# EPITRANSCRIPTOMICS: THE NOVEL RNA FRONTIER

EDITED BY : Giovanni Nigita, Mario Acunzo, William Chi Shing Cho and Carlo Maria Croce PUBLISHED IN : Frontiers in Bioengineering and Biotechnology, Frontiers in Plant Science, Frontiers in Genetics and Frontiers in Molecular Biosciences

#### Frontiers Copyright Statement

© Copyright 2007-2019 Frontiers Media SA. All rights reserved. All content included on this site, such as text, graphics, logos, button icons, images, video/audio clips, downloads, data compilations and software, is the property of or is licensed to Frontiers Media SA ("Frontiers") or its licensees and/or subcontractors. The copyright in the text of individual articles is the property of their respective authors, subject to a license granted to Frontiers.

The compilation of articles constituting this e-book, wherever published, as well as the compilation of all other content on this site, is the exclusive property of Frontiers. For the conditions for downloading and copying of e-books from Frontiers' website, please see the Terms for Website Use. If purchasing Frontiers e-books from other websites or sources, the conditions of the website concerned apply.

Images and graphics not forming part of user-contributed materials may not be downloaded or copied without permission.

Individual articles may be downloaded and reproduced in accordance with the principles of the CC-BY licence subject to any copyright or other notices. They may not be re-sold as an e-book.

As author or other contributor you grant a CC-BY licence to others to reproduce your articles, including any graphics and third-party materials supplied by you, in accordance with the Conditions for Website Use and subject to any copyright notices which you include in connection with your articles and materials.

All copyright, and all rights therein, are protected by national and international copyright laws.

The above represents a summary only. For the full conditions see the Conditions for Authors and the Conditions for Website Use. ISSN 1664-8714 ISBN 978-2-88945-741-0 DOI 10.3389/978-2-88945-741-0

#### About Frontiers

Frontiers is more than just an open-access publisher of scholarly articles: it is a pioneering approach to the world of academia, radically improving the way scholarly research is managed. The grand vision of Frontiers is a world where all people have an equal opportunity to seek, share and generate knowledge. Frontiers provides immediate and permanent online open access to all its publications, but this alone is not enough to realize our grand goals.

#### Frontiers Journal Series

The Frontiers Journal Series is a multi-tier and interdisciplinary set of open-access, online journals, promising a paradigm shift from the current review, selection and dissemination processes in academic publishing. All Frontiers journals are driven by researchers for researchers; therefore, they constitute a service to the scholarly community. At the same time, the Frontiers Journal Series operates on a revolutionary invention, the tiered publishing system, initially addressing specific communities of scholars, and gradually climbing up to broader public understanding, thus serving the interests of the lay society, too.

#### Dedication to Quality

Each Frontiers article is a landmark of the highest quality, thanks to genuinely collaborative interactions between authors and review editors, who include some of the world's best academicians. Research must be certified by peers before entering a stream of knowledge that may eventually reach the public - and shape society; therefore, Frontiers only applies the most rigorous and unbiased reviews.

Frontiers revolutionizes research publishing by freely delivering the most outstanding research, evaluated with no bias from both the academic and social point of view. By applying the most advanced information technologies, Frontiers is catapulting scholarly publishing into a new generation.

#### What are Frontiers Research Topics?

Frontiers Research Topics are very popular trademarks of the Frontiers Journals Series: they are collections of at least ten articles, all centered on a particular subject. With their unique mix of varied contributions from Original Research to Review Articles, Frontiers Research Topics unify the most influential researchers, the latest key findings and historical advances in a hot research area! Find out more on how to host your own Frontiers Research Topic or contribute to one as an author by contacting the Frontiers Editorial Office: researchtopics@frontiersin.org

## EPITRANSCRIPTOMICS: THE NOVEL RNA FRONTIER

Topic Editors:

Giovanni Nigita, The Ohio State University, United States Mario Acunzo, Virginia Commonwealth University, United States William Chi Shing Cho, Queen Elizabeth Hospital, Hong Kong Carlo Maria Croce, The Ohio State University, United States

Following the formulation of the central dogma of molecular biology and the later discovery of classes of non-coding RNAs, the primary focus of Genetics was essentially on variation of DNA aiming at elucidating biological pathways perturbed in diseases. Recently, extensive attention has shifted towards the study of posttranscriptional RNA modifications occurring in both protein-coding as well as non-coding RNAs, revealing a novel and finer layer of complexity in gene regulation. This, in turn, has led to the birth of the novel field of 'Epitranscriptomics'.

The recent increase of applications of high-throughput sequencing technology (HTS) has allowed the unprecedented opportunity to identify on a transcriptomewide scale, millions of RNA modifications in human genes, counting today more than 140 distinct types such as: methylation (e.g. m6A, m1A, m5C, hm5C, 2'OMe) methylation (e.g. m6A, m1A, m5C, hm5C, 2'OMe), pseudourylation (ψ), deamination (e.g. A-to-I RNA editing).

The scope of this Research Topic was to collect both reviews and research articles addressing the wet lab approaches and bioinformatics methodologies necessary to aid in the identification of novel RNA modifications and characterization of their biological functions. Among the articles embracing the aim of the Research Topic, we have collected four original research and methods articles, five reviews, and a technology article.

Citation: Nigita, G., Acunzo, M., Cho, W. C. S., Croce, C. M., eds. (2019). Epitranscriptomics: The Novel RNA Frontier. Lausanne: Frontiers Media. doi: 10.3389/978-2-88945-741-0

## Table of Contents

*05 Editorial: Epitranscriptomics: The Novel RNA Frontier* Giovanni Nigita, Mario Acunzo, William Chi Shing Cho and Carlo M. Croce

#### SECTION 1

#### RNA METHYLATION AND PSEUDOURIDYLATION IN NON-CODING RNAs


Yang Zhao, William Dunker, Yi-Tao Yu and John Karijolich

#### SECTION 2

#### EPITRANSCRIPTOMICS IN HUMAN DISEASES


Margarita T. Angelova, Dilyana G. Dimitrova, Nadja Dinges, Tina Lence, Lina Worpenberg, Clément Carré and Jean-Yves Roignant

*62 Link Between m6A Modification and Cancers* Zhen-Xian Liu, Li-Man Li, Hui-Lung Sun and Song-Mei Liu

#### SECTION 3

#### EPITRANSCRIPTOMICS IN PLANTS

*73 Unveiling Chloroplast RNA Editing Events Using Next Generation Small RNA Sequencing Data*

Nureyev F. Rodrigues, Ana P. Christoff, Guilherme C. da Fonseca, Franceli R. Kulcheski and Rogerio Margis

*86 REDIdb 3.0: A Comprehensive Collection of RNA Editing Events in Plant Organellar Genomes*

Claudio Lo Giudice Graziano Pesole and Ernesto Picardi

*95 Transcriptome-Wide Annotation of m5C RNA Modifications Using Machine Learning*

Jie Song, Jingjing Zhai, Enze Bian, Yujia Song, Jiantao Yu and Chuang Ma

*108 Corrigendum: Transcriptome-Wide Annotation of m5C RNA Modifications Using Machine Learning*

Jie Song, Jingjing Zhai, Enze Bian, Yujia Song, Jiantao Yu and Chuang Ma

*109 Cadmium Stress Leads to Rapid Increase in RNA Oxidative Modifications in Soybean Seedlings*

Jagna Chmielowska-Bąk, Karolina Izbiańska, Anna Ekner-Grzyb, Melike Bayar and Joanna Deckert

#### SECTION 4 TECHNOLOGY REPORT

#### *117 The Effect of Centrifugal Force in Quantification of Colorectal Cancer-Related mRNA in Plasma Using Targeted Sequencing*

Vivian Weiwen Xue, Simon Siu Man Ng, Wing Wa Leung, Brigette Buig Yue Ma, William Chi Shing Cho, Thomas Chi Chuen Au, Allen Chi Shing Yu, Hin Fung Andy Tsang and Sze Chuen Cesar Wong

## Editorial: Epitranscriptomics: The Novel RNA Frontier

#### Giovanni Nigita<sup>1</sup> \*, Mario Acunzo<sup>2</sup> \*, William Chi Shing Cho<sup>3</sup> and Carlo M. Croce<sup>1</sup>

*<sup>1</sup> Department of Cancer Biology and Genetics, The Ohio State University, Columbus, OH, United States, <sup>2</sup> Division of Pulmonary Diseases and Critical Care Medicine, Virginia Commonwealth University, Richmond, VA, United States, <sup>3</sup> Department of Clinical Oncology, Queen Elizabeth Hospital, Kowloon, Hong Kong*

Keywords: epitranscriptomics, RNA methylation, RNA editing, psuedourylation, RNA

**Editorial on the Research Topic**

#### **Epitranscriptomics: The Novel RNA Frontier**

From the formulation of the Central Dogma of molecular biology to the discovery of novel noncoding RNA (ncRNAs) classes, the focus of Genetics involved DNA variants with the scope of elucidating biological pathways perturbed in disease. Recently, considerable attention has shifted toward the study of RNA modifications dynamically occurring in both protein-coding as well as non-coding RNAs, in both animal and plant kingdoms. This, in turn, has gradually led to the exciting exploration of the novel frontier of "Epitranscriptomics," revealing an additional, finer layer of complexity in gene regulation.

#### Edited and reviewed by:

*Richard D. Emes, University of Nottingham, United Kingdom*

#### \*Correspondence:

*Giovanni Nigita giovanni.nigita@osumc.edu Mario Acunzo mario.acunzo@vcuhealth.org*

#### Specialty section:

*This article was submitted to Bioinformatics and Computational Biology, a section of the journal Frontiers in Bioengineering and Biotechnology*

Received: *13 November 2018* Accepted: *22 November 2018* Published: *04 December 2018*

#### Citation:

*Nigita G, Acunzo M, Cho WCS and Croce CM (2018) Editorial: Epitranscriptomics: The Novel RNA Frontier. Front. Bioeng. Biotechnol. 6:191. doi: 10.3389/fbioe.2018.00191*

The application of the fast-growing high-throughput sequencing technology (HTS) has allowed the unprecedented opportunity to unveil millions of RNA modifications in human genes, such as resulting from methylation (e.g., m6A, m1A, m5C, hm5C, 2'OMe), pseudourylation (Ψ ) and deamination (e.g., A-to-I RNA editing), counting to date more than 140 different forms of RNA modifications.

The aim of this Research Topic was to collect both original research articles and reviews in addressing the bioinformatics approaches and wet lab methodologies necessary to aid in the detection of novel RNA modifications and characterization of their biological functions, as well as contribute to the identification of the molecular protagonists involved in the regulation of such phenomena. Among the articles embracing the scope of the Research Topic, we have collected five reviews, four original research and methods articles, and a technology article.

Two reviews focused on RNA modifications in ncRNAs. Romano et al. surveyed the main types of RNA methylation in the context of ncRNAs with their associated functions. They also described the methodologies to identify and profile such RNA modifications. On the other hand, Zhao et al. instead, focused on the role of the pseudouridylation of ncRNAs in nuclear gene expression events. In particular, they reviewed how, in the nuclear context, ncRNA pseudouridylation contributes to transcription and pre-miRNA splicing.

Another important RNA modification is the non-templated addition of uridine(s) to the terminal end of RNA, termed 3′ RNA uridylation, which has been described having a role in the regulation of both mRNAs and ncRNAs. In this collection, Menezes et al. reviewed current discoveries of the functional roles for 3′ RNA uridylation, particularly focusing on mammalian biology.

RNA modifications add a novel and fine-tuning layer to gene expression regulation, having recently shown how the dysregulation of these phenomena can lead to the onset of disease, such as neurological diseases and cancer. In this collection, Angelova et al. discussed the emerging findings that show indicate the modifications of RNA impact the development of the brain and their involvement in neurological disorders. One of the most abundant internal modifications in messenger RNAs (mRNAs) is represented by N6 methyladenosine (m6A), which has been discovered to play important roles in multiple biological processes. Liu et al. reviewed how the malfunctions of m6A machineries lead to several cancer types, including solid and non-solid tumors.

RNA modification phenomena are also present in the plant kingdom. It has been observed that post-transcriptional modifications, such as C-to-U RNA editing events, can occur in chloroplasts. In this collection, Rodrigues et al. introduced a method to profile RNA editing from chloroplast by using small RNA (sRNA) libraries form Arabidopsis, soybean and rice. The results obtained in this study encourage using sRNA libraries in order to identify novel RNA editing events and confirm previous detected ones, as well as profiling RNA editing in different plants under different biological conditions.

The improvements of high-throughput sequencing (HTS) techniques allowed us to unveil thousands of editing events in plants. In this collection, Lo Giudice et al. presented the third release of REDIdb, a freely available database for RNA editing events in plants organelles. The current release of REDIdb contains more than 26K RNA editing modifications, together with a new web interface allowing users to contextualize editing events in their genomic, biological and evolutionary contexts.

Together with the improvements of HTS technologies, we have observed remarkable progress in the development of algorithms and computational methods able to analyze and study data in the emerging field of Epitranscriptomics. Song et al. presented PEA-m5C, a transcriptome-wide machine learningbased predictor of m5C modification in Arabidopsis. PEA-m5C was developed by employing a random forest algorithm with sequence-based features and optimized window, obtaining an average AUC of 0.939 in 10-fold cross-validation experiments.

The collection also includes a study conducted in Soybean seedlings by Chmielowska-Bak et al. concerning the influence of short-term cadmium stress on two RNA oxidation-dependent modifications: 8-hydroxyguanosine and apurinic/apyrimidinic sites. The authors reported an increased level of RNA oxidation in plants in stress conditions.

Finally, the collection also includes a study on the effect of centrifugal force in quantification of colorectal cancer-related mRNA in plasma using targeted sequencing. We know that the pre-analytical factors are critical for the measurement of analytes in the blood. This study investigated two common centrifugation protocols on plasma mRNA quality and quantity. The authors concluded that more targeted mRNAs could be found by double centrifuges of 1,600xg followed by 16,000xg than a single centrifuge of 3,500xg.

Epitranscriptomics represents a thrilling, vast and novel field in Cellular Biology. The articles we finally accepted for our Research Topic address some of its most exciting challenges, with the intention of providing a useful resource that will peak the interest of many researchers investigating the RNA modification phenomena. We would like to take this opportunity to thanks the contributions of the authors and the professional support from the editorial staffs.

#### AUTHOR CONTRIBUTIONS

All authors listed have made a substantial, direct and intellectual contribution to the work, and approved it for publication.

**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 Nigita, Acunzo, Cho and Croce. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

## RNA Methylation in ncRNA: Classes, Detection, and Molecular Associations

#### Giulia Romano<sup>1</sup> , Dario Veneziano<sup>2</sup> , Giovanni Nigita<sup>2</sup> and Serge P. Nana-Sinkam<sup>1</sup> \*

1 Internal Medicine "Division of Pulmonary and Critical Care Medicine", Virginia Commonwealth University Health System, Richmond, VA, United States, <sup>2</sup> Department of Cancer Biology and Genetics, The Ohio State University, Columbus, OH, United States

Nearly all classes of coding and non-coding RNA undergo post-transcriptional modification, as more than 150 distinct modification types have been reported. Since RNA modifications were first described over 50 years ago, our understanding of their functional relevance in cellular control mechanisms and phenotypes has truly progressed only in the last 15 years due to advancements in detection and experimental techniques. Specifically, the phenomenon of RNA methylation in the context of ncRNA has emerged as a novel process in the arena of epitranscriptomics. Methylated ncRNA molecules may indeed contribute to a potentially vast functional panorama, from regulation of posttranscriptional gene expression to adaptive cellular responses. Recent discoveries have uncovered novel dynamic mechanisms and new layers of complexity, paving the way to a greater understanding of the role of such phenomena within the broader molecular cellular context of human disease.

#### Edited by:

Florent Hubé, UMR 7216, Epigénétique et Destin Cellulaire, France

#### Reviewed by:

Erik Dassi, University of Trento, Italy Yuri Motorin, Université de Lorraine, France Clement Carre, FR3631 Institut de Biologie Paris Seine, France

#### \*Correspondence:

Serge P. Nana-Sinkam Patrick.Nana-Sinkam@vcuhealth.org

#### Specialty section:

This article was submitted to RNA, a section of the journal Frontiers in Genetics

Received: 05 March 2018 Accepted: 20 June 2018 Published: 12 July 2018

#### Citation:

Romano G, Veneziano D, Nigita G and Nana-Sinkam SP (2018) RNA Methylation in ncRNA: Classes, Detection, and Molecular Associations. Front. Genet. 9:243. doi: 10.3389/fgene.2018.00243 Keywords: RNA, methylation, epigenetics, non-coding RNAs, RNA methodologies

#### INTRODUCTION

Up until recently, the central dogma (Crick, 1970) had supported primary focus on the molecular contributions of DNA and protein to human disease. The inability to detect and evaluate RNA with the necessary molecular resolution and precision has limited our understanding of the spectrum of RNA modifications that may drive disease.

Following the discovery of pseudouridine (Davis and Allen, 1957), nine additional modifications were identified in 1965 (Holley et al., 1965b). Finally, modification events in nucleotides of mRNA molecules were also uncovered in the 1970s (Desrosiers et al., 1974; Adams and Cory, 1975; Dubin and Taylor, 1975; Perry et al., 1975). Gradually, the "static" interpretation of the cellular role of RNA started to be challenged (Gilbert, 1986). With the discovery of novel species of non-coding RNA (ncRNA) and their mechanisms further investigated (Lee et al., 1993; Fire et al., 1998; Eddy, 2001), RNA biology came to the forefront (Todd and Karbstein, 2007). Along with advancements in experimental and transcriptomics techniques, which enabled a more detailed investigation of the translational control of cellular responses and phenotypes (Chan et al., 2010), interest in RNA modifications also grew, resulting in significant progress in the last 15 years. Recent discoveries, such as the first and second mRNA m6A demethylases FTO (Jia et al., 2011) and ALKBH5 (Zheng et al., 2013), as well as the identification of the METTL3/METTL14 methyltransferase complex (Liu et al., 2014), have triggered renewed interest in RNA modifications.

To date, a total of 163 post-transcriptional RNA modifications have been uncovered across all living organisms (Boccaletto et al., 2017) and are among the most evolutionarily conserved

**7**

properties of RNAs (Li and Mason, 2014), revealing a "novel," complex layer of biological regulation known as the epitranscriptome (Saletore et al., 2012). The functional diversity provided by these phenomena can indeed affect RNA structure, play a fundamental role in their interactions with other molecules and in regulatory networks, such as metabolic changes (Lewis et al., 2017), thus affecting every aspect of cellular physiology.

RNA modifications have been categorized as reversible and non-reversible. Among non-reversible modifications, we find well-studied phenomena such as RNA editing and pseudouridylation (Meier, 2011). Nonetheless, recent focus has shifted to reversible modifications, such as cytosine and adenosine methylations (Klungland et al., 2016). However, this classic distinction is being reassessed, in light of the discovery of "erasers" such as FTO and ALKBH5.

The importance of modifications in novel classes of ncRNA transcripts is also becoming relevant. Well-characterized chemical modifications in traditional classes of RNAs such as transfer (tRNAs) and ribosomal (rRNA) RNA, novel detection technologies and deep sequencing analysis (Veneziano et al., 2015, 2016), have paved the way for a fuller assessment of these molecular events also in regulatory ncRNAs, such as microRNA (Alarcon et al., 2015b) and long ncRNAs (Patil et al., 2016).

#### RNA METHYLATION

RNA methylation is a reversible, post-transcriptional RNA modification, affecting several biological processes, such as RNA stability and mRNA translation (Ji and Chen, 2012; Wang et al., 2014, 2015; Dev et al., 2017), through a variety of RNA methyltransferases, often using distinct catalytic strategies. Furthermore, recent studies have shown how the deregulation of proteins implicated in these modification phenomena is associated to disease (Supplementary Table 1A). In this section, we will review the main types and functions of methylation in ncRNAs (**Figure 1**).

#### N6 -Methyladenosine (m6A)

N6 -methyladenosine (m6A) is the most abundant internal modification detected to date in mRNA (Roundtree et al., 2017). Discovered in the 1970s, its function has been thoroughly investigated only in the last decade (Rottman et al., 1974; Wang and He, 2014). This was driven by the recent discovery and characterization of evolutionarily conserved proteins able to encode (writers), decode (readers), and remove (erasers) methylation (Lewis et al., 2017). Since 1994, different writers have been identified, including METTL3 and METTL14, proven to regulate the circadian clock, differentiation of embryonic stem cells and primary miRNA processing (Dominissini et al., 2012; Wang et al., 2014; Alarcon et al., 2015b). These enzymes work in complex with proteins essential to the correct processing of RNA methylation (Schwartz et al., 2014): Wilms tumor 1-associated protein (WTAP), RNA-binding motif protein 15 (RBM15) and Protein virilizer homolog (KIAA1429). Additionally, the discovery of ALKBH5 and FTO has revealed the dynamic dimension of this modification phenomenon for cellular metabolism (Jia et al., 2011; Zheng et al., 2013). Recently, the YTH domain family proteins (YTHDF1–3) and YTH domain-containing protein 1 (YTHDC1) have been characterized as m6A readers, providing the first functional evidence of m6A (Wang et al., 2014).

The methyl group in m6A does not affect the Watson– Crick base-pairing (Liu and Jia, 2014), is highly conserved between human and mice and located in 5<sup>0</sup> UTRs, 3<sup>0</sup> UTRs, around stop codons, long internal and alternatively spliced exons (Dominissini et al., 2016; Li et al., 2016a; Lewis et al., 2017). It is also found in tRNA, rRNA, and small nuclear RNA (snRNA) as well as several long non-coding RNA, such as Xist (Dominissini et al., 2012). While not completely understood, m6A has been shown to play critical roles in the biological regulation of mRNA and ncRNA (Liu and Jia, 2014), particularly splicing, stability, turnover, nuclear export, and mediation of cap-independent translation (Meyer et al., 2015). Recently, Sun et al. (2016) have integrated all m6A sequencing data into a novel database, RMBase, identifying ∼200,000 N<sup>6</sup> -Methyladenosines (m6A) sites in human and mouse. Finally, Linder et al. (2015) mapped m6A and m6Am at single-nucleotide resolution and identified small nucleolar RNAs (snoRNAs) as a new class of m6A-containing non-coding RNAs (ncRNAs).

#### N1 -Methyladenosine (m1A)

Although the first studies on N<sup>1</sup> -methyladenosine (m1A) in total RNA date back more than 50 years (Dunn, 1961), only one study in the last decade has shed substantial light on function. m1A is a dynamic methylation event at the N<sup>1</sup> position of adenosine, comprising the addition of a methyl group and a positive charge in the base, specifically in the Watson–Crick interface, obviously altering RNA-protein interaction and RNA secondary structures through electrostatic effects (Roundtree et al., 2017). m1A is abundant in tRNA and rRNA (El Yacoubi et al., 2012; Sharma et al., 2013) exercising major influence on structure and function (Anderson, 2005). Two groups recently found a strong conservation of the m1A pattern in several human and murine cell lines as well as in yeast, affirming the important role of this modification along the evolutionary chain. In particular, m1A has been shown to have a role in mRNA translation, via unique localization near the translation start site and first splice site (Dominissini et al., 2016; Li et al., 2016a; Roundtree et al., 2017) and by facilitating noncanonical binding of the exon–exon junction complex (Cenik et al., 2017).

#### 2 0 -O-Methylation (20OMe/Nm)

2 <sup>0</sup>OMe is a very common RNA modification in abundant RNAs (rRNA, snRNA, tRNA) (Schibler and Perry, 1977; Borges and Martienssen, 2015; Roundtree et al., 2017) as well as in microRNA and it is fundamental for the biogenesis and function of these molecules (Ji and Chen, 2012). It was initially detected at the second and third nucleotide in many mRNA (Schibler and Perry, 1977). Further, it was observed that in rRNA, the loss of an individual modification had no apparent effect, while the

deletion of 2–3 modifications in A and P site regions impairs translation and strongly delays pre-rRNA processing (Liang et al., 2009).

2 0 -O-methylation occurs in 3<sup>0</sup> termini and is found to be important in plant biogenesis of small RNA, inter alia miRNA and siRNAs (Yu et al., 2005). Furthermore 2<sup>0</sup> -O-methylation plays an important role in protecting against 30–5<sup>0</sup> degradation and 3<sup>0</sup> uridylation of some small RNAs as piRNAs in animals and Ago2-associated small RNAs in Drosophila (Ji and Chen, 2012). It has been found to be catalyzed by HUA-ENHANCER-1/piwimethyltransferase (HEN1/piMET) enzyme.

## 5-Methylcytosine (m5C)

5-Methylcytosine (m5C) is an epitranscriptomic modification that involves the 5th carbon atom of cytosine as a target for methylation in poly(A) RNA, rRNA, tRNA, snRNA, and lncRNA (Amort et al., 2013, 2017; Lewis et al., 2017). While some of the proteins regulating m5C in different RNA have been identified, the biological function remains unclear (Nachtergaele and He, 2017). NOL1/NOP2/Sun domain family member 2 (NSUN2) together with DNA methyltransferase-like protein 2 (DNMT2) have been shown to be the writers of m5C, although to date no erasers or readers have been discovered (Lewis et al., 2017), though recently, investigators identified ALYREF as a potential reader of m5C (Yang et al., 2017). Several roles have been suggested for m5C, from the stabilizing of tRNA secondary structure and prevention of degradation or cleavage, to playing a role in translation when in rRNA and increasing the stability of mRNA transcripts (Esteller and Pandolfi, 2017).

### METHODOLOGIES FOR THE DETECTION AND PROFILING OF RNA METHYLATION

The recent advent of more sensitive and robust sequencing technologies (Li et al., 2016b), coupled with novel biochemical techniques (Song and Yi, 2017), has greatly improved the characterization and understanding of RNA modifications (Frye et al., 2016). This has allowed us to address challenges such as limitations with reverse transcription (RT) signatures and low transcript expression, as is the case with mRNA and lncRNA. Major advances in high-throughput sequencing methods (Helm and Motorin, 2017) have indeed allowed for the systematic identification of RNA modifications at singlenucleotide resolution, effectively distinguishing their distribution patterns in a transcriptome-wide manner.

Traditional biophysical targeted approaches for the detection and quantification of RNA modifications have further matured and provided the foundation for nearly all current highthroughput techniques (Vandivier and Gregory, 2017). Earlier methodologies relied on chromatography applied to direct sequencing, providing the very first evidence of modifications in RNA (Desrosiers et al., 1974). As these techniques only allowed detection of global patterns of modification, they were soon improved with the application of electrophoresis (Gupta and Randerath, 1979; Sprinzl and Vassilenko, 2005) and mass spectrometry (McCloskey and Nishimura, 1977; Kowalak et al., 1993) attaining for the first time base resolution. Recently other strategies, such as high-resolution melting (Golovina et al., 2014), have been implemented to narrow resolution.

Nonetheless, an important strategy on which several highthroughput techniques were later developed, is based on the detection of variation in RT signatures (Brownlee and Cartwright, 1977; Motorin et al., 2007). As RNA modifications may interfere with the RT enzyme, inducing its arrest and/or the misincorporation of non-complementary deoxyribonucleoside triphosphates (dNTPs), this provided the foundation to several current methodologies exclusively RT-based as well as leveraging on chemical treatment of the RNA pool or the use of antibodies for the enrichment of modified RNA populations. Such is the case of techniques employing methyl RIP-seq (MeRIP-seq) (Mishima et al., 2015; Dominissini et al., 2016; Li et al., 2016a) coupled with various crosslinking techniques to improve the resolution window. For instance, in m1A-IDseq, employ demethylases to generate a m1A-depleted control library for validation (Li et al., 2016a). In alternative techniques, such as m1A-seq, RNA pools undergo Dimroth rearrangement under alkaline conditions, converting m1A residues to m6A, thus producing different RT signatures that can validate the MeRIP data (Dominissini et al., 2016). Indeed, certain RNA modifications, such as m6A and m5C, are RT-silent. Despite simple antibody pulldown methods have satisfactorily mapped m6A sites (Dominissini et al., 2012; Meyer et al., 2012) and antibodies highly specific to methylated RNA bases have also been employed (Linder et al., 2015; Li et al., 2016a), most antibody-based methods do not provide nucleotide resolution. For this reason, more recent global approaches have paired antibody binding to covalent crosslinking at specific RNA sites, resulting in RT signatures able to improve resolution (Linder et al., 2015). For instance, after transcripts fragmentation in MeRIP protocols, antibodies forming non-covalent complexes with modified residues are further cross-linked to reactive residues nearby via UV light at distinct frequencies according to the specific techniques (i.e., miCLIP and PA-m6A-seq) for m6A detection (Chen K. et al., 2015). Such induced covalent crosslinks are then the sites at which RT stalls, yielding approximate or precise single-nucleotide resolution. Recently, an innovative detection technique has precisely elucidated m6A distributions across unknown regions via an antibody-independent strategy able to produce abortive cDNA signatures at m6A sites, greatly increasing resolution (Hong et al., 2018). In the case of m5C, bisulfite sequencing has yielded satisfactory results, although posing a few challenges. As unmodified cytosines are converted to inosines as a result of bisulfite treatment, m5C residues remain unaffected, providing a signature in cDNA. While this has been effective for highly abundant ncRNA populations (i.e., tRNA and rRNA) (Militello et al., 2014), degradation issues (due to higher pH conditions during treatment) and read mapping challenges have yielded poor results for low-abundance RNA species (Squires et al., 2012; Hussain et al., 2013; Jeltsch et al., 2017). An alternative approach termed "suicide enzyme trap" has been employed to characterize substrates of m5C-methyltransferases (m5C - MTases) NSUN2 and NSUN4 (Metodiev et al., 2014; Van Haute et al., 2016). By mutating m5C-MTases to form irreversible covalent bonds with target residues, the resulting stable enzyme– RNA complexes are suitable for immunoprecipitation and mapping. Such is also the case of the AZA-seq methodology formalized by Khoddami and Cairns (2014) in which "suicide inhibitor" nucleotide analog 5-azacytidine is incorporated into cellular RNA and "traps" m5C-MTases for pulldown and sequencing.

Finally, 20OMe too can be detected at base resolution via differential RT profiles, with or without chemical treatment. The RiboMeth-seq methodology (Birkedal et al., 2015; Krogh et al., 2016; Marchand et al., 2016, 2017) for instance, leverages on the ability of 20OMe to preserve adjacent phosphodiester bonds from alkaline cleavage and produces a high-throughput coverage profile of under-represented positions at the extremes of reads. Nonetheless, chemical treatment is not strictly necessary. Indeed, earlier methods relied on the natural ability of 2 <sup>0</sup>OMe to interrupt RT at low dNTP concentrations (Maden et al., 1995). Such principle was recently employed in the development of a high-throughput protocol proven to be more sensitive and specific than methods based on alkaline hydrolysis. These methodologies have specifically been assessed on 20OMe modifications occurring in ribosomal and transfer RNA, while not as efficiently identifying such phenomena in low abundance RNA molecules such as mRNA and several ncRNAs. To address such deficiency, the recently published Nmseq protocol leverages on the ability of 20OMe to confer resistance to oxidation by sodium periodate to the ribose backbone of RNA molecules, thus allowing the enrichment and mapping of reads originating from RNA fragments whose internal 20OMe have been exposed at the 3<sup>0</sup> end via the elimination of nonmodified nucleotides. Such technique has provided a sensitive and precise 20OMe detection method for rare RNA classes (Dai et al., 2017).

Due to the time-consuming and labor-intensive nature of such techniques, many transcriptomes and potentially novel modifications remain unexplored. For this reason, computational methods have also been developed for the accurate evaluation of modifications events (Zhang et al., 2015; Liu et al., 2017). Moreover, given the error-prone nature of high-throughput techniques, it is strongly suggested that modification sites predicted from big data not be considered as candidates if not validated with at least one additional methodology (Helm and Motorin, 2017). All methodologies described above are summarized in Supplementary Table 1B.

#### ncRNA SPECIES AND RNA METHYLATION: FUNCTIONAL ASSOCIATIONS

#### tRNA

tRNA methylations were first identified concurrently with the initial sequencing of the clover-shaped molecule (Holley et al., 1965a). Initially, it was suggested that such phenomena was probably the result of a network of diverse enzymes (Hurwitz et al., 1964). It is now clear that tRNA methylation is highly conserved and that tRNAs are the RNA class containing the majority of modified nucleosides among all discovered RNA

species. With a total of more than 90 modified nucleosides identified (MODOMICS) (Boccaletto et al., 2017), all tRNA molecules from the three domains of life contain 13 methylated nucleosides out of 18 shared (Marck and Grosjean, 2002; Jackman and Alfonzo, 2013). Originally, it was thought that tRNA modifications in general were a straightforward, static process occurring on specific sites of distinct tRNA species. Given the recent characterization of major tRNA modification pathways, along with their associated tRNA methyltransferase enzyme families (Hori, 2014), a relevant diversity has emerged among living organisms. The presence of catalytic interactions, distinct RNA substrate recognition mechanisms and diverse chemical processes, all suggest a complex functional panorama. Generally, four functional categories can be attributed to tRNA methylation phenomena: preservation of secondary and tertiary structures (Helm and Attardi, 2004; Voigts-Hoffmann et al., 2007); thermodynamic stability (Yokoyama et al., 1987); protection from degradation and rapid tRNA decay (Kadaba et al., 2004; Alexandrov et al., 2006; Guy et al., 2014); translation control and fidelity (Anderson et al., 1998, 2000; Chan et al., 2010, 2012). It is thus evident that tRNA methylation contributes to RNA quality control systems, cellular localization (Kaneko et al., 2003), response to stress stimuli (Schaefer et al., 2010; Becker et al., 2012; Muller et al., 2013), proliferation and many other processes (Phizicky and Hopper, 2015). Most importantly, disruption of energy and amino acid metabolism pathways (i.e., depletion of methionine, necessary for methylation) can damage downstream the RNA modification system, resulting in partially modified tRNAs and thus translational errors (explaining why living organisms use the methionine codon as the initiation codon for protein synthesis) (Hori, 2014). Recently, researchers discovered the first tRNA demethylase, ALKBH1, as a novel post-transcriptional gene expression regulation mechanism (Liu et al., 2016). Finally, tRNA methylations and their enzymes may cooperate collectively in functional networks in order to support adaptive cellular responses (Chan et al., 2010; Tomikawa et al., 2010; Ishida et al., 2011).

#### miRNAs

From transcription to decay, the multi-level process of the biogenesis of miRNAs is regulated by two main actors: processing enzymes such as DROSHA, DICER, and AGO proteins (Ha and Kim, 2014); and post-transcriptional modifications. Established RNA modifications, such as RNA editing events, have been shown to dynamically alter the sequence and/or the structure of miRNAs (Nigita et al., 2015; Nishikura, 2016) and consequently, in some cases, their function (Kawahara et al., 2007; Nigita et al., 2016). Recently, this has been also investigated in the context of miRNAs and RNA methylation.

2 <sup>0</sup>OMe has been detected at the 3<sup>0</sup> -end of miRNAs (only in plants) and found to confer stability and protection from 3<sup>0</sup> uridylation and degradation (Backes et al., 2012; Borges and Martienssen, 2015). m6A within 3<sup>0</sup> UTRs has been generally associated with the presence of miRNA binding sites; roughly 2/3 of mRNAs containing an m6A site within their 3<sup>0</sup> UTR also have at least one microRNA binding site (Meyer et al., 2012). In another study, Alarcon et al. (2015b) described how miRNAs can undergo N<sup>6</sup> -adenosine methylation (m6A) as a result of the intervention of METTL3 during pri-miRNA processing. The same authors also showed that m6A marks in primiRNAs allow for the RNA-binding protein DGCR8 to identify its specific substrates, promoting the beginning of miRNA biogenesis. Alarcon et al. (2015a) have further hypothesized that the RNA-binding protein HNRNPA2B1 could function as nuclear reader of the m6A mark, binding to m6A marks in primiRNAs, thus promoting pri-miRNA processing. Additionally, the effects of RNA demethylation on miRNA expression have also been investigated. Berulava et al. (2015) reported significant miRNA expression dysregulation as a result of knocking down m6A demethylase FTO, providing indirect evidence of cotranscriptional processing in the methylation of mRNAs and miRNAs. Finally, Chen T. et al. (2015) discovered that miRNAs positively regulate m6A installment on mRNAs via a sequence pairing mechanism. Methylation events in miRNAs add a new layer of complexity in the regulation of post-transcriptional gene expression and warrant future studies in order to fully elucidate the roles and functions of modified miRNAs.

#### Long ncRNA

Although the majority of focus has been recently devoted to modifications in mRNA, 1000s of lncRNA transcripts have been detected containing a substantial number of modifications (Shafik et al., 2016). Evidence associating methylation with the most established lncRNA transcripts are just starting to be recognized. MALAT1 has been shown to bind with the m6A writer METTL16 at its 3<sup>0</sup> -triple-helical RNA stability element (Brown et al., 2016) specifically in its A-rich portion, after it was previously proven that MALAT1 can carry m6A (Liu et al., 2013). The presence of m6A has been further shown to destabilize the hairpin stems in the transcript, making them more flexible and solvent-accessible (Zhou et al., 2016) as well as more accessible for protein binding (Liu et al., 2015). Several putative m5C sites have also been detected in MALAT1 (Squires et al., 2012), but no enzymes have been identified. lncRNA HOTAIR (Khoddami and Cairns, 2013) possesses a specific m5C site which has been verified with a 100% modification rate (Amort et al., 2013). Finally, m6A events have been associated to XIST-mediated transcriptional repression (Patil et al., 2016) while m5C sites can prevent XIST-protein interactions, although it may not be a conserved mechanism (Amort et al., 2013). More detailed information can be found in Jacob et al. (2017).

### AUTHOR CONTRIBUTIONS

GR wrote and set up the manuscript. GN and DV wrote and reviewed the content. SN-S supervised and reviewed the manuscript writing and development.

#### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene.2018. 00243/full#supplementary-material

#### REFERENCES

fgene-09-00243 July 10, 2018 Time: 16:15 # 6



quad-negative binomial model. BMC Bioinformatics 18:387. doi: 10.1186/ s12859-017-1808-4



**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 Romano, Veneziano, Nigita and Nana-Sinkam. This is an openaccess article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

## The Role of Noncoding RNA Pseudouridylation in Nuclear Gene expression events

#### *Yang Zhao1 , William Dunker1 , Yi-Tao Yu2 \* and John Karijolich1,3\**

*1Department of Pathology, Microbiology, and Immunology, School of Medicine, Vanderbilt University, Nashville, TN, United States, 2 Department of Biochemistry and Biophysics, Center for RNA Biology, School of Medicine and Dentistry, University of Rochester, Rochester, NY, United States, 3Vanderbilt-Ingram Cancer Center, Nashville, TN, United States*

Pseudouridine is the most abundant internal RNA modification in stable noncoding RNAs (ncRNAs). It can be catalyzed by both RNA-dependent and RNA-independent mechanisms. Pseudouridylation impacts both the biochemical and biophysical properties of RNAs and thus influences RNA-mediated cellular processes. The investigation of nuclearncRNA pseudouridylation has demonstrated that it is critical for the proper control of multiple stages of gene expression regulation. Here, we review how nuclear-ncRNA pseudouridylation contributes to transcriptional regulation and pre-mRNA splicing.

#### *Edited by:*

*Carlo Maria Croce, The Ohio State University, United States*

#### *Reviewed by:*

*Francesco Russo, University of Copenhagen, Denmark Shihao Shen, University of California, Los Angeles, United States*

#### *\*Correspondence:*

*Yi-Tao Yu yitao\_yu@urmc.rochester.edu; John Karijolich john.karijolich@vanderbilt.edu*

#### *Specialty section:*

*This article was submitted to Bioinformatics and Computational Biology, a section of the journal Frontiers in Bioengineering and Biotechnology*

*Received: 20 November 2017 Accepted: 22 January 2018 Published: 08 February 2018*

#### *Citation:*

*Zhao Y, Dunker W, Yu Y-T and Karijolich J (2018) The Role of Noncoding RNA Pseudouridylation in Nuclear Gene Expression Events. Front. Bioeng. Biotechnol. 6:8. doi: 10.3389/fbioe.2018.00008*

Keywords: RNA pseudouridylation, steroid receptor RNA activator, 7SK RNA, spliceosomal small nuclear RNA, transcription, pre-mRNA splicing

## INTRODUCTION

The proper control of gene expression in the nucleus is achieved by the actions of a diverse set of factors. In addition to proteins, noncoding RNAs (ncRNAs) participate in most, if not all, stages of nuclear gene expression, including both RNA polymerase II (Pol II) transcription and pre-mRNA splicing (Mercer et al., 2009). ncRNAs employ numerous mechanisms to accomplish their function, including binding to and modulating protein function, base pairing with complementary nucleic acids, or directly catalyzing biochemical reactions. Like proteins, proper modification and folding into higher-order structures are a prerequisite for their function.

In addition to the four canonical nucleosides, more than 140 chemically distinct modified RNA nucleosides have been identified in nature (Machnicka et al., 2013). Pseudouridine (ψ), first discovered over 60 years ago (Cohn and Elliot, 1951), is the most abundant internal RNA modification in stable RNAs. ψ, the C5-glycoside isomer of uridine, is formed through an internal transglycosylation reaction in which the N1–C1' bond between the uracil base and the ribose sugar is broken and a C5–C1' glycosidic bond is reformed (**Figure 1**). As a consequence of isomerization, an additional hydrogen bond donor is present at the non-Watson–Crick edge. The distinct structure of ψ increases both the rigidity of the phosphodiester backbone and the thermodynamic stability of ψ–A compared with U–A. This effect is mediated by water-coordinated hydrogen bonding and base stacking (Charette and Gray, 2000).

Initial evidence for a functional role of ψ partially came from the fact that ψ residues are clustered in functionally important and evolutionarily conserved regions of tRNA (Grosjean et al., 1995; Hopper and Phizicky, 2003), rRNA (Branlant et al., 1981; Maden, 1990), and small nuclear RNA (snRNA) (Reddy and Busch, 1988; Massenet et al., 1998; Narlikar et al., 2002; Karijolich and Yu, 2010). Indeed, experimental data have confirmed important roles for pseudouridylation in multiple aspects of gene expression regulation, including spliceosomal small nuclear ribonucleoprotein (snRNP) biogenesis, efficiency of pre-mRNA splicing, and translation fidelity (Karijolich et al., 2010). Here, we will provide an overview of the mechanisms of pseudouridylation and then highlight several pseudouridylated ncRNAs and their effects on nuclear gene expression events, specifically transcription and pre-mRNA splicing.

Figure 1 | Schematic of the pseudouridylation reaction. The isomerization of uridine (U) to pseudouridine (ψ) is mediated by pseudouridine synthases (PUSs). It results in an extra hydrogen bond donor (d) and the same number of hydrogen bond acceptors (a).

#### MECHANISMS OF PSEUDOURIDYLATION

The past decades have seen remarkable progress toward defining mechanisms by which pseudouridylation is catalyzed. Pseudouridylation of ncRNA is catalyzed by pseudouridine synthases (PUSs) through two distinct mechanisms, namely RNA-dependent pseudouridylation and RNA-independent pseudouridylation.

#### RNA-DEPENDENT PSEUDOURIDYLATION

The RNA-dependent pseudouridylation machinery consists of one unique box H/ACA RNA and four core proteins. Box H/ACA RNAs are one of the most evolutionarily conserved families of small ncRNAs and are present in all eukaryotes. They function as guide RNAs to direct pseudouridylation in mRNA, rRNA, spliceosomal snRNAs, and various other types of ncRNAs. With a median length of 133 nucleotides (nts), eukaryotic box H/ACA RNAs adopt a hairpin-hinge–hairpin-tail secondary structure (**Figure 2A**). Within the internal hinge region, there is a H box motif (ANANNA), while an ACA box motif (ACA) is positioned near the 3' end. Within each hairpin structure,

Figure 2 | Mechanism of pseudouridylation. (A) Schematic of RNA-dependent pseudouridylation by box H/ACA RNP. The box H/ACA RNP is composed of the box H/ACA RNA and four core proteins (Cbf5, Nhp2, Nop10, and Gar1). The secondary structure of a eukaryotic pseudouridylation guide box H/ACA RNA is shown as a blue line. The RNA adopts a hairpin-hinge–hairpin-tail structure. The box H (5'-ANANNA-3') within the hinge region and the box ACA (5'-ACA-3') motif at the 3'-end of the RNA are highlighted in a yellow box. The pseudouridylation pocket (thick-blue line) facilitates substrate recognition *via* complementary base-pairing interactions between the box H/ACA RNA and the substrate RNA (green line). (B) The standalone pseudouridine synthase (PUS), PUS7, recognizes the consensus sequence of substrates U2 small nuclear RNA (snRNA) and catalyzes the pseudouridine formation at U35. (C) Inducible pseudouridylation of U2 snRNA. Top, the 3' pocket of small nucleolar RNA 81 (snR81) base pairs with the sequence surrounding nucleotide (nt) U93 with two mismatches (denoted as red crosses). Bottom, the sequence surrounding U56 recognized by PUS7 has three nts (highlighted in red), different from the consensus PUS7 recognition sequence.

an internal loop (pseudouridylation pocket) is present, which facilitates substrate recognition *via* complementary base-pairing interactions between the box H/ACA RNA (the loop sequence) and the substrate RNA. Both hairpins of H/ACA RNAs can carry functional pseudouridylation pockets and can thus independently direct pseudouridylation of uridines in separate RNAs or separate uridines located within the same substrate RNA. In the guide RNA–target RNA interaction, the guide sequences hybridize 5' and one nt 3' of the target uridine to the substrate RNA immediately, thereby framing it. The distance between the target uridine and the H or ACA box of the guide RNA is usually 14–15 nts (Kiss et al., 2010).

The four core proteins associated with box H/ACA RNAs are Cbf5 (dyskerin in human and NAP57 in rodents), Nhp2 (L7Ae in archaeal), Gar1, and Nop10 (Yu and Meier, 2014). Cbf5 is the enzymatic component of the RNP and catalyzes the U-to-ψ isomerization reaction (**Figure 2A**). Structures of the enzymes from various species show a high degree of evolutionary conservation, especially in the PUS and Archaeosine transglycosylase (PUA) domain (Hamma et al., 2005; Manival et al., 2006; Rashid et al., 2006). The other three core proteins are also essential, and the depletion of them in yeast, with the exception of GAR1, causes the loss of all H/ACA RNAs (Girard et al., 1992; Bousquet-Antonelli et al., 1997).

Facilitated by the development of *in vitro* systems for the reconstitution of enzymatically active RNP complexes, the crystal structure of the box H/ACA RNP was first solved using archaeal components (Li and Ye, 2006). These experiments demonstrated that the L7Ae, Nop10, and the catalytic domain of Cbf5 bound to the upper stem of the guide RNA, whereas the PUA domain of Cbf5 anchored the lower stem and ACA motif. The substrate RNA is recruited and the target uridine is precisely placed within the active site *via* complementary base-pairing interactions with the bipartite guides, while extensive protein interactions help to stabilize the interaction. In contrast to the other core proteins, Gar1 does not physically interact with the box H/ACA guide RNA or substrate RNA. Instead, Gar1 interacts directly with the thumb loop of Cbf5 and participates in regulating substrate turnover. Studies in yeast have revealed that the structure of the eukaryotic box H/ACA RNP is highly similar to the archaeal one, however, with exceptions. In particular is the independence of the RNP's activity from Nhp2 binding and a novel C-terminal extension in Gar1, which interacts with Cbf5. It is hypothesized that these functional and structural differences reflect the evolutionary adaptations of eukaryotic box H/ACA RNP to the variable RNA structure and moderate temperature range in which eukaryotes live in, respectively (Li et al., 2011).

#### RNA-INDEPENDENT PSEUDOURIDYLATION

RNA-independent pseudouridylation in eukaryotes acts through standalone enzymes called PUS enzymes. In contrast to the box H/ACA RNP-based mechanism, PUS enzymes carry out both substrate recognition and the internal transglycosylation reaction. Substrate recognition is achieved *via* consensus sequences and/or secondary structure elements of the substrate RNA (**Figure 2B**). In eukaryotes, there are 10 different PUS enzymes, numbered PUS1 through PUS10. These are classified into five families (TruA, TruB, TruD, RluA, and PUS10) based on their bacterial counterparts (Hamma and Ferre-D'Amare, 2006). Although the primary sequences have diverged, all PUSs, including Cbf5, share a conserved catalytic domain and likely a conserved catalytic mechanism based on the solved crystal structure (Foster et al., 2000; Hoang and Ferre-D'Amare, 2001, 2004; Sivaraman et al., 2002, 2004; Del Campo et al., 2004; Ericsson et al., 2004; Kaya et al., 2004; Mizutani et al., 2004; Hoang et al., 2006; McCleverty et al., 2007). This domain structure is composed predominately of anti-parallel β-sheets, with one face decorated by two groups of α-helices and loops. A forefinger–thumb structure formed by these loops pinches the target RNA, while the strictly conserved catalytic aspartate residue participates in the enzymatic reaction.

Unlike the box H/ACA RNPs, which have been found only to reside within the nucleus, PUS enzymes have been found in the nucleus, cytoplasm, and mitochondria. Each PUS enzyme targets either one specific or multiple uridines in many RNA species, including snRNAs, rRNAs, and tRNAs in both cytoplasm and mitochondria.

#### PSEUDOURIDYLATION IS INDUCIBLE

Until relatively recently, RNA modifications, including pseudouridylation, were considered constitutive. In 2011, Wu et al. (2011) provided the first evidence that RNA modifications were inducible and demonstrated that yeast U2 snRNA was conditionally pseudouridylated when cells were subjected to nutrient deprivation or heat shock (Wu et al., 2011). In addition to the three constitutive pseudouridines (ψ35, ψ42, and ψ44) of yeast U2 snRNA, two novel pseudouridines (ψ56 and ψ93) were detected in stressed cells. Further detailed analyses revealed that both the RNA-independent PUS (PUS7 catalyzes ψ56 formation) and the box H/ACA RNPdependent [small nucleolar RNA 81 (snR81)-guided box H/ ACA RNP catalyzes ψ93 formation] modification machineries are involved in inducible pseudouridylation. Interestingly, both ψ56 and ψ93 are "imperfect" substrates for their respective modification machineries. The sequences flanking positions U56 and U93 in U2 snRNA are similar, but not identical to the sequences surrounding the constitutively pseudouridylated targets of PUS7 and snR81, respectively. For example, the 3' pocket of snR81 base pairs with the sequence surrounding nt U93 of U2 snRNA with two mismatches. In addition, ψ56 formation is mediated by PUS7 engaging a substrate whose sequence differs by three nts from the consensus PUS7 recognition sequence (**Figure 2C**). Inducible pseudouridylation is also functionally relevant, as demonstrated by the observation that the artificial introduction of ψ93 reduces the efficiency of pre-mRNA splicing. Interestingly, it was recently shown that the TOR-signaling pathway regulates ψ93 formation (Wu et al., 2016b).

Inducible pseudouridylation of other snRNAs has been reported subsequently. U6 snRNA is inducibly pseudouridylated at U28 by PUS1 during the yeast filamentous growth program (Basak and Query, 2014). Further analysis of mutants indicates that U6–ψ28 is functionally relevant, as all U6 snRNA mutations that resulted in strong pseudouridylation at position U28 also exhibited a pseudohyphal growth phenotype, whereas blocking U6–ψ28 formation prevents filamentous growth.

Recently, transcriptome-wide mapping of pseudouridines in yeast and human cells has revealed its presence in mRNA (Carlile et al., 2014; Lovejoy et al., 2014; Schwartz et al., 2014). While mRNA pseudouridylation appears to be primarily catalyzed by the standalone PUSs (PUS1–PUS4, PUS6, PUS7, and PUS9), several pseudouridine residues are catalyzed by box H/ACA RNPs. Remarkably, mRNA pseudouridylation was also found to be highly inducible and in a stress-specific manner. For example, by comparing mRNA ψ profiles of untreated cells to those of cells exposed to heat shock or H2O2, it was found that the inducible pseudouridylation profiles were largely nonoverlapping (Li et al., 2015).

### FUNCTION OF ncRNA PSEUDOURIDYLATION IN TRANSCRIPTION

Transcription is the primary control point for gene expression. It therefore determines cellular function and cell identity and is subjected to tight regulation to achieve a high degree of specificity and efficiency. The eukaryotic DNA template is packaged by histone proteins into a highly condensed structure called chromatin. The chromatin structure is dynamically regulated by both histone modifications and chromatin-remodeling factors (Narlikar et al., 2002). Promoters contain elements that bind to transcriptional activators and repressors, as well as the transcription machinery (Smale and Kadonaga, 2003; Kadonaga, 2004). RNA Pol II is the enzyme to catalyze the transcription reaction of mRNA from DNA. Pol II is recruited to promoters by transcriptional activators in a holoenzyme form together with general transcription factors and a multiprotein complex called the Srb/Mediator (Bjorklund and Kim, 1996). Following transcription initiation, Pol II transits to a productive elongation status through interactions with multiple elongation factors (Zhou et al., 2012). Given the central role of transcription in gene expression, it is not surprising that transcription is subject to diverse steps of regulation. Here, we will discuss two pseudouridylated ncRNAs and their function in regulating RNA Pol II transcription (**Figure 3**).

#### STEROID RECEPTOR RNA ACTIVATOR (SRA) AND TRANSCRIPTION PREINITIATION

One layer of transcriptional control comes from the binding of activator and repressor proteins to the promoters of target genes in a sequence-specific manner (Smale and Kadonaga, 2003). Coactivators and corepressors, which interact with activators and repressors, are required to achieve optimal transcriptional regulation in cells (Kadonaga, 2004). The SRA was first identified as a transcriptional coactivator for several steroid-hormone receptors, including receptors for androgens (ARs), estrogens

Figure 3 | Functions of noncoding RNA (ncRNA) pseudouridylation during transcription. The secondary structures of ncRNAs are shown, and the pseudouridylation sites are denoted as a red star. The green arrow and red-blocking arrow highlight the ncRNAs that are known to regulate polymerase II (Pol II) transcription, the factors they target and whether the effect of the ncRNA on transcription is stimulatory or inhibitory, respectively. HIV-1 LTR, human immunodeficiency virus-1 long terminal repeat; HRE, hormone response element; HIV-1 TAR, human immunodeficiency virus-1 transactivation response; NHR, nuclear hormone receptor; P, phosphoryl group; P-TEFb, positive transcription–elongation factor-b; SRA; steroid receptor RNA activator; SRC-1, steroid receptor activator-1.

(ERs), glucocorticoids (GRs), and progestins (PRs) (Lanz et al., 1999). Interestingly, while it was initially presumed that a protein encoded by a specific 5'-spliced variant of the SRA gene was the functional factor, subsequent experiments demonstrated that the factor was an ncRNA.

Steroid receptor RNA activator operates as part of a ribonucleoprotein complex containing steroid receptor activator-1, which is an AF-2 coactivator (Lanz et al., 1999). Computerassisted modeling suggests that SRA adopts a highly complex secondary structure containing 11 topological substructures (STRs). Mutagenesis of each STR indicated that 5 of 11 STRs are required for SRA to coactivate transcription, and STR7 is the most important one for SRA function (Lanz et al., 2002).

In a study to identify coactivators for retinoic acid receptor (RAR) in mouse S91 melanoma cells, mPUS1p was unexpectedly identified. In addition, SRA turned out to be a substrate of mPUS1p (Zhao et al., 2004). Using chromatin immunoprecipitation, RAR, PUS1p, and SRA were found to cooccupy the retinoic acid response promoter in a ligand-independent complex. PUS1 mediated pseudouridylation of SRA promotes the formation of an "active" structure and aids in establishing the transcription preinitiation complex upon ligand binding. Further supporting a role of SRA pseudouridylation in transcriptional regulation, mutations in PUS1p that disrupt its interaction with RAR or its pseudouridylation activity attenuate the activation of RARdependent transcription. mPUS1p also significantly augmented transactivation by other nuclear receptors (NRs) including thyroid hormone receptor (TR), GR, AR, PR, and ER, illustrating that this mechanism likely applies universally to the regulation of NR-dependent transcription.

In addition to mPUS1p, mPUS3p also modifies SRA and serves as an NR coactivator (Zhao et al., 2007). Unlike mPUS1p, mPUS3p does not enhance sex steroid receptor activity, suggesting that substrate-site specificity may have distinct roles. Indeed, *in vitro* mPUS1p and mPUS3p generally modify different positions in SRA, with a few positions commonly targeted. Intriguingly, the order of modification of SRA by mPUS1p and mPUS3p determines the positions within SRA that are required to be pseudouridylated. However, it is important to note that the only *in vivo*-pseudouridylated site identified in SRA is U206. Interestingly, a U206A mutation, which promotes hyperpseudouridylation of SRA *in vitro*, switches SRA from a coactivator to a molecule with dominant-negative activity *in vivo*.

Pseudouridylation of SRA by mPUS1p and mPUS3p is a highly complex posttranscriptional mechanism that controls a coactivator–corepressor switch in SRA with major consequences for NR signaling (Zhao et al., 2007). Unexpectedly, pseudouridylation of SRA occurs in a stem-loop structure STR5 (at position U206), whose secondary structure was not shown to be important for SRA function. Moreover, the thermodynamic and secondary structure differences between STR5 in hSRA-WT and hSRA-U206A are relatively minor, which further suggest that the secondary structure remodeling is unlikely to explain the large biochemical and functional effects observed. Instead, it is proposed that ψ206 stabilizes stems I and II of STR5 in a higher-order conformation through the base-stacking-enhancing properties of ψ, resulting in masking new sites and preventing hyperpseudouridylation by mPUS1p and mPUS3p. This in turn may interfere with the binding of SRA to other proteins that define its function as a scaffold for both repressors and activators.

The physiological importance of PUS1p-mediated SRA pseudouridylation is illustrated by a disorder known as mitochondrial myopathy and sideroblastic anemia (MLASA). It is caused by an inactivating mutation in human PUS1p (Bykhovskaya et al., 2004). Some abnormalities in these patients, such as facial dysmorphisms, are suggested to be the consequence of defective hSRA–NR signaling (Fernandez-Vizarra et al., 2007). In addition, ERs and ARs are important targets for cancer therapy. Given the importance of SRA in NR-transcriptional regulation, coupled with the functional significance of PUS1p-mediated SRA pseudouridylation, recent work has suggested that the disruption of SRA pseudouridylation could serve as a novel RNA-based cancer therapeutic (Ghosh et al., 2012).

#### 7SK snRNA AND TRANSCRIPTION ELONGATION

For the past three decades, most of the attention has been put on the early stages of the transcription cycle involving the recruitment of Pol II to gene promoters and assembly of active preinitiation complexes, which were thought to be the principal points where transcription was controlled (Kuras and Struhl, 1999; Ptashne, 2005). In recent years, however, accumulating evidences indicate that the subsequent stages of the transcription cycle are also highly regulated (Guenther et al., 2007; Muse et al., 2007). Notably, the promoter-proximal pausing and release of Pol II has been identified as a major rate-limiting step for controlling the expression of many metazoan genes and plays a critical role in cell growth, renewal, and differentiation (Levine, 2011; Zhou et al., 2012).

Shortly after initiation, Pol II is paused at a promoter-proximal region by negative elongation factors NELF and DSIF, resulting in a short nascent transcript ~12 nts in length. Promoter-proximately paused Pol II is the major form of Pol II found on metazoan chromosomes and is poised for entry into productive elongation. The positive transcription elongation factor-b (P-TEFb) is the major factor required to overcome this restriction. P-TEFb, composed of cyclin-dependent kinase 9 (CDK9) and cyclin T1, phosphorylates and thereby antagonizes the inhibitory actions of NELF and DSIF, triggering the release of Pol II from promoter-proximal pausing. In addition to this, P-TEFb also phosphorylates the C-terminal domain of the largest subunit of Pol II, which then serves as a platform for assembling key transcription and RNA-processing factors that promote transcriptional elongation, cotranscriptional processing of pre-mRNA, and lastly termination (Zhou et al., 2012).

Under normal growth conditions, up to 90% of cellular P-TEFb is sequestered in an inactive complex called the 7SK RNP (Yang et al., 2001). Within this complex, a noncoding RNA, namely 7SK snRNA, functions as a scaffold and mediates the interaction of P-TEFb with HEXIM1 or -2 and thus inhibits CDK9's kinase activity (Yik et al., 2003). 7SK snRNA, transcribed by RNA Pol III, is an abundant noncoding RNA and highly conserved in higher eukaryotes (Wassarman and Steitz, 1991). Although its levels remain relatively constant, several genome-wide studies suggested that 7SK was pseudouridylated (Kishore et al., 2013; Carlile et al., 2014). Indeed, site-specific and quantitative pseudouridylation assays demonstrated that up to 94% of 7SK snRNA in HeLa cells is pseudouridylated at residue U250. Although the guide RNA has not been identified, it is clear that the box H/ACA RNP machinery catalyzes this modification as the depletion of DKC1 significantly reduces 7SK snRNA pseudouridylation. In addition, 7SK snRNA pseudouridylation was demonstrated to play a critical role in regulating the formation of the 7SK-P-TEFb snRNP, as mutation of U250, or depletion of DKC1, reduced the binding of CDK9 and HEXIM to 7SK snRNA (Zhao et al., 2016).

The identification of a key role for pseudouridylation in 7SK snRNP stability has potential clinical relevance. For instance, one emerging strategy in curing human immunodeficiency virus (HIV) infection is to "shock," or reactivate, the latent HIV reservoirs for their subsequent "kill" by Highly Active Anti-Retroviral Therapy (Richman et al., 2009; Deeks, 2012). Along this line, Zhao et al. demonstrated that reducing 7SK snRNA pseudouridylation destabilized the 7SK snRNP and activated Tat-dependent HIV-1 transcription. Furthermore, reduction in 7SK snRNA pseudouridylation in combination with latency reversal agents (LRAs) significantly increased the reversal of HIV latency, implicating the DKC1-box H/ACA RNP as a promising new target to eradicate latent viral reservoirs (Zhao et al., 2016). It will be interesting to determine whether the widely used chemotherapeutic agent 5-fluorouracil, which is an inhibitor of PUS enzymes, can act synergistically with current LRAs to further enhance the reversal of HIV latency.

#### FUNCTION OF SPLICEOSOMAL snRNA PSEUDOURIDYLATION IN PRE-mRNA SPLICING

Most genes in eukaryotes are not transcribed in mature but rather as pre-mRNA, containing coding exons as well as noncoding introns. Therefore, further pre-mRNA splicing is needed to remove intronic sequences and assemble exons into mature

mRNA (Green, 1986; Newman, 1994). Splicing is catalyzed by the spliceosome, a massively large complex consisting of snRNAs and numerous protein components (Wahl et al., 2009; Matera and Wang, 2014). There are five snRNAs within the major spliceosome—U1, U2, U4, U5, and U6 that participate in the splicing reaction as snRNP complexes.

Pre-mRNA splicing is initiated by the recognition of the 5'-splice site (5'-SS) by the U1 snRNP *via* complementary basepairing interactions (Kramer et al., 1984; Bindereif and Green, 1987; Ruby and Abelson, 1988; Seraphin and Rosbash, 1989). The branch-site sequence is then engaged by the U2 snRNP *via* complementary base-pairing interactions, resulting in the bulging out of the branch point adenosine of the pre-mRNA and the formation of the spliceosomal complex A (Zhuang and Weiner, 1989; Michaud and Reed, 1991; Wassarman and Steitz, 1992). Subsequently, the tri-snRNP, a complex of U4 snRNP, U6 snRNP, and U5 snRNP, is recruited, creating a fully assembled spliceosome (complex B1). Following a series of RNA–RNA rearrangements, U1 and U4 snRNPs are destabilized and released, resulting in the formation of an active spliceosome (complex B2) (Sawa and Abelson, 1992; Lesser and Guthrie, 1993). This complex catalyzes the first step of pre-mRNA splicing in which the 2'-OH group of the bulged-out branch point adenosine nucleophilically attacks the phosphate at the 5'-SS. The result of the first-step reaction is the generation of a lariat 2/3 intermediate and a cutoff 5' exon intermediate. After additional conformational changes, complex B2 is converted to complex C, and the second step of splicing is catalyzed, resulting in the production of mature mRNA and lariat intron products. The U2, U5, and U6 snRNPs are recycled for new rounds of pre-mRNA splicing (**Figure 4**) (Burge et al., 1999).

Interestingly, all of the major spliceosomal snRNAs are posttranscriptionally pseudouridylated (**Figure 5A**). U2 snRNA is the most extensively pseudouridylated snRNA, and unsurprisingly investigations into snRNA pseudouridylation have primarily focused on U2 snRNA. The functional study of U2 snRNA pseudouridylation was initiated in the early 1990s by Patton, who prevented U2 snRNA pseudouridylation in HeLa cell S100 and nuclear extracts by the incorporation of 5-fluorouridine (5-FU) (Patton, 1993a,b). Although 5-FU-substituted U2 snRNA was able to form a U2 snRNP, the snRNP was more vulnerable to dissociation by salt.

As experimental systems and assays developed, a more detailed description of the effects of U2 snRNA pseudouridylation on premRNA splicing emerged. For example, Yu et al. demonstrated that while *in vitro*-transcribed U2 snRNA, which lacks modification, was unable to rescue a splicing defect in U2 snRNA-depleted oocytes, following prolonged reconstitution periods, U2 snRNA was pseudouridylated and able to reconstitute splicing activity. In addition, anti-snRNP immunoprecipitation coupled with glycerol-gradient sedimentation demonstrated that U2 snRNA lacking pseudouridine was unable to form functional 17 S U2 snRNP. Thus, a good correlation between modification status, U2 snRNP biogenesis, and pre-mRNA splicing was established. In addition, by creating chimeric U2 snRNAs between cellularderived and *in vitro*-transcribed U2, Yu et al. demonstrated that the functionally important modifications primarily resided within the first 27 nts of U2 snRNA (Yu et al., 1998).

Pseudouridylation of residues within the branch-site recognition region is also functionally important for pre-mRNA splicing. Zhao and Yu demonstrated that pseudouridine residues within the branch-site recognition region of Xenopus U2 snRNA are required for U2 snRNP assembly and spliceosome assembly (Zhao and Yu, 2007). In addition, an NMR structure of yeast U2 snRNA:pre-mRNA branch-site helix demonstrated that ψ35 induces a dramatic structural alternation, which is required for the bulging out of the branch point adenosine and nucleophilic attack on the 5'-SS (Newby and Greenbaum, 2002). Consistent with this, a yeast knockout of PUS7, which catalyzes ψ35 of U2 snRNA, exhibited reduced fitness under conditions of high-salt media, or when in competition with a wild-type strain (Ma et al., 2003). The other pseudouridines within this region, ψ42 and ψ44, are also functionally relevant (Wu et al., 2016a). The deletion of both SNR81, responsible for ψ42, and PUS1, which catalyzes ψ44 formation, reduces the efficiency of pre-mRNA splicing, leading to growth defect. Further genetic and biochemical analyses demonstrated that U2 snRNA ψ42 and ψ44 facilitate the interaction with Prp5, a U2-dependent ATPase known to play an important role in monitoring U2-intron branch-site interactions at early stages of spliceosome assembly (Xu and Query, 2007). Furthermore, Prp5 has reduced ATPase activity on U2 snRNA lacking ψ42 and ψ44, suggesting that these modifications regulate Prp5 enzymatic activity. Collectively, these data indicate that pseudouridylation within the branch-site recognition region plays a role not only in the biogenesis of functional snRNP but also in spliceosome assembly by influencing the enzymatic activity of Prp5.

In contrast to U2 snRNA, pseudouridylations within the other spliceosomal U snRNAs have received much less attention. However, this is not to say that they have not been investigated. For instance, the functional importance of two ψs within the 5' end of U1 snRNA has been investigated. Adopting an *in vitro*-splicing system in which two 5'-SS are in competition with each other, Roca et al. suggested that U1 snRNA pseudouridylation participates in 5'-SS discrimination (Roca et al., 2005). In addition, Freund et al. demonstrated that a ψ–G base pair, between the U1 snRNA and the substrate pre-mRNA, respectively, stabilized the U1 snRNA interaction with the 5'-SS of HIV-1 RNA (Freund et al., 2003). Lastly, as described earlier, the inducible pseudouridylation of U6 snRNA at U28 during the yeast filamentous growth program is also functionally relevant, as shown by differential growth phenotypes that are dependent on its pseudouridylation (Basak and Query, 2014).

#### CONCLUSION

Remarkable progress has been made toward elucidating the mechanism and function of RNA pseudouridylation in various cellular processes. However, a function for nc RNA pseudouridylation in transcriptional regulation has only just begun to emerge. Our limited understanding of the impact of pseudouridylation on transcription is partially due to the limited number of ncRNAs that were known to be pseudouridylated. However, recent efforts of transcriptome-wide mapping of RNA pseudouridylation have greatly expanded the catalog of known pseudouridylated RNAs and have identified novel modification sites within ncRNAs participating in transcriptional regulation.

#### REFERENCES


For instance, metastasis-associated lung adenocarcinoma transcript 1 (MALAT1), which is involved in the epigenetic modulation of gene expression, as well as alternative splicing, is pseudouridylated at two distinct sites (Carlile et al., 2014). How these modifications contribute to MALAT1 function remains unclear and certainly worth investigating. As more ψs in ncRNA transcriptional regulators are identified, a better understanding of how these ncRNAs impact eukaryotic mRNA transcription will be achieved. Since many of these ncRNAs are associated with diseases, determining how the functional impact of their pseudouridylation may open the door to novel therapeutic strategies.

Although the mechanism and function of snRNA pseudouridylation in splicing is considerably well studied, there are still many unanswered questions. For example, in contrast to U2 snRNA, the functional significance of pseudouridylation within U1, U4, U5, and U6 is not clear. In addition, within higher eukaryotes, there exists a second spliceosome, the minor spliceosome, which consists of U5, in addition to four distinct U snRNAs, U11, U12, U4atac, and U6atac, and all of them are pseudouridylated (**Figure 5B**). Interestingly, the positions of minor spliceosomal snRNA pseudouridylation are homologous to those within major spliceosomal snRNAs, suggesting their importance in minor intron splicing (Massenet and Branlant, 1999). Detailed functional analysis of these pseudouridylations is required if we seek to understand the mechanism of minor spliceosome biogenesis and minor intron splicing. In addition, these studies may provide a better understanding of how introns are selectively recognized by the two distinct spliceosomes.

### AUTHOR CONTRIBUTIONS

All authors contributed to writing the manuscript.

#### FUNDING

The work presented was supported by grant GM104077 from NIH (to Y-TY) and by start up funds from Vanderbilt University Medical Center (to JK).


Deeks, S. G. (2012). HIV: shock and kill. *Nature* 487, 439–440. doi:10.1038/487439a

Del Campo, M., Ofengand, J., and Malhotra, A. (2004). Crystal structure of the catalytic domain of RluD, the only rRNA pseudouridine synthase required for normal growth of *Escherichia coli*. *RNA* 10, 231–239. doi:10.1261/rna.5187404


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Copyright © 2018 Zhao, Dunker, Yu and Karijolich. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

## 3 ′ RNA Uridylation in Epitranscriptomics, Gene Regulation, and Disease

#### Miriam R. Menezes † , Julien Balzeau† and John P. Hagan\*

*Department of Neurosurgery, University of Texas Health Science Center at Houston, Houston, TX, United States*

#### Edited by:

*Mario Acunzo, Virginia Commonwealth University, United States*

#### Reviewed by:

*Fabio Iannelli, IFOM—The FIRC Institute of Molecular Oncology, Italy Michael Poidinger, Singapore Immunology Network (A*∗*STAR), Singapore Tongjun Gu, University of Florida, United States*

> \*Correspondence: *John P. Hagan john.p.hagan@uth.tmc.edu*

*†These authors have contributed equally to this work and are co-first authors.*

#### Specialty section:

*This article was submitted to RNA, a section of the journal Frontiers in Molecular Biosciences*

> Received: *04 February 2018* Accepted: *14 June 2018* Published: *13 July 2018*

#### Citation:

*Menezes MR, Balzeau J and Hagan JP (2018) 3*′ *RNA Uridylation in Epitranscriptomics, Gene Regulation, and Disease. Front. Mol. Biosci. 5:61. doi: 10.3389/fmolb.2018.00061* Emerging evidence implicates a wide range of post-transcriptional RNA modifications that play crucial roles in fundamental biological processes including regulating gene expression. Collectively, they are known as epitranscriptomics. Recent studies implicate 3 ′ RNA uridylation, the non-templated addition of uridine(s) to the terminal end of RNA, as a key player in epitranscriptomics. In this review, we describe the functional roles and significance of 3′ terminal RNA uridylation that has diverse functions in regulating both mRNAs and non-coding RNAs. In mammals, three Terminal Uridylyl Transferases (TUTases) are primarily responsible for 3′ RNA uridylation. These enzymes are also referred to as polyU polymerases. TUTase 1 (TUT1) is implicated in U6 snRNA maturation via uridylation. The TUTases TUT4 and/or TUT7 are the predominant mediators of all other cellular uridylation. Terminal uridylation promotes turnover for many polyadenylated mRNAs, replication-dependent histone mRNAs that lack polyA-tails, and aberrant structured noncoding RNAs. In addition, uridylation regulates biogenesis of a subset of microRNAs and generates isomiRs, sequent variant microRNAs that have altered function in specific cases. For example, the RNA binding protein and proto-oncogene LIN28A and TUT4 work together to polyuridylate pre-let-7, thereby blocking biogenesis and function of the tumor suppressor let-7 microRNA family. In contrast, monouridylation of Group II pre-miRNAs creates an optimal 3′ overhang that promotes recognition and subsequent cleavage by the Dicer-TRBP complex that then yields the mature microRNA. Also, uridylation may play a role in non-canonical microRNA biogenesis. The overall significance of 3′ RNA uridylation is discussed with an emphasis on mammalian development, gene regulation, and disease, including cancer and Perlman syndrome. We also introduce recent changes to the HUGO-approved gene names for multiple terminal nucleotidyl transferases that affects in part TUTase nomenclature (TUT1/TENT1, TENT2/PAPD4/GLD2, TUT4/ZCCHC11/TENT3A, TUT7/ZCCHC6/TENT3B, TENT4A/PAPD7, TENT4B/PAPD5, TENT5A/FAM46A, TENT5B/FAM46B, TENT5C/FAM46C, TENT5D/FAM46D, MTPAP/TENT6/PAPD1).

Keywords: RNA epitranscritpomics, 3′ terminal RNA uridylation, TUTase, LIN28/let-7 pathway, DIS3L2, cancer, perlman syndrome, Wilms tumor

### INTRODUCTION

Epitranscriptomics refer to a diverse set of RNA chemical modifications and post-transcriptional nucleotide additions that play central roles in pre-mRNA splicing, translation, and regulation of gene expression. For example, ribosomal and transfer RNAs undergo extensive chemical modifications that are required for function through stabilization of their RNA secondary structure, while non-templated nucleotide additions are critical for polyadenylated mRNAs and aminoacylation of tRNAs. Recent evidence implicates 3′ RNA uridylation, the addition of non-templated uridine(s) to the RNA end, in several biological processes. This review presents our current understanding of the biochemical and functional roles for 3′ terminal RNA uridylation with a focus on mammalian biology. Over the past decade, 3′ RNA uridylation has emerged to be functionally significant for multiple RNA types such as mRNAs, microRNAs, and structured non-coding RNAs. Terminal Uridylyl Transferase (TUTases) that are also known as polyU polymerases are the enzymes responsible for 3′ RNA uridylation.

#### NON-CANONICAL TERMINAL RIBONUCLEOTIDYL TRANSFERASES

TUTases fall within a class of seven non-canonical terminal ribouncleotidyl transferases that contain a DNA polymerase β-like nucleotidyltransferase domain (**Figure 1**). Unfortunately, each member of the non-canonical polymerase family that includes TUTases has multiple names, creating considerable confusion in relation to their ribonucleotidyl specificity (**Table 1**). Here, we will refer to each enzyme using their HUGOapproved nomenclature that was recently updated by the Human Gene Nomenclature Committee to address several concerns. Two genes TUT4 (aka ZCCHC11) and TUT7 (aka ZCCHC6) encode terminal uridylyl transferases that are primarily cytoplasmic and are quite similar structurally, having arisen from a gene duplication event. These enzymes interact transiently with RNA and are slow polymerases (e.g., the TUT4 uridylation rate is ∼0.2 nucleotides per second; Yeom et al., 2011). As such, they typically add one or very few uridines to their RNA substrates, except in cases where TUTase interaction with RNA is stabilized by other factors such as the RNA binding protein LIN28A. TUT1/STAR-PAP has dual specificity for UTP and ATP with respective <sup>k</sup>cat values of 0.059 and 0.002 s−<sup>1</sup> (Yamashita et al., 2017). The remaining non-canonical transferases primarily add non-templated adenosines. Phylogenetic and biochemical analyses reveal that the insertion of a histidine at the active site alters the ribonucleotidyl specificity of the polymerase from ATP to UTP (Munoz-Tello et al., 2012; Yates et al., 2015; Chung et al., 2016; Yamashita et al., 2017). Consistent with this idea, human TENT2 (aka GLD2/TUT2) lacks this critical histidine and prefers ATP over UTP by >80-fold (Chung et al., 2016). Furthermore, human TENT2 mutated by histidine insertion within the active site converts the mutant protein to a TUTase. In this review, we describe in detail the functional roles of three mammalian TUTases, TUT1, TUT4, and TUT7. Human TUT4 and TUT7 have numerous roles in regulating both mRNAs and non-coding RNAs. TUT1 has dual functions where it is critical for uridylation and maturation of the spliceosomal U6 snRNA in the nucleus and in the adenylation of select mRNAs, including HO-1 (Heme Oxygenase 1), BIK (BCL2-interacting killer), PTEN, WIF1, and CDH1 (Gonzales et al., 2008; Mellman et al., 2008; Li et al., 2012, 2017; Kandala et al., 2016; Sudheesh and Laishram, 2017). Lastly, we will discuss the roles of TUTases in mammalian development, physiology, and diseases including cancer and Perlman Syndrome.

#### SPLICEOSOMAL U6 SNRNA AND URIDYLATION

Transcription of eukaryotic protein coding genes, especially in multicellular organisms, generates a pre-mRNA that typically undergoes splicing to remove introns and join successive exons. The major spliceosome, a large ribonucleoprotein (RNP) complex, is required for splicing and contains five essential snRNPs (U1, U2, U4, U5, and U6). U6 small nuclear RNA (U6 snRNA) has a unique maturation mechanism (reviewed in Mroczek et al., 2012). As illustrated in **Figure 2**, U6 snRNA is the only known RNA substrate where uridylation occurs within the nucleus and this uridylation is essential for its splicing function (Trippe et al., 1998, 2003, 2006). U6 snRNA is transcribed by RNA polymerase III (Kunkel et al., 1986) where transcription termination occurs within a short (∼5–6 nucleotide) oligo(dT) DNA stretch. The original U6 snRNA transcript contains four uridines at the 3′ end. After transcription and initial 3′ end formation, the chaperon-like La protein can bind this polyU tail, favoring stabilization of the U6 snRNA (Wolin and Cedervall, 2002). To continue the maturation process, La protein is removed and TUT1 adds up to 20 uridines to the polyU tail. This longer polyU tail is a signal for USB1, a 3′→ 5 ′ exoribonuclease, to remove uridines, leaving only five of them. USB1 then catalyzes the formation of a 2′ ,3′ -cyclic phosphate at the 3′ end. This maturation process allows the LSm2-8 complex to bind the 3′ extremity, facilitates the proper assembly of mature snRNPs, and is important for nuclear retention (Licht et al., 2008).

#### MICRORNAS AND URIDYLATION

The following sections describe the biochemical functions that uridylation has in relation to microRNAs, a recently identified class of negative regulators of gene expression. MicroRNA expression is most frequently regulated transcriptionally; however, both microRNA biogenesis as well as generation of isomiRs, sequence variant microRNAs, can be regulated by TUTases. Briefly, uridylation plays both positive and negative roles in regulating canonical microRNA biogenesis for a small, albeit important set of microRNAs, most notably the tumor suppressor let-7 microRNA family. Uridylation is also implicated in two distinct non-canonical microRNA biogenesis pathways that differ on their reliance on Drosha and Dicer. Lastly,

predominantly cytoplasmic proteins where they function in noncoding RNA quality control, mRNA turnover (both polyadenylated mRNAs and histone mRNAs), and

uridylation of mature microRNAs can generate isomiRs with altered activity.

#### Canonical MicroRNA Biogenesis and Function

regulation of let-7 microRNA biogenesis with LIN28A.

MicroRNAs are a class of small noncoding RNAs (∼19–23 nucleotides) that primarily function as negative regulators of their mRNA targets (reviewed in Hagan and Croce, 2007; Bartel, 2009; Lin and Gregory, 2015b; Balzeau et al., 2017). Mature microRNAs are generated through two step-wise cleavage reactions from the primary microRNA (pri-miRNA) transcripts. Pri-miRNAs are produced by RNA polymerase II (Cai et al., 2004; Lee et al., 2004) and to a much lesser extent by RNA polymerase III (Borchert et al., 2006). The microRNA host gene can be either a protein coding gene or a long non-coding RNA (lncRNA). For protein coding host genes, the microRNA almost exclusively exists in the intron as expected, since microRNA processing would disrupt the mRNA. The few exceptions to this rule are several microRNAs that lie either in alternatively spliced exons or between alternative polyA sites. For host genes that are lncRNAs, microRNA lie in both introns and exons at significant frequencies.

Canonical microRNA biogenesis occurs by two sequential endonucleolytic reactions to produce a mature, functional microRNA. The pri-miRNA is first cleaved by the nuclear RNase III enzyme Drosha in the Microprocessor complex (Lee et al., 2003; Denli et al., 2004; Gregory et al., 2004; Han et al., 2004). The resultant cleavage product is termed precursor microRNA (pre-miRNA) and has a characteristic hairpin structure of ∼50– 75 nucleotides. The pre-miRNA is transported to the cytoplasm by the Exportin-5/Ran GTP complex (Yi et al., 2003; Bohnsack et al., 2004). In the cytoplasm, the pre-miRNA is processed by a second RNase III enzyme Dicer to release a small RNA duplex that is subsequently loaded onto Argonaute (Bernstein et al., 2001; Hutvagner et al., 2001; Ketting et al., 2001; Chendrimada et al., 2005; Haase et al., 2005). Typically, one strand is preferentially incorporated into the miRISC (microRNA Induced Silencing Complex) where it base pairs with imperfect complementarity to the 3′ UTR of its target mRNAs, resulting in mRNA destabilization and/or translational inhibition. As such, microRNAs may potentially regulate 30% of all mammalian genes (Lewis et al., 2005; Lim et al., 2005).

#### Monouridylation Promotes Dicer Processing of Group II Precursor MicroRNAs

Addition of non-templated uridine(s) to the 3′ end of microRNAs is an important post-transcriptional modification that has been shown to impact activity and biogenesis of miRNAs. Specific pre-miRNAs are substrates for monouridylation and/or oligouridylation that can impact microRNA biogenesis positively or negatively, respectively. As the name indicates, monouridylation is the addition of a single non-templated uridine, while oligouridylation and polyuridylation refer to the addition of polyU-tails that can be as long as 30 nucleotides.

For the vast majority of microRNAs, nuclear Drosha cleavage of the pri-miRNA results in a pre-miRNA with a 2 nucleotide


3 ′ overhang that is ultimately recognized in the cytoplasm by the Dicer/TRBP complex that directs the second cleavage event (Zhang et al., 2004; Park et al., 2011). As depicted in **Figure 3**, Narry Kim's lab through sequencing and bioinformatic analyses discovered that a small number of pre-miRNAs are characterized by a single non-templated uridine at the 3′ end that creates a 2 nucleotide overhang (Heo et al., 2012). These atypical precursor microRNAs have been termed Group II and include most let-7 family members (7a-1, 7a-3, 7b, 7d, 7f-1, 7f-2, 7g, 7i, and miR-98) as well as miR-105. Monouridylation of these precursors promotes their subsequent Dicer processing and increases the levels of mature microRNA levels in vivo. Sequencing of pre-let-7 revealed that roughly 20% of all pre-let-7 microRNAs were monouridylated at their 3′ ends in HeLa cells that lack expression of both LIN28A and LIN28B. In contrast, all other single nucleotide additions to pre-let-7 combined represent only 1%. Analysis of numerous deep sequencing libraries indicated that the majority of mature Group II let-7 microRNAs are likely processed from monouridylated precursors, as evidenced by their let-7-3p reads. In vitro uridylation assays identified three terminal ribonucleotidyl transferases (TUT4/ZCCHC11, TUT7/ZCCHC6, and TENT2/PAPD4/TUT2/GLD2) competent for mono-uridylating pre-let-7a-1 in vitro and each had a preference for pre-let-7a-1 with a 1nt vs. 2nt 3′ overhang. The NTP specificity was determined for these enzymes, revealing that TUT4 and TUT7 utilized UTP, while TENT2 could use

UTP, GTP, or ATP. In vivo knockdowns performed singly or in combination revealed that these enzymes function redundantly in HeLa cells where triple knockdowns resulted in the most significant loss of Group II mature let-7 and mono-uridylated pre-let-7. For single knockdown, TUT7 had the most prominent effect. Similar results were observed for miR-105 in the triple knockdowns. It remains unclear if TENT2 is an in vivo TUTase and whether its effects are mediated by monoadenylation rather than monouridylation, given the reported affinity of TENT2 for ATP over UTP (Chung et al., 2016). These results highlight the surprising complexity in how let-7 microRNA biogenesis is controlled either by monouridylation that promotes microRNA maturation versus LIN28A-directed polyuridylation that blocks biogenesis that is described in the next section.

#### Polyuridylation of Select Precursor MicroRNAs Blocks Their Biogenesis

To date, the best-characterized example of 3′ RNA polyuridylation blocking microRNA biogenesis involves LIN28A-mediated repression of the let-7 microRNA family (**Figure 3**) (reviewed in Lee et al., 2016; Balzeau et al., 2017). Briefly, the developmentally regulated RNA binding protein and proto-oncogene Lin28A binds to the terminal loop of pre-let-7 where it recruits the TUT4 to add a short polyU-tail to pre-let-7, blocking Dicer cleavage (Hagan et al., 2009; Heo

et al., 2009; Piskounova et al., 2011). Polyuridylated pre-let-7 is rapidly degraded by the exonuclease Dis3L2 that recognizes the polyU-tail (Chang et al., 2013; Ustianenko et al., 2013). Let-7 microRNAs have gained considerable attention due to their prominent roles as tumor suppressors. Notably, loss of let-7 expression and miR-21 overexpression are the most commonly dysregulated microRNAs delineating poor clinical prognosis in cancers (Yang et al., 2010; Wang et al., 2011; Nair et al., 2012). Consistent with this finding, let-7 microRNAs act as tumor suppressors by negatively regulating numerous oncogenes such as MYC, RAS, HMGA2, YAP1, CDK6, CCND1, and BLIMP1. In cancer, reactivation of either LIN28A or LIN28B expression accounts for the widespread repression of the tumor suppressor let-7 microRNA family. The significance of the LIN28/let-7 pathway is further highlighted by the fact that LIN28A is a Thomson reprogramming factor whose expression on its own is necessary and sufficient to make induced pluripotent stem cells (Yu et al., 2007; Buganim et al., 2012).

Let-7 is an evolutionarily ancient microRNA family whose founding member was discovered in C. elegans as a heterochronic gene that regulates developmental timing during the transition from late larval to adult cell fates. In humans, there are 12 let-7 family members encoded by eight chromosomes. These microRNAs are quite abundant in most cell types and often account for >10% of the entire cellular microRNA population (Griffiths-Jones et al., 2006). In both embryonic stem cells and many cancer cells, multiple let-7 microRNA primary transcripts are actively generated; however, mature let-7 microRNAs are remarkably low, due to post-transcriptional gene regulation mediated by LIN28 paralogs.

LIN28A and LIN28B are two closely related and developmentally regulated RNA binding proteins and are implicated directly in negatively regulating let-7 microRNA biogenesis (Heo et al., 2008, 2009; Newman et al., 2008; Rybak et al., 2008; Viswanathan et al., 2008; Hagan et al., 2009; Piskounova et al., 2011). These proteins contain two RNA binding domains whose interactions with the loop of immature let-7 is required for high affinity binding. Specifically, an Nterminal cold shock domain binds a stem-loop structure in the pre-E element while the C-terminal Zinc knuckles bind to a conserved GRAG in the 3′ of the loop (Newman et al., 2008; Piskounova et al., 2008; Nam et al., 2011). Let-7 microRNAs negatively regulate both LIN28A and LIN28B mRNAs by direct binding to their 3′ UTRs, thereby establishing a double negative feedback loop and self-enforcing binary switch. Although both

LIN28 proteins were initially thought to block Microprocessor cleavage of nuclear pri-let-7, the subcellular localization of these protein suggested an alternate model where LIN28 functions in the cytoplasm to block the conversion of pre-let-7 to its mature and functional form. Reports at the time showed that LIN28A is almost exclusively cytoplasmic (Balzer and Moss, 2007; Polesskaya et al., 2007), while LIN28B exists in the cytoplasm throughout the cell cycle with some nuclear accumulation during S/G2 phases (Guo et al., 2006).

Narry Kim's and our research subsequently discovered that Lin28A recruits TUT4/Zcchc11 that adds an oligouridine tail to the 3′ end of the pre-let-7, blocking Dicer cleavage and provoking the degradation of polyuridylated pre-let-7 (Hagan et al., 2009; Heo et al., 2009). To reach this conclusion, we both used biochemical and cell culture studies to demonstrate that TUT4 is a TUTase and that knockdown of TUT4 elevated mature let-7 levels in Lin28A-expressing cells. Knockdown of no other terminal ribonucleotidyl transferase tested affected mature let-7 levels. Mechanistically, the Zinc Knuckle Domain (ZKD) of LIN28A and the N-terminal region of TUT4 (LIM: LIN28- Interacting Module) is thought to be critical for mediating their protein-protein interactions (Faehnle et al., 2017; Wang et al., 2017). The RNase II exonuclease DIS3L2, a gene whose germline mutation causes Perlman syndrome (Astuti et al., 2012; Higashimoto et al., 2013), has recently been implicated in the degradation of polyuridylated pre-let-7 (Chang et al., 2013; Ustianenko et al., 2013). As expected, Dis3l2 loss does not affect the levels of mature let-7 microRNAs in Lin28A-expressing cells, since polyuridylated pre-let-7 is no longer a viable Dicer substrate.

The Gregory lab has shown that Lin28A and Lin28B in vitro can promote uridylation of pre-let-7 by both TUT4 and TUT7, raising the possibility that the LIN28 paralogs can use alternate or redundant TUTases (Thornton et al., 2012). They reported that double knockdown of both TUT4 and TUT7 increased let-7 more than TUT4 knockdown alone in mouse embryonic stem (ES) and P19 embryonal carcinoma cells that express Lin28A (Thornton et al., 2012). Furthermore, this work confirmed earlier studies where TUT7 knockdown on its own does not elevate mature let-7 levels (Hagan et al., 2009; Heo et al., 2009). Our research demonstrated that LIN28A and LIN28B regulate let-7 microRNA biogenesis by distinct mechanisms with differential reliance on the TUT4 (Piskounova et al., 2011). For LIN28B, it remains unresolved the precise mechanism(s) responsible for let-7 regulation, even though in each scenario direct binding of LIN28B to the terminal loop of immature let-7 is involved. Four mechanisms separately or in combination may be important. Specifically, LIN28B may act in the nucleus to block Microprocessor cleavage of pre-let-7, may sequester prilet-7 in the nucleolus, may block Dicer cleavage of pre-let-7 in the cytoplasm, and lastly, may work with the alternate TUTase TUT7 to block biogenesis. Since the subcellular localization of LIN28B appears cell type dependent, diverse mechanisms of action may be responsible (Guo et al., 2006; Hafner et al., 2010; Piskounova et al., 2011; Molenaar et al., 2012; Suzuki et al., 2015).

In addition to the LIN28A/let-7 pathway, polyuridylation of pre-miRNA has been proposed to repress other microRNAs. Initially, the Kim lab reported that miR-107, miR-143, and miR-200c are regulated by Lin28A and polyuridylation via TUT4 (Heo et al., 2009); however, their followup study argues against this conclusion (Cho et al., 2012). To define RNAs bound to Lin28A, they performed CLIP-Seq, a genome wide method that incorporates UV crosslinking of live cells to capture RNAs bound to a protein of interest that is subsequently enriched via immunoprecipitation. Following protease digestion, high-throughput sequencing of cDNA libraries is performed that correspond to the co-purified bound RNA. Among premicroRNAs that bind Lin28A, pre-let-7 was discovered but not pre-miR-107, pre-miR-143, or pre-miR-200c. Moreover, knockdown studies confirmed that Lin28A knockdown only upregulates mature let-7 microRNAs, confirming an earlier report (Hagan et al., 2009).

Polyuridylation has also been proposed to regulate miR-1 biogenesis in myotonic dystrophy patients (Rau et al., 2011); however, this work did not report in vivo TUTase loss-offunction or gain-of-function experiments. Myotonic dystrophy is caused by expansion of CTG and CCTG repeats in the DMLK and ZNF9 genes, respectively, that create aberrant RNAs that functionally sequester the splicing regulator MBNL1. In heart samples from myotonic dystrophy patients, microRNA profiling revealed that mature miR-1 levels were markedly reduced. In normal cardiac cells, the RNA binding protein MBNL1 binds to the terminal loop of pre-miR-1-1 and pre-miR-1-2 as determined by UV crosslinking. Rau and colleagues hypothesized that in myotonic dystrophy, MBNL1 is sequestered, thereby permitting LIN28 to bind to the pre-miR-1 terminal loop and block miR-1 maturation (Rau et al., 2011). In HeLa cells that do not express either LIN28 paralog, co-transfection experiments showed that LIN28A reduced mature miR-1 when pri-miR-1 was ectopically expressed. In vitro uridylation assays showed that Lin28A promotes uridylation of pre-miR-1 by TUT4. Loss-of-function of MBNL1 and Lin28B in H9C2 rat cardiomyoblasts decreased and increased mature miR-1 levels, respectively, consistent with them having antagonistic functions. In myotonic dystrophy patients, miR-1 loss is predicted to cause upregulation of its target genes such as GJA1 (Connexin 43) and CACNA1C (Cav 1.2), leading to dysregulation of important gap junction and calcium channel proteins in the heart.

#### Non-canonical MicroRNA Biogenesis

Uridylation also plays roles in non-canonical microRNA biogenesis. In contrast to canonical microRNA biogenesis, a small subset of microRNAs are made by alternative, noncanonical pathways that bypass the requirement for either Drosha or Dicer (reviewed in Miyoshi et al., 2010; Abdelfattah et al., 2014). Dicer-independent and Drosha-independent microRNAs are rare in mammals. The following sections review our current knowledge.

#### Ago2-Dependent and Dicer-Independent MicroRNAs

For canonical microRNAs, one strand of the microRNA duplex after Dicer cleavage typically is loaded onto an Argonaute protein (Ago1-4), leading to incorporation into the RNA-induced silencing complex (RISC). Although Ago1-4 are individually competent to bind and load microRNAs and siRNAs into RISC, Ago2 is unique in having slicer activity that is responsible for siRNA-mediated (and to a much lesser extent miRNA-directed) cleavage of their target mRNAs.

Research initially in zebrafish implicated miR-451 as a key erythropoiesis gene (Dore et al., 2008; Pase et al., 2009) whose function is evolutionarily conserved in mammals (Patrick et al., 2010; Rasmussen et al., 2010). In comparison to other microRNAs as shown in **Figure 4**, the conserved microRNA miR-451 is rather unusual, since Drosha cleavage yields a short (42 nt) hairpin with only a 17 nucleotide stem that is too short for Dicer cleavage (Siolas et al., 2005). In addition, the mature miR-451 sequence includes the entire hairpin loop. Research using zebrafish and mouse models ultimately led to the discovery that miR-451 biogenesis requires Ago2 rather than Dicer (Cheloufi et al., 2010; Cifuentes et al., 2010). In the zebrafish model, microRNA expression was interrogated in control, maternal-zygotic dicer mutants, and maternal-zygotic ago2 mutants at 48 h post fertilization (Cifuentes et al., 2010). Intriguingly, several microRNAs such as miR-451 and miR-735- 5p were unaffected by dicer loss. In contrast, mature miR-451 expression was abolished in ago2 mutants. For miR-451, many reads in wild-type and dicer-deficient zebrafish cells contained the first 30 nucleotides of pre-miR-451 with 1–5 non-templated uridines. The last templated base in this RNA species base-pairs with position 10 in pre-miR-451, a location that is reminiscent of how Ago2 slices mRNA targets using siRNA as a guide in RISC. Further biochemical studies revealed that Ago2 associates with pre-miR-451 and is responsible for pre-miR-451 using in vitro processing assays. Their data shows that trimming of Ago2 sliced pre-miR-451 gives rise to mature miR-451. Mouse miR-451 was shown by the Hannon lab to undergo a similar process and was initially discovered using genetically engineered mice where Ago2's slicer activity was abolished by a targeted point mutation that resulted in alanine replacing an essential catalytic aspartate. Their mouse work in conjunction with additional in vitro and cell culture assays showed that the biogenesis of miR-451 is Ago2 dependent and Dicer-independent. Recently, another study using the catalytically dead Ago2 mice revealed that miR-486 requires both the catalytic activity of Ago2 as well as Dicer cleavage of its precursor. Specifically, Dicer cleavage of pre-miR-486 produces duplex miRNA-5p/miRNA-3p; however, this duplex is arrested and Ago2 is required to cleave the passenger strand to liberate the guide for its incorporation into the RISC complex.

#### Mirtrons

Another alternative microRNA biogenesis pathway termed "mirtrons" was recently discovered wherein, splicing generates pre-miRNA hairpin mimics, thereby bypassing the requirement for Drosha (reviewed in Rissland, 2015). Mirtrons have been identified in diverse animals including worms, flies, and humans (Berezikov et al., 2007; Okamura et al., 2007; Ruby et al., 2007; Bortolamiol-Becet et al., 2015; Reimao-Pinto et al., 2015). Typical mirtrons have both hairpin ends generated by splicing forming a short 3′ overhang for nuclear export and cleavage (**Figure 5**). In mammals, short introns are relatively uncommon that could produce a canonical mirtron. A limited number of microRNAs are now definitively known to be mirtrons and include miR-877 and miR-1224 in mammals and miR-1225 and miR-1226 in primates (Berezikov et al., 2007).

Bioinformatic studies have identified hundreds of additional candidate murine and human mirtrons that are predominantly

comprised of endogenous Dicer substrates with unique patterns of ordered 5 ′ and 3′ heterogeneity (Ladewig et al., 2012; Westholm et al., 2012; Wen et al., 2015). In most cases, the end of mammalian mirtrons are characterized by the presence of oligouridine tails due to untemplated uridylation whose functional significance remains unclear. In Drosophila, the TUTase Tailor uridylates mirtrons and is linked to mirtron turnover (Bortolamiol-Becet et al., 2015; Reimao-Pinto et al., 2015). It remains unclear how uridylation of mirtrons affects their biogenesis in mammals.

#### IsomiRs and Uridylation

Mature microRNAs can differ in sequence and length from their canonical miRNA sequence reported in miRBase. These variants, known as isomiRs, are of different types: there is heterogeneity in Drosha and Dicer cleavage site selection, templated additions, or deletions at the 5′ and/or 3′ termini of the miRNA, nontemplated additions at the 3′ end and substitutions within the sequence (Morin et al., 2008). Non-templated variations have been attributed to the activity of ribonucleotidyl transferases, posttranscriptional modifications or the presence of genetic variants within the miRNA transcript (Morin et al., 2008; Cloonan et al., 2011; Wyman et al., 2011; Neilsen et al., 2012; Lee et al., 2013; Ma et al., 2013; Starega-Roslan et al., 2015; McCall et al., 2017). IsomiRs due to altered target specificity may regulate different subset of genes compared to the canonical miRNA and may play a role in disease pathogenesis (Gong et al., 2012; Tan et al., 2014). Although this review focuses on uridylation, it is known that other terminal ribouncucleotidyl transferases are significant players in isomiR formation. For example, TENT2 (aka GLD-2) through monoadenylation stabilizes miR-122, a microRNA that is highly expressed in the liver (Katoh et al., 2009; Burns et al., 2011; D'Ambrogio et al., 2012).

To identify and quantify isomiRs, small-RNA-Seq has been routinely used. In this method, sequencing libraries are prepared from size fractionated RNAs to enrich for small RNAs that are used as the initial template. For the highthroughput sequencing itself, a single read is sufficient and the overall number of required reads is less in contrast to mRNA libraries that typically use paired-end reads and require far more sequencing depth. A handful of bioinformatic algorithms exist to analyze small-RNA-Seq data for isomiRs (miraligner, isomiRex, isomiRID, IsomiRage, DeAnnIso, mirisomiRExp, isomiR-SEA; Pantano et al., 2010; de Oliveira et al., 2013; Sablok et al., 2013; Muller et al., 2014; Guo et al., 2016; Urgese et al., 2016; Zhang et al., 2016). Of interest, a recent report by Garalde and colleagues suggests that direct RNA sequencing using nanopore technology has several beneficial attributes and has the ability to detect specific nucleotide modifications in RNA (Gyarfas et al., 2009). Since the RNA is sequenced directly, nanopore does not lose information relative to post-transcriptional RNA modification in comparison to traditional RNA-Seq. RNA-Seq suffers from additional technical liabilities associated with the library preparation that includes PCR amplification and the short reads make definitive analysis of alternative splicing events difficult.

A striking example of the complexity of post-transcriptional regulation linked to microRNA uridylation involves TUT4, miR-26a, miR-26b, and interleukin-6 (IL-6) (Jones et al., 2009). In LIN28B-expressing A549 cells, Jones and colleagues showed that knockdown of TUT4 decreased the inflammatory cytokine IL-6 at both the mRNA and protein levels. IL-6 mRNA in TUT4 knockdowns had reduced half-life and a shorter polyA-tail length. The IL-6 3′ UTR was sufficient to convey TUT4-dependence on a chimeric luciferase construct. Deep sequencing of A549 cells revealed that miR-26a in control cells exists largely (>80%) as monouridylated 22 nucleotide microRNA, while in TUT4-deficient cells two populations predominate: a 21 nucleotide microRNA missing the 3′ nontemplated U (∼50%) or as a 23 nucleotide miRNA ending with a non-templated UA dinucleotide (∼28%). In contrast, miR-26b reads are increased in TUT4 knockdowns by >20 fold. Luciferase assays showed using microRNA mimics that the ability of miR-26b to repress IL-6 decreases as the length of non-templated Us increase. Overall, their data suggests that TUT4 in part promotes IL-6 expression by decreasing miR-26b levels as well as shifting the isomiR population of miR-26a. A subsequent study has implicated miR-26a in negatively regulating both LIN28B and TUT4, leading to let-7 upregulation (Fu et al., 2013). It remains unknown if TUT4 can regulate directly IL-6 mRNA turnover via microRNAindependent mechanisms.

Recent work from the Gregory and Zon labs have interrogated the consequence of simultaneous knockdowns of both TUT4 and TUT7 on isomiR generation (Thornton et al., 2014). In HeLa cells that do not express the LIN28 paralogs, microRNA sequencing revealed that overall abundance of individual microRNAs was largely unaffected by TUTase loss. Depending on the particular microRNA, up to 8% were determined to be uridylated relative to total reads. TUTase depletion reduced mature microRNA uridylation for multiple microRNAs as expected and resulted in a concomitant gain in adenylation.

#### URIDYLATION IN MRNA TURNOVER

Transcription and degradation are both critical for controlling mRNA abundance. Like transcription, mRNA turnover is a regulated process that integrates a multitude of factors including cell-type specificity, timing within the cell cycle, and response to intrinsic and extrinsic signals. Multiple mechanisms exist that are important for mRNA turnover that have differential reliance on deadenylation and uridylation. In the forthcoming sections, we will elaborate on the functions of uridylation in degradation of polyadenylated mRNAs, histone mRNAs that lack polyA-tails, and cleaved mRNAs.

#### Uridylation in Polyadenylated mRNA Degradation

Almost all mRNAs end with a polyA-tail that promotes transcript stability and translational efficiency. Over time, polyAtails become shorter, decreasing translation and enhancing degradation. Mechanistically, mRNA can be actively degraded by multiple mechanisms where there is significant commonality in the critical enzymes (**Figure 6**). Decapping and 5′→3 ′ exonuclease, typically XRN1, degrade mRNA from the 5′ end while the exosome degrades mRNA 3′→5 ′ . Emerging evidence demonstrates that uridylation contributes to mRNA turnover. The first evidence that uridylation plays a role in mRNA turnover was derived from research on caffeine-induced death suppressor (Cid1) in the fission yeast Schizosaccharomyces pombe (Rissland et al., 2007). In S. pombe, caffeine in combination with hydroxyurea treatment leads to genomic instability and cell death due to failure in S-M checkpoint control. In a suppressor screen of this phenotype, Cid1 was identified. Bacterially expressed recombinant Cid1 utilizes both UTP and ATP while the Cid1 complex purified for fission yeast acts predominantly as a polyU polymerase that can uridylate multiple mRNAs, targeting them for degradation (Rissland et al., 2007; Rissland and Norbury, 2009).

In mammals, it was widely assumed that polyA tails end in adenine for obvious reasons. In groundbreaking studies by Narry Kim's lab, a novel technique that they termed TAIL-Seq was employed to interrogate the terminal ends of transcripts, revealing that a subset of mRNA transcripts in end U (Chang et al., 2014; Lim et al., 2014). TAIL-Seq is an innovative method to capture the true 3′ end of RNAs. In this method, rRNA are first depleted and a 3′ adapter with biotin is ligated to the 3′ end of the RNA. RNA is fragmented by modest digestion with RNase T1. A streptavidin purification is then used, the 5′ end of RNA is phosphorylated, and a 5′ adapter is added. At this point, library preparation and paired-end high throughput sequencing follows traditional protocols.

Using TAIL-Seq, the Kim lab identified 1–3 uridine addition as a common modification at the end of mRNAs with short polyA-tails (<25 As) and guanylation at the end of mRNAs with longer polyAs (>40 nts) at the downstream of the poly(A) tail. Roughly 50 and 80% of polyadenylated transcripts are uridylated at a frequency of >5 and >2%, respectively, with specific transcript as high as 40%. Considering that uridylation targets mRNAs for degradation, the levels of uridylated transcripts at steady state are rather remarkable. Furthermore, they showed that the TUT4 and TUT7 are the responsible TUTases for mRNA uridylation and their knockdown increased the half-life of numerous transcripts. Overall, they discovered more than >600 genes upregulated in HeLa cells upon TUTase loss. Their data supports the idea that uridylated mRNAs are degraded by canonical pathways where DIS3L2 only has a minor role as anticipated given the limited number of Us added to mRNA. The further importance of TUTases in regulation of the maternal transcriptome during oocyte development and in the maternalto-zygotic transition that occurs shortly after fertilization is

discussed in section TUTases in Vertebrate Development and Physiology.

During apoptosis, uridylation plays a critical role in the rapid degradation of global mRNAs that relies prominently on DIS3L2 (Thomas et al., 2015). For this process, TUT4 and TUT7 are responsible for widespread mRNA uridylation. Knockdown of TUTases or DIS3L2 impaired apoptotic mRNA decay and decreased programmed cell death, while DIS3L2 overexpression promotes apoptosis. When taken together, these results suggest that dysregulated uridylation may play multiple roles in disorders involving apoptotic cell death and/or aberrant mRNA turnover, leading to dysregulated gene expression and cell survival.

#### Uridylation and Replication-Dependent Histone mRNA Turnover

Replication-dependent histone mRNAs are exclusively present during the S phase of the cell cycle to encode histone proteins required for packaging of newly synthesized DNA (reviewed in Hoefig and Heissmeyer, 2014; Marzluff and Koreski, 2017). In eukaryotes, this restricted expression is accomplished through regulation of both histone mRNA transcription and degradation (**Figure 7**). In metazoans, histone mRNAs are unique because they end in a highly conserved stem-loop and are not polyadenylated like other mRNAs. The stem-loop structure interacts with LSm1-7 and Stem-Loop Binding Protein (SLBP) to form a complex that recruits 3′ hExo to bring about the degradation of histone mRNAs at the end of S phase or during the inhibition of DNA replication (Mullen and Marzluff, 2008; Hoefig et al., 2013; Lyons et al., 2014; Brooks et al., 2015).

A pioneering study by Mullen and Marzluff demonstrated that histone mRNA turnover is initiated by 3′ terminal oligouridylation that leads to either 3′→5 ′ degradation by the exosome and/or by decapping and subsequent 5′→3 ′ degradation by the exonuclease Xrn1 (Mullen and Marzluff, 2008). A subsequent collaborative study by the Rhoads and Marzluff labs showed that uridylated histone mRNAs are predominantly degraded by decapping and then 5′→3 ′ degradation (Su et al., 2013). The TUTase(s) involved in oligouridylation of histone mRNAs has not been definitively identified. Different terminal ribonucleotidyl transferases have been proposed as important for histone uridylation: MTPAP, TENT4B, TUT4, and TUT7. The emerging consensus is that TUT4 and/or TUT7 are the responsible TUTase(s). Both MTPAP and TENT4B have been shown to be polyA polymerases and lack the critical histidine that confers UTP specificity (Tomecki et al., 2004; Nagaike et al., 2005; Xiao et al., 2006; Rammelt et al., 2011; Berndt et al., 2012; Boele et al., 2014; Yamashita et al., 2017). Furthermore, MTPAP protein is almost exclusively mitochondrial where it is responsible for polyadenylation on mRNA transcripts encoded by mitochondrial DNA (Tomecki et al., 2004; Nagaike et al., 2005; Xiao et al., 2006). Subsequent work showed that MTPAP likely has an indirect effect on histone mRNA levels, since MTPAP knockdown impairs cell growth and reduces the number of cells actively going through S phase (Su et al., 2013).

In 2011, the Norbury lab showed that knockdown of the cytoplasmic TUT4 in HeLa cells blocked histone mRNA

degradation and subsequently stalled DNA replication, accompanied by a proportional reduction in uridylated histone mRNA transcripts (Schmidt et al., 2011). The significance of TUT4 was independently confirmed in a subsequent study (Su et al., 2013). These data suggest that TUT4 is the terminal U-transferase responsible for directing histone mRNAs for degradation upon the inhibition or completion of DNA replication. Lackey and colleagues found through high-throughput sequencing that knockdown of TUT7 reduces uridylation at the 3′ end of histone mRNA transcripts as well as uridylation of degradation intermediates in the stem loop (Lackey et al., 2016). In contrast to prior reports, a knockdown of TUT4 did not produce an effect on the 3′ uridylation pattern or on the stem-loop, suggesting that TUT7 in concert with 3′hExo play a major role in trimming and uridylation of histone mRNAs while TUT4 plays a minor role in this process.

#### Uridylation Promotes Degradation of Cleaved mRNAs

The discovery of RNA interference and the experimental utility of shRNA/siRNAs has revolutionized how gene function is interrogated (Mello and Conte, 2004). In cases of perfect (or near perfect) complementarity, the Ago2-RISC complex directs siRNA-directed cleavage of their targeted mRNA. In contrast, miRNAs in the RISC complex due to their imperfect complementarity to their mRNA targets usually result in mRNA destabilization and/or translational inhibition. When Ago2-RISC does cleave the target mRNA, the resulting 5′ mRNA fragment is often characterized by the addition of 1– 24 uridines that promotes degradation (Shen and Goodman, 2004).

#### URIDYLATION IN NON-CODING RNA QUALITY CONTROL

Non-coding RNAs such as rRNAs, tRNAs, snRNAs, snoRNAs are involved in diverse cellular processes ranging from chromosome replication to mRNA translation. In certain cases, these ncRNAs may be non-functional due to mutations in their genetic sequence, transcriptional errors, premature termination, and misprocessing. Therefore, it is imperative to eliminate aberrant ncRNAs in order to ensure proper cellular function and avoidance of diseases. Elimination of such misprocessed ncRNA precursors is mediated in part by uridylation of ncRNA termini.

Haas and colleagues used a proteomics approach to identify protein complexes involved in miRNA tailing and trimming and identified TUT1 and DIS3L2 as likely candidates to be involved in ncRNA quality control (Haas et al., 2016). Studies with DIS3L2 and its catalytically inactive variant in Perlman syndrome revealed that DIS3L2 recognizes uridylated ncRNAs in the cytoplasm and subsequently degrades them (Labno et al., 2016). In this case, uridylation was carried out by TUT4 and TUT7 regardless of the origin of the ncRNAs. CLiP-Seq studies revealed that DIS3L2 has affinity for various uridylated ncRNAs: unprocessed tRNAs, vault RNAs, Y-YNAs, 5S RNA, snoRNAs transcribed by RNA pol III, transcription start siteassociated short RNAs (TSSas), extended snRNAs and a FTL short RNA from the ferritin mRNA 5′ UTR transcribed by RNA polymerase II (Pirouz et al., 2016; Ustianenko et al., 2016). Also, polyuridylation of aberrant precursor snRNAs targets them for degradation by DIS3L2, an exonuclease that recognizes polyUtails (Faehnle et al., 2014; Labno et al., 2016; Ishikawa et al., 2018). Also, TUT7 and to a lesser extent TUT4 is thought to be critical for elimination of trimmed pre-miRNAs (Kim et al., 2015). These findings suggest that the TUT-DIS3L2 surveillance (TDS) is responsible for monitoring and elimination of aberrantly structured ncRNAs.

#### TUTASES IN VERTEBRATE DEVELOPMENT AND PHYSIOLOGY

The in vivo functions of TUTases have been elucidated in morphant zebrafish and in genetically engineered knockout mice. In zebrafish, knockdown of either tut4 or tut7 via morpholino injection into fertilized eggs causes the majority of fish to die within 5 days post-fertilization (Thornton et al., 2014). These morphant zebrafish are characterized by developmental delay, failure in tail elongation, and somite degeneration. In situ hybridization revealed that several Hox genes had aberrant expression patterns in the morphant zebrafish. Of note, the morpholinos used in this experiment block premRNA splicing and therefore do not interrogate the role of the spliced maternal TUTase transcript that exist in the unfertilized egg.

In mammals, TUT4 works with Lin28A to block let-7 microRNA biogenesis (Hagan et al., 2009; Heo et al., 2009; Piskounova et al., 2011), while Lin-28 utilizes the TUTase PUP-2 to suppress let-7 in the nematode C. elegans (Lehrbach et al., 2009), implicating uridylation as an ancient mechanism important for Lin28 repression of let-7. Similar to TUTasedeficient fish, loss of either lin-28a or lin-28b in zebrafish caused multiple phenotypes, including developmental delay (Ouchi et al., 2014). These embryos were characterized by severe gastrulation defects that led to ∼40% lethality by 12 h post fertilization and those that survived longer had reduced body lengths and head size.

Knockout mice have elucidated the function of the TUTases TUT4/Zcchc11 and TUT7/Zcchc6 in mammals. TUT7-deficient mice are born in expected Mendelian ratios and appear overtly normal (Kozlowski et al., 2017). TUT4 and TUT7 are broadly expressed at the RNA level where TUT7 is relatively higher in the liver, lungs, and alveolar macrophages. To determine if TUT7 is involved in the defense of inhaled pathogens, wild-type and TUT7 knockout mice were challenged with S. pneumoniae through intratracheal exposure. TUT7 nulls had increase neutrophil recruitment during early infection and elevated several cytokine mRNAs such as CXCL5, IL-6, and CXCL1. This finding suggests a role for TUT7 in altering the innate immune response. Organism-wide double knockout of both TUT4 and TUT7 starting at 2 months of age are reported to be overtly healthy for several months, indicating that these genes are likely not essential in adult mammals (Morgan et al., 2017); however, detailed analyses of these mice remain to understand how uridylation may contribute in more subtle ways to adult physiology across multiple tissues.

TUT4 knockout mice are born normal in size and at expected Mendelian ratios (Jones et al., 2012); however, null mice are characterized by postnatal growth retardation that persists throughout life and roughly 50% of nulls die within a week of birth. For knockout mice that survive to weaning age, they are long lived. Reduced levels of insulin-like growth factor 1 (Igf-1) in the liver and the blood were reported to contribute to the observed growth retardation and lethality. Igf-1 is a potent mitogen previously implicated in organismal growth (Baker et al., 1996). Mechanistically, the 3′ UTR of Igf-1 was critical for TUT4-mediated regulation as shown by luciferase reporter assays, suggesting possible microRNA involvement. Of note, deep sequencing of microRNAs in neonatal livers revealed that several mature microRNA species in TUT4-deficient mice had reduced levels of non-templated uridine additions while the overall abundance of individual microRNAs was largely unaffected. For mature microRNAs characterized by differential uridylation, several are predicted to repress Igf-1. The uridylated forms of miR-126-5p and miR-379 have diminished ability to repress luciferase expression constructs that harbor the IGF-1 3′ UTR. Therefore, loss of uridylation of these microRNAs is thought to enhance their ability to repress Igf-1. One caveat to this study is that it was conducted prior to the realization that TUTases promote degradation of mRNAs with short polyA-tails (Chang et al., 2014; Lim et al., 2014).

Regulation of miRNA expression is necessary for CD4 T-cell maturation. Using deep sequencing, Vasquez et al. demonstrated upon T-cell activation, terminally uridylated miRNA sequences decreased, accompanied by a proportional decrease in the levels of TUT4 and TUT7 (Gutierrez-Vazquez et al., 2017). Furthermore, analysis of TUT4 deficient T lymphocytes, demonstrated that TUT4 is essential for the maintenance of miRNA uridylation in steady-state T lymphocytes. This study underscores the role of post-transcriptional uridylation in regulating miRNA levels during T-cell activation.

In mammals, the program for early development is established during oogenesis through the maternal transcriptome which is achieved by stabilization of RNA binding proteins and suppression of RNA degradation pathways. However, during oocyte growth, limited degradation is required to confer distinctiveness to the maternal transcriptome. In a recent study by Morgan et al. TAIL-Seq was used to demonstrate that during oocyte growth in mice, 3′ terminal oligo-uridylation of mRNA that is mediated by TUT4 and TUT7 facilitates the elimination of certain transcripts (Morgan et al., 2017). These events play a critical role in defining the maternal transcriptome. They generated mice in which Zp3-Cre was used to knockout conditionally both TUT4 and TUT7 during oogenesis starting at the secondary stage. TUT4/TUT7 conditional knockout mice were infertile but underwent ovulation normally. Further studies revealed that this phenotype was due to the inability of these animals to support early embryonic development due to meiosis failure. This was mainly attributed due to a lack of elimination of TUT4 uridylated transcripts that resulted in an inaccurate maternal transcriptome. This landmark study revealed that uridylation is imperative for accurate development of the maternal transcriptome, which in turn regulates oocyte maturation and fertility (Morgan et al., 2017).

After fertilization, TUT4 and TUT7 are implicated in the targeted degradation of maternal transcripts during the maternal-to-zygotic transition (MZT) in multiple vertebrate species that include frogs, fish, and mice (Chang et al., 2018). To reach this conclusion, the Kim lab profiled mRNAs by TAIL-Seq at fertilization and immediately thereafter. They discovered that a marked increase in uridylation during MZT that occurs 4–6 h post fertilization. Using TUT4 and TUT7 morpholinos targeting the start codon that blocks translation of the preexisting spliced maternal transcripts for these two genes, they found that MZT uridylation was impaired, a subset of maternal transcripts was stabilized, and that TUTase loss cause significant gastrulation defects in both zebrafish and frogs. TUT7 was found to be primarily responsible for mRNA uridylation prior to the blastula stage and its loss recapitulated the observed failures in gastrulation. In addition, morpholinos that block splicing of the TUTase pre-mRNAs did not cause MZT defects indicating that new zygotic expression of the TUTases is not required for this process and confirmed the significance of maternal TUTase mRNAs. Altogether, these results highlight that TUTases play critical roles in the elimination of maternal transcript that are required during MZT. It should be noted that TUT7 null mice are born in appropriate Mendelian ratios and as such, its loss in insufficient to cause embryonic lethality (Kozlowski et al., 2017).

To elucidate the developmental roles of Lin28 in mammals and by extension shed light on potential functions for TUTases, we generated and interrogated conditional mouse knockouts for both Lin28A and Lin28B in collaboration with George Daley's group (Zhu et al., 2011; Shinoda et al., 2013a,b). Lin28B nulls are overtly normal except for a modest impairment in male postnatal growth. Lin28A mouse knockouts are born in the expected Mendelian ratios; however, these mice are growth impaired and the vast majority die perinatally. Mechanistically, loss of Lin28A during embryogenesis leads to lifelong defects in glucose metabolism that likely contribute to slowed postnatal growth in survivors. Additional analysis of these conditional knockout as well as gain-of-function Lin28 mice reveal a direct role for Lin28 in regulating glucose metabolism and cellular bioenergetics, in part by let-7 mediated regulation of the Insulin-PI3K-mTOR pathway (Zhu et al., 2010, 2011). Specifically, muscle specific loss of Lin28A or overexpression of let-7 resulted in insulin resistance and impaired glucose tolerance. Double knockout of both Lin28A and Lin28B results in lethality by embryonic day 12.5, with numerous defects including developmental delay and neural tube closure defects, suggesting that the paralogs have partially redundant developmental functions. Conditional knockout of either Lin28A or Lin28B at 6 weeks of age yield no overt phenotypes, demonstrating that these genes do not play essential functions shortly after birth. This result is not surprising as expression of both genes is largely restricted to embryonic development.

### TUTASES IN DISEASE

Given the relative infancy of the 3′ RNA uridylation field, our understanding of how uridylation contributes to disease is rather limited. Recent and provocative evidence implicates uridylation as a critical gene regulator and driver of tumorigenesis. These data indicate dysregulated TUTase activity alone and in concert with the onco-fetal LIN28/let-7 pathway as hallmarks of poor prognosis in multiple cancer types. In addition, Perlman syndrome and potentially a subset of Wilms tumors are caused by mutations in DIS3L2, a 3′→<sup>5</sup> ′ exonuclease that is responsible for degrading numerous polyuridylated RNA substrates. As discussed earlier, polyuridylation may play a role in myotonic dystrophy. As research continues, it is likely that defective uridylation will be discovered that has significance in additional diseases.

#### TUTases and the LIN28/let-7 Pathway in Cancer

The global significance of the onco-fetal LIN28/let-7 pathway has been the subject of multiple reviews, given its importance in ES biology, iPSC reprogramming, and cancer (reviewed in Bussing et al., 2008; Peter, 2009; Viswanathan and Daley, 2010; Lee et al., 2016; Balzeau et al., 2017). The LIN28 paralogs, LIN28A and LIN28B, are RNA binding proteins and protooncogenes whose expression occurs primarily during embryonic development. In multiple cancers, one of the LIN28 paralogs will be reactivated, blocking biogenesis of the tumor suppressor let-7 microRNA family. For LIN28A in particular, it recruits the TUT4 to polyuridylate pre-let-7, blocking let-7 maturation and function. Let-7 exerts its tumor suppressor function in part by inhibiting the expression of numerous oncogenes such as MYC, RAS, YAP1, HMGA2, and CDK6 and by altering cellular bioenergetics, including glucose metabolism. LIN28A and LIN28B expression in cancer is typically mutually exclusive and expression of either gene is almost invariably associated with poor prognosis. Since there are 12 let-7 family members encoded by eight human chromosomes, the activation of the LIN28/let-7 pathway explains the global post-transcriptional let-7 repression observed in many human cancers.

The LIN28/let-7 pathway is directly implicated in cancer using genetically engineered mouse models where ectopic LIN28A and/or LIN28B expression cause and/or enhance progression of neuroblastoma, Wilms tumor, mast cell leukemia, hepatocellular carcinoma, and colorectal adenocarcinoma (Molenaar et al., 2012; Nguyen et al., 2014; Urbach et al., 2014; Tu et al., 2015; Wang et al., 2015). Reintroduction of let-7 expression effectively impedestumor growth or even causes tumor regression in several mouse models (Esquela-Kerscher et al., 2008; Viswanathan et al., 2009; Trang et al., 2010; Piskounova et al., 2011). For example, our research demonstrated that TUT4-depletion or let-7 reintroduction caused established xenograft tumors to regress in mice (Piskounova et al., 2011). LIN28 paralog expression confers resistance via a let-7 dependent mechanism to ionizing radiation and several chemotherapies (Weidhaas et al., 2007; Blower et al., 2008; Yang et al., 2008; Chen et al., 2009; Oh et al., 2010; Boyerinas et al., 2012). Altogether, these data provide strong evidence implicating the onco-fetal LIN28/let-7 pathway in cancer and as an attractive target for new therapies. This view is further reinforced by our conditional mouse knockout studies that demonstrate that loss of either Lin28A or Lin28B proto-oncogenes at 6 weeks of age cause no overt phenotypes (Shinoda et al., 2013b), consistent with predominant embryonic expression of these genes. In addition, conditional knockout of both TUT4 and TUT7 in adult mice are overtly healthy (Morgan et al., 2017). Therefore, targeted inhibition of LIN28 is less likely to cause deleterious side effects in cancer patients. Recently, the Gregory lab has performed a high throughput biochemical screen to identify mouse TUT4 inhibitors (Lin and Gregory, 2015a). Although the majority of initially identified compounds were thiol-containing false positives due to lack of a reducing agent in the screen, several compounds were ultimately identified that could block pre-let-7 uridylation in vitro.

In addition to the role of uridylation in LIN28A-expressing cancers, TUT4 is implicated in both gliomas and breast cancer, independently of the LIN28/let-7 pathway. In breast cancer, Hallett and Hassell defined a dual gene classifier to predict breast cancer survival using E2F1 and TUT4 (i.e., KIAA0191; Hallett and Hassell, 2011). Other proliferation-related genes (e.g., BUB1 or AURKA) could substitute for E2F1 in their 2-gene signature and TUT4 was not prognostic on its own. When taken together, these results suggest that TUT4 overexpression in certain contexts may contribute to poor clinical prognosis in breast cancer. Of note, this classifier was independent of breast cancer subtype and the LIN28/let-7 pathway. Proteomic studies have also revealed that overexpression of TUT4 is common in high grade gliomas in comparison to low grade (Gerth et al., 2013). The potential significance of this association remains to be elucidated.

#### DIS3L2 in Perlman Syndrome and Wilms Tumor

Perlman syndrome is a rare, autosomal recessive congenital overgrowth and cancer susceptibility disorder (Neri et al., 1984, 1985). In 2012, germline mutations in the DIS3L2 exonuclease was discovered by a team working on Perlman syndrome (Astuti et al., 2012). Subsequently, two DIS3L2 heterozygous missense changes were identified in sporadic Wilms tumor, a tumor highly associated Perlman syndrome. Astuti and colleagues showed that the loss of DIS3L2 is associated with mitotic abnormalities and the abnormal expression of mitotic proteins, introducing the tumor suppressor activity of DIS3L2 (Astuti et al., 2012). A case of homozygous deletion of DIS3L2 exon 9 in a Japanese patient with Perlman syndrome was described later (Higashimoto et al., 2013). DIS3L2 is a cytoplasmic RNA exonuclease that is important for degrading numerous polyuridylated non-coding RNAs through recognition of their polyU tails (reviewed in Morris et al., 2013; Pashler et al., 2016).

Recent work has identified multiple non-coding RNAs as direct targets for Dis3l2-mediated decay where the underlying theme is terminal polyuridylation is the key specificity determinant. Chang and colleagues showed that Dis3l2 knockdown in mouse embryonic stem cells does not affect mature let-7 levels, consistent with polyuridylated pre-let-7 no longer being a Dicer substrate; however, an increase in polyuridylated pre-let-7 levels was observed. They normalized their data to U6 snRNA, a key spliceosomal component that undergoes oligouridylation and trimming. In vitro, Dis3l2 preferentially interacts with and degrades polyuridylated precursor microRNAs (e.g., pre-miR-21 and pre-let-7g) in comparison to their non-uridylated counterparts, highlighting that a polyU-tail directs at least in part the activity of this enzyme. In the study of Ustianenko and colleagues, they identified Dis3L2 initially as an oligoU binding protein (Ustianenko et al., 2013). They further showed Dis3l2 associates and degrades polyuridylated pre-let-7. Interestingly, in LIN28A- and LIN28Bnegative HeLa cells, they found that Dis3L2 loss reduced mature let-7 levels. The potential significance of let-7 microRNAs in Perlman syndrome remains to be resolved. Also, the precise defects that leads to Perlman syndrome is unclear as one can envision that disease is caused by the toxic accumulation of defective RNA species and/or by loss of specific RNA(s) that require DIS3L2 activity for function.

### FUTURE PERSPECTIVES

Although the last decade has revealed the central importance of uridylation in mammals, much remains to be discovered that has significant implications for basic biology and translational medicine. For example, the LIN28/let-7 pathway is directly implicated in numerous poor prognosis cancers where expression of either LIN28 paralog may confer resistance to ionizing radiation and several frontline chemotherapies such as doxorubicin, 5-fluorouracil, taxanes, and platinumbased drugs (reviewed in Lee et al., 2016; Balzeau et al., 2017). Moreover, cancer stem cells that are often responsible for recurrent and more deadly disease are characterized by LIN28A or LIN28B reactivation even when the tumor bulk lacks their expression. Both the LIN28/let-7 pathway and the TUT4 and TUT7 appear to be bona fide molecular targets for therapeutic intervention in cancer where their specific inhibition may be well-tolerated in patients (Piskounova et al., 2011; Shinoda et al., 2013b; Morgan et al., 2017). Altogether, these results suggest that LIN28 and/or TUTase inhibitors may represent a novel drug class whose development is worthwhile.

Further research is also needed to define the precise molecular mechanisms by which TUT4 promotes the development of poor prognosis breast cancers and high grade gliomas, independently of the LIN28/let-7 pathway. Ongoing research continues to unravel how mutations in DIS3L2 contribute to Perlman syndrome and Wilms tumors. Also, advances in next generation sequencing technology such as nanopore sequencing may enhance our ability to look at terminal RNA uridylation that does not involve cumbersome sequencing library construction that is used for small-RNA-Seq or TAIL-Seq. In terms of more basic biology, uridylation is likely regulated in cell-type and disease-specific manners. Therefore, the complete repertoire of functionally significant uridylation targets remains to be elucidated. Numerous biological questions remain about RNA uridylation. What regulates TUTase expression and function? Are there other tissue specific RNA binding proteins that function like LIN28A to target specific microRNAs or other RNA substrates for uridylation-dependent degradation? How functionally significant are isomiRs generated by uridylation in development, normal physiology, and disease biology? What are the precise roles that uridylation plays in non-canonical microRNA biogenesis? Are there cis- and trans-acting factors that modulate uridylation activity on specific RNA substrates such as mRNAs with short polyA-tails? Does dysregulated

#### REFERENCES


uridylation contribute to pathology in other diseases and by what mechanisms? The answers to these questions as well as others should provide critical new insights into how uridylation contributes to post-transcriptional gene regulation in the emerging field of RNA epitranscriptomics.

#### AUTHOR CONTRIBUTIONS

All authors listed have made a substantial, direct and intellectual contribution to the work, and approved it for publication. MRM and JB contributed equally to this work.

#### FUNDING

This work was supported by the NIH grant R01 CA215186 awarded to JPH.

pathway by the terminal uridyltransferase tailor. Mol. Cell 59, 217–228. doi: 10.1016/j.molcel.2015.05.034


Chang, H., Yeo, J., Kim, J. G., Kim, H., Lim, J., Lee, M., et al. (2018). Terminal uridylyltransferases execute programmed clearance of maternal transcriptome in vertebrate embryos. Mol. Cell 70, 72–82 e7. doi: 10.1016/j.molcel.2018.03.004


microRNA turnover during CD4 T-cell activation. RNA 23, 882–891. doi: 10.1261/rna.060095.116


PAPalpha. Nucleic Acids Res. 44, 811–823. doi: 10.1093/nar/gkv 1074


scan reveals novel loci for autism. Nature 461, 802–808. doi: 10.1038/nature 08490


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 Menezes, Balzeau and Hagan. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

## The Emerging Field of Epitranscriptomics in Neurodevelopmental and Neuronal Disorders

Margarita T. Angelova1†, Dilyana G. Dimitrova1†, Nadja Dinges 2†, Tina Lence2† , Lina Worpenberg2†, Clément Carré<sup>1</sup> \* and Jean-Yves Roignant <sup>2</sup> \*

<sup>1</sup> Drosophila Genetics and Epigenetics, Sorbonne Université, Centre National de la Recherche Scientifique, Biologie du Développement—Institut de Biologie Paris Seine, Paris, France, <sup>2</sup> Laboratory of RNA Epigenetics, Institute of Molecular Biology, Mainz, Germany

#### Edited by:

Giovanni Nigita, The Ohio State University, United States

#### Reviewed by:

Jernej Ule, University College London, United Kingdom Martin Kos, Universität Heidelberg, Germany

#### \*Correspondence:

Clément Carré clement.carre@upmc.fr Jean-Yves Roignant j.roignant@imb-mainz.de

†These authors have contributed equally to this work.

#### Specialty section:

This article was submitted to Bioinformatics and Computational Biology, a section of the journal Frontiers in Bioengineering and Biotechnology

> Received: 07 March 2018 Accepted: 29 March 2018 Published: 13 April 2018

#### Citation:

Angelova MT, Dimitrova DG, Dinges N, Lence T, Worpenberg L, Carré C and Roignant J-Y (2018) The Emerging Field of Epitranscriptomics in Neurodevelopmental and Neuronal Disorders.

Front. Bioeng. Biotechnol. 6:46. doi: 10.3389/fbioe.2018.00046 Analogous to DNA methylation and histone modifications, RNA modifications represent a novel layer of regulation of gene expression. The dynamic nature and increasing number of RNA modifications offer new possibilities to rapidly alter gene expression upon specific environmental changes. Recent lines of evidence indicate that modified RNA molecules and associated complexes regulating and "reading" RNA modifications play key roles in the nervous system of several organisms, controlling both, its development and function. Mutations in several human genes that modify transfer RNA (tRNA) have been linked to neurological disorders, in particular to intellectual disability. Loss of RNA modifications alters the stability of tRNA, resulting in reduced translation efficiency and generation of tRNA fragments, which can interfere with neuronal functions. Modifications present on messenger RNAs (mRNAs) also play important roles during brain development. They contribute to neuronal growth and regeneration as well as to the local regulation of synaptic functions. Hence, potential combinatorial effects of RNA modifications on different classes of RNA may represent a novel code to dynamically fine tune gene expression during brain function. Here we discuss the recent findings demonstrating the impact of modified RNAs on neuronal processes and disorders.

#### Keywords: RNA modification, m5C, Nm, pseudouridine, m6A, neurons, disease

#### INTRODUCTION

An estimated 1–2% of all genes in a given organism contribute to nucleic acid modification systems, suggesting biological importance of modified nucleotides (Grosjean, 2009). A classic example is the methylation of cytosine on DNA, which acts as a critical epigenetic regulator of gene expression (Bird, 2002). Additionally, current advances in RNA modification research report over 140 distinct post-transcriptional RNA modifications (Cantara et al., 2011; Machnicka et al., 2013). Initial knowledge has been derived from studies on abundant non-coding RNAs (ncRNAs), such as transfer RNAs (tRNAs) and ribosomal RNAs (rRNAs), in prokaryotes and simple eukaryotes. These pioneer investigations described a diverse, chemically complex, and strongly conserved nature of RNA nucleotide modifications (Cantara et al., 2011; Machnicka et al., 2013). The most heavily modified RNAs in any cell type and organism are tRNAs. Up to 20% of nucleotides in mammalian cytoplasmic tRNAs carry modifications (Motorin and Helm, 2011; Pan, 2018). Modified nucleotides outside the anticodon loop of tRNAs occur non-randomly at conserved positions across diverse species and affect in general its stability (Helm, 2006; Motorin and Helm, 2010). In addition, modifications in the anticodon loop can contribute to optimize mRNA decoding by directly affecting codon-anticodon interactions (Agris, 2008).

Aberrant tRNA and rRNA modifications have been linked to various human disease syndromes and the phenotypes are often observed in specific tissues such as the gonads and the nervous system (Torres et al., 2014). Notably, increasing number of predicted human transfer RNA (tRNA) modification genes have been associated with neurological disorders, in particular with intellectual disability (ID) (for recent review see Bednárová et al., 2017). ID, or previously known as Mental Retardation (MR), is characterized by nonprogressive cognitive impairment and affects 1–3% of the general population (Daily et al., 2000). It is presently unclear whether all observed phenotypes are caused by aberrant tRNA modifications, by effects on unidentified other RNA substrates (see below) and/or by a modification-independent function of the involved enzymes (Guo and Schimmel, 2013; Genenncher et al., 2018). Likewise, it is unknown why some tissues, in particular the brain, are more sensitive to the loss of these modifications.

Importantly, besides the heavily modified tRNAs and rRNAs, mRNAs, small and long non-coding RNAs were also found to harbor post-transcriptional modifications. Recent technological advances that allowed mapping of selected RNA modifications on a transcriptome-wide scale revealed widespread distribution of N6-methyladenosine (m6A), pseudouridine (9) and ribose 2 ′ -O-methylation (Nm) on mRNA (Dominissini et al., 2012; Meyer et al., 2012; Carlile et al., 2014; Schwartz et al., 2014a; Dai et al., 2017). The prevalence of some others, including N1-methyladenine (m1A) and 5-methylcytidine (m5C) is still debated (Dominissini et al., 2016; Li et al., 2016, 2017c; Dominissini and Rechavi, 2017; Legrand et al., 2017; Safra et al., 2017). m6A, the most abundant mRNA modification, was shown to affect almost every step of mRNA biogenesis, including splicing, export, translation, and mRNA decay (Lence et al., 2017; Roignant and Soller, 2017). It is thus not surprising that misregulation of m6A results in several physiological defects, including brain development abnormalities, obesity, cancer, and other diseases (Batista, 2017; Dai et al., 2018). In addition, the discovery of m6A RNA demethylases (Jia et al., 2011; Zheng et al., 2013; Jacob-Hirsch et al., 2018) and the identification of m6A-binding proteins (Dominissini et al., 2012) indicated that similarly to DNA modification, RNA methylation can be reversible and convey information via recognition of effector proteins.

Altogether these recent studies revealed an entire new layer of regulation of gene expression, which has been central to the development of a novel concept called "RNA epigenetics or epitranscriptomics" (He, 2010; Meyer et al., 2012). However, the exact biological function of the majority of modified RNA nucleotides remains to be discovered. In this review, we will focus on several RNA modifications and will discuss their involvement in the development of the brain and neurological disorders.

### 5-METHYLCYTOSINE (m5C)

Cytosine can be methylated at the 5th position of the pyrimidine ring to form 5-methylcytosine (m5C) (**Figure 1**). Various eukaryotic cytosine-5-RNA methyltransferases catalyze the formation of m5C at specific positions (Motorin and Grosjean, 1999; Brzezicha et al., 2006; Sharma et al., 2013; Metodiev et al., 2014; Haag et al., 2015; Schosserer et al., 2015). The analysis of genetic mutations in two particular RNA cytosine-5 methyltransferase family members (Dnmt2/Trdmt and NCL1/TRM4/NSun2) has provided important insights into the biological effects of aberrant m5C deposition.

#### Dnmt2

Dnmt2 is a member of the most widely conserved eukaryotic cytosine-5-DNA methyltransferase protein family (Goll and Bestor, 2005). Despite this classification, only few studies reported Dnmt2-mediated DNA methylation (Hermann et al., 2003; Kunert et al., 2003; Phalke et al., 2009) and it is today acknowledged that Dnmt2 functions mainly as a tRNA methylase (Okano et al., 1998; Schaefer and Lyko, 2010a,b; Raddatz et al., 2013). Dnmt2 methylation activity on position C38 of three tRNAs, which include tRNAAsp, tRNAVal, and tRNAGly , has been described in yeast, Drosophila, mouse, and human cells (Goll et al., 2006; Jurkowski et al., 2008; Schaefer et al., 2010). Knockdown of Dnmt2 in zebrafish embryos leads to differentiation defects in some organs, and notably, to abnormal neurogenesis in the hypothalamus and diencephalon (Rai et al., 2007). In Dnmt2 mutant flies, reduced viability under stress conditions was observed (Schaefer et al., 2010). This is in accordance with previous studies that suggest an increased tolerance for stress in Drosophila and Entamoeba upon Dnmt2 overexpression (Lin et al., 2005; Fisher et al., 2006). Nevertheless, the majority of studies suggests that Dnmt2 mutation does not trigger strong detrimental phenotypes in yeast, Drosophila and mice (Wilkinson et al., 1995; Kunert et al., 2003; Goll et al., 2006; Schaefer et al., 2010), which raises the question why zebrafish relies on Dnmt2 for proper development, whereas mice and flies do not. One possible explanation is that these organisms have redundant mechanisms that compensate for the loss of Dnmt2, which may be absent or less robust in zebrafish. Consistent with this possibility, it was shown that Dnmt2 mutant mice exhibit lethal phenotypes in the absence of a second m5C methyltransferase, NSun2 (Tuorto et al., 2012). In human, polymorphisms in DNMT2 have been associated with spina bifida, a congenital malformation of the central nervous system (Franke et al., 2009).

#### NSun2

Unlike Dnmt2, it has been established that mammalian NSun2 does not only modify tRNAs (Blanco et al., 2011; Tuorto et al., 2012) but also other small ncRNAs such as 7SK, vault, and Y-RNAs (Hussain et al., 2013; Khoddami and Cairns, 2013). Dnmt2 and Nsun2 double knockout mice showed a lethal

phenotype. However, deletion of NSun2 alone (Blanco et al., 2011; Tuorto et al., 2012; Hussain et al., 2013) or in combination with Dnmt2 (Rai et al., 2007; Tuorto et al., 2012) in specific tissues impairs cellular differentiation pathways in mammalian skin, testes, and brain. The function in the brain appears conserved as Nsun2 mutations are associated with ID and Dubowitz-like syndrome in humans (Abbasi-Moheb et al., 2012; Khan et al., 2012; Martinez et al., 2012), as well as with microcephaly in human and mice (Blanco et al., 2014). In addition, a recent study in human and mice neuron precursor cells showed that m5C deposited by Nsun2 regulates neural stem cell (NSC) differentiation and motility (Flores et al., 2017). This study thus provides some links between the failure of RNA m5C deposition and the associated brain development diseases.

It is intriguing that patient fibroblasts and Nsun2-deficient mice (Blanco et al., 2014), as well as Dnmt2 mutant flies (Durdevic et al., 2013), exhibit increase cleavage of tRNA, and elevated production of tRNA fragments (tRFs). This accumulation of tRFs reduces protein translation rates and increases oxidative stress as well as neuronal apoptosis. Interestingly, reducing tRNA cleavage in Nsun2-deficient brains is sufficient to rescue sensitivity to oxidative stress, implying that tRFs play a role in Nsun2-mediated defects.

#### (2′ -O)-METHYLATION (Nm)

2 ′ -O-methylation (Nm) is a common nucleoside modification of RNA, where a methyl group is added to the 2′ hydroxyl of the ribose moiety (**Figure 1**). Nm increases hydrophobicity, protects RNAs from nuclease attacks and stabilizes helical structures (Kurth and Mochizuki, 2009; Byszewska et al., 2014; Kumar et al., 2014; Yildirim et al., 2014). Nm is predominantly found internally in ribosomal RNAs and small nuclear RNAs as well as in tRNAs and in a number of sites on mRNA (Darzacq et al., 2002; Rebane et al., 2002; Kurth and Mochizuki, 2009; Zhao et al., 2012; Somme et al., 2014; Dai et al., 2017). This modification is also present at the 3′ -end of miRNAs and siRNAs in plants (Li et al., 2005; Yu et al., 2005), as well as in siRNAs and piRNAs in animals (Horwich et al., 2007; Saito et al., 2007). Nm methyltransferases acting on tRNAs are highly conserved from bacteria and archaea to humans (Somme et al., 2014) and usually target positions in the anticodon loop. For instance, TRM7 in S. cerevisiae modifies positions 32 and 34 of selected tRNA, amongst which is tRNAPhe (Pintard et al., 2002; Guy et al., 2012). Strikingly, FTSJ1, the TRM7 ortholog in human, methylates the exact same positions of the exact same tRNAs (Guy and Phizicky, 2015; Guy et al., 2015). Consistent with a conserved function, expression of human FTSJ1 can suppress the severe growth defect of S. cerevisiae ∆trm7 mutants (Pintard et al., 2002; Guy and Phizicky, 2015). Reduction of the modification level in tRNAPhe was reported in carcinoma and neuroblastoma in mice (Pergolizzi and Grunberger, 1980; Kuchino et al., 1982) and is associated with ID in human (see below).

#### FTSJ1

One of the best characterized associations between ID in human and mutations in a gene encoding for Nm is the one between non-syndromic X-linked ID (NSXLID) and mutations in the FTSJ1 gene (OMIM:300499) (Guy et al., 2015). One third of the X-linked ID (XLID) conditions are syndromic (S-XLID) and the other two thirds are non-syndromic (NS-XLID) (Lubs et al., 2012). NSXLID is associated with no obvious and consistent phenotype other than mental retardation (IQ < 70), indeed NSXLID disorders are clinically diverse and genetically heterogeneous. FTSJ1 loss of function causes NSXLID retardation in males (Froyen et al., 2007; Takano et al., 2008). Heterozygous loss of function mutations in females do not cause the disease, which is probably due to inactivation of the affected X chromosome**.** Several alleles of FTSJ1 from six independent families correlate with NSXLID. All of these alleles lead to a reduction in mRNA levels and/or protein function (Willems et al., 1993; Hamel et al., 1999; Freude et al., 2004; Ramser et al., 2004; Froyen et al., 2007; Takano et al., 2008; Guy et al., 2015; **Table 1**). Consistently with the 2′ -O-methyltransferase activity of FTSJ1 on tRNAs, Guy and Phizicky reported that two genetically independent lymphoblastoid cell lines (LCLs) of NSXLID patients with FTSJ1 loss of function mutations nearly completely lack Cm<sup>32</sup> and Gm<sup>34</sup> on tRNAPhe (Guy et al., 2015). Additionally, tRNA Phe from a patient carrying an FTSJ1-p.A26P missense allele specifically lacks Gm34, but has normal levels of Cm32. tRNAPhe from the corresponding Saccharomyces cerevisiae TRM7-A26P mutant also specifically lacks Gm34. Altogether, these findings strongly suggest that the absence of Gm34, but not Cm<sup>32</sup> modification on tRNAPhe causes NSXLID in patients

#### TABLE 1 | Proteins required for writing, reading, or removal of different RNA modifications and their mutations associated with altered brain functions.


(Continued)

#### TABLE 1 | Continued


ID, intellectual disability; KD, knockdown; KO, knock out; cKO, conditional knock out; Lof, reduced function or loss of function; Gof, gain of function; SNP, single nucleotide polymorphism; Hs, Homo sapiens; Mm, Mus musculus; Dr, Danio rerio; Dm, Drosophila melanogaster N6-methyladenosine (m6A); pseudouridine (Ψ), 5-methylcytosine (m5C); and 2′ -O-methylation (Nm).

carrying distinct FTSJ1 alleles. Nevertheless, the molecular consequences arising from the loss of this 2′ -O-methylation are not yet determined. Furthermore, it is noteworthy to mention two additional studies involving families from the Chinese Han population (Dai et al., 2008; Gong et al., 2008), where three single nucleotide polymorphisms (SNPs) in the FTSJ1 gene were analyzed. Authors found a positive association with occurrence of NSXLID (Dai et al., 2008) as well as with general cognitive ability, verbal comprehension, and perceptual organization in male individuals (Gong et al., 2008). Although it seems tempting to link the variance of FTSJ1 gene to general human cognitive ability, more profound studies are needed to support this idea.

#### TRMT44

TRMT44 is a putative 2′ -O-methyluridine methyltransferase predicted to methylate residue 44 in tRNASer (Leschziner et al., 2011). Mutations in this gene were identified as a causative mutation in partial epilepsy with pericentral spikes (PEPS), a novel mendelian idiopathic epilepsy (Leschziner et al., 2011). However, the underlying mechanisms are currently unknown.

#### Small Nucleolar RNAs (snoRNAs)

snoRNAs are a class of regulatory RNAs responsible for posttranscriptional modification of ribosomal RNAs (rRNAs). Two families of snoRNAs have been described, based on their structure and function: C/D box snoRNAs are responsible for 2 ′ -O-methylation (Cavaillé et al., 1996), whereas H/ACA box snoRNAs mediate pseudouridylation (Ganot et al., 1997 and see the following chapter in this review). In zebrafish, loss of three snoRNAs results in impaired rRNA modifications, causing severe developmental defects including growth delay and deformations in the head region (Higa-Nakamine et al., 2012). In human, C/D box snoRNAs have been implicated in Prader-Willi syndrome (PWS), a complex neurological disease characterized with mental retardation, low height, obesity, and muscle hypotonia (Sridhar et al., 2008; Doe et al., 2009). In several independent studies, PWS was shown to be caused by the loss of imprinted snoRNAs in locus 15q11-q13. Large deletions of this region underlie about 70% of cases of PWS (Peters, 2008), whereas duplication of the same region is associated with autism (Belmonte et al., 2004; Bolton et al., 2004; Cook and Scherer, 2008). Locus 15q11–q13 contains numerous copies of two C/D box snoRNAs—SNORD115 (HBII-52), and SNORD116 (HBII-85) (Cavaillé et al., 2000). SNORD115 is believed to play key roles in the fine-tuning of serotonin receptor (5-HT2C) by influencing its pre-mRNA splicing (Vitali et al., 2005; Kishore and Stamm, 2006; Falaleeva et al., 2017), whereas SNORD116 loss is thought to contribute to the etiology of the PWS (Cavaillé et al., 2000; Sahoo et al., 2008; Duker et al., 2010).

#### Hen1/Pimet

Hen1/Pimet is a conserved enzyme, which adds 2′ -O-methyl group to 3′ -terminal nucleotides of miRNAs and siRNAs in plants, and of siRNAs and piRNAs in animals. Addition of this modification protects these small non-coding RNAs (sncRNAs) from 3′ → 5 ′ exonuclease degradation (Li et al., 2005; Horwich et al., 2007; Saito et al., 2007; Terrazas and Kool, 2009; Ross et al., 2014). In the absence of Hen1/Pimet, piRNA, and siRNA are destabilized and sncRNA silencing activities are compromised. Surprisingly, Hen1 mutant flies display neither increased lethality nor sterility under normal laboratory conditions but show however accelerated neurodegeneration (brain vacuolization), memory default, and shorter lifespan (Abe et al., 2014). This suggests a protective effect of Nm and small RNA pathways against age-associated neurodegenerative events. Accordingly, Drosophila lacking the siRNA effector, Argonaute 2 (Ago2), are viable but exhibit memory impairment and shortened lifespan (Li et al., 2013).

#### PSEUDOURIDINE (ψ)

Pseudouridine (also known as 5-ribosyluracil or ψ) is the first discovered (Cohn and Volkin, 1951) and most abundant RNA modification, present in a broad range of non-coding RNA, and was also recently detected in coding mRNA (Carlile et al., 2014; Lovejoy et al., 2014; Schwartz et al., 2014a; Li et al., 2015). The isomerization of uridine into ψ improves the base stacking in RNAs by the formation of additional hydrogen bonds, which influences RNA secondary structure and increases the stability of RNA duplexes (Arnez and Steitz, 1994; Davis, 1995). Pseudouridylation was shown to have a strong impact on different aspects of cellular processes, including translation efficiency, splicing, telomere maintenance, and the

regulation of gene expression (Mochizuki et al., 2004; Carlile et al., 2014; Schwartz et al., 2014a). This base modification is catalyzed by pseudouridine synthases (Pus) that act on their substrates by two distinct mechanisms. One of those mechanisms is the guide RNA-dependent pseudouridylation, in which H/ACA box snoRNAs target RNAs for pseudouridylation via specific sequence interactions between the snoRNAs and the target RNA. A specific enzyme present in the snoRNP (snoribonucleoprotein) particule catalyzes the uridine modification (dyskerin in human, Cbf5 in yeast; Duan et al., 2009; Liang et al., 2009). Alternatively, RNA-independent pseudouridylation requires stand-alone pseudouridine synthases (Pus) that directly catalyze ψ formation at particular target RNA (Yu et al., 2011; Carlile et al., 2014; Rintala-Dempsey and Kothe, 2017). Each enzyme has a unique specificity for its target RNA and modifies uridine in a certain consensus sequence. Pus enzymes are present in all kingdoms of life, evolutionary conserved and are categorized into six families, based on their consensus sequences: TruA, TruB, TruD, RluA, and RsuA. The sixth family member, Pus10, is exclusive to eukaryotes and archaea (Koonin, 1996; Kaya and Ofengand, 2003; Fitzek et al., 2018).

Several pieces of evidence hint toward an implication of ψ in regulating neuronal functions. For instance, patients with mildto-moderate severity of Alzheimer's disease show significantly elevated levels of urinal ψ (Lee et al., 2007) but it is currently unknown whether there is a link between this increase and the Alzheimer's disease etiology. Furthermore, it has been suggested that pseudouridylation can serve as a direct indicator of oxidative stress, which in turn has been linked to an increasing risk of neurodegeneration (Roth et al., 1999; Uttara et al., 2009). Accordingly, in cells exposed to acute oxidative stress by H2O<sup>2</sup> treatment, Li and colleagues detected an elevation by ∼40–50% in mRNA ψ levels, demonstrating that mRNA pseudouridylation acts as a direct response to cellular stress (Li et al., 2015).

A recent report demonstrated a direct implication of ψ in neuronal disorders from patients with myotonic dystrophy type 2 (DM2) (Delorimier et al., 2017). DM2 is a neuromuscular disease characterized by severe gray matter changes, including neuronal loss and global neuronal impairment (Minnerop et al., 2011; Meola and Cardani, 2015). DM2 patients have an increased binding of Muscleblind-like 1 protein (MBNL1) to CCUG repeats in an intron of the CNBP gene (Cho and Tapscott, 2007). Interestingly, it was recently reported that pseudouridylation within CCUG repeats reduces RNA flexibility and thus modestly inhibits MBNL1 binding (Delorimier et al., 2017). Similarly, ψ modification of a minimally structured model RNA resulted in an even more drastic reduction of MBNL1 binding to CCUG repeats. This study shows that ψ can reduce the diseasecausing binding of MBNL1 at extended CCUG repeats and offers a basis for future research in treating neurodegenerative diseases.

#### Pus1

Pus1 is a member of the TruA family that typically pseudouridylates tRNA but was also recently found to act on rRNA, snRNA, and mRNA (Schwartz et al., 2014a; Carlile et al., 2015). Mutations of Pus1 in human lead to mitochondrial myopathy and sideroblastic anemia (Bykhovskaya et al., 2004; Fernandez-Vizarra et al., 2007; Bergmann et al., 2010). Recently, a mild cognitive impairment was also characterized in a long-surviving patient with two novel Pus1 mutations (Cao et al., 2016). A different study demonstrated a Pus1 dependent pseudouridylation of the steroid RNA activator (SRA). Pseudouridylated SRA acts as a co-activator of the nuclear estrogen receptor α (ERα) (Zhao et al., 2004; Leygue, 2007). Given that ERα was shown to regulate neuronal survival (Gamerdinger et al., 2006; Foster, 2012), it is conceivable that one of the functions of Pus1 in brain activity is mediated via the control of the ER pathway.

#### Pus3

Pus3 is another member of the TruA family, which has a strong sequence homology to Pus1 but acts on distinct target RNA. In situ hybridization showed accumulation of Pus3 mRNA in the nervous system of mice embryos, suggesting a role of Pus3 in neural development (Diez-Roux et al., 2011). Accordingly, a truncated form of Pus3 accompanied by reduced levels of ψ U39 in tRNA was detected in patients with ID (Shaheen et al., 2016). Taken together, the discovery of impaired cognition caused by mutations of TruA enzymes emphasizes their importance in neuronal development and maintenance regulation.

#### Dyskerin and RluA-1

The pseudouridine synthase dyskerin is essential for the H/ACAbox mediated pseudouridylation in human (Heiss et al., 1998; Lafontaine et al., 1998). Mutations of the dyskerin-encoding gene DKC1 causes X-linked recessive dyskeratosis congenita (DKC), a rare progressive congenital disorder that mostly affects highly regenerative tissues, such as the skin and bone marrow (Heiss et al., 1998; Mochizuki et al., 2004). Cells of the affected patients have decreased telomerase activity and thus reduced telomere length, which may be responsible for the disease (Mitchell et al., 1999). Interestingly, expression analysis of Dyskerin 1 showed a high level in embryonic neural tissue, as well as in specific subsets of neurons in the cerebellum and olfactory bulb of adult brains (Heiss et al., 2000). While the function of dyskerin in hematopoiesis has been studied intensively, we are yet lacking a detailed understanding about its potential nervous system function in adult brains. In Drosophila melanogaster, RluA enzymes modify uridines in rRNA and tRNA. In situ hybridization in embryos revealed a specific RluA-1 mRNA localization to dendrites of a subset of peripheral neurons, which also raises the question about the molecular function of RluA-1 and its target RNA in the peripheral nervous system during embryonic development (Wang et al., 2011).

### N6-METHYLADENOSINE (m6A)

m6A is an abundant mRNA modification that regulates nearly all aspects of mRNA processing including splicing, export, translation, stability, and decay (Meyer and Jaffrey, 2017; Roignant and Soller, 2017; Roundtree et al., 2017). This modification is catalyzed by a stable protein complex composed of two methyltransferases, Methyltransferase like-3 (Mettl3) and Methyltransferase like-14 (Mettl14) (Sledz and Jinek, 2016; Wang et al., 2016a,b; Schöller et al., 2018). Additional proteins required for m6A deposition are Wilms' tumor 1-associating protein (Wtap) (Liu et al., 2014; Ping et al., 2014; Wang et al., 2014), Vir like m6A methyltransferase associated (Virma) (Schwartz et al., 2014b; Yue et al., 2018), Zinc finger CCCH domaincontaining protein 13 (Zc3h13) (Guo et al., 2018; Knuckles et al., 2018; Wen et al., 2018), RNA binding protein 15 (Rbm15) and its paralog Rbm15B (Patil et al., 2016). Mettl3 has the catalytic activity and can accommodate the SAM substrate, while Mettl14 serves to stabilize the binding to RNA (Sledz and Jinek, 2016; Wang et al., 2016a,b; Schöller et al., 2018). In vertebrates, m6A modification is dynamically regulated and can be reversed by two demethylases belonging to the family of α-ketoglutarate dependent dioxygenases, Fat mass and obesity associated protein (FTO) and ALKBH5 (Jia et al., 2011; Zheng et al., 2013). Recent advances in techniques to map m6A modification in a transcriptome wide manner enabled identification of thousands of modified mRNAs and lncRNAs (Dominissini et al., 2012; Meyer et al., 2012). While m6A has been involved in many physiological processes, increasing evidence suggests an importance of m6A modification in brain development and in the function of the nervous system.

#### m6A Writer Complex

m6A levels are particularly high in the nervous system, as shown in the developing mouse brain (Meyer et al., 2012), and in heads of adult flies (Lence et al., 2016). Furthermore, a recent study detected higher m6A content in the mouse cerebellum and in neurons compared to glia (Chang et al., 2017). Using in situ hybridization in zebrafish embryos, Ping et al. showed that Wtap is ubiquitously expressed at 36 h postfertilization with enrichment in the brain region (Ping et al., 2014). Consistently, Wtap depletion using morpholino treatment resulted in severe developmental defects, including appearance of smaller brain ventricles and curved notochord at 24 h postfertilization. Importance of m6A during neuronal development was further demonstrated by depletion of METTL3 in human embryonic stem cells (hESC), which strongly impaired neuronal differentiation (Batista et al., 2014), as well as the formation of mature neurons from embryoid bodies (Geula et al., 2015). Notably, m6A mRNA modification is essential for mouse survival as mice lacking Mettl3 die at E6.5 (Geula et al., 2015). However, two recent studies performed a conditional KO (cKO) of Mettl14 specifically in neurons and revealed an essential role of m6A in embryonic cortical neurogenesis (Yoon et al., 2017; Wang et al., 2018). Mettl14 cKO animals showed a decreased NSC proliferation and premature differentiation of NSCs (Wang et al., 2018), as well as delayed specification of different neuronal subtypes during brain development (Yoon et al., 2017). Yoon at al. further demonstrated that m6A modification is required for timely decay of transcripts involved in stem cell maintenance and cell cycle regulation in cortical neuronal progenitors. This allows accurate progression of the cell cycle and in turn induces the spatiotemporal formation of different neuronal subtypes. Interestingly, the authors also observed that many transcripts linked to mental disorders (autism, schizophrenia) are m6A modified in human, but not in mouse cultures of neuronal progenitor cells (NPC), raising the possibility that m6A regulates specifically these human diseases (Yoon et al., 2017). Consistent with this hypothesis, polymorphisms in ZC3H13 have been associated with schizophrenia (Oldmeadow et al., 2014).

Beyond the role in neuronal development, m6A modification also plays a critical role in the process of axon regeneration in mature mouse neurons (Weng et al., 2018). Weng et al. showed that upon peripheral nerve injury m6A levels of many transcripts in dorsal root ganglion (DRG) were elevated, which led to increased translation during the time of axon regeneration, via the specific m6A reader protein Ythdf1. Mettl14 and Ythdf1 conditional KO mice displayed strong reduction of sensory axon regeneration, resulting from reduced protein synthesis, revealing the critical role of m6A modification in response to injury (Weng et al., 2018).

In Drosophila, loss of components of the methyltransferase complex results in severe locomotion defects due to altered neuronal functions (Haussmann et al., 2016; Lence et al., 2016; Kan et al., 2017). Mettl3 mutants display alterations in walking speed and orientation, which can be rescued by ectopic expression of Mettl3 cDNA in neurons. Whether a particular subset of neurons is responsible for the observed alterations awaits further investigations. Interestingly, another member of the m6A methyltransferase complex, Nito (RBM15 in human), was recently shown to control axon outgrowth, branching and to regulate synaptic bouton formation via the activity of the CCAP/bursicon neurons (Gu et al., 2017), providing first insights toward addressing this question.

#### m6A Readers

As mentioned above, most characterized functions of m6A rely on the direct binding of the so-called m6A "reader" proteins to the modified site. The best-studied m6A readers are the YTHdomain containing proteins, which can specifically bind m6A via their YTH domain (Luo and Tong, 2014; Theler et al., 2014; Xu et al., 2014). RNA in situ hybridization of rat brain sections showed that one particular member of the YTH family, Ythdc1, is enriched in specific cells in the brain (Hartmann et al., 1999). Interestingly, in a yeast two-hybrid screen to identify Ythdc1 interacting proteins, libraries from P5 and E16 brains were screened and the rat homolog of Sam68 was found as the main interactor (Hartmann et al., 1999). Sam68 is known to regulate neuronal activity-dependent alternative splicing events (i.e., of neurexin-1) (Iijima et al., 2011). In line with this function, in situ hybridization assays showed that Drosophila Ythdc1 localizes in the ventral neurectoderm and central nervous system of Drosophila embryos (Lence et al., 2016) and a reduced level of Ythdc1 was found to enhance SCA1-induced neurodegeneration (Fernandez-Funez et al., 2000).

Apart from the YTH-protein family of conventional m6A reader proteins, a number of other proteins that bind RNA in m6A-dependent fashion has been recently identified (Edupuganti et al., 2017). Among them, Fragile X mental retardation protein (FMRP, also known as POF; FMR1; POF1; FRAXA) was shown to preferentially bind an RNA probe containing m6A sites (Edupuganti et al., 2017). FMRP plays critical roles in synaptic plasticity and neuronal development. Its loss of function in human leads to the Fragile X syndrome, which is the most prevalent form of inherited ID and the foremost monogenic cause of autism (Bardoni et al., 2001; Lubs et al., 2012; Wang et al., 2012; Hagerman and Polussa, 2015). FMRP has a central role in neuronal development and synaptic plasticity through the regulation of alternative mRNA splicing, mRNA stability, mRNA dendritic transport and postsynaptic local protein synthesis of a subset of mRNAs (Antar et al., 2006; Didiot et al., 2008; Bechara et al., 2009; Ascano et al., 2012; Guo et al., 2015). Moreover, it represses mRNA translation during the transport of dendritic mRNAs to postsynaptic dendritic spines and activates mRNA translation of a subset of dendritic mRNAs at synapses (Bechara et al., 2009; Fähling et al., 2009). Consistent with a potential interplay between FMRP and m6A, a recent study found that m6A is present on many synaptic mRNAs that are known targets of FMRP protein (Chang et al., 2017). Future research will seek to further illuminate the potential role of FMRP within the m6A pathway. It is interesting to note that Nm also appears to contribute to FMRP-mediated translation regulation at synapses. FMRP can form a complex with the non-coding RNA, brain cytoplasmic RNA (BC1), to repress translation of a subset of FMRP target mRNAs (Zalfa et al., 2003). This interaction is modulated by the Nm status of BC1 RNA. In both nucleus and cytoplasm in the cell body, Nm is present on BC1, but it is virtually absent at synapses (Lacoux et al., 2012). The authors suggested that changes in the 2′ -O-methylation status of BC1 RNA contribute to the fine-tuned regulation of gene expression at synapses and consequently to neuronal plasticity by influencing FMRP local translational control. This example supports a likely combinatorial role of RNA modifications in the regulation of similar targets and/or processes during brain function.

## m6A Erasers

Two proteins in humans were reported to act as m6A erasers: (FTO) and AlkB homolog 5 (ALKBH5) (Jia et al., 2011; Zheng et al., 2013). Both belong to the family of Fe2+-α-ketoglutarate-dependent deoxygenases and catalyze the removal of the methyl group of m6A by oxidation. Interestingly, even though ALKBH5 is only moderately expressed in the brain, it has been associated with mental disorders. Du et al. found that certain polymorphisms within the ALKBH5 gene correlate with the major depressive disorder (MDD), suggesting an involvement of ALKBH5 in conferring risk of MDD (Du et al., 2015).

In comparison, FTO is highly expressed in the human brain, especially in the hypothalamus and the pituitary gland and displays dynamic expression during postnatal neurodevelopment. Polymorphic alleles of FTO in human were identified to increase the risk for hyperactive disorder (Choudhry et al., 2013), for Alzheimer's disease (Keller et al., 2011; Reitz et al., 2012) and to affect the brain volume (Melka et al., 2013; Li et al., 2017a). Molecular and functional studies have shown that FTO knockout mice display altered behavior, including locomotion defects, but also influences learning and memory. For instance, fear conditioned mice showed a significant increase in m6A intensity on several neuronal targets, and knockdown of FTO further enhanced consolidation of cued fear memory (Widagdo et al., 2016). In line with this finding, FTO deficiency reduces the proliferation and neuronal differentiation of adult NSCs, which leads to impaired learning and memory (Li et al., 2017a). Electrophysiological tests in FTO knockout mice demonstrated an impaired dopamine type 2 and 3 receptor response, resulting in an abnormal response to cocaine (Hess et al., 2013). Strikingly, a recent study in mouse embryonic dorsal root ganglia found that FTO is enriched and specifically expressed in axons, influencing translation of axonal mRNAs (Yu et al., 2018). This demonstrates the dynamic role of m6A modification in regulating local translation. However, it is important to stress that FTO was also demonstrated to additionally demethylate m6Am, which is present next to the 7mG cap modification (Mauer et al., 2017). In comparison to m6A, m6Am also contains a methyl group on the ribose. As available antibodies recognize both modifications indistinguishably, it is currently difficult to assign their respective contribution in the context of brain activity.

### CONCLUSION

The association of aberrant RNA modifications with various neurological disorders highlights the importance of these chemical moieties for proper brain development and cognition. However, today the role of RNA modifications in these processes is not completely understood. One of the current challenges lies in identifying the class and identity of RNAs that are targeted by RNA modification enzymes and are causative of the neurological defects. Common targets for many of these enzymes are tRNAs and rRNAs, thus it is likely that in many cases their dysfunction plays an important role in the etiology of the disease. Yet, this does not explain why the phenotypes observed upon mutations of these enzymes are often restricted to the brain. Some of these enzymes or snoRNAs are predominantly expressed in the nervous system, which indicates their importance in this tissue and suggests the existence of differentially modified ribosomes (also recently called specialized ribosomes) that may carry distinct functions (Briggs and Dinman, 2017; Sloan et al., 2017). However, other snoRNA or enzymes exhibit wider tissue distribution, suggesting that brain specific phenotypes may reflect a higher sensitivity to altered translation in this tissue compared to others organs. This would be consistent with other diseasecausing mutations in ribosomal proteins or tRNA synthetase genes that also manifest their effect specifically in the nervous system (Antonellis et al., 2003; Antonellis and Green, 2008; Yao and Fox, 2013; Brooks et al., 2014). What could be the reason(s) behind this increased sensitivity? It has been observed that tRNA level is in general higher in the brain compared to other tissues, suggestive of a bigger translational demand (Dittmar et al., 2006). This neuronal specificity may arise from the local translation that occurs at synapses upon environmental changes. In this context, RNA modifications represent an attractive system to regulate this acute need in a dynamic and flexible manner. However, other examples point toward a translation-independent mechanism. For instance, in the case of FTSJ1 mutations and the associated NSXLID, neither the amount, nor the charging of the concerned tRNAs appear to be affected (Guy et al., 2012, 2015), seemingly ruling out a general translation defect. Perhaps in this case, the absence of modification can stimulate tRNA cleavage and generate tRFs that can also interfere with translation. This increase in tRFs would not necessarily be associated with a corresponding reduction of the uncleaved tRNA since tRNA levels are tightly regulated (Wilusz, 2015). Alternatively, the absence of modification on the tRNA may affect its interaction with the ribosome and thus influence translation efficiency or fidelity. Therefore, careful examination at different molecular levels is required to appreciate the effect of tRNA and rRNA modification enzyme mutations on translation and their consequence on the neurological phenotype.

That being said, the general picture is probably more complex. For instance, recent reports show that tRFs not only interfere with translation but can also affect transposon regulation and genome stability (Durdevic et al., 2013; Martinez et al., 2017; Schorn et al., 2017; Zhang et al., 2017). This activity could in principle also contribute to neurological disorders as growing evidence suggests associations between (re)expression of transposable elements and the occurrence of neuropathies (Perrat et al., 2013; Krug et al., 2017; Zahn, 2017; Jacob-Hirsch et al., 2018). In addition, beyond tRNA and tRF, the recent studies on m6A clearly demonstrate the involvement of mRNA modification in different aspects of neuronal development and regulation. The large diversity of RNA processing events in the brain, including the high rates of alternative splicing and recursive splicing (Duff et al., 2015; Sibley et al., 2015), the inclusion of microexons (Irimia et al., 2014) and the biogenesis of circular RNAs (cirRNAs) can all in principle be affected by RNA modifications. For instance, m6A modification was recently found on circRNAs, and enables their translation (Yang et al., 2017; Zhou et al., 2017). Given that misexpression of circRNAs has been associated with neurological disorders (Shao and Chen, 2016; van Rossum et al., 2016; Li et al., 2017b), some of m6A brain functions may rely on circRNA-mediated regulation. Thus, because neurons face distinct challenges with regards to localisation of RNAs to distal processes and to localized translation it is likely that in the brain, more than in any other tissues, a combinatorial effect of RNA modifications on different classes of RNAs represents a critical informational

FIGURE 2 | RNA modifications are implicated in various neuronal processes. Distinct RNA modifications of tRNAs, small RNAs and mRNAs are required for common biological processes during brain development (left), neuronal differentiation (middle), and proper functioning of individual neuron (right), (see also Table 1). N6-methyladenosine (m6A), pseudouridine (9), 5-methylcytosine (m5C), and 2′ -O-methylation (Nm). The RNA classes in the brackets are the ones studied so far. Additional types with important functions may be modified as well.

layer that dynamically fine-tunes gene regulation (**Figure 2**). Exciting discoveries are lying ahead for deciphering this intricate epitranscriptomics code.

#### AUTHOR CONTRIBUTIONS

All authors listed have made a substantial, direct and intellectual contribution to the work, and approved it for publication.

#### REFERENCES


#### FUNDING

The work presented was supported by DFG (RO4681/5-1), the DIP (RO4681/6-1) to J-YR, by the IBPS Actions Incitatives 2018 to CC, the French MESR to DD and an ARC Ph.D. fellowship to MA. CC acknowledges the Réseau André Picard for its financial support. J-YR and CC are members of the European Epitranscriptomic COST Action (CA16120).


showing different specificities for their tRNA substrates. RNA 20, 1257–1271. doi: 10.1261/rna.044503.114


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 Angelova, Dimitrova, Dinges, Lence, Worpenberg, Carré and Roignant. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

## Link Between m6A Modification and Cancers

#### Zhen-Xian Liu1†, Li-Man Li 1†, Hui-Lung Sun<sup>2</sup> \* and Song-Mei Liu<sup>1</sup>

*<sup>1</sup> Center for Gene Diagnosis, Zhongnan Hospital of Wuhan University, Wuhan, China, <sup>2</sup> Department of Chemistry and Institute for Biophysical Dynamics, Howard Hughes Medical Institute, University of Chicago, Chicago, IL, United States*

\*

N6-methyladenosine (m6A) epitranscriptional modification has recently gained much attention. Through the development of m6A sequencing, the molecular mechanism and importance of m6A have been revealed. m6A is the most abundant internal modification in higher eukaryotic mRNAs, which plays crucial roles in mRNA metabolism and multiple biological processes. In this review, we introduce the characteristics of m6A regulators, including "writers" that create m6A mark, "erasers" that show demethylation activity and "readers" that decode m6A modification to govern the fate of modified transcripts. Moreover, we highlight the roles of m6A modification in several common cancers, including solid and non-solid tumors. The regulators of m6A exert enormous functions in cancer development, such as proliferation, migration and invasion. Especially, with the underlying mechanisms being uncovered, m6A and its regulators are expected to be the targets for the diagnosis and treatment of cancers.

#### Keywords: m6A, mRNA, cancers, function, structures

### INTRODUCTION

More than 100 kinds of chemical modifications of RNA have been identified in living organisms (Boccaletto et al., 2018). Studies have widely reported certain types of RNA modifications in eukaryotic mRNA, including N1-methyladenosine (m1A), N6-methyladenosine (m6A) and 5 methylcytosine (m5C), among which m6A was first discovered in the 1970s. m6A is the most abundant internal modification of mRNA and long noncoding RNA (lncRNA) in the majority of eukaryotes. Besides, m6A significantly clusters around the stop codon and 3′ untranslated region (3′UTR) (Dominissini et al., 2012; Meyer et al., 2012; Bodi et al., 2015). m6A modification mostly occurs at RRACH motif (R denotes A or G, H denotes A, C, or U) (Narayan and Rottman, 1988; Csepany et al., 1990; Narayan et al., 1994).

As shown in **Figure 1**, the formation of m6A is a reversible process (Jia et al., 2013), m6A "writers" with methyltransferase activity are consisted of three individual proteins: methyltransferase-like 3 (METTL3), methyltransferase-like 14 (METTL14), and Wilms' tumor 1 associating protein (WTAP). Obesity-associated protein (FTO) and alkB homolog 5 (ALKBH5) are m6A demethylase (Jia et al., 2011; Zheng et al., 2013). Another protein family is m6A "readers," which can recognize m6A modification to modulate mRNA fate (Li et al., 2017a).

m6A modification regulates mRNA at different levels, including structure, maturation, stability, splicing, export, translation and decay (Liu and Zhang, 2018). Moreover, m6A is also involved in cell fate decision, cell cycle regulation, cell differentiation and circadian rhythm maintenance (Wu et al., 2018). Furthermore, multiple RNA binding proteins that are affected by m6A.

#### Edited by:

*Mario Acunzo, Virginia Commonwealth University, United States*

#### Reviewed by:

*Mattia Pelizzola, Fondazione Istituto Italiano di Technologia, Italy Ioannis S. Vizirianakis, Aristotle University of Thessaloniki, Greece*

#### \*Correspondence:

*Song-Mei Liu smliu@whu.edu.cn Hui-Lung Sun huilung@uchicago.edu*

*†These authors have contributed equally to this work.*

#### Specialty section:

*This article was submitted to Bioinformatics and Computational Biology, a section of the journal Frontiers in Bioengineering and Biotechnology*

> Received: *05 February 2018* Accepted: *12 June 2018* Published: *13 July 2018*

#### Citation:

*Liu Z-X, Li L-M, Sun H-L and Liu S-M (2018) Link Between m6A Modification and Cancers. Front. Bioeng. Biotechnol. 6:89. doi: 10.3389/fbioe.2018.00089*

For example, heterogeneous nuclear ribonucleoprotein G (HNRNPG), a new m6A reader protein, utilizes a low-complexity region to recognize a motif exposed by m6A modification (Liu et al., 2017). HNRNPC and HNRNPA2B1 are two abundant nuclear RNA-binding proteins (Dai et al., 2018). Currently, increasing evidence has shown the roles of m6A in human diseases. Here, we summarized the functions and roles of m6A regulators in diverse cancers (Wang et al., 2017a), such as acute myeloid leukemia (AML), glioblastoma (GBM), lung cancer, liver cancer, hoping to elucidate the contributions of m6A in cancer process.

#### PROTEINS INVOLVED IN THE M6A METHYLATION

#### m6A "Writers"

METTL3, METTL14, and WTAP form m6A methyltransferase complex (Bokar et al., 1997; Liu et al., 2014; Ping et al., 2014). METTL3, a 70-kDa key protein, is firstly identified as m6A "writer" (Bokar et al., 1997). A recent study has indicated that knockdown of METTL3, METTL14, and WTAP could decrease m6A level in polyadenylated RNA. The gel filtration experiment has revealed that METTL3 and METTL14 form a stable METTL3-14 complex with a 1:1 stoichiometric ratio, then WTAP binds to the METTL3-14 complex (Liu et al., 2014). Further crystallization and structure determination have demonstrated that METTL3-14 heterodimer is asymmetric. In this crystallized METTL3-14 complex, both METTL3 and METTL14 contain MTA-70 methyltransferase domain, two CCCH-type zinc finger motifs exist in the N-terminal region of METTL3 and an N-terminal extension presents in the N-terminal region of METTL14 (Sledz and Jinek, 2016). Additionally, METTL3 acts as the catalytic core, transferring methyl group from S-adenosylmethionine (SAM) to acceptor adenine moiety. METTL14 serves as the RNA-binding platform, promoting the binding of RNA substrate and enhancing the complex integrity (Wang et al., 2016a,b). METTL3-14 complex dimer induces m6A deposition on nuclear RNA. WTAP does not possess methyltransferase activity (Ping et al., 2014), however, it interacts with METTL3-14 complex to affect m6A methyltransferase activity in vivo and localization in nuclear speckles (Liu et al., 2014).

#### m6A "Erasers"

FTO and ALKBH5 have been reported to exhibit demethylation activity (Jia et al., 2011; Zheng et al., 2013). The level of m6A in mRNA increases after FTO knockdown, while m6A in mRNA notably decreases after overexpression of the wild-type FTO (Jia et al., 2011). However, Meyer C et al. reported that, compared with m6A, FTO showed higher affinity with the modification N6, 2'-O-dimethyladenosine (m6Am), a reversible modification influencing cellular mRNA fate. FTO preferentially demethylated m6A<sup>m</sup> and reduced the stability of m6A<sup>m</sup> mRNA (Mauer et al., 2017). The difference across the studies may be caused by the location of FTO in different cell lines. ALKBH5 localizes in the nucleus, the level of m6A in mRNA significantly decreases in ALKBH5 overexpressed cells (Zheng et al., 2013). ALKBH5 strictly selects substrate in catalysis progression, and the loss of ALKBH5 impairs RNA metabolism, mRNA export and assembly (Zheng et al., 2013).

#### m6A "Readers"

Remarkably, m6A regulates gene expression through m6A "readers," a group of various proteins which can recognize m6A modification. The highly conserved YT521-B homology YTH domain family proteins include YTHDF1, YTHDF2, and YTHDF3 in the cytoplasm, and YTH domain containing 1 (YTHDC1) in the nucleus (Wang et al., 2014a, 2015; Xu et al., 2014; Xiao et al., 2016; Shi et al., 2017). YTHDF1 promotes the translation of m6A-methylated mRNA, YTHDF2 accelerates the decay of m6A-methylated mRNA, and YTHDF3, together with YTHDF1 and YTHDF2, noticeably enhances the metabolism of m6A-methylated mRNA in the cytoplasm (Shi et al., 2017). In nuclear speckle localization, YTHDC1 influences mRNA splicing by facilitating SRSF3 but inhibiting SRSF10 (Roundtree et al., 2017). Although YTHDC1 knockdown does not significantly alter the distribution of non-target transcripts, it can affect mature mRNA transport from the nucleus to the cytoplasm (Xiao et al., 2016). YTH domain containing 2 (YTHDC2) can preferentially bind to m6A-containing transcripts resulting in the decrease of mRNA abundance and the enhancement of translation efficiency via the interaction with translation initiation and decay machineries (Hsu et al., 2017).

#### CONSEQUENCE OF M6A METHYLATION

#### Impact of m6A on Splicing and Export

mRNA is spliced into mature transcript and exported from nucleus to the cytoplasm, then they can be translated into proteins. Therefore, mRNA nuclear export is a vital step which connects transcription with translation. Simultaneously, mRNA export also regulates gene expression (Wickramasinghe and Laskey, 2015). Depletion of METTL3 delays the export of mature mRNA. In addition, circadian period shows elongation by prolonging nuclear retention of mature mRNA of the clock genes Per2 and Arntl (Fustin et al., 2013). ALKBH5 plays vital roles in mRNA export as well as roles in RNA metabolism and the association of the nuclear speckle proteins (Zheng et al., 2013). YTHDC1 can interact with SRSF3, a nuclear export adaptor protein and splicing factor, to modulate the binding of RNA with SRSF3 and NXF1, then SRSF3-NXF1 and YTHDC1-SRSF3 protein complexes lead m6A-modified mRNA into export pathway (Roundtree et al., 2017). FTO mediates nuclear pre-mRNA alternative splicing. For instance, FTO controls alternative splicing of RUNX1T1, an adipogenesisrelated transcription factor, affecting adipogenesis (Zhao et al., 2014). Furthermore, FTO preferentially binds to pre-mRNA in intronic regions, and FTO knockdown leads the substantial changes in pre-mRNA splicing with exon skipping events (Bartosovic et al., 2017).

#### Impact of m6A on Translation

Translation is regulated by m6A modification through several mechanisms. YTHDF1 is known to promote translation efficiency by binding with m6A. The translation efficiency decreases notably after YTHDF1 knockdown. YTHDF1 ensures the efficient protein production from m6A modification transcript (Wang et al., 2015). In addition, YTHDF1 recruits eukaryotic initiation factor 3 (eIF3) to directly bind a single m6A modification in the 5′UTR, which facilitates ribosome loading and recruits the 43S complex to promote translation (Meyer et al., 2015).

According to previous studies, m6A sites are mainly enriched in the stop codon and 3'UTR (Dominissini et al., 2012; Meyer et al., 2012). Interestingly, heat shock stress induces an elevated m6A peak in the 5'UTR. Upon heat shock stress, the nuclear YTHDF2 limits the demethylation of m6A "eraser" FTO to preserve 5'UTR methylation of stress-induced transcripts. The increased m6A-marked 5'UTR methylation promotes capindependent translation initiation, providing a mechanism for selective mRNA translation under heat shock stress (Zhou et al., 2015). In translational process, YTHDF3 interacts with ribosomal 40S/60S subunits and significantly enhances translation efficiency of YTHDF1 and YTHDF3 sharing targeting m6A-methylated mRNA (Li et al., 2017a).

METTL3 also directly promotes translation of certain mRNA in human cancer cells by recruiting eIF3. Reader proteins YTHDF1, YTHDF2 and binding-partners METTL14, WTAP are independent of translational control of METTL3 (Lin et al., 2016).

#### Impact of m6A on RNA Stability

The stability of RNA is closely associated with m6A-dependent degradation process. Knockout of METTL3 and METTL14 in embryonic stem cells can reduce mRNA decay, leading to increase of target mRNA expression, resistance of differentiation, enhancement of self-renewal and maintenance of the pluripotency state of embryonic stem cell (ESC) (Liu et al., 2014; Wang et al., 2014b). A RNA-binding protein, human antigen R (HuR), can increase RNA stability, however, m6A modification inhibits the binding ability by interfering with the HuR (Srikantan et al., 2012).

YTHDF2 directly modulates mRNA decay pathway in an m6A-dependent way (Wang et al., 2014a). Compared to the unmethylated mRNA, YTHDF2 has about 16-fold higher binding affinity to m6A-marked mRNA, showing significant increase of decay rate and shorter half-life (Batista et al., 2014; Fu et al., 2014). The carboxy-terminal domain of YTHDF2 selectively binds to methylated mRNA and leads to YTHDF2-bound transcripts to decay sites (Wang et al., 2014a). Furthermore, YTHDF2 N-terminal region interacts with the SH domain of the CNOT1 subunit recruiting the CCR4-NOT complex to accelerate the deadenylation of YTHDF2-bound m6A-containing mRNA (Du et al., 2016).

#### Roles of m6A Methylation in Cancers

Currently, increasing studies focus on the potential links between m6A and cancers. In different tumors, the effect of m6A modification could be different. The changes of m6A also affect the progression of tumors, including proliferation, growth, invasion and metastasis (**Figures 2**–**4**).

#### Acute Myeloid Leukemia (AML)

The common types of leukemia are acute lymphoblastic leukemia (ALL), chronic lymphocytic leukemia (CLL), acute myeloid leukemia (AML) and chronic myeloid leukemia (CML). As regard the incidence and mortality, AML ranks the first in hematopoietic malignancies. Moreover, different cytogenetic abnormalities [t(8;21)(q22;q22), inv(16)(p13q22)/t(16;16)(p13;q22), t(15;17)(q22;q11∼21), and abn(11q23)] and molecular abnormalities (FLT3-ITD, MLL PTD and NPM1 mutations) show different pathogenesis (Frohling et al., 2005). Although new therapies for AML, such as epigenetic targeted drugs and immunotherapies (Yang and Wang, 2018), have been developed, the survival rates have not improved significantly.

In recent years, the molecular mechanism of m6A "writers" and "erasers" in leukemia has been explored (**Figure 2**). METTL3 expression was increased in AML patients and plays an oncogenic role. METTL3 inhibits cell differentiation and apoptosis, and promotes cell proliferation through increasing c-MYC, BCL-2, PTEN translation. MELLT3 also activate PI3K/AKT pathway to control cell differentiation and self-renewal (Vu et al., 2017). Interestingly, METTL3 works on a chromatin-based pathway independently of METTL14 by localizing to the transcriptional start sites of active genes through CAATT-box binding protein (CEBPZ), which resulting in an increase of translation of the corresponding mRNA (Barbieri et al., 2017). In addition, almost all members of "writers" are overexpressed in AML, including METTL3, METTL14, WTAP, and KIAA1429. METTL14 plays an inhibitory role in normal myelopoiesis, whereas suppresses cell differentiation in AML. In SPI1-METTL14-MYB/MYC axis, METTL14, is downregulated by SPI1, exerts an oncogenic role in enhancing leukemia stem/initiating cells self-renewal and repressing myeloid differentiation by regulating MYB and MYC via m6A modification (Weng et al., 2017). WTAP, another member of m6A writers, has been shown an association with AML. Elevated WTAP promotes cell proliferation and inhibits cell differentiation of AML. WTAP also directly links with Hsp90 (Bansal et al., 2014), a molecular chaperone maintains the stability of many tumor-promoting oncoproteins (Whitesell and Lindquist, 2005). More importantly, the knockdown of WTAP with etoposide treatment significantly promotes apoptosis. However, without etoposide treatment, this phenomenon doesn't occur (Bansal et al., 2014). These outcomes suggest WTAP is a potential target for the treatment of AML.

Although both FTO and ALKBH5 have no effect on AML patients' survival rates (Vu et al., 2017), in certain subtypes of AML, such as MLL-rearranged AML, acute promyelocytic leukemia (APL), t(11q23) and t(15;17) AMLs, FTO is highly expressed. FTO reduces the m6A levels of ASB2 and RARA by targeting their UTRs and affects all-trans-retinoic acid (ATRA) treatment efficiency (Li et al., 2017c). R-2-hydroxyglutarate(R-2HG), which accumulates in isocitrate dehydrogenase 1/2 (IDH1/2) mutant cancers, can increase global m6A via inhibiting FTO, resulting in the decrease of MYC/CEBPA mRNA stability and inhibition of leukemia proliferation (Su et al., 2018).

Genetic alterations of m6A regulatory genes including "writers," "readers," and "erasers" have indicated an inferior cytogenetic risk in AML by analyzing datasets from the Cancer Genome Atlas Research Network (TCGA) (Kwok et al., 2017). METTL3, METTL14, WTAP FTO and R-2HG, interacting with FTO, are potential therapeutic targets for AML. However, study on the "readers" of m6A has not yet been reported.

#### Glioblastoma (GBM)

Glioblastoma [GBM; WHO grade IV (Xi et al., 2016)] is an invasive malignant primary brain tumor with a median survival of 10–11 months and a poor quality of life(Rick et al., 2018). With the development of epigenetics, the correlation of m6A with GBM has been uncovered in recent years (**Figure 3**).

In GBM, the level of m6A on mRNA is reduced, however, when glioblastoma stem-like cells (GSCs) are induced into differentiation, the m6A level is increased. The effect of knockdown METTL3 or METTL14 on promoting GSCs growth and self-renewal have been further confirmed by overexpression experiments. Additionally, depletion of METTL3 or METTL14 enhances tumor tumorigenicity while FTO inhibitor treatment prevents tumor deterioration. What's more, knockdown of METTL3 or METTL14 alters gene expression leading to elevation of oncogenes, including ADAM19, EPHA3, and KLF4. FTO inhibitor MA2 can prolong the life span of GSCs transplanted animals, suggesting m6A is expected to be a target for the treatment of GBM (Cui et al., 2017). m6A modification related genes and pathways could serve as promising molecular targets for GBM treatment.

A study has proved that METTL3 increases, METTL14 and ALKBH5 decrease, whereas FTO has no significant change in GSCs. METTL3 mediates GSCs maintenance and dedifferentiation by regulating the stability of the SOX2 mRNA though installing m6A on the SOX2-3′UTR. The complete structures of METTL3 and HuR are vital to this procedure. Furthermore, suppressed METTL3 can result in the inhibition of GSC growth and the neurosphere formation, the reduction of stem cell-specific marker (SSEA1) and glioma reprogramming factors (including POU3F2, OLIG2, SALL2, and SOX2) expression. In particular, SOX2 mRNA has a high affinity to METTL3. Crucially, depletion of METTL3 leads to an increase of radiation sensitivity and a decrease of DNA repair, providing a direction for overcoming radiation tolerance (Visvanathan et al., 2018).

WTAP is overexpressed in GBM. WTAP enhances cell proliferation, migration, invasion and tumorigenicity of glioblastoma cells in xenograft via mediating phosphorylation of epidermal growth factor receptor (EGFR) and AKT. Besides, WTAP regulates the expression of certain genes related to motility of cancer cells, such as chemokine ligand 2 (CCL2), chemokine ligand 3 (CCL3), matrix metallopeptidase 3 (MMP3), lysyl oxidase-like 1 (LOXL1), hyaluronan synthase 1 (HAS1),

and thrombospondin 1 (THBS1) (Jin et al., 2012). The high expression of WTAP is an independent negative prognostic factor that is associated with age and WHO grade, predicting poor overall survival for GBM patients (Xi et al., 2016). Therefore, WTAP may be a prognostic marker for GBM.

Contrary to the text mentioned above (Visvanathan et al., 2018), ALKBH5 is elevated in GSCs (Zhang et al., 2017), enhancing cell self-renewal, proliferation and tumorigenicity. ALKBH5 serves as a poor prognostic indicator in patients with glioma. ALKBH5 removes m6A from FOXM1 (which is a

transcription factor and is highly expressed in GBM patients) nascent transcripts by binding to the 3′UTR, enhancing FOXM1 expression. This process can be strengthened by a long noncoding RNA antisense to FOXM1 (FOXM1-AS). Depletion of ALKBH5 and FOXM1-AS inhibits GSCs tumorigenesis via the FOXM1 axis. In addition, certain evidence has confirmed that HuR, SOX2 and Nestin also play a crucial role in this process (Zhang et al., 2017).

#### Lung Cancer

Lung cancer includes small cell lung carcinoma (SCLC) and nonsmall-cell lung carcinoma (NSCLC), and NSCLC accounts for approximately 85% of all cases. Although the incidence and death rate have declined, the 5-year survival rates remain poor (Molina et al., 2008).

As seen in **Figure 4**, METLL3 enhances the translation of certain oncogenes such as EGFR, TAZ, MAPKAPK2 (MK2), and DNMT3A by recruiting eIF3 to the translation initiation complex which is independent of METTL3 catalytic activity and m6A readers. Furthermore, METTL3 plays a driving role in cancer cell growth, survival and invasion (Lin et al., 2016). Another study shows that miR-33a inhibits the proliferation of NSCLC cells by binding to the 3′UTR of METTL3 mRNA (Du et al., 2017). The links between microRNA, METTL3, and NSCLC suggest METTL3 may be a novel target for NSCLC therapy.

#### Liver Cancer: Hepatocellular Carcinoma (HCC) and Cholangiocarcinoma (CCA) Hepatocellular Carcinoma (HCC)

The majority of liver cancer is HCC. The incidence and mortality of HCC increase per year owing to the lack of precision diagnosis at an early-stage, and prediction of tumor metastasis and postsurgical recurrence. The 5-year survival rate is only 18%, thus the mechanism and pathogenesis of HCC are urgent to be addressed (Ma et al., 2017). Increasing evidence has shown that m6A and regulators are critical for the development of liver cancer (**Figure 4**).

Currently, only two researches have revealed the association of m6A "writers" with HCC, focusing on METTL3 and METTL14, respectively. METTL3, increasing in HCC, can facilitate HCC cells growth, migration and colony formation in vitro and enhance HCC tumorigenicity, growth and lung metastasis in vivo. The stability of suppressor of cytokine signaling 2 (SOCS2) mRNA can be downregulated via YTHDF2. And SOCS2 mRNA is a downstream target of METTL3 and a tumor inhibitor in HCC. Particularly, as a clinical manifestation of HCC, the higher level of METTL3 might predict a poor prognosis. RBM15B, KIAA1429 and m6A level on mRNA also increase in HCC while METTL14 has no significant change. However, the disturbance of METTL14 obviously alters the proliferation, migration and colony formation of Huh-7 cell (Chen et al., 2017b). Interestingly, in another study, METTL14 and FTO decreased in HCC while METTL3, WTAP, KIAA1429, and ALKBH5 have no remarkable change. The downregulation of METTL14 acts as a poor prognostic indicator for survival without recurrence in HCC and has a close association with tumor metastasis. Specifically, METTL14 can promote pri-miR126 processing to mature miR126, a tumor suppressor in HCC metastasis, by mediating the recognition and binding of the microprocessor protein DGCR8 to pri-miRNA (Ma et al., 2017).

YTHDF2 is closely associated with the malignance of HCC. YTHDF2 regulates mRNA degradation by recognizing mRNA m6A sites, leading to the enhancement of proliferation of HCC cells. miR-145, which is down-regulated in HCC patients, can suppress the expression of YTHDF2 by directly targeting the 3 ′UTR of YTHDF2 mRNA(Yang et al., 2017). Thereby, miR-145 could be a candidate target for the treatment of liver cancer.

#### Cholangiocarcinoma (CCA)

Being diagnosed at advanced stage, CCA has a poor prognosis and limited curative options, and the overall survival rate is rather low. Invasion and migration cancer cells make CCA worse (Jo et al., 2013; Goldaracena et al., 2017; Kennedy et al., 2017).

As a nuclear protein, WTAP upregulates in CCA and has a positive correlation with TNM stage, lymph node metastasis and vascular invasion. WTAP siRNA inhibits CCA cells migration, invasion and tumorigenicity rather than proliferation. The results of cDNA microarray and real time PCR have revealed that WTAP could promote metastasis-related genes expression, such as MMP7, MMP28, and Muc1 (Jo et al., 2013). However, it is not clear whether WTAP function is related to m6A methyltransferase in CCA. Further investigations remain to be done.

#### Breast Cancer

In breast cancer, METTL3, HBXIP (mammalian hepatitis B X-interacting protein) and let-7g miRNA form a positive feedback loop (HBXIP/let-7g/METTL3/HBXIP) to promote cell proliferation (**Figure 4**). HBXIP enhances the expression of METTL3 by suppressing the tumor suppressor let-7g. The increase of METTL3, in turn, enhances the level of HBXIP by facilitating m6A modification in mRNA (Cai et al., 2018).

Hypoxia is a vital feature of the tumor microenvironment that drives cancer progression. With hypoxia induction, hypoxia inducible factor (HIF) dependent ALKBH5 strengthens pluripotency factor NANOG mRNA stabilization and accumulation by demethylation (**Figure 4**). Moreover, elevation of ALKBH5 increases the expression of NANOG in breast cancer stem cells (Zhang et al., 2016a). Furthermore, depletion of ALKBH5 or ZNF217, an m6A methyltransferase inhibitor, suppresses NANOG and Kruppel-like factor 4 (KLF4) expression by increasing m6A-methylated RNA (Zhang et al., 2016b). Of note, depletion of ALKBH5 inhibits tumor formation, decreases the breast cancer stem cells and suppresses metastasis from breast to lung in immunodeficient mice (Zhang et al., 2016a,b). The hypoxic induction of pluripotency factor, ALKBH5 and ZNF217 expression depends on HIF, meaning that HIF is vital for breast cancer therapy.

#### Renal Cell Carcinoma (RCC)

METTL3 plays an inhibitory role in RCC and the higher level indicates a better prognosis. The expression levels of METTL3 mRNA and protein decline in RCC. Depletion of METTL3 promotes cell proliferation, growth, and colony formation through PI3K-Akt-mTOR pathway activation, and enhances cell migration and invasion by epithelial-mesenchymal transition (EMT) pathway. METTL3 knockdown also decreases cell cycle arrest in G1 phase and P21 expression, which acts as a suppressor in tumor (Li et al., 2017b). In a word, METTL3 exerts an essential role in the progression of RCC (**Figure 4**).

#### Pancreatic Cancer

High expression of YTHDF2 mRNA and protein in pancreatic cancer have a positive correlation with its progression (**Figure 4**). In pancreatic cancer cells, YTHDF2 serves as a suppressor in adhesion, invasion, migration and EMT through YAP signaling, but YTHDF2 acts as a promoter in proliferation via Akt/GSK3b/CyclinD1 pathway (Chen et al., 2017a). As a potential diagnostic and prognostic marker, whether YTHDF2 could be a new therapeutic target remains to be elucidated.

#### Colon Cancer

RNA helicase YTHDC2 acts as a promoter in colon cancer metastasis by enhancing the translation of hypoxia-inducible factor-1alpha (HIF-1α) gene, which could promote EMT via the transcription factor Twist1 (**Figure 4**). Further, the 5′UTR of both HIF-1α and Twist1 are the targets of YTHDC2 under hypoxia. YTHDC2 upregulates in colon cancer and has an



*NA, Not Available.*

*<sup>a</sup>Vu et al. (2017), Barbieri et al. (2017), Weng et al. (2017), Bansal et al. (2014), Li et al. (2017c), and Su et al. (2018).*

*<sup>b</sup>Xi et al. (2016), Cui et al. (2017), Visvanathan et al. (2018), Jin et al. (2012), and Zhang et al. (2017).*

*<sup>c</sup>Cai et al. (2018), Zhang et al. (2016a), and Zhang et al. (2016b).*

obviously positive correlation with tumor stage, indicating YTHDC2 may be a diagnostic marker for colon cancer patients (Tanabe et al., 2016). YTHDC2 serves as a RNA helicase rather than the reader of m6A in this study (Tanabe et al., 2016), although other studies have certified that YTHDC2 is a N6 methyladenosine binding protein (Hsu et al., 2017). Whether the two kinds of YTHDC2 belong to the same substance and the exact mechanism still need to be explored.

#### Cervical Cancer

In cervical cancer, decreased m6A is closely related to tumor size, FIGO stage, differentiation, lymph node infiltration and tumor recurrence (**Figure 4**). The reduction of m6A plays a positive role in cell proliferation and tumor development (Wang et al., 2017b). In particular, FTO might upregulate both mRNA and protein expression of β-catenin, interacting with excision repair cross-complementation group 1 (ERCC1). Ultimately, this process leads to resistance of chemo-radiotherapy on cervical squamous cell carcinoma, the major type of cervical cancer. More importantly, although FTO expression is not associated with the 5-year overall survival, the patients with high FTO and β-catenin expression has a poor prognosis (Zhou et al., 2018). The findings of FTO function in response to the treatment of cervical cancer expands our understanding of m6A role in cancer development.

#### CONCLUSIONS AND OUTLOOKS

RNA epigenetics has become a hot topic in recent years. Among more than 100 kinds of different chemical modifications, m6A is the most abundant modification with the feature of dynamic and reversibility. It is installed by "writers" and removed by "erasers." "Readers" are the m6A recognition proteins. m6A is extremly important for mRNA metabolism at different stage, from processing in the nucleus to translation and decay in the cytoplasm. Besides, m6A regulates circadian rhythm, cell cycle, cell differentiation, reprogramming, state transitions and stress responses (Zhao et al., 2017). Except for m6A, other chemical modifications also are irreplaceable, such as m1A, m5C, 2′ -Omethylation (2'OMe), pseudouridine (ψ) (Esteller and Pandolfi, 2017; Zhao et al., 2017). For example, m5C plays a significant role in translation efficency, mRNA structure, genetic recoding of coding gene. It also regulates vault ncRNA process into small RNA and tRNA cleavage (Esteller and Pandolfi, 2017).

Diverse cancers are influenced by the structures and functions of m6A modification. In this review, we summarized the mechanisms and functions of m6A-modified RNA in 10 cancers, including AML, GBM, lung cancer, HCC, CCA, breast cancer, RCC, pancreatic cancer, colon tumor and cervical cancer (**Table 1**, **Figures 2**–**4**). Compared to m6A "writers" and "erasers," only few articles studied the relationship between m6A "readers" and cancers. YTHDF2 in GBM, HCC, intestinaltype gastric adenocarcinoma and lung adenocarcinoma is upregulated. Genetic alterations of m6A readers indicates a poor survival in AML (Kwok et al., 2017). However, the underlying mechanism of m6A "readers" in cancers still remains a mystery. Meanwhile, other modifications also play a vital role in cancers. For instance, ALKBH3, a m1A RNA demethylase, can protect the genome against alkylation damage. TET1, a RNA m5C demethylase, acts as a tumor suppressor in leukemia (Esteller and Pandolfi, 2017). Whether there is joint effect or subtractive effect between m6A and other modifications remains to be further studied.

Among the regulators of m6A, METTL3, the crucial methyltransferase of m6A, plays a promotor in most of cancers. In colorectal, prostate and bile duct cancers, METTL3 has been reported to be significantly upregulated based on bioinformatic analysis (Chen et al., 2017b). In addition to methyltransferase activity, METTL3 also influences cancers indenpendently of its catalytic subunit. This typical function is to enhance the translation of oncogenes (Lin et al., 2016).

Insterestingly, the methyltransferases and demethylases have distinct impacts on various cancer cells. In AML cell, both of them promote cell proliferation and suppress cell differenation (Bansal et al., 2014; Li et al., 2017c; Vu et al., 2017; Weng et al., 2017). In GSCs, METTL3 and METTL14 suppress differenation while WTAP and ALKBH5 promote cell proliferation, selfrenewal and tumorigenicity (Jin et al., 2012; Cui et al., 2017; Zhang et al., 2017; Visvanathan et al., 2018). The roles of METTL3 in lung cancer are inconsistent across different studies (Lin et al., 2016; Du et al., 2017). Elevated METTL3 palys an oncogenic role in lung cancer (Lin et al., 2016), while METTL3 is downregulated in NSCLC (Du et al., 2017). It is worth noting that alteration of m6A levels in HCC development is discordant across different studies, METTL3 and METTL14 show a completely contrary effects on migration of HCC cells (Chen et al., 2017b; Ma et al., 2017). These results may suggest that some functions of METTL3 are independent of m6A

#### REFERENCES


modification and the underlying mechanism remains to be explored.

Despite the roles of m6A in cancers have dramatically advanced in recent years, a large number of challenges still exist. First, the mechanisms of m6A regulators in some cancers are largely unkown. Second, if m6A level and its regulators could be potential biomarkers for diagnosis and prognosis of some caners, and the specificity and sensitivity of these biomarkers need to be explored. Third, a number of studies have suggested that m6A regulators and related pathways could be used as therapeutic targets, but lack of the specific applications in clinical practice with a large sample size, and the corresponding side effects are largely unkown. All of these issues should be clearly addressed.

#### AUTHOR CONTRIBUTIONS

Z-XL and L-ML are the co-first author who writer the article. S-ML and H-LS are corresponding author.

#### ACKNOWLEDGMENTS

We thank the financial support from the National Natural Science Foundation of China (81472023, 81772276, and 91753201).


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 Liu, Li, Sun and Liu. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

## Unveiling Chloroplast RNA Editing Events Using Next Generation Small RNA Sequencing Data

Nureyev F. Rodrigues <sup>1</sup> , Ana P. Christoff <sup>1</sup> , Guilherme C. da Fonseca<sup>2</sup> , Franceli R. Kulcheski <sup>3</sup> and Rogerio Margis 1, 2, 4 \*

<sup>1</sup> Programa de Posgraduação em Genética e Biologia Molecular, Departamento de Genética, Universidade Federal do Rio Grande do Sul, Porto Alegre, Brazil, <sup>2</sup> Programa de Posgraduação em Biologia Celular e Molecular, Centro de Biotecnologia, Universidade Federal do Rio Grande do Sul, Porto Alegre, Brazil, <sup>3</sup> Programa de Pósgraduação em Biologia Celular e do Desenvolvimento, Departamento de Biologia Celular, Genética e Embriologia, Universidade Federal de Santa Catarina, Florianópolis, Brazil, <sup>4</sup> Departamento de Biofísica, Universidade Federal do Rio Grande do Sul, Porto Alegre, Brazil

#### Edited by:

Giovanni Nigita, The Ohio State University Columbus, United States

#### Reviewed by:

Lei Song, National Cancer Institute (NIH), United States Xiyin Wang, North China University of Science and Technology, China Gaurav Sablok, University of Helsinki, Finland Fabio Iannelli, IFOM-The FIRC Institute of Molecular Oncology, Italy

> \*Correspondence: Rogerio Margis rogerio.margis@ufrgs.br

#### Specialty section:

This article was submitted to Bioinformatics and Computational Biology, a section of the journal Frontiers in Plant Science

Received: 06 June 2017 Accepted: 13 September 2017 Published: 29 September 2017

#### Citation:

Rodrigues NF, Christoff AP, da Fonseca GC, Kulcheski FR and Margis R (2017) Unveiling Chloroplast RNA Editing Events Using Next Generation Small RNA Sequencing Data. Front. Plant Sci. 8:1686. doi: 10.3389/fpls.2017.01686 Organellar RNA editing involves the modification of nucleotide sequences to maintain conserved protein functions, mainly by reverting non-neutral codon mutations. The loss of plastid editing events, resulting from mutations in RNA editing factors or through stress interference, leads to developmental, physiological and photosynthetic alterations. Recently, next generation sequencing technology has generated the massive discovery of sRNA sequences and expanded the number of sRNA data. Here, we present a method to screen chloroplast RNA editing using public sRNA libraries from Arabidopsis, soybean and rice. We mapped the sRNAs against the nuclear, mitochondrial and plastid genomes to confirm predicted cytosine to uracil (C-to-U) editing events and identify new editing sites in plastids. Among the predicted editing sites, 40.57, 34.78, and 25.31% were confirmed using sRNAs from Arabidopsis, soybean and rice, respectively. SNP analysis revealed 58.2, 43.9, and 37.5% new C-to-U changes in the respective species and identified known and new putative adenosine to inosine (A-to-I) RNA editing in tRNAs. The present method and data reveal the potential of sRNA as a reliable source to identify new and confirm known editing sites.

Keywords: small RNA, chloroplast, RNA editing, NGS, SNP genotyping

#### INTRODUCTION

Chloroplasts are notable examples of successful endosymbiosis in the early origin of modern life forms. These organelles possess their own gene expression machinery, with complex posttranscriptional processes and fine nucleus-cytosol crosstalk. In plants, these organelles undergo a posttranscriptional process called RNA editing, corresponding to nucleotide changes from cytosine to uracil (C-to-U) and less frequently from uracil to cytosine (U-to-C), in some sites of coding sequences (Tillich et al., 2006; Chateigner-Boutin and Small, 2010). These nucleotide changes correct the codons to encode appropriate amino acids, maintaining the functional amino acid sequence of the evolutionarily conserved protein (Takenaka et al., 2013). Another well-known mechanism of RNA editing is the adenine to inosine (A-to-I) editing, as observed in the chloroplast tRNAArg (ACG). This type of editing enables hydrogen bond formation with more than one base in the corresponding codon position (Su and Randau, 2011). The A-to-I editing in position 34 of the tRNA Arg (ACG) produces the wobble nucleotide described as essential for efficient chloroplast translation (Delannoy et al., 2009). In Arabidopsis thaliana, arginine tRNA adenosine deaminase (TAD or ADAT) performs this deamination (Elias and Huang, 2005; Delannoy et al., 2009).

RNA editing in coding sequences increases the conservation levels among proteins across several plants species. Evolutionarily, codons generated by RNA editing are more conserved than codons encoded by genomic DNA (Guo et al., 2015). Editing sites located within coding sequences have been well studied, despite the existence of editing sites in non-coding regions, such as introns and tRNAs. There are several cases of different editing efficiencies from plant to plant, and even among different plant tissues (Peeters and Hanson, 2002; Chateigner-Boutin and Hanson, 2003; Tseng et al., 2013), suggesting that several different RNA editing sites remain to be elucidated.

The identification of all components from the RNA editing machinery has not yet been achieved, although several proteins have been identified as important for the maintenance of editing processes. The pentatricopeptide repeat proteins (PPR) are a highly diverse protein family. In the plant evolutionary landscape of PPR proteins, 109 genomes/proteomes were analyzed, resulting in a total of 49,204 PPR genes and 616,206 motifs (Cheng et al., 2016). Some of these PPRs harbor a DYW motif, similar to the deaminase motifs observed in other proteins, which could explain the C-to-U nucleotide conversion (Salone et al., 2007; Schallenberg-Rüdinger et al., 2013; Hayes et al., 2015). In addition, several studies have reported PPRs associated with specific RNA editing events, demonstrating that these molecules bind to specific cis-elements located upstream of the RNA editing site (Okuda et al., 2006; Barkan and Small, 2014). Moreover, the PPR alone is not sufficient to promote RNA editing but requires other proteins, such as RNA editing-interacting (RIP/MORF), OMMR and OZ proteins, to achieve a successful editing event (Bentolila et al., 2013; Sun et al., 2016).

The most frequent plastid RNA editing type in flowering plants is the C-to-U change, with approximately 40 sites detected thus far in Arabidopsis (Takenaka et al., 2013). To facilitate RNA editing site prediction in organelles, software, such as PREP suite has been developed (Mower, 2009). These programs enable RNA editing site prediction in genes from organelles by considering homology and conservation among protein sequences compared to genomic databases. Currently, thousands of partial and complete plastid genomes are available in NCBI, which can be used to extensively search for RNA editing events.

Different experimental techniques have identified chloroplast RNA editing sites. A widely used method is the reverse transcription PCR (RT-PCR) of plastid messenger RNAs in which several chloroplast cDNA fragments are cloned into vectors and further sequenced (Rüdinger et al., 2009). Additionally, if a chloroplast candidate gene sequence is previously known, then specific primers can be designed to direct the gene amplification from cDNA samples, with subsequent sequencing (Wolf et al., 2004). RNA editing events can also be detected through the Poisoned Primer Extension method or High Resolution Melting (HRM) analysis (Chateigner-Boutin and Small, 2007), using chloroplast cDNA as a template for amplification. Another method to measure RNA editing is multiplex RT-PCR mass spectrometry, described as a robust and convenient method (Germain et al., 2015). Although robust, these methods are dependent on specific primers and are restricted to RNA editing studies only.

RNA sequencing has facilitated RNA editing analyses by comparing reads from RNA-seq data with organelle genome references. Currently, RNA-seq is primarily adapted to study polyadenylated transcripts. Thus, as their cyanobacterial ancestor, several plastid polyadenylated RNA transcripts are associated with the RNA decay pathway via degradation by 3 ′– 5′ exoribonucleases (Komine et al., 2002; Zimmer et al., 2009). Therefore, this approach generates RNA-seq libraries with smaller amounts of plastid reads than libraries generated from organelle-enriched RNA samples, with posterior reduction of ribosomal RNA (Guo et al., 2015). Furthermore, these approaches restrict the analysis to only transcripts located in chloroplasts, preventing a comparative analysis between nuclear and plastid transcripts.

In recent years, studies of small RNAs (sRNA) have considerably increased, particularly associated with the deep sequencing of microRNAs (miRNAs) and other small noncoding RNAs (ncRNAs) from nuclear origin, producing a large amount of new sequence data. These studies have focused on the roles of sRNAs in genome maintenance, development and plant responses to environmental stresses (Simon et al., 2009; Long et al., 2015; Xu et al., 2015). However, plastid-derived sRNA sequences have also been identified in these total sRNA libraries (Ruwe and Schmitz-Linneweber, 2012; Zhelyazkova et al., 2012; Ruwe et al., 2016). Therefore, considerable amounts of sRNA data are available in public databases and can be employed for RNA editing studies. In the present study, we propose that sRNA sequencing data could represent an additional resource to identify chloroplast RNA editing events, in addition to other approaches, such as strand-specific RNA sequencing and Single Nucleotide Polymorphism (SNP). Here, we describe a method for identifying a set of new editing sites in chloroplast transcripts using sRNA data. Analyses of sRNA libraries can provide a strong qualitative and reliable quantitative measure of plastid RNA editing events.

#### MATERIALS AND METHODS

#### sRNA Libraries and Chloroplast Genomes

Public RNA libraries deposited in NCBI GEO (www.ncbi.nlm.nih.gov/geo/) with accession numbers GSE85070 (Wu et al., 2016) (Arabidopsis thaliana, mRNA-seq and sRNAseq), GSE69571 (da Fonseca et al., 2016) (Glycine max, soybean, mRNA-seq and sRNA-seq) and GSE77046 (Neto et al., 2015) (Oryza sativa japonica group, rice, sRNA-seq; mRNA-seq data unpublished) were used as input data to evaluate the proposed method. These libraries were produced from samples with no qualitative influence on RNA editing and did not use any method to enrich the isolation of plastid RNAs. The Arabidopsis mutant data present in the libraries were not used. For sRNA analyses, only reads with 18–24 nucleotides were selected from the libraries. Complete chloroplast genome, coding sequences and tRNAs from Arabidopsis (NC\_000932), soybean (NC\_007942), and rice (NC\_001320) were obtained separately at the Index of Genomes from The CpBase: Chloroplast Genome Database (http://chloroplast.ocean.washington.edu/).

#### Prediction of Conserved Editing Sites

The Predictive RNA Editor for Plants suite (PREP-Cp) (http://prep.unl.edu/) (Mower, 2009) was used to predict conserved plastid editing sites. These sites were used to evaluate read coverage and editing percentage using the sRNA data. Fasta files corresponding to plastid coding sequence data were manually formatted to be usedfor use as an input batch file in the PREP-Cp tool. To predict editing sites for each species, a less stringent cutoff value of 0.5 was used, despite the 0.8 default value. This lower cutoff value was used to evaluate the effective occurrence of the predicted editing sites and their efficacious detection from sRNA data.

#### RNA Mapping and Confirmation of Predicted Sites

The sRNA/mRNA libraries were primarily mapped using Bowtie (Langmead et al., 2009) with 0 mismatch and no reverse complement against the chloroplast genome, coding sequences and tRNAs. Mapped reads resulted in a new file (m0). Unmapped reads were submitted to a second round of mapping with no mismatches against nuclear and mitochondrial genomes. This step eliminates all reads with perfect matches against these genomes. Unmapped reads were further mapped with two mismatches and no reverse complement against chloroplast genome and coding sequences. This second group of mapped reads produced another file containing reads with editing events (m2). Both m0 and m2 fastq files were concatenated in an m0 + m2 file. The C-to-U editing sites predicted by PREP-Cp in the cpDNA coding sequence were subjected to m0 + m2 mapping and further manual inspection using Tablet software (Milne et al., 2013). The predicted editing sites were confirmed based on a C-to-T mapping change. The steps described above are summarized in **Figure 1**.

#### Single Nucleotide Polymorphism Analysis

The m0 + m2 fastq files from sRNA libraries were mapped against the whole chloroplast genome, coding sequences and tRNAs using Geneious-R8 (Kearse et al., 2012), with the Bowtie algorithm and the same parameters of the previous mapping (**Figure 1**). The Geneious find variation/SNPs tool was used to search for A-to-G and C-to-T changes in putative new editing sites that were not predicted by PREP. The following parameters were used: Minimum Coverage of 5, Maximum Variant P-value of 10−<sup>2</sup> , option to find polymorphism Inside and Outside coding sequence and P-value calculation method as approximate. In the manual inspection of mapping, reads with putative editing events in the 5′ and 3′ end were discarded to improve prediction and selection for validation using RT-qPCR assay.

#### Validation and Analysis of the RNA Editing Sites Using RT-qPCR

To validate predicted and new C-to-U RNA editing sites from the sRNA data in soybean chloroplast transcripts [Glycine max

FIGURE 1 | Pipeline for identification of editing sites using chloroplast RNA transcripts. (1) sRNA-seq/mRNA-seq reads were filtered by mapping against the chloroplast reference genome. Mapped reads were saved as another file named as m0 (chloroplast RNAs m0). (2) Reads that did not map were subjected to a new round of mapping against nuclear and mitochondrial reference genomes, and those reads that did map were discarded. (3) The remaining unmapped reads were remapped against the chloroplast genome allowing up to 2 mismatches using Bowtie. (4) The resulting mapped reads (chloroplast m0 + m2), plus the m0 file, were used in the analysis to predict transcript editing sites through PREP and Geneious SNPs approaches.

(L.) Merrill], we collected the roots, leaves and petals from the soybean cultivar Conquista. These tissues were collected as biological triplicates. All samples were immediately frozen in liquid nitrogen, and total RNA was extracted using Trizol (Invitrogen, CA, USA). The RNA quality was evaluated through electrophoresis on a 1% agarose gel, and the RNA amount was verified using a Qubit fluorometer and Quant-iT RNA assay kit according to the manufacturer's instructions (Invitrogen, CA, USA).

Reverse transcription quantitative polymerase chain reaction (RT-qPCR) was performed to validate the C-to-U RNA editing rates for some predicted editing sites in soybean chloroplast genes across three different tissues (roots, leaves and petals). To validate and quantify new RNA editing sites, only leaf samples were used. The cDNA synthesis was performed with approximately 1 µg of total RNA. Each reaction was primed with 1 µM dT25V oligonucleotide (Invitrogen, Carlsbad, CA, USA). Prior to transcription, RNA and the oligo(dT)25V primer oligo were mixed with RNase-free water to a total volume of 10 µL and incubated at 70◦C for 5 min, followed by cooling on ice. The reactions were reverse transcribed with 1X M-MLV RT buffer, 0.5 mM dNTPs (Ludwig, Porto Alegre, RS, Brazil) and 200 U of M-MLV RT Enzyme (Promega, Madison, WI, USA) in a final volume of 30 µL. The synthesis was performed at 40◦C for 60 min. All cDNA samples were diluted 100 fold with RNase-free water and subsequently used as templates in RT-qPCR analysis. The subsequent PCR amplification was performed using a set of primers designed according to Chen et al. (2008), with modifications. A set of primers, comprising two specific editing primers and one unique universal primer, were designed for each editing site. Specific editing primers were characterized by a unique difference in the last nucleotide at the 3′ end that recognizes and differentiates edited and unedited sites. All primers employed in the reaction are listed in Table S1.

All RT-qPCR reactions were performed on a Bio-Rad CFX384 real-time PCR detection system (Bio-Rad, Hercules, CA, USA) using SYBR Green I (Invitrogen, Carlsbad, CA, USA) to detect double-stranded cDNA synthesis. The reactions were conducted in a 10 µL volume containing 5 µL of diluted cDNA (1:100), 0.2X SYBR Green I, 0.1 mM dNTP, 1X PCR buffer, 3 mM MgCl2, 0.25 U Platinum Taq DNA Polymerase (Invitrogen, Carlsbad, CA, USA) and 200 nM of each forward and reverse primer. The samples were analyzed as biological triplicates and technical quadruplicates in a 384-well plate. A non-template control was also included. The PCR reactions were run under the following conditions: an initial polymerase hot start at 94◦C for 5 min, followed by 40 cycles at 94◦C for 15 s, 60◦C for 15 s and 72◦C for 10 s. A melting curve analysis was programmed at the end of the PCR run over the range of 65 to 99◦C, and the temperature increased stepwise by 0.5◦C. The threshold and baseline were manually determined using Bio-Rad CFX manager software.

To calculate the RNA editing rates, we used the threshold cycle (Ct) generated during the qPCR amplifications. To calculate the percentage of editing, an equation that considered the difference between the Ct-values of each editing variant was used:

% RNA editing = 2 (Ct mean of T variant − Ct mean of C variant)

$$\frac{2}{2^{\left(\text{Ct mean of }T\text{ variant} - \text{Ct mean of C variant}\right)} + 1} \times 100$$

#### RESULTS

#### sRNA Reads Mapped to Chloroplast Genomes

The sRNA libraries sequenced without plastid RNA isolation were mapped to Arabidopsis, soybean and rice chloroplast genomes using an in-house pipeline (**Figure 1**). Approximately 3.2, 1.6, and 0.9 million reads did not map to nuclear and mitochondrial genomes but mapped to Arabidopsis, soybean and rice chloroplast genomes, respectively. These chloroplast (cp) mapped reads represented approximately 22.9% (Arabidopsis), 4.79% (soybean), and 3.62% (rice) of the total reads in these libraries (**Table 1**). The editing informative m2 reads corresponded to 455,904 (Arabidopsis), 208,417 (soybean), and 144,609 (rice). The histograms representing the percentage length distribution of each individual class are shown in **Figure S1**. The mean coverage was 838.6 in Arabidopsis, 358.6 in soybean and 222 in rice. The maximum coverage values were 872,674 in Arabidopsis, 380,116 in soybean and 166,534 in rice. Some chloroplast regions were not covered by the sRNA library reads, with minimal coverage of zero. The number of plastid genome positions with no coverage was 47,057 in Arabidopsis, 24,505 in soybean and 3,039 in rice, representing approximately 30.46, 16.09, and 2.25% of each chloroplast genome, respectively. The genome fraction coverage for Arabidopsis, soybean and rice is represented in **Figure S2**.

#### sRNA Polymorphisms Confirm PREP Editing Site Prediction in Coding-Sequence Genes

The conserved chloroplast C-to-U RNA editing sites were predicted using the Predictive RNA Editor for Plants (PREP-Cp) (http://prep.unl.edu/) (Mower, 2009). The PREP suite predicted 69 potential editing sites in Arabidopsis, 92 sites in soybean and 79 sites in rice chloroplast genes. These predicted editing sites


m0, reads with no mismatches.

m2, reads with until 2 mismatches.

were distributed in 21 different coding sequences in Arabidopsis and rice and 23 coding sequences in soybean. The mapped chloroplast sRNA reads were analyzed using Tablet software to evaluate the presence/absence of C-to-U editing events in the predicted sites. Different numbers of confirmed editing sites were observed among the three species: 28 sites in Arabidopsis, 32 sites in soybean and 20 sites in rice, corresponding to 40.57, 34.78, and 25.31% of the total sites, respectively. The PREP score (values between 0 and 1) indicates editing site prediction confidence to control the relative proportion of false positive and false negative predictions. When a more stringent score value (≥0.8) was considered, the predicted editing site numbers decreased to 45, 59, and 29 for Arabidopsis, soybean and rice, respectively. Analyses of chloroplast sRNA alignment confirmed the 23 predicted editing sites in Arabidopsis, 28 sites in soybean, and 14 sites in rice, corresponding to 51.1, 47.45, and 48.27% of the total predicted editing sites, respectively (**Figure 2A**). Even with a higher score value, some predicted sites were not confirmed, reflecting the absence of reads corresponding to editing or not enough coverage (Table S2). Four editing sites were conservatively predicted and confirmed among the three species. These sites corresponded to three sites inside the ndhB transcript and one site in the rps14 transcript. Soybean and Arabidopsis shared 11 common editing sites in the atpF, clpP, ndhB, ndhD, psbE, psbF, rpoB, rpoC1, and rps14 transcripts. Concerning the rice atpF, clpP, ndhB, psbE, and psbF genes, a thymine was already present in these editing sites. Rice shared a single editing site with Arabidopsis in the ndhB transcript at position 467, which in soybean corresponds to a thymine. The numbers of unique confirmed editing sites for each species were 12, 16, and 14

FIGURE 2 | PREP predicted editing sites and graphical read distribution and editing in the ndhB transcript. (A) Venn diagram with confirmed RNA editing sites predicted by PREP in Arabidopsis, soybean and rice. Gene names followed by the position numbering of the editing site in the coding sequence are indicated. (B) Graphical representation of sRNA coverage and predicted editing sites in the ndhB gene; (S) editing sites identified by SNP analysis, (T) predicted editing site in another species that already has a thymine in the species, (\*) editing site predicted by PREP and confirmed by read mapping and coverage, (−) predicted sites with reads but not confirmed by editing and (0) predicted editing sites without read coverage.

for Arabidopsis, soybean and rice, respectively (**Figure 2A**). The complete distribution of PREP predicted editing sites according to species is described in Table S2.

#### mRNA-Seq and sRNA-Seq Differences in RNA Editing Analysis

To provide information concerning sRNA data reliability, the C-to-U RNA editing profiles were compared to the PREP predicted editing sites between the sRNA and mRNA (messenger RNA) libraries in Arabidopsis, soybean and rice. The mRNA-Seq data confirmed 27 predicted editing sites in Arabidopsis, 37 sites in soybean and 20 sites in rice, corresponding to 39.13, 40.21, and 25.31% of the predicted sites, respectively (Table S3). One predicted editing site was exclusively confirmed using mRNA-Seq libraries in Arabidopsis, and 11 predicted editing sites were confirmed in soybean and rice. However, analyses using sRNA-Seq libraries detected two exclusively confirmed editing sites in Arabidopsis, six sites in soybean and eight sites in rice. The confirmed predicted editing sites shared between mRNA and sRNA data corresponded to 37.68, 28.26, and 15.19% of the total predicted editing sites in Arabidopsis, soybean and rice, respectively (**Figure 3**).

#### Confirmation of PREP Predicted Editing Sites and New Editing Site Prediction through SNP Analysis in Coding-Sequences Using sRNA Data

In addition to the confirmation of the predicted editing sites, new candidates for editing sites were searched. A SNP analysis was used with a minimum P-value of ≤ 10−<sup>10</sup> to identify sites with C-to-T changes. This parameter enabled the identification of 59 potential editing sites in Arabidopsis, 43 sites in soybean, and 19 sites in rice. Among these editing sites, 58, 37, and 15

FIGURE 3 | Comparison of predicted editing site confirmation between sRNA and mRNA data. On the left, values of total confirmed predicted editing sites by data type (mRNA or sRNA). Green boxes represent editing sites confirmed in both data; yellow boxes represent editing sites confirmed only in mRNA data; blue boxes represent editing sites confirmed only in sRNA data; and black boxes represent unconfirmed predicted editing sites.

sites encode amino acid changes in Arabidopsis, soybean and rice, respectively (Table S4). These editing sites were distributed in 27 genes in Arabidopsis, 24 genes in soybean and 11 genes in rice. Comparison of these editing sites against the editing sites predicted using PREP revealed that 20, 18, and 7 sites were previously predicted in Arabidopsis, soybean and rice, respectively (Table S5). Among these sites, 18, 18, and 6 sites were predicted with a higher score value in Arabidopsis, soybean and rice, respectively.

When the edited transcript distribution was evaluated in all species (**Figure 4A**), a higher editing frequency was associated with a core of genes (clpP, ndhB, ndhF, rpoA, rpoB, rpoC1, rpoC2, and rps14) and confirmed with at least one method used for all species evaluated. Considering exclusive edited genes, Arabidopsis showed 14 editing sites distributed among nine genes identified using SNP analysis. The editing in the rice atpA gene, detected through SNP analysis, was predicted by PREP. Soybean presented four exclusive editing sites confirmed by sRNA reads and predicted by PREP. They sites were distributed among the petB, rps2, and rps14 genes. C-to-U changes promote a serine to leucine amino acid change in petB and rps14 and

FIGURE 4 | Number of genes with C-to-U editing sites in the studied species. (A) Venn diagram with the total number of genes with editing sites in Arabidopsis, soybean and rice, when using both PREP (only confirmed) and SNP analysis. Not all genes share common editing sites among species. The gene identities are described in Table S6. (B) Percentages of total RNA editing sites identified by distinct approaches, as observed in Arabidopsis, soybean and rice. The absolute number of editing sites for each method is in parentheses. Black bars correspond to the percentage of total sites confirmed only by PREP prediction (>0.8 in prediction score); white bars indicate the percentage of total sites confirmed by the SNP approach; and gray bars show the percentage of total sites confirmed using both approaches.

a histidine to tyrosine amino acid change in rps2. Arabidopsis, soybean and rice SNP analysis revealed 19, 15, and 7 C-to-T changes distributed among 11, 10 and five exclusive genes, respectively. All genes and their respective editing sites are listed in Table S6. The comparative C-to-T analysis using different identification methods demonstrated that the SNP method could identify reliable C-to-U editing events, including events previously predicted using PREP at a lower PREP score (> 0.5) (**Figure S3**) or a more stringent cutoff (PREP score >0.8) (**Figure 4B**).

#### C-to-U RNA Editing in the ndhB Gene

The well-studied ndhB gene was the most frequently edited gene detected through PREP prediction in all plants. The number of editing sites predicted by PREP in this gene varied between species: 9 sites in Arabidopsis, 13 in soybean and 10 in rice. The number of editing sites confirmed by sRNA alignment was 7 sites in Arabidopsis, 9 sites in soybean and 7 sites in rice, representing 77.7, 69.23, and 70% of the predicted editing sites, respectively. Other editing sites could not be confirmed, reflecting insufficient read coverage (**Table 2**). In contrast, despite high predicted editing site numbers, 7 sites in Arabidopsis, 9 sites in soybean and 5 sites in rice, the matK gene had only two confirmed predicted editing sites in Arabidopsis and one confirmed predicted editing site in soybean and rice (Table S2).

In the ndhB gene, SNP analysis detected potential new editing sites in all three species (**Table 2**). However, this gene was not the most edited gene according to SNP analysis in rice. In this species, ndhB had three new potential editing sites, while rpoC2 gene had four new sites. In Arabidopsis, ndhD had 8 new potential editing sites according to SNP analysis. In soybean, the ndhB gene remained as the most edited gene (Table S6). Comparative analyses showed a different read distribution of the predicted sites in ndhB among species (**Figure 2B**). Some regions showed higher coverage, not only in the editing site, but also in neighboring sites. For example, PREP predicted 467 editing sites (C-to-U), with varied coverage between species, but reads confirming the editing event were observed in both Arabidopsis and rice. Although soybean had a higher amount of reads in this site, a T was present in this genomic position. Notably, several sites showed more than 10 reads of coverage but did not confirm editing events. Some putative editing sites predicted using SNP analysis showed higher coverage than the predicted sites confirmed using PREP (**Table 2**).

#### A-to-I Editing Events Predicted Using SNP Analysis in Chloroplast tRNA Genes

Chloroplast sRNAs can also be useful in adenosine to inosine (Ato-I) RNA editing screening. tRNA genes were used to evaluate editing events, by searching for a guanosine (G) SNP in sRNA mapping since inosine is read as G by cellular machineries (Kim, 2004).

tRNA genes showed at least one position with an A-to-G change in at least two species (Table S7), totaling 11, 4, and 12 putative A-to-I editing events in Arabidopsis, soybean and rice, respectively. These A-to-G changes were distributed in 8, 4, and 10 tRNAs in Arabidopsis, soybean and rice, respectively. Among these sites, two sites were conserved between species: position 58 of tRNA-Trp (CCA) between soybean and rice and position 35 of tRNA-Arg (ACG) among all species evaluated. In tRNA-Arg (ACG), nucleotide 35 presented 40, 58.8, and 67.8% of the edited reads in Arabidopsis, soybean and rice, respectively (**Table 3**). The tRNAs most frequently edited were tRNA-Ser (UGA), with 3 A-to-G changes in Arabidopsis, and tRNA-Leu (UAG) and tRNA-Trp (CCA) with two A-to-G changes in Arabidopsis and rice, respectively.

#### Validation of C-to-U RNA Editing in Soybean Plastid Genes

To validate some predicted editing sites and demonstrate sRNA data reliability as a resourceful tool for the identification of RNA editing sites, four PREP predicted editing sites were selected for C-to-U RNA editing analysis using RT-qPCR. The ndhA (position 1073), ndhB (position 149), rps14 (position 80), and rps16 (position 212) editing sites were comparatively quantified in different soybean tissues (**Figures 5A–D**). Five new putative editing sites, identified by SNP analysis, were also confirmed and quantified in leaf samples: accD (position 617), ndhE (position 233), petB (position 611), rps2 (position 248), and rps3 (position 383) (**Figure 5E**). RT-qPCR showed that the percentage of ndhA editing was higher in leaves (76.75%) than in petals (20.11%) or roots (30.23%) (**Figure 5A**). The same editing pattern was observed for ndhB and rps14. In ndhB, the percentage editing was 72.41, 30.54, and 16.55% (**Figure 5B**), while values of 74, 17.86, and 8.15% were obtained in rps14 editing in the leaves, petals and roots, respectively (**Figure 5C**). The rps16 editing profile was different, with an editing percentage that was higher than 60% in all tissues (**Figure 5D**). With respect to putative new C-to-U editing sites identified using SNP analysis, RT-qPCR confirmed C-to-U editing events and demonstrated different editing rates among genes: accD (60.2%), ndhE (39.85%,) petB (54.3%), rps2 (71.52%), and rps3 (20.02%) (**Figure 5E**).

#### DISCUSSION

In the present study, we propose an additional resource and new method to identify conserved and new RNA editing sites in plastid RNA sequences. Currently, an increasing number of high-throughput sequencing data have become available. Among these datasets, there are substantial data corresponding to sRNA sequencing libraries. After analyzing some of these libraries, we observed that even without previous isolation of chloroplasts for further RNA extraction and sequencing, millions of chloroplast-derived sRNA reads could be recovered, reflecting mapping against the chloroplast genome. An important constraint of the presented method refers to the library quality and the read coverage of reference genomes.

In the present study, Arabidopsis libraries had the highest mean coverage using sRNA reads, which likely facilitated the recovery of the largest number of confirmed editing sites. The coverage percentage across genomes was different between species, with lower values detected in Arabidopsis. This result


TABLE 2 | NdhB C-to-U editing events by PREP and SNP approach using reads derived from sRNA-seq.

\*Coding sequence length and coverage values.

"Nucleotide position": position in base pair is from the A of the initiator codon.

"Total Coverage": total mapped reads in respective nucleotide position.

"Edited Coverage": number of reads shown T, instead C.

"% Editing": percentage of RNA editing using the edited reads divided by total mapped reads.

"PREP score": confidence value of prediction according PREP.

"nd": no defined.

demonstrated that the use of sRNA libraries for mapping editing events is not directly related to a significant coverage across the entire plastid genome. Although this method has the capacity to confirm and discover editing sites in chloroplasts, a smaller number of mitochondrial reads would likely affect RNA editing analysis in this organelle. In the present study, the approach for the identification of editing sites was compared to the PREP and SNP strategies. The editing sites and percentage editing may vary between species because some species may already possess a thymine in the genome. In these cases, C-to-U editing will not occur. The same situation can occur with some A-to-I editing sites, which could affect the general percentage of editing among species. The use of a different PREP score, resulting in distinct cut-off values, may also affect these percentages. In addition, editing factors and their editing sites may evolve differently among species.

The elementary step employed in the pipeline used in the present study was the initial sRNA library mapping against the chloroplast genome, considering 0 mismatches. Plastid DNA insertions in nuclear genomes have been demonstrated for partial, intact or even truncated coding sequences in several species (Chen et al., 2015). Thus, an initial filtration step against


TABLE 3 | A-to-I editing analysis of tRNA-Arg(ACG) sites by SNP approach with corresponding reads derived from sRNA-seq.

\*tRNA sequence length and coverage values.

"Total Coverage": total mapped reads in respective nucleotide position.

"Edited Coverage": number of reads shown G, instead A.

"% Editing": percentage of RNA editing using the edited reads divided by total mapped reads.

the chloroplast genome prevents the loss of unedited reads to those loci present in nuclear insertions. Unedited reads are necessary, particularly in quantitative editing analysis, where the editing percentage is measured and cannot be ruled out.

Some C-to-U editing studies have previously used mRNA-Seq to demonstrate and quantify editing events in plant mitochondria (Bentolila et al., 2013) and chloroplasts (Guo et al., 2015). Comparison of sRNAs and mRNA data sequences demonstrated that most of the confirmed editing sites can be recovered using both datasets. However, there are differences between these data, demonstrating that sRNAs can identify editing sites that were not detected using mRNA data and vice versa (**Figure 3**). The use of sRNA data to complement RNA editing analysis can improve the identification and measurement of RNA editing in various aspects.

In the present study, a new set of plastid editing sites was identified in soybean. The C-to-U editing events have previously been demonstrated in other species, and we recovered several edited transcripts, including ndhB, ndhD, ndhG, rpoB, and rpoC1 (Corneille et al., 2000; Okuda et al., 2009; Zhou et al., 2009; Chateigner-Boutin et al., 2011; Boussardon et al., 2012; Tseng et al., 2013), in the present analysis. For most known C-to-U editing sites predicted through PREP and confirmed by sRNA reads in the present study, 21 sites have previously been demonstrated in Arabidopsis (Tsudzuki et al., 2001; Tillich et al., 2005) and 19 sites have previously been demonstrated in rice (Corneille et al., 2000; Tsudzuki et al., 2001), representing 30.43 and 24% of the total predicted editing sites, respectively (Table S2). Moreover, we showed editing events in soybean plastid genes, including ndhA, psaI, and petB, which had not previously been demonstrated for rice or Arabidopsis. In the SNP analysis, we identified new C-to-U editing sites. For example, in the Arabidopsis ndhF gene, a putative C-to-U editing site was identified at position 884, leading to a serine to phenylalanine change. In the soybean ndhE gene, a putative C-to-U editing site at position 233 was observed in 73.7% of the reads. This editing led to a proline to leucine change in the encoded protein. Despite this information, the impact of amino acid modifications on respective protein structures remains unclear. Both ndh genes encode thylakoid Ndh complex components involved in photosynthesis optimization under different stress conditions conditions (Casano, 2001; Martin et al., 2004; Rumeau et al., 2007). NdhB mutants under lower air humidity conditions or following exposure to ABA present a reduction in the photosynthetic level, likely mediated through stomatal closure triggered under these conditions (Horvath, 2000). Therefore, a protein structure modification, resulting from a loss or decrease in RNA editing events could affect adaptations to stress conditions or cause other unknown changes.

The coding sequence of protein D2, encoded by the psbD gene, a photosystem II (PSII) core protein, showed a putative new editing event in rice at positions 1006 and 1007. However, reflecting low coverage, these new editing sites still require further experimental confirmation. Maintenance of the D2 protein structure is important not only for proton transport (Pokhrel et al., 2013) but also for the phosphorylation dynamics of this protein (Tikkanen and Aro, 2012) and its interaction with the proteins responsible for PSII maintenance (Liu and Last, 2015). If this editing site is confirmed, then alterations in editing site patterns resulting from factors, such as abiotic stress could be associated with photo-oxidative damage susceptibility. Previous studies have demonstrated that abiotic stress influences the editing process and consequently plastid physiology (Nakajima and Mulligan, 2001; Karcher and Bock, 2002).

Five putative C-to-U editing sites predicted using SNP analysis were validated through RT-qPCR. This result demonstrates the reliability and accuracy of sRNA data resources and the method presented herein to confirm predicted sites in silico and identify new RNA editing sites. Position 1073 in the ndhA gene is an editing site identified only in the soybean chloroplast editome. RT-qPCR revealed that the editing percentage varies among different soybean tissues. The ndhB (position 149) gene was previously evaluated in the non-photosynthetic tissues of Arabidopsis. An RNA editing pattern previously demonstrated in Arabidopsis (Tseng et al., 2013), with a higher percentage in leaves (>75% edited), followed by flowers (25–75% edited) and roots (unedited), was similarly observed in the present study. An exception was observed for the root tissue, which showed a low editing percentage (16.5%) in soybean instead of an unedited rate, as observed in Arabidopsis. The editing site at position 80 in rps14 also was evaluated across different tissues in Arabidopsis. A high editing percentage was demonstrated in Arabidopsis leaves (Tseng et al., 2013), a pattern also demonstrated in soybean using RT-qPCR. The RNA editing percentages observed in roots

analyzed in leaves, petals and roots. Box area represents the lower and upper percentiles; (E) confirmation and quantitation of soybean editing sites identified by SNP analysis. Transcripts from soybean leaves were analyzed for C-to-U editing in specific nucleotide positions: accD-617, ndhE-233, petB-611, rps2-248, and rps3-383. Box area represents the lower and upper percentiles. The upper whisker of the boxplot indicates the highest editing value observed; the lower whisker, the lowest editing value; and the middle line, the median.

and petals showed different patterns between Arabidopsis and soybean, although a decrease in these values was observed in the root tissue of both species. The editing of rps16 at position 212 was predicted and confirmed only in soybean and did not show differences in the editing percentage between leaf and root tissues. These results indicate that sRNA sequence mapping can not only be used to confirm the predicted editing sites, but also to quantify the editing percentage.

The plastid acetyl-CoA carboxylase, necessary for de novo fatty acid synthesis, comprises two components, accA and accD proteins; accD encodes the β-carboxyl transferase subunit and is required in tobacco plants for a functional enzyme (Kode et al., 2005). The vanilla cream1 (vac1) albino mutant, reflecting a PPR-DYW protein required for editing in accD and ndhF in Arabidopsis, exhibits albino to pale yellow phenotype and an RNA editing reduction in those transcripts (Tseng et al., 2010). The requirement of plastid accD editing for functional protein has previously been demonstrated (Sasaki et al., 2001), and this new editing site, which promotes a serine to leucine change, could also be important for the maintenance of protein structure and functionality. The ndhE gene encodes a subunit of a membrane subcomplex of the NAD(P)H dehydrogenase complex (Peng et al., 2011). NdhE protein interacts with the membrane subcomplex proteins, NdhC and NdhG, and with subcomplex proteins, NhdH and NdhK (Efremov et al., 2010; Peng et al., 2011). The new editing site described here promotes a proline to leucine change, which could modify the interaction between these proteins and lead to changes in electron transfer to quinone. The petB gene encodes the cytochrome b<sup>6</sup> protein, a cytochrome b6f complex component responsible for mediating electron transfer between photosystem I (PSI) and plastocyanin (Baniulis et al., 2008); mutants of petB in tobacco showed reduced levels of PSI, PSII and light-harvesting complex proteins (Monde et al., 2000), indicating a requirement of cytochrome b<sup>6</sup> to correct photosynthetic apparatus assembly. The new editing site involving a serine to leucine change in petB at position 611, identified in the present study, could be required for the maintenance of cytochrome b6f complex structure and stability. Proteins S2 and S3 are located on the solvent side of ribosome small subunit (Manuell et al., 2004), and RNA editing events can modify their interactions among other ribosomal proteins and likely with mRNA, with potential effects on the regulatory aspects of plastid translation in response to stress or other homeostasis processes.

The SNP analysis facilitated the evaluation of not only C-to-U editing but also A-to-I editing events in chloroplast tRNAs. The tRNA-Arg (ACG) A-to-I editing event was also observed in all three species in the present study. This change corresponds to an inosine in the wobble position, which encodes three arginine codons CGU, CGC, and CGA that play a critical role in plastid protein synthesis (Rogalski et al., 2008). The enzyme involved in this mechanism in Arabidopsis, At1g68720, encodes a tRNA adenosine deaminase (TADA), which is targeted to plastids. RNAi lines of this gene show markedly reduced A-to-I editing efficiency, displaying phenotype consequences, such as growth and development delays (Elias and Huang, 2005; Delannoy et al., 2009; Karcher and Bock, 2009). Editing events in others tRNAs have been shown in some species and have been well studied in animals (Su and Randau, 2011) and previously demonstrated in moss Takakia lepidozioides (Miyata et al., 2008). The method described here can help to identify and measure other tRNA editing events not yet described in plants.

In addition to the high amount of data currently available in public databases that can readily be assessed, there are some plastid sRNAs biological features that can reveal important mechanisms of RNA editing. The precise plastid sRNA biogenesis remains unknown because there is no evidence of any RNAi machinery in organelles that could originate small RNAs thus far. Notably, there is evidence of a relaxed plastid genome transcription mechanism, resulting in full plastid genome transcription (Hotto et al., 2012). It has been suggested that plastid sRNAs originated from RNA sequence regions protected against degradation by forming secondary structures or from associations with RNA-binding proteins regions (Pfalz et al., 2009). The results of the present study demonstrated that sRNAs are not necessarily over-represented in regions of editing sites but are also evident in coding sequences with smaller lengths, where these sRNAs can still be observed. These biological features enable the use of sRNA datasets to confirm the results of different RNA editing prediction tools and enable the analysis of editing events not only in a qualitative but also a quantitative manner, depending on the library quality and read coverage.

The identification of editing sites and measurement of editing levels have demonstrated differences among tissues (Tseng et al., 2013) and developmental stages (Miyata and Sugita, 2004). These findings can be used to evaluate the impact of different stresses on these mechanisms (Nakajima and Mulligan, 2001; Van Den Bekerom et al., 2013). Thus, the use of sRNA data to confirm predicted editing sites in association with SNP searches can provide a powerful and reliable plastid editome characterization and measurement, and the results can be applied to compare editing levels in different tissues, developmental stages and physiological conditions.

#### CONCLUSION

Analysis of sRNA libraries can be used to identify and quantify RNA editing events. Using this source of sequence data and pipeline of analyses, we obtained, for the first time, a consistent set of non-conserved and new editing sites in soybean. We propose the use of plastid sRNA libraries as a novel source and approach to study RNA editing events. Until recently, no other studies have taken advantage of such data to screen for RNA editing sites. Thus, the results from the present study should encourage researchers to use small RNA libraries to compare RNA editing in different plants under different conditions to improve knowledge on the editing role of plastid RNA in plant biology.

#### AUTHOR CONTRIBUTIONS

RM, NR, and AC conceived and designed the study. NR conducted in silico analysis. NR and FK conducted the RTqPCR experiments. NR and GdF analyzed the data. NR and AC drafted the manuscript. All authors have read and approved the manuscript.

#### FUNDING

RM is the recipient of a research fellowship 309030/2015-3, NR is the recipient of a Ph.D. fellowship, and AC and GdF are the recipients of Post-Doctoral fellowships from CNPq. FK was sponsored by a FAPERGS/CAPES-DOCFIX (1634-2551/13-9) grant. The present study was also partially supported through a grant from INCT-MCTIC.

#### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: http://journal.frontiersin.org/article/10.3389/fpls.2017. 01686/full#supplementary-material

#### REFERENCES


Figure S1 | sRNA length distribution. The histograms represent the percentage of length distribution of each individual class. In black, gray and white bars, Arabidopsis, soybean, and rice read data, respectively.

Figure S2 | Number of plastid genomic sites (Y-axis) and their respective sRNA reads coverage (X-axis). In black, gray and white bars, Arabidopsis, soybean and rice read data, respectively.

Figure S3 | RNA editing site numbers identified by the PREP and SNP approaches in Arabidopsis, soybean and rice. Black bars correspond to sites confirmed only by PREP prediction (>0.5 in prediction score); white bars indicate sites confirmed using the SNP approach; and gray bars show sites confirmed using both approaches.


activities in wild-type and ndhF-deficient tobacco. Physiol. Plant. 122, 443–452. doi: 10.1111/j.1399-3054.2004.00417.x


land plant RNA editing factors in diverse eukaryotes. RNA Biol. 10, 1549–1556. doi: 10.4161/rna.25755


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2017 Rodrigues, Christoff, da Fonseca, Kulcheski and Margis. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

## REDIdb 3.0: A Comprehensive Collection of RNA Editing Events in Plant Organellar Genomes

Claudio Lo Giudice<sup>1</sup> , Graziano Pesole1,2 and Ernesto Picardi 1,2 \*

1 Institute of Biomembranes, Bioenergetics and Molecular Biotechnologies, Consiglio Nazionale delle Ricerche, Bari, Italy, <sup>2</sup> Department of Biosciences, Biotechnology and Biopharmaceutics, University of Bari A. Moro, Bari, Italy

RNA editing is an important epigenetic mechanism by which genome-encoded transcripts are modified by substitutions, insertions and/or deletions. It was first discovered in kinetoplastid protozoa followed by its reporting in a wide range of organisms. In plants, RNA editing occurs mostly by cytidine (C) to uridine (U) conversion in translated regions of organelle mRNAs and tends to modify affected codons restoring evolutionary conserved aminoacid residues. RNA editing has also been described in non-protein coding regions such as group II introns and structural RNAs. Despite its impact on organellar transcriptome and proteome complexity, current primary databases still do not provide a specific field for RNA editing events. To overcome these limitations, we developed REDIdb a specialized database for RNA editing modifications in plant organelles. Hereafter we describe its third release containing more than 26,000 events in a completely novel web interface to accommodate RNA editing in its genomics, biological and evolutionary context through whole genome maps and multiple sequence alignments. REDIdb is freely available at http://srv00.recas.ba.infn.it/redidb/index.html

#### Edited by:

Giovanni Nigita, The Ohio State University, United States

#### Reviewed by:

Giorgio Giurato, Università degli Studi di Salerno, Italy Shihao Shen, University of California, Los Angeles, United States

\*Correspondence:

Ernesto Picardi ernesto.picardi@uniba.it

#### Specialty section:

This article was submitted to Bioinformatics and Computational Biology, a section of the journal Frontiers in Plant Science

> Received: 01 February 2018 Accepted: 29 March 2018 Published: 11 April 2018

#### Citation:

Lo Giudice C, Pesole G and Picardi E (2018) REDIdb 3.0: A Comprehensive Collection of RNA Editing Events in Plant Organellar Genomes. Front. Plant Sci. 9:482. doi: 10.3389/fpls.2018.00482 Keywords: organellar genomes, RNA editing, plant database, mitochondria, chloroplasts

#### INTRODUCTION

RNA editing is an essential co/post transcriptional process able to expand transcriptome and proteome diversity in addition to alternative splicing. The term RNA editing was first introduced in 1986 to describe the addition and deletion of uridine nucleotides to and from mRNAs in trypanosome mitochondria (Benne et al., 1986). Since then, RNA editing events have been found in a wide range of organisms and can occur in the nucleus and cytoplasm as well as in organelles (Bowe and depamphilis, 1996). Modifications due to RNA editing comprise nucleotide substitutions and insertions or deletions that can affect both protein coding and Non-protein coding RNAs (Maier et al., 1996; Steinhauser et al., 1999).

In humans, the most prevalent type of RNA editing event is the deamination of adenosine (A) in inosine (I) in double RNA strands (dsRNAs) through the catalytic activity of the adenosine deaminase (ADAR) family of enzymes. To date, more than 4 million events have been collected and annotated in dedicated resources such as DARNED, RADAR, and REDIportal (Kiran et al., 2013; Ramaswami and Li, 2014; Picardi et al., 2017).

**86**

In plants, RNA editing occurs mostly in organelles in the form of cytidine (C) to uridine (U) conversion particularly in translated regions of mRNAs, albeit the opposite event (U-to-C substitutions) has been observed in some taxa, especially in chloroplasts RNAs (Takenaka et al., 2013). Plant RNA editing sites are recognized by specific pentatricopeptide repeat (PPR) proteins that are encoded in the nuclear genome. In flowering plants, the editosome machinery requires several additional Non-PPR protein factors, even though its molecular assembly has yet to be clarified (Sun et al., 2016).

Most of the C-to-U changes in the protein coding regions tends to modify affected codons restoring evolutionary conserved aminoacid residues (Gray, 2003). Therefore, plant RNA editing is believed to act as an additional proofreading mechanism to generate fully functional proteins. Occasionally, C-to-U modifications occur in untranslated regions, structural RNAs and intervening sequencing, affecting splicing and translation efficiency. Indeed, RNA editing changes in the domain V of plant group II introns is mandatory for the splicing process (Castandet et al., 2010).

With the advent of high-throughput sequencing technologies, many complete plant organellar genomes have been released and numerous novel RNA editing events uncovered. Nevertheless, RNA editing changes are not always correctly or completely annotated in primary databases (GenBank, ENA and DDBJ) and an appropriate field to unambiguously describe them is not provided. RNA editing modifications are often reported as misc\_feature or even as simple exception notes. With the aim to overcome these limitations and create a cured catalog of plant RNA editing events, we developed the specialized REDIdb database. Its first release stored 9,964 modifications distributed over 706 different nucleotide sequences, increased to 11,897 in the following update.

After 10 years of massively parallel sequencing, we present here REDIdb 3.0, an upgraded release that annotates 26,618 RNA editing events distributed among 281 organisms and 85 complete organellar genomes.

All changes have been recovered from Genbank and literature using a semi-automated bioinformatics procedure in which each annotation has been manually checked to avoid redundancy or inconsistencies due to errors in flatfiles.

The web-interface was totally restyled and developed using the latest computational technologies in the field of database querying and managing.

Furthermore, many computational facilities have been integrated to improve the user experience and ensure continuous and future updates of the database. Indeed, REDIdb 3.0 accommodates RNA editing in its genomics, biological and evolutionary context through whole genome maps and multiple sequence alignments.

Although a variety of RNA editing databases have been released such as DARNED (Kiran et al., 2013), RADAR (Ramaswami and Li, 2014), and REDIportal (Picardi et al., 2017), REDIdb is the only one devoted to editing changes in plant organelles. Indeed, similar resources such as dbRES (He et al., 2007), RESOPS (Yura et al., 2009), ChloroplastDB (Cui et al., 2006), or GOBASE (O'Brien et al., 2009) have been dismissed or not updated.

#### MATERIALS AND METHODS

All editing events stored in REDIdb derive from GenBank flatfiles through a semi-automated parsing algorithm implemented in custom python (2.7.13) scripts. Each flatfile is screened for RNA editing features using the SeqIO parser included in the Biopython (1.68) module (Cock et al., 2009).

All annotations have been manually checked to identify and correct potential errors, taking into account other related flatfile fields or literature. REDIdb database is organized in MySQL tables and queries are in python employing the MySQL-python (1.2.5) module, a data access library to MySQL engine. The web interface, instead, is built in BootStrap (3.3.7), while data presentation is based on DataTables, an ad hoc Javascript library (1.10.13) to efficiently show large tables in html documents. Genome rendering, available for complete organellar genomes, has been developed in pure python, mimicking OGDraw graphics (Lohse et al., 2013).

Query results are dynamically generated using the CGI (common gateway interface) technology. Multiple sequence alignments of edited cDNAs and proteins have been generated by ClustalOmega (Sievers et al., 2011) and displayed in html pages through the MSAViewer (Yachdav et al., 2016), a JavaScript component of the BioJS collection (https://biojs.net/).

The distribution of RNA editing events along functional domains and predicted protein secondary structures are shown by the feature-viewer JavaScript library (https://github.com/ calipho-sib/feature-viewer) based on the powerful D3 JavaScript library for visualizing data using web standards (https://d3js. org/). Functional domains have been detected using InterPro engine (Jones et al., 2014), while protein secondary structures have been predicted using the stand-alone version of Spider2 program (Yang et al., 2017).

All the scripts to parse multiple alignments, InterPro html files and Spider2 outputs have been created in Python. Scripts used to extract RNA editing positions from Genbank flatfiles are freely available at the REDIdb help page. Additional details and supplementary scripts are available upon request.

#### RESULTS

#### Database Content

Previous REDIdb release contained 11,897 editing events distributed over 198 organisms and 929 different nucleotide sequences. This upgraded version, instead, collects more than 26,000 editing events from 281 organisms, 85 complete organellar genomes and 3,467 sequences. REDIdb 3.0 includes 26,545 events in protein coding sequences and 73 in untranslated regions, structural RNAs and introns. The vast majority of editing changes occur in the mitochondrion, accounting for a total of 23,553 events over 2,300 sequences.

The most recurrent RNA editing modification is the C-to-U substitution, that accounts for more than 92% of all

#### TABLE 1 | Number of RNA Editing events in complete genomes stored in REDIdb.


(Continued)

#### TABLE 1 | Continued


Events are divided by sequence (coding/Non-coding) and according to their intracellular location. In presence of multiple accession numbers for the same organism, only the RefSeq record (if present in Genbank) has been considered.

annotated events and, when located in protein coding regions, tends to modify the aminoacid coded by the edited codon. Indeed, the majority of RNA editing events affects the first and second codon position leading to aminoacid changes resulting the most conserved in the comparison with related orthologs.

Differently from the previous releases, the novel REDIdb database annotates 85 complete organellar genomes. Of these 57 are mitochondrial genomes and include 7791 events. As reported in **Table 1**, the most edited mitochondrial genomes are those from Liriodendron tulipifera, Nelumbo nucifera and Ginkgo biloba with 888, 847, and 717 events, respectively. Of 27 annotated chloroplast genomes, instead, the one from Anthoceros formosae comprising 564 modifications results the richest in editing events.

All REDIdb sequences including RNA editing events are identified by unique accession numbers (e.g., EDI0000.). To preserve the full compatibility with previous database versions, accession numbers linked to old entries have been maintained unchanged.

#### Query Form and Output Tables

REDIdb implements a modular query form (**Figure 1A**) allowing users to make flexible searches by selecting the organism or the intracellular location or the gene name. Regarding nucleotide sequences, users can retrieve the original sequence submitted to the primary database or the RefSeq version or both. In addition, the search can be limited to full open reading frames and include individual exons in case of interrupted genes.

Query results are shown in a sortable and exportable summary table (**Figure 1B**) comprising several info such as the GenBank accession number, the organism and the link to the related taxonomy, the organelle type and the link to the complete genome (if available), the gene name and a flag indicating its partial or full nature, the editing types and details and the total number of events. Column can be selectively included in the final table and results are downloadable in pdf or csv format. The "Taxonomy" column includes a link to an interactive taxonomy chart, while the "Genome" column contains a link to the complete genome (if available in primary databases) chart in which RNA editing events are displayed in their genomics context.

Using the link in the "Gene\_name" column, users can browse individual RNA editing events organized in flatfiles.

#### Entry Organization

RNA editing events stored in REDIdb are organized in specific flat-files comprising four main sections. The first section (**Figure 2A**) contains a general description of the entry including the organism name, the taxonomy (according with the NCBI Taxonomy database), the GenBank and PubMed accession numbers, the intracellular location (mitochondrion or chloroplast) and the official gene name.

The second section (**Figure 2B**) is devoted to Gene Ontologies (GO), obtained by matching each protein sequence contained in REDIdb against the InterPro database (Finn et al., 2017). In the case of protein coding genes, it contains information regarding the molecular functions, the biological processes and the cellular localization of the protein product. The third section (**Figure 2C**) shows all the editing features that characterize the record. Here, for each editing event the position on the transcript is reported and, if the complete reference genome is available, also the genomic location. In case of editing within protein coding genes, the genomic codon, edited codon and aminoacidic change are determined and reported. Finally, the fourth

the record (organism, Genbank accession, intracellular location, gene name, PubMed references, ecc.), a gene ontology box (B) describing the gene product properties, a feature table (C) with all the editing events and a sequence zone (D) with both the genomic sequence and the corresponding edited transcript/protein. section (**Figure 2D**) contains the genomic sequence and the corresponding edited transcript. In coding protein genes, also the edited protein is displayed. Genomic sequences as well as edited transcripts and proteins can be retrieved in Fasta format.

#### Graphical Visualization

Edited cDNA and protein sequences can be explored in their evolutionary context through multiple alignments of available orthologs sequences. Since plant RNA editing tends to increase the sequence conservation along the evolution, annotated RNA editing changes are marked and visualized in the multiple alignment by the MSAViewer, to give rise to conservation levels and provide valuable comparative genomics information (**Figure 3A**).

In addition, RNA editing events are displayed along the edited sequence showing known functional domains and predicted secondary protein structures in order to better interpret the biological role of specific C-to-U or U-to-C changes (**Figure 3B**).

In case of complete organellar genomes, each genome is graphically rendered and edited genes can be selectively highlighted. Genome graphs are generated in SVG and include links to edited genes by mousing over. Further statistics such as the coding potential of the genome as well as the fraction of edited genes are also reported (**Figure 4**).

#### CONCLUSIONS AND PERSPECTIVES

As already mentioned, RNA editing plays an important role in transcriptome and proteome diversity. Since its first discovery in 1986 (Benne et al., 1986), a large number of events have been found in a wide range of eukaryotic organisms (Ichinose and Sugita, 2016). Only in humans more than 4 million events have been reported and dedicated resources such as DARNED, RADAR, and REDIportal have been developed to contain them into suitable specialized databases (Kiran et al., 2013; Ramaswami and Li, 2014; Picardi et al., 2017).

In the plant kingdom, RNA editing was first identified as C-to-U substitutions in mitochondrial transcripts (Hiesel et al., 1989), followed by its identification also in chloroplasts (Höch et al., 1991). In order to maintain a cured catalog of such events, we developed the specialized REDIdb database. Its third release, described here, contains three times more entries than the first version and two times more entries than the second version. To date, REDIdb is the unique bioinformatics resource collecting plant organellar RNA editing events. Indeed, similar databases such as dbRES

(He et al., 2007) or RESOPS (Yura et al., 2009) have been dismissed or are no more updated. Plant RNA editing events are also annotated in CloroplastDB (Cui et al., 2006), devoted to chloroplast genomes, and GOBASE (O'Brien et al., 2009), the organelle genome database. However, such resources are not specialized for RNA editing and include potential not fixed errors due to the lack of manual curation (Picardi et al., 2011).

REDIdb 3.0 has been completely redrawn keeping in mind the simplicity as its working principle. RNA editing events are always shown in their biological context and novel graphical facilities have been added. Edited genes are now depicted in complete genome maps and RNA editing conservation can be investigated in pre-calculated multiple alignments of orthologous sequences. REDIdb 3.0 allows also the visualization of aminoacid changes induced by RNA editing in protein domains or secondary structures, providing insights into the potential functional consequences.

Next generation sequencing technologies, now arrived at their third generation, are expected to greatly increase the number of RNA editing candidates in the next future. Therefore, it will be indispensable to collect and annotate them in their biological context taking into account also the RNA editing levels.

Due to the unicity in its field, REDIdb is planned to be maintained and updated over time (as new editing sites or complete genomes are released), taking into account, as much as possible, eventual feedbacks from the users.

#### REFERENCES


#### AUTHOR CONTRIBUTIONS

CL conducted the bioinformatics analyses and wrote the first manuscript draft; EP and GP conceived the study and contributed to writing and revising the manuscript.

#### FUNDING

This work was supported by ELIXIR IIB (CNR).

#### ACKNOWLEDGMENTS

We kindly thank TMR Regina and M. Takenaka for revising the database and fruitful suggestions, and L. Marra for technical and editorial assistance.


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 Lo Giudice, Pesole and Picardi. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

## Transcriptome-Wide Annotation of m5C RNA Modifications Using Machine Learning

Jie Song1,2†, Jingjing Zhai 1†, Enze Bian3†, Yujia Song<sup>3</sup> , Jiantao Yu<sup>3</sup> and Chuang Ma1,2 \*

<sup>1</sup> State Key Laboratory of Crop Stress Biology for Arid Areas, Center of Bioinformatics, College of Life Sciences, Northwest A&F University, Shaanxi, China, <sup>2</sup> Key Laboratory of Biology and Genetics Improvement of Maize in Arid Area of Northwest Region, Ministry of Agriculture, Northwest A&F University, Shaanxi, China, <sup>3</sup> College of Information Engineering, Northwest A&F University, Shaanxi, China

The emergence of epitranscriptome opened a new chapter in gene regulation. 5-methylcytosine (m5C), as an important post-transcriptional modification, has been identified to be involved in a variety of biological processes such as subcellular localization and translational fidelity. Though high-throughput experimental technologies have been developed and applied to profile m5C modifications under certain conditions, transcriptome-wide studies of m5C modifications are still hindered by the dynamic nature of m5C and the lack of computational prediction methods. In this study, we introduced PEA-m5C, a machine learning-based m5C predictor trained with features extracted from the flanking sequence of m5C modifications. PEA-m5C yielded an average AUC (area under the receiver operating characteristic) of 0.939 in 10-fold cross-validation experiments based on known Arabidopsis m5C modifications. A rigorous independent testing showed that PEA-m5C (Accuracy [Acc] = 0.835, Matthews correlation coefficient [MCC] = 0.688) is remarkably superior to the recently developed m5C predictor iRNAm5C-PseDNC (Acc = 0.665, MCC = 0.332). PEA-m5C has been applied to predict candidate m5C modifications in annotated Arabidopsis transcripts. Further analysis of these m5C candidates showed that 4nt downstream of the translational start site is the most frequently methylated position. PEA-m5C is freely available to academic users at: https://github.com/cma2015/PEA-m5C.

Keywords: AUC, Epitranscriptome, machine learning, RNA modification, RNA 5-methylcytosine

#### INTRODUCTION

The epitranscriptome, also known as chemical modifications of RNA (CMRs), is a newly discovered layer of gene expression (Meyer and Jaffrey, 2014). With advances in mass spectrometry and high-throughput sequencing technologies, the field of epitranscriptome is rapidly expanding and attracting a comparable degree of research interests to DNA and histone modifications in the field of epigenetics (Helm and Motorin, 2017). Among more than 150 types of CMRs identified, most of them have been found in transfer RNAs (tRNAs) and ribosomal RNAs (rRNAs) (Hussain et al., 2013), but some can occur in mRNAs and noncoding RNAs (Machnicka et al., 2013; Pan, 2013; Carlile et al., 2014; Dominissini et al., 2016; David et al., 2017). A growing line of evidences indicated that CMRs located in both coding and noncoding regions can play essential roles in a variety of biological processes. For instance, N<sup>6</sup> -methyladenosine (m6A) sites in 5′ -untranslated

#### Edited by:

Giovanni Nigita, The Ohio State University, United States

#### Reviewed by:

Salvatore Alaimo, Università degli Studi di Catania, Italy Zhaohui Steve Qin, Emory University, United States

> \*Correspondence: Chuang Ma chuangma2006@gmail.com

†These authors have contributed equally to this work.

#### Specialty section:

This article was submitted to Bioinformatics and Computational Biology, a section of the journal Frontiers in Plant Science

> Received: 07 January 2018 Accepted: 04 April 2018 Published: 18 April 2018

#### Citation:

Song J, Zhai J, Bian E, Song Y, Yu J and Ma C (2018) Transcriptome-Wide Annotation of m5C RNA Modifications Using Machine Learning. Front. Plant Sci. 9:519. doi: 10.3389/fpls.2018.00519 region (UTR) can promote cap-independent translation under heat stress (Meyer et al., 2015; Zhou et al., 2015); while m6A sites in coding regions can affect translation dynamics by inducing steric constraints and destabilizing pairing between codons and tRNA anticodons (Choi et al., 2016; Zhao et al., 2017). Thus, the transcriptome-wide annotation of RNA modifications is essential for fully understanding the biological functions of CMRs.

Compared with those well characterized modifications such as m6A and N<sup>1</sup> -methyladenosine (m1A), the transcriptomewide annotation of 5-methylcytosine (m5C) modifications is more challenging. First, bisulfite sequencing technologies are difficult to implement for profiling m5C modifications because of the instability of mRNA molecules treated with bisulfite (Amort et al., 2013; Li et al., 2016). In addition, other existing high-throughput sequencing technologies, such as m5C-RIP (Edelheit et al., 2013), can localize m5C residues to transcript regions of 100–200 nucleotide (nt) long, but fail to accurately identify m5C modifications at single-nucleotide solution. Second, because of the dynamic nature of m5C (Wang and He, 2014), existing high-throughput sequencing technologies can only capture a snapshot of RNA modifications under certain experimental conditions, and cover just a small fraction of the whole transcriptome of a given sample (Zhou et al., 2016), resulting in the generation of significant numbers of false negatives (non-detected true m5C modifications). Third, the base preferences around the m5C sites are not strong enough, increasing the difficulties in computational predictions with traditional statistical approaches. Machine learning (ML) is a branch of artificial intelligence technology that has been widely used in engineering, computer science, informatics and biology (Ma et al., 2014a, 2017; Cui et al., 2015; Libbrecht and Noble, 2015; Zhai et al., 2016). The biggest advantage of ML systems is that they can automatically learn interesting patterns from existing datasets and bring about selfimprovement of system performance for accurately predicting novel knowledge from a new data set (Ma et al., 2014a,b). Therefore, computational methods coupled with machine learning technologies may provide an option to accurately annotate RNA modifications like m5C in the transcriptome-wide manner.

Until now, iRNAm5C-PseDNC is the exclusive m5C predictor, which was built using random forest (RF) algorithm based on sequence-based features, and has been reported to have a good predictive performance for mammalian m5C prediction (Qiu et al., 2017). However, because of the lineage-specific sequence and structural properties differences between plant and mammalian species, tools developed for mammal species can't always retain their original performance when applied to other organisms (Leclercq et al., 2013; Zhai et al., 2017). This particular issue underscores the need for accurate transcriptome-wide m5C prediction tools in plants, which may lay a foundation for elucidating the mechanisms of formation and the cellular functions of m5C modifications.

In this study, we developed PEA-m5C, an accurate transcriptome-wide m5C predictor under a ML framework with an ensemble of 10 RF-based prediction models. PEA-m5C was trained with features extracted from the flanking sequence of m5C modifications, and showed promising performance when applied to predict m5C modifications in Arabidopsis thaliana. We further applied PEA-m5C to predict candidate m5C modifications in annotated Arabidopsis transcripts, and found that candidate m5C modifications are enriched in the coding region of mRNAs. In addition, 4-nt downstream of the translational start site is the most frequently methylated position. All candidate m5C modifications have been deposited in a public database named Ara-m5C for follow-up functional studies. In order to facilitate the application of PEA-m5C, we have implemented the proposed model into a cross-platform, userfriendly and interactive interface with R and JAVA programming languages.

#### MATERIALS AND METHODS

#### Dataset Generation

In this study, we constructed four m5C datasets: DatasetCV (cross-validation dataset), DatasetHT (hold-out test dataset), DatasetIT1 (independent test dataset for samples from the Arabidopsis silique tissue) and DatasetIT2 (independent test dataset for samples from the Arabidopsis shoot tissue).

DatasetCV and DatasetHT were constructed based on m5C modifications in transcripts expressed in the Arabidopsis root tissue at single-nucleotide resolution using RNA bisulfite sequencing technology (David et al., 2017). During bisulfite conversion, unmethylated cytosines were converted into uracils, while methylated cytosines were not converted. Bisulfite-treated RNA samples were sequenced to generated 100-nt pairedend reads using the Illumina HiSeq 2500. Low-quality reads were processed using Trimmomatic (Bolger et al., 2014), and the left clean reads were globally mapped to in silico bisulfite-converted Arabidopsis reference genome sequences using the RNA mode of B-Solana (Kreck et al., 2012). For each cytosine site in the Arabidopsis reference genome, the methylation level was calculated using a proportion statistic: P = (C+9)/(T+C), where C and T represent the number of cytosines and thymines in aligned reads at the cytosine site under analysis, respectively. 9 specifies the added pseudo counts (1/8 counts). The false discovery rate (FDR) was calculated using the R package qvalue (Storey, 2002). Cytosines were regarded as positive samples (m5C modifications) if they satisfied the following criteria: methylation level ≥1% and FDR ≤ 0.3. After the removal of sequence redundancy, we finally obtained 1,296 m5C modifications in 885 transcripts (**Table S1**). In these 885 transcripts, cytosines were regarded as negative samples (non-m5C modifications) if they were not annotated as m5C modifications. In order to avoid over-fitting and GC bias in training process, we limited the number of negative samples to be 10 times of positive samples. Thus, for each positive sample, 10 samples were selected in the 200-nt region around the positive sample, among which GC content difference is not more than 5%. This allows a similar distribution of positive and GC-matched negative samples, which is markedly different from the background distribution of all cytosines in these 885 transcripts (**Figure S1**). Note that some of the negative samples may in fact be true m5C modifications not yet discovered. We randomly divided these 1,296 positive samples and 12,960 negative samples into two parts for constructing DatasetCV and DatasetHT, respectively. The DatasetCV comprises 1,196 positive samples and 11,960 negative samples, while the DatasetHT has a balanced number (100) of positive and negative samples (**Table S1**).

Using the same criteria mentioned above, another two datasets (DatasetIT1: 79 positive and negative samples; DatasetIT2: 73 positive and negative samples) were also constructed for Arabidopsis silique and shoot tissues, respectively (**Table S1**). Of note, positive and negative samples in DatasetIT1 and DatasetIT2 were not overlapped with those in DatasetCV and DatasetHT.

Each sample in these four datasets was represented by a sequence window of 43 nucleotides centered around the respective cytosine site. For samples near the borders of the available RNA sequence, the positions missing from the 43 nt window were filled with "N," the symbol for unknown. The Arabidopsis reference genome sequences (TAIR10) and annotated transcripts used in this study were downloaded from the Araport 11 database (https://www.araport.org/data/ araport11).

#### Feature Encoding

In order to be recognized by ML-based systems, each sample of Lnt window size, was represented as a numeric vector (length: 4∗L + 106) using the binary, k-mer and PseDNC encoding schemes. The details of these three encoding schemes are described in the following.

#### Binary Encoding

This encoding strategy generates a vector of 4∗L features by characterizing "A," "C,", "G," "U," and "N" with (1, 0, 0, 0), (0, 1, 0, 0), (0, 0, 1, 0), (0, 0, 0, 1), and (0, 0, 0, 0) for each sample, respectively.

#### K-mer Encoding

In this scheme, the composition of short sequence with different lengths was considered to explore its potential effect on the identification of m5C. In order to avoid the curse of dimensionality, we set k = 1, 2, and 3 to generate 84 features for calculating the frequency of mononucleotide occurrence (k = 1; four features), dinucleotide occurrence (k = 2; 16 features) and trinucleotide occurrence (k = 3; 64 features).

#### PseDNC Encoding

The pseudo dinucleotide composition (PseDNC) is a widely used encoding strategy that considers sequential information as well as physicochemical properties of dinucleotides in the RNA sequence (Chen et al., 2015, 2017). For each sample, it generates 16+λ numeric features, the first 16 of which are features extracted from adjacent dinucleotide pairs, and the other λ are features extracted from distant dinucleotide pairs (λ denotes the maximal distance between two dinucleotides). The detailed definition of PseDNC is presented in **Supplementary Data 1**.

### Development of ML-Based m5C Predictor

**Figure 1** illustrates the workflow of PEA-m5C, which consists of three phases, namely, (A) model construction, (B) model optimization, and (C) model prediction. Model construction and optimization were performed on the DatasetCV.

#### Model Construction

To construct an m5C prediction model, PEA-m5C required an input of a set of positive and negative samples. These samples were transformed into a feature matrix using three different encoding schemes (binary, k-mer, and PseDNC). The feature matrix was input into the RF algorithm to construct an m5C prediction model, which consisted of 100 classification trees. Each of the classification trees was built using a set of bootstrapped samples and features. The output of the RF-based m5C prediction model was determined by a majority vote of the classification trees. The RF algorithm was implemented using the R package "Rweka" (Hornik et al., 2009), which provides an R environment to invoke the ML package "weka" (v3.9.1; https:// www.cs.waikato.ac.nz/ml/weka).

#### Model Optimization

Ten-fold cross-validation experiments were performed to optimize m5C prediction models in PEA-m5C by iteratively varying window size and feature number. Cross-validation is a standard method for estimating the generalization accuracy of ML systems. In a ten-fold cross-validation, the DatasetCV was randomly divided into 10 equal subsets and each subset was iteratively selected as a testing set for evaluating the model trained with other nine subsets. In each fold of cross-validation, considering the high unbalance between positive and negative samples (1:10), the negative samples were randomly divided into 10 parts, each of which coupled with the set of positive samples were used for training an RF-based m5C prediction model. Therefore, ten RF-based m5C prediction models were constructed in the training process. In the testing process, each sample was scored using these ten RF-based m5C prediction models. The corresponding ten prediction scores were averaged as the final prediction score of the sample under analysis. Once the testing process was completed, the prediction accuracy of PEA-m5C (an ensemble of ten RF-based m5C prediction models) was evaluated using the receiver operating characteristic (ROC) analysis, which plots a curve of false positive rate (FPR) varying at different true positive rate (TPR). The value under the ROC curve (AUC) was used to quantitatively score the prediction performance of PEA-m5C. AUC is ranged from 0 to 1, the higher the better prediction performance. After 10 subsets have been successively used as the testing set, the corresponding 10 AUC values were averaged as the overall prediction performance of PEA-m5C.

The PEA-m5C was optimized to maximize the AUC by iteratively varying window size L from 5- to 43-nt and feature number from 2 to 4∗L+106. The feature subset was selected according to the feature importance estimated using the information gain approach implemented in R package "FSelector" (Cheng et al., 2012). The detailed process of model optimization is given in **Figure 2**. We initialize AUC matrix

FIGURE 2 | The pseudo-code for model optimization.

("AUCMatrix") and feature matrix ("FMatrix") as two empty sets (**Lines 1-2**). Then for a given window size L (5-nt ≤ L ≤ 43-nt) (**Line 3**), we varied the upstream sequence length (Lu) from 1-nt to (L-2)-nt and the number of feature subset from 2 to 4∗L+106 (**Lines 4-7**). Subsequently, for each feature subset, we performed a 10-fold cross-validation experiment and stored the corresponding AUC value into a vector ("AUCVector") (**Lines 8-9**). After all possible feature subsets have been examined using 10-fold cross-validation experiments, the maximum AUC in "AUCVector" will be stored in the "AUCMatrix" (**Lines 11-12**), and the corresponding feature subset with maximum AUC will be stored in "FMatrix" (**Lines 13-15**). Finally, after all possible window sizes have been performed, the optimized L<sup>u</sup> and L<sup>d</sup> can be obtained by searching the maximum value in "AUCMatrix" **(Lines 18-19**), and the optimized feature subset can be obtained by searching "FMatrix" with L<sup>u</sup> and L<sup>d</sup> (**Lines 20-21**).

#### Model Prediction

PEA-m5C predicted all candidate m5C modifications in given RNA sequences in FASTA format. For each cytosine site, PEAm5C firstly extracted the flanking sequence with the optimized window size. Then, three feature encoding schemes were performed to transform the flanking sequence to a numeric vector. Subsequently, the optimized feature subset was input into the ten RF-based m5C prediction models. Finally, PEAm5C generated a prediction score to reflect the possibility of this cytosine to be a real m5C modification. Of note, four thresholds have been also included in the PEA-m5C, which were automatically determined in the 10-fold cross-validation at the specificity level of 99, 95, 90, and 85%, respectively. These four thresholds corresponded to four different confidence modes of PEA-m5C: VHmode (very high confidence mode), HMode (high confidence mode), NMode (normal confidence mode) and LMode (low confidence mode), respectively. Cytosine sites with a prediction score higher than the threshold were predicted as positive samples; otherwise, they were predicted as negative samples.

#### Model Comparisons

The iRNAm5C-PseDNC is only available m5C predictor that aims to accurately predict m5C modifications in mammalian genomes. It was constructed using the RF algorithm with only PseDNC features, and was trained with mammalian m5C modifications (window size: 41-nt) (Sun et al., 2016). In order to fairly compare prediction performance between iRNAm5C-PseDNC and our proposed model PEA-m5C, we also re-trained iRNAm5C-PseDNC with positive and negative samples of 41-nt in the DatasetCV, and this retrained predicted model was named as iRNAm5C-PseDNC<sup>∗</sup> . Prediction performance of iRNAm5C-PseDNC, iRNAm5C-PseDNC<sup>∗</sup> and PEA-m5C was estimated on DatasetHT, DatasetIT1 and DatasetIT2 using six widely used measures: sensitivity (Sn, also known as recall), specificity (Sp), precision (Pr), accuracy (Acc), F1-score (F1), and Matthews correlation coefficient (MCC). These measures were defined as follows:

$$\begin{aligned} \text{Sn} &= \frac{\text{TP}}{\text{TP} + \text{FN}}, \\ \text{Sp} &= \frac{\text{TN}}{\text{TN} + \text{FP}}, \\ \text{Pr} &= \frac{\text{TP}}{\text{TP} + \text{FP}}, \\ \text{Acc} &= \frac{\text{TP} + \text{TN}}{\text{TP} + \text{TN} + \text{FP} + \text{FN}}, \\ \text{F}\_{1} &= \frac{2 \ast \text{Pr} \times \text{Sn}}{\text{Pr} + \text{Sn}} = \frac{2 \ast \text{TP}}{2 \ast \text{TP} + \text{FP} + \text{FN}}, \\ \text{MCC} &= \frac{\text{TP} \ast \text{TN} - \text{FP} \ast \text{FN}}{\sqrt{(\text{TP} + \text{FP}) \ast (\text{TP} + \text{FN}) \ast (\text{TN} + \text{FP}) \ast (\text{TN} + \text{FN})}}, \end{aligned}$$

where TP, TN, FP, and FN represent the number of true positives, true negatives, false positives and false negatives, respectively. F<sup>1</sup> is the harmonic mean of Pr and Sn. Compared with Sn, Sp, Pr, and F1, Acc and MCC are two more information measures which combine all of the predictions (TP, TN, FP, and FN) into a single score. Acc, which ranges from 0 to 1, measures the proportion of correct predictions. MCC, also known as the phi coefficient, measures the correlation between the observations and predictions. It is generally regarded as a balanced measure, which can be used even if the two classes are of very different size. The value of MCC ranges from −1 to 1, where 1 represents a perfect prediction, 0 indicates no better than random prediction and −1 means total disagreement between observations and predictions.

#### Transcriptome-Wide m5C Annotation and Analysis

Candidate m5C sites in the annotated Arabidopsis transcripts were predicted using the PEA-m5C. The spatial distribution of candidate m5C modifications was statistically analyzed in three aspects: (i) feature enrichment (e.g., 5′ -UTR, coding region [CDS] and 3′ -UTR) analysis of candidate m5C modifications in coding RNAs; (ii) the most frequently methylated position relative to the translational start site; (iii) functional enrichment analysis of genes containing candidate m5C modifications.

The base preference around candidate m5C modification sites was also explored, including: (i) the proportion of m5C modifications in different sequence contexts: CG, CHG and CHH (H: A, T or C); (ii) sequence motifs of candidate m5C modifications.

#### RESULTS

#### Characterization of m5C Modifications Using Sequence-Based Features

To investigate whether m5C modifications can be identified using sequence-based features, we first examined the positional frequencies of four bases in positive and negative samples in the DatasetCV (**Figures 3A,B**). We observed that the positional base frequency appears to be stable in negative samples. In

contrast, the positional base frequency was biased to guanine (G) in the region near m5C sites in positive samples. We then detected position-specific base usages by using rank sum test. Setting significant level (p-value) to be 1.0E-10, we found that 15 position-specific base usages are significantly different between positive and non-m5C modifications. They are −9G,−7T,−6A,−3G,−2T,−2G,−2C,−1T,−1C, 1G, 1C, 2T, 2G, 6G. The difference can be visualized by comparing the frequencies of these position-specific bases in m5C and non- m5C modifications (**Figure 3C**). Furthermore, through two sample logo analysis using R package "DiffLogo" (Nettling et al., 2015), we discovered the similar trend of some specific nucleotide usage preferences around m5C modifications (**Figure 3D**). These results indicate that base frequency differences exist between m5C and non- m5C modifications.

different between m5C and non-m5C modifications.

We then examined sequence-based features generated from k-mer and PseDNC encoding schemes. **Figure 4A** displays the mean values of these features for positive and negative samples. When the window size is 11-nt (L<sup>u</sup> = L<sup>d</sup> = 5), we detected 70 k-mer-based features and 19 PseDNC-based features significantly different between positive and negative samples (two-sample t-test; p ≤ 1.0E-4). The top-five ranked features are the frequency of T, G, GG and PseDNC-11, PseDNC-15 (**Figure 4B**). When the window size was extended from 11-nt to 41-nt (L<sup>u</sup> = L<sup>d</sup> = 20), we also detected 32 k-mer-based features and 12 PseDNC-based features at the significance level of 1.0E-4. The top-five ranked features are the frequency of G, GG and GGC, PseDNC-11 and PseDNC-15 (**Figure 4B**).

Taken together, these results indicate that the three encoding schemes, binary, k-mer and PseDNC, can generate discriminative features for m5C prediction. However, the importance of different features is affected by the window size used.

#### A Machine Learning-Based m5C Predictor With Optimized Window Size and Features

To obtain the optimized window size and feature subset, we iteratively performed ten-fold cross-validation experiments on the DatasetCV by varying window size L from 5-nt to 43-nt and the feature number F from 2 to 106+4 <sup>∗</sup>L (**Figure 5A**). For a given window size of L (e.g., upstream region: L<sup>u</sup> = 10 and downstream region: L<sup>d</sup> = 5) and feature number of F (e.g., F = 50), we performed a 10-fold cross-validation experiment to calculate an AUC value for evaluating the prediction performance of PEA-m5C. Then, at the given window size L, the best AUC value achieved by PEA-m5C can be found according to the curve depicted in **Figure 5B**, where x axis represents the number of selected features and y axis represents the AUC yielded by PEAm5C. After examining all possible combinations of window sizes and feature numbers, we observed that PEA-m5C achieved the highest AUC value of 0.939 (**Figure 5A**), when the window size was set as 11-nt (L<sup>u</sup> = L<sup>d</sup> = 5) and 50 top ranked features were used (**Figure 5C**,**Table S2**).

#### Prediction Evaluation and Comparison Using Hold-Out and Independent Testing Sets

negative samples and affected by the window size.

After training PEA-m5C using the DatasetCV with the optimized window size and feature subset, we next evaluated the performance of PEA-m5C on a hold-out test set (DatasetHT). As shown in **Figure 6A**, the prediction score of positive samples (mean ± standard deviation [sd]: 0.775 ± 0.223) was significantly higher than that of negative samples (mean ± sd: 0.194± 0.225). This result indicates that PEA-m5C could provide a competitive performance in discriminating positive and negative samples. Indeed, PEA-m5C gave an area under ROC (AUC) and an area under the precision-recall curve (auPRC) of 0.939 and 0.945, respectively (**Figures 6B,C**). To assess the performance more comprehensively, six measures (Sn, Sp, Pr, Acc, MCC, and F1) were examined at four thresholds, corresponding to the specificity level of 99% (very high confidence mode; VHmode), 95% (high confidence mode; HMode), 90% (normal confidence mode; NMode) and 85% (Low confidence mode; LMode) in the 10-fold cross-validation experiment, respectively (**Table 1**). In line with the intuitive observations of ROC curve (**Figure 6B**) and precision-recall curve (**Figure 6C**), PEA-m5C performed markedly better than random selection (AUC = 0.5, auPRC = 0.5, and MCC = 0) in predicting m5C modifications at four different specificity levels (**Table 1**).

Currently, iRNAm5C-PseDNC is the only software available for m5C prediction; however, it was built based on mammalian m5C modifications. This provides us an opportunity to evaluate whether iRNAm5C-PseDNC could retain prediction accuracy on Arabidopsis m5C modifications. We observed that iRNAm5C-PseDNC yielded a high specificity of 0.980, but an extremely low sensitivity of 0.010. The main reason is that there are significant differences between mammalian and Arabidopsis m5C modifications (**Figure S2**). To examine the effectiveness of ML algorithms in iRNAm5C-PseDNC, we generated a new prediction model (named as iRNAm5C-PseDNC<sup>∗</sup> ) by re-training iRNAm5C-PseDNC using positive and negative samples from the DatasetCV and evaluated its performance using the DatasetHT. Compared with iRNAm5C-PseDNC, iRNAm5C-PseDNC<sup>∗</sup> yielded higher prediction accuracy at the level of Sn, Sp, Pr, Acc, MCC, and F1. However, PEA-m5C still achieved higher prediction accuracy than iRNAm5C-PseDNC and iRNAm5C-PseDNC∗ (**Table 1**). The prediction performance of PEA-m5C was also better than iRNAm5C-PseDNC and iRNAm5C-PseDNC<sup>∗</sup> on DatasetIT1 and Dataset2, which consist of samples from Arabidopsis silique and shoot tissues, respectively (**Table S3**).

Taken together, these results indicate that the construction of Arabidopsis thaliana-specific predictor is necessary and crucial. In addition, PEA-m5C is a useful tool for the prediction of m5C sites in Arabidopsis transcripts.

#### Transcriptome-Wide Annotation and Analysis of Candidate m5C Modifications

The encouraging performance of PEA-m5C in the crossvalidation and validation testing experiments provide us an opportunity to accurately predict m5C sites in the annotated Arabidopsis transcripts. At the threshold of 0.891 (VHMode), PEA-m5C predicted 303,421 candidate m5C modifications (**Table 2**), covering 4.56% cytosines (303,421/6,650,570) in all annotated transcripts in Araport 11 database (https:// www.araport.org/data/araport11). During the writing of our manuscript, Cui and colleagues identified 4,439 m5C peaks in 3,534 expressed genes (**Table S4)** in young seedlings of Arabidopsis (Cui et al., 2017), by applying m5C RNA immunoprecipitation followed by a deep-sequencing approach. We validated the m5C predictions using these 4,439 m5C peaks. Among the 3,534 expressed genes, PEA-m5C identified 5,463 candidate m5C modifications, covering 2,724 of 4,439 reported peak regions. We note that the proportion of covered m5C peaks increased from 61.4% (2,724/4,439) to 89.4% (3,968/4,439), when the HMode was used.

As is known to us all, cytosines in DNA sequences can be methylated in three sequence context, namely CG, CHG, and CHH (H = A, C, or T) (Smith and Meissner, 2013). In this study, we explored the levels of cytosine methylation in RNA sequences. We observed that 24.7, 27.8, and 47.5% of the candidate m5C modifications are methylated in the CG, CHG, and CHH sequence context, respectively. These proportions are markedly different from those of cytosines in background sequences (CG: 15.1%, CHG: 17.9%, CHH: 67.0%) (**Figure 7A**). Statistical analysis of base preference showed that there are very strong "G" signal around candidate m5C modifications (**Figure 7B**). These results indicate that candidate m5C modifications predicted by PEA-m5C may have potential biological functions.

Toward a better understanding of these candidate m5C modifications, we further analyzed the enrichment of m5C within three different regions of mRNAs: 5′ -UTR, CDS and 3 ′ -UTR. It can be seen from **Figure 7C** that the majority of m5C modifications are located in CDS regions. Recent studies have indicated that the m5C modification prefers to occur at the downstream of translational start sites in mammal mRNAs (Amort et al., 2017; Yang et al., 2017). We calculated the distance between candidate m5C modifications and translational start sites, and found that the most frequently m5C modification position is the 4nt downstream of the translational start site (AUG∗**C;** methylated cytosines are in bold and underlined) (**Figure 7D**). In order to further investigate the potential function of those 1,063 genes with m5C modifications located at 4-nt downstream of the translational start site, we performed a GO (gene ontology) enrichment analysis using agriGO 2.0 (Tian et al., 2017) and found that in the BP (Biological Progress) sub-category, 166 genes (**Table 3**) are enriched in the term "response to stimulus" with FDR of 2.40E-4; For the MF (Molecular Function) sub-category, 350 genes are significantly enriched in "catalytic activity" with FDR of 9.80E-07 (**Table 3**). We also performed

FIGURE 6 | Performance evaluation of PEA-m5C on the DatasetHT. (A) The different distribution of prediction scores for positive and negative samples. (B) The ROC curve illustrating the high performance of PEA-m5C. (C) The precision-recall curve illustrating the high performance of PEA-m5C.

TABLE 1 | Prediction performance of m5C predictors on DatasetHT.


\*An m5C prediction model generated by re-training iRNAm5C-PseDNC using positive and negative samples from the DatasetCV.

TABLE 2 | Candidate m5C modifications in different types of RNAs. Num: number, Prop: proportion, trans: transcripts.


pathway enrichment analysis on these 1063 genes using the hypergeometric distribution test. Pathway information was obtained from KEGG (http://www.genome.jp/kegg) and AraCyc (http://www.plantcyc.org) databases. At the level of p ≤ 1.0E-2, we identified four significantly enriched pathways, including L-lysine biosysthesis VI pathway, glutathione metabolism, N-Glycan biosynthesis, and phosphatidylinositol signaling system (**Table S5**).

### Implementation of PEA-m5C

To facilitate the practicability, we implemented PEA-m5C into an R package named "PEA-m5C". We also provided a crossplatform, user-friendly and interactive interface for PEA-m5C with JAVA programming language (**Figure 8**). This allows the user to easily implement PEA-m5C without the requirement of any programming skills or knowledge. To expand the application of PEA-m5C to other species, users can also retrain prediction models through the pre-specified dataset using the "Self-Defined Mode" option in PEA-m5C, with the input of positive and negative samples in FASTA format. PEA-m5C is freely available to academic users at: https://github.com/cma2015/PEA-m5C.

#### DISCUSSION

In this study, we developed PEA-m5C, a computationally framework for accurate identification of m5C modifications in Arabidopsis. PEA-m5C predictor was constructed using RF algorithm with optimized window size and sequence-based features, achieving a considerable promising performance no matter from 10-fold cross-validation experiment or hold-out test experiment. The PEA-m5C is superior to the newly developed and only available m5C predictor iRNAm5C-PseDNC in several aspects.

First, besides the PseDNC encoding scheme used in iRNAm5C-PseDNC, PEA-m5C additionally integrates

different sequence contexts: CG, CHG and CHH (where H = A, C, or U). (B) The sequence logo of candidate m5C modifications. The position of candidate m5C modifications is defined as 0. (C) The distribution of observed m5C modifications (positive samples in the DatasetCV), candidate m5C modifications, and all cytosine sites (background) along the 5′ -UTR, CDS and 3′ -UTR, normalized for transcript length. (D) The distribution of candidate m5C modifications relative to translational start sites. The position of translational start sites is defined as 0.

TABLE 3 | Top five significant GO terms in the sub-category of biological progress (BP), molecular function (MF), and cellular component (CC).


another two encoding schemes (binary and k-mer) to make more use of sequence-based features. Both 10-fold cross-validation and independent testing experiments have demonstrated that higher prediction accuracy can be achieved by PEA-m5C when more feature encoding schemes were used (**Figure S3**; **Table S6**). For instance, in the 10-fold cross-validation, PEA-m5C yielded an AUC of 0.904, 0.914 and 0.939 when PseDNC, PseDNC + k-mer, PseDNC + k-mer + binary encoding schemes were used, respectively.

Second, PEA-m5C uses a hybrid optimization strategy to produce better prediction accuracy (**Table S6**), while iRNAm5C-PseDNC didn't perform the model optimization process. This is understandable as the model optimization is a rather timing-consuming process (**Figure 2**). However, the results shown in **Figure 5** illustrated the importance of model optimization in developing accurate m5C predictors. We also would like to note that the process of model optimization requires to be finely tuned, such as the choice of appropriate feature selection approaches. To select informative features for m5C prediction, we preferred to use the information


gain approach rather than statistical analysis approaches (e.g., chi-square test for binary features, student's t-test for k-mer- and PseDNC-based features). While testing on the DatasetHT, PEA-m5C using the information gain approach yielded a slightly higher maximum MCC (0.790) than that using the chi-square test and the student's t-test (0.770).

Finally, PEA-m5C has been implemented into a user-friendly interface with JAVA programming language and an R package to maximize its practicality. It also includes a self-training module that provides an option to automatically build m5C predictors for specific species, tissues, or conditions. This is very important as m5C modifications exhibit different sequence patterns in different issues (**Figure S4**).

In the future, we will endeavor to incorporate more features (e.g., structure-based features) to further improve the performance of PEA-m5C. If possible, specie-specific or tissue-specific predictors will be developed to facilitate the functional investigation of m5C modifications in plants.

#### AUTHOR CONTRIBUTIONS

CM: Designed the experiments; JS, JZ, and EB: Performed the experiments; JS, JZ, EB, CM, JY, and YS: Analyzed the data; CM, JZ, and JS: Wrote the paper. All authors read and approved the final manuscript.

#### FUNDING

This work has been supported by the National Natural Science Foundation of China (31570371), the Youth 1,000- Talent Program of China, the Hundred Talents Program of Shaanxi Province of China, the Youth Talent Program of State Key Laboratory of Crop Stress Biology for Arid Areas (CSBAAQN2016001), The Agricultural Science and Technology Innovation and Research Project of Shaanxi Province, China (2015NY011), and the Fund of Northwest A&F University.

#### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fpls.2018. 00519/full#supplementary-material

Supplementary Data 1 | The description of PseDNC encoding.

Figure S1 | Location distribution of positive, negative and background samples along the 5′ -UTR, CDS and 3′ -UTR, normalized for transcript length (relative distance).

Figure S2 | Two sample logos of Arabidopsis and mammalian m5C modifications. It shows nucleotides which are enriched or depleted in the surrounding region of m5C modifications.

Figure S3 | The ROC curve of 10-fold cross-validation illustrating the performance of PEA-m5C with different feature encoding schemes.

Figure S4 | Different sequence patterns of m5C modifications in DatasetHT, DatasetIT1, and DatasetIT2. (A) Frequencies of 41 <sup>∗</sup> 4 position-specific bases in DatasetHT (Root tissue) and DatasetIT1 (Silique tissue). (B) Two sample logos DatasetHT (Root tissue) and DatasetIT1 (Silique tissue) m5C modifications. (C) Frequencies of 41 <sup>∗</sup> 4 position-specific bases in DatasetHT (Root tissue) and DatasetIT2 (Shoot tissue). (D) Two sample logos DatasetHT (Root tissue) and DatasetIT2 (Shoot tissue) m5C modifications.

Table S1 | Four benchmark datasets constructed for the prediction of m5C modifications in this study.

Table S2 | The feature importance measured using the information gain approach at the window size of 11-nt (L<sup>u</sup> = L<sup>d</sup> = 5).

#### REFERENCES


Table S3 | Prediction performance of m5C predictors on DatasetIT1 and DatasetIT2.

Table S4 | Peak regions used for validating transcriptome-wide candidate m5C modifications predicted by PEA-m5C.

Table S5 | Enriched pathways of genes containing m5C modifications at 4-nt downstream of the translational start site.

Table S6 | The performance of m5C predictors on DatasetHT using different encoding schemes.


a pre-specific function. Front. Plant Sci. 7:1914. doi: 10.3389/fpls.2016. 01914


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 Song, Zhai, Bian, Song, Yu and Ma. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

## Corrigendum: Transcriptome-Wide Annotation of m5C RNA Modifications Using Machine Learning

Jie Song1,2†, Jingjing Zhai 1†, Enze Bian3†, Yujia Song<sup>3</sup> , Jiantao Yu<sup>3</sup> and Chuang Ma1,2 \*

<sup>1</sup> State Key Laboratory of Crop Stress Biology for Arid Areas, Center of Bioinformatics, College of Life Sciences, Northwest A&F University, Shaanxi, China, <sup>2</sup> Key Laboratory of Biology and Genetics Improvement of Maize in Arid Area of Northwest Region, Ministry of Agriculture, Northwest A&F University, Shaanxi, China, <sup>3</sup> College of Information Engineering, Northwest A&F University, Shaanxi, China

Keywords: AUC, Epitranscriptome, machine learning, RNA modification, RNA 5-methylcytosine

#### **A corrigendum on**

#### Edited by:

Giovanni Nigita, The Ohio State University, United States

#### Reviewed by:

Salvatore Alaimo, Università degli Studi di Catania, Italy

> \*Correspondence: Chuang Ma chuangma2006@gmail.com

†These authors have contributed equally to this work

#### Specialty section:

This article was submitted to Bioinformatics and Computational Biology, a section of the journal Frontiers in Plant Science

Received: 17 October 2018 Accepted: 13 November 2018 Published: 30 November 2018

#### Citation:

Song J, Zhai J, Bian E, Song Y, Yu J and Ma C (2018) Corrigendum: Transcriptome-Wide Annotation of m5C RNA Modifications Using Machine Learning. Front. Plant Sci. 9:1762. doi: 10.3389/fpls.2018.01762 by Song, J., Zhai, J., Bian, E., Song, Y., Yu, J., and Ma, C. (2018). Front. Plant Sci. 9:519. doi: 10.3389/fpls.2018.00519

**Transcriptome-Wide Annotation of m**5**C RNA Modifications Using Machine Learning**

In the original article, there was an error, the word "reversible" is misleading. A correction has been made to the Abstract and the Introduction, paragraph 2.

Though high-throughput experimental technologies have been developed and applied to profile m5C modifications under certain conditions, transcriptome-wide studies of m5C modifications are still hindered by the dynamic nature of m5C and the lack of computational prediction methods.

Second, because of the dynamic nature of m5C (Wang and He, 2014), existing high-throughput sequencing technologies can only capture a snapshot of RNA modifications under certain experimental conditions, and cover just a small fraction of the whole transcriptome of a given sample (Zhou et al., 2016), resulting in the generation of significant numbers of false negatives (non-detected true m5C modifications).

The authors apologize for the mistake. This error does not change the scientific conclusions of the article in any way. The original article has been updated.

#### REFERENCES

Wang, X., and He, C. (2014). Dynamic RNA modifications in posttranscriptional regulation. Mol. Cell 56, 5–12. doi: 10.1016/j.molcel.2014.09.001

Zhou, Y., Zeng, P., Li, Y. H., Zhang, Z., and Cui, Q. (2016). SRAMP: prediction of mammalian N6-methyladenosine (m6A) sites based on sequence-derived features. Nucleic Acids Res. 44:e91. doi: 10.1093/nar/gkw104

**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 Song, Zhai, Bian, Song, Yu and Ma. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

## Cadmium Stress Leads to Rapid Increase in RNA Oxidative Modifications in Soybean Seedlings

Jagna Chmielowska-B ˛ak<sup>1</sup> \*, Karolina Izbianska ´ 1 , Anna Ekner-Grzyb<sup>1</sup> , Melike Bayar<sup>2</sup> and Joanna Deckert<sup>1</sup>

<sup>1</sup> Department of Plant Ecophysiology, Faculty of Biology, Institute of Experimental Biology, Adam Mickiewicz University in Poznan, Pozna ´ n, Poland, ´ <sup>2</sup> Department of Molecular Biology and Genetics, Faculty of Science, Istanbul University, Istanbul, Turkey

Increase in the level of reactive oxygen species (ROS) is a common response to stress factors, including exposure to metals. ROS over-production is associated with oxidation of lipids, proteins, and nucleic acids. It is suggested that the products of oxidation are not solely the markers of oxidative stress but also signaling elements. For instance, it has been shown in animal models that mRNA oxidation is a selective process engaged in post-transcriptional regulation of genes expression and that it is associated with the development of symptoms of several neurodegenerative disorders. In the present study, we examined the impact of short-term cadmium (Cd) stress on the level of two RNA oxidation markers: 8-hydroxyguanosine (8-OHG) and apurinic/apyrimidinic sites (AP-sites, abasic sites). In the case of 8-OHG, a significant increase was observed after 3 h of exposure to moderate Cd concentration (10 mg/l). In turn, high level of AP-sites, accompanied by strong ROS accumulation and lipid peroxidation, was noted only after 24 h of treatment with higher Cd concentration (25 mg/l). This is the first report showing induction of RNA oxidations in plants response to stress factors. The possible signaling and gene regulatory role of oxidatively modified transcripts is discussed.

#### Edited by:

Giovanni Nigita, The Ohio State University, United States

#### Reviewed by:

Kashmir Singh, Panjab University, Chandigarh, India Christophe Bailly, Université Pierre et Marie Curie, France

#### \*Correspondence:

Jagna Chmielowska-B ˛ak jagna.chmielowska@amu.edu.pl

#### Specialty section:

This article was submitted to Bioinformatics and Computational Biology, a section of the journal Frontiers in Plant Science

Received: 05 September 2017 Accepted: 18 December 2017 Published: 09 January 2018

#### Citation:

Chmielowska-B ˛ak J, Izbianska K, ´ Ekner-Grzyb A, Bayar M and Deckert J (2018) Cadmium Stress Leads to Rapid Increase in RNA Oxidative Modifications in Soybean Seedlings. Front. Plant Sci. 8:2219. doi: 10.3389/fpls.2017.02219 Keywords: cadmium, soybean, RNA oxidation, 8-hydroxyguanosine, AP-sites, abasic sites, epitranscriptomics, oxidative stress

#### INTRODUCTION

Elevated levels of cadmium (Cd) were found in the soil in several regions of the world (Khan et al., 2011; Wang et al., 2015; Dumitrel et al., 2017). Contamination of the environment with this metal possesses serious threat to plants growth as Cd is highly mobile and toxic. Importantly, it can be absorbed by crop plants and in this mode enter human organisms leading to serious disorders. Cd is considered as class I carcinogen. The main targets of its toxicity are kidneys, lungs, and skeleton system (Kah et al., 2012).

In recent years, significant effort has been made to elucidate Cd impact on plants. The research has been focused on the toxicity mechanisms, Cd sensing, and the development of metal tolerance. The most universal responses to this toxic element, not only limited to plants but also found in bacteria and animals, are over-production of ROS (reviewed in

**Abbreviations:** 8-OHG, 8-hydroxyguanosine; AP-site, apurinic/apyrimidinic site; ARP, aldehyde reactive probe; CM-H2DCFDA, 5-(and-6)-chloromethyl-2<sup>0</sup> ,70 -dichlorodihydrofluorescein diacetate, acetyl ester; ROS, reactive oxygen species; TBARS, thiobarbituric acid reactive substances.

Chmielowska-B ˛ak and Deckert, 2012). ROS are "double-faced" molecules, which on one hand can lead to significant damage of cellular compounds but on the other hand are crucial components of signaling network indispensable for the activation of the defense (Cuypers et al., 2016; Sewelam et al., 2016). Recently it has been proposed that ROS signal is transmitted by the products of the oxidation of biological molecules including lipids, proteins, and nucleic acids (reviewed in Chmielowska-B ˛ak et al., 2015).

Indeed, various studies carried out on human, animal, and plant system demonstrated correlation between oxidation of certain species of mRNA and decrease in the level of encoded proteins, indicating that the process constitutes a posttranscriptional gene regulatory mechanism (Shan et al., 2003, 2007; Tanaka et al., 2007; Chang et al., 2008; Bazin et al., 2011; Gao et al., 2013). The decrease in the amount of proteins encoded by oxidized transcripts is most probably dependent on recently described ribosome stalling. In vitro studies carried out on reconstituted bacterial system demonstrated that occurrence of 8-OHG, the most common oxidative modification of RNA, causes slowing down of translation process by 2–4 magnitudes. The effect has been observed regardless of the position of oxidized bases in the codon, even in the wobble position. In eukaryotic extracts, translation was nearly completely inhibited by the presence of 8-OHG. It was suggested that occurrence of 8-OHG in transcripts leads to alterations in RNA–RNA interactions and prevents adaptation of active conformation in the decoding center. At the same time, it has been shown that oxidized transcripts are subjected to ribosome-based quality control and are predestined for degradation through No-Go decay pathway (NGD) (Simms et al., 2014). In concordance, yeast mutants with alerted decapping system leading to less efficient mRNA degradation showed elevated level of 8-OHG in the transcripts accompanied by premature death of the cells. The yeast mutants were also characterized by higher frequency of reversion from Trp<sup>−</sup> (tryptophan minus) phenotype (Stirpe et al., 2016).

Another major discovery is the fact that mRNA oxidation is a selective process. Study carried out on postmortem isolated brain tissues of patients suffering from Alzheimer's disease demonstrated that oxidation did not occur in the most abundant transcripts such as β-actin. On the other hand, the most prominent oxidation was always noted in specific mRNAs encoding proteins engaged in signal transduction, cellular transport, gene expression regulation, and response to alerted ROS metabolism (Shan et al., 2003). These findings have been confirmed by other reports carried out on human, animal, and plant models (Shan et al., 2007; Tanaka et al., 2007; Chang et al., 2008; Bazin et al., 2011; Gao et al., 2013). Despite the importance of transcript oxidation in genes regulation and cell functioning so far, only two studies were dedicated to elucidation of this phenomenon in plants. It was demonstrated on sunflower seeds that alleviation of seed dormancy during dry after-ripening was associated with increase in the 8-OHG in mRNA. The observed oxidation was limited to 24 definite transcripts encoding proteins associated with metabolism, response to stress factors, and transport (Bazin et al., 2011). Similarly, studies on wheat showed selective oxidation of certain transcripts associated with changes in protein levels and release from seed dormancy (Gao et al., 2013).

Beside the 8-OHG, oxidation might lead to the formation of numerous other modified bases in nucleic acids (Barciszewski et al., 1999). Studies using in vitro translation system showed that the most common oxidative modifications of RNA, namely 8- OHG, 5-hydroxyuridine (5-OHU), 5-hydroxycytidine (5-OHC), 8-oxo-7,8-dihydroadenosine (8-OHA), 1,N6-ethenoadenosine (ε-A), 3,N4-ethenocytidine (ε-C), and abasic sites (AP), result in slowing down or complete inhibition of the translation process (Calabretta et al., 2015). In turn, oxidation of DNA is associated with decrease in its stability and enhanced mutation rate. High level of oxidatively modified bases has been noted in various types of cancer (Sedelnikova et al., 2010). The repair of DNA lesions induced by oxidation is carried out by the Base Excision Repair (BER) pathway. The initial step of BER is recognition and excision of the modified bases by DNA glycosylases leading to the formation of AP-sites (abasic sites), which are considered markers of oxidative stress (Colak, 2008 ˇ ; Antoniali et al., 2017). Recently, a method of abasic sites detection has been successfully applied in the evaluation of RNA oxidation (Tanaka et al., 2011). However, so far the changes in the level AP-sites in transcripts of plants exposed to stress factors have not been described.

The aim of present study was examination of the influence of Cd in two concentrations (10 and 25 mgl−<sup>1</sup> ) on the intensity of oxidative stress and frequency of RNA oxidationdependent modifications – 8-OHG and AP-sites. Our previous research showed that in the earliest period of Cd stress (3 h), ROS are engaged in the regulation of gene expression, while strong accumulation of O<sup>2</sup> <sup>−</sup> and H2O<sup>2</sup> has been marked after 24 h (Chmielowska-B ˛ak et al., 2017). Therefore, these two time points, 3 and 24 h, were applied in the present study.

#### MATERIALS AND METHODS

#### Growth Conditions and Treatment Procedures

Soybean (Glycine max L cv. Naviko) seeds were kindly supplied by the Department of Genetics and Plant Breeding, University of Life Sciences in Poznan, Poland. The seeds were surface-sterilized ´ for 5 min with 75% ethanol and for 10 min with 1% sodium hyperchlorite. Thereafter, the seed were washed under running water for 30 min and imbedded in distilled water for 2 h. The seeds were placed on Petri dishes (30 cm of diameter) lined with two layers of moistened lignin covered by one layer of blotting paper and transferred to growth chamber with stable temperature of 22◦C for 48 h. Germinated seedlings, selected in respect of similar roots length, were transferred to new Petri dishes (10 cm of diameter), wherein the roots were placed between two layers of blotting paper in cutout wholes. Afterward, the seedlings were treated with 5 mL of distilled water (control) or CdCl<sup>2</sup> with Cd at the concentration 10 and 25 mg l−<sup>1</sup> (corresponding to 89 and 223 µM, respectively).

#### Estimation of the Amount of Dead Cells

Cell viability was estimated on the basis of Blue Evans uptake according to Lehotai et al. (2011). Approximately 200 mg of roots were cut off on ice, weighted (200 mg), and incubated for 20 min in 0.25% Evans blue (Sigma, E-2129). Then the roots were washed twice for 15 min in distilled water and homogenized in mortar with – destaining solution (50 ml of ethanol, 49 ml of distilled water, and 1 ml of 10% SDS). Samples were incubated in heating block for 15 min at 50◦C and centrifuged (12,000 rpm, 20◦C, 15 min). The Blue Evans uptake, indicating cells death, was measured spectrophotometrically at λ = 600 nm.

#### Total RNA and mRNA Isolation

For RNA isolation, approximately 100 mg of soybean roots were cut off on ice, immediately frozen in liquid nitrogen, and stored in −80◦C. The RNA was isolated from frozen tissue with the use of TriReagent (BioShop Canada Inc., Canada, TRI118) according to the manufacturer's instructions. The mRNA has been purified from the total RNA using GenEluteTM mRNA Miniprep Kit (Sigma–Aldrich, MRN10-1KT). The amount and purity of the obtained RNA and mRNA has been measured on NanoCell (Thermo Scientific) at spectrophotometer BiomateTM 3S (Thermo Scientific).

#### Measurement of 8-OHG Level

The level of 8-OHG has been quantified with OxiSelectTM Oxidative RNA Damage ELISA-8OHG Quantification Kit (BioCells, STA-325). For the analysis, 10 µg of sample (total RNA or mRNA) was digested with 20U of Nuclease S1 (BioShop Canada Inc., Canada, NUC333.50) for 2 h at 37◦C followed by digestion with 10U of alkaline phosphatase from bovine intestinal mucosa (Sigma–Aldrich, P6774-2KU) for 1 h at 37◦C. Further procedures were carried out according to the manufacturer's instructions. The absorbance of the samples was measured on ˙IMARKTM Microplate Reader (Bio-Rad) and the 8-OHG concentrations were calculated using ELISA Analysis software with 4-parameter logistic regression algorithm.

#### ROS Detection

The general ROS were detected in vivo in the roots of soybean seedlings using fluorescent dye, CM-H2DCFDA (Life Technologies, C6827) dissolved in dimethyl sulfoxide (DMSO; Sigma–Aldrich, 472301) and diluted in phosphate-buffered saline (PBS) buffer (BioShop Canada Inc., Canada, PBS404) to the total concentration of 10 µM. The roots of seedlings were incubated for 1 h in CM-H2DCFDA, washed with distilled water, and treated for 3 or 24 h with Cd solutions (10 or 25 mg L−<sup>1</sup> ) or distilled water (experimental control). To exclude the possibility of autofluorescence, the negative control incubated for 1 h in PBS buffer instead of CM-H2DCFDA has been applied. All procedures were carried out in the dark room. The level of general ROS was visualized by means of Zeiss Axiovert 200M confocal microscope with 450–490 nm excitation and 515 nm emission light wave length. The 5× magnified images were photographed using AxioCam MRC5 camera.

### Measurements of Lipid Peroxidation

Lipid peroxidation was evaluated on the basis of the amount of TBARS according to Cuypers et al. (2011) with small modifications. Roots of soybean seedlings (200 mg) were cut off on ice and homogenized with 3 ml of 10% TCA (Sigma– Aldrich, TO699). The samples were centrifuged (12,000 rpm, 4 ◦C, 10 min) and 1 ml of supernatant was transferred to glass tubes. Thereafter, the tubes were filled with 4 ml of 0.5% TBA, dissolved in 10% TCA, and incubated for 30 min in 95◦C. Subsequently, the samples were cooled, mixed by inversion, and centrifuged (5,000 rpm, 4◦C, 2 min). The absorbance of supernatant was measured at λ = 532 nm and corrected for unspecific absorbance at λ = 600 nm. Amount of TBARS was calculated on the basis of extinction factor (155 mM−<sup>1</sup> cm−<sup>1</sup> ).

#### Estimation of the Level of AP-Sites in RNA

The level of abasic sites (AP-sites) has been evaluated using method based on reaction with ARP described by Tanaka et al. (2011). Approximately 5 µg of mRNA dissolved in 50 µl of DNAase- and RNAase-free water was incubated with 50 µl of 2mM N-(aminooxyacetyl)-N<sup>0</sup> -(D-Biotinoyl)hydrazine (Life Technologies, A-10550) in Tris–EDTA buffer (Sigma–Aldrich, T9285) for 1 h at 37◦C. The reaction was stopped by addition of 50 µl of 50 mM formaldehyde (BioShop Canada Inc., Canada, FOR201.1). RNA precipitation has been carried out by addition of 15 µl of 3M sodium acetate (BioShop Canada Inc., Canada, SAA333.100) and 450 µl of pure ethanol (POCH Basic, BA 6480111). The precipitation proceeded for 24 h at −20◦C. Thereafter, the samples were centrifuged by 12,000 rcf at 4◦C, washed with 500 µl of 75% ethanol and dissolved in 10 µl of Tris–EDTA.

The amount and purity of the obtained mRNA has been measured on NanoCell (Thermo Scientific) at spectrophotometer BiomateTM 3S (Thermo Scientific). A total of 1 µg of the mRNA has been spotted on membrane (Zeta-Probe <sup>R</sup> Blotting Membranes, Bio-Rad) previously soaked in Tris–EDTA and air-dried. The membrane with samples was irradiated for 15 min with UV light, incubated for 30 min in Casein Blocking Buffer (Sigma–Aldrich, B6429), followed by incubation for 1 h with Streptavidin-Horseradish Peroxidase (HRP) Conjugate (Sigma–Aldrich, GERPN1231) in Casein Blocking Buffer (1:20,000). Thereafter, the membrane was washed 6 times for 4 min with PBS (BioShop Canada Inc., Canada, PBS404.200) containing 0.05% Tween (Sigma–Aldrich, P1379) and developed with ClarityTM Western ECL Substrate (Bio-Rad). The blocking, incubation with streptavidin-HRP, washing, and developing proceeded on rocking platform (Shaker-Rocker MR12, BioSan). Chemiluminescence has been captured on ChemiDocTM Touch Imaging System (Bio-Rad) with 15 min exposure time. The intensity of spots was measured with Multi Gauge Software (Fuji) as Q-B/pixel<sup>2</sup> , where Q is quantity and B is background. The relative density has been expressed as percentage in relation to the control.

#### Statistical Analysis

fpls-08-02219 January 5, 2018 Time: 18:6 # 4

The measurements of the 8-OHG level in total RNA and mRNA were carried out on 4 and 5–6 experimental repetitions, respectively. The measurements of growth, lipid peroxidation, and abasic sites level in mRNA were conducted on 3 experimental repetitions, while the estimation of the amount of dead cells in 2–3 experimental repetitions. For evaluation of statistically significant differences, obtained data were analyzed with the use of one-way ANOVA (p = 0.05). In the case of evaluation of the level of abasic sites in mRNA, due to the nonparametric distribution, the Mann–Whitney U-test has been applied (p = 0.05). Results which showed no statistically significant differences are marked with the same letter.

#### RESULTS

Exposure of seedlings to Cd for 3 h had no effect on their morphology, growth, or the amount of dead cells (**Figures 1A,C,D**). In turn, 24 h of Cd stress resulted in roots browning and inhibition of their growth (**Figures 1B,C**). Exposure for 24 h to higher Cd concentration led to significant increase in the amount of dead cells (**Figure 1D**).

The level of 8-OHG (oxidatively modified base) was approximately 5 times higher in mRNA in relation to the total RNA (**Figures 2A,B**, **3A,B**). In response to Cd, significant increase in the level of 8-OHG has been noted in the total RNA and mRNA after 3 h of treatment with lower concentration (10 mgl−<sup>1</sup> ) (**Figures 2A,B**). This effect was not observed after 24 h or in response to higher Cd concentration (25 mgl−<sup>1</sup> ) (**Figures 3A,B**).

The opposite tendency has been noted in the case of ROS accumulation. General ROS were detected with specific dye, CM-H2DCFD, which in response to oxidation emits green fluorescent signal. The fluorescence signal was generally lower after 3 than after 24 h. No differences in fluorescence intensity

FIGURE 1 | Morphology of the control seedlings and seedlings treated with Cd for 3 h (A) and 24 h (B), length of seedlings roots (C), and cell mortality evaluated on the basis of Blue Evans uptake (D). The results are means of 3–6 independent repetitions ± SE. Results marked with asterisk (<sup>∗</sup> ) show statistically significant differences in relation to the control.

have been observed between the roots of control and Cd-stressed seedlings after 3 h (**Figure 2C**). However, a visible increase in the fluorescence signal has been noted in roots of the seedling treated with Cd for 24 h in relation to the control (**Figure 3C**).

The ROS over-production was correlated in time with increased lipid peroxidation (**Figure 3D**) and AP-sites frequency in the mRNA (**Figure 3E** and **Supplementary Figure S1**). After 3 h, no significant changes in TBARS or AP-sites level were observed between Cd-stressed and control seedlings (**Figures 2D,E**). In turn, after 24 h, the level of TBARS and APsites was significantly higher in the roots of seedlings treated with Cd at higher concentration (25 mgl−<sup>1</sup> ) (**Figures 3D,E** and **Supplementary Figure S1**).

#### DISCUSSION

Exposure of plants to Cd leads to the development of various symptoms of toxicity (reviewed in Gallego et al., 2012). In the present study, 3 h long treatment with this metal did not affect the morphology and growth of soybean seedlings roots. At this time point, no statistically significant differences in cells viability were noted (**Figures 1A,C,D**). However, already after 24 h, browning and shortening of the roots in response to Cd treatment has been observed (**Figures 1A,C**). Additionally, 24 h long exposure to higher Cd concentration led to significant increase in the amount of dead cells (**Figure 1D**).

Cadmium (Cd) toxicity might be, at least partially, mediated by ROS. These molecules, which include hydrogen peroxide (H2O2), hydroxyl radical (HO), superoxide anion (O<sup>2</sup> <sup>−</sup>), and singlet oxygen (1O2), are highly reactive and mediate oxidation of various cellular compounds. Numerous reports showed that exposure to stresses leads to increase in the level of ROS accompanied by oxidation of proteins and membrane lipids (Gill and Tuteja, 2010; Cuypers et al., 2016; Sewelam et al., 2016). In fact, accumulation of the product of lipid peroxidation, malondialdehyde (MDA) and TBARS, is considered a typical symptom of oxidative stress. However, so far little attention has been given to the oxidation of nucleic acids in plant exposed to unfavorable condition. Moreover, the studies were limited only to changes in the level of DNA oxidation markers (Macovei et al., 2010; Yin et al., 2010), while there is no information concerning ROS impact on RNA. It is worth highlighting that studies on bacteria showed higher susceptibility of RNA to oxidation when compared to DNA. Under the same conditions, the level of oxidatively modified RNA exceeded level of oxidized DNA (Liu et al., 2012). This might be explained by the fact that DNA molecules are more protected due to localization in the nucleus, higher level of packing, and association with numerous proteins. Among the main RNA types, mRNA seems to be the most vulnerable to the oxidation (Bazin et al., 2011).

Indeed in the present study, the frequency of the most common oxidative modification, 8-OHG, was approximately 5 times higher in mRNA than in the total RNA (**Figures 2**, **3**). Another modification associated with oxidation processes, abasic sites (AP-sites), were undetectable in the total RNA using same or even higher concentrations of sample as in the case of mRNA

concentration 10 mgl−<sup>1</sup> statistically significant differences.

(data not shown). The results indicate that also this modification occurs more frequently in the transcripts than other RNA types. Interestingly, Cd-dependent induction of 8-OHG and AP-sites were separated in time. Significantly, higher levels of 8-OHG in response to Cd were noted only after 3 h of treatment with lower metal concentration (**Figures 2A,B**). In turn, increase in the level of AP-sites was observed after 24 h of exposure to the higher concentration (**Figure 3E**) and was accompanied by increase in the level of other markers of oxidative stress – strong ROS over-production (**Figure 3C**) and enhanced lipid peroxidation (**Figure 3D**).

The observed variable in time effect of Cd-dependent ROS signal is in concordance with other studies. For example, several reports indicate that Cd stress leads to the generation of differing in the time of occurrence ROS waves (Garnier et al., 2006; Peréz-Chaca et al., 2014; Lv et al., 2017). In the case of the study on tobacco suspension cells, it has been shown that exposure to this metal leads to generation of three ROS waves, whereas

the earliest one occurred already within the first hour of metal treatment and was dependent on the activity of membrane bond enzyme, NADPH oxidase. Same research reported that Cd cytotoxicity was associated with the second ROS wave resulting from disturbances in mitochondria functioning (Garnier et al., 2006). Also our earlier research showed that ROS signal in soybean seedlings exposed to Cd differs in time. The activity of NADPH oxidase modulated expression of signaling associated genes after 3 h of metal treatment, while significant ROS accumulation and lipid peroxidation were noted only after 24 h (Chmielowska-B ˛ak et al., 2017). Apparently, the role of ROS in plant cells is temporal, species, and spatial specific (Miller et al., 2008; Cuypers et al., 2016). For example, it has been evidenced in Arabidopsis thaliana that ROS signal originating from peroxisomes and chloroplasts has distinct effect on the transcriptome (Sewelam et al., 2014). Another study using agents, which induce distinct ROS types, showed that some transcripts are modulated specifically by O<sup>2</sup> <sup>−</sup>, H2O2, or <sup>1</sup>O<sup>2</sup> (Gadjev et al., 2006).

Reactive oxygen species might play various roles in plants exposed to stress conditions. On one hand, these molecules are responsible for oxidative damage of membrane lipids, proteins, and nucleic acids (Das and Roychoudhury, 2014; Rio, 2015). On the other hand, ROS are engaged in signaling network and defense mechanism (Cuypers et al., 2016). ROSdependent signaling includes direct sensing, for example, through transcription factors or serine/threonine protein kinase oxidative signal-inducible 1 (OXI1) and indirect modulation of signal through changes in cellular redox status and/or interaction with other signaling molecules such as nitric oxide, calcium ions, mitogen-activated protein kinases, or plant hormones (Neill et al., 2002; Chi et al., 2013; Kopczewski and Kuzniak, 2013 ´ ; Wrzaczek et al., 2013; Waszczak et al., 2015; Cuypers et al., 2016). Recently, it has been proposed that ROS signal might be also transmitted by products of oxidation such as oxylipins, peptides derived from protein oxidation, and oxidatively modifies nucleic acids (reviewed in Chmielowska-B ˛ak et al., 2015). One of the important future challenges in research concerning the role of ROS in plants response to stresses is elucidation of the exact role of specific ROS signals.

In the case of present research, the observed Cd-dependent induction of AP sites is most probably a symptom of oxidative stress (**Figure 3E** and **Supplementary Figure S1**). The assumption is based on the fact that AP sites induction was correlated in time with significant increase in ROS level and lipid peroxdiation (**Figures 3C,D**). However, at the present stage of research, it is difficult to explain the exact role of rapid induction of 8-OHG formation in RNA noted already after 3 h (**Figures 2A,B**), when the symptoms of oxidative stress were still not detectable (**Figures 2C,D**). Studies carried out on animal, human, and plant models showed that 8-OHG formation is not a random but highly selective process limited to defined transcripts, although the mechanism of its selectivity has not been yet discovered. High rate of 8-OHG in transcripts leads to ribosome stalling and in consequence to the decrease in the amount of encoded proteins (Shan et al., 2007; Tanaka et al., 2007; Chang et al., 2008; Bazin et al., 2011; Gao et al., 2013; Simms et al., 2014). The selective nature of 8-OHG formation and its impact on protein biosynthesis indicate that this process constitutes a newly discovered mechanism of post-transcriptional gene regulation. Interestingly, it seems that the gene regulatory function of transcripts abundant in 8-OHG plays distinct role in animals and plants. In the case of animal and human models, the high 8-OHG level in mRNA has been shown to be associated with the development of neurodegenerative disorders (Shan et al., 2003; Chang et al., 2008; Kong and Lin, 2010). In turn in plants, this oxidative modification of transcripts is essential for regulation of the level of certain proteins and alleviation of seed dormancy – a natural process in plants' life cycle (Bazin et al., 2011; Gao et al., 2013).

In summary, this is the first report showing increased oxidation of total RNA and mRNA in plants exposed to stress. The observed increase in the level of AP-site was correlated in time with strong ROS accumulation and lipid peroxidation indicating that this mRNA modification constitutes a marker of oxidative challenge. However, the rapid induction of 8-OHG is puzzling and its exact role in plants response to unfavorable conditions needs further elucidation.

#### AUTHOR CONTRIBUTIONS

JC-B and JD designed the research. JC-B and AE-G carried out the cultivation of material, collection of samples, and RNA and mRNA isolation. JC-B and KI conducted the evaluation of ROS and AP-sites level. JC-B, MB, and AE-G performed the measurements of 8-OHG level including isolation of mRNA. JC-B carried out the examination of lipid peroxidation. All authors analyzed the obtained results. JC-B wrote the manuscript. All authors critically read, corrected, and approved the manuscript.

#### FUNDING

This work was financed by the National Science Center, Poland, in frame of project number 2014/13/D/NZ9/04812.

#### ACKNOWLEDGMENT

MB participated in the research conducted at the Department of Plant Ecophysiology at Adam Mickiewicz University in Poland during students' mobility in frame of Erasmus+ Programme.

## SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fpls.2017.02219/ full#supplementary-material

FIGURE S1 | Photograph of an exemplary membrane showing formation of abasic sites (AP-sites) in mRNA isolated from the roots of seedlings treated with distilled water (control) or Cd at the concentration 10 mgl−<sup>1</sup> (Cd 10) or 25 mgl−<sup>1</sup> (Cd 25) for 3 and 24 h.

#### REFERENCES


Reactive Probe. Free Radic. Res. 45, 237–247. doi: 10.3109/10715762.2010. 535529


reductase, confers tolerance to aluminum stress in transgenic tobacco. Planta 231, 609–621. doi: 10.1007/s00425-009-1075-3

**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 Chmielowska-B ˛ak, Izbianska, Ekner-Grzyb, Bayar and Deckert. ´ This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

## The Effect of Centrifugal Force in Quantification of Colorectal Cancer-Related mRNA in Plasma Using Targeted Sequencing

Vivian Weiwen Xue<sup>1</sup> , Simon Siu Man Ng<sup>2</sup> , Wing Wa Leung<sup>2</sup> , Brigette Buig Yue Ma<sup>3</sup> , William Chi Shing Cho<sup>4</sup> , Thomas Chi Chuen Au<sup>3</sup> , Allen Chi Shing Yu<sup>5</sup> , Hin Fung Andy Tsang<sup>1</sup> and Sze Chuen Cesar Wong<sup>1</sup> \*

<sup>1</sup> Department of Health Technology and Informatics, Faculty of Health and Social Sciences, Hong Kong Polytechnic University, Kowloon, Hong Kong, <sup>2</sup> Department of Surgery, Faculty of Medicine, The Chinese University of Hong Kong, Shatin, Hong Kong, <sup>3</sup> State Key Laboratory in Oncology in South China, Sir YK Pao Centre for Cancer, Department of Clinical Oncology, Hong Kong Cancer Institute and Li Ka Shing Institute of Health Sciences, The Chinese University of Hong Kong, Shatin, Hong Kong, <sup>4</sup> Department of Clinical Oncology, Queen Elizabeth Hospital, Kowloon, Hong Kong, <sup>5</sup> Department of Computer Science, University of Oxford, Oxford, United Kingdom

#### Edited by:

Stefano Volinia, University of Ferrara, Italy

#### Reviewed by:

Rosanna Asselta, Humanitas Università, Italy Stefano Duga, Humanitas Università, Italy

\*Correspondence: Sze Chuen Cesar Wong cesar.wong@polyu.edu.hk

#### Specialty section:

This article was submitted to RNA, a section of the journal Frontiers in Genetics

Received: 09 March 2018 Accepted: 26 April 2018 Published: 15 May 2018

#### Citation:

Xue VW, Ng SSM, Leung WW, Ma BBY, Cho WCS, Au TCC, Yu ACS, Tsang HFA and Wong SCC (2018) The Effect of Centrifugal Force in Quantification of Colorectal Cancer-Related mRNA in Plasma Using Targeted Sequencing. Front. Genet. 9:165. doi: 10.3389/fgene.2018.00165 In our previous study, we detected the effects of centrifugal forces on plasma RNA quantification by quantitative reverse transcription PCR. The aims of this study were to perform targeted mRNA sequencing and data analysis in healthy donors' plasma prepared by two centrifugation protocols and to investigate the effects of centrifugal forces on plasma mRNA quality and quantity. Targeted mRNA sequencing was performed using a custom panel with 108 colorectal cancer-related genes in 18 healthy donors' plasma that prepared by (1) 3,500 g for 10 min at 4◦C and (2) 1,600 g for 10 min at 4◦C followed by 16,000 g for 10 min at 4◦C. Results showed that plasma ribosomal RNA was detected in 16/18 (88.9%) 3,500 g and 6/18 (33.3%) 1,600 g followed by 16,000 g centrifuged plasma. For targeted sequencing, 75/108 (69.4%) and 86/108 (79.6%) genes were detected in 3,500 and 1,600 g followed by 16,000 g, respectively, while 16/108 (14.8%) genes were not detected in both centrifugations. Detailed analysis showed that 2 of 108 (1.85%) genes showed lower expressions in 3,500 g than in 1,600 g followed by 16,000 g. The median expressions of genes in 3,500 g were positively correlated with the expressions in 1,600 g followed by 16,000 g (R <sup>2</sup> = 0.9471, P < 0.0001, Spearman rank correlation). Meanwhile, plasma samples were not distinctively clustered based on centrifugal forces according to hierarchical clustering. Targeted mRNA sequencing and subsequent data analysis were performed in this study to investigate the effects of two different centrifugal forces that are commonly used in plasma collection. Our targeted sequencing results help to understand the centrifugal force effects on plasma mRNA, and these findings show that the centrifugation protocol for plasma mRNA research using targeted sequencing can be standardized which facilitates multicenter studies for comparison and quality assurance in the future.

Keywords: centrifugal force, targeted sequencing, plasma mRNA, colorectal cancer, gene expression

### INTRODUCTION

Colorectal cancer (CRC) is one of the most serious health issues worldwide (Brenner et al., 2014). The survival rate of primary CRC is significantly higher than the rate of advanced cancer, which means that a more effective cancer screening is helpful in CRC prevention (Levin et al., 2008). Currently recommended annual screening for people age ≥ 50 years is Guaiac-based fecal occult blood test (gFOBT) or fecal immunochemical test (FIT) (Smith et al., 2017). Although they are economical and non-invasive tests, gFOBT is not specific enough due to falsepositive detection from hemorrhoids and ulcers, and both gFOBT and FIT are not sensitive in neoplasm detection (Brenner and Tao, 2013; Widlak et al., 2017). The insufficient sensitivity of current screening for non-invasive and early detection of CRC urges a more effective non-invasive detection based on liquid biopsy such as plasma. Plasma mRNA has been used as diagnostic and prognostic tumor markers in various cancers (García et al., 2007). The detection of plasma mRNA is non-invasive and flexible, which is beneficial to cancer patients' follow-up after surgeries or adjuvant therapies, and it has potential to monitor cancer recurrence as well (Wong et al., 2004b; Stein et al., 2011). However, mRNA has a low abundance, and it is fragmented in plasma (Savelyeva et al., 2017). Moreover, different blood processing protocols, such as filtering and different centrifugal forces used in plasma preparation, may result in different quantification of plasma mRNA (Ng et al., 2002; El-Hefnawy et al., 2004; Wong et al., 2007). In a previous study, an additional centrifugation in 1,300 g for 10 min after a routine centrifugation in 1,800 g for 10 min was found to decrease plasma RNA concentration over 20 times (El-Hefnawy et al., 2004). Moreover, our earlier work showed that mRNA quantity was significantly different in metastatic CRC patients' plasma samples prepared by two centrifugations, 800 g for 8 min at 4◦C and 4,500 g for 8 min at 4◦C, which detected CTNNB1, SELP, KRT20, and GAPDH mRNA using quantitative reverse transcription PCR (RT-qPCR) (Wong et al., 2007). Those results demonstrated that different centrifugal forces could lead to different quantities of mRNA in plasma samples from CRC patients. This artifact alerts us the importance to standardize the centrifugal protocol for plasma mRNA analysis. Otherwise, the data obtained cannot be interpreted and compared with other studies.

With the prevalence of next-generation sequencing, RNA deep sequencing has been used as an approach for transcriptome profiling in plasma samples (Wang et al., 2014; Shih et al., 2015). However, there is no standardized protocol of plasma preparation or explicit descriptions of centrifugal forces effects on plasma mRNA, which has crucial impacts on reproducibility of applications and multicenter researches. Some studies used plasma samples collected from centrifugal forces, such as 1,400 or 1,600 g, to study microRNAs (miRNAs) and mRNAs cancer markers using RNA deep sequencing, respectively (Wang et al., 2014; Shih et al., 2015). On the other hand, some researchers used a higher centrifugal forces to profile plasma extracellular RNAs using small RNA deep sequencing (Freedman et al., 2016; Danielson et al., 2017). In one study, plasma samples were prepared in 2,500 g for 22 min at 4◦C with an additional centrifugation in 8,000 g for 5 min after plasma thawing prior to RNA extraction (Freedman et al., 2016). In another study, plasma samples were prepared in 1,000 g for 10 min at room temperature with an additional centrifugation in 2,000 g after plasma thawing prior to RNA extraction (Danielson et al., 2017). Up to now, no study has reported the effect of centrifugal force on plasma mRNA based on RNA sequencing data. Therefore, we aim to examine the effect of centrifugal force on plasma mRNA quantity and quality that may be important for cancer detection and monitoring.

In this study, we examined a panel of CRC-related mRNA in healthy donors' (HDs) plasma prepared by two commonly used centrifugations (1) 3,500 g for 10 min at 4◦C and (2) 1,600 g for 10 min at 4◦C followed by 16,000 g for 10 min at 4◦C. CRC-related mRNA expression was detected using targeted deep sequencing. Subsequently, differential expression and correlation of expression in two centrifugal forces were analyzed. The information obtained from this study will be helpful for us to understand the centrifugal force effects on the expression level of CRC-related mRNAs in plasma samples, which could facilitate us to develop a standardized and effective protocol of CRC biomarker detection using targeted mRNA sequencing in plasma samples.

#### MATERIALS AND METHODS

#### Healthy Donors Recruitment and Plasma Collection

Eighteen HDs were recruited in this study. For each donor, 15 ml peripheral blood was collected in K3 EDTA tubes (Greiner Bioone, Austria) and divided to two parts evenly. One portion was centrifuged for 3,500 g, 10 min at 4◦C, and 3.2 ml plasma was collected and preserved by 9.6 ml Trizol LS Reagent (Thermo Fisher Scientific, USA) before storage at −80◦C. Another portion was centrifuged for 1,600 g, 10 min at 4◦C followed by 16,000 g for 10 min at 4◦C, and 3.2 ml plasma was collected and preserved in the same way. Microfuge 22R Centrifuge and F301.5 rotor (Beckman Coulter) were used for centrifugation in plasma preparation. Blood processing was done within 4 h after blood draw. All donors were recruited with written informed consent. The study was approved by the Joint Chinese University of Hong Kong and New Territories Easter Cluster Clinical Research Ethics Committee (CREC-2014.224).

#### RNA Extraction and Purification

For each sample, 3.2 ml plasma was used for RNA extraction using our established protocol (Wong et al., 2004a,b). In brief, the aqueous layer with RNA was separated after adding chloroform (Sigma-Aldrich, USA) followed by centrifugation for 12,000 g, 15 min at 4◦C. Then, 0.54 volume of absolute ethanol (Sigma-Aldrich, USA) was added to the aqueous layer to achieve appropriate binding conditions. The mixture was purified using RNeasy Mini Kit (Qiagen, Germany) (Wong et al., 2004a,b). Subsequently, DNase digestion using TURBO DNA-free Kit (Invitrogen, Lithuania) was performed, and the DNA-free RNA was concentrated using RNeasy MinElute Cleanup Kit (Qiagen, Germany) according to the manufacturer's instructions (Tsui et al., 2014). Plasma total RNA was eluted in 14 µl RNase-free water, and it was stored at −80◦C until use. The quality and quantity of RNA was detected using Agilent RNA 6000 Pico Kit (Agilent Technologies, Lithuania) on 2100 Bioanalyzer. RNA integrity number (RIN) and the percentage of RNA fragments > 200 nt (DV200) were detected as the quality indicators. RIN is a standardized value to describe RNA quality, which has considered the 28S/18S ratio and other features from electrophoretic RNA separation results (Schroeder et al., 2006). DV<sup>200</sup> is a parameter to evaluate the length distribution of fragmented RNA (Landolt et al., 2016).

#### Sequencing Library Preparation and Data Analysis

Sequencing library was prepared using a custom designed TruSeq Targeted RNA Expression Kit (Illumina, USA), which was used to examine a panel of 108 CRC-related genes including 93 Wnt-signaling genes, existing CRC markers from literatures and a control gene (Supplementary Table 1). The cDNA libraries were synthesized using 5 µl extracted plasma RNA, which was equivalent to about 670 pg RNA per sample, and the preparation of sequencing libraries was according to the manufacturer's instructions with slight modifications, including (1) 2-fold diluted adapters to amplify libraries and (2) two times of clean-up for PCR products using AMPure XP beads (Beckman Coulter, USA) (Tsui et al., 2014). The quality and quantity of prepared cDNA libraries were checked by Agilent High Sensitivity DNA Kit (Agilent Technologies, Lithuania) and qPCR, respectively. FastStart Universal SYBR Green Master (Roche, Germany) was used in quantification. Primers with 5′ -AATGATACGGCGACCACCGAGAT-3′ and 5′ -CAAGCAGAAGACGGCATACGA-3′ matched sequences within adapters were used. Illumina format DNA standard (Qiagen, Germany) was prepared by a serial dilution to achieve the standard curve for absolute quantification. The pooled sequencing library with 5% PhiX control (Illumina) was sequenced for single-end 51 bp length on MiSeq System using MiSeq Reagent Kit v3 (Illumina, Singapore). The targeted RNA sequencing data were available in Sequence Read Archive (SRA) database (SRP125573).

Data analysis for targeted mRNA sequencing included two parts. The primary analysis was performed on MiSeq reporter. After base calling, FASTQ files of sequences with high sequencing quality were aligned to hg19 reference genome based on custom designed regions. Raw aligned replicate counts of each gene for each sample were output. Counts per million (CPM) of genes were calculated as normalized expression for correcting biases due to library sizes. Let the raw count of gene i in a sample j is Cij, with i = 1 to n and j = 1 to m. The calculation of CPMij is below (Rau et al., 2013; Law et al., 2014):

$$CPM\_{\vec{ij}} = \frac{C\_{\vec{ij}}}{\sum\_{i=1}^{n} C\_{\vec{ij}}} \times 10^6$$

The normalized expression of gene was shown using log<sup>2</sup> scale as log<sup>2</sup> (CPM+1) to avoid log transformation for zero CPM (Law et al., 2014; Tsui et al., 2014). The secondary analysis was performed as below pipeline: a non-specific filter as "keep the gene if it has > 1 CPM in ≥ 18 plasma samples" was used to remove uninformative signals and increase detection power without dependency on centrifugal force labels, which generally excluded low-abundance genes and reduced dispersion (Bourgon et al., 2010; Robinson et al., 2010; Rau et al., 2013). After the filtering, DESeq2 was used to estimate dispersion and detect differential expression using paired sample test for the remaining genes (Love et al., 2014). A cutoff of fold change > 4 and adjusted P value < 0.05 was used to identify significant differences. Adjusted P value was calculated based on Benjamini-Hochberg correction (Benjamini and Hochberg, 1995).

TABLE 1 | The summary of detection of genes in two centrifugal forces.


#### Statistical Analysis

Statistical analysis was performed by Wilcoxon matched-pairs signed rank test and Spearman correlation in Prism 5. P < 0.05 was regarded as significant difference and significant correlation, respectively.

#### RESULTS

#### Plasma RNA Quality and Quantity From Two Centrifugal Forces

Using Bioanalyzer, 18S and 28S rRNAs were detected in 16/18 (88.9%) 3,500 g centrifuged plasma and 6/18 (33.3%) 1,600 g followed by 16,000 g centrifuged plasma, respectively (Supplementary Figure 1). Besides, RIN was detected with median of 6.90 (range: 1.3–8.4) and 1.15 (range: 1–7.5) in 3,500 and 1,600 g followed by 16,000 g centrifuged plasma, respectively. RIN in 3,500 g was significantly higher than those in 1,600 g followed by 16,000 g (**Figure 1A**, P < 0.0001, Wilcoxon matchedpairs signed rank test). RNA concentration was detected with median concentration of 196.5 (range: 101–678) and 100.0 (range: 59–251) pg/µl in 3,500 and 1,600 g followed by 16,000 g centrifuged plasma, respectively. RNA concentration in 3,500 g was significantly higher than those in 1,600 g followed by 16,000 g (**Figure 1B**, P < 0.01, Wilcoxon matched-pairs signed rank test). DV<sup>200</sup> was detected with median percentage of 58.5 (range: 27– 80) and 41.0 (range: 15–74) in 3,500 and 1,600 g followed by 16,000 g centrifuged plasma, respectively. DV<sup>200</sup> in 3,500 g was significantly higher than those in 1,600 g followed by 16,000 g (**Figure 1C**, P < 0.01, Wilcoxon matched-pairs signed rank test).

#### Summary of Plasma mRNA Targeted Sequencing

Overall, the number of total raw reads in targeted sequencing on MiSeq is 39.0 million with at least 93% ≥ Q30. Among them, about 32.9 million reads were high quality, and 20.9% of them were aligned to the targeted regions in human genome hg19.

Detection of the 108 CRC-related genes was summarized in **Table 1**. Sixteen (14.8%) of 108 genes were undetectable in both centrifugations. Besides, 6 genes could only be detected in 3,500 g but not in 1,600 g followed by 16,000 g centrifuged plasma, while 17 genes could only be detected in 1,600 g followed by 16,000 g but not in 3,500 g centrifuged plasma. Details of the genes detected only in one centrifugal force condition were listed in **Table 2**. These 23 genes were detected in ≤ 5 plasma samples, and the majority of them (73.9%) were low-abundance (raw counts ≤ 10 counts).

#### Expression Filtering and Gene Expression Levels in Two Centrifugal Forces

Detectable genes in 3,500 g (**Figure 2A**) and 1,600 g followed by 16,000 g (**Figure 2B**) were 75 genes and 86 genes, respectively. Based on the normalized expression levels (CPM), 25 of 108


(23.1%) genes that had the higher expression compared to the majority of other genes in both centrifugations passed the filter (> 1 CPM in ≥ 18 plasma samples), and they were highlighted by red color in **Figure 2**. Only these 25 genes were included in the following differential expression analysis. The median expressions of those genes in 3,500 g were positively correlated with the expressions in 1,600 g followed by 16,000 g (**Figure 3**, R <sup>2</sup> = 0.9471, P < 0.0001, Spearman rank correlation). On the other hand, 83 genes were filtered out because of their low sequencing coverage, and they were not included in downstream differential expression analysis.

#### Differential Gene Expression in Two Centrifugal Forces

The differential expression analysis and the fold-change estimation of 25 passed filter genes were performed by DESeq2. Results were listed in **Table 3**, and genes with significant difference in expression were highlighted in bold. Among them, MYC proto-oncogene (MYC) and hypoxia inducible factor 1 alpha subunit (HIF1A) showed significantly lower expressions in 3,500 g compared with in 1,600 g followed by 16,000 g (16.67-fold with adjusted P < 0.005 and 5.56-fold with adjusted P < 0.05, respectively). In detail, MYC was detected in 11/18 (61.1%) plasma samples in both centrifugations with the median normalized expression of 9.2 (range: 0.0–30443.0) and 98.9 (range: 0.0–48931.0) CPM in 3,500 g and 1,600 g followed by 16,000 g, respectively. HIF1A was detected in 15/18 (83.3%) plasma samples in both centrifugations with the median normalized expression of 266.0 (range: 0.0–149880.3) and 1135.8 (range: 0.0–212939.9) CPM in 3,500 and 1,600 g followed by 16,000 g, respectively.

The hierarchical clustering for samples and gene expressions in different centrifugations was achieved by complete linkage of Euclidean distances (**Figure 4**). Plasma samples were not

interquartile range of log-transformed normalized expressions and outliers, and 25 passed filter genes were highlighted by red color.

distinctively clustered based on centrifugal forces, which indicated 3,500 and 1,600 g followed by 16,000 g centrifugations could not cause distinguished differential expression to the panel of CRC-related genes in HDs' plasma samples.

correlative between 3,500 and 1,600 g followed by 16,000 g conditions (R 2 = 0.9471, P < 0.0001, Spearman rank correlation).

#### DISCUSSION

Plasma RNA sequencing has been used to investigate circulating cancer markers. However, majority of previous studies focused on profiling miRNA markers, because miRNAs were relative stable in human plasma (Mitchell et al., 2008; Wang et al., 2014). For plasma mRNA, we previously reported that its quantification could be affected by different centrifugal forces based on RT-qPCR results (Wong et al., 2007). Therefore, it is important to explicate centrifugal force effects based on RNA deep sequencing data before examining plasma mRNA using RNA deep sequencing technologies. Here, we provided important information on the centrifugal force effects using a custom panel of CRC-related mRNAs in plasma samples. Basically, we found that there were 2 of 108 CRC-related genes showed differential expression in plasma samples prepared by protocols with different centrifugal forces. Besides, the results of clustering and correlation for gene expression showed that two centrifugal forces used in this study were not cause distinguished differential expression to the panel of CRC-related genes in HDs' plasma samples. This is the first study to evaluate the effects of centrifugal force on plasma mRNA quantity and quality using targeted RNA sequencing.

Through comparing plasma mRNA extracted in two centrifugal forces, which were (1) 3,500 g and (2) 1,600 g followed by 16,000 g, three important findings were observed.

TABLE 3 | The differential gene expression in two centrifugal forces analyzed by DESeq2 (3,500 vs. 1,600 g followed by 16,000 g).


First, RNA concentration, integrity and the percentage of longer fragments were significantly decreased in plasma samples prepared by the high centrifugal force. Compared with the cell-free plasma samples prepared by 1,600 g followed by 16,000 g (Chiu et al., 2001), plasma samples prepared by 3,500 g include more RNAs-associated particles (Ng et al., 2002). Those particles may include cell debris, extracellular vesicles, and other particles that can combine with mRNA molecules, which contribute to the increased amount of 18S and 28S rRNAs and the corresponding increase in RNA concentration. Meanwhile, the decrease of RIN and DV<sup>200</sup> in plasma samples prepared by the high centrifugal force was probably due to depleted RNAs-associated particles, which account for fewer 18S and 28S rRNAs and

more fragmented RNA when plasma was subjected to the high centrifugal force, respectively. This phenomenon emphasized the effects from RNAs-associated particles to the quality and quantity of plasma total RNA in different centrifugal forces.

However, our attention is also focused on whether mRNAs could be affected by different centrifugal forces. Our second finding was that most of mRNAs in our CRC-related panel were detectable in plasma samples prepared by both centrifugations using targeted mRNA sequencing. This phenomenon is not surprising, because circulating mRNAs exist and are prevented from endogenous RNase digestion due to combination and protection from particles, for example apoptotic bodies and protein complexes (Wieczorek et al., 1985; Hasselmann et al., 2001; Ng et al., 2002). However, a majority of detectable mRNAs had low and overdispersed expression, and the similar phenomenon was found in the recent study on plasma mRNA sequencing in pregnant women (Chim et al., 2017). Besides, we found several low-abundance mRNAs (≤ 10 counts) that were only detectable in one of two centrifugations (**Table 2**). Most of these mRNAs were only detectable in one or two plasma samples, and they accounted for the situation that more genes were detected in 1,600 g followed by 16,000 g than in 3,500 g. It was difficult to determine which kind of particles protected these mRNAs from RNase in plasma samples. For transcripts only detected in 3,500 g-centrifuged plasma samples, their existence could be related to the presence of cell debris, extracellular vesicles and other particles, which cannot be removed effectively by the protocol with 3,500 g centrifugal force.

Filters are generally used in RNA sequencing data analysis to eliminate uninformative data points and increase detection power, and data filters were required to be chosen prudently to avoid losing type I error control in differential analysis (Bourgon et al., 2010; Rau et al., 2013). In this study, we used CPM filter for excluding mRNAs with the expression lower than filter criteria from subsequent differential expression analysis as stated in the methodology section, which was previously defined in edgeR (Robinson et al., 2010). Our third finding was that in differential expression analysis, 25 genes were detected for downstream analysis after filtering. Among them, MYC and HIF1A showed significantly lower expressions in 3,500 g than in 1,600 g followed by 16,000 g. This phenomenon implied that plasma mRNA of these two genes was hardly affected by centrifugal force effects and mainly preserved as cell-free format, which resulted in the increased relative expressions after normalization. MYC encodes the transcription factor c-Myc. It showed elevated expressions in different tumor cells, and it worked with the promoter regions of targeted genes (Lin et al., 2012). HIF1A encodes a transcription factor that responds to hypoxia through recruiting specific cyclin dependent kinase, stimulating RNA polymerase elongation and activating transcription of downstream genes (Galbraith et al., 2013). There was no previous study to describe how MYC and HIF1A mRNAs exist and are preserved as cellfree format in human plasma. The sequencing results of MYC and HIF1A expressions have been validated using RT-qPCR. Overall, those 25 genes had the high median of normalized expressions compared with other genes in both centrifugations, and their expressions in two different centrifugations were significantly correlated (**Figure 3**). This result demonstrated that detected gene expressions depended on the intrinsic expression levels of gene itself instead of effects from different centrifugal forces.

#### REFERENCES


Moreover, plasma samples were not be clustered based on the centrifugal forces used in plasma preparation, which showed that the two centrifugal forces used in this study did not lead to distinctive difference in the concentration of CRC-related mRNAs in plasma samples (**Figure 4**).

To conclude, we achieved targeted mRNA sequencing using a custom panel of CRC-related mRNAs in plasma samples. Our sequencing results demonstrated these plasma mRNAs were not distinctly affected by two widely different centrifugal forces. However, considering mRNA from cell debris possibly interferes disease-derived plasma mRNA quantification and the efficiency of circulating markers selection for CRC in future studies, we suggest using the protocol with 1,600 g followed by 16,000 g centrifugal force in plasma preparation, which efficiently removes cell debris from plasma and is more likely to expose disease-derived mRNA information. These findings have laid down a solid foundation in plasma RNA properties upon centrifugation for downstream RNA deep sequencing. Moreover, it is helpful for researchers to standardize their protocol so that the results generated can be compared in multicenter studies with more precision and confidence. For future works, we may use transmission electron microscopy, ultra-centrifugation and other technologies to further study which components or extracellular vesicles result in differential plasma mRNA quantification caused by centrifugal force effects.

#### AUTHOR CONTRIBUTIONS

SW conceived and designed the experiments; VX performed the experiments; SW, VX, AY, and WC analyzed the data; SN, WL, BM, and WC gave invaluable comments on the subject recruitment and data interpretation; VX and SW wrote the paper; TA and HT processed the patient samples and technical work before library preparation.

#### FUNDING

This study was supported by the Health and Medical Research Fund (HMRF), Food and Health Bureau, The Government of the Hong Kong Special Administrative Region (Reference number: 02131226).

#### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene. 2018.00165/full#supplementary-material


of circulating RNA transcripts in pregnant women based on RNA-seq data. Int. J. Mol. Sci. 18:1709. doi: 10.3390/ijms18081709


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 Xue, Ng, Leung, Ma, Cho, Au, Yu, Tsang and Wong. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.