# RECENT PROGRESSES OF NON-CODING RNAS IN BIOLOGICAL AND MEDICAL RESEARCH, 2nd Edition

EDITED BY : Yun Zheng and Philipp Kapranov PUBLISHED IN : Frontiers in Genetics

#### Frontiers eBook Copyright Statement

The copyright in the text of individual articles in this eBook is the property of their respective authors or their respective institutions or funders. The copyright in graphics and images within each article may be subject to copyright of other parties. In both cases this is subject to a license granted to Frontiers. The compilation of articles constituting this eBook is the property of Frontiers.

Each article within this eBook, and the eBook itself, are published under the most recent version of the Creative Commons CC-BY licence. The version current at the date of publication of this eBook is CC-BY 4.0. If the CC-BY licence is updated, the licence granted by Frontiers is automatically updated to the new version.

When exercising any right under the CC-BY licence, Frontiers must be attributed as the original publisher of the article or eBook, as applicable.

Authors have the responsibility of ensuring that any graphics or other materials which are the property of others may be included in the CC-BY licence, but this should be checked before relying on the CC-BY licence to reproduce those materials. Any copyright notices relating to those materials must be complied with.

Copyright and source acknowledgement notices may not be removed and must be displayed in any copy, derivative work or partial copy which includes the elements in question.

All copyright, and all rights therein, are protected by national and international copyright laws. The above represents a summary only. For further information please read Frontiers' Conditions for Website Use and Copyright Statement, and the applicable CC-BY licence.

ISSN 1664-8714 ISBN 978-2-88966-054-4 DOI 10.3389/978-2-88966-054-4

#### About Frontiers

Frontiers is more than just an open-access publisher of scholarly articles: it is a pioneering approach to the world of academia, radically improving the way scholarly research is managed. The grand vision of Frontiers is a world where all people have an equal opportunity to seek, share and generate knowledge. Frontiers provides immediate and permanent online open access to all its publications, but this alone is not enough to realize our grand goals.

#### Frontiers Journal Series

The Frontiers Journal Series is a multi-tier and interdisciplinary set of open-access, online journals, promising a paradigm shift from the current review, selection and dissemination processes in academic publishing. All Frontiers journals are driven by researchers for researchers; therefore, they constitute a service to the scholarly community. At the same time, the Frontiers Journal Series operates on a revolutionary invention, the tiered publishing system, initially addressing specific communities of scholars, and gradually climbing up to broader public understanding, thus serving the interests of the lay society, too.

#### Dedication to Quality

Each Frontiers article is a landmark of the highest quality, thanks to genuinely collaborative interactions between authors and review editors, who include some of the world's best academicians. Research must be certified by peers before entering a stream of knowledge that may eventually reach the public - and shape society; therefore, Frontiers only applies the most rigorous and unbiased reviews.

Frontiers revolutionizes research publishing by freely delivering the most outstanding research, evaluated with no bias from both the academic and social point of view. By applying the most advanced information technologies, Frontiers is catapulting scholarly publishing into a new generation.

#### What are Frontiers Research Topics?

Frontiers Research Topics are very popular trademarks of the Frontiers Journals Series: they are collections of at least ten articles, all centered on a particular subject. With their unique mix of varied contributions from Original Research to Review Articles, Frontiers Research Topics unify the most influential researchers, the latest key findings and historical advances in a hot research area! Find out more on how to host your own Frontiers Research Topic or contribute to one as an author by contacting the Frontiers Editorial Office: researchtopics@frontiersin.org

# RECENT PROGRESSES OF NON-CODING RNAS IN BIOLOGICAL AND MEDICAL RESEARCH, 2nd Edition

Topic Editors:

Yun Zheng, Kunming University of Science and Technology, China Philipp Kapranov, Huaqiao University, China

Publisher's note: In this 2nd edition, the following article has been updated: Xun Y, Tang Y, Hu L, Xiao H, Long S, Gong M, Wei C, Wei K and Xiang S (2019) Purification and Identification of miRNA Target Sites in Genome Using DNA Affinity Precipitation. *Front. Genet.* 10:778. doi: 10.3389/fgene.2019.00778

Citation: Zheng, Y., Kapranov, P., eds. (2020). Recent Progresses of Non-Coding RNAs in Biological and Medical Research, 2nd Edition. Lausanne: Frontiers Media SA. doi: 10.3389/978-2-88966-054-4

# Table of Contents

*05 Editorial: Recent Progresses of Non-coding RNAs in Biological and Medical Research*

Yun Zheng and Philipp Kapranov


Le Ou-Yang, Jiang Huang, Xiao-Fei Zhang, Yan-Ran Li, Yiwen Sun, Shan He and Zexuan Zhu

*98 LncRRIsearch: A Web Server for lncRNA-RNA Interaction Prediction Integrated With Tissue-Specific Expression and Subcellular Localization Data*

Tsukasa Fukunaga, Junichi Iwakiri, Yukiteru Ono and Michiaki Hamada


Xuan Zhang, Tianjun Li, Jun Wang, Jing Li, Long Chen and Changning Liu

*132 Predicting lncRNA-miRNA Interaction* via *Graph Convolution Auto-Encoder*

Yu-An Huang, Zhi-An Huang, Zhu-Hong You, Zexuan Zhu, Wen-Zhun Huang, Jian-Xin Guo and Chang-Qing Yu

# *141 Purification and Identification of miRNA Target Sites in Genome Using DNA Affinity Precipitation*

Yu Xun, Yingxin Tang, Linmin Hu, Hui Xiao, Shengwen Long, Mengting Gong, Chenxi Wei, Ke Wei and Shuanglin Xiang

*153 Non-Coding RNAs in Pediatric Solid Tumors* Christopher M. Smith, Daniel Catchpoole and Gyorgy Hutvagner

# Editorial: Recent Progresses of Non-coding RNAs in Biological and Medical Research

Yun Zheng<sup>1</sup> \* and Philipp Kapranov <sup>2</sup> \*

*<sup>1</sup> Yunnan Key Lab of Primate Biomedicine Research, Institute of Primate Translational Medicine, Kunming University of Science and Technology, Kunming, China, <sup>2</sup> Institute of Genomics, School of Biomedical Sciences, Huaqiao University, Xiamen, China*

Keywords: non-coding RNA, long non-coding RNA (IncRNA), microRNA, biological research, medical research

**Editorial on the Research Topic**

INTRODUCTION

**Recent Progresses of Non-coding RNAs in Biological and Medical Research**

#### Edited by:

*William Cho, Queen Elizabeth Hospital (QEH), Hong Kong*

#### Reviewed by:

*Peter Igaz, Semmelweis University, Hungary Mohammadreza Hajjari, Shahid Chamran University of Ahvaz, Iran*

#### \*Correspondence:

*Yun Zheng zhengyun5488@gmail.com Philipp Kapranov philippk08@hotmail.com*

#### Specialty section:

*This article was submitted to RNA, a section of the journal Frontiers in Genetics*

Received: *08 November 2019* Accepted: *17 February 2020* Published: *28 February 2020*

#### Citation:

*Zheng Y and Kapranov P (2020) Editorial: Recent Progresses of Non-coding RNAs in Biological and Medical Research. Front. Genet. 11:187. doi: 10.3389/fgene.2020.00187* Short (<200 nt) and long (>200 nt) non-coding (nc) RNAs account for majority of mammalian transcriptional output and encompass RNA species critical for various aspects of development and disease (Ambros, 2001; Kapranov et al., 2002, 2007; Bartel, 2004; Carninci et al., 2005). We have witnessed an ever-increasing pace of discovery of these transcripts in the last decade, in a large measure owing to the wide-spread application of high-throughput sequencing technologies for RNA analysis. These ncRNAs include, but not limited to, novel members of known classes such as miRNAs and siRNAs; new classes of small RNAs, for example, those associated with promoters and termini of genes; new classes of long non-coding (lnc) RNAs; plethora of antisense transcripts; circular RNAs derived from exons and introns; and many others (Laurent et al., 2015; Li et al., 2016; Kristensen et al., 2019; Zhang et al., 2019). Non-coding RNAs have been associated with almost every important biological process and human disease (Calin et al., 2004; Esteller, 2011; Wapinski and Chang, 2011; Mendell and Olson, 2012). However, our understanding of most of these transcripts is still at the initial stages.

Deeper insight into these enigmatic RNA species clearly requires efforts from both wet-lab and computational avenues of research (Zheng et al., 2017). Therefore, this Research Topic aimed to provide works from both directions to converge on generation of new insights into the functionalities of ncRNAs. Thirteen papers included in it serve as a collection of recent results and advances across multiple areas of ncRNA research field.

# WET-LAB EXPERIMENTAL STUDIES OF NCRNAS

Lin et al. identified miR-30c secreted by bovine embryos as a potential biomarker for hampered preimplantation. Two miRNAs, i.e., miR-30c and miR-10b, were found at much higher levels in conditioned medium of slow cleaving embryos compared to intermediately cleaving ones (Lin et al.). One of them, miR-30c, directly repressed cyclin-dependent kinase 12 (CDK12) through a complementary site in the 3′ UTR (Lin et al.). Several DNA damage response (DDR) genes were significantly downregulated after introducing miR-30c or repressing CDK12, suggesting that miR-30c regulates embryo development through the DDR pathway (Lin et al.).

Mature hair follicles in mammals undergo periodic selfrenewal processes called hair follicle cycles. Understanding the molecular regulatory mechanisms of the renewal cycle is important in medicine and developmental biology. Zhao et al. examined deregulated miRNAs, lncRNAs and circRNAs in the hair follicle cycle of Angora Rabbit (Oryctolagus cuniculus) and provides comprehensive repository of ncRNAs potentially relevant to this process.

Wang et al. profiled lncRNAs in the CD4<sup>+</sup> T cells in the mouse model of acute asthma. They found 36 up- and 98 downregulated lncRNAs in the disease compared with the control samples (Wang et al.). The potential functions of deregulated lncRNA were analyzed by performing miRNA binding analysis (Wang et al.).

It has been well-established that miRNAs work by guiding RNA-induced silencing complex (RISC) to their target RNA binding sites in cytoplasm (Bartel, 2004). However, a steady stream of evidence shows that some miRNAs localize and potentially function in nucleus (Place et al., 2008; Ritland Politz et al., 2009; Liu et al., 2018). Xun et al. proposed an efficient experimental method to find miRNA binding sequences in genomic DNA in vivo, thus potentially identifying miRNA binding sites in the regulatory regions of genes.

# COMPUTATIONAL STUDIES OF NCRNAS

Ou-Yang et al. proposed a novel method called two-side sparse self-representation (TSSR) for predicting lncRNAdisease associations. TSSR significantly outperformed other tested methods and identified some candidate lncRNA-disease associations (Ou-Yang et al.).

Zhang et al. proposed a method called CRlncRC2 for predicting associations between lncRNAs and cancers. More than four hundred cancer-related lncRNA candidates were identified, which were evaluated by examining the Lnc2Cancer database, reviewing literature, and performing statistical analysis of multiple relevant data sources containing information on mutations and differential gene expression in cancers (Zhang et al.). These results demonstrated that CRlncRC2 is an effective and accurate method for identification of cancer-related lncRNAs (Zhang et al.).

LncRNAs are assumed to realize their functions by interacting with other molecules, such as proteins, chromatin and other RNA species. Shen et al. proposed a new method for identifying lncRNA-protein interactions by employing Kernel Ridge Regression, based on Fast Kernel Learning (LPI-FKLKRR). LPI-FKLKRR demonstrated a superior performance compared with a series of other methods as judged by area under precision recall curve.

Huang et al. introduced a computational method to predict interactions between lncRNAs and miRNAs leveraging the information of expression profile data for these transcripts and the graph convolution technique. The proposed model is based on the assumption that the interaction between an lncRNA and a miRNA could be deciphered from their co-expression pattern. Compared with the conventional miRNA-target prediction algorithms based on sequence matching, their work presents a new approach to predict lncRNA:miRNA interactions.

Fukunaga et al. introduced a web server, called LncRRIsearch, for predicting lncRNA:lncRNA and lncRNA:mRNA interactions in human and mouse. The tissue-specific expression and cellular localization data of lncRNAs are integrated in this web server to explore tissue-specific or subcellular-localized lncRNA interactions (Fukunaga et al.).

# REVIEWS AND PERSPECTIVES

Li and Liu summarizing recent evidences suggesting that coding and non-coding properties are inherent to both coding and non-coding transcripts. In other words, some lncRNAs and circRNAs could be used to produce short peptides, i.e., have coding capabilities. On the other hand, 3′ and 5′ UTRs of coding genes have non-coding functions such as recruiting RNA-binding proteins (Li and Liu).

Smith et al. reviewed the miRNAs and lncRNAs that play key roles in the initiation and progression of pediatric solid tumors. Pediatric tumors, due to lower mutation load compared with adult ones, are assumed to arise from mis-regulation of networks normally functioning during development at transcriptional level (Smith et al.). The authors summarized accumulating evidence of involvement of miRNAs and lncRNAs in the regulatory networks functioning during oncogenesis.

Watson et al. explored small RNAs in neurodegenerative diseases. This comprehensive review discusses roles of various small RNAs in multiple neurodegenerative diseases, including Alzheimer's, Parkinson's, multiple sclerosis, Amyotrophoic lateral sclerosis, and Huntington's disease.

Recent evidences show that ncRNAs, both miRNAs and lncRNAs, could serve as communication factors between cells (Bayraktar et al., 2017; Bär et al., 2019). Ramón y Cajal et al. proposed that the interactions between miRNAs and lncRNAs might contribute to the cell-type specific outcomes and to the determination of cell fate. In one model, miRNAs could be competitively sequestered by tissue-specific lncRNAs. In another context, miRNAs released to extracellular space as ligands could interact with lncRNAs in different organs as receptors to either sequester the miRNAs or induce degradation of the miRNAs or the lncRNAs.

# SUMMARY

Non-coding RNAs have been associated with various biological processes and human diseases. These phenomena were further expanded and reviewed by several studies in this Research Topic. A number of wet lab and computational methods as well as database resources reported in the Topic should help to refine the connections between ncRNAs and diseases and identify the mechanisms of actions of the former, thus further contributing to the advancement of the ncRNA field.

# AUTHOR CONTRIBUTIONS

YZ and PK conceived of the work and wrote the manuscript.

# FUNDING

The research was supported in part by a grant (No. 31760314) of National Natural Science Foundation of China (http://www.nsfc.gov.cn/) and a grant (No. 2018YFA0108502)

# REFERENCES


of the Ministry of Science and Technology of China to YZ; and a grant (No. 31671382) of National Natural Science Foundation of China to PK. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.


**Conflict of Interest:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2020 Zheng and Kapranov. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Multivariate Information Fusion With Fast Kernel Learning to Kernel Ridge Regression in Predicting LncRNA-Protein Interactions

#### Cong Shen<sup>1</sup> , Yijie Ding<sup>2</sup> , Jijun Tang1,3 and Fei Guo<sup>1</sup> \*

*<sup>1</sup> School of Computer Science and Technology, College of Intelligence and Computing, Tianjin University, Tianjin, China, <sup>2</sup> School of Electronic and Information Engineering, Suzhou University of Science and Technology, Suzhou, China, <sup>3</sup> Department of Computer Science and Engineering, University of South Carolina, Columbia, SC, United States*

#### Edited by:

*Yun Zheng, Kunming University of Science and Technology, China*

#### Reviewed by:

*Zexuan Zhu, Shenzhen University, China Mauricio Fernando Budini, Universidad de Chile, Chile Min Wu, Agency for Science, Technology and Research (A\*STAR), Singapore*

> \*Correspondence: *Fei Guo fguo@tju.edu.cn*

#### Specialty section:

*This article was submitted to RNA, a section of the journal Frontiers in Genetics*

Received: *22 September 2018* Accepted: *21 December 2018* Published: *15 January 2019*

#### Citation:

*Shen C, Ding Y, Tang J and Guo F (2019) Multivariate Information Fusion With Fast Kernel Learning to Kernel Ridge Regression in Predicting LncRNA-Protein Interactions. Front. Genet. 9:716. doi: 10.3389/fgene.2018.00716*

Long non-coding RNAs (lncRNAs) constitute a large class of transcribed RNA molecules. They have a characteristic length of more than 200 nucleotides which do not encode proteins. They play an important role in regulating gene expression by interacting with the homologous RNA-binding proteins. Due to the laborious and time-consuming nature of wet experimental methods, more researchers should pay great attention to computational approaches for the prediction of lncRNA-protein interaction (LPI). An in-depth literature review in the state-of-the-art *in silico* investigations, leads to the conclusion that there is still room for improving the accuracy and velocity. This paper propose a novel method for identifying LPI by employing Kernel Ridge Regression, based on Fast Kernel Learning (LPI-FKLKRR). This approach, uses four distinct similarity measures for lncRNA and protein space, respectively. It is remarkable, that we extract Gene Ontology (GO) with proteins, in order to improve the quality of information in protein space. The process of heterogeneous kernels integration, applies Fast Kernel Learning (FastKL) to deal with weight optimization. The extrapolation model is obtained by gaining the ultimate prediction associations, after using Kernel Ridge Regression (KRR). Experimental outcomes show that the ability of modeling with LPI-FKLKRR has extraordinary performance compared with LPI prediction schemes. On benchmark dataset, it has been observed that the best Area Under Precision Recall Curve (AUPR) of 0.6950 is obtained by our proposed model LPI-FKLKRR, which outperforms the integrated LPLNP (AUPR: 0.4584), RWR (AUPR: 0.2827), CF (AUPR: 0.2357), LPIHN (AUPR: 0.2299), and LPBNI (AUPR: 0.3302). Also, combined with the experimental results of a case study on a novel dataset, it is anticipated that LPI-FKLKRR will be a useful tool for LPI prediction.

Keywords: lncRNA-protein interactions, multiple kernel learning, fast kernel learning, kernel ridge regression, gene ontology

# 1. INTRODUCTION

Long non-coding RNAs (lncRNAs) constitute a large class of transcribed molecules. They have a characteristic length of more than 200 nucleotides which do not encode proteins (St Laurent et al., 2015). Existing research has proven that lncRNAs can control gene expression during the transcriptional, posttranscriptional, and epigenetic procedures through interacting with the homologous RNA-binding proteins (Guttman and Rinn, 2012; Quan et al., 2015; Tee et al., 2015). A most recent research found that, a kind of lncRNA named lnc-Lsm3b can refrain the activity of the receptor RIG-I, by the induction of viruses during the regulation of immune response (Jiang et al., 2018). This is consistent with previous studies which have proven that lncRNAs are playing potential roles in complex human diseases (Li et al., 2013). Due to the laborious and time-consuming nature of wet experimental methods in molecular biology, many state-of-theart computational researches have been carried out dealing with the conundrum, in an effort to enhance accuracy and time efficiency (Zou et al., 2012; Jalali et al., 2015; Han et al., 2018).

Since it is very difficult to extract any actual details on the 3D structures of lncRNAs and relative proteins, many sequence-based and secondary structure-based approaches for the prediction of lncRNA-protein interaction (LPI) have been published in the literature. Bellucci et al. have established the well-known catRAPID (Bellucci et al., 2011) by leveraging both physicochemical properties and secondary structure information, which could be employed as compound information to handle the problem of predicting LPI. Meanwhile, the hybrid schema RPISeq has been introduced by Muppirala et al. (2011), which employs both Support Vector Machines (SVM) and Random Forest (RF). Wang et al. have proposed a classifier combining Naive Bayes (NB) and Extended NB (ENB) classifier to extrapolate LPI (Wang et al., 2012). Lu et al. have established lncPro, which translates each LPI into numerical form, and applies matrix multiplication (Lu et al., 2013). Suresh et al. developed RPI-Pred based on SVM, by using the structure and sequence information of lncRNAs and proteins (Suresh et al., 2015).

In contrast to the aforementioned works, Li et al. have introduced the LPIHN by employing an heterogeneous network, assembled with a kind of random walk on lncRNA-protein association profile, with a restart mechanism (RWR) (Li et al., 2015). Ge et al. have used resource allocation mode on a dichotomous network, and they have published the algorithm as LPBNI (Ge et al., 2016). Lately, Hu et al. have proposed a kind of semi-supervised link prediction scheme, entitled LPI-ETSLP (Hu et al., 2017), which was soon upgraded to the IRWNRLPI. This method actually integrates RWR and matrix factorization (Zhao et al., 2018).

Zhang et al. have suggested two classes of state-of-the-art computational intelligence approaches (Zhang et al., 2017). The first includes supervised LPI binary classifiers, which do not require prior knowledge of interactions as negative instances (Bellucci et al., 2011; Muppirala et al., 2011; Wang et al., 2012; Lu et al., 2013; Suresh et al., 2015). second category includes semi-supervised approaches which combine known interactions to suggest unknown LPI. The following are characteristic cases of this class: LPIHN (Li et al., 2015), LPBNI (Ge et al., 2016), LPI-ETSLP (Hu et al., 2017), and IRWNRLPI (Zhao et al., 2018).

Transfer learning (Jonathan et al., 1995), which can recognize and leverage skills or knowledge learned in previous tasks to novel tasks, is viewed as a kind of burgeoning machine learning branch. Whereas, zero-shot learning in pairwise learning with two-step Kernel Ridge Regression (KRR) (Stock et al., 2016), is a special type of transfer learning, constructing predictors from a dataset which contains both labeled and unlabeled samples. Hence, it is a kind of effective mechanism which can reduce the need of labeled data. In order to detect the pairwises of lncRNAs and proteins that can interact with each other, the state-of-the-art statistical methods have been exploited, such as Recursive Least Squares (RLS), Kronecker RLS, Sparse Representation based Classifier (SRC), and Multiple Kernel Learning (MKL). All these techniques have already been applied in predicting Protein-Protein Interactions (PPIs) (Ding et al., 2016; Liu X. et al., 2016), Drug-Target Interactions (DTIs) (Xia Z. et al., 2010; Laarhoven et al., 2011; Twan and Elena, 2013; Nascimento et al., 2016; Shen et al., 2017b), binding sites of biomolecules (Ding et al., 2017; Shen et al., 2017a) identification of disease-resistant genes (Xia J. et al., 2010), and microRNA-disease associations (Zou et al., 2015; Peng et al., 2017) with comparative consequences.

With reference to the above researches, we have enriched the categories of similarity measures adopted during LPI prediction. Integration of the heterogeneous kinds of similarity information is achieved by applying Fast Kernel Learning (FastKL) which deals with kernel weight optimization. This is done through the integration of the prediction architectures for weighting heterogeneous kernels. This research proposes a kind of twostep Kernel Ridge Regression (KRR) applied in the field of LPI prediction. LPI-FKLKRR has proven to be a more reliable and effective approach for LPI prediction, compared with other competitive methods. The core of the algorithm proposed herein has been evaluated on the benchmark dataset of LPIs. What is especially encouraging, is that many of the LPI predictions made by our method have been confirmed, with a high degree of correlation. Also, we have conducted a comparative testing on a novel dataset to illustrate the stable performance of the LPI-FKLKRR.

# 2. METHODS

In this section, we focus on the elaboration of architecture for our model. Its basic structural components-entities are the following: The known interactions matrix of LPI and the multivariate information that consists of lncRNA expressions, the local network, the sequence information and moreover the Gene Ontology (GO). It is imperative to combine all the similarity information together with the respective combination weights. Finally, we have developed and employed the LPI with Fast Kernel Learning based on Kernel Ridge Regression Prediction (LPI-FKLKRR) identification strategy, which utilizes a kind of two-stage Kernel Ridge Regression in LPI prediction.

## 2.1. Problem Specification

Suppose there are m lncRNAs and n proteins involved in LPI. We formally define two kinds of molecules as L = {l<sup>i</sup> | i = 1, 2, · · · , m} and P = {p<sup>j</sup> | j = 1, 2, · · · , n}, respectively. Hence, the interactions between lncRNAs and proteins can be intuitively and succinctly expressed as an adjacency matrix **F** with m × n, which can be formulated as Equation (1)

$$\mathbf{F} = \begin{bmatrix} f\_{1,1} & f\_{1,2} & \cdots & f\_{1,j} & \cdots & f\_{1,n} \\ f\_{2,1} & f\_{2,2} & \cdots & f\_{2,j} & \cdots & f\_{2,n} \\ \vdots & \vdots & \ddots & \vdots & \ddots & \vdots \\ f\_{i,1} & f\_{i,2} & \cdots & f\_{i,j} & \cdots & f\_{i,n} \\ \vdots & \vdots & \ddots & \vdots & \ddots & \vdots \\ \vdots & \vdots & \ddots & \vdots & \ddots & \vdots \\ f\_{m,1} & f\_{m,2} & \cdots & f\_{m,j} & \cdots & f\_{m,n} \end{bmatrix}\_{m \times n} \tag{1}$$

where fi,<sup>j</sup> in matrix **F** corresponds to the prediction value of pairwise hl<sup>i</sup> , pji, 1 ≤ i ≤ m, 1 ≤ j ≤ n, and m, n ∈ N ∗ . If lncRNA l<sup>i</sup> can interact with protein p<sup>j</sup> , the value of fi,<sup>j</sup> is marked as 1, otherwise it is marked as 0.

Obviously, the identification of new interactions between lncRNAs and proteins can be viewed as a task suitable for a recommender system (Koren et al., 2009) of a bipartite network, which can mine and detect the potential associated individuals. To this end, we use Multiple Kernel Learning (MKL) to design the optimization with respect to the prediction of LPI. In the following chapter, we will support the argument that the similarity matrix is equivalent to a kernel.

### 2.2. lncRNA Kernels and Protein Kernels

In order to conduct MKL, it is inevitable to construct similarity matrices of molecules in lncRNA and protein kernel spaces, respectively. Specifically, lncRNA expression, protein GO, lncRNA sequence, protein sequence, and known interactions between one lncRNA and all proteins are considered in our framework. In addition, the training adjacency matrix **F**train is obtained by masking the known pairwise hl<sup>i</sup> , pji, where the partial known elements in the matrix are set to 0 for the validation set, which are represented in **Figure 1**.

#### 2.2.1. Gaussian Interaction Profile Kernel

Interactions can be reflected in the connectivity behavior in the subjacent network (Laarhoven et al., 2011; Twan and Elena, 2013). For the lncRNAs, we extract information of lncRNA interactions corresponding to each row of the training adjacency matrix **F**train. Then we use a broadly applicable Gaussian Interaction Profile (GIP) kernel to device interaction kernel defined for lncRNA l<sup>i</sup> and l<sup>k</sup> (i, k = 1, 2, · · · , m). GIP about protein p<sup>j</sup> and p<sup>s</sup> (j,s = 1, 2, · · · , n) can be generated in a similar way. As a summary, each element value in GIP can be represented as follows:

$$\mathbf{K}\_{\rm GIP}^{\rm nc}(l\_i, l\_k) = \exp(-\sigma\_{\rm lnc} \| \mathbf{F}\_{l\_i} - \mathbf{F}\_{l\_k} \|^2) \tag{2a}$$

$$\mathbf{K}\_{\rm GIP}^{pro}(p\_j, p\_s) = \exp(-\sigma\_{pro} \left\| \mathbf{F}\_{p\_j} - \mathbf{F}\_{p\_s} \right\|^2) \tag{2b}$$

where **F**l<sup>i</sup> , **F**l<sup>k</sup> and **F**p<sup>j</sup> , **F**p<sup>s</sup> are the matrices of interactions for lncRNA l<sup>i</sup> , l<sup>k</sup> and protein p<sup>j</sup> , p<sup>s</sup> , respectively. The Gaussian kernel bandwidths σlnc and σpro are initialized to the value of 1 in the experiments. Practically, when employing 5-fold CV and LOOCV, the GIP kernel similarity should be recalculated each time based on the training samples.

### 2.2.2. Sequence Similarity Kernel

A sequence S with length d is an ordered list of characters, which can be written as S = c1c<sup>2</sup> · · · c<sup>h</sup> · · · c<sup>d</sup> (1 ≤ h ≤ d). Enlightened by state-of-the-art methods (Yamanishi et al., 2008; Nascimento et al., 2016), we use normalized Smith-Waterman (SW) score (Smith and Waterman, 1981) to measure the sequence similarity. The formulations are represented as follows:

$$\mathbf{K}\_{SW}^{\rm inc}(l\_i, l\_k) = SW(\mathbf{S}\_{l\_i}, \mathbf{S}\_{l\_k}) / \sqrt{SW(\mathbf{S}\_{l\_i}, \mathbf{S}\_{l\_i})SW(\mathbf{S}\_{l\_k}, \mathbf{S}\_{l\_k})} \tag{3a}$$

$$\mathbf{K}\_{SW}^{pro}(p\_j, p\_s) = SW(\mathbf{S}\_{p\_j}, \mathbf{S}\_{p\_i}) / \sqrt{SW(\mathbf{S}\_{p\_j}, \mathbf{S}\_{p\_j})SW(\mathbf{S}\_{p\_s}, \mathbf{S}\_{p\_s})} \tag{3b}$$

where SW(·, ·) stands for Smith-Waterman score; Sl<sup>i</sup> and Sl<sup>k</sup> are the sequences for lncRNA l<sup>i</sup> and l<sup>k</sup> ; Sp<sup>j</sup> and Sp<sup>s</sup> denote the sequences for protein p<sup>j</sup> and p<sup>s</sup> .

### 2.2.3. Sequence Feature Kernel

We obtain the sequence feature kernel by extracting the feature of the sequences about lncRNAs and proteins. In practice, Conjoint Triad (CT) (Shen et al., 2007) and Pseudo Position-Specific Score Matrix (Pse-PSSM) (Chou and Shen, 2007) are adopted to describe lncRNA and protein sequences, respectively. Both Sequence Feature kernels (SF) **K** lnc SF and **K** pro SF are constructed based on a Radial Basis Function kernel (RBF) with bandwidth equals to 1.

### 2.2.4. lncRNA Expression Kernel

It is interesting to identify genes with concordant behaviors because different genes always show different behaviors (Lai et al., 2017). Expression profiles of lncRNAs refers to 24 cell types which come from NONCODE database (Xie et al., 2014). After expressing each lncRNA as a 24-dimensional expression profile vector, the kernel of lncRNAs expression **K** lnc EXP can be generated according to the RBF, and kernel bandwidth is also set to 1.

#### 2.2.5. GO Kernel

Inspired by a former research (Zheng et al., 2012), similar Gene Ontology (GO) with proteins are expected to act in similar biological processes, or to reside in similar cell compartments, or to have similar molecular functions. Therefore, GO annotations are employed in this paper to generate a similarity matrix in protein space. The files of Gene Ontology (GO) terms have been downloaded from the GOA database (Wan et al., 2013).

Semantic similarity is always based on the overlap of the terms associated with two proteins (Wu et al., 2013). Jaccard value which we exploited in measuring the semantic similarity of two GO terms t<sup>j</sup> and t<sup>s</sup> related to proteins p<sup>j</sup> and p<sup>s</sup> is defined as follows:

$$Jaccard(t\_j, t\_s) = \frac{|t\_j \cap t\_s|}{|t\_j \cup t\_s|} \tag{4}$$

where t<sup>j</sup> ∩ t<sup>s</sup> denotes the common terms between p<sup>j</sup> and p<sup>s</sup> , and t<sup>j</sup> ∪ t<sup>s</sup> refers to total number of terms of p<sup>j</sup> and p<sup>s</sup> . However, there has not been any formal definition with GO common terms tj∩t<sup>s</sup> given before.

We denote that, if the two sequences are completely consistent, two sequences S<sup>1</sup> and S<sup>2</sup> have common terms of GO. For example, given three sequences S<sup>1</sup> = h3, 1, 5i, S<sup>2</sup> = h3, 2, 5i, and S<sup>3</sup> = h3, 2, 5i, if we only follow that all the corresponding locations of three sequences have non-zero values, then all three sequences have common terms. Nevertheless, for sequence S2, it can be said that S<sup>2</sup> has common terms with S3, but does not have common terms with S1, because the second characters of S<sup>1</sup> and S<sup>2</sup> are different. Thus, we obtain a more sparse GO similarity matrix **K** pro GO which can facilitate the computation.

### 2.3. Fast Kernel Learning

In MKL, we need to find an optimal mapping vector **w**, i.e., we require to choose a kind of optimal weighting strategy so that object similarity matrices can be appropriately constructed. Concretely, the vector of parameter weight values for lncRNA kernels and protein kernels are represented as **w** lnc and **w** pro , respectively. We have already described that there are four kernels in lncRNA space including **K** lnc GIP, **K** lnc SW, **K** lnc SF , and **K** lnc EXP, and four kernels in protein space including **K** pro GIP, **K** pro SW, **K** pro SF , and **K** pro GO, respectively. The optimal lncRNA and protein kernels are given as follows:

$$\mathbf{K}\_{lmc} = \sum\_{a=1}^{4} \boldsymbol{w}\_a^{lm} \mathbf{K}\_a^{lnc}, \quad \mathbf{K}\_a^{lnc} \in \mathfrak{R}^{m \times m} \tag{5a}$$

$$\mathbf{K}\_{pro} = \sum\_{a=1}^{4} \boldsymbol{\omega}\_{a}^{pro} \mathbf{K}\_{a}^{pro}, \quad \mathbf{K}\_{a}^{pro} \in \mathfrak{R}^{n \times n} \tag{5b}$$

where w lnc a and w pro <sup>a</sup> denote each element in **w** lnc and **w** pro; **K** lnc a and **K** pro <sup>a</sup> correspond each kind of normalized similarity matrix among the heterogenous similarity kernels in lncRNA and protein spaces.

According to the description of Fast Kernel Learning (FastKL) (He et al., 2008), **w** is used as a substitute for the required optimal solution **w lnc** or **w pro**, and **K** denotes kernel matrix **K**lnc or **K**pro. FastKL is not only minimizing the distance between **K** and **Y**, where **Y** = **yy**<sup>T</sup> , **y** is a matrix corresponds to all training set labels. It considers the regularization term k**w**k 2 that is used to prevent overfitting. To this end, **w** can be drawn from the Formula 6 as follows:

$$\begin{aligned} \min\_{\mathbf{w}, \mathbf{K}} \quad & \|\mathbf{K} - \mathbf{Y}\|\_F^2 + \lambda \|\mathbf{w}\|^2 \\ \text{s.t.} \quad & \sum\_{a}^{f} w\_a = 1 \end{aligned} \tag{6}$$

where F represents Frobenius norm and λ is the tradeoff parameter. In practice, we set λ 10000 when selecting the optimal parameter value.

As a step forward to deduct Equation (6), since the Frobenius norm of a matrix equals to the trace about the product between the matrix itself and matrix of its transformation, i.e., k**X**k 2 <sup>F</sup> = tr(**XX**<sup>T</sup> ), the object function with respect to the optimal solution **w** can be simplified as follows:

$$\begin{aligned} \min\_{\mathbf{w}} \quad & \mathbf{w}^{\mathrm{T}} (\mathbf{A} + \lambda \mathbf{I}) \mathbf{w} - 2\mathbf{b}^{\mathrm{T}} \mathbf{w} \\\\ \text{s.t.} \quad & \sum\_{a}^{f} w\_{a} = 1 \\ & A\_{u,v} = \operatorname{tr}(\mathbf{K}\_{u}^{\mathrm{T}} \mathbf{K}\_{v}) \\ & b\_{v} = \operatorname{tr}(\mathbf{Y}^{\mathrm{T}} \mathbf{K}\_{v}) \end{aligned} \tag{7}$$

where tr(·) is the symbol of the trace operator; Au,<sup>v</sup> represents each element in matrix **A**; **K**<sup>u</sup> and **K**<sup>v</sup> denote two different kernel matrices.

Recapitulating the above statement, through gaining the final **w lnc** and **w pro**, we have achieved the goal of MKL for fusing all kinds of similarity matrices so that the input matrix of KRR can be generated.

# 2.4. Kernel Ridge Regression

Stock et al. developed a scenario of pairwise learning, called Kernel Ridge Regression (KRR) (Stock et al., 2016), which can be applied in binary classification. The basic idea of KRR is to minimize a suitable objective function with an L2-complexity penalty so that it can fit the labeled dyads as much as possible. Specifically, the KRR prediction for the LPI pairwise hl<sup>i</sup> , pji has two steps which are shown in **Figure 2**.

In the first step, a prediction with respect to the new protein for all intermediate LPI pairwise is obtained as an 1×n vector **f**i,· , which can be computed as follows:

$$\mathbf{f}\_{i\_\*} = \mathbf{k}\_{lnc}^T (\mathbf{K}\_{lnc} + \lambda\_l \mathbf{I})^{-1} \mathbf{F} \tag{8}$$

where **k**lnc denotes the vector of lncRNA kernel evaluation between lncRNAs in the training set and a protein in the test set, and λ<sup>l</sup> is the regularization parameter.

In the second step, we can obtain each element f ∗ i,j in the prediction matrix **F** <sup>∗</sup> by using another regularization parameter λ<sup>p</sup> as following Equation (9):

$$f\_{i,j}^{\*} = \mathbf{k}\_{pro}^{\mathrm{T}} (\mathbf{K}\_{pro} + \lambda\_p \mathbf{I})^{-1} \mathbf{f}\_{i\_\*}^{\mathrm{T}} \tag{9}$$

Considering the optimal lncRNAs and proteins kernels **K**lnc and **K**pro, the general objective function of the two-step KRR is defined as follows:

$$\min\_{\mathbf{F}^\*} \sum\_{(i,j,\ell)\in\mathbf{F}} (f\_{i,j} - f\_{i,j}^\*)^2 + \nu \text{ec}(\mathbf{F}^\*)^T \boldsymbol{\Sigma}^{-1} \nu \text{ec}(\mathbf{F}^\*) \tag{10}$$

where vec(·) is a vectorization operator that can rearrange the matrix elements in one row; **F** <sup>∗</sup> denotes the prediction of the original matrix **F** which can be estimated with the application of the LPI-KRR. Objective function in Equation (10) need to be minimized by iterations, and the iterations usually gets converged in about 5–10 iterations.

The kernel matrix 4 that is used in Equation (10) is defined as Equation (11):

$$\boldsymbol{\Xi} = \mathbf{K}\_{pro} \otimes \mathbf{K}\_{lm} (\lambda\_l \lambda\_p \mathbf{I} \otimes \mathbf{I} + \lambda\_p \mathbf{I} \otimes \mathbf{K}\_{lm} + \lambda\_l \mathbf{K}\_{pro} \otimes \mathbf{I})^{-1} \tag{11}$$

By using the lncRNAs, the proteins' kernels and the two regularization parameters λ<sup>l</sup> and λp, each element in matrix **F** ∗ can be represented as Equation (12):

$$\mathbf{F}^\* = \mathbf{K}\_{lnc} (\mathbf{K}\_{lnc} + \lambda\_l \mathbf{I})^{-1} \mathbf{F} (\mathbf{K}\_{lpro} + \lambda\_p \mathbf{I})^{-1} \mathbf{K}\_{lpro} \tag{12}$$

The LPI-FKLKRR calculation framework is illustrated in the following Algorithm 1.

**Algorithm 1** Fast Kernel Learning based on Kernel Ridge Regression (LPI-FKLKRR).

**Input: K**lnc GIP, **K** lnc SW, **K** lnc SF , **K** lnc EXP <sup>∈</sup> <sup>R</sup>m×<sup>m</sup> and **<sup>K</sup>** pro GIP, **K** pro SW, **K** pro SF , **K** pro GO <sup>∈</sup> <sup>R</sup>n×<sup>n</sup> ; **F** ∈ Rm×<sup>n</sup> .

**Output: F**<sup>∗</sup>

.


# 3. RESULTS

This section provides a quantitative evaluation that employ benchmark dataset to assess our approach. We first show a result of 5-fold cross validation, then conduct an independent analyzing about performance of single kernel. Moreover, LPI-FKLKRR is not only compared with mean weighted model but also be assessed in parallel comparison including other outstanding methods. Furthermore, we utilize the case study to evaluate our method in predicting unknown lncRNA-protein interactions. What's more, there is also a comparison between LPI-FKLKRR and state-of-the-art work on a novel dataset.

### 3.1. Benchmark Dataset

Although there exists a high volume of web-based resources (Park et al., 2014), available datasets should be carefully selected. We have acquired the benchmark dataset according to the state-of-the-art work by Zhang et al. (2017). They have experimentally determined lncRNA-protein interactions with 1114 lncRNAs and 96 proteins from NPInter V2.0 (Yuan et al., 2014). Non-coding RNAs and sequence information of proteins were gleaned from NONCODE (Xie et al., 2014) and SUPERFAMILY database (Gough et al., 2001), respectively. Zhang et al. also removed lncRNAs and proteins whose expression or sequence information were unavailable in order to reduce the pressure of computation. Those lncRNAs and proteins with only one interaction were removed for the same reason. A dataset with 4158 lncRNA-protein interactions which contains 990 lncRNAs and 27 proteins were finally collected.

# 3.2. Evaluation Measurements

To gauge the stability of our model, 5-fold Cross Validation (5 fold CV) has been employed. The Area Under ROC curve (AUC) and Area Under the Precision-Recall curve (AUPR) measures have been utilized to evaluate our approach. We would like to emphasize that AUPR is more significant than AUC as a quality measurement because of the sparsity of the true lncRNA-protein interactions.

# 3.3. Experimental Environment

The proposed LPI-FKLKRR algorithm, has been implemented by using MATLAB as the development and compilation platform. All programs have been validated on a computer with 3.7 GHz 4-core CPU, 20 GB of memory, and 64-bit Windows Operating Systems.

# 3.4. Parameter Optimization

Grid search schema has been adopted to get the optimized values of the parameters λ<sup>l</sup> and λp. The range of λ<sup>l</sup> is from 20 to 980 while λ<sup>p</sup> parameter ranges from 2 to 27. The criteria used to select the optimal values of λ<sup>l</sup> and λ<sup>p</sup> were the highest AUPR value and the lowest values of λ<sup>l</sup> and λp, due to the fact that the smaller values of λ<sup>l</sup> and λp, the less is the running time of the algorithm.


*Bold values represent the best value in columns.*

FIGURE 3 | The ROC and PR curve of different models.

We have found that λ<sup>l</sup> = 20.89 and λ<sup>p</sup> = 0.02 are the best values for the two parameters (AUPR: 0.6950).

# 3.5. Performance Analysis

After testing different kinds of kernels on the benchmark dataset, we obtain that the AUPRs of GIP kernel, sequence feature kernel, sequence similarity kernel and gene expression & protein GO kernel are 0.6429, 0.4885, 0.5024, and 0.2663, respectively. The detailed results are listed in **Table 1**. It is obvious that GIP kernel has the highest AUPR value (among the single Kernels). Multiple kernels with the FastKL weighted model achieves AUPR equal to 0.6950, which is an outstanding performance. In **Figure 3**, we can see that the FastKL performs better than the other models. It is

clear that the FastKL is effective in improving the performance of LPIs prediction.

In addition, **Figure 4** shows the weight of each kernel, including lncRNA space and protein space in a 5-fold CV experiment. Conspicuously, weights of GIP kernel obtain the largest values on the lncRNA space. However, four kinds of protein similarity matrices equally divide the weights in protein space. This occasion should be explained that four kinds of protein similarity have low degree of overlapping in the representation space, i.e., each kind of protein similarity presents a specific aspect of protein feature.

# 3.6. Comparing to Existing Predictors

The comparison between our approach and other existing methods are showed in **Table 2**. It should be mentioned that the highest AUPR 0.6950 is achieved by our proposed approach, which is superior to all others. The AUPR values for the other established methods are the following: integrated LPLNP (AUPR: 0.4584) (Zhang et al., 2017), RWR (AUPR: 0.2827) (Gan, 2014), CF (AUPR: 0.2357) (Sarwar et al., 2001), LPIHN (AUPR: 0.2299) (Li et al., 2015), and LPBNI (AUPR: 0.3302) (Ge et al.,


\**Results are derived from Zhang et al. (2017). Bold values represent the best value in columns.*

2016). There are two well-founded reasons for the successful improved performance of our method. Firstly, FastKL effectively combines multivariate information by employing multiple kernel learning. Simultaneously, LPI-KRR is an effective prediction algorithm employing two-step KRR to fuse lncRNA and protein feature spaces. Due to the fact that there are extrapolation difficulties for the imbalanced datasets, PRC is more effective than ROC on highly imbalanced datasets. Therefore, we have obtained acquire competitive AUC value, compared to the stateof-the-art algorithms. From all the above we conclude that our approach can be a useful tool in the prediction of LPI.

# 3.7. Case Study

We have also used Local Leave-One-Out Cross-Validation (LOOCV) to evaluate the predictive performance. Local LOOCV masks the relationship between one protein and all lncRNAs. Our model is trained by the rest of the known information no matter if they are interacting or not and it is tested on a masked relationship. For a protein not appearing in the trial, our approach can predict the strength of interactions between this protein and gross 990 lncRNAs in the experiment. We have ranked these values of interactions in descending order,

TABLE 3 | The AUPR and AUC of different kernels by local LOOCV on benchmark dataset.


*Bold values represent the best value in columns.*

since high ranking is connected to high interaction possibility. In **Figure 5**, we can see that the performance of single kernel, average weighted kernels and weighted kernels with FastKL have failed. The FastKL weighted model using Multiple kernels, gains the best performance with values 0.5506 and 0.7937 for the AUPR and the AUC respectively. The detailed results are listed in **Table 3**.

As shown in **Table 4**, two cases of the top 20 interactions (including proteins ENSP00000309558 and ENSP00000401371), have been extrapolated by LPI-FKLKRR. Also, two cases in **Table 5** including lncRNAs, NONHSAT145960 and NONHSAT031708 of the top 10 interactions have been extrapolated by the LPI-FKLKRR. We check them up in the masked relationship between one protein and all lncRNAs, or one lncRNA and all proteins. Our approach achieves successful identification proportion equal to 11/20 and 12/20 on the proteins ENSP00000309558 and ENSP00000401371, respectively, and it achieves identification proportion equal to 6/10 and 6/10 on lncRNAs NONHSAT145960 and NONHSAT031708.

# 3.8. Speed Comparison on Benchmark Dataset

Practically, running speed is also play an important role in predicting LPI. The state-of-the-art methods of peer groups, such as LPLNP, can produce high-accuracy performances. Hence, the overall evaluation of the success of each approach, should also consider the Running Time (RT). Thus, a comparison between the RT of LPLNP and LPI-FKLKRR, has been performed. The comparative RT analysis between LPLNP and LPI-FKLKRR after running the available source code of LPLNP from the network, is illustrated in **Table 6**.

TABLE 4 | Top 20 interactions rank on protein ENSP00000309558 and ENSP00000401371.


TABLE 5 | Top 10 interactions rank on lncRNA NONHSAT145960 and NONHSAT031708.


Although LPLNP and LPI-FKLKRR have competitive AUC (according to results shown in **Table 2**) it is clear that the LPI-FKLKRR achieves better average running performance using only 11.48 s to accomplish the prediction task of LPI. This is much faster than the 352.93 s of the LPLNP (as shown in **Table 6**. Moreover, the standard deviation also manifest that LPI-FKLKRR is both fast and stable. Furthermore, considering the higher AUPR value of LPI-FKLKRR, we can strongly suggest that LPI-FKLKRR can be both a time-saving and useful tool for LPI prediction.

### 3.9. Evaluation on Novel Dataset

To support the results of the benchmark experiments, we have employed another dataset which is published by Zheng et al. The size of the novel dataset is larger than the benchmark dataset, which is shown in **Table 7**.

Originated from the same databases as the benchmark dataset, the novel dataset consists of 4467 LPIs, including 1050 unique lncRNAs and 84 unique proteins. We have conducted the comparison of LPI-FKLKRR and PPSNs (Zheng et al., 2017) by applying 5-fold CV on novel dataset, and list the results in **Table 8**. The AUC value for the LPI-FKLKRR algorithm is equal to 0.9669, which is higher than the one of PPSNs. What's more, the AUPR value which is equal to 0.7062 for the novel dataset proves the robustness performance of the LPI-FKLKRR on an imbalanced dataset.

Apart from the baseline methods that we have done test in **Figure 2**, we make a new comparison on the dataset that proposed by Zheng et al. with methods including NRLMF and CF. NRLMF, which is also capable of integrating various data sources, achieved good performance for both MDA prediction (Yan et al., 2017; He et al., 2018) and DTI prediction (Liu Y. et al., 2016). And CF method that has proposed by Sarwar et al., is another state-of-the-art work. From **Table 8**, we notice that no matter from the aspect of AUPR or AUC, the value of LPI-FKLKRR are higher than NRLMF (AUPR:0.4010, AUC:0.8287) and CF (AUPR:0.4267, AUC:0.8103).

Both the 5-fold CV and local LOOCV are also done in the novel dataset experiment. After testing different kinds of kernels on the novel dataset, we obtain that in the 5-fold CV, the AUPRs of GIP kernel, sequence feature kernel, sequence similarity kernel and gene expression & protein GO kernel are 0.6812, 0.4819, 0.4846, and 0.2379, respectively. Multiple kernels with the FastKL weighted model achieves AUPR equal to 0.7076, which is an outstanding performance. In **Figures 6, 7**, we can see that the FastKL performs better than the other models.

TABLE 6 | Comparison of running time between LPI-FKLKRR and LPLNP in 10 times.


*<sup>a</sup>The address of LPLNP is given by Zhang et al. (2017). Bold values represent the best value in columns.*

This result is consistent with the consequence on benchmark dataset.

# CONCLUSIONS AND DISCUSSION

In this paper, we have proposed a novel prediction method for the prediction of lncRNAs-protein interactions by using Kernel Ridge Regression, combined with a multiple kernel learning approach (LPI-FKLKRR). LPI-FKLKRR employs fast kernel learning to fuse lncRNA and protein similarity matrices, respectively. A two-step Kernel Ridge Regression is adopted to forecast the interactions between lncRNAs and proteins. The 5-fold cross validation (5-fold CV) testing of the proposed LPI-FKLKRR algorithm, achieved very reliable and promising results when applied on the benchmark dataset (AUPR: 0.6950). Furthermore, LPI-FKLKRR achieves satisfactory prediction performances compared with the state-of-the-art approaches. A comparison on a novel dataset illustrates the stability performance of our model.

From the view point of the classification method about the prediction, the problem setting of lncRNA-protein interaction prediction can be the same with miRNA-disease interaction prediction and drug-target interaction prediction (Ezzat et al., 2018). For instance, CF method, which has proposed by Sarwar et al, has a recent work named MSCMF, which projects drugs and targets into a common low-rank feature space Zheng et al. (2013). This method can be transfered to the area of LPI prediction. Ezzat et al. have supposed that chemogenomic methods can be categorized into five types, including neighborhood models, bipartite local models, network diffusion models, matrix factorization models, and feature-based classification models. Consequently, in the future we will improve the predicting performance by adding information such as available 3D structure data, by constructing more heterogeneous similarity matrices, by changing weighting strategy or by drawing other effective regression models.

TABLE 7 | The information of two datasets in the experiment.


\**The benchmark dataset and the novel dataset come from the paper of Zhang et al. (2017) and Zheng et al. (2017), respectively.*

TABLE 8 | The AUPR and AUC of different methods on novel dataset.


*<sup>a</sup>AUPR is not exploited by Zheng et al. (2017). Bold values represent the best value in columns.*

# DATA AVAILABILITY STATEMENT

The datasets and codes for this study can be found in website https://github.com/6gbluewind/LPI\_FKLKRR.

# AUTHOR CONTRIBUTIONS

FG, YD, and CS conceived and designed the experiments. CS and YD performed the experiments and analyzed the data. FG and CS wrote the paper. FG and JT supervised the experiments and reviewed the manuscript.

# FUNDING

This work is supported by a grant from the National Natural Science Foundation of China (NSFC 61772362), the Tianjin Research Program of Application Foundation and Advanced Technology (16JCQNJC00200) and National Key R&D Program of China (SQ2018YFC090002, 2017YFC0908400).

# ACKNOWLEDGMENTS

We are grateful to editors and reviewers, who provided related advices and feedback for this analysis.

# REFERENCES


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Shen, Ding, Tang and Guo. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Interplay Between ncRNAs and Cellular Communication: A Proposal for Understanding Cell-Specific Signaling Pathways

#### Santiago Ramón y Cajal1,2,3 \*, Miguel F. Segura<sup>4</sup> and Stefan Hümmer2,3

<sup>1</sup> Department of Pathology, Vall d'Hebron University Hospital, Universitat Autònoma de Barcelona, Barcelona, Spain, <sup>2</sup> Translational Molecular Pathology, Vall d'Hebron Research Institute, Barcelona, Spain, <sup>3</sup> Spanish Biomedical Research Network Centre in Oncology (CIBERONC), Barcelona, Spain, <sup>4</sup> Group of Translational Research in Child and Adolescent Cancer, Vall d'Hebron Research Institute, Barcelona, Spain

Intercellular communication is essential for the development of specialized cells, tissues, and organs and is critical in a variety of diseases including cancer. Current knowledge states that different cell types communicate by ligand–receptor interactions: hormones, growth factors, and cytokines are released into the extracellular space and act on receptors, which are often expressed in a cell-type-specific manner. Non-coding RNAs (ncRNAs) are emerging as newly identified communicating factors in both physiological and pathological states. This class of RNA encompasses microRNAs (miRNAs, wellstudied post-transcriptional regulators of gene expression), long non-coding RNAs (lncRNAs) and other ncRNAs. lncRNAs are diverse in length, sequence, and structure (linear or circular), and their functions are described as transcriptional regulation, induction of epigenetic changes and even direct regulation of protein activity. They have also been reported to act as miRNA sponges, interacting with miRNA and modulating its availability to endogenous mRNA targets. Importantly, lncRNAs may have a cell-type-specific expression pattern. In this paper, we propose that lncRNA–miRNA interactions, analogous to receptor–ligand interactions, are responsible for cell-typespecific outcomes. Specific binding of miRNAs to lncRNAs may drive cell-type-specific signaling cascades and modulate biochemical feedback loops that ultimately determine cell identity and response to stress factors.

#### Edited by:

Philipp Kapranov, Huaqiao University, China

#### Reviewed by:

Simona Greco, Policlinico San Donato (IRCCS), Italy Chandrasekhar Kanduri, University of Gothenburg, Sweden

> \*Correspondence: Santiago Ramón y Cajal sramon@vhebron.net

#### Specialty section:

This article was submitted to RNA, a section of the journal Frontiers in Genetics

Received: 04 January 2019 Accepted: 14 March 2019 Published: 02 April 2019

#### Citation:

Ramón y Cajal S, Segura MF and Hümmer S (2019) Interplay Between ncRNAs and Cellular Communication: A Proposal for Understanding Cell-Specific Signaling Pathways. Front. Genet. 10:281. doi: 10.3389/fgene.2019.00281 Keywords: cancer, cellular communication, cell signaling, long non-coding RNA, microRNA, epigenetics

# INTRODUCTION

Cancer is a complex disease and a major cause of death worldwide. Development of neoplastic disease is a multistep process involving the accumulation of numerous molecular changes. These changes impact cellular function within the tumor and its microenvironment, ultimately resulting in the hallmarks of cancer (Hanahan and Weinberg, 2011).

To date, most researchers have aimed to define the molecular mechanisms of tumorigenesis and cancer progression based on the classical gene expression theory – transcription of coding genes followed by protein synthesis. Following this approach, numerous genetic (e.g., mutations and genomic aberrations) and epigenetic alterations have been identified to have an association with carcinogenesis (Lengauer et al., 1998; Kagohara et al., 2018). In addition, alterations in post-transcriptional regulation of gene expression (e.g., splicing), mRNA translation (e.g., miRNAs)

and post-translational protein modification (e.g., phosphorylation) have been reported in almost every type of cancer. When different combinations are taken into consideration, the potential alterations are almost infinite. However, studies have mainly been based on around 20,000 protein-coding genes, corresponding to approximately 2% of the whole transcribed genome (Bertone et al., 2004; Carninci et al., 2005); the other transcripts include a large variety of non-coding RNAs (ncRNAs). Continuous generation of RNA sequencing (RNAseq) data shows that ncRNAs are strongly deregulated in pathological processes – particularly in multifactorial diseases like cancer (Cipolla et al., 2018). Hence, current limitations to deciphering the molecular mechanisms of cancer might be due to the fact that the putative implications of a large part of the genome remain undefined.

While ncRNA genes were for years considered as an irrelevant part the genome there is growing evidence that mammalian cells produce them in their thousands (Bertone et al., 2004; Carninci et al., 2005). Yet, in the absence of experimental verification of their function, most (>95%) of these transcripts are still considered transcriptional noise (Ponjavic et al., 2007). Studies dating back to the early 1990s indicated that certain lncRNAs may have similar functions to common mRNAs (Brannan et al., 1990; Brown et al., 1991), and since then, detailed studies on certain well-characterized non-coding transcripts have provided mechanistic insights. However, the variety in their mode of action, ranging from protein activity regulation to epigenetic control and regulation of other ncRNAs, implies that we are just beginning to understand their importance for a multitude of biochemical and cellular functions. Importantly, the role of these transcripts in certain tumor types is beginning to become apparent, and lncRNA expression profile has been proposed as a strong prognostic factor (Jiang et al., 2016; Bolha et al., 2017; Ali et al., 2018). It is therefore conceivable that elucidating the function of lncRNAs in normal cells and their deregulation in cancer cells will be one of the next milestones toward a more detailed understanding of the molecular mechanisms of cancer.

In this review, we propose a new role of ncRNAs in cancer. This model, analogous to the well-established ligand–receptor interactions, proposes intercellular communication via ncRNA interactions as a fundamental concept in cancer. This model could provide a hypothetical basis to explain the different types of biochemical feedback in tissues, which in turn could be linked to the differing response to drugs in tumors harboring similar genetic alterations, the different sites of tumor metastases, and the activation of different microRNA profiles depending on the tumor type and location.

# THE NON-CODING TRANSCRIPTOME

RNAs comprise a diverse range of molecules. In addition to the well-characterized RNAs with established functions such as coding genes (mRNAs), protein synthesis (rRNAs and tRNAs), or mRNA splicing (snRNAs), a multitude of additional non-protein-coding RNAs has been described in recent years. According to their size, ncRNAs are categorized as small (<200 bp) or long non-coding RNAs (lncRNAs, >200 bp).

# SMALL NON-CODING RNAS

The group of short ncRNAs consists of microRNAs (miRNAs), small interference RNAs (siRNAs), small nucleolar (snoRNAs), and Piwi-interacting RNAs (piRNAs) (Zamore and Haley, 2005). Of these, miRNAs have been best characterized in terms of their function, regulation and role in multiple human diseases such as cancer. Initiated by the discovery of the first miRNA, lin-4 in Caenorhabditis elegans (Lee et al., 1993; Wightman et al., 1993), large research efforts have led to a detailed characterization of miRNA biogenesis and regulatory functions in recent years (Ambros, 2004; Bartel, 2004; He and Hannon, 2004; Ebert and Sharp, 2012). Binding of miRNAs to specific sites in their target transcripts, called miRNA recognition elements (MRE), results in either transcript degradation or translational inhibition (Lee et al., 1993; Shivdasani, 2006). Currently there are just under 2,000 human high-confidence annotated miRNAs (Kozomara et al., 2018), and it is believed that they collectively regulate at least one third of the genome (Hammond, 2015). Importantly, gene expression profiling studies have demonstrated altered miRNA expression in a wide range of human diseases, including cancer.

# miRNAs AND CANCER

miRNAs have been shown to participate in cancer throughout the various stages: from tumor origin, to immortalization, metastatic steps and interactions with the host tissue (Saliminejad et al., 2018), and they are able to regulate oncogenes and tumor suppressor genes. In the clinical setting, miRNA expression signatures are emerging as important diagnostic and prognostic predictors (Cho, 2007; Kong et al., 2012; Hayes et al., 2014; Drusco and Croce, 2017). Functional studies clearly support a relevant role of certain miRNAs in cancer. However, a remaining challenge is to understand the exact signaling pathways altered by miRNA deregulation. Several factors contribute to this complexity: (1) the 3<sup>0</sup> UTR of a particular target gene contains multiple MREs; (2) multiple MREs can act either alone or cooperatively and (3) the same miRNA can regulate different targets (Doench and Sharp, 2004; Grimson et al., 2007).

In addition to cell-type-specific gene expression profiles, deregulation of miRNAs can result in tumor suppressive or oncogenic effects in a context-dependent manner (Zhang et al., 2007). Furthermore, the existence of feedback loops involving certain transcription factors such as c-Myc, which is both a regulator of miRNA expression and a target of miRNAs, adds yet another layer of complexity to the role of miRNAs in cancer (Jackstadt and Hermeking, 2015).

# miRNA TRAFFICKING

While the main mechanism of action is to control mRNA stability or translation in the cytoplasm, miRNAs can be found in unexpected cellular compartments such as the nucleus, mitochondria, and endoplasmic reticulum (Leung, 2015).

They are also found in the extracellular space – this was first described in 2008, when it was proposed that circulating miRNAs may serve as biomarkers of certain cancers (Mitchell et al., 2008). Since then, miRNAs have been discovered in various extracellular environments including blood (Weber et al., 2010), urine (Gidlof et al., 2011), saliva (Park et al., 2009; Patel et al., 2011), and ascitic fluid (Husted et al., 2011). These findings opened up two new fields of investigation on miRNA. First, as they are easy to detect in body fluids, clinical research has focused on the use of extracellular miRNAs as biomarkers as an alternative, non-invasive method for diagnosis and disease monitoring (Duttagupta et al., 2011; Ajit, 2012). Second, because miRNAs are found in the extracellular space, it was proposed that they may act not only in the cells in which they are transcribed but also in neighboring cells. This intriguing idea of horizontal transfer of genetic material subsequently gained major attention. To date, five non-exclusive mechanisms of miRNA release from donor cells have been proposed: (i) miRNA bound to RNAprotein complexes (e.g., in complex with Argonaut) (Arroyo et al., 2011; Turchinovich et al., 2011); (ii) transport via lipid or lipoprotein particles (Rayner and Hennessy, 2013); (iii) vesicles shed directly from the plasma membrane (Hunter et al., 2008; Callis et al., 2009; Shefler et al., 2010; Jaiswal et al., 2012); (iv) vesicles of endosomal origin (exosomes) (Valadi et al., 2007; Skog et al., 2008; Rechavi et al., 2009) and (v) vesicles from apoptotic bodies (Zernecke et al., 2009; Hergenreider et al., 2012). Currently, the coexistence of these different forms of miRNA transport is supported in the literature, but improved biochemical methods and molecular tools with higher temporal and spatial resolution are required to strengthen the evidence (Tkach and Thery, 2016).

Many groups have clearly demonstrated that isolated miRNA containing fractions (e.g., exosomes) from donor cells are capable of inducing phenotypic alterations in the recipient cells. However, studies on the underlying mechanisms are rare thus far, and basic questions concerning the amount of miRNA (signaling-like or enzymatic function) or the type of miRNA [individual or pool of miRNA(s)] remain unaddressed.

# LONG NON-CODING RNAs

All RNAs longer than 200 nucleotides that are not translated into proteins are collectively categorized as lncRNAs (Carninci et al., 2005; Guttman et al., 2009; Ponting et al., 2009; Cabili et al., 2011; Derrien et al., 2012; Housman and Ulitsky, 2016; Lagarde et al., 2017). Similarly to mRNAs, lncRNAs are transcribed by RNA polymerase II and are often subject to post-transcriptional modification like 5<sup>0</sup> capping, 3<sup>0</sup> polyadenylation and splicing (Quinn and Chang, 2016). Although less well-studied than miRNAs, several thousand lncRNAs have already been described, and thanks to technical advances in RNAseq techniques and computational prediction methods, the total number of lncRNAs identified continues to increase (Quinn and Chang, 2016; Hon et al., 2017). Solely defined by their length, lncRNAs constitute the largest class of ncRNAs in the mammalian genome, and can be further categorized into long intergenic ncRNAs (lincRNAs), antisense RNAs (asRNAs), pseudogenes, and circular RNAs (circRNAs) (Quinn and Chang, 2016; Lagarde et al., 2017). lncRNAs fulfill a variety of functions by interacting with DNA, RNA and proteins, and they may be described according to their mechanism of action toward their interacting molecule as enhancers, decoys, guides or scaffolds (Wang and Chang, 2011; Fok et al., 2017; Kopp and Mendell, 2018).

The first unsupervised clustering analysis of individual transcripts in different tissues revealed that 78% of lncRNAs (in comparison to 19% of mRNAs) were expressed in a tissue-specific manner (Cabili et al., 2011). As sequencing techniques advanced, this specificity of expression has been observed at the individual cell level, and even differential cell-to-cell expression has been observed by Lv et al. (2016); we could confirm this in our own (unpublished) studies in the triple negative breast cancer cell line MB-MDA231. In the field of cancer biology, lncRNA expression profile has recently been proposed as a strong prognostic factor and even as a therapeutic target (Chen et al., 2014; Yarmishyn and Kurochkin, 2015).

Despite the growing catalog of lncRNAs, the majority of detected transcription products remain functionally unannotated. Gene function prediction based on sequence homology for protein-coding genes is challenging (Ulitsky, 2016). Therefore, lncRNA classification requires either specialized computational tools or genome-wide functional studies as currently performed by the use of gene knockout techniques (e.g., CRISPERi) (Gilbert et al., 2013; Goyal et al., 2017).

# lncRNAs IN CANCER

Differential expression of lncRNAs has been described in a variety of pathological conditions including cardiovascular, autoimmune, neurodegenerative diseases and particularly in cancer (Walsh et al., 2014; Nguyen and Carninci, 2016; Zhang et al., 2016; Aune et al., 2017; Bhan et al., 2017; Viereck and Thum, 2017; Wan et al., 2017). The first reported lncRNA with an aberrant expression in cancer was prostate cancer associated 3 (PCA3) (Bussemakers et al., 1999), which was identified via differential display analysis of transcripts in normal human prostate cancer. PCA3 was the first FDA-approved lncRNA-based biomarker for use in clinical practice. Since then, it has proven a useful, non-invasive test for prostate cancer (Wei et al., 2014). In subsequent years, many other lncRNAs have been identified as having a highly predictive value in the diagnosis of different cancers. Among those, deregulated expression of the lncRNA HOTAIR, originally described in breast cancer, is associated with cancer progression in 26 human tumor types (Gupta et al., 2010; Bhan and Mandal, 2015; Teschendorff et al., 2015). Recently, the application of next-generation sequencing in a variety of different cancer transcriptomes uncovered thousands of lncRNAs with aberrant expression in different cancer types (Huarte, 2015). These numbers increase further when single nucleotide polymorphisms

(SNPs) that are associated with cancer are taken into account (Freedman et al., 2011).

Regarding metastasis, some lncRNAs have been associated with more aggressive, metastatic tumors and even with cancer cell colonization to specific organ sites (Li et al., 2018). For example, in colorectal carcinoma, expression of the lncRNA CCAT2 correlated with a higher incidence of liver metastasis (Ling et al., 2013). Another example is the association of elevated levels of HOTAIR with a higher incidence of liver metastasis in gastric cancer (Zhang et al., 2015) and with brain metastasis in non-small cell lung carcinoma (Nakagawa et al., 2013).

Despite the enormous number of lncRNAs described as having aberrant expression in different cancer types, to date, only a few have been functionally characterized (reviewed in Huarte, 2015); existing studies have identified tumor-suppressor and oncogenic functions of lncRNAs (Prensner and Chinnaiyan, 2011; Huarte, 2015). In many cases, the functional role of these lncRNAs has been linked to well-known oncogenic pathways like p53 and c-myc or participating in different steps of classical cancer processes such as epithelial mesenchymal transition (EMT). Functional analysis will probably expand with the recent application of the CRISPR/Cas9 system, which will provide the tools required to study the function of lncRNAs in genome-wide studies. These unsupervised studies will be crucial to functionally annotate the role of lncRNAs in cancer and shed further light on the regulation of the underlying molecular events (Han et al., 2014; Ho et al., 2015).

# lncRNA AND miRNA INTERPLAY

Different lncRNAs are known to interact with DNA, RNA, and proteins; therefore, the functions of the class of lncRNAs appear to be pleiotropic, ranging from chromatin remodeling to the regulation of transcription, splicing, and translation (Wang and Chang, 2011; Fok et al., 2017; Kopp and Mendell, 2018). A subclass of lncRNA has recently been shown to regulate gene expression in trans by acting as miRNA "sponges" (Karreth et al., 2011; Salmena et al., 2011; Tay et al., 2011; Karreth and Pandolfi, 2013; Su et al., 2013; Yoon et al., 2014; Ulitsky, 2018). These lncRNAs belong to a group of RNAs named ceRNA (competitor of endogenous RNA) (Tay et al., 2014; Thomson and Dinger, 2016; Zhong et al., 2018). While there is no unifying definition of ceRNAs, their function as miRNA "sponges" minimally requires their cytoplasmic localization and the presence of MRE in their sequence. ceRNAs contain MREs for one or multiple miRNAs, and binding is thought to sequester miRNAs and thereby enable translation of endogenous miRNA targets (**Figure 1**).

In recent years, this concept has been described for the expression of different genes involved in tumor progression (de Giorgio et al., 2013; Cheng et al., 2015; Wang et al., 2016; Song et al., 2017). An example of an oncogenic lncRNA is UCA1, which controls the availability of miR-18a and thereby determines the expression of the oncogene YAP1 (Zhu et al., 2018; **Figure 2A**). In contrast, the tumor suppressor gene PTEN is regulated by the lncRNA CCAT2 by acting as a competing

classical "DNA-RNA-protein" pathway is extended by functional role of ncRNAs.

RNA for miR-21 (**Figure 2B**; Xie et al., 2017). In recent years, many more ceRNAs have been described in cancer. We have summarized the known lncRNA/miRNA/mRNA combinations in **Supplementary Table S1**.

In addition to sequestering miRNAs, lncRNAs have also been reported to compete with miRNAs by binding directly to mRNAs (Faghihi et al., 2010). miRNAs have been reported to induce destabilization of lncRNAs, yet some lncRNAs contain miRNAs precursors (Cui et al., 2016). This suggests a complex interplay between lncRNAs and miRNAs, which ultimately determines stability and translation of protein-coding mRNAs. Notably, recent studies have revealed that ceRNAs have significant roles in cancer pathogenesis (Cheng et al., 2015; Wang et al., 2016). For example, alterations in the expression of key factors in oncogenic signaling pathways, like BRAF, have been linked to changes in the level of ceRNAs (Ergun and Oztuzcu, 2015; Karreth et al., 2015). We are just starting to understand these complex molecular interactions, their place in functional regulatory networks controlling cellular processes, and their implications in

cancer (Liz and Esteller, 2016; Yang et al., 2016; Cao et al., 2017; Chan and Tay, 2018).

# PROPOSAL: TISSUE-SPECIFIC INTERPLAY BETWEEN miRNA AND lncRNA SUPPORTS CONTEXT-DEPENDENT CELL SIGNALING

Currently, the mechanisms involved in cancer origin and progression have been elaborated according to the alterations found in protein-coding genes, barely 2% of the translated genome. Following this approach, most of the described alterations merge into few biochemical routes such as the PI3K/AKT/mTOR or RAS/MAPK pathways. However, the outcome of these alterations is sometimes tissue-specific and cell-context-dependent. It is, for example, still unclear how certain oncogenic mutations progress to tumor formation only in a particular set of tissues. There are clear examples with germline mutations in genes like adenomatous polyposis coli (APC), cadherin 1 (CDH1), BRCA1, von Hippel-Lindau tumor suppressor (VHL), and ataxia telangiectasia mutated (ATM) which are causative for the development of cancer in specific types of tissue (Schneider et al., 2017). Another paradigmatic example is the initial good response to specific inhibitors in melanomas harboring the BRAFV600E mutation, compared to the protumor effect of the same inhibitors in colon adenocarcinomas carrying similar BRAF mutations (Prahallad et al., 2012; Sclafani et al., 2013). Finally, the development of metastases in different types of cancer is often restricted to certain organs (organotropic metastasis), and even clonal subpopulations within the primary tumor display preferences for certain organs (Obenauf and Massague, 2015).

Therefore, in addition to the classical oncogenic signaling pathways (PI3K/AKT/mTOR or RAS/MAPK), additional layers of regulation must exist, to explain tissue- and organ-specific processes. We hypothesize that these cell-type-specific outcomes may be caused by interplay between lncRNAs and miRNAs. Building on the concept of ceRNAs, the tissue and cell-typespecific expression of lncRNAs might be a key mechanism to support tissue-specific regulation of oncogenic signaling pathways. In this sense, the translational profile (i.e., proteome) of miRNA-regulated mRNAs can be controlled by the celltype-specific expression of lncRNAs. Consequently, the interplay between lncRNAs and miRNAs might affect signaling cascades by regulating the abundance of proteins within these pathways in a cell-type-specific manner (**Figure 3A**).

This core mechanism, acting on the intracellular level, can also be extended to communication between different cells. This is because miRNAs not only act in the cells in which they are transcribed but can also be transferred into different cells (intercellular level). In this regard, the interplay between miRNAs and lncRNAs suggests many parallelisms to ligand– receptor interactions. Ligands like cytokines, hormones and growth factors are released as soluble factors by the donor cell,

selectively interact with the target cell receptors and activate signaling cascades which ultimately alter the phenotype of the target cell. miRNAs have been shown to be released into the extracellular space, and lncRNAs are expressed in an organ-, tissue- or cell-type-specific manner. As described above, the abundance of lncRNAs may ultimately determine the effect of miRNAs on the expression of protein-coding genes. Therefore, as occurs in the ligand-receptor model, exposure to the same miRNA may result in cell-type-specific alterations dependent on the expression of lncRNAs (**Figure 3B**).

(B) Tissue specific response to secreted miRNAs dependent on the

expression profile of lncRNAs.

In summary, the interplay between lncRNAs and miRNAs at an intra- and intercellular level may provide a framework for understanding context-specific phenomena in cancer. In the following section, we will exemplify these putative mechanisms in two concrete cases.

# INTRACELLULAR INTERPLAY BETWEEN lncRNAs AND miRNAs – BRAFV600E IN COLON ADENOCARCINOMA VS MELANOMA

Transduction of a signal from an activated receptor is dependent on the levels of kinases and phosphatases in these pathways and is regulated by positive and negative feedback loops. Alterations in the stoichiometry of factors involved in signaling cascades

can be crucial for the ultimate cellular effect. A paradigmatic example for such alterations in the same oncogenic pathway is described for the BRAFV600E mutation. While treatment of BRAFV600E–bearing tumors with BRAF inhibitors has a good response in melanomas, a protumor effect has been described in colon adenocarcinomas. It has been proposed that the local

malignant transformation.

feedback in the different cell-signaling pathways could partly explain this unexpected contradictory effect (Schneider et al., 2017). In this respect, miRNAs have been described as key players in fine-tuning the expression of proteins involved in the Ras/Raf signaling pathway (reviewed in Masliah-Planchon et al., 2016). However, the expression of the majority of these miRNAs is not specific for single tissues; therefore, additional layers of regulation must exist, to explain the different responses. We propose that the tissue-specific interplay between ncRNAs might partly explain the different sensitivity to BRAF inhibitors in colon adenocarcinoma and melanoma harboring the same oncogenic mutation in BRAF (**Figure 4**).

# INTERCELLULAR INTERPLAY BETWEEN lncRNAs AND miRNAs – ORGANOTROPIC METASTASIS

miRNAs are released into the extracellular space and transferred to target cells. lncRNAs (the "receptors") show organ-, tissueand cell-type-specific expression. Binding of miRNAs to lncRNAs (akin to a ligand-receptor complex) occurs via a sequencespecific MRE within the lncRNA. Finally, this interaction results in miRNA sequestration and/or degradation of miRNA or lncRNA. As the ligand-receptor model would predict, these interactions should result in the activation of signaling cascades enabling significant alterations in biochemical and cellular functions.

Applying this model to cancer, the mechanisms underlying organotropic metastasis might be explained in part by the interplay between miRNAs and lncRNAs. Even though there is no direct experimental evidence for this hypothesis, there are theoretical possibilities for how this interplay could contribute to the development of site-specific metastasis.

High levels of secreted miRNAs from cancer cells have been reported for almost all types of tumor cells. Upon arrival of a disseminated tumor cell to a distant tissue (e.g., lung, bone, or brain), these miRNAs are first taken up by tissue-specific endothelial cells. Transfer of miRNAs to endothelial cells in turn has been shown to alter the expression of proteins required for maintenance of the endothelial barrier (Zhou et al., 2014; Tominaga et al., 2015). Therefore, the expression profile of competing lncRNAs within endothelial cells might determine if the barrier function can be sustained in the presence of exogenous miRNAs, and metastasis will be favored in organs in which ceRNAs are absent (**Figure 5**).

In addition, aberrant expression of lncRNAs has been reported in different tumor types, and differential cell-tocell expression has even been observed in certain tumor types (Lv et al., 2016). Therefore, the presence of sponging lncRNAs might alter (a) the miRNA secretome of tumor cells, and (b) the responsiveness to secreted miRNAs from other cells. The latter case in particular might account for clonal populations of tumor cells with site-specific patterns of metastasis (Obenauf and Massague, 2015).

It is tempting to speculate that the interplay between lncRNAs and miRNAs combined with the tissue- or clonespecific expression of lncRNAs might favor the formation of metastatic niches in an organ- or tissue-specific manner. However, future detailed studies will be required to prove this hypothesis.

# FUTURE DIRECTIONS

In our opinion, the continuous accumulation of large amounts of data by RNAseq of whole cancer transcriptomes or extracellular miRNAs might be only partially helpful to move forward. We propose exploring the mechanistic basics in cell-culturebased systems or even model organisms with a reduced complexity. Models derived from these studies could later be used to predict how the extracellular miRNA composition combined

with a distinct set of intracellular lncRNAs might impact on disease progression.

In summary, current limitations in our understanding on the molecular mechanisms of cancer might be due to the fact that, until now, only 2% of the genome has been taken into account. Therefore, future studies should aim at expanding our current view of cancer by including the role of ncRNAs in the interpretation of cancer as a multifactorial disease. The proposed model, combining lncRNA-miRNA interactions with intercellular communication might be particularly helpful in understanding the tissue-specificity of many cancers, hitherto one of the least understood phenomena of cancer.

# DATA AVAILABILITY

Publicly available datasets were analyzed in this study. This data can be found here: https://www.ncbi.nlm.nih.gov/pubmed/.

# AUTHOR CONTRIBUTIONS

All authors listed have made a substantial, direct and intellectual contribution to the work, and approved it for publication.

# REFERENCES


## FUNDING

SRyC acknowledges support from Fondo de Investigaciones Sanitarias (FIS; PI17/02247 and PI14/01320), Centro de Investigación Biomédica en Red de Cáncer (CIBERONC; CB16/12/00363), and Generalitat de Catalunya (AGAUR; 2017 SGR 1799 and 2014 SGR 1131). MFS acknowledges support from the Instituto de Salud Carlos III (CPII16/00006 and PI17/00564) and the European Regional Development Fund (ERDF).

# ACKNOWLEDGMENTS

The authors thank Trond Aasen for helpful suggestions and discussion on the manuscript. The authors apologize to their colleagues whose work could not be cited in this manuscript due to space limitations.

# SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene. 2019.00281/full#supplementary-material





**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Ramón y Cajal, Segura and Hümmer. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Bovine Embryo-Secreted microRNA-30c Is a Potential Non-invasive Biomarker for Hampered Preimplantation Developmental Competence

Xiaoyuan Lin<sup>1</sup> , Evy Beckers<sup>1</sup> , Séan Mc Cafferty<sup>1</sup> , Yannick Gansemans<sup>2</sup> , Katarzyna Joanna Szymanska ´ 3 , Krishna Chaitanya Pavani<sup>4</sup> , João Portela Catani<sup>1</sup> , Filip Van Nieuwerburgh<sup>2</sup> , Dieter Deforce<sup>2</sup> , Petra De Sutter<sup>5</sup> , Ann Van Soom<sup>4</sup> and Luc Peelman<sup>1</sup> \*

<sup>1</sup> Department of Nutrition, Genetics and Ethology, Faculty of Veterinary Medicine, Ghent University, Merelbeke, Belgium, <sup>2</sup> Department of Pharmaceutics, Faculty of Pharmaceutical Sciences, Ghent University, Ghent, Belgium, <sup>3</sup> Physiology Group, Department of Basic Medical Sciences, Ghent University, Ghent, Belgium, <sup>4</sup> Reproduction, Obstetrics and Herd Health, Ghent University, Merelbeke, Belgium, <sup>5</sup> Department of Uro-Gynaecology, Faculty of Medicine and Health Sciences, Ghent University, Ghent, Belgium

Recently, secreted microRNAs (miRNAs) have received a lot of attention since they may act as autocrine factors. However, how secreted miRNAs influence embryonic development is still poorly understood. We identified 294 miRNAs, 114 known, and 180 novel, in the conditioned medium of individually cultured bovine embryos. Of these miRNAs, miR-30c and miR-10b were much more abundant in conditioned medium of slow cleaving embryos compared to intermediate cleaving ones. MiR-10b, miR-novel-44, and miR-novel-45 were higher expressed in the conditioned medium of degenerate embryos compared to blastocysts, while the reverse was observed for miR-novel-113 and miR-novel-139. Supplementation of miR-30c mimics into the culture medium confirmed the uptake of miR-30c mimics by embryos and resulted in increased cell apoptosis, as also shown after delivery of miR-30c mimics in Madin-Darby bovine kidney cells (MDBKs). We also demonstrated that miR-30c directly targets Cyclindependent kinase 12 (CDK12) through its 3<sup>0</sup> untranslated region (3<sup>0</sup> -UTR) and inhibits its expression. Overexpression and downregulation of CDK12 revealed the opposite results of the delivery of miRNA-30c mimics and inhibitor. The significant down-regulation of several tested DNA damage response (DDR) genes, after increasing miR-30c or reducing CDK12 expression, suggests a possible role for miR-30c in regulating embryo development through DDR pathways.

Keywords: bovine embryos, secreted miRNAs, miR-30c, CDK12, cell cycle, DNA damage response, individual in vitro production

#### Edited by:

Philipp Kapranov, Huaqiao University, China

#### Reviewed by:

Jianmin Su, Northwest A&F University, China Jenna Kropp, University of Wisconsin-Madison, United States

> \*Correspondence: Luc Peelman Luc.Peelman@UGent.be

> > Specialty section:

This article was submitted to RNA, a section of the journal Frontiers in Genetics

Received: 05 February 2019 Accepted: 21 March 2019 Published: 05 April 2019

#### Citation:

Lin X, Beckers E, Mc Cafferty S, Gansemans Y, Joanna Szymanska K, ´ Chaitanya Pavani K, Catani JP, Van Nieuwerburgh F, Deforce D, De Sutter P, Van Soom A and Peelman L (2019) Bovine Embryo-Secreted microRNA-30c Is a Potential Non-invasive Biomarker for Hampered Preimplantation Developmental Competence. Front. Genet. 10:315. doi: 10.3389/fgene.2019.00315

**31**

# INTRODUCTION

fgene-10-00315 April 4, 2019 Time: 18:10 # 2

Many studies have indicated that the timing of cell division during the early embryonic stages is crucial for normal development and can be used as an indicator of embryo development competence (Fenwick et al., 2002; Zernicka-Goetz, 2002, 2006; Plusa et al., 2005; Hiiragi et al., 2006; Terriou et al., 2007). For example, the delay of cell division might be a consequence of chromosomal aberrations and DNA damage (Milewski et al., 2018) and slow cleaving embryos have a higher caspase activity in comparison to fast cleavers (Vandaele et al., 2007). In general, faster cleaving embryos have a significantly higher probability of reaching advanced developmental stages compared to slower cleaving embryos (Van Soom, 1997; Meirelles et al., 2004; Vandaele et al., 2006), while some studies also demonstrated that cleaving divisions that are too fast or too slow are indicative of poor embryo quality (Market Velker et al., 2012; Gutierrez-Adan et al., 2015). During these early embryonic stages, miRNA levels undergo dynamic changes (Mineno et al., 2006; Tang et al., 2007; Yang et al., 2008; Viswanathan et al., 2009; Goossens et al., 2013), indicating their potential role in embryonic development.

With this study we wanted to investigate if one or more miRNAs have potential as a non-invasive biomarker for preimplantation developmental competence according to cleavage patterns and blastocyst formation. It has been reported that miRNAs are not only localized intracellularly but also secreted via exosomes (Valadi et al., 2007). In addition, miRNAs have been reported to be transferable to other cells, and can be functional in the new location (Valadi et al., 2007; Sohel et al., 2013; Vilella et al., 2015). More specifically, they can be taken up into cells from the extracellular environment, leading to a corresponding endogenous miRNA increase in transfected cells. Recently, secreted miRNA expression was reported to correlate with developmental competence and sexual dimorphism in bovine (Kropp and Khatib, 2015; Gross et al., 2017) and human embryos (Rosenbluth et al., 2014). Although the precise mechanisms of miRNA release in the cellular environment are poorly understood, their selective secretion and high stability (resistant to RNase digestion and other harsh conditions) make miRNAs good candidates for use as biomarkers (Luo et al., 2009; Donker et al., 2012). Potential limitations for their use as biomarkers are their general low abundance, and the high sequence identity among family members.

miR-30 family members are involved in the regulation of p53-induced mitochondrial fission and cell apoptosis (Li et al., 2010). As a member of the miR-30 family, miR-30c has been shown to regulate the cell cycle and proliferation in human and mouse (Li et al., 2012; Quintavalle et al., 2013; Shukla et al., 2015; Liu et al., 2016). One of its potential targets as determined by our study is CDK12 mRNA. CDK12 is a protein kinase responsible for mature mRNA synthesis transcriptional elongation (Bartkowiak et al., 2010; Liang et al., 2015). This kinase has been reported to be crucial for the development of the inner cell mass in mouse embryos (Juan et al., 2016) and to maintain genomic stability as Cyclin K/CDK12 complex (Blazek et al., 2011) through regulating DDR genes. In this study, we demonstrated that miR-30c is secreted and taken up by bovine embryos and functions as a negative regulator of cell growth by targeting CDK12, indicating that miR-30c can be considered as a promising biomarker for bovine early embryonic development. These findings may provide new insights into understanding the regulatory role of secreted miRNAs in the process of intercellular communication.

# MATERIALS AND METHODS

# In vitro Embryo Production and CM Collection

All animal handlings were approved by the Ethical Committee of the Faculty of Veterinary Medicine (EC2013/118) of Ghent University. All methods were performed in accordance with the relevant guidelines and regulations. Bovine blastocysts were produced according to the previously used routine in vitro fertilization (IVF) methods in our lab (Wydooghe et al., 2014a). Briefly, ovaries were collected from the local slaughterhouse and processed within 2 h. The collected ovaries were washed three times in warm physiological saline supplemented with 5 mg/ml kanamycin (GIBCO-BRL Life Technologies, Merelbeke, Belgium). Subsequently, cumulus oocytes complexes were aspirated from 4 to 8 mm diameter follicles and cultured in groups of 60 in 500 µl maturation medium-containing TCM199 (Life Technologies, Ghent, Belgium) supplemented with 20% heat-inactivated fetal bovine serum (FBS) (Biochrom AG, Berlin, Germany) for 22 h at 38.5◦C in 5% CO<sup>2</sup> in the air. Frozen-thawed bovine spermatozoa from Holstein bulls were separated through a 45% and 90% Percoll gradient (GE healthcare Biosciences, Uppsala, Sweden). The final sperm concentration of 1 × 10<sup>6</sup> spermatozoa/ml was adjusted in IVF-Tyrode's albuminlactate-pyruvate (IVF-TALP), consisting of bicarbonate-buffered Tyrode solution supplemented with 6 mg/ml and 25 µg/ml heparin bovine serum albumin (BSA) (Sigma, Schnelldorf, Germany). Matured oocytes were washed in 500 µl IVF-TALP medium and were incubated with spermatozoa. After incubation for 21 h, presumed zygotes were vortexed for 3 min to remove spermatozoa and cumulus cells, washed with IVF-TALP and transferred to 20 µl drops of synthetic oviductal fluid supplemented with ITS (5 µg/ml Insulin + 5 µg/ml Transferrin + 5 ng/ml Selenium) and 4 mg/ml BSA. Culture occurred individually in drops of 20 µl, covered with mineral oil at 38.5◦C in 5% CO2, 5% O<sup>2</sup> and 90% N2.

Bovine embryos were divided into groups according to the first cleavage patterns, as described previously (Dinnyes, 1999; Amarnath et al., 2007; Sugimura et al., 2017). Time points of first cleavage [24.2–33.8 h post insemination (hpi)] were listed up and were divided in quartiles. The first quartile was considered as "fast," the second and third quartiles were considered as "intermediate" and the last quartile was considered as "slow." More specifically, individual droplets were viewed microscopically at two time points (26.6 and 31.4 hpi), and three groups were produced according to the embryos' cleavage pattern: "fast" (cleavage occurred before 26.6 hpi), "intermediate" (cleavage occurred between 26.6 and 31.4 hpi) and "slow" (cleavage had not occurred yet at 31.4 hpi). Additionally, the developmental competence of each embryo was microscopically viewed and assessed at 8 days post insemination (dpi), enabling a division into two subgroups (degenerate embryos and blastocysts). Eventually, the embryos were divided into six groups: FB (fast cleaving blastocyst), IB (intermediate cleaving blastocyst), SB (slow cleaving blastocyst), FD (fast cleaving degenerate), ID (intermediate cleaving degenerate), and SD (slow cleaving degenerate). Conditioned medium of single embryos was collected (17.5 µl each droplet) and pooled for each of the six groups.

# miRNA Extraction

fgene-10-00315 April 4, 2019 Time: 18:10 # 3

At 8 dpi, the CM was collected and miRNA was extracted with the miRNeasy Serum/Plasma kit (Qiagen, Germantown, United States). To meet the miRNA-sequencing minimum concentration requirement, RNA was extracted from CM (three replicates of 3 mL each) and was concentrated with the RNeasy MinElute Cleanup kit (Qiagen, Germantown, United States). Finally, the quality and concentration of the RNA samples were examined using an RNA 6000 Pico Chip (Agilent Technologies, Carlsbad, CA, United States) and a Quant-iT RiboGreen RNA Assay kit (Life Technologies, Carlsbad, CA, United States), respectively. The total RNA isolated from CM ranged from 1.982 to 2.448 ng/µl. The FB and FD group were excluded because the required amount of secreted miRNAs from the IVF culture system for sequencing was not obtained.

# Small RNA Library Construction and Deep Sequencing

Small RNA library construction was performed with the Tailormix v2 kit (SeqMatic, Fremont, CA, United States). The quality-ensured RNA-seq libraries were pooled and sequencing was performed in triplicate on the Illumina Miseq (NxtGnt sequencing facility, Gent, Belgium).

# Small RNA-Sequencing Data Analysis and Differential Expression Analysis

Identification of known miRNAs, prediction of putative novel miRNAs and reading counting were done using the mirPRo pipeline (Shi et al., 2015). MicroRNA data from the miRBase (v21) (Griffiths-Jones et al., 2006) and the annotated cow genome (GCA\_000003055.3) were used as reference. Differential expression between sample groups was statistically tested in R (Ihaka and Gentleman, 1996) with both EdgeR (Robinson et al., 2010) and DESeq2 (Love et al., 2014) via the SARTools wrapper (Varet et al., 2016). Two comparisons were made after RNA-Sequencing: IB vs. SB; (I + S) Degenerate vs. (I + S) Blastocysts. The results were considered statistically significant when the Benjamini-Hochberg corrected p-value was <0.05.

# Pathway Analysis

The functional analysis of the differentially expressed genes between the groups was performed using DAVID (Huang et al., 2008, 2009) (predicted target genes as input) and miRWalk (Dweep and Gretz, 2015) (miR-30c and miR-10b as input) in terms of enrichment of gene ontologies (GO). In addition, a pathway analysis was performed using the KEGG database to identify the significant pathways affected by the differentially expressed miRNAs. The Benjamini-Hochberg corrected p-values <0.05 were considered statistically significant.

# RT-qPCR

To verify the results of the miRNA sequencing, five mature miRNAs were quantified using RT-qPCR (real-time quantitative PCR). Accordingly, total RNA samples (including miRNAs) isolated from CM (three additional biological replicates of 200 µl each) were reverse transcribed using a miScript II RT kit (Qiagen, Germantown, MD, United States) and subsequently quantified with a miScript SYBR Green Kit containing 10 × miScript Universal Primer (Qiagen, Germantown, MD, United States). U6 (Mondou et al., 2012; Abd El Naby et al., 2013) was quantified to normalize miRNA expression levels.

To check the intracellular expression of the differentially released miRNAs and if miR-30c is taken up by embryos, miRNAs were quantified using RT-qPCR. Total RNA samples (including miRNAs) isolated from embryos (three replicates of approximate 5 embryos each) using the miRNeasy Mini kit (Qiagen, Germantown, United States) and reverse transcribed using a miScript II RT and subsequently quantified with a miScript SYBR Green Kit containing 10 × miScript Universal. U6 was quantified to normalize miRNA expression levels.

Additionally, embryos and MDBKs were used to analyze mRNA abundance of CDK12 and DDR genes. Total RNA samples were isolated from embryos (three replicates of approximate 5 embryos each) and MDBKs using the RNeasy Micro kit (Qiagen, Germantown, MD, United States) and reverse transcribed using the iScript cDNA synthesis kit (BioRad, Brussels, Belgium). The mRNA levels were quantified with a SsoAdvanced Universal SYBR Green Supermix kit (BioRad, Brussels, Belgium). GAPDH (Herrmann et al., 2013; Li et al., 2016), which has been proved to be a stable reference gene in our sample (data not shown), was quantified to normalize mRNA expression levels.

All reactions were performed in triplicate, and the 2−11Ct method was used to analyze the data. The primer sequences used for RT-qPCR are listed in **Supplementary Table S1**.

# miR-30c Mimics Supplementation to Embryos Culture Medium

Since individually cultured embryos have less tolerance when compared to group cultured embryos (Goovaerts et al., 2009; Wydooghe et al., 2014a,b) and they easily die after changing the culture environment, group culture was performed for miR-30c functional analysis instead of individual culture. The IVF embryos were produced according to the previously described protocol. This time, however, presumed zygotes were vortexed for 3 min after 21 h incubation, washed with IVF-TALP and transferred to drops of SOF supplemented with ITS, BSA and miR-30c mimics (chemically synthesized, doublestranded RNAs which mimic mature endogenous miRNAs after delivery to cells) or control mimics (chemically synthesized, double-stranded RNAs which have no homology to any known microRNA or mRNA sequences) (Qiagen, Germantown,

United States) with a final concentration of 1 µM according to the instructions. Culture occurred in groups of 25 in drops of 50 µl, covered with mineral oil at 38.5◦C in 5% CO2, 5% O2, and 90% N2. On 8 dpi, blastocyst rates were calculated. Blastocysts were collected for RT-qPCR or assessed with apoptosis staining.

# TUNEL Staining and Differential Apoptotic Staining

TUNEL staining was performed using a previously described protocol (Ortiz-Escribano et al., 2017) with an in situ cell death detection kit (Sigma, St. Louis, MO, United States). Briefly, ∼20 blastocysts for each group were collected and fixed in 4% paraformaldehyde at room temperature (RT) for 1 h, and then permeabilized in 0.1% Triton X-100 at RT for 10 min. Afterward, blastocysts were stained with 20 µl TUNEL mixture for 1 h at 37C and subsequently stained with 10 µg/ml DAPI for 10 min. The embryos were mounted on the slides and were examined using a 20× water immersion objective on a Leica TCS-SP8 X confocal microscope (Leica microsystems, Wetzlar, Germany). The apoptosis ratio was expressed as the total number of TUNEL-positive cells relative to the total number of the cells per blastocyst.

Differential apoptotic staining was performed using previously described protocols (Wydooghe et al., 2011; Lu et al., 2019). The first day, ∼20 blastocysts for each group were fixed in 4% paraformaldehyde for 1 h and put in a 4-well dish in permeabilization solution (0.5% Triton X-100 + 0.05% Tween) in phosphate buffered saline (PBS) at RT for 1 h. After washing the blastocysts 3 times during 2 min in PBS-BSA, they were incubated in 2N HCl at RT for 20 min and then in 100 mM Tris–HCl at RT for 10 min. The blastocysts were washed (3 times during 2 min) and then put into 500 µl of blocking solution at 4C overnight. The second day, the blastocysts were washed again (3 times during 2 min) and incubated in primary CDX-2 antibody (Biogenex, San Ramon, United States) at 4C overnight. On the third day, the blastocysts were washed twice for 15 min and subsequently incubated in blocking solution containing the rabbit active caspase-3 antibody (Cell Signaling Technology, Leiden, Netherlands) overnight at 4C. On day four, the blastocysts were incubated in blocking solution containing the goat anti-mouse Texas Red antibody at RT for 1 h and were subsequently incubated in blocking solution containing the goat anti-rabbit FITC antibody at RT for 1 h. The blastocysts were washed twice for 15 min and incubated at RT for 20 min in a dilution 1: 200 Hoechst in PBS-BSA in the dark. All slides were examined using a 63 × water immersion objective on a Leica TCS-SP8 X confocal microscope. The apoptosis ratio was expressed as the total number of Caspase-3-positive cells relative to the total number of the cells per blastocyst.

# Plasmid Construction

The full-length coding sequence of CDK12 (4473 bp) (NM\_001205701.1) was amplified from MDBK cDNA and was inserted into a pEGFP-N1 vector via NheI and XhoI sites for construction of the CDK12-overexpressing vector. The empty vector (mock) was used as a negative control. The CDK12 3 0 -UTR (282 bp) containing the predicted miR-30c binding site was amplified from bovine genomic DNA and inserted into a psi-CHECK2 vector (Promega, Madison, United States) via NotI and XhoI sites and confirmed by sequencing. To test whether the predicted miR-30c target site in the CDK12 3 0 -UTR is critical for the miR-30c-mediated repression of CDK12 expression, the seed sequence of the predicted miR-30c's binding site was changed (Wu et al., 2017; **Figure 4A**). Primers for vector construction are listed in **Supplementary Table S1**.

# Dual-Luciferase Reporter Assay

The miR-30c mimics/control mimics and luciferase reporter plasmids were co-transfected into HEK293T cells using Lipofectamine 2000 (Invitrogen, Carlsbad, United States). After 24 h of transfection, the Renilla and Firefly luciferase were assayed using the Dual Luciferase Reporter Kit (Promega, Madison, WI, United States).

# Cell Culture and Transfection

The HEK293T cells and MDBK cells were cultured at 37C in 5% CO<sup>2</sup> in DMEM media (Thermo Fisher Scientific, Waltham, MA, United States) supplemented with 10% FBS (VWR, Radnor, United States), 100 U/ml penicillin and 100 mg/ml streptomycin. miR-30c mimics/inhibitor and their negative controls were delivered into MDBK cells using Hiperfect reagent (Qiagen, Germantown, MD, United States) following the manufacturer's instructions. The short-interfering RNA (siRNA) targeting CDK12 and a non-target control siRNA (si-NTC) were purchased from Qiagen (Germantown, MD, United States). SiRNA or the overexpressing vector was transfected into MDBK cells using Lipofectamine 2000 according to the manufacturer's instructions. Protein or total RNA were extracted for western blotting (WB) or RT-qPCR 48 or 24 h after transfection.

# Western Blotting

Cells were collected after 48 h of transfection and lysed using Radioimmunoprecipitation lysis buffer consisting of 50 mM Tris–HCl (pH 7.5), 150 mM NaCl, 0.1% SDS, 1% NP-40, 0.5% sodium deoxycholate and protease inhibitors. The samples were denatured at 100C for 10 min before loading onto 10% SDSpolyacrylamide gels. Separated proteins were then transferred onto nitrocellulose membranes and blocked with 5% non-fat milk in PBS with 0.1% Tween-20 for overnight. Membranes were then incubated overnight with 1/1000 rabbit anti-CDK12 (Novus Biologicals, Abingdon, United Kingdom) and 1/1000 rabbit anti-β-actin (Novus Biologicals, Abingdon, United Kingdom). After three washes, the membranes were incubated with HRPconjugated goat anti-rabbit IgG (H + L) (Novus Biologicals, Abingdon, United Kingdom) for 2 h at room temperature. Signals were revealed by autograph using SuperSignal West Femto Maximum Sensitivity Substrate (Thermo Fisher Scientific, Waltham, United States).

# Cell Cycle Assays: PI Staining and Flow Cytometry

fgene-10-00315 April 4, 2019 Time: 18:10 # 5

Madin-Darby bovine kidney cells were cultured in 6-well plates for 48 h after transfection and were stained with propidium iodide (PI) at a final concentration of 50 µg/ml PI and 100 µg/ml RNase A in PBS. Then, the cells were analyzed using AccuriTM C6 flow cytometry (BD, Erembodegem, Belgium) collecting 50000 events. All experiments were replicated three times.

# Cell Proliferation Assays: WST-1 Colorimetric Assay

WST-1(4-(3-(4-iodophenyl)-2-(4-nitrophenyl)-2H-5-tetrazolio)- 1,3-benzene disulfonate) (Merck, Kenilworth, United States) was used for cell proliferation analysis. The assay was performed using 96-well plates with ∼20000 cells. After 48 h of transfection, 10 µl of WST-1 was added to 90 µl samples. The samples were measured at 450 nm wavelength (570 nm as a reference wavelength) using an EZ read 400 microplate reader (Biochrom, Holliston, United States). Cell viability was then calculated by comparing the absorbance values of sample groups after background subtraction. All experiments were replicated three times.

# Statistical Analysis

The data are presented as mean ± S.D and derived from at least three independent experiments. The statistical analyses were performed using ANOVA followed by Tukey's test or Student's t test. For each analysis, P < 0.05 was considered significant.

# RESULTS

# Intermediate Cleaving Embryos Result in a Higher Blastocyst Rate Compared to Slow Cleaving Embryos

According to the timing of the first cell division, 1808 individually cultured embryos for each of three replicate were labeled as either fast, intermediate or slow cleaving and evaluated at 8 dpi for developmental competence. Intermediate embryos produced significantly (P = 0.027) more blastocysts in comparison to the slow embryos (41.16 and 18.7%, respectively; **Figure 1**). No statistically distinctive differences (P = 0.24) were found between fast and intermediate cleaving embryos (50.65 and 41.16%, respectively; **Figure 1**). The fast group was excluded for sequencing because not enough RNA was obtained due to the low number of embryos belonging to this group.

# miRNAs Secreted by Bovine Embryos

In total 294 miRNAs were found in conditioned media (CM) after sequencing (MicroRNAs sequencing data are available in the GEO database under the accession number PRJNA492220): 114 known miRNAs and 180 potential novel miRNAs. The uncorrected p-value was indicative of differential secretion from embryos with different cleavage patterns and different development competences for the following miRNAs: miR-30c

TABLE 1 | The differentially expressed miRNAs (p < 0.05) in CM content from individually cultured bovine embryos (I, intermediate cleaving; S, slow cleaving).

and blastocyst. Data are presented as mean ± SD of three experiments.



and miR-10b were secreted more in slow cleaving embryos' CM compared with the CM of intermediate cleaving embryos; miR-10b, miR-novel-44, and miR-novel-45 were more abundant in CM from degenerate embryos than in that of blastocysts, while miR-novel-113 and miR-novel-139 were more abundant in blastocyst's CM than degenerate's CM (**Table 1**). However, with the low sample size, due to the practical difficulty to obtain enough CM, it was unsurprising that none of the differences remained significant after multiple testing with the Benjamini-Hochberg corrected p-value. Consequently, the sequencing results of 5 of the 6 above mentioned miRNAs were confirmed using RT-qPCR (novel-miR-44 has the same mature sequence as novel-miR-45) (**Figure 2**). RT-qPCR showed that miR-30c and miR-10b have an 18 (P = 0.00072) and 30 (P = 0.00017) fold higher expression in the CM from slow cleaving embryos in comparison to intermediate cleaving embryos (**Figures 2A,B**). The expression levels of both these two miRNAs in the CM of fast cleaving embryos and intermediate cleaving embryos showed no significant difference (**Figures 2A,B**). MiR-10b and novelmiR-45 showed a 55 (P = 0.00000) and 8 (P = 0.0068) fold

higher expression in the CM from degenerate embryos compared to blastocysts (**Figure 2C**). Novel-miR-113 and novel-miR-139 displayed, respectively 14 (P = 0.0027) and 22 (P = 0.00033) fold higher expression in the CM of blastocysts than in that of degenerate embryos (**Figure 2D**). In addition, miR-30c was found to be 20 (P = 0.00067) times more abundant in CM compared to control media (**Figure 2E**).

The intracellular miRNAs expression was also validated using RT-qPCR and similar results were obtained. miR-30c and miR-10b have a 13 (P = 0.0031) and 21 (P = 0.00044) times higher expression in slow cleaving embryos in comparison to intermediate cleaving embryos (**Figure 2F**). MiR-10b and novelmiR-45 show a 37 (P = 0.0004) and 5 (P = 0.0081) times higher expression in degenerate embryos compared to blastocysts (**Figure 2G**). Novel-miR-113 and novel-miR-139 displayed 18 (P = 0.00091) and 7 (P = 0.0062) times higher expression in blastocysts than degenerate embryos (**Figure 2H**).

### Pathway Analysis

Examination of the GO analysis results of the differentially expressed miRNAs between IB and SB revealed that 11 biological processes, among which "in utero embryonic development," "cell cycle," "fibroblast growth factor receptor signaling pathway," and "Notch signaling pathway" were over-represented (**Figure 3A**). Additionally, 16 KEGG pathways, with as top hits: the p53 signaling pathway, the Wnt signaling pathway, the TGFbeta signaling pathway and apoptosis were over-represented (**Figure 3B**). These GO-terms and pathways enriched with targets provide an intriguing clue to the biological consequences of miRNAs differential secretion from embryos with different cleavage patterns.

# miR-30c Mimics Can Be Taken Up by Bovine Embryos and Increase Embryo Apoptosis

miR-30c has been shown to regulate cell cycle and proliferation in human breast cancer cells, glioma cells, hematopoietic cells, osteoblast cells and mice embryonic carcinoma cells (Li et al., 2012; Quintavalle et al., 2013; Shukla et al., 2015; Liu et al., 2016), thus, combining the above sequencing/RT-qPCR results with information from the literature we hypothesized that miR-30c can be taken up by embryos and might influence embryonic development through regulation of the cell cycle. To test this hypothesis, we added the miR-30c mimics into the IVF culture medium at 21 hpi, thus allowing miR-30c mimics to influence embryos for at least 5 to 10 h before they reach the 2-cell stage (26–31 hpi). RT-qPCR results showed

that the miR-30c levels were approximate 80 times higher in miR-30c mimics treated embryos compared to the control mimics group (**Figure 4A**), indicating that miR-30c was taken up by the embryos.

No significant difference was found in blastocyst rate between miR-30c mimics group and control mimics group (**Figure 4B**). However, TUNEL staining showed that the miR-30c mimics group had an apoptosis rate of 12.86% whereas that of the control

mimics group was 5.05% (**Figures 4C,D**). Similarly, differential apoptotic staining showed that the miR-30c mimics group had an apoptosis rate of 11.85% whereas that of the control mimics group was 4.05% (**Figures 4E,F**).

# miR-30c Directly Targets Cell Progression Regulator CDK12

Different miRNA target prediction methods may produce different results, thus we adopted the method from Ozen (Ozen et al., 2007) and Li (Li et al., 2011). If a target was identified by at least three of six used different algorithms (TargetScan, miRDB, PicTar, miRanda, miRWalk and Tarbase), it was considered likely to be a miRNA target. Of the putative target genes identified in this way, CDK12 (identified by Targetscan, miRDB and miRanda) was chosen for further analysis. This gene was previously shown to be required for the prevention of apoptosis (Bartkowiak et al., 2015; Juan et al., 2016) and to protect cells from genomic instability and inhibit cell differentiation (Blazek et al., 2011; Dai et al., 2012) through the regulation of DDR genes in human and mouse. The 3<sup>0</sup> -UTR segment of the bovine CDK12 gene containing the putative miR-30c target binding site region (**Figure 5A**) was amplified and cloned into luciferase reporter vector psi-CHECK2 and subsequently transfected to HEK293T cells. As shown in **Figure 5B**, the miR-30c mimics dramatically suppressed the activities of wild-type (WT) 3<sup>0</sup> -UTRs of CDK12, while the mutated 3<sup>0</sup> -UTR binding site (MUT) was unaffected. To further confirm the regulatory relationship between miR-30c and CDK12, RT-qPCR, and WB were performed to determine the CDK12 mRNA and protein levels in MDBKs. The results showed that CDK12 was suppressed by miR-30c mimics and enhanced by miR-30c inhibitors at the protein level (**Figure 5C**) rather than the mRNA level (**Figure 5D**). The direct target relationship was also analyzed

experiments. (∗∗P < 0.01).

in embryos: miR-30c mimics were supplemented into embryos culture medium and then CDK12 expression was evaluated using RT-qPCR and WB. Not surprisingly, embryos showed the similar results as MDBKs (**Figures 5E,F**). Collectively, these results show that miR-30c directly targets CDK12 and inhibits its translation instead of degrading mRNA.

# miR-30c Overexpression and CDK12 Downregulation Direct Transcription of Key DDR Genes

Given that CDK12 is involved in DNA repair (Paculová et al., 2017) and has been proven to be a target gene inhibited by miR-30c in our study, we hypothesized that miR-30c may suppress cell cycle progression by inhibiting DDR pathways. A previous study on mouse embryos showed that four DDR genes, namely Brca1, Fancd2, Fanci, and Atr, had a reduced expression in the absence of CDK12 (Juan et al., 2016). To our knowledge, in bovine, the relationship among miR-30c, CDK12, and DDR pathway has not been investigated yet. Here we examined the expression of these four genes using RT-qPCR after supplementing miR-30c mimics into embryos culture medium and modulating CDK12 expression in MDBKs. As shown in **Figure 6A**, the delivery of miR-30c significantly decreased mRNA levels of all four investigated DDR genes BRCA1, FANCD2, FANCI, and ATR in embryos. As shown in **Figure 6B**, downregulation of CDK12 also significantly decreased mRNA levels of the above four genes. We also examined the expression of DDR genes after overexpressing CDK12 using the previously mentioned vector construct. As shown in **Figure 6C**, overexpression of CDK12 did not have a significant effect on the mRNA level of these DDR genes.

# miR-30c Suppresses the Cell Cycle, While CDK12 Promotes the Cell Cycle

Although miR-30c has been shown to regulate cell progression in human and mouse (Quintavalle et al., 2013; Liu et al., 2016), this regulatory relationship is still unclear in bovine cells. Considering the fact that the compaction of embryos makes it difficult to use them for flow cytometry analysis, further studies were performed using the bovine cell line MDBKs. PI staining was used to determine the effect of miR-30c mimics or inhibitors on the MDBK cell cycle. As shown in **Figure 7A**, cell cycle phase distribution determined by flow cytometry displayed 8% increase of treated cells in the G1 phase after delivery of miR-30c mimics, indicating the cell growth suppression, while delivery of miR-30c inhibitors resulted in an 8% decrease of cells in G1 phase.

CDK12 expression was assessed after siRNA or vector transfection. RT-qPCR (**Figure 7C**) and WB (**Figure 7D**) showed that CDK12 expression was indeed upregulated by the vector construct and downregulated by siRNA at transcriptional level, showing their usefulness for the next experiments. Transfection of the CDK12 overexpressing construct resulted in an 8% decrease of cells in the G1 phase compared with the control group, whereas knockdown of CDK12 using siRNA resulted in a 20% increase of cells in the G1 phase and a 15% decrease in the S phase compared with si-NTC (**Figure 7E**).

# miR-30c Decreases Cell Viability, While CDK12 Increases Cell Viability

The MDBKs cellular metabolic activity, indicative of the cell proliferation, was monitored after addition of miR-30c mimics or inhibitors using the WST-1 assay. As shown in **Figure 7B**, miR-30c mimics led to a significant decrease in cell viability (37%), while miR-30c inhibitors increased cell viability (57%).

As shown in **Figure 7F**, CDK12 overexpression increased cell viability (84%), while CDK12 inhibition led to a 49% decrease in cell viability.

# DISCUSSION

Timing of cleavage is regarded as an important marker to assess embryo quality (Gutierrez-Adan et al., 2015) and it has been shown that rapid cleaving embryos are of better quality

than slower cleaving embryos (Meirelles et al., 2004; Vandaele et al., 2006). Because it has been demonstrated that an embryo's potential is determined more in the early developmental stages than in later developmental stages (Wong et al., 2010; Milewski and Ajduk, 2017; Milewski et al., 2018), we chose the 2-cell stage to assess the samples regarding evaluation of embryos quality, instead of the 4-cell stage or the morula stage. In our study, fast and intermediate cleaving embryos produced significantly more blastocysts compared to the slow cleaving embryos (50.65 and 41.16% vs. 18.7%), confirming the above theory.

In addition to their intracellular function, secreted miRNAs may play a significant role in intercellular communications (Vickers et al., 2011; Boon and Vickers, 2013; Yang et al., 2018). However, the dynamics of miRNA secretion and their transfer mechanisms are still poorly understood. Secreted miRNAs have been found to be related to cell growth, invasion, migration, dissemination as well as metastasis and impairment of the immune system response (Schwarzenbach et al., 2014). Furthermore, they have potential as biomarkers for cancer and benign diseases, thus raising the questions whether and how secreted miRNAs influence embryo development and if they can be used as non-invasive biomarkers for embryo quality. Given the current methods for miRNA detection, the main limitation is the low abundance of miRNAs in CM. However, miRNAs secreted by a single human embryo have been successfully detected and extracted (Capalbo et al., 2016), indicating the potential application for bovine embryos. In our study, to obtain a sufficient amount of miRNAs for sequencing, we concentrated the CM from 167 embryos for each replicate and thus achieved at least 1 million raw reads. The potential of secreted miRNAs as biomarkers relies mainly on their high stability and their capacity to reflect embryo developmental status and their prognostic abilities in relation to IVF success and pregnancy outcome. Although there are several recent studies focusing on miRNAs in culture media (Rosenbluth et al., 2014; Kropp and Khatib, 2015) and body fluids, such as follicular fluid (Sohel et al., 2013) and endometrium (Vilella et al., 2015), providing an indication of the developmental competence of embryos, improvements in detection techniques and more knowledge of the miRNA signaling is needed in order to use secreted miRNA as biomarkers in embryonic development. In addition, not only technical aspects currently limit the use of secreted miRNAs as biomarkers in culture media and also in other body fluids; to date the source of secreted miRNAs is not clear. Therefore, more extensive studies are necessary to clarify whether secreted miRNAs detected in extracellular environment are the product of dead cells or are secreted in a tissue-specific manner. Furthermore, studies with large samples sizes are needed and some aspects of experimental reliability must be assessed before secreted miRNAs can be used as biomarkers.

Apart from the easy detection, a biomarker should be clearly discriminatory for the state to be defined, in casu the developmental competence of the embryo. Here, we demonstrated that miRNAs are differentially secreted from bovine embryos with different cleavage patterns and different qualities: miR-30c and miR-10b were differentially expressed between slow and intermediate cleaving embryos' CM; miR-10b, miR-novel-113, miR-novel-44, miR-novel-45, and miR-novel-139 were differentially expressed between blastocyst's and degenerate's CM.

Among the differentially expressed miRNAs, miR-30c was found to be 18 times more abundant in slow cleaving embryos' CM vs. intermediate cleaving embryos' CM. This distinct difference makes it a suitable biomarker candidate for the developmental capacity of bovine early embryos. To gauge the effect of miR-30c uptake by bovine embryos in correlation with the cleavage pattern and the proposed roles of miR-30c in cell proliferation in mouse (Liu et al., 2016), cell apoptosis in human and mouse (Li et al., 2010; Quintavalle et al., 2013; Liu et al., 2016), cell differentiation in human and mouse (Karbiener et al., 2011; Wu et al., 2012) and cell damage in human (Li et al., 2012), apoptosis assays were performed. RT-qPCR results confirmed that miR-30c was indeed taken up by bovine embryos and they showed a higher apoptosis rate, which is in agreement with previous findings that miRNAs could be both released and taken up by embryos (Kropp and Khatib, 2015; Vilella et al., 2015; Gross et al., 2017). The effect was further investigated using the bovine cell line MDBK. The delivery of miR-30c mimics to the MDBKs led to reduced cell proliferation and an arrest at G1 stage, while the delivery of miR-30c inhibitors resulted in the opposite effects, as expected. Previous studies on human embryos suggested that miR-30c can serve as a potential marker of blastocyst implantation potential (Capalbo et al., 2016; Noli et al., 2016). This is not surprising because although miR-30c is highly conserved between different species, it has been shown to act differently among different species. For instance, miR-30c was found to increase cell proliferation in mouse embryonal carcinoma cells (Liu et al., 2016), while it was also found to be a tumor suppressor miRNA in human cancers (Poudel et al., 2013; Shukla et al., 2015).

We also demonstrated for the first time that miR-30c downregulates CDK12 expression at a post-transcriptional level both on bovine embryos and MDBKs. CDK12 is a transcriptionassociated CDK that exerts control over Pol II-mediated transcription (Ekumi et al., 2015) and is essential for splicing and differentiation (Chilà et al., 2016). Intriguingly, recent research has shown that CDK12 is essential for embryonic development and the maintenance of genomic stability by regulating the expression of DDR genes, and reduced expression of some of these DDR genes will subsequently trigger apoptosis (Juan et al., 2016; Chen et al., 2017). During early embryonic development, DNA replication is prominent and highly efficient DNA repair is crucial for proper embryo development. For instance, Atrand Brca1-lacking embryos were reported to display growth retardation in mice (Liu, 1996; Brown and Baltimore, 2000). In our study, both the supplementation of miR-30c mimics into bovine embryos culture medium and the CDK12 knockdown in MDBKs caused a decreased expression level of key DDR genes BRCA1, FANCD2, FANCI, and ATR. These results present evidence that miR-30c overexpression or CDK12 downregulation reduces the expression of these DDR genes at the transcriptional level, leading to a potential failure of DNA damage repair. Interestingly, while CDK12 overexpression increased cell cycle progression and cell proliferation, it had no effect on the mRNA

level of those key DDR genes. This indicates that CDK12 overexpression might influence cell cycle progression at other levels or through other mechanisms. For instance, in breast cancer cells, CDK12 overexpression led to altered alternative last exon splicing of a subset of genes (Tien et al., 2017) and increased the invasiveness of a breast cancer cell line by decreasing the expression of the long isoform of DNAJB6 (Paculová and Kohoutek, 2017). A potential weakness of our study is that due to technical difficulties, part of the functional analysis of CDK12 was done on MDBKs. It would be better if we can validate this mechanism in bovine embryos.

In summary, we have found 114 known miRNAs and 180 potential novel miRNAs in CM of bovine embryos. We have also identified miR-30c, which can be secreted and taken up by bovine embryos, as a novel potential biomarker related to bovine embryo apoptosis and reduced development. As miR-30c directly targets CDK12 and downregulates DDR genes, it may exert its effects on cell cycle progression by inhibiting the DDR pathways.

# AUTHOR CONTRIBUTIONS

XL performed the experiments and wrote the manuscript. YG provided the bioinformatics analysis for miRNA sequencing.

# REFERENCES


EB contributed the qPCR experiments. KS was responsible for the embryo staining. KP helped to produce embryos. SMC, JC, PS, FVN, DD, AVS, and LP participated in the study design. All authors reviewed the manuscript.

# FUNDING

This work was supported by Ghent University (BOF GOA project 01G01112).

# ACKNOWLEDGMENTS

The authors thank Petra Van Damme for her excellent technical assistance.

# SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene. 2019.00315/full#supplementary-material


to developmental kinetics. Reprod. Fertil. Dev. 16, 201–201. doi: 10.1071/ RDv16n1Ab158


fgene-10-00315 April 4, 2019 Time: 18:10 # 14

assessment of their predictive value for pregnancy. Reprod. Biomed. Online 14, 294–299. doi: 10.1016/S1472-6483(10)60870-X


stem cells into neural cells. Cell Death Dis. 8:e2953. doi: 10.1038/cddis. 2017.336


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Lin, Beckers, Mc Cafferty, Gansemans, Joanna Szymanska, ´ Chaitanya Pavani, Catani, Van Nieuwerburgh, Deforce, De Sutter, Van Soom and Peelman. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Small Non-coding RNAs: New Class of Biomarkers and Potential Therapeutic Targets in Neurodegenerative Disease

Callum N. Watson1,2 \*, Antonio Belli1,2 and Valentina Di Pietro1,2,3

<sup>1</sup> Neuroscience and Ophthalmology Research Group, Institute of Inflammation and Ageing, University of Birmingham, Birmingham, United Kingdom, <sup>2</sup> National Institute for Health Research Surgical Reconstruction and Microbiology Research Centre, Queen Elizabeth Hospital Birmingham, Birmingham, United Kingdom, <sup>3</sup> Beckman Institute for Advanced Science and Technology, University of Illinois at Urbana–Champaign, Urbana, IL, United States

Neurodegenerative diseases (NDs) are becoming increasingly prevalent in the world, with an aging population. In the last few decades, due to the devastating nature of these diseases, the research of biomarkers has become crucial to enable adequate treatments and to monitor the progress of disease. Currently, gene mutations, CSF and blood protein markers together with the neuroimaging techniques are the most used diagnostic approaches. However, despite the efforts in the research, conflicting data still exist, highlighting the need to explore new classes of biomarkers, particularly at early stages. Small non-coding RNAs (MicroRNA, Small nuclear RNA, Small nucleolar RNA, tRNA derived small RNA and Piwi-interacting RNA) can be considered a "relatively" new class of molecule that have already proved to be differentially regulated in many NDs, hence they represent a new potential class of biomarkers to be explored. In addition, understanding their involvement in disease development could depict the underlying pathogenesis of particular NDs, so novel treatment methods that act earlier in disease progression can be developed. This review aims to describe the involvement of small non-coding RNAs as biomarkers of NDs and their potential role in future clinical applications.

#### Edited by:

Yun Zheng, Kunming University of Science and Technology, China

#### Reviewed by:

Mauricio Fernando Budini, Universidad de Chile, Chile Chao Peng, University of Pennsylvania, United States

> \*Correspondence: Callum N. Watson cnw763@bham.ac.uk

#### Specialty section:

This article was submitted to RNA, a section of the journal Frontiers in Genetics

Received: 14 January 2019 Accepted: 05 April 2019 Published: 26 April 2019

#### Citation:

Watson CN, Belli A and Di Pietro V (2019) Small Non-coding RNAs: New Class of Biomarkers and Potential Therapeutic Targets in Neurodegenerative Disease. Front. Genet. 10:364. doi: 10.3389/fgene.2019.00364 Keywords: small non-coding RNAs, microRNAs, neurodegenerative disease, biomarkers, new therapeutic targets

# INTRODUCTION

Neurodegenerative diseases (NDs) are classified as a class of disorders affecting the central nervous system and they are characterized by the progressive loss of neuronal tissues. NDs are agedependent disorders which are increasing internationally, due to the ever increasing elderly population, which is leaving greater numbers of people subjected to the chronic, debilitating nature of these incurable diseases (Heemels, 2016). Currently, the most represented NDs are: Alzheimer'disease (AD) with 5 million people affected in America only, followed by Parkinson's diseases (PD) with 1 million people; multiple sclerosis (MS) 400,000; Amyotrophoic lateral sclerosis (ALS) 30,000 and Huntington's disease (HD) with 3,000 incidents (Agrawal and Biswas, 2015).

Some treatments for ND have aimed to reduce the syndrome of NDs; these include L-dopa and deep brain stimulation in PD (Groiss et al., 2009; Nagatsua and Sawadab, 2009). However, very

**46**

few have aimed to slow or reverse ND development, and those that have been investigated e.g., stem cell therapy (Chung et al., 2002; Rachakonda et al., 2004) highlight the requirement for more research. Late diagnosis leads to strategic treatment being ineffective due to irreversible disease progression (Sheinerman and Umansky, 2013). This has been reported for example, on anti-AD therapies in late-stage clinical trials (including dimebon of Medivation and Pfizer, solanezumab of Eli Lilly and bapineuzumab of Pfizer and Johnson & Johnson). Biomarkers for early diagnosis could prevent or limit disease development through prophylactic or early treatment, which has ignited interest. Currently, the most accurate diagnosis relies on neuropathology, mainly based on autopsy, or in the measurement of cerebrospinal fluid (CSF) proteins, such as tau or Aβ- in AD, which requires invasive procedures. However, blood proteins, such as Aβ1-42 peptide in AD or cytokines for ALS or HD (Agrawal and Biswas, 2015), as well as genetics diagnostics markers such as ApoE isoforms in AD or α-synuclein or Parkin for PD, have also demonstrated potential clinical utility (Agrawal and Biswas, 2015).

Neuroimaging techniques can also help to make the correct diagnosis and monitor the progress of NDs. Magnetic resonance imaging (MRI) is one of the most widely used neuroimaging techniques used for AD (Jack et al., 2011; McKhann et al., 2011) and for dementia with Lewy bodies (DLB) (Ciurleo et al., 2014). Magnetic resonance spectroscopy (MRS) has also showed promise in early diagnosis of PD and traumatic brain injury, measuring metabolic dysfunctions and irreversible neuronal damage (Vagnozzi et al., 2008).

Recently, a new class of circulating RNAs – non-coding RNAs – have been re-evaluated and are being considered as potential biomarkers. After years of the belief that 98% of the genome was "junk" due to its non-coding nature it was realized these genes had biologically functionality. Noncoding genes include introns, pseudogenes, repeat sequences and cis/trans-regulatory elements that function as RNA without translation. Estimations have suggested that 99% of total RNA content is made up of non-coding RNA, with numbers of validated non-coding RNAs (ncRNAs) increasing every year (Palazzo and Lee, 2015).

Currently ncRNAs can be defined by length – small 18–200 nts and long >200nts – or functionality with housekeeping ncRNAs such as ribosomal RNAs (rRNAs) and transfer RNAs (tRNAs) or regulatory ncRNAs like microRNAs (miRNAs), small nuclear RNAs (snRNAs), piwi-interacting RNA (piRNAs), tRNA derived small RNAs (tsRNAs) and long non-coding RNAs (lncRNAs) (Dozmorov et al., 2013). Nonetheless, difficulty distinguishing categories persists due to the crossover of properties.

Small non-coding RNAs (sncRNAs) have diverse roles, which in conjunction with other molecules involve gene regulation through either RNA interference, RNA modification or spliceosomal involvement (**Table 1**). Consequently, during disease progression their expression can alter. MiRNAs are the most studied sncRNA as biomarkers with involvement in various diseases including cancers, aging and neurodegenerative disease (Calin and Croce, 2006; Grasso et al., 2014; Di Pietro et al., 2017). Other sncRNAs have shown promise as biomarkers, TABLE 1 | Classification of types of small non-coding RNAs.


with links to neurodegenerative disease (Munoz-Culla et al., 2016). There is the potential for multiple sncRNA biomarkers for neurodegenerative diseases, which if found, could aid diagnosis in a clinical setting while demonstrating the processes underpinning the disease development. In future, this could produce novel therapies to treat neurodegenerative diseases using original methodologies.

In this review, we consider the evolving role of sncRNAs and discuss their involvement in neurodegenerative disease with particular emphasis on their potential as biomarkers.

# MICRORNA

MiRNAs are the most studied sncRNA. Their biogenesis commences with the formation of a pri-miRNA made up of two stem-loop structure. A Drosha and DGCR8 complex cleaves the pri-miRNA to form a single stem-loop pre-miRNA. Dicer cleaves the pre-miRNA to create a double stranded miRNA, which is loaded onto Argonaute family of proteins to form the miRISC complex (**Figure 1Ai**). Accompanied to the miRISC complex, miRNAs regulate gene expression post-transcriptionally through degradation and repression of mRNA sequences by an Argonaute family protein mediated method (**Figure 1Aii**; O'Brien et al., 2018). A single miRNA can have multiple targets, likewise a target mRNA can be bound to by many different miRNAs, to enable more diverse signaling patterns.

MiRNAs show specific signaling in the brain, and were also found differentially expressed in bio-fluids. Although there is no consistent consensus on particular miRNAs or brain area yet, and no specific miRNA overlap between brain tissues and bio-fluids (as reported in **Table 2**) these findings certainly provide insights in the study of NDs pathogenesis.

MiRNAs are best studied in Alzheimer's disease (AD), which manifests itself as deposition of neurofibrillary tangles (NFT) and extracellular amyloid-β (Aβ), before neuronal degeneration and clinical symptoms materialize in the form of behavioral changes such as memory issues. NFT, Aβ and neuronal degeneration have been associated with dysregulation of miRNA gene expression, which could emanate from altered Aβ or Tau metabolism. MiRNAs effect Aβ metabolism by interacting with amyloid precursor protein (APP) through direct binding of the 30untranslated region (30UTR) to the APP mRNA, indirect inhibition through downregulation of Beta-secretase 1 (BACE1) and ATP-binding cassette transporter (ABCA1) or regulating alternative APP splicing. MiRNAs also affect Tau through regulation of microtubule associated protein tau (MAPT)

splicing, affecting tau isoforms 3R and 4R. Direct or indirect binding either modulates phosphorylated Tau-associated protein kinases or influences degradation of phosphorylated tau by binding 3<sup>0</sup> -UTR BCL2 associated athanogene 2 (BAG2) mRNA (Zhao et al., 2017).

MiRNAs have an established involvement in neurobiological functions and pathogenesis of numerous other neurodegenerative diseases (Serafin et al., 2014; Fransquet and Ryan, 2018; Ricci et al., 2018). Mitochondrial dysfunction caused by miRNA dysregulation leads to oxidative stress, which causes cell death, α-synuclein aggregation and neurodegeneration known to be present in PD (Spano et al., 2015). In ALS, both TAR DNA binding protein (TARDBP) and fused in sarcoma (FUS) are well-established causative genes, which are involved in miRNA processing. TARDBP has specific roles in facilitation of posttranscriptional processing achieved through association directly with miRNA or processing factors such as Dicer (Kawahara and Mieda-Sato, 2012). FUS regulates miRNA-mediated gene silencing through facilitation of the interaction between miRNA, mRNA and RISC components (Zhang et al., 2018). In HD, a miRNA formulation is being trailed as therapeutic agents to alter the aberrant Huntingtin (HTT) protein expression (Aronin and DiFiglia, 2014).

MiRNA involvement in ND development has demonstrated the capability of distinguishing between disease subtypes and shown promise for future stratification. For example in AD, 30 differentially regulated miRNAs found in the brain and blood of AD patients were assigned to different Braak stages, a methodology for classifying AD pathology, with 10 associated with Braak stage III (hsa-mir-107, hsa-mir-26b, hsa-mir-30e, hsamir-34a, hsa-mir-485, hsa-mir200c, hsa-mir-210, hsa-mir-146a, hsa-mir-34c, and hsa-mir-125b) (Swarbrick et al., 2019). Likewise in PD, miR-331-5p is differentially expressed in plasma of early onset Parkinson's disease (EOPD) patients, which was not seen in late onset Parkinson's disease (LOPD) patients (Cardo et al., 2013; **Table 2**). Studies comparing between subtypes of NDs are still in the minority and more are required to understand the true capability of miRNA markers in stratification of NDs.

# SMALL NUCLEAR RNAs

Small nuclear RNAs (snRNAs), the component parts of the spliceosome – responsible for removal of non-coding introns from precursor mRNA – are highly conserved uridine rich sequences with five snRNAs making up its spine; U1, U2, U4, U5,

#### TABLE 2 | MiRNAs with an involvement in the neurodegenerative disease development.


(Continued)

TABLE 2 | Continued


<sup>∗</sup>Passenger miRNA strand.



and U6. These snRNAs combine with partner proteins to form the small nuclear ribonucleoprotein (snRNPs) complex, which is essential pre-mRNA splicing to enable production of functional mRNA for protein translation.

Sm-class snRNAs are synthesized by RNA polymerase II and after transcription contain a 7-methylguanosine cap, Smprotein binding site and 3<sup>0</sup> stem-loop. The latter two are recognized by the SMN complex, which recruits a set of Sm proteins to create the Sm-core RNP. Following this, the cap undergoes hypermethylation by trimethylguanosine synthase-1 (TSG1) creating a 2,2,7-trimethylguanosine cap. The 3 0 end is then trimmed by an unknown exonuclease before subsequent maturation through modifications (Matera et al., 2007; **Figure 1Bi**).

Two types of spliceosome "major" and "minor" (0.35% of all introns) can be assembled. Major spliceosome assembly commences by U1 interacting with the 5<sup>0</sup> splice site while U2 snRNP binds to the branch point sequence. This leads to the recruitment of the premade U4/U6.U5 tri-snRNP complex, in this state the spliceosome is inactive. After destabilization or release of either U1 or U4, the spliceosome becomes active. The active spliceosome undergoes two phases of catalysis leading to its dissociation – including U2, U5, and U6 that are recycled – when it releases the mRNA, as mRNP (Wahl et al., 2009; **Figure 1Bii**). The minor spliceosome has divergent and highly conserved 5<sup>0</sup> splice site and branch point sequences, which interact with U5 as well as alternative factors U11, U12, and U4atac/U6atac that are functional analog of its major counterpart (Verma et al., 2018; **Figure 1Bii**). Both spliceosomes show the capability to contribute to the development of neurodegenerative disease, demonstrating snRNA involvement (Bai et al., 2013; Tsuiji et al., 2013; Ratti and Buratti, 2016; Jutzi et al., 2018).

In sporadic and familial AD, U1 snRNP subunits – including U1-70K and U1A – were present in cytoplasmic aggregates, which occurs by the basic-acidic dipeptide (BAD) domain binding to tau in U1-70K (Bishof et al., 2018). Inordinate levels of unspliced RNA also reside, caused by dysregulation of RNA processing. In conjunction with evidence that inhibition of U1 snRNP increases APP, this implicates U1 snRNP dysregulation in the pathogenesis of AD (Bai et al., 2013; Hales et al., 2014a,b). Recent evidence has shown abnormal expression of U1 snRNA can cause premature cleavage of pre-mRNA via polyadenylation (PCPA) at the 3<sup>0</sup> poly-A site. This affects splicing and could demonstrate a novel AD causing pathology (Cheng et al., 2017) (**Table 3**).

U snRNAs are also associated with spinal muscular atrophy (SMA). SMN1 gene dysregulation alters U snRNA levels through its role in U snRNA biosynthesis; nonetheless, the underlying pathology is still unclear (Zhang et al., 2013). Many studies have proposed a reduction in U snRNAs is key to SMA pathology due to their involvement in mRNA processing, with U1 and U11 of particular interest (Gabanella et al., 2007; Zhang et al., 2008). In contrast, U snRNAs can accumulate in the motor neurons of ALS patient spinal cords when compared to control patients, to cause defects showing that U snRNA level can depict disease state, depending of cell type (Tsuiji et al., 2013).

More recently, when considering induced pluripotent stem cell (iPSC) derived motor neurones cultures, a study suggested that an imbalanced ratio of variant U1 to U1 might cause the SMA phenotype rather than an overall reduction in U1 snRNA (Vazquez-Arango et al., 2016). Demonstrating that purely measuring U snRNA level may be an oversimplified measurement and variant U snRNA could indicate the underlying pathophysiology of aberrant spliceosome related neurodegeneration.

Other U snRNAs studied in neurodegenerative disease include U2. A U2 snRNA mutation causes neuron degeneration, through altering pre-mRNA splicing at select splice sites that are associated with alternative pre-mRNA splicing (Jia et al., 2012). In addition, a dipeptide repeat (C90RF72) linked to both ALS and frontotemporal dementia (FTD), interacts and interferes with U2 snRNP. In patient derived cells, this led to mislocalisation but mis-splicing linked to ALS/FTD has yet to be established (Yin et al., 2017).

Mutations found within the gene PRPF4 – which encodes hPrp4 a U4/U6 di-snRNP protein – undertake an important role in the development of retinitis pigmentosa (RP) (Chen et al., 2014). hPrp4 is known to interact with

CypH and hPrp3 to regulate the stability of the tri-snRNP, U4/U6.U5. Thus, aberrant splicing could cause RP through direct or indirect mechanisms that have been hypothesized, but not defined.

The minor spliceosome has ND relevance as in ALS, TDP-43 functionality decreases (Colombrita et al., 2012), which reduces the number Gemini of coiled bodies (GEMs). GEMs contribute to U12 snRNA biogenesis, so in spinal motor neurones of ALS patients there was a decrease of U12 snRNA and U11/U12 snRNP, which may disrupts pre-mRNA splicing (Ishihara et al., 2013). Additionally, an ALS mutant (P525L) cannot promote minor intron splicing due to an aberrant FUS gene that routinely binds to U11 snRNP to direct splicing. This leads to mislocalisation of FUS-trapped U11 and U12 snRNAs, which form aggregates in the cytoplasm so incorrect splicing results (Reber et al., 2016). In addition, a cerebral ataxia mutation RNU12 causes minor intron retention in homozygous mutant patients (Elsaid et al., 2017). When combined this demonstrates a likely role for minor intron splicing in motor neurone maintenance.

# SMALL NUCLEOLAR RNAs

Small nucleolar RNAs (SnoRNAs) modify RNA through there conserved motifs, with boxes C/D guiding methylation and H/ACA guiding pseudouridylation, respectively (Ohtani, 2017; **Figure 1Cii**). Each class of snoRNAs displays a unique secondary structure composed of conserved proteins to form the defined C/D and H/ACA snoRNPs. SnoRNAs mainly target rRNA to modify functionally important regions of the ribosome (Decatur and Fournier, 2002) but other purposes include pre-rRNA endonucleolytic processing (Tollervey and Hurt, 1990), guiding snRNAs such as U6 snRNA (Tycowski et al., 1998) and more recently mRNA guiding (Sharma et al., 2016) or regulation of alternative splicing in pre-mRNAs (Falaleeva et al., 2016).

Box C/D snoRNP biogenesis commences when a protein complex of SNU13 and NOP58 is pre-formed and loaded onto the snoRNA with the help of HSP90/R2TP. This recruits assembly factors and the pre-snoRNPs are transferred to the Cajal bodies where final processing occurs. Box H/ACA RNPs biogenesis starts by SHQ1 and DKC1 combining to prevent to non-specific RNAs binding. SHQ1 is released with the help of the R2TP complex allowing DKC1 to bind H/ACA RNAs at the site of transcription. Numerous assembly factors including NHP2, NOP10, and NAF1 are present during this pre-snoRNP form. When NAF1 – which binds the C-terminal domain of RNA polymerase II to keep H/ACA RNP inactive – is replaced by GAR1, mature and functional H/ACA RNPs are produced. Both forms are transported to the nucleolus to elicit their actions (Massenet et al., 2017; **Figure 1Ci**).

A study showed differential regulation of two C/D box snoRNAs (e307 and e470) prior to the development of AD in mouse model. After formation of a β-amyloid plaque, this differential expression is no longer present, demonstrating that they could be useful in early diagnosis. No clear evidence of pathogenesis just hypothesized using bioinformatics methods (Gstir et al., 2014) (**Table 3**).

Despite the fact that autism spectrum disorder (ASD) might not be considered a neurodegenerative disease. Studies have found links in ASD with numerous snoRNA genes found to be differentially expressed using RNA-seq (Wright et al., 2017). Duplication of SNORD115 in mouse chromosome 7 that mirrors human chromosome 15q11-13 – duplication of this is one of the most common chromosomal abnormalities in ASD – has been shown to increase SNORD115 levels and results in abnormal brain development. In addition, SNORD115 (HBII-48 and HBII-52) levels are dysregulated in superior temporal gyrus of human ASD brain samples, which could explain 5-HT changes (Gabriele et al., 2014) and alternative splicing seen in ASD (Voineagu et al., 2011) as HBII-52 may regulate 5-HT2C receptor mRNA levels (Stamova et al., 2015) as well as alternative splicing (Kishore et al., 2010).

Another study demonstrated that maternal alcohol consumption in pregnancy alters the C/D box RNA levels in brain cells during abnormal fetal development. DNA methylation, microRNA and snoRNA levels altered with emphasis on SNORD115 increasing and SNORD116 decreasing (Laufer et al., 2013).

# PIWI-INTERACTING RNA

Piwi-Interacting RNAs (PiRNAs) are a diverse range of small RNAs that are highly enriched in the germline tissues. They interact with PIWI-class Argonaute proteins with sequence bias for only the first 5<sup>0</sup> nucleotide to be a Uracil. This diverse population can be mapped back to distinct areas of the genome known as piRNA clusters, which contain highly enriched areas of fragmented dysfunctional transposable element (TE) sequences. These are thought to emanate from the memory of previous TE invasions, and can be utilized to protect against TEs (Toth et al., 2016). In addition, PIWI proteins function at the chromatin level by guiding DNA methylation and deposition of repressive histone marks to silence TE transcription (Le Thomas et al., 2013; **Figure 1Dii**).

The biogenesis of piRNAs gives rise to two different forms primary and secondary of 26–30 bps in length, stemming from single-stranded precursors (Yan et al., 2011; Mani and Juliano, 2013), which are best studied in Drosophila. Primary piRNAs biogenesis is poorly defined but precursors of around 200 bp stemming nearly entirely from piRNA clusters are cleaved – Zucchini (ZUC) is thought to do this – to enable loading onto a PIWI protein in association with other factors (**Figure 1Di**). This piRNA-PIWI complex interacts with TEs to prevent insertion through methylation or transcriptional repression, thereby affecting gene expression (Toth et al., 2016).

In Drosophila, secondary piRNAs are formed through a more defined "ping-pong" pathway, which utilizes the primary piRNAs formed from TE fragments present in piRNA clusters loaded onto Aubergine (AUB) to find complementary antisense TE transcripts (**Figure 1D**). Once found the complementary TE mRNA binds, and is cleaved ten nucleotides along from the 5 0 end by AUB, which terminates its function. Additionally it creates a new 5<sup>0</sup> end and piRNA precursor, which accompanied by

AGO3 is processed into secondary piRNA. The secondary piRNA promotes the development of more cluster-derived piRNAs – it is representative of the sense TE strand – through complementary cluster transcripts to develop a greater repertoire against active TEs (Toth et al., 2016; **Figure 1Di**).

Originally piRNAs were solely thought to be present in germline cells, more recently they have been found in other areas of the body including blood (Yang et al., 2015), blood plasma (Freedman et al., 2016) and the brain (Roy et al., 2017) as well as interacting with diseases in the liver (Rizzo et al., 2016), cardiovascular system (Loche and Ozanne, 2016) and brain (Roy et al., 2017) demonstrating their roles are farreaching. In neurodegenerative disease there have been recent studies on PD and AD.

Risk variants APOE (rs2075650) and RNU6-560P (rs10792835 + rs3851179) have been linked with AD through genomewide association studies (GWAS). These risk variants were significantly correlated with nine (6 APOE and 3 RNU6- 560P) different piRNAs, showing regulatory capabilities (Guo X. et al., 2017). PiRNA dysregulation may be integral to the development of AD through aberrant downstream signaling. The link to pathogenesis in AD was clarified in three AD dysregulated piRNAs (piR-38240, piR-34393, and piR-40666) after establishing complementary target genes (CYCS, KPNA6, and RAB11A) through inverse expression correlation (Roy et al., 2017). The target genes were known to regulate AD pathways through oxidative stress induced neurodegeneration, apoptosis and vesicular trafficking of Aβ. This demonstrates a regulatory role for piRNAs in preventing AD and so monitoring dysregulation could allow early diagnosis and implicate a treatment method.

There was a difference found in piRNA expression between PD- and control- patient derived cells. Patient tissue samples showed the same trend, with 70 different piRNAs overlapping between both (**Table 3**). Two distinct trends come from these piRNAs, up or down regulation (Schulze et al., 2018). In the down-regulated piRNA fraction, those that were shortinterspersed nuclear elements (SINE) and long-interspersed nuclear elements (LINE) derived elements in cell lines and LINE in tissues, showed significant enrichment when compared to genome-wide expression (Schulze et al., 2018). This is indicative of an inability to silence SINE and LINE derived elements in PD-derived neurones, which could show a pathogenesis of PD disease.

# TRANSFER RNAs

Transfer RNAs (tRNAs) are the most abundant form of sncRNA, making up 4–10% of all cellular RNAs. Previously thought to be static contributors to gene expression, acting as an adaptor molecule in translation. Recently it has been found that small non-coding tRNAs have unique function that enable wider signaling and dynamic regulation of various functions (Gebetsberger and Polacek, 2013).

Mature tRNA is formed through transcription of precursor tRNA (pre-tRNA) using RNA polymerase III. Endonucleolytic ribonuclease P (RNase P) and ribonuclease Z cleave the transcribed pre-tRNA at the 5<sup>0</sup> leader sequence and 3<sup>0</sup> polyuracil (poly –U) tail, respectively, before tRNA nucleotidyl transferase adds a 30CCA tail (**Figure 1Ei**). Many post-transcriptional modifications will occur during maturation and only tRNAs appropriately processed will leave the nucleus via nuclear receptor-mediated export process, with wrongly processed terminating. The mature tRNAs are between 73–90 nts in length and contain a clover-leaf shaped secondary structure, composing of a D-loop, an anticodon loop, a T-loop, a variable loop and an amino acid acceptor stem (Kirchner and Ignatova, 2015). The mature of pre-tRNA can be cleaved – into specific products unlike previously thought – into two main categories of cleaved tRNAs have been categorized; (1) tRNA-halves, (2) tRNA derived fragments.

tRNA halves are produced by cleavage of the anticodon loop giving rise to two halves; 30–35 nt 5<sup>0</sup> -tRNA halves and 40–50 nt 3<sup>0</sup> tRNA halves (Li and Hu, 2012; **Figure 1Eii**). A subtype of tRNA halves known as tRNA-derived stress-induced RNAs (tiRNAs) are by-products of stress. They induce cleavage by angiogenin (ANG) – a ribonuclease – of mature cytoplasmic tRNAs (Yamasaki et al., 2009).

tRNA derived fragments (tRFs) are produced from either pre-tRNAs or mature tRNAs (**Figure 1Eii**). Four main types have been established stemming from the fragment location on tRNAs: 5-tRFs, 3-tRFs, 1-tRFs, and 2 tRFs. 5-tRFs – located most abundantly in the nucleus – are generated from cleavage of the D-loop of tRNAs by Dicer, with adenine being present at the 3 0 ends. Further subdivision classifies 5-tRFs isoforms into "a" (∼15 nts), "b" (∼22 nts) and "c" (∼30 nts) (Kumar et al., 2015; Lee et al., 2009). 3-tRFs result from cleavage by Dicer, ANG or another member of the Ribonuclease A superfamily of the T-loop, containing a CCA tail sequence (18–22 nts) (Lee et al., 2009; Maraia and Lamichhane, 2011; Kumar et al., 2015). 1-tRFs are formed by the cleavage of the 3<sup>0</sup> -trailer fragment of pre-tRNAs by either RNaseZ or ELAC2, this usually commences after the 3 0 -ends of mature tRNA and contains a poly-U 3<sup>0</sup> -end (Lee et al., 2009; Liao et al., 2010). 2-tRFs, less known about but may be formed from the anticodon loop (Goodarzi et al., 2015).

Numerous neurodegenerative disorders are associated with tRFs. ANG mutants show reduced ribonuclease (RNase) activity and were first implicated in the pathogenesis of amyotrophic lateral sclerosis (ALS) (Greenway et al., 2006). Latterly, a subset of the ALS-associated ANG mutants were observed in Parkinson's disease (PD) patients (van Es et al., 2011). Recombinant ANG can improve life span and motor function in an ALS [SOD1 (G93A)] mouse model, demonstrating that tRFs may have an important role in motor neuron survival (Kieran et al., 2008) (**Table 3**).

The link between ANG-induced tiRNAs, cellular stress and neurodevelopment disorders was strengthened with the finding of NSun2 (Blanco et al., 2014). Mutations in the cytosine-5 RNA methyltransferase NSun2 have been shown to cause intellectual disability and a Dubowitz-like syndrome in humans (Abbasi-Moheb et al., 2012; Martinez et al., 2012). NSun2 methylates two different cytosine residues of tRNA. Without NSun2, cytosine-5 RNAs are not methylated, which increases the stressinduced ANG-mediated endonucleolytic cleavage of tRNAs and

so 5<sup>0</sup> -tiRNAs accumulate. Accumulation of these factors leads to cell death in hippocampal and striatal neurons because of translational repression leading to cellular stress. Subsequently, NSun2 knockout mice show reduced neuronal size and impaired formation of synapses, which could explain the impairment of NSun2 gene mutation patients (Blanco et al., 2014).

A mutation in CLP1 gene (R140A) – a RNA kinase involved in tRNA splicing – is present in pontocerebellar hypoplasia (PCH) patients, a heterogeneous group of inherited neurodegenerative disorders characterized by the loss of motor neurons, muscle paralysis, impaired development of various parts of the brain and differential tRNA splicing (Karaca et al., 2014; Schaffer et al., 2014). The role of CLP1 in RNA splicing means the mutant gene has reduced kinase activity and affinity to the tRNA endonuclease complex (TSEN), impairing pre-tRNA cleavage and elevating unspliced pre-tRNAs in patient derived neurons (Schaffer et al., 2014). TSEN cuts the transcript at 3<sup>0</sup> intron-extron junctions, so the absence of CLP1 means 5<sup>0</sup> -unphosphorylated tRF cannot interact with the pre-tRNAtyr 3 0 -exon and subsequent splicing steps are interrupted (Cassandrini et al., 2010).

N6 -threonyl-carbamoyl-adenosine (t6A) is a complex modification of adenosine involved in cytoplasmic tRNA modification. It is located next to the anticodon loop of many tRNAs that decode ANN codons, at position 37 (t6A37). Recently, a biosynthetic defect in the t6A molecule resulting from a mutation in the kinase-associated endopeptidase (KAE1) gene, which is part of the kinase, endopeptidase and other proteins of small size (KEOPS) complex was found in two phenotypically neurodegenerative patients, implicating tRNA modification in neuronal maintenance (Edvardson et al., 2017).

Although, tRNA-derived small non-coding RNAs, have already demonstrated a role in cancer progression (Sun et al., 2018), their role as biomarkers in NDs has not been fully investigated yet.

However, animal studies showed 13 dysregulated tRFs in brain samples of SAMP8 mouse model for AD. In particular, four were upregulated (AS-tDR-011775, AS-tDR-011438, AStDR-006835 and AStDR-005058) and 9 down regulated (AS-tDR-013428, AS-tDR-011389, AS-tDR-009392, AS-tDR012690, AStDR-010654, AS-tDR-008616, AS-tDR-010789, AS-tDR-011670,

## REFERENCES


and AS-tDR-007919), demonstrating their potential involvement of tRFs in early detection of AD.

## CONCLUSION

The key problem with the ND field is the lack of understanding in the events preceding the development of protein-based markers – such as Tau – currently used to diagnose NDs. By this stage, the diseases become more difficult to treat.

SncRNAs play an important regulatory role in the maintenance of the homeostatic brain. Therefore, changes in their concentration levels can be indicative of mechanistic changes that could precede protein-based markers. One single sncRNA biomarker is unlikely to differentiate between diseases. However, a combination of sncRNA biomarkers could be illustrative of the mechanistic development of NDs to enable early diagnosis, enhanced disease monitoring as well as defining subtle differences between NDs. Consequently, novel treatment methods directly related to their mechanistic underpinning of specific NDs, and potentially other brain related pathologies can be envisaged.

Novel, less-well studied sncRNAs could be integral to understanding the overall disease progression. So new methodologies may be necessary to quantify these changes and allow for future biomarker development.

## AUTHOR CONTRIBUTIONS

CW drafted the manuscript. AB and VDP critically revised the manuscript.

# FUNDING

This study was funded by the National Institute for Health Research (NIHR). The views expressed are those of the author(s) and not necessarily those of the NHS, the NIHR or the Department of Health.

chaperone-mediated autophagy and alpha-synuclein pathology in Parkinson's disease. Cell Death Dis. 4:e545. doi: 10.1038/cddis.2013.73


from amyotrophic lateral sclerosis patients. Neuromolecular Med. 18, 551–560. doi: 10.1007/s12017-016-8396-8


fate in motoneuron-like cells. J. Biol. Chem. 287, 15635–15647. doi: 10.1074/jbc. M111.333450



in Huntington's disease brain. Mov. Disord. 30, 1961–1964. doi: 10.1002/mds. 26457



L-dopa-treated patients with PD. Neurology 84, 645–653. doi: 10.1212/wnl. 0000000000001258


ALS and SMA. EMBO Mol. Med. 5, 221–234. doi: 10.1002/emmm.20120 2303



spinal muscular atrophy. Proc. Natl. Acad. Sci. U.S.A. 110, 19348–19353. doi: 10.1073/pnas.1319280110


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Watson, Belli and Di Pietro. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Systematic Analysis of Non-coding RNAs Involved in the Angora Rabbit (Oryctolagus cuniculus) Hair Follicle Cycle by RNA Sequencing

Bohao Zhao<sup>1</sup> , Yang Chen1,2, Shuaishuai Hu<sup>1</sup> , Naisu Yang<sup>1</sup> , Manman Wang<sup>2</sup> , Ming Liu<sup>1</sup> , Jiali Li<sup>1</sup> , Yeyi Xiao<sup>1</sup> and Xinsheng Wu1,2 \*

<sup>1</sup> College of Animal Science and Technology, Yangzhou University, Yangzhou, China, <sup>2</sup> Joint International Research Laboratory of Agriculture and Agri-Product Safety, Yangzhou University, Yangzhou, China

The hair follicle (HF) cycle is a complicated and dynamic process in mammals, associated with various signaling pathways and gene expression patterns. Non-coding RNAs (ncRNAs) are RNA molecules that are not translated into proteins but are involved in the regulation of various cellular and biological processes. This study explored the relationship between ncRNAs and the HF cycle by developing a synchronization model in Angora rabbits. Transcriptome analysis was performed to investigate ncRNAs and mRNAs associated with the various stages of the HF cycle. One hundred and eleven long non-coding RNAs (lncRNAs), 247 circular RNAs (circRNAs), 97 microRNAs (miRNAs), and 1,168 mRNAs were differentially expressed during the three HF growth stages. Quantitative real-time PCR was used to validate the ncRNA transcriptome analysis results. Gene ontology (GO) enrichment and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway analyses provided information on the possible roles of ncRNAs and mRNAs during the HF cycle. In addition, lncRNA–miRNA–mRNA and circRNA–miRNA–mRNA ceRNA networks were constructed to investigate the underlying relationships between ncRNAs and mRNAs. LNC\_002919 and novel\_circ\_0026326 were found to act as ceRNAs and participated in the regulation of the HF cycle as miR-320-3p sponges. This research comprehensively identified candidate regulatory ncRNAs during the HF cycle by transcriptome analysis, highlighting the possible association between ncRNAs and the regulation of hair growth. This study provides a basis for systematic further research and new insights on the regulation of the HF cycle.

#### Keywords: rabbit, non-coding RNA, sequencing, hair follicle cycle, ceRNA

# INTRODUCTION

Hair follicle (HF) development is a complex morphogenetic process that relies on a variety of signaling systems, and on interactions between mesenchymal and epithelial tissues (Hardy, 1992; Oro and Scott, 1998). Under the biological regulation of stem cells, mature HFs undergo a cycling and continuous self-renewal process, with periods of active growth (anagen), followed by regression

#### Edited by:

Yun Zheng, Kunming University of Science and Technology, China

#### Reviewed by:

Changning Liu, Xishuangbanna Tropical Botanical Garden (CAS), China Zexuan Zhu, Shenzhen University, China

> \*Correspondence: Xinsheng Wu xswu@yzu.edu.cn

#### Specialty section:

This article was submitted to RNA, a section of the journal Frontiers in Genetics

Received: 16 February 2019 Accepted: 12 April 2019 Published: 03 May 2019

#### Citation:

Zhao B, Chen Y, Hu S, Yang N, Wang M, Liu M, Li J, Xiao Y and Wu X (2019) Systematic Analysis of Non-coding RNAs Involved in the Angora Rabbit (Oryctolagus cuniculus) Hair Follicle Cycle by RNA Sequencing. Front. Genet. 10:407. doi: 10.3389/fgene.2019.00407

**60**

(catagen), and rest (telogen) (Cotsarelis et al., 1990; Paus and Cotsarelis, 1999; Fuchs and Segre, 2000; Oshima et al., 2001). In murine HF cycling, key parameters for the recognition of distinct stages have been defined in many studies (Chase et al., 1951; Chase, 1954; Straile et al., 2010). Moreover, the immediate removal of hair shafts could induce homogeneous anagen development in the murine model, which leads to the spontaneous entering of consecutive stages (catagen and telogen). In this way, the methods for the analysis of murine HF growth were provided, and were based on histologic and ultrastructural studies on murine hair cycling (Veen et al., 1999; Müller-Röver et al., 2001). During the anagen phase, the hair root is dividing and adding to the hair shaft. The HFs actively grow, surrounded by dermal fibroblasts that have not reached the subcutis. During the catagen phase, interfollicular dermal fibroblasts fully surrounded the HFs, the blood supply is cut off, and the hair bulb starts to atrophy. Finally, HFs enter the telogen phase, where hair shafts stop growing, and begin to fall due to synthesis and release of hair cycle inhibitor (Stenn and Paus, 2001). The molecular mechanisms underlying the regulation of the hair cycle and of HF development are of interest in medicine and developmental biology (Shirokova et al., 2016; Ahmed et al., 2017; Sardella et al., 2017).

Long non-coding RNAs (lncRNAs), microRNAs (miRNAs), and circular RNAs (circRNAs) are non-coding RNA (ncRNA) that are not translated into proteins but regulate many cell functions and play vital roles in many biological processes (Mattick and Makunin, 2006; Guttman and Rinn, 2012). miRNAs are small ncRNA molecules (∼22 nucleotides length) that repress gene expression by recognizing specific target mRNAs (Ding et al., 2009). An increasing number of studies reported that lncRNAs (non-coding RNAs containing more than 200 base pairs) regulate interactions between genes and proteins, act as decoys that bind to miRNAs or proteins, or bind to enhancer regions or neighboring loci to modulate the transcription of their target gene as enhancers (Winkle et al., 2015; Chen et al., 2016; Li et al., 2016; Song et al., 2017; Lu et al., 2018). CircRNAs consist of continuous loop structures, are more stable than linear mRNAs, and are conserved between different species (Stoffelen et al., 2012; Memczak et al., 2014). As sponges for miRNAs, circRNAs act as competitive inhibitors that interfere with the binding of miRNAs to their target genes (Hansen et al., 2013; Zhong Z. et al., 2016). circRNAs may also regulate the function of RNA-binding proteins and the transcription activity of the host gene (Reut et al., 2014; Li et al., 2015). Although circRNAs have been categorized as ncRNA, they have been reported to have the ability to code proteins as gene regulators (Pamudurti et al., 2017).

Accumulating evidence suggests that lncRNAs are involved in the regulation of the HF cycle (Wang et al., 2017; Song et al., 2018; Zhu Y.B. et al., 2018). Specific lncRNAs, such as HOTAIR, H19, and RP11-766N7.3, have been reported to be differentially expressed in dermal papilla cells after Wnt signaling by using lncRNA microarrays, and integrated analysis by RNA-seq techniques has led to the identification of potential lncRNA, which may play a role during the initiation of secondary HFs (Lin et al., 2015; Yue et al., 2016). Moreover, aberrantly expressed miRNAs may participate in the regulation of the development of skin and HFs. miRNAs play important roles in several signaling pathways and control gene expression patterns during the HF cycle (Mardaryev et al., 2010; Chao et al., 2013; Ahmed et al., 2014; Zhou et al., 2018). In addition, the expression levels and functions of circRNAs associated with skin color during different skin differentiation stages have been analyzed by RNA-seq (Zhu Z. et al., 2018).

However, only very few studies have systematically investigated ncRNAs during the HF cycle. This study established a HF cycle synchronization model in the rabbit, allowing an integrated analysis of ncRNAs and mRNAs expressed during the different HF cycle phases (anagen, catagen, and telogen). Numerous essential factors related to the HF cycle have been uncovered, contributing to the understanding of HF cycle regulation and suggesting new potential therapies for hair-related diseases.

# MATERIALS AND METHODS

# Animals

Twelve 6-month-old male Wanxi Angora rabbits were used to establish the HF synchronization model. They were all housed under the same conditions, including temperature, and were fed the same diet (feed pellet and grass). Animals were reared in a controlled environment and had the same length of the hair coat phenotypes. The experimental procedures in this study were approved by the Animal Care and Use Committee of Yangzhou University.

To estimate the wool growth rate and to determine the onset of the anagen phase, the dorsal area of experimental animals was shaved with electronic clippers and entry into anagen was determined by the appearance of light pink skin and by hair regrowth. The length of the hair coat was measured, skin samples were collected after shaving, samples were fixed in 4% formaldehyde, and paraffin sections were stained with hematoxylin–eosin (HE) for histological observations. Longitudinal sections of the HFs showed the skin status and the phase of the HF cycle.

# Tissue Collection

Rabbits were anesthetized via ear vein injections of 0.7% pentobarbital sodium (6 mL/kg), dorsal skin samples (1 cm<sup>2</sup> ) were collected, and placed immediately in liquid nitrogen for RNA extraction. Iodine solution was applied on the wound to prevent bacterial infection. Samples were harvested at different phases of the HF cycle for gene expression profiling: growth (anagen), cessation (catagen), and rest (telogen). Three sample replicates were collected at days 90, 130, and 150 of the HF cycle for ncRNA and mRNA sequencing analysis.

# RNA Isolation and RNA Quantification

Total RNA from nine samples was extracted from skin tissue using Trizol reagent (Invitrogen, Carlsbad, CA, United States), according to the manufacturer's instructions. RNA degradation and contamination were monitored by running samples on 1% agarose gels. RNA purity was analyzed via a NanoPhotometer <sup>R</sup>

spectrophotometer (IMPLEN, CA, United States). RNA concentration was measured using the Qubit <sup>R</sup> RNA Assay Kit and a Qubit <sup>R</sup> 2.0 Fluorometer (Life Technologies, CA, United States). RNA integrity was assessed via the RNA Nano 6000 Assay Kit and a Bioanalyzer 2100 system (Agilent Technologies, CA, United States). lncRNAs and miRNAs were quantified following the same procedure used for conventional mRNAs. Quantification of circRNAs was performed adding an exonuclease to degrade non-circRNAs. Briefly, two samples containing the same amount of RNA were collected. In one sample, linear RNA was digested with RNase R (Cat. No. RNR07250, Epicentre Company, United States), leaving only the circRNAs, while the other sample was not treated with RNase R. The two RNA samples were reverse transcribed. The samples subjected to RNase treatment were used to detect circRNAs, whereas the untreated samples were used to detect β-actin.

# Library Construction for lncRNA and circRNA Sequencing

A total amount of 3 µg of RNA per sample was used for lncRNA sequencing and of 5 µg for circRNA sequencing. First, ribosomal RNAs were removed with the Epicentre Ribo-zeroTM rRNA Removal Kit (Epicentre, United States) and the rRNA-depleted samples were purified by ethanol precipitation. Subsequently, sequencing libraries were generated using the rRNA-depleted RNA and the NEBNext <sup>R</sup> UltraTM Directional RNA Library Prep Kit for Illumina <sup>R</sup> (NEB, United States), following the manufacturer's recommendations. First strand cDNA was synthesized using random hexamer primers and M-MuLV Reverse Transcriptase (RNaseH). Second strand cDNA synthesis was performed using DNA Polymerase I and RNase H. After adenylation of the 3<sup>0</sup> ends of DNA fragments, NEBNext Adaptor with hairpin loop structure were ligated to prepare for hybridization. To select cDNA fragments with a preferential length of 150∼200 bp, the library fragments were purified with the AMPure XP system (Beckman Coulter, Beverly, MA, United States). Then, 3 µl of USER Enzyme (NEB, United States) was used with size-selected, adaptor-ligated cDNA before the PCR. Finally, the PCR products were purified (AMPure XP system) and the library quality was assessed with the Agilent Bioanalyzer 2100 system.

# Library Construction for Small RNA Sequencing

A total amount of 3 µg of RNA per sample was used as input material for the small RNA library. Sequencing libraries were generated using the NEBNext <sup>R</sup> Multiplex Small RNA Library Prep Set for Illumina <sup>R</sup> (NEB, United States), following the manufacturer's recommendations. Briefly, NEB 3 0 SR Adaptor was directly and specifically ligated to the 3<sup>0</sup> end of miRNAs, siRNAs, and piRNAs. After the 3<sup>0</sup> ligation reaction, the SR RT Primer was hybridized to the excess of 3<sup>0</sup> SR Adaptor, transforming the single-stranded DNA adaptor into a double-stranded DNA molecule. Then, the 5<sup>0</sup> ends adapter was ligated to the 5<sup>0</sup> ends of the miRNAs, siRNAs, and piRNAs. First strand cDNA was synthesized using M-MuLV Reverse Transcriptase (RNase H–). DNA fragments of 140–160 bp length (the length of small non-coding RNAs plus the 3<sup>0</sup> and 5<sup>0</sup> adaptors) were recovered and dissolved in 8 µL of elution buffer. Finally, the library quality was assessed using the Agilent Bioanalyzer 2100 system and DNA High Sensitivity Chips.

# Clustering and Sequencing of lncRNAs, circRNAs, and miRNAs

Clustering of the index-coded samples was performed on a cBot Cluster Generation System using TruSeq PE Cluster Kit v3-cBot-HS (Illumina), according to the manufacturer's instructions. After cluster generation, the lncRNA and circRNA libraries were sequenced on an Illumina Hiseq 4000 platform and 150 bp paired end reads were generated. The miRNA library was sequenced on an Illumina Hiseq 2500 platform and 50 bp single-end reads were generated.

# Quality Control

For lncRNA and circRNA sequencing, raw data (raw reads) in fastq format were first processed through in-house perl scripts. In this step, clean data (clean reads) were obtained by removing reads containing adapter, reads containing ploy-N, and low-quality reads from the raw data. For miRNA sequencing, raw data (raw reads) in fastq format were first processed through custom perl and python scripts. In this step, clean data (clean reads) were obtained by removing reads containing ploy-N, with 5 0 adapter contaminants, without 3<sup>0</sup> adapter or the insert tag, containing ploy A or T or G or C, and low-quality reads from raw data. At the same time, the Q20, Q30 scores, and GC-content of the raw data were calculated. A specific length range from the clean reads was selected to conduct all the downstream analyses, based on clean data of high quality.

# Genome Mapping, Transcriptome Assembly, and ncRNAs Identification

For lncRNA and circRNA sequences, the reference genome (Oryctolagus cuniculus genome obtained from Ensembl OryCun2.0) and annotation files were directly downloaded from the genome website. An index of the reference genome was built using bowtie2 (Langmead and Salzberg, 2012), and paired-end clean reads were aligned to the reference genome using HISAT2 v2.0.4 (Pertea et al., 2016). Also, the small RNA tags were mapped to the reference sequence with bowtie2 (Langmead and Salzberg, 2012) without mismatch to analyze the expression and distribution of miRNA sequences in the reference genome.

The mapped lncRNA and mRNA reads from each sample were assembled by means of StringTie (v1.3.1) (Pertea et al., 2016), following a reference-based approach. The circRNAs were detected and identified using find\_circ (Memczak et al., 2014). Alignment of the small RNA tags to miRBase20.0 identified known Oryctolagus cuniculus and Mus musculus (near-source species) miRNAs. Mirdeep2 software (Friedländer et al., 2011) was used to identify potentially novel miRNAs and to draw the secondary structures and the characteristics of the hairpin structures of miRNA precursors.

# Quantification of lncRNA, circRNA, mRNA, and miRNA Expression Levels

Cuffdiff (v2.1.1) was used to calculate fragments per kilo-base millions of exon per million fragments mapped (FPKM) of both lncRNAs and mRNA in each sample (Trapnell et al., 2010). FPKMs of genes were computed by summing the FPKMs of transcripts in each gene group. Also, the raw counts were first normalized using transcripts per million (TPM) (Zhou et al., 2010) and normalized expression levels = (read count<sup>∗</sup> 1,000,000)/lib size (lib size is the sum of circRNA read counts). This was used to determine the circRNA expression levels. On the other hand, miRNA expression levels were estimated by TPM based on the following criteria: Normalization formula: Normalized expression = mapped read count/total reads<sup>∗</sup> 1,000,000. The differential expression of ncRNAs was determined using the DESeq R package (1.10.1) (Wang et al., 2010).

# Target Gene Prediction, GO, and KEGG Enrichment Analysis

In cis regulation, lncRNAs can act on neighboring target genes. Coding genes 10 k/100 k upstream or downstream of the lncRNA gene were searched for and their function was analyzed. For trans regulation, lncRNAs and their target genes were analyzed based on their expression levels. The correlation between lncRNAs and coding gene expression levels were calculated with custom scripts; then, the genes from different samples were clustered using WGCNA (Langfelder and Horvath, 2008) to search for common expression modules and to analyze the function via functional enrichment analysis. The target genes of miRNAs and miRNA target sites in exons of circRNA loci were identified using miRanda (version 3.3a, main parameter: -sc 140; -en -10; -scale 4; -strict) (Enright et al., 2004). Differentially expressed (DE) ncRNAs were annotated by gene ontology (GO) enrichment and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway analyses to investigate their biological functions. Briefly, GO analysis was applied to elucidate genetic regulatory networks of interest by forming hierarchical categories according to the molecular function (MF), cellular component (CC), and biological process (BP) aspects of the differentially expressed genes<sup>1</sup> . KEGG pathway analysis was performed to explore the significantly enriched pathways of DE genes<sup>2</sup> .

# Quantitative Real-Time PCR

Eight mRNAs, four lncRNAs, and five circRNAs associated with skin and the HF cycle were selected for validation by qRT-PCR analysis. Approximately 1µg of total RNA was used to synthesize cDNA using HiScript II Q Select RT SuperMix for qPCR (Vazyme). qRT-PCR was performed using the AceQ qPCR SYBR <sup>R</sup> Green Master Mix (Vazyme), according to the manufacturer's instructions, and data were analyzed via QuantStudio <sup>R</sup> 5 (Applied Biosystems). The specific primer sequences are listed in **Supplementary Table S1**. The expression

<sup>2</sup>http://www.genome.jp/kegg/

levels were calculated using the 2−11Ct method (Schmittgen and Livak, 2008), with glyceraldehyde 3-phosphate dehydrogenase (GAPDH) as reference gene.

To confirm the miRNA transcriptome data, three miRNAs were selected for qRT-PCR analysis. Approximately 2µg of total RNA was used to synthesize cDNA after adding a poly (A) tail to the 3<sup>0</sup> end of the miRNAs using the miRcute Plus miRNA First-Strand cDNA Synthesis Kit (Tiangen). qRT-PCR was performed using the miRcute miRNA qPCR Detection Kit (SYBR Green), according to the manufacturer's instructions. The specific primers were designed by Beijing Tiangen Co., Ltd. and the product code sets are listed in **Supplementary Table S1**. The U6 small nuclear RNA gene was chosen as internal control. The expression levels were calculated using the 2−11Ct method (Schmittgen and Livak, 2008), and the results of the experiments were normalized to the expression levels of the constitutively expressed U6 gene.

# Construction of ncRNAs Regulatory Networks

To investigate the role and interactions between ncRNAs and mRNAs during the HF cycle, ncRNAs regulatory networks were constructed. For the interaction network of lncRNA–miRNA, DE lncRNAs were filtered out according to the homology between lncRNA and miRNA precursor; then, the targeted relationships between lncRNA and miRNA were predicted by miRanda. Then, the regulatory networks of lncRNA–miRNA–mRNA pairs and circRNA–miRNA–mRNA pairs were constructed according to the following steps: (i) the ncRNAs and mRNAs that were upregulated or downregulated were retained; (ii) the interactions of lncRNA–miRNA, miRNA–mRNA, and miRNA–circRNA were predicted by miRanda, which predicts miRNA binding seed sequence sites, as well as overlapping the same miRNA binding site in lncRNAs, circRNAs, and mRNAs; (iii) The lncRNA–miRNA–mRNA pairs network covered two cases: one was the upregulated lncRNA-downregulated miRNA-upregulated mRNA, the other was the downregulated lncRNA-upregulated miRNA-downregulated mRNA. The circRNA–miRNA–mRNA pairs network covered two cases: one was the upregulated circRNA-downregulated miRNA-upregulated mRNA, the other was the downregulated circRNA-upregulated miRNA-downregulated mRNA. Cytoscape software was used to build and visually display the networks.

# Luciferase Assay

The dual-luciferase reporter system E1910 (Promega, Madison, WI, United States) was used to perform luciferase activity assays. The miR-320-3p mimic and miR-320-3p negative control mimics were purchased from Shanghai GenePharma Co., Ltd. Wild-type luciferase reporter vectors (pMir-HTATIP2-3'UTR-WT, pMir-LNC\_002919-WT, and pMir-novel\_circ\_0026326-WT) were constructed using the primers shown in **Supplementary Table S2**. Their substitution mutants (pMir-HTATIP2-3'-UTR-MUT, pMir-LNC\_002919- MUT, and pMir-novel\_circ\_0026326-MUT) were synthesized by Beijing Tsingke Co., Ltd. Briefly, the skin fibroblast cells of rabbit

<sup>1</sup>http://www.geneontology.org

Zhao et al. ncRNA in Hair Follicle Cycles

(RAB-9, ATCC <sup>R</sup> CRL-1414TM) were cultured in 24-well tissue culture plates. Cells were co-transfected with the pMir-report luciferase reporter, the miRNA (miR-320-3p) mimics and pRL-TK using LipofectamineTM 2000 (Invitrogen). After 48 h of culture at 37◦C, transfected cells were lysed with 100 µl of passive lysis buffer. Next, 20 µl of lysates were mixed with 100 µl of LAR II, and firefly luciferase activity was measured by using a luminometer. As an internal control, 100 µl of Stop & Glo reagent was added to the sample. Firefly luciferase activity was normalized to the corresponding Renilla luciferase activity.

# RESULTS

# Hair Follicle Cycle Synchronization Model

For the HF cycle synchronization model, Angora rabbits were used. The obtained observations showed that the length of the hair coat increased steadily until day 110. Between days 120 and 150, the growth rate of wool declined rapidly. Then, between days 160 and 180, the wool recovered and once again showed an increased growth rate (**Figure 1A**). Histological analysis showed rapid growth of the hair shaft and increasing depth of the HF between days 0 and 110. Then, the growth of the hair shaft and the depth of the HF decreased between days 120 and 130. Finally, the hair shaft started to fall off and the hair bulbs atrophied between days 140 and 150. After the HF cycle ended, a new HF appeared, the growth of the hair shaft recovered and the HFs moved into a new cycle (**Figure 1B**). In conclusion, the hair cycle of Angora rabbits is characterized by an anagen phase between days 0 and 110, a catagen phase between days 120 and 130, and a telogen phase between days 140 and 150.

# Differentially Expressed lncRNAs, mRNAs, miRNAs, and circRNAs

A summary of the lncRNA-seq, miRNA-seq, and circRNA-seq data from the three HF cycle phases is shown in **Supplementary Table S3**, indicating the relatively high quality of the transcriptome data. The lncRNA-seq, miRNA-seq, and circRNA-seq data were deposited in the Short Read Archive (SRA) of the National Center for Biotechnology Information (NCBI) under the bioproject numbers PRJNA479733, PRJNA495446, and PRJNA495449. DE ncRNAs and mRNAs were analyzed using Cuffdiff software with a criterion of p < 0.05. Volcano plots, clustering maps, and Venn diagrams were used to illustrate the distribution of the DE ncRNAs and mRNAs between the three groups (**Figures 2**–**5**). **Table 1** summarizes the number of DE ncRNAs and mRNAs. Differential expressions of 111 lncRNAs (60 upregulated and 51 downregulated), 247 circRNAs (128 upregulated and 119 downregulated), 97 miRNAs (38 upregulated and 59 downregulated), and 1,168 mRNAs (750 upregulated and 418 downregulated) were found between the three HF cycle stages. Complete information on all DE lncRNAs, circRNAs, miRNAs, and mRNAs is listed in **Supplementary Tables S4**–**S7**. Several lncRNAs were found to be associated with the HF cycle, such as LNC\_002694, LNC\_002919, LNC\_003354, LNC\_003790, LNC\_008354, LNC\_008931, and LNC\_005484, which could regulate gene expression by recognizing their target mRNAs. Based on analysis of their biological function, the candidate lncRNAs associated with the HF cycle are listed in **Supplementary Table S8**. Moreover, analysis of the relationships between circRNAs and genes allowed identification of novel\_circ\_0004876, novel\_circ\_0005177, novel\_circ\_0026326, novel\_circ\_0034968, and novel\_circ\_0036671, which may play a role during the HF cycle. In addition, several miRNAs, including miR-128-3p, miR-200a-3p, miR-27a-3p, miR-30e-5p, and miR-320-3p; mRNAs, such as BMP2, CSNK2B, KRT17, LAMB1, FZD4, SMAD2, HTATIP2, and SIAH1 were identified to play pivotal roles during the HF cycle and during skin development.

# Validation of Differentially Expressed lncRNAs, circRNAs, miRNAs, and mRNAs by qPCR

To validate the lncRNAs, mRNAs, miRNAs, and circRNAs differential expression results, the relative expression of four DE lncRNAs (LNC\_002694, LNC\_002919, LNC\_003354, and LNC\_005484), five DE circRNAs (novel\_circ\_0004876,

FIGURE 2 | Changes in lncRNA expression during the Angora rabbit hair follicle cycle. (A–C) Volcano plots showing up- and down-regulated lncRNAs between days 90, 130, and 150 of the hair follicle cycle. (D) Venn diagram showing the number of overlapping differentially expressed lncRNAs between days 90, 130, and 150. (E) Heat map of lncRNAs showing hierarchical clustering of DE lncRNAs between days 90, 130, and 150. Up- and down-regulated lncRNAs are shown in red and blue, respectively.

FIGURE 3 | Changes in circRNA expression during the Angora rabbit hair follicle cycle. (A–C) Volcano plots showing up- and down-regulated circRNAs between days 90, 130, and 150 of the hair follicle cycle. (D) Venn diagram showing the number of overlapping differentially expressed circRNAs between days 90, 130, and 150. (E) Heat map of circRNAs showing hierarchical clustering of DE circRNAs between days 90, 130, and 150. Up- and down-regulated circRNAs are shown in red and blue, respectively.

novel\_circ\_0005177, novel\_circ\_0026326, novel\_circ\_0034968, and novel\_circ\_0036671), four DE miRNA (miR-128-3p, miR-200a-3p, miR-27a-3p, and miR-320-3p), and eight DE mRNAs (BMP2, CSNK2B, FAM45A, FUOM, HTATIP2, KRT17, ME1, and SIAH1) were measured by qRT-PCR (**Figures 6**–**9**). The qRT-PCR results were consistent with the transcriptome sequencing data.

# GO and KEGG Pathway Analysis

lncRNAs can regulate neighboring protein-coding genes; therefore, a colocalization threshold of 100 kb upstream or downstream of lncRNAs was set for the GO and KEGG analyses. Several GO terms were found that were significantly enriched in the three experimental groups (**Supplementary Table S9**), including skin and HF-related GO terms like HF development (GO: 0001942), hair cycle (GO: 0042633), hair cycle process (GO: 0022405), regulation of HF development (GO: 0051797), and skin morphogenesis (GO: 0043589), among others. The top 20 KEGG pathways associated with DE lncRNAs between days 90, 130, and 150 of the HF cycle based on the function of colocalized mRNAs (**Supplementary Figure S1**) and co-expressed mRNAs (**Supplementary Figure S2**) included the Wnt signaling pathway, TGF-β signaling pathway, MAPK signaling pathway, and JAK/STAT signaling pathway.

In addition, based on the relationship between circRNAs and genes, GO analysis of genes producing DE circRNAs was performed (**Supplementary Table S10**). The GO terms identified HF development (GO: 0001942), hair cycle process (GO: 0022405), hair cycle (GO: 0042633), and skin development (GO: 0043588), which were all related to skin and HF development. The top 20 KEGG pathways associated with genes producing DE circRNAs between 90, 130, and 150 days (**Supplementary Figure S3**) of the HF cycle were likewise related to skin and HF development, such as the Hedgehog signaling pathway, Wnt signaling pathway, and MAPK signaling pathway.

Furthermore, GO enrichment analysis of genes targeted by DE miRNA (**Supplementary Table S11**) identified GO terms related to HF development, such as HF morphogenesis (GO: 0031069), negative regulation of HF development (GO: 0051799), and regulation of HF development (GO: 0051797), among others. The top 20 KEGG pathways associated with DE miRNAs are shown in **Supplementary Figure S4**. They include pathways related to HF cycle, such as the Hedgehog signaling pathway, NF-κB signaling pathway, and JAK/STAT signaling pathway.

Finally, GO and KEGG analyses of DE mRNAs are shown in **Supplementary Table S12**. The GO terms identified include, for example, skin morphogenesis (GO: 0043589) and positive regulation of HF development (GO: 0051798). The top 20 enriched KEGG pathways for DE genes between the different stages of the HF cycle are shown in **Supplementary Figure S5**. These KEGG pathways include the Wnt signaling pathway, the MAPK signaling pathway, and the TGF-β signaling pathway, which participate in skin development and HF cycle. Differentially expressed genes between days 90, 130, and 150 of

TABLE 1 | Summary of the number of differentially expressed ncRNAs and mRNAs.


the HF cycle, as well as their biological functions, are listed in **Supplementary Table S13**.

# ceRNA Regulatory Networks

Study of the relationship between ncRNAs and mRNAs may increase our understanding of the molecular mechanisms operating during skin development and HF cycle. According to the competing endogenous RNA (ceRNA) regulatory hypothesis, ncRNAs and mRNAs can compete for the same miRNAs, resulting in additional layers of regulation of gene expression. Based on the analysis of DE lncRNAs, circRNAs, miRNAs, and mRNAs, a network of lncRNAs and miRNAs was first constructed (**Figure 10**). In lncRNA-miRNA-mRNA regulatory networks, miRNA may act as the center, lncRNA as the decoy, and mRNA as the target, which suggests that lncRNAs could act as miRNA sponges to regulate gene expression (**Figure 11**). In addition, certain circRNAs can competitively bind miRNAs and act as miRNA sponges; therefore, circRNA-miRNA-mRNA triads were constructed with the circRNA as the docoy, miRNA as the center, and mRNA as the target (**Figure 12**).

LNC\_002919 and novel\_circ\_0026326 were identified as ceRNAs for miR-320-3p, which targets HTATIP2. A dual-luciferase reporter system was used to verify the binding relationships between the identified lncRNA and miRNA, circRNA and miRNA, and mRNA and miRNA. Luciferase assay showed that miR-320-3p could decrease luciferase activity by binding to sites on LNC\_002919, novel\_circ\_0026326, and the HTATIP2 3 <sup>0</sup>UTR (**Figure 13**). The interactions between ncRNAs and mRNA suggest the existence of novel regulatory mechanisms during skin development and HF cycle.

# DISCUSSION

The HF cycle is similar in most mammalian species, and many animal models have been used to study the process of hair growth, including mice (Wolbach, 1951; Chase, 1954), rats (Johnson and Ebling, 1964), monkeys (Uno, 1991), cats (Hendriks et al., 1997), and sheep (Hynd et al., 1986). In mice, the hair growth period lasts only 17–19 days, and anterior regions can enter the resting period before the posterior regions regrow (Chase, 1954). By plucking the hairs of rats, the first wave of hair growth was observed between 31 and 22 days,

and HF from resting clubs were collected at 55 days of age (Johnson and Ebling, 1964). Although animal HFs show a circannual rhythm, the HF cycles producing sheep wool, horse mane, and human scalp hair have special characteristics, including a biological clock that is independent from day and night, season and temperature over a period of 2–6 years

FIGURE 9 | Validation of mRNA differential expression results at 90, 130, and 150 days. qRT-PCR validation of BMP2, CSNK2B, FAM45A, FUOM, HTATIP2, KRT17, ME1, and SIAH1 mRNA expression levels in skin samples between 90, 130, and 150 days. The mRNA expression levels at 130 days and 150 days were normalized to the value at 90 days. Error bars indicate the mean ± SD of triplicate experiments. <sup>∗</sup>P < 0.05; ∗∗P < 0.01.

(Stenn and Paus, 2001). The structure, composition, and growth of hair fibers are similar between Angora rabbits and other rabbit breeds. However, the appearance of a mutation in Angora rabbits leads to a prolongation of the anagen phase, so this phase lasts approximately 5 weeks in New Zealand white rabbits but more than 3 months in Angora rabbits (Moore et al., 1987). The HF clock in Angora rabbits has its own characteristic chronobiology, with a long growing period, and independence from seasons and temperature. This study established a synchronization model for hair growth in Angora rabbits. The HFs initiated vigorous growth after shaving the dorsal area, and measuring the length of the hair coat and analyzing the histological characteristics

showed that the growth phase lasted about 110 days, the regression period started at about 120 days, and the resting period at about 150 days. The HF synchronization model can contribute to the field of research in the chronobiology of HFs. ncRNAs are epigenetic, translational and genetic regulators that may play a role in numerous biological processes in eukaryotes (Mattick and Makunin, 2006). ncRNAs could play complicated and vital roles during the hair cycle; investigation of the regulatory and functional interactions between lncRNAs, circRNAs, miRNAs, and mRNAs may increase understanding of this biological process.

The present study investigated ncRNAs and mRNAs that were significantly up-regulated or down-regulated during the three stages of the HF cycle. Recent studies have shown that DE lncRNAs modulate biological functions in dermal papilla cells, which regulate postnatal hair cycling and HF cycle (Lin et al., 2015). Likewise, RNA-seq technology has been used for the analysis of lncRNAs and mRNAs during the initiation of sheep secondary HFs (Yue et al., 2016). In addition, miRNAs have been the focus of intense research for several years, and have been associated with HF morphogenesis and development (Mardaryev et al., 2010; Ahmed et al., 2014; Hochfeld et al., 2017). However, only few studies analyzed the involvement of circRNA in skin development and HF cycle. circRNAs can act as miRNA sponges, suppressing miRNA activity and resulting in increased RNA expression (Hansen et al., 2013). This study employed high-throughput sequencing for the analysis of DE ncRNAs in the HF during the different hair cycle stages, based on the synchronization model. A total of 111 lncRNAs, 247 circRNAs, 97 miRNAs, and 1,168 mRNAs were differentially expressed during the hair cycle stages. Moreover, several differentially expressed mRNAs were identified during hair cycling. As a dermal papilla signature gene, BMP2 is expressed in the hair matrix and can regulate HF cycling (Nakamura et al., 2003; Rendl et al., 2008). In this study, its expression in the catagen was significantly decreased. A previous study reported that KRT17 acts as a key factor to regulate the hair cycling, which affects the transition of anagen-catagen (Tong and Coulombe, 2006). The present results showed that KRT17 is highly expressed in catagen (via the identified candidates mRNA) between days 130 and 90 as well as between days 150 and 90. Moreover, the co-location relationships between LNC\_004603 and KRT17 were obtained via functional analysis of lncRNA, which indicates that LNC\_004603 may act as a potential factor for the regulation of hair cycling. Furthermore, miR-200a-3p is highly expressed in the anagen, which has been proved the be preferentially expressed in the epidermis (Yi et al., 2006). In addition, the expression of miR-128-3p significantly increased from days 90 to 150, with high expression in the telogen. In human HF mesenchymal stem cells, miR-128 could regulate the cell differentiation by targeting SMAD2 (Wang et al., 2016).

Gene ontology analysis includes three domains describing the cellular and molecular roles of genes and gene products (MF, CC, and BP) (Harris, 2004). KEGG is a pathway database for the systematic analysis of gene function, linking genomic and functional information (Ogata et al., 2000). GO and KEGG were used to investigate the potential mechanisms of action of the DE ncRNAs in this study. The obtained results suggest that multiple signaling pathways form a complex regulatory network during skin and HF development. These include the Wnt signaling pathway, the Hedgehog signaling pathway, the TGF-β signaling pathway, the MAPK signaling pathway, the BMP signaling pathway, and the JAK/STAT signaling pathway. These signaling pathways have been previously reported to regulate HF morphogenesis and development (Andl et al., 2002; Mill et al., 2003; Jamora et al., 2005; Kulessa et al., 2014; Akilli Öztürk et al., 2015; Harel et al., 2015). Both SMAD2 and SIAH1 were enriched in the Wnt signaling pathway, and SMAD2 was upregulated at day 150 compared to the

differential expression at day 90. In addition, SIAH1 decreased significantly from days 90 to 130, but increased from days 130 to 150, and was highly expressed when comparing day 150 to day 90. In the cashmere goat, SIAH1 and SMAD2 were significantly expressed during the telogen-anagen HF transition. SIAH1 is highly significantly expressed from telogen to early anagen, and the expression of SMAD2 increased from telogen to late anagen (Liu et al., 2018). Via functional analysis of lncRNA, the co-expression relationship between LNC\_002690 and SIAH1 was identified, indicating that LNC\_002690 might play a central role in hair cycling via regulation of SIAH1 expression. Hence, these candidates could act as key candidates during HF cycling.

RNA transcripts are regulated by ceRNAs, which compete for the binding of shared miRNAs. miRNA response elements (MREs) are sequences where miRNAs can bind and repress target gene expression. Acting as miRNA sponges, pseudogenes, lncRNAs, circRNAs, and mRNAs can suppress miRNA function through shared MREs (Salmena et al., 2011). Therefore, to try to understand the role of ncRNAs during the HF cycle, lncRNA–miRNA–mRNA and circRNA–miRNA–mRNA regulatory networks were constructed. LNC\_002919 and novel\_circ\_0026326 acted as sponges for miR-320-3p, which targets HTATIP2. MiR-320-3p has been reported to either directly or indirectly target genes that regulate the cell cycle and differentiation of the HF (Liu et al., 2013). HTATIP2 was highly expressed during the catagen and telogen phases, suggesting that HTATIP2 could inhibit cellular activities during the hair cycle. Decreased or absent HTATIP2 activity modulated through JAK-STAT3 signaling has been shown to play an important role in certain cellular processes. Furthermore, the study shows a link between the JAK-STAT signaling pathway and hair growth (Zhang et al., 2012; Harel et al., 2015). In this analysis of DE lncRNAs, a relationship was found between LNC\_002919 and KRTAP11-1, suggesting that LNC\_002919 could modulate KRTAP11-1 expression. KRTAP11-1 influences keratin-bundle assembly and can regulate the physical properties of hair (Fujimoto et al., 2014). Therefore, LNC\_002919 could be a potent regulator of the HF cycle. However, the molecular mechanisms underlying the regulation of HTATIP2 by LNC\_002919 and novel\_circ\_0026326, which may act as miR-320-3p sponges, need to be further explored.

# CONCLUSION

In summary, this study established a rabbit HF synchronization model and investigated the lncRNA, circRNA, miRNA, and mRNA expression profiles by transcriptome analysis of samples collected at different stages of the HF cycle. GO and KEGG pathway enrichment analyses were carried out to identify candidate ncRNAs and mRNAs involved in the regulation of the HF cycle. In addition, ceRNA networks were constructed, which may be active during the HF cycle. These results provide a basis for an improved understanding of the mechanisms underlying the HF cycle.

# DATA AVAILABILITY

The datasets generated for this study can be found in The lncRNA-seq, miRNA-seq, and circRNA-seq data were deposited in the SRA of the NCBI, The lncRNA-seq, miRNA-seq, and circRNA-seq data were deposited in the Short Read Archive (SRA) of the National Center for Biotechnology Information (NCBI) under the bioproject numbers PRJNA479733, PRJNA495446, and PRJNA495449.

# ETHICS STATEMENT

The experimental procedures in this study were approved by the Animal Care and Use Committee of Yangzhou University.

# AUTHOR CONTRIBUTIONS

BZ was responsible for the collection and analysis of results and wrote the manuscript. YC, SH, NY, and MW were responsible for construction of hair follicle synchronization model. ML, JL, and YX carried out of the experiments. BZ and XW designed the study and finalized the manuscript. All authors read and approved the final manuscript.

# FUNDING

This work was supported by the Modern Agricultural Industrial System Special Funding (CARS-43-A-1), the Priority Academic Program Development of Jiangsu Higher Education Institutions (PAPD 2014-134), and the Postgraduate Research and Practice Innovation Program of Jiangsu Province (XKYCX17\_059).

# SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene.2019. 00407/full#supplementary-material

FIGURE S1 | Top 20 KEGG pathways based on mRNA colocalization with differentially expressed lncRNAs between 90, 130, and 150 days. (A) Scatterplot showing KEGG pathway enrichment between 130 and 90 days. (B) Scatterplot showing KEGG pathway enrichment between 150 and 90 days. (C) Scatterplot showing KEGG pathway enrichment between 150 and 130 days.

FIGURE S2 | Top 20 KEGG pathways based on mRNA co-expression with differentially expressed lncRNAs between 90, 130, and 150 days. (A) Scatterplot showing KEGG pathway enrichment between 130 and 90 days. (B) Scatterplot showing KEGG pathway enrichment between 150 and 90 days. (C) Scatterplot showing KEGG pathway enrichment between 150 and 130 days.

FIGURE S3 | Top 20 KEGG pathways associated with differentially expressed circRNAs between 90, 130, and 150 days. (A) Scatterplot showing KEGG pathway enrichment between 130 and 90 days. (B) Scatterplot showing KEGG pathway enrichment between 150 and 90 days. (C) Scatterplot showing KEGG pathway enrichment between 150 and 130 days.

FIGURE S4 | Top 20 KEGG pathways associated with differentially expressed miRNAs between 90, 130, and 150 days. (A) Scatterplot showing KEGG pathway

enrichment between 130 and 90 days. (B) Scatterplot showing KEGG pathway enrichment between 150 and 90 days. (C) Scatterplot showing KEGG pathway enrichment between 150 and 130 days.

FIGURE S5 | Top 20 KEGG pathways associated with differentially expressed mRNAs between 90, 130, and 150 days. (A) Scatterplot showing KEGG pathway enrichment between 130 and 90 days. (B) Scatterplot showing KEGG pathway enrichment between 150 and 90 days. (C) Scatterplot showing KEGG pathway enrichment between 150 and 130 days.

TABLE S1 | Primers used for the quantitative real-time PCR analysis.

TABLE S2 | Primers used for construction of the luciferase reporter vector.

TABLE S3 | Summary of RNA sequencing for each sample.

TABLE S4 | Analysis of differentially expressed lncRNAs between 90, 130, and 150 days.

TABLE S5 | Analysis of differentially expressed circRNAs between 90, 130, and 150 days.

# REFERENCES


Chase, H. B. (1954). Growth of the hair. Physiol. Rev. 34:113.


Fuchs, E., and Segre, J. A. (2000). Stem cells: a new lease on life. Cell 100, 143–155. Fujimoto, S., Takase, T., Kadono, N., Maekubo, K., and Hirai, Y. (2014). Krtap11-1,

a hair keratin-associated protein, as a possible crucial element for the physical properties of hair shafts. J. Dermatol. Sci. 74, 39–47. doi: 10.1016/j.jdermsci. 2013.12.006

TABLE S6 | Analysis of differentially expressed miRNAs between 90, 130, and 150 days.

TABLE S7 | Analysis of differentially expressed mRNAs between 90, 130, and 150 days.

TABLE S8 | Differentially expressed lncRNAs associated with the hair follicle cycle by target prediction.

TABLE S9 | Gene ontology classification of differentially expressed lncRNAs between 90, 130, and 150 days.

TABLE S10 | Gene ontology classification of differentially expressed circRNAs between 90, 130, and 150 days.

TABLE S11 | Gene ontology classification of differentially expressed miRNA between 90, 130, and 150 days.

TABLE S12 | Gene ontology classification of differentially expressed mRNAs between 90, 130, and 150 days.

TABLE S13 | Differentially expressed genes between 90, 130, and 150 days of the hair follicle cycle.



specification and activation. Stem Cells 34, 1896–1908. doi: 10.1002/stem. 2363



**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Zhao, Chen, Hu, Yang, Wang, Liu, Li, Xiao and Wu. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Coding or Noncoding, the Converging Concepts of RNAs

#### Jing Li and Changning Liu\*

CAS Key Laboratory of Tropical Plant Resource and Sustainable Use, Xishuangbanna Tropical Botanical Garden, The Innovative Academy of Seed Design, Chinese Academy of Sciences, Kunming, China

Technological advances over the past decade have unraveled the remarkable complexity of RNA. The identification of small peptides encoded by long non-coding RNAs (lncRNAs) as well as regulatory functions mediated by non-coding regions of mRNAs have further complicated our understanding of the multifaceted functions of RNA. In this review, we summarize current evidence pointing to dual roles of RNA molecules defined by their coding and non-coding potentials. We also discuss how the emerging roles of RNA transform our understanding of gene expression and evolution.

Keywords: messenger RNA, long noncoding RNA, coding potential, ribosome profiling, micropeptide

# INTRODUCTION

#### Edited by:

Philipp Kapranov, Huaqiao University, China

#### Reviewed by:

Florent Hubé, UMR7216 Epigénétique et Destin Cellulaire, France Dieter August Wolf, Sanford Burnham Prebys Medical Discovery Institute, United States

#### \*Correspondence:

Changning Liu liuchangning@xtbg.ac.cn

#### Specialty section:

This article was submitted to RNA, a section of the journal Frontiers in Genetics

Received: 24 February 2019 Accepted: 06 May 2019 Published: 22 May 2019

#### Citation:

Li J and Liu C (2019) Coding or Noncoding, the Converging Concepts of RNAs. Front. Genet. 10:496. doi: 10.3389/fgene.2019.00496 Benefiting from the advances in science and technology, our understanding of the complexity of organisms is constantly increasing. The "central dogma" of molecular biology states that genetic information is typically processed from DNA to RNA to protein, and this decides cellular and organismal phenotype (Crick, 1970). In the past, RNAs, except for infrastructural RNAs (such as rRNAs and tRNAs), were commonly considered as an intermediate between DNA and proteins. However, over recent decades, the rapid development of high-throughput sequencing technologies has revealed the pervasive transcription of eukaryotic genomes (Okazaki et al., 2002; Carninci et al., 2005; Kapranov et al., 2007; Lander, 2011), thus revealing RNA-mediated gene regulation. The fact that most regulatory RNAs function without involvement in protein translation led us to re-examine the roles of RNAs in the development and evolution of higher organisms.

In higher organisms, only a small fraction of genetic transcripts (less than 3%) have the capability to encode proteins, despite pervasive transcription across genomes. This raises the question of whether the remaining non-protein-coding transcripts are transcriptional "noise" or contain more genetic information. Large-scale projects for the systematic annotation and functional characterization of genes (such as ENCODE and FANTOM) have reported that at least 80% of mammalian genomic DNA is actively transcribed and elaborately regulated, with the vast majority of this considered to be noncoding RNA (ncRNA) genes (Consortium, 2012; Hon et al., 2017). The numbers of ncRNA genes vary between species, and interestingly, the complexity of an organism is

**76**

**Abbreviations:** 4E-BP, Eukaryotic translation initiation factor 4E (eIF4E)-binding protein; aa, amino acid; CCR2, C-C chemokine receptor type 2; ceRNAs, competing endogenous RNAs; circRNAs, circular RNAs; FBXW7, F-Box and WD repeat domain containing 7; Foxo, Forkhead box-o; HIST1H1C, histone cluster 1 H1 family member C; hnRNPC, heterogeneous nuclear ribonucleoprotein C; hTR, human telomerase RNA; IRES, internal ribosome entry site; lncRNA, long noncoding RNA; Mbl, Mannose-binding lectin; Mdm2, mouse double minute 2; MLN, myoregulin; MOTS-C, mitochondrial open reading frame of the 12S rRNA-c; MS, mass spectrometry; nt, nucleotide; ribo-seq, ribosome profiling sequencing; Scl, sarcolamb; SERCA, sarco/endoplasmic reticulum Ca2+-ATPase; sORF, small open reading frame; SR, sarcoplasmic reticulum; SRA, steroid receptor RNA activator; SRAP, steroid receptor RNA activator protein; Ube3a1, ubiquitin-protein ligase E3A; UTRs, untranslated regions; VEGF, vascular endothelial growth factor; ZNF609, zinc Finger Protein 609.

highly associated with the abundance of ncRNA genes but not protein-coding genes, implying the potential importance of ncRNAs (Rubin et al., 2000; Stover et al., 2000; Mattick, 2001; Venter et al., 2001; Kapusta and Feschotte, 2014). Among these, lncRNAs that are defined as transcripts longer than 200 nucleotides with low/no protein-coding potential, represent a considerable proportion.

Long non-coding RNAs can regulate gene expression in various ways, including epigenetic, transcriptional, posttranscriptional, translational and protein location effects. Corresponding to functional diversity, the modes of action of lncRNAs are also quite varied. lncRNAs can recruit epigenetic factors to modify chromatin state (Rinn and Chang, 2012), assemble transcriptional machinery to trigger the initiation of transcription (Bonasio and Shiekhattar, 2014), or act as a structural organizer to participate in the formation of subcellular organelles (Naganuma and Hirose, 2013). Additionally, lncRNAs can complementarily bind with other forms of RNA molecules to modulate gene expression at transcriptional, post-transcriptional and translational levels, for example as a moderator of mRNA activity or a decoy/sponge for miRNA (Poliseno et al., 2010; Gong and Maquat, 2011; Bonasio and Shiekhattar, 2014; Tay et al., 2014; Yoon et al., 2014). Moreover, lncRNAs couple with proteins through particular structures to act as a location transferor, or to modulate enzyme activities (Wang and Chang, 2011).

Based on the "noncoding" definition, the modes of action of lncRNAs mentioned above are exerted primarily through ncRNAs. Intriguingly, recent bioinformatics analyses of largescale data from ribosome-protected RNA fragments (ribosome profiling or ribo-profiling) have revealed that a considerably large part of these transcripts tends to contain sORFs and binds with ribosomes (Aspden et al., 2014; Ruiz-Orera et al., 2014; Anderson et al., 2015; Mackowiak et al., 2015; Olexiouk et al., 2016), suggesting that the coding potential of lncRNAs has been vastly underestimated. Several functional experiments have demonstrated that some lncRNAs can encode small peptides (named "micropeptides" with a length less than 100 aa) that are involved in various biological processes, although this is rare (Hubé and Francastel, 2018). In addition, certain coding transcripts, such as TP53 mRNA, could also function as RNA, without translation to proteins, to regulate significant biological processes (Candeias, 2011; Kloc et al., 2011). Therefore, it seems reasonable to presume that the demarcation of RNA depending on its coding or noncoding status is somewhat blurred, and partially intertwined. That is, RNA roles are likely not tightly constrained (such as RNA functioning only as mRNA or ncRNA), but rather converge and overlap: lncRNAs can function by encoding small peptides, while mRNAs can use their special structural features, such as the 3<sup>0</sup> UTR or 5<sup>0</sup> UTR, to function (**Figure 1**).

In the present article, we will review current studies of the bilateral functionality of lncRNAs and mRNAs in terms of their coding potential, as well as the advancement of high-throughput techniques that would facilitate a deeper recognition of functional diversity of RNAs. This review will highlight the cases that illuminate the contrapositive roles between lncRNAs and mRNAs, and briefly discuss the biological significance of these discoveries for gene expression and evolution.

# LONG NONCODING RNAs ENCODE SMALL PEPTIDES/PROTEINS WITH REGULATORY FUNCTIONS

# Peptides/Proteins Encoded by Regular Long Noncoding RNAs

The original definition of lncRNAs concerns their low/noncoding potential. However, with accumulating evidence from bioinformatics and ribosome transcriptome profiling, lncRNAs have been shown to display strong ribosomal associations in many species, varying from plant to animal, indicating a potential coding capacity in lncRNA sORFs (Kageyama et al., 2011; Nam et al., 2016; Yeasmin et al., 2018). In recent years, several micropeptides derived from lncRNAs have been shown to be functional. We have summarized these micropeptides in **Table 1**.

Steroid receptor RNA activator is a prototypic example of lncRNAs with both coding and noncoding products (Lanz et al., 1999; Mattick, 2003; Hubé et al., 2006, 2011). SRA was initially identified as a noncoding gene with multiple RNA isoforms, which is critical in many biological processes, such as acting as a co-activator of nuclear receptors and a regulator of steroid receptor-dependent gene expression (Hubé et al., 2006, 2011; Cooper et al., 2011). Interestingly, SRA can also encode for a conserved SRAP, which, in turn, represses the transcriptional regulatory activity of the SRA1 gene by interacting with a specific SRA stem-loop (Emberley et al., 2003; Chooniedass-Kothari et al., 2006; Hubé et al., 2011). The transmissible functionalities between the coding and noncoding SRA gene are caused by alternative splicing (AS) of introns/extrons (Colley and Leedman, 2011), suggesting the significance of AS events in the generation of bifunctional RNA.

Of note, among the small number of already-known functional micropeptides, a few are muscle-specific, and have been implicated in the regulation of the activities of SERCA (Anderson et al., 2016; Nelson et al., 2016; Matsumoto et al., 2017). For example, MLN, a 46-aa micropeptide specifically expressed in skeletal-muscle, is encoded by a lncRNA (LINC00948 in human and 2310015B20Rik in mouse); it can directly interact with SERCA to decrease the affinity of this ATPase for Ca2<sup>+</sup> and inhibit Ca2<sup>+</sup> entry into the SR (Anderson et al., 2015). The Scl micropeptide is encoded by the noncoding pncr003:2L gene, and can affect Ca2<sup>+</sup> traffic in cardiac muscle in the fly; the mutation of this gene triggers an arrhythmic phenotype (Magny et al., 2013). The MOTS-C micropeptide can regulate insulin sensitivity and metabolic homeostasis in the mitochondria of muscle cells, and derives from mitochondrial 12S rRNA (Slavoff et al., 2014; Lee et al., 2015). In the above example, the Scl peptides and their respective regulatory functions in the heart are quite conserved between species, including the fly and humans (Magny et al., 2013). These results indicate that several sORFs embedded in the noncoding region of the genome seem to undergo a relatively stricter natural

selection than adjacent sequences, raising the question of whether these sORFs have a capability to sprout into a new gene in situ or to be integrated as a component into new genes elsewhere during evolution.

The tal gene in Drosophila is of vital importance in tarsal morphogenesis in the fly leg, and stage-and position-specific expression have been reported in embryonic development. Although tal is regarded as a noncoding gene as none of its ORFs are over 100 aa, deeper analysis has found that the functionality of tal is predominantly dependent on the ORF regions (Manak et al., 2006). There are five ORFs in the tal gene, four of which contain a similar and conserved 7 aa motif that determines the functionality of the gene, with the shortest peptide of only 11 aa. Phylogenetic analysis revealed that these tal-like peptides are conserved in metazoans and represent a new class of eukaryotic genes. The discovery of these mini-peptides further expands the possible scope and function of lncRNA-encoded peptides that are hidden in currently sequenced genomes and the transcriptome (Galindo et al., 2007).

# Peptides/Proteins Encoded by Circular RNAs

Circular RNAs pertain to a sub-category of specialized lncRNAs, which are primarily produced by backsplicing the 3<sup>0</sup> end to the 5 0 end of exons in the same transcript (often a coding gene) via the spliceosome, thereby forming lncRNAs in a circular shape (Ashwal-Fluss et al., 2014; Zhang X.O. et al., 2014; Starke et al., 2015). Through bioinformatics analysis and high-throughput sequencing, many circRNAs have been identified in multiple species (Sanger et al., 1976; Capel et al., 1993; Danan et al., 2012; Memczak et al., 2013; Jeck and Sharpless, 2014; Wang et al., 2014). However, understanding of circRNA function is still very limited. The reported biological activities of circRNAs include acting as a sponge for microRNAs (Hansen et al., 2013; Memczak et al., 2013), as a competitor during pre-mRNA splicing (Ashwal-Fluss et al., 2014), and as a transcriptional regulator in the nucleus (Li et al., 2015). The majority of circRNAs are chimeric lncRNAs derived from mRNA transcripts and likely in part encompass the exons of protein-coding genes. This poses the question of whether circRNAs have protein coding capabilities. In fact, many studies have demonstrated that circRNAs have coding capabilities both in vitro and in vivo in terms of capindependent translation (Chen and Sarnow, 1995; Li and Lytton, 1999; Guo et al., 2014; Jeck and Sharpless, 2014; Abe et al., 2015; Wang and Wang, 2015; Pamudurti et al., 2017). Moreover, some functional protein products are encoded by circRNAs (such as circ-FBXW7; circ-Mbl, circ-ZNF609 and circ-SHPRH) (Rybak-Wolf et al., 2015; Pamudurti et al., 2017; Yang et al., 2018; Zhang et al., 2018a).

circ-ZNF609 was initially screened out in a functional genetic screen, and is differentially expressed during myogenesis (Legnini et al., 2017). This circRNA contains an ORF covering almost all ORF regions of the host gene, but has a small variation at the splice junction. Its protein product lacks the zinc-finger domain compared with its linear counterpart, with an obvious impact on myoblast proliferation. Interestingly, heat shock could

#### TABLE 1 | Peptides encoded by lncRNAs in plants and animals.


significantly activate the translation of circ-ZNF609; suggesting a possible regulatory role of circRNA translation under specific stimuli (Legnini et al., 2017).

circ-Mbl was first detected in the lodge of the second exon of the splicing factor muscleblind (MBL/MBNL1) in flies and humans, with a function of competing with pre-mRNA splicing (Ashwal-Fluss et al., 2014). Recently, through a bioinformatics analysis of ribosome foot-printing datasets, Pamudurti and coworkers revealed that circ-Mbl could encode a peptide in the fly head, as detected through MS. Both circ-Mbl1 RNA and its protein-related product reside in the synaptosome and can be regulated by the 4E-BP and the transcription factor in forkhead family – FOXO, suggesting that this circRNA translation might be distinctively important in the brain (Pamudurti et al., 2017).

The observation that circRNAs generate proteins can be traced back to much earlier studies in Archaea, where circularized introns produce a site-specific endonuclease (Dalgaard et al., 1993). However, to date, direct experimental evidence for circRNA translation to peptides is still scarce; as a result, it is even tougher to understand the function of their translated products. Considering that most circRNAs stem from coding transcripts and contain complete exons, it is possibly assumed that the circRNAs and their coding-products might provide uncharacterized modes of regulation of gene and

protein expression (Pamudurti et al., 2017). Therefore, it is important to further investigate the possible functions associated with circRNA coding.

# Large-Scale Approaches for the Identification of Potential sORFs

To date, hundreds of thousands of lncRNAs have been discovered in various species, and there is a desire to study their relevant functional mechanisms (Okazaki et al., 2002; Liu et al., 2005; Kapranov et al., 2007; Ponting et al., 2009; Ulitsky and Bartel, 2013; Volders et al., 2013). However, it is unpractical to identify lncRNAs and predict their functions using only traditional technical approaches, irrespective of the requirement for intensive validation of the exact mechanisms underlying lncRNA activities. The same is true for the identification of lncRNA coding capacity. Therefore, new large-scale technologic approaches based on computational analysis of transcriptome data and proteomics data have been developed, all of which are mutually reinforcing and cross-validated.

A ribo-seq technique has been recently developed and is widely used to measure the full coding potential of RNA transcripts on a genome scale through deep sequencing of ribosome-protected RNA fragments (Ingolia et al., 2009). By identifying the precise ribosomal positions of RNAs, ribo-seq can plot the potential on-going events of translation in the cytosol, which is useful in identifying potentially functional micropeptides (Ingolia et al., 2011; Ingolia, 2016). With the advent of Ribo-Seq, thousands of translated sORFs were discovered in lncRNAs (Ingolia et al., 2011; Bazzini et al., 2014; Ruiz-Orera et al., 2014; Ji et al., 2015), with a few functional peptides, such as MLN (Anderson et al., 2015) and HOXB-AS3 (Huang et al., 2017). However, the proportion of coding lncRNAs estimated by various ribosome-profiling studies differ widely (Guttman et al., 2013; Ingolia et al., 2014), resulting from false positive and distinct prediction thresholds. Therefore, MS has emerged as a complementary method.

Mass spectrometry demonstrates excellent performance in detecting and characterizing the products of proteins/peptides in a complex biological sample. The detection of lncRNAencoded peptides is the most direct evidence for lncRNA coding potential. However, to date, the proportion of coding lncRNAs detected by MS-based proteomes is small compared with that in ribo-seq results (Verheggen et al., 2017). The main weakness attributed to this approach is that MSbased proteomics is obviously impacted by the length and concentration of the detected samples. Therefore, specialized methods have been developed to circumvent these detection limitations. Short translation products at low abundance can surmount the threshold of MS detection through the use of peptidomics approaches (Schulz-Knappe et al., 2005) and enrichment protocols (Mustafa et al., 2015).

Both of the above techniques have their respective advantages and shortcomings; therefore, "proteogenomics" has been developed (Nesvizhskii, 2014; Menschaert and Fenyö, 2017; Ruggles et al., 2017). In proteogenomics, proteomics data are systematically integrated and analyzed with genomics and transcriptomic data generated from DNA-sequencing, RNA-sequencing and ribosome-profiling. The predicted sequences of proteins/peptides are tracked back to the genome and transcripts to identify the gene expression patterns and actual translational events. The significance of proteogenomics studies lies in improving genome annotation, and reasonably applying multi-omics data to explore complex and profound mechanisms in biological activities and complex diseases (Zhang B. et al., 2014; Zhang et al., 2016; Mertins et al., 2016).

# NONCODING RNA REGULATORY FUNCTIONS EMBEDDED IN mRNAs

#### 3 <sup>0</sup> UTR Regulatory Roles of mRNAs

Based on current research results, the noncoding regulatory functions discovered in mRNAs are mainly present in the 3<sup>0</sup> UTRs, which were previously supposed to be the vital regulative elements for mRNA self-stability and location. Compared with highly conserved coding regions that have to undergo strictly selected pressure, the 3<sup>0</sup> UTR displays more flexibility and plasticity between species. Its size varies from a few to hundreds of nucleotides, and likely has a close relationship with biological complexity (Chen et al., 2012; Mayr, 2016). Moreover, for an RNA molecule, other than the impact on base pairing, the changes in sequence are most likely to induce corresponding changes in structure, resulting in information transmitted from RNA to protein (Berkovits and Mayr, 2015).

By comprehensively estimating up-to-date cases where mRNAs regulate biologic activities without translating to protein, we found that the 3<sup>0</sup> UTR of mRNA plays a large role as an effectors. Increasing evidence has demonstrated that the 3<sup>0</sup> UTRs of mRNAs are actively involved in repressing the occurrence and progression of cancer cells, such as the 3<sup>0</sup> UTRs of α-tropomyosin mRNA, prohibitin mRNA and ribonucleotide reductase mRNA (Rastinejad et al., 1993; Fan et al., 1996; Manjeshwar et al., 2003). These studies demonstrate that the 3<sup>0</sup> UTRs of some mRNAs can antagonize tumor development, likely through RNA interactions with regulatory factors involved in cellular growth in a post-transcriptional pattern. Indeed, the 3<sup>0</sup> UTR can recruit RNA-binding proteins, as in the case of CD47 mRNA. CD47 mRNA has two isoforms of the 3<sup>0</sup> UTR, long (CD47- LU) and short (CD47-SU), and only the CD47-LU, which is AU-rich, can interact with the RNA-binding protein TIS11B to form a membraneless organelle with a specific biochemical and biophysical environment which is separate from the cytosol (Ma and Mayr, 2018). However, the most prevalent mode of action of the 3<sup>0</sup> UTR is as ceRNAs, such as in the cases of CCR2 mRNA and Ube3a1 RNA, which confer to the function of lncRNAs (Valluy et al., 2015; Hu et al., 2017).

# Noncoding Regulatory Roles of mRNA Not Involving the 3<sup>0</sup> UTR

Other than the 3<sup>0</sup> UTR, the 5<sup>0</sup> UTR and ORF can also be involved in RNA-mediated regulatory function, although recent reports of this phenomenon are scare. Two mRNAs, TP53 mRNA

and HIST1H1C mRNA, are recognized as being involved in ORF-mediated regulation. TP53 protein is a tumor suppressor implicated in many processes during tumor occurrence and development. However, a triple synonymous mutant (TriMp53) in codons led to a misshapen structure, resulting in loss of the IRES activity of p47 (one isoform of p53) and an abrogated affinity of hnRNPC, but with better binding to Mdm2, which is an E3 ubiquitin-protein ligase in mediating p53/TP53 ubiquitination, and an augmented ability of p53 to activate apoptosis. These facts indicate that TP53 has intricate regulatory roles at both the RNA and protein levels, suggesting that the functions of the RNA and protein molecules are closely intertwined (Candeias, 2011). HIST1H1C mRNA participates in regulating telomere length homeostasis. Aside from the proteinrelated product, a 15-nt long region in the ORF region (nt334– nt348) is attributed to HIST1H1C-mRNA-mediated biological activity, through complementation with the terminal stemloop sequence of the P6b region of hTR, in a base-pairing pattern. These results extend the functional potency of mRNA ORF regions in a non-traditional and noncoding direction (Ivanyi-Nagy et al., 2018).

In terms of the 5<sup>0</sup> UTR, there are only two examples. VEGF is a key regulator of angiogenesis during embryonal and cancerous development, and this regulatory function is closely correlated with the 5<sup>0</sup> UTR. vegf mRNA has an unusually long 5<sup>0</sup> UTR of 1,038 nucleotides, and contains two IRES, resulting in an intricate regulation of VEGF expression. In addition, the presence of the 5<sup>0</sup> UTR of vegf mRNA alone in tumor cells could promote the expression of anti-apoptotic genes but repress proapoptotic genes, suggesting an anti-apoptotic role of the vegf 5 <sup>0</sup> UTR, and demonstrating its potential as a target for cancer treatment. To the best of our knowledge, the 5<sup>0</sup> UTR of vegf mRNA represents the only example of an mRNA UTR which can promote tumor progression (Akiri et al., 1998; Huez et al., 1998; Masuda et al., 2008).

The c-myc P0 transcript is an isoform transcript from the promoter 0 (P0) of the c-myc gene, which has an extra ˜639 nucleotide extension of the 5<sup>0</sup> UTR when compared with two major isoforms (P1 and P2) of c-myc mRNA. Ectopic expression of the 5<sup>0</sup> UTR of the c-myc P0 transcript alone in HeLa cells results in significantly increased expression of the c-Myc1 (p67) and c-Myc2 (p64) proteins as well as incremental apoptosis sensitivity, but decreased tumorigenicity, all of which are likely attributable to competitive regulation of gene expression in the c-myc locus. These results demonstrate that the 5<sup>0</sup> UTR potentially functions in trans to perform gene regulation (Blume et al., 2003).

# PERSPECTIVE

In recent years, researchers have begun to pay close attention to the development of bifunctional RNAs, and have discussed the evolved roles of RNAs with multiple functions (Dinger et al., 2011; Ulveling et al., 2011; Kageyama et al., 2011; Hubé and Francastel, 2018). In the early stages of such research, researchers discovered individual gene on a case-by-case basis. However, in the last ten years, rapid advances in large-scale detection and identification techniques (such as ribo-seq and MSbased proteomics) have facilitated multi-faceted investigations of genomes and vital processes, thus shedding light on the complex activities of various RNA molecules. Bifunctional RNAs raise questions about the concept of a gene, in terms of whether RNA, both coding and noncoding, is an independent gene type or a convergence of coding and noncoding genes which occurred during evolution. In this review, we intend to not only investigate the current status of bifunctional RNAs as reported in recent years, but also discuss the potential pervasiveness of bifunctional RNAs from a global perspective in terms of large-scale data.

With recent estimates of ribosome profiling, small peptides encoded by lncRNAs have significantly expanded the extent and diversity of the proteome, and predictions suggested that a large fraction of the annotated lncRNAs in various eukaryotic organisms would be translated with sORFs (Aspden et al., 2014; Ruiz-Orera et al., 2014; Mackowiak et al., 2015; Olexiouk et al., 2016). Proteogenomic evidence has confirmed that many small peptides which stem from regions of lncRNA genes are expressed differentially in different cell types and during different developmental/disease stages, although their functions are somewhat enigmatic (Nesvizhskii, 2014; Zhu et al., 2018). However, other studies have revealed that mRNAs could also be involved in cellular regulatory processes in a coding-independent manner (Nam et al., 2016). The results from a large-scale RNA structure analysis revealed that the secondary structures of mRNA have an essential regulatory effect on its maturation and stability, even for the evolutionarily conserved RNA silencing pathways of eukaryotes, suggesting that mRNAs partially retain the functionality of structure that exists in many RNA molecules (Katz and Burge, 2003; Li et al., 2012; Taggart et al., 2012). All these facts indicate that the coding potential and biological roles of mRNAs and lncRNAs could be switched in some cases, implying a conceptual blurring between coding and noncoding genes.

Many lncRNAs share similar features with classical mRNAs, such as transcription by polymerase II with a 5<sup>0</sup> -cap and 3 0 -polyadenylated tail, and frequent accumulation in the cytoplasm (van Heesch et al., 2014). Therefore, when associated with ribosomes, sORFs embedded in lncRNAs have a significant chance to be translated to peptides. The peptides derived from lncRNAs have a relatively shorter chain length and weaker conservation across different species, and this is consistent with the original lncRNAs which often have few introns, a low expression level and weak phylogenetic conservation (Cabili et al., 2011; Derrien et al., 2012; Kutter et al., 2012; Necsulea et al., 2014). From the perspective of proteins driving evolution, these peptides are likely considered to be an important source for new protein (Ruiz-Orera et al., 2014). Previously reported experimental evidence indicates that noncoding RNAs expressed at low levels could contribute to the birth of novel protein coding genes (Levine et al., 2006; Cai et al., 2008; Reinhardt et al., 2013). Given that several lncRNA-derived peptides have been demonstrated to play essential roles in many biological activities, it

is worth investigating the putative significance of the generation of these lncRNA-derived peptides in gene evolution, expression and regulation.

However, in view of the huge quantity, diversified mechanisms of action, and intricate functions of lncRNAs, it is inappropriate to regard lncRNAs just as a pool for evolved peptides. In terms of RNA alone, its roles are diverse, including the potential to be retro-transcribed into DNA, or to act as an enzyme to participate in complex biochemical processes (Cech, 1986). Moreover, random RNA sequences can inoculate structurally complex and highly active RNA ligases, suggesting that randomness can produce functionality (Ekland et al., 1995). Therefore, it is very likely that RNA molecules alone comprise abundant genetic information, such as particular structural features and ultraconservative sequence elements, which could regulate the timing and place of gene expression during cellular differentiation and development.

In recent decades, because of the addition of the huge family of noncoding genes, RNAs have provoked great interest for their mysterious roles in organisms. lncRNA-encoded peptides expand the horizon of functional mechanisms for these bio-macromolecules. To date, thousands of peptide

# REFERENCES


products have been identified in human cells, with limited understanding of their function. The current review has summarized the recently discovered micropeptides implicated in various biological processes. We also discussed the potential noncoding roles of mRNAs as a regulator. The continued discovery and functional characterization of bifunctional RNAs will provide new insights into important cellular processes and organismal evolution.

# AUTHOR CONTRIBUTIONS

Both authors have made a substantial, direct and intellectual contribution to the work, and approved it for publication.

# FUNDING

This work was supported through the following grants by the National Natural Science Foundation of China (Grant Nos. 31471220 and 91440113), the Start-up Fund from Xishuangbanna Tropical Botanical Garden, and the "Top Talents Program in Science and Technology" from Yunnan Province.



in hepatitis C infected patients. World J. Hepatol. 7, 1312–1324. doi: 10.4254/ wjh.v7.i10.1312


fgene-10-00496 May 20, 2019 Time: 15:45 # 9


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Li and Liu. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# LncRNA-Disease Association Prediction Using Two-Side Sparse Self-Representation

Le Ou-Yang1,2†, Jiang Huang3†, Xiao-Fei Zhang<sup>4</sup> , Yan-Ran Li <sup>3</sup> , Yiwen Sun<sup>5</sup> , Shan He<sup>6</sup> and Zexuan Zhu<sup>3</sup> \*

*<sup>1</sup> Guangdong Key Laboratory of Intelligent Information Processing and Shenzhen Key Laboratory of Media Security, Shenzhen University, Shenzhen, China, <sup>2</sup> FJKLMAA (Fujian Key Laborotary of Mathematical Analysis and Applications), Fujian Normal University, Fuzhou, China, <sup>3</sup> College of Computer Science and Software Engineering, Shenzhen University, Shenzhen, China, <sup>4</sup> School of Mathematics and Statistics and Hubei Key Laboratory of Mathematical Sciences, Central China Normal University, Wuhan, China, <sup>5</sup> School of Medicine, Shenzhen University, Shenzhen, China, <sup>6</sup> School of Computer Science, University of Birmingham, Birmingham, United Kingdom*

#### Edited by:

*Philipp Kapranov, Huaqiao University, China*

#### Reviewed by:

*Remco Molenaar, University Medical Center Amsterdam, Netherlands Shihua Zhang, Academy of Mathematics and Systems Science (CAS), China Jie Zheng, ShanghaiTech University, China*

> \*Correspondence: *Zexuan Zhu zhuzx@szu.edu.cn*

> > *†Joint first authors*

#### Specialty section:

*This article was submitted to RNA, a section of the journal Frontiers in Genetics*

Received: *30 December 2018* Accepted: *03 May 2019* Published: *28 May 2019*

#### Citation:

*Ou-Yang L, Huang J, Zhang X-F, Li Y-R, Sun Y, He S and Zhu Z (2019) LncRNA-Disease Association Prediction Using Two-Side Sparse Self-Representation. Front. Genet. 10:476. doi: 10.3389/fgene.2019.00476* Evidences increasingly indicate the involvement of long non-coding RNAs (lncRNAs) in various biological processes. As the mutations and abnormalities of lncRNAs are closely related to the progression of complex diseases, the identification of lncRNA-disease associations has become an important step toward the understanding and treatment of diseases. Since only a limited number of lncRNA-disease associations have been validated, an increasing number of computational approaches have been developed for predicting potential lncRNA-disease associations. However, how to predict potential associations precisely through computational approaches remains challenging. In this study, we propose a novel two-side sparse self-representation (TSSR) algorithm for lncRNA-disease association prediction. By learning the self-representations of lncRNAs and diseases from known lncRNA-disease associations adaptively, and leveraging the information provided by known lncRNA-disease associations and the intra-associations among lncRNAs and diseases derived from other existing databases, our model could effectively utilize the estimated representations of lncRNAs and diseases to predict potential lncRNA-disease associations. The experiment results on three real data sets demonstrate that our TSSR outperforms other competing methods significantly. Moreover, to further evaluate the effectiveness of TSSR in predicting potential lncRNAs-disease associations, case studies of Melanoma, Glioblastoma, and Glioma are carried out in this paper. The results demonstrate that TSSR can effectively identify some candidate lncRNAs associated with these three diseases.

Keywords: lncRNAs-disease associations prediction, computational approaches, sparse representation, lncRNA similarity, disease similarity

# 1. INTRODUCTION

Long non-coding RNAs (lncRNAs), which are a class of non-coding transcripts with the lengths longer than 200 nucleotides (Derrien et al., 2012; Harrow et al., 2012; Guttman et al., 2013; Chen et al., 2016b), have been proven to be involved in various biological processes (Chen et al., 2012, 2016b, 2018) and closely correlated with the development of complex diseases, such as cancers

**86**

and rheumatic diseases (Bussemakers et al., 1999; Managadze et al., 2011; Bhartiya et al., 2012; Schonrock et al., 2012; Li et al., 2013; Lu et al., 2013; Zhao et al., 2014; Chen et al., 2016b). For example, studies have revealed the roles of lncRNAs in regulating gene expression (Taft et al., 2010; Wapinski and Chang, 2011). As the development of complex diseases are closely related to the mutations and abnormalities of lncRNAs, to understand the pathogenesis of human diseases systematically, and identify the biomarkers of disease progression and prognosis, it is important to predict the potential associations between diseases and lncRNAs (Chen et al., 2016b; Yu et al., 2018). However, only a small number of lncRNA-disease associations have been validated. Therefore, efficient methods for predicting the associations between lncRNAs and diseases are emergent needed (Lu et al., 2018).

In recent years, identifying the associations between diseases and lncRNAs has attracted a lot of attentions (Chen and Yan, 2013; Lu et al., 2018). Prediction methods based on biological experiments or computational approaches are proposed to undertake this task. Due to the limitations of biological experiments such as time-consuming and expensive in cost, computational approaches provide an alternative for biological experiments and have been widely used to identify the associations between lncRNAs and diseases (Chen et al., 2016b). Existing computational approaches for association prediction can be roughly classified into three categories. The first category is based on machine learning approaches. These models predict the associations between diseases and lncRNAs based on known lncRNA-disease associations. For example, Chen et al. proposed a semi-supervised learning-based method named Laplacian Regularized Least Squares for LncRNA-disease Association (LRLSLDA) (Chen and Yan, 2013) to predict the associations between diseases and lncRNAs. Zheng et al. formulated the problem of association prediction as a matrix factorization problem and introduced a collaborative matrix factorization model (CMF) (Zheng et al., 2013) to predict the associations. However, the performance of machine learningbased methods depend on the choice of hyperparameters such as the dimensionality of the latent space in matrix factorizationbased methods, and the suitable values for these hyperparameters are usually previously unknown and hard to determine.

The second category is based on random walk. These models identify potential lncRNA-disease associations by integrating known associations between diseases and lncRNAs and similarities among diseases and lncRNAs. For example, Zhou et al. predicted the associations between diseases and lncRNAs by implementing random walk with restart on the constructed similarity networks among lncRNAs and diseases (Zhou M. et al., 2015). The third category is based on data integration. These models focus on integrating multiple heterogeneous data sources. For example, Lu et al. (2018) developed a model named SIMCLDA for identifying the associations between diseases and lncRNAs based on disease-gene and gene-gene ontology associations. However, the above methods rely heavily on the similarity networks or external information (e.g., similarity networks among diseases and lncRNAs, and gene-gene associations) that are inferred based on predefined metrics. Moreover, the information extracted from other databases or data platforms may include some irrelevant or noise information that may mislead the prediction of associations.

To address the above problems, in this paper, we introduce a novel two-side sparse self-representation (TSSR) model for lncRNA-disease association prediction. Based on known lncRNA-disease associations, our model can adaptively learn two non-negative sparse self-representation matrices which capture the intra-similarities among lncRNAs and diseases respectively. Moreover, our model could also drawn support from the intra-associations among disease and lncRNAs that derived from external information of lncRNAs and diseases to generate more accurate estimation of the representation matrices. Experiment results on three real datasets demonstrate that compared with six state-of-the-art association prediction algorithms, our TSSR model could achieve more accurate prediction results. Furthermore, case studies on three cancers (i.e., Glioblastoma, Glioma, and Melanoma) also demonstrate the effectiveness of TSSR in predicting the associations between lncRNAs and diseases. The source code of TSSR is available at https://github.com/Oyl-CityU/TSSR.

The rest of this paper is organized as follows. In section 2, we formulate our two-side sparse self-representation model and introduce a relaxed Majorization-Minimization algorithm to solve the optimization problem. The experiment results and case studies are given in section 3. In section 4, we conclude our works.

# 2. METHODS

# 2.1. Notations and Problem Statement

In this paper, we use D = {di} m i=1 to represent the set of lncRNAs and T = {tj} n j=1 to represent the set of diseases, where m and n denote the number of lncRNAs and the number of diseases, respectively. A binary matrix Y = [Yij] ∈ {0, 1} m×n is introduced to represent the associations between lncRNAs and diseases, where Yij = 1 if there is an association between lncRNA d<sup>i</sup> and disease t<sup>j</sup> , and Yij = 0 otherwise. Note that there are two reasons that may lead to Yij = 0. The first reason is that it has been experimentally verified that there is no association between d<sup>i</sup> and t<sup>j</sup> . The second reason is that whether there is an association between d<sup>i</sup> and t<sup>j</sup> is still unknown. Therefore, we usually refer to the zero elements in Y as unknown pairs. The lncRNA-disease association prediction problem can be formulated as the problem of predicting the scores of unknown pairs in Y, which can be used for ranking the pairs. In this study, we first rank the unknown pairs in Y based on the predicted scores in descending order, and then select the top-ranked pairs as potential association pairs.

In particular, unlike matrix factorization methods that project lncRNAs and diseases into a shared latent space and predict lncRNA-disease associations based on the inner product of their latent vectors, we try to learn the intra-similarities among lncRNAs and diseases from the observed associations in Y, and utilize the learned similarity matrices to reconstruct Y and thus predict the scores of unknown pairs in Y. Here, instead of using predefined metrics to construct the similarity matrices of lncRNAs and diseases (which makes the predicted results sensitive to the selected metrics and input data), we introduce a novel two-side sparse self-representation (TSSR) model to adaptively learn the intra-similarities among lncRNAs and diseases from the observed associations in Y, and effectively utilize external information of lncRNAs and diseases to enhance the prediction performance.

# 2.2. Two-Side Sparse Self-Representation Model

Sparse representation techniques which focus on finding a sparse representation of a sample in the form of a linear combination of basic elements (also called atoms) in a dictionary, have been widely used to numerous applications such as computer vision and machine learning (Zhang et al., 2015). In traditional sparse representation models, the objective is to solve the following problem

$$\min\_{\mathbf{x}} \|\mathbf{x}\|\_{0} \quad \text{s.t.} \quad \mathbf{y} = D\mathbf{x}. \tag{1}$$

where k · k<sup>0</sup> denotes L<sup>0</sup> norm, **y** ∈ R m×1 is a sample vector, D is a m × l matrix which denotes the dictionary and **x** ∈ R l×1 is the sparse representation coefficient of **y**. In practice, L<sup>0</sup> norm is usually replaced with L<sup>1</sup> norm to make the above problem (1) solvable in polynomial time. Since the above problem (1) needs to take extra time to construct the dictionary D and has not data-adaptiveness. Many approaches are proposed to employ the dataset itself as the dictionary, which results in the following sparse self-representation model

$$\min\_{X} \left\lVert Y - YX \right\rVert\_F^2 + \beta \left\lVert X \right\rVert\_1. \tag{2}$$

where k.k<sup>F</sup> is the Frobenius norm, Y denotes the feature set of all samples (each row denotes a feature and each column represents the feature vector of a sample), X is the sparse self-representation coefficient matrix of the columns of Y (each column X·<sup>j</sup> of X denotes the representation coefficient of j-th sample Y·<sup>j</sup> , with all samples in Y as dictionary) and β is a tuning parameter to control the trade off between the minimization error and the sparsity. By solving the above model (2), X can capture the most similar relationships among the columns of Y, based on the information provided in Y. In this study, Y ∈ {0, 1} <sup>m</sup>×<sup>n</sup> describes the observed associations between lncRNAs and diseases and we would like to predict potential associations between lncRNAs and diseases based on their intra-similarities learned from Y. Thus, instead of just finding the representations of the columns of Y, we prefer to explore the representations of the rows and columns of Y simultaneously, which capture the intra-similarities within lncRNAs and diseases respectively. Based on the idea of sparse self-representation, we introduce a novel two-side sparse self-representation (TSSR) model to handle the task of lncRNAdisease association prediction. In particular, we formulate the framework of TSSR into the following optimization problem

$$\begin{aligned} \min\_{U,V} \|Y - UYV\|\_F^2 + \beta (\|U\|\_1 + \|V\|\_1),\\ \text{s.t.} \quad U \ge 0, V \ge 0, \sum\_{z=1}^m U\_{iz} = 1, \sum\_{k=1}^n V\_{kj} = 1. \end{aligned} \tag{3}$$

where U = [Uii′] ∈ R m×m <sup>+</sup> and V = [Vjj′] ∈ R n×n <sup>+</sup> are two nonnegative sparse matrices which represent the row and column representation coefficient matrices of Y, respectively, and β is a tuning parameter which controls the sparsity of U and V. Based on this definition, U denotes the coefficient matrix based on the dictionary YV, which captures the similarities among lncRNAs. For example, Uii′ denotes the similarities between the i-th and i ′ -th lncRNAs, which correspond to the i-th and i ′ -th rows of Y. On the other hand, V denotes the coefficient matrix based on the dictionary UY, which captures the similarities among diseases. For example, Vjj′ denotes the similarities between the j-th and j ′ -th diseases, which correspond to the j-th and j ′ -th columns of Y. With the sparse regularization term, we can control the sparsity of the learned representation matrices U and V, and find the most similar relationships within lncRNAs and diseases. The constraints P<sup>m</sup> <sup>z</sup>=<sup>1</sup> <sup>U</sup>iz <sup>=</sup> 1 and <sup>P</sup><sup>n</sup> <sup>k</sup>=<sup>1</sup> Vkj = 1 are used to guarantee the probability properties of Ui· and V·<sup>j</sup> , respectively.

In the above objective function (3), the representation matrices are learned from the original data matrix Y, which means that they will be sensitive to the input data Y. If the input data only includes a small number of known associations, it may be hard to learn a comprehensive representation matrix. With the development of high-throughput experimental techniques and the accumulation of clinical information, we could also collect some functional annotations and phenotype information for lncRNAs and diseases respectively. Based on these prior information, we can infer the intra-associations among diseases and lncRNAs. To utilize these pairwise associations inferred from other databases to promote the estimation of two representation coefficient matrices U and V, two regularization terms are added to Equation (3). Moreover, we introduce a weight matrix W in a similar way to Zheng et al. (2013) to prevent unknown instances (for which association information is not available) from contributing to the determination of the row and column representations of Y (i.e., U and V). The final objective function of our TSSR model is as follows.

$$\begin{aligned} \min\_{U,V} & \|W \odot (Y - UYV)\|\_F^2 + \beta (\|U\|\_1 + \|V\|\_1) \\ & + \lambda\_d \|\mathbb{S}\_d - U\|\_F^2 + \lambda\_t \|\mathbb{S}\_t - V\|\_F^2, \\ & \text{s.t.} \quad U \ge 0, V \ge 0, \sum\_{z=1}^m U\_{\bar{t}z} = 1, \sum\_{k=1}^n V\_{k\bar{j}} = 1. \end{aligned} \tag{4}$$

where λ<sup>d</sup> and λ<sup>t</sup> are two tuning parameters controlling the influences of prior intra-associations among lncRNAs and diseases, S<sup>d</sup> ∈ R <sup>m</sup>×<sup>m</sup> and S<sup>t</sup> ∈ R <sup>n</sup>×<sup>n</sup> denote the affinity matrices of lncRNA and disease respectively, where (Sd)ii′ describes the association between lncRNAs d<sup>i</sup> and d<sup>i</sup> ′ , and (St)jj′ describes the associations between diseases t<sup>j</sup> and t<sup>j</sup> ′ . ⊙ denotes the element-wise product or Hadamard product of two matrices and W ∈ R m×n is a weight matrix where Wij = 0 for unknown entries in Y and Wij = 1 for known entries in Y. Consequently, unknown entries in Y do not contribute to the minimization of the first term of Equation (4).

## 2.3. Optimization Algorithm

Here, to handle the constraints in (4), we employ a relaxed Majorization-Minimization algorithm (Yang and Oja, 2011, 2012) to obtain the solution of objective function (4). For more details about this optimization method, please refer to Yang and Oja (2012). In particular, we denote ▽<sup>U</sup> as the gradient of our objective function with respect to U.

$$\nabla\_U = -2[W \odot (Y - UYV)]V^T Y^T - 2\lambda\_d (\mathbb{S}\_d - U) + \beta. \quad \text{(5)}$$

Let ▽ + <sup>U</sup> = 2W ⊙ (UYV)V TY <sup>T</sup> + 2λdU + β and ▽ − <sup>U</sup> = 2(W ⊙ Y)V TY <sup>T</sup> + 2λdS<sup>d</sup> denote the positive and negative parts of ▽U, respectively. Thus, we have ▽<sup>U</sup> = ▽<sup>+</sup> <sup>U</sup> − ▽<sup>−</sup> U .

Due to the constraint <sup>P</sup><sup>m</sup> z=1 Uiz = 1 and Uiz ≥ 0, we obtain the following updating rule for Uiz:

$$U\_{ix}^{\text{new}} = U\_{ix} \cdot \frac{a\_i^U (\Box\_U^-)\_{ix} + 1}{a\_i^U (\Box\_U^+)\_{ix} + b\_i^U}. \tag{6}$$

where a U i and b U i can be obtained by Equations (7) and (8), respectively.

$$a\_i^U = \sum\_z \frac{U\_{ix}}{\langle \nabla\_U^+ \rangle\_{iz}},\tag{7}$$

$$b\_i^U = \sum\_z U\_{iz} \frac{(\Box\_U^-)\_{iz}}{(\Box\_U^+)\_{iz}}.\tag{8}$$

Similarly, we denote ▽<sup>V</sup> as the gradient of our objective function with respect to V.

$$\nabla V = -2(Y^T U^T)[W \odot (Y - UYV)] - 2\lambda\_t (\mathbf{S}\_t - V) + \boldsymbol{\beta}. \tag{9}$$

Let ▽ + <sup>V</sup> = 2Y <sup>T</sup>U T [W ⊙ (UYV)] + 2λtV + β and ▽ − <sup>V</sup> = 2Y <sup>T</sup>U T (W ⊙ Y)+2λtS<sup>t</sup> denote the positive and negative parts of ▽V, respectively, we have ▽<sup>V</sup> = ▽<sup>+</sup> <sup>V</sup> − ▽<sup>−</sup> V .

Similarly, the updating rule for Vkj is as follows:

$$V\_{k\bar{j}}^{\text{new}} = V\_{k\bar{j}} \cdot \frac{a\_{\bar{j}}^{V} (\Box\_{V}^{-})\_{k\bar{j}} + 1}{a\_{\bar{j}}^{V} (\Box\_{V}^{+})\_{k\bar{j}} + b\_{\bar{j}}^{V}}. \tag{10}$$

where a V <sup>j</sup> = P k Vkj (▽ + V )kj and b V <sup>j</sup> = P <sup>k</sup> Vkj (▽ − V )kj (▽ + V )kj .

The details of the optimization algorithm to the proposed TSSR model are described in Algorithm 1. U and V can be updated by Equations (6) and (10), respectively. In this study, we stop the iteration when the changes of U and V are less than 1e-6, measured by L<sup>1</sup> norm. Finally, the predicted label matrix Yˆ can be returned by Yˆ = UYV when algorithm arrives at the convergence conditions.

**Algorithm 1**: Algorithm for the TSSR model

	- 1. Initialize U and V;
	- 2. **While** not converged **do**
	- 3. Update U according to Equation (6) −

$$U\_{iz}^{\text{new}} = U\_{iz} \cdot \frac{a\_i^U (\Box\_U^-)\_{iz} + 1}{a\_i^U (\Box\_U^+)\_{iz} + b\_i^U};$$

4. Update V according to Equation (10) a V j (▽ − V )kj + 1 ;

$$V\_{k\circ}^{\text{new}} = V\_{k\circ} \cdot \frac{b\_j \cdot \nabla\_{\nabla\_{\nabla}} \omega\_j}{a\_j^V (\nabla\_{\nabla}^+)\_{k\circ} + b\_j}$$


## 3. RESULTS

In this section, we demonstrate the performance of various algorithms on three real datasets. Furthermore, case studies of three cancer diseases (i.e., Melanoma, Glioblastoma, and Glioma) are performed to validate the effectiveness of our TSSR model. The materials, experimental settings, and parameter settings are described as follows.

V

# 3.1. Materials

## 3.1.1. LncRNA-Disease Associations

We collect three datasets to evaluate the performance of various prediction algorithms. The first dataset is downloaded from the supplementary data of a article (Lu et al., 2018), which contains 621 experimentally confirmed lncRNA-disease associations between 226 diseases and 285 lncRNAs from the LncRNADisease database<sup>1</sup> established in 2015. The second dataset involving 260 high-quality associations between 95 lncRNAs and 81 human disease is obtained from the supplementary files of the published article (Chen et al., 2015), which retrieved data from MNDR database<sup>2</sup> (Wang et al., 2013) in March 2015 . The third dataset is downloaded from the Lnc2Cancer database <sup>3</sup> in 2015. By getting rid of the duplicate lncRNA-disease associations for the same lncRNA-disease pair, we obtain 677 distinct associations, including 54 human cancers and 436 lncRNAs. The statistics of the three datasets are illustrated in **Table 1**.

### 3.1.2. Disease Similarities

As previous studies have discovered that diseases with similar phenotypes are usually related with similar dysfunctions of lncRNAs (Chen et al., 2015), incorporating the similarities among diseases estimated from other database may help to infer the

<sup>1</sup>http://www.cuilab.cn/lncrnadisease

<sup>2</sup>http://www.rna-society.org/mndr/

<sup>3</sup>http://www.bio-bigdata.com/lnc2cancer/


potential associations between diseases and lncRNAs based on known lncRNA-disease associations. Similar to previous studies (Wang et al., 2010; Chen et al., 2015), we construct the similarity matrix S<sup>t</sup> of diseases by integrating the disease semantic similarity matrix inferred from the structure of directed acyclic graph that describes the relationships among diseases (Wang et al., 2010; Chen et al., 2015) and disease Gaussian interaction profile kernel similarity matrix inferred from known associations between diseases and lncRNAs (Chen and Yan, 2013; Chen et al., 2015). In particular, we obtain the similarity matrix S<sup>t</sup> by averaging the disease similarity matrix and disease Gaussian interaction profile kernel similarity matrix (van Laarhoven et al., 2011; Chen and Yan, 2013; Chen et al., 2015, 2016a).

### 3.1.3. LncRNA Similarities

Since lncRNAs with similar functions tend to exhibit similar associations with diseases, calculating the similarities among lncRNAs will promotes the identification of potential associations between diseases and lncRNAs. In this study, we calculate the similarity matrix S<sup>d</sup> of lncRNAs by integrating the functional similarity matrix calculated by the model of LNCSIM (Chen et al., 2015) and the lncRNA Gaussian interaction profile kernel similarity matrix estimated from known associations between lncRNAs and diseases (Chen and Yan, 2013). Similar to the disease similarity matrix S<sup>t</sup> , we obtain the lncRNA similarity matrix S<sup>d</sup> by averaging the lncRNA functional similarity matrix and Gaussian interaction profile kernel similarity matrix (van Laarhoven et al., 2011; Chen and Yan, 2013; Chen et al., 2015; Chen et al., 2016a).

# 3.2. Experimental Settings

To illustrate the effectiveness of our proposed TSSR model, we compare our method with other six state-of-the-art association prediction methods, namely NetlapRLS (Xia et al., 2010), BLM-NII (Mei et al., 2012), CMF (Zheng et al., 2013), PBMDA (You et al., 2017a), PRMDA (You et al., 2017b), and SIMCLDA (Lu et al., 2018). All these methods are designed for predicting the inter-associations between different types of biological entities and all of them can make use of the prior intra-associations among biological entities to improve their performance. Thus, all these algorithms are well suited for undertaking the task of lncRNA-disease association prediction. Moreover, our experiment results show that they are effective in inferring the associations between diseases and lncRNAs. Specifically, 15 repetitions of 10-fold cross validation (CV) are conducted for each model, with receiver operating characteristic (ROC) curve as the main metric to evaluate the performance. By stacking the columns of matrix Y, we obtained the vector, a mn × 1 vector, denoted as vec(Y). In each repetition of 10-fold CV, we divide vec(Y) into ten disjoint folds randomly. Nine folds are treated as the training set while the remaining one fold is left out as the testing set. The AUC (Area Under Curve) score is calculated for each 10-fold CV repetition, and the final AUC score for each model are obtained by averaging over 15 such repetitions.

# 3.3. Parameter Settings

As each model has some hyperparameters that need to be predefined, we perform cross validation on the training set to determine the values of these hyperparameters. In particular, the parameter settings for various models are described as follows. For NetLapRLS (Xia et al., 2010), the hyperparameters satisfy γd<sup>2</sup> γd<sup>1</sup> = γp2 γp1 , β<sup>d</sup> = β<sup>p</sup> with their values chosen from {10−<sup>6</sup> , 10−<sup>5</sup> , . . . , 10<sup>2</sup> }. For BLM-NII (Mei et al., 2012), the value of the linear combination weight α is chosen from {0, 0.1, 0.2, . . . , 1.0}. The max function is utilized to combine the interaction scores inferred from the disease and lncRNA sides. For the matrix factorization based methods, the dimensionality of the latent space K is selected from {50, 100} (Zheng et al., 2013). For CMF (Zheng et al., 2013), the regularization coefficient λ<sup>1</sup> is chosen from {2−<sup>2</sup> , . . . , 2<sup>1</sup> } (Zheng et al., 2013), while the values of λ<sup>d</sup> and λ<sup>t</sup> are chosen from {2−<sup>3</sup> , 2−<sup>2</sup> , . . . , 2<sup>5</sup> }. For PBMDA (You et al., 2017a), the maximum path length L is set to 3 and the weight threshold T is selected from {0.2, 0.3, . . . , 0.8} with the step size set to 0.1, while the decay factor α is set to 2.26. For SIMCLDA (Lu et al., 2018), we set the values of α<sup>l</sup> and α<sup>d</sup> from 0.1 to 1 with stepsize 0.1 and select the regularization parameter from {10−<sup>3</sup> , 10−<sup>2</sup> , . . . , 10<sup>3</sup> }. For TSSR, we choose the three parameters β and λ<sup>d</sup> = λ<sup>t</sup> from {2−10, 2−<sup>9</sup> , . . . , 2<sup>9</sup> , 210}. Note that the most suitable hyper-parameters of a machine learning model on different datasets are usually different. Therefore, in this work, we adopt grid search (Bergstra and Bengio, 2012) to select the optimal hyperparameters for each model on each dataset.

# 3.4. Comparison With State-of-the-Art Methods

We conduct the experiments with 10-fold CV to shed light on the performance of TSSR in predicting potential lncRNAdisease associations, compared with other six state-of-theart methods. Here, the AUC score is used to evaluate the predictive performance of various methods. The experiment results measured by AUC are shown in **Figures 1**–**3**. As shown in **Figure 1**, on LncRNADisease dataset, TSSR obtains an AUC score of 0.8736, which is higher than other methods (BLM-NII 0.8641, NetLapRLS 0.7837, CMF 0.7273, PBMDA 0.6885, PRMDA 0.7231, SIMCLDA 0.6067), indicating the superiority of our TSSR in predicting lncRNA-disease associations. We can find from **Figure 2** that on MNDR dataset, TSSR achieves the best AUC score (TSSR 0.8369, BLM-NII 0.7929, NetLapRLS 0.8210, CMF 0.8078, PBMDA 0.7722, PRMDA 0.6596, SIMCLDA 0.6187). On Lnc2Cancer dataset (the results are shown in **Figure 3**), TSSR still has competitive performance with other six methods with respect to AUC score (TSSR 0.9814, BLM-NII 0.9859, NetLapRLS 0.9392, CMF 0.9864, PBMDA 0.9680, Ou-Yang et al. LncRNA-Disease Association Prediction

PRMDA 0.8179, SIMCLDA 0.6190). Note that on Lnc2Cancer, our TSSR achieves similar performance with BLM-NII and CMF. This may due to the parameter setting of TSSR. In this study, the values of the hyperparameters λ<sup>d</sup> and λ<sup>t</sup> (which control the influences of prior intra-similarities among lncRNAs and diseases) in our TSSR are set to same for simplicity, which is reasonable when the two data sets are balanced. However, the number of lncRNAs and diseases in Lnc2Cancer dataset are imbalanced. Thus, forcing λ<sup>d</sup> and λ<sup>t</sup> to be equal may limit the performance of TSSR. If the values of λ<sup>d</sup> and λ<sup>t</sup> are tuned separately, TSSR could achieve better performance. Moreover, to evaluate the effect of external information on the performance of TSSR, we remove the regularization terms related to the external information (i.e., setting λ<sup>d</sup> = λ<sup>t</sup> = 0) and show the results in **Figure 4**. As shown in this figure, the performance of TSSR and TSSR without external information (denoted by TSSR\_original) is comparable (on LncRNADisease, TSSR 0.8736, TSSR\_original 0.8735; on MNDR, TSSR 0.8369, TSSR\_original 0.8367; on Lnc2Cancer, TSSR 0.9814, TSSR\_original 0.9614), which means the improved performance of TSSR is mainly due to the self-representation learning. Thus, our TSSR does not depend heavily on the external information. All these results demonstrate the effectiveness of the proposed TSSR in predicting potential lncRNA-disease associations.

# 3.5. Effects of Parameters

The proposed TSSR involves three parameters, λd, λ<sup>t</sup> , and β, where λ<sup>d</sup> and λ<sup>t</sup> control the influences of prior intra-associations among lncRNAs and diseases and β controls the sparsity of U and V. We will study how these parameters affect the performance of TSSR.

**Figure 5** shows the prediction performance of TSSR on LncRNADisease dataset, MNDR dataset and Lnc2Cancer dataset, measured by AUC with respect to different values of λ<sup>d</sup> and λ<sup>t</sup> . As shown in **Figure 5**, the optimal value of λ<sup>d</sup> = λ<sup>t</sup> for these three datasets is 2−10, 2<sup>0</sup> , and 2<sup>2</sup> , respectively, while β is set to 2<sup>1</sup> , 2<sup>8</sup> , and 2<sup>8</sup> , respectively. We find that TSSR usually performs well when the values of λ<sup>d</sup> and λ<sup>t</sup> are relatively small, which means the additional use of external information is not always helpful for performance improvement. On the contrary, if the external information contains noise, the performance of TSSR may decrease if we overemphasizing the effect of external information. These results demonstrate that our TSSR can effectively learn the representation matrices from known lncRNA-disease associations, and flexibly utilize external information to promote the prediction of potential lncRNA-disease associations.

In addition, we also study the impact of sparsity control parameter β. **Figure 6** illustrates the AUC scores obtained by TSSR in terms of different values of β. As shown in **Figure 6**, on these three datasets, TSSR achieves the best AUC score when the value of β is 2<sup>1</sup> , 2<sup>8</sup> , and 2<sup>8</sup> , respectively, while λ<sup>d</sup> = λ<sup>t</sup> is set to 2−10, 2<sup>0</sup> , and 2<sup>2</sup> , respectively. We can also find from this figure that larger values of β can generally achieve better performance, which indicates the importance of controlling the sparsity of the representation matrices U and V.

# 3.6. Case Studies

To further validate the performance of our algorithm, based on the LncRNADisease dataset, we apply our TSSR model to identify the most possible lncRNAs that associated with three cancers (i.e., Melanoma, Glioma, and Glioblastoma). Here, all the known associations in the LncRNADisease dataset are used to train the model. Then we select the top 20 associated lncRNAs which get the highest predicted ranks for each cancer and verify these predictions based on MNDR and Lnc2Cancer databases. Moreover, the relevant literatures that support the prediction results are listed to indicate whether the predicted lncRNA-disease associations have been experimentally validated. Specially, MNDR database contains both experimental and prediction evidence (Ning et al., 2016; Ping et al., 2018). The results for the three cancers are shown in **Tables 2**–**4**, respectively. Note that we only show the predictions that are not included in the training set.

Melanoma is a deadly malignancy which develops from the pigment-containing cells with increasing incidence than that of any other types of cancer (Aladowicz et al., 2013). People with low level of skin pigment exposure in excess ultraviolet light (UV) have a high risk to be infected a melanoma (Kanavy and Gerstenblith, 2011). It has been estimated that by 2030, melanoma could overtake colorectal cancer as the fifth most common cancer (Rahib et al., 2014). Therefore, we apply our TSSR model to predict the potential melanomaassociated lncRNAs. According to the results shown in **Table 2** (the complete list of the top 20 identified lncRNAs is shown in **Supplementary Material**), 10 out of the top 20 identified lncRNAs have been verified. For example, Luan et al. (2016) discovered that MALAT1 could promote the cell proliferation, invasion and migration of melanoma. Li et al. observed that MEG3 was obviously decreased in melanoma cells (Li et al., 2018). They also found melanoma cell apoptosis was induced by up-regulation of MEG3, and consequently come to a conclusion that overexpression of MEG3 has a significant repression impact in melanoma cell migration and invasion ability.

Glioma is one of the most common primary malignant tumors originating in the brain, which comprises approximately 30% of all brain tumors (Goodenberger and Jenkins, 2012; Boele et al., 2015). Glioma can be graded from I to IV by World Health Organization (WHO) grading system according to their grade (Louis et al., 2016a,b). The exact causes of glioma are still unclear at the present (Kwiatkowska and Symons, 2013; Li et al., 2015). Studies have revealed the roles of lncRNAs in the development of human disease, including glioma (Zhou et al., 2018). Here, we utilize the TSSR to identify the potential lncRNAs that are more likely to related to glioma. Based on the experiment results, 9 out of the top 20 identified lncRNAs have been validated in the MNDR and Lnc2Cancer databases, and other relevant literatures. The results are shown in **Table 3** (the complete list of the top 20 identified lncRNAs is shown in **Supplementary Material**). For example, Ma et al. discovered that compared with paired

FIGURE 1 | AUC scores of various algorithms in LncRNADisease dataset (\* indicates TSSR significantly outperforms the competitor with *p* < 0.05 using *t*-test, error bars denote 95% confidence intervals).

normal tissues, the expression level of lncRNA MALAT1 was increased in glioma tissues, which means MALAT1 can be treated as a convictive marker for the prognosis of glioma patients (Ma et al., 2015). Zou et al. revealed that glioma patients with high PVT1 expression had low survival rate (Zou et al., 2017). Moreover, patients who received chemotherapy and radiotherapy could improve their survival by down-regulating PVT1. They also indicated that PVT1 could be served as potential target for the treatment of diffuse gliomas.

Glioblastoma, also known as glioblastoma multiform (GBM) (grade IV of Glioma), is the most common and aggressive form of primary brain tumors and kills nearly every patient in a median time of 15 months (Bleeker et al., 2012; Jovcˇevska et al., 2013). More importantly, there is still no clear way to prevent the disease (Gallego, 2015). Therefore, it is urgent to predict the potential glioblastoma-associated lncRNAs. In this study, we use our TSSR to undertake this task. As shown in **Table 4**, 8 out of the 20 lncRNAs have been verified in

FIGURE 3 | AUC scores of various algorithms in Lnc2Cancer dataset (\* indicates TSSR significantly outperforms the competitor with *p* < 0.05 using *t*-test, error bars denote 95% confidence intervals).

confidence intervals).


*Prediction evidence denotes the prediction associations in MNDR database.*

TABLE 3 | The identified novel lncRNAs that have been verified to be associated with Glioma.


*Prediction evidence denotes the prediction associations in MNDR database.*

the MNDR and Lnc2Cancer databases, and other relevant literatures (the complete list of the top 20 identified lncRNAs is shown in **Supplementary Material**). For example, Zhou et al. described that HOTAIR has a significant increased expression in multiple human cancers including GBM and they found HOTAIR is necessary for GBM formation in vivo (Zhou X. et al., 2015). Thus, HOTAIR could be a potential therapeutic target in glioblastoma. Liu et al. found that NBAT1 has lower expressions in glioblastoma tissues compared with those in normal brain tissues and they also observed that up-regulated NBAT1 inhibits proliferation of T98 and U87 cells via regulating Akt, suggesting that NBAT1 may be related to prognosis of glioblastoma (Liu et al., 2018).

Based on the above case studies, we find that our TSSR is effective in identifying novel associations between lncRNAs and diseases based on known lncRNA-disease associations and intra-associations among lncRNAs and diseases.

## 4. CONCLUSION

Increasing evidences indicate the role of lncRNAs in biological processes, which motivates the development of computational


TABLE 4 | The identified novel lncRNAs that have been verified to be associated with Glioblastoma.

*Prediction evidence denotes the prediction associations in MNDR database.*

models to identify the potential associations between lncRNAs and diseases. Predicting the potential associations between lncRNAs and diseases based on known lncRNA-disease associations is equivalent to a recommendation problem with implicit feedback, where the task is to predict whether the unknown pairs in Y are potential associations or not. In this paper, we present a novel model, named two-side sparse self-representation (TSSR), to predict the scores of unknown pairs in Y. Based on these predicted scores, we could identify potential associations between lncRNAs and diseases. Unlike previous matrix factorization techniques that project lncRNAs and diseases into a shared latent space and predict lncRNAdisease associations based on the inner product of their latent vectors (where the dimension of latent space is previously unknown and hard to determine), our model directly learn the intra-similarities among lncRNAs and diseases from the observed associations in Y, and utilize the learned representation matrices to reconstruct Y by regarding original Y as a dictionary. As shown in Equation (4), our TSSR does not need to make many assumptions of the model in advance. Moreover, by forcing the representation matrices to be sparse, our TSSR could learn the most similar relationships among lncRNAs and diseases based on the observed associations in Y. Thus, our TSSR has data-adaptiveness and avoids the determination of some sensitive parameters such as the dimension of latent space and number of nearest neighbors. Unlike random walkbased or data integration-based methods that rely heavily on the similarity networks inferred from external information with predefined metrics, our model could adaptively learn the self-representations of lncRNAs and diseases according to their performance in reconstructing observed associations in Y. Moreover, in case the input data Y only includes a small number of known associations, our model could draw support from the intra-associations among lncRNAs and diseases derived from external information to enhance the learning of representation matrices. Therefore, our model could effectively predict potential lncRNA-disease associations by leveraging the information provided by known lncRNA-disease associations and external information of lncRNAs and diseases. Experiment results on three real data sets show that our TSSR could achieve better performance than other six state-of-the-art methods. The effectiveness of TSSR in predicting potential lncRNA-disease associations is also evaluated based on three case studies. As a link prediction algorithm, our TSSR model is flexible and could be used to handle other link prediction tasks in bipartite networks.

Furthermore, since external information of lncRNAs and diseases are utilized to enhance the performance of various methods, we also perform sensitivity analysis to assess the influences of noise information on the performances of various methods. In particular, we generate the similarity matrices S<sup>d</sup> and S<sup>t</sup> randomly (i.e., the elements in S<sup>d</sup> and S<sup>t</sup> are generated randomly) and test the performances of various methods. The detailed experiment results are shown in **Tables S4–S6**. As shown in these tables, although the performance of TSSR is affected by the noise information, it could still achieve the best performance, which means our TSSR could be used to undertake the lncRNA-disease prediction task even when the collected external information of lncRNAs and diseases contains a lot of noise.

With the development of high-throughput experimental techniques, an increasing number of data for lncRNAs and diseases are becoming available. We can calculate the similarities among lncRNAs (or diseases) based on different views of data and different metrics. How to efficiently seek the optimal combination of these similarities is an interesting future work. We will try to extend our model to handle this problem.

# AUTHOR CONTRIBUTIONS

LO-Y and JH conceived and designed the study, performed the statistical analysis, and drafted the manuscript. ZZ conceived of the study, and participated in its design and coordination and helped to draft the manuscript. X-FZ and Y-RL participated in the design of the study, performed the statistical analysis, and helped to revise the manuscript. YS and SH participated in the design of the study and helped to revise the manuscript. All authors read and approved the final manuscript.

# FUNDING

This work is supported by the National Natural Science Foundation of China under grants No. 61602309, 61871272, 61575125, 11871026, and 61402190, Shenzhen Fundamental Research Program, under grant JCYJ20170817095210760 and JCYJ20170302154328155, Natural Science Foundation of SZU [2017077], Guangdong Special Support Program of Topnotch Young Professionals, under grants 2014TQ01X273, and 2015TQ01R453, Guangdong Foundation of Outstanding Young Teachers in Higher Education Institutions, under grant Yq2015141, Natural Science Foundation of Hubei province [ZRMS2018001337].

# REFERENCES


# SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene. 2019.00476/full#supplementary-material


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Ou-Yang, Huang, Zhang, Li, Sun, He and Zhu. This is an openaccess article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# LncRRIsearch: A Web Server for lncRNA-RNA Interaction Prediction Integrated With Tissue-Specific Expression and Subcellular Localization Data

#### Tsukasa Fukunaga1,2†, Junichi Iwakiri 3,4†, Yukiteru Ono<sup>5</sup> and Michiaki Hamada1,4,6,7,8,9 \*

*<sup>1</sup> Department of Electrical Engineering and Bioscience, Faculty of Science and Engineering, Waseda University, Tokyo, Japan, <sup>2</sup> Department of Computer Science, Graduate School of Information Science and Technology, The University of Tokyo, Tokyo, Japan, <sup>3</sup> Graduate School of Frontier Sciences, University of Tokyo, Chiba, Japan, <sup>4</sup> Artificial Intelligence Research Center, National Institute of Advanced Industrial Science and Technology, Tokyo, Japan, <sup>5</sup> IMSBIO Co., Ltd., Tokyo, Japan, <sup>6</sup> Computational Bio Big-Data Open Innovation Laboratory, National Institute of Advanced Industrial Science and Technology, Tokyo, Japan, <sup>7</sup> Institute for Medical-oriented Structural Biology, Waseda University, Tokyo, Japan, <sup>8</sup> Graduate School of Medicine, Nippon Medical School, Tokyo, Japan, <sup>9</sup> Center for Data Science, Waseda University, Tokyo, Japan*

#### Edited by:

*Philipp Kapranov, Huaqiao University, China*

#### Reviewed by:

*Yuchen Liu, Shenzhen University, China Jin Chen, University of Kentucky, United States*

> \*Correspondence: *Michiaki Hamada mhamada@waseda.jp*

*†These authors have contributed equally to this work*

#### Specialty section:

*This article was submitted to RNA, a section of the journal Frontiers in Genetics*

Received: *18 February 2019* Accepted: *30 April 2019* Published: *28 May 2019*

#### Citation:

*Fukunaga T, Iwakiri J, Ono Y and Hamada M (2019) LncRRIsearch: A Web Server for lncRNA-RNA Interaction Prediction Integrated With Tissue-Specific Expression and Subcellular Localization Data. Front. Genet. 10:462. doi: 10.3389/fgene.2019.00462* Long non-coding RNAs (lncRNAs) play critical roles in various biological processes, but the function of the majority of lncRNAs is still unclear. One approach for estimating a function of a lncRNA is the identification of its interaction target because functions of lncRNAs are expressed through interaction with other biomolecules in quite a few cases. In this paper, we developed "LncRRIsearch," which is a web server for comprehensive prediction of human and mouse lncRNA-lncRNA and lncRNA-mRNA interaction. The prediction was conducted using RIblast, which is a fast and accurate RNA-RNA interaction prediction tool. Users can investigate interaction target RNAs of a particular lncRNA through a web interface. In addition, we integrated tissue-specific expression and subcellular localization data for the lncRNAs with the web server. These data enable users to examine tissue-specific or subcellular localized lncRNA interactions. LncRRIsearch is publicly accessible at http://rtools.cbrc.jp/LncRRIsearch/.

Keywords: lncRNA, RNA-RNA interaction, web server, tissue-specific expression, subcellular localization

# 1. INTRODUCTION

Long non-coding RNAs (lncRNAs) were initially considered to be transcriptional noise or experimental artifacts, but recent research has revealed that lncRNAs play important roles in various biological processes, such as cell differentiation (Fatica and Bozzoni, 2014) and functioning of the immune system (Carpenter et al., 2013). While large-scale RNA sequencing studies have discovered several tens of thousands of lncRNAs in the human transcriptome (Iyer et al., 2015; Hon et al., 2017), the function is known in detail for only a small number of lncRNAs (Quek et al., 2014; de Hoon et al., 2015). To understand the molecular mechanisms of complex biological systems, elucidating the functions of more lncRNAs is an important research topic.

**98**

Recent discoveries of lncRNA-mRNA interactions regulating biological processes (Gong and Maquat, 2011; Kretz et al., 2013; Abdelmohsen et al., 2014) suggest that comprehensive lncRNA-mRNA interaction predictions are helpful for the estimation of lncRNA function. Several databases or web services have been developed for the function prediction based on lncRNA-mRNA interactions, but there are no web services for comprehensive prediction of human and mouse lncRNA interaction. RAID contains some lncRNAmRNA interaction data taken from the literature, but the number of interactions is limited and comprehensiveness is low (Yi et al., 2017). RISE includes experimentally validated lncRNA-RNA interactions based on high-throughput sequencing methods (Lu et al., 2016; Nguyen et al., 2016), but the number of lncRNA interactions is also limited (Gong et al., 2017). The database compiled by Terai et al. (2016) contains predicted lncRNA-mRNA and lncRNA-lncRNA interaction data at transcriptome scale, but the database does not store more than one local base-pairing interaction for each lncRNA-RNA interaction. In addition, the database includes only human lncRNA-RNA interactions.

To address these shortcomings, we have constructed the LncRRIsearch, which is a web server for comprehensive prediction of human and mouse lncRNA-mRNA and lncRNAlncRNA interactions. We applied RIblast to human and mouse transcriptome to predict RNA-RNA interactions (Fukunaga and Hamada, 2017). LncRRIsearch provides multiple local basepairing interactions predicted by RIblast for each lncRNA-RNA interaction. In addition, unlike previous databases or web services, we integrated tissue-specific RNA expression and subcellular localization data of lncRNAs with our web service. These data help us to verify the correctness of the predicted interactions. Actually, we showed the tissuespecificity information improves the prediction accuracy for lncRNA-RNA interactions in previous research (Iwakiri et al., 2017). LncRRIsearch is freely accessible at http://rtools.cbrc.jp/ LncRRIsearch/.

# 2. MATERIALS AND METHODS

# 2.1. Dataset of lncRNA and mRNA Sequences

We downloaded human and mouse RNA sequences from GENCODE version 25 and M14, respectively (Harrow et al., 2012). While we used all lncRNA transcript sequences in our analysis, we used the longest mRNA transcript for each gene to reduce the size of the dataset. In addition, we excluded transcripts in the pseudoautosomal region on the Y-chromosome from the analysis. As a result, we obtained 27,674 lncRNA and 20,360 mRNA transcripts as human RNA dataset, and 16,113 lncRNA and 22,468 mRNA transcripts as mouse RNA dataset. Note that LncRRIsearch contains an additional 175 mRNA and 3,776 lncRNA transcripts in comparison with the database previously compiled by Terai et al. (2016) as human RNA dataset. This difference is derived from the version update of GENCODE.

# 2.2. Prediction of lncRNA-RNA Interactions

RNA-RNA interaction prediction for long RNAs is timeconsuming calculation, and even the fastest programs at present cannot be predict the interactions in real-time. Therefore, we predicted comprehensive human and mouse lncRNA-mRNA and lncRNA–lncRNA interactome in advance, and stored the interaction results in MySQL database. By selecting a query RNA or a target RNA, users can obtain pre-calculated prediction results of the selected RNA.

We used the RIblast program, which has been recently developed by our group, for comprehensive RNA-RNA interaction prediction (Fukunaga and Hamada, 2017). RIblast predicts local base-pairing interactions based on interaction energy that is computed by using both accessibility energy and hybridization energy. Briefly, RIblast considers both effects on stabilization energy derived from hybridization between two RNA sequences and the energy for preventing the formation of intramolecular double-stranded structure. (If an RNA region forms double-stranded structure in the secondary structure, the region does not tend to interact with the other RNA molecules via base-pairing.) RIblast output multiple candidates for local base-pairing interactions for each RNA-RNA pair. The threshold interaction energy was set to −12 or −16 kcal/mol. We regarded the query and target RNA pairs (A, B) and (B, A) as being different because RIblast predicts slightly different interactions for these pairs. Users can sort target transcripts for each query transcript by two criteria: MINENERGY and SUMENERGY. MINENERGY denotes the minimum interaction energy of local base-pairing interaction among all interactions between the query RNA and the target RNA. SUMENERGY means the sum of all interaction energies of local base-parings for the RNA-RNA pair.

We investigated whether the experimentally validated lncRNA-mRNA interactions were predicted by RIblast. We verified that RIblast predicted human 1/2-sbs RNA (ENST00000548810) and SERPINE1 (ENST00000223095) interaction, and human 1/2-sbs RNA and ANKRD57 (ENST00000356454) interaction (Gong and Maquat, 2011). In addition, human 7SL RNA (ENST00000635274) and TP53 mRNA (ENST00000617185) interaction (Abdelmohsen et al., 2014) was also predicted by RIblast. As we did not predict mRNA-mRNA interactions, LncRRIsearch does not provide human TINCR-mRNA interactions (Kretz et al., 2013) (TINCR ENST00000448587) was annotated as mRNA in GENCODE ver.25). In summary, our prediction results include experimentally validated lncRNA-mRNA interactions for lncRNAs.

# 2.3. Expression Analysis for Tissue-Specific lncRNA-RNA Interaction

Expression levels of human lncRNA and mRNA genes were estimated from RNA-seq data derived from five international consortia. The first RNA-seq dataset was derived from 32 tissues collected from 122 human individuals, which was produced by the Human Protein Atlas Project (Expression Atlas ID: E-MTAB-2836) (Uhlén et al., 2015). The second RNA-seq dataset was derived from 30 representative tissues, released by the GTEx Consortium (Expression Atlas ID: E-MTAB-2919) (GTEx Consortium, 2015). The third RNA-seq dataset was produced by the Human Body Map Project from 16 tissues (Expression Atlas ID: E-MTAB-513) (Cabili et al., 2011). The fourth RNA-seq dataset, derived from 19 tissues isolated from fetuses with congenital defects, was released by the Epigenome Roadmap Project (Expression Atlas ID: E-MTAB-3871) (Kundaje et al., 2015). The last RNA-seq dataset, the largest collection of primary cells, was derived from 56 tissues produced by FANTOM5 project (Expression Atlas ID: E-MTAB-3358) (Forrest et al., 2014). Note that the second RNA-seq dataset originally contained 53 tissues derived from several cell lines and subregions of a single tissue. To reduce the number of redundant cell types, 30 representative tissues were arbitrarily selected.

In addition, expression levels of mouse lncRNA and mRNA genes were also estimated from RNA-seq data. The first RNAseq dataset was derived from nine tissues harvested from an adult male C57BL/6 mouse (Expression Atlas ID: E-GEOD-74747) (Huntley et al., 2016). The second RNA-seq dataset was derived from three mouse strains (C57BL/6, DBA/2J, and CD1) (Expression Atlas ID: E-MTAB-2801) (Merkin et al., 2012). In this dataset, gene expression data across eight (C57BL/6 strain) or nine mouse tissues (DBA/2J and CD1 strains) is available.

Tissue-specificities of lncRNA and mRNA genes were investigated based on an outlier analysis of the RNA-seq data using ROKU (Kadota et al., 2006). For each lncRNA and mRNA gene, the tissues in which the gene was specifically expressed were detected based on its extremely high or low expression levels in one or a few tissues. These tissue-specificity data allow the user to investigate the tissue-specific lncRNAs which regulate the expression levels of their target mRNAs through the base-pairing interactions. The tissue-specific lncRNA-RNA interactions derived from the aforementioned five human RNAseq datasets and four mouse RNA-seq dataset are provided in LncRRIsearch (**Tables S1–S9**).

# 2.4. Integration With Subcellular Localization Data to LncRRIsearch

Subcellular localization dataset was downloaded from the LncAtlas database (Mas-Ponte et al., 2017). This dataset includes 15 human cell-line subcellular localization data, and the localization was quantified by "relative concentration index" (RCI), which was defined as log2-transformed ratio of FPKM between two expression data. For example, high cytoplasmic/nuclear RCI means that the transcript tends to localize in cytoplasm rather than nucleus. For 14 cell-lines, two types of RCIs (cytoplasmic/nuclear and nuclear/cytoplasmic RCIs) are included in the dataset. On the other hand, for the K562 cell-line, five types of RCI data (Chromatin/Nucleus, Nucleolus/Nucleus, Nucleoplasm/Nucleus, Cell membrane/Cytoplasm, and Insoluble fraction/Cytoplasm RCIs) are additionally included in the dataset. These subcellular localized RNA-RNA interactions are also provided in LncRRIsearch (**Tables S10–S12**). The detail of the dataset was described in the original publication (Mas-Ponte et al., 2017). Note that mouse subcellular localization data are not included in LncRRIsearch.

# 2.5. Database Organization

In LncRRIsearch, tissue-specific expression data and subcellular localization data were stored in a series of MySQL databases. For

RNA-RNA interaction data, all pre-calculated SUMENERGY and MINENERGR scores were also stored in the databases, but the local base-pair data were not stored in the databases because the data size is too large. In the web service, the base-pairs are repredicted by RIblast in real time when both the query and target RNAs are selected based on SUMENERGY or MINENERGY scores. However, because RIblast cannot predict interactions of long RNAs in real-time, base-pair prediction results for RNA sequences longer than 5,000 nt were stored in the databases, and the data is referenced in the web service.

# 3. RESULTS

LncRRIsearch provides three types of interaction prediction method (**Figure 1**): a name/ID based method, an expression pattern-based method, and a localization-based method.

# 3.1. Investigation of an RNA-RNA Interaction Based on Name or ID

Users firstly select target species (human or mouse) and the energy threshold (−12 or −16 kcal/mol), and then inputs name or ID of genes or transcripts (**Figure 1**). LncRRIsearch supports GENCODE gene/transcript names or IDs as input type, and either query lncRNA or target lncRNA/mRNA is required as input RNA. After specifying a gene of interest, several transcript isoforms derived from the gene are listed for selection of a single lncRNA transcript if multiple isoforms are encoded in the gene. For the selected lncRNA transcript (query transcript), all interacting RNAs (target transcripts) predicted by RIblast are provided. After selecting a single target transcript, the details of the RNA-RNA interaction between query and target transcripts are described (**Figure 2**). In this step, all local basepairing interactions are listed, and users can download the

prediction results as a text file. In addition, the global basepairing interaction is described as an image (The center left of **Figure 2**). In this figure, the query RNA and the target RNA are represented as a blue line and a red line, respectively, and the predicted interactions are displayed as gray or black lines between two RNAs. The color consistency means strength of interactions. For each local base-pairing interaction, text (output of RIblast) and a graphical view based on VARNA (Darty et al., 2009) are also provided (The lower left and the lower right of **Figure 2**).

# 3.2. Investigation of Tissue-Specific RNA-RNA Interactions

LncRRIsearch helps users to investigate lncRNA-RNA interactions exhibiting tissue-specific expression patterns (**Figure 1**). Users can select an RNA-seq dataset from four different RNA-seq studies and select a tissue of interest. For the selected tissue, one of three possible tissue-specific expression patterns for the query and target RNA transcripts should be selected: Query and target RNAs are specifically up-regulated in the same tissue; query RNAs are specifically up-regulated and target RNAs are down-regulated in the same tissue; or query RNAs are specifically down-regulated and target RNAs are up-regulated in the same tissue.

After selecting the tissue-specific expression pattern, the corresponding query and target RNAs predicted by RIblast are listed. In this step, once a query RNA is selected, the list of possible target RNAs is automatically updated for the selected query. By selecting the tissue-specific query and target RNAs, detailed information about interactions between the query and target RNAs is provided (**Figure 2**). In addition, the expression values of query and target RNAs are provided as a graphical view in the results page (The upper right of **Figure 2**).

# 3.3. Investigation of Subcellular Localized RNA-RNA Interactions

Users can investigate subcellular-localized human lncRNA-RNA interactions (**Figure 1**). Users firstly select a energy threshold and select a cell line of interest. For the selected cell line, a type of RCI and the threshold of RCI should be selected. Except for K562 cell line, users can choose which one of the nucleus/cytosol or cytosol/nucleus RCI. For K562 cell line, users have five choices of sub-compartments RCIs in addition to the above-mentioned two RCIs. The subsequent steps are the same as the investigation of tissue-specific RNA-RNA interactions. The RCI values of query and target RNAs are displayed as a graphical view in the results page (The center right of **Figure 2**).

# 4. DISCUSSION

We developed LncRRIsearch, which is a web server for comprehensive prediction of human and mouse lncRNAmRNA and lncRNA-lncRNA interactions including tissuespecific expression and subcellular localization data. There are two advantages of LncRRIsearch over other lncRNA-RNA interaction databases or web services; the comprehensiveness of interaction prediction and the ability to investigate tissue-specific or subcellular localized interaction patterns.

We envision three future improvements of LncRRIsearch. The first is the development of real-time RNA-RNA interaction prediction software. Although LncRRIsearch provides comprehensive human and mouse lncRNA-RNA interaction based on GENCODE version 25 and M14, novel lncRNAs will be discovered in the future. Real-time prediction would be useful for the discoverers of new lncRNAs to investigate their interactions. The acceleration of RNA-RNA interaction prediction is still an important research topic. One possible direction is the simplification of the energy model. RIblast uses a complete nearest-neighbor energy model in the search step, but some researchers have reported that the use of an approximated energy model produces a marked increase in the calculation speed in exchange for only a slight decrease in the prediction accuracy (Tafer et al., 2011; Wenzel et al., 2012; Alkan et al., 2017).

The second improvement is the integration of the results of RNA-RNA interaction detection experiments. Recently, several high-throughput sequencing methods for the exhaustive identification of RNA-RNA interaction sites have been developed, including PARIS (Lu et al., 2016) and MARIO (Nguyen et al., 2016). Although only a few lncRNA-related interactions have been detected in these experiments, simultaneously displaying predicted and experimentally verified interactions (where available) should be useful for users. In addition, such data will encourage researchers to develop machine-learning-based RNA-RNA interaction prediction programs.

The third improvement is an increase in the number of target species. This improvement would enable us to not only investigate the lncRNA interactions of newly added species but also compare lncRNA interactomes between species. Nguyen et al. recently showed that the conservation of experimentally confirmed lncRNA-RNA interaction regions is high, although lncRNA generally lacks sequence conservation (Nguyen et al., 2016). This means that conservation information should be useful for the verification of predicted lncRNA-RNA interactions.

# DATA AVAILABILITY

All datasets analyzed for this study are included in the manuscript and the **Supplementary Files**. LncRRIsearch is publicly available from http://rtools.cbrc.jp/LncRRIsearch/.

# AUTHOR CONTRIBUTIONS

TF, JI, and MH conceived the study and wrote the manuscript. TF, JI, and YO processed the data. YO constructed the database. TF and JI equally contributed to this work. MH supervised this study. All authors read and approved the final manuscript.

# FUNDING

This study was supported by MEXT/JSPS KAKENHI Grants JP16J00129 and JP17H05605 to TF; JP16K16143 to JI; and JP16H05879, JP16H01318, JP16H02484, and JP17K20032 to MH.

# ACKNOWLEDGMENTS

The computations in this research were performed using the supercomputing facilities at the National Institute of Genetics in Research Organization of Information and Systems.

# REFERENCES


# SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene. 2019.00462/full#supplementary-material


**Conflict of Interest Statement:** YO was employed by company IMSBIO CO., LTD.

The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Fukunaga, Iwakiri, Ono and Hamada. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Next Generation Sequencing for Long Non-coding RNAs Profile for CD4<sup>+</sup> T Cells in the Mouse Model of Acute Asthma

Zhengxia Wang<sup>1</sup>† , Ningfei Ji<sup>1</sup>† , Zhongqi Chen<sup>1</sup>† , Chaojie Wu<sup>1</sup> , Zhixiao Sun<sup>1</sup> , Wenqin Yu1,2 , Fan Hu<sup>3</sup> , Mao Huang<sup>1</sup> \* and Mingshun Zhang4,5 \*

<sup>1</sup> Department of Respiratory and Critical Care Medicine, The First Affiliated Hospital of Nanjing Medical University, Nanjing, China, <sup>2</sup> Department of Infectious Disease, Taizhou People's Hospital, Taizhou, China, <sup>3</sup> State Key Laboratory of Reproductive Medicine, Nanjing Medical University, Nanjing, China, <sup>4</sup> NHC Key Laboratory of Antibody Technique, Nanjing Medical University, Nanjing, China, <sup>5</sup> Department of Immunology, Nanjing Medical University, Nanjing, China

#### Edited by:

Yun Zheng, Kunming University of Science and Technology, China

#### Reviewed by:

Changning Liu, Xishuangbanna Tropical Botanical Garden (CAS), China Fei Li, Zhejiang University, China

#### \*Correspondence:

Mao Huang hm6114@163.com Mingshun Zhang mingshunzhang@njmu.edu.cn †These authors have contributed equally to this work

#### Specialty section:

This article was submitted to RNA, a section of the journal Frontiers in Genetics

Received: 04 February 2019 Accepted: 22 May 2019 Published: 07 June 2019

#### Citation:

Wang Z, Ji N, Chen Z, Wu C, Sun Z, Yu W, Hu F, Huang M and Zhang M (2019) Next Generation Sequencing for Long Non-coding RNAs Profile for CD4<sup>+</sup> T Cells in the Mouse Model of Acute Asthma. Front. Genet. 10:545. doi: 10.3389/fgene.2019.00545 Background and Aims: Although long non-coding RNAs (lncRNAs) have been linked to many diseases including asthma, little is known about lncRNA transcriptomes of CD4<sup>+</sup> T cells in asthma. The present study aimed to explore the lncRNAs profile in the CD4+T cells from the mouse model of acute asthma.

Methods: Next generation sequencing for lncRNAs and mRNAs was performed on CD4+ T cells from asthma and control mice. Gene ontology (GO) and kyoto encyclopedia of genes and genomes (KEGG) pathway analyses were performed to predict the functions and signal pathways for the aberrant lncRNAs. The selected lncRNAs were further measured using quantitative real-time PCR (polymerase chain reaction) and observed in the fluorescence in situ hybridization (FISH). The lncRNA– mRNA co-expression network was constructed via Pearson's correlation coefficient and Cytoscape 3.6.

Results: Next generation sequencing revealed 36 up-regulated lncRNAs and 98 down-regulated lncRNAs in acute asthma compared with controls. KEGG pathway analysis showed that cytokine-cytokine receptor interaction had the highest enrichment scores. A co-expression network was constructed in which 23 lncRNAs and 301 mRNAs altered formed a total of 12424 lncRNA and mRNA pairs. To validate the RNA sequencing results, we measured the 4 different lncRNAs using qPCR. The lncRNA fantom3\_9230106C11 was significantly reduced in CD4+ T cells of asthma. Bioinformatics analysis showed that lncRNA fantom3\_9230106C11 had the potential to interact with many miRNAs and transcription factors related to Th2 differentiation.

Conclusion: This study provided the first evidence for different expression of lncRNAs of CD4+T cells in asthma and may serve as a template for further, larger functional in-depth analyses regarding asthma molecular lncRNAs.

Keywords: asthma, CD4<sup>+</sup> T lymphocyte, long non-coding RNA, mRNA, next-generation sequencing

# INTRODUCTION

fgene-10-00545 June 6, 2019 Time: 18:18 # 2

Allergic asthma is a type of asthma provoked by allergens, including pollen, animal dander, fungal spores, or house dust mites (HDM) (Takyar et al., 2013; Hondowicz et al., 2016). Once processed and presented by dendritic cells, allergens promote the differentiation and expansion of Th2 cells, releasing type 2 cytokines (IL-4, IL-5, IL-13) and expressing the master transcription factor GATA-3. Th2 cells orchestrate eosinophil maturation and survival, airway hyperresponsiveness, and B cell isotype switching to IgE (Holgate, 2012). Consequently, Th2 cells play crucial roles in the cascade reactions of allergic asthma.

Non-coding RNAs emerged as essential players in Th2 cell differentiation and allergic asthma. Recently, we reported the aberrant microRNA (miRNA) profile in CD4<sup>+</sup> T cells from a murine model of acute asthma (Liu et al., 2018). In addition to miRNAs, long non-coding RNAs (lncRNAs), which comprise 200 nucleotides lacking putative open reading frames, may regulate CD4<sup>+</sup> T cell differentiation in asthma (Zhang F. et al., 2017). In allergic asthma, the lncRNA profile has been documented in primary airway smooth muscle cells, CD8<sup>+</sup> T cells or blood samples (Tsitsiou et al., 2012; Austin et al., 2017; Zhu et al., 2018). To our knowledge, lncRNA expression in the CD4<sup>+</sup> T cells in asthma remains elusive.

In this study, we performed next-generation sequencing and data mining to investigate the whole spectrum of transcriptional signatures (mRNAs and lncRNAs) of CD4<sup>+</sup> T cells in the murine model of acute asthma. Some lncRNAs were measured in CD4<sup>+</sup> cells ex vivo or in Th2 cells in vitro using realtime quantitative reverse transcription PCR (qRT-PCR) and fluorescence in situ hybridization (FISH). Moreover, cross-talk between mRNAs and lncRNAs associated with CD4<sup>+</sup> T cell differentiation was revealed.

# MATERIALS AND METHODS

# Establishment of the Model of Acute Asthma

Specific pathogen-free female C57BL/6J mice (18 to 22 g) aged 6 to 8 weeks were obtained from the College of Veterinary Medicine Yangzhou University (Yangzhou, China). All experiments that involved animal and tissue samples were performed in accordance with the guidelines and procedures approved by the Institutional Animal Care and Use Committee of Nanjing Medical University (IACUC-1709011).

The model of acute asthma was established as described previously (Liu et al., 2018). Briefly, the asthmatic mice (n = 3) were sensitized on days 0 and 14 by the intraperitoneal injection of 20 µg ovalbumin (OVA) (Grade V, Sigma-Aldrich) emulsified in 2 mg aluminum hydroxide gel (InvivoGen, San Diego, CA, United States) in a total volume of 200 µl. These sensitized mice were exposed to aerosolized 1% OVA in sterile saline for 30 min from day 20 to day 22, consecutively. The control subjects (n = 3) were sensitized and challenged using the same protocol as used for saline alone. All mice were monitored daily and were alive before sacrifice.

Twenty-four hours after the final challenge, lung function was evaluated by the direct measurement of lung resistance and dynamic compliance in restrained, tracheostomized, mechanically ventilated mice via the FinePointe RC System (Buxco Research Systems, Wilmington, NC, United States) under general anesthesia as described previously (Kerzerho et al., 2013). The sera were collected to measure total IgE using an ELISA kit according to the manufacturer's instructions (eBioscience, Thermo Fisher Scientific, United States). To determine lung tissue inflammation, the right upper lung lobe was removed, fixed, dehydrated and embedded in paraffin. The fixed embedded tissues were cut into 5 µm sections on a Leica model 2165 rotary microtome (Leica, Nussloch, Germany), and the tissue slides were stained with hematoxylin and eosin (H&E).

# CD4<sup>+</sup> T Cell Purification From the Spleen

Mice were anesthetized by i.p. injection of a mixture of 10 mg/kg xylazine (MTC Pharmaceuticals, Cambridge, ON, Canada) and 200 mg/kg ketamine hydrochloride (Rogar/STB, London, ON, Canada). The anaesthetized mice were sacrificed with cervical dissociation. The spleen was removed, ground and prepared into single cell suspensions. CD4<sup>+</sup> T cells in the spleen were sorted using CD4 (L3T4) micro beads (130-049-201, Miltenyi Biotec, United States). Briefly, the single cell suspensions were incubated with CD4 (L3T4) micro beads. The magnetically labeled cells were flushed and collected. Finally, the purity of CD4<sup>+</sup> T cells was quantified using anti-CD4-APC antibodies (17-0041-81, eBioscience).

# RNA Isolation, Library Preparation, and Sequencing

Total RNA was isolated using the miRNeasy Mini Kit (Qiagen, Germany) and stored at −80◦C until use. RNA purity was assessed using the ND-1000 Nanodrop. Each RNA sample had an A260:A280 ratio above 1.8 and A260:A230 ratio above

#### TABLE 1 | Primers used in qRT-PCR.


with five mice in each group, ∗∗p < 0.01, ∗∗∗p < 0.01, ∗∗∗∗p < 0.0001.

2.0. RNA integrity was evaluated using the Agilent 2200 TapeStation (Agilent Technologies, United States) and each sample had the RIN above 7.0. Briefly, rRNAs were removed from Total RNA using Epicenter Ribo-Zero rRNA Removal Kit (illumina, United States) and fragmented to approximately 200bp. Subsequently, the purified RNAs were subjected to first strand and second strand cDNA synthesis following by adaptor ligation and enrichment with a low-cycle according to instructions of NEBNext <sup>R</sup> UltraTM RNA Library Prep Kit for Illumina (NEB, United States).

The purified library products were evaluated using the Agilent 2200 TapeStation and Qubit <sup>R</sup> 2.0 (Life Technologies, United States) and then diluted to 10 pM for cluster generation in situ on the pair-end flow cell followed by sequencing (2 × 150 bp) on HiSeq 3000. The clean reads were obtained after removal of reads containing adapter, ploy-N and at low quality from raw data. Sequence data were mapped to mouse reference genome mm10 with TopHat v2.0.13. gfold v1.1.2 was subsequently employed to count the number of reads mapped to each gene. Differential expression was assessed by DEGseq using RPKM as input. Differentially expressed genes were chosen according to the criteria of fold change >2 and adjusted P-value <0.05. All the differentially expressed genes were used for heat map analysis and KEGG enrichment analyses. For KEGG enrichment analysis, a P-value <0.05 was used as the threshold to determine significant enrichment of the gene sets.

# Bioinformatics Data Analysis and Data Mining

The transcriptome was assembled using Cufflinks and Scripture based on the reads mapped to the reference genome. The assembled transcripts were annotated using the Cuff compare program from the Cufflinks package. The unknown transcripts were used to screen for putative lncRNAs. Three computational approaches, including CPC/CNCI/Pfam, were combined to sort non-coding RNA candidates from putative protein-coding RNAs in the unknown transcripts. Putative protein-coding RNAs were

filtered out using a minimum length and exon number threshold. Transcripts with lengths greater than 200 nt and with more than two exons were selected as lncRNA candidates and further screened using CPC/CNCI/Pfam, which distinguished proteincoding genes from non-coding genes.

The UCSC genome browser was used to locate lncRNA fantom3\_9230106C11. The predicted potential target genes whose loci were within a 10-kb window upstream or downstream of the given aberrantly expressed lncRNA were considered cisregulated genes. Other genes in the co-expression network were identified as trans-regulated according to complementary base pairing by LncTar (Li et al., 2015). In addition, Bibiserv (Sczyrba et al., 2003) and RNA22 (Miranda et al., 2006) were used to predict the interactive miRNAs. Different mRNAs in the our NGS were compared with known genes about asthma in the human disease database Malacards associated with asthma<sup>1</sup> and asthma-associated genes<sup>2</sup> using Wayne chart<sup>3</sup> .

# CD4<sup>+</sup> T Cell Differentiation in vitro

Naive CD4<sup>+</sup> T cells in the spleen were sorted using a Naive CD4<sup>+</sup> T Cell isolation kit (130-104-453, Miltenyi Biotec, United States). Briefly, the single cell suspensions were incubated with biotinantibody cocktails, which depleted the non-T cells and memory CD4<sup>+</sup> T cells. The untouched naive CD4<sup>+</sup> T lymphocytes were seeded onto anti-CD3 (85-16-0031-85, eBioscience, 1 mg/ml) precoated 96-well plates. For Th1 polarization, naive CD4<sup>+</sup> T cells were further stimulated with anti-CD28 (85-16-0281- 85, eBioscience, 2 µg/ml), IL-2 (212-12, PeproTech, 20 ng/ml), IL-12 (210-12, PeproTech, 50 ng/ml) and anti-IL-4 (85-16- 7041-85, eBioscience, 10 µg/ml). For Th2 polarization, naive CD4<sup>+</sup> T cells were further stimulated with anti-CD28 (85-16- 0281-85, eBioscience, 2 µg/ml), IL-2 (212-12, PeproTech, 20 ng/ml), IL-4 (214-14, PeproTech, 200 ng/ml), anti-IFN-γ (85- 16-7311-85, eBioscience, 10 µg/ml), and anti-IL-12 (85-16-7123- 85, eBioscience, 10 µg/ml). Half of the complete RPMI-1640 medium (16000-044, Gibco, United States) with 10% FBS (Gibco, United States) was replaced with fresh medium to maintain the cytokine environments. The CD4<sup>+</sup> T cells were harvested and analyzed on day 5.

# Flow Cytometry Analysis ex vivo and in vitro

For nuclear protein assays, spleens from animal experiments were prepared as single-cell suspensions. After RBC (red blood cell) lysis, the cells were stained with CD16/CD32 FcR (Fc Receptor) blocking antibody, Fixable Viability Dye eFluorTM 506 and anti-CD4-APC (eBioscience, 17-0041-83). For intracellular staining, the cells were fixed and permeabilized (eBioscience, 00-5523-00) according to the manufacturer's instructions, and then intracellular products were stained. Flow cytometry was performed with anti-Gata3-PE-Cy7 (BD Biosciences, 560405), anti-T-bet-PE-Cy7 (eBioscience, 25-5825- 80) and isotype controls.

<sup>1</sup>https://www.malacards.org/

<sup>2</sup>https://www.ncbi.nlm.nih.gov/gene/?term=asthma

<sup>3</sup>http://bioinformatics.psb.ugent.be/webtools/Venn/

For cytokine assays, T cells were restimulated on day 5 of culture for 5 h with 20 nM PMA (70-CS1001, Multiscience, China) and blocked with BFA (70-CS1002, Multiscience, China). After 5 h, the cells were harvested, gently washed with PBS containing 1% bovine serum albumin (BSA) and fixed in IC fixation buffer (00-822-49, eBioscience, United States). The fixed cells were permeabilized with permeabilization buffer (00- 8333-56, eBioscience, United States) and stained with PE-anti-IFN-γ (85-12-7311-84, eBioscience, United States), PE-anti-IL4 (85-12-7041-83, eBioscience, United States) or the isotype control. The stained CD4<sup>+</sup> T lymphocytes were analyzed on the BD FACSCalibur, and the data were analyzed by FlowJo software (TreeStar).

# Real-Time Quantitative Polymerase Chain Reaction

RNAs from cells or tissues were isolated using a miRNeasy mini kit. The cDNA was synthesized using the Prime Script RT Reagent Kit (TaKaRa, Kyoto, Japan) following the manufacturer's instructions. The quantitative PCR analysis was performed using the CFX96 system (Bio-Rad laboratories) in conjunction with ready-to-use fast-start SYBR Premix Ex Taq II (TaKaRa, Kyoto, Japan). The cycling conditions were 95◦C for 30 s, followed by 95◦C for 5 s and 60◦C for 30 s for up to 40 cycles and dissociation at 95◦C for 15 s, 60◦C for 30 s and a final extension at 95◦C for 15 s. The relative abundance of gene targets was determined by the comparative CT (cycle threshold) number normalized against tested β-actin comparative CT. The primers used are shown in **Table 1**, which were designed by RiboBio Institute (Guangzhou, China) or from primerbank by using primer5.


The P-value indicates the difference between the control and asthma groups.

FIGURE 3 | Volcano plots and heat map of mRNA expression between asthma and control groups. (A) Volcano plot assessment of gene expression in CD4<sup>+</sup> T cells between asthma and control groups. Red dots represent different mRNAs (p < 0.05) showing fold changes >=2 or <= −2. (B) Heat map analysis of differentially expressed mRNAs between the asthma and control groups. Green indicates low expression, and red indicates high expression. A4, A5, and A7 were the control groups; B4, B6, and B7 were CD4<sup>+</sup> T cells from the asthma group. (C) The Venn diagram of genes about asthma from sequencing compared with gene database and malacards database.

# Correlation and Co-expression Analysis

The co-expression analysis was based on Pearson's correlation coefficient. Considering the influence of random factors, it is found that using Pearson correlation coefficient only is not strict. Therefore, when the number of samples is less than eight, due to the small number of samples, we use the mixed washing method to calculate the P-value to further screen.

The co-expression relationship was screened according to Pearson correlation coefficient and P-value. Differentially expressed lncRNAs and mRNAs with fold changes ≥2 and p < 0.05 were analyzed. For each lncRNA-mRNA pair, the Pearson correlation (COR) was calculated to identify significantly correlated pairs. The Pearson correlation value cutoff was 0.95 and p < 0.05. To create a visual representation, a lncRNA-mRNA regulatory network was constructed using Cytoscape 3.6.

# Fluorescence in situ Hybridization (FISH)

Fluorescence in situ hybridization assays were performed using the RiboTM Fluorescent in situ Hybridization Kit and the RiboTM lncRNA FISH Probe Mix (Ribo, Guangzhou, China) according to the manufacturer's protocols. Briefly, the cells were fixed in 4% formaldehyde for 10 min and then washed with PBS. The cells were incubated with 20 µmol/L lncRNA FISH probe mix at 37◦C overnight. After washing, the FISH preparations were counterstained with DAPI observed in confocal microscopy for appropriate fluorescence filter sets (Zeiss, Oberkochen, Germany). The lncRNA probe labeled with Cy3 was designed and synthesized by RiboBio Co., Ltd., RiboTM U6 and RiboTM 18S were used as reference controls for the subcellular localization of lncRNA.

# Statistical Analysis

The statistical analysis was performed using GraphPad Prism version 5.0 (GraphPad, San Diego, CA, United States). The results for variables that were normally distributed are displayed as the means ± SEM. An ANOVA was performed to establish equal variance, and a 2-tailed Student's t-test with Bonferroni correction was applied to determine statistical significance, which was defined as P < 0.05.

# RESULTS

# Establishment of an Acute Model of Asthma

As the well-established murine model of asthma, OVA challenge provoked evident pulmonary inflammation, airway hypersensitivity and higher IgE in serum (**Figures 1A–C**). CD4<sup>+</sup> T cells were magnetically sorted from the spleen, and the purity was validated by flow cytometry. As shown in **Figure 1D**, 98.5% of the cells were CD4 positive (**Figure 1D**).

# LncRNA Profile of CD4<sup>+</sup> T Cells in Asthma

To obtain a global overview of CD4<sup>+</sup> T cells in the asthma transcriptome, we constructed and sequenced 6 RNA-Seq libraries, including controls (n = 3) and asthma mice (n = 3). All of the data have been submitted to Sequence Read Archive (SRA<sup>4</sup> ) with the accession number of PRJNA540404. In total, 134 lncRNAs were significantly altered in expression, including 98 downregulated lncRNAs and 36 upregulated lncRNAs (**Figure 2**). The top 10 decreased lncRNAs and top 10 increased were listed in **Table 2**.

# The mRNA Profile of CD4<sup>+</sup> T Cells in Asthma

In RNA sequencing, we explored not only the lncRNAs but also the mRNAs from the same samples. With respect to mRNAs, 141 mRNAs were significantly downregulated and 160 mRNAs were significantly upregulated in the CD4<sup>+</sup> T cells from

<sup>4</sup>www.ncbi.nlm.nih.gov/sra

TABLE 3 | Top 20 mRNAs in CD4<sup>+</sup> T cells between asthma and control groups.


The P-value indicates the difference between the control and asthma groups.

asthma (**Figure 3**). The top 10 decreased and top 10 increased mRNAs are listed in **Table 3**, including reduced Kbtbd12, Hunk, and Slc6a1 and elevated Ear7, Ear6, and Epx. Among of the 301 differently expressed mRNAs, 17 mRNAs were in the human disease database Malacards associated with asthma (see text footnote 1), 28 mRNAs were in the asthma-associated genes<sup>5</sup> . More importantly, 11 mRNAs were overlapped in our sequencing data, MalaCards database and asthma-associated genes, including IL4, IL-10, MMP9, VCAM-1, Il1rl1, Alox5, Il1rn, Ccr3, Cysltr1, Epx, and Ccl24.

fgene-10-00545 June 6, 2019 Time: 18:18 # 8

# The lncRNA-mRNA Co-expression Network

To further examine the function of these differentially expressed lncRNAs in CD4<sup>+</sup> T cells in asthma, we constructed an lncRNAmRNA co-expression network between 301 differentially expressed mRNAs and 23 differentially expressed lncRNAs. The results showed that the co-expression network comprised 12424 connections between lncRNAs and mRNAs. Notably, fantom3\_4933428M03 and fantom3\_9230106C11 exhibited a high degree of connectivity, suggesting that these two lncRNAs may play key roles (**Figure 4** and **Table 4**). Moreover, our results highlighted the potential internal adjustment correlations

coefficient, where the full line indicates a positive correlation and the imaginary line indicates a negative correlation.

<sup>5</sup>https://www.ncbi.nlm.nih.gov/gene



<sup>∗</sup><30◦ was not shown.

between the differentially expressed lncRNAs and mRNAs in CD4<sup>+</sup> T cells between the asthma and control groups.

# Validation of RNA-Seq Data With Real-Time PCR

To confirm the differentially expressed gene data, we further analyzed some dysregulated lncRNAs using qRT-PCR ex vivo and in vitro. In OVA-induced asthma, Th2 cells orchestrate the pathological cascades reactions. As shown in **Figure 5A**, CD4<sup>+</sup> T cells from the asthma group showed higher expression of the Th2 master transcription factor Gata-3 and decreased expression of the Th1 master transcription factor T-bet. Different from the sequencing data, the expression of LncRNA fantom3\_4933428M03 or fantom3\_F630107E09 was unexpectedly similar in the CD4<sup>+</sup> T cells from either the asthma or the control group. However, lncRNA fantom3\_9230106C11 was in accordance with sequencing data, which was significantly decreased in the asthma CD4<sup>+</sup> T cells (**Figure 5B**). To further explore the data reliability, we induced Th1 or Th2 cells in vitro (**Figures 6A,B**). As expected, the expression of lncRNA fantom3\_9230106C11 was significantly decreased in Th2 cells (**Figure 6C**). FISH assays indicated that lncRNA fantom3\_9230106C11 was localized in the cytoplasm of CD4<sup>+</sup> T cells (**Figure 7**).

# Prediction of lncRNA fantom3\_9230106C11 Targets

The above results indicated that lncRNA fantom3\_9230106C11 may be involved in Th2 cell differentiation in asthma. LncRNA fantom3\_9230106C11 was located at chr6:34412743–34415062 (GenBank: AK033773.1), overlapping with intron 3, exon 4, and intron 4 of Akr1b7 (chr6: 34412362–34423137). Sequence blast analysis shows that lncRNA fantom3\_9230106C11 is

99.8% similar with Mus musculus lncRNA URS00009B6C6F<sup>6</sup> . In the NONCODE (current version v5.0) http://www.noncode. org/, Mus musculus lncRNA URS00009B6C6F is renamed with NONMMUT056297.2, which is highly expressed in the hippocampus (Data Source:ERP000591).

We further explored the potential candidates that may interact with the lncRNA. Diverse transcription factors, such as T-bet, Gata-3, c-maf, stat4, stat5, stat6, jun-b, Dec-2, IRF4, Notch, Gfi-1, and YY1, may be involved in Th2 differentiation (Hwang et al., 2013; Lee, 2014). We used LncTar (Li et al., 2015) and

<sup>6</sup>https://rnacentral.org/rna/URS00009B6C6F

found that lncRNA fantom3\_9230106C11 had no effects on the mRNAs of these transcription factors. The Consite Predication Database (Mahmood et al., 2018) showed that transcription factors Gata1, Gklf, SOX17, S8, and Ahr-ARNT may be bound to the promoter of LncRNA fantom3\_9230106C11. In addition to transcription factors, a wide range of miRNAs contribute to Th2 differentiation, i.e., miR-17∼92, miR-29, miR-126, miR-132-3p, miR-148a, mir-24, mir-27, mir-19, and mir-155 were confirmed to regulate Th2 differentiation (Istomine et al., 2016; Pua et al., 2016). In the LncTar, Bibiserv, and RNA 22 analysis, lncRNA fantom3\_9230106C11 was predicted to bind with miR-19 and other miRNAs. Collectively, the potential candidates of LncRNA fantom3\_9230106C11 may include Akr1b7, Gata1 and miRNAs (**Figure 8**).

# DISCUSSION

Allergic bronchial asthma is a common chronic inflammation with well-defined pathological features, including intermittent airway hyperresponsiveness, pulmonary eosinophil infiltration, and excessive mucus secretion (Chang et al., 2015). Numerous studies have demonstrated that allergic asthma is driven predominantly by a Th2 type of immune response in both human

and mouse models of asthma (Kaiko and Foster, 2011). LncRNAs have been linked with airway smooth muscle cells in asthma (Zhang et al., 2016; Austin et al., 2017; Yu et al., 2017; Zhang X.- Y. et al., 2017). However, to our knowledge, there is no report on lncRNA expression in CD4<sup>+</sup> T cells in asthma. Therefore, we constructed a mouse model of acute asthma and screened RNA transcripts (mRNAs, lncRNAs) from CD4<sup>+</sup> T cells via next-generation sequencing.

In our study, we identified 134 lncRNAs and 301 mRNAs abnormally expressed in CD4<sup>+</sup> T cells of asthma compared with controls. In the mRNA expression pattern analysis, elevated Ear7, Ear6 and Epx were closely associated with eosinophils and type 2 inflammations (Ochkur et al., 2017). IL-4, IL-13, and IL-21 (Lajoie et al., 2014), which promoted Th2 differentiation and allergic asthma, were also significantly increased in the CD4<sup>+</sup> T cells from the asthma mouse model. However, the Th1 master transcription factor T-bet and the Th2 master transcription factor Gata3 were comparable in CD4<sup>+</sup> T cells from both control and asthma mice, suggesting that Th2 differentiation in asthma may be independent of Gata3 (O'Shea and Paul, 2010). In the lncRNA profile analysis, the roles of a wide range of lncRNAs with varied expression in asthma CD4<sup>+</sup> T cells were largely unknown. In the lncRNA and mRNA co-expression network, lncRNAs may regulate diverse mRNAs, including Ear6 and Epx, which are required for Th2 differentiation and asthma etiology.

Using real-time qRT-PCR, we demonstrated that lncRNA fantom3\_9230106C11 was decreased in CD4<sup>+</sup> T cells from asthma mice ex vivo and in Th2 cells in vitro. By filtering the aberrantly expressed genes located near the lncRNA fantom3\_9230106C11s, we found that lncRNAs might regulate the transcription of Akr1b7 in cis. lncRNA fantom3\_9230106C11s covered intron 3, exon 4, and intron 4 of Akr1b7. Akr1b7 encodes aldose reductase-related protein 1, which may catalyze xenobiotic aromatic aldehydes (Liu et al., 2009). The roles of Akr1b7 in Th2 differentiation or asthma, however, remain uncertain. The bioinformation prediction indicated that the transcription factors Gata1, Gklf, and Ahr-ARNT may be potential candidates for lncRNA fantom3\_9230106C11. Gata1 served as a surrogate for Gata3 in its canonic role of programming Th2 gene expression (Sundrud et al., 2005). Gklf (gut-enriched Krüppel-like factor; or Kruppellike factor 4, KLF4) was required in Th2 cell responses in vivo (Tussiwand et al., 2015). Ahr-ARNT, which may regulate IL-33, IL-25, and TSLP, was closely associated with allergic severe asthma (Weng et al., 2018). Moreover, considering that lncRNA fantom3\_9230106C11 resides in the cytoplasm, it may interact with miR-19 and other miRNAs, which are closely associated with Th2 differentiation.

Our study was not without limitations. First, BALB/c and C57BL/6 mouse strains are the two most commonly used in asthma models. Compared with the C57BL/6 mice used in this study, BALB/c mice were more prone to a Th2 immune response (Gueders et al., 2009). Although allergic asthma was Th2-dominated, Th1 and other CD4<sup>+</sup> T subpopulations were

also involved in the initiation and aggravation of disease. Therefore, the C57BL/6 mouse model was well recognized in the recapitulation of disease features in asthma. Second, spleen CD4<sup>+</sup> T cells rather than lung CD4<sup>+</sup> T cells were analyzed in the present study. We tried to sort CD4<sup>+</sup> T cells from lung parenchyma and pulmonary lymph nodes. However, there were not enough purified CD4<sup>+</sup> T cells (5<sup>∗</sup> 10<sup>6</sup> from each mouse) to complete the NGS experiment. Previously, adoptive transfer experiments demonstrated that peripheral CD4<sup>+</sup> T lymphocytes regulate asthma pathogenesis (Cohn et al., 2004; Hubeau et al., 2006). Therefore, we postulated that spleen CD4<sup>+</sup> T cells may be surrogates for peripheral CD4<sup>+</sup> T lymphocytes. Third, we have not validated our observations in the clinical samples. LncRNAs are considered poorly conserved across different

species (Ma et al., 2013). However, the conservation may be multidimensional (Diederichs, 2014). The expression of aberrant lncRNAs (lncRNA fantom3\_9230106C11) in the clinical samples should be evaluated.

# CONCLUSION

In conclusion, we conducted a comprehensive analysis of lncRNA profiles in CD4<sup>+</sup> T cells from an asthma model using next-generation sequencing. The co-expression network of lncRNAs and mRNAs was constructed. The present study provided a platform for elucidating the roles of lncRNAs in Th2 differentiation and asthma pathogenesis.

# ETHICS STATEMENT

fgene-10-00545 June 6, 2019 Time: 18:18 # 13

All experiments that involved animal and tissue samples were performed in accordance with the guidelines and procedures approved by the Institutional Animal Care and Use Committee of Nanjing Medical University (IACUC-1709011).

# AUTHOR CONTRIBUTIONS

MH and MZ conceived the idea and designed the research. ZW, NJ, ZC, and CW conducted the animal and in vitro experiments. ZS, WY, and MZ analyzed the results. FH performed the flow cytometry. ZW and MZ wrote the

# REFERENCES


manuscript. All the authors reviewed and approved of the manuscript.

# FUNDING

This research was supported by the Precision Medicine Research of The National Key Research and Development Plan of China (2016YFC0905800), National Natural Science Foundation of China (81671563, 81770031, and 81700028), Natural Science Foundation of Jiangsu Province (BK20171501, BK20171080, and BK20181497), Jiangsu Province's Young Medical Talent Program, China (QNRC2016600), and Jiangsu Provincial Health and Family Planning Commission Foundation (Q2017001).



migration of rat airway smooth muscle cells in asthma via upregulating the expression of transient receptor potential 1. Am. J. Transl. Res. 8, 3409–3418.

Zhu, Y. J., Mao, D., Gao, W., and Hu, H. (2018). Peripheral whole blood lncRNA expression analysis in patients with eosinophilic asthma. Medicine 97:e9817. doi: 10.1097/MD.0000000000009817

**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Wang, Ji, Chen, Wu, Sun, Yu, Hu, Huang and Zhang. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Identification of Cancer-Related Long Non-Coding RNAs Using XGBoost With High Accuracy

*Xuan Zhang1,2†, Tianjun Li3†, Jun Wang4†, Jing Li1, Long Chen3\* and Changning Liu1\**

*1 CAS Key Laboratory of Tropical Plant Resources and Sustainable Use, Xishuangbanna Tropical Botanical Garden, Chinese Academy of Sciences, Kunming, China, 2 University of Chinese Academy of Sciences, Beijing, China, 3 Department of Computer and Information Science, Faculty of Science and Technology, University of Macau, Macau, China, 4 Institute of Medical Sciences, Xiangya Hospital, Central South University, Changsha, China*

#### *Edited by:*

*Philipp Kapranov, Huaqiao University, China*

#### *Reviewed by:*

*Yun Xiao, Harbin Medical University, China Xingming Jiang, Harbin Medical University, China*

#### *\*Correspondence:*

*Changning Liu liuchangning@xtbg.ac.cn Long Chen longchen@um.edu.mo*

*†These authors have contributed equally to this work.*

#### *Specialty section:*

*This article was submitted to RNA, a section of the journal Frontiers in Genetics*

*Received: 24 February 2019 Accepted: 12 July 2019 Published: 09 August 2019*

#### *Citation:*

*Zhang X, Li T, Wang J, Li J, Chen L and Liu C (2019) Identification of Cancer-Related Long Non-Coding RNAs Using XGBoost With High Accuracy. Front. Genet. 10:735. doi: 10.3389/fgene.2019.00735*

In the past decade, hundreds of long noncoding RNAs (lncRNAs) have been identified as significant players in diverse types of cancer; however, the functions and mechanisms of most lncRNAs in cancer remain unclear. Several computational methods have been developed to detect associations between cancer and lncRNAs, yet those approaches have limitations in both sensitivity and specificity. With the goal of improving the prediction accuracy for associations of lncRNA with cancer, we upgraded our previously developed cancer-related lncRNA classifier, CRlncRC, to generate CRlncRC2. CRlncRC2 is an eXtreme Gradient Boosting (XGBoost) machine learning framework, including Synthetic Minority Over-sampling Technique (SMOTE)-based over-sampling, along with Laplacian Score-based feature selection. Ten-fold cross-validation showed that the AUC value of CRlncRC2 for identification of cancer-related lncRNAs is much higher than previously reported by CRlncRC and others. Compared with CRlncRC, the number of features used by CRlncRC2 dropped from 85 to 51. Finally, we identified 439 cancer-related lncRNA candidates using CRlncRC2. To evaluate the accuracy of the predictions, we first consulted the cancer-related long non-coding RNA database Lnc2Cancer v2.0 and relevant literature for supporting information, then conducted statistical analysis of somatic mutations, distance from cancer genes, and differential expression in tumor tissues, using various data sets. The results showed that our approach was highly reliable for identifying cancer-related lncRNA candidates. Notably, the highest ranked candidate, lncRNA AC074117.1, has not been reported previously; however, integrated multi-omics analyses demonstrate that it is the target of multiple cancer-related miRNAs and interacts with adjacent protein-coding genes, suggesting that it may act as a cancer-related competing endogenous RNA, which warrants further investigation. In conclusion, CRlncRC2 is an effective and accurate method for identification of cancerrelated lncRNAs, and has potential to contribute to the functional annotation of lncRNAs and guide cancer therapy.

Keywords: cancer, long noncoding RNA, machine learning, Synthetic Minority Over-sampling Technique, XGBoost

# INTRODUCTION

Cancer is a leading cause of death worldwide (Siegel et al., 2018) and it is established that cancers are caused by genetic and epigenetic changes (Kanwal and Gupta, 2010; You and Jones, 2012). Hence, high throughput technologies to characterize genes associated with cancer have applications with crucial implications for human health. Long non-coding RNAs (lncRNAs) account for the vast majority of non-coding RNAs longer than 200 nucleotides, and were previously considered "junk" RNA, due to their low coding potential; however, over recent decades, lncRNAs have been recognized as significant regulators of multiple major biological processes impacting development, differentiation, and metabolism (Bhan and Mandal, 2015). In cancer, lncRNAs act via multiple mechanisms, including regulation of chromatin topology in both cis and trans (chromatin remodeling, chromatin interactions), scaffolding of proteins and other RNAs, acting as protein and RNA decoys (competing endogenous RNA, ceRNA), regulating neighboring genes as natural antisense transcripts (NATs), and producing micropeptides (Aab et al., 2016; Ransohoff et al., 2018).

The aberrant expression of lncRNAs has been linked to typical cancer hallmarks, such as continuous proliferation, bypassing apoptosis, genomic instability, drug resistance, invasion, and metastasis (Renganathan and Felley-Bosco, 2017; Bhan et al., 2017; Balas and Johnson, 2018; Wang et al., 2019). For example, the lncRNA growth arrest-specific transcript 5 (*GAS5*), which is down-regulated in almost all tumor tissues, can suppress the tumorigenesis of cervical cancer by downregulating miR-196a and miR-205 (Yang et al., 2017), while *LncRNA‐PVT1*, which is up-regulated in non-small cell lung cancer (NSCLC), can improve tumor invasion and metastasis (Yang et al., 2014). Further, Hox transcript antisense intergenic RNA (*HOTAIR*), which contributes to epigenetic regulation of genes, plays an important role in various cellular pathways by interacting with Polycomb Repressive Complex 2 (PRC2) (Mercer and Mattick, 2013). In addition, due to dynamic changes in their expression levels as cancer develops, some lncRNAs are regarded as potential biomarkers and therapeutic targets (Hanahan and Weinberg, 2011; Bhan et al., 2017). The most prominent example of such a biomarker is prostate cancer antigen 3 (*PCA3*), a lncRNA expressed at high levels in prostate cancer (De Kok et al., 2002; Yarmishyn and Kurochkin, 2015). The detection of *PCA3* in urine is a more specific marker for prostate cancer diagnosis than the commonly used factor, prostate specific antigen (PSA), and has been widely applied in the clinic (Hessels et al., 2003; Tinzl et al., 2004). Another example is lncRNA *TUC339*, which is highly enriched in extracellular vesicles secreted by hepatocellular carcinoma cells, where it regulates the growth and adhesion of tumor cells (Kogure et al., 2013). These features of lncRNA prompted us to search for efficient methods to predict functional lncRNAs in cancer, to facilitate deeper understanding of malignancies and the potential application of lncRNAs as targets for cancer therapies and diagnostics.

Systematic understanding of the contributions of lncRNAs to cancer is challenging, partly due to the unpredictability of lncRNA functional elements, as well as their relatively low conservation, low expression levels, and diverse functional mechanisms. The functions of a single lncRNA, or several lncRNAs, can be determined using experimental methods; however, this approach is time consuming and costly. The successful implementation of machine learning systems for the study of genomics, proteomics, systems biology, and evolution, has been a great inspiration to the field of life sciences more generally (Larranaga et al., 2006). Using machine learning algorithms, we can determine the high dimensional characteristics of functional lncRNAs from an informatics perspective. To successfully apply machine learning to the identification of functional lncRNAs in cancer genomics, it is fundamental to first identify positive and negative sets. For this purpose, there are a number of repositories from which cancer-related lncRNAs can be conveniently obtained, including Lnc2Cancer v2.0, a manually curated database that provides comprehensive experimentally supported associations between lncRNAs and human cancer (Gao et al., 2019), and CRlncRNA, another manually curated database that uses stricter criteria to retain only data related to cancer hallmarks that have been experimentally confirmed (Wang et al., 2018). These databases can be exploited to develop machine learning models to predict and rank cancer-related lncRNAs. There has been relatively little research that has attempted to use machine learning methods to predict functional lncRNAs in cancer. For example, Zhao et al. (2015) presented the first naïve Bayes based machine learning method, and identified 707 cancer-related lncRNA candidates. In our previous work, we used a Random Forest based algorithm, CRlncRC, to classify cancer-related lncRNAs and other lncRNAs, through integration of 85 features (Zhang et al., 2018); however, compared with the computational prediction work reported for cancer-related protein-coding genes, the identification of cancer-related lncRNAs remains preliminary. The sensitivity and specificity of methods to predict cancer-related lncRNAs require further improvement.

In this study, we developed a new cancer-related lncRNA classifier, CRlncRC2. Compared with CRlncRC, CRlncRC2 uses the Laplacian score feature selection method to reduce training time and prevent over-fitting. In addition, unlike the naïve under-sampling method adopted by CRlncRC, we address the data imbalance problem, which is caused by the relatively small size of available positive sets of cancer-related lncRNAs, using the Synthetic Minority Over-sampling Technique (SMOTE) method, to balance imbalanced data, while aiming to retain all important information. Moreover, CRlncRC2 uses a more powerful machine learning model, extreme gradient boosting machine (XGBoost), to improve its predictive performance. Ten-fold cross-validation showed that the area under the receiver operating characteristic curve (AUC or area under ROC curve) score of CRlncRC2 is much higher than those of CRlncRC (0.86 vs. 0.82) and the method developed by Zhao et al. (0.90 vs. 0.79). Finally, 439 possible cancer-related lncRNAs were identified using CRlncRC2, of which 5 in the

**Abbreviations:** AUC, area under the ROC curve; ceRNA, competing endogenous RNA; DT, decision tree; lncRNA, long non-coding RNA; ROC, Receiver operating characteristic; SVM, support vector machines; XGBoost, extreme gradient boosting machine

top 20 were confirmed using the Lnc2Cancer v2.0 database. Further, statistical analyses show that the identified lncRNAs are closer to cancer protein genes, carry more mutations, and are more likely to be differentially expressed in tumor tissues than negative lncRNAs. In addition, survival analysis revealed a significant difference in overall survival between the low and high expression groups of the top 10 predictions. In particular, one lncRNA, AC074117.1 (ENSG00000234072), which was the top ranked of our predictions and has not been reported in the literature, is suggested as being highly likely to be associated with cancer in the lncRNA-related ceRNA network. In conclusion, CRlncRC2 exhibited good performance in both cross-validation and prediction evaluation. We believe our framework will be a useful tool for study of lncRNA–cancer associations.

# MATERIALS AND METHODS

Our experiment followed the pipeline illustrated in **Figure 1A**, which consisted of four main steps: Data preparation, Feature engineering, Model training, and Prediction and validation. The detailed processes of feature selection and cross-validation are presented in **Figures 1B**, **C**.

# Data Preparation

Cancer-related lncRNAs (positive set) and cancer unrelated lncRNAs (negative set) were downloaded from CRlncRC (https:// github.com/xuanblo/CRlncRC). The criteria for cancer-related lncRNA collection include either differentially expressed in cancer (as verified by Real-Time qRT-PCR), co-occurred with a significant relevant clinicopathological parameter (e.g., tumor differentiation, clinical stage, and survival time), or proven by functional experiments (e.g., colony formation assay, matrigel invasiveness assay, xenograft mouse model, and metastasis nude mouse model). As the category of cancer unrelated lncRNA is difficult to define, and for consistency with other classifiers, we located a large number of single-nucleotide polymorphisms (SNPs) associated with phenotypes derived from the NHGRI-EBI GWAS Catalog (Welter et al., 2014) in the sequences of lncRNAs, and only those lncRNAs which had no phenotype-related SNPs detected within its 10 kb up/down stream were selected as cancer non-related lncRNAs. Finally, we identified 158 positive lncRNAs **(Data Sheet 1**) and 4,533 negative lncRNAs (**Data Sheet 2**).

We downloaded lncRNA feature data from CRlncRC; CRlncRC retrieves 85 features and groups them into four categories: genomic features, expression features, epigenetic features, and network features. Feature category, name, source database, and description information are detailed in **Data Sheet 3**.

# Feature Engineering

Features play an essential role in classification, and appropriate features can improve classification performance significantly. In cancer genomic research, the currently known cancerrelated lncRNA (positive) set are only available because they were identified by humans. It is possible that some samples in the negative set may be considered to belong to the positive set in the future. Hence, we employed Laplacian scoring (He et al., 2005), which is designed to select features without labels, as a criterion to evaluate the correlations of each feature. The basic idea of Laplacian score is to evaluate the features according to their locality preserving power, which is from the Laplacian Eigenmaps (Chung, 1997) and Locality Preserving Projection (He and Niyogi, 2003).

In detail, we applied the scikit-feature (Li et al., 2017) to calculate Laplacian scores; the parameters for the affinity matrix used for the calculation are as follows: metric = euclidean, neighbor mode = knn, and k = 5. Calculated scores range from 0 to 1, with smaller values indicating more important features. The distribution of calculated Laplacian scores is presented in **Figure 2** and clearly shows that there are large margins in each category of features. In this case, we can determine the difference between the sorted Laplacian scores (asc) and use the first two differential values to set a threshold. Specifically, we set the margins in "Epigenetic" to the second and third largest differential values, because these appeared to be the inflection points. Hence, the features were split into three parts, and the features located in the lower part (i.e., those with scores indicating that the features are more important) retained immediately. Nevertheless it is not advisable to simply remove those features located in the other parts, as these also contain some information. Therefore, we merged the features according to the mean in each part and retained the merged features to preserve the information. For example, the middle scoring part of "Expression" contains two features, and we removed these two features, while retaining their mean value. The mean-merged feature obtained from the high scoring parts were also retained. Finally, generated training and validation sets by concatenating the processed category features. Changes in the feature number in each category are summarized in **Table 1**. After feature selection, we obtained 51 features, eight of which are synthetic. A "Bigtable", containing 11194 lncRNAs from CRlncRC, with 85 features, is included in **Data Sheet 4**.

## Model Training

The machine learning method, XGBoost (Chen and Guestrin, 2016), was tuned to search for an optimal prediction solution. XGBoost is a type of gradient boosting decision tree method; its objective function is defined in equation (1).

$$\mathcal{L}(\phi) = \sum\_{i=1}^{n} \prescript{n}{}{\text{loss}}(\boldsymbol{\chi}\_{i}\boldsymbol{\hat{\boldsymbol{\chi}}\_{i}}) + \sum\_{k=1}^{K} \prescript{K}{}{\Omega}(\boldsymbol{f}\_{k}), \qquad \text{(equation 1)}.$$

where loss is the training loss, Ω(*f*) is the complexity of the tree, and K is the number of trees in the model. This model can be optimized by minimizing this objective function. To this end, an additive training method was employed for training loss, and prediction at the additive *tth* training round could be quickly optimized using Taylor expansion. The greedy algorithm [31] was used to determine optimal tree complexity.

In our study, we used the dmlc XGBoost library (https:// xgboost.ai/) for implementation of the XGBoost model. To tune the hyper-parameters, we first adopted Bayesian optimization to


*\*LP, lower part of Laplacian Score; MP, middle part of Laplacian Score; UP, higher part of Laplacian Score.*

search for potential hyper-parameters and then manually finetuned those hyper-parameters to improve the performance of the model. The hyper-parameters for XGBoost primarily control the growth and the robustness of the model:


In addition, as our sample was unbalanced (the ratio of the minority positive class versus majority negative class was approximately 1/30), we adopted SMOTE (Nakamura et al., 2013) to re-sample our training set by Bayesian optimization, which reduces the impact of data imbalance. The final tuning result for this model is n estimator = 546, max depth = 10, learning rate = 0.01, colsample bytree = 0.7, subsample = 0.826, and gamma = 0.036.

Ten-fold cross validation was adopted to evaluate the model trained by parameters obtained using Bayesian optimization. The algorithm stratified shuffles the total samples into 10 folds, and begins an iteration: each time 9 folds are initially over-sampled, and then assigned for training. The single remaining fold is adopted as the pair for validation. Subsequently, the over-sampled training set was used to fit the model, while the validation set was utilized to evaluate the model's performance. Note that the validation set in each iteration is not re-sampled and does not include any data used for training. Further, the models trained on each iteration are independent of one another. To rigorously evaluate the performance of our model, we measured the AUC scores using the abovementioned 10-fold cross-validation (**Figure 1C**).

Further, to rigorously evaluate the model's performance, we measured the recall, precision, and F1 score, using the 10-fold cross-validation process described above.

The recall is the ratio of correctly predicted positive observations to all observations in a specific class, and was calculated using equation 3:

$$\text{Recall} = \frac{TP + FN}{TP}, \tag{\text{equation 2}}$$

The precision is the ratio of correctly predicted positive observations to total predicted positive observations, and was calculated using equation 4:

$$Precision = \frac{TP + FP}{TP}, \tag{equation 3}$$

The F1 Score is the weighted average of Precision and Recall, and was calculated using equation 5:

$$F1\,\text{score} = 2 \ast \frac{(Recall \ast Precision)}{(Recall + Precision)},\qquad \text{(equation 4)}$$

## Prediction and Evaluation

To predict novel cancer-related lncRNAs, we used our pre-trained model to predict 7,253 unknown lncRNAs from TANRIC [33]. To evaluate the accuracy of our model, we used various methods to test the reliability of our predictions. First, predictions were searched against the Lnc2Cancer v2.0 database. Next, the Kolmogorov-Smirnov test was used to examine whether there were significant differences among the different sets (positive, negative, and predictive) in the distance to cancer protein-coding genes, mutation numbers, and numbers of samples differentially expressed between tumor and normal tissues. Mutation data and cancer protein-coding gene sets were download from COSMIC [34]. Tumor and normal tissue expression profiles were downloaded from TANRIC. Further, survival analysis for the top 10 predictions was conducted using TANRIC.

# RESULTS

### Data Collection

We collected 158 highly trusted cancer-related lncRNAs from CRlncRC as our positive data set. All have been reported in the literature with the support of strict experimental validation and are involved with cancer hallmarks. lncRNAs (n = 4,553) in CRlncRC without phenotype-related SNPs within 10 kb up- or down-stream were used as our negative data set. In CRlncRC, we collected 85 features that could potentially facilitate the recognition of cancer-related lncRNAs and grouped them into four different categories (see **Data Sheet 3** for details): Genomic features (such as GC content and sequence conservation score), Expression features (the expression profiles of lncRNAs in 16 different tissue types), Epigenetic features (different types of epigenetic signals in different types of cell lines), and Network features (the interactions between lncRNAs and cancer-related protein-coding genes and miRNAs). After feature selection using Laplacian scores, we reduced the feature number from 85 to 51. Cumulative curves were plotted and showed that the distribution of the feature values between the positive and negative sets was significantly different (Kolmogorov-Smirnov test, p-value < 0.05) (**Data Sheet 5**). The number in each feature category before and after feature selection is shown in **Table 1**.

## Performance Evaluation

The results of 10-fold cross-validation are presented in **Figure 3.**  We drew 10 ROC curves, which had minimum and maximum AUC values of 0.73 and 0.93, respectively, and an average value of 0.86 ± 0.6. In addition to AUC values, additional evaluation indicators were used to assess our results, including precision, recall, and F1-Score (**Table 2**). The average precision, recall, and F1-Score values were 0.72, 0.62, and 0.65, respectively. Overall, these data demonstrate that CRlncRC2 is an efficient tool for identification of lncRNAs related with cancer, with high accuracy and stable performance.

Compared with other methods, CRlncRC2 has superior performance. Relative to CRlncRC, CRlncRC2 reduced features number from 85 to 51 and the mean AUC value reached 0.86, which is 0.04 higher than that achieved using CRlncRC (**Figure 4A**). Further, we compared the prediction performance of CRlncRC2 with that of the method described by Zhao et al. (2015). To ensure a fair comparison, we retrained our CRlncRC2 method using the dataset reported by Zhao et al. Compared with the method of Zhao et al., the resulting mean AUC value for CRlncRC2 was much higher (0.90 vs. 0.79) (**Figure 4B**).

To determine why CRlncRC2 performed better than CRlncRC, we analyzed the feature importance (XGBoost importance weight) in CRlncRC2 (**Data Sheet 6**). Compared with the features used in CRlncRC, it is clear that the epigenetic and expression feature numbers in CRlncRC2 were almost unchanged, while those of genomic features were reduced by half, while network features were decreased by two thirds (**Figure 5A**). Expression features were two among the top ten most important features in CRlncRC2, while they were not among the top ten in CRlncRC (**Figure 5B**). In addition, there are four types of features in the top 20 features of CRlncRC2, indicating that CRlncRC2 can make better use of different features (**Figure 5C**). Furthermore, as illustrated in **Figures 5C**, **D**, the proportions of epigenetic features among the first 20 and 50 features for CRlncRC2 were much larger than those for CRlncRC. Surprisingly, although genomic and network features accounted for a small proportion, the three synthetic genomic and network features (Gen\_LevelTwo, Gen\_LevelOne, and Net\_LevelTwo) ranked the highest, indicating that synthetic features generated by combining low Laplacian score features may contribute substantially to the model (**Figure 5E**). Two

TABLE 2 | Performance of 10-fold cross-validation.


repeat features, short interspersed nuclear element (SINE) and long interspersed nuclear element (LINE) signals on gene bodies, ranked No. 4 and No. 5, followed by gene expression level in colon tissue (No. 6), prostate gland (No. 8), "H3k4me1" epigenetic modification signals within the Transcription Start Site upstream and downstream 5k (TSS5k) region in GM12878 (No. 7), and "H3k4me3" epigenetic modification signals within lncRNA gene body/TSS1k region in H1hesc/GM12878 cell line (No. 9 and No. 10).

We further evaluated the effectiveness of our approach to dealing with the available imbalanced data. The SMOTE oversampling method was used to balance the imbalanced data, and it contributed to an increase of 0.01 in the AUC value, relative to non-SMOTE adjusted data (**Figure 6A**). In addition, to compare the performance of different machine learning algorithms, several models were compared using the non-SMOTE adjusted over-sampling data. ROC curve analysis showed that the XGBoost-based method performed better than Decision tree (DT) (0.85 vs. 0.60) and Support Vector Machine (SVM) (0.85 vs. 0.74) -based approaches (**Figure 6B**). These results indicate that our new method facilitated superior performance relative to previous methods. XGBoost contributed substantially to the AUC values, while data over-sampling was also very important.

# Statistical Analysis of Candidate Cancer-Related lncRNA Candidates

We used the pre-trained model to predict novel candidate cancerrelated lncRNAs from 7,253 unknown lncRNAs, which were not in our training or testing sets. Finally, we predicted 439 cancerrelated lncRNA candidates (**Data Sheet 7**). First, we used the data from the newly updated database, Lnc2Cancer v2.0, to test our

predictions, since we did not collect our positive dataset from this database. We studied the intersection of our predictions and their collections. Among our top 10, 20, and 50 predictions, 2, 5, 8 lncRNAs, respectively, were also collected by Lnc2Cancer, and were functionally validated as cancer-related (**Figure 7A**). In total, 47 candidate cancer-related lncRNAs were found in Lnc2Cancer (**Data Sheet 7**). According to the tag information provided in the Lnc2Cancer database, these lncRNAs can be classified into several categories: drug-resistant, methylation, circulating, transcription factor (TF), and variant (**Data Sheet 8, Figure A**). Further, we selected the top 10 among these 47 cancer-related lncRNAs and evaluated their expression in cancers. Surprisingly, almost all lncRNAs exhibited inconsistent changes in expression in various tissues (**Data Sheet 8, Figure B**), confirming their functional diversity and reflecting the strong tissue specificity of lncRNAs. In addition, the 47 predicted lncRNAs had roles in numerous malignant tumors, including 17 involved in colorectal cancer, 10 in gastric cancer, and 10 in hepatocellular carcinoma (**Data Sheet 8, Figure C**).

an AUC value 0.11 higher than that achieved using the method of Zhao et al.

Using statistical methods and multigroup data, we further analyzed the reliability of our predictions. First, we hypothesized that the potential cancer-related lncRNAs were likely to have more somatic mutations in cancer genomes, since many previous studies have demonstrated that mutations in functional genes are a primary cause of carcinogenesis. To validate this assumption, we compared the number of somatic mutations (documented in COSMIC) between different lncRNA sets and a cancer-related protein-coding gene set (**Figure 7B**). The results showed that the cancer-related protein-coding gene set, as the positive control, contained far more somatic mutations than the cancer-unrelated lncRNA set (negative control, Kolmogorov-Smirnov test, p-value = 6.10e-33). The somatic mutation numbers in both the positive and predicted cancer-related lncRNA sets were between those of cancer-unrelated lncRNAs and cancer-related proteincoding genes, with a significantly higher quantity than those in cancer-unrelated lncRNAs (Kolmogorov-Smirnov test, p-value 2.35e-07 and 8.27e-06, respectively).

As a number of lncRNAs exert their function in cis, by influencing neighboring genes, we assumed that these potential cancer-related lncRNAs were likely closer to cancer-related protein-coding genes than cancer-unrelated lncRNAs. Therefore, we calculated the distances of different lncRNA sets to their closest cancer-related proteins, and compared them with the random background (i.e., the distance between cancer-related protein-coding genes and random positions in genome) (**Figure 7C**). We found that the distances between cancer-unrelated lncRNAs and cancer-related protein-coding genes were significantly larger than those between cancer-related lncRNAs and cancer-related protein-coding genes (Kolmogorov-Smirnov test, p-value = 4.1e-4). Similarly, the distance of predicted cancer-related lncRNAs from cancer-related proteincoding genes was far shorter than that of cancer-unrelated lncRNAs (Kolmogorov-Smirnov test, p-value = 4.9e-06). Moreover, no significant difference in distance was detected between background and the cancer-unrelated lncRNA set, as expected.

Next, we examined whether the expression levels of cancerrelated lncRNAs differed from those of cancer-unrelated lncRNAs in cancer samples (**Figure 7D**). Using lncRNA expression data from the TANRIC database, we calculated the percentage of lncRNAs that were differentially expressed (absolute log2-fold change > 1) between cancer and paracancerous tissue sample pairs, to determine whether this differed among the lncRNA sets. We found that lncRNAs in the positive set had the highest percentage of differentially expressed genes (approximately 40%), while the value for the negative set was only approximately 20%. Among predicted cancer-related lncRNAs, > 35% of them showed differential expression. These results further support the association of our prediction products with cancer, and also reveal that simple dependence on differential expression to identify cancer-related lncRNAs is far from sufficient.

# Case Study

Although functional identification of lncRNAs is very challenging, using bioinformatics analysis, database searches, and literature

features, and after feature selection, 51 remained in CRlncRC2. (B) Comparison of the top 10 features in CRlncRC2 and CRlncRC. (C) Comparison of the top 20 features in CRlncRC2 and CRlncRC. (D) Comparison of the top 50 features in CRlncRC2 and CRlncRC. (E) Bar plot of the top 10 features used in CRlncRC2.

FIGURE 6 | Comparison of SMOTE and non-SMOTE, non-SMOTE XGBoost, and others. (A) The ROC generated using SMOTE XGBoost has AUC value 0.01 higher than that achieved using non-SMOTE XGBoost. (B) The XGBoost-based ROC without SMOTE generated AUC values 0.11 and 0.25 higher than the SVMbased and Decision Tree-based ROC curves, respectively.

differentially expressed lncRNAs.

review, we can uncover evidence that our predictions represent lncRNAs with functions in cancer. For the top 10 candidate genes we used the TANRIC database to generate Kaplan-Meier survival curves for each cancer type. The results showed that there was a significant difference in the overall survival rate between low and high lncRNA expression groups for all genes in at least one tumor tissues (**Figure 8A**).

For example, survival analysis of the No. 1 lncRNA, AC074117.1, indicated significant differences in survival time between low and high expression groups in individuals with invasive breast carcinoma (BRCA) and kidney renal clear cell carcinoma (KIRC), with p-values of 1.5e-2 and 4.0e-5, respectively (**Figure 8B**, **C**). To study the regulatory function of AC074117.1, we downloaded data on cancer-related small

RNA molecules from The Human microRNA Disease Database (HMDD) (Huang et al., 2019), and the interaction network between lncRNAs and miRNAs from StarBase (Li et al., 2014). Subsequently, we constructed an interaction network between AC074117.1 and cancer-related microRNAs (**Figure 8D**). In addition, according to predictions using the LncRNA and Disease Database (version 2.0), AC074117.1 likely targets a gene cluster on chromosome 2, and is associated with a variety of cancers (Bao et al., 2019). Together, all these clues suggest that AC074117.1 may be involved in cancer and act as a ceRNA. As shown in **Figure 8E** (data from the UCSC genome browser), AC074117.1 is highly expressed in almost all tissues. Further, there is histone methylation signal in the AC074117.1 transcription start site. The H3K4Me3 and H3K27Ac signals in the first exon were high, while the H3K4Me1 signals were relatively weak. Moreover, high conservation signals (100 vertebrates basewise conservation scores generated using PlyloP) were found in its exon regions. Notably, there are a large number of repetitive elements in the whole body region of AC074117.1. These methylation signals and repeat elements may contribute to the mechanism by which this lncRNA is involved in cancer progression (Anwar et al., 2017; Di Ruocco et al., 2018; Solovyov et al., 2018). In conclusion, our predictions indicate that lncRNA AC074117.1 has a strong potential correlation with cancer.

In addition, recent literature reports support some of the predicted lncRNAs in the top 10 list; for example, the *TRAF3IP2-AS1* lncRNA ranked second (No. 2) among our predictions and is a hub gene in a lncRNA-mediated ceRNA network that competes with the onco-lncRNAs, PVT1 and XIST, and could be a clinically relevant biomarker in glioblastoma (Zan and Li, 2019). TTC28-AS1 (No. 4) is an antisense RNA of TTC28 which is associated with colorectal cancer (Pitkanen et al., 2014). Further, C1RL-AS1 (No. 10) has been linked to angiogenesis, as predicted in the ANGIOGENES database (Muller et al., 2016).

# DISCUSSION

Accumulating reports demonstrate that lncRNAs have significant roles in human cancers. Using experimental methods to study the relationships between lncRNA and cancer is time consuming and costly. In contrast, computational methods enable integration of multi-omics data and provide additional information for data mining. In this study, we developed a new method, CRlncRC2, based on a powerful machine learning algorithm — XGBoost, Laplacian score feature selection, and SMOTE over-sampling, to predict associations of lncRNAs with cancer. Compared with CRlncRC, CRlncRC2 improves the performance while requires fewer features (see **Table 3** for a detailed comparison). The results show that CRlncRC2 is much more sensitive and specific than the previous version (CRlncRC), primarily due to the selected algorithm model, as the results show huge differences between results generated using other methods and those from application of XGBoost. XGBoost has also been used in numerous other projects, achieving good results. For example, Zheng et al. developed a scalable, flexible approach, BiXGBoost, to reconstruct gene regulatory networks (GRNs), and tested it on DREAM4 and *Escherichia coli* datasets, demonstrating good performance of BiXGBoost in different scale networks (Zheng et al., 2018).

Machine learning algorithms have important roles in bioinformatics, where they facilitate the solution of problems, such as classification, clustering, regression, and prediction; however, the machine learning approach still faces a number of obstacles in predicting cancer-related lncRNAs. First, for biological data, frequently, only small positive sets are available, due to the difficulty of collecting information, such as patient data and experimental verification of functional genes, which greatly impedes the practical application of machine learning. Further, machine learning models require optimization for high performance, according to the specific data and situation. To address these problems, in this study, we selected the most stringent criteria to select the positive and negative sets, and used the latest histological information for feature extraction. We chose over-sampling in our new algorithm because it enables use of more information relative to undersampling, and the results confirmed that it can improve accuracy and specificity. Moreover, we merged features with high Laplacian scores and got eight synthesis features, which had a highest feature importance rank. Our findings suggest that high Laplacian score features still contain useful information and is not good practice to simply discard them.

LncRNAs have been applied in clinical practice as new biomarkers and prognostic indicators. Research on the relationships between lncRNAs and cancer is attractive and progressing very rapidly. Machine learning methods have the power to discover novel lncRNAs, including disease associated lncRNAs (Kang et al., 2017; Bao et al., 2019). Efforts should continue to improve the ability of machine learning algorithms to predict cancer associations. Moreover, with increasing research into lncRNAs, greater quantities of relevant high-throughput data are becoming easier to obtain. The development of functional research into lncRNAs has revealed additional functional elements and mechanisms (Zhang et al., 2014; Brockdorff, 2018). Further, numerous new tools for evaluating the similarity of non-linear sequences, using k-mer content (Kirk et al., 2018) and a new evolutionary classification perspective (Chen et al., 2016), have been developed, which can be used to extract new features, such as lncRNA conservation. These can facilitate better application of bioinformatics methods to predict cancer-related lncRNAs and assist in cancer diagnosis and treatment.



# CONCLUSIONS

In this study, we upgraded CRlncRC to CRlncRC2, using a powerful machine learning algorithm (XGBoost), Laplacian score feature selection, and an advanced over-sampling method (SMOTE). The results show that both XGBoost and SMOTE can help to improve model accuracy and specificity. After feature engineering, most of the expressed and methylated features are retained, indicating their importance for predicting lncRNAs with potential functions in cancer. Using much fewer features, CRlncRC2 has a mean AUC value 0.04 higher than that of CRlncRC. In addition, our predicted top-ranking cancer-related lncRNA candidates are supported by Inc2Cancer v2.0, literature reports, and statistical data. In summary, CRlncRC2 is an effective and useful method for lncRNA-cancer association identification.

# DATA AVAILABILITY

The datasets analyzed for this study can be found at https:// github.com/xuanblo/CRlncRC2.

# AUTHOR CONTRIBUTIONS

CL and LC conceived, designed, and supervised this study. JW and XZ collected and compiled data from the literature and public databases. XZ, TL, and JW designed and developed the data analysis. JL participated in discussion of the project. XZ, JW, TJ, and CL compiled the manuscript draft. CL, LC, and JL revised the manuscript. All authors reviewed, edited, and approved the manuscript.

# REFERENCES


# FUNDING

This work was supported by the National Natural Science Foundation of China (No. 31471220, 91440113), Start-up Fund from Xishuangbanna Tropical Botanical Garden, "Top Talents Program in Science and Technology" from Yunnan Province, Science and Technology Development Fund, Macau S.A.R. (097/2015/A3, 196/2017/A3).

# ACKNOWLEDGMENTS

Data analysis was supported by the High-performance computing (HPC) Platform, The Public Technology Service Center of Xishuangbanna Tropical Botanical Garden (XTBG), CAS, China.

# SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene.2019.00735/ full#supplementary-material.

DATA SHEET 1 | Positive dataset. XLSX 11KB

DATA SHEET 2 | Negative dataset. XLSX 66KB

DATA SHEET 3 | Feature categories. DOCX 16KB

DATA SHEET 4 | Bigtable. CSV 6.9M

DATA SHEET 5 | Cumulative curves of positive and negative lncRNAs for all features. PDF 538KB

DATA SHEET 6 | Feature importance. CSV 1KB

DATA SHEET 7 | Predicted positives. XLSX 27KB

DATA SHEET 8 | Statistics of the interactions with Lnc2Cancer v2.0. PDF 207KB


between immunotherapy responsive and T cell suppressive classes. *Cell Rep.* 23, 512–521. doi: 10.1016/j.celrep.2018.03.042


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Copyright © 2019 Zhang, Li, Wang, Li, Chen and Liu. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

# Predicting lncRNA-miRNA Interaction *via* Graph Convolution Auto-Encoder

*Yu-An Huang1†, Zhi-An Huang2†, Zhu-Hong You1\*, Zexuan Zhu3\*, Wen-Zhun Huang1, Jian-Xin Guo1 and Chang-Qing Yu1*

*1 College of Electronics and Information Engineering, Xijing University, Xi'an, China, 2 Department of Computer Science, City University of Hong Kong, Kowloon, Hong Kong, 3 College of Computer Science and Software Engineering, Shenzhen University, Shenzhen, China*

The interaction of miRNA and lncRNA is known to be important for gene regulations. However, the number of known lncRNA-miRNA interactions is still very limited and there are limited computational tools available for predicting new ones. Considering that lncRNAs and miRNAs share internal patterns in the partnership between each other, the underlying lncRNA-miRNA interactions could be predicted by utilizing the known ones, which could be considered as a semi-supervised learning problem. It is shown that the attributes of lncRNA and miRNA have a close relationship with the interaction between each other. Effective use of side information could be helpful for improving the performance especially when the training samples are limited. In view of this, we proposed an endto-end prediction model called GCLMI (Graph Convolution for novel lncRNA-miRNA Interactions) by combining the techniques of graph convolution and auto-encoder. Without any preprocessing process on the feature information, our method can incorporate raw data of node attributes with the topology of the interaction network. Based on a real dataset collected from a public database, the results of experiments conducted on k-fold cross validations illustrate the robustness and effectiveness of the prediction performance of the proposed prediction model. We prove the graph convolution layer as designed in the proposed model able to effectively integrate the input data by filtering the graph with node features. The proposed model is anticipated to yield highly potential lncRNA-miRNA interactions in the scenario that different types of numerical features describing lncRNA or miRNA are provided by users, serving as a useful computational tool.

Keywords: LncRNA–miRNA interactions, graph convolution network, computational prediction model, regulation network, system biology model

# INTRODUCTION

In recent years, the knowledge of the role of RNA in gene regulation has emerged from the advances in next-generation sequencing technologies, allowing a deeper and more comprehensive study on full transcriptomes of organisms. It is demonstrated by the ENCODE project that in mammals noncoding RNA could constitute a substantial majority of transcripts within the genome (Science, 2004). There is as much as 98% of the whole human genome encoding for noncoding transcripts, most of which are processed to generate small noncoding RNA such as miRNA, or long noncoding RNA (lncRNA).

#### *Edited by:*

*Philipp Kapranov, Huaqiao University, China*

#### *Reviewed by:*

*Yun Xiao, Harbin Medical University, China Xiufeng Zhang, University of California, United States*

#### *\*Correspondence:*

*Zhu-Hong You zhuhongyou@gmail.com Zexuan Zhu zhuzx@szu.edu.cn*

*†These authors have contributed equally to this work.*

#### *Specialty section:*

*This article was submitted to RNA, a section of the journal Frontiers in Genetics*

*Received: 28 January 2019 Accepted: 17 July 2019 Published: 29 August 2019*

#### *Citation:*

*Huang Y-A, Huang Z-A, You Z-H, Zhu Z, Huang W-Z, Guo J-X and Yu C-Q (2019) Predicting lncRNAmiRNA Interaction via Graph Convolution Auto-Encoder. Front. Genet. 10:758. doi: 10.3389/fgene.2019.00758*

**132**

Even though the current understanding of lncRNA functions is still limited, it is revealed that they are key regulators of multiple biological processes through a complex mechanism in which their modular structure permits them to interact with specific proteins, RNA, and DNA (Wapinski and Chang, 2011). On the other hand, miRNAs post-transcriptionally regulate the expression of their target genes. Accumulating studies are showing that, similar to the protein-coding genes, both of these two types of noncoding RNA influence almost all aspects of biology (Lu et al., 2005). The aberrant expression level of noncoding RNAs appears to be one of the initiating factors of different types of disease including cancers (Lewis et al., 2003; Calin and Croce, 2006).

A number of studies have begun to uncover the interactions between miRNA and lncRNA and more and more details about the influence of miRNA on lncRNA function is now coming into view (Tay et al., 2014). In some cases, miRNA triggers lncRNA decay. In other cases, lncRNA acts as miRNA sponges/decoys, or competes with miRNA for binding mRNAs or generate miRNAs. Recently, the hypothesis of competing endogenous RNA (ceRNA) has been proposed and become a mainstream view for explaining the interaction between lncRNA and miRNA (Salmena et al., 2011). Specifically, lncRNA competes with pseudogenes, circular RNAs and messenger RNAs for binding or sequestering microRNAs from the same pool through matching the miRNA response elements (MREs). Considering that both lncRNA and miRNA are keys to regulate gene expression and they interact with each other, it is not unexpected that their relationship in interaction network is firmly regulated. Understanding the lncRNA-miRNA interactions networks governing the initiation and development of diverse diseases is essential but remains largely uncompleted (Karreth and Pandolfi, 2013).

LncRNAs and miRNAs interact with each other forming a huge and complex regulation network for controlling gene expression on transcriptional, post-transcriptional, and post-translational levels. Through this multi-level regulation, these two vast families of noncoding RNAs are involved in almost all aspects of cell cycles including cell division, senescence, differentiation, stress response, immune activation, and apoptosis (Shi et al., 2013). In view of this, interactions of noncoding RNAs on the regulation network have attracted widespread attention in medical research (Huang et al., 2016a). A comprehensive understanding of the molecular and cellular effects of such noncoding interaction can offer great insight into the disease mechanism at a molecular level. Noncoding RNAs in those interactions newly discovered to be associated with a specific disease can be regarded as potential diagnostic markers and therefore is of high value in therapeutic approaches.

Some efforts have been made to design a computational method to meet the emerging need for an accurate prediction of lncRNA-miRNA interactions on a large scale. One popular direction is to do statistical analysis on the data collected from biological experiments. For example, Sumazin et al. attempted to construct a miRNA-mediated network of coding and noncoding RNA interactions for inferring the key dysregulation of ncRNA expression in pathogenesis (Sumazin et al., 2011). The algorithm of Hermes they proposed for such network calculates the statistical significance of each RNA-miRNA-RNA triplet by matching the expression profiles of gene and miRNAs in glioblastoma. Similarly, Paci et al. and Conte et al. construct lncRNA-miRNA-RNA interaction network by calculating so-called sensitivity correlation which denotes the difference between Pearson correlation coefficient and partial correlation coefficient for each triplet obtained from the breast cancer data (Paci et al., 2014; Conte et al., 2017). To investigate the underlying roles of lncRNA in the diseases of prostate cancer and lung adenocarcinoma, Du et al. and Sui et al. integrate different types of attribute data of RNA to construct a regulatory network in which lncRNAs centrally mediate miRNAs (Du et al., 2016; Sui et al., 2016). All of these methods are designed based on statistics measure and their statistics analysis is for a specific type of disease. To identify the noncoding RNA-mediated sponge regulatory network in various diseases recorded in TCGA and UCEC, Wang et al. construct lncRNA-miRNA-gene triplet networks yielded by prediction algorithms. Based on such constructed networks, hypothesis testing approach is implemented for predicting those triplets associated with diseases (Wang et al., 2015).

Another direction for predicting lncRNA-miRNA is based on matching seed sequences. Most computational tools of such type, such like TargetScan, miRanda and RNAhybrid, aim at predicting miRNA targets selecting evolutionarily conserved microRNA binding sites (Zheng et al., 2017). However, it is pointed out by Natalia et al. that prediction using these methods could be of high false positives and often biologically irrelevant (Pinzón et al., 2017). They show that the interaction between lncRNAs and miRNAs is dose-sensitivity. In view of this, it is hardly to predict miRNA target only using the sequence information as they are not always dose-sensitive enough to be functionally regulated by miRNAs.

The past decade has witnessed the exponential growth of noncoding RNA expression profiling data in cancers but the number lncRNA-miRNA interactions underlying such big data is still limited (Zheng et al., 2016b). Considering that different attributes of noncoding RNAs are being continuously updated, the big data about noncoding RNAs poses significant challenge for data analysis and integration, which is important for predicting new links on the current sparse lncRNA-miRNA interaction network. In taking forward this area of work, some methods based on machine learning have been proposed. Huang et al. propose the first prediction model for inferring lncRNAmiRNA on a large scale. Specifically, the EPLMI model uses a network diffusion method on weighted networks associated with expression profiles, sequence information and biological function (Huang et al., 2018). The basic assumption of this method is based on the finding that miRNAs of similar patterns tend to interact with similar lncRNA and vice versa. However, how to define the similarity among noncoding RNAs based on their expression profile is still an open problem. EPLMI model use Person correlation coefficients to compute such similarity, which means it assumes each element in the noncoding RNA features equally contributes to the similarity score. However, it would be inappropriate for the nature of its mechanism.

In recent years, the advance of deep learning fuels the widespread use of data mining in many different science areas including bioinformatics (Li et al., 2016). Specially, graph convolution comes to be a powerful and popular technique in data mining for graph-based data. It proves to be powerful for its ability to automatically learn latent features from an end-toend model structure. The hidden layers within the model thus are able to extract meaningful information from the raw input data. In this work, we introduce the technique of graph convolution into the model of autoencoder for building an end-to-end deep learning prediction model called GCLMI for inferring new lncRNA-miRNA interaction on a large scale. Specifically, two different layers are respectively designed to encode and decode the raw feature of each nodes on the input graph. As a result, the decoder can yield a fully-connected network in which the predicted score of each link represent the confidence coefficient of it to be true. Different from the sequence-based algorithms which only consider the sequence information, GCLMI is a network-based algorithm which considers the known lncRNAmiRNA interactions along with the expression levels of lncRNA and miRNA. In addition, GCLMI aims to compute the possibility of a lncRNA-miRNA pair to be interactive in biological processes while sequence-based tools aim to predict the binding sites of miRNA in transcripts.

To evaluate the prediction performance of the proposed model, we implement it in a real dataset of lncRNA-miRNA interactions. By using the frameworks of 2-fold, 5-fold and 10-fold cross validation, the prediction model yielded average AUCs of 0.8492+/−0.0013, 0.8567+/−0.0009 and 0.8590+/−0.0005, respectively. The results of a series of comparison experiments show that the model we present is superior to some methods previously proposed. In addition, the results also illustrate the ability of graph convolution to integrate the raw features of nodes and the topology of graph. The experimental results overall prove that the deep learning-based model we proposed is reliable to yield accurate results and robust to parameter settings. It is anticipated that the proposed model could be served as a useful computational tool for predicting large-scale lncRNAmiRNA interactions in the scenario that know lncRNA-miRNA interactions along with their expression profile are given by users.

# METHOD

### Materials

The number of known lncRNA-miRNA interactions is still limited and expression profiles of lncRNA and miRNA are often be used for inferring those lncRNA-miRNA pairs of high correlations. Although the number such results is huge, but they are not truly confirmed by the experiments based on CLIP-Seq techniques and therefore would negatively affect the prediction results (Zheng et al., 2016a). To obtain the ground true data resource for our prediction, we collected a dataset of lncRNA-miRNA interactions that are experimentally confirmed from the lncRNASNP database (version v1.0). lncRNASNP is a comprehensive database for lncRNA and provides different kinds of relevant data resource including lncRNA expression profiling, expanded lncRNA-associated diseases, and noncoding variants in lncRNAs (available at http://bioinfo.life.hust.edu.cn/ lncRNASNP). The database matches the IDs of lncRNAs and integrates data from different public databases including that of lncRNA-miRNA interactions from starBase. Eight thousand ninety-one pairwise interactions including 780 types of lncRNA and 275 types of miRNA are totally recorded (Gong et al., 2014). Such interactions have already been verified *via* laboratory examination and therefore are of high confidence.

There are different types of data able to be used as the features of lncRNA and miRNA, such as sequence information of nucleotides, expression profiles, target genes and predicted functional annotations (Zheng et al., 2018). Sequence information is complete for both of lncRNA and miRNA but is too complicated for model to learn as it is nominal and of different length for different types. The links between noncoding RNA and target genes would be meaningful for inferring their interactions but such information is scarce and incomplete for many types of them. The functional annotation of noncoding RNA is important for understanding the characters of one noncoding RNA and many works have been made to inferring them by considering different types of complementary data. However, such information is yielded by prediction algorithms with additional assumptions and therefore is possible to cause computation bias on further prediction models. The superiority of expression profile data to others has been illustrated in our previous work by the experimental results (Huang et al., 2018). For such reason, we only focus on the expression profiles of noncoding RNAs in this work.

To collect the expression profile data of lncRNAs, we match the ids of lncRNAs from two different databases of lncRNASNP and NONCODE (http://www.noncode.org/) (Bu et al., 2011). For 780 types of lncRNA recorded in the lncRNASNP database, 450 of them are successfully matched with their expression profiles. The data of expression profile for each type of lncRNA present the expression level of it in 22 different human tissues or cell lines. For the features of miRNAs, we collect them from the microRNA.org database (http://www.microrna.org/) (Betel et al., 2008). As a result, the ids of 230 out of 275 types of miRNA are successfully converted from lncRNASNP into microRNA.org database. Each entry of miRNA expression profiles consists of 172 values describing the expression level of such miRNA in 172 various tissues and cell lines in human body.

# Graph Convolution

It is still an open problem to define the convolution operator on a graph and generalizing convolutional neural networks (CNNs) to arbitrary graphs comes to be a recent area of interest (Kearnes et al., 2016). So far, the approaches with graph convolution could be categorized into two types: i) one is based on definitions of spatial convolution and ii) the other is based on the graph spectral theory. The latter is more popular and is elegantly defined as a multiplication in the graph Fourier domain. The spectral framework was first to be introduce in the context of graph CNNs by Bruna et al. (Bruna et al., 2013). Along this direction, Kipf et al. propose an optimization strategy based on approximal first-order on the spectral filters, reducing its complexity from O(n2 ) to O(|ε|) (see **Figure 1**) (Kipf and Welling, 2016).

To formulate the operator of spectral convolution on graph, given an adjacent matrix A of graph *G* with its Laplacian *L* :=

*D* – *A* and attributes of each node on graph (say s), Defferrard et al. (2016) propose spectral graph convolution to filter s by a nonparametric kernel *g*θ(Λ) = *diag*(θ), where θ is a vector of Fourier coefficients. Given *L* can be decomposed by L = UΛUT, where Λ is the diagonal matrix of eigenvalues and U is eigenvector matrix, such operator could be defined as

$$\text{g}\_{\theta} \, \prescript{\star}{}{\text{s}} = \text{U} \text{g}\_{\theta} \, \text{U}^{\text{T}} \, \text{s} \tag{1}$$

Approximating the spectral filter by using a truncated expansion in terms of Chebyshev polynomials Tk(s) up to Kth order, the definition is as follows:

$$\mathcal{g}\_{\boldsymbol{\theta}}\,^{\star}\mathbf{s} = \sum\_{k=0}^{K} \,^{K}\_{k} \,^{T}T\_{k} \left(L\_{N}\right)\mathbf{s} \tag{2}$$

where *Tk* denotes Chebyshev polynomials and *θ*ʹ is a vector of Chebyshev coefficients. Considering the complexity of computing L is as large as O(n2 ), Kipf and Welling. (2016) further simplified this definition by limiting K = 1 and approximating the largest eigenvalue of L by 2. The convolution operator comes to be:

$$\log\_{\theta} \, \mathsf{s} = \theta(I + D^{-\frac{1}{2}} A D^{-\frac{1}{2}}) \mathsf{s} \tag{3}$$

By introducing the renormalization tricks: *I D <sup>n</sup>* + → *AD D AD* − − − − <sup>1</sup> 2 1 2 1 2 1 2 with à = *A + IN* and *D A ii ij <sup>j</sup>* =∑ , formula (3) can be simplified as: 1

$$\log\_{\theta} \, ^\star s = \theta \widetilde{D}^{-\frac{1}{2}} \tilde{A} \widetilde{D}^{-\frac{1}{2}} s \tag{4}$$

In this work, we follow this definition as formula 4 for design our deep learning model based on the graph convolution.

# GCLMI: An Auto-Encoder Prediction Model for lncRNA-miRNA Interactions

In this work, we cast the prediction task for lncRNA-miRNA interactions as a link prediction problem on a heterogeneous bipartite graph. Consider an adjacent matrix of such graph *M* of shape *Nl* × *Nm*, where *Nl* is the number of lncRNA nodes and *Nm* is the number of miRNA nodes. Entry *Mij* in this matrix encode either the interaction between *i*-th type of lncRNA and *j*-th type of miRNA is identified by biological experiments or not. The task of prediction can be considered as referring the value of unobserved entries in M using semi-supervised learning on the observed ones.

In an equivalent picture, we can also represent the interaction data by an undirected graph *G* = (ν, ε, *Xl* , *Xm*), where *Xl* and *Xm* are the feature matrices for the lncRNA nodes and miRNA nodes, respectively. The goal is to learn embedding features for lncRNAs and miRNAs *E* by building a graph-based encoder [*El* , *Em*] = *fen* (ν, ε, *Xl* , *Xm*) and predicting new links by building a decoder *M*' = *fde* (*El* , *Em*). *El* and *Em* are the feature matrices for lncRNAs and miRNA with shapes of *Nl* × *L* and *Nm* × *L*, respectively (see **Figure 2**).

To this aim, our proposed model is composed by two layers of different types: i) an encoder layer for filtering node features of lncRNA and miRNA on the graph of their interaction network and ii) a decoder layer for predicting fully-collected interaction network using the embedding features learned from the former layers.

The inputs of encoder layer include the feature matrixes of lncRNA and miRNA (i.e. *Fl* and *Fm*) and the adjacent matrix of known lncRNA-miRNA interaction network (i.e. M). In order to integrate the features of lncRNA and miRNA into one input matrix, an expanded matrix *X* is constructed based on *Fl* and *Fm* as follows:

$$\mathbf{X} = \begin{bmatrix} F\_l & \mathbf{0} \\ \mathbf{0} & F\_m \end{bmatrix} \tag{5}$$

Accordingly, the adjacent matrix of known lncRNA-miRNA interaction network is expanded as:

$$A = \begin{bmatrix} 0 & \cdots & \boldsymbol{M} \\ \boldsymbol{M}^T & \mathbf{0} \end{bmatrix} \tag{6}$$

Based on the above two input matrixes, we compute a graph convolution matrix G according to formula 4:

$$\mathbf{G} = X\_{r\nu} \begin{pmatrix} I + D^{-\frac{1}{2}} A D^{-\frac{1}{2}} \end{pmatrix} \tag{7}$$

The hidden layer is then built based on G by introducing its weight matrix We and bias matrix Be. With ReLU as the activation function, the output E of the encoder layer would be as follows:

$$E = \text{ReLU}\left(G \cdot \mathcal{W}\_e + B\_e\right) = \begin{bmatrix} E\_l \\ & E\_{su} \\ & E\_{su} \end{bmatrix} \tag{8}$$

where the trainable weight matrix *We*∈(*Dl* + *Dm*) × *Ne* transforms the convolution matrix G into a hidden matrix *E*. *Ne* denotes the number of latent factors and is set manually. The output layer learned from the encoder layer is a projection from the space of raw features into a hidden space with lower rank. As lncRNA and miRNA are known to interact with each through the MRE on transcripts, the design of hidden accords with the nature that lncRNA, MRE, miRNA is associated in a three-layer relation network.

The output of the encoder layer has two components, which are the matrix of the embedding feature matrix of lncRNA *El* and that of miRNA *Em*. Introducing a trainable weight matrix Wd, the decoder layer is then built based on these separate matrixes of the same raw dimension as follows:

$$M' = E\_l W\_d E\_m^T \tag{9}$$

The output matrix *M*' clearly has the same shape out the input matrix *M*. As matrix *M*' is numerical, it describes the weight of links in a fully-connected network. All lncRNA-miRNA pairs with value of 0 in matrix M would be assigned a predicted value by the decoder. Those pairs with high predicted scores are anticipated to more possibly be connected.

To train the model of GCLMI in a semi-supervised learning manner, we use the strategy of negative sampling. Specifically, in each epoch of training process, we randomly select a fixed number of negative samples from the unlabeled lncRNA-miRNA pairs. The loss function of our training is defined as follows:

$$\begin{split} \mathcal{L} = \sqrt{\frac{\sum ij; \Omega\_{\boldsymbol{p}, \boldsymbol{\bar{y}}} = 1or \, \Omega\_{\boldsymbol{n}, \boldsymbol{\bar{y}}} = 1 \Big(M\_{\boldsymbol{\bar{y}}}' - M\_{\boldsymbol{\bar{y}}}\Big)^2}{\sum \boldsymbol{i} j \Big(\Omega\_{\boldsymbol{p}, \boldsymbol{\bar{y}}} + \Omega\_{\boldsymbol{n}, \boldsymbol{\bar{y}}}\Big)} + \frac{1}{2} ||\boldsymbol{W\_{e}}||^2 \\ + \frac{1}{2} ||\boldsymbol{W\_{\boldsymbol{d}}}||^2 + \frac{1}{2} ||\boldsymbol{B\_{e}}||^2 \end{split} \tag{10}$$

where the matrices Ω*<sup>p</sup> N N l m* ∈{ } <sup>×</sup> 0 1, and Ω*<sup>n</sup> N N l m* ∈{ } <sup>×</sup> 0 1, denote the masks for positive samples and the negative samples from random sampling, respectively. The first term in equation (10) aims to minimize the prediction error and the second and the third term define the constraint on the weight matrix in encoder and decoder, respectively. As negative sampling is implemented for training, in each epoch the Ωn would be randomly generated

in which the number of "1" would be fixed as a specific percentage of the number of positive samples. Hence, we would only optimize over the positive samples if we set this percentage as 0 or optimize over the positive samples and partial negative samples otherwise.

# RESULTS AND DISCUSSION

# Evaluation of Graph Convolution's Effectiveness

Using techniques of graph convolution, spectral filter function integrates the information of attribute feature of input node with that of its neighbor nodes on the graph. GCLMI model uses graph convolution to build a data pre-processing module so that it can train the embedding features of nodes in an end-to-end learning manner. In this section, we evaluate the effectiveness of graph convolution with regard to its ability to integrate the raw data of input feature. Specifically, we compare the standard pipeline of GCLMI with the case that the input features are removed. To this aim, each entry of the input feature matrix A in formula 7 is replaced with the value of 1. In this case, the operator of graph convolution would be meaningless as all node features are the same. We implemented such modified computation process in 5-fold cross validation. As a result, without any input of node feature, the GCLMI model yield an AUC of 0.8483 on the 5-fold cross volition experiment, significantly lower than AUC of 0.8567 yield by the standard computational pipeline (see **Figure 3**). The result shows that the graph convolution designed in the model of GCLMI is feasible and effectively to integrate the raw data of feature inputs (see **Figure 4**).

# Evaluation of the Impact of Negative Sampling

There is still no biological experiment confirming any lncRNAmiRNA pair that are definitely not interactive so that no database can provide the data of negative samples for our training. For this reason, the prediction task in this work can be considered as a semi-supervised learning problem. Considering the known lncRNA-miRNA network is sparse, sampling on the unlabeled samples could generate a data source in which underlying negative samples are involved. Information of unlabeled data can be properly leveraged to push the limits of poor data resource for training (Zheng et al., 2014). To do so, we implement negative sampling on the unlabeled samples in each training epoch to construct negative sample set for training. However, the number of samples from negative sampling can have an effect on the prediction performance of the proposed model. A larger amount of negative sample can provide data resource for training and good performance could be achieved with more information for model to learn. However, it can also cause the problem of unbalanced training data. In this view, the choice of the size of negative sample set is important for an accurate prediction of GCLMI model. In each training epoch, the size of negative sample set is fixed as a ratio p of that of positive samples. In this section, we explore the prediction performance of GCLMI with different values of p (i.e. 0, 0.5, 1.0, 3.0, 5.0, 10.0).

**Figure 5** shows the training loss and training error along with increased training epoch in this series of experiments. We calculated the training loss and training error whose definitions are as the Equation 10 and the first term of Equation 10, respectively. The curves of **Figure 5**(**A**) and **Figure 5**(**B**) show that the training processes of GCLMI with different sizes of negative sample set are similar. For most of experiments, the corresponding training loss and training error could be convergent to their lower bounds before the 250th epoch and 150th epoch, respectively, illustrating the computational process is robust to different negative sampling. The prediction performance of GCLMI with different negative sampling is also evaluated. As shown in the **Figure 5**, the prediction performance varies with different sizes of negative sample set in term of the AUC value. Specifically, when the number of negative samples is set as 3 times of positive samples,

training process, respectively.

the model achieves its highest prediction performance with AUC of 0.8567. It also should be noted that the prediction performance of GCLMI declines greatly with p set as 0. As setting p = 0 means that no negative sample is used for training, this result illustrates that negative sampling is effective and necessary for an accurate prediction of large-scale lncRNA-miRNA interactions.

# Prediction Performance of GCLMI on k-Fold Cross Validation

For the evaluation of the performance of our proposed model with regard to the prediction accuracy on lncRNA-miRNA interactions, we adopt the evaluation frameworks of 2-fold, 5-fold and 10-fold cross validation. All experiments in this work are conducted on a real dataset involving experimentally-confirmed lncRNA-miRNA interactions. Specifically, in the k-fold cross validation, all known lncRNA-miRNA interactions are roughly divided into k parts, each of which is used as testing sample set in turn and the rest is used as training sample set. After implement the prediction process of GCLMI with training set as input, each testing sample obtain its prediction score presenting the confidence coefficient about the link existence. We consider all the 209,152 unlabeled lncRNA-miRNA pairs as candidate samples and compute the ranks of prediction scores among the candidates.

We consider those testing samples with a higher rank than a given threshold as positive. By setting different threshold in the experiments, we compute the corresponding true positive rates (TPRs, sensitivity) and FPRs (1-specificity) for each threshold. Specifically, given a threshold, sensitivity denotes the percentage of testing samples with higher ranks and specificity is the percentage of testing sample with lower ranks. Based on TPRs and FPRs, the corresponding ROC curve (receiver operating characteristic curve) is plotted and the area under the curve (AUC) is computed as a main evaluation criterion for the performance. The value of AUC lies between 0.5 and 1, where 0.5 means a purely random guess and 1 denotes a perfect prediction. As some of known lncRNA-miRNA interactions take turns to be used as testing samples and assumed to be unknown in the prediction process, if they obtained a high rank among those unlabeled samples in general, it means the prediction performance is good and prediction model is feasible. In addition, as the division of sample sets is random, we repeat the sampling implement GCLMI model with different sample division 20 times to avoid the bias caused by such partition. The standard deviation is also calculated for each cross validation. As a result, conducting GCLMI on the collected dataset, we obtain good prediction performance with average AUCs of 0.8492+/−0.0013, 0.8567+/−0.0009 and 0.8590+/−0.0005 in 2-fold, 5-fold and 10-fold cross validation, respectively (see **Figure 6**). As shown in **Table 1**, the increase of fold number in cross validation boosts the performance of GCLMI because more data resource in training set would benefit the prediction performance. In this view, we

negative sample sets.

TABLE 1 | Prediction performance w.r.t. AUC in 2-fold, 5-fold and 10-fold cross validation.


TABLE 2 | Performance comparison among different methods by using RNA expression profile-based similarity in the framework of 5-fold cross validation.


anticipate that GCLMI model is able to yield more reliable results with more ground true input data in the future. The results of high AUCs illustrate the reliable performance for predicting lncRNAmiRNA interaction on a large scale.

# Performance Comparison With Other Similarity-Based Methods

Current approaches to predict new links on biological bipartite networks are mainly based on similarity-based assumption (Sun et al., 2018). Given a network in which two types of nodes representing two kinds of research objects are involved, most of previous prediction model assumes that similar objects of one type tend to be associated with those of another type (Huang et al., 2016b). Therefore, their prediction performance could be greatly influenced by the measurement they adopt to calculate the similarity scores among object of the same types (Huang et al., 2017b). For example, KATZHMDA model calculates the similarity of microbes using the Gaussian kernel and EPLMI model uses Person correlation coefficient for the similarity of lncRNA and miRNA based on their expression profiles (Huang et al., 2016a; Huang et al., 2018). However, such linear computation method may so simple that it fails to describe the general similarity of lncRNA or miRNA with regards to their roles in regulation network based on their expression profile. To bypass such barrier, we propose an end-to-end prediction model using graph convolution technique and therefore the prediction is free of any calculation for similarity.

To further evaluate the prediction performance, several similarity-based methods are implemented on the same dataset for performance evaluation, using the same similarity matrices of lncRNA and miRNA based on Person correlation coefficients of expression profiles. The comparison methods include two types of neighbor-based collaborative filtering (i.e. lncRNA-based CF and miRNA-based CF), matrix factorization-based method (i.e. SVDbased CF and basic latent factor model) and EPLMI. Using 5-fold cross validation on the same dataset, the comparison result shows that the proposed model has the best prediction ability among five comparison methods with highest AUC values of 0.8567+/−0.0009 (see **Table 2**). We consider such superior in performance on link prediction is benefited from the end-to-end learning approach as GCLMI model was designed. It is anticipated that such end-toend prediction model as we proposed would yield more accurate prediction results with a larger amount of high-dimension data as inputs in the future.

# CONCLUSION

Increasing evidence show that lncRNA and miRNA collaborate to form a regulation network for gene regulation. Interactions between lncRNA and miRNA thus provide great insights into understanding the molecular mechanism of the initiation and development of various types of complex diseases. However, little effort has been made to develop computational approach to predict lncRNA-miRNA interaction on a large scale. The main challenge comes from the small number of known interactions between lncRNA and miRNA (i.e. the sparsity of lncRNAmiRNA interaction network) and the limited understanding on the underlying pattern on lncRNA-miRNA interaction.

To address this issue, we proposed a deep learning-based prediction model named GCLMI which can effectively predict largescale lncRNA-miRNA interactions. Given raw data as RNA attribute features, the GCLMI model is able to extract meaningful embedding features for both miRNA and lncRNA in an end-to-end training manner. The results of a series of experiments show that the lowdimension embedding learned from the proposed model is of good representation ability with regards to their relation on the interaction network. Benefited from the deep learning structure as GCLMI is designed, we anticipate that the proposed model could be used as a useful tool for an accurate prediction of large-scale lncRNA-miRNA interactions in the scenario that additional information describing features of lncRNA and miRNA is offered by the users. In the current version of GCLMI, other types of data relevant to intrinsic features of lncRNA and miRNA, such like ncRNA sequence information and structural data are still inapplicable for GCLMI to handle with, as the graph convolution operator needs numerical data as inputs. In the future, we will investigate solutions about this limitation.

# AUTHOR CONTRIBUTIONS

Y-AH and Z-AH conceived the algorithm, developed the program, and wrote the manuscript. Z-HY and ZZ helped with manuscript editing, designed and performed experiments. Y-AH and Z-AH prepared the data sets, carried out analyses and helped with program design. All authors read and approved the final manuscript.

# FUNDING

Y-AH was supported by the National Natural Science Foundation of China under Grant No. 61702424 and the Natural Science Basic Research Plan in Shaanxi Province under Grant No. 2018JQ6015. Z-HY was supported by the National Natural Science Foundation of China under Grant No. 61572506.

# REFERENCES


and messenger RNAs in human breast cancer. *BMC Syst. Biol.* 8, 83. doi: 10.1186/1752-05090-8-83


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Copyright © 2019 Huang, Huang, You, Zhu, Huang, Guo and Yu. This is an openaccess article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

# Purification and Identification of miRNA Target Sites in Genome Using DNA Affinity Precipitation

*Yu Xun1, Yingxin Tang1, Linmin Hu1, Hui Xiao1, Shengwen Long1, Mengting Gong1, Chenxi Wei1, Ke Wei1,2\* and Shuanglin Xiang1\**

*1 Key Laboratory of Protein Chemistry and Developmental Biology of Education Ministry of China, College of Life Science, Hunan Normal University, Changsha, China, 2 Medical School, Hunan University of Chinese Medicine, Changsha, China*

Combination with genomic DNA is one of the important ways for microRNAs (miRNAs) to perform biological processes. However, because of lack of an experimental method, the identified genomic sites targeted by microRNA were only located in the promoter and enhancer regions. In this study, based on affinity purification of labeled biotin at the 3′-end of miRNAs, we established an efficiently experimental method to screen miRNA binding sequences in the whole genomic regions *in vivo*. Biotinylated miR-373 was used to test our approach in MCF-7 cells, and then Sanger and next-generation sequencing were used to screen miR-373 binding sequences. Our results demonstrated that the genomic fragments precipitated by miR-373 were located not only in promoter but also in intron, exon, and intergenic. Eleven potentially miR-373 targeting genes were selected for further study, and all of these genes were significantly regulated by miR-373. Furthermore, the targeting sequences located in E-cadherin, cold-shock domain-containing protein C2 (CSDC2), and PDE4D genes could interact with miR-373 in MCF-7 cells rather than HeLa cells, which is consistent with our data that these three genes can be regulated by miR-373 in MCF-7 cells while not in HeLa cells. On the whole, this is an efficient method to identify miRNA targeting sequences in the whole genome.

Keywords: miRNA, target sites, genome, DNA, affinity precipitation

# INTRODUCTION

MicroRNAs (miRNAs) are a class of endogenous small non-coding RNAs that are processed from pre-miRNAs by Dicer into 21- to 25-nt double-stranded sequences (Bartel, 2004; He and Hannon, 2004). Through regulating gene expression at the post-transcriptional level, miRNAs can take part in many biological processes including development, cell proliferation, apoptosis, organogenesis, and tumorigenesis (Carrington and Ambros, 2003; Bartel, 2004; Filipowicz et al., 2008). It has been shown clearly that miRNAs regulate gene expression on the post-transcriptional level *via* RNA-induced silencing complex (RISC) pathway in the cytoplasm (Liu et al., 2005a; Dong et al., 2013). However, with the development of new techniques, numerous miRNAs were found enriched in nucleus,

# *Edited by:*

*Yun Zheng, Kunming University of Science and Technology, China*

#### *Reviewed by:*

*Xiufeng Zhang, University of California, Riverside, United States Y-h. Taguchi, Chuo University, Japan*

#### *\*Correspondence:*

*Ke Wei 004343@hnucm.edu.cn Shuanglin Xiang xshlin@hunnu.edu.cn*

#### *Specialty section:*

*This article was submitted to RNA, a section of the journal Frontiers in Genetics*

*Received: 24 February 2019 Accepted: 23 July 2019 Published: 12 September 2019*

#### *Citation:*

*Xun Y, Tang Y, Hu L, Xiao H, Long S, Gong M, Wei C, Wei K and Xiang S (2019) Purification and Identification of miRNA Target Sites in Genome Using DNA Affinity Precipitation. Front. Genet. 10:778. doi: 10.3389/fgene.2019.00778*

**141**

**Abbreviations:** CSDC2, cold-shock domain-containing C2; ITSN2,intersectin 2; Hsp60, heat shock protein family D (Hsp60) member 1; ALOX5, arachidonate 5-lipoxygenase; ARID2, AT-rich interaction domain 2; PDE4D, phosphodiesterase 4D; SUN1, Sad1 and UNC84 domain-containing 1; ZNF76, zinc finger protein 76; ZNF385B, ZNF385B; TTC34, tetratricopeptide repeat domain 34; TPM1, tropomyosin 1; EVI5, ecotropic viral integration site 5; RPL37, ribosomal protein L37; FANCC, FA complementation group C.

which suggests that microRNAs play important roles in nucleus (Liu et al., 2018). Several studies have proved that miRNA can regulate gene expression *via* interacting with genomic sequences. In 2008, Place et al. reported that miR-373 can up-regulate coldshock domain-containing protein C2 (CSDC2) and E-cadherin *via* sequence complementarity with the promoter of these genes. MiR-223 can combine with the promoter of NF1A and downregulates the expression of NF1A (Place et al., 2008).

Based on the existing rule of interaction between miRNA and mRNA, some software tools for predicting miRNA binding sites in genome are developed. However, it is hard to accurately predict microRNA target sites in genome, for the mechanism of gene regulation by miRNAs *via* combination with genomic DNA remains to be elucidated. First, the location of miRNA binding sites in genome should be further studied. Janowski et al. found that small dsRNAs, which are completely complimentary with the sequence in the region −56 to +17 of the promoter, can up-regulate the expression of genes (Janowski et al., 2007). Then Meng et al. reported that the siRNA binding position can be located around −1611 from the transcription start site (Meng et al., 2016). Moreover, it was also reported that miRNA can bind in the enhancer region and increase the transcription activity of neighboring genes (Xiao et al., 2017). Second, the mechanism of interaction between genome and miRNA has not been fully illustrated. Some papers suggested that the 2–8 nt from the 5′-end of the antisense is the key to transcription activation (Xiao et al., 2017). However, it is also reported that let-7i can interact with promoter TATA-box motifs of interleukin (IL)-2 because of low minimal free energy (MFE) value (−27.6 kcal/mol), while the "seed region" of let-7i is not completely complementary with IL-2 promoter, which suggests that the complement of 5′-end of miRNA with target sequence is not the only principle for microRNA target prediction (Zhang et al., 2014). Finally, the prediction based on bioinformatics is insufficient to reflect the real condition *in vivo*, for the epigenetic modification of genome may affect the interaction of miRNA with targeting site (Liu et al., 2018).

Our recent study reported a convenient experimental approach for the isolation and identification of binding miRNAs for messenger RNA by applying short biotinylated DNA antisense oligonucleotides mix to enhanced green fluorescent protein (EGFP) mRNA, which was fused to target gene mRNA (Wei et al., 2014). We wonder whether this affinity assay could be used to screen miRNA binding sequence in genomic regions *via* biotinylated miRNA of interest. In the present study, based on biotinylated miRNA capture affinity technique, we have developed an experimental procedure for searching miRNA targeting sequences in the promoter and even in whole genomic regions (**Figure 1**). MiR-373 has been used to test our method in MCF-7. First, we proved that biotinylated miR-373, with the same function as miR-373, can up-regulate the expression of E-cadherin, which have been reported to be up-regulated by miR-373 *via* targeting its promoter. Then using the method as described in this paper, we have collected DNA fragments precipitated by biotinylated miR-373 or negative control RNA. Semi-quantitative polymerase chain reaction (PCR) and real-time PCR showed the E-cadherin promoter and CSDC2 promoter, in the previously reported miR-373 binding site, can be pulled down by biotinylated miR-373 rather than negative control RNA, which suggests that our approach is feasible. Then to find the unknown miR-373 binding sequence, the DNA fragments were inserted into pGEM-T vectors (Promega) and sequenced. Ten unreported miR-373 binding sequences were identified. Interestingly, six identified sequences were located in intron of genes and two sequences in intergenic. Only two of the rest sequences were in the promoter of gene. Western blot and real-time PCR demonstrated that six of seven identified genes can be up-regulated by miR-373 in MCF-7. Interestingly, our results shown that miR-373 cannot improve the expression of E-cadherin, CSDC2, and PDE4D in HeLa cells, which is consistent with our data that miR-373 targeting sequences of these genes cannot be precipitated by miR-373 in HeLa cells. Finally, to efficiently screen miRNA targeting genomic sequence, next-generation sequencing experiment was used to detect the samples precipitated by miR-373 and numerous miR-373 targeting sites were sequenced. On the whole, we developed an efficient approach to screen miRNA targeting genomic sequence and provided a new perspective for studying the interaction of miRNA and genome.

# MATERIALS AND METHODS

# Materials

The biotinylated miR-373 and biotinylated negative control miRNA (biotinylated NC miRNA) were synthesized from TaKaRa Biotechnology (Dalian, China) *via* labeling with biotin at the 3′-end of the miRNA (**Table 1**). pGEM-T vectors were purchased from Promega (Madison, USA). ARID2, SUN1, E-cadherin, and ZNF76 antibodies were purchased from ABclonal, Inc. (Wuhan, China).

# Cell Culture and miRNA Transfection

MCF-7 and HeLa cells were purchased from the Cell Bank of the Chinese Academy of Sciences (Shanghai, China) and cultured in Dulbecco's modified Eagle's medium (DMEM) (Gibco-BRL, Carlsbad, USA) supplemented with glutamine, antibiotics, and 10% fetal bovine serum (Gibco-BRL, Carlsbad, USA) in a humidified atmosphere of 5% CO2 at 37°C. Plasmid DNA or miRNA was transfected into cells using Lipofectamine 2000 (Invitrogen, Carlsbad, CA) according to the manufacturer's instructions.

# Western Blot

Cells were harvested 24, 48, or 72 h post-transfection. Then cells were lysed in radioimmunoprecipitation assay (RIPA) buffer [150 mM NaCl, 1-M Tris-HCl (pH 7.2), 1% (v/v) Triton X-100, 1% (w/v) sodium deoxycholate, 0.1% (w/v) sodium dodecyl sulfate (SDS)] with protease inhibitors. Proteins were separated on 10% or 15% SDS–polyacrylamide gel and transferred to poly(vinylidene difluoride) (PVDF) membranes. The resulting blots were blocked with 5% non-fat dry milk, and specific proteins were detected with appropriate antibodies. The proteins were detected using horseradish

peroxidase (HRP)-conjugated secondary antibody and Super Signal West Pico Chemiluminescent substrate kits (Pierce).

# DNA–miRNA Pull-Down Assay

The procedures used for affinity purification of biotinylated miRNAs were partly in reference to those previously described by Tidi Hassan and colleagues (Liu et al., 2005b). Cells were transfected by biotinylated miRNA or biotinylated NC miRNA for 24 h. Then cells were treated with 37% formaldehyde to a final concentration of 1% and incubated at room temperature for 15 min for cross-linking. The cross-linking reaction was stopped by the addition of 100-mM glycine. Next, cells were collected; lysed in lysis buffer that contains 1% SDS, 1-mM EDTA, 50-mM HEPES (pH 7.5), 140-mM NaCl, and 1% Triton X-100; and supplemented with 100× protease inhibitor (Boehringer cocktail) and 1-U/μl RNase inhibitor (Invitrogen). The genomic DNA was sheared by sonicator equipment. This step should be performed on ice to avoid the denaturation of chromatin and miRNA. The supernatants were recovered by 12,000*g* centrifugation for 10 min and incubated with equilibrium streptavidin beads for 1 h at room temperature. Streptavidin beads were washed four times by washing buffer, which contains 10-mM Tris-HCl (pH 7.5), 1-mM EDTA, 0.15-mM LiCl, and 10-mM Tris-HCl. Proteinase K (Roche Applied Science) and RNase A (Roche Applied Science) were be used TABLE 1 | miRNA sequences and real-time quantitative PCR primers.


to degrade protein and RNA. Then the DNA was separated from streptavidin beads after treating beads at 80°C for 5 min. The eluted DNA was recovered using Chromatin Immunoprecipitation (ChIP) Kit (Millipore, USA) according to manufacturer's instructions.

# Illumina HiSeq 2000 Next-Generation Sequencing and Bioinformatics Analysis

The PCR products were fragmented to an average length of 150 bp. After DNA-end repair, 3′dA overhang, and ligation of methylated sequencing have been performed, the DNA samples were sent to Beijing Genomics Institute (BGI, China) for sequencing by Illumina Genome Analyzer. Bioinformatics analysis steps for ChIP-Seq libraries are presented below. First, the original image data are transferred into sequence data *via* base calling, which is defined as raw data or raw reads, and saved as FASTQ file. Second, quality control was performed to detect whether the data are qualified. In addition, filtering of raw data was used to decrease data noise. As a result, "dirty" raw reads which contain the sequence of adapter, more than 10% unknown bases, or low-quality bases have been removed in this step. Third, the clean reads were mapped to the *Homo sapiens* genome reference, and only the alignments within two mismatches and unique mapping reads were considered in further analyses. Then genome-wide peak scanning was performed in UCSC Genome Browser to get the information of peak location and peak sequence. Peaks were classified based on the location (UCSC annotation data) and showed in the following genome regions: intergenic, introns, downstream, upstream, and exons. Furthermore, after peak scanning, all the related genes related to miR-373 or NC RNA can be listed. Last, to predict potential functions of the putative miRNA targets in different cellular components, biological processes, and molecular functions, we used gene ontology (GO) categories (http:// www.geneontology.org/) to classify the identified target genes. Besides, the Kyoto Encyclopedia of Genes and Genomes (KEGG) database (fttp://fttp.genome.jp/pub/kegg/pathway/) was applied for KEGG pathway analyses. We also submitted the HiSeq 2000 nextgeneration sequencing data to National Center for Biotechnology Information (NCBI).

# Validation of miRNA Targets *via* qRT–PCR

Total RNA was isolated from cells that were transfected with synthetic miRNAs using TRIzol reagent (TaKaRa) according to the manufacturer's instructions. For quantification of mRNA, 1 µg of total RNA was reversely transcribed using the Reverse Transcription System (Promega, Madison, USA). The resulting cDNA was used as template for semi-quantitative PCR or quantitative real-time PCR. β-Actin served as an endogenous control used to normalized expression data. Each sample was analyzed in triplicate. Relative expression and standard error were calculated by the supplied ABI 7900HT Real-Time System software. All primers used in the qRT– PCR experiments are listed in **Table 1**.

# Statistical Analysis

Data were expressed as means ± SD from three to four independent experiments. Data were analyzed using Student's *t* test for two groups or analysis of variance (ANOVA) with Tukey–Kramer tests for multiple group comparisons. *P* < 0.05 was considered statistically significant.

# RESULTS

# E-Cadherin Is Up-Regulated by Both miR-373 and Biotinylated miR-373 in MCF-7

Previous reports described that miR-373 can increase the expression of E-cadherin *via* targeting to its promoter in PC-3 cells (**Figure 2A**), while it has no impact on E-cadherin expression in HCT-116 and LNCaP cells (Place et al., 2008). To ensure whether miR-373 and biotinylated miR-373 could regulate E-cadherin in MCF-7 cells, biotinylated miR-373, non-biotinylated miR-373, or NC miRNA was transfected into MCF-7 cells for 48 h. Both semi-quantitative PCR and real-time PCR showed that E-cadherin increases over four times in the mRNA level after transfection with miR-373 or

biotinylated miR-373 than does NC miRNA (**Figures 2B**, **C**). We also confirmed that both miR-373 and biotinylated miR-373 can up-regulate E-cadherin protein levels in MCF-7 cells (**Figure 2D**). These data demonstrated that miR-373 increases the expression of E-cadherin in MCF-7 cells. Furthermore, being labeled with biotin at the 3′-end of miR-373 would not significantly affect the function of miR-373 to regulate E-cadherin.

# The Promoters of E-Cadherin and CSDC2 Can be Pulled Down by Biotinylated miR-373 in MCF-7

After having confirmed that biotinylated miR-373 could increase the expression of E-cadherin in MCF-7, the miR-373 targeting sequence in E-cadherin promoter was regarded as a positive control to test whether our method worked. Biotinylated miR-373 or biotinylated NC miRNA was transfected into MCF-7 cells. After 24 h posttransfection, miR-373 targeting sequences were isolated through DNA–miRNA pull-down assay described in the Materials and Methods section. Then semi-quantitative PCR and quantitative RT– PCR were performed to detect the enrichment of E-cadherin and CSDC2 promoters. As shown in **Figures 3A**, **B**, both E-cadherin and CSDC2 promoters can be amplified by semi-quantitative PCR from the sample transfected with biotinylated miR-373 rather than biotinylated NC miRNA. GAPDH promoter, which did not contain potential target site of miR-373, has no detectable signal when the sample transfected with biotinylated miR-373 was used. As shown in **Figure 3C**, the quantitative RT–PCR results were consistent

with those of semi-quantitative PCR. The amount of E-cadherin promoter and CSDC2 promoter in the sample transfected with biotinylated miR-373 was as over 10 times as the sample transfected with biotinylated NC miRNA, while the amount of promoters of actin, GAPDH, ITSN2, and Hsp60, which were regarded as negative control, is almost the same in the sample transfected with biotinylated miR-373, compared with the sample transfected with biotinylated NC miRNA. These results suggested that our method can be used to enrich miR-373 binding DNA sequences.

# Identifying the Potential miR-373 Targeting DNA Sequences in Purified DNA Products

To identify unknown miR-373 targeting DNA sequences, we inserted the precipitated DNA into pGEM-T vectors (Promega, USA) and sequenced the vectors using the primer combining with T7 promoter. The specific procedure is shown in **Figure 1**. First, the Quick Blunting Kit (NEB, USA) has been used to convert DNA with incompatible 5′ or 3′ overhangs to blunt-ended DNA, because bio-miR-373-precipitated DNA fragments sheared by sonicator equipment contain fragments with incompatible 5′ or 3′ overhangs, which were hard to insert into pGEM-T vectors. Second, we tailed the blunt-ended DNA with an adenine in the 3′-end *via* using Ex Taq DNA Polymerase. Third, to improve the efficiency to insert DNA fragments into pGEM-T vectors, the A-tailing fragments have been condensed by 20% PEG-8000 and purified by 75% alcohol. Fourth, the purified DNA fragments have been inserted into pGEM-T vectors (Promega, USA). Fifth, the vectors were transformed into Top10 (Invitrogen, USA), and blue-white screening was used to select positive clones. Over 40 clones have been sequenced, and 10 sequences have been identified. Last, the sequences were analyzed *via* UCSC Genome Browser and NCBI Map Viewer. As shown in **Table 2**, six identified sequences are located in the introns of ZNF76, PDE4D, ALOX5, KIAA1959, and ZNF385B. Two sequences are in intergenic. Two sequences are in promoters of ARID2 and SUN1.

# The Regulation of the Identified Genes by miR-373

The identified miR-373 targeting sequences are not only located in promoters but also located in introns. We wonder whether miR-373 could regulate these genes *via* directly binding with its promoter or intron. First, quantitative RT–PCR and Western blotting were performed to detect the regulation of the identified


genes by miR-373. As shown in **Figure 4A**, the mRNA levels were not observed to be significantly changed at 24 h post-transfection with miR-373. MiR-373 increases the mRNAs of ALOX, ARID2, CSDS2, KIAA1958, PDE4D, SUN1, and ZNF358B only 30% to 80% times at 48 h, while it up-regulates the mRNAs of E-cadherin and ZNF76 over four times at 48 h. All genes were up-regulated over two times in the mRNA level by miR-373 at 72 h, excepting ZNF358B. Then we purchased ARID2, SUN1, and ZNF76 antibodies to detect the expression of these genes at protein levels. As shown in **Figure 4B**, miR-373 can significantly increase these genes at 72 h. These data indicated that ARID2, SUN1, and ZNF76 can be obviously up-regulated by miR-373.

To investigate whether the regulation by miR-373 to the identified genes is a common phenomenon, RT–PCR was performed to measure the expression of the identified genes in HeLa cells. Interestingly, ALOX5, ARID2, KIAA1958, SUN-1, ZNF76, and ZNF385B can be up-regulated by miR-373, while CSDC2, E-cadherin, and PDE4D have not significantly increased after being transfected with miR-373 (**Figure 5A**). We also found that miR-373 cannot bind with the targeting sequence in CSDC2, E-cadherin, and PDE4D genes (**Figure 5B**). Some papers reported that some cell lines were resistant to specific miRNAs or dsRNAs-reduced transcriptional activation while sensitive to others. Our results provide an evidence that the direct interaction between miRNA and genomic sequences is key to miRNA-induced regulation of genes.

# Screening miR-373 Binding Sequences *via* the High-Throughput Next-Generation Sequencing Technology

We have successfully established a method to identify unknown miRNA targeting DNA sites, but only 10 sequences have been identified in over 40 clones (data not shown). To improve efficiency to screen unknown miRNA target sequences, two genomic DNA fragment libraries were constructed and subjected to next-generation sequencing: one was constructed from miR-373-precipitated sample in MCF-7 and named HM-7-DNA, and the other was from NC RNA-precipitated sample and named HM-7-DNA-NC. As shown in **Supplementary Figure 1**, the main peak of HM-DNA sample was distributed at 360 bp and the main peak of HM-DNA-NC sample was distributed at 276 bp. So both of the samples were qualified and suitable for further sequencing. Then the quality control was used to analyze the quality of raw data obtained from Illumina HiSeq 2000 sequencing. As shown in **Supplementary Figures 1A, C**, both HM-7-DNA and HM-7-DNA-NC represented good-quality sequences, because the base ratios are mostly higher than 20. The raw data also had satisfactory base composition, for four bases of A, T, G, and C were distributed uniformly, and the AT content exceeded the GC content (**Supplementary Figures 2B, D**). The raw data have been submitted into SRA database, and the accession number is PRJNA547356.

The information of the peak location and sequence has been identified by genome-wide peak scanning in UCSC Genome Browser (**Supplementary Tables 1, 2**). Then we analyzed the distribution of the sequences from HM-7-DNA. As shown in **Figure 6A**, 49.7% of the sequences are located in intergenic, 25.5% of the sequences in intron, 11.7% of the sequences in promoter (Up2k),

genes were detected by Western blotting.

10.3% of the sequences in exon, and 2.8% of the sequences in down2k. Meanwhile, we also analyzed the chromosomal location of miR-373 targeting sequences (**Figure 6B**). The results showed that the candidate targets of miR-373 were mainly distributed in 5th, 9th, 10th, and 20th chromosomes (**Figure 6B**).

MiRNAs can perform its biological functions *via* targeting genomic DNA and regulating gene expression, so pathway-based analysis of miR-373 targeting gene helps us to better understand the role of miR-373 in cells. On the one hand, GO analysis was performed to annotate the function of genes. **Figure 6C** shows the classification of the peak-related gene of HM-7-DNA based on the GO analysis. Biological process, cellular component, and molecular function, respectively, included 17, 6, and 5 categories. On the other hand, based on KEGG analysis, we found that miR-373 targeting genes were related with hypertrophic cardiomyopathy, dilated cardiomyopathy, tight junction, cardiac muscle contraction, and viral myocarditis (**Table 3**).

We also compared the differences between HM-7-DNA and HM-7-DNA-NC. As shown in **Figure 6D**, 1,966 genes containing miR-373 targeting sequences have been found. Interestingly, 443 genes containing NC miRNA targeting sequences also have been identified. It cannot be denied that all designed NC miRNAs have the ability to combine with certain DNAs, so it is a possibility to have NC miRNA binding sites in genomic DNA. According to our results, there are 169 genes containing both miR-373 and NC miRNA target sites. These results suggest that the NC miRNA used in our paper is not suitable for studying the regulation of these 169 genes by miR-373, because NC miRNA also has a possibility to regulate these genes. Hence, to better study the genes identified by

our method, it is necessary to use biotinylated NC miRNA as negative control to prove that the studied genes have no potential NC miRNA binding sites.

So we selected six sequences from 1,827 genes that only contain miR-373 targeting sites to do further study. These sequences are located in exon or intron (**Table 4**). Semi-quantitative PCR has been performed to detect the enrichment of the six sequences in miR-373-precipitated DNA. As shown in **Figure 7A**, all sequences can be pulled down by miR-373. Then quantitative RT–PCR results demonstrated that the mRNAs of TTC34 and FANCC significantly increase after transfection by miR-373 for 48 and 72 h. TPM1, KIAA1377, EVI15, and RPL37 can be down-regulated by miR-373 after 24 h post-transfection, while these genes can be significantly up-regulated by miR-373 at 72 h (**Figure 7B**). We also randomly selected 35 potential miR-373 target genes and analyzed the changes in these gene expression after transfecting miR-373 for 48 h. As shown in **Table 5**, 21 gene expression changed more than two-fold and only four gene expression changed less than quarter-fold after transfecting with miR-373 than did NC miRNA.

# DISCUSSION

Although it has been proved that binding with promoters is an important way for miRNA to regulate gene expression, the mechanism of miRNA target recognition in genome should be further illuminated. Similar to the miRNA–mRNA interaction model, some papers suggested the "seed sequence" in miRNA is the key to binding with promoter (Xu et al., 2014; Xiao et al., 2017). However, Zhang et al. reported that let-7i can bind with TATAbox motifs in IL-2 promoter and the seed sequence of let-7i is not completely complementary with IL-2 promoter (Zhang et al., 2014). Another type of prediction tools, such as RNA hybrid, evaluates the interaction ability between miRNA and genomic sequences

TABLE 3 | The significant enrichment analysis of target gene's function in the pathway.


*Q ≤ 0.05 is a pathway that was significantly enriched in the peak-related gene.*

TABLE 4 | MiR-373 binding sequences identified by next-generation sequencing.


*via* measuring thermo-dynamic stability of miRNA and dsDNA or ssDNA (Rehmsmeier et al., 2004). Furthermore, Paugh et al. proved that miRNAs can form triplexes with dsDNA in genome and regulate gene expression (Paugh et al., 2016). Because of lacking support by mechanism, computational prediction of miRNA targeting site in genome is in an initial step. In this paper, based on miRNA targeting–mRNA purification technique, which has been reported previously (Hassan et al., 2013), we have established an effective biochemical procedure to screen the potential miRNA targeting genes *via* pulling down the genomic sequences, which directly combined by miRNA. As a result, the putative target DNA sequences that were bound by biotinylated miRNAs can be easily isolated from cell extracts. These isolated DNA sequences can be analyzed through cloning and sequencing, and then the potential target genes may be found using the bioinformatics analysis. As described in this article, we successfully identified the known target genes of miR-373; moreover, we also detected unreported target genes. Therefore, we demonstrated that the target genes of miRNA complementary to DNA sequences can be efficiently obtained through our biochemical procedure directly from cultured cells.

Another limitation of bioinformatic prediction is that the prediction cannot reflect the real situation *in vivo*. When Li et al. have studied the regulation of gene expression by dsRNA *via* binding with promoter, they found that some specific dsRNAs can increase target gene expression in some cell lines, but not in others (Li et al., 2006). They also reported that E-cadherin expression was up-regulated by miR-373 in PC-3 and LNCaP cells, while not in HCT-116 cells (Place et al., 2008). Meanwhile, our results demonstrated that the expression of E-cadherin, CSCD2, and PDE4D, which can be

reference. *n* = 3, \**P* < 0.05, \*\**P* < 0.01 compared with the sample transfected with NC miRNA.

up-regulated by miR-373 in MCF-7 cells, was not significantly increased after transfecting miR-373 in HeLa cells. One of the reasons affecting miRNA-mediated gene activation is the epigenetic state of genome, for it is proved that the promoter of E-Cadherin is hypermethylated in HeLa cells, which prevented saRNA-induced

E-cadherin up-regulation (Li et al., 2006). Furthermore, our results demonstrated miR-373 can interact with the sequences located in E-cadherin, CSCD2, and PDE4D genes in MCF-7 cells but cannot in HeLa cells, which indicated that the direct interaction of miRNAs and targeting sequences is key to regulation of targeting genes. On



the whole, our method can measure the direct interaction between miRNA and genomic DNA, which can avoid false positives caused by ignoring the modification of genome.

A very noteworthy finding in the present study is that some genomic fragments precipitated by miR-373 were located in intron. The roles of miRNAs in intron have not been widely studied. Meng et al. reported that some miRNAs binding sites are located in intron in plants (Meng et al., 2013). It has also been reported that siRNA targeting intronic sequences near alternative exons regulate splicing of mRNA and that Ago1 is essential for RNAi-mediated alternative splicing (Allo et al., 2009). It has been reported that some miRNAs and long noncoding RNAs are transcribed from the intron *via*  sharing of the promoters with their host genes (Bosia et al., 2012; Kung et al., 2013; Chamorro-Jorganes et al., 2014; Ramalingam et al., 2014), so miRNAs (e.g., miR-373) targeting intron may play a role in regulating miRNAs and long noncoding RNAs, which are located in intron. Our results demonstrated that miR-373 can interact with the sequences located in intron. Then our results showed that ZNF76, PDE4D, ALOX5, KIAA1958, ZNF385B, TTC34, EVI5, and FANCC, which contain miR-373 binding sites in intron, can be regulated by miR-373. So there is existing interaction among miR-373 and intron sequences, which might affect gene expression. Taken together, the results suggested that the interaction of miRNAs and intron may play some biological functions in cells, though we have not provided direct evidence that miRNAs regulate gene expression *via* binding with intron. The expression level of ZNF76 mRNA regulated by miR-373 was dramatic increased, but this regulated mechanism need to study in further study should focus on the mechanism of ZNF76 regulation by miR-373. We will investigate whether miR-373 binding sequence in ZNF76 intron is key to regulate expression of ZNF76 by knocking out the sequence.

Although we successfully identified several miR-373 binding sequences located in promoter, exon, or intron, there were some purified DNA fragments located in genomic DNA region far away from any known gene (over 80 kb, data not show). These DNA fragments may be located near the uncharacterized genes or may be a trigger for mediating the long-range regulation as described in a previous report (Zhao et al., 2009).The biological function of interaction between miRNAs and these target sites need to be further researched. However, our experimental procedure provides a way to find this kind of target sites and uncover new regulated mechanism of miRNAs.

Strangely, there is lack of connection between Sanger sequencing and next-generation sequencing results. Because Sanger sequencing and next-generation sequencing results were obtained from two independent experiments, many factors, including cell culture conditions, ChIP, and library construction, may contribute to variability between datasets. Another reason that may contribute to variability between two kinds of sequencing is that the samples have been prepared *via* different procedures. Preparing the samples for next-generation sequencing, compared with the samples for Sanger sequencing, have an extra step in that samples need to be amplified by PCR. In this step, some CG-rich sequences may be lost because the sequences are hard to be amplified by PCR.

In conclusion, this is a suitable method for identifying miRNA target genes that are complementary to genomic DNA. With the use of this method, the interaction of miRNA and putative miRNA targets can be confirmed by quantitative PCR with specific primers. So this method can be used to confirm the regulation mechanism of miRNAs to genes *via* binding genomic DNA. Furthermore, the experimental procedure can be applied to screen potential miRNA targets. On the whole, the method can improve the miRNA research enormously.

# DATA AVAILABILITY

The raw data was submitted to NCBI SRA database and the accession number is PRJNA547356.

# AUTHOR CONTRIBUTIONS

Designed the experiments: SX, KW. Performed the experiments: YX, KW, YT, LH, HX, SL, MG, CW. Wrote the paper: YX, KW, SX.

# FUNDING

This research was funded by the China Natural Science Foundation (grant numbers 81770389, 81601122, and 81703919), Hunan Provincial Natural Science Foundation of China (grant number 2017JJ3205, 2017JJ3232), and Cooperative Innovation Center of Engineering and New Products for Developmental Biology of Hunan Province (grant number 20134486).

# SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene.2019.00778/ full#supplementary-material

SUPPLEMENTARY FIGURE 1 | Distribution of the main peak of genomic DNA. The genomic DNA was sheared by sonicator equipment. Then 1.5% agarose gel was used to measure the quality of samples. Results from (A) HM-DNA and (B) HM-DNA-NC are; both the main peaks of genomic DNAs

# REFERENCES


were distributed between 100 and 500 bp. Agilent 2100 was used to detect specific distribution of sample fragments. HM-DNA fragment distributed at 360 bp (C); HM-DNA-NC fragment distributed at 276 bp (D).

SUPPLEMENTARY FIGURE 2 | Quality distribution and base distribution of HM-7-DNA and HM-DNA-NC. Quality distribution of (A) HM-7-DNA and (C) HM-7-DNA-NC are shown; the *X*-axis corresponds to the base site of the read. The *Y*-axis is quality value. Each dot in the image represents the quality value of the corresponding position along reads. Base distribution of (B) HM-7-DNA and (D) HM-7-DNA-NC is shown; both show a balanced base composition. The *X*-axis was the base position on the reads, and the *Y*-axis was the percentage of the corresponding base at each position. A, C, G, T, and N represent different bases.

SUPPLEMENTARY TABLE 1 | The peak information of HM-7-DNA.

SUPPLEMENTARY TABLE 2 | The peak information of HM-7-DNA-NC.


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Copyright © 2019 Xun, Tang, Hu, Xiao, Long, Gong, Wei, Wei and Xiang. This is an openaccess article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

# Non-Coding RNAs in Pediatric Solid Tumors

#### *Christopher M. Smith1, Daniel Catchpoole2,3\* and Gyorgy Hutvagner1\**

*1 School of Biomedical Engineering, University of Technology Sydney, Sydney, Australia 2 School of Software, University of Technology Sydney, Sydney, Australia 3 The Tumour Bank–CCRU, Kids Research, The Children's Hospital at Westmead, Sydney, Australia*

Pediatric solid tumors are a diverse group of extracranial solid tumors representing approximately 40% of childhood cancers. Pediatric solid tumors are believed to arise as a result of disruptions in the developmental process of precursor cells which lead them to accumulate cancerous phenotypes. In contrast to many adult tumors, pediatric tumors typically feature a low number of genetic mutations in protein-coding genes which could explain the emergence of these phenotypes. It is likely that oncogenesis occurs after a failure at many different levels of regulation. Non-coding RNAs (ncRNAs) comprise a group of functional RNA molecules that lack protein coding potential but are essential in the regulation and maintenance of many epigenetic and post-translational mechanisms. Indeed, research has accumulated a large body of evidence implicating many ncRNAs in the regulation of well-established oncogenic networks. In this review we cover a range of extracranial solid tumors which represent some of the rarer and enigmatic childhood cancers known. We focus on two major classes of ncRNAs, microRNAs and long noncoding RNAs, which are likely to play a key role in the development of these cancers and emphasize their functional contributions and molecular interactions during tumor formation.

#### *Edited by:*

*Yun Zheng, Kunming University of Science and Technology, China*

#### *Reviewed by:*

*Alessio Naccarati, Italian Institute for Genomic Medicine (IIGM), Italy Zexuan Zhu, Shenzhen University, China*

#### *\*Correspondence:*

*Daniel Catchpoole Daniel.Catchpoole@uts.edu.au Gyorgy Hutvagner gyorgy.hutvagner@uts.edu.au*

#### *Specialty section:*

*This article was submitted to RNA, a section of the journal Frontiers in Genetics*

*Received: 24 January 2019 Accepted: 30 July 2019 Published: 20 September 2019*

#### *Citation:*

*Smith CM, Catchpoole D and Hutvagner G (2019) Non-Coding RNAs in Pediatric Solid Tumors. Front. Genet. 10:798. doi: 10.3389/fgene.2019.00798*

Keywords: pediatric tumors, miRNA, long noncoding RNA, cancer biology, gene expression

Pediatric cancers are often categorized as hematologic, intracranial, or extracranial (Chen et al., 2015). Hematologic cancers include those derived from the blood or blood forming tissues, including bone marrow and the lymph nodes. Intracranial cancers are tumors that develop inside the brain, whereas extracranial solid tumors, often referred to as pediatric solid tumors, arise outside the brain. Collectively, pediatric solid tumors represent approximately 40% of all pediatric cancers and commonly form in the developing sympathetic nervous system (neuroblastoma), retina (retinoblastoma), kidneys (Wilms tumor), liver (hepatoblastoma), bones (osteosarcoma, Ewing sarcoma), or muscles (rhabdomyosarcoma) (Kline and Sevier, 2003; Allen-Rhoades et al., 2018). Solid tumors can originate from cells of any of the three germ layers, the ectoderm, mesoderm, or endoderm, and likely arise due to disruptions in the developmental processes of these precursor cells, leading them to develop cancerous phenotypes (Chen et al., 2015). This contrasts with most adult cancers, which tend to be of epithelial origin and are believed to develop over time due to exposure to toxins and environmental stress. As a result, adult cancers often display a high occurrence of genetic mutations, whereas pediatric solid tumors tend to feature a relatively low number of genetic mutations. This has led to investigations into alternative forms of gene regulation that may contribute to the emergence and development of cancerous cells in pediatric cancers.

**153**

Non-coding RNAs (ncRNAs) form a group of functional RNAs lacking protein-coding potential, which play a crucial role in the regulation of gene expression at every level, from epigenetic regulation via methylation and chromatin packaging to post-transcriptional regulation (Cech and Steitz, 2014; Zhao et al., 2016). The most widely studied ncRNAs are the microRNAs (miRNAs), small 20 to 25-nucleotide-long RNAs that play an important role in regulating translation and messenger RNA stability via complementary base pairing (Huang et al., 2011). Other classes of small ncRNAs include small interfering RNAs (siRNAs), Piwi-interacting RNAs (piRNAs), small transfer RNAs (tRFs), small nucleolar RNAs (snoRNAs), small nuclear RNAs (snRNAs), and small cytoplasmic RNAs (scRNAs). Additionally, long non-coding RNAs (lncRNAs) are a loosely defined group of RNAs normally larger than 200 nucleotides long that lack protein-coding potential and do not fall into any of the other categories but nonetheless play key roles in the regulation of gene expression (Mercer et al., 2009). Increasing data suggest that ncRNAs play a role in regulating all biological processes, and it is no surprise that studies have observed widespread dysregulation of ncRNAs in nearly all forms of cancer (Prensner and Chinnaiyan, 2011; Leichter et al., 2017). Interestingly, dysregulated RNA patterns are often specific to the type of cancer or even subtype and can provide insight into the mechanisms underlying phenotypic differences between tumors or cells within a tumor, such as their aggressiveness or resistance to certain types of treatments (Blenkiron et al., 2007; Li et al., 2014). Additionally, genomewide association studies have suggested that over 80% of single nucleotide polymorphisms found associated with cancer are outside of coding regions (Carninci et al., 2005; Cheetham et al., 2013). In this review, we will discuss how two major classes of ncRNAs, miRNAs and lncRNAs, may contribute to pediatric solid tumors by participating in the regulation of established oncogenic networks known to drive these cancers.

# MICRORNAS AND GENE REGULATION

Not long after the first human miRNA, *let-7*, was discovered in 2002 by the Ruvkun lab, miRNAs began to emerge as key participants in tumorigenesis (Pasquinelli et al., 2000). In 2002, two miRNAs were identified as potential tumor suppressors due to their frequent downregulation or deletion in chronic lymphocytic leukemia (Calin et al., 2002). Calin et al. (2004) later showed that many miRNA genes are located close to fragile sites or common breakpoints that frequently occur in cancers, suggesting that their loss of function was a key event in oncogenesis. Since then, oncomiRs—cancer-associated miRNAs—have become a major research focus (Esquela-Kerscher and Slack, 2006). A better understanding of the mechanisms behind miRNA regulation in cancer is invaluable to researchers and clinicians alike, not only to aid in the identification of new drug targets but also for the development of promising RNA-based therapies and their potential use as early detection biomarkers.

# miRNAs: Biogenesis and Functions

The life cycle of a miRNA typically begins with its transcription into a primary miRNA (pri-miRNA) by RNA polymerase II (Ha and Kim, 2014). pri-miRNAs share several similarities with messenger RNAs (mRNAs) in that they are 5' capped, are 3' polyadenylated, and can be several hundreds or thousands of nucleotides long. In many cases, the pri-miRNA encodes for one miRNA species; however, in humans, a substantial number are polycistronic and encode several different miRNAs together. pri-miRNAs must be processed in the nucleus by the RNAse III enzyme Drosha, which releases shorter ~65-nucleotide-long precursor RNAs (pre-miRNAs) with a secondary hairpin structure. This hairpin is recognized by the Exportin-5/Ran-GTP transporter, which transports the pre-miRNA from the nucleus to the cytoplasm. In the cytoplasm, the pre-miRNA is further processed by Dicer, another RNAse III enzyme, which cleaves the loop and releases a double-stranded miRNA duplex containing the 5 prime (5p) and 3 prime (3p) sequences. The duplex is then recognized by one of the four human Argonaute proteins, which loads one of the strands and discards the other.

miRNAs carry out their functions by binding to Argonaute and associating with various other proteins to form the RNA induced silencing complex (RISC). As part of this complex, miRNAs serve as guides by binding via complementary base pairing to target sites that are normally found in the 3´-untranslated region (3´UTR) of mRNAs. RISCs can regulate gene expression by direct cleavage of transcripts, transcript destabilization, or blocking translation. In a broader sense, miRNAs play a role in globally "fine-tuning" gene expression and are particularly important in inducing and maintaining differentiated cell states. In cancer, this finely tuned expression is often impaired, enabling gene networks that are normally switched on or off to reverse and begin influencing cellular behavior in a deleterious manner.

# miRNAs: Drivers or Passengers in Cancer?

Microarrays and next-generation sequencing technologies enabled global measurements of miRNA expression changes and have revealed miRNA dysregulation to be a hallmark in nearly all cancers. miRNA expression profiles often correlate with cancer subtypes and have been effective at classifying cancer samples for risk stratification (De Preter et al., 2011). However, understanding the contribution of specific miRNAs can prove difficult. miRNAs are predicted to regulate hundreds to thousands of genes; however, their influence may be minor, and often, they must act in concert with other miRNAs. Current miRNA target prediction algorithms are imperfect and do not capture the true range of regulatory targets; therefore, biological validation is still needed (Riffo-Campos et al., 2016). Additionally, opposing behavior is seen with many miRNAs, where the same miRNA may be considered an oncogene in one cancer and a tumor suppressor in another. Because of their integration within complex gene networks, it is often not obvious whether a dysregulated miRNA actively participates in the maintenance of a cancerous state or whether it is simply a bystander. Therefore, it is important to examine how miRNAs participate in oncogenic networks on a functional level in order to properly understand their role.

Transcription factors that play an important role in regulating cell proliferation, migration, and apoptosis are commonly perturbed in pediatric solid tumors. One of the best examples of this is in neuroblastoma, where *MYCN* amplification is present in approximately 25% of neuroblastoma patients and disproportionally represents high-risk cases (Huang and Weiss, 2013). MYCN upregulation is also observed at a higher frequency in several other pediatric solid tumors including Wilms tumor, rhabdomyosarcoma (Williamson et al., 2005), and retinoblastoma, although generally not to the extent seen in neuroblastoma. Germline inactivation of the Wilms Tumor 1 (WT1) transcription factor has been linked to a genetic predisposition towards Wilms tumor. Several transcription factors, including Twist, Snails, and Zebs, involved in the epithelial-to-mesenchymal transition have also been implicated in the development of osteosarcoma (Yang et al., 2013). miRNAs are often closely tied to transcription factors, either as regulators or as transcriptional targets (**Figure 1**) (Sin-Chan et al., 2019). One of the earliest studies linking miRNAs to an oncogenic transcription factor was by O'Donnell et al. in 2005 (O'Donnell et al., 2005). In this study, they demonstrated that c-Myc could induce expression of the miR-17~92 cluster and that several of these miRNAs could in turn regulate E2F1 transcription to control cell proliferation.

#### Disruptions in miRNA Processing

Recent studies have shown that impairments of the miRNA processing machinery are common in Wilms tumor and likely contribute to this disease. For example, a study by Torrezan et al. (2014) found mutations in miRNA processing genes in 33% of tumors, most commonly occurring in the *Drosha* gene, with other mutations in *DICER1*, *XPO5*, *DGCR8* and *TARBP2*. These results are supported by several other studies by Wu et al. (2013), Rakheja et al. (2014), Walz et al. (2015), Wegert et al. (2015), and Gadd et al. (2017). In Rakheja et al.'s study, they further examined the potential consequences of several of these mutations and found that Drosha mutations often led to a loss of RNAse IIIB activity, which prevented processing of

FIGURE 1 | Regulatory circuitry involving non-coding RNAs in various pediatric solid tumors. Shows elements of a key regulatory circuit involving MYC and E2F family transcription factors and many ncRNAs, often dysregulated in pediatric solid tumors. In many cases, recurring dysregulation of specific elements, including miRNAs and lncRNAs, is observed and may represent vulnerabilities in the normal development of specific cell lineages. (A) Loss of chromosomal regions where let-7 and miR-34 miRNAs are localized is frequently observed in neuroblastoma and may represent a key event in the development of many of these cancers. (B) let-7 dysregulation may facilitate overexpression of the oncogenic fusion transcript EWS-FLI-1 in Ewing sarcoma. (C) The RB1 tumor suppressor regulates E2F, and loss of function via mutations can lead to the development of retinoblastoma. In osteosarcoma, miR-9 may be able to act as an oncogenic driver as it is often overexpressed and can downregulate RB1. (D) In neuroblastoma, miR-9 can display tumor-suppressive properties by cooperating with miR-125a and miR-125b to regulate a specific isoform of trkC and suppress cell proliferation (E) The lncRNA TUG1 is suggested to act as a ceRNA against miR-9, which has been shown to display tumor-suppressive properties in some osteosarcoma cell lines.

pri-miRNAs, leading to a global reduction in mature miRNAs. *DICER1* mutations also frequently affected the RNAse IIIB domain; however, this mutation only affected processing of 5p miRNAs from precursors, as *DICER1* contains a second RNAse domain for 3p processing. As a result, this mutation led to a shift towards 3p miRNA maturation. These mutations have interesting consequences for global miRNA expression and most likely favor expression of oncogenic miRNAs or reduce expression of miRNAs with tumor-suppressive effects. In line with this, the let-7 family is predominantly 5p-derived, and lower expression of several of its 5p members was found in both *Drosha* and *DICER1* mutants in two of these studies (Rakheja et al., 2014; Walz et al., 2015). Additionally, the miR-200 family was found downregulated in Wilms tumors with mutated miRNA processing genes, which is known to regulate the mesenchymal-to-epithelial transition and has been associated with highly aggressive forms of cancer (Ceppi and Peter, 2014; Walz et al., 2015). The functional role of several oncomiRs has been investigated in detail within the context of pediatric solid tumors and is discussed in the following section.

## The miR-17~92 Cluster is a Downstream Effector of Oncogenic Transcription Factors

The miR-17~92 cluster is expressed during normal development of the brain, heart, lungs, and immune system (Koralov et al., 2008; Ventura et al., 2008; Bian et al., 2013; Chen et al., 2013) and is known to regulate critical genes involved in cell growth, proliferation, and apoptosis. This cluster is comprised of six different miRNAs that are co-expressed, including miR-17, miR-18a, miR-19a, miR-19b-1, miR-20a, and miR-92-1. Dysregulation of the miR-17~92 cluster has been shown in several pediatric solid tumors including neuroblastoma, Wilms tumor, retinoblastoma, and osteosarcoma, where a higher expression generally correlates with a poorer prognosis (Chen and Stallings, 2007; Baumhoer et al., 2012; Li et al., 2014). The miR-17~92 cluster is particularly interesting due to its regulation by the transcription factor MYC and its homologue MYCN, where it seems to act as a mediator for some of MYC/MYCN's oncogenic effects (Schulte et al., 2008). Other transcription factors known to target the miR-17~92 cluster include members of the E2F family and STAT3 (Mogilyansky and Rigoutsos, 2013).

Several studies have demonstrated that the miR-17~92 cluster regulates many downstream components of the transforming growth factor beta (TGF-β) pathway, which is known to participate in a variety of cellular process such as differentiation, proliferation, and immune cell activation. A study by Fontana et al. (2008) demonstrated that in neuroblastoma, miR-17 and miR-20a downregulate the cyclin-dependent kinase inhibitor p21, which is activated by TGF-β. p21 plays a key role in the inhibition of cell cycle progression by blocking the transition from G1 to S phase, and its deregulation leads to uncontrolled cell growth. Additionally, Fontana et al. (2008) showed that miR-17-5p regulated another downstream component of TGF-β, the pro-apoptotic factor Bcl-2 interacting mediator (BIM). Mestdagh et al. (2010) later investigated miR-17~92 regulation of the TGF-β pathway in more depth and identified miR-17 and miR-20a as regulators of TGFBR2 and miR-18a as a regulator of SMAD2 and SMAD4, both signal transducers for TGF-β receptors. miR-18a and miR-19a have also been shown to repress estrogen receptor (ESR1) expression, and prolonged knockdown of miR-18a induced morphological differentiation of SK-N-BE neuroblastoma cells. Interestingly, the TGF-β pathway interacts with ESR1 signaling via several of the SMADs (Band and Laiho, 2011), suggesting a complex interplay between miR-17~92 and its targeted pathways necessary for fine-tuning differentiation during neuronal development—a balance that is disrupted when miR-17~92 is overexpressed. While no studies have investigated in detail the interaction between the miR-17~92 cluster and TGF-β pathway in Wilms tumor, the TGF-β pathway has been implicated in Wilms tumor development. In contrast with neuroblastoma, the TGF-β pathway appears to function as a promoter of Wilms tumor progression, and TGF-β is highly expressed in primary tumors, even more so in metastatic tumors. This multifaceted behavior of the TGF-β pathway has been shown in other cancers and implies that the pathway's influence is specific to the tumor it is activated in.

The E2F family of transcription factors serve an important role in cell cycle control as their expression can cause cells to enter the G1 phase to initiate cell division (Chen et al., 2009). Several members, including E2F1, E2F2, and E2F3, all regulate miR-17~92 expression. In a study by Kort et al. (2008) a member of the E2F family of transcription factors, E2F3, was shown to be exclusively expressed in Wilms tumor and not in other types of kidney tumors. In line with this, they compared expression of the miR-17~92 miRNAs in Wilms tumor samples to other renal tumor subtypes and found them all to be upregulated. They were also able to show a correlation between E2F3 expression and the stage of Wilms tumor, where it was highest in late-stage metastatic tissues. In retinoblastoma, an early study investigating the miR-17~92 cluster identified that one of its members, miR-20a, participates in an autoregulatory feedback loop with E2F2 and E2F3 (Sylvestre et al., 2007), as they found both transcription factors are themselves downregulated by miR-20a. The authors suggested that this autoregulation was critical in preventing expression of excessive amounts of E2F transcription factors. Given that MYC/MYCN and E2F have previously been shown to induce each other's expression, miR-20a appears to play an important role in keeping this positive feedback loop in check (Leone et al., 1997; Strieder and Lutz, 2003). Therefore, it is easy to see how disruption in one or more of these regulatory elements could lead to uncontrolled expression of these proliferative and anti-apoptotic signals.

A later study by Conkrite et al. (2011) investigated miR-17~92 in retinoblastoma and revealed that this cluster was capable of driving retinoblastoma formation in *RB1/p107*-deficient mice. *RB1* plays a key role in inhibiting cell cycle progression, and germline mutations of this gene can lead to familial retinoblastoma formation (Friend et al., 1986; Classon and Harlow, 2002). *RB1*'s protein product, pRB, inhibits E2F transcription factors by binding and inactivating them, and its absence enables miR-17~92–driven tumor formation.

The miR-17~92 cluster also plays a role in driving tumor progression and metastasis in osteosarcoma (Li et al., 2014). A recent study by Yang et al. (2018) identified QKI2 as a regulatory target of the miR-17~92 cluster. QKI proteins have previously been shown to inhibit β-catenin and induce differentiation in colon cancer. Yang at el. demonstrated that miR-17~92 downregulated QKI2, causing upregulation of β-catenin, leading to increased proliferation, invasion, and migration in osteosarcoma (Yang et al., 2018). Additionally, miR-20a has previously been shown to downregulate Fas expression, which is a cell surface marker that interacts with FasL to induce apoptosis in the lungs, where osteosarcoma almost exclusively metastasizes to (Huang et al., 2012).

The miR-17~92 cluster plays a tumorigenic role in a number of pediatric solid tumors including neuroblastoma, Wilms tumor, retinoblastoma, and osteosarcoma. The use of the miRNA pathway by transcription factors such as the MYC and E2F families enables them to target a wide range of genes and immediately effect gene expression at the post-transcriptional level. Continued research into how miRNAs may operate as oncogenic drivers will likely expand the repertoire of potential drug targets available to us.

### Let-7 Dysregulation is a Feature in Many Pediatric Solid Tumors

The let-7 family of miRNAs are among the most well-characterized tumor suppressors due to their frequent downregulation in cancers. In total, there are 12 members of the let-7 family located across eight different chromosomes; however, in most cells, only a selection of these miRNAs will be expressed (Balzeau et al., 2017). let-7 miRNAs are important in regulating the cell cycle and maintaining cells' differentiated state by targeting a wide range of genes with known roles in cancer biogenesis such as *MYC/MYCN*, *RAS*, *CDK6*, and *HMGA2* (Buechner et al., 2011; Wu et al., 2015).

let-7 is regulated by the LIN28 proteins, LIN28A and LIN28B, which mediate uridylation, prevent processing of the let-7 precursor, and are important for maintaining pluripotency in cells (Lehrbach et al., 2009; Balzeau et al., 2017). Both *Lin28* genes contain let-7 target sites and participate in a double-negative feedback loop with let-7 (Yin et al., 2017). Overexpression of *Lin28* tends to drive cells towards oncogenesis and is a common feature in cancers. In a study by Urbach et al. (2014), *Lin28b* overexpression was found in approximately 30% of Wilms tumors. Additionally, they found overexpression of *Lin28* could induce tumor formation in specific renal intermediates and that restoration of let-7 activity could reverse this effect in mice. Similar examples have been shown in mouse models, where *Lin28b* overexpression can drive hepatoblastoma and hepatocellular carcinoma in the liver and neuroblastoma in the neural crest (Molenaar et al., 2012; Nguyen et al., 2014). Molenaar et al. (2012) investigated Lin28b in neuroblastoma and demonstrated that Lin28b could enhance MYCN protein levels via let-7 regulation. However, a later study by Powers et al. (2016) showed that Lin28b expression was redundant in certain MYCNamplified neuroblastoma cells, as overexpression of the MYCN transcript could function as a miRNA sponge for let-7, thereby negating their effect regardless of expression level. Powers et al. showed that most neuroblastomas were characterized by a loss of let-7 with either MYCN overexpression or chromosomal loss of arm 3p or 11q, where several let-7 miRNAs are located (**Figure 1A**). The authors noted that these events were generally mutually exclusive and suggested that the presence of one event alleviated selective pressure for the other.

A study by Di Fiore et al. (2016) revealed that let-7d could promote and suppress tumor formation within the same system. In this study, they found that let-7d overexpression in osteosarcoma cells reduced several stemness genes, including *Lin28b*, *HMGA2*, *Oct3/4*, and *SOX2*, and could elicit the mesenchymal-to-epithelial transition with upregulation of the epithelial marker E-cadherin and downregulation of mesenchymal markers N-cadherin and vimentin. However, they also found that let-7d enhanced cell migration and invasion, presumably by acting via the TGF-β pathway, which is known to promote this behavior. let-7d strongly increased versican VI expression, which has previously been shown to activate the TGF-β pathway in osteosarcoma (Li S. et al., 2014).

In Ewing sarcoma, Hameiri-Grossman et al. (2015) found that let-7 downregulated the *Ras* oncogene, as well as the transcription factor HIF-1a, to reduce EWS-FLI-1 expression (**Figure 1B**). EWS-FLI-1 is a hybrid transcript that results from a translocation event involving EWS and FLI1, and translocation events such as this are present in nearly all Ewing sarcoma cases and are believed to drive the disease (Delattre et al., 1994).

Loss of Let-7 plays a key role in many pediatric solid tumors as its loss enables expression of transcription factors and other genes that participate in oncogenesis. This has been emphasized in neuroblastoma, where it has been suggested that loss of let-7 function is an essential event in tumor development and positions the miRNA pathway as a central player in pediatric solid tumors.

### miR-9 Has Been Shown to Play Oncogenic and Tumor-Suppressive Roles in Different Pediatric Tumors

miR-9 is a highly conserved miRNA involved in several different cellular processes including cell proliferation, differentiation, and migration. Early studies revealed miR-9 to be highly expressed in the brain and play a role both during development and in the adult brain; however, miR-9 has also been associated with many cancers outside the brain, acting as an oncogene or tumor suppressor (Coolen et al., 2013). Mir-9 is upregulated by MYC/ MYCN and plays a role in promoting tumor growth and metastasis in several cancers including breast cancer, osteosarcoma, and rhabdomyosarcoma, where it is often overexpressed (Iorio et al., 2005; Luo et al., 2017) However, in other cancers such as neuroblastoma, miR-9's role is less clear, and studies have argued for oncogenic and tumor suppressor functions (Laneve et al., 2007; Zhi et al., 2014).

The role of miR-9 in osteosarcoma appears to be in promoting cell growth and metastasis (Zhu et al., 2015; Qi et al., 2016). In a study by Zhu et al. (2015), miR-9 knockdown suppressed cell growth and migration of osteosarcoma cells. They were also able to show that miR-9 downregulated RB1 via the Grap2 and cyclin D interacting protein (GCIP), thereby promoting E2F-mediated cell division (**Figure 1C**). Similar behavior has been observed in the alveolar subtype of rhabdomyosarcoma, where miR-9 contributes to increased cell proliferation and migration (Missiaglia et al., 2017). In this study by Missiaglia et al. (2017), miR-9 was shown to be induced by the PAX3/ FOXO1 fusion gene via MYCN, which is specific to this subtype of rhabdomyosarcoma.

In neuroblastoma, miR-9 expression has been shown to be both up- and downregulated in different studies. An early study by Laneve et al. (2007) showed that miR-9 was downregulated in 50% of primary neuroblastoma samples, and follow-up experiments demonstrated that miR-9 could act together with miR-125a and miR-125b to suppress cell proliferation by targeting a truncated isoform of the neurotrophin receptor tropomyosin-related kinase C (trkC) (**Figure 1D**). However, a later study by Ma et al. (2010) found miR-9 to be a target of MYCN and that miR-9 expression correlated with MYCN and metastatic status in neuroblastoma tumors. In this same study (albeit in breast cancer cells), Ma et al. also demonstrated that miR-9 suppressed E-cadherin to activate β-catenin and promote the epithelial-to-mesenchymal transition. Mir-9 is frequently involved in promoting cell migration; however, its absence has also been shown to produce different responses such as cell cycle arrest or apoptosis in neurons depending on their origin (Bonev et al., 2011). The contradictory behavior seen with studies of miR-9 highlight the diverse roles that individual miRNAs can play, and more comprehensive studies are needed to identify the relevant contextual influences on miRNA behavior.

### miR-34 Is a Key Regulator of the Cell Cycle and Drug Resistance in Pediatric Solid Tumors

The miR-34 family has garnered significant interest since its members were discovered to be direct transcriptional targets of the tumor suppressor and transcription factor p53 (Hermeking, 2010). The miR-34 family consists of three miRNAs encoded by two genes, *mir-34a* and *mir-34b/c*. All three miRNAs play a key role in regulating apoptosis and the cell cycle by inducing G1 phase arrest. One of the more interesting facts about *miR-34a* and *miR-34b/c* is their genomic locations, which are located on chromosomes 1p36 and 11q23, respectively, regions that are frequently lost in pediatric solid tumors (Ruteshouser et al., 2005; Wittmann et al., 2007). In particular, loss of 1p36 occurs in 20–30% of neuroblastoma cases and correlates with MYCN amplification (Caron et al., 1993; Maris et al., 1995), whereas loss of 11q23 in occurs in approximately 40% of cases but almost never occurs with MYCN amplification (**Figure 1A**) (Guo et al., 1999; Attiyeh et al., 2005). miR-34 members are also regulators of the MYC family, as miR-34a is known to regulate MYCN and miR-34b and mir-34c to regulate c-MYC (Wei et al., 2008).

Studies on mir-34a expression have identified frequent downregulation in neuroblastoma, osteosarcoma, and hepatoblastoma (Jiao et al., 2016). miR-34a is itself considered a tumor suppressor due to its involvement in cell cycle arrest and apoptosis (De Antonellis et al., 2014). In neuroblastoma, Cole et al. (2008) investigated the growth-inhibitory effects of several miRNAs mapping to common chromosomal aberrations by overexpressing them in cell lines. In most cases, overexpression did not lead to a noticeable change in phenotype; however, miR-34a and miR-34c induced significant growth inhibition in cell lines with 1p36 deletion. Growth inhibition and suppression of metastasis by miR-34a have also been shown in osteosarcoma by several studies, where members of key proliferative signal transduction pathways such as c-Met, DUSP1, and Eag1 were identified as regulatory targets (Yan et al., 2012; Wu X. et al., 2013; Gang et al., 2017). The miR-34 family also targets several members of the Notch signaling pathway, which has been linked to both oncogenic and tumor-suppressive roles depending on the cellular context. In osteosarcoma, activation of the Notch pathway is known to contribute to tumor growth, and miR-34a–mediated downregulation of this pathway likely contributes to its tumor-suppressive role. However, in Ewing sarcoma, a recent study investigating miR-34b suggested that it could act as an oncogene, promoting proliferation, migration, and invasion through Notch1 repression (Lu Q. et al., 2018). Prior studies have shown correlations between high mir-34a expression and patient survival, which would indicate a tumor-suppressive role for mir-34a (Nakatani et al., 2012; Marino et al., 2014). It is unclear why miR-34a and miR-34b would display contrasting effects given their shared targets, and further investigation is needed.

Several studies by Pu et al. (2016) and Pu et al. (2017) have suggested that miR-34a may also play a role in promoting multidrug resistance in osteosarcoma. In these studies, they found that miR-34a-5p enhanced multidrug resistance through downregulation of the *CD117* and *AGTR1* genes *in vitro*. CD117 is often highly expressed in drug-resistant tumors and is commonly used as a marker for stemness (Adhikari et al., 2010). In contrast, Nakatani et al. found that miR-34a increased chemosensitivity in Ewing sarcoma (Nakatani et al., 2012).

## Other miRNAs Involved in Multiple Pediatric Solid Tumors

A substantial number of other miRNAs have been discovered with functional implications in multiple pediatric solid tumors. One such miRNA is miR-125b, which typically exhibits tumorsuppressive properties in cancers such as neuroblastoma, osteosarcoma, and Ewing sarcoma, where it is commonly dysregulated (Laneve et al., 2007; Li J. et al., 2014; Xiao et al., 2019). Previously, it was mentioned that miR-125b participates in a network with miR-125a and miR-9, regulating expression of a truncated trkC isoform to control neuroblastoma growth and differentiation (Laneve et al., 2007; Le et al., 2009). In osteosarcoma, miR-125b was found to regulate STAT3 by downregulating MAP kinase kinase 7 (MKK7), which inactivates STAT3 via dephosphorylation (Xiao et al., 2019). Loss of miR-125b and consequent overexpression of MKK7 led to increased tumor formation and poorer prognosis. In Ewing sarcoma, miR-125b is involved in regulating the PI3K signaling pathway; could inhibit cell proliferation, migration, and invasion; and induce apoptosis through suppression of PIK3CD (Li J et al., 2014). Conversely, in retinoblastoma, miR-125b is overexpressed and has shown oncogenic properties by promoting cell proliferation and migration and inhibiting apoptosis (Bai et al., 2016). Conflicting behavior with miR-125b has been observed in many other cancers, which suggests that its role is highly dependent on cell identity (Sun et al., 2013).

miR-124 has been widely reported to act as a tumor suppressor by inhibiting cell growth and metastasis and acts as a key mediator of differentiation in several pediatric solid tumors (Peng et al., 2014; Feng et al., 2015; Zhao et al., 2017). In neuroblastoma, miR-124a increased the proportion of differentiated cells possessing neurite outgrowths (Le et al., 2009). In retinoblastoma, miR-124 participates in a regulatory network with lncRNAs Malat1 and XIST, which both function as oncogenes by enhancing growth and metastasis through downregulation of miR-124 (Liu S. et al., 2017; Hu et al., 2018). miR-124 itself was shown to target STAT3 to inhibit cell proliferation, migration, and invasion (Liu S. et al., 2016). In Ewing sarcoma, miR-124 expression is suppressed, and expression was found to reduce growth and metastasis *via* downregulation of mesenchymal genes such as *SLUG* and cyclin D2 (*CCND2*) (Li et al., 2017). Finally, in osteosarcoma, retinoblastoma, and Ewing sarcoma, miR-143 has been found to be dysregulated (De Vito et al., 2012; Li S. et al., 2014; Wang et al., 2016; Sun et al., 2018). For example, Li et al. investigated miR-143 function in osteosarcoma and showed that miR-143 participated in the TGF-β pathway by targeting versican, and TGF-β could reduce miR-143 expression to promote cell migration and invasion (Li S. et al., 2014). FOSlike antigen 2 (FOSL2) was also identified as a miR-143 target, which enhanced cell proliferation, migration, and invasion in the absence of miR-143 (Sun et al., 2018). Additional miRNA studies have been listed in **Table 1**.

TABLE 1 | miRNAs that have been shown to exhibit oncogenic or tumor-suppressive effects through functional studies in various pediatric solid tumors.


**159**

### miRNAs Regulate All Aspects of Tumorigenesis

Widespread dysregulation of miRNAs is observed in many pediatric solid tumors, and functional studies have demonstrated that many of these miRNAs can drive or repress oncogenic pathways responsible for cell proliferation, apoptosis, angiogenesis, metastasis, and drug resistance. Importantly, miRNAs such as let-7 and miR-34 play a vital role in pediatric solid tumors by regulating established oncogenic transcription factors such as the MYC and E2F families (Wei et al., 2008; Buechner et al., 2011). Other miRNAs, such as the miR-17~92 cluster and miR-9, serve as downstream effectors for these transcription factors, although their exact role in tumorigenesis seems to depend on the overall transcriptional landscape (Schulte et al., 2008; Ma et al., 2010). In some cases, viewing miRNAs as oncogenes or tumor suppressors likely represents an oversimplification of their role in cancer, and a better understanding of their participation in oncogenic networks will be needed to clarify their exact contributions.

# LONG NON-CODING RNAS REGULATE ONCOGENIC PATHWAYS IN PEDIATRIC SOLID TUMORS

For a long time, it was believed that the human genome was mostly comprised of "junk" DNA, despite pervasive transcription of much of the genome outside of protein-coding genes and other known RNAs at the time (Prensner and Chinnaiyan, 2011). Originally thought of as transcriptional noise, lncRNAs have now emerged as functional regulators of nearly all essential cellular processes including growth, differentiation, cell state maintenance, apoptosis, splicing, and epigenetic regulation. The first lncRNA, H19, was discovered in 1990 where an RNA molecule was found spliced and polyadenylated in a manner typical of mRNAs; however, it lacked an open reading frame and was believed to function as an untranslated RNA molecule (Brannan et al., 1990).

Often, lncRNAs participate within protein complexes and can operate as scaffolds, guides, decoys, or allosteric regulators. Many lncRNAs function as epigenetic regulators by interacting with proteins involved in chromatin remodeling and DNA methylation. Frequently, these lncRNAs will be cis-acting and regulate the regions near their transcribed location; however, some are trans-acting. Other lncRNAs function as competing RNAs (ceRNAs), which contain miRNA binding sites in a similar manner to mRNAs in order to compete and reduce the activity of miRNAs.

Several studies have investigated lncRNA expression in pediatric tumors and have successfully identified unique expression profiles in different cancers and tumor subtypes (Mitra et al., 2012; Brunner et al., 2012; Dong et al., 2014; Sahu et al., 2018). For example, Dong et al. (2014) compared hepatoblastoma samples to normal liver tissue in patients and found 2,736 differentially expressed lncRNAs. A study by Pandey et al. (2014) found 24 lncRNAs that could distinguish low- and high-risk neuroblastoma tumors. In a more recent study, Sahu et al. (2018) identified 16 differentially expressed lncRNAs that could be used to predict event-free survival with greater accuracy than other commonly used clinical risk factors. Mechanistic studies into many of these lncRNAs have revealed that they frequently act as an additional layer of regulation within established oncogenic networks involving protein-coding genes and miRNAs. While the field of lncRNAs is still relatively young, many studies have emerged that suggest that lncRNAs are far more integrated into existing gene networks than what has previously been appreciated (**Figure 1**). In the following section, the roles of some of the better-characterized lncRNAs in pediatric solid tumors will be discussed.

# Malat1 Is Induced by MYCN in Neuroblastoma and Competes With Many miRNAs

One of the earliest lncRNAs to be associated with disease was Malat1 (metastasis-associated lung adenocarcinoma transcript 1), which was shown to associate with metastatic tumors in non–small cell lung cancer patients (Ji et al., 2003). Malat1 is abundantly expressed and highly conserved across species, unlike many other lncRNAs, and displays remarkably diverse functions in cellular processes including alternative splicing, nuclear organization, and epigenetic modulation. Studies have suggested an important role for Malat1 in brain development, as it is highly expressed in neurons and its depletion has been shown to affect synapse and dendrite development (Bernard et al., 2010; Chen et al., 2016). However, its importance has been questioned as other studies have found that Malat1-KO mice are viable with no discernable change in phenotype (Nakagawa et al., 2012; Zhang et al., 2012).

In addition to lung cancer, Malat1 is known to contribute to metastasis in other common types of cancer including hepatocellular carcinoma and bladder cancer, with evidence that it acts through induction of the epithelial-to-mesenchymal transition (Ying et al., 2012; Li G. et al., 2014; Yang et al., 2017). The role of Malat1 in several pediatric cancers has also been explored in recent studies. In neuroblastoma, Tee et al. (2014) recently identified a regulatory network involving N-Myc, Malat1, and the histone demethylase JMJD1A. They found that N-Myc upregulated JMJD1A via direct binding of its promoter region and that JMJD1A could demethylate histone H3K9 near the promoter region of Malat1, leading to its upregulation. MYCN-mediated upregulation of Malat1 provides one mechanism in which its amplification can lead to increased metastasis in neuroblastoma patients. Another study by Bi et al. (2017) also demonstrated that Malat1 regulated Axl expression, a transmembrane receptor tyrosine kinase, which is known to activate pathways involved in cell proliferation, survival, and migration. In osteosarcoma, Dong et al. (2015) demonstrated that Malat1 was highly expressed and could activate the PI3K/Akt pathway to promote proliferation and invasion.

Malat1 is known to interact with many miRNAs implicated in cancer. In osteosarcoma, several studies have shown Malat1 can function as a ceRNA for different miRNAs (Wang et al., 2017b; Liu K. et al., 2017b; Sun and Qin, 2018). miR-140-5p is a tumor suppressor that downregulates HDAC4, a histone deacetylase that contributes to tumorigenesis, and competitive binding by Malat1 with miR-140-5p was shown to increase HDAC4 activity (Sun and Qin, 2018). Malat1 was also shown to compete with miR-144-3p binding to ROCK1/ROCK2, promoting proliferation and metastasis (Wang et al., 2017b). In a similar manner, Liu K. et al. (2017) found that Malat1 could regulate cell growth through high-mobility group protein B1 (HMGB1) via ceRNA activity with miR-142-3p and miR-129-5p. Finally, in retinoblastoma, Malat1 downregulated miR-124 activity, leading to activation of the transcription factor SLUG, which is also targeted by miR-124 (Liu S. et al., 2017). SLUG has a known role in the epithelialto-mesenchymal transition by suppressing E-cadherin via the Wnt/B-catenin pathway (Prasad et al., 2009).

In addition to interactions with miRNAs, Malat1 has also been shown to be processed directly by the Drosha–DGCR8 microprocessor complex through binding sites in the 5' end of the transcript (MacIas et al., 2012). lncRNAs such as Malat1 cooperate with the miRNA pathway and a number of transcription factors and epigenetic factors to form a complex network responsible for regulating tumorigenesis. The capacity for Malat1 to drive proliferation and metastasis in pediatric solid tumors suggests that dysregulation of any of these regulatory components can be sufficient for the development of cancer and highlights the value of further research into the relatively new field of lncRNAs.

# H19: lncRNA Dysregulation via Loss of Imprinting may Contribute to Tumorigenesis

H19 is a paternally imprinted gene that is typically expressed exclusively from the maternal allele. Early reports suggested that H19 functioned as a tumor suppressor capable of inhibiting cell growth (Hao et al., 1993; Zhang et al., 1993; Casola et al., 1997; Fukuzawa et al., 1999). Studies in childhood solid tumors such as hepatoblastoma, Wilms tumor, and embryonic rhabdomyosarcoma supported this idea, as all three cancers often exhibited reduced H19 expression and had frequently lost the maternal 11p15 chromosomal region housing this gene (Fukuzawa et al., 1999). Other studies, in osteosarcoma and retinoblastoma, suggested an oncogenic role for H19, as its upregulation and loss of imprinting were commonly seen (Chan et al., 2014; Li L. et al., 2018). This observation was also seen in many other cancers including breast cancer (Lottin et al., 2002). Recently, the Hedgehog signaling pathway, a regulator of differentiation known to participate in cancer development and metastasis, was shown to induce H19 expression (Chan et al., 2014).

Understanding the exact function of H19 has proved difficult; however, it was known to sit downstream of the insulin growth factor 2 (IGF2) gene, a growth factor known to play a role in tumorigenesis. Early reports suggested interactions between IGF2 and H19, as loss of imprinting of either gene caused biallelic expression of the other gene (Ulaner et al., 2003). Ulaner et al. (2003) proposed a model for H19 and IGF2 involving a CCTFbinding site seated between the two genes, which could facilitate the blocking of IGF2 or transcription of H19 depending on its methylation status. However, this model suggested that H19 may simply serve as a marker for epigenetic disruptions and left open the question of what H19's actual function is.

More recent studies have demonstrated a role for H19 in epigenetic regulation. H19 binds to several epigenetic regulators including S-adenosylhomocysteine hydrolase (SAHH), methyl-CpG–binding domain protein 1 (MBD1), and enhancer of zeste homolog 2 (EZH2) (Raveh et al., 2015; Zhou et al., 2015). H19 was found to inhibit SAHH, which led to downregulation of DNMT3B-mediated methylation. MBD1 binds methylated DNA and recruits other proteins to mediate transcriptional repression or histone methylation, and H19 was shown to recruit this protein to several genes including IGF2 (Monnier et al., 2013). Finally, EZH2 is a histone methyltransferase that forms part of the Polycomb repressive complex 2 (PRC2) (Sauliere et al., 2006; Zhou et al., 2015).

H19 also plays a role in maintaining cells in an undifferentiated state by associating with the KH-type splicing regulatory protein (KSRP). When multipotent mesenchymal cells were induced, H19 was found to dissociate with KSRP to promote several of its functions including the decay of unstable mRNAs and increasing the expression of specific miRNAs involved in proliferation and differentiation though association with Drosha and Dicer.

H19's role in cancer has been emphasized by studies highlighting its relationship to the tumor suppressor p53. The H19 locus reciprocally regulates p53, as p53 suppresses H19 transcription and H19 can inactivate p53 by directly interacting with it (Yang et al., 2012). Notably, H19 also encodes for a miRNA in its first exon, miR-675, which suppresses p53 and several other targets including Rb, Igf1r, and several SMAD and cadherin genes. In the absence of functional p53, H19 was shown to promote tumor proliferation and survival under hypoxic conditions. Later studies in colorectal cancer showed that H19 could induce EMT by acting as a ceRNA (Liang et al., 2015). ceRNA function has recently been shown in a retinoblastoma study, targeting the mir-17~92 cluster (Zhang A. et al., 2018a). In this study, they found that H19 contained seven functional binding sites for mir-17~92 and was able to sponge mir-17~92 activity. This led to a de-repression of genes such as *p21* and STAT3 targets *BCL2*, *BCL2L1*, and *BIRC5*.

In a review by Raveh et al. (2015) it was proposed that H19 may behave differently in a manner that was dependent on the developmental stage of the cell, which could explain the evidence suggesting both oncogenic and tumor-suppressive roles. Here, the authors found that H19 functioned as a promoter of differentiation during the embryonic period and that absence of H19 at this stage could leave cells vulnerable to forming cancer, thereby seemingly acting as a tumor suppressor. However, in adult cells, where it is not normally expressed, H19 could function as an oncogene by promoting tumor survival and metastasis (Matouk et al., 2015).

# TUG1 Regulates Transcription Factors Through Competition With miRNAs in Osteosarcoma

Recent studies have investigated the role of lncRNA TUG1 as a prognostic factor and ceRNA in osteosarcoma. Ma et al. (2016) identified a correlation between upregulation of TUG1 and poor prognosis and metastasis, which was also evident in plasma, and suggested a potential use as a biomarker for patients with osteosarcoma. TUG1 is known to act through ceRNA activity against a number of miRNAs including miR-9, miR-132, miR-144, miR-153, miR-212, and miR-335 (Xie et al., 2016; Cao et al., 2017; Wang et al., 2017a; Li G. et al., 2018; Li H. et al., 2018). These miRNAs are known to regulate pathways involved in proliferation, cell cycle control, migration, and apoptosis. For example, TUG1 was shown to mediate de-repression of the transcription factor POU class 2 homeobox1 (POU2F1) via downregulation of mir-9 (**Figure 1E**) (Xie et al., 2016). POU2F1 itself participates in various cellular processes including growth, metabolism, stem cell identity, and metastasis (Vázquez-Arreguín and Tantin, 2016). In another example by Cao et al. (2017) they found that TUG1 also regulates migration and the epithelial-tomesenchymal transition via ceRNA action on miRNA-144-3p. miR-144-3p is a regulator of EZH2, and upregulation of EZH2 induced cell migration through the Wnt/β-catenin pathway (Cao et al., 2017). Studies have also demonstrated direct interactions between TUG1 and the Polycomb repressor complex; however, to our knowledge, this has not been investigated in pediatric solid tumors (Yang et al., 2011).

# Other Long Non-Coding RNAs in Pediatric Solid Tumors

In addition to those mentioned above, there are a number of other lncRNAs that have been identified as potential oncogenes or tumor suppressors involved in the pathogenesis of pediatric solid tumors (see **Table 2**) (Chen et al., 2017; Pandey et al., 2015).

For example, in osteosarcoma, lncRNAs HOTAIR, SNHG16, SNHG12, THOR, PACER, MFI2, and HOTTIP have all been shown to promote tumor or cell growth (Li et al., 2016; Qian et al., 2016; Ruan et al., 2016; Yin et al., 2016; Chen W. et al., 2018; Su et al., 2019; Wang et al., 2019). HOTAIR is known to play a role in chromatin regulation by acting as a scaffold for PRC2 and lysine-specific histone demethylase 1 (LSD1), and can also act as a ceRNA for miR-217 (Gupta et al., 2010; Tsai et al., 2010; Wang et al., 2019). SNHG16 has been shown to act as a ceRNA for several miRNAs including miR-205 and miR-340 (Zhu C. et al., 2018; Su et al., 2019). Additionally, several lncRNAs are downregulated in osteosarcoma with potential tumorsuppressive activity such as loc285194, MEG3, and TUSC7 (Pasic et al., 2010; Cong et al., 2016; Shi et al., 2018). loc285194 has been identified as a transcriptional target of p53 and can downregulate miR-211 (Liu Q. et al., 2013). In another study, increased p53

TABLE 2 | lncRNAs that play a role in pediatric solid tumors. OS, osteosarcoma; RB, retinoblastoma; NB, neuroblastoma; WT, Wilms tumor; HB, hepatoblastoma; RMS, rhabdomyosarcoma; ES, Ewing sarcoma.


expression and a decrease in cell proliferation and invasion were observed when MEG3 was overexpressed (Shi et al., 2018). Furthermore, MEG3 was found to be downregulated by another lncRNA, EWSAT1, which had previously been shown to enhance cell proliferation and metastasis in both osteosarcoma and Ewing sarcoma (Howarth et al., 2014; Sun et al., 2016).

In retinoblastoma, HOTAIR, THOR, and MEG3 appear to have a similar influence as seen in osteosarcoma, where they also acted as oncogenes (HOTAIR and THOR) or tumor suppressors (MEG3) (Gao et al., 2017; Shang, 2018; Yang G. et al., 2018). In the study examining HOTAIR in retinoblastoma, HOTAIR was shown to be engaged in a reciprocal regulatory loop with miR-613 and promoted cell proliferation and activation of the EMT, potentially through upregulation of N-cadherin, vimentin, and α‐SMA (Yang G. et al., 2018). Several lncRNAs have also been found acting as oncogenic ceRNAs including XIST, DANCR, and HOXA11-AS (Hu et al., 2018; Wang J. X. et al., 2018; Han et al., 2019). Finally, PANDAR is upregulated in retinoblastoma and may regulate cell proliferation and apoptosis via the Bcl-2/ caspase-3 pathway (Sheng et al., 2018).

A number of studies have also suggested an important role for lncRNAs in neuroblastoma. For example, lncUSMYcN is an lncRNA that is frequently co-amplified alongside MYCN (Liu P. Y. et al., 2016). Liu et al. found that in neuroblastoma, lncUSMycN could upregulate MYCN through transcriptional activation of NCYM (a.k.a. MYCNOS), which codes for a protein that stabilizes MYCN (Suenaga et al., 2014). NCYM RNA has also been suggested to bind to the RNA-binding protein NonO, which is also known to upregulate MYCN expression (Liu et al., 2014; Liu P. Y. et al., 2016). Neuroblastoma associated transcript-1 (NBAT-1) is an epigenetic regulator that interacts with EZH2, and functions as a tumor suppressor due to its important role in neuronal differentiation (Pandey et al., 2014). Loss of NBAT-1 expression was found to increase cell proliferation and invasion (Pandey et al., 2014). Finally, an isoform of lncRNA CASC15, CASC15-S, was also implicated as a key element in neuronal differentiation, and low expression was associated with a poor outcome in patients (Russell et al., 2015).

In Wilms tumor, a study by Zhu et al. identified LINC00473 as an oncogenic lncRNA that is upregulated in unfavorable tumors (Zhu et al., 2018b). LINC00473 was shown to promote tumor growth and metastasis by acting as a ceRNA for the tumor suppressor miR-195 (Zhu et al., 2018b).

A study by Dong et al. identified 1757 upregulated and 979 downregulated lncRNAs comparing hepatoblastoma and normal tissues, suggesting that lncRNAs play a key role in this disease as well (Dong et al., 2014). The lncRNAs Colorectal Neoplasia Differentially Expressed (CRNDE) and LINC01314 have been investigated in more detail in hepatoblastoma (Dong et al., 2017; Lv et al., 2018). CRNDE is known to be frequently upregulated in hepatoblastoma, and knockdown of CRNDE activated the mTOR pathway and inhibited tumor growth and angiogenesis with a corresponding decrease in VEGFA and Ang-2 levels (Dong et al., 2017). LINC01314 was identified as a tumor suppressor, reducing proliferation and migration via downregulation of cell cycle proteins MCM7 and cyclin D1 (Lv et al., 2018).

# CONCLUDING REMARKS

It is now clear that both miRNAs and lncRNAs form integral parts of the biological networks known to be impaired in pediatric solid tumors. miRNAs such as let-7 and mir-34 are key regulators of many pediatric oncogenes including *MYC*, *MYCN*, *RAS*, and *MET* (Johnson et al., 2005; Wei et al., 2008; Buechner et al., 2011; Yan et al., 2012). Additionally, ncRNAs such as the miR-17~92 cluster, mir-9, and Malat1 also serve as downstream effectors of MYC and MYCN (Schulte et al., 2008; Ma et al., 2010; Tee et al., 2014). Many more ncRNAs participate in these and other pathways to form a highly complex regulatory network essential for maintaining an optimal cell state (See **Tables 1** and **2**). ncRNA dysregulation offers an alternative mechanism to genetic mutations and DNA methylation whereby cell development and differentiation can be disturbed. Despite the relatively rare occurrence of mutations in pediatric solid tumors, copy number variations are common and often occur at regions of the genome that harbor ncRNAs with tumor-suppressive roles (Wei et al., 2008; Powers et al., 2016). Gene expression is often imprecise; however, miRNAs provide a layer of robustness, which helps ensure that biological networks respond appropriately to signals and remain functional despite an ever-increasing cellular disorder (Ebert and Sharp, 2012). lncRNAs, too, play a vital role in maintaining order by forming RNA–protein complexes and serving as ceRNA antagonists against miRNA-mediated repression, although much more work is needed in this field to fully comprehend their range of biological roles. Functional studies have revealed that dysregulation of ncRNAs is capable of driving progenitor cells towards oncogenesis. For example, this has been shown in retinoblastoma, where overexpression of the mir-17~92 cluster could drive tumor formation in *RB/p107*-deficient mice (Conkrite et al., 2011).

While genome-wide association studies have revealed that miRNA processing is frequently disrupted in Wilms tumor, this has not been shown to the same extent in other pediatric solid tumors. However, genetic mutations of protein-coding genes are only one way in which disruptions of miRNA processing can be revealed. Most miRNA studies ignore the fact that a high proportion of expressed miRNAs are isoforms (isomiRs). isomiRs originating from the same miRNA gene can possess a great deal of functional variability, with differences in target acquisition or turnover rate that can have a significant impact on overall gene regulation. Studies focusing on isomiR expression will provide an additional layer of resolution to our understanding of miRNA dysregulation.

Recent developments in single-cell technology have revealed heterogeneity in gene expression profiles among individual cells in many cancers such as glioblastoma and neuroblastoma (Patel et al., 2014; Boeva et al., 2017). Such studies suggest that many tumors comprise different cellular subtypes with unique phenotypes such as growth rate, drug resistance, and metastatic potential, which demand a new way of approaching cancer treatments. miRNA expression in pediatric solid tumors may also be heterogenous; however, limitations in single-cell technologies have left this avenue relatively unexplored, and further developments are needed.

So far, ncRNA research has played a key role in advancing our understanding of the mechanisms behind pediatric solid tumor development. Evidence supports an active role for ncRNAs in cancer that extends beyond mere passengers. However, continued research is needed to fully comprehend the molecular events leading to the development of cancer and unlock new possibilities for drug targets and biomarkers, which will ultimately lead to a better outcome for patients afflicted by these diseases.

## REFERENCES


# AUTHOR CONTRIBUTIONS

CS collected the information and wrote the review. DC and GH provided guidelines, consulted, and edited the manuscript.

# FUNDING

This work was supported by the Australian Research Council DP180100120 project grant.


proliferation in postnatal and adult hearts. *Circ. Res.* 112, 1557–1566. doi: 10.1161/CIRCRESAHA.112.300658


to control the oncogenic process in osteosarcoma cells. *Int. J. Biol. Sci.* 14, 1445–1456. doi: 10.7150/ijbs.26335


regulate let-7 microRNA processing in *Caenorhabditis elegans*. *Nat. Struct. Mol. Biol.* 16, 1016–1020. doi: 10.1038/nsmb.1675


PI3K/AKT signaling pathway. *J. Cell. Biochem.* 118, 3424–3434. doi: 10.1002/ jcb.25999


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Copyright © 2019 Smith, Catchpoole and Hutvagner. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

digital media

of impactful research

article's readership