# MOBILE GENETIC ELEMENTS IN CELLULAR DIFFERENTIATION, GENOME STABILITY, AND CANCER

EDITED BY: Tammy A. Morrish and Jose Luis García Pérez PUBLISHED IN: Frontiers in Chemistry and Frontiers in Molecular Biosciences and Frontiers in Cell and Developmental Biology

#### *Frontiers Copyright Statement*

*© Copyright 2007-2018 Frontiers Media SA. All rights reserved. All content included on this site, such as text, graphics, logos, button icons, images, video/audio clips, downloads, data compilations and software, is the property of or is licensed to Frontiers Media SA ("Frontiers") or its licensees and/or subcontractors. The copyright in the text of individual articles is the property of their respective authors, subject to a license granted to Frontiers.*

*The compilation of articles constituting this e-book, wherever published, as well as the compilation of all other content on this site, is the exclusive property of Frontiers. For the conditions for downloading and copying of e-books from Frontiers' website, please see the Terms for Website Use. If purchasing Frontiers e-books from other websites or sources, the conditions of the website concerned apply.*

*Images and graphics not forming part of user-contributed materials may not be downloaded or copied without permission.*

*Individual articles may be downloaded and reproduced in accordance with the principles of the CC-BY licence subject to any copyright or other notices. They may not be re-sold as an e-book.*

*As author or other contributor you grant a CC-BY licence to others to reproduce your articles, including any graphics and third-party materials supplied by you, in accordance with the Conditions for Website Use and subject to any copyright notices which you include in connection with your articles and materials.*

> *All copyright, and all rights therein, are protected by national and international copyright laws.*

> *The above represents a summary only. For the full conditions see the Conditions for Authors and the Conditions for Website Use.*

ISSN 1664-8714 ISBN 978-2-88945-389-4 DOI 10.3389/978-2-88945-389-4

### About Frontiers

Frontiers is more than just an open-access publisher of scholarly articles: it is a pioneering approach to the world of academia, radically improving the way scholarly research is managed. The grand vision of Frontiers is a world where all people have an equal opportunity to seek, share and generate knowledge. Frontiers provides immediate and permanent online open access to all its publications, but this alone is not enough to realize our grand goals.

### Frontiers Journal Series

The Frontiers Journal Series is a multi-tier and interdisciplinary set of open-access, online journals, promising a paradigm shift from the current review, selection and dissemination processes in academic publishing. All Frontiers journals are driven by researchers for researchers; therefore, they constitute a service to the scholarly community. At the same time, the Frontiers Journal Series operates on a revolutionary invention, the tiered publishing system, initially addressing specific communities of scholars, and gradually climbing up to broader public understanding, thus serving the interests of the lay society, too.

### Dedication to quality

Each Frontiers article is a landmark of the highest quality, thanks to genuinely collaborative interactions between authors and review editors, who include some of the world's best academicians. Research must be certified by peers before entering a stream of knowledge that may eventually reach the public - and shape society; therefore, Frontiers only applies the most rigorous and unbiased reviews.

Frontiers revolutionizes research publishing by freely delivering the most outstanding research, evaluated with no bias from both the academic and social point of view. By applying the most advanced information technologies, Frontiers is catapulting scholarly publishing into a new generation.

### What are Frontiers Research Topics?

Frontiers Research Topics are very popular trademarks of the Frontiers Journals Series: they are collections of at least ten articles, all centered on a particular subject. With their unique mix of varied contributions from Original Research to Review Articles, Frontiers Research Topics unify the most influential researchers, the latest key findings and historical advances in a hot research area! Find out more on how to host your own Frontiers Research Topic or contribute to one as an author by contacting the Frontiers Editorial Office: researchtopics@frontiersin.org

# **MOBILE GENETIC ELEMENTS IN CELLULAR DIFFERENTIATION, GENOME STABILITY, AND CANCER**

### Topic Editors:

**Tammy A. Morrish,** Independent Researcher, Ann Arbor, MI, United States **Jose Luis García Pérez,** MRC Institute of Genetics and Molecular Medicine (IGMM), University of Edinburgh, United Kingdom; Center for Genomics and Oncological Research (GENYO), Pfizer, University of Granada, Andalusian Regional Government, Spain

Image: hermesc/Shutterstock.com

The human genome, as with the genome of most organisms, is comprised of various types of mobile genetic element derived repeats. Mobile genetic elements that mobilize by an RNA intermediate, include both autonomous and non-autonomous retrotransposons, and mobilize by a "copy and paste" mechanism that relies of the presence of a functional reverse transcriptase activity. The extent to which these different types of elements are actively mobilizing varies among organisms, as revealed with the advent of Next Generation DNA sequencing (NGS).

To understand the normal and aberrant mechanisms that impact the mobility of these elements requires a more extensive understanding of how these elements interact with molecular pathways of the cell, including DNA repair, recombination and chromatin. In addition, epigenetic based-mechanisms can also influence the mobility of these elements, likely by transcriptional activation or repression in certain cell types. Studies regarding how mobile genetic elements interface and evolve with these pathways will rely on genomic studies from various model organisms. In addition, the mechanistic details of how these elements are regulated will continue to be elucidated with the use of genetic, biochemical, molecular, cellular, and bioinformatic approaches. Remarkably, the current understanding regarding the biology of these elements in the human genome, suggests these elements may impact developmental biology, including cellular differentiation, neuronal development, and immune function. Thus, aberrant changes in these molecular pathways may also impact disease, including neuronal degeneration, autoimmunity, and cancer.

**Citation:** Morrish, T. A., Pérez, J. L. G., eds. (2018). Mobile Genetic Elements in Cellular Differentiation, Genome Stability, and Cancer Lausanne: Frontiers Media. doi: 10.3389/978-2-88945-389-4

# Table of Contents

### **1. Editorial**

*05 Editorial: Mobile Genetic Elements in Cellular Differentiation, Genome Stability, and Cancer*

Tammy A. Morrish and Jose L. Garcia-Pérez

### **2. Retrotransposons in Somatic and Cancer Cells**


Ilaria Sciamanna, Chiara De Luca and Corrado Spadafora

*29 Crossing the LINE Toward Genomic Instability: LINE-1 Retrotransposition in Cancer*

Jacqueline R. Kemp and Michelle S. Longworth


Yasuo Ariumi


Javier G. Pizarro and Gaël Cristofari

### **3. Retrotransposons in Evolution**


Anton A. Buzdin, Vladimir Prassolov and Andrew V. Garazha

# Editorial: Mobile Genetic Elements in Cellular Differentiation, Genome Stability, and Cancer

Tammy A. Morrish<sup>1</sup> \* and Jose L. Garcia-Pérez 2, 3, 4

1 Independent Researcher, Ann Arbor, MI, United States, <sup>2</sup> MRC Human Genetics Unit, Institute of Genetics and Molecular Medicine (IGMM), University of Edinburgh, Edinburgh, United Kingdom, <sup>3</sup> Junta de Andalucía de Genómica e Investigación Oncológica (GENYO), Granada, Spain, <sup>4</sup> Centre for Genomics and Oncological Research, University of Granada, Granada, Spain

Keywords: mobile DNA, reverse transcriptase, genome stability, cellular differentiation, model organisms, retrotransposon, transposon, DNA repair

**Editorial on the Research Topic**

### **Mobile Genetic Elements in Cellular Differentiation, Genome Stability, and Cancer**

The human genome, as well as the genome of most organisms, harbors various types and abundances of transposable element derived repeats (Lander, 2001; Waterston et al., 2002). The topic on: "Mobile Genetic Elements in Cellular Differentiation, Genome Stability, and Cancer," includes a collection of original research articles and reviews, which address the impact of reverse transcriptases, including the ones coded by transposable elements, on both basic biological mechanisms and disease. In 1970, the discovery of reverse transcriptases or RNA-dependent DNA polymerases, was reported by two different laboratories (Baltimore, 1970; Temin and Mizutani, 1970). Since then numerous studies regarding retroviral reverse transcriptases have significantly contributed to the characterization and biology of may different retrovirus and retroelements. These studies continue to be of interest for the prevention and treatment of various retroviral induced human diseases and for the basic understanding of the origin of retroviruses. In addition the knowledge of reverse transcription has been harnessed for basic use in molecular biology and other applications, including recent widely used methods such as RNAseq. As retroviruses are considered exogenously derived reverse transcriptases, the subsequent discovery in 1987 of telomerase, also considered an endogenous RNA-dependent DNA polymerase, has significantly contributed to the understanding of one of the predominant mechanisms of telomere maintenance that contributes to most, but not all organisms with linear chromosomes (Greider and Blackburn, 1985; Biessmann et al., 1990). Yet, sequences encoding for endogenous RNA-dependent DNA polymerases are not limited to telomerase. The isolation and subsequent genetic, biochemical, and molecular characterization of human full-length non-Long Terminal Repeat (LTR) retrotransposons, termed **L**ong **In**terspersed **E**lements (LINE-1) demonstrated that elements formally encode a reverse transcriptase activity (Dombroski et al., 1991; Mathias et al., 1991; Feng et al., 1996; Moran et al., 1996). Non-LTR retrotransposons are not limited to the human genome, and are present as full-length and/or truncated, rearranged, inactive remnants in many other genomes. In addition, the reverse transcriptase activities encoded by non-LTR retrotransposons share sequence identity with many other reverse transcriptases (Nakamura et al., 1997; Malik et al., 1999). Furthermore, non-LTR retrotransposons rely on the encoded reverse transcriptase for integration, typically by target-primed reverse transcription (TPRT), which was initially biochemically defined using the non-LTR retrotransposon R2Bm, from Bombyx mori (Luan et al., 1993). A review by Onozawa and Aplan included in this topic, describes two different types of LINE-1 reverse transcriptase-mediated template sequence insertion polymorphisms (TSIPs), or integration structures that are polymorphic in the human genome (Onozawa and Aplan). The

Edited and reviewed by:

Cecilia Giulivi, University of California, Davis, United States

> \*Correspondence: Tammy A. Morrish morrisht@gmail.com

#### Specialty section:

This article was submitted to Cellular Biochemistry, a section of the journal Frontiers in Chemistry

Received: 23 October 2017 Accepted: 20 November 2017 Published: 04 December 2017

#### Citation:

Morrish TA and Garcia-Pérez JL (2017) Editorial: Mobile Genetic Elements in Cellular Differentiation, Genome Stability, and Cancer. Front. Chem. 5:108. doi: 10.3389/fchem.2017.00108 characteristics of class2 structures allude to the occurrence of additional integration mechanisms by the LINE-1 reverse transcriptase that may occur in germ cells or during embryogenesis (Onozawa and Aplan). To note, the features described in these class2 structures are consistent with previous reports of endonuclease-independent LINE-1 retrotransposition (Eickbush, 2002; Morrish et al., 2002).

Phylogenetic analysis of the reverse transcriptase domains support the idea that retroviruses and telomerase evolved from non-LTR retrotransposons, due to the gain or loss of LTR sequence and/or sequences encoding for specific domains (Xiong and Eickbush, 1988; Malik et al., 1999). These early phylogenetic studies are consistent with the protovirus hypothesis proposed by Temin, that (1) retroviruses are likely derived from endogenous retrotransposons and (2) mutations that arise due to the mobility of retrotransposons could potentially activate oncogenes or inactivate tumor suppressor genes, perhaps contributing to tumorigenesis (Temin, 1971; Shimotohno et al., 1980). As LINE-1 elements are active in tumors, yet transcriptionally repressed in many somatic cell types, there was much interest to understand the extent that LINE-1 retrotransposition contributes to tumorigenesis (Solyom et al., 2012; Shukla et al., 2013; Doucet-O'Hare et al., 2015; Ewing et al., 2015; Rodic et al., 2015). Included in this topic is original research using bioinformatic approaches to examine LINE-1 expression and insertion profiles using RNAseq data from normal and primary tumor samples collected using the Cancer Genome Atlas (TCGA) (Clayton et al.). Here the authors examined the expression and integration differences in breast invasive carcinoma, head and neck squamous carcinoma, and lung adenocarcinoma and their analysis indicates two cases of LINE-1 mediated insertions near two different tumor suppressor genes, including an Alu insertion into the CBL gene in breast invasive carcinoma and a LINE-1 insertion into the first exon of the BAALC gene in a head and neck squamous cell carcinoma. Again, these findings are consistent with the protovirus hypothesis. However these tumors may also harbor mutations in "host" genes that regulate LINE-1 retrotransposition. A number of reviews were included in this topic that address recent studies on LINE-1 retrotransposition in cancer (Honda; Kemp and Longworth ; Sciamanna et al.). In addition, identifying cellular genes and pathways that regulate LINE-1 transcription and activity is an active area of research, and two reviews discuss the current understanding regarding the regulation of LINE-1 retrotransposition in somatic cells, which may become dysregulated in cancer (Ariumi; Pizarro and Cristofari). The topic also includes two original research articles on the impact of endogenous retroviruses on genome evolution. In the article by Irie et al., the authors use dN/dS analysis and molecular approaches to validate their findings regarding the contribution of the sushi-ichi retrotransposon during the evolution of the zinc finger protein-encoding gene

### REFERENCES

Baltimore, D. (1970). RNA-dependent DNA polymerase in virions of RNA tumour viruses. Nature 226, 1209–1211. doi: 10.1038/2261209a0

SIRH11/ZCCHC16 and the impact of this gene during eutherian brain evolution. In addition, another research article examines the evolution of the Tbx6 transcription binding sites, (ORRA1- ORRA1D), which are LTRs derived from the endogenous retroviruses, MaLRs (Yasuhiko et al.). The authors examine the impact on transcription of genes harboring these Tbx6 binding sites, using the Tbx6 knockout mouse. Their findings are coupled with biochemical and bioinformatic approaches. Finally two reviews nicely described the host cellular factors that impact the transcriptional dynamics of ERVs in the human genome (Buzdin et al.; Meyer et al.).

Overall the articles that were received for this topic: "Mobile Genetic Elements in Cellular Differentiation, Genome Stability, and Cancer" predominantly focus on the evolution of endogenous reverse transcriptases (RT), including the LINE-1 encoded RT, and the endogenous retroviruses ERVs and MaLR. These articles also summarize the findings in the field regarding these reverse transcriptases in normal biology and disease. These summaries and newly reported findings are consistent with the protovirus hypothesis (Temin, 1971; Shimotohno et al., 1980; Shimotohno and Temin, 1981). Identification of additional host factors and cellular pathways that contribute to LINE-1 retrotransposition will help further elucidate the protovirus hypothesis, as not all LINE-1 insertions occur in tumor suppressor or oncogenes. In addition, further studies regarding exogenous and endogenous reverse transcriptases will continue to shed light on the growing knowledge surrounding reverse transcription in the RNA world.

### AUTHOR CONTRIBUTIONS

All authors listed have made a substantial, direct and intellectual contribution to the work, and approved it for publication.

### FUNDING

Funding was provided by a Howard Temin Pathway to Independence Award, Grant Number K99/R00CA154889 from the National Cancer Institute (TM) and the deArce Koch Memorial Endowment Fund from the University of Toledo (TM). JG-P's lab is supported by CICE-FEDER-P12-CTS-2256, Plan Nacional de I+D+I 2008–2011 and 2013–2016 (FIS-FEDER-PI14/02152), PCIN-2014-115-ERA-NET NEURON II, the European Research Council (ERC-Consolidator ERC-STG-2012-233764), by an International Early Career Scientist grant from the Howard Hughes Medical Institute (IECS-55007420), by The Wellcome Trust-University of Edinburgh Institutional Strategic Support Fund (ISFF2) and by a private donation by Ms. Francisca Serrano (Trading y Bolsa para Torpes, Granada, Spain).

Biessmann, H., Mason, J. M., Ferry, K., d'Hulst, M., Valgeirsdottir, K., Traverse, K. L., et al. (1990). Addition of telomere-associated HeT DNA sequences "heals" broken chromosome ends in Drosophila. Cell 61, 663–673. doi: 10.1016/0092-8674(90)90478-W


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2017 Morrish and Garcia-Pérez. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Patterns of Transposable Element Expression and Insertion in Cancer

Evan A. Clayton1, 2, Lu Wang3, 4, Lavanya Rishishwar 3, 4, 5, Jianrong Wang<sup>6</sup> , John F. McDonald1, 2 and I. King Jordan3, 4, <sup>5</sup> \*

*1 Integrated Cancer Research Center, School of Biology, Georgia Institute of Technology, Atlanta, GA, USA, <sup>2</sup> Ovarian Cancer Institute, Atlanta, GA, USA, <sup>3</sup> School of Biology, Georgia Institute of Technology, Atlanta, GA, USA, <sup>4</sup> PanAmerican Bioinformatics Institute, Cali, Colombia, <sup>5</sup> Applied Bioinformatics Laboratory, Atlanta, GA, USA, <sup>6</sup> Department of Computational Mathematics, Science and Engineering, Michigan State University, East Lansing, MI, USA*

Human transposable element (TE) activity in somatic tissues causes mutations that can contribute to tumorigenesis. Indeed, TE insertion mutations have been implicated in the etiology of a number of different cancer types. Nevertheless, the full extent of somatic TE activity, along with its relationship to tumorigenesis, have yet to be fully explored. Recent developments in bioinformatics software make it possible to analyze TE expression levels and TE insertional activity directly from transcriptome (RNA-seq) and whole genome (DNA-seq) next-generation sequence data. We applied these new sequence analysis techniques to matched normal and primary tumor patient samples from the Cancer Genome Atlas (TCGA) in order to analyze the patterns of TE expression and insertion for three cancer types: breast invasive carcinoma, head and neck squamous cell carcinoma, and lung adenocarcinoma. Our analysis focused on the three most abundant families of active human TEs: Alu, SVA, and L1. We found evidence for high levels of somatic TE activity for these three families in normal and cancer samples across diverse tissue types. Abundant transcripts for all three TE families were detected in both normal and cancer tissues along with an average of ∼80 unique TE insertions per individual patient/tissue. We observed an increase in L1 transcript expression and L1 insertional activity in primary tumor samples for all three cancer types. Tumor-specific TE insertions are enriched for private mutations, consistent with a potentially causal role in tumorigenesis. We used genome feature analysis to investigate two specific cases of putative cancer-causing TE mutations in further detail. An Alu insertion in an upstream enhancer of the *CBL* tumor suppressor gene is associated with down-regulation of the gene in a single breast cancer patient, and an L1 insertion in the first exon of the *BAALC* gene also disrupts its expression in head and neck squamous cell carcinoma. Our results are consistent with widespread somatic activity of human TEs leading to numerous insertion mutations that can contribute to tumorigenesis in a variety of tissues.

Keywords: LINE-1, L1, Alu, SVA, retrotransposons, bioinformatics, mutation, tumorigenesis

## INTRODUCTION

More than 50% of the human genome sequence is derived from transposable element (TE) insertions (Lander et al., 2001; de Koning et al., 2011). The vast majority of TE-derived sequences in the human genome correspond to relatively ancient insertions that are no longer capable of transposition (Mills et al., 2007). However, there are several families of human TEs that remain

### Edited by:

*Tammy A. Morrish, Formerly affiliated with University of Toledo, USA*

#### Reviewed by:

*David Ray, Mississippi State University, USA David E. Symer, Ohio State University Comp. Cancer Ctr., USA Tara Theresa Doucet-O'Hare, National Institutes of Health, USA*

> \*Correspondence: *I. King Jordan king.jordan@biology.gatech.edu*

#### Specialty section:

*This article was submitted to Cellular Biochemistry, a section of the journal Frontiers in Molecular Biosciences*

Received: *24 August 2016* Accepted: *31 October 2016* Published: *16 November 2016*

#### Citation:

*Clayton EA, Wang L, Rishishwar L, Wang J, McDonald JF and Jordan IK (2016) Patterns of Transposable Element Expression and Insertion in Cancer. Front. Mol. Biosci. 3:76. doi: 10.3389/fmolb.2016.00076* active to this day. The most abundant families of active TEs in the human genome are the Alu and SVA short interspersed nuclear elements (SINEs) along with the L1 Long Interspersed Nuclear Element (LINE) family (Kazazian et al., 1988; Batzer and Deininger, 1991; Batzer et al., 1991; Brouha et al., 2003; Ostertag et al., 2003; Wang et al., 2005). Alu and SVA SINEs are non-autonomous TEs that are mobilized via the transpositional machinery encoded by the autonomous L1 family of LINEs. Recent evidence indicates that a handful of HERV-K endogenous retroviral elements also remain active in the human genome (Wildschutte et al., 2016).

Active TE families are of great interest since they have the ability to generate de novo mutations, many of which have been linked to human disease (Hancks and Kazazian, 2012; Solyom and and Kazazian, 2012). For instance, TE insertions have been shown to contribute to the etiology of a variety of different cancer types (Belancio et al., 2010a; Carreira et al., 2014). Numerous recent studies have used a combination of nextgeneration sequence analysis, followed by validation with PCR and/or Sanger sequencing, to elucidate connections between TE activity and cancer (Solyom et al., 2012; Shukla et al., 2013; Tubio et al., 2014; Doucet-O'Hare et al., 2015; Ewing et al., 2015). L1 insertions in particular have been implicated as potential cancer causing mutations in those and other studies (Morse et al., 1988; Miki et al., 1992; Iskow et al., 2010; Lee et al., 2012; Scott et al., 2016). L1 activity is thought to promote tumor development by causing genomic instability, via impaired chromosomal pairing during mitosis, and/or by disrupting coding or regulatory sequences (Kemp and Longworth, 2015).

Many of the studies that have related TEs to cancer have considered TE expression, at the transcript or protein level, and TE insertional activity separately. A number of different cancer types are positive for L1 transcript expression (Belancio et al., 2010b), and L1 proteins have been shown to be ubiquitously expressed in both normal and tumor samples from the same individuals (Bratthauer and Fanning, 1992, 1993; Bratthauer et al., 1994; Asch et al., 1996; Doucet-O'Hare et al., 2015, 2016). There is also evidence suggesting that L1 protein expression can be limited to tumor tissues and thereby serve as a useful cancer biomarker; nearly half of all human cancers are exclusively immunoreactive for L1-ORF1 encoded proteins (Rodic et al., 2014). The expression of L1 proteins in tumors has been shown to affect the expression of a number of cancer-related genes, including the down-regulation of tumor suppressors (Rangasamy et al., 2015). With respect to TE insertional activity, studies on matched normal and tumor tissues have found that novel L1 insertions occur at high frequencies in lung cancer genomes (Iskow et al., 2010). Such insertions frequently occur in oncogenes and tumor suppressors, underscoring their putative role in tumorigenesis (Lee et al., 2012).

A principal challenge when interpreting cancer genomes is distinguishing between so-called passenger and driver mutations. While passenger mutations are present in cancer genomes, they are not considered to contribute to cancer progression; instead, they are simply somatic mutations that arise during carcinogenesis and are carried along during clonal expansion. Driver mutations, on the other hand, are causal mutations that are directly implicated in carcinogenesis and the promotion of cancer growth (Stratton et al., 2009; Marx, 2014; Pon and Marra, 2015). To date, only a few studies have directly implicated TE insertions as cancer driver mutations. One such study analyzed 19 hepatocellular carcinoma genomes utilizing the RC-Seq methodology (Baillie et al., 2011) and discovered two separate L1 insertions that initiate tumorigenesis via distinct oncogenic pathways (Shukla et al., 2013). This study found L1 insertions in two different tumor suppressor genes: Mutated in Colorectal Cancers (MCC) and Suppression of Tumorigenicity (ST18). Most recently, a role for L1 insertional activity was conclusively demonstrated for colorectal cancer caused by an insertion in the APC tumor suppressor gene (Scott et al., 2016). This paper describes a somatic L1 insertion into one copy of the APC gene that, when coupled with a point mutation in the other copy of the gene, initiates tumorigenesis through the two hit colorectal cancer pathway.

Owing to parallel developments in genomics and bioinformatics, it is now possible to jointly analyze the patterns of TE transcript expression and TE insertional activity in human cancers. The Cancer Genome Atlas (TCGA) provides access to both transcriptome sequence data (RNA-seq) and whole genome sequence data (DNA-seq) for a number of matched normal and primary tumor sample pairs from individual patients (Weinstein et al., 2013). In addition, recently developed bioinformatics algorithms allow for the detection of TE transcripts directly from RNA-seq data (Jin et al., 2015) as well as for the characterization of novel TE insertions from DNA-seq data (Thung et al., 2014; Sudmant et al., 2015). We took advantage of these developments in order to evaluate the patterns of both TE expression and insertional activity in three cancer types: breast invasive carcinoma, head, and neck squamous cell carcinoma, and lung adenocarcinoma (**Figure 1** and Supplementary Figure 1). We observed a simultaneous increase of L1 transcript expression and L1 insertional activity for primary tumor samples for all three cancers, and we evaluate individual cases of TE insertions that are implicated as potential cancer causing mutations.

## MATERIALS AND METHODS

### Genome and Transcriptome Sequence Data

Whole genome sequence data (DNA-seq), transcriptome sequence data (RNA-seq) and patient metadata for matched normal and primary tumor tissue samples from nine cancer patients were acquired from The TCGA (Weinstein et al., 2013) via the Cancer Genomics Hub (CGHub) using the download client GeneTorrent (Maltbie et al., 2013). The nine participants included three breast invasive carcinoma patients, three head and neck squamous cell carcinoma patients and three lung adenocarcinoma patients (**Table 1**). DNA-seq and RNA-seq data were accessed as BAM files of paired-end Illumina sequence data aligned against the human genome reference sequence (build hg19). BAM files containing sequence alignments were validated for quality using FASTQC (Andrews, 2011), and autosomes were

was analyzed to compare normal versus cancer expression levels, and DNA-seq data was analyzed to identify somatic TE insertion events. The main bioinformatics programs (wrench) and databases (cylinder) used for each phase of the analysis are indicated.

extracted from the BAM files for downstream analysis using SAMtools (Li et al., 2009).

### Gene and Transposable Element (TE) Expression Levels

Gene and TE expression levels were measured using RNA-seq data for the nine matched normal and primary tumor tissue samples. Gene expression levels were quantified as read counts mapped to NCBI RefSeq gene annotations (Pruitt et al., 2012). TE expression levels—for Alu, L1 and SVA elements—were quantified using reads mapped to RepeatMasker annotations, which were subsequently analyzed with the TEtranscripts package (Jin et al., 2015). The TEtranscripts program uses an expectation maximization (EM) algorithm to choose optimal unique TE locations for multi-mapped reads, thereby allowing for accurate expression level measurements for active TE families. The TEtranscripts method was recently shown to yield more reliable measures of TE transcription levels compared to previously published methods, such as HTSeq-count, Cufflinks, and RepEnrich (Trapnell et al., 2010; Criscione et al., 2014; Anders et al., 2015). The L1Base database was used to identify the genomic locations of 145 full length, intact elements from the most recently active L1 subfamily (Penzkofer et al., 2005). The set of full-length intact L1 sequences from the L1Base was generated by performing a BLAST search using the human genomic DNA sequences against the L1 template sequence (Penzkofer et al., 2005). L1Base was used to facilitate measures of active L1 element expression by limiting our analysis to RNAseq reads that map to full-length, intact L1 sequences which retain the potential to be transpositionally active. This was done in an effort to ensure that the reads we analyzed were taken from potentially active L1 elements as opposed to older fixed elements, which could represent read-through transcripts initiated from nearby genomic promoters. The expression levels of these potentially active L1 elements were analyzed separately using the TEtranscripts method.

Differential expression levels between normal and cancer tissue pairs, for genes and TEs, were evaluated by comparing distributions of log<sup>10</sup> transformed RNA-seq expression levels characterized as described above. The statistical significance levels of the observed differential expression between normal and cancer pairs were evaluated by comparing these distributions using the non-parametric Kolmogorov-Smirnov test. Statistical comparisons were done separately for each tissue (cancer) type: breast invasive carcinoma, head and neck squamous cell carcinoma and lung adenocarcinoma.

### Transposable Element Insertion Detection

The genomic locations of novel TE insertions from matched normal and primary tumor tissue samples were predicted based on discordant read-pair mapping of DNA-seq data (Ewing, 2015) (**Table 2**). A scheme of our TE insertion detection analysis pipeline is shown in Supplementary Figure 2. DNA-seq BAM files were realigned according to GATK's standard indel realignment method (Van der Auwera et al., 2013) to facilitate TE insertion detection. The programs MELT (Sudmant et al., 2015) and Mobster (Thung et al., 2014) were used together for TE insertion detection. These two programs were selected owing to their previously demonstrated superior performance for human TE insertion detection (Rishishwar et al., 2016). Only TE insertion sites that were found by both methods (i.e., the intersection of the predictions) were used for subsequent analysis. TE insertion predictions made by the individual programs were considered to represent the same insertion if they were found within ±100 bp of each other. An additional filtering step was applied based on the number of mapped sequence reads (coverage) that support each TE insertion prediction. Only predictions with a minimum coverage of 5 reads and a maximum coverage of 4X the average sequencing depth of the sample were used for subsequent analysis. These upper and lower cut-off thresholds were empirically chosen based on the observed distributions



*<sup>a</sup>NT-D, Normal tissue DNA-seq; NT-R, Normal tissue RNA-seq; TP-D, Tumor primary DNA-seq; TP-R, Tumor primary RNA-seq.*

of the numbers of discordant mapped read pairs used to call individual TE insertions. Read count distributions were computed individually for each program (MELT, Mobster) used and for each sample (Supplementary Figure 3). The resulting distributions were typically bimodal with a lower peak (i.e., with lower read count support) that we considered to be enriched for potential false positive TE insertion calls. The lower cut-off threshold of 5 reads was chosen to minimize such false positives, and the upper cut-off threshold was chosen to remove calls made in genomic regions that show anomalously high numbers of mapped reads, which tend to be enriched for ambiguously mapped reads.

The number of observed versus expected counts of unique L1 insertions were compared for matched normal and primary tumor tissue samples. The observed counts were taken from the TE detection pipeline, and the expected counts were computed


TABLE 2 | Numbers of MELT and Mobster predicted TE insertions in matched normal (N) and primary tumor (T) samples across 9 individuals.

as the ratio of unique insertions seen in matched normal vs. primary tissue for all TEs multiplied by the total number of observed L1 insertions. The significance of the difference between the observed versus expected counts of unique L1 insertions was evaluated using the Fisher's exact test. Counts of TE insertions for matched normal and primary tumor tissue samples were characterized based on their frequencies from the 1000 Genomes Project (1KGP) (Sudmant et al., 2015) and grouped into three distinct frequency bins. The distributions of TE insertion counts across the three frequency bins were compared for matched normal and cancer samples for the different tissue types analyzed here, and the significance of the differences between these distributions were evaluated using the Kolmogorov-Smirnov test.

### TE Insertion Genome Feature Analysis

The genomic locations of novel TE insertions were considered with respect to several genomic features using the BEDTools program (Quinlan, 2014): RefSeq genes (Pruitt et al., 2012), COSMIC tumor suppressor genes (Forbes et al., 2015), and enhancer elements defined by chromatin states (Roadmap Epigenomics et al., 2015). The population allele frequencies of the predicted TE insertions were computed from the Phase 3 release of the 1KGP (Sudmant et al., 2015) as previously described (Rishishwar et al., 2015).

### RESULTS AND DISCUSSION

### TE Expression Levels in Matched Normal vs. Primary Tumor Tissue Samples

RNA-seq data were used to evaluate the differences in TE expression levels between matched normal and primary tumor tissue samples as described in the Materials and Methods. The observed differences in gene expression levels between normal and tumor tissue were compared to differences in TE expression levels for breast invasive carcinoma, head, and neck squamous cell carcinoma and lung adenocarcinoma. There are no significant differences observed for the distributions of gene expression levels between matched normal and primary tumor tissue pairs for any of the three cancer types analyzed here (**Figure 2**). Similarly, when all three families of potentially active TEs (Alu, L1, and SVA) are considered together, there is no significant difference seen for the overall levels of expression between matched normal and tumor tissue. However, when fulllength, potentially active L1 sequences are considered alone, we observe statistically significant increases in L1 expression levels for all three cancer types.

The methods that we used to characterize TE expression levels include several analytical controls aimed to ensure that only genuine TE-initiated transcripts, from members of potentially active families, are measured. Nevertheless, the lack of a difference between normal and tumor expression levels observed when all three active TE families were considered together could reflect technical difficulties with identifying bona fide TE transcripts that are initiated from element promoters as opposed to TE sequences that are passively expressed as part of longer genic transcripts. This is particularly true for Alu elements, many of which are found in the introns of human genes and transcribed as read-through transcripts initiated from RNA Pol II gene promoters (Deininger, 2011). Our confidence in the ability to measure L1-initiated transcripts is higher owing to the focus on previously identified full-length, intact elements that are located in intergenic regions. In any case, the up-regulation of L1s in cancer that we observed has potential implications for increased TE insertional activity for all three families, since L1 encoded proteins are responsible for the cis retrotransposition of L1s as well as the trans activation of Alu and SVA elements (Batzer and Deininger, 2002; Hancks and Kazazian, 2010). We analyzed the same pairs of matched normal and primary tumor tissues to evaluate whether the observed increase in L1 expression corresponds to increased transpositional activity of human TEs.

### Novel TE Insertions in Matched Normal and Primary Tumor Tissue Samples

It is now possible to characterize the genomic locations and copy numbers of individual TE insertions from whole genome DNA-seq data owing to recent developments in computational genomics software (Ewing, 2015; Rishishwar et al., 2016). This technological advance is exemplified by the recent Phase 3 release of the 1KGP, which includes a complete genome-wide census of polymorphic TE insertion sites for 2504 individuals across

26 human populations (Sudmant et al., 2015). We analyzed whole genome DNA-seq data using computational methods for TE insertion detection (see Materials and Methods) in order to compare TE insertional activity between matched normal versus primary tumor tissue samples.

When all three families of active human TEs are considered together, we observed a total of 3672 TE insertions across the nine individuals analyzed for normal and cancer tissue pairs, 693 of which are unique insertions found in only one individual and one tissue type. In other words, we observe an average of ∼77 unique somatic TE insertions per person, i.e., "private" TE insertions. This estimate is similar to the value of ∼90 unique (presumably germline) TE insertions that we previously observed for individuals from the 1KGP (Rishishwar et al., 2015). A large majority of the observed TE insertions—81% for all TEs and 62% for L1s alone are shared between the normal and tumor tissue types of an individual, suggesting that they represent germline insertions (**Figure 3A**). There are 1.3x more unique TE insertions seen for tumor compared to normal tissue, and this effect is more pronounced for L1s alone, which are 2x more abundant in tumor tissue samples. Accordingly, there is a statistically significant excess of observed versus expected L1 insertions in tumor versus normal tissue (P = 0.019) (**Figure 3B**). These results are consistent with a potential role for L1 transpositional activity in tumorigenesis for the cancer types analyzed here, as has been previously suggested for several different cancers (Morse et al., 1988; Iskow et al., 2010; Lee et al., 2012; Scott et al., 2016).

Given the relatively high level of L1 insertional activity in the tumor tissue samples analyzed here, we tested whether tumorspecific L1 insertions are found at lower frequencies among the (presumably) healthy donors from the 1KGP compared to L1 insertions found in matched normal tissue. The idea was to evaluate whether the tumor-specific L1 insertions represent mutations that are private, and thereby more likely to be deleterious or disease-causing. To do this, individual TE insertions were classified as high frequency (>0.05), low frequency (<0.05) or private (absent) according to their previously characterized population (allele) frequencies from the 1KGP (Rishishwar et al., 2015; Sudmant et al., 2015).

When all three cancer types are considered together, there is a statistically significant excess of private and low frequency TE insertions observed for tumor compared to normal tissue (P = 1.9e-61) (**Figure 3C**). This effect is even more pronounced when L1 insertions are considered alone (P = 2.7e-23). The same pattern of an increased frequency of private L1 insertions in tumor tissue is observed (P < 2.0e-7) when all three cancer types are analyzed for sets of patients (**Figures 3D–F**) and when samples for individual patients are analyzed separately (Supplementary Figure 4). The strongest effect is seen for head and neck squamous cell carcinoma. The pattern of a significant excess of private L1 insertions in tumor compared to normal tissue, observed for all three cancer types studied here, provides further evidence in support of a possible role for L1 activity in tumorigenesis.

It should be noted TE insertions found in low copy numbers may not be detectable using next-generation sequence analysis,

described in the Materials and Methods. (A) The total number of predicted TE insertions, pooled for all nine individuals over the three cancer types analyzed here, are shown for normal vs. tumor tissue. Venn diagrams show the numbers of unique versus shared TE insertions for the two tissue types. (B) Comparison of the observed versus expected numbers of unique L1 insertions for normal vs. tumor tissue. (C) Comparison of the population frequencies of observed TE insertions in matched normal vs. tumor tissue pairs are shown for all of the TEs analyzed here and for L1s alone. (D–F) The same comparisons of TE insertion population frequencies are shown individually for each cancer type analyzed here. TE insertion population frequencies are color coded as shown in the key. *P*-values show the significance of the differences for observed distributions based on the Fisher's exact test (B) and the Kolmogorov-Smirnov test (C–F).

whereas such insertions may be uncovered using more sensitive PCR-based approaches. False negatives of this kind will be more prevalent at low levels of sequence coverage. We have tried to control for this by using relatively high sequence coverage (∼35X) studies here, but the conservative lower read count cut-off of 5 reads per TE insertion call that we used may still lead to missing TE insertion calls. Sequence based predictions can also yield false-positive TE insertion calls. In an effort to deal with this issue, we have only used high-confidence calls produced by two independent programs—MELT and Mobster—that we have recently shown to be most reliable for the detection of human TE insertions (Rishishwar et al., 2016).

One other potential problem with the sequence based analysis relates to the base pair resolution with which TE insertions can be called via computational analysis of next-generation sequence data. Currently, the most accurate programs for calling TE insertions from next-generation sequence data do not yet allow for the insertions to be precisely located to genomic regions

FIGURE 4 | Private TE insertions implicated as potential cancer driver mutations. (A) A tumor-specific Alu insertion (red) is found in a single breast cancer patient. The insertion is located within an upstream enhancer for the *CBL* gene on chromosome 11 (gene model shown in blue), as indicated by enhancer-associated chromatin marks (inset yellow bars). Presence of the Alu insertion is associated with down-regulation of *CBL* (expression levels in green). (B) A tumor-specific L1 insertion (red) is located within the first exon of the *BAALC* gene on chromosome 8 (gene model shown in blue). Co-location of the L1 insertion with promoter-associated chromatin marks (purple bars) is shown in the inset. Presence of the L1 insertion is associated with down-regulation of *BAALC* (expression levels in red).

at single base pair resolution. To account for this fact, TE insertions called within a window of ±100 bp are considered to be co-located (Supplementary Figure 2). It is possible that this approximation can lead to multiple TE insertion events being collapsed into a single event. Subsequent experimental confirmation of individual TE insertion calls of interest (e.g., potentially tumorigenic TE insertions) should help to provide certainty with respect to both their validity and their precise genomic locations.

### Potentially Tumorigenic TE Insertions

Having established a potential role for transpositional activity in tumorigenesis using the genome-wide approaches described above, we wanted to search for specific examples where individual TE insertions could be implicated as possible cancer driver mutations. To do so, we performed an integrated analysis of TE insertion, gene expression and chromatin data (see Materials and Methods) in an effort to identify the cancer-specific TE insertions that are most likely to play a causal role in tumorigenesis. We considered TE insertions that are co-located with either exons or regulatory elements of previously characterized tumor suppressor genes to have the highest likelihood of being functionally relevant. We observed a total of 141 intragenic (35.9%) insertions and 246 intronic insertions (62.6%) out of the 393 total cancer-specific insertions in our dataset. None of these intergenic or intronic cancer-specific TE insertions were found to disrupt any known functional (regulatory) sequence element. Thus, consistent with previous studies, the vast majority of TE insertions that we observed are not likely to affect gene function or expression in cancer. We did find 4 exonic TE insertions, along with 2 insertions located in regulatory elements, for known tumor suppressor genes (1.5% of the total). Here, we focus on two of these potential cases of cancer driver TE insertions, which could prove to be of interest to the TE and/or cancer research communities.

There is a private, breast cancer tumor-specific Alu insertion that is located within an upstream enhancer element that helps to regulate the expression of the Cbl Proto-Oncogene (CBL) gene (**Figure 4A**). CBL is classified as a tumor suppressor gene by the COSMIC database (Forbes et al., 2015). It has been found to be mutated or translocated in a number of cancers including acute myeloid leukemia (Abbas et al., 2008; Naramura et al., 2011; Aranaz et al., 2013); mutations in CBL are also the cause of Noonan syndrome-like disorder (Martinelli et al., 2010). The CBL encoded protein functions as a negative regulator of signal transduction pathways (Schmidt and Dikic, 2005), activation of which have been associated with cancer (Sever and Brugge, 2015). The tumor-specific Alu enhancer insertion that we characterized is associated with down-regulation of CBL expression, consistent with a potential role in tumorigenesis via the activation of signal transduction pathways associated with cell proliferation (Sever and Brugge, 2015).

We also found a private L1 insertion that was unique to a head and neck squamous cell carcinoma tissue sample, located within the first exon of the Brain and Acute Leukemia, Cytoplasmic (BAALC) gene (**Figure 4B**). As its name implies, the BAALC gene is expressed in the brain and related neural tissues, and it was first identified by association with acute myeloid leukemia where it was shown to be overexpressed (Damiani et al., 2013; Zhou et al., 2015). TE insertions within exons are extremely rare and would presumably have a dramatic effect on gene function. Indeed, this particular insertion is associated with nearly complete inactivation of the BAALC gene. This is consistent with previous results showing that the presence of fixed L1 insertions genome-wide is strongly associated with the down-regulation of human gene expression (Han et al., 2004). A recent study has demonstrated that BAALC can inhibit extracellular signal-regulated kinase (ERK) mediated monocytic differentiation of AML cells (Morita et al., 2015). Thus, downregulation of BAALC would presumably result in a loss of control over cellular differentiation, consistent with a possible role in tumorigenesis. A recent study discovered a role for the change in methylation status of a cancer-specific L1 insertion in tumorigenesis (Scott et al., 2016); this could be an additional mechanism by which the BAALC L1 insertion observed here exerts a regulatory effect.

## CONCLUSION

The results of our analysis show a surprisingly high level of somatic TE activity in the human genome. Abundant transcripts from members of all three active human TE families analyzed here—Alu, SVA and L1—can be identified for both normal and cancer tissue samples. In addition, after filtering for high confidence TE insertion calls, we identified an average of close to 80 unique insertions for each tissue among the individual patients in our study. Thus, active human TE families retain the ability to transpose in somatic tissue thereby generating substantial levels of cellular heterogeneity among diverse tissues.

We also observe a correlated increase in both transcript expression levels and transpositional activity for L1 elements in cancer tissue samples when compared to matched normal tissue. Increased cancer expression of L1 elements is particularly relevant for TE insertional activity, since the L1 transpositional machinery is responsible for transposing non-autonomous Alu and SVA elements in trans along with L1 elements in cis. Our results are consistent with previous studies showing expression of L1 transcripts in lung cancer (Belancio et al., 2010b) and expression of L1 ORF1p in breast cancer (Harris et al., 2010), and tumor-specific L1 insertions have also previously been found in breast (Morse et al., 1988), head and neck (Helman et al., 2014), and lung tumors (Helman et al., 2014). We confirmed the presence of numerous tumor-specific L1 insertions in these three cancer types and identify two potentially tumorigenic TE insertions, an Alu insertion in the enhancer region of the tumor suppressor gene CBL and an L1 insertion in the first exon of the BAALC gene. These results underscore the potential for somatic TE activity to generate cellular heterogeneity and to contribute to the etiology of cancer across a wide range of human tissues.

## ETHICS STATEMENT

Ethical approval was not required for this study on restricted access, de-identified data in accordance with the guidelines of the Cancer Genome Atlas (TCGA). Access to the data was approved by the data access committee of the TCGA.

### AUTHOR CONTRIBUTIONS

EC, LW, and LR performed all of the analyses described in the study. JW contributed to the genome feature analysis. IJ and JM conceived of designed and supervised the study. All authors contributed to the drafting and revision of the manuscript.

### FUNDING

EC and LW were supported by the Georgia Tech Bioinformatics Graduate Program. LR and IJ were supported by the IHRC-Georgia Tech Applied Bioinformatics Laboratory (ABiL).

### REFERENCES


### ACKNOWLEDGMENTS

The results published here are in whole or part based upon data generated by The Cancer Genome Atlas managed by the NCI and NHGRI. Information about TCGA can be found at http://cancergenome.nih.gov. The authors thank Emily Norris for feedback on the manuscript.

### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: http://journal.frontiersin.org/article/10.3389/fmolb. 2016.00076/full#supplementary-material


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2016 Clayton, Wang, Rishishwar, Wang, McDonald and Jordan. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# The Reverse Transcriptase Encoded by LINE-1 Retrotransposons in the Genesis, Progression, and Therapy of Cancer

#### Ilaria Sciamanna<sup>1</sup> , Chiara De Luca<sup>1</sup> and Corrado Spadafora<sup>2</sup> \*

1 Istituto Superiore di Sanità, Rome, Italy, <sup>2</sup> Institute of Translational Pharmacology, National Resarch Council of Italy, Rome, Italy

In higher eukaryotic genomes, Long Interspersed Nuclear Element 1 (LINE-1) retrotransposons represent a large family of repeated genomic elements. They transpose using a reverse transcriptase (RT), which they encode as part of the ORF2p product. RT inhibition in cancer cells, either via RNA interference-dependent silencing of active LINE-1 elements, or using RT inhibitory drugs, reduces cancer cell proliferation, promotes their differentiation and antagonizes tumor progression in animal models. Indeed, the non-nucleoside RT inhibitor efavirenz has recently been tested in a phase II clinical trial with metastatic prostate cancer patients. An in-depth analysis of ORF2p in a mouse model of breast cancer showed ORF2p to be precociously expressed in precancerous lesions and highly abundant in advanced cancer stages, while being barely detectable in normal breast tissue, providing a rationale for the finding that RT-expressing tumors are therapeutically sensitive to RT inhibitors. We summarize mechanistic and gene profiling studies indicating that abundant LINE-1-derived RT can "sequester" RNA substrates for reverse transcription in tumor cells, entailing the formation of RNA:DNA hybrid molecules and impairing the overall production of regulatory miRNAs, with a global impact on the cell transcriptome. Based on these data, LINE-1-ORF2 encoded RT has a tumor-promoting potential that is exerted at an epigenetic level. We propose a model whereby LINE1-RT drives a previously unrecognized global regulatory process, the deregulation of which drives cell transformation and tumorigenesis with possible implications for cancer cell heterogeneity.

### Edited by:

Tammy A. Morrish, Independent Investigator, USA

### Reviewed by:

Amy Hulme, Missouri State University, USA Wenfeng An, South Dakota State University, USA

> \*Correspondence: Corrado Spadafora cspadaf@tin.it

### Specialty section:

This article was submitted to Cellular Biochemistry, a section of the journal Frontiers in Chemistry

Received: 17 August 2015 Accepted: 26 January 2016 Published: 11 February 2016

#### Citation:

Sciamanna I, De Luca C and Spadafora C (2016) The Reverse Transcriptase Encoded by LINE-1 Retrotransposons in the Genesis, Progression, and Therapy of Cancer. Front. Chem. 4:6. doi: 10.3389/fchem.2016.00006 Keywords: LINE-1 retrotransposons, reverse transcriptase, tumorigenesis, differentiation therapy, cancer heterogeneity, epigenetics

## INTRODUCTION: THE RETROTRANSPOSITION MACHINERY IN THE GENESIS OF GENOMIC AND EPIGENOMIC LANDSCAPES

The complete sequencing of the human genome has disclosed the unexpected finding that coding genes account for a mere 1.2% of the total genome, while the remaining portion is constituted by non-coding DNA (International Human Genome Sequencing Consortium, 2001). Branded by a historically "bad reputation", non-coding sequences have been defined as "junk" (Ohno, 1972) or "selfish" (Orgel and Crick, 1980) DNA, a view further strengthened by the evidence that nearly 50% of the human genome is constituted by apparently functionless transposable "genetic parasites" thought to increasingly litter all chromosomes during evolution.

Two main families of transposable elements characterize eukaryotic genomes: DNA transposons, which mobilize through a "cut and paste" mechanism (Muñoz-López and García-Pérez, 2010), and retrotransposons, which mobilize instead through "copy and paste," a process that requires the reverse transcription of RNA intermediates into cDNA copies as a preliminary step in retrotransposition (Levin and Moran, 2011), promoting the broad expansion of retroelements in eukaryotic genomes.

A key player in this mechanism is the enzyme reverse transcriptase (RT) encoded by LINE-1 retrotransposons themselves. The latter are a source of the RT activity required to promote retrotransposition in human cells (Brouha et al., 2003). LINE-1 elements actually harbor two open reading frames, ORF1 and ORF2, which respectively encode ORF1p, an RNA-binding protein, and ORF2p, with reverse transcriptase (RT) and endonuclease (EN) activities (reviewed in Babushok and Kazazian, 2007). The LINE-1-derived retrotransposition machinery, constituted by ORF1 and ORF2 proteins, has cis-preference for its own LINE-1 RNA (Esnault et al., 2000; Wei et al., 2001; Kulpa and Moran, 2006). LINE-1-derived RT is also used for retrotranscription /retrotransposition of other RNAs, including Alu elements (Dewannieux et al., 2003), SVA (SINE-R/VNTR/ALU) elements (Ostertag et al., 2003) and mRNAs that give rise to a large population of processed pseudogenes nearly as numerous as the original coding genes (Pink et al., 2011). It is now well established that RT-originating sequences contribute to shape genomes and constitute a large proportion of evolutionarily conserved chromosomal DNA, accounting altogether for nearly 50% of the human genome (International Human Genome Sequencing Consortium, 2001). Such extensive preservation suggests a functional importance of retrotransposons. Not surprisingly, retrotransposons are increasingly being implicated in fundamental genomic functions, in both normal and pathological contexts (Rebollo et al., 2012). Indeed, as highly dynamic components of genomes, they contribute a relentless source of genetic and epigenetic variations and novelty (Feschotte, 2008; Bourque, 2009; Beck et al., 2011) and, on the long run, a major driving force in genome evolution (Oliver and Greene, 2011). A detailed description of all functional implications of retrotransposition in genome biology and evolution would be out of scope in this article, but extensive information is discussed in excellent reviews (Feschotte, 2008; Goodier and Kazazian, 2008; Bourque, 2009; Beck et al., 2011; Oliver and Greene, 2011; Rebollo et al., 2012).

The advent of high-throughput technologies in recent years has provided an accurate localization of new genomic insertions, shifting the focus from a gene-centric to a genome-wide view. This has revolutionized the traditional paradigms of genome organization by disclosing novel and unexpectedly complex genomic landscapes. Studies now show that genomes are crowded with sequences of reverse-transcribed origin, many of which are correlated with the insurgence of a variety of pathologies (for a review Hancks and Kazazian, 2012), in particular cancer (Belancio et al., 2010).

The ENCODE Project Consortium (The ENCODE Project Consortium, 2012) showed that approximately 80% of the human genome is pervasively transcribed; actually, a relevant proportion of small and long non-coding transcripts are functional components of genome-wide regulatory networks (Djebali et al., 2012). The groundbreaking finding of an astounding landscape of small RNAs—classified as microRNAs (miRNAs), endogenous small interfering RNAs (endo-siRNAs or siRNAs) (Piatek and Werner, 2014) and Piwi-interacting RNAs (piRNAs) (Kim et al., 2009), depending on their origin and the proteins they interact with - has unveiled an RNA-mediated regulatory network that controls the genome architecture and transcriptomic profile (Aalto and Pasquinelli, 2012; Li, 2014), influencing a multitude of biological processes. Growing data show a dual relationship between small RNAs and retroelements: on the one hand, small RNAs act as "guardians of the genome" in transposon-defense pathways aimed at repressing retroelement mobility (Yang and Kazazian, 2006; reviewed in Malone and Hannon, 2009); on the other hand, retroelements are intimately involved in their biogenesis, because a growing number of small RNAs in all three classes have a recognized retrotransposon-derived origin (Borchert et al., 2011; Watanabe et al., 2011).

Long non-coding RNA (lncRNAs) are components of the mammalian transcriptome and constitute a heterogeneous class of thousands of polymerase II-transcribed RNA species, polyadenylated, spliced, mostly localized in the nucleus (reviewed by Zhang et al., 2013). A large proportion of lncRNAs, with either oncogenic or tumor suppressor roles, are constituted by antisense RNAs; the latter, together with sense transcripts, are being identified in genome-wide regulatory networks that epigenetically fine-tune genome expression, with implications in tumorigenesis, differentiation and development (reviewed in Pelechano and Steinmetz, 2013, Fatica and Bozzoni, 2014). lncRNAs also have tight connections with transposable elements of both the DNA-based and the retroelements families, which occur within nearly 80% of mature lncRNA transcripts and account for about 30–40% of total human lncRNA sequences (Kelley and Rinn, 2012; Kapusta et al., 2013). Also of RT-derived origin are a large proportion of genomic sequences highly preserved throughout evolution and classified as conserved, highly-conserved and ultra-conserved elements (UCRs), according to the level of conservation throughout species (Bejerano et al., 2004; Woolfe et al., 2005, for a recent review see Nelson and Wardle, 2013).

From an ample survey encompassing the genomes of 29 mammalian species (Lowe and Haussler, 2012), a vision of genomes emerges as complex integrated functional systems, in which a considerable proportion of non-exonic sequences were exapted from mobile element insertions (Nishihara et al., 2006; Lowe and Haussler, 2012) to assemble large-scale regulatory circuits. Deregulation of these circuits is implicated in a variety of diseases, including cancer (Esteller, 2011).

### LINE-1-ENCODED RT AS A NEW UNDERESTIMATED PLAYER IN CANCER

While retrotransposable elements are extensively studied and characterized, somewhat surprisingly the retrotransposonencoded RT activity has long failed to attract an equivalent attention. The RT encoded by infective retroviruses has actually been intensively studied since the time of its discovery in 1970 by Baltimore (1970) and Temin and Mizutani (1970), due to its clinical implications (Herschhorn and Hizi, 2010). In contrast, the endogenous cellular RT has long been overlooked, despite many clues implicating it in as relevant processes as embryogenesis and tumorigenesis. Decades after the discoveries of Baltimore and Temin, a body of evidence has shown that endogenous RT expression is developmentally modulated: low levels of RT, if any, are expressed in differentiated non pathological tissues; increased expression is instead typical of cells characterized by low differentiation and high proliferation, e.g., early embryos (for a review see Sciamanna et al., 2011) and transformed cells (for a review see Sinibaldi-Vallebona et al., 2011). Overall, that is consistent with the notion that LINE-1 increased expression (Chen et al., 2012a; Rodic et al., 2014 ´ ) and retroelement mobilization are implicated in tumorigenesis (Hancks and Kazazian, 2012; Kaer and Speek, 2013).

In contrast to differentiated quiescent cells, tissues and cells with low differentiation and high proliferation states are sites of high RT expression and provide permissive contexts for retrotransposition. Following up on that line, several studies have pursued RT inhibition in cancer cells, either using nonnucleoside RT inhibitors (nevirapine and efavirenz; Mangiacasale et al., 2003; Landriscina et al., 2005; Sciamanna et al., 2005, 2013), or RNA interference (RNAi)-mediated downregulation of RT-encoding LINE-1 elements (Sciamanna et al., 2005; Oricchio et al., 2007). In the latter case, the RNAi assays were carried out using double-stranded siRNA oligonucleotide targeted against the ORF-1 encoding domain of human fulllength, highly expressed LINE-1s (Brouha et al., 2003). Both the drug-mediated and the RNAi-mediated approaches to reduce LINE-1-derived RT were found to reduce proliferation, promote differentiation and reprogram the global transcription profiles of coding and non-coding sequences in several cancer cell lines (human melanoma, glioblastoma, osteosarcoma and prostate, colon and small cell lung carcinomas). This provided early evidence for the implication of the LINE-1-encoded RT in tumorigenesis. The inhibitory effects of efavirenz on LINE-1 reverse transcription and retrotransposition were further tested in in vitro assays (Dai et al., 2011), and its antiproliferative and differentiating potential have been recently confirmed in breast (Patnala et al., 2013) and pancreatic (Hecht et al., 2015) cancer cell lines. Moreover, efavirenz treatment of mice xenografted with human tumorigenic cells caused the arrest, or a significant slow down, of progression of several tumor types in vivo (Sciamanna et al., 2005). Importantly, RNAi-mediated LINE-1 downregulation drastically reduced the tumorigenic potential of human cancer cells in nude mice (Oricchio et al., 2007). These effects are reversible and, upon discontinuation of RT inhibitory treatments, tumor cells return to their original dedifferentiated phenotype and unrestrained proliferation capacity (Sciamanna et al., 2005); these obervations provided initial hints to an epigenetic role of RT.

The high levels of RT activity found in tumor cells and tissues, reported by our (Mangiacasale et al., 2003; Gualtieri et al., 2013) and other laboratories (Patnala et al., 2013), correlate well with the enhanced rate of retrotransposition observed in many human tumors, a phenomenon that dramatically contributes to shape cancer genomes (Iskow et al., 2010; Lee et al., 2012; Solyom et al., 2012; Shukla et al., 2013; Ewing et al., 2015). In a MMTV-PyVT transgenic mouse strain (Guy et al., 1992), whose females spontaneously develop breast carcinoma, a burst in the copy number of both LINE-1 and SINE B1 elements was depicted very early at tumor onset; their copy number further increases along with tumor progression (Gualtieri et al., 2013). These data converge to indicate that tumors constitute a highly permissive environment for retrotranscription, yet do not answer the question of whether overexpression and amplification of LINE-1 elements act as oncological "drivers" or as mere "passengers" (Rodic and Burns, 2013). The findings that pharmacological inhibition of RT is sufficient to reduce cancer cell proliferation, promote differentiation and antagonize tumor progression in animal models, similar to the effects obtained by RNAi-specific downregulation of LINE-1 expression, strongly support a causative role of LINE-1-encoded RT in tumorigenesis. In an applied clinical perspective, therefore, RT can be regarded as a target and RT inhibitors as potential therapeutic agents in a novel cancer differentiation therapy. Efavirenz has recently been tested in a phase II trial with metastatic prostate cancer patients, suggesting that relatively high dosage (over 600 mg per day) may be beneficial as a novel anticancer treatment (Houédé et al., 2014).

The role of RT encoded by LINE-1 in tumorigenesis is distinct from that of RT activities produced from the other two potential sources, i.e., endogenous retroviruses (HERVs) and telomerase-associated RT (TERT). First, RNAi-mediated downregulation of HERV-K expression showed negligible effects on the rate of proliferation and differentiation of cancer cells, in contrast with the dramatic effects observed after LINE-1 specific RNAi (Oricchio et al., 2007). Second, inhibitors of LINE-1 derived RT elicit rapid changes in treated cells in our experiments (Mangiacasale et al., 2003; Sciamanna et al., 2005), differently from drugs targeting telomerase, which reduce cancer cell proliferation after a long tratement (about 120 days; Damm et al., 2001); these data therefore rule out the possibility that TERT contributes to the rapid response of cells to RT inhibitors. It should be noted, however, that LINE-1 RT is critical for telomere maintenance, given that LINE-1 knockdown in cancer cells correlates with: (i) reduced length of telomeres, (ii) decreased telomerase activity, and (iii) decreased telomerase mRNA level (Aschacher et al., 2016). Together these results reveal that LINE-1 RT has a functional impact on TERT. Thus, while TERT is not involved in the changes elicited by inhibitors targeting the retrotransposon-derived genuine RT, the level of activity of LINE-1 elements may impact on TERT. These findings again strengthen the view that LINE-1 RT is a major player in tumorigenesis.

### LINE-1 ORF2-ENCODED RT ACTIVITY IN CANCER PROGRESSION

The ORF2-encoded RT has been recently assessed for its suitability as a tumor marker (Gualtieri et al., 2013) in females of the cancer-prone MMTV-PyVT described above (Guy et al., 1992). In these females, breast cancer tissues withdrawn at different times after birth are representative of progressive cancer stages. ORF2p cannot be detected in normal breast tissue by immunohistochemistry (IHC), but increased expression is triggered very early in tumorigenesis, preceding the appearance of typical histological alterations and accepted cancer markers (e.g., Ki67 and epidermal growth factor receptor ERB2); further upregulation takes place during tumor growth. These findings correlate well with the notion that hypomethylated LINE-1 sequences, from which ORF2p is produced, are typical of cancer genomes and precancerous lesions compared to their normal tissues counterpart (Miousse and Koturbash, 2015).

The abundant expression of LINE-1 products in preneoplastic mammary tissues suggests an exploitable tool as a potential diagnostic biomarker for early cancer detection: the identification of cancer-prone foci marked by increased RT before the appearance of recognizable histological alterations, can expand the window of opportunities for therapeutical intervention, which can possibly be most effective if associated with the development of RT inhibitory treatment. Interestingly, the abundance and subcellular localization of LINE-1 products are also proposed to have prognostic value in human metastatic breast cancer (Chen et al., 2012a).

Compelling objectives of "the war on cancer" currently include the definition of novel early markers identifying cancerprone lesions before their spreading, as well as the development of novel therapeutic approaches in possible replacement of conventional cytotoxic chemotherapy. In a recent critical reappraisal, Hanahan (2014) has pointed out that the war on cancer, if not lost, is certainly not won and has suggested that therapeutic strategies should avoid fragmenting along multiple, highly diversified narrow paths targeting many substrates, each of which is highly selective for a specific cancer. Rather, the therapeutic "bullets" ought to hit fewer targets shared by a large spectrum of cancers (Hanahan, 2014). LINE-1 ORF2 encoded RT would fulfill these criteria, representing, at the same time, an early diagnostic cancer marker, a worth pursuing therapeutic target and the driving component of a newly emerging cancer-promoting mechanism.

### THE MOLECULAR BASES OF THE RT-DEPENDENT CANCER-PROMOTING MECHANISM

As briefly recalled above, retrotransposition events have had a fundamental role not only in shaping the genomic landscape, but also in directing regulatory networks aimed to fine-tune a variety of genomic functions. Data obtained in the last few years growingly indicate that the retrotransposon machinery, besides being a well-known source of genomic variations caused by new insertions (Böehne et al., 2008; Bourque, 2009), also exerts a global epigenetic regulatory role on the cellular transcriptome. LINE-1 ORF2-encoded RT is a new player in this mechanism.

Prompted by the finding that tumor cell lines are endowed with abundant LINE-1-encoded RT, Sciamanna et al. (2013) began to address the mechanism through which RT might act by comparing the global transcription profile of melanoma cells before and after RT inhibition by microarray analysis. The results showed that RT inhibition modulates the expression of a broad range of coding genes, but also long and small non-coding sequences, including UCRs and miRNAs. miRNAs actually emerged as crucial components of the RTdepending mechanism; indeed, a subpopulation, known to be involved in cell differentiation, cell growth, tumorigenesis and metastatic progression proved highly responsive to RT inhibition. Many miRNA-encoding genes are significantly associated with genomic regions enriched in closely spaced Alu repeats, further strengthening the link between miRNAs and retrotransposons. The physical association of pre-miRNA genomic loci with high density retroelements actually suggests that the latter can exert a regulatory "position effect" on miRNA expression (Slotkin and Martienssen, 2007). Experimental evidence supporting an orchestrating role of the RT enzyme emerged from cesium chloride density centrifugation analysis of nucleic acids extracted from melanoma and prostate carcinoma cell lines, harboring either "native" or efavirenz-inhibited RT: by buoyant density analysis, LINE-1- and Alu-containing molecules with the density of DNA:RNA hybrids were selectively identified in tumor cells, which disappeared upon treatment with efavirenz and were absent in non-transformed human fibroblasts (Sciamanna et al., 2013). Thus, the DNA:RNA hybrids are an especially abundant, if not exclusive, component of cancer cells, generated by reverse transcription of RNA templates, largely—albeit not exclusively provided by LINE-1 and Alu transcripts. These data suggest that a cancer-promoting RT-dependent mechanism is active in tumor cells and can be blocked by inhibiting the LINE-1 RT. Based on these data, Sciamanna et al. proposed a model (Sciamanna et al., 2014) whereby the highly expressed LINE-1 RT in cancer cells can intercept RNAs and convert them in RNA:DNA hybrids via reverse transcription. Central to the model is the RT-dependent production of RNA:DNA hybrids, associated with altered functional miRNA profiles, observed under conditions of high LINE-1-derived RT in cancer cells and modulatable by RT inhibitors. A wealth of data show that miRNA expression is indeed downregulated in cancer cells, with profound implications for cell fates (Lu et al., 2005; Gaur et al., 2007; Jansson and Lund, 2012). A variety of small RNAs, including 7SL RNA (Ullu and Weiner, 1984), tRNAs (Kaçar et al., 1992), small nuclear RNAs (Doucet et al., 2015), and YRNAs (Perreault et al., 2005), are known to act as templates for reverse transcription in intermediate steps of the genesis of pseudogenes. It is not unreasonable to hypothesize that miRNA precursors may also be retrotranscribed. The observation that the production of hybrid RNA-DNA molecules is associated with aberrant miRNA profiles in cancer cells actually suggests that RT can "subtract" RNA precursors, thus preventing or impairing the formation of double stranded (ds) RNA dicer substrates for the biogenesis of mature miRNAs: this would ultimately contribute to establish favorable conditions for the onset of cancer phenotypes.

RT inhibition results in restored miRNA biogenesis, likely re-establishing their regulatory networks, consistent with its empirically established capacity to revert the cancer phenotype (Sciamanna et al., 2013).

In agreement with this idea, a subset of LINE-1-specific siRNAs, targeting LINE-1 expression and capable to induce methylation of their promoters, are found to be down-modulated in breast cancer compared to normal cells (Chen et al., 2012b). Conversely, LINE-1 inhibition by siRNAs up-modulate the expression of miRNAs involved in tumor suppression (Ohms et al., 2014). Taken as a whole, these findings indicate an orchestrating role of LINE-1-encoded RT in setting a cancer-permissive cellular state.

Although, the LINE-1 enzymatic machinery preferentially reverse transcribes its own RNA (Esnault et al., 2000; Wei et al., 2001; Kulpa and Moran, 2006), the presence of intronless pseudogenes scattered throughout mammalian genomes points out that mRNAs transcribed from protein-coding genes are also substrates for reverse transcription by the endogenous RT (Pink et al., 2011). This suggests that the RT-depending mechanism, in addition to targeting miRNAs, can also target several more RNA classes, coding and non-coding, small- and long-RNAs, though with a possible preferential bias for those associated with, or derived from, retroelement sequences. Consistent with this view, Sciamanna et al. (2013) found that about one third of the efavirenz-downmodulated miRNAs in melanoma cells are clustered on chromosome 19 (C19MC) in a locus characterized by a high density of primate-specific Alu repeats, which were shown to have co-evolved with miRNAs coding genes (Lehnert et al., 2009). An independent study also reported that LINE-1 silencing caused a deregulated profile of miRNA expression in breast cancer cells (Ohms and Rangasamy, 2014).

In summary, LINE-1 expression and small RNA networks emerge as the balanced components of a RT-depending regulatory mechanism placed at the intersection between normally differentiated and transformed non-differentiated cellular states: when one component raises the other one decays.

It is worth stressing that the partial inactivation of miRNA function is not an exclusive feature of cancer, but is a physiological phenomenon, shared with early preimplantation embryos, a context where again miRNA pathways become transiently suppressed (Suh et al., 2010). Moreover, miRNA inactivation is concomitant with a burst of LINE-1 activity in both tumorigenesis and embryogenesis. In the next paragraph we discuss this striking analogy and suggest that physiological and pathological processes have in common the same RT-dependent mechanism.

### THE RT-BASED MECHANISM AS GLOBAL REGULATOR OF DIFFERENTIATION IN TUMORIGENESIS AND EMBRYOGENESIS

In prior developmental studies, the presence of LINE-1-encoded RT activity and protein was assessed in gametes and early embryos to address the potential role of this enzyme in embryogenesis. Unexpectedly, Giordano et al. (2000) found an RT activity in mature murine spermatozoa, providing the first hint that RT might somehow be involved in early embryogenesis. The sperm endogenous RT, far from being a nonfunctional remnant encoded by "genomic parasites," has a full enzymatic activity able to reverse transcribe exogenous RNA molecules, taken up and internalized by spermatozoa, in cDNA copies that could then be delivered to embryos at fertilization (Giordano et al., 2000; reviewed in Spadafora, 2008). Pittoggi et al. (2003) further found that RT is also present in early embryos and is strictly required for preimplantation development: indeed, exposing zygotes to RT inhibitor, or antisense-mediated downregulation of LINE-1 (Beraldi et al., 2006), both caused a drastic arrest of development at the two- and four-cell embryo stages with globally altered gene expression profiles. Interestingly, fertilization activates a reverse transcription wave in zygotes within a few hours, which then propagates throughout the first cell division; that is concomitant with the production of new LINE-1 copies that mostly remain as non integrated extrachromosomal structures (Vitullo et al., 2012). Indirect evidence for an embryonic RT activity also emerge from reports that somatic LINE-1 retrotranspositions occur in human stem cells (Garcia-Perez et al., 2007; Coufal et al., 2009) and in very early stages of development in humans (van den Hurk et al., 2007) and rodents using transgenic murine and rat models (Kano et al., 2009). These findings indicate that the endogenous RT is active in early stages of embryogenesis, where it appears to have implications for epigenetic regulation of gene expression and to be necessary for the unfolding of the developmental program.

Cancer and embryo developmental studies convergingly point to the conclusion that an RT-based mechanism is physiologically activated in early embryogenesis and repressed in differentiated tissues; its unscheduled re-activation in somatic cells has cancerpromoting effects, yielding increased cell proliferation and loss of differentiation, in analogy with embryonic growth. It is a well-established notion that tumors and embryos share a variety of cellular, biochemical and molecular features and that genes typically expressed in embryogenesis, yet silenced in normal differentiated tissues, are re-expressed in tumors (Ma et al., 2010). These circumstances support the conclusion that tumorigenesis often recapitulates developmental patterns (Kaiser et al., 2007). In this conceptual framework, Spadafora (Spadafora, 2015) proposed that the RT-dependent mechanism is a source of the functional analogies shared by the physiological and pathological processes connecting embryogenesis and tumorigenesis.

### THE GENESIS OF CANCER HETEROGENEITY

The retrotransposon machinery is highly sensitive to stressing stimuli (Hagan and Rudin, 2002; Terasaki et al., 2013). In response, LINE-1 expression can be activated at differential levels in different cells, depending on the nature and the intensity of endogenous or exogenous stressors. We propose that the differential activation of RT, including by stress, can generate the heterogeneously differentiated cell populations that typically characterize human cancers (Meachem and Morrison, 2013). It is currently unclear whether the cellular heterogeneity observed in cancer reflects the existence of cell populations undergoing a progressive transformation "trajectory," initiating as a primary cancer state and sequentially evolving into metastatic cells, or whether a broad array of cellular variations simultaneously arise in a single stress-responding event. Based on the data discussed above, it is tempting to speculate that the latter is the case; cells with varying degrees of malignancy—some of which may confer metastatic capacity—may concomitantly originate in a single genome-wide burst of stress-activated LINE-1-RT expression. In this hypothetical model, schematized in **Figure 1**, burst(s) of LINE-1 expression, triggered by exogenous and/or endogenous stimuli in normal cells (in green), would generate an array of cell populations endowed with various levels of LINE-1-dependent RT activity (indicated by different shades of colored cells), coinciding with the emergence of preneoplastic lesions. We propose that the different levels of LINE-1 activation correspond to different degrees of cell de-differentiation; in the process, embryonic regulatory patterns can be reactivated and induce somatic cells to revert back to embryo-like states (Kaiser et al., 2007). LINE-1 activation at low levels would exert modest de-defferentiation effects, while higher levels would determine a more extensive reactivation of embryonic patterns, with the ensuing production of more aggressive "embryolike" transformed cells. The cell populations concomitantly originating from the activation of RT expression would then differentially propagate throughout cancer progression, thus contributing to cancer heterogeneity. The model represented in **Figure 1** was inspired by the recently proposed "Big Bang" hypothesis for the genesis of human cancer, in which a single ancestral event is thought to originate the heterogeneity of cancer cell populations (Sottoriva et al., 2015), which would then progress and expand in parallel (Klein, 2009). The simultaneous genesis of cells with heterogeneous invasive potential would also offer a possible explanation for the genesis and spreading of metastatic tumors of unknown primary origin: these are a relatively rare class of metastatic tumors detected in patients in which the primary tumor cannot be identified, and account for 3–5% of all cancer diagnoses (Natoli et al., 2011; Stella et al., 2012).

In more general terms, the model predicts a relatively minor role for DNA mutations in cancer progression, as cell transformation is rather viewed as originating from an RTmediated reactivation of "embryonic" regulatory circuits mostly acting at the epigenetic level in differentiated cells (Spadafora, 2015). Although needing further experimental testing, the model builds on emerging evidence indicating the global reach of RT onto several RNA classes (Sciamanna et al., 2013; Ohms et al., 2014; Ohms and Rangasamy, 2014) and is compatible with its reversible character by modulating RT levels (Sciamanna et al., 2005).

Retrotransposable elements also clearly impinge on genome function by generating extensive variations via insertional mutagenesis. Although large numbers of mutations are identified by high-throughput sequencing data in cancer contexts, their role(s) in tumorigenesis is often undefined (Kandoth et al., 2013): predisposing gene mutations in fact play a documented causative role only in 5–10% of human cancers (Nagy et al., 2004).

Recent excellent works have reported that the genomes of different tumor types harbor hundreds of de novo somatic insertions, selectively found in cancer genomes (Iskow et al., 2010; Lee et al., 2012; Solyom et al., 2012; Shukla et al., 2013; Doucet-O'Hare et al., 2015; Ewing et al., 2015; Rodic et al., 2015 ´ ). Despite of these reports, however, the general implication of L1 retrotransposition events as either "driver" mutations (i.e., with a causative role in tumorigenesis), or as "passengers" (i.e., manifesting a consequence of the loss of genome regulation associated with cell transformation), remains an open question (Rodic and Burns, 2013). Insertions were documented and were attributed causative trigger in specific cases; among others, LINE-1 insertion were found within the c-myc gene (Morse et al., 1988), or in the tumor-suppressing gene apc (Miki et al., 1992) in breast and colon carcinoma, respectively; in those instances, LINE-1 insertions should have an activating (c-myc) or inhibitory (apc) role, respectively. In a different context, Alu insertions also result in neurofibromatosis type 1 (Wallace et al., 1991). It is worth recalling, however, that together LINE-1, Alu and SVA insertions account only for a marginal contribution (<0.5%) to the genesis of cancer (Callinan and Batzer, 2006). This leaves

burst of RT activity (red flash), which deregulates the transcriptome of individual cells at various levels (represented by different color shades): this originates heterogeneous cancer cell populations. In the model, cancer cell heterogeneity would therefore set in following the early burst of differentially expressed RT activity in different cells. Cancer would then progress with the expansion of various cell populations (on the right).

ample room for non-inserational mechanisms of tumorigenesis that implicate retrotransposons. In addition, the concept that new insertions might cause tumorigenesis would be hard to reconcile with the full reversibility of the "therapeutic" effect associated with LINE-1 RT inhibition in various cancer cells, observed in our and other laboratories (Sciamanna et al., 2005; Oricchio et al., 2007; Patnala et al., 2013). In our model, therefore, insertional mutagenesis, though not being totally ruled out, plays a minor role. We believe that most retrotranspositional insertions observed in many tumors reflect a failure to repress the activity of retroelements (a frequent failure in cancer), rather than being a cause of tumorigenesis.

In an extreme view, mutations may often represent a tolerated consequence of the tumor-associated global deregulation rather than the cause. The evidence summarized so far suggest that deregulated RT activity, likely acting in combination with other key epigenetic processes such as global DNA hypomethylation and chromatin remodeling, contributes to shape pro-tumorigenic expression profiles, and thus favors the phenotypic plasticity and diversity of cancer cells.

### CANCER AS A REVERSIBLE "DEVELOPMENTAL" DISORDER AND DIFFERENTIATION THERAPY

As discussed above, the non-coding RNA profiles modulated by RT can globally regulate cell differentiation. Evidence is emerging that unscheduled reactivation of RT, as occurring in cancer cells, or its developmentally regulated repression, as in normal cells, are sufficient to promote cell de-differentiation or, on the contrary, stabilize the differentiated state, respectively. Tumorigenesis can be viewed as the erroneous resumption of genome-wide networks active in embryogenesis and silenced in adult life, and the differentiation process can be regarded as a sequence of transient and reversible cellular states in which RT activity is variably activated. According to this view, cancer would also be a reversible phenomenon and, as such, potentially modulatable by RT-inhibitory differentiation-inducing agents. The idea that the "normal" differentiation program can be restored to cancer cells, with the loss or attenuation of tumorigenic phenotype, has inspired much research and clinical work in the last decades. Perhaps the best known example is the development of retinoic acid-based differentiation therapy, successfully applied to treat acute promyelocytic leukemia (APML). Retinoic acid is

### REFERENCES


a powerful morphogen and differentiating agent and has been the object of intense studies in the last decades, the outcome of which cannot be exhaustively discussed here (reviewed by Tang and Gudas, 2011). Other attempts to apply the same principle to solid tumors, however, have had more limited results so far (reviewed by Leszczyniecka et al., 2001; Cruz and Matushansky, 2012). The data obtained from in vitro assays, preclinical tests on animal models and a recent human trial, converge in viewing the LINE-1-encoded RT as an effective target for a non-cytotoxic, differentiation-inducing cancer therapy; RT inihibition appears to be the common condition sufficent to reverse tumorigenicity and restore differentiation to a wide variety of cancer cells.

### CONCLUSIONS

Growing data undermine the concept of terminal differentiation as a stably acquired condition, revealing that: (i) differentiation states should rather be viewed as transient conditions, and (ii) even in the presence of genomic alterations, epigenetics often wins over genetics (Lotem and Sachs, 2002). Epigenetic changes can effectively bypass the genetic alterations associated with, or caused by, tumorigenesis and reprogram gene expression profiles, reverting, or mitigating, the malignant phenotypes of cells. LINE-1-encoded RT is emerging as a key epigenetic regulator at the intersection between normal and pathological development. As such, the level of RT activity has the potential to shift the biological balance of cells in one or the other direction. In our view, these findings and emerging concepts, besides their clinical implications, fulfill the early prediction by Temin that endogenous RT activity plays roles both in normal development, as in embryogenesis, and in pathologies as such as cancer (Temin, 1971).

### FUNDING

Work in our laboratory was supported by the Italian Ministry of Health grant n. 15ONC/8 "Endogenous Reverse Transcriptase as a tumor marker and causative agent of tumor onset and progression," to CS.

## ACKNOWLEDGMENTS

We acknowledge the skillful assistance of Cosimo Curianò with drawing preparation.


in human embryonic development. Hum. Mol. Genet. 16, 1587–1592. doi: 10.1093/hmg/ddm108


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2016 Sciamanna, De Luca and Spadafora. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Crossing the LINE Toward Genomic Instability: LINE-1 Retrotransposition in Cancer

### Jacqueline R. Kemp and Michelle S. Longworth\*

*Department of Cellular and Molecular Medicine, Lerner Research Institute of Cleveland Clinic, Cleveland, OH, USA*

Retrotransposons are repetitive DNA sequences that are positioned throughout the human genome. Retrotransposons are capable of copying themselves and mobilizing new copies to novel genomic locations in a process called retrotransposition. While most retrotransposon sequences in the human genome are incomplete and incapable of mobilization, the LINE-1 retrotransposon, which comprises ∼17% of the human genome, remains active. The disruption of cellular mechanisms that suppress retrotransposon activity is linked to the generation of aneuploidy, a potential driver of tumor development. When retrotransposons insert into a novel genomic region, they have the potential to disrupt the coding sequence of endogenous genes and alter gene expression, which can lead to deleterious consequences for the organism. Additionally, increased LINE-1 copy numbers provide more chances for recombination events to occur between retrotransposons, which can lead to chromosomal breaks and rearrangements. LINE-1 activity is increased in various cancer cell lines and in patient tissues resected from primary tumors. LINE-1 activity also correlates with increased cancer metastasis. This review aims to give a brief overview of the connections between LINE-1 retrotransposition and the loss of genome stability. We will also discuss the mechanisms that repress retrotransposition in human cells and their links to cancer.

### Edited by:

*Tammy A. Morrish, The University of Toledo, USA*

### Reviewed by:

*Revati Wani, Pfizer Inc., USA Sheila Lutz, Wadsworth Center, New York State Department of Health, USA*

### \*Correspondence:

*Michelle S. Longworth longwom@ccf.org*

#### Specialty section:

*This article was submitted to Cellular Biochemistry, a section of the journal Frontiers in Chemistry*

Received: *13 August 2015* Accepted: *27 November 2015* Published: *16 December 2015*

### Citation:

*Kemp JR and Longworth MS (2015) Crossing the LINE Toward Genomic Instability: LINE-1 Retrotransposition in Cancer. Front. Chem. 3:68. doi: 10.3389/fchem.2015.00068* THE LINE-1 RETROTRANSPOSON IS AN ACTIVE MOBILE ELEMENT

Keywords: retrotransposons, LINE-1, genomic instability, retrotransposition, cancer

Retrotransposons, a class of transposable elements (TE), are highly repetitive DNA sequences positioned throughout the human genome. These structural elements make use of an RNAmediated transposition process, allowing them to move from one location in the genome to another, while the original copy remains in its original locus. The RNA-based retrotransposons are classified into the autonomous long terminal repeat (LTR) and the non-LTR containing retrotransposons. LTR containing retrotransposons, as their name implies, possess LTRs ranging from 100 bp to over 5 kb in size and are endogenous retroviruses. Long interspersed nuclear elements (LINEs), comprising 20% of the human genome are a type of non-LTR retrotransposon. Non-autonomous retrotransposons are a third class of retrotransposons, of which the short interspersed nuclear elements (SINEs) comprise ∼13% of the human genome (Lander et al., 2001).

The human genome contains millions of copies of retrotransposons; however, only a single non-LTR retrotransposon family, the LINE-1 (L1) family, remains the primary source of retrotransposition. The activity of the L1 retrotransposon has persisted over time within the human genome and its derepression is associated with genomic instability and tumor development (Gasior et al., 2006; Lee et al., 2012). Over 100,000 L1 sequences exist in the human genome; however, most are rendered inactive by point mutations, rearrangements, or truncations (Brouha et al., 2003). It was originally estimated that the average human diploid genome contains ∼80–100 active L1s that are capable of undergoing retrotransposition (Sassaman et al., 1997). Of those which are active, six were classified as "hot" L1s responsible for the bulk of L1 retrotransposition within the human genome (Brouha et al., 2003). More recently, however, three independent studies demonstrated that the occurrence of new L1 insertions is more prevalent than previously thought. Additionally, a number of the newly inserted "hot" L1s were found to be extremely polymorphic and specific to a few individuals, suggesting that L1 retrotransposition may contribute to the propensity for one individual to develop disease over another (Beck et al., 2010; Huang et al., 2010; Iskow et al., 2010).

A full-length L1 retrotransposon is ∼6 kB in size and contains a 5′ untranslated region, two non-overlapping open reading frames (ORF1 and ORF2), and a 3′ untranslated region that ends in a poly (A) tail (Swergold, 1990; Becker et al., 1993). ORF1 encodes a 40 kDa RNA-binding protein (Mathias et al., 1991), whereas ORF2 encodes a 150 kDa protein (ORF2p) with demonstrated endonuclease and reverse transcriptase activities (Mathias et al., 1991; Feng et al., 1996; Piskareva et al., 2003). Interestingly, ORF2p also contains a conserved cysteine-rich domain recently shown to have a high non-specific affinity to RNA, which may contribute to the process of reverse transcription (Piskareva et al., 2013). Various mutants of ORF1p and ORF2p, have been created and used to demonstrate that the two proteins are necessary for retrotransposition in a cell culture based assay (Moran et al., 1996; Wei et al., 2001; Kulpa and Moran, 2005; Doucet et al., 2010).

The mobility of a L1 retrotransposon is completely dependent on transcription and translation of its encoded proteins and therefore incudes both nuclear and cytoplasmic events essential for retrotransposon duplication (**Figure 1**). ORF1p and ORF2p preferentially bind to their own mRNA and form ribonucleoprotein (RNP) complexes (Leibold et al., 1990; Alisch et al., 2006; Dmitriev et al., 2007; Doucet et al., 2010). ORF1p

During a cycle of retrotransposition (gray arrows), L1 is transcribed and exported into the cytoplasm, where translation occurs. ORF1p and ORF2p preferentially bind to their own mRNA and form ribonucleoprotein (RNP) complexes. The L1 RNP gains access into the nucleus, where the ORF2p endonuclease domain cleaves genomic DNA to expose a 3′ -hydroxyl residue that is used as a primer by the L1 reverse transcriptase to copy the L1 mRNA, a mechanism that has been termed target-primed reverse transcription (TPRT). The resulting cDNA is then inserted into a novel region in the genome. A number of host cell defense mechanisms exist to inhibit L1 retrotransposition (black arrows), including L1 DNA methylation, mutation, and/or degradation, L1 RNA degradation, inhibition of L1 RNP formation, and/or localization to stress granules, and autophagy signaling pathways. All are capable of inhibiting L1 and preventing its mobilization throughout the human genome.

has been demonstrated to have nucleic acid chaperone activity that is essential for the retrotransposition process (Martin et al., 2005, 2008). The L1 RNP gains access into the nucleus, where the ORF2p endonuclease domain cleaves genomic DNA to expose a 3′ -hydroxyl residue that is used as a primer by the L1 reverse transcriptase to copy the L1 mRNA, a mechanism that has been termed target-primed reverse transcription (TPRT). The resulting cDNA is then inserted into a novel region in the genome (Cost et al., 2002). A nuclear localization signal has been identified in ORF2p (Goodier et al., 2004); however, it is unclear whether the L1 RNP is capable of crossing an intact nuclear membrane or whether it gains access following nuclear envelope breakdown (Kubo et al., 2006).

### POTENTIAL L1-MEDIATED MECHANISMS OF TUMOR DEVELOPMENT

Many reports have demonstrated that retrotransposons can significantly impact the structure of the human genome. Retrotransposons have adverse effects on genome stability since multiple copies of the same sequence can hinder precise chromosomal pairing during mitosis and meiosis, resulting in DNA double-stranded breaks, more homologous recombination, chromosome duplication, and increased potential for inefficient repair of recombination events (Belgnaoui et al., 2006; Farkash et al., 2006; Gasior et al., 2006). A recent study identified LINE-LINE-mediated non-allelic homologous recombination as an important mechanism of structural rearrangement, contributing to genomic variability and instability (Startek et al., 2015).

L1 retrotransposition events in the human genome have been deemed responsible for ∼97 disease-producing insertions (reviewed in Hancks and Kazazian, 2012). Specifically, direct insertional mutagenesis caused by L1 retrotransposition can result in disruption of coding sequence, disruption of splicing, and/or deregulation of gene expression. Symer and colleagues identified L1 element inversions, extra nucleotide insertions, exon deletions, a chromosomal inversion, and flanking sequence comobilization in the retrotransposon target site in human tissue culture cells (Symer et al., 2002). Studies have also shown that L1 acts as more than just an insertional mutagen, but also that its retrotransposition activity can result in large genomic deletions (Gilbert et al., 2002).

L1 retrotransposons exhibit a cis-preference, in which the L1 proteins preferentially use their own L1 RNA as the transcript for reverse transcription and integration (Wei et al., 2001; Kulpa and Moran, 2006). However, L1 proteins can also work in trans to promote mobilization of other RNAs, thus increasing their potential for causing genomic instability. Non-autonomous elements including SINEs (Dewannieux et al., 2003) and SVAs (Raiz et al., 2012), as well as small nuclear RNAs (e.g., U6 snRNA; Buzdin et al., 2002; Gilbert et al., 2005; Garcia-Perez et al., 2007), small nucleolar RNAs (e.g., U3 snoRNA; Weber, 2006), and messenger RNAs (Esnault et al., 2000; Wei et al., 2001) are all capable of being trans-mobilized via L1. In all of these cases, retrotransposition of mRNAs results in processed pseudogenes that bear L1 structural hallmarks. These trans-mobilization events utilize the ORF1p and/or ORF2p to insert into the human genome and do not involve sequence specificity. Once these pseudogenes are inserted back into the genome, they usually lack introns and promoters, but contain a poly (A) 3′ end and targetsite duplications of varying length (Vanin, 1985; Weiner et al., 1986; Esnault et al., 2000). Interestingly, siRNAs have been shown to be expressed from pseudogenes in mouse oocytes, suggesting a potential way in which they might influence gene regulation (Tam et al., 2008). Therefore, generation of processed pseudogenes is a direct product of endogenous retrotransposition activity in the human genome that can contribute to genomic diversity and instability.

Integration of L1 in or near oncogenes or tumor suppressor genes can contribute to tumor development (Morse et al., 1988; Miki et al., 1992; Iskow et al., 2010) and progression of life-threatening cancers, including lung, colon, and breast cancer in humans (Lee et al., 2012; Criscione et al., 2014). For example, disruption of the APC gene by a somatic insertion of L1 was shown to be present in colon cancer and associated with development of colorectal tumors (Miki et al., 1992). The APC gene encodes a tumor suppressor involved in maintaining chromosomal stability during mitosis (Fodde et al., 2001b). In Apc deficient mouse cells, structural rearrangements, resulting from chromosomal breakage and recombination are apparent (Fodde et al., 2001a). Further, cells are defective in chromosome segregation when they carry a truncated form of Apc (Kaplan et al., 2001). Other tumor suppressor genes found to be disrupted by tumor-specific L1 insertions include Mutated in Colorectal Cancers (MCC) and Suppression of Tumorigenicity 18 (ST18; Shukla et al., 2013). Furthermore, since L1 machinery acts to trans-mobilize other RNAs, those insertions can also impact expression of genes. Alu, a type of SINE present in higher copy numbers than L1, can be trans-mobilized, leading to cancerassociated gene insertions. Sites of Alu insertions include the APC locus and this was associated with Desmoids tumors (Halling et al., 1999), the tumor suppressor NF-1 (neurofibromatosis type I; Wallace et al., 1991), and the BRCA1 and BRCA2 breast/ovarian cancer related genes (Miki et al., 1996; Teugels et al., 2005). SVA elements can also be mobilized by the L1 retrotransposition machinery, leading to disease (Ostertag et al., 2003). In one study, mobilization of SVA resulted in deletion of the HLA-A gene in three Japanese families; a number of individuals in these families were afflicted with leukemia (Takasu et al., 2007).

Telomerase reactivation, as a means to maintain telomeres, occurs in the early stages of carcinogenesis to promote cancer cell immortalization (Counter et al., 1992; Kim et al., 1994). Transcriptional regulation of hTERT, the catalytic subunit of telomerase, is a major mechanism for telomerase activation in the cancer setting. In a recent study, L1 was shown to contribute to tumor pathogenicity by inducing hTERT and helping to maintain telomeres in telomerase-positive tumor cells. Depletion of L1 resulted in reduced telomere length, suggesting that L1 is a reasonable target in the treatment of telomerase-positive cancer (Aschacher et al., 2015).

### L1 EXPRESSION IN CANCERS

Given that L1 retrotransposition can lead to genomic instability and genetic heterogeneity is a common feature in tumor initiating cells, it is not surprising that expression of the L1 encoded ORF1p is reported to be a hallmark of many human cancers, with almost half (47%) of the human neoplasms examined being immunoreactive for L1 (Rodic et al., 2014). L1 positive neoplasms included invasive breast carcinomas (97% L1 positive), high-grade ovarian carcinomas (91.5% L1 positive), and pancreatic ductal adenocarcinomas (PDACs; 89% L1 positive). Carcinomas originating in the endometrium, biliary tract, esophagus, bladder, head and neck, lung, and colon were also frequently L1 immunoreactive (22.6–76.7% L1 positive; Rodic et al., 2014). In a separate study, increased ORF1p expression and novel L1 insertions in PDAC were observed in matched primary and metastatic tissues. However, the overall results showed discordant rates of retrotransposition, suggesting that while increased L1 retrotransposition may not be a direct cause of metastatic PDAC, it may contribute to gene disregulation leading to metastasis (Rodic et al., 2015). Furthermore, activation of L1 increases the risk of epithelial-mesenchymal transition and metastasis in epithelial cancer (reviewed in Rangasamy et al., 2015) and promotes proliferation and invasion of LoVo colorectal cancer cells (Li et al., 2014) and MDA-MB-231 breast cancer cells (Yang et al., 2013). ORF1p and ORF2p levels are upregulated in breast cancers compared to normal tissues. Cytoplasmic levels of ORF1p and ORF2p are elevated in DCIS breast cancers compared to highly invasive cancers. Conversely, nuclear levels of ORF1p and ORF2p were found to be higher in invasive breast cancers and correlated with increased lymph node metastasis and poor patient survival (Harris et al., 2010; Chen et al., 2012). Furthermore, inhibition of the L1-encoded reverse transcriptase in breast cancer cells was demonstrated to reduce the rate of proliferation and promote cellular differentiation (Patnala et al., 2014). Finally, L1 activity and expression was elevated in rat chloroleukemia cells, suggesting that mobilization of this retrotransposon may contribute to the genomic instability observed in this model of blood cancer (Kirilyuk et al., 2008).

Hypomethylation of L1 DNA has been observed in various cancers and is associated with an increase in transcriptional activation and expression of L1 (Alves et al., 1996; Asch et al., 1996; Kitkumthorn et al., 2012; Murata et al., 2013; Criscione et al., 2014; Park et al., 2014). L1 hypomethylation can occur early in tumorigenesis and is associated with bladder (Patchsung et al., 2012; Salas et al., 2014), gastric (Shigaki et al., 2013; Baba et al., 2014a), colon (Ogino et al., 2008; Antelo et al., 2012; Murata et al., 2013), lung (Saito et al., 2010), and breast cancers (Park et al., 2014). L1 hypomethylation is associated with poor prognosis of lung adenocarcinoma (Ikeda et al., 2013), hepatocellular carcinoma via activation of c-Met (Zhu et al., 2014), esophageal squamous cell carcinoma (ESCC; Iwagami et al., 2013), and with inferior survival in colorectal carcinomas with high microsatellite instability (Inamura et al., 2014). Additionally, L1 hypomethylation in ESCC was shown to be significantly associated with lymph node metastasis, frequency of p53 mutation, and chromosomal instability (Kawano et al., 2014). In a separate study, L1 hypomethylation in ESCC patient samples was associated with an increase in CDK6 expression (Baba et al., 2014b). This may contribute to the aggressiveness of tumors since CDK6 is known to promote tumor progression by stimulating proliferation and angiogenesis (Kollmann et al., 2013). Finally, hypomethylation of L1 in colorectal cancer can lead to activation of oncogenes important in metastasis, including MET, RAB1P, and CHRM3 (Hur et al., 2014). It was observed that specific L1 sequences residing within the intronic regions of these proto-oncogenes were hypomethylated and reduced methylation of specific L1 elements within the MET gene correlated with an induction of MET expression (Hur et al., 2014). However, since methylation levels of repetitive L1 elements often tightly correlate with global DNA methylation levels, it is difficult to conclude that L1 hypomethylation directly results in the increased genomic instability found in tumors.

## MECHANISMS THAT INHIBIT L1 RETROTRANSPOSITION ARE OFTEN DEREGULATED IN CANCER

As the uncontrolled movement of retrotransposons throughout the genome can have deleterious consequences for genome stability and health in general, a number of defense mechanisms exist in human cells to repress their movement. These mechanisms exist at the DNA, RNA, and protein levels to inhibit L1 and retrotransposition (**Figure 1**).

DNA methylation status is a major determinant of gene expression changes within the human genome and is involved in various biological processes including cancer (Liu et al., 2003). As discussed above, hypomethylation of L1 DNA is associated with an increase in L1 expression. Conversely, methylation of L1 within the CpG rich 5′ -UTR represses its ability to be activated and transcribed, thereby minimizing the exposure of genomic DNA to L1-associated damage (Hata and Sakaki, 1997; Weisenberger et al., 2005; Barchitta et al., 2014). DNA methylation, therefore, is a key mechanism for L1 silencing. It has been shown in mouse embryonic stem cells that methylation of the L1 promoter is maintained by DNA methylatransferases, including Dmnt1 and Dmnt3a and/or -3b (Woodcock et al., 1998; Liang et al., 2002).

Other epigenetic mechanisms have been reported to be involved in regulating L1 expression. One study showed that reporter genes introduced into human embryonic carcinomaderived cell lines by engineered L1 retrotransposons were rapidly silenced during or shortly after their integration (Garcia-Perez et al., 2010). Treatment of the cells with histone deacetylase inhibitors reversed the silencing and ChIP experiments demonstrated that a change in the chromatin status at the L1 integration site correlated with reactivation of the reporter gene (Garcia-Perez et al., 2010). Other studies involving chromatin structure averaged global histone modifications and found that histone H3 lysine nine methylation is enriched at human retrotransposons, suggesting that histone methylation may play a role in repressing recombination of these retrotransposons (Kondo and Issa, 2003; Martens et al., 2005; Goodier and Kazazian, 2008). Low levels of the silencing histone modification H3K27me3 at L1 loci in conjunction with L1 hypomethylation has been shown to support an active role in rectal cancer prognosis and a poor clinical outcome (Benard et al., 2013). Conversely, high levels of the activating histone modification H3K9Ac at L1 loci were associated with poor patient survival. This indicates that L1 methylation and histone modifications work closely together in determining gene expression and tumor progression (Benard et al., 2013).

Global chromatin organization is also involved in repression of Drosophila melanogaster retrotransposons. Studies in the fly have identified a role for the chromatin organizing complex, Condensin II in repressing retrotransposition in somatic cells and tissues. The Condensin II subunit, dCAP-D3, promotes silencing of retrotransposon-containing loci by maintaining boundaries of repressive histone modifications to repress retrotransposon transcription and ultimately inhibit retrotransposition (Schuster et al., 2013). Furthermore, decreased dCAP-D3 expression impacts chromatin structure, resulting in DNA double strand breaks within the retrotransposon sequence, an increase in homologous pairing, and an increase in global retrotransposon copy number. While global chromatin regulators have yet to be implemented in L1 repression, CAP-D3, and Condensin II are conserved and further studies are necessary to determine whether they also inhibit retrotransposition in human cells.

Epigenetic modification, however, is not the only mechanism employed by cells to inhibit retrotransposition. Exciting new evidence from multiple labs suggests that a host of cellular proteins employ distinct mechanisms to accomplish the inhibition.

One mechanism includes targeting the L1 RNA intermediate to prohibit insertion of L1 into the human genome. The ribonucleoprotein hnRNPL, which plays multiple roles in RNA metabolism, has been shown to directly interact with L1 RNA to negatively regulate retrotransposition. hnRNPL does so by decreasing the steady-state levels of the L1 RNA (Peddigari et al., 2013). Downregulation of L1 mRNA and subsequently, reduced expression of the ORF1p and ORF2p by RNase L was also shown to restrict L1 mobilization; whereas, siRNA-mediated knockdown of endogenous RNase L lead to a significant increase in L1 retrotransposition events in a human ovarian cancer cell line (Zhang et al., 2014). Similarly, the melatonin receptor 1 (MT1) inhibits retrotransposition through downregulation of L1 mRNA and ORF1p. Researchers showed that antagonists directed against MT1 abolished this effect in a dose-dependent manner (deHaro et al., 2014). Furthermore, melatonin-rich blood suppressed endogenous L1 RNA during in situ perfusion of tissue-isolated xenografts of human pancreatic cancer (deHaro et al., 2014).

Innate immune defenses can also inhibit retrotransposition of L1. Guo and colleagues demonstrated that autophagy degrades the L1 RNA intermediate, preventing new insertions into the genome and promoting genome stability. Degradation of retrotransposon RNA was facilitated by receptors involved in activating autophagy signaling pathways, NDP52 and p62. Interestingly, this study also showed that mice lacking Atg6/Beclin1, a gene critical for the formation of autophagosomes, accumulate retrotransposon RNA and new genomic insertions of L1 (Guo et al., 2014).

L1 RNP formation and safe delivery of the RNP to genomic DNA is essential for TPRT to occur; therefore, targeting the RNP for degradation is a useful mechanism to inhibit this process. The RNA helicase, MOV10 directly associates with the L1 RNP (Goodier et al., 2012) and similar to SAMHD1 (Zhao et al., 2013) inhibits L1 retrotransposition by promoting stress granule formation (Arjan-Odedra et al., 2012; Li et al., 2013); stress granules are ribonucleoprotein cytosolic foci that appear under cellular stress and often act to promote mRNA degradation (Kedersha et al., 2005). Further, the L1 ORF1p was shown in a separate study to localize in stress granules with components of RISC, suggesting a mechanism for controlling retrotransposition and the associated genomic damage (Goodier et al., 2007). More recently, the zinc-finger antiviral protein, ZAP was shown to inhibit L1 retrotransposition by binding to the L1 RNP and inhibiting accumulation of L1 RNA (Goodier et al., 2015; Moldovan and Moran, 2015). ZAP colocalizes with the RNP in cytoplasmic stress granules and interacts with a number of novel proteins, including MOV10 (Goodier et al., 2015; Moldovan and Moran, 2015).

Another mechanism to inhibit retrotransposition involves targeting the single-strand DNA that arises during the process of L1 integration, to repress its mobilization. These cellular proteins can directly promote degradation of L1, thereby inhibiting retrotransposition. For example, the APOBEC3 (A3) family of cytidine deaminases functions to inhibit L1 retrotransposition by deaminating the transiently exposed cDNA, creating C-to-U conversions (Richardson et al., 2014). This may then target the mutated retrotransposon DNA for degradation through endonuclease activity. Additionally, the endonucleases TREX1 (Stetson et al., 2008) and ERCC1/XPF (Gasior et al., 2008) can physically cleave the reverse-transcribed cDNA of L1, thereby inhibiting retrotransposition.

### CONCLUSIONS

Undeniably, L1 retrotransposons are an interesting and important component of the human genome. The activity of L1 retrotransposons can generate a wide array of genomic mutations and rearrangements, with potentially serious consequences for the stability of the genome. L1s are frequently hypomethylated and expressed in human cancers and their increased activity correlates with tumor progression and metastasis. Additionally, L1-insertion-mediated interference with normal RNA processing and expression also contributes to cancer development. Further studies on L1 retrotransposition, their effects on local and global genome organization, and the identification of novel mechanisms which repress retrotransposition to prevent tumor development will broaden our understanding of the impact of retrotransposons on genetic diversity and human health.

### FUNDING

This work was supported by a National Institutes of Health research grant R01GM102400 to ML. The contents are solely the responsibility of the authors and do not necessarily represent the official views of the NIH.

## REFERENCES


including components of RNA interference RNA-induced silencing complex. Mol. Cell. Biol. 27, 6469–6483. doi: 10.1128/MCB.00332-07


retrotransposons in cultured human cells. Nucleic Acids Res. 42, 3803–3820. doi: 10.1093/nar/gkt1308


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2015 Kemp and Longworth. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Links between Human LINE-1 Retrotransposons and Hepatitis Virus-Related Hepatocellular Carcinoma

#### Tomoyuki Honda1, 2 \*

<sup>1</sup> Department of Viral Oncology, Institute for Virus Research, Kyoto University, Kyoto, Japan, <sup>2</sup> Division of Virology, Department of Microbiology and Immunology, Osaka University Graduate School of Medicine, Suita, Japan

Hepatocellular carcinoma (HCC) accounts for approximately 80% of liver cancers, the third most frequent cause of cancer mortality. The most prevalent risk factors for HCC are infections by hepatitis B or hepatitis C virus. Findings suggest that hepatitis virus-related HCC might be a cancer in which LINE-1 retrotransposon, often termed L1, activity plays a potential role. Firstly, hepatitis viruses can suppress host defense factors that also control L1 mobilization. Secondly, many recent studies also have indicated that hypomethylation of L1 affects the prognosis of HCC patients. Thirdly, endogenous L1 retrotransposition was demonstrated to activate oncogenic pathways in HCC. Fourthly, several L1 chimeric transcripts with host or viral genes are found in hepatitis virus-related HCC. Such lines of evidence suggest a linkage between L1 retrotransposons and hepatitis virus-related HCC. Here, I briefly summarize current understandings of the association between hepatitis virus-related HCC and L1. Then, I discuss potential mechanisms of how hepatitis viruses drive the development of HCC via L1 retrotransposons. An increased understanding of the contribution of L1 to hepatitis virus-related HCC may provide unique insights related to the development of novel therapeutics for this disease.

### Edited by:

Tammy A. Morrish, Formerly affiliated with University of Toledo, USA

### Reviewed by:

Kaushlendra Tripathi, Mitchell Cancer Institute, USA Gerald Günther Schumann, Paul-Ehrlich-Institut, Germany

\*Correspondence: Tomoyuki Honda thonda@virus.med.osaka-u.ac.jp

### Specialty section:

This article was submitted to Cellular Biochemistry, a section of the journal Frontiers in Chemistry

Received: 18 December 2015 Accepted: 22 April 2016 Published: 11 May 2016

### Citation:

Honda T (2016) Links between Human LINE-1 Retrotransposons and Hepatitis Virus-Related Hepatocellular Carcinoma. Front. Chem. 4:21. doi: 10.3389/fchem.2016.00021 Keywords: L1, retrotransposon, hepatitis C virus (HCV), hepatitis B virus (HBV), hepatocellular carcinoma

## INTRODUCTION

Liver cancer, 80% of which is hepatocellular carcinoma (HCC), accounts for 9% of all cancer deaths worldwide (Jemal et al., 2011; Tateishi and Omata, 2012). The major causative agents of HCC are hepatitis viruses, such as hepatitis B virus (HBV) or hepatitis C virus (HCV) (Jemal et al., 2011; Tateishi and Omata, 2012). HBV belongs to the Hepadnaviridae family, which has a relaxed circular DNA (rcDNA) as a viral genome (Beck and Nassal, 2007; Nguyen et al., 2008). HCV belongs to the Flaviviridae family, which has a nonsegmented, positive-stranded RNA as a viral genome (Hijikata et al., 1991; Grakoui et al., 1993; Aly et al., 2012). Both viruses cause chronic infections, with approximately 350 and 170 million people worldwide affected by chronic HBV and HCV infections, respectively (Parkin, 2006; Aly et al., 2012). It is now clear that chronic HBV and HCV infections play critical roles in the development of HCC (Jemal et al., 2011; Forner et al., 2012; Tateishi and Omata, 2012). However, the precise mechanisms of hepatocarcinogenesis in chronic hepatitis virus infections are still unclear.

Long interspersed nuclear element-1 (LINE-1 or L1) retrotransposons are genetic elements that constitute approximately 17% of the human genome (Lander et al., 2001). Because most L1s are 5′ truncated, most of them are defective, while 80–100 copies are still retrotransposition-competent and utilize a "copy-and-paste" mechanism to retrotranspose to new genomic loci (Brouha et al., 2003; Beck et al., 2010). Aberrantly expressed or dysregulated L1s are considered a major source of endogenous mutagenesis in humans (Levin and Moran, 2011; Burns and Boeke, 2012). L1 retrotransposition occurs in germ cells, pluripotent stem cells, at early stages of human embryonic development (van den Hurk et al., 2007; Beck et al., 2011; Levin and Moran, 2011; Klawitter et al., 2016) and in somatic cells, such as neuronal progenitor cells or cancer cells (Muotri et al., 2005; Iskow et al., 2010). Many epidemiological studies suggest a linkage between L1 and cancers (Shukla et al., 2013; Rodic et al., ´ 2014; Harada et al., 2015). However, in most cases, it is unclear whether L1s are activated in normal cells before clonal expansion or in cancer cells at the later stage of carcinogenesis (Goodier, 2014).

Among cancers, hepatitis virus-related HCC is considered to be a cancer in which L1 might be involved (Shukla et al., 2013). Firstly, by far the majority of L1 de novo insertions detected in cancer tissues has been found in cancers of epithelial origin (Goodier, 2014). Secondly, HBV and HCV have a potential to suppress host defense mechanisms that can also control L1 retrotransposition (Gale and Foy, 2005; Chang et al., 2012; Yu et al., 2015). Thirdly, endogenous L1 retrotransposition was demonstrated to activate oncogenic pathways in HCC. Fourthly, several L1 chimeric transcripts with host or viral genes are found in hepatitis virus-related HCC (Lau et al., 2014). Here, I will summarize potential linkages between hepatitis virusrelated HCC and L1s. Firstly, I will review how HBV could affect L1 retrotransposon activity. I will then introduce current understandings of the relationship between HCV and L1. Finally, I will discuss possible L1-mediated mechanisms that may induce HCC. Understandings of possible links between virus-related HCC and L1 may open a new avenue for the development of novel therapeutics for this disease.

### A POTENTIAL LINK BETWEEN HBV AND L1 IN HCC

The 3.2-kb HBV genome encodes four, partly overlapping open reading frames (ORFs): preC/C (core and Hepatitis B e-Antigen [HBeAg]), P (viral polymerase), preS/S (Hepatitis B surface Antigen [HBsAg]) and X (non-structural protein [HBx]) genes (**Figure 1A**). In the nucleus, the genome is converted into covalently closed circular DNA (cccDNA). From this cccDNA, all viral RNAs, including pregenomic RNA (pgRNA) as a replication intermediate and viral mRNAs, are transcribed. Viral proteins such as core and polymerase proteins and pgRNAs are assembled into the nucleocapsid within the cytoplasm. In the nucleocapsid, pgRNA is reverse transcribed into rcDNA. All these HBVrelated nucleic acids have the potential to trigger innate immune responses in infected cells (Ait-Goughoulte et al., 2010). If these immune responses cannot clear HBV, the virus establishes a chronic infection, which is known to increase the risk of developing liver cirrhosis and HCC (Gonzalez and Keeffe, 2011).

Type I interferons (IFNs) play a major role in anti-viral immunity (Katze et al., 2002). Association of IFNs with IFN receptors activates JAK1 and signal transducer and activator of transcription 1/2 (STAT1/2). Then, these proteins interact with interferon regulatory factor 9 (IRF9) and form a potent transcription factor, upregulating the expression of several hundreds of IFN-stimulated genes (ISGs). These ISGs suppress viral replication and spread through various mechanisms described elsewhere (Katze et al., 2002). IFN is used to control HBV replication, indicating that IFN is a restriction factor (Dienstag, 2008). For example, tetherin, an IFN-inducible transmembrane protein, inhibits HBV virion secretion (Yan et al., 2015). Zinc finger antiviral protein (ZAP) is upregulated in IFN-treated cells and restricts HBV replication through downregulation of pgRNA (Mao et al., 2013). On the other hand, HBV has a variety of strategies to counteract IFN signaling (**Figure 1A**). HBsAg, HBeAg and HBV virions inhibit Toll-like receptor (TLR)-mediated antiviral responses (Visvanathan et al., 2007; Wu et al., 2009; Vincent et al., 2011; Woltman et al., 2011). HBV polymerase suppresses IRF3 activation by interacting with the host RNA helicase, DDX3 (Wang and Ryu, 2010; Yu et al., 2010). HBV polymerase also disrupts ubiquitination of the stimulator of interferon genes (STING) and blocks innate immune responses against cytoplasmic DNA (Liu et al., 2015). Expression of HBx protein inhibits virus-induced expression of the IFN gene by promoting the decay of mitochondrial antiviral signaling protein (MAVS) (Wei et al., 2010; Kumar et al., 2011). Furthermore, HBV abrogates IFN signal transduction by impairing either STAT1 nuclear import or phosphorylation (Christen et al., 2006, 2007; Lütgehetmann et al., 2011). All the listed mechanisms that suppress IFN signaling could also activate L1 retrotransposon, because IFN has been shown to inhibit the expression and retrotransposition of L1 (Yu et al., 2015). The mechanisms underlying the inhibitory effect of IFN on L1 remain unclear. However, MOV10 is an attractive candidate to mediate this inhibitory effect, because MOV10 is an IFN-inducible gene and suppresses L1 retrotransposition (Schoggins et al., 2011; Goodier et al., 2012). Collectively, immune suppression by HBV may activate the expression and retrotransposition of L1 elements.

In addition, HBV may also modulate L1 expression epigenetically. L1 retrotransposition activity is usually suppressed in most somatic cells by host DNA methyltransferase-mediated DNA methylation of its promoter (Ishizu et al., 2012; Castro-Diaz et al., 2014). In cancer cells, global DNA hypomethylation occurs at various genomic loci including those containing DNA repeats and/or retrotransposons (Ehrlich, 2002; Hatziapostolou and Iliopoulos, 2011). Many studies have reported hypomethylation of the L1 loci in HCC and HBV infection (Shitani et al., 2012; Zhang C. et al., 2013; Gao et al., 2014; Zhu et al., 2014). In particular, L1 hypomethylation is likely to be linked to poor outcomes of HCC (Gao et al., 2014; Zhu et al., 2014). Given global hypomethylation occurs in the host genome (including the L1 loci) during HBV infection, this may upregulate L1

expression, potentially removing an obstacle to L1 transposition in liver cells. In addition, some chimeric transcripts, such as HBx-L1, are detected in HCC and associated with a poor prognosis, further supporting the link between HBV-related HCC and L1 (Lau et al., 2014).

### A POTENTIAL LINK BETWEEN HCV AND L1 IN HCC

The 9.6-kb HCV genome contains a single ORF, encoding a polyprotein precursor of approximately 3000 amino acids. The polyprotein is cleaved by host and viral proteases, producing structural (core, E1 and E2) and non-structural (P7, NS2, NS3, NS4A, NS4B, NS5A and NS5B) proteins (**Figure 1B**). The replication of HCV starts with the synthesis of a full-length, negative-stranded RNA intermediate, which in turn works as a template for the de novo production of positive-stranded genomic RNA. Thus, HCV replicates without a known DNA intermediate stage. HCV genomic RNA is highly structured and contains double-stranded regions in various portions (Tuplin et al., 2002; Zhang S. et al., 2013). Double-stranded RNAs (dsRNAs) are also generated during the replication cycle of HCV. Such dsRNAs are potent inducers of innate immune responses, mainly through TLR3 and retinoic acid inducible gene-I (RIG-I) signal pathways (Li et al., 2012). However, the immune responses induced by HCV are not strong enough to eradicate the virus (Battaglia and Hagmeyer, 2000).

Although it is thought that non-retroviral RNA viruses are not integrated into host genomic DNA, we and others have demonstrated that they do become integrated into the host genome via host retrotransposon machineries (Geuking et al., 2009; Horie et al., 2010). Likewise, HCV cDNA is reportedly detected in patients infected with HCV (Zemer et al., 2008). Because the involvement of HIV was ruled out in all the HCV cDNA-positive patients, it is hypothesized that host retrotransposons might be involved (Zemer et al., 2008). However, the retrotransposons responsible for this phenomenon remain unidentified and the involvement of retroviruses other than HIV is not ruled out. Because the 3'UTR of HCV is not polyadenylated, the contribution of L1, whose substrates are usually polyadenylated, to this phenomenon seems to be unlikely. However, several reports propose alternative retrotransposition mechanisms by L1, termed internal priming or twin priming, where a poly-A tail is not required to prime reverse transcription (Ostertag and Kazazian, 2001; Srikanta et al., 2009). These alternative mechanisms may explain how HCV RNA could be reverse transcribed by L1, despite lacking a poly-A tail. It is also unknown whether HCV cDNA is integrated into the host genome or exists as extrachromosomal DNA. A recent report showed that fragments homologous to HCV genes are present in the rabbit and hare genomes, which might suggest the possibility that cDNA of an HCV ancestor has been integrated into the host genome (Silva et al., 2012). These observations imply that some linkages between HCV and retrotransposon activity might exist.

Most HCV-infected patients develop a chronic infection, suggesting that HCV has developed successful strategies to evade host immune responses (Gale and Foy, 2005) (**Figure 1B**). For instance, the HCV NS3/4A protease cleaves the Toll/IL-1 receptor domain-containing adaptor inducing IFN-β (TRIF) adaptor protein and MAVS to impair TLR3 and retinoic acidinducible gene-I (RIG-I) signaling pathways, respectively (Foy et al., 2005; Li K. et al., 2005; Li X.-D. et al., 2005). NS5A and E2 proteins suppress the signaling of the interferondependent induced protein kinase R (PKR), a key molecule in the innate immune system (Gale et al., 1997; Taylor et al., 1999). The interferon sensitivity-determining region (ISDR) in the NS5A protein interacts with the death domain of myeloid differentiation primary response 88 protein (MyD88), a major adaptor protein in TLR signaling, and impairs its signaling (Abe et al., 2007). All these mechanisms that suppress IFN responses against HCV could in turn activate retrotransposons, such as L1, in infected cells, because IFN and IFN-inducible genes, such as MOV10, are shown to suppress retrotransposition of L1s as described above (Schoggins et al., 2011; Goodier et al., 2012; Yu et al., 2015).

The HCV core protein has an oncogenic potential (Moriya et al., 1998; Shimotohno, 2000). One mechanism put forward for this is that the core protein modulates host gene expression pathways which may activate oncogene expression (Shrivastava et al., 1998; Marusawa et al., 1999; Shimotohno, 2000; Watashi et al., 2001; Ray et al., 2002). In addition to the core protein, NS5A protein also stimulates NF-κB signaling (Ray et al., 1995; Gong et al., 2001; Park et al., 2002; Waris et al., 2003). Similarly, HCV proteins may stimulate the expression of L1 retrotransposons. Indeed, the infectious HCV virion reportedly activates HIV long terminal repeats (LTR) and upregulates gene transcription (Sengupta et al., 2013). However, studies investigating whether HCV proteins have the potential to stimulate L1 expression and/or retrotransposition have not been reported so far.

### POSSIBLE MECHANISMS OF L1 INVOLVEMENT IN HCC DEVELOPMENT

Although a definitive role for L1 activity in contributing to HCC etiology has not been established thus far, investigating a possible link between L1 activation and the development of HCC would be of considerable interest for a number of reasons (**Figure 2**). Firstly, L1s, when aberrantly expressed or dysregulated, can be major sources of endogenous mutagenesis in humans as described above (Levin and Moran, 2011; Burns and Boeke, 2012). Any potential disruption of tumor suppressor genes by L1 retrotransposition could contribute to the development of HCC. Indeed, L1 was shown to be a crucial source of mutations that can reduce the tumor-suppressive capacity of somatic cells (Shukla et al., 2013). A subset of L1 de novo insertions identified in cancer tissue occurred at genes commonly mutated in cancer (Lee et al., 2012). Secondly, L1 de novo insertions can affect the expression of nearby genes and the genes in which they inserted (Lee et al., 2012; Shukla et al., 2013). Intragenic L1 insertions usually coincide with reduced gene expression (Lee et al., 2012). For example, L1 insertions into the tumor suppressor mutated in colorectal cancer (MCC) gene coincides with its downregulation (Shukla et al., 2013). MCC is expressed in liver and suppresses the oncogenic β-catenin/Wnt signaling pathway frequently activated in HCC (Fukuyama et al., 2008). If an L1 insertion occurs close to an oncogene, L1 could enhance oncogene expression, resulting in the development of HCC. For example, the telomerase reverse transcriptase (TERT) gene is one of the most common genes associated with L1 de novo insertion (Ding et al., 2012; Lau et al., 2014). Since aberrant expression of TERT is associated with tumor development, L1 insertion near the TERT locus may have a role in carcinogenesis (Cohen et al., 2007; Nault et al., 2015). L1 insertion at the transcriptional repressor suppression of tumorigenicity 18 (ST18) gene activates its expression (Shukla et al., 2013). ST18 is a candidate oncogene in liver, because the expression of ST18 is upregulated in several liver cancer cells and in tumors in a mouse-model for inflammation-driven HCC (Shukla et al., 2013). Thirdly, L1 provides sites that could lead to genomic rearrangements (Burwinkel and Kilimann, 1998). Such genomic rearrangements contribute to genomic instability (Burwinkel and Kilimann, 1998; Ehrlich, 2002). Fourthly, L1 retrotransposition could contribute to new splice donor or acceptor sites, which could alter the host transcriptome and might enhance HCC progression (Singer et al., 2010). Lastly, L1 retrotransposition occasionally creates new chimeric transcripts, which might enhance the progression to HCC. An example of this mechanism is the L1-MET transcript, a chimeric transcript that consists of the c-MET oncogene and an intronic L1 sequences (Zhu et al., 2014). The expression of L1-MET has been shown to be correlated with that of c-MET (Zhu et al., 2014). Because L1-MET is associated with poor prognosis in cancer (Wolff et al., 2010; Hur et al., 2014), L1-MET might be associated with a poor prognosis for HCC via the activation of c-MET signaling (Zhu et al., 2014).

Taken together, I conclude two potential roles for L1 elements in the development of hepatitis virus-related HCC. The first relates to a chimeric transcript specific to HBV-related HCC, HBx-L1, which can be detected in more than 20% of HBV-related HCC and correlates with a poor outcome (Lau et al., 2014). The promoter of the HBx gene transcribes HBx-L1 from the locus that is normally silent in the genome. Knockdown of HBx-L1 reduces migratory and invasive properties of HBV-positive HCC cells. HBx-L1 overexpression confers growth advantage and promotes cell migration and invasion regardless of its chimeric protein-coding potential, suggesting that HBx-L1 is a long noncoding RNA that promotes HCC phenotypes. Furthermore, it has been shown that HBx-L1 affects β-catenin/Wnt signaling, a major pathway in the oncogenesis of HBV-related HCC, confirming its role in HCC (Whittaker et al., 2010; Lau et al., 2014). In addition to this, I hypothesize the other possible role of L1 as a potent inducer of the expression of cytidine deaminases, such as activation-induced cytidine deaminase (AID) and apolipoprotein B mRNA editing enzyme, catalytic polypeptide 3 (APOBEC3). Since transgenic mice expressing AID genes invariably induce tumors, this suggests that cytidine deaminases may have a carcinogenic potential (Okazaki et al., 2003; Takai et al., 2009). APOBEC3 is a protein family of seven proteins in human: APOBEC3A, B, C, DE, F, G, and H (Schumann et al., 2010; Vieira and Soares, 2013). Members of the APOBEC3 protein family restrict replication of not only retroviruses such as HIV, but also retrotransposons, HBV and HCV (Harris and Liddament, 2004; Vieira and Soares, 2013). Among APOBEC3 proteins, APOBEC3G seems to have a major role in HIV restriction (Chaipan et al., 2013; Vieira and Soares, 2013). APOBEC3G also has the restriction activity against LTR retrotransposons in the mouse genome (Esnault et al., 2005; Schumacher et al., 2008). All members of the human APOBEC3 protein family of cytidine deaminases restrict L1 retrotransposition with APOBEC3A, B, C and F having the strongest inhibitory effect (Muckenfuss et al., 2006; Kinomoto et al., 2007). For HBV and HCV, APOBEC3G is a major restriction factor (Vartanian et al., 2010; Peng et al., 2011; Kitamura et al., 2013). HBV and HCV somehow stimulate

the expression of cytidine deaminases (Vartanian et al., 2010). Furthermore, L1 activation reportedly increases the expression of the mouse APOBEC3 gene in mouse embryonic fibroblasts (Yu et al., 2015). Taken together, hepatitis viruses, directly and/or maybe indirectly via L1 activation, induce the expression of cytidine deaminases, which may hyperedit host genomes, resulting in the accumulation of deleterious mutations in the genome and the development of HCC (Okazaki et al., 2003; Takai et al., 2009; Vartanian et al., 2010).

### CONCLUSION AND PERSPECTIVE

Presented lines of evidence suggest potential links between hepatitis virus infection and L1 retrotransposon activity. Especially, L1 hypomethylation or some L1 chimeric transcripts are associated with poor prognosis of HCC, suggesting that it can have a significant effect on HCC phenotypes and supporting the idea that HCC is a cancer in which L1 plays a role. However, knowledge of how L1 activation by chronic hepatitis virus infection enhances the development of HCC is still limiting. Further accumulation of examples of recurrent L1 insertion sites in the host genome or recurrent chimeric transcripts specific to hepatitis virus-related HCC will be promising ways to understand L1 involvement in HCC etiology. Single cell analyses of L1 retrotransposition events and expression in tumor cells and surrounding normal cells may enhance these processes. Understanding the potential roles of L1 in HCC may open avenues to developing novel therapeutics, such as RNA interference against HCC-specific L1 chimeric transcripts.

### AUTHOR CONTRIBUTIONS

TH wrote the manuscript and approved it for publication.

### ACKNOWLEDGMENTS

I would like to thank Makoto Hijikata and Nicholas F. Parrish for helpful discussions and Keizo Tomonaga for his support and encouragement. Preparation of this paper was supported in part by KAKENHI Grant Number 15K08496 from Japan Society for the Promotion of Science (JSPS), and grants from the Takeda Science Foundation, Senri Life Science Foundation, Suzuken Memorial Foundation, The Shimizu Foundation for Immunology and Neuroscience Grant for 2015 and The NOVARTIS Foundation (Japan) for the Promotion of Science.

## REFERENCES


hypermutation of hepatitis B viral genomes: excision repair of covalently closed circular DNA. PLoS Pathog. 9:e1003361. doi: 10.1371/journal.ppat.1003361


**Conflict of Interest Statement:** The author declares that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2016 Honda. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Guardian of the Human Genome: Host Defense Mechanisms against LINE-1 Retrotransposition

### Yasuo Ariumi\*

*Ariumi Project Laboratory, Center for AIDS Research and International Research Center for Medical Sciences, Kumamoto University, Kumamoto, Japan*

Long interspersed element type 1 (LINE-1, L1) is a mobile genetic element comprising about 17% of the human genome, encoding a newly identified ORF0 with unknown function, ORF1p with RNA-binding activity and ORF2p with endonuclease and reverse transcriptase activities required for L1 retrotransposition. L1 utilizes an endonuclease (EN) to insert L1 cDNA into target DNA, which induces DNA double-strand breaks (DSBs). The ataxia-telangiectasia mutated (ATM) is activated by DSBs and subsequently the ATM-signaling pathway plays a role in regulating L1 retrotransposition. In addition, the host DNA repair machinery such as non-homologous end-joining (NHEJ) repair pathway is also involved in L1 retrotransposition. On the other hand, L1 is an insertional mutagenic agent, which contributes to genetic change, genomic instability, and tumorigenesis. Indeed, high-throughput sequencing-based approaches identified numerous tumor-specific somatic L1 insertions in variety of cancers, such as colon cancer, breast cancer, and hepatocellular carcinoma (HCC). In fact, L1 retrotransposition seems to be a potential factor to reduce the tumor suppressive property in HCC. Furthermore, recent study demonstrated that a specific viral-human chimeric transcript, HBx-L1, contributes to hepatitis B virus (HBV)-associated HCC. In contrast, host cells have evolved several defense mechanisms protecting cells against retrotransposition including epigenetic regulation through DNA methylation and host defense factors, such as APOBEC3, MOV10, and SAMHD1, which restrict L1 mobility as a guardian of the human genome. In this review, I focus on somatic L1 insertions into the human genome in cancers and host defense mechanisms against deleterious L1 insertions.

Keywords: LINE-1, retrotransposition, DNA double-strand breaks (DSBs), DNA repair, tumor suppressor, HBV, epigenetic regulation, somatic insertion

### INTRODUCTION

Long interspersed element type 1 (LINE-1, L1) is an active and autonomous non-long terminal repeat (LTR) retrotransposon composing about 17% of the human genome and L1 is an essential evolutionary force (DeBerardinis et al., 1998; Ostertag and Kazazian, 2001; Cordaux and Batzer, 2009; Hancks and Kazazian, 2012). However, only 100 copies out of ∼500,000 copies still remain active (Brouha et al., 2003; Mills et al., 2007; Beck et al., 2010). The remaining L1s are 5′ truncated and defective. Furthermore, L1 provides the trans-acting functions required for the retrotransposition of non-autonomous retrotransposons such as short interspersed

### Edited by:

*Tammy A. Morrish, Formerly affiliated with University of Toledo, USA*

### Reviewed by:

*Geoff Faulkner, Mater Medical Research Institute, Australia Nemanja Rodic, Yale University, USA*

> \*Correspondence: *Yasuo Ariumi ariumi@kumamoto-u.ac.jp*

#### Specialty section:

*This article was submitted to Cellular Biochemistry, a section of the journal Frontiers in Chemistry*

Received: *14 February 2016* Accepted: *14 June 2016* Published: *28 June 2016*

#### Citation:

*Ariumi Y (2016) Guardian of the Human Genome: Host Defense Mechanisms against LINE-1 Retrotransposition. Front. Chem. 4:28. doi: 10.3389/fchem.2016.00028* element (SINE), which includes Alu repeats in humans, SINE-VNTR-Alu (SVA), and processed pseudogenes (Esnault et al., 2000; Dewannieux et al., 2003; Hancks et al., 2011).

L1 encodes three open reading frames, a newly identified ORF0 with unknown function, ORF1p with RNA-binding and nucleic acid chaperon activities, and ORF2p with AP-like endonuclease (EN) and reverse transcriptase (RT) activities required for L1 retrotransposition (Mathias et al., 1991; Martin and Bushman, 2001; Ostertag and Kazazian, 2001; Hancks and Kazazian, 2012; Denli et al., 2015). ORF0 is the primatespecific ORF in the anti-sense 5′ untranslated region (UTR) of L1 (Denli et al., 2015). ORF0 predominantly localizes in nuclear PML-adjacent foci and enhances L1 mobility. ORF1p and ORF2p preferentially assemble with L1 RNA and form a ribonucleoprotein (RNP) in the cytoplasmic foci (Goodier et al., 2007; Doucet et al., 2010). Although retroviruses and LTR-retrotransposons utilize a long terminal repeat (LTR) to synthesize full-length transcripts, L1 instead utilizes an internal promoter in the 5′UTR of L1 (Swergold, 1990). Several transcription factors including SOX11 (Tchenio et al., 2000), YY1 (Becker et al., 1993; Athanikar et al., 2004), RUNX3 (Yang et al., 2003), and p53 (Harris et al., 2009) positively regulate the L1 transcription. On the other hand, SOX2 (Muotri et al., 2005) and SRY (Tchenio et al., 2000) as well as several epigenetic factors negatively regulate the L1 transcription (**Table 1**).

L1 integrates into the genome by target-primed reverse transcription (TPRT) (Luan et al., 1993) after the L1-RNP complex enters the nucleus. During TPRT, the L1 EN creates a nicked DNA that serves as a primer for reverse transcription of L1 RNA, leading to integration of L1 cDNA into the human genome (Feng et al., 1996). A typical L1 EN cleavage site is 5′ - TTTT/AA-3′ (Feng et al., 1996; Cost and Boeke, 1998). Thus, L1 insertion generates DNA double-strand breaks (DSBs) as well as L1 structural hallmarks such as frequent 5′ truncations, 3′ poly(A) tails and variable length target site duplications (TSDs) in the target DNA. L1 can alter the mammalian genome in many ways upon retrotransposition, since the insertion of L1 into the human genome may cause genomic instability, genetic disorders, and cancers through insertional mutagenesis (Kazazian et al., 1988; Morse et al., 1988; Miki et al., 1992; Narita et al., 1993; Holmes et al., 1994; Gilbert et al., 2002; Morrish et al., 2002; Symer et al., 2002; Belancio et al., 2008; Beck et al., 2011; Hancks and Kazazian, 2012; Bundo et al., 2014; Kines et al., 2014; **Figure 1**). So far, >100 disease-causing retrotransposon insertions have been identified in humans [26 L1, 61 Alu, 12 SVA, 4 poly(A)] (**Figure 1**).

### L1-MEDIATED DSBs INDUCTION AND DNA REPAIR MACHINERY

L1 is known to induce DSBs in target DNA by L1 EN activity (Gasior et al., 2006). The ataxia-telangiectasia mutated (ATM) is activated by DSBs and subsequently phosphorylates downstream substrates including p53, Chk2, BRCA1 and the MRE11-Rad50-NBS1 (MRN) complex, resulting in the activation of DNA damage checkpoint and cell cycle arrest (Harper

#### TABLE 1 | Host factors regulating the L1 transcription.


and Elledge, 2007; Ciccia and Elledge, 2010; Shiloh, 2014). Accordingly, L1 retrotransposition was increased in ATMdeficient cells, indicating ATM signaling pathway modulates L1 retrotransposition (Coufal et al., 2011). In contrast, the E6 protein from β-human papillomavirus (β-HPV 5 and 8) reduces ATM protein levels and attenuates L1 retrotransposition, suggesting that ATM is needed for efficient L1 retrotransposition (Wallace et al., 2013). Thus, the DNA damage response may modulate L1 retrotransposition. Notably, L1 can integrate into preformed DSBs generated independently of L1 EN, resulting in retrotransposon-mediated DNA repair (Morrish et al., 2002). Furthermore, host DNA repair machinery may also impact L1 retrotransposition. Gasior et al reported that DNA repair enzyme ERCC1/XPF heterodimer limits L1 retrotransposition (Gasior et al., 2008). Importantly, deficiencies of non-homologous endjoining (NHEJ) repair pathway such as Ku70, Artemis, and DNA ligase IV (LigIV) decrease retrotransposition frequencies of human L1 in chicken DT40 cells, suggesting that the NHEJ repair

pathway is required for efficient L1 retrotransposition (Suzuki et al., 2009).

## L1 RETROTRANSPOSITION IN CANCERS

Somatic L1 insertions are seldom observed in normal tissues except hippocampus (Baillie et al., 2011; Evrony et al., 2012; Upton et al., 2015). Although most L1 retrotransposition was thought to occur in the germline, somatic L1 insertions were also found to occur in variety of tumors, including breast cancer, colon cancer, hepatocellular carcinoma (HCC), and lung cancer (Miki et al., 1992; Liu et al., 1997; Iskow et al., 2010; Lee et al., 2012; Solyom et al., 2012; Shukla et al., 2013; Carreira et al., 2014; Helman et al., 2014; Ewing et al., 2015; **Table 2**). First, three L1 insertion candidates were reported in human tumors (Morse et al., 1988; Miki et al., 1992; Liu et al., 1997). However, two insertions described by Liu et al. (1997) and Morse et al. (1988) lack all of the hallmark features of a true somatic retrotransposition event, such as L1 endonuclease cleavage site, the presence of L1 poly(A) tail, target-site duplication (TSD), 5′ truncation and inversion, and 3′ transduction (Holmes et al., 1994; Moran et al., 1999; Goodier et al., 2000; Pickeral et al., 2000; Szak et al., 2002; **Table 2**). These insertions may be derived from recombination events, L1 EN-independent insertions (Morrish et al., 2002), or other atypical integration mechanisms of L1 retrotransposition. Indeed, an L1 insertion disrupts the adenomatous polyposis coli (APC) gene in a colon cancer, indicating the disruption of a tumor suppressor gene caused by somatic L1 insertion (Miki et al., 1992). Accordingly, a recent study identified a novel somatic insertion in the APC gene and a hot spot for L1 insertion on Chromosome 17, suggesting that the L1 insertion initiates colorectal cancer (CRC) by mutating the APC gene through the classic two-hit CRC pathway (Scott et al., 2016). Furthermore, high-throughput sequencing-based approaches identified numerous somatic tumor-specific insertions in cancers (Miki et al., 1992; Liu et al., 1997; Iskow et al., 2010; Lee et al., 2012; Solyom et al., 2012; Shukla et al., 2013; Carreira et al., 2014; Helman et al., 2014; Tubio et al., 2014). Indeed, Lee et al. identified the L1 insertions in cadherin-12 (CDH12), roundabout, axon guidance receptor, homolog 2 (ROBO2), NRXN3, FPR2, COL11A1, NEGR1, NTM, and CTNNA2 (Lee et al., 2012). As well, Solyom et al. identified several tumorspecific insertions in colorectal tumors including odd Oz/tenm homolog 3 (ODZ3), ROBO2, protein tyrosine phosphatase, receptor type, M (PTPRM), pericentriolar material 1 (PCM1), CDH11, and runt-related transcription factor 1 (RUNX1T1) of colorectal cancers (Solyom et al., 2012). All insertions were severely 5′ truncated. Interestingly, these genes are associated with cell-adhesion functions and both groups could identify the L1 insertions in the same ROBO2 genes, suggesting the potential role of cell-adhesion genes in L1 insertion-mediated colorectal tumorigenesis. In addition, Tubio et al. analyzed the somatic L1 retrotransposition activities in 290 cancers and noticed insertions

#### TABLE 2 | L1 insertions in cancers.


\**Lack of the hallmark features of a true somatic retrotransposition event (Morse et al., 1988; Liu et al., 1997).*

occurring during cancer development. 53% of the patients have at least one somatic L1 retrotransposition event, of which 24% were 3′ transductions, most frequently colorectal cancers (93%) and lung cancers (75%), suggesting that 3′ transduction are potentially mutagenic. Somatic L1 retrotranspositions tend to insert in intergenic or heterochromatin regions of the cancer genome (Tubio et al., 2014). Furthermore, somatic L1 insertions participate in the dynamics of many tumor genomes and lead to driver mutations. Surprisingly, L1 insertion was reported in colonic adenoma, a known cancer precursor, suggesting that widespread somatic L1 retrotransposition occurs early during development of gastrointestinal (GI) tumors, probably before dysplastic growth (Ewing et al., 2015). Similarly, a recent study demonstrated that L1 retrotransposition is active in esophageal adenocarcinoma and its precursor, Barrett's esophagus (BE), indicating that somatic L1 insertions occur early in BE and esophageal adenocarcinoma. Notably, two L1 insertions were detected in normal esophagus, indicating that some L1 insertions may occur in normal squamous epithelium cells (Doucet-O'Hare et al., 2015). In this regard, most of the new somatic insertions are truncated, and would not mobilize again. So mutations arising from insertions in the normal precursor esophageal or benign BE would be contributing to tumorigenesis. Otherwise, only a rare full-length somatic insertion has the potential to contribute to mutation during the various stages of transition to tumorigenesis. In addition, L1 insertions in pancreatic ductal adenocarcinoma (PDAC) were reported with discordant rate of retrotransposition between primary and metastatic sites, suggesting that L1 insertions in gastrointestinal neoplasms occur discontinuously. Thus, somatic L1 insertions contribute to genetic and phenotypic heterogeneity in PDAC (Rodic et al., 2015 ´ ). Interestingly, somatic insertions were identified in epithelial tumors but not in blood or brain cancers (Lee et al., 2012). However, we raise awareness regarding the following limitations of this study. For example, the sample size was small and the normal tissue was not from the same patient. In addition, in this study they only examined multiple myeloma and did not look at the entire spectrum of blood based cancers. In this regard, ten-eleventranslocation (TET) 2, a DNA demethylation-related protein, is frequently mutated in myeloid and lymphoid tumors (Ko et al., 2015). The TET family that oxidizes 5-methylcytosine (5mC) to 5-hydroxymethylcytosine (5hmC), 5-formylcytosine (5fC), and 5-carboxylcytosine (5caC) in DNA, leads to the DNA demethylation. Since DNA methylation has a pivotal regulatory role in L1 silencing, TET2 may impact L1 mobility. Therefore, L1 insertions may be suppressed in such hematological cancers. Intriguingly, several somatic insertions occur in genes that are commonly mutated in cancers such as tumor suppressor gene. These insertions disrupt the expression of target genes, and are biased toward regions of cancer-specific DNA hypomethylation (Lee et al., 2012). Indeed, recent studies identified somatic L1 insertion in tumor suppressor genes, such as APC and PTEN (Miki et al., 1992; Helman et al., 2014). As well, the first case of familial retinoblastoma (Rb) caused by a de novo insertion of a full-length L1 into intron 14 of the Rb gene, resulting in the aberrant and non-canonical mRNA splicing of the Rb gene, was reported (Rodríguez-Martín et al., 2016). Furthermore, 18 retrotransposon insertions [14 Alu, 3 L1, and 1 poly(A)] were identified in neurofibromatosis type 1 (NF1) gene (Wimmer et al., 2011).

Although still debated, cell division seems to be required for efficient L1 retrotransposition (Shi et al., 2007; Xie et al., 2013). In fact, retrotransposition was strongly inhibited in the cells arrested in the G1, S, G2, or M phase of cell cycle. The reduction in L1 transcript abundance limits retrotransposition in nondividing cells, suggesting that inhibition of retrotransposition in non-dividing cells protects somatic cells from accumulation of deleterious mutations caused by L1 insertions (Shi et al., 2007). In contrast, there is an opposite report that L1 retrotransposition was detected in non-dividing and primary human somatic cells using adenovirus-L1 hybrid vector, even though they detected L1 retrotransposition in G1/S- but not in G0-arrested cells (Kubo et al., 2006). In addition, retrotransposition was also inhibited during cellular senescence in primary human fibroblasts. So far, several biomarkers of cellular senescence have been identified such as senescence-associated β-galactosidase (SA-β-Gal), p53/p21, p16INK4a, senescence-associated heterochromatin foci (SAHF), senescence-associated secretory phenotype (SASP), autophagy, telomere-induced foci/DNA damage response (DDR), and cell cycle arrest (Kuilman et al., 2010) and the reduction in L1 retrotransposition may be a biomarker of cellular senescence. Thus, cell cycle may affect L1 retrotransposition.

L1 protein expression is a common feature of many types of high-grade malignant tumor, yet is rarely detected in early stage of tumorigenesis (Rodic et al., 2014 ´ ). L1 promoter is normally silenced by methylation in normal somatic cells (Woodcock et al., 1997; Schulz et al., 2006). In contrast, L1 promoter is hypomethylated (Baba et al., 2014), and expression of L1 is elevated in many tumors. In fact, L1 expression was detected in human breast carcinomas and testicular cancers (Bratthauer and Fanning, 1992; Bratthauer et al., 1994; Nangia-Makker et al., 1998). L1 ORF1p protein is detected in a variety of tumor cells including breast cancer, colon cancer, pancreatic ductal adenocarcinoma, and HCC but not in normal somatic cells (Bratthauer et al., 1994; Asch et al., 1996; Rodic et al., 2014 ´ ). Thus, L1 ORF1p expression seems to be a hallmark of many human cancers as a highly specific tumor marker.

In addition to expression of L1, a hallmark of tumor cells is an activated telomere maintenance mechanism that allows prolonged survival of the malignant tumor cells. In more than 80% of tumors, telomeres are typically maintained by telomerase. Notably, the reduced length of telomeres was reported in the L1 knockdown cells, indicating that L1 is involved in telomere maintenance in telomerase positive tumor cells (Aschacher et al., 2016). Accordingly, L1 involves in a transcriptional regulation of hTERT and upregulation of its transcription factors c-Myc and KLF-4 (Aschacher et al., 2016). Thus, L1 may contribute to the development of cancers. However, these studies were not done in alternative lengthening of telomeres (ALT)-positive tumors or telomerase negative tumors. Consequently, it is uncertain if L1 is directly contributing to telomere maintenance or if the reduction in telomere length is contributed to the reduction in telomerase levels. Indeed, the stoichiometry of telomerase is important for maintaining telomere length (Armanios et al., 2005; Goldman et al., 2005).

Chronic infection with hepatitis B virus (HBV) is a major risk for the development of HCC. HBV integration into the human genome was found in most HBV-related HCC and it has been implicated in the development of HCC. An initial study proposed that HBV integration occurs randomly without preferred integration site (Matsubara and Tokino, 1990). However, high-throughput sequencing-based approaches identified recurrent integration sites in HCC (Ding et al., 2012). HBV integration favored chromosome 17 and preferentially integrated into human transcript units. At least, telomerase reverse transcriptase (TERT) and fibronectin 1 (FN1) genes were identified as the recurrent HBV integration sites. Furthermore, seven integrations were found in the repeat regions including L1, LTR/ERV1, and SINE/Alu (Ding et al., 2012). Similarly, a recent transcriptome sequencing study of HBV-positive HCC cell lines discovered that HBV integrates into L1 (Lau et al., 2014). Insertion of the gene encoding hepatitis B virus x (HBx) into L1 on chromosome 8p11 produces an oncogenic HBx-LINE1 chimeric RNA transcript (Lau et al., 2014; **Figure 2**). The HBx-LINE1 RNA transcript was detected in 23.3% of HCC, suggesting that HBx-LINE1 is selected for in HCC oncogenesis. The long non-coding RNA (lncRNA)-like HBx-LINE1 transcript confers cancer-promoting properties through activation of Wnt/β-catenin signaling pathway (Lau et al., 2014).

Furthermore, endogenous L1-mediated retrotransposition was identified in the germline and somatic cells of HCC patients (Shukla et al., 2013). The germline L1 insertion in the tumor

HBx-LINE1 fusion RNA transcript in HCC.

suppressor mutated in colorectal cancers (MCC) was detected in 21.1% of HCC, resulting in the aberrant expression of MCC. Moreover, suppression of tumorigenicity 18 (ST18) was activated by a tumor-specific somatic L1 insertion (Shukla et al., 2013). Thus, L1-mediated retrotransposition seems to be a potential etiological factor in HCC.

### GUARDIAN OF THE HUMAN GENOME: HOST DEFENSE MECHANISMS AGAINST L1 RETROTRANSPOSITION

Since insertion of L1 into the human genome may cause human genetic disorders and cancer, retrotransposition must be silenced under normal conditions. To restrict deleterious retrotransposition, host cells have evolved several defense mechanisms protecting cells against retrotransposition including epigenetic regulation through DNA methylation (Burden et al., 2005; Trono, 2015), RNA silencing by RNA interference (Soifer et al., 2005; Yang and Kazazian, 2006), PIWI-interacting RNA (piRNA)-PIWI system (Aravin et al., 2007a,b; Kuramochi-Miyagawa et al., 2008; De Fazio et al., 2011; Marchetto et al., 2013), microRNA (Hamdorf et al., 2015), and host restriction factors, such as apolipoprotein B mRNA editing enzyme catalytic polypeptide-like 3 (APOBEC3), Moloney leukemia virus 10 (MOV10), and SAM domain and HD domain containing protein 1 (SAMHD1) (**Table 3**).

DNA methylation within the 5′UTR promoter of L1 is essential for maintaining transcriptional inactivation and for inhibiting L1 retrotransposition (Woodcock et al., 1997; Liang et al., 2002; Burden et al., 2005). L1 is highly active during early embryogenesis, while L1 issilenced early in development through epigenetic mechanisms (**Table 1**). Indeed, methylation of the L1 promoter is maintained by DNA metyltransferases (DNMTs), including DNMT1, DNMT3a, and DNMT3b (Liang et al., 2002). L1 retrotransposition is negatively regulated by methyl-CpGbinding protein 2 (MeCP2)-mediated DNA methylation (Yu et al., 2001; Muotri et al., 2010). In addition, nucleosomal and remodeling deacetylase (NuRD) multiprotein complex specifically enriches the L1 promoter. Rb and E2F recruit to the L1 promoter along with histone deacetylase (HDAC), including HDAC1 and HDAC2 (Montoya-Durango et al., 2009, 2016). Furthermore, KRAB-associated protein1 (KAP1, also known as TRIM28) mediates transcriptional silencing of endogenous retroelements (EREs) including L1, Alu, SVA, and human endogenous retrovirus-K (HERV-K) as well as exogenous retrovirus mouse leukemia virus (MLV) in embryonic stem (ES) cells (Wolf and Goff, 2007, 2009; Matsui et al., 2010; Rowe et al., 2010; Castro-Diaz et al., 2014; Turelli et al., 2014; Trono, 2015). Krüppel-associated box (KRAB)-containing zinc-finger protein (KRAB-ZFP/ZNF), a large family of tetrapod-restricted transcription factors, and a cofactor KAP1 serve as a scaffold for a heterochromatin complex comprising the SETDB1 (also known as ESET) histone methyltransferase, histone deacetylase, nucleosome remodeling, and DNMT activities (Trono, 2015; **Figure 3**). Furthermore, the protein deacylase and mono-ADP ribosyltransferase Sirtuin 6 (SIRT6) represses L1 mobility by

#### TABLE 3 | Host defense factors against L1.


ribosylating KAP1 (Van Meter et al., 2014). SIRT6 binds to the 5′UTR of L1 and ribosylates KAP1, resulting in facilitation of KAP1 interaction with the heterochromatin factor HP1α, thereby contributing to the packaging of L1 into transcriptionally

repressive heterochromatin. Moreover, promyelocytic leukemia zinc finger (PLZF) protein, a member of the POK (POZ and Kruppel zinc finger) family of transcription factors that acts as an epigenetic regulator of stem cell maintenance in germ cells and haematopoietic stem cells, represses L1 retrotransposition in germ and progenitor cells (Puszyk et al., 2013). PLZF-mediated DNA methylation induces silencing of the L1 gene, resulting in inhibition of L1 retrotransposition. Species-specific KZNFs might recruit KAP1 to species-specific retrotransposon classes that recently invaded the host genome. In this regard, Jacobs et al. recently reported that two primate-specific ZNF91 and ZNF93 repress SVA and L1 retrotransposons, respectively (Jacobs et al., 2014). ZNF93 evolved earlier to repress the primate L1 lineage until ∼12.5 million years ago when the L1PA3 subfamily escaped ZNF93-mediated restriction through the removal of the ZNF93-binding site, suggesting an evolutionary arms race between KRAB-ZNFs and retrotransposons (Jacobs et al., 2014).

Post-translational modification and subcellular localization of L1 protein seem to be important for modulation of L1 mobility. In fact, phosphorylation of ORF1p is required for L1 retrotransposition (Cook et al., 2015). L1 ORF1p contains four conserved proline-directed protein kinase (PDPKs) target sites. PDPK mutations in ORF1p could inactivate L1 mobility (Cook et al., 2015). The PDPK family includes mitogen-activated protein kinases (MAPKs) and cyclin-dependent kinases (CDKs). Although nuclear localization of L1 ORF1p and ORF2p is essential for L1 retrotransposition, L1 ORF1p predominantly localizes in punctate cytoplasmic foci in most of cases (Goodier et al., 2007; Harris et al., 2010; Chen et al., 2012). However, in several breast cancers, LI ORF1p and ORF2p were also detected in the nucleus (Harris et al., 2010; Chen et al., 2012). Indeed, the expression of L1 is elevated in breast cancers.

Recently, APOBEC3 family of cytidine deaminases, MOV10, and SAMHD1 have been identified as restriction factors for human immunodeficiency virus type 1 (HIV-1) (Sheehy et al., 2002; Harris et al., 2003; Mangeat et al., 2003; Zhang et al., 2003; Burdick et al., 2010; Furtak et al., 2010; Wang et al., 2010; Hrecka et al., 2011; Laguette et al., 2011). APOBEC3A, APOBEC3B, and APOBEC3F but not APOBEC3G inhibit L1 retrotransposition by a DNA deaminase-independent manner, indicating a novel anti-L1 retrotransposition mechanism (Turelli et al., 2004; Bogerd et al., 2006; Chen et al., 2006; Muckenfuss et al., 2006; Stenglein and Harris, 2006; Hulme et al., 2007; Kinomoto et al., 2007; Niewiadomska et al., 2007; Schumann, 2007; Arias et al., 2012). In contrast, Kinomoto et al. and Niewiadomska et al. reported that APOBEC3G could inhibit L1 retrotransposition by a DNA deamination-independent manner (Kinomoto et al., 2007; Niewiadomska et al., 2007). Furthermore, APOBEC3G inhibits Alu retrotransposition by a DNA deaminase-independent manner (Chiu et al., 2006; Hulme et al., 2007; Bulliard et al., 2009). MOV10 RNA helicase also inhibits L1 and Alu retrotransposition (Arjan-Odedra et al., 2012; Goodier et al., 2012, 2013; Li et al., 2013). Similarly, SAMHD1 inhibits LINE-1 and Alu/SVA retrotransposition (Zhao et al., 2013). SAMHD1 inhibits L1 retrotransposition through promoting the sequestration of L1 RNP within stress granules (Hu et al., 2015). Similarly, the zinc-finger antiviral protein (ZAP) also known as PARP13, a member of poly(ADPribose) polymerase (PARP) family, inhibitsthe retrotransposition of L1, Alu, and intracisternal A particle (IAP) retrotransposons (Goodier et al., 2015; Moldovan and Moran, 2015). ZAP interacts with L1 RNA and L1 ORF1p and co-localizes with stress granules.

Type I interferons (IFN 1) including IFNα and IFNβ have been involved in innate immunity against viruses. In this regard, a recent study reported that L1 induces IFN1 and IFN1, in turn, inhibits L1 retrotransposition, suggesting that IFN1 controls propagation of L1 as well as maintenance of genomic integrity (Yu et al., 2015). Accordingly, several interferon-stimulated genes (ISGs), including APOBEC3, MOV10, BST-2, ISG20, MAVS, MX2, RNase L, SAMHD1, TREX1, and ZAP restrict L1 retrotransposition, indicating that ISGs are key players of the type I interferon anti-retroelement response (Turelli et al., 2004; Bogerd et al., 2006; Chen et al., 2006; Muckenfuss et al., 2006; Stenglein and Harris, 2006; Hulme et al., 2007; Kinomoto et al., 2007; Niewiadomska et al., 2007; Schumann, 2007; Stetson et al., 2008; Arias et al., 2012; Zhao et al., 2013; Zhang et al., 2014; Goodier et al., 2015; Hu et al., 2015; **Table 3**).

Small RNAs have been implicated in the regulation of L1 mobility. Piwi proteins and Piwi-interacting RNAs (piRNA) silence L1 during genome reprogramming in the embryonic male germ line (De Fazio et al., 2011; Marchetto et al., 2013). Notably, Hamdorf et al. uncovered a new mechanism in which microRNAs restrict L1 mobilization and L1-associated mutations in cancer cells, cancer-initiating cells and iPS cells (Hamdorf et al., 2015). Indeed, miR-128 represses L1 retrotransposition by binding directly to L1 RNA, suggesting a new function of microRNAs in mediating genomic stability by suppressing the mobility of endogenous retrotransposons.

Tumor suppressor p53 mutations occur in most of human cancers, however, precisely how p53 functions to mediate tumor suppression is not well understood. In this regard, p53 was reported to restrict L1 mobility and suggested that p53 restricts oncogenesis in part by restricting transposon mobility (Wylie et al., 2016). Although normal human p53 suppressed transposons, mutant p53 from cancer patients could not. In contrast, L1 activity was elevated in p53 negative human cancers. Thus, ancestral function of p53 may be associated with transposon control as a guardian of human genome.

### CONCLUSION

L1 has successfully propagated and composed 17% of the human genome, resulting in evolutionary force. Activation of the normally silent L1 is associated with a high level of cancer-associated DNA damage and genomic instability. Indeed, L1 insertions into the human genome may cause cancers through insertional mutagenesis. In fact, recent high-throughput sequencing-based approaches could identify numerous somatic tumor-specific L1 insertions in a variety of cancers (Iskow et al., 2010; Lee et al., 2012; Solyom et al., 2012; Shukla et al., 2013; Helman et al., 2014), however there is no sufficient evidence. Therefore, it should clarify the role of L1-mediated retrotransposition in human cancers. Indeed, the implication of L1 insertion events as either passenger or driver mutations with a causative role in tumorigenesis still remains to be clarified (Rodic and Burns, 2013 ´ ). Intriguingly, somatic insertions were only identified in epithelial tumors (Lee et al., 2012). Accordingly, epithelial cells can be transformed to cancer stem cells (Wang et al., 2013) and metastasis is more prevalent in epithelial tumors (Gotzmann et al., 2004). Thus, epithelial cells seem to be plastic (Carreira et al., 2014). Cancer stem cells are defined as rare cells with indefinite potential for self-renewal that drive tumorigenesis (Reya et al., 2001). However, it remains to be clarified the role of L1 mobility in cancer stem cells. Recent studies focus on the relationship among L1 mobility, reprogramming, and differentiation. Indeed, reprogramming somatic cells into iPS cells activates L1 mobility (Wissing et al., 2012; Friedli et al., 2014; Klawitter et al., 2016). On the other hand, L1 mobility is enhanced in tumor cells. In this regard, the elevation of L1 protein or RNA expression levels may be useful as a diagnostic

### REFERENCES


hallmark of many human cancers and as a tumor specific marker, metastasis, and prognosis. Furthermore, recent advances in single cell analysis will be useful for comparison of the L1 mobility and the integration site of L1 at a single cell level in human cancers.

Finally, tumor suppressor proteins may be associated with transposon control to restrict deleterious retrotransposition as a guardian of the human genome. Wild-type p53 suppresses transposon mobility in normal cells, while mutant p53 in cancer cells could not, resulting in the activation of L1 mobility in cancer cells (Wylie et al., 2016). Furthermore, recent studies identified somatic L1 insertion in tumor suppressor genes, such as APC, PTEN, NF1, and Rb (Miki et al., 1992; Wimmer et al., 2011; Helman et al., 2014; Rodríguez-Martín et al., 2016). Thus, L1 insertions in the tumor suppressor genes may disrupt their functions and be associated with tumorigenesis. Altogether, host cells have evolved several defense mechanisms protecting cells against retrotransposition.

### AUTHOR CONTRIBUTIONS

The author confirms being the sole contributor of this work and approved it for publication.

### FUNDING

This work was supported by a Lateral Research from the Japan Society for the Promotion of Science (JSPS), by the Research Program on Hepatitis from Japan Agency for Medical Research and Development, AMED, and by Takeda Science Foundation.

### ACKNOWLEDGMENTS

I thank Ms. Kazumi Tsuruhara for secretarial assistance and her kind encouragement.


local tumor invasion and metastasis. Mutat. Res. 566, 9–20. doi: 10.1016/S1383- 5742(03)00033-4


human somatic cells. Proc. Natl. Acad. Sci. U.S.A. 103, 8036–8041. doi: 10.1073/pnas.0601954103


**Conflict of Interest Statement:** The author declares that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2016 Ariumi. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Templated Sequence Insertion Polymorphisms in the Human Genome

#### Masahiro Onozawa1, 2 and Peter D. Aplan<sup>1</sup> \*

*<sup>1</sup> Genetics Branch, National Cancer Institute, National Institutes of Health, Bethesda, MD, USA, <sup>2</sup> Department of Hematology, Hokkaido University Graduate School of Medicine, Sapporo, Japan*

Templated Sequence Insertion Polymorphism (TSIP) is a recently described form of polymorphism recognized in the human genome, in which a sequence that is templated from a distant genomic region is inserted into the genome, seemingly at random. TSIPs can be grouped into two classes based on nucleotide sequence features at the insertion junctions; Class 1 TSIPs show features of insertions that are mediated via the LINE-1 ORF2 protein, including (1) target-site duplication (TSD), (2) polyadenylation 10–30 nucleotides downstream of a "cryptic" polyadenylation signal, and (3) preference for insertion at a 5′ -TTTT/A-3′ sequence. In contrast, class 2 TSIPs show features consistent with repair of a DNA double-strand break (DSB) via insertion of a DNA "patch" that is derived from a distant genomic region. Survey of a large number of normal human volunteers demonstrates that most individuals have 25–30 TSIPs, and that these TSIPs track with specific geographic regions. Similar to other forms of human polymorphism, we suspect that these TSIPs may be important for the generation of human diversity and genetic diseases.

### Edited by:

*Tammy A. Morrish, Formerly affiliated with University of Toledo, USA*

### Reviewed by:

*Francesca Storici, Georgia Institute of Technology, USA Jeffrey Han, Tulane University, USA*

### \*Correspondence:

*Peter D. Aplan aplanp@mail.nih.gov*

### Specialty section:

*This article was submitted to Cellular Biochemistry, a section of the journal Frontiers in Chemistry*

Received: *09 August 2016* Accepted: *27 October 2016* Published: *16 November 2016*

### Citation:

*Onozawa M and Aplan PD (2016) Templated Sequence Insertion Polymorphisms in the Human Genome. Front. Chem. 4:43. doi: 10.3389/fchem.2016.00043*

Keywords: templated sequence insertion polymorphisms (TSIPs), mitochondria, polymorphism, human migration, DNA repair, LINE-1 retrotransposon

## INTRODUCTION

Maintenance of chromosomal integrity is required for the survival of all organisms, from simple prokaryotes to complex eukaryotes. This maintenance of chromosomal integrity is accomplished by DNA repair enzymes. There are a number of DNA repair systems that operate in eukaryotes, including DNA mismatch repair, DNA single-strand break repair, and DNA double-strand break (DSB) repair. DNA DSB repair can be further subdivided into repair by homologous recombination, "canonical" non-homologous end-joining (NHEJ), and "non-canonical" NHEJ (Chiruvella et al., 2013; Deriano and Roth, 2013).

Transfected plasmid DNA can be captured and used as a patch at the site experimentally induced DNA DSBs; the DNA patches typically show signs of NHEJ, such as micro-deletion, microhomology, and non-templated nucleotide addition (Lin and Waldman, 2001; Varga and Aplan, 2005; Cheng et al., 2010). Moreover, Yu and Gabriel detected mitochondrial DNA fragments at the site of HO endonuclease-induced DNA DSBs (Yu and Gabriel, 1999), demonstrating that a DNA DSB can be repaired by insertion of DNA sequences in yeast.

RNA can provide a template for DNA synthesis during telomere elongation (Autexier and Lue, 2006) or reverse transcription of retrotransposons (Baltimore, 1985). Several lines of evidence suggest that endogenous retrotransposons may also have a role in DNA DSB repair in human cells. When introduced into yeast, the human LINE-1 ORF2 can mediate repair of HO endonuclease-induced DNA DSB via insertion of retrotransposon sequences (Teng et al., 1996), through a cDNA intermediate. Subsequently, synthetic RNA oligonucleotides were shown to be a template for DNA synthesis during repair of HO endonuclease induced DNA DSB in yeast, although the efficiency of repair with RNA oligonucleotides was orders of magnitude lower than DNA oligonucleotides (Storici et al., 2007). Finally, a role for LINE-1 retrotransposons in DNA DSB repair in mammalian cells has been predicted (Morrish et al., 2002, 2007). This prediction was based on clever experiments that showed new integrations of an endonucleaseincompetent LINE-1 retrotransposon could be identified in rodent cell lines; this form of LINE-1 integration was designated endonuclease-independent (ENi) retrotransposition (Morrish et al., 2002). Because these new integration sites lacked the typical features of LINE-1 endonuclease mediated insertions, such as Target-Site Duplications (TSDs) and poly(A) tracts, the authors hypothesized that LINE-1 sequences were used as a "patch" to repair a spontaneous DNA DSB (as opposed to one introduced by the LINE-1 endonuclease).

With the advent of next generation sequencing technologies, millions of germline variants in mammalian genomes within a species have been identified. Most characterized variants, known as polymorphisms, fall into three large categories; single nucleotide polymorphisms (SNPs), small (<50 bp) insertions or deletions, referred to collectively as short indels, and large deletions (>50 bp) (Genomes Project et al., 2012). More recently, a smaller number of polymorphic insertions of retro-elements (such as LINE-1 or Alu) have been identified through the use of sophisticated methods to detect mobile element insertions (Beck et al., 2010; Huang et al., 2010; Iskow et al., 2010). In addition, insertion of processed gene transcripts into the germline have been identified and referred to as polymorphic pseudogenes (Ewing et al., 2013). Recent reports have demonstrated that insertions of retroelements and pseudogenes represent only a fraction of the insertional polymorphisms in the human genome (Onozawa et al., 2014, 2015).

### DNA DSBS CAN BE REPAIRED BY INSERTION OF SEQUENCES DERIVED FROM DISTANT REGIONS OF THE GENOME IN AN EXPERIMENTAL SYSTEM

Templated Sequence Insertions (TSI) were first characterized using a cell culture based approach to study DNA DSB (Varga and Aplan, 2005). These studies modified an experimental system that had been popularized by Jasin and colleagues (Jasin and Haber, 2016), and employed a vector (designated EF1αTK), that contained the EF1α promoter driving expression of the herpes simplex virus thymidine kinase (HsvTK); the recognition site for the rare-cutting meganuclease I-SceI was inserted between the EF1α promoter and HsvTK cDNA (Varga and Aplan, 2005). Expression of the HsvTK enzyme in mammalian cells confers sensitivity to the nucleoside analog ganciclovir; thus, millions of cells can quickly be screened for loss of HsvTK expression by treatment with ganciclovir. The EF1αTK vector was electroporated into the human leukemia cell line U937, or ovarian cancer cell line OVCAR8. Sub-clones that had integrated a single copy of the EF1αTK vector, designated "F5" for U937 (Varga and Aplan, 2005) or "A15" for OVCAR8 (Cheng et al., 2010), were isolated (**Figure 1A**). In an attempt to induce gross chromosomal rearrangements, an I-SceI expression vector was introduced into the cell lines, followed by ganciclovir (GCV) selection of clones that had lost expression of HsvTK due to mis-repair of a DNA DSB.

Although most of the GCV-resistant clones had short deletions encompassing the HsvTK start codon, rare clones that had undergone an insertion at the DNA DSB site were identified (Varga and Aplan, 2005; Cheng et al., 2010; Onozawa et al., 2014). In these clones, the inserted sequence was not derived from nearby genomic sequences, but instead mapped to a distant region of the genome. After modifying the procedure to enrich for insertions (**Figure 1A**), 32 insertions of sequences derived from distant regions of the genome were identified in F5 subclones and 34 in A15 subclones (**Figure 1B**). The origin of these insertions mapped to 18 of 24 human chromosomes, without an obvious preference for any chromosome or chromosomal region (Onozawa et al., 2014). These insertions were designated "TSI," or TSI, in contrast to the short, non-TSI seen at the site of NHEJ-mediated repair of a DNA DSB (Onozawa et al., 2014). The TSI junctions often showed features of NHEJ, such as microhomology and nontemplated nucleotide addition. Generation of TSIs seemed to be generalizable, as they were not restricted to I-SceI induced cleavage but also found at TALEN-mediated cleavage sites (Onozawa et al., 2014).

### TSIs ARE DERIVED FROM RNA

The TSIs inserted at the site of NHEJ mediated DNA DSB were not excised from the genome, since the donor sites were intact, and the donor (inserted) sequence had an additional copy compared to genomic regions flanking the donor site (Onozawa et al., 2014). The TSI sequences were enriched for transcribed sequences, suggesting that the TSI may have originated via reverse transcription of RNA. In addition, treatment of cells that expressed the I-SceI enzyme with reverse-transcriptase inhibitors suppressed the frequency of TSIs at DNA DSB site by more than three-fold. Moreover, co-transfection of murine RNA with an I-SceI expression vector into the human F5 cell line showed that reverse transcribed murine sequences were used as insertions at the I-SceI cleavage site. Finally, three insertions displayed mammalian telomere repeat (TTAGGG)n sequences, suggesting that telomerase RNA can also be used as a TSI template. Taken together, although other mechanisms, such as template switching by DNA polymerases or break-induced repair remain possible (Malkova and Haber, 2012; Morrish et al., 2013), the above observations support the hypothesis that reverse transcribed RNA can be used as a template to patch a DNA DSB (Onozawa et al., 2014). Potential sources of this reverse transcriptase activity include LINE-1 ORF2, HERVs (Hohn et al., 2013), as well as the

FIGURE 1 | Insertion mediated repair of DNA DSBs. (A, Left) Outline of the reporter system used to characterize experimental TSIs (Varga and Aplan, 2005; Onozawa et al., 2014). The EF1α promoter (open box), I-*Sce*I recognition sequence, HsvTK cDNA (vertically striped box), and G418R cassette (horizontally striped box) are indicated. (A, Middle) Genomic DNA was PCR amplified using primers flanking the I-*Sce*I site. To enrich for PCR fragments containing insertions, the gel portion containing fragments of 0.5–2.0 kb was purified, ligated into plasmids, and inserts from individual colonies PCR amplified. (A, Right) Schematic of result showing colonies containing insertions, small indels, and deletions (Onozawa et al., 2014). (B) Size of insertions events recovered from F5 and A15 cell lines varied from 73 to 414 bp (median, 191 bp) (Onozawa et al., 2014). (C) Identification of insertions from whole-genome sequence data. SV data shows chromosome-9 sequences fused to chromosome 1 and reciprocal chromosome 1 sequences fused to chromosome 9. Sequence fragments are consistent with either a balanced translocation or an insertion of a chromosome 1 sequence into chromosome 9 (Onozawa et al., 2014). (D) Analysis of candidate TSI (schematic). PCR primers anneal to TSI acceptor locus (for example, chr 9 from Panel C). Amplification of a TSI leads to a larger (1.0 kb) PCR fragment, as shown. Presence of identical insertion-containing 1.0 kb PCR fragments in independent cell lines (cell line B and C) suggests an insertional polymorphism, which can be confirmed by nucleotide sequence analysis (Onozawa et al., 2014). (E) Nucleotide sequence of the insertion shown in Panels C,D. Chromosome 9 sequences, target-site duplication (TSD), poly(A) tail (negative strand), polyadenylation signal, and chromosome 1 insertion are indicated. Figure modified from Onozawa et al. (2014).

TERT subunit of telomerase, which has recently been suggested to possess hTR-independent RT activity (Sharma et al., 2012).

### TSIs ARE NOT AN ARTEFACT OF EXPERIMENTAL, INDUCED DNA DSB, BUT CAN BE IDENTIFIED IN UNMANIPULATED HUMAN CELLS

TSIs cannot be detected from routine analysis of whole genome sequence (WGS) reads, which are typically <150 bp. However, TSIs can be identified from WGS using the principles outlined below. First, the junction of two non-homologous chromosomes is designated as a "structural variant" (SV) on short read WGS; pairs of SVs that map very closely can be ascertained by inspection of SV files (**Figure 1C**). For a pair of SVs to represent a Templated Sequence Insertion Polymorphism (TSIP), both fusion junctions must be located within 50 kb of one another. Second, the strand polarity must align such that an insertion is feasible. Third, each end of the SV needs to be localized to a single, unique genomic loci; SV that show multiple or imperfect alignments (<95% sequence identity) are excluded. Fourth, all highly repetitive alpha satellite sequences are discarded.

The authors screened SVs obtained from two multiple myeloma cell lines (KP6, MC1286PE1) using these criteria and identified 23 unique, verified TSIs (Onozawa et al., 2014). A typical example is shown in **Figures 1C–E**. Briefly, although this pair of SVs is consistent with a balanced translocation, it is also consistent with an insertion of chromosome 1 sequences into chromosome 9 (**Figure 1C**). Primers were generated that could amplify the putative insertion, including flanking sequences and both junctions, on a single PCR fragment (**Figures 1D,E**). Nucleotide sequence of the PCR product verified that the SVs were indeed produced by insertion of chromosome 1 derived sequences into chromosome 9, as opposed to a reciprocal translocation between chromosomes 1 and 9 (**Figure 1E**). Interestingly 8 out of 23 insertions were identical or nearly identical in the two cell lines (Onozawa et al., 2014) (a schematic example of this phenomenon is shown in **Figure 1D**), suggesting that these insertions represent polymorphisms in the human genome as opposed to SV acquired by the tumor cells. Consistent with the TSI definition above, these polymorphic insertions were designated templated-sequence insertion polymorphisms (TSIPs).

### MOST TSIs IDENTIFIED IN NORMAL HUMAN SUBJECTS ARE POLYMORPHIC

A publicly available database of WGS from 52 normal individuals of defined ethnic/geographic groups ("SV baseline genome set," filename B37baselinejunctions.tsv, available at http:// www.completegenomics.com/sequence-data/download-data/) contained a total of 39,595 SVs from the 52 individuals (Onozawa et al., 2015). Using the criteria set forth in section IV above, 171 candidate TSIPs were identified (Onozawa et al., 2015). Each individual had an average of 25–30 TSIPs, and TSIPs could be classified as "common" (26%; present in at least 20% of individuals), or "rare" (74%; present in <20% of individuals). Interestingly, three TSIPs had a frequency of almost 100%, suggesting that the reference human genome (GRCh37/hg19) is based on an uncommon variant that lacks these three TSIPs. When divided into four regional "super groups" of specified geographic origin (African, Asian, European, North American), common TSIPs were present in individuals from all regions, whereas rare TSIPs tended to be restricted to individuals from a single region (Onozawa et al., 2015). There were more TSIPs per individual of African origin than other geographic regions. All of these findings are consistent with diversity identified in previous studies of mitochondrial and Y-chromosome sequences, and are consistent with patterns of human migration and the hypothesis that Homo sapiens originated in Africa (Cann et al., 1987; Hammer, 1995; Underhill et al., 2000).

The investigators obtained genomic DNA from eight of the normal individuals, who had a total of 89 candidate TSIs and successfully validated 69/89 (77.5%) candidate TSIPs (Onozawa et al., 2015). Since these insertions can be polymorphic, they must be heritable, leading to the conclusion that the insertion event must have taken place in either a germ cell (sperm or egg), or early-stage embryo.

### SEQUENCES USED AS TEMPLATES FOR TSIPs

Nucleotide sequence analysis of the insertion sequences revealed that partial LINE-1 elements, cDNAs (with several spliced exons), non-annotated intergenic or intronic sequences, and mitochondrial sequences were used as templates for TSIPs identified in normal individuals (Onozawa et al., 2015). Of note, although mitochondrial fragment insertions were commonly identified as TSIP donor sequences, no TSIs derived from mitochondrion were identified in experimentally induced TSIs using the F5 and A15 cell lines described above (Varga and Aplan, 2005; Cheng et al., 2010; Onozawa et al., 2014). Although speculative, it is possible that mitochondrial sequence insertions may be reproduction-specific events that take place in germ cells or early stage embryos, but do not occur, or occur only rarely, in somatic cells. Interestingly, sperm mitochondria are known to be ubiquitinated and destroyed shortly after fertilization (Sutovsky et al., 2000), leading to the hypothesis that fragmented paternal mitochondrial DNA [or reverse transcribed RNA that was encoded by mitochondrial DNA (Sharma et al., 2012)] can be used to patch a DNA DSB in a fertilized embryo, leading to a TSIP which contained mitochondrial sequence in all cells of the individual, including germ cells (Woischnik and Moraes, 2002; Onozawa et al., 2015; Zhou et al., 2016). Although no TSIPs contained telomere sequences (Onozawa et al., 2014, 2015), interstitial telomeric sequences (ITSs) have been identified in several species (Ruiz-Herrera et al., 2002), and were identified as insertions in the I-SceI experimental system (Onozawa et al., 2014), leading to the speculation that the ITSs may have resulted from telomere patches used to repair a DNA DSB (Nergadze et al., 2004, 2007; Onozawa et al., 2014).

FIGURE 2 | Landscape of insertion polymorphisms in the human genome. LINE-1 mediated integration of LINE-1/SINE sequences, LINE-1 sequences (which may include additional 3′ transduced sequences), and processed cDNA insertions are known to create insertion polymorphisms (Beck et al., 2010; Huang et al., 2010; Iskow et al., 2010; Ewing et al., 2013). Polyadenylated intronic or intergenic fragments can also be acted upon in *trans* by LINE-1 ORF2 and integrate at the site of a nick created by LINE-1 ORF2, resulting in a class 1 TSIP. Class 2 TSIPs can be generated by reverse transcription of RNA transcripts into a cDNA patch that is used to repair a DNA DSB via a NHEJ mechanism. Alternatively, RNA could be inserted in the DNA DSB and used directly as a patch template, as reported for yeast (Storici et al., 2007). Finally a DNA DSB can be repaired by fragments of mitochondrial DNA or cDNA; mitochondrial insertions seem to be unique to germ cells or embryos. Figure modified from Onozawa et al. (2015).

### CLASS 1 AND CLASS 2 TSIPs

TSIPs can be placed into two classes (class 1 and class 2) based on nucleotide sequences at the insertion site (**Figure 2**). Class 1 TSIPs show a duplication of recipient sequences at both insertion junctions of at least 5 bp; this feature is reminiscent of a TSD that is characteristic of insertions caused by cleavage and insertion of LINE-1 sequences. In addition, class 1 TSIs typically inserted at a preferred LINE-1 integration site (consensus sequence 5 ′ -TTTT/A-3′ ), and contained a non-templated addition of 10–40 "A" residues, as well as a polyadenylation signal (5′ - AATAAA-3′ ) located 10**–**20 nucleotides upstream of the poly(A) track. These features strongly support an insertion mediated by LINE-1 ORF2 protein, which contains both endonuclease and reverse transcriptase activity, acting upon non-LINE-1 RNA, and inserting the sequence into a distant region of the genome (Luan et al., 1993; Piskareva et al., 2013). Moreover, since these events must have occurred in germ cells or embryos to be transmitted, it is interesting to note that LINE-1 elements are de-repressed and active in embryos (Castro-Diaz et al., 2014). We can detect no obvious physiologic function for the Class 1 TSIPs, and suspect that these are caused by a careless LINE-1 ORF2 protein causing mischief throughout the genome. Class 2 TSIPs had none of these features (i.e., no TSDs, cryptic poly(A) signal, or poly(A) tract) but instead displayed NHEJ features such as microdeletion, microhomology, and non-templated nucleotide addition at the insertion junction, similar to what one would predict if an ENi retrotransposition event (Morrish et al., 2002) used non-LINE-1 RNA as a template. Consistent with this prediction, it is well established that LINE-1 ORF2 can act in trans (Wei et al., 2001). All experimental DNA DSB repair events were class 2 events, and we speculate that class 2 TSIPs are caused by DNA DSB repair events that occurred in a germ cell or early stage embryo of an ancestral individual.

### POTENTIAL TO CAUSE GENETIC DISEASE

There is potential for this mechanism of DNA DSB repair to cause genetic disease. Several TSIPs disrupted the coding region of a gene (Onozawa et al., 2015). Furthermore, a recent report

### REFERENCES


described a constitutional 72-bp insertion of mitochondrial sequence into the coding region of GLI3, leading to Pallister-Hall syndrome (Turner et al., 2003). Of note, the conception of this patient was temporally and geographically associated with high-level radioactive contamination following the Chernobyl accident (Turner et al., 2003). Although speculative, it is conceivable that a DNA DSB in the germ cell, caused by ionizing radiation, was repaired by a TSI derived from mitochondrial DNA in this individual.

### CONCLUSION

TSIPs encompass several forms of insertion polymorphisms in human genomes, and are mediated via a combination of mechanisms. Class 1 TSIPs are retrotransposon-mediated events that insert polyadenylated, reverse-transcribed cDNA into seemingly random regions of the genome, whereas class 2 TSIPs are consistent with DNA DSB repair events, in which a short fragment of reverse-transcribed RNA is used as a patch to repair a DNA DSB. These TSIPs provide unique polymorphic markers, similar to SNPs and variable tandem repeats, and can be used to track population migration and evolution. Similar to retrotransposon insertions, we suspect that TSIPs, which can disrupt coding regions of the genome, may play a role in both the etiology of genetic diseases as well as mammalian evolution.

### AUTHOR CONTRIBUTIONS

PA conceived and edited mini-review. MO conceived minireview, generated figures, and wrote the first draft.

### ACKNOWLEDGMENTS

This work was supported by the intramural research program of the NCI, NIH. MO was supported by the Japan Society for the Promotion of Science (JSPS), Grant-in-Aid for Research Activity Start-up (26890001) and Grant-in-Aid for Scientific Research (C) (16K09836).


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2016 Onozawa and Aplan. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Post-Transcriptional Control of LINE-1 Retrotransposition by Cellular Host Factors in Somatic Cells

Javier G. Pizarro and Gaël Cristofari\*

Institute for Research on Cancer and Aging of Nice (IRCAN), Faculty of Medicine, CNRS UMR7284, INSERM U1081, University of Nice Sophia Antipolis, Nice, France

Long INterspersed Element-1 (LINE-1 or L1) retrotransposons form the only autonomously active family of transposable elements in humans. They are expressed and mobile in the germline, in embryonic stem cells and in the early embryo, but are silenced in most somatic tissues. Consistently, they play an important role in individual genome variations through insertional mutagenesis and sequence transduction, which occasionally lead to novel genetic diseases. In addition, they are reactivated in nearly half of the human epithelial cancers, contributing to tumor genome dynamics. The L1 element codes for two proteins, ORF1p and ORF2p, which are essential for its mobility. ORF1p is an RNA-binding protein with nucleic acid chaperone activity and ORF2p possesses endonuclease and reverse transcriptase activities. These proteins and the L1 RNA assemble into a ribonucleoprotein particle (L1 RNP), considered as the core of the retrotransposition machinery. The L1 RNP mediates the synthesis of new L1 copies upon cleavage of the target DNA and reverse transcription of the L1 RNA at the target site. The L1 element takes benefit of cellular host factors to complete its life cycle, however several cellular pathways also limit the cellular accumulation of L1 RNPs and their deleterious activities. Here, we review the known cellular host factors and pathways that regulate positively or negatively L1 retrotransposition at post-transcriptional level, in particular by interacting with the L1 machinery or L1 replication intermediates; and how they contribute

to control L1 activity in somatic cells.

Keywords: LINE-1, retrotransposon, genome evolution, repeated sequences, retrotransposition, structural variation (SV)

## L1 ELEMENTS CONTRIBUTE TO THE DYNAMICS OF SOMATIC AND GERMLINE HUMAN GENOMES

The Long INterspersed Element-1 (LINE-1 or L1) retrotransposon forms 17% of our genome (Lander et al., 2001). Most L1 copies present in the reference human genome are defective but ∼100 copies could be retrotransposition-competent (Brouha et al., 2003). In addition, many polymorphic L1 elements, not included in the reference genome, also have the potential to mobilize (Beck et al., 2010; Ewing, 2015; Mir et al., 2015).

L1 elements can retrotranspose in the germline, in embryonic stem cells and in the early embryo (Kazazian et al., 1988; Garcia-Perez et al., 2007; van den Hurk et al., 2007). However, L1

### Edited by:

Tammy A. Morrish, Independent Researcher, USA

### Reviewed by:

Sandy Martin, University of Colorado School of Medicine, USA Lixin Dai, Modern Meadow Inc., USA

> \*Correspondence: Gaël Cristofari gael.cristofari@unice.fr

#### Specialty section:

This article was submitted to Cellular Biochemistry, a section of the journal Frontiers in Cell and Developmental Biology

> Received: 14 December 2015 Accepted: 18 February 2016 Published: 07 March 2016

#### Citation:

Pizarro JG and Cristofari G (2016) Post-Transcriptional Control of LINE-1 Retrotransposition by Cellular Host Factors in Somatic Cells. Front. Cell Dev. Biol. 4:14. doi: 10.3389/fcell.2016.00014 retrotransposons are repressed in most tested normal somatic cells except in the brain (Coufal et al., 2009; Baillie et al., 2011; Evrony et al., 2012; Richardson et al., 2014b; Upton et al., 2015). L1 mobilization impacts human genome evolution through insertional mutagenesis and sequence transduction, which occasionally results in inherited genetic diseases (Hancks and Kazazian, 2012). Somatic retrotransposition in the brain could also contribute to the etiology of some mental disorders or disabilities, such as Rett Syndrome or Ataxia Telangiectasia, characterized by increased levels of L1 mobilization (Muotri et al., 2010; Coufal et al., 2011). Moreover, somatic L1 mobilization participates to the dynamics of many tumor genomes and can lead to driver mutations (Miki et al., 1992; Iskow et al., 2010; Lee et al., 2012; Solyom et al., 2012; Shukla et al., 2013; Helman et al., 2014; Tubio et al., 2014; Doucet-O'Hare et al., 2015; Ewing et al., 2015; Rodic et al., 2015 ´ ). Besides its impact as an insertional mutagen, L1 also triggers other forms of genomic alterations such as DNA double-strand breaks or chromosomal translocations, and these activities could participate to normal aging or tumorigenesis (Wallace et al., 2008; Lin et al., 2009; Belancio et al., 2010). Finally, the L1 machinery also drives the retrotransposition of Short INterspersed Elements (SINEs) and the formation of processed pseudogenes (Esnault et al., 2000; Dewannieux et al., 2003).

L1 elements and their host have co-evolved: L1s use the cellular machinery for their own replication, while the host cell has evolved multiple defense mechanisms limiting L1 deleterious effects. Silencing L1 expression, through CpG DNA methylation and histone modifications is a major repressive mechanism, which prevents the accumulation of mutagenic events (Bourc'His and Bestor, 2004; Castro-Diaz et al., 2014; Jacobs et al., 2014). Here we review posttranscriptional cellular pathways, which regulate positively or negatively L1 retrotransposition in somatic cells, in particular by interacting with the L1 machinery or L1 replicative intermediates.

### L1 REPLICATION IS MEDIATED BY A RIBONUCLEOPROTEIN PARTICLE (RNP) AND TARGET-PRIMED REVERSE TRANSCRIPTION (TPRT)

An active L1 retrotransposon comprises a 5′ untranslated region (UTR), two open reading frames (ORF1 and ORF2) separated by a short inter-ORF spacer and a 3′ UTR (**Figure 1**). An antisense ORF0 of unknown function has also been recently described in the 5′ UTR (Denli et al., 2015). As a consequence of the reverse transcription and integration mechanism, L1 sequence ends with a poly(dA) stretch and is flanked by target site duplications (TSD) of variable size. The 5′ UTR contains RNA polymerase II sense and antisense promoters (Swergold, 1990; Speek, 2001; Nigumann et al., 2002). The translation of the bicistronic L1 mRNA by an unconventional mechanism produces

two proteins, named ORF1p and ORF2p (Alisch et al., 2006; Dmitriev et al., 2007). ORF1p is a 40 kDa RNA-binding protein, forming trimers and with nucleic acid chaperone activity (Martin, 1991; Holmes et al., 1992; Martin and Bushman, 2001; Martin et al., 2003; Khazina et al., 2011). ORF2p is a ∼150 kDa protein with endonuclease (EN) and reverse transcriptase (RT) activities, which are critical for L1 retrotransposition (Mathias et al., 1991; Feng et al., 1996; Moran et al., 1996). ORF2p also contains a C-terminal cysteine-rich region, potentially contributing to its RNA binding capability (Piskareva et al., 2013). ORF1p and ORF2p bind the L1 mRNA to form a ribonucleoprotein particle (RNP), considered as the core of the L1 replicative complex (Hohjoh and Singer, 1996; Kolosha and Martin, 1997; Kulpa and Moran, 2005, 2006; Doucet et al., 2010; Goodier et al., 2010). This assembly occurs preferentially in cis (Esnault et al., 2000; Wei et al., 2001; Kulpa and Moran, 2006), through binding of ORF2p to the L1 RNA poly(A) sequence (Doucet et al., 2015). L1 RNPs accumulate in cytoplasmic foci, which colocalize with stress granules (Goodier et al., 2007, 2010; Doucet et al., 2010). The functional importance of these cytoplasmic complexes remains to be elucidated. Although cell division seems to promote retrotransposition, it is not absolutely required (Kubo et al., 2006; Shi et al., 2007; Xie et al., 2013). Thus, access of L1 RNPs to chromatin can occur independently of mitotic nuclear envelope breakdown through an unknown nuclear import mechanism.

New L1 copies are directly synthesized and inserted in the genome by a process called TPRT (Luan et al., 1993; Feng et al., 1996; Cost et al., 2002; Christensen et al., 2006). During TPRT, ORF2p binds and nicks a consensus sequence of the form 5′ -TTTT/A-3′ in the genomic DNA (Feng et al., 1996). This cleavage, potentially followed by additional processing steps, exposes a single-stranded T-rich DNA stretch able to partially or completely anneal to the L1 RNA poly(A) tail and to prime ORF2p-mediated reverse transcription (Kulpa and Moran, 2006; Monot et al., 2013; Viollet et al., 2014). A possible second nick, generally few nucleotides downstream of the first one, allows priming and synthesis of the second DNA strand. Finally, the L1 DNA ends are filled in and sealed, creating TSD (Luan et al., 1993; Feng et al., 1996; Cost et al., 2002). The molecular actors involved in these late stages are unknown. This process is frequently abortive, resulting in 5′ truncated L1 copies.

### L1 RETROTRANSPOSITION IS REGULATED BY CELLULAR FACTORS AT MULTIPLE LEVELS

L1 activity is regulated at multiple stages of the L1 retrotransposition cycle (**Figure 1**). We focus here on posttranscriptional mechanisms and their molecular effectors acting in human or mammalian somatic cells and interacting with components of the L1 RNP or with L1 replication intermediates. L1 regulation in the germline, notably by Piwi-interacting RNA (piRNA), has been reviewed elsewhere (Zamudio and Bourc'his, 2010; Crichton et al., 2014) and is not detailed in the present article.

### Proteomic Studies Have Revealed Cellular Partners of L1 RNPs and Potential Novel Regulators of L1 Retrotransposition Overview

Several recent studies have identified cellular partners of L1 RNPs through tagging of ORF1p, ORF2p or L1 RNA, followed by affinity chromatography and mass-spectrometry (Goodier et al., 2013; Peddigari et al., 2013; Taylor et al., 2013; Moldovan and Moran, 2015). These experimental efforts differ by the cell line, the L1 clone, the tagged component in the complex and the chromatography method used, but eventually lead to a number of common host factors (**Figure 2**). It should be underlined that only a fraction of the hits has been validated by co-immunoprecipitation, and only a single study used quantitative mass-spectrometry to measure the specific enrichment of the detected proteins upon elution (Taylor et al., 2013). A first step toward functional characterization generally involves retrotransposition assays in cultured cells upon depletion or overexpression of the tested factor. The outcome of these genetic assays allows a first classification into positive or negative regulators. However, many binding partners only modestly impact the levels of L1 retrotransposition in these assays, or have pleiotropic effects preventing unambiguous interpretation. With few exceptions, the majority of the tested factors are RNA-binding proteins, which copurify with ORF1p through an indirect RNA bridge, colocalize with L1 RNPs in stress granules, and inhibit L1 retrotransposition.

### Limitations

Due to the scarcity of L1 endogenous complexes in cells, all proteomic studies rely on the overexpression of engineered L1 constructs. It is conceivable that: (i) some of the discovered partners become associated with L1 components as a result of L1 overexpression beyond physiological levels. (ii) L1 RNP stoichiometry is altered; (iii) the retrotransposition reporter cassette, which contains an intron, modifies L1 RNA cellular processing, and thus its binding partners.

### Positive Regulators of L1 Retrotransposition Poly(A) Binding Proteins Act in L1 RNP Assembly or Trafficking

Poly(A) binding proteins (PABPs) bind mRNA poly(A) tails and are involved in mRNA stability and translation initiation (Goss and Kleiman, 2013). Short hairpin RNA (shRNA)-mediated knockdown of PABPC1, reduces L1 retrotransposition with minimal effects on L1 RNA and proteins accumulation, or poly(A) tail length (Dai et al., 2012). This effect is associated with reduced L1 RNP levels and reduced nuclear accumulation of this complex, suggesting a possible—direct or indirect—role of PABPC1 in the assembly or the subcellular trafficking of the L1 RNP. Consistently, PABPC1 associates with the L1 RNP in an RNA-dependent manner, they colocalize in stress granules (Goodier et al., 2013; Taylor et al., 2013), and moderate PABPC1 overexpression stimulates retrotransposition (Dai et al., 2012).

Other PABPs have been found to associate with the L1 RNA (PABPN1, PABPC4) but addressing their specific role in L1 retrotransposition has been hampered by pleiotropic effects, or not yet tested (Dai et al., 2012; Goodier et al., 2013; Taylor et al., 2013).

### PCNA is a Cofactor of TPRT

PCNA is a DNA sliding clamp acting as a processivity factor for many DNA polymerases during DNA replication or DNA damage repair (Moldovan et al., 2007). ORF2p binds PCNA through a PCNA-interacting protein (PIP) box, located between the EN and RT domains of ORF2p (Taylor et al., 2013). Mutations in ORF2p PIP box disrupt PCNA-ORF2p interaction and inhibit L1 retrotransposition. Interestingly, ORF2p mutations abrogating its EN or RT activity also disrupt PCNA-ORF2p interaction, suggesting that PCNA binding to ORF2p occurs downstream or concomitantly with TPRT.

### Proline-Directed Protein Kinase(s) Regulate(s) ORF1p Function

ORF1p contains several (S/T)-P putative phosphorylation sites for proline-directed protein kinases (PDPKs), such as mitogen-activated protein kinases and cyclin-dependent kinases. Mutations of S18, S27, T203, and T213, which are potential PDPK targets, decrease L1 retrotransposition; and these residues were found phosphorylated by mass-spectrometry in human cells (Cook et al., 2015). Interestingly, several protein kinases associate with the L1 RNP (Goodier et al., 2013; Taylor et al., 2013; Moldovan and Moran, 2015), however it remains to be demonstrated if one or several of them might directly target ORF1p. Interestingly, S18/S27 sites in ORF1p are required for binding by Pin1 prolyl isomerase (Cook et al., 2015), suggesting a scenario in which binding of Pin1 promotes ORF1p conformational change, which could affect its stability, activity or localization, or its subsequent ability to be dephosphorylated (Yeh et al., 2004; Liou et al., 2011).

### Cellular Pathways Inhibiting L1 Retrotransposition at Post-Transcriptional Level

### RNA Interference Pathways Prevent the Accumulation of L1 RNA

L1 RNA serves both as an mRNA to produce the L1 machinery and as a template for reverse transcription. Multiple RNA interference (RNAi) pathways act in somatic or embryonic cells to prevent the accumulation of L1 RNA, and eventually retrotransposition.

First, the Microprocessor complex (Drosha/DGCR8), a major nuclear complex implicated in microRNA (miRNA) biosynthesis through pri-miRNA processing, is also able to bind L1 RNA in vivo, to reduce its abundance and to limit L1 retrotransposition. In addition, it can cleave various L1 RNA fragments derived from the L1 5′ UTR region in vitro, indicating that L1 RNA can be a direct Microprocessor substrate (Heras et al., 2013, 2014). Moreover, the miRNA pathway could also act downstream of Microprocessor to inhibit retrotransposition. Indeed, miR-128 in complex with the Argonaute (Ago) protein binds the L1 RNA in the ORF2 region leading to L1 transcript degradation (Hamdorf et al., 2015).

Second, the combined expression of sense and antisense L1 transcripts driven by L1 5′ UTR promoters reduces L1 RNA stability and L1 retrotransposition (Yang and Kazazian, 2006). This process is associated with the synthesis of rasiRNA (repeat-associated small interfering RNA) consistent with a possible processing of L1 RNA duplexes, and is modestly inhibited by Dicer knockdown, suggesting an additional layer of L1 repression mediated by siRNA mechanisms. In agreement with a role of RNAi pathways in somatic L1 regulation, L1 RNPs tend to accumulate in stress granules where they colocalize with several RNAi factors and often interact with them (Goodier et al., 2007, 2013).

### Innate Immunity and Interferon Response Pathways

The cellular innate immune response is one of the first lines of defense against a broad range of viral infections. It involves cellular factors with antiviral activities, among which the interferon (IFN) response pathway plays a central role (MacMicking, 2012; Ivashkiv and Donlin, 2014). This pathway leads to the activation of IFN-stimulated genes (ISG) acting as effectors and reinforcing IFN-signaling itself. A significant proportion of ISG are viral restriction factors (MacMicking, 2012), which also appear to counteract L1 retrotransposition (Goodier et al., 2015), and are described below.

Upon overexpression, several members of the APOBEC3 (A3) cytidine deaminase family inhibit L1 retrotransposition (A3A, A3B, A3C and A3F) (Bogerd et al., 2006; Chen et al., 2006; Muckenfuss et al., 2006; Stenglein and Harris, 2006; Kinomoto et al., 2007; Niewiadomska et al., 2007). A3A is a nuclear protein predominantly expressed in peripheral blood mononuclear cells (PBMCs) and is induced by IFN-β (Chen et al., 2006; Muckenfuss et al., 2006; Stenglein et al., 2010). A3A-mediated L1 inhibition depends on A3A deaminase activity and on the subsequent processing of the deaminated DNA by uracil DNA glycosylase (UNG) and apurinic/apyrimidinic endonuclease (APE) (Richardson et al., 2014a). A3B is also a nuclear protein. It is endogenously expressed in embryonic stem cells, in induced-pluripotent stem cells and in a number of cancer cell lines. Its depletion stimulates L1 retrotransposition (Wissing et al., 2011; Marchetto et al., 2013); however, catalytically dead A3B mutants still inhibit L1 retrotransposition (Wissing et al., 2011). Thus, the mechanism by which A3B represses L1 mobilization remains unknown. Similarly, reducing the expression of A3C moderately increases retrotransposition in cancer cell lines that express detectable levels of endogenous A3C (Muckenfuss et al., 2006). As for A3B, A3C- and A3F-mediated L1 repression is deaminase-independent (Muckenfuss et al., 2006; Stenglein and Harris, 2006; Kinomoto et al., 2007; Horn et al., 2014). A3C might interfere with L1 reverse transcription or the activity of ORF2p in the L1 RNP (Horn et al., 2014).

Several other ISG products, such as MOV10, ZAP or RNase L, limit L1 replication by limiting L1 RNA accumulation. The RNA helicase MOV10 robustly copurifies with the L1 complex, colocalizes with L1 RNPs in stress granules, reduces L1 RNA halflife, and ultimately strongly inhibits retrotransposition (Arjan-Odedra et al., 2012; Goodier et al., 2012, 2013; Li et al., 2013; Taylor et al., 2013; Moldovan and Moran, 2015). Similarly, the zinc-finger antiviral protein (ZAP) associates with the L1 RNP and accumulates with it in stress granules (Goodier et al., 2015; Moldovan and Moran, 2015). Its overexpression reduces fulllength L1 RNA levels, and L1 retrotransposition levels. ZAP zinc finger domain is necessary and sufficient for its anti-L1 activity. Inversely, knocking down endogenous ZAP increases L1 retrotransposition. The ribonuclease L (RNase L) degrades L1 RNA and inhibits retrotransposition although no association or colocalization was detected with the L1 RNP (Zhang et al., 2014). Other ISGs with known viral restriction activities (e.g., BST2, ISG20, MAVS, and MX2) are also strong inhibitors of L1 retrotransposition (Goodier et al., 2015), but their mechanism of action has not yet been explored.

Finally, SAMHD1 and TREX1 are ISGs involved in a negative feedback loop, acting as repressors of the interferon response itself. Loss-of-function mutations in these genes lead to the Aicardi-Goutières syndrome, an autoimmune disease. Both factors inhibit L1 retrotransposition (Stetson et al., 2008; Zhao et al., 2013). Trex1 (Three-prime-repair exonuclease 1) is an abundant 3′ -5′ DNA exonuclease and its overexpression inhibits engineered L1 retrotransposition in cultured cells (Stetson et al., 2008). Trex1-deficient cells accumulate ssDNA fragments derived from various retroelements including L1, suggesting that Trex1 can metabolize reverse transcribed L1 cDNA (Stetson et al., 2008). SAMHD1 (SAM Domain And HD Domain 1) impairs lentivirus replication in non-dividing cells by depleting the intracellular pool of dNTPs and thereby inhibiting reverse transcription (Lahouassa et al., 2012). In contrast, SAMHD1 inhibits L1 retrotransposition in dividing cells, through a dNTPase-independent mechanism, which might directly affect ORF2p levels, and thus inhibit L1 reverse transcription (Zhao et al., 2013).

### DNA Repair Pathways

EN-mediated cleavage of the target DNA or other TPRT intermediates could lead to DNA double-strand break (DSB) or DNA lesion signaling, and activation of subsequent DNA repair pathways. Conversely, these cellular processes could also participate in the resolution of L1 integration, through L1 second strand DNA synthesis or DNA ligation.

The role of DSB signaling and non-homologous end-joining (NHEJ) pathways remains controversial. Ataxia-telangiectasia mutated (ATM) protein, a kinase activated upon DSB, was initially proposed to be required for L1 retrotransposition and L1-induced DSBs (Gasior et al., 2006; Wallace et al., 2013). However, independent studies using ATM-deficient mice or human cell models rather suggest that ATM is a repressor of retrotransposition (Coufal et al., 2011). Similarly, knocking out NHEJ genes (e.g., Ku70/80, DNA Ligase IV or Artemis) decreases L1 retrotransposition in chicken cells (Suzuki et al., 2009). However, loss-of-function of DNA-PKcs or DNA Ligase IV in mammalian cells does not impair L1 retrotransposition (Coufal et al., 2011), indicating that NHEJ is not absolutely required for L1 retrotransposition. An interesting possibility could be that DSB signaling and repair pathways compete with the L1 machinery or other cellular factors for the resolution of L1 insertion during—or after—cDNA synthesis, leading to 5′ truncated insertions (Zingler et al., 2005; Suzuki et al., 2009; Coufal et al., 2011).

Other DNA repair pathways can also antagonize L1 replication. The ERCC1-XPF complex, which plays a role in nucleotide excision, base excision and interstrand crosslink repair pathways is a potent inhibitor of L1 retrotransposition (Gasior et al., 2008). ERCC1-XPF is an endonuclease able to specifically cleave DNA at junctions between single-stranded and double-stranded regions, a predicted structure produced by the TPRT process. Thus, it has been hypothesized that ERCC1- XPF might cut off L1 cDNA at the target site during reverse transcription.

### OPEN QUESTIONS FOR THE FUTURE

• How is unspliced L1 RNA exported to the cytosol and the L1 RNP imported back to the nucleus?

### REFERENCES


### AUTHOR CONTRIBUTIONS

JGP drafted the manuscript. GC revised the manuscript.

### FUNDING

GC is funded by the Fondation ARC pour la recherche sur le cancer, the European Research Council (ERC-2010- StG 243312, RETROGENOMICS), the French Government (National Research Agency, ANR) through the "Investments for the Future" (LABEX SIGNALIFE, ANR-11-LABX-0028- 01), and the Fondation pour la Recherche Médicale (FRM DEP20131128533).

### ACKNOWLEDGMENTS

We are grateful to Aurelien J. Doucet for critical reading.


Zingler, N., Willhoeft, U., Brose, H. P., Schoder, V., Jahns, T., Hanschmann, K. M., et al. (2005). Analysis of 5′ junctions of human LINE-1 and Alu retrotransposons suggests an alternative model for 5′ -end attachment requiring microhomology-mediated end-joining. Genome Res. 15, 780–789. doi: 10.1101/gr.3421505

**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2016 Pizarro and Cristofari. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# An LTR Retrotransposon-Derived Gene Displays Lineage-Specific Structural and Putative Species-Specific Functional Variations in Eutherians

Masahito Irie1, 2, Akihiko Koga<sup>3</sup> , Tomoko Kaneko-Ishino<sup>1</sup> \* and Fumitoshi Ishino<sup>2</sup> \*

*<sup>1</sup> Department of Nursing, School of Health Sciences, Tokai University, Isehara, Japan, <sup>2</sup> Department of Epigenetics, Medical Research Institute, Tokyo Medical and Dental University, Tokyo, Japan, <sup>3</sup> Department of Cellular and Molecular Biology, Primate Research Institute, Kyoto University, Inuyama, Japan*

### Edited by:

*Tammy A. Morrish, Formerly affiliated with University of Toledo, USA*

#### Reviewed by:

*Richard McLaughlin, Fred Hutchinson Cancer Research Center, USA Annette Damert, Babes-Bolyai University, Romania Tom F. Moore, University College Cork, Ireland*

### \*Correspondence:

*Tomoko Kaneko-Ishino tkanekoi@is.icc.u-tokai; Fumitoshi Ishino fishino.epgn@mri.tmd.ac.jp*

#### Specialty section:

*This article was submitted to Cellular Biochemistry, a section of the journal Frontiers in Chemistry*

Received: *12 March 2016* Accepted: *01 June 2016* Published: *23 June 2016*

#### Citation:

*Irie M, Koga A, Kaneko-Ishino T and Ishino F (2016) An LTR Retrotransposon-Derived Gene Displays Lineage-Specific Structural and Putative Species-Specific Functional Variations in Eutherians. Front. Chem. 4:26. doi: 10.3389/fchem.2016.00026* Amongst the 11 eutherian-specific genes acquired from a sushi-ichi retrotransposon is the CCHC type zinc-finger protein-encoding gene *SIRH11/ZCCHC16.* Its contribution to eutherian brain evolution is implied because of its involvement in cognitive function in mice, possibly via the noradrenergic system. Although, the possibility that *Sirh11/Zcchc16* functions as a non-coding RNA still remains, dN/dS ratios in pairwise comparisons between its orthologs have provided supportive evidence that it acts as a protein. It became a pseudogene in armadillos (Cingulata) and sloths (Pilosa), the only two extant orders of xenarthra, which prompted us to examine the lineage-specific variations of *SIRH11/ZCCHC16* in eutherians. We examined the predicted SIRH11/ZCCHC16 open reading frame (ORF) in 95 eutherian species based on the genomic DNA information in GenBank. A large variation in the SIRH11/ZCCHC16 ORF was detected in several lineages. These include a lack of a CCHC RNA-binding domain in its C-terminus, observed in gibbons (Hylobatidae: Primates) and megabats (Megachiroptera: Chiroptera). A lack of the N-terminal half, on the other hand, was observed in New World monkeys (Platyrrhini: Primates) and species belonging to New World and African Hystricognaths (Caviomorpha and Bathyergidae: Rodents) along with Cetacea and Ruminantia (Cetartiodactyla). Among the hominoids, interestingly, three out of four genera of gibbons have lost normal *SIRH11/ZCCHC16* function by deletion or the lack of the CCHC RNA-binding domain. Our extensive dN/dS analysis suggests that such truncated SIRH11/ZCCHC16 ORFs are functionally diversified even within lineages. Combined, our results show that *SIRH11/ZCCHC16* may contribute to the diversification of eutherians by lineage-specific structural changes after its domestication in the common eutherian ancestor, followed by putative species-specific functional changes that enhanced fitness and occurred as a consequence of complex natural selection events.

Keywords: LTR retrotransposon, genome evolution, exaptation, primate, brain functions

## INTRODUCTION

Mutation and selection are two principal factors in the Darwinian theory of evolution. The domestication of long terminal repeat (LTR) retrotransposons and retroviruses is a kind of mutation that promotes macroevolution through diversification of genomic function by creating new host genes from exogenous genetic materials (Kaneko-Ishino and Ishino, 2012, 2015; Lavialle et al., 2013; Imakawa et al., 2015). In addition to the investigation of duplication of genes (Ohno, 1970; Kimura, 1983), such acquired genes afford good examples for studying macroevolution and diversification as well as serving as lineage-specific markers in phylogenic analysis. In the human genome, there are approximately 30 LTR retrotransposonderived genes belonging to two main groups, the sushi-ichi retrotransposon homologs (SIRH, also called MART, or SUSHI) and the paraneoplastic Ma antigen (PNMA) family (Voltz et al., 1999; Rosenfeld et al., 2001; Schüller et al., 2005; Campillos et al., 2006; Kaneko-Ishino and Ishino, 2012, 2015; Iwasaki et al., 2013). These genes are derivatives from the original LTR retrotransposons, but each member has a unique DNA sequence. Therefore, each seems to be domesticated in such a manner to have its unique function. Among the SIRH genes, PEG10/SIRH1 (Paternally expressed 10) is a therian-specific gene, which is conserved in eutherians and marsupials and plays an essential role in early placenta formation (Ono et al., 2001, 2006; Suzuki et al., 2007). Among all the other eutherianspecific SIRH genes, PEG11/RTL1/SIRH2 (Paternally expressed 11/Retrotransposon-like 1) and SIRH7/LDOC1 (Leucine zipper, downregulated in cancer 1) also have been shown to have essential placental functions (Charlier et al., 2001; Edwards et al., 2008; Kagami et al., 2008; Sekita et al., 2008; Naruse et al., 2014), such as maintenance of fetal capillaries and the differentiation/maturation of a variety of placental cells, respectively. All of this evidence provides strong support for the contribution of SIRH genes to the evolution of viviparity in mammals via their eutherian-specific functions (Kaneko-Ishino and Ishino, 2012, 2015).

SIRH11/ZCCHC16 (Zinc-finger CCHC domain-containing 16) is an X-linked gene that encodes a CCHC type of zincfinger protein that exhibits high sequence identity to the LTR retrotransposon Gag protein (Irie et al., 2015). It is expressed in the brain, kidney, testis and ovary in adult mice, and its deletion causes abnormal mouse behaviors related to cognition, including attention, impulsivity and working memory, possibly via the locus coeruleus–noradrenaergic (LC-NA) system (Irie et al., 2015). It is proposed that phasic activation of NA neurons in the LC is linked to cognitive shifts that facilitate dynamic reorganization of target neural networks, thus permitting rapid behavioral adaptation in response to changing environmental imperatives (Berridge and Waterhouse, 2003; Bouret and Sara, 2005). Therefore, we suggest that the acquisition of SIRH11/ZCCHC16 has played a role in eutherian brain evolution (Irie et al., 2015; Kaneko-Ishino and Ishino, 2015).

The possibility that Sirh11/Zcchc16 functions as a noncoding RNA has not been completely excluded. The dN/dS ratio is a good indicator of selective pressure acting on a protein-coding gene, calculated as the ratio of the number of nonsynonymous substitutions per nonsynonymous site, in a given period of time, to the number of synonymous substitutions per synonymous site, in the same period. The values <1 mean that the gene questioned is subjected to purifying selection because the former changes tend to change the protein function while latter changes have no impacts on it. In the case of SIRH11/ZCCHC16, dN/dS ratios in pairwise comparisons of the orthologs between the mouse and seven representative eutherian species other than xenarthran species is approximately 0.35–0.45 (<1), which suggests that SIRH11/ZCCHC16 has undergone purifying selection after its domestication (exaptation) in the common eutherian ancestor (Irie et al., 2015).

The evolution of mammalian species is associated with several critical geological events and their associated environmental and geographical impacts. The split of the therians from the monotremes occurred 166–186 Mya, followed by the eutherian/marsupial split 160 Mya (Luo et al., 2003; Asher et al., 2004; Madsen, 2009). The domestication of PEG10 occurred between these two periods (Suzuki et al., 2007) and all the other SIRH genes, such as PEG11/RTL1, SIRH7/LDOC1, and SIRH11/ZCCHC16, were domesticated after the eutherian/marsupial split and before the split of the three major eutherian lineages, boreoeutheria (including euarchontoglires and laurasiatheria), afrotheria and xenarthra 120 Mya that was associated with the division of the supercontinent Pangea (Edwards et al., 2008; Nishihara et al., 2009; Naruse et al., 2014; Irie et al., 2015).

After the extinction of the dinosaurs at the Kreide (Cretaceous)-Paleogene (K-Pg) boundary 65 Mya, an adaptive radiation of mammals independently took place in Eurasia, North and South America, Africa, Australia and Antarctica as well as several isolated islands (Murphy and Eduardo, 2009). The long-term isolation of Australia and South America from other continents as well as the reunion of the two continents, such as Eurasia and Africa, and South and North America, affected the subsequent evolutionary route and history of the eutherians as well as other organisms to a great extent. For example, Xenarthrans evolved and diverged on the isolated South American continent, where carnivorous marsupials and birds had long predominated (Patterson and Pascual, 1972; Murphy and Eduardo, 2009). After the Isthmus of Panama emerged ∼3 Mya, the carnivorous marsupials were replaced by an invading carnivorous laurasiatherian species from North America (Patterson and Pascual, 1972). As mentioned, the domestication of all the SIRH genes was completed by the time of the emergence of the common eutherian ancestor, after which extensive eutherian diversification occurred in a lineage- and speciesspecific manner (Kaneko-Ishino and Ishino, 2012, 2015). Therefore, it is of great interest to examine the lineage-specific variations of SIRH11/ZCCHC16 model gene to explore the extent of its involvement in the eutherian diversification process.

## RESULTS

### Conservation of SIRH11/ZCCHC16 in Eutherian Species

SIRH11/ZCCHC16 encodes a protein composed of approximately 300–310 amino acids (aa), with a CCHC RNA-binding domain in its C-terminus (Irie et al., 2015). Based on whole genome sequence data of 85 eutherian species from GenBank, including two xenarthran species, the SIRH11/ZCCHC16 ORF in each species was deduced from its own DNA sequence that displayed homology with human SIRH11/ZCCHC16. The predicted ORFs in the 83 species, excluding the two xenarthran species with the pseudoSIRH11/ZCCHC16, are illustrated in **Figure 1** (see also **Figure S1**: SIRH11/ZCCHC16 aa sequence). Although, there might be some sequence errors in the genomic information, we used it to perform an initial investigation. Ideally, DNA sequences from multiple individuals should be analyzed in every species in this type of investigation. Instead, in this work, we focused on lineage-specific cases and confirmed the mutations using genomic DNA from the same as well as additional species in some of the primate lineages. As a result, a total of 95 eutherian species were analyzed in this study. Afrotheria is the most closely related eutherian group to the xenarthrans and their SIRH11/ZCCHC16 ORFs are highly conserved, as previously reported (**Figure 1**, the lowest of the six rows).

### Mutations Leading to the Loss of a CCHC RNA-Binding Domain in the Boreotherians

Nonsense mutations leading to loss of a CCHC RNA-binding domain were observed in five boreotherian [two euarchontoglires and three laurasiatherian species, including the white-cheeked gibbon (Primates), Chinese tree shrew (Scandentia), Amur tiger (Carnivora), and two flying fox species (Chiroptera)] (**Figure 1**, **Figure S1**). It is possible that these mutations may be due to sequence errors, however, it is worth considering other possibilities, including the cases in which the mutations are lineage-specific, such as the gibbons and two megabat species (Megachiroptera). Our analyses suggest that this type of mutation changes SIRH11/ZCCHC16 function at least in these lineages. Interestingly, in one gibbon species, the truncated SIRH11/ZCCHC16 is suggested to have become pseudogenized and it was lost from two other gibbon species by profound structural changes.

In the case of the white-cheeked gibbon (Nomascus leocogenys: Nle), there is a four-base pair deletion leading to a frameshift and the subsequent emergence of a nonsense codon just after it (**Figure 2A**). We confirmed all of these sequence changes in another white-cheecked gibbon at a Japanese zoo. In the dN/dS ratio in the pairwise comparison among the hominoids, SIRH11/ZCCHC16 was found to be highly constrained among humans, chimpanzees and gorillas (0.20–0.42), while the values of the gibbon (Nle) compared to humans and chimpanzees are higher (0.61–0.74), and those compared to the gorilla and orangutan are close to 1 (0.84–0.90). Although in this kind of approximate method, it is not possible to compare pairwise dN/dS values in a rigorous way, these results suggest that the truncated Nle ORF has been subjected to a lesser degree of purifying selection compared to other hominoid members having the full-length SIRH11/ZCCHC16 (**Table 1A**). It is also probable that the Nle SIRH11/ZCCHC16 has lost some function by losing its CCHC RNA-binding domain (Rajavashisth et al., 1989; Curtis et al., 1997; Chen et al., 2003; Schlatter and Fussenegger, 2003; Narayanan et al., 2006; Matsui et al., 2007).

As the gibbons are a close relative of the hominids, including humans, we further analyzed two other species in the Hylobatidae family, the white-handed gibbon, Hylobates lar: Hla, and the siamang, Symphalangus syndactylus: Ssy, and found that normal SIRH11/ZCCHC16 was absent from these two species. We set up PCR conditions using primers designed from the gibbon (Nle) DNA sequence. These primers worked well even in gorilla, macaque and marmoset samples but none of the expected bands were obtained from Hla or Ssy (**Figure 2B**, top and middle columns). The results were almost the same with primers designed from the conserved sequences between gibbon and human (F3, F4, R3, R4; **Figure 2B**, top and bottom columns), and we confirmed that the quality of their genomic DNA was good enough for PCR analysis by amplifying tyrosinase (TYR) and lactase (LCT) genes as controls (**Figure S2**). Finally, we performed Southern blot analysis using a gorilla PCR fragment (F1-R1: 544 bp) as a probe and confirmed that there was no corresponding band to SIRH11/CCHC16 that appeared in the white-handed gibbon and siamang (**Figure 2C**), suggesting that a large deletion or profound structural change had occurred in these two gibbon species. All these results demonstrate that gibbons in at least three out of four genera do not possess the normal full-length SIRH11/ZCCHC16 ORF as a result of deletion/structural changes or the lack of a CCHC RNA-binding domain, supporting the notion that the gibbon SIRH11/ZCCHC16 gene is not functional and instead has become a pseudogene.

In the case of megabats, the nonsense mutation of SIRH11/ZCCHC16 is conserved between these two species, while it is intact in the four closest microbat species (Microchiroptera) (**Figures 2D,E**), indicating that this mutation occurred relatively recently. Except for the loss of the C-terminal CCHC domain, the megabat SIRH11/ZCCHC16 ORF looks well conserved without further nonsense or frameshift mutations. In this case, it is difficult to estimate whether the truncated ORFs are functional or not. Although the dN/dS value between two megabats exhibits 1.5, it is difficult to distinguish whether they have been subjected to positive or neutral evolution (**Table 1B**). However, it may simply reflect the fact that these two species are too close to examine in terms of the evolutionary relationship in this way, because both the dN and dS values are approximately 10 times lower than in the four microbat species. Among the Chiroptera, the values between the megabats to the microbats are slightly higher (0.63–0.72) than those of the microbats (0.53–0.63; **Table 1B**). Although, it is possible that they might possess certain functions, the function of the truncated megabat SIRH11/ZCCHC16 have already been changed or lost, or is on the way to either of these fates, while the microbat SIRH11/ZCCHC16 has undergone purifying selection.


FIGURE 2 | (A) Nonsense mutation in gibbon *SIRH11/ZCCHC16.* The four bp deletion (blue in a red box) in gibbon leads to a nonsense mutation (red). Note that there is a G to A transition (DNA polymorphism) in a stop codon of gibbon (TAA) compared with human/chimpanzee and other primates (TAG). (B) PCR analysis of gibbon *SIRH11/ZCCHC16.* Upper panel shows the schematic representation of primer design. Lower panel shows agarose gel electrophoresis profile in each primer *(Continued)*

### FIGURE 2 | Continued

set. The arrows represent expected band size. M, 100 bp and 1 kb ladder; Gor, gorilla; Hla, white-handed gibbon; Ssy, siamang; Rhe, rhesus macaque; Mar, common marmoset; Sol, solvent only (no DNA). (C) Southern blot analysis of *Hla* and *Ssy*. Left and right panels show the result of hybridization using *SIRH11* and *TYR* probes, respectively. The arrows indicate expected band size. Gor, gorilla; Hla, white-handed gibbon; Ssy, siamang; Rhe, rhesus macaque; Mar, common marmoset. (D) Amino acid sequence alignment of Chiroptera SIRH11/ZCCHC16. The blue asterisks in a red box indicate a common nonsense mutation in megachiroptera. The asterisks, colons and periods below the amino acids indicate identical, strongly and weakly similar residues among six species, respectively. (E) DNA alignments around the common TAA nonsense mutation (red).

### Mutations Resulting in the Loss of the N-Terminal Half of the ORF in the Boreotherians

In three lineages in boreotheria, a deletion of the N-terminal half of SIRH11/ZCCHC16 ORF was observed [i.e., the New World monkeys (Primates: three species/three examined (3/3)), the New World and African hystricognaths (Rodentia: 3/3 and 2/2, respectively) as well as species belonging to Cetacea and Ruminantia (Cetartiodactyla: 5/5 and 7/7, respectively)] (**Figure 3**). In all of these species, short putative ORFs, mainly comprising 167 aa, are conserved although the causative mutations are independent of each other, reflecting their own lineage-specific events. In these cases, the pairwise dN/dS analyses suggest that there has been selective constraint in some species and perhaps a more relaxed or neutral selection in others.

Among three New World monkeys, Ma's night monkey, the Bolivian squirrel monkey and common marmoset, a common deletion of 11 aa was observed near the N-terminus, with a common nonsense mutation just after it (**Figures 3A,C**). The putative short ORF that starts from the next Met codon comprises 167 aa in the first two and 241 aa in the latter because of a single additional Met codon that arose in a species-specific manner (**Figure S1**). We further analyzed five more species, the long-haired spider monkey (Ateles belzebuth: Abe), common squirrel monkey (Saimiri sciureus: Ssc), tufted capuchin (Cebus apella: Cap), the Azara's owl monkey (Aotus azarae: Aaz) and the cotton-top tamarin (Saguinus oedipus: Soe) and confirmed the common deletion of 11 aa and the subsequent nonsense mutation are a lineage-specific feature (**Figures 3B,C**). Thus, it seems probable that these mutations emerged in the common Platyrrihini ancestor from which all the New World monkeys in South America diverged (Poux et al., 2006).

A similar situation was assumed in the case of the two closest rodent groups, the South American and African hystricognaths (Caviomorpha and Bathyergidae; Poux et al., 2006). The lost N-terminal parts contain several nonsense mutations and frameshifts, and only one nonsense mutation is conserved in all of the species (**Figure 3D**), indicating that this mutation emerged in a common ancestor in Africa (Poux et al., 2006). Compared with the heavily mutated N-terminal region, the C-terminal region, comprising 161–203 aa, is highly conserved.

Compared with the above two cases, DNA sequences corresponding to the N-terminal half are completely missing in the species of Cetacea and Ruminantia, indicating a large deletion occurred in the common ancestor of these two suborders, although the former has an additional frameshift event that took place in a Cetacea-specific manner (**Figure 3E**, **Figure S3**). The remaining ORFs, each comprising 167 aa, are also highly conserved.

The pairwise dN/dS analysis within each lineage suggests that some portions of the tree display gene-wide dN/dS values that are consistent with purifying selection. For example, among five New World monkey species the truncated SIRH11/ZCCHC16 ORFs are highly constrained (dN/dS = 0.09) between the two night monkeys (the Azaras owl monkey and Ma's night monkey; **Table 1C**). However, those of the tufted capuchin are variable: greater than 1 (1.9 and 1.4) to the night monkeys, close to or less than 1 (0.99 and 0.80) to the common marmoset and Bolivian squirrel monkey. Those of the Bolivian squirrel monkey are consistently close to 1 (0.80–1.1), suggesting neutral evolution. Thus, it is possible that the functions of the truncated ORFs in Platyrrhini were diversified to a great extent, presumably because of species-specific adaptation after the structural change of N-terminus deletion in the common Platyrrhini ancestor.

In the case of rodents, the truncated SIRH11/ZCCHC16 ORFs in Hystricognathi are highly constrained among Rodentia except for the Damara mole rat and naked mole rat, and the Damara mole rat and long-tailed chinchilla (0.81 and 0.86, respectively; **Table 1D**). The cases of the Ruminantia (**Table 1E**) and Cetacea (**Table 1F**) are similar to the Platyrrhini in that the dN/dS values exhibit a large variety, some are highly constrained (between bison and cattle, as well as sheep and goats) while the others seem to be subjected to neutral evolution. All these results suggest that the function of the truncated SIRH11/ZCCHC16 ORFs diverged in a species-specific manner, possibly reflecting the functional constraints imposed by the environment after the lineage-specific structural changes.

Some species have ORFs with a small deletion of the Nterminal region, such as the western lowland gorilla, with a 28 aa deletion (Primates), and the upper Galilee mountains blind mole rat, with a 22 aa deletion (Rodentia). In the former, using three different individuals we confirmed the nonsense mutation close to the translational start site caused by a frameshift (**Figure 3F**).

### Loss of SIRH11/ZCCHC16 in Xenathrans

We previously reported that SIRH11/ZCCHC16 in two armadillo species (Dasypus novemcinctus and Tolypeutes matacus: Cingulata) and two sloth species (Choloepus hoffmanni and Choloepus didactylus: Pilosa) were pseudogenized due to severe mutations, including multiple nonsense mutations and frameshifts (Irie et al., 2015). There are no common mutations between two armadillo species belonging to two different genera, although the genomic DNA information corresponding to the C-terminal part was lacking in GenBank for one armadillo

### TABLE 1 | Pairwise dN/dS analyses on several lineages with truncated SIRH11/ZCCHC16.


*(A) Pairwise dN/dS analysis on Homonidae, (B) Pairwise dN/dS analysis on Chiropreta, (C) Pairwise dN/dS analysis on Platyrrhini, (D) Pairwise dN/dS analysis on Rodentia, Hystricognathi species are shown yellow. (E) Pairwise dN/dS analysis on Ruminantia, (F) Pairwise dN/dS analysis on Cetacea. The dN/dS values more than 0.80 or less than 0.21 are shown in red or blue, respectively. When the dS value is 0, it is impossible to calculate this value, therefore, indicated as nd.*


FIGURE 3 | Continued

FIGURE 3 | Continued


Squ, common squirrel monkey; Tam, cotton-top tamarin; Sol, solvent only (no DNA). (C) DNA sequence analysis of Platyrrhihi *SIRH11/ZCCHC16.* DNA sequences of Azara's owl monkey, common marmoset, tufted capuchin, and long-haired spider monkey determined by our own experiments are also shown. Magenta boxes show

SIRH11/ZCCHC16. The similar sequences among five Hystricognathi species are expressed in green. The red asterisks and Xs indicate the sites of ORF termination and frameshift, respectively. The blue asterisks in a red box indicate a common nonsense mutation in Hystricognathi. The underlined sequences indicate the putative short ORFs starting from a next Met codon. The asterisks, colons and periods below the amino acids indicate identical, strongly and weakly similar residues among six species, respectively. The house mouse sequence is used as a reference. (E) Amino acid sequence alignment of Cetartiodactyla SIRH11/ZCCHC16. The blue asterisks in red boxes indicate common nonsense mutations in Cetartiodactyla. The red asterisks and Xs indicate the sites of ORF termination and frameshift, respectively. The underlined sequences indicate the putative short ORFs starting from a next Met codon. The similar sequences among 14 Cetartiodactyla species are shown in green. The asterisks, colons and periods below the amino acids indicate identical, strongly and weakly similar residues among 14 species, respectively. The pig sequence is used as a reference. (F) Sequence analysis of gorilla *SIRH11/ZCCHC16.* Upper panel shows the schematic representation of the primer design and nonsense mutation site (red asterisk) in gorilla *SIRH11/ZCCHC16*. Middle panel shows the sequence comparison between gorilla, human, and chimpanzee. Lower

lineage specific insertion or deletion. The underlined letters indicate the *SIRH11/ZCCHC16* ORF. (D) Amino acid sequence alignment of Hystricognathi

species (Dasypus novemcinctus). Two sloth species belonging to the same genera exhibit a quite similar pattern of nonsense and frameshift mutations.

panel represents the sequence results of three individuals.

Then, we searched for a common nonsense mutation in another armadillo species (Tolypeutes matacus) along with the two sloth species again and found a promising candidate nonsense mutation in the C-terminus, 22 aa upstream of the CCHC RNA-binding domain (**Figure 4A**, blue asterisks in a red box, and **Figure 4B**). This pattern of a shared nonsense mutation is consistent with the possibility that this mutation inactivated the gene, because the CCHC domain would be critical for normal SIRH11/ZCCHC16 function, as in the case of gibbons mentioned above. In this work, we have surveyed the mutations only in the ORF region, but it should be noted that pseudogenization may have occurred through promoter or other regulatory mutations and also from a missense mutation or insertions/deletions that inactivates the gene.

The anteater is the only remaining animal group in xenarthra, belonging to Pilosa, the same order as the sloth (Delsuc and Douzery, 2009). An attempt was made to analyze SIRH11/ZCCHC16 in the giant anteater (Myrmecophaga tridactyla) by genomic PCR, but the PCR did not work, even using primer sets designed with completely conserved DNA regions between armadillos (Cingulata) and sloths (Pilosa) (**Figure 4C**). Although, a single band larger than an expected size was seen in the two conditions (**Figure 4C**, F1R1, and F1R2), its DNA sequence did not have any relationship to SIRH11/ZCCHC16 (data not shown). The quality of its genomic DNA was good enough for PCR, as shown by amplifying the tyrosinase gene (TYR) using primers designed from the armadillo DNA sequence (**Figure S4**). However, due to the limited amount and relatively low quality we were unable to perform Southern blot analysis to confirm the absence of SIRH11/ZCCHC16 orthologs in its genome. Therefore, the final conclusion awaits the determination of the anteater genome sequence in the future, but all of the results obtained thus far suggest that the three extant groups in xenarthra lack any functional SIRH11/ZCCHC16 and that the pseudogenizing mutation(s) occurred in a common xenarthran ancestor.

### DISCUSSION

It is of interest to determine the roles genes acquired from LTR retrotransposons play in organisms in the current form of the developmental system as well as in the course of

The red asterisks and Xs show the sites of ORF termination and frameshift, respectively. The blue asterisks in a red box indicate a common mutation among three species. Purple characters indicate CCHC amino acids in the RNA binding domain. (B) DNA alignments around the common TAG nonsense mutation (red) indicated by an orange line in (A). (C) PCR analysis of xenarthra *SIRH11/ZCCHC16* Upper panel: schematic representation of primer design to amplify xenarthra *SIRH11/ZCCHC16*. Lower panel: agarose gel electrophoresis profile in each primer set. Ant, giant anteater; Arm, southern three-banded armadillo; Slo, Linnaeus's two-toed sloth; 8x, 8x174 *Hae*III marker; 1 kb, 1 kb ladder marker.

biological evolution. Among SIRH genes, PEG10/SIRH1, PEG11/RTL1/SIRH2, and SIRH7/LDOC1 are highly conserved across eutherian species, presumably because they play essential roles in the viviparous reproduction system via placental formation, maintenance, differentiation and maturation, respectively (Ono et al., 2006; Sekita et al., 2008; Naruse et al., 2014). In this study, we found that SIRH11/ZCCHC16 displays lineage-specific structural variations in eutherians, such as the lack of the CCHC RNA-binding domain or the N-terminal half, as well as species-specific variations in the resulting truncated ORFs. Thus, it is possible that certain SIRH genes, such as those concerning cognitive brain function, act as critical determinant factors in the diversification of the eutherians depending on a variety of environmental factors, such as ecological niches and the dynamics of life style as well as the evolutionary history of the species, including geological events.

We showed that all the South American primates and rodents have the truncated SIRH11/ZCCHC16 ORFs in addition to xenarthran pseudoSIRH11/ZCCHC16. Although this might not be of significance, it is of interest to consider the possibility that species with normal SIRH11/ZCCHC16 function suffered a competitive disadvantage in the South American environment in the past. South America has a unique evolutionary history in which geographical factors have played a critical role (Houle, 1999; Poux et al., 2006; Delsuc and Douzery, 2009; Murphy and Eduardo, 2009; Nishihara et al., 2009). Diversification of the three major eutherian groups, boreotheria, afrotheria and xenarthra, is supposed to largely be dependent on the division of the supercontinent Pangea, which is thought to have occurred approximately 120 Mya (Nishihara et al., 2009). Xenarthrans evolved and diverged on the isolated South American continent, where carnivorous marsupials and birds had long predominated (Patterson and Pascual, 1972; Murphy and Eduardo, 2009). After the Isthmus of Panama emerged ∼3 Mya, the carnivorous marsupials were replaced by an invading carnivorous laurasiatherian species from North America (Patterson and Pascual, 1972). In the competition between the carnivorous marsupials and xenarthrans after the extinction of the dinosaurs ∼65 Mya, as well as the marsupials and the carnivorous laurasiatherians ∼3 Mya, the presence/absence of SIRH11/ZCCHC16 among these groups might be a critical factor in the evolutionary outcome. For example, the extinct marsupials have no SIRH11/ZCCHC16, the xenarthrans have pseudoSIRH11/ZCCHC16 and only the laurasiatherians have a normal SIRH11/ZCCHC16 in the South American evolutionary history.

Another issue of interest is the SIRH11/ZCCHC16 mutations in the primates. Phylogenetic relationships, divergence times, and patterns of biogeographic descent among primate species are complex and still controversial. According to a recent molecular phylogenetic analysis using Species Supermatrix, the currently living primates last shared a common ancestor 71–63 Mya and Asia was the ancestral home of the primates. This is also true for the hominoids, suggesting that the ancestor of African apes and humans entered Africa, while the hylobatids remained in Asia (Springer et al., 2012). Among the hominoids, gibbons have lost the normal SIRH11/ZCCHC16: white-cheeked gibbon (Nle) has a truncated SIRH11/ZCCHC16 ORF lacking the CCHC RNA-binding domain, while white-handed gibbon (Hla) and siamang (Ssy) do not have the normal SIRH11/ZCCHC16 gene in their genomes. It is apparent that SIRH11/ZCCHC16 function is not conserved in the latter two species, but is this the case for the truncated SIRH11/ZCCHC16 in the former? It is typically difficult to determine whether truncated ORFs have original, similar or different functions through comparison of the amino acid sequence homology. The dN/dS analysis sometimes helps provide a useful prediction of whether they still possess some function (dN/dS < 1) or have lost their function and already become neutralized (dN/dS = 1), as previously shown in the xenarthran lineage (Irie et al., 2015). The higher dN/dS values of the gibbon (Nle) suggested that the truncated Nle ORF has been subjected to a lesser degree of purifying selection and that the Nle SIRH11/ZCCHC16 has lost some function by losing its CCHC RNA-binding domain. In retroviruses, the CCHC domain forms a part of the nucleocapsid protein that functions in virus genome packaging and the early infection process (Narayanan et al., 2006). Proteins containing the CCHC zinc-finger domain are commonly known to interact with single-stranded DNAs (ssDNAs) and RNAs (Matsui et al., 2007) and play important roles in Drosophila as well as mammalian development via transcriptional and translational regulations (Rajavashisth et al., 1989; Curtis et al., 1997; Chen et al., 2003; Schlatter and Fussenegger, 2003). Therefore, it is probable that the CCHC zincfinger domain is essential for the normal SIRH11/ZCCHC16 function that confers selective advantage. In future, it will be of interest to consider the possibility that SIRH11/ZCCHC16 contributed to brain evolution in hominoidea and also the alternative, that the loss of SIRH11/ZCCHC16 did confer some selective advantage in the gibbons.

Lineage-specific loss of the N-terminus of SIRH11/ZCCHC16 ORFs in all the species of New World monkeys and Hystricognathi is consistent with their evolutionary history. Our data indicates a common ancestor of the Platyrrhini in South America already had the mutation(s) leading to the N-terminal deletion. It is proposed that the common ancestor emigrated from Africa and somehow immigrated into South America ∼34 Mya (Houle, 1999; Poux et al., 2006), possibly by an incidental current drift from Africa to South America that existed at that time (Houle, 1999). In the case of the two closest South American and African hystricognaths, the Caviomorpha and Bathyergidaein, the results also show that the nonsense mutation leading to the N-terminal deletion first emerged in the common ancestor in Africa. It is proposed that a common Caviomorpha ancestor, from which all the rodent species in South America diverged, emigrated from Africa by an unknown event, just as New World monkeys did (Houle, 1999)**.** The recent discovery of (Late) Eocene primates in Santa Rosa, Peru, extends the fossil record of primates in South America back approximately 10 million years, leading to consideration of possible similarities of an intercontinental dispersal mechanism for the two mammalian groups that occurred around 36 Mya (Bond et al., 2015). However, the Eocene primates bear little resemblance to any extinct or living South American primates, but do bear striking resemblance to Eocene African anthropoids while the Santa Rosa rodents exhibit the derived status relative to the contemporaneous African rodents. Then, these authors suggested two possibilities that rodents and primates might not have had simultaneous crossing episodes or that the two groups had differing rates of diversification after their arrival in South America (Bond et al., 2015). Our results appear to support the latter idea, because the patterns of Hystricongathi and Platyrrhini SIRH11/ZCCHC16 diversification are very different, i.e., conservative vs. highly diversified, although this might not be directly related to the morphology of the molars.

In the Chiroptera, the dN/dS analysis did not provide good evidence to indicate that SIRH11/ZCCHC16 is subject to different types of evolutionary selection between the megabats and microbats. This may be because the numbers of species are limited, resulting in the fact that a more detailed analysis is necessary to construct a precise evolutionary view among such closely related species. It is known that between these two suborders of Chiroptera, sophisticated laryngeal echolocation system is absent in Megachiroptera (Teeling, 2005, 2009). Therefore, it will be of interest to elucidate how structural changes in SIRH11/ZCCHC16 relate to certain neurological changes affecting differences in this behavior between these two suborders of Chiroptera. It should be noted that recent molecular data indicate that Microchiroptera is not a monophyletic group, thus, suggesting that sophisticated laryngeal echolocation in the bats either originated in the ancestor of all bats and was subsequently lost in lineages leading to the megabats or originated more than once in the microbat lineages (Teeling et al., 2000). We found that the dN/dS values exhibit a large variety in several eutherian lineages that display the N-terminus deletion. This finding suggests that the function of the truncated SIRH11/ZCCHC16 ORFs diverged in a species-specific manner, implying that the protein contributed to diversification of eutherians by increasing evolutionary fitness although SIRH11/ZCCHC16 itself is not an essential gene in eutherian development and growth. However, it will be necessary to carry out maximum likelihood estimates of the dN/dS values using PAML branch models or other techniques to obtain supportive evidence for this idea.

Knockout mice demonstrated that Sirh11/Zcchc16 is involved in cognitive function, including attention, impulsivity and working memory. In mice, Sirh11/Zcchc16 is expressed in the adult kidney, testis and ovary in addition to the brain, but male and female KO mice exhibited normal fertility and kidney function. However, it is possible that it also plays some role in the kidney, testis, ovary and embryonic liver where Sirh11/Zcchc16 expression was confirmed. Human SIRH11/ZCCHC16 is expressed in similar tissues and organs, such as the adult brain, liver, kidney and testis, as shown by RT-PCR, although the levels are very low, as in the case of mice (**Figure S5**). Therefore, it is important to identify the roles of SIRH11/ZCCHC16 in some other organs rather than brain in different lineages and species. It is of particular interest also to determine its function in humans because of X-linked intellectual disability and attention-deficit/hyperactivity linked phenotypes of the Sirh11/Zcchc16 knockout mice.

### MATERIALS AND METHODS

### Ethics

All experiments using primate samples were performed in Kyoto University Primate Research Institute (KUPRI), in accordance with the Guidelines for Care and Use of Nonhuman Primates (Version 3; June 2010) published by KUPRI. For usage of these samples and publication of the results, we obtained permissions from the respective zoos that provided the samples.

## Primate Samples

All hominoid DNA samples were extracted from liver pieces collected from animals that died of natural causes at zoos except Nomascus leocogenys (Nle: white-cheeked gibbon). Nomascus genomic DNA was isolated from its feces provided by Hirakawa Zoological Park. All New World monkey DNA samples were extracted from cultured epithelial cells originating from animals bred at KUPRI. The cultured cells were derived from a tiny piece of the ear skin of a live animal anesthetized for other purposes, such as a medical treatment or health checkup.

## PCR Analysis

For gibbon SIRH11 analysis, we prepared genomic DNA of Hylobates lar (Hla: white-handed gibbon) and Symphalangus syndactylus (Ssy: siamang). The PCR reaction was performed using PrimeSTAR GXL DNA Polymerase (TaKaRa, Japan) with the following conditions: 94◦C, 2 min; 4 cycles of 98◦C, 10 s, 50◦C, 15 s, 68◦C, 1 min; 32 cycles of 98◦C, 10 s, 56◦C, 15 s, 68◦C, 15 s; final extension 68◦C, 1 min. The following PCR primers were used: gibbon\_SIRH11\_F1: 5′ - GGCATC TCTCCAATTCAGCTGTTAGCAACT-3′ , gibbon\_SIRH11\_R1: 5 ′ - GGCAAGGCAATCTCTTGTGAAGTGACCACA-3′ , gibbon\_SIRH11\_F2: 5′ - AGTGTCTTCTTCACAGCTAACAGC TTTGGC-3′ , gibbon\_SIRH11\_R2: 5′ - CTGCAGTAGAGGCAC AAATGAGTTTCTAGC-3′ , gibbon\_SIRH11\_F3: 5′ - ACATAT CTGGGCCTGACAAGAG-3′ , gibbon\_SIRH11\_R3: 5′ - GGC TTGGTGTTGGATCAAGG-3′ , gibbon\_SIRH11\_F4: 5′ - AGC AGTCATTTGGTAAACCCAC-3′ , gibbon\_SIRH11\_R4: 5′ - CAA GGAAGCCAACAATGGGAG-3′ .

For Platyrrhini SIRH11 analysis, we prepared genomic DNA of five species: Tufted capuchin (Cap), common marmoset (Mar), the Azara's owl monkey (Owl), long-haired spider monkey (Spi), common squirrel monkey (Squ) and cotton-top tamarin (Tam). Human (Hum), rhesus macaque (Rhe) DNAs were used as controls. The PCR reaction was performed using PrimeSTAR GXL DNA Polymerase (TaKaRa, Japan) with following conditions: for F1R1 and F2R2 primer sets, 94◦C, 2 min; 4 cycles of 98◦C, 10 s, 50◦C, 15 s, 68◦C, 2 min; 30 cycles of 98◦C, 10 s, 64◦C, 15 s, 68◦C, 40 s; final extension 68◦C, 1 min; for F3R3 and F4R4 primer sets: 94◦C, 2 min; 4 cycles of 98◦C, 10 s, 50◦C, 15 s, 68◦C, 1 min; 30 cycles of 98◦C, 10 s, 56◦C, 15 s, 68◦C, 15 s; final extension 68◦C, 1 min. The following PCR primers were used: Platyrrhini\_SIRH11\_F1: 5′ -GGCATCTCTCCA ATTCAGCTGTTAGCAACT-3′ , Platyrrhini\_SIRH11\_F2: 5 ′ -AGTGTCTTCTTCACAGCTAACAGCTTTGGC-3′ , Platyrrhini\_SIRH11\_F3: 5′ -GAGGGAGGAGAGAAAGGT ACTG-3′ , Platyrrhini\_SIRH11\_F4: 5′ -TGCAGAACATTGGCC TTTTCC-3′ , Platyrrhini\_SIRH11\_R1: 5′ -GGCAAGGCAATC TCTTGTGAAGTGACCACA-3′ , Platyrrhini\_SIRH11\_R2: 5 ′ -CTGCAGTAGAGGCACAAATGAGTTTCTAGC-3′ , Platyrrhini\_SIRH11\_R3: 5′ -TCTGAGCAATTGGCAGGG TC-3′ , Platyrrhini\_SIRH11\_R4: 5′ -GGTCACCATGAAACT GGGTG-3′ . The PCR products were directly-sequenced.

For gorilla SIRH11 analysis, genomic DNA was isolated from frozen liver. The PCR reaction was performed using PrimeSTAR GXL DNA Polymerase (TaKaRa, Japan) with the following conditions: 94◦C, 2 min; 36 cycles of 98◦C, 10 s; 55◦C, 15 s; 68◦C, 20 s; final extention 68◦C, 60 s. The following PCR primers were used: Gorilla\_SIRH11\_F1: 5′ -GAGGGAGGAGAG AAAGGTACTG-3′ and Gorilla\_SIRH11\_R1: 5′ -TCTGAGCAA TTGGCAGGGTC-3′ .

For xenarthra SIRH11 analysis, we prepared genomic DNA from three xenarthran species: Tolypeutes matacus (southern three-banded armadillo), Choloepus didactylus (Linnaeus's twotoed sloth), and Myrmecophaga tridactyla (giant anteater). Genomic DNA was isolated from frozen tissues using the DNeasy Blood & Tissues Kit (QIAGEN, Germany). The PCR reaction was performed using ExTaqHS (TaKaRa, Japan) with the following conditions: 96◦C, 3 min; 30 cycles of 98◦C, 10 s; 55 or 60◦C, 30 s; 72◦C, 60 s; final extension 72◦C, 3 min. The following PCR primers were used: Xenarthra\_SIRH11\_F1: 5′ -CTTACT GCCTGCCCATTGGT-3′ , Xenarthra\_SIRH11\_R1: 5′ -GGATTT TAAAAGTTGGTGCAGG-3′ , Xenarthra\_SIRH11\_F2: 5′ -GGC AGAGAATCTGATTCTA-3′ , Xenarthra\_SIRH11\_R2: 5′ -GTA TTGGTGGTAGATCAGG-3′ .

### DNA Sequencing of SIRH11/ZCCHC16 in Primate Species

The Gorilla\_SIRH11\_F1R1 PCR products described above were cloned into the pBluescript II SK (+) vector and sequenced using a forward primer: 5′ -TGTAAAACGACGGCCAGT-3′ and a reverse primer: 5′ -CAGGAAACAGCTATGACCATG-3 ′ . Platyrrhini\_SIRH11\_F1R1 PCR products described above were directly-sequenced using Platyrrhini\_SIRH11\_F1 and R1 primers. DNA Data Bank of Japan (DDBJ) accession numbers: LC150703 for western lowland gorilla SIRH11/ZCCHC16, LC150704 for Tufted capuchin, LC150705 for the Azara's owl monkey and LC150706 for long hair spider monkey SIRH11/ZCCHC16.

### Southern Blot Analysis

We prepared genomic DNA of five species: gorilla (Gor), Hylobates lar (Hla: white-handed gibbon), Symphalangus syndactylus (Ssy: siamang), rhesus macaque (Rhe), and marmoset (Mar). Twelve microgram of genomic DNA were digested by restriction enzymes, HindIII and XbaI. Southern blot analysis was performed using standard protocol (electrophoresis: submerged in 1x TAE buffer, at 1.2 V/cm, for 18 h at 4◦C using 1.2% agarose gel; Treatment of DNA in gel: denaturation with 0.5 N NaOH/0.5 M NaCl for 30 min at 20–30◦C, neutralization with 0.5 M Tris/0.5 M NaCl (pH7.0) for 15 min at 20–30◦C; Capillary transfer to nylon membrane, Hybond-N+ (GE Healthcare): 5x SSC was supplied and absorbed by paper stack, for 4 h at 20–30◦C; hybridization: 59◦C for 12.5 h). The TYR and SIRH11 probes were generated by genomic PCR using gorilla DNA as a template, respectively. The probe labeling, hybridization, washes, and detection were performed using AlkPhos System (GE Healthcare), per manufactures protocol.

### Computational Analysis

Eighty-five eutherian mammal SIRH11 genome sequences were downloaded from NCBI (http://www.ncbi.nlm.nih.gov/). Two SIRH11 sequences we previously identified in xenarthra species, Tolypeutes matacus and Choloepus didactylus, were obtained from DDBJ accession LOC064756 and LOC064757, respectively. The SIRH11 ORF in each species was identified by NCBI nucleotide blast search (http://blast.ncbi.nlm.nih.gov/Blast.cgi) using human SIRH11 ORF sequence (GenBank Accession No. NC\_000023: 112454729- 112455661) as the query sequence. EMBOSS Transeq (http://www.ebi.ac.uk/Tools/st/emboss\_transeq/) was used for translation nucleotide sequence to amino acids sequence. Multiple sequence alignment was constructed using Clustal Omega (http://www.ebi.ac.uk/Tools/msa/clustalo/) in the default mode.

### Estimation of the dN/dS Ratio

An amino acid sequence phylogenic tree was constructed with MEGA6 (Tamura et al., 2013) using Maximum Likelihood method based on the JTT matrix based model. The codon alignment of cDNA was created with the PAL2NAL program (www.bork.embl.de/pal2nal/) (Suyama et al., 2006). The nonsynonymous/synonymous substitution rate ratio (ω = dN/dS) was estimated by using CodeML (runmode: −2) in PAML (Yang, 2007).

## AUTHOR CONTRIBUTIONS

Conceived and designed the experiments: TK, FI. Performed the experiments: MI, AK. Analyzed the data: MI, AK, TK, FI. Analyses of primate samples including DNA sequencing, genomic PCR and Southern blot analysis: AK. Wrote the paper: TK, FI.

### FUNDING

This work was supported by funding program for Next Generation World-Leading Researchers (NEXT Program) from the Japan Society for the Promotion of Science (JSPS) to TK, Grants-in-Aid for Scientific Research 15H04427 to AK, Grantsin-Aid for Scientific Research (S) from the Ministry of Education, Culture, Sports, Science, and Technology (MEXT) of Japan to FI, TK.

### ACKNOWLEDGMENTS

We thank S. Kawada, T. Mori, and S. Kitamura at the National Museum of Nature and Science for providing us the giant anteater sample for PCR experiment. The following facilities provided us with tissue samples for PCR analyses, through the GAIN (Great Apes Information Network): Sapporo Maruyama Zoo, gorilla (Gong); Kyoto City Zoo, gorilla (Hiromi); Fukuoka City Zoological Garden, gorilla (Willy); Hirakawa Zoological Park, siamang (Lian); Fukuoka City Zoological Garden, whitehanded gibbon (Taki). Fecal samples of white-cheeked gibbon (Monjiro) were provided to us by Hirakawa Zoological Park. We greatly appreciate cooperation by these facilities and their curators. We are also grateful to Yuki Enomoto of Kyoto University and Kana Mori of Hirakawa Zoological Park for technical assistance. Pacific Edit reviewed the manuscript prior to submission.

## SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: http://journal.frontiersin.org/article/10.3389/fchem. 2016.00026

Figure S1 | Alignment of amino acid sequences of SIRH11/ZCCHC16 from 83 eutherian mammals (related to Figure 1).

Figure S2 | PCR analysis of gibbon TYR and LCT (related to Figure 2B). Left panel shows the scheme of PCR primer design. Right panel shows agarose gel electrophoresis profile in each primer set. The arrows represent expected band size. To confirm the quality of genomic DNAs of *Hla* and *Ssy*, we amplified tyrosinase (*TYR*) and lactase (*LCT*) genes using the PCR primers designed on the basis of the human sequence. For *TYR* gene amplification, the PCR reaction was performed using PrimeSTAR GXL DNA Polymerase (TaKaRa, Japan) with the following conditions: 94◦C, 2 min; 4 cycles of 98◦C, 10 s, 50◦ , 15 s, 68◦C, 2 min; 30 cycles of 98◦C, 10 s, 64◦C, 15 s, 68◦C, 40 s; 68◦C, 1 min. The following PCR primers were used: human\_TYR\_F: 5′ -TAAGAGAAGCTCTATTCCTGACACTAC CTC-3′ and human\_TYR\_R: 5′ -AGCTGGTGCTTCATGGGCAAAATCAATGTC-3′ . For *LCT* gene amplification, the PCR reaction was performed using PrimeSTAR GXL DNA Polymerase (TaKaRa, Japan) with the following conditions: 94◦C, 2 min; 4 cycles of 98◦C, 10 s, 54◦C, 15 s, 55◦C, 2 min; 30 cycles of 98◦C, 10 s, 64◦C, 15 s, 68◦C, 50 s; 68◦C, 1 min. The following PCR primers were used: human\_LCT\_F: 5′ -AGTTCGAAAGAGATTTGTTCTACCACGGGA-3′ and human\_LCT\_R: 5′ -AGCTCTGTTCATTGCCGTGGAAGGCCACGA-3′ .

Figure S3 | DNA sequence alignment of Cetartiodactyla SIRH11/ZCCHC16 (related to Figure 3E). Among 14 species, corresponding DNA sequences upstream and around the first Met of pig SIRH11/ZCCHC16 and those of the short ORFs of the other 13 Cetartiodactyla species are highly conserved, demonstrating that a large deletion occurred in this lineage.

Figure S4 | PCR analysis of xenarthran TYR (related to Figure 4B). Left panel shows the scheme of PCR primer design. The PCR primers, F1 and R1, were designed in consensus sequences between the armadillo (*Dasypus novemcinctus*) and sloth (*Choloepus hoffmanni*) *TYR*. Right panel shows agarose gel electrophoresis profile in each primer set. The arrows represent expected band size. To confirm the quality of anteater genomic DNA, we amplified the tyrosinase (*TYR*) gene in three xenarthra speicies, *Tolypeutes matacus* (southern three-banded armadillo), *Choloepus didactylus* (Linnaeus's two-toed sloth) and *Myrmecophaga tridactyla* (giant anteater). The PCR reaction was performed using

Ex*Taq*HS (TaKaRa, Japan) with the following conditions: 96◦C, 3 min; 30 cycles of 98◦C, 10 s; 55◦C, 30 s; 72◦C, 60 s; final extension 72◦C, 3 min. The following PCR primers were used: Xenarthra\_TYR\_F1: 5′ -GTTAGTCATGTGCTTTTCAGA AG-3′ and Xenarthra\_TYR\_R1: 5′ -CCAGGTGCTTCATGAGCAAAAT-3′ .

#### Figure S5 | Expression of human SIRH11/ZCCHC16 in adult tissues and

organs. Upper and lower panels show *SIRH11/ZCCHC16* and *ACTB* agarose gel electrophoresis profiles, respectively. HEK293T genome was amplified as a control for PCR. Human total RNA was purchased from Clontech (Human total RNA

### REFERENCES


Master Panel II, #636643). The cDNA was made from total RNA (1 µg) using Revertra Ace qPCR RT Master Mix (TOYOBO, Japan). 10 ng of cDNA was used for RT-PCR analysis. The PCR reaction was performed using Ex*Taq*HS polymerase (TaKaRa, Japan) with following condition: 96◦C, 1 min; 30 cycles (for *ACTB*) or 35 cycles (for *SIRH11/ZCCHC16*) of 98◦C, 10 s; 60◦C, 30 s; 72◦C, 30 s; final extension 72◦C, 3 min. The following primer sequences were used: hACTB\_F: 5′ -AAGTGTGACGTGGACATCCG-3′ and hACTB\_R: 5′ -GATCCACAT CTGCTGGAAGG-3′ ; hSIRH11\_F: 5′ -GGTGACCCTGCCAATTGCTC-3′ and hSIRH11\_R: 5′ -AGGTACTCTTGTCAGGCCCAG-3′ .


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2016 Irie, Koga, Kaneko-Ishino and Ishino. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# LTRs of Endogenous Retroviruses as a Source of Tbx6 Binding Sites

### Yukuto Yasuhiko, Yoko Hirabayashi and Ryuichi Ono\*

*Division of Cellular and Molecular Toxicology, Biological Safety Research Centre, National Institute of Health Sciences, Tokyo, Japan*

Retrotransposons are abundant in mammalian genomes and can modulate the gene expression of surrounding genes by disrupting endogenous binding sites for transcription factors (TFs) or providing novel TFs binding sites within retrotransposon sequences. Here, we show that a (C/T)CACACCT sequence motif in ORR1A, ORR1B, ORR1C, and ORR1D, Long Terminal Repeats (LTRs) of MaLR endogenous retrovirus (ERV), is the direct target of Tbx6, an evolutionary conserved family of T-box TFs. Moreover, by comparing gene expression between control mice (Tbx6 +/−) and Tbx6-deficient mice (Tbx6 −/−), we demonstrate that at least four genes, *Twist2*, *Pitx2*, *Oscp1*, and *Nfxl1,* are down-regulated with Tbx6 deficiency. These results suggest that ORR1A, ORR1B, ORR1C and ORR1D may contribute to the evolution of mammalian embryogenesis.

### Edited by:

*Tammy A. Morrish, Independent Investigator, United States*

### Reviewed by:

*Feng Zhang, Fudan University, China Dustin C. Hancks, University of Utah, United States Yong Zhang, Institute of Zoology (CAS), China*

> \*Correspondence: *Ryuichi Ono onoryu@nihs.go.jp*

### Specialty section:

*This article was submitted to Cellular Biochemistry, a section of the journal Frontiers in Chemistry*

Received: *22 December 2016* Accepted: *23 May 2017* Published: *15 June 2017*

### Citation:

*Yasuhiko Y, Hirabayashi Y and Ono R (2017) LTRs of Endogenous Retroviruses as a Source of Tbx6 Binding Sites. Front. Chem. 5:34. doi: 10.3389/fchem.2017.00034* Keywords: endogenous retroviruses, retrotransposon, transcription factors, evolution, TBX6

## INTRODUCTION

About half of the mammalian genome is occupied by DNA sequences derived from transposable elements (TEs) (Lander et al., 2001; Waterston et al., 2002; Lindblad-Toh et al., 2005; de Koning et al., 2011). Retrotransposons, which mobilize via an RNA intermediate by a copy-and-paste mechanism, comprise the majority of mammalian TEs, whereas DNA transposons, which move via a cut-and-paste mechanism, comprise a small fraction and have accumulated mutations that render them immobile (Deininger et al., 2003). Most TEs are nonfunctional and are regarded as genomic parasites or junk DNA; however, a growing body of evidence suggests that retrotransposons and retrotransposon-derived genes have acquired functions essential for host survival during mammalian evolution (Yoder et al., 1997; Levin and Moran, 2011; Hancks and Kazazian, 2012).

In some cases, open reading frames from TEs are domesticated as endogenous genes during mammalian evolution. For example, Peg10 and Rtl1, derived from the gag and pol proteins of the Ty3/Gypsy type retrotransposon, which is similar to Sushi-ichi, are highly conserved in mammals and participate in placental formation (Ono et al., 2001, 2003, 2006; Sekita et al., 2008). Similar to the gag protein of Sushi-ichi, the other two of the eleven Sushi-ichi retrotransposon homolog (Sirh) family genes, Sirh7/Ldoc1 and Sirh11/Zcchc16, encode ORF (Open-Reading frame); they are also involved in the determination of the timing of parturition and cognitive function in the brain, respectively (Ono et al., 2011; Naruse et al., 2014; Irie et al., 2015). Syncytins/SYNCYTINs (mouse/human) and FEMATRIN (cow), derived from the envelope of endogenous retrovirus (ERV), mediate cell-cell fusion to form the syncytiotrophoblast and induce fusion with bovine endometrial cells in vitro (Mi et al., 2000; Dupressoir et al., 2009, 2011; Nakaya et al., 2013). Skin aspatic protease (SAPase), which has a retrovirus-like aspartic protease, plays important roles in the determination of the texture of the skin by modulating the degree of hydration by processing profilaggrin (Matsui et al., 2010).

Since the discovery of TEs, it has been posited that TEs may seed regulatory elements throughout genomes and drive phenotypic differences between species via changes in transcriptional output (McClintock, 1950; Britten and Davidson, 1969; Feschotte, 2008). It has become evident that many TEs, such as long terminal repeats (LTRs) of endogenous retroviruses (ERVs), contain TF binding sites and are associated with gene expression patterns. For example, MuERV-L LTRs function as alternative promoters for protein coding genes, including Gata4 and Tead4, which are important for the specification of primitive endoderm and trophectoderm, respectively, in two-cell embryos (Kigami et al., 2003; Evsikov et al., 2004; Macfarlan et al., 2012). It has also been reported that MuERV-L, exclusively expressed in two-cell embryos, is captured at double-strand break (DSB) sites introduced by the CRISPR/Cas9 system in mouse zygotes (Ono et al., 2015). Some of the intracisterminal Aparticle (IAP) retrotransposon insertions are known to induce de novo metastable epi-alleles, such as agouti viable yellow (Avy), axin fused (AxinFu) and Cdk5rap locus (Vasicek et al., 1997; Morgan et al., 1999; Druker et al., 2004). The stochastic nature of the establishment of the epigenetic state of the 5′ LTR leads to variable expressivity of the adjacent genes. Both the sense and anti-sense LINE-1 (L1) promoter can drive L1 chimeric transcripts (Criscione et al., 2016). Moreover, AS071 and AS021, two AmnSINE1s present in mammals as well as birds and reptiles, are enhancers of the genes FGF8 (fibroblast growth factor 8), 178 kb from AS071, and SATB2, 392 kb from AS021 (Sasaki et al., 2008). Recently, it was reported that MER41, a primate-specific endogenized gammaretrovirus, is a source of interferon γ (IFNG)-inducible binding sites (Chuong et al., 2016).

In this study, we demonstrate a potential role for ORR1A (Origin-Region Repeat 1A), ORR1B, ORR1C, and ORR1D, LTRs of the MaLR (Mammalian-Apparent Long-Terminal Repeat Retrotransposon) endogenous retrovirus-like element, in controlling gene expression via Tbx6 binding (Smit, 1993). Because Tbx6 functions in the regulation of early embryogenesis, including anti-neural fate regulation in the presomitic mesoderm and later somite segmentation, ORR1A, ORR1B, ORR1C, and ORR1D may have played a role in the evolution of mammalian embryogenesis (Chapman and Papaioannou, 1998; Takemoto et al., 2011).

### RESULTS AND DISCUSSION

Tbx6 belongs to an evolutionarily conserved family of T-box transcription factors (TFs), known to be involved in the neuralmesodermal fate determination of axial stem cells (Chapman and Papaioannou, 1998; Takemoto et al., 2011). Previously, we revealed that Tbx6 directly activates the expression of Mesp2, a segmentation and polarization factor in somitogenesis, in a Notch signal-dependent manner (Yasuhiko et al., 2006). A ligand of Notch signal, Dll1, is also a direct target of Tbx6, implying that Tbx6 participates in the regulation of the Notch signaling pathway (White and Chapman, 2005). The consensus core sequence of Tbx6 binding sites has been reported as CACACCT or AGGTGTBRNNNN (White and Chapman, 2005). In this study, we used (C/T)CACACCT as a consensus for both reports (White and Chapman, 2005; Yasuhiko et al., 2006).

At first, the Tbx6 binding sequence motif, (C/T) CACACCT, was identified by whole genome in silico screening. Furthermore, we chose the Tbx6 binding sequence, which has at least two more Tbx6 binding sequences within the neighboring 100 bp upstream and/or downstream regions, because we previously demonstrated that higher enhancer activity of Tbx6 was observed when there are more than three Tbx6 binding sequences within a narrow region. As a result, 3500 potential Tbx6 binding sites were identified, and a characteristic feature was revealed (**Figure 1A**; Supplementary Table 1).

Approximately 70.0% of potential Tbx6 binding sites comprise repeat sequences (**Figure 1A**). Specifically, 85.7% of the potential Tbx6-binding-repeat sequences were within ORR1A, ORR1B, ORR1C, and ORR1D, LTRs of the MaLR that span 679 independent ORR1s-LTRs, while SINEs and LINES represent only 2% of the Tbx6-binding repeat sequences (Bao et al., 2015; Supplementary Table 1).

There are 166,375 Repeatmasker annotated ORR1s, including partial sequences, in the mouse genome (MM10), and 20% of them have at least one Tbx6 binding site (**Figure 1B**). In fact, the reference sequences of ORR1s-LTRs from Repbase, which are consensus sequences of ORR1s, have one or two Tbx6 binding sequence motifs (**Figure 1C**). These data suggest that the tandem insertion of these LTRs or degenerated LTR sequences with more than three Tbx6 binding sequence motifs might be good targets for Tbx6 to bind in vivo. Furthermore, potential Tbx6 binding ORR1s have more than three Tbx6 binding motifs within themselves or share the Tbx6 binding motifs with neighboring sequences.

Tbx6-binding ORR1s more than 300 bp in length were selected, and the consensus sequences including three Tbx6 binding motifs and the absolute distance from each ORR1 to the nearest mouse reference gene were determined (**Figures 1D,E**; Supplementary Table 1). The strong interaction between Tbx6 and the consensus sequence of Tbx6-binding ORR1s were confirmed by electrophoretic mobility shift assay (EMSA), while the interactions disappeared by introducing mutations into the Tbx6-binding motif one by one (**Figure 1F**). The finding that three Tbx6 binding motifs rather than one or two Tbx6 binding motifs have stronger binding affinity was comparable to our previous report (**Figure 1G**; Yasuhiko et al., 2008). As Tbx6 binding ORR1s were relatively enriched near gene transcription start sites (**Figure 1E**), Tbx6 may contribute to regulating the expression level of nearby genes until reaching 60 kb-windows. Then, to explore the influence of ORR1A, ORR1B, ORR1C, and ORR1D on the regulation of gene expression by Tbx6, we compared the expression level of 9 genes that are randomly selected within 50 kb of potential Tbx6 binding sites on ORR1A, ORR1B, ORR1C, and/or ORR1D in Tbx6 (+/−) (control) and Tbx6 (−/−) (Tbx6 KO) embryos at 8.0 day post-coitus (dpc). Because Tbx6 KO embryos have morphological abnormalities after 9 dpc, we used 8.0 dpc embryos in this study to exclude secondary effects from morphological abnormalities.

*(Continued)*

#### FIGURE 1 | Continued

retrovirus (ERV), occupy 57% of the total Tbx6 binding sequence motifs. (B) Of all the ORR1 sequences in the mouse genome, 20% of ORR1s have at least one Tbx6 binding sequence. (C) DNA sequence comparison between ORR1A (rodentia ancestral shared), ORR1A0 (mus musculus), ORR1A2 (muridae), ORR1A3 (muridae), ORR1A4 (muridae), ORR1B (rodentia ancestral shared), ORR1B1 (mus musculus), ORR1B2 (mus musculus), ORR1C1 (rodentia ancestral shared), ORR1C2 (rodentia ancestral shared), ORR1D1 (rodentia ancestral shared), ORR1D2 (rodentia ancestral shared), ORR1E (rodentia ancestral shared), ORR1F (muridae) and ORR1G (muridae) LTRs. Identical sequences are indicated by asterisks. The Tbx6 binding sequence motif is indicated by green boxes. Yellow boxes are indicated as a corresponding region of the Tbx6 binding sequence motif "Site C" in Figure 1D. (D) Sequence logo of the the ORR1 LTRs that had more than three Tbx6 binding sequence motifs. Three tandem "AGGTGTGs," a Tbx6 binding sequence motif, are highly conserved between ORR1 LTRs, which have more than three Tbx6 binding sequence motifs. (E) Frequency histogram of the absolute distance from each ORR1 to the nearest mouse reference gene. The background expectation is derived from the genome-wide ORR1s distribution. Statistical significance of the observed enrichment within the first 10 kb of the nearest mouse reference gene was assessed by a binominal test. (F) Site A and site B sequences independently bind to Tbx6 in an electromobility shift assay (EMSA); however, the binding affinity is much higher with the presence of both sites A and B. Sequences of oligonucleotide probes were shown below the gel image. Mutated nucleotides were depicted in lower case. (G) Triple Tbx6 binding sequence motif shows the highest binding affinity to Tbx6, while other T-box TFs, including T (Brachyury), Eomes, Mga, Tbx18, and zebrafish Tbx6 (zTbx6), have no affinity. Arrowheads in (F,G): Positions of the bands resulted from multiple Tbx6 binding to ORR1 sequences. Sequences of oligonucleotide probes were shown below the gel image. Mutated nucleotides were depicted in lower case.

As expected, four genes, Twist2, Pitx2, Oscp1, and Nfxl1, were down-regulated, although the expression of five other genes, Enpep, Prdm2, Corin, Pdpn and Map4k4, was not altered significantly (**Figure 2**). It has been reported that enhancer activity could be blocked by the epigenetic repressive marks of the neighboring regions, such as histone deacetylation and trimethylation of K9 and K27 on histone H3 (H3K27me3 and H3K9me3) or an insulator, a genetic boundary element blocking the interaction between enhancers and promoters (Roth et al., 2001; Schmidl et al., 2009; Greer and Shi, 2012; Dowen et al., 2014). It might be possible that five genes whose expression levels were not altered by Tbx6 deficiency could be blocked by epigenetic modifications or unknown silencers.

Our analysis revealed the rodent-specific ORR1 family of ERVs to be a source of Tbx6 binding sites. Furthermore, Tbx6-binding ORR1s are enriched near genes which might be associated with several biological process and molecular pathways (**Figure 3**). In the human genome, there are 2,927 potential TBX6 binding motifs; however, the majority of sites are not in LTRs but in simple repeat sequences or Alu (Supplementary Table 2). Although the source of Tbx6/TBX6 binding sequences is different between species, each mammalian species might shape their Tbx6/TBX6 binding sequence through mammalian evolution. Our analysis and other reports, including the primate-specific MER41 family as IFNG-inducible binding sites and AmnSINE1s as mammalian enhancers, raised the possibility that TE-derived regulatory elements influence lineagespecific mammalian evolution (Sasaki et al., 2008; Chuong et al., 2016).

### MATERIALS AND METHODS

### Bioinformatic Analyses

(C/T)CACACCT, Tbx6 binding sequence motifs, were identified in the mouse whole genome (MM10) and human whole genome (hg19) and filtered out when there were not two more Tbx6 binding sequences within the neighboring 100 bp upstream and/or downstream regions using gggenome (https:// gggenome.dbcls.jp). All the TE sequences were downloaded from Repeatmasker truck (mouse:MM10/human:hg19) of the UCSC genome browser (https://genome.ucsc.edu). The Intersect intervals program (https://usegalaxy.org/) was used to identify the TEs that have potential Tbx6 binding sequences using potential Tbx6 binding sites identified as a query against Repeatmasker annotated TEs. The ClosestBed program (https:// usegalaxy.org/) was used to find the closest mouse reference genes (MM10) and to identify the absolute distance between the potential Tbx6 binding motif and its closest reference gene. These distances were grouped by 10 kb-bin sizes. The expected background was determined by randomly sampling an equal number of the remaining 78,042 annotated ORR1s that did not have more than three Tbx binding motifs. Sampling was repeated 100 times, and the mean number of elements was used as the expected value for comparison to the potential Tbx6 binding ORR1s. Statistical significance was determined for the first 10-kb bin by a binominal test as previously described (Chuong et al., 2016). Gene ontology of the closest reference genes within 50 kb-windows of potential Tbx6-binding motifs were determined by the GREAT program http://bejerano. stanford.edu/great/public/html/index.php (**Figure 3**). The consensus sequence of the potential Tbx6-binding motifs was identified by ClustalW program (for alignment: http://clustalw. ddbj.nig.ac.jp/index.php?lang=ja)and Sequence Logo program (for generation of sequence logos: http://weblogo.berkeley. edu). A gene ontology/signal pathway panel of Tbx6-binding ORR1s.

### Electrophoretic Mobility Shift Assay (EMSA)

Full sequences of ORFs of mouse Tbx6 (NM\_011538.2), T (Brachyury; NM\_009309.2), Eomes (NM\_010136.3), Tbx18 (NM\_023814.4), and T-box-coding fragment of Mga (NM\_013720.2) were PCR amplified and cloned in pCS2+ (Rupp et al., 1994) vector. Expression vector pCS2-zTbx6 for zebrafish Tbx6 translation was a gift from Dr Hiroyuki Takeda (Terasaki et al., 2006). Transcription factors were in vitro transcribed and translated using TnT(R) Quick Coupled Transcription/Translation System (Promega) following the manufacturer's protocol. Sequences of DNA probes were as follows: Mutated nucleotides are designated in lower case. Mesp2 and Dll1 were positive controls for the assay and described in Yasuhiko et al. (2006) and

White and Chapman (2005), respectively. 1ORR1A0, 5′ - GGGAGTGGCACCATCTGAAGGTGTGGCCTTGTTGGAATA GGTGTGACCTGGTTGGAATG-3′ ; ORR1A0mA, 5′ -GGGAG TGGCACCATCTGAgaattcGGCCTTGTTGGAATAGGTGTGA CCTGGTTGGAATG-3′ ; ORR1A0mB, 5′ -GGGAGTGGCAC CATCTGAAGGTGTGGCCTTGTTGGAATgaattcGACCTGG TTGGAATG-3′ ; ORR1A0mAB, 5′ -GGGAGTGGCACCATCTG AgaattcGGCCTTGTTGGAATgaattcGACCTGGTTGGAATG-3 ′ ; ORR1B1, 5′ -GGGAGTGGCACTATTAGAAGGTGTGG CCTTGTTGGAGTAGGTGTGGCCTTGTTGGAGGA-3′ ; ORR1B1mA, 5′ -GGGAGTGGCACTATTAGAgaattcGGC CTTGTTGGAGTAGGTGTGGCCTTGTTGGAGGA-3′ ; ORR1B1mB, 5′ -GGGAGTGGCACTATTAGAAGGTGT GGCCTTGTTGGAGTgaattcGGCCTTGTTGGAGGA-3′ ; ORR1B1mAB, 5′ -GGGAGTGGCACTATTAGAgaattcG GCCTTGTTGGAGTgaattcGGCCTTGTTGGAGGA-3′ ; ORR1\_3xTbx6BS, 5′ -TAGAGGAGGTGTGGCCTTGTTG GAGTAGGTGTGGCCTTGTTGGAGTAGGTGTGGCCTTG T-3′ ; ORR1\_3xTbx6BSmA, 5′ -TAGAGGgaattcGGCCTTG TTGGAGTAGGTGTGGCCTTGTTGGAGTAGGTGTGGCCT TGT-3′ ; ORR1\_3xTbx6BSmAB, 5′ -TAGAGGgaattcGGCCT TGTTGGAGTgaattcGGCCTTGTTGGAGTAGGTGTGGCC TTGT-3′ ; ORR1\_3xTbx6BSmABC, 5′ -TAGAGGgaattcGGCC TTGTTGGAGTgaattcGGCCTTGTTGGAGTgaattcGGCCTTG

T-3′ ; Mesp2, 5′ -CCTTCGAGGGGTCAGAATCCACACCTC TGCAAATGGGCCCGCTTT-3′ ; Mesp2mB2, 5′ - CCTTCG AGaGtaCtGAATCCACACCTCTGCAAATGGGCCCGCTTT-3 ′ ; Mesp2mB1, 5′ - CCTTCGAGGGGTCAGAATCgAtAtCTCT GCAAATGGGCCCGCTTT-3′ ; Dll1, 5′ -ACAATCAAAGGA ACACTAGCTCCAAGAATCACACCTCGGGATTCTAATG AAGCTGCCTA-3′ ; Dll1m, 5′ -ACAATCAAAGGAACACTA GCTCCAAGAATCgaattcCGGGATTCTAATGAAGCTGCC TA-3′ . Sense and anti-sense oligonucleotides for each probe were annealed, DIG-labeled and subjected to EMSA assay using the DIG Gel Shift Kit, 2nd Generation (Roche). Briefly, 4 fmol of labeled oligonucleotide probe was incubated with 4µl of in vitro translated mixture, electrophoresed in 7.5% polyacrylamide gel for 80 min at 100 V, and blotted to a positively charged nylon membrane. Shifted oligonucleotides were detected using an anti-DIG Fab fragment (Roche) and CDP-Star Ready-to-Use AP substrate (Roche).

### Gene Targeting, Mouse Embryos and Real-Time RT-PCR

All animal studies were conducted in accordance with the guidelines approved by the animal care committee of the National Institute of Health Sciences (NIHS; No.934). The

protocol was approved by the animal welfare committee of National Institute of Health Sciences (NIHS; No.41). Animals had access to a standard chow diet and water ad libitum and were housed in a pathogen-free barrier facility with a 12L:12D cycle.

A Tbx6 conditional knockout mouse (Tbx6flox) was generated using the ES cell line TT2 and maintained in an ICR background (Yagi et al., 1993). Briefly, exon 3-5, encoding the T-box DNA binding domain of Tbx6, was flanked by a pair of loxP sites and knocked into the Tbx6 locus by homologous recombination. The PGK-neo selection marker was removed by the FLP-FRT system to obtain Tbx6flox mice. For cDNA preparation, embryos (8.0 days post-coitus) were obtained by crossing female CAG-Cre/Tbx6flox/+ hybrid heterozygotes onto male Tbx6flox/flox homozygotes. Embryos were genotyped by PCR using allantois genomic DNA, and total RNA were prepared using an RNeasy mini kit (QIAGEN). Total RNA was pooled from 5 (Tbx6+/−) and 4 (Tbx6−/−) 8.0dpc sibling embryos in the same litter. The sequences of primers for real-time RT-PCR were as follows:

Twist2\_forward, 5′ -TGTCCGCCTCCCACTAGC-3′ ; Twist2\_reverse, 5′ -TGTCCAGGTGCCGAAAGTC-3′ ; Pitx2\_forward, 5′ -GGCAGTCACCCTGGGAAG-3′ ; Pitx2\_reverse, 5′ -GCCGACACTAGTTTGCGACA-3′ ; Enpep\_forward, 5′ -CCTGCTTTACGACCCCCTAC-3′ ; Enpep\_reverse, 5′ -TTAGCCACAAGTCGTCCCAC-3′ ; Oscp1\_forward, 5′ -GACTCTGCCGCTGCTCT-3′ ; Oscp1\_reverse, 5′ -TCGTCCATGAACTTCCTGTTGA-3′ ; Prdm2\_forward, 5′ -GCTTCGAGGACTTCCAGAGG-3′ ; Prdm2\_reverse, 5′ -TGGTTTAGTGGCCCAGACAC-3′ ; Pdpn\_forward, 5′ -AGGTGCTACTGGAGGGCTTA-3′ ; Pdpn\_reverse, 5′ -GCTGAGGTGGACAGTTCCTC-3′ ; Nfxl1\_forward, 5′ -AGAACCTCCTCAGTTGCTGC-3′ ; Nfxl1\_reverse, 5′ -AAGGGGCATTCACCAGGATG-3′ ; Corin\_forward, 5′ -GATATGTTCACGAAACGGCCC-3′ ; Corin\_reverse, 5′ -CGCTCCTGTCTGCTCTCAAG-3′ ; Map4k4\_forward, 5′ -TTCCGGCCTCTCAAGCCT-3′ ; Map4k4\_reverse, 5′ -TCCCAGACTCCTCACTGGAG-3′ ; Mesp2\_forward, 5′ -ACCCTTACACCAGTCCCTAGAAA-3′ ; Mesp2\_reverse, 5′ -GGTTCTGGAGACACAGAAAGACT-3′ ; Msgn1\_forward, 5′ -GCCAGAAAGGCAGCAAAGTC-3′ ; Msgn1\_reverse, 5′ -AGACAGGCGGCAGGTAATTC-3′ ; β -actin\_forward, 5′ -CTGTCGAGTCGCGTCCA-3′ ; β-actin\_reverse, 5′ -ACGATGGAGGGGAATACAGC-3′ ;

### REFERENCES


Primers were designed using Primer-BLAST (https://www. ncbi.nlm.nih.gov/tools/primer-blast/) tool to amplify 70–150 base pair (bp) fragment separated by at least one intron (>500 bp), except Msgn1 (single exon gene). PCR reaction was performed using SYBR(R) Premix Ex Taq(TM) II (Takara RR820S) following the manufacturer's protocol, with PCR cycle as follows: 1 cycle of 95◦C 30 s, 40 cycles of 95◦C 5 s and 60◦C 30 s.

### Statistical Analyses

Statistical significance for qPCR was assessed using a two-tailed unpaired Student's t-test with a threshold of p < 0.1.

### ETHICS STATEMENT

The animal facility of the National Institute of Health Sciences was approved by the Japan Health Sciences Foundation since 2008. All animal studies were conducted in accordance with the guidelines approved by the animal welfare committee of the National Institute of Health Sciences (NIHS; No. 41).

### AUTHOR CONTRIBUTIONS

RO conceived of the project. YY, YH, and RO participated in the experimental design. RO performed most analyses. YY produced Tbx6 KO mice and performed EMSA and RT-PCR. RO wrote the manuscript. All authors read and approved the final manuscript.

### ACKNOWLEDGMENTS

This work was supported by grants from Grant-in-Aid for Scientific Research (C) (26430183) and the Research on Regulatory Science of Pharmaceuticals and Medical Devices from Japan Agency for Medical Research and Development, AMED to RO. We are grateful to Hiroyuki Takeda (University of Tokyo) for providing zTbx6 cDNA clones, and to Satoshi Kitajima and Eriko Ikeno for technical assistance in generating Tbx6 KO mouse.

### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: http://journal.frontiersin.org/article/10.3389/fchem. 2017.00034/full#supplementary-material

human L1 antisense promoter-driven transcripts. BMC Genomics 17:463. doi: 10.1186/s12864-016-2800-5


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2017 Yasuhiko, Hirabayashi and Ono. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Endogenous Retroviruses: With Us and against Us

#### Thomas J. Meyer <sup>1</sup> \*, Jimi L. Rosenkrantz 2, 3, Lucia Carbone1, 2, 4 and Shawn L. Chavez 3, 5

<sup>1</sup> Division of Bioinformatics and Computational Biology, Department of Medical Informatics and Clinical Epidemiology, Oregon Health & Science University, Portland, OR, USA, <sup>2</sup> Department of Molecular and Medical Genetics, Oregon Health & Science University, Portland, OR, USA, <sup>3</sup> Division of Reproductive & Developmental Sciences, Oregon National Primate Research Center, Portland, OR, USA, <sup>4</sup> Department of Medicine, Knight Cardiovascular Institute, Oregon Health & Science University, Portland, OR, USA, <sup>5</sup> Departments of Obstetrics and Gynecology and Physiology and Pharmacology, Oregon Health & Science University School of Medicine, Portland, OR, USA

Mammalian genomes are scattered with thousands of copies of endogenous retroviruses (ERVs), mobile genetic elements that are relics of ancient retroviral infections. After inserting copies into the germ line of a host, most ERVs accumulate mutations that prevent the normal assembly of infectious viral particles, becoming trapped in host genomes and unable to leave to infect other cells. While most copies of ERVs are inactive, some are transcribed and encode the proteins needed to generate new insertions at novel loci. In some cases, old copies are removed via recombination and other mechanisms. This creates a shifting landscape of ERV copies within host genomes. New insertions can disrupt normal expression of nearby genes via directly inserting into key regulatory elements or by containing regulatory motifs within their sequences. Further, the transcriptional silencing of ERVs via epigenetic modification may result in changes to the epigenetic regulation of adjacent genes. In these ways, ERVs can be potent sources of regulatory disruption as well as genetic innovation. Here, we provide a brief review of the association between ERVs and gene expression, especially as observed in pre-implantation development and placentation. Moreover, we will describe how disruption of the regulated mechanisms of ERVs may impact somatic tissues, mostly in the context of human disease, including cancer, neurodegenerative disorders, and schizophrenia. Lastly, we discuss the recent discovery that some ERVs may have been pressed into the service of their host genomes to aid in the innate immune response to exogenous viral infections.

### Edited by:

Tammy A. Morrish, Independent Investigator, Ann Arbor, USA

### Reviewed by:

Ahmet M. Denli, Salk Institute for Biological Studies, USA Reiner Strick, Universitätsklinikum Erlangen, Germany

\*Correspondence:

Thomas J. Meyer thomas.joshua.meyer@gmail.com

### Specialty section:

This article was submitted to Cellular Biochemistry, a section of the journal Frontiers in Chemistry

Received: 22 December 2016 Accepted: 20 March 2017 Published: 07 April 2017

#### Citation:

Meyer TJ, Rosenkrantz JL, Carbone L and Chavez SL (2017) Endogenous Retroviruses: With Us and against Us. Front. Chem. 5:23. doi: 10.3389/fchem.2017.00023

Keywords: endogenous retrovirus, genome, human disease, pre-implantation embryo, stem cells, placenta, innate immunity

## BACKGROUND

A retroviral genome exists in different forms during its replication cycle. A viral particle, or virion, protects the RNA genome of the retrovirus during escape from the host cell and infection of new cells. A virion that enters a new host cell deploys its genomic payload, using its own reverse transcriptase to convert the RNA viral genome into a DNA copy which is integrated into the host genome, referred to as a provirus (**Figure 1**). Subsequently, a provirus can be transcribed into RNA again, and then translated by the host's ribosomal machinery to produce more virions. Ancient retroviral infections have occasionally resulted in such integrations into the germline of the

host, becoming endogenous retroviruses (ERVs). While some ERVs have been shown to produce infectious particles (van der Laan et al., 2000), most ERV copies suffer mutations over evolutionary time that prevent the normal assembly of viral particles, preventing horizontal transmission of infections between individuals. However, while now trapped within the host genome, some of these provirus copies are still transcribed and can encode some if not all of the original viral proteins. Therefore, ERVs are classified as a family of autonomous retrotransposons. Further, offspring of the host can inherit any germline ERV insertions from their parents, resulting in a vertical transmission pattern with evolution (**Figure 2**). As much as 8% of the human genome consists of ERV sequences acquired through repeated endogenization events followed by subsequent retrotranspositional expansion of captured viral subfamilies.

These ancient genomic residents represent a potent source of genomic and regulatory variability. The high degree of homology between these ERV copies, and the presence of the long terminal repeats (LTRs) at either end of each copy (**Figure 1**), provide an opportunity for non-allelic homologous recombination that can result in the excision of a given insertion, leaving behind only a single LTR copy. Recombination events between the different insertions of the same or similar ERV subfamilies can result in deletions, duplications, and other rearrangements of intervening genomic sequences. Additionally, the ERV sequences themselves can contain motifs that can disrupt or modulate nearby genes and regulatory regions. Not surprisingly, ERVs activity is associated with a number of human diseases and the target of epigenetic repression by the host genome. However, the consequences are not solely deleterious, as there is evidence that ERVs have been co-opted into important regulatory and developmental roles as well.

## ERVS IN GERM CELLS AND PRE-IMPLANTATION EMBRYOS

Certain stages of mammalian pre-implantation embryo and germ cell development characterized by multiple waves of epigenetic reprogramming pose a unique challenge for the control of endogenous retroviral activity. During the two waves of epigenetic reprogramming that occur in primordial germ cells (PGCs) and fertilized oocytes, a considerable amount of DNA demethylation occurs. Examination of global DNA methylation at these stages have shown that levels within human and mouse pre-implantation embryos decrease beginning at the 1- to 2-cell stage, depending on the species, and up to or soon after the blastocyst stage (Kobayashi et al., 2012; Guo et al., 2014; Lee et al., 2014; Okae et al., 2014; Wang L. et al., 2014). Since DNA methylation is largely responsible for repression of many transposable elements, including ERVs (Walsh et al., 1998), the activity of ERVs and the alternative mechanisms repressing ERV activation during these periods of global hypomethylation have been the focus of a number of recent investigations.

Given that some ERV families have expanded substantially in the number of genomic integrations in animals (Tristem, 2000; Bénit et al., 2001), it has been hypothesized that widespread reactivation of ERVs during the waves of global reprogramming within germ cell and pre-implantation development are largely responsible for this expansion. On the other hand, it is also known that additional ERV repressive mechanisms must be in place in order to maintain genomic stability throughout epigenetic reprogramming and the highly choreographed molecular processes required for normal germ cell development, fertilization, and embryonic development. These ideas are not mutually exclusive, as there is substantial evidence supporting both reactivation (Fuchs et al., 2013; Wang J. et al., 2014; Grow et al., 2015) and alternative repression (Thomas and Schneider, 2011; Manghera and Douville, 2013; Leung et al., 2014; Liu et al., 2014; Schlesinger and Goff, 2015; Wolf et al., 2015; Thompson et al., 2016) across the vast number and variety of ERVs within the genome during germ cell development and embryogenesis.

Despite the existence of elaborate mechanisms that mediate ERV inactivation within the genome, there is extensive evidence that some ERVs are still active and play an important role during gametogenesis and pre-implantation development. Upregulation of ERV proviral transcription and protein expression has been well documented in early human embryos and embryonic stem cells (hESCs). For example, elevated expression of the ERV-H family has been observed within both naïve-like and primed hESC sub-populations (Wang J. et al., 2014; Theunissen et al., 2016; Supplementary Table 1). Additional transcripts from the ERV-K (HML-2) family are also observed at high levels within hESCs and rapidly decrease upon differentiation (Fuchs et al., 2013). Expression of ERV-K begins at the 8-cell stage, concurrent with embryonic genome activation (EGA), and continues throughout pre-implantation development into the blastocyst stage. A majority of actively transcribed ERV-K loci during this time are associated with LTR5HS, a specific subclass of LTR, which is confined to human and chimpanzee and contains an OCT4 binding motif. The LTR5HS subclass requires both hypomethylation and OCT4 binding for transcriptional activation, which synergistically facilitated ERV-K expression (Grow et al., 2015; Supplementary Table 1). Based on the elevated activity of these ERVs within hESCs and pre-implantation embryos, as well as their known interactions with other cellular factors during this time, it is thought that these ERVs have been functionally incorporated into roles important for defining and maintaining pluripotent specific states.

The role of LTRs as regulatory regions for proviral DNA represents an additional function that can be utilized by or incorporated into host genomes. In particular, LTRs are known to be co-opted as promoters or enhancer elements of nearby genes important during embryonic development and maintenance of pluripotency (Friedli and Trono, 2015). Nearly, ∼33% of all transcripts in human embryonic tissues are associated with repetitive elements, suggesting a clear pattern of embryonic cell specificity for viral promoters (Fort et al., 2014). Many transcripts detected in the totipotent blastomeres of mouse 2-cell embryos are initiated from LTRs upon EGA as well, indicating that these repeat sequences may help drive cell-fate regulation in mammals (MacFarlan et al., 2012). Regulatory activities of certain LTRs have also been shown to provide important functions not only in embryonic cells, but also within germ cells during gametogenesis. For example, germline-specific transactivating p63 (GTAp63), a member of the p53 family and a transcript important for maintaining genetic fidelity in the human male germline, is under the transcriptional control of ERV9 LTR (Ling et al., 2002; Beyer et al., 2011; Liu and Eiden, 2011; Supplementary Table 1). Transcriptionally active GTAp63 suppresses proliferation and induces apoptosis upon DNA

damage in healthy testis and is frequently lost in human testicular cancers. Restoration of GTAp63 expression levels in cancer cells was observed upon treatment with a histone deacetylase (HDAC) inhibitor, indicating possible epigenetic control of ERV9-mediated GTAp63 expression via activating histone acetylation marks. Thus, the ability of ERV9 regulatory regions to contribute to the maintenance of male germline stability is yet another example of how ERVs have evolved to serve an important function in their human hosts (Liu and Eiden, 2011).

### ERVS IN THE PLACENTA

The placenta is a transient organ representing the maternalfetal interface during pregnancy; it is derived from the outer trophectoderm (TE) layer of blastocysts, and plays a critical role in the gas, nutrient, and waste exchange required for normal embryonic growth. It is well established that both mouse and human placentas are hypomethylated compared to other somatic cells derived from either in vivo or in vitro sources (Ehrlich et al., 1982; Fuke et al., 2004; Cotton et al., 2009; Popp et al., 2010; Hon et al., 2013). As such, the DNA methylation levels of LTRs within human placentas more closely resemble that observed in oocytes than in somatic tissues, averaging ∼60% methylation across the genome (Schroeder et al., 2015). Given this hypomethylation of LTRs in placentas, it is not surprising that numerous sub-families of ERV proviruses are expressed within human placental tissues. More specifically, there is evidence of proviral transcription from ERV-E (Yi and Kim, 2007), ERV3 (ERV-R; Boyd et al., 1993; Andersson et al., 2005), ERV-K (Kammerer et al., 2011), ERV-fb1 (Sugimoto et al., 2013), ERV-V1/2 (Esnault et al., 2013), ERV-W (Blond et al., 2000), and ERV-FRD (Blaise et al., 2003; Supplementary Tables 1, 2).

The most notable ERV families producing functional proteins during placentation are ERV-W and ERV-FRD, corresponding to Syncytin-1 and Syncytin-2, respectively, which are critical for the cellular fusion underlying human placental syncytia formation and maintenance (Blond et al., 2000; Mi et al., 2000; Blaise et al., 2003, 2005; Dunk et al., 2012; Supplementary Table 2). Cellular fusion is a relatively unique function in normal healthy tissues, with muscle, bone and placenta being the major exceptions. Since regulation of this highly specified function is of much interest, the precise mechanisms underlying the transcriptional control of the Syncytin-1 gene have been the topic of several investigations. Both DNA and histone H3K9 methylation have been reported to be important for inactivating ERV-W and thus repressing Syncytin-1 expression, resulting in pathological conditions such as exogenous viral infections and preeclampsia when repression does not occur (Matousková et al., 2006; Gimenez et al., 2009; Li et al., 2014; Zhuang et al., 2014). It has been shown that transcriptional activation of the ERV-W locus and the promotion of cell fusion also requires the synergism of LTR promoter hypomethylation, along with the binding of several transcription factors such as GCM1, Sp1, and GATA family members (Yu et al., 2002; Cheng et al., 2004; Prudhomme et al., 2004; Cheng and Handwerger, 2005; Chang et al., 2011). Recently, another ERV-derived protein called suppressyn has been identified to alternatively regulate Syncytin-1, but not Syncytin-2-based cell fusion by inhibiting its interaction with the Syncytin-1 associated receptor, ASCT2 (Sugimoto et al., 2013; Supplementary Table 2). Suppressyn is a truncation product of the proviral env gene from the ERV-fb1 element and is transcribed within the placenta. Within normal human placentas, suppressyn is co-expressed with Syncytin-1 in the syncytiotrophoblast layer (Sugimoto et al., 2013), further supporting that these two factors are involved in cell-cell fusion regulation at the maternal-fetal interface in utero.

Notably, integration of ERV-W and ERV-FRD into the genome occurred prior to the divergence of Old World (Catarrhini; Cáceres et al., 2006) and New World (Platyrrhini) monkeys (Blaise et al., 2003), respectively, thus Syncytin-1 and Syncytin-2 are only present in higher-order primate (Haplorhini) species, although functionally similar yet distinct ERV proviral proteins have been discovered throughout most mammalian genomes, as reviewed in Imakawa et al. (2015). The ERV-V env gene present within Old World monkeys has also been implicated in trophoblast fusion activity, possibly alleviating the lack of functional Syncytin-1 within these species, while the ERV-V reiterations present within the human genome are not functional in this capacity (Esnault et al., 2013; Supplementary Table 2). Syncytin-A and Syncytin-B appear to function like human Syncytins within the mouse placenta and are known to have entered the murine (Muridae) lineages approximately 20 million years ago (Dupressoir et al., 2005). Similarly, Syncytin-Ory1 has been discovered in rabbits and hares (Leporidae; Heidmann et al., 2009), Syncytin-Car1 within 26 different species of carnivorans (Carnivora; Cornelis et al., 2012), Syncytin-Mar1 within the squirrel-related clade (either Scuridae or Marmotini; Redelsperger et al., 2014), Syncytin-Ten1 within tenrec (Tenrecidae; Cornelis et al., 2014), Syncytin-Rum1 in ruminants (Ruminantia; Cornelis et al., 2013), and Syncytin-Opo1 within the short-lived placenta of opossum and kangaroo marsupials (Marsupialia; Cornelis et al., 2015).

Several ERV captured env genes have been proposed to have an immunosuppressive role that is important for preventing maternal rejection of the semi-allogenic fetus during pregnancy. In addition to fusogenic properties derived from the env gene of ERV-FRD, Syncytin-2 contains a classical Env retroviral immunosuppressive domain that has been shown to have immunosuppressive activity via in vitro tumorrejection assay (Mangeney et al., 2007). Given observed protein expression within cytotrophoblasts cells of the human placenta, Syncytin-2 has been suggested to facilitate fetal tolerance by suppressing the maternal immune system. Other ERV-derived env proteins from ERV-V and ERV-K have also been proposed to possess an immunosuppressive role in controlling the maternal immune system during pregnancy. This is based on findings that both families have one or more proviral loci in the genome with intact env open reading frames (ORFs) and a corresponding immunosuppressive domain. Additionally, both ERV-V and ERV-K expression has been observed within placental trophoblast cells at the maternal-fetal interface, although corresponding in vitro functional assays have not yet been completed to directly support in vivo findings (Kammerer et al., 2011; Subramanian et al., 2011; Supplementary Table 1). Until these studies are undertaken, the exact function of ERV-V and ERV-K and whether env protein expression from these ERVs induce maternal immunosuppression within the placenta, will remain unknown.

### ERVS AND HUMAN DISEASE

Through insertional mutagenesis, recombination between homologous copies, and the regulatory disruption that epigenetic suppression of ERV insertions can cause to nearby gene loci, there are many mechanisms by which these elements might cause disease. In particular, their association with various cancers has been well demonstrated, as reviewed in Katoh and Kurata (2013). For instance, ERV activity has been strongly associated with many breast cancers (Golan et al., 2008; Wang-Johanning et al., 2008; Salmons et al., 2014). While in melanoma tissues, ERV-K expression of both RNA and protein has been shown (Büscher et al., 2005), and one recent study identified 24 ERV-K (HML-2) loci transcribed (Schmitt et al., 2013). In another study of Hodgkin's lymphoma, all cancer patient samples were found to have alternative transcripts of the CSF1R, an important locus associated with this cancer, that initiate at the LTR of an ERV located ∼6.2 kb upstream of the normal promoter (Lamprecht et al., 2010).

ERVs have been demonstrated to be associated with a variety of neurologic diseases, as reviewed in Douville and Nath (2014). One such disease is amyotrophic lateral sclerosis (ALS). Elevated ERV-K (HML-2) activity has been observed in the brain tissue of ALS patients (Douville et al., 2011), while transgenic animals expressing the ERV-K env gene in cortical and spinal neurons developed motor dysfunction, suggesting that these elements may contribute to neurodegeneration (Li et al., 2015). Additionally, the expression of ERV-W env and gag has been observed in samples of muscle from ALS patients (Oluwole et al., 2007). While the ERV-W findings may be due to the inflammatory response (Alfahad and Nath, 2013), the support for the involvement of ERV-K in ALS is mounting, though causality has yet to be demonstrated. Multiple sclerosis (MS) is another neurological disease in which ERVs have been strongly implicated. MSRV (multiple sclerosis-associated retrovirus), a subtype of ERV-W, as well as ERV-W1 and W2 and ERV-H/F have all been linked to MS (reviewed in Christensen, 2016). One study showed significantly elevated Env antigen in serum of MS patients relative to controls, while qPCR of ERV-W in mononuclear cells from blood (PBMC) showed association with MS relative to controls (Perron et al., 2012a). This same study demonstrated Env expression in eight well-characterized MS brains that had lesions throughout the parenchyma and in perivascular infiltrates, as well as at the rim of chronic active lesions. ERV association with schizophrenia and bipolar disorder has been demonstrated through the presence of biomarkers for ERV-K and ERV-W found in blood, cerebrospinal fluid, and the pre-frontal cortex (Karlsson et al., 2001; Huang et al., 2006, 2011; Perron et al., 2012b). In one study of schizophrenia, hypermethylation of a specific ERV-W LTR insertion located in the regulatory region of the GABBR1 gene was associated with risk of schizophrenia (Hegyi, 2013). A nearly full-length ERV-K insertion near the PRODH gene, known to be associated with schizophrenia and other neuropsychiatric disorders, has been shown to work in concert with the internal PRODH CpG island to activate the gene. It is thought that aberrant DNA methylation of this locus may be a piece of the schizophrenia puzzle (Suntsova et al., 2013).

### ERVS MAY PLAY A ROLE IN THE INNATE IMMUNE RESPONSE

While the majority of ERV proviruses have acquired mutations, thereby preventing translation into protein, certain families have been especially well preserved and contain functional ORFs for one or more of the classical proviral genes. Within primates, ERV-K (HML-2) represents the best-preserved and most recently active ERV, containing a substantial number of loci that have predicted coding potential throughout different primate genomes. It has also been observed that ERV-K encodes a small accessory protein, Rec, in naïve ES cells and human blastocysts. Overexpression of Rec protein within human pluripotent cells increases the innate antiviral response and can inhibit exogenous viral infections, suggesting an immunoprotective role of the ERV-K Rec protein during early embryonic development (Grow et al., 2015; Supplementary Table 1). An additional ERV-K proviral protein, gag, which makes up the core of viral particles in exogenous retroviruses, is also expressed within human blastocysts and pluripotent cells. Immunolabeling of ERV-K gag protein followed by confocal and transmission electron microscopy revealed ERV-K gag protein within structures of blastocysts resembling virallike particle (VLPs). This suggested that some ERV proviral sequences within the human genome still retain the ability to code for viral proteins and form VLPs during normal human embryogenesis. Proteins produced from ERV env genes have also been demonstrated to function as restriction factors against exogenous retroviral infection (Malfavon-Borja and Feschotte, 2015).

Even ERV proviruses that do not contain functional ORFs can still harbor sequence motifs that serve to modulate the activity of nearby genes. For instance, interferon (IFN) inducible enhancers have been dispersed via ERV insertions adjacent to IFN-inducible genes independently over mammalian evolution. This has resulted in regulatory networks of genes able to work in concert due to the presence of these ERV sequences. Further, CRISPR-Cas9 deletion of a MER41 insertion upstream of AIM2 in HeLa cells disrupted the endogenous IFNG-inducible regulation of this locus, demonstrating the utility that host genomes can obtain over time by harnessing ERV sequences (Chuong et al., 2016). In another example showing the variety of mechanisms by which ERVs are involved with innate immunity, Chiappinelli et al. (2015) demonstrated that induction of ERV expression, and especially bidirectional transcription of ERVs, activated a double-stranded RNA sensing pathway that triggers a type I interferon response and apoptosis.

## CONCLUSIONS

The relationship between ERVs and the human genome is a diverse and complicated one, resulting from millions of years of co-evolution. ERVs are known to be involved in disease through insertional mutagenesis, as targets of epigenetic repression, and via recombination of sequences between the homologous copies of these elements scattered across the genome. Throughout mammalian evolution, the deleterious effects of ERVs have been balanced by the benefits gained from innovative co-option of their sequences and proteins by their host genomes. These innovations include the intimate relationship between ERV activity with embryonic and placental development, as well as a number of ERV-associated regulatory networks that have become important components of the normal function of our genome. An innate immune response to exogenous retroviral infection is likely only one of several ERV functional roles. Once thought to have been quiescent, dead residents of the human genome, we are only beginning to uncover the scope of how actively intertwined our biology is with these long-time genomic partners.

### AUTHOR CONTRIBUTIONS

TM and JR drafted the manuscript and figures. SC and LC edited the manuscript.

### FUNDING

TM was supported by the National Library of Medicine of the National Institutes of Health under Award Number T15LM007088. JR was supported by the Collins Medical Trust Foundation and Glenn/AFAR Scholarship for Research in the Biology of Aging. LC and SC were supported by NIH/NICHD R01HD086073-A1, National Centers for Translational Research in Reproduction and Infertility

### REFERENCES


(NCTRI). Additional funding for SC came from the Georgeanna Jones Foundation for Reproductive Medicine, Medical Research Foundation of Oregon, and the Collins Medical Trust. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health, the Collins Medical Trust Foundation, or Medical Research Foundation of Oregon.

### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: http://journal.frontiersin.org/article/10.3389/fchem. 2017.00023/full#supplementary-material


regulatory role for retrotransposons in pluripotency maintenance. Nat. Genet. 46, 558–566. doi: 10.1038/ng.2965


islet xenotransplantation in SCID mice. Nature 407, 90–94. doi: 10.1038/350 24089


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2017 Meyer, Rosenkrantz, Carbone and Chavez. This is an openaccess article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Friends-Enemies: Endogenous Retroviruses Are Major Transcriptional Regulators of Human DNA

Anton A. Buzdin1, 2 \*, Vladimir Prassolov <sup>1</sup> and Andrew V. Garazha3, 4

*<sup>1</sup> Department of Cell Biology, Engelhardt Institute of Molecular Biology, Russian Academy of Sciences, Moscow, Russia, <sup>2</sup> Centre for Convergence of Nano-, Bio-, Information and Cognitive Sciences and Technologies, National Research Centre "Kurchatov Institute," Moscow, Russia, <sup>3</sup> Group for Genomic Regulation of Cell Signaling Systems, Shemyakin-Ovchinnikov Institute of Bioorganic Chemistry, Moscow, Russia, <sup>4</sup> Department of Biomedicine, Moscow Institute of Physics and Technology, Moscow, Russia*

Endogenous retroviruses are mobile genetic elements hardly distinguishable from infectious, or "exogenous," retroviruses at the time of insertion in the host DNA. Human

endogenous retroviruses (HERVs) are not rare. They gave rise to multiple families of closely related mobile elements that occupy ∼8% of the human genome. Together, they shape genomic regulatory landscape by providing at least ∼320,000 human transcription factor binding sites (TFBS) located on ∼110,000 individual HERV elements. The HERVs host as many as 155,000 mapped DNaseI hypersensitivity sites, which denote loci active in the regulation of gene expression or chromatin structure. The contemporary view of the HERVs evolutionary dynamics suggests that at the early stages after insertion, the HERV is treated by the host cells as a foreign genetic element, and is likely to be suppressed by the targeted methylation and mutations. However, at the later stages, when significant number of mutations has been already accumulated and when the retroviral genes are broken, the regulatory potential of a HERV may be released and recruited to modify the genomic balance of transcription factor binding sites. This process goes together with further accumulation and selection of mutations, which reshape the regulatory landscape of the human DNA. However, developmental reprogramming, stress or pathological conditions like cancer, inflammation and infectious diseases, can remove the blocks limiting expression and HERV-mediated host gene regulation. This, in turn, can dramatically alter the gene expression equilibrium and shift it to a newer state, thus further amplifying instability and exacerbating the stressful situation.

Keywords: retrovirus, gene expression regulation, pathology, cancer, inflammation, stress, stability, infection

Human endogenous retroviruses (HERVs) and related genetic elements occupy ∼8% of human genome. They are thought to be remnants of multiple ancient retroviral infections (Sverdlov, 2000; Belshaw et al., 2004; Buzdin, 2007). HERV insertions occurred in the ancestral germ cell lineage, fixed in the genome and became inheritable (Buzdin et al., 2003; Dewannieux and Heidmann, 2013). In the human DNA, HERVs are represented by 504 groups including 717.778 individual fragments (RepeatMasker, hg19). The individual HERV copies are frequently interrupted by other sequences, such as transposable elements, and may each represent two or more genomic fragments.

### Edited by:

*Tammy A. Morrish, Independent Investigator, United States*

### Reviewed by:

*Artem Babaian, University of British Columbia, Canada Avindra Nath, National Institute of Neurological Disorders and Stroke, United States*

> \*Correspondence: *Anton A. Buzdin buzdin@ponkc.com*

#### Specialty section:

*This article was submitted to Cellular Biochemistry, a section of the journal Frontiers in Chemistry*

Received: *17 December 2016* Accepted: *24 May 2017* Published: *08 June 2017*

#### Citation:

*Buzdin AA, Prassolov V and Garazha AV (2017) Friends-Enemies: Endogenous Retroviruses Are Major Transcriptional Regulators of Human DNA. Front. Chem. 5:35. doi: 10.3389/fchem.2017.00035* Older HERVs have accumulated more mutations, including indels, and thus are more fragmented then the evolutionary young elements. For example, the MER-41-int element located at the position chr1:26,952,949-26,962,938 (hg19 assembly) is broken into four fragments in the genome, but biologically this was a single HERV.

Many families of HERVs are highly transcriptionally active in human tissues (Buzdin et al., 2006a; Maliniemi et al., 2013). Genomic copies of HERVs are of particular interest because in addition to viral genes they also have various regulatory sequences concentrated in their long terminal repeats (LTRs) about 1 kb long fragments of DNA flanking the "body" of an element (**Figure 1**). The LTRs serve as promoters (Buzdin et al., 2006a), enhancers (Chuong et al., 2013; Suntsova et al., 2013), polyadenylation signals (Suntsova et al., 2015), chromatin folding reshapers (Schumann et al., 2010), and binding sites for various nuclear proteins (Young et al., 2013). Importantly, most of HERVs reside in the human genome as solitary LTRs arisen due to homologous recombinations between the two 5′ and 3′ -flanking LTRs of the same full-length element (Hughes and Coffin, 2004). In turn, further recombinations between the different HERVs may cause genomic instability (Trombetta et al., 2016). For example, this mechanism may be responsible for at least 78 copy number variation cases encompassing known human genes (Campbell et al., 2014).

Most of the newly inserted HERVs harbor functional retroviral genes, such as those encoding for the reverse transcriptase/integrase, the structural polyprotein Gag and the envelope polyprotein Env, and the canonical function of an LTR is the regulation of retroviral expression. However, the LTRs may also drive the transcription of closely located genomic sequences and human genes (Buzdin et al., 2006a). In this minireview, we pay attention to the regulatory function of HERVs which donated multitude of functional sequences to the human genome.

### STRUCTURE OF LTR AND BINDING OF NUCLEAR PROTEINS

Most of the HERVs exist in the form of solitary LTRs. The LTRs include promoter elements, enhancers, transcriptional factor binding sites, splice sites, and polyadenylation signals, and are thought to serve as the major transcriptional regulators of HERVs. LTRs specifically bind host cell nuclear proteins (Trubetskoy et al., 2002) and serve in the following five pathways of human transcriptional regulation: (i) LTRs may have enhancer/repressor activities (Domansky et al., 2000; Hughes and Coffin, 2004; Ruda et al., 2004; Suntsova et al., 2013); (ii) LTRs may be promoter active; (iii) LTR may provide polyadenylation sites to terminate read-through transcripts; (iv) LTRs may provide splice sites; (v) LTRs may regulate host genes by RNA interference (Gogvadze et al., 2009).

Mapping DNaseI hypersensitivity sites (DHS) is the method of choice for the high-throughput identification of the regulatory genomic regions. Similarly, transcription factor binding sites (TFBS) denote fragments of DNA with nuclear protein binding capacities (Ho et al., 2012). We combined investigation of both DHS and TFBS content of HERVs on a genomic scale (Garazha et al., 2015). To this end, we annotated all the genomic copies of HERVs and devised a bioinformatic algorithm mapping relevant TFBS and DHS features. For the entire set of HERVs, ∼140,000 individual inserts (∼19%) had at least one DHS and ∼110,000 inserts (∼15%)—at least one TFBS. Totally, there were identified ∼155,000 and ∼320,000 HERV-related DHS and TFBS, respectively (Garazha et al., 2015). This directly evidences potential implication of HERVs in the regulation of thousands of human genes. This is also in line with the previous finding that ∼30% of all p53 binding sites localized by chromatin immunoprecipitation approach in the human genome fall within the HERV elements (Wang et al., 2007). Finally, as much as ∼31.4% of all human transcription start sites were mapped within various transposable elements, including the HERVs (Faulkner et al., 2009).

All the 504 known HERV groups were characterized with regard to their TFBS content and showed very different results (available at http://herv.pparser.net/TotalStatistic.php). The families differed in their copy number, ranging from several copies as for the HERV-F, to more than 22,000 members as for the THE1B family. The total number of TFBS was also strikingly different—from zero (LTR5, LTR7A) to ∼13,000 (MLT1K). The densities of TFBS also varied among the families. This is also important to quantitate absolute numbers of TFBS in each family. For example, the LTR12 family had the biggest proportion of TFBS-positive members and donated ∼1,300 TFBS to the human DNA, whereas the family MLT1K contributed the greatest number of TFBS (∼13,000), but had a small occurrence of TFBS-positive members. Interestingly, the TFBS and DHS tended to commonly appear in the same HERV elements. The probability that a particular element had DHS, was proportionate to the number of TFBS mapped herein (Garazha et al., 2015). Although, these findings provide clues for identification and functional annotation of multiple previously unknown human regulatory sequences, they are most likely still an underestimation of the HERV-generated TFBS pool. The repetitive nature of HERVs in many cases did not allow to directly attribute TFBS or DHS to any particular HERV element (Garazha et al., 2015).

Importantly, all the interrogated transcriptional factors had TFBS in the HERVs. This can explain extremely diverse and sometimes strongly tissue-specific influence of the different HERVs on the gene expression. For example, the LTR of the most recent HERV family HERV-K (HML-2) containing many human-specific and even polymorphic members, shows very high promoter and enhancer activities in the human germ cells and the corresponding tumors (seminoma), being transcriptionally silent in the other tissues (Domansky et al., 2000; Ruda et al., 2004). The promoter activity of the HERV-K (HML-2) inserts also provided the first evidence for the human specific antisense regulation of gene expression (Gogvadze et al., 2009). The human-specific LTRs located in the introns of genes SLC4A8 and IFT172 (for sodium bicarbonate cotransporter and intraflagellar transport protein

172, respectively) can in vivo generate transcripts that are reversecomplementary to the exons of those genes. Overexpression of the antisense transcripts resulted in approximately three- to fourfold decrease in mRNA levels for these genes (Gogvadze et al., 2009).

The HERVs can also provide polyadenylation signals for the regulation of gene expression. For example, mRNA for 8-kDa human protein similar to transcription factor GON4L is polyadenylated using the HERV-K (HML-2) LTR sequence (Baust et al., 2000). Another human transcription factor gene, ZNF195, utilizes the HERV-F LTR as the alternative polyadenylation site (Kjellman et al., 1999).

### FUNCTIONAL INTERPLAY OF HERVs AND HUMAN GENOME

Expression of HERVs is tightly controlled by the host cell because it may be deleterious. Even the physical presence of the repetitive sequences in the genome can generate genomic instability due to homologous recombination between the HERV elements. HERVs can bias normal gene regulatory networks (Suntsova et al., 2015; reviewed by Rebollo et al., 2012). Expression of HERV proteins may result in dangerous inflammatory or immunosuppressive effects (Cho et al., 2008). In mammals, endogenous retroviruses are transcriptionally repressed using the KRAB domain zinc finger proteins and their cofactor TRIM28, which recruit methylation machinery to HERV copies (Turelli et al., 2014). In embryonic cells, a zinc finger protein Yin Yang 1 may serve as another repressor of HERV transcription by suppressing promoter activities of the LTRs (Schlesinger et al., 2013). Besides DNA methylation, histone modification is considered an alternative mechanism of endogenous retroviral repression in embryonal stem cells with the proteins SETDB1 (methyltransferase responsible for H3K9 trimethylation) and H3K4 demethylase LSD1/KDM1A involved (reviewed by Rebollo et al., 2012).

APOBEC3 protein family has another function in suppression of HERVs and retroviruses. APOBEC3G (hA3G) inhibits the retroviruses by entering viral particles and inducing hypermutation of viral genome during reverse transcription, leading to G to A substitutions (Bae and Jung, 2014). In concert, the protein hA3F induces viral hypermutation by deaminating minus-strand of viral cDNA during reverse transcription (Bae and Jung, 2014). Taken together, these factors induce epigenetic silencing and hypermutation of HERVs. Indeed, the LTRs have a bigger mutation rate than the rest of non-coding fraction of the human genome (Romano et al., 2006).

Conversely, the content of TFBS among the HERVs decreases with their evolutionary age (Garazha et al., 2015). For the heavily mutated, highly diverged (>20%) HERV elements, this content is approximately six-fold lower compared to the top evolutionary young elements. This observation may suggest that genomic "domestication" of HERVs involved reformatting of the active TFBS profiles and their further "standardization" upon accumulation of mutations, until they get equilibrated with the rest of non-coding DNA (Garazha et al., 2015). However, this type of analysis can be biased by the higher fragmentation in the evolutionary older HERVs, because each fragment is considered as an independent element. Further studies are, therefore, needed to explore the TFBS accumulation trends in linkage to the evolutionary dynamics of the human genome.

Sometimes co-evolution with the human genome resulted in a recruitment of certain HERV regulatory modules by the host organism (**Table 1**). The best-known example is the acquisition of salivary expression of the carbohydrate digestive enzyme amylase TABLE 1 | Implication of HERV transcriptional regulation in human physiology and pathology (selected examples).


from a HERV element inserted in the common ancestor of great apes (Ting et al., 1992).

On the other hand, HERV-H is a family expressed preferentially in human embryonal stem cells. Surprisingly, these are the HERV-H LTRs that appeared to be the primary mediators of cell fate reprogramming using famous "Yamanaka cocktail" (by overexpressing OCT3/4, SOX2, and KLF4 proteins), due to regulatory HERV-H-driven intergenic non-coding RNAs that help to recruit the transcriptional activator genes by serving as the scaffold (Ohnuki et al., 2014). Another human long non-coding RNA (human pluripotency-associated transcript 5, HPAT5) derived from both a HERV element HUERS-P1 and an Alu retrotransposon, was shown to promote pluripotency by functioning as a molecular sponge for the let-7 family of microRNAs (Durruthy-Durruthy et al., 2015; Chuong et al., 2017).

The element MER39 forms an endometrium-specific promoter that regulates expression of Prolactin during pregnancy (Emera et al., 2012). The developmental switch from fetal to adult beta-globin gene expression in human is controlled by a copy of HERV9 element (Tuan and Pi, 2014). In hippocampus, transcription of gene PRODH is regulated human-specifically by a HERV-K (HML-2) LTR (Suntsova et al., 2013). PRODH metabolizes neuromediator molecules and has a strong implication in higher nervous activity and neurological disorders, and its deregulation might have an important impact on human evolution (Suntsova et al., 2013).

## HERV-MEDIATED REGULATION OF GENE EXPRESSION IN PATHOLOGY

### Proliferative Disorders

Recent findings indicate that HERV-mediated control of gene expression may be involved in various human diseases including cancer (Kassiotis, 2014). The role of HERVs in cancer is most likely limited to regulation of gene expression (Hohn et al., 2013). The data from cancer genome sequencing identified over 180 somatic insertions caused by LINE-1 retrotransposon activity, vs. only a single integration of a short HERV fragment, most likely replicated due to microhomology-mediated DNA repair mechanism (Lee et al., 2012). Many HERVs are abnormally expressed in cancer. For instance, HERV-K (HML-2) elements are up to ∼3,000 times overexpressed in germ cell tumors and in melanoma (Buzdin et al., 2006b; Schmitt et al., 2013). Upregulation of HERVs can be mediated by either biased content of the specific transcription factors or by disruption of the anti-retroviral suppression mechanisms, such as aberrant demethylation (Conti et al., 2016) and decreased expression of APOBEC3 proteins (Shepelin et al., 2016). HERVs, in turn, may promote cellular transformation by regulating downstream human genes. For example, a demethylated copy of MaLR LTR can act as an alternative promoter to transcriptionally derepress the gene CSF1R, encoding colony stimulating factor-1 receptor, which is linked with survival of the Hodgkin's lymphoma cells (Lamprecht et al., 2010). More examples can be found in the other specific reviews (Babaian and Mager, 2016; Gonzalez-Cao et al., 2016; Anwar et al., 2017).

### Infectious Diseases

The evolution of human pathogens might generate mechanisms involving transcriptional interactions of endogenous and exogenous retroviruses. For example, in HIV-infected patients, the HERV-K (HML-2) proviruses are expressed in peripheral blood mononuclear cells at higher levels compared to the non-infected individuals (Bhardwaj et al., 2014). The antibodies against HERV-K (HML-2) Env protein in blood were proposed as the new biomarker of HIV-1 infection, because HIV-1 can upregulate expression of a fully N-glycosylated HERV-K (HML-2) envelope protein on the cell surface (Michaud et al., 2014). Moreover, the HERV-K (HML-2)-specific T-cells from the HIV-1 infected patients in vitro completely eliminated the human cells infected with a panel of globally diverse HIV isolates. The mechanism of HIV-1 induced activation of human transposable elements possibly involves the activity of an HIV-1 Tat protein (Jones et al., 2013). Recent studies showed that out of 91 annotated HERV-K (HML-2) proviruses, Tat could activate expression of 26 proviruses, silenced 12, and did not change the expression of the others (Gonzalez-Hernandez et al., 2014). In addition, HIV infection may cause transactivation of HERV-W elements with their Env genes and Syncytin (Uleri et al., 2014). However, a controversial data were reported on the presence of HERV-K (HML-2) viral particles in the plasma of HIV-infected patients—higher levels of HERV-K (HML-2) RNA were detected in the HIV patients from Uganda, but not from the USA (Li et al., 2013). Of note, the recent association study showed that susceptibility to infection with varicella zoster virus is linked with the non-coding gene HLA Complex P5 in the major histocompatibility complex. This gene is a copy of an endogenous retrovirus that may have a potential to suppress viral activity through indirect regulatory mechanisms. In previous studies, particular genetic variants of this region were associated with delay in development of AIDS in HIV-infected individuals (Crosslin et al., 2015).

## Autoimmunity

The biased expression of HERVs is considered as one of the triggers of autoimmune disorders (Suntsova et al., 2015), which is evidenced by increased proviral RNA levels (Ehlhardt et al., 2006) and anti-HERV protein antibodies in sera from several types of patients (Bannert and Kurth, 2004). Immune reactivity against ERV proteins can be experimentally induced in mice and non-human primates, evidencing that immunological tolerance to endogenous retroviral products is not complete (Kassiotis, 2014). The HERV overexpression may be linked with massive DNA hypomethylation as seen for T-cells in systemic lupus erythematosus (SLE) patients (Wu et al., 2015).

Compared to the normal controls, in the patients with rheumatoid arthritis, increased antibody response was detected against the HERV-K10 Gag protein (Nelson et al., 2014). HERV-W transcripts and protein isoforms of Syncytin were overexpressed in cartilage of osteoarthritis patients (Bendiksen et al., 2014). In osteoarthritis, the patient's individual disease severity index was correlated with the expression of HERV-K18 provirus (Garcia-Montojo et al., 2013). However, inflammatory diseases may be also associated with the decreased expression of HERVs (**Table 1**).

### Neurological Diseases

Expression of HERVs may serve as the biomarker for various neurological diseases (**Table 1**). For example, the HERV expression may be inducible in human astrocytes and neurons under inflammatory conditions in an IFNγ-dependent manner (Manghera et al., 2015). For multiple sclerosis (MS), a hypothesis was proposed that HERV-encoded proteins can act as the powerful immune stimulators inducing disease progression following neurodegeneration (Libbey et al., 2014). Indeed, genetic variants in some genes restricting retroviral infections were statistically linked with the risk of getting MS, as shown for the TRIM5, TRIM22, and BST2 genes (Nexo et al., 2013).

The abnormally high levels of the HERV-W Env gene product were detected in the plasma of the patients with schizophrenia and bipolar disorder (Diem et al., 2012), and in the active lesions in multiple sclerosis (van Horssen et al., 2016) and in the biopsies from the chronic inflammatory demyelinating polyradiculoneuropathies (Faucard et al., 2016). The increased expression of endogenous HERV-K (HML-2) proviral Env gene, in turn, may contribute to the development of amyotrophic lateral sclerosis by inducing neurodegeneration (Li et al., 2015). Finally, HERVs may also cause neurological disorders due to HERV-linked genomic rearrangements (**Table 1**). The humanspecific enhancer activity of a HERV-K (HML-2) provirus on schizophrenia-associated gene PRODH may be another active mechanism of HERV involvement in schizophrenia (Suntsova et al., 2013). Recently, a link was discovered between schizophrenia risk and the complement C4 system (Sekar et al., 2016). The individuals having a polymorphic HERV intronic insertion have elevated C4 expression, which in turn may cause neuronal synapse over-pruning, a phenotype that is associated with schizophrenia. Although this evidence is still indirect, this case is intriguing in light of previous observations of an association between schizophrenia and elevated ERV transcriptional activity (Chuong et al., 2017).

### CONCLUSIONS

Taken together, these findings suggest that at the early stages after insertion, the HERV is treated by the host cells as a foreign genetic element, and is likely to be suppressed by the targeted methylation and mutations. However, at the later stages, when significant number of mutations has been already accumulated and when the retroviral genes are broken, the regulatory potential of a HERV may be released and recruited

### REFERENCES


to modify the genomic balance of transcription factor binding sites. This process goes together with further accumulation and selection of mutations, which reshape the regulatory landscape of the human DNA. However, developmental reprogramming, stress or pathological conditions like cancer, inflammation and infectious diseases, can remove the blocks limiting expression and HERV-mediated host gene regulation. This, in turn, can dramatically alter the gene expression equilibrium and shift it to a newer state, thus further exacerbating the stressful or unstable situation.

### AUTHOR CONTRIBUTIONS

All authors listed have made substantial, direct and intellectual contributions to the work, and approved it for publication.

### FUNDING

This work was supported by the the Russian Scientific Foundation grant 14-14-00060.


mice, and elephants through the independent recruitment of transposable elements. Mol. Biol. Evol. 29, 239–247. doi: 10.1093/molbev/msr189


targeted therapy in multiple sclerosis. Mult. Scler. Relat. Disord. 8, 11–18. doi: 10.1016/j.msard.2016.04.006


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2017 Buzdin, Prassolov and Garazha. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.