# THE PAST AND THE FUTURE OF HUMAN IMMUNITY UNDER VIRAL EVOLUTIONARY PRESSURE

EDITED BY : Gkikas Magiorkinis and Tara P. Hurst PUBLISHED IN : Frontiers in Immunology and Frontiers in Microbiology

#### Frontiers eBook Copyright Statement

The copyright in the text of individual articles in this eBook is the property of their respective authors or their respective institutions or funders. The copyright in graphics and images within each article may be subject to copyright of other parties. In both cases this is subject to a license granted to Frontiers. The compilation of articles constituting this eBook is the property of Frontiers.

Each article within this eBook, and the eBook itself, are published under the most recent version of the Creative Commons CC-BY licence. The version current at the date of publication of this eBook is CC-BY 4.0. If the CC-BY licence is updated, the licence granted by Frontiers is automatically updated to the new version.

When exercising any right under the CC-BY licence, Frontiers must be attributed as the original publisher of the article or eBook, as applicable.

Authors have the responsibility of ensuring that any graphics or other materials which are the property of others may be included in the CC-BY licence, but this should be checked before relying on the CC-BY licence to reproduce those materials. Any copyright notices relating to those materials must be complied with.

Copyright and source acknowledgement notices may not be removed and must be displayed in any copy, derivative work or partial copy which includes the elements in question.

All copyright, and all rights therein, are protected by national and international copyright laws. The above represents a summary only. For further information please read Frontiers' Conditions for Website Use and Copyright Statement, and the applicable CC-BY licence.

ISSN 1664-8714 ISBN 978-2-88963-229-9 DOI 10.3389/978-2-88963-229-9

#### About Frontiers

Frontiers is more than just an open-access publisher of scholarly articles: it is a pioneering approach to the world of academia, radically improving the way scholarly research is managed. The grand vision of Frontiers is a world where all people have an equal opportunity to seek, share and generate knowledge. Frontiers provides immediate and permanent online open access to all its publications, but this alone is not enough to realize our grand goals.

#### Frontiers Journal Series

The Frontiers Journal Series is a multi-tier and interdisciplinary set of open-access, online journals, promising a paradigm shift from the current review, selection and dissemination processes in academic publishing. All Frontiers journals are driven by researchers for researchers; therefore, they constitute a service to the scholarly community. At the same time, the Frontiers Journal Series operates on a revolutionary invention, the tiered publishing system, initially addressing specific communities of scholars, and gradually climbing up to broader public understanding, thus serving the interests of the lay society, too.

#### Dedication to Quality

Each Frontiers article is a landmark of the highest quality, thanks to genuinely collaborative interactions between authors and review editors, who include some of the world's best academicians. Research must be certified by peers before entering a stream of knowledge that may eventually reach the public - and shape society; therefore, Frontiers only applies the most rigorous and unbiased reviews.

Frontiers revolutionizes research publishing by freely delivering the most outstanding research, evaluated with no bias from both the academic and social point of view. By applying the most advanced information technologies, Frontiers is catapulting scholarly publishing into a new generation.

#### What are Frontiers Research Topics?

Frontiers Research Topics are very popular trademarks of the Frontiers Journals Series: they are collections of at least ten articles, all centered on a particular subject. With their unique mix of varied contributions from Original Research to Review Articles, Frontiers Research Topics unify the most influential researchers, the latest key findings and historical advances in a hot research area! Find out more on how to host your own Frontiers Research Topic or contribute to one as an author by contacting the Frontiers Editorial Office: researchtopics@frontiersin.org

# THE PAST AND THE FUTURE OF HUMAN IMMUNITY UNDER VIRAL EVOLUTIONARY PRESSURE

Topic Editors:

Gkikas Magiorkinis, National and Kapodistrian University of Athens, Greece Tara P. Hurst, Birmingham City University, United Kingdom

Image: Gkikas Magiorkinis "Black and Red Queen evolutionary dynamics" by Gkikas Magiorkinis is licensed under CC-BY

There is a long-standing evolutionary battle between viruses and their hosts that continues to be waged. The evidence of this conflict can be found on both sides, with the human immune system being responsive to new viral challenges and viruses having developed often sophisticated countermeasures. The "arms race" between viruses and hosts can be thought as an example of the "Red Queen" race, an evolutionary hypothesis inspired from the dialogue of Alice with the Red Queen in Lewis Carroll's "Through the Looking-Glass". At the same time, viruses have a minimal genomic content as they have evolved to hitchhike biological machinery of their hosts (or other co-infecting viruses). The minimalistic viral genome could be thought as the result of a "Black Queen" evolution, a theory inspired from the card game Heart, where the winner is the one with the fewest points at the end.

The effects of this arms race are evident in the evolution of the human immune system. This system is capable of responding to diverse viral challenges, utilizing both the ancient innate immune system and the more recently evolved adaptive immune system of jawed vertebrates. It is now well-known that the two systems are linked, with innate immunity hypothesized to have provided raw material for the emergence of the adaptive immune response. The adaptive immune response comprises several protein families (including B and T cell receptors, MHC and KIR proteins, for example) that are encoded by complex and variable genomic regions. This complexity enables for responsive genetic changes to occur in immune cells, such as the ability of genomic hypervariable regions in B cells to recombine in order to produce more specific antibodies. Indeed, the human immune system is thought to be continually evolving via various mechanisms such as changes in the genes encoding immune receptors and the regulatory sequences that control their expression. For example, there is some evidence that exogenous viral infections can alter the expression of endogenous retroviruses, some of which contribute to the immune response.

Viral countermeasures can include encoding decoy receptors for the signalling molecules of the immune response, altering the gene expression of adaptive immune cells during chronic infection or using host enzymes to facilitate viral immune escape. As the articles herein show, the immune system continues to be challenged by viral infections and these challenges continue to shape how the immune system combats pathogens, thus viruses and human immunity are continuously part of "Red and Black Queen" evolutionary dynamics.

We had the pleasure of working with Jonas Blomberg as a reviewer during the course of the Research Topic and his untimely passing was a great loss. Prof. Blomberg made significant contributions, including to the nomenclature of endogenous retroviruses (ERVs), the evolution and characterization of specific human ERV (HERV) and the contribution of ERVs to diseases such as cancer. It is with great respect for his contributions to the ERV field that we dedicate this eBook to his memory.

Citation: Magiorkinis, G., Hurst, T. P., eds. (2019). The Past and the Future of Human Immunity Under Viral Evolutionary Pressure. Lausanne: Frontiers Media SA. doi: 10.3389/978-2-88963-229-9

# Table of Contents

*06 Editorial: The Past and the Future of Human Immunity Under Viral Evolutionary Pressure*

Tara Patricia Hurst and Gkikas Magiorkinis


Faezeh Borzooee, Krista D. Joris, Michael D. Grant and Mani Larijani


Patrick Gemmell, Jotun Hein and Aris Katzourakis

*80 Related Endogenous Retrovirus-K Elements Harbor Distinct Protease Active Site Motifs*

Matthew G. Turnbull and Renée N. Douville


Emmanuel Atangana Maze, Claire Ham, Jack Kelly, Lindsay Ussher, Neil Almond, Greg J. Towers, Neil Berry and Robert Belshaw

*123 Pockets of HIV Non-infection Within Highly-Infected Risk Networks in Athens, Greece*

Leslie D. Williams, Evangelia-Georgia Kostaki, Eirini Pavlitina, Dimitrios Paraskevis, Angelos Hatzakis, John Schneider, Pavlo Smyrnov, Andria Hadjikou, Georgios K. Nikolopoulos, Mina Psichogiou and Samuel R. Friedman

*131 Immunomodulatory Function of HBeAg Related to Short-Sighted Evolution, Transmissibility, and Clinical Manifestation of Hepatitis B Virus* Anna Kramvis, Evangelia-Georgia Kostaki, Angelos Hatzakis and Dimitrios Paraskevis

#### *143 Hepatitis B Virus Adaptation to the CD8+ T Cell Response: Consequences for Host and Pathogen*

Sheila F. Lumley, Anna L. McNaughton, Paul Klenerman, Katrina A. Lythgoe and Philippa C. Matthews

#### *157 Impact of Interferon-*a *Receptor-1 Promoter Polymorphisms on the Transcriptome of the Hepatitis B Virus-Associated Hepatocellular Carcinoma*

Timokratis Karamitros, George Papatheodoridis, Dimitrios Paraskevis, Angelos Hatzakis, Jean L. Mbisa, Urania Georgopoulou, Paul Klenerman and Gkikas Magiorkinis

*168 Evolution of Two Major Zika Virus Lineages: Implications for Pathology, Immune Response, and Vaccine Development*

Jacob T. Beaver, Nadia Lelutiu, Rumi Habib and Ioanna Skountzou

# Editorial: The Past and the Future of Human Immunity Under Viral Evolutionary Pressure

#### Tara Patricia Hurst <sup>1</sup> \* and Gkikas Magiorkinis <sup>2</sup> \*

*<sup>1</sup> Department of Life Sciences, School of Health Sciences, Birmingham City University, Birmingham, United Kingdom, <sup>2</sup> Department of Hygiene, Epidemiology and Medical Statistics, Medical School, National and Kapodistrian University of Athens, Athens, Greece*

Keywords: human viruses, endogenous retrovirus, immunity, evolution, disease

**Editorial on the Research Topic**

#### **The Past and the Future of Human Immunity Under Viral Evolutionary Pressure**

"Nothing in biology makes sense except in the light of evolution" was argued by Theodosius Dobzhansky in 1973. After more than 100 years since Charles Darwin's Origin of Species, it was increasingly realized that the theory of evolution can help us disentangle biological mechanisms. Here we asked contributors to see their antiviral research interests "through the looking glass" of evolution and to consider how evolutionary theory may help us understand human antiviral immunity. Thus, our special topic explored the scope of the evolution of human antiviral immunity, whether this included short-term effects or long-term impact. We specifically asked contributors to provide thought-provoking manuscripts that could fall within the scope of coevolution of viruses and the human immune response. They submitted articles discussing the ongoing dynamics of co-existing with ancient endogenous retroviruses, as well as the pressures exerted by extant exogenous viruses such as influenza and human immunodeficiency virus (HIV-1). Our understanding of antiviral immunity was broadened by papers exploring mechanisms such as T cell function and RNA structural components.

In a Hypothesis and Theory article, Marchi et al. explore the effects of persistent infection on T cell responses by analyzing gene expression profiles and gene networks in publicly-available data sets. Importantly, they identified T-box 21 (Tbx21)-mediated gene expression as a hallmark of inflation but not exhaustion of memory CD8+ T cells. This finding has potential to shape our understanding of the genes that define memory T-cell populations. Further, this shows how different viral infections can influence either exhaustion or inflation. This highlights the adaptability of the immune response to viral infection but also how infection could alter gene expression to shift T cell phenotypes.

In one of two mini-reviews in the topic, Smyth et al. discuss the contribution of RNA structure to antiviral immunity. RNA is much more than an intermediary between DNA and protein; it can adopt complex three-dimensional structures that may be altered by single nucleotide mutations, with consequent effects on its functions. Thus, since RNA structures are critical to the replication of many viruses, they are also substrates upon which evolution can act. This can include non-coding RNAs produced during replication, as well as genomic RNA and transcripts. Viral RNA is sensed by different pattern recognition receptors (PRRs) of the innate immune response, such as Tolllike receptor 3 (TLR3) and retinoic acid inducible gene I (RIG-I). There is thus pressure on viral RNA to evade immune detection such as by nucleotide and codon usage bias (adenosine by HIV-1), the presence of specific secondary structures that restrict transcription to low levels, or the maintenance of single-stranded RNA regions to avoid PRR detection. In turn, viral RNA structures have influenced the human immune system, such as the evolution of the human leukocyte antigen (HLA) locus.

#### Edited and reviewed by:

*Ian Marriott, University of North Carolina at Charlotte, United States*

#### \*Correspondence:

*Tara Patricia Hurst tara.hurst@bcu.ac.uk Gkikas Magiorkinis gkikasmag@gmail.com*

#### Specialty section:

*This article was submitted to Microbial Immunology, a section of the journal Frontiers in Immunology*

Received: *16 August 2019* Accepted: *17 September 2019* Published: *02 October 2019*

#### Citation:

*Hurst TP and Magiorkinis G (2019) Editorial: The Past and the Future of Human Immunity Under Viral Evolutionary Pressure. Front. Immunol. 10:2340. doi: 10.3389/fimmu.2019.02340*

**6**

The co-evolution of influenza virus and host immunity is discussed in the second mini-review (Voskarides et al.). The authors argue that this is an example of antagonistic evolution, with influenza virus a predator and the human host as prey. This conflict manifests in the genetic variability of Influenza viruses, a result of the absence of proof-reading by the RNA polymerase, that continually challenges the host immune system. In response, the human host has genetic major histocompatibility complex (MHC) diversity which facilitates T cell-mediated immunity to diverse pathogens.

APOBEC3G (apolipoprotein B mRNA editing enzyme catalytic polypeptide-like 3G, A3G) is a cellular cytidine deaminase that mutates retroviral genomes, thereby being a key host cell mechanism to debilitate viruses that embed in the genome. In an original article herein, the authors suggest that the action of A3G extends to adaptive immunity, suggesting that HIV-1 cytotoxic T cell (CTL) escape mutants are generated through A3G activity (Borzooee et al.). The authors use a bioinformatics approach to simulate A3G mutations and show that HIV-1 mutants are biased toward cytotoxic T cell (CTL) escape mutants. These results point toward an evolutionary adaptation of the HIV-1 genome that essentially "hijacks" human antiretroviral activity to counteract T-cell immunity.

A cluster of papers within the topic discuss the role of endogenous retroviruses (ERV) in the immune response. This includes an up-to-date review on the shaping of the immune response by human ERVs (HERVs) (Grandi and Tramontano). Further, four original research articles explore diverse aspects of ERVs and immunity. Firstly, ERV sequences within the genome are a potential source of genetic diversity that are known to be co-opted by the host. This has been shown most recently to include HERV-H sequences contributing to stem cell identity. Gemmell et al. explore the nature of HERV-H loci and their transcription further, showing positive correlations between particular long-terminal repeats (LTRs) and cell types (type II LTRs and embryonic cells).

Secondly, Turnbull and Douville analyzed HERV-K protease sequences within the human genome. They show that transcription of HERV-K proteases varies according to the functional motifs, with two motifs (DTGAD, DTGVD) being more abundant among all HERV-K integrations, but also more active, compared to the less common motifs, in diseases such as amyotrophic lateral sclerosis and breast cancer. These results are important not only for understanding the interactions among HERV-K integrations evolutionarily, but also to appreciate the diversity of enzymatic targets when designing anti-HERV based treatments.

Thirdly, the effect of the histone deactylase (HDAC) inhibitor vorinostat on HERV expression in CD4+ T cells was analyzed using RNA-Seq (White et al.). Vorinostat is one latency reversing agent being examined in HIV-1 shock and kill strategies. The study found over 2000 HERV elements were modulated by vorinostat, notably the downregulation of ERVL and upregulation of HERV-9 elements, the latter of which was confirmed by digital droplet PCR. This study identified three HERV-9 members (LTR12 elements) that were upregulated by vorinostat in a dose-dependent manner. The relevance of this finding lies in the transcriptional regulation of certain genes by LTR12 elements, including the pro-apoptotic genes TP3 and TNFRSF10B (White et al.). In agreement with our previous work using RT-qPCR, members of the HERV-K (HML-2), HERV-W and HERV-FRD families were not found to be upregulated by vorinostat (1). Notably the study by White et al. unlike that of Hurst et al., used uninfected, primary CD4+ T cells to avoid confounding effects of HIV-1 infection.

Finally, ERVs can be upregulated in response to acute exogenous viral infections and this may contribute to the antiviral response. In the paper by Maze et al. the expression of Papio cynocephalus Endogenous Retrovirus (PcEV) was found to be upregulated in macaque plasma and tissues (peripheral blood mononuclear cells (PBMCs) and spleen) in response to simian immunodeficiency virus (SIV) infection. Further, upregulation of the interferon (IFN)-stimulated gene (ISG), STAT1, an important marker of the early antiviral response, correlated with the increased PcEV RNA levels (Maze et al.). The authors posit a role for ERVs in the activation of the innate immune response.

The effect of immunity-virus interface at the population level was examined by Williams et al. where they present the perplexing phenomenon of a "pocket" of uninfected individuals in a connected network of recently HIV-1 infected individuals. This raises the question of how a cluster of connected HIV-1 exposed individuals could remain uninfected within an HIV-1 outbreak. The authors discuss potential explanations including the possibility that long-term infections act as a "firewall" in spreading the infections to the full-population. The latter could be explained by the existence of an epidemic equilibrium of co-existence where human-HIV co-evolution would be acting in the absence of treatment.

The evolutionary gameplay underlying the pathogenesis of hepatitis B virus's e antigen (HBVeAg) is reviewed and explored by Kramvis et al. HBVeAg, as the authors explain, is a tolerogen, an antigen that increases the tolerance of HBV infection by the human host. There is significant variability in HBeAg seroconversion and transmission efficiency among genotypes, which the authors suggest can be explained as a result of shortsighted evolution. The virus-host interplay of HBV is further explored by Lumley et al. on the effects of CD8+ T cell responses. The authors review the evidence that CD8+ T cells play an important role in the chronicity of HBV infection as well as the disease outcomes such as cirrhosis and hepatocellular carcinoma (Lumley et al.). The role of interferon-α receptor-1 (IFNAR1) promoter polymorphisms on HBV infection are explored by Karamitros et al. with respect to the development of hepatocellular carcinoma. The authors show that the variable tandem repeat [VNTR: −77(GT)n] has a significant impact in the transcription profile of IFNAR1 and this may be relevant in cancer pathophysiology.

Finally, Beaver et al., have reviewed evidence of how Zika virus (ZIKV) evolution can explain the observed patterns of pathogenesis. The authors review evidence showing that the immune responses against African isolates differ from those observed against Asian ZIKV lineages, and discuss how this could explain observed pathogenic differences among ZIKV strains.

From ancient germline infections to modern epidemics, outbreaks or emerging infections, our contributors explored the evolutionary interplay of human immunity and viruses at the molecular, individual, or population level. They offered thought-provoking hypotheses for further exploration, reviewed publications, and provided novel results in the field of human antiviral immunity field. This shows the

#### REFERENCES

1. Hurst T, Pace M, Katzourakis A, Phillips R, Klenerman P, Frater J, et al. Human endogenous retrovirus (HERV) expression is not induced by treatment with the histone deacetylase (HDAC) inhibitors in cellular models of HIV-1 latency. Retrovirology. (2016) 13:10. doi: 10.1186/s12977-016-0242-4

**Conflict of Interest:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

power of integrating evolutionary theory with research of antiviral immunity and we hope that our special topic will further promote evolutionary thinking across fields of human pathogen research.

### AUTHOR CONTRIBUTIONS

TH and GM wrote the draft and edited the final version of the manuscript.

Copyright © 2019 Hurst and Magiorkinis. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Inflation vs. Exhaustion of Antiviral CD8+ T-Cell Populations in Persistent Infections: Two Sides of the Same Coin?

#### Emanuele Marchi 1,2 \*, Lian Ni Lee1,2 and Paul Klenerman1,2

*<sup>1</sup> Peter Medawar Building for Pathogen Research, Nuffield Department of Medicine, University of Oxford, Oxford, United Kingdom, <sup>2</sup> Translational Gastroenterology Unit, John Radcliffe Hospital, Oxford, United Kingdom*

Persistent virus infection can drive CD8+ T-cell responses which are markedly divergent in terms of frequency, phenotype, function, and distribution. On the one hand viruses such as Lymphocytic Choriomeningitis Virus (LCMV) Clone 13 can drive T-cell "exhaustion", associated with upregulation of checkpoint molecules, loss of effector functions, and diminished control of viral replication. On the other, low-level persistence of viruses such as Cytomegalovirus and Adenoviral vaccines can drive memory "inflation," associated with sustained populations of CD8+ T-cells over time, with maintained effector functions and a distinct phenotype. Underpinning these divergent memory pools are distinct transcriptional patterns—we aimed to compare these to explore the regulation of CD8+ T-cell memory against persistent viruses at the level of molecular networks and address whether dysregulation of specific modules may account for the phenotype observed. By exploring in parallel and also merging existing datasets derived from different investigators we attempted to develop a combined model of inflation vs. exhaustion and investigate the gene expression networks that are shared in these memory pools. In such comparisons, co-ordination of a critical module of genes driven by Tbx21 is markedly different between the two memory types. These exploratory data highlight both the molecular similarities as well as the differences between inflation and exhaustion and we hypothesize that co-ordinated regulation of a key genetic module may underpin the markedly different resultant functions and phenotypes *in vivo*—an idea which could be tested directly in future experiments.

Keywords: exhaustion, inflation, bioinformatics, LCMV (lymphocytic choriomeningitis virus), CMV (cytomegalovirus)

#### INTRODUCTION

CD8+ T cell responses play a critical role in control of many virus infections. In the case of Lymphocytic Choriomeningitis (LCMV) it has been extensively modeled in the mouse and the dynamics, specificity, and function of antiviral CD8+ T cell responses are well-understood. One feature of this infection is the development of CD8+ T cell exhaustion, a feature first described in the pre-tetramer era as loss of function and finally deletion in the presence of persisting viruses such as DOCILE strains (1), and subsequently investigated at a molecular level using LCMV Clone 13, which shows similar features (2). Mapping the transcriptional underpinning of CD8+ T cell

#### Edited by:

*Gkikas Magiorkinis, National and Kapodistrian University of Athens, Greece*

#### Reviewed by:

*Ian Humphreys, Cardiff University, United Kingdom Anastasia Samsonova, Saint Petersburg State University, Russia*

\*Correspondence: *Emanuele Marchi emanuele.marchi@ndm.ox.ac.uk*

#### Specialty section:

*This article was submitted to Viral Immunology, a section of the journal Frontiers in Immunology*

Received: *02 August 2018* Accepted: *23 January 2019* Published: *06 March 2019*

#### Citation:

*Marchi E, Lee LN and Klenerman P (2019) Inflation vs. Exhaustion of Antiviral CD8*+ *T-Cell Populations in Persistent Infections: Two Sides of the Same Coin? Front. Immunol. 10:197. doi: 10.3389/fimmu.2019.00197*

**9**

exhaustion was crucial in defining its mechanisms—key of which include expression of checkpoint molecules such as PD-1 (3). These discoveries have been very influential in understanding both antiviral and also anti-cancer responses and have driven the development of new checkpoint blockade therapies.

In contrast, persistent infection with cytomegaloviruses (CMVs)—human and murine CMV (MCMV)—is linked with the development of memory "inflation" (4). This is marked by the late expansion and maintenance of a number of CD8+ T cell pools directed at a subset of peptides (5). Their phenotype lacks the expression of checkpoint molecules, rather showing acquisition of markers of cellular differentiation over time. Importantly, in contrast to the exhausted phenotype, these cells retain strong effector functions. The transcriptional underpinning of this—and of the related model of memory inflation driven by adenoviral vectors (6)—has also been explored, and this is also highly distinct from the development of conventional "central" long-lived memory cells (7).

Both models are associated with viral persistence—high level viremia in the case of LCMV, and very low level local reactivation in the case of MCMV. How is the development of these two apparently very distinct forms of CD8+ T cell memory driven? To address this we aimed to compare the gene expression profiles and gene networks in these different settings. Such datasets are complex to generate and require highly reproducible and well-established models, coupled with adequate T cell numbers specific for individual epitopes. Thus, currently such data are valuable and remain an important resource to explore. Typically, even with newer techniques such as RNA-Seq, such studies only focus on immune responses to a single pathogen rather than comparing diverse pathogens.

To achieve the comparison we sought, distinct datasets generated from different platforms by two different laboratories were merged, and an integrated model created and subjected to validation. We explored whether two data sets generated from comparable experimental designs were sufficiently suitable to be merged, despite differences in array platforms, mouse suppliers and viruses, and whether this could provide some new insights into the relationship between these different T cell populations.

As outlined in **Figure 1A**, data sets comparing memory Inflation against conventional memory (7) and CD8 T cell Exhaustion vs. conventional memory (8) studies include comparable or analogous samples in which CD8 transcriptomics from early and later stages of viral infections where referred to Naïve cells. In our model, samples from these experiments were expected to be assimilated to three broad categories, the common reference samples (Naïve CD8 T cells, the most comparable expression profiles), the intermediate phenotype (CD8 T cells from not Inflating or Exhausted samples) and an Extreme phenotype (CD8 T cells Inflating or Exhausted, hence expected to diverge significantly from the rest of samples and hypothetically from each other. The reliability of integration could be evaluated from the expected distribution of expression profiles in exploratory analysis (i.e. principal component analysis, PCA) and further data mining could explore the validity of the model and hypotheses generated (**Figure 1A**).

Via such an integrative and comparative bioinformatics analysis of expression profiles from inflation and exhaustion murine models, we describe similarities and dissimilarities of the phenomena at a transcriptomic level. We observe some predicted and also some unexpected features, with some possible practical implications for future experimental design. The results presented here explore the use of such a workflow as one approach to integrate valuable existing datasets from different platforms. While data from such in silico comparisons cannot reach the quality and accuracy of those derived from a single unified in vivo experiment, there is an opportunity that existing, publicly available data can be studied further in order to address questions not originally anticipated at the time datasets were generated, and to help develop new ideas for the field.

### RESULTS AND DISCUSSION

### PCA of Expression Data From Inflation and Exhaustion Models Reveals Comparable Events Between the Phenomena

We first addressed the overall transcriptomic similarities between memory inflation and exhaustion by re-exploring a previously generated dataset, GSE73314 (7). This dataset was derived from mice infected with MCMV, tracking one inflationary response, M38, and one conventional memory response (M45) at both early (acute, d7) and late (memory, d50) timepoints.

Antigen-specific T cell populations were FACS-sorted following tetramer staining and gene expression analyzed using microarray. In parallel inflationary and conventional responses to beta-galactosidase expressed in a replication-deficient human Adenovirus serotype 5 construct (HuAd5-lacZ) were studied following HuAd5-lacZ immunization. Responses to the inflationary epitope βgal<sup>96</sup> (referred to as D8V in this paper) and the conventional epitope βgal<sup>497</sup> (referred to as I8V in this paper) were analyzed again at the peak acute (d21) and late memory (d100) time points. The analysis of this dataset has been previously described, including principal components analysis based on all informative genes or subsets of transcription factors, and importantly the close relationships between these expression profiles and those of human "inflationary" populations derived from studies of CMV were confirmed (7). Data from **Figure 2A** from the paper are reproduced here [previously published as Figure S4 from (7)] which depicts the first 3 principal components of these data prior to addition of the integrated exhaustion dataset. A clustering of data derived from the acute timepoint for the relevant tetramers [D8V d21 (blue), I8V d21 (yellow), M38 d7 (magenta), and M45 d7 (pink)] is observed. However, with reference to the naive (green) dataset, the inflationary populations at the late timepoints [D8V d100 (red), M38 d50 (turquoise)] sit slightly further segregated than the acute samples, while the conventional memory pools at the late timepoints (M45 d50) have shifted in the opposite direction, with some re-expression of profiles (M45 d50) linked to resting naive cells. I8V at d100 are slightly divergent from M45 at d50, possibly reflecting subtle intrinsic differences (e.g., tissue tropism, antigen levels during infection) between the two infection models, especially as MCMV is a replicating virus with periodic episodes of viral

reactivation while AdHu5 is a non-replicating vector which does not reactivate. Nonetheless the expression of both groups of conventional memory cells at the early phase of infection (M45 d7 and I8V d21) show reduced divergence, suggestive of a conserved response between acute and conventional cells at the molecular level.

samples (M38, D8V) and non-Inflating Samples (M45,I8V), at acute stages (days 7 or 21) and later stages (days 50 or 100), and naive samples. (B) PCA of Exhausted/non-Exhausted CD8 T cells. 3D PCA showing distribution of transcription profiles of a model of Exhaustion (Cl13,Tetrahedrons), with non-Exhaustive samples (Arm, spheres) at different stages, and naive samples. Stages: 6 days (yellow), 8 days (brown), 15 days (pink), 30 days (black), naive (green).

The same type of descriptive analysis was next performed on the exhaustion data set, GSE41867 (8). The data points were derived from acutely-resolved lymphocytic choriomeningitis (LCMV) infection (Arm) or chronic LCMV infection (Arm Cl13) groups at days 6, 8, 15, and 30 post-infection and were generated on the Affymatrix platform. In this study, gene-expression profiles of exhausted CD8+ T-cells from mice infected with chronic LCMV (Arm C13) were compared with functional CD8+ cells, from mice infected with the Armstrong strain of the virus. The analysis revealed a comparable layout of sample distribution (**Figure 2B**), with groups' relative distances denoting a good concordance with the experimental outcomes. Dysfunctional clone-13 specific CD8+ T-cells (tetrahedron) (15 and 30 days, pink and black, respectively), appear to drift away at late time points and do not cluster around conventional CD8+ T-cells (spheres, pink, and black). In contrast, as was previously observed in the inflation model, the gene expression profiles of cells at the early timepoints [day 6 (pink) and 8 (brown) postinfection] from acutely resolving (spheres) and chronic LCMV (tetrahedron) are seen to cluster together.

Data set integration, when it is possible and successfully achieved, should allow more direct comparisons between samples generated in independent experiments. An alternative perspective of the differences or similarities between samples, without redesigning an entirely new experiment (which could become quite large, complex and very costly) could allow researchers to investigate new hypotheses and improve experimental design. In an attempt to improve the comparative analysis of the two models, the two data sets were merged (schematic in **Figure 1A**), following a pipeline aimed to reduce batch effects from multiple sources (see methods) as much as possible, which would confound accurate detection of gene expression signals (Schematic in **Figure 1B**).

**Figure 3** shows a 3D PCA of the two merged data sets [from Bolinger et al. (7), (B) and Doering et al. (8) (D)], generated with different microarray platform (Illumina and Affymetrix), plotting together Inflating and Exhausting samples (in red) with the respective functional counterparts (in blue). Projections of the first three principal components capture most of the variability (>50%) of the total set of common expressed genes between the two platforms (∼14,000). The results are also shown as a dendogram in **Figure S1A**. In order to check the robustness of the results, 5 outliers were removed and the merged dataset reprocessed (without extra normalization after ComBat processing) and the most variable genes were selected, with genes filtered by variance (interquartile range (IQR) > 0.5, n = 1,660) (9). As shown in **Figures S1A–C**, the sample distribution is in good agreement with the original 3D PCA (**Figure 3**).

The appropriateness of the data integration is highlighted by the fact that naive samples [in green, spheres: Affymetrix, (8); tetrahedron: Illumina, (7)], which are the most comparable samples between the two data sets cluster in close proximity to each other. Interestingly, late timepoint Inflation and Exhausted samples (in red), other than having the tendency to cluster relatively close to each other, diverge at most from naive and non-Inflation and non-Exhausted samples (in blue), which instead occupy an intermediate position in the overall plot (**Figure 3**). In accordance with their extremely differentiated phenotype, Inflating samples are placed at the furthest distance from naive samples, immediately followed by the exhausted cells.

Batch effects may still play a role in this type of data analysis, and this was investigated by pvca (10). Prior to ComBat processing, "platform" factor (Affymetrix, Illumina) was assessed as the major source of batch effect; following batch effect removal

the "State" factor (Naive, Resolving, Inflating\_Exhausting, as batch describing three simplified categories of samples: reference sample, intermediate phenotype, and extreme phenotype) represent the major source of variability (**Figure S7**). This initial, descriptive statistical approach suggests that immune responses demonstrating Inflation and Exhaustion share some common features at molecular level—despite the divergence that is observed at phenotypic level.

It is reasonable to postulate the existence of a common set of genes behind these immune responses; a pathway, in which the behavior and the intrinsic dynamics of its components could determine the fate of T-cells toward either Inflation or Exhaustion.

### Weighted Gene Co-expression Network Analysis of Inflating Samples

In order to test the hypothesis of a shared set of genes behind opposite immune responses, yet highly related at the origin, we performed Weighted Gene Co-Expression Network Analysis [WGCNA, (11)] on the Inflation data set only (i.e., without any potential artifacts generated through merging), prior filtering the genes by variance (n = 2,231), with the intention of detecting a module of genes characteristic of the phenomenon.

The dendogram in **Figure 4A** shows gene hierarchical clustering highlighting modules of genes with high interconnectivity based on TOM similarity (Topological Overlap Measure, a robust measure of network proximity). The second largest detected module (blue) was found enriched with immune relevant genes, after checking for the most statistically represented GO category (GOenrichmentAnalysis in WGCNA package, **Tables S1A,B**). Reactome pathway enrichment analysis (12) confirmed the overrepresentation of immune relevant pathways within blue module (**Figure S2A**). Genes such as Tbx21 and Eomes, known to have pivotal roles in T-cell differentiation, are contained in the blue module, along with other genes, such as E2f2, involved in cell proliferation and consistent with the Inflation phenotype.

We therefore repeated PCA on the whole integrated data set, using the 588 genes assigned to the blue module, to test if sample distribution in respect to these genes was unaltered, improved or changed (**Figure 4B**); it was observed, in fact, that the sample layout was conserved: independent naive samples (in green, spheres, and tetrahedrons), that could function as calibrating samples, were clustered even closer, while inflating samples (red, tetrahedrons) were separated further compared to the PCA plot using the whole gene set (**Figure 3**). In general, the rest of Inflating and Exhausted samples appear to cluster tighter in the context of this subset of genes (blue module genes).

It is important to note that the same analysis performed using other gene modules, e.g. the turquoise module, can disrupt the PCA sample layout we observe using all genes (or blue module genes) and it generates a less informative PCA plot or hierarchical clustering dendogram. The clustering analysis using genes from the Turquoise module (not enriched in immune relevant GO terms, **Tables S1A,B** or Reactome pathways, **Figure S2B**) produced a sample layout (**Figure S3**) that is noticeably different from the layout produced when employing blue module genes. Module detection, clustering analysis, GO terms and Reactome pathway enrichment (**Figures S2A,B**) were consistent after removing outliers and changing parameters to fit best a scale free topology (more appropriate soft-thresholding power, β = 20). The equivalent approach of hierarchical sample clustering analysis, employing again blue module genes, shows clearly that Inflating and Exhausted samples cluster together (in red), along with recently activated samples at acute phases from both data sets (**Figure 4C**).

### Graphical Representation of Gene Networks Obtained in Inflating and Exhausted Data Sets

A separate WCGNA was executed on Exhaustion data set, matching the criteria utilized for the Inflation network inference, and similarly a module containing Tbx21 was detected with a gene composition significantly enriched in immunological pathways. Indeed, there is a considerable gene overlap between the two modules that is even more evident within the context of transcription factor genes where the overlap reaches the 41% of the features (**Figure 5A**).

To have a further insight in the two related Tbx21 modules a graphical representation of the nodes connectivity was produced. A circular topology was imposed in both gene modules to make the networks comparable and observe whether the gene hubs (nodes with the highest connectivity) were conserved or changed. Edges between genes reflect their direct correlations and

a similar correlation with "third party" genes, measured in the TOM matrix.

co-expression network analysis of Inflating samples after removing outliers (Soft-thresholding power β = 20).

We discovered that in the module network generated using the Inflation data set, Tbx21 appeared to be a dominant hub, connecting with the vast majority of genes in the module, emphasized clearly when only edges with a weight > 0.41 were plotted (**Figure 5B**). Tbx21 has been previously described to be pivotal in controlling CD8 T cell activation (13) and this finding would extend the role of Tbx21 in maintaining functionality and effector status of the T cells in a subset of long-term memory populations.

In contrast, in the corresponding graphical representation of the analogous module in the Exhaustion data set, Tbx21 does not appear anymore to be the unique hub, losing its central role. Even lowering the cut off (weight < 0.54) of plotted edges shows evidently that other genes are more likely to function as hubs (**Figure 5B**). The **Tables S2A,C** matrices represent the TOMs (topological overlay measures) used as input to visualize blue (Tbx21) and turquoise modules networks in Cytoscape and imposing a circular topology (**Figure 5B**). In order to corroborate this finding, we analyzed the data with an alternative approach, employing a method that searches for genes following a similar trajectory to Tbx21 across the time points of the two experiments (**Figures 5C,D**). We applied the Pavlidis template matching test (14, 15) using Tbx21 as template, equally for Inflating and Exhausted modules. Consistent with network analysis we observe that in Exhaustion data there are only 33 genes that match significantly the expression trajectory of Tbx21 (**Figure 5D**), while in Inflating data module over 200 genes display expression patterns strongly matching the one of Tbx21 (**Figure 5C**).

In order to further explore the hypothesis that Tbx21 is the main master regulator driving the inflation and directly interacting with the genes in its module, we analyzed ChIP-Seq data publicly available (16) where antibodies against Tbx21 where used to analyze transcription factor-DNA binding in Cytotoxic T Lymphocytes from mice infected with LCMV (P14 mice). A Tbx21 binding profile was obtained running the macs2 peak caller algorithm (17) on wild type sample (SRR2075567, Tbx21 ChIP-seq on effector P14 CD8+ T cells) and using an input control (SRR2075584) (q < 0.01). Following gene annotation of peaks region [Homer bioinformatics tools, homer.ucsd.edu/homer, (18)], 383 out of 588 genes in Tbx21 module obtained with Inflating samples were found with binding sites for Tbx21 (in proximity, <1 kbp); 414 genes if we include

represents average expression, and the dotted line is the centralized (*z*-score) template expression.

peaks within a distance of 2 kbp. A Fisher test was performed to demonstrate that Tbx21 module was significantly enriched with peaks: highly significant peaks (macs2 peak score >50) mapping to blue module (63/588) were significantly more enriched (p = 0.0003) than turquoise module peaks (44/805).

We can formulate the hypothesis that in the case of an Inflationary response there is the presence of a "homogenous" genetic program, mainly coordinated by the transcription factor Tbx21. We postulated that a significant difference between Inflating and Exhausted populations could be more evident from differential correlations in expression between genes, rather than absolute changes in expression levels of the same genes between models. The strongest correlations that Tbx21 shows with its putative target genes in the Inflation data, highlight its fully functional role as a master regulator, while in the Exhaustion data its task could be mitigated or disrupted by other coexisting transcriptional pathways. In order to elucidate the strongest level of correlation between Tbx21 and genes within its module, it could be informative to examine the expression levels across the time points of Tbx21 and one representative gene. E2f2 is a transcription factor involved in cell growth and proliferation. The relationship between Tbx21 and E2f2 was analyzed separately in the Exhaustion and Inflation datasets (**Figure S4**). In the Inflation time course, E2f2 shows an expression pattern that matches well to that of Tbx21, especially at later stages of infection (beyond the dashed line), while in the Exhaustion time series the pattern of expression is poorly correlated.

In support of this observation, inflating samples at the late timepoints are enriched with Reactome pathways involved in cell division and proliferation. The late timepoints of the inflation and exhausted samples within the integrated dataset were checked for enrichment of 674 curated Reactome gene sets [MSigDB Collections, (19) c2.cp.reactome.v6.2.symbols.gmt]. Direct comparison between inflated (M38, day 50) vs. exhausted cell populations (Arm Cl13 day 30) was analyzed by GSEA [Gene Set Enrichment Analysis, (20)] and it was noted that the top most enriched gene sets in Inflation are homogenous and belong to categories of cell division and DNA replication (**Table S3**, **Figure S5**). Conversely, the most significant Reactome pathways enriched in exhausted samples are quite heterogeneous (**Table S4**, **Figure S4B**). Repeating this analysis with inflating cells from the Adenovirus model (Day 50) vs. exhausted samples (Arm Cl13 day 30) yielded consistent results in spite of the presence of a clear outlier in the inflating samples (S11 D8V), which would have reduced the statistical power (data not shown). Taken together, these findings lend credence to the idea that a key difference between inflation and exhausted cells at the later timepoints is that regulation of the former's proliferative capacities is retained, akin to those of activated and highly proliferating effector and memory CD8 T cells after acute Arm infection (8).

Finally, we further explored the new potential insights gained from data integration using GSEA of Immunological signatures from MsigDB (4,872 gene sets, c7.all.v6.2.symbols.gmt). In the same comparison, between Inflation vs. Exhaustion at late timepoints (MCMV M38 Days 50 vs. LCMV Cl13 days 30); significantly enriched gene sets either in Inflation or Exhaustion populations (FDR < 0.25, based on gene sets permutations) from comparable studies were found to be biologically interpretable. Data and relevant examples consistent with experimental data (21) are shown in **Table S5** and **Figure S6**.

### CONCLUSIONS

This analysis of previously published transcriptional data was aimed to address a simple but important question: what is the relationship between memory inflation and immune exhaustion? Both types of memory are distinct from conventional central memory development where populations contract following an acute expansion. In the one case the cells remain functional (inflation) and in the other they lose function (exhaustion) associated with distinct phenotypes. We hypothesized that dysregulation of a key module of genes might account for the phenotypic and functional differences seen between these T cell types.

Although it is possible to address this question using parallel analyses of existing data -and we have done this here—there is additional power in the merging of datasets, although great care must be taken to avoid artifacts due to batch and platform effects. Clearly a repeat set of experiments with mice treated with the different infections and vaccines in parallel, with a conserved pipeline for gene expression measurement and downstream data analysis would be ideal, but also impractical and costly given constraints on animal models and animal welfare in different settings (for example in the UK, LCMV Arm Cl13 remains a biosafety group 3 pathogen). We therefore think approaches of data integration—using a range of appropriate tools—from the increasing number of datasets publicly available will have important potential for future studies. Indeed this potential should only increase over time with convergence of sequencing approaches toward high-throughput RNA-Seq methods and accrual of datasets.

We observed two features of note from this analysis. Firstly, using a merged dataset, some transcriptional features of exhaustion and inflation appear to be shared, and cluster broadly closer to those derived from responses analyzed at an acute timepoint than to naive T cells or conventional memory. This is perhaps not a surprise since both inflationary and exhaustive memory are dependent on persistent/repetitive antigen stimulation. While the nature and/or the intensity of this antigen re-encounter may differ between the settings, in both cases, TCR triggering occurs and it is the response to this triggering which distinguishes the populations. While the populations clearly differ in expression of inhibitory receptors such as PD-1 and TIM-3, these represent only a relatively small set of genes within the total number up- and down-regulated in these populations compared to naive cells—and may overall contribute little to the PCA and hierarchical methods used to compare these subsets. This is not to say that such features are not critical and distinctive, but simply that other shared features may be worth exploration in future.

The second feature of interest relates to the relative connectivity of Tbx21 amongst genes within the same module this analysis was initially performed using the primary nonmerged datasets. All such genes in each setting (inflation or exhaustion) show a degree of correlation—however the role of Tbx21 as a central master transcription factor amongst these genes is fundamentally different in the inflationary pool. This fits with a well-demonstrated type 1 responsiveness driven by Tbx21 and associated to these populations, functional control of virus, and also with a differentiated phenotype which depends on functional interactions between Tbx21 and Zeb2 (16, 22). The lack of connectivity of Tbx21 in exhaustion has been previously described and is reproduced in this comparative analysis—however the cause of this has yet to be fully elucidated.

It is interesting to note that in the absence of Eomes, a transcription factor prominent in exhausted (8) but not in inflating T cells, memory CD8 T cells display a phenotype which strikingly resembles that of inflating memory cells, being IL-2 low and high in Granzyme B, Klrg-1, and Perforin (23). However, these memory cells were not able to proliferate properly in the recall response. This does not appear to be the case in inflating memory cells, and underscores the importance of the co-ordinated cell cycling module present in inflating cells which is not observed in exhausted cells. Furthermore, retention of their proliferative capacity would also explain their numerical superiority.

Infection with CMV drives not only memory inflation and conventional memory, but also is associated with the development of "peripheral memory" and tissue resident memory. Peripheral memory is linked to an intermediate expression of the chemokine receptor CX3CR1 (fractalkine receptor) and with the potential to proliferate in vivo as well as differentiate to both CX3CR1 high and low subsets, depending on the exposure to antigen (24, 25). Tissue resident cells have lost CXC3R1 expression like central memory cells, but have evolved tissue associated phenotypes such as expression of CD69 and CD103 (26). It is likely the combination of these multiple distinct memory types are responsible for long term virus control, especially the maintenance of inflationary pools being dependent on longer-lived central and peripheral memory cells. Further transcriptional analyses of peripheral and tissueresident memory within the MCMV and adenovirus model vs. the LCMV model for example using such an integrated approach as shown here will be of value in defining their key distinguishing characteristics and how they are inter-related. Overall this will certainly contribute to our understanding of the host responses to chronic viral infection and hopefully to the development of novel vaccines.

In conclusion, we have tried to address an important immunological question by a deeper analysis of valuable experimentally derived datasets, and in doing so generate new ideas about the causes of exhaustion and the mechanisms involved in robust memory after vaccination. Such studies can be examined in parallel but the tools available to interrogate the data as one integrated group are attractive and this area will doubtless be explored further in future. The major issues include not only potential batch effects, but also fundamental differences in mouse strains used, housing and handling, all of which can lead to artifacts. For immunologic experiments, the presence of very well-defined/conserved cell populations (in this case naive and conventional memory CD8+ T cells) provides some important internal references and controls following dataset integration. Further mathematical tools should also be applied to assess the quality of the integration and avoid biases (e.g., the PVCA-R package). As gene expression studies develop and cross-referencing of RNA-Seq datasets becomes both more attractive and more powerful this issue of normalization and integration will become even more important, for example in initiatives such as the Human Cell Atlas (https://humancellatlas.org). Further, however well-integrated mathematically, such data ultimately demand experimental validation—in this case a deeper analysis of the role of Tbx21 (and associated genes in the module) in inflationary memory would be valuable.

## METHODS

PCA of expression profiles were performed in R (27) using princomp/pricomp functions. Three-dimensional plots employing the first three principal components were generated using R package rgl (28), providing high-level functions for 3D interactive graphics.

The merging of data sets from different microarray platforms (Illumina, Affymetrix) is summarized in the following steps:


Weighted Correlation Network Analysis [WGCNA, (32)] was executed using R package WGCNA (11) on a subset of highly variable genes (IQR > 0.5, 2231 features).

Hierarchical clustering analysis of samples using gene modules was performed using edited functions in flashClust R package (33).

Graphical visualization of gene module networks was generate on Cytoscape (34), with gene edges based on Topological Overlap Measures (TOM). Pavlidis matching template analysis (14) using Tbx21 gene as reference for expression pattern to match was performed on MeV java-based application (15).

Peak calling on Tbx21 binding profile data (16) was performed using macs2 algorithm to identify genome-wide locations of transcription factor binding from ChIP-seq data (17). Tbx21 peaks were annotated using Homer bioinformatics tools [http:// homer.ucsd.edu/homer/ngs/annotation.html (18)].

Effectiveness of batch value correction was performed using R package pvca, (10).

The versions of all the packages used is provided in the **Supplementary Materials and Methods**.

#### AUTHOR CONTRIBUTIONS

EM conceived the bioinformatic model, develop the idea, performed the bioinformatic analysis, and wrote the article. PK conceived the model, reviewed and wrote the article, and gave final approval. LNL contributed to the interpretation and design of the bioinformatics analysis and writing of the article. LNL and PK contributed equally.

#### ACKNOWLEDGMENTS

We are very grateful to the scientists who generated the original transcriptional datasets—particularly Stuart Sims and Bea Bolinger in Oxford and also John Wherry's team at the University of Pennsylvania, without whom this analysis would not have been possible. Thanks must also go to those who have collaborated on subsequent MCMV experiments including Julia Colston, Madeleine Zinser, Andy Highton, and Claire Gordon. This work was supported by the Wellcome Trust WT109965MA. LNL is supported by Cancer Research UK grant C30332/A23521.

#### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fimmu. 2019.00197/full#supplementary-material

Figure S1 | (A) Clustering analysis of merged samples from Inflating and Exhausted models. Dendogram showing samples clustering (equivalent to analysis in Figure 3) after removing outliers and filtering gene by expression variance (IQR > 0.5). (B) PCA analysis of merged samples from Inflating and Exhausted models. Samples PCA using first two principal components (equivalent to analysis in Figure 3) after removing outliers and filtering gene by expression variance (IQR > 0.5). (C) PCA analysis of merged samples from Inflating and Exhausted models. Samples PCA using first three principal components (equivalent to analysis in Figure 3) after removing outliers and filtering gene by expression variance (IQR > 0.5).

### REFERENCES


Figure S2 | (A,B) Reactome pathways enrichment analysis. Most represented pathways in blue module (A) and in turquoise module (B). Gene ratio is the proportion of pathways genes in the total number of module genes.

Figure S3 | Hierarchical clustering of Inflating/Exhausted samples based on turquoise module genes. Dendogram plot showing sample clustering analysis (Euclidian distance) on Inflating-Exhausted merged sets, based on a gene set of 692 genes, detected as turquoise module in a repeated Gene co-expression network analysis of Inflating samples after removing outliers (Soft-thresholding power β = 20).

Figure S4 | Tbx21 and E2f2 Expression graphs. Normalized level of expressions of Tbx21 and E2f2 (in blue) across time points in Exhaustion and Inflation model experiments. Dash line marks the limit between early and late stage of infection.

Figure S5 | (A,B) GSEA enrichment plots of Reactome gene sets. First 9 top enriched gene sets pathways in Inflating samples (A) and in Exhausting samples (B) at late stages (IFNL: M38, 50 Days; EXHA: Cl13, 30 Days).

Figure S6 | (A,B) Representative GSEA enrichment plots of Inflation vs. Exhaustion. Illustrative enrichments of a CD8 effector signature (exact source: GSE30962\_1570\_200\_UP) in Inflation samples (A) and a CD8 memory signature (exact source: GSE1000002\_1582\_200\_DN) in Exhaustion samples (B).

Figure S7 | Assessment of batch effect contribution prior and after ComBat processing. An analysis by pvca R package was performed to estimate the variability of experimental effects.

Table S1 | (A,B) GO term enrichment analysis of WGCNA modules. GO categories enrichment analysis in modules detected including all Inflating samples (A, β = 9) and excluding possible outliers (B, β = 20).

Table S2 | (A–C) TOM matrices of blue and turquoise modules. Table matrices with topological overlay measures (TOM) of blue modules detected including all samples (A) or excluding possible outliers (B); TOM matrix of turquoise module including all inflating samples (C).

Table S3 | GSEA report of Reactome gene sets enriched in Inflation. GSEA report of Reactome curated pathways found enriched (FDR < 0.25) in Inflating samples (M38, days 50) vs. Exhausting samples (Cl13, days 30).

Table S4 | GSEA report of Reactome gene sets enriched in Exhaustion. GSEA report of Reactome curated pathways found enriched (FDR < 0.25) in Exhausting samples (Cl13, days 30) vs. Inflating samples (M38, days 50).

Table S5 | (A,B) GSEA reports of MsigDB Immunological Signatures gene sets in Inflation and Exhaustion. GSEA reports of immunological signatures found enriched (FDR < 0.25) in Inflating samples (A) (M38, days 50) vs. Exhausting samples (B) (Cl13, days 30).

Supplementary Materials and Methods | Relevant R and Bioconductor packages and versions used.

a recombinant adenoviral vector. J Immunol. (2013) 190:4162–74. doi: 10.4049/jimmunol.120266


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

The handling editor declared a past co-authorship with the authors EM and PK.

Copyright © 2019 Marchi, Lee and Klenerman. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# RNA Structure—A Neglected Puppet Master for the Evolution of Virus and Host Immunity

Redmond P. Smyth1,2†, Matteo Negroni <sup>3</sup> \* † , Andrew M. Lever 4,5, Johnson Mak <sup>6</sup> and Julia C. Kenyon4,7,8 \*

<sup>1</sup> Helmholtz Institute for RNA-based Infection Research, Würzburg, Germany, <sup>2</sup> Faculty of Medicine, University of Würzburg, Würzburg, Germany, <sup>3</sup> Université de Strasbourg, CNRS, Architecture et Réactivité de l'ARN, UPR9002, F-67000, Strasbourg, France, <sup>4</sup> Department of Medicine, University of Cambridge, Addenbrooke's Hospital, Cambridge, United Kingdom, <sup>5</sup> Department of Medicine, Yong Loo Lin School of Medicine, National University of Singapore, Singapore, Singapore, 6 Institute for Glycomics, Griffith University, Gold Coast, QLD, Australia, <sup>7</sup> Department of Microbiology and Immunology, Yong Loo Lin School of Medicine, National University of Singapore, Singapore, Singapore, <sup>8</sup> Homerton College, Cambridge, United Kingdom

#### Edited by:

Gkikas Magiorkinis, National and Kapodistrian University of Athens, Greece

#### Reviewed by:

Subrata H. Mishra, Johns Hopkins University, United States Jianzhong Zhu, Yangzhou University, China

#### \*Correspondence:

Matteo Negroni m.negroni@ibmc-cnrs.unistra.fr Julia C. Kenyon jck33@cam.ac.uk

†These authors have contributed equally to this work and are co-first authors

#### Specialty section:

This article was submitted to Viral Immunology, a section of the journal Frontiers in Immunology

Received: 10 June 2018 Accepted: 24 August 2018 Published: 19 September 2018

#### Citation:

Smyth RP, Negroni M, Lever AM, Mak J and Kenyon JC (2018) RNA Structure—A Neglected Puppet Master for the Evolution of Virus and Host Immunity. Front. Immunol. 9:2097. doi: 10.3389/fimmu.2018.02097 The central dogma of molecular biology describes the flow of genetic information from DNA to protein via an RNA intermediate. For many years, RNA has been considered simply as a messenger relaying information between DNA and proteins. Recent advances in next generation sequencing technology, bioinformatics, and non-coding RNA biology have highlighted the many important roles of RNA in virtually every biological process. Our understanding of RNA biology has been further enriched by a number of significant advances in probing RNA structures. It is now appreciated that many cellular and viral biological processes are highly dependent on specific RNA structures and/or sequences, and such reliance will undoubtedly impact on the evolution of both hosts and viruses. As a contribution to this special issue on host immunity and virus evolution, it is timely to consider how RNA sequences and structures could directly influence the co-evolution between hosts and viruses. In this manuscript, we begin by stating some of the basic principles of RNA structures, followed by describing some of the critical RNA structures in both viruses and hosts. More importantly, we highlight a number of available new tools to predict and to evaluate novel RNA structures, pointing out some of the limitations readers should be aware of in their own analyses.

#### Keywords: RNA structure, viral evolution, secondary structure, immune evasion, viral RNA

### INTRODUCTION

Mutation rates of viral genomes are extremely high when compared with those of eukaryotic cells; RNA virus polymerases typically possess error rates of 10−<sup>4</sup> to 10−<sup>6</sup> per base (1). Such rapid mutation is a strategy by which they can evade host adaptive immune responses (2). Antiviral defenses of the innate immune system, which is less genetically flexible than the adaptive response, enable a very broad range of recognition from which it is difficult for viruses to escape, even given their high error rate. To counter the innate immune system, viruses have developed strategies to block its activation. This arms race prompts the immune system to develop counter measures to recognize and eliminate the virus, whilst viruses that survive and transmit successfully are those that have evolved to escape it. When considering the evolutionary pressures on both virus and host, research has often focused on the protein sequences needed by each. A recent explosion in

**20**

our understanding of the functions of RNA, however, leads us to consider instead the role of RNA itself in driving the evolution of viruses and of human immunity.

RNA is a truly multifunctional molecule. It directs ribosomes (themselves RNA based enzymes) to produce proteins, but also regulates cellular activity by interacting directly with proteins or nucleic acids and by catalyzing biochemical reactions (ribozymes). Indeed, RNA is now implicated in almost every cellular process, including immune defense. It is also recognized as being key to viral infection processes. RNA multifunctionality comes from its ability to fold into complex three-dimensional structures that can often switch conformation to effect different functions such as binding other RNA molecules or proteins. The initial fold of an RNA molecule depends primarily on its sequence and is established by Watson-Crick pairing of complementary bases into stem-loop structures that then orientate themselves relative to one another [for a general review of RNA structure see (3)]. This three-dimensional positioning can be stabilized by non-canonical interactions or structures such as pseudoknots, which occur where the nucleotides of a loop region base pair intramolecularly with complementary nucleotides. An example is shown in **Figure 1A**, within the complex structural element known as an IRES (internal ribosome entry site). As for proteins, single nucleotide mutations can alter the three-dimensional structure of the RNA, with corresponding deleterious or positive effects on its function; RNA structures are hence substrates for, and drivers of, viral evolution. For example, random mutation may confer a new beneficial function on a given structure that is then selectively favored by evolution.

Viruses are known to be extremely thrifty with their genomes to maximize replication speed, using strategies such as overlapping reading frames and polycistronic mRNAs. Similarly, they often contain functional RNA structures within both coding and non-coding regions. For example, the first 500 nucleotides of the HIV-1 genome is densely packed with structured domains that control key steps of the replication cycle including transcription, translation, export, packaging, and reverse transcription (7–9) and structural switches that aid their regulation (5, 10–12). Functionality encoded within an RNA structure is often a requisite for successful initiation or completion of viral replication. Many viruses, particularly those of the picornaviridae, initiate translation from IRES elements (13). RNA structures facilitating frameshifting enable viruses to encode multiple proteins from a single RNA (14). Additionally, viral RNA structures may directly or indirectly impact cellular immunity. In HIV-1, the transactivation response element (TAR, **Figure 1B**) regulates transcription of the genomic RNA and gene expression (15) thereby playing a central role in determining the level of virus detected by the immune system; at its most extreme leading to complete evasion of the immune response through latency (16). Frameshift structures or splicing regulators qualitatively and quantitatively manage the amounts of proteins produced by viruses, and hence those that are seen by the immune system (17). For example, some strains of Influenza A virus encode a second open reading frame (ORF) in segment three which is accessed by a low frequency +1 ribosome frameshifting event (18). This ORF produces PA-X, a protein which modulates inflammatory, apoptotic, and Tlymphocyte signaling pathways (18). Viruses with larger genomes can even produce their own microRNAs (19) or long noncoding RNAs (lncRNAs) that control cellular functions (20, 21). Dengue virus expresses a subgenomic RNA that has been shown to inhibit interferon expression by binding to TRIM25 (22). This subgenomic RNA is produced when a cellular 5′ -to-3′ exoribonuclease stalls at a stable pseudoknot RNA structure in the 3′ UTR; small substitutions within this structure modulate viral fitness and pathogenicity through their effects on the immune system. The dengue 3'UTR also folds differently in humans and insects (23) leading to production of different immunomodulatory non-coding RNAs in each host type (22, 23). This is one mechanism by which the same viral genome can both effectively replicate in human and insect cells and counteract these two divergent immune systems.

The immune system senses viral RNA using different mechanisms, including the recognition of viral single-stranded RNA by Toll-like receptors (24). The importance of RNA structures in viral replication is so fundamental that they are directly recognized by the innate immune response. Not only has the innate immune system evolved to recognize double-stranded RNA (dsRNA) within viral genomes or genomic replication intermediates, often via MDA5, it has also evolved to recognize the double-stranded parts of conserved RNA elements like IRESs (13) or 3' stem-loops, via RIG-I [reviewed in (25)]. The importance of this in maintaining broad antiviral defense is reflected in the fact that such RNA structures are also formed by DNA viruses in their protein-coding RNAs. The cellular double-stranded RNA recognition system not only leads to production of interferon, but was recently shown to upregulate NKG2D ligand, thus alerting NK cells to the presence of virus and enabling destruction of the infected cell (26). Viruses have evolved strategies to mask recognition of their RNAs, such as 5' cap-snatching (27), but their need to maintain certain critical RNA structures means that they struggle to avoid recognition entirely.

### VIRAL EVOLUTION TO EVADE AN IMMUNE RESPONSE MAY BE CONSTRAINED OR FACILITATED BY RNA STRUCTURE

The presence of essential conserved viral RNA structures can constrain the ability of viruses to evolve and to evade the immune response. Some RNA viruses have optimized their genome structure and organization to facilitate viral evolution during co-infection of the same cell with different viral strains. One widespread strategy is genome segmentation leading to reassortment (seen in rotaviruses and influenza viruses) (28). Another common strategy is template switching during replication leading to recombination and the formation of genome chimeras (seen in retroviruses) (29). Reassortment and recombination are non-random processes that are known to depend on RNA sequence and structure, but the underlying mechanisms are often poorly understood. Both processes may be

facilitated by inter-molecular interactions (30, 31), and evolution can be both promoted or inhibited by intra-molecular RNA structure.

In HIV-1, each virion contains two copies of the RNA genome, which in the case of co-infection of the same cell can originate from two different proviruses. These are non-covalently joined as a dimer via stem-loop 1 (SL1, **Figure 1B**). During viral replication, reverse transcriptase (RT) switches template, thereby adding template switching generated errors to its inherently low fidelity, producing the genetic diversity that allows HIV-1 to rapidly escape the immune system and antiretroviral therapy (2). Sequence incompatibility preventing formation of heterodimers of genomic RNA at SL1 has been shown to be a major restriction to inter-subtype recombination, with only a 3 nt sequence difference being sufficient to disrupt this (32). When considering distantly-related strains of HIV with compatible SL1 sequences, the main factor governing recombination is the degree of local sequence similarity (33, 34) However, in more closely-related viralsequences, RNA structures strongly influence recombination locally (35–37). This has been shown to be the case for well-defined RNA structures within env such as the C2 hairpin and the Rev responsive element (RRE) (38). It has been suggested that this evolved to favor the stepwise folding of proteins during translation, but it has also been shown to favor the occurrence of recombination in these same regions (37). As a consequence, recombination may shuffle whole domains of proteins thus generating structural variants that escape immune recognition, particularly for quaternary epitopes generated by the juxtaposition of different protein domains. RNA structures and sequences have also been shown to influence the fidelity of reverse transcription such that stable secondary structures enhance the number and type of mutations incorporated (39, 40). The regions of env encoding the external parts of the viral surface glycoprotein gp120 are under strong positive selection by the humoral immune system. Perhaps counterintuitively they present a lower degree of RNA structure; however, this can be accounted for by such rapid viral mutation that the RNA is unable to conserve base-pairing. As studies have shown, poorly structured RNA regions are reverse transcribed with higher fidelity, which paradoxically limits the rate of introduction of mutations in these highly variable regions of the genome (39).

RNA viruses with segmented genomes can undergo reassortment, leading to the exchange of entire gene segments, potentially giving rise to new viral strains to which humans have no previous immunity. In influenza A, reassortment has been historically associated with the emergence of pandemic strains, including the most recent H1N1 2009 pandemic which contained influenza gene segments from human, avian and swine lineages (41, 42). It is thought that packaging sequences within each gene segment direct their selective incorporation into virus particles (43, 44) and recent work suggests a mechanism where packaging signals mediate RNA-RNA interactions that would guide their incorporation. It therefore follows that any preferential interactions or incompatibilities between vRNP segments would then regulate genetic reassortment and influenza evolution (45). It is possible that improved understanding of this process would help to better predict the emergence of pandemic influenza.

In addition to genome diversification through recombination and reassortment, RNA structures influence immune escape by modulating the viral proteome. The use of frameshifting or alternative start codons is often controlled by viral RNA structures, resulting in the translation of viral peptides or proteins in a different reading frame. The resulting peptides are often antigenic and may act to dilute the presentation of conventional viral peptides on the surface of the infected cell (46). In some viruses there is an evolutionary constraint to maintain translation-impeding RNA structures in genes that encode good T cell epitopes, thus maintaining their translation at levels too low to trigger T cell recognition and killing. The EBNA-1 RNA from Epstein-Barr virus for example regulates its own translation in cis (47) and the evolutionary pressure on this gene comes from the need to maintain G-quadruplex RNA structures (**Figure 1C**) that act as "steric blocks" to ribosomes. When these structures are destabilized, cells infected with the resulting mutant virus are more readily seen by T cells than those infected with wild-type virus (48).

Viruses also appear to be under pressure to maintain unstructured regions in their genome: there is a bias in HIV-1 toward the use of A's in the retroviral genome; this biases codon usage and ultimately even the amino acid composition of the viral proteins (49). Adenosines are vastly overrepresented in the single-stranded regions of RNA and underrepresented in double-stranded regions; their only binding partner, U, can also pair with G, which may explain the single-stranded nature of A-rich regions. Artificial introduction of extensive synonymous A to G mutations in pol led to increased stability of the dimeric genome inside the virion, and reduced reverse transcription as a result (50). The signature distribution of Adenosine frequency and its relation to local RNA structure was thought to be maintained by the influence of RNA secondary structures on reverse transcription. Changing the A ratio in local areas by only including codons found in natural isolates of HIV-1, did not affect replication efficiency in vitro (51) however it is possible that in vivo viruses use this strategy of maintaining parts of their genome as singlestranded in order to avoid innate immune recognition by RIG-I or MDA5. Many viruses also need to maintain a low number of CpG dinucleotides in their genome in order to avoid recognition by the zinc-finger antiviral protein (ZAP) (52). When HIV-1 codon usage was humanized, affecting the native RNA structure, a reduced IFN-α/β response was observed (53), suggesting that the maintenance of specific structures in the viral genome comes at the cost of greater recognition by the innate immune response. Despite this, the viral genome apparently undergoes positive selection for the maintenance of many specific RNA structures: when synonymous mutations were introduced extensively into the viral genome a decrease in infectivity was observed that could be attributed to an expected alteration of splicing pattern and/or modification of RNA structures (50, 54).

### THE ROLES OF VIRAL AND CELLULAR RNA STRUCTURE IN THE EVOLUTION OF HUMAN IMMUNE RESPONSES AND THE HUMAN IMMUNE SYSTEM

In terms of the evolution of the human genome, the degree of variation at the MHC class I locus is positively correlated with local pathogen richness, for which viruses are postulated to play an important role. This is particularly evident for HLA B (55). The importance of RNA structure in influencing the generation of T cell epitopes, either through translational enhancing, blocking, or frameshifting mechanisms, means that RNA structures within viruses must have influenced the evolution of the HLA locus. For example, macaques that have the correct MHC-I allele to present an antigenic cryptic peptide derived from the env ORF are better able to control simian immunodeficiency virus (SIV) infection (56) which would be expected to be a driver for the maintenance of this MHC allele within the population. As mentioned above, the mechanisms controlling the generation of cryptic translational products are often RNA structure-dependent.

Viral RNA structures have also influenced the evolution of the innate immune system; as previously discussed, hallmarks of viral RNAs are targeted by conserved RNA-binding proteins such as RIG-I. More specific antiviral proteins have also evolved, however. APOBEC3 proteins target retroviral genomes and are incorporated into viral particles (57). These host cell-derived cytidine deaminases bind to the viral RNA and mutate it during the reverse transcription process, leading to non-functional virus. It has been reported that regions of the genome under strong purifying selection present an underrepresentation of APOBEC3 target sequences, a signature of a strong pressure for limiting the occurrence of mutations in certain regions of the genome (39). Retroviruses have also developed direct strategies to counteract these proteins, often by encoding proteins that bind to them directly. Interestingly, APOBEC3s have recently been shown to bind to the same motifs in the viral RNA genome as the viral NC structural protein involved in genome packaging. This suggests a competitive relationship may have developed between host cell and viral factors, for binding to the same viral RNA structures (58).

### NOVEL WAYS TO EXPLORE RNA STRUCTURE AND FUNCTION

RNA functionality is best understood through its structure, but RNA structure determination is extremely challenging. Although it is formed based on simple base pairing rules, for RNA molecules of biologically relevant sizes there are an astronomical number of possible structural permutations, meaning that RNA structure cannot be predicted easily from base pairing rules alone. Biophysical methods such as crystallography (59) and NMR (60) are each able to determine RNA structure at atomic resolution but both have difficulty with large RNA substrates. This is evidenced by the paucity of atomic resolution RNA structures, compared to their protein equivalents. For example, the RCSB databank holds structural data for over 100,000 proteins, but contains only around 1000 RNA structures. This difficulty arises because RNA molecules tend to adopt long flexible shapes with weak tertiary interactions that are prone to misfolding (61). Furthermore, the negatively charged phosphate groups on the surface of an RNA molecule can impose technical challenges as they hinder crystal packing. Newer techniques are emerging to address this gap including small-angle X-ray scattering (62), single molecule FRET (63, 64), and atomic force microscopy (65).

RNA secondary structure is currently most commonly resolved using a combination of (i) phylogenetic approaches (ii), structure prediction algorithms, and (iii) experimental methods with chemical/enzymatic probes. Whilst these methodologies cannot determine RNA structure at atomic resolution, they are nevertheless able to generate models that provide useful biological insights (8, 66, 67). Indeed, RNA structure determination is currently undergoing a revolution thanks to advances in next generation sequencing technology that have transformed traditional biochemical assays into powerful tools that can characterize thousands of RNA structures in single experiments (68–71). The most widely used methods take advantage of chemical probes, such as dimethyl sulfate (DMS) and selective 2′ -hydroxyl acylation analyzed by primer extension (SHAPE) reagents, that differentially react with single stranded vs double stranded RNA. Knowledge of whether a nucleotide is likely to be base paired or not can significantly improve the accuracy of RNA structure predictions from thermodynamic folding algorithms when included as an energetic consideration in the modeling programme, known as a pseudo free energy parameter (72). For example, if chemical probes show a nucleotide is single-stranded, the modeling algorithm adds an energetic penalty to structures that include

#### REFERENCES


it in a double-stranded region. The programme then displays the most energetically favorable structures, that fit all of the data best. Several chemical probes can penetrate cells and virions, which is important for understanding RNA function in vivo, such as the binding sites of regulatory proteins (9, 73). Further characterization of RNA structure-function relationships can be obtained using specialized approaches, such as mutational interference mapping experiment (MIME) (74) and cross-linking SHAPE (XL-SHAPE), where protein binding sites are mapped using UV cross-linking in parallel with SHAPE probing (75), and by CLIP (crosslinking-immunoprecipitation sequencing) related methodologies (76). More recently, RNA proximity ligation has emerged as a new class of RNA structural probing technique for the direct detection of long-range base pairing or inter-molecular interactions (77–84). These types of interactions are commonly found in viral genomes/regulatory RNAs and are difficult to identify with other methodologies. As the immune system is known to be regulated by non-coding RNAs (85), the ability to detect direct interactions between viral and cellular RNAs will be particularly important for future understanding of virus-host interactions.

### AUTHOR CONTRIBUTIONS

JM conceived the review. RS, MN, AL, JM, and JK wrote the manuscript.

#### FUNDING

This work was supported by the Helmholtz Association (VH-NG-1347 to RS), Sidaction (AI25-1-02335 to MN), The Biomedical Research Centre UK and Clinical Academic Reserve UK (to AL), and the Medical Research Council UK (MR/N022939/1 to AL and JK). JM is recipient of funding from Australian National Health and Medical Research Council project grant App1121697.

in virions. J Biol Chem. (2004) 279:48397–403. doi: 10.1074/jbc.M4082 94200


IRES. FEBS Lett. (2013) 587:1353–8. doi: 10.1016/j.febslet.2013. 03.005


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 Smyth, Negroni, Lever, Mak and Kenyon. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Influenza Virus—Host Co-evolution. A Predator-Prey Relationship?

#### Konstantinos Voskarides, Eirini Christaki and Georgios K. Nikolopoulos\*

*Medical School, University of Cyprus, Nicosia, Cyprus*

Influenza virus continues to cause yearly seasonal epidemics worldwide and periodically pandemics. Although influenza virus infection and its epidemiology have been extensively studied, a new pandemic is likely. One of the reasons influenza virus causes epidemics is its ability to constantly antigenically transform through genetic diversification. However, host immune defense mechanisms also have the potential to evolve during short or longer periods of evolutionary time. In this mini-review, we describe the evolutionary procedures related with influenza viruses and their hosts, under the prism of a predator-prey relationship.

Keywords: adaptation, antagonistic evolution, influenza virus, bottleneck, immune system, antigen, genetics, mutation

#### Edited by:

*Tara Patricia Hurst, Abcam, United Kingdom*

#### Reviewed by:

*Sarah Rowland-Jones, University of Oxford, United Kingdom Jianzhong Zhu, Yangzhou University, China*

#### \*Correspondence:

*Georgios K. Nikolopoulos gknikolopoulos@gmail.com; nikolopoulos.georgios@ucy.ac.cy*

#### Specialty section:

*This article was submitted to Viral Immunology, a section of the journal Frontiers in Immunology*

Received: *31 May 2018* Accepted: *15 August 2018* Published: *07 September 2018*

#### Citation:

*Voskarides K, Christaki E and Nikolopoulos GK (2018) Influenza Virus—Host Co-evolution. a Predator-Prey Relationship? Front. Immunol. 9:2017. doi: 10.3389/fimmu.2018.02017* INTRODUCTION

Health disasters caused by influenza viruses are abundant in human history (1–3). The Spanish flu of 1918 is estimated to have claimed more than 50 million lives, a figure that was beyond the death toll of World War I (4, 5). Another two pandemics occurred later in the twentieth century when surveillance systems and laboratory capacity were in place to monitor them: the Asian flu (1957) and the Hong Kong flu (1968) (2, 6), which had lower fatality rates but still caused between half and 2 million deaths each. In 2009, a novel H1N1 influenza virus spread rapidly from Mexico to the rest of the globe (2) but had eventually moderate impact with less deaths than the previous pandemics (2, 7), partly due to our improved response. Seasonal epidemics may also be severe with high morbidity and mortality (8).

Viral infections are not a static phenomenon. Many viral genomes can adapt on their hosts' cellular environment while survivors of viral infections have probably special genetic characteristics that helped them remain alive. Therefore, both genomes change over time reminding us of the evolutionary race between predators and preys. Viral genomes however are smaller and more flexible, allowing viruses to rapidly adapt though random mutagenesis or due to transmission in other species. Pandemics of influenza are due to these random and sudden viral mutations or transspecies transmissions. Under this perspective, in this paper, we discuss about influenza viruses and the coevolution with the human genomes.

### INFLUENZA VIRUS

#### Virology and Replication Cycle

Influenza viruses are members of the Orthomyxoviridae family. Their genome is comprised of a negative sense, segmented single-stranded RNA. Viral particles also contain essential viral proteins and a host cell-derived envelope (9). Influenza viruses are further classified into four types: A, B, C, and D. Influenza virus types A, B, and C infect and cause respiratory illness in humans. Influenza D viruses mainly affect cattle and are not known to infect humans (10). Influenza

A and B viruses cause seasonal epidemics whereas type C viruses usually cause a mild upper respiratory tract illness and associated epidemics have only been scarcely reported (11). Influenza A viruses can infect many animal species, including birds, pigs, horses, marine mammals, and other hosts, and can cause pandemics. Influenza A viruses are categorized into subtypes based on the molecular characteristics of their surface glycoproteins, hemagglutinin (HA), and neuraminidase (NA). Identification of at least 18 antigenically distinct HA subtypes and 11 distinct NA subtypes of influenza A virus strains infecting humans and animals have so far been determined (12, 13). Two genetically and antigenically distinct lineages (Victoria and Yamagata) of Influenza B viruses co-circulate in humans (14–16).

Hemagglutinin is comprised of a dimer HA1-HA2: HA1 is crucial for binding to the host cell receptor whereas HA2 for cell fusion. Viral endocytosis is followed by uncoating and release of viral RNA, which is imported into the nucleus where viral replication and protein synthesis take place using viral polymerase proteins and the host cell machinery (17). Virions are assembled in the cell surface and bud enclosed in an envelope originating from the host cell membrane. Neuraminidase allows the virus to leave the infected cell as it cleaves sialic acid (SA) from the cell surface receptors. Viral replication causes cell death with various mechanisms including disruption of protein synthesis and apoptosis. Since viral release continues for hours before cell death, many respiratory epithelial cells are affected and die within a few replication cycles (9, 18).

Influenza viruses target epithelial cells of the respiratory tract, which contain SA receptors. Epithelial cells across species express different SA receptors and Influenza A virus strains show a predilection for certain types of such receptors, making zoonotic transmission difficult. For example, human influenza strains have a predilection for SA α-2,6 galactose receptors, which are found in the respiratory epithelium of the upper airways in humans, while animal influenza A viruses bind to SA α-2,3 galactose, which is found on the epithelial cells of birds and pigs, but could also be expressed in the human lower respiratory tract epithelium (12, 19, 20). HA epitopes are the major determinants for the production of strain-specific neutralizing antibodies.

#### Clinical Manifestations

Influenza symptoms usually present abruptly, after an incubation period of 1–2 days. Systemic symptoms are characteristic and help differentiate influenza from other upper respiratory tract viral illnesses. These include high fever, chills, rigors, headache, myalgias, malaise, and anorexia. Fever and systemic symptoms commonly last for 3 days, however fever can last up to 8 days. Myalgias can be severe and usually involve the back and extremities. Respiratory symptoms include dry cough, sore throat, hoarseness, nasal congestion, and discharge (18).

Different subtypes of influenza have different ability to infect airway epithelial cells of the upper or lower respiratory tract, hence causing a milder infection or a more severe illness leading to severe pneumonia. For example, H5N1 infects alveolar epithelial cells as well as alveolar macrophages, triggering a significant pro-inflammatory response, which can result in severe lung injury (21–23).

### HOST IMMUNE RESPONSE TO INFLUENZA VIRUS INFECTION

Cells of the innate immune response are the first and fast responders upon influenza virus infection, recruited by chemokines released by airway epithelial cells. Upon viral entry, intracellular viral ssRNA and other viral molecular patterns are recognized mainly by Toll-like receptors (TLR) 3,7,8,9 and retinoic acid-inducible gene-I protein (RIG-1) receptors. The downstream signaling triggered by the activation of these receptors results in the activation of transcription factors like nuclear factor kappa-B and interferon regulatory factor (IRF) 3 and 7, leading to the expression of pro-inflammatory cytokines and interferons (24–26). Moreover, NOD-like receptor family pyrin domain containing 3 (NALP3) inflammasome is also activated upon influenza virus infection promoting IL-1β and IL-18 secretion, and pulmonary infiltration by neutrophils and macrophages (27).

Natural Killer (NK) cells, monocytes, neutrophils, and dendritic cells migrate to the site of infection and exhibit antiviral activity. NK cells have cytotoxic activity on cells infected with influenza virus, macrophages phagocytose infected cells and regulate adaptive immune responses, and dendritic cells present viral antigens bound to Major Histocompatibility Complex (MHC) molecules to naïve and memory T lymphocytes, initiating the specific adaptive immune response. In addition, immunoglobulins (mainly IgA) present in nasal secretions contribute to the antiinfluenza immune response by preventing viral entry (26).

Both T and B cells are essential in the adaptive immune response to influenza virus infection. Naive CD8<sup>+</sup> T cells, upon activation by dendritic cells and facilitated by the action of cytokines, proliferate and differentiate to cytotoxic T lymphocytes (CTLs). CTLs are able to kill influenza virusinfected cells and also restrict viral replication via production of cytokines and effector molecules like perforin and granzymes (26, 28, 29). Memory CTLs cells can respond efficiently during a secondary infection and confer cross-protective heterosubtypic immunity (30, 31). Moreover, CD4<sup>+</sup> T cells express co-stimulatory molecules that participate in antibody production by B cells. At the same time, antiviral cytokines, like INF-γ, TNF, and IL-2 are expressed from Th1 CD4<sup>+</sup> T cells, which also help activate alveolar macrophages. Th2 CD4<sup>+</sup> T cells contribute to B cell specific responses (32). Comprising an essential defense mechanism against influenza virus infection, B cells generate neutralizing (hemagglutinininhibiting antibodies) as well as non-neutralizing antibodies (anti-NA and antibodies against structural proteins). The latter, as well as specific antibody dependent cell-mediated cytotoxicity, play an important role in cross-protection against different influenza subtypes (31, 33). There is increased interest in the role of cross-reactive T cells and broadly neutralizing antibodies in the development of vaccines that could elicit "universal" immunity against different or novel strains of influenza viruses.

## VIRAL EVOLUTION

### Predator and Prey. Antagonistic Evolution

Predator-prey species relationship is of great interest in evolutionary biology since the balance of attack and defense mechanisms is very important for the long-term survival of both species. A simplified example that is often given is that of lions and their prey. The faster the lions are, the most successful they can be in acquiring food, surviving, and reproducing. The faster the zebras (lions' prey) are, the better they can escape the lions, so they survive and reproduce. It is all about Darwinian fitness, since survival and robustness are related with a higher probability of having more offspring (descendants). That is translated to reproducibility and to a better chance to evolve. Darwinian fitness cannot increase simultaneously for both predator and prey species. It increases for one species and reduces for the other. But that is not stable. Afterwards the opposite happens. This is the phenomenon of antagonistic evolution that is believed to push evolution forward and sometimes can increase phenotypic diversification (34).

Antagonistic evolution can be a main driver of multiple phenotypes, specialization, or even speciation. Multiple species evolution, those belonging to the same genus or family, is termed as evolutionary radiation. An example is that of marine mammals, dolphins, and whales. It is believed that multiple such species have been evolved due to prey (fish, squid) specialization (35, 36).

Antagonistic evolution also happens between host and various pathogens including viruses. Successful viruses are the ones that can cheat cells, go inside the cytoplasm, take advantage of the cell translation system (ribosomes), reproduce, and then exit the cell by exocytosis going to infect other cells. Viruses can be considered as the predators and cells as the preys, even though viruses do not necessarily kill the cells. This is probably a major difference with the classical predator-prey relationship. Evolutionary successful viruses are transmitted from one host to another (cell or a whole organism), making minimal damage. On the other site, cellular defense mechanisms, like the MHC proteins, have the potential for evolving in a quite short evolutionary time, this being a highly effective immunological shield.

#### Genetic Diversification of Influenza Viruses

Influenza viruses exhibit vast genetic variability sourced from three molecular mechanisms: mutation, genetic re-assortment, and genetic recombination. While the first two mechanisms have been extensively studied, the significance of the third one needs clarification (37). The results of these genetic procedures are antigenic shift or antigenic drift. Antigenic shift is a rapid change of the virus genome and antigenicity, coming from a combination of different viral strains. The eight RNA fragments of the influenza virus can re-assort between different strains, producing new subtypes. This happens usually inside livestock animals, in geographic regions where people live in proximity with animals (38). Antigenic shifts caused influenza pandemics in the past. On the other hand, mutations, usually nucleotide substitutions, can cause antigenic drifts. Antigenic drifts cause more gradual changes to the viral HA and N proteins, this being the cause of the appearance of new influenza strains each year (39). Antigenic drifts are very important for vaccine production.

The major cause of the high evolutionary rate of influenza viruses is the lack of proofreading (correction) function of the viral RNA polymerase. Mutagenesis is close to 10−<sup>3</sup> -10−<sup>4</sup> per nucleotide site, this being a very high mutagenesis rate and producing a high genetic variability in the viral genetic pool (40, 41). Another consequence of the continuously evolved viral genetic variability is the increased probability for widening the host-range of the virus due to HA protein changes (37).

Influenza A probably evolves under a negative pressure for CpG oligonucleotides. This is considered to happen due to innate immune recognition, something that is not so vital for influenza B evolution (42, 43). Additionally, it has been shown that antigenic drifts are more frequent to influenza A viruses than influenza B viruses (44). Even though antigenic drifts are usually gradual changes regarding the viral properties, the possibility for a rapid evolutionary change due to a mutation (punctuated evolution) cannot be excluded.

One of the most interesting recent findings regarding influenza virus' genetic variability and evolution is that when people become infected, they are actually infected from a diversity of viruses and not from only one strain (45). This implies that vaccines target the dominant strain and not the whole population of viruses. Most of these viral strains are not well adapted but can stay for long inside a human (or other species) population, thus hiding the danger to gain a gradual adaptation through mutation accumulation processes (46–48). It is obvious that natural selection can determine which viral variants can survive and which cannot. Additionally, this multi-infection process may be related with cooperation or competition phenomena among viruses, that currently are not so well understood (39).

Further understanding of the mutational and evolutionary procedures related to influenza viruses can help us develop prediction models. Prediction models are very important for vaccine production. Most of the predictive modeling studies focus on the surface protein hemagglutinin (49). Hemagglutinin DNA sequences for thousands of influenza A strains, isolated over the last 40 years, are freely available in electronic databases (50). However, prediction studies and experimental evolution studies are very challenging. Currently, available tools are not adequate for a reliable approach. More details can be found in the recently published reviews (51) and (52).

### A Predator—Prey Relationship Between Human and Influenza Virus

The relationship between viruses and eukaryotic cells has its origins millions of years ago. It is not so simple since many viruses complete their cycle inside a cell without harming it. Instead, sometimes their nucleic acid sequences have been embedded in the host cell DNA. In humans, about half of their genome is constituted of viral sequences, showing this long lasting ancient interaction (53, 54).

mutations and natural selection. Viral adaptation has been increased again.

Antibody-mediated immunity to influenza viruses is wellunderstood. Vaccines success is based on this response. Despite this, vaccines are ineffective in many people. One reason is that cellular immunity through T cells is also a very important immunological response (32). Antibody variability is a genetically intrinsic mechanism of B cells. However, T-cell response is based on MHC (HLA) proteins efficiency that differs among individuals and is an inherited trait. Recognition of viral proteins on the surface of infected cells is performed through MHC proteins. There are thousands of genetic polymorphisms in the MHC genes at the population level, producing millions of combinations. These combinations are inherited in the form of "genes packages" termed as haplotypes (each person carries two MHC haplotypes). It is unlikely for two individuals to carry a single common haplotype. Some MHC proteins (haplotypes) are more effective than others in recognizing viral proteins. Why is there this vast diversity of MHC proteins in human populations? The answer is that this increases the likelihood that some individuals will survive after a severe pandemic and this way a species extinction is avoided. On the other hand, we can assume that haplotypes that exist today are the ones that were successful in the past for combating specific viral or bacterial infections. Survivors passed these haplotypes onto next generations. Of course, as it was stated before, viruses are not stable entities. Mutations increase the number of viral types and some of them will finally "win" the previously successful MHC proteins causing again the (human) population to be susceptible. This is quite similar with the predator-prey relationship happening extensively in nature (**Figure 1**).

T4 cells (CD4 cells or T-helper cells) and T8 cells (CD8 cells or T-cytotoxic cells) are the most important lymphocytes that use MHC proteins for recognizing the virus-infected cells. Studies showed that the long-term evolution of T8 cells epitopes may have shaped the long-term evolution of influenza A virus (55, 56). On the other hand, influenza viruses seem to respond to this evolutionary pressure by mutations to their nucleoprotein gene (nucleoprotein is an important viral capsid protein for T-cell recognition). It has also been proven that some combinations of MHC proteins and nucleoprotein variants result in immunological response and some not (57–59). This antagonistic evolutionary relationship is a factor of a vice-versa increasing genetic variability. In the same regard, TLRs may have remained evolutionary stable, possibly due to the fact that the pathogen-associated molecular patterns (PAMPs) they recognize have also remained unaltered, since they are fundamental for pathogen survival (60).

Another category of important proteins in human cells, defending against viral infections, are the IFITM (interferoninduced transmembrane) proteins. At least three human IFITM proteins exist (61). When upregulated (by both type I and type II interferons), they can restrict viruses outside cells, prohibiting viral penetration through the membrane lipid layers. IFITM3 seems to be more specific for influenza viruses (61). Gene polymorphisms in IFITM3 gene can increase or reduce susceptibility to influenza infections, underlining the evolutionary significance of these proteins (62, 63).

#### Genetic Bottlenecks

In population genetics, genetic bottleneck is the event where a sub-group of individuals originating from a larger population, migrate in another geographic region, carrying only a subset of the genetic information of the initial population. This is a standard procedure during evolution. Populations split, and a part of the initial genetic diversity is transferred in different geographic regions. Isolation and new mutations can dramatically change the genetic diversity of the new population.

Transmission of viruses in new hosts implies three different genetic bottleneck phenomena: (i) The viral transmission bottleneck, which determines how much of the viral diversity generated in one host passes to another during transmission (64, 65). Studies show that along with major variants, other minor viral variants can be transferred from one host to another (66). These minor variants can be transformed to major through natural selection pressures. Two different processes are related

#### REFERENCES


with this bottleneck: virus entry and virus replication followed by production of new virions (ii) Individuals or sub-populations (e.g., humans) migrate in other geographic regions and transfer there only a group of viruses of the initial pool of viruses, and (iii) Genetic bottlenecks of the hosts, e.g., humans. Host genome is vital for the effectiveness of the immune response. Only a subset of the paternal population MHC genes' variability will be transferred in the new geographic region.

Combinations of those three kinds of genetic bottlenecks and evolutionary forces will shape the final virus-host relationship that will take place in a human population (**Figure 2**). This relationship is not static since new genetic bottlenecks take place, especially in present days that people travel a lot.

What determines the viral bottleneck size? Small viral founding populations are usually observed, even though this depends on the virus and the host (67). Experimental studies in ferret and guinea pig showed that influenza infections coming from limited dose of aerosol exposure had tighter bottlenecks than contact transmission (68). More studies are needed to elucidate the importance of this phenomenon in human populations. Viruses follow special routes in our planet and understanding this may contribute to prevention strategies.

### CONCLUSION

Influenza viruses and humans have co-evolved in the past and will continue to co-evolve. Our understanding of influenza virus pathogenesis and evolution has increased, however, we are still unprepared to stop the next pandemic strain from emerging. Evolution prediction algorithms can help inform efforts toward the development of better preventive measures.

#### AUTHOR CONTRIBUTIONS

KV and EC: conception, drafting the manuscript, approval of the final version; GN: conception and design, critical revision of the manuscript, approval of the final version.


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 Voskarides, Christaki and Nikolopoulos. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# APOBEC3G Regulation of the Evolutionary Race Between Adaptive Immunity and Viral Immune Escape Is Deeply Imprinted in the HIV Genome

Faezeh Borzooee, Krista D. Joris, Michael D. Grant and Mani Larijani\*

Immunology and Infectious Diseases Program, Division of Biomedical Sciences, Faculty of Medicine, Memorial University of Newfoundland, St. John's, NL, Canada

APOBEC3G (A3G) is a host enzyme that mutates the genomes of retroviruses like HIV.

#### Edited by:

Gkikas Magiorkinis, National and Kapodistrian University of Athens, Greece

#### Reviewed by:

Matteo Negroni, Center for the National Scientific Research (CNRS), France Tara Patricia Hurst, Abcam, United Kingdom Jean-Christophe Paillart, Université de Strasbourg, France

> \*Correspondence: Mani Larijani mlarijani@mun.ca

#### Specialty section:

This article was submitted to Viral Immunology, a section of the journal Frontiers in Immunology

Received: 15 July 2018 Accepted: 07 December 2018 Published: 11 January 2019

#### Citation:

Borzooee F, Joris KD, Grant MD and Larijani M (2019) APOBEC3G Regulation of the Evolutionary Race Between Adaptive Immunity and Viral Immune Escape Is Deeply Imprinted in the HIV Genome. Front. Immunol. 9:3032. doi: 10.3389/fimmu.2018.03032 Since A3G is expressed pre-infection, it has classically been considered an agent of innate immunity. We and others previously showed that the impact of A3G-induced mutations on the HIV genome extends to adaptive immunity also, by generating cytotoxic T cell (CTL) escape mutations. Accordingly, HIV genomic sequences encoding CTL epitopes often contain A3G-mutable "hotspot" sequence motifs, presumably to channel A3G action toward CTL escape. Here, we studied the depths and consequences of this apparent viral genome co-evolution with A3G. We identified all potential CTL epitopes in Gag, Pol, Env, and Nef restricted to several HLA class I alleles. We simulated A3G-induced mutations within CTL epitope-encoding sequences, and flanking regions. From the immune recognition perspective, we analyzed how A3G-driven mutations are predicted to impact CTL-epitope generation through modulating proteasomal processing and HLA class I binding. We found that A3G mutations were most often predicted to result in diminishing/abolishing HLA-binding affinity of peptide epitopes. From the viral genome evolution perspective, we evaluated enrichment of A3G hotspots at sequences encoding CTL epitopes and included control sequences in which the HIV genome was randomly shuffled. We found that sequences encoding immunogenic epitopes exhibited a selective enrichment of A3G hotspots, which were strongly biased to translate to non-synonymous amino acid substitutions. When superimposed on the known mutational gradient across the entire length of the HIV genome, we observed a gradient of A3G hotspot enrichment, and an HLA-specific pattern of the potential of A3G hotspots to lead to CTL escape mutations. These data illuminate the depths and extent of the co-evolution of the viral genome to subvert the host mutator A3G.

#### Keywords: CTL epitope, APOBEC3G (A3G), HIV, immune escape, viral evolution

### INTRODUCTION

HIV, like other RNA viruses, evolves rapidly and continuously through the accumulation of mutations (1). The high rate of HIV genome mutation, between 10−<sup>4</sup> and 10−<sup>5</sup> mutations per nucleotide per replication cycle, is generated by HIV's error-prone reverse transcriptase (RT) (2–7). APOBEC3G (A3G) is a member of the apolipoprotein B mRNA-editing enzyme catalytic polypeptide-like editing complex (APOBEC) family of cytidine deaminase enzymes. Malim and colleagues, in 2002, discovered that A3G is responsible for the prevalence of G to A mutations in HIV sequences from HIV infected individuals (8). The APOBEC family includes 11 members in humans: activation-induced cytidine deaminase (AID), APOBEC1, APOBEC2, APOBEC3A-H, and APOBEC4, which, through their cytidine deaminase activity, are involved in diverse physiological processes including lipid metabolism, antibody diversification, virus/retroelement restriction, and cancer genome hypermutation (9–13).

In general, the A3 branch family members are capable of impeding infectivity of HIV and several other viral infections such as hepatitis B, human T cell leukemia virus type 1, and human papillomavirus (14–18) though A3G is the most effective actor on the HIV genome (19). These enzymes exert their antiviral restriction activity by deamination of cytidine to uridine (C to U) in the minus-strand single-stranded DNA during reverse transcription of viral genomic RNA which mediates guanosine to adenosine (G to A) mutation in plus-strand DNA (20–26). A3G is constitutively expressed in resting CD4<sup>+</sup> T cells, macrophages, and dendritic cells, but can be further induced by interferon (IFN) (27–29). It is packaged into the HIV virion in A3Gexpressing producer cells and can act on the viral genome in the subsequently-infected cell (20). It has been shown by singlevirion analysis that A3G can be co-packed with A3F, A3D, or A3H haplotype II and co-mutate the same viral genome in a single cycle of HIV replication (30).

A3G has a sequence preference for mutating C in CCC, TCC, and ACC motifs, but this preference is further modulated by the DNA secondary structure, together mediating accumulation of G to A mutations in viral cDNA (18, 21, 22, 24, 30–34). A3G-mediated mutations on the HIV genome also follow a twin gradient pattern, as determined by the central and end polypurine tracts (PPT)'s impact on the reverse transcription dynamics of the HIV genome altering the time that various regions are left single-stranded and available for A3G to act on (34, 35). Depending on the load and positions of A3Ginduced mutations, this could lead to either degradation or G to A hypermutation in the viral genome (23, 25, 26). One of the early observations of the contribution of A3G to producing defective viral proteins was its ability to create premature stop codons (36, 37). For example, the codon encoding tryptophan (TGG) is converted by A3G to a stop codon (TGA) (37, 38). A3G is also able to physically interfere with HIV replication in a deamination-independent manner by blocking reverse transcription or binding to tRNA to prevent reverse transcription initiation (39–41). However, it appears that viral restriction has a higher requirement for deamination-dependent A3G activity (42–45).

Over the last decade and a half, several hundred published studies have focused on A3G's HIV restriction activity. A3G can induce as many as five mutations per kilobase (22, 26, 32, 38), at least an order of magnitude higher than RT's error rate. It has been suggested that HIV's RT is only responsible for 2% of HIV genome mutagenesis and the other 98% can be attributed to the action of A3G (46). While high A3G activity correlates with slower disease progression, lower A3G activity leading to sub-lethal mutations might enhance HIV diversity and lead to more rapid disease progression (47). In contrast, other studies showed a much lower contribution of A3G to genetic variation of HIV, as compared with RT-driven mutations. One study reported negligible sub-lethal mutation frequencies as low as 4 × 10−<sup>21</sup> and 1 × 10−<sup>11</sup> for A3G and A3F mutations, respectively, which is significantly lower than the frequency of mutations arising from RT (39). Most reports suggest that HIV can experience both a beneficial and a harmful influence from A3G expression (47– 50). Other studies reported that A3G is less likely to impose HIV diversification and facilitate viral diversification and adaptation in vivo, and that A3G, even at low expression levels, is lethal for HIV (36, 51, 52).

Most studies reporting the impressively high load of A3G mutation were carried out using Vif-deficient HIV because A3G is antagonized by the HIV encoded accessory protein Vif (53–56). Although Vif is necessary for HIV replication in A3G-expressing cells, it is not required in A3G-deficient cells (8, 53–58). Vif binding mediates proteasomal degradation of A3G, but it can also downregulate the translation of A3G (59–61). It has been reported that the accessory protein Vpr can also bind A3G and mediate its proteasomal degradation (62). Thus, in the presence of HIV's full complement of accessory anti-A3G factors, only low levels of mutations are induced by A3G (47, 49, 59, 63, 64).

HIV specific CD8<sup>+</sup> cytotoxic T cell (CTL) responses and their human leukocyte antigen (HLA) restriction are crucial determinants of viral containment following the initial innate immune response (65, 66). Multiple parameters such as HLA genotype, virus sequence, and T cell receptor repertoire contribute to CTL response effectiveness (67–72). Despite the significant protective role of CTLs in limiting viral HIV replication, the immune system ultimately fails to clear HIV, at least in part because of mutations within or adjacent to CTL epitope during the early and chronic phases of disease progression (73). HIV is under intense evolutionary pressure for escape mutations that lead to evasion of CTL killing and CTL escape mutations are a major force in driving viral evolution in acute/early chronic infection (74, 75). These mutations can be located either inside or outside CTL epitopes, be fast or slow in appearance, but to be selected, they ought to maintain a balance with a cost for viral replication fitness and escape from CTL (76–80). Mutations that facilitate immune evasion are positively selected and become dominant in the viral population (73, 81, 82). On the other hand, sometimes amino acid alterations under immune pressure can even confer a de novo immune response (83).

CTL escape mutations can act through several mechanisms: by reducing or abrogating binding of viral epitopes to HLA Class I, disrupting intracellular epitope processing or altering recognition by T cell receptors (84–89). Viral proteins in infected cells are first proteolytically degraded in the cytosol by immunoproteasomes (86, 90, 91). Proteasomal degradation product peptides including epitope precursors can be up to 32 amino acids long; however, immunoproteasomes are inclined to generate longer peptides ending with C-terminal hydrophobic residues that are anchors for most HLA class I molecules. After post-proteasomal degradation, epitope precursors typically 8–16 amino acids in length, are transferred into the endoplasmic reticulum (ER) lumen where HLA class I molecules are folded and assembled, by the transporter-associated-with-antigenprocessing (TAP1 and TAP2) (92). Further N-terminal trimming in the ER can occur by enzymes such as the ER aminopeptidases (ERAP1 and 2) to fit the groove of restricting HLA class I molecules (93, 94). The peptide-HLA complex is subsequently transported to the cell surface to be recognized by CTLs. Thus, proteasomal degradation and antigen processing are key determinants of epitope availability for the anti-HIV CTL response (95). In addition to mutations in epitopes, immune escape can also be altered through mutation of the flanking regions that impact proteasomal processing (96). It has been shown that the robustness of the anti-HIV CTL response correlates with the number of epitopes generated due to proteasomal cleavage, and thus, mutations that impact processing can drastically influence CTL escape (77–79, 89, 97–100).

The most restrictive step in antigen processing is the peptide's ability to bind to the specific expressed set of HLA class I molecules using N- and C-terminal anchor residues that bind into the groove of a specific HLA class I molecule (101). Mutations at anchor residues can disrupt HLA class I binding, whereas other mutations such as those at the central bulge of the peptide (normally residues, 4–6 in the canonical 9-mer peptide) interfere with TCR recognition of the HLA-peptide complex (73, 84, 96, 102). It remains to be fully understood how the host's condition shapes the availability of beneficial mutations; however, HLA profile is a major CTL escape driver (85, 103– 105). HLA-B is the most protective among all three HLA class I loci (A, B, and C) (106). CTL escape and reversion pathways are more closely associated with epitopes restricted to protective HLA alleles such as HLA-B27 and -B57, which are associated with slower disease progression and more robust HIV-specific CTL responses (78, 107). Escape mutations in these HLA restricted epitopes incur a high fitness cost by reducing viral replicative capacity (108). In contrast, HLA-B35 and -B8 are associated with rapid disease progression based on their ability to present less effective epitopes to TCRs (109–113).

Mutational diversity of HIV's genome has a crucial role in evasion of immune recognition and multiple studies have implicated A3G as an important player in the interplay between the adaptive immune anti-viral CTL response viral adaptation and immune escape (114–119). A3G has been proposed to induce CTL escape in two ways, either by directly mutating CTL epitopes or by causing mutations outside epitopes which influence the peptide degradation and HLA presentation of wildtype CTL epitopes (117, 119–121). Several reports describe A3Ginduced mutations located within or flanking CTL epitopes. One report found remarkable evidence for enrichment of nonsynonymous amino acid substitutions by A3G in the anchor or proximal amino acid residues of HLA-restricted epitopes that are important in epitope processing leading to immune escape (115). Consistent with a role for A3G in CTL escape, an earlier bioinformatics study reported a reduction in CTL recognition as a result of A3G mutation in epitopes (119). On the other hand, it was demonstrated that increasing the turnover of truncated HLA-restricted peptides, generated due to the action of A3G, can enhance the CTL response in a mouse model of CTL responses to HIV (120). We previously measured CTL recognition of wildtype or A3G-mutated epitopes ex vivo by CTL from HIV-infected individuals. We considered a limited subset of CTL epitopes known to elicit CTL recognition, and we focused on A3Ginduced mutations in epitope residue positions 3, 5, and 7, which would mainly impact TCR recognition. We found that in the vast majority of instances, A3G-induced mutations in CTL epitopes abrogated CTL recognition of epitopes in an HLA-dependent manner (114, 117, 118). Moreover, we showed that A3G mutational hotspots are enriched in the viral genomic sequences encoding immunogenic CTL epitopes in Gag, Pol, and Nef (117). This is in agreement with the earlier study that also found enriched A3G hotspot motifs within the rapidly diversifying CTL escape sites in Env (119). Interestingly, and in contrast to our findings on A3G hotspot motif enrichment in CTL-encoding epitopes, another study reported that A3G hotspots are less frequently located at genomic locations encoding for the V1-V5 region, the most variable regions of the gp120 envelope protein, in order to hold in reserve its potential mutational capacity for long-term adaption of HIV to the antibody response (122).

Most of our knowledge about epitope-specific CTL responses in chronic HIV infection comes from studies using the standard IFN-γ ELISpot assay (123). Besides experimental approaches, a variety of computational tools for prediction of mutations and their impact have provided valuable information. Different sets of algorithms such as artificial neural networks (ANN), average relative binding (ARB), stabilizing matrix method (SMM) and others have enabled prediction of CTL epitopes within the viral proteome based on HLA-binding affinity (124, 125). The web algorithm HLA binding predictors have a broad allelic coverage with as much as 90–95% accuracy (124–128). HLA binding and subsequent recognition by TCR are the most selective steps in the peptide presentation pathway (129). However, other processes upstream of HLA binding such as proteasomal cleavage, TAP transporter and the stability of the peptide-HLA complex also shape viral epitope availability (130). Prediction tools, such as NetChop and PaProC have been developed based on protein degradation by purified proteasomes to predict potential cleavage sites (131–134). The reliability of these tools has been shown (135–138).

Thus, mutations that impact protein proteasomal processing and/or epitopes' HLA binding can lead to loss of CTL recognition and immune escape (88, 89, 96, 139, 140), but the extent to which A3G mutations could potentially impact each of the successive stages of CTL epitope generation and presentation is not known. Here we utilized the aforementioned computational tools to construct a comprehensive CTL epitope map of HIV based on the steps of antigen presentation: proteasomal cleavage, TAP transporter efficiency and HLA-binding affinity. We simulated all possible A3G-mediated mutations within and outside CTL epitope-encoding sequences of the HIV genome. We then examined predicted consequences for CTL epitope generation. We also probed whether the positions and predicted consequences of A3G-mediated mutations are random, or rather indicative of co-evolution of the HIV genome with the action of the host mutator. In cases where such co-evolution was observed, we studied the depth and extent of the strategies used by the HIV genome to influence the outcomes of A3G activity. Since our experimental system is devoid of the immense in vivo selection pressure for CTL escape, and hence able to predict enhanced generation of CTL epitopes as well as the opposite scenario of immune evasion without bias, the analysis provides a unique lens for considering how viral genomes co-evolve with host restriction factors.

### MATERIALS AND METHODS

#### A3G-Induced Mutation Simulation

Simulation of A3G-induced mutations was carried out as previously described (117). Briefly, the whole genome of the HIV-1 isolate HXB2 BRU was obtained from NCBI. This sequence was chosen since it was used in previous works and model systems that studied the role of A3G on HIV CTL escape (117, 119). A3G-induced mutations (G-to-A) on the 5′ -most dG in A3G's trinucleotide hotspot motifs considering the sense of the +ve sense strand (GGG, GGA, and GGT) were manually simulated and translated to amino acid sequence. For this analysis, we considered first-round mutations. For multiple back-to-back A3G hotspots, all possible amino acid alteration consequences of A3G-induced mutations were considered.

#### Prediction of CTL Epitopes, HLA Binding, and Proteasomal Cleavage

To generate a comprehensive list of all potential CTL epitopes of HIV, we considered all HIV peptides that are predicted to be efficiently processed by proteasomes and also bind to HLA class I molecules. We identified the portions of the HIV genome encoding known CTL epitopes using the HIV Molecular Immunology Database (http://www.hiv.lanl.gov/ content/immunology/tables/ctl\_summary.html). We evaluated the predicted MHC binding affinity of wild-type and A3Gmutated variant CTL epitope sequences of Gag, Pol, Env, and Nef restricted to HLA-A02:01, -A03:01, -B57:01, and -B35:01. We utilized epitope prediction algorithms enabling us to investigate the impact of mutations at A3G hotspots within or in flanking regions of the predicted and known epitopes on HLA affinity binding, and epitope processing. Here we used NetMHCpan 4 (http://www.cbs.dtu.dk/services/NetMHCpan/) using artificial networks (ANN) to construct a fine CTL epitope map based on HLA-I binding. The NetMHCpan 4 server predicts binding of peptides to any HLA molecule of a known sequence using ANNs (127, 136, 141). Then the Immune Epitope Databases (IEDB) server (http://tools.iedb.org/processing/) was applied for further prediction based on proteasomal cleavage, TAP transporter efficiency and HLA binding affinity to improve the selection of potential epitopes. To evaluate the impact of A3G alterations on HLA-binding, we only considered predicted epitopes with a high-rank score between 0 and 0.5 percentile as strong HLA binders and 0.5–2.00 percentile as weak HLA binders (136). We then calculated a Delta from the wild-type sequence HLA binding score to evaluate the change in predicted HLA affinity caused by A3G-induced mutation. We set a Delta of 0.1 as a threshold of significant difference for enhanced or diminished HLA binding affinity, based on distribution analysis of the difference values. We used NetChop 3.1 (http://www.cbs.dtu. dk/services/NetChop/) to display the impact of the mutation on proteasomal cleavage. The program C-term 3.0 network is trained with a database consisting of 1260 publicly available HLA class I ligands (using only the C-terminal cleavage site of the ligands). The highest probabilities of cleavage (threshold set at 0.5) were applied based on default program recommendation (142). To predict proteasomal cleavage sites, wild-type and A3Ginduced mutated polypeptide of Gag, Pol, Env, and Nef were submitted to NetChop 4.

### Analysis of A3G Hotspot Frequency in Sequences Encoding CTL Epitopes at the Nucleotide Level and Prediction of Amino Acid Alteration Consequences

To investigate the enrichment of A3G hotspots of CTL epitopes Gag, Pol, Env, and Nef restricted to HLA-A2:01, HLA-A3:01, HLA-B57:01, and HLA-B35:01, we counted the number of A3G hotspot motifs (GGA, GGG, and GGT) in CTL epitopeencoding regions vs. non-CTL epitope-encoding sequences. We normalized for gene size by dividing the frequency of hotspots by the total number of analyzed nucleotides in each gene. The normalized hotspot frequencies at the nucleotide levels were calculated for sequences encoding CTL epitopes restricted to each individual HLA, and non-CTL epitope encoding sequences. Then, the ratio of hotspot frequency was determined for inside to outside epitope-encoding sequences, for each A3G hotspot motif and each restricting HLA. As controls to evaluate potential A3G sequence enrichment in sequences encoding CTL epitopes, we conducted a parallel analysis with randomly shuffled HIV genomic sequence using the "Shuffle DNA" function of the Sequence Manipulation Suite (http://www. bioinformatics.org/sms2/shuffle\_dna.html) resource. The HIV sequence was randomly shuffled six independent times, and A3G hotspot enrichment analysis was performed for each hotspot motif and restricting HLA, using the same border locations of sequences encoding CTL epitope sequences in the actual HIV genomic sequence. MATLAB was used to describe the distribution of amino acid alterations in epitope-surrounding regions, considering a 32-amino acid boundary around each 8– 11 mer epitope, with a limit of either 2 or 4 A3G-induced amino acid changes on either the N- or C-terminal sides of each CTL epitope within this boundary considered to be a clustered pattern. Graphpad Prism 5 was used to generate the schematic graph to display the distribution of A3G-induced mutation within and in the flanking regions of CTL epitopes Gag, Pol, Env, and Nef.

For analysis of amino acid alteration consequences of A3G enrichment in CTL-encoding sequences, affected amino acids (as a result of non-synonymous A3G-mediated mutations), or non-affected amino acids (as a result of silent A3G-mediated mutations) were determined, and frequencies of amino acid consequences inside or outside epitope-encoding sequences were normalized to the total amino acid number of regions of each polypeptide (Gag, Pol, Env, and Nef) present in epitope or non-epitope regions, for each restricting HLA. Then, the ratio of normalized non-synonymous to silent substitutions in CTL epitope encoding region to non-CTL epitope encoding region was calculated for A3G-induced mutations of Gag, Pol, Env, and Nef, restricted to each HLA. The ratio of non-synonymous and silent A3G-mediated amino acid substitutions in the CTL epitope encoding region to non-CTL epitope encoding region was calculated in two ways for each individual A3G hotspot motif. First, the ratio of A3G-induced mutations resulting in non-synonymous residue changes to silent mutations inside CTL epitopes was divided by the same ratio determined for regions of each polypeptide located outside CTL epitopes. Second, the frequency of A3G-mediated non-synonymous mutations inside CTL epitopes was divided by the frequency of A3G-mediated non-synonymous mutations outside CTL epitopes, ignoring silent consequences.

To analyze the distribution pattern of A3G hotspot positioning in the HIV genome, we divided the entire length of the genome into 60 bp stretches and counted the number of A3G hotpots whose mutation would result in non-synonymous amino acid changes or stop codon generation. This pattern was plotted and compared against the known mutational gradient of the HIV genome as previously described (35, 143). In the same manner, to analyze the distribution pattern of A3G hotspots whose mutation is predicted to result in CTL escape, we considered, for each HLA, A3G hotpots that fall in CTL epitope-encoding sequences and whose mutation caused a predicted decrease in HLA binding affinity, as described above. For each HLA, we also generated a map of the positions of CTL epitope-encoding sequences across the entire HIV genome. We derived a map of normalized escape potential which we defined as the number of CTL-escape inducing A3G hotspots in each 60 bp segment, normalized (divided) by the total number of A3G hotspots within the segment. Based on the normalized escape potential and the number of CTL epitopes encoded in each 60 bp segment, for each HLA, we derived a map of escape factor which we defined as the product of the number of CTL epitopes and normalized escape potential. Thus, the escape factor value represents the potential for A3G-induced mutations to generate CTL-escape in any given 60 bp increment of the HIV genome.

## RESULTS

### Potential Wild-Type and A3G-Mutated CTL Epitopes in Gag, Pol, Env, and Nef

We utilized NetMHCpan 4 and IEDB using ANN to construct a CTL epitope map based on proteasomal cleavage, TAP transporter efficiency and HLA-binding affinity. Using entire peptide sequences Gag, Pol, Env, and Nef, we generated a list of all potential HLA-binding peptide epitopes. Although the binding affinity data covers 172 HLA molecules (136), we restricted our analysis to HLA-A02:01, -A03:01, -B57:01, and -B35:01 because HLA-A02:01 and A03:01 are frequent in the population, and HLA-B57:01 and -35:01 correlate with robust and weak HIVspecific CTL responses, respectively. Based on the potential definition of all possible epitopes through the presence of HLAbinding anchor residues in the HIV proteome, the number of all putative potential epitopes is significantly higher than those that actually elicit CTL responses, due to limitation in either epitope processing, presentation to TCR, and the many complex physiological and immune response dynamics that underlie the CTL response that cannot be accounted for by epitope prediction algorithms (144). Nevertheless, we noted that the set of epitopes that we generated on the bases of HLA binding and proteasomal processing predictions included the majority (∼70%) of experimentally-verified CTL epitopes listed in the HIV Molecular Immunology Database (**Table S1**). In addition to the predicted set of epitopes, we also included in our analyses experimentally-known CTL epitopes.

Thus, we generated an epitope list which includes all potential CTL epitopes restricted by HLA-A2:01, -A3:01, -B57:01, and -B35:01. In total, for Gag, Pol, Env, and Nef, we identified 14-12- 14-10, 19-33-26-21, 22-14-20-8, and 8-6-9-8 epitopes restricted to HLA-A2:01, -A3:01, -B57:01, and -B35:01, respectively (**Table 1** and **Table S1**). To dissect the role of A3G at each step of CTL epitope generation, we simulated A3G mutations in sequences encoding Gag, Pol, Env, and Nef. A3G mutations (Gto-A) on the 5′ -most dG in A3G hotspot trinucleotide motifs (GGG, GGA, and GGT) were simulated and translated to the peptide sequence (**Figure 1**). We found 16-16-21-13 (Pol), 8-10- 11-7 (Gag), 5-2-6-5 (Nef), and 13-8-15-3 (Env) restricted toHLA-A2:01, -A3:01-, -B57:01-, and -B35:01-, respectively, whose encoding sequences contain A3G hotpots. After simulation of A3G-induced mutations at these motifs, we identified 33-33-44- 20 possible mutated epitopes restricted to HLA-A2:01, -A3:01, -B57:01, and -B35:01 for Pol, of which 25 alterations include stop codons (**Table 1** and **Table S1**). These numbers were 8- 20-22-8 and 12 stop codons for Gag, 11-4-14-15 and 9 stop codons for Nef, 28-20-31-6, and 29 stop codons for Env. These results indicate that A3G-induced mutations can potentially alter CTL epitopes restricted to all four examined HLAs in Gag, Pol, Env, and Nef (**Table 1** and **Table S1**). Considering all predicted wild-type epitopes, Gag, Pol, Env, and Nef contained 21, 40, 26, and 12% of the predicted CTL epitopes respectively. 25, 30, 26, and 19 of all CTL epitopes were restricted to HLA-A2:01, -A3:01, -B57:01, and -B35:01, respectively. After simulating A3Ginduced mutations, 22, 42, 25, and 11% of all mutated epitopes came from Gag, Pol, Env, and Nef and 26, 23, 33, 18% of all epitopes were restricted to HLA-A2:01, -A3:01, -B57:01, and -B35:01, respectively. This initial comparison between the distribution of wild-type vs. A3G-mutated CTL epitopes suggested a bias for A3G-mediated mutations in HLA-B57 restricted epitopes, consistent with previous suggestions for a role of A3G in mediating decreased CTL recognition for peptides restricted to protective HLAs such as B57 (117, 118).

### The Potential Impact of A3G-Induced Mutations on HLA Binding Affinity

To examine the specific impacts of A3G-induced mutations on HLA-binding, we considered predicted epitopes with a highrank NetMHCpan 4 score between 0 and 0.5 percentile as strong HLA binders and 0.5–2.0 percentile as weak HLA binders, according to default parameters of the prediction algorithm

an evolutionary imprint on the HIV genome by studying the pattern and distribution of A3G hotspots across the HIV genome, in the context of their potential for facilitating the generation of CTL escape mutants (Top box).

(136) (**Figure 1**, **Table S1**). However, we noted that 30% of experimentally-verified epitopes exhibit out of range and low HLA binding affinity scores; hence, their aforementioned absence in the total predicted pool of CTL epitopes (**Table S1**). To evaluate the change in predicted HLA affinity that occurred as a result of each A3G-induced mutation, we calculated a Delta value from the wild-type sequence HLA binding score. We set 0.1 as a threshold of difference for increased or reduced HLA binding affinity because below this value poor correlation was observed between the predicted HLA affinity rank and absolute nM affinities.

Although A3G-induced stop codons would not lead to infectious virus production, viral genomes containing stop codons can produce immunogenic truncated peptides which contain CTL epitopes (120). Thus, we considered all A3Gmutations, including stop codon generators (**Figure 2A**, top panel), or excluding stop codons (**Figure 2A**, bottom panel). Considering all A3G-induced mutations, 25, 46, 32, and 14% of HLA-A2:01-, -A3:01-, -B57:01-, and -B35:01-restricted epitopes exhibited increased HLA-binding affinity after A3G simulation mutation. Conversely, 75, 54, 68, and 86% of HLA-A2:01-, -A3:01-, -B57:01-, and -B35:01-restricted epitopes exhibited decreased HLA-binding affinities as a result of A3G-induced mutations (**Figure 2A**, top panel). Excluding A3G-mediated stop codons, 32, 51, 62, and 24% of HLA-A2:01-, -A3:01-, -B57:01-, and -B35:01-restricted epitopes increased HLA-binding affinity after A3G simulation mutation, whilst 68, 49, 38, and 76% of A3G-induced mutations HLA-A2:01-, -A3:01-, -B57:01-, and -B35:01-restricted epitopes led to decreased HLA-binding affinities (**Figure 2A**, bottom panel).

Next, we sought to break down the effect of A3G-induced mutations at epitope-HLA anchor vs. non-anchor residues. A3Ginduced mutations in non-anchor residues were predicted to lead to enhanced, diminished, or abolished HLA binding affinity for epitopes restricted to all 4 HLAs, with diminished/abolished HLA binding being the most common predicted outcome for HLA-A2:01-, -A3:01-, - and -B35:01-restricted epitopes (74, 53, 80% of all A3G mutations, respectively); for HLA-B57:01 however, that outcome was nearly equal to potential for enhanced HLA affinity (35%) (**Figure 2B**). Escape mutations that reduce class I HLA binding commonly occur at the N-terminal (amino acid position 2 in the peptide epitope) and/or the C terminal (e.g., amino acid position 9 in a 9mer peptide epitope) anchor residues in most epitopes (84, 96, 145). Thus, we then considered TABLE 1 | Summary of the number of potential CTL epitopes restricted to HLA-A2:01, HLA-A3:01, HLA-B57:01, and HLA-B35:01 for wild-type and A3G-mediated mutated Gag, Pol, Env, and Nef proteins.



(Continued)

(Continued)

TABLE 1 | Continued


A3G-induced mutations that target the anchor residues. A3Ginduced mutations in anchor residues were predicted to lead to enhanced, diminished or abolished HLA binding affinity for epitopes restricted to all 4 HLAs, with diminished/abolished HLA binding being the most common predicted outcome for -B57:01-, - and -B35:01-restricted epitopes (33, 7% of all A3G mutations, respectively). In contrast, for HLA-A2:01- and - A3:01, A3G-induced mutations at anchor residues mediated diminished/abolished HLA binding affinity with the same or much lower levels than enhanced HLA binding affinities (2 and 1%, respectively) (**Figure 2C**).

Considering all epitopes that were within the aforementioned HLA-binding threshold range, 2 and 23% of A3G-induced amino acid changes targeted the N and C-terminal anchor positions, respectively. Twenty-three percent is approximately two-fold higher than expected by random chance (84, 146): if A3Ginduced mutations were equally distributed amongst all residues in a pool of 8–11 mer peptides, then each residue ought to have a ∼10% probability of being targeted. In total, A3G can potentially generate a stop codon in 30% of all epitopes that contain A3G hotspots. Of these, 43% led to a stop codon at the most C-terminal position, of which the overwhelming majority (92%) were restricted to HLA-B57:01, since tryptophan (the TGG codon) is the C-terminal anchor for HLA-B57:01 (**Figure 2A**, compare top and bottom panels). This codon, which has a high likelihood of containing an A3G hotspot motif dependent on the next downstream nucleotide, is the most susceptible codon for generating a stop due to G-to-A mutation (37, 38). Also, we observed 30% of all potential A3G-induced mutations in CTL epitopes were located at residue positions 3, 5, and 7 which are key for TCR recognition (147), while 40% of A3G-mediated substitutions targeted residue positions 1, 4, 6, 8, 9, and 10. These results indicate that the N- and C-terminal anchor residues are under- and over-targeted by A3G for mutation, whilst the middle positions are apparently equally targeted. Furthermore, the increased A3G targeting of the most C-terminal anchor residue reflects its overwhelming propensity for stop codon generation in HLA-B57:01-restricted epitopes (**Figure 2A**). Based on these analyses, we conclude that A3G-induced mutations can increase or decrease HLA-binding affinities of potential CTL epitopes; however, the major outcome considering all mutations (non-synonymous amino acid changes and stop codons) at all residues (anchor and non-anchor) was decreased HLA-binding affinity. These results are consistent with previous observations that the CTL-epitope-encoding sequences of HIV have evolved to channel A3G-induced mutations to mediate CTL escape. On the other hand, we also observed the generation of 18 neo-epitopes based on enhanced HLA-binding affinity.

#### The Role of A3G-Induced Mutations on Proteasomal Processing of Epitopes

We utilized NetChop 3.1 to examine the impact of A3G-induced mutations on the proteasomal processing of HIV proteins, which is the step before peptide epitope generation for HLA binding. We submitted the entire sequences of Gag, Pol, Env, and Nef to NetChop, either in wild-type format, or including all possible A3G-induced mutations. On the entire peptide sequence, we overlaid the map of HLA-A2:01, -A3:01-, -B57:01, and -B35:01-restricted CTL epitopes. Each residue within wildtype or A3G-mutated Gag, Pol, Env, and Nef proteins was then assigned a cleavage prediction score (default threshold of 0.5 is considered a cleavage site) (142), and scores at each residue position were compared between wild-type and A3G-mutated proteins (**Supplementary File 1**: excel table). For this analysis, we considered two categories of A3G-induced mutations: those that fell within individual CTL epitopes, or those that fell outside but within six amino acids upstream or downstream of the N-or C- terminal residues of the epitope (148). A3Ginduced mutations that generated new/enhanced cleavage sites within a CTL epitope, or abolished/decreased cleavage within the six amino acids adjacent to the epitope would likely lead to diminished proteasomal processing of the epitope. Conversely, A3G mutations that enhanced cleavage in the adjacent region of an epitope or abolished/decreased cleavage within the epitope itself would likely lead to enhanced proteasomal processing of the epitope.

In this manner, we quantified the impact of A3Ginduced mutations on CTL epitope proteasomal cleavage (**Supplementary File 1**, **Table 2**). Considering all A3G-mediated mutations, for epitopes restricted to HLA-A2:01, -A3:01-, -B57:01, and -B35:01, respectively, 42, 53, 43, 45% of all A3G-induced mutation events resulted in decreased predicted proteasomal processing, whilst 58, 47, 57, and 55% of mutations resulted in generation of sites predicted to enhance proteasomal processing. If epitopes were categorized by protein of origin rather than restricting HLA, for Gag, Pol, Env, and Nef, respectively, 54, 43, 41, and 53% of all A3G-induced mutations resulted in decreased predicted proteasomal processing, whilst 46, 57, 59, and 47% resulted in enhanced predicted processing. Excluding A3G-mediated stop codon generation, for epitopes restricted to HLA-A2:01, -A3:01-, -B57:01, and -B35:01, respectively, 43, 56, 53, and 43% of all A3G-induced mutations

resulted in decreased predicted proteasomal processing, whilst, 57, 44, 43, and 57% resulted in enhanced proteasomal processing.

In total, including A3G-mediated stop codon generation 54% of all A3G-induced mutations that could potentially impact proteasomal cleavage were predicted to lead to enhanced CTL epitope production, whilst 46% could potentially decrease CTL epitope production. These numbers are 51 and 49%, respectively, whist A3G-mediated stop codon generation is excluded (**Table 2**). These results indicate that there has not been a strong evolutionary pressure maintained on the viral genome for utilizing A3G toward CTL escape at the level of modulating proteasomal processing for CTL epitope generation.

We then investigated whether mutations mediated by A3G are clustered around the CTL epitopes, with more A3G hotspots being present either near the N- or C-terminal boundaries of epitopes, than expected at random. Considering a limit of 4 mutational A3G hotspots (A3G hotspots whose mutation would lead to non-synonymous substitutions) in 32 residues around each epitope, there appeared to be a marked paucity of such clustering (**Table 3**, **Figure S1**); however, when this limit was lowered to 2 mutational A3G hotspots, instances of clustering expectedly rose to 50–70% of epitopes. When all A3G hotspots were considered, this clustering proportion rose to 70–80%; thus, if non-mutational A3G hotspots can alter aspects of epitope production pre-translation (e.g., splicing, expression, etc.) this could be considered a significant trend.

#### Patterns and Consequences of A3G Hotspot Distribution Within or Outside CTL Epitope-Encoding Regions

We then investigated the enrichment of A3G hotspots (GGA, GGG, and GGT) inside vs. outside genomic sequences encoding CTL epitopes restricted to HLA-A2:01, HLA-A3:01, HLA-B57:01, and HLA-B35:01 in Gag, Pol, Env, and Nef genes (**Table S2**, **Figure 3**). First, we normalized for total nucleotide length of each gene and calculated the ratio of normalized A3G hotspot frequencies inside to outside epitope-encoding sequences for each protein's CTL epitopes. Thus, an inside: outside ratio >1 would be indicative of A3G hotspot enrichment in CTLencoding sequences. As a control, we subjected the entire HIV genomic sequence to a random shuffling process, six independent times, but retained the positions/borders of the CTL epitopeencoding sequences. We then conducted the same analysis and expectedly arrived at ratios of ∼1 (**Figure 3A**). We did not observe a generalizable trend of hotspot enrichment (ratio >1) in CTL-encoding sequences; however, when compared to the hypothetical ratio of 1 and the randomly shuffled control analyses, we noted that sequences encoding epitopes restricted to HLA-A2:01 and HLA-B57:01 often exhibited the highest enrichment ratios of 2-2.5 for at least 1-2 out of the 3 A3G hotspot motifs (**Table S2**, **Figure 3A**). These data are consistent with our previous observations that viral genomic sequences encoding more immunogenic CTL epitopes (restricted to more common HLA alleles, or those that elicit a more effective CTL response) have evolved to maintain A3G hotspots. Since A3G-mediated mutations which lead to stop codon generation would most often lead to non-infectious genomes, it is difficult to envision how maintenance of such A3G motifs to alter CTL epitopes at the cost of producing a non-infectious virus could be advantageous for the virus. Thus, in our enrichment analyses which was conducted to measure the extent to which the viral genome has evolved to


utilize A3G toward CTL escape, we excluded A3G hotspots that would lead to stop codons. Rather, we considered these separately by examining the frequency and positional distribution of stop codon-generating A3G motifs in Gag, Pol, Env, and Nef. We found that between 18 and 43% of stop codons are positioned in the first quarter of each peptide (**Figure S2**), and there was a general trend of more frequent A3G-mediated stop codon generation in Pol and Env, as compared to Gag and Nef.

Having examined gene sequence A3G hotspot enrichment, we sought to measure the potential consequences at CTL epitope protein level. To this end, all simulated A3G-induced mutations were translated to protein sequences as described above. Since it is known that A3G can mutate the entire viral genome at low levels, epitopes with multiple hotspots and multiple mutated versions were considered independently. We then quantified A3G-induced non-synonymous and silent substitutions that fell within or outside of CTL epitopes of Gag, Pol, Env, and Nef restricted to HLA-A2:01, HLA-A3:01, HLA-B57:01 and HLA-B35:01. Next, we determined the ratio of A3G-induced mutations that caused non-synonymous residue changes to A3Ginduced mutations which resulted in silent mutations within CTL epitopes and divided this by the same ratio determined for regions of each polypeptide that fell outside CTL epitopes. This analysis was carried out for each individual A3G hotspot motif, and as a total for all amino acids affected by A3G mutations within each polypeptide (**Table S2** and **Figure 3B**). Thus, a ratio of >1 would indicate that the genomic sequence of HIV has evolved to channel A3G-induced mutations into amino acid changes, more often within CTL epitope-encoding regions as compared to sequences outside these portions. Indeed, we observed numerous instances of significant preferential channeling (ratios up to 4.5) toward non-synonymous residue changes in CTL epitopes of Gag, Pol, Env, and Nef as a result of A3G-induced mutations (**Figure 3B**). If we ignored A3Gdriven silent consequences and evaluated the ratio of only A3G-mediated non-synonymous mutations inside CTL epitopeencoding regions to A3G-mediated non-synonymous mutations outside CTL epitope-encoding regions, we observed that in 28/48 graphed bars (58% of all measurements) the ratio was ≥ 1, with some ratios in the 2–3 range (**Figure 3C**). In general, the bias for A3G mutations to translate to non-synonymous rather than silent amino acid mutations was more pronounced for Gag, Pol and Nef as compared to Env, consistent with the former three polypeptides housing the vast majority of HIV's CTL epitopes (**Figures 3B**,**C** compared to **Figure 3A**).

In principle, non-synonymous amino acid changes arising from A3G mutations can enhance or diminish antigen presentation as the proteasomal processing and HLA binding levels (**Figure 2**, **Table 2**). To examine the distribution patterns of A3G hotspots that could potentially lead to CTL escape, we generated a map of all A3G hotspots across the entire HIV genome and overlaid this map on the experimentallydetermined and well-known gradient of G to A mutations across the HIV genome (**Figure 4A**). In the context of A3G action, this twin gradient has been suggested to be a consequence of HIV genome's replication dynamics. Certain portions of the HIV minus strand genome remain single-stranded for a longer period compared to other segments, due to dynamics of RNA digestion by RT, the role of the Polypurine tracts (PPT), and subsequent positive sense strand polymerization. These segments are thus more available for A3G targeting resulting in a mutation gradient (34, 35, 143). We observed that regions near the central PPT and the C-terminal end (Nef) that are more highly mutated are rich in A3G hotspots, consistent with the notion that the viral genome has positioned hotspots in genomic locations that are more prone to being targeted by the A3G enzyme (**Figure 4A**). For each HLA, we plotted the number of CTL-epitope encoding sequences (in Gag, Pol, Env, and Nef) at incremental positions along the entire viral genome length (**Figure 4B**, top graph of each panel). We also plotted a normalized escape potential graph which represents the likelihood that an A3G hotspot located inside a CTL epitope-encoding sequence can generate a CTL-escape mutation. This was calculated by counting the A3G hotspots predicted to lower HLA binding affinities and normalizing these by the total abundance of A3G hotspots in the given CTLepitope encoding region (**Figure 4B**, middle graph of each panel). Considering the number of CTL-epitopes encoded by a given genomic location (top panel), as well as the normalized escape potential of the sequences encoding this epitope (middle panel), we then generated an escape factor map which represents the compound potential for A3G to cause CTL escape across the HIV genome, for each HLA (**Figure 4B**, bottom graph of each panel). First, we noted that the potential for A3G-generated CTL escape was present throughout the length of the genome, for epitopes restricted to all 4 HLAs; however, it was generally more frequent in regions of the genome with a higher mutational potential and less frequent in regions known to be mutated at lower rates (comparing escape factor maps in **Figure 4B** to the mutational gradient in **Figure 4A**). Secondly, epitopes restricted to HLA-A2:01 and HLA-B57:01 exhibited overall higher abundance and frequent positioning of escape-inducing A3G hotspots, with HLA-B57:01-restricted epitopes also containing the highest escape factor values (**Figure 4B**). Thirdly, regions encoding for Gag, Pol and Nef contained generally a higher density of potentially CTL escape-inducing A3G hotspots, as compared to Env. The polypeptide and HLA-specific patterns observed are consistent with the A3G hotspot enrichment analysis (**Figure 3**) and taken together suggest the evolution of HIV genome to position A3G hotspot motifs in CTL-encoding regions, and highly mutable regions of the HIV genome, such that they preferentially yield CTL escape-inducing non-synonymous amino acid changes.

### DISCUSSION

Here, we aimed to follow up on previous works suggesting that A3G is a source of CTL escape-inducing mutations. We first mapped all potential CTL epitopes within Gag, Pol, Env, and Nef, and considered the impact of A3G-induced mutations on these epitopes. To this end, we embarked on a two-pronged analysis: first, from the immune recognition perspective, we examined the effect of A3G-induced mutations



on the various stages of CTL epitope production, including proteasomal processing and HLA-binding affinities. Second, from the viral genome evolution perspective, we examined whether, where and to what consequence, A3G hotspots have been maintained or enriched in genomic sequences that encode for CTL epitopes. At each stage of all analyses, we considered three individual A3G hotspots (GGA, GGG, and GGT), and potential impact on CTL epitopes restricted to four HLA alleles that have been previously shown to have differential abilities to present immunogenic CTL peptides of HIV. Furthermore, opposite to the notion of CTL escape mediated by A3Ginduced mutations, we also considered A3G mutations that can potentially generate novel or more immunogenic CTL epitopes. Despite a wealth of information about the role of A3Gs in CTL escape, knowledge of novel CTL epitopes mediated by endogenous mutators remains poor. We found that although A3G-mediated mutations could potentially enhance or diminish the proteasomal cleavage of Gag, Pol, Env, and Nef into CTL epitopes, the overwhelming impact on HLA binding affinities of CTL epitopes as a result of A3G mutations was decreased affinity.

Here we also provide strong and novel lines of evidence for the co-evolution of the HIV genome with A3G, so as to utilize this host factor toward CTL escape. First, A3G hotspot motifs were positioned in CTL-encoding epitopes so as to preferentially cause non-synonymous mutations. Secondly, most A3G-induced mutations in CTL epitopes resulted in diminished/abrogated HLA binding capacity. Thirdly, the distribution pattern of CTL escape-inducing A3G hotspots across the HIV genome varies with restricting HLAs and generally correlates with the known mutational gradient across the entire HIV genome. These observations shed light on the multiple layers of depth to which the HIV genome has resorted to position A3G hotspots for CTL escape. An earlier study which examined the overall pattern of A3G-mediated non-synonymous vs. silent mutations concluded that A3G has not left an evolutionary footprint on the HIV genome (149). This study broadly examined all A3G/F target motifs but not in the context of a specific biological force which may encourage genome evolution in response to A3G/F. In contrast, we argue that evidence for co-evolution of a pathogen's genome with a host factor may not be broadly apparent but must be sought for in the specific context of the pro/anti-viral biological impacts driven by the host factor and specific locations of the pathogen's genome impacted and under pressure by the host factor action. Thus, we considered HIV genome co-evolution with A3G/F in the context of the potential for CTL escape by focusing on sequences that encode for such epitopes. We find that when considered in this context, there is substantial evidence for the evolution of the HIV genome to subvert the activity of A3G/F toward its own gain. In our analyses of predicted immune response to HIV (HLA binding and proteasomal processing of CTL epitopes) we considered A3G-mediated stop codons; though these would yield non-infectious viruses, truncated proteins that are immunogenic could still be produced (120). On the other hand, we excluded A3G-mediated stop codons from our analyses of viral genome evolution (A3G hotspot enrichment and positioning) since they cannot be considered as an advantageous mode of utilizing A3G to the virus's benefit.

Rather than considering existing A3G hotspots as evidence for their role in selection as we have, the case can also be made for the opposite view; that if the usage of A3G hotspots toward advantageous outcomes for the virus was key, then current

restricted to HLA-A2:01, HLA-A3:01, HLA-B57:01, and HLA-B35:01 in Gag, Pol, Env, and Nef proteins.

sequences circulating at the population level ought to be rather devoid of A3G hotspots and instead rich in the mutated versions. Whilst this argument may hold true in several different contexts, considering it in the context of CTL escape is more difficult. First, CTL escape mutations are usually not broadly selectable at the population level but are highly individual host-dependent

FIGURE 4 | Distribution pattern of CTL escape-inducing A3G hotspots in the context of the entire HIV genome and its mutational gradient. (A) To analyze the distribution pattern of A3G hotspots across the HIV genome, we counted the number of A3G hotpots that could result in non-synonymous amino acid changes within windows of 60 bp length across the entire length of Gag, Pol, Env, and Nef. The number of A3G hotspots in each 60 bp window is plotted as gray bars, and the overall pattern is shown as a black line. We then overlaid this pattern against the known mutational gradient of the HIV genome as previously described (red line) (143) (Continued)

FIGURE 4 | (Copy Right License Number: 4460241074666). (B) To analyze the distribution pattern of predicted CTL escape-inducing A3G for each HLA, a map of the positions of CTL epitope-encoding sequences for Gag, Pol, Env, and Nef, restricted to HLA-A2:01, HLA-A3:01, HLA-B57:01, and HLA-B35:01 was created (Top panel of each box). A map of normalized escape potential was also generated (middle panel of each box). We defined normalized escape potential as the number of CTL escape-inducing A3G hotspots in each 60 bp segment, normalized (divided) by the total number of A3G hotspots within the segment. Based on the normalized escape potential and the number of CTL epitopes encoded in each 60 bp segment, for each HLA we then constructed a map of escape factor which we defined as the product of the number of CTL epitopes and normalized escape potential (bottom panel). In this map, black indicates no potential for CTL escape (escape factor= 0), orange indicates low potential (escape factor=1), pink indicates modest potential (escape factor range of 2–4), red indicates high potential (escape factor range of 5–7).

since they are intimately connected to HLA genotype. Second, from the virus's perspective, there may be two advantages to maintaining the A3G/F hotspots: first, conflicting demands of replication fitness on one hand and immune evasion on the other, which is best illustrated by the high rates at which certain CTL escape mutations revert to wildtype presumably fitter sequence, especially upon transmission to a new host with a different HLA genotype wherein CTL escape mutations from the previous host are no longer advantageous. Second, it may not be to the advantage of the virus to benefit from maximum CTL escape, as it would limit its replication capacity by quickly eliminating infected host cells (150, 151). Thus, it may be advantageous to conserve some CTL escape potential in the form of A3G hotspots to be available to use when it suits the virus. An example of this very conservation of A3G/F-mutational hotspots has been shown in terms of antibody epitopes in Env (122).

These findings bring to light novel aspects of the interplay between the host mutator A3G and the co-evolution of the viral genome. Overall, A3G-induced mutations were predicted to influence CTL epitope production and HLA binding, both toward the production of more immunogenic epitopes and conversely, toward CTL escape. It is important to consider these results in the context of two additional concepts: first, although the overall action of A3G on the HIV genome is predicted to result more often in CTL escape than in generation of new, more immunogenic epitopes, it is important to note that even if the latter and former occurred with equal probability, the escape mutations would be the dominant outcome under immune pressure in vivo (9, 87, 119). Second, in this analysis, we did not take into account the fitness consequences of A3Ginduced mutations. Predicting the in vitro replicative fitness cost and peptide HLA binding affinity of clinically derived sequences has shown that escape mutations in CTL epitopes of Gag restricted to protective HLA class I alleles carried higher fitness costs and lower levels of reduction in HLA class I binding affinity compared to mutations in epitopes restricted to other HLA class I alleles. This suggests that one way by which protective HLA molecules act is by binding epitopes whose CTL escape mutations incur a high fitness cost with relatively low benefit in terms of HLA-binding affinity reduction (108).

The practical application of this work will lie in determining epitope choice for vaccine design. Epitope clusters and altered epitopes with the potential to be better processed or bound by HLA because of A3G mutations ought to be superior platforms for the development of prophylactic or post-infection CTLbased vaccines. Thus, accounting for and indeed exploiting the action of endogenous genome mutators to design more effective vaccines would represent a strategic advance in HIV vaccine design. In addition, the analyses carried out here should be considered in the context of extensive A3 family enzyme mutations of tumor genomes, as understanding the mechanisms by which a tumor cell can escape, or boost CTL response is critical to developing vaccination and therapies based on CTL epitopes. We and others have postulated that the function of A3 family members in cancer genome mutagenesis may bear parallels to its role in viral genome mutagenesis, as tumor cells are also under pressure to avoid detection by CTL and could use A3-induced mutagenesis to this end (152–159). At present, whether and how frequently this may occur is unknown, and using similar analyses to gain insights will have important implications for the design of personalized anti-tumor CTLbased strategies.

### AUTHOR CONTRIBUTIONS

FB generated all data, with help from KJ. FB and ML analyzed the data, generated the figures and wrote the manuscript. MG edited the manuscript. We thank our colleague Emma Quinlan for assistance with editing.

## FUNDING

This work was supported by Canadian Institutes of Health Research (CIHR) operating grants (MOP111132 and OCH131580) to ML. FB is supported by a Dean's fellowship from the Faculty of Medicine Memorial University and a Ph.D. fellowship from the Beatrice Hunter Cancer Research Institute (BHCRI) with funds provided by the Cancer Research Training Program as part of the Terry Fox Foundation strategic Health Research Training Program in Cancer Research at CIHR.

### ACKNOWLEDGMENTS

We are grateful to our colleague Emma M. Quinlan for reproducing the HIV mutational gradient (red trend line) in **Figure 4A**, from Kijak et al. (143) (Copy Right License Number: 4460241074666).

#### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fimmu. 2018.03032/full#supplementary-material

### REFERENCES


immunodeficiency virus type 1 (HIV-1) proteins reveal imprints of immune evasion on HIV-1 global variation. J Virol. (2002) 76:8757–68. doi: 10.1128/JVI.76.17.8757-8768.2002


rapid decline of T-helper lymphocytes in HIV-1 infection. A report from the multicenter AIDS cohort study. Lancet (1990) 335:927–30.


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Borzooee, Joris, Grant and Larijani. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Human Endogenous Retroviruses Are Ancient Acquired Elements Still Shaping Innate Immune Responses

Nicole Grandi <sup>1</sup> and Enzo Tramontano1,2 \*

<sup>1</sup> Laboratory of Molecular Virology, Department of Life and Environmental Sciences, University of Cagliari, Cagliari, Italy, 2 Istituto di Ricerca Genetica e Biomedica, Consiglio Nazionale delle Ricerche, Cagliari, Italy

About 8% of our genome is composed of sequences with viral origin, namely human Endogenous Retroviruses (HERVs). HERVs are relics of ancient infections that affected the primates' germ line along the last 100 million of years, and became stable elements at the interface between self and foreign DNA. Intriguingly, HERV co-evolution with the host led to the domestication of activities previously devoted to the retrovirus life cycle, providing novel cellular functions. For example, selected HERV envelope proteins have been coopted for pregnancy-related purposes, and proviral Long Terminal Repeats participate in the transcriptional regulation of various cellular genes. Given the HERV persistence in the host genome and its basal expression in most healthy tissues, it is reasonable that human defenses should prevent HERV-mediated immune activation. Despite this, HERVs and their products (including RNA, cytosolic DNA, and proteins) are still able to modulate and be influenced by the host immune system, fascinatingly suggesting a central role in the evolution and functioning of the human innate immunity. Indeed, HERV sequences had been major contributors in shaping and expanding the interferon network, dispersing inducible genes that have been occasionally domesticated in various mammalian lineages. Also the HERV integration within or near to genes encoding for critical immune factors has been shown to influence their activity, or to be responsible for their polymorphic variation in the human population, such as in the case of an HERV-K(HML10) provirus in the major histocompatibility complex region. In addition, HERV expressed products have been shown to modulate innate immunity effectors, being therefore often related on the one side to inflammatory and autoimmune disorders, while on the other side to the control of excessive immune activation through their immunosuppressive properties. Finally, HERVs have been proposed to establish a protective effect against exogenous infections. The present review summarizes the involvement of HERVs and their products in innate immune responses, describing how their intricate interplay with the first line of human defenses can actively contribute either to the host protection or to his damage, implying a subtle balance between the persistence of HERV expression and the maintenance of a basal immune alert.

#### Edited by:

Tara Patricia Hurst, Abcam, United Kingdom

#### Reviewed by:

Alexander Emmer, Martin Luther University of Halle-Wittenberg, Germany Renée Nicole Douville, University of Winnipeg, Canada

> \*Correspondence: Enzo Tramontano tramon@unica.it

#### Specialty section:

This article was submitted to Microbial Immunology, a section of the journal Frontiers in Immunology

Received: 25 May 2018 Accepted: 20 August 2018 Published: 10 September 2018

#### Citation:

Grandi N and Tramontano E (2018) Human Endogenous Retroviruses Are Ancient Acquired Elements Still Shaping Innate Immune Responses. Front. Immunol. 9:2039. doi: 10.3389/fimmu.2018.02039

Keywords: HERV, endogenous retroviruses, innate immunity, interferon, evolution, autoimmunity, cancer

## INTRODUCTION

Our genome includes an impressive proportion of repetitive elements, among which human endogenous retroviruses (HERVs) account for about the 8% (1). HERVs are DNA sequences of retroviral origin that have been acquired along the last 100 million of years through multiple integrations by now-extinct exogenous retroviruses (2, 3). Peculiarly, while known retroviruses target the somatic cells, these ancestral infections affected the primate germ line, leading to the vertical transmission of HERV relics through the offspring (**Figure 1**). It is however still not clear if the HERV-originating exogenous retroviruses had germ line cells as main/unique target or infected this population by chance (4). In general, the mechanism that formed HERV insertions is analogous to the one used by exogenous retrovirus replication. In both cases, once into the cytoplasm, the RNA genome is reverse transcribed into a double-stranded DNA (dsDNA) by the viral reverse transcriptase (5). The so-obtained proviral DNA is then integrated in the host chromosomes through the viral integrase that interacts with cellular cofactors (6). At this point, the provirus expression generates a set of mRNAs encoding for the different viral proteins. In the presence of a functional reverse transcriptase, the full-length mRNA can also be reverse transcribed, producing a proviral cDNA theoretically competent for new integration events. The action of the host editing systems and the genomic substitution rate, however, often made HERV proviruses defective, leaving only a residual protein coding capacity but more often producing non-coding RNAs. HERVs share with exogenous retroviruses the typical proviral structure, being normally composed of two long terminal repeats (LTRs) that flank the internal portion of the viral genes gag, pro-pol and env (**Figure 1**). The LTRs are formed during the reverse transcription and have a regulatory significance for viral genes' expression, including promoters, enhancers and polyadenylation signals. The retroviral genes encode for the structural components, i.e. matrix, capsid and nucleocapsid (gag) and the envelope surface and transmembrane subunits (env), as well as for the enzymes involved in the viral life cycle, namely protease (pro), reverse transcriptase and integrase (pol) (**Figure 1**). While only these simple retroviral gene products have been identified in the majority of ERVs, some groups are known to have a more complex genome that encodes for additional proteins. This is the case of HERV-K(HML2) sequences, whose env gene—depending on the presence or the absence of a characteristics 292-bp deletion—can originate two splicing variants, Np9 and Rec, respectively (7, 8). Both proteins have been investigated due to their possible oncogenic properties [for a recent review, see (9)]. Only very recently, the presence of a putative rec gene has been also reported in another HERV-K group, namely HML10 (10). Aside from protein-coding genes, HERV proviruses harbor a primer-binding site (PBS) and a polypurine trait (PPT) located between the 5′LTR and gag and between env and the 3′LTR, respectively (**Figure 1**). Both have an important role during reverse transcription: the PBS binds the cellular tRNA priming the synthesis of the (–)strand DNA, while the PPT acts as a primer for the (+)strand DNA production.

The PBS sequence had also traditionally been used for HERV classification, and the current nomenclature still often relies on the tRNA type recognizing the different HERV groups' PBS (e.g., HERV-K for lysine tRNA, HERV-W for tryptophan tRNA, etc.). However, HERV classification is still ongoing, and the use of different methods in the last decades led often to multiple designations for the same HERV sequences and groups (11). Beside the use of the above PBS-based taxonomy, a minority of HERV groups has been named following unconventional criteria, such as the presence of a proximal cellular gene (e.g., HERV-ADP) or a peculiar amino acid motif (e.g., HERV-FRD). Clearly, such a lack of established rules generated confusion in the field, underlining the need of univocal guidelines for the naming of HERV groups, their single members and the latter expressed products (11, 12). Hence, the above mentioned nomenclature criteria are now considered inadequate, being not based on phylogenetic aspects and, regarding the PBS-based one, not taking into account the frequent occurrence of alternative PBS types (3, 13, 14). Currently, HERVs are broadly divided into three classes according to their similarity to exogenous members: class I (gammaretrovirus- and epsilonretrovirus-like), class II (betaretrovirus-like) and class III (spumaretrovirus-like). The classification of the various HERV groups is instead based on the phylogenetic relationships among the different groups - considering above all the highly-conserved pol gene - and can be corroborated by structural features found in all the members of the same genus or class (3, 11, 13). For what concerns the HERV nomenclature, a revised naming system has been introduced by the Human Genome Organization Gene Nomenclature Committee (12). Hence, for clarity, we provide a list of aliases in the presence of multiple HERV designation (**Table S1**).

An updated analysis of HERV sequences within the human genome has been recently performed (3) with the software RetroTector (15). A multi-step approach allowed to classify about 3200 HERV insertions in 39 "canonical" groups plus 31 "non-canonical" clades showing a mosaic structure arisen from recombination and secondary integrations (3). This classification provided a comprehensive overview of HERV diversity, being also a useful background for the still-ongoing characterization of the different HERV groups at the genomic level. Such an exact knowledge of the individual HERV sequences' localization and coding capacity could in fact represent a great advance in the understanding of their potential effects in both physiological and pathological contexts (16). In particular, while in the former some HERV locus-specific activities have already been demonstrated in placenta (9, 17– 19), the tentative link between HERVs and human diseases has not lead to any definitive association, yet. In this regard, while many reports investigated the various HERV groups' overall expression in diseased tissues, very few studies tried to assign it to specific proviral loci of origin and to analyze their differential expression between patients and healthy individuals. This, together with the above-mentioned lack of knowledge about individual HERV sequences and the absence of standardized methodologies for HERV association studies,

being fixed in the human population. The general structure of a full-length HERV provirus is represented: the two Long Terminal Repeats (LTRs) are formed during the reverse transcription of the viral RNA genome and flank the gag, pro, pol, and env genes. The primer binding site (PBS) and the polypurine trait (PPT) are located between 5′LTR and gag and between env and 3′LTR, respectively. The viral genes encode for the structural and non-structural proteins found in the viral particle: gag matrix (MA), capsid (CA) and nucleocapsid (NC); pro-pol protease (PR)—reverse transcriptase (RT) and integrase (IN); env surface (SU) and transmembrane (TM) subunits. While in exogenous retroviral infections the integrated provirus is transcribed by the cellular machinery to release new virions, the HERV persistence within the host genome and the action of cellular editing systems led to the accumulation of mutations that often made the proviruses coding-defective and thus unable to produce infectious particles.

prevented until now the identification of precise molecular mechanisms of pathogenesis and, thus, the unambiguous link to any disorder (16, 20). Nevertheless, even if their significance in disease etiology is still uncertain, it is by now clear that HERVs can have an impact on the host in several ways. While some effects depend on the mere presence of HERV sequences within the human genome (16), in some other instances, HERV expression can provide RNAs and proteins potentially able to trans-regulate human genes and to influence the host immunity (21–24). It has been shown that HERV expression products can act as pathogen-associated molecular patterns (PAMPs), triggering the cellular receptors responsible for the first line of defenses (25–27). They can moreover provide antigenic epitopes recognized by lymphocytes (especially through molecular mimicry with exogenous viral molecules) and stimulating the onset of specific T- and B- cells (28– 30). These mechanisms might be particularly relevant for the tentative involvement of HERVs in autoimmunity and inflammatory diseases. On the contrary, HERVs have also been involved in the downregulation of the host immunity, starting from their role in maternal immune tolerance to the fetus to their suggested protective action against excessive immune activation (31–34). The present review will discuss the interplay between HERV insertions and their expression products and our defenses, with a special attention to their contribution in shaping and influencing the human innate immunity.

### HERV ROLE IN THE EVOLUTION AND SHAPING OF THE HUMAN GENOME

About 75 years ago, the pioneer studies of Barbara McClintock suggested that transposable elements (TEs)—now known to constitute >40% of our DNA (1)—were not useless "junk DNA" but normal components of eukaryotic genomes that can have important regulatory roles (35). Nowadays, growing evidences confirm that TEs had a crucial role in the shaping and evolution of vertebrates' genomes, contributing to the establishment of lineage-specific patterns of gene expression (36, 37). TE insertion appear to be highly conserved among mammals, showing an increased density in the proximity of cellular genes and being a main source of transcription factor binding sites and regulatory signals (37–42). This points out their relevance for the human development and transcriptional modulation throughout evolution (36, 38). Of course, the majority of TEs have been acquired several million of years ago, being by now silenced due to the accumulation of mutations in their coding sequence and to various cellular mechanism of transcriptional repression, such as histone hypermethylation. Despite this, many TEs are likely involved in a long-term co-evolution with the host, that exerted firstly a selective pressure against their detrimental effects and led then in some instances to their cooptation for biological processes (43, 44). Intriguingly, even the general epigenetic silencing of TEs could represent an ancestral adaptation of the harboring organism that, initially, evolved defense systems to downregulate them and, then, developed strategies to control gene expression through their exaptation (44). In this view, the sporadic loss of epigenetic regulation could be not only linked to certain developmental stages and disorders, but might also constitute an opportunity to boost evolvability (44). Accordingly, despite controlling mechanisms, many HERV sequences still retain a residual expression capacity, leading to the production of either coding or non-coding RNA transcripts that can both influence the host biology. One of the most remarkable examples of the HERV impact on vertebrate physiology is represented by "syncytins," an ensemble of Env proteins encoded by different HERV sequences in all eutherian mammals through a process of convergent evolution (45). Syncytins have in fact been domesticated independently by the various species, providing common and important functions for placenta development and physiology. In the case of the human genome, two env loci, namely ERVWE1 (HERV-W, 7q21.2) and ERVWE2 (HERV-FRD, 6p24.1), encode for the coopted Env proteins syncytin-1 and -2, respectively (17, 18). While syncytin-1 has a pivotal role in placental syncytiotrophoblast development and homeostasis (17, 18, 46, 47), syncytin-2 is thought to be involved in the maternal immune tolerance to the fetal allograft (32). The role of these and other HERV-derived Env proteins have been recently reviewed elsewhere (9).

Aside from protein production, the thousands of HERV sequences dispersed in our DNA contributed to the evolution of the primates' genome by providing an abundant source of regulatory elements. It is well known that our genetic information is organized in regulatory networks, involving both cis- regulatory sequences and trans-acting genes, and that their interaction is at the base of cellular plasticity and evolution (23, 44). Also HERVs participate to this complex interplay, being able to regulate the host genes' activity in several ways and at different expression levels (**Figure 2**). In fact, even the sole presence of HERV proviruses and solitary LTRs can influence cellular genes' activity (**Figure 2a**). Repetitive elements' insertion is a source of genomic modifications, possibly leading to the disruption or insertional mutagenesis of co-localized genes or promoting chromosomal rearrangements through non-allelic homologous recombination (**Figure 2a**). This was reported in male infertility, in which the intra-chromosomal recombination between two homologous HERV-I sequences located on chromosome Y is responsible for the microdeletion of the azoospermia factor a (48). Moreover, HERV LTRs can provide cis-regulatory activity to nearby cellular genes by enhancing their transcription or even providing alternative promoters and splicing signals, also with a remarkable tissue-specificity (**Figure 2a**). A representative example is the HERV-E provirus integrated upstream of an ancestral amylase gene and acting as a specific enhancer in parotid glands (49, 50). Furthermore, despite the frequent loss of protein-coding capacity, also the abundant production of HERVderived non-coding RNAs (ncRNAs) such as microRNA and long ncRNA, may likely provide cis-regulatory elements able to modulate the expression of the host genes, either alone (e.g., providing a recognition motif for an RNA-binding protein) or in concert with cellular transcription factors (**Figure 2b**). This occurs for example in human embryonic stem cells, whose pluripotency depends on nuclear long ncRNAs expressed by a HERV-H element and recruiting specific cellular transcriptional activators (51). In addition, HERV ncRNAs can act as "RNA sponges" binding and dampening microRNA families involved in the post-transcriptional regulation of gene expression, given that several human microRNAs have high sequence homology with HERVs (52–54). Such a miRNA sponge activity has been reported in the positive regulation of pluripotency in embryonic stem cells, which depends on the interaction of a HERV-H long ncRNA (HPAT5, locus 6q27) to complementary sequences in the let-7 microRNA family (54, 55) (**Figure 2b**). Finally, if a HERV protein is produced, it can be able to modulate the host genic expression through the biological activities previously involved in the virus life cycle and now providing new cellular functions (21) (**Figure 2c**). As an example, HERV Gag and Rec proteins are known to influence the stability, localization and translation of cellular transcripts (21). Accordingly, HML2 Rec was reported to interact with ∼1,600 cellular mRNAs in embryonic cells and, especially, to influence their ribosome occupancy, possibly suggesting regulatory function coopted for early development (56). Similarly, the Arc Gag-like protein derived from an ancient Ty3/gypsy retrotransposon has been repurposed during brain evolution to mediate communications between neural cells, having an important role in the development and plasticity of the nervous system (57–59). In particular, Arc has been shown to assemble into capsids that include mRNA sequences to be transferred from a neuron to new recipient cells through extracellular vesicles, then undergoing activitydependent translation (60).

Beside the role of individual HERV sequences, it has been proposed that the ensemble of TEs widespread in the human DNA can have a more extensive role. By constituting a sort of parallel regulatory network, TEs can possibly influence multiple host genes and, thus, shape whole pathways involved in complex cellular processes (22, 36, 61). Accordingly, more than one third of p53 binding sites in the human genome have been dispersed by class I HERV sequences, which have become major components of the p53 regulatory network (62). Hence, the acquisition of HERV insertions could have acted as a driving force for human genome plasticity and cellular networking, and, as described below, such TE-dependent modeling and functional renewal has been particularly critical in the evolution of innate immunity. Vertebrate genomes have been subjected to substantial rearrangements through the acquisition of HERVs and, intriguingly, such colonization has been concomitant to the development of important immune pathways (24). Major shifts have occurred in both innate and adaptive immune systems, increasing the complexity and specificity of vertebrate antiviral defenses (24). The loci of primate MHC (major histocompatibility complex) are in fact characterized by a high density of HERV integrations, which contributed to their remarkable plasticity (63) and led to genic variations among species. As an example, a HERV-K(HML10) provirus inserted within the ninth intron of human complement C4A gene (MHC class III), also called HERVKC4, is responsible for its dichotomous size variation (10, 64–66). Besides generating physical changes among species, such TE

(3). (b) HERV non-coding RNAs (ncRNAs) can also be able to cis-regulate cellular genes, even through the recruitment of cellular regulators (e.g.: transcription and splicing factors) (4). In addition, HERV ncRNA have been reported to act as "microRNA sponges," binding and dampening microRNA families responsible for post-transcriptional modifications (5). (c) Finally, some HERV proteins can also regulate genic expression through their interaction with cellular mRNAs and the modulation of their transfer and ribosome occupancy (6).

insertions in immune gene introns could also account for regulatory effects, being present mostly in antisense orientation and subjected to bidirectional transcription (10). Noteworthy, mRNA sequences originated by classes of genes with a relatively recent expansion, such as the ones involved in immunity, are enriched in TEs, which are not present in the transcripts arising from highly conserved genes with basic functions (38). Overall, one of the soundest evidences about the role of HERVs in the shaping of pivotal immune systems regards the interferon (IFN) network, a crucial antiviral pathway for innate immunity and a fundamental effector to initiate and maintain adaptive responses (67). Intriguingly, Chuong and coauthors showed that HERV insertions greatly contributed to the evolution and amplification of IFN transcriptional network, dispersing independently a wide number of IFN-inducible enhancers in many mammalian genomes (22). In particular, the experimental deletion of a subset of these endogenous elements in human cell lines affected the activity of neighboring genes induced by IFN, impairing thus important immune pathway such as the AIM2 inflammasome (22). Furthermore, due to their residual regulatory activities, HERV LTRs can act as promoter and/or enhancers after IFNmediated stimulation (68). This has been reported for HERV-K(HML2) LTRs, which harbor two IFN-stimulated response elements (ISREs) activated by the IFN signaling, leading to the increased HML2 expression in response to inflammation (69, 70). In line with this, HERV-K env genes transcription can be stimulated by IFN α, encoding Env superantigens responsible for polyclonal T-cell activation (71).

#### HERVS AS ACTIVATORS OF ANTIVIRAL INNATE IMMUNITY

Innate immunity is the first and most ancient line of defenses against microbial infections, acting by a complex network that is conserved throughout the animal kingdom (72). When an infectious agent overcomes the organism physical barriers, the presence of conserved PAMPs (i.e., lipids, proteins, glycans, and nucleic acids) allows their prompt recognition by innate immunity sensors, namely pattern recognition receptors (PRRs). PRRs are germ line-encoded receptors with a pivotal role in antiviral defenses, as well as in the response to self-injuries (73), recognizing PAMPs and danger-associated molecular patterns (DAMPs), i.e., molecules present in damaged or stressed tissues (74).

In vertebrates, PRRs include five major classes of receptors that are localized either transmembrane or in the cytosol (tm- and cytPRRs, respectively) of different cell types (73). tmPRRs are represented by the Toll Like Receptors (TLRs), detecting PAMPs either on the cell surface or in the endosomal compartments. cytPRRs include RIG-I-like receptors (RLRs), NOD-like receptors (NLRs), C-type lectin receptors (CLRs) and DNA sensors, all recognizing intracellular PAMPs (26, 75– 77). cytPRRs are usually present in cells that can be actively infected by a given class of microbial agents, while tmPRR cell-extrinsic recognition is independent from the cell infection and usually occurs in cells devoted to pathogen detection (78). In both cases, PRR stimulation by microbial PAMPs activates a complex cascade of signaling that triggers the production of various pro-inflammatory molecules, including cytokines, chemokines and type I IFN (78, 79). These effectors have a double role. On the one side, they quickly establish an antimicrobial environment to counteract the infection. On the other side, the activation of PRRs expressed on antigen-presenting cells (APCs), especially dendritic cells (DCs), evokes the development of a long-lasting adaptive immunity, leading to the onset of specific cellular and humoral defenses (78). The latters are mainly represented by the clonal expansion of naïve cytotoxic T cells and the production of antibodies by B lymphocytes. Other important APCs are macrophages and B lymphocytes, which are albeit mostly involved in the stimulation of already-activated T cells. The overall innate immune signaling has been reviewed elsewhere (72, 79–81): in the following sections we will focus our attention on the specific pathways relevant to exogenous retrovirus sensing and, thus, possibly involved also in HERV immune recognition.

HERVs are integral parts of the human genome since millions of years, are highly transcribed during embryonic development and expressed at variable levels in several adult tissues. Hence, from the immunological point of view, HERV expressed products somehow stay at the interface between self-molecules and microbial antigens. In fact, also depending on the cellular and immunological surrounding, HERV-derived molecules can be either tolerated by human defenses or able to stimulate the human immunity, leading to their intensive investigation in various autoimmune diseases. In addition, the presence of other immune stimuli has been shown to be capable to influence HERV expression, suggesting a complex and multifaceted interplay that is still not completely clarified. In principle, the innate immune pathways activated by HERV-derived products are the same involved in the first line antiviral defenses counteracting exogenous retroviruses, including both tm- and cytPRRs (26) (**Figure 3**). In humans, TLRs from 1 to 10 are inserted in plasmatic or endosomal membranes and expressed on both innate immune cells and non-immune cell types, while cyt-PRRs are soluble elements normally found in the cytosol or migrating to it when stimulated by the presence of retroviral molecules following cellular infection (80, 82). Both tm- and cytPRRs can potentially recognize HERV nucleic acids and proteins either due to their molecular identity with exogenous viral PAMPs or as DAMPs (26, 74). The upregulation of HERV transcription, even in the absence of a causal role in the initiation of immune-related pathogenesis, can thus provide numerous HERV-derived PAMPs or DAMPs able to further prompt inflammation, contributing to the symptomatology. This could likely occur in diseases such as autoimmunity and cancer, which have in common a general epigenetic de-regulation known to strongly liberate the nonspecific retroelements' expression (9, 16). Hence, even if no definitive associations have been described yet, a growing body of studies is confirming that HERVs may have an impact on the human immune homeostasis, starting form their interaction with PRRs known to have a central role in the early response to HIV-1 infection, i.e., TLRs, RLRs, and cytoplasmic DNAsensors (83).

### Sensing of HERVs by Transmembrane PRRs

As mentioned above, different tmPRRs can potentially be activated by either HERV nucleic acids or proteins (84) (**Figure 3**). TLRs are the first PRRs to be identified and have been studied in greater detail: they can sense viral molecules either on the cellular surface, through plasmatic membrane TLRs (1, 2, 4, 5, 6, and 10), or in the endosome, through TLRs localized in the endosomal membranes (3, 7, 8, 9) (80). All TLRs have a common structure, characterized by a first domain that binds specific ligands, protruding outside the cell surface or inside the endosome, and a second conserved cytosolic domain that is required for the intracellular signaling. Their activation relies albeit on different types of molecules. Plasmatic membrane TLRs can recognize retroviral proteins, either as individual molecules or as components of viral-like structures (**Figure 3**). Accordingly, TLR2 and TLR4 are both able to detect the Env proteins of retroviruses, including HIV-1 gp120 that induce NF-κB activation and proinflammatory cytokine secretion (85). In line with this, both receptors are upregulated in DCs, macrophages and peripheral blood mononuclear cells during HIV infection (86–88). Endosomal TLRs can instead be involved in the sensing of different retroviral nucleic acids, which should however be auto-phagocytized from the cytoplasm to the endosomal lumen to be detected (26, 74) (**Figure 3**). In particular, one of the most immunogenic nucleic acids PAMPs is dsRNA, which is normally not found in uninfected cells. Viral dsRNA can stimulate an innate response through TLR3 signaling (72) and, interestingly, the immune activity elicited by TLR3 agonists has been shown to provide protection from different viral infections, among which certain HIV strains (89). TLR3 has been also implicated in the HIV-1 transactivation (90), replication (91), and inflammation (92). Following proviral transcription, HERV ssRNA could be recognized by both TLR7 and -8. The same TLRs, in fact, are important for the recognition of the HIV RNA genome during acute infection, probably through the recognition of guanosine-

interaction with HERV molecules has been reported are marked with an asterisk.

and uridine- rich ssRNA that stimulates the production of IFN-α and proinflammatory cytokines by DCs and macrophages (93). In the presence of reverse transcriptase activity, the HERV ssRNA can serve as a template to the synthesis of RNA:DNA hybrids, which can be sensed by TLR9 (94). The latter was in fact shown to sense RNA:DNA hybrids presenting viral-derived sequences, leading to the secretion of IFN-I and pro-inflammatory cytokines in DCs (94). TLR9 is also the sole DNA-sensing TLR and specifically binds unmethylated CpG-rich DNA (72, 82) that, however, is present mostly in bacteria and DNA viruses (72, 80). Recently, TLR9 agonists have been reported to induce HIV production in latently infected cells, being investigated as possible therapeutic strategies to attack the proviral reservoir (95, 96).

In any of the above pathways, once activated, tmPRR dimeric receptors prime a complex intracellular cascade of signaling that, through several kinases and ubiquitinases, could lead to the activation and nuclear translocation of transcription factors that stimulate the expression of cytokines, chemokines and type I IFN (IFN-I) establishing an initial antiviral status and stimulating the adaptive immune defenses. Currently, even if all the mentioned tmPRRs could be potentially able to sense HERV-derived molecules, there are yet limited findings reporting the direct interaction between these ligands and few plasmatic membrane TLRs. Contrarily, endosomal TLRs were widely shown to be involved in the control of murine ERVs (97) albeit still remaining poorly characterized for their interplay with human TEs.

In one of the most investigated groups, the HERV-W, the Env surface subunit has been characterized to be a potent stimulator of TLR4, harboring remarkable pro-inflammatory properties that might contribute to multiple sclerosis immunopathogenesis (98– 100) (**Table 1**). The HERV-W Env was shown to interact with both TLR4 and CD14, inducing proinflammatory molecules that are prevalent in multiple sclerosis, such as IL-1, IL-6 and tumor necrosis factor α [(99, 109, 110), **Table 1**]. These immune effectors can likely contribute to the damage of brain populations, including DCs, oligodendrocytes and astrocytes (111). HERV-W Env was in fact shown to activate DCs and promote a Th1-like immune response (99), adding mechanistic insights to the tentative link between HERV-W sequences and multiple sclerosis [recently reviewed in Grandi and Tramontano (16)]. The interaction between HERV-W Env and TLR4 was also investigated in both murine and human oligodendrocyte precursors, leading to an increased production of cytokines and inducible nitric oxide synthase (101) (**Table 1**). In line with this, HERV-W Env overexpression in mice led to the development of experimental allergic encephalomyelitis (112) and inflammatory hallmarks typical of myelin injuries (113, 114). Such a HERV-W Env mediated pro-inflammatory effect was responsible for reduced oligodendrocytes differentiation and affected myelin expression and renewal (101), as also demonstrated by the fact that treatment with a specific antibody for HERV-W Env (GNbAC1) was able to rescue myelin expression (115). In addition, HERV-W Env hold a significant superantigen activity (116), that has also been proposed to play a role in demyelination, by evoking a polyclonal non-specific T-cell activation and the massive release of multiple cytokines (98, 109). A similar interaction between HERV-W Env and the TLR4 of pancreatic β cells has been proposed to promote autoimmune reactions and to affect insulin secretion in type I diabetes patients (117).

A second evidence of HERV pro-inflammatory potential comes from the investigation of HERV-K(HML2) group expression in psoriasis, another poorly understood autoimmune disorder. In this context, wild type and mutated HML2 dUTPases (metallo-enzymes that hydrolyzes dUTP preventing its incorporation into the viral DNA) were shown to interact ex vivo with TLR2 (108). Such interaction stimulated the expression of NF-κB and induced Th1 and Th17 cytokine production in DCs and Langerhans-like cells as well as, even if at lower levels, in keratinocytes (108) (**Table 1**). This could suggest a possible role of HML2 dUTPase in the immunopathogenesis of psoriatic lesions, even if these observations have been performed in primary cells from healthy individuals and not confirmed in vivo. A subsequent study explored the HML2 dUTPase locus in a large number of psoriatic patients, reporting the association of some single nucleotides variants with a lower susceptibility to the disease (118). A small subset of these patients was also tested for dUTPase-induced cellular immunity, showing frequently increased B- and T-cell responses that were albeit not further characterized at the molecular level (118).

Finally, the stimulation of endosomal TLR3 by HERV dsRNA molecules has been observed after the treatment with DNA methyltransferase inhibitors, anticancer agents that were shown to remove methylation from the HERV promoter regions, causing their reactivation and subsequently triggering IFNα and β responses (27). Interestingly, such a HERV-mediated immunestimulation is probably at the base of demethylating agent anticancer effect (see below).

### Sensing of HERVs by Cytosolic PRRs

Besides transmembrane TLRs, other PRRs devoted to the antiviral response are present in the cytosol as soluble factors or reach this compartment after the sensing of microbial nucleic acids. Among them, the most relevant for antiviral immunity include the RLRs retinoic acid inducible gene I (RIG-I), sensing short dsRNA and ssRNA with a 5′ -triphosphate moiety, and MDA5 (melanoma differentiated associated gene 5), that detects long dsRNA molecules (72). RIG-I was shown to be a crucial intracellular sensor for HIV nucleic acids, recognizing complex secondary structures in HIV RNA and stimulating IFN-I release by mononuclear cells (83). Accordingly, as a viral immune escape mechanism, HIV protease was shown to target RIG-I for lysosome degradation (119). Contrarily, MDA5 was not affected by HIV protease-mediated degradation (119), even if its upregulation has also been observed in HIV infection (120). Other cytPRRs, namely DNA sensors, recognize instead DNA of viral origin that, differently from nuclear DNA, can be found in the cytoplasm during retroviral replication. Particularly, in HIV-1 infected cells, the replication intermediates cDNA, ssDNA, and RNA:DNA hybrids can all be detected by cytoplasmic DNA sensing proteins (83, 121). Viral dsDNA can be sensed by the DNA-dependent activator of IFN (DAI) and the IFNγ inducible protein 16 (IFI16), both triggering the STING-TBK1- IRF3 signaling, as well as by the absent in melanoma protein 2 (AIM2), activating the inflammasome pathway (26, 82, 122). Another important sensor for cytoplasmic dsDNA is cGAS (cyclic GMP-AMP synthase), which can also recognize DNA-RNA hybrids and mtDNA released after mitochondrial damage (82). The main dsDNA sensors relevant to HIV infection are IFI16 and cGAS, which are both upregulated in the absence of antiretroviral therapy and associated with chronic immune activation (123). For what concerns foreign ssDNA, stem-rich secondary structures in HIV-1 ssDNA were shown to induce IFI16 in macrophages (124).

Given that retrovirus replication takes place mostly in the cytoplasm, cytPRRs are known to be crucial to the detection of RNA and DNA originated by exogenous retroviruses, while very few evidences about the molecular sensing of HERV nucleic acids are available yet (26). Particularly, similarly to TLR3, also MDA5 has been shown to sense HERV dsRNA triggered by the treatment with DNA methyltransferase inhibitors, evoking IFN production (27). A role in the prevention of HERV DNA/RNA accumulation has been proposed for certain enzymes involved in the cytoplasmic homeostasis of nucleic acids, which could possible provide some protection against HERV-mediated immune activation (9). Such hypothesis derives from observations made in mice deficient for the


3 ′ → 5 ′ exonuclease 1 (Trex1), in which the accumulation of endogenous retroelements' cDNA led to the activation of innate DNA sensors and the production of IFN (125–127). Even if a similar function is still to be confirmed for human Trex1, this enzyme was shown to restrict the reverse-transcribed DNA from endogenous retroelements which instead accumulated in Trex-1 deficient cells (125). Moreover, patients with Aicardi-Goutières autoimmune syndrome lack this enzyme and show inflammatory and IFN-I responses that could possibly be sustained by HERV nucleic acids accumulation. Hence, Trex1 could have a role in the control of HERV cDNA-mediated immune activation (125, 128), also considering that this cellular protein was recently reported to prevent cGAS or IFI16-mediated recognition of HIV-1 replication intermediates, being possibly involved in viral immune escape mechanisms (129). Similarly, the 3′ → 5 ′ RNA exoribonuclease SKIV2L has been recently involved in the exosomal degradation of endogenous RNA molecules in the cytosol, thus possibly preventing also HERV RNA accumulation that could trigger viral RNA-sensing receptors (130).

#### Consequences of Innate Immunity Stimulation

Independently from the cytosolic or membrane-associated localization, the sensing of viral molecules by innate immune receptors evokes the production of pro-inflammatory effectors (such as IFN, cytokines and chemokines), which promptly establish an antiviral status. This immune activation contributes to contain the infection and prepares the ground for subsequent adaptive immune responses, mediated by T- and B-lymphocytes that elicit specific cellular and humoral defenses, respectively. Overall, such innate-adaptive signaling is fundamental to counteract exogenous viruses, and the resulting immune activation is generally able to eliminate the infectious triggers and then to shut down. In the case of HERV molecules, however, their stable presence and expression in the organism could provide continuous triggers to the host immune sensors. In fact, the main findings linking HERVs to autoimmune and inflammatory disorders rely on the chronic stimulus exerted by endogenous retroviral molecules, which could sustain molecular mimicry events based on the sequence identity between HERV products and either exogenous viruses or body components (16, 28, 30, 131, 132). Worth to note, after the initial immune activation, the production of IFN establishes a positivefeedback that further upregulates IFN-stimulated genes by both a paracrine and an autocrine loop, fostering chronic inflammation and autoimmunity. The same antiviral status prompted by HERV expression can hence create a vicious circle in which inflammatory molecules and epigenetic dysregulation further upregulate HERV expression (26, 69, 84). For all these reasons, even if no definitive etiological evidences have been reached yet, the role of HERVs in triggering innate defenses could contribute to autoimmune pathological developments and, hence, be a valuable therapeutic target. Accordingly, monoclonal antibodies against HERV-W Env proteins are currently under clinical trials as innovative approach against multiple sclerosis (133, 134) and type I diabetes (117). In addition, even in the absence of a direct immunogenic role of HERV-derived proteins, the presence and nucleotide variability of HERV loci in proximity to key immune genes should be taken into account for their possible influence on the latters, especially in the context of genome wide association studies. These analyses are by now an established method to identify risk loci linked to autoimmune conditions, but rarely take into account HERV sequences and their networking with cellular genes. As an example, MS-associated SNPs showed a significant enrichment of HERV insertion and potential HERV ORFs in their genetic neighborhood as compared to control SNPs

(135, 136). Accordingly, SNPs in the antiviral gene TRIM5 were shown to be negatively correlated to MS, while SNPs around a HERV-Fc1 locus on chromosome X had a significant association with the disease, with a risk that was strongly influenced by the two genes in an additive fashion (137). Hence, besides protein production, the genetic analysis of HERV loci in the neighborhood of immune genes, especially if presenting key immune roles, can provide insights on inter-individual variants concurring to autoimmunity risks.

#### HERVS AS INHIBITORS OF INNATE IMMUNITY

In parallel to pro-inflammatory effects, HERV-derived peptides have also been implicated in immune-suppressive mechanisms. The latter mainly involve Env transmembrane subunits, which hold a characteristic immunosuppressive domain (ISD) conserved among retroviral Env proteins. In animal exogenous retroviruses, the ISD counteracts the host antiviral responses (138, 139) while, in the case of HERVs, such element has been occasionally coopted for the maternal immune tolerance during pregnancy (32, 45). In this physiological status, a subtle immune balance must allow the invasion of fetal trophoblasts, which present also paternal antigens, albeit maintaining the ability to protect the organism from microbial infections. Pregnancy is hence accompanied by the repression of cellular immunity through the shift from Th1 inflammatory cytokine production (TNF-α, IFN-γ, and IL-2) toward an anti-inflammatory Th2 cytokines response (IL-4, IL-5, IL-10), to prevent cytotoxic processes potentially harmful to the fetus (102, 140, 141). Such shift is thought to be mainly due to syncytin-2 ISD, which is highly preserved and shows strong immunosuppressive potential (32). However, also syncytin-1 was shown to be able to inhibit cytokine production in blood, suggesting a possible contribution in the maternal immune shift (102).

Besides these domesticated proteins, other HERV Envs have shown immunosuppressive activity. The transmembrane subunit of a HERV-K(HML2) sequence was reported to inhibit T cells activation in a similar way to the HIV ISD, influencing cytokine release and immune gene expression (107). Similarly, a HERV-H Env protein (Env-59) ISD showed anti-inflammatory potential in a mouse model of arthritis (103), being inversely related to the level of pathogenic effectors (such as IL-6 and TLR7) in human autoimmune rheumatic diseases (104). Furthermore, a recently described HERV LTR (HERVP71A, locus 6p22.1) was shown to serve as a tissue-specific enhancer for HLA-G gene expression in human extravillous trophoblasts, at the fetal-maternal interface, inhibiting the natural killer cell cytotoxicity and conferring immune tolerance to the developing placenta (106).

#### HERVS AND EXOGENOUS INFECTIONS

HERVs have been proposed to have a role during exogenous viral infections, and such role could be either beneficial or harmful (16). It is worth to note that the interplay between exogenous and endogenous viruses is still not fully characterized, even if some evidences suggest that exogenous infection sustained by different viral species—among which HIV, herpesviruses and influenza—are able to modulate HERV expression (142–144). In the case of an upregulation of HERV expression, such cooperative action could increase the immune triggering exerted by HERV products and, especially in the presence of retroviral infections, possibly account for complementation of defective viruses and recombination events (16). Here, however, we will focus on the possible protective effects against exogenous infections exerted by HERV expression products, which can theoretically be able to restrict any step of the viral cycle (21).

As any organism or biological entity, viruses are subjected to an ensemble of selective pressures, being exerted by the host defenses as well as by the surrounding environment, including other microbes threatening the same host. Due to this, the different viral species evolved strategies to avoid the host antiviral systems and to compete with other viral populations, to assure their replication. In line with this, it is well known that a cell infected by a certain virus often become resistant to superinfection, developing a virus-induced viral resistance against members of the same species as well as different viruses (24). Considering HERVs, such cross-protection has been investigated mostly for exogenous retroviruses that share identity in their protein and nucleic acids, being more prone to interact with HERV products (16) (**Figure 4**). A partial resistance to exogenous infection could depend on the interference and blocking of the same cellular receptor by HERV-derived proteins or pseudo-viral particles (145, 146) (**Figure 4**). Inside the cell, the expression of HERV antisense transcripts has been proposed to confer protection against infections through the complementary interaction with homologous RNA sequences originated by exogenous retroviruses' expression, forming dsRNA molecules that can be recognized as PAMPs by human PRRs (26, 72) (**Figure 4**). Finally, in the case of HERV proteins' production, their identity with exogenous viral proteins could led to complementation events possibly affecting the formation of viral particles (**Figure 4**). Accordingly, a HERV-K(HML2) Gag protein was reported to co-assemble with HIV-1 Gag and to subsequently impair HIV-1 capsid formation as well as HIV-1 particles release and infectivity (147, 148) (**Figure 4**).

On the one side, all these direct interactions between HERV and exogenous viral products has been argued to had a role in the restriction and extinction of HERV-originating ancestral exogenous retroviruses (21). In line with this, Blanco-Melo and coauthors "resuscitated" two ancestral Env proteins: one was reconstructed from various HERV-T insertions in primate species, being more ancient, and the other was encoded by a human HERV-T provirus acquired about 20 million years later (149). The "older" Env was shown to bind the human receptor MTC1, and to be able to generate infectious pseudotyped MLV particles and syncytia (149). The "modern" HERV-T Env, albeit being still able to bind MTC1, lost these abilities and even downregulated MTC1 either at the cell surface (causing its internalization) or during secretion (blocking its transport), leading to its degradation (149). On the other side, the same interactions could have also contributed to the rapid evolution and adaptation of HERV genes, driving the selection of elements

that could increase the range of restricted microbes and confer a sort of broad-viral protection (21). In this regard, HERV-K(HML2) Rec proteins upregulation has recently been reported during embryogenesis, leading to the specific stimulation of the IFN-induced viral restriction factor IFITM126 in epiblast and embryonic stem cells (56, 150). This finding led to the fascinating idea that Rec and the HML2 mRNAs, associated to it for nuclear export, might be detected in the cytosol, eliciting an innate anti-viral response thought to broadly inhibit embryonic viral infections (56).

sustained also by the possible upregulation of HERV expression by HIV infection.

### HERVS, CANCER AND ANTICANCER STRATEGIES

In the context of cancer development, the ability of HERVs to modulating the immune system may have opposite effects, being potentially relevant to both the oncogenic processes and the anticancer defenses (23).

On the one side, HERV immunosuppressive functions might contribute to cancer progression by reducing the immune recognition and attack of tumor cells. As an example, the ISD

and (c) the binding of HERV proteins or pseudoparticles to the same cellular receptor, preventing HIV binding and entry. Of note, these effects can be eventually

peptide of an HERV-H Env (env60), namely H17, was shown to induce tumor cells' epithelial-to-mesenchymal transition, stimulating CCL19 chemokine expression and the subsequent recruitment and expansion of pluripotent immunoregulatory CD271<sup>+</sup> cells (105). This mechanism was proposed to be critical for cancerous cell immune escape as well as for metastatic invasion and adhesion (105).

On the other side, the immunogenic properties of HERV expression combined with their general upregulation in cancer tissues, especially due to a broad epigenetic dysregulation, could hence represent an innovative therapeutic target (16). In fact, HERV products have been investigated in the field of anticancer immunotherapy, especially if expressed to higher extents (tumor-associated) or exclusively (tumor-specific) in transformed cells (16). As mentioned above, demethylating agents—commonly used in anticancer therapy—are known to induce a hypomethylated status liberating retrotransposon expression, which is the base of their therapeutic activity (25). In fact, such a stimulation led to the production of HERV bidirectional transcripts that can form dsRNA, sensed by cellular PRRs and directing thus a IFN-I and –III response against colorectal tumor cells (29). Accordingly, the individual knockdown of key immune players like MDA5, MAVS, and IRF7 in cancer cells significantly reduced the anticancer activity of DNA methyltransferase inhibitors (29). The activation of dsRNA sensors by HERV nucleic acids following the treatment with demethylating agents has been reported at the same time also by Chiappinelli et al. in ovarian cancer cells (27). Authors underlined that a major anticancer mechanism of demethylating agents is an induced IFN-I immune response mediated by the cytosolic dsRNA sensing of the multiple upregulated HERV transcripts (27). In particular, the cellular PRRs TLR3 and MDA5 have a pivotal role in such immune triggering, given that tumors showing high HERV expression present a concomitant significant activation of these viral sensors (27). Therefore, demethylating agents have been proposed either in association with active immunization therapies, to produce synergistic anticancer effects (151), or as a strategy to overcome primary resistance to immune checkpoint blocking therapies (25).

#### CONCLUSIONS

A growing body of evidences suggests that the relationship between our genome and HERVs constitutes an intricate

#### REFERENCES


and multifaceted co-evolution spanning million of years throughout mammalian development. During such a long liaison, the detrimental effects exerted by HERVs have been balanced by beneficial activities that brought innovation and diversity to the human genome and physiology. Among these, the establishment of diverse HERV-mediated regulatory networks and the co-option of HERV proteins for pregnancy functions provided pivotal features characterizing our biological nature. Intriguingly, while HERVs are products of ancestral exogenous viral infections pervading primates, they became major contributors in shaping and improving the human antiviral immunity. Nowadays, they are still able to modulate it in an ambivalent way, suggesting that some new adaptive interplay between HERVs and our genome might still evolve de novo through sequence variation and fortuitous interactions (21). Additional studies are needed to characterize in detail the interaction between individual HERV products and specific immune receptors as well as the pathways involved in HERVmediated modulation of innate responses. The characterization of such a complex interplay would allow to better understand the persistence of these long-time genomic residents and to finally clarify their role in human immune physiology and pathogenesis.

#### AUTHOR CONTRIBUTIONS

NG and ET participated to the conception, drafting and revision of the manuscript and approved the final version.

### FUNDING

This work was supported by a POR-FESR 2014-2020 grant.

#### ACKNOWLEDGMENTS

We would like to thank the colleagues involved in the studies reported in the present review, and apologize to the ones whose work has not been referenced here.

#### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fimmu. 2018.02039/full#supplementary-material


response in cancer via dsRNA including endogenous retroviruses. Cell (2015) 162:974–986. doi: 10.1016/j.cell.2015.07.011


trophoblast cell fusion and differentiation. Mol Cell Biol. (2003) 23:3566–74. doi: 10.1128/MCB.23.10.3566-3574.2003


and coinfection with opportunistic pathogens. AIDS Res Hum Retroviruses (2011) 27:1099–109. doi: 10.1089/AID.2010.0302


activity in both human autoimmune diseases and experimental arthritis. Arthritis Rheumatol. (2017) 69:398–409. doi: 10.1002/art.39867


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 Grandi and Tramontano. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# The Exaptation of HERV-H: Evolutionary Analyses Reveal the Genomic Features of Highly Transcribed Elements

Patrick Gemmell <sup>1</sup> , Jotun Hein<sup>2</sup> and Aris Katzourakis <sup>1</sup> \*

*<sup>1</sup> Department of Zoology, University of Oxford, Oxford, United Kingdom, <sup>2</sup> Department of Statistics, University of Oxford, Oxford, United Kingdom*

HERV-H endogenous retroviruses are thought to be essential to stem cell identity in humans. We embrace several decades of HERV-H research in order to relate the transcription of HERV-H loci to their genomic structure. We find that highly transcribed HERV-H loci are younger, more fragmented, and less likely to be present in other primate genomes. We also show that repeats in HERV-H LTRs are correlated to where loci are transcribed: type-I LTRs associate with stem cells while type-II repeats associate with embryonic cells. Our findings are generally in line with what is known about endogenous retrovirus biology but we find that the presence of the zinc finger motif containing region of *gag* is positively correlated with transcription. This leads us to suggest a possible explanation for why an unusually large proportion of HERV-H loci have been preserved in non-solo-LTR form.

#### Edited by:

*Tara Patricia Hurst, Abcam, United Kingdom*

#### Reviewed by:

*Martin Sebastian Staege, Martin Luther University of Halle-Wittenberg, Germany Nicole Grandi, University of Cagliari, Italy*

\*Correspondence: *Aris Katzourakis aris.katzourakis@zoo.ox.ac.uk*

#### Specialty section:

*This article was submitted to Viral Immunology, a section of the journal Frontiers in Immunology*

> Received: *05 July 2018* Accepted: *28 May 2019* Published: *09 July 2019*

#### Citation:

*Gemmell P, Hein J and Katzourakis A (2019) The Exaptation of HERV-H: Evolutionary Analyses Reveal the Genomic Features of Highly Transcribed Elements. Front. Immunol. 10:1339. doi: 10.3389/fimmu.2019.01339* Keywords: HERV-H, endogenous retrovirus, transcription, stem cell, exaptation

### BACKGROUND

Endogenous retroviruses (ERVs) are the result of germ line retroviral integrations that are passed from one generation to another in a Mendelian fashion. At integration, a typical ERV locus contains the viral genes gag, pol, and env, as well as two flanking long terminal repeats (LTRs) that are identical. Once present in a population, ERVs may increase in number via reinfection or retrotransposition (1) until all active members of the family are silenced by host defenses (1, 2), degraded by mutations, or truncated by solo-LTR formation (1). Most human endogenous retroviruses (HERVs) are highly degraded, with a small number of described cases where they have been co-opted by their host, preserving their sequence, and now play roles in human biology. HERV derived interferon-inducible enhancers regulate human innate immune pathways (3), with the widespread recruitment of endogenous viruses in host immunity across animals pointing to a systematic evolutionary process (4). Furthermore, the ERV derived syncytins are important for placentation (5), and recent studies suggest that members of the HERV-H family are important to the maintenance of stem cell identity (6, 7).

The HERV-H family—formerly known as RTVL-H, for retrovirus like sequence primed with a primer binding site homologous to histidine tRNA—has been studied down to the level of individual insertions on many occasions in the past 30 years. The LTRs of HERV-H promote its transcription, and variation in these regions have been used to divide the family into three subtypes: type-I, type-Ia, and type-II (8, 9). The recombinant type-Ia subtype was originally thought to have

**70**

expanded recently and to have stronger transcription than loci with pure type-I or type-II LTRs (9); however, the relative youth of type-Ia repeats was later brought into question by Anderssen et al. (10), who discovered type-Ia repeats in the marmoset, a New World monkey species. Thus, considerable effort has been made to identify the structural variations of HERV-H that are important to its biological activity.

The sequencing of the human genome was important to a more comprehensive analysis of HERV-H. A study of the majority of full-length HERV-H integrations was performed by Jern et al. (11), who clustered HERV-H into two groups, the larger HERV-H group of 926 full-length elements and the smaller HERV-H like group of 92 full-length elements. Within the larger HERV-H group, Jern et al. (11) used previously studied variants of HERV-H (12, 13) to define two further subgroups: an older group of 77 RGH2-like elements (13) having a fairly intact pol and more frequently containing env; and a younger group of 705 RTVLH2-like elements (12) having more pol deletions and less frequently containing env. The 926 bona-fide HERV-H elements in the human genome were subsequently studied in some detail resulting in an annotated consensus sequence (14). In an unrelated study on ERV replication mechanisms, a small number of relatively intact HERV-H (presumably the RGH2-like elements) were found to have low env dN/dS. It was suggested that these elements had replicated via re-infection and were also responsible for copying other less intact members of the HERV-H family (15). These studies provided further important information on the origin and diversity of HERV-H within the human genome.

In the last few years interest in HERV-H has redoubled as the relationship between retrotransposons and stem cells has become a focus of research. In the case of HERV-H, there is evidence that HERV-H transcripts are upregulated in, and necessary for, the maintenance of human stem cell identity. This evidence derives from studies including those of Wang et al. (7) who show that half of the full-length HERV-H in the human genome are bound by pluripotency associated transcription factors NANOG, OCT5, and LBP9. These loci produce chimeric hESC (human embryonic stem cell) and hiPSC (human induced pluripotent stem cell) specific transcripts and long non-coding RNAs. By disrupting HERV-H or LBP9, Wang et al. (7) could demonstrate that some HERV-H elements play an essential biological role: differential markers were upregulated while pluripotency-associated transcription factors were downregulated, so that an ability for self-renewal was shown to be impaired. The same year, Lu et al. (6) reached a similar conclusion on the importance of HERV-H after discovering that interfering with HERV-H transcripts in hESCs led to modified cells becoming more fibroblast like, with concomitant changes in the appropriate transcriptional markers. More recently, Gemmell et al. (16) showed that the most highly transcribed HERV-H have diverged quickly, and are presumably under directional selection, while Göke et al. (17) studied HERV-H transcription in different cell types and concluded that HERV-H loci are stagespecific regulatory elements in humans. Combined, these studies suggest that HERV-H sequences have been exapted by their hosts, whereby retrovirally derived sequences have been functionally

co-opted for their roles in host biology. These exciting recent results have yet to be integrated with the tradition of studying HERV-H from an evolutionary perspective.

Given the current interest in HERV-H, we think it important to capitalize on several decades of detailed work on the family, including the aforementioned consensus and the LTR repeat types that have previously been shown to affect transcription. Additionally, we think it notable that an especially large number of HERV-H loci have been maintained in non-solo-LTR form (see **Figure 1**), an oddity that has been raised but not resolved in the recent literature (19–21). For these reasons, the crux of this study is as follows: what is the relationship between the genomic characteristics of a HERV-H locus and the level at which it is transcribed? Below we identify the features of HERV-H that are significantly correlated with transcription and show how age, integrity, and the LTR repeat type of a locus affect its transcription levels. We also demonstrate that there is a relationship between gag and transcription, and conjecture that the presence of zinc finger motifs in the terminal region of this gene might be related to why so many HERV-H loci are maintained in a full-length state.

#### RESULTS

We produced multiple sequence alignments of the HERV-H genes gag (1,080 sequences), pol (1,126 sequences), and env (1,081 sequences), using the consensus HERV-H (14) as a guide. (Complete procedures are provided in the Methods section). We used the same consensus to form alignments of the 5′

L5/1–L5/4, G1–G4, P1–P8, E1–E5, and L3/1–3/4 are also noted.

(908 sequences) and 3′ (932 sequences) LTRs of HERV-H. Examining the alignments we identified regions that were shared deletions across subsets of viral sequences. We refer to these regions via enumeration from 5′ to 3′ and they are depicted in **Figure 2**.

BLASTn was used to search for the previously characterized type-I and type-II repeats within the HERV-H LTRs. To place the HERV-H loci in context we obtained their distance from the nearest gene, and also located 847 of the loci in a six-way alignment of primate genomes, thereby gaining some perspective on their structural state in other primates. Finally, we were able to determine the genetic distance (K80 substitution model) between the paired LTRs of 627 of the HERV-H loci—as the divergence between paired LTRs accumulates at a neutral rate, both genetic distance and orthology contribute information about the age of the loci under investigation.

TABLE 1 | PGLS applied to transcription data from embryonic cells for 409 HERV-H loci: λ = 0.48 (0.17, 0.78); *R* <sup>2</sup> = 0.24.


TABLE 2 | PGLS applied to transcription data from stem cells for 409 HERV-H loci: λ = 0.50 (0.27, 0.73); *R* <sup>2</sup> = 0.13.


Because HERV-H loci are related, they do not represent independent data points for statistical analysis. To quantify the relationship between HERV-H transcription data (7) and the aforementioned per locus genomic features of HERV-H, we controlled for the non-independence between measurements due to shared evolutionary history by incorporating phylogenetic information. A maximum likelihood (ML) phylogeny of the genes of the HERV-H loci was constructed, and we used a phylogenetic generalized least squares (PGLS) method to control for the non-independence of measurements across the HERV-H family. Of the 834 loci in the phylogeny, 409 had a complete collection of genomic features (i.e., no missing values) and could therefore be used in our PGLS analysis (**Tables 1**, **2**). Akaike information criterion (AIC) based model selection and plots were used to explore model space and we formulated minimum adequate models to explain the features of our dataset i.e., we carefully explored regression space and include here models containing statistically informative explanatory variables. We provide our data in **Additional File 1**. The response variable in our analyses was the logarithm of per locus mean transcription in reads per kilobase per million reads (RPKM) and PGLS regressions were performed using measurements from three distinct cell types defined by Wang et al. (7): somatic (32 sample types); embryonic (114 sample types); and stem (55 hESC and 25 hiPSC sample types). An overview of the methods of Wang et al. is included in our own methods section, which also describes how to link our data to theirs.

#### Qualitative Analysis

In this section we discuss what we notice about our results by looking at them. This motivates the quantitative analysis that follows in the next section.

The relationship between HERV-H phylogeny, transcription, and genomic structure is visually summarized in **Figure 3**. To aid the reader, a simplified and annotated partial version of this figure is also provided in **Figure 4**. The tips of the tree in **Figure 3** are colored according to the categorical transcription level assigned to them by Wang et al. (7) (see Methods). Upon visual inspection, it can be seen that loci annotated as highly transcribed by Wang et al. (7) are located toward the bottom of the ladderized tree, and we confirmed that this phenomenon was also true of phylogenies built using LTRs and from amino-acid or nucleotide trees of individual HERV-H genes (some examples are contained in **Additional File 2**). Thus, HERV-H transcription appears to have a phylogenetic component.

In which cells is HERV-H active? **Figures 3**, **4** represent the average transcription of HERV-H loci across cells categorized as somatic, embryonic, or stem. The figure shows a hierarchy of HERV-H transcription such that transcription is typically lowest in somatic cells, and higher in embryonic cells and stem cells in turn. Note that this comparison between different cell types involves contrasting data gathered in different ways (e.g., cell culture vs. primary sources) by different labs and that any systematic effect this leads to is not something that we correct for.

LTRs identified as type-I outnumber those identified as type-Ia/II. However, considering transcription in conjunction with LTR subtypes, one can also see that type-Ia/II loci are distributed throughout the tree and that type-II repeats are associated with higher levels of embryonic transcription, even if a locus is categorically characterized as inactive or moderately active by Wang et al. (7). Therefore, in agreement with Göke et al. (17), we seem to identify stage-specific transcription, though in this case we relate it to particular well-studied repeat types [see e.g., (10)].

If transcription is positively correlated with function, our results suggest that many parts of consensus HERV-H might have deleterious effects that prevent their exaptation. Columns L5, gag, pol, env, and L3 of **Figure 3** indicate the presence or absence of the viral regions L5/1–L5/4, G1–G4, P1–P8, E1–E5, and L3/1– 3/4 at particular loci. Highly active loci toward the bottom of the tree can be seen to have highly intact LTRs, while some less transcribed loci toward the top of the tree have less intact LTRs. Less active older loci toward the top of the tree tend to have relatively complete internal regions when compared to the active loci at the bottom of the figure.

The status of active HERV-H in other primates is also of interest. The sequences used to build the phylogeny in **Figure 3** come from the human genome. A rough indication of the age and status of HERV-H loci in other primates is given by the column cgom. Loci present in chimpanzee, gorilla, orangutan, or macaque at a minimum of 25% of the level that they appear in human, and that therefore presumably consist of more than a solo-LTR, are marked with a dot as appropriate. Visual inspection of **Figures 3**, **4** shows that the less derived and less active HERV-H loci that are located toward the top of the tree appear more often present in a substantive way in other primates; the active and derived loci toward the bottom of the tree appear to be absent or degraded in distant primates. This suggests that if HERV-H is important to stem cell biology in non-human primates, then the loci that are involved might be different to those that are involved in humans.

FIGURE 4 | Concise guide to Figure 3 showing transcription levels and per locus genomic features of a subset of the 409 HERV-H loci. Rows are ordered, colored, and sourced as per Figure 3, though some rows are omitted to aid clarity. Annotations highlight aspects of our results that are discussed in detail in the main article text.

#### Statistical Results

The above observations—younger loci are more transcribed; loci with less intact genes are more transcribed; loci with more intact LTRs are more transcribed; the repeat type of an LTR is relevant to where the corresponding HERV-H is transcribed are supported by a statistical analysis and robust to phylogenetic non-independence. Two PGLS regressions are reported in detail: **Table 1** displays a regression for transcription measurements from embryonic cells and **Table 2** shows a similar analysis of stem cell data. Both regressions were constructed with respect to the same 409 loci though each model contains only the explanatory variables that were found to be important from a statistical perspective.

In embryonic cells, the presence of region P6 of pol is negatively correlated with transcription. The P6 region of pol has previously been identified with an RNaseH motif and the 5′ end of the integrase domain (14). Positive correlations between embryonic transcription and the presence of the LTR region L5/1, as well as between embryonic transcription and type-II repeats, are also seen. For stem cells, the situation is similar, in that P6 is negatively correlated with transcription, whereas L5/1 is positively correlated with transcription. However, **Table 2** also shows that in stem cells it is type-I LTRs (indicator TI + UI) that are positively correlated with transcription, and that the older an ERV is the less active it is likely to be. Surprisingly, given that **Figures 3**, **4** suggest that intact loci tend to be less transcribed, the presence of the G4 region of gag is positively correlated with transcription in both embryonic cells and stem cells.

A coefficient of determination corrected to take account of correlation structure [see (22), p. 224] can be applied to the regressions presented in **Tables 1**, **2**. The regression describing embryonic data has R <sup>2</sup> = 0.24, and therefore explains nearly a quarter of the variance in the transcription of HERV-H using only a few features of the loci themselves. The model describing stem cell data has R <sup>2</sup> = 0.13, and seems to underfit the most highly transcribed HERV-H loci. Both PGLS models possess intermediate levels of phylogenetic signal which suggests that the evolutionary relationship between HERV-H loci must be taken into account in order to make correct statistical inferences relating structure and transcription. As expected, given that there is no reported role for HERV-H in somatic cells, we were unable to produce a PGLS regression with R <sup>2</sup> > 0.01 for somatic cell measurements—therefore we do not think there is a meaningful relationship between the structure of the majority of HERV-H loci and their somatic transcription.

### DISCUSSION AND CONCLUSIONS

We found the repeat class of LTRs to be related to where the corresponding HERV-H is transcribed. HERV-H loci classified as type-I were roughly twice as highly transcribed in stem cells when compared to other loci. On the other hand, we found the presence of a type-II repeat increased transcription of HERV-H loci by a factor of four in embryonic cells. The finding that type-II repeats are preferentially transcribed in embryonic cells is potentially important in light of the widespread interest in cultivating naive human cells. Mouse ESC cultures have been shown to contain sub-populations of naive-like cells that have totipotent properties (23), and these totipotent cells have transcriptional characteristics that are usually associated with 2-cell embryos. Our analyses show that the type-II repeat is correlated with HERV-H transcription in early human embryonic cells. At the same time, these loci with type-II repeats do not appear to be a focus of Wang et al. (7) as they do not usually cluster into the highly active category (**Figures 3**, **4**). As the work of Macfarlan et al. (23) shows that the natural timing of the transcription of repeat loci is relevant for identifying naive-like cells in mouse, it is worth considering whether a similar situation is true for humans also.

We also demonstrated that the integrity of HERV-H LTRs is important to the magnitude of HERV-H transcription. In particular, transcription of HERV-H is consistently correlated with the presence of the first 114 bp (L5/1) of the consensus LTR. This short sequence is part of the larger U3 region and is known to contain a MYB binding site (76–80) as well as to finish with an Sp1 binding site (105–114). The correlation makes sense given Sp1 binding sites have been shown to be important in previous assays (10, 24).

A related feature of our results is the fragmented nature of the pol and env genes of many HERV-H loci (**Figures 2**, **3**). It has previously been shown that the decay of env is a common feature of the most prolific families of endogenous retroviruses (25) and HERV-H seems to be typical in this respect. In addition it is sensible to assume that a more complete (i.e., longer) pol or env will contain more sequence that triggers host defenses (2) or is more likely to be involved in harmful ectopic recombination events [e.g., (26)]. In particular, we find here that the P6 region of pol is negatively correlated with transcription to the extent that the presence of P6 is associated with a roughly 2-fold decrease in HERV-H transcription. It will be interesting to determine whether the P6 region attracts repressive chromatin/DNA marks that could explain this observation. This result shows that, even today, more complete HERV-H loci are subject to reduced transcription in their host.

In light of the preceding discussion it is striking that the gag gene is relatively complete in the majority of HERV-H, and that the presence of the G4 region of gag is associated with a 38 and 68% increase in HERV-H transcription in embryonic and stem cells, respectively. The consensus HERV-H gag has previously been annotated and was described as containing two zinc finger motifs (14). These motifs are located in region G4 (positions 1390–1431 and 1459–1497) and would have originally enabled proteins encoded by the exogenous progenitor of HERV-H to bind single-stranded viral RNA as part of a packaging process. The association between these motifs and HERV-H transcription levels may be a statistical coincidence, but it is unexpected enough to encourage speculation on the alternative that there might be a causal explanation for the relationship.

We draw upon recent research that shows that endogenous viral sequences have been frequently repurposed by their hosts to provide functional roles in host immunity. This process results in virally derived genes that are now functioning as host immune genes, which have been termed endogenous viral element-derived immunity (EDIs) (27). We propose that just as exogenous zinc finger motifs enable retroviral proteins to package RNA to the benefit of an exogenous virus, so too might endogenous zinc finger motifs have at one time enabled HERV-H proteins to bind viral RNA to the benefit of the host. The general phenomenon of viral material being captured by a host for immune benefits is well-established and occurs via many mechanisms and in many organisms (27). In this case our suggestion is that insofar as zinc finger motif-complete products of HERV-H were in other ways defective, or bound related viruses that they did not package, they would act in competition with genuinely effective packaging proteins and therefore as antagonists against viral reinfection. This scenario invokes a new role for zinc finger domains in host defense as compared to their (unrelated) known role as part of host KRAB zinc finger based defenses (28–31).

Reflecting on the above discussion we conclude with the following narrative. In the distant past exogenous HERV-H was a complete retrovirus adapted to horizontal transmission and at some point roughly 35 million years ago an endogenization occurred (32). It is possible that HERV-H then drifted to fixation. The intact nature of HERV-H gag is in stark contrast to the fragmented pol and env genes that are the hallmarks of prolific endogenous retroviruses. If we suppose the relationship between the zinc finger motif containing region of gag and transcription is important, this would go some way toward explaining why HERV-H is present at a roughly 1:1 full-length to solo-LTR ratio in the human genome. This ratio is extraordinary, but it could be that while HERV-H LTRs drive transcription, at some point in time the zinc finger motifs in the last third of gag made some HERV-H loci useful to the host, and consequently more likely to be at first tolerated and then later co-opted for a role in stem cell identity. This would suggest we view co-opted HERV-H as a biologically convenient aggregation of LTRs and the G4 region of gag, both embedded in an ancestral ERV packaging, the remainder of which was of little functional consequence over evolutionary time.

### METHODS

### Multiple Sequence Alignments of HERV-H Loci

We obtained the genomic sequence underlying the 1,225 fulllength HERV-H loci described by (7) from the UCSC Genome Browser database available at http://genome.ucsc.edu (33), and used Genome Reference Consortium Human Build 37 (GRCh37, h19) for the human genome. We also obtained the corresponding RepeatMasker (http://www.repeatmasker.org) gene annotations from the same database. For the non-human primates, we used the following assembly versions: Chimp: CHIMP2.1.4, Gorilla: gorGor3.1, Orangutan: PPYG2, Macaque: MMUL1.0, Marmoset: C\_jacchus3.2.1.

Before aligning the HERV-H loci we first used RepeatMasker annotations to identify any non-HERV-H repeats within the underlying sequences. Such non-HERV-H repeats, for example SINEs or LINEs, would, if ignored, introduce confounding indels into our analysis. For this reason all but the outermost 20 bp of these repeats were removed from the 1,225 HERV-H sequences before constructing alignments. The remaining 40 bp or less of non-HERV-H repeats were flagged so that we could remove them by hand, thereby ensuring that we did not remove mistakenly RepeatMasked genuine HERV-H sequence. Non-HERV-H repeats were found to be rare, so that 80% of the sequence underlying the 1,225 loci remained completely unmodified. The remaining 20% of loci had a median of 12% of their underlying sequence removed.

To identify the gag, pol, and env genes within the 1,225 HERV-H sequences we used tBLASTn (34). We searched each of the 1,225 loci using a previously published (14) HERV-H consensus sequence as a query. Hits of at least 25 bp in length and with an expect value no more than 10−<sup>6</sup> were merged in a way that maintained fragment order and sense. The result of this reconstruction was 1,080 gag, 1,126 pol, and 1,081 env genes.

Only 20 of the 1,225 HERV-H loci were not matched by one of the three tBLASTn searches. An inspection using Dfam (35) revealed that these loci were sequences with short internal regions, that contained other retroviral insertions, or that were short overall (one full-length region was only 44 bp).

A similar search and merge process was performed for the 5′ and 3′ LTRs of each of the 1,225 HERV-H loci. In this case the search was performed using BLASTn (34) and resulted in the reconstruction of 908 5′ LTRs and 932 3′ LTRs.

To construct multiple sequence alignments of the genes and LTRs of the HERV-H loci we first pairwise aligned the reconstructed sequences to the appropriate part of the consensus. We then progressively combined these alignments to create five multiple sequence alignments, one for each of the three genes and one for each LTR. Pairwise alignment was conducted with Stretcher (36) and progressive multiple alignment was conducted using MUSCLE (37).

### Identifying Regions Containing Common Deletions

Examining the sequence in each alignment we identified regions that demarcated deletions common to subsets of viral sequences. The presence of sequence in these regions was bimodally distributed and is represented in **Figure 2**, where the regions are named via enumeration from 5′ to 3′ : we identified eight regions in the alignment of HERV-H LTRs (L5/1–L5/4 for the four regions in the 5′ LTR and L3/1–L3/4 for the four regions in the 3′ LTR); four regions in gag (G1–G4); eight regions in pol (P1–P8); and four regions in env (E1–E4). As per Belshaw et al. (15), if <5% of a region was aligned for a particular HERV-H the region was marked as absent at that locus, while if more than 55% of a region was aligned it was marked as present. Ambiguous regions were coded as missing values. The 25 regions we defined were included as indicator variables (i.e., 1 or 0) when constructing statistical models (below).

#### Distance to Nearest Gene

The RefSeq gene annotation track was downloaded from the appropriate UCSC Genome Browser database (as detailed above). The distance between the centroid of each HERV-H locus and its nearest neighboring gene was calculated and recorded.

#### Characterization of LTR Subtypes

Several examples of HERV-H LTR subtypes (9, 32) are provided by Anderssen et al. (10). We constructed consensus type-I (TI) and type-II (TII) repeats as well as consensus unique-I (UI) and unique-II (UII) sequences based on **Figure 2** of the study by Anderssen et al. (10). These consensus sequences were used as queries in a BLASTn search against the 1,225 HERV-H loci. Hits with expect values of no more than 10−<sup>6</sup> were treated as indicating the presence of the appropriate sequence. Data on the presence of LTR subtype sequences at HERV-H loci are recorded in **Additional File 1**.

#### Pairing Loci to EPO Multiple Alignments

To examine the status of the 1,225 HERV-H full-length loci in other primates we obtained the Enredo-Pecan-Ortheus (EPO) genome scale multiple sequence alignment from Ensembl Release 71 (38). We then used BLASTn to locate the HERV-H loci in the human row of the EPO alignments. Of the 1,225 HERV-H loci, 847 were unambiguously located (unique and exact matches) within a six-way EPO alignment. The remaining loci were not present in the alignments in an unambiguous form. This could be because the region of the human genome they are located in is not included in the EPO alignments, or because the region they are located in was identified as having been duplicated in a primate other than human.

Some uncertainty remains over loci that were unambiguously located. This is because the EPO methodology could fail to identify genuine orthologs and paralogues in the non-human primates. We note that orthology information from the EPO alignments contributes qualitatively to our results but was not used in our statistical analyses.

#### Tree Building and Phylogenetic Regression

A supermatrix concatenation of gag, pol, and env alignments was produced. The alignment was edited by hand to remove short or badly aligned regions and sequences that could not reasonably be assumed to be homologous. The tree building software RAxML 8.2.3 (39) was used to produce a maximum likelihood (ML) tree relating the 834 sequences in the supermatrix alignment. Tree inference was performed under the GTR + γ substitution model.

The supermatrix based tree was rooted by constructing an auxiliary phylogeny of the reverse-transcriptase (RT) region of pol (nucleotides 82–576 of the pol consensus). The 569 HERV-H sequences with relatively complete RT, having over 140 of 165 possible codons, were first translated and then combined with a panel of 15 HERV-W RTs. An ML tree was constructed from the resulting alignment using the PROTGAMMAAUTO option.

All regression analysis was conducted with the R system (40). Regressions taking into account phylogeny were performed using a phylogenetic generalized least squares (PGLS) approach as introduced by (41). The λ method (42) was used to assess

#### REFERENCES

1. Katzourakis A, Rambaut A, Pybus OG. The evolutionary dynamics of endogenous retroviruses. Trends Microbiol. (2006) 13:463–8. doi: 10.1016/j.tim.2005.08.004

phylogenetic signal. Analyses were conducted using the R package APE (43).

## Transcription Data

Wang et al. (7) previously curated HERV-H transcription data from a large panel of RNA sequencing experiments performed on a variety of cell types. We have subsequently related this data to genomic features in our own study.

In overview, Wang et al. (7) compared uniquely mapped HERV-H reads to other uniquely mapped reads and RPKM (reads per kilobase million) normalized. Wang et al. (7) also used hierarchical clustering to apply one of three labels (highly active, moderately active, or inactive) to each HERV-H locus as a way of summarizing the transcription of a locus across many cell types and experiments. These data are fully available in Supplementary Table 7 of Wang et al. (7) while the provenance of the reads themselves is available in Supplementary Table 4 of the same article. Our data can be related in full to the data of Wang et al. (7) via the first column of our **Additional File 1**: our serial number 1 to 1,225 corresponds to what Wang et al. refer to as Repeat\_ID, which takes values HERVH\_1 to HERVH\_1225.

The data used by Wang et al. (7) come from a variety of cell types which they grouped into five broad categories including somatic tissue (both tissue specific and mixture), embryonic cells, hESC (human embryonic stem cells), and hiPSC (induced pluripotent stem cells). This processing and grouping of RNA transcription experiment data by Wang et al. (7) was convenient for our study and we relied on their data without subsequent relabeling or adjustment.

## AUTHOR CONTRIBUTIONS

PG and AK conceived the study, interpreted the results, and participated in the editing of the manuscript. PG performed the analysis. PG was supervised by AK and JH. All authors read and approved the final manuscript.

## FUNDING

PG was supported by the Engineering and Physical Sciences Research Council and AK is funded by the Royal Society.

## SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fimmu. 2019.01339/full#supplementary-material

Additional File 1 | This file contains information on the structure of primate HERV-H loci in a machine readable tabular form.

Additional File 2 | This file contains trees built using HERV-H LTRs and individual HERV-H genes.

2. Wolf G, Greenberg D, Macfarlan TS. Spotting the enemy within: targeted silencing of foreign DNA in mammalian genomes by the Krüppel-associated box zinc finger protein family. Mob DNA. (2015) 6:1–20. doi: 10.1186/s13100-015- 0050-8


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Gemmell, Hein and Katzourakis. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Related Endogenous Retrovirus-K Elements Harbor Distinct Protease Active Site Motifs

#### Matthew G. Turnbull<sup>1</sup>† and Renée N. Douville1,2 \*

<sup>1</sup> Department of Biology, University of Winnipeg, Winnipeg, MB, Canada, <sup>2</sup> Department of Immunology, University of Manitoba, Winnipeg, MB, Canada

Background: Endogenous retrovirus-K is a group of related genomic elements descending from retroviral infections in human ancestors. HML2 is the clade of these viruses which contains the most intact provirus copies. These elements can be transcribed and translated in healthy and diseased tissues, and some of them produce active retroviral enzymes, such as protease. Retroviral gene products, including protease, contribute to illness in exogenous retroviral infections. There are ongoing efforts to test anti-retroviral regimens against endogenous retroviruses. Herein, we examine the potential activity and diversity of human endogenous retrovirus-K proteases, and their potential for impact on immunity and human disease.

#### Edited by:

Tara Patricia Hurst, Abcam, United Kingdom

#### Reviewed by:

Antoinette Van Der Kuyl, University of Amsterdam, Netherlands Larance Ronsard, Ragon Institute of MGH, MIT and Harvard, United States

\*Correspondence:

Renée N. Douville r.douville@uwinnipeg.ca orcid.org/0000-0002-7854-007X †orcid.org/0000-0001-8603-4882

#### Specialty section:

This article was submitted to Virology, a section of the journal Frontiers in Microbiology

Received: 01 May 2018 Accepted: 25 June 2018 Published: 18 July 2018

#### Citation:

Turnbull MG and Douville RN (2018) Related Endogenous Retrovirus-K Elements Harbor Distinct Protease Active Site Motifs. Front. Microbiol. 9:1577. doi: 10.3389/fmicb.2018.01577 Results: Sequences similar to the endogenous retrovirus-K HML2 protease and reverse transcriptase were identified in the human genome, classified by phylogenetic inference and compared to Repbase reference sequences. The topologies of trees inferred from protease and reverse transcriptase sequences were similar and agreed with the classification using reference sequences. Surprisingly, only 62/480 protease sequences identified by BLAST were classified as HML2; the remainder were classified as other HML groups, with the majority (216) classified as HML3. Variation in functionally significant protease motifs was explored, and two major active site variants were identified – the DTGAD variant is common in all groups, but the DTGVD motif appears limited to HML3, HML5, and HML6. Furthermore, distinct RNA expression patterns of protease variants are seen in disease states, such as amyotrophic lateral sclerosis, breast cancer, and prostate cancer.

Conclusion: Transcribed ERVK proteases exhibit a diversity which could impact immunity and inhibitor-based treatments, and these facets should be considered when designing therapeutic regimens.

Keywords: endogenous retrovirus-K (ERVK), protease, protease inhibitor, active site motifs, RNAseq, amyotrophic lateral sclerosis, breast cancer, prostate cancer

## INTRODUCTION

Retroviridae is a diverse family composed of both exogenous infectious viral species whose life cycle includes stages with a ssRNA genome inside virions which is converted into a dsDNA provirus, and endogenous proviral species transmitted in a Mendelian fashion. The evolutionary pressures on ERVs and exogenous retroviruses (XRVs) are very different; ERVs experience much stronger

**80**

negative selection against pathogenicity since their survival depends directly on reproduction of their host (Holmes, 2011; Stewart et al., 2011; Barbeau and Mesnard, 2015). If fact, some ERVs are directly involved in placentation (Mi et al., 2000), and some host genes descend from retroviral ancestors (Kaneko-Ishino and Ishino, 2012). Indeed, the role of ERVs in the evolution and function of their host genome is becoming more clear (Bannert and Kurth, 2004; Cowley and Oakey, 2013). The apparent inactivity of many ERVs is probably a direct consequence of this pressure, as many ERV integrations contain inactivating mutations that disrupt the function or expression of viral genes. The most severe inactivating mutation is the case of solo LTRs, where the entire coding region of the ERV is deleted. Despite this, a surprising number of ERVs appear to lack obvious inactivating mutations, and some are even infectious, blurring the line between ERV and XRV (Boller et al., 2008; Jha et al., 2011; Denner and Young, 2013; Kozak, 2014). Although no infectious ERV has been proven to exist in the human genome, some loci are polymorphic and population genetics analysis and functional studies have not ruled out low-level ongoing infectious replication of some human ERVs (Turner et al., 2001; Jha et al., 2011; Naveira et al., 2014; Wildschutte et al., 2014, 2016).

Here we focus on the Betaretrovirus-like human ERVK, and particularly the HML2 clade (HK2), which includes loci encoding functional enzymes, and whose youngest members may be less than 200,000 years old (Franklin et al., 1988; Boller et al., 2008). Actively transcribed ERVs, such as ERVK-10, are assigned names by the Human Gene Nomenclature Committee as recommended by Mayer et al. (2011). The biological role of transcribed ERVK elements is becoming more clear in recent years (Ruda et al., 2004; Macfarlan et al., 2012; Manghera et al., 2014, 2015; Michaud et al., 2014; Schlesinger et al., 2014; Grow et al., 2015; Li et al., 2015; Bray et al., 2016; Christensen, 2016; Becker et al., 2017; Prudencio et al., 2017), but the effects of individual viral proteins is understudied as compared to their XRV counterparts. Protease-mediated maturation of retroviral structural proteins and enzymes is critical to the XRV life-cycle; viruses lacking PR activity produce virions with an immature morphology and are not infectious (Kohl et al., 1988). Even though ERVs are not known to replicate by infection of new cells, they can increase in copy number by reinfection of their host cell; this process called reintegration requires the activity of all the viral enzymes (Dewannieux et al., 2006). Because PR function is conserved across Retroviridae, a virus with an inactive protease may be complemented by another retroviral protease. The factors which permit or exclude this complementation must be context dependent; for example, the HK2 protease encoded by ERVK-10 cleaves HIV-1 Gag in vitro, although not in virio (Towler et al., 1998). The contextual nature of complementation is not perfectly understood, so care must be taken in applying findings across model systems, and particularly in the context of human disease (Towler et al., 1998; Contreras-Galindo et al., 2012; Morandi et al., 2015).

Structural analysis proceeds most easily using crystallographic structural determination, which is not yet available for the HK2 PR; however, biochemical and genetic analyses reveal a typical A2 aspartic protease, with each 106-residue monomer of the mature dimer processed autocatalytically from a larger precursor (Mueller-Lantzsch et al., 1993; Towler et al., 1998; Kuhelj et al., 2001; Dunn et al., 2002). The active HK2 PR is moderately sensitive to pepstatin, with a pH optimum around 4.5 (Towler et al., 1998; Kuhelj et al., 2001). Some HK2 elements (such as ERVK-10) encode a functional protease, and HK2 virions with condensed cores have been observed budding from human cells (Mueller-Lantzsch et al., 1993; Sauter et al., 1995; Towler et al., 1998; Boller et al., 2008). Furthermore, the two reconstructed HK2 viruses Phoenix and ERVKCON have a PRdependent infectivity (Dewannieux et al., 2006; Lee and Bieniasz, 2007). Although the substrate affinity of the HK2 PR has not been extensively studied, the PR cleavage sites in the HK2 Gag polyprotein are known (Kraus et al., 2011), as is its susceptibility to select PIs (Towler et al., 1998; Kuhelj et al., 2001).

In addition to catalyzing maturation of viral proteins, PR may play a role in immunity and pathology of human diseases. Protease pathogenicity has been best studied in HIV-1, whose PR is known to cleave a variety of host proteins (Shoeman et al., 1991; Impens et al., 2012), and to trigger apoptosis under specific conditions (Shoeman et al., 1991; Rumlova et al., 2014). HIV-1 PR also directly combats innate immunity by trafficking the dsRNA sensor RIG-I to the lysosome (Solis et al., 2011). Another example of HIV PR inhibition of innate immunity is its ability to cleave RIPK1 and RIPK2 proteins, thus successfully abrogating NF-κB signaling (Wagner et al., 2015). The effect of proteases is not limited to innate immune signaling proteins, but also effector proteins such as intrinsic restriction factors. Both FIV and MLV cleave APOBEC3 in their respective hosts as a means to evade this antiviral defense (Abudu et al., 2006; Yoshikawa et al., 2017). Given the known pathogenic potential of retroviral proteases, it seems prudent to explore the potential effects of PRs encoded by ERVK which, in contrast to HIV-1, are present in all human beings – and which have been associated with specific human diseases, such as ALS, schizophrenia, rheumatic disease and cancer (Frank et al., 2005; Seifarth et al., 2005; Reynier et al., 2009; Douville et al., 2011; Li et al., 2015; Christensen, 2016).

Here, we undertake genomic, sequence-function, and transcriptomic analysis of the diverse ERV PRs in the human genome. We draw on the existing ERV and XRV literature to evaluate the phylogenetic distribution of sequence motifs

**Abbreviations:** AA, peptide/amino acid; ALS, amyotrophic lateral sclerosis; BLAST, basic local alignment search tool; CA, Capsid; DNA, desoxyribonucleic acid; dsRNA, double stranded RNA; DU, dUTPase; ERV, endogenous retrovirus; ERVK, endogenous retrovirus-K; FTP, file transfer protocol; GIMP, GNU image manipulation program; GRCh38, genome research consortium human genome release 38; GTR, general time-reversible; HIV, human immunodeficiency virus; HK, human endogenous retrovirus-K HML; HML, homology to mouse mammary tumor virus-like; HMM, hidden Markov model; IN, integrase; JTT, Jones, Taylor, and Thornton; LTR, long terminal repeat; MA, matrix; MACSE, multiple alignment of coding sequences; NCB, National Computational Biology Institute; NC, nucleocapsid; NT, nucleotide/nucleic acid; PI, protease inhibitor; PNG, portable network graphics; PR, protease; RAxML, randomized axelerated maximum likelihood; RH, RNAse H; RIG, retinoic acid inducible gene; RNA, ribonucleic acid; RNA-Seq, RNA sequencing; RT, reverse transcriptase; sALS, sporadic ALS; SRA, sequence read archive; ssRNA, single stranded RNA; SU, surface subunit of envelope; WAG, Whelan and Goldman; XIST, X inactive specific transcript; XRV, exogenous retrovirus.

with potential functional relevance, focusing on HK2 PRs. We accomplish this by identifying human ERV protease sequences, inferring their phylogenetic affiliation to each other, and comparing them to representative sequences. The classification process was carried out in parallel using RT as a standard comparator. Here we show that distinct ERVK PR variants with predicted differences in their functional activity are differentially transcribed in disease-relevant human tissues.

### RESULTS

#### Representative Retroviral Motifs

With the goal of using the most recent human genome build GRCh38 to identify loci encoding retroviral PR and RT enzymes, HMMs were used as a search tool. Pfam-A was searched for all relevant protein families related to retroviruses. **Table 1** describes the output from this search, revealing 92 HMMs in total, with 40 directly related to retroviral core and accessory proteins. Most important amongst these are the retroviral protease domain RVP, and the RT core domain RVT\_1. Retrovirus-associated HMMs (**Table 1**), the PR sequence of ERVK-10 (Towler et al., 1998), and the RVT\_1 domain of Phoenix (Dewannieux et al., 2006), were subsequently used to identify pro and pol sequences in representative retroviral sequences and in the human genome.

### Representative Endogenous Retroviral Sequences

As expected, BLAST identified more, but less diverse sequences, than HMMER in both the human genome and Repbase consensa. tBLASTn for HK2 PR matched every ERVK clade except HK10, but no other ERV. HMMER identified an RVP domain in every ERVK clade, and most other ERVs. The regions identified by each method overlap. tBLASTn for the Phoenix RVT\_1 domain matched most ERVs. HMMER also identified an RVT\_1 domain in most ERVs. The results of these methods mostly overlapped, with some notable differences. BLAST identified only part of the RVT\_1 domain on ERVE, and the RVT\_1 domain on ERV9 and ERVFc covers only part of the BLAST result. Each of the BLAST and HMMER results from ERVFb only partially overlapped. HMMER identified an RVT\_1 domain on ERVT, but BLAST found no match. Both methods disproportionately detected HML2 loci; BLAST did so because the query sequences were HML2-derived, whereas HMMER did so because HML2 loci are more intact and therefore more easily discovered by LTRharvest.

The nucleotide sequences from each search were aligned using their translation and their phylogeny was inferred using RAxML from both nucleotide and amino acid (AA) alignments (see **Figure 1** for PR tBLASTn results). The resulting alignments and trees were used to classify the genomic sequences identified by the corresponding search. These reference trees have very low bootstrap support values; however, pol and pro trees have similar topology and the classifications based on evolutionary placement agree with the tree topology inferred directly from genomic pro sequences.

TABLE 1 | Retroviral HMMs from Pfam sorted according to the retroviral protein product from which they were derived.


HMMs in the "other" category correspond to common functional motifs, nonretroviral proteins, or domains with unknown function.

### Classification of ERV Sequences in the Human Genome

Having established a system to identify PR and RT elements, ERV nucleotide sequences were identified in GRCh38 using BLAST and HMMER as directed by GenomeTools. These were then curated to eliminate highly divergent sequences with rare insertions (less frequent than 1/20 sequences) and to minimize the size of their multiple alignment before phylogenetic inference. LTRdigest hits that did not co-occur with another retroviral domain were eliminated. The genomic distribution of the resulting ERV annotations is presented in **Table 2**, along with curation data.

When examining PR hits, tBLASTn for ERVK-10 PR identified 480 sequences in GRCh38 (**Additional Files 1**, **2**). Forty-two are identical to another PR sequence, of which 15 are from an alternative assembly. LTRdigest identified 150 RVP domains in GRCh38 (**Additional Files 3**, **4**). Only 8 of these were shorter than half of the expected length of an RVP domain. Occasionally,


"Raw" refers to all sequences identified, and "Curated" refers only to sequences that were aligned for further analysis. Credible RT sequences are those which were not eliminated for causing gappy sites. Curated RVT\_1 sequences are those which co-occurred with another retroviral HMM match. LTRharvest and LTRdigest were not used to search alternative assemblies, so there are no results to report. Since this curation workflow did not exclude any protease search results, the number is shown only once.

sequences from different loci were indistinguishable; twelve were identical to another RVP locus. RVP positions 38–42 are absent in 83 PRs; 48 of these were confidently classified into one of ERVW, ERV9, ERVE, ERVT, or ERV3. The remaining 35 PR sequences were not confidently classified. In contrast, the PR sequences identified by LTRdigest are less numerous and more diverse than those identified by BLAST. There is a substantial overlap in the sequences recognized by each method, but LTRdigest identified

Turnbull and Douville ERVK Protease Active Site Motifs

HK2 PR with a disproportionately higher frequency than other ERVs, given their relative occurrence in the human genome (Gifford and Tristem, 2003; Bannert and Kurth, 2006).

When examining RT hits, tBLASTn for the RVT\_1 domain of Phoenix identified 1412 sequences in GRCh38, of which 840 were removed because they induced gaps (**Additional Files 5**, **6**). LTRdigest identified 1766 RVT\_1 domains in GRCh38, of which 1617 were removed either because they did not co-occur with another core retroviral domain or they contained uncommon insertions which induced gaps in the alignment (**Additional Files 7**, **8**). Standalone RVT\_1 domains were eliminated. This is because non-ERV retroelements, such as LINEs (Boissinot et al., 2000), also encode a RT with an RVT\_1 domain, but lack other retroviral proteins. These non-ERV retroelements would skew the analysis, since they are much more numerous; retroelements as a whole make up 42.2% of the human genome, but ERVs make up less than 8.3% [47]. Although LINEs are not flanked by LTRs, they are nonetheless annotated when LTRharvest mistakes other appropriately spaced repetitive sequences for LTRs.

In order to assign appropriate nomenclature to the identified sequences, we classified 489 loci (one or more PR or RT encoding regions separated by no more than 10,000 bp) (**Table 3**). We observe 53 loci contain only one BLAST or LTRdigest result, 324 loci contain 2, 57 loci contain 3, 53 loci contain 4, and one each contain 5 and 6 results. Four loci had all results placed on internal nodes by RAxML and were not classified at all. Eighteen loci contained sequences whose classification varied; in most cases these sequences were truncated or insertion of one retroelement inside another resulted in spurious association of sequences belonging to separate elements.

Totals are shown for each ERV supergroup. The total number of unplaced sequences is also shown along with the number of sequences placed on internal branches, which in both cases were between the H/F and W/9 clades. Sequences were considered unplaced if the probability of their placement into the tree was less than 90%. One sequence was placed on an internal node outside these clades. AA alignments are translated from nucleotide (NT) alignments.

Phylogenetic trees constructed from genomic sequences (**Figure 2**) shared similar trends as those observed in the reference trees (**Figure 1** and **Additional Files 9–16**). The classifications assigned by placement of reference sequences into the phylogeny (**Additional Files 17–32**) agrees with the branching of the phylogeny inferred from genomic sequences directly. This can be clearly seen in **Figure 2**, in which the leaves of these phylogenies are colored according to their classification by evolutionary placement. Each phylogenetic tree (**Additional Files 33–40**) constructed from aligned nucleic acid sequences of either PR or RT from reference ERV sequences contained a bipartition segregating ERVK clades, although it should be noted that BLAST PR results include exclusively ERVK sequences. HK1, 2, 4, 9, and 10 commonly formed a clade within ERVK. In all four trees in **Figure 2**, ERVK leaves form a distinct clade, and likewise ERVW and ERV9 form a clade which is associated with ERVF. ERVS forms a clade with ERVL. ERVI forms a clade with ERVADP. ERVE forms a clade with ERV3. **Figure 2** is intended to allow the reader to quickly see the broad trends; if the reader is TABLE 3 | Number of sequences classified into each category from each search method.


PR, results from BLAST for HK2 protease; RT, results from BLAST for Phoenix RVT\_1 domain of reverse transcriptase; RVP, results from LTRdigest for RVP domain; RVT\_1, results from LTRdigest for RVT\_1 domain; NT, results classified by their nucleic acid sequence; AA, results classified by their translation.

interested in the placement of specific elements, the phylogenetic trees output by RAxML are included in **Additional Files 33–40**.

### Alternate Genome Assemblies Reveal Additional PR Sequences

Although it fails to capture the real diversity of humanity, the reference human genome contains variable regions with multiple valid assemblies. New loci can be distinguished from loci which were duplicated by inclusion in both assemblies by examining the flanking genomic sequences, which should also be identical for technical duplicates. The 15 kilobases flanking the start site of each PR identified in GRCh38 by BLAST and GenomeTools was extracted and compared to alternative assemblies using blastn. Eight regions had a perfect match, of which five were originally identified by BLAST. Many BLAST hits surrounding identical PR sequences had partial matches indicative of closely similar insertions in divergent genomic loci. Moreover, 31 of the 52

regions identified on alternative assemblies had no high scoring matches. The fact that many sequences could not be paired up is consistent with previous observation of unique insertions in a recent analysis of the ERV content within alternative assemblies (Wildschutte et al., 2014).

#### Diversity of Protease Sequences

The consensus sequences for each ERVK clade from Repbase contain many conserved sites (**Figure 1**). It is clear that PR functional motifs are less diverse than the surrounding sequences (**Figure 3**), and that variants occur with different frequencies (**Table 4**). The α-helix (C2) has only one common variant (GRDLL). The active site loop (B1) has two; the DTGAD motif is most common and occurs in all clades, whereas the second most frequent (DTGVD) occurs only in the HK3, HK5, and HK6 clades. The tree in **Figure 4** is displayed to clearly show the abundance of HK3 sequences (top of figure) versus all other clades. Additionally, **Figure 4** highlights the distribution of common active site variants (DTGAD in blue, DTGVD in orange).

Active site motifs were extracted from MACSE aligned protease sequences identified by BLAST in the human genome (GRCh38). Sequences containing stop codons and/or frameshift mutations were removed and their frequency was determined. This was done using standard UNIX utilities. Sequences were considered inactive if they contained mutations not observed in active proteases of other XRVs; otherwise, they were considered potentially active.

### ERVK HML2 Proteases Exhibit Great Diversity

Sequences classified as HK2 by RAxML were examined in more detail to determine if their variability could have functional

consequences if the protease is translated, especially if it could lead to differential drug susceptibility because the variable residue intrudes into the substrate binding site. Several columns of the alignment in **Figure 5** represent variable sites which clearly subdivide the clade in two.

The most notable variation within HK2 sequences (**Figure 5**) is the variable AA (L, V or I) at position 89 which is predicted to help form the S3 and S1<sup>0</sup> binding subsites by homology (Menendez-Arias et al., 1994). Columns 55, 65, and 66 are also notable. The AA (V or I) at position 55 follows the flap interface (columns 51–54) predicted by homology (Kuhelj et al., 2001). Residues 65 and 66 are near the surface of PR in the N-terminal strand of A2; from this position, their variability could impact protein-protein interactions within the viral polyproteins or with host partners, and could potentially influence flap mobility during substrate binding via interactions with α-helix C1. Thus, the cellular complement of HK2 PR variants represent a continuum of enzymatic affinities and protein-protein interactions.

### Variant Proteases Are Expressed in Human Health and Disease

We looked in publicly available RNA-Seq datasets (**Table 5**) from diseases with some known association with ERV expression to establish how genomic diversity is transcribed in healthy and

TABLE 4 | Frequency of co-occurrence for observed ERVK active site and associated helix motifs.


disease states (Seifarth et al., 2005; Douville et al., 2011; Ren et al., 2012; Agoni et al., 2013; Wildschutte et al., 2014; Abba et al., 2015; Li et al., 2015; Brohawn et al., 2016). RNA-Seq results were narrowed to examine the expression of the two most highly expressed and well-studied groups HML2 and HML3. Further, ERVK loci (detectable by both BLAST and LTRharvest as previously described) were limited to proteases without gross inactivating mutations. Reanalysis of transcriptomics data from three independent studies focused on ALS, breast cancer and prostate cancer are shown in **Figure 6**; expression of other HML groups are in **Additional File 42**. Overall, the expression of loci encoding DTGAD and DTGVD motifs is significantly different, due to low expression of DTGVD loci, and each of these is divergent from the pooled expression of other less common motifs.

Expression of protease encoding transcripts alone does not prove that these will be properly translated or proteolytically processed (Bauerova-Zabranska et al., 2005; Tien et al., 2018). Additional supporting evidence comes from examination of the upstream gag gene by searching reads aligning to the same locus using tBLASTn for the Gag sequence of ERV-K113. Several, but not all, loci expressed gag as well as pro. Individual experiments are required to confirm Gag-pro-pol polyprotein processing for each transcribed locus.

There is a trend toward increased ERVK transcription in cervical spinal tissues of ALS patients versus neuro-normal controls (**Figure 6**); indeed, this enhancement is evident when patients are stratified by sex, with female cases having an increased burden of ERVK (**Additional File 43**, p < 0.01). When tissues from cancer patients were examined, breast cancer biopsies revealed increased PR transcripts with DTGAD (p < 0.05) and alternate PR active site motifs (p < 0.05), as compared to cosmetic breast resection controls. Overall, HML-2 expression was enhanced in breast cancer (p < 0.01). This ERVK load difference in breast cancer appears to be driven by the increased expression of three loci in 8p23.1 flanked by LTR5A, and encoding a DTGSD active site motif. Interestingly, an HIV-1 PR variant which encodes a DTGSD motif is known to be active, although less so than the wild-type enzyme (Hong et al., 1998). Similarly, HML-2 transcripts are increased in prostate cancer specimens as compared to autologous non-cancerous adjacent tissue (p <0.01). These findings point toward a mixed profile of PR variants in disease states with elevated ERVK expression.

#### DISCUSSION

The human genome is home to a diversity of ERVs; this necessitates much effort in discriminating those of biological and/or clinical importance. Our identification of transcribed ERVK loci with a potential to encode functional PR enzymes will inform future studies examining their biological impact and drug susceptibility.

There remain technical challenges in assessing ERVs, as the results from LTRharvest depict that this paradigm is not ideal for analysis of more complicated loci containing recombination, serial insertions, or substantial deletions. Theoretically, each ERV should encode exactly 1 pro and 1 pol at the moment of insertion,

area of the tree is indicated by the name associated with the arc around the perimeter. Branch coloring represents bootstrap support, where red indicates poor support (closer to zero) and green indicates good support (closer to 100). Placement of the root at the branch connecting HK3 is only a matter of visual convenience – the real root branch is more likely that leading to HK8. This alignment is available in Additional File 41.

but over time deletion and insertion events can remove or obscure these signatures. Despite this issue, the use of HMMs to identify ERVs within the human genome was effective. Indeed, RVP and RVT\_1 annotations are more commonly identified within elements annotated by LTRharvest, since BLAST did not detect loci distantly related to the query HK2 sequences. The search results were improved by the merging of results from the same paralog and alignment-based extension. These sequences were grouped both by direct phylogenetic inference and by comparison to a reference tree. The phylogenetic patterns apparent in **Figures 1**, **2** broadly agree with previously recognized relationships between ERVs (Vargiu et al., 2016). Although trees from pro and pol differ in some respects, they generally agree at deeper nodes which are important for classification, despite the lower phylogenetic signal of protease due to its comparatively shorter length.

There is value in examining both reference genomes and alternate assemblies. The reference genome was constructed from few people (Osoegawa et al., 1998; Lander et al., 2001), suggesting that the ERV content of individual human genomes could vary considerably and deserves closer scrutiny; especially since unfixed loci are likely to be younger (Belshaw et al., 2005), and could encode functional enzymes. Future use of data from platforms such as the 1000 Genomes Project (2018) will provide an improved appreciation of the diversity of ERV content in human DNA.

#### ERVK Protease Sequence Diversity

The diversity of human genomic retroviral proteases reflects their evolutionary history and conserved structure-function relationships. Residues flanking the active site and those maintaining the dimer interface are less variable, a pattern clearly

TABLE 5 | RNA-Seq Libraries from the Sequence Read Archive analyzed in this study.


discernible in the Pfam family RVP (Finn et al., 2014) and evident within **Figure 1**. The β-sheet at the "bottom" of the enzyme is an exception to this rule, as its residues are not as conserved, perhaps because this region of tertiary structure is driven by backbone interactions (Louis et al., 2007). Overall, the most notable variation in genomic protease sequences is the absence of RVP positions 38–42 in 83 PRs, none of which were classified as ERVK.

The sequence of the retroviral protease active site (B1) and the active-site associate helix (C2) are conserved, as seen in **Figures 1**, **3**, **5** for ERVK. The most common active site motif of ERVK, DTGAD, is identical to that predominantly observed in exogenous retroviruses. The second most common motif, DTGVD (A29V), is observed in functional retroviral proteases; in fact, there are many examples of HIV proteases bearing this variation (e.g., Uniprot: P15833). Given the wide phylogenetic distribution of A29V and the similar chemical properties of alanine and valine residues, this is probably a functional variant of the ERVK protease. This motif could therefore represent pre-integration variation of the ancestral XRV which gave rise to ERVK, which is further supported by the concentration of DTGVD motif containing sequences in HK3, seen in **Figure 4**. Alternatively, the motif could represent a post-integration mutation predating the amplification of HK3 (Mayer and Meese, 2002). We also predict several α-helix (C2) variants could be functional in ERVK. In the HIV-1 PR (e.g., PDB: 1HVC), the C2 arginine residue (R94) fills the space between B1 and the loop connecting A1<sup>0</sup> to D2<sup>0</sup> while forming hydrogen bonds to residues in both. Another large polar residue might fill this role; indeed, lysine is known to do so in HIV-1 protease (Louis et al., 2007). Several genomic motifs fit these requirements, such as the HK2 R94Q and R94K variants, and the HK8 D95E variant. Systematic experimental verification of the activity of individual PR variants is thus warranted.

Nonetheless, we predict that most of the less common active site loop variants are inactive (**Figure 4**). This prediction is pulled from our understanding of the composition of this motif. Many variations are of a residue which catalyzes the hydrolytic reaction, or which is predicted to be essential to the dimer interface. The catalytic residues must be aspartic acid. The threonine residue which helps form the dimer interface is uncommonly replaced by serine (e.g., PDB: 2JYS, 2RSP) (Jaskolski et al., 1990; Hartl et al., 2008). Others could disrupt the active site by substituting a side-chain of very different size. The glycine residue must be small, as distortions to the structure of this region could disrupt the catalytic structure. This is also true for the alanine residue, although it is known to be replaced by serine in HIV-1 without a total loss of activity (Ido et al., 1991). The final aspartic acid residue participates in hydrogen bonds with the nearby arginine residues (Louis et al., 2007), and might be replaced by another

small charged or polar residue; however, the character of this residue may not be critical, since it can vary with glycine in HIV-1 protease (Louis et al., 2007).

The young, polymorphic HK2 element ERVK113 is well studied and was reported to produce immature virions, perhaps due to a non-functional protease (Boller et al., 2008). This is in contrast to the known active HK2 protease of ERVK-10 (Ono, 1986; Towler et al., 1998). This is further supported by reported differences in the sequences of these two mature proteases; the G56S (reported as G234S) substitution in ERVK113 may disrupt the flap interface, compromising PR activity and conferring an immature particle morphology (Boller et al., 2008).

### Implications of ERVK PR Diversity on Immunity

Retroviral proteases have numerous and varied protein targets which are difficult to predict, with correspondingly broad cellular effects stemming from their expression. Many proteases include targets with immune functions, but these vary based on hostvirus pairings. It remains to be seen how ERVK PRs may modulate innate immunity signaling in humans, as seen with XRVs (Abudu et al., 2006; Solis et al., 2011; Wagner et al., 2015; Yoshikawa et al., 2017). Based on our knowledge of HIV (Konvalinka et al., 1995; Snasel et al., 2000; Lin et al., 2003), it is even conceivable that each ERVK PR variant may target distinct panels of cellular proteins, due to their different substrate specificities.

The potential impact of ERVK PR not limited to the innate immune system. Observations from the recovery of immune function in patients on HAART containing or not containing PIs suggests that HIV PR can directly impact the functioning of adaptive immune cells independently of other viral proteins (Ananworanich et al., 2003; Chiodi, 2006). Furthermore, differential ERVK PR substrate specificities could lead to changes in the peptide fragment population from which MHC class

I epitopes are drawn, in addition to the impact that PI use alters this system (Kourjian et al., 2014; Kourjian et al., 2016). Our observation of distinct ERVK PR variants expressed in disease highlights the need for future studies to consider their overlapping and distinct impacts on cell signaling and proteome profiles related to immunity.

### Implications for Drug-Based Treatment of ERVK-Associated Disease

Many protease coding sequences could be translated into an active enzyme, and their sequence variability may impact their activity, and likely their susceptibility to antiretroviral drugs. Variation between DTGAD and DTGVD in the active site of sequences classified as HK3 is notable. This variation could have biological relevance when both HK2 and HK3 elements are expressed, such as in ALS (Douville et al., 2011) and schizophrenia (Frank et al., 2005). The HK2 PR is susceptible to some clinically relevant HIV-1 PIs, although not to the same degree as the HIV-1 PR (Towler et al., 1998; Kuhelj et al., 2001; Tyagi et al., 2017). Since ERVK expression is associated with many disease states, and the activity of retroviral enzymes can be pathogenic, it is possible that PIs may be useful in the future treatment of ERVK-associated conditions. In fact, at least one ongoing clinical trial (NCT02437110) plans to administer the PI Darunavir, along with other drugs, to patients with ALS. Another clinical trial (NCT01528865) targeting ERVK in lymphoma was withdrawn – but the intent to explore the clinical utility of ERVK PIs is becoming clear.

The rapid development of drug resistance is a major hurdle in the treatment of HIV-1 infections, but since ERVK is fixed in the host genome, it cannot rapidly evolve such drug resistance. However, ERVK elements are diverse and likely respond differently to PIs. This diversity is not represented in the ERVK literature, which is focused on the consensus HK2 PR (Towler et al., 1998; Kuhelj et al., 2001; Dewannieux et al., 2006; Lee and Bieniasz, 2007; Tyagi et al., 2017). One recent paper states that the active site loop of the HIV-1 and ERVK PR are identical (Tyagi et al., 2017), a finding which this report directly contradicts. We predict this diversity may necessitate the simultaneous use of multiple PIs in the treatment of ERVKassociated diseases. Consequently, if the drug regimen in the above-mentioned clinical trial is not effective in treating ALS, this cannot be taken as evidence that such therapy would not be effective using multiple PIs in a combination therapy.

We have established that a multitude of ERVK protease variants exist in the human genome, and that some of these variants can reasonably be predicted to have differing drug binding profiles. However, many ERVs are transcriptionally repressed (Bogerd et al., 2006; Schlesinger and Goff, 2013; Schlesinger et al., 2014); our analysis of publicly available RNA-Seq data shows not only that ERVs are expressed, but that the expression of loci with differing biochemical and evolutionary characteristics varies between and within different disease conditions, as well as in healthy controls. Due to the relatively recent insertion of ERVK sequences within the human genome, and the pathogenicity of XRV PRs, it is warranted that future studies consider their role in modulating immunity. Since distinct biochemical and ancestral features may impact the susceptibility of these enzymes to PIs, we propose that patientspecific drug regimens may be required in treatment of ERVK associated disease. Moreover, given the potential importance of protease sequence variability, the sequences of other ERVK proteins (particularly the drug targets reverse transcriptase and integrase) should also be explored.

### CONCLUSION

Within the human genome, ERVK proteases exhibit a high degree of diversity. Specifically, two predominant PR active sites emerge, the DTGAD and DTGVD variants, which are differentially expressed in disease states. This study will also be an asset for inferring how ERVK PRs impact the human proteome, specifically as it pertains to immune function. It is possible to interpret this paper as casting a negative light on the prospect of inhibitor-based treatment of ERVK-associated disease, but this is not our intention. We do, however, caution against treatments and techniques that treat the numerous and diverse elements called ERVK as though they were a single invariable element. This variability is not endless; biomedical and clinical techniques to account for this enzymatic variation exist. We recommend that the results of inhibition assays against the specific ERVK proteases expressed in each disease state should be considered if anti-PR drug regimens are to be implemented for inflammatory disease.

### MATERIALS AND METHODS

Sources for databases and software employed in the course of this study are listed in **Table 6**.

Sequences homologous to retroviral protease (pro) and reverse transcriptase (pol) were automatically extracted from the last major build of the reference human genome and classified with published sequences from Repbase. Structure-function analysis was applied to ERV PR sequences using the limited published data on them, and by transferring knowledge of better studied proteases.

### Reference Sequences

The 38th official assembly of the human genome (GRCh38/hg38 published December 2013) was obtained from NCBI. Representatives for ERVADP, ERV9, ERV3, ERVE, ERVFRD, ERVF, ERVFXA, ERVFC, ERVFB, ERVH, ERVI, ERVL, ERVRB, ERVP, ERVS, ERVT, ERV16, ERVW, and HML groups 1 through 10 (HK1 through HK10) were derived from Repbase v19.07 (Jurka et al., 2005). HK9 and HK10 are also called K14C and KC4, respectively. The sequences of ERVK Phoenix (Dewannieux et al., 2006), and ERVK-10 (Kuhelj et al., 2001) were taken from their respective publications. These sequences can be found in **Additional Files 46**, **47**, respectively. Genes were located using BLAST v2.2.28 (Altschul et al., 1990) and HMMER v3.1b1 (Eddy, 1998). Cytological bands were assigned by a perl script using co-ordinates from NCBI.

TABLE 6 | Databases and software used in this study.


#### BLAST

Each GRCh38 chromosome and representative retroviral sequence was searched by tBLASTn for the amino acid sequences of the mature ERVK-10 PR and the RVT\_1 domain of Phoenix pol identified by HMMER. BLAST hits from same gene in different frames were merged, aligned by MACSE v1.01b (Ranwez et al., 2011), and then sequences 5<sup>0</sup> and/or 3<sup>0</sup> of each hit which would make the alignment flush were retrieved. Proteases identified in the reference genome assembly were matched to alternative assemblies using a perl script.

#### HMMER and GenomeTools

HMMs for which to search were identified in Pfam-A using the search term "retrovirus" (Finn et al., 2014). ERV reference sequences were searched by HMMER with an e-value of 1.0. Putative LTR retroelements were identified in GRCh38 by LTRharvest, and these were subsequently analyzed by LTRdigest, which ran HMMER. LTRharvest and LTRdigest are part of Genometools v1.5.1 (Gremme et al., 2013). Hits from the same locus in different reading frames were merged and aligned with MACSE, then the regions which would make each sequence flush to the alignment were retrieved.

#### Phylogenetic Inference and Classification

Results from each of the four searches were curated by eliminating HMM matches not associated with another core retroviral domain and by removing BLAST results which induced gaps in ≥95% of alignment rows with MACSE. Sequences were then re-aligned with MACSE and models of evolution were selected using jModelTest (Darriba et al., 2012) and ProtTest (Guindon et al., 2010; Darriba et al., 2011). Maximum likelihood phylogenies were inferred using RAxML v7.2.8 (Stamatakis, 2014) for nucleotide and protein alignments of each search (8 trees in total). GRCh38 results were classified using RAxML's evolutionary placement algorithm with reference to the inferred phylogeny of representative ERV sequences.

Misclassified or recombinant elements were sought by comparing the assignment of search results separated by less than 10 kbp. This locus-based definition is computationally simple, but potentially error prone; a more accurate element-based analysis would require algorithmic reconstruction of insertion architecture, as was undertaken in the construction of Dfam (Wheeler et al., 2013).

#### Measuring Expression in Publicly Available RNA-Seq Data

Publicly available RNA-Seq data from tissues associated with ERVK expression in human disease were obtained from the SRA using sratools. The resulting FASTQ files were aligned to the human genome (GRCh38) using bowtie2 following examination with FASTQC. The resulting SAM file was then indexed using samtools and finally the resulting indexed BAM file was queried for expression of ERV loci identified using the methods above. The raw expression values thus obtained were normalized by library size and locus length to produce FPKM values.

The resulting FPKM values were examined on density and QQ plots to confirm that they did not conform to a normal distribution. Non-parametric tests of statistical significance were used in this paper. For unpaired expression data (breast cancer, ALS) the Mann–Whitney U test was used, and for paired data (prostate cancer) the Wilcoxon Signed-Rank Test was used. In all cases tests were two-sided and p-values were corrected using the Bonferroni procedure, considering each set of contrasts involving the same variables to be a family.

#### AVAILABILITY OF DATA AND MATERIAL

The datasets used to generate the findings of this paper are available from NCBI by their accession numbers. SRA accession numbers for each RNA-Seq data are: ALS – SRP064478, Breast Cancer – SRP058722, Prostate Cancer – ERP000550. The human

genome version 38 was obtained from the NCBI FTP site [79]. Datasets generated during the current study which are not included as additional files are available from the corresponding author upon request. Important scripts used in the course of this study are included in **Additional File 48**.

### AUTHOR CONTRIBUTIONS

MT and RD conceived the experiments. MT performed the bioinformatics analysis. MT and RD wrote the manuscript.

#### FUNDING

Funding support was provided by the University of Winnipeg and a Discovery Grant from Natural Sciences and Engineering Research Council of Canada (NSERC #RGPIN-2016-05761).

#### ACKNOWLEDGMENTS

We would like to thank the Douville lab team for their input on this project. This work was supported by WestGrid (www.westgrid.ca) and Compute Canada (www.computecanada.ca) by use of the computer cluster Orcinus. We acknowledge that the University of Winnipeg is in Treaty 1 territory and in the heart of the Métis nation. Orcinus is located on the traditional, ancestral, and unceeded territory of the Musqueam people. MT currently resides in the traditional territory of the Three Fire Peoples: the Ojibwa, Odawa, and Potawatomi.

#### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fmicb.2018. 01577/full#supplementary-material

FILE 1 in FASTA (fa) format | Aligned translated genomic protease BLAST results. The sequences found by BLAST for the HK2 protease in the human genome aligned and translated by MACSE.

FILE 2 in FASTA (fa) format | Aligned genomic protease BLAST results. The sequences found by BLAST for the HK2 protease in the human genome aligned by MACSE.

FILE 3 in FASTA (fa) format | Aligned translated genomic RVP HMMER results. The sequences found by HMMER directed by LTRdigest for the protein family HMM RVP in the human genome aligned and translated by MACSE.

FILE 4 in FASTA (fa) format | Aligned genomic RVP HMMER results. The sequences found by HMMER directed by LTRdigest for the protein family HMM RVP in the human genome aligned by MACSE.

FILE 5 in FASTA (fa) format | Aligned translated genomic reverse transcriptase BLAST results. The sequences found by BLAST for the HK2 reverse transcriptase in the human genome aligned and translated by MACSE.

FILE 6 in FASTA (fa) format | Aligned genomic reverse transcriptase BLAST results. The sequences found by BLAST for the HK2 reverse transcriptase in the human genome aligned by MACSE.

FILE 7 in FASTA (fa) format | Aligned translated genomic RVT\_1 HMMER results. The sequences found by HMMER directed by LTRdigest for the protein family HMM RVT\_1 in the human genome aligned and translated by MACSE.

FILE 8 in FASTA (fa) format | Aligned genomic RVT\_1 HMMER results. The sequences found by HMMER directed by LTRdigest for the protein family HMM RVT\_1 in the human genome aligned by MACSE.

FILE 9 in Newick (nwk) format | Phylogeny inferred from genomic BLAST results for the HK2 protease amino acid sequences. The phylogeny of amino acid sequences identified in the human genome GRCh38 by BLAST using the mature HK2 protease as a query as inferred by RAxML under the JTT + G model of evolution and aligned by MACSE.

FILE 10 in Newick (nwk) format | Phylogeny inferred from genomic BLAST results for the HK2 protease nucleic acid sequences. The phylogeny of nucleic acid sequences identified in the human genome GRCh38 by BLAST using the mature HK2 protease as a query as inferred by RAxML under the GTR model of evolution and aligned by MACSE.

FILE 11 in Newick (nwk) format | Phylogeny inferred from genomic BLAST results for the HK2 reverse transcriptase amino acid sequences. The phylogeny of amino acid sequences identified in the human genome GRCh38 by BLAST using the HK2 reverse transcriptase as a query as inferred by RAxML under the JTT + G model of evolution and aligned by MACSE.

FILE 12 in Newick (nwk) format | Phylogeny inferred from genomic BLAST results for the HK2 reverse transcriptase nucleic acid sequences. The phylogeny of nucleic acid sequences identified in the human genome GRCh38 by BLAST using the HK2 reverse transcriptase as a query as inferred by RAxML under the GTR model of evolution and aligned by MACSE.

FILE 13 in Newick (nwk) format | Phylogeny inferred from genomic HMMER results for RVP amino acid sequences. The phylogeny of amino acid sequences identified in the human genome GRCh38 by HMMER directed by LTRdigest using the protein family HMM RVP as a query as inferred by RAxML under the JTT + G model of evolution and aligned by MACSE.

FILE 14 in Newick (nwk) format | Phylogeny inferred from genomic HMMER results for RVP nucleic acid sequences. The phylogeny of nucleic acid sequences identified in the human genome GRCh38 by HMMER directed by LTRdigest using the protein family HMM RVP as a query as inferred by RAxML under the GTR + G model of evolution and aligned by MACSE.

FILE 15 in Newick (nwk) format | Phylogeny inferred from genomic HMMER results for RVT\_1 amino acid sequences. The phylogeny of amino acid sequences identified in the human genome GRCh38 by HMMER directed by LTRdigest using the protein family HMM RVT\_1 as a query as inferred by RAxML under the JTT + G model of evolution and aligned by MACSE.

FILE 16 in Newick (nwk) format | Phylogeny inferred from genomic HMMER results for RVT\_1 nucleic acid sequences. The phylogeny of nucleic acid sequences identified in the human genome GRCh38 by HMMER directed by LTRdigest using the protein family HMM RVT\_1 as a query as inferred by RAxML under the GTR + G model of evolution and aligned by MACSE.

#### FILE 17 formatted as described in the RAxML user manual (txt) |

Classification likelihood weights for evolutionary placement of genomic protease BLAST amino acid sequences onto the corresponding reference tree. The placement of genomic sequences onto the tree of Repbase consensa was done using RAxML's evolutionary placement algorithm under the JTT + G model of evolution, and this file contains the likelihood of each potential placement for each sequence.

FILE 18 formatted as described in the RAxML user manual (txt) |

Classification likelihood weights for evolutionary placement of genomic protease BLAST nucleic acid sequences onto the corresponding reference tree. The placement of genomic sequences onto the tree of Repbase consensa was done using RAxML's evolutionary placement algorithm under the GTR + G model of evolution, and this file contains the likelihood of each potential placement for each sequence.

#### FILE 19 formatted as described in the RAxML user manual (txt) |

Classification likelihood weights for evolutionary placement of genomic reverse transcriptase BLAST amino acid sequences onto the corresponding reference tree. The placement of genomic sequences onto the tree of Repbase consensa was done using RAxML's evolutionary placement algorithm under the RtREV + G model of evolution, and this file contains the likelihood of each potential placement for each sequence.

#### FILE 20 formatted as described in the RAxML user manual (txt) |

Classification likelihood weights for evolutionary placement of genomic reverse transcriptase BLAST nucleic acid sequences onto the corresponding reference tree. The placement of genomic sequences onto the tree of Repbase consensa was done using RAxML's evolutionary placement algorithm under the GTR + G model.

#### FILE 21 formatted as described in the RAxML user manual (txt) |

Classification likelihood weights for evolutionary placement of genomic RVP HMMER amino acid sequences onto the corresponding reference tree. The placement of genomic sequences onto the tree of Repbase consensa was done using RAxML's evolutionary placement algorithm under the WAG + G + F model of evolution, and this file contains the likelihood of each potential placement for each sequence.

FILE 22 formatted as described in the RAxML user manual (txt) |

Classification likelihood weights for evolutionary placement of genomic RVP HMMER nucleic acid sequences onto the corresponding reference tree. The placement of genomic sequences onto the tree of Repbase consensa was done using RAxML's evolutionary placement algorithm under the GTR + G model of evolution, and this file contains the likelihood of each potential placement for each sequence.

#### FILE 23 formatted as described in the RAxML user manual (txt) |

Classification likelihood weights for evolutionary placement of genomic RVT\_1 HMMER amino acid sequences onto the corresponding reference tree. The placement of genomic sequences onto the tree of Repbase consensa was done using RAxML's evolutionary placement algorithm under the RtREV + G model of evolution, and this file contains the likelihood of each potential placement for each sequence.

#### FILE 24 formatted as described in the RAxML user manual (txt) |

Classification likelihood weights for evolutionary placement of genomic RVT\_1 HMMER nucleic acid sequences onto the corresponding reference tree. The placement of genomic sequences onto the tree of Repbase consensa was done using RAxML's evolutionary placement algorithm under the GTR + G model of evolution, and this file contains the likelihood of each potential placement for each sequence.

FILE 25 in Newick (nwk) format | Original labeled reference tree for evolutionary placement of genomic protease BLAST amino acid sequences. The reference tree for placement of genomic protease BLAST results labeled for use with the corresponding classification likelihood file.

FILE 26 in Newick (nwk) format | Original labeled reference tree for evolutionary placement of genomic protease BLAST nucleic acid sequences. The reference tree for placement of genomic protease BLAST results labeled for use with the corresponding classification likelihood file.

FILE 27 in Newick (nwk) format | Original labeled reference tree for evolutionary placement of genomic reverse transcriptase BLAST amino acid sequences. The reference tree for placement of genomic reverse transcriptase BLAST results labeled for use with the corresponding classification likelihood file.

FILE 28 in Newick (nwk) format | Original labeled reference tree for evolutionary placement of genomic reverse transcriptase BLAST nucleic acid sequences. The reference tree for placement of genomic reverse transcriptase BLAST results labeled for use with the corresponding classification likelihood file.

FILE 29 in Newick (nwk) format | Original labeled reference tree for evolutionary placement of genomic RVP HMMER amino acid sequences. The reference tree for placement of genomic RVP HMMER directed by LTRdigest results labeled for use with the corresponding classification likelihood file.

FILE 30 in Newick (nwk) format | Original labeled reference tree for evolutionary placement of genomic RVP HMMER nucleic acid sequences. The reference tree for placement of genomic RVP HMMER directed by LTRdigest results labeled for use with the corresponding classification likelihood file.

FILE 31 in Newick (nwk) format | Original labeled reference tree for evolutionary placement of genomic RVT\_1 HMMER amino acid sequences. The reference tree for placement of genomic RVT\_1 HMMER directed by LTRdigest results labeled for use with the corresponding classification likelihood file.

FILE 32 in Newick (nwk) format | Original labeled reference tree for evolutionary placement of genomic RVT\_1 HMMER nucleic acid sequences. The reference tree for placement of genomic RVT\_1 HMMER directed by LTRdigest results labeled for use with the corresponding classification likelihood file.

FILE 33 in Newick (nwk) format | Phylogeny inferred from amino acid sequences from representative Repbase consensa searched using BLAST with the HK2 protease as a query. The phylogeny of amino acid sequences identified in representative Repbase consensa by BLAST using the mature HK2 protease as a query as inferred by RAxML under the JTT + G model of evolution and aligned by MACSE.

FILE 34 in Newick (nwk) format | Phylogeny inferred from nucleic acid sequences from representative Repbase consensa searched using BLAST with the HK2 protease as a query. The phylogeny of nucleic acid sequences identified in representative Repbase consensa by BLAST using the mature HK2 protease as a query as inferred by RAxML under the GTR + G model of evolution and aligned by MACSE.

FILE 35 in Newick (nwk) format | Phylogeny inferred from amino acid sequences from representative Repbase consensa searched using BLAST with the HK2 reverse transcriptase as a query. The phylogeny of amino acid sequences identified in representative Repbase consensa by BLAST using the HK2 reverse transcriptase as a query as inferred by RAxML under the RtREV + G model of evolution and aligned by MACSE.

FILE 36 in Newick (nwk) format | Phylogeny inferred from nucleic acid sequences from representative Repbase consensa searched using BLAST with the HK2 reverse transcriptase as a query. The phylogeny of nucleic acid sequences identified in representative Repbase consensa by BLAST using the HK2 reverse transcriptase as a query as inferred by RAxML under the GTR + G model of evolution and aligned by MACSE.

FILE 37 in Newick (nwk) format | Phylogeny inferred from amino acid sequences from representative Repbase consensa searched using HMMER for RVP. The phylogeny of amino acid sequences identified in representative Repbase consensa by HMMER using the protein family HMM RVP as a query as inferred by RAxML under the WAG + G + F model of evolution and aligned by MACSE.

FILE 38 in Newick (nwk) format | Phylogeny inferred from nucleic acid sequences from representative Repbase consensa searched using HMMER for RVP. The phylogeny of nucleic acid sequences identified in representative Repbase consensa by HMMER using the protein family HMM RVP as a query as inferred by RAxML under the GTR + G model of evolution and aligned by MACSE.

FILE 39 in Newick (nwk) format | Phylogeny inferred from amino acid sequences from representative Repbase consensa searched using HMMER for RVT\_1. The phylogeny of amino acid sequences identified in representative Repbase consensa by HMMER using the protein family HMM RVT\_1 as a query as inferred by RAxML under the RtREV + G + I model of evolution and aligned by MACSE.

FILE 40 in Newick (nwk) format | Phylogeny inferred from nucleic acid sequences from representative Repbase consensa searched using HMMER for RVT\_1. The phylogeny, inferred by RAxML under the GTR + G model of evolution, of nucleic acid sequences identified in representative Repbase consensa by HMMER using the protein family HMM RVT\_1 as a query aligned by MACSE.

FILE 41 in FASTA (fa) format | Alignment used to generate Figure 4.

FILE 42 in PNG (png) format | Expression of non-HML2/HML3 ERVs in human tissues. The expression of ERVK loci in GRCh38 containing an uninterrupted

protease was measured with bowtie2 using RNA-Seq data from different conditions (SRA accession numbers for each condition are ALS – SRP064478, Breast Cancer – SRP058722, Prostate Cancer – ERP000550). Normalization to fragments per kilobase of exon per million mapped reads (FPKM) is relative to the entire locus. This graphic was made using R with the tidyverse library, and GIMP.

FILE 43 in PNG (png) format | Expression of ERVK in cervical spinal cord RNA-Seq of male and female donors with ALS. The expression of ERVK loci in GRCh38 containing an uninterrupted protease was measured with bowtie2 using RNA-Seq data from different conditions (SRA accession numbers for each condition are ALS – SRP064478, Breast Cancer – SRP058722, Prostate Cancer – ERP000550). Normalization to fragments per kilobase of exon per million mapped reads (FPKM) is relative to the entire locus. Samples were sexed using expression of XIST. This graphic was made using R with the tidyverse library, and GIMP.

#### REFERENCES

1000 Genomes Project (2018). Available at: http://www.internationalgenome.org/


FILE 44 in Comma Seperated Values (csv) format | Counts of HML2/HML3 ERVs expressed in human tissues.

FILE 45 in Excel (xlsx) format | Counts of HML2/HML3 ERVs expressed in human tissues.

FILE 46 in FASTA (fa) format | Sequence of the HML2 protease. The sequence of the ERVK-10 mature protease.

FILE 47 in FASTA (fa) format | Sequence of the HML2 reverse transcriptase. The sequence of the RVT\_1 domain of the reconstituted HML2 progenitor virus Phoenix.

FILE 48 in ZIP (zip) format | Important scripts used for this study. This ZIP archive contains the perl and bash scripts used to locate and align protease and reverse transcriptase using BLAST in the human genome and the scripts used to measure expression in RNA-Seq data.

intact viral particles. J. Gen. Virol. 89(Pt 2), 567–572. doi: 10.1099/vir.0. 83534-0



cancer-associated long noncoding RNAs and aberrant alternative splicings. Cell Res. 22, 806–821. doi: 10.1038/cr.2012.30


type 1 protease precursors. PLoS One 13:e0191372. doi: 10.1371/journal.pone. 0191372


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 Turnbull and Douville. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Transcriptional Modulation of human endogenous retroviruses in Primary cD4**+** T cells Following Vorinostat Treatment

*Cory H. White1†, Nadejda Beliakova-Bethell 2,3, Steven M. Lada2 , Michael S. Breen4 , Tara P. Hurst5 , Celsa A. Spina2,6, Douglas D. Richman2,3,6, John Frater <sup>7</sup> , Gkikas Magiorkinis 5† and Christopher H. Woelk1 \*†*

*<sup>1</sup> Faculty of Medicine, University of Southampton, Southampton, Hants, United Kingdom, 2San Diego VA Medical Center and Veterans Medical Research Foundation, San Diego, CA, United States, 3Department of Medicine, University of California San Diego, La Jolla, CA, United States, 4Department of Genetic and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY, United States, 5Department of Zoology, University of Oxford, Oxford, United Kingdom, 6Department of Pathology, University of California San Diego, La Jolla, CA, United States, 7Nuffield Department of Clinical Medicine, Peter Medawar Building for Pathogen Research, South Parks Road, Oxford, United Kingdom*

The greatest obstacle to a cure for HIV is the provirus that integrates into the genome of the infected cell and persists despite antiretroviral therapy. A "shock and kill" approach has been proposed as a strategy for an HIV cure whereby drugs and compounds referred to as latency-reversing agents (LRAs) are used to "shock" the silent provirus into active replication to permit "killing" by virus-induced pathology or immune recognition. The LRA most utilized to date in clinical trials has been the histone deacetylase (HDAC) inhibitor—vorinostat. Potentially, pathological off-target effects of vorinostat may result from the activation of human endogenous retroviruses (HERVs), which share common ancestry with exogenous retroviruses including HIV. To explore the effects of HDAC inhibition on HERV transcription, an unbiased pharmacogenomics approach (total RNA-Seq) was used to evaluate HERV expression following the exposure of primary CD4<sup>+</sup> T cells to a high dose of vorinostat. Over 2,000 individual HERV elements were found to be significantly modulated by vorinostat, whereby elements belonging to the ERVL family (e.g., LTR16C and LTR33) were predominantly downregulated, in contrast to LTR12 elements of the HERV-9 family, which exhibited the greatest signal, with the upregulation of 140 distinct elements. The modulation of three different LTR12 elements by vorinostat was confirmed by droplet digital PCR along a dose–response curve. The monitoring of LTR12 expression during clinical trials with vorinostat may be indicated to assess the impact of this HERV on the human genome and host immunity.

Keywords: human endogenous retroviruses, histone deacetylase inhibitor, primary CD4**+** T cells, total RNA-Seq, long terminal repeat

### INTRODUCTION

Vorinostat is a histone deacetylase (HDAC) inhibitor also known as suberoylanilide hydroxamic acid. HDAC inhibitors act on HDAC enzymes and block the removal of acetyl groups from histones resulting in a relaxed chromatin state (1) and the modulation of the expression of large numbers of genes (2, 3). In addition, HDAC inhibitors appear to affect the acetylation states of transcription

#### *Edited by:*

*Hao Shen, University of Pennsylvania, United States*

#### *Reviewed by:*

*Dirk Dittmer, University of North Carolina at Chapel Hill, United States Wenhui Hu, Temple University, United States*

#### *\*Correspondence: Christopher H. Woelk*

*christopher.woelk@merck.com*

#### *†Present address:*

*Cory H. White and Christopher H. Woelk, Merck Exploratory Science Center, Merck Research Laboratories, Cambridge, MA, United States; Gkikas Magiorkinis Department of Hygiene, Epidemiology and Medical Statistics, Medical School, National and Kapodistrian University of Athens, Athens, Greece*

#### *Specialty section:*

*This article was submitted to Microbial Immunology, a section of the journal Frontiers in Immunology*

*Received: 24 October 2017 Accepted: 09 March 2018 Published: 12 April 2018*

#### *Citation:*

*White CH, Beliakova-Bethell N, Lada SM, Breen MS, Hurst TP, Spina CA, Richman DD, Frater J, Magiorkinis G and Woelk CH (2018) Transcriptional Modulation of Human Endogenous Retroviruses in Primary CD4+ T Cells Following Vorinostat Treatment. Front. Immunol. 9:603. doi: 10.3389/fimmu.2018.00603*

**98**

factors at the protein level, which alters their activity and leads to further transcriptional changes (4). HDAC inhibitors have wide ranging therapeutic value and have been considered for the treatment of cancer (5) and neurodegenerative disorders (6), as well as in "shock and kill" strategies to facilitate an HIV cure (7). The therapeutic efficacy of HDAC inhibitors against cancer is thought to stem from their ability to induce tumor cell apoptosis (5). Vorinostat is approved by the Federal Drug Administration (FDA) for the treatment of refractory cutaneous T-cell lymphoma (8). In an HIV cure setting, HDAC inhibitors may provide the "shock" capable of flushing HIV out of the persistent reservoir, while antiretroviral therapy is used to prevent new infections so that the cell lysis mediated by viral replication or the immune system may then "kill" actively replicating cells (7). Due to the pre-existing FDA approvals for human use, vorinostat has already been used in a number of completed (9, 10) and ongoing (11) clinical trials assessing shock and kill strategies for an HIV cure.

Human endogenous retroviruses (HERVs), which constitute approximately 8% of the human genome, are themselves descended from ancient exogenous retroviruses (12) and thus share common ancestry with HIV. HERV structure reflects that of retroviruses with two long terminal repeat (LTR) elements flanking *gag*, *pol*, and *env* genes, although HERVs most frequently exist in the genome as solitary LTR elements due to the loss of genes through recombination (13). Since vorinostat activates the expression of HIV, there have been concerns that this drug may also upregulate HERVs with potentially pathological consequences (14). For example, HERV pathology could result from the modulation of the expression of protein coding genes or the formation of chimeric proteins with aberrant function leading to oncogenesis (15), as well as the dysregulation of inflammatory immune responses through the expression of HERV encoded proteins (e.g., *gag* and *env*) (16). Indeed, HERV expression has previously been associated with a wide repertoire of diseases including diabetes, schizophrenia, autoimmune diseases (e.g., multiple sclerosis and rheumatoid arthritis), and cancer (17). However, the difficulties in associating HERVs with disease should be stressed due to their ubiquitous nature in human populations although polymorphisms between individuals could explain disease specificity (17). Finally, it was previously shown that HIV capsids could be successfully pseudotyped *in vitro* with HERV-W Env resulting in infectious virus particles (18). This raises the possibility that coexpression of HERVs and HIV might lead to novel retroviral strains with new properties through transcomplementation or recombination, although the latter may be unlikely due to the large evolutionary distance between HERV elements and HIV (19).

To explore the ability of vorinostat to modulate the expression of HERV elements in the human genome, our previous analysis utilized a targeted approach [i.e., real-time reverse transcription polymerase chain reaction (RT-qPCR)], to assess the expression of the *env* and *pol* genes of specific HERV families (i.e., HERV-K, HERV-W, and HERV-FRD) following HDAC inhibitor treatment (14). This study showed that cell line model systems of chronic HIV infection (i.e., J-LAT-8.4 and U1 cells) treated with different concentrations of vorinostat (i.e., 1 µM and 1 mM) for 24 h did not significantly alter the expression of these HERV elements. Furthermore, treatment of uninfected and HIV-infected primary CD4<sup>+</sup> T cells with another HDAC inhibitor, panobinostat (20 nM), for 24 h did not result in the upregulation of these HERV genes. In contrast, Kronung et al. (20) previously applied another targeted RT-qPCR approach to study the expression of transcripts of the *TP63* and *TNFRSF10B* genes that are under control of an LTR12 promoter derived from the HERV-9 family. Treatment with vorinostat (1 or 5 µM) for 18 h upregulated these genes *via* the LTR12 promoter across various cells lines (i.e., GH, H1299, K562, U2OS, HeLa, Ovcar-3, and HuT-78) suggesting that this drug may indeed modulate HERV elements. However, discrepancies have been noted between cell lines and primary cells with respect to the host gene transcriptional profile induced by vorinostat (2). The main motivation for the current study was to resolve these discrepancies and determine if vorinostat can modulate HERV elements in primary CD4<sup>+</sup> T cells using an unbiased approach (i.e., total RNA-Seq). Uninfected instead of HIV-infected primary CD4<sup>+</sup> T cells were selected for study to disambiguate the effects of vorinostat on HERV elements since the Tat protein of HIV has also been shown to activate HERV elements, e.g., HERV-K(HML-2) (21, 22).

### MATERIALS AND METHODS

### Isolation of Primary CD4**+** T Cells

For subsequent total RNA-Seq analysis, cryopreserved primary CD4<sup>+</sup> T cells that were viably frozen were obtained from four different healthy donors (AllCells, Inc., Emeryville, CA, USA) and thawed in RPMI with 20% human serum. Dead cells resulting from thawing frozen cells were removed using Viahance magnetic negative selection (Biophysics Assay Laboratory Inc., Worcester, MA, USA). For dose–response analysis, peripheral blood was isolated from two additional healthy donors by venipuncture according to the protocols approved by an institutional review board into polypropylene syringes containing sodium heparin. Primary CD4<sup>+</sup> T cells were isolated using the RosetteSep CD4<sup>+</sup> T cell enrichment cocktail (StemCell Technologies Inc., Vancouver, Canada). Aliquots taken from CD4<sup>+</sup> T cell samples were subjected to flow cytometry to assess purity (i.e., >95% cells expressing CD4).

## Treatment of Primary CD4**+** T Cells With Vorinostat

Primary CD4<sup>+</sup> T cells (2.5 million cells per milliliter) were plated into six-well tissue culture plates at 2 ml per well. For the four donors subjected to total RNA-Seq analysis, wells were either treated with a high dose of vorinostat (10 µM) dissolved in dimethyl sulfoxide (DMSO) or left untreated (i.e., DMSO solvent only). For the two donors subjected to dose–response analysis by digital droplet PCR, the wells were treated with 0.34, 1, 3, and 10 µM of vorinostat dissolved in DMSO or left untreated (i.e., DMSO solvent only). In all cases, after 24 h of vorinostat exposure, the samples were washed twice with 10 ml of phosphate buffered saline and resuspended in RLT Plus buffer (Qiagen, Valencia, CA, USA) containing β-mercaptoethanol for RNA extraction.

### RNA Isolation

Total RNA was extracted from primary CD4<sup>+</sup> T cells using the RNeasy Plus Kit (Qiagen, Valencia, CA, USA) and genomic DNA removed using an on-column DNase treatment. RNA integrity was assessed using the Bioanalyzer 2100 (Agilent Technologies, Santa Clara, CA, USA) and RNA integrity numbers of samples were on average 8.9 (SD ± 0.29).

#### Total RNA-Seq Data Generation

Cytoplasmic and mitochondrial ribosomal RNAs (rRNAs) were removed from total RNA extractions using the Ribo-Zero Gold (Human/Mouse/Rat) rRNA Removal Kit (Epicentre, Madison, WI, USA). RNA-Seq libraries were prepared using the TruSeq™ Stranded Total RNA Library Prep Kit (Illumina, San Diego, CA, USA) and sequenced to a depth of 100 million reads using the HiSeq2000 (Illumina) to generate 50 bp paired-end reads.

#### Total RNA-Seq Data Analysis

Sequence data in FASTA format for greater than 92,000 distinct HERV elements from the human genome were downloaded from the Endogenous Retrovirus Database (HERVd, 2012 release) (23, 24). This FASTA file of HERV sequence data was converted into a Bowtie index using the *bowtie-build* command (25) and also used to manually construct a gene transfer format (GTF) file. Duplication has led to the expansion of HERV elements throughout the human genome that are often fragmented due to insertions. To enable accurate quantification of HERV elements, reads were mapped with the "-m 1" option using the Bowtie index to ensure that only reads uniquely mapping to a single HERV element with no mismatches were retained. To maximize the alignment of reads to fragments of HERV elements, paired-end reads were decoupled into single-end reads for mapping purposes. The number of reads mapping to each HERV element was then counted using htseq-count with the GTF file (26). Raw counts were converted to counts per million (cpm) mapped reads, HERV elements were removed that did not have at least 1 cpm in at least half of the samples, and subjected to trimmed mean of M-values normalization. Finally, reads from the total RNA-Seq data were also mapped to the human genome using TopHat (27) (default settings with coverage based search for junctions disabled) in order to visualize read pile up against HERV elements in a genomic context using the UCSC genome browser (28).

## Total RNA-Seq Data Access

Metadata, FASTQ files, and a raw HERV expression matrix have been submitted to the Gene Expression Omnibus (https://www. ncbi.nlm.nih.gov/geo/) under accession number GSE102187.

### Droplet Digital PCR (ddPCR) Analysis

Custom TaqMan assays were used to quantify the upregulation by vorinostat of three LTR12 elements with HERVd designations rv\_007357, rv\_007420, and rv\_010177, using the QX100 Droplet Digital PCR System (Bio-Rad, Hercules, CA, USA) as previously described (29, 30). Briefly, the PrimerQuest Tool (31) (Integrated DNA Technologies, Coralville, IA, USA) was used to design two TaqMan assays against rv\_007357 and a single assay against each of the remaining LTR12 elements. Primer and probe sequences for these TaqMan assays were subject to BLAT analysis against the human genome to confirm specificity and are presented in **Table 1**. Five nanograms of RNA in a 20 µl PCR reaction volume were used for each target in duplicate. The TaqMan Gene Expression Assay (Hs03044961\_g1) for the ribosomal protein L27 (*RPL27*) gene was selected as a normalizer (32). LTR12 element expression was assessed between the vorinostat treated (10 µM) and untreated condition using the samples used to generate the original total RNA-Seq data (Donors 1–4). LTR12 expression was also assessed in a vorinostat dose–response curve (0, 0.34, 1, 3, and 10 µM) for two additional donors (Donors 5 and 6).

#### Statistical Analyses

For total RNA-Seq data, HERV elements were identified as differentially expressed between vorinostat treated and untreated samples with a false discovery rate (FDR) corrected *p*-value <0.05 using EdgeR (33). EdgeR adopts the negative binomial model as the main approach to model RNA-Seq data. This model


*a The "rv" designations from the Human Endogenous Retrovirus Database (HERVd) are listed for each LT12 element. Two primer and probe sets were used for the LTR12 element with designation rv\_007357.*

*GC%, percentage of guanine and cytosine bases in corresponding primer or probe; bp, base pair.*

approach requires an estimate of the true biological coefficient of variation. The square root of this value, the dispersion parameter, was estimated in edgeR by initially measuring a single dispersion parameter using all genes, while taking into account trends or gene abundances (i.e., trended dispersion). Then genewise (i.e., tagwise) dispersion estimates are measured and an empirical Bayes method was used to shrink these genewise dispersion estimates toward the trended dispersion. Gene expression data were then fit to the generalized linear model (GLM), and a GLM likelihood test was used to assess differential gene expression. Model parameters included a sample donor variable to account for the paired structure of the data (paired before and after vorinostat treatment). Significance values were adjusted for multiple testing using the Benjamini and Hochberg (34) method to control the FDR.

Validation of gene expression initially assessed by total RNA-Seq was performed using ddPCR, which is more sensitive than RT-qPCR, since target RNA is distributed across thousands of oil emulsion droplets that each undergoes reverse transcription and a subsequent end point PCR reaction. The number of target RNA molecules present was calculated from the fraction of positive end point reactions using Poisson statistics because some droplets contain no template while others contain one or more copies (35). All ddPCR data were expressed as copies of target RNA molecules (e.g., LTR12 element) per million copies of *RPL27* mRNA molecules and then log2 transformed. Prior to log2 transformation, a small regularization constant of 0.01 was added to all values used in this calculation to avoid taking a logarithm of zero in some instances. Differential expression of LTR12 elements was assessed between the vorinostat treated (10 µM) and untreated condition in a paired *t*-test (*p*-value < 0.05) using the samples used to generate the original total RNA-Seq data (Donors 1–4), as well as along the aforementioned dose–response curve for Donors 5 and 6.

#### RESULTS

#### Global Modulation of HERVs Upon Treatment With Vorinostat

Primary CD4<sup>+</sup> T cells were isolated from four seronegative donors and treated with a high dose of vorinostat (10 µM) or left untreated for 24 h. This high dose of vorinostat was initially used in an exploratory approach to optimize the ability to identify HERV elements modulated by this HDAC inhibitor. These eight samples were subjected to total RNA-Seq analysis, and the resulting data mapped against HERV sequences curated in HERVd (23). The non-parametric equivalent of a paired *t*-test identified 2,101 distinct HERV elements modulated by vorinostat with an FDR corrected *p*-value of less than 0.05 that mapped to 120 different HERV families across the human genome. In a conservative approach, HERV elements with an absolute log2 fold change of more than |3| were identified, leaving 451 upregulated HERV elements from 81 distinct HERV families and 363 downregulated elements from 82 families annotated from HERVd (**Figure 1**). LTR16C and LTR33 elements, which originated from the ERVL family, were predominantly downregulated, whereas LTR12 elements from the HERV-9 family were predominantly upregulated. The upregulation of LTR12 elements was by far the most

dramatic of which the most upregulated (rv\_005487) had a log2 fold change of 11.985 (actual fold change 4,054). Furthermore, the top 100 upregulated HERV elements contained 46 that were from the LTR12 HERV family (data not shown). Other ERVL elements (i.e., not belonging to the LTR16C or LTR33 families) had a balance of up- and downregulated elements. In summary, vorinostat clearly modulated HERV elements across the genome, but appears to have specificity for certain elements (e.g., LTR12) and families (e.g., ERVL and HERV-9).

elements belonging to the LTR12 family which were significantly upregulated

with a log2 fold change greater than 3 by vorinostat treatment.

Previously, using targeted RT-qPCR analysis we demonstrated that cell lines chronically infected with HIV (i.e., J-LAT8.4 or U1 cells) exposed to low (1 µM) and high (1 mM) doses of vorinostat did not lead to the consistent upregulation of the following HERV elements: HK2 *env*, HK2 *pol*, HERV-W *env* (syncytin-1), and HERV-FRD *env* (syncytin-2) (14). To confirm these findings, the expression of these elements in the total RNA-Seq data of the current study was examined following vorinostat treatment of primary CD4<sup>+</sup> T cells. There was no difference in expression of these elements between vorinostat treated and untreated cells (Figure S1 in Supplementary Material).

### ddPCR Confirmation of LTR12 Upregulation by Vorinostat

LTR12 elements were selected for ddPCR validation since these elements were the most upregulated by vorinostat. Three LTR12 elements on chromosome 6 (rv\_007357, rv\_007420, and rv\_010177) were selected for ddPCR analysis since they were (1) upregulated by at least eightfold (log2 fold change of 3), (2) longer than 1,000 bp and upregulated along large segments of the HERV element, (3) not confounded by the presence of a neighboring gene within 5 kb, and (4) consistently upregulated across all donors (**Figure 2**). The most upregulated LT12 element (rv\_005487) is approximately 1.5 kb in the human genome but was not selected for ddPCR analysis because only a 250 bp fragment of this element was expressed. Two TaqMan primer and probe sets were designed against rv\_007357 to capture the two major peaks of expression in this LTR12 element, whereas a single primer and probe set was used to target rv\_007420 and rv\_010177 (**Table 1**; **Figure 2**). Significant upregulation of all three elements upon vorinostat exposure was confirmed by ddPCR and fold changes reflected those detected by RNA-Seq analysis (**Figure 3**).

position of the LTR12 element. LTR12 elements are labeled with their "rv" designation from the Human Endogenous Retrovirus Database. The *y*-axis indicates the read level averaged over 40 bp, and the pink caps to black bars in the figure indicate reads whose numbers extended beyond the depicted scale. Small colored bars represent the position of primers (blue) and probes (red) from custom TaqMan assay used to assess the expression of LTR12 elements by droplet digital PCR. Two distinct TaqMan assays were targeted to rv\_007357 (A) with a single assay against each of the remaining LTR12 elements (B,C).

(ddPCR). Log2 fold changes between vorinostat treated and untreated conditions were averaged across all four donors (Donors 1–4) for the RNA-Seq (black bars) and the ddPCR (gray and hatched bars) data for each LTR12 element. Error bars represent SDs across donors. The labels "Set 1" and "Set 2" indicate the two distinct primer and probe sets used to target the same LTR12 element (i.e., rv\_007357). A second primer and probe set was not used to target the other LTR12 elements, and the missing bar is thus labeled "NA" for not applicable. LTR12 elements are labeled with their "rv" designation from the Human Endogenous Retrovirus Database.

### Dose Responsive Upregulation of LTR12 by Vorinostat

A high dose of vorinostat (10 µM) was used to treat primary CD4<sup>+</sup> T cells subjected to RNA-Seq analysis. This dose is just above what can be achieved with intravenous administration of vorinostat for the treatment of B- and T-cell malignancies (36) and much higher than doses achieved via oral administration in clinical trials to explore "Shock and Kill" strategies for an HIV cure (9, 10). Furthermore, 10 µM of vorinostat appears cytotoxic for just over 20% of healthy T lymphocytes (37). Therefore, the expression of the three LTR12 elements was examined by ddPCR over a more pharmacologically relevant dose–response curve (0.34, 1, 3, and 10 µM), which included less cytotoxic doses (e.g., 0.34 and 1 µM), in primary CD4<sup>+</sup> T cells isolated from two independent donors. The expression of each LTR12 element was clearly dose dependent for each donor and greater than twofold (log2 fold change > 1) for rv\_007357 at the lowest dose of 0.34 µM for both donors (**Figure 4**). In summary, vorinostat upregulated LTR12 elements at doses used to treat blood cell malignancies (36) and at even lower doses (i.e., 0.34 µM) that are relevant to clinical trials with HIV-infected individuals (9, 10).

### DISCUSSION

A total RNA-Seq experiment was used to investigate the modulation of HERVs in primary CD4<sup>+</sup> T cells exposed to vorinostat. The primary result from this work demonstrated the power of such an untargeted (i.e., unbiased) approach. Our previous studies, using a targeted RT-qPCR approach, had concluded that vorinostat did not modulate HERV elements of the HERV-K, HERV-W, and HERV-FRD families (14). Although this result was confirmed for elements from these families (Figure S1 in Supplementary Material), the current untargeted approach suggests that vorinostat does indeed modulate HERV elements in primary CD4<sup>+</sup> T cells, predominantly LTRs (e.g., LTR12, LTR16C, and LTR33) of the ERV9 and ERVL families (**Figure 1**). Brocks and coworkers (38) recently used Cap Analysis Gene Expression Sequencing (CAGE-Seq) (39) to identify expression from transcription start sites (TSS) in a lung cancer cell line (NCI-1299) exposed to vorinostat. The CAGE-Seq data were used to identify treatment induced non-annotated TSS, which were shown to be enriched for LTR12 elements, thus confirming the expression of these elements by vorinostat in a cell line using a different methodology. Furthermore, Kronung and colleagues (20) noted that vorinostat could activate the expression of LTR12-driven genes (e.g., *TP63* and *TNFRSF10B*) in cell lines (i.e., J-LAT8.4 and U1) but did not modulate the expression of HERV-E, HERV-H or MaLR-driven genes, and thus concluded that vorinostat was specific for ERV9 LTRs (i.e., LTR12). However, the current study suggests that vorinostat modulates the expression of LTRs outside of the ERV9 family (e.g., LTR16C and LTR33). In summary, these results advocate for untargeted approaches (e.g., total RNA-Seq and CAGE-Seq), often referred to as "fishing expeditions," since targeted approaches may not interrogate all relevant transcripts.

The ERV9 family, which includes the LTR12 element, is one of the most successful inhabitants of the human genome due to its continued proliferation until almost six million years ago, around the time of the human-chimp split (40). In contrast, the ERVL family is probably the oldest family and appears to lack an *env* gene consistent with these elements being ancient retrotransposons that entered genomes before the mammalian radiation (41). It is not clear why the transcription of both of these HERV families is modulated by an HDAC inhibitor (i.e., vorinostat) but suggests that HDACs may be an important epigenetic checkpoint in their transcription. HERV elements have recently been associated with non-coding regulatory RNAs with diverse properties ranging from contributing to the pluripotency of human cells (42) to promoting immunoglobulin M production in B-cell driven immune responses independent of T-cells (43). Therefore, agents such as vorinostat that alter the function of HDACs may need to undergo additional evaluation for HERV upregulation to assess their impact on the function of immune cells.

Modulation of HERV elements by vorinostat may be considered an "off-target" effect with respect to the primary goal of HIV activation for shock and kill strategies to facilitate a cure (7). A limitation of this study is that the pathological consequences of these off-target effects remain unknown (16), but one concern might be the oncogenic effects of LTR-driven genes modulated by vorinostat. Lamprecht et al. (44) have demonstrated that activation of an LTR of the MaLR family drives the expression of a proto-oncogene (i.e., *CSF1R*) that may lead to the development of Hodgkin lymphoma. However, vorinostat

appears to drive the expression of pro-apoptotic genes (i.e., *TP3* and *TNFRSF10B*) in cell lines and thus protect against tumorigenesis (20). An alternate view would be that the HERV elements modulated by vorinostat encode products that might facilitate HIV activation. To explore this more fully, future analyses should better characterize the transcripts expressed from HERV elements upregulated by vorinostat treatment with respect to the relevant RNA products (e.g., messenger RNA, long non-coding RNA or micro RNA) and determine their role, if any, in HIV activation.

Another potential oncogenic concern would be that replicationdefective HERV elements could recombine to form replicationcompetent virus with tumor inducing potential. This phenomenon has been observed *in vivo* in mice (45–48). Specifically, Young et al. (45) demonstrated that a series of recombination events could restore a replication-defective ERV (i.e., Emv2) to a replication competent virus in antibody-deficient mice (Rag1−/−) that eventually led to thymic and splenic tumors. In humans, the most recently integrated HERV elements belong to the HML-2 family of the HERV-K group, known as HERV-K(HML-2), which have maintained open reading frames encoding functional viral proteins that are expressed but form non-infectious particles (49–52). Dewwaniux et al. (53) demonstrated that the human genome still has the coding potential to resurrect infectious retroviruses from replication defective HERV-K(HML-2) elements. However, this required a three-fragment recombination event *in vitro*, and such a resurrected virus has not been observed to our knowledge *in vivo* in humans. In summary, despite these observations, it is highly unlikely that the HERV elements upregulated by vorinostat in this study, which are predominantly LTR fragments from older HERV families, could recombine to reconstitute the full-length genome required to generate an infectious element.

A final concern is that vorinostat activation of both HIV and HERV elements may lead to recombination and the evolution of novel retroviruses with unknown pathogenicity (19). Acute HIV infection has been shown to lead to the activation of HERV-K(HML-2) elements (22, 54), probably mediated through interactions with the Tat protein. Although subsequent intra-HERV-K recombination has been suggested (55), this remains unconfirmed by other groups (22). Furthermore, to the best of our knowledge, such a recombination event between distantly related retroviruses, such as HERVs and HIV, has never been described. It is unlikely that the HERV elements upregulated in this study by vorinostat exhibit sufficient similarity to facilitate efficient homologous recombination with the HIV genome (56). For example, the LTR12 elements examined in this study exhibit limited similarity with the HIV LTR from the HXB2 strain (accession number K03455): rv\_007357 (38%), rv\_007420 (23%), and rv\_010177 (10%). A limitation of this study is that HERV element modulation by vorinostat was conducted in the absence of HIV infection. HIV infection was excluded in this pilot investigational study because the virus itself can modulate HERV element expression (21, 22), and this confounder was removed so that the effects of HDAC inhibition on HERV element expression could be unambiguously assessed. Future studies of vorinostat treatment of HIV-infected cells could screen for HIV:HERV recombinants, although these are unlikely to be found for the reasons stated earlier.

Analysis of total RNA-Seq data and validation by ddPCR primarily focused on HERV elements upregulated by vorinostat, i.e., LTR12. A potential limitation is that HERV elements may be embedded or in close proximity to genes that are upregulated, and their signal results from read through transcription. This was not likely for the three LTR12 elements selected for ddPCR confirmation (**Figures 3** and **4**), since they were at least 5 kbp from protein coding genes in the genome. In addition, a large number of HERV elements were also downregulated (e.g., ERVL, **Figure 1**). This is reflective of our previous work examining vorinostat-modulated gene expression in which the number of genes downregulated was similar to the number upregulated (2–4). In a similar vein, vorinostat leads to chromatin relaxation and then the milieu of transcription factors present in the nucleus regulates which HERV elements are upregulated and which are downregulated due to the corresponding transcription factor binding sites in these elements. Specifically, we have previously determined that vorinostat upregulated the high mobility group (HMG) AT-hook 1 (*HMGA1*) transcription factor at the transcript, protein, and acetylation level (4).

In summary, the modulation of a large number of HERV elements by vorinostat was demonstrated using an unbiased approach (i.e., RNA-Seq). Evidence for the pathogenic consequences of HERV modulation is not of sufficient strength to limit the use of vorinostat in shock and kill approaches toward an HIV cure. However, HERV elements such as LTR12 could be monitored as off-target biomarkers during shock and kill clinical trials with HDAC inhibitors and trial subjects should be screened to further explore HIV:HERV interactions.

#### REFERENCES


### ETHICS STATEMENT

This study was carried out in accordance with the recommendations of the Institutional Review Board (IRB) at the University of California San Diego (UCSD) with written informed consent from all subjects who donated blood for this study. All subjects gave written informed consent in accordance with the Declaration of Helsinki. The protocol was approved by the IRB at UCSD.

### AUTHOR CONTRIBUTIONS

CWh, NB-B, and CWo contributed to the design of the work. NB-B and SL were responsible for the acquisition of the data which was analyzed by CWh, NB-B, SL, and MB. All the authors contributed to the interpretation of the data. CWh, NB-B, and CWo were primarily responsible for drafting the manuscript with significant inputs from TH and GM with edits and revisions suggested by all other authors.

#### ACKNOWLEDGMENTS

We acknowledge support from the IRIDIS 4 High Performance Computing Facility and associated support services at the University of Southampton, the Collaboratory of AIDS Researchers for Eradication (CARE), the Bioinformatics Core at the University of Southampton, and the Genomics and Sequencing Core at the UCSD Center for AIDS Research.

### FUNDING

This work was performed with the support of the UCSD Center for AIDS Research from the National Institutes of Health (NIH) to DR (P30 AI36214); the U.S. Department of Veterans Affairs grant (K2BX002731) to NBB; the James B. Pendleton Charitable Trust; as well as other NIH grants to DR (the CARE Martin Delaney Collaboratory U19 AI096113) and to CWo (AI104282). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

#### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at https://www.frontiersin.org/articles/10.3389/fimmu.2018.00603/ full#supplementary-material.


lymphoma. *Oncologist* (2007) 12(10):1247–52. doi:10.1634/theoncologist. 12-10-1247


endogenous retroviral elements in the blood of HIV-1-infected individuals. *J Virol* (2012) 86(1):262–76. doi:10.1128/JVI.00602-11

56. An W, Telesnitsky A. Effects of varying sequence similarity on the frequency of repeat deletion during reverse transcription of a human immunodeficiency virus type 1 vector. *J Virol* (2002) 76(15):7897–902. doi:10.1128/ JVI.76.15.7897-7902.2002

**Conflict of Interest Statement:** The senior author, CWo, is currently employed by Merck & Co. who manufactures vorinostat but the research presented in this manuscript was completed prior to this appointment while still a professor at the University of Southampton. The other authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Copyright © 2018 White, Beliakova-Bethell, Lada, Breen, Hurst, Spina, Richman, Frater, Magiorkinis and Woelk. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

# Variable Baseline Papio cynocephalus Endogenous Retrovirus (PcEV) Expression Is Upregulated in Acutely SIV-Infected Macaques and Correlated to STAT1 Expression in the Spleen

#### Edited by:

*Gkikas Magiorkinis, National and Kapodistrian University of Athens, Greece*

#### Reviewed by:

*Mani Larijani, Memorial University of Newfoundland, Canada Claudia Matteucci, University of Rome Tor Vergata, Italy*

#### \*Correspondence:

*Neil Berry neil.berry@nibsc.org Robert Belshaw robert.belshaw@plymouth.ac.uk*

*†Joint senior authors*

#### Specialty section:

*This article was submitted to Viral Immunology, a section of the journal Frontiers in Immunology*

Received: *28 January 2019* Accepted: *08 April 2019* Published: *15 May 2019*

#### Citation:

*Maze EA, Ham C, Kelly J, Ussher L, Almond N, Towers GJ, Berry N and Belshaw R (2019) Variable Baseline Papio cynocephalus Endogenous Retrovirus (PcEV) Expression Is Upregulated in Acutely SIV-Infected Macaques and Correlated to STAT1 Expression in the Spleen. Front. Immunol. 10:901. doi: 10.3389/fimmu.2019.00901* Emmanuel Atangana Maze1,2, Claire Ham<sup>2</sup> , Jack Kelly <sup>1</sup> , Lindsay Ussher <sup>1</sup> , Neil Almond<sup>2</sup> , Greg J. Towers <sup>3</sup> , Neil Berry <sup>2</sup> \* † and Robert Belshaw<sup>1</sup> \* †

*<sup>1</sup> School of Biomedical Sciences, Faculty of Medicine and Dentistry, University of Plymouth, Plymouth, United Kingdom, <sup>2</sup> Division of Infectious Disease Diagnostics, National Institute of Standards and Control (NIBSC), Potters Bar, United Kingdom, <sup>3</sup> Division of Infection and Immunity, University College London, London, United Kingdom*

Retroviral replication leaves a DNA copy in the host cell chromosome, which over millions of years of infection of germline cells has led to 5% of the human genome sequence being comprised of endogenous retroviruses (ERVs), distributed throughout an estimated 100,000 loci. Over time these loci have accrued mutations such as premature stop codons that prevent continued replication. However, many loci remain both transcriptionally and translationally active and ERVs have been implicated in interacting with the host immune system. Using archived plasma and tissue samples from past macaque studies, experimentally infected with simian immunodeficiency virus (SIV), the expression of one macaque ERV in response to acute viral infection was explored together with a measure of the innate immune response. Specifically, RNA levels were determined for (a) *Papio cynocephalus* Endogenous Retrovirus (PcEV), an ERV (b) STAT1, a key gene in the interferon signaling pathway, and (c) SIV, an exogenous pathogen. Bioinformatic analysis of DNA sequences of the PcEV loci within the macaque reference genome revealed the presence of open reading frames (ORFs) consistent with potential protein expression but not ERV replication. Quantitative RT-PCR analysis of DNase-treated RNA extracts from plasma derived from acute SIV-infection detected PcEV RNA at low levels in 7 of 22 macaques. PcEV RNA levels were significantly elevated in PBMC and spleen samples recovered during acute SIV infection, but not in the thymus and lymph nodes. A strong positive correlation was identified between PcEV and STAT1 RNA levels in spleen samples recovered from SIV-positive macaques. One possibility is that SIV infection induces PcEV expression in infected lymphoid tissue that contributes to induction of an antiviral response.

Keywords: endogenous retrovirus, STAT1, macaque, PcEV, SIV, RNA, innate immunity

## INTRODUCTION

Endogenous Retroviruses (ERV) are descendants of ancient retroviral infections that have become established in the germline and proliferated to now represent ∼5% of the human genome and other mammals, rising to ∼8% if the older group of Mammalian Apparent LTR-Retrotransposons (MaLRs) are included. Individual proviruses, termed loci, accumulate mutations over time during their Mendelian transmission that eventually render them replication-defective. In the human reference genome, no locus has retained full-length Open Reading Frames (ORFs) for all genes which precludes the possibility of yielding infectious, cell-free progeny virions (1, 2). However, ERVs and other retroelements have been shown to play a role in epigenetic gene regulation (3–7), with many ERV loci having been co-opted. Examples include HERV-E Long Terminal Repeat (LTR)-driven tissue-specific expression of a human salivary amylase gene (8), ERV-derived syncytins that play a role in placentation (9), and promoter-containing LTRs of ERVs that influence epigenetic gene control and pluripotency (10).

The study of ERVs in non-human primates (NHP) has potential to broaden the wealth of data generated from mouse studies. Smaller mammals tend to have more ERV loci integrating into their genome than larger mammals (11) and, consistent with this general pattern, replication-competent ERV loci have been found in the mouse (12) but not in humans (1, 2, 12– 14). In terms of the number of recently integrated ERV loci, the macaque genome is more similar to the human than the mouse (11). Macaques thus represent an opportunity to derive additional relevant data that could augment more conventional mouse studies. To help bridge this gap we have utilized archived materials derived from previous simian immunodeficiency virus (SIV) studies.

Recently, it has been suggested ERVs and other retroelements play a role in innate immune signaling by acting as Pathogen Associated Molecular Patterns (PAMPs) (15–21). Indeed, it has been proposed that retroelement PAMPs, such as cytosolic DNA resulting from reverse transcription, set an activation threshold for triggering the innate immune response (22). These authors speculate that the level at which activation occurs is an evolutionary trade-off between avoidance of tolerance of exogenous pathogens on one hand and constant triggering by endogenous elements on the other. The latter appears to occur in the inherited autoimmune disease Aicardi-Goutieres syndrome where deficiency of TREX1 DNA endonuclease may lead to a build-up of retroelement PAMPS, which triggers inappropriate innate signaling (23). If ERVs were involved in the early immune response, it might be expected that a correlation would exist between ERV expression and measures of innate immune activation in archived macaque samples taken during acute infection.

Our study focuses on Papio cynocephalus Endogenous Retrovirus (PcEV), a recently integrated ERV first described in baboons (24). PcEV is one of three ERV lineages copying within the macaque genome during the last 5 million years (25). Since a range of components in the ERV replication cycle can potentially be detected by the innate immune system (26), we argue that the likelihood of an ERV locus being biologically relevant increases if its sequence is free of inactivating mutations such as premature stop codons. Induction of type-1 interferons (IFN-1), via the JAK/STAT pathway, represents a key part of the innate response to viral infections (27, 28). Interferon-stimulated genes (ISGs) act to limit viral replication and play a role in the control of the viral burden (29). However, the immunomodulatory role of ISG induction in acute-phase and chronic SIV infection is complex and relatively poorly understood, contributing to both dampening and inflammatory immune responses (30, 31). IFN-1 signaling results in the transcription of hundreds of ISGs, an important member being STAT1 (32, 33). Basal IFN-1 levels set the abundance of STAT1, which influences the susceptibility toward infection (34). STAT1 therefore provides a measure of IFN-1 activation induced during acute SIV infection in macaques (35), hence its expression was explored in this study.

Archived macaque plasma and tissue samples were analyzed to determine PcEV activity during acute SIV infection when de novo responses are likely to be highest, particularly lymphoid tissues targeted during primary infection (35). This activity was also compared with STAT1 RNA levels in the same RNA preparations. Our study represents the first analysis of PcEV loci in the macaque reference genome sequences and demonstrates the presence of multiple potentially protein-coding, but not replicating, PcEV loci. In conjunction with identification of low levels of PcEV RNA in plasma of a proportion of acutely SIVinfected macaques, PcEV was actively transcribed in tissues with a level of cell-associated gene expression that is upregulated in response to acute SIV infection. In the spleen, this appeared to be directly correlated to STAT1 RNA expression. The potential significance of these findings linking ERV and innate immunity are discussed.

#### RESULTS

### Bioinformatic Analysis Suggests PcEV Protein Production but Not Replication

An online search of the most recent rhesus macaque (Macaca mulatta, RM) genome assembly (rheMac8) using the constructed PcEV reference sequence revealed 72 PcEV loci (plus another 76 matches in unassembled regions), most of which are represented only by fragments. **Figure 1** illustrates the only 10 loci that do not have large regions (more than a few hundred nucleotides) missing from the reference genome. Among these loci are examples of full-length ORFs for all genes but no single locus has full-length ORFs in all genes.

The fragmentation observed is largely the result of genome assembly errors. Four of these 10 most intact loci contain scaffold gaps, where sequenced regions either side have not been joined. One example is locus chr9:50301533-10531, which in the rheMac8 assembly has its pro-pol reading frame interrupted by a single premature stop codon and by a long insertion which consists of a tandem duplication at a region marked as a scaffold gap. However, an examination of this locus sequence from an earlier assembly of the RM genome (rheMac2), which was from the same individual animal, and from the homologous

shown by the number of nucleotides involved ("H" marks an indel within a homopolymer, assumed to be sequencing errors). Results are from the rheMac8 build with all interruptions either confirmed or corrected using the earlier rheMac2 build from the same animal, i.e., interruptions were treated as sequencing errors if present only in one build. Details of assembly problems in these loci are in Supplementary Information, which also contains multiple alignments for the three genes, the LTR alignments used for dating, the flanking regions used to determine locus homology across genomes, and reference (consensus) sequences both for the complete provirus and individual genes. *Pro-pol* alignment starts from the position suggested in Mang et al. (24). The reference coordinates here include a one nucleotide gap inserted at the end of *pro-pol* to incorporate the frameshift in *env*. Note, the *gag* and *pro-pol* are translated in the same frame with suppression of the *gag* stop codon (36).

locus in the cynomolgus macaque (Macaca fascicularis, CM) genome (macFas5), a species that diverged from the RM ∼1– 2 million years ago (mya) (37, 38), demonstrates that neither the premature stop codon nor the insertion belong to this locus. These three pro-pol sequences are not homologous along their entire length and have a clearly visible transition, a "breakpoint," near to position 3,937 in the reference genome: the ∼40% of the rheMac8 pro-pol sequence before this breakpoint is no more similar to the other two sequences than it is to a range of other PcEV loci, while downstream the sequences of all three loci are very similar.

This can also be demonstrated phylogenetically (**Figure 2**), where a tree of pro-pol sequences before the breakpoint shows the rheMac8 sequence is not recovered in the same clade as the rheMac2 and macFas5 sequences, while a tree of the propol sequences after this break-point shows all three sequences recovered together in a well-supported clade (as seen in the other loci). This represents a clear assembly error. Indeed, the locus in rheMac2 and in macFas5 are in the antisense direction while the locus in rheMac8 is in the sense direction. Consistent with widespread assembly problems, only approximately half of the 'intact' loci in rheMac8 are intact in macFas5, and vice versa. There is also much more fragmentation in PcEV loci than in the human HERV-K(HML2) loci (39), which belong to an older ERV lineage (40). The human ERVs were only sequenced after being cloned within bacterial artificial chromosomes (41), which avoided the problem with the macaque genomes of trying to assemble simultaneously multiple very similar ERV loci from short next generation sequencing reads.

Evidence was identified of insertionally polymorphic PcEV loci, i.e., loci present in some individuals but not others (42). The intact loci recovered are mostly several millions of years old (**Figure 1**), the most recently integrated locus being chr9:50301533-10531. The LTRs of this locus differ by two substitutions which, given an estimated rate of nucleotide substitution of ∼1 × 10−<sup>9</sup> /nucleotide/year (25), indicates an age of ∼1.9 million years. All these loci are also found in the CM reference genome and are therefore expected to be fixed in the RM population. However, two loci with large scaffold gaps were estimated to have integrated within at least the last 700,000 years; both were insertionally polymorphic. Locus chr1:55452680-60247 is heterozygous for presence/absence (=provirus/pre-integration site) in the reference genome and locus chr5:114918771-21894 is represented solely by the preintegration site in the rheMac3 genome sequence, which is from a different individual (from China rather than India; further details of these loci are included in **Supplementary Information**).

We infer from these observations that PcEV transcription arising from some of these 10 loci may result in PcEV protein production but is unlikely to result in PcEV replication.

### PcEV Loci Are Likely to be Transcribed and Lack STAT1 Binding Sites

The LTRs of the above loci, excluding the two very old ones, all have an undisrupted TATA binding protein (TBP) binding site (=TATA box) at position 382 (**Figure 3**; **Table S1**). This correlates to the position of the conserved TATA box found in PcEV in the baboon genome (24). Since the TATA box usually has at least minimal promoting activity (43), loci without disruption in the TBP binding site are likely to be transcribed. Another TBP binding site was predicted at position 51 but is unlikely to be real since it is poorly conserved among the loci and not found in other γ-retroviruses, such as HERV-W and MuLV (43, 44). The LTRs also contain many undisrupted transcription factor binding sites. The regions containing these sites correlated with ones already described in the baboon (24), so were named in accordance with that earlier study. Directrepeat enhancers (DR1, DR2, DR3) contained sites for GRα, GATA1, and GATA2. Three CCAAT box-associated binding regions (CCAAT box 1, 2, and 3) contained sites for GRβ, CEBPβ, and NFY. A fourth CCAAT box region (box 4) contains these three sites plus the site for CEBPα. Such CCAAT boxes directly upstream of a TATA box have been shown to efficiently promote transcription activity of the HERV-W LTR (43). Other retroviral LTRs similarly harbor binding sites for many transcription factors (45).

The screening of PcEV loci for putative transcription factor binding sites included key inflammation-related transcription factors: STAT1, NFκB (46), and STAT3 (47). Predicted binding sites for STAT1 and NFκB were not identified, so it is therefore unlikely that STAT1 or NFκB directly regulates PcEV expression through binding to the retroviral promoter. The only inflammation-related transcription factor binding site found was for STAT3 at position 149 (**Figure 3**).

### Low Levels of Plasma PcEV RNA in One-Third of Acutely SIV-infected Macaques

Cell-free PcEV RNA levels were measured by RT-qPCR in plasmas from unchallenged control (naïve) macaques (n = 5) or macaques at or around peak SIV viraemia (SIV+) challenged with one of four different SIV strains; SIVmac239 (n = 5), SIVmac251 (n = 5), SIVmacC8 (n = 6), or SIVsmE660 (n = 6). SIVmacC8 represents a minimally attenuated nef-disrupted variant of wild-type SIVmac251/32H (35), while all others are wild-type SIVs belonging to the SIVmac/SIVsm lineage. All viral challenges were performed in CM, except for SIVmac239 conducted in RM. Three naïve animals were CM, two were RM. As shown in **Figure 4**, 7 of 22 macaques (32%) exhibited low but detectable levels of PcEV in their plasma (39; 84; 94; 139; 70; 35; 165 PcEV copies/ml). These seven macaques include instances of infection with each of the four SIV strains. PcEV RNA was not detected in the plasma of naïve macaques, although the small sample size does not enable any inference about the relationship between PcEV detection in plasma and SIV status (Fisher's Exact Test comparing presence vs. absence of PcEV in plasma, from naïve and pooled SIV+ macaques, p=0.27). Stringent protocols were employed for DNase treatment to remove contaminating genomic DNA, including paired +/– RT reaction steps in each experiment (**Figure S1**). Neither PcEV nor SIV were amplified in RT minus reactions (**Table 1**), providing additional confidence that all PcEV molecules originated from an RNA template. Hence, low but detectable PcEV RNA was detectable in around one-third of SIV+ plasmas during acute infection.

### Variable Baseline Tissue PcEV Transcriptional Activity Is Upregulated During Acute SIV Infection

Levels of detectable cellular PcEV transcripts in a range of tissues targeted by SIV were determined, isolating RNA from PBMC, spleen, thymus, peripheral and mesenteric lymph nodes (PLN/MLN) from naïve and acutely SIV+ macaques. Data generated for cell-associated PcEV transcriptional activity were compared for each preparation, normalized by co-amplification of the constitutively expressed cellular GAPDH gene with PcEV. Baseline PcEV RNA transcriptional activity was identified in the majority of tissues from all naïve macaques (**Figure 5**),

though many were at low levels and none exceeded 60 PcEV RNA copies/1,000 copies GAPDH. In these naïve tissues, basal PcEV transcriptional activity for MLN, thymus, PBMC, spleen and PLN was determined as 19, 20, 25, 33, and 52 PcEV RNA copies/ml, respectively. Differences between these tissue means were not statistically significant (Kruskal-Wallis test, p = 0.07).

Mean PcEV levels were higher in tissues recovered from SIV+ macaques during acute infection (10–14 days postinfection) compared to naïve macaques (**Figure 5**). PcEV RNA was significantly increased in PBMC (Mann-Whitney test, p = 0.01) and thymus (p = 0.03), but increases were not significantly different in spleen (p = 0.23), MLN (p = 0.38), and PLN (p = 0.20). Differences between tissue means were not statistically significant (Kruskal-Wallis test, p = 0.83). Taken together, the data indicate that SIV infection per se upregulates the expression of PcEV RNA from a variable baseline level of transcriptional activity in a range of tissues targeted by SIV during acute infection.

#### Cellular STAT1 RNA Levels Are Elevated During Acute SIV Infection

STAT1 represents an independent variable in these studies and its relationship with SIV status and tissue type was determined in the same manner as for PcEV (**Figure 6**). As with PcEV, a global STAT1 induction was observed during acute SIV infection compared to naïve macaques. Considering each tissue separately, STAT1 was significantly increased in PBMC (Mann-Whitney test, p < 0.001), thymus (p < 0.001), and MLN (p = 0.03), but not spleen (p = 0.15) or PLN (p = 0.60). Hence, during acute SIV

infection cellular STAT1 levels are elevated, though at variable levels for individual tissues, as SIV infection stimulates a robust immune response.

#### Cell-Associated PcEV Transcription Levels Are Correlated to STAT1 in SIV+ Spleens

Finally, expression levels of the two genes were compared in the same RNA preparations from each tissue sample. **Figure 7** shows tissue-specific patterns identifying the relationship between localized PcEV and STAT1 expression. Cellular PcEV and STAT1 RNA levels were strongly correlated in the spleen of SIV+ macaques (Spearman test, r = 0.90, p = 0.005) and showed a positive though statistically non-significant relationship in spleens derived from naïve macaques (r = 0.68, p = 0.11). Two SIV+ macaques (H19 and H20) with elevated splenic PcEV levels were PcEV+ in plasma, but the macaque with the highest level (G24) was PcEV RNA- in plasma. Four PBMC samples from macaques that were PcEV+ in plasma (including H19 and H20) exhibited a wide range of PcEV transcription levels.

STAT1 and PcEV levels were not significantly correlated in the other tissues but most relationships were positive. Pooling all tissue data from the 22 SIV+ macaques (a total of 30 samples) yielded a strong positive correlation (Spearman test, r = 0.47; p = 0.0091). Pooled PcEV and STAT1 levels in the 13 naïve macaques were also strongly correlated (r = 0.61; p = 0.0004), albeit at a lower magnitude of induction for both PcEV and STAT1. Such correlations in pooled data might be artifactual but it is interesting that no such correlation was found between STAT1 and cellular SIV levels in these total RNA preparations (r = 0.29, p = 0.21); there was also no correlation between STAT1 and plasma SIV RNA levels (r = 0.25, p = 0.22).

TABLE 1 | PcEV and SIV RNA levels in plasma as determined by qPCR with (+) and without (–) reverse transcriptase (RT).


*Species of macaque is indicated; CM, cynomolgus macaque; RM, rhesus macaque. PcEV RNA levels are expressed as simple integers; SIV RNA levels use a power of ten notation. Ct, cycle threshold.*

*a In CM.*

*b In RM.*

### DISCUSSION

The transcriptional activity of an ERV in macaques has been investigated using a combination of bioinformatic prediction analyses and biological measures of in vivo activity, exploiting archived biological materials collected from previous macaque challenge studies. PcEV expression was upregulated in several key lymphoid tissues of acutely SIV-infected macaques, when de novo immune responses are likely to be highest. In the spleen PcEV expression was positively correlated to cellular STAT1 RNA levels. Analysis of baseline data indicated low, variable cellassociated PcEV RNA levels in the absence of a clear external stimulus, such as SIV. Introduction of SIV induced PcEV in all tissues measured, attaining statistical significance in the thymus and PBMC. While caution needs to be exercised in not overinterpreting our findings given a number of caveats (e.g., inherent variations in SIV strain dynamics, tissues targeted during acute

FIGURE 5 | SIV infection upregulates PcEV expression. Comparison of PcEV expression in PBMC, spleen, thymus, MLN, and PLN. Open symbols represent naïve macaques and closed symbols represent SIV+ macaques. Each symbol represents the number of PcEV RNA copies from a single tissue from a single individual. Means for each group are indicated as horizontal bars with *p*-values given from a Mann-Whitney comparison of the ranked expression levels between naïve and SIV+ macaques.

infection, macaque species), these data clearly suggest a global upregulation of PcEV transcriptional activity in response to acute SIV infection.

It is widely recognized that SIV establishes productive infection in multiple lymphoid tissues during the primary viraemic phase. STAT1, part of the IFN-1 cascade, represents one component of the dynamic response to SIV, triggering a range of innate immunity genes during acute infection (35), but this response fails to clear the virus in the long term. This study confirmed induction of STAT1, part of the IFN-1 cascade in response to SIV, highlighting tissue-specific differences. In the spleen, PcEV and STAT1 RNA levels were positively correlated but this relationship was less clear in PBMC, thymus and MLN and warrants further scrutiny, especially where small sample sizes precluded a comprehensive analysis. With both PcEV and STAT1, exposure to SIV per se appears sufficient to trigger a response independent of the level of viral replication.

Although little is known about the macaque, in the mouse model it is well-established that microbial triggering of innate and adaptive sensors leads to increased ERV transcription (51). Treatment with both lipopolysaccharide (LPS), a Toll-like receptor 4 (TLR4) agonist, and polyinosinic-polycytidylic acid [poly(I:C)], a TLR3 agonist, significantly induced expression of two mouse ERV proviruses (52). Examples of human ERVs whose expression is enhanced in the presence of IFN or viral infection include HERV-K(HML2), reported to be increased in PBMC from HIV-1-infected individuals (13, 53, 54), and HERV-W was reported to be increased by influenza virus infection in vitro (55, 56). The HERVK-K(HML2) LTR has two interferon-stimulated response elements (ISREs) and binding sites for inflammatory transcription factors such as NFκB (57, 58). PcEV lacks such motifs and the absence of clear STAT1 binding sites suggests that STAT1 proteins do not directly drive PcEV transcription. However, prediction of transcription binding sites is complex and lowering the dissimilarity margin of transcription binding site prediction from 5% to 15% allocates the STAT3 binding sites in PcEV also to NFκB and STAT1. STAT3 is activated at the protein level by IFN-1 (59) and binding motifs in PcEV LTR for CEPBβ, whose RNA expression is elevated by type-II IFN (60), were identified in the sequence analyses. Alternatively, PcEV might be upregulated by accessory proteins of exogenous retroviruses as HIV-1 Tat increases HERVK(HML2) transcription (61, 62). Another possibility is inflammation-induced demethylation of LTRs, where the HIV-1 5′LTR undergoes demethylation following administration of TNFα in vitro and LPS to transgenic mice harboring latent HIV genomes in their lymphocytes (63, 64).

With regard to how PcEV might impact on innate immune signaling pathways, two main mechanisms can be envisaged (26). The first, via an exogenous route, implies binding of ERV particles or glycoproteins to innate receptors such as TLRs, as demonstrated in vitro where recombinant Env glycoprotein from HERV-W triggers secretion of pro-inflammatory cytokines via TLR4 in PBMC (65). The second, via an endogenous route, is that ERV transcription and reverse transcription may generate intermediate reaction products such as dsRNA, possibly by transcription and hybridization of ERV sense and antisense transcripts, and RNA:DNA hybrids generated during reverse transcription detected by TLRs and cytosolic RIG-Ilike receptors, such as MDA5 (18–20). Only low PcEV RNA levels were detected in plasma from around one-third of SIV+ macaques, suggesting PcEV particles are unlikely to be produced in large quantities. It seems plausible, however, that PcEV might trigger induction mechanisms through released viral components generated in a transient nature, supported

macaques and closed circles SIV+ macaques. Large closed symbols represent SIV+ macaques in whose plasma PcEV RNA was detected. Statistical analysis of SIV+ and naïve macaques are indicated (Spearman correlation analysis). Individual animals are labeled if they appear in more than one plot.

by the finding of full-length ORFs for the env gene in several loci. In humans, ERV-encoded reverse transcriptase has been suggested to reverse transcribe viral RNA intracellularly (66, 67). In autoimmune TREX1-deficient mice, ERV DNA accumulates in the cytoplasm, associated with inflammatory phenotypes (23) which can be treated using RT inhibitors (68). Hence, the correlation between PcEV and STAT1 expression in this in vivo model requires further clarification.

One study limitation is that we restricted our analyses to data from only a single time point. This might not reveal any delayed correlations between PcEV, SIV, and STAT1 levels, especially given differences in cellular and anatomical localization of SIV (35). Cell sorting experiments to derive specific cell sub-sets would help address this issue and refine analyses. Differences in viral kinetics and tropism might also play a confounding role, although all SIV strains studied, even the minimally attenuated SIVmacC8, exhibit robust early replication kinetics in vivo as determined by plasma viral RNA levels 10–14 days postinfection.

To assess the relevance of the macaque as a model system, similarities and differences between human and macaque ERVs require consideration. Humans have a single ERV lineage that has been replicating within the last few million years: HERV-K (HML2). Early studies reported HERV-K(HML2) RNA in plasma from HIV patients (53), but more recent studies failed to reproduce this (13, 14). Evidence for low level PcEV RNA was identified in the plasma of a proportion of SIV+ macaques. Detection of retroviral RNA in plasma presumes some form of protein protection, whether as fully or partially packaged particles, so some macaques in the population may harbor a replication-competent PcEV locus. Although we did not find any potentially replication-competent PcEV in the reference macaque genome sequences, this may reflect a combination of genome assembly problems with the likelihood that such loci would be unfixed in the macaque population, i.e., present only in the genomes of some individual macaques (42). Evidence for at least two unfixed PcEV loci was found. This phenomenon occurs in HERV-K(HML2), where loci with full-length ORFs in all genes are present only in some non-reference genomes (69, 70). Replication-competent PcEV loci could also be generated by recombination: in the mouse model, recombination between replication-defective loci reconstituted replication-competent loci with resulting infectious virions (71). Although macaques might differ from humans in having as yet unrecovered replication-competent PcEV loci, levels of cell-associated HERV-K(HML2) transcription appear similar: ∼30 copies/1,000 copies GAPDH in PBMC of non-acute HIV infection cases and ∼15 RNA/1,000 copies GAPDH in PBMC from uninfected individuals (13).

Assuming SIV induces PcEV expression in vitro, knockdown of PcEV transcription and individual sensors followed by measurement of the IFN-1 response would determine whether PcEV is involved in the IFN-1 response. The other two recently active ERV lineages in the macaque, Chimpanzee Endogenous Retrovirus (CERV), and Simian Endogenous type D Retrovirus (SERV) (25), also require study since within both lineages single examples of loci with full-length ORFs in all genes were identified. Another important group of retroelements, Long Interspersed Nuclear Elements (LINEs), also require scrutiny because they, unlike ERVs, are known to be copying in the human population (72) and potentially replication-competent loci have been found in the macaque genome (73). LINEs have an entirely intra-cellular replication cycle that would help is distinguishing between endogenous and exogenous routes of innate triggering. Considering several ERVs and LINEs together would provide a wider perspective on whether retroelements are involved in fundamental immune response induction.

In conclusion, the current study demonstrates that PcEV retains transcriptional activity despite the apparent lack of replication. Expression appears to be upregulated by acute SIV infection and in some tissues a correlation exists between the expression levels of PcEV and STAT1, the latter being part of the IFN-1 cascade. The mechanistic processes at play warrant detailed investigation to determine whether ERVs are overlooked components of innate immune responses and activation signaling mechanisms.

#### MATERIALS AND METHODS

#### Extraction and Alignment of PcEV Locus Sequences

Bioinformatic analyses were performed using UCSC's Genome Browser website and its BLAT tool (74) on the most recent reference genome sequence of the RM, rheMac8 (Mmul\_8.0.1). This assembly derives from shortread Illumina whole genome sequencing (MacaM) (75) with later closing of some gaps using long-read PacBio sequencing of the same individual animal (comment in GenBank accession GCA\_000772875.3).

Sequences of loci were compared with those from an earlier build of the RM genome, rheMac2 (76) (GenBank accession GCA\_000002255.2). This build is an earlier short-read whole genome sequencing project largely from the same (female) individual (ID 17573), with some sequencing of an unrelated male used in finishing and to provide a Y chromosome. Sequences were also compared to those of a CM, a closely related species sharing a common ancestor only 0.91 (±0.11) million years ago, which is more recent than the common ancestor of all rhesus macaques (38). The reference genome sequence of the CM, macFas5, appears to have been built independently of the RM genome using a range of short-read technologies (comment in GenBank accession GCA\_000364345.1). The other RM genome assembly available on the UCSC website, rheMac3, is very fragmentary.

PcEV was described initially from the baboon (24) (GenBank reference AF142988), erroneously referred to as BaEV in an earlier analysis of the macaque genome (25) (see **Supplementary Information**). This GenBank sequence was used to locate and download the sequences of several PcEV loci in the macaque using the UCSC Genome Browser. A multiple alignment was then made of the downloaded sequences using MEGA (77, 78) and a new consensus reference sequence built with full-length ORFs in all genes (sequences from different loci differed only by a few percent so alignments were unambiguous). This reference was used to re-search the macaque genome. Multiple alignments were made manually from downloaded sequences and individual loci examined visually for full-length ORFs and other motifs.

Previously published data on recently active macaque ERVs lineages (11, 25) are presented here in more detail (**Figures S2–S4**).

## Prediction of Transcription Factor Motifs

A consensus PcEV LTR sequence was built based on an alignment of LTR sequences from the loci displayed in **Figure 1**, excluding the two much older loci, plus NC\_022517 (36) and AF142988 (24). This sequence was screened with the ALGGEN-PROMO online tool (79, 80) for the presence of putative binding sites for a range of transcription factors linked to either inflammation or retroviral infection (43, 44). The dissimilarity between query and binding site matrices from the database was fixed at a threshold of 5%. The following transcription factors and binding sites were selected for NHPs only, with the accession number in the TRANSFAC public database given in square brackets: AP1 [T00029]; NFI/CTF [T00094]; C/EBPα [T00105]; NFY [T00150]; CREB [T00163]; cRel [T00168]; CTF [T00174]; GATA1 [T00306]; GATA2 [T00308]; GATA3[T00311]; GRα[T00337]; NF1[T00539]; NFAT1[T00550]; C/EBPβ[T00581]; C/EBP1[T00583]; NFkB[T00590]; NFkB1[T00593]; RelA [T00594]; POU2F1 [T00641]; Sp1 [T00759]; TBP [T00794]; YY1 [T00915]; STAT1α, [T01492]; STAT3 [T01493]; STAT1β [T01573]; NFYA [T01804]; GRβ [T01920]; NFAT2 [T01945]; NFAT1 [T01948]; STAT1 [T04759]; GR [T05076].

#### Dating the Integration of ERV Loci

Integrations were dated using the nucleotide divergence between the two LTRs of the proviruses. These LTRs form the flanks of the provirus (the complete integrated DNA form of a retrovirus) and are identical at the time of integration, subsequently accumulating substitutions at the host background rate. The following equation was used to calculate locus age (81): integration time = sequence distance between 5′ and 3′ LTR/(rate in 5′ LTR + rate in 3′ LTR).

Thus, estimated age = (a/b)/r ∗ 2 where,

a = number of mismatches between the two LTRs,

b = length of the LTR,

r = estimated rate of nucleotide substitution.

Note, because substitutions will have occurred along the two branches leading to the 5′ and the 3′ LTRs it is possible to alternatively multiply the LTR length by two to reflect the effect of this. The rate of nucleotide substitution was taken as 1.0 × 10−<sup>9</sup> substitutions/nucleotide/year based on analyses of primate ERV sequences (25). This value is about half the rate estimated for mammals generally (82, 83) but is close to estimates for the rate of neutral molecular evolution in Old Word Monkeys and Apes (38).

For loci with no substitutions within their LTRs, the probability (P0) of observing this if the locus had integrated at time "age" can be found from the Poisson distribution with P<sup>0</sup> = e <sup>−</sup><sup>λ</sup> where λ is the average number of events in the time period. Thus, using the notation P<sup>0</sup> =e (−2 ∗b ∗ r ∗ age), the probability of having no substitutions in the LTRs is just <0.5 at age 700,000 years.

#### Study Population and RNA Preparation

Plasma and cells were derived from acutely SIV-infected Indian RM and CM challenged with different SIV isolates in historical studies of SIV pathogenesis and vaccination (35, 48, 50, 84). Total RNA was extracted from 140 µL to 1 mL plasmas using QiaAmp (Qiagen Ltd) according to the manufacturer's instruction. Plasmas obtained from CM were extracted from a volume of 140 uL, the four RM plasmas were extracted from 1 mL plasma. Each RNA preparation underwent a DNase step to remove contaminant DNA and all qPCR samples were run with an RT minus control, an essential step when working with ERVs. Healthy human plasma contains about 1,000 genome copies/mL, in the form of cell-free DNA (85), with many ERV copies in every cell. By contrast, exogenous retroviruses have only a single DNA copy in a small minority of PBMC (86), so the danger of DNA contamination is greatly reduced. For quantification of cell-associated transcripts, RNA was extracted from frozen cells, washed once in PBS and lysed with guanidine isothiocyanate (Sigma Ltd). Chloroform (200 µL) was added followed by centrifugation at 13,000 rpm for 2 min, aqueous phase collected, 1 volume 100% ethanol added and samples loaded onto an RNEasy silica column (Qiagen Ltd). Columns were washed and RNA eluted according to the manufacturer's instruction to a final volume of 50 µL and quantitated using nanodrop. Each RNA preparation was diluted to 10 ng/µL, and 50 ng (5 µL) added to each qPCR reaction.

#### Primers and Probe Sequences

Primer sequences for qPCR are shown below in 5′ -3′ orientation. PcEV primers were based on an alignment of PcEV pro-pol sequences (**Figure S5**) designed to anneal to all loci. SIV primers are based on conserved regions in gag (49) except for SIVsmE660 based on conserved LTR sequences (50). PcEV, GAPDH, and STAT1 sequences were as follows:

PcEV: CCGTGTCTATCAAGCAATATCC (forward), GGC AGAAGAGGAGTGTTCCAGG (reverse) and AACTCGGAG TGTTGCGAC (probe).

GAPDH: GGCTGAGAACGGGAAGCTC (forward), AGG GATCTCGCTCCTGGAA (reverse), and TCATCAATGGAA GCCCCATCACCA (probe).

STAT1: CAATACCTCGCACAGTGGTTAGAAAA (forward) and CGGATGGTGGCAAATGAAAC (reverse).

SIVsmE660: CTCCACGCTTGCTTGCTTAA (forward), AGGGTCCTAACAGACCAGGG (reverse), TCCCATATCTCT CCTAGYCGCCGC (probe).

Other SIVs: AGTGCCAACAGGCTCAGAAAA (forward), TGCGTGAATGCACCAGATG (reverse) and TTAAAAAGC CTTTATAATACTGTCTGCG (probe).

#### Plasmids and in vitro Transcription

pBluescript II SK+ (pBS, 3.0 kb) was from Stratagene Ltd (1 µg/µL). pBS-PcEV was derived separately by inserting a PcEV amplicon into the multiple cloning site of a pBS vector using ApaI (nt 604) restriction sites (GENEWIZ, Inc). pBS-PcEV was linearized by HindIII restriction located 16nt downstream of the PcEV sequence. Reaction mixes included 1 µg of pBS-PcEV plasmid (0.2 µg/µL), 2 U/µL final concentration of enzyme and 15 µL of water/enzyme buffer. Linearization was at 37◦C for 1 h. In vitro transcription and purification were carried using MAXIscript kit (Ambion Ltd) following manufacturer's instructions.

#### Standard Curves

PcEV transcripts were DNase-treated and 10-fold serial dilutions made. PcEV copy number was determined by Poisson end-point calculations, with Ct values obtained for each plasma sample compared to the standard curve to estimate number of PcEV copies per reaction (for 5 µL total RNA input). The number of PcEV copies/mL for each plasma sample was determined taking into account extraction starting volume. SIV standards used were derived from a SIVmac251 plasma RNA reference series (49, 87). For intracellular RT-PCR assessments, standards for GAPDH and STAT1 were as previously described (35).

#### DNase Treatment of RNA Preparations

Removal of DNA contaminant was conducted using DNA-free kit (Ambion Ltd). 5 µL of 10X DNAse buffer was added to 50 µL RNA preparation. 2 µL TURBO DNase was added and incubated for 1 h at 37◦C. 10 µL Inactivation Reagents were added and incubated for 5 min with intermittent flicking of tubes. Inactivation reagents were pelleted by centrifugation at 10,000 g for 2 min. Supernatant containing RNA was harvested, assayed or stored at −80◦C for further analysis. In the absence of DNAase treatment, similar copy numbers between RT+/RTreactions were noted: H17(RT+ = 1.18E+04; RT− = 1.27E+04); H18(RT+ = 7.06E+03; RT− = 9.12E+03); H19(RT+ = 2.39E+04; RT− = 2.37E+04); H20(RT+ = 6.96E+03; RT− = 8.99E+03). Observed levels of DNA contamination are consistent with theoretical expectations: assuming macaques have similar levels of cell-free DNA in plasma, given at least 50 intact PcEV copies exist in each macaque cell, with an expectation of at least 5 × 10<sup>4</sup> DNA copies in plasma.

#### qPCR and qRT-PCR

Quantification of PcEV and SIV transcripts in plasma were determined with 5 µL of DNase-treated RNA per qPCR reaction, assayed in triplicate using RNA Ultrasense One-Step Quantitative RT-PCR kit (Invitrogen Ltd). For PcEV, qPCR reactions consisted of 25.9 µL nuclease-free water, 1 µL forward and reverse primers (5µM each, final concentration: 100 nM), 2.5 µL PcEV-specific Taqman probe (10µM), 10 µL of 5X master mix, 0.1 µL ROX reference dye, 2.5 µL Superscript III RT and Platinum Taq polymerase enzyme mix. Cycling parameters were 48◦C for 30 min RT step, 95◦C for 5 min, 95◦C for 15 s (40 cycles); 61◦C extension 1 min. For SIV, mix contained 25.9 µL nucleasefree water, 1 µL each forward and reverse primer (5µM each, final concentration: 100 nM), 2.5 µL SIV-specific probe (10µM). Cycling for 52◦C 60 min RT step, 95◦C 10 min, 95◦C for 30 s (40 cycles) and 61◦C extension for 90 s.

For quantification of cell-associated transcripts, each RNA preparation was diluted to 10 ng/µL, so that 50 ng (5 µL) was added in each qPCR reaction. For PcEV and SIV, qPCR was conducted as stated above. For GAPDH, the mix contained 25.9 µL nuclease-free water, 1 µL each forward and reverse primer (5µM each, final concentration: 100 nM), 2.5 µL GAPDHspecific probe (10µM). Cycling parameters were 52◦C 60 min RT step, 95◦C, 10 min, 40 cycles of 95◦C, 30 s (40 cycles) and 61◦C extension for 90 s. For STAT1, the Power SYBR Green RNA-to-Ct 1-step kit was used (Applied Biosystems Ltd). Mix consisted of 15.6 µL nuclease-free water, 4 µL each forward and reverse primer (10µM each, final concentration: 400 nM), 25 µL 2X Master Mix (containing SYBR Green and polymerase) and 0.4 µL RT-enzyme mix. Cycling parameters were 48◦C 30 min RT, 95◦C for 10 min, 95◦C for 15 s (40 cycles) and 60◦C extension for 1 min. Melt curve was 95◦C for 15 s, 60◦C for 15 s and 95◦C for 15 s. RNA preparations were confirmed to lack detectable DNA by performing a parallel quantitative PCR lacking reverse transcriptase for PcEV, SIV and GAPDH on all RNA extracts. In the RNA Ultrasense system, enzyme mix was replaced by 0.2 µL Platinum Taq polymerase (10 U/µL) (Invitrogen Ltd).

#### REFERENCES


#### Statistical Analysis

GraphPad Prism version 6.01 was used for all statistical analyses. The Mann-Whitney test was performed for comparing nonpaired samples in naïve vs. SIV+ groups; the Kruskal-Wallis test was used to compare expression levels between macaques infected with different strains of SIV. Since most datasets had distributions that were significantly different from normal, correlation coefficient r and p-values are presented from a non-parametric Spearman test. Fisher's Exact Test was used to assess whether SIV is associated with PcEV in the plasma of SIV+ animals.

### ETHICS STATEMENT

All biological samples used in this study were derived from previously conducted experiments, which received the appropriate ethical approval from the local NIBSC ethical committee. Animal procedures were performed in strict accordance with UK Home Office guidelines under a license granted by the Secretary of State for the Home Office which approved the work described. These previous experiments have been published (35, 48, 81–83, 85) along with full details of experimental design, procedures and ethical approval.

## AUTHOR CONTRIBUTIONS

EM, NB, and RB conceived and designed the experiments. EM and CH performed the experiments. RB, JK, and LU performed the bioinformatic analyses. EM, NA, GT, NB, and RB wrote the paper. EM and RB carried out the statistical analyses. All authors approved the final draft of the paper.

### FUNDING

The project was supported by NIBSC and a University of Plymouth Ph.D. studentship.

#### ACKNOWLEDGMENTS

We are grateful to Giada Mattiuzzo (NIBSC, Division of Virology) for providing GAPDH and STAT1 standards. Amelia Ellisson and David Camp helped with bioinformatic analysis of PcEV loci.

### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fimmu. 2019.00901/full#supplementary-material


Toll-like receptor 4: inference for neuroAIDS. AIDS. (2014) 28:2659– 70. doi: 10.1097/QAD.0000000000000477


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Maze, Ham, Kelly, Ussher, Almond, Towers, Berry and Belshaw. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Pockets of HIV Non-infection Within Highly-Infected Risk Networks in Athens, Greece

Leslie D. Williams <sup>1</sup> , Evangelia-Georgia Kostaki <sup>2</sup> , Eirini Pavlitina<sup>3</sup> , Dimitrios Paraskevis <sup>2</sup> , Angelos Hatzakis <sup>2</sup> , John Schneider <sup>4</sup> , Pavlo Smyrnov <sup>5</sup> , Andria Hadjikou<sup>6</sup> , Georgios K. Nikolopoulos <sup>7</sup> , Mina Psichogiou<sup>8</sup> and Samuel R. Friedman<sup>1</sup> \*

*1 Institute for Infectious Disease Research, National Development and Research Institutes, New York, NY, United States, <sup>2</sup> Department of Hygiene, Epidemiology and Medical Statistics, Medical School, National and Kapodistrian University of Athens, Athens, Greece, <sup>3</sup> Transmission Reduction Intervention Project, Athens, Greece, <sup>4</sup> Departments of Medicine and Public Health Sciences, University of Chicago Medical Center, Center for AIDS Elimination, Chicago, IL, United States, <sup>5</sup> Alliance for Public Health, Kyiv, Ukraine, <sup>6</sup> Medical School, University of Cyprus, Nicosia, Cyprus and European University Cyprus, Nicosia, Cyprus, <sup>7</sup> Medical School, University of Cyprus, Nicosia, Cyprus, <sup>8</sup> First Department of Internal Medicine, Laikon General Hospital, Medical School, National and Kapodistrian University of Athens, Athens, Greece*

#### Edited by:

*Tara Patricia Hurst, Abcam, United Kingdom*

#### Reviewed by:

*Ellsworth Marvin Campbell, Centers for Disease Control and Prevention (CDC), United States Taisuke Izumi, Henry M. Jackson Foundation, United States*

> \*Correspondence: *Samuel R. Friedman friedman@ndri.org*

#### Specialty section:

*This article was submitted to Virology, a section of the journal Frontiers in Microbiology*

Received: *06 May 2018* Accepted: *23 July 2018* Published: *24 August 2018*

#### Citation:

*Williams LD, Kostaki E-G, Pavlitina E, Paraskevis D, Hatzakis A, Schneider J, Smyrnov P, Hadjikou A, Nikolopoulos GK, Psichogiou M and Friedman SR (2018) Pockets of HIV Non-infection Within Highly-Infected Risk Networks in Athens, Greece. Front. Microbiol. 9:1825. doi: 10.3389/fmicb.2018.01825* As part of a network study of HIV infection among people who inject drugs (PWID) and their contacts, we discovered a connected subcomponent of 29 uninfected PWID. In the context of a just-declining large epidemic outbreak, this raised a question: What explains the existence of large pockets of uninfected people? Possible explanations include "firewall effects" (Friedman et al., 2000; Dombrowski et al., 2017) wherein the only HIV+ people that the uninfected take risks with have low viral loads; "bottleneck effects" wherein few network paths into the pocket of non-infection exist; low levels of risk behavior; and an impending outbreak. We considered each of these. Participants provided information on their enhanced sexual and injection networks and assisted us in recruiting network members. The largest connected component had 241 members. Data on risk behaviors in the last 6 months were collected at the individual level. Recent infection was determined by LAg (SediaTM Biosciences Corporation), data on recent seronegative tests, and viral load. HIV RNA was quantified using Artus HI Virus-1 RG RT-PCR (Qiagen). The 29 members of the connected subcomponent of uninfected participants were connected (network distance = 1) to 17 recently-infected and 24 long-term infected participants. Fourteen (48%) of these 29 uninfected were classified as "extremely high risk" because they self-reported syringe sharing and had at least one injection partner with viral load >100,000 copies/mL who also reported syringe sharing. Seventeen of the 29 uninfected were re-interviewed after 6 months, but none had seroconverted. These findings show the power of network research in discovering infection patterns that standard individual-level studies cannot. Theoretical development and exploratory network research studies may be needed to understand these findings and deepen our understanding of how HIV does and does not spread through communities. Finally, the methods developed here provide practical tools to study "bottleneck" and "firewall" network hypotheses in practice.

Keywords: networks, HIV transmission, non-infection, HIV risk, firewall effects, bottleneck effects

## INTRODUCTION

We present a case study of a large sub-network of non-infection that we encountered during the Transmission Reduction Intervention Project (TRIP). TRIP traced the injection and sexual networks of recently-infected people in a successful attempt to recruit and intervene with additional recently-infected people to get them into treatment both to protect their health and to reduce their transmitting HIV to others during the early infection period of high viral load. In the course of this project, we discovered a large, connected sub-component of 29 uninfected people within a larger network that contained many recently-infected members.

This paper explores how such a large connected "pocket" of the uninfected could exist. It considers three possible explanations for the existence of such a sub-network:


The paper also describes the network location and risk links among members of the sub-network of uninfected participants.

## MATERIALS AND METHODS

Research methods have previously been described (Nikolopoulos et al., 2016, 2017), so we do so only briefly here.

Setting: The study took place (6/2013–7/2015) in Athens, Greece, where an HIV outbreak among people who inject drugs (PWID) began in 2011 (Paraskevis et al., 2011, 2013, 2015; Nikolopoulos et al., 2015b).

Laboratory Methods: HIV testing used a microparticle anti-HIV-1/2 EIA (AxSYM HIV-1/2 gO, Abbott) confirmed by Western Blot (MP Diagnostics). All HIV+ participants were tested by Limiting Antigen Avidity Assay (LAg; SediaTM Biosciences Corporation) (Duong et al., 2012, 2015; Nikolopoulos et al., 2017). This test is based on antibody maturation to categorize HIV infection as "recent" or "longstanding." An Optical Density (ODn) score of 1.5 was used as a cut-off for recent infection, with a median of three ODn values ≤1.5 indicating recent infection. This corresponds to a window period of 130 days (Duong et al., 2015). HIV RNA was quantified for all HIV-positive samples with Artus HI Virus-1 RG RT-PCR (Qiagen). Antibody-negative samples in social networks were tested for viremia (and thus acute infection) in pools of 10.

#### Questionnaire

Participants were interviewed using a questionnaire containing items on demographics, sexual and injecting behaviors, drug treatment, and antiretroviral treatment. A main focus of this interview was to collect information on participants' network members and the venues where they interacted with risk network members to enable network and venue recruitment. Participants were asked to name people they had injected or had sex with in the prior 6 months; people who injected or had sex in their presence in the prior 6 months; and people who injected, used drugs, or had sex with people the participants had injected or sex with. They were also asked about places they usually visit to use drugs, to have sex, or to meet new sex partners. We worked with respondents to make a list of sex or drug injecting venues; and staff visited venues to recruit participants for the appropriate arms of the study.

#### Recruitment

TRIP used social network tracing and venue recruitment methods to locate those who had recently been infected. These methods have been shown to be able to locate infections downstream, upstream, and sideways across infection chains (Friedman et al., 2014). To be eligible for the study, all participants had to be 18 years or older and able to answer the questionnaire.

Recently-infected participants in TRIP were people who were very likely to have acquired HIV in the past 6 months, and included both original participants who were first enrolled in the project and whose networks were subsequently traced ("seeds"), and their network contacts. Long-term infected individuals were TRIP participants (both seeds and their contacts) who had probably been infected more than 9 months ago. The classification of participants as either recently- or longer-terminfected was based on HIV testing histories, LAg ODn values, and viral load levels.

#### Seeds

More specifically, seeds were newly-HIV-diagnosed PWID referred to the study by the allied ARISTOTLE project (Sypsa et al., 2014, 2017; Hatzakis et al., 2015) or other testing facilities. Seeds with LAg ODn ≤1.5 and no indication of advanced disease were classified as recently infected. Most recently-infected seeds also had documented seroconversion in the prior 6 months. Seeds with LAg ODn >1.5, and without documented seroconversion in the last 6 months, were classified as long-term infected. They were matched to recently infected seeds for age (±5 years), risk group, and gender. Many had tested positive for HIV >3 months before their participation in TRIP but learned about their infection shortly before their TRIP baseline interview.

#### Network Tracing

The named network and venue members of recently- and longerterm infected seeds were recruited as follows. We recruited

injection and sex partners, and other risk environment contacts, for two steps (i.e., the Step 1 network members recruited directly by the seed, and the Step 2 network members recruited by the Step 1 network members). We tested them for HIV. If they tested positive, we conducted LAg tests and measured HIV viral load. People with recent HIV infection in networks were defined as newly diagnosed individuals with documented testing history of negative serology in the last 6 months and/or LAg ODn ≤1.5, without any indication of advanced disease. Antibody-negative samples were tested for HIV RNA in pools of 10 to identify acute infections. To maximize the number of potential highly infectious people recruited, we also recruited the network members of people with "borderline-recent infection" found in networks. People with borderline-recent infection were defined as newly diagnosed individuals with LAg ODn ≥1.5 but with documented (or reliable, self-reported) history of testing HIV-negative within the last 9 months and/or high viral load (>100,000 copies/ml). For analytic purposes, we included people with borderline-recent infection as part of the recently-infected group in the analyses in this paper.

Based on the logic that infection spreads among members of social networks, and that people often find new sexual and injection partners within their social networks, TRIP did not stop when it encountered an uninfected network member but traced the network for at least one additional step (i.e., at least 2 steps from each seed). When recently or borderline-recently infected participants were located in network tracing, their risk and social contacts were recruited for 2 additional steps. For example, if a network member who was 2 steps away from his/her seed was classified as recently infected, we recruited his/her social network members and then the social network partners of those partners.

#### Incentives

Participants received 10 euros for baseline interviews and 5 euros for each network contact they named who participated in TRIP. As part of HIV testing, we provided participants with standard counseling and appropriate referrals to care. Recently and acutely infected participants received expedited assistance.

#### Follow-Up

Participants were followed up approximately 6 months later. Those who were uninfected at their first interview were offered the chance to be re-tested for HIV infection (and for recent infection and viral load if infected).

#### Informed Consent

The project was approved by the Institutional Review Boards of the Hellenic Scientific Society for the Study of AIDS and Sexually Transmitted Diseases in Athens and National Development and Research Institutes (NDRI) in New York. All participants provided written informed consent.

#### Analyses

Participants recruited in one of four classifications were included in the present analyses: recently-infected seeds, network/venue members of recently-infected seeds; longer-terminfected seeds (LT seeds); and network/venue members of LT seeds. (In these analyses, people with borderline-recent infections were categorized as recently-infected.) Statistical analyses were TABLE 1 | Numbers and kinds of links of members of the 29 member subcomponent of uninfected participants with each other and with the 17 recently infected participants and long-term infected participants.


*For the two venue-based links between uninfected and recently-infected participants, there were no "risky links" since at least one of the dyad members reported not engaging in any sex without a condom and also no syringe sharing. The two venue links in the last row are both cases in which long-term infected participants were recruited from the venues of negatives in the uninfection pocket. In both of these links, each of the dyad members reported syringe sharing, although we do not know if they ever shared with each other.*

conducted with SPSS Statistics 21. **Table 2** compares descriptive statistics for members of the subcomponent of negatives with descriptive statistics for the recently-infected and long-term infected participants to whom they are linked. Although not shown in the table, one-way ANOVAs (for continuous variables) and Chi-square tests of independence (for binary variables) were used to compare these groups on all characteristics presented in the table. These tests produce approximate p-values that can only be used as heuristic guides because these three subsets of participants were recruited through chain-referral. As such, the sample violates the assumptions of sampling independence that underlie statistical inference.

Network Analyses and visualizations were conducted using Visone 2.16. Calculations of Seidman k-core specify subsets of a component whose members are all linked to k or more members of that same subset. In any given component, there can be only one 2-core. There can be multiple k-cores with k >2; all of their members, by definition, are members of the 2-core. Participants who are not members of a core with k >1 are only weakly tied to the network and thus to patterns of viral transmission. Thus, kcore analysis lets us understand how the uninfected component members "fit into" the large connected component, and the extent to which they are linked to denser parts of the network.

#### RESULTS

Forty-five recently-infected, 105 long-term infected, and 181 uninfected participants were recruited. The largest connected component had 241 members, and is shown in **Figure 1**. Within this large connected component there was a subcomponent (i.e., "pocket") of 29 connected uninfected PWID (located in the center of **Figure 1**). These 29 participants and the participants with whom they had a direct risk network link are the focus of this paper. (A direct network link usually means that at least one of two participants named the other as a network member during the interview. However, we also considered participants to be directly linked in cases where our field staff saw them together at injection venues and therefore categorized them as people who probably injected together, even if they did not report this on their questionnaires. Only 4 such links were identified among our 29 negative pocket members and any of their direct network connections.)

All but one of the members of this 29-member subcomponent are members of the Seidman 2-core of the large component, as are all of the infected participants to whom they are directly linked. Indeed, most of the 29 are members of a 3-core as well.

**Figure 2** shows the 29 members of the connected subcomponent of uninfected participants and their risk ties to each other and to the 17 recently-infected and 24 long-term infected participants with whom they have direct risk-network connections. **Table 1** shows that the uninfected had many links with each other (35 total links in Row 1) and with members of the recently-infected (47 total links in Row 2) and longer-term infected (36 total links in Row 3) participants, and that almost all of these risk links were injection links rather than sexual links.

**Table 2** presents sociodemographic and behavioral characteristics, HIV prevalence rate, and selected other variables for each of these 3 groups and for the total sample. As mentioned above, statistical comparisons only produce approximate p-values due to violations of sampling assumptions. Only one comparison was significant at p < 0.05: that the long-term infected were more likely to be unemployed.

Twenty-one (72%) of the 29 uninfected "pocket" members were directly linked (network distance = 1) to at least one recently-infected participant, and 16 (55%) to at least one long-term infected participant. We classified 14 out of 29 (48%) uninfected "pocket" members as being at "extremely high risk" because they self-reported syringe sharing and had at least one direct link to at least one injection partner who self-reported sharing syringes and had a viral load > 100,000 copies per mL. These 14 extremely high risk uninfected participants said they used a syringe someone else had already used a mean of approximately 45 times in the last 6 months. Another six of the 29 were linked to someone who shared syringes


\*

6 August 2018 | Volume 9 | Article 1825

and had a viral load > 100,000 copies but self-reported that they themselves had not shared syringes in the last 6 months.

Seventeen of the 29 members of the uninfected subcomponent were re-interviewed and had blood taken in a 6-month followup. None of these 17 tested positive for HIV. At study intake, 12 (70.6%) of the 17 were in at least one partnership defined as having "extremely high risk."

#### DISCUSSION

The research in this paper shows both the power of risk network research and the limitations of current theories about the spread of HIV (and perhaps other agents) through networks and communities. Unlike phylogenetic research or behavioral epidemiology, the network design used in this study can investigate the ties among people who are infected and uninfected, and thus can pose questions about why groups of people who are uninfected remain that way despite having risk network links to people who both have high viral loads and engage in risky behavior.

Of note, neither of the two existing network-level theories can explain why the 29-member subcomponent remains uninfected. These 29 members have many sexual and/or injection ties both to recently-infected and to longer-terminfected participants, which shows that the networks do not create a bottle-neck that is preventing transmission to the 29 member subcomponent. Similarly, the large number of sexual and/or injection ties to participants who have high viral loads and/or are recently-infected shows that something besides the firewall effect is protecting the subcomponent members.

In a study of HIV in New York in the early 1990s (Friedman et al., 1997), we showed (1) that membership in the 2-core of the large component was associated with being HIV-infected and also with higher levels of risk behavior, and (2) that 3-core membership was also associated with additional risk. Thus, it is particularly puzzling to find a large subcomponent of the noninfected with most of its members in the 2-core (and, indeed, many are in a 3-core), and thus not peripheral to the risk network.

Insofar as we can test them, behavioral theories also do not explain why the uninfected subcomponent remains uninfected. Syringe sharing on the part of both the uninfected and their infected injection partners is widespread, and many of these infected partners have high viral loads and thus should be quite infectious. Nonetheless, since we lack relationship-specific data about how often a given participant engaged in a risk behavior with a specific other participant, it remains possible that the uninfected people whom we designated as "extremely high risk" may not have engaged in syringe sharing with their participant partners who had high viral loads and who also engage in syringe sharing (with unknown persons). A related limitation is that four of the 48 links of uninfected participants with recently-infected participants were "venue links," which means that we cannot be certain that they are directly linked as friends or partners.

One possible explanation for the fact that the "pocket" members remained uninfected is that the epidemic outbreak among Athens PWID is fairly new-it started in 2011 (Paraskevis et al., 2013; Nikolopoulos et al., 2015a; Sypsa et al., 2017), and that therefore, HIV simply had not reached them yet. We cannot rule this out, but the fact that none of the 17 uninfected participants for whom we have follow-up testing data seroconverted by the 6 month follow-up provides a (low statistical power) piece of evidence that suggests that something more is going on here.

Thus, we are left with a conundrum: none of the existing theories can explain our observations. It is, of course, possible that our data are an anomaly, which suggests that replication research is sorely needed. It is also possible that some members of the pocket of non-infection could have a degree of genetic immunity to HIV (Tsiara et al., 2018). On the other hand, these data are sufficiently strong to suggest that the theoretical development of the field is incomplete and that some deep thinking is required. The focus of this deep thinking should go beyond the question of why individuals with high-risk connections are not infected, and should instead consider the question of how such a large, at-risk connected cluster remains uninfected. Relatedly, if this phenomenon turns out to be common, future efforts should seek to understand the contradiction between this phenomenon and the fact that largescale epidemic outbreaks do happen.

Future replication research should seek to obtain detailed data on the risk and protective behaviors each member of each dyad engages in with each of their specific network members. It should also collect and analyze specimens for measuring possible individual resistance to infection (e.g., via human leukocyte antigen typing).

Future theory and research should not only seek to understand how such a large "pocket" of uninfected network members can remain so, given the observed risks, but should also seek to explore some additional questions posed by the present findings: (1) Given the large number of longer-term infected participants with viral loads >100,000 copies/mL in **Table 2**, as these people with extremely high viral load develop more effective antibody responses and their viral loads decrease, will this establish effective firewalls to reduce further viral transmission? And (2) as a corollary question, in the context of an epidemic among Athens PWID that began in 2011 and had just passed its period of highest incidence at the time TRIP began recruiting, why were these high viral loads so prevalent?

Finally, the straightforward methods used here to study subnetworks of non-infection provide a template for studying "bottleneck" and "firewall" network hypotheses in practice. This template should be useful as additional theories are developed.

#### AUTHOR CONTRIBUTIONS

All authors contributed to design and data collection, and critiqued and approved the final text. LW had primary responsibility for analyses. LW and SF had primary responsibility for writing the article. SF had primary responsibility for the project as a whole (all sites), while GN took primary responsibility for the conduct of the research at the Athens site.

#### ACKNOWLEDGMENTS

This research was supported by US National Institute on Drug Abuse grants DP1 DA034989 Preventing HIV Transmission

### REFERENCES


by Recently-infected Drug Users (Transmission Reduction Intervention Project) and P30 DA11041 Center for Drug Use and HIV Research. The funding source had no role in the design and conduct of the study; collection, management, analysis, and interpretation of the data; preparation, review, or approval of the manuscript; or decision to submit the manuscript for publication.


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 Williams, Kostaki, Pavlitina, Paraskevis, Hatzakis, Schneider, Smyrnov, Hadjikou, Nikolopoulos, Psichogiou and Friedman. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Immunomodulatory Function of HBeAg Related to Short-Sighted Evolution, Transmissibility, and Clinical Manifestation of Hepatitis B Virus

#### Anna Kramvis <sup>1</sup> \*, Evangelia-Georgia Kostaki <sup>2</sup> , Angelos Hatzakis <sup>2</sup> and Dimitrios Paraskevis <sup>2</sup>

*<sup>1</sup> Hepatitis Virus Diversity Research Unit, Department of Internal Medicine, Faculty of Health Science, University of the Witwatersrand, Johannesburg, South Africa, <sup>2</sup> Department of Hygiene, Epidemiology and Medical Statistics, Medical School, National and Kapodistrian University of Athens, Athens, Greece*

#### Edited by:

*Tara Patricia Hurst, Abcam, United Kingdom*

#### Reviewed by:

*Timokratis Karamitros, University of Oxford, United Kingdom Masaya Sugiyama, National Center For Global Health and Medicine, Japan*

> \*Correspondence: *Anna Kramvis Anna.Kramvis@wits.ac.za*

#### Specialty section:

*This article was submitted to Virology, a section of the journal Frontiers in Microbiology*

Received: *19 July 2018* Accepted: *03 October 2018* Published: *24 October 2018*

#### Citation:

*Kramvis A, Kostaki E-G, Hatzakis A and Paraskevis D (2018) Immunomodulatory Function of HBeAg Related to Short-Sighted Evolution, Transmissibility, and Clinical Manifestation of Hepatitis B Virus. Front. Microbiol. 9:2521. doi: 10.3389/fmicb.2018.02521* Hepatitis B virus (HBV) infection, a global public health problem can be asymptomatic, acute or chronic and can lead to serious consequences of infection, including cirrhosis, and hepatocellular carcinoma. HBV, a partially double stranded DNA virus, belongs to the family *Hepadnaviridae,* and replicates via reverse transcription of an RNA intermediate. This reverse transcription is catalyzed by a virus-encoded polymerase that lacks proof reading ability, which leads to sequence heterogeneity. HBV is classified into nine genotypes and at least 35 subgenotypes, which may be characterized by distinct geographical distributions. This HBV diversification and distinct geographical distribution has been proposed to be the result of the co-expansion of HBV with modern humans, after their out-of-Africa migration. HBeAg is a non-particulate protein of HBV that has immunomodulatory properties as a tolerogen that allows the virus to establish HBV infection *in vivo*. During the natural course of infection, there is seroconversion from a HBeAg-positive phase to a HBeAg-negative, anti-HBe-positive phase. During this seroconversion, there is loss of tolerance to infection and immune escape-HBeAg-negative mutants can be selected in response to the host immune response. The different genotypes and, in some cases, subgenotypes develop different mutations that can affect HBeAg expression at the transcriptional, translational and post-translational levels. The ability to develop mutations, affecting HBeAg expression, can influence the length of the HBeAg-positive phase, which is important in determining both the mode of transmission and the clinical course of HBV infection. Thus, the different genotypes/subgenotypes have evolved in such a way that they exhibit different modes of transmission and clinical manifestation of infection. Loss of HBeAg may be a sign of short-sighted evolution because there is loss of tolerogenic ability of HBeAg and HBeAg-negative virions are less transmissible. Depending on their ability to lead to HBeAg seroconversion, the genotype/subgenotypes exhibit varying degrees of short-sighted evolution. The "arms race" between HBV and the immune response to HBeAg is multifaceted and its elucidation intricate, with transmissibility and persistence being important for the survival of the virus. We attempt to shed some light on this complex interplay between host and virus.

Keywords: genotypes, subgenotypes, tolerogen, transmission, HBeAg seroconversion

Hepatitis B virus (HBV), which belongs to the family Hepadnaviridae is the smallest DNA virus infecting man. HBV has a partially double-stranded, circular DNA genome of ∼3,200 base pairs, with four overlapping reading frames (ORFs). HBV replicates via an RNA intermediate, through the process of reverse transcription catalyzed by the viral polymerase that lacks proof reading ability. Thus, sequence heterogeneity is a feature of the virus and within the host HBV exists as a quasispecies of mixed viral strains. Based on an intergroup divergence of >7.5% across the complete genome, HBV has been classified phylogenetically into 9 genotypes, A–I (Norder et al., 2004; Kramvis et al., 2005; Yu et al., 2010; Kramvis, 2014) (**Figure 1**), with a putative 10th genotype, "J," isolated from a single individual (Tatematsu et al., 2009). Genotypes A–D, F, H, and I are classified further into at least 35 subgenotypes, using between ∼4 and 8% intergroup nucleotide divergence across the complete genome and good bootstrap support, (Norder et al., 2004; Kramvis et al., 2005, 2008; Kramvis, 2014). The genotypes, and in some cases the subgenotypes, have distinct global and local geographical distributions, with the genotypes, prevailing in the two regions where HBV is endemic, south East Asia, and Africa, being different (**Figure 2**) (Kramvis et al., 2005; Kramvis, 2014). This HBV diversification and distinct geographical distribution has been proposed to be the result of the co-expansion of HBV with modern humans, after their out-of-Africa migration (Paraskevis et al., 2013).

In addition to encoding for the structural proteins, HBcAg (core or capsid protein) and the viral envelope proteins [three forms of HBsAg, small (S), middle (M), and large (L)] and the polymerase/reverse transcriptase, the compact genome of HBV encodes for two non-particulate proteins, the X protein (a transcriptional transactivator) and HBeAg (Tiollais et al., 1981). Antibodies are directed against both structural and nonstructural proteins of HBV. Anti-HBc is non-neutralizing and a sign of exposure to HBV. Anti-HBs is the neutralizing antibody, directed against "a" dominant epitope of the viral envelope and anti-HBs levels >10 IU/L, following either natural infection or vaccination, are indicators of immunity. Anti-HBe is directed against B cell epitopes shared by both HBeAg and HBcAg, with HBeAg acting as a decoy for HBcAg. HBcAg and HBeAg T cell epitopes are also cross-reactive in both humans (Ferrari et al., 1991) and mice (Milich et al., 1987). Anti-HBe seroconversion can occur up to 6 years before the actual loss of HBeAg or the onset of liver damage (Thompson et al., 2007).

The complex interplay between the host and viral factors (including genotype/subgenotype, viral load and HBeAg status) play an important role in determining the clinical outcomes of HBV infection. HBV infection can be asymptomatic, acute, chronic (HBsAg-positive for longer than 6 months), which can lead to serious consequences of infection, including cirrhosis, and hepatocellular carcinoma (HCC; liver cancer). HBV is generally non-cytopathic. The liver damage associated with either acute or chronic hepatitis B is as a result of the immune response attack on hepatocytes, in a bid to eliminate HBV during the immune clearance or reactive phase, which leads to necroinflammation. In order to overcome the effects of the immune response viruses can code for immunomodulatory proteins, such as HBeAg in the case of HBV, and also evolve genetically in order to escape the immune response. Immune escape mutants can be selected during the course of natural infection in response to the host immune response. In HBV the complex patterns of purifying selection are as a result of the overlapping ORFs (Mizokami et al., 1997) and the high frequency of recombination (Simmonds and Midgley, 2005; Zhou and Holmes, 2007). The "arms race" between HBV and the immune response is multifaceted and its elucidation intricate, with transmissibility and persistence being important for the survival of the virus. Here an attempt will be made to shed some light on this complex interplay.

## HBeAG AS AN IMMUNOMODULATOR

The 25 kDa precursor of HBeAg, with an additional 29 amino acids on its amino terminus relative to HBcAg, is translated from the preC/C ORF on the precore mRNA [1901–2452/2488 from the EcoRI site, (Kramvis et al., 2005)] (Summers and Mason, 1982; Messageot et al., 2003). HBcAg is translated from the pregenomic RNA from position 1901. The basic core promoter (BCP) of HBV controls the transcription of the preC/C region from both transcripts (Yuh et al., 1992; Yu and Mertz, 1996) (reviewed in Kramvis and Kew, 1999). The amino terminal signal peptide directs the precursor to the endoplasmic reticulum (ER), where it is post-translationally modified by cleavage on the amino and carboxyl termini to give rise to the mature HBeAg that is expressed in the hepatocyte cytosol and also secreted in its soluble form in the serum (Revill et al., 2010).

Even though its exact function has not been determined, the conservation of HBeAg in all hepadnaviruses signifies an important role of this non-particulate secreted protein (Revill et al., 2010). Moreover, HBeAg must impart an evolutionary advantage to HBV because even though HBeAg-negative mutant strains of HBV exist (see below), they have not replaced the wildtype virus. From various animal studies it is evident that HBeAg is not involved in viral infection, replication, and assembly (Chang et al., 1987; Schlicht et al., 1987; Chen et al., 1992; Milich and Liang, 2003), but is important for natural infection in vivo (Milich and Liang, 2003). It is thought that the virus has retained the

secretory form of HBcAg because it has immunomodulatory functions. HBeAg downregulates the immune response to HBcAg by deletional, nondeletional, central, and peripheral immune tolerance (Milich et al., 1990, 1998; Milich, 1997; Milich and Liang, 2003; Chen et al., 2004, 2005). Thus, HBeAgmediated immune regulation may predispose to chronicity and persistence following in utero or perinatal transmission and prevent severe liver injury during adult infections (Milich and Liang, 2003). Clinically, HBeAg is an index of viral replication, infectivity, inflammation, severity of disease and response to antiviral therapy.

HBeAg is expressed by both human and non-human hepadnaviruses, with all mammalian-infecting sequences showing high conservation within the amino terminus of the precursor and key immunomodulatory epitopes (Revill et al., 2010), including B-cell epitopes and T-cell recognition sites in the region that overlaps with HBcAg (Milich et al., 1987; Belnap et al., 2003; Billaud et al., 2005). Thus, HBeAg:

• elicits both humoral and cell-mediated immunity,which differ from that directed to HBcAg (Huang et al., 2006)

HBcAg and HBeAg share extensive amino acid homology but differ in their structure and localization. HBcAg is particulate and found in the cell whereas HBeAg is non-particulate/monomeric and can be found in the cell but is also secreted extracellularly (Milich and Liang, 2003). Because of these differences they are targeted by different antibodies (Imai et al., 1982). The antibody to HBcAg can either be T-cell dependent or independent and the non-cross-reactive antibody to HBcAg is greater than that to that directed against HBeAg (Milich and McLachlan, 1986). HBeAg exhibits low T-cell dependent antibody response (Milich and Liang, 2003). The CD8<sup>+</sup> T-cell (CTL) response against HBcAg is important in HBV elimination in humans (Bertoletti et al., 1991; Chisari and Ferrari, 1995). HBcAg and HBeAg are highly cross-reactive at the CD4<sup>+</sup> and CD8<sup>+</sup> T-cell levels (Milich et al., 1988; Bertoletti et al., 1991; Kuhrober et al., 1997; Townsend et al., 1997) and are indistinguishable in terms of CTL priming and CTL target recognition in vitro (Kuhrober et al., 1997; Townsend et al., 1997; Frelin et al., 2009). HBeAg expressed in hepatocytes in vivo is a superior target for HBcAg/HBeAg CTLs compared to HBcAg alone (Frelin et al., 2009). In order to be recognized by the CTLs, cytosolic HBeAg and/or its precursors must be processed and presented in the context of major histocompatibility complex (MHC) class I molecules (Frelin et al., 2009), whereas HBeAg presentation for CD4<sup>+</sup> cells is via the MHC class II pathway (Milich and Liang, 2003).

• functions as a T cell tolerogen (Chen et al., 2005) and regulates the immune response against the intracellular nucleocapsid (Chen et al., 2004)

An important function of the non-particulate HBeAg, which is the only HBV protein to cross the placenta (Milich and Liang, 2003), is to establish neonatal T cell tolerance to both HBeAg and HBcAg (Milich et al., 1990; Hsu et al., 1992; Chen et al., 2004). Therefore, it is essential that HBeAg remains non-particulate so that it can cross the placenta (Schodel et al., 1993).

• regulates the innate immune response to viral infection within hepatocytes.

The adaptive immune response is attenuated, by the preferential activation of Th2 cells over Th1 cells, following NF-kB activation by HBeAg (Yang et al., 2006). Th2 cells induce a non-protective humoral response whereas Th1 cells stimulate macrophages, which eliminate virions (Forsthuber et al., 1996; Huang et al., 2006). HBeAg in the serum leads to a switch from the normal Th1-mediated anti-HBc antibody response to Th2 phenotype (Milich et al., 1998). The difference in the activation of the different Th cell subsets is probably because the primary antigenpresenting cells are different for HBcAg and HBeAg, being B cells for the former and dendritic/macrophage cells for the latter (Milich et al., 1997) This preferential activation by HBeAg of Th2 cells, together with the depletion of Th1 cells, which are required for viral clearance, can lead to the persistence of HBV. HBeAg also down-regulates the expression of the Toll-like receptor (TLR) family within hepatocytes (Riordan et al., 2003; Visvanathan et al., 2007). When peripheral blood monocyte cells, were pre-treated with HBeAg, they displayed an impaired TLR signaling response (Visvanathan et al., 2007).

### HBeAG TO ANTI-HBe SEROCONVERSION

Broadly, the natural history of HBV can be divided into at least four phases, which are differentiated by the level of viral replication and the host immune response:


The length of the HBeAg-positive phase is important in determining both the mode of transmission and the clinical course of HBV infection. Mother-to-child transmission and subsequent chronic infection in the infant are favoured when mothers are HBeAg-positive (Chen et al., 2012), whereas children born to HBeAg-negative mothers are more likely to develop acute hepatitis B or may be infected horizontally later in life, if not immunized (Hadziyannis and Vassilopoulos, 2001; Kramvis, 2016). Following acute HBV infection, the development of chronic infection requires the expression of HBeAg (Hadziyannis, 2011) with a longer duration of HBeAgpositivity, increasing the risk of liver disease progression (Chu and Liaw, 2007). Although HBeAg loss is universally considered to be a favourable outcome (Hsu et al., 2002; Chu and Liaw, 2007), usually accompanied by decreased viral load and clinical remission, active hepatitis, cirrhosis, and hepatocellular carcinoma can develop in a minority of patients even after seroconversion (Hsu et al., 2002).

HBeAg seroconversion is a dynamic process lasting a number of years and involves the interaction between the host immune response and changes in the quasispecies and viral loads of HBV (Lim et al., 2007).

### VIRAL EVOLUTION AND EMERGENCE OF MUTANTS AFFECTING HBeAG EXPRESSION

The loss of HBeAg between the high replicative, low inflammatory phase and HBeAg-negative chronic hepatitis phase is accompanied by:

• Decrease in viral sequence diversity

Viral sequence diversity was found to be 2.4-fold higher in both spontaneous and interferon-induced HBeAg seroconverters compared with non-seroconverters (Lim et al., 2007). Mutations, in both the HBeAg and HBcAg, were positively selected in 70% of the seroconverters compared to 5% of the non-seroconverters (Lim et al., 2007).

#### TABLE 1 | Mutations affecting HBeAg expression.


\**The logistic regression method was used to identify HBV mutations associated with specific genotypes/ subgenotypes (Afifi et al., 2004). Larger values show tighter correlations, with positive values related to a positive association, and negative values to a negative association. All associations were statistically significant using 2-sided p-values (Kramvis et al., 2008).*

#### • Increase in viral substitution rates

Viral substitution rates have been shown to increase between the high replicative, low inflammatory phase and the immune clearance/reactive phase (Hannoun et al., 2000). Following seroconversion, immune pressure continues to drive viral mutagenesis even though the viral loads may be low (Lim et al., 2007) and HBV replicates less efficiently in the HBeAg-negative phase of disease (Volz et al., 2007).

#### • Emergence of HBeAg-negative mutants

The viral quasispecies following seroconversion are defective in HBeAg production as a result of mutations in the BCP and preC/C ORF. Mutations in the BCP and preC can influence the expression of HBeAg at the transcriptional, translational and post-translational levels (Kramvis, 2016) (**Table 1**).

Position 1858 in the preC/C can be either a C or T (**Figure 3**) and is differentially associated with the different genotypes/subgenotypes (**Table 1**). This variation at position 1858 can influence, which of the mutations, affecting HBeAg expression, will develop (Kramvis et al., 2008) and their frequency (Revill et al., 2010).

Although the HBV sequences deposited in the public databases, such as GenBank, may not be entirely representative of the strains circulating globally, they provide the best available datasets of genotyped strains of HBV (Bell et al., 2016). From this data set it is evident that the genotype can affect the frequency of the BCP/PC mutations. The BCP double mutation 1762T/1764A occurs in ∼25% of downloaded sequences (Bell et al., 2016) without difference between subgenotypes A1, A2, and genotype D (**Figure 3**). In two case control studies, independently of HBeAg status, 1762T/1764A was significantly more frequent in genotypes A and C compared to D and B, respectively (Orito et al., 2001; Tanaka et al., 2004). A mutation, which leads to a stop codon that truncates HBeAg during translation, G1896A, does not develop in the genotypes/subgenotypes in which 1858C occurs frequently (Li et al., 1993; Lok et al., 1994) and in which 1858C is positively associated with, namely, A, C2, F2, and H (Kramvis et al., 2008). On the other hand, G1896A is frequent in genotypes/subgenotypes, which have 1858T, C1, D, E, and F (Kramvis et al., 2008; Revill et al., 2010). As illustrated in **Figure 3** more than 40% of genotype D sequences downloaded from Genbank had G1896A whereas it was found in very few genotype A sequences. Overall G1896A occurs in up to 30% of all sequences, most frequently detected in genotypes G (100%), and B (50%), followed by genotypes D (40%) and C (23%) (Revill et al., 2010). All genotype G and the majority of subgenotype B6 sequences encode this mutation and it is least frequently observed in genotype A (1.5%) and genotype F (5%) (Revill et al., 2010). Thus, because of these differences the estimated annual rate of HBeAg to anti-HBeAg seroconversion can differ between the genotypes. Genotype B has higher annual rate of seroconversion of 15.5% compared to genotype C (7.9%) (Kao et al., 2004) and E (7.4%) (Shimakawa et al., 2016). In a Taiwanese study, it was shown that HBeAg seroconversion of patients infected with genotype C occurred 10 years later than those infected with genotype B (Kao et al., 2004).

Subgenotype A2 does not select the G1896A mutation, nor the other mutations found in subgenotype A1 that contribute to the HBeAg-negativity of this unique strain (**Table 1**). In particular, the mutations in the Kozak sequence upstream from the precore start codon (1809–1812), characteristic of subgenotype A1 (Baptista et al., 1999; Kramvis and Kew, 2007; Revill et al., 2010), interfere with translation of HBeAg by a leaky scanning mechanism (Ahn et al., 2003), do not occur in subgenotype A2. Therefore, generally subgenotype A2 has a high

FIGURE 3 | Mutation distribution graphs generated using the Mutation Reporter Tool (Bell and Kramvis, 2013) showing the percentage of mutant residues relative to the reference motif found at the ten loci of interest specified (1762, 1764, 1809-1812, 1858, 1862, 1888, 1896). Three data sets were submitted to the tool to produce the three graphs showing the mutation distribution for 605 subgenotype A1, 730 subgenotype A2 and, 1899 genotype D unselected sequences. The basic core promoter/precore sequences (1750 – 1900 from *Eco*R1 site) was downloaded from http://hvdr. bioinf.wits.ac.za/alignments (Bell et al., 2016). The reference motif used was AGGCACGGGG. This is also shown by the letter preceding each locus on the X-axis. To facilitate direct comparisons between the graphs, conserved loci were not suppressed and the Y-axis was scaled to 100% by selecting the appropriate controls on the input page.

frequency of HBeAg-positivity compared to subgenotype A1 and this is statistically significant in individuals younger than 30 years (Tanaka et al., 2004). The reason for the high frequency of HBeAg-positivity in subgenotype A2 is possibly because the only mutation that is selected in A2 that can affect HBeAg expression is 1762T/1764A. This double mutation is found more frequently in HBeAg-negative carriers of subgenotype A2 HBV, whereas, the frequency of 1762T/1764A is not affected by HBeAg status in subgenotype A1 (Tanaka et al., 2004). In genotype D isolates from HBeAg-negative individuals, both 1762T1764A and 1896A were more frequent than from HBeAg-positive individuals (Tanaka et al., 2004). The BCP 1762T/1764A mutations are not selected in subgenotype B6 strains, prevalent in people living in Canadian Arctic and who are frequently HBeAg-negative, as a result of 1896A and precore start codon mutations (Osiowy et al., 2011). Even though genotypes D and E cannot be differentiated in the precore region and both have 1858T, they do not develop G1896A at the same frequency. In Sudan, where genotypes D and E co-circulate viral loads were significantly higher in genotype E-infected patients compared to genotype D-infected, with patients infected with genotype E, showing a significantly higher frequency of HBeAg-positivity in blood donors (Mahgoub et al., 2011), asymptomatic carriers and liver disease patients (Yousif et al., 2013). The high frequency of HBeAg-negativity in genotype D was as a result of G1896A (Mahgoub et al., 2011; Yousif et al., 2013), which correlates with the statistically positive and negative association of G1896A, with genotype D and E, respectively (Kramvis et al., 2008) (**Table 1**). This is confirmed by a comparison of the frequency of G1896A in sequences downloaded from the public databases (Bell et al., 2016), where G1896A occurred 47.2% of genotype D compared to 34.2% genotype E sequences (p < 0.0001) (**Figure 4**). Genotype G is unique in that all sequences have a premature stop codon at position 2 of the precore precursor protein and therefore HBeAg is not expressed (Stuyver et al., 2000).

Subgenotype A1 is the most "sophisticated" in terms of its control of HBeAg expression (Kramvis and Kew, 2007). Firstly, like all other genotypes and subgenotype A2, it can develop 1762T/1764A. Secondly, instead of Kozak 1809–1812 GCAC, which is present in subgenotypes A2 and D3, subgenotype A1 has TCAT (Kramvis and Kew, 2007). These variations are characteristic of subgenotype A1 and affect HBeAg expression at the translational level (Ahn et al., 2003), by converting the Kozak region from an optimal to a suboptimal translation context (Kimbi et al., 2004) and, causing decreased translation of HBeAg by a ribosomal leaky scanning mechanism (Ahn et al., 2003). Compared to subgenotypes A2 and D3, transfection with subgenotype A1 led to a lower expression of the precore/core precursor in the secretory pathway and a higher co-localization in the nucleus (Bhoola and Kramvis, 2016). This reduction in HBeAg levels is comparable to that observed in the presence of 1762T/1764A and when occurring together, the Kozak and BCP 1762T/1764A mutations reduce HBeAg expression in an additive manner (Ahn et al., 2003). Thirdly, a G to T transversion at position 1862 in the precore region of HBV occurs more frequently in subgenotype A1 isolates from HBeAg-negative than in HBeAg-positive South Africans (Kramvis et al., 1997, 1998) and affects HBeAg expression at the post-translational level (Sugauchi et al., 2004; Chen et al., 2008; Inoue et al., 2009; Bhoola and Kramvis, 2017) This mutation results in a valine to phenylalanine substitution at the−3 position of the signal peptide cleavage site at position 19 of the precursor protein. Phenylalanine interferes with signal peptide cleavage (Nielsen et al., 1997). This leads to decreased HBeAg expression as a result of the increased retention of the precursor in the cytoplasm of the hepatocyte (Chen et al., 2008; Inoue et al., 2009; Bhoola and Kramvis, 2017). When this mutation was introduced into

residues relative to the reference motif found at the ten loci of interest specified (1762, 1764, 1809-1812, 1858, 1862, 1888, 1896). Two data sets were submitted to the tool to produce the two graphs showing the mutation distribution for 1899 genotype D and 471 genotype E unselected sequences. The basic core promoter/precore sequences (1750–1900 from *Eco*R1 site) was downloaded from http://hvdr.bioinf.wits.ac.za/alignments (Bell et al., 2016). The reference motif used was AGGCACGGGG. This is also shown by the letter preceding each locus on the X-axis. To facilitate direct comparisons between the graphs, conserved loci were not suppressed and the Y-axis was scaled to 100% by selecting the appropriate controls on the input page.

a genotype D plasmid driven by a cytomegalovirus promoter, it resulted in a 54% reduction in the secretion of HBeAg relative to the wild-type and to the formation of aggresomes (Chen et al., 2008). When this mutation was introduced into a subgenotype A1 backbone, the decrease in expression of secreted HBeAg was less than that in the genotype D context (22%). The mutant was found to lead to the accumulation of the HBeAg precursor protein in the ER and ER-Golgi intermediate compartment (ERGIC). This accumulation resulted in an earlier activation of the three UPR pathways, but not to an increase in apoptosis (Bhoola and Kramvis, 2016).

Thus, in the various genotypes and subgenotypes, HBeAg loss, and immune response escape are as a result of a number of mutations. In order to survive, HBV has to balance its ability to establish an infection (HBeAg-positive, immunotolerogenic phase) and escape the immune response (HBeAg-negative phase) in order to persist.

### INTERACTION OF THE VIRUS AND HOST

Previous studies have shown that a few years before seroconversion and as soon as the levels of HBeAg begin declining, the genetic diversity of the virus increases (Hannoun et al., 2000; Lim et al., 2007; Cheng et al., 2013). Notably viral diversity does not follow this trend in non-seroconverters but remains at the low-level seen over the course of the immunotolerant phase (Lim et al., 2007; Wu et al., 2011; Cheng et al., 2013). Given the stable rate at which mutations accumulate in the HBV genome during the different stages of HBV replication, the differences in viral diversity observed, firstly between seroconverters and non-seroconverters, during the course of the infection and close to the time of seroconversion are because of the strong selective pressure of the immune response (Cheng et al., 2013). Specifically, as soon as the levels of HBeAg are decreasing, tolerance diminishes, leading to increased immune response and the selection of escape variants, which are infrequently selected during the immunotolerant phase. The increased genetic diversity is inversely related to HBV-DNA levels, intimating that the variants selected before and during the HBeAg negative phase have a lower replicative capacity than the virions circulating in the absence of immune selective pressure (Cheng et al., 2013). Thus, the intra-host viral evolution also follows different stages according to the HBeAg and immune-status of the host. During the immunotolerant phase, the virus replicates without constraint from the host and thus there is no need to select multiple variants; in contrast, during the immune clearance and seroconversion phase, different variants are selected because of the strong selective pressure (Lim et al., 2007; Cheng et al., 2013). Interestingly, during the latter phase, multiple variants with higher genetic variability in the non-overlapping genomic regions are selected and circulate.

Therefore, there is a complex interaction of the virus and the host. As it has been previously found, HBV has co-expanded with modern humans for at least 28,000 years suggesting a long period of interaction (Paraskevis et al., 2013). HBV major clades (genotypes) have been generated as a result of founder effects of different strains spreading in distinct geographic regions. Similarly, subgenotypes, belonging within HBV major clades, are the result of more recent events of strains. The time of the most common recent origin (tMRCA) of different genotypes has been previously estimated to occur several 1,000 years ago (Paraskevis et al., 2013), while the tMRCA of the subgenotypes is more recent (Paraskevis et al., 2013). Given that HBV has co-expanded with humans for a long time and also that some genotypes have been more frequently found in specific areas and populations (i.e., F/H in indigenous populations in America; B and C in Asia; many different subgenotypes have been infecting mostly indigenous populations across the globe) suggests that some of the genomic characteristics related to the HBeAg expression probably have been shaped over a long time period. This is presently a hypothesis and concrete evidence will come from the analysis of HBV from ancient samples.

Stochastic mutations lead to the break of immune tolerance and/or increased immune reactivity, which drives viral evolution from a low- to a high-level positive selection stage that in turn leads to decreased HBV DNA levels because of increased immune pressure and less efficient viral replication. This increased sequence diversity in the intra-host quasispecies in seroconverters prior to the loss of HBeAg was found to be largely as a result of sequence variation in the precore/core open reading frame and its regulatory elements (Harrison et al., 2011). This sequence variation in the region encoding HBeAg can influence the immunomodulatory role of HBeAg, the ability of HBV to persist and be transmitted, as well as its virulence. The driving forces of viral evolution during the different stages of HBV infection are complex and difficult to present in a linear fashion. The balance and trade-offs of the immunomodulation and immune escape of HBeAg can be influenced by the different genotypes/subgenotypes.

### SHORT SIGHTED EVOLUTION, TRANSMISSIBILITY AND CLINICAL MANIFESTATION RELATED TO HBeAG EXPRESSION

Broadly, HBeAg-negativity is preferable at the individual host level whereas HBeAg-positivity is more favourable at the population level (Milich and Liang, 2003). As the natural course of HBV infection progresses from the HBeAg-positive (high replicative/low inflammatory) phase to the HBeAg-negative (chronic hepatitis) phase, T cell tolerance is ending as HBeAg expression changes from its secreted soluble to the cytosolic form. As a immunotolerogen and a decoy, secreted HBeAg is beneficial during the early phases of the infection, whereas the cytosolic form represents a liability considering that this form is processed and presented to MHC class I. Furthermore, HBeAg expressed in vivo is superior to HBcAg as a target, directly, or indirectly, of HBcAg/HBeAg-specific CTLs (Frelin et al., 2009). This differential CTL recognition of HBeAg, expressed by wild-type virus, may result in the preferential elimination of wild-type relative to HBeAg-negative HBV (Frelin et al., 2009). As a consequence, viral diversity increases during the course of infection, with the emergence of strains defective in HBeAg expression, which can escape the immune response and persist albeit at the cost, in some cases, of reduced replication, decreased transmissibility and increased virulence (**Figure 5**). Thus, this adaptive evolution of HBV during the course of HBeAg-seroconversion can be considered short-sighted when it limits the ability of the virus to transmit to new hosts, either by reducing per contact transmissibility or reducing contact rate of infected hosts because of increased virulence that can result in death (Lythgoe et al., 2017). The increased virulence can also be a result of the loss of immune tolerance. The ability of the different genotypes/subgenotypes to lead to HBeAg seroconversion can influence the degree of short-sighted evolution of HBV.

### RELATIONSHIP OF GENOTYPES AND SUBGENOTYPES TO SHORT-SIGHTED EVOLUTION

As discussed above, mutations in the BCP/precore region, can affect HBeAg expression at the transcriptional, translational and post-translational level, and these mutations do not develop in all genotypes/subgenotypes at the same frequency (**Table 1**, **Figures 3**, **4**). Subgenotype A2 and genotype H, develop only mutations 1762T/1764A, which affect transcription of precore mRNA and result in decreased levels of HBeAg (Buckwold et al., 1996; Baptista et al., 1999), without switching off HBeAg expression. Thus, these (sub)genotypes, which have been shown to have a relatively high frequency of HBeAg-positivity (Tanaka et al., 2004) are expected to have a long HBeAg high replicative, low inflammatory phase. This allows HBV to be transmitted sexually since it can remain asymptomatic allowing maturity for sexual activity and a healthy host (Araujo et al., 2011) before the development of cirrhosis and HCC. In fact, in a study carried out in the USA, genotype A (subgenotype A2 is predominates over A1 in the USA) was predominantly transmitted via sexual (82%) or parenteral (54%) routes compared to genotype B and C, which were transmitted vertically (Chu et al., 2003). On the other hand, the persistence of 1762T/1764A mutant expressing HBeAg, even at relatively low levels, means that the hepatocytes expressing HBeAg will be targeted, leading to liver injury and consequent development of cirrhosis and HCC much later in life. A high prevalence of 1762T/1764A has been detected in HBV strains from HCC patients compared to asymptomatic carriers of HBV (Baptista et al., 1999).

G1896A is the classical HBeAg-negative mutation that affects the translation of HBeAg by introducing a stop codon that truncates the HBeAg precursor and HBeAg is not expressed. Genotypes/subgenotypes with 1858T can develop the G1896A HBeAg-negative mutants (**Table 1**) and develop HBeAg-negative chronic hepatitis. Loss of HBeAg may be a sign of shortsighted evolution because there is loss of tolerogenic ability of HBeAg, HBeAg-negative virions are less transmissible and cannot establish chronic hepatitis. Establishment of chronic HBV infection following perinatal transmission requires wildtype HBV, whereas transmission of mixed wild-type/G1896A quasipecies leads to early immune elimination and resolution of the acute infection (Raimondo et al., 1993). G1896A occurs more frequently in genotype D than in genotype E strains (**Figure 4**), despite both having 1858T. This means that women of childbearing age infected with genotype E will remain HBeAgpositive for longer, allowing for mother-to-child transmission, which has resulted in high prevalence and geographical restriction of genotype E to Africa and among African emigrants to other regions (Mulders et al., 2004). Vertical transmission is advantageous in restricted human populations (Li et al., 2017). The higher HBeAg-positivity seen in individuals infected with genotype E compared to those infected with genotype D could confer tolerance and less serious clinical manifestations than genotype D (Yousif et al., 2013). Genotype E was found to prevail in Sudanese blood donors (Mahgoub et al., 2011), whereas the liver disease patients were infected with genotype D (Yousif et al., 2013). High ratios (>50%) of G1896A in the HBeAg-positive phase in individuals infected with genotype D are accompanied by persistently high viraemia and ALT elevation after anti-HBe seroconversion and a higher risk of cirrhosis (Chu et al., 2002).

Genotype G is a HBeAg-negative strain because of a premature stop codon at position 2 of the precore precursor protein and G1896A. Although genotype G is replication competent on its own (Li et al., 2007), it is always found as a

co-infection with either subgenotype A2 or genotype H because the absence of the immunotolerogenic properties of HBeAg does not allow it to establish a persistent infection (Li et al., 2007). Like subgenotype A2 (**Figure 3**), genotype H develops 1896A infrequently and therefore can supply HBeAg in trans following co-infection. Genotype G can be transmitted and propagated without the helper virus and has been shown to replace genotype A in the HBeAg-negative phase of infection (Li et al., 2007) but requires the presence of helper virus expressing HBeAg in the early phases of infection in immunocompetent hosts (Kato et al., 2002). The absence of immunomodulatory HBeAg means that this genotype G alone can only infect immunocompromised individuals (Li et al., 2007). This is the reason why this genotype is relatively rare and is transmitted by high risk transmission chains including males-who-have-sex-with-men (MSMs), intravenous drug users (IVDUs) and blood transfusions.

Subgenotype A1 is the only strain that develops G1862T and HBV strains with G1862T have been isolated from tumorous but not from adjacent non-tumorous liver tissue (Kramvis et al., 1998). Thus, patients, infected with subgenotype A1 with the characteristic suboptimal Kozak sequence preceding the precore start codon, together with 1762T/1764A and 1862T mutant, would have severely decreased levels of HBeAg. The reduction or absence of HBeAg in the serum would result in the loss of immunotolerance and in the immune response being directed to the hepatocytes because of the reduction of soluble HBeAg, which acts as a decoy. This together with the increased ER stress, can result in liver damage, thus contributing to the higher hepatocarcinogenic potential of this subgenotype (Kramvis and Kew, 2007). In individuals infected with subgneotype A1, HCC develops 6.5 years earlier than in individuals infected with other (sub)genotypes (Kramvis and Kew, 2007). The early HBeAg seroconversion and the development of HCC before the age of 30, means that this subgenotype is characterized by the highest degree of short-sighted evolution as its mode of transmission is limited to early horizontal transmission and a short HBeAg-positive phase. This could be the reason why it is the only subgenotype, which can develop the unique mutations that affect HBeAg expression, Kozak 1809-1812 and G1862T at the translational and post-translational levels, respectively.

## KNOWLEDGE GAPS AND FUTURE STUDIES

There is a paucity of studies on the natural history of subgenotype A1 of HBV, which prevails in Africa as opposed to subgenotype A2 found outside Africa. Moreover, there are no studies on the immune response to this particular strain of HBV. Further studies on the infection by this subgenotype in ethnic groups of non-African descent are important to determine whether the host genetic background influences its natural history. A relatively high prevalence of subgenotype A1 was found in HCC patients in southern India, and similarly to African studies, there was an association of subgenotype A1 with HCC and its development at a younger age (Gopalakrishnan et al., 2013). These studies are important and relevant considering the recent human migrations from Africa, which can lead to the dispersal of the strain globally.

## CONCLUSION

HBV is a very successful, highly evolved virus for a number of reasons. It has evolved in such a way that it can be transmitted by various modes, in utero, perinatally, horizontally early in life and sexually. Various (sub)genotypes have preferable mode/s of transmission. HBV can establish chronic infection and even though this may have serious consequences, including the development of cirrhosis, and HCC that may limit mobility and hence sexual transmission, these arise later in life long after the perinatal and neonatal transmission of the virus. Loss of HBeAg may be a sign of short-sighted evolution because there is loss of tolerogenic ability of HBeAg and HBeAg-negative virions are less transmissible. The different genotype/subgenotypes exhibit different degrees of short-sighted evolution.

#### REFERENCES


### AUTHOR CONTRIBUTIONS

AK conceptualized and wrote the first draft of the paper, carried out mutation analysis. DP contributed to the writing and editing of the paper. E-GK carried out the phylogenetic analyses and prepared **Figures 1**, **2**. AH contributed to the writing and editing of the paper. All authors listed have made a substantial, and intellectual contribution to the work, read and approved the final manuscript for submission.

### ACKNOWLEDGMENTS

Research was supported by the Cancer Association of South Africa, Deutsche Forschungsgemainschaft (German Research Foundation), Japan Society for the Promotion of Science, National Research Foundation of South Africa, Poliomyelitis Research Foundation. South African Medical Research Council, University of the Witwatersrand and the Hellenic Scientific Society for the Study of AIDS and STDs.


subtypes (Ae and Aa) of hepatitis B virus genotype A. J. Gen. Virol. 85(Pt 4), 811–820. doi: 10.1099/vir.0.79811-0


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 Kramvis, Kostaki, Hatzakis and Paraskevis. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Hepatitis B virus Adaptation to the CD8**+** T Cell Response: Consequences for Host and Pathogen

*Sheila F. Lumley 1,2, Anna L. McNaughton1 , Paul Klenerman1,2,3, Katrina A. Lythgoe4 and Philippa C. Matthews1,2,3\**

*1Medawar Building for Pathogen Research, Nuffield Department of Medicine, University of Oxford, Oxford, United Kingdom, 2Department of Infectious Diseases and Microbiology, Oxford University Hospitals NHS Foundation Trust, John Radcliffe Hospital, Oxford, United Kingdom, 3Oxford BRC, John Radcliffe Hospital, Oxford, United Kingdom, 4Nuffield Department of Medicine, Big Data Institute, University of Oxford, Oxford, United Kingdom*

#### *Edited by:*

*Gkikas Magiorkinis, National and Kapodistrian University of Athens, Greece*

#### *Reviewed by:*

*Antonio Bertoletti, Duke-NUS Medical School, Singapore Joerg Timm, Heinrich-Heine Universität Düsseldorf, Germany Robert Thimme, Albert Ludwigs Universität Freiburg, Germany*

*\*Correspondence:*

*Philippa C. Matthews philippa.matthews@ndm.ox.ac.uk*

#### *Specialty section:*

*This article was submitted to Viral Immunology, a section of the journal Frontiers in Immunology*

*Received: 12 April 2018 Accepted: 25 June 2018 Published: 16 July 2018*

#### *Citation:*

*Lumley SF, McNaughton AL, Klenerman P, Lythgoe KA and Matthews PC (2018) Hepatitis B Virus Adaptation to the CD8+ T Cell Response: Consequences for Host and Pathogen. Front. Immunol. 9:1561. doi: 10.3389/fimmu.2018.01561*

Chronic viral hepatitis infections are a major public health concern, with an estimated 290 million individuals infected with hepatitis B virus (HBV) globally. This virus has been a passenger in human populations for >30,000 years, and remains highly prevalent in some settings. In order for this endemic pathogen to persist, viral adaptation to host immune responses is pre-requisite. Here, we focus on the interplay between HBV infection and the CD8+ T cell response. We present the evidence that CD8+ T cells play an important role in control of chronic HBV infection and that the selective pressure imposed on HBV through evasion of these immune responses can potentially influence viral diversity, chronicity, and the outcome of infection, and highlight where there are gaps in current knowledge. Understanding the nature and mechanisms of HBV evolution and persistence could shed light on differential disease outcomes, including cirrhosis and hepatocellular carcinoma, and help reach the goal of global HBV elimination by guiding the design of new strategies, including vaccines and therapeutics.

Keywords: hepatitis B virus, evolution, adaptation, diversity, CD8**+** T cells, adaptive immunity, human leukocyte antigen

#### INTRODUCTION

Within hosts, viruses with high mutation rates can rapidly adapt to the selection pressures placed upon them, including natural and vaccine induced immune responses, and antiviral therapy. Hepatitis B virus (HBV) represents a substantial international public health challenge, with an estimated 290 million people chronically infected globally (1). In this review, we explore the evidence for HBV escape from the CD8+ T cell response and examine the influence this process could have on infection outcomes.

**Abbreviations:** C HBV, core gene; CARs, chimeric antigen receptors; cccDNA, covalently closed circular DNA; GWAS, genome wide association studies; HBV, hepatitis B virus; HBcAg, hepatitis B core antigen; HBeAg, hepatitis B e antigen; HBsAg, hepatitis B surface antigen; HCC, hepatocellular carcinoma; HLA, human leukocyte antigen; KIR, killer-cell immunoglobulin-like receptors; NLG, N-linked glycosylation; ORF, open reading frame; P HBV, polymerase gene; PBL, peripheral blood lymphocyte; RT, reverse transcriptase; S HBV, surface gene; SNP, single-nucleotide polymorphism; TCR, T cell receptor; MHC, major histocompatibility complex; HIV, Human immunodeficiency virus; HCV, Hepatitis C virus.

Hepatitis B virus belongs to the Hepadnaviridae family of small, enveloped, primarily hepatotropic viruses. At only 3,200 bp, HBV has one of the smallest genomes of all known pathogenic viruses. The partially double-stranded DNA (dsDNA) circular genome consists of four genes, X, Polymerase (P), Core (C), and Surface (S), and a high proportion of the genome is encoded on overlapping open reading frames (**Figure 1**). During transcription, the partially dsDNA genome is "completed" to form a fully dsDNA molecule, which is subsequently supercoiled to form covalently closed circular DNA (cccDNA). This cccDNA is reverse transcribed by HBV reverse transcriptase (RT), an enzyme lacking 3′–5′ exonuclease proof-reading capacity, and therefore introducing mutations into the HBV genome during each round of replication [in duck hepadnavirus, the mutation rate is estimated at between 0.8 × 10<sup>−</sup><sup>5</sup> and 4.5 × 10<sup>−</sup><sup>5</sup> substitutions per nucleotide per replication (2)]. The mutations generated result in a viral quasispecies, comprised of dominant genotype(s) surrounded by clouds of closely related HBV variants.

The error prone RT, coupled with high rates of HBV replication [estimated at between 200 and 1,000 virions/hepatocyte/ day at the peak of infection (3)] results in the production of a large number of virions harboring mutations. The vast majority of mutations are likely to be deleterious, some are neutral, and a minority provide the virus with a potential selective advantage, such as escape from CD8+ T cell-mediated responses. However, HBV polymorphisms are constrained by the overlapping reading frame structure of the genome, since the majority of mutations can simultaneously affect multiple genes [these have been described as "mirror" mutations (4), **Figure 2**]. Mutations that are neutral or beneficial for one protein might be detrimental for another. Accordingly, overlapping regions of the HBV genome generally have less diversity compared to nonoverlapping regions (5) and the within-host rate of evolution at overlapping regions is about half that of non-overlapping regions (6).

Current vaccination and treatment approaches are hindered by poor diagnosis and access to treatment, drug and vaccine escape mutants, viral rebound on treatment cessation or immunosuppression, and lack of curative therapy (8). To make a significant impact on HBV prevalence, parallel improvements in diagnostics, treatment, and prevention are required; ultimately, new immunotherapeutic strategies may be key to the success of elimination. Developing a more robust picture of the extent, nature, and significance of the interplay between the virus and the host CD8+ T cell response is an important avenue of enquiry, enabling us to predict and tailor therapeutic interventions that may be beneficial in mediating control or clearance of chronic infection.

A robust body of data has been assimilated over the past few decades for HIV and HCV, informing significant understanding of the nature and impact of CD8+ T cell-mediated immune control and escape (**Table 1**). For HBV, there is a relative paucity of such evidence but the field could be advanced by similar approaches. We have therefore set out to assimilate the evidence

for viral adaptation to the host CD8+ T cell response in HBV infection, and to consider the significance of this adaptation both to viral fitness and function, and to host outcomes. Finally, we highlight gaps in our current understanding and knowledge, in order to provide foundations for ongoing research efforts.

### THE IMMUNOLOGICAL BASIS FOR ESCAPE

Acute and chronic HBV infections are associated with functionally different CD8+ T cell responses (**Table 1**). Acute, selfresolving infections are characterized by functionally efficient, multi-specific antiviral CD8+ T cell responses which are sustained after viral clearance (9). Both non-cytolytic and cytolytic mechanisms have been implicated (22). In contrast, chronic infection is typically characterized by a lack of protective T cell memory maturation and exhausted HBV-specific CD8+ T cell responses (22–24).

Th1-polarized CD4+ T cells regulate and maintain CD8+ T cell responses and contribute to HBV clearance (80). Genome wide association studies (GWAS) have linked a range of human leukocyte antigen (HLA) class II alleles with disease outcomes. CD4+ responses are associated with vaccine responses (81) and clearance of acute infection (82, 83). Host HLA class II genotype has also been linked to treatment response (84) and to risk of developing hepatocellular carcinoma (HCC) (85). CD4+ CD25+ regulatory T cells suppress the activation, proliferation, and interferon-γ production of both CD4+ and CD8+ T cells in chronic HBV infection (86, 87).

The highly polymorphic HLA class I genes are thought to be an important host factor for viral control, contributing to differences in HBV outcome observed globally. Host HLA polymorphisms and different HBV genotypes have been demonstrated to influence the rate of disease progression and the long-term outcome of HBV infection (66, 88, 89). However, HBV can subvert various multiple steps of the CD8+ T cell antigen processing and presentation pathway to evade detection by the host (**Figure 3**, boxes 1–5). Thus, while all individuals with chronic HBV infection are at risk of increased progression to cirrhosis and HCC, individual outcomes depend on the interplay between host, viral, and environmental factors. In addition to HLA genes, other factors are implicated in disease outcome including age and duration of infection, other host genetic factors (90), and exposure to hepatotoxins (91).

### MECHANISMS OF HBV ESCAPE FROM CD8**+** T CELL RESPONSES

#### Antigen-Processing Escape Mutants

The amino acids flanking viral epitopes are important for effective antigen processing; mutations in these regions may impair proteasomal processing of the epitopes and are recognized in both HCV and HIV as a mechanism of CD8+ T cell escape (100, 101). Likewise, mutations altering the processing of HBV epitopes could be relevant for HBV escape from the CD8+ T cell-mediated immune response (**Figure 3**, box 1), however, none have been identified at present, potentially Table 1 | Strands of evidence for the significance of the CD8+ T cell response in control/clearance of infection with blood-borne viruses.


*a The citations within this table aim to provide a robust overview of the evidence, using a combination of strong examples from the primary literature together with selected review articles that summarize specific aspects of this topic.*

*HBV, hepatitis B virus; HBeAg, hepatitis B e antigen; HLA, human leukocyte antigen; GWAS, Genome wide association studies; SNPs, single-nucleotide polymorphisms; MHC, major histocompatibility complex; HCV, Hepatitis C virus; HIV, Human immunodeficiency virus.*

due to the focus on mutations lying within HLA-restricted epitopes rather than in the flanking regions.

### Virus Peptides Regulate Surface HLA Expression

Virus-induced changes in HLA class I surface expression play an important role in viral pathogenesis and persistence (**Figure 3**, box 2). CD8+ T cells recognize HBV-infected hepatocytes through presentation of HLA class I HBV epitopes on the cell surface (13), however, this expression can be upregulated or downregulated. Decreased presentation of class I MHC molecules on hepatocytes and lymphoid cells is described in the woodchuck hepatitis virus model (40). Changes in surface HLA have also been described in human HBV infection (37), for example, lower HLA class I has been associated with hepatitis B e antigen (HBeAg)-positive vs HBeAg-negative status (38, 39). Interestingly, these studies are three decades old and have not been replicated in the more recent literature. Downregulation of HLA class II molecules by pre-core mutants has also been described during chronic HBV infection (102). Mutations altering the processing or presentation of HBV HLA class I epitopes, although not conclusively demonstrated, could hypothetically be relevant for escape from the CD8+ T cell-mediated immune response in human infection (**Figure 3**, box 2).

### Selective Mutation of HLA-Binding Residues

Immune escape by selective mutation of HLA-binding residues within HBV CD8+ epitopes is one of the most commonly identified mechanisms for viral CD8+ immune escape (**Table 1**; **Figure 3**, box 3). Evidence of this escape mechanism in HBV has emerged through the identification of HLA class I "footprints" (94), mutations that are significantly enriched in patients with certain HLA class I alleles. Older literature was conflicting regarding the frequency and significance of such footprints in HBV (96, 103), but HLA footprints have subsequently been identified in all four HBV genes and mapped to known or predicted HLA epitopes [**Table 1** (53)]. In some cases, these mutations result in altered peptide-HLA binding scores, providing a plausible mechanism for HBV immune escape, but have so far only been identified using cross-sectional data (54, 55, 96). However, the pattern of escape is consistent across populations with divergent HLA haplotypes and different HBV genotypes, for example, genotypes B and C in a cohort of Chinese-origin patients (56), New Zealand-resident Tongans with chronic HBV genotype C3 infections (57) and Iranian patients with genotype D infection (58). Importantly, core mutants in patients with chronic genotype A or D infection have been confirmed to impair CD8+ T cell IFN-γ secretion *in vitro* (54), indicating that these mutants could play an *in vivo* role in immune escape. These studies are limited by the sequencing methods (in which a predefined number of variants is typically selected for cloning and sequencing) reducing sensitivity for detection of low-abundance variant detection compared to newer ultra-deep sequencing methods.

Further studies have focused on identifying regions of the HBV genome with high within-host nucleotide diversity, and high rates of nonsynonymous substitutions, to determine which regions may be under strong HLA-mediated selection pressure. In a longitudinal study of eight HBeAg-negative asymptomatic HBV carriers, followed over 25 years, the ratio of synonymous to nonsynonymous mutations (dS/dN) in the core gene was low, suggesting high rates of positive selection, although no specific HLA-restricted epitopes were identified (104). Another study demonstrated low dS/dN ratios in patients with sustained vs unsustained viral control following treatment, with specific surface antigen polymorphisms lying within HLA class I epitopes identified [sV14G, sF20S, sT45I, sI213L (105)], although this was not confirmed either with HLA genotyping or through demonstration of a functional impact on T cell recognition.

## Epitope Masking With N-Linked Glycosylation (NLG)

N-linked glycosylation is a post-translational modification that plays an established role in the antigenicity and infectivity of viruses (106, 107). NLG can mask immunogenic epitopes, interfering with antibody recognition of hepatitis B surface antigen (HBsAg), leading to immune and diagnostic escape (32, 95). It can also impact on HBV virion secretion, likely by altering the ability of envelope proteins to interact with the capsid surface (108–110). The number of NLG sites correlates with disease state, with increased NLG reported in patients with reactivated HBV vs chronic infection, and in those with sustained vs unsustained response off treatment (32, 111). Although the role of NLG in HBV evasion of CD8+-mediated T cell immunity is yet to be determined, it could potentially provide another strategy for immune escape by interfering with the binding of an HBV epitope to an HLA molecule, or the binding of an HLA-antigen complex to a cognate T cell receptor (TCR) (**Figure 3**, box 4).

### Alteration of TCR Recognition

Engagement of the TCR with HLA class I/peptide complexes on antigen-presenting cells is key to activating CD8+ T cells; therefore, mutations in the TCR contact residues of an epitope can lead to immune escape. Immunodominance of viral epitopes is not simply determined by the amino acid sequence of the peptide and its binding affinity, but also depends on the peptide concentration and T cell clone, with the same HBV peptide able to induce different signaling cascades in different CD8+ T cell clones (97). Antagonist functions may provide HBV with a means of immune escape (**Figure 3**, box 5). Specifically, certain CD8+ T cell epitopes in hepatitis B core antigen (HBcAg) (97) and HBsAg (98) act as TCR antagonists, binding the TCR and inhibiting the CD8+ T cell response. HBeAg may promote HBV chronicity by inducing CD8+ T cell tolerance. However, the underlying mechanisms driving this immune state in humans remain to be elucidated. Indeed, the mechanism may not involve presentation of an HLA class I-restricted epitope, as currently no epitopes have been identified that are unique to the pre-core sequence of HBeAg (a ~29 amino acid stretch not shared with HBcAg) (112).

Chronic HBV infection is characterized by an exhausted CD8+ T cell phenotype associated with reduced cytotoxic activity and enhanced expression of inhibitory markers. TCR binding in the presence of high HBsAg levels induces T cell exhaustion, characterized by poor effector cytotoxic activity, impaired cytokine production and sustained expression of multiple inhibitory receptors. A hierarchy of co-inhibitory receptors, dominated by PD-1, act synergistically to promote CD8+ tolerance. The degree of T cell impairment also depends on suppressive cytokines, interaction with other T cell subsets, and stage of T cell differentiation (113–117). T cell exhaustion is (at least partly) reversible; blockade of inhibitory receptors including PD-1 (26, 117), CTLA-4 (27), and Tim-3 (29) partly improve HBV-specific CD8+ T cell function *in vitro*. In addition, therapy with nucleot(s)ide analogs may lead to a modest reconstitution of HBV-specific T cell function (118). Although this restoration is transient (119), these CD8+ T cells can be associated with viral control upon therapy cessation (120).

## ESCAPE OVER SPACE AND TIME

### Kinetics of Escape

Evidence of immune-mediated selection has been found in HBV infection (53–56), although the kinetics of immune escape are yet to be robustly delineated. Longitudinal samples from the same individuals form the ideal dataset to address questions about the changes in viral sequence and diversity over time, but this has rarely been undertaken for HBV infection. One longitudinal study of HBV evolution following acquisition from a single source demonstrated an expansion and contraction of HBV diversity, with maximum diversity coinciding with peak viremia, and a predominance of nonsynonymous mutations with greatest diversity in the core gene (3). Further longitudinal data are required to unpick the timing and kinetics of viral evolution.

An area of HBV kinetics that has received some attention is the scenario of HBeAg loss. It is hypothesized that the change from HBeAg-positive to HBeAg-negative occurs by one of two mechanisms:

(i) Antibody-mediated control (121, 122) usually associated with low HBV DNA levels. This situation is most likely to be characterized by low viral sequence diversity, although the low viral loads make this difficult to study given the limits of sensitivity of next generation sequencing approaches.

(ii) Selection of pre-core and promotor mutations (123), reducing or eliminating HBeAg production. In this case, HBeAgnegative status is associated with an increase in evolutionary rate and therefore with increased sequence diversity (6, 54, 56, 57, 104, 111, 124–126). The cause/effect relationship between the increased evolutionary rate and the shift in immune activity is unclear. The higher viral mutation rate could lead to the occurrence of stochastic mutations, generating new T cell epitopes that disrupt immune tolerance, or could be the consequence of increased immune reactivity driving escape mutants.

#### Compartment-Specific Evolution

Compartment-specific evolution has been described for chronic viruses, including HIV (127, 128) and HCV (129, 130), although the evidence for HBV is very limited to date. The practical barriers to sampling tissue compartments longitudinally from the same patient make it difficult to assess the co-evolution of genetically distinct subpopulations over time [as is likely the case for HIV in the genital tract (131)].

Although *hepadnaviruses* are characteristically hepatotropic, HBV DNA is also found in a range of other tissues, including lymphatic cells. In the woodchuck model, life-long replicationand transmission-competent viruses persist in lymphocytes (132). However, it is difficult to demonstrate that hepatitis B virions isolated from different compartments in humans are replication and transmission competent, without a viable method of culturing autologous virus. There are some data to suggest that peripheral blood lymphocytes (PBLs) can support viral replication (133), but secretion of HBeAg and HBsAg from liver macrophages has not been detected [Lucifora, unpublished data, referred to in (134)].

HBV may undergo independent evolution in different tissue compartments, leading to compartmentalization of viral subpopulations (135–137); for example, HBV variants isolated from PBLs may be specifically adapted to this environment (137), potentially harboring relevant immune-escape mutants (135, 137). It has been hypothesized that compartment-specific mutants may serve as a source of reactivation or transmission and have been implicated in reinfection post liver transplant (138), mother to child transmission (137, 139), fulminant hepatic failure in the context of HIV co-infection (140), and antiviral escape (141).

Further work is required to confirm whether HBV does harbor replication and transmission competent viruses in cells other than hepatocytes. If this is confirmed, understanding host-virus dynamics at the compartmental level, studying the emergence of immune and antiviral escape mutants and the factors contributing to persistence and transmission will be crucial for developing improved therapeutics for HBV control.

#### FUNCTIONAL IMPACT OF ESCAPE MUTATIONS ON HBV

The primary functional impact of mutations within HLA class I-restricted T cell epitopes is to alter the frequency and/or functionality of the CD8+ T cell immune response. These mutations may have additional impact on the viral replication cycle and treatment response in the following ways:


The full range of functional impacts of HBV CD8+ immune escape mutants has not been comprehensively explored. Understanding the functional impact of mirror and compensatory mutations that are associated with CD8+ T cell-mediated selection may lead to further insights into the host–virus interaction.

#### CLINICAL IMPACT OF VIRUS AND HOST POLYMORPHISMS ON HOST OUTCOME

#### Impact of HBV Mutations on Reactivation

Hepatitis B virus reactivation as a consequence of immunosuppression has emerged as an important issue across a wide range of clinical settings [as previously reviewed (149, 150)]. Reactivation is seen secondary to immunosuppressive therapy for cancer, in particular in the context of therapy with rituximab and fludarabine (33, 151), solid organ transplantation (150), bone marrow transplantation (152), and autoimmune disease [especially with infliximab treatment (153–155)], highlighting that HBV reactivation is associated with a general defect of HBVspecific T cell control. Reactivation has also been documented in immunocompetent patients despite the presence of neutralizing antibodies (156).

Specific mutations associated with HBV reactivation have been identified in both neutralizing antibody targets and T cell epitopes (32–34). In a study of 29 patients with HBV reactivation, 75% of HBV-reactivated patients (vs 3% of chronic HBV controls) carried HBsAg mutations localized in immune-active HBsAg regions, and 5 of 13 identified HBsAg mutations were localized in HLA-restricted T cell epitopes [either class I (sC48G, sV96A, sL175S, and sG185E) or class II (sS171F)] (32). This suggests that in addition to an iatrogenic trigger for reactivation during immunosuppressive therapy, viral sequence can be a contributory factor as a result of CD8+ immune escape mutants.

### Impact of Host HLA Class I Haplotype on HBV Infection Outcome

GWAS approaches have linked various single-nucleotide polymorphisms in the HLA class II region with a range of infection outcomes, but there is a lack of such robust evidence for the involvement of HLA class I genes (72, 73, 75). One study identified a relationship between class I HLA-A genotype and HBeAg status (66), suggesting a role for genes at this locus in control of infection. However, confirmation of HLA associations can be difficult due to the variability in study design and methodologies and the small, heterogeneous populations sampled. Furthermore, the mechanisms for these HLA class I associations with disease outcome are poorly understood. Differences in antigen presentation, TCR binding leading to changes in T cell activation, and altered cytokine production may be responsible, either individually or in combination. Effects of linkage disequilibrium with other important neighboring loci, such as HLA class II or killer-cell immunoglobulin-like receptors genes, cannot be excluded. Functional studies are required to determine the basis for these associations.

#### Impact of HBV Adaptation on Control Strategies

It is likely that to achieve elimination in line with global public health goals (157), new therapies targeting either the host immune system or the HBV replication cycle will be needed. Specific immunotherapies are under development, targeting both the innate and adaptive immune system, which aim to eliminate (or stably suppress) HBV replication (158).

T cell-based immune therapies are attractive options for HBV control (159). Strategies broadly take two approaches, either aiming to restore functionality and increase the quantity of existing defective host T cells with vaccines and checkpoint inhibitors, or to mimic the T cell response mounted during naturally resolving acute HBV infection by the adoptive transfer of HBV-specific T cells. Adoptive T cell therapy renders T cells HBV-specific by expression of natural HLA-restricted TCRs or HLA-independent chimeric antigen receptors on the T cell surface. Although natural TCRs have the advantage of activating the T cell response in a physiological way, therapy is potentially complicated by the need to match TCR to host HLA alleles, although some cross-reactivity may occur (160).

Given that HBV is able to evade natural immunity (32, 149, 156, 161, 162), vaccine-induced immunity (163, 164), and antiviral therapy (145, 165–169), it should be anticipated that HBV has the potential to mutate and escape from immunotherapeutic control. This is a vital consideration in the development of new HBV control strategies, and good knowledge of the full range of escape strategies should allow us to predict and potentially mitigate this. Care must be taken when developing T cell immunotherapies and polyepitope vaccines as immunodominance is a complex function of the nature and context of the epitope within the peptide, the TCR, the T cell clone, and the environment (98). The HBV literature is skewed toward the investigation of certain populations with specific HBV genotypes and HLA haplotypes, as highlighted in the "Hepitopes" database, a catalog of HLA class I epitopes in HBV, in which a disproportionate 44% of reported CD8+ T cell epitopes are HLA-A\*02 restricted (53). The effect of using a polyepitope vaccine or re-directing T cells against peptides presented by discordant HLA alleles needs to be considered. This might inadvertently occur by using a vaccine or T cellbased immunotherapy based on key epitopes from a different genotype to that prevalent in the population to which it is

Table 2 | Areas for future focus in determining the nature and characteristics of the CD8+ T cell response to hepatitis B virus (HBV).


delivered, and may produce functionally incompetent T cells, unable to recognize the infectious virus strain when used, or potentially lead to immunopathology. Since knowledge about certain host/virus interactions is under-represented, further studies will be required to define the full range of CD8+ T cell epitopes presented by HLA alleles, the antiviral functions of the corresponding CD8+ T cells in each compartment, the potential for generation of immune escape mutants and the impact these have on the immune response.

#### CHALLENGES

The understanding of HBV escape from the CD8+ immune response is lagging behind that of HIV and HCV. The field struggles with a lack of comprehensive literature, small datasets that can lead to conflicting results, differences in approaches to classifying patient groups into poor/outdated descriptions of "phases" of infection, over-reliance on serostatus, and lack of longitudinal follow-up and deep sequence data. Establishing the role of compartmentalization in infection is complex, with clinical samples scarce due to the risk associated with liver biopsy. These challenges are exacerbated by an under-resourcing of clinical and research approaches in many of the settings where HBV is endemic (8).

### FUTURE FOCUS

There are many unanswered questions in the field of HBV and CD8+ immunity. In **Table 2**, we highlight gaps in our current understanding and knowledge, suggest desirable methods to develop, datasets to collate, and questions to be answered in order to provide foundations for ongoing research efforts.

#### REFERENCES


#### SUMMARY

Hepatitis B virus is a complicated, unique virus, which has evolved together with *Homo sapiens* over millennia; it has evolved a range of mechanisms that favor transmission and persistence which include the capacity to evade the CD8+ T cell response. By focusing on understanding the evolutionary interplay between host and virus, we can develop better insights into areas where we can target viral "Achilles heels." The need for novel anti-HBV strategies should drive a deeper exploration of this host–pathogen interaction. Future research will be strengthened by comprehensive cross-sectional and longitudinal studies on HLA-typed hosts with clinical details, across a range of host ethnicities and HBV genotypes, with high quality serological and whole genome HBV deep sequencing data. This will provide a more comprehensive understanding of the nature and mechanisms of HBV evolution and persistence, helping us to reach the goal of global HBV eradiation by guiding the design of new strategies, including vaccines and therapeutics.

### AUTHOR CONTRIBUTIONS

SL undertook the primary literature review and drafted the manuscript; all authors had substantial input into revisions.

#### FUNDING

SL is funded by the National Institute for Health Research. PM is funded by the Wellcome Trust (grant number 110110). KL is funded by the Wellcome Trust (grant number 107652/Z15/Z) and The Royal Society. PK is funded by the Wellcome Trust (WT109965MA) and National Institute for Health Research Biomedical Research Centre, Oxford.


immune-tolerant phase in familiarly transmitted chronic hepatitis B infection: deep-sequencing and phylogenetic analysis. *J Viral Hepat* (2014) 21(9): 650–61. doi:10.1111/jvh.12196


between plasma and peripheral blood mononuclear cells. *J Virol* (2005) 79(10):6349–57. doi:10.1128/JVI.79.10.6349-6357.2005


caused by cytotoxic T lymphocyte escape mutations in hepatitis C virus polymerase. *J Virol* (2011) 85(22):11883–90. doi:10.1128/JVI.00779-11


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

The handling Editor declared a past co-authorship with one of the authors PK.

*Copyright © 2018 Lumley, McNaughton, Klenerman, Lythgoe and Matthews. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

*Timokratis Karamitros1,2\*, George Papatheodoridis <sup>3</sup> , Dimitrios Paraskevis <sup>4</sup> , Angelos Hatzakis <sup>4</sup> , Jean L. Mbisa5 , Urania Georgopoulou6 , Paul Klenerman7 and Gkikas Magiorkinis 1,4\**

#### *Edited by:*

*Linda F. Van Dyk, University of Colorado Denver, United States*

#### *Reviewed by:*

*Hridayesh Prakash, All India Institute of Medical Sciences, India Sandip Chakraborty, College of Veterinary Sciences and Animal Husbandry, India*

#### *\*Correspondence:*

*Timokratis Karamitros tkaram@pasteur.gr; Gkikas Magiorkinis gmagi@med.uoa.gr*

#### *Specialty section:*

*This article was submitted to Microbial Immunology, a section of the journal Frontiers in Immunology*

*Received: 29 September 2017 Accepted: 28 March 2018 Published: 16 April 2018*

#### *Citation:*

*Karamitros T, Papatheodoridis G, Paraskevis D, Hatzakis A, Mbisa JL, Georgopoulou U, Klenerman P and Magiorkinis G (2018) Impact of Interferon-α Receptor-1 Promoter Polymorphisms on the Transcriptome of the Hepatitis B Virus-Associated Hepatocellular Carcinoma. Front. Immunol. 9:777. doi: 10.3389/fimmu.2018.00777*

*1Department of Zoology, University of Oxford, Oxford, United Kingdom, 2Department of Microbiology, Public Health Laboratories, Hellenic Pasteur Institute, Athens, Greece, 3Academic Department of Gastroenterology, Laiko General Hospital, Medical School of National and Kapodistrian University of Athens, Athens, Greece, 4Department of Hygiene and Epidemiology and Medical Statistics, Medical School of National and Kapodistrian University of Athens, Athens, Greece, 5Virus Reference Department, Public Health England, London, United Kingdom, 6Department of Microbiology, Molecular Virology Laboratory, Hellenic Pasteur Institute, Athens, Greece, 7Peter Medawar Building for Pathogen Research and Translational Gastroenterology Unit, University of Oxford, Oxford, United Kingdom*

Background and aims: Genetic polymorphisms within the promoter of interferon-α receptor type-1 (IFNAR1) have been associated with the susceptibility to and the outcome of chronic hepatitis B virus (HBV) infection. However, the impact of these polymorphisms in the transcriptome of the HBV-associated hepatocellular carcinoma (HCC) remains largely unexplored.

Methods: Using whole-genome and exome sequencing data from The Cancer Genome Atlas project, we characterized three single-nucleotide polymorphisms (SNPs: −568G/C, −408C/T, −3C/T) and one variable number tandem repeat [VNTR: −77(GT)n] within the IFNAR1 promoter sequence in 49 HCC patients. RNAseq data from 10 genotyped HCC samples were grouped according to their −77VNTR or −3SNP genotype to evaluate the impact of these polymorphisms on the differential expression on the HCC transcriptome.

results: There is a fourfold higher impact of the −77VNTR on the HCC transcriptome compared to the −3SNP (*q* < 0.1, *p* < 0.001). The expression of the primary IFNAR1 transcript is not affected by these polymorphisms but a secondary, HCC-specific transcript is expressed only in homozygous −77VNTR ≤8/≤8(GT)n samples (*p* < 0.05). At the same time, patients carrying at least one −77VNTR >8(GT) allele, presented a strong upregulation of the fibronectin-1 (FN-1) gene, which has been associated with the development of HCC. Gene Ontology and pathway enrichment analysis of the differentially expressed genes revealed a strong disruption of the PI3K–AKT signaling pathway, which can be partially triggered by the extracellular matrix FN-1.

conclusion: The IFNAR-1 promoter polymorphisms are not involved in the expression levels of the main IFNAR-1 transcript. The −77VNTR has a regulatory role on

**157**

the expression of a secondary, truncated, HCC-specific transcript, which in turn coincides with disruptions in cancer-associated pathways and in FN-1 expression modifications.

Keywords: hepatitis virus B, interferon-**α** receptor type-1 promoter, hepatocellular carcinoma, interferon-**α** receptor, polymorphism, transcriptome, RNAseq, the Cancer Genome Atlas project

#### INTRODUCTION

Worldwide, more than two billion people have been infected with hepatitis B virus (HBV) and approximately 250 million individuals are chronically infected (1). Infected patients can be inactive chronic HBV carriers (IC) (eAg-negative, eAb-positive, with low levels of HBV DNA and no evidence of liver inflammation) or present with the progressive chronic hepatitis B (CHB) (2–5). HBV is responsible for >50% of hepatocellular carcinomas (HCCs) worldwide (2, 6, 7). On the other hand, HBeAg-negative ICs have a more benign prognosis with very low risk of cirrhosis or HCC, as indicated by long-term follow-up studies (8–10).

The genetic profile of the patient plays a substantial role in the clinical outcome of HBV infection (11–13). The virus also modulates cellular mechanisms and signal pathways during the course of the infection. For example, the persistent expression of the HBV x antigen (HBxAg) is correlated with the development of fibrosis and cirrhosis during CHB, as it can activate fibronectin-1 (FN-1) gene, through the induction of the nuclear factor kappa B (NF-kappa B or NFkB) (14). FN is an omnipresent extracellular matrix glycoprotein. Plasma FN and cellular FN have distinct properties and roles in the strictly regulated mechanism of tissue repair (15). FN is a very important component of ECM and any dysfunction in the fibrinogenesis mechanisms can lead to the development of fibrotic disease (16).

Interferons (IFN)-α/β are cytokines involved in both innate and adaptive immune responses (17, 18), thus play a pivotal role as cancer defense mechanisms (19). Interferon-induced signal transduction pathways represent a fine-tuned network of interactions, which are triggered upon binding of IFN on its receptor (IFNR) (20–22). IFNs are shown to regulate the transcription levels of more than 2,000 genes, which compose the *Interferome*, a gene-network created by integrating information collected from high-throughput experiments (23). As IFNs are anticancer and antiviral cytokines, it is expected that the genetic profiles of the genes involved in these signal transduction pathways (e.g., IFNR) will have an impact on the susceptibility of the patients to cancer and, when HBV is involved, to CHB and HBV-related HCC.

IFN-α is administrated as part of the first-line therapies for CHB (2, 24). The compatible receptor (IFNAR) consists of two subunits IFNAR-1 and IFNAR-2 (25). A number of polymorphisms in the promoter of IFNAR-1 gene and also in its coding sequence have been described (26–29). Those observed in the promoter—three SNPs and one variable number tandem repeat (VNTR) at positions −568, −408, −3 and −77, respectively—are believed to affect the expression of the receptor and are associated with the clinical outcome of HBV infection (27, 30, 31).

We have previously shown that the same polymorphisms are associated with the clinical phase of HBeAg-negative chronic HBV infection in Caucasians. Briefly, patients with genotypes −568GC/CC, −408CT, and −3CT and patients with less than 8(GT) repeats in the −77VNTR were more frequent among inactive carriers (IC) vs. patients with HBeAg-negative CHB (32). Notably, the (GT)-repeats in the −77VNTR were strongly associated with the clinical outcome of the patients in our study; homozygotes carrying both alleles with ≤8(GT)-repeats, were more likely to be ICs, compared to those carrying both alleles with >8(GT) repeats (OR = 7.14, *p* ≤ 0.001).

In this study, we use whole-genome sequencing (WGS), whole-exome sequencing (WXS), and RNAseq data derived from The Cancer Genome Atlas (TCGA) project in combination with a bioinformatics pipeline, which efficiently and accurately genotypes the −77VNTR. We estimate the distribution of the genetic variations of the four IFNAR-1 promoter polymorphisms and we investigate the impact of −77VNTR and −3SNP polymorphisms on the regulation of HCC transcriptome and interferome*.* IFNAR-1 is transcribed into three transcripts, ENST00000270139 (which is translated into the receptor), ENST00000493503, and ENST00000442071; performing differential expression analysis in patients grouped according to their −77VNTR genotype, we show that −77VNTR modifies the transcriptional profile of IFNAR-1 gene by controlling the expression of the last one, which comprises only a fibronectin III domain and is translated into a truncated form of the receptor. The expression of this transcript coincides with significantly lower expression of FN-1. Finally, the PI3K– AKT signaling pathway, which is partially triggered by FN-1, was significantly enriched with differentially expressed genes. Our results indicate that this secondary transcript could potentially act as a regulatory element, but its functional role remains largely unknown.

#### PATIENTS AND METHODS

#### Study Design

We examined four previously described polymorphisms within the IFNAR-1 promoter (27, 30, 31), from now on referred to as interferon-α receptor promoter polymorphisms (IFNARPPs). They include three single-nucleotide polymorphisms (SNPs) at positions −568G/C, −408C/T, and −3C/T and a variable number tandem repeat of the binucleotide GT, −77VNTR(GT)n.

We used 49 (15 WGS and 34 WXS) TCGA "liver hepatocellular carcinoma (LIHC)" samples (project ID 10464). We filtered all the available (up to March 2016) HCC samples with positive HBV surface antigen. We performed genotyping of the four IFNARPPs, using the pipeline described below. We compared the frequencies of the revealed genotypes with those of a group of 92 chronically infected HBeAg-negative IC as defined in our previous study (32) as they present very low risk of cirrhosis and HCC (8–10). Briefly, IC state was defined for patients with persistently normal ALT values under strict follow-up and maximum HBV DNA levels ≤20,000 IU/mL. None of these inactive carriers had cirrhosis (33).

To assess the impact of the −77VNTR and the −3SNP on the expression profiles in HBV-associated hepatocellular carcinoma, we used tumor RNAseq data available for 10 TCGA HCC genotyped samples after classifying them according to their genotype (for −77VNTR: ≤8/≤8 vs. ≤8/>8 or >8/>8(GT) repeats, and for −3SNP: CC vs. CT or TT) (**Table 1**).

#### Bioinformatics

The .bam files were transformed into paired-end .fastq files using Samtools-bam2fq (34) and were locally mapped against two artificial chromosomes using Bowtie2 (35) in—very sensitive local mode (**Figure 1**). The artificial reference chromosomes were created after splitting the IFNAR1 promoter sequence (genebank: X60459.1) at the −77VNTR, leaving 3 GT repeats at each hanging end, to avoid non-specific mapping of reads due to the low complexity of the microsatellite repeats. We extracted the mapped reads and used them to *de novo* assemble the alleles of IFNAR1 promoter for each TCGA patient, using MIRA (36). We further analyzed the *de novo* assembled contigs using R *DNAstrings* package to count the GT repeats. We used *Samtools mpileup* and *bcftools* (34) to call the variations of the three SNPs from the mapping alignments. The genotyping results were confirmed and quality-controlled by visual inspection of the mapping alignments of 15 random samples, using IGV (37).

We performed the RNAseq analysis using Kallisto (38) to map the reads against the Human Transcriptome reference (v.GRCh38. rel79) and to calculate the transcripts abundances. We analyzed the impact of the IFNARPPs on the transcriptional landscape of the interferon-associated genes using the "Interferome" database (23) on a subset our whole-transcriptome results. We used Sleuth (39) and R-base functions to interpret and visualize the RNA-seq analysis results. We performed Gene Ontology (GO) and KEGG pathway enrichment analysis using the differentially expressed genes (*p* < 0.001).

#### Statistical Analysis

We used the *t*-test, the Fisher's exact test, and the *z*-test to evaluate the association of the demographic and genetic characteristics of the patients with the disease outcome. We used RStudio v0.99.446 for R v3.2.3 programming language for all statistical computing and graphics.


### RESULTS

### RNAseq Differential Expression Design

We tested for differentially expressed transcripts in 10 HCC RNAseq samples after defining their −77VNTR and −3SNP genotypes. We grouped and compared them according to their −77VNTR genotype [samples A–E: >8/≤8 or >8/>8(GT)n vs. F–J: ≤8/≤8(GT)n] and according to their −3SNP genotype (samples A, B, C, F, G: "CT" vs. D, E, H, I, J: "CC") (**Table 1**). The quality-control assessment of the RNAseq analysis is summarized in Figure S1 in Supplementary Material.

### The **−**77 VNTR and **−**3 SNP Polymorphisms and the IFNAR1 Transcription Profile

IFNAR1 gene generates three different transcripts, but only one of them is translated into the functional receptor protein subunit (transcript "001": ENST00000270139). Of the two remaining transcripts, one is not translated (transcript "002": ENST00000493503)

Figure 1 | Schematic representation of the computational pipeline applied to whole-genome sequencing (WGS) and WXS The Cancer Genome Atlas (TCGA) data (in blue) for the genotyping of the IFNARPPs. IFNAR1 promoter sequence (genebank: X60459.1) was used to for the construction of two pseudo-chromosomes, on which the reads were aligned. The mapped reads were extracted and used for the *de novo* assembly of the two alleles of each sample. This method allows the assembly of alleles varying substantially in length compared to the reference.

and the other (transcript "003": ENST00000442071) produces a truncated (136 aa long) isoform of the receptor, incorporating only a fibronectin type III domain. We found that −77 VNTR and −3 SNP polymorphisms do not significantly affect the expression levels of the primary transcript "001" and the secondary transcript "002."

Interestingly, the expression of the secondary transcript "003" was detectable only in homozygous patients carrying both alleles with ≤8(GT) repeats, while it was absent in patients carrying at least one allele with >8(GT) repeats (2.47 vs. 0.01 TPM, respectively) (**Figure 2**). We further tested 10 TCGA normal tissue (liver) RNAseq samples for the expression of this transcript. We found them all negative except from one, which presented only basal expression (0.02 TPM). Thus, transcript "003" was HCC-specific.

#### Impact of **−**77 VNTR and **−**3 SNP on the HCC Transcriptome and Interferome

We classified 10 HCC TCGA samples according to their genotype, ≤8/≤8 vs. ≤8/>8 or >8/>8(GT) repeats for the −77VNTR, and CC vs. CT or TT for the −3SNP. There were significant changes in the transcription profiles between the groups tested. In detail, for the −77VNTR grouping, there were 246 differentially expressed genes (*p* < 0.001), while for the −3SNP grouping, only 57 genes showed significant changes in their expression levels (*p* < 0.001) (**Figure 3**).

Focusing on the interferon-related genes, we created a subset of the transcriptomics results according to transcript IDs found in the *Interferome* database (23). The −77VNTR

Figure 2 | Analysis of IFNAR1 transcripts: (A) structure of IFNAR-1 gene. Alternative splicing and transcription initiation sites produce three distinct transcripts that differ in length. Only transcripts 001 and 003 are translated into proteins, which share only the C-terminal part and a fibronectin type III domain. (B): RNAseqderived expression levels of the three transcripts from 10 hepatocellular carcinoma (HCC) samples grouped based on their −77VNTR(GT)n genotype. The levels of transcripts 001 (responsible for the production of the receptor) and 002 do not differ significantly between the genotypes. Transcript 003 is expressed only in samples with <8/<8 (GT)n repeats in the −77VNTR polymorphism. The genotypes of HCC RNA samples (A–J) are described in Table 1.

polymorphism had a fourfold higher impact on the Interferome compared to the −3SNP, as 46 and 11 transcripts, respectively, are either up- or downregulated (*p* < 0.001). Notably, among the most significantly differentially expressed genes was FN-1, which presented a 4.49-fold lower expression in ≤8/≤8(GT) n homozygous patients (**Figure 4**; Table S2 in Supplementary Material).

#### Impact of **−**77 VNTR on Signaling Pathways

Performing GO and KEGG pathway enrichment analysis based on the −77VNTR grouping we found six significantly enriched pathways (**Table 2**). PI3K–AKT signaling pathway was ranked first, with the higher number and proportion of differentially expressed genes involved (14 genes, *p* < 0.05). The super-family "pathways in cancer" was also found significantly enriched with 15 differentially expressed genes, but since PI3K–AKT pathway was the main contributor, the family was excluded from **Table 2**. Intriguingly, PI3K–AKT signaling pathway can be triggered by FN-1 (**Figure 5**).

### IFNARPPs and HBV-Associated Hepatocellular Carcinoma

We genotyped the four IFNARPPs in 49 HBV-associated HCC samples, 15 WGS and 34 WXS. We were able to confirm the existence of the −568G allele [previously associated with more pathogenic disease states (26, 32)] in 36 out of 49 HCC samples, as the read coverage was occasionally reduced over the −568SNP in the WXS dataset. We compared their distributions of the polymorphisms with those described previously for 92 ICs (32). The demographic characteristics of the 49 HCC and the 92 IC patients are summarized in Table S1 in Supplementary Material. There was no statistically significant difference in frequency of the −568SNP genotypes between the IC and HCC groups (**Table 3**). Given the previously described linkage between alleles −408C/T and −3C/T (−408C to −3C and −408T to −3T) (27) these were analyzed together. There was no statistically significant difference in the prevalence of the −408/−3 SNP polymorphisms in the IC and the HCC groups (**Table 3**). The −408/−3 TT genotype was not identified in any of the TCGA samples tested. Notably, the number of GT repeats at the −77VNTR of IFNAR-1 promoter was associated with the disease status. Heterozygous ≤8/>8 (GT)n patients are less likely to be IC compared to homozygotes carrying both alleles with ≤8 GT repeats (OR = 0.41, 95% CI: 0.19, 0.90). The same trend was observed when >8/>8 and ≤8/>8 (GT)n genotypes were grouped together (OR = 0.40, 95% CI: 0.19, 0.84) (**Table 3**; Figure S2 in Supplementary Material).

#### DISCUSSION

Implementing a fully automated computational pipeline, we characterized the genetic profile of the four polymorphisms located in the IFNAR1 promoter region in 49 HBV-associated HCC samples derived from the TCGA project.

Whole-genome sequencing data have enormous size and they are usually available as ready-to-use alignments (.bam files), which are binary and compressed. Variation-calling from bam files can be performed with several routine pipelines, each characterized by different biases toward specific types of SNP and in/del genotyping errors (40). For example, reads that differ significantly in length can be ignored and remain unmapped. This is highly likely in VNTRs with heterozygous genotypes in loci where the two alleles differ substantially in length compared to the reference sequence (e.g., −77VNTR(GT)5/15), which in combination with low read coverage and read length (quite common in WGS data) can lead to inaccurate variant calling. Our pipeline solves this problem, by using an artificial reference of two pseudochromosomes, derived from the original sequence, segregated exactly on the VNTR locus, leaving only 3(GT)-repeats at each segregation edge. By using these two pseudo-chromosomes, we were able to specifically mine all the reads corresponding to the IFNAR1 promoter region—originally mapped or unmapped that in turn precisely reconstructed the two alleles of each sample. Our method provides biologically meaningful results as only one (homozygous) or two (heterozygous) contigs were generated for all the samples tested. There were not any non-specific contigs generated, indicating that the mining of the reads was highly specific.

Since only 15 TCGA WGS samples met our selection criteria, we used WXS data as well. Although WXS libraries are optimized for mRNA sequencing, non-coding upstream and downstream sequences are randomly captured during the target enrichment


*a A threshold of 10 differentially expressed genes per pathway has been applied.* process (41, 42). Thus, we were able to accurately genotype the proximate to the first IFNAR1 exon polymorphisms, with the exemption of −568SNP, which could only be genotyped in 36 out of the 49 HCC TCGA samples. −568SNP is the most distant one relatively to the start of the transcription, and this resulted in reduced read coverage in 11 WXS samples. The limited number of observations in the HCC group did not allow for statistically significant results with regard to −568 SNP and disease status, although a relative difference in the genotypic distributions was observed (Figure S2 in Supplementary Material). Further studies are needed to confirm the significance of this SNP, as it has been previously associated with the development of CHB and the spontaneous recovery after acute HBV infection (27, 32).

Our results suggest that the number of GT repeats in the −77VNTR is associated with the state of the disease, as alleles with more than 8(GT)n are more frequent among HCC samples. (**Table 3**; Figure S2 in Supplementary Material). The finding that more (GT) repeats are associated with a more severe outcome of the disease is in concordance with our previous observations where alleles carrying ≤8(GT) repeats in the −77VNTR, were associated with the IC phase, whereas >8(GT) repeats were more frequent


*Statistically significant associations in bold.*

*IC, inactive carrier; HCC, hepatocellular carcinoma.*

*\*Genotype available only for 118 samples (92 for IC and 36 for HCC).*

in HBeAg-negative CHB patients (32). Zhou et al. studying CHB patients and spontaneously resolved cases after acute HBV infection showed that alleles carrying ≤9(GT) repeats were more frequent in CHB patients, while >9(GT) repeats were associated with spontaneous clearance after acute HBV infection (27). Moreover, this result was linked with the rest of the polymorphisms examined in haplotypes, [≤9(GT)n, −568C, −408/−3T] for the CHB patients and >9(GT)n, −568G, −408C, −3C in cases of spontaneous clearance after acute HBV infection. In our previous study, alleles −568C and −408/−3T were associated with the IC phase, whereas −568G and −408/−3C were more frequent in CHB patients. In this study, although genotypes −568CG and −408/−3CT show a trend of higher frequency in the HCC samples, compared to genotypes −568CC and −408/−3CC (Figure S2 in Supplementary Material), they were not associated with disease, as the limited availability of HCC samples and the lack of coverage over −568SNP in WXS samples led to lack of statistical power for these comparisons. Notably, genotype −408/−3TT was totally absent in the HCC dataset. Studies in larger datasets may discern whether alleles −568G and −408/−3T and the homozygous genotypes GG and TT are linked to HCC.

The effective expression of IFNAR-1 is essential for the linkage with IFN and the triggering of the downstream signal pathways (43). Even minor modifications in the receptor structure can impair its normal function and alter its antiviral and antiproliferative properties (44, 45). Moreover, the expression levels of both the receptor subunits IFNAR1 and IFNAR2 are associated with the INF-β treatment outcome in multiple sclerosis patients (46). Hepatitis C patients with 5GT repeats in the −77VNTR (genotypes GT5/5/GT5/14) have been reported to be better responders to IFN-based therapy (47). On the other hand, experiments based on luciferase reporter plasmids have suggested that the promoter of the IFNAR-1 is not affected by the −568SNP and the −77VNTR (26, 27), while in another study its activity is significantly affected by −3SNP (48). In detail, the promoter activity was reduced for −3T plasmids through reduced binding affinity to HMGB1, a factor that was suggested to bind to the −3 element to regulate the transcription levels of IFNAR1. Similarly, reduced levels of IFNAR1 expression in HBV patients with C > T substitution at the −3 position of the IFNAR1 promoter was reported in the same study (48). In another study, an important role of the −3CT SNP was evident regarding the regulation of the transcription factor High Mobility Group B protein 1 (HMGB-1); the C > T transition was shown to reduce the binding affinity of HMGB-1 to the IFNAR-1 promoter sequence, thus lead to reduced expression of IFNAR-1 (48).

In this study, we conducted a thorough RNAseq differential expression analysis on HCC samples grouped according to their −77VNTR and −3SNP genotype. In agreement with our previous findings, where a significant role of the −77VNTR genotypes had arisen with respect to the clinical course of HBV infection, here we report a more significant involvement of −77VNTR in the modification of the HCC transcriptome and interferome, as more genes were found to be significantly differentially expressed compared to those affected by the −3SNP. We also shed light on the controversial question about the impact of these promoter polymorphisms on the expression of IFNAR-1 gene. The expression levels of the receptor are associated with the clinical outcome of chronically HBV-infected patients (49). We conclude that the expression of the major IFNAR-1 transcript, which is responsible for the production of the functional receptor, is not significantly affected by these polymorphisms (**Figure 2**).

At the same time, a secondary IFNAR1 transcript, which is translated into a truncated form of the receptor, is only produced at detectable levels in samples with ≤8/≤8 (GT)n genotype in the −77VNTR, independently from the −3SNP genotype. This transcript is not detectable in normal liver tissue samples, thus appears to be HCC specific. Secondary IFNAR1 transcripts have been reported in tumor cell lines (50) and here, for the first time, we report that −77VNTR genotype is crucial for their production, which apparently leads to the transcriptional remodeling of IFNAR1 gene.

Grouping the samples according to their −77VNTR genotype revealed that one of the most differentially expressed genes was the extracellular matrix FN-1. Patients carrying at least one −77VNTR >8(GT) allele, presented a strong upregulation of the FN-1 gene, while FN-1 was significantly downregulated in ≤8/≤8(GT)-repeats carrying patients, a phenomenon coinciding with the expression of the secondary IFNAR-1 transcript, which implies its potentially protective role against the development of HCC. This transcript comprises only a fibronectin III domain; FN fragments and modules can inhibit FN-matrix assembly by competing for FN-assembly sites, which could act as a feedback system to regulate FN levels on the cell surface (51). Specifically, the first type III domain of the FN molecule is important for the matrix assembly (52, 53), while even small fragments derived from this module regulate FN polymerization, inducing it at moderate concentrations but inhibiting it at high concentrations (54, 55). FN-1 is involved in HCC as it can be upregulated by HBxAg in an NFkB-dependent way (14), while is generally over-expressed in several cancers [reviewed in Ref. (56)]. NFkB was not found differentially expressed in our dataset (*p* = 0.16) but is linked to PI3K-AKT, which was the most significantly enriched signaling pathway, down to the MDM2 proto-oncogene and the apoptosis regulator Bcl-2 (**Figure 5**). FN-1 can trigger this pathway and transduce multiple intracellular signals that control cell cycle. Our data suggest that apart from NFkB, complementary mechanisms, partially controlled by FN-1 and involved in cell cycle regulation might exist.

This study is based on data mining and *in silico* analyses, thus our findings here generate a strong hypothesis about links

#### REFERENCES


between the expression of the truncated IFNAR-1 transcript either with the PI3K–AKT signaling pathway and/or with the HCC development. To test this hypothesis, further wet-lab studies will be needed involving knock-down of the transcript, transfection with a plasmid that will produce this transcript and quantitative analysis of the respective transcripts and proteins would confirm these observations and would shed light in these complex regulatory networks.

In conclusion, our results suggest very minimal (if any) involvement of the IFNAR-1 promoter polymorphisms in the expression levels of the IFNAR-1 major transcript but at the same time raise a potential and intriguing role for the −77VNTR regarding the regulation of downstream genes. Our study shows that the majority of modifications of the *Interferome* coincided with the production of the truncated IFNAR-1 transcript. Thus, further study of this truncated transcript could clarify the mechanistic features of the combined antiviral and anticancer roles of IFNAR-1.

#### AUTHOR CONTRIBUTIONS

TK designed and conducted the analyses, evaluated the results, and wrote the manuscript. GP, DP, AH, JM, UG, PK, and GM wrote and revised the manuscript.

#### FUNDING

The work was supported the Medical Research Council, UK (Project Reference: MR/K010565/1).

#### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at https://www.frontiersin.org/articles/10.3389/fimmu.2018.00777/ full#supplementary-material.


the assessment of patients with HBeAg-negative chronic hepatitis B virus infection. *J Viral Hepat* (2014) 21(7):517–24. doi:10.1111/jvh.12176


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Copyright © 2018 Karamitros, Papatheodoridis, Paraskevis, Hatzakis, Mbisa, Georgopoulou, Klenerman and Magiorkinis. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

# Evolution of Two Major Zika Virus Lineages: Implications for Pathology, Immune Response, and Vaccine Development

#### *Jacob T. Beaver, Nadia Lelutiu, Rumi Habib and Ioanna Skountzou\**

*Department of Microbiology and Immunology, Emory Vaccine Center, Emory University School of Medicine, Atlanta, GA, United States*

Zika virus (ZIKV) became a public health emergency of global concern in 2015 due to its rapid expansion from French Polynesia to Brazil, spreading quickly throughout the Americas. Its unexpected correlation to neurological impairments and defects, now known as congenital Zika syndrome, brought on an urgency to characterize the pathology and develop safe, effective vaccines. ZIKV genetic analyses have identified two major lineages, Asian and African, which have undergone substantial changes during the past 50 years. Although ZIKV infections have been circulating throughout Africa and Asia for the later part of the 20th century, the symptoms were mild and not associated with serious pathology until now. ZIKV evolution also took the form of novel modes of transmission, including maternal–fetal transmission, sexual transmission, and transmission through the eye. The African and Asian lineages have demonstrated differential pathogenesis and molecular responses *in vitro* and *in vivo*. The limited number of human infections prior to the 21st century restricted ZIKV research to *in vitro* studies, but current animal studies utilize mice deficient in type I interferon (IFN) signaling in order to invoke enhanced viral pathogenesis. This review examines ZIKV strain differences from an evolutionary perspective, discussing how these differentially impact pathogenesis *via* host immune responses that modulate IFN signaling, and how these differential effects dictate the future of ZIKV vaccine candidates.

#### Keywords: Zika virus, evolution, transmission, tissue tropism, phylogeny, immunology, vaccines

### INTRODUCTION

Zika virus (ZIKV) has garnered international attention due to its rapid worldwide expansion since 2015 when an epidemic struck Brazil, resulting in a newly identified pathology including severe neurological impairments such as microcephaly, which is now part of the congenital ZIKV syndrome, as well as Guillain–Barré syndrome (GBS) afflicting adults (1). The World Health Organization declared ZIKV a Public Health Emergency of International Concern in February 2016, during which time ZIKV was spreading rapidly across South America, the Caribbean, and into the United States (1). This precarious outbreak in Brazil spread rampant across the western continents raising critical questions pertaining to the evolution of this virus. Prior to 2015, ZIKV infections were limited geographically to Africa and Asia and were reported to be asymptomatic, and approximately 20% mildly symptomatic represented as a self-limiting febrile illness with most common symptoms

#### *Edited by:*

*Gkikas Magiorkinis, National and Kapodistrian University of Athens, Greece*

#### *Reviewed by:*

*Luciana Barros Arruda, Universidade Federal do Rio de Janeiro, Brazil Jose Luiz Proenca-Modena, Universidade Estadual de Campinas, Brazil*

> *\*Correspondence: Ioanna Skountzou iskount@emory.edu*

#### *Specialty section:*

*This article was submitted to Viral Immunology, a section of the journal Frontiers in Immunology*

*Received: 02 May 2018 Accepted: 03 July 2018 Published: 18 July 2018*

#### *Citation:*

*Beaver JT, Lelutiu N, Habib R and Skountzou I (2018) Evolution of Two Major Zika Virus Lineages: Implications for Pathology, Immune Response, and Vaccine Development. Front. Immunol. 9:1640. doi: 10.3389/fimmu.2018.01640*

maculopapular rash, conjunctivitis, and joint pain (2). The mounting evidence that ZIKV is now causing neuropathology and fetal brain disruption, as well as rising concerns over novel modes of ZIKV transmission suggests an evolutionary change in the molecular and genetic structure of ZIKV strains that has contributed to its rapid expansion, severity of pathogenicity, and multiple routes of infections. These increasing adverse effects depicts why an analysis of the phenotypic differences between the African and Asian lineages, as well as between the many strains, which have evolved under each branch, is a vital component of our ongoing effort to develop vaccines or therapeutics and fill major gaps of knowledge regarding ZIKV pathogenesis.

### HISTORY OF VIRUS EMERGENCE

Zika virus was discovered in the Zika Forest of Uganda in 1947 by Alexander Haddow and George Dick during a surveillance investigation of yellow fever in rhesus macaques in Uganda (3). The virus was later isolated from the *Aedes africanus* mosquito collected at the same site (4). The first human case occurred in Nigeria in 1954, but it was not until 1966 that ZIKV was first detected in Asia alongside the first evidence of transmission by an urban vector, *Aedes aegypti* mosquitoes from Malaysia (4, 5). We know that two major lineages of ZIKV were formed at this time, African and Asian, which is confirmed by current genetic and phylogenetic analyses (2). ZIKV made no headlines until an outbreak in 2007 on Yap Island, Micronesia, rendering 73% of the residents infected (6). Despite the presence of DENV IgM in all affected individuals, the unique symptomatic presentation was definitively identified as ZIKV-induced. The next outbreak was 6 years later in French Polynesia, spreading to several other islands in Oceania. The most commonly reported symptoms in the Yap Island and French Polynesian outbreaks included rash, fever, arthralgia, and non-purulent conjunctivitis (7, 8). However, the first case with GBS as a complication of ZIKV infection was reported in the 2013 outbreak in French Polynesia (7). It was also in 2013 when it was discovered that ZIKV transmission could occur through blood or other bodily fluids and not just through mosquito bites.

Brazil was the next location to experience an outbreak early in 2015. Phylogenetic and molecular clock analyses revealed that there was a single introduction of ZIKV to the country (9). The virus was likely brought to Brazil by a traveler from French Polynesia after a stop at Easter Island (10–12). A recent article from Passos et al. performed a retrospective blood blank analysis on 210 samples collected from patients during a DENV-4 outbreak that occurred in early 2013 (13). Of these samples, 10% showed a singly positive qRT-PCR result for ZIKV, and only 2% demonstrated consistently positive results across triplicate samples. While the cycle threshold for positive results by this group is less stringent than those of other groups, it regardless provides potential insight that ZIKV may have been present in South American countries as early as April of 2013.

From the confirmed 2015 cases, it took less than 1 year for the virus to spread throughout Brazil, into neighboring South American countries, and into Central and North America. The increase in GBS cases was reported in Brazil, Colombia, Suriname, and Venezuela and microcephaly cases in NE Brazil, which included neurological disorders and neonatal malformations (12, 14–23). The remarkable rise of infants born with microcephaly in Brazil set off international alarms and garnered global attention (24). Sequence homology studies reveal that of the two ZIKV lineages, the strains responsible for the human outbreaks throughout the Americas were phylogenetically closest to the Asian lineage (25).

### MOLECULAR BIOLOGY OF ZIKV

#### ZIKV Genome Organization

Zika virus is a positive single-stranded RNA virus that belongs in the *Flaviviridae* family. This family includes the human pathogenic viruses, Japanese encephalitis virus, dengue virus type 1–4 (DENV), yellow fever virus (YFV), West Nile virus (WNV), and tick-borne encephalitis virus. To better understand how ZIKV might be evolving, it is important to understand its genomic structure. The genome is a 10.8 kb single-strand, positive-sense RNA molecule that consists of a 5′ untranslated region (UTR) (~107 nt), one open reading frame (ORF) (~10.2 kb), and a 3′ UTR (~420 nt). The ORF encodes a polyprotein precursor that is processed into three structural proteins; capsid (C), premembrane/membrane (prM), and envelope protein (E) as well as seven non-structural proteins (NS1, NS2A, NS2B, NS3, NS4A, NS4B, and NS5). The viral polyprotein is co-translationally or co-post-translationally cleaved by viral NS2/NS3 protease, host signal peptidase (C/prM, prM/E, E/NS1, 2K/NS4B) and a host protease (NS1/NS2A). The pr- fragment of the prM protein is cleaved by furin in the trans-Golgi apparatus to generate mature virions. The major surface glycoprotein involved in host cell binding and membrane fusion is E protein. Viral reproduction is accomplished through the non-structural proteins (NS1-NS5), which serve as self-cleaving peptidases, along with the viral RNAdependent RNA-polymerase. The genome organization and major protein functions of ZIKV are highly similar to all other members in the *Flavivirus* genus (25–27).

As RNA genome viruses are strategically organized to contain the minimal number of genes required for sufficient replication and host immune evasion, many RNA viruses have evolved innovative methods for manipulating subverting molecules within their host cells (28). Among these are non-coding, subgenomic RNAs. These subgenomic flavivirus RNA components (sfRNA) have been implicated in both the reduction of type I interferon (IFN) transcription, and in mediating resistance to cellular exonucleases that would degrade genomic transcripts, such as Xrn1 (29, 30). While the complete functional role of sfRNAs remains unknown, few key pieces of information have already emerged regarding ZIKV. Of these, work by Donald et al. suggests that sfRNA of ZIKV can not only inhibit the type I IFN response by means of a pseudo-knot tertiary structure, but may do so in a manner that is more broad than those of other flaviviruses, such as DENV (31). Additionally, the difference in ZIKV lineage does not impact the generation of these sfRNAs, and is unlikely to impact the predicted tertiary structure.

#### Genetic Evolution of the Virus

The MR766 (HQ234498) strain of ZIKV from Uganda is considered the classical strain and is used consistently in both *in vitro* and *in vivo* research studies to model ZIKV infections. While this strain has been passaged 147 times in insect cell cultures and suckling-mouse brain tissues, very few mutations have been detected in its genome. In fact, when genomic sequences are compared between MR766 to two other variations, AY632535 and DQ859059, which were both isolated from Uganda in 1947 from sentinel Rhesus macaques, all three variants were determined to differ in only 0.4% of nucleotide and 0.6% of amino acid sequences (32). The initial low mutation rates may have been responsible for low transmission to humans and possibly subclinical infections. Phylogenetic analysis of available ZIKV genomes reveals that 86.5% of isolates are from humans, 11% are from mosquitoes, 2% are isolated from NHPs. Interestingly, of the available genomes, African lineages are only isolated from mosquitoes and NHPs, while Asian lineages are isolated from both humans and mosquitoes (**Figure 1**).

Phylogenetic trees of ZIKV have also been used to study the movement of ZIKV strains across the globe to identify potentially serious mutations that could alter molecular mechanisms, which then lead to enhanced pathology (33). Phylogenetic analysis of available ZIKV genomes reveals approximately 97% of genomes published are from the Asian lineage, and 7% are of African lineage. Among these Asian lineages, 66.9% of all isolates were collected from North, South, and Central America. Of these, 38.8% of isolates were from North and Central America, while only 28.1% were from South America. The remaining 26.1% of Asian lineage isolates were collected from the Asian and Oceania continents. Only 7% of all analyzed strains were from Africa (**Figure 2**).

A phylogenetic analysis juxtaposing MR766 strain to Asian lineage strains, particularly from Suriname and French Polynesia, reveals 50 amino acid lineage-specific differences (2, 34–37). Of these, all variations occur in either the NS1 or NS5 proteins (34, 35). Yet, when Wang et al. compared human to mosquito strains from the French Micronesia (FSM) outbreak in 2007 and the French Polynesia outbreak in 2013 (H/PF/2013), they identified 435 and 446 nucleotide changes in FSM and H/PF/2013, respectively, although 344 nucleotides were identical. Wang et al. considered them as sub-lineages deriving from the same ancestor that arrived in Malaysia in 1966 but had seemingly no clinical impact for 50 years (2, 26). All contemporary human strains within the Americas share higher sequence homology with the Asian lineage P6-740 (Malaysia/1966), which was the sole mosquito isolate (*A. aegypti*) than IbH-30656 (Nigeria/1968) (38). These isolates are most closely related to the H/PF/2013 strain (French Polynesia/2013) than the FSM strain (Micronesia/2007), suggesting that although the two Asian variants evolved from a common ancestor, they further diversified and the genetic distance between the 2007 and 2013 variants increased (2). A third major lineage from Africa is thought to exist based on analysis of only the E and NS5 gene sequences (35, 39). This lineage is designated Africa II, but it is neglected due to incomplete sequencing of the whole genome.

Multiple sequence alignments using 58 complete genome and five envelope sequences of ZIKV as of April 2016, revealed conserved amino acid variations. Nineteen variations were found in the sequence of structural proteins and 47 variations in the non-structural proteins, with the most variations in NS5 although the RNA-dependent polymerase domain had no variations at conserved motifs (26, 40). Eight variation sites were located in E protein; two sites were in the stem region and one site in the transmembrane region of the E protein. Substitutions in stem and transmembrane regions affect virion assembly and membrane fusion, whereas substitutions in Domain III of E protein may affect receptor binding (41, 42).

Smith et al. (27). found the difference between African and Asian isolates used in their analysis to be ~75–100 AA residues in the ORF, while the strains within each Asian and American lineage differed by ~10–30 AAs, suggesting that even minimal mutations could have phenotypic impact. A separate analysis by Wang et al. on nucleotide sequences compared 8 African strains (7 from mosquitos, 1 from monkey) with 25 Asian strains (all human) and found 59 amino acid variations that differed between the two major lineages, but were shared within the various strains of either African or Asian ancestry (2). The highest variability (10%) between Asian human and African mosquito strain was in the pr region of prM protein, though the effect of this structural change on viral function is not clear. Yuan et al. demonstrated differences in neurovirulence among Asian lineage strains from Cambodia and Venezuela may be dependent on a single amino acid substitution S139N (43). This substitution occurs in the pr region of prM and Yuan et al. hypothesizes that it may contribute to neurovirulent phenotype, but does not speculate as to a mechanism. These data are important because it demonstrates not only variances among strains of the same lineage but also sheds critical insight on intra-lineage strain-specific evolutionary differences. Comparison of protein sequences using P6-740 as the Asian reference and FSM showed over 400 variations at the nucleotide level and 26 unique substitutions at the protein level. Comparison of FSM, H/PF/2013, and the Brazilian strains from 2015 to 2016 showed that all these strains acquired changes at an additional eight positions for a total of 34 amino acid changes compared to P6-740. All isolates showed identical amino acids at these positions with the exception of T2634M/V in the NS5 protein.

Of note, no known ZIKV mosquito strain has the same nucleotide sequence as the human strains, though this could be due to sampling bias or ZIKV transmission through alternative routes (2). Nonetheless, nucleotide sequence changes can have an impact on viral pathology, replication, transmissibility, and fitness. One such example is the impact on posttranslational modification of the E protein. Faye et al. in 2014 reported a N154-glycosylation site deletion event of E protein in African isolates that did not exist in Asian lineage strains (40). Neurovirulence may depend on glycosylation of the Env protein (Asn 154) (44, 45). Naik and Wu reported that a mutation of putative N-glycosylation sites on DENV NS4B decreased RNA replication suggesting that glycosylation may play important roles in infectivity, maturation, and virulence of flaviviruses (46).

host species of isolation using Virus Variation analysis system available through NCBI. Only complete nucleotide genomes were screened and duplicate strains were removed to produce 296 unique strains. Strains isolated from humans, mosquitos, NHPs, and cell cultures are labeled with red (humans), blue (mosquitos), green (NHP), and aqua (cell cultures). (B) The total number of strains isolated per host species was used to derive the percentage of each host within the grand total.

Similar to DENV, ZIKV evolution depends on worldwide spread of the mosquito vector, growing human population size, and increased foreign travel and commerce (37). Sequence analyses demonstrate that the virus originated in Africa within two distinct groups; Uganda and Nigeria, mostly isolated from non-human vectors, and anchored by the MR-766 strain. The Asian cluster was isolated in Malaysia and is anchored by a prototype strain, P6-740, which includes strains from other Southeast Asian countries, such as Cambodia and French Polynesia. The American clade, which includes strains from Brazil and other American or Caribbean strains, evolved from this Asian cluster and expanded rapidly among naïve populations (37). As ZIKV evolves, it diversifies and creates new interactions with vectors and hosts that impact pathology, which exhibit unique lineage and strain-specific pathological profiles. To this date, 197 fully sequenced African and Asian isolates have been characterized and have been deposited in GenBank.

followed by continent of isolation using Virus Variation analysis system available through NCBI. Only complete nucleotide genomes were screened and duplicate strains were removed to produce 296 unique strains. Strains were clustered from Africa (red), from the Americas (Asian lineage) (blue); and from Asia and Oceania (Asian lineage) (green). (B) The total number of strains isolated per continent was used to derive the percentage of each host within the grand total.

#### ZIKV TRANSMISSION AND TISSUE TROPISM

### Vector Influence on Viral Evolution

While the African lineage contained eight mosquito isolates, the P6-740 (Malaysia/1966) was the sole mosquito isolate in the Asian lineage. In 2007, human sera from patients with painful febrile disease and *A. aldopictus* mosquitos were sampled from West Africa and tested positive for ZIKV (47). At the same time, the Micronesia outbreak identified *A. (stregomyia) hensilli* as the likely principal vector (6). In 2013, ZIKV reached French Polynesia, with subsequent spread to Oceanian islands (New Caledonia, Cook islands, and Easter island), which contained *A. aegypti* and *A. Aldopictus* throughout most of this region (48). Eleven percent of the population was infected causing symptoms such as lowgrade fever, rash, conjunctivitis, and arthralgia, as well as GBS (49). *A. aegypti* has not only expanded to Central-South America but is also regarded as the most common vector for DENV (50). The New World strains of *A. aegypti* and *A. albopictus,* which are the most common in USA are poor transmitters of ZIKV (51) suggesting that continuous divergence of the Asian lineage due to genomic evolution can be adapted to direct human to human transmission without the involvement of a vector. Indeed, while *Aedes* is widely accepted as the vector for ZIKV (52–54), work by Guedes et al. has demonstrated that ZIKV can infect and replicate in the midgut, salivary glands, and can be detected in saliva of *Culex* spp. (55). This work suggests, while still a contentious topic requiring further investigation, the transmission vector range for ZIKV may be greater than anticipated.

#### Non-Vector Transmission

Zika virus and other Flaviviruses (with the exception of Hepatitis C) are transmitted by mosquito bite, but ZIKV has clearly diverged from other flaviviruses. Since the 2015 epidemic, the mirth of published data has made it apparent that ZIKV can be transmitted from human to human through sexual transmission, blood transfusion, ocular transmission, or vertical transmission from mother to fetus (15, 16, 20). While the African strains are better transmitted through mosquitos, the American strains with Asian ancestry may have obtained enhanced transmission capabilities through sexual intercourse. This is supported by the numerous clinical cases of sexual transmission from male to female partners, and the limited data regarding female to male transmission (56–67). Of these clinical cases, originally infected individuals are reported to have traveled to South American countries or Pacific island nations, where they are believed to contracted ZIKV (68). Recent findings from Mead et al. reveal that while ZIKV can be detected in semen from infected men for up to 9 months after infection, sexual transmission of ZIKV typically occurs within 20 days of infection, and the amount of infectious ZIKV in semen decreases rapidly within the first 3 months of infection. Additionally, data from Barreto-Vieira et al. regarding attempts to standardize *in vitro* techniques for and Asian lineage strain of ZIKV demonstrated that mammalian cells associated with mucosal membranes are more susceptible to ZIKV infection than insect cells (69).

#### Human Infection Studies Fetoplacental Infections

The new routes of transmission demonstrate a novel tissue tropism for the virus. As such, tropism for the human placenta allowing infection of the fetus is unique for ZIKV, even though it should be noted that other flaviviruses that are not human pathogens do infect placental tissue in their respective hosts. Asian strains of ZIKV attracted global attention for their impact on maternal and fetal health (70, 71). Infectivity studies have shown that South American strains of Asian lineage are capable of infecting human decidua and umbilical cord tissues and are responsible for apoptosis of chorionic villi, which function in fetal/maternal blood/nutrient exchange (72, 73). In Brazil, mothers giving birth to newborns with microcephaly had reported fever, rash, and conjunctivitis during pregnancy, the most commons signs and symptoms of ZIKV infection. However, diagnosis beyond the earliest stage of acute disease is nearly impossible in the dengue-endemic regions and since the majority of exposures to ZIKV may cause asymptomatic infections, large numbers of infected pregnant women may have gone undiagnosed or misdiagnosed (20). Importantly, the severity of these responses varies greatly between patients and indicates that while studies on human immune responses to ZIKV infection are important, there is a large knowledge gap yet to be filled regarding the diversity of ZIKV infections among human demographics. An excellent example of how ZIKV pathogenic severity is dependent on individual genetic backgrounds is the study published by Caires-Junior et al., which compares pairs of twins where one is diagnosed with congenital Zika virus syndrome (CZS) (74). This study demonstrates that upon infection of neural tissues with ZIKV, cells from the CZS twins grew slower and exhibited increased viral replication. Importantly, the transcriptome results of the study reveal a significant difference in the level of an mTOR inhibitor protein, DDIT4L, between CZS and unaffected twins. This finding is significant because it indicates an individual genetic disposition for increased mTOR signaling, and as mTOR signaling pathways are critical for autophagy-mediate viral clearance, means that ZIKV infections are intensified in these individuals.

#### Fetal Brain Infections

Zika virus's proclivity for neural cells is not a novel finding among flaviviruses, as members of this viral family are known for their neurovirulent qualities. Several studies have sought to investigate the neuroinvasive and pathogenesis of ZIKV within both humanspecific cell cultures and in neonatal mice. One example is the work done by van den Pol et al., which investigates ZIKV cellular targeting within the brain (75). Here, *in vivo* mouse studies on neonatal brains reveals ZIKV infection of specific regions of the brain as early as 4 days postinfection, but can be fully determined by 7 days postinfection. Additionally, human cell culture work by van den Pol et al. reveals that human astrocytes demonstrate a 24-h virus incubation period before shedding active virus into culture media. Importantly, a study by Lin et al. using cultured human fetal brain tissues suggests that ZIKV enters the brain *via* the subventricular zone and actively infect and replicate in committed neural cells, and thus as cells propagate to develop the brain, result in an increase of viral presence (76). The authors then speculate as that the immune response to this neural infection may drive microcephaly.

#### Urogenital Tract Infections

Viral RNA can persist at high levels for months in the sperm of infected men even after resolution of symptoms and persists in the vaginal secretions of infected females for weeks after symptoms resolve (60, 77). This type of persistence in the reproductive tract and sexual transmission is not observed with other flaviviral infections (78).

#### Ocular Infections

Ocular infections are also unique to ZIKV and likely transmitted through conjunctival fluids, tears, and lacrimal glands. Although the reported cases of African lineage infections were too few to identify similar symptomatology, ocular infections by the Asian lineage have been documented in humans and recapitulated in animal models. Inner retinal vasculopathy (79) or other ocular infections have been reported linked to international travel into South American and Caribbean island nations (79, 80). Ocular abnormalities have been documented in infants with congenital Zika virus syndrome (CZS). In a report by Fernandez et al., postmortem examination of fetuses from terminated pregnancies revealed micro-calcifications of the retina, increased amount of autolysis of tissues at the front of the ocular tract, detachment of retinal and RPE layers, and a distinct lack of neural differentiation of retinal neurons (81). This study by Fernandez et al. corroborates with other clinical cases, such as those reported in Brazil, where 34.5% of all ZIKV microcephalic infants reported in the first 2 months of 2016 had ocular abnormalities, such as retinal mottling, optic nerve degeneration, and a lack of differentiation of retinal neurons (82–90).

### Animal Infection Studies

#### Ocular Infections

To investigate the pathogenic variability of ZIKV lineages as they occur progressively during active infections, many scientific investigations have modeled ZIKV infections in animal models. Noteworthy, among these are murine models, which have been well characterized for use in studying other flaviviruses. A129 and AG129 mice that are deficient in IFN receptor signaling when infected with ZIKV showed specific cellular tropism of ZIKV in retinal cell layers of the eye. A129 mice are deficient in IFN alpha/ beta receptors, while AG129 mice are deficient in IFN alpha/ beta and gamma receptors. Muller cells and retinal astrocytes infected with the virus resulted in a sustained proinflammatory microenvironment within the ocular tract that contributed to conjunctivitis and uveitis in these mice (91). In a NHP model for ocular and uteroplacental pathogenesis, Mohr et al. demonstrated lack of retinal neuron maturation, anterior segment dysgenesis, and notable chorioretinal lesions in fetal macaques (92). Different ZIKV strains display divergent tropism within host tissues. Miner et al. showed that ZIKV infections induced purulent viral panuveitis in AG129 mice and that the Asian lineage ZIKV from South America was isolated in higher viral burden than those from Oceania in ocular tissues (93).

#### Brain, Spleen, and Testicular Infections

Regarding tissue tropism by the Asian and African strains in animal models, the reports are conflicting. Dowall et al. demonstrated that viral RNA and ZIKV-induced histopathology with both MP1751 (African lineage) and PRVABC59 (Asian lineage) were highest in brain, spleen, and testis of A129, although the histological changes were more prominent in animals infected with the African lineage. The histopathology was minimal in heart, liver, kidney, and lung, although the Asian lineage caused no measurable clinical features (94). The Natal RGN strain from northeastern Brazil was isolated from the brain of a fetus with microcephaly and contained half of all mutations in the NS1 gene, suggesting that tissue-specific evolution of ZIKV has contributed to the emergence of Congenital Zika syndrome (2).

#### Lineage-Dependent Differences in Animal and Cell Culture Models

There are limited studies characterizing ZIKV strains and most studies utilize one strain exclusively. Phenotypic differences among African and Asian isolates have been reported in both *in vitro* and *in vivo* models, demonstrating the importance of considering ZIKV isolate, passage history, cell type, or mouse model when interpreting results. Asian strains of ZIKV have been analyzed for differential infectivity in many human and non-human cell lines. These include cells from ovary, kidney, liver, brain, lung, and keratinocytes. The latter of these is important to understand since the skin is the first barrier encountered during mosquito bites, as well as the first defense against ZIKV entry. Analyses from these *in vitro* studies demonstrate that infection after 48 h produced differences between cell lines in the amount of intracellular NS1, as well as amount of virus release, and the extent of infection did not directly correlate to IFN response (95).

Substantial differences are also found between African and Asian strains *in vitro* among mammalian and insect cell lines (27). A low-passage African isolate from mosquito reached higher titers than two low-passage Asian strains at all observed time points (0–36 dpi) in cell lines from four diverse vertebrate hosts and five insect cell lines (27). Similarly, Vielle et al reported strain-specific infection profiles in Vero cells, Aedes cells, and human monocytoid DCs (MoDCs). The authors used five African and Asian lineage strains isolated from various hosts; MR766 (U-1947) from monkey, MP1751 (U-1962) from a pool of *A. africanus* mosquito, PF13/25013-18: FP-2013 (French Polynesia) from human serum, PR-2015 (Puerto Rico) from human serum, and G-2016 (Guadaloupe) from human semen. The low passage U-1962 and U-1947 are very distant phylogenetically. The African lineage U-1962 and the Asian lineage PR-2015 showed highest rates of infection in Vero cells compared to the other strains. Similar findings for the U-1962 were observed in the mosquito cell line. In contrast, *in vitro studies* of human MoDCs showed similar susceptibility to infection, activation/maturation, expression of type I and III IFNs or cell death between lineages*.* The authors reported that NS5 of U-1962 showed polymorphism compared to the other strains of the study, but none of the residues were putative STAT2 binding residues, suggesting that the levels of expression of mutations were independent of mutations in the NS5 sequence of the U-1962 strain (96).

Additionally, infectivity studies comparing the two lineages revealed that African strains of ZIKV can infect human neural progenitor cells and produce both higher titers of progeny virus, and also induce higher levels of cellular apoptosis (34, 97, 98). A study by Hamel et al. demonstrated similar findings using human astrocyte cell cultures, where they found African ZIKV strain HD78788 can reach higher infectious titers 24 h postinfection of human astrocytes and also induces less innate antiviral gene transcription than Asian strain H/PF/2013 (99). This trend was further confirmed in human dendritic cells, which are one of the primary cell types naturally infected by ZIKV (96).

While mice may not naturally become infected by ZIKV, they can be used as models for pathogenesis, and similar reports of differential lineage-specific infection characteristics have been published regarding *in vivo* mouse experiments. Zhang et al. conducted a study juxtaposing an older Asian lineage strain from Cambodia, with a more contemporary American strain from Venezuela to investigate potential differences in neurovirulence between the two (100). They found that compared to the Cambodia strain, neonatal mice infected with Venezuelan strain of ZIKV demonstrated more neuronal cell death, obliteration of oligodendrocyte development, and an increase in the amount of CD68 and Iba1 positive microglia/macrophages in brain tissues. On the other hand, Qian et al. reported that African and Asian lineages showed similar levels of brain-development disruption in an organoid development model, thus prompting further questions regarding further analyses within the Asian lineages (101). This change between Asian and American, or alternatively between pre-epidemic and epidemic stains of ZIKV, may be attributed to new mutations between the two (102). Indeed, sequencing data comparing over 20 strains of ZIKV reveals at least 15 amino acid changes between epidemic and pre-epidemic strains, as well as the generation of a 9 amino acid bulge, rather than an external loop structure at the 3-prime UTR region of the NS5 sequence.

Mouse background strain, transgenic line, and pharmacological manipulation in wild-type strains all produce variable results using the same dose and strain of ZIKV from either lineage. The animal model chosen is one potential complication of comparing ZIKV isolates, but so is *in vitro* and/or *in vivo* passage history of isolates. There is a tradeoff between passage number and viral fitness in either vertebrate or insect hosts. A study by Haddow et al. suggests that high passage number of traditional African strains, such as MR766, in both Vero cells and suckling mouse brains has resulted in a distinct loss of glycosylation sites, which may thus affect pathology in other organs (32). The pathology of a low-passage African isolate from mosquito and two lowpassage Asian strains were compared *in vivo* using two mouse models, the A129 mouse (deficient in type-I interferon receptor, Ifnar1<sup>−</sup>/<sup>−</sup>) and the IFN-I antibody blockade mouse. The A129 mouse is commonly used to study ZIKV because the virus has been shown to infect cells by targeting human STAT2 to suppress IFN signaling, and it has been proposed that since it does not bind murine STAT2, it cannot infect mice unless the type I IFN receptor is knocked out or blocked (1). Of additional note, the majority of investigations that utilize ZIKV grow their viral stocks in Vero cells, which do not produce IFN type I in response to a viral infection, and thus allows them to be permissive to ZIKV infections (103, 104).

According to Smith et al., the African isolate caused more severe clinical pathology and lethality in both mouse models, suggesting enhanced virulence of the African strain compared to both Asian strains. Significant phenotypic differences were also observed between the two Asian strains (CPC-0740 and SV0127- 14) used in the study; SV0127-14 produced 10- to 100-fold lower titers in all cell types compared to CPC-0740, and it produced only mild clinical symptoms and 10% mortality in Ifnar1<sup>−</sup>/<sup>−</sup> mice versus 90% mortality with the more virulent CPC-0740 (27). The IFN-I antibody mouse model was far less susceptible than the Ifnar1<sup>−</sup>/<sup>−</sup> model, producing zero mortality and no clinical symptoms with CPC-0740. The same result was found using a more recent Asian isolate from Puerto Rico, PRVABC59 in the IFN-I antibody blockade mouse (27).

Similar to previous studies comparing African and Asian strains, Dowall et al. reported that A129 mice tolerated infections with an Asian strain well, while an African strain was lethal, with morbidity and mortality worsening in a dose-dependent manner. Interestingly, although the Asian strain produced no clinical symptoms, viral RNA levels were detected in various tissues, including brain, spleen, lungs, and kidneys and viral burden was detected in secretions, albeit the magnitude and time course of the Asian and African strain differed, with detection levels produced earlier using African infections. Moreover, seroreactivity revealed detectable antibody responses in the Asian infected A129 mice despite no clinical signs of illness (1).

### CURRENT UNDERSTANDING OF ZIKV AND THE HOST IMMUNE RESPONSE

Zika virus hid from the public eye for several decades since it's discovery in the 1940s, as many cases of ZIKV infections are believed to have either been subclinical or misdiagnosed as a different flavivirus infection, such as DENV. As ZIKV spread from Africa across Asia and into the Polynesian and Micronesian islands, ZIKV shifted from a mild pathogenesis and largely subclinical symptomatology to the neuro-virulent Asian lineage with a higher incidence of congenital abnormalities. The specific pathways activated by viral infection inevitably steer the innate immune response toward differential patterns and intensities of cellular and humoral responses. ZIKV is a member of the flavivirus family, and thus shares similar cell signaling pathways with other viruses within this group, which directly antagonize the IFN response system of the host innate immune response, but through a species-specific mechanism (**Figure 3**). Thus, to fully comprehend the challenges that ZIKV poses to human immunity and to recognize fully efficacious vaccine candidates, a detailed understanding of how ZIKV directly evades and antagonizes host innate and adaptive responses is vital.

#### Type I IFN Responses

Type I IFN refers to the classic IFN α/β signaling pathway, whereby viral antigen, or a pathogen-associated molecular patterns, are recognized by pathogen recognition receptors initiating an intracellular protein cascade that culminates in protein translocation into the nucleus and subsequent transcriptional activation of DNA for IFN-α and IFN-β (105). When nearby cells have IFN α/β bind to their surface receptor, IFN alpha receptor (IFNAR), a series of phosphorylation events, involving Janus kinase 1, tyrosine kinase 2, and the many signal transducer and activator of transcription (STAT) proteins ultimately result in the transcription of IFN-stimulated genes (ISGs), which have antiviral properties through a broad variety of mechanisms.

STAT1 heterodimers and thus increases chemoattractant IFN-stimulated genes. (D) STAT2 is targeted for proteasomal degradation by NS5 resulting in an increased

#### Lineage Similarities

Many flaviviruses directly antagonize different stages of the type I IFN pathway *via* species independent and diverse mechanisms that all centrally rely on genomic non-structural protein 5 (NS5). Of the seven genomic NS proteins and sub-proteins, the NS5 protein functions as the viral polymerase enzyme, and in RNA capping (106–108). Dengue virus NS5 protein has been shown to inhibit human STAT2 function by means of an E3 ubiquitin ligase, called UBR4 that targets STAT2 for proteasomal degradation. ZIKV similarly inhibits human STAT protein, but can do so independently of UBR4 (109, 110). ZIKV inhibition of STAT2 has been demonstrated in several studies where primary human dendritic or endothelial cell infections resulting in lower expression of pro-inflammatory cytokines, such as interleukin 6 (IL-6), IFN α/β, and chemokine C-C ligand 5 (CCL5) (111, 112). Additionally, NS1 and NS4b have both demonstrated the ability to inhibit the production of IFN α/β after direct stimulation with poly I:C (synthetic double stranded RNA) by blocking the formation of the TBK1 (TANK binding kinase 1) complex, which allows for oligomerization of interferon regulator factors (IRFs) (107, 113, 114).

rate of STAT homodimerization and upregulation of anti-inflammatory cytokines.

Primary human skin fibroblasts infected with the French Polynesia isolate H/PF/2013 mounted innate immune responses by increasing the expression of RIG-I, MDA5, TLR3 leading to upregulation of Type I IFNs and ISGs as well as CXCL10 and CCL5 (115). The same strain led to IFN-β production and induced apoptosis of infected lung epithelial cells A549 (116). Overall, both lineages have demonstrated similar mechanisms of Type I IFN activation and upregulation, and similar pathways of STAT2 inhibition and targeting for degradation.

Importantly, ZIKV inhibits human STAT2 function *via* a mechanism similar to DENV, but does not similarly inhibit murine STAT2. These two proteins share 64% sequence homology, which may thus account for the dissimilar protein interactions. This single protein non-homology between human and mouse STAT2 has led to a domination of immune-deficient murine models as the primary model for pathogenic and vaccine studies (117, 118). These models are primarily A129 and AG129 systems, which lack the IFNAR protein, required for the production of ISGs. Despite many investigators' claims that immune competent models for ZIKV will lack symptomatic expression, replication, and similarity to human infections, several recent publications have demonstrated that ZIKV can be modeled in immune competent mice using either a C57BL/6 or BALB/c background (119). The success of immune competent systems may be directly attributed to ZIKV inhibitory effects on STAT3, 4, 5, or 6. This knowledge, however, still requires investigation and remains a knowledge gap within the field.

#### Lineage Differences

Bowen et al. found that dendritic cells from donors are productively infected *in vitro* by both lineages, but with different kinetics. The African lineage had faster replication and infection magnitude, and unlike the Asian linage, caused cell death. All strains antagonized STAT1 and STAT2. Asian strains used included PRVABC59 and P6-740, while African strains included the Ugandan prototype strain (MR766) and a low-passage Senegalese strain (DakAr41524). They found that the viral kinetics varied with the source of the monocyte isolate, but that related strains varied similarly. All strains induced IFNB gene transcription, and a decrease in STAT2 phosphorylation was seen in both, but more pronounced in the African lineage (111).

In a study by Simonin et al. who infected human primary neural cells with African and Asian lineage strains, the African strain induced upregulation of at least 19 genes including RIG-I, MDA-5, and TLR-3 and induction of type 1 and 2 IFN was higher, associated with enhanced levels of inflammatory cytokines such as IL-6 or tumor necrosis factor (TNF). The only downregulated gene was CXCL8, a mediator of inflammatory responses. The Asian strain did not show any significant upregulation of genes; instead, four genes (CXCL8, CXCL10, CASP1, CTSS) were downregulated. Even at higher MOIs, the cytokine response to Asian strain was weak. In addition, neural cell infection by both strains showed similar differences in viral infectivity and cytokine production (97). In contrast, when McGrath et al. infected human neural stem cells from two individual patients with Asian and African lineages of ZIKV and conducted a transcriptomics analysis, they found increased expression of IFN-α, IL-2, TNF-α, IFN-γ genes as well as genes involved in complement, apoptosis, and STAT5 signaling pathways in cells infected with the Asian isolate (120). Patients displaying neurological symptoms during the ZIKV epidemic in Brazil demonstrated higher blood concentrations of pro-inflammatory cytokines, such as IL-6, IL-7, and IL-8, as well as higher levels of chemotactic molecules, such as IP-10 and MCP-1 (121). These findings were recapitulated in a non-human primate model (122) and in mouse studies modeling ZIKV infections using homologous strains.

Ultimately, this inhibition of type I IFN production and signaling results in an attenuated innate response to infection and may alter T cell-specific responses (123, 124). Presently, less than a handful of reports have examined the host-induced immune responses to genetically evolved ZIKV. Tripathi et al. examined strain-specific differences using Ifnar1<sup>−</sup>/<sup>−</sup> and Stat2<sup>−</sup>/<sup>−</sup> C57BL/6 mice. They looked at three viruses from the Asian lineage (P6- 740, FSS13025, PRVABC59) and two from the African lineage (MR766; DakAr41519). They found that the African strains conferred faster onset of disease and higher mortality and in both mouse models. Infection with the African strains was marked by more severe neurological symptoms, while neurological symptoms in Asian infections were more prolonged. While both strains induced host inflammatory responses, the African isolates elicited higher levels of several cytokines and markers of T cell infiltration (IL-6, CXCL10, TNF-a, IFN-γ, CCL3, CCL4, CCL5, CXCL9, GZMB, CCL2, CCL7, CXCL1, CXCL2, IL-1b, IL-15, CD4, CD8, CCR5, CXCR3, CCR2, CCR5) (125).

Foo et al. infected human blood monocytes with the Uganda strain MR766 or the French Polynesia strain H/PF/2013. They found that both strains productively infected CD14 monocytes, but that infection with Asian viruses led to the expansion of non-classical monocytes, resulting in a M2-skewed immunosuppressive phenotype, marked by IL-10 production. African lineages on the other hand, induced pro-inflammatory M1-skewed responses, inducing CXCL10. They also found that blood from pregnant women was more susceptible to infection. Infection with virus from the African lineage led to higher viral burdens, and increased levels of IFN-β, STAT, OAS, IRF, and NF-kB. The Asian lineage had higher expansion of CD14loCD16+ nonclassical monocytes, despite having a lower viral load. In general, the African strain promoted cytokines and immunomodulatory genes involved with inflammation (CXCL10, IL-23A, CD64, CD80, IL-18, IDO, SOCS1, CCR7), while the Asian strain was associated with the activation of immunosuppressive genes (IL-10, Arg1, CD200R, CD163, CD23, CCL22, VEGFa). These results were confirmed using a second strain of ZIKV from each lineage (IbH30656 for African and PRVABC59 for Asian). Pregnancy enhanced infection of both lineages in CD14 monocytes, and they found a similar pattern with the African lineage having a higher viral load. Blood from the first and second semesters of pregnancy demonstrated considerably higher CD14loCD16<sup>+</sup> non-classical monocyte levels upon infection with the Asian strain, but blood in the third trimester had similar levels to nonpregnant blood. The African strain, however, produced a slight increase in non-classical monocytes during all three trimesters. Unlike monocytes from non-pregnant women, monocytes from pregnant women were more reactive to the Asian lineage than the African one. The Asian strain additionally induced genes associated with adverse pregnancy outcomes in the first two trimesters of pregnancy (ADAMTS9 and fibronectin 1) (73).

While both Asian and African lineages have demonstrated similar agonism and antagonism in Type I IFN signaling, Asian lineage strains have additionally shown a secondary method of IFN antagonism. Asian strains isolated from South American countries appear to directly activate IRF3, IRF7, and IRF9 through NS1, NS4, and NS5 viral proteins (112). African lineages have not demonstrated this ability as of yet, which implicates the differential amino acid residues as key binding factors for innate immune mediators.

#### Type II IFN Responses

Unlike the type I IFN response, which inhibits ZIKV infection, Chaudhary et al. demonstrated that IFN-γ (a type II IFN) cannot only decrease ZIKV infectivity in mammalian cell culture, but can also upregulate transcription of innate inflammatory and chemoattractant cytokines, such as IRF1 and CXCL10, respectively. This suppression of type I responses, but activation of type II responses is also demonstrated by the NS2A and NS4B proteins. This phenomenon of variable activation of type I and type II IFN responses has also been documented in human clinical cases of ZIKV infections (126). A study performed by Kam et al., sought to fully characterize immune biomarkers that were associated with ZIKV infections in 95 human clinical cases from Brazil. The authors demonstrated an increase of IFN-γ among human febrile cases of ZIKV and differential cytokine expression between febrile patients and those with neurological complications. Between these two groups, patients with neurological complications showed similar levels of IFN-γ and decreased levels of anti-inflammatory cytokines, such as IL-10 and IP-10 (121). ZIKV NS5 was shown to generate the differential type I and type II responses during infection, by specifically inhibiting IFN-β signaling, and simultaneously functioning as a prominent activator of IFN-γ signaling.

Interferon-γ plays a critical role in host antiviral responses and increased levels of IFN-γ can be associated with host natural killer (NK) cell response, as they secrete high levels of this cytokine during infection. A study by V. Costa et al. proposed that DENV infection is controlled by NK cells specifically through the production of IFN-γ, and that these NK cells are activated by DENV-infected dendritic cells (DC's) (127). While this specific mechanism of IFN-γ inhibition has not been demonstrated yet for ZIKV, it has been shown that ZIKV does infect antigenpresenting cells upon infection *via* mosquito bite. Cimini et al. found that the amount of IFN-γ secreted by CD4<sup>+</sup> T-cells is reduced (128). Given that CD4<sup>+</sup> and CD8<sup>+</sup> T-cells have proven to be the dominant drivers in ZIKV clearance during human infection (129), this alteration in cell cytokine secretion generates several questions regarding the mechanisms that ZIKV uses to subvert the host immune response and facilitate its replication.

Ngono et al. compared CD8 T cell responses to two Zika strains from the African (MR766) and Asian (FSS13025) lineages in wild-type C57BL/6 mice treated with an IFNAR blocking antibody, and in LysMCre<sup>+</sup>IFNARfl/fl C57BL/6 (H-2b ) mice (lacking IFNAR in certain myeloid cells). They found that the viral load of the African lineage strain decreased 3 days postinfection, but that the Asian lineage strain did not. Both strains elicited similar levels of granzyme B+ CD8+ T cells in both mouse models. They additionally identified epitopes recognized by IFN-γ secreting CD8<sup>+</sup> T cells and found that in both strains the major epitope was E protein derived. In LysMCre<sup>+</sup>IFNARfl/fl mice, they identified 14 peptides specific to the African lineage, three specific to the Asian, and 12 shared by both, with all proteins being targeted, except for NS1 and NS2b in the Asian lineage. The Asian and African strains both resulted in a sixfold and fivefold increase in CD44<sup>+</sup>CD62L<sup>−</sup>CD8<sup>+</sup> T cells, respectively, indicating a strong CD8 response. Both strains showed similar CD8<sup>+</sup>T cell kinetics, with the percentage of IFN-γ+ CD8+ T cells being highest at day 7 postinfection. It is important to note that the MR766 isolate was serially passaged in mouse brains, possibly affecting its behavior (124).

Collectively, the majority of the investigations regarding ZIKV and the Type II IFN response have been done using Asian lineage viruses. Thus, there is an immense knowledge gap concerning African lineage ZIKV strains and how they may directly affect the Type II IFN response. While it can be inferred that both African and Asian lineages both benefit from the increase in STAT1 being freely able to generate homodimers and thus promote ISGs, African lineage ZIKV strains have not specifically demonstrated this ability, and thus it remains unknown.

#### Type III IFN Responses

Discovered in the early 2000s, type III IFNs comprises four variants of IFN-λ (numbered 1–4). The receptor for this type of IFN is unique because, rather than being ubiquitously expressed on nucleated cells like IFNARs, it is selectively expressed on epithelial cell surfaces. Thus, IFN-λ plays a distinct role in the protection of epithelial barriers. Additionally, while expressed on epithelia surfaces, other cell types can respond to type III IFN signaling, such as those in the central nervous system (130, 131). While the role of IFN-λ has been studied during WNV and YFV infections, there is limited information regarding IFN-λ and ZIKV infections. During DENV and YFV infection, the depletion of type III interferons results in impaired CD4<sup>+</sup> and CD8<sup>+</sup> T-cell activation, and thus also negatively impacts viral clearance. Additionally, in mouse models deficient for type III IFN signaling infected with YFV, type III IFN signaling results in decreased blood/brain barrier maintenance and thus allows for viral neuro-invasion (132). Of the studies for ZIKV infections investigating type III IFN responses to infection, many focus on maternal and fetal infections with emphasis on the fetal/maternal blood barrier (133–137).

The placenta is the organ that separates the fetal and maternal blood supply, primarily through the chorionic villi, where fetal and maternal blood are spatially separated by 3–4 cell layers. After blastocyst implantation in the uterine wall, trophoblast cells multiply and differentiate into variable cell types. One such cell type, the syncytiotrophoblast, forms the outer epithelial layer of the chorionic villi where the majority of fetal/maternal blood exchange occurs (138). The syncytiotrophoblast layer is the primary epithelial defense in the fetal/maternal blood barrier and the first cells ZIKV encounters during fetal infection. Indeed, the type III interferons produced by syncytiotrophoblasts allow for autocrine protection, and subsequently prevent ZIKV from infecting the fetus (139).

Provided that the cells of the fetal/maternal blood interface are resistant to ZIKV infections based on their ability to secrete IFN-λ, several studies have focused on uncovering the mechanism by which ZIKV gains entry into the amniotic space and thus can infect the fetus (140). These studies focused on a specific type of fetal macrophage cells called Hofbauer cells (HBC) that derive from the fetus; are of monocytic origin, and are commonly found through the chorionic villi (141). These cells first appear early during human pregnancy (within the first 3 weeks), and then diminish in number between the fourth and fifth month of gestation. Among mothers and fetuses infected with ZIKV, however, HBCs have been seen to linger long in to the third trimester of pregnancy at a density higher than normally observed (142). Not only do HBCs persist at increased density, they are capable of direct infection by ZIKV. It is speculated that they can move freely between the chorionic tissues where blood exchange occurs in the placenta (143–145). Thus, by traversing from the chorionic fetal/maternal blood interface to the amniotic sac and fetus, HBCs can act as a shuttle for ZIKV to bypass the fetal/ maternal blood barrier and infect the fetus.

Asian lineage strains have shown the ability to upregulate Type III IFN production and mRNA translation in both cell culture and in primary human clinical cases. African lineage strains have only demonstrated this ability to a lesser extent only in cell cultures. *In vitro* studies with ZIKV AF (MR-766) and AS 9 (FSS13025) infected human choriocarcinoma JEG-3 cells showed induction of antiviral type III IFN responses and the ISG 2′-5′ oligoadenylate synthetase suggesting that IFN type III responses produced by human placental trophoblasts confer protection against ZIKV infection (143). Interestingly, while both lineages have demonstrated an increase in both translation and transcription, there is not an increase of the amount of active Type III IFN proteins produced and detected in culture medium.

#### ZIKV Vaccine Development

Since the outbreaks that garnered international attention for ZIKV in 2015, the race for a vaccine to combat ZIKV has yielded several candidates, currently at various stages of development. A successful ZIKV vaccine must confer strong protection in healthy and pregnant populations, while also proving safe and efficacious in regards fetal/neonatal health. Important features to consider for the best vaccine should include both enhanced magnitude and quality of neutralizing antibodies and minimal cross-reactivity to DENV to prevent antibody-dependent enhancement (ADE) (146–149). ADE is perhaps the most critical of these, as flaviviruses are antigenically and structurally similar, non-neutralizing antibodies generated by one flavivirus can result in fatal outcomes upon secondary infection with a different flavivirus.

Recombinant envelope (E) protein subunit vaccines have been tested as both whole protein and single domain subunit candidates. Many of the E protein subunit vaccines published have required multiple booster administrations to achieve high antibody titers, and rely on the use of adjuvants to amplify the humoral response induced (150, 151). The use of only domain III of the E protein has proven sufficient to neutralize ZIKV in cell culture systems and in multiple mouse models when proteins are generated using the French Polynesian strain of ZIKV, Asian lineage, as a template (152). Combining the knowledge of how T-cells mediate viral clearance of ZIKV infections with epitope predictions, Pradhan et al. demonstrate that subunit vaccines using the NS2B, NS3, and NS4A proteins have potential as neutralizing vaccines that mediate a strong CD4<sup>+</sup> T-cell response through *in silica* analyses (153).

Another candidate for ZIKV vaccines are lipid encapsulated mRNAs. While multiple publications have demonstrated that lipid enclosed mRNA vaccines are capable of antibody-based neutralization, only the work of Richner et al. address the issue of cross-reactive antibodies by specifically deleting the domain II of the fusion loop on the ZIKV E protein (154). Removal of this fusion loop domain results in minimal cross-reactivity between ZIKV and DENV, serotype 1. DNA vaccines have also been evaluated for ZIKV candidates, and one has progressed to clinical trials. DNA vaccines provide an economic advantage with their ease of production, and Larocca et al. has demonstrated a ZIKV DNA vaccine using the E and matrix (M) proteins can induce strong T-cell responses (155). An additional study by Muthumani et al. evaluated the efficacy of a DNA vaccine in both immune compromised mice and NHP and demonstrated the successful generation of neutralizing antibodies and high antibody titers (156). Of these vaccines, constructs generated as the vaccine have synthesized conserved residue among MR766 and Brazilian strains individually in an attempt to generate the largest cross-protective response. Despite this effort, data regarding non-homologous challenges and evaluations against different strains and lineages remains scarce. This is true for the majority of vaccine candidate studies published.

Live-attenuated vaccines and chimeric vector vaccines offer a third and fourth candidate against ZIKV, as they provide a strong cellular immune response due to their active infections, and thus provide a humoral response that more closely recapitulates a natural infection. Whole inactivated virus vaccines are the frontrunner candidates for future ZIKV vaccines, based on the clinical trials occurring within the United States, 75% of which are whole inactivated particles from the PRVABC59 strain of ZIKV, which poses the largest threat to the US (157). Live-attenuated vaccines also have the benefit of generating a natural cellular response without the risk of a generating disease. Shan et al. has demonstrated that deletions within the 3′-UTR of the RNA genome can generate an attenuated ZIKV clone that fails to replicate, and that this vaccine can induce sufficient protection to prevent ZIKV from causing disease (158, 159).

#### Knowledge Gaps and Future Studies

Despite the major concern that ADE and ZIKV antibody crossreactivity among flavivirus plays in the development of vaccines, few studies successfully demonstrate that the neutralizing antibodies produced by their particular vaccine candidate are not cross-reactive to other flaviviruses. Thus, while the quest for neutralization is important, the future risk of ZIKV-mediated DENV ADE cannot and should not be overlooked, so that future DENV and CHIKV outbreaks can be avoided. A third consideration is that few, if any, publications directly investigate the binding avidity of these antibodies, but instead focus more on neutralization capacity, although the data strongly suggests low avidity and affinity antibodies have a higher chance of generating ADE for other flaviviruses, presenting another potential complication (152, 160, 161).

Additionally, despite the known disparities between different lineages of ZIKV, a lack of clarity remains regarding the mechanism behind neutralizing antibodies produced by a candidate vaccine or how it fairs against strains non-homologous to those in the vaccine. These neutralization assays should ideally include multiple strains from each of the three lineages: African, Asian, and American. This information is highly significant, because representative viruses from these lineages demonstrate differential neuroinvasive and inflammatory capabilities, as well as different infection profiles.

Few studies exist that focus on the impacts of ZIKV vaccination during pregnancy and even less address the issue of ZIKV sexual transmission. This is particularly important for areas with high prevalence of DENV where ZIKV cases can often be misdiagnosed and untreated for the infection. It is, therefore, of paramount importance to design effective vaccines conferring strong mucosal immunity to these high-risk groups.

#### CONCLUSION

Zika virus continues to pose an international threat as both a neuroinvasive virus with potentially lethal consequences and as a direct danger to pregnant mothers. While ZIKV has demonstrated a continual geographic and phylogenetic expansion, the actual amount of genetic mutations is surprisingly limited between lineages. African lineages have shown to be more infectious and generate better inflammatory responses in a number of *in vitro* and *in vivo* experiments. The question that rises from the collection of studies that demonstrate a more virulent role for African strains is why the human impact, including neurological complications and fetal disruption, since 2015 in the Americas is produced by strains with Asian ancestry. It is also curious why the African strain is not found in any recently reported human cases. One possibility is that surveillance of infected humans in Africa is insufficient to detect severe cases of infection with the African strain.

Moreover, the African lineages exhibit an inhibitory mechanism on IFN production and signaling, which has been well documented in other flaviviruses. The Asian and American lineages, however, have evolved a secondary mechanism to prevent IFN transcription by IRF3 and IRF7 binding and preventing their translocation into the nucleus. Hence, while the African lineage has been shown to be more infectious than the Asian, it often presents as a self-limiting febrile disease, whereas the Asian and American lineages have exhibited persistent infections, with some human cases shedding active virus for upwards of 6 months (162).

Vaccine candidate research for ZIKV continues to be limited, providing little research into potential differences in vaccination

#### REFERENCES


responses between circulating lineages. Additionally, these investigations have not prioritized the critical relationship between ZIKV and other flaviviruses, such as DENV, as the antibodies proven protective against ZIKV may promote more severe infections for DENV, rather than provide cross-neutralizing benefits. While our understanding of ZIKV has increased profoundly, extensive investigations are still required for a better understanding of lineage-specific dynamics and the host immune response in terms of the evolutionary trajectory of these lineages as they continue to expand geographically. Improved understanding of these topics, which currently present knowledge gaps in the field of ZIKV research, will serve as the cornerstones for designing future vaccines and antivirals that not only efficacious for ZIKV, but also safe against other flaviviruses.

### AUTHOR CONTRIBUTIONS

JB, NL, RH, and IS wrote and reviewed the manuscript, and IS approved the version to be published. All authors listed have made substantial and intellectual contribution to the work.

### FUNDING

The work was supported by NIH grant 5R01AI111557.


Zika syndrome. *JAMA Ophthalmol* (2017) 135:1163–9. doi:10.1001/ jamaophthalmol.2017.3595


virological factors potentially associated with the rapidly expanding epidemic. *Emerg Microbes Infect* (2016) 5:e22. doi:10.1038/emi.2016.48


human maternal decidual tissues, inducing distinct innate tissue responses in the maternal-fetal interface. *J Virol* (2017) 91:e01905-16. doi:10.1128/ JVI.01905-16


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Copyright © 2018 Beaver, Lelutiu, Habib and Skountzou. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*