In and outs of Chuviridae endogenous viral elements: origin of a retrovirus and signature of ancient and ongoing arms race in mosquito genomes

Background Endogenous viral elements (EVEs) are sequences of viral origin integrated into the host genome. EVEs have been characterized in various insect genomes, including mosquitoes. A large EVE content has been found in Aedes aegypti and Aedes albopictus genomes among which a recently described Chuviridae viral family is of particular interest, owing to the abundance of EVEs derived from it, the discrepancy in the endogenized gene regions and the frequent association with retrotransposons from the BEL-Pao superfamily. In order to better understand the endogenization process of chuviruses and the association between chuvirus glycoproteins and BEL-Pao retrotransposons, we performed a comparative genomics and evolutionary analysis of chuvirus-derived EVEs found in 37 mosquito genomes. Results We identified 428 EVEs belonging to the Chuviridae family confirming the wide discrepancy between the number of genomic regions endogenized: 409 glycoproteins, 18 RNA-dependent RNA polymerases and one nucleoprotein region. Most of the glycoproteins (263 out of 409) are associated specifically with retroelements from the Pao family. Focusing only on well assembled Pao retroelement copies, we estimated that 263 out of 379 Pao elements are associated with chuvirus-derived glycoproteins. Seventy-three potentially active Pao copies were found to contain glycoproteins into their LTR boundaries. Thirteen out of these were classified as complete and likely autonomous copies, with a full LTR structure and protein domains. We also found 116 Pao copies with no trace of glycoproteins and 37 solo glycoproteins. All potential autonomous Pao copies, contained highly similar LTRs, suggesting a recent/current activity of these elements in the mosquito genomes. Conclusion Evolutionary analysis revealed that most of the glycoproteins found are likely derived from a single or few glycoprotein endogenization events associated with a recombination event with a Pao ancestral element. A potential fully functional Pao-chuvirus hybrid (named Anakin) emerged and the glycoprotein was further replicated through retrotransposition. However, a number of solo glycoproteins, not associated with Pao elements, can still be found in some mosquito genomes 114 million years later, suggesting that these glycoproteins were likely domesticated by the host genome and may participate in an antiviral defense mechanism against both chuvirus and Anakin retrovirus.

LTR structure and protein domains. We also found 116 Pao copies with no trace of glycoproteins and 37 solo glycoproteins. All potential autonomous Pao copies, contained highly similar LTRs, suggesting a recent/current activity of these elements in the mosquito genomes. Conclusion: Evolutionary analysis revealed that most of the glycoproteins found are likely derived from a single or few glycoprotein endogenization events associated with a recombination event with a Pao ancestral element. A potential fully functional Pao -chuvirus hybrid (named Anakin) emerged and the glycoprotein was further replicated through retrotransposition. However, a number of solo glycoproteins, not associated with Pao elements, can still be found in some mosquito genomes 114 million years later, suggesting that these glycoproteins were likely domesticated by the host genome and may participate in an antiviral defense mechanism against both chuvirus and Anakin retrovirus.

Background
Viruses have long-term and intricate interactions parasitizing host cells and hence both viruses and host are subject to an endless arms race (FORTERRE;PRANGISHVILI, 2009) . A large body of evidence currently supports that virus/host interactions can occur at both the protein and nucleic acid levels. One clear example of the last, is that viral genomic sequences can be integrated into the host genome (FESCHOTTE; GILBERT, 2010;JOHNSON, 2019) . The process of viral genome integration is called endogenization and normally occurs as a "life cycle" stage in viral groups such as retroviruses and phage DNA viruses (WEISS, 2016) . However, recent studies have shown that genomes, or genomic regions, of non-integrative viruses can also be found integrated into various eukaryotic genomes (FESCHOTTE; GILBERT, 2010;JOHNSON, 2019;KATZOURAKIS;GIFFORD, 2010) . These viral loci have been called endogenous viral elements (EVEs).
Endogenization can occur by two main mechanisms: through non-homologous recombination mediated by double-strand break repair pathway of the host cell; or mediated by proteins, such as reverse transcriptases and integrases, from the endogenous retrotransposons-envelope-free retrovirus-like elements (KATZOURAKIS; GIFFORD, 2010) . Recent findings on mosquito genomes suggest that the latter mechanism is likely to be the most important, since the abundance and diversity of EVEs are positively correlated with retrotransposon abundance and activity (PALATINI et al. , 2017;WHITFIELD et al. , 2017) .
Likewise, EVEs and retrotransposon loci normally exist in close proximity in the host chromosome, suggesting a causal role between the activity of certain retrotransposon families and EVE integration (PALATINI et al. , 2017;WHITFIELD et al. , 2017) . EVEs are normally found as fragments of exogenous viral genomes and it is therefore unlikely that they are able to generate new virus particles or infect new cells. Therefore, there are three main hypotheses regarding the fate and impact of these elements on the host genome: i) EVEs may evolve neutrally accumulating mutations and degenerate over time; ii) EVEs may be co-opted by the host genomes, giving rise to new functional host genes; and iii) EVEs may play an antiviral role, generating small RNAs which degrade cognate exogenous viral RNA or non-functional viral proteins that hinders proper assembly/maturation of a new viral particle or blocks the viral receptor on the host cell surface (ARMEZZANI et al. , 2014;ITO et al. , 2013;ROBINSON et al. , 1981) . The vast majority of studies on EVEs found in mosquito genomes focus on the role of piRNA production as a post-transcriptional regulatory mechanism of exogenous circulating viruses (PALATINI et al. , 2017;WHITFIELD et al. , 2017 ;THÉRON et al. , 2013) . On the other hand, there are few studies of the role of EVE proteins in the antiviral response or their role in the emergence of new host genes.
Chuviridae is a recently discovered RNA viral family of negative-sense single-stranded viruses characterized by metatranscriptomic and bioinformatic analysis only (SHI et al. , 2016) . The information available on this family is limited to its distribution (it likely infects several arthropod species including mosquitoes), its variable genomic structure (unsegmented, bisegment and circular), and the presence of a number of EVEs in species from Amphipoda, Hemiptera, Coleoptera, Hymenoptera and Diptera genomes (Shi et al. 2016). An in-depth descriptive analysis of chuvirus-derived EVEs exists only for the Ae.
aegypti mosquito genome (WHITFIELD et al. , 2017) , in which they displayed three intriguing features: a great abundance of chuvirus-derived EVEs compared to other viral families (it is the second most abundant EVE family, outnumbered only by the Rhabdoviridae family); an association with retroelements from the BEL-Pao superfamily; and a striking difference in the viral genome fragment endogenized -a much higher quantity of glycoprotein (Gly) compared to RNA dependent RNA polymerase (RdRp) and nucleoprotein (NP) sequences (WHITFIELD et al. , 2017) . Endogenization of different viral regions are expected to be influenced by the virus genome structure and orientation of the RNA replication process. Whitfield et al., 2017 have shown that EVEs derived from another non-segmented negative-sense single-stranded virus from the Rhabdoviridae family occur in the following order of abundance: NP -> Gly -> RdRp ressambling the order in which the viral genomes is replicated in this viral family. On the other hand, Chuviridae viruses, which are similar to Rhabdoviridae in genomic structure (NP-Gly-RdRp), shows a very different endogenization pattern, with eighty-seven endogenous Gly sequences, only four from RdRp and no endogenized nucleoproteins (WHITFIELD et al. , 2017) . It may indicate that chuvirus-derived EVEs found in Ae. aegypti genome were either derived from a exogenous chuvirus with segmented or non-segmented genome and for some reason Gly has been majorly endogenized, or that the endogenization of chuvirus genomic regions occurred evenly at the beginning but only Gly was maintained through the evolutionary time.
Endogenous repetitive elements are abundant in metazoan genomes and mosquito genomes from the Aedes genus are particularly full of retrotransposons/retroviruses (GOUBERT et al. , 2015;NENE et al. , 2006;WHITFIELD et al. , 2017) . One of the most abundant LTR retrotransposons/retroviruses are from the BEL-Pao superfamily which is also very abundant and widespread in other metazoan genomes (MALIK, H. S.;EICKBUSH, 2000) . Elements from this group have two long terminal repeat (LTR) regions, consisting of between 100 and 900 base pairs, and two coding regions -a capsid protein with a GAG domain, and a polyprotein, which commonly has four domains: aspartic protease (PR), reverse transcriptase (RT), RNAse H (RH) and integrase (INT). Moreover, at least three elements (Roo, Tas and Cer-7) have an envelope-like protein (Gly) downstream of the polyprotein, which were acquired from Gypsy , Phlebovirus and Herpesvirus, respectively (FELDER et al. , 1994 ;BROWNING et al. , 1996 ;MALIK, HARMIT S.;HENIKOFF, 2005) .
In view of all the intriguing aforementioned chuvirus/BEL-Pao/host features, we investigated in depth which biological phenomenon has generated the high abundance of chuvirus glycoproteins found in mosquito genomes and examined the role these glycoproteins may play in BEL-Pao retroelements and mosquito biology. Our results showed, for the first time, that these EVEs are broadly present in mosquito genomes and that a large majority of the glycoproteins are physically associated with elements from the Pao family.
We also found that most of the chuvirus-derived glycoproteins are structurally associated with potentially autonomous Pao elements and are likely to play a role in viral particle formation as an envelope protein. However, we also found structurally conserved solo glycoproteins, suggesting a potential role for these in antiviral defense mechanisms via receptor competition against bona fide chuviruses and/or the new retrovirus (named Anakin in this paper) emerging from the acquisition of the chuvirus glycoprotein by Pao retroelements.

Data Collection
A non-redundant database of proteins including all taxa and a database of chuvirus genomes were obtained from NCBI (last accessed January 2018). Mosquito genomes were retrieved from NCBI and Vectorbase (last accessed January 2018) and chuvirus-derived EVEs already identified in mosquito genomes were retrieved from (WHITFIELD et al. , 2017) . Mosquito genome sources and assembly metrics are presented in Supplementary Material 1 . All command lines used in the present study can be found in Supplementary Material 2 .

EVE Screening
Two BLAST-based strategies were used. The first used the chuvirus genome dataset as a query in a tBLASTx (ALTSCHUL et al. , 1989) screening against mosquito genomes, and the second used the EVEs from (WHITFIELD et al. , 2017) as a query in a BLASTn analysis against mosquito genomes. All EVE regions were extracted from mosquito genomes in two ways: I -only the ungapped aligned region; II -ungapped aligned region plus 10kb of each flanking region.
The resulting sequences were used as a query in a BLASTx analysis against the non-redundant protein database with different filters in order to eliminate false positives and false negatives in chuvirus EVE identification. This was done for two reasons. First, as a number of viral proteins are still not annotated in the databases and hence if one considers the first hit alone in order to determine the viral origin one would rule out some EVEs and produce a false negative result. Second, some wrongly annotated viral proteins would cause the first hit to be mis-annotated as viral, while the following two or more hits would indicate that the sequence belongs to another taxon, generating a false positive result. Queries that showed four or fewer matches were annotated as an EVE if the best match was a viral protein. For queries with five or more subject matches, the proportion of the five best matches was taken into consideration, in accordance with the following criterion: a sequence is annotated as an EVE only if three or more subjects have viral proteins.
The EVE sequences containing the flanking sequences were clusterized using CD-HIT (LI; GODZIK, 2005) to remove redundancy. Flanking sequences were retained to avoid clusterization of identical or very similar copies while allowing the removal of the same EVE copy recovered using the two search strategies (chuvirus genomes and EVEs).

Chuvirus EVE characterization
Using the above strategy, we obtained a total of 428 chuvirus-derived EVEs from mosquito genomes, of which 409 were glycoproteins, 18 RNA-dependent RNA polymerases and one nucleoprotein region. However, a number of these were found in small-size contigs and in close proximity to indeterminate "NNN" regions in the assembly. In order to avoid genome assembly problems, we restricted our analysis to contigs bearing EVEs with at least For the remaining 322 sequences, we extracted potential coding regions using the EMBOSS getorf ( http://www.bioinformatics.nl/cgi-bin/emboss/getorf ) tool, and ORFs containing fewer than 100 amino acids were removed from further steps, resulting in 279 sequences.

Flanking sequence analysis
Once they had been passed through the aforementioned filters, all EVEs plus flanking regions were translated using the EMBOSS getorf software, and the resulting ORFs were

Search for all homologous BEL-Pao elements
Nucleotide sequences of complete BEL-Pao elements containing domain signatures and LTRs recovered in the previous analysis were used as queries in a BLASTn analysis against the respective mosquito genomes to recover all homologous BEL-Pao copies.
Sequences retrieved in this step were recovered with 10kb of each flanking region and screened for the presence of chuvirus-derived EVEs. Long terminal repeat (LTR) regions from all sequences retrieved were evaluated using LTR_FINDER and default parameters (XU; WANG, 2007) .
All EVEs were sorted, based on RdRp, Gly or NP proteins and BEL-Pao Retrotransposons based on whether they are associated with chuvirus-derived glycoproteins or not and the structural conservation of the copies, into the following categories: i) potentially active retrotransposons containing both LTRs, BEL-Pao conserved protein domains (GAG-PR-RT-RH-INT) with or without a chuvirus-derived glycoprotein; ii) defective elements which have at least one BEL-Pao domains and one or no LTRs; iii) solo LTRs; and iv) solo glycoproteins ( Table 1 ).

Molecular modeling
Molecular modeling was performed for three groups of glycoproteins: I -Solo glycoproteins, corresponding to glycoproteins without the BEL-Pao signature in their flanking regions; II -glycoproteins found inside of BEL-Pao element boundaries (LTRs) -in order to select some Gly proteins found inside BEL-Pao retrotransposons (group II), an amino acid distance matrix was created using UGENE (OKONECHNIKOV; GOLOSOVA; FURSOV, 2012) . One random sequence was chosen for molecular modeling from each cluster exhibiting more than 90% similarity ( Supplementary Material 4 ).; and IIIglycoproteins from bona fide chuviruses previously characterized by metatranscriptomics in mosquitoes.
The Phyre2 server (KELLEY et al. , 2015) was used to select a template for each protein of interest. The 3D structure of each template selected was obtained from the Protein Data Bank (PDB) (Berman et al. 2000) and submitted, together with the Gly sequence, to the Modeller package version 9.23 multimer modeling algorithm. The predicted models were evaluated for their stereochemical parameters using the Procheck tool ( Laskowski et al.

Solo LTRs
LTRs from previously characterized elements were used as a query against mosquito genomes in a BLASTn analysis. Matches were recovered with 10 kb of flanking region.
Sequences with a single LTR matching the mean length of LTRs identified using full BEL-Pao elements (around 670 base pairs) and no surrounding BEL-Pao domains were considered to be Solo LTRs ( Supplementary Material 5 ).

Evolutionary analysis
Two analyses were performed. The first was based on Gly protein sequences using only amino acid sequences with more than 100 residues. This analysis included most EVEs Reconstruction of phylogenetic trees was carried out using MrBayes v3.2.2 x64 (RONQUIST; HUELSENBECK; TESLENKO, 2010) , starting with three seeds and 1,000,000 generations. The convergence of the independent runs was detected when the standard deviation of all three seeds was lower than 0.05. The burn-in removed the first 25% of trees sampled and the remaining 75% were used to generate a posterior probability consensus tree and the phylogenies were visualized using iTOL (LETUNIC; BORK, 2007) .
DNA extraction was performed from pools of 5 individuals (females and males) of each species with the protocol established by Ayres, 2003(AYRES et al. , 2003 , the quality and concentration of DNA was evaluated with Nanodrop 2000 (Thermo Fisher Scientific).
Primers to amplify the endogenous gene of sodium channel ( sodium channel protein para-like LOC109432678 ), were used as control to evaluate the DNA samples integrity before the EVE screening, and the regions that covers part of the EVE and a region of the

Supporting evidence of widespread chuvirus endogenization in different mosquito genomes
Four hundred and twenty-eight EVEs were identified in 32 out of 37 genomes screened. These elements corresponded to 409 glycoprotein fragments, 18 RNA-dependent RNA-polymerases (RdRp) and one nucleoprotein ( Supplementary Material 8 ). After the exclusion of EVEs from small contigs and those in close proximity to uncertain sequences/assembled bases (NNNs), 279 sequences remained. Two hundred forty-one of these presented glycoprotein conserved amino acid domains.
In view of the previously described and confirmed abundance of chuvirus-derived glycoprotein EVEs, a more in-depth characterization of these EVEs was performed.
Endogenous glycoproteins varied in size from 117 to 1977 nucleotides (median = 880) and amino acid length from 39 to 659 residues. The amino acid identity with each chuvirus genome used as a query varied from 29.03 to 56.41 percent (average identity = 32.83, SD = 6.31, Supplementary Material 8 ).

Glycoproteins are mostly associated with elements from the BEL-Pao retrotransposon superfamily
Screening for BEL-Pao protein domain signatures and LTRs we found glycoproteins in three different contexts ( Figure 1, Table 1 ). Thirty-seven glycoproteins were categorized as solo glycoproteins, since no BEL-Pao protein domain or LTR signature was found ( Figure   1 A ). Two hundred glycoproteins were flanked by BEL-Pao protein domains but with only one LTR or none, characterizing them as defective BEL-Pao elements ( Figure 1 B ).
Sixty-three glycoproteins were associated with BEL-Pao domains flanked by two LTRs, characterizing them as potentially active elements. These included thirteen of those copies showing all domains and LTRs of complete BEL-Pao elements ( Figure 1 C ).
We also identified solo LTRs in seven genomes ( Table 1 ). These varied in number from only one in the Anopheles gambiae Pimpirena assembly to 82 solo LTRs in the Aedes albopictus C636 assembly. Interestingly, the number of solo LTRs was found to be greater than the number of LTRs associated with full BEL-Pao elements in the Aedes albopictus  C636 and Anopheles maculatus BTQ1 assemblies ( Table 1 ) .

Endogenous and exogenous glycoproteins have similar 3D structures
Although BLAST and phylogenetic analysis indicate that the EVEs found share a common ancestor with chuviruses, the amino acid identity is considerably low (between 25 and 50%). Another way to obtain further support for the viral origin of these endogenous sequences is through 3D structure modeling. If these polypeptides resemble viral glyco/envelope proteins in 3D space, this would corroborate their viral origin.
The templates identified by the Phyre2 tool represent many type B glycoproteins    The evolutionary tree using the RT and RH regions from both BEL-Pao retroelements, including elements containing or not chuvirus glycoproteins, shows four clearly defined clades ( Figure 5 ). One of these has 181 elements without glycoproteins closely-related to BEL elements in Branch 1 ( Figure 5 ). The other three are closely-related clades with Pao elements from Branch 2, of which 129 are composed of elements with glycoproteins (red scalene triangles) and 122 of elements without glycoproteins (blue scalene triangles, Figure 5 ). It is clear that chuviruses glycoproteins are specifically associated with elements from the Pao family -the now called Anakin elements.
There is a large number of Pao elements without glycoproteins, but comparing it with the Anakin elements the last are more abundant and widespread, in a larger number of mosquito species, than Pao elements without glycoproteins. ( Supplementary Material 11 ).

PCR validation of solo-glycoproteins in different mosquito lineages
Five primer pairs were designed to amplify both chuvirus/mosquito integration junction of solo-glycoprotein regions. Four lineages of Ae. albopictus , two lineages of C.
quinquefasciatus and two lineages of Ae. aegypti were screened ( Table 2, Supplementary Material 12 ). At least one EVE-mosquito boundaries were sequenced successfully for all five solo glycoproteins ( Table 2 ), showing a consistent alignment to the corresponding genomic region in the available genomes ( Supplementary Material 12 ). This confirms the integration of the EVEs and highlights that these insertions were conserved and an ancient component of these genomes.

Discussion
Viruses and transposable elements share one or few common ancestors and hence present similar features, particularly for some retroviruses and retrotransposons that are differentiated by the presence/absence of an envelope gene (glycoprotein). The envelope protein is responsible for the infection capacity of the former (HAYWARD, 2017) . But more than that, these entities exchange genetic information between them and with their host genome (DREZEN et al. , 2017;FRANK;FESCHOTTE, 2017;CORDAUX, 2017;SINHA;JOHNSON, 2017) . Several instances are of interexchange of functional genes that allowed major changes in their evolutionary history as known such as the acquisition of a complete herpesvirus by a piggyBac transposon (INOUE et al. , 2017) or the emergence of retroviruses by envelope capture (KIM et al. , 2004) . Here, we characterized a new event of gene sequence exchange between viruses, the mosquito host genome and retrotransposons with the acquisition of a glycoprotein (envelope) gene from a chuvirus by a Pao retrotransposon (named Anakin). Besides, we also found evidence of past and likely current arms race between these entities.
Virus genome replication is tightly linked to virus genome structure. Replication origin and orientation may favor the endogenization of the genomic regions that are first copied/transcribed, since these regions are produced more abundantly than the last replicated/transcribed regions (RUSSO et al. , 2019;WHITFIELD et al. , 2017 Figure 6A ).
Alternatively, this hybrid element may have emerged from a recombination event involving a Pao retroelement and a chuvirus-derived solo glycoprotein after its endogenization ( Figure   6A ). After viral envelope protein endogenization into the host genome, two major events may  ( Table   2 ). But we can not rule out the repurposing of more divergent glycoproteins to other unknown functions at the host level.
Finally, our results show the importance of taking EVEs into account in metaviromic studies (NOURI et al. , 2018) . Some of the chuvirus EVEs detected in our study exhibited a high degree of identity (99.08 to 100%) with previously described circulating chuviruses Figure 6: General scheme of the chuvirus derived EVE integration into the germinative cells of the host genome, phusion with Pao retroelements and potential antiviral mechanisms by receptor competition. A -Bonafide circulating chuvirus infecting a host cell and realising their single strand RNA into the cell cytoplasm (1); RT from retrotransposons fortuitously recognize the viral genome and retro transcribe to dsDNA (2); dsDNA is directly integrated into the host genome by double strand break repair mechanisms or by recognition and integration using integrase proteins from retroelements (3) either in different genomic loci (4) or into a Pao retroelement boundaries (5); Expression of Pao proteins (6); B -chuvirus-derived glycoproteins are dispersed in many genomic loci but have been replicated substantially by the Pao + chuvirus glycoprotein element (Anakin). Anakin retrovirus can assemble viral particles with envelope proteins allowing it to infect new host cells (1); Complete or defective solo glycoprotein are translated and exported to the extracellular environment (2) and can block new viral infection of both Anakin and chuvirus by binding to cell membrane virus protein receptors (3).
Kaiowa, Guato, Cumbaru and Croada (LARA PINTO et al., 2017;PAUVOLID-CORRÊA et al., 2016) . Although, we cannot rule out the possibility that some of these sequences may indeed have come from circulating chuviruses that infect Ae. albopictus species, the high identity values of these "viruses" with the EVEs found here strongly suggest that these previously defined chuviruses are in fact endogenous elements of the Ae. albopictus genome ( Supplementary Material 14 ).

Conclusion
In this study we revealed the diversity of endogenous virus elements derived from the

Ethics approval and consent to participate
Not applicable.

Consent for publication
Not applicable.