Stem-Loop RNA Hairpins in Giant Viruses: Invading rRNA-Like Repeats and a Template Free RNA

We examine the hypothesis that de novo template-free RNAs still form spontaneously, as they did at the origins of life, invade modern genomes, contribute new genetic material. Previously, analyses of RNA secondary structures suggested that some RNAs resembling ancestral (t)RNAs formed recently de novo, other parasitic sequences cluster with rRNAs. Here positive control analyses of additional RNA secondary structures confirm ancestral and de novo statuses of RNA grouped according to secondary structure. Viroids with branched stems resemble de novo RNAs, rod-shaped viroids resemble rRNA secondary structures, independently of GC contents. 5′ UTR leading regions of West Nile and Dengue flavivirid viruses resemble de novo and rRNA structures, respectively. An RNA homologous with Megavirus, Dengue and West Nile genomes, copperhead snake microsatellites and levant cotton repeats, not templated by Mimivirus' genome, persists throughout Mimivirus' infection. Its secondary structure clusters with candidate de novo RNAs. The saltatory phyletic distribution and secondary structure of Mimivirus' peculiar RNA suggest occasional template-free polymerization of this sequence, rather than noncanonical transcriptions (swinger polymerization, posttranscriptional editing).

Other molecules (short parasitic repeats, frequently forming stem-loop hairpins, viroids, viruses, etc.) presumably subsisted mainly as parasites and occasionally contributing new, sometimes functional parts. Persistence of the cooperative system implies integrating new molecules with new functions, while channeling most resources to critical components such as ribosomes.
Nowadays, ribosomes compete with parasitic elements that hijack the cell's integrated cooperative system (Xie and Scully, 2017). Viruses frequently mimic cellular processes (Hiscox, 2007), including the cell's replication/transcription compartments (Chaikeeratisak et al., 2017). Hence when life began, ribosomal RNAs had virus-like properties. This arms race might explain why >95% of the cell's transcriptome consists of ribosomal RNA (Peano et al., 2013). Here we hypothesize that new RNAs still spontaneously emerge and integrate molecular cooperative systems of organisms.
We use several analyses, including comparisons among RNA secondary structures, to detect and test for de novo RNA emergence.

Structural Homology
Classical sequence homology between linear sequences is inefficient at reconstructing ancient evolution because sequences evolve relatively fast. Structures, rather than sequences, are conserved for longer periods. For example, analyses considering structural homology among proteins suggest a common cellular ancestry for modern cells and viruses (Nasir and Caetano-Anollés, 2015). Similarly, analyses using simple properties of secondary structures formed by diverse RNAs detect two main clusters (A1 and A2, small vs. complex RNAs), each subdivided into two main groups (A1:B1-B2; A2:D1-D2), schematized in Figure 1.
Cluster B1 is the functionally most diverse group and includes several presumably ancestral RNAs, such as replication FIGURE 1 | Hierarchical cluster of secondary structures formed by diverse RNA molecules, adapted from Seligmann and Raoult (2016). origin, tRNA and some ribozymes. Its "sister" cluster B2 includes diverse, probably more derived/recent molecules (i.e., the only known protein-encoding viroid, AbouHaidar et al., 2014). Clusters D1 and D2 are characterized by rRNA subunits, associated with parasitic RNAs: D1 includes all six 23S rRNA subunits and retroviruses; and D2 groups all 16S rRNA subunits and most families of RPEs, rickettsial palindromic elements, which infest specifically Rickettsia genomes (Amiri et al., 2002;Gillespie et al., 2012). Secondary structure similarities between parasitic and ribosomal RNAs underscore virus-like rRNA properties (Figure 1), presumably due to the assumed arms race between rRNAs and parasitic RNAs.
Cluster D2 includes half of the secondary structures formed by tRNA sequences. Hence B1-D2 would represent organic life's main tRNA-rRNA axis of molecular evolution, where simple ancestral tRNA-like RNAs complexified into rRNA-like RNAs (Bloch et al., 1983(Bloch et al., , 1984(Bloch et al., , 1989. This interpretation is in line with detections of candidate tRNA genes within mitochondrial 16S rRNA of chaetognath mitogenomes that otherwise would lack tRNAs (Barthélémy and Seligmann, 2016).
Here we analyze additional types of RNAs and explore their similarities with clusters B1-2/D1-2. Additional viroids, are analyzed to explore the possibility that some viroids date from the precellular world (Bussière et al., 1995;Diener, 1996) and others emerged de novo recently (Koonin and Dolja, 2013;Seligmann and Raoult, 2016), potentially solving the conundrum about primordial/recent de novo viroid origins (Diener, 2016). Two RNA types function as controls to confirm the statuses of putative ancestral/de novo clusters B1 and B2. The remaining RNAs originate from giant viruses and test putative evolutionary links between giant viruses and rRNAs.

Secondary Structure Predictions, Properties, and Comparisons
Secondary structure predictions follow previous analyses (Seligmann and Raoult, 2016). Four variables are estimated from the optimal secondary structure predicted by Mfold (Zuker, 2003): 1. overall nucleotide percentage not involved in self-hybridization (loops); 2. percentage of nucleotides in closed loops at stem extremity among all nucleotides in loops (external loops); and 3. GC contents in loops, and 4. in stems. RNA size is not included among these variables. Table 1 presents these variables for sequences analyzed here. They are compared with corresponding data from a previous collection of RNA sequences (Seligmann and Raoult, 2016, therein Table 1 and Figure 2) that defined clusters in Figure 1. Comparisons use Pearson's correlation coefficient r as a similarity estimate, setting statistically significant similarity at r = 0.95 (one tailed P = 0.05). Correlation analyses plot each of the four variables 1-4 of Y for one of the new RNAs analyzed here as a function of corresponding variables 1-4 of X for each of previously analyzed RNAs (Seligmann and Raoult, 2016, therein Table 1). Correlation coefficients estimate similarities between X and Y according to variables 1-4 (example in Figure 2). Results of statistical analyses presented here are validated also by the Benjamini-Hochberg correction method for false discovery rates which accounts for multiple tests (Benjamini and Hochberg, 1995; methodology detailed for unrelated analyses in Seligmann and Warthi, 2017). This test, unlike the classical Bonferroni approach that minimizes false positive detection rates and misses numerous positive results (Perneger, 1998) optimizes between false negative and false positive detection rates (Käll et al., 2007). The method is not affected by lack of independence between observations, and accounts for multiple testing. These additional analyses confirm classical statistics and therefore are not detailed in Results.

Mimivirus' RNA
Mimivirus' transcriptome (public data available at http://sra.dnanexus.com/studies/SRP001690/experiments; Legendre et al., 2010) are analyzed using CLCgenomicswb7. Reads are mapped on Mimivirus' reference genome NC_014649, according to the following criteria: at least half of the read maps to the reference genome, with at least 80% identity.

Simulation-Generated Palindromes
Previous analyses produced the classification scheme of secondary structures formed by RNAs in Figure 1. Secondary structures formed by selected RNAs are analyzed to test this scheme. We classified the secondary structures of 24 sequences produced in silico by Demongeot and Moreira (2007). These  Variables are: N-sequence length (not used for classifying RNAs in further analyses); Loop-percentage of nucleotides not involved in self-hybridization; eLoop-percentage of nucleotides among those in loops, in closed loops at stem extremity; percentage of GC in stems, and loops. "Cluster" indicates the cluster in Figure 1 with the highest similarity in secondary structure properties. "Circ" indicates sequences generated by simulations attempting to mimic circular RNA genesis (Demongeot and Moreira, 2007); "De novo" is a template-free synthesized sequence (Béguin et al., 2015); MITE1-29 are Pandoravirus'miniature inverted-repeat transposable elements from Submariner family (Sun et al., 2015), * indicates MITE inserted in a protein coding gene; Mito and Mimi are amoeban rRNA sequences aligning with Mimivirus sequences (alignments described in Table 3);" 5 ′ -UTR" leader sequence of Dengue and West Nile virus (Gale et al., 2000); "Swinger alone" is the part of the 5 ′ -UTR leader detected in Mimivirus' transcriptome, not templated by its genome; "surrounding" integrates the former swinger sequence with its untransformed surrounding Mimivirus sequences.
RNAs were generated by simulations designed to reconstruct likely short primordial genes. Simulations were constrained to produce circular RNAs with codons coding for all 20 amino acids and a stop codon, and to form a stem-loop hairpin. The resulting theoretical RNAs have consensual tRNA sequence properties (Demongeot and Moreira, 2007). Optimal secondary structures of these 24 theoretical RNAs (Circ1-24, Table 1) are compared with optimal secondary structures formed by RNAs used previously (Seligmann and Raoult, 2016, therein Table 1) as shown for Circ1, in Figure 2. The secondary structure of Circ1 resembles the cloverleaf structure of tRNA Asn, and much less the OL-like structure formed by that tRNA sequence. Mitochondrial tRNAs form secondary structures that resemble mitochondrial light strand replication origins (OL), presumably function as OLs in OL absence (Desjardins and Morais, 1990;Seligmann and Labra, 2014).
Half of the 24 theoretical RNAs designed by simulations cluster with B1. Ten cluster with D2 (as shown in Figure 2), and the remaining two with D1. There are 24 × 6 = 144 FIGURE 2 | Multivariate comparisons between estimates of secondary structure variables for theoretical sequence Circ1 (from Table 1 comparisons between the 24 simulation-generated RNAs and the six RNAs in B1. Of these comparisons, 19 (13.2%) have Pearson correlation coefficients r with P < 0.05. Among the 24 × 17 = 408 comparisons with D2, 10 (2.45%) have P < 0.05. There are 24 × 13 = 312 comparisons with D1; only two (0.64%) have P < 0.05. No comparison between the 24 simulation-generated RNAs with D2 has P < 0.05. Hence our interpretation of B1 as representing secondary structures formed by ancestral RNAs agrees with the independent approach developed by Demongeot and Moreira (2007).
The fact that secondary structures formed by the 24 simulation-generated, presumably ancestral-like RNAs of Demongeot and Moreira (2007) preferentially cluster with B1 confirms the ancestral status of these theoretical RNAs and of cluster B1. The observation that translation/genetic code properties converge with tRNA sequence properties (Demongeot and Moreira, 2007) is also congruent with analyses of tRNAs along the principles of the natural circular code (Michel, 2012(Michel, , 2013Michel, 2014, 2015): a set of 20 codons overrepresented in coding vs. other frames of protein coding genes (Arquès and Michel, 1996), which enable coding frame retrieval (Ahmed et al., 2007(Ahmed et al., , 2010Michel and Seligmann, 2014). Results also fit a model of evolution of secondary structures into sequence signals such as codons (El Houmami and Seligmann, 2017).

Template-Free Synthesized Sequence
Simulation-generated sequences such as those in the previous section are suboptimal to confirm the ancestral/de novo status of clusters B1/B2. We analyze the predicted optimal secondary structure formed by a short sequence that is synthesized template-free by a combination of three archaeal enzymes, including DNA polymerase PolB (Béguin et al., 2015). This sequence is by definition "de novo." Its secondary structure properties most resemble those of an exceptional, protein-encoding viroid in cluster B2 (one tailed P = 0.034). This result is statistically significant also after considering multiple testing, using the Benjamini-Hochberg correction for false discovery rates that accounts for the number of tests done (Benjamini and Hochberg, 1995). This strengthens the status of cluster B2 as recent spontaneously generated RNAs.

Ancient and de novo Viroids
Evidence for recent vs. precellular origins of viroids are equivocal. Potentially, results of previous classifications of viroid secondary structures might be biased by including in analyses viroids with high GC contents and forming complex branching secondary structures. Analyses here include ten viroids forming rod-like secondary structures with GC contents ranging from 35 to 61%. All cluster with D1 and D2 (seven and three viroids, respectively, Table 2). No comparison with B1 and B2 has P < 0.05. Among 130 and 170 comparisons of these 10 rod-shaped viroids with secondary structures belonging to D1 and D2, 47 and 8 (36.2 and 4.7%, respectively), have P < 0.05.
Hence rod-shaped viroids belong to the ancestral tRNA-rRNA axis of molecular evolution. Viroids with more complex branching patters clustered with B2 according to previous analyses (Seligmann and Raoult, 2016). This suggests that viroid 'survival' requires evolutionary secondary structure simplification, perhaps because endonucleases target secondary branching (Fujishima et al., 2011). Results are compatible with mixed evidence for ancient precellular origins and recent de novo emergence of viroids.

Ribosomal RNA-Like Mimivirus Sequences
We used blastn (Altschul et al., 1997) to explore putative links between ribosomal RNAs and giant virus genomes. For that purpose we extracted rRNA sequences from Acanthamoeba castellanii's complete mitochondrial genome (NC_001637) and aligned these with Acanthamoeba polyphaga Mimivirus' complete genome (NC_014649). The amoeba's mitogenome includes four rRNA genes-5S, 16S, and 18S rRNAs-and a 23S-like sequence (Burger et al., 1995). Blastn alignment criteria were set at the shortest word size (7), the weakest match/mismatch scores (1/−1) and gap costs (existence 0; extension 2). For each rRNA we chose the alignment with the lowest (best) e value among the alignments detected by blastn between these mitochondrial rRNA genes and Mimivirus' genome ( Table 3).
For 23S-like rRNA, three alignments are considered because they represent similarities with different Mimivirus sequences and the alignments have low e values. Secondary structures formed by four of the six rRNA sequences aligning with Mimivirus sequences cluster best with D2 and two sequences cluster best with B2 (Figure 1). Hence these rRNA sequences most resemble the cluster that includes bacterial 16S rRNA subunits.
These rRNA sequences aligned with sequences from the Mimivirus genome. These Mimivirus sequences also form secondary structures which cluster differently in Figure 1 than their putative amoeban mitochondrial rRNA homologs. Optimal secondary structures formed by four of the six corresponding Mimivirus DNA sequences cluster best with B2 and the remaining two with D2. Very few of the r coefficients used for these classifications have P < 0.05. Hence these results must be considered as tentative.
Differences in clustering by secondary structures formed by mitochondrial rRNA sequences vs. Mimivirus' genome for aligning sequence pairs suggest that alignments are frequently due to convergences between rRNAs and viral sequences. A possible interpretation of these clustering results ( Table 1) is that viruses tend to create de novo rRNA-like sequences, though some of the alignments might suggest regular homology due to common ancestry. Lateral transfers between the host and Mimivirus is also a reasonable explanation. Independently of lateral transfers, this putative mitochondrial rRNA-Mimivirus homology is in line with common ancestry between Megavirales and a cellular ancestor of mitochondria, as suggested by homologies between polymerases of these organisms (Kempken et al., 1992;Rohe et al., 1992;Kapitonov and Jurka, 2006;Yutin et al., 2013;Krupovic and Koonin, 2016;Koonin and Krupovic, 2017) and above noted similar regulations of posttranscriptional processing (vertebrate mitochondria, Ojala et al., 1981;Claverie and Abergel, 2009;Mimivirus, Byrne et al., 2009). Hence secondary structure analyses apparently strengthen the hypothesis that mitochondria share a common ancestor with Megavirales.

Potential Origins of Mimivirus' Template-Free RNA
This RNA not templated by Mimivirus' genome might originate from accidental contamination. However, three arguments suggest less conventional hypotheses. Firstly, this RNA is repeatedly detected throughout the virus' infection cycle, hence in different sequencing events. Secondly, the exact same RNA is detected each time in terms of sequence and length. Thirdly, this exact sequence occurs in the genome of another giant virus, Megavirus terra1. These arguments suggest that the occurrence of this RNA in Mimivirus' transcriptome is not circumstantial.
This RNA could originate from (a) de novo creation, as suggested for some short stem-loop hairpin RNAs (Seligmann and Raoult, 2016) and analyses in previous sections, (b) pools of vertically transmitted RNAs originating from horizontal transfers (Stedman, 2013) forming quasi species groups (Villarreal, 2015(Villarreal, , 2016, or (c) noncanonical transcriptions of genomic DNA, such as RDDs [RNA-DNA differences, which result from posttranscriptional nucleotide substitutions (Li et al., 2011) or indels (insertions/deletions, Chen and Bundschuh, 2012)]. In addition, this RNA might result from a peculiar type of transcription, which produces transcripts matching genomes only if one assumes that transcription systematically exchanges nucleotides over the whole length of the transcript, which is called swinger transcription. There are 23 possible nucleotide exchange rules, 9 are symmetric exchanges of type X<->Y and 14 are asymmetric exchanges of the type X->Y->Z->X. These 23 transformations are each separately applied in silico to Mimivirus' genome to produce 23 swinger-transformed versions of that genome which are used for further analyses. Current information on systematic nucleotide exchanges is reviewed by Seligmann (2017), with further references therein (see also Seligmann, 2013;Michel and Seligmann, 2014).

Swinger Transcript or Template Free RNA Polymerization?
Our preliminary explorations of the kinetic data of the transcriptome of the giant virus Mimivirus (Raoult et al., 2004;Legendre et al., 2010Legendre et al., , 2011 detected putative swinger RNAs among RNAs that do not match the regular genome sequence, after excluding regular RNAs by mapping on the regular genome (Figures 3-5). Table 4 compares abundances and mean lengths of detected putative swinger RNAs among transcriptomic data produced by 454 and SOLID massive sequencing techniques (SOLID data unpublished, available in our laboratory). Results by both techniques are comparable. Abundances estimated from 454 sequencing correlate positively with those produced by SOLID (Spearman rank nonparametric correlation rs = 0.323, one tailed P = 0.066). Similarly, mean lengths of swinger reads correlate positively for data produced by 454 and SOLID (rs = 0.394, one tailed P = 0.082). Combining P-values from these two tests using Fisher's method for combining P values (Fisher, 1950), which sums −2 × ln(Pi) where i ranges from 1 to k tests (here k = 2) yields a chisquare statistic with 2 × k degrees of freedoms with a combined P = 0.034. Hence results from both sequencing methods are overall congruent,   swinger RNAs are not artifacts resulting from massive sequencing technologies.
Within 454 data, two swinger RNAs (boxes in Figure 4, primary and secondary structures in Figures 4, 5) were detected throughout most of the viral cycle, corresponding to genomic sequences at positions 243499-243537 (C->T->G->C swinger rule, box 1 in Figure 3) and 768549-768596 (A->C->G->A swinger rule, box 2 in Figure 4). These two genomic positions are each other's inverse complement. This is also the case for the corresponding swinger RNA reads. These reads correspond to the aforementioned template-free RNA and are homologous with Megavirus terra1 (KF527229: positions 903764-903800). This candidate swinger RNA aligns only with swinger transformed versions of Mimivirus' genomic sequence, but its similarity with the swinger transformed genome is also low. Putatively, some genomic sequences in Mimivirus could originate from horizontal transfers from other viruses, suggesting a potential implication in the horizontal transfer of the Sputnik virophage that infests Mimivirus (La Scola et al., 2008). This is in line with the chimeric origin of most Mimivirus genes (originating from eukaryotic hosts and bacterial coparasites of the eukaryotic host, Moreira and Brochier-Armanet, 2008;Jeudy et al., 2012), and confirms horizontal transfer of other viral sequences (Yutin et al., 2013), via swinger transformations.

GENERAL COMMENTS
Analyses that integrate molecular structure information are surprisingly successful at resolving various biological problems, notably of ancient evolution (for example the presumed monophyletic cellular origin of viruses, Nasir and Caetano-Anollés, 2015). However, techniques enabling this remain relatively inaccessible, notably for RNAs. This situation is particularly true if analyzes are supposed to integrate nonequilibrium dynamics of folding during the synthesis of the molecule [proteins: cotranslational folding, (Holtkamp et al., 2015;Seligmann and Warthi, 2017); RNAs: cotranscriptional folding, Gong et al., 2017]. Indeed, most RNAs fold during their syntheses, the new fold appearing after a new stretch of nucleotides is added to the elongating RNA (Schroeder et al., 2002). Each new fold is partially constrained by the previous one, which renders prediction algorithms more complex (Zhao et al., 2010;Frieda and Block, 2012).
The approach used here does not require any special computational skills. It could be adapted to dynamical contexts, and to include information on groups of closely related, suboptimal secondary structures, when these are only slightly more unstable than the optimal structure. Previous analyses (Seligmann and Raoult, 2016) included such special cases, for mitochondrial tRNAs in their classical cloverleaf fold, and in their replication origin (OL)-like fold. Analyzes separating different folds with similar stabilities for structures formed by the same sequence, such as cloverleaf vs. OL-like tRNA structures, yield sometimes different results in classifications such as that in Figure 1 (see Seligmann and Raoult, 2016, therein Figure 2).
In addition, molecules with very similar estimates for all variables may form very different structures, or similar structures of very different sizes (the variables do not include sequence length). Hence the simple approach used here could be applied to study closely related clouds of secondary structures formed by a given RNA. It could also be adapted to discriminate further between similar secondary structures by including additional variables, such as GC contents in internal vs. external loops, and angular rotation between stem branches.
Overall our results indicate that the versatility of RNA structures enable for functional novelties. Their physicochemical properties are also compatible with this role in primitive protolife organic conditions.

CONCLUSIONS
Interpretations of a previous classification of RNA secondary structures formed by a variety of RNAs (Seligmann and Raoult, 2016) were tested by classifying specific RNAs of special interest. For example, circular RNAs generated by simulations presumably mimicking ancestral RNAs cluster mainly, as expected, with cluster B1, a presumed group of ancestral RNAs (Figure 1).
Cluster B2 was previously interpreted as representing de novo emerged RNAs because it grouped short simple RNAs from very different functional types (tRNAs, ribozymes, viroids, etc.). Secondary structures formed by a sequence synthesized template free (hence de novo) cluster with B2.
Rodshaped viroids from a wide range of GC contents belong independently of GC contents to D1 and D2, two clusters characterized by rRNAs and parasitic sequences. Viroids forming secondary structures characterized by more complex branching patterns belong to cluster B2 (putative de novo RNAs). Hence results suggest that some viroids are recent, rodshaped ones are presumably ancient RNAs.
Presumed parasitic palindromic sequences from Pandoravirus (MITE submarine family) resemble cluster D2. D2 is characterized by16S rRNA subunits and Rickettsial palindromic elements that parasitize Rickettsia genomes. This fits previous grouping of secondary structures formed by parasitic sequences with 23S and 16S rRNAs (clusters D1 and D2).
Mimivirus sequences that align with amoeaban mitochondrial rRNA genes also strengthen suspected evolutionary links between rRNA and viral sequences. Mitochondrial rRNA sequences aligning with Mimivirus sequences cluster as expected with D2, characterized by bacterial 16S rRNA. Interestingly, secondary structures formed by Mimivirus sequences with which the mitochondrial rRNAs align, cluster mainly with B2, suggesting recent de novo origins for viral sequences resembling rRNAs. These results are based on relatively weak similarities and hence can only be considered as preliminary. Nevertheless, they suggest that viruses produce rRNA-like sequences, in line with the prediction that the study of giant viruses will 'change current conceptions of life, diversity and evolution' (Abrahao et al., 2014).
The secondary structure formed by the 5 ′ UTR leading region of flavivirid viruses clusters also with D2, hence it is a further viral, rRNA-like sequence. An RNA persisting throughout Mimivirus' infection cycle and lacking homology with Mimivirus' genome occurs also in some flavivirus genomes, Megavirus, in copperhead snakes and levant cotton. This saltatory phylogenetic distribution is compatible with repeated spontaneous, template free synthesis by polymerases deterministically producing specific sequences, as observed in Archaea (Béguin et al., 2015). Indeed, this RNA, embedded in the surrounding regular 5 ′ and 3 ′ sequences (Table 1 and Figure 5), forms a secondary structure that clusters with B2. This would suggest de novo emergence of this unusual RNA.
Analyses assuming different scenarios based on noncanonical transcriptions do not reach clear-cut conclusions on the origin of that RNA that does not map to the Mimivirus genome. Though contamination cannot be totally excluded, other hypotheses seem plausible. Indeed, this RNA maps imperfectly on Mimivirus' swinger-transformed genome (Mimivirus' transcriptome includes numerous swinger RNAs, results from each 454 and SOLID massive sequencing methods are congruent). Overall, results hint that parasitic RNAs form rRNA-like secondary structures, and template free polymerizations apparently enrich genomes with new RNA/DNA sequences.