Data Report ARTICLE
Secondary Structures Dataset of Eukaryotic Group II Introns
- 1Institute of Cytology and Genetics, Russian Academy of Sciences, Russia
- 2Novosibirsk State University, Russia
- 3Institute of Computational Mathematics and Mathematical Geophysics (RAS), Russia
- 4INSERM U981 Identification de Nouvelles Cibles Thérapeutiques en Cancérologie, France
Being one of the most successful mobile elements, group II introns are present in all the three domains of life. It is commonly believed that group II introns played an important role in the emergence of eukaryotic retroelements and splicing (Novikova and Belfort, 2017). In biotechnology, group II introns have a promising application as specific gene-targeting vectors (Song et al., 2015).
Group II intron sequences possess very weak similarity and at the same time fold into a common (although variable) structure that provides them with ribozyme activity. The secondary structure of group II introns consists of six distinctive domains, DI-VI. The first of them is organized into a complex structure with a largest number of stems and plays a role of a scaffold on which the whole structure is assembled. The remaining domains have few structure elements, which determines their low specificity, except for DV-VI that possess a moderate sequence conservation.
The pioneering collection of group II introns sequences contains a small number of eukaryotic group II introns with no individual structures (Dai et al., 2003). The largest collection, the RFAM, relies on the presence of DV-VI domains (Kalvary et al., 2017) and on the capability to form secondary structure in the DI domain position. However, this criteria is not ideal because group III introns also possess DV-VI-like domains as well as their own DI domain which is strongly different from the DI domain of group II introns. Therefore, an explicit model of the DI domain is necessary for the differentiation of intron sequences of group II from sequences of group III.
This data report is aimed at providing a new, comprehensive secondary structure catalog of eukaryotic group II introns using bioinformatics approach. In this study, we applied the structure computation and structural alignment of low-homologous sequences and the descriptor-based technique that identifies the target structure.
2 Value of the Data
1. Group II introns are widespread mobile elements that are evolutionarily related to the emergence of splicing and eukaryotic retroelements and are used as a tool in the genome engineering and gene expression control.
2. No exhaustive collection of secondary structures of eukaryotic group II introns has been published thus far. Here, we provide a dataset of DI domain secondary structures of eukaryotic group II introns, supplying them with the data on those tertiary interactions (EBS1-IBS1, EBS2-IBS2, α-α’, β-β’) and well-conserved RT-motifs that we found.
3. These data will be useful for analyzing the complexity of group II intron folding and their evolutionary relationship to eukaryotic retroelements and splicing, for identifying novel group II introns and for targetron designing in gene engineering.
3 Materials and Methods
To search for structural homologs within RNA sequences, we used the descriptor-based RScan program (http://www.softberry.com/freedownloadhelp/rna/rscan/rscan.all.html). To search for sequence homologs we used NCBI BLAST web-server (Madden, 2002). To filter structures by stability we used RNAeval (Lorenz et al., 2011): only those structures passed the filter that met the criterion Z-score< -2.2. As a negative dataset, we generated random 5000-nt long sequences of the same dinucleotide composition.
Our strategy for finding eukaryotic group II introns and predicting their DI secondary structures was the following. First, we extracted from the papers (Fontaine et al., 1997; Burger et al., 1999; Sultan et al., 2016; Chan et al., 2018) 8 eukaryotic DI domains (Arabidopsis thaliana nad1.I4, Porphyra purpurea LSU.I1 and I2, Pylaiella littoralis cox1.I1, I2 and I3, Pylaiella littoralis LSU.I1 and I2) and built four distinctive structure models for them. Each model corresponded to its unique set of stems (Fig. 1) and started with the 5’ spice site sequence GUGCG or UUGCG.
Second, by varying the lengths of loops and stems within the models and prohibiting non-canonical base pairs (except G-U pair) we searched for structural homologs in eukaryotic sequences hosted in the group II intron Database (Dai et al., 2003). Then we filtered the results by the stability criterion and by the presence of pseudoknot interactions EBS1-IBS1, EBS2-IBS2 and α-α'. In this way, we extended our training set to two more sequences, Neurospora crassa cox1.I1 and Schizosaccharomyces pombe EF2 cob1.I1.
Finally, we repeated the previous step, but now on the 10820 eukaryotic sequences of the RFAM database of group II introns (http://rfam.xfam.org/family/RF00029#; note that this dataset contains many twins), and then removed the twins. Thus, we obtained robust structures, at the same time clearly losing those group II introns, in which the DI domain acquired or lost any stems compared to our models.
In all the training and predicted introns we searched for the YADD sequence (RT active site) and the conservative motifs of RT. We called the RT motif conservative if we found that its similarity to any of the RT motifs from Enterococcus faecium, Geobacter sp. M18 and Enterococcus casseliflavus (the top three sequences in Figure 5 in the paper (Zimmerly and Wu, 2014)) exceeded 70%.
Our four DI domain models (including the consensus of the 5’ splicing sites; see also Table 1 Suppl. for a description of the models and their occurrence) showed very different rates of false positives. While the most relaxed model of the IIA intron was found on average once every 59K nt in random sequences, others were not observed even in 9∙108 nt. Then the specificity increased even more due to the requirement for EBS1-IBS1, EBS2-IBS2 and α-α' pseudoknots, since we found them in all sequences of the training set and required them in the predicted set. Despite that such a high specificity of a latter case should coexist with a low sensitivity we found three such group II introns, all type IIB (Table 1 Suppl.).
We have observed that bacterial DI domains can often also fold into the eukaryotic structures, although typically with a lower structure energy. This may indicate a high ability of bacterial group II introns to horizontal transfer to eukaryotes. In the group II intron Database (Dai et al., 2003) we found 12 such bacterial group II introns, which met all our criteria for eukaryotic group II introns and calculated a portion of base-pairs common to these bacterial and eukaryotic structures of the DI domain, as 79.4%. In other words, if we correctly predict a sequence as a eukaryotic group II intron, but it definitely folds into another DI domain structure, we still correctly predict an average of 79.4% base pairs.
In total, except two sequences in the eukaryotic group II intron Database, we found 22 sequences in the RFAM that satisfied the constructed structural models of DI domains of eukaryotic group II introns. We divided these 22 sequences into three groups. The first group consists of nine fungi group II introns, which contain, albeit in smaller quantities than in the training set, conservative and correctly positioned RT-motifs (Fig. 2). All of these candidates include the YADD motif, have a GUGCG sequence of the 5' splicing site, seven of them have β-β’ pseudoknots and eight of them are IIA-type. The second group of two IIB sequences (fungi and plantae) has also GUGCG and has neither YADD nor β-β’. The third group consists of 11 very similar, 97-99%, plant sequences that also have neither YADD nor β-β’, but have UUGCG 5’ splicing site. Note that of the ten training sequences only one contained UUGCG 5’ splicing site. These 11 sequences are probably copies of a particular plant transposon, because through a homology search we found other 436 plant sequences identical to them.
Keywords: mobile group II introns, RNA, secondary structure, Ribozyme, mobile element
Received: 26 Apr 2019;
Accepted: 18 Oct 2019.
Copyright: © 2019 Titov, Kobalo, Vorobiev and Kulikov. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
* Correspondence: Dr. Igor Titov, Institute of Cytology and Genetics, Russian Academy of Sciences, Novosibirsk, Russia, email@example.com