Ori-Finder 2, an integrated tool to predict replication origins in the archaeal genomes
- 1Department of Physics, Tianjin University, Tianjin, China
- 2Key Laboratory of Systems Bioengineering (Ministry of Education), Tianjin University, Tianjin, China
- 3SynBio Research Platform, Collaborative Innovation Center of Chemical Science and Engineering, Tianjin, China
DNA replication is one of the most basic processes in all three domains of cellular life. With the advent of the post-genomic era, the increasing number of complete archaeal genomes has created an opportunity for exploration of the molecular mechanisms for initiating cellular DNA replication by in vivo experiments as well as in silico analysis. However, the location of replication origins (oriCs) in many sequenced archaeal genomes remains unknown. We present a web-based tool Ori-Finder 2 to predict oriCs in the archaeal genomes automatically, based on the integrated method comprising the analysis of base composition asymmetry using the Z-curve method, the distribution of origin recognition boxes identified by FIMO tool, and the occurrence of genes frequently close to oriCs. The web server is also able to analyze the unannotated genome sequences by integrating with gene prediction pipelines and BLAST software for gene identification and function annotation. The result of the predicted oriCs is displayed as an HTML table, which offers an intuitive way to browse the result in graphical and tabular form. The software presented here is accurate for the genomes with single oriC, but it does not necessarily find all the origins of replication for the genomes with multiple oriCs. Ori-Finder 2 aims to become a useful platform for the identification and analysis of oriCs in the archaeal genomes, which would provide insight into the replication mechanisms in archaea. The web server is freely available at http://tubic.tju.edu.cn/Ori-Finder2/.
DNA replication is one of the essential and conserved features among all three domains of life. In bacteria, DNA replication initiates from a single replication origin (oriC), which is often adjacent to the replication-related genes and distributed with the DnaA box motifs, whereas eukaryotic organisms exploit significantly more replication origins, ranging from hundreds in yeast to tens of thousands in human (Gao et al., 2012). Archaea are classified as a separate domain in the three-domain system, and share some similar features with both bacteria and eukaryotes (Woese and Fox, 1977). Similar to the bacteria, the oriCs in archaea are located in the intergenic regions around the replication-related proteins and distributed with the origin recognition boxes (ORBs). The ORB motifs are the conserved sequences and recognition sites for the Orc1/Cdc6 initiation proteins (Barry and Bell, 2006). In some organisms, G-stretches are also observed at the end of ORBs. On the other hand, the origin binding proteins in archaea are homologous to the related eukaryotic Orc1/Cdc6 proteins, and some archaea could also adopt more than one oriC to initiate DNA replication. With the increasing availability of complete archaeal genomes, identification of their oriCs would provide further insight into the mechanism of DNA replication in archaea and reveal the evolutionary history between bacteria and eukaryotes (Barry and Bell, 2006; Wu et al., 2014b).
The first putative oriC of archaea was identified in Halobacterium sp. strain NRC-1 by GC-skew method and demonstrated by cloning into a non-replicating plasmid (Myllykallio et al., 2000). The Z-curve method is an alternative technique that detects the asymmetrical nucleotide distribution around replication origins. The three components of the Z-curve, xn, yn, and zn display the distributions of purine versus pyrimidine (R vs. Y), amino versus keto (M vs. K) and strong H-bond versus weak H-bond (S vs. W) bases along the sequence, respectively. The xn and yn components are termed the RY and MK disparity curves, respectively. The AT and GC disparity curves are defined by (xn + yn)/2 and (xn -yn)/2, which shows the excess of A over T and G over C, respectively, along the sequence (Zhang and Zhang, 2005; Gao, 2014). Based on the Z-curve analysis, we have identified single oriC in Methanocaldococcus jannaschii and Methanosarcina mazei, double oriCs in Halobacterium sp. strain NRC-1, and three oriCs in Sulfolobus solfataricus P2, which are consistent with the subsequent experiments (Soppa, 2006). Recently, multiple orc1/cdc6-associated oriCs in all the available haloarchaeal genomes have been predicted by identification of putative ORBs (Wu et al., 2012). Based on these discoveries, several basic features of the oriCs could be summarized in archaea. Firstly, most oriCs are located in proximity to the genes encoding archaeal replication-related proteins, such as archaeal Orc/Cdc6 protein, Whip (Winged-Helix Initiator Protein) and DNA primase. Secondly, oriCs are often located around the extremes of disparity curves. Finally, most of the oriCs contains the AT-rich unwinding elements and conserved ORBs (Zhang and Zhang, 2005; Barry and Bell, 2006; Wu et al., 2014a).
Our group has developed a web-based system Ori-Finder 1 to find oriCs in the bacterial genomes based on the Z-curve method with high accuracy and reliability (Gao and Zhang, 2008). Now with the knowledge of oriCs in the archaeal genomes, we present an online tool, Ori-Finder 2, to identify the oriCs in the archaeal genomes, based on the integrated method comprising the analysis of base composition asymmetry using the Z-curve method, the distribution of ORB elements identified by FIMO tool, and the occurrence of genes frequently close to replication origins, which is available at http://tubic.tju.edu.cn/Ori-Finder2/.
Methods and Implementation
Ori-Finder 2 utilizes an integrated approach to predict oriCs in the user-supplied archaeal genomes automatically. Figure 1 presents the workflow of Ori-Finder 2. Users submit an annotated or unannotated genome sequence to the web server. For the annotated genome, we recommend that users submit the sequence file in GenBank format or upload the sequence file in FASTA format as well as its corresponding protein table (PTT) file. The web server is also able to analyze the unannotated genomes by integrating two gene prediction pipelines, ZCURVE1.02 and Glimmer3 (Guo et al., 2003; Delcher et al., 2007), for gene identification and BLAST program for functional annotations of genes. Then all the intergenic sequences are scanned by Find Individual Motif Occurrences (FIMO), a software tool for scanning DNA or protein sequences with motifs described as position-specific scoring matrices (Grant et al., 2011), to obtain the ORB sequences, and also by REPuter program, a classic pipeline to compute exact repeats and palindromes in complete genomes (Kurtz et al., 2001), to identify the repeats. Finally, all the intergenic sequences adjacent to the replication-related genes with the ORB sequences are predicted as oriCs. Since the approach relies on the prior knowledge of oriCs in archaea, it may fail to identify the oriCs adjacent to the unknown genes which might be involved in DNA replication. In order to overcome the drawback, the intergenic sequences, which contain more than two conserved motifs, will be also predicted as oriCs. BLAST searches are performed against DoriC, a database of bacterial and archaeal replication origins, to search the homologs (Gao and Zhang, 2007; Gao et al., 2013). Here, the conserved motifs of ORB sequences used in FIMO were obtained from DoriC. All the records in DoriC were organized into several taxonomic clusters, including Methanobacteriaceae, Methanomicrobia, Methanococcaceae, Sulfolobaceae and Thermococcaceae. And the conserved ORB motifs were calculated from the corresponding clusters by Multiple EM for Motif Elicitation (MEME) program, a tool used to discover motifs in a group of related DNA or protein sequences (Bailey et al., 2009). Table 1 displays the regular expressions of ORB motifs. Note that the common motif is calculated from all the records in DoriC. The motif logos are shown in the submission form, and the position specific probability matrix (PSPM) is available in the document webpage. Each job of Ori-Finder 2 is assigned a unique ID, and the whole process will take several minutes to complete. Users could retrieve their results with the job ID or be notified by email if specified in the submission page.
FIGURE 1. Workflow diagram of Ori-Finder 2. The flow chart schematically shows the procedure to identify oriC regions by Ori-Finder 2.
In the result, the information including genome size, GC content, the locations of replication-related genes and the predicted oriCs, as well as the Z-curve (AT, GC, RY, and MK disparity curves) for the input genome is displayed as an HTML table. In addition, the detailed information about the repeats identified by REPuter program, ORBs recognized by FIMO and the homologs in DoriC are also presented in the corresponding subtable. The ORB motifs in all the intergenic regions are also available for download from the provided URL. Users could also click to enlarge the embedded figure to obtain the high resolution one which displays the RY, MK, GC, AT disparity curves, replication-related proteins, and the predicted oriCs. The result webpage and figures will be stored in 7 days on the web server.
Results and Discussion
Based on this online system, we predicted the oriCs for all the available complete archaeal genomes in GenBank. For example, Pyrococcus abyssi is a classical model of DNA replication in the archaeal organisms. Similar to bacteria, there is only one oriC in its circular chromosome, which has been identified by cumulative oligomer skew and confirmed by in vivo method. With the annotated genome file, the oriC predicted by Ori-Finder 2 is in accordance with the experimental result and located at the peak of the MK disparity curve. Several ORB sequences are recognized in the oriC. Figure 2 is a screenshot of the result by Ori-Finder 2. In addition, some archaea adopt more than one oriC during the DNA replication. For this situation, Ori-Finder 2 also predicted multiple oriCs in their genomes. Haloferax volcanii DS2 has a chromosome with multiple oriCs. Five oriCs were identified in silico, and three of them have been confirmed in vitro (Norais et al., 2007; Wu et al., 2012; Hawkins et al., 2013). With the annotated genome file, all the five oriCs mentioned above have been predicted by Ori-Finder 2 successfully, and another oriC with three ORB motifs is also found, which is adjacent to the genes purO and cgi. Besides that, the oriCs identified in the unannotated genomes are consistent with the previous results. In order to estimate the performance of Ori-Finder 2, we used 13 annotated archaeal chromosomes, whose oriCs have been confirmed by experimental method or identified in silico by other groups (Table 2). Compared with the records in DoriC, the sensitivity and precision are 66.7% and 62.1%, respectively. The reason of the lower precision and sensitivity compared with the programs to detect bacterial origins, such as Ori-Finder 1, is that bacteria have only one oriC in their chromosomes, but archaea tend to have more than one. Furthermore, oriCs in archaea show more diversity than those in bacteria, such as more complex ORBs in comparison with the DnaA boxes, and more unknown species-specific replication-related genes. It is difficult to predict the oriCs in archaea with high precision and sensitivity due to the limited amount of experimental data. For example, not all the oriCs in the genomes with multiple oriCs are found, and the ORBs with unique features need to be further explored by experimental methods. For the convenience of users’ query, the oriCs confirmed by in vivo or in silico methods have been collected into DoriC, which is freely available at http://tubic.tju.edu.cn/doric/.
FIGURE 2. Example of Ori-Finder 2 result for Pyrococcus abyssi GE5. (A) The information of genome size, GC content, the locations of replication-related genes and the predicted oriCs. (B) The detailed information of the predicted oriC region including size, GC content, homologs in DoriC and sequence, as well as the information of the identified ORBs including the ORB motif (also referred to as “Pattern name”), location, strand, the associated log-likelihood ratio score, P value and the matched sequences. Note that the log-likelihood ratio score and P value are computed by FIMO to measure the similarity between the ORB motif and the matched sequence, and the P value cutoff for FIMO motif searching is 10-4. The ORB motif used here is the common motif. (C) The left figure shows the Z-curves (AT, GC, RY, and MK disparity curves) for the original sequence, and the right figure shows the Z-curves (AT, GC, RY, and MK disparity curves) for the rotated sequence beginning and ending in the maximum of the GC disparity curve. The short vertical red line indicates the location of replication-related protein. The black arrow is the predicted oriC region.
Here, we presented a user-friendly interactive web-based platform Ori-Finder 2 to predict the oriCs in the archaeal genomes. The tool integrated several genomic pipelines, including FIMO, BLAST, ZCURVE, Glimmer, and REPuter, to comprehensively annotate and analyze the oriCs. Moreover, the ORB motifs are also calculated by MEME and organized by taxonomy. The software presented here does not necessarily find all the origins of replication in cases where there are multiple ones in a genome. However, we will continually strive to improve our approach to make it more accurate and sensitive with the increase of the oriCs confirmed experimentally in archaea. As the only currently available auto-annotation system for the archaeal replication origins at the sequence level, we believe that Ori-Finder 2 will be helpful to predict the archaeal replication origins and provide insight into DNA replication in archaea.
Hao Luo designed the computer program and drafted the manuscript. Chun-Ting Zhang and Feng Gao supervised the study and revised the manuscript. All authors read and approved the final manuscript.
Conflict of Interest Statement
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
The authors thank Dr. Kurtz for providing the REPuter binaries. They also would like to thank Dr. Ren Zhang for invaluable assistance. The present work was supported in part by National Natural Science Foundation of China (Grant Nos. 31171238 and 30800642), and Program for New Century Excellent Talents in University (No. NCET-12-0396).
Bailey, T. L., Boden, M., Buske, F. A., Frith, M., Grant, C. E., Clementi, L.,et al. (2009). MEME SUITE: tools for motif discovery and searching. Nucleic Acids Res. 37, W202–W208. doi: 10.1093/nar/gkp335
Cock, P. J., Antao, T., Chang, J. T., Chapman, B. A., Cox, C. J., Dalke, A.,et al. (2009). Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics 25, 1422–1423. doi: 10.1093/bioinformatics/btp163
Kurtz, S., Choudhuri, J. V., Ohlebusch, E., Schleiermacher, C., Stoye, J., and Giegerich, R. (2001). REPuter: the manifold applications of repeat analysis on a genomic scale. Nucleic Acids Res. 29, 4633–4642. doi: 10.1093/nar/29.22.4633
Myllykallio, H., Lopez, P., Lopez-Garcia, P., Heilig, R., Saurin, W., Zivanovic, Y.,et al. (2000). Bacterial mode of replication with eukaryotic-like machinery in a hyperthermophilic archaeon. Science 288, 2212–2215. doi: 10.1126/science.288.5474.2212
Norais, C., Hawkins, M., Hartman, A. L., Eisen, J. A., Myllykallio, H., and Allers, T. (2007). Genetic and physical mapping of DNA replication origins in Haloferax volcanii. PLoS Genet. 3:e77. doi: 10.1371/journal.pgen.0030077
Wu, Z., Liu, J., Yang, H., Liu, H., and Xiang, H. (2014a). Multiple replication origins with diverse control mechanisms in Haloarcula hispanica. Nucleic Acids Res. 42, 2282–2294. doi: 10.1093/nar/gkt1214
Keywords: archaea, replication origins, Z-curve, origin recognition box, DNA replication
Citation: Luo H, Zhang C-T and Gao F (2014) Ori-Finder 2, an integrated tool to predict replication origins in the archaeal genomes. Front. Microbiol. 5:482. doi: 10.3389/fmicb.2014.00482
Received: 05 August 2014; Accepted: 27 August 2014;
Published online: 15 September 2014.
Edited by:Eric Altermann, AgResearch Ltd., New Zealand
Copyright © 2014 Luo, Zhang and Gao. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
*Correspondence: Feng Gao and Chun-Ting Zhang, Department of Physics, Tianjin University, Tianjin 300072, China e-mail: firstname.lastname@example.org; email@example.com