BAC-BROWSER: The Tool for Visualization and Analysis of Prokaryotic Genomes

Prokaryotes are actively studied objects in the scope of genomic regulation. Microbiologists need special tools for complex analysis of data to study and identification of regulatory mechanism in bacteria and archaea. We developed a tool BAC-BROWSER, specifically for visualization and analysis of small prokaryotic genomes. BAC-BROWSER provides tools for different types of analysis to study a wide set of regulatory mechanisms of prokaryotes: - transcriptional regulation by transcription factors (TFs), analysis of TFs, their targets, and binding sites.- other regulatory motifs, promoters, terminators and ribosome binding sites- transcriptional regulation by variation of operon structure, alternative starts or ends of transcription.- non-coding RNAs, antisense RNAs- RNA secondary structure, riboswitches- GC content, GC skew, codon usage BAC-browser incorporated free programs accelerating the verification of obtained results: primer design and oligocalculator, vector visualization, the tool for synthetic gene construction. The program is designed for Windows operating system and freely available for download in http://smdb.rcpcm.org/tools/index.html.


INTRODUCTION
Prokaryotes are the great source for discovery of new regulatory mechanisms. Their diversity and relative simplicity of genome organization make them suitable objects for different -omics and comparative analysis. Regulation by transcription factors is considered as well described for bacteria, but even for model bacteria Escherichia coli and Bacillus subtilis new TFs and targets are still being identified (Keseler et al., 2013;Gama-Castro et al., 2016;Fang et al., 2017;Belliveau et al., 2018;Gao et al., 2018). Since the low conservation of regulatory networks, regulation of other bacteria, including important pathogens, remains obscure (Lozada-Chávez et al., 2006;Rodionov, 2007;Fisunov et al., 2016;Eckweiler et al., 2018). Operon structure is one of the interesting and poorly studied features of prokaryotes. An operon is defined as a set of genes transcribed en bloc, so operon consists of several genes transcribed and regulated together (Jacob et al., 1960). New data about bacterial transcription showed that operon structure is not stable and can vary in different conditions (Koide et al., 2009;Mao et al., 2015;Junier and Rivoire, 2016). To cope with this discrepancy, genes transcribed and regulated together were called transcriptional units (TU) (Cho et al., 2009;Junier and Rivoire, 2016) and now term operon has more meaning in terms of functional rather than direct transcriptional link between genes (Okuda et al., 2007). Identification of TUs is a new task that has to be resolved to understand transcriptional regulation. Exact mapping of TUs can facilitate identification of new riboswitches, non-coding RNA, antisense transcripts (Sharma et al., 2010;Boutard et al., 2016). The same TU may have multiple TSSs (transcription start sites) and transcription ends. Alternative TSSs in bacteria are found for 15-60% genes and operons (Stazic and Voß, 2016), they are also involved in transcription regulation (Cho et al., 2009;Li et al., 2015). To characterize both poorly studied regulatory mechanisms and classical ways of regulation is needed to use complex information about transcription and translation and combine it with genomics data. For this purpose were developed special tools and programs (Pavlopoulos et al., 2015).
There are many specific tools for TU mapping, TSS and terminator prediction. They use various strategies for data analysis (Amman et al., 2014;Fortino et al., 2014Fortino et al., , 2016Čuklina et al., 2016;Promworn et al., 2017), and some can visualize the results (McClure et al., 2013;Hilker et al., 2016). However, these tools generally concentrated on one or few aspects of regulation. Wide set of tools utilize known algorithms for high-throughput data analysis and genomes visualization but none of them allow analysis of complex operon structure of prokaryotes (Carver et al., 2012;Okonechnikov et al., 2012;Lechat et al., 2013;Dietrich et al., 2014).
We developed new genome browser BAC-BROWSER designed for molecular biologists and microbiologists that combine user-friendly interface with multiple functions for prokaryotic regulation analysis. Using RNA-seq read coverage BAC-BROWSER can identify TSSs and terminators, map and visualize TUs. BAC-BROWSER provides a wide set of algorithms for analysis of genes, proteins, regulatory motifs and structural elements, like hairpins and riboswitches. Furthermore, the program facilitates subsequent verification of computational results with built-in tools for molecular biology analysis. BAC-BROWSER includes free modules for primer design, vector representation, and analysis. Based on BAC-BROWSER prediction we identified TSSs and gene promoters that helped to reveal a new mechanism of transcriptional regulation and reconstruct transcriptional control network for three bacteria species (Mazin et al., 2014;Fisunov et al., 2016).

General Information
BAC-BROWSER software was written in VB.NET 9.0. The program loads genomes in Fasta, Genbank, and GFF format.

Analysis of Coverage and Transcription Unit Mapping
BAC-BROWSER intakes coverage data in SAM/BAM format or in own format. BAC-BROWSER format represents the array of values, where each one represents coverage at a given position starting from the first nucleotide. The SAM format can be converted into BAC-BROWSER format. The coverage dataset can be assembled into a single table file, which simplifies loading of data. The coverage can be represented as full read coverage or first nucleotide coverage. In the second method only first nucleotide contributes to the coverage, while the remaining read is skipped. This representation is used for analysis of 5 -end enriched RNA-seq coverage (Mazin et al., 2014).
The algorithm of transcription unit identification from the coverage have been described in Mazin et al. (2014). Briefly, the algorithm calculates coverage derivative and identifies its local extremes, which correspond to local steps in coverage. The steps may originate from TSSs, transcriptional terminators and RNA ends produced by RNA processing as well as coverage noise produced fortuitously. The algorithm further splits coverage into intervals between consecutive steps and identifies statistical significance of coverage difference between adjacent intervals. In current implementation we use standard deviation, which threshold can be manually set. The original implementation described a quasi-Poisson distribution test. Then if the difference is statistically insignificant the step is removed and intervals are joined. The process repeats iteratively until convergence. The final set of intervals can be assigned as transcription units (operons). The borders between the intervals correspond to TSSs and transcriptional terminators, respectively (Mazin et al., 2014). So, this algorithm identifies TSSs, terminators, and TUs. However, its accuracy depends on the quality of data and parameters the user chooses.
The exact identification of TSSs with single-nucleotide resolution requires first nucleotide coverage data from 5 -end enriched RNA-seq library. The method is different from the algorithm for TU identification and also have been published (Mazin et al., 2014). The program scans the coverage for local maxima, which correspond to TSSs. The algorithm accounts for background signal to identify maxima that are above the noise threshold. The latter is dynamically calculated for the local sequence area (parameters are user-adjusted).
It is also possible to export normalized coverage (FPKM) of annotated features for further analysis. A coverage of a single gene is normalized to the total library coverage or to the coverage of CDSs (Anders and Huber, 2010). The latter excludes overrepresented features like rRNAs and tRNAs that may introduce quantitative bias.

Applications for Sequence Analysis, Oligonucleotide Thermodynamics and Secondary Structure Analysis and Motif Search
Standard algorithms for sequence search are implemented in the application (Altschul et al., 1990). At first stage query and subject sequences are splitted into K-words. K-words are indexed and the matrix of matched hits is built. Matched hits serve as seeds to expand alignment. The gaps between seeds are filled by Wunch-Needleman algorithm (Needleman and Wunsch, 1970). A user can adjust thresholds for sequence similarity, minimal alignment length and length of seeds.
Positional weight matrixes (PWMs) are built from a usersupplied set of aligned sequences. For logo plot construction user can apply log-likelihood score or Shannon positional entropy (Li and Tompa, 2006). PWMs are stored in text format and can be easily edited. External or manually constructed PWMs are suitable for search. The program scans a genome and finds PWM matches above the threshold score manually selected by a user. The score is calculated by the standard method as a sum of frequencies of matched nucleotides. For motif de novo search we implemented modified MEME algorithm (Bailey and Elkan, 1994).
Thermodynamic parameters for oligonucleotides are calculated by the modified Allawi and SantaLucia's the nearestneighbor method for DNA (Allawi and SantaLucia, 1997). Thermodynamic parameters for RNA hairpins (used in hairpin finder at genomic view window) are calculated by the nearestneighbor algorithm for RNA (SantaLucia, 1998). Secondary structures and of oligonucleotide dimers are calculated using iterative annealing with subsequent calculation of duplex stability. For calculation of hairpins, the oligonucleotide is split into two halves and iterative annealing of halves is performed. A search of secondary structures in a genome is performed in a sliding window. For each subsequence the algorithm calculates the hairpin stability in the same way as for oligonucleotide (see above), but with RNA thermodynamic parameters.
The identification of repetitive elements is performed in a sliding window. For each subsequence the algorithm identifies if N nucleotides separated by the spacer of K length form direct or inverted repeats.

Tools for Molecular Biology
PCR primer design in the program includes the following methods: automatic primer design for the given sequence and manual primer design. The first method designs a list of primers that correspond to the manual set of parameters like Tm, amplicon length and secondary structure stability. The algorithm makes the list of subsequences from the sequence to which the primers have to be designed. Then the algorithm filters subsequences until the appropriate ones are found. The second method allows designing oligonucleotide directly on DNA sequence within the viewer window. A user can move and resize oligonucleotide on genomic sequence and Tm is calculated on-flight. This method can be used to design oligonucleotides for gene synthesis or probes for real-time PCR. Tm is calculated by the modified Allawi and SantaLucia's thermodynamics method (Allawi and SantaLucia, 1997). Calculation of oligonucleotide concentration from OD is performed by approximation of Beer-Lambert law by the following equation: OD / (nA × 15200 + nG × 12010 + nC × 7050 + nT × 8400 + M), where nA, nG, nC, and nT is the number of respective nucleotides and M corresponds for absorbance of DNA modifications (if present).
The protein isoelectric point is calculated based on the Henderson-Hasselbach equation and table values of pK for charged amino acids. The program iteratively adjusts pH until the isoelectric point is reached.

Sequence and Coverage Analysis
In BAC-BROWSER there are available linear and circular sequence representations. An easy and usable interface enables to manipulate with annotation, gene sequences, mark and retrieve nucleotide fragments or directly analyze them. BAC-BROWSER allows manual correction, addition, and import of annotation. BAC-BROWSER provides simple extraction of a single or multiple genes or protein sequences. The program can upload files with more than one sequence and show it as one genome. Sequence search is available in the application. BAC-BROWSER performs a search of nucleotide and protein sequences with the customizable percent of identity and can do the simultaneous search of several sequences uploaded in Fasta file. The program readily works with short sequences as vectors, viral and bacterial genomes. BAC-browser will work with the human genome and other large genomes, but it is not designed for this and the analysis will take a long time.
In BAC-BROWSER, there are more than 15 tools implemented for sequence and associated coverage analysis: -genome sequence searching and editing; -tools for editing genome annotation: annotation amendment, importing and exporting of features, open reading frame (ORF) prediction; -tools for motif search and analysis like PWM scanning and building, and also de novo motif search; -tools for repeat and hairpin formation search; -restriction site search and enzyme mapping; -BLAST search; -GC content (Figure 1) and GC skew calculators, and codon usage.
RNA-seq analysis output in SAM format can be directly loaded by the application. SAM/BAM file user can obtain from programs for RNA-seq analysis such as Bowtie, BWA or TopHat (Li and Durbin, 2009;Langmead and Salzberg, 2012;Kim et al., 2013). Read alignments are automatically converted into nucleotide coverage. BAC-BROWSER can perform coverage normalization and log transformation. A user chooses the type of coverage representation depending on RNA-seq library type. BAC-BROWSER generates standard nucleotide coverage and coverage obtained with first nucleotides of reads. The second type of coverage we recommend to use for 5 enriched RNA-seq (Sharma et al., 2010;Creecy and Conway, 2015). This type of RNA-seq library preparation used for exact TSSs mapping and quantitate transcript analysis (Mazin et al., 2014;Fisunov et al., 2016). The program also can load other quantitate or qualitative data in simple and universal BAC-BROWSER format. It can be used for methylation, SNP or GC content display. So, BAC-BROWSER provides a simultaneous view on biological data of different types and keeps it in lightweight universal format.

Prediction of Transcriptional Units
In the program we implemented a method for condition-specific TU analysis in bacteria. The program analyses RNA-seq data for fast identification of TSSs, transcription terminators and TUs. This method and algorithm we used in our published work (Mazin et al., 2014), but here we want to compare it with another method and show accuracy on simulated data. For real data testing, we used previously published E. coli dataset (Conway et al., 2014). To get coverage were used processed RNA-seq data. Derived results correspond to results of Conway et al. (2014) and show high sensitivity (0.91) ( Table 1).
For read count simulation was used negative binomial distribution (Frazee et al., 2015). We simulated standard RNAseq transcriptome reads varying mean coverage of gene from 20 to 400 reads per nucleotide, coverage for each transcript was drawn from the normal distribution. Results of the testing show, that implemented in BAC-BROWSER algorithm for TU identification is sensible for mean transcript coverage and can identify about 90% of TUs (Figure 2).

CONCLUSION
We developed an easy, fast and multifunctional application BAC-BROWSER for prokaryotic genome visualization and analysis. The program combines popular algorithms and methods for better interpretation and analysis of complex data and provides tools for subsequent verification of results with molecular biology methods. The program freely available and will be improved and supplemented in future versions which will be also available for the research community. In particular, we plan to make the web version of the program accessible to users with any operating system. Now BAC-browser has several limitations: it works only in Windows operating system and designed for analysis of small genomes up to 20 Mb in length. The manual describing file formats, usage of the program and parameters of tools is available in http://smdb.rcpcm.org/tools/ index.html.

DATA AVAILABILITY
Program in freely available in the SMDB website http: //smdb.rcpcm.org/tools/index.html under the GNU GPL license. Operating systems: Windows XP and later versions. Programming language: VB.NET 9.0. Other requirements: none. Any restrictions to use by non-academics: none.

AUTHOR CONTRIBUTIONS
GF contributed to the software design, software testing, and the drafting of the manuscript. IG contributed to the software testing, bioinformatics analysis, data interpretation, and the drafting of the manuscript. VG contributed to the design of the study and the drafting of the manuscript.

FUNDING
This work was funded by the Russian Science Foundation grant 14-24-00159 "Systems research of minimal cell on a Mycoplasma gallisepticum model."