Analysis of key genes in Mycobacterium ulcerans reveals conserved RNA structural motifs and regions with apparent pressure to remain unstructured

Buruli Ulcer is a neglected tropical disease that results in disfiguring and dangerous lesions in affected persons across a wide geographic area, including much of West Africa. The causative agent of Buruli Ulcer is Mycobacterium ulcerans, a relative of the bacterium that causes tuberculosis and leprosy. Few therapeutic options exist for the treatment of this disease beyond antibiotics in the early stages, which are frequently ineffective, and surgical removal in the later stage. In this study we analyze six genes in Mycobacterium ulcerans that have high potential of therapeutic targeting. We focus our analysis on a combined in silico and comparative sequence study of potential RNA secondary structure across these genes. The result of this work was the comprehensive local RNA structural landscape across each of these significant genes. This revealed multiple sites of ordered and evolved RNA structure interspersed between sequences that either have no bias for structure or, indeed, appear to be ordered to be unstructured and (potentially) accessible. In addition to providing data that could be of interest to basic biology, our results provide guides for efforts aimed at targeting this pathogen at the RNA level. We explore this latter possibility through the in silico analysis of antisense oligonucleotides that could potentially be used to target pathogen RNA.


Introduction
Buruli Ulcer (BU) is a skin-related neglected tropical disease (NTD) caused by the family of bacteria responsible for tuberculosis and leprosy (1).The bacterial pathogen, Mycobacterium ulcerans (M.ulcerans), is the causative agent of this disease, which mainly affects the skin and, on rare occasions, the bone.BU has been reported in 33 countries, 14 of which regularly report to the World Health Organization (2).M. ulcerans thrives in temperatures between 29-33°C and requires a low oxygen concentration of 2.5% (3,4), thus, the bacteria is able to easily spread in areas near the equator such as West Africa and other tropical areas in the Western Pacific (5,6).
Although this pathogen spreads quite readily, there is limited knowledge on its mode of transmission (5).However, it is known that M. ulcerans relies on the virulence factor mycolactone for pathogenesis (7, 8).Mycolactone, an exotoxin secreted by M. ulcerans, has three major biological functions: cytotoxicity, resulting in large-scale skin ulceration; immunosuppression, resulting in limited local inflammatory response, and analgesic effects, resulting in limited or no pain associated with damage (9,10).M. ulcerans' ability to block and degrade the nerves of infected tissue, and therefore inhibit the feeling of pain, makes it particularly dangerous (11).Due to the long incubation period and lack of symptoms, the infected often unknowingly carry the disease, prohibiting early diagnosis that is vital to effective treatment (7).
Current treatment options include surgical removal of the skin, oral antibiotics, and dressing of the skin (12).Surgery and skin grafting are only effective for early cases of BU where small areas of the skin are affected and can be safely removed (13).However, at later stages of the disease, large areas of the skin become damaged and surgery is not a safe solution (13).Three antibiotic options have been utilized to treat BU including rifampin, clarithromycin, and streptomycin (13), however, oral antibiotics have not proven to be ideal treatments as patients who take these medications are frequently hospitalized due to adverse effects (7).
Although BU is devastating to people in many developing countries, very few studies have been conducted to find novel treatments.Since it only affects a small portion of the world (6,13), therapeutics would have a relatively small return on investment-thus leaving BU a neglected disease.With this in mind, RNA-targeting therapeutics are emerging as exciting new treatments, which can take the form of small-molecules (14)(15)(16) or antisense oligonucleotides [ASOs (17,18)].The ASO approach could be particularly well suited as a strategy for treating BU and other neglected diseases, due to their rapid and relatively low cost to develop (19,20).Additionally, as RNA-based therapies grow in popularity, the projected future cost associated with their deployment is expected to decline (21).
ASOs are relatively simple yet promising therapeutics that can target RNA via sequence complementarity.Once bound to the target sequence, they can prevent protein production by blocking the ribosome, masking important functional elements, or causing degradation of the bound transcript (22).Significant to this current study, ASOs have previously been used as antimicrobial agents (18,22), which offers hope that this method could be applied to BU. Targeting of RNA, using both small-molecules and ASOs, is greatly facilitated by analysis of target secondary and tertiary structure (16).Analysis of RNA structure has proven that it is an important component of gene regulation in human and pathogen genomes (23)(24)(25)(26); however, to date, no analyses of RNA structure have been performed in M. ulcerans.A better understanding of RNA structure in this pathogen would thus provide information that could prove useful in treating BU, as RNA structures have been shown to be important to proper regulation of translation in other bacterial strains (27).
In this study, we apply an in silico sequence analysis pipeline for prediction of RNA secondary structure, ScanFold, that is based on the identification of small local motifs with unusually high propensity for ordered thermodynamic stability (28).This is accomplished by measuring the stability of evolved/ordered sequences vs in silico randomized sequences with the same nucleotide composition.This approach was optimized on viral human pathogens: HIV, Zika, human Herpesviridae and, most recently, SARS-CoV-2 (23).Significantly, this latter analysis of SARS-CoV-2 led to the discovery (or highlighting) of structural motifs that are being targeted with both small molecules (29)(30)(31) and ASOs (32,33).In this current analysis of M. ulcerans, we enhance our approach to make predictions that are of particular relevance to identification of efficient ASOs.

Methods
The M. ulcerans agy99 genome (NC_008611.1) and associated gff3 data as well as the pMUM001 virulence plasmid (NC_005916.1)and associated gff3 data were downloaded from NCBI (Supplementary File S1).Based on previous work by Butt et al. (34), six essential genes with no host (human) homology were chosen for analyses.The chosen target genes are Mul_RS01615, Mul_RS01365, Mul_RS04730, Mul_RS09540, Mul_RS04200 and Mul_RS00210, or Targets 1-6 respectively.All genes except Mul_RS00210 were found in the NC_008611.1 genome and Mul_RS00210 was found in the NC_005916.1 virulence plasmid.
These genomes and associated gff3 files were loaded into IGV-ScanFold (version 0.2.0) (https://github.com/ResearchIT/IGV-ScanFold/releases/) to view the sequence of M. ulcerans as a whole.After searching for the genes, they were initially scanned in their entirety by highlighting the gene and using the "ScanFold>Run ScanFold on selected region" option.For this analysis the default parameters were changed to a 120 nt window size, a 1 nt step size, 100 randomizations per window, mononucleotide shuffling, 37°C temperature, competition of 1 (to demand that only one unique base pair per nucleotide is possible), either forward or reverse strand depending on the gene polarity, and the RNAfold folding algorithm was selected.Metrics obtained from ScanFold include the MFE (a measure of thermodynamic stability), ΔG z-score (a measure of ordered stability that can indicate potential function), and ensemble diversity or ED (a measure of predicted structure's conformational volatility).The MFE is estimated by the change in predicted Gibb's folding free energy (the ΔG°) going from a fully denatured (random coil) RNA to an ordered 2D structure, where more negative values indicate increasing favorability and greater stability.The ΔG z-scores identify structures that have propensity for ordered stability, where negative values indicate the number of standard deviations more stable the native sequence is compared to any randomized version of it (35,36).The ED uses the RNA partition function to compare the distance between the optimal thermodynamic structure and suboptimal conformations (37).This distance is measured as the number of base pairs different between Boltzmann weighted conformations and is averaged across the ensemble.Lower ED values indicate a single dominant conformation, while higher EDs suggest multiple conformations or a lack of defined structure (38).Additionally, arc diagrams depicting weighted z-score structures are produced where blue, green, and yellow arcs indicate z-scores ≤ −2, ≤ −1, and < 0, respectively.For more information on these metrics, see the ScanFold methods paper (39).After initial analysis, several genes were found to have regions of low z-score near the 3′ and 5′ ends.All genes were rescanned using the same parameters with an additional 50-300 nucleotides upstream and downstream of the start and stop codon, respectively.This allowed us to note any patterns in stability immediately outside the targets.As an additional layer of structure prediction validation, these genes, including the upstream and downstream sequences, were also scanned using dinucleotide shuffling.The results of mono-and dinucleotide shuffling were compared using the "ct_compare.py"script (https://github.com/moss-lab/SARS-CoV-2)and were found to have only minor differences.The average difference in percent of paired nucleotides was 2.29% and 0.66% for −1 and −2 z-scores pairs respectively, and the average percent similarity or consistency in pairing partners was 86.63% and 97.93% for −1 and −2 z-scores pairs respectively.All structures analyzed in this work were predicted using both shuffling methods, further validating that they may be potentially functional.
To further assess the potential functionality of structures found in these genes, covariation analysis was completed using the cm-builder analysis pipeline (40).All 21 structures with a z-score ≤ −2 were analyzed for covariation using the cm-builder perl script (40).This script builds off the RNAFramework toolkit (41,42) and utilizes Infernal (here using release 1.1.2(43);) to build and search for covariation models from each predicted ScanFold secondary structure.Briefly, cm-builder uses Infernal to generate a covariance model (CM) using the sequence and structure of the selected motif.A database of sequences is built for Infernal using a BLAST search (44).For M. ulcerans, the nt collection database was searched for each of the selected motifs found in the target genes using the following parameters: BLASTn (Somewhat similar), Organism (bacteria, taxid:2), and a max sequence number of 5000.Infernal then takes the BLAST data and aligns the sequences removing any redundant and severely truncated alignments.The resulting set of homologs from the alignment of BLAST hits is then aligned to the original CM, which is used to build a new CM.The whole process is repeated three times.Finally, the resulting alignment is refactored to remove gap-only positions and include only bases spanning the first to the last base-paired residue.The final alignment file is analyzed using R-scape v2 (41) where APC corrected G-test statistics (45) are used to identify motifs showing significantly covarying base-pairs using the default E value of 0.05.Statistically significant covariation is indicative of evidence for a conserved RNA base pair, but the lack of covariance does not indicate that the structure has no potential for function.All covariation results were analyzed to evaluate their significance or power, and all statistically significant base pairs were annotated on the 2D models made using VARNA (46) (Supplementary File S2).
After running ScanFold to define regions of unusual thermodynamic stability, and cmbuilder to identify statistically significant covarying base pairs, the sequences and predicted secondary structures of each gene were loaded into OligoWalk (47) and ran using default parameters (Mode: Break Local Structure, Oligo Chemistry: DNA, Oligo Length: 18nt, and Oligo Concentration: 1uM).The overall ΔG, duplex ΔG, intra-oligo ΔG, and inter-oligo ΔG values for 18 nucleotide complementary oligonucleotides tiled across each gene sequence were obtained for further analysis.Additionally, using the sequence fasta file, final partners file, and log files from IGV-Scan Fold an in-house python script, "ScanFold_Oligo_Metrics.py", was used to partition the data into 18 nt fragments and determine the average ED, MFE, and z-score for each.(https://github.com/moss-lab/ScanFold_Oligo_Metrics/blob/main/ScanFold_Oligo_Metrics.py).This was accomplished using a tiled window approach, which allowed for the complementary analysis of ScanFold and OligoWalk metrics for the same fragments.For more information on the data generated by this script, see the ReadMe file on GitHub.Using Oligowalk and ScanFold data, bar graphs were made for every gene (Supplementary File S3).All data used to create these graphs and analyze the results, can be found in (Supplementary File S4).Following the work of Matveeva et.al., oligonucleotides that had intra-oligo values greater than −1.1 kcal/mol, inter-oligo values greater than −8 kcal/mol, duplex ΔG values of less than −15 kcal/mol, and negative overall ΔG values were considered as optimal ASO targets to begin further analysis.This data was then compared to the ScanFold fragment data and trends were noted.

Results
We focused on six genes found to be essential to M. ulcerans (34,48) with no known homology to human hosts: Mul_RS01615, Mul_RS01365, Mul_RS04730, Mul_RS09540, Mul_RS04200 and Mul_RS00210.These were selected for their essentiality to M. ulcerans, their homology to genes targeted in other pathogens, and their lack of host homology (34,48).These were subjected to the full ScanFold pipeline as integrated in IGV-ScanFold.In the first step, ScanFold-Scan was used to define the local RNA folding landscape of each gene (Supplementary File S5).Here, a scanning analysis window of 120 nucleotides was moved one base at a time across each gene sequence while predicting several folding metrics for each window: the minimum free energy of folding (MFE, ΔG°), its associated base pairing model (secondary structure), ensemble diversity (ED), and a thermodynamic z-score that compares the MFE of the natively ordered RNA to the MFE of randomized versions of the same sequence to identify propensity for ordered stability as indicated by negative values.Details on all metrics can be found in the Methods Section.In the second stage, ScanFold-Fold, folding metrics are partitioned to each nucleotide and base pair, and consensus base pairs across all windows-weighted by their propensity for ordered stability (thermodynamic z-score) -are identified and output as distinct motifs (Supplementary File S6).
A summary of results across each analyzed gene is seen in Table 1.Here, the average ΔG values across all analyzed genes ranged from −48.40 kcal/mol to −33.23 kcal/mol.This difference in predicted RNA folding stability is directly correlated with GC %, as expected, which ranged from 68.14% to 61.32%.The average z-score and ensemble diversity (ED), however, do not follow trends for ΔG or GC%.The z-score quantifies how much greaterthan-random the folding stability of an RNA sequence is, which is primarily dependent on the sequence order and not its nucleotide composition.Likewise, the ED value indicates the diversity of potential structural conformations in an RNA's folding ensemble, which also appears to be an evolved property of ordered/functional RNA sequences (38).The average z-score ranged from −0.82 to −0.03, while the average ED ranged from 20.64 to 26.85.Here we see that the gene with the lowest (most favorable) ΔG and highest GC% (Mul_RS01615) has the highest (least favorable) average z-score and ED.The gene with the lowest average z-score (Mul_RS04730) also showed the highest percent of windows with z-score < −1 (39.55%).This did not hold true for z-scores < −2 (two standard deviations more stable than random), but the gene with the second lowest average z-score (Mul_RS01365) showed the highest percent of windows with z-scores < −2 (14.76%).The gene with the smallest fraction of its nucleotides spanned by low z-score windows was Mul_RS01615 (19.0% and 4.4% for the −1 and −2 z-score cutoffs, respectively).
In the prediction of the ΔG for each analysis window, a model secondary structure is also generated.Across each gene, this resulted in many predicted base pairs where a specific nucleotide can be paired differently across multiple overlapping windows.As a result, many potential base pairing partners may be predicted per nucleotide.ScanFold-Fold predicts a single structural context (paired or unpaired) for each nucleotide based on its contributions to low z-score windows (indicating ordered stability and potential functionality).The genes that had the greatest percentages of low z-score base pairs were Mul_RS04730 and Mul_RS01365, where Mul_RS04730 had 25.90% of its base pairs predicted with average z-score < −1 and Mul_RS01365 had 10.12% of its base pairs predicted with average z-score < −2 (Table 1).Additionally, ScanFold-Fold extracts discrete structural motifs (i.e., single hairpins or multi-branched stem loops) containing at least one base pair with an average z-score < −2.This results in a list of motifs for each gene, where the longest gene Mul_RS00210 had the greatest number of motifs (8 motifs) and the much shorter gene Mul_RS01365 had the second greatest number of motifs (7 motifs).
In summary, all predictions (mono-and dinucleotide shuffling) indicate a particular importance for functional RNA secondary structures encoded within the genes for virulence factor production and cell wall biosynthesis (Supplementary Files S5).The ScanFold-Fold results also give us a means of generating motifs of interest for further analysis (Supplementary File S6).

Analysis of RNA folding in Mul_RS01365
Mul_RS01365 encodes a protein that is homologous to the desA2 (stearoyl-ACP desaturase) protein of M. tuberculosis.This protein is thought to function by catalyzing the initial conversion of saturated fatty acids to unsaturated fatty acids in lipid metabolism.
A summary of ScanFold data is shown on Figure 1, where blue, green, and yellow arcs indicate base pairing with average z-scores ≤ −2, ≤ −1, and < 0, respectively.The overall thermodynamic stability remains flat across Mul_RS01365 until scans reach the 3′ end of the gene, where stability increases (more negative ΔG in windows overlapping these nucleotides).Despite fairly monotonous trends in ΔG, evidence for regions of ordered RNA stability (negative z-scores) are clustered into two distinct regions at the 5′ and 3′ ends of the coding region, respectively.Interestingly, the core coding region is spanned by an area with positive z-scores, which indicates that the ordered/evolved sequence is less stable than predicted based on nucleotide content: i.e., the evolved sequence may be ordered to be unstructured or accessible.This region of unusual instability correlates with spikes in the ensemble diversity, which indicates a volatile ensemble of potential RNA secondary structures or a lack of stable structure.Conversely, the previously mentioned low z-score clusters overlap regions of low ensemble diversity, indicating one (or a few similar) conformation(s).
When the low z-score windows were evaluated by ScanFold-Fold, seven distinct motifs (M1-7) were identified with exceptionally low (< −2) z-score-weighted base pairs (Figure 1).These motifs comprise two upstream hairpins (M1 and M2) and five downstream hairpins that span the translational start and stop codons, respectively.M1 is notable for containing the start codon embedded within a stable helix and for being the longest thermodynamically stable hairpin predicted for this gene.When evaluated for their conservation across mycobacterial species, none of the proposed base pairs were found to have statistically significant covariation (Supplementary File S2).The proposed structures were, however, found to be present/conserved across a wide array of species.In M1, for example, the hairpin structure is preserved across pathogenic species of mycobacteria (Figure 2).Conservation is highest in the terminal hairpin loop region that contains the start codon, and trails off toward the basal stem, where deletions and inconsistent mutations (that ablate canonical base pairing potential) would be predicted to weaken the basal stem.The core hairpin is best preserved (100% preservation of base pairing) in the medically significant strains, M. gordonae, kansasii, tuberculosis, and bovis while inconsistent mutations begin accumulating in M. fortuitum, where a U>C mutation disrupts a single AU pair (Figure 2).M. avium had the most inconsistent mutations in the core hairpin, that would be predicted to weaken or break four base pairs (out of 15).Conversely, while no compensatory mutations are observed in the core hairpin, six species had at least one consistent change, where single point mutations preserve base pairing; M. Leprae had two consistent mutations, however, these were offset by several inconsistent changes.

Analysis of RNA folding in Mul_RS04730
Mul_RS04730 encodes a protein homologous to the rpoA (alpha chain of RNA polymerase) of M. tuberculosis.This protein functions as a component of the DNA-dependent RNA polymerase responsible for bacterial genome replication.
A summary of ScanFold data is shown in Figure 3 following the same color scheme mentioned previously.Here, the overall thermodynamic stability (ΔG) remains flat across the entirety of the transcript.Despite the flatline trend in ΔG, evidence of ordered RNA stability (negative z-scores) is clustered into one distinct region around the 5′ end of the coding region.The coding region is predominantly spanned by structures with z-scores of 0 to −1 but a few structures with z-scores ≤ −2 and ≤ −1 are interspersed.This indicates that the ordered/evolved sequence is overall only slightly more stable than predicted based on nucleotide content: i.e., the evolved sequence may be ordered to be structured.The regions of only slightly increased stability correlate with spikes in the ensemble diversity when compared to those of the much more stable region near the 5′ end of the coding region.This is indicative of a volatile ensemble of potential RNA secondary structures or a lack of stable structure across the core coding sequence; whereas the more 5′ region shows a less volatile ensemble or more stable structures that may be potentially functional.
When the low z-score windows were evaluated by ScanFold-Fold, two distinct motifs (M1-2) were identified with exceptionally low (< −2) z-score-weighted base pairs (Figure 3).These motifs comprise one upstream multi-branch helix (M1) and one downstream hairpin (M2) that span the start site of translation and part of the coding sequence, respectively.M1 is notable for containing the start codon in a single stranded region between two hairpins of the multi-branch helix.When evaluated for their conservation across mycobacterial species, two of the proposed base pairs were found to have statistically significant covariation with power ≥ 0.25 (Supplementary File S2 and Figure 3).In addition, part of the M1 structure and all of the M2 structure were found to be present/conserved across a wide array of species.In M1, for example, the hairpin structure is preserved across pathogenic species of mycobacteria (Figure 4).Conservation is highest in the central hairpin that shows evidence of significant covariation and in the downstream hairpin containing the start codon.However, toward the 5′ end of the structure, conservation drops off, as homologous sequences were not found here.The small hairpin containing the last nucleotide of the start codon is best conserved (100% preservation of base pairing) in the medically significant strain, M. gordonae, kansasii, fortuitum, phlei, abscessus, and avium; while inconsistent mutations begin accumulating in M. tuberculosis, bovis, and leprae where a U>A mutation disrupts a single AU pair (Figure 4).M. gordonae had the most inconsistent nucleotides (with the proposed 2D structure) in the 5′ hairpin (10 mutations), while M. tuberculosis and bovis had the most inconsistent nucleotides in the core hairpin (5 mutations).Interestingly, two compensatory mutations are observed in the core hairpin, which match those found in the cm-builder analysis.Seven of the ten species had at least one consistent change, where single point mutations preserve base pairing.M. kansasii had two consistent mutations, however, these were offset by several inconsistent changes.Notably, M. leprae, another necrotizing bacterium, appeared to have the highest level of conservation across the entire structured region when compared to M. ulcerans.Overall, the core hairpin and next downstream hairpin containing the start codon appear to be quite conserved, and when analyzed against all other data seems to indicate this region could be potentially functional.

Analysis of RNA folding in Mul_RS04200
Mul_RS04200 encodes a protein homologous to the rpoB (beta chain of RNA polymerase) of M. tuberculosis.This protein functions as a component of the DNA-dependent RNA polymerase responsible for bacterial genome replication.
Interesting trends are seen in the ScanFold-Scan folding metrics partitioned per nucleotide of the gene (Supplementary Figure S1).The overall thermodynamic stability (ΔG°) remains flat across the entirety of the Mul_RS04200 gene.Despite the flatline trend in ΔG°, there is evidence for regions of ordered RNA stability (negative z-scores).Z-scores remained relatively negative across the entire gene, but regions of lower z-scores were noted.Only one region in the middle of the gene showed a significant decrease in z-score below −2, while the majority of the gene's 3′ end displayed lower than average z-scores (indicating increased stability towards the end of the gene).In contrast to the 3′ end, a region of positive z-scores (potentially ordered to be unstructured) is observed approximately 750 nucleotides from the 5′ end.
The ScanFold-Fold motif found in the lowest z-score region (M1; Supplementary Figure S1) had base pairs with significantly low z-scores (< −2) which increased upstream and downstream of the hairpin.Notably, as base pairing extended out from the basal stem, the z-scores steadily increased until the final two bulges and terminal loop became only slightly negative.

Analysis of RNA folding in Mul_RS09540
Mul_RS09540 encodes a protein homologous to the rpoZ (omega component of RNA polymerase) of M. tuberculosis.This protein functions as a component of the DNAdependent RNA polymerase responsible for bacterial genome replication.
When analyzing the ScanFold-Scan folding metrics partitioned per nucleotide of the gene, interesting trends are seen (Supplementary Figure S2).Compared to the others, Mul_RS09540 is the smallest gene that was analyzed.The overall thermodynamic stability (ΔG) remains uniform across the entirety of the gene, and it was found to have the fewest base pairs with z-scores < −2.Near the 3′ end, there are two distinct clusters of z-score values that increase into the positive range, thus indicating a region that may be evolved to have reduced structural propensity.This same unstable region correlates with high ensemble diversity, further indicating the lack of ordered and stable structure at the 3′ end.Conversely, in the middle of the gene, low z-score regions overlapped with regions of low ensemble diversity indicating that stronger RNA secondary structures are present near the core of the coding sequence.Interestingly, the most stable area of the structure seems to occur towards the middle of the gene rather than near the 3′ or 5′ ends.
The ScanFold-Fold motif found for this region (M1; Supplementary Figure S2) is predicted to form a 106 nucleotide hairpin that displays interesting trends in stability.In this structure, the most significantly low z-score base pairs were found in the first 5 base pairs of the basal stem.As base pairing extended out from the basal stem, the z-scores steadily increased until the final two bulges and terminal loop became only slightly negative.

Analysis of RNA folding in Mul_RS00210
Mul_RS00210 encodes a protein homologous to the Pks7 (putative polyketide synthase) of M. tuberculosis.This protein likely functions in lipid metabolism where it is believed to be involved in an intermediate step for the synthesis of a polyketide molecule, mycolactone.This polyketide has been shown to act as the virulence factor essential for infection and the painless nature of the ulcers (48).Rather than falling within the bacterial genome, the Mul_RS00210 gene occurs within the 174 kb virulence plasmid, pMUM001, that encodes a cluster of giant polyketide synthases (48).Until recently, this was an uncharacterized example of plasmid-mediated virulence in a Mycobacterium, and it is believed that the pathogenicity of M. ulcerans is due the acquisition of pMUM001 by horizontal gene transfer (48).
A summary of ScanFold data is shown on (Figure 5).The overall thermodynamic stability (ΔG) of Mul_RS00210 is more variable than all other genes analyzed in this study.Evidence for regions of ordered RNA stability (negative z-scores) are clustered into many distinct regions throughout the coding region.Here, there are eight structured regions with z-scores < −2 and five small regions of positive z-scores interspersed (the first region is in the intergenic region).This indicates that the ordered/evolved sequences making up these eight structures are much more stable than predicted based on nucleotide content: i.e., the evolved sequence may be ordered to have structure.These regions of high ordered stability (low z-score) correlate with dips in the ensemble diversity when compared to those found in regions of much lower ordered stability indicating that they are more stable and have a less volatile ensemble of potential RNA secondary structures that can form.The few regions of positive z-score may be indicative of regions that are evolved to be unstructured and therefore accessible to regulatory molecules.
When the low z-score windows were evaluated by ScanFold-Fold, eight distinct motifs (M1-8) were identified with exceptionally low (< −2) z-score-weighted base pairs (Figure 5).These motifs comprise seven hairpins (M1-7) and one multibranch helix (M8).Here, all hairpins were found to have base pairs with significantly lower than average z-scores (< −2), whereas the multi-branch helix was found to only have significantly low z-scores in the basal stem.When evaluated for conservation across mycobacterial species, two of the proposed base pairs found in structures M1 and M2 were found to have statistically significant covariation with power of 0 to 0.1 and ≥ 0.25 respectively (Figure 5 and Supplementary File S2).In addition, all structures were found to be present/conserved across a wide array of species.
One unique feature of all −2 z-score structures found in this gene was their presence throughout the other two genes encoding the polyketide synthase subunits responsible for production of mycolactone (Figure 5).Predicted motifs are thus replicated multiple times throughout the plasmid-up to 15 times (Figure 5).Although this observed multiplication of structural elements is unique to our analysis, this finding is not entirely surprising as Stinear et al. found that these three genes have stretches of up to 27kb with near identical nucleotide sequence (99.7%).Additionally, of the 105-kb mycolactone locus, there is only ~9.5 kb of unique, nonrepetitive sequence (48).

Analysis of RNA folding in Mul_RS01615
Mul_RS01615 encodes a protein that is homologous to the accD3 (putative acetyl CoA carboxylase carboxyl transferase-beta subunit) of M. tuberculosis.This protein is a component of the acetyl coenzyme A carboxylase complex and plays a functional role in lipid metabolism.
The overall thermodynamic stability (ΔG) stability varied across the gene, with a general trend toward lower stability (more positive values) toward the 3′ end (Figure 6).A distinct cluster of low z-score nucleotides occurs in the core coding region of the gene, which overlaps a cluster of distinctly low ensemble diversity values.The ScanFold-Fold motif built for this region, M1, contains a multibranch loop structure formed from two hairpin loops with low, but not significantly negative, z-score nucleotides and base pairs.The multibranch loop structure sits atop a long stem formed by significantly low (< −1) z-score base pairs and nucleotides, where the basal stem (composed of six base pairs) had the lowest (< −2) z-scores.Although the ensemble diversity was low across M1, the region spanning the two hairpins of the multibranch loop were higher (Figure 6), suggesting potential conformational dynamics for the two hairpins.Notably, the region immediately downstream of M1 was spanned by positive z-scores, indicating a potential for ordered instability.A second structural motif, M2, was predicted immediately downstream of the annotated open reading frame (ORF) for Mul_RS01615 (Figure 6).This motif falls in a somewhat diffuse region of low z-scores that overlaps a cluster of moderate ensemble diversities; thus, while the base pairs and nucleotides comprising M2 have significantly low z-scores (<−2), the predicted conformational ensemble does not appear to be particularly tight (i.e., potential for dynamics).It is worth noting that the transcript annotations for M. ulcerans are not sufficient to determine if this motif falls within the 3′ UTR of Mul_RS01615.While both motifs in Mul_RS01615 were found to have potentially homologous sequences and structures in other mycobacteria (Supplementary File S2), significant covariation was not observed.

Analysis of antisense oligonucleotide accessibility
To explore the potential value of ScanFold data in identifying binding sites for short antisense oligonucleotides (ASO), we partitioned the ScanFold-Scan and -Fold results by averaging predicted metrics across short (18 nt) windows that approximate the size of potential ASO binding sites.This was further enhanced by predicting duplex binding affinities via the program OligoWalk (47) while considering the effects of ScanFold predicted local intramolecular structure on intermolecular duplex formation (all data available in Supplementary File S4).To facilitate analysis, these results were also plotted vs. ScanFold-Fold predicted base pairs (Supplementary File S3).Numerous short stretches of sequence across each gene of interest, including potentially accessible regions with positive z-scores, overlap strong predicted duplex binding sites.While the results in toto are potentially valuable for aiding in the identification and design of ASOs vs. M. ulcerans, we focus our attention on the two genes that have predicted ordered structures that encompass the start sites of translation, which are particularly attractive sites for antimicrobial ASO therapeutics.
The ASO accessibility results for the desA2 homolog, Mul_RS01365, are summarized in Figure 7.The trend toward enhanced thermodynamic stability of local RNA secondary structure in this gene is starkly illustrated, where significant dips in the partitioned averaged MFE and z-score overlap the translational stop site and ScanFold-Fold modeled low zscore structures.The inaccessibility of this region to ASOs is corroborated by OligoWalk predictions that show the overall duplex ΔGs are highly unfavorable: indeed, this region is predicted to have the most positive values across the gene.Conversely, the 5′ end, spanning the start codon, is predicted to have some of the least stable (least negative predicted average MFEs) local structure across the gene.Perhaps counterintuitively, there are dips in the average z-score and ensemble diversity that indicate ordered structure and indeed, ScanFold-Fold models show significantly ordered base pairing in this region (Figures 1, 7).Put another way, while the thermodynamic stability of this region is predicted to be low, it is still higher than expected given the nucleotide content, thus the ordered stability indicated by negative z-scores.
Favorable overall ASO duplex ΔGs span the start codon, despite it being contained in ordered structure (e.g., in M1, Figure 1), indicating a potential for "strand invasion", where an ASO can efficiently bind by breaking/replacing intramolecular helices with intermolecular base pairs.To explore this further, and to illustrate how our data could facilitate ASO design, the most favorable (lowest ΔG, taking into account disruption of target structure) ASO predicted to occlude the start site was identified (Figure 8).Here, an 18-mer ASO is predicted to bind to the mRNA with an overall ΔG of −9.8 kcal/mol, which considers the significant energy barrier (+15.8 kcal/mol) needed to disrupt 12 base pairs in the M1 hairpin structure.This disruption is predicted to totally ablate the terminal hairpin structure (Figure 8).Notably, other favorable ASO binding sites flank this optimal one (Figure 8 and Supplementary File S4) and thus, the ASO sequence can be extended in either direction to enhance predicted stability: e.g., to add additional stabilizing base pairs to the looped-out nucleotides predicted to be released upon disruption of the terminal hairpin loop.
ASO accessibility results for the rpoA homolog Mul_RS04730 are shown in Figure 9.The partitioned MFE varies across the mRNA, however, it is predicted to be less stable in the 18-mers that span the start codon.The z-score is lowest toward the 5′ end of the RNA and steadily increases toward the 3′ end, indicating 18-mers are less likely to be engaged in ordered RNA structures as one moves along the sequence.Importantly, the overall predicted ASO duplex stabilities were most favorable in the region spanning the start codon; indeed, the most stable predicted duplex across the entire mRNA spans the start site (Figure 9).The overall predicted ASO duplex ΔG is −15.0 kcal/mol, which considers the relatively low barrier (+7.4 kcal/mol) needed to invade the flanking hairpin structures in the target RNA (2 base pairs in each structure) (Figure 10).Similar to Mul_RS01365, multiple other ASOs are predicted to bind near this optimal site, allowing for more stable sequences to be designed: e.g., by extending the optimal predicted ASO to further invade the small downstream hairpin structure.
Beyond these interesting examples of ordered/structured RNAs that could be targeted with ASOs there are regions with apparent ordering for a lack of structure (positive z-scores) that could be especially accessible to ASOs (Supplementary File S3).For example, the 18-mer sequences that span the core coding region of Mul_RS01365 are all predicted to have positive z-scores and relatively unfavorable MFEs (Figure 7).This central portion of the coding region also possesses the most favorable predicted ASO binding sites for the gene, suggesting that this region is an especially attractive site for ASO binding.

Discussion
This study reports the first comprehensive analysis of RNA secondary structure in M. ulcerans.Here, we focused on a set of essential genes whose proteins have no homology to the human host.This is based on analyses of human and M. ulcerans pathways in the KEGG database and BLASTP homology searches of 20 different bacteria (34).This large set of 236 genes was also prioritized based on their druggablity and how well suited they are as potential therapeutic targets (34).Here, we focused on six of these genes specifically based on their potential significance for drug targeting.Five of these genes are annotated as being essential to survival and replication (RNA synthesis and lipid metabolism) of the bacteria (34), while one is annotated to be required for pathogenicity and the painless nature of the ulcers (polyketide or mycolactone synthesis) (9,48).We have specifically adapted our structural analysis pipelines to obtain data on these genes that may provide valuable results for advancing M. ulcerans mRNA therapeutics.Maps of the local RNA folding landscapes provide insights into the stability and potential for ordered/functional folding across each gene of interest.For example, looking at the ScanFold data for Mul_RS01365 (both partitioned per-nucleotide and by 18-mers; Figures 1 and 7, respectively), we see thermodynamic stability (favorable/low MFE predictions) and ordered stability (low z-scores) at the 3′ end spanning the stop codon.This suggests potential roles for RNA structure and its thermodynamic stability in the termination of translation.Conversely, the coding region of this gene appears to be unstructured (as evidenced by mediocre MFEs and positive z-scores) perhaps to facilitate interactions with regulatory molecules or to promote rapid translation of this gene.Indeed, this latter idea is gaining traction as a mode for affecting protein folding (49): i.e., mRNA structural stability regulating the rate of translation to facilitate folding of proteins being translated.
While none of the analyzed genes had global biases for ordered structure (average z-score < −1; Table 1) clusters of ordered stability were present in each, where at least one defined structural motif could be predicted with exceptional (z-score < −2) base pairs, yielding 21 motifs in total across the six target genes when using mono-or dinucleotide shuffling.These sequences have been (presumably) ordered to fold over evolution and this proposition is supported by the overall conservation of secondary structure that was observed for each motif, as well as the identification of 4 statistically significant covarying base pairs.The functional roles of conserved, ordered RNA secondary structures in M. ulcerans and species with identified homologs can be diverse.For example, as noted above, structures in coding regions could affect translation rates.Structures may also be playing roles in modulating accessibility to regulatory molecules present in the bacterial cell or in mRNA turnover (e.g., modulating sensitivity to endogenous RNases).Significantly, two motifs span the translational start site: in Mul_RS01365 and Mul_RS04730 the start codons are modeled to lie within a helix and loop, respectively.
The instances where start sites are constrained within ordered structure present an interesting case for potential ASO therapeutics.Promising efforts at designing antimicrobial ASOs have focused on occluding the translational start sites of bacterial genes (50).For Mul_RS01365 and Mul_RS04730, we predict oligos that can invade target intramolecular structure and, for Mul_RS01365, have the potential to greatly disrupt the ordered folding that may itself be functionally significant.That is to say, an ASO targeting the start site of Mul_RS01365 may have two potential modes of action: obscuring the start site to impede translation and disrupting a secondary structure that may itself play roles in regulating translation.Beyond these two structural motifs, we present 17 others which may also be amenable to ASO-mediated disruption or that may be targeted with RNA-targeting small-molecules-a promising area of research for novel therapeutics.Notably, the 8 identified motifs found in Mul_RS00210 are replicated multiple times in the virulence plasmid, which increases the concentration of potential ligand binding sites.Disrupting the genes responsible for mycolactone production could attenuate the pathogenicity of M. ulcerans.Our data also suggest regions of unusual thermodynamic instability across all analyzed genes, which may be ordered for accessibility to trans-regulatory molecules.These sites may be particularly attractive for ASOs that can mask such sites from interacting partners and/or stimulate mRNA degradation.
It is worth noting that there are several potential limitations to our approaches.These limitations arise in part from the sparse data for M. ulcerans due to the "orphan" status of its associated disease.This paucity of data makes analyses of this pathogen challenging, including the analysis of functional RNA structure.For example, the identification of RNA secondary structure models is limited by the lack of covariation across the identified structures due to the limited diversity of sequences available.While structures are conserved, the little variability in sequence cannot identify significant, structure-preserving, covariation.There are few available large datasets; this includes genome sequencing of different mycobacterium strains (such as other M. ulcerans strains), biochemical probing data, and "omics" data such as RNA binding protein sites.Lastly, we are limited by a lack of experimental data to validate the effect of our predicted ASOs on survival and replication of M. ulcerans.Even with these potential limitations, the results presented here provides researchers with valuable data that can guide their studies to identify targets and design experiments.This highlights the need for additional work on this pathogen to collect more genome-wide datasets and develop better systems for in vitro studies (e.g. of putative ASO activity).

Conclusion
We present the first in-depth analysis of RNA structure in six key genes of M. ulcerans, the microbe responsible for the neglected tropical disease, Buruli Ulcer.Our results are made public to advance a basic understanding of the RNA biology of this pathogen-by providing conserved structural motifs with high likelihood of function (which may, themselves, serve as potential therapeutic targets).As well, we hope to advance novel therapeutics against M. ulcerans by providing data to facilitate antisense oligonucleotide design.

FIGURE 1 .
FIGURE 1. Mul_RS01365 ScanFold results and 2D models.Global ScanFold results for Mul_RS01365.The ΔG z-score, minimum free energy (MFE), ensemble diversity (ED), and base pair arc diagram, and gene cartoon (top to bottom) are shown to the left.All −2 ΔG z-score structures found are represented as 2D models to the right.The base pair arc diagram is annotated with gray boxes to show the location of M1-7 across the gene.In the 2D structure model of M1, the canonical AUG start codon is annotated in blue.The nucleotides of each structure are annotated with the average per nucleotide z-scores where the most negative are indicated in red and the most positive are indicated in blue.

FIGURE 3 .
FIGURE 3. Mul_RS04730 ScanFold results and 2D models.Global ScanFold results for Mul_RS04730.The ΔG z-score, minimum free energy (MFE), ensemble diversity (ED), base pair arc diagram, and gene cartoon (top to bottom) are shown to the left.All −2 ΔG z-score structures found are represented as 2D models to the right.The base pair arc diagram is annotated with gray boxes to show the location of M1-2 across the gene.In the 2D structure model of M1, the canonical AUG start codon is annotated in blue.Nucleotides that exhibit statistically significant covariation are annotated by green bars across the base pair.The nucleotides of each structure are annotated with the average per nucleotide z-scores where the most negative are indicated in red and the most positive are indicated in blue.

FIGURE 5 .
FIGURE 5. Mul_RS00210 ScanFold results, 2D models, and structure occurrence throughout the pMUM001 plasmid.Global ScanFold results for Mul_RS00210 including the location of repetitive sequence and structure elements across the pMUM001 virulence plasmid.(A) A schematic diagram of the 174kb pMUM001 virulence plasmid with the locations of the three largest genes that encode the subunits responsible for production of mycolactone.The colored boxes annotated across these three genes represent the locations of the sequences that form structures M1 (red), M2 (yellow), M3 (green), M4 (blue), M5 (pink), M6 (orange), M7 (purple), and M8 (black) in the Mul_RS00210 gene.(B) Global ScanFold results for Mul_RS00210 including (top to bottom) the ΔG z-score, minimum free energy (MFE), ensemble diversity (ED), base pair arc diagram, gene cartoon, and 2D models of −2 ΔG z-score structures.The base pair arc diagram is annotated with colored boxes (correlated to the colors in panel (A) to show the location of M1-8 across the gene.In the 2D structure models, nucleotides that exhibit statistically significant covariation are annotated by green bars across the base pair.The nucleotides of each structure are annotated with the average per nucleotide z-scores where the most negative are indicated in red and the most positive are indicated in blue.

FIGURE 6 .
FIGURE 6. Mul_RS01615 ScanFold results and 2D models.Global ScanFold results for Mul_RS01615.The ΔG z-score, minimum free energy (MFE), ensemble diversity (ED), base pair arc diagram, and gene cartoon (top to bottom) are shown to the left.The base pair arc diagram is annotated with gray boxes to show the location of M1-7 across the gene.All −2 ΔG z-score structures found are represented as 2D models to the right.The nucleotides of each structure are annotated with the average per nucleotide z-scores where the most negative are indicated in red and the most positive are indicated in blue.

FIGURE 8 .
FIGURE 8. Predicted binding and strand invasion of Mul_RS01365 Motif 1 by the most favorable 18-mer antisense oligonucleotide.The Mul_RS01365 Motif 1 2D model, ASO of interest with OligoWalk data, and the predicted structure after strand invasion (left to right).Metrics in green indicate favorable binding, and the nucleotide outlined in green indicate the position of the start codon in the structure.

FIGURE 10 .
FIGURE 10.Predicted binding and strand invasion of Mul_RS04730 Motif 1 by the most favorable 18-mer antisense oligonucleotide.The Mul_RS04730 Motif 1 2D model, ASO of interest with OligoWalk data, and the predicted structure after strand invasion (left to right).Metrics in green text indicate favorable binding and the nucleotides outlined in green indicate the position of the start codon in the structure.