Improved Measurements of RNA Structure Conservation with Generalized Centroid Estimators

Identification of non-protein-coding RNAs (ncRNAs) in genomes is a crucial task for not only molecular cell biology but also bioinformatics. Secondary structures of ncRNAs are employed as a key feature of ncRNA analysis since biological functions of ncRNAs are deeply related to their secondary structures. Although the minimum free energy (MFE) structure of an RNA sequence is regarded as the most stable structure, MFE alone could not be an appropriate measure for identifying ncRNAs since the free energy is heavily biased by the nucleotide composition. Therefore, instead of MFE itself, several alternative measures for identifying ncRNAs have been proposed such as the structure conservation index (SCI) and the base pair distance (BPD), both of which employ MFE structures. However, these measurements are unfortunately not suitable for identifying ncRNAs in some cases including the genome-wide search and incur high false discovery rate. In this study, we propose improved measurements based on SCI and BPD, applying generalized centroid estimators to incorporate the robustness against low quality multiple alignments. Our experiments show that our proposed methods achieve higher accuracy than the original SCI and BPD for not only human-curated structural alignments but also low quality alignments produced by CLUSTAL W. Furthermore, the centroid-based SCI on CLUSTAL W alignments is more accurate than or comparable with that of the original SCI on structural alignments generated with RAF, a high quality structural aligner, for which twofold expensive computational time is required on average. We conclude that our methods are more suitable for genome-wide alignments which are of low quality from the point of view on secondary structures than the original SCI and BPD.

structurally conserved. Thus, SCI is defined as the rate of MFE for the common secondary structure to the averaged MFE for each sequence. MFE values for each sequence and the common secondary structure are calculated by RNAfold and RNAalifold in Vienna RNA package (Hofacker, 2003). Gruber et al. (2008) have employed the base pair distance (BPD) as a measurement of structure conservation. BPD has been originally defined as the normalized Hamming distance between two RNA secondary structures (Flamm et al., 2001). In , BPD has been shown to be as accurate as SCI.
RNAz with SCI has been used for ncRNA screens in several organisms including humans (Washietl et al., 2005a), nematodes (Missal et al., 2006), plasmodium (Mourier et al., 2008), and arabidopsis (Song et al., 2009), showing that RNAz is one of the most accurate tools for identifying ncRNAs. However, for practical use of RNAz on the genome-wide search, a relatively high false discovery rate has unfortunately been estimated (Washietl et al., 2007). It is conceivable that multiple alignments produced by a standard aligner that does not consider any secondary structures are not suitable for identifying ncRNAs in some cases and incur high false discovery rate. Wang et al. (2007) have also suggested that the genome-wide alignments in the UCSC Genome Browser (Kent et al., 2002) produced by MULTIZ (Blanchette et al., 2004) should be improved in some regions for identifying ncRNAs. To improve the accuracy, two strategies can be considered: the one is to employ a structural aligner such as RAF (Do et al., 2008) to produce high quality alignments, and the other is to develop a more robust method against low quality alignments. Since the former strategy will consume impractical execution time for structural alignments, this study takes the latter strategy.
Recently, several researchers have studied high-dimensional space estimation based on maximizing expected accuracy (MEA) including sequence alignment and RNA secondary structure prediction (Ding et al., 2005;Do et al., 2005Do et al., , 2006Carvalho and Lawrence, 2008;Hamada et al., 2009Hamada et al., , 2011, showing that MEAbased estimation is more reliable than the maximum likelihood estimation. CentroidFold and CentroidFold employ one of MEA-based estimators called g-centroid estimators for predicting RNA (common) secondary structures, and have been shown to be more accurate than other existing tools such as MFE-based methods (Hamada et al., 2009(Hamada et al., , 2011. Especially, CentroidFold can predict much more accurate common secondary structures for low quality multiple alignments produced by CLUSTAL W (Thompson et al., 1994) than RNAalifold.
In this study, we propose improved measurements for structure conservation based on g-centroid estimators for RNA (common) secondary structure prediction, instead of MFE-based predictions by RNAfold and RNAalifold, to incorporate the robustness against low quality multiple alignments. We call them C-SCI and C-BPD, which use centroid structures instead of MFE structures to calculate SCI and BPD. Our experiments show that the proposed methods achieve higher accuracy than the original SCI and BPD for not only human-curated structural alignments but also low quality alignments produced by CLUSTAL W. Furthermore, the accuracy of C-SCI on CLUSTAL W alignments is comparable with that of the original SCI on RAF alignments for which twofold expensive computational time is required on average.

rna secondary structure predIctIon wIth g-centroId estIMator
CentroidFold implements a g-centroid estimator which predicts secondary structures with the maximum expected accuracy by a kind of posterior decoding methods on the base-pairing probability matrix (Hamada et al., 2009).
Let Σ = {A,C,G,U} and Σ * denote the set of all finite RNA sequences consisting of bases in Σ. For a sequence x = x 1 x 2 …x n ∈ Σ * , let |x| denote the number of symbols appearing in x, which is called the length of x. Let S(x) be a set of secondary structures of an RNA sequence x. An element y ∈ S (x) is represented as a |x| × |x| binaryvalued triangular matrix y = (y ij ) i < j , where y ij = 1 means that bases x i and x j form a base pair.
CentroidFold employs a gain function between a true secondary structure u ∈ S(x) and a predicted secondary structure y ∈ S(x) defined as where g is a weight for base pairs, and I(condition) is the indicator function, which takes 1 or 0 depending on whether condition is true or false. The gain function (1) is equal to the weighted sum of the number of true positives and the number of true negatives of base pairs. CentroidFold predicts a secondary structure y ∈ S(x) which maximizes the expectation of G g (u, y) with respect to an ensemble of all possible secondary structures S(x) which is distributed under a posterior distribution p(u|x), , | where C is a constant independent of y, and p ij = E u|x [u ij ] is the base-pairing probability that the i-th and j-th bases form a base pair. The optimal secondary structure ˆ[ ( , )] arg max S E θ γ θ can be calculated efficiently by using the following Nussinov-style DP algorithm:  (Hofacker et al., 2002;Bernhart et al., 2008). The free energy of a consensus structure is defined as the average of the energy contributions of the single sequences plus covariance scores for bonuses of compensatory and consistent co-mutation in the alignment.
The consensus MFE alone could be used to identify functional RNAs in terms of thermodynamic stability of consensus structures. However, it is difficult to make straightforward use of it, since the folding energy is heavily biased by the nucleotide composition and the length of the alignment. SCI solved this problem by normalizing E A [y MFE (A)] with the average of E x [y MFE (x)] for all x ∈ A. From a different view, SCI reflects the idea that for a well-conserved alignment the structure of each sequence resembles each other and the consensus structure resembles all of them, so SCI is near 0 for an alignment that is not structurally conserved, whereas SCI is near 1 or above for an alignment that is structurally conserved. Especially, if the alignment is structurally well-conserved and compensatory and consistent mutations often occur, SCI may be above 1.
As shown in the definition (5), SCI obviously depends on the accuracy of common secondary structure prediction, which is also deeply influenced by the quality of multiple alignments of RNAs. This fact is supported by a previous study  and our results shown in Section 3. For the genome-wide search, high quality alignments that consider RNA secondary structures cannot be obtained easily due to the computational cost for calculating structural alignments. Therefore, a robust method that does not require high quality alignments is desired.

Base pair distance
The BPD evaluates secondary structure conservation of a given multiple alignment of RNA sequences by comparing predicted secondary structures directly rather than MFE (Flamm et al., 2001;Gruber et al., 2008). BPD is based on the normalized Hamming distance between two RNA secondary structures defined as: for y ∈ S(x) and y′ ∈ S(x′) for two RNA sequences x and x′ with the same length |x| = |x′|. Two variations of BPD can be defined: the first is the mean over all the combination of pairwise distances in the alignment A, that is, The second is the mean distance from the consensus structure to each individual structure in the alignment A, that is, Note that x,x′ ∈ A in Eqs (6) and (7) may contain the gaps ("−") resulting from the alignment A so that the Hamming distance can be defined. free energy and the CONTRAfold model (Do et al., 2006) based on a machine learning technique. In our experiments, we used the McCaskill model with Boltzmann likelihood parameters (Andronescu et al., 2010).
The weight g in the definition (1) controls the number of predicted base pairs, that is, the trade-off between specificity and sensitivity of predicted base pairs. If g = 1, this estimator is equivalent to the centroid estimator (Ding et al., 2005;Carvalho and Lawrence, 2008).
CentroidFold can predict a common secondary structure of a multiple alignment of RNA sequences by using the g-centroid estimator under the mixture of the RNAalifold model and the McCaskill model (Hamada et al., 2011). Let A be an alignment of RNA sequences that contains #A sequences. CentroidFold employs a gain function defined as the sum of the gain function (1) for all x ∈ A: CentroidFold predicts a common secondary structure y ∈ S(A) which maximizes the expectation of G y g u * ( , ) under the mixed distribution of the McCaskill model and the RNAalifold model: where p(u|x) and p(u|A) are the McCaskill model and the RNAalifold model, respectively. Here, w ∈ [0,1] is a weight between two distributions (w = 0.5 in our experiments). The optimal common secondary structure can similarly be calculated by using the recursion (3) with the averaged base-pairing probability defined as ( | ). = ∑ ∈S θ θ CentroidFold and CentroidFold have been shown to be more accurate than other existing tools (Hamada et al., 2009(Hamada et al., , 2011. Especially, CentroidFold can predict much more accurate common secondary structures than RNAalifold for low quality multiple alignments produced by CLUSTAL W.

MFE-basEd MEasurEMEnts oF structurE consErvation
In this section, we introduce two existing measurements of structure conservation based on prediction of RNA secondary structures that minimizes free energy.

Structure conservation index
The SCI evaluates secondary structure conservation of a given multiple alignment of RNA sequences in terms of the MFE. SCI is defined as For a single sequence x, E x (y) denotes the free energy of a secondary structure y ∈ S(x), and y MFE (x) = arg min y∈S(x) E x (y) is defined to be the MFE structure of x calculated by RNAfold (Hofacker, 2003). Similarly, for an alignment A, E A (y) is the free energy of a consensus structure y ∈ S(A), and y MFE (A) = arg et al., 2006), which includes 18,990 reference alignments of 36 RNA families. Reference alignments in BRAliBase 2.1 are human-curated alignments which are made from Rfam database (Griffiths-Jones et al., 2005) aiming for evaluating structural alignments. We produced multiple alignments using CLUSTAL W (Thompson et al., 1994) version 1.83 with default settings to investigate the discrimination capability on low quality alignments. We also produced structural alignments using RAF (Do et al., 2008) version 1.00 with default settings. RAF is one of the most efficient structural aligners based on the Sankoff (1985) algorithm which simultaneously aligns and folds given RNA sequences. However, RAF is much slower than CLUSTAL W since secondary structures are taken into account. For each alignment, we generated 10 negative controls by utilizing SISSIz (Gesell and Washietl, 2008), which randomizes columns of a given alignment to destroy its common secondary structure, while maintaining gap patterns, dinucleotide compositions, and sequence length. These alignments were binned according to their normalized Shannon entropy by the size of 0.05. The normalized Shannon entropy is defined as the average of the Shannon entropy of the individual column over all columns in the alignment whose length is |A|: where j is in the alphabet Σ = {A,U,G,C, − } consisting of the four nucleotides and the gap character "−," and p j i is the relative frequency observing the character j in the column i. We clustered the alignments into the bins from 0.05 to 1.15 stepping with 0.05 of normalized Shannon entropy. Figure 1 shows the distribution of the reference alignments on the bins of the normalized Shannon entropy.
To evaluate the performance of the various strategies, we performed the receiver operating characteristic (ROC) curve analysis. An ROC curve is a plot of true positive rate versus false positive rate in varying the discrimination threshold of a classifier. The area under the ROC curve (AUC) is used for evaluation of the discrimination; the AUC value closer to 1 means better discrimination capability.
In our study, we compared C-SCI and C-BPD (pairwise, consensus) with the original SCI and BPD. To compute SCI, we used the program scif available at http://www.biophys.uni-duesseldorf.de/ bralibase/. The two variants of BPD were calculated by a modified scif which uses a utility function for the BPD in Vienna RNA package. We implemented C-SCI and C-BPD based on CentroidFold package version 0.0.9 for predicting (common) secondary structures, and Vienna RNA Package version 1.8.5 for calculating the free energy of predicted structures. Figure 2 shows the results of ROC curve analysis of C-SCI and C-BPD comparing with SCI and BPD on the BRAliBase reference alignments and the CLUSTAL W alignments for each bin of normalized Shannon entropy, indicating that C-SCI achieved the highest AUC, especially on low entropy region.

dIscrIMInatIon capabIlIty
Both variants of BPD have the same drawback as SCI since MFEbased structures are used, that is, unreliable prediction of (common) secondary structures will result in inaccurate identification of ncRNAs.

centroId-based MeasureMents oF structure conservatIon
Now, we improve the above-mentioned measurements of secondary structure conservation by employing CentroidFold and CentroidFold instead of RNAfold and RNAalifold, respectively, for (common) secondary structure prediction to incorporate the robustness against low quality multiple alignments.

C-SCI
We propose C-SCI based on SCI which employs centroid structures instead of MFE structures. First, we predict the consensus centroid structure for an alignment A, denoted by y c (A), and centroid structures for each sequence x ∈ A, denoted by y c (x). We map a predicted structure onto each sequence x and calculate its free energy E x [y c (x)] for all of the sequences. For an alignment, we map a predicted consensus structure onto each sequence x and get rid of gaps and corresponding parts of the structure. To calculate the energy, we use RNAeval (Hofacker, 2003) with the predicted structure on the sequence. The free energy of a consensus secondary structure is calculated from the averaged free energy for all sequences and the covariance score which is implemented according to RNAalifold (Hofacker et al., 2002).
Then, C-SCI is calculated as follows: C-SCI has two parameters which affect the discrimination capability. We denote g A as the parameter g for predicting consensus secondary structures on multiple alignments, and g S as g for predicting secondary structures on individual sequences. The two parameters for centroid-based measurements were determined empirically; g a = 1.0 and g s = 6.0, which on average gave us accurate results in various conditions.

C-BPD
Since BPD directly compares RNA secondary structures, a more accurate and robust method for predicting RNA secondary structures is desired. We can employ centroid structures instead of MFE structures to calculate BPD. We call this C-BPD. As well as BPD, we can consider two variations of C-BPD: 3 results

dataset and evaluatIon
To confirm the discrimination capability of C-SCI and C-BPD, we performed the experiments in accordance with the previous study      1 shows the discrimination accuracies for all the alignments of all the methods. This indicates that the centroid-based measurements achieve higher AUC scores on the alignments by all the aligners than their MFE-based counterparts.

coMputatIonal coMplexIty
To address the genome-wide search, the computational cost is a serious problem. As shown in Table 1, it is obvious that the use of reference alignments which are structurally corrected by human curation is desirable. However, it is impractical to always obtain such reference alignments since high quality alignments cannot be obtained due to limited human resources and knowledge. Two alternative approaches are to use structural aligners which can align RNA sequences conserving their secondary structures, and to use the standard aligners like CLUSTAL W.
SCI through RAF alignments even for short sequences less than 200 nt. Therefore, in case that structural alignments might be unavailable such as the genome-wide search, C-SCI is practical to use and is expected to have as high discriminant power as SCI on structural alignments.

dIscussIon
We proposed centroid-based measurements of secondary structure conservation, and examined their performance. The results clearly show that C-SCI and C-BPD are more accurate than the original SCI and BPD. The discrimination capability of C-SCI for CLUSTAL W alignments is more accurate than or comparative to SCI for RAF structural alignments. This means that our methods are more suitable for genome-wide alignments which are low quality from the point of view on secondary structures. As shown in Figure 2, the original SCI and BPD are inaccurate, especially for low entropy regions, that is, highly conserved alignments. For a highly conserved alignment, the common secondary structure predicted by RNAalifold will be very similar with the individual secondary structure predicted by RNAfold for each sequence in the alignment. This means that the original SCI and BPD cannot distinguish structurally conserved alignments from structurally non-conserved alignments for low entropy regions because of the definition of the original SCI and BPD. Therefore, important genes which are highly conserved would be undetectable as ncRNAs even if such genes could fortunately be aligned between related species by a sequence-based aligner. This is a serious drawback of SCI and BPD. Figure 2 also indicates that our methods are robust on low entropy regions compared with SCI and BPD. For the centroidbased measurements, we used g a = 1.0 and g s = 6.0, which are the weight for base pairs of predicting secondary structures for alignments and individual sequences, respectively. The previous study has shown that almost the best accuracy of predicting secondary structures for individual sequences can be achieved at g s = 6.0 (Hamada et al., 2009), whereas only highly reliable base pairs for alignments might be predicted at g a = 1.0. Therefore, our methods can detect only the alignments with reliable base pairs as structurally conserved. This results in reducing false detections of ncRNAs for low entropy regions as shown in Figure 4. acknowledgMents This work was supported in part by a grant from "Functional RNA Project" funded by the New Energy and Industrial Technology Development Organization (NEDO) of Japan, and was also supported in part by Grant-in-Aid for Scientific Research on Priority Area "Comparative Genomics" from the Ministry of Education, Culture, Sports, Science, and Technology of Japan. We thank Michiaki Hamada for fruitful discussions.