PIANO: A Web Server for Pseudouridine-Site (Ψ) Identification and Functional Annotation

Known as the “fifth RNA nucleotide”, pseudouridine (Ψ or psi) is the first-discovered and most abundant RNA modification occurring at the Uridine site, and it plays a prominent role in a number of biological processes. Thousands of Ψ sites have been identified within different biological contexts thanks to the advancement in high-throughput sequencing technology; nevertheless, the transcriptome-wide distribution, biomolecular functions, regulatory mechanisms, and disease relevance of pseudouridylation are largely elusive. We report here a web server—PIANO—for pseudouridine site (Ψ) identification and functional annotation. PIANO was built upon a high-accuracy predictor that takes advantage of both conventional sequence features and 42 additional genomic features. When tested on six independent datasets generated from four independent Ψ-profiling technologies (Ψ-seq, RBS-seq, Pseudo-seq, and CeU-seq) as benchmarks, PIANO achieved an average AUC of 0.955 and 0.838 under the full transcript and mature mRNA models, respectively, marking a substantial improvement in accuracy compared to the existing in silico Ψ-site prediction methods, i.e., PPUS (0.713 and 0.707), iRNA-PseU (0.713 and 0.712), and PseUI (0.634 and 0.652). Besides, PIANO web server systematically annotates the predicted Ψ sites with post-transcriptional regulatory mechanisms (miRNA-targets, RBP-binding regions, and splicing sites) in its prediction report to help the users explore potential machinery of Ψ. Moreover, a concise query interface was also built for 4,303 known Ψ sites, which is currently the largest collection of experimentally validated human Ψ sites. The PIANO website is freely accessible at: http://piano.rnamd.com.


INTRODUCTION
Pseudouridine (5-ribosyluracil, Y, and psi) is the first-discovered (Cohn and Volkin, 1951) and most abundant RNA modification occurring at the Uridine site catalyzed by 13 pseudouridine synthase (PUS) (Chen and Patton, 2000;Zhao et al., 2004;McCleverty et al., 2007;Shaheen et al., 2016;Jacob et al., 2017). Y is present in many classes of RNA within all organisms, such as messenger RNA (mRNA), transfer RNA (tRNA), small nucleolar RNA (snoRNA), small nuclear RNA (snRNA), and ribosomal RNA (rRNA) (Ge and Yu, 2013). Y was termed as "the fifth nucleotide" with an estimated Y/U ratio of 7-9% (Jacob et al., 2017), and it is considered to be the most prevalent of the mRNA modifications (Meyer and Jaffrey, 2017). Y plays a prominent role in many biological processes. The presence of Y in tRNA and rRNA regulates the entry site binding process in ribosomal RNA (Jack et al., 2011) and RNA structure stabilization (Kierzek et al., 2014). A recent study also demonstrated that Y is related to transcript stability (Schwartz et al., 2014), environmental signal response (Carlile et al., 2014), and genetic code switching in mRNA (Karijolich and Yu, 2011;Fernández et al., 2013). Y deficiency may be associated with various diseases. It has been found that the dysregulation of Y modification of mitochondrial tRNA acts as an etiology of mitochondrial myopathy and sideroblastic anemia (MLASA) (Bykhovskaya et al., 2004). Furthermore, mutations in pseudouridine are also involved in diseases like lung cancer and duykeratosis congenita (Mei et al., 2012).
Several high-throughput sequencing approaches have been developed for profiling the transcriptome-wide distribution of Y, including Pseudo-seq (Carlile et al., 2014), Y-seq (Schwartz et al., 2014), PSI-seq (Lovejoy et al., 2014), and CeU-seq (Li X,et al., 2015). These approaches all share the same principle, in which RNA is treated with the N-cyclohexyl-N'-(2-morpholinoethyl)carbodiimide-metho-p-toluenesulfonate (CMC) to leave a bulky group on Y and stop reverse transcription. Since the bulky adduct on the Y may reduce the sensitivity in the detection of Y, Vahid et al. recently developed a new approach, RBS-seq, which is based on a modification of RNA bisulfite sequencing and claims better sensitivity (Khoddami et al., 2019). Currently, the experiment-validated Y sites in human, mouse, and a few other model organisms are available from RMBase database (Xuan et al., 2017), and the regulation pathways of Y were more explicitly explained in MODOMICS database (Boccaletto et al., 2017).
Wet-lab approaches are surely effective for the study of transcriptome pseudouridylation with respect to a specific biological context; however, they are also laborious and offer only limited coverage, i.e., the reported RNA Y sites by wet-lab experiments are still restricted to the transcripts more readily expressed under a specific cell/tissue condition. Alternatively, computational efforts may provide a more cost-effective avenue (Chen X, et al., 2017). To date, many computational efforts have been made to facilitate the study of RNA epigenetics (Boccaletto et al., 2017;Chen X, et al., 2017;Xue et al., 2020;Liu et al., 2020) in terms of both experimental data collection and site prediction works. For predictors related to the identification of Y RNA modification, PseUI (He et al., 2018), XG-PseU (Liu et al., 2019), and iRNA-PseU (Chen et al., 2016) allow for prediction of putative Y sites from an RNA sequence, and PPUS (Li Y.H, et al., 2015) can predict the Y sites regulated by a specific pseudouridine synthase. However, these three predictors are all based on sequence-derived features only without considering other genomic features (such as conservation, gene annotation, and miRNA binding) that may contribute to the prediction, and thus their performance is limited . Moreover, their prediction results are not functionally annotated with potential posttranscriptional regulation machineries that may explain the functional consequences of the predicted Y sites.
We present here a web server-PIANO-for pseudouridine site identification and functional annotation. Inspired by the WHISTLE framework , PIANO took advantage of both the conventional sequence features and 42 additional genomic features. Using six independent datasets generated from four different technologies, we showed that PIANO adds a marked improvement to the accuracy of existing Y-site prediction. Moreover, the PIANO web server accepts both genomic location and RNA sequence format as input file when making predictions, and the putative Y sites returned are also annotated with various post-transcriptional regulations, including miRNA-targets, RBPbinding regions, and splicing sites, to unveil potential functional mechanisms of Y. The PIANO website is freely accessible at: http:// piano.rnamd.com.

Training and Testing Data for Y-Site Prediction
To construct the Y-site prediction model, we used the known human Y sites detected from four different base-resolution Y profiling techniques, including Y-Seq, RBS-Seq, CeU-Seq, and Pseudo-Seq (see Table 1). The Y sites at base-resolution were directly downloaded from Gene Expression Omnibus (GEO).
In the beginning of the performance evaluation, dataset H1 (see Table 1) was used as the testing data, while dataset H2-H4 were used as for training. Specifically, the base-resolution Y sites in training datasets were used as the positive training data. The negative sites used in model training were randomly selected from unmodified U sites located on the same transcripts of positive sites (see Figure 1). To make the best use of the limited volume of positive data, we randomly selected 10 negative sites for each of the positive sites. To balance the positive-to-negative ratio, the negative sites were then randomly split into 10 subsets, and 10 separate predictors were generated with a 1:1 positive-to-negative ratio. The negative sites of testing data were generated following the same procedure. Consequently, 10 separate predictors were generated, and their prediction results were averaged. Following the experimental design of WHISTLE framework , we performed dataset level leave-one-out validation over the H1-H5 base-resolution datasets; four samples from H1-H5 were used as training, while the other was used for testing. Subsequently, the sites from the datasets H1-H5 (generated from Y-Seq, RBS-Seq, and CeU-Seq) were used to establish a predictor, whose performance was evaluated on the dataset H6, which was generated from an independent technology (Pseudo-Seq).

Sequence-Derived Features
The length of 41bp was widely used to extracted sequence information in many previous studies, which was determined as a suitable flanking window by relevant tests, i.e., iRNA-m7G (Liu et al., 2019), iRNA-2OM , and MethyRNA (Chen W, et al., 2017). Consequently, the sequence-derived information of 41 bp flanking window of Y and non-Y (U) sites as central was generated using the chemical properties of nucleotides, position-specific nucleotide propensity (PSNP), and cluster information.
In the first encoding method, the nucleotides are classified into three categories based on three distinct structural chemical properties. Ring structures of nucleotides are the first to be considered; here, adenosine and guanosine have two rings, while cytidine and uridine only have one ring. In addition, the guanosine and cytidine have stronger hydrogen bonding than adenosine and uridine. Furthermore, adenosine and cytidine can be classified as the amino group, while guanosine and uridine contain the keto group. Based on these chemical properties defined above, the i -th nucleotide from sequence S may be encoded by a vector S i = (x i , y i , z i ): (1) Thus, the A, C, G, and U may be encoded as a vector (1,1,1), (0,1,0), (1,0,0), and (0,0,1), respectively.
The position-specific nucleotide propensity (PSNP) stands for the differences of the frequency of nucleotides calculated in specific locations between RNA sequences of positive and negative data. The frequency of occurrence of A, U, G, and C in the i -position were calculated for both positive and negative data, respectively, to obtain two matrices with 4×41 dimension as Z plus and Z minus , where Z plus was extracted from sequence of all positive data, and Z minus was extracted from sequence of all negative data. The position-specific nucleotide propensity (PSNP) matrices was defined as Z PSNP : For the cluster information, the average relative position of the closest k (k=1,2 and 3) nucleotide to center Y/non-Y was calculated for each nucleic acid (A, G, C, and U). The k was considered as 1 to 3. Using sequence 'AGCUAGCCAUC CUACGGUACAGCAU' as an example, the center U is at the ninth positive. For encoding the cluster information of adenine, the average relative position of the closest 1 (k=1) adenine to center U is 1 (1/1); when k equals to 2, the relative position of the second closest adenine to center U is 4, and, therefore, the average relative position of the closest 2 (k=2) adenine to center U is 2.5 (5/2) and 3.7 (11/3) when k equals to 3. Similarly, the cluster information of guanosine in this example sequence is 3 (3/1), 3.5(7/2), and 4.7(14/3) when k equals to 1, 2, and 3, respectively.
The sequence-derived encoding methods employed by the three previously published predictors were used to reproduce the PPUS, iRNA-PseU, and PseUI with the same training data of PIANO, respectively, and their performances were compared with PIANO using independent datasets.

Genome-Derived Features
In the original WHISTE approach, 35 additional genomic features that might contribute to the prediction of m 6 A RNA methylation sites were considered . In PIANO, seven new genomic features were added to the prediction model, the details of the 42 genomic features considered in the prediction were summarized in Supplementary Table S1. Specifically, genomic Features 1-16 are dummy variable features indicating whether the uridine sites shall fall within the transcript regions that satisfy certain topological properties. All the features in this category are generated by the GenomicFeatures R/Bioconductor package using the transcript annotations hg19 TxDb package (Lawrence et al., 2013). To remove the ambiguity caused by transcript isoforms, only the primary (longest) transcripts of each gene were kept for the extraction of the transcript subregions. The longest transcript isoform was used to unambiguously assign m6A peak regions to mRNAs (Ke et al., 2017) and contributed to a better performance in accuracy compared with using the average value of multiple transcripts. Genomic Features 17-20 are real valued features defining the relative position of the transcript regions (3'UTR, 5'UTR, CDS, and whole transcript), i.e., the distance from the adenine to the 5' end divided by the width of the region. The values are also set to zero for sites that do not belong to the region. Genomic features 21-25 represent the length of the transcript region containing the modification site. The values are also set to zero for sites that do not belong to the region.
Features 26-27 captured the distance from the adenine sites to the 5'end or 3'end of the splicing junctions. Additionally, the distance to the nearest neighboring y sites in the training data is generated to measure the clustering effect of the y RNA modification sites. Evolutionary conservation score of the uridine sites and its flanking regions are measured by Phast-Cons (Siepel et al., 2005) score, and the fitness consequence (Gulko et al., 2015) scores were presented in features 28-31.
To consider the RNA secondary structures around the uridine site, the RNA secondary structures are predicted using RNAfold from the Vienna RNA package (Lorenz et al., 2011) and presented in features 32-33. Genomic properties of transcripts containing the Y sites were presented in features 34-38. Finally, features 39-42 represent omics information, such as microRNA target sites (Chou et al., 2017) and HNRNPC binding sites (2012).

Machine Learning Approach Used for Y-Site Prediction
As a high-efficiency machine learning algorithm in computational biology, the SVM (Support Vector Machine) has been widely applied in microRNA target prediction (Liu et al., 2010), protein phosphorylation prediction (Wong et al., 2007), and m 6 A RNA methylation site prediction (Chen W, et al., 2017). In this project, the R language interface of LIBSVM (Chang and Lin, 2011) was used to build our model with the radial basis function as kernel, and the other parameters were set at the default.

Performance Evaluation of Y-Site Prediction
To evaluate the performance of PIANO, a 5-fold cross-validation was employed on training datasets using the SVM classifier, and the independent testing dataset was used to measure the final performance of PIANO. There is no overlap between the training sites and testing sites, as only the Y sites not previously used as training data were considered during performance evaluation; the performance evaluation result should thus directly reflect the capability of the algorithm to identify previously unknown Y sites. To evaluate the performance, the ROC (receiver operating characteristic) curve (sensitivity against 1-specificity) was used, and the area under ROC curve (AUROC) was calculated as the main performance evaluation metric.

Estimate the Probability of Y
The likelihood ratio (LR) of a Y site is calculated to estimate the probability of Y RNA methylation: In the PIANO web server, a site was predicted to be a putative Y site if its predictive value was above 0.5 with a minimum LR value of 1. A site with a larger LR value suggests that it is more likely to be a Y site. The machine learning classifiers usually obtain the lowest empirical rate with the value of 0.5 as cutoff. The statistical significance of LR is assessed by an upper bound of the p-value, indicating how extreme the observed LR is among all the transcriptome U sites. It is calculated from the relative ranking of the putative Y sites among all the transcriptome U sites, i.e., if only 0.1% of U sites have a LR score larger than a specific U site, then the upper bound of the p-value of this site is 0.001. In the report of PIANO web server, a putative Y site is considered to be of high confidence if its LR within the top 0.5% of all transcriptome Us (corresponding to an upper bound of the p-value < 0.005) of all the transcriptome U sites, followed by medium confidence (0.005 < upper bound of the p-value ≤ 0.05) and low confidence (p-value > 0.05).

Functional Annotation of Putative Y Site
The gene symbol, Ensembl gene ID, gene region, and gene type for each putative Y site were annotated using ANNOVAR package (Wang et al., 2010). Furthermore, we annotated the putative Y sites with three kinds of post-transcriptional regulation, including RNA-binding proteins (RBPs) regions, miRNA-RNA targets, and splicing sites. We first found the intersection between the computational predicted Y sites and POSTAR2-derived RBP binding regions . For miRNA targets, we obtained the information from miRanda (Agarwal et al., 2015) and starBase2 (Li et al., 2013), and we found the Y sites within the miRNA targets regions to explore the potential influence of Y on miRNA-target interactions. Finally, we obtained the Canonical splice sites (GT-AG) from UCSC (Lawrence et al., 2013) annotations, 100 bp upstream region from 5' splicing sites and 100 bp downstream region from 3' splicing sites were extracted for the subsequent analysis of Y sites on splicing sites. The detailed information of the posttranscriptional regulation association analysis can be found in Supplementary Table S2.

RESULTS
Although the genome-derived features alone are already very effective for predicting Y sites, the best performance was achieved when the sequence features and genomic features were combined. Consequently, our PIANO predictor was established based on both the genome-derived features and sequence-derived features. When designing the encoding methods for sequence features used for the PIANO approach, the chemical properties of nucleotides, position-specific nucleotide propensity (PSNP), and cluster information were considered. We found that this combination (sequence and genomic features) achieved the best performance in accuracy compared with combining genomederived features with other basic sequence encoding methods (i.e., one-hot encoding method). The performance of the predictor was evaluated under two modes. For the full transcript mode, the positive and negative Y sites located in both exonic and intronic regions are all considered to construct the predictor. In the mature mRNA mode, only positive and negative Y sites located on mature mRNA transcripts are considered; this is because existing experimental datasets overwhelmingly relied on polyA selection in RNA-seq library preparation, and intronic Y sites are likely to be underrepresented in the data, which may lead to an over-estimation of accuracy under the full transcript mode.
To avoid potential over-fitting and to identify the most significant subset of genomic features, feature selection was implemented; the datasets H2-H5 were used as training data, while dataset H1 was used for the independent testing data. The relative importance of each genome-derived feature were measured by the Perturb method (Gevrey et al., 2003). According to the rank of importance, the top N most important features were reserved in the prediction and were evaluated with a 5-fold cross-validation. We showed that the newly developed method PIANO substantially outperformed competing approaches on crossvalidation (Supplementary Table S3) when tested on independent datasets (Supplementary Table S3) or benchmarked by an independent technique (Supplementary Table S4). To sum up, by testing independent datasets generated from four different Y profiling technologies (Y-seq, RBS-seq, Pseudo-seq, and CeU-seq), the newly developed method PIANO achieved an average AUC of 0.955 and 0.838 under full transcript and mature mRNA modes, respectively (see Table 2), representing a marked improvement compared to PPUS (0.713 and 0.707), iRNA-PseU (0.713 and 0.712), and PseUI (0.634 and 0.652).
The performance of the purposed predictor was further evaluated by separating the training and testing datasets between the cell type in which datasets H3-H5 generated from HEK293T were used for training, while datasets H2 and H6 from Hela were used for independent testing. Consistent with previous validation results, our method PIANO achieved a marked improvement in prediction accuracy compared with existing predictors, using the AUROC (area under ROC curve) and AUPRC (area under precision-recall curve) as an evaluation metric, when tested on independent dataset with a 1:1 positive to negative ratio (Supplementary Table S5) and 1:10 positive to negative ratio (Supplementary Table S6), respectively, suggesting the reliability of our newly proposed approach. Besides, the comparison between different algorithms indicated that SVM (Support Vector Machine) was a quite effective machine learning approach and achieved the best performance in our study (Supplementary Table S5). In addition, to further evaluate different approaches, we also considered the prediction of PUS-specific Y sites. In this experiment, TruB1, PSU7, and TruB2 were considered, and the goal was to predict their specific substrates (Safra et al., 2017). Consistent with previous results in Y-site prediction, the PIANO method again substantially outperformed competing approaches under both the full transcript and mature mRNA model ( Table 3), suggesting the effectiveness of the approach.

Construction of the PIANO Website
A website PIANO, which stands for pseudouridine site identification and functional annotation, was built for the convenience of academic users. Hyper Text Markup Language Only the Y sites not previously used as training data were considered during performance evaluation, so the training sites and testing sites did not overlap. Because existing datasets overwhelmingly relied on polyA selection in RNA library preparation and intronic Y sites are likely to be underrepresented in the data, the performances were evaluated under two modes: full transcript and mature mRNA modes.
In the mature mRNA mode, only positive and negative Y sites located on mature mRNA transcripts are considered, as previously described . Our new approach PIANO substantially outperformed competing approaches in accuracy. (HTML), Cascading Style Sheets (CSS), and Hypertext Preprocessor (PHP) were used to construct the PIANO web interface. This included a database containing 4,303 experimentally validated Y sites reported from four different high-throughput Y profiling techniques, which is so far the most complete collection of Y in humans. Among those experimentally validated Y sites, we found Y was distributed most often along coding DNA sequence and 3'UTR, but it was relatively rare in 5'UTR (Supplementary Figure S2). Secondly, a web server for putative Y-site identification from the userdefined genomic ranges or provided FASTA sequences (detailed in Figure 2) was used. The help document of the PIANO web server is provided in the Supplementary Materials. Both experimentally validated Y sites and the predicted putative Y sites are functionally annotated with various posttranscriptional regulations to unveil potential functional mechanism concerning Y. The data and prediction results may be conveniently downloaded and visualized with web browser. The PIANO website is freely accessible from: http://piano. rnamd.com.
Here, by integrating 42 genomic features together with conventional sequence-derived features, we have developed the (so far) most accurate Y-site predictor. Our new method (PIANO) substantially outperformed competing approaches when using four different Y profiling protocols as the benchmarks (with 0.24 and 0.12 improvement in terms of AUC under full transcript and mature mRNA modes, respectively) and supports functional annotation for the putative Y sites. A web site-PIANO-was also developed, including (1) a database hosting currently the largest collection of 4,303 experimentally validated human Y sites; and (2) a web FIGURE 2 | Interface and output of the PIANO web server for Y-site prediction and functional annotation. (A) When predicting human Y sites, the PIANO web server supports two types of input: the genomic ranges of human genome assembly and the FASTA sequences. As the prediction process may take quite some time, it is highly recommended that the user should provide an email address, where an email notification will be sent when the job is finished.  server enabling the prediction of novel Y sites from given genomic ranges or FASTA sequences. Users may query and download their predicted results with clear and simple instructions (see Supplementary Materials). The scripts used to generate genomic and sequence features considered in PIANO's framework, the training and testing data, and datasets related to the construction of the PIANO database were provided in the download page of PIANO website. In conclusion, our work will serve as a useful resource for researchers who are interested in Y and its role concerning various post-transcriptional regulations. Nevertheless, it is worth noting that there exist significant discrepancies in the Y sites reported by different technologies (Zaringhalam and Papavasiliou, 2016;Adachi et al., 2018). Although the discrepancy may be due to the context-specificity of pseudouridylation and technology preferences, our PIANO predictor achieved reasonable consensus with all the four highthroughput profiling Y techniques; Y is, however, considered as the most prevalent mRNA modifications (Meyer and Jaffrey, 2017) with an estimated Y/U ratio of 7-9% (Jacob et al., 2017). Currently, only a small number of Y sites have been reported; we are therefore not able to calculate a reasonable number for the real-life estimate of class imbalance. This may due to the limited detection power of existing experimental approaches. With an estimated real-life Y/U ratio as 8%, we can expect at least 10 times the number of negative sites. Under this assumption, we tested the stability of our method by assigning 1:10 and 1:1 positive-to-negative ratio for the training and testing data. The result showed that the performance generated by the 1:10 class were more stable than the 1:1 class (Supplementary Figure S3). We further calculated the value of FDR, FPR, and TPR in this setting, using different LRs as cutoff (Supplementary Table S7).
To sum up, we cannot rule out the possibility of experimental bias, and the training data (gold standard data) may be further optimized in the future as more experimental evidence is accumulated. To make the PIANO method more practically useful, the predictor should be used by combining with other experimental evidence and knowledge, e.g., the Us within a binding site of PUS. The performance of PIANO method is much better than all existing approaches, and it can provide the most reliable putative Y sites for users.

DATA AVAILABILITY STATEMENT
Publicly available datasets were analyzed in this study. This data can be found here: GSE60047, GSE58200, GSE63655, GSE90963.

AUTHOR CONTRIBUTIONS
KC, JM, GL, and JS initialized the project. KC and BS designed the research plan. ZW constructed the genomic features considered in human Y site prediction. BS performed the development of the Y site web server. YT and BS built the website. BS and KC drafted the manuscript. All authors read, critically revised, and approved the final manuscript.