ORIGINAL RESEARCH article

Front. Bioeng. Biotechnol., 14 July 2020

Sec. Computational Genomics

Volume 8 - 2020 | https://doi.org/10.3389/fbioe.2020.00808

Riboflow: Using Deep Learning to Classify Riboswitches With ∼99% Accuracy

  • 1. MS Program in Computer Science, Department of Computer Science, College of Engineering and Applied Sciences, Stony Brook University, Stony Brook, NY, United States

  • 2. MS in Bioinformatics, Georgia Institute of Technology, Atlanta, GA, United States

  • 3. Department of Bioinformatics, School of Chemical and Biotechnology, SASTRA Deemed University, Thanjavur, India

Article metrics

View details

12

Citations

6,9k

Views

1,7k

Downloads

Abstract

Riboswitches are cis-regulatory genetic elements that use an aptamer to control gene expression. Specificity to cognate ligand and diversity of such ligands have expanded the functional repertoire of riboswitches to mediate mounting apt responses to sudden metabolic demands and signal changes in environmental conditions. Given their critical role in microbial life, riboswitch characterisation remains a challenging computational problem. Here we have addressed the issue with advanced deep learning frameworks, namely convolutional neural networks (CNN), and bidirectional recurrent neural networks (RNN) with Long Short-Term Memory (LSTM). Using a comprehensive dataset of 32 ligand classes and a stratified train-validate-test approach, we demonstrated the accurate performance of both the deep learning models (CNN and RNN) relative to conventional hyperparameter-optimized machine learning classifiers on all key performance metrics, including the ROC curve analysis. In particular, the bidirectional LSTM RNN emerged as the best-performing learning method for identifying the ligand-specificity of riboswitches with an accuracy >0.99 and macro-averaged F-score of 0.96. An additional attraction is that the deep learning models do not require prior feature engineering. A dynamic update functionality is built into the models to factor for the constant discovery of new riboswitches, and extend the predictive modeling to new classes. Our work would enable the design of genetic circuits with custom-tuned riboswitch aptamers that would effect precise translational control in synthetic biology. The associated software is available as an open-source Python package and standalone resource for use in genome annotation, synthetic biology, and biotechnology workflows.

Availability:

PyPi package: riboflow @ https://pypi.org/project/riboflow

Repository with Standalone suite of tools: https://github.com/RiboswitchClassifier

Language: Python 3.6 with numpy, keras, and tensorflow libraries.

License: MIT.

Introduction

Riboswitches are ubiquitous and critical metabolite-sensing gene expression regulators in bacteria that are capable of folding into at least two alternative conformations of 5′UTR mRNA secondary structure, which functionally switch gene expression between on and off states (Mandal et al., 2003; Roth and Breaker, 2009; Serganov and Nudler, 2013). The selection of conformation is dictated by the binding of ligand cognate to the aptamer domain of a given riboswitch (Gelfand et al., 1999; Winkler et al., 2002, 2004). Cognate ligands are key metabolites that mediate responses to metabolic or external stimuli. Consequent to conformational changes, riboswitches weaken transcriptional termination or occlude the ribosome binding site thereby inhibiting translation initiation of associated genes (Yanofsky, 1981; Mandal and Breaker, 2004). Riboswitches provide an intriguing window into the ‘RNA world’ biology (Stormo and Ji, 2001; Brantl, 2004; Breaker et al., 2006; Strobel and Cochrane, 2007) and there is evidence of their wider distribution in complex genomes (Sudarsan et al., 2003; Barrick and Breaker, 2007; Bocobza and Aharoni, 2014; McCown et al., 2017). The modular properties of riboswitches have engendered the possibility of synthetic control of gene expression (Tucker and Breaker, 2005), and combined with the ability to engineer binding to an ad hoc ligand, riboswitches have turned out to be a valuable addition to the synthetic biologist’s toolkit (Wieland and Hartig, 2008; Wittmann and Suess, 2012). In addition to orthogonal gene control they are useful in a variety of applications, notably metabolic engineering (Zhou and Zeng, 2015), biosensor design (Yang et al., 2013; Meyer et al., 2015) and genetic electronics (Villa et al., 2018). Riboswitches have been used as basic computing units of a complex biocomputation network, where the concentration of the ligand of interest is titrated into measurable gene expression (Beisel and Smolke, 2009; Domin et al., 2017). Riboswitches have also been directly used as posttranscriptional and translational checkpoints in genetic circuits (Chang et al., 2012). Their key functional roles in infectious agents but absence in host genomes make them attractive targets for the design of cognate inhibitors (Blount and Breaker, 2006; Deigan and Ferré-D’Amaré, 2011; Wang et al., 2017). Characterisation of riboswitches would expand the repertoire of translational control options in synthetic biology and bioengineering. In turn, this would facilitate the reliable construction of precise genetic circuits. In view of their myriad applications, robust computational methods for the accurate characterisation of novel riboswitch sequences would be of great value.

Since the discovery of riboswitches (Mironov et al., 2002; Nahvi et al., 2002), many computational efforts have been advanced for their characterisation, notably Infernal (Nawrocki and Eddy, 2013), Riboswitch finder (Bengert and Dandekar, 2004), RibEx (Abreu-Goodger and Merino, 2005), RiboSW (Chang et al., 2009) and DRD (Havill et al., 2014), and reviewed in Clote (2015) and Antunes et al. (2017). These methods used probabilistic models of known classes with or without secondary structure information to infer or predict the riboswitch class. Singh and Singh explored featuring mono-nucleotide and di-nucleotide frequencies in a supervised machine learning framework to classify different riboswitch sequences, and concluded that the multi-layer perceptron was optimal (Singh and Singh, 2016). Their work achieved modest performance (F-score of 0.35 on 16 different riboswitch classes). None of the above methods were shown to generalize effectively to unseen riboswitches. Our remedy was to explore the use of deep learning models for riboswitch classification. Deep networks are relatively recent neural network-based frameworks that use a type of learning known as representation learning (Bengio et al., 2013). Convolutional neural networks are one type of deep learning, known for hierarchical information extraction. Such architectures with alternating convolutional and pooling layers have been earlier used to extract structural and functional information from genome sequences (Alipanahi et al., 2015; Sønderby et al., 2015; Zhou and Troyanskaya, 2015; Kelley et al., 2016). Recurrent neural networks are counterparts to CNNs and specialize in extracting features from time-series data (Che et al., 2018). RNNs with Long Short-Term Memory (termed LSTM) incorporate recurrent connections to model long-run dependencies in sequential information (Hochreiter and Schmidhuber, 1997), such as in speech and video (Graves and Schmidhuber, 2005). This feature of LSTM RNNs immediately suggests their use in character-level modeling of biological sequence data (Lipton, 2015; Lo Bosco and Di Gangi, 2017). Bidirectional LSTM RNN have been shown to be especially effective, given that they combine the outputs of two LSTM RNNs, one processing the sequence from left to right, the other one from right to left, together enabling the capture of dynamic temporal or spatial behavior (Sundermeyer et al., 2014). Bidirectional LSTM RNNs are a particularly powerful abstraction for modeling nucleic acid sequences whose spatial secondary structure determines function (Lee et al., 2015). Two recent successes of deep learning methods in RNA biology have been: (i) prediction of RNA secondary structure (Singh et al., 2019), and (ii) dynamic range improvement in riboswitch devices (Groher et al., 2019). Here we have evaluated the relative merits of a spectrum of state-of-the-art learning methods for resolving the ligand-specificity of riboswitches from sequence. It is demonstrated that the deep learning models vastly outperformed other machine learning models with respect to the classification of riboswitches belonging to 32 different families.

Materials and Methods

Dataset and Pre-processing

We searched the Rfam database of RNA families (Kalvari et al., 2018) with the term “Riboswitch AND family” and the corresponding hit sequences were obtained in FASTA format from the Rfam ftp server (Rfam v13 accessed on July 6, 2019). Each riboswitch is represented by the coding strand sequence, with uracil replaced by thymine, thereby conforming to the nucleotide alphabet ‘ACGT.’ Each sequence was scanned for non-standard letters (i.e., other than the alphabet) and such occurrences were corrected using the character mapping defined in Table 1. The feature vectors for machine learning were extracted from the sequences. For each sequence, 20 features were computed, comprising four mononucleotide frequencies (A,C,G,T) and 16 dinucleotide frequencies. To address possible skew in distribution, all the frequency features were normalized to zero mean and unit variance. Deep models, namely convolutional neural networks (CNNs) and bidirectional recurrent neural networks with LSTM (hereafter simply referred as RNNs) are capable of using the sequences directly as the feature space, obviating any need for feature engineering. We used the first 250 bases of the riboswitch sequence as the input, with the proviso that shorter sequences (which is usually the case; Table 2) were padded for the extra spaces. Python scripts used to create the final dataset are available in the repository for this project1. The dataset consists of the riboswitch sequence, four 1-mer frequencies, 16 2-mer frequencies, and class, for each instance, which could be appropriately subsetted for training the base and deep models.

TABLE 1

S. No.Original letterMapped character#Occurrences in dataset
1RG6
2YT8
3KG1
4SG3
5WA2
6HA2

Non-standard nucleotide mapping.

Rare occurrences of non-standard nucleotides in the sequences were converted using this mapping key.

TABLE 2

Class no.Rfam IDClass NameClass sizeAvg. length
1RF00504Glycine riboswitch4592100
2RF01786Cyclic di-GMP-II riboswitch66186
3RF01750ZMP/ZTP riboswitch167492
4RF00059Thiamine pyrophosphate riboswitch12559110
5RF01057SAH Riboswitch83292
6RF01725SAM -1 -4 Variant riboswitch793104
7RF00162SAM - 1 Riboswitch6027113
8RF00174Cobalamin riboswitch14212189
9RF01055Molybdenum Co-factor riboswitch1221134
10RF01727SAM-SAH Riboswitch24050
11RF01482AdoCbl riboswitch182137
12RF03057nhaA-I RNA55956
13RF01734Fluoride Riboswitch201870
14RF00167Purine Riboswitch2632101
15RF00234Glucosamine-6-phosphate riboswitch936175
16RF01739Glutamine riboswitch110364
17RF03072raiA RNA736219
18RF03058sul1 RNA34456
19RF00380Ykok riboswitch (Magnesium sensing)1059170
20RF00168Lysine Riboswitch2240180
21RF03071DUF164626553
22RF01689AdoCbl variant RNA212125
23RF00379ydaO/yuaA leader3918164
24RF00634SAM - 4 Riboswitch1245116
25RF01767SAM - 3 Riboswitch19590
26RF00080yybP-ykoY manganese riboswitch833158
27RF02683NiCo riboswitch (Nickel or Cobalt sensing)23597
28RF00442Guanidine-I riboswitch902109
29RF00522Pre-queosine riboswitch -153345
30RF00050Flavin Mononucleotide Riboswitch4080142
31RF01831THF riboswitch663102
32RF00521SAM - 2 Riboswitch81978

A summary of the riboswitch dataset used in our study.

The dataset includes a mixture of metabolite/ion/cofactor/amino-acid/nucleotide/vitamin/signaling-molecule aptamer ligands. ‘Class no.’ corresponds to the response classes to be learnt in machine learning. The average length of all sequences in a given class is also given.

Predictive Modeling

The machine learning problem is simply stated as: given the riboswitch sequence, predict the ligand class of the riboswitch. Toward this, a battery of eight supervised machine learning and deep classifiers were studied and evaluated in the present work (Table 3). Classifiers derived from implementations in the Python scikit-learn machine learning library (Pedregosa et al., 2011)2 are referred to as base models and include the Decision Tree, K-nearest Neighbors, Gaussian Naive Bayes, the ensemble classifiers AdaBoost and Random Forest and the Multi-layer Perceptron. The deep classifiers namely CNN and RNN derived from implementations in the Python keras library3 on tensorflow (Abadi et al., 2015). Three scripts in the repository, namely baseModels.py, rnnApp.py, and cnnApp.py, implement the base models, RNN, and CNN, respectively. For both the base and deep classifiers, the dataset was split into 0.9:0.1 training:test sets. Multi-class modeling is fraught with overfitting to particular classes (especially pronounced in cases of extreme class skew). To address this issue, two strategies were adopted: (i) the splitting process was stratified on the class, which ensured that each class was proportionately and sufficiently distributed in both the training and test sets, and (ii) hyperparameter optimisation, discussed below.

TABLE 3

ClassifierFeatures usedHyperparameters of interestML Library
Decision Tree1- and 2-mer frequenciesMaximum features, Minimum sample split, Minimum sample leaf, Random state, Max depthSklearn
Gaussian Naïve Bayes1- and 2-mer frequenciesPriorsSklearn
k-Nearest Neighbors1- and 2-mer frequenciesNumber of neighbors Leaf size, Weights, AlgorithmSklearn
Adaptive Boosting1- and 2-mer frequenciesNumber of estimators, Learning rate, AlgorithmSklearn
Random Forest1- and 2-mer frequenciesNumber of estimators, Max depth, Impurity criterionSklearn
Multi-layer Perceptron1- and 2-mer frequenciesActivation, Solver, Alpha (regularization term), Learning rate, epochs, hidden layers, nodes per hidden layerSklearn
CNNRiboswitch sequenceNumber of filters, Kernel size, Activation function, Pooling method, number of Conv1D layers, Dropout ratio, Optimiser, #EpochsTensorFlow (Keras)
Bidirectional RNN with LSTMRiboswitch sequenceActivation function, number of LSTM nodes, number of Bidirectional layers, Dropout ratio, Optimiser, #EpochsTensorFlow (Keras)

Description of the base model and deep classifiers.

Hyperparameters noted for each classifier are meant to be representative. For the deep models, any long riboswitch genome sequences were truncated to 250 nucleotides, which is an adjustable parameter (max_len) set to much larger than the average length of any riboswitch class. The Python3 library used for implementation of machine learning model is noted.

Hyperparameter Optimisation

Hyperparameter fine tuning for each classifier was effected by discrete combinatorial grid search on the hyperparameters associated with that classifier. The grid search was evaluated with 10-fold cross-validation of the training set. This yielded the optimal hyperparameters for each classifier. The scripts for hyperparameter optimisation of the base models are available in the repository. In the case of the deep models, we used a train-validate-test approach to model optimisation with keras/TensorFlow, by setting the ‘validation’ flag to 0.1.

Evaluation Metrics

The performance of each classifier was evaluated on the test set using the receiver operating characteristic (ROC) analysis in addition to standard metrics such as the precision, recall, accuracy and F-score (harmonic mean of precision and recall) (van Rijsbergen, 1975). The ROC curve was obtained by plotting the TP rate vs. the FP rate, i.e., sensitivity vs. (1 – specificity), and the area under the ROC curve (AUROC) could be estimated to rate the model’s performance. AUROC represents the probability that a given classifier would rank a randomly chosen positive instance higher than a randomly chosen negative one. ROC analysis is robust to class imbalance, typical of the machine learning problem at hand, however, a multi-class adaptation of the binary ROC is necessary. For each classifier, this is achieved by computing classwise binary AUROC values in a one-vs.-all manner, followed by aggregating the classwise AUROC values into a single multi-class AUROC measure (Yang, 1999; Manning et al., 2008). Aggregation of the classwise AUROC values could be done in at least two ways:

  • 1.

    Micro-average AUROC, where each instance is given equal weight.

  • 2.

    Macro-average AUROC, where each class is given equal weight.

In micro-averaging, all the instances from different classes are treated equally, to arrive at the final metrics. In particular, the microaverage AUROC is given by the area under the overall TP rate vs. overall FP rate curve.

On the other hand, the macro-average of a given metric for a multi-class prediction problem is estimated by the average of the metric for the individual classes. For example, the macro-average AUROC is given by:

where AUROCi is the binary AUROC for the ith class.

It is clear that the micro-average AUROC would be dominated by the larger classes, while the macro-average AUROC is a more balanced measure. Both the micro-average and macro-average AUROC measures were used to evaluate the performance of all our classifiers. A python script, multiclass.py available in the repository, generates all the performance metrics and plots.

Dynamic Extension of the Models

Genome sequencing of diverse, exotic prokaryotes is likely to yield new regulatory surprises mediated by riboswitches (Breaker, 2011). A model that could classify a fixed set of 32 classes remains static in the wake of exponentially growing number of known genomes. To address the challenge of extending the model to new classes, we have devised a strategy for dynamically updating the model. This process is initiated by feeding the sequences corresponding to the new class(es) to an updater script, which revises the dataset and then trains a new model. The automation of modeling would ensure a user-friendly pipeline for learning any number of riboswitch classes along with the production of performance metrics of such models. The script dynamic.py, available in the repository, implements the dynamic functionality of our modeling effort.

Results

Our Rfam query retrieved 39 riboswitch class hits, however, seven of these classes had a membership of less than 100 sequences and were filtered out. Subsequently, our dataset consisted of 32 riboswitch classes with a total of 68,520 sequences. A summary of this dataset is presented in Table 2. The largest classes include the cobalamin and thiamine pyrophosphate classes, with >10,000 riboswitches in each, accounting for considerable diversity within classes. Classes with >4,000 members include Flavin mononucleotide (FMN) and glycine riboswitches. Other notable classes with at least 1,000 members include the lysine, purine, fluoride and glutamine switches. The riboswitch sequences were inspected for the standard alphabet (Table 1) and the final pre-processed comma-separated values (csv) datafile with each instance containing the sequence, 20 features and riboswitch class was prepared (available at4).

Table 3 recapitulates the key properties of the classifier models used in this study. We noticed poor performance of the base models on the test set with default model parameters, which could be traced to persistent overfitting (dominated by the larger classes), despite stratified sampling. Hyperparameter optimisation of the default parameters is one solution to address this problem and was carried out on the base models. The exercise is summarized in Appendix I (Supplementary File S1), which includes the final configuration of the hyperparameter space for all the base and deep classifiers. The optimized CNN and RNN architectures are illustrated in Figure 1. In the CNN, two convolutional layers were used followed by a pooling layer and dropout layer before flattening to a fully connected output layer. The RNN employed two sophisticated bidirectional LSTM units sandwiched by dropout layers before flattening to a fully connected output layer. The number of training epochs necessary for each deep model was determined based on the convergence of the error function (shown in Figure 2).

FIGURE 1

FIGURE 1

Deep learning frameworks used in the study. (A), CNN architecture, optimized for two 1-dimensional convolutional layers; and (B), Bidirectional RNN with LSTM, optimized for two bidirectional layers. Two dropout layers are used in the RNN.

FIGURE 2

FIGURE 2

Epoch tuning curves for the CNN (A) and RNN (B). The CNN converges faster with respect to the number of epochs, however, the RNN learns better, as seen with the continuously decreasing loss function.

With the optimized classifiers, the performance of the predictive modeling was evaluated on the unseen testing set. Figure 3 shows the resultant classifier performance by an array of metrics including accuracy, and F-score. It is abundantly clear that the deep models vastly outperformed the base classifiers in all metrics across all classes. Figures 4, 5 show the ROC curves along with the micro-averaged and macro-averaged AUROC for the base models and the deep models, respectively. The AUROC is indicative of the quality of the overall model. It is seen that the AUROC is 1.00 for all classes for the RNN. Table 4 summarizes the performance of the classifiers, with the detailed classwise F-score of each classifier available in the Supplementary Table S2 and the classwise break-up of the AUROC of all classifiers in the Supplementary Table S3.

FIGURE 3

FIGURE 3

Standard performance metrics. Clockwise from top left, Accuracy; Precision; F-score; and Recall. The overall precision, recall and F-score were computed by macro-averaging the classwise scores. The deep models emerged as vastly superior alternatives to the base machine learning models on all performance metrics.

FIGURE 4

FIGURE 4

AUROC for the base models. (A) Decision Tree, (B) Gaussian NB, (C) kNN, D: AdaBoost, (E) Random Forest, and (F) Multi-layer Perceptron. Gray lines denote classwise AUROCs of all 32 classes, from which it is clear that not all classes are equally learnt.

FIGURE 5

FIGURE 5

AUROC for the deep models. (A) CNN, and (B) RNN. Gray lines denote classwise AUROCs of all 32 classes. It clear that RNN achieves learning perfection at both the macro and classwise levels.

TABLE 4

ModelAccuracyPrecisionRecallF-scoreMicro AUROCMacro AUROC
Decision Tree0.540.490.390.420.880.81
Gaussian NaïveBayes0.370.390.460.380.90.89
K-neighbors0.670.750.550.610.940.88
AdaBoost0.470.420.360.360.890.92
Random Forest0.710.860.580.650.980.97
Multi-layer perceptron0.740.750.670.700.990.98
CNN0.970.980.910.9311
RNN0.990.970.960.9611

Performance metrics for all classifiers.

The macro-averaged values of precision, recall and F-score are shown. Micro AUROC, micro-average AUROC; Macro AUROC, macro-average AUROC.

Discussion

The RNN model marginally (but clearly) outperformed the CNN model, and both of them significantly outperformed all the base models on all key metrics, notably accuracy and F-score. The best-performing among the base models was the Multi-layer Perceptron. It is noteworthy that the AUROC approached unity and near-perfection for both the deep models, especially the bidirectional RNN with LSTM. This implied that the use of k-mer features masked long-range information whereas the deep models were able to capture such correlations directly from the full sequence. These results affirmed that RNNs could be used to effectively simulate the interactions in biological sequences.

The F-score (a measure of balanced accuracy) is a more unforgiving metric than AUROC in the case of multi-class problems (Table 4). While the CNN and RNN had macro-averaged F-scores of 0.93 and 0.96, respectively, none of the base models exceeded 0.70 including the multilayer perceptron. Supplementary Table S2 provides the classwise F-scores of all classifiers. All the base models struggled to classify the largest riboswitch classes, namely TPP and Cobalamin. This is a consequence of the diversity of such large riboswitch classes, making the ‘outlier’ members harder to classify correctly. Both the RNN and CNN are mapping the sequence input to its corresponding riboswitch family. In such a case, the sequence similarity within a family and sequence dissimilarity across families represent plausible discriminating features that the models are learning. Higher order features such as sequence context and base dependencies dictated by RNA secondary structure also constitute learnable information. Table 5 shows the results of a sequence-based clustering analysis of the riboswitch families using the cd-hit algorithm (Li and Godzik, 2006). It is seen that there are >7000 and >10,000 singleton clusters in the TPP and Cobalamin classes, respectively. Here we introduce a diversity metric for riboswitch families, defined as the ratio of the number of clusters at 90% sequence identity to the total size of the family. Compared to the overall diversity score of 0.7, both TPP and Cobalamin classes had above-average diversities (0.71 and 0.82, respectively). However, these diverse classes did not pose any problems for the deep models. On the contrary, the AdoCbl and AdoCbl variant riboswitch classes posed significant learning challenges to both base and deep models. AdoCbl in particular is the smallest riboswitch family considered here, but is also unnaturally diverse (with a score of 0.83). This frustrates learning, because the limited number of instances do not adequately represent the class diversity, and result in class outliers. Consequently this emerged as the most challenging for all classifiers, reflected in the classwise F-scores (Supplementary Table S2). Four of the base classifiers managed a zero F-score, while the CNN and RNN achieved F-scores of 0.38 and 0.67, respectively, by far their lowest for any class. Even here, the consistency of the RNN model is remarkable. The other classes that were notably challenging to the base models but not to the deep models included Cyclic di-GMP-II, ZMP/ZTP, SAM 1–4 variant, Molybdenum co-factor, Glucosamine-6-phosphate and Guanidine-I riboswitch classes. Of these, the Glucosamine-6-phosphate, Cyclic di-GMP-II, and Molybdenum co-factor riboswitch classes are among the most diverse riboswitch families, with diversity scores ≥0.80. It is seen that the classification problems arise either with diverse classes or at the extremes of class sizes. Too large the class, the diversity is challenging, whereas too small and the learning itself is incomplete and challenged. The deep models – RNN and CNN – consistently performed well across all classes, independent of the size of the class. It could be inferred in this case that using direct features (i.e., sequences) rather than engineered features (i.e., k-mer frequencies) led to more robust models. From Table 5, it is also clear that most of the learnt classes (especially the large ones) are diverse, thereby elevating the classifier performance to robustness against adversarial attacks – that is, changing a few nucleotides in the input sequence would be unlikely to drastically alter the class prediction.

TABLE 5

Rfam ID#Clusters at 90%Diversity#Singleton clusters#Redundant sequences#Clusters in the rest
RF0005024840.6118664645208
RF0005989540.71736627238738
RF000806110.734932947081
RF0016237360.62272214143956
RF0016721220.8118185945570
RF0016819050.8517074545787
RF00174116200.821013028936077
RF002348140.877311146878
RF0037925830.6620449745109
RF003804410.422653547251
RF004426000.674494647092
RF0050430100.6623456644682
RF005214950.613792047197
RF005222140.401132847478
RF006345010.403171247191
RF010559780.808657346714
RF010576290.765191347063
RF014821500.831321247546
RF016891310.62101247562
RF017254350.553262547257
RF017271490.621051647543
RF0173416160.80138912746076
RF017393440.312155647348
RF0175010740.648219446618
RF017671220.63881247570
RF017865550.844804947137
RF018314500.683275747242
RF026831590.681223147533
RF030573590.642667047333
RF03058970.28571747595
RF03071860.33482947606
RF030722730.371417247419
ALL476860.70388711951N-

Clustering analysis of riboswitch families at 90% sequence identity.

Singleton clusters indicate clusters with only one representative sequence. Redundant sequences are 100% identical to another member in the riboswitch family. The full set of riboswitch sequences (ALL) and ALL minus the riboswitch family of interest were also clustered.

These results might be put in perspective by benchmarking against the existing literature. Guillen-Ramirez and Martinez-Perez (Guillén-Ramírez and Martínez-Pérez, 2018) extended the k-mer features logic and arrived at an optimal combination of 5460 k-mer features. Using a limited dataset of 16 classes, they used state-of-the-art machine learning to achieve accuracies in the high nineties, however, their results did not generalize equally to riboswitches with remote homology. For e.g., their best-performing classifier (Multi-layer Perceptron) misclassified 6 out of the 225 instances of Lysine riboswitch as cobalamin-gated. The source code for the features and modeling used in their work is not readily available for new applications. To make the workflow described in our study easily reproducible and user-friendly, we have developed a Python package riboflow5 mirroring the best RNN model. In addition to predicting the most probable class of a given riboswitch sequence, riboflow provides an option to predict the complete vector of class probabilities, which could be helpful in disambiguating any class confusion. It would also inform the design of synthetic orthogonal riboswitches for biotechnology applications. The implementation and usage details are provided in Appendix II (Supplementary File S1).

An interesting benchmark is afforded by the Riboswitch Scanner (Mukherjee and Sengupta, 2016), which used profile HMMs (Eddy, 2011) of riboswitch classes to detect riboswitches in genomic sequences. While our method addresses inter-class discrimination of riboswitch sequences, Riboswitch Scanner is a web-server that essentially performs riboswitch vs. not-riboswitch discrimination for user-given riboswitch classes. The absence of F-score metrics does not allow for direct comparisons, however, the sensitivity and specificity seemed consistently comparable for most classes, with noticeable variations with respect to the Glycine, THF and SAM I-IV variant riboswitch classes. It must be indicated that their method is validated with Rfam seed sequences, without consideration for the proliferation of riboswitch sequences. Performance evaluation on limited data could inflate performance estimates and complicate their interpretation. We further note that it is possible to extend our method to the ‘riboswitch-or-not’ classification problem by calibrating the prediction probability thresholds. In any case, our work enables the ranking of riboswitches using the strength of the predicted probabilities, which would aid the selection of the best riboswitch sequence design.

It must be noted that riboswitches are precisely specific to cognate ligands. For e.g., the AdoCbl riboswitch would not tolerate a methyl-substituted cobalamin (Nahvi, 2004) nor does the TPP riboswitch interact with thiamine or thiamine monophosphate (Lang et al., 2007). At the same time, these two riboswitches are very diverse in their phylogenomic distribution and actual sequences. The key to effective learning lies in treading a fine line between the intra-class diversity and inter-class specificity. It is remarkable the bidirectional LSTM RNN was able to achieve exactly this tradeoff. The roots of such performance of the deep models in general has been explained recently to be related to the lottery ticket hypothesis (Frankle and Carbin, 2019) as well as learning the intrinsic dimension of the problem (Li et al., 2018), here the classification of riboswitches.

To extend the functionality of our work, we have introduced a dynamic component to all our models, both base and deep. With the exponential growth in genome sequencing, the room for riboswitch discovery is enormous (Yu and Breaker, 2020). Our models could accommodate new riboswitch class definitions by way of dataset augmentation, thereby making them general and more robust. This work used the dynamic functionality to extend a preliminary 24-class model to the present 32 classes with sustained performance. The implementation and usage details of the dynamic functionality and other utilities provided in the repository are given in Appendix II (Supplementary File S1). It is noted that the deep learning models could be adapted to new classes and related problems by the technique of transfer learning (Weiss et al., 2016). Addition of new data to existing models presents data quality issues, which remain contentious (Leonelli, 2019), and could be partially addressed using the tools employed in this study to assess the canonical Rfam dataset.

In summary, we present riboflow, a python package (see foot note 5) as well as standalone suite of tools, that have been validated and thoroughly tested on 32 riboswitch classes. By using large and complete datasets, the variance of our modeling procedure has been optimized, and this ensures the generality and applicability of our models on new instances/classes without compromise of performance. riboflow is an off-the-shelf solution that is ready for programmatic incorporation of the RNN model into automatic annotation and design pipelines. All of our trained models are available to the interested user as ‘pickled’ models from https://github.com/RiboswitchClassifier. Riboswitches are a cornerstone of progress in synthetic biology, representing key checkpoints for translation activation. Our work presents an intuitive general-purpose extensible platform for the effortless characterization of new riboswitch sequences and classes, which would accelerate bacterial genome annotation, synthetic biology, and biotechnology, including the rapid design of novel genetic circuits with exquisite specificity. The predicted probabilities of class membership could be used as a proxy of aptamer binding strength with cognate signaling molecule, and this paves the way for the design of effective riboswitches for any stimulus or set of stimuli. In addition to being indispensable workhorses in synthetic biology, riboswitches represent novel and exciting targets for the development of new class of antibiotics (Penchovsky et al., 2020), and our work would also help toward the design of riboswitch inhibitors to combat emerging and multi-drug resistant pathogens. In addition, our work opens up the applications of deep learning methods, including advanced relatives like stacked Bi-LSTM and attention models (Vaswani et al., 2017), for addressing related and unrelated problems in the biological realm.

Conclusion

We have demonstrated that CNN and RNN, without needing prior feature extraction, are capable of robust multi-class learning of ligand specificity from riboswitch sequence, with the RNN posting an F-score of ∼0.96. The confidence of classification could be obtained from an inspection of the predicted classwise probabilities. The bidirectional LSTM RNN model has been packaged into riboflow to enable embedding into genome annotation pipelines, genetic circuit-design automation, and biotechnology workflows. The CNN shows the best tradeoff between the time-cost of training the model and overall performance and could be applied to the task of learning new riboswitch classes using the provided dynamic update option that is provided. All the code used in our study is freely available for any use and further improvement by the scientific community as well as in the larger interest of reproducible research. Our study has highlighted the use of macro-averaged F-score as a disciminating objective metric of classifier performance on multi-class data. Our work reaffirms the competitive advantages of bidirectional LSTM RNNs over conventional machine learning and hidden markov profiles in modeling data sequences, and opens up their applications for modeling other non-coding RNA elements.

Statements

Data availability statement

All datasets generated for this study are included in the article/Supplementary Material.

Author contributions

AP conceived and designed the work and wrote the manuscript. KP, RB, and AP performed the experiments and analyzed the data. All authors contributed to the article and approved the submitted version.

Funding

This work was funded in part by DST-SERB grant EMR/2017/000470 and the SASTRA TRR grant (to AP).

Acknowledgments

We are grateful to the School of Chemical and Biotechnology, SASTRA Deemed University for infrastructure and computing support. A preliminary version of the manuscript has been released as a Pre-Print at bioRxiv (Premkumar et al., 2019).

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Supplementary material

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fbioe.2020.00808/full#supplementary-material

References

  • 1

    AbadiM.AgarwalA.BarhamP.BrevdoE.ChenZ.CitroC.et al (2015). TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. Available online at: www.tensorflow.org(accessed May 10, 2018).

  • 2

    Abreu-GoodgerC.MerinoE. (2005). RibEx: a web server for locating riboswitches and other conserved bacterial regulatory elements.Nucleic Acids Res.33(Suppl. 2), W690W692.

  • 3

    AlipanahiB.DelongA.WeirauchM. T.FreyB. J. (2015). Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning.Nat. Biotechnol.33831838. 10.1038/nbt.3300

  • 4

    AntunesD.JorgeN.CaffarenaE. R.PassettiF. (2017). Using RNA sequence and structure for the prediction of riboswitch aptamer: a comprehensive review of available software and tools.Front. Genet.8:231. 10.3389/fgene.2017.00231

  • 5

    BarrickJ. E.BreakerR. R. (2007). The distributions, mechanisms, and structures of metabolite-binding riboswitches.Genome Biol.8:R239.

  • 6

    BeiselC. L.SmolkeC. D. (2009). Design principles for riboswitch function.PLoS Comput. Biol.5:e1000363. 10.1371/journal.pcbi.1000363

  • 7

    BengertP.DandekarT. (2004). Riboswitch finder—a tool for identification of riboswitch RNAs.Nucleic Acids Res.32(Suppl. 2), W154W159.

  • 8

    BengioY.CourvilleA.VincentP. (2013). Representation learning: a review and new perspectives.IEEE Trans. Pattern Anal. Mach. Intelligence3517981828. 10.1109/tpami.2013.50

  • 9

    BlountK. F.BreakerR. R. (2006). Riboswitches as antibacterial drug targets.Nat. Biotechnol.2415581564. 10.1038/nbt1268

  • 10

    BocobzaS. E.AharoniA. (2014). Small molecules that interact with RNA: riboswitch-based gene control and its involvement in metabolic regulation in plants and algae.Plant J.79693703. 10.1111/tpj.12540

  • 11

    BrantlS. (2004). Bacterial gene regulation: from transcription attenuation to riboswitches and ribozymes.Trends Microbiol.12473475. 10.1016/j.tim.2004.09.008

  • 12

    BreakerR. R. (2011). Prospects for riboswitch discovery and analysis.Mol. Cell43867879. 10.1016/j.molcel.2011.08.024

  • 13

    BreakerR. R.GestelandR. F.CechT. R.AtkinsJ. F. (2006). The RNA World.New York, NY: Cold Spring Harbor Laboratory Press.

  • 14

    ChangC. L.LeiQ.LucksJ. B.Segall-ShapiroT. H.WangD.MutalikV. K. (2012). An adaptor from translational to transcriptional control enables predictable assembly of complex regulation.Nat. Methods910881094. 10.1038/nmeth.2184

  • 15

    ChangT.-H.HuangH.-D.WuL.-C.YehC.-T.LiuB.-J.HorngJ.-T. (2009). Computational identification of riboswitches based on RNA conserved functional sequences and conformations.RNA1514261430. 10.1261/rna.1623809

  • 16

    CheZ.PurushothamS.ChoK.SontagD.LiuY. (2018). Recurrent neural networks for multivariate time series with missing values.Sci. Rep.8:6085. 10.1038/s41598-018-24271-9

  • 17

    CloteP. (2015). Computational prediction of riboswitches.Methods Enzymol.553287312. 10.1016/BS.MIE.2014.10.063

  • 18

    DeiganK. E.Ferré-D’AmaréA. R. (2011). Riboswitches: discovery of drugs that target bacterial gene-regulatory RNAs.Acc. Chem. Res.4413291338. 10.1021/ar200039b

  • 19

    DominG.FindeißS.WachsmuthM.WillS.StadlerP. F.MörlM. (2017). Applicability of a computational design approach for synthetic riboswitches.Nucleic Acids Res.4541084119.

  • 20

    EddyS. R. (2011). Accelerated profile HMM searches.PLoS Comput. Biol.7:e1002195. 10.1371/journal.pcbi.1002195

  • 21

    FrankleJ.CarbinM. (2019). The lottery ticket hypothesis: finding sparse, trainable neural networks.arXiv [Preprint]. Available online at: https://arxiv.org/abs/1803.03635(accessed March 4, 2019).

  • 22

    GelfandM. S.MironovA. A.JomantasJ.KozlovY. I.PerumovD. A. (1999). A conserved RNA structure element involved in the regulation of bacterial riboflavin synthesis genes.Trends Genet.15439442. 10.1016/s0168-9525(99)01856-9

  • 23

    GravesA.SchmidhuberJ. (2005). Framewise phoneme classification with bidirectional LSTM and other neural network architectures.Neural Net.18602610. 10.1016/j.neunet.2005.06.042

  • 24

    GroherA. C.JagerS.SchneiderC.GroherF.HamacherK.SuessB. (2019). Tuning the performance of synthetic riboswitches using machine learning.ACS Synth. Biol.83444. 10.1021/acssynbio.8b00207

  • 25

    Guillén-RamírezH. A.Martínez-PérezI. M. (2018). Classification of riboswitch sequences using k-mer frequencies.Biosystems1746376. 10.1016/j.biosystems.2018.09.001

  • 26

    HavillJ. T.BhatiyaC.JohnsonS. M.SheetsJ. D.ThompsonJ. S. (2014). A new approach for detecting riboswitches in DNA sequences.Bioinformatics3030123019. 10.1093/bioinformatics/btu479

  • 27

    HochreiterS.SchmidhuberJ. (1997). Long short-term memory.Neural Comput.917351780. 10.1162/neco.1997.9.8.1735

  • 28

    KalvariI.ArgasinskaJ.Quinones-OlveraN.NawrockiE. P.RivasE.EddyS. R.et al (2018). Rfam 13.0: shifting to a genome-centric resource for non-coding RNA families.Nucleic Acids Res.46D335D342. 10.1093/nar/gkx1038

  • 29

    KelleyD. R.SnoekJ.RinnJ. (2016). Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks.Genome Res.26990999. 10.1101/gr.200535.115

  • 30

    LangK.RiederR.MicuraR. (2007). Ligand-induced folding of the thiM TPP riboswitch investigated by a structure-based fluorescence spectroscopic approach.Nucleic Acids Res.3553705378. 10.1093/nar/gkm580

  • 31

    LeeB.LeeT.NaB.YoonS. (2015). DNA-level splice junction prediction using deep recurrent neural networks.arXiv [Preprint]. Available online at: arXiv.org>cs>arXiv:1512.05135(accessed August 25, 2019).

  • 32

    LeonelliS. (2019). Philosophy of biology: the challenges of big data biology.eLife8:e47381. 10.7554/eLife.47381

  • 33

    LiC.FarkhoorH.LiuR.YosinskiJ. (2018). Measuring the intrinsic dimension of objective landscapes.arXiv [Preprint]. Available online at: https://arxiv.org/pdf/1804.08838.pdf(accessed July 3, 2019).

  • 34

    LiW.GodzikA. (2006). Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences.Bioinformatics2216581659. 10.1093/bioinformatics/btl158

  • 35

    LiptonZ. C. (2015). A critical review of recurrent neural networks for sequence learning.arXiv [Preprint]. Available online at: https://arxiv.org/abs/arXiv:1506.00019(accessed May 29, 2015).

  • 36

    Lo BoscoG.Di GangiM. (2017). Deep learning architectures for DNA sequence classification.Lecture Notes Comput. Sci.10147162171. 10.1007/978-3-319-52962-2_14

  • 37

    MandalM.BoeseB.BarrickJ. E.WinklerW. C.BreakerR. R. (2003). Riboswitches control fundamental biochemical pathways in Bacillus subtilis and other bacteria.Cell113577586. 10.1016/s0092-8674(03)00391-x

  • 38

    MandalM.BreakerR. R. (2004). Gene regulation by riboswitches.Nat. Rev. Mol. Cell Biol.5451463.

  • 39

    ManningC. D.RaghavanP.SchützeH. (2008). Introduction to Information Retrieval.Cambridge: Cambridge University Press.

  • 40

    McCownP. J.CorbinoK. A.StavS.SherlockM. E.BreakerR. R. (2017). Riboswitch diversity and distribution.RNA239951011. 10.1261/rna.061234.117

  • 41

    MeyerA.PellauxR.PototS.BeckerK.HohmannH. P.PankeS.et al (2015). Optimization of a whole-cell biocatalyst by employing genetically encoded product sensors inside nanolitre reactors.Nat. Chem.7673678. 10.1038/nchem.2301

  • 42

    MironovA. S.GusarovI.RafikovR.LopezL. E.ShatalinK.KrenevaR. A.et al (2002). Sensing small molecules by nascent RNA: a mechanism to control transcription in bacteria.Cell111747756. 10.1016/S0092-8674(02)01134-0

  • 43

    MukherjeeS.SenguptaS. (2016). Riboswitch scanner: an efficient pHMM-based web-server to detect riboswitches in genomic sequences.Bioinformatics32776778. 10.1093/bioinformatics/btv640

  • 44

    NahviA. (2004). Coenzyme B12 riboswitches are widespread genetic control elements in prokaryotes.Nucleic Acids Res.32143150. 10.1093/nar/gkh167

  • 45

    NahviA.SudarsanN.EbertM. S.ZouX.BrownK. L.BreakerR. R. (2002). Genetic control by a metabolite binding mRNA.Chem. Biol.910431049. 10.1016/S1074-5521(02)00224-7

  • 46

    NawrockiE. P.EddyS. R. (2013). Infernal 1.1: 100-fold faster RNA homology searches.Bioinformatics2929332935. 10.1093/bioinformatics/btt509

  • 47

    PedregosaF.VaroquauxG.GramfortA.MichelV.ThirionB.GriselO.et al (2011). Scikit-learn: machine learning in python.J. Mach. Learn. Res.1228252830.

  • 48

    PenchovskyR.PavlovaN.KaloudasD. (2020). RSwitch: a novel bioinformatics database on riboswitches as antibacterial drug targets.IEEE/ACM Trans. Comput. Biol. Bioinform.99:1. 10.1109/TCBB.2020.2983922

  • 49

    PremkumarK. A. R.BharanikumarR.PalaniappanA. (2019). Riboflow: using deep learning to classify riboswitches with ∼99% accuracy.bioRxiv [Preprint]. 10.1101/868695

  • 50

    RothA.BreakerR. R. (2009). The structural and functional diversity of metabolite-binding riboswitches.Annu. Rev. Biochem.78305334. 10.1146/annurev.biochem.78.070507.135656

  • 51

    SerganovA.NudlerE. (2013). A decade of riboswitches.Cell1521724. 10.1016/j.cell.2012.12.024

  • 52

    SinghJ.HansonJ.PaliwalK.ZhouY. (2019). RNA secondary structure prediction using an ensemble of two-dimensional deep neural networks and transfer learning.Nat. Commun.10:5407.

  • 53

    SinghS.SinghR. (2016). Application of supervised machine learning algorithms for the classification of regulatory RNA riboswitches.Brief. Funct. Genomics1699105. 10.1093/bfgp/elw005

  • 54

    SønderbyS. K.SønderbyC. K.NielsenH.WintherO. (2015). Convolutional LSTM networks for subcellular localization of proteins.ArXiv [Preprint]. 10.1007/978-3-319-21233-3_6

  • 55

    StormoG. D.JiY. (2001). Do mRNAs act as direct sensors of small molecules to control their expression?Proc. Natl. Acad. Sci. U.S.A.9894659467. 10.1073/pnas.181334498

  • 56

    StrobelS. A.CochraneJ. C. (2007). RNA catalysis: ribozymes, ribosomes, and riboswitches.Curr. Opin. Chem. Biol.11636643. 10.1016/j.cbpa.2007.09.010

  • 57

    SudarsanN.BarrickJ. E.BreakerR. R. (2003). Metabolite-binding RNA domains are present in the genes of eukaryotes.RNA9644647. 10.1261/rna.5090103

  • 58

    SundermeyerM.AlkhouliT.WuebkerJ.NeyH. (2014). “Translation modeling with bidirectional recurrent neural networks,” in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, 1425.

  • 59

    TuckerB. J.BreakerR. R. (2005). Riboswitches as versatile gene control elements.Curr. Opin. Struct. Biol.15342348. 10.1016/j.sbi.2005.05.003

  • 60

    van RijsbergenC. J. (1975). Information Retrieval.London: Butterworths.

  • 61

    VaswaniA.JonesL.ShazeerN.ParmarN.GomezA. N.UszkoreitJ.et al (2017). “Attention is all you need,” in Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA.

  • 62

    VillaJ. K.SuY.ContrerasL. M.HammondM. C. (2018). Synthetic biology of small RNAs and riboswitches.Microbiol. Spectr.6:10.1128/microbiolspec.RWR-0007-2017. 10.1128/microbiolspec.rwr-0007-2017

  • 63

    WangH.MannP. A.XiaoL.GillC.GalgociA. M.HoweJ. A.et al (2017). Dual-targeting small-molecule inhibitors of the Staphylococcus aureus FMN riboswitch disrupt riboflavin homeostasis in an infectious setting.Cell Chem. Biol.24576588. 10.1016/j.chembiol.2017.03.014

  • 64

    WeissK.KhoshgoftaarT. M.WangD. (2016). A survey of transfer learning.J. Big Data3:9.

  • 65

    WielandM.HartigJ. S. (2008). Artificial riboswitches: synthetic mRNA-based regulators of gene expression.Chembiochem918731878. 10.1002/cbic.200800154

  • 66

    WinklerW.NahviA.BreakerR. R. (2002). Thiamine derivatives bind messenger RNAs directly to regulate bacterial gene expression.Nature419952956. 10.1038/nature01145

  • 67

    WinklerW. C.NahviA.RothA.CollinsJ. A.BreakerR. R. (2004). Control of gene expression by a natural metabolite-responsive ribozyme.Nature428281286.

  • 68

    WittmannA.SuessB. (2012). Engineered riboswitches: expanding researchers’ toolbox with synthetic RNA regulators.FEBS Lett.58620762083. 10.1016/j.febslet.2012.02.038

  • 69

    YangJ.SeoS. W.JangS.ShinS. I.LimC. H.RohT. Y.et al (2013). Synthetic RNA devices to expedite the evolution of metabolite-producing microbes.Nat. Commun.4:1413. 10.1038/ncomms2404

  • 70

    YangY. (1999). An evaluation of statistical approaches to text categorization.J. Inf. Retr.16788.

  • 71

    YanofskyC. (1981). Attenuation in the control of expression of bacterial operons.Nature289751758. 10.1038/289751a0

  • 72

    YuD.BreakerR. R. (2020). A bacterial riboswitch class senses xanthine and uric acid to regulate genes associated with purine oxidation.RNA10.1261/rna.075218.120[Epub ahead of print].

  • 73

    ZhouJ.TroyanskayaO. G. (2015). Predicting effects of noncoding variants with deep learning- based sequence model.Nat. Methods12931934. 10.1038/nmeth.3547

  • 74

    ZhouL. B.ZengA. P. (2015). Engineering a lysine-ON riboswitch for metabolic control of lysine production in Corynebacterium glutamicum.ACS Synth. Biol.413351340. 10.1021/acssynbio.5b00075

Summary

Keywords

riboswitch family, synthetic biology, machine learning, convolutional neural network, recurrent neural network, hyperparameter optimization, multiclass ROC, clustering

Citation

Premkumar KAR, Bharanikumar R and Palaniappan A (2020) Riboflow: Using Deep Learning to Classify Riboswitches With ∼99% Accuracy. Front. Bioeng. Biotechnol. 8:808. doi: 10.3389/fbioe.2020.00808

Received

17 December 2019

Accepted

23 June 2020

Published

14 July 2020

Volume

8 - 2020

Edited by

Yusuf Akhter, Babasaheb Bhimrao Ambedkar University, India

Reviewed by

Ahsan Z. Rizvi, Mewar University, India; Joohyun Kim, Vanderbilt University Medical Center, United States

Updates

Copyright

*Correspondence: Ashok Palaniappan,

These authors have contributed equally to this work

This article was submitted to Computational Genomics, a section of the journal Frontiers in Bioengineering and Biotechnology

Disclaimer

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.

Outline

Figures

Cite article

Copy to clipboard


Export citation file


Share article

Article metrics