AUTHOR=Lin Rattaphon , Wichadakul Duangdao 

TITLE=Interpretable Deep Learning Model Reveals Subsequences of Various Functions for Long Non-Coding RNA Identification

JOURNAL=Frontiers in Genetics

VOLUME=Volume 13 - 2022

YEAR=2022

URL=https://www.frontiersin.org/journals/genetics/articles/10.3389/fgene.2022.876721

DOI=10.3389/fgene.2022.876721

ISSN=1664-8021

ABSTRACT=Long non-coding RNAs (lncRNAs) play crucial roles in many biological processes and are implicated in several diseases. With the next-generation sequencing technologies, substantial un-annotated transcripts have been discovered. Classifying unannotated transcripts using biological experiments is more time-consuming and expensive than computational approaches. Several tools for identifying long non-coding RNAs are available. These tools, however, did not explain which features in their tools contributed to the prediction results. Here, we present Xlnc1DCNN, a tool for distinguishing long non-coding RNAs (lncRNAs) from protein-coding transcripts (PCTs) using a one-dimensional convolutional neural network with prediction explanations. The evaluation results of the human test set showed that Xlnc1DCNN outperformed other state-of-the-art tools in terms of accuracy and F1-score. The explanation results revealed that lncRNA transcripts were mainly identified as sequences with no conserved regions, short patterns with unknown functions, or only regions of transmembrane helices while protein-coding transcripts were mostly classified by conserved protein domains or families. The explanation results also conveyed the probably inconsistent annotations among the public databases, lncRNA transcripts which contain protein domains, protein families, or intrinsically disordered regions (IDRs). Xlnc1DCNN is freely available at https://github.com/cucpbioinfo/Xlnc1DCNN.