LncPep: A Resource of Translational Evidences for lncRNAs

Long noncoding RNAs (lncRNAs) are a type of transcript that is >200 nucleotides long with no protein-coding capacity. Accumulating studies have suggested that lncRNAs contain open reading frames (ORFs) that encode peptides. Although several noncoding RNA-encoded peptide-related databases have been developed, most of them display only a small number of experimentally validated peptides, and resources focused on lncRNA-encoded peptides are still lacking. We used six types of evidence, coding potential assessment tool (CPAT), coding potential calculator v2.0 (CPC2), N6-methyladenosine modification of RNA sites (m6A), Pfam, ribosome profiling (Ribo-seq), and translation initiation sites (TISs), to evaluate the coding potential of 883,804 lncRNAs across 39 species. We constructed a comprehensive database of lncRNA-encoded peptides, LncPep (http://www.shenglilabs.com/LncPep/). LncPep provides three major functional modules: 1) user-friendly searching/browsing interface, 2) prediction and BLAST modules for exploring novel lncRNAs and peptides, and 3) annotations for lncRNAs, peptides and supporting evidence. Taken together, LncPep is a user-friendly and convenient platform for discovering and investigating peptides encoded by lncRNAs.


INTRODUCTION
Long noncoding RNAs (lncRNAs) are defined as RNAs longer than 200 nucleotides (nt) and have been shown to be extensively expressed and exert powerful regulatory functions (Marchese et al., 2017). Mechanistically, lncRNAs can regulate protein-protein and protein-DNA interactions by serving as scaffolds or guides, binding to proteins as decoys and modulating mRNA expression as microRNA (miRNA) sponges. Evidence accumulated over the past decade demonstrates that lncRNA regulation plays key roles in diverse biological and pathological contexts, such as the immune response , cell proliferation (Li et al., 2018), neuronal disorders (Salta and De Strooper, 2017), and tumour biology . LncRNAs have been regarded as "junk RNAs" and have no potential to encode functional proteins. Recently, a growing amount of evidence has demonstrated that lncRNAs are able to encode functional peptides that play vital roles in physiological processes (Anderson et al., 2015;Matsumoto et al., 2017;Anastasia et al., 2019;Niu et al., 2020;Cai et al., 2021;Zhang et al., 2021). For example, the translated peptides from lncRNA Aw112010 are essential for the orchestration of mucosal immunity during bacterial infection and colitis (Jackson et al., 2018). Matsumoto et al. identified and functionally characterized a novel polypeptide encoded by the lncRNA LINC00961 (Matsumoto et al., 2017). A LINC00961encoded peptide was found to negatively regulate mTORC1 activation by interacting with lysosomal v-ATPase and stimulating amino acids, which further promoted muscle regeneration. The lncRNA HOXB-AS3 was discovered to encode a conserved 53 amino acid (aa) peptide that suppresses colon cancer growth by competitively binding to the arginine residues in the RGG motif of hnRNP A1 (Huang et al., 2017). These studies expand our understanding of lncRNAs and the coding potential of the genome. With increasing numbers of experimentally validated lncRNA-encoded peptides, a comprehensive identification and annotation of peptides translated from lncRNAs is urgently needed.
Various computational algorithms and biotechnologies have been developed to directly or indirectly capture translational evidence of RNAs. The coding potential assessment tool (CPAT) (Wang et al., 2013) and coding potential calculator v2.0 (CPC2) (Kang et al., 2017) are the most commonly used algorithms to assess RNA coding ability. Ribosome profiling (Ribo-seq) is a common method to identify translated RNAs (Ingolia et al., 2009), as well as the N6-methyladenosine modification of RNA (m6A) that promotes RNA translation initiation (Meyer et al., 2015), and the translation initiation site (TIS) detected by global translation initiation sequencing is important evidence for encoding proteins or peptides (Lee et al., 2012). Ribo-seq, m6A sites, and TIS provide indirect proof of lncRNA-encoded peptides. Although there is other indirect evidence supporting lncRNA-encoded peptides, none of these lines of evidence offers dependable predictions by themselves.
To identify the peptides encoded by lncRNAs, we built a comprehensive database, LncPep, that contains 10, 580, 228 peptides that were predicted to be translated from 883,804 lncRNAs across 39 species. Direct and indirect evidence is integrated to evaluate the peptide-encoding potential of lncRNAs. This database provides a convenient data search and browse engine, detailed information on each lncRNA and its translated peptide, and supporting evidence. Moreover, prediction and BLAST searches for novel lncRNAs and peptides are available for users. LncPep is expected to serve as an important resource to discover and investigate biologically functional peptides hidden in lncRNAs. All the information and data are freely accessible at http://www.shenglilabs.com/LncPep/.

Data Source and Summary
The current version of LncPep contains 883,804 lncRNAs across 39 species together with six different peptide-encoding lines of evidence to evaluate their translation potential ( Figure 1). This evidence provides direct or indirect support for lncRNA translation. For convenience, we normalized a score for each line of evidence ranging from 0 to 1 and combined these scores for a comprehensive translation potential evaluation (see Materials and Methods).
The numbers of lncRNAs, predicted peptides, and supported lines of evidence in each species are summarized in Table 1. Detailed information on lncRNAs was retrieved from NONCODE, LncBook, and LNCipedia ( Figure 1). Both ATG and non-ATG were considered start codons in all the predicted ORFs. The ORF length was set to ≥10 aa, and the longest ORF was selected when multiple ORFs overlapped in the same lncRNAs. Five different pieces of evidence to support translation are included in LncPep (Figure 1), including CPAT, CPC2, Riboseq, TISs, and m6A sites. Only humans and mice have five pieces of evidence, while other species have three or fewer pieces of evidence ( Figure 1). The CPAT and CPC2 algorithms are the most commonly used tools for RNA coding potential evaluation, and these two tools provided the coding probability scores that were used in LncPep as evidence (Wang et al., 2013;Kang et al., 2017). Since ribosomes and TISs are necessary for RNA translation, we used Ribo-seq and validated TISs as two pieces of evidence to support lncRNA translation (Ramakrishnan, 2002;Wan and Qian, 2014;Wang et al., 2019). m6A modification was reported to promote RNA translation, and the detected m6A sites were also used as evidence Meyer, 2019). "Natural" peptides are more likely to be functional, and we used the Pfam domain to assess lncRNA-translated peptides as one evidence of functionality (Mistry et al., 2021).

LncRNA Translating Features
Peptides were predicted from extracted lncRNA sequences based on ORF searching, and translating evidence scores were calculated for predicted peptides. We defined high-confidence peptides (HCPs) as peptides with Ribo-seq evidence in human and mouse, with no less than 4 pieces of evidence in Arabidopsis thaliana, Caenorhabditis elegans, fruit fly, rat, yeast, and zebrafish, and with no less than 3 pieces of evidence in the other species. On average, less than one HCP was encoded per lncRNA in all species (Figure 2A). Although the numbers of HCPs per lncRNA in humans and mice were 0.016 and 0.01, humans and mice have a large number of lncRNAs, which makes HCPs occupy a considerable part of the human and mouse proteome. Most of the lncRNA-encoded peptides were less than 100 aa in length ( Figure 2B). For evidence, the vast majority of peptides are supported by more than two pieces of evidence ( Figure 2C). In humans, approximately 5% of peptides are supported by more than 2 types of evidence. We compared predicted peptides in LncPep with those in sORFs and Microproteins. Only 153 peptides were shared by all databases, and about 95% of LncPep peptides were unique in these three databases.

Data Access and Download
LncPep provides convenient and flexible routes to mine the data. In the "Browse" module, users can select the species they are interested in, and a brief summary of the peptides will be provided, including the host lncRNA, peptide sequence and length, the evidence and the scores ( Figure 3A). Users can further browse summarized details of host lncRNAs by clicking the lncRNA ID. A popup window of peptide sequences will appear by clicking the arrow in the "Pep_seq" column. Detailed evidence supporting peptides of interest will be shown after clicking the arrow in the "Evd" column. The summary table can be flexibly browsed by ranking peptide length, CPAT scores, CPC2 scores, m6A numbers, Pfam numbers, Ribo-seq numbers, TIS numbers, or integrated peptide-encoding scores. In addition, users can filter the summary table by selecting single or multiple pieces of evidence.
LncPep allows users to search the entire database by lncRNA ID, host gene, genomic location, and evidence on the search page ( Figure 3B). The results table will contain peptide numbers, query names, species, lncRNA IDs, ORF genomic loci, peptide lengths, peptide sequences, ORF start sites, ORF end sites, translation scores, and supporting evidence. Search results can be ranked by peptide length, ORF start sites, and integrated peptide-encoding scores by clicking the corresponding table header names. On the lncRNA or peptide page, detailed information on the lncRNAs, peptides, and evidence is provided ( Figure 4). All the data are free to download on the "Download" page (http://www.shenglilabs.com/LncPep/ #!/download) ( Figure 3C).
As a growing number of lncRNA-encoded peptides have been reported, we also curated experimentally validated lncRNAencoded peptides. Through literature research and integration, we collected experimentally validated peptides from 27 articles and applied detailed information for the host lncRNAs, peptides, and articles ( Figure 3D). Most of the studies were based on human lncRNAs, and another small group was based on mice. This module will continue to be updated.

Predict and BLAST
With the development of high-throughput sequencing technology, a large number of lncRNAs and peptides have been or will be discovered. Prediction and BLAST modules will be useful for users to identify their own functional lncRNAs and peptides. Thus, we developed the "Predict" ( Figure 3E) and "Blast" modules ( Figure 3F) in the LncPep database, wherein users can input their own lncRNA sequences in Fasta format. The results table contains peptide numbers, lncRNA IDs, species, ORF numbers, ORF sequences, and options for BLAST. Users can view the ORF sequences and lengths in a popup window by clicking the arrow in the "ORF sequence" column. Users are also allowed to BLAST interested ORFs by clicking "Blast ORF" in the "Blast" column. Furthermore, users can BLAST specific lncRNA or ORF sequences based on datasets deposited in LncPep. LncRNA or ORF sequences in Fasta format are required for input. Before clicking the "Blast" button, users are also required to indicate whether inputting sequences are peptides or lncRNAs. The species and threshold E values are available for the user to select. Currently, up to 1,000 sequences are allowed to be uploaded and analysed at the same time, and results should be obtained within a few minutes.

Example Application
Users can investigate potential translated peptides of lncRNAs of interest. For example, HSALNT0229539 is a 1646 nt-long human lncRNA annotated in the LncBook database (https://ngdc.cncb. ac.cn/lncbook/transcript?transid=HSALNT0229539), which is located at chr16:29679186-29698684 (+) ( Figure 4A). The CPAT and CPC2 scores of HSALNT0229539 were 0.286 and 0.209, respectively ( Figure 4A). ORFs are covered by more than one line of evidence (Ribo-seq, and TIS) on average ( Figure 4B). Only HSALNT0229539 ORF-1 is supported by Pfam evidence ( Figure 4B). Detailed sequence information of lncRNA HSALNT0229539 is shown in a popup window after clicking the hyperlink on the "Sequence" arrow ( Figure 4C). In total, 5 ORFs were discovered in lncRNA HSALNT0229539, and detailed information is summarized in the following "ORF and peptide information" table ( Figure 4B). LncRNA HSALNT0229539 is much more highly expressed in fallopian tube than in other normal human tissues ( Figure 4D). Furthermore, HSALNT0229539 is extensively expressed in multiple cancer cell lines, indicating that HSALNT0229539 is a cancer-universally expressed lncRNA ( Figure 4E). HSALNT0229539 ORF-1 is located at 19-372 of lncRNA HSALNT0229539, which is predicted to translate as the following peptide: MKQAVRAARQAADFTLK VEVECSSLQEAVQAAEAGADLVLLDNFKPEELHPTATVLK AQFPSVAVEASGGITLDNLPQFCGPHIDVISMGMLTQAAP ALDFSLKLFAKEVAPVPKIH ( Figure 4F). HSALNT0229539 ORF-1 is predicted with coding potential scores of 0.286 and 0.209 for CPAT and CPC2, respectively. In the Pfam database, QRPTase_C is matched the ORF-1 sequence. In the RPFdb database, 154 Ribo-seq signals were mapped to the ORF-1 region. In addition, two pieces of TIS evidence was found in the HSALNT0229539 ORF-1 region. Evidence from outside public databases can be accessed by clicking the corresponding hyperlinks in the "Database" column.

DISCUSSION
The rapid development of high-throughput RNA sequencing technologies largely facilitates the discovery and deep investigation of lncRNAs (Atkinson et al., 2012;Brar and Weissman, 2015;Stark and Grzelak, 2019). These RNAs transcribed from typically non-protein-coding regions of genomes have recently been demonstrated to encode functional peptides in various biological contexts. The LncPep database provides an online resource for peptideencoded lncRNAs and contains 883,804 lncRNAs across 39 species with translational evidence. LncPep offers various ways to browse and search lncRNA-encoding peptide resources and supports users in predicting and blasting customized lncRNA/ peptide sequences for exploratory research on novel lncRNA transcripts or peptides. Furthermore, users can download the full datasets deposited in LncPep, which will empower researchers to explore the "coding realm" of lncRNAs. A Frontiers in Cell and Developmental Biology | www.frontiersin.org January 2022 | Volume 10 | Article 795084 5 "document" page is offered for users to understand and use this database quickly.
To date, FuncPEP (Dragomir et al., 2020), ncEP , cncRNAdb (Huang et al., 2021), and SmProt (Hao et al., 2018) have been developed for noncoding RNA peptides. FuncPEP, ncEP, and cncRNAdb curated experimentally validated peptides encoded by noncoding RNAs, including lncRNAs, circRNAs, and miRNAs. SmProt collected peptides shorter than 100 mino acids (aa) identified from ribosome profiling data, literature, or MS, but no peptides longer than 100 aa encoded by lncRNAs were included (Lun et al., 2020;Meng et al., 2020;Cai et al., 2021). Compared to these existing databases of noncoding RNA-encoded peptides, LncPep is focused on lncRNAs and has four advantages: 1) both validated and predicted lncRNA-encoded peptides are included in LncPep; 2) abundant evidence and detailed annotations are supplied to support peptides and lncRNAs; 3) LncPep does not have a length limitation for peptides, and peptides longer than 100 aa are also important, which has been reported by multiple studies; and 4) the "Predict" and Frontiers in Cell and Developmental Biology | www.frontiersin.org January 2022 | Volume 10 | Article 795084 6 "Blast" modules will help users to explore novel lncRNAs and peptides.
Some limitations still need improvement. Internal ribosome entry sites (IRESs) (Hellen and Sarnow, 2001;Bonnal et al., 2003) are functional cis-acting RNA elements that can direct 40S ribosomes to an internal position on the RNA for translation initiation; thus, IRESs are also an important support for lncRNA translation. RNA structure (Mao et al., 2014;Mauger et al., 2019) also affects its translation; thus, lncRNA structure needs to be taken into account. We did not include these lines of evidence due to the limited available datasets. In the future, we will continue to update the database and add IRES, lncRNA structure, and more Ribo-seq data to support lncRNA translation; we will also further improve the web interface of the database.

Data Collection
Basic information on lncRNAs was retrieved from LncBook (Ma et al., 2019), NONCODE , and LNCipedia (Volders et al., 2019), and 898,452 lncRNA transcripts across 39 species were included. The lncRNA transcript expression profiles in different normal human tissues and multiple types of cancer cell lines were retrieved from LncExpDB  and were previously collected from the Human Protein Atlas (HPA) (Uhlen et al., 2017) and the Cancer Cell Line Encyclopedia (CCLE) (Ghandi et al., 2019), respectively. In particular, the lncRNA expression of normal human tissues included 122 samples in 32 different tissue types, and the expression of cancer cell lines was from 659 cancer cell lines in 44 primary sites.

Analysis of Evidence for lncRNA-Translated Peptides
The CPAT algorithm and CPC2 were employed to evaluate RNA encoding potential. The CPAT algorithm is based on a logistic regression model built with sequence features from known coding RNA candidates (Wang et al., 2013). We calculated the CPAT scores of lncRNAs for all species, and the CPAT scores were used as one of the criteria to assess the reliability of lncRNA encoding potential. CPC2 is a fast and accurate coding potential calculator based on intrinsic sequence features and is a species-neutral tool (Kang et al., 2017). Thus, the CPC2 scores were calculated for all lncRNA transcripts of 39 species as one line of evidence for encoding potential.
Ribosomes are key modules in polysomes with actively translated RNAs (Ramakrishnan, 2002). Therefore, the association with ribosomes/polysomes detected by ribosome profiling (Ribo-seq) can serve as strong evidence for peptidetranslated lncRNAs. RPFdb (Wang et al., 2019) is a public resource for ribosome profiling containing Ribo-seq data from 3,603 samples. We downloaded the Ribo-seq data for humans, mice, C. elegans, chicken, rat, zebrafish, and Arabidopsis thaliana and then mapped them to ORFs of lncRNA transcripts with coverage >90% by using bedtools. The mapped Ribo-seq signals are evidence of lncRNA translation.
Translation initiation sites (TISs) are important for protein/ peptide production from transcripts. Global translation initiation sequencing technology (Wan and Qian, 2014) was used to identify genome-wide TISs. TISdb (Wan and Qian, 2014) is a database that curates human and mouse TISs characterized by global translation initiation sequencing. We downloaded these validated TISs from TISdb and mapped them to ORFs of lncRNA transcripts, and the mapped TISs were used as evidence for lncRNA translation.
The N6-methyladenosine modification of RNA (m6A) is the most abundant internal modification on RNA transcripts in eukaryotic cells. m6A located in 3′ UTRs can promote the translation of capped RNAs (Helm and Motorin, 2017). The RNA EPItranscriptome Collection (REPIC) database  and m6A-Atlas database  are two commonly used m6A modification resources. We downloaded and merged the m6A profiles for humans, mice, Arabidopsis, chimpanzees, fruit flies, rats, yeast, and zebrafish from these two databases and mapped them to the 3′ UTRs of lncRNAs. Mapped m6A modification sites are used to support lncRNA translation.
The Pfam database (Mistry et al., 2021) is a large collection of existing protein families and is the most famous database to analyse novel genomes and proteins. Thus, we downloaded the Pfam datasets and applied hmmsearch to search all the predicted lncRNA peptides, and an e-value < 0.0001 was used as the cut-off.

Peptide Sequence Prediction
The potential peptide sequences translated from candidate lncRNAs were predicted by using Open Reading Frame (ORF) Finder, which searches for ORFs in the DNA sequences of lncRNAs of interest (Wheeler et al., 2003). If peptides overlapped, then we used the longer one. In particular, ORF Finder performs a six-frame translation of DNA sequences of interest and returns candidate ORF sequences. Both ATG and non-ATG parameters were applied in ORF prediction, as non-ATG sequences have been shown to be an important group of translation initiation sites (Ingolia et al., 2011;Lee et al., 2012).

Calculation of Peptide-Encoding Scores of lncRNA
We defined a peptide-encoding score to quantitatively assess the lncRNA translation potential, which is a summation of the CPAT, CPC2, m6A, Pfam, Ribo-seq, and TIS scores as follows: For m6A, Pfam, Ribo_seq, and TIS, if one sample or sequence mapped to the related peptides, we defined the related score as 1; if no sequence mapped, the score was 0. Scores of these 5 pieces of evidence were calculated as follows: Score of m6A: For the CPAT and CPC2, the scores were based on the coding probability that these two algorithms provided. In addition, the scores of CPAT and CPC2 were as follows: Score of CPAT: S (CPTA) CPAT ( coding probability )