ORIGINAL RESEARCH article

Front. Bioinform., 26 May 2025

Sec. RNA Bioinformatics

Volume 5 - 2025 | https://doi.org/10.3389/fbinf.2025.1585794

CytoLNCpred-a computational method for predicting cytoplasm associated long non-coding RNAs in 15 cell-lines

  • Department of Computational Biology, Indraprastha Institute of Information Technology, New Delhi, India

Article metrics

View details

1,5k

Views

363

Downloads

Abstract

The function of long non-coding RNA (lncRNA) is largely determined by its specific location within a cell. Previous methods have used noisy datasets, including mRNA transcripts in tools intended for lncRNAs, and excluded lncRNAs lacking significant differential localization between the cytoplasm and nucleus. In order to overcome these shortcomings, a method has been developed for predicting cytoplasm-associated lncRNAs in 15 human cell-lines, identifying which lncRNAs are more abundant in the cytoplasm compared to the nucleus. All models in this study were trained using five-fold cross validation and tested on an validation dataset. Initially, we developed machine and deep learning based models using traditional features like composition and correlation. Using composition and correlation based features, machine learning algorithms achieved an average AUC of 0.7049 and 0.7089, respectively for 15 cell-lines. Secondly, we developed machine based models developed using embedding features obtained from the large language model DNABERT-2. The average AUC for all the cell-lines achieved by this approach was 0.665. Subsequently, we also fine-tuned DNABERT-2 on our training dataset and evaluated the fine-tuned DNABERT-2 model on the validation dataset. The fine-tuned DNABERT-2 model achieved an average AUC of 0.6336. Correlation-based features combined with ML algorithms outperform LLM-based models, in the case of predicting differential lncRNA localization. These cell-line specific models as well as web-based service are available to the public from our web server (https://webs.iiitd.edu.in/raghava/cytolncpred/).

Highlights

• Prediction of cytoplasm-associated lncRNAs in 15 human cell lines

• Machine learning using composition and correlation features

• DNABERT-2 embeddings for lncRNA localization prediction

• Correlation-based models outperform LLM-based models

• Web server and models available for public use

Introduction

The rapidly expanding field of non-coding RNAs has revolutionized our understanding of gene regulation and cell biology. Among the diverse classes of non-coding RNAs, long non-coding RNAs (lncRNAs) have attracted significant attention due to their ability to regulate gene expression at various levels. Initially dismissed as transcriptional noise, lncRNAs have emerged as critical players in cellular processes, including development, differentiation, and disease progression (Statello et al., 2021). To fully comprehend the functional roles of lncRNAs, it is imperative to investigate their subcellular localization. lncRNAs have distinct functions in the nucleus and cytoplasm, influencing transcriptional and posttranscriptional processes. In the nucleus, lncRNAs regulate gene expression and chromatin organization, while in the cytoplasm, they participate in signal transduction and translation. Some lncRNAs exhibit dual localization and functional diversification, reflecting their adaptability to different subcellular environments (Miao et al., 2019; Aillaud and Schulte, 2020; Mattick et al., 2023).

In recent years, extensive research efforts have been focused on deciphering the subcellular localization of lncRNAs. Various experimental approaches, such as fluorescence in situ hybridization (FISH) (Chang et al., 2023), RNA sequencing (RNA-seq) (Mayer and Churchman, 2017), and fractionation techniques (Miao et al., 2019), have been employed to identify the subcellular localization patterns of lncRNAs. These studies have revealed that lncRNAs can be localized in different cellular compartments, including the nucleus, cytoplasm, nucleolus, and specific subcellular structures. The subcellular localization of lncRNAs is often associated with their biological functions. For instance, nuclear-localized lncRNAs are frequently involved in transcriptional regulation, chromatin remodeling, and epigenetic modifications. Cytoplasmic lncRNAs, on the other hand, can interact with proteins or act as competitive endogenous RNAs (ceRNAs) to regulate gene expression post-transcriptionally (Bridges et al., 2021). However, most of these methods are expensive to perform and require highly specialized instrumentation.

Advancements in computational methods and machine learning approaches have further facilitated the prediction of lncRNA subcellular localization. These methods leverage various features, such as sequence composition, secondary structure, and evolutionary conservation, to predict the subcellular localization of lncRNAs with high accuracy. Several computational methods have been proposed for predicting lncRNA subcellular localization. Sequence-based methods rely on the nucleotide composition of the lncRNA. They utilize features such as k-mer frequency, nucleotide composition, and sequence motifs. However, these methods are trained on datasets that are not unique to humans, and they do not account for the variation in the subcellular localization of lncRNA in different cells.

Cell-line specific subcellular localization gains prominence due to the variability (in terms of subcellular localization) that lncRNAs exhibit within different cell-lines. This was reported by Lin et al. in lncLocator 2.0, where it was observed that a single lncRNA had different localization in different cell-lines (Lin et al., 2021). We observed a similar trend in our dataset, where some lncRNAs were found to be localized in the nucleus for some cell-lines but were localizing to the cytoplasm in some other cell-lines. This pattern can be seen clearly in Figure 1.

FIGURE 1

FIGURE 1

Bubble plot indicating the variability of localization of a single lncRNA across multiple cell-lines.

lncLocator 2.0 is a cell-line-specific subcellular localization predictor that employs an interpretable deep-learning approach (Lin et al., 2021). TACOS, also a cell-line-specific subcellular localization predictor, uses tree-based algorithms along with various sequence compositional and physicochemical features (Jeon et al., 2022). Among all the existing computational methods, only lncLocator 2.0 and TACOS are designed to predict subcellular localization specific to different cell-lines. The primary issue with these methods is that the datasets used to develop these methods have not been properly filtered. Specifically, these methods have included mRNA sequences in their datasets, which can lead to inaccurate predictions. Additionally, the datasets have eliminated lncRNAs with an absolute fold-change less than 2, which can result in the failure to predict the subcellular location of lncRNAs with borderline concentration differences between locations.

To address the limitations of existing methods in a comprehensive manner, we have developed CytoLNCpred. In this study, we aimed to enhance the prediction accuracy compared to current tools, which have significant room for improvement. Furthermore, we have cleaned the dataset and adhered to industry standards to validate the performance of our method. In CytoLNCpred, a machine learning model trained using correlation-based features demonstrated significantly better performance on the validation dataset compared to existing tools.

Materials and methods

To aid in the development of a prediction model for lncRNA subcellular localization, we’ve designed a workflow diagram, depicted in Figure 2. The comprehensive details of each phase in this workflow are outlined in the subsequent sections.

FIGURE 2

FIGURE 2

Overall architecture of CytoLNCpred.

Dataset creation

In this study, we have selected lncAtlas for acquiring cell-line specific subcellular localization information. lncAtlas is a comprehensive resource of lncRNA localization in human cells based on RNA-sequencing data sets (Mas-Ponte et al., 2017). lncAtlas contains a wide array of information, including Cytoplasm to Nucleus Relative Concentration Index (CNRCI), which we have utilized in our method. CNRCI is defined as the log2-transformed ratio of RPKM (Reads Per Kilobase per Million mapped reads) in two samples, in this case - the cytoplasm and nucleus. It is calculated as follows

Sequence information for the lncRNAs was obtained from ENSEMBL database (version 112) and lncRNAs with no sequence were dropped. In order to modify the dataset for a classification problem, we assigned sequences having CNRCI value greater than 0 as Cytoplasm and those having CNRCI value less than 0 were assigned as Nucleus. Redundancy was removed using MeshClust (Girgis, 2022), using a sequence similarity of 90%. Figure 3 graphically depicts how the training and validation datasets were created.

FIGURE 3

FIGURE 3

Graphical overview of the process of dataset creation.

Further, we used sequences up to the length of 10,000 nucleotides only, as the longer lncRNA were misleading for the machine learning models and computationally very expensive when large language models were involved. The summary of the dataset used for each cell line is provided in Table 1.

TABLE 1

Cell-linesOriginalNon-redundant datasetFiltering out sequences >10,000 lengthTraining dataset (complete)Training dataset (nucleus)Training dataset (cytoplasm)Validation dataset (complete)Validation dataset (nucleus)Validation dataset (cytoplasm)
18261821113190454735722713790
H1.hESC41944178255220411224817511307204
HeLa.S3114211417035624709214111823
HepG217031699102982359822520615056
HT1080118311806905523142381387860
HUVEC18701859113790965925022816563
MCF.72714270317021361102833334125784
NCI.H46077276946036830464927616
NHEK1383137875560445015415111239
SK.MEL.569469138030424262766016
SK.N.DZ762759422337205132855233
SK.N.SH20862077121196871525324317964
GM12878213621281286102878824025819860
K5621197119172958341416914610442
IMR.90497496314251133118633429

Detailed summary of the dataset used in the study, including the total number of samples for each cell-line in the source database and the final non-redundant dataset.

Feature generation - Composition and correlation-based

For facilitating the training of machine learning (ML) models, we generated a large variety of features using different approaches. These features convert nucleotide sequences We used the in-house tool Nfeature (Mathur et al., 2021) for generating multiple composition and correlation features.

Composition-based

Nucleotide composition-based features refer to quantitative representations of sequences that can be derived from the proportions and arrangements of nucleotides within these sequences. In this study, we have computed nucleic acid composition, distance distribution of nucleotides (DDN), nucleotide repeat index (NRI), pseudo composition and entropy of a sequence. The details for each of the features are provided in Table 2.

TABLE 2

Feature nameType of descriptorNo. of descriptors
Nucleic acid CompositionDi-Nucleotide16
Reverse complement K-Mer compositionDi-Nucleotide10
Nucleotide Repeat IndexMono-Nucleotide4
EntropySequence-level1
EntropyNucleotide-level4
Distance DistributionMono-Nucleotide4
Pseudo CompositionPseudo Di-nucleotide19
Pseudo CompositionPseudo Tri-nucleotide65
Total number of descriptors123

Overview of the composition-based features generated using Nfeature.

Correlation-based features

In this study, using Nfeature, we quantitatively assess the interdependent characteristics inherent in nucleotide sequences through the computation of correlation-based metrics. Correlation refers to the degree of relationship between distinct properties or features; an autocorrelation denotes the association of a feature with itself, whereas a cross-correlation indicates a linkage between two separate features. By employing these correlation-based descriptors, we effectively normalize the variable-length nucleotide sequences into uniform-length vectors, rendering them amenable to analysis via machine learning algorithms. These specific descriptors facilitate the identification and extraction of significant features predicated upon the nucleotide properties distributed throughout the sequence, enabling a more robust understanding of genetic information. A brief description of the features has been provided in Table 3.

TABLE 3

Feature nameType of descriptorNumber of descriptors
Cross CorrelationTrinucleotide Cross Correlation264
Auto-Cross CorrelationAuto Dinucleotide - Cross Correlation288
Auto-Cross CorrelationAuto Trinucleotide - Cross Correlation288
Auto CorrelationTri-Nucleotide24
Auto CorrelationNormalized Moreau-Broto24
Auto CorrelationDinucleotide Moran24
Auto CorrelationDinucleotide Geary24
Pseudo CorrelationSerial Correlation Pseudo Trinucleotide Composition65
Pseudo CorrelationSerial Correlation Pseudo Dinucleotide Composition17
Pseudo CorrelationParallel Correlation Pseudo Trinucleotide Composition65
Pseudo CorrelationParallel Correlation Pseudo Dinucleotide Composition17
Total number of descriptors1100

Overview of the correlation-based features generated using Nfeature.

The total number of descriptors generated by using both composition and correlation-based features is 1223. Detailed explanation of the features and their biological implication have been provided in Supplementary Table S1. The properties used to calculate correlation-based features are provided in Supplementary Table S2.

Embedding using DNABERT-2

DNABERT-2 is an adaptation of BERT (Bidirectional Encoder Representations from Transformers) designed specifically for DNA sequence analysis (Zhou et al., 2023). DNABERT-2 generates embeddings for DNA sequences that encapsulate not just the individual bases, but also their biological significance in terms of structure, function, and interactions. Moreover, the key advantage of DNABERT-2 embeddings lies in their ability to capture the complex dependencies within DNA sequences. We have made use of both aspects of DNABERT-2 - the pre-trained model to make predictions and the embeddings from the model to be used as features for downstream tasks. The pre-trained model was trained on the training dataset using the default parameters mentioned in their GitHub repository. Moreover, we have also generated embeddings from both the pre-trained as well as the fine-tuned models in order to use them as features for machine learning algorithms. The embeddings are derived from the hidden states of the model’s final output layer, using max pooling. The number of embeddings generated for each lncRNA using DNABERT-2 was 768.

Five-fold cross validation

In order to estimate the performance of machine learning based models while training, we have deployed five-fold cross validation. In this method, the training dataset is split into five folds in a stratified manner and training is actually done over four folds and one fold is dedicated for validation. This process is iteratively performed for five times, by changing the fold that is used for validation and the rest of the folds being used for training. This generates an unbiased set of five performance metrics and the performance of the model is reported as the mean of these five sets.

Feature selection

To optimize model performance and reduce computational complexity, we employed the Minimum Redundancy Maximum Relevance (mRMR) feature selection algorithm to identify the most informative features from our dataset. The mRMR algorithm selects features that exhibit the highest relevance to the target variable while minimizing redundancy among the features themselves, thereby enhancing the efficiency and predictive power of the models. For applying the mRMR algorithm, we combined all the features that were generated previously. Three different feature sets–Composition, Correlation and Embeddings, were used to generate a combined feature set comprising 2278 features. We evaluated the impact of feature selection by calculating and selecting subsets of 10, 50, 100, 500, 1000, 1500, and 2000 features. These subsets were subsequently used in downstream analyses to assess their influence on model performance metrics. Feature importance was also evaluated using simple correlation.

Model development

In this study, three different approaches were followed for model development. The first approach involves the fine tuning of the DNABERT-2 using our training dataset and subsequently using the fine-tuned model to make predictions on the validation dataset. This method initially fine-tunes both the tokenizer and the pre-trained model according to our training dataset, and generates a fine-tuned tokenizer, and model. The fine-tuned model takes lncRNA sequences and generates the prediction using the tokenizer and model. In the next approach, we have implemented a hybrid approach, combining the components of large language models and machine learning algorithms. Instead of features, we have generated embeddings from a pre-trained as well as fine-tuned DNABERT-2 model. Embeddings from the DNABERT-2 model were then used to train machine learning models and subsequently evaluated them. The third approach involves composition and correlation-based features and using them to train machine learning models. The final model was developed using the.

Model evaluation metrics

The binary classification performance of our fine-tuned model was evaluated using the following metrics: Sensitivity (SENS), Specificity (SPEC), Precision (PREC), Accuracy (ACC), Matthew’s Correlation Coefficient (MCC), F1-Score (F1) and Area Under the Receiver Operator Characteristic curve (AUC). The aforementioned metrics were calculated using the four different types of prediction outcomes: true positive (TP), false positive (FP), true negative (TN), and false negative (FN):

The evaluation of our binary classification model using various metrics provides critical insights into its performance. Sensitivity (SENS) measures the model’s ability to identify positive instances, while Specificity (SPEC) assesses its accuracy in recognizing negative instances. Precision (PREC) reflects the accuracy of positive predictions, and Accuracy (ACC) offers an overall measure of correctness, though it may be misleading in imbalanced datasets. Matthew’s Correlation Coefficient (MCC) provides a balanced view by considering all prediction outcomes, with values close to 1 indicating strong predictive capability. The F1-Score (F1) combines Precision and Sensitivity into a single metric, ideal for balancing the trade-off between false positives and negatives. Finally, the Area Under the Curve (AUC) evaluates the model’s ability to distinguish between classes across different thresholds, with higher values indicating better performance. Together, these metrics enable a comprehensive evaluation of the model, guiding necessary improvements and refinements.

Results

In this study, an attempt was made to design a model that will be able to classify the subcellular location of lncRNA into cytoplasm or nucleus. To achieve this, we tried out multiple approaches. Figure 4 provides an overview of the performance of the various approaches tried in this study.

FIGURE 4

FIGURE 4

Overview of the performance achieved by different prediction strategies. The values indicate the average MCC and AUC across all the 15 cell-lines for a prediction strategy.

Functional enrichment analysis

The GO and KEGG enrichment analysis, conducted using RNAenrich (Zhang et al., 2023), reveals distinct functional roles for cytoplasmic versus nuclear-localizing lncRNAs. Among the significantly enriched GO terms (adjusted p-value <0.05), 2,511 were shared, while 397 were unique to cytoplasmic lncRNAs (positive class) and 254 to nuclear lncRNAs (negative class). Cytoplasmic lncRNAs were enriched for biological processes such as “response to interferon-beta” (GO:0035456), “positive regulation of apoptotic process” (GO:0043065), and “RNA splicing” (GO:0008380), indicating roles in immune signaling, post-transcriptional regulation, and cellular stress responses. Correspondingly, KEGG pathway enrichment identified associations with Ferroptosis (hsa04216) and Autophagy (hsa04140), further highlighting their involvement in cytoplasmic stress and degradation pathways. In contrast, nuclear-localized lncRNAs were enriched for GO terms such as “eukaryotic 48S preinitiation complex” (GO:0033290), “regulation of transcription of nucleolar large rRNA by RNA polymerase I” (GO:1901836), and “MLL1/2 complex” (GO:0044665), reflecting their roles in transcriptional regulation, chromatin remodeling, and nucleolar function. KEGG analysis further linked nuclear lncRNAs to Sterol Biosynthesis (hsa00100) and Nucleotide Excision Repair (hsa03420), pointing to nuclear functions in genome maintenance and metabolic regulation. Together, these enrichments underscore the compartment-specific biological functions of lncRNAs, shaped by their cellular localization.

Feature importance

To identify features associated with subcellular localization labels (cytoplasm or nucleus), we computed the correlation of each feature with the corresponding CNRCI values. A positive correlation indicates that an increase in the feature value favors cytoplasmic localization, whereas a negative correlation suggests a preference for nuclear localization. This analysis elucidates which features predominantly influence localization to either compartment. The top 10 genes that were highly correlated with the CNRCI values are provided in Table 4, 5 for composition-based and correlation-based features, respectively. A more detailed version of this table is provided in Supplementary Table S3, 4. It can be observed that Cytosine-based k-mers are more prevalent in the positively correlated features (supporting cytoplasm localization) whereas Thymine is predominantly found in negatively correlated features (supporting nucleus localization).

TABLE 4

Top 5 positively correlated featuresTop 5 negatively correlated features
SK.N.DZCDK_CGACDK_TCGCDK_AACCDK_ACGCDK_CGCDK_TGGCDK_TGCDK_CTGCDK_GGGCDK_GGT
HeLa.S3CDK_TACCDK_AACDK_AATCDK_ACDK_TACDK_GGCDK_GCDK_GGGCDK_GGCCDK_CTG
HUVECCDK_CGCDK_GCGCDK_CGGCDK_CGACDK_CGCCDK_TGCDK_CATCDK_TCDK_TGACDK_ATG
NHEKCDK_GCGCDK_CGCCDK_CCGCDK_CGCDK_CGGCDK_TGCDK_TGTCDK_GTGCDK_GTCDK_TCT
GM12878CDK_CGACDK_ACGCDK_CGCDK_GCGCDK_TCGCDK_CATCDK_TCACDK_TATCDK_ATCDK_TTC
IMR.90CDK_CGACDK_GCGCDK_CGCDK_CGGCDK_CGCCDK_TGCDK_TGGCDK_CTGCDK_CCTCDK_CT
A549CDK_CGACDK_AAACDK_AACDK_GCGCDK_GAACDK_CTGCDK_CACDK_CAGCDK_TGCDK_CCA
MCF.7CDK_CGACDK_CGCDK_GCGCDK_CGCCDK_CGGCDK_TGCDK_ATGCDK_CATCDK_TGTCDK_T
NCI.H460CDK_AACCDK_CGACDK_CAACDK_ACGCDK_CGTCDK_TGGCDK_TGCDK_GGGCDK_GTGCDK_GGT
SK.MEL.5CDK_CGCCDK_CGCDK_CCGCDK_ACGCDK_CGACDK_AGTCDK_GATCDK_TGACDK_TGCDK_TTG
H1.hESCCDK_TACDK_TTACDK_TAACDK_AATCDK_AACDK_CTGCDK_CAGCDK_CCDK_CCACDK_CC
HT1080CDK_CGACDK_CGCDK_GCGCDK_ACGCDK_CGCCDK_TGCDK_TCDK_ATGCDK_TGTCDK_TTT
K562CDK_CGACDK_CGCDK_GCGCDK_ACGCDK_CGCCDK_CAGCDK_AGCDK_CTGCDK_TGGCDK_CA
HepG2CDK_CGACDK_CGCDK_TCGCDK_GCGCDK_CGCCDK_TGCDK_CTCDK_CTGCDK_CACDK_TGT
SK.N.SHCDK_AACDK_AACCDK_AAACDK_TAACDK_CAACDK_CTGCDK_TGCDK_TGGCDK_CAGCDK_CCA

Composition based-features having the highest correlation with the CNRCI values for 15 cell-lines.

TABLE 5

Top 5 positively correlated featuresTop 5 negatively correlated features
SK.N.DZTCC_p3_p4_lag1TACC_p3_p4_lag1TCC_p3_p11_lag1TACC_p3_p11_lag1TCC_p2_p3_lag2DACC_p1_p4_lag1TCC_p3_p12_lag1TACC_p3_p12_lag1TCC_p12_p3_lag1TACC_p12_p3_lag1
HeLa.S3DACC_p5_p3_lag1PC_PTNC_TAGSC_PTNC_TAGPKNC_TAGTAC_p1_lag1TCC_p1_p8_lag1TACC_p1_p8_lag1TCC_p8_p1_lag1TACC_p8_p1_lag1TCC_p2_p8_lag1
HUVECDACC_p7_lag1DACC_p9_p8_lag1DACC_p1_p7_lag1DACC_p4_p2_lag1DACC_p4_p8_lag1DACC_p9_p7_lag1DACC_p1_p8_lag1DACC_p4_p7_lag1DACC_p7_p2_lag1DACC_p7_p4_lag1
NHEKDACC_p4_p2_lag1DACC_p10_p2_lag1DACC_p6_p2_lag1DACC_p7_p12_lag1DACC_p9_p4_lag1DACC_p7_p2_lag1DACC_p4_p12_lag1DACC_p6_p12_lag1DACC_p9_p7_lag1PC_PDNC_TT
GM12878TCC_p9_p3_lag1TCC_p10_p3_lag1TACC_p9_p3_lag1TACC_p10_p3_lag1TCC_p3_p9_lag1DACC_p9_p7_lag1DACC_p4_p7_lag1DACC_p7_p4_lag1DACC_p7_p6_lag1DACC_p6_p7_lag1
IMR.90DACC_p4_p2_lag1DACC_p6_p2_lag1DACC_p1_p7_lag1DACC_p10_p2_lag1MAC_p2_lag1DACC_p6_p12_lag1DACC_p7_p2_lag1DACC_p9_p7_lag1DACC_p1_p2_lag1DACC_p4_p12_lag1
A549DACC_p1_lag1DACC_p4_lag1DACC_p8_p9_lag2TAC_p3_lag1TACC_p3_lag1DACC_p9_p1_lag1DACC_p4_p1_lag1DACC_p8_p1_lag2DACC_p1_p4_lag1DACC_p7_p9_lag2
MCF.7DACC_p7_lag1DACC_p1_p7_lag1DACC_p9_p8_lag1DACC_p4_p2_lag1DACC_p4_p8_lag1DACC_p1_p8_lag1DACC_p9_p7_lag1DACC_p4_p7_lag1DACC_p7_p2_lag1DACC_p7_p4_lag1
NCI.H460DACC_p9_p4_lag1DACC_p1_p7_lag1DACC_p9_p6_lag1DACC_p7_p12_lag1DACC_p4_p6_lag1DACC_p1_p10_lag1DACC_p1_p6_lag1DACC_p1_p4_lag1DACC_p4_p12_lag1DACC_p10_p12_lag1
SK.MEL.5DACC_p9_p8_lag1DACC_p4_p2_lag1DACC_p9_p2_lag1DACC_p10_p2_lag1SC_PTNC_CGGTCC_p3_p7_lag1TACC_p3_p7_lag1DACC_p9_p7_lag1TCC_p7_p3_lag1TACC_p7_p3_lag1
H1.hESCNMBAC_p1_lag1DACC_p3_p11_lag1DACC_p1_lag1DACC_p3_p5_lag1PDNC_TCPKNC_CTTPKNC_CATSC_PTNC_CTTPC_PTNC_CTTSC_PTNC_CAT
HT1080DACC_p4_p2_lag1DACC_p1_p7_lag1DACC_p10_p2_lag1DACC_p7_lag1DACC_p6_p2_lag1DACC_p9_p7_lag1DACC_p1_p8_lag1DACC_p7_p2_lag1DACC_p1_p2_lag1DACC_p4_p7_lag1
K562TCC_p9_p3_lag1TCC_p10_p3_lag1TACC_p9_p3_lag1TACC_p10_p3_lag1TCC_p3_p9_lag1DACC_p1_p8_lag2DACC_p1_p4_lag1DACC_p1_p10_lag2DACC_p4_p1_lag2DACC_p9_p7_lag2
HepG2DACC_p7_p1_lag1TAC_p3_lag1TACC_p3_lag1DACC_p4_lag1DACC_p1_p7_lag1DACC_p4_p7_lag1DACC_p7_p4_lag1DACC_p7_p6_lag1DACC_p6_p7_lag1DACC_p9_p7_lag1
SK.N.SHDACC_p1_lag1DACC_p1_p7_lag1DACC_p9_p4_lag1DACC_p8_p9_lag2TCC_p11_p3_lag1DACC_p1_p4_lag1DACC_p9_p1_lag1DACC_p4_p1_lag1DACC_p1_p6_lag1TCC_p12_p3_lag1

Correlation-based features having the highest correlation with the CNRCI values for 15 cell-line.

Additionally, we assessed feature variability across cell lines by calculating the difference between the maximum and minimum correlation values observed for each feature across all 15 cell lines. This approach highlights features exhibiting the most pronounced inter-cell-line variation, providing insight into their potential biological or experimental variability. Figures 5, 6 represent heatmaps depicting the highly variable genes and their correlation with the CNRCI values for composition and correlation-based features, respectively. The complete information for the variable genes has been provided in Supplementary Table S5, 6.

FIGURE 5

FIGURE 5

Heatmap depicting the composition-based features that demonstrate the highest variation in their correlation with the CNRCI values within 15 cell-lines.

FIGURE 6

FIGURE 6

Heatmap depicting the correlation-based features that demonstrate the highest variation in their correlation with the CNRCI values within 15 cell-lines.

Model based on composition and correlation features

Composition and correlation features generated from Nfeature were used to train multiple ML models. We have computed the performance of nine composition-based features and thirteen correlation-based features. We implemented all the combinations of feature and ML model to identify which feature-ML model combination performs the best. The composition features combined with classical ML methods were able to achieve an average AUC of 0.7049 and a MCC of 0.1965, across the 15 cell-lines. Similarly, with correlation-based features and ML methods, the best performance achieved was an average AUC of 0.7089 and a MCC of 0.2133. Performance of the best performing model using both composition and correlation-based features are provided for all the 15 cell-lines in Table 6, 7 respectively. The detailed performance for all the models used in this analysis have been provided in Supplementary Table S7, 8. The model parameters are provided in Supplementary Table S12.

TABLE 6

Cell-linesFeature usedML model usedSensitivitySpecificityPrecisionAccuracyMCCF1-scoreAUC
A549RDKRandomForestClassifier0.6000.8250.6920.7360.4380.6430.761
H1.hESCCDKRandomForestClassifier0.5000.7850.6070.6710.2970.5480.708
HeLa.S3PKNCGaussianNB0.8260.5930.2840.6310.3100.4220.779
HepG2RDKGaussianNB0.4290.8470.5110.7330.2920.4660.718
HT1080CDKMLPClassifier0.5170.7690.6330.6590.2960.5690.728
HUVECPKNCSVC0.0320.9640.2500.706−0.0110.0560.756
MCF.7CDKGaussianNB0.4520.8090.4370.7210.2590.4440.724
NCI.H460PKNCSVC0.0001.0000.0000.8260.0000.0000.723
NHEKCDKSVC0.0001.0000.0000.7420.0000.0000.643
SK.MEL.5CDKXGBClassifier0.0630.9500.2500.7630.0230.1000.676
SK.N.DZCDKMLPClassifier0.5450.6920.5290.6350.2370.5370.679
SK.N.SHPDNCQuadraticDiscriminantAnalysis0.5470.7320.4220.6830.2590.4760.703
GM12878PKNCGradientBoostingClassifier0.1330.9390.4000.7520.1150.2000.687
K562ALL_COMPDecisionTreeClassifier0.4520.7880.4630.6920.2430.4580.620
IMR.90PKNCAdaBoostClassifier0.4480.7350.5910.6030.1920.5100.668
Average0.3700.8290.4050.7040.1970.3620.705

Performance of the best ML model for each cell-line on the validation dataset using composition features.

TABLE 7

Cell-linesFeature usedML model usedSensitivitySpecificityPrecisionAccuracyMCCF1-scoreAUC
A549PC_PDNCGradientBoostingClassifier0.5670.8030.6540.7090.3810.6070.757
H1.hESCPDNCRandomForestClassifier0.4950.8110.6350.6850.3240.5560.720
HeLa.S3PKNCGaussianNB0.7830.5850.2690.6170.2720.4000.779
HepG2MACGaussianNB0.6790.5670.3690.5970.2180.4780.715
HT1080SC_PTNCGaussianProcessClassifier0.5830.7690.6600.6880.3590.6190.738
HUVECPDNCAdaBoostClassifier0.3020.8970.5280.7320.2430.3840.757
MCF.7SC_PDNCSVC0.0001.0000.0000.7540.0000.0000.753
NCI.H460PDNCSVC0.0001.0000.0000.8260.0000.0000.683
NHEKPKNCAdaBoostClassifier0.2310.8660.3750.7020.1160.2860.635
SK.MEL.5PDNCGradientBoostingClassifier0.0630.9330.2000.750−0.0070.0950.654
SK.N.DZTACSVC0.4240.8650.6670.6940.3270.5190.739
SK.N.SHPC_PTNCGaussianNB0.7500.5310.3640.5880.2480.4900.689
GM12878PC_PDNCXGBClassifier0.3500.8990.5120.7710.2880.4160.715
K562DACCAdaBoostClassifier0.4050.8080.4590.6920.2210.4300.642
IMR.90PC_PTNCAdaBoostClassifier0.6210.5880.5630.6030.2080.5900.655
Average0.4170.7950.4170.6940.2130.3910.709

Performance of the best ML model for each cell-line on the validation dataset using correlation features.

Models based on embeddings from DNABERT-2

Embeddings from large language models are known to encapsulate not just the individual bases, but also their biological significance in terms of structure, function, and interactions. In this approach, we generated high level representations of lncRNA sequences using both the pre-trained as well as the fine-tuned models. These embeddings were used to train ML models and the models were evaluated on the validation dataset. In the case of pre-trained embeddings, the model achieved an average AUC of 0.6586 and an average MCC of 0.1182. When fine-tuned embeddings were used as features, the performance of the model dropped marginally, achieving an average AUC of 0.6604 and an average MCC of 0.1740. Detailed results for the performance of ML models on the validation dataset using pre-trained as well as fine-tuned embeddings as features are provided in Table 8, 9 respectively. The detailed performance for all the models has been reported in Supplementary Table S9, 10.

TABLE 8

Cell-linesML model usedSensitivitySpecificityPrecisionAccuracyMCCF1-scoreAUC
A549SVC linear0.5000.7590.5770.6560.2670.5360.661
H1.hESCMLP Classifier0.5880.6380.5190.6180.2230.5520.666
HeLa.S3XGBoost Classifier0.0870.9920.6670.8440.2010.1540.699
HepG2MLP Classifier0.0001.0000.0000.7280.0000.0000.706
HT1080MLP Classifier0.8170.5130.5630.6450.3380.6670.735
HUVECRandom Forest Classifier0.0320.9820.4000.7190.0410.0590.690
MCF.7MLP Classifier0.0120.9840.2000.745−0.0130.0220.657
NCI.H460Gradient Boosting Classifier0.0631.0001.0000.8370.2280.1180.599
NHEKGaussian Naive Bayes Classifier0.5380.7050.3890.6620.2230.4520.660
SK.MEL.5KNN0.0630.9670.3330.7760.0610.1050.638
SK.N.DZGradient Boosting Classifier0.3640.8080.5450.6350.1910.4360.678
SK.N.SHMLP Classifier0.4690.7090.3660.6460.1660.4110.618
GM12878Logistic Regression0.1170.9550.4380.7600.1250.1840.655
K562Logistic Regression0.0710.9230.2730.678−0.0090.1130.609
IMR.90Random Forest Classifier0.3450.8240.6250.6030.1930.4440.699
Average0.2710.8510.4600.7040.1490.2840.665

Performance of the best ML model for each cell-line on the validation dataset using embeddings from pre-trained DNABERT-2 model.

TABLE 9

Cell-linesML model usedSensitivitySpecificityPrecisionAccuracyMCCF1-scoreAUC
A549SVC_linear0.4890.7080.5240.6210.2000.5060.660
H1.hESCSVC_linear0.4220.8010.5850.6500.2410.4900.687
HeLa.S3MLP Classifier0.0000.9920.0000.830−0.0370.0000.732
HepG2Gaussian Naive Bayes Classifier0.5710.7670.4780.7140.3210.5200.708
HT1080SVC_linear0.4500.7560.5870.6230.2170.5090.700
HUVECMLP Classifier0.2060.9030.4480.7110.1470.2830.726
MCF.7Gaussian Naive Bayes Classifier0.5120.7390.3910.6830.2320.4430.700
NCI.H460XGBoost Classifier0.0000.9870.0000.815−0.0480.0000.535
NHEKAdaBoost Classifier0.3590.8210.4120.7020.1890.3840.600
SK.MEL.5SVC_radial0.0001.0000.0000.7890.0000.0000.539
SK.N.DZGradient Boosting Classifier0.5150.6920.5150.6240.2070.5150.717
SK.N.SHAdaBoost Classifier0.1560.8660.2940.6790.0280.2040.587
GM12878Gradient Boosting Classifier0.1830.9390.4780.7640.1820.2650.683
K562Gradient Boosting Classifier0.1430.8940.3530.6780.0520.2030.590
IMR.90Gradient Boosting Classifier0.5520.7060.6150.6350.2610.5820.671
Average0.3040.8380.3790.7010.1460.3270.656

Performance of the best ML model for each cell-line on the validation dataset using embeddings from fine-tuned DNABERT-2 model.

Fine-tuned DNABERT-2 model

In this approach, we used our training dataset to fine-tune the model and generate a fine-tuned tokenizer and model. Using this tokenizer and model, we generate high level representations of our lncRNA sequences and these representations are used by the model to generate predictions. The fine-tuned DNABERT2 model could not be evaluated while training as we were not able to implement five-fold cross validation. The fine-tuned model was evaluated on the validation dataset and performance metrics for the same are reported in Table 10. The detailed performance for all the models has been provided in Supplementary Table S11.

TABLE 10

Cell-linesSensitivitySpecificityPrecisionAccuracyMCCF1-scoreAUC
A5490.5000.8610.7030.7180.3930.5840.762
H1.hESC0.5200.7130.5460.6360.2350.5330.644
HeLa.S30.0001.0000.0000.8370.0000.0000.527
HepG20.0001.0000.0000.7280.0000.0000.698
HT10800.5670.7310.6180.6590.3010.5910.676
HUVEC0.1590.9760.7140.7500.2510.2600.710
MCF.70.0001.0000.0000.7540.0000.0000.562
NCI.H4600.0001.0000.0000.8260.0000.0000.576
NHEK0.0001.0000.0000.7420.0000.0000.564
SK.MEL.50.0001.0000.0000.7890.0000.0000.468
SK.N.DZ0.5150.8850.7390.7410.4390.6070.791
SK.N.SH0.0001.0000.0000.7370.0000.0000.589
GM128780.0001.0000.0000.7670.0000.0000.718
K5620.1900.8850.4000.6850.0990.2580.570
IMR.900.6550.5290.5430.5870.1850.5940.649
Average0.2070.9050.2840.7300.1270.2280.634

Performance of fine-tuned DNABERT-2 model on the validation dataset.

Model based on features selected by mRMR algorithm

In order to identify the best set of features from the combined feature set of 2278 features (composition, correlation, and embeddings), mRMR algorithm was used. Seven different feature sets were created based on the top ‘k’ features selected by mRMR. Performance was evaluated for seven different sets of features for each cell-line using 12 different ML classifiers. Table 11 reports the AUC for the best model for each combination and the last row shows the average for each feature set. The best AUC value was reported when top 500 genes selected by mRMR was used for training.

TABLE 11

Cell linesTop 10Top 50Top 100Top 500Top 1000Top 1500Top 2000
A5490.7560.750.7490.7410.7390.7380.738
GM128780.7250.7060.7060.7270.7420.7340.719
H1.hESC0.6770.6870.6940.710.7120.7160.717
HT10800.710.730.7280.7310.7450.7490.742
HUVEC0.7330.7440.7620.7710.760.7480.764
HeLa.S30.6690.7280.7560.7340.7180.7760.779
HepG20.7360.7220.7250.720.6910.6730.685
IMR.900.7250.6510.6470.6210.6360.6020.663
K5620.6150.6650.6450.6280.5850.5810.586
MCF.70.7060.7280.730.7010.7150.7210.711
NCI.H4600.5350.5140.5330.6380.5530.5930.568
NHEK0.640.6340.6260.6090.6030.6320.65
SK.MEL.50.6720.6520.5520.7250.5970.6190.575
SK.N.DZ0.6710.6640.6750.7040.7050.7150.707
SK.N.SH0.6830.6840.6850.6690.6830.6840.682
Average0.6830.6840.6810.6950.6790.6850.686

AUC values of the best ML model for each cell-line on the validation dataset for the feature sets generated by mRMR algorithm.

Performance comparison of CytoLNCpred and existing state-of-the-art classifiers

To further illustrate the efficacy of our method, we conduct a comparative analysis with other cutting-edge classifiers. Specifically, we evaluate existing predictors, namely, lncLocator 2.0 and TACOS, which employ predictive algorithms to predict subcellular location of lncRNAs in different cell-lines.

Among these predictors, lncLocator 2.0 relies on the word embeddings and a MultiLayer Perceptron Regressor to predict CNRCI values. The predicted CNRCI values were then converted to labels using a fixed threshold value. The second predictor, TACOS, generated a variety of feature encodings using composition and physicochemical properties and tree-based algorithms were deployed to make the predictions. It is important to note that TACOS has been trained on 10 out of the 15 cell-lines. For a fair performance comparison, we leverage the performance metrics of evaluated on the validation dataset. Table 12 summarizes the evaluation of CytoLNCpred and other existing tools based on AUROC.

TABLE 12

lncLocator 2.0TACOSCytoLncPred – Composition featuresCytoLncPred – Correlation features
A5490.5920.7410.7610.757
H1.hESC0.6490.7520.7080.720
HeLa.S30.4920.7210.7790.779
HepG20.4870.7120.7180.715
HT10800.5970.7270.7280.738
HUVEC0.5000.7210.7560.757
MCF.70.530-0.7240.753
NCI.H4600.500-0.7230.683
NHEK0.5190.6230.6430.635
SK.MEL.50.5000.5670.6760.654
SK.N.DZ0.500-0.6790.739
SK.N.SH0.5080.6330.7030.689
GM128780.5000.5680.6870.715
K5620.500-0.6200.642
IMR.900.471-0.6680.655
Average0.5230.6760.7050.709

Comparing the performance of our method and other existing classifiers using our validation dataset based on AUROC for all cell-lines.

mRNA localization prediction accuracy using CytoLNCpred

To assess the applicability of CytoLNCpred, a tool originally developed for lncRNA localization prediction, to mRNA sequences, we utilized mRNA data obtained from the lncAtlas database. These mRNA sequences were subjected to prediction using the standalone version of CytoLNCpred, and its performance was evaluated using the Area Under the Receiver Operating Characteristic curve (AUROC). The AUROC values obtained for mRNA localization prediction across different cell lines, alongside the corresponding performance for lncRNA prediction, are presented in Figure 7.

FIGURE 7

FIGURE 7

Performance evaluation of CytoLNCpred in predicting the subcellular localization of mRNA and lncRNA sequences in various cell lines, as measured by the Area Under the Receiver Operating Characteristic curve (AUROC).

The results indicate that CytoLNCpred exhibits a varying degree of accuracy in predicting the subcellular localization of mRNA sequences across the tested cell lines. As illustrated in Figure 7, the predictive performance for mRNA localization differs depending on the cellular context. Notably, the highest prediction accuracy for mRNA was observed in the A549 cell line (AUROC = 0.800), suggesting a strong potential for the tool in this specific context. While the performance varied across different cell lines, with HUVEC showing the lowest AUROC (0.598), the overall results suggest that features learned by CytoLNCpred for lncRNA localization can also provide some discriminatory power for mRNA localization. Interestingly, in the A549 cell line, the prediction accuracy for mRNA even surpassed that observed for lncRNAs. However, in other cell lines like HUVEC and MCF-7, the performance on mRNA was notably lower compared to lncRNAs.

In order to further validate the model accuracy for mRNAs, 10 random cytoplasmic mRNAs were obtained from NCBI gene database. These mRNAs were then predicted using CytoLNCpred for the A549 cell-line. The model was able to predict only 2 mRNAs correctly, among the 10 cytoplasmic RNA and the detailed results are provided in Supplementary Table S13. This variability in performance across cell types highlights the potential influence of cell-specific factors on mRNA localization and suggests that further refinement or specialized models might be beneficial for broader applicability to mRNA.

Discussion

In recent years, researchers have recognized that the subcellular localization of lncRNAs plays a pivotal role in understanding their function. Unlike protein-coding genes, lncRNAs do not encode proteins directly. Instead, they exert their effects through diverse mechanisms, including interactions with chromatin, RNA molecules, and proteins. The precise localization of lncRNAs within the cell provides crucial information about their regulatory roles.

In our analysis LINC00852 showed a marked cell-line–specific shift–predominantly nuclear (negative localization score) in most cell types, but strongly cytoplasmic in the NCI-H460 lung carcinoma line. This mirrors literature reports that LINC00852 can be cytoplasmically enriched in lung carcinoma cases. In lung carcinoma cell-lines (A549 and SPCA-1), LINC00852 was found mainly in the cytoplasm (qRT-PCR assay) (Liu et al., 2018). It binds with the S100A9 protein in the cytoplasm, activating the MAPK pathway and plays a positive role in the progression and metastasis of lung adenocarcinoma cells. By contrast, other studies have observed LINC00852 in the nucleus in some tumors. In osteosarcoma cell-lines (like 143B and MG-63), it is observed that LINC00852 acts as a transcription factor and increase the expression of AXL gene (Li et al., 2020). Such context dependence could reflect tissue-specific expression of RNA-binding factors or isoforms that govern nuclear export. The fact that LINC00852 is cytoplasmic in some cancer lines but nuclear in others suggests it may switch roles-in cytosolic form it may acts as a post-transcriptional regulator (e.g., miRNA sponge), whereas nuclear retention may imply transcriptional or chromatin-related roles.

SNHG3 in our data is mostly cytoplasmic in H1.hESC, HepG2 (liver carcinoma) and GM12878 (lymphoblast) cells, but nuclear in cell-lines like MCF-7 and NHEK. This pattern aligns with experimental studies. In colorectal cancer cell lines (e.g., SW480, LoVo), SNHG3 was found to localize predominantly in the cytoplasm (comparable to GAPDH) (Huang et al., 2017). There it acts as a competing endogenous RNA (ceRNA), sponging miRNAs (e.g., miR-182-5p) to upregulate oncogenic targets like c-Myc. The high cytoplasmic SNHG3 expression in stem-cell like and proliferative cell-lines (H1.hESC, HepG2) from our dataset suggests a similar ceRNA role, whereas its nuclear enrichment in more differentiated/epithelial cells may reflect downregulation of this pathway in those contexts. In general, SNHG-family lncRNAs are known to influence cancer cell growth and often operate via cytoplasmic post-transcriptional mechanisms, consistent with SNHG3’s localization and function in promoting malignancy.

Subcellular localization of lncRNA gains prominence in recent times due to their role in gene regulation within the cell. A large number of aptamer and ASO based drugs are being developed using RNA nanotechnology. In recent years, the convergence of nanotechnology and long non-coding RNAs (lncRNAs) has yielded exciting developments in drug development. Nanoparticles, such as liposomes and exosomes, are being harnessed for targeted delivery of lncRNA-based therapeutics to cancer cells. Additionally, CRISPR-Cas9 technology, delivered via nanoparticles, enables precise gene editing by modulating lncRNA expression. Computational models and deep learning approaches are aiding our understanding of lncRNA-mediated mechanisms. Overall, this interdisciplinary field holds immense promise for personalized medicine, improved therapies, and better patient outcomes.

Predicting lncRNA subcellular localization using tools like CytoLNCpred offers significant potential for guiding the development of RNA-based therapeutics and CRISPR strategies. Since antisense oligonucleotides (ASOs) are generally more effective against nuclear lncRNAs (Zong et al., 2015) and small interfering RNAs (siRNAs) excel against cytoplasmic targets (Lennox and Behlke, 2016), a CytoLNCpred prediction indicating cytoplasmic enrichment would favor siRNA development, while a predicted nuclear localization would suggest ASOs as the primary choice. Similarly, this prediction informs CRISPR approaches: targeting nuclear-acting lncRNAs might be best achieved by disrupting transcription or key regulatory elements using CRISPRi or Cas9 (Rosenlund et al., 2021), whereas lncRNAs predicted to function in the cytoplasm could be more effectively targeted by degrading the transcript directly using RNA-targeting CRISPR-Cas13 (Xu et al., 2020), potentially guiding crRNA expression strategies (e.g., using a U1 promoter for cytoplasmic crRNA localization). Thus, localization prediction aids in the rational selection of therapeutic modalities and CRISPR targeting strategies based on the likely site of lncRNA function.

In recent times large language models are considered as SOTA methods and apart from the classical composition and correlation feature-based model, we also implemented DNABERT-2 for our classification problem. The DNABERT-2 model has been trained on the genomes of a wide variety of species and is computationally very efficient. DNABERT-2 uses Byte Pair Encoding to generate tokens which is known to perform better than k-mer tokenization. So, in order to fully exploit the DNABERT-2 model, we generated embeddings from both pre-trained and fine-tuned models. These embeddings when combined with ML methods were able to predict subcellular localization very well but poorer than a fine-tuned DNABERT-2 model.

In our study, we compared the performance of DNABERT-2 with traditional composition and correlation-based features for classifying subcellular localization of lncRNAs. While DNABERT-2, a pre-trained language model, showed promising results, but we found that traditional machine learning models trained on carefully crafted composition and correlation features consistently outperformed DNABERT-2. This suggests that for this specific task, the carefully engineered features capture the relevant biological information more effectively than the general-purpose representations learned by DNABERT-2. Specifically, it was observed that the correlation-based features achieve a higher average AUC than all other approaches. However, these approaches failed when we used RNA sequences greater than 10,000 base pairs. In order to reduce the variability in nucleotide length, the sequence length was limited to 10,000 base pairs.

The lncAtlas database, while a valuable resource for lncRNA subcellular localization, has several significant limitations including its restriction to GENCODE-annotated lncRNAs and a limited set of 15 human cell lines, with detailed sub-compartment data available only for the K562 cell line. The database relies on RNA-seq data and the Relative Concentration Index (RCI), which provides relative abundance rather than absolute counts. Furthermore, lncAtlas has not been updated since 2017, making it less comprehensive and potentially outdated.

To address these limitations, future research should focus on developing techniques to improve the interpretability of DNABERT-2’s predictions. This could involve methods such as attention visualization or feature importance analysis. Furthermore, expanding the diversity of training data is essential to enhance the model’s generalizability across different biological contexts. By incorporating data from a wider range of organisms and conditions, subcellular localization prediction could become a more versatile and reliable tool for genomic analysis. In our case, correlation-based features with machine learning algorithms outperformed all other approaches. Moreover, improved machine learning algorithms are needed to be developed that can account for large variability in nucleotide lengths.

Conclusion

Understanding the subcellular localization of lncRNA can provide great insights into their function within the cell. Computational tools have recently expanded the domain of subcellular localization by the development of faster and more accurate methods. In this study, we used a variety of machine learning as well as large language models to accurately predict lncRNA subcellular localization. The implementation of large language models to tackle biological problems is gaining momentum and our study also highlights its importance. The final model used in CytoLNCpred was designed using a traditional machine learning model trained using correlation-based features. This tool will help researchers to improve the functional annotation of lncRNA and develop RNA-based therapeutics.

Statements

Data availability statement

Publicly available datasets were analyzed in this study. This data can be found here: https://webs.iiitd.edu.in/raghava/cytolncpred/, https://github.com/raghavagps/cytolncpred.

Author contributions

SC: Data curation, Formal Analysis, Investigation, Methodology, Software, Validation, Visualization, Writing – original draft, Writing – review and editing. NM: Formal Analysis, Investigation, Methodology, Validation, Visualization, Writing – review and editing. GR: Conceptualization, Formal Analysis, Funding acquisition, Investigation, Supervision, Writing – review and editing.

Funding

The author(s) declare that financial support was received for the research and/or publication of this article. The current work has been supported by the Department of Biotechnology (DBT) grant BT/PR40158/BTIS/137/24/2021.

Acknowledgments

Authors are thankful to the Department of Science and Technology (DST-INSPIRE) and Indraprastha Institute of Information Technology, New Delhi, for fellowships and financial support. Authors are also thankful to Department of Computational Biology, IIITD New Delhi for infrastructure and facilities. Figure 2, 3, 4 and 7 were created with BioRender.com.

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Generative AI statement

The author(s) declare that no Generative AI was used in the creation of this manuscript.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

Supplementary material

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fbinf.2025.1585794/full#supplementary-material

References

  • 1

    AillaudM.SchulteL. N. (2020). Emerging roles of long noncoding RNAs in the cytoplasmic milieu. Noncoding RNA6, 44. 10.3390/ncrna6040044

  • 2

    BridgesM. C.DaulagalaA. C.KourtidisA. (2021). LNCcation: lncRNA localization and function. J. Cell Biol.220, e202009045. 10.1083/jcb.202009045

  • 3

    ChangJ.MaX.SunX.ZhouC.ZhaoP.WangY.et al (2023). RNA fluorescence in situ hybridization for long non-coding RNA localization in human osteosarcoma cells. J. Vis. Exp.10.3791/65545

  • 4

    GirgisH. Z. (2022). MeShClust v3.0: high-quality clustering of DNA sequences using the mean shift algorithm and alignment-free identity scores. BMC Genomics23, 423. 10.1186/s12864-022-08619-0

  • 5

    HuangW.TianY.DongS.ChaY.LiJ.GuoX.et al (2017). The long non-coding RNA SNHG3 functions as a competing endogenous RNA to promote malignant development of colorectal cancer. Oncol. Rep.38, 14021410. 10.3892/or.2017.5837

  • 6

    JeonY.-J.HasanM. M.ParkH. W.LeeK. W.ManavalanB. (2022). TACOS: a novel approach for accurate prediction of cell-specific long noncoding RNAs subcellular localization. Brief. Bioinform.23, bbac243. 10.1093/bib/bbac243

  • 7

    LennoxK. A.BehlkeM. A. (2016). Cellular localization of long non-coding RNAs affects silencing by RNAi more than by antisense oligonucleotides. Nucleic Acids Res.44, 863877. 10.1093/nar/gkv1206

  • 8

    LiQ.WangX.JiangN.XieX.LiuN.LiuJ.et al (2020). Exosome-transmitted linc00852 associated with receptor tyrosine kinase AXL dysregulates the proliferation and invasion of osteosarcoma. Cancer Med.9, 63546366. 10.1002/cam4.3303

  • 9

    LinY.PanX.ShenH.-B. (2021). lncLocator 2.0: a cell-line-specific subcellular localization predictor for long non-coding RNAs with interpretable deep learning. Bioinformatics37, 23082316. 10.1093/bioinformatics/btab127

  • 10

    LiuP.WangH.LiangY.HuA.XingR.JiangL.et al (2018). LINC00852 promotes lung adenocarcinoma spinal metastasis by targeting S100A9. J. Cancer9, 41394149. 10.7150/jca.26897

  • 11

    Mas-PonteD.Carlevaro-FitaJ.PalumboE.Hermoso PulidoT.GuigoR.JohnsonR. (2017). LncATLAS database for subcellular localization of long noncoding RNAs. RNA23, 10801087. 10.1261/rna.060814.117

  • 12

    MathurM.PatiyalS.DhallA.JainS.TomerR.AroraA.et al (2021). Nfeature: a platform for computing features of nucleotide sequences. 10.1101/2021.12.14.472723

  • 13

    MattickJ. S.AmaralP. P.CarninciP.CarpenterS.ChangH. Y.ChenL.-L.et al (2023). Long non-coding RNAs: definitions, functions, challenges and recommendations. Nat. Rev. Mol. Cell Biol.24, 430447. 10.1038/s41580-022-00566-8

  • 14

    MayerA.ChurchmanL. S. (2017). A detailed protocol for subcellular RNA sequencing (subRNA-seq). Curr. Protoc. Mol. Biol.120, 4.29.14. 10.1002/cpmb.44

  • 15

    MiaoH.WangL.ZhanH.DaiJ.ChangY.WuF.et al (2019). A long noncoding RNA distributed in both nucleus and cytoplasm operates in the PYCARD-regulated apoptosis by coordinating the epigenetic and translational regulation. PLoS Genet.15, e1008144. 10.1371/journal.pgen.1008144

  • 16

    RosenlundI. A.CalinG. A.DragomirM. P.KnutsenE. (2021). CRISPR/Cas9 to silence long non-coding RNAs. Methods Mol. Biol.2348, 175187. 10.1007/978-1-0716-1581-2_12

  • 17

    StatelloL.GuoC.-J.ChenL.-L.HuarteM. (2021). Gene regulation by long non-coding RNAs and its biological functions. Nat. Rev. Mol. Cell Biol.22, 96118. 10.1038/s41580-020-00315-9

  • 18

    XuD.CaiY.TangL.HanX.GaoF.CaoH.et al (2020). A CRISPR/Cas13-based approach demonstrates biological relevance of vlinc class of long non-coding RNAs in anticancer drug response. Sci. Rep.10, 1794. 10.1038/s41598-020-58104-5

  • 19

    ZhangS.AmahongK.ZhangY.HuX.HuangS.LuM.et al (2023). RNAenrich: a web server for non-coding RNA enrichment. Bioinformatics39, btad421. 10.1093/bioinformatics/btad421

  • 20

    ZhouZ.JiY.LiW.DuttaP.DavuluriR.LiuH. (2023). DNABERT-2: efficient foundation model and benchmark for multi-species genome. arXiv [q-bio.GN]. 10.48550/ARXIV.2306.15006

  • 21

    ZongX.HuangL.TripathiV.PeraltaR.FreierS. M.GuoS.et al (2015). Knockdown of nuclear-retained long noncoding RNAs using modified DNA antisense oligonucleotides. Methods Mol. Biol.1262, 321331. 10.1007/978-1-4939-2253-6_20

Summary

Keywords

lncRNA, cytoplasm localization, machine learning, DNABert-2, cell-line specific localization

Citation

Choudhury S, Mehta NK and Raghava GPS (2025) CytoLNCpred-a computational method for predicting cytoplasm associated long non-coding RNAs in 15 cell-lines. Front. Bioinform. 5:1585794. doi: 10.3389/fbinf.2025.1585794

Received

01 March 2025

Accepted

14 May 2025

Published

26 May 2025

Volume

5 - 2025

Edited by

Stephen M. Mount, University of Maryland, United States

Reviewed by

Ganesh Panzade, National Cancer Institute at Frederick (NIH), United States

Srinivasulu Yerukala Sathipati, Marshfield Clinic Research Institute, United States

Swapna Vidhur Daulatabad, National Cancer Institute at Frederick (NIH), United States

Updates

Copyright

*Correspondence: Gajendra P. S. Raghava,

ORCID: Shubham Choudhury, orcid.org/0000-0002-4509-4683; Naman Kumar Mehta, orcid.org/0009-0009-0244-2826; Gajendra P. S. Raghava, orcid.org/0000-0002-8902-2876

Disclaimer

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.

Outline

Figures

Cite article

Copy to clipboard


Export citation file


Share article

Article metrics