ORIGINAL RESEARCH article

Front. Bioinform., 26 May 2025

Sec. RNA Bioinformatics

Volume 5 - 2025 | https://doi.org/10.3389/fbinf.2025.1585794

CytoLNCpred-a computational method for predicting cytoplasm associated long non-coding RNAs in 15 cell-lines

  • Department of Computational Biology, Indraprastha Institute of Information Technology, New Delhi, India

The function of long non-coding RNA (lncRNA) is largely determined by its specific location within a cell. Previous methods have used noisy datasets, including mRNA transcripts in tools intended for lncRNAs, and excluded lncRNAs lacking significant differential localization between the cytoplasm and nucleus. In order to overcome these shortcomings, a method has been developed for predicting cytoplasm-associated lncRNAs in 15 human cell-lines, identifying which lncRNAs are more abundant in the cytoplasm compared to the nucleus. All models in this study were trained using five-fold cross validation and tested on an validation dataset. Initially, we developed machine and deep learning based models using traditional features like composition and correlation. Using composition and correlation based features, machine learning algorithms achieved an average AUC of 0.7049 and 0.7089, respectively for 15 cell-lines. Secondly, we developed machine based models developed using embedding features obtained from the large language model DNABERT-2. The average AUC for all the cell-lines achieved by this approach was 0.665. Subsequently, we also fine-tuned DNABERT-2 on our training dataset and evaluated the fine-tuned DNABERT-2 model on the validation dataset. The fine-tuned DNABERT-2 model achieved an average AUC of 0.6336. Correlation-based features combined with ML algorithms outperform LLM-based models, in the case of predicting differential lncRNA localization. These cell-line specific models as well as web-based service are available to the public from our web server (https://webs.iiitd.edu.in/raghava/cytolncpred/).

Highlights

• Prediction of cytoplasm-associated lncRNAs in 15 human cell lines

• Machine learning using composition and correlation features

• DNABERT-2 embeddings for lncRNA localization prediction

• Correlation-based models outperform LLM-based models

• Web server and models available for public use

Introduction

The rapidly expanding field of non-coding RNAs has revolutionized our understanding of gene regulation and cell biology. Among the diverse classes of non-coding RNAs, long non-coding RNAs (lncRNAs) have attracted significant attention due to their ability to regulate gene expression at various levels. Initially dismissed as transcriptional noise, lncRNAs have emerged as critical players in cellular processes, including development, differentiation, and disease progression (Statello et al., 2021). To fully comprehend the functional roles of lncRNAs, it is imperative to investigate their subcellular localization. lncRNAs have distinct functions in the nucleus and cytoplasm, influencing transcriptional and posttranscriptional processes. In the nucleus, lncRNAs regulate gene expression and chromatin organization, while in the cytoplasm, they participate in signal transduction and translation. Some lncRNAs exhibit dual localization and functional diversification, reflecting their adaptability to different subcellular environments (Miao et al., 2019; Aillaud and Schulte, 2020; Mattick et al., 2023).

In recent years, extensive research efforts have been focused on deciphering the subcellular localization of lncRNAs. Various experimental approaches, such as fluorescence in situ hybridization (FISH) (Chang et al., 2023), RNA sequencing (RNA-seq) (Mayer and Churchman, 2017), and fractionation techniques (Miao et al., 2019), have been employed to identify the subcellular localization patterns of lncRNAs. These studies have revealed that lncRNAs can be localized in different cellular compartments, including the nucleus, cytoplasm, nucleolus, and specific subcellular structures. The subcellular localization of lncRNAs is often associated with their biological functions. For instance, nuclear-localized lncRNAs are frequently involved in transcriptional regulation, chromatin remodeling, and epigenetic modifications. Cytoplasmic lncRNAs, on the other hand, can interact with proteins or act as competitive endogenous RNAs (ceRNAs) to regulate gene expression post-transcriptionally (Bridges et al., 2021). However, most of these methods are expensive to perform and require highly specialized instrumentation.

Advancements in computational methods and machine learning approaches have further facilitated the prediction of lncRNA subcellular localization. These methods leverage various features, such as sequence composition, secondary structure, and evolutionary conservation, to predict the subcellular localization of lncRNAs with high accuracy. Several computational methods have been proposed for predicting lncRNA subcellular localization. Sequence-based methods rely on the nucleotide composition of the lncRNA. They utilize features such as k-mer frequency, nucleotide composition, and sequence motifs. However, these methods are trained on datasets that are not unique to humans, and they do not account for the variation in the subcellular localization of lncRNA in different cells.

Cell-line specific subcellular localization gains prominence due to the variability (in terms of subcellular localization) that lncRNAs exhibit within different cell-lines. This was reported by Lin et al. in lncLocator 2.0, where it was observed that a single lncRNA had different localization in different cell-lines (Lin et al., 2021). We observed a similar trend in our dataset, where some lncRNAs were found to be localized in the nucleus for some cell-lines but were localizing to the cytoplasm in some other cell-lines. This pattern can be seen clearly in Figure 1.

Figure 1
www.frontiersin.org

Figure 1. Bubble plot indicating the variability of localization of a single lncRNA across multiple cell-lines.

lncLocator 2.0 is a cell-line-specific subcellular localization predictor that employs an interpretable deep-learning approach (Lin et al., 2021). TACOS, also a cell-line-specific subcellular localization predictor, uses tree-based algorithms along with various sequence compositional and physicochemical features (Jeon et al., 2022). Among all the existing computational methods, only lncLocator 2.0 and TACOS are designed to predict subcellular localization specific to different cell-lines. The primary issue with these methods is that the datasets used to develop these methods have not been properly filtered. Specifically, these methods have included mRNA sequences in their datasets, which can lead to inaccurate predictions. Additionally, the datasets have eliminated lncRNAs with an absolute fold-change less than 2, which can result in the failure to predict the subcellular location of lncRNAs with borderline concentration differences between locations.

To address the limitations of existing methods in a comprehensive manner, we have developed CytoLNCpred. In this study, we aimed to enhance the prediction accuracy compared to current tools, which have significant room for improvement. Furthermore, we have cleaned the dataset and adhered to industry standards to validate the performance of our method. In CytoLNCpred, a machine learning model trained using correlation-based features demonstrated significantly better performance on the validation dataset compared to existing tools.

Materials and methods

To aid in the development of a prediction model for lncRNA subcellular localization, we’ve designed a workflow diagram, depicted in Figure 2. The comprehensive details of each phase in this workflow are outlined in the subsequent sections.

Figure 2
www.frontiersin.org

Figure 2. Overall architecture of CytoLNCpred.

Dataset creation

In this study, we have selected lncAtlas for acquiring cell-line specific subcellular localization information. lncAtlas is a comprehensive resource of lncRNA localization in human cells based on RNA-sequencing data sets (Mas-Ponte et al., 2017). lncAtlas contains a wide array of information, including Cytoplasm to Nucleus Relative Concentration Index (CNRCI), which we have utilized in our method. CNRCI is defined as the log2-transformed ratio of RPKM (Reads Per Kilobase per Million mapped reads) in two samples, in this case - the cytoplasm and nucleus. It is calculated as follows

CNRCI=log2CytoplasmicexpressionFPKMNuclearexpressionFPKM

Sequence information for the lncRNAs was obtained from ENSEMBL database (version 112) and lncRNAs with no sequence were dropped. In order to modify the dataset for a classification problem, we assigned sequences having CNRCI value greater than 0 as Cytoplasm and those having CNRCI value less than 0 were assigned as Nucleus. Redundancy was removed using MeshClust (Girgis, 2022), using a sequence similarity of 90%. Figure 3 graphically depicts how the training and validation datasets were created.

Figure 3
www.frontiersin.org

Figure 3. Graphical overview of the process of dataset creation.

Further, we used sequences up to the length of 10,000 nucleotides only, as the longer lncRNA were misleading for the machine learning models and computationally very expensive when large language models were involved. The summary of the dataset used for each cell line is provided in Table 1.

Table 1
www.frontiersin.org

Table 1. Detailed summary of the dataset used in the study, including the total number of samples for each cell-line in the source database and the final non-redundant dataset.

Feature generation - Composition and correlation-based

For facilitating the training of machine learning (ML) models, we generated a large variety of features using different approaches. These features convert nucleotide sequences We used the in-house tool Nfeature (Mathur et al., 2021) for generating multiple composition and correlation features.

Composition-based

Nucleotide composition-based features refer to quantitative representations of sequences that can be derived from the proportions and arrangements of nucleotides within these sequences. In this study, we have computed nucleic acid composition, distance distribution of nucleotides (DDN), nucleotide repeat index (NRI), pseudo composition and entropy of a sequence. The details for each of the features are provided in Table 2.

Table 2
www.frontiersin.org

Table 2. Overview of the composition-based features generated using Nfeature.

Correlation-based features

In this study, using Nfeature, we quantitatively assess the interdependent characteristics inherent in nucleotide sequences through the computation of correlation-based metrics. Correlation refers to the degree of relationship between distinct properties or features; an autocorrelation denotes the association of a feature with itself, whereas a cross-correlation indicates a linkage between two separate features. By employing these correlation-based descriptors, we effectively normalize the variable-length nucleotide sequences into uniform-length vectors, rendering them amenable to analysis via machine learning algorithms. These specific descriptors facilitate the identification and extraction of significant features predicated upon the nucleotide properties distributed throughout the sequence, enabling a more robust understanding of genetic information. A brief description of the features has been provided in Table 3.

Table 3
www.frontiersin.org

Table 3. Overview of the correlation-based features generated using Nfeature.

The total number of descriptors generated by using both composition and correlation-based features is 1223. Detailed explanation of the features and their biological implication have been provided in Supplementary Table S1. The properties used to calculate correlation-based features are provided in Supplementary Table S2.

Embedding using DNABERT-2

DNABERT-2 is an adaptation of BERT (Bidirectional Encoder Representations from Transformers) designed specifically for DNA sequence analysis (Zhou et al., 2023). DNABERT-2 generates embeddings for DNA sequences that encapsulate not just the individual bases, but also their biological significance in terms of structure, function, and interactions. Moreover, the key advantage of DNABERT-2 embeddings lies in their ability to capture the complex dependencies within DNA sequences. We have made use of both aspects of DNABERT-2 - the pre-trained model to make predictions and the embeddings from the model to be used as features for downstream tasks. The pre-trained model was trained on the training dataset using the default parameters mentioned in their GitHub repository. Moreover, we have also generated embeddings from both the pre-trained as well as the fine-tuned models in order to use them as features for machine learning algorithms. The embeddings are derived from the hidden states of the model’s final output layer, using max pooling. The number of embeddings generated for each lncRNA using DNABERT-2 was 768.

Five-fold cross validation

In order to estimate the performance of machine learning based models while training, we have deployed five-fold cross validation. In this method, the training dataset is split into five folds in a stratified manner and training is actually done over four folds and one fold is dedicated for validation. This process is iteratively performed for five times, by changing the fold that is used for validation and the rest of the folds being used for training. This generates an unbiased set of five performance metrics and the performance of the model is reported as the mean of these five sets.

Feature selection

To optimize model performance and reduce computational complexity, we employed the Minimum Redundancy Maximum Relevance (mRMR) feature selection algorithm to identify the most informative features from our dataset. The mRMR algorithm selects features that exhibit the highest relevance to the target variable while minimizing redundancy among the features themselves, thereby enhancing the efficiency and predictive power of the models. For applying the mRMR algorithm, we combined all the features that were generated previously. Three different feature sets–Composition, Correlation and Embeddings, were used to generate a combined feature set comprising 2278 features. We evaluated the impact of feature selection by calculating and selecting subsets of 10, 50, 100, 500, 1000, 1500, and 2000 features. These subsets were subsequently used in downstream analyses to assess their influence on model performance metrics. Feature importance was also evaluated using simple correlation.

Model development

In this study, three different approaches were followed for model development. The first approach involves the fine tuning of the DNABERT-2 using our training dataset and subsequently using the fine-tuned model to make predictions on the validation dataset. This method initially fine-tunes both the tokenizer and the pre-trained model according to our training dataset, and generates a fine-tuned tokenizer, and model. The fine-tuned model takes lncRNA sequences and generates the prediction using the tokenizer and model. In the next approach, we have implemented a hybrid approach, combining the components of large language models and machine learning algorithms. Instead of features, we have generated embeddings from a pre-trained as well as fine-tuned DNABERT-2 model. Embeddings from the DNABERT-2 model were then used to train machine learning models and subsequently evaluated them. The third approach involves composition and correlation-based features and using them to train machine learning models. The final model was developed using the.

Model evaluation metrics

The binary classification performance of our fine-tuned model was evaluated using the following metrics: Sensitivity (SENS), Specificity (SPEC), Precision (PREC), Accuracy (ACC), Matthew’s Correlation Coefficient (MCC), F1-Score (F1) and Area Under the Receiver Operator Characteristic curve (AUC). The aforementioned metrics were calculated using the four different types of prediction outcomes: true positive (TP), false positive (FP), true negative (TN), and false negative (FN):

Sensitivity=TPTP+FN
Specificity=TNTN+FP
Precision=TP+TNTP+TN+FN+FP
Accuracy=TP+TNTP+TN+FN+FP
MCC=TP×TNFP×FNTP+FP×TP+FN×TN+FP×TN+FN
F1score=TPTP+0.5×FN+FP

The evaluation of our binary classification model using various metrics provides critical insights into its performance. Sensitivity (SENS) measures the model’s ability to identify positive instances, while Specificity (SPEC) assesses its accuracy in recognizing negative instances. Precision (PREC) reflects the accuracy of positive predictions, and Accuracy (ACC) offers an overall measure of correctness, though it may be misleading in imbalanced datasets. Matthew’s Correlation Coefficient (MCC) provides a balanced view by considering all prediction outcomes, with values close to 1 indicating strong predictive capability. The F1-Score (F1) combines Precision and Sensitivity into a single metric, ideal for balancing the trade-off between false positives and negatives. Finally, the Area Under the Curve (AUC) evaluates the model’s ability to distinguish between classes across different thresholds, with higher values indicating better performance. Together, these metrics enable a comprehensive evaluation of the model, guiding necessary improvements and refinements.

Results

In this study, an attempt was made to design a model that will be able to classify the subcellular location of lncRNA into cytoplasm or nucleus. To achieve this, we tried out multiple approaches. Figure 4 provides an overview of the performance of the various approaches tried in this study.

Figure 4
www.frontiersin.org

Figure 4. Overview of the performance achieved by different prediction strategies. The values indicate the average MCC and AUC across all the 15 cell-lines for a prediction strategy.

Functional enrichment analysis

The GO and KEGG enrichment analysis, conducted using RNAenrich (Zhang et al., 2023), reveals distinct functional roles for cytoplasmic versus nuclear-localizing lncRNAs. Among the significantly enriched GO terms (adjusted p-value <0.05), 2,511 were shared, while 397 were unique to cytoplasmic lncRNAs (positive class) and 254 to nuclear lncRNAs (negative class). Cytoplasmic lncRNAs were enriched for biological processes such as “response to interferon-beta” (GO:0035456), “positive regulation of apoptotic process” (GO:0043065), and “RNA splicing” (GO:0008380), indicating roles in immune signaling, post-transcriptional regulation, and cellular stress responses. Correspondingly, KEGG pathway enrichment identified associations with Ferroptosis (hsa04216) and Autophagy (hsa04140), further highlighting their involvement in cytoplasmic stress and degradation pathways. In contrast, nuclear-localized lncRNAs were enriched for GO terms such as “eukaryotic 48S preinitiation complex” (GO:0033290), “regulation of transcription of nucleolar large rRNA by RNA polymerase I” (GO:1901836), and “MLL1/2 complex” (GO:0044665), reflecting their roles in transcriptional regulation, chromatin remodeling, and nucleolar function. KEGG analysis further linked nuclear lncRNAs to Sterol Biosynthesis (hsa00100) and Nucleotide Excision Repair (hsa03420), pointing to nuclear functions in genome maintenance and metabolic regulation. Together, these enrichments underscore the compartment-specific biological functions of lncRNAs, shaped by their cellular localization.

Feature importance

To identify features associated with subcellular localization labels (cytoplasm or nucleus), we computed the correlation of each feature with the corresponding CNRCI values. A positive correlation indicates that an increase in the feature value favors cytoplasmic localization, whereas a negative correlation suggests a preference for nuclear localization. This analysis elucidates which features predominantly influence localization to either compartment. The top 10 genes that were highly correlated with the CNRCI values are provided in Table 4, 5 for composition-based and correlation-based features, respectively. A more detailed version of this table is provided in Supplementary Table S3, 4. It can be observed that Cytosine-based k-mers are more prevalent in the positively correlated features (supporting cytoplasm localization) whereas Thymine is predominantly found in negatively correlated features (supporting nucleus localization).

Table 4
www.frontiersin.org

Table 4. Composition based-features having the highest correlation with the CNRCI values for 15 cell-lines.

Table 5
www.frontiersin.org

Table 5. Correlation-based features having the highest correlation with the CNRCI values for 15 cell-line.

Additionally, we assessed feature variability across cell lines by calculating the difference between the maximum and minimum correlation values observed for each feature across all 15 cell lines. This approach highlights features exhibiting the most pronounced inter-cell-line variation, providing insight into their potential biological or experimental variability. Figures 5, 6 represent heatmaps depicting the highly variable genes and their correlation with the CNRCI values for composition and correlation-based features, respectively. The complete information for the variable genes has been provided in Supplementary Table S5, 6.

Figure 5
www.frontiersin.org

Figure 5. Heatmap depicting the composition-based features that demonstrate the highest variation in their correlation with the CNRCI values within 15 cell-lines.

Figure 6
www.frontiersin.org

Figure 6. Heatmap depicting the correlation-based features that demonstrate the highest variation in their correlation with the CNRCI values within 15 cell-lines.

Model based on composition and correlation features

Composition and correlation features generated from Nfeature were used to train multiple ML models. We have computed the performance of nine composition-based features and thirteen correlation-based features. We implemented all the combinations of feature and ML model to identify which feature-ML model combination performs the best. The composition features combined with classical ML methods were able to achieve an average AUC of 0.7049 and a MCC of 0.1965, across the 15 cell-lines. Similarly, with correlation-based features and ML methods, the best performance achieved was an average AUC of 0.7089 and a MCC of 0.2133. Performance of the best performing model using both composition and correlation-based features are provided for all the 15 cell-lines in Table 6, 7 respectively. The detailed performance for all the models used in this analysis have been provided in Supplementary Table S7, 8. The model parameters are provided in Supplementary Table S12.

Table 6
www.frontiersin.org

Table 6. Performance of the best ML model for each cell-line on the validation dataset using composition features.

Table 7
www.frontiersin.org

Table 7. Performance of the best ML model for each cell-line on the validation dataset using correlation features.

Models based on embeddings from DNABERT-2

Embeddings from large language models are known to encapsulate not just the individual bases, but also their biological significance in terms of structure, function, and interactions. In this approach, we generated high level representations of lncRNA sequences using both the pre-trained as well as the fine-tuned models. These embeddings were used to train ML models and the models were evaluated on the validation dataset. In the case of pre-trained embeddings, the model achieved an average AUC of 0.6586 and an average MCC of 0.1182. When fine-tuned embeddings were used as features, the performance of the model dropped marginally, achieving an average AUC of 0.6604 and an average MCC of 0.1740. Detailed results for the performance of ML models on the validation dataset using pre-trained as well as fine-tuned embeddings as features are provided in Table 8, 9 respectively. The detailed performance for all the models has been reported in Supplementary Table S9, 10.

Table 8
www.frontiersin.org

Table 8. Performance of the best ML model for each cell-line on the validation dataset using embeddings from pre-trained DNABERT-2 model.

Table 9
www.frontiersin.org

Table 9. Performance of the best ML model for each cell-line on the validation dataset using embeddings from fine-tuned DNABERT-2 model.

Fine-tuned DNABERT-2 model

In this approach, we used our training dataset to fine-tune the model and generate a fine-tuned tokenizer and model. Using this tokenizer and model, we generate high level representations of our lncRNA sequences and these representations are used by the model to generate predictions. The fine-tuned DNABERT2 model could not be evaluated while training as we were not able to implement five-fold cross validation. The fine-tuned model was evaluated on the validation dataset and performance metrics for the same are reported in Table 10. The detailed performance for all the models has been provided in Supplementary Table S11.

Table 10
www.frontiersin.org

Table 10. Performance of fine-tuned DNABERT-2 model on the validation dataset.

Model based on features selected by mRMR algorithm

In order to identify the best set of features from the combined feature set of 2278 features (composition, correlation, and embeddings), mRMR algorithm was used. Seven different feature sets were created based on the top ‘k’ features selected by mRMR. Performance was evaluated for seven different sets of features for each cell-line using 12 different ML classifiers. Table 11 reports the AUC for the best model for each combination and the last row shows the average for each feature set. The best AUC value was reported when top 500 genes selected by mRMR was used for training.

Table 11
www.frontiersin.org

Table 11. AUC values of the best ML model for each cell-line on the validation dataset for the feature sets generated by mRMR algorithm.

Performance comparison of CytoLNCpred and existing state-of-the-art classifiers

To further illustrate the efficacy of our method, we conduct a comparative analysis with other cutting-edge classifiers. Specifically, we evaluate existing predictors, namely, lncLocator 2.0 and TACOS, which employ predictive algorithms to predict subcellular location of lncRNAs in different cell-lines.

Among these predictors, lncLocator 2.0 relies on the word embeddings and a MultiLayer Perceptron Regressor to predict CNRCI values. The predicted CNRCI values were then converted to labels using a fixed threshold value. The second predictor, TACOS, generated a variety of feature encodings using composition and physicochemical properties and tree-based algorithms were deployed to make the predictions. It is important to note that TACOS has been trained on 10 out of the 15 cell-lines. For a fair performance comparison, we leverage the performance metrics of evaluated on the validation dataset. Table 12 summarizes the evaluation of CytoLNCpred and other existing tools based on AUROC.

Table 12
www.frontiersin.org

Table 12. Comparing the performance of our method and other existing classifiers using our validation dataset based on AUROC for all cell-lines.

mRNA localization prediction accuracy using CytoLNCpred

To assess the applicability of CytoLNCpred, a tool originally developed for lncRNA localization prediction, to mRNA sequences, we utilized mRNA data obtained from the lncAtlas database. These mRNA sequences were subjected to prediction using the standalone version of CytoLNCpred, and its performance was evaluated using the Area Under the Receiver Operating Characteristic curve (AUROC). The AUROC values obtained for mRNA localization prediction across different cell lines, alongside the corresponding performance for lncRNA prediction, are presented in Figure 7.

Figure 7
www.frontiersin.org

Figure 7. Performance evaluation of CytoLNCpred in predicting the subcellular localization of mRNA and lncRNA sequences in various cell lines, as measured by the Area Under the Receiver Operating Characteristic curve (AUROC).

The results indicate that CytoLNCpred exhibits a varying degree of accuracy in predicting the subcellular localization of mRNA sequences across the tested cell lines. As illustrated in Figure 7, the predictive performance for mRNA localization differs depending on the cellular context. Notably, the highest prediction accuracy for mRNA was observed in the A549 cell line (AUROC = 0.800), suggesting a strong potential for the tool in this specific context. While the performance varied across different cell lines, with HUVEC showing the lowest AUROC (0.598), the overall results suggest that features learned by CytoLNCpred for lncRNA localization can also provide some discriminatory power for mRNA localization. Interestingly, in the A549 cell line, the prediction accuracy for mRNA even surpassed that observed for lncRNAs. However, in other cell lines like HUVEC and MCF-7, the performance on mRNA was notably lower compared to lncRNAs.

In order to further validate the model accuracy for mRNAs, 10 random cytoplasmic mRNAs were obtained from NCBI gene database. These mRNAs were then predicted using CytoLNCpred for the A549 cell-line. The model was able to predict only 2 mRNAs correctly, among the 10 cytoplasmic RNA and the detailed results are provided in Supplementary Table S13. This variability in performance across cell types highlights the potential influence of cell-specific factors on mRNA localization and suggests that further refinement or specialized models might be beneficial for broader applicability to mRNA.

Discussion

In recent years, researchers have recognized that the subcellular localization of lncRNAs plays a pivotal role in understanding their function. Unlike protein-coding genes, lncRNAs do not encode proteins directly. Instead, they exert their effects through diverse mechanisms, including interactions with chromatin, RNA molecules, and proteins. The precise localization of lncRNAs within the cell provides crucial information about their regulatory roles.

In our analysis LINC00852 showed a marked cell-line–specific shift–predominantly nuclear (negative localization score) in most cell types, but strongly cytoplasmic in the NCI-H460 lung carcinoma line. This mirrors literature reports that LINC00852 can be cytoplasmically enriched in lung carcinoma cases. In lung carcinoma cell-lines (A549 and SPCA-1), LINC00852 was found mainly in the cytoplasm (qRT-PCR assay) (Liu et al., 2018). It binds with the S100A9 protein in the cytoplasm, activating the MAPK pathway and plays a positive role in the progression and metastasis of lung adenocarcinoma cells. By contrast, other studies have observed LINC00852 in the nucleus in some tumors. In osteosarcoma cell-lines (like 143B and MG-63), it is observed that LINC00852 acts as a transcription factor and increase the expression of AXL gene (Li et al., 2020). Such context dependence could reflect tissue-specific expression of RNA-binding factors or isoforms that govern nuclear export. The fact that LINC00852 is cytoplasmic in some cancer lines but nuclear in others suggests it may switch roles-in cytosolic form it may acts as a post-transcriptional regulator (e.g., miRNA sponge), whereas nuclear retention may imply transcriptional or chromatin-related roles.

SNHG3 in our data is mostly cytoplasmic in H1.hESC, HepG2 (liver carcinoma) and GM12878 (lymphoblast) cells, but nuclear in cell-lines like MCF-7 and NHEK. This pattern aligns with experimental studies. In colorectal cancer cell lines (e.g., SW480, LoVo), SNHG3 was found to localize predominantly in the cytoplasm (comparable to GAPDH) (Huang et al., 2017). There it acts as a competing endogenous RNA (ceRNA), sponging miRNAs (e.g., miR-182-5p) to upregulate oncogenic targets like c-Myc. The high cytoplasmic SNHG3 expression in stem-cell like and proliferative cell-lines (H1.hESC, HepG2) from our dataset suggests a similar ceRNA role, whereas its nuclear enrichment in more differentiated/epithelial cells may reflect downregulation of this pathway in those contexts. In general, SNHG-family lncRNAs are known to influence cancer cell growth and often operate via cytoplasmic post-transcriptional mechanisms, consistent with SNHG3’s localization and function in promoting malignancy.

Subcellular localization of lncRNA gains prominence in recent times due to their role in gene regulation within the cell. A large number of aptamer and ASO based drugs are being developed using RNA nanotechnology. In recent years, the convergence of nanotechnology and long non-coding RNAs (lncRNAs) has yielded exciting developments in drug development. Nanoparticles, such as liposomes and exosomes, are being harnessed for targeted delivery of lncRNA-based therapeutics to cancer cells. Additionally, CRISPR-Cas9 technology, delivered via nanoparticles, enables precise gene editing by modulating lncRNA expression. Computational models and deep learning approaches are aiding our understanding of lncRNA-mediated mechanisms. Overall, this interdisciplinary field holds immense promise for personalized medicine, improved therapies, and better patient outcomes.

Predicting lncRNA subcellular localization using tools like CytoLNCpred offers significant potential for guiding the development of RNA-based therapeutics and CRISPR strategies. Since antisense oligonucleotides (ASOs) are generally more effective against nuclear lncRNAs (Zong et al., 2015) and small interfering RNAs (siRNAs) excel against cytoplasmic targets (Lennox and Behlke, 2016), a CytoLNCpred prediction indicating cytoplasmic enrichment would favor siRNA development, while a predicted nuclear localization would suggest ASOs as the primary choice. Similarly, this prediction informs CRISPR approaches: targeting nuclear-acting lncRNAs might be best achieved by disrupting transcription or key regulatory elements using CRISPRi or Cas9 (Rosenlund et al., 2021), whereas lncRNAs predicted to function in the cytoplasm could be more effectively targeted by degrading the transcript directly using RNA-targeting CRISPR-Cas13 (Xu et al., 2020), potentially guiding crRNA expression strategies (e.g., using a U1 promoter for cytoplasmic crRNA localization). Thus, localization prediction aids in the rational selection of therapeutic modalities and CRISPR targeting strategies based on the likely site of lncRNA function.

In recent times large language models are considered as SOTA methods and apart from the classical composition and correlation feature-based model, we also implemented DNABERT-2 for our classification problem. The DNABERT-2 model has been trained on the genomes of a wide variety of species and is computationally very efficient. DNABERT-2 uses Byte Pair Encoding to generate tokens which is known to perform better than k-mer tokenization. So, in order to fully exploit the DNABERT-2 model, we generated embeddings from both pre-trained and fine-tuned models. These embeddings when combined with ML methods were able to predict subcellular localization very well but poorer than a fine-tuned DNABERT-2 model.

In our study, we compared the performance of DNABERT-2 with traditional composition and correlation-based features for classifying subcellular localization of lncRNAs. While DNABERT-2, a pre-trained language model, showed promising results, but we found that traditional machine learning models trained on carefully crafted composition and correlation features consistently outperformed DNABERT-2. This suggests that for this specific task, the carefully engineered features capture the relevant biological information more effectively than the general-purpose representations learned by DNABERT-2. Specifically, it was observed that the correlation-based features achieve a higher average AUC than all other approaches. However, these approaches failed when we used RNA sequences greater than 10,000 base pairs. In order to reduce the variability in nucleotide length, the sequence length was limited to 10,000 base pairs.

The lncAtlas database, while a valuable resource for lncRNA subcellular localization, has several significant limitations including its restriction to GENCODE-annotated lncRNAs and a limited set of 15 human cell lines, with detailed sub-compartment data available only for the K562 cell line. The database relies on RNA-seq data and the Relative Concentration Index (RCI), which provides relative abundance rather than absolute counts. Furthermore, lncAtlas has not been updated since 2017, making it less comprehensive and potentially outdated.

To address these limitations, future research should focus on developing techniques to improve the interpretability of DNABERT-2’s predictions. This could involve methods such as attention visualization or feature importance analysis. Furthermore, expanding the diversity of training data is essential to enhance the model’s generalizability across different biological contexts. By incorporating data from a wider range of organisms and conditions, subcellular localization prediction could become a more versatile and reliable tool for genomic analysis. In our case, correlation-based features with machine learning algorithms outperformed all other approaches. Moreover, improved machine learning algorithms are needed to be developed that can account for large variability in nucleotide lengths.

Conclusion

Understanding the subcellular localization of lncRNA can provide great insights into their function within the cell. Computational tools have recently expanded the domain of subcellular localization by the development of faster and more accurate methods. In this study, we used a variety of machine learning as well as large language models to accurately predict lncRNA subcellular localization. The implementation of large language models to tackle biological problems is gaining momentum and our study also highlights its importance. The final model used in CytoLNCpred was designed using a traditional machine learning model trained using correlation-based features. This tool will help researchers to improve the functional annotation of lncRNA and develop RNA-based therapeutics.

Data availability statement

Publicly available datasets were analyzed in this study. This data can be found here: https://webs.iiitd.edu.in/raghava/cytolncpred/, https://github.com/raghavagps/cytolncpred.

Author contributions

SC: Data curation, Formal Analysis, Investigation, Methodology, Software, Validation, Visualization, Writing – original draft, Writing – review and editing. NM: Formal Analysis, Investigation, Methodology, Validation, Visualization, Writing – review and editing. GR: Conceptualization, Formal Analysis, Funding acquisition, Investigation, Supervision, Writing – review and editing.

Funding

The author(s) declare that financial support was received for the research and/or publication of this article. The current work has been supported by the Department of Biotechnology (DBT) grant BT/PR40158/BTIS/137/24/2021.

Acknowledgments

Authors are thankful to the Department of Science and Technology (DST-INSPIRE) and Indraprastha Institute of Information Technology, New Delhi, for fellowships and financial support. Authors are also thankful to Department of Computational Biology, IIITD New Delhi for infrastructure and facilities. Figure 2, 3, 4 and 7 were created with BioRender.com.

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Generative AI statement

The author(s) declare that no Generative AI was used in the creation of this manuscript.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

Supplementary material

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fbinf.2025.1585794/full#supplementary-material

References

Aillaud, M., and Schulte, L. N. (2020). Emerging roles of long noncoding RNAs in the cytoplasmic milieu. Noncoding RNA 6, 44. doi:10.3390/ncrna6040044

PubMed Abstract | CrossRef Full Text | Google Scholar

Bridges, M. C., Daulagala, A. C., and Kourtidis, A. (2021). LNCcation: lncRNA localization and function. J. Cell Biol. 220, e202009045. doi:10.1083/jcb.202009045

PubMed Abstract | CrossRef Full Text | Google Scholar

Chang, J., Ma, X., Sun, X., Zhou, C., Zhao, P., Wang, Y., et al. (2023). RNA fluorescence in situ hybridization for long non-coding RNA localization in human osteosarcoma cells. J. Vis. Exp. doi:10.3791/65545

PubMed Abstract | CrossRef Full Text | Google Scholar

Girgis, H. Z. (2022). MeShClust v3.0: high-quality clustering of DNA sequences using the mean shift algorithm and alignment-free identity scores. BMC Genomics 23, 423. doi:10.1186/s12864-022-08619-0

PubMed Abstract | CrossRef Full Text | Google Scholar

Huang, W., Tian, Y., Dong, S., Cha, Y., Li, J., Guo, X., et al. (2017). The long non-coding RNA SNHG3 functions as a competing endogenous RNA to promote malignant development of colorectal cancer. Oncol. Rep. 38, 1402–1410. doi:10.3892/or.2017.5837

PubMed Abstract | CrossRef Full Text | Google Scholar

Jeon, Y.-J., Hasan, M. M., Park, H. W., Lee, K. W., and Manavalan, B. (2022). TACOS: a novel approach for accurate prediction of cell-specific long noncoding RNAs subcellular localization. Brief. Bioinform. 23, bbac243. doi:10.1093/bib/bbac243

PubMed Abstract | CrossRef Full Text | Google Scholar

Lennox, K. A., and Behlke, M. A. (2016). Cellular localization of long non-coding RNAs affects silencing by RNAi more than by antisense oligonucleotides. Nucleic Acids Res. 44, 863–877. doi:10.1093/nar/gkv1206

PubMed Abstract | CrossRef Full Text | Google Scholar

Li, Q., Wang, X., Jiang, N., Xie, X., Liu, N., Liu, J., et al. (2020). Exosome-transmitted linc00852 associated with receptor tyrosine kinase AXL dysregulates the proliferation and invasion of osteosarcoma. Cancer Med. 9, 6354–6366. doi:10.1002/cam4.3303

PubMed Abstract | CrossRef Full Text | Google Scholar

Lin, Y., Pan, X., and Shen, H.-B. (2021). lncLocator 2.0: a cell-line-specific subcellular localization predictor for long non-coding RNAs with interpretable deep learning. Bioinformatics 37, 2308–2316. doi:10.1093/bioinformatics/btab127

PubMed Abstract | CrossRef Full Text | Google Scholar

Liu, P., Wang, H., Liang, Y., Hu, A., Xing, R., Jiang, L., et al. (2018). LINC00852 promotes lung adenocarcinoma spinal metastasis by targeting S100A9. J. Cancer 9, 4139–4149. doi:10.7150/jca.26897

PubMed Abstract | CrossRef Full Text | Google Scholar

Mas-Ponte, D., Carlevaro-Fita, J., Palumbo, E., Hermoso Pulido, T., Guigo, R., and Johnson, R. (2017). LncATLAS database for subcellular localization of long noncoding RNAs. RNA 23, 1080–1087. doi:10.1261/rna.060814.117

PubMed Abstract | CrossRef Full Text | Google Scholar

Mathur, M., Patiyal, S., Dhall, A., Jain, S., Tomer, R., Arora, A., et al. (2021). Nfeature: a platform for computing features of nucleotide sequences. doi:10.1101/2021.12.14.472723

CrossRef Full Text | Google Scholar

Mattick, J. S., Amaral, P. P., Carninci, P., Carpenter, S., Chang, H. Y., Chen, L.-L., et al. (2023). Long non-coding RNAs: definitions, functions, challenges and recommendations. Nat. Rev. Mol. Cell Biol. 24, 430–447. doi:10.1038/s41580-022-00566-8

PubMed Abstract | CrossRef Full Text | Google Scholar

Mayer, A., and Churchman, L. S. (2017). A detailed protocol for subcellular RNA sequencing (subRNA-seq). Curr. Protoc. Mol. Biol. 120, 4.29.1–4. doi:10.1002/cpmb.44

PubMed Abstract | CrossRef Full Text | Google Scholar

Miao, H., Wang, L., Zhan, H., Dai, J., Chang, Y., Wu, F., et al. (2019). A long noncoding RNA distributed in both nucleus and cytoplasm operates in the PYCARD-regulated apoptosis by coordinating the epigenetic and translational regulation. PLoS Genet. 15, e1008144. doi:10.1371/journal.pgen.1008144

PubMed Abstract | CrossRef Full Text | Google Scholar

Rosenlund, I. A., Calin, G. A., Dragomir, M. P., and Knutsen, E. (2021). CRISPR/Cas9 to silence long non-coding RNAs. Methods Mol. Biol. 2348, 175–187. doi:10.1007/978-1-0716-1581-2_12

PubMed Abstract | CrossRef Full Text | Google Scholar

Statello, L., Guo, C.-J., Chen, L.-L., and Huarte, M. (2021). Gene regulation by long non-coding RNAs and its biological functions. Nat. Rev. Mol. Cell Biol. 22, 96–118. doi:10.1038/s41580-020-00315-9

PubMed Abstract | CrossRef Full Text | Google Scholar

Xu, D., Cai, Y., Tang, L., Han, X., Gao, F., Cao, H., et al. (2020). A CRISPR/Cas13-based approach demonstrates biological relevance of vlinc class of long non-coding RNAs in anticancer drug response. Sci. Rep. 10, 1794. doi:10.1038/s41598-020-58104-5

PubMed Abstract | CrossRef Full Text | Google Scholar

Zhang, S., Amahong, K., Zhang, Y., Hu, X., Huang, S., Lu, M., et al. (2023). RNAenrich: a web server for non-coding RNA enrichment. Bioinformatics 39, btad421. doi:10.1093/bioinformatics/btad421

PubMed Abstract | CrossRef Full Text | Google Scholar

Zhou, Z., Ji, Y., Li, W., Dutta, P., Davuluri, R., and Liu, H. (2023). DNABERT-2: efficient foundation model and benchmark for multi-species genome. arXiv [q-bio.GN]. doi:10.48550/ARXIV.2306.15006

CrossRef Full Text | Google Scholar

Zong, X., Huang, L., Tripathi, V., Peralta, R., Freier, S. M., Guo, S., et al. (2015). Knockdown of nuclear-retained long noncoding RNAs using modified DNA antisense oligonucleotides. Methods Mol. Biol. 1262, 321–331. doi:10.1007/978-1-4939-2253-6_20

PubMed Abstract | CrossRef Full Text | Google Scholar

Keywords: lncRNA, cytoplasm localization, machine learning, DNABert-2, cell-line specific localization

Citation: Choudhury S, Mehta NK and Raghava GPS (2025) CytoLNCpred-a computational method for predicting cytoplasm associated long non-coding RNAs in 15 cell-lines. Front. Bioinform. 5:1585794. doi: 10.3389/fbinf.2025.1585794

Received: 01 March 2025; Accepted: 14 May 2025;
Published: 26 May 2025.

Edited by:

Stephen M. Mount, University of Maryland, United States

Reviewed by:

Ganesh Panzade, National Cancer Institute at Frederick (NIH), United States
Srinivasulu Yerukala Sathipati, Marshfield Clinic Research Institute, United States
Swapna Vidhur Daulatabad, National Cancer Institute at Frederick (NIH), United States

Copyright © 2025 Choudhury, Mehta and Raghava. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Gajendra P. S. Raghava, cmFnaGF2YUBpaWl0ZC5hYy5pbg==

ORCID: Shubham Choudhury, orcid.org/0000-0002-4509-4683; Naman Kumar Mehta, orcid.org/0009-0009-0244-2826; Gajendra P. S. Raghava, orcid.org/0000-0002-8902-2876

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.