ORIGINAL RESEARCH article
Front. Bioinform.
Sec. Single Cell Bioinformatics
Volume 5 - 2025 | doi: 10.3389/fbinf.2025.1562410
Optimization of clustering parameters for single-cell RNA analysis using intrinsic goodness metrics
Provisionally accepted- 1Ri.MED Foundation, Palermo, Italy
- 2Department of Economics, Business and Statistics, University of Palermo, Palermo, Sicily, Italy
- 3National Center for Gene Therapy and Drugs based on RNA Technology, Palermo, Italy
Select one of your emails
You have multiple emails registered with Frontiers:
Notify me on publication
Please enter your email address:
If you already have an account, please login
You don't have a Frontiers account ? You can register here
The accurate clustering of cell subpopulations is a crucial aspect of single-cell RNA sequencing. The ability to correctly subdivide cell subpopulations hinges on the efficacy of unsupervised clustering. Despite the advancements and numerous adaptations of clustering algorithms, the correct clustering of cells remains a challenging endeavor that is dependent on the data in question and on the parameters selected for the clustering process.In this context, the present study aimed to predict the accuracy of clustering methods when varying different parameters by exploiting the intrinsic goodness metrics. The construction of a robust linear regression model using three cell-atlas datasets as ground truth demonstrated that the use of the UMAP method for the generation of the neighborhood graph and an increase in resolution has a beneficial impact on accuracy. The impact of the resolution parameter is accentuated by the reduced number of nearest neighbors, resulting in sparser and more locally sensitive graphs, which better preserve finegrained cellular relationships. Moreover, it is advisable to test different numbers of principal components, given that this parameter is highly affected by data complexity. This procedure has enabled the effective prediction of clustering accuracy through the utilization of intrinsic metrics. A total of fifteen intrinsic measures have been calculated for the three cell-atlas datasets, and an ElasticNet regression model has been trained in both intra-and cross-dataset approaches. The findings demonstrated that the within cluster dispersion and Banfield-Raftery index could be used as proxy for accuracy. Conversely, the silhouette index, despite its extensive utilisation within the extant literature, is less effective, as intrinsic metric, for an immediate comparison of different clustering parameter configurations.
Keywords: single-cell, clustering, Intrinsic metrics, ElasticNet, robust linear mixed regression
Received: 17 Jan 2025; Accepted: 27 May 2025.
Copyright: © 2025 Sciaraffa, Gagliano, Augugliaro and Coronnello. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
* Correspondence: Nicolina Sciaraffa, Ri.MED Foundation, Palermo, Italy
Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.