ORIGINAL RESEARCH article
Front. Artif. Intell.
Sec. Medicine and Public Health
Unveiling Patterns in Clinical Data: Exploring the Role of Large Language Models and Clustering Algorithms
Provisionally accepted- 1Kern Medical Center, Bakersfield, United States
- 2Tarleton State University, Stephenville, United States
- 3West Virginia University, Morgantown, United States
- 4University of Central Florida, Orlando, United States
- 5West Virginia University Heart and Vascular Institute, Morgantown, United States
Select one of your emails
You have multiple emails registered with Frontiers:
Notify me on publication
Please enter your email address:
If you already have an account, please login
You don't have a Frontiers account ? You can register here
Objective Large Language Models (LLMs) have shown exceptional performance in natural language processing, yet their utility in structured clinical data analysis remains relatively underexplored. This pilot study investigates whether LLM-generated embeddings can preserve the structural integrity of clinical datasets and enhance predictive modeling, particularly in resource-constrained settings. Methods We applied dimensionality reduction techniques such as Principal Component Analysis (PCA), t-distributed Stochastic Neighbor Embedding (t-SNE), and k-means clustering to compare original data structures with those derived from LLM embeddings. Evaluation metrics included cosine similarity, area under the curve (AUC), and R², applied across 100 synthetic datasets and two real-world clinical datasets: the UCI medical database and endocarditis patient records. We assessed multiple LLM architectures, including BERT, RoBERTa, Llama 2, and E5-small, focusing on predictive accuracy and computational efficiency. Results LLM embeddings closely mirrored original data structures, with BERT achieving a cosine similarity of 0.95 on linear datasets and Llama 2 (30B) reaching 0.85 on quadratic datasets, albeit with higher computational costs. Predictive performance improved significantly across the board with increases in subject variable ratio (SVR), three groups were identified similar performance, This is a provisional file, not the final typeset article assisted better and assisted significantly better. These groups differed based upon the equation used to generate synthetic data. Discussion These findings highlight the potential of LLMs to enhance structured data analysis by identifying optimal conditions, such as SVR thresholds, for their practical use. The trade-off between computational cost and performance across different LLM architectures is also emphasized, suggesting the need for context-specific model selection. Conclusion LLMs can be effectively leveraged to repurpose existing clinical datasets for individualized clinical questions, such as optimizing surgical timing for patients with infective endocarditis and embolic stroke. This approach advances precision medicine and supports data-driven clinical decision-making.
Keywords: Endocarditis, Large language models, Medical Informatics, Natural Language Processing, precision medicine, Predictive Modeling
Received: 02 Nov 2025; Accepted: 30 Jan 2026.
Copyright: © 2026 Ali, Gandhi, Jafri, Ali, Yahya Raza, Sulaiman and Mehaffey. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
* Correspondence: Abbas Ali
Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.
