Unveiling Patterns in Clinical Data: Exploring the Role of Large Language Models and Clustering Algorithms

Ali, Abbas; Gandhi, Subi; Jafri, Syed  H; Ali, Mohammed  M; Yahya Raza, Syed; Sulaiman, Samian; Mehaffey, James

doi:10.3389/frai.2026.1737530

ORIGINAL RESEARCH article

Front. Artif. Intell.

Sec. Medicine and Public Health

Unveiling Patterns in Clinical Data: Exploring the Role of Large Language Models and Clustering Algorithms

Provisionally accepted

Abbas Ali^1*

Subi Gandhi²

Syed H Jafri²

Mohammed M Ali³

Syed Yahya Raza⁴

Samian Sulaiman⁵

James Mehaffey⁵

¹Kern Medical Center, Bakersfield, United States
²Tarleton State University, Stephenville, United States
³West Virginia University, Morgantown, United States
⁴University of Central Florida, Orlando, United States
⁵West Virginia University Heart and Vascular Institute, Morgantown, United States

The final, formatted version of the article will be published soon.

Objective Large Language Models (LLMs) have shown exceptional performance in natural language processing, yet their utility in structured clinical data analysis remains relatively underexplored. This pilot study investigates whether LLM-generated embeddings can preserve the structural integrity of clinical datasets and enhance predictive modeling, particularly in resource-constrained settings. Methods We applied dimensionality reduction techniques such as Principal Component Analysis (PCA), t-distributed Stochastic Neighbor Embedding (t-SNE), and k-means clustering to compare original data structures with those derived from LLM embeddings. Evaluation metrics included cosine similarity, area under the curve (AUC), and R², applied across 100 synthetic datasets and two real-world clinical datasets: the UCI medical database and endocarditis patient records. We assessed multiple LLM architectures, including BERT, RoBERTa, Llama 2, and E5-small, focusing on predictive accuracy and computational efficiency. Results LLM embeddings closely mirrored original data structures, with BERT achieving a cosine similarity of 0.95 on linear datasets and Llama 2 (30B) reaching 0.85 on quadratic datasets, albeit with higher computational costs. Predictive performance improved significantly across the board with increases in subject variable ratio (SVR), three groups were identified similar performance, This is a provisional file, not the final typeset article assisted better and assisted significantly better. These groups differed based upon the equation used to generate synthetic data. Discussion These findings highlight the potential of LLMs to enhance structured data analysis by identifying optimal conditions, such as SVR thresholds, for their practical use. The trade-off between computational cost and performance across different LLM architectures is also emphasized, suggesting the need for context-specific model selection. Conclusion LLMs can be effectively leveraged to repurpose existing clinical datasets for individualized clinical questions, such as optimizing surgical timing for patients with infective endocarditis and embolic stroke. This approach advances precision medicine and supports data-driven clinical decision-making.

Keywords: Endocarditis, Large language models, Medical Informatics, Natural Language Processing, precision medicine, Predictive Modeling

Received: 02 Nov 2025; Accepted: 30 Jan 2026.

Copyright: © 2026 Ali, Gandhi, Jafri, Ali, Yahya Raza, Sulaiman and Mehaffey. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

* Correspondence: Abbas Ali

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.