Your new experience awaits. Try the new design now and help us make it even better

ORIGINAL RESEARCH article

Front. Artif. Intell.

Sec. Medicine and Public Health

Unveiling Patterns in Clinical Data: Exploring the Role of Large Language Models and Clustering Algorithms

Provisionally accepted
Abbas  AliAbbas Ali1*Subi  GandhiSubi Gandhi2Syed  H JafriSyed H Jafri2Mohammed  M AliMohammed M Ali3Syed  Yahya RazaSyed Yahya Raza4Samian  SulaimanSamian Sulaiman5James  MehaffeyJames Mehaffey5
  • 1Kern Medical Center, Bakersfield, United States
  • 2Tarleton State University, Stephenville, United States
  • 3West Virginia University, Morgantown, United States
  • 4University of Central Florida, Orlando, United States
  • 5West Virginia University Heart and Vascular Institute, Morgantown, United States

The final, formatted version of the article will be published soon.

Objective Large Language Models (LLMs) have shown exceptional performance in natural language processing, yet their utility in structured clinical data analysis remains relatively underexplored. This pilot study investigates whether LLM-generated embeddings can preserve the structural integrity of clinical datasets and enhance predictive modeling, particularly in resource-constrained settings. Methods We applied dimensionality reduction techniques such as Principal Component Analysis (PCA), t-distributed Stochastic Neighbor Embedding (t-SNE), and k-means clustering to compare original data structures with those derived from LLM embeddings. Evaluation metrics included cosine similarity, area under the curve (AUC), and R², applied across 100 synthetic datasets and two real-world clinical datasets: the UCI medical database and endocarditis patient records. We assessed multiple LLM architectures, including BERT, RoBERTa, Llama 2, and E5-small, focusing on predictive accuracy and computational efficiency. Results LLM embeddings closely mirrored original data structures, with BERT achieving a cosine similarity of 0.95 on linear datasets and Llama 2 (30B) reaching 0.85 on quadratic datasets, albeit with higher computational costs. Predictive performance improved significantly across the board with increases in subject variable ratio (SVR), three groups were identified similar performance, This is a provisional file, not the final typeset article assisted better and assisted significantly better. These groups differed based upon the equation used to generate synthetic data. Discussion These findings highlight the potential of LLMs to enhance structured data analysis by identifying optimal conditions, such as SVR thresholds, for their practical use. The trade-off between computational cost and performance across different LLM architectures is also emphasized, suggesting the need for context-specific model selection. Conclusion LLMs can be effectively leveraged to repurpose existing clinical datasets for individualized clinical questions, such as optimizing surgical timing for patients with infective endocarditis and embolic stroke. This approach advances precision medicine and supports data-driven clinical decision-making.

Keywords: Endocarditis, Large language models, Medical Informatics, Natural Language Processing, precision medicine, Predictive Modeling

Received: 02 Nov 2025; Accepted: 30 Jan 2026.

Copyright: © 2026 Ali, Gandhi, Jafri, Ali, Yahya Raza, Sulaiman and Mehaffey. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

* Correspondence: Abbas Ali

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.