AUTHOR=M S Pavithran , D Saranyaraj , Chakrabortty Anirban 

TITLE=Lung cancer risk prediction using augmented machine learning pipelines with explainable AI

JOURNAL=Frontiers in Artificial Intelligence

VOLUME=Volume 8 - 2025

YEAR=2025

URL=https://www.frontiersin.org/journals/artificial-intelligence/articles/10.3389/frai.2025.1602775

DOI=10.3389/frai.2025.1602775

ISSN=2624-8212

ABSTRACT=Lung cancer remains the leading cause of cancer-related deaths worldwide, making early and precise diagnosis is critical for improving the patient survival rates. Machine learning has shown promising results in predictive analysis for lung cancer prediction. However, class imbalance in clinical datasets negatively impacts the performance of Machine Learning classifiers, leading to biased predictions and reduced accuracy. In an attempt to address this issue, various data augmentation techniques were applied alongside classification models to enhance predictive performance. This study evaluates data augmentation techniques paired with machine learning classifiers to address class imbalance in a small lung cancer dataset. A comparative analysis was conducted to assess the impact of different augmentation techniques with classification models. Experimental findings demonstrate that K-Means SMOTE, combined with a Multi-Layer Perceptron classifier, achieves the highest accuracy of 93.55% and an AUC-ROC score of 96.76%, surpassing other augmentation-classifier combinations. These results underscore the importance of selecting optimal augmentation methods to improve classification performance. Furthermore, to ensure model interpretability and transparency in medical decision-making, LIME is utilized to provide insights into model predictions. The study highlights the significance of advanced augmentation techniques in addressing data imbalance, ultimately enhancing lung cancer risk prediction through machine learning. The findings contribute to the growing field of AI-driven healthcare by emphasizing the necessity of selecting effective augmentation-classifier pairs to develop more accurate and reliable diagnostic models. Due to the dataset’s high cancer prevalence (87.45%) and limited size, this work is a preliminary methodological comparison, not a clinical tool. Findings emphasize the importance of augmentation for imbalanced data and lay the groundwork for future validation with larger, representative datasets.