ORIGINAL RESEARCH article
Front. Big Data
Sec. Machine Learning and Artificial Intelligence
This article is part of the Research TopicFrontiers in Information Technology, Electronics, and Management InnovationView all 5 articles
Enhanced SQL Injection Detection Using Chi-Square Feature Selection and Machine Learning Classifiers
Provisionally accepted- Nelson Mandela African Instituton of Science & Technology, Arusha, Tanzania
Select one of your emails
You have multiple emails registered with Frontiers:
Notify me on publication
Please enter your email address:
If you already have an account, please login
You don't have a Frontiers account ? You can register here
In the face of increasing cyberattacks, SQL injection remains one of the most common and damaging types of web threats, accounting for over 20% of global cyberattack costs. However, due to their dynamic and variable nature, current detection methods often suffer from high false positive rates and lower accuracy. This paper enhanced SQL injection detection using Chi-square feature selection and machine learning models. A combined dataset was assembled by merging a custom dataset with the SQLiV3.csv file from the Kaggle repository. Jensen–Shannon Divergence (JSD) analysis revealed moderate domain variation (overall JSD = 0.5775), with class-wise divergence of 0.1340 for SQLi and 0.5320 for benign queries. TF-IDF was used to convert SQL queries into feature vectors, followed by chi-square feature selection to retain the most statistically significant features. Five classifiers, namely Multinomial Naïve Bayes, Support Vector Machine, Logistic Regression, Decision Tree, and K-Nearest Neighbor, were tested before and after feature selection. Results show that chi-square feature selection improves classification performance across all models by reducing noise and eliminating redundant features. Notably, Decision Tree and KNN models, which initially performed poorly, showed substantial improvements after feature selection. The Decision Tree improved from being the second-worst performer before feature selection to the best classifier afterward, achieving the highest accuracy of 99.73%, precision of 99.72%, recall of 99.70%, F1-score of 99.71%, FPR of 0.25%, and a misclassification rate of 0.27%. These findings highlight the crucial role of feature selection in high-dimensional data environments. Future research will investigate how feature selection impacts deep learning architectures, adaptive feature selection, incremental learning approaches, robustness against adversarial attacks, and evaluate model transferability across production web environments to ensure real-time detection reliability, establishing feature selection as a vital step in developing reliable SQL injection detection systems.
Keywords: cyberattacks, machine learning, Feature Selection, SQL injection, High-dimensional data
Received: 15 Aug 2025; Accepted: 24 Oct 2025.
Copyright: © 2025 Casmiry, Mduma and Sinde. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
* Correspondence: Emanuel Casmiry, casmirye@nm-aist.ac.tz
Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.