Your new experience awaits. Try the new design now and help us make it even better

ORIGINAL RESEARCH article

Front. Big Data

Sec. Machine Learning and Artificial Intelligence

This article is part of the Research TopicFrontiers in Information Technology, Electronics, and Management InnovationView all 5 articles

Enhanced SQL Injection Detection Using Chi-Square Feature Selection and Machine Learning Classifiers

Provisionally accepted
  • Nelson Mandela African Instituton of Science & Technology, Arusha, Tanzania

The final, formatted version of the article will be published soon.

In the face of increasing cyberattacks, SQL injection remains one of the most common and damaging types of web threats, accounting for over 20% of global cyberattack costs. However, due to their dynamic and variable nature, current detection methods often suffer from high false positive rates and lower accuracy. This paper enhanced SQL injection detection using Chi-square feature selection and machine learning models. A combined dataset was assembled by merging a custom dataset with the SQLiV3.csv file from the Kaggle repository. Jensen–Shannon Divergence (JSD) analysis revealed moderate domain variation (overall JSD = 0.5775), with class-wise divergence of 0.1340 for SQLi and 0.5320 for benign queries. TF-IDF was used to convert SQL queries into feature vectors, followed by chi-square feature selection to retain the most statistically significant features. Five classifiers, namely Multinomial Naïve Bayes, Support Vector Machine, Logistic Regression, Decision Tree, and K-Nearest Neighbor, were tested before and after feature selection. Results show that chi-square feature selection improves classification performance across all models by reducing noise and eliminating redundant features. Notably, Decision Tree and KNN models, which initially performed poorly, showed substantial improvements after feature selection. The Decision Tree improved from being the second-worst performer before feature selection to the best classifier afterward, achieving the highest accuracy of 99.73%, precision of 99.72%, recall of 99.70%, F1-score of 99.71%, FPR of 0.25%, and a misclassification rate of 0.27%. These findings highlight the crucial role of feature selection in high-dimensional data environments. Future research will investigate how feature selection impacts deep learning architectures, adaptive feature selection, incremental learning approaches, robustness against adversarial attacks, and evaluate model transferability across production web environments to ensure real-time detection reliability, establishing feature selection as a vital step in developing reliable SQL injection detection systems.

Keywords: cyberattacks, machine learning, Feature Selection, SQL injection, High-dimensional data

Received: 15 Aug 2025; Accepted: 24 Oct 2025.

Copyright: © 2025 Casmiry, Mduma and Sinde. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

* Correspondence: Emanuel Casmiry, casmirye@nm-aist.ac.tz

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.