Your new experience awaits. Try the new design now and help us make it even better

ORIGINAL RESEARCH article

Front. Artif. Intell.

Sec. AI in Business

Volume 8 - 2025 | doi: 10.3389/frai.2025.1669191

This article is part of the Research TopicSoft Computing and Artificial Intelligence Techniques in Decision Making, Management and EngineeringView all 5 articles

Minimizing unnecessary tax audits using multi-objective hyperparameter tuning of XGBoost with focal loss

Provisionally accepted
Ivan  MalashinIvan Malashin*Igor  S MasichIgor S MasichVadim  S TynchenkoVadim S TynchenkoAndrei  P GantimurovAndrei P GantimurovVladimir  A NelyubVladimir A NelyubAleksei  S BorodulinAleksei S Borodulin
  • Department of Law, Bauman Moscow State Technical University, Moscow, Russia

The final, formatted version of the article will be published soon.

This study presents a machine learning (ML) approach for detecting non-compliance in companies' tax data. The dataset, consisting of over one million records, focuses on three key targets: invalid addresses, invalid director information, and invalid founder information. The analysis prioritizes young companies (≤3 years old) with fewer than 100 employees, thereby improving class distributions and model effectiveness. A combination of binary classification techniques was employed, including benchmarked supervised learning models (XGBoost, Random Forest), anomaly detection methods (LOF, Isolation Forest), and semi-supervised learning using deep neural networks (DNNs) with unlabeled data. Given its computational efficiency, XGBoost was selected as the primary model. However, class imbalance persisted even among young companies, necessitating the integration of focal loss to improve classification performance. To further enhance accuracy while maintaining model interpretability, NSGA-II (Non-dominated Sorting Genetic Algorithm II) was used for multi-objective hyperparameter optimization of XGBoost. The objectives were to maximize ROC-AUC for improved predictive performance and minimize the number of trees to enhance interpretability. The optimized model achieved a ROC-AUC of 0.9417, compared to 0.9161 without optimization, demonstrating the effectiveness of this approach. Additionally, SHAP analysis provided insights into key factors influencing non-compliance, supporting explainability and aiding regulatory decision-making. This methodology contributes to fair and efficient oversight by reducing unnecessary inspections, minimizing disruptions to compliant businesses, and improving the overall effectiveness of tax compliance monitoring.

Keywords: Company data, Tax Compliance, anomaly detection, machine learning, XGBoost

Received: 19 Jul 2025; Accepted: 24 Sep 2025.

Copyright: © 2025 Malashin, Masich, Tynchenko, Gantimurov, Nelyub and Borodulin. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

* Correspondence: Ivan Malashin, ivan.p.malashin@gmail.com

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.