Your new experience awaits. Try the new design now and help us make it even better

ORIGINAL RESEARCH article

Front. Environ. Sci.

Sec. Environmental Informatics and Remote Sensing

This article is part of the Research TopicModeling for Environmental Pollution and Change, Volume IIView all 8 articles

Development of a Novel Imputation Framework for PM2.5 Particle Data in Pakistani Cities using Machine Learning and Statistical Techniques

Provisionally accepted
  • 1American University of Kuwait, Safat, Kuwait
  • 2University of Strathclyde Department of Mathematics and Statistics, Glasgow, United Kingdom
  • 3The Public Authority for Applied Education and Training, Safat, Kuwait

The final, formatted version of the article will be published soon.

The absence of PM2.5 data in environmental monitoring systems, due to sensor and communication failures, system maintenance issues, and monitoring gaps, can hinder assessments relevant to public health and the development of policies aimed at improving air quality. Consequently, this study aims to analyse five imputation methods: (1) Bayesian regressions (BR), (2) K-Nearest Neighbors (KNN), (3) missForest, (4) Predictive Mean Matching (PMM), and (5) Random Forest (RF), considering data from daily measurement of PM2.5 collected from May 2019 to December 2024 across monitoring sites in the Pakistani cites of Islamabad, Karachi, Lahore and Peshawar. The data was used to simulate three missing mechanisms, namely MCAR, MAR and MNAR, with missing rates ranging from 5% to 25%. The performance of each imputation technique was evaluated using Root Mean Square Error (RMSE) and Mean Absolute Error (MAE) criteria to assess the accuracy of each imputation technique. The results confirmed that imputation accuracy under the MAR mechanism consistently produced the lowest errors as the missing rates increased. Among all the tested methods, missForest and KNN were the most optimal and had the best performance at all missingness levels, and, more importantly, retained most of the temporal structure, range, and variability of the data. The relative performance of the missForest in RMSE and MAE values among the various missing rates and mechanisms was lowest, thus presenting it as the best imputation approach when analyzing PM2.5 data in this scenario.

Keywords: air quality monitoring, machine learning, MissForest, Pakistan, PM2.5 missing data imputation

Received: 05 Jan 2026; Accepted: 30 Jan 2026.

Copyright: © 2026 Alsaber, Khan, PAN, Alshatti and Gray. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

* Correspondence:
Ahmad Alsaber
Muhammad Khan

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.