Particulate matter forecast and prediction in Curitiba using machine learning

Introduction Air quality is directly affected by pollutant emission from vehicles, especially in large cities and metropolitan areas or when there is no compliance check for vehicle emission standards. Particulate Matter (PM) is one of the pollutants emitted from fuel burning in internal combustion engines and remains suspended in the atmosphere, causing respiratory and cardiovascular health problems to the population. In this study, we analyzed the interaction between vehicular emissions, meteorological variables, and particulate matter concentrations in the lower atmosphere, presenting methods for predicting and forecasting PM2.5. Methods Meteorological and vehicle flow data from the city of Curitiba, Brazil, and particulate matter concentration data from optical sensors installed in the city between 2020 and 2022 were organized in hourly and daily averages. Prediction and forecasting were based on two machine learning models: Random Forest (RF) and Long Short-Term Memory (LSTM) neural network. The baseline model for prediction was chosen as the Multiple Linear Regression (MLR) model, and for forecast, we used the naive estimation as baseline. Results RF showed that on hourly and daily prediction scales, the planetary boundary layer height was the most important variable, followed by wind gust and wind velocity in hourly or daily cases, respectively. The highest PM prediction accuracy (99.37%) was found using the RF model on a daily scale. For forecasting, the highest accuracy was 99.71% using the LSTM model for 1-h forecast horizon with 5 h of previous data used as input variables. Discussion The RF and LSTM models were able to improve prediction and forecasting compared with MLR and Naive, respectively. The LSTM was trained with data corresponding to the period of the COVID-19 pandemic (2020 and 2021) and was able to forecast the concentration of PM2.5 in 2022, in which the data show that there was greater circulation of vehicles and higher peaks in the concentration of PM2.5. Our results can help the physical understanding of factors influencing pollutant dispersion from vehicle emissions at the lower atmosphere in urban environment. This study supports the formulation of new government policies to mitigate the impact of vehicle emissions in large cities.


Introduction
Vehicle emissions represent one of the primary sources of air pollution in urban areas globally, with road traffic emissions constituting a significant portion of the particulate matter (PM) present, especially at the roadside (Charron et al., 2007).Particulate matter (PM) emissions from vehicles, which account for 56% of PM, encompass various sources, including exhaust emissions (Khazini et al., 2023).These emissions predominantly contribute to fine PM, known as PM 2.5 , which refers to particles with an aerodynamic diameter <2.5 µm.Additionally, PM emissions arise from the re-suspension of dust and wear and tear of vehicle components such as brakes, tires, and clutches, primarily contributing to the coarse mode of PM (PM 2.5 -PM 10 ) (Abu-Allaban et al., 2003;Thorpe and Harrison, 2008;Kam et al., 2012;Pant and Harrison, 2013).
The population living in cities is exposed to high concentrations of PM, and the United Nations estimates that the world population living in urban areas will increase approximately 12% between 2022 and 2050 (United Nations, 2019).With this increase, the rise in vehicle fleets and the characteristics of cities will potentially affect the concentration of pollutants.The existence of a large number of buildings and scarcity of vegetation, associated with geographical and meteorological factors in urban environments, influence the diffusion, transformation, deposition, and removal of pollutants in the atmosphere (Abhijith et al., 2017;Harrison, 2018;Barwise and Kumar, 2020;Shakya et al., 2023).
Despite being known for its advances in urban mobility, Curitiba (Brazil) has observed an increase in the number of cars per inhabitant, which is higher than the population growth, and a decrease in the number of public transport users (Fochesatto et al., 2023).In 2023, the vehicle fleet in Curitiba was the fifth largest in the country, comprising more than 1.7 million vehicles, mostly cars, which accounted for approximately 66% of the fleet, followed by motorcycles, accounting approximately 10% of the fleet (BRASIL, 2023).Andrade et al. (2012) showed that the main sources of PM in Curitiba were vehicle emissions, which is responsible for most of the PM 2.5 emitted.Mercuri et al. (2023) showed that the flow of vehicles in Curitiba is directly related to the concentration of particulate matter on urban roads.
Several studies have proposed models to predict PM concentration (Brokamp et al., 2018;Shang et al., 2019;Xiao et al., 2020), but identifying the key factors influencing these predictions remains a challenging problem.Prediction and forecasting are often used interchangeably, but in our research, the terms have a clear distinction.In this study, forecasting refers to the process of estimating fine particle concentration in the future based on past observation data.On the other hand, prediction refers to estimating PM 2.5 concentration in the same time step as the input variables used to make the prediction.
The Random Forest algorithm is one of the most common models used for PM estimation.It has been increasingly used in studies predicting the concentration of atmospheric pollutants, with different temporal and spatial resolutions (Reichstein et al., 2019;Stafoggia et al., 2019;Xu et al., 2019), and it makes it possible to select variables of interest that can influence the concentration of PM 2.5 , calculating the importance of each one in the model and classifying them.Recently, there have been few studies using Recurrent Neural Networks (RNN) and their variations for air quality forecasting.Among these, long short-term memory (LSTM) takes into account the temporal dependencies in PM 2.5 concentration records and has been increasingly applied (Bekkar et al., 2021;Dhakal et al., 2021;Guo et al., 2022).Perez et al. (2020) used a neural network (NN) model and a linear model to predict the maximum 24-h PM 2.5 average in Chile, and the authors found a higher accuracy using the neural network model.Hooyberghs et al. (2005) described the design of an NN prediction tool for ambient PM concentrations in Belgium; based on measurements from 10 monitoring sites from 1997 to 2001 and on simulations of meteorological parameters, they identified the boundary layer height (BLH) as the most important input variable.Li et al. (2020) evaluated and compared the performance of six common machine learning algorithms (MLAs), including Random Forest (RF), for predicting hourly street-level of PM 2.5 concentrations at three roadside stations in Hong Kong, showing that RF was the MLA with the highest predictive accuracy and R 2 values greater than or equal to 0.95.Kamińska (2018) applied RF to predict NO, NO 2 , and PM 2.5 values in Wrocław, Poland.In the research, traffic volume, temporal characteristics, and meteorological conditions (wind speed and direction, temperature, pressure, and relative humidity) were considered as predictors; the author showed that in warmer periods, RF produces a better fit and that the most important predictors for PM 2.5 concentrations were meteorological conditions, especially temperature and wind.
This study aims to use machine learning models (Random Forest and LSTM neural network) to estimate the concentration of PM 2.5 in Curitiba, Brazil.The models to access PM 2.5 concentrations were developed using data from optical sensors installed in the city between 2020 and 2022, meteorological variables, boundary layer height, vehicle flow count, and particulate matter concentration as input variables.We seek to identify the importance of different input variables and compare PM 2.5 prediction and forecasting model performances.The study is organized as follows: Section 2 describes the data related to vehicle counting, PM 2.5 concentration, meteorological conditions, and boundary layer height, presents an overview of the machine learning models and a description of performance evaluation metrics; Section 3 contains the modeling results and discussion; and Section 4 summarizes the conclusions.

Materials and methods
This section is divided in four parts: subsection 2.1) a description of the solution developed by the authors to measure PM 2.5 concentrations and the dataset construction based on vehicle count, meteorological, and boundary layer height data; subsection 2.2) a description of the Random Forest model used for PM prediction; subsection 2.3) an overview of the LSTM Neural Network architecture applied for PM forecast; and subsection 2.4) the performance metrics applied to evaluate the quality of predictions.  .Vehicle, meteorological, and particle data Particulate matter measurement was performed using an SDS011 sensor coupled to a Raspberry Pi single-board computer.This sensor employs optical technology and uses laser scattering to obtain the concentration of particulate matter between 0.3 µm and 10 µm, including inhalable particles classified as PM 2.5 (World air quality index project, 2008).It represents a low-cost, low-power consumption measurement method with adaptability to different locations and climatic conditions, indicating the potential use for monitoring networks in various locations within a city or country (Liu et al., 2019;Tagle et al., 2020).The SDS011 sensors were deployed at 14 locations in the city of Curitiba (see Figure 1), which  was characterized by predominantly paved streets and residential areas (Rodrigues et al., 2023(Rodrigues et al., , 2024)).The data used in this study refer to the period from 1 January 2020 to 31 December 2022 .
The vehicle count data were obtained from the Perkons company, covering four points in the city (Figure 1) from 1 January 2020 to 4 October 2022.The total vehicle count (including motorcycles, cars, and trucks) was used to represent the average hourly or daily vehicle count in the city.In addition to the number of vehicles, meteorological variables and the variation in the height of the boundary layer are expected to influence the concentration and dispersion of PM in the atmosphere.Therefore, air temperature ( • C), relative humidity (%), atmospheric pressure (mB), global radiation (kJ m −2 ), wind speed (m/s), wind direction ( • ), wind gust (m/s), and precipitation data (mm) were obtained from the National Meteorological Institute (INMET) from two automatic weather stations: station A807 located at the Polytechnic Centre of the Federal University and station B806 located in the city of Colombo.
European Centre for Medium-Range Weather Forecasts (ECMWF) global climate atmospheric reanalysis data (ERA5) were utilized to obtain the variation in the planetary boundary layer height in Curitiba.The reanalysis combines model data with observations across the world, including satellite and radiosonde datasets and various observational datasets from the World Meteorological Organization's Global Telecommunication System (GTS) (Hersbach et al., 2020;Li et al., 2023).The ERA5 data cover the entire globe, on a 1440 × 721 grid with 0.25 • latitude and 0.25 • longitude resolution, a vertical resolution of 37 standard pressure layers, an hourly temporal resolution, and is computed by the bulk Richardson number method (a measure of the atmospheric conditions) (Hersbach et al., 2020;Guo et al., 2021).Figure 1 illustrates the location of Curitiba in Brazil and indicates the SDS011 sensor points, vehicle count locations, meteorological stations, and the site used for downloading ERA5 data.

. PM . prediction using Multiple Linear Regression and Random Forest
The Multiple Linear Regression (MLR) model was used as the baseline for predicting PM 2.5 .An MLR extends simple linear regression to include more than one explanatory variable, producing a multivariate model.The equation for the line in MLR modeling takes the form of Equation 1, where for i = n observations, y i is the dependent variable, x i is the explanatory variable, β 0 is the y-intercept (constant term), β p is the slope coefficients for each explanatory variable, and ǫ is the model's error term (also known as the residuals).
The Random Forest (RF) algorithm has been used to predict environmental variables concentrations, it is a classification and regression algorithm that integrates multiple decision trees through ensemble learning (Jeung et al., 2019).The RF model performs a random sampling of the original dataset using the decision tree as the basic random forest classifier resulting in n different sample datasets.These datasets are used to build n different decision tree models, with the final findings depending on the average value of these decision tree models (Kamińska, 2019;Luo et al., 2023).Essentially, the RF is constructed by a large number of trees, and the algorithm calculates the average result of all trees, as shown in Equation 2, where f (x) is the result of the RF non-linear regression, K is the number of trees, and T(x) is the result of each regression tree.In this study, the RF model was created for predicting the hourly and daily mass concentrations of PM 2.5 (dependent variable) using meteorological, vehicle count, and boundary layer height variables described in Section 2.1 as predictive (independent) variables.After calculating hourly and daily averages and cleaning missing data, the dataset was divided into 80% of training and 20% of test datasets.This division was made using a method to split arrays or matrices into random train and test subsets with a random state that controls the shuffling applied to the data before applying the split to ensure reproducible output across multiple function calls.A total of 1,000 decision trees were used to apply the random forest regression method, which is a meta estimator that fits a number of decision tree regressors on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting.The performances of the MLR baseline estimation and RF model were calculated and compared for hourly and daily averages, as well as the importance of each predictor variable in the RF model.
. PM .forecast using naive model and long short-term memory neural network The long short-term memory (LSTM) is one of the Recurrent Neural Network (RNN) models most widely used in air quality forecasting because it considers the temporal dependencies observed in PM 2.5 concentration time series (Huang and Kuo, 2018;Bekkar et al., 2021).It was created to solve problems of long-term dependencies, which general RNNs cannot learn, and gradient vanishing or explosion in backpropagation, which means that the learning speed of the previous hidden layers is slower than the deeper hidden layers in RNNs, even leading to a decrease in accuracy rate as hidden layers increase (Huang and Kuo, 2018;Yadav et al., 2020;Bekkar et al., 2021).Meanwhile, LSTM has longer memory and can learn from inputs that are separated from each other by long time lags (Bekkar et al., 2021).
An LSTM has three analogical gates based on the sigmoid function, which works on the range between 0 and 1.The input gate controls the writing of input information, the forget gate determines whether the information is saved or released from the memory at each decision point, and the output gate decides what information to output (Huang and Kuo, 2018;Bekkar et al., 2021).To compare LSTM network's performance, Naive prediction's performance was build and used as reference.Naive forecasting models are based on the repetition of a historical observation solely, without trying to explain the underlying causal relationships that produce the variable being estimated (Shim et al., 2011;Ciechulski and Osowski, 2024).Our version of Naive model considers the forecast equal to the latest observation in a time series (Gleser, 1990), which means that the PM 2.5 concentration was taken as the same as the previous hour (or day) on the current hour (or day).
Following the same procedure as for the RF model, missing data were cleaned, hourly and daily averages were calculated, and 20% of data were used in Naive's forecasting representing the test data, comprising the period from 17 March 2022 to 4 October 2022.The forecast errors were calculated and used as the reference for LSTM model, as described in the next section of performance metrics.The target for the LSTM model was PM 2.5 concentration values at the subsequent timestep, i.e., the forecast horizon was set to 1 h or 1 day.We have varied the number of timesteps for the LSTM to look backward (window size) while predicting from 1 to 35.We have tested and compared different window sizes and hidden units while predicting, and the first 80% of the time series was used for training and the last 20% of data was used for testing the model.The LSTM final architecture has 2 and 3 hidden layers with 64 neurons each for hourly and daily models, respectively.
Data were preprocessed using a method to standardize features by removing the mean and scaling to unit variance.A standard LSTM code was written and optimized using the PyTorch package; we have used mean squared error (squared L2 norm) for the loss function and Stochastic Gradient Descent for the optimizer.The LSTM performance errors were calculated and compared with Naive's model.

. Performance evaluation metrics
The mean absolute error (MAE), mean absolute percentage error (MAPE), root mean square error (RMSE), coefficient of determination (R 2 ), and an accuracy metric were used to assess the prediction and forecast quality and compare the results of MLR with RF and of Naive estimation with LSTM.In Equations 3-6 below, n is the sample size, o i and p i represent the measured and predicted value, respectively, and ō denotes the mean of all measured values.MAE (mean absolute error) is the arithmetic mean of the absolute deviations between the measured and predicted values of the sample, as shown in Equation 3.
The mean absolute percentage error (MAPE) expresses the prediction or forecast error as a percentage and can be calculated from Equation 4.
RMSE (root mean square error) of a sample is the quadratic mean of the differences between the observed values and predicted ones.It reflects the prediction accuracy and its calculation formula is shown in Equation 5.
The coefficient of determination (R 2 ) reflects the proportion of all variations of the dependent variable that can be explained by the independent variable through the regression relationship and can be calculated by Equation 6.
The accuracy metric (ACC) is a percentage value, which depends on the MAPE and is calculated using Equation 7.

Results and discussion
Figure 2 shows the daily profile of each variable with the lines representing the average in each hour of the day.From Figure 2B, it can be observed that the number of vehicles is an important source of particle emission in Curitiba, and that the PM 2.5 concentration has one peak approximately 7 a.m. and another near 8 p.m.We note the effect of solar radiation in heating the surface, generating more dispersion and, consequently, vertical air mass movement, as represented by the increased wind velocity during the day.Changes in relative humidity and winds may also affect particle dynamics.(µg/m 3 ), is shown in Equation 8.The following independent variables from Equation 8 are hourly averages: a is atmospheric pressure (mB), P H is precipitation data (mm), V H is vehicle count, and H H is planetary boundary layer height (m).
MLR for daily time scale prediction of PM 2.5 concentration, which is indicated by the dependent variable PM D 2.5 (µg/m 3 ), is described by Equation 9.The following independent variables from Equation 9 are daily averages: T D is air temperature ( a is atmospheric pressure (mB), P D is precipitation data (mm), V D is vehicle count, and H D is planetary boundary layer height (m).
Table 1 summarizes the errors for test data using hourly and daily averages for the studied models (Random Forest and Multiple Linear Regression used for prediction, LSTM, and Naive models used for forecasting).All forecasts shown in Table 1 used window size equal to one (hour or day) and PM 2.5 concentration as input data.
Figures 3, 4 show the RF results of test data for hourly and daily averages.Figure 3 presents a dispersion plot with predicted values on x-axis and measured values on y-axis, with the 1:1 line, and Figure 4 also shows the time series of predicted and measured PM 2.5 concentration and the vehicle count data on the right y-axis.Figure 4 shows a decrease in the number of vehicles in the city of Curitiba in 2020, which probably contributed to the observed decrease in PM concentration.
RF model accuracy was higher compared with MLR in all cases, as well as MAE, MAPE, and RSME errors decrease and R 2 value increase.The daily average of the RF model provided the highest accuracy value, reaching 99.37%.For both time scale predictions, there was a reduction in errors and an increase in the R 2 value and accuracy using the RF model.By using the RF model, there was a greater increase in accuracy compared with MLR on the hourly scale, increasing the value by 0.46%, while the increase on the daily scale was only 0.07%.The increase in the R 2 value and the decrease in errors when using the RF model were also more noticeable on the hourly scale than on the daily scale.
. /fdata. .The RF algorithm was able to select the most important variables for predicting PM 2.5 .Figure 5 shows the most important predictors used by RF model for hourly and daily averages.The importances in RF are computed as the mean and standard deviation of accumulation of the impurity decrease within each tree.
Boundary layer height was the most important variable, and relative humidity was the third most important for both time scale predictions: hour and day.There was a distinction between the other variables considering the hourly and daily averages, especially wind gust and wind speed, which exchanged between the second and second-to-last most important variables.Precipitation remained the least important variable, and vehicle count rose from the fifth most important variable to the third in the analysis of daily averages.
Initially analyzing the hourly average, the accuracy value for the LSTM model was very close to Naive's, but it was found that the accuracy, R 2 , and associated errors changed according to the size of window (look back period) and the number of hidden layers.It was also checked whether increasing the number of input variables could improve the performance of LSTM compared with Naive for hourly and daily scenarios, which are described in the following.
In hourly timescale scenarios, the most important variables found in the RF model were added in stages as inputs to different LSTM models, starting with the BLH.Variables of equal importance were added in a single step, for example, relative humidity and .
/fdata. .temperature.Table 2 explains the different scenarios (H1, H2, H3, H4, and H5), and input variables were added to the LSTM models.Figure 6 shows the evolution of errors as the look back period increases for the five different inputs scenarios using three hidden layers, hidden dimension of 64, and 150 epochs of training.
The highest accuracy (99.71%) was found in scenarios H3 and H4 with a window size of 5, as observed in Figure 6A. Figure 6C shows that the lowest MAPE, 0.29%, value was found for the Naive model and LSTM with scenarios H4 and H5, while the other values varied between 0.30 and 0.31%.R 2 reached a maximum value of 0.89 on all occasions for scenario H5 with a window size greater than or equal to 10, as shown in Figure 6E. Figure 6B shows that the MAE errors remained below the MAE value found in the Naive model for an LSTM window size equal to or greater than 5, and Figure 6D shows that the RMSE values were always lower by using LSTM, compared with Naive, for scenario 5 with a window size equal to or greater than 10.
Similar to the analysis of hourly averages, the most important variables found in the RF model with daily averages were added to the LSTM model, maintaining the same window size and using 3 hidden layers, hidden dimension of 64, and 150 epochs for training.Table 3 describes each scenario for the daily time scale (D1, D2, D3, D4, and D5), the input variables, and the LSTM and Naive model's evaluation metrics.
For daily scenarios, the accuracy and R 2 values were higher for the LSTM model, and MAE, MAPE, and RSME errors were lower when compared with the Naive approximation.When the number of inputs was increased with a window size of 1 day, there was a variation in the error and accuracy values.The highest accuracy and lowest MAPE were obtained in scenario D3 using three inputs: PM 2.5 concentration, boundary layer height, and wind velocity.Figure 7 shows measured PM 2.5 concentration and predicted concentrations using LSTM on a daily scale with PM 2.5 concentration as input (scenario D1).
In 2022, vehicle count and particulate matter concentrations were higher than in 2021 and 2020.This behavior is associated with the COVID-19 pandemic, in which vehicle circulation decreased due to the lockdown in the City of Curitiba.Consequently, there was a decrease in PM 2.5 concentration peaks during 2020-2021.Even though the LSTM was trained with lower values, corresponding to the period of the pandemic, the model was able to forecast the PM 2.5 concentration in 2022.

Conclusion
The data gathered in this research provide important information about the City of Curitiba, Brazil, especially the relationship between number of vehicles and concentration of fine particles.Using the dataset created, the models for prediction and forecasting indicated that Random Forest and LSTM model were good estimators of PM 2.5 concentration.
Model's performance was analyzed using measured data from Curitiba, while several inputs were tested.RF had better results for prediction compared with MLR, reaching 99.37% of accuracy at daily time scale.The lowest accuracy in prediction was the one that considered the hourly time scale using MLR.In general, models at daily scale performed better compared with models at hourly scale.The RF model identified boundary layer height as the most important input variable for both time scales and precipitation as the less important.Variations in wind conditions, vehicle count, air temperature, and relative humidity contributed significantly to predictions at hourly and daily scales.
The inputs recognized as most important in RF prediction (BLH, wind intensity, humidity, and vehicle count) were also important for LSTM forecast.The results of the LSTM model showed sensible variation depending mainly on model's looking back window size and inputs, sometimes reaching or exceeding the values found in the Naive model, with a maximum accuracy of 99.71% found on hourly scale with window size equal to 5 h.LSTM model had better performance compared with Naive's in forecasting at daily scale.Because of its ability to exploit the sequential nature of the data, LSTM network have the tendency to outperform Naive model.
Data showed the influence of COVID-19 pandemic on vehicle circulation and fine particulate matter concentration in Curitiba, with lower values in 2020 and 2021, followed by an increase in 2022.LSTM neural network was trained with pandemic data and was able to generate good forecasts for PM 2.5 concentration in 2022, a post pandemic period.
RF and LSTM proved to be good models for the prediction of fine particles and forecasting in Curitiba, respectively.Our results help the physical understanding of factors influencing pollutant dispersion from vehicle emissions at the lower atmosphere in urban environment.As a suggestion for future studies, we recommend the application and comparison of other models to predict and forecast PM 2.5 , as well as testing larger window sizes to verify if it is possible to improve the performance of the model.It is also suggested to include vehicle information categorized by type or fuel as input variables of the models.

Funding
The author(s) declare that financial support was received for the research, authorship, and/or publication of this article.This work was partially financed by CAPES-Brazil-Finance Code 001, the CNPq-Brazil-Universal Call (grant agreement no.423674/2018-8), the Estonian Research Council (grant PRG 1674), and the European Union's Horizon 492 2020 Research and Innovation Programme (grant agreement no.871115) ACTRIS IMP.

FIGURE
FIGURELocation of SDSsensor points, vehicle count, meteorological stations, and ERA data in the city of Curitiba and metropolitan region (Colombo).

FIGURE
FIGURE Daily profile with hourly averages of the dataset variables: (A) PM .concentration, air temperature, and relative humidity; (B) vehicle count, BLH, and solar radiation; (C) wind gust, wind velocity, and wind direction; (D) precipitation and air pressure.Shaded areas represent confidence interval of %.

FIGURE
FIGURERF predictions against measured PM .concentration in Curitiba for the validation period considering: (A) hourly averages and (B) daily averages.

FIGURE
FIGURETest data time series of measured and predicted PM .concentration in Curitiba using RF.Fine particles and vehicle count data were resampled using (A) hourly and (B) daily averages.

FIGURE
FIGURERF importance of the input variables for predicting PM .concentration in Curitiba considering: (A) hourly averages and (B) daily averages.

FIGURE
FIGURE LSTM -h forecast performance metrics for di erent input scenarios as we increased the looking back window size.The metrics are: (A) Accuracy (B) MAE (C) MAPE, (D) RMSE and (E) R .
TABLE Description of scenarios and input variables used in each LSTM hour forecast model.

TABLE Summary of
Naive and LSTM model's performance results for di erent scenarios using daily averages of input variables and window size equal to day.Test data time series of measured and predicted daily concentration of PM . in Curitiba using LSTM with window size equal to day. FIGURE