Construction and evaluation of hourly average indoor PM2.5 concentration prediction models based on multiple types of places

Background People usually spend most of their time indoors, so indoor fine particulate matter (PM2.5) concentrations are crucial for refining individual PM2.5 exposure evaluation. The development of indoor PM2.5 concentration prediction models is essential for the health risk assessment of PM2.5 in epidemiological studies involving large populations. Methods In this study, based on the monitoring data of multiple types of places, the classical multiple linear regression (MLR) method and random forest regression (RFR) algorithm of machine learning were used to develop hourly average indoor PM2.5 concentration prediction models. Indoor PM2.5 concentration data, which included 11,712 records from five types of places, were obtained by on-site monitoring. Moreover, the potential predictor variable data were derived from outdoor monitoring stations and meteorological databases. A ten-fold cross-validation was conducted to examine the performance of all proposed models. Results The final predictor variables incorporated in the MLR model were outdoor PM2.5 concentration, type of place, season, wind direction, surface wind speed, hour, precipitation, air pressure, and relative humidity. The ten-fold cross-validation results indicated that both models constructed had good predictive performance, with the determination coefficients (R2) of RFR and MLR were 72.20 and 60.35%, respectively. Generally, the RFR model had better predictive performance than the MLR model (RFR model developed using the same predictor variables as the MLR model, R2 = 71.86%). In terms of predictors, the importance results of predictor variables for both types of models suggested that outdoor PM2.5 concentration, type of place, season, hour, wind direction, and surface wind speed were the most important predictor variables. Conclusion In this research, hourly average indoor PM2.5 concentration prediction models based on multiple types of places were developed for the first time. Both the MLR and RFR models based on easily accessible indicators displayed promising predictive performance, in which the machine learning domain RFR model outperformed the classical MLR model, and this result suggests the potential application of RFR algorithms for indoor air pollutant concentration prediction.


Introduction
PM 2.5 refers to particulate matter with an aerodynamic diameter of 2.5 μm or less, which is one of the environmental pollutants with the greatest impact on public health (1)(2)(3). Numerous epidemiological studies have shown that both long-term and short-term exposure to PM 2.5 increases the risk of death from respiratory and cardiovascular diseases in the population (4)(5)(6). Studies have shown that for every 10 g/m 3 increase in the average concentration of PM 2.5 in ambient air, there is a 3.1% increase in hospital admissions and a 2.5% increase in mortality from chronic obstructive pulmonary disease (7). Furthermore, there is a 3% increase in emergency department visits for bronchial asthma (8), a 16% increase in the risk of death from ischemic heart disease, and a 14% increase in mortality from stroke (4,9).
Currently, most relevant studies use ambient PM 2.5 concentrations as a surrogate for human PM 2.5 exposure without taking into account the difference between indoor and outdoor PM 2.5 concentrations as well as the contribution of indoor PM 2.5 exposure to actual human exposure, which limits the interpretation of their results. As most people spend at least 80% of their day indoors, and for some specific populations such as the older adults and children, this percentage is even higher (10)(11)(12). Therefore, indoor PM 2.5 concentration is crucial for accurate PM 2.5 exposure assessment and health risk assessment. Direct measurement of indoor PM 2.5 concentration can provide the most accurate data; however, such practice is not easy to achieve, as it requires a lot of manpower and material resources as well as the compliance of the research participants, especially for large-scale population and/or long-term studies. When direct measurement is difficult to achieve, it is important to construct appropriate predictive models.
At present, many studies have been conducted to establish prediction models for indoor PM 2.5 concentration (12)(13)(14)(15)(16)(17)(18), mainly involving multiple linear regression (MLR) models and random forest regression (RFR) models, which have their own advantages and disadvantages. For indoor PM 2.5 concentration, there is still controversy about which model has a better predictive effect. In addition, the models in these studies have mostly predicted the average indoor PM 2.5 concentration on one or more days, and do not adequately account for the fluctuation of indoor PM 2.5 concentration during the day (or longer) and the variability of individual behaviors over time (19)(20)(21). Obviously, the establishment of indoor PM 2.5 concentration prediction models with higher temporal resolution is of more practical significance to improve individual PM 2.5 exposure assessment. The existing models were constructed using indoor PM 2.5 concentration monitoring data from a single type of place, which is not universal enough and inevitably limits the practical application to different types of places. No study has yet established prediction models for hourly average indoor PM 2.5 concentration based on data from multiple types of places.
In this study, monitored data on indoor PM 2.5 concentrations from five types of typical sites (offices, primary and secondary schools, kindergartens, shopping malls, and restaurants) in Shanghai were collected during different seasons. The data were used to develop and evaluate predictive MLR and RFR models for indoor PM 2.5 temporal average concentrations based on multiple types of places. The aim of the study was to provide a feasible way to improve individual PM 2.5 exposure assessment.

Data collection
Five types of typical locations -offices, middle and primary schools, kindergartens, shopping malls, and restaurants -were selected for indoor PM 2.5 concentration field monitoring in 16 districts of Shanghai. A TSI DustTrak 8,530 benchtop aerosol monitor (TSI Incorporated, Shoreview, MN, United States) was used for the monitoring. One floor was selected as the monitoring site for the high, middle, and low areas of office buildings, shopping malls, and restaurants. Two, four, and six monitoring points were set for indoor areas of 200-1,000 m 2 , 1,001-5,000 m 2 , and over 5,000 m 2 , respectively. Two classrooms from each floor were used as monitoring sites in high, middle, and low areas of kindergartens, middle, and primary schools. One, three, and five monitoring points were set for indoor areas of less than 50 m 2 , 50-100 m 2 , and more than 100 m 2 , respectively. All of the above points were distributed evenly on the diagonal of the room or in a plum style, and the height of each point was set at the level of a human respiratory belt (0.8-1.2 m). The actual measurement time was in January, April, July, and October of 2018 (the 4 months represented the four seasons of the year: January for winter, April for spring, July for summer, and October for autumn). Indoor PM 2.5 concentrations in each location were monitored for 1 week during these 4 months, with each instrument monitoring the concentrations every 15 min, which covered all times of the day (00,00-23,00 h) to ensure full coverage of people's activities in various places as much as possible.
For the construction of prediction models, we used the findings of relevant publications (17,(21)(22)(23)(24) to identify 11 easily accessible indicators that may have significant effects on indoor PM 2.5 concentrations. The relevant information of the indicators could be found in Supplementary Table S1. The outdoor PM 2.5 and PM 10 concentration data were obtained from the monitoring stations of 16 municipal control points in Shanghai. By calculating the distance between all government-controlled monitoring stations and the indoor places we monitored, the data from the closest station was selected as outdoor PM 2.5 and PM 10 concentration data for indoor places. Meteorological data for the same period were obtained from the European Center for Medium and Long-Range Weather Forecasts, which included outdoor temperature, relative humidity, air pressure, precipitation, surface wind speed, and wind direction.

Data analysis
The data analysis in this study was based on the arithmetic mean of time, that is, the indoor and outdoor PM 2.5 concentrations, outdoor PM 10 concentration, as well as related meteorological parameters were processed as hourly mean values for use. For example, the indoor PM 2.5 concentration at 09:00 h was actually the mean value of 08:00 h to 09:00 h. Following a series of data washing, the final database consisted of 11,712 records, 11 potential predictor variables, and natural log-transformed indoor PM 2.5 concentrations (approximately normally distributed) as response variables for MLR and RFR model construction. Data analysis and model construction in this study were performed with R software (version 4.1.0), and statistical significance levels were set at p values of <0.01 and < 0.05 (both sides).
Frontiers in Public Health 03 frontiersin.org

MLR model construction steps
A sensitivity analysis was conducted for the effects of different variable screening methods on the predictive efficacy of MLR models. The three adopted types of variable screening were as follows: 1) manually supervised forward linear regression commonly used in reference to classical land-use regression modeling (25,26), 2) stepwise regression (backward, variables with regression coefficient p < 0.05 were retained), and 3) least absolute shrinkage and selection operator (Lasso). The manually supervised forward linear regression method was used to build a basal multiple regression model in three steps: 1) After testing the premise assumptions of the regression model, all potential predictor variables expected to be included in the model were first univariately regressed against the response variable (natural log-transformed hourly average PM 2.5 concentration), and predictor variables with significant (p < 0.05) regression coefficients were retained for the next step, 2) Correlations between prediction variables were tested. Among the prediction variables that were highly correlated with other prediction variables (Spearman r > 0.50, p < 0.05), only the prediction variable with the highest coefficient of determination (R 2 ) was retained for further analysis, 3) The predictor variables that remained after the previous two steps were sorted according to R 2 (from highest to lowest), and then each predictor was entered into the regression model in order. Finally, only those predictor variables with significant partial regression coefficients (p < 0.05), which boosted the R 2 of the model by more than 1% and whose coefficients were consistent with the priori hypothesis (such as a positive coefficient of outdoor PM 2.5 ), were retained.
In the process of MLR model diagnosis, variance inflation factors of the predictive variables were tested to evaluate multicollinearity. Additionally, considering that season may modify the effects of other potential predictor variables on indoor PM 2.5 concentration, we stratified the data by winter-spring (January, April) and summer-autumn (July, October) seasons and developed season-specific prediction models.

RFR model construction steps
Random forest model is a machine learning model that realizes the classification and/or prediction for unknown samples through the integrated learning with a large number of decision trees, which is now widely used in the processing of big data due to its fast computing speed, high prediction accuracy, and strong anti-interference (27-29). This model possesses two significant characteristics, namely sample randomization and variable randomization. Bagging algorithm is the basis of the random forest model, which is also known as bootstrap sampling algorithm, in short, there is put back to the random collection of samples to form a different set of data to train the base learner, so as to realize the mutual independence of individual learners. The Random Forest algorithm extends and expands the Bagging algorithm. In addition to random sampling of samples, the Random Forest algorithm also incorporates random selection of variables at each attribute node of the classification tree, which further enhances the diversity of each decision tree, reduces the risk of model overfitting, and can effectively improve the generalization performance of the final ensemble model (27, 29). The prediction accuracy and generalization of a Random Forest model are closely related to two important hyperparameters, which are ntree (the number of trees used) and mtry (the number of variables used for binary trees in the specified nodes). The randomForest package of R software (version 4.1.0) was used to construct the RFR model. In our analysis, different values were set for these two parameters as sensitivity analysis in order to obtain maximum model prediction effectiveness. The increase in mean squared error (%IncMSE) of the predicted value was taken as an indicator to measure the importance of a variable, in other words, a random value was assigned to each prediction variable. If the prediction variable is important, the prediction error of the model will increase after its value is randomly replaced, so the larger the value, the more important the variable is.
In order to evaluate and compare the prediction efficiency of the MLR model and the RFR model for indoor hourly average PM 2.5 concentration in various types of places, we developed two RFR models. The first RFR model was called the Full variables-RFR model (Full-RFR). Since the RFR model does not need to consider preconditions such as the independence of predictive variables that are faced by general MLR models, all 11 potential predictive variables were included in the model. The second RFR model was called the Conjoint-RFR model (Conjoint-RFR). In order to compare the MLR and RFR models, this Conjoint-RFR model was established using the same predictor variables as the MLR model with the best prediction performance identified in the previous steps.

Evaluation of models
The R 2 and root mean squared error (RMSE) calculated based on the predicted and measured values of the model were used as the model performance evaluation indexes. In addition, the generalization performance of the model was evaluated by a ten-fold cross-validation (CV) method. In short, the entire dataset was randomly and equally divided into ten subsets, nine of which were selected as the training set and the remaining one was used as the test set to test the prediction performance of the model. This process was repeated 10 times until each subset was used for one verification (30).

Indoor PM 2.5 pollution in various places
The summary of hourly average indoor PM 2.5 concentration statistics for each site was shown in Table 1. In general, the median hourly average indoor PM 2.5 concentration was 34.9 μg/m 3 and the interquartile range was 24.5 μg/m 3 , with a few readings on the high side and a maximum value of 288 μg/m 3 . The result of Welch analysis of variance (Welch ANOVA) (31) showed significant differences (p < 0.01) in the hourly average indoor PM 2.5 concentrations in different types of places. The highest hourly average indoor PM 2.5 concentrations were found in restaurants (44.4 μg/m 3 ), probably because of frequent cooking in restaurants that produces a large amount of grease smoke and causes indoor PM 2.5 concentrations to increase (32 Station were used here to characterize the indoor PM 2.5 pollution in each location (Figure 1). In terms of 35 μg/m 3 as the standard, indoor PM 2.5 exceeded the standard in different degrees in all places and restaurants were the worst offender, followed by kindergartens. The monitoring results suggest that the indoor environmental quality of these two types of places needs to be improved.
The changes of hourly average indoor PM 2.5 concentration at different times are shown in Figure 2. Overall, there were significant differences (p < 0.01) in indoor PM 2.5 at different times of the day, and we also observed significant intraday fluctuations in the monitoring data for each type of place (p < 0.05). The variability of PM 2.5 concentration at different times of the day in multiple types of places is closely related to the nature of the place. For example, the fluctuation of PM 2.5 concentration in the restaurant was as expected (p < 0.01), with two peaks occurring after 11:00 and after 17:00, which are roughly the beginning of lunch and dinner. At these times, intensive cooking leads to higher indoor PM 2.5 concentrations, and similar patterns were observed in other places (Figure 2). These results demonstrate the intraday variability of indoor PM 2.5 concentration as well as the spatial variability across places.

MLR model results
Univariate regression model results for hourly average indoor PM 2.5 concentration were summarized in Supplementary Table S2. All 11 prediction variables were significantly associated with hourly average indoor PM 2.5 (p < 0.05). The R 2 of the 11 prediction variables   Table 2. The model which was developed based on the stepwise regression method had the best prediction performance (CV R 2 = 60.48%) and the lowest prediction error (CV RMSE = 0.44) among the three MLR models (Table 3). In this paper, the relative importance of the predictor variables within MLR model was determined using the "Lindeman, Merenda and Gold (LMG). " LMG was evaluated as the most successful indicator of the relative importance of independent variables, which was implemented by using the "relaimpo" package of R software (33, 34) ( Figure 3

RFR model results
We compared and analyzed all RFR models with ntree of 200, 500, 1,000 and mtry of 1 ~ 11 (Supplementary Figure S1), and finally determined that ntree = 200 and mtry = 2 were the most suitable RFR parameters for this study after fully considering the model's prediction effectiveness, prediction error, and model efficiency. Results from the Conjoint-RFR model, which used the same predictor variables as the MLR model, showed that the RFR model explained a greater proportion of the variance of indoor PM 2.5 time-averaged concentrations with an R 2 (RMSE) of 89.65% (0.23), which decreased in predictive efficacy (CV R 2 = 71.86%) and increased in prediction error (CV RMSE = 0.37) after ten-fold cross-validation. Nevertheless, the overall performance of the model was still better than that of the corresponding MLR model (CV R 2 = 60.48; CV RMSE = 0.44). The performance of the Full-RFR model incorporating all predictor variables was better than that of the Conjoint-RFR model, with a CV R 2 (RMSE) of 72.20% (0.36). The importance results of the predictor variables from the random forest algorithm (Figures 4A,B) indicated that the top five variables in the Conjoint-RFR model ( Figure 4B) in order of importance were type of place, outdoor PM 2.5 concentration, season, hour, and surface wind speed. Comparison of the importance ranking results of the variables in the Conjoint-RFR model and the corresponding MLR model shows that the top three variables in both models are the same, namely, outdoor PM 2.5 concentration, type of place, and season, but with a different order. By contrast, the variable "hour" appears in the top five variables in the Conjoint-RFR model but wind direction is in the top five in the MLR model.

Discussion
Significant differences in indoor PM 2.5 concentrations between various types of places and at different times of day were found in our study. The variable of "type of place" ranked first and second in the importance assessment of the predictor variables of the RFR model and Variation of intraday hourly average indoor PM 2.5 concentration in each place (μg/m 3 ). PM 2.5 refers to particulate matter with an aerodynamic diameter of 2.5 μm or less.
Frontiers in Public Health 06 frontiersin.org the MLR model in this study, respectively. This result emphasized the importance of place type in predicting indoor PM 2.5 concentration and suggested that it might be difficult to extrapolate the prediction model based on a single type of place for use in other types of places. In fact, it is not difficult to understand the conclusion that the different functional attributes of each place naturally create a unique indoor microenvironment, which consequently affects the occurrence, diffusion, deposition and other behaviors of PM 2.5 (35)(36)(37)(38). For example, in an office, there is a high concentration of people, frequent use of office equipment (e.g., printers, photocopiers and computers), and air-conditioning equipment (air-conditioners, humidifiers, air filters), with low ventilation and a single source of indoor pollution, whereas in a shopping mall there is a higher flow of people, more frequent ventilation, and a more complex internal environment. In contrast, the frequent cooking activities in restaurants generate smoke and high temperatures, creating a different microenvironment than the places mentioned above (35,39 46). No matter what causes this variability, establishing a higher temporal resolution in an indoor PM 2.5 concentration prediction model is more practical for refining individual PM 2.5 exposure assessment and health risk evaluation. MLR models are widely used for indoor air quality prediction because of the advantages of simple methodology, easy application, and strong interpretation of results (13,17,47). However, prerequisites exist for MLR application. First, a linear relationship must exist between the prediction variable and the response variable. Second, the response variable must obey a normal distribution when each predictor variable takes a certain definite value. Third, the response variable must satisfy the homogeneity of variance when each predictor variable takes different values. Fourth, the predictor variables are independent of each other and do not have a very close statistical correlation. These prerequisites for MLR in practical applications are sometimes not easily satisfied. With improvements in computing power and the advent of the era of big data, machine learning algorithms have been constantly enhanced and widely focused. The random forest algorithm is an integrated decision tree-based algorithm proposed by Breiman and Cutler in 2001, which can simultaneously construct a large number of decision trees in parallel and achieve significantly higher computational efficiency than other machine learning methods by integrating the learning of multiple decision trees (27, 29). Due to the inherent inclusion of interactions between variables in the random forest algorithm, there is no need to consider the issue of multicollinearity among variables in general models, and the algorithm performs robustly with mixed data types, missing data, non-equilibrium data, and extreme data, leading to a high prediction accuracy of the model (28). In addition, owing to the inclusion of sample perturbation and attribute perturbation in the algorithm, the random forest model can effectively limit overfitting and is regarded as one of the best algorithms today (48)(49)(50). Of course, random forest models also have certain drawbacks, such as poor interpretability of the model, which is usually considered as a black box model. Furthermore, categorical variables with more levels will have a greater impact on the model results than those with fewer levels, which may lead to a deviation in the prediction results (48,51).
In our study, MLR and RFR prediction models were developed for hourly average indoor PM 2.5 concentrations based on monitoring data from multiple types of places. As a conventional and classical prediction model, the MLR model is widely used to predict indoor PM 2.5 concentration. Our MLR model (CV R 2 = 60.48%) had a relatively high predictive performance compared with published MLR prediction models of indoor PM 2.5 concentration based on 1 day or longer (such as 1 week) whose R 2 values ranged from 33 to 87% (13,16,18,19,(52)(53)(54). To the best of our knowledge, only one study by Xu et al. (13) has developed an MLR prediction model for hourly average indoor PM 2.5 concentration. In this study, two MLR models were developed for two regions with CV R 2 values of 71 and 75%.  Relative importance of the multiple linear regression (MLR) model predictor variables. R 2 , coefficient of determination; PM 2.5 refers to particulate matter with an aerodynamic diameter of 2.5 μm or less.
Frontiers in Public Health 08 frontiersin.org The two CV R 2 values in the study by Xu et al. (13) indicated better model predictive performance than for our MLR model. This difference might be because the model development in our paper was based entirely on easily accessible temporal indicators and outdoor indicators. By contrast, the model construction in the study by Xu et al. (13) incorporated not only outdoor indicators (such as outdoor PM 2.5 concentration and outdoor relative humidity) but also indoor indicators (such as indoor smoking and cooking), with a wide range of indicator coverage. However, the model in that study also suffered from difficulties in the definition of relevant indicators, such as "whether or not to cook. " In fact, cooking ingredients, cooking methods, cooking time, and the type of oil used have significant effects on indoor PM 2.5 concentration (55,56). Moreover, these types of prediction indicators were not easy to obtain and the process was costly. Only several studies have developed RFR prediction models for indoor PM 2.5 concentration, and the CV R 2 values have ranged from 48.9 to 82% in these studies (13,16,18). The predictive efficacy of the Full-RFR model in this study (CV R 2 = 72.20%) was also at a high level. The importance of the predictor variables in random forest regression (RFR) models based on "%IncMSE." Full-RFR model (A), Conjoint-RFR model (B). PM 2.5 and PM 10 refer to particulate matter with an aerodynamic diameter of 2.5 μm or less and of 10 μm.
Frontiers in Public Health 09 frontiersin.org MLR and RFR models, as common indoor PM 2.5 concentration prediction models, are still controversial in terms of which approach can better predict indoor PM 2.5 concentrations. Previous studies have shown (16, 44) that using the same dataset, an RFR model usually outperforms an MLR model in terms of predictive efficacy owing to the strength of the algorithm itself, such as robustness to missing data and good characterization of interactions between different predictor variables. However, some studies have reached the opposite conclusion, as in the study by Yuchi et al. (18). In their study, two models had the same variables for the same dataset, and the MLR model (CV R 2 = 50.2%) outperformed the RFR model (CV R 2 = 48.9%) in terms of generalization performance. This issue was also explored in the current study, as the results of our sensitivity analysis for the modeling algorithm showed that the Full-RFR model, which used all predictor variables, and the Conjoint-RFR model, which used the same predictor variables as MLR, both performed better than the MLR model.
Compared with other studies, the current study had several strengths. First, the indoor PM 2.5 concentration monitoring data based on multiple types of places were used for modeling, which was more generalizable for predicting indoor PM 2.5 concentration than the models developed using data from a single type of place. Second, we developed modeling with high temporal resolution indoor PM 2.5 concentration data (hourly average data), which fully took into account the temporal variability of indoor PM 2.5 . Third, the sample size used for modeling was sufficiently large (n = 11,712) to greatly exceed the number of predictor variables (11), so that the model was less prone to overfitting. Fourth, the model prediction cost was low, and the predictor variables in the model were all easy to obtain. For example, outdoor PM 2.5 concentration, wind direction, and surface wind can be found through the websites of relevant government departments. The model is suitable for epidemiological studies with large populations and/or long time periods.
Of course, there were some limitations in the study. First, the outdoor PM 2.5 concentration data of indoor places in the study were obtained from the nearest government-controlled monitoring sites. Although this approach has been used in many previous studies, it could introduce some errors in the model due to the spatial variability of outdoor PM 2.5 concentrations. Second, the absence of human indoor activity variables, such as smoking and cooking, might cause an increase in the prediction error of the model at certain time periods and contexts, for instance, during cooking and when air purifiers were used. Third, the model was developed and evaluated based on data from Shanghai, and there was a lack of equivalent data from other regions for further validation of model performance.

Conclusion
We found significant differences in indoor PM 2.5 concentration between types of places and time periods. This finding reflects the possible limitations of models based on indoor PM 2.5 concentration data from a single type of place as well as the necessity for a prediction model with a high temporal resolution in order to perfect individual PM 2.5 exposure assessment. Here, we aimed to develop MLR and RFR models for hourly average indoor PM 2.5 concentration over multiple types of places. Both statistical models were based on easy-to-access indicators and showed good predictive efficacy. They could, therefore, be used for quantitative estimation of indoor PM 2.5 exposure in large-scale population studies. In addition, the performance of the classical MLR model and machine learning RFR model were evaluated comparatively in predicting indoor PM 2.5 concentration, and the model performance metrics showed that the RFR model using the same dataset outperformed the MLR model. This finding suggests the potential of RFR models in predicting indoor air pollutant levels, and other machine learning algorithms may also be worthy of exploration.

Data availability statement
The raw data supporting the conclusions of this article will be made available by the authors, without undue reservation.

Author contributions
YS, ZD, CD, and SS designed the research. YS, FH, FC, DW, ML, and HZ performed data acquisition, organization and a part of analysis. ZD and YS analyzed the data and wrote the primary manuscript. YS, ZD, CD, SS, FH, and FC provided a contribution to the explanation of the findings and critically reviewed and edited the manuscript. All authors contributed to the article and approved the submitted version.