Predicting the Number of Days With Visibility in a Specific Range in Warsaw (Poland) Based on Meteorological and Air Quality Data

Atmospheric visibility is an important parameter of the environment which is dependent on meteorological and air quality conditions. Forecasting of visibility is a complex task due to the multitude of parameters and nonlinear relations between these parameters. In this study, meteorological, air quality, and atmospheric visibility data were analyzed together to demonstrate the capabilities of the multidimensional logistic regression model for visibility prediction. This approach allowed determining independent variables and their significance to the value of the atmospheric visibility in four ranges (i.e., 0–10, 10–20, 20–30, and ≥ 30 km). We proved that the Iman–Conover (IC) method can be used to simulate a time series of meteorological and air quality parameters. The visibility in Warsaw (Poland) is dependent mainly on air temperature and humidity, precipitation, and ambient concentration of PM10. Three logistic models of visibility allowed us to determine precisely the number of days in a month with visibility in a specific range. The sensitivity of the models was between 75.53 and 90.21%, and the specificity 78.51 and 96.65%. The comparison of the theoretical (modeled) with empirical (measured) distribution with the Kolmogorov–Smirnov test yielded p-values always above 0.27 and, in half of the cases, above 0.52.


INTRODUCTION
The effects of anthropogenic air pollution are discussed for years (Schedling, 1967;Saikawa et al., 2017;Kinney, 2018;Grewling et al., 2019;Makra, 2019). Visibility, an indicator of atmospheric transparency, refers to the visual range that distant objects can be clearly discerned (Li et al., 2019). Visibility is an atmospheric parameter, which can serve as a visual index for air quality (Kuo et al., 2013). Generally, visibility is the "distance at which the contrast of a given object with respect to its background is just equal to the contrast threshold of an observer" (WMO, 2015). It can also be defined as the clearness with which objects stand out from their backgrounds or other objects and how far people can see and how well they can identify objects (Bennett, 1930;Wooten and Hammond, 2002). Thus, a reduction in visibility is not simply fewer colors. On polluted days, however, the sky will appear white or grayish, and there is diminished contrast, making it difficult to discern the contours and details of objects (Ding et al., 2020). Many studies were made to evaluate the relationship between air pollution and visibility (Hyslop, 2009;Majewski et al., 2015;Qu et al., 2020). The impairment of visibility is primarily attributed to the scattering and absorption of visible light by particulate matter (PM). As the number of particulate of PM and selected PM components (e.g., black carbon) increases, more light is absorbed and scattered, resulting in less clarity, color, and visual range (Latha and Badarinath, 2003). Therefore, visibility at urban sites is mainly shaped by the emission of PM and PM precursors from anthropogenic sources like road traffic, combustion of fossil fuels, municipal solid waste treatment, and industry (Tsai et al., 2007;Deng et al., 2008;Zhao et al., 2011;Fajardo et al., 2013;Majewski et al., 2014;Zhuang et al., 2014).
Forecasting visibility is a complex task due to the multitude of parameters and nonlinear relations between these parameters. The statistical models which are currently used in the prediction of visibility have limited capabilities (Madan et al., 2000;Zhang et al., 2017;So et al., 2018;Dietz et al., 2019). We decided to use a classification model to make our forecast of visibility. We used multidimensional logistic regression since it allows us to assess the impact of a given variable on visibility without any additional calculations and the need for implementation of complex numerical algorithms (Hosmer et al., 2013). Although it is not easy to implement visibilityair quality interactions in the model, multiple outputs can be combined to improve visibility forecasting. Our solution allows the identification of visibility values in as many as four classes, taking into account different meteorological conditions and air quality measures (including the concentration of PM), which have not been analyzed in such scope before. Using the designated logistic models, cumulative empirical distributions of selected independent variables were identified, which was the basis for the development of their Monte Carlo generators. The obtained tool makes it possible to model the number of days with the visibility value within the appropriate range, which has not been analyzed in detail so far.
Generally, there is a lack of studies concerning the problem of visibility and air pollution in Poland since the air quality in Poland is one of the lowest in Europe (EEA, 2019). The air pollution in Poland is mainly caused by the emission from road traffic, industry, and energy production (Pastuszka et al., 2003;Rogula-Kozłowska et al., 2013;Błaszczak et al., 2016;KOBIZE, 2019). The impact of air quality in Poland can be seen in the scale of the whole of Europe (Spindler et al., 2010;Rogula-Kozłowska et al., 2014;Leoni et al., 2018); hence, monitoring of air quality including atmospheric visibility is very important. However, the visibility is monitored only at 15 places in Poland (PANSA, 2020), i.e., one place per over 20,000 km 2 . This proves that the crucial point is the development of the methodology of forecast visibility which is based on historical data.
Here we presented an original model with the consistent methodological approach and assumptions. This approach to the visibility forecast can be successful in application to the analysis of publicly available data concerning many sites around the world.

Model Overview
The workflow of our original visibility forecast model is shown in Figure 1.

Air Quality and Meteorological Observations
The information used in the study came from the period between 2004 and 2013. We used data from three different monitoring stations located in the southern part of Warsaw, the capital of Poland. Their location is shown in Supplementary  Figure 1. The air quality data were from the monitoring station MzWarWokalna located at 52.160772N, 21.033819E. We used 1-h average concentrations of sulfur dioxide (SO 2 ), carbon monoxide (CO), ozone (O 3 ), nitrogen dioxide (NO 2 ), and PM 10 , monitored via pulsed fluorescence, infrared absorption, absorption of ultraviolet light, chemiluminescence, and a β-gauge automated particle sampler, respectively. The daily, arithmetic mean of 24 1-h concentrations was calculated for this study. The meteorological parameters were recorded at the Ursynów-SGGW station (52. 160810N  meteorological measurements were done according to the procedures of the Polish Institute of Meteorology and Water Management, National Research Institute (IMGW-PIB). The visibility was measured at the Warsaw Chopin Airport weather station (52.162876N, 20.961125E) with the use of a visibility meter equipped with an atmospheric phenomenon detector-Vaisala FS11 (wavelength 875 nm). It performed the functions of a visibility meter using light dispersion measurements and an atmospheric phenomenon detector with a measurement range from 10 to 50 km. The data (1h values) were shared by the IMGW-PIB. For the whole research period, the 87,634 1-h values were obtained and a daily mean was calculated as for air quality data. We are then aware that during rain phenomena, the instrument FS11 is measuring beam attenuation due to scattering processes and is not physically taking absorption, e.g., by water vapor into account. The manufacturer claims that the response of the FS11 has been tested, evaluated, and 25 times verified with a transmissometer including a visible light band emitter at different locations around the world. Therefore, the absorption effect is covered to a certain extent, according to the manufacturer. We see the need of further research, aimed at a detailed analysis of visual range changes and precipitation.
The datasets were validated before the usage. The meteorological equipment was daily calibrated and validated by the personnel of the SGGW-WULS. The air quality data, before the publication in the archive of the Chief Inspectorate for Environmental Protection, are crosschecked and validated. The quality of the visibility data was constantly checked by the personnel at the Warsaw Chopin Airport since it determines the safety of the operation of the airport. As a result of these data quality control procedures, all the data which were outlier were removed by the respective services.
In the summer period (May-August), the median of all daily values of visibility in each month Vis(p = 0.50) has the highest value and is in the range 28.8-30.8 km, while in the winter period (November-February), the monthly visibility was the lowest and almost constant (in the range of 0.5 km). The visibility in the mid-period (March-April and September-October) has the highest change (from 13.4 to 28.8 km). Similar monthly trends are also observed for the Vis p = 0.05 and Vis(p = 0.95). According to these observations and results of clustering data points (Figure 2), we divided the year into three periods (i.e., N = 3): (I) November-February, (II) March-April and September-October, and (III) May-August, which we analyzed separately (Supplementary Figure 2).

Data Analysis
For each of the N periods, the visibility values are converted to a binary variable. It is the fundament for forecasting visibility. The binary value represented is true when visibility Vis is higher than the threshold visibility for the given period Vis N . In parallel, we determine which M independent variables x i has an impact on the values of Vis in a given FIGURE 2 | The dendrogram of the data points used in the work. The data are clustered into three groups: C1, C2, and C3. The x-axis represents measured air quality and meteorological data vectors, and the y-axis is the distance between points. The horizontal line represents the threshold distance between clusters we used in the work, and C_ denotes datasets.
period. The parallel M independent variables are determined with the backward stepwise algorithm. The independence of variables was checked using Spearman's coefficient of correlation ρ. We treat two variables as dependent if the significance was at p < 0.05. The values of ρ are provided in Supplementary Table 2.
We use the logistic model to forecast the binary data. Contrary to other classification models, the result of these calculations is probability. In general, the model can be described by Equation (1) (Harrell, 2015): where p is the probability that Vis is greater than the specific threshold value Vis lim , x is the vector of independent variablesair quality data or meteorological conditions, and α is the vector of the coefficients determined using the maximum likelihood estimation method (Hosmer et al., 2013). The variable is assumed to be independent if the hypothesis about the dependence of variables can be rejected at p < 0.05.
We conducted three phases of the logistic model determination: learning, testing, and validation. The data for each of these phases were selected randomly, 400 points, i.e., 3,600 points for three models and three phases.
The crucial point of the analysis was the determination of value probability p lim for which p(Vis > Vis lim ) = p lim . According to the value of Vis lim , the data about visibility were divided into two classes: Vis > Vis lim and Vis ≤ Vis lim ; we choose p lim = 0.50 (Majewski et al., 2014). Since the preliminary analysis of the data revealed three periods with similar visibility (Figure 2), we used N = 3 models, and hence, we will determine three values of Vis lim , three sets of independent variables (for each logistic model M 1 , M 2 , and M 3 ), and three sets of coefficients α i .
Based on the N determined models of the forecast of Vis i = 1,2,... , N = f (x 1 , x 2 , . . ., x j ), we found empirical distributions of each of the j variables. Using these distributions, we decided to find the best fitting theoretical distribution. The choice of the distribution was done using the Kolmogorov-Smirnov (K-S) test (Massey, 1951). To get the highest compatibility of the measured and forecasted data, we analyzed the following distributions as a candidate for modeling variables (Krishnamoorthy, 2016): normal, lognormal, exponential, beta, generalized extreme value (GEV), Gumbel, Weibull, Fischer-Tippet, Johnson, Rayleigh, and Pareto with K-S test.
We used the Spearman coefficient of correlation ρ between variables and best fit theoretical distribution to build a Monte Carlo (MC) generator of synthetic time series of air quality and meteorological data. In case when variables are dependent, we used the Iman-Conover method (IC) which is a modification of the MC method (Iman and Conover, 1982). In the IC method, the variability of variables is based using marginal distributions (theoretical distributions) and the covariance is assessed by the Spearman coefficient of correlation (Chun and Tsang, 2004). Thus, the usage of this method needs to meet specific conditions (see section Conditions for the Iman-Conover method of the Supplementary Material). In our work, time series are generated using the IC method based on air quality and meteorological data except for the precipitation. We used the MC method to simulate the precipitation (P), according to the theoretical models of days with precipitation. The aim was to calculate the number of days with the precipitation n r in the given period. The precipitation P was modeled as binary data (P = 0-no precipitation, P = 1-precipitation).

Simulation of Time Series Using Iman-Conover MC
We used our logistic models and generators to simulate time series K-times in a given period (day, week, month, etc.) of the meteorological parameters and air quality data. We used it to determine the probability that the value of Vis exceeds the limit. This procedure was repeated for each of the N models (for each model for a given variability of visibility). As a result, N cumulative distribution functions (CDFs) of the probability of the number of days with visibility higher than a specific value for every period were determined.
As a result of K = 10,000 simulations, we obtained time series with 7, 30, 365, etc. values for weekly, monthly, and yearly forecasts, respectively. We used these values to calculate p Vis > Vis lim1,2,3 . Let's define a function Vis limk , x 1 , x 2 , ..., x M k with domain for a whole forecast period in such a way: = p Vis limk , x 1 , x 2 , ..., x M k > 0.5 → D(Vis lim ) = 1 p Vis limk , x 1 , x 2 , ..., x M k > 0.5 → D(Vis lim ) = 0 (2) The value D is a binary parameter describing whether it is more probable that visibility is higher than Vis lim (D = 1) or more probable that visibility is lower (D = 0). We analyzed the total number of days with visibility higher than Vis lim for the forecast periods and the value of D p was the number of days in the given period when the above conditions were satisfied. The expected value of D p can be calculated using Formula (3): where f (x i ) is the probability density function of theoretical distribution which is the best fitting empirical distribution for variable x i . Since the variables in the integral above are not independent, the analytical solution is hardly feasible, and hence, we used the MC method.

Prediction Assessment
The assessment of the performance of the models is made using sensitivity (SENS) and specificity (SPEC). According to Hosmer et al. (2013) and Harrell (2015), we used also the third measure for logistic models-counting uncertainty R z 2 . We choose the following threshold values for further calculations: Vis lim1 = 10 km, Vis lim2 = 20 km, and Vis lim3 = 30 km (Figure 2  and Supplementary Figure 2).
Finally, we used the two-sample Kolmogorov-Smirnov test to compare CDF from the logistic model and CDF from the experiment. It allowed us to evaluate numerically whether our model prediction of the number of days within the specific visibility range is in agreement with the empirical distributions of visibility.

The Logistic Regression Model
During the initial design of the model, we analyzed Vis lim with different lowest values and different intervals between them. The Warsaw Chopin Airport, where the visibility was measured, is located at the site where visibility is good and the share of measurement with limited visibilities is low. Hence, the visibility data models with intervals lower than 10 km had poor performance. In this work, we present the performance and possibilities of the logistic regression model with three limit values. We present the modes with values Vis lim1 = 10 km, Vis lim2 = 20 km, and Vis lim3 = 30 km, which have good capabilities of forecasting visibility. For example, for n m1 = 2, 517 data points, where Vis = 10 km (i.e., Vis lim1 > 10 km), the forecast was in agreement for 1,896 points (SENS = 75.53%), while for the remaining 517 data points, 500 was in agreement (SPEC = 96.65%). The value of Vis lim is crucial for the number M of the independent variables for each model (M 1 = 7, M 2 = 6, M 3 = 5). The CDFs for three different Vis lim divided into three periods are provided in Supplementary Figures 3-5.
Below, we present the values of coefficients α and reminding the vector of concentrations x for the logistic model described in Frontiers in Environmental Science | www.frontiersin.org the Data Analysis section (Equation 1). The zeros in α vectors are put for dependent variables.
α 0I = 26.0 ± 1.4, α 0II = 18.6 ± 1.0, α 0III = 16.2 ± 0.9 (6) The prediction assessment for the models we yielded is as follows: for model I, SENS = 75.53%, SPEC = 96.65%, R z 2 = 93.03% based on 2,517 points in the set; for model II, SENS = 86.82%, SPEC = 89.21%, R z 2 = 88.09% based on 1,616 points; and for model III, SENS = 90.21%, SPEC = 78.51%, R z 2 = 86.28 % based on 1,021 points. For the chosen values of limits Vis lim , the concentration of PM 10 and the values of Rh,T, andP have an impact on Vis in all the models. The solar irradiance and wind velocity are omitted in the vectors above since none of these variables was independent in any model. The increase of values of Rh, P, and PM 10 results in the decrease of Vis, while the increase of T results in an increase of Vis. These results are in the agreement with earlier studies (e.g., Deng et al., 2014;Majewski et al., 2015;Aman et al., 2019;Araghi et al., 2019;Won et al., 2020). The concentrations of CO and SO 2 are independent variables only in the model where Vis lim1 = 10 km, and the increase of both of them decreases visibility. NO 2 decreases the values of Vis for Vis lim1 = 10 km and Vis lim2 = 20 km.
The results of the fitting coefficients show that with the increase of Vis lim , the positive influence of temperature decreases, while the negative influence of the concentration of PM 10 and the presence of precipitation increases. The increase of relative humidity decreases the probability of occurrence of the given visibility, but this influence decreases with the Vis lim .
The sensitivity analysis of the logistic models was done using odds ratio (OR). We assumed percentage variability of marginal values of Rh (0) , PM 10(0) , and T (0) by 30%. The OR for Rh is increasing with Vis lim from OR Rh Vis lim = 10km = 0.0058 to OR Rh Vis lim = 20km = 0.0212 to OR Rh Vis lim = 30km = 0.0217, while for T, P, and PM 10 , the OR is decreasing with the Vis lim (for Vis lim = 10km, OR PM 10 = 0.49, OR T = 1.99, OR T = 0.75; for Vis lim = 20km OR PM 10 = 0.39, OR T = 1.85, OR T = 0.58; for Vis lim = 30km, OR PM 10 = 0.35, OR T = 1.67, OR T = 0.40).
The results of OR clearly show that the choice of values of Vis lim has an impact on the sensitivity of the model and, hence, on the results of the model.

Empirical and Theoretical Distribution Fitting
Since we divided the year into N = 3 periods, our analysis of variables was also conducted in three periods. We determined empirical distributions of air quality data and meteorological data. We evaluated which of the distribution has the K-S p-value ( Table 1). We provided more details in see section Distributions of the Supplementary Material.

Simulation of Visibility Occurrence
We used developed logistic models and MC generators of air quality and meteorological data to determine the number of days in month D Vis , when visibility was in one of four ranges (defined by the chosen values of Vis lim ): (a) below 10 km, i.e., [0, 10[, (b) [10, 20[, (c) [20, 30[, and (d) [30, +∞[. The calculations of CDFs of D Vis were done for all three periodsthree models-which were used in the work. The results are shown in Figure 3. It presents two families of CDFs: the dashed which is based on the measurement data and the continuous which is based on our model. We see that the shape of the dashed and respective continuous distributions is very similar. We verified our results using the K-S test for measurement data from the visibility monitoring station and our model. We found that for all comparisons, we could not reject the hypotheses that empirical distribution and theoretical (from the model) are the same. The probabilities (p-values of the K-S test) can be presented as the following matrix of p KS ij where i, row index, denotes period, and j describes the range of visibility: 1- [0, 10[ km, [10, 20[ km, 3-[20, 30[ km, and 4-[30,+∞[ km: p Since the CDFs for Vis < 10 km in the summer period (black) in Figure 3C do not have enough degrees of freedom, we did not evaluate p KS for this range in this period. In general, these values are high to very high, which confirms the quality of our model. We achieved that the model correctly determines the probability that the number of days with the given visibility is less or equal to a specific value. In period I (November-February), the number of days in a month with visibility in the range [10, 20[is the highest and the median of the simulated number of days with such visibility is D Vis = 15, while the median for the range [0, 10[is D Vis = 11. Similarly, we found that for period II (March-April and September-October), the highest number of days in a month is for the interval [10, 20[with the median D Vis = 12; however, the number of days in all intervals with visibility above 10 km is similar to the median number of days D Vis = 8 for [30, +∞[ and D Vis = 9 for [20, 30[. For period III (May-August), the median number of days in a month D Vis = 21 is the highest for the interval [30, +∞[.
From the application point of view, the most important is predicting the number of days with visibility in the range below 10 km since the visibility in this range is connected with aviation. Based on the curves in Figure 3, in the winter period, we see that the probability of the number of days in a month with visibility below 10 km is (D Vis ≤ 21) = 1, while in the mid-period, this number of days in a month decreases to 7 and in the summer period to 2. The beginning of each curve also plays an important role in the analysis of the visibility conditions at the given site.
FIGURE 3 | CDF of the number of days D Vis in a month (30 days) with Vis in given range, "th" in the legend stands for theoretical CDF (calculated from the described model), "emp" denotes CDF from measurements, and E(x) is the expected value of D Vis for the (A) winter period, (B) mid-period, and (C) summer period. Figure 3, in the winter period, we can conclude that there are always 2 days in a month with Vis = 10 km since (D Vis = 2) = 0, while in the summer period, there are at least 13 days in a month with Vis = 30 km. The application of this model to another site after training allows to identify the support of the probability density function (PDF), i.e., the inverse image of open interval ]0, 1[ of values of CDF. In the case of our data, the support of PDF for VIS is as follows:

According to
• Vis [0, 10[km is: The intervals presented above are the essential description of the visibility conditions at the site we were training our model and can be applied elsewhere.
The further analysis of the CDF generated with our model gives information about the 5th percentile or the 95th percentile of the number of days in a month with specific visibility or lower. This plays an important role in the evaluation of the projects which require the assessment of limited visibility occurrence. We believe that our model can be easily applied if the sources of pollutants are similar to Warsaw and, hence, if the spatial distribution of values of the concentrations of pollutants is similar. For example, in Poland, the differences in PM 10 concentrations between particular regions may be high (Łowicki, 2019;GIOŚ, 2020). It is mainly caused by the high impact of local sources of emission (Rogula-Kozłowska et al., 2012Błaszczak et al., 2020). However, the described model was developed based on the data collected in Warsaw, where air quality and sources are similar to most European cities, where the spatial distributions of concentrations of air pollutants are rather flat (Dziennik Urzędowy Województwa Mazowieckiego, 2007Mazowieckiego, , 2017Holnicki et al., 2017). It caused our proposal of visibility forecasting to be quite universal and useful for most European cities.

DATA AVAILABILITY STATEMENT
Publicly available datasets were analyzed in this study. This data can be found here: Air quality data http://powietrze. gios.gov.pl/pjp/archives; Visibility data https://dane.imgw.pl/; Meteorological data are available from SGGW-WULS database on request to GM, contact: grzegorz_majewski@sggw.edu.p.

AUTHOR CONTRIBUTIONS
GM and BS: conceptualization and methodology. JSB, TM, and AD: data curation. BS: analysis. GM and JSB: writing. JB, BS, WR-K, EA, and TM: reviewing and editing. GM and WR-K: supervision. All authors contributed to the article and approved the submitted version.

FUNDING
This research and APC was funded from a subsidy of the Ministry of the Interior and Administration, Poland to The Main School of Fire Service in Warsaw, Poland. The research was also a part of the project "Implementation doctorate -edition II Faculty W-7 (03DW/0001/18)" financed by the National Centre for Research and Development, Warsaw, Poland.

ACKNOWLEDGMENTS
We would like to thank the reviewers for the critical remarks which significantly improved the quality of the manuscript.