Estimation of the ground-level SO2 concentration in eastern China based on the LightGBM model and Himawari-8 TOAR

Sulfur dioxide (SO2) is one of the main pollutants in China’s atmosphere, but the spatial distribution of ground-based SO2 monitors is too sparse to provide a complete coverage. Therefore, obtaining a high spatial resolution of SO2 concentration is of great significance for SO2 pollution control. In this study, based on the LightGBM machine learning model, combined with the top-of-atmosphere radiation (TOAR) of Himawari-8 and additional data such as meteorological factors and geographic information, a high temporal and spatial resolution TOAR-SO2 estimation model in eastern China (97–136°E, 15–54°N) is established. TOAR and meteorological factors are the two variables that contribute the most to the model, and both of their feature importance values exceed 30%. The TOAR-SO2 model has great performance in estimating ground-level SO2 concentrations with 10-fold cross validation R2 (RMSE) of 0.70 (16.26 μg/m3), 0.75 (12.51 μg/m3), 0.96 (2.75 μg/m3), 0.97 (2.16 μg/m3), and 0.97 (1.71 μg/m3) when estimating hourly, daily, monthly, seasonal, and annual average SO2. Taking North China as main study area, the annual average SO2 is estimated. The concentration of SO2 in North China showed a downward trend since 2016 and decreased to 15.19 μg/m3 in 2020. The good agreement between ground measured and model estimated SO2 concentrations highlights the capability and advantage of using the model to monitor spatiotemporal variations of SO2 in Eastern China.


Introduction
In past decades, China's industrialization has accelerated, resulting in more serious environmental problems (Li et al., 2014). SO 2 is a primary source of air pollution and directly affects human health, causing various cardiovascular and respiratory diseases (Sunyer, 2003;Johns and Linn, 2011;Li et al., 2015;Song et al., 2016;Wang et al., 2018). In addition, as a main precursor of sulfate, SO 2 increases the frequency of haze events and causes substantial damage to the ecological environment (Zhu et al., 2011;Lee, 2015;Calkins et al., 2016).
In recent years, China has successively built a series of SO 2 ground monitoring stations. These stations can provide data sources for SO 2 -related research. However, the small number and uneven distribution render limited spatial coverage of groundbased SO 2 monitors (Yu et al., 2018). Compared with ground monitoring, satellite observation has a wide coverage, and there is a good correlation between SO 2 column concentration and ground-level SO 2 concentration, mainly using polar orbiting satellites. Therefore, the model based on SO 2 column concentration from satellite observation has become an effective tool to obtain ground-level SO 2 concentration with high spatial resolution (Ialongo et al., 2016;Liu et al., 2016). At present, satellite instruments widely used in SO 2 column concentration monitoring are the Global Ozone Monitoring Experiment (GOME) (Eisinger and Burrows, 1998), the Atmospheric Infrared Sounder (AIRS) (Carn, 2005), the Ozone Monitoring Instrument (OMI) (Yang et al., 2007;Zhang et al., 2017;Li et al., 2020a) and the Scanning Imaging Absorption Spectrometer for Atmospheric Chartography (SCIAMACHY) (Lee et al., 2011). Based on satellite remote sensing, these studies successfully estimated the ground-level SO 2 concentration using statistical methods, which filled the gap in observational data.
However, polar orbiting satellites can only observe daily data on the concentration of SO 2 columns, and these data are a combination of observations taken at two different times. That is to say, the input data to the model have low temporal resolution and are not observed at the same time. By comparison, some geostationary orbit satellites can observe panoramic TOAR once an hour. Himawari-8 is an advanced geostationary orbit satellite (Yoshida et al., 2018) launched by the Japan Meteorological Agency (JMA). Its TOAR data covers a wide area of eastern China with high temporal resolution, including 16 bands ranging from visible to near-infrared light. Therefore, the Himawari-8 TOAR has great advantages in building a high spatial and temporal resolution estimation model of ground-level pollutant concentration, which has been widely used in many related studies (Zang et al., 2018;Wei et al., 2021;Xu et al., 2021;Song et al., 2022a;Chen et al., 2022c). However, to the best of our knowledge, the Himawari-8 TOAR has not been applied to ground-level SO 2 concentration estimation so far.
Compared with statistical models, machine learning algorithms have better data processing ability for highdimensional data and can better solve nonlinear relationships, providing it with better application prospects in SO 2 estimation (Tripathy et al., 2021). Therefore, this study aims to estimate ground-level SO 2 concentration in eastern China based on the Light Gradient Boosting Machine (LightGBM) machine learning model, combined with Himawari-8 TOAR and auxiliary data such as meteorological factors and geographic information. The SO 2 ground observation data used in the study came from the National Environmental Quality Monitoring Center of China, which can be obtained from its official website at http:// www.cnemc.cn/en/. The quality assurance and validity judgment of SO 2 data are controlled according to HJ818-2018 technical specification. The study area in this paper is the eastern China (97-136°E, 15-54°N), and hourly SO 2 data from approximately 1800 SO 2 ground monitoring stations from 1 September 2015, to 31 August 2021, are used. The spatial distribution of these stations is shown in Figure 1. It can be seen that the stations are only sparsely distributed in the west and north of the study area.

Himawari-8 TOAR
Himawari-8, the world's first geostationary weather satellite capable of obtaining color images, was launched by JMA in 2014, and data became available in 2015. The Advanced Himawari Imager (AHI) contains a total of 16 bands from visible light to near-infrared wavelengths, named B1, B2, B3, B4, B5, B6, B7, B8, B9, B10, B11, B12, B13, B14, B15, and B16, respectively (Yoshida et al., 2018). Details of 16 bands information of the Advanced Himawari Imager (AHI) instrument on Himawari-8 satellite is shown in Table 1. However, the observation range of Himawari-8 is limited to 80°E −200°E and 60°S-60°N, so no valid data can be obtained in China's Tibet, Xinjiang and western Sichuan . The Himawari-8 TOAR data used in this paper have a temporal resolution of 1 h and a spatial resolution of 5 km.

Meteorological factors and geographic information
Considering that meteorological conditions will affect the formation, accumulation and diffusion of SO 2 , various meteorological factor values are added to the model. The meteorological factors used in this study are from the European Centre for Medium-Range Weather Forecasts (ECMWF) EAR-5 reanalysis datasets (Hersbach et al., 2020), which have an hourly temporal resolution and a spatial resolution of 0.25°×0.25°or 0.1°×0.1°(as Table 2 showed). Meteorological factors used in this study mainly include boundary layer height (BLH), relative humidity (RH), surface pressure (SP), 2 m temperature (TM), and 10 m U and V winds (U 10 , V 10 ) (Li et al., 2019b;Song et al., 2022b). In addition to meteorological factors, geographic information also affects SO 2 concentrations. The geographic information selected in this paper mainly includes land cover type (LUCC), altitude (height) and population density (pd). LUCC is  represented by high and low vegetation indices (LH, LL) from EAR-5, height is from SRTM-3 data (spatial resolution of 90 m) jointly measured by NASA and the National Imaging and Mapping Agency (NIMA), and pd is from NASA Socioeconomic Data and Application Center (spatial resolution of 0.04°×0.04°). The distribution of population density in eastern China is shown in Figure 1.

Data matching
First, through the bilinear interpolation method, the spatial resolution of various meteorological factors and geographic information is adjusted to be consistent with the grid resolution of Himawari-8 (0.05°×0.05°). Then, hourly SO 2

FIGURE 2
Feature importance for Himawari-8 satellite bands in the models of four seasons, where the blue dashed line represents the x=0.02 line.
Frontiers in Earth Science frontiersin.org 04 ground station observations are matched against the established grid. If there is one station in the grid, the observed value of the station is the data for that grid, and if there is more than one station in the grid, the average of the data from these stations is the grid value. The latitude and longitude range of the study area after data matching is 97-136°E, 15-54°N; the study area, contains a total of 6,087,672 data points.

Bands selection
After testing, we find that when the feature importance of a variable is less than 2%, it will not only degrade the model performance, but also increase the computation amount, which wastes the storage space and running time. Therefore, we take 2% as the threshold of feature importance to select TOAR bands. It should be noted that the feature importance only represents the contribution of each variable to the model, but cannot represent the physical reasons why these variables affect the ground-level SO 2 concentrations. In this way, we pick out suitable bands in each season as part of the input data. Figure 2 shows the feature importance for Himawari-8 satellite bands in the models of four seasons. Based on the results of Figure 2, the final bands are selected as Table 3.

Light gradient boosting machine
LightGBM is a decision tree algorithm based on the histogram algorithm. Its main idea is still to use weak classifier (decision tree) iterative training to obtain the optimal model, but two new technologies gradient-based one-side sampling (GOSS) and exclusive feature bundling (EFB) are added, which allows it to quickly record data characteristics (Ke et al., 2017). At the same time, LightGBM uses a depthlimited leafwise algorithm to filter out leaf splits with low gain, reducing the algorithm overhead. It is precisely based on these optimizations that LightGBM can save considerable running time and storage space compared with the traditional decision tree algorithm to achieve the purpose of rapidly processing massive data (Ma et al., 2022).
The model performance is described by three indicators: coefficient of determination (R 2 ), mean absolute error (MAE), and root mean square error (RMSE). Their definitions are as follows : whereŷ i represents the predicted value of the model, y i represents the true value, y represents the mean of the true value, and n represents the total sample.
In this study, the comparison between the performance of LightGBM and other machine learning models in Himawari-8 TOAR data is shown in Table 4. We choose LightGBM model in this study because of its good performance and short running time.
3 Results and discussion

Model cross validation results
To test the performance of the model, we apply 10-fold cross validation (Chen et al., 2019;Chen et al., 2022a). The data is split into ten parts, nine for training the model and one for validating the results, and the process is repeated ten times. Based on 9: 00-16:00 (this article uses Beijing time, which is 8 h earlier than Universal Time), the 10-fold cross validation result of the validation dataset is shown in Figure 3. R 2 is 0.64-0.72, RMSE is 11.89 μg/m 3 -19.9 μg/m 3 , MAE is 5.56 μg/m 3 -9.65 μg/m 3 , and the fitting slope is 0.62-0.69. The results estimated by the model are slightly lower than the observations. During the time period of 9:00-16:00, the performance of the model varies with time. Generally, it shows a trend of first rising and then decreasing. The model performs best in the period of 13:00-14:00 (R 2 is 0.72). This is because meteorological conditions such as high temperature and atmospheric instability at noon are conducive to the diffusion of pollutants. And the solar radiation is strongest at this time, so the TOAR will also be stronger, thus generating the best radiation signal received by the satellite.
As shown in Figures 4A-D, the model performs best in winter with a 10-fold cross validation R 2 of 0.72. Performance is the worse in summer with a R 2 value of only 0.51. R 2 in spring and autumn are 0.62 and 0.65. This may be related to the complex and changeable meteorological conditions in summer and the highest concentration of SO 2 in winter (Wei et al., 2019;Zang et al., 2019). Therefore, the TOAR-SO 2 model can effectively capture high SO 2 events in winter over eastern China.
The 10-fold cross validation results based on daily, monthly, seasonal, and annual average SO 2 are shown in Figures 4E-H. The performance of TOAR-SO 2 model has been significantly improved when estimating monthly, seasonal, and annual average SO 2 with R 2 (RMSE) of 0.96 (2.75 μg/m 3 ), 0.97 (2.16 μg/m 3 ) and 0.97 (1.71 μg/m 3 ). In contrast, the model is ordinary when estimating the daily average SO 2 , but R 2 (RMSE) To test whether the model has better performance in regions with high annual average SO 2 concentrations, this paper conducts a 10-fold cross validation of 348 cities in eastern China with SO 2 ground-truth data records and then screened out the proportion of cities with R 2 values between 0.8 and 0.9, 0.7-0.8, 0.6-0.7, 0.5-0.6, and 0-0.5 in the top 20, top 50, top 100, top 150 cities and all cities in eastern China by pollution level from 2016 to 2020. In Figure 5, the results show that with the increase of SO 2 pollution, the proportion of cities with R 2 values between 0.8 and 0.9 increases significantly. At the same time, cities with R 2 values between 0.7 and 0.8 generally show the same but only 10% of these cities are among the top 20 polluted cities. The proportion of cities with R 2 values between 0.5 and 0.7 doesn't change significantly. However, for cities with R 2 values lower than 0.5, the proportion of SO 2 decreases significantly with increasing SO 2 concentration. It can be seen

FIGURE 3
Hourly model 10-fold cross validation results based on samples, which the light dashed line is the perfectly fitted line, that is, the 1:1 relationship, LT represents Beijing time, and N represents the total sample amount.
Frontiers in Earth Science frontiersin.org that the model has a better estimation effect in areas with severe SO 2 pollution and the estimation result is basically close to the site data.
In conclusion, the TOAR-SO 2 model established in this study can accurately estimate the SO 2 concentration in eastern China. The estimated result is slightly lower than the observation. The TOAR-SO 2 model performs best in winter and in areas with severe SO 2 pollution, and it works well when estimating monthly, seasonal, and annual average SO 2 . Therefore, the SO 2 estimated by the TOAR-SO 2 model can provide reliable data for monitoring the spatial variation and temporal trend of SO 2 pollution in eastern China.

Feature importance of the TOAR-SO 2 model
The feature selection of the TOAR-SO 2 model adopts the backward selection method (Li et al., 2020b), that is, the variables with low feature importance are filtered out, and only the variables with high feature importance are retained. The feature importance of selected variables in each season is shown in Figure 6A. The results show that TOAR and meteorological factors are the two variables that contribute the most to the model, and both of their feature importance values exceed 30% in each season. The high feature importance of meteorological factors indicates that they have a great influence on SO 2 concentration (Xie et al., 2015;Liu et al., 2017). The feature importance of the time element is between 7.7% and 10%.
Among the various meteorological factors used in the model, U 10 , V 10 and BLH contribute the most to the model, followed by RH, SP, and TM ( Figure 6B). Wind speed can change the concentration of SO 2 by changing the diffusion and transport speed of SO 2 , and BLH is related to the stability of the atmosphere and will directly affect the vertical mixing and long-distance diffusion of pollutants (Miao et al., 2018). Besides, some studies have shown that BLH can also have affect wind speed (Rigby and Toumi, 2008). In addition, the high RH environment can accelerate the heterogeneous absorption of SO 2 by aerosols, resulting in the conversion of SO 2 to sulfate (Zhang et al., 2015b;Fu and Chen, 2017). The SP and TM are related to the height of the boundary layer and the strength of the turbulence in the   Frontiers in Earth Science frontiersin.org 08 atmosphere (Zhang et al., 2015a;Mentes and Eper-Papai, 2015), which also affect the SO 2 concentration.

Spatial distribution of SO 2 in eastern China
By inputting the hourly data of each variable into the model, the hourly SO 2 concentration in eastern China is estimated, and then the spatial distribution of the mean value of SO 2 between 2016 and 2020 is calculated ( Figure 7A). The result shows that the distribution of SO 2 concentrations has obvious regional differences, which are generally high in the north and low in the south (the average concentration in the north is 21.75 μg/m 3 , and the average concentration in the south is 18.05 μg/m 3 ). The average concentration of SO 2 in North China is the highest, reaching 22.21 μg/m 3 . This is due to the existence of a large number of coal mining enterprises in these areas, coupled with the multivalley basin topography, resulting in a large accumulation of SO 2 . The lowest annual concentration of SO 2 is found in the southeastern and northeastern regions of China. Compared with Figure 7B, it can be seen that the results predicted by the model are generally consistent with the observation. Figure 7C shows the annual average SO 2 concentration predicted by the model of the 10 cities with the most serious SO 2 pollution in the ground monitoring data. There is a certain deviation between the predicted results of individual cities and the actual situation, but most cities are close to the actual situation. According to the results, the concentration of SO 2 in these cities gradually decreased from the high value in 2016 to less than 20 μg/m 3 in 2020.
The mean values of SO 2 in spring, summer, autumn and winter from 2016 to 2020 are estimated, and their spatial

Frontiers in Earth Science
frontiersin.org

FIGURE 9
The intraday variation of SO 2 concentration predicted by the model (μg/m 3 ).

FIGURE 10
Spatial distribution of annual mean values of SO 2 in North China from 2016 to 2020 (μg/m 3 ).
Frontiers in Earth Science frontiersin.org 10 distributions are shown in Figure 8. The results show that the concentration of SO 2 demonstrates obvious seasonal differences, and the concentration of SO 2 in winter is significantly higher than that in spring, summer and autumn, which is related to the large number of residents burning coal for heating in winter; furthermore, the stable atmospheric structure and low precipitation in winter are not conducive to wet deposition and diffusion of SO 2 (Calkins et al., 2016;Zhao et al., 2016). The concentration of SO 2 reaches the highest value in winter (25.88 μg/m 3 ), then begins to decrease in spring (16.07 μg/m 3 ), decreases to the lowest value in summer (14.22 μg/m 3 ), and increases again in autumn (16.82 μg/m 3 ). This phenomenon indicates that the concentration of SO 2 is continuous in the temporal scale.
This study also estimates the hourly mean value of SO 2 between 9:00 and 16:00 in eastern China, and the results are shown in Figure 9. In general, the SO 2 concentration keeps declining between 9:00 and 16:00, with the highest concentration at 9:00 and the lowest at 16:00. The concentration in the morning is generally higher than that in the afternoon. This is because the temperature in the morning is lower than that in the afternoon, and the structure of the atmosphere is more stable, which is not conducive to SO 2 diffusion. The intraday variation of SO 2 concentration in  Frontiers in Earth Science frontiersin.org southern China is not obvious, and SO 2 concentration maintains a low level throughout the day. Figure 10 shows the spatial distribution of the annual average concentration of SO 2 in North China (the specific location in Figure 1) estimated by the model from 2016 to 2020. In North China, the region with the most serious SO 2 pollution, the annual average concentration of SO 2 keeps declining from 2016 to 2020. In 2016, the SO 2 concentration exceeded 40 μg/m 3 in many areas. In 2017, the SO 2 concentration decreased significantly, and the number of areas above 40 μg/m 3 was greatly reduced. Within these areas, the SO 2 concentration decreased most in western Shanxi Province, western Hebei Province and central Inner Mongolia, but remained at a high level in northern Ningxia. In 2018, the SO 2 concentration further decreased, and only a few areas exceeded 40 μg/m 3 . By 2020, the SO 2 concentration in 90.52% of the North China was lower than the national ambient air quality SO 2 level 1 concentration limit of 20 μg/ m 3 . In general, SO 2 pollution in North China has been effectively alleviated in the past 5 years, which is closely related to the wide application of flue gas desulfurization (Duan et al., 2016) and the government's relevant policies to strengthen the control of SO 2 emissions such as coal desulfurization. In addition, SO 2 pollution levels have also been affected by the new coronavirus pneumonia epidemic (Fan et al., 2020;Ran et al., 2020).

Discussion
In this study, we build a TOAR-SO 2 model with high spatial and temporal resolution over eastern China. The model performs well and can provide reliable SO 2 data for remote areas lacking ground monitoring stations, which is of great significance for SO 2 pollution control. The comparison of model performance between this study and other studies is shown in Table 5. In studies that cover a large area rather than just a city, the Space-Time Extra-Tree (STET) model (Wei et al., 2022) has the best effect, followed by our model. But our model has higher temporal and spatial resolution compared with the STET model. Figure 11 shows the comparison of annual average SO 2 concentrations in eastern China from 2016 to 2020 between the dataset estimated in this study and the ChinaHighSO 2 dataset estimated by the STET model. In general, the two have a good consistency, especially during 2018-2020. When estimating the annual average concentration of SO 2 , the R 2 of ChinaHightSO 2 (0.98) is slightly higher than that in this study (0.97), but our RMSE (1.71 μg/m 3 ) and MAE (0.82 μg/m 3 ) are better than the RMAE (2.46 μg/m 3 ) and MAE (1.35 μg/m 3 ) of ChinaHightSO 2 . Overall, both of these two models can be considered reliable in estimating the annual average SO 2 concentration.

Conclusion
In this study, we apply Himawari-8 TOAR data to build a TOAR-SO 2 model with high spatial and temporal resolution based on the LightGBM machine learning model. The TOAR-SO 2 model can effectively capture high SO 2 events in winter, and works well when estimating monthly, seasonal, and annual average SO 2 with R 2 (RMSE) of 0.96 (2.75 μg/m 3 ), 0.97 (2.16 μg/m 3 ) and 0.97 (1.71 μg/m 3 ). The concentration of SO 2 in North China estimated by the model showed a downward trend since 2016. Overall, the good agreement between ground measured and model estimated SO 2 concentrations highlights the capability and advantage of using the model to monitor spatiotemporal variations of SO 2 in Eastern China.
In the future, we need to improve the accuracy of the model in summer and extend the prediction range to the whole of China to obtain more accurate hourly concentrations of ground-level SO 2 with wider coverage. In addition, models established by machine learning methods lack interpretability. In the next step, we will improve the interpretability of the model by combining machine learning methods with the atmospheric chemistry model considering chemical mechanism.

Data availability statement
The estimated data and data reading codes are available from https://doi.org/10.5281/zenodo.7047543.

Author contributions
Conceptualization, BC; methodology, BC and TX; writing-original draft preparation, BC and TX; resources, BC; formal analysis, TX; software, TX, YR, LZ, XL, YW, JH, and ZS; data curation, TX, YR, LZ, XL, YW, JH, and ZS; visualization, TX and ZS. All authors have read and agreed to the published version of the manuscript.