High Spatial Resolution Topsoil Organic Matter Content Mapping Across Desertified Land in Northern China

Soil organic matter (SOM) content is an effective indicator of desertification; thus, monitoring its spatial‒temporal changes on a large scale is important for combating desertification. However, mapping SOM content in desertified land is challenging owing to the heterogeneous landscape, relatively low SOM content and vegetation coverage. Here, we modeled the SOM content in topsoil (0–20 cm) of desertified land in northern China by employing a high spatial resolution dataset and machine learning methods, with an emphasis on quarterly green and non-photosynthetic vegetation information, based on the Google Earth Engine (GEE). The results show: 1) the machine learning model performed better than the traditional multiple linear regression model (MLR) for SOM content estimation, and the Random Forest (RF) model was more accurate than the Support Vector Machine (SVM) model; 2) the quarterly information regarding green vegetation and non-photosynthetic were identified as key covariates for estimating the SOM content in desertified land, and an obvious improvement could be observed after simultaneously combining the Dead Fuel Index (DFI) and Normalized Difference Vegetation Index (NDVI) of the four quarters (R2 increased by 0.06, the root mean square error decreased by 0.05, the ratio of prediction deviation increased by 0.2, and the ratio of performance to interquartile distance increased by 0.5). In particular, the effects of the DFI in Q1 (the first quarter) and Q2 (the second quarter) on estimating low SOM content (<1%) were identified; finally, a timely (2019) and high spatial resolution (30 m) SOM content map for the desertified land in northern China was drawn which shows obvious advantages over existing SOM products, thus providing key data support for monitoring and combating desertification.

Soil organic matter (SOM) content is an effective indicator of desertification; thus, monitoring its spatial-temporal changes on a large scale is important for combating desertification. However, mapping SOM content in desertified land is challenging owing to the heterogeneous landscape, relatively low SOM content and vegetation coverage. Here, we modeled the SOM content in topsoil (0-20 cm) of desertified land in northern China by employing a high spatial resolution dataset and machine learning methods, with an emphasis on quarterly green and non-photosynthetic vegetation information, based on the Google Earth Engine (GEE). The results show: 1) the machine learning model performed better than the traditional multiple linear regression model (MLR) for SOM content estimation, and the Random Forest (RF) model was more accurate than the Support Vector Machine (SVM) model; 2) the quarterly information regarding green vegetation and non-photosynthetic were identified as key covariates for estimating the SOM content in desertified land, and an obvious improvement could be observed after simultaneously combining the Dead Fuel Index (DFI) and Normalized Difference Vegetation Index (NDVI) of the four quarters (R 2 increased by 0.06, the root mean square error decreased by 0.05, the ratio of prediction deviation increased by 0.2, and the ratio of performance to interquartile distance increased by 0.5). In particular, the effects of the DFI in Q1 (the first quarter) and Q2 (the second quarter) on estimating low SOM content (<1%) were identified; finally, a timely (2019) and high spatial resolution (30 m) SOM content map for the desertified land in northern China was drawn which shows obvious advantages over existing SOM products, thus providing key data support for monitoring and combating desertification.

INTRODUCTION
Desertification is defined as land degradation in drylands resulting from various factors, including climatic variations and human activities (UNCCD, 1994), which has a major effects on carbon emissions through the loss of soil organic matter (SOM) when soil deplete from their original state (MEA -Millennium Ecosystem Assessment, 2005). China is among the countries most severely affected by desertification, representing 17.93% of its total land area (State Forestry Administration of China, 2015). Thus, desertification in China is likely to considerably impact the global CO 2 status (Feng et al., 2004;. Accurate and timely SOM content mapping is of great importance for assessing desertification (Sims et al., 2019). However, few researches have focused on the large scale SOM content estimation of the desertified land of Northern China (Li et al., 2018;Sun et al., 2019).
Soil sampling is the traditional method for determining the regional SOM content. Although the method yields more accurate results, it requires considerable effort, material resources, and time, especially for large areas (Wang et al., 2017;Hamzehpour et al., 2019;Lee et al., 2019;Yang et al., 2020). Due to its high-cost soil sampling is not practical for frequent or real-time SOC monitoring. Digital Soil Mapping (DSM) (McBratney et al., 2003) is efficient and convenient for acquiring soil information based on mathematical or statistical relationships between field soil observations and related predictor variables (e.g., climate, vegetation, relief, parent material, and time) (Zhao et al., 2014;Liang et al., 2019b;Keskin et al., 2019).
Remote sensing has facilitated the progress of DSM, mainly by providing up-to-date information on land cover and natural vegetation status, and is often replaced by vegetation indices (Lamichhane et al., 2019;Wiesmeier et al., 2019); for example, Normalized Difference Vegetation Index (NDVI) is the most commonly used (Plaza et al., 2018;Wang et al., 2018;Zhang and Huisingh, 2018;Xu et al., 2019;Zeraatpisheh et al., 2019;Yang et al., 2020). However, NDVI is constrained when representing vegetation coverage in semi-arid and arid areas due to the influence of exposed soil, dead vegetation and litter on the spectral response (Maynard and Levi, 2017). While incorporating non-green vegetation information in different seasons could help alleviate this kinds of problem in soil organic carbon prediction .
Machine learning methods, such as random forest (RF), support vector machine (SVM), enhanced regression trees (ERT), and co-kriging methods, have impacted the success of soil attribute estimations in recent years (Were et al., 2015;Keskin et al., 2019;Lamichhane et al., 2019;Wang et al., 2019;Zeraatpisheh et al., 2019). RF technology has become the most popular machine learning method and can help avoid model overfitting (Zhang et al., 2017;Zeraatpisheh et al., 2019;Yang et al., 2020) and support variable importance evaluation (Ballabio, 2009;Were et al., 2015;Zhi et al., 2018). In addition, the SVM model (Cortes and Vapnik, 1995), based on the principle of structural risk minimization, can avoid the dependence on data scale and data distribution; therefore, it has better learning capabilities and smaller prediction errors (Ballabio, 2009;Were et al., 2015). Many studies have compared the accuracy of different machine learning models; however, the conclusions are not uniform (Brungard et al., 2015;Liang et al., 2019a;Hamzehpour et al., 2019). For example, Wu et al. (2016) compared Mahalanobis Distance (MD), Maximum Likelihood (ML), Artificial Neural Networks (ANNs), Support Vector Machines (SVMs), and Random Forests (RFs) for land degradation mapping, and found ML classification is a suitable candidate. Were et al. (2015) found SVR to be the best predictor of SOC, while Wu et al. (2018) showed that the RF model is superior to other models such as SVM. Nevertheless, there is a consensus that machine learning is more suitable for soil attribute estimation than traditional multiple regression (Chagas et al., 2016;Zeraatpisheh et al., 2019).
The relatively coarse resolution (250-1,000 m) of existing SOM products [e.g., the Harmonized World Soil database (WIEDER, 2014) with 1000-m resolution and Soil Grids (Hengl et al., 2017) with a higher 250-m resolution] is another limitation for the desertified land in northern China, in which high spatial heterogeneity exist and directly affect the spatial and temporal distribution of topsoil organic matter (Tongway and Ludwig, 2007). With the open access of the multi-spectral satellites (e.g., Landsat 8 and Sentinel-2), researches on SOC estimation with high spatial resolution were gradually carried out; for example, Wang et al. used Landsat 5 TM data to estimate SOC and nitrogen in Liaoning Province, China (Wang et al., 2017); Forkuor et al. (2017) found that Landsat eight data provides acceptable SOC digital map of south-western Burkina Faso, while in the study of Castaldi et al. (2019), Sentinel-2 data was sufficient for describing the variability of SOC. However, the majority of previous studies have only focused on the field or local scales, mainly due to the limitation associated with the processing of extensive high spatial resolution data. In recent years, the emergence of cloud platforms (such as GEE), characterized by massive amounts of available high spatial resolution datasets and powerful computing capabilities, has aided in collecting multitemporal, high spatial resolution vegetation information for SOM modeling (Dong et al., 2016;Gorelick et al., 2017;Beaton et al., 2019).
The main purpose of this study was to estimate the high spatial resolution topsoil SOM content in the desertified land of northern China based on the GEE platform by employing Sentinel-2 data, field-measured SOM samples, and ancillary factors with advanced machine learning methods (SVM and RF) and traditional statistical methods (MLR). In particular, the potential of high spatial and quarterly green and nonphotosynthetic information generated from Sentinel-2 data was explored.

Study Area
The extent of the desertified land was from the "Atlas of Desert and Aeolian Desertification in Northern China" which was produced through visual interpretation of Landsat images and has been adopted by many studies (Wang et al., 2011). The product includes multi-temporal results from 1975-2010 with 4°o f desertification: slight, moderate, severe, and extremely severe. In this study, desertification monitoring results from 2010 were finally utilized ( Figure 1). The total area is 375,900 km 2 and the climate difference is significant in the area, which spans arid, semi-arid, and sub-humid climate regions (Tao, 2014). Natural vegetation types include woodlands, shrublands, grasslands, and barren land from east to west (Zhang and Huisingh, 2018). The major soil types are Gray-brown desert soils (WRB reference soil group: Cambisols), brown calcic soil (WRB reference soil group: Cambisols), and aeolian sandy soil (WRB reference soil group: Arenosols). All desertification processes generally lead to an obvious reduction of SOM in topsoil; thus, the SOM content of the study area was lower than that in non-desertified areas (Allington and Valone, 2010;Tao, 2014).

Soil Sample Collection and Chemical Analysis
Based on the information on the type and extent of desertified land, and considering elevation, climate, and soil type, soil samples were collected from July to August 2017-2019. The sample collection was mainly concentrated in eight representative areas: Mu Us Sandy Land, Otindag Sandy Land, Horqin Sandy Land, Hulunbuir Sandy Land, Gonghe alpine desertified grassland, Minqin Oasis-Desert transition belt, the northern edge of the Taklimakan Desert, and the southern margin of the Gurbantunggut Desert ( Figure 1). The total number of samples was 243 ( Table 1). The size of the sampled square is 30 m*30 m, the soil profile depth was 0-20 cm and then nine 10*10 m grids are laid out in the square. Take one soil sample at the center of each grid, and get nine soil samples. The nine soil samples were thoroughly mixed and analyzed in the laboratory, and the SOM content was obtained by the analysis. At the same time, a handheld GPS (Trimble GPS, precision <1 m) recorder was used to record the geographic coordinates of the center point of the plot.
Soil samples were stored in a sealed package for SOM analysis in the laboratory. Subsequently, the bulk samples were air-dried and sieved to retrieve the fine earth fraction (<2 mm) (O'Kelly, 2004). SOM was measured by the dichromate oxidation method as described by the Agriculture Chemistry Speciality Council, Soil Science Society of China (Agriculture Chemistry Speciality Council, 1983). Jenny (1941) defined soil development as a function of climate, organic matter/organic activity, relief, parent materials and time. Later, McBratney et al. (2003) proposed digital soil mapping based on this, using the SCORPAN equation as a framework to quantitatively estimate soil properties through soil samples and environmental covariates. The latest SOC digital mapping review article shows that covariates representing organism/organic activities were among the most frequent among top five covariates, followed by the variables representing climate and topography (Lamichhane et al., 2019).

Predictor Variables
Organism/organic activities including land cover and natural vegetation, but real-time vegetation cover information is difficult to obtain. Vegetation index (Mahmoudabadi et al., 2017;Guo et al., 2020) and land cover or land use data are often used as vegetation indicators (Meersmans et al., 2008;Li et al., 2013). Especially, NDVI (Tucker, 1979) has been widely used in the estimation of SOC content, and it has been found to have a significant positive correlation with soil carbon storage , and in several soil carbon content estimation studies, NDVI is important predictors (Guo et al., 2020;Mahmoudabadi et al., 2017;Wang et al., 2020Wang et al., , 2017. At present, studies using NDVI for DSM (digital soil mapping), the acquisition of NDVI is mostly by an annual mean or alternatively by a single image acquired during a specific time of year (the peak of the growing season or the dry season, etc.). However, more and more studies support the view that the temporal variability of vegetation spectra driven by soil and climate feedback is an important predictor of soil variability (Ballabio et al., 2012;Maynard and Levi, 2017). Guo et al. (2020) used three NDVI images in different periods to successfully predict the SOC content at the field scale. Clark (2017) found that using NDVI in different seasons at the same time improved the accuracy of land cover classification.
In studies of digital mapping of SOC/SOM, the variable representing the "C" factor is the second most common (Lamichhane et al., 2019). Climate is the main controlling factor for SOC storage from a regional to global scale (Lamichhane et al., 2019;McBratney et al., 2003;Plaza et al., 2018;Wiesmeier et al., 2019). Therefore, the average annual temperature and annual rainfall are used as predictors for estimating SOC content in many studies (Liang et al., 2019b;Schillaci et al., 2017). With the development of remote sensing technology, climate indicators such as surface temperature and surface evapotranspiration have also been incorporated into the SOC content prediction index system (Sayão et al., 2018;Zhao et al., 2014).
Topography plays a vital role in the spatial distribution of SOC, because topography affects precipitation, water flow paths, stagnant water and discharge, and has a significant impact on the erosion process. Based on the terrain index extracted from DEM data, elevation, slope, aspect, TWI... are all commonly used predictors in SOC prediction (Wang et al., 2017;Zeraatpisheh et al., 2019).
We also selected 14 indicators that can represent climate, organism, and topography to predict the soil organic matter content in the study area ( Table 2), included two climatic factors: 10-years average annual temperature (MAT) and annual average precipitation (MAP); eight vegetation variables (NDVI and DFI of four quarters); three topographic factors (elevation, slope, aspect) and land-cover type (LC). The data sources of all variables were directly used online on the GEE platform, on which all related calculations were also completed. The unified projection was WGS 84 and the spatial resolution was 30 m. Through the reproject function {ee.Image (). reproject [(crs:'EPSG:4326', scale: 30)]} of the GEE platform. The method of resampling is nearest neighbor resampling, the unified projection was WGS 84 and the spatial resolution was 30 m.
In the selection of vegetation index, we considered both photosynthetic vegetation information and non-photosynthetic vegetation information, so we chose NDVI and DFI two vegetation indexes. DFI is the band index, which has good potential for estimating dead fuel (DF) coverage in steppe areas (Cao et al., 2010). DF is also called non-photosynthetic vegetation (NPV) or plant litter, and includes dry/cured/dead grasses, dead foliage and twigs, dead wood, slash wood, litter and duff, as well as crop residue (Cao et al., 2010). In addition, we calculated the vegetation index of the four quarters within a year for comparative experiments to study whether the multitemporal vegetation index information can improve the estimation accuracy of soil carbon content.
The selected remote sensing image source was Sentinel-2 data, and the spatial resolution of the bands differed. The highest spatial resolution was 10 m and the revisit period was 5 days. The image was obtained from the Sentinel-2 surface reflectance dataset (GEE ID: COPERNICUS/S2_SR) from the GEE. The period from 1 January to 31 December 2019 was used to ensure data quality by incorporating more images into the analysis. After determining the time and range, the MASK function provided by the GEE platform was first used to filter the images with less than 20% cloud cover. We calculated the vegetation indexes on the cloudless images, established a time series according to the quarters, and took the max value of the vegetation index of the image set in each quarter as a covariate.
That means a year's Sentinel-2 data is divided into four sets. Every 3 months is a set (January to March, April to June...), and four quarters' NDVI and DFI were calculated.
Here, NIR, Red, SWIR2, and SWIR1 are the reflectance values of the Sentinel-2 image in the near-infrared, red, and two short wave infrared bands, respectively, and correspond to bands 8, 4, 11, and 12 of Sentinel-2 data, respectively. In this study, MAT and MAP were selected as climate factors. The MAT and MAP sources were based on the ERA5 Daily aggregates dataset (GEE ID: ECMWF/ERA5/DAILY), obtained by the mean value function [return. mean ()] for the GEE. The data was collected from 2010 to 2019, the spatial resolution was 0.25°arc degrees, and the temporal resolution was 1 day.
The altitude factor was derived from the 90-m resolution SRTM90 (GEE ID: CGIAR/SRTM90_V4) topographic data in the GEE dataset. Slope and aspect were obtained by running a terrain analysis [ee.Algorithms.Terrain ()] on the GEE platform.
The land-cover data source was the MCD12Q1 V6 product (GEE ID: MODIS/006/MCD12Q1), which provides global landcover types at yearly intervals (2001-2016) derived from six different classification schemes. The type of classification scheme we selected was the Land-Cover Type 1: Annual International Geosphere-Biosphere Programme (IGBP) classification.
The high-resolution variation of SOM was acquired through the utilization of Sentinel-2 based VIs, namely NDVI and DFI. Except these high-resolution predictors, we do utilize some coarse resolution predictors, mainly due to the dataset availability in the GEE platform. However, these coarse resolution variables are meaningful for estimating the SOM level in the coarse-scale, and when combined with high-spatial resolution predictors, could generate a more detailed SOM map.

Modelling Methodology
The relationship between SOM and predictor variables was analyzed using correlation analysis. To explore the potential of quarterly green and dry vegetation information, we divided all predictor variables into three groups: 1) NDVI (July-September, corresponding to maximum green vegetation coverage), topography, climate, and land-cover types (covariate set A); 2) NDVI (four quarters), topography, climate, and land-cover types (covariate set B); and 3) the DFI (four quarters) was added for covariate set B (covariate set C).
Three modeling techniques were selected for SOM prediction: Multiple linear regression (MLR), RF, and SVM. Two machine learning methods (i.e., RF and SVM) can model the complex nonlinear relationship between variables and organic matter and are considered adept at predicting SOM in various regions (Were et al., 2015;Zhou et al., 2020). Moreover, RF can also sort the importance of variables by the change in the error outside the bag (Zeraatpisheh et al., 2019;Zhou et al., 2020). As a simple linear model, MLR was employed as an evaluation benchmark to compare the performances of machine learning and traditional statistical models.
RF is an ensemble learning method based on decision trees (Breiman, 2001). Re-sampling by the bootstrap method is used to randomly select observations in the dataset, using the principle of minimum error to select the optimal segmentation feature and the optimal segmentation point. The segmentation point (value) divides the data into two parts, determines the corresponding output value, and finally generates a regression decision tree after multiple divisions . The combination of trees generated by multiple choices constitutes an RF, and the final value of the predictor variable is the linear average of the prediction results of multiple trees (Ballabio, 2009;Were et al., 2015;Zhi et al., 2018). Based on the stability of the RF model, the excellent model performance, and the ability to evaluate the importance of variables, RFs have been widely used in various classification and regression problems Keskin et al., 2019;Yang et al., 2020). On the GEE platform RF, only two parameters need to be set to generate a predictive model: 1) the number of trees (n tree); and 2) the maximum number of leaf nodes in each tree (maximum nodes). To realize these two model parameters, we determined the final value of the parameters by setting the step size to traverse the parameter combination within a suitable range in the GEE platform. For the RF model, we set the number of trees (n tree) in GEE within the range of 10-10,000, and the maximum tree depth 5-50 to conduct the experiment and select the most suitable model parameters. The final model parameters are shown in Table 3: SVM is a machine learning method developed by Cortes and Vapnik (1995) based on the principle of structural risk minimization. It is specifically proposed for small-sample learning problems. The use of kernel functions solves the problem of dimensionality, which is suitable for dealing with nonlinear problems. In addition, the structural risk minimization principle enables the SVM to have a very good generalization ability (Were et al., 2015;Zhi et al., 2018). The hyperplane is used to optimally divide all data into different classes, which results in acceptable learning ability and smaller prediction error (Ballabio, 2009). In this study, the realization of an SVM algorithm was completed in the GEE platform. For the optimal parameter setting of the SVM models, the kernel function of SVM used the widely recognized radial-basis kernel (RBF) , the cost factor C, the minimum loss function P, and the gamma coefficient G were determined using the LIBSVM toolbox (Chang and Lin, 2011). In the process of parameter setting, the root mean square error (RMSE) of different parameter combinations was measured by 5-fold cross-validation. Finally, the parameters are determined as shown in Table 4: The MLR model was used to establish a linear equation between a dependent variable Y and multiple independent variables (X1, X2... Xn), and the least-squares method was used to estimate the parameters of each independent variable under the condition of minimizing the sum of the square of errors. It is a commonly used early modeling method for regression problems. Because the method is simple and provides accurate estimations, it has been used in many studies (Meersmans et al., 2008;Besalatpour et al., 2013;Zhang et al., 2017). Although MLR is relatively simple, in many studies, it is well fitted to the relationship between variables. Furthermore, when evaluating the superiority of a machine learning model, a linear model is often used as a benchmark. Therefore, we selected MLR to estimate the SOM content due to its robust and widespread applications (Chagas et al., 2016;Lee et al., 2019).

Model Performance Assessment
Here we split the data into five groups, and use each of them in turn to evaluate the model fit on the other 4/5 of the data. The determination coefficient (R 2 ), RMSE, ratio of prediction deviation (RPD), and the ratio of performance to interquartile distance (RPIQ) were selected as indicators to measure the model accuracy.
In Eqs. 3-6, P i and O i are the predicted and observed SOM content, respectively; n is the number of samples; P and Ò are the means of the predicted and observed SOM content, respectively; SD is the standard deviation of measured values; IQ is the interquartile range (IQ Q3−Q1) of the measured SOM in the validation dataset; and Q1 and Q3 are the first and third quarters of the SOM content in the validation dataset, respectively. In general, a good model prediction should correspond to high R2, RPD, and RPIQ values, and low RMSE values. In particular, the model classification criteria adopted in this study were based on RPD values, which were divided into five model classes:

SOM Content Distribution
Statistical analysis of the soil sample points showed that the maximum SOM content in the study area was 3.8%, the   minimum was 0.001%, the mean was 0.559%, the standard deviation was 0.749%, and the coefficient of variation was 0.334. From the distribution map (Figure 2), it can be seen that the overall SOM content (0-20 cm) was low (i.e., <1% in most areas), which is consistent with the overall characteristics of the SOM in desertified land. Among the 243 samples, 34.36% had a SOM content <0.2%, whereas only 13.03% of samples had a SOM content >2%. According to the different sampling regions, the SOM content in the Gonghe alpine desertified grassland, Mu Us Sandy Land, and Hulunbuir Sandy Land were significantly higher than in other areas.

Correlations Between Single Input Factors and SOM Content
The results of the correlation analysis between the single input factors and the SOM content are shown in Figure 3. SOM content was positively correlated with mean annual precipitation (r 0.48) and elevation (r 0.48), and negatively correlated with mean annual temperature (r -0.45). Perhaps more worthy of attention was the very high correlation between SOM and the quarterly vegetation index. Not unexpectedly, NDVI was positively correlated with SOM, and the correlation of NDVI in the third quarter (r 0.47) was higher than that in other quarters. The performance of DFI was yet unexpected. The DFI in the four quarters showed a good correlation with SOM (r > 0.5), especially in the first quarter (r 0.64).
In the study area, water is the major limiting factor of plant production, thus the increase of precipitation could contribute to the accumulation of SOM (Plaza et al., 2018). An increase in temperature will lead to accelerated decomposition of organic matter in the soil and thus decrease the SOM reserves. Additionally, the positive relationship between elevation and SOM is mainly due to due to the influence of precipitation and temperature. As the elevation increases, precipitation increases and temperature decreases (McBratney et al., 2003;Plaza et al., 2018;Lamichhane et al., 2019).
Topography will affect the water temperature conditions and the distribution of soil-forming materials (Jenny, 1941). Among all topography variables, the correlation between elevation and SOM reached 0.48, slope reached 0.15, and aspect correlation reached 0.12. Because elevation and slope, slope direction often affects the redistribution of surface water and heat conditions, the water accumulation, and the decomposition rate of surface litter, as well as reflects the soil erosion and deposition process, which indirectly affects the change of soil organic matter content (Plaza et al., 2018). Model Validation and Comparison Figure 4 and Table 5 show the performance of the three modeling techniques (RF, SVM, and MLR) under different combinations of covariates. According to the RPD value of the model, the RF-B and RF-C models (RPD 2.0, RPD 2.1) can be evaluated as being very good models. The three models of the SVM method can be considered fair, and the three linear models are considered very poor. Comparing the changes in the evaluation index values of each model, we found that the accuracy of the model has a close effect on the selection of the modeling method and the parameter combination. With the addition of the quarterly NDVI and introduction of the quarterly DFI index, the model effect gradually improved. After adding the quarterly information of NDVI, the R 2 in the RF-B model increased by 0.2, the RMSE decreased by 0.01, the R 2 in the SVM-B model increased by 0.5, and the RMSE decreased by 0.02. After simultaneously combining the DFI and NDVI in the four seasons quarterly as variables, the model accuracy was more obviously improved. The RF model R 2 increased by 0.06, the RMSE decreased by 0.05, RPD increased by 0.2, and RPIQ increased by 0.5. The SVM model R 2 increased by 0.09, the RMSE decreased by 0.05, RPD increased by 0.16, RPIQ increased by 0.29. The MLR model R 2 increased by 0.07, RMSE decreased by 0.04, RPD increased by 0.12, and RPIQ increased by 0.21. The results further show the importance of multi-temporal data and dry vegetation on the surface of arid areas and demonstrate that the combination of all predictors (covariate set C) results in the best predictive performance.
Compared with the performance of the three models, the two machine learning methods showed better estimation ability than the linear regression model. This is similar to the complex nonlinear relationship between soil attributes and environmental factors proved by many soil attribute estimation studies, which are more suitable for fitting using a FIGURE 4 | Model accuracy evaluation scatter plot. Models A,B, and C refer to covariate sets A,B, and C, respectively. RMSEroot mean square error, RPDratio of prediction deviation, RPIQ -ratio of performance to interquartile distance.
Frontiers in Environmental Science | www.frontiersin.org August 2021 | Volume 9 | Article 668912 machine learning method (Besalatpour et al., 2013). For the two machine learning methods, the RF model performed better than the SVM model under different parameter combinations. The three modeling methods consistently showed the best effect of model C, which also shows that the addition of the DFI index and multi-temporal vegetation index can effectively improve the model effect.

Variable Importance
For the SOM mapping using the model RF, the ranking of the predictors sorted by relative importance is shown in Figure 5 (note: Importance was converted to a percentage). The importance of the variables in models A, B, and C differed slightly. In covariate set A, the importance of MAT reached 39%, elevation reached 26%, and NDVI in the third quarter (Q3) reached 13%; in covariate set B, MAT and elevation were the two most important variables, followed by the NDVI in the first quarter (Q1); in covariate set C, DFIs in Q1 and the second quarter (Q2) (importance of 23 and 23%, respectively) were important variables for predicting the spatial distribution of SOC, followed by elevation, MAP, and MAT. This may be due to the mutual influence between variables, and models with different combinations of variables had shown different importance.

Spatial Prediction
According to different model methods (RF, SVM, and MLR) and three sets of parameter combinations (covariate sets A, B, and C), the SOM maps for the study area were produced with a 30 m resolution based on the GEE platform ( Figure 6). The results showed a distribution trend of high SOM in the east and low SOM  in the west, and high-value areas were mainly distributed in the farming-pastoral transition zone near the Hunshandake Sandy Land, Hulunbuir Sandy Land, and Qinghai Gonghe Alpine. The areas with lower values were located near the Mu Us Sandy Land and the marginal and central part of the Gurbantunggut Desert in the northwest. This can be partly explained by the land-cover type and vegetation coverage. However, the difference between the three modeling methods was also obvious ( Figure 6). RF produced more high-value regions (SOM >1.5%) and provided a better simulation of the low-value regions on the edge of the Gurbantunggut Desert. The spatial differences in the organic matter maps brought about by the three different combinations of variables were reflected in the Mu Us Sandy Land and the desertified area in northwest Qinghai. For example, in the desertified land northwest of Qinghai, the SOM contents determined using the RF model were 0.1-0.3% for group A, 0.3-0.5% for group B, and 0.5-1% for group C.
Statistics of the SOM content (i.e., results of RF model C) were analyzed according to different desertification degrees, and the results are shown in Figure 7. There was a significant negative correlation between the degree of land desertification and the SOM content. The maximum and minimum SOM contents corresponding to different degrees of desertification exhibited no obvious regularity, mainly because desertified land is generally a mixture of vegetation and bare soil in different proportions, while the SOM content prediction is based on the pixel scale.

Variable Covariate Importance
The NDVI has been regarded as the most important and commonly used alternative index and plays an important role in SOM and SOC emissions (McBratney et al., 2003;Yang et al., FIGURE 6 | Spatial distribution of soil organic matter (SOM) content. A,B, and C refer to covariate sets A,B, and C, respectively; 1, 2, and 3 represent different modeling techniques, i.e., random forest (RF), support vector machine (SVM), and multiple linear regression model (MLR), respectively. Frontiers in Environmental Science | www.frontiersin.org August 2021 | Volume 9 | Article 668912 2020; Zhao et al., 2014). However, in desertified areas with sparse and non-photosynthetic vegetation coverage, NDVI does not provide sufficient vegetation information to predict SOM . As shown in Figure 5, the contribution of NDVI was lower than that of MAT and elevation factors, and the inclusion of quarterly NDVI information resulted in little improvement, whereas DFI had important effects on SOM estimation in the desertified lands. DFIs in Q1 and Q2 were the most important variables ( Figure 5C), and the relative importance reached 46%, followed by that of elevation and MAT, while the importance of the best performance of the seasonal NDVI only reached 4%. This finding is consistent with other studies (Maynard and Levi, 2017), who found that when bare land reaches or exceeds 20%, NDVI (as a representative variable of vegetation characteristics) no longer has statistical significance. At the same time, Wang et al. (2018) also proved that the indicator function of non-green vegetation on soil carbon content cannot be ignored. The dynamic distribution of quarterly NDVI and DFI was plotted against four grades of SOM content from low to high, namely 0-1, 1-2, 2-3, and 3-4%, respectively (Figure 8). The box plot shows that NDVI and DFI had different quarterly trends. At the same level of SOM content, NDVI in Q2 was higher than NDVI in other quarters, while DFI reached its maximum value in Q1. Additionally, as the level of SOM increased, the values of quarterly NDVI and DFI also increased. This is also consistent with the results of previous studies , Wang et al. 2017) because vegetation controls the input of surface organic matter, and soil fertility will also determine the growth of ground vegetation (Plaza et al., 2018). The importance of DFI in Q1 and Q2 can be explained, since the separability in the distribution of DFI values for the 0-1% SOM degradation in Q1 and Q2 is most significant, which is completely different from the upper limit, lower limit, and quantile of other SOM content level, compared to the distributions of other vegetation indices.

Comparison of Soil Organic Carbon Map to Other Existing Datasets
Currently, research on SOM or SOC in China focuses on small areas in a province or a small ecosystem (Li et al., 2018;Sun et al., 2019). Very few large-scale studies have been conducted and have a low resolution (1 km). Furthermore, the sample data for the estimation of SOM in China is from the 1980 soil census data, time is old, and errors exist (Feng et al., 2002;Liang et al., 2019b). Moreover, this study shows that the estimation of SOM content in northern China is highly uncertain because of the uneven distribution of historical samples and the low SOM content (Liang et al., 2019a). Our work addresses these knowledge gaps to some extent and has the potential to be extended to other areas.
To demonstrate the advantages of our products, comparisons with the harmonized world soil database (WIEDER, 2014) and soil grids (Hengl et al., 2017) [i.e., two of the most widely used global soil organic matter products and with a higher resolution (250 m)] were FIGURE 8 | Dynamic distribution of vegetation index at different soil organic matter (SOM) content grades (a: SOM content was 0-1%; b: SOM content was 1-2%; c: SOM content was 2-3%; d: SOM content was 3-4%; 1_NDVI: Normalized difference vegetation index (NDVI) in the first quarter; 2_NDVI: NDVI in the second quarter; 3_NDVI: NDVI in the third quarter; 4_NDVI: NDVI in the fourth quarter; 1_DFI: DFI in the first quarter; 2_ DFI: Dead fuel index (DFI) in the second quarter; 3_ DFI: DFI in the third quarter; 4_ DFI: DFI in the fourth quarter).
Frontiers in Environmental Science | www.frontiersin.org August 2021 | Volume 9 | Article 668912 conducted. The results of the RF model were converted into soil organic carbon using a coefficient of 0.58, and we took an area of Mu Us Sandy Land as an example to make small-scale comparisons of different data products ( Figure 9). Clearly, our 30-m resolution product shows more spatial details. In the desertification area of northern China, the ground situation is relatively complex, and the main vegetation types in the sandy land are sparsely distributed shrubs with strong spatial heterogeneity. Estimation of SOM content with a low resolution would involve a large number of mixed pixels. The shrub-covered area with relatively high SOM content could not be distinguished from bare soil or sand with extremely low SOM content, and the spatial variation of the SOM content cannot be expressed.

Effects of Desertification on SOM Content
According to our research, the SOM content decreased as the degree of desertification increased. The SOM content changed from 0.81 for slight desertification to 0.62 (g/100 g), 0.51, and 0.42 as the degree of desertification deepened, and the SOM content decreased by 48% from areas with slight to extremely severe desertification. The majority of the SOM is lost in the early stages of desertification, and less loss occurs in the subsequent stage. In a study of soil nutrients in different degrees of desertification on the Qinghai-Tibet Plateau, Li et al. (2006) also found that the deepening of the degree of land desertification will lead to a reduction of SOM and that although there was a significant difference between the mild and heavy stages of desertification (p < 0.01), there was no significant difference between the heavy and severe stages of desertification, which also means that most of the organic matter in the topsoil had been lost in the heavy stage of desertification. Although different studies have reached the same conclusion on the impact of land desertification on SOM (Li et al., 2006;Zhao et al., 2009;Tang et al., 2015), the differences in SOM between different degrees of desertification were inconsistent. In our study, the SOM content of slight to extremely severe desertification decreased by 48%, while Zhao et al. reported a 29% reduction (Zhao et al., 2009). In short, the amount of SOM loss caused by land desertification both depends on the initial level of SOM in the study area and the type of desertification.
Given the large area of sandy land in northern China (37.59 M Ha) and obviously SOM content in different desertification degrees, the large scale restoration of desertied land under way, such as the Beijing and Tianjin Sandstorm Source Controlling Program (Wu et al., 2013), the Grain for Green Project (Stokes et al., 2010), and the Three-North Shelterbelt Project (Ma, 2004), would have an important effect on SOM gains in the desertified land. At the same time, it is easier to obtain an increase in SOM by first restoring the desertified lands not seriously degraded.

CONCLUSION
In this study, we produced a high spatial resolution (30 m) SOM content map for the desertified land in northern China based on the online remote sensing dataset and machine learning techniques of the GEE platform. Our study initially used quarterly NDVI and DFI as remote sensing indicators to predict SOM contents using machine learning and linear regression models. The results show that the quarterly information of the green and non-photosynthetic vegetation was identified as a key covariate for SOM content estimation in desertified land, and an obvious improvement was observed after simultaneously combining the DFI and NDVI of the four quarters (i.e., R 2 increased by 0.06, the RMSE decreased by 0.05, RPD increased by 0.2, and RPIQ increased by 0.5). The importance of DFI in Q1 and Q2 for low SOM content estimation was illustrated in this study. The performance of machine learning was better than that of the linear regression model, of which RF (R 2 0.80; RMSE 0.36) showed some advantages compared to SVM (R 2 0.66; RMSE 0.47). This is promising for SOM content mapping for desertified land with low SOM content and sparse vegetation in arid regions, especially with the assistance of the GEE platform. This study also reveals that the high spatial resolution SOM content map (30 m) is more appropriate for heterogeneous desertified land than other existing coarse spatial resolution global SOM products. In addition, our study found that the majority of SOM is lost in the early stage of desertification and that restoring desertified land has the potential to increase SOC storage when considering the vast and wide distribution of desertified land. In the future, SOM content mapping f or desertified land could be improved by incorporating more representative predictor variables and improving machine learning training. However, the products developed in this study could be utilized for desertification and soil carbon studies in northern China.

DATA AVAILABILITY STATEMENT
The raw data supporting the conclusions of this article will be made available by the authors, without undue reservation.

AUTHOR CONTRIBUTIONS
XL designed the experiment; JY analyzed the data and completed the draft; BS performed the field survey; CY provided the map of sandy land distribution; BW and ZG provided guidance on experiment design; JW revised the draft. All authors have read and agreed to the published version of the manuscript.