Municipal Solid Waste Forecasting in China Based on Machine Learning Models

As the largest producing country of municipal solid waste (MSW) around the world, China is always challenged by a lower utilization rate of MSW due to a lack of a smart MSW forecasting strategy. This paper mainly aims to construct an effective MSW prediction model to handle this problem by using machine learning techniques. Based on the empirical analysis of provincial panel data from 2008 to 2019 in China, we find that the Deep Neural Network (DNN) model performs best among all machine learning models. Additionally, we introduce the SHapley Additive exPlanation (SHAP) method to unravel the correlation between MSW production and socioeconomic features (e.g., total regional GDP, population density). We also find the increase of urban population and agglomeration of wholesales and retails industries can positively promote the production of MSW in regions of high economic development, and vice versa. These results can be of help in the planning, design, and implementation of solid waste management system in China.


INTRODUCTION
Over the past decade, the urban population in China has reached up to 900 million residents with an urbanization rate of over 60% (NBSC, 2021), which significantly challenges the existing urban sources (e.g., water, air, and energy) related to residents' life quality (Hoornweg and Bhada-Tata, 2012). The municipal solid waste (MSW), as renewable energy, is considered an essential part of the Waste-to-Energy (WtE) system (Ouda et al., 2013;Kuznetsova et al., 2019;Mukherjee et al., 2020). It is reported that the production of MSW in China was around 242 million tons in 2020 compared with that of 8.17 million tons in 2008 (NBSC, 2020). In other words, the efficient management of municipal solid waste is becoming an important concern for urban sustainability governance. However, the utilization efficiency of MSW was merely about 45% in China, which was much lower than that in other advanced countries, such as over 80% in Japan (Ding et al., 2021). Therefore, how to increase the utilization efficiency of MSW would impact both central and local governments in China to promote urban sustainable development (He and Lin, 2019).
In general, an integrated decision-support methodology for waste-to-energy management systems (WtEMS) design is mainly composed of three modules: 1) the waste modeling and prediction, 2) optimization of WtEMS, and 3) a multi-dimensional assessment, as shown in Figure 1 (Kuznetsova et al., 2019). Among these three modules, waste modeling and its prediction of MSW play a fundamental role in effectively conducting urban planning and energy management. Many international scholars have carried out extensive studies on this module by using group comparisons, time series analysis, and system dynamics (Beigl et al., 2008). Recently, with the popularity of machine learning (ML) methods, alternative methods were put forward to forecast the quantity of generated municipal solid waste effectively (Guo et al., 2021). For instance, based on the example of Suzhou (Niu et al., 2021), constructed the long shortterm memory (LSTM) neural network, autoregressive integrated moving average (ARIMA), and traditional neural network to predict the MSW production. They found that the LSTM played a vital role in predicting MSW production but did not reveal the correlation between the production of MSW and socio-economic variables. Nguyen et al. (2021) selected residential areas in Vietnam as a case of study and figured out that both the random forest (RF) and the k-nearest neighbor (KNN) approaches performed effectively in predicting the amount of urban waste. Birgen et al. (2021) developed a Gaussian Processes Regression (GPR) method to predict the daily lower heating value of MSW by combining the historical data of a WtE plant and the weather and calendar data. In addition, other ML methods, such as the support vector machine (SVM) (Kumar et al., 2018) and decision tree (Kannangara et al., 2018) have also been employed to predict the MSW production.
Similar to other energy forecasting research topics (e.g., crude oil prices, gas consumption), MSW production is also was highly influenced by various socio-economic factors (Zhang et al., 2009;Liang et al., 2019;Huang et al., 2021a). However, previous studies neither revealed the correlation between each factor and MSW production nor identified their interaction in different socioeconomic circumstances (Kannangara et al., 2018;Niu et al., 2021;Nguyen et al., 2021). In the context of China, existing FIGURE 1 | Integrated decision support method for WtEMS design: methodology flowchart.
Frontiers in Energy Research | www.frontiersin.org November 2021 | Volume 9 | Article 763977 2 studies scarcely discussed the performances and applications of different ML methods in predicting MSW. Therefore, this paper mainly aimed to construct a prediction model by using machine learning models by using provincial panel data of 2008-2019 in China. Besides, it also discussed the comparison of the performances of six different ML models in predicting China's municipal solid waste generation. Considering that data input form and model hyperparameters have a great influence on prediction results, we tested different preprocessing strategies to ensure robust estimation and prediction of the ML model. Finally, this paper provided some potential implications for both policy-makers and other industry stakeholders in terms of convincing evidence concluded from the ML prediction model.
The initial contributions of this paper are threefold. First, it emphasized the good performance of machine learning approaches in predicting MSW production and extended the existing literature to construct a prediction model by comparing six supervised learning algorithms. These models varied from linear, non-linear to ensemble methods and artificial neural network methods, including a body of discussions on data preprocessing, resampling, model training, testing, and interpretation steps. Therefore, the constructed prediction model of MSW would theoretically shed light on other similar research related to prediction issues in the future. Second, this paper estimated the impacts of diverse socio-economic factors on MSW production, such as the regional economic development level (e.g., regional GDP, population density, per capita disposable income), industrial structure (e.g., wholesale and retail values added), and waste generation characteristics. Third, to improve the interpretations of ML models, this paper employed the SHapley Additive exPlanation (SHAP) approach and visualized the SHAP value of each explanatory variable. This technique would also provide good evidence to explain the outcomes of ML models for other researchers in the future.
The remaining sections of this paper are organized as follows: Materials and Methods describes the models adopted in this paper and the process of data acquisition. Results reports the results of comparison among six ML models, via presenting the predictive capability and SHAP analysis. Conclusion provides conclusions and some implications. Figure 2 outlines the main steps of the methodology used in this study. In this paper, we first preprocessed the original database and selected critical variables for MSW prediction. Second, this paper focused on comparing with six ML models, including the multiple linear regression (MLR), support vector regression (SVR), Random Forest, extreme gradient boosting (XGBoost), k-nearest neighbor, and deep neural network (DNN). Thirdly, three evaluation metrics are used to compare the prediction performance of each algorithm. Finally, the SHAP method is employed to analyze and discuss the output.

The Multiple Linear Regression Liner Model
The multiple linear regression is a commonly used ML method to estimate the marginal effects of independent variables (or called feature vector in machine learning techniques) on the dependent variable. It is widely applied to waste prediction of desirable explanatory power in different regions and countries (Beigl et al., 2008). In China, this approach is also employed to predict the MSW production in "Calculation and Prediction Method of Municipal Solid Waste Production (CJ/T 106-1999)", which is the official guide compiled by the Ministry of Construction, China.
The model can be expressed as Eq. 1: where Y is MSW generation in this paper, β 0 denotes regression constant, β 1 ∼ β k are regression coefficients, X 1 ∼ X k are explanatory variables, ϵ marks the regression residuals. Usually, MLR uses the ordinary least squares (OLS) method to estimate the parameters that can achieve the lowest sum-ofsquared errors between the observed and predicted responses. Under the OLS estimation, MLR's results could be easily interpreted. However, some drawbacks have to be considered in MLR. For instance, the multicollinearity among the predictors can result in estimation errors, as well as the omitted variables could induce a biased estimation. In this paper, we mainly concentrated on the performance of each ML model and considered the variables selection based on earlier studies (Kannangara et al., 2018;Namlis and Komilis, 2019;Niu et al., 2021;Nguyen et al., 2021). The multicollinearity and omitted variables problems are not our concerns.

Support Vector Regression
SVM was originally used to deal with pattern recognition problems, and recently extended to estimate regression models due to its properties of the sparse solution and good generalization (Demir and Bruzzone, 2014). By introducing an ε-tube to reformulate the optimization problem, the SVM model could be transformed to an SVR model and finds the optimal approximation of the continuous-valued function while balancing the complexity and prediction error of the prediction model (Huang et al., 2021b). In addition, the accuracy of an SVR model heavily relies on three parameters: a penalty parameter (C), the kernel width (c) and the precision parameter (ε) (Abbasi and El Hanandeh, 2016;Li et al., 2021). Specifically, the smaller C is, the smaller the fitting error and the weaker the generalization ability would be. The larger c is, the more support vectors; and vice versa. ε is a precision parameter representing the tube's radius located around the regression function. In other words, the choice of ε donates the magnitude of errors that can be neglected. Since the above three parameters are critical to the adaptability of the model, we will tune them using a grid optimization approach in Results to optimize the SVR model.
A great body of literature has discussed the SVR and SVM models in predicting the generation of MSW. For example (Abbasi and El Hanandeh, 2016), adjusted the hyperparameters of SVR by combining the grid search method and applying the model with the optimal parameters to the monthly prediction of MSW in Logan City, Australia. They found that SVR can effectively reduce the mean absolute error (MAE) and root-mean-square error (RMSE), and improve prediction performance (R-square). Besides , applied SVM to the prediction of MSW production in Vietnam with an MAE of 131.07, which confirmed that the SVM model performed a better prediction. Kumar et al. (2018) applied it to the prediction of the production rate of plastic waste, and found that the prediction result of SVM (R 2 0.74) is better than RF (R 2 0.66) and lower than artificial neural network (ANN) (R 2 0.75). Mehrdad et al. (2021) argued that SVM was superior to both the adaptive neuro-fuzzy inference system and artificial neural network models in predicting methane generation.

Random Forest
Random Forest is an evolution of Bagging which aims to reduce the variance of a statistical model, simulates the variability of data through the random extraction of bootstrap samples from a single training set and aggregates predictions on a new record (see Breiman, 1996). It performs a more stable and better prediction of explained variables than other machine learning models (Huang et al., 2021b). Generally, the RF algorithm implementation can be expressed as follows: 1) Bagging is used to randomly generate sample subsets; 2) Use the idea of random subspace by randomly extracting features, splitting nodes, and building a regression subdecision tree; 3) Repeat the above steps to construct T (the number of decision trees) regression decision subtrees to form a random forest; 4) Take the predicted values of T sub-decision trees and take the mean as the final prediction result.
The RF model was widely used in the prediction of waste. Kumar et al. (2018) used RF for the prediction of plastic waste generation rate that showed an R-square of 0.66. The size of the random forest, that is, the number of decision trees (Ntrees) and the number of features tried in each segmentation (Nfeatures) have a significant impact on the predictive ability of the RF model (Hariharan, 2021). When Ntrees exceed a certain value, the prediction performance of the model converges. In this case, increasing the number of decision trees will not improve the model performance, but will result in model redundancy. In addition, using a smaller number of Ntrees reduces the similarity in the forest, but also reduces the complexity and strength of the model. Conversely, the increase in Ntrees can make each tree more powerful, but also increase the correlation between the trees. Therefore, in the following section, we will optimize these two hyper-parameters to acquire better results.

Extreme Gradient Boosting
XGBoost algorithm, proposed in 2016, is a relatively new approach (Chen and Guestrin, 2016). Different from RF model using bagging integration method, XGBoost model is an integration tree model using boosting method to integrate classification and regression tree (CART). It has the advantages of fast training speed and high prediction accuracy. The result of XGBoost is the sum of prediction scores of all CARTs (Chen and Guestrin, 2016) as formed in Eq. 2: where N represents the number of trees in the model, f m represents each CART tree andŷ is predicted result. Since its introduction, the XGBoost model has been widely used in the prediction of oil price (Costa et al., 2021) and energy usage (Feng et al., 2021). However, up to date, XGBoost model has not been applied to the research of MSW generation prediction. Similar to RF, the number of integrated CARTs (Ntrees) in XGBoost has a great influence on the prediction performance. Therefore, in order to increase the model's performance in predicting the MSW generation, it is necessary to optimize this hyper-parameter. In Results, we also use the grid search method to confirm the different combinations of these two parameters to obtain the optimal model structure.

K-Nearest Neighbor
KNN algorithm is a non-parametric learning method first proposed by Cover and Hart (Cover and Hart, 1967). Since its introduction, it has been widely used in regression and classification due to its simple and intuitive mathematical form (Wu et al., 2008). It is essentially a supervised learning technique that via the clustering algorithm classify the similarity between the test sample and K nearest training samples (Zheng et al., 2020). Here, K is a user-defined number, normally an odd number, and the similarity is measured by the commonly used Euclidean distance. The test sample is classified based on the most frequent classification among the training samples. The mean value of the K nearest training samples is regarded as the predicted value. The mathematical measurement of Euclidean distance is expressed in Eq. 3: One drawback of KNN approach is the pre-selected number of K, a hyperparameter, because it would greatly influence the numbers of nearest samples (Wu et al., 2008;Zheng et al., 2020). In the following section, we first limit K to positive integers between 1 and 30, and then cross-verify them on a 10-fold sample to avoid this drawback.
Several studies applied the KNN approach into the prediction of MSW. For example, (Abbasi and El Hanandeh, 2016) first attempt to evaluate the ability of KNN to forecast MSW generation. They concluded that KNN can give good prediction performance and may be applied to establish the forecasting models that could provide accurate and reliable MSW generation prediction. Nguyen et al. (2021) predicted the MSW production in Vietnam and the R-square was over 0.96, which indicated that more than 96% of MSW production would be explained by the KNN model.

Artificial Neural Network
The ANN model is a computational system composed of multiple layers of neurons (input-hidden-output) (Al-Dahidi et al., 2019). This model is widely used in waste management because of its strong fault-tolerant ability to describe the complex relationship between variables in a multivariate system. (Abbasi and El Hanandeh, 2016;Mehrdad et al., 2021;Nguyen et al., 2021;Niu et al., 2021). The deep neural network is a branch of ANN based on a perceptron model. Indeed, an ANN model with multiple hidden layers is called a DNN since it has to train and process through multiple layers (Liu et al., 2017). The structure of DNN also includes input layer, hidden layer, and output layer. In general, the structure of DNN and ANN is similar, and their training algorithm is not different. However, studies showed that DNN tends to provide better performance and accuracy than conventional ANN models .
In this paper, a DNN with four layers of structure is constructed, namely the input layer, the first hidden layer, the second hidden layer and the output layer with one neuron. The number of neurons in the hidden layer has a great influence on the prediction performance of DNN. The smaller the number of neurons, the more likely it is to lead to insufficient fitting. On the contrary, an excessive number of neurons may lead to overfitting. Therefore, selecting the appropriate number of neurons for DNN is also one of the bases to improve the model performance. In this paper, the number of neurons in the first hidden layer (Nh1) and the number of neurons in the second hidden layer (Nh2) are optimized to gain better results. Specifically, we first specify the numerical space of the number of neurons, and then test on the train and test samples, taking the optimal result as the optimal network structure.

Data Collection
In this paper, we aim to construct a ML-based prediction model of MSW production that is the predictor in all ML models. However, because there are no relevant statistics of MSW production in China at present, we utilize a proxy indicator of the MSW removal volume (Niu et al., 2021;Namlis and Komilis, 2019). More specifically, we obtained this annual statistical data for all provinces in mainland China from 2008 to 2019 to support our research.
The input variables of this paper in predicting MSW production are collected from provincial panel databases of the China Statistical Yearbook 2008-2019. Nine diverse socioeconomic factors on MSW production, such as the regional economic development level (e.g., regional GDP, population density, per capita disposable income), industrial structure (e.g., wholesale and retail values added), and waste generation characteristics are obtained . Table 1 reported the variable definition and descriptive statistics. As plotted in Figure 3, the skewness and kurtosis of each variable existed noticeable differences. To mitigate the influences in predicting the MSW production, we employ three different data preprocessing methods and proceed to explore the model's performance under different circumstances in the following sub-sections.

Data Preprocessing and Re-Sampling
The preprocessing methods adopted include linear normalization (Range) and standard deviation normalization (Scale), as shown in Eq. 4 and Eq. 5 respectively. For ML models (such as KNN) that need to calculate the distance between samples, different orders of magnitude between variables will greatly affect the performance of the model. We retained the original input data in this paper (Raw), and conducted two normalization strategies Frontiers in Energy Research | www.frontiersin.org November 2021 | Volume 9 | Article 763977 of Range and Scale to reduce the influence of data's dimensions and skewness on the predictions. Thus, the results of the three preprocessing methods would be comparable.
where x min represents the minimum value of variables while x max represents the maximum value.
x represents the numerical average value and σ 2 is the variance of each variable.
To minimize the deviation caused by sampling and prevent the model from over-fitting, we adopted the 10-folds cross validation method of resampling technique to create a random sample subset of input data as a training set. The remaining data was used as test set to obtain the generalization ability of the algorithms.

Metrics of the Model
To evaluate the performance of each machine learning algorithm, we use three metrics of the MAE, RMSE and the coefficient of determination (R 2 ) (Chai et al., 2021; Frontiers in Energy Research | www.frontiersin.org November 2021 | Volume 9 | Article 763977 6 Nguyen et al., 2021). These measurements are formulated as Eqs 6-8.
where n is the number of samples, x i is the predicted response by the model, y i is the actual value of the response, x i is average estimated value.

Model Interpretation
Model interpretability is a major challenge to applications of ML methods, which has not been given enough attention in the field of ML and MSW forecasting research. To improve the interpretations of machine learning models, this paper employed the SHAP method that assigned each input variable a value reflecting its importance to predictor (Lundberg and Lee, 2017).
However, a major limitation of Eq. 9 is that as the number of features/socio-economic factors increases, the computation cost will grow exponentially. To solve this problem (Lundberg et al., 2020), proposed a computation-tractable explanation method, i.e., TreeExplainer, for decision tree-based ML models such as RF. The TreeExplainer method marks it much more efficient to calculate a risk factor's SHAP value both locally and globally (Ayoub et al., 2021).
The SHAP combines optimal allocation with local explanations using the classic Shapley values. It would help users to trust the predictive models, not only what the prediction is but also why and how the prediction is made (Ayoub et al., 2021). Thus, the SHAP interaction values can be calculated as the difference between the Shapley values of factor i with and without factor j in Eq. 10: For this superiority, we employ it to explain RF models which is based on decision trees. Therefore, compared with the existing methods , SHAP can reflect the influence of features in each sample, show the positive and negative effects of the influence, and thereby improve the explanatory of the model output.

Comparison of Model Results
The programming environment used in this study is Python (version 3.8.3) with additional support packages namely scikitlearn (version 0.24.1), Tensorflow (version 2.2.2) to calculate and run the ML algorithms.

Tuning
In this section, parameters of machine learning models are tuned, excluding multiple linear regression approach because it doesn't involve any hyper-parameters. Specific adjustment for parameters is shown in Table 2.
In the tuning process of SVR, we conduct the aforementioned three data preprocessing strategies (the Raw, Range, and Scale) respectively. As shown in Table 3, in the Raw strategy, that is to retain the original form of input data, the penalty parameter (C) varies from 1 to 4000, compared with that in the Range strategy of 0.01-10. The precision parameter (ε) is an interval between 0.0001 and 0.001 in the Range and Scale strategies, compared with that of an interval from 0 to 5000. The kernel width (c) doesn't show any differences among the three strategies. The processing strategies of Range and Scale can effectively improve the normalization and scaling of the distributions of input variables.where Scaled and Auto in c represent the results of Eq. 11 and Eq. 12 as the c value of the SVR.
where N S represents the number of sample features and S 2 represents sample variance. The optimization results are shown in Figure 4.
The hyper-parameters in other ML models are also tuned. For RF, the number of variables tried in each  segmentation (Nfeatures) is set as positive integers between 1 and 9 in terms of nine input variables in this paper. The forest size (Ntree) is set as positive integers between (50,400). The optimization results of hyper-parameters are shown in Figure 5. In Figures 4, 5, the redder the color is, the higher the R 2 of the parameter combination (therefore, the better the prediction), and vice versa. For KNN, the number of neighbors K is set as a positive integer between 1 and 29. For the XGBoost, the number of trees (Ntree) is set to 23 positive integers between 50 and 490. For DNN, the number of neurons in the first hidden layer (Nh1) is set as a positive integer increasing by 16 between (16,240), and the number of neurons in the second hidden layer (Nh2) is set as one half of the number of the first hidden layer. Moreover, the Adma method is used as the optimization method, MAE is set as the loss function and the maximum number of epochs is set to 200. Meanwhile, to prevent over-fitting of the DNN, the EarlyStop mechanism is introduced, and the minimum learning rate is set as 0.003 and the tolerance is set as 20. The hyper-parameter selection results of KNN, XGBoost, and DNN are shown in Figure 6. The hyper-parameters adopted by each method are shown in Table 4. Figure 7 presents the prediction performance of different ML models by using three preprocessing strategies. Several findings can conclude from the comparison among models. First, the prediction performance of MLR is the worst among all the methods because it doesn't involve hyper-parameter and responding adjustments. Second, the overall performances of SVR and KNN are similar, but the prediction ability of SVR is  November 2021 | Volume 9 | Article 763977 8 slightly higher than that of KNN except for results in Scale processing. Normally, the conducting SVR model needs a more complex process than KNN. By inputting different forms of data, the KNN only needs to adjust one super parameter, which requires less work than SVR. Third, the RF and XGBoost models present significant and similar advantages in predicting MSW production compare with MLR, SVR, and KNN according to the performance measurement of R 2 . Fourth, the DNN has the best predictive performance among all the algorithms.

Model Application and Generation Ability
In this study, the RF and DNN models showed high R 2 values ( > 0.9) during all preprocessing methods. That means the developed ML models had a good power of explanation and were not over-fitted or over-trained. Compared with the ML method for MSW prediction developed in the earlier studies, our results were significantly better in prediction accuracy. For example (Niu et al., 2021), developed LSTM and ANN models for predicting MSW generation and during the testing phase, the R 2 value were 0.92 and 0.74, respectively ( Table 5). In addition, , reported a DNN model with predictive performance (R 2 ) of 0.9 for MSW production projections in Vietnam. According to Kumar et al. (2018) and Kannangara et al. (2018) the ANN, SVM and other ML models for predicting MSW generation showed R 2 even lower than 0.8. Thus, the machine learning model developed in this paper promotes the effective prediction of MSW production.

SHAP Analysis
Overall Analysis Figure 8 shows the SHAP summary plot that orders features based on their importance to predict MSW production.  Specifically, a higher SHAP value of a feature indicates higherranked importance to the MSW production volume. For example, the difference in the region's GDP has the greatest impact on the model's prediction of MSW production. It is likely because waste production is highly related to the household wealth that directly influences one's daily consumption and potential production of MSW (Malinauskaite et al., 2017). Moreover, higher value of this feature result in higher SHAP values, which correspond to a higher output amount of MSW.
In addition, the industry structure presents a great influence on MSW production because of its indirect impacts on the citizens' consumption. For instance, a higher degree of the added value of wholesale and retail trade indicates higher  production of MSW compared with other industries (e.g., transportation, warehousing, and postal services industries). Some studies have argued that consumption patterns and population increase are important factors that contribute to MSW production in developing countries (Liu et al., 2019;Nguyen et al., 2021). Besides, the urban population also shows a significant impact on MSW production, because of its functioning on the total amount of MSW production. In contrast, other socio-economic features have a relatively insignificant impact on MSW in China. In the following paper, we will continue to analyze the dependency among these three features to discover the generation mode of MSW in China. Figure 9 plots the relationship between a feature and its SHAP value dependent on another feature in the RF model. We select Nup and InWAR as the features to discuss and identify their variation as changes of InGDP. As shown in Figures 9A,B, the red points represent a higher value of InGDP, and the blue points represent the lower one. Figure 9A plots the moderating effects of GDP on the impacts of urban population on MSW production. It shows that under the condition of a low Nup and a low InGDP, the SHAP value of Nup is below zero, which indicates that the impact of Nup would negatively impact the MSW production under these circumstances. In other words, the less developed region might undermine the impact of the urban population on MSW production, although the local urban population increases. In contrast, with the economic growth, the increase of the urban population will promote the production of MSW. It could be recognized by the red color of the SHAP value in this figure. Figure 9B reflects the interaction between GDP and the added value of wholesale and retail industries on MSW production. For example, before InWAR reached 600 billion, its SHAP value is always negative. However, if InWAR exceeds 600 billion yuan as the increase of total GDP, the increase of the added value of wholesale and retail trade plays a positive role in promoting the production of MSW. It means that if the added value of the wholesale and retail industry remains at a low level (less than 6,000 billion yuan), these industries have little effect on MSW production. However, if the added value is more than the threshold of 6000 billion yuan, the regional GDP would promote the impact of the WAR industry added value. Correspondingly, the SHAP value of InWAR indicates a significant promotion on MSW production.

CONCLUSION
To address the prediction in the production of municipal solid waste and support the WtE system design, we mainly constructed the MSW prediction method in China by using machine learning algorithms. In the comparisons of six ML models, we concentrated our attention on the predictive performances of each algorithm, particularly, by introducing three preprocessing strategies. As a result, SVR had the lowest hyperparameter consistency under different preprocessing strategies. Among the six ML methods established in this study, DNN has the best predictive ability, with an R-square of over 0.97 under all three data preprocessing strategies. The prediction performance of the machine learning methods developed in this paper is also significantly higher than the current standard (MLR) in China.
In addition, we find that the form of input hyper-parameter had a great influence on the models' performances. Specifically, the explanatory indicators of the regional GDP, urban population, the added values of wholesale and retail industries, are the most important variables that affect MSW production in different provinces of China. With the development of the urban economy, the urban population increase will promote the generation of municipal solid waste. Inversely, in less developed regions, the increase of the urban population will reduce the generation of MSW. Besides, the different stages of the development of the wholesale and retail industries also impact the production of MSW. It means that in the less developed regions, a less added value of the wholesale and retail industries indicates a weak impact on MSW production, and vice versa. Our findings provide a reliable forecasting method for stakeholders. By increasing the prediction capability of MSW production, national and local policymakers could effectively conduct a series of governance policies to promote a friendly residential environment and urban sustainability. However, if given data from lower administrative, we can build even more powerful predictive models. Future studies can make effort on this to achieve more reliable and accurate results.

DATA AVAILABILITY STATEMENT
The original contributions presented in the study are included in the article/Supplementary Material, further inquiries can be directed to the corresponding author.