Forecasting of Steam Coal Price Based on Robust Regularized Kernel Regression and Empirical Mode Decomposition

Aiming at the problem of difficulties in modeling the nonlinear relation in the steam coal dataset, this article proposes a forecasting method for the price of steam coal based on robust regularized kernel regression and empirical mode decomposition. By selecting the polynomial kernel function, the robust loss function and L2 regular term to construct a robust regularized kernel regression model are used. The polynomial kernel function does not depend on the kernel parameters and can mine the global rules in the dataset so that improves the forecasting stability of the kernel model. This method maps the features to the high-dimensional space by using the polynomial kernel function to transform the nonlinear law in the original feature space into linear law in the high-dimensional space and helps learn the linear law in the high-dimensional feature space by using the linear model. The Huber loss function is selected to reduce the influence of abnormal noise in the dataset on the model performance, and the L2 regular term is used to reduce the risk of model overfitting. We use the combined model based on empirical mode decomposition (EMD) and auto regressive integrated moving average (ARIMA) model to compensate for the error of robust regularized kernel regression model, thus making up for the limitations of the single forecasting model. Finally, we use the steam coal dataset to verify the proposed model and such model has an optimal evaluation index value compared to other contrast models after the model performance is evaluated as per the evaluation index such as RMSE, MAE, and mean absolute percentage error.


INTRODUCTION
Accurate forecasting of the steam coal price can provide a certain basis for enterprises related to steam coal to formulate the procurement plan. The steam coal price indicates the general performance of supply and demand on the steam coal market. China is a big consumer of coal (Xiong and Xu, 2021;Wang and Du, 2020). Forecasting the steam coal price accurately can help in the analysis of the steam coal market, grasp the implied law in the steam coal market, and improve the steam coal market's efficiency.
In recent years, many references have proposed many different methods for coal price forecasting. The time series model can mine the implicit law in the time series. Matyjaszek et al. removed the effect of abnormal fluctuations in prices on the forecasting model by using the transgenic time series (Matyjaszek et al., 2019). Ji et al. effectively improved the forecasting accuracy of the forecasting model by using the ARIMA model and neural network model (Ji et al., 2019). Wu et al. decomposed the price series into several components, used the ARIMA model and SBL model for coal price forecasting, and added up the forecasted values of all the components as final forecasting result. Compared to the contrast model adopted, this model can effectively improve the model forecasting precision (Wu et al., 2019). Chai et al. combined STL decomposition method with ETS model. The experimental results show that it has the best forecasting performance compared with benchmark models and neural network (Chai et al., 2021). It is difficult to learn the implicit nonlinearity law in the data by using the time series model, which is sensitive to the abnormal value in the data and only considers a single variable factor other than other influence factors.
A neural network model can learn the nonlinearity law in the data which is studied based on the interconnection between nerve cells. Alameer et al. effectively improved the forecasting accuracy of coal price based on LSTM model and DNN model (Alameer et al., 2020). Lu et al. adopted the full empirical mode to decompose and preprocess the original dataset, and then chose the radial basis function neural network model for model training and forecasting. The results show higher stability (Lu et al., 2020). Yang et al. adopted the improved whale optimization algorithm to optimize the decomposition and LSTM combined model based on the improved integration empirical model, which has a better model forecasting performance compared to other reference models (Yang et al., 2020). Zhang et al. decomposed the original data series by multi-resolution singular value decomposition method and forecasted the coal price by using MFO-optimized ELM model. Experimental results show the forecasting performance of the proposed model was superior to that of the contrast model . However, the neural network model is a black box model which is difficult to interpret.
The steam coal market is a complex nonlinear system, containing influence factors such as economy, steam coal transportation, steam coal supply, and steam coal demand. The influence factors involve a wide range and many feature data and contain some noise data. This method improves the model interpretability by using linear model and reduces the adverse impact of noise data on the forecast model by using Huber loss function (Gupta et al., 2020). We use the kernel function to mine the implicit nonlinearity law in the steam coal data Vu et al., 2019;Ye et al., 2021). The combined model can improve the model performance based on the advantages of the sub-model (Wang et al., 1210;Zhou et al., 2019;Wang et al., 2020a;Wang et al., 2020b;Qiao et al., 2021;Zhang et al., 2021). This method can decompose the forecasting error of the forecasting model into multiple modal components by using the EMD method (Yu et al., 2008;Xu et al., 2019;Xia and Wang, 2020), build the ARIMA model (Conejo et al., 2005;Karabiber and Xydis, 2019) for each modal component for forecasting, and add up the forecasted values of all the modal components to compensate error for the original forecasting model.
For the problem of difficulties in modeling the nonlinear relation in the steam coal dataset, this article proposes a forecast method for the price of steam coal based on robust regularized kernel regression and empirical mode decomposition. The second part introduces the used algorithm theory content; the third part states the data preprocessing steps, the selection of features, and the whole process of model training and forecasting; the fourth part shows the model comparison and experimental results; and the fifth part contains conclusion and prospect.

Huber-Ridge Model
The Huber function (Huber et al., 1992) has great robustness, which can effectively reduce the negative influence of abnormal data on model performance. The Huber loss function is shown in Eq. 1: where u is the residual value and M is the threshold value of the Huber function. The Huber function imposes the punishment which is larger than the threshold value residual to effectively lower the influence of abnormal sample points on the model training.
The Ridge model is added with L2 penalty term based on the objective function of the linear regression model. The objective function of the model is shown in Eq. 2: where X refers to the set of feature parameters, ω refers to the weight coefficient vector, Y refers to the forecasted target quantity, ω 2 2 refers to the L2 penalty term, and α l2 refers to the regular coefficient.
L2 regular term compresses the feature weight value adversely to the model forecasting and makes it approximate to 0 in order to reduce the impact of features with low correlation. When the regular coefficient of α l2 is large, the L2 regular term makes more parameters' weight in the parameter weight vector approximate to 0 to screen out main features and mitigate the degree of model overfitting to some extent.
The Huber loss function and L2 regular term are combined to construct the Huber-Ridge model (Owen, 2006), improving the model robustness and lowering the overfitting risk. Its objective function is shown in Eq. 3:

Polynomial Kernel Huber-Ridge Model
The T.M. Cover theorem (Cover, 1965) points out that the data in the high-dimensional space can show the linearity law more easily. The kernel function maps the vector of low-dimensional feature space to the high-dimensional feature space, and transforms the nonlinearity law in the low-dimensional feature space into linearity law in the high-dimensional space to learn the linearity law in the high-dimensional space by using the linear model and indirectly learn the nonlinearity law in the original feature space based on the model. Due to the high-dimensional feature space having high dimensionality, the dimension disaster may happen if the model is directly used for fitting in the highdimensional space. The introduced kernel function can effectively solve the above problem, and the kernel function can represent inner product value in the high-dimensional space with the inner product value in the low-dimensional space. Thus, it can avoid the inner product calculation in the high-dimensional space and greatly reduce the calculation of the model. The regular risk functions have a unified expression mode (Schölkopf et al., 2001), as shown in Eq. 4: The kernel function is introduced to the Huber-Ridge model (Jianke Zhu et al., 2008). Thus, the model can learn the implicit nonlinearity law in the data. The objective function of Huber-Ridge kernel model is shown in Eq. 6: where y i is the actual value, f(x) is the forecasting value, and u is the forecasting error T argmin w Eqs. 9-12 can be obtained, respectively, by getting the partial derivative of w and b: zT zb where I 0 is a diagonal matrix of shape (n, n), the values of its elements in the A domain are 1, and the remaining values are 0.
The basic Newton method is used to iteratively update w and b, as shown in Eq. 13: where H −1 is the first derivative of the gradient matrix as shown in Eq. 14, γ is the length of the step, usually with a value of 1, and ∇ is a Hessen matrix as shown in Eq. 15: Simplify Eq. 13 to obtain the final computational equation of objective function of the kernel Huber-Ridge model, as shown in Eq. 16: The polynomial kernel is a commonly used kernel function, and the polynomial kernel function is shown in Eq. 17: where x (x 1 , x 2 , /, x n ) is the feature vector and d is the kernel parameter. The polynomial kernel function (17) is substituted into Eq. 16 to obtain the final computational equation of polynomial kernel Huber-Ridge model objective function.

EMD Model
The empirical mode decomposition is a signal decomposition technology and decomposes the original signal into a series of components which are the intrinsic mode functions. The empirical mode decomposition is often used to handle the time series data and decomposes the original time series into a series of different components to explore the implicit law in the time series data. The intrinsic mode function should meet the two conditions below: 1. In the data interval, difference between numbers of extreme points and zero points is at most one. 2. The average value of the upper envelope and the lower envelope is zero.
The EMD model is adaptive and can decompose the original series for a time series data without the number of components specified till the standard of stopping decomposition is met. The relationship between the original series and the decomposed components is shown in Eq. 18: where X(t) refers to the original time series, n i 1 imf i refers to the sum of the components, and r refers to the residual. When the residual series is a monotonic function, the decomposition stopped. The decomposition step of empirical mode decomposition is shown as follows: STEP 1: Identify all the maximum points and minimum points in the time series, and fit the upper envelope e u and the lower envelope e l by using the cubic spline finite difference method according to maximum points and minimum points.
STEP 2: Calculate the average value of the upper envelope e u and the lower envelope e l , and obtain the mean envelope e mean emean eu +e l 2 . STEP 3: Calculate the difference between the original series X(t) and the mean envelope e mean , and obtain the intermediate time series STEP 5: Subtract the component imf i from the original time series X(t) and execute the steps 1-4 again. If the standard of stopping decomposition is satisfied, the decomposition process will end.

ARIMA Model
The autoregressive integrated moving average (ARIMA) model is defined in Eq. 19: where y t and ε t indicate the actual value and residual value at the time point t, respectively, and φ (φ 1 , φ 2 , /, φ p ) and θ (θ 1 , θ 2 , /, θ q ) refer to the weight vectors. p and q are the model orders. The historical time series data and historical white noise error data of the variable are used to forecast the current value.
The prerequisite of using ARIMA model is to use stationary data. The non-stationary data can be handled by combining the autoregressive integrated moving average (ARIMA) model and different methods (Gilbert, 2005). The ARIMA model has three parameters, (p, d, q), in which d refers to the differential order of the data series.

Evaluation Indexes
The model performance is evaluated by the mean absolute error (MAE) and the definition of MAE is shown in Eq. 20: The root-mean-square error (RMSE) is used for model performance assessment and the definition of RMSE is shown in Eq. 21: The mean absolute percentage error (MAPE) is used for model performance assessment and the definition of MAPE is shown in Eq. 22: where n is the number of samples in the set verified, y i is the forecasted value of the model, and y i is the true value. The closer the MAE, RMSE, and MAPE values are to 0, the better the model performance will be.

The Training and Forecasting Process of Polynomial Kernel Huber-Ridge-EMD-ARIMA Model
The training and forecasting process for the forecast framework of polynomial kernel Huber-Ridge-EMD-ARIMA model (PK-Huber-Ridge-EMD-ARIMA) proposed is shown in Figure 1. Its process steps are as follows. STEP 1: Data preprocessing and feature selection: Screen the correlation features according to Pearson correlation coefficient and Spearman correlation coefficient after the data preprocessing and divide training dataset and test dataset.

Data Description
Qinhuangdao steam coal price data that are 4500-kilocalorie, 5000-kilocalorie, and 5500-kilocalorie steam coal exit price data of Qinhuangdao Port from Jan. 2017 to Jul. 2021, are used for experimental study. The sampling is conducted once a week, which is the weekly frequency data.

Data Preprocessing
The original data about steam coal price and its features have the disadvantage of missing value and inconsistent data time sampling frequency; so, such original data shall be preprocessed and the data processed can be brought into the model for model training and forecasting.

Types of the feature Feature set
Coal production Raw coal production of key state-owned coal mines, coal production of large coal enterprises, . . . . . . . . . , and raw coal production of non-state-owned coal mines Coal supply National coal import quantity, coal inventory and dispatching data of Qinhuangdao Port, cargo ship ratio of four ports around Bohai Sea, and historical data of total coal storage in five ports around Bohai Sea Coal transportation Coal transportation quantity of Daqin Line, coal sales volume of national key coal mines, . . . . . . . . ., and daily average number of coal railway loading vehicles Coal consumption National industrial power consumption, national social electricity consumption, power generation in coastal provinces, and coal consumption in power grid Macroeconomic Coal future price, added value of national secondary industry, GDP, consumer price index, the producer price index, and investment in fixed assets of the whole society The data preprocessing step is shown as follows: 1. Unify the data sampling frequency: The data input to the model have the features of one-to-one relationship between coal feature data and coal price, that is, the sampling frequency of coal feature data and coal price data is the same. When the sampling frequency of the original coal price data and related feature data is inconsistent, such original data shall be operated at the unified data sampling frequency; the frequency of the data higher than the specified sampling frequency shall be reduced and the frequency of the data lower than the specified sampling frequency shall be raised. The daily frequency data are reduced to weekly frequency data. The quarterly and monthly data are raised to the weekly data and the missing value arises after the low-frequency data are raised to the high-frequency data. The raised data are processed by ascending order as per the date, and then the missing value is filled up by linear difference filling. 2. Fill up the missing value: There are some missing values and non-numerical parts in the original coal data which need to be filled up to better utilize the dataset. The missing part in the data is filled up by linear difference, and the nonnumerical part is deleted and then the missing part deleted is filled up by linear difference. The equation of missing value between filling points (x 0 , y 0 ) and (x k , y k ) is shown in Eq. 23: 3. Standardize the dataset: There is a dimensional difference between different types of data. To avoid the dimensional error and lower the model performance, the standardized equation is used to transform the data distribution into standard distribution with the mean value of 0 and variance of 1.
The standardized equation is shown as follows: After transformation as per Eq. 24, the distribution of original feature data is transformed into a standard normal distribution with the mean value of 0 and variance of 1. x ki indicates the kth numerical value of the ith feature index. X i indicates the mean value of the ith feature index data, σ i indicates the standard deviation of the ith feature index, and n indicates the sample size of the ith feature index. 4) Divide training dataset and test dataset: The training dataset and test dataset are not fixed, but they change dynamically. In the energy market, the influencing factors of energy indicators change with time (Liang et al., 2019). The coal data and related feature data at the time point within the sliding window of k width are used as the training dataset for

Feature Selection
Selecting comprehensive and relevant features can greatly improve the performance of the forecasting model. All feature data are presented as a data matrix, and the optimal feature variable is chosen according to the feature type and feature optimal time interval. Feature type: The coal price pertains to many factors; there are many factors influencing coal market price, and the main   Table 1.
There are many initial feature indexes, so the feature screening needs to be performed. The feature variable at the same time point as the forecasted target variable does not necessarily have the highest correlation, and it is necessary to find out the optimal time interval, that is, optimal delay order, for each feature variable. The optimal delay linear feature and the optimal delay nonlinear relevant features are screened as per Pearson correlation coefficient and Spearman correlation coefficient.
The value range of Pearson correlation coefficient r p Spearman and correlation coefficient r s is [-1, 1]. The closer the absolute value of r p and r s is to 1, the stronger the correlation will be; the closer the absolute value of r p and r s is to 0, the weaker the correlation will be.
Given the corresponding threshold values of the Pearson correlation coefficient and Spearman correlation coefficient are r mp and r sp , respectively, the feature indexes whose correlation coefficient exceeds the threshold value r mp and r sp are screened. Given the maximum delay order of the feature is delay max , the parameter setting is shown in Table 2.
The chosen feature variables and steam coal price are input to the forecasting model mentioned, and the model outputs the steam coal price at the next time point. Table 3 and Table 4 show the selection of the optimal delay feature variable when forecasting the steam coal price on Jul. 6, 2021. The feature variable selected as per this method changes over time.

Experiment Result
In this article, Lasso, Ridge, Huber-Ridge, PK-Huber-Ridge, and PK-Huber-Ridge-EMD-ARIMA models are used for comparison. One-step forecasting is used for empirical test.
Qinhuangdao thermal coal data and feature data at the first 120 time points are used as the data variables of the forecasting model, and the forecasting model outputs the thermal coal price data at the 121st time point.
The set values of hyperparameters of Lasso, Ridge, Huber-Ridge, PK-Huber-Ridge, and PK-Huber-Ridge-EMD-ARIMA models are shown in Table 5.Here, α l1 is the coefficient of L1 regular term; α l2 is the coefficient of L2 regular term; M is the threshold of Huber loss function; d k is the kernel parameter of the polynomial kernel and represents the order of the polynomial; p is the autoregressive order of ARIMA model; d a represents the difference order; and q represents the moving average order. BIC criterion (Burnham and Anderson, 2004) is used to select the optimal ARIMA model hyperparameters p, d a , and q.
The forecasting model is used to forecast 4500-kilocalorie steam coal price data (Dataset 1), 5000-kilocalorie steam coal price data (Dataset 2), and 5500-kilocalorie steam coal price data (Dataset 3) of Qinhuangdao Port from March 17, 2020 to July 6, 2021. Table 6 and Figure 3 and Figure 4 and Figure 5 show the evaluation index results of the forecasting model.
Through the comparison of the experimental results of five thermal coal price forecasting models, the following conclusions can be obtained.
Compared with the single model, the proposed combination model has a better forecasting performance. In dataset 1, dataset 2, and dataset 3 experiments, the forecasting performance of PK-Huber-Ridge-EMD-ARIMA model is better than the Lasso model, Ridge model and Huber-Ridge model, and PK-Huber-Ridge model. The thermal coal price dataset is complex, and the forecasting performance of a single forecasting model is very limited. The combination model can better deal with complex datasets. PK-Huber-Ridge-EMD-ARIMA model adopts the method of decomposition integration and time series forecasting model to compensate for the error of single model. We consider the residual error rule of model forecasting to complement the hidden rules that the original single model does not learn.
Compared with the ordinary model, the robust kernel function model has better performance. In dataset 1, dataset 2, and dataset 3 experiments, the forecasting performance of PK-Huber-Ridge-EMD-ARIMA model is better than the Lasso model, Ridge model, and Huber-Ridge model. Thermal coal dataset has nonlinear law. PK-Huber-Ridge-EMD-ARIMA model uses polynomial kernel function to map nonlinear features into high-dimensional space, so that the linear model can learn the nonlinear law in the original feature space, so as to further improve the forecasting performance of the forecast model.

RESULT AND DISCUSSION
For the nonlinearity law in the steam coal dataset and limitations of the single forecasting model, this article proposes the forecast method for the price of steam coal based on robust regularized kernel regression and empirical mode decomposition. The robust regularized kernel regression model learns the nonlinearity law in the original data by using the kernel function. This model selects the Huber loss function to enhance the robustness of the forecasting model. We select the L2 regular term to lower the risk of model overfitting. The combined model based on EMD and ARIMA is used for error compensation against the Huber-Ridge polynomial kernel model, further improving the forecasting performance of the forecasting model. Compared to Lasso, Ridge, Huber-Ridge, and PK-Huber-Ridge, the proposed forecasting model (PK-Huber-Ridge-EMD-ARIMA) has the minimum value of MAE, RMSE, and MAPE.
The influence factors of steam coal price are complex which are easily affected by national policies. How to quantify policy factors and input them into the forecasting model for model training and model forecasting is the next work.

DATA AVAILABILITY STATEMENT
The data analyzed in this study are subject to the following licenses/restrictions: The data used to support the findings of this study are currently under embargo, while the research findings are commercialized. Requests to access these datasets should be directed to tmz@csust.edu.cn.