Developing an XGBoost Regression Model for Predicting Young’s Modulus of Intact Sedimentary Rocks for the Stability of Surface and Subsurface Structures

Young’s modulus (E) is essential for predicting the behavior of materials under stress and plays an important role in the stability of surface and subsurface structures. E has a wide range of applications in mining, geology, civil engineering, etc.; for example, coal and metal mines, tunnels, foundations, slopes, bridges, buildings, drilling, etc. This study developed a novel machine learning regression model, namely an extreme gradient boosting (XGBoost) to predict the influences of four inputs such as uniaxial compressive strength in MPa; density in g/cm3; p-wave velocity (Vp) in m/s; and s-wave velocity in m/s on two outputs, namely static Young’s modulus (Es) in GPa; and dynamic Young’s modulus (Ed) in GPa. Using a series of basic statistical analysis tools, the accompanying strengths of each input and each output were systematically examined to classify the most prevailing and significant input parameters. Then, two other models i.e., multiple linear regression (MLR) and artificial neural network (ANN) were employed to predict Es and Ed. Next, multiple linear regression and ANN were compared with XGBoost. The original dataset was allocated as 70% for the training stage and 30% for the testing stage for each model. To improve the performance of the developed models, an iterative 10-fold cross-validation method was used. Therefore, based on the results XGBoost model has revealed the best performance with high accuracy (Es: correlation coefficient (R2) = 0.998; Ed: R2 = 0.999 in the training stage; Es: R2 = 0.997; Ed: R2 = 0.999 in the testing stage), root mean square error (RMSE) (Es: RMSE = 0.0652; Ed: RMSE = 0.0062 in the training stage; Es: RMSE = 0.071; Ed: RMSE = 0.027 in the testing stage), RMSE-standard deviation ratio (RSR) index value (Es: RSR = 0.00238; Ed: RSR = 0.00023 in the training stage; Es: RSR = 0.00304; Ed: RSR = 0.001 in the testing stage) and variance accounts for (VAF) (Es: VAF = 99.71; Ed: VAF = 99.99 in the training stage; Es: VAF = 99.83; Ed: VAF = 99.94 in the testing stage) compared to the other developed models in this study. Using a novel machine learning approach, this study was able to deliver substitute elucidations for predicting Es and Ed parameters with suitable accuracy and runtime.


INTRODUCTION
Young's modulus (E) is important for predicting the behavior of materials under load and plays a key part in the stability of surface and subsurface structures. E has a broad application in mining, geology, civil engineering, etc., i.e., coal and metal mines, tunnels, foundations, slopes, bridges, buildings, drilling, etc. Computation of accurate rock deformation properties, especially E is essential to the design of any rock engineering or rock mechanics project. Several researchers have studied the deformation and behavior of various types of rocks (Zhao et al., 2017;Rahimi and Nygaard, 2018;Davarpanah et al., 2019;Xiong et al., 2019). Generally, there are two common techniques, static and dynamic, employed to measure E. Static Young's modulus (E s ) is generally acquired as the digression of the stress-strain curve at 50% of the maximum strength of the rock core sample. The dynamic Young's modulus (E d ) can be determined if the density of the rock along with the velocities of compressional and shear waves is known. In rock engineering, the variation between E s and E d has been broadly investigated (Brotons et al., 2016). Normally, the value of E d is slightly greater than the E s studied by various researchers (Zhang, 2006;Kolesnikov, 2009). The ratio between E d and E s was calculated to range between 1 and 20 (Wang, 2000).
Typically, there are two common techniques, such as destructive and non-destructive, to estimate the strength and deformation of rocks. According to the recommended standards of the International Society of Rock Mechanics (ISRM) and the American Society for Testing Materials (ASTM), the use of destructive testing in the laboratory to directly estimate E is complex, time-consuming and expensive process. At the same time, sample preparation is quite difficult in the case of fragile, internally damaged, thin and highly foliated rocks (Jing et al., 2020). Thus, attention must be paid to the indirect evaluation of E through the use of rock index tests. Many researchers have established prediction models to overcome these shortcomings by employing soft computing methods such as artificial neural network (ANN), multiple regression analysis (MRA) and other novel machine learning approaches (Lindquist et al., 1994;Singh and Dubey, 2000;Tiryaki, 2008;OzcelikBayram et al., 2013;Abdi et al., 2018;Teymen and Mengüç, 2020;Cao et al., 2021;Yang et al., 2020;Duan et al., 2020). Waqas et al. used linear and nonlinear regression, regularization and ANFIS (using neuro-fuzzy inference system) to predict the E d of sedimentary rocks (Waqas and Ahmed, 2020). Abidi et al. proposed the ANN and MRA  Frontiers in Earth Science | www.frontiersin.org October 2021 | Volume 9 | Article 761990 3 (linear) methods for predictive modeling of E using input variables like porosity in %; dry density (γd) in g/cm 3 ; P-wave velocity (Vp) in km/s; and water absorption (Ab) in %. The results indicated that the ANN model outperformed the MRA (Abdi et al., 2018). Davarpanah et al. developed linear and nonlinear relationships between static and dynamic deformation parameters of various rocks and found strong correlations between them (Davarpanah et al., 2020). Aboutaleb et al. conducted non-destructive experiments with SRA (simple regression analysis), MRA, ANN and SVR (support vector regression) and found that ANN and SVR models were more accurate in predicting E d (Aboutaleb et al., 2018). Mahmoud et al. predicted the E s of sandstone using an ANN network with 409 data events in the training phase and 183 data events in the testing phase. The developed ANN model predicted E s with a high correlation coefficient (R 2 0.999) and minimum mean absolute percentage error (AAPE 0.98%) . Elkatatny developed an ANN network for predicting E d from the drilling parameters. The study showed encouraging results (Elkatatny, 2021). Elkatatny et al. was first to correlate E s prediction results from different models such as ANN, ANFIS and SVM. The established correlations improved the accuracy of the estimated E s . Cao et al. employed the novel approach of supervised machine learning, namely an extreme gradient boosting (XGBoost) combined with the firefly algorithm (FA) to predict E. The results showed that the proposed approach is suitable for predicting E.
Based on the above literature and to the best of author's knowledge, XGBoost machine learning method has rarely been used, especially in combination with ANN and MLR, for predictive modeling of E of rocks. Due to limitations of the conventional predictive methods, the prediction of E with machine learning approaches plays a key role in determining the accuracy of the corresponding data of tests performed in the laboratory. In this novel study, XGBoost is developed for predicting E s and E d using four input parameters, i.e., uniaxial compressive strength (UCS) in MPa; density in g/cm 3 ; p-wave velocity (Vp) in m/s; and s-wave velocity (Vs) in m/s, complimented with ANN and MLR. Then, the original dataset of 64 data points is split as 70% for the training stage and 30% for the testing stage. To improve the performance of the machine learning model, an iterative 10-fold cross-validation method is used.

Construction of a Dataset
Several multivariate parameters of intact sedimentary rocks (marlstone, sandstone, and limestone) are already reported (Moradian and Behnia, 2009) to have been used as inputs to predict the static Young's modulus (E s ) and dynamic Young's modulus (E d ), which include uniaxial compressive strength (UCS) in MPa; density in g/cm 3 ; p-wave velocity (Vp) in m/s; and s-wave velocity (Vs) in m/s. There were a total of 64 events with no missing data. Table 1 shows the original dataset and statistical distribution in this study.
In this study, to visualize the original dataset of E, the seaborn module in Python was used. Figures 1A,B illustrate a pairwise scatter of the kernel density estimation (KDE). The purpose of building a KDE pairwise plot is to examine the association between any two influencing parameters in the original dataset. Based on Figures 1A,B, all the input parameters have a moderate to strong positive correlation with both E s and E d . Next, Figures 2A,B highlight the diagonal correlations between the input and output parameters. The seaborn module in Python was used for diagonal correlation heatmaps to develop the correlation coefficients of multiple inputs with E s and E d .
Correlation coefficient values are specified in the light red to dark red color for E s and light purple to dark purple for E d . According to Figures 2A,B, the overall correlation coefficients between the input and output parameters are relatively high. Therefore, all parameters were incorporated to improve the accuracy of the final probabilistic prediction framework in the E s and E d circumstance. Figure 3 depicts the flow chart of the study.

Multiple Regression Analysis
Multiple regression analysis (MRA) can be classified into linear and nonlinear regression. However, this study has implemented the multiple linear regression (MLR) as a result of multiple variables using SPSS (version 23). MLR is a numerical method that uses multiple descriptive parameters to estimate the output of a reporting parameter. The MLR method is used to obtain the best-fit relationship between the parameters. Generally, MLR can be employed to establish the association between independent (input) and dependent (output) parameters. In this study, the MLR technique was used to predict E s and E d , respectively. The MLR relationship between the inputs and output can usually be expressed by Eq. 1. where, D depicts the output parameter, a denotes the regression constant, B 1 -B n are the coefficients of regression and X 1 -X n are the input variables. Based on the consequences of MLR, E s and E d are predicted by the established linear expressions as shown in Eqs 2, 3. (3) where, E s and E d are static and dynamic Young's modulus in GPa, respectively. UCS represents uniaxial compressive strength in MPa, Vp and Vs are the p-wave and s-wave velocities in m/s, respectively.

Artificial Neural Network
Artificial neural network (ANN) is among many supervised machine learning methods and has found wide application in a variety of fields. An ANN consists of three components, i.e., input layers, hidden layers, and an output layer. The structure of ANN and the choice of hidden layers and neurons play a crucial part in ascertaining its performance (Chester, 1990). The feedforward back-propagation (FFBP) neural network, a multilayer perceptron network, was used owing to its simple process and wide applicability. Backpropagation (BP) is one of the most efficient and commonly employed learning algorithms in multi-layer networks (Cevik et al., 2011;Hajihassani et al., 2014). Each network must contain sufficient neurons depending on the application of ANN. Neurons of a given layer are connected to the neurons of the subsequent consecutive layer with every connection having a certain weightage (Atkinson and Tatnall, 1997). Equation 4 is employed to estimate the approximate number of neurons in the hidden layer (N h ), since the inappropriate selection of the neurons in the hidden layer often leads to "under-fitting" and "over-fitting" and must be prevented. Figure 4 represents the basic structure of the ANN network for predicting E s and E d in this study.
where, N 1 denote the total number of inputs. In order to build the net input n, the weighted input ω i p i is connected to a scalar bias b, f denotes the transfer function, and O denotes the scalar output. If the neuron has Z input parameters, the output can be computed by Eq. 5. This study used a sigmoid transfer function in the hidden layer and a linear output function in the output layer. To achieve the number of neurons in the hidden layer, this study used some provisions, since there is no specific approach to providing the desired results. In addition, fifty epochs were used for training the ANN network and the least error of validation is considered as a stop to avoid overfitting.

Extreme Gradient Boosting
The extreme gradient boosting (XGBoost) algorithm was created by Chen and Guestrin (Chen and Guestrin, 2016). Being an effective tree-based ensemble learning algorithm, it is considered a powerful tool among data science researchers. XGBoost is based on gradient boosting architecture (Friedman, 2001), which uses various complement functions to estimate the results using Eq. 6.
where, y i indicates the predicted output for ith data with the parameter vector U i ; n denotes the number of estimators corresponding to independent tree structures for each f k (i.e. k 1 to n); and y 0 i displays the primary hypothesis, which is actually the mean of the original parameters in the training data. η represents the rate of learning connected to improving the performance of the model, whether connecting the additional trees to prevent overfitting. The statistical model has to be developed with less overfitting which is one of the genuine problems that often conflicts in machine learning. In the XGBoost model, the training phase is determined in a complementary way. According to Eq. 3 in the kth stage, the kth estimator is connected to the model and the prediction of the kth y −k i is calculated from the estimated output y −(k−1) i in the next step, and the established f k of the kth complementary estimator is shown in Eq. 7.
whereas f k represents the leaves weight that is established by reducing the objective function of the kth tree and is given by Eq. 8.
where, Z denotes the quantity of leaf nodes, c denotes the complexity parameter, λ denotes constant coefficient, and ω 2 a  Frontiers in Earth Science | www.frontiersin.org October 2021 | Volume 9 | Article 761990 denotes the leaf weight from 1 to Z. c and λ are regularization parameters employed to improve the model to keep away from the over-fitting. g a and h a are the summation parameters for the entire dataset associated with a leaf of the initial and previous loss function gradient, correspondingly. In order to build the kth tree, a leaf is distributed into several leaves. Such a system is implied by using the gain parameters expressed in Eq. 9.
where, G denotes the gain parameters, O R and P R denote the right leaf, respectively. O L and P L denote the subsequent division of the left leaf, respectively. When the gain parameter is approximated to zero, the division criteria are generally assumed. c and λ are regularization parameters that are indirectly reliant on the gain parameters. For example, a larger regularization parameter can significantly reduce the gain parameter, thus avoiding the leaf convolution phenomenon. However, this will reduce the performance of the model to adapt to the training data. Figure 5A demonstrates the basic structure of the level-wise XGBoost tree model.

Grid Search Cross Validation
The grid search method is used for the adjustment of hyperparameters (Bergstra and Bengio, 2012). The technique approves a search in an identified range of hyperparameters and defines the desired results leading to the best outcomes of the assessment criteria, i.e., R 2 , MAE, MSE and RMSE. GridSearchCV() has been carried out in scikit-learn Python programing language to process this strategy. This method simply calculates the score of CV for all hyperparameters integrated with a particular reach. In this study, a 10-fold iterated arbitrary arrangement practice was incorporated in the CV command as specified in Figure 5B. GridSearchCV() allows not only to compute the desired hyperparameters, but also to evaluate the metric values to their desired outcomes. This study used all the remaining features of the Python programing language by default to perform Grid Search CV.

Performance Criterion
Typically, the performance of a model must be estimated when approaching the steadiness of a prediction framework, using an extensive range of performance criteria to select a highly accurate model. Therefore, this study proposes a unique performance criterion as follows:

Correlation coefficient
The correlation coefficient (R 2 ) is the key to the execution of the regression survey. The computation of R 2 can be expressed by Eq. 10.

Variance Accounts For
Variance accounts for (VAF) is also considered as one of the important metrics for evaluating the overall performance of the model. Higher the VAF value, the greater will be the performance of the model. The computation of VAF can be expressed by Eq. 14.
where, X and Y are the mean of the original and predicted data values, respectively, X i and Y i are the original and predicted data values, respectively, and n shows the number of datasets.
Yi is the mean value of the original data. Table 2 signifies the performance ranking and the corresponding RSR values.

RESULTS AND DISCUSSION
In this study, a novel machine learning regression XGBoost model was developed and compared with two other models, namely MLR and ANN, to confirm the accuracy of predicting E s and E d . To avoid overfitting of these models, the original dataset was partitioned into 70% for the training stage and 30% for the testing stage of 64 events. The ANN and XGBoost models are trained on training data and then validated by testing data. The 50 epochs were used for training the ANN model and the least error of validation is considered as a stop to avoid overfitting. According to Eq. 4, a total of nine neurons are selected in the hidden layer, which is connected to four input neurons and two output neurons, as shown in Figure 4. In this study, an XGBoost model with the default features of the XGBoost module was executed, i.e., M 50 estimators, the regularization properties of c 0, λ 1, and a learning rate of η 0.3. Moreover, a 10-fold iterated arbitrary arrangement practice was incorporated to substantiate the models. The original and predicted output values were then arranged and represented in scattered plots in order to ease the performance and correlation analysis of the developed models. The input parameters are UCS (MPa); Density (g/cm 3 ); Vp (m/s); and Vs (m/s). The predicted output parameters are E s and E d . The final output was evaluated by using performance criteria such as R 2 , RMSE and VAF, and the developed models were compared to estimate the appropriate model with higher accuracy of prediction results in this study. Figures 6A Table 3 shows the performance criterion of the MRA, ANN and XGBoost determined using Eq. 10-14. Figure 11 depicts the comparative Taylor diagrams of the proposed models MRA, ANN and XGBoost in the (a) training stage and (b) testing stage for E s and E d to further estimate the performance of the models more extensively. The standard deviation values associated with one another by the circular lines are shown as horizontal and vertical coordinates in the diagrams. Two performance metrics, one is the R 2 value, indicated by the blue radial lines from the starting of the coordinates, and the other is the RMSE value, specified by the green circular line. Observed values were used as the base model with zero errors, i.e., RMSE 0 in the diagrams, the maximum R 2 1, and the computed standard deviations. Next, the R 2 , RMSE and standard deviation of the other models were compared with the observed values. A best model is one with highest degree of similarity to observed data model. As shown in Figure 11, the XGBoost model has been able to approach the observed data and outperformed the MRA and ANN models in both the training and testing phases.
Moreover, according to Table 3 and the results in Figure 11, the XGBoost model has revealed the best performance with high accuracy ( Therefore, XGBoost is an applicable machine learning regression model that can be applied to accurately predict the E s and E d .

CONCLUSION
Young's modulus (E) plays an important role in the stability of surface and subsurface structures. Therefore, an accurate estimation of E is mandatory. This study developed a novel machine learning XGBoost regression model with four input parameters, i.e., UCS (MPa), density (g/cm 3 ), Vp (m/s) and Vs (m/s) for predicting E s (GPa) and E d (GPa). In addition, the MRA and ANN models were included to compare their results with the proposed model. To avoid overfitting of these models, the original dataset was partitioned into 70% for the training stage and 30% for the testing stage of 64 data points. The study concludes that the proposed XGBoost regression model performed more accurately than the other studied models in predicting E s and E d . Employing a novel machine learning approach, this study was able to provide substitute elucidations to predict E s and E d parameters with appropriate accuracy and runtime. Future work can be extended using various datasets to further confirm the reliability of the proposed model.

DATA AVAILABILITY STATEMENT
The original contributions presented in the study are included in the article/Supplementary Material, further inquiries can be directed to the corresponding author.