Machine learning based ground motion site amplification prediction

Site condition impact on seismic ground motion has been a complex but important subject in earthquake hazard analysis. Traditional studies on site amplification effect are either based on site response via wave propagation simulation or regression analysis using parameters such as Vs30, bedrock ground motion and site response period. Ground Motion Prediction Equations (GMPEs) are used for regions where there is limited data of seismic records. The main issues with these approaches are that they cannot demonstrate the complex relationship between site amplification and its various affecting parameters, thus there exists large uncertainty in the results. Recent studies based on machine learning have shown significant improvement in predicting the site amplification, but the result is not well explained. This study assembled the information on 6 parameters including Vs30, magnitude, epicentral distance, earthquake source depth, bedrock ground motion, and altitude of 353,327 records observed during 1997 and 2019 from 698 KiK-net stations. Three machine learning algorithms of Random Forest (RF), XGBoost, and Deep Neural Networks (DNN) were implemented to predict the site amplification factor using these 6 selected parameters. Shapley Additive explanation (SHAP) was used to explain the importance of the 6 parameters. The results show that all three machine learning algorithms performed much better than the traditional GMPE approach with XGBoost’s performance the best. The explanation provided by the SHAP analysis further enhanced the reasonability of this study. It is anticipated that the combination of machine learning and SHAP analysis can provide better assessment for site amplification of ground motion with better potential of future application in seismic hazard analysis.


Introduction
The effect of site condition on seismic ground motion has been a research focus in geotechnical earthquake engineering and seismic hazard analysis. Numerous records have demonstrated that site conditions amplify the ground motion and further intensify the earthquake damage (Borcherdt, 1970;Borcherdt and Gibbs, 1976;Seed et al., 1988;Dobry et al., 2000;Bala et al., 2009). A large number of studies have been performed to better represent the effect of site amplification, and traditional approaches can be categorized into two groups: site response analysis based on wave propagation simulation and the empirical Ground Motion Prediction Equations (GMPEs). Site response analysis based on site wave propagation simulation is mostly using one-dimensional (1-D) or multi-dimensional models. The non-linear property of the soil is approximated through equivalent linearization (Idriss and Seed, 1968;Seed and Idriss, 1969). Park and Hashash conducted the analysis through improved equivalent linearization of site soil nonlinear property (Park and Hashash, 2004;Park and Hashash, 2008), and Gerlymos and Gazetas proposed a new constitutive model in a 1-D analysis (Gerolymos and Gazetas, 2005). Harmon et al. developed the ground motion site amplification model for the West and East part of the US based on extensive site response simulation (Harmon et al., 2019). Site response analysis can provide detailed results of the site amplification, but the analysis requires extensive information on soil property which is often not available. The other approach for site response analysis is through empirical Ground Motion Prediction Equations (GMPEs), which is a statistical model based on earthquake property and simplified parameters of site soil conditions. The early stage GMPEs were usually developed separately for rock and soil site conditions, and ground motion amplification factor was used to consider statistical effect of site conditions (Boore et al., 1997;Sadigh et al., 1997). Abrahamson and Silva considered non-linear effect of the site amplification factor (Abrahamson and Silva, 1997), and Boore et al. (Boore et al., 1997) used the average shear-wave velocity of the top 30 m soils (Vs30) to represent the site effect. Seyhan and Stewart (Seyhan and Stewart, 2014) developed a site amplification model for the West part of the US and used the GMPEs by Boore et al. (Boore et al., 2014) for rock sites to compliment the data scarcity of records on rock sites. Most GMPEs used only Vs30 to express the site condition, which cannot completely represent the complex nature of the various site conditions, thus resulting in a large degree of uncertainty. With the accumulation of a great number of observed records and rapid development of computational resources machine learning algorithms have been actively introduced to predict the site response. Daniel Roten et al. (Roten and Olsen, 2021a) applied machine learning to predict the site amplification factor using records from the KiK-net, and the result was compared against the result by the 1-D site response analysis showing the mean squared logarithm error reduction of at least 50%. Hamidreza et al. (Hamidreza and Soleimani Kutanaei, 2015) compared the site amplification result by Artificial Neural Network (ANN) against the result by 2-D site response analysis and found that the shear wave velocity and soil layer depth are more important than other factors in site amplification. Kveh (Kaveh et al., 2016)and (Derras et al., 2017) predicted the site amplification through M5 decision tree method and ANN, respectively. Chuanbin Zhu (Zhu et al., 2021)compared the performance of RF, GRA, SRI and HVSR methods on the benchmark dataset, and the results showed that RF performance is relatively better. Daniel Roten (Roten and Olsen, 2021b) compared the neural network methods (CNN, MLP) with the theoretical SH1D amplification method, and the results showed that CNN has a smaller MSLE and MAE in the prediction results. Machine learning approach has shown its superiority in predicting site amplification of ground motion, but the approach itself looked like a "black box" approach with results somethings difficult to be explained.
This study utilized the observed ground motion in KiK-net between 1997 and 2019, and applied RF, XGBoost, and DNN approaches to predict site amplification factor for different periods. The results were compared against those by (Yoojoong and Jonathan, 2005), who used Vs30 as the main parameter to predict ground motion site amplification. The comparison demonstrates that machine learning approaches are superior in predicting site amplification with XGBoost performance the best. Shapley additive explanation (SHAP) was then introduced to explain the proposed prediction models, and the contribution of each parameter on site amplification was analyzed to provide better guidance for future similar studies.

Data
KiK-net is the ground motion observation network operated by the Japanese National Research Institute for Earth Science and Disaster Resilience (NIED), which is composed of approximately 700 stations (https://www.doi.org/10.17598/NIED.0004). Each station has a pair of high-end sensors placed at the top and bottom of the borehole to record three components (NS, EW, UD) of the earthquake ground motion. In addition, detailed soil profile data has been provided for most stations by NIED. This study conducted the research based the data accumulated by KiK-net during 1997-2019.

Data preparation
The ground motion records by KiK-net are stored separately for each earthquake event, and this study assembled the records with the following additional parameters. Information on earthquake magnitude (Mag), altitude of the station (S_Altitude), earthquake source depth (Depth), and the latitude and longitude of both the earthquake source and the observation stations are included in the ground motion records. The epicentral distance(R) can be calculated via the latitude and longitude of the earthquake source and observation stations. The Vs30 at the stations can be calculated from the borehole data according to Eq. 1 (Yoojoong and Jonathan, 2005). For stations where borehole data does not reach down to 30 m, the last layer will be extended to 30 m deep (Boore, 2004).
where d i is the depth for the i-th layer, and Vs i is its corresponding shear wave velocity. The calculation only accounts for soil layers up to 30 m deep. Spectral acceleration Sa is calculated as in Eq. 2 (Chen Longwei Chen Zhuoshi Yuan Xiaoming, 2013).
where Sa(T) EW , Sa(T) NS are the spectral acceleration of the EW and NS component for period T. The site amplification factor (Amp) for Sa can be calculated as in Eq. 3.

Amp T
where Sa(T) S , Sa(T) R are the SA at surface and bottom respectively for period T. The above computation is repeated for all the records Frontiers in Earth Science frontiersin.org observed in the KiK-net during 1997-2019, resulting in a data set of Amp for more than 350,000 earthquake records.

Data filtering and distribution
The location distribution of earthquake source and observation station for the dataset selected in this study is shown in Figure 1. Previous researches (Bommer and Martinez-Pereira, 2000) have shown that only ground motions with PGA more than 20 cm/s 2 will have impact on our society, engineering structures and living environment. The PGA S (subscript s stands for surface) distribution of the data collected for this study is shown in Figure 2A, and from Figure 2A it can be seen that PGA S for most records is below 20 cm/s 2 . For application in future engineering practice, we need to filter the data from the assembled dataset. Figure 2B shows the distribution of Amp with datasets of different minimum PGA S of 0, 1, 2, and 20 cm/s 2 . As can be seen from Figure 2B, the distribution of Amp for datasets of minimum 1 cm/s 2 and 20 cm/s 2 looks similar, therefore, we chose the dataset with minimum PGA S of 1 cm/s 2 for this study to ensure that there be enough samples in the dataset. This filtered dataset has more than 260,000 samples. The distribution of Vs30, Mag, R and Depth for the selected dataset is also shown in Figures 3A-D, respectively. As can be seen from Figure 3 Vs30 mostly falls into the range of 250-700 m/s, and M is mostly between 4 and 7, while R is mostly less than 200 km and the source depth is mostly shallower than 60 km.

Ln Amp c ln
where c, V ref , b are regression coefficients, and η is a random factor expressing the ground motion response, and ε is the error of regression. It is noted that Eq. 4 is only applicable when Vs30 is between 130 and 1,300 m/s and PGA R is between 0.02 and 0.8 g. This study made sure that the selected dataset also met these conditions for the comparison study.

Machine learning approaches
Because of the vast advancement of computer technology and computing resources in recent years, machine learning approaches have been extensively applied in many areas of studies, often resulting in much satisfactory results (Jordan and Mitchell, 2015). In this study three  representative machine learning algorithms (RF, XGBoost, DNN) were selected in constructing models to predict the site amplification factor. For the feature parameters in the constructed machine learning models, only parameters which were easily accessible and without regional restrictions were selected, which included Vs30, Mag, R, S_altitude, Depth, Acc_rock. For validating the proposed models, 10% records The Bayesian optimization process for hyper-parameters of the 3 ML models. (A)Hyper-parameter optimization process for RF, (B)Hyper-parameter optimization process for XGBoost, (C)Hyper-parameter optimization process for DNN.
Frontiers in Earth Science frontiersin.org 05 were randomly selected from the dataset. Of the remaining 90% data, 70% were used for training, and the other 30% were used for testing the accuracy and degree of fitting in the prediction.

Random forest
Random Forest (RF) algorithm is a classification prediction approach based on multiple decision trees by Leo Breiman (Breiman, 2001), and it is an extension of decision tree algorithm. The RF algorithm traverses all nodes (trees) to be split, identifies the optimum split variable and its corresponding slipt threshold for maximum impurity reduction for all sub-nodes. The above process is repeated until the threshold value requirement is satisfied, thus resulting in generation of a forest of trees. Impurity is often represented by Gini index, which can be calculated using the following formula: where Gini(t) is the Gini index at node t, P( k t ) is the ratio of k category sample size by the total sample size at node t, and K is the number of sample categories at note t.
Bagging approach is used as the key algorithm in RF, and each tree in the forest is trained by a randomly selected sample set. The output of the results is an average of results for all the trees, as shown below: where F(x) is the output result, f i (x) is the result for tree i, and n is the number of trees in the forest.

XGBoost
XGBoost algorithm is a classification prediction approach based on the iteration of multiple decision trees by Tianqi Chen (Chen and Guestrin, 2016), and boosting is the key algorithm in XGBoost. The

FIGURE 5
The prediction performance of four methods on validation dataset (The closer the scatter distribution is to the 1:1 line, the better the prediction results).

Frontiers in Earth Science
frontiersin.org XGBoost approach starts from an initial model trained on an initial dataset, and the result is used to reconstruct the next model, and this process repeats until a satisfactory result is obtained. During this iteration process, the newly generated tree is used to approximate the error of the previous tree, and the result can be expressed as the additive tree models. Aggregation can be used to represent the result, as shown below: where y (t) i is the model prediction result for the t iteration, f k (x) is the prediction result for tree k, and t is the number of tree models.
During the iteration process, XGBoost use an approach similar to the one for decision tree, that is, traversing of classification of all feature parameters and using an objective function OBJ to evaluate the performance. Splitting is performed when OBJ increment surpasses pre-determined threshold, and the OBJ can be expressed as in the following: where the first item on the right is the differentiable loss function used to measure the distance between predicted value y i and the object value y i which is differentiable, and the second item Ω is penalty for model's complexity, which is used to reduce the risk of over-fitting.

Deep neural networks
The multi-layer network proposed by Hinton in 2006 opened the door for deep learning (Hinton et al., 2006). Multi-layer perceptron (MLP) is composed of input, hidden (multiple), output layers with full connection between neighboring layers. Each layer can be treated as a logistic regression model, and neuron parameter can be calculated via a traversing approach as shown in Eq. 9: a n i σ K k 1 w nm ik *a m k (9)

FIGURE 6
The prediction residual of four methods on validation dataset.

Frontiers in Earth Science frontiersin.org
where a n i is the i-th neuron in the nth layer, K is number of neurons in the neighboring layer m, w nm ik is the calculation coefficient for two neurons (a n i , a m k ), and σ is the activation function to provide non-linear modeling capability for the networks.

Optimization of hyper-parameters
The hyper-parameters in an ML model are very important in affecting the performance of the models, so identifying the proper setup of hyper-parameters is an important task in building ML models. In this study, based on the Optuna framework of Bayesian optimization, we optimize the important hyper-parameters in the three selected ML models. For RF, the hyper-parameters are maximum depth of decision tree (Max_depth), the number of trees (Num_boost_round), the minimum sample size for splitting (Samples_split) and the minimum sample size in a leaf (Samples_ leaf). For XGBoost the hyper-parameters are maximum depth of the decision trees (Max_depth), learning rate (Learning_rate), number of decision trees (Num_boost_round), fitting parameter (Gamma) and normalization parameters (Alpha, Lambda). Similarly, for DNN, the hyper-parameters include learning rate (Learning_rate), number of network layers (N_layers) and sample size in one training (Batch_ size). Mean Squared Error (MSE) of the model results was used as the control parameter to determine the hyper-parameters and MSE can be expressed as in the following Equation: where N represents the number of samples, y i is the object value of i-th sample, y i is the predicted value of the i-th sample. The MSE and the average training time for the hyper-parameter optimization of the 3 ML models are shown in Table 1, and the training process is visualized in Figure 4.

Results for the four models
The traditional model by Yoojoong Choi et al. is used as the baseline model, and the results by the three proposed machine learning models are compared against it from the baseline model at periods of 0.01, 0.2, 1, and 3 s. The results are shown in Figure 5 where the horizontal axis is the actual Amp while the vertical axis is the predicted Amp. For further observing the prediction error of the model, we calculate the residual and the result is shown in Figure 6, where the residual can be calculated based on Eq. 13 which represents the model prediction error. As can be seen clearly from Figures 5, 6, the results from the three machine learning models show much less scattering and smaller residual, indicating better accuracy than it by the traditional approach. To further analyze the prediction accuracy of the three proposed models, R 2 and mean absolute error (MAE) are used to assess the three models for the training, testing and validation datasets. The results are shown in Figure 7, where R 2 represents the ratio of explainable portion in the overall squared summation divided by the predicted squared summation. The closer R 2 is to 1, the better the prediction results. R 2 and MAE can be expressed in the following equations.

FIGURE 7
R 2 and MAE distribution of the three machine learning models for training, testing and validation datasets.

Frontiers in Earth Science
frontiersin.org 08 where N represents the number of samples, y i is the object value of the i-th sample, y i is the predicted value of the i-th sample, and y i is the mean value of the object value.
From the R 2 distribution in Figure 7 it can be concluded that DNN has smaller values than the values by both XGBoost and RF for training, testing and validation datasets. RF has higher R 2 value than it by XGBoost for the training dataset, but lower values for both the testing and validation datasets, indicating an overfitting trend by the RF model, which may lead to less generalization capability in future predictions. From the MAE distribution in Figure 7, it can be seen that DNN has the highest values in all three datasets, and RF has lower MAE than XGBoost for the training dataset, but it has higher MAE than XGBoost for both the testing and validation datasets. Based on the above observation, it can be concluded that the DNN model performed the worst, and RF model showed some over-fitting tendency, while the XGBoost model performed the best in predicting the site amplification factor. Frontiers in Earth Science frontiersin.org To further study the generalization capability of the proposed XGBoost model, 6 events with different Vs30 values outside the training dataset were chosen to verify the amplification prediction results, as shown in Figure 8. As can be seen from Figure 8, for different Vs30, the predicted amplification factor is pretty close to the actual ratio, Scattering plots for feature parameters and their corresponding SHAP values.

Frontiers in Earth Science
frontiersin.org 10 indicating good stability and generalization capability of the proposed model.

Explanation of the predicted results
Although machine learning has been increasingly applied in many areas with great success, it is still considered a "black box" approach with little reasonable explanation of the results. In this study the SHAP approach is introduced to explain the prediction models proposed. SHAP was originally constructed by Lundberg (Akiba et al., 2019) in 2017 as an explanation model, and its core is to calculate the contribution (SHAP values) of each feature parameter based on collaborative game theory to reflect the contribution of each parameter in the prediction. The calculation of SHAP value can be expressed in the following.
where y set (x) represents the model prediction when feature parameters are set and F is the feature number. For assembled tree models, SHAP method considers each feature parameter as a contributor, and the summation of contribution value from every parameter will lead to the final prediction assessment, as in the following (Lundberg and Lee, 2017): where y is the predicted result by the model, and SHAP 0 is the average prediction for all samples in the training dataset. SHAP i represents the SHAP value of the i-th feature parameter. It should be noted that the effect of each feature parameter on the prediction has been studied in the past (Liu and Lei, 2005), but those studies did not consider the coupling effect of other feature parameters in the analysis. Based on collaborative game theory, SHAP approach can well represent the contribution by each feature parameter via SHAP value in the prediction results accounting for the coupleinge effect of feature parameters, and the contribution can be both positive and negative, thus well suited for explaining the prediction results (Zhang, 2020).

Contribution of each feature parameter
As explained in previous sections, XGBoost performed the best in the prediction, therefore it was selected to predict site amplification factor at 6 periods (0.01, 0.2, 0.6, 1, 2, and 3 s) and the corresponding SHAP values were calculated. For observation of the relationship between feature parameter and SHAP value, scattering plots were provided in Figure 9 for all 6 parameters at the 6 periods. From Figure 9 we can make the following observations.
1. SHAP value increases with Vs30 when Vs30 is small, but decreases when Vs30 is large, representing a negative correlation between SHAP value and Vs30. This negative trend increases with increasing period value. The positive correlation when Vs30 is small indicates that the predicted value for soft soil sites is bigger than actual value, and this over prediction is worse for longer periods. This observation is comparable to the conclusions in previous studies (Emel et al., 2014;Wojtuch et al., 2021), validating the results by this study and providing an evidence to show that SHAP analysis can well explain the model results. 2. There is generally a negative correlation between Mag and its SHAP value, representing over prediction by the model for smaller earthquakes. 3. There is no clear correlation trend between station altitude and its SHAP value when the period is short (0.01 and 0.2 s). When the period increases, the SHAP value is big when the altitude is small, and the SHAP value increases with decreasing station altitude, representing over prediction at low altitude for long periods. 4. There is obvious negative correlation between bedrock acceleration (Acc_rock) and SHAP value, which means that the model overpredicts the site amplification when Acc_rock is small but underestimates the amplification when Acc_rock is big. This observation is also supported by results by Beresnev (Walling et al., 2008), which may be because of non-linear effect of the site. 5. From the perspective of SHAP calculation, the SHAP values for Depth and R do not show a particularly obvious trend.

Feature parameter importance
The SHAP value analysis has provided a good picture of the impact by each feature parameter, and to further quantify the impact of each parameter on the prediction results, importance analysis was performed for all the feature parameters in this study. In a typical XGBoost model, importance can be ranked using different measures (Gain, Cover, Weight) but the conclusion can be different if using different measures, so it is hard to decide which measure is the best for application. Since SHAP value represents the contribution of each feature parameter on the prediction result, we can use the average of the SHAP value to objectively represent the importance of each feature parameter and rank the absolute mean SHAP value to decide on the importance of each parameter. The results are shown in Table 2. The same results are graphically represented in Figure 9 for better comprehension. As can be seen from both Table 2 and Figure 10 Vs30, Mag, S_Altitude and Acc_rock have relatively large influence on the prediction results while R and Depth show less influence with Vs30 having the largest impact and Depth having the least impact. These results may serve as good reference for future site amplification studies.

Conclusion
Based on the observed ground motion between 1997 and 2019 and station information from the KiK-net, a large database was assembled that included both the site amplification factor and 6 feature parameters. Three prediction models for site amplification factor based on machine learning were proposed and prediction results were compared against the result by a traditional approach at 6 different periods. Extensive analysis was conducted to find the best prediction model. SHAP analysis was used on the XGBoost model to provide better explanation of the prediction results and assess the impact and importance of the 6 feature parameters at 6 different periods. The following observations can be made based on this study.
1. In terms of predicting site amplification effect, machine learning algorithms are significantly better than traditional regression methods. Among the three machine learning approaches studied, XGBoost performs the best, followed by RF and DNN. 2. Comparison with the actual ground motion records for site amplification verifies the performance of the XGBoost prediction model, demonstrating huge potential of machine learning in site amplification factor prediction. 3. In the SHAP analysis, Vs30, Acc_rock and Mag are significantly negatively correlated with the predicted value, and S_Altitude is negatively correlated for large periods.

FIGURE 10
Feature importance ranking.
Frontiers in Earth Science frontiersin.org