Performance Evaluation of GIS-Based Novel Ensemble Approaches for Land Subsidence Susceptibility Mapping

The optimal prediction of land subsidence (LS) is very much difficult because of limitations in proper monitoring techniques, field-base surveys and knowledge related to functioning and behavior of LS. Thus, due to the lack of LS susceptibility maps it is almost impossible to identify LS prone areas and as a result it influences severe economic and human losses. Hence, preparation of LS susceptibility mapping (LSSM) can help to prevent natural and human catastrophes and reduce the economic damages significantly. Machine learning (ML) techniques are becoming increasingly proficient in modeling purpose of such kinds of occurrences and they are increasing used for LSSM. This study compares the performances of single and hybrid ML models to preparation of LSSM for future prediction of performance analysis. In this study, the spatial prediction of LS was assessed using four ML models of maximum entropy (MaxEnt), general linear model (GLM), artificial neural network (ANN) and support vector machine (SVM). Alongside, the possible numbers of novel ensemble models were integrated through the aforementioned four ML models for optimal analysis of LSSM. An inventory LS map was prepared based on the previous occurrences of LS points and the dataset were divvied into 70:30 ratios for training and validating of the modeling process. To identify the robust and best LSSMs, receiver operating characteristic-area under curve (ROC-AUC) curve was employed. The ROC-AUC result indicated that ANN model gives the highest ROC-AUC (0.924) in training accuracy. The highest AUC (0.823) of the LSSMs was determined based on validation datasets identified by SVM followed by ANN-SVM (0.812).


INTRODUCTION
Land subsidence (LS) is a natural geo-hazard phenomenon that occurs around the globe and causes extensive deformation of the earth's surface. More specifically, subsidence may causes lowering of the earth's land surface by natural or human induced activities, most importantly mobilization of solid or fluid underground materials are the main causes (Herrera-García et al., 2021). In general, there is a downward motion of the rock and soil surface in an almost vertical direction or sometimes with a slight angle that is known as LS (Cigna and Tapete, 2021). This phenomenon occurs suddenly or sometimes gradually due to a number of natural as well as anthropogenic factors. The main factors for their occurrences are earthquakes, volcanic activity, floods, overexploitation of groundwater and its decline, mining activities, tunnel construction, and so on (Erkens and Stouthamer, 2020;Lyu et al., 2020). Among all these possible factors, groundwater exploitation and structural weakness are the most important issues in this concern . Groundwater depletion is most responsible for LS and it is a slow and gradual process (Galloway and Burbey, 2011). The overdraft aquifer systems particularly in the agricultural and residential area are the most susceptible zone for the occurrences of LS Orhan et al., 2021). Like other natural geo-hazard phenomena, this incident also causes life loss and huge economic losses. Essentially, land loss causes devastating damage to property and infrastructure, such as construction, communication with transport networks, drainage systems, underground pipelines, and many more (Cigna and Tapete, 2021). LS not only affected environmental changes but also impacts on social and economic activities. Apart from all these direct effects, several indirect results such as minimizing groundwater storage capacity, water contamination, increases flood hazards Nhu et al., 2020), dissolution of calcareous rocks and stem faulting and mining. The phenomenon of LS is not a recent activities rather it has a long history. A body of literature studies has been shown that during the past century LS occurred due to the depletion of groundwater over 200 places in 34 countries across the globe (Herrera-García et al., 2021). Land subsidence on a spatial and temporal scale is therefore causing significant environmental, socio-economic and financial damage across the globe.
Iran, its geographical location and associated conditions have favored a semi-arid and arid climatic region with frequency of drought in recent decades (Arabameri et al., 2021). Therefore, in order to overcome the drought, this country is faced with an increasing demand for water supply through overextraction of groundwater due to the expansion of urban and agricultural uses (Babaee et al., 2020;Pourghasemi et al., 2020). In the upcoming decades, population growth and associated economic activities will continue to increase the demand of groundwater and groundwater depletion, and severe LS activities are found in different regions of the world (Famiglietti, 2014). Thus, the most important natural resources, i.e., water levels, are gradually declining due to over-extraction for agricultural, domestic and industrial uses (Abdollahi et al., 2019;Guzy and Malinowska, 2020). Apart from this, climate change and its associated phenomena have a major impact on LS in this region.
The Phenomenon of climate change significantly increases the atmospheric temperature, as a result drought condition have been occurred and a large number of people greatly depends on groundwater, and gradually destruction of aquifers causes LS, particularly in the central and north-eastern parts of Iran (Rateb and Abotalib, 2020). Land use change is one of the most important factors for LS in arid and semi-arid climates . The pattern of land use land cover also impacted on groundwater availability and recharge capacity. Land use changes have been directly supported by land subsidence, particularly in the extreme semiarid and arid climatic condition. As a result, the recovery of deformation surface after LS is costly and time consuming, it is therefore essential to predict land subsidence susceptibility mapping (LSSM) for proper management and optimal uses of land resources. The occurrences of LS subsequently causes land degradation and it is destroy infrastructure, agricultural land and other natural resources. Therefore, to control and manage the fertile agricultural land, infrastructure from LS it is necessary to optimal mapping of LS by which land use planners management the land resources in sustainable way (Ghorbanzadeh et al., 2018;Arabameri et al., 2021).
As a result, several geo-environmental conditioning factors, along with different prediction models, have been used to predict LSS. A number of hazard susceptibility maps have been developed worldwide, based on qualitative and quantitative methods (Oh et al., 2019;Mohammady et al., 2019). Advances in Remote Sensing and Geographic Information System (RS-GIS) technology, along with artificial intelligence, greatly help in the mapping of a number of natural hazards with proper management of environmental issues through land use planning. Recently, Interferometric Synthetic Aperture Radar (InSAR) observations satellite data have been used to monitor and detected LS areas (Karimzadeh and Matsuoka, 2020). The spatial and temporal land deformation is measured through InSAR observations and it is a microwave remote sensing system (Orhan et al., 2021). Several machine learning algorithm (MLA) has been used over time to predict LSSM, such as logistic regression (LR) (Tien Bui et al., 2018), artificial neural network (ANN) (Mohebbi Tafreshi et al., 2020), support vector machine (SVM) (Arabameri et al., 2021), logistic tree model (LTM) (Arabameri et al., 2021), decision tree (DT) (Lee and Park, 2013) and so on. However, the most recent ensemble model, i.e. a combination of several MLAs, has been widely used for better and meaningful results for this purpose. In another way, we can say that the Ensemble Model was used for the accurate presentation of single classifiers along with its higher accuracy (Pham et al., 2017). Beside this, the ensemble model also has the capacity to deal with the difficult relationship between different scales of influence and spatial data (Kanevski et al., 2004). Several research studies on LSS mapping using MLA and their ensemble by various researchers, such as Tien Bui et al. (2018); Abdollahi et al. (2019); and many more.
Thus, current research work on LSSM has been carried out in the arid and hyper-arid climate zone of the Kashan plain in the north of Esfahan Province to mitigate surface deformation and the proper management of surface structures.
A body of literature survey (Arabameri et al., 2020d(Arabameri et al., , 2021Babaee et al., 2020;Rezaei et al., 2020) and based on the local topographical, hydrological, climatological, geological and environmental condition, here we have selected twelve appropriate LS conditioning factors. Therefore, we used twelve geo-environmental conditioning factors namely elevation, aspect, distance to road (DtR), groundwater drawdown, distance to fault (DtF), topographic wetness index (TWI), distance from stream (DtS), normalized differentiate vegetation index (NDVI), curvature, slope, land use and lithology to meet our objective. In this study, four popular MLAs namely the Maximum Entropy (MaxEnt), general linear model (GLM), artificial neural network (ANN) and support vector machine (SVM) were used for modeling and mapping of LS, based on the stateof-the-art skillful characteristics and literature study (Abdollahi et al., 2019). The selection behind these MLAs are based on their earlier involvement in different research work on natural hazard susceptibility studies and respective optimal prediction performance (Zamanirad et al., 2019;Mohebbi Tafreshi et al., 2020;Najafi et al., 2020). In the case of MaxEnt, it has the ability to choose the correct estimation of the uncertain probability distribution and to select the highest entropy of the given probabilistic constraints. The GLM algorithm is based on a logistic regression model and used for a fractional response to handle a binary value dataset. Structured code input and output nodes were determined by trial and error in ANN, and events and non-event phenomena were determined. SVM is mainly used for classification, error analysis; generalize the overall function and find out about the two-class hyperplane in the dataset. Finally, a total of 11 ensembles, in which six are two models ensemble and five are three-four models ensemble, have been developed for better predictive analysis of LSSM in this region. A body of literature survey and best of our knowledge it has been found that no research study on the 11 possible ensembles of aforementioned four ML algorithms were used in LS studies. The maximum ensemble methods were created using the aforementioned four popular ML algorithms to optimal estimation of LS prediction performance and this is the novelty in this research study. Thereafter, all of these output results were validated by area under curve (AUC) analysis. As a result, the LS susceptibility zones have been classified into five zones, i.e., very low, low, medium, high, and very high. Depending on the LS susceptibility zones, appropriate prevention strategies should be taken to control future occurrences and proper management strategies.

Description of the Study Area
The Kashan plain is located in the North of the Esfahan Province in the Kashan city (between 33 • 40 00 to 34 • 35 00 N, and 51 • 05 00 to 51 • 55 00 E) and covers about 2,231 km 2 area (Figure 1). Elevation in the study area ranges between 803 m and 1,671 m above mean sea level. The average annual rainfall ranges between 75 and 185 mm and there is more rainfall in the west (Ghazifard et al., 2016). The climate of the study area has two classes of arid and hyper-arid. The minimum and maximum temperature in this area is 16 and 22 • C, respectively, and also the minimum and maximum slopes in area are 0 and 129%, respectively (Ghazifard et al., 2016;Goorabi et al., 2020). Kashan plain is located in the foothills of the Karkas Mountains and on the margin of the central desert of Iran. Based on the land use map, poor rangeland the largest area (53.26%), and then by agricultural (21.5%), barren land (16.19%) and the remaining area is shared between afforest, sand dune, salt land and residential areas ( Table 1). Based on the lithology map, diverse lithological have covered the area in which the largest area pertains to the low level piedmont fan and valley terrace deposits (68.33%), followed by unconsolidated windblown sand deposits including sand dunes (18.83%) and the remaining area is shared between other formations presented in detail in Table 1. Based on the land type (geomorphology) map, plates the largest area (32.09%), and then by flood plain (26.98%), low land (18.36%) and the remaining area is shared between mountains, scree and hill areas.

Methodology
The research work of the LSSM has been carried out in four steps (Figure 2).
• Preparing an LS inventory map using 239 LS points.
Historical LS data were collected through field survey along with the help of Coppernnicus aerial view output and high resolution satellite images. A total of 12 geoenvironmental control factors have been used to meet our research objective. • Multi-collinearity testing was conducted among the conditioning factors used in this study using inflation factor variance (VIF) and tolerance (TOL) techniques (Band et al., 2020;Arabameri et al., 2021). • To map the LS susceptibility of a number of MLAs, i.e., MaxEnt, GLM, ANN, and SVM have been used in this study together with a total of eleven ensemble methods. • The performance of each model was validated by area under curve (AUC) analysis.

LS Inventory Map
The LS Inventory Map (LSIM) is the primary mapping tool for LSSM. LSIM shows the spatial distribution of a number of LS regions (Figure 1). It is well known that LS zones can be predicted on the basis of both the historical and the current spatial distribution of LS. The current inventory map in this area has been prepared using RS-GIS technology. As a result, LS areas have been identified from Coppernnicus aerial view output and extensive field surveys with the Global Positioning System (GPS) to locate the exact position in the field. In general, any type of inventory map can be used to assess the relationship between the distribution of a particular hazard location and the associated conditioning factors responsible for that hazard. A total of 239 LS points were used in this study, in which 70% (167) was used as a training dataset and 30% (72) was used for dataset validation.
In this study, we have followed the influence of data Splitting performance (Nguyen et al., 2021) to divided the entire dataset  was splitting into 70:30 ratio. Some of the field photographs in this study area are shown in Figure 3.

Land Subsidence Conditioning Factors (LSCFs)
The quality of the predictive outcome of the LSSMs depends to a large extent on the selection of control factors. The evaluation of the relationship between the LS and their associated conditioning factors is therefore very necessary as it influences the modeling process. Twelve LSCFs have therefore been chosen to prepare the LSSM for this area. There is no universal criterion for the selection of these variables, although several literature studies have been conducted (Sahu et al., 2017). The types of LS and the availability of data are also taken into account for LSSM along with the geo-environmental conditions in this area. As already mentioned that in this study we have selected a total of 12 suitable LSCFs based on the literature survey and keeping in view the local geoenvironmental conditions like topography, hydro-climatology and geological conditions. The 12 LSCFs used for this study are elevation, aspect, distance to road (DtR), groundwater drawdown, distance to fault (DtF), topographic wetness index (TWI), distance from stream (DtS), normalized differentiate vegetation index (NDVI), curvature, slope, land use and lithology (Figures 4A-L). Therefore, several data sources were used to prepare these twelve conditioning factors. Different topographic and hydrological factors have been prepared from Advanced Land Observation satellite Phased Array type L-band Synthetic  Aperture Radar digital elevation model (ALOSPALSAR DEM) with a 12.5 m resolution which is freely available on the Alaska Satellite Facility (ASF) website 1 . Sentinel 2A satellite data with a the land use map. The lithology map in this area was taken from the Geological Maps of Iran collected from the Geological Society of Iran (GSI) 3 at a scale of 1:100,000.
As a result, the elevation map was derived from the DEM analysis of the ArcGIS 10.5 platform, with values ranging from 803 to 1671 m ( Figure 4A). The second side of the slope is the derivative. Aspect is the altitude calculator and the slope direction of its eight neighbors. The aspect map of the present study area ( Figure 4B) has been shown. DtR is an important factor in the occurrence of LS due to surface pressure and may cause LS in its surrounding area. The DtR range in this study ranges from 0 to 16,776 m ( Figure 4C). Studies indicates that several types of factors such as climate-hydrological and physical factors significantly control soil moisture .
Among the various factors, groundwater depletion is the most responsible conditioning factor for LS. The groundwater map shows values from 1.8 to 21.1 m ( Figure 4D). DtF is also responsible for the deformation of the soil surface through the use of LS. The value of a DtF map is between 0 and 16,679 m ( Figure 4E). The degree of water accumulation in the area depends on the TWI, which is a secondary topographic variable. The TWI map in this area was prepared using DEM on the ArcGIS 10.5 platform and the value ranges from 1.94 to 14.69 ( Figure 4F). The following equation has been used to calculate the TWI.
Where, A s and β denotes cumulative catchment area (m 2 ) and define slope angle, respectively. The phenomenon of climate change and several human induced activities are considered the two major drivers flow pattern in a basin area and impacted on hydrological factors Tian et al., 2020).
DtS is also a significant factor for LS. The probability of LS is increasing away from the river, and vice versa. The DtS map was shown in Figure 4G and ranged from 0 to 2,619 km. NDVI has the capacity to measure the growth and biomass of the vegetation (Yilmaz, 2009). This factor also plays an important role in the LSSM, as land use largely affects the occurrence of LS. The NDVI value ranged from −0.72 to 0.82, with a lower value indicating bare surface area and a higher value indicating forest cover ( Figure 4H). The following equation was used to calculate the NDVI values using Sentinel 2A satellite data.
Where, Band8 is near-infrared and Band4 is red reflectance of the spectrum. Curvature represents the secondary geomorphic assets and shows the pattern of flow, sedimentation, erosion, etc. (Yesilnacar, 2005). The curvature map in this study was shown in Figure 4I and the values range from −4.77 to 5.66. The value of the slope map in this study area ranges from 0 to 129% ( Figure 4J). LS is highly influenced by the slope of the area. Lithology is another key factor in the occurrence of LS as it affects the storage capacity of water. In this study, the lithological map ( Figure 4K) was classified into seven types. Table 1 shows details of the lithological characteristics, such as their description, age of formation, percentage area, etc. Finally, the land use map has been prepared to understand the coverage of the surface area and its impact on the LS. The land use map ( Figure 4L) for this area was classified into seven categories and their classes, along with their respective areas, are shown in Table 2.

Multi-Collinearity (MC) Analysis
MC can be defined as the linear relationship between two or more variables in the dataset (Alin, 2010). Linear dependency is the top most priority given in this analysis and explained variables through correlation matrix (Saha et al., 2021b). Preparation of natural hazard related susceptibility mapping and their optimal prediction accuracy is based on suitable geo-environmental conditions (Arabameri et al., 2021). As a result, MC test has been carried out in this study to analysis the specific relationship among all of these variables and minimize the bias. Generally, MC occurs when there is a high correlation among the two or more variables (Arabameri et al., 2017). In other words, MC test is required to ensure the independent conditioning factors in a dataset . Tolerance (TOL) and variance inflation factors (VIF) techniques have been used by several researchers to test MC analyzes (Chowdhuri et al., 2020a). Therefore, in this study, we also used these two techniques and their respective equations were calculated as follows: Where, R 2 j is the regression value of j variables in a dataset. The MC occurred when the TOL value is < 0.10 or 0.20 and VIF value is > 5 or 10 of a respective variables in a given dataset.

Maximum Entropy (MaxEnt)
The MaxEnt algorithm is based on the principle of maximum entropy and is one of the most popular predictive machine learning models (Woodbury et al., 1995). In general, MaxEnt estimates the probability distribution of the target based on the principle of maximum entropy and the probability distribution of LS occurrences in this study. MaxEnt has always chosen the highest entropy in a given probabilistic dataset. Apart from this, the presence features are used only by the MaxEnt model and have a significant impact on inaccessible areas with a reliable outcome (Reddy and Dávalos, 2003;Phillips et al., 2009). The relative influence of predictive variables (IPV) is estimated using jackknife re-sampling techniques within this model to generate response curves. Model performance was calculated by re-sampling the jackknife, excluding the predictor variables from the data set (Yang et al., 2013). Basically, this model identifies the true distribution (π) of LS, i.e. target occurrences over area X within a specific study area. Here, historical evidences of LS taken as a training dataset to define the true distribution (π). Let's consider, the LS occurrence probability indicates the location of area X and the target probability distribution is π (X). Location X indicates the probability of LS occurring by P(y = 1|X) and the Bayes rule has been applied to express this algorithm as follows: P y = 1 X = P y = 1 P(X|y = 1) P(X) = P y = 1 (X) 1/|X| (5) Where, P y = 1 indicates success of LS occurrences and (X) indicates total number of occurrences over the whole study area. π(X) is predicted through maximum entropy principle along with Gibbs probability distribution. Thus, Gibbs probability distribution may be expressed as follows: Where, Z λ (X) indicates normalization constant of a vector, λ i indicates weights assigned of a vector. Furthermore, in the study area if m LS occurrences, variation between the regularization and log likelihood is expressed as follows (Phillips and Dudík, 2008): Where, β j represent the parameter of regularization for the j th variables of predictor.

Generalized Linear Model (GLM)
The GLM was originally introduced and used by Nelder and Wedderburn (1972). In general, it is the probability of statistical method with a logit function and widely used in the field of natural hazards analysis (Lucà et al., 2011). The simple linear regression model was modified to form the GLM model. One major advantages of this model is that its simplicity, thus GLM extensively used in the wide fields of statistical analysis (Vorpahl et al., 2012). The basic function of this model is to develop multivariate regression between dependent and independent variables. Basically, the extensive form of a simple linear regression model is therefore GLM's ability to build up a non-normal distribution between datasets (Payne et al., 2012). It also has the ability to develop binary datasets based on the presence and absence of data, using the logit link function for logistic regression. The GLM algorithm can easily handle the binary data set with the fractional response of the logit link function (Garosi et al., 2018). The basic function of this model fitting approaches includes finding out error distribution, determining the variables within this system and run the logit link function. According to Bernknopf et al. (1988) the function of GLM is as follows: Y = Pr y = 1 = e C 0 +C 1 +X 1 +···+C n X n 1 + e C 0 +C 1 X 1 +···+C n X n Where, Y (logit-function) indicates the probability of an incident happening and it varies from 0 to 1; X 1 . . . X n represent the values of different controlling factors and C 1 . . . C n is their respective coefficient.

Artificial Neural Network (ANN)
One of the most popular MLAs, i.e., the ANN, is the most accurate and widely used forecasting model that has been effectively applied in various areas of forecasting analysis, such as social, economic, stock issues, natural hazard susceptibility mapping, etc. In general, ANN is a flexible statistical structure capable of identifying a non-linear relationship between input and output variables of a dataset (Hsu et al., 1995). In modern times, this model has been used with the utmost precision for forecasting, process control and pattern recognition in the broader perspective of science and technology fields (Sudheer et al., 2002). An ANN model has some unique features, and the result of this model's output forecast is more attractive and accurate. The unique features of this model are based on data-driven, self-adapted methods, the ability to generalize, and the ability to manage complex non-linear relationships. Apart from that, among all non-linear classes, ANN is a universal approximator capable of approximating complex class functions with high accuracy (Zhang and Qi, 2005). Among the various algorithms used in the ANN model, Multilayer Perceptron (MLP) is the trendiest and widely used by a number of researchers (Kosko, 1992). Therefore, within this MLP algorithm ANN model consists of three layers, i.e., input layer, hidden layer and output layer (Mandal and Mondal, 2019). If the function of the input layer is not able to involve in a proper way than data structured of the model is measured by hidden layer nodes (Arabameri et al., 2020c). The hidden layer is determined through trial and error method within this model (Gong et al., 1996). Thus, the model structure systematically predicts by input as well as hidden layers and evaluates the output results. The output layer has been defined by Boolean value of 0 and 1. In this research study, 0 indicates no LS and 1 indicates LS. According to Hagan et al. (1995) the back propagation of an ANN model can be expressed as follows: The net input of j th neuron of layer l and I iteration δ factor for neuron j th in the output layer i th δ factor for neuron j th in the hidden layer i th w l ji (t + 1) = w l ji (t) + α w l ji (t) − w l ji (t − 1) Where, α is the momentum rate and n is the learning rate within this model.

Support Vector Machine (SVM)
The SVM is a supervised machine learning method and broadly used in statistical test such as categorization and regression analysis . The algorithm of SVM is based on the principle of structural risk minimization and statistical learning (Vapnik, 2013). This model is binary classifiers and was introduced by Vapnik (2013) in 2013. In general, SVM has the capability of resolving the statistical dataset in the way of classification and regression analysis. In SVM model errors are recognized through several classification functions and finally generalize the overall function (Joachims, 1998).
The main two principles of SVM are the optimal hyper-plane classification and the use of kernel function (Yao et al., 2008). Therefore, the principle of hyper-plane is used to differentiate into two classes, i.e., events and non-events, in this study it is LS and non-LS. The situation of closeness of optimal hyperplane and training dataset is known as support vectors (Lee et al., 2017). The statistical induced problems in a SVM model can be employed in two ways: optimal separating hyper-plane from training dataset and conversion of non-linear data into linearly separable data through kernel function (Yao et al., 2008). In a SVM modeling, two classes have been created by hyperplane, i.e., one is above the hyper-plane denoted by 1 and another is below the hyper-plane denoted by 0. The following equations have been used to calculate the hyper-plane in a SVM model.

Subject to
Min n i=1 ϕ i y j = 0 and 0 ≤ α i ≤ D Where, x = x i , i = 1, 2,. . . n represent the input variables of vector, y = y i , j = 1, 2,. . .n represent the output variables of vector and ϕ i is Lagrange multipliers. The decision function of SVM can be expressed as follows: Where, a is the bias which indicate linear distance of hyper plane from the origin, K x i , x j is kernel functions such as polynomial (POL) and radial basis function (RBF) and, these can be expressed as follow (Kavzoglu and Colkesen, 2009).

Validation and Accuracy Assessment
The Validation and evaluation of MLA and ensemble generated LSSMs were done by using area under receiver operating curve analysis (ROC-AUC) as it is a standard toll to do the same. ROC-AUC is a statistical analysis and has been widely used by a number of researchers to validate and assess the accuracy of several natural hazard susceptibility mappings Nguyen et al., 2019;Yuan and Moayedi, 2019;. In general, it is a two-dimensional curve and consists of events and non-event phenomena (Frattini et al., 2010). It is a graphical construction on X-axis known as sensitivity and Y-axis known as 1-speficity. The X and Y axis are false positive (FP) and true positive (TP), respectively. The four indices, i.e., true positive (TP), true negative (TN), false positive (FP) and false negative (FN) have been used to assessment the ACC of ROC. In which, true and false positive indicates LS and non-LS points correctly, on the other side, true and false negative indicates LS and non-LS points incorrectly. In the ROC-AUC sensitivity identifies LS and 1-specificity identifies non-LS accurately. The ROC-AUC value ranges from 0.5 to 1. The lower value (0.5) indicates poor performance and higher value (1) indicates good performance by the model. The following equations were used to calculate the ROC-AUC value of a model.
Where, P indicates presence of LS and N indicates absence of LS.

Multi-Collinearity Analysis
The multi-collinearity analysis is the significant factor selection method. It is the method where the land subsidence causative factors (LSCFs) have been filtered from high correlations among LSCF variables which have resulted in the erroneous output and uncertain predictions. In this study, the multi-collinearity analysis has been done through the Tolerance (TOL) and the Variance inflation factor (VIF) values of the LSCFs. When the TOL value is below 0.1 and the VIF is above 5, it has a problem with the multi-collinearity. Stream power index (SPI) and drainage density followed the above rules and we removed the variables. Rests of the 12 variables have been considered as LSCFs which have no multi-collinearity problems. The VIF of the 12 LSCFs ranges from 2.864 to 1.085 and the TOL value ranges from 0.349 to 0.921 (Table 3).

Land Subsidence Susceptibility Maps (LSSMs)
The land subsidence susceptibility maps of the Kashan plain have been created from the different single and ensemble classifier machine learning (ML) models. Here, four-stage of machine learning land subsidence susceptibility models were used. First, the four single or stand-alone ML; second, the two ensemble models were created by the integrated of two single models; third, again three ensemble models were created by the integrated of the three single ML models and last four ensemble models were created by the integrated of the four single ML model. Each of the models has presented an LSSM (Figures 5-7). The LSSMs were classified in five probability zones of very low, low, medium, high, and very high. There are many classification approaches to classify the land subsidence probability maps. These are natural break, quantile, equal interval, manual and standard deviation. In this study, all the mentioned classification methods have been applied and the natural break method gave the best classification result for all the maps. Where the spatial data have the big jump value, the natural break classification scheme is suitable for classification probability map (Ayalew and Yamagishi, 2005). The single ML and two, three, and four ensembles of LSSM have been analyzed in the next sections.

LSSMs From Single ML Models
There are four single ML models have been used for the LSSM. The LSSMs of the GLM, MaxEnt, ANN, and SVM models have been presented in Figure 5 and indicate the same symbol for each class. The areas of the very high, high, medium, low, and very low land subsidence susceptibility area in the GLM model are 15, 17, 24, 25, and 18% ( Figure 8A). The percentage coverage of the very high, high, medium, low, and very low land subsidence susceptibility area in the MaxEnt model are 17, 19, 24, 24, and 16. The percent coverage of the very high, high, medium, low, and  very low land subsidence susceptibility area in the ANN model are 25, 13, 6, 6, and 50%. And the areas of very high to very low land subsidence susceptibility area in the SVM model are 16, 10, 18, 29, and 27%. The areas of probability classes of land subsidence in the models of GLM, MaxEnt, and SVM are almost the same and they maintain consistency. The probability classes of LSSM in ANN model are slide difference from the other three models ( Figure 8A).

LSSMs From First Stage Ensemble Models
After the single ML land subsidence susceptibility model, the first stage ensemble models were created by the integrated of two ML models. In this process, six ensemble models have been created. These are GLM-MaxEnt, GLM-ANN, GLM-SVM, MaxEnt-ANN, MaxEnt-SVM, and ANN-SVM (Figure 6). The very high probability of land subsidence areas varies from 9 to 10% in the above six ensemble models ( Figure 8B). The percentage of the high probability of land subsidence areas ranges from 9 to 11. The percentage of medium probability of land subsidence areas ranges from 12 to 18. The low and very low probability area ranges from 13 to 27 and 37 to 54%, respectively ( Figure 8B). So the very high and high probabilities of land subsidence classes have a good consistency in the six ensemble models.

LSSMs From Second and Third Stage Ensemble Models
In this stage, the possible ensemble models have been made by the integration of three and four single ML models. These ensemble models are GLM-MaxEnt-ANN, GLM-MaxEnt-SVM, MaxEnt-ANN-SVM, and ANN-SVM-GLM-MaxEnt. The LSSMs were prepared from these ensemble models presented in Figure 7.
In this second stage ensemble model, the percentage of very high land subsidence susceptibility zone ranges from 5 to 6 ( Figure 8C). The high land subsidence susceptibility probability zone varies from 7 to 8%. The percentage of medium land subsidence susceptibility probability zone ranges from 9 to 11. The low land subsidence probability zone varies from 15 to 21 and the very low land subsidence hazard probability zone ranges from 56 to 69%. So there are no such differences among the land subsidence hazard probability zone in second stage ensemble models. The final ensemble model or the third stage ensemble model was made through the integration of all four single models. The final ensemble model has 3, 7, 7, 14, and 69% of land subsidence susceptible areas for the very high, high, medium, low, and very low zone, respectively ( Figure 8C).

Evaluation of Land Subsidence Model
Validation is an important task for modeling based output because of the accessibility of the model output determined by the model output validation. The goodness of fit and prediction accuracy of all ML ensemble models have been evaluated using training and validation land subsidence data applied the technique of area under curve (AUC) of the receiver operating characteristic (ROC) curve. The evaluation performance result of single ML and ensemble of two, three, and four ML models in training and validating stage have been shown in Figures 9, 10. All the single ML and ensemble two, three, and four ML methods showed the excellent goodness of fit and prediction accuracy of the models. The AUC-ROC of the single four ML model on the training stage showed in Figure 9A. The ANN model has the highest AUC value (0.924) followed by SVM and GLM. On the validation stage ( Figure 10A), the SVM model got the highest (0.823) accuracy among the single ML model and followed by ANN (0.794).
In case of the ensemble of two ML methods (Figures 9B,  10B), the ensemble of the ANN-SVM model has higher AUC-ROC in training (0.915) and validating (0.812) datasets and followed by the GLM-ANN (0.808) and ANN-MaxEnt (0.805). The ensemble of three and four ML algorithms showed the   Table 4 shows the comparison of AUC-ROC for all single ML and ensemble two, three, and four ML methods using training and validating datasets. The result of the reliability of ML algorithms based on training datasets has depicted that the ANN has the highest AUC (0.924) means this model is the best fit model for the land subsidence hazard mapping. And the second and third best fit models are the ANN-SVM (0.915) and GLM-ANN (0.908). The prediction accuracy of the ML algorithms based on validating datasets showed the SVM has the highest AUC (0.823) followed by ANN-SVM (0.812) and MaxEnt-ANN (0.808). The SVM model is the best model for the land subsidence susceptibility mapping because it has the finest prediction accuracy.

DISCUSSION
Several kinds of natural hazards related environment problems has been solved by various research groups for sustainable management and utilization of natural resources (Pradhan and Kim, 2017;Jiang et al., 2018;Tsai et al., 2019;Wang et al., 2020;Xu et al., 2021) with the help of remote sensing (RS) technology (Han et al., 2019;Hu et al., 2020;Zhang et al., 2020c) and geographical information system (GIS) tool (Zuo et al., 2015;Xu et al., 2018;Zhu et al., 2019;Yang et al., 2020b) and widespread progress in the computational facilities (Chao et al., 2018;Zhang et al., 2018Zhang et al., , 2020bCao et al., 2020;Xu et al., 2020;Feng et al., 2020). A noteworthy support from the combination of RS and GIS technology gives efficiently solution in the several types of natural hazards related problems (Yang et al., 2018(Yang et al., , 2020aSun et al., 2021). Therefore, the geospatial technology, i.e., remote sensing and GIS has been providing high resolution multispectral satellite data and their optimal processing which is immensely help to analysis and solves several types of natural hazards related risk. The optimal data processing without any kinds of bias is done through machine learning algorithms and their computation analysis has been presented through the help of GIS platform (Pourghasemi and Rossi, 2017;Yang et al., 2015;Chen et al., 2019a;Zhu et al., 2019). Thus, in the present time, RS-GIS techniques and machine learning algorithms has been widely helped for optimal evaluation of many scientific problems (Yang et al., 2015;Wu et al., 2020). Preparing of LS hazard susceptibility mapping is an important and one of the key challenging tasks among the land use planners (Arabameri et al., 2021). Therefore, several researchers proposed various kinds of models to address this key challenges and there has been great interest in improving the prediction performance of the hazards related susceptibility mapping through ML models (Oh et al., 2019). It is also be mentioned here that no specific models can give an optimal result and have controversy among the researchers in this regards . The reliability and accurate result is the most predominate condition for the LS hazard susceptibility mapping and researchers tried to form new novel ensemble models to produced good outcomes (Reichenbach et al., 2018). A lot of ML methods have been applied in the previous past few years for the spatial probability map of various kind of environmental hazards (Arabameri et al., 2020c). The spatial analysis of LS indicates that subsidence specifically occurred in the flat areas particularly in the alluvial deposited land and agricultural areas located in arid regions (Herrera-García et al., 2021). A present time, ML algorithms and their ensemble methods have been applied in various fields for the susceptibility mapping and it has been shown to be effective in terms of predictive performance (Nguyen et al., 2019;Arabameri et al., 2020c;Feng et al., 2020;Liu et al., 2020;Zhang et al., 2020a;Saha et al., 2021b). Particularly, ensemble models always enhanced the output result by integrated the several ML algorithms (Mojaddadi et al., 2017;Arabameri et al., 2020d;Saha et al., 2021a).
Thus, based on the literature survey and local geoenvironmental conditions, different LSCFs have primarily selected to perform the LSSM in this present study area. After that multi-collinearity assessment studies was carried out and based on the result of multi-collinearity, a total of 12 geo-environmental variables were selected for the LS susceptibility modeling (Chen et al., 2019b;Arabameri et al., 2020d). Thereafter, ML algorithms of GLM, MaxEnt, SVM, and ANN were used and 11 possible ensemble models were developed to mapping LS susceptibility analysis. GLM or the logistic regression model is the most common statistical technique used for the prediction of landslide, flood, groundwater, gully erosion susceptibility. The most advantages of GLM is that assumes a linear relationship between a link function of the predictors and response. The presence-only feature can be considered as advantages of MaxEnt because the determination of non-land subsidence may result in uncertainty. SVM is a supervised based classification model and it is very capable of dealing with non-linear and high-dimensional grouping problems by use of the different SVM based function (Huang and Zhao, 2018). The ANN is an effective tool in a neural network, where the hidden and output layer nodes process their inputs (Lee et al., 2012) and successfully applied in this study. Ensemble models rapidly applied for the susceptibility modeling, but some author reported that ensemble models have better accuracy than the single models (Pham et al., 2019;Arabameri et al., 2020b;Chowdhuri et al., 2020b) consequently, some study reported that stand-alone ML models have better accuracy (Althuwaynee et al., 2014).
In this study, we used a total of 15 models for land subsidence modeling, but the ANN model was the best model by the success rate of accuracy (AUC = 0.924) and the success rate AUC obtained from the training datasets. On the other hand, the SVM land subsidence hazard susceptibility model was the best model by the prediction accuracy rate (AUC = 0.823) and the predictive accuracy rate obtained from the validating datasets. An ANN model is based on the non-linear statistical analysis of a given dataset and evaluation on the basis of observed coherence network dynamics. Thus, ANN gives the highest accuracy in the training dataset. On the other side, SVM model try to relocate the idea based on the kernel function using unsupervised function (Smits and Jordaan, 2002) and hence, gives the better performance in validation dataset. The accuracy of the models in success and prediction rate has been analyzed through the ROC-AUC curve. The ROC curve is a quantitative model evaluator successfully used for model performance evaluation in most of the studies (Su et al., 2017;Arabameri et al., 2020d). The graphical presentation of the ROC-AUC curve created its most suitable model evaluator. The ROC-AUC curve result demonstrated that the 15 models performed well, but the SVM and ANN-SVM models have made the best prediction of the gully erosion. Another study of LS in the Kashmar region, Iran based on MaxEnt models gives the result of ROC-AUC is 88.9% (Rahmati et al., 2019a,b) which is significantly higher than the present study. Similarly, LS susceptibility studies in the Semnan province of Iran using ANN models gives the result of ROC-AUC is 0.919 (Arabameri et al., 2020d) which is less than the our study result (AUC = 0.924). Therefore, studies indicate that the same ML models give different result in accuracy assessment from region to region, depending upon the local geo-environmental factors.
Thus, in this study, the combination of remote sensing and GIS techniques along with ML algorithms has given the optimal result for land subsidence susceptibility mapping. Among the four ML algorithms, SVM gives the most optimal prediction performance outcome than the others. Therefore, based on the output maps of LS resulting from hybrid ML algorithms will be very much helpful to the land use planners and policy makers for sustainable management and uses of land resources. The land degradation process through LS is also control through taken proper management strategies.

CONCLUSION
The phenomenon of LS is one of the economic threats among the global people through the land degradation processes usually induced by human misuse. Therefore, proper assessment and management of this kind of natural hazard is crucial for sustainable development of any region. Hence, for the purpose of land management it is necessary to identification, modeling, assessment and analysis, and in the present research study it has been carried out in the Kashan plain, Iran. Here, ML algorithms of GLM, ANN, MaxEnt, SVM and their 11 possible ensemble classifier models were used for LS susceptibility modeling and mapping. The final result indicates that the ANN model is the best in training phase among the 15 models. But in the prediction accuracy SVM model is the best among all models. Furthermore, the consistency between the LSSMs was mentioned properly and there are no such differences between the susceptibility zones. Additionally, maximum ensemble models from the four ML models were developed in this study and in near future several others ML models can be used to compare and evaluate the optimal result. Not only this, based on the optimal output of this study, these ensemble models can be used in further research studies such as several environmental hazards potentiality mapping and prediction. The best outcomes of the study are land subsidence susceptibility maps which will help in the local administrations and decision-makers in land use planning and proper management of land resources. As we know every research study have some limitations, therefore this study also carried some limitation in terms of using limited LS causative factors and lack of hydrological modeling. Another side, the strength of this research study is the quality of the ML modeling and their optimal prediction result.

DATA AVAILABILITY STATEMENT
The original contributions presented in the study are included in the article/supplementary material, further inquiries can be directed to the corresponding author/s.

AUTHOR CONTRIBUTIONS
AA: conceptualization, methodology, software, validation, formal analysis, visualization, and resources. AA and SL investigation. OA: data curation. AA and SL: writing-original draft preparation. AA, SC, AS, IC, and HM: writing-review and editing. AA and SL: supervision. SL: funding. All authors have read and agreed to the published version of the manuscript.