Reducing Local Correlations Among Causal Factor Classifications as a Strategy to Improve Landslide Susceptibility Mapping

A landslide susceptibility map (LSM) is the basis of hazard and risk assessment, guiding land planning and utilization, early warning of disaster, etc. Researchers are often overly keen on hybridizing state-of-the-art models or exploring new mathematical susceptibility models to improve the accuracy of the susceptibility map in terms of a receiver operator characteristic curve. Correlation analysis of the causal factors is a necessary routine process before susceptibility modeling to ensure that the overall correlation among all factors is low. However, this overall correlation analysis is insufficient to detect a high local correlation among the causal factor classes. The objective of this study is to answer three questions: 1) Is there a high correlation between causal factors in some parts locally? 2) Does it affect the accuracy of landslide susceptibility assessment? and 3) How can this influence be eliminated? To this aim, Wanzhou County was taken as the test site, where landslide susceptibility assessment based on 12 causal factors has been previously performed using the frequency ratio (FR) model and random forest (RF) model. In this work, we conducted a local spatial correlation analysis of the “altitude” and “rivers” factors and found a sizeable spatial overlap between altitude-class-1 and rivers-class-1. The “altitude” and “rivers” factors were reclassified, and then the FR model and RF model were used to reevaluate the susceptibility and analyze the accuracy loss caused by the local spatial correlation of the two factors. The results demonstrated that the accuracy of LSMs was markedly enhanced after reclassification of “altitude” and “rivers,” especially for the RF model–based LSM. This research shed new light on the local correlation of causal factors arising from a particular geomorphology and their impact on susceptibility.


INTRODUCTION
The landslide susceptibility map represents the spatial probability of landslide occurrence, is the basis for landslide hazard and risk assessment (Fell et al., 2008;Pellicani et al., 2017), and is used in practice for land planning (Cascini 2008;Chen et al., 2019), quantitative risk analysis (Chen et al., 2016;Yan et al., 2020), early warning systems (Segoni et al., 2018;Rosi et al., 2021), etc. In the past several decades, hazard susceptibility assessment has always been a hot spot for research on all kinds of regional scales, including local-scale , basin-scale (Bueechi et al., 2019;Huang et al., 2021a), and national-scale (Bălteanu et al., 2020). The relationship between existing landslides and their causal factors is modeled to obtain the landslide probability for the whole study area, which is the basic framework of landslide susceptibility. The internal geological and external environmental factors are the main incentives of landslides, characterized by altitude, slope, aspect, lithology, curvature, human engineering activities, rivers, traffic, etc. (Xiao et al., 2019). In recent years, to improve the accuracy of susceptibility evaluation, lots of new statistical (Segoni et al., 2016;Reichenbach et al., 2018) and machine learning methods (Catani et al., 2013;Lagomarsino et al., 2017;Huang et al., 2020), or multiple mixed-matching models (Rossi et al., 2010;Shirzadi et al., 2017;Huang et al., 2021b), have been introduced in susceptibility mapping.
After the susceptibility calculation, a receiver operator characteristic (ROC) curve is always required for accurate analysis (Xiao et al., 2020). The model with the highest AUC is considered the best model suitable for this test site (Canavesi et al., 2020;Sun et al., 2020) and, at the same time, provides a reference for other research areas. Researchers are overly keen on hybridizing state-of-the-art models (Schicker and Moon, 2012;Kornejady et al., 2018;Luo and Liu, 2018) or exploring new mathematical susceptibility models Paryani et al., 2020;Wu et al., 2020), often ignoring the interrelationships between causal factors. It is a well-known fact that each study area has its specific geomorphological features. By analyzing the correlation of the causal factors, factors with high overall correlation were excluded Mind'je et al., 2020;Zhao and Chen, 2020). However, the remaining causal factors may be highly correlated in some micro-topography parts, which cannot be detected by the overall correlation analysis and have not been mentioned in the literature. Given this, several issues need to be discussed: Is there a high correlation between causal factors in some parts locally? Does it affect the accuracy of landslide susceptibility assessment? How can this influence be eliminated?
In Wanzhou County, Chongqing, China, the Yangtze River flows through the entire area from southwest to northeast, causing many landslides along both sides of the Yangtze River (Yang et al., 2017;Wang et al., 2019;Huang et al., 2021;. Both sides of the Yangtze River are highly susceptible to landslides, and the region is characterized by low elevation and proximity to rivers (Yang et al., 2018;Deng et al., 2021;Hu et al., 2021;. Therefore, it is necessary to explore whether "altitude" and "rivers" factors are highly correlated in the region and their influence on susceptibility mapping. This study aims to show that local spatial correlation on causal factors could exist and reduce the accuracy of susceptibility mapping. We conducted a local spatial correlation analysis on the "altitude" and "rivers" in the study area to discuss their valid contribution to susceptibility, taking Wanzhou County as an example. The "altitude" and "rivers" were reclassified; then, the frequency ratio (FR) model and random forest (RF) model were used to reevaluate the susceptibility and analyze the accuracy loss caused by the local spatial correlation of these factors. The results shed new light on local correlations of factors arising from a particular geomorphology and their impact on susceptibility.
The study area extends into the subtropical humid monsoon zone and features a mild climate with abundant sunshine and mean annual precipitation of 1,191.3 mm, mainly concentrated from May to September (about 90% of the yearly rainfall). During summer, the rain is characterized by short and intense rainstorms (up to 100 mm/ day). The Yangtze River runs throughout the study area from southwest to northeast, and 93 large and small streams form a complex surface runoff network. The elevation gradually decreases from east to west, forming a hilly landscape, with an overall step-like morphology formed by multilevel fluvial terraces, which resulted from the combination of repeated tectonic uplift stages and the Yangtze River erosion. According to the information provided by Chongqing Natural Resources Bureau, more than 600 landslides were identified in the study area. Since the impoundment of the Three Gorges Reservoir in 2003, many dormant landslides have been reactivated, mainly triggered by water level fluctuation and rainfall. The well-known Anlesi Landslide, Caojiezi Landslide, and Taibaiyan Landslide are all ancient landslides with a volume of more than 10 million cubic meters, and they all developed in subhorizontally dipping sandstone and mudstone interbedded strata.
The bedrock lithology encompasses sandstones, mudstones, shales, and limestones (Table 1), with nearly horizontal stratifications. Extending from both sides of the Yangtze River, the outcropping bedrock mainly increases in age from Triassic to Jurassic (2.3-137 Ma), with sporadic Permian (299-252 Ma) and Quaternary bedrock (from 2.5 Ma). The middle Jurassic Shaximiao Group, consisting of alternating layers of sandstone and mudstone, is the most widely distributed geological unit.

INPUT DATA AND METHODOLOGY
Modeling Algorithms 1) Frequency ratio (FR) model. The frequency ratio model is a relatively simple statistical model (Kumar and Anbalagan, 2015). Each factor is classified according to a specific method, and the contribution degree of Frontiers in Earth Science | www.frontiersin.org November 2021 | Volume 9 | Article 781674 each factor category is calculated based on statistical analysis. The contribution degree set of all factors is the Landslide Susceptibility Index (LSI), and the formula is where S 1 is the landslide area within the classification, S is the area within the classification, A 1 is the total landslide area of the study area, and A is the total area of the study area.
2) Random forest (RF) model. The random forest model is a nonparametric multivariate technology based on ensemble learning algorithm. This technology was proposed by Breiman and was widely used in various research fields because of its excellent performance, including landslide disaster susceptibility evaluation (Breiman, 1996a(Breiman, , 1996bBreiman, 2001). Random forest model is considered to be a relatively effective method in classification, regression, and unsupervised learning. It contains some classification numbers for prediction, and this classification tree is randomly generated by using "bagging" to generate multiple independent training sets. The main advantages of this model are as follows: It is suitable for analyzing nonlinear variables without considering multicollinearity and has strong robustness to outliers; it can deal with high-dimensional data, take into account discrete data and continuous data, and has no fixed standardization requirements for the input data set; the data processing speed is fast and can obtain the variable importance sorting; and compared with other models, it has a strong antinoise ability.

Input Data and Methodology
Twelve landslide susceptibility causal factors of Wanzhou County and two models, namely, frequency ratio (FR) and random forest (RF), are used in this research. The selected  In the study area, massive landslides were induced by the Yangtze River, heavily skewing the landslide distribution toward lower altitudes. The altitude range of the study area is 120-1,656 m, divided into six classes: 120-350, 350-500, 500-700, 700-900, 900-1,100, and 1,100-1,656 m ( Table 3). According to their scale, the water systems were divided into three types: I) the main stem of the Yangtze River, II) secondary tributaries of the Yangtze River, and III) seasonal streams. The influence of the river on landslide development is related to the type of river and the distance from the slope to the river. The rivers factor was divided into five classes by distance to each water system shown in Table 4.  In the previous susceptibility evaluation, the Spearman correlation coefficient between altitude and rivers was only −0.14 ( Table 5), indicating that overall the correlation between these two factors was low. The altitude-class-1 zone (less than 350 m) has the highest frequency ratio contribution ( Table 2), attributed to the rivers' effect in the initial analysis. The water level of the Yangtze River reservoir fluctuates between 145 and 175 m, affecting slopes mostly below 350 m, thus exhibiting a tendency for landslides to be distributed at different altitudes. After in-depth consideration of the causal factors in the study area, it was found that river development is highly related to topographic elevation, so there may be a considerable spatial overlap between the altitude-class-1 zone and rivers-class-1 zone. Therefore, there are three possible issues: Is there a high correlation between altitude-class-1 and rivers-class-1 zones; Does it affect the accuracy of landslide susceptibility assessment; and How can this influence be eliminated? Exploring and answering the three issues are the main research objectives of this study. The research idea includes the following steps: -First, altitude-class-1 and rivers-class-1 were divided into three zones: a, b, and c. As shown in Figure 2, "a" is the common area for altitude-class-1 and rivers-class-1, and "b" and "c" are separate areas for altitude-class-1 and riversclass-1, respectively. The frequency ratios of landslides in zones a, b, and c were counted and compared with altitudeclass-1 and rivers-class-1 to reflect the actual contribution of the two factors. This step can answer the question of whether there is a high correlation between altitude-class-1 and rivers-class-1 regions.
-The altitude and rivers factors were reclassified, and then the susceptibility of Wanzhou County was re-evaluated. The altitude was divided into seven classes, where classes-2 to 6 remained the same, and class-1 was split into class-1a and class-1b. The rivers factor was divided into six classes, where classes-2 to 5 were left as they were, and class-1 was split into class-1a and class-1c. Altitude-class-1a and rivers-class-1a are, spatially, the exact same area. Susceptibility was reassessed using FR and RF models based on reclassified altitude and rivers and the original ten other causal factors. This step can be considered a preliminary stage to directly illustrate the impact on the accuracy of the susceptibility evaluation while providing quantitative data for analysis in a further step. -Quantitative and pixel-by-pixel analysis of susceptibility maps: The receiver operator characteristic (ROC) curve was used to verify the accuracy of the susceptibility results, and pixel-by-pixel for going through where the susceptibility map changed after factor reclassification. Figure 3 presents a visual inspection that clearly exemplifies the distribution of landslides in altitude-class-1 and rivers-class-1 areas. The dark gray "a" zone represents the common area for altitude-class-1 and rivers-class-1, while the blue "c" and orange "b" are the separate areas for altitude-class-1 and rivers-class-1, respectively. All landslides in the study area are superimposed on the map in black rasters, showing the differential distribution of landslides in areas a, b, and c. We can see at a glance that the landslides in the gray area are less than those in the dark gray and the blue areas. As a quantitative comparison, landslide frequency ratio statistics were performed for each a, b, or c area ( Table 6). The data show that the frequency of landslide distribution in areas a, b, and c varies greatly. The landslide frequency ratio in the common area a is 2.72, the landslide frequency ratio in altitude-class-1 rises from 2.98 to 3.49 after removing area a, and the landslide frequency ratio in rivers-class-1 plummets from 1.41 to 0.46 after removing area a. It can be tentatively inferred that the common area of altitude-class-1 and rivers-class-1 to some extent influences the judgment of the actual contribution of altitude and rivers factors to landslide development. That is, the initially calculated landslide frequency ratios of altitude and rivers are not entirely reliable. "Altitude-class-1" was reclassified into "altitude-class-1a" and "altitude-class-1b," while "rivers-class-1" was divided into "rivers-class-1a" and "rivers-class-1c." Table 7 shows the original classes and new classes, concluding the percentage of domain in the total domain and frequency ratio contribution of each class. At the same time, a Coxcomb chart (Figure 4) clearly expressed all the information in Table 7. The arc of the sector represents the PDTD of each class, and its radius stands for the FR value. The red stripes represent the original class-1, and the reclassified areas 1a and 1b (1c) are indicated in blue and green, FIGURE 3 | Spatial distribution of altitude-class-1 and rivers-class-1. respectively, to reflect the contribution of each area to landslide development by the length of the sector radius. It is evident from Figure 4 that the landslide frequency distribution in class-1 is not uniform, especially for the "rivers-class-1" area: "Rivers-class-1a" far exceeds the average contribution of "rivers-class-1." In contrast, the true gift of "rivers-class-1c" is minimal. It follows that a reclassification of the area was absolutely necessary to better reflect the contribution of causal factors to landslides. To verify the effects of reclassifying "altitude-class-1" and "rivers-class-1," the 12 causal factor system of the previous susceptibility assessment in Table 2 was used in the landslide susceptibility assessment in this test. Except for altitude and rivers, the remaining ten causal factors continued the previous classification.

RESULTS
The LSM of Wanzhou County was recalculated using the FR model and RF model based on improved factors; then, the area under the receiver operating characteristic (ROC) curve (AUC) was applied to evaluate the accuracy of each result. The ROC curve mainly reflects the change of the number of landslides in each susceptibility interval from high to low. As shown in Figure 5, after reclassification of altitude-class-1 and riversclass-1, the accuracy of LSM based on the FR model was improved by 0.5% (72.8-73.3%), and the accuracy of LSM based on the RF model was significantly improved by 5.1% (79.9-85.0%).
The LSM was divided into 10 zones with 10% spacing according to the susceptibility value (i.e., the landslide probability of occurrence), and pixel-by-pixel counted the number of landslide pixels and all pixels in each region, respectively. It is evident that the number of landslide points is directly proportional to the susceptibility value ( Figure 6A). For the two models, the percentages of landslides in the range of the top 20% interval of the occurrence probability were improved 8.1% (FR model, 18.10-26.2%) and 24.87% (RF model, 24.2-48.98%), respectively. In contrast, pixels were primarily located in zones with susceptibility value below 40% ( Figure 6B).
The susceptibility value was divided into five zones by equal interval: very low (0-20%), low (20-40%), moderate (40-60%), high (60-80%), and very high (80-100%). The landslide statistics of different susceptibility levels are shown in Table 8 and Figure 7. The frequency ratio value for the very high susceptibility areas varied considerably. The frequency ratio value based on the FR model increased from 4.09 to 4.64, and the value based on the RF model increased from 4.10 to 7.23.
The above results demonstrated that the accuracy of the very high susceptibility zone was markedly enhanced after reclassification of "altitude-class-1" and "rivers-class-1," especially for the RF model-based LSM.

DISCUSSION
The two LSMs based on the RF model are shown in Figure 8. Although the improved LSM has a 5.9% higher AUC, it is not easy to see the difference when comparing these two graphs with the naked eye. A visual comparison of the two maps was made, and their values were subtracted to define their differences (Figure 9). Since the raster value of each susceptibility map is between 0 and 1, the value of the  comparison map could potentially range from − 1 to 1. A simple visual inspection of Figure 9 reveals that there are apparent differences between the two susceptibility maps. The value range of Figure 9 is −0.9731-0.9482, with pure blue representing −1, pure red representing 1, and a gradual blue-yellow-red transition between −1 and 1. Most importantly, the differences between the two LSMs are not evenly distributed, and some spatial patterns of rivers can be recognized in the comparison map. Concerning the method proposed by Xiao et al. (2020) for understanding and interpreting the different results of LSM, the values of the comparison map were interrupted at ±0.5 and divided into three classes, namely, "underestimation" (UN), "approximation" (APR), and "overestimation" (OV). Table 9 FIGURE 6 | Distribution of points versus the landslide probability of occurrence. (A) Landslide points; (B) all pixels in the domain. shows the range of values and percentages for each classification. 97.13% of the comparison map pixels are located in the APR region, and only scattering pixels are UN or OV.
To explore the critical class of the rivers factor that led to differences between susceptibility maps, a simple count of the UN and OV points for each class of rivers was performed (Table 10). In the statistics of Table 10, rivers-class-1a only accounts for   9.68% of the total area, but it contains 26.53% of UN pixels. Meanwhile, rivers-class-1c accounts for only 13.27% of the total area, but it has 38.16% OV pixels.
In the original RF model-based susceptibility assessment, riversclass-1 was not differentiated into area 1a and area 1c. This statistical result indicates that the susceptibility value in rivers-class-1a is underestimated, and rivers-class-1c is overestimated in the original LSM. The deviation of the susceptibility results is exactly the same as that in the factor contribution analysis ( Table 7; Figure 4B). The landslide contribution in the rivers-class-1a area was underestimated, where the calculated susceptibility values were underestimated. For rivers-class-1c, both landslide contribution and susceptibility value were overestimated. After reclassifying the rivers factor, the RF model improved the LSM accuracy in the rivers-class-1 area, thus improving the accuracy in the high susceptibility area and the whole area.  Rivers-classes-1a and 1c are visually inspected and explicitly represented in Figure 10 concerning the UN or OV pixels. In Figure 10A, the rivers-class-1a area is marked in yellow, the riversclass-1c area is indicated in blue, and the other classes are uniformly noted in light gray. UN and OV pixels are displayed in black and red, respectively, scattered sporadically throughout the study area. Zooming in on the two regions of Figures 10B,C, one can clearly see that the red OV pixels tend to be distributed on class-1c, again in agreement with the statistical properties of Table 10.
Previous studies of landslide susceptibility have included correlation analysis of the causal factors, but only for each causal factor as a whole. The study in this work demonstrated the existence of a high local correlation between classifications of altitude and rivers. In other words, the high local correlation of factor classifications cannot be detected by the overall correlation analysis. In this study, the conjecture about altitude and rivers comes entirely from the in-depth knowledge of the topography and river system in the study area. On the basis of this conjecture, a local correlation analysis and a quantitative study of its effect on the accuracy of LSM were performed. The results show that the high local correlation of altitude and rivers factors does exist and truly affects the accuracy of LSM. Meanwhile, a simple reclassification of factors can eliminate this effect and improve the accuracy of LSM.

CONCLUSION
This study shows that the local correlation of causal factors could exist and reduce the accuracy of susceptibility assessment. A simple method of factor reclassification was proposed to improve the accuracy of LSM effectively. Taking Wanzhou County as the test site, where landslide susceptibility assessment was based on 12 causal factors, the FR model and RF model were previously completed. In this work, we conducted a local spatial correlation analysis of the "altitude" and "rivers" factors and found a large spatial overlap between altitude-class-1 and rivers-class-1. "Altitude-class-1" was reclassified into "altitude-class-1a" and "altitude-class-1b," while "rivers-class-1" was divided into "rivers-class-1a" and "riversclass-1c," where "altitude-class-1a" was spatially identical to the "rivers-class-1a" area. The FR model and RF model were used to reevaluate the susceptibility. The area under the receiver operating characteristic curve (AUC) was applied to evaluate the accuracy of each LSM. The results demonstrated that the accuracy of LSMs was markedly enhanced after reclassification of "altitude-class-1" and "rivers-class-1," especially for the RF model-based LSM. A pixel-bypixel comparison of the two LSMs based on the RF model was performed and visually inspected with rivers-class-1. In previous susceptibility mapping, the calculated susceptibility value in the rivers-class-1a area tends to be underestimated, and the opposite is seen for the rivers-class-1c area. This research shed new light on the local correlation of causal factors arising from a particular geomorphology and their impact on susceptibility.
Finally, the following points can be summarized for the cases in this study.
-The overall correlation between the altitude and rivers factor is low, but there is a considerable spatial overlap between altitude-class-1 and rivers-class-1. The presence of this common overlap area has led to the underestimation and overestimation of the contribution of altitude-class-1 and rivers-class-1 to landslides, respectively, in previous susceptibility assessments. -The accuracy of the LSMs was improved by 0.5% (FR model) and 5.1% (RF model) after reclassification of "altitude-class-1" and "rivers-class-1," respectively, especially for the accuracy of the very high susceptibility zone of the RF model-based LSM. -Since the FR model does not consider the weight coefficients of the causal factors, the FR model-based LSM is not sensitive enough to the reclassification of the altitude and rivers factors. The RF model performs better not only in modeling the relationship between causal factors and landslides but also in distinguishing the differences of each factor class.

DATA AVAILABILITY STATEMENT
The raw data supporting the conclusion of this article will be made available by the authors, without undue reservation.

AUTHOR CONTRIBUTIONS
TX organized and analyzed the data and wrote the manuscript, LW provided and analyzed the data, LY analyzed the data and wrote the manuscript, WT and TX were responsible for the project, and CZ analyzed the data. All authors have read and agreed to the published version of the article.