Climatic and seismic data-driven deep learning model for earthquake magnitude prediction

The effects of global warming are felt not only in the Earth’s climate but also in the geology of the planet. Modest variations in stress and pore-fluid pressure brought on by temperature variations, precipitation, air pressure, and snow coverage are hypothesized to influence seismicity on local and regional scales. Earthquakes can be anticipated by intelligently evaluating historical climatic datasets and earthquake catalogs that have been collected all over the world. This study attempts to predict the magnitude of the next probable earthquake by evaluating climate data along with eight mathematically calculated seismic parameters. Global temperature has been selected as the only climatic variable for this research, as it substantially affects the planet’s ecosystem and civilization. Three popular deep neural network models, namely, long short-term memory (LSTM), bidirectional long short-term memory (Bi-LSTM), and transformer models, were used to predict the magnitude of the next earthquakes in three seismic regions: Japan, Indonesia, and the Hindu-Kush Karakoram Himalayan (HKKH) region. Several well-known metrics, such as the mean absolute error (MAE), mean squared error (MSE), log-cosh loss, and mean squared logarithmic error (MSLE), have been used to analyse these models. All models eventually settle on a small value for these cost functions, demonstrating the accuracy of these models in predicting earthquake magnitudes. These approaches produce significant and encouraging results when used to predict earthquake magnitude at diverse places, opening the way for the ultimate robust prediction mechanism that has not yet been created.


Introduction
Climate change is defined as an alteration in the climate as measured by statistical parameters such as the global mean surface temperature. The term "climate" here refers to the long-term pattern of meteorological conditions that has prevailed over the past three decades. Climate is made up of many factors, such as temperature, humidity, precipitation, air pressure, wind speed, evaporation, cloud cover, condensation, radiation, and evapotranspiration. The climate and temperature of Earth are increasingly influenced by both natural forces, such as variations in solar radiation, and human activities, such as the burning of fossil fuels and deforestation. Changes in the relative amounts of solar radiation and the Earth's emitted infrared radiation are the root causes of climate change. Observations indicate that the global temperature rose by 1.4°F (0.78°C) between 1900 and 2005 (Singh and Singh, 2012). Several additional climatic events, such as extreme heat waves, glacial melting, sea ice loss, soaring sea levels, frequent heavy rains and ocean acidification, are intimately related to global warming. The effects of global warming are not restricted to climate change alone; the entire planet is grappling with the effects of an energy imbalance. This imbalanced energy might manifest as isostatic rebound, a volcanic explosion, or an earthquake. Human activities have been the primary source of the planet's warming during the past few decades (Intergovernmental Panel on Climate Change, 2014).
According to records, there has been a significant rise in temperature in the Northern Hemisphere over the past 1,400 years (Pachauri et al., 2014). The previous three decades have seen the bulk of this warming. Figure 1 displays the annual variance in global surface temperature relative to the 20th century average. Climate change has led to an increase in sea level, a decrease in ice cover, and exacerbation of severe meteorological conditions, such as intensified tropical cyclones and severe droughts. Increased emissions of greenhouse gases from power plants, industry, and automobiles exacerbate this warming, which impacts not only the Earth's climate but also the geology of the planet (Singh and Singh, 2012). This growth primarily originated from the burning of fossil fuels and growing urbanization. Additionally, enhanced climatic forcing perpetuates this warming owing to rising quantities of greenhouse gases (McCormick et al., 2007).
Climate change has sparked widespread concern among scientists and policymakers in recent years. As the Earth warms, Greenland and polar regions' ice sheets and mountain glaciers melt, lessening the glacial load on the crust. The crust relaxes and rebounds as these glaciers melt. Glacial melting has boosted the flow of glacial rivers. There has been a massive outflow of water into the ocean as a result, which might disrupt the delicate equilibrium of plate tectonics on a global scale (Glick, 2011;Mara and Vlad, 2013). Solid Earth is unloaded due to the decay of glacial ice sheets and caps as shown in Figure 2. This unloading can cause crustal deformation and mantle expansion. (Smiraglia et al., 2007;Pagli and Sigmundsson, 2008). The crust progressively rises owing to isostatic rebound because of erosion or glacial melting, causing crustal deformations and tectonic motion (Larsen et al., 2005).
According to satellite data collected worldwide, glacier mass is decreasing in the high mountain regions of Alaska, coastal Greenland, arctic Canada, the southern Andes, and Asia. Furthermore, significant amounts of water are being discharged into the oceans (Kaser et al., 2006;Meier et al., 2007;Gardner et al., 2013;Abdullah et al., 2020). Figure 3 depicts a global summary of the World Glacier Monitoring Service's findings on the mass changes of selected glaciers (Global Glacier State -World Glacier Monitoring Service, 2021). This figure shows the changes in glacier mass over time, as measured in millimetres of equivalent water.
Since the Industrial Revolution, it has been largely believed that global warming has played a major role in rising global sea levels (Church and White, 2011). Furthermore, it is assumed that the decay of glacier ice and ocean thermal expansion played a significant role in global sea-level rise during the 20th century (Church and White, 2011). Figure 4 shows the rise in sea level since 1993 (black line). Thermal expansion (red line) and increased water mostly owing to melting glaciers (blue line) are just two of the many factors for which exact estimates are available. As shown in Figure 5, the subsidence of the crust is initiated by the additional weight imposed by global sealevel rise. In addition, subsidence of the crust can promote plate tectonics to counterbalance the increased stress caused by the added seawater.
Earthquakes occur when the Earth's tectonic plates move as a result of a sudden and large release of internal energy. Earthquakes are one of the most devastating natural disasters. Earthquakes frequently hit without notice, giving people little time to make preparations. In addition, earthquakes frequently cause other natural disasters, such as surface fault rupture (Bray, 2001), tsunamis (Jain, Virmani, and Abraham, 2019), snow slides (Podolskiy et al., 2010), landslides (KEEFER, 1984), soil liquefication (Verdugo and González, 2015), and fires (Cassidy, 2013), which exacerbate the crisis. Devastating earthquakes cause deaths (Ambraseys and Melvilleand, 1983), massive infrastructure damage (Bilham, 2009), societal defeat, and a rapid economic downturn (So and Platt, 2014). In the last two decades, earthquakes have caused more than half of all natural disaster-related fatalities (Bartels and VanRooyen, 2012). The devastating effects of significant earthquakes can be lessened with timely and accurate predictions that allow for the adoption of preventative measures. A reliable forecast indicates an earthquake's location, time, and magnitude. These predictions can save many lives and resources. Although several strategies employing diverse input factors have been offered, such accurate forecasts are uncommon in past research (Otari and Kulkarni, 2012). Researchers in seismology and related disciplines have attempted to identify earthquake precursors. Since the end of the 19th century, these unusual phenomena have typically occurred before earthquakes. According to various studies, earthquakes can be predicted by observations of numerous precursors, such as temperature increases (Zandonella, 2001;Sadhukhan et al., 2021b;Sadhukhan et al., 2021c;Maji et al., 2021), ionospheric analysis (Pulinets, 2004), animal behaviour (Fidani, 2013;Cao and Huang, 2018), hydrogeological and gas geochemical analysis (Hartmann and Levy, 2005) and radon gas emissions (Petraki et al., 2015).
The majority of earthquake prediction techniques rely on the existence of particular precursors (Ikram and Qamar, 2014). However, in reality, these precursors often do not materialise with subsequent seismic occurrences or are hard to recognize, so these approaches may not always produce desirable outcomes . As the precursors do not necessarily occur before every earthquake, it is exceedingly difficult to generalize and standardize these prediction systems. This has led to the proposal of novel methods for future earthquake prediction (Tiampo and Shcherbakov, 2012). Earthquake prediction can be of two types: long-term and shortterm. Predicting earthquakes within the next several days, weeks, or months is a very challenging task requiring a great deal of data and analysis. As a result, it ought to be reliable and accurate, with a minimum of false positives (Goswami et al., 2018). Short-term forecasts are commonly used to evacuate a region ahead of an earthquake. Long-term prediction is based on the timing and location of past earthquakes. Therefore, the existing tectonic context, historical data, and geographical information are evaluated to determine where and how frequently earthquakes occur. However, it can help define building code standards and develop emergency response plans.
Earthquake prediction is a vital subject in seismology since successful prediction can save lives, property, and infrastructure. Although earthquakes appear to be active and spontaneous, they often fail to provide favourable outcomes. Numerous technologies, such as mathematical analysis, artificial intelligence, and machine learning algorithms, have been proposed to address this issue. Many different approaches have been used in recent theoretical and practical investigations of earthquake prediction. Air ionization, radon migration, latent heat release, variation in surface temperature, air pressure, relative humidity, cloud formation, coupling with precipitation anomalies, radio wave propagation, ionosphere and magnetosphere effects are all climate-associated variables that have been identified as potential precursors to future earthquakes (Daneshvar and Freund, 2017).
Earthquake prediction models perform admirably for earthquakes of moderate magnitude; however, the results obtained with large shocks are disappointing. Large earthquakes cause the most damage and concern. As high-magnitude earthquakes are uncommon, it is difficult to predict them due to a lack of appropriate evidence. After a decade of effort, the seismology community has been unable to devise a system for earthquake prediction. Predictions of earthquakes continue to be impossible due to the inadequacy of current technology to monitor stresses and pressure changes more precisely through scientific equipment positioned beneath the crust; hence, considerable seismic data are always limited. Due to a lack of multidisciplinary collaboration between seismology and computer science departments to accurately predict and quantify earthquake occurrences, earthquake prediction has remained a challenging endeavour to date. Due to the extremely non-linear and complex geophysical processes that create earthquakes, no mathematical or empirical relationship exists between any physically recordable parameter and the timing, magnitude, or location of a future earthquake (Panakkat and Adeli, 2007).
The following is the outline of the study: The relevant prior research is outlined in Section 2. The research's data and methodology are described in Section 3, and its analytical methods are presented in Section 4. The deep neural network models employed in this research are outlined in Section 5, and their results are discussed in Section 6. Finally, Section 7 draws the necessary conclusion.

Related work
There is increasing evidence in the scientific community suggesting that climate change also contributes to geological occurrences such as tremors, tsunamis, and volcanic eruptions. In Variation in the cumulative mass of reference glaciers. The value is expressed as a meter water equivalent (mwe) compared to 1976. Source: World Glacier Monitoring Service (WGMS).

Frontiers in Earth Science
frontiersin.org 03 recent years, numerous researchers throughout the world have been striving to verify the effect of climate change on seismicity. The occurrence of earthquakes is assumed to be a random and extremely non-linear process, and no model exists that can predict the exact time, position, and magnitude of earthquakes. Numerous studies have been undertaken on earthquake occurrences and forecasts, yielding a variety of conclusions regarding the subject.
Several studies have revealed significant irregularities in climatic factors before large earthquakes. Satellite thermal imaging has revealed long and short-term temperature abnormalities preceding major earthquakes (Tronin et al., 2002;Pulinets et al., 2006;Jiao et al., 2018;Pavlidou et al., 2019). These transient abnormalities might vary by 2°C-4°C between four to 20 days before an earthquake and gradually disappear thereafter (Ouzounov et al., 2007). Some unexpected abnormalities were found above clouds in the atmosphere and the lithosphere (Sasmal et al., 2021). A variety of factors, such as changes in the crust's geophysics, contribute to the occurrence of seismic foreshocks. The lithosphere-atmosphere-ionospheric coupling (LAIC) model can be used to defend against these irregularities (Pulinets and Ouzounov, 2011;Carbone et al., 2021).
Preliminary research was conducted on the aberrant variations in the enhanced surface-latent heat flux and water vapour anomaly before the Colima and Gujarat earthquakes (Dey and Singh, 2003;Dey et al., 2004). Many different climate factors and processes may influence seismic activity. Changes in many critical climatic variables often precede severe earthquakes. These include surface latent heat flow, precipitation, wind speed, cloudiness and vertical Rise in global sea levels as seen through satellite altimeters, Period: 1993-2018. Source: NOAA.

FIGURE 5
Land subsidence caused by additional seawater.

Frontiers in Earth Science
frontiersin.org 04 air flow (Mansouri . At many regional and temporal scales, comprehensive studies utilizing various time intervals and spatial resolutions indicate increased precipitation preceding large seismic events. Therefore, a significant positive relationship seems to be present between seismicity and precipitation prior to the major shocks (R = 0.711).
Consequently, seismic events foretell climatic abnormalities, increase precipitation, and create cyclones. Heavy rainfall across the seismic region within 5 days after a large earthquake was shown to be substantially correlated with such events (Zhao et al., 2021). Approximately 74.9% of earthquakes in China were followed by epicentral rainfall, while 86.6% of earthquakes were accompanied by seismic area rainfall. Rainfall is more prevalent in earthquake zones than in the 30-year climatic trends, and earthquakes predominate during the monsoon season (Zhao et al., 2021).
Increased rainfall has been seen in Iran and the neighbouring middle east region just before major earthquakes between 2002 and 2013 (Mansouri Daneshvar et al., 2014). The researchers investigated the geographical correlations between seismic occurrence and meteorological changes by grouping 39 significant earthquakes into eight seismological areas. They found moderate and high correlations (R 2 ) between the preceding precipitation and the magnitudes and hypocentre depths of large earthquakes. Further studies indicated that rainfall has the capacity to anticipate earthquake sequences beginning at least three and a half months in advance. The estimated lagged correlation demonstrates a positive relationship between precipitation and subsequent earthquake occurrence days, with lags ranging from 3 to 103 days (Mansouri . Even though earthquakes lead to all types of climatic anomalies, rainfall statistics are largely significant factors in estimating the place and magnitude of possible tremors. Extremely dry conditions (drought) often precede major earthquakes, and then one or more years of aboveaverage precipitation are usually followed by tremors (Huang et al., 1979). Fluctuations in surface heat flow around the epicentre zone have been related to enhanced thermal energy at the Earth's surface, which is thought to be the cause of this anomalous precipitation. The increase in sensible heat flow aids the process of evapotranspiration, which produces atmospheric water vapour. This might lead to the formation of clouds and abnormal precipitation. Researchers have shown that semi-stationary linear cloud development is also associated with increased seismicity (Guangmeng and Jie, 2013;Thomas et al., 2015). Abnormal rainstorms over the epicentral area have also been documented prior to major earthquakes (Mullayarov et al., 2012;Daneshvar and Freund, 2017).
Some intriguing studies explicitly establish a correlation between the incidence of earthquakes and the increase in surface temperature. Begley (2006) claims that earthquakes occur when nucleation processes release significant amounts of stored energy along the fault plane. Reduced stress on the crust as a response to glacial decay generates "isostatic rebound," which eventually results in fault resurfacing and increased seismicity. Numerous researchers have attempted to establish the connection between rising temperatures and seismic activity (Usman and Amir, 2009;Usman et al., 2011;Usman, 2016). The study region included several glaciers. Rising global temperatures caused these glaciers to melt, relieving pressure on the Earth below. As a result, seismic activity may have increased, causing the Earth to rebound. Most earthquakes on the Richter scale were between 3.0 and 3.9, and seismic activity continued to grow with rising temperatures. Further research showed that an increase in temperature is associated with an increase in the number of shallow earthquakes (between 0 and 80 km). A plausible correlation between rising earthquake activity and climate change was proposed by Mara and Vlad (2013). Glacial ice sheets cover 10% of Earth's entire crustal area; hence, any changes in their extent due to glacier decay would have significant implications on the planet's tectonic stability.
The rapid depletion of ice caps is another obvious effect of global warming. McGuire (2013)'s study suggests that the decline in ice sheets can also produce earthquakes. He asserted that a rise in global temperature over several decades triggered the melting of enormous, thick ice sheets, enabling the crust to bounce back. As global sea levels continue to rise indefinitely, load-related crust deformation at ocean basin margins may ultimately "unclamp" coastal faults. There is substantial evidence of a major relationship between climatic change and earthquakes during the transition from the previous ice age, notably in North America and Scandinavia. The environment of the Northern Hemisphere, notably Alaska, is significantly impacted by climate change and global warming (Hinzman et al., 2005). Sadhukhan et al. (2021c) studied the relationships between earthquake magnitudes and variations in global temperature by applying signal processing methods. Semblance analysis was used to verify the association between these two dynamics. The causality test reveals that the two dynamics are strongly connected, indicating that one may be expected given the historical data of the other. The authors then employed a variety of statistical signal processing techniques to explore the multifractal, non-linear, and chaotic nature of two dynamics: earthquake magnitude and global temperature variations (Sadhukhan et al., 2021b). A correlation study determined the degree of correlation between global earthquake frequency and global temperature changes (Maji et al., 2021). Additionally, RNN-based deep learning models are used to verify the relationship between climate change and seismicity (Sadhukhan et al., 2021a).
Using statistical methods such as correlation and regression analysis, Masih (2018) investigated the correlation between climate change and the frequency of earthquakes. Furthermore, the study asserts that climate change due to global warming triggers the decline of glacial ice sheets, depressurization of the underlying rocks and reactivation of faults, thereby classifying the region as seismically active with frequent earthquakes. Molchanov (2010) used correlation analysis to explore the relationship between climatic change (temperature) and crustal seismicity. He found that fluctuations in temperature and seismic activity exhibited comparable tendencies. Evidence was presented by Swindles et al. (2017) that glacial extent, driven by climate, impacted the frequency of seismic events and volcanic activity in Iceland throughout millennia.

FIGURE 6
Cumulative count of earthquakes that occurred in (A) Indonesia, (B) Japan and (C) HKKH Region ordered by decreasing magnitude on a logarithmic scale (main). Frequency-Magnitude distribution in seismic catalogs of (A) Indonesia, (B) Japan and (C) HKKH Region (inset).

Frontiers in Earth Science
frontiersin.org 06 Based on the preceding discussions, an attempt has been made to predict the magnitude of the next probable earthquake by evaluating the climate data along with eight mathematically calculated seismic parameters. Three widely used deep neural network models, namely, long short-term memory (LSTM), bidirectional long-term memory (Bi-LSTM), and transformer models, were used to predict the magnitude of future earthquakes in a given seismic region using climate data and eight seismic parameters calculated from a predefined number of past significant seismic events with a predefined threshold magnitude or greater. Since global temperature has such a profound effect on the planet's ecosystems and civilization, it has been chosen as the single climatic variable for this analysis. The geological structure and features are the same throughout the study area. This makes it possible to make accurate models of the relationship between global temperature and mathematically derived seismic parameters for predicting the magnitude of the next earthquake. These models can accurately predict the magnitude of the next approaching earthquake, which is the significance of this work. In addition, they have a high-performance metric for accurately forecasting earthquake magnitude ranges.

Data and methods
Deep learning-based earthquake prediction research has been carried out in Indonesia, Japan and the Hindu-Kush Karakoram Himalaya (HKKH) region. Each of these places has a high frequency of earthquakes, making them suitable for earthquake prediction research. The underlying dataset for this research is a temporal series of historical seismicity for the indicated locations. Global temperature anomaly data extracted from the global land and ocean temperature anomaly dataset maintained by the National Oceanic and Atmospheric Administration (NOAA) of the United States Department of Commerce has been used as the experimental dataset (https://www.ncei.noaa.gov/access/ monitoring/climate-at-a-glance/global/time-series). Additionally, historical seismicity data from the United States Geological Survey (USGS), which is publicly available at https://earthquake. usgs.gov/earthquakes/search/, have also been used for this investigation. This study explored both datasets from January 1921 to December 2020. The coordinate boundaries of these regions are shown in Table 1, and their catalogs are evaluated to calculate the magnitude of completeness. The minimum magnitude below which an earthquake catalogue is deemed incomplete is known as the catalogue's magnitude of completeness (M c ). Among the well-known catalog-based approaches for calculating M c , fitting a Gutenberg-Richter model to the observed frequencymagnitude distribution has received much attention in recent works (Wiemer, 2000;Pavlenko and Zavyalov, 2022;Yuliastuti et al., 2022). This method has a major limitation in dealing with a small number of events in a catalog. In this study, the Gutenberg-Richter law of seismic magnitude distribution has been deployed, as all the seismic catalogues used have a wide time frame and a broad experimental study region for capturing enormous seismic events to compute M c . This method sorts earthquakes into "bins" based on the number of occurrences with magnitudes exceeding a predetermined threshold. The count for each bin is then displayed on a similar logarithmic scale. If they were statistically accurate, the data would form a straight line. Although it is nearly impossible to obtain statistically perfect datasets, we can estimate M c using this relationship. When a straight line is fitted to the data, the point where the data deviate from the line indicates the level of completeness. Figures 6A-C depict the cumulative count of earthquakes that occurred by decreasing magnitude on a logarithmic scale for Indonesia, Japan and the HKKH region, respectively. Table 1 summarizes the magnitude of completeness computed for these regions.
In this study, global temperature anomaly data, along with eight seismic parameters, were utilized to determine the seismic potential of any region. The parameters are selected based on Gutenberg Richter's law of earthquake magnitude distributions, and recent earthquake prediction research (Panakkat and Adeli, 2007;Panakkat and Adeli, 2009;Asim et al., 2018). The number of instances in each of the three datasets varies according to the seismic events recorded in the catalogues of the individual regions. Before processing seismic parameter computation, the earthquake database is purged of all seismic events with magnitudes below the threshold. This eliminates erroneous or incomplete data in determining seismic parameter trends. The most recent 100 records prior to each earthquake event have been considered to calculate these seismic parameters. These parameters are then used along with global temperature anomaly data to forecast the magnitude of the next earthquake.
A vector of seismicity characteristics created for each preceding significant seismic event as well as the monthly global temperature anomaly are the inputs for the deep neural network. Each seismic

FIGURE 7
Research methodology used in this research.

Frontiers in Earth Science
frontiersin.org 07 zone is unique, and different seismic parameters display various characteristics. Consequently, independent training of the LSTM, Bi-LSTM, and transformer models is conducted using 80% of the available seismic records in the relevant datasets for each area. After the models have been trained, the results are evaluated against the remaining 20% of the datasets. Figure 7 depicts the overall flowchart of the suggested research technique for estimating the magnitude of an impending earthquake.

Seismic parameters
The investigation of seismic parameters and their computations are inspired by the work (Panakkat and Adeli, 2007;Adeli and Panakkat, 2009;Panakkat and Adeli, 2009). Eight parameters were derived from seismic catalogs to predict the magnitude of an imminent earthquake. The most recent n records (n = 100) prior to each earthquake event are used to calculate these earthquake parameters. These parameters are then used along with global temperature anomaly data to forecast the magnitude of the next earthquake. The parameters are numerical representations of seismic facts such as the Gutenberg-Richter law, foreshock frequency, seismic energy release, and typical temporal earthquake magnitude distribution. Consequently, a feature vector of eight parameters depicts the region's internal geological state prior to each earthquake occurrence.

Time elapsed (T) for the last "n" seismic events
The first seismic parameter addressed in this study is time T, which reflects the time interval between the last n occurrences, where n is 100 in our study and t is the time of the earthquake.
T t n − t 1 Most earthquakes are preceded by significant precursor activity, such as a series of foreshocks. Indeed, some of the most popular earthquake prediction models (Zaliapin et al., 2003) are based on the frequency and intensity of foreshocks. The foreshock frequency can be measured using the T value, which depends on the set magnitude threshold. A high T value indicates a dearth of foreshocks, which may indicate a diminished likelihood of a subsequent large seismic event in many seismic zones. On the other hand, a small T value indicates a relatively high foreshock frequency and an increased probability of a subsequent large seismic event.

The mean magnitude (M mean )
The second seismic parameter is the average magnitude of the 100 most recent earthquakes. It is proportional to the magnitudes of foreshocks since the seismic activity of magnitude M increases just before a large earthquake.
Following the accelerated release theory (Bufe and Varnes, 1993), the quantity of energy released by a fractured fault increases exponentially as the period between earthquakes decreases. In other words, the measured mean magnitudes of foreshocks increase just prior to the occurrence of a large earthquake.

The slope of the Gutenberg-Richter curve (b-value)
The Gutenberg-Richter inverse power law is utilized to illustrate the relationship between the number of earthquakes n with a magnitude equal to or higher than M, where parameter a denotes the intensity of seismicity and parameter b represents the ratio of minor to major events. These are the two seismic parameters considered in this study. The b-value determines the frequency of smaller earthquakes relative to larger ones: the greater the b-value, the more frequent smaller earthquakes. Multiple studies link the b-value to the Earth's crust's differential stress, with low b-values seen in highly strained zones or faults and high b-values in less stressed locations. Using the maximum likelihood (ML) method, the b-value can be calculated as follows: where M c and M mean are the magnitudes of completeness of the catalogue and the mean magnitude of past n events (n is 100), respectively.

The y-intercept of the Gutenberg-Richter curve (a value)
The values of a and b in the Gutenberg-Richter inverse power law provide a regression line that can estimate future earthquake frequencies. The G-R distribution has constant parameters a and b. The expression below represents the value of â a log 10 N M where N(M) represents the number of earthquakes of a specific magnitude M.

Magnitude deficit (ΔM)
The magnitude deficit is another seismic parameter in this study. Based on the Gutenberg and Richter (1956) relationship, it measures the difference between the largest observed magnitude and the largest predicted magnitude.
where M max ,observed is the maximum magnitude observed in the previous n events and M max ,expected is the largest magnitude projected by the inverse power-law relationship in the previous n occurrences. Due to the probability that an event of the largest magnitude would occur only once among n occurrences, N = 1, log N = 0, and Eq. 3 yields 3.1.6 Rate of the square root of the seismic energy (dE 1/2 ) The rate of the square root of seismic energy release (dE) is another seismic parameter associated with seismic activity. Most seismic zones are open physical systems with a constant accumulation of energy induced by lithospheric plate movement. These systems retain relative equilibrium if frequent low-magnitude seismic activity dissipates the Frontiers in Earth Science frontiersin.org 08 increasing accumulation of energy (Roeloffs, 2000). When lowmagnitude seismic activity is halted for an extended period due to frictional or mechanical causes, the physical system conserves energy; this phenomenon is known as "seismic quiescence." When the accumulated energy exceeds a predetermined level, it is released as a large seismic event (Tiampo, Rundle, McGinnis, Gross, and Klein, 2002). Hence, the rate at which seismic energy is emitted is a critical parameter of seismicity in quiescent environments. The following equation represents the square root of the seismic energy released.
where E 1/2 is the square root of seismic energy (E), determined from the associated Richter magnitude and expressed in ergs by the following empirical relation.
3.1.7 Sum of mean square deviations from the regression line using the Gutenberg-Richter inverse power law (η-value) This metric assesses the degree to which observed seismic data conform to the Gutenberg-Richter inverse power-law relationship. Lower values indicate that it is more likely that the observed distribution may be approximated by the power law. In contrast, larger values indicate increased unpredictability and the inability of the power law to represent magnitude-frequency distributions.

Mean time between characteristic events (µ)
This is the average duration or interval between occurrences of a particular attribute over the past n instances. According to the elastic rebound hypothesis (Reid, 1910), certain seismic zones demonstrate periodic trends in the slow accumulation of stress and eventual release via large earthquakes. Researchers have shown that the time between large earthquakes is rather consistent (Kagan and Jackson, 1991). Large earthquakes of this magnitude are known as typical occurrences. In this context, magnitudes are provided within a certain approximation range. For example, earthquakes of magnitudes between 5 and 5.5 are considered to have the same typical magnitude. Ideally, typical occurrences occur at roughly equal intervals of time. Suppose t i Charecteristic denotes the observed time interval between characteristic occurrences of magnitude M i , and the total number of characteristic events is represented by n Charecteristic . In that case, the mean time between characteristic occurrences (µ) can be calculated using the following equation: The input of the deep neural network is a vector of seismicity parameters generated for each prior significant seismic event and the global temperature anomaly for the month. A collection of seismic Frontiers in Earth Science frontiersin.org 09 parameters exhibiting maximum performance in one region may not do so in other region. Additionally, global temperature anomalies have a substantial impact on earthquakes. Together with the global temperature anomaly, all seismic parameters are employed concurrently to construct a deep learning-based model for earthquake magnitude prediction. The next section provides an overview of the deep neural network models utilized in this study.

Analytical methods
This section gives a concise summary of the analytical procedures utilized in this investigation.

Deep neural networks
Neural network architecture is generally used to implement deep learning. Deep neural networks use a series of non-linear processing layers, with basic components operating in parallel. It consists of an input layer, several concealed layers, and an output layer. Nodes or neurons connect the various levels. Each hidden layer uses the output of the preceding layer as its input. In data science, deep learning has emerged as a powerful technique for tackling previously intractable problems in the natural world (Sadowski and Baldi, 2018;Bourilkov, 2019). This is assisted by deep learning's enhanced capacity to find intricate patterns in extremely large datasets.
Long-term contextual information is mostly accessible in the internal states of the network, where these activities are stored. Recurrent neural networks (RNNs) are a robust modelling method for such sequential data because of their cyclic connections. RNNs are highly effective in applications involving the labelling and prediction of sequences. Recurrent neural networks use inputs from previously active networks to improve predictions. This method allows RNNs to employ a dynamic contextual window over the input sequence instead of the static contextual window used by feed-forward networks (Sak et al., 2014).

Long short-term memory recurrent neural network
LSTM networks are a special kind of recurrent neural network that are designed to recognize the importance of context in making sequence predictions. LSTM networks were first proposed in 1997 by Hochreiter and Schmidhuber (1997). LSTM is a kind of RNN that overcomes the difficulties of handling long-term dependencies (Graves, 2014). In addition, LSTMs do not suffer from the vanishing gradient problem (Hochreiter, 1998;Gers et al., 2000). LSTMs have feedback connections, in contrast to deep feedforward neural networks. In both context-free and context-sensitive language learning, LSTM models outperform RNNs (Gers and Schmidhuber, 2001). In addition to handling single data points in vectors or arrays, they can handle data sequences. Because of this, LSTMs excel in processing and predicting time series.
The LSTM model, in contrast to the RNN's hidden layer neurons, is made up of a unique collection of memory cells. The LSTM model is dependent on the state of the entity. The gate structure filters information to maintain and refresh the state of memory cells. The gate structure includes input, forget, and output gates. There are three sigmoid layers and one tanh layer in every memory cell (Qiu et al., 2020). The forget gate f t of the LSTM unit determines which cell state information is omitted from the model. The memory cell translates the previous moment's output h t−1 and the current moment's external information x t into a long vector [h t−1 , x t ] to represent the current moment.

FIGURE 9
Learning graph of the LSTM, Bi-LSTM and transformer models using (A) Indonesia Earthquake Catalogue, (B) Japan Earthquake Catalogue and (C) HKKH Region Earthquake Catalogue.

Frontiers in Earth Science
frontiersin.org 10 W f and b f are the forget gate's weight matrix and bias, respectively, and σ is the sigmoid function. The forget gate's primary function is to monitor how much of the prior cell state C t−1 is reserved for the current cell state C t . The gate outputs a value between 0 and 1 based on h t−1 and x t , with 1 signifying total reserve and 0 indicating total discard. The input gate i t regulates the amount of the current time network input x t reserved for the cell state C t , avoiding unnecessary data from entering memory cells. The first step is determining which cell state has to be modified; the sigmoid layer chooses the modified value, as illustrated in Eq. 13. The alternative option is to modify the cell state's data. Using the tanh layer, a new candidate vector C t is generated to govern the amount of new information provided, as indicated by Eq. 14.
The final step is to update the cell states of the memory cells using Eq. 15 The O t output gate controls how much of the current cell state is discarded.
The final output value of the cell is stated as

Bidirectional long short-term memory recurrent neural network
Bidirectional networks provide significant advantages over unidirectional networks in a variety of contexts (Cui et al., 2022). Bidirectional LSTM is derived from bidirectional RNN (Schuster and Paliwal, 1997), which employs two hidden layers to examine sequence input in both forward and backward directions. Two hidden layers are connected to the same output layer through bidirectional LSTMs. Using positive sequence inputs from time T − n to time T − 1, the forward layer generates a sequence of output values, denoted by h. In contrast, the reverse layer generates a sequence of values, denoted by h ← , using inverted sequence inputs from time T − n to time T − 1. The outputs of the forward and backwards layers are calculated using the conventional LSTM Eqs 12-17. The bidirectional LSTM layer generates the Y T output vector, whose elements are determined using the equation.
Here, the σ function is used to mix the two output sequences. It may be a function of summation, average, concatenation, or multiplication. Using earthquake magnitude prediction as an illustration, the final output of a bidirectional LSTM layer may be expressed in the form of a vector, Y T [y T−n , . . . y T−1 ], where the last element, y T−1 , indicates the expected magnitude of the next earthquake.

Transformer model
The self-attention-based transformer for sequence modelling has recently been introduced and has been a tremendous success (Parikh et al., 2016;Vaswani et al., 2017). In contrast to RNN-based approaches, the transformer model may access any historical segment regardless of distance. It excels at recognizing recurring patterns with long-term dependencies. The performance benefits of transformer models in prediction have been widely established . Numerous recent studies have applied it in image, music, and speech processing Parmar et al.,  . However, scaling attention to extremely long sequences is computationally expensive since the space complexity of self-attention rises quadratically with sequence length . When forecasting time series with exact precision and substantial long-term dependence, this becomes a serious issue. In addition, the space complexity of the canonical transformer, which rises quadratically with input length L, may create a memory bottleneck. The sparse transformer (Child et al., 2019), with a complexity of O(n n √ ), and the sparse log transformer , with a complexity of O(n (log n) 2 ), are some solutions to this problem. These methods have made it possible to simulate long-term time series.
The transformer model is composed of encoders and decoders. Each encoder layer's primary responsibility is to create data describing relationships between inputs. In contrast, the decoder component takes all the encoded data and uses the embedded context information to produce a new sequence of output values. Both the encoder and decoder are built of modules that may be layered on top of one another. The bulk of modules are composed of multi-head attention and feed-forward layers. The attention mechanism translates a query and a collection of key-value pairs

Frontiers in Earth Science
frontiersin.org to output in this instance. The encoder consists of six identical layers with two sublayers stacked together. The first is a multi-head selfattention layer, while the second is a simple position-wise, fully connected feed-forward network. In addition, a residual connection is created surrounding each sublayer, followed by a normalizing layer. Similar to the encoder, the decoder consists of six layers with the same sublayers. Furthermore, multi-head attention is applied to encoder outputs to help in the production of target translations. The attention function in a transformer is a mapping of a query and an arrangement of key-value sets to output, with the query, keys, values, and output all being vectors. The input consists of queries and keys in dimension d k and values in dimension d v . Each query is multiplied by all keys and then divided by d k . The result is a weighted sum of the values, with each value receiving a weight defined by the compatibility function of the query with its corresponding key. The softmax function is then used to obtain weights for the values from the output. In practice, we calculate the attention function in parallel on a set of queries, keys, and values and store the results in a matrix Q, K, and V. The calculation for the output matrix is as follows:

Evaluation criteria
To evaluate the performance of these deep learning models, the following four evaluation criteria were considered: mean absolute error (Willmott and Matsuura, 2005), mean square error (Pishro-Nik, 2014), log-cosh loss (Grover, 2021) and mean squared logarithmic error (Mean Squared Logarithmic Error Loss, 2021).

Mean absolute error (MAE)
The MAE refers to the average absolute vertical or horizontal distance between each point in a scatter plot and the straight line through the origin. Consequently, MAE indicates the average absolute difference between projections and objectives. Thus, MAE assesses how well a forecast matches the actual results.

Mean square error (MSE)
The MSE reflects the deviation between the forecasts and the original projections. It is the average squared deviation between the prediction and the target. Since it is dependent on a square term, negative values are impossible. Thus, it comprises both the estimator's variance and its bias. The MSE is the sample standard deviation of the differences between anticipated and observed values for the specified number of observations.

MSE
1 n n i 1

Log-cosh loss
In regression problems, the Logcosh loss is another smoother metric than the MSE. Logcosh computes the logarithm of the hyperbolic cosine of the prediction error (Grover, 2021).
where p denotes the predicted value and t represents the true value. Log(cosh(x)) is generally equivalent to (x**2) / 2 for small x and abs(x) − log (2) for large x. This shows that Logcosh behaves Frontiers in Earth Science frontiersin.org 13 similarly to the mean squared error but is less affected by the occasional drastically incorrect predictions (Chris, 2019).

Mean squared logarithmic error (MSLE)
The mean squared logarithmic error is the average of the squared differences between the log-transformed actual and forecasted values over the observed data.
In general, the above formula expresses the loss function: whereŷ denotes the predicted value. This loss may be understood as a ratio between the actual and anticipated values over time.
log y i + 1 − logŷ i + 1 log y i + 1 y i + 1 (24) The addition of "1"to both y andŷ is for mathematical convenience, as log (0) is not defined, and both y andŷ can be zero.

Model architecture and training
This work uses deep learning methods to provide predictions regarding the magnitude of future earthquakes based on temperature anomalies and eight other seismic parameters for a specific location. The vectors consisting of temperature anomaly data and eight seismic parameters are fed into neural networks with two types of recurrent units: long short-term memory cells and bidirectional long short-term memory cells. These vectors are also supplied into a transformer model that employs an attention mechanism by variably weighting the significance of each incoming data element. Because each region has unique features and is distinct from others, independent training was conducted to construct a prediction model based on seismic parameters that are distinctive to each region.

LSTM model
There are 32 LSTM units in the primary layer of the LSTM model. To prevent overfitting, a dropout layer is applied thereafter at a rate of 0.2. When a system is overfitted, it might produce good training results but poor testing outcomes. Overfitting occurs when a system depends excessively on its historical data, rendering it rigid and incapable of adjusting to new input. After that, we have two layers of dense units connected by a linear layer, another layer of dense units activated by a rectified linear unit (ReLU), and a third dense unit serving as the output. The output layer itself consists of a single dense unit.
The number of epochs and batch size are two trivial hyperparameters that must be determined prior to training based on experience and extensive trial and error. The number of epochs is a hyperparameter that controls how many times the learning algorithm will iterate through the training dataset. One epoch denotes that every sample in the training dataset has had the opportunity to influence the internal model parameters. The batch size is a hyperparameter that determines how many samples should be processed before modifying the internal model parameters. Here, we pick batch sizes of 128 and 50 epochs using a regional earthquake catalogue, including thousands of earthquake events (data rows). This means that the dataset is divided into subsets consisting of 128 samples each. After every 128 samples, the model weights are recalculated. The model examines the entire dataset fifty times using fifty epochs.
The model's architectural representation is shown in Figure 8A. Figure 9 shows the learning graph generated by deep learning models (LSTM, Bi-LSTM, and transformer) utilizing the processed dataset of three seismic zones. A learning curve is a graph that illustrates how the learning performance of a model varies with experience or time. Learning curves are frequently utilized in deep neural network algorithms that learn gradually and adjust their internal parameters over time. The major goal of our work with deep neural networks is to reduce error as much as possible. The objective function is typically characterized by a loss function, with "loss" referring simply to the value produced by the loss function. In this study, all three sets of seismic catalogs were used to build learning curves during the training phase, and the default loss function used was mean square error loss. Low scores suggest higher learning, while a zero score shows that the training dataset was learned accurately and without any mistakes. Here, the training loss plot reduces to the point of stability with a minimal number of epochs, indicating a satisfactory fit of the model with three seismic catalogs.

Bi-LSTM model
In the Bi-LSTM model, the main layer comprises twenty-four LSTM units that can operate in both directions. To prevent overfitting, the subsequent layer is a dropout layer with a dropout rate of 0.2. The next layers are also unchanged: a layer of dense units activated by a ReLU function, another layer of dense units connected linearly, and a final dense unit serving as the output. The final output layer consists of a single dense output unit. Figure 8B displays an architectural depiction of the model. Figure 9 displays the training loss plot's gradual decline to stability after a few epochs, indicating that the model fits the three seismic datasets well.

Transformer model
Here, a multi-head self-attention system has been implemented. This method employs self-attention processes to model sequence data to identify complex correlations of varied lengths from timeseries data. Furthermore, this transformer-based technique may describe a wide spectrum of non-linear dynamical systems. The Q, K, and V configurations depend on the input via various thick layers. The next section is optional and depends on the scale of our model and data. However, we will also completely bypass the  decoder. This indicates that just one or more layers of the attention block will be utilized. In the last phase, a few thick layers will be employed to estimate anything we want to predict. Figure 8C depicts the model's implemented architecture.
Each attention block comprises a feed-forward block, a selfattention block, and a normalization block. The sizes of the inputs and outputs for each block are the same. Adam (Kingma and Ba, 2015) is an excellent initial optimizer for training that has been used in this research. Dropout approaches for regularization are applied in the encoder's and decoder's three types of sublayers: selfattention, feed-forward, and normalization. The dropout rate for each sublayer is 0.2. The training loss plot stabilizes after a few epochs, as shown in Figure 9, demonstrating a decent model fit to the three seismic datasets.

Results and observations
This study attempts to estimate the magnitude of incoming earthquakes based on fluctuations in global temperature and eight seismic parameters of the previous 100 earthquake events by  Another aspect influencing model accuracy is the number of neurons in the hidden layers. If it is set too high, overfitting might occur, and the model will be incapable of effectively imitating the data. Dropout layers have been utilized to address this difficulty, which deactivates numerous neurons. Four metrics have been employed to assess the performance of these models: mean absolute error (MAE), mean squared error (MSE), log-cosh loss, and mean squared logarithmic error (MSLE). MAE and MSE are deviation measures that indicate how far the predictions are from the target values. Prediction models perform better when these deviation values are smaller. The log-cosh loss is the logarithm of the prediction error's hyperbolic cosine. MSLE is the time-dependent ratio between true and anticipated values. The MAE, MSE, log-cosh loss, and MSLE values all started to converge after running these deep learning models on the training dataset.

Results for Indonesia earthquake catalogs
By examining pre-processed seismic datasets from Indonesia, deep learning algorithms were utilized to investigate the influence of global temperature fluctuations on earthquake occurrences and evaluate how the actual and predicted magnitudes vary over time. Figure 10 illustrates this. Based on our observations, these models can project magnitudes ranging from 3.8 M to 5.8 M. Figure 11 depicts the predicted magnitude as a function of the observed magnitude. The x-axis represents the model's projected magnitude, while the y-axis represents the observed or actual magnitude recorded in the seismic dataset. The diagonal line in the plot's centre represents the estimated regression line. Because each data point is quite close to the anticipated regression line, we can conclude that the LSTM model fits the data fairly well. The figure also indicates that the model can predict earthquakes up to 5.8 M. Figure 12 shows a histogram of the errors produced by a deep neural network when forecasting the magnitude of the next upcoming earthquake. The difference between actual and projected values is referred to as "errors." These error numbers may be negative since they represent the extent to which the projected values differ from the actual values. The bulk of the anticipated magnitudes have errors near 0.0, with larger deviations being rare. The distribution is approximately symmetrical, with LSTM model values ranging from −0.2 to 0.2. As shown in Figure 13, the performance of each deep learning model improved as the deviation metrics decreased with increasing epochs. Tables 2 compare all deviation metrics calculated by the models used in the Indonesia earthquake datasets during training and testing. On the Indonesia earthquake dataset, all models performed well. As demonstrated in Table 2, the Bi-LSTM model had the lowest deviation metrics throughout the training period. When these models were fed an unknown test dataset, the LSTM model outperformed the others with the lowest deviation metrics. In the testing stage, the LSTM model surpasses the others, with the lowest MAE = 0.066, MSE = 0.007, log cosh loss = 0.039, and MSLE = 0.003.

Results for Japan earthquake catalogs
Deep learning methods were utilized to evaluate the preprocessed seismic data from Japan. The study evaluates how the predicted and actual magnitudes vary over time, as shown in Figure 14. These models can predict magnitudes ranging from 3.8 M to 5.8 M based on our observations. Figure 15 illustrates the predicted magnitude as a function of the observed magnitude. The x-axis indicates the model's predicted magnitude, while the y-axis represents the observed or true magnitude recorded in the seismic database. Because each data point is relatively near the predicted regression line, we can infer that the transformer model, fits the data fairly well. The figure also indicates that the model can predict earthquakes with magnitudes up to 5.8 M. Figure 16 shows the distribution of errors made by a deep neural network when predicting the magnitude of the next impending earthquake. Most anticipated magnitudes have errors near 0.0, whereas greater discrepancies are uncommon. The distribution of errors for the LSTM and transformer model is confined within the range from −0.2 to 0.2, indicating a roughly symmetrical distribution. Figure 17 depicts how each deep learning model's performance was enhanced as the deviation metrics dropped with increasing epochs. Table 3 compare all the deviation metrics produced by these models in the Japan earthquake datasets during training and testing. The Bi-LSTM model obtained the lowest deviation metrics during the training period, as shown in Table 3. The transformer model outperformed the other models with the fewest deviation metrics when fed with an unknown test dataset. In the testing phase, the transformer model surpasses the

Results for HKKH region earthquake catalogs
Deep learning techniques were also used to examine preprocessed seismic data from the HKKH area. Figure 18 illustrates the temporal evolution of the projected and actual magnitudes.
According to our findings, these models can estimate magnitudes ranging from 3.5 M to 5.2 M. Figure 19 depicts the expected magnitude as a function of the observed magnitude. The x-axis represents the model's projected magnitude, while the y-axis represents the observed or real magnitude recorded in the seismic database. We may infer that the LSTM model fits the data fairly well in contrast to other models because each data point is close to the predicted regression line. Furthermore, the image shows that the model can detect earthquakes with magnitudes of up to 5.3 M. Figure 20 shows a histogram of the errors generated  by deep neural networks while estimating the size of an upcoming earthquake. Most of the predicted magnitudes have errors near 0.0, with larger deviations being unusual. In the LSTM model, the distribution is reasonably symmetrical, with values ranging from −0.2 to 0.2. Figure 21 shows how the performance of each deep learning model increased as the deviation metrics dropped with increasing epochs. Table 4 compare all deviation metrics derived from the models in the HKKH area earthquake datasets during training and testing. On the HKKH region earthquake dataset, all models performed brilliantly. As demonstrated in Table 4, the Bi-LSTM model had the lowest deviation metrics throughout the training period. When these models were fed an unknown test dataset, the LSTM model outperformed the others with the lowest deviation metrics. According to Table 4, the LSTM model outperforms the others in the testing phase, with the lowest MAE = 0.083, MSE = 0.011, log cosh loss = 0.039, and MSLE = 0.005.
The test results presented in Tables 2-4 demonstrate that these deep learning models can predict the magnitude of an impending earthquake with a maximum MSE of 0.03 across all three regional earthquake catalogs. The errors caused by the models may be approximated to a maximum standard deviation of 0.17 magnitude units over all three datasets. As a result, we can conclude that these models are extremely accurate in modelling these datasets, as the error in the magnitude estimations of mild earthquakes has a maximum standard deviation of 0.17, depending on the network. Consequently, the results indicate that the models make fewer errors in their predictions. As a result, we may deduce that these models accurately simulate the seismic datasets of the three regions and the global temperature data. Alternatively, these models have effectively identified a correlation between earthquake magnitude and global temperature fluctuations.
Using earthquake catalogs from three distinct regions, the performance of the proposed system is assessed in this section. Low MSE, MAE, log cosh, and mean squared log error values indicate that the models fit all three datasets well, indicating a solid prediction system. The models' bias or variance errors are depicted on the training graphs. During the training period, all of the models converged to identical MSE, MAE, log-cosh loss, and MSLE values for all three seismic databases. These evaluation criteria are utilized to evaluate each model's performance. The convergence of these evaluation metrics to a small number indicates that the models fit the dataset with a high degree of precision, suggesting a correlation between the magnitude of the earthquake and fluctuations in global temperature. All three deep learning algorithms utilized in this study performed well and accurately to predict the magnitude of approaching earthquakes, confirming the efficiency and usefulness of earthquake modelling.
Temperature has a significant impact on growing heat fluxes close to earthquake zones. Sensible heat flow boosts evapotranspiration, one of the processes that moves water vapour into the atmosphere. This may lead to cloud formation and increased precipitation. Strong earthquakes are frequently linked to an increase in precipitation in seismic zones. In addition, climatic change and global warming increase glacier erosion, resulting a shift of mass balance on Earth's crust. This mass redistribution may enhance the probability of stress release in a previously stressed region. This is because erosion reduces the system's overall stress, which is sufficient to stabilize the system prior to unloading. Consequently, the loading and emptying of water bodies due to climate change may have direct effects on local seismicity. The impact of climate change on regional and transregional earthquakes, however, needs to be thoroughly investigated.
Most of the hypotheses produced in earthquake precursor signal studies are based on empirical formulas. Multiple factors contribute to the occurrence of an earthquake, including the accumulation of energy caused by tectonic motions, the stress-strain pattern, the fault types, the dynamics of the inner earth fluid, and the geomorphological structure. Consequently, at the concluding phase of earthquake preparation, extremely complicated precursory signals may be received. On the basis of the precursory signal's characteristics (amplitude, frequency, and phase), one can provide quantitative information regarding the probable magnitude, depth, location, and timing of next earthquake. However, there has not been much progress made thus far.

Conclusion
This study provides a novel method for establishing the association between earthquake occurrences and climate change by employing deep learning and finds a sustainable method for earthquake prediction. This study selected global temperature as the single climatic variable because it substantially impacts the Earth's ecosystem and civilization. Global temperature data along with several mathematically computed seismic parameters are considered basic inputs for deep learning algorithms. This study presented deep learning-based approaches for forecasting the magnitude of imminent earthquakes utilizing LSTM, Bi-LSTM, and the transformer model with global temperature anomaly data on the earthquake catalogs of Japan, Indonesia, and the Hindu-Kush Karakoram Himalaya area. Approximately 80% of earthquake datasets are utilized to train the deep learning models. The remaining 20% of the data were subsequently predicted. A double hidden layer was employed in the LSTM and Bi-LSTM models, and a multi-head self-attention system was built into the transformer model. The accuracy of a model is very sensitive to various parameters, such as the number of recurrent units in the hidden layers, the batch size, and the number of epochs. Extensive testing was carried out throughout the training phase to identify the optimal values for these parameters. Dropout layers are utilized to prevent overfitting in all models. The effectiveness of these models was assessed using the MAE, MSE, log-cosh loss, and MSLE metrics.
The cost functions for all models with varied earthquake datasets converge to minimal values. For the Indonesia earthquake catalogs, the LSTM model has been found to perform best during testing, with an MAE = 0.066, MSE = 0.007, log cosh loss = 0.038, and MSLE = 0.003. The model also exhibited the lowest MAE = 0.083, MSE = 0.011, log cosh loss = 0.039, and MSLE = 0.005 when tested with an unknown dataset obtained from the seismic catalogue of the Hindu-Kush Karakoram Himalaya region. The transformer model seems to have the lowest MAE = 0.083, MSE = 0.015, log cosh loss = 0.054, and MSLE = 0.007 for the earthquake catalogue of Japan. Achieving such a low value indicates that the models provide a good fit to the data, suggesting a correlation between the magnitude of the earthquake and fluctuations in global temperature. Several regional earthquake catalogs were used to test and validate deep learning-based techniques, and the results showed that the LSTM, Bi-LSTM, and transformer models were the most accurate algorithms for predicting earthquake magnitude. However, the maximum magnitudes anticipated by these models are confined between M5 and M6 depending upon the datasets. Due to the relative scarcity of large seismic occurrences in the historical earthquake records of a few places, particularly within a time span suited for retrospective forecasts, it is difficult to quantify the amount of statistical success in predicting large earthquakes.
The global temperature anomaly is considered as the only climate variable for this investigation since it strongly affects the Earth's ecosystem and civilization. In this experimental study, melting ice and isostasy have been discussed to explain their relationship with rising global temperatures and their effect on regional seismicity. As per our knowledge concern, no dataset has been publicly available for climatic variables such as precipitation, humidity, air pressure, wind speed, etc. However, considerations of these variables might change in the deep neural network models, and a comparative analysis of earthquake magnitude prediction using temperature and without using temperature both can be an interesting study as a future scope of this work.
A very interesting recent study (Christie et al., 2022) on the eastern Antarctic Peninsula's Larsen A and B ice shelves points out the need for more in-depth research to establish our claim as more robust and accurate. It is interesting to note that if the study can be carried out based on regional temperature anomalies instead of global temperature anomalies for some specified regions along with seismic data, the findings may be more accurate. From the publicly available dataset, the global temperature anomaly and seismic data of some specified regions are used as inputs in our experimental study. The non-availability of regional temperature anomaly data is indeed a major challenge of this work.

Data availability statement
The original contributions presented in the study are included in the article/supplementary material, further inquiries can be directed to the corresponding author.