AUTHOR=Chelidze Tamaz , Kiria Tengiz , Melikadze George , Jimsheladze Tamar , Kobzev Gennady TITLE=Earthquake Forecast as a Machine Learning Problem for Imbalanced Datasets: Example of Georgia, Caucasus JOURNAL=Frontiers in Earth Science VOLUME=Volume 10 - 2022 YEAR=2022 URL=https://www.frontiersin.org/journals/earth-science/articles/10.3389/feart.2022.847808 DOI=10.3389/feart.2022.847808 ISSN=2296-6463 ABSTRACT=Abstract In the paper (Chelidze et al, 2020) we considered the problem of M ≥ 3 earthquakes (EQ) forecast using the Machine Learning (ML) approach, using experimental (training) time series on monitoring water level variations in deep wells as well as geomagnetic and tidal time series in Georgia (Caucasus). For such magnitudes’ the number “seismic” to “aseismic” days in Georgia is approximately 1:5 and the dataset is close to the balanced one. However, the problem of the forecast is practically important for stronger events – say, events of M ≥ 3.5, as the learning dataset of Georgia, became more imbalanced: the ratio of seismic to aseismic days for in Georgia reaches the values of the order of 1:20 and more. In this case, some accepted ML classification measures, such as Accuracy lead to wrong predictions due to a large weight of true negative cases. As a result, the minority class, here – seismically active periods - is ignored at all. We applied specific measures to avoid the imbalance effect and exclude the overfitting possibility. After regularization (balancing) of the training data, we build the Confusion Matrix and performed Receiver Operating Classification in order to forecast the next day probability of M≥ 3.5 earthquake occurrence. We found that Matthews’ correlation coefficient (MCC) and F1 are measures, which give good results even if the initial negative and positive classes are of very different sizes. Application of MCC and F1 to observed geophysical data gives a good forecast of the next day M≥ 3.5 seismic event probability of the order of 0.8. After randomization of EQ dates in the training dataset the Matthews coefficient efficiency decreases to 0.17. 1. Introduction In the paper (Chelidze et al, 2