AUTHOR=Yang Yun TITLE=Feature importance analysis of solar flares and prediction research with ensemble machine learning models JOURNAL=Frontiers in Astronomy and Space Sciences VOLUME=Volume 11 - 2024 YEAR=2025 URL=https://www.frontiersin.org/journals/astronomy-and-space-sciences/articles/10.3389/fspas.2024.1509061 DOI=10.3389/fspas.2024.1509061 ISSN=2296-987X ABSTRACT=Solar flares, as intense solar eruptive events, have a profound impact on space weather, potentially disrupting human activities like spaceflight and communication. Hence, identify the key factors that influence the occurrence of solar flares and accurate forecast holds significant research importance. Considering the imbalance of the flare data set, three ensemble learning models (Balanced Random Forest (BRF), RUSBoost (RBC), and NGBoost (NGB)) were utilized, which have gained popularity in statistical machine learning theory in recent years, combined with imbalanced data sampling techniques, to classify and predict the labels representing flare eruptions in the test set. In this study, these models were used to classify and predict flares with a magnitude ≥ C- and M-class, respectively. After obtaining the feature importance scores of each model, a comprehensive feature importance ranking was derived based on the ranking. The main results are as follows: (1) For the prediction of flares ≥ C- and M-class, the best-performing model achieved a Recall of ∼0.76, ∼0.88 and a Tss score of ∼0.65, ∼0.78 on the test set, respectively. These are relatively high scores for model performance evaluation metrics. (2) The importance scores of each feature under different evaluation metrics and the comprehensive importance ranking can be directly obtained through the model without the need for additional feature analysis tools. Using this ranking to reduce the dimensionality of the data set for the three main models, similar or better classification results can be achieved using only about half of the original features. (3) Our results demonstrate the mean photospheric magnetic free energy (MEANPOT), the time decay value based on the magnitudes of all previous flares (Edec), and the total unsigned current helicity (TOTUSJH). They are the three quantities that have the most significant relationship with solar flares, which include free energy, twist degree, and the historical information of flare occurrences, respectively. Besides, analyzing the feature parameters of four different active regions, we find that the geographical information of the flare occurrence is an important factor. The object of this work is to provide prediction methods for imbalanced data as well as feature importance ranking methods.