Hybrid feature-LSTM for solar radiation forecasting in different Chinese climate zones

Zhao, Long; Wang, Fei; Wang, Hong; Wang, Hui; Shi, Yi

doi:10.3389/feart.2026.1745611

ORIGINAL RESEARCH article

Front. Earth Sci., 10 February 2026

Sec. Atmospheric Science

Volume 14 - 2026 | https://doi.org/10.3389/feart.2026.1745611

Hybrid feature-LSTM for solar radiation forecasting in different Chinese climate zones

Long Zhao¹

Fei Wang²

Hong Wang³

Hui Wang³*

Yi Shi²

¹College of Horticulture and Plant Protection, Henan University of Science and Technology, Luoyang, China
²College of Agricultural Equipment Engineering, Henan University of Science and Technology, Luoyang, China
³Henan Tobacco Company Luoyang Company, Luoyang, China

Solar radiation is a critical energy source for life and ecosystems on Earth, influencing the growth and development of crops. Accurate solar radiation forecasting promotes agricultural development and ensures national food security. This study developed a high-accuracy solar radiation prediction system based on the long short-term memory (LSTM) network model and its hybrid models. Three feature importance analysis algorithms, including extreme gradient boosting, light gradient boosting machine, and categorical gradient boosting, were employed to evaluate the importance of meteorological factors and to develop various factor combinations. Furthermore, three optimization algorithms, including beluga whale optimization algorithm, goose optimization algorithm, and horned lizard optimization algorithm (HLOA), were applied to optimize the hyperparameters of the LSTM model. Based on the forecasting results, the optimal input combinations and optimization algorithms were determined. The findings reveal that sunshine hours (SH) and vapor pressure deficit (VPD) are the most strongly correlated factors with solar radiation in the temperate continental zone (TCZ), with maximum temperature (Tmax) exhibits the highest importance coefficient. In the tropical monsoon zone (TPMZ). The optimal factor combinations for forecasting models include SH, VPD, Tmax, minimum temperature (Tmin), and relative humidity (RH) in both climate zones. Optimization algorithms significantly enhance the accuracy of LSTM model, with HLOA demonstrating the best performance. Specifically, the HLOA-LSTM model with the optimal factor combination achieves the following precision metrics: for TCZ, RMSE = 3.470 ± 0.224 MJ/(m²·day), R² = 0.807 ± 0.016; for TPMZ, RMSE = 2.858 ± 0.561 MJ/(m²·day), R² = 0.814 ± 0.038. The results of this study indicate that the HLOA-LSTM model, with inputs including SH, VPD, Tmax, Tmin, and RH, is the optimal model for solar radiation prediction, providing a valuable reference for high-precision solar radiation forecasting in TZC and TPMZ.

1 Introduction

The growth of the global population and economy, along with improved living standards, has led to increasing demands for food and energy (Zeng et al., 2018). Solar energy is not only the most accessible renewable and clean energy source, but also a crucial element for plant growth and crop evapotranspiration (Farhadi and Taki, 2020; Yang and Gueymard, 2019). Solar energy resources play a significant role in enhancing food security, mitigating the destructive environmental impacts of fossil fuels, and improving the monitoring of agricultural hydrological environments (Miao et al., 2018). Accurate estimation of solar radiation is essential for reducing fossil fuel consumption and managing hydrological conditions in farmland.

Currently, solar radiation data are primarily obtained through three methods: direct measurement using solar devices such as pyranometers, estimation via empirical models for regions without solar measuring instruments, and prediction using machine learning models (Taki et al., 2018). While direct observations through meteorological stations can provide accurate solar radiation information, the measurement instruments and their maintenance are expensive and time-consuming (Fan et al., 2018). Due to the limited coverage of meteorological and radiation monitoring stations, researchers cannot determine solar radiation levels in areas without measuring equipment (Cao et al., 2023). To address this limitation, researchers have developed various simple empirical models with minimal input requirements based on the relationships and patterns of solar radiation and its influencing factors (Yang et al., 2006). Although empirical models offer advantages such as ease of use and availability of input data, they require regional calibration for accurate results and are challenging to apply over large areas (Zhang et al., 2019). Moreover, the complexity and variability of the relationships between solar radiation and its influencing factors cannot be fully represented by quantitative models.

With the advancement of artificial intelligence, researchers have applied various machine learning models to solar radiation prediction (Jia et al., 2022; Sebastianelli et al., 2024). Machine learning models possess powerful nonlinear fitting capabilities, enabling them to capture complex nonlinear relationships between multiple independent and dependent variables. They can handle multivariable datasets, exhibit strong generalization abilities, and offer better adaptability and flexibility (Rodríguez et al., 2024; Zhao et al., 2023a). Fan et al. (2019) compared the performance of various empirical and machine learning models, concluding that machine learning models outperform empirical models. Among the machine learning algorithms used for solar radiation prediction, the long short-term memory (LSTM) network, an improvement of the recurrent neural network (RNN), utilizes a gated structure that significantly enhances its performance in processing sequential data, thereby providing more accurate results (Ghimire et al., 2019). This makes LSTM models widely used in solar radiation prediction studies. De Araujo (2020) demonstrated that the LSTM algorithm is an effective model for solar radiation prediction. Yildirim et al. (2023) compared the performance of LSTM, multilayer perceptron, and adaptive neuro-fuzzy inference systems for solar radiation prediction, concluding that LSTM outperformed the others, providing highly accurate results. Therefore, this study employs the LSTM model to establish a high-performance predictive model.

Standalone machine learning models have been widely applied in predictive tasks. To further improve prediction accuracy, researchers have combined optimization algorithms with machine learning models to fine-tune hyperparameters, further enhancing prediction accuracy and robustness (Patel and Swathika, 2024; Tahir et al., 2024; Zhao et al., 2022). The integration of meta-heuristic optimization algorithms with machine learning models enables the discovery of optimal hyperparameter configurations, thereby maximizing model potential (Wu et al., 2019). The beluga whale optimization (BWO) algorithm, inspired by the predatory behavior of beluga whales, effectively explores the solution space. It is easy to implement and understand while exhibiting good adaptability (Zhong et al., 2022). The goose optimization (GO) algorithm, which simulates goose behavior, ensures precise global and local searches, making it a powerful optimization algorithm for efficient exploration of optimal solutions (Hamad and Rashid, 2024). The horned lizard optimization algorithm (HLOA) simulates the defensive strategies of horned lizards, effectively overcoming challenges in the optimization process. It simultaneously enables exploration and exploitation of the solution space, avoiding local optima. With its simple parameter configuration, HLOA is suitable for various complex parameter optimization tasks (Peraza-Vázquez et al., 2024). In summary, this study selects BWO algorithm, GO algorithm, and HLOA to fine-tune the hyperparameters of the LSTM model, aiming to construct a highly accurate and robust solar radiation prediction system.

The accuracy of machine learning models is influenced not only by hyperparameters but also significantly by the input features. Analyzing the importance of input factors can identify variables highly correlated with the target variable, exclude less relevant variables, improve training efficiency, and prevent overfitting (Zhao et al., 2023b). Gradient boosted decision trees (GBDT), an ensemble learning algorithm based on decision trees, evaluates feature contributions to provide robust analysis results (Wang S. et al., 2023). Researchers have developed three new algorithms based on GBDT—extreme gradient boosting (XGBoost), light gradient boosting machine (LightGBM), and categorical boosting (CatBoost). These algorithms improve computational efficiency, predictive accuracy, and overall performance to enhance applicability. Jiang et al. (2022) used the LightGBM algorithm to select land surface parameters affecting soil salinity, optimizing soil salinity prediction models. Chen et al. (2022) applied the XGBoost algorithm to construct a multi-regional soil salinity estimation model, demonstrating that XGBoost provides accurate feature importance analysis results. Kookalani et al. (2022) showed that the CatBoost algorithm effectively captures nonlinear relationships to offer reliable feature analysis. Therefore, this study utilizes XGBoost, LightGBM, and CatBoost algorithms to comprehensively analyze the importance of meteorological factors, aiming to reduce model complexity and enhance prediction accuracy.

The study integrated six meteorological factors—sunshine hours (SH), maximum temperature (T_max), minimum temperature (T_min), vapor pressure deficit (VPD), relative humidity (RH), and wind speed (WIND)—to develop a solar radiation estimation system. First, XGBoost, LightGBM, and CatBoost were applied to perform feature importance analysis on the input factors, forming diverse combinations of input variables. Subsequently, the LSTM model was employed as the base model, and its hyperparameters were optimized using BWO algorithm, GO algorithm, and HLOA to construct a highly accurate solar radiation estimation model. The primary objectives of this study are as follows: (1) Using XGBoost, LightGBM, and CatBoost algorithms to analyze the correlations of meteorological factors with solar radiation across different climate zones; (2) constructing various factor combinations based on the comprehensive results of the three analysis algorithms and determine the optimal input factor combinations for each climate zone; (3) Combining BWO algorithm, GO algorithm, and HLOA with the LSTM model to develop hybrid models and compare the predictive performance of hybrid models with standalone models.

2 Materials and methods

2.1 Study sites

China’s climate zones can be classified into tropical monsoon zone (TPMZ), subtropical monsoon zone, temperate monsoon zone, temperate continental zone (TCZ), and mountain plateau zone, based on regional variations in climate and precipitation. For this study, six stations were selected within the TCZ and TPMZ. These two zones exhibit significant differences in climatic conditions. By conducting experiments across these climatic zones, we can more fully assess the robustness and generalization capabilities of machine learning models and optimization algorithms when processing data with distinct climatic characteristics. The geographical and meteorological information of the study sites are listed in Table 1, with their spatial distribution illustrated in Figure 1. The data used in this study were obtained from the national meteorological science data center of China (http://data.cma.cn) and randomly divided into training and testing sets in an 8:2 ratio to ensure the independence of model training and validation. The dataset includes daily solar radiation measurements from 2012 to 2016, along with daily meteorological variables such as sunshine hours, maximum temperature, minimum temperature, vapor pressure deficit, relative humidity, and wind speed. Furthermore, prior to model training, all variables were normalizd to ensure comparability among features and facilitate more efficient convergence of the machine learning algorithms.

Table 1

Table 1. Geographical and meteorological information of study sites.

Figure 1

Topographic map of China displaying elevation with color gradients from green (low) to orange (high). Marked study stations include Hetian, Hami, Datong, Tengchong, Nanning, and Shantou. Regions are labeled TCZ, TMZ, MPZ, SMZ, TPMZ. A scale bar and legend indicate elevation range from negative 277 to 8806 meters.

Figure 1. Geographical distribution of study sites. Note: TCZ--temperate continental zone; TMZ--temperate monsoon zone; SMZ--subtropical monsoon zone; MPZ--mountain plateau zone; TPMZ--tropical monsoon zone.

2.2 Feature importance analysis algorithms

2.2.1 Lightweight gradient boosting machine

LightGBM is a framework that implements the GBDT algorithm (Jiang et al., 2025). It supports highly efficient parallel training and offers advantages over traditional GBDT algorithms, including higher accuracy, faster training speeds, and lower memory consumption, making it well-suited for processing large-scale datasets. The feature importance analysis in the LightGBM algorithm is based on feature split gain. Specifically, the gain importance for a given feature i is expressed as shown in Equation 1:

F I_{i}^{g a i n} = \sum_{j = 1}^{T} I (i is used in split j) \cdot G a i n_{j} (1)

where $F I_{i}^{g a i n}$ represents the gain importance of feature, T is the total number of trees, $I (i i s u s e d i n s p l i t j)$ is an indicator function that denotes whether feature i was used in the split of the j-th tree, and $G a i n_{j}$ is the split gain in the j-th tree. The specific details of the LightGBM algorithm are provided by Ke et al. (2017).

2.2.2 Extreme gradient boosting

The XGBoost algorithm is built on the principles of gradient boosting, with a focus on optimizing the objective function. By employing a second-order Taylor expansion of the objective function, XGBoost accelerates convergence and improves accuracy (Bilali et al., 2025). Feature importance in XGBoost is determined by the gain achieved when a feature is used in a split across all decision trees. The importance of feature i is calculated as the weighted average of its gain across all trees, expressed as in Equation 2 below:

I_{i} = \frac{1}{T} \sum_{t = 1}^{T} I_{i} (I_{t}) (2)

where $I_{i} (I_{t})$ represents the gain of feature i in the decision tree, and $T$ is the total number of decision trees. The detailed content of the XGBoost algorithm is provided by Chen and Guestrin (2016).

2.2.3 Categorical gradient boosting

CatBoost is an improved algorithm based on the GBDT framework. By employing symmetric decision trees as base learners, it significantly enhances training speed while effectively mitigating overfitting. CatBoost evaluates feature importance using multiple metrics that assess the contribution of each feature. Among these metrics, the predicted value change reflects the average variation in predictions when a feature value changes. Additionally, the loss function change measures the variation in the loss function when a feature is removed versus when the full feature set is used (Wang Y. et al., 2023). The specific formulas are shown in Equations 3–5:

P V C = \sum_{t r e e s, l e a f s} {(V_{1} - a v r)}^{2} \cdot b_{1} + {(V_{2} - a v r)}^{2} \cdot b_{2} (3)

a v r = \frac{V_{1} \cdot b_{1} + V_{2} \cdot b_{2}}{b_{1} + b_{2}} (4)

where $b_{1}$ and $b_{2}$ are the weights of the left and right leaves, respectively, and $V_{1}$ and $V_{2}$ are the objective function values for the left and right leaves.

L F C = L (I) - L (I_{i}) (5)

where $L (I)$ is the loss function value when feature i is excluded, and $L (I_{i})$ is the loss function value when the complete feature set is used. The comprehensive information on the CatBoost algorithm is provided by Prokhorenkova et al. (2018).

2.2.4 Comprehensive analysis

To derive a more robust and reliable feature importance analysis, this study combines the results from the three algorithms. The importance scores obtained from each algorithm are averaged to produce the final feature importance score. This integrative approach leverages the strengths of each algorithm, enhancing the reliability and robustness of the results. The comprehensive calculation formula is shown in Equation 6:

F I = \frac{F I_{C a t b o o s t} + F I_{X G B o o s t} + F I_{L i g h t g b m}}{3} (6)

2.3 Machine learning algorithms

2.3.1 Long short-term memory networks

The LSTM neural network is an improved algorithm based on RNN. The LSTM framework features three distinct gate structures: the input gate, output gate, and forget gate. These gates work in tandem to maintain and update the internal states of the network, significantly enhancing LSTM’s performance when processing sequential data. The LSTM network effectively mitigates the vanishing gradient problem commonly observed in traditional RNNs, greatly improving training efficiency. Within the gate structures of the LSTM network, the input gate determines which input information should be stored in the memory cell; the forget gate decides which information should be forgotten; and the output gate selects the information to be extracted and output from the current memory cell (Ghimire et al., 2022). The structural diagram of LSTM is shown in Figure 2, and the mathematical representation is given by Equations 7–12:

F_{t} = σ (w_{f} \cdot [h_{t - 1}, x_{t}] + b_{f}) (7)

I_{t} = σ (w_{i} \cdot [h_{t - 1}, x_{t}] + b_{i}) (8)

O_{t} = σ (w_{o} \cdot [h_{t - 1}, x_{t}] + b_{o}) (9)

{\tilde{C}}_{t} = \tanh (w_{c} \cdot [h_{t - 1}, x_{t}] + b_{c}) (10)

C_{t} = F_{t} \cdot C_{t - 1} + I_{t} \cdot {\tilde{C}}_{t} (11)

h_{t} = O_{t} \cdot \tanh (C_{t}) (12)

where $F_{t}$ , $I_{t}$ , and $O_{t}$ represent the forget gate, input gate, and output gate, respectively; $σ$ denotes the activation function sigmoid; $w$ signifies the weight matrices; $b$ denotes the bias vectors; $h_{t}$ and $h_{t - 1}$ represent the hidden layer states at time $t$ and $t - 1$ , respectively; $x_{t}$ represents the input vector at time $t$ ; $C_{t}$ and $C_{t - 1}$ denote the cell states at time $t$ and $t - 1$ , respectively.

Figure 2

Diagram of an LSTM cell showing three gates: forget, input, and output. The cell takes previous states $ h_{t-1} $ and $ C_{t-1} $, processes them through sigmoid and tanh functions, and updates states to $ h_t $ and $ C_t $. Each gate controls different data flow pathways.

Figure 2. The structure of LSTM network.

To further enhance the predictive accuracy of the LSTM network, this study employed multiple optimization algorithms to tune four hyperparameters of the model, including the number of nodes in the first hidden layer (NumHiddenUnits), the maximum number of training epochs (MaxEpochs), the initial learning rate (InitialLearnRate), and the dropout probability (DropoutRate). Specifically, the MaxEpochs defines the upper limit on the number of complete passes through the training dataset, the InitialLearnRate determines the initial step size for the optimization algorithm during weight updates, and the DropoutRate suppresses overfitting by randomly setting a portion of the input elements to zero during training. The search ranges for the four hyperparameters are shown in Table 2. In addition, all model training in this study was conducted on a system equipped with an Intel(R) Core(TM) i5-1035G1 CPU.

Table 2

Table 2. Search ranges of different hyperparameters in the optimization algorithms.

2.3.2 Beluga whale optimization algorithm

The beluga whale optimization algorithm is a meta-heuristic optimization algorithm inspired by the behavior of beluga whales (Zhong et al., 2022). This algorithm simulates three stages of beluga behavior—exploration, exploitation, and whale fall—to optimize model parameters. In this process, each beluga whale represents a candidate solution, with its position continually updated. During the exploration phase, the BWO algorithm performs a global search by mimicking the swimming behavior of beluga whales in the search space. In the exploitation phase, the algorithm simulates the whales’ predation behavior and incorporates the Levy flight strategy to enhance convergence performance. In this stage, beluga whales cooperate to hunt prey by sharing positional information and interacting with one another, enabling more focused and precise local searches. During the whale fall phase, the whales’ positions and sinking steps are used to establish new individual positions, maintaining population diversity and size (Pan et al., 2024; Sun X. et al., 2024).

2.3.3 Goose optimization algorithm

The goose optimization algorithm is a meta-heuristic algorithm inspired by the resting and foraging behavior of geese (Hamad and Rashid, 2024). The algorithm begins with population initialization, where the fitness value of each individual is calculated based on its position in the search space. The GO algorithm then enters the exploration and exploitation phases. The exploration phase simulates the random searching behavior of geese when they have not identified a clear target. This phase aims to discover new solutions through random searches, helping the algorithm escape local optima and move toward the global optimum. The exploitation phase models the intensive searching behavior of geese upon discovering food, focusing on refining known solutions through local search to find the best solution. The GO algorithm balances these two phases using a random variable. If the variable’s value exceeds 0.5, the algorithm enters the exploration phase; otherwise, it enters the exploitation phase. This mechanism ensures that the algorithm can thoroughly search known regions while broadly exploring new ones (Sun Y. et al., 2024; Wang et al., 2025).

2.3.4 Horned lizard optimization algorithm

Horned lizard optimization algorithm is a novel meta-heuristic optimization algorithm inspired by the defensive behaviors of horned lizards (Peraza-Vázquez et al., 2024). The algorithm simulates the defensive strategies of horned lizards to search for optimal solutions in the solution space. HLOA first calculates the fitness value of each horned lizard, and during the optimization process, it employs various defensive strategies, including camouflage, skin color changes, blood-squirting, and fleeing. These strategies help avoid suboptimal solutions during the optimization process. Camouflage and skin color changes allow the horned lizard to better blend into its environment, enhancing the algorithm’s adaptability. Blood-squirting and fleeing strategies enable the lizard to evade predators, creating new search spaces. This improves the algorithm’s ability to explore uncharted areas of the solution space, strengthening its global search capabilities (Hachemi et al., 2024; Prajapati et al., 2024). The principles of the three optimization algorithms are illustrated in Figure 3.

Figure 3

Flowchart illustrating processes for optimizing Long Short-Term Memory (LSTM) parameters using three methods: BWO, GO, and HLOA. Each method involves initializing parameters, calculating fitness values, and iterative phases such as exploitation and exploration. Conditions and iterative checks lead to outputting optimal LSTM parameters.

Figure 3. The principles of the BWO, GO, and HLOA algorithms.

2.4 Evaluation metrics

This study evaluates the performance of solar radiation estimation models using root mean square error (RMSE), coefficient of determination (R²), mean absolute error (MAE), and nash-sutcliffe efficiency (NSE). Additionally, the global evaluation index (GPI) is used to assess the overall predictive performance of the models. The formulas for these metrics are shown in Equations 13–17:

R^{2} = \frac{{[\sum_{i = 1}^{n} (p_{i} - \bar{p_{i}}) (t_{i} - \bar{t_{i}})]}^{2}}{\sum_{i = 1}^{n} {(p_{i} - \bar{p_{i}})}^{2} \sum_{i = 1}^{n} {(t_{i} - \bar{t_{i}})}^{2}} (13)

R M S E = \sqrt{\frac{\sum_{i = 1}^{n} {(t_{i} - p_{i})}^{2}}{n}} (14)

M A E = \frac{\sum_{i - 1}^{n} |t_{i} - p_{i}|}{n} (15)

N S E = 1 - \frac{\sum_{i = 1}^{n} {(t_{i} - p_{i})}^{2}}{\sum_{i = 1}^{n} {(t_{i} - \bar{t_{i}})}^{2}} (16)

G P I = \sum_{j = 1}^{4} α_{j} (g_{j} - y_{j}) (17)

where $p_{i}$ represents the predicted value of solar radiation, $t_{i}$ represents the actual value of solar radiation, $\bar{p_{i}}$ represents the mean of the predicted solar radiation values, and $\bar{t_{i}}$ represents the mean of the actual solar radiation values. The closer RMSE and MAE are to 0, the smaller the error in the prediction results. The closer R² and NSE are to 1, the better the predictive model fits the data. $g_{j}$ is the normalized value of RMSE, R², MAE, and NSE, while $y_{j}$ corresponds to the median value of each parameter. $a_{j}$ takes the value 1 for R² and NSE, and −1 in other cases. A higher GPI value or a higher ranking indicates better overall performance of the model.

3 Results

3.1 Importance analysis of factors related to solar radiation

This study utilized XGBoost, LightGBM, and CatBoost algorithms to evaluate the importance of meteorological factors associated with solar radiation, using the integrated results of the three algorithms as the final importance ranking. The heatmap of correlations between six meteorological factors and solar radiation is shown in Figure 4, and the rankings of the importance of different meteorological factors are presented in Table 3. In the TCZ, the top-ranked factors are SH and VPD. The third and fourth positions are generally occupied by temperature-related factors, while RH and WIND rank relatively lower overall. In the TPMZ, the most important meteorological factor is T_max, followed by SH and VPD in second and third place, respectively, with T_min ranking next. RH and WIND exhibit a lower correlation with solar radiation.

Figure 4

Heatmap showing correlation coefficients between variables Tmax, RH, Tmin, WIND, VPD, and SH with samples numbered 51828 to 59431. The color gradient scale indicates strength from 0.000 to 1.000.

Figure 4. Correlations between meteorological factors and solar radiation. Note: the heatmap displays Pearson correlation coefficients between meteorological factors and solar radiation at six stations. Lighter colors indicate stronger correlations.

Table 3

Table 3. Rankings of meteorological factor importance at study sites.

SH has the most significant influence at station 52203--Hami, with a relative importance value of 0.489. Following this, station 53487--Datong has an SH importance value of 0.312, and station 51828--Hetian also exhibits relatively high importance, with a value of 0.301. In the TPMZ, SH’s importance is moderate, with importance values of 0.251, 0.220, and 0.208 at stations 56739--Tengchong, 59316--Shantou, and 59431--Nanning, respectively. VPD has the highest importance value at station 53487, with a value of 0.353, making it the most significant factor at this station. At stations 51828, 52203, and 59431, the importance values of VPD are 0.277, 0.146, and 0.260, respectively, ranking second. For stations 56739 and 59316, VPD ranks third in importance, with values of 0.247 and 0.201, respectively. T_max is the most important factor in TPMZ, with importance values of 0.263, 0.311, and 0.279 at stations 56739, 59316, and 59431, respectively. In the TZC, T_max ranks third in importance at stations 51828 and 53487, with values of 0.189 and 0.116, respectively. T_min ranks fourth in importance at TPMZ stations, with values of 0.106, 0.148, and 0.145 at stations 56739, 59316, and 59431, respectively. In TZC, T_min’s importance value at station 52203 is 0.114, ranking third. The T_min values at stations 51828 and 53487 are 0.110 and 0.098, respectively, ranking fourth. RH is the fourth most important factor at station 52203, with an importance value of 0.097, while its importance is lower at other stations, indicating its relatively smaller influence. WIND is the least important factor across all stations, with the highest value being only 0.058 at station 52203.

Based on the importance rankings of factors at each station, this study established different combinations of factors to determine the optimal set of input factors. The input combinations for different research stations are shown in Table 4.

Table 4

Table 4. Input factor combinations for different study sites.

3.2 Analysis of optimal input combinations for solar radiation prediction models

This study employed the LSTM model and its hybrid models, incorporating different input factor combinations to predict solar radiation, aiming to identify the optimal set of input factors. The prediction accuracy of models in different climate zones is shown in Tables 5, 6. The box plots of evaluation metrics for model prediction results under different input combinations are displayed in Figure 5. In the TCZ and TPMZ, as the number of input factors increases, the prediction accuracy of the models improves. When predictions are made using the optimal factor combinations, the accuracy of the models reaches its peak. Subsequent additions of input factors no longer significantly enhance prediction accuracy. For both the TCZ and TPMZ, the optimal input factor combination is C3, including SH, VPD, T_max, T_min, and RH.

Table 5

Table 5. Prediction accuracy of LSTM and hybrid models under different factor combinations in the TCZ.

Table 6

Table 6. Prediction accuracy of LSTM and hybrid models under different factor combinations in the TPMZ.

Figure 5

Four box plots comparing four models: LSTM, BWO-LSTM, GO-LSTM, and HLOA-LSTM across four criteria: R², RMSE, MAE, and NSE. Each plot displays median lines and mean values, with varying performances for C1, C2, C3, and C4.

Figure 5. Box plots of prediction results for different input combinations. Note: C1-C4 represent different factor combinations at research sites. C1 represents a three-factor combination, including SH, VPD, T_max, and SH, VPD, T_min. C2 represents a four-factor combination, including SH, VPD, T_max, T_min, and SH, VPD, T_min, RH. C3 represents a five-factor combination, including SH, VPD, T_max, T_min, and RH. C4 represents a six-factor combination, including SH, VPD, T_max, T_min, RH, and WIND.

In the TZC, as the input combinations progress from C1 to C3, the prediction performance of each model shows a gradual improvement. The RMSE ranges are 3.897 ± 0.473 MJ/(m²·day), 3.728 ± 0.369 MJ/(m²·day), and 3.610 ± 0.364 MJ/(m²·day), respectively; the R² values range from 0.734 ± 0.035, 0.762 ± 0.033, to 0.795 ± 0.028. When all factors are included as inputs in the prediction model, the accuracy does not significantly improve, with the precision range being RMSE = 3.676 ± 0.407 MJ/(m²·day), R² = 0.775 ± 0.034. Similarly, in the TPMZ climate zone, as the input combinations increase from C1 to C3, the model’s prediction accuracy also gradually improves. The RMSE ranges are 3.427 ± 0.833 MJ/(m²·day), 2.936 ± 0.640 MJ/(m²·day), and 2.879 ± 0.581 MJ/(m²·day), respectively; the corresponding R² values are 0.696 ± 0.080, 0.756 ± 0.063, and 0.794 ± 0.058. Adding further input factors does not result in a significant improvement in prediction accuracy. When the combination C4 is used as input for the model, the prediction precision is RMSE = 2.879 ± 0.575 MJ/(m²·day), R² = 0.787 ± 0.048.

3.3 Analysis of independent and hybrid solar radiation prediction models

To improve the accuracy of the standalone LSTM model for solar radiation prediction, this study utilized three meta-heuristic optimization algorithms, including BWO, GO, and HLOA, to fine-tune the LSTM model’s hyperparameters and achieve optimal parameter configurations. To visually demonstrate the predictive performance of the models, scatter plots of predicted versus actual solar radiation values under the same input factor combinations were created (as shown in Figure 6). Additionally, the GPI metric was used to comprehensively evaluate the prediction results, with GPI values and rankings for different models under four input combinations summarized in Table 7. The results indicate that the optimization algorithms significantly enhanced the prediction accuracy of the LSTM model, and the performance of hybrid models surpassed that of the standalone LSTM model. In both the TCZ and TPMZ, the HLOA-LSTM model exhibited superior predictive performance, followed by the GO-LSTM model, which performed comparably to the HLOA-LSTM. The BWO-LSTM model ranked last among the hybrid models but still outperformed the standalone LSTM model.

Figure 6

Multiple scatter plots display correlations between solar radiation predictions and actual solar radiation across different scenarios and datasets, labeled 518262-C1 to 99431-C4. Each plot compares multiple models: LSTM, BiDLSTM, GWO-LSTM, and HHO-LSTM, denoted by different colored markers. Axes represent predicted versus actual solar radiation in megajoules per square meter per day. Regression lines are included to visualize model accuracy.

Figure 6. Scatter plot of predicted and observed solar radiation values for the prediction models. Note: each point represents a set of predicted versus observed solar radiation data. The diagonal line indicates perfect prediction. Points closer to the diagonal line represent higher prediction accuracy.

Table 7

Table 7. GPI values and rankings of model prediction results under different input combinations.

Taking the optimal combination C3 as an example, the GPI values predicted by the LSTM model at stations 51828, 52203, and 53487 are 1.039, 0.365, and 0.585, respectively. Among the three hybrid models, the HLOA-LSTM model performs more prominently at these stations, with GPI values of 1.948, 1.634, and 1.985. Additionally, the GO-LSTM and BWO-LSTM models also demonstrate good predictive capabilities, with GPI values of 1.799, 1.652, and 1.774, and 1.134, 2.000, and 1.699, respectively. For stations 56739, 59316, and 59431, the GPI values predicted by the LSTM model are 0.906, 1.220, and 1.218, respectively, while the GPI values predicted by the HLOA-LSTM model are 1.949, 1.729, and 1.903. The GO-LSTM and BWO-LSTM models at these stations obtain GPI values of 1.770, 1.840, and 1.930, and 1.698, 1.851, and 1.684, respectively. Additionally, training time is also a key metric for evaluating practical applicability of model. Experimental results show that the average training time for the LSTM model is 6.278 s. After integrating optimization algorithms, the computational complexity of the hybrid models increases, leading to corresponding increases in training time. The average time per iteration for BWO-LSTM and GO-LSTM are 8.711 s and 10.147 s, respectively, while the training time for the HLOA-LSTM model is 10.769 s. Although HLOA-LSTM incurs slightly higher training overhead compared to LSTM, its significant improvement in prediction accuracy justifies this additional cost in applications demanding high precision forecasting.

4 Discussion

4.1 Regional differences in the influence of meteorological factors on solar radiation

This study reveals the differences in the correlation between various meteorological factors and solar radiation in the TZC and TPMZ through feature analysis. The research found that the correlation between meteorological factors and solar radiation varies across different regions, which may be related to the climatic characteristics of the areas where the stations are located. SH ranks relatively high in both the TZC and TPMZ, indicating that SH is an important input for predicting solar radiation (Venkatachalam and Solomon, 2019). VPD reflects the dryness of the air, and its importance is also significant in both the TZC and TPMZ, making it an important factor influencing solar radiation (Sharafati et al., 2019). The weather in the TZC is typically sunny, dry, and has longer sunshine hours (Wu et al., 2013), so SH and VPD have a more significant impact on solar radiation in this region. T_max reflects the cumulative solar radiation over the course of a day and is also a meteorological factor with a high correlation to solar radiation (Tao et al., 2019). The high temperatures year-round in the TPMZ result in a higher correlation between T_max and solar radiation in this area. T_min (Yadav et al., 2015), RH (Jović et al., 2016), and WIND (Kasaeian et al., 2016) generally rank lower, but they still contribute to solar radiation models and provide more data for the models.

4.2 Optimal factor combinations

Based on the feature analysis results, this study constructed four different input factor combinations. In the TZC and TPMZ, the models with the three-factor input combination (C1) showed relatively low accuracy in their predictions. This could be because although C1 includes meteorological factors with a high correlation to solar radiation in the region, such as SH, T_max, and VPD, solar radiation is influenced by multiple factors. These factors fail to provide comprehensive information for the prediction model, leading to lower model accuracy. C2, which adds T_min or RH, provides more meteorological information to the model, facilitating a better capture of the non-linear relationship between input variables and solar radiation. The combination of SH, VPD, T_max, T_min, and RH is the optimal input combination for the TZC and TPMZ, as it includes the key meteorological factors affecting solar radiation, significantly improving the model’s prediction accuracy. The introduction of WIND did not further enhance prediction accuracy, possibly because too many input factors caused overfitting of the model, capturing noise and outliers in the data. Additionally, WIND is correlated with meteorological factors like SH and T_max, leading to redundant information, which not only reduced model accuracy but also increased model complexity.

4.3 Optimal hybrid model

This study optimized the LSTM model using three optimization algorithms: BWO, GO, and HLOA. The results show that the hybrid models perform better than the standalone LSTM model, indicating that the BWO, GO, and HLOA algorithms can effectively optimize the hyperparameters of LSTM. These algorithms can effectively adjust the parameter configuration, significantly improving the model’s prediction accuracy. In both the TZC and TPMZ, the performance of HLOA exceeded that of the GO and BWO algorithms, likely because HLOA employs the diverse behavioral strategies of the corner lizard, giving the algorithm stronger global and local search capabilities, and the ability to handle various complex optimization problems, effectively avoiding issues that may arise during the optimization process (Peraza-Vázquez et al., 2024). In contrast, the GO and BWO algorithms are more prone to getting stuck in local optima and may experience problems like premature convergence (Li et al., 2024). However, the enhanced search capabilities of HLOA comes at the expense of increased computational complexity, resulting in a slight increase in training time. Nevertheless, the HLOA-LSTM model achieves substantially higher prediction accuracy, and this additional computational cost is considered acceptable in application scenarios where prediction accuracy is prioritized.

In summary, feature importance analysis algorithms and optimization algorithms can effectively improve the performance of the LSTM model in solar radiation prediction. In the TZC and TPMZ, HLOA demonstrated outstanding optimization capabilities, significantly improving the prediction accuracy and stability of the LSTM model. The HLOA-LSTM model, which uses five meteorological factors including SH, VPD, T_max, T_min, and RH as input variables, is the best solar radiation prediction model in the TZC and TPMZ.

Despite achieving favorable performance in solar radiation prediction, this study still has certain limitations. Firstly, the study area is limited to TZC and TPMZ. When applying the model to broader climatic regions, further adjustments and optimizations are required. Secondly, this study primarily considered common meteorological factors, while other known factors affecting solar radiation, such as cloud cover and aerosol optical depth, have not been fully examined. This could be a limiting factor in improving model accuracy. Additionally, there is still room for improvement in the predictive accuracy of the final hybrid model, HLOA-LSTM. Future research will continue to explore the potential of machine learning models in solar radiation prediction, with a particular focus on analyzing and optimizing their applicability and accuracy under diverse climatic conditions and meteorological variables. Concurrently, efforts will be made to expand the scale of datasets to evaluate the training efficiency and performance of models on larger datasets, while investigating potential computational overhead and optimization opportunities. This will help address the complexities associated with solar radiation prediction under varying environmental conditions.

5 Conclusion

This study developed a high-accuracy solar radiation prediction model with minimal input requirements based on the LSTM algorithm. To identify meteorological factors with strong correlations to solar radiation, three feature importance analysis algorithms—XGBoost, LightGBM, and CatBoost—were employed to analyze six meteorological factors (SH, VPD, T_max, T_min, RH, and WIND). The integrated results of these analyses were used to determine the final importance values for each factor. Based on these findings, different factor combinations were constructed to identify the optimal combination. To further enhance the prediction accuracy of the LSTM model, three optimization algorithms—BWO, GO, and HLOA—were used to fine-tune the hyperparameters of the LSTM model, resulting in a highly accurate solar radiation prediction model. The main findings are as follows:

1. XGBoost, LightGBM, and CatBoost algorithms effectively analyzed the input variables, significantly improving the prediction accuracy of the model. In the TCZ, SH and VPD were the factors most strongly correlated with solar radiation, followed by T_max, T_min, and RH, while WIND showed relatively weak correlation. In the TPMZ, T_max was the most important factor, followed by SH and VPD, then T_min and RH, with WIND again having relatively low importance for solar radiation. The input combination of SH, VPD, T_max, T_min, and RH was identified as the optimal factor combination for the TCZ and TPMZ.

2. Optimization algorithms significantly enhanced the prediction accuracy of the LSTM model, with HLOA achieving the best optimization performance and the largest improvement in model accuracy.

3. When using the optimal factor combination as input, the HLOA-LSTM model is the best solar radiation prediction model for the TZC and TPMZ, with prediction accuracies of: RMSE = 3.470 ± 0.224 MJ/(m²·day), R² = 0.807 ± 0.016 for TZC, and RMSE = 2.858 ± 0.561 MJ/(m²·day), R² = 0.814 ± 0.038 for TPMZ.

Data availability statement

The raw data supporting the conclusions of this article will be made available by the authors, without undue reservation.

Author contributions

LZ: Conceptualization, Formal Analysis, Funding acquisition, Project administration, Resources, Supervision, Validation, Writing – review and editing. FW: Conceptualization, Data curation, Methodology, Resources, Software, Validation, Writing – original draft. HoW: Formal Analysis, Investigation, Writing – review and editing. HuW: Project administration, Writing – review and editing. YS: Investigation, Software, Visualization, Writing – review and editing.

Funding

The author(s) declared that financial support was received for this work and/or its publication. This work was financially supported by the Henan Tobacco Company Luoyang Company Technological Innovation Projects (Grant No. 2023410300200043), National Natural Science Foundation of China (Grant No. 52309050), and the Key Scientific Research Projects of Colleges and Universities in Henan Province (Grant No. 24B416001).

Conflict of interest

Authors HoW and HuW were employed by Henan Tobacco Company Luoyang Company.

The remaining author(s) declared that this work was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

The author(s) declared that this work received funding from Henan Tobacco Company Luoyang Company. The funder had the following involvement in the study: study design and decision to submit it for publication.

Generative AI statement

The author(s) declared that generative AI was not used in the creation of this manuscript.

Any alternative text (alt text) provided alongside figures in this article has been generated by Frontiers with the support of artificial intelligence and reasonable efforts have been made to ensure accuracy, including review by the authors wherever possible. If you identify any issues, please contact us.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

References

Bilali, A. E., Hadri, A., Taleb, A., Tanarhte, M., El Mahdi, E., and Kharrou, M. H. (2025). A novel hybrid modeling approach based on empirical methods, PSO, XGBoost, and multiple GCMs for forecasting long-term reference evapotranspiration in a data scarce-area. Comput. Electron. Agric. 232, 110106. doi:10.1016/j.compag.2025.110106

CrossRef Full Text | Google Scholar

Cao, Q., Yang, L., Liu, Y., and Wang, S. (2023). Development criterion of estimating hourly global solar radiation for all sky conditions in China. Energy Convers. Manag. 284, 116946. doi:10.1016/j.enconman.2023.116946

CrossRef Full Text | Google Scholar

Chen, T., and Guestrin, C. (2016). “Xgboost: a scalable tree boosting system,” in Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, 785–794.

Google Scholar

Chen, B., Zheng, H., Luo, G., Chen, C., Bao, A., Liu, T., et al. (2022). Adaptive estimation of multi-regional soil salinization using extreme gradient boosting with Bayesian TPE optimization. Int. J. Remote Sens. 43, 778–811. doi:10.1080/01431161.2021.2009589

CrossRef Full Text | Google Scholar

de Araujo, J. M. S. (2020). Performance comparison of solar radiation forecasting between WRF and LSTM in Gifu, Japan. Environ. Res. Commun. 2, 045002. doi:10.1088/2515-7620/ab7366

CrossRef Full Text | Google Scholar

Fan, J., Chen, B., Wu, L., Zhang, F., Lu, X., and Xiang, Y. (2018). Evaluation and development of temperature-based empirical models for estimating daily global solar radiation in humid regions. Energy 144, 903–914. doi:10.1016/j.energy.2017.12.091

CrossRef Full Text | Google Scholar

Fan, J., Wu, L., Zhang, F., Cai, H., Zeng, W., Wang, X., et al. (2019). Empirical and machine learning models for predicting daily global solar radiation from sunshine duration: a review and case study in China. Renew. Sustain. Energy Rev. 100, 186–212. doi:10.1016/j.rser.2018.10.018

CrossRef Full Text | Google Scholar

Farhadi, R., and Taki, M. (2020). The energy gain reduction due to shadow inside a flat-plate solar collector. Renew. Energy 147, 730–740. doi:10.1016/j.renene.2019.09.012

CrossRef Full Text | Google Scholar

Ghimire, S., Deo, R. C., Raj, N., and Mi, J. (2019). Deep solar radiation forecasting with convolutional neural network and long short-term memory network algorithms. Appl. Energy 253, 113541. doi:10.1016/j.apenergy.2019.113541

CrossRef Full Text | Google Scholar

Ghimire, S., Deo, R. C., Casillas-Pérez, D., Salcedo-Sanz, S., Sharma, E., and Ali, M. (2022). Deep learning CNN-LSTM-MLP hybrid fusion model for feature optimizations and daily solar radiation prediction. Measurement 202, 111759. doi:10.1016/j.measurement.2022.111759

CrossRef Full Text | Google Scholar

Hachemi, A. T., Sadaoui, F., Saim, A., Ebeed, M., and Arif, S. (2024). Dynamic operation of distribution grids with the integration of photovoltaic systems and distribution static compensators considering network reconfiguration. Energy Rep. 12, 1623–1637. doi:10.1016/j.egyr.2024.07.050

CrossRef Full Text | Google Scholar

Hamad, R. K., and Rashid, T. A. (2024). GOOSE algorithm: a powerful optimization tool for real-world engineering challenges and beyond. Evol. Syst. 15, 1–26. doi:10.1007/s12530-023-09553-6

CrossRef Full Text | Google Scholar

Jia, D., Yang, L., Lv, T., Liu, W., Gao, X., and Zhou, J. (2022). Evaluation of machine learning models for predicting daily global and diffuse solar radiation under different weather/pollution conditions. Renew. Energy 187, 896–906. doi:10.1016/j.renene.2022.02.002

CrossRef Full Text | Google Scholar

Jiang, X., Duan, H., Liao, J., Guo, P., Huang, C., and Xue, X. (2022). Estimation of soil salinization by machine learning algorithms in different arid regions of Northwest China. Remote Sens. 14, 347. doi:10.3390/rs14020347

CrossRef Full Text | Google Scholar

Jiang, Y., Li, F., Gong, Y., Yang, X., and Zhang, Z. (2025). Multiple environmental variables as covariates to improve the accuracy of spatial prediction models for SOM on Karst Aera. Land Degrad. and Dev. doi:10.1002/ldr.5454

CrossRef Full Text | Google Scholar

Jović, S., Aničić, O., Marsenić, M., and Nedić, B. (2016). Solar radiation analyzing by neuro-fuzzy approach. Energy Build. 129, 261–263. doi:10.1016/j.enbuild.2016.08.020

CrossRef Full Text | Google Scholar

Kasaeian, A., Mehrpooya, M., Aghaie, M., and Ahmadi, M. H. (2016). Solar radiation prediction based on ICA and HGAPSO for Kuhin City, Iran. Mech. and Industry 17, 509. doi:10.1051/meca/2015100

CrossRef Full Text | Google Scholar

Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., et al. (2017). Lightgbm: a highly efficient gradient boosting decision tree. Adv. Neural Information Processing Systems 30.

Google Scholar

Kookalani, S., Cheng, B., and Torres, J. L. C. (2022). Structural performance assessment of GFRP elastic gridshells by machine learning interpretability methods. Front. Struct. Civ. Eng. 16, 1249–1266. doi:10.1007/s11709-022-0858-5

CrossRef Full Text | Google Scholar

Li, J., Zhou, X., Zhou, Y., and Han, A. (2024). Optimal configuration of distributed generation based on an improved Beluga whale optimization. IEEE Access. doi:10.1109/ACCESS.2024.3368440

CrossRef Full Text | Google Scholar

Miao, S., Ning, G., Gu, Y., Yan, J., and Ma, B. (2018). Markov Chain model for solar farm generation and its application to generation performance evaluation. J. Clean. Prod. 186, 905–917. doi:10.1016/j.jclepro.2018.03.173

CrossRef Full Text | Google Scholar

Pan, Y., Tian, H., Farid, M. A., He, X., Heng, T., Hermansen, C., et al. (2024). Metaheuristic optimization of water resources: a case study of the Manas River irrigation district. J. Hydrology 639, 131640. doi:10.1016/j.jhydrol.2024.131640

CrossRef Full Text | Google Scholar

Patel, A., and Swathika, O. G. (2024). Off-Grid small-scale power forecasting using optimized machine learning algorithms. IEEE Access. doi:10.1109/ACCESS.2024.3430385

CrossRef Full Text | Google Scholar

Peraza-Vázquez, H., Peña-Delgado, A., Merino-Treviño, M., Morales-Cepeda, A. B., and Sinha, N. (2024). A novel metaheuristic inspired by horned lizard defense tactics. Artif. Intell. Rev. 57, 59. doi:10.1007/s10462-023-10653-7

CrossRef Full Text | Google Scholar

Prajapati, S., Garg, R., and Mahajan, P. (2024). Novel adaptive MPPT technique for enhanced performance of grid integrated solar photovoltaic system. Comput. Electr. Eng. 120, 109648. doi:10.1016/j.compeleceng.2024.109648

CrossRef Full Text | Google Scholar

Prokhorenkova, L., Gusev, G., Vorobev, A., Dorogush, A. V., and Gulin, A. (2018). CatBoost: unbiased boosting with categorical features. Adv. Neural Information Processing Systems 31. doi:10.48550/arXiv.1706.09516

CrossRef Full Text | Google Scholar

Rodríguez, E., Droguett, E. L., Cardemil, J. M., Starke, A. R., and Cornejo-Ponce, L. (2024). Enhancing the estimation of direct normal irradiance for six climate zones through machine learning models. Renew. Energy 231, 120925. doi:10.1016/j.renene.2024.120925

CrossRef Full Text | Google Scholar

Sebastianelli, A., Serva, F., Ceschini, A., Paletta, Q., Panella, M., and Le Saux, B. (2024). Machine learning forecast of surface solar irradiance from meteo satellite data. Remote Sens. Environ. 315, 114431. doi:10.1016/j.rse.2024.114431

CrossRef Full Text | Google Scholar

Sharafati, A., Khosravi, K., Khosravinia, P., Ahmed, K., Salman, S. A., Yaseen, Z. M., et al. (2019). The potential of novel data mining models for global solar radiation prediction. Int. J. Environ. Sci. Technol. 16, 7147–7164. doi:10.1007/s13762-019-02344-0

CrossRef Full Text | Google Scholar

Sun, X., Zhu, L., and Liu, D. (2024a). Blueberry bruise non-destructive detection based on hyperspectral information fusion combined with multi-strategy improved Beluga Whale optimization algorithm. Front. Plant Sci. 15, 1411485. doi:10.3389/fpls.2024.1411485

PubMed Abstract | CrossRef Full Text | Google Scholar

Sun, Y., Wang, X., Gao, L., Yang, H., Zhang, K., Ji, B., et al. (2024b). Multi-objective optimal scheduling for microgrids—improved goose algorithm. Energies 17, 6376. doi:10.3390/en17246376

CrossRef Full Text | Google Scholar

Tahir, M. F., Yousaf, M. Z., Tzes, A., El Moursi, M. S., and El-Fouly, T. H. (2024). Enhanced solar photovoltaic power prediction using diverse machine learning algorithms with hyperparameter optimization. Renew. Sustain. Energy Rev. 200, 114581. doi:10.1016/j.rser.2024.114581

CrossRef Full Text | Google Scholar

Taki, M., Rohani, A., Soheili-Fard, F., and Abdeshahi, A. (2018). Assessment of energy consumption and modeling of output energy for wheat production by neural network (MLP and RBF) and Gaussian process regression (GPR) models. J. Cleaner Production 172, 3028–3041. doi:10.1016/j.jclepro.2017.11.107

CrossRef Full Text | Google Scholar

Tao, H., Ebtehaj, I., Bonakdari, H., Heddam, S., Voyant, C., Al-Ansari, N., et al. (2019). Designing a new data intelligence model for global solar radiation prediction: application of multivariate modeling scheme. Energies 12, 1365. doi:10.3390/en12071365

CrossRef Full Text | Google Scholar

Venkatachalam, C., and Solomon, G. (2019). Dataset of solar energy potential assessment for Adama city (Ethiopia). Data Brief 24, 103879. doi:10.1016/j.dib.2019.103879

PubMed Abstract | CrossRef Full Text | Google Scholar

Wang, S., Wu, Y., Li, R., and Wang, X. (2023a). Remote sensing-based retrieval of soil moisture content using stacking ensemble learning models. Land Degrad. and Dev. 34, 911–925. doi:10.1002/ldr.4505

CrossRef Full Text | Google Scholar

Wang, Y., Zhang, Z., Pang, N., Sun, Z., and Xu, L. (2023b). CEEMDAN-CatBoost-SATCN-based short-term load forecasting model considering time series decomposition and feature selection. Front. Energy Res. 10, 1097048. doi:10.3389/fenrg.2022.1097048

CrossRef Full Text | Google Scholar

Wang, J., Zhou, X., Liu, Y., Sun, J., Guo, P., and Lv, W. (2025). An efficient nondestructive detection method of rapeseed varieties based on hyperspectral imaging technology. Microchem. J. 210, 112913. doi:10.1016/j.microc.2025.112913

CrossRef Full Text | Google Scholar

Wu, W., Tang, X.-P., Yang, C., Guo, N.-J., and Liu, H.-B. (2013). Spatial estimation of monthly mean daily sunshine hours and solar radiation across mainland China. Renew. Energy 57, 546–553. doi:10.1016/j.renene.2013.02.027

CrossRef Full Text | Google Scholar

Wu, L., Zhou, H., Ma, X., Fan, J., and Zhang, F. (2019). Daily reference evapotranspiration prediction based on hybridized extreme learning machine model with bio-inspired optimization algorithms: application in contrasting climates of China. J. Hydrology 577, 123960. doi:10.1016/j.jhydrol.2019.123960

CrossRef Full Text | Google Scholar

Yadav, A. K., Malik, H., and Chandel, S. S. (2015). Application of rapid miner in ANN based prediction of solar radiation for assessment of solar energy resource potential of 76 sites in Northwestern India. Renew. Sustain. Energy Rev. 52, 1093–1106. doi:10.1016/j.rser.2015.07.156

CrossRef Full Text | Google Scholar

Yang, D., and Gueymard, C. A. (2019). Producing high-quality solar resource maps by integrating high-and low-accuracy measurements using Gaussian processes. Renew. Sustain. Energy Rev. 113, 109260. doi:10.1016/j.rser.2019.109260

CrossRef Full Text | Google Scholar

Yang, K., Koike, T., and Ye, B. (2006). Improving estimation of hourly, daily, and monthly solar radiation by importing global data sets. Agric. For. Meteorology 137, 43–55. doi:10.1016/j.agrformet.2006.02.001

CrossRef Full Text | Google Scholar

Yildirim, A., Bilgili, M., and Ozbek, A. (2023). One-hour-ahead solar radiation forecasting by MLP, LSTM, and ANFIS approaches. Meteorology Atmos. Phys. 135, 10. doi:10.1007/s00703-022-00946-x

CrossRef Full Text | Google Scholar

Zeng, Z., Gower, D. B., and Wood, E. F. (2018). Accelerating forest loss in Southeast Asian Massif in the 21st century: a case study in Nan Province, Thailand. Glob. Change Biology 24, 4682–4695. doi:10.1111/gcb.14366

PubMed Abstract | CrossRef Full Text | Google Scholar

Zhang, Y., Cui, N., Feng, Y., Gong, D., and Hu, X. (2019). Comparison of BP, PSO-BP and statistical models for predicting daily global solar radiation in arid Northwest China. Comput. Electron. Agric. 164, 104905. doi:10.1016/j.compag.2019.104905

CrossRef Full Text | Google Scholar

Zhao, L., Zhao, X., Pan, X., Shi, Y., Qiu, Z., Li, X., et al. (2022). Prediction of daily reference crop evapotranspiration in different Chinese climate zones: combined application of key meteorological factors and Elman algorithm. J. Hydrology 610, 127822. doi:10.1016/j.jhydrol.2022.127822

CrossRef Full Text | Google Scholar

Zhao, L., Qing, S., Bai, J., Hao, H., Li, H., Shi, Y., et al. (2023a). A hybrid optimized model for predicting evapotranspiration in early and late rice based on a categorical regression tree combination of key influencing factors. Comput. Electron. Agric. 211, 108031. doi:10.1016/j.compag.2023.108031

CrossRef Full Text | Google Scholar

Zhao, L., Qing, S., Wang, F., Wang, H., Ma, H., Shi, Y., et al. (2023b). Prediction of rice yield based on multi-source data and hybrid LSSVM algorithms in China. Int. J. Plant Prod. 17, 693–713. doi:10.1007/s42106-023-00266-z

CrossRef Full Text | Google Scholar

Zhong, C., Li, G., and Meng, Z. (2022). Beluga whale optimization: a novel nature-inspired metaheuristic algorithm. Knowledge-Based Syst. 251, 109215. doi:10.1016/j.knosys.2022.109215

CrossRef Full Text | Google Scholar

Keywords: featureimportance analysis, long short-term memory network algorithm, machine learning, optimization algorithms, solar radiation

Citation: Zhao L, Wang F, Wang H, Wang H and Shi Y (2026) Hybrid feature-LSTM for solar radiation forecasting in different Chinese climate zones. Front. Earth Sci. 14:1745611. doi: 10.3389/feart.2026.1745611

Received: 15 November 2025; Accepted: 19 January 2026;
Published: 10 February 2026.

Edited by:

Petru Adrian Cotfas, Transilvania University of Braşov, Romania

Reviewed by:

Kasim Oztoprak, Konya Food and Agriculture University, Türkiye
Andreea Sabadus, West University of Timişoara, Romania

Copyright © 2026 Zhao, Wang, Wang, Wang and Shi. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Hui Wang, MTUyMzYxOTMyNTBAMTYzLmNvbQ==

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.