- 1College of Horticulture and Plant Protection, Henan University of Science and Technology, Luoyang, China
- 2College of Agricultural Equipment Engineering, Henan University of Science and Technology, Luoyang, China
- 3Henan Tobacco Company Luoyang Company, Luoyang, China
Solar radiation is a critical energy source for life and ecosystems on Earth, influencing the growth and development of crops. Accurate solar radiation forecasting promotes agricultural development and ensures national food security. This study developed a high-accuracy solar radiation prediction system based on the long short-term memory (LSTM) network model and its hybrid models. Three feature importance analysis algorithms, including extreme gradient boosting, light gradient boosting machine, and categorical gradient boosting, were employed to evaluate the importance of meteorological factors and to develop various factor combinations. Furthermore, three optimization algorithms, including beluga whale optimization algorithm, goose optimization algorithm, and horned lizard optimization algorithm (HLOA), were applied to optimize the hyperparameters of the LSTM model. Based on the forecasting results, the optimal input combinations and optimization algorithms were determined. The findings reveal that sunshine hours (SH) and vapor pressure deficit (VPD) are the most strongly correlated factors with solar radiation in the temperate continental zone (TCZ), with maximum temperature (Tmax) exhibits the highest importance coefficient. In the tropical monsoon zone (TPMZ). The optimal factor combinations for forecasting models include SH, VPD, Tmax, minimum temperature (Tmin), and relative humidity (RH) in both climate zones. Optimization algorithms significantly enhance the accuracy of LSTM model, with HLOA demonstrating the best performance. Specifically, the HLOA-LSTM model with the optimal factor combination achieves the following precision metrics: for TCZ, RMSE = 3.470 ± 0.224 MJ/(m2·day), R2 = 0.807 ± 0.016; for TPMZ, RMSE = 2.858 ± 0.561 MJ/(m2·day), R2 = 0.814 ± 0.038. The results of this study indicate that the HLOA-LSTM model, with inputs including SH, VPD, Tmax, Tmin, and RH, is the optimal model for solar radiation prediction, providing a valuable reference for high-precision solar radiation forecasting in TZC and TPMZ.
1 Introduction
The growth of the global population and economy, along with improved living standards, has led to increasing demands for food and energy (Zeng et al., 2018). Solar energy is not only the most accessible renewable and clean energy source, but also a crucial element for plant growth and crop evapotranspiration (Farhadi and Taki, 2020; Yang and Gueymard, 2019). Solar energy resources play a significant role in enhancing food security, mitigating the destructive environmental impacts of fossil fuels, and improving the monitoring of agricultural hydrological environments (Miao et al., 2018). Accurate estimation of solar radiation is essential for reducing fossil fuel consumption and managing hydrological conditions in farmland.
Currently, solar radiation data are primarily obtained through three methods: direct measurement using solar devices such as pyranometers, estimation via empirical models for regions without solar measuring instruments, and prediction using machine learning models (Taki et al., 2018). While direct observations through meteorological stations can provide accurate solar radiation information, the measurement instruments and their maintenance are expensive and time-consuming (Fan et al., 2018). Due to the limited coverage of meteorological and radiation monitoring stations, researchers cannot determine solar radiation levels in areas without measuring equipment (Cao et al., 2023). To address this limitation, researchers have developed various simple empirical models with minimal input requirements based on the relationships and patterns of solar radiation and its influencing factors (Yang et al., 2006). Although empirical models offer advantages such as ease of use and availability of input data, they require regional calibration for accurate results and are challenging to apply over large areas (Zhang et al., 2019). Moreover, the complexity and variability of the relationships between solar radiation and its influencing factors cannot be fully represented by quantitative models.
With the advancement of artificial intelligence, researchers have applied various machine learning models to solar radiation prediction (Jia et al., 2022; Sebastianelli et al., 2024). Machine learning models possess powerful nonlinear fitting capabilities, enabling them to capture complex nonlinear relationships between multiple independent and dependent variables. They can handle multivariable datasets, exhibit strong generalization abilities, and offer better adaptability and flexibility (Rodríguez et al., 2024; Zhao et al., 2023a). Fan et al. (2019) compared the performance of various empirical and machine learning models, concluding that machine learning models outperform empirical models. Among the machine learning algorithms used for solar radiation prediction, the long short-term memory (LSTM) network, an improvement of the recurrent neural network (RNN), utilizes a gated structure that significantly enhances its performance in processing sequential data, thereby providing more accurate results (Ghimire et al., 2019). This makes LSTM models widely used in solar radiation prediction studies. De Araujo (2020) demonstrated that the LSTM algorithm is an effective model for solar radiation prediction. Yildirim et al. (2023) compared the performance of LSTM, multilayer perceptron, and adaptive neuro-fuzzy inference systems for solar radiation prediction, concluding that LSTM outperformed the others, providing highly accurate results. Therefore, this study employs the LSTM model to establish a high-performance predictive model.
Standalone machine learning models have been widely applied in predictive tasks. To further improve prediction accuracy, researchers have combined optimization algorithms with machine learning models to fine-tune hyperparameters, further enhancing prediction accuracy and robustness (Patel and Swathika, 2024; Tahir et al., 2024; Zhao et al., 2022). The integration of meta-heuristic optimization algorithms with machine learning models enables the discovery of optimal hyperparameter configurations, thereby maximizing model potential (Wu et al., 2019). The beluga whale optimization (BWO) algorithm, inspired by the predatory behavior of beluga whales, effectively explores the solution space. It is easy to implement and understand while exhibiting good adaptability (Zhong et al., 2022). The goose optimization (GO) algorithm, which simulates goose behavior, ensures precise global and local searches, making it a powerful optimization algorithm for efficient exploration of optimal solutions (Hamad and Rashid, 2024). The horned lizard optimization algorithm (HLOA) simulates the defensive strategies of horned lizards, effectively overcoming challenges in the optimization process. It simultaneously enables exploration and exploitation of the solution space, avoiding local optima. With its simple parameter configuration, HLOA is suitable for various complex parameter optimization tasks (Peraza-Vázquez et al., 2024). In summary, this study selects BWO algorithm, GO algorithm, and HLOA to fine-tune the hyperparameters of the LSTM model, aiming to construct a highly accurate and robust solar radiation prediction system.
The accuracy of machine learning models is influenced not only by hyperparameters but also significantly by the input features. Analyzing the importance of input factors can identify variables highly correlated with the target variable, exclude less relevant variables, improve training efficiency, and prevent overfitting (Zhao et al., 2023b). Gradient boosted decision trees (GBDT), an ensemble learning algorithm based on decision trees, evaluates feature contributions to provide robust analysis results (Wang S. et al., 2023). Researchers have developed three new algorithms based on GBDT—extreme gradient boosting (XGBoost), light gradient boosting machine (LightGBM), and categorical boosting (CatBoost). These algorithms improve computational efficiency, predictive accuracy, and overall performance to enhance applicability. Jiang et al. (2022) used the LightGBM algorithm to select land surface parameters affecting soil salinity, optimizing soil salinity prediction models. Chen et al. (2022) applied the XGBoost algorithm to construct a multi-regional soil salinity estimation model, demonstrating that XGBoost provides accurate feature importance analysis results. Kookalani et al. (2022) showed that the CatBoost algorithm effectively captures nonlinear relationships to offer reliable feature analysis. Therefore, this study utilizes XGBoost, LightGBM, and CatBoost algorithms to comprehensively analyze the importance of meteorological factors, aiming to reduce model complexity and enhance prediction accuracy.
The study integrated six meteorological factors—sunshine hours (SH), maximum temperature (Tmax), minimum temperature (Tmin), vapor pressure deficit (VPD), relative humidity (RH), and wind speed (WIND)—to develop a solar radiation estimation system. First, XGBoost, LightGBM, and CatBoost were applied to perform feature importance analysis on the input factors, forming diverse combinations of input variables. Subsequently, the LSTM model was employed as the base model, and its hyperparameters were optimized using BWO algorithm, GO algorithm, and HLOA to construct a highly accurate solar radiation estimation model. The primary objectives of this study are as follows: (1) Using XGBoost, LightGBM, and CatBoost algorithms to analyze the correlations of meteorological factors with solar radiation across different climate zones; (2) constructing various factor combinations based on the comprehensive results of the three analysis algorithms and determine the optimal input factor combinations for each climate zone; (3) Combining BWO algorithm, GO algorithm, and HLOA with the LSTM model to develop hybrid models and compare the predictive performance of hybrid models with standalone models.
2 Materials and methods
2.1 Study sites
China’s climate zones can be classified into tropical monsoon zone (TPMZ), subtropical monsoon zone, temperate monsoon zone, temperate continental zone (TCZ), and mountain plateau zone, based on regional variations in climate and precipitation. For this study, six stations were selected within the TCZ and TPMZ. These two zones exhibit significant differences in climatic conditions. By conducting experiments across these climatic zones, we can more fully assess the robustness and generalization capabilities of machine learning models and optimization algorithms when processing data with distinct climatic characteristics. The geographical and meteorological information of the study sites are listed in Table 1, with their spatial distribution illustrated in Figure 1. The data used in this study were obtained from the national meteorological science data center of China (http://data.cma.cn) and randomly divided into training and testing sets in an 8:2 ratio to ensure the independence of model training and validation. The dataset includes daily solar radiation measurements from 2012 to 2016, along with daily meteorological variables such as sunshine hours, maximum temperature, minimum temperature, vapor pressure deficit, relative humidity, and wind speed. Furthermore, prior to model training, all variables were normalizd to ensure comparability among features and facilitate more efficient convergence of the machine learning algorithms.
Figure 1. Geographical distribution of study sites. Note: TCZ--temperate continental zone; TMZ--temperate monsoon zone; SMZ--subtropical monsoon zone; MPZ--mountain plateau zone; TPMZ--tropical monsoon zone.
2.2 Feature importance analysis algorithms
2.2.1 Lightweight gradient boosting machine
LightGBM is a framework that implements the GBDT algorithm (Jiang et al., 2025). It supports highly efficient parallel training and offers advantages over traditional GBDT algorithms, including higher accuracy, faster training speeds, and lower memory consumption, making it well-suited for processing large-scale datasets. The feature importance analysis in the LightGBM algorithm is based on feature split gain. Specifically, the gain importance for a given feature i is expressed as shown in Equation 1:
where
2.2.2 Extreme gradient boosting
The XGBoost algorithm is built on the principles of gradient boosting, with a focus on optimizing the objective function. By employing a second-order Taylor expansion of the objective function, XGBoost accelerates convergence and improves accuracy (Bilali et al., 2025). Feature importance in XGBoost is determined by the gain achieved when a feature is used in a split across all decision trees. The importance of feature i is calculated as the weighted average of its gain across all trees, expressed as in Equation 2 below:
where
2.2.3 Categorical gradient boosting
CatBoost is an improved algorithm based on the GBDT framework. By employing symmetric decision trees as base learners, it significantly enhances training speed while effectively mitigating overfitting. CatBoost evaluates feature importance using multiple metrics that assess the contribution of each feature. Among these metrics, the predicted value change reflects the average variation in predictions when a feature value changes. Additionally, the loss function change measures the variation in the loss function when a feature is removed versus when the full feature set is used (Wang Y. et al., 2023). The specific formulas are shown in Equations 3–5:
where
where
2.2.4 Comprehensive analysis
To derive a more robust and reliable feature importance analysis, this study combines the results from the three algorithms. The importance scores obtained from each algorithm are averaged to produce the final feature importance score. This integrative approach leverages the strengths of each algorithm, enhancing the reliability and robustness of the results. The comprehensive calculation formula is shown in Equation 6:
2.3 Machine learning algorithms
2.3.1 Long short-term memory networks
The LSTM neural network is an improved algorithm based on RNN. The LSTM framework features three distinct gate structures: the input gate, output gate, and forget gate. These gates work in tandem to maintain and update the internal states of the network, significantly enhancing LSTM’s performance when processing sequential data. The LSTM network effectively mitigates the vanishing gradient problem commonly observed in traditional RNNs, greatly improving training efficiency. Within the gate structures of the LSTM network, the input gate determines which input information should be stored in the memory cell; the forget gate decides which information should be forgotten; and the output gate selects the information to be extracted and output from the current memory cell (Ghimire et al., 2022). The structural diagram of LSTM is shown in Figure 2, and the mathematical representation is given by Equations 7–12:
where
To further enhance the predictive accuracy of the LSTM network, this study employed multiple optimization algorithms to tune four hyperparameters of the model, including the number of nodes in the first hidden layer (NumHiddenUnits), the maximum number of training epochs (MaxEpochs), the initial learning rate (InitialLearnRate), and the dropout probability (DropoutRate). Specifically, the MaxEpochs defines the upper limit on the number of complete passes through the training dataset, the InitialLearnRate determines the initial step size for the optimization algorithm during weight updates, and the DropoutRate suppresses overfitting by randomly setting a portion of the input elements to zero during training. The search ranges for the four hyperparameters are shown in Table 2. In addition, all model training in this study was conducted on a system equipped with an Intel(R) Core(TM) i5-1035G1 CPU.
2.3.2 Beluga whale optimization algorithm
The beluga whale optimization algorithm is a meta-heuristic optimization algorithm inspired by the behavior of beluga whales (Zhong et al., 2022). This algorithm simulates three stages of beluga behavior—exploration, exploitation, and whale fall—to optimize model parameters. In this process, each beluga whale represents a candidate solution, with its position continually updated. During the exploration phase, the BWO algorithm performs a global search by mimicking the swimming behavior of beluga whales in the search space. In the exploitation phase, the algorithm simulates the whales’ predation behavior and incorporates the Levy flight strategy to enhance convergence performance. In this stage, beluga whales cooperate to hunt prey by sharing positional information and interacting with one another, enabling more focused and precise local searches. During the whale fall phase, the whales’ positions and sinking steps are used to establish new individual positions, maintaining population diversity and size (Pan et al., 2024; Sun X. et al., 2024).
2.3.3 Goose optimization algorithm
The goose optimization algorithm is a meta-heuristic algorithm inspired by the resting and foraging behavior of geese (Hamad and Rashid, 2024). The algorithm begins with population initialization, where the fitness value of each individual is calculated based on its position in the search space. The GO algorithm then enters the exploration and exploitation phases. The exploration phase simulates the random searching behavior of geese when they have not identified a clear target. This phase aims to discover new solutions through random searches, helping the algorithm escape local optima and move toward the global optimum. The exploitation phase models the intensive searching behavior of geese upon discovering food, focusing on refining known solutions through local search to find the best solution. The GO algorithm balances these two phases using a random variable. If the variable’s value exceeds 0.5, the algorithm enters the exploration phase; otherwise, it enters the exploitation phase. This mechanism ensures that the algorithm can thoroughly search known regions while broadly exploring new ones (Sun Y. et al., 2024; Wang et al., 2025).
2.3.4 Horned lizard optimization algorithm
Horned lizard optimization algorithm is a novel meta-heuristic optimization algorithm inspired by the defensive behaviors of horned lizards (Peraza-Vázquez et al., 2024). The algorithm simulates the defensive strategies of horned lizards to search for optimal solutions in the solution space. HLOA first calculates the fitness value of each horned lizard, and during the optimization process, it employs various defensive strategies, including camouflage, skin color changes, blood-squirting, and fleeing. These strategies help avoid suboptimal solutions during the optimization process. Camouflage and skin color changes allow the horned lizard to better blend into its environment, enhancing the algorithm’s adaptability. Blood-squirting and fleeing strategies enable the lizard to evade predators, creating new search spaces. This improves the algorithm’s ability to explore uncharted areas of the solution space, strengthening its global search capabilities (Hachemi et al., 2024; Prajapati et al., 2024). The principles of the three optimization algorithms are illustrated in Figure 3.
2.4 Evaluation metrics
This study evaluates the performance of solar radiation estimation models using root mean square error (RMSE), coefficient of determination (R2), mean absolute error (MAE), and nash-sutcliffe efficiency (NSE). Additionally, the global evaluation index (GPI) is used to assess the overall predictive performance of the models. The formulas for these metrics are shown in Equations 13–17:
where
3 Results
3.1 Importance analysis of factors related to solar radiation
This study utilized XGBoost, LightGBM, and CatBoost algorithms to evaluate the importance of meteorological factors associated with solar radiation, using the integrated results of the three algorithms as the final importance ranking. The heatmap of correlations between six meteorological factors and solar radiation is shown in Figure 4, and the rankings of the importance of different meteorological factors are presented in Table 3. In the TCZ, the top-ranked factors are SH and VPD. The third and fourth positions are generally occupied by temperature-related factors, while RH and WIND rank relatively lower overall. In the TPMZ, the most important meteorological factor is Tmax, followed by SH and VPD in second and third place, respectively, with Tmin ranking next. RH and WIND exhibit a lower correlation with solar radiation.
Figure 4. Correlations between meteorological factors and solar radiation. Note: the heatmap displays Pearson correlation coefficients between meteorological factors and solar radiation at six stations. Lighter colors indicate stronger correlations.
SH has the most significant influence at station 52203--Hami, with a relative importance value of 0.489. Following this, station 53487--Datong has an SH importance value of 0.312, and station 51828--Hetian also exhibits relatively high importance, with a value of 0.301. In the TPMZ, SH’s importance is moderate, with importance values of 0.251, 0.220, and 0.208 at stations 56739--Tengchong, 59316--Shantou, and 59431--Nanning, respectively. VPD has the highest importance value at station 53487, with a value of 0.353, making it the most significant factor at this station. At stations 51828, 52203, and 59431, the importance values of VPD are 0.277, 0.146, and 0.260, respectively, ranking second. For stations 56739 and 59316, VPD ranks third in importance, with values of 0.247 and 0.201, respectively. Tmax is the most important factor in TPMZ, with importance values of 0.263, 0.311, and 0.279 at stations 56739, 59316, and 59431, respectively. In the TZC, Tmax ranks third in importance at stations 51828 and 53487, with values of 0.189 and 0.116, respectively. Tmin ranks fourth in importance at TPMZ stations, with values of 0.106, 0.148, and 0.145 at stations 56739, 59316, and 59431, respectively. In TZC, Tmin’s importance value at station 52203 is 0.114, ranking third. The Tmin values at stations 51828 and 53487 are 0.110 and 0.098, respectively, ranking fourth. RH is the fourth most important factor at station 52203, with an importance value of 0.097, while its importance is lower at other stations, indicating its relatively smaller influence. WIND is the least important factor across all stations, with the highest value being only 0.058 at station 52203.
Based on the importance rankings of factors at each station, this study established different combinations of factors to determine the optimal set of input factors. The input combinations for different research stations are shown in Table 4.
3.2 Analysis of optimal input combinations for solar radiation prediction models
This study employed the LSTM model and its hybrid models, incorporating different input factor combinations to predict solar radiation, aiming to identify the optimal set of input factors. The prediction accuracy of models in different climate zones is shown in Tables 5, 6. The box plots of evaluation metrics for model prediction results under different input combinations are displayed in Figure 5. In the TCZ and TPMZ, as the number of input factors increases, the prediction accuracy of the models improves. When predictions are made using the optimal factor combinations, the accuracy of the models reaches its peak. Subsequent additions of input factors no longer significantly enhance prediction accuracy. For both the TCZ and TPMZ, the optimal input factor combination is C3, including SH, VPD, Tmax, Tmin, and RH.
Table 5. Prediction accuracy of LSTM and hybrid models under different factor combinations in the TCZ.
Table 6. Prediction accuracy of LSTM and hybrid models under different factor combinations in the TPMZ.
Figure 5. Box plots of prediction results for different input combinations. Note: C1-C4 represent different factor combinations at research sites. C1 represents a three-factor combination, including SH, VPD, Tmax, and SH, VPD, Tmin. C2 represents a four-factor combination, including SH, VPD, Tmax, Tmin, and SH, VPD, Tmin, RH. C3 represents a five-factor combination, including SH, VPD, Tmax, Tmin, and RH. C4 represents a six-factor combination, including SH, VPD, Tmax, Tmin, RH, and WIND.
In the TZC, as the input combinations progress from C1 to C3, the prediction performance of each model shows a gradual improvement. The RMSE ranges are 3.897 ± 0.473 MJ/(m2·day), 3.728 ± 0.369 MJ/(m2·day), and 3.610 ± 0.364 MJ/(m2·day), respectively; the R2 values range from 0.734 ± 0.035, 0.762 ± 0.033, to 0.795 ± 0.028. When all factors are included as inputs in the prediction model, the accuracy does not significantly improve, with the precision range being RMSE = 3.676 ± 0.407 MJ/(m2·day), R2 = 0.775 ± 0.034. Similarly, in the TPMZ climate zone, as the input combinations increase from C1 to C3, the model’s prediction accuracy also gradually improves. The RMSE ranges are 3.427 ± 0.833 MJ/(m2·day), 2.936 ± 0.640 MJ/(m2·day), and 2.879 ± 0.581 MJ/(m2·day), respectively; the corresponding R2 values are 0.696 ± 0.080, 0.756 ± 0.063, and 0.794 ± 0.058. Adding further input factors does not result in a significant improvement in prediction accuracy. When the combination C4 is used as input for the model, the prediction precision is RMSE = 2.879 ± 0.575 MJ/(m2·day), R2 = 0.787 ± 0.048.
3.3 Analysis of independent and hybrid solar radiation prediction models
To improve the accuracy of the standalone LSTM model for solar radiation prediction, this study utilized three meta-heuristic optimization algorithms, including BWO, GO, and HLOA, to fine-tune the LSTM model’s hyperparameters and achieve optimal parameter configurations. To visually demonstrate the predictive performance of the models, scatter plots of predicted versus actual solar radiation values under the same input factor combinations were created (as shown in Figure 6). Additionally, the GPI metric was used to comprehensively evaluate the prediction results, with GPI values and rankings for different models under four input combinations summarized in Table 7. The results indicate that the optimization algorithms significantly enhanced the prediction accuracy of the LSTM model, and the performance of hybrid models surpassed that of the standalone LSTM model. In both the TCZ and TPMZ, the HLOA-LSTM model exhibited superior predictive performance, followed by the GO-LSTM model, which performed comparably to the HLOA-LSTM. The BWO-LSTM model ranked last among the hybrid models but still outperformed the standalone LSTM model.
Figure 6. Scatter plot of predicted and observed solar radiation values for the prediction models. Note: each point represents a set of predicted versus observed solar radiation data. The diagonal line indicates perfect prediction. Points closer to the diagonal line represent higher prediction accuracy.
Taking the optimal combination C3 as an example, the GPI values predicted by the LSTM model at stations 51828, 52203, and 53487 are 1.039, 0.365, and 0.585, respectively. Among the three hybrid models, the HLOA-LSTM model performs more prominently at these stations, with GPI values of 1.948, 1.634, and 1.985. Additionally, the GO-LSTM and BWO-LSTM models also demonstrate good predictive capabilities, with GPI values of 1.799, 1.652, and 1.774, and 1.134, 2.000, and 1.699, respectively. For stations 56739, 59316, and 59431, the GPI values predicted by the LSTM model are 0.906, 1.220, and 1.218, respectively, while the GPI values predicted by the HLOA-LSTM model are 1.949, 1.729, and 1.903. The GO-LSTM and BWO-LSTM models at these stations obtain GPI values of 1.770, 1.840, and 1.930, and 1.698, 1.851, and 1.684, respectively. Additionally, training time is also a key metric for evaluating practical applicability of model. Experimental results show that the average training time for the LSTM model is 6.278 s. After integrating optimization algorithms, the computational complexity of the hybrid models increases, leading to corresponding increases in training time. The average time per iteration for BWO-LSTM and GO-LSTM are 8.711 s and 10.147 s, respectively, while the training time for the HLOA-LSTM model is 10.769 s. Although HLOA-LSTM incurs slightly higher training overhead compared to LSTM, its significant improvement in prediction accuracy justifies this additional cost in applications demanding high precision forecasting.
4 Discussion
4.1 Regional differences in the influence of meteorological factors on solar radiation
This study reveals the differences in the correlation between various meteorological factors and solar radiation in the TZC and TPMZ through feature analysis. The research found that the correlation between meteorological factors and solar radiation varies across different regions, which may be related to the climatic characteristics of the areas where the stations are located. SH ranks relatively high in both the TZC and TPMZ, indicating that SH is an important input for predicting solar radiation (Venkatachalam and Solomon, 2019). VPD reflects the dryness of the air, and its importance is also significant in both the TZC and TPMZ, making it an important factor influencing solar radiation (Sharafati et al., 2019). The weather in the TZC is typically sunny, dry, and has longer sunshine hours (Wu et al., 2013), so SH and VPD have a more significant impact on solar radiation in this region. Tmax reflects the cumulative solar radiation over the course of a day and is also a meteorological factor with a high correlation to solar radiation (Tao et al., 2019). The high temperatures year-round in the TPMZ result in a higher correlation between Tmax and solar radiation in this area. Tmin (Yadav et al., 2015), RH (Jović et al., 2016), and WIND (Kasaeian et al., 2016) generally rank lower, but they still contribute to solar radiation models and provide more data for the models.
4.2 Optimal factor combinations
Based on the feature analysis results, this study constructed four different input factor combinations. In the TZC and TPMZ, the models with the three-factor input combination (C1) showed relatively low accuracy in their predictions. This could be because although C1 includes meteorological factors with a high correlation to solar radiation in the region, such as SH, Tmax, and VPD, solar radiation is influenced by multiple factors. These factors fail to provide comprehensive information for the prediction model, leading to lower model accuracy. C2, which adds Tmin or RH, provides more meteorological information to the model, facilitating a better capture of the non-linear relationship between input variables and solar radiation. The combination of SH, VPD, Tmax, Tmin, and RH is the optimal input combination for the TZC and TPMZ, as it includes the key meteorological factors affecting solar radiation, significantly improving the model’s prediction accuracy. The introduction of WIND did not further enhance prediction accuracy, possibly because too many input factors caused overfitting of the model, capturing noise and outliers in the data. Additionally, WIND is correlated with meteorological factors like SH and Tmax, leading to redundant information, which not only reduced model accuracy but also increased model complexity.
4.3 Optimal hybrid model
This study optimized the LSTM model using three optimization algorithms: BWO, GO, and HLOA. The results show that the hybrid models perform better than the standalone LSTM model, indicating that the BWO, GO, and HLOA algorithms can effectively optimize the hyperparameters of LSTM. These algorithms can effectively adjust the parameter configuration, significantly improving the model’s prediction accuracy. In both the TZC and TPMZ, the performance of HLOA exceeded that of the GO and BWO algorithms, likely because HLOA employs the diverse behavioral strategies of the corner lizard, giving the algorithm stronger global and local search capabilities, and the ability to handle various complex optimization problems, effectively avoiding issues that may arise during the optimization process (Peraza-Vázquez et al., 2024). In contrast, the GO and BWO algorithms are more prone to getting stuck in local optima and may experience problems like premature convergence (Li et al., 2024). However, the enhanced search capabilities of HLOA comes at the expense of increased computational complexity, resulting in a slight increase in training time. Nevertheless, the HLOA-LSTM model achieves substantially higher prediction accuracy, and this additional computational cost is considered acceptable in application scenarios where prediction accuracy is prioritized.
In summary, feature importance analysis algorithms and optimization algorithms can effectively improve the performance of the LSTM model in solar radiation prediction. In the TZC and TPMZ, HLOA demonstrated outstanding optimization capabilities, significantly improving the prediction accuracy and stability of the LSTM model. The HLOA-LSTM model, which uses five meteorological factors including SH, VPD, Tmax, Tmin, and RH as input variables, is the best solar radiation prediction model in the TZC and TPMZ.
Despite achieving favorable performance in solar radiation prediction, this study still has certain limitations. Firstly, the study area is limited to TZC and TPMZ. When applying the model to broader climatic regions, further adjustments and optimizations are required. Secondly, this study primarily considered common meteorological factors, while other known factors affecting solar radiation, such as cloud cover and aerosol optical depth, have not been fully examined. This could be a limiting factor in improving model accuracy. Additionally, there is still room for improvement in the predictive accuracy of the final hybrid model, HLOA-LSTM. Future research will continue to explore the potential of machine learning models in solar radiation prediction, with a particular focus on analyzing and optimizing their applicability and accuracy under diverse climatic conditions and meteorological variables. Concurrently, efforts will be made to expand the scale of datasets to evaluate the training efficiency and performance of models on larger datasets, while investigating potential computational overhead and optimization opportunities. This will help address the complexities associated with solar radiation prediction under varying environmental conditions.
5 Conclusion
This study developed a high-accuracy solar radiation prediction model with minimal input requirements based on the LSTM algorithm. To identify meteorological factors with strong correlations to solar radiation, three feature importance analysis algorithms—XGBoost, LightGBM, and CatBoost—were employed to analyze six meteorological factors (SH, VPD, Tmax, Tmin, RH, and WIND). The integrated results of these analyses were used to determine the final importance values for each factor. Based on these findings, different factor combinations were constructed to identify the optimal combination. To further enhance the prediction accuracy of the LSTM model, three optimization algorithms—BWO, GO, and HLOA—were used to fine-tune the hyperparameters of the LSTM model, resulting in a highly accurate solar radiation prediction model. The main findings are as follows:
1. XGBoost, LightGBM, and CatBoost algorithms effectively analyzed the input variables, significantly improving the prediction accuracy of the model. In the TCZ, SH and VPD were the factors most strongly correlated with solar radiation, followed by Tmax, Tmin, and RH, while WIND showed relatively weak correlation. In the TPMZ, Tmax was the most important factor, followed by SH and VPD, then Tmin and RH, with WIND again having relatively low importance for solar radiation. The input combination of SH, VPD, Tmax, Tmin, and RH was identified as the optimal factor combination for the TCZ and TPMZ.
2. Optimization algorithms significantly enhanced the prediction accuracy of the LSTM model, with HLOA achieving the best optimization performance and the largest improvement in model accuracy.
3. When using the optimal factor combination as input, the HLOA-LSTM model is the best solar radiation prediction model for the TZC and TPMZ, with prediction accuracies of: RMSE = 3.470 ± 0.224 MJ/(m2·day), R2 = 0.807 ± 0.016 for TZC, and RMSE = 2.858 ± 0.561 MJ/(m2·day), R2 = 0.814 ± 0.038 for TPMZ.
Data availability statement
The raw data supporting the conclusions of this article will be made available by the authors, without undue reservation.
Author contributions
LZ: Conceptualization, Formal Analysis, Funding acquisition, Project administration, Resources, Supervision, Validation, Writing – review and editing. FW: Conceptualization, Data curation, Methodology, Resources, Software, Validation, Writing – original draft. HoW: Formal Analysis, Investigation, Writing – review and editing. HuW: Project administration, Writing – review and editing. YS: Investigation, Software, Visualization, Writing – review and editing.
Funding
The author(s) declared that financial support was received for this work and/or its publication. This work was financially supported by the Henan Tobacco Company Luoyang Company Technological Innovation Projects (Grant No. 2023410300200043), National Natural Science Foundation of China (Grant No. 52309050), and the Key Scientific Research Projects of Colleges and Universities in Henan Province (Grant No. 24B416001).
Conflict of interest
Authors HoW and HuW were employed by Henan Tobacco Company Luoyang Company.
The remaining author(s) declared that this work was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
The author(s) declared that this work received funding from Henan Tobacco Company Luoyang Company. The funder had the following involvement in the study: study design and decision to submit it for publication.
Generative AI statement
The author(s) declared that generative AI was not used in the creation of this manuscript.
Any alternative text (alt text) provided alongside figures in this article has been generated by Frontiers with the support of artificial intelligence and reasonable efforts have been made to ensure accuracy, including review by the authors wherever possible. If you identify any issues, please contact us.
Publisher’s note
All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.
References
Bilali, A. E., Hadri, A., Taleb, A., Tanarhte, M., El Mahdi, E., and Kharrou, M. H. (2025). A novel hybrid modeling approach based on empirical methods, PSO, XGBoost, and multiple GCMs for forecasting long-term reference evapotranspiration in a data scarce-area. Comput. Electron. Agric. 232, 110106. doi:10.1016/j.compag.2025.110106
Cao, Q., Yang, L., Liu, Y., and Wang, S. (2023). Development criterion of estimating hourly global solar radiation for all sky conditions in China. Energy Convers. Manag. 284, 116946. doi:10.1016/j.enconman.2023.116946
Chen, T., and Guestrin, C. (2016). “Xgboost: a scalable tree boosting system,” in Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, 785–794.
Chen, B., Zheng, H., Luo, G., Chen, C., Bao, A., Liu, T., et al. (2022). Adaptive estimation of multi-regional soil salinization using extreme gradient boosting with Bayesian TPE optimization. Int. J. Remote Sens. 43, 778–811. doi:10.1080/01431161.2021.2009589
de Araujo, J. M. S. (2020). Performance comparison of solar radiation forecasting between WRF and LSTM in Gifu, Japan. Environ. Res. Commun. 2, 045002. doi:10.1088/2515-7620/ab7366
Fan, J., Chen, B., Wu, L., Zhang, F., Lu, X., and Xiang, Y. (2018). Evaluation and development of temperature-based empirical models for estimating daily global solar radiation in humid regions. Energy 144, 903–914. doi:10.1016/j.energy.2017.12.091
Fan, J., Wu, L., Zhang, F., Cai, H., Zeng, W., Wang, X., et al. (2019). Empirical and machine learning models for predicting daily global solar radiation from sunshine duration: a review and case study in China. Renew. Sustain. Energy Rev. 100, 186–212. doi:10.1016/j.rser.2018.10.018
Farhadi, R., and Taki, M. (2020). The energy gain reduction due to shadow inside a flat-plate solar collector. Renew. Energy 147, 730–740. doi:10.1016/j.renene.2019.09.012
Ghimire, S., Deo, R. C., Raj, N., and Mi, J. (2019). Deep solar radiation forecasting with convolutional neural network and long short-term memory network algorithms. Appl. Energy 253, 113541. doi:10.1016/j.apenergy.2019.113541
Ghimire, S., Deo, R. C., Casillas-Pérez, D., Salcedo-Sanz, S., Sharma, E., and Ali, M. (2022). Deep learning CNN-LSTM-MLP hybrid fusion model for feature optimizations and daily solar radiation prediction. Measurement 202, 111759. doi:10.1016/j.measurement.2022.111759
Hachemi, A. T., Sadaoui, F., Saim, A., Ebeed, M., and Arif, S. (2024). Dynamic operation of distribution grids with the integration of photovoltaic systems and distribution static compensators considering network reconfiguration. Energy Rep. 12, 1623–1637. doi:10.1016/j.egyr.2024.07.050
Hamad, R. K., and Rashid, T. A. (2024). GOOSE algorithm: a powerful optimization tool for real-world engineering challenges and beyond. Evol. Syst. 15, 1–26. doi:10.1007/s12530-023-09553-6
Jia, D., Yang, L., Lv, T., Liu, W., Gao, X., and Zhou, J. (2022). Evaluation of machine learning models for predicting daily global and diffuse solar radiation under different weather/pollution conditions. Renew. Energy 187, 896–906. doi:10.1016/j.renene.2022.02.002
Jiang, X., Duan, H., Liao, J., Guo, P., Huang, C., and Xue, X. (2022). Estimation of soil salinization by machine learning algorithms in different arid regions of Northwest China. Remote Sens. 14, 347. doi:10.3390/rs14020347
Jiang, Y., Li, F., Gong, Y., Yang, X., and Zhang, Z. (2025). Multiple environmental variables as covariates to improve the accuracy of spatial prediction models for SOM on Karst Aera. Land Degrad. and Dev. doi:10.1002/ldr.5454
Jović, S., Aničić, O., Marsenić, M., and Nedić, B. (2016). Solar radiation analyzing by neuro-fuzzy approach. Energy Build. 129, 261–263. doi:10.1016/j.enbuild.2016.08.020
Kasaeian, A., Mehrpooya, M., Aghaie, M., and Ahmadi, M. H. (2016). Solar radiation prediction based on ICA and HGAPSO for Kuhin City, Iran. Mech. and Industry 17, 509. doi:10.1051/meca/2015100
Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., et al. (2017). Lightgbm: a highly efficient gradient boosting decision tree. Adv. Neural Information Processing Systems 30.
Kookalani, S., Cheng, B., and Torres, J. L. C. (2022). Structural performance assessment of GFRP elastic gridshells by machine learning interpretability methods. Front. Struct. Civ. Eng. 16, 1249–1266. doi:10.1007/s11709-022-0858-5
Li, J., Zhou, X., Zhou, Y., and Han, A. (2024). Optimal configuration of distributed generation based on an improved Beluga whale optimization. IEEE Access. doi:10.1109/ACCESS.2024.3368440
Miao, S., Ning, G., Gu, Y., Yan, J., and Ma, B. (2018). Markov Chain model for solar farm generation and its application to generation performance evaluation. J. Clean. Prod. 186, 905–917. doi:10.1016/j.jclepro.2018.03.173
Pan, Y., Tian, H., Farid, M. A., He, X., Heng, T., Hermansen, C., et al. (2024). Metaheuristic optimization of water resources: a case study of the Manas River irrigation district. J. Hydrology 639, 131640. doi:10.1016/j.jhydrol.2024.131640
Patel, A., and Swathika, O. G. (2024). Off-Grid small-scale power forecasting using optimized machine learning algorithms. IEEE Access. doi:10.1109/ACCESS.2024.3430385
Peraza-Vázquez, H., Peña-Delgado, A., Merino-Treviño, M., Morales-Cepeda, A. B., and Sinha, N. (2024). A novel metaheuristic inspired by horned lizard defense tactics. Artif. Intell. Rev. 57, 59. doi:10.1007/s10462-023-10653-7
Prajapati, S., Garg, R., and Mahajan, P. (2024). Novel adaptive MPPT technique for enhanced performance of grid integrated solar photovoltaic system. Comput. Electr. Eng. 120, 109648. doi:10.1016/j.compeleceng.2024.109648
Prokhorenkova, L., Gusev, G., Vorobev, A., Dorogush, A. V., and Gulin, A. (2018). CatBoost: unbiased boosting with categorical features. Adv. Neural Information Processing Systems 31. doi:10.48550/arXiv.1706.09516
Rodríguez, E., Droguett, E. L., Cardemil, J. M., Starke, A. R., and Cornejo-Ponce, L. (2024). Enhancing the estimation of direct normal irradiance for six climate zones through machine learning models. Renew. Energy 231, 120925. doi:10.1016/j.renene.2024.120925
Sebastianelli, A., Serva, F., Ceschini, A., Paletta, Q., Panella, M., and Le Saux, B. (2024). Machine learning forecast of surface solar irradiance from meteo satellite data. Remote Sens. Environ. 315, 114431. doi:10.1016/j.rse.2024.114431
Sharafati, A., Khosravi, K., Khosravinia, P., Ahmed, K., Salman, S. A., Yaseen, Z. M., et al. (2019). The potential of novel data mining models for global solar radiation prediction. Int. J. Environ. Sci. Technol. 16, 7147–7164. doi:10.1007/s13762-019-02344-0
Sun, X., Zhu, L., and Liu, D. (2024a). Blueberry bruise non-destructive detection based on hyperspectral information fusion combined with multi-strategy improved Beluga Whale optimization algorithm. Front. Plant Sci. 15, 1411485. doi:10.3389/fpls.2024.1411485
Sun, Y., Wang, X., Gao, L., Yang, H., Zhang, K., Ji, B., et al. (2024b). Multi-objective optimal scheduling for microgrids—improved goose algorithm. Energies 17, 6376. doi:10.3390/en17246376
Tahir, M. F., Yousaf, M. Z., Tzes, A., El Moursi, M. S., and El-Fouly, T. H. (2024). Enhanced solar photovoltaic power prediction using diverse machine learning algorithms with hyperparameter optimization. Renew. Sustain. Energy Rev. 200, 114581. doi:10.1016/j.rser.2024.114581
Taki, M., Rohani, A., Soheili-Fard, F., and Abdeshahi, A. (2018). Assessment of energy consumption and modeling of output energy for wheat production by neural network (MLP and RBF) and Gaussian process regression (GPR) models. J. Cleaner Production 172, 3028–3041. doi:10.1016/j.jclepro.2017.11.107
Tao, H., Ebtehaj, I., Bonakdari, H., Heddam, S., Voyant, C., Al-Ansari, N., et al. (2019). Designing a new data intelligence model for global solar radiation prediction: application of multivariate modeling scheme. Energies 12, 1365. doi:10.3390/en12071365
Venkatachalam, C., and Solomon, G. (2019). Dataset of solar energy potential assessment for Adama city (Ethiopia). Data Brief 24, 103879. doi:10.1016/j.dib.2019.103879
Wang, S., Wu, Y., Li, R., and Wang, X. (2023a). Remote sensing-based retrieval of soil moisture content using stacking ensemble learning models. Land Degrad. and Dev. 34, 911–925. doi:10.1002/ldr.4505
Wang, Y., Zhang, Z., Pang, N., Sun, Z., and Xu, L. (2023b). CEEMDAN-CatBoost-SATCN-based short-term load forecasting model considering time series decomposition and feature selection. Front. Energy Res. 10, 1097048. doi:10.3389/fenrg.2022.1097048
Wang, J., Zhou, X., Liu, Y., Sun, J., Guo, P., and Lv, W. (2025). An efficient nondestructive detection method of rapeseed varieties based on hyperspectral imaging technology. Microchem. J. 210, 112913. doi:10.1016/j.microc.2025.112913
Wu, W., Tang, X.-P., Yang, C., Guo, N.-J., and Liu, H.-B. (2013). Spatial estimation of monthly mean daily sunshine hours and solar radiation across mainland China. Renew. Energy 57, 546–553. doi:10.1016/j.renene.2013.02.027
Wu, L., Zhou, H., Ma, X., Fan, J., and Zhang, F. (2019). Daily reference evapotranspiration prediction based on hybridized extreme learning machine model with bio-inspired optimization algorithms: application in contrasting climates of China. J. Hydrology 577, 123960. doi:10.1016/j.jhydrol.2019.123960
Yadav, A. K., Malik, H., and Chandel, S. S. (2015). Application of rapid miner in ANN based prediction of solar radiation for assessment of solar energy resource potential of 76 sites in Northwestern India. Renew. Sustain. Energy Rev. 52, 1093–1106. doi:10.1016/j.rser.2015.07.156
Yang, D., and Gueymard, C. A. (2019). Producing high-quality solar resource maps by integrating high-and low-accuracy measurements using Gaussian processes. Renew. Sustain. Energy Rev. 113, 109260. doi:10.1016/j.rser.2019.109260
Yang, K., Koike, T., and Ye, B. (2006). Improving estimation of hourly, daily, and monthly solar radiation by importing global data sets. Agric. For. Meteorology 137, 43–55. doi:10.1016/j.agrformet.2006.02.001
Yildirim, A., Bilgili, M., and Ozbek, A. (2023). One-hour-ahead solar radiation forecasting by MLP, LSTM, and ANFIS approaches. Meteorology Atmos. Phys. 135, 10. doi:10.1007/s00703-022-00946-x
Zeng, Z., Gower, D. B., and Wood, E. F. (2018). Accelerating forest loss in Southeast Asian Massif in the 21st century: a case study in Nan Province, Thailand. Glob. Change Biology 24, 4682–4695. doi:10.1111/gcb.14366
Zhang, Y., Cui, N., Feng, Y., Gong, D., and Hu, X. (2019). Comparison of BP, PSO-BP and statistical models for predicting daily global solar radiation in arid Northwest China. Comput. Electron. Agric. 164, 104905. doi:10.1016/j.compag.2019.104905
Zhao, L., Zhao, X., Pan, X., Shi, Y., Qiu, Z., Li, X., et al. (2022). Prediction of daily reference crop evapotranspiration in different Chinese climate zones: combined application of key meteorological factors and Elman algorithm. J. Hydrology 610, 127822. doi:10.1016/j.jhydrol.2022.127822
Zhao, L., Qing, S., Bai, J., Hao, H., Li, H., Shi, Y., et al. (2023a). A hybrid optimized model for predicting evapotranspiration in early and late rice based on a categorical regression tree combination of key influencing factors. Comput. Electron. Agric. 211, 108031. doi:10.1016/j.compag.2023.108031
Zhao, L., Qing, S., Wang, F., Wang, H., Ma, H., Shi, Y., et al. (2023b). Prediction of rice yield based on multi-source data and hybrid LSSVM algorithms in China. Int. J. Plant Prod. 17, 693–713. doi:10.1007/s42106-023-00266-z
Keywords: featureimportance analysis, long short-term memory network algorithm, machine learning, optimization algorithms, solar radiation
Citation: Zhao L, Wang F, Wang H, Wang H and Shi Y (2026) Hybrid feature-LSTM for solar radiation forecasting in different Chinese climate zones. Front. Earth Sci. 14:1745611. doi: 10.3389/feart.2026.1745611
Received: 15 November 2025; Accepted: 19 January 2026;
Published: 10 February 2026.
Edited by:
Petru Adrian Cotfas, Transilvania University of Braşov, RomaniaReviewed by:
Kasim Oztoprak, Konya Food and Agriculture University, TürkiyeAndreea Sabadus, West University of Timişoara, Romania
Copyright © 2026 Zhao, Wang, Wang, Wang and Shi. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
*Correspondence: Hui Wang, MTUyMzYxOTMyNTBAMTYzLmNvbQ==
Fei Wang2