Optimizing soybean variety selection for the Pan-African Trial network using factor analytic models and envirotyping

Araújo, Maurício S.; Pavan, João P. S.; Stella, André A.; Fregonezi, Bruno F.; Lima, Natally F.; Leles, Erica P.; Santos, Michelle F.; Goldsmith, Peter; Chigeza, Godfree; Diers, Brian W.; Pinheiro, José B.

doi:10.3389/fpls.2025.1594736

ORIGINAL RESEARCH article

Front. Plant Sci., 06 June 2025

Sec. Plant Breeding

Volume 16 - 2025 | https://doi.org/10.3389/fpls.2025.1594736

Optimizing soybean variety selection for the Pan-African Trial network using factor analytic models and envirotyping

Maurício S. Araújo^1*

João P. S. Pavan¹

André A. Stella¹

Bruno F. Fregonezi¹

Natally F. Lima²

Erica P. Leles³

Michelle F. Santos³

Peter Goldsmith³

Godfree Chigeza⁴

Brian W. Diers³

José B. Pinheiro^1*

¹Genetics Diversity and Breeding Laboratory, Department of Genetics, University of São Paulo, Piracicaba, São Paulo, Brazil
²Allogamous Plant Breeding Laboratory, Department of Genetics, University of São Paulo, Piracicaba, São Paulo, Brazil
³Feed the Future Innovation Lab, University of Illinois Urbana-Champaign, United States Agency for International Development (USAID), Washington, DC, United States
⁴International Institute of Tropical Agriculture, Consultative Group on International Agricultural Research (CGIAR), Ibadan, Oyo, Nigeria

Soybean is a global food and industrial crop, however, climate change significantly affects its grain yield. Therefore, the selection of varieties with high adaptation to target population of environments is imperative in Sub-Saharan Africa. This study aimed to identify soybean varieties with high overall performance and stability using multi-environment trial data from the Pan-African Soybean Trial Network. Additionally, we sought to determine the environmental factors influencing yield through envirotyping tools. In two South-Eastern African countries, a total of 169 soybean varieties were evaluated across 83 environments in 19 locations in Malawi (47 trials) and 14 locations in Zambia (36 trials). The trials followed a randomized complete block design with three replications. Data for 37 environmental features were obtained from NASA POWER and SoilGrids. We fitted factor analytic models (FA) to estimate genotype adaptation across environments. Additionally, we applied an environmental kernel approach and the XGBoost method to assess the number of mega-environments. The FA model with four factors provided the best fit, explaining 82.44% and 81.95% of the variance and the average semi-variance ratio (ASVR), respectively. Approximately, 59.6% of the genotype-by-environment interaction were crossover. Varieties V025, V035, and V158 exhibited high yield potential and reliability but displayed moderate stability. Three mega-environments were identified, with growing degree days, mean temperature, and photosynthetically active radiation use efficiency being the most associated features for soybean grain yield. To enhance the identification of variety adaptation in these environments, integrating machine learning models with crop growth modeling is essential to assess associations between environmental features and soybean yield.

1 Introduction

Soybean (Glycine max L.) is a commodity crop of great global importance (Mishra et al., 2024). Its grains are widely utilized in agro-industry, primarily for oil production, high-protein food products, and animal feed formulation (Zhi et al., 2020). Its nutritional composition is determined by proteins, oil, carbohydrates, isoflavones, and minerals. However, population growth and the ever increasing demand for protein sources, both for human consumption and animal feed, highlights the need to expand global soybean production (Messina, 2022). In this context, improving production efficiency in new agricultural frontiers through the development of more adapted varieties becomes essential to ensure food security for future generations. In light of that, genetic improvement programs have focused on developing highyielding varieties with resistance to pests and diseases, as well as broad adaptation to target environmental conditions (Favoretto et al., 2025). These advancements have been driven by the optimization of breeding strategies and the adoption of effective agricultural practices (Carciochi et al., 2019).

Plant breeders rely on multi-environment trials (METs) to evaluate genotype performance across diverse conditions, representing the target population of environments (TPE) and assessing genotype adaptation to specific or broad environments (Poupon et al., 2023; Malosetti et al., 2016; Costa-Neto et al., 2023; Vitale et al., 2024). When crossover interactions occur, genotype rankings vary across environments (Fehr, 1987; Cooper and Delacy, 1994), and neglecting genotype-by-environment (G×E) interaction can introduce some bias and reduce selection efficiency (van Eeuwijk et al., 2016). To quantify G×E interaction, various methods have been explored, each with distinct assumptions and applications. These include analysis of variance (Plaisted and Peterson, 1959; Shukla, 1972), regression models (Finlay and Wilkinson, 1963; Eberhart and Russell, 1966), non-parametric approaches (Lin and Binns, 1998), multiplicative models such as GGE Biplot (Yan et al., 2000) and AMMI (Gauch and Zobel, 1997; Gauch, 2008), linear mixed models (Henderson, 1949, 1950), factor analytic (FA) models — which are extensions of linear mixed models — (Piepho, 1997a, b; Smith et al., 2001b), and Bayesian approaches (Cotes et al., 2006), all widely applied in plant breeding.

Factor analytic (FA) models are a specific class of linear mixed models (LMMs) that are particularly robust in handling diverse data structures, especially unbalanced data. As a parsimonious approximation of the unstructured model, they indirectly construct the full genetic covariance structure, accounting for heterogeneous variances and covariances. This capability allows for the exploration of genetic covariance between environments or traits, making FA models well-suited for METs. Their effectiveness stems from dimensionality reduction through latent variables, known as factors (Smith et al., 2001b; Piepho, 1998). Additionally, as linear mixed models, they facilitate the inclusion of relatedness information, whether genomic (marker-based) or ancestral (pedigree) (Smith et al., 2005). Building on these principles, Smith and Cullis (2018) introduced the Factor Analytic Selection Tools (FAST), which incorporate parameters for assessing overall performance (OP) and stability via Root Mean Square Deviation (RMSD). These metrics enhance breeders’ decision-making by providing a statistically sound and comprehensive evaluation framework. Today, FA models are the benchmark for handling unbalanced MET data within the LMM framework (Tolhurst et al., 2022; Araújo et al., 2024), with recent insights by Piepho and Williams (2024) emphasizing their utility in predicting genotype performance in METs.

Beyond selecting the most appropriate statistical methods, modern plant breeding demands additional tools to enhance the predictive ability of models. Over the past decade, environmental features have emerged as valuable resources for improving predictions in METs (Xu, 2016; Resende et al., 2024). Although the integration of environmental data into genetic analyses is not a new concept (Van Eeuwijk and Elgersma, 1993; Wood, 1976), advances in hardware and data processing have enabled the use of large datasets, facilitating the incorporation of environmental features into statistical genetic models. Enviromics, a specialized field at the intersection of environmental data, statistics, and quantitative genetics, leverages plant ecophysiology to better understand how environmental factors influence plant development and the plasticity of key agronomic traits (Costa-Neto and Fritsche-Neto, 2021). In this context, envirotypes represent all sources of environmental variation affecting plant development and can serve as environmental markers in statistical genetic models, aiding in the prediction of genotypic performance in non-evaluated environments (Xu, 2016; Resende et al., 2025).

The addition of information derived from Geographic Information System (GIS) techniques into predictive models has been encouraged to improve the efficiency of breeding programs (Guarino et al., 2002). An initial effort was made by Booth (1990) aiming to indicate climatically suitable regions for the introduction of tree species at a global scale based on the environmental conditions where they were collected. Annicchiarico et al. (2006) assessed how GIS-based methodologies could aid the recommendation of durum wheat genotypes in MET, as compared to traditional methodologies. The integration of machine learning, quantitative genetics, enviromics, and GIS tools enhances the identification of environmental patterns in target environments. These resources enable the exploration of environmental homogeneity and the determination of factors influencing climatic variability, facilitating the incorporation of G×E interaction and the selection of cultivars adapted to specific conditions.

Soybean variety selection is becoming increasingly important due to its high nutritional value and economic significance in the global market. Despite its potential, generally, the adaptation of soybean varieties to Sub-Saharan African environments specifically in the South-Eastern countries of Malawi and Zambia remains largely unexplored, limiting the availability of high-performing cultivars suited to the region’s diverse agro-ecological conditions. This gap is particularly concerning given the rapid population growth and the escalating demand for affordable protein based food sources, which underscore the necessity of expanding and optimizing soybean production. Moreover, climate change exacerbates environmental variability, increasing the urgency for resilient cultivars capable of maintaining stable yields across unpredictable conditions (Sousa et al., 2019). To address this challenge, this study employs advanced selection tools to identify superior varieties with high overall performance and stability within the Pan-African Trials Network. Furthermore, the integration of envirotyping methodologies enables the exploration of associations between environmental variables and G×E interactions, facilitating the identification of specific adaptations critical for sustainable soybean production in Malawi and Zambia.

2 Material and methods

2.1 Phenotypic data and field trials

Soybean variety yield trials are part of the Soybean Innovation Lab (SIL). This program was established to select high-yielding varieties adapted to target population environments (TPE) in Africa, to support cultivation by smallholder farmers. This initiative led to the creation of the Pan-African Soybean Variety Trials (PATs) through partnerships with the African Agricultural Technology Foundation (AATF), the Syngenta Foundation for Sustainable Agriculture (SFSA), and the International Institute of Tropical Agriculture (IITA) (Santos, 2019). The PATs program plays a key role in identifying and disseminating varieties capable of adapting to diverse Agro-ecological conditions, thereby contributing to enhanced food security and economic growth across selected Africa countries. The African continent was divided into 33 Agro-ecological Zones (AEZs), classified according to criteria such as climatic zones (tropical, temperate, etc.), length of the growing season, soil type, and altitude, with a resolution of 5 arc-minutes (≈ 9.2 km × 9.2 km) (Figure 1) (Food and Agriculture Organization of the United Nations, 2025).

Figure 1

Figure 1. (A) displays the map of Africa with Agro-ecological Zones (AEZ) classified into 33 distinct categories based on climatic variables, topography, and the chemical and physical properties of the soil. Each color on the map represents a specific AEZ class. Refer to Food and Agriculture Organization of the United Nations (2025) for detailed identification of each class. The red and black points on the map highlight the countries of Malawi and Zambia, respectively. (B) presents the map of Malawi, highlighting its respective AEZs. The colors of the points indicate the locations where the trials were conducted, and the number in parentheses represents the number of trials carried out at each site. (C) shows Zambia with the distribution of trial locations, along with the number of experimental trials conducted in each region.

A total of 169 soybean varieties were evaluated over the 2017/18 to 2023/24 seasons (Supplementary Figure S1) in trials conducted in two South-Eastern African countries of Malawi and Zambia. In Malawi, 47 trials were conducted across 19 distinct locations, each defined as the interaction between location and season (Figure 1B). In Zambia, 36 environments were carried out across 14 locations (Figure 1C). The trials followed a randomized complete block design (RCBD) with three replications. Each plot consisted of four rows measuring five meters in length (4 × 5 m), spaced 50 cm apart, with 20 plants per row. grain yield (kg ha⁻¹) was measured from the two central rows. Agronomic management practices adhered to the specific technical recommendations for soybean cultivation.

2.2 Envirotyping

Throughout the crop’s growing season, we collected data on 37 environmental features (Table 1). Each genotype’s sowing and harvesting dates were used to retrieve environment-specific variables, enabling the characterization of trial conditions and the assessment of their similarity. The environmental covariates encompassed geographic, climatic, and soil information. The climatic variables were obtained using the EnvRtype package (Costa-Neto et al., 2021), which accesses the NASA POWER database (https://power.larc.nasa.gov/) (Sparks, 2018; NasaPower, 2022). Soil attributes were retrieved from the SoilGrids database via API using the httr package for web access (Wickham, 2023) and jsonlite for JSON parsing (Ooms, 2014). Static variables such as altitude and soil properties were associated with the trial location coordinates.

Table 1

Table 1. Summary statistics of 37 environmental features grouped into geographical, climatic, and soil-related categories.

Prior to kernel construction, we applied quality control filters to remove missing or inconsistent values and standardized all continuous variables using Z-score normalization to ensure comparability across different measurement scales (Equation 1):

\begin{array}{l} Z_{i j} = \frac{x_{i j} - {\bar{x}}_{\cdot j}}{s_{\cdot j}} & (1) \end{array}

where ${\bar{x}}_{\cdot j}$ and $s_{\cdot j}$ denote the mean and standard deviation, respectively, of the j-th variable across all locations.

To reduce multicollinearity, we examined the Pearson correlation matrix and flagged variable pairs with correlation coefficients. Redundant variables were removed based on domain knowledge and exploratory principal component analysis (PCA), which was implemented using the factoextra version 1.0.7 package (Kassambara and Mundt, 2016).

The final environment-by-variable matrix W was then used to compute the enviromic similarity kernel KE as described in Equation 2.

\begin{array}{l} K_{E} = \frac{W W^{⊤}}{trace (W W^{⊤}) / n} & (2) \end{array}

where $W^{⊤}$ is the transpose of W, and n is the number of environments. This standardization ensures unit trace, allowing comparability across analyses and interpretation of diagonal elements as average similarities. The matrix W contains standardized environmental covariates (e.g., climatic and soil variables), with rows representing environments (location-by-year combinations) and columns corresponding to environmental descriptors.

2.2.1 Identification of mega-environments

Initially, environments were grouped into mega-environments based on an enviromic similarity matrix, denoted as the enviromic kernel (KE). This matrix integrated 37 environmental covariates and grain yield. Hierarchical clustering was applied using the Unweighted Pair Group Method with Arithmetic Mean (UPGMA) algorithm (Sokal and Michener, 1958). The optimal number of clusters was defined using the Elbow method, and the most influential covariates were explored via principal component analysis (PCA) (Pearson, 1901). To prevent methodological circularity, the dataset was randomly split into training (70%) and test (30%) subsets prior to unsupervised learning. PCA and K-means clustering were applied exclusively to the training subset, and the resulting cluster assignments were used as categorical labels for model training.

Classification was performed using the XGBoost (Extreme Gradient Boosting) algorithm (Chen and Guestrin, 2016), implemented via the xgboost package. The model was configured for multi-class classification (multi:softmax) and trained using the first three principal components. The hyperparameters used were: tree depth of 6, learning rate (η) of 0.3, and 100 boosting iterations. The objective function minimized by the algorithm included both the predictive loss and regularization terms, and is expressed in Equation 3:

\begin{array}{l} ℒ (θ) = \sum_{i = 1}^{N} ℓ (y_{i}, {\hat{y}}_{i}^{(t)}) + \sum_{t = 1}^{T} Ω (f_{t}), & (3) \end{array}

where $ℓ$ denotes the multinomial log-loss function, and the regularization term $Ω (f_{t})$ for each tree $f_{t}$ is defined in Equation 4:

\begin{array}{l} Ω (f_{t}) = γ T + \frac{1}{2} λ \sum_{j = 1}^{T} w_{j}^{2}, & (4) \end{array}

in which T is the number of leaves, w_j is the score on leaf j, γ is the complexity penalty for the number of leaves, and λ controls the L2 regularization on leaf weights. All analyses were performed in R (version 4.3.1) using the following packages: cluster (Maechler et al., 2019), caret (Kuhn et al., 2020), xgboost (Chen et al., 2022), and dendextend (Galili, 2015).

To explore the relationship between environmental variables and grain yield, we fitted a multiple linear regression model using the adjusted mean yield for each environment as the response variable. The model is specified in Equation 5:

\begin{array}{l} y = μ + \sum_{i = 1}^{t} β_{i} X_{i} + e & (5) \end{array}

where y represents the adjusted mean yield in each environment; µ is the intercept of the model, corresponding to the overall mean yield; $β_{i}$ denotes the coefficient associated with the i-th environmental variable; $X_{i}$ corresponds to the value of the i-th environmental feature; e is the random error term, assumed to follow a normal distribution with zero mean and constant variance. Adjusted means used as the response variable were obtained by fitting separate linear mixed models for each environment, in which genotype was included as a fixed effect and replication as a random effect. From these models, empirical best linear unbiased estimates (eBLUEs) of genotype means were extracted. Subsequently, the mean of the eBLUEs within each environment was calculated and used as the environment-level adjusted mean in the subsequent analyses.

2.3 Statistic analysis

We analyzed the phenotypic data using the linear mixed-effects model described by Henderson (1949) and Henderson (1950). Estimation of variance components was performed using the residual maximum likelihood (REML) method (Patterson and Thompson, 1971). The model was implemented using the ASReml-R package (version 4.1.2) (Butler et al., 2018) within the R software environment (R Core Team, 2022). Prior to model fitting, we assessed the validity of key model assumptions through standard residual diagnostics. The normality of residuals was evaluated using quantile–quantile (Q-Q) plots, as recommended by Kozak and Piepho (2018). Residual independence was assumed, and heteroscedasticity across environments was addressed by specifying a diagonal residual covariance matrix, allowing each environment to have its ϵown residual variance. The applied model follows Equation 6.

\begin{array}{l} y = μ 1_{n} + X_{1} s + X_{2} b + Z_{1} g + Є & (6) \end{array}

In which $y^{(n \times 1)}$ is the vector of phenotypic data across $t$ environments, where $n = \sum_{j = 1}^{t} n_{j}$ , and $n_{j}$ is the number of observations in each environment $j$ ; $μ$ is the model intercept; $s^{(t \times 1)}$ is the vector of fixed effects for environments; $b^{(b \times 1)}$ is the vector of fixed effects for the blocks, where $b = \sum_{j = 1}^{t} b_{j}$ and $b_{j}$ is the number of blocks within environment $j; g^{(v \times 1)}$ is the vector of random effects for the $v$ genotypes evaluated across environments, where $g \sim MVN (0, G \otimes I_{v})$ . Although genotypes are conceptually common across environments, the factor analytic (FA) model implicitly nests genotypes within environments by modeling the genotype-by-environment interaction through the G matrix, which captures the variance–covariance structure among environments. $Є^{(n \times 1)}$ is the vector of residual effects, where $Є \sim MVN (0, R \otimes I_{n})$ . Here, R is a diagonal matrix of order $t$ , allowing for heterogeneous residual variances across environments, i.e., $R = diag (σ_{Є_{1}}^{2}, σ_{Є_{2}}^{2}, \dots, σ_{Є_{t}}^{2})$ . $X_{1}^{(n \times t)}$ , $X_{2}^{(n \times b)}$ , and $Z_{1}^{(n \times v)}$ , represent the incidence matrices of the vectors accompanying them in the model. $1_{n}^{(n \times 1)}$ is a vector of ones; and $I_{v}$ and $I_{n}$ areidentity matrices of orders $v$ and $n$ , respectively.

The genotypic effect vector $g$ , for an FA model of order

$K$ , is then expressed in Equation 7:

\begin{array}{l} g = (\hat{Λ} \otimes I_{v}) \hat{f} + δ & (7) \end{array}

where ${\hat{Λ}}^{(t \times K)}$ is the matrix containing the $K$ factor loadings for each of the $t$ environments $(λ_{1}, λ_{2}, \dots, λ_{t})$ , ${\hat{f}}^{(K v \times 1)}$ is the vector containing the $v$ factor scores of genotypes in each environment ${[f_{1}^{T}, f_{2}^{T}, \dots, f_{v}^{T}]}^{T}$ , and ${\hat{δ}}^{(t v \times 1)}$ is the vector representing the model’s lack of fit. The joint distribution of $\hat{f}$ and $\hat{δ}$ is given in Equation 8:

\begin{array}{l} (\begin{matrix} \hat{f} \\ \hat{δ} \end{matrix}) \sim N [(\begin{matrix} 0 \\ 0 \end{matrix}), (\begin{matrix} I_{K} \otimes I_{v} & 0 \\ 0 & Ψ \otimes I_{v} \end{matrix})] & (8) \end{array}

In which $Ψ^{(t \times t)}$ is the diagonal matrix of specific variances ( $Ψ_{1}, Ψ_{2}, \dots, Ψ_{t}$ ) for each environment, i.e., what the factors couldn’t capture.

The selection of the most parsimonious model was based on the explained variance $v_{k t}$ , which was utilized for all $K$ factors and for each factor per environment ( $k$ -th) (Equation 9) (Smith et al., 2015), and the average semi-variance ratio (ASVR) (Equation 10) (Piepho, 2019; Chaves et al., 2023), respectively.

\begin{array}{l} v_{k_{t}} = \frac{{\hat{λ}}_{k_{t}}^{⋆^{2}} d_{k}}{\sum_{k = 1}^{K} {\hat{λ}}_{k_{t}}^{⋆^{2}} d_{k} + {\hat{ψ}}_{t}} \times 100 & (9) \end{array}

\begin{array}{l} \begin{array}{l} A S V R = \frac{\frac{2}{t \times (t - 1)} \sum_{t = 1}^{t - 1} \sum_{t^{'} = t + 1}^{t} \frac{1}{2} \times (\sum_{k = 1}^{K} {\hat{λ}}_{k_{t}}^{⋆^{2}} + \sum_{k = 1}^{K} {\hat{λ}}_{k_{t^{'}}}^{⋆^{2}}) - \sum_{k = 1}^{K} {\hat{λ}}_{k_{t}}^{⋆} {\hat{λ}}_{k_{t^{'}}}^{⋆}}{\frac{2}{t \times (t - 1)} \sum_{t = 1}^{t - 1} \sum_{t^{'} = t + 1}^{t} \frac{1}{2} \times [(\sum_{k = 1}^{K} λ_{k_{t}}^{⋆^{2}} + ψ_{t}) + (\sum_{k = 1}^{K} {\hat{λ}}_{k_{t^{'}}}^{⋆^{2}} + ψ_{t^{'}})] - \sum_{k = 1}^{K} {\hat{λ}}_{k_{t}}^{⋆} {\hat{λ}}_{k_{t^{'}}}^{⋆}} \\ \times 100 \end{array} & (10) \end{array}

The generalized heritability by Cullis et al. (2006) was obtained through the Equation 11:

\begin{array}{l} H^{2} = 1 - (\frac{{\bar{ν}}_{B L U P}}{2 σ_{g}^{2}}) & (11) \end{array}

Where ${\bar{ν}}_{B L U P}$ is the average pairwise prediction error variance, and $σ_{g}^{2}$ is the genotypic variance.

The coefficient of variation was calculated using Equation 12.

\begin{array}{l} C V = \frac{{\hat{σ}}_{e}}{\hat{μ}} & (12) \end{array}

Where ${\hat{σ}}_{e}$ is the estimated residual standard deviation, and $\hat{μ}$ is the overall mean of each environment.

We estimated the genetic correlation between pairs of environments as described by Cullis et al. (2010), given by Equation 13:

\begin{array}{l} ρ_{g_{t t'}} = \frac{\sum_{k = 1}^{K} λ_{t k} λ_{t' k}}{\sqrt{{\hat{σ}}_{g t}^{2} {\hat{σ}}_{g t'}^{2}}} = D G D & (13) \end{array}

where, ${\hat{σ}}_{g t}^{2}$ and ${\hat{σ}}_{g t'}^{2}$ represent the genotypic variance components in environments $t$ and $t'$ respectively, while the matrix $D$ is a diagonal matrix composed of the reciprocal square roots of the diagonal elements of matrix $G$ .

The crosser interaction was estimated using Equation 14:

\begin{array}{l} σ_{g e_{r k}}^{2} = 1 - \frac{σ^{2} (\sqrt{σ_{g_{t}}^{2}})}{σ_{g e}^{2}} & (14) \end{array}

The variance component for the genotype-by-environment G $\times$ E interaction, denoted as $σ_{g e}^{2}$ , was estimated using a compound symmetry (CS) model. In this structure, the variance-covariance matrix of the genetic effects is definedas $σ_{g}^{2} J + σ_{g e}^{2} I_{j}$ , where $J$ is a matrix of ones. The CS model was adopted following the conceptual framework proposed by Cooper and Delacy (1994), which enables the partitioning of G×E interaction into simple (related to genotypic response consistency) and crossover (due to changes in genotype ranking) components. By assuming equal genetic variances and covariances across environments, the CS structure provides a neutral and interpretable baseline, from which deviations can be attributed to crossover interaction. This approach avoids conflating model-derived correlation structures, such as those in FA models, with the theoretical decomposition of the G×E variance.

2.4 Factor Analytic Selection Tools

To address identifiability issues and enable biological interpretability in factor analytic (FA) models, we adopted the constraints implemented in ASReml-R (Butler et al., 2018), as described by Smith et al. (2021). Specifically, for models with more than one factor (K > 1), the upper triangular elements of the loading matrix Λ were set to zero, and the factor scores were assumed to have a diagonal covariance matrix with decreasing elements. The constrained loading matrix is denoted as Λ^∗, and the corresponding factor scores as f^∗. To recover the original (rotated) parameterization while preserving the variance structure implied by the model, we performed a singular value decomposition (SVD) of Λ^∗ as follows in Equation 15:

\begin{array}{l} Λ^{*} = U L^{1 / 2} V^{⊤}, & (15) \end{array}

where $U$ and $V$ are orthonormal matrices of dimensions $t \times K$ and $K \times K$ , respectively, and $L$ is a diagonal matrix with singular values sorted in decreasing order. The final rotated loading matrix is then obtained as $Λ = Λ^{*} V L^{- 1 / 2} = U$ , and the diagonal matrix of factor variances is $D = L$ . Accordingly, the scores $f$ are reconstructed as $(L^{1 / 2} V^{⊤} \otimes I_{v}) f^{*}$ , ensuring that the variance of the factors satisfies $var (f) = D \otimes I_{v}$ , as required for proper modeling of the random effects in the FA structure. These constraints facilitate identifiability and maintain the interpretability of the latent dimensions while preserving the implied genetic covariance structure across environments.

To support genotype selection within the environments evaluated, we used FA Models and applied the selection tools proposed by Smith and Cullis (2018). Specifically, the overall performance (OP_v) (Stefanova et al., 2009) of the v-th genotype was calculated using Equation 16:

\begin{array}{l} O P_{v} = \frac{1}{t} \sum_{t = 1}^{T} {\hat{λ}}_{1 t}^{*} {\tilde{f}}_{1 v}^{*} & (16) \end{array}

In the provided equations, ${\hat{λ}}_{1 t}^{*}$ represents the rotated factor loading associated with the $t$ -th environment for the first latent factor, and ${\tilde{f}}_{1 v}^{*}$ denotes the rotated score of the $v$ -th genotype for the first latent factor.

The remaining factors evaluate the stability parameter. The overall stability of the v-th genotype can be calculated by the root mean square deviation (RMSD_v) using the following Equation 17:

\begin{array}{l} R M S D_{i} = \sqrt{\frac{1}{t} \sum_{t = 1}^{T} Є_{t}^{*}} & (17) \end{array}

In the given expressions, $Є_{v t}^{*}$ represents the deviation of the prediction associated with the first factor, which can be obtained as follows: $Є_{v t}^{*} = {\tilde{β}}_{v t} - {\hat{λ}}_{1 t}^{*} {\tilde{f}}_{1 v}^{*}$ , where ${\tilde{β}}_{v t}$ is the linear combination of loadings and factor scores from all factors except the first.

The responsiveness of genotype v to the k-th factor ( $R E_{v k}$ ) was computed as shown in Equation 18:

\begin{array}{l} R E_{v k} = ({\bar{λ}}_{k}^{*} - {\bar{λ}}_{k -}^{*}) f_{v k}^{*} & (18) \end{array}

where ${\bar{λ}}_{k +}^{*}$ and ${\bar{λ}}_{k -}^{*}$ represent the mean of the positive and negative rotated loadings, respectively, associated with the $k$ -th latent factor.

We evaluated the reliability of each genotype using Equation 19:

\begin{array}{l} R_{v} = 1 - \frac{P E V_{v}}{{\bar{σ}}_{g}^{2}} & (19) \end{array}

In which ${PEV}_{v}$ is the prediction error variance of the v-th genotype, and ${\bar{σ}}_{g}^{2}$ is the mean genotypic variance across environments.

An ideal genotype should present both high overall performance (OP_v) and low root mean square deviation (RMSD_v). The ideal genotype is selected based on the construction of an index (FAST_v) (Chaves et al., 2023; Cowling et al., 2023) (Equation 20):

\begin{array}{l} F A S T_{v} = (2 \times \frac{O P_{v} - \bar{O} P}{\sqrt{σ_{(O P)}^{2}}} - \frac{R M S D_{v} - R \bar{M} S D}{\sqrt{σ_{(R M S D)}^{2}}}) \times R_{v} & (20) \end{array}

3 Results

Environmental kernel-based analyses incorporated climate and soil data from trials between 2017 and 2024. Principal component analysis (PCA) explained 52.7% of the total variance, with 33.1% attributed to the first principal component (PC1) and 19.6% to the second (PC2) (Figure 2A). Ten environmental features contributed most to climate variation among trials, with growing degree days (gdd), mean temperature (tmean), and photosynthetically active radiation use efficiency (fue) showing the strongest loadings in PC1 (Figure 2B). Hierarchical clustering applied to environmental similarities (based on the XGBoost model) suggested three mega-environment groups (Figure 2C). Regarding yield, the variables fue (radiation use efficiency), spv (seasonal precipitation variation), and tmrange (thermal amplitude) were associated with the largest regression coefficients. Additionally, fue, tmdew (mean dew point), wsm (soil moisture), and rhm (mean relative humidity) showed statistically significant associations with yield (p< 0.05) (Figure 2D).

Figure 2

Figure 2. The (A) displays a principal component analysis (PCA) based on the environmental kernel, where the colors green, black, and red correspond to mega-environments, Mega 1, Mega 2, and Mega 3, respectively. The (B) highlights the environmental variables that contribute the most across all evaluation sites. The (C) presents a dendrogram based on the XGBoost model, used to test cluster mega-environments in Malawi and Zambia during the 2017 to 2023/24 growing seasons. Meanwhile, the (D) represents the variables with the greatest influence on yield performance in the trials.

The M4 model, with a factor analytic (FA) variance-covariance structure consisting of four factors (Table 2), exhibited the best fit for the dataset (Supplementary Figure S2). This selection was based on a threshold of 82.44% of the explained variance and 81.95 (%) of ASVR for the model with four factors (FA4). This criterion considered not only the explanatory capacity of the data but also the parsimony.

Table 2

Table 2. Log-likelihood (LogL), deviance, number of parameters (Par.), explained variance (var%), and average semi-variance ratio (ASVR) for the models tested.

The Pan-Africa Trial Network demonstrated high experimental precision, with values ranging from 0.07 (M18s2E006) to 0.50 (M21s1E051). Broad-sense heritability coefficients (H²) were also substantial, ranging from 0.46 (M19s2E015) to 0.85 (Z21s2E059) (Figure 3). Based on the distribution, the coefficient of variation (CV) showed a median of 0.229, with first and third quartiles of 0.183 and 0.272, respectively. Similarly, H² values had a median of 0.768, with Q1 = 0.710 and Q3 = 0.789 (Supplementary Figure S3). The average yield across the trials was 2,508.54 kg ha⁻¹; however, there was considerable variation among the experiments, ranging from 523.82 kg ha⁻¹ (Z19s2E027) to 4,410.92 kg ha⁻¹ (M22s2E062) (Supplementary Table S1). Considering the two countries individually, the average yield in Malawi was 3,171.10 kg ha⁻¹, while in Zambia it was 2,555.94 kg ha⁻¹.

Figure 3

Figure 3. Scatterplot showing the relationship between the coefficient of variation (CV) and heritability (H²) across 83 soybean yield trials conducted in Malawi and Zambia. Each point represents an environment (trial), positioned according to its heritability (X-axis) and CV (Y-axis), with labels indicating the environment codes.

Figure 4 shows a heatmap of pairwise genetic correlations between environments based on the factor analytic (FA) model. The strongest negative correlation was observed between trials Z19s2E028 and Z22s2E067 (r = −0.99), indicating a strong crossover interaction. Environments Z21s2E059, Z21s2E056, and Z21s2E057 showed high variability in correlations with other trials (SD > 0.48), suggesting inconsistent genotype responses. In contrast, Z19s2E024 and Z22s2E065 were among the most stable environments, with the lowest standard deviation in correlations (SD< 0.23). Trials such as Z20s2E046 and M20s2E039 exhibited the highest mean correlations with other environments (mean r > 0.20), highlighting their potential as representative environments for genotype recommendation. These results reflect substantial heterogeneity in genotype-by-environment interactions across trials conducted in Malawi and Zambia from 2017/18 to 2023/24, emphasizing the importance of environment-specific selection.

Figure 4

Figure 4. Heatmap showing pairwise genotypic correlations between environments based on the factor analytic (FA) model. Each cell represents the genetic correlation between two trials, with a color scale ranging from −1 to 1. Trial names are shown along both axes, and the figure emphasizes patterns of genetic similarity across environments. The evaluations were conducted in Malawi and Zambia from the 2017/18 to 2023/24 seasons, focusing on soybean grain yield.

The varieties V020, V075, V137, V158, V035, V025, and V031 exhibited the best performance, as indicated by the highest OP values (Y-axis). Regarding stability, V013 showed the best fit, with the lowest RMSD values (X-axis) according to the FAST index. Varieties V025, V035, and V158 demonstrated high yield and reliability but exhibited medium stability (Figure 5).

Figure 5

Figure 5. Graph showing the relationship between overall performance (OP) and stability, measured as root mean square deviation (RMSD), for soybean varieties evaluated in the Pan-African Trials Network across the 2017–2023/24 seasons. OP represents the mean performance of each genotype across environments, while RMSD quantifies the deviation from the average response, with lower values indicating higher stability. Each point corresponds to a genotype, and colors represent the reliability of the estimated performance–stability values, with the color scale ranging from red (low reliability) to green (high reliability). Axes labels and the legend have been enlarged to improve readability. This visualization summarizes results from the FAST (Factor Analytic Selection Tools) analysis.

Figure 6 presents the response of the variables to the second (Figure 6A), third (Figure 6B), and fourth (Figure 6C) factors. Responsiveness to specific factors facilitates the identification of environmental conditions associated with the environments that contribute to these factors. In this context, varieties V075, V020, and V137 demonstrated high overall performance and stability across factors 2, 3, and 4, respectively. Conversely, genotypes exhibiting low reliability (< 0.4%), such as V029, V110, V100, and V105 (Figure 5), also consistently demonstrated the poorest overall performances across all four evaluated factors, highlighting their limited adaptability and potential. Additionally, the variety V13 maintained the best fit in terms of OP, suggesting a higher stability and suitability under the tested conditions (Figure 6). These findings suggest that the associated factors may reflect meaningful environmental characteristics that can be leveraged for specific adaptation.

Figure 6

Figure 6. Overall performance (OP) vs. stability (RMSD) for all 169 soybean varieties from the Pan-African Trials Network. Biplot (A) represents responsiveness to the second factor, (B) to the third factor, and (C) to the fourth factor. Each point represents a genotype, with color indicating the reliability of its estimated performance–stability score. The color scale ranges from red (low reliability) to green (high reliability), as shown in the accompanying legend. Axes labels and the reliability legend have been enlarged to enhance readability.

4 Discussion

In this study, we applied FAST tools for selecting soybean varieties with high overall performance and stability in grain yield across METs. Additionally, we utilized GIS and envirotyping tools to explore associations between environmental features and grain yield, and to define mega-environments. Integrating environmental data into genetic-statistical models facilitated the characterization of G×E interaction patterns and their association with yield performance (Tolhurst et al., 2022). Furthermore, identifying environmental similarities between the experimental network and the TPE can enhance genetic gains through selection (Chaves et al., 2024).

The yield components of soybean are strongly influenced by the environmental effect (Araújo et al., 2024), thus being subject to the G × E interaction (Meyer et al., 2024; Agoyi et al., 2024; Abebe et al., 2024). Over the years, overall performance and stability parameters have been assessed using methods based on analysis of variance (ANOVA) and linear regression. However, several limitations have been identified, such as: (i) modeling the genotype effect only as fixed; and (ii) the use of balanced data. We fitted a model of the genotype effect as random, employing the factor analytic structure (Piepho, 1997b; Smith et al., 2001a). This approach allows for the estimation of genetic parameters, using the heterogeneous random effect, enabling the evaluation of genetic progress over breeding cycles in various locations, seasons, and different agricultural years (Gogel et al., 2018; Chaves et al., 2023).

The genetic correlation heatmap in Figure 4 reveals high heterogeneity in genetic variances and low genetic correlations among environments, highlighting the crossover nature of the G × E interaction (Cullis et al., 2010). In other words, as the intensity of the interaction increases, the genetic correlation between pairs of environments decreases. This phenomenon is explained by the disparity in genetic variance values in each environment and the covariance between pairs of environments (Cooper and Delacy, 1994). Heinemann et al. (2022) demonstrated, in the context of crossover G×E interaction, the influence of environmental features on yield components. This can be explained by the direct effect of specific environmental variables on the adaptation of genotypes in METs. Therefore, it becomes crucial to identify environmental factors (climate, soil, spatial trends, among others) and genetic factors influencing the G × E interaction. To achieve this, robust methodologies are necessary to dissect this interaction and enable more precise selection (Kang et al., 1989).

The FA model stands out for its efficiency in handling diverse data structures (Piepho, 1998). This approach is commonly employed in MET, particularly during the stages of cultivar selection and recommendation (Kelly et al., 2007). This becomes possible due to the derivation of orthogonal factors from a set of correlated variables (Cullis et al., 2014). These factors represent linear combinations of the factor loadings associated with each environment, along with the corresponding scores for each cultivar. It is worth noting that the structure of the FA model resembles that of an unstructured covariance matrix but distinguishes itself by its greater parsimony. A study conducted by Chaves et al. (2023) demonstrated the effectiveness and flexibility of FAST in selecting tropical maize genotypes, aiming for overall performance and stability across different locations and seasons. The authors suggested incorporating pedigree or genomic data into the statistical model, applying optimization methods, and using environmental features as strategies to enhance selection estimates.

The evaluation of genotypes with high overall performance and stability can be done through latent regression graphs. Although these graphs provide valuable information, selecting the best cultivars using this methodology can be labor-intensive, as it requires evaluating individualized regression for each genotype. In order to overcome these limitations, Smith and Cullis (2018) proposed FA selection tools, aiming to assess the overall performance and stability of each genotype across the entire dataset. Overall performance is achieved when the loadings of the first factor are positive and rotated, corresponding to the main effects of the genotypes. In this scenario, there is no complex G × E interaction, as the ranking of genotypes remains unchanged across different environments. The RMSD is used to estimate stability by measuring the deviation of each genotype from the line drawn by the latent regression. In this study, weights were assigned to both parameters since, for this specific dataset, productive performance was deemed more critical than stability. Consequently, some studies managed to achieve genetic gains using MET data, employing FAST for cultivar recommendation (Smith and Cullis, 2018; Tolhurst et al., 2019; Bakare et al., 2022).

The environmental and altitudinal characteristics of Malawi and Zambia significantly influence local climatic conditions, vegetation distribution, and land use (Supplementary Figure S4). Both countries are situated in high-altitude regions, with Malawi exhibiting altitudes ranging from 500 to 1,500 m, reaching 3,002 m in the Mulanje Mountains (Lancaster, 1980), while Zambia maintains an average altitude between 1,000 and 1,500 m, with Mount Mafinga as its highest peak (2,339 m). These altitudinal variations directly impact temperature regimes, precipitation patterns, and agricultural potential, aligning with previous studies on the influence of topography on African ecosystems. Higher elevations in Malawi are associated with milder temperatures and increased precipitation, which favor diverse vegetation and agricultural systems. In contrast, low-altitude areas, such as regions near Lake Malawi and the Shire Valley, experience warmer and more humid conditions, influencing local biodiversity and crop adaptability. Similarly, Zambia’s elevated plateaus contribute to a moderate climate, reducing temperature extremes and promoting stable precipitation levels (Rawlins and Kalaba, 2020).

The analysis of mega-environments aims to identify target regions or environments with consistent patterns of G×E interaction over multiple years (Yan et al., 2023). When these patterns are stable and repeatable, the target region can be subdivided into sub-regions or mega-environments (Cooper and Hammer, 1996). However, when data are limited to a single year, the mega-environment concept may not be appropriate, as these environments should represent repeatable G×E interaction patterns (Basford and Cooper, 1998). In addition to yield data, incorporating environmental variables such as edaphoclimatic characteristics (elevation, temperature, precipitation, and soil type) can enhance the delineation of mega-environments. These variables provide a more comprehensive understanding of environmental influences on genotype performance, facilitating more precise recommendation strategies for different regions.

In this context, we observed that the variables growing degree days (gdd), mean temperature (Tmean), photosynthetically active radiation use efficiency (fue), seasonal precipitation variability (spv), and temperature range (Tmrange) were the most important factors influencing soybean grain yield in these environments. In tropical and subtropical regions such as Malawi and Zambia, adequate GDD accumulation is essential to ensure that soybean reaches maturity at the appropriate time. Mean temperature directly affects soybean metabolic rates, and excessively high temperatures can induce heat stress, negatively impacting photosynthesis and grain formation. Factors such as light intensity, temperature, and water availability influence fue. In regions with high solar radiation, such as Malawi and Zambia, soybean has the potential for high fue, provided that other factors, such as water and nutrient availability, are not limiting. Irregular precipitation patterns, including severe droughts, can adversely affect soybean development from germination to grain filling. A moderate temperature range is beneficial for soybean, promoting improved carbon assimilation and balanced growth. Understanding the influence of environmental variables on soybean cultivation and modeling the G×E interaction enables the identification of specific adaptations, assisting breeders in decision-making regarding which varieties can have their genetic potential fully exploited (Araújo et al., 2024). Integrating robust statistical models, machine learning techniques (Crossa et al., 2024), and crop growth models (Bustos-Korts et al., 2022) can enhance the accuracy of these recommendations.

Data availability statement

The original contributions presented in the study are included in the article/Supplementary Material. Further inquiries can be directed to the corresponding authors.

Author contributions

MA: Investigation, Conceptualization, Supervision, Writing – review & editing, Data curation, Methodology, Software, Visualization, Resources, Funding acquisition, Validation, Writing – original draft, Project administration, Formal Analysis. BF: Writing – original draft, Methodology, Writing – review & editing, Formal Analysis. AS: Writing – review & editing, Writing – original draft, Methodology, Formal Analysis. JPP: Formal Analysis, Writing – review & editing, Methodology, Writing – original draft. NL: Methodology, Writing – original draft, Writing – review & editing, Formal Analysis. EL: Resources, Supervision, Conceptualization, Writing – review & editing, Writing – original draft, Data curation. MS: Project administration, Data curation, Conceptualization, Writing – review & editing, Supervision, Writing – original draft. PG: Validation, Writing – review & editing, Project administration, Supervision, Writing – original draft, Funding acquisition, Visualization. GC: Project administration, Supervision, Writing – review & editing, Writing – original draft, Funding acquisition, Resources. BD: Validation, Writing – original draft, Supervision, Writing – review & editing. JBP: Conceptualization, Visualization, Resources, Validation, Data curation, Project administration, Formal Analysis, Methodology, Investigation, Writing – review & editing, Writing – original draft, Software, Supervision, Funding acquisition.

Funding

The author(s) declare that financial support was received for the research and/or publication of this article. Mauricio dos Santos Araújo was supported by FAPESP (São Paulo Research Foundation, Grant 2024/01868). We are grateful to São Paulo Research Foundation (FAPESP), similarly, we would like to acknowledge our sincere appreciation to the University of São Paulo and the University of Illinois, Urbana-Champaign for their support in this study.

Acknowledgments

We want to thank the coordinators and participants of the United States Agency for International Development (USAID) Feed the Future Soybean Innovation Lab Pan-African Soybean Variety Trials for their valuable contributions in providing the soybean data used in this study. We kindly thank Innocent Vulou Unzimai for the reviews.

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Generative AI statement

The author(s) declare that no Generative AI was used in the creation of this manuscript.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

Supplementary material

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fpls.2025.1594736/full#supplementary-material

References

Abebe, A. T., Adewumi, A. S., Adebayo, M. A., Shaahu, A., Mushoriwa, H., Alabi, T., et al. (2024). Genotype × environment interaction and yield stability of soybean (Glycine max L.) genotypes in multi-environment trials (METs) in Nigeria. Heliyon 10, e38097. doi: 10.1016/j.heliyon.2024.e38097

PubMed Abstract | PubMed Abstract | Crossref Full Text | Google Scholar

Agoyi, E. E., Ahomondji, S. E., Yemadje, P. L., Ayi, S., Ranaivoson, L., Torres, G. M., et al. (2024). Combining AMMI and BLUP analysis to select high-yielding soybean genotypes in Benin. Agron. J. 116, 2109–2128. doi: 10.1002/agj2.21615