A Method for Identifying Environmental Stimuli and Genes Responsible for Genotype-by-Environment Interactions From a Large-Scale Multi-Environment Data Set

It has not been fully understood in real fields what environment stimuli cause the genotype-by-environment (G × E) interactions, when they occur, and what genes react to them. Large-scale multi-environment data sets are attractive data sources for these purposes because they potentially experienced various environmental conditions. Here we developed a data-driven approach termed Environmental Covariate Search Affecting Genetic Correlations (ECGC) to identify environmental stimuli and genes responsible for the G × E interactions from large-scale multi-environment data sets. ECGC was applied to a soybean (Glycine max) data set that consisted of 25,158 records collected at 52 environments. ECGC illustrated what meteorological factors shaped the G × E interactions in six traits including yield, flowering time, and protein content and when these factors were involved in the interactions. For example, it illustrated the relevance of precipitation around sowing dates and hours of sunshine just before maturity to the interactions observed for yield. Moreover, genome-wide association mapping on the sensitivities to the identified stimuli discovered candidate and known genes responsible for the G × E interactions. Our results demonstrate the capability of data-driven approaches to bring novel insights on the G × E interactions observed in fields.


Methods 1. Overview of historical data
The earliest records of modern breeding of soybean in Japan date back to the 1930s, as far as we could investigate.
In this study, experimental results from 1961 were subjected to analyses because most weather stations of the Japan Meteorological Agency started observations of major meteorological factors in the 1960s. At each breeding centre, experimental results were released annually as booklets until the 2000s; subsequently, the booklet reports were replaced with digital files, including the Excel and Ichitaro formats, the latter of which is a Japanese application for For correspondence with genomic information, experiments for later generations, typically F6 and later, of which seeds were often available, were subjected to digitalisation and integration. Duplicated records were removed by assessing the years of evaluation, variety names, field names and phenotypic values of yield and stem length. The historical data included 72,829 records of 6,106 varieties evaluated at 440 fields over 55 years (from 1961 to 2015).
Note that each record was an average of the evaluation of multiple plants, with typically two or three replications. However, in statistical analyses, these records were treated as measurements on single plants because the number of replications and plants were often unknown, and varied among fields and years if they were known. It is also notable that, throughout this study, these records were often treated as measurements on single plants for ease of explanations (e.g. Figure 1).

Methods 2. Development of mixed models at each environment 2.1. Overview
Single-trait mixed models were fitted to the records from each environment. The purposes were (1) to remove variations due to years and management conditions and to adjust the phenotypic values with them, and (2) to estimate the additive genetic effects at the environment (see Table S2 for the data summary at each environment). Here, plant densities and amount of fertiliser are presented as x_y_z, where, for plant density, x, y and z indicate the interval between rows (cm), the interval between strains in a row (cm) and the number of plants per strain; and for fertiliser, x, y and z indicate the amount of nitrogen, phosphate and potassium (kg/10a), respectively. The amount of fertiliser was sometimes unclear because the amount of additional fertiliser was not always described explicitly; in these cases, the amount was regarded as missing (NA). The sowing dates are presented as the days from April 1 st .
The models for each environment were developed as follows. The base models included year effects as fixed effects and additive genetic effects as random effects, as follows: where Y is a vector of phenotypic values, X is an incidence matrix of year effects, B is the fixed effect of years, Z is a design matrix, U is the additive genetic effect and E is a vector of residual errors. The covariance structure of U was defined using the genome-wide SNPs 1 . When the conditions of a management method (plant density, fertiliser and sowing date) could be regarded as categorical variables, they were modelled as fixed effects, as follows: where M is an incidence matrix of the effects of the management conditions and A is the fixed effect. When the variables could be regarded as continuous, the management conditions were modelled using basis functions, as follows: where F is a matrix of value output from the basis functions and W is a vector of weight of the functions. A row vector of F is given as, where fj is the j th basis function, xi is the input value of row i (e.g. the sowing date of variety i) and n is the number of functions.
The genotype by management (G × M) interactions were also modelled. When the management conditions were categorical, the model was: where H is a design matrix and V is a vector of random effects. V was assumed to follow: where Im is the identity matrix of size m, which is the number of categories, G is the genetic relationship matrix and 2 is the variance component. When the management conditions were continuous, the model was: where Fj is the j th column vector of F, the dot (•) indicates the element-wise product and Qj is a vector of random effects for the j th basis function. Qj was assumed to follow: ~Multivariate Normal( , 2 ).
For each management method (plant density, fertiliser and sowing date), models (2) and (3) were compared using AICs, and it was decided whether the G × M interactions were included. Subsequently, using the chosen modelling scheme for each management method (i.e. with or without the interactions), the best combination of methods (e.g. density + fertiliser or fertiliser + sowing) were searched using AIC. Parameter estimation and AIC calculation were conducted using airemlf90 2 .
Narrow-sense SNP heritability (ℎ 2 ) for Model 1, Model 2A, and Model 2B was estimated as where 2 is the variance component of the additive genetic effect (i.e., additive genetic variance) and 2 is the component of the residual error (i.e., residual variance). For Model 3B, ℎ 2 was estimated as where , is the value of the jth basis function evaluated at point t, and 2 is the variance component of the random effect of the jth basis function. This model was selected only for days to flowering at F19. The continuous management conditions were sowing dates, and the point t was set to 10 th June, a typical sowing date at the experimental field. Because Model 3A was not selected, heritability was not calculated using this model.
The remainder of this section describes this model-selection process for each environment. The abbreviations of traits are SW, seed weight (g/100 seeds); SL, stem length (cm); YI, yield (kg/a); DTF, days to flowering (days); DTM, days to maturity (days); and PR, protein content (%).

F01
The plant density, fertiliser and sowing date consisted of 14 conditions, five conditions and 36 dates ranging from 50 to 112, respectively. The plant density was transformed to the density per unit area, and then modelled using quadratic curves, as follows: The effect of the fertiliser was absorbed into the year effect because some conditions were indistinguishable from the year effects. The sowing dates were modelled using cubic splines 3 . The number of splines and knots were 8 and 12, respectively. Thus, where f indicates the spline function. The first and last four knots were arranged to the earliest and latest sowing dates, respectively, to constrain the span of splines 3 . The remaining knots were arranged between them, at equal intervals.
This method was common to all the environments for which the sowing dates were modelled using splines.
The AIC values obtained for each management method and trait are presented in Table Sp1. For all traits, models that included the G × M interactions were worse than those without the interactions. Moreover, all models with the interactions failed to converge within the default number of iterations of airemlf90 (5,000). As described below, the models with the interactions often suffered from convergence problems, probably because of an insufficient sample size to support the complex models. For F01, therefore, models without the interactions were selected for all traits and both management methods. The inclusion of the plant density and sowing dates without the G × M interactions yielded the lowest AICs in each trait (Table Sp2). Thus, the model including the density and sowing date effects, but excluding the G × M interactions, was selected.

F02
The plant density, fertiliser and sowing date consisted of three conditions, five conditions and 13 dates ranging from 40 to 65, respectively. The plant density and fertiliser were modelled as fixed effects. The sowing dates were modelled using quadratic curves, and not using splines, because the range was narrow.
The AIC values obtained for each management method are presented in Table Sp3. For all management methods, models including the G × M interactions were worse than those without interactions. Thus, the G × M interactions were not considered. Subsequently, the combinations of management methods were searched using AICs (Table Sp4). As a result, the model that included all management methods was chosen. The lowest values are underlined for each management method.

F03
The plant density, fertiliser and sowing date consisted of five conditions, seven conditions and 14 dates ranging from 33 to 85, respectively. The effects of plant density and fertiliser were absorbed into the year effect. The sowing dates were modelled using cubic splines, as described for F01.
AIC values are presented in Table Sp5. For each trait, models including the G × M interactions were better than the models without the interactions. However, as observed in F01, because these models failed to converge within the default number of iterations of airemlf90 (5,000), we avoided the use of the results of the model with the interactions. Instead, the models without the interactions were selected for subsequent analyses, because these models were also better than the base models.

F04_D
The data at F04 included two conditions (dense and sparse) as the plant density; thus, the data were divided into two datasets, F04_D and F04_S, which were defined as different environments. The fertiliser and the sowing date of F04_D consisted of two conditions and 14 dates ranging from 68 to 104, respectively. Although the fertiliser consisted of only two conditions, because these two conditions were simultaneously adopted only in 2006, the effects of fertiliser were modelled as the year effects. The sowing dates were modelled using cubic splines, as described for F01. For each trait, models without the G × M interactions were better than those with the interactions and the base models (Table Sp6). The lowest (best) values are underlined.

F04_S
The fertiliser and the sowing date of F04_S consisted of two conditions and 14 dates ranging from 75 to 114, respectively. As observed for F04_D, although the fertiliser consisted of only two conditions, because these two conditions were simultaneously adopted only in 2007, the effects of fertiliser were modelled as the year effects. The sowing dates were modelled using cubic splines, as described for F01. For each trait, models without the G × M interactions were the best models (Table Sp7). The lowest (best) values are underlined.

F14
The plant density, fertiliser and sowing date consisted of four conditions, eight conditions and 18 dates ranging from 44 to 71, respectively. The plant densities differed in the intervals between the strains. Thus, the intervals were modelled using quadratic curves. The fertiliser was modelled as a fixed effect, whereas the sowing date effects were absorbed into the year effects.
For each management method, models with the G × M interactions were worse than those without the interactions (Table Sp8). Including both the plant density and fertiliser without the G × M interactions yielded the lowest AICs for SL, YI and PR; whereas for SW, DTF and DTM, the inclusion of the fertiliser exclusively, yielded the lowest AICs (Table Sp9). Thus, for the former three traits, the model that included the density and fertiliser effects was selected, whereas for the latter three traits, the model that included the fertiliser effect was selected. The lowest values are underlined for each management method.

F18
The plant density, fertiliser and sowing date consisted of four conditions, three conditions and 8 dates ranging from 46 to 56, respectively. The plant density was added to the models as a fixed effect, whereas the fertiliser and sowing date were treated as the year effects. The models without the G × M interactions yielded the lowest AIC values for all traits (Table Sp10); thus, they were selected for subsequent analyses. The lowest values are underlined for each trait.

F19
The plant density and the fertiliser had only one condition. The sowing dates consisted of 22 dates ranging from 26 to 116. The sowing date was modelled using splines, as described for F01. The models with the G × M interactions yielded the lowest AIC values for all traits but SW (Table Sp11). However, the models did not converge, with the exception of that for DTF. Thus, for DTF, the model with the interactions was selected, whereas for the other traits, models without interactions were selected.

F21
Both the plant density and the fertiliser consisted of eight conditions. The sowing dates consisted of 31 dates ranging from 64 to 119. The sowing date was modelled using splines, as described for F01. The models with the G × M interactions yielded the lowest AIC values for PR and DTF, whereas the models without the interactions yielded the lowest for the remaining traits (Table Sp12). However, because the models with the interactions did not converge, the models without the interactions were selected for all traits.

F27_D
The data at F27 were divided into two sets according to the plant density, i.e. dense (F27_D) and sparse (F27_S). The fertiliser of F27_D consisted of three conditions, which were absorbed into the year effects. The sowing dates, which ranged from 63 to 97, were modelled using cubic splines, as described for F01. For all traits, the models with the sowing date effects but without the G × M interactions yielded the lowest AIC values (Table Sp13). The lowest values are underlined for each trait.

F27_S
The fertiliser of F27_S consisted of three conditions that were absorbed into the year effects. The sowing dates, which ranged from 37 to 78, were modelled using cubic splines. For all traits, the models with the sowing date effects but without the G × M interactions yielded the lowest AIC values (Table Sp14). The lowest values are underlined for each trait.

F29
The plant density and the fertiliser consisted of four conditions. These conditions were indistinguishable from the year effects; thus, their effects were absorbed into the year effects. The sowing date consisted of 25 dates ranging from 66 to 113, and was modelled using cubic splines, as described for F01. The inclusion of the sowing date effects improved the model (Table Sp15), whereas the inclusion of the G × M interactions only improved the model for DTF. Moreover, because the models with the interactions did not converge for the traits, including DTF, the models without the interactions were selected. *Models did not converge.

F30
The plant density, fertiliser and sowing date consisted of nine conditions, six conditions and 27 dates ranging from 56 to 95, respectively. The fertiliser effects were absorbed into the year effects because they were indistinguishable.
The plant density was modelled as a fixed effect, whereas the sowing date effects were modelled using cubic splines, as described in F01.
For each management method (plant density and sowing date), the models with the G × M interactions were worse than those without the interactions ( Table Sp16). The inclusion of both the plant density and the sowing date without the G × M interactions yielded the lowest AICs for all traits (Table Sp17). The lowest values are underlined for each management method.

F34_D
The data at F34 were divided into two data sets according to the plant density, i.e. dense (F34_D) and sparse (F34_S).
The fertiliser consisted of two conditions, which were indistinguishable from the year effects; thus, their effects were absorbed into the year effects. The sowing date consisted of 18 dates ranging from 48 to 103 and was modelled using cubic splines, as described for F01. The models that included the sowing date effects without the G × M interactions were better than the models with the G × M interactions for all traits (Table Sp18). The lowest values are underlined for each trait.

F34_S
The fertiliser consisted of two conditions and its effects were absorbed into the year effects. The sowing date consisted of 15 dates ranging from 48 to 107 and was modelled using cubic splines. The models that included the sowing date effects without the G × M interactions were better than the models with the G × M interactions for all traits (Table   Sp19). The lowest values are underlined for each trait

F41
The plant density, fertiliser and sowing date consisted of five conditions, 13 conditions and 19 dates ranging from 32 to 90, respectively. The plant densities per unit area (m 2 ) were calculated and modelled using quadratic curves. The sowing date was modelled using cubic splines, whereas the fertiliser effects were absorbed into the year effects.
The models with the G × M interactions were suggested for SL of the sowing date (Table Sp20) (Table Sp21).

Methods 3. Estimation of slopes of additive genetic effects on environmental covariates
The estimates of the additive genetic effects of a variety at the 52 environments (30 in the case of PR) were regressed on the environmental covariates and the slopes were estimated using a maximum likelihood approach. This process is written for variety i as: where ̂, is the estimate of the additive genetic effect at environment j which is scaled with the estimate of the additive genetic standard deviation (̂ ), μi is the intercept, xj is the standardized environmental covariate at environment j, βi is the slope and , is the residual error. The residual error follows a normal distribution, where 2 is the variance of the parts of the additive genetic effects that were unexplainable by the environmental covariate (see Methods in the main text) and , 2 is the prediction error variance (i.e., uncertainty of additive genetic effects or posterior variance) which is scaled with ̂.
In this procedure, for stable computation, two assumptions in ECGC were simplified. First, the slope (β ) was treated as a fixed effect and estimated for each variety independently. Second, 2 is assumed to be common to each environment (i.e., 2 ). Without these simplifications, estimation of the slopes was unstable because of high dimensional data (624 varieties and 52 environments). Here, , 2 is a known value and the remaining values (μi, βi and 2 ) are parameters to be estimated. The log likelihood is as follows: The maximum likelihood estimates of these parameters were obtained by iteratively maximising the log likelihood with regard to each parameter. βi was updated as: and μi was updated as: These equations were obtained by setting the first derivatives of the loglikelihood as zero. Because 2 does not have a simple solution using this approach, 2 was updated by maximising the log likelihood directly using the Brent

Methods 4. Inference of the alleles of flowering genes
Because the SNP array used here does not include the alleles of known flowering genes (E1, E2, E3 and E4), we inferred the alleles of these genes for the used varieties from the SNP genotypes. The alleles of the four flowering genes were reported for 63 varieties 1 , and 26 varieties were overlapped with our study. Using these 26 varieties as the reference, the correspondence between the haplotypes of SNPs located around the genes and the reported alleles was investigated. Haplotypes were constructed by gradually expanding the target region in both the 5 and 3 directions of the genes, such that alleles were distinguished by haplotypes.
For E1, 12 haplotypes were constructed from 209 SNPs that were located within the ± 600 kbp region of the gene (Glyma.06G207800, 20207077-20207940 bp on chromosome 6). Among the 12 haplotypes found, one corresponded to the e1-nl allele, one corresponded to the e1-as allele and 10 corresponded to the E1 allele. Although a variety (V619) carrying the E1 allele exhibited the haplotype corresponding to the e1-nl allele, because this variety could not be distinguished from the other varieties with the e1-nl allele even when the span of the haplotypes was expanded further, we assigned the e1-nl allele to this variety. Because the 79 varieties included in our study had haplotypes that were not represented in the reference, we could not determine the alleles of these varieties. For