I.4 screening experimental designs for quantitative trait loci, association mapping, genotype-by environment interaction, and other investigations
- 1 Division of Rare and Manuscript Collections, Cornell University, Ithaca, NY, USA
- 2 Biometrics and Statistics Unit, International Maize and Wheat Improvement Center (CIMMYT), Mexico DF, Mexico
Crop breeding programs using conventional approaches, as well as new biotechnological tools, rely heavily on data resulting from the evaluation of genotypes in different environmental conditions (agronomic practices, locations, and years). Statistical methods used for designing field and laboratory trials and for analyzing the data originating from those trials need to be accurate and efficient. The statistical analysis of multi-environment trails (MET) is useful for assessing genotype × environment interaction (GEI), mapping quantitative trait loci (QTLs), and studying QTL × environment interaction (QEI). Large populations are required for scientific study of QEI, and for determining the association between molecular markers and quantitative trait variability. Therefore, appropriate control of local variability through efficient experimental design is of key importance. In this chapter we present and explain several classes of augmented designs useful for achieving control of variability and assessing genotype effects in a practical and efficient manner. A popular procedure for unreplicated designs is the one known as “systematically spaced checks.” Augmented designs contain “c” check or standard treatments replicated “r” times, and “n” new treatments or genotypes included once (usually) in the experiment.
Conventional breeding will continue to make significant contributions to efforts to maintain the rate of crop improvement for food production and nutrition in order to meet the increase in human population growth. However, biotechnological methods, such as linkage analysis for detecting quantitative trait loci (QTLs), marker-assisted selection (MAS), association mapping, genomic selection, etc., will also be required. It is of paramount importance that the statistical methods used for designing field and laboratory trials and for analysing the data originating from those trials be accurate and efficient.
Crop breeding programs using conventional approaches, as well as new biotechnological tools, rely heavily on data resulting from the evaluation of genotypes in different environmental conditions (agronomic practices, locations, and years). The incidence of genotype-by-environment interaction (GEI) is a consequence of QTL-by-environment interaction (QEI) and marker effect-by-environment interaction, and this affects conventional breeding as well as MAS and genomic selection breeding strategies. The series of field trials known as multi-environment trials (METs) are vital for: (i) studying the incidence of GEI and assessing the stability of quantitative traits; (ii) mapping QTL and QEI; and (iii) finding associations among molecular markers and quantitative trait variation based on linkage disequilibrium analysis. To detect and quantify the presence of QEI is of vital importance for understanding the genetic architecture of quantitative traits.
All biotechnological methods are based on molecular marker data and phenotypic data. Phenotypic data are vitally important for assessment of the within-environment error structure for each of the trials that will be used later in the MET analysis. The MET statistical analysis is useful for assessing GEI, mapping QTLs, and studying QEI. Large populations are required for scientific study of QEI, and for determining the association between molecular markers and quantitative trait variability. Therefore, appropriate control of local variability through efficient experimental design is of key importance.
Spatial variability in the field is a universal phenomenon that affects the detection of differences among treatments in agricultural experiments by inflating the estimated experimental error variance. Researchers wishing to conduct field trials are faced with this dilemma. They tackle the problem by using an appropriate statistical design and layout for the experiment, and by using suitable methods for statistical analysis. A priori control of local variability in each testing environment is usually determined from the experimental design used to accommodate the genotypes to the experimental units. However, a posteriori control of the residual effect based on a model that provides a good fit to the data can effectively complement the control of local variability provided by the experimental design (see e.g., Federer, 2003a). Recently, efficient experimental designs (both unreplicated and replicated) have been developed, assuming that observations are not independent in that contiguous plots in the field may be spatially correlated (Martin et al., 2004; Cullis et al., 2006).
Commonly, field trials used for linkage analyses or association mapping analyses are of 200 or more genotypes in size. These may consist of individuals from segregating F2 and F3 populations, recombined inbred lines (RILs), accessions from a genebank, advanced breeding cultivars, or individuals from any segregating population. Usually, QTL mapping is done on large numbers (500 or more) in as many locations or conditions as possible, for estimating QEI and examining the stable or unstable part of the chromosome that influences the trait under study. Thus, seed availability and land and labor costs are crucial factors to be considered when establishing METs for QTL and QEI analyses, and association mapping.
The class of augmented designs is especially useful for achieving control of variability and assessing genotype effects in a practical and efficient manner. In the early stages of a breeding program, a plant breeder is faced with evaluating the performance of large numbers of genotypes. Frequently, the seed supply is limited, but even if it is not, the large number of genotypes can necessitate using a single experimental unit per genotype.
A popular procedure for unreplicated designs is the one known as “systematically spaced checks.” In this procedure, a standard check genotype is systematically spaced every certain number of experimental units. Several statistical procedures have been devised over the years to compare the yield of a new genotype with the standard variety. This procedure can require an inordinate amount of space, labor, and other resources devoted to check plots of a single standard genotype. Yates (1936) has shown that the number of check plots should be of the order of the square root of the number of (new genotype) test plots. In conducting METs, Sprague and Federer (1951) have shown that a cost-efficient procedure for maximizing genetic advancement involves using two replicates at each location for single crosses of maize, three replicates for top crosses, and four replicates for double crosses.
A third class of procedure used in the screening of genotypes for yield and other characteristics is that of “augmented experimental designs.” These designs contain c check or standard treatments replicated r times, and n new treatments or genotypes included once (usually) in the experiment. Some of the c checks could be promising new genotypes (treatments) in the final stages of testing. Any standard experimental design may be used for the check treatments and then the block sizes or the number of rows and columns are increased to accommodate the new treatments. This class of design has several desirable qualities, including the following:
1. The number of checks can be any kind and number c.
2. The number of new entries can be any number n.
3. The new treatments can be considered as random or as fixed effects.
4. Survivors in the final stages of screening may be used as checks along with some standard checks. The dual use of these genotypes as checks and as their final evaluation is an efficient use of resources.
5. Some of the designs in this class allow for screening when other factors are present, thereby revealing genotype-by-factor interactions.
6. Non-contenders can be discarded prior to harvest, since they do not affect computation of blocking effects and variances.
Various augmented experimental designs are discussed in the following sections. These are augmented block (Federer, 1956, 1961), augmented row–column (Federer and Raghavarao, 1975; Federer et al., 1975), augmented resolvable row–column (Federer, 2002), augmented split plot (Federer, 2005b), and augmented split block (Federer, 2005a).
When the field layout is in a row–column formation, either for the entire experiment or within each complete block, an experimental design can be developed that controls variability in two directions for any number of genotypes and replicates. The row–column experimental designs have two block components, i.e., blocks in rows and blocks in columns. When the entire experiment is laid out in a row–column arrangement, the “latinised” designs assure that entries do not occur more than once in a row or a column of the experiment. Also, neighbor restricted designs restrict randomization of entries in such a way that certain groups of entries do not occur together, so that genotypic interference due to different maturity or plant height can be avoided.
Analysis of designed, spatially laid out experiments needs to take account of the design restrictions encountered. The actual spatial variation that occurs during the course of conducting field experiments may not be taken into account in the experimental design or in the standard statistical analysis selected before the experiment was conducted. Hence, to achieve appropriate statistical analysis for the data obtained from the experiment, it is necessary to determine the type and nature of the spatial variation present in the experiment. This often means selecting from a family of plausible statistical analyses. Federer (2003a) presented a number of methods useful for “exploratory model selection,” to account for the variation that is present in the results of an experiment rather than what the variation pattern was expected to be. He used various forms of trend analysis on a variety of examples to determine the model that explained the variation present in each experiment. Several publications have been written using various forms of trend analysis for a variety of situations (Wolfinger et al., 1997; Federer, 2002, 2003a,b; Federer and Wolfinger, 2003).
Augmented Block Experimental Designs
Augmented block experimental designs fall into two categories, complete blocks and incomplete blocks for the check genotypes or treatments. A randomized complete block design (RCBD), with r replicates or blocks, is used for the c check genotypes to start the construction of an augmented randomized block. Then, the r blocks are expanded to include the c checks plus n/r new genotypes in each block. If n is not a multiple of r, then fewer or more new genotypes would appear in some of the blocks. The c checks and n/r new genotypes are randomly allotted to the experimental units (plots) in each block. Genotype numbers are randomly assigned to the new genotypes, but this is not necessary in the early stages of screening since each new genotype is a random event in itself. To illustrate an augmented RCBD, let c = 3 checks, r = 4 blocks, and n = 13 new genotypes. A plan is:
A partitioning of the degrees of freedom in an analysis of variance (ANOVA) table for this design is:
In the first stage of screening, there may be a very large number of new genotypes with n of 8,000, 30,000, or even over 100,000. In these cases, the block size may become larger than is considered necessary to retain relative homogeneity within each block. The class of experimental designs known as an “incomplete block design” (ICBD) can then be used. The incomplete blocks of an ICBD may be in complete blocks, resolvable, or they may not. An appropriate ICBD for c checks, r replicates of the checks, incomplete blocks of size k, s incomplete blocks within a complete block, and b incomplete blocks is selected for the check genotypes. Then the b incomplete block sizes are increased to include n/b new genotypes in each incomplete block. To illustrate, let c = 15 checks arranged in r = 5 replicates and b = rs = 25 incomplete blocks of size k = 3. Let n = 300 new genotypes, and then n/b = 300/25 = 12. By enlarging the 25 incomplete blocks from k = 3 to k = 15 to accommodate 3 + 12 = 15 experimental units, the 300 new genotypes can be put into these 25 incomplete blocks. The 12 new genotypes and the three checks are randomly allotted to the 15 experimental units in each of the 25 incomplete blocks. The blocks of genotypes are randomly allotted to the incomplete blocks in the field layout. The 15 check genotypes may, for example, be two standard genotypes and 13 promising and surviving new genotypes from previous screening cycles.
A randomized form of an ICBD may be obtained from a software toolkit such as Gendex (2009). Using the parameters k = c + n/b = 15, v = c + n/r = 75, and r = 5, a randomized form of an ICBD is obtained. Then the n/r numbers for v that appear in an incomplete block are replaced by genotype numbers to accommodate the n = 300 new genotypes, but retaining k of the check treatments in each incomplete block according to the plan for checks only.
A partitioning of the degrees of freedom in an ANOVA table for the above example is:
When the new genotypes are unreplicated, they do not contribute to the estimation of the block and error variances and the estimation of the block effects (Federer and Raghavarao, 1975). Only the replicated check treatments do this. Computer codes for analysing the results from augmented block designs have been given by Wolfinger et al. (1997) and Federer (2003a).
Augmented Complete Block Design for a QTL Mapping Study
A typical QTL experiment in maize consists of F2 plants obtained from the cross of two maize inbred lines referred to as parent 1 (P1) and parent 2 (P2). Subsequently, the F2 plants can be selfed to produce, say, 900 independent F5 lines. These 900 new entries (RILs) will be genotyped with molecular markers and genetic data, and the respective phenotypic data will be used for QTL and QEI mapping. These lines may be crossed to an inbred tester from an opposite heterotic group to obtain testcross seeds. The check entries may include the parents P1 and P2, the F1 from the cross P1 × P2 and two other checks (check1 and check2) the breeder wishes to include. One possible augmented complete block design (CBD) may consist of 20 blocks of size 45 augmented by P1, P2, F1, and check1 and check2. Thus, the block size comprises a total of 50 entries (45 new entries comprising testcross F5 lines and five other entries that will be repeated in every block). The same or a different group of test lines in the incomplete block can be used in all the sites where the experiment is planted, but with different randomization of the incomplete blocks. In this case, the augmented RCBD has c = 5 checks (P1, P2, F1 check1 and check2), r = 20 blocks, and n = 900 new genotypes. A possible plan is:
The distribution of the repeated checks in the field should avoid, as much as possible, appearance of the same replicated check more than once in the same row or column. This latinised augmented CBD may help to reduce bias due to unexpected soil trends running across columns or rows.
A partitioning of the degrees of freedom in an ANOVA table for this design in each site is:
Supposing that the trial were established in three different sites, then the partition of the degrees of freedom in the ANOVA table would be as follows:
Augmented Incomplete-Complete Block Design for an Association Mapping Study
This example supposes that 200 diverse bread wheat accessions from a genebank are to be used for an association mapping study. The accessions will be used to examine the possible relationship between various phenotypic traits (such as grain yield, resistance to leaf and yellow rust, bread making quality, protein content, etc.) and the molecular markers located along the seven chromosomes of the three genomes of wheat (A, B, and D). Ten sites with contrasting environmental conditions would be used to allow good discrimination of the 200 accessions. Differential environmental conditions must be used in order to obtain a good discrimination for resistance to different potential rust pathogens as well as for the other traits.
It is assumed that c = 15 checks can be arranged in r = 5 replicates and b = 25 incomplete blocks of size k = 3 are formed. The 200 accessions can be accommodated in 25 incomplete blocks of size 11 by enlarging the incomplete blocks from k = 3 to k = 11 by adding n/b = 200/25 = 8 new entries in each incomplete block.
The ANOVA table of the combined analysis across ten environments is:
Augmented Row–Column Experimental Designs
Augmented row–column designs can be constructed either by adding rows and/or columns or by enlarging the intersections of the rows and columns of a square or rectangle. Considering the latter option, a 5 × 5 Latin square can be used for five checks A, B, C, D, and E, augmented with 250 new genotypes, adding 10 new genotypes to each row–column intersection as follows to obtain the schematic plan before randomization:
A randomization plan would be obtained for the Latin square and then the 11 entries in each row–column intersection would be randomly allotted to the 11 experimental units in each intersection. The new genotypes are randomly assigned to the numbers 1–250. A partitioning of the degrees of freedom in an ANOVA table is:
An alternative row–column plan would be to set up a 25 row by 15 column rectangle as shown below.
If the variation in rows and in columns can be explained by linear, quadratic, and perhaps cubic tends and their interactions, then two checks would have been sufficient to obtain row and column solutions to adjust the new treatments, and 325 new treatments could have been included. An equal number of rows and columns results in the minimum number of check genotypes. For example, using a 20 × 20 square, 40 plots could be allocated to two check genotypes and 360 to new genotypes. There still would be more than 20 degrees of freedom associated with the error mean square. Another scenario supposes that one standard check genotype and four promising new genotypes in the final stage of evaluation are used. Utilizing new genotypes in their final stage of testing allows dual use of the results and efficient experimentation, eliminating the inclusion of too many check plots.
A randomization plan would involve randomly allocating the rows and columns in the above plan to the rows and columns in the experimental area, randomly assigning the letters A–E to the checks, and randomly allotting the numbers 1–250 to the new genotypes. A partitioning of the degrees of freedom in an ANOVA table is:
Federer et al. (1975) discuss a number of other arrangements including one used by Dr. A. Mangelsdorf. The Mangelsdorf design has a nice balanced property and was used for METs.
The first plan given above within this section is row–column–check connected in that solutions are obtainable for all effects. The plan immediately above is row–check connected and column–check connected but is not row–column–check connected. This means that functions of the column effects, such as linear, quadratic, cubic, etc., regressions are used in the analysis of such designs. In order to have a plan that is row–column–check connected, two of the transversals of the square or rectangle need to be adjacent to each other, a feature that an experimenter may consider as undesirable. Computer codes illustrating this type of analysis are given by Federer (2003b), Federer and Wolfinger (2003), and Wolfinger et al. (1997).
Augmented Resolvable Row–Column Experimental Designs
Experimental designs such as a lattice square or a lattice rectangle may be used to construct augmented lattice square and augmented lattice rectangle plans (Federer, 2002, 2003b). For such plans, row blocking and column blocking are included in each complete block, thus making the design resolvable. Since the proportion of experimental units in relation to the number of checks is less in an augmented lattice square, this is the plan that will be illustrated. There are k × k experimental units in each complete block, and 2k, 3k, etc., check genotypes may be used. To construct such a plan, a lattice square plan is obtained first for v = k2 treatments. The complete blocks where treatments 1 to k and k + 1 to 2k appear together in a row or in a column are deleted. For 2k check genotypes, treatments 2k + 1, 2k + 2, …, k2 are deleted in each of the r blocks. The rk (k – 2) new treatments are inserted into the deleted treatment spaces of the lattice square. To illustrate, with k = 7 and r = 7, a plan would be as shown at the bottom of the page.
The symbol × indicates where one of the rk (k – 2) = 245 new genotypes would be entered. Row linear and quadratic effects and column linear and quadratic effects can be estimated (Federer, 2002). Checks 1–7 appear once with checks 8–14 in rows and in columns, but do not appear with each other. The diagonal elements need not be adjacent, as illustrated below.
A partitioning of the degrees of freedom in an ANOVA is:
To screen 30,000 new genotypes, k would be 33 and k = r = 33 replicates would be required. As stated earlier, the 2k = 66 checks could consist of two standard checks plus 64 new genotypes in their final stage of testing.
As an alternative design in this class, the checks could be in a lattice square experimental design. Then, each of the row–column intersections within each complete block could be enlarged to include the desired number of new genotypes.
Augmented Split Plot Experimental Designs
In order to compare the effect of environments and management procedures on new genotypes, the class of augmented split plot experimental designs has been proposed by Federer (2005b). The effects of factors such as tillage, fertilizers, insecticides, irrigation, planting density, date of planting, etc on new genotypes could be assessed. The effect of the date of planting is often confused with site-to-site effects. The new genotypes to be assessed may appear in split plot treatments or in whole plot treatments. New genotypes can be tested for several factors at a time by using split split plot, split split split plot, etc augmented designs. These designs allow for genotype-by-factor interactions and GEI, and are useful, especially in the final stages of screening genotypes. A schematic plan of a design is shown below for four whole plots, such as tillage practices, three checks (20, 21, and 22), and 19 new genotypes such as the 7 or 8 split plot treatments, and r = 4 blocks or replicates of check genotypes.
There are seven split plot treatments in Block 4 and eight in the other three blocks. The checks are given the highest numbers because SAS software subtracts the highest numbered effect from all the others for the estimated effects, and gives a standard error of a difference between an estimated effect of a genotype and the highest numbered one, rather than a standard error of an effect as indicated. It is usually more desirable to compare all new genotypes with a check, rather than compare all entries with a new genotype. The usual randomization procedure for a split plot experimental design would be used.
A partitioning of the degrees of freedom in an ANOVA would be:
Codes for analysing data for this design and others in this class are given by Federer (2005b).
Augmented Split Block Experimental Designs
Augmented split block experimental designs are another class of augmented experimental design for assessing the effects of various factors on new genotypes, as described by Federer (2005a) who discussed five different examples of this class and presents a numerical example and a code for analysis of the data. New genotypes may be considered to be random or fixed effects. One of the cases considered is an intercropping example for two crops with new genotypes for both crops. Allowing for interaction of factors with genotypes is an important aspect of this class of design. To illustrate one design within this class, an augmented randomized block experimental design is used for c = 3 checks (A, B, C), n = 25 new genotypes (1–25), and r = 4 blocks. Then, d = 4 dates of planting (D1, D2, D3, D4) are strip blocked across the entries in each of the four blocks. This is illustrated in the schematic layout at the bottom of the page.
The date treatments are in an RCBD and the checks and new genotypes are in an augmented randomized block experimental design. The date experimental units are distributed across all the genotype entries in a block.
A possible partitioning of the degrees of freedom in an ANOVA table is:
In the early stages of a plant breeding program, expected genetic gains may be increased by screening a large number of genotypes in contrast to having more precise comparisons of a fewer number of genotypes. This makes it necessary to evaluate many entries where there may not be sufficient seed to replicate each. For this reason Federer proposed augmented designs where a set of check entries are replicated an equal (or unequal) number of times in a specified field design and an additional set of new test entries are included in the experiment only once. In this review we show different type of augmented complete and ICBD for the check treatments with the test entries being added or “augmented” to the blocks.
This approach provides a very efficient means of screening test entries and has a considerable amount of flexibility. Augmented ICBD might be preferred over augmented CBD when the number of repeated checks is large. When soil variability runs in two directions augmented row–column designs should be a good alternative, and when the experiment is “latinized” so that entries do not occur more than once in a row or column, then the efficiency of increasing precision increases. The augmented incomplete block or/and the row-column designs can be used for association mapping and/or genomic selection where a large number of entries (usually more than 1000) are needed but cannot be planted in all possible environments. The advantages of using these augmented designs is when the soil heterogeneity increases due to limiting factors as low water, and nitrogen availability in the field.
There are many variations of split plot and split block experimental designs. Federer and King (2007) discuss several of these variations as well as combinations of the designs. Experimenters may find some of these variations suitable for augmenting with new genotypes that will fit the conditions for their experiment. Such designs as given in the last two sections above allow the experimenter to obtain interactions of new genotypes with a variety of factors. Instead of a single factor, a factorial combination of several factors could be used. For example, instead of date only, a factorial arrangement of date, fertilizer level, and insecticide could be used. Considerable flexibility is possible through the use of augmented experimental designs.
When it is advisable to use an augmented design, it may be used at several sites. For example, the Manglesdorf design presented by Federer et al. (1975) was used at several sites in Brazil. Methods for combining results over sites have been described by Federer et al. (2001), and they even allow for different designs at the different sites.
Conflict of Interest Statement
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
Sadly, Professor Walter T. Federer, the lead author of this chapter, passed away in April 2008. He was one of the greatest statisticians on the theme of experimental design for plant breeding, agronomy, and agriculture in general. Professor Federer was a unique, enthusiastic human being who was always ready to discuss serious scientific issues without losing his unique character of extreme kindness and gentlemanliness. I have the privilege to say that he was my friend.
Federer, W. T. (2003b). “Analysis for an experiment designed as an augmented lattice square design,” in Handbook of Formulas and Software for Plant Geneticists and Breeders, ed. M. S. Kang (Binghamton, NY: Food Products Press), 283–289.
Federer, W. T., and Wolfinger, R. D. (2003). “Augmented row–column design and trend analyses,” in Handbook of Formulas and Software for Plant Geneticists and Breeders, ed. M. S. Kang (Binghamton, NY: Food Products Press), 291–295.
Gendex. (2009). Gendex DOE Toolkit. Available at: http://designcomputing.net/gendex/
Keywords: multi-environment trials, augmented experimental designs, genotype × environment interaction, quantitative trait loci (QTL)
Citation: Federer WT and Crossa J (2012) I.4 screening experimental designs for quantitative trait loci, association mapping, genotype-by environment interaction, and other investigations. Front. Physio. 3:156. doi: 10.3389/fphys.2012.00156
Received: 15 March 2012; Accepted: 03 May 2012;
Published online: 01 June 2012.
Edited by:Jean-Marcel Ribaut, Generation Challenge Programme, Mexico
Reviewed by:Shan Lu, Nanjing University, China
Stanislav Kopriva, John Innes Centre, UK
Uener Kolukisaoglu, University of Tuebingen, Germany
Copyright: © 2012 Federer and Crossa. This is an open-access article distributed under the terms of the Creative Commons Attribution Non Commercial License, which permits non-commercial use, distribution, and reproduction in other forums, provided the original authors and source are credited.
*Correspondence: José Crossa, Biometrics and Statistics Unit, International Maize and Wheat Improvement Center (CIMMYT), Apdo.Postal 6-641, 06600 Mexico DF, Mexico. e-mail: email@example.com