Two-phase and family-based designs for next-generation sequencing studies

The cost of next-generation sequencing is now approaching that of early GWAS panels, but is still out of reach for large epidemiologic studies and the millions of rare variants expected poses challenges for distinguishing causal from non-causal variants. We review two types of designs for sequencing studies: two-phase designs for targeted follow-up of genomewide association studies using unrelated individuals; and family-based designs exploiting co-segregation for prioritizing variants and genes. Two-phase designs subsample subjects for sequencing from a larger case-control study jointly on the basis of their disease and carrier status; the discovered variants are then tested for association in the parent study. The analysis combines the full sequence data from the substudy with the more limited SNP data from the main study. We discuss various methods for selecting this subset of variants and describe the expected yield of true positive associations in the context of an on-going study of second breast cancers following radiotherapy. While the sharing of variants within families means that family-based designs are less efficient for discovery than sequencing unrelated individuals, the ability to exploit co-segregation of variants with disease within families helps distinguish causal from non-causal ones. Furthermore, by enriching for family history, the yield of causal variants can be improved and use of identity-by-descent information improves imputation of genotypes for other family members. We compare the relative efficiency of these designs with those using unrelated individuals for discovering and prioritizing variants or genes for testing association in larger studies. While associations can be tested with single variants, power is low for rare ones. Recent generalizations of burden or kernel tests for gene-level associations to family-based data are appealing. These approaches are illustrated in the context of a family-based study of colorectal cancer.


INTRODUCTION
In the early days of genomewide association studies (GWAS), the cost of commercial high-density genotyping panels was prohibitive for large-scale epidemiologic studies needed to detect the modest relative risks (RRs) now known to be associated with most common variants for complex diseases . Hence, investigators turned to multi-stage designs, in which only a sample of subjects were genotyped on such platforms and a generous selection of the most significant associations were then tested on an independent sample using custom genotyping techniques. The final analysis typically combined the data from both stages, with a final significance level chosen to ensure genomewide significance after allowing for the number of variants tested in the second. The basic principles were based on a series of papers written before the GWAS era (Satagopan et al., 2002(Satagopan et al., , 2004Satagopan and Elston, 2003), and subsequent work showed how to optimize the allocation of sample size and first-stage critical values in the GWAS context (Wang et al., 2006;Skol et al., 2007). In particular, Skol et al. (2006) showed that this joint analysis was more powerful than treating the design as discovery followed by independent replication, despite various high-profile journals' requirements for an independent replication study (Panagiotou et al., 2012). Although this became the conventional GWAS design throughout the first decade of the 21st century, rapidly declining costs of commercial GWAS chips have made it feasible for many studies to obtain genome-wide coverage on all available subjects in a single stage (Thomas et al., 2009a,b). The cost of custom genotyping for large numbers of hand-picked SNPs was often comparable to standard high-density panels, and having more subjects with genome-wide data allowed for more informative analysis of interactions, subgroups, pleiotropic effects, etc. For a general review of multi-stage designs in genetics, see (Elston et al., 2007).
As we entered the "post-GWAS" era, the focus began to shift toward rare variants and the use of next-generation sequencing (NGS) technologies that could in principle (given a large enough sample size and deep enough sequencing) uncover all the genetic variation in a region, not just the common SNPs that have been used to tag the unknown causal variants. In part, this interest stemmed from the increasing recognition that common variants were accounting for only a relatively small proportion of the total heritability of most complex diseases (Manolio et al., 2009;Schork et al., 2009). Amongst other possible explanations for the "missing heritability," rare variants have been proposed, based on an evolutionary argument (Gorlov et al., 2011) or empirical evidence (Bodmer and Bonilla, 2008) that their effect sizes could be larger, although recent whole-exome sequencing studies have cast some doubt on this hypothesis (e.g., Heinzen et al., 2012). Furthermore, since rare variants tend not to be well tagged by common ones (Duan et al., 2013), use of conventional GWAS panels would tend to miss associations with rare variants. Cost currently precludes application of NGS to whole-genome sequencing on a large scale, so clever study design has again become important (Thomas et al., 2009a,b). One of the first uses of NGS was for targeted follow-up of GWAS hits, for which an alternative to two-stage designs, known as two-phase designs, is a natural choice. These differ from the two-stage designs described above in that the set of subjects chosen for expensive data collection (e.g., NGS) are a proper subset of a larger epidemiologic study rather than an independent sample and that this subset is selected on the basis of information already available on the full study (Whittemore and Halpern, 1997;Thomas et al., 2004;Yang and Thomas, 2011). In the case of NGS, this could involve stratification jointly on disease status and carrier status of the associated variant(s). While this would tend to induce a spurious association between any variants in LD with the GWAS SNPs and disease even under the null hypothesis that they are not causal, this bias can be avoided by adjusting for the sampling fractions, and additional information available in the full study can also be incorporated. The basic principles were developed in a series of seminal papers by Norman Breslow with various colleagues (see Breslow and Holubkov, 1997b;Breslow and Chatterjee, 1999;Scott et al., 2007;Breslow et al., 2009b, for summaries of this work). Recently, Schaid et al. (2013a) has provided an excellent discussion of the use of this approach for targeted follow-up of GWAS hits by NGS. However, for whole genome or whole exome sequencing studies, there would be no point in selecting individuals based on whether they carried a specific polymorphism, except to eliminate those known to be carrying a known major mutation.
Most GWAS for discovering common variants associated with disease traits have been conducted using a case-control design with unrelated controls. Not only are unrelated individuals easier to identify and enroll than are entire families (particularly multiple-case families), but the statistical efficiency for discovery or association testing per subject genotyped is typically higher using unrelated controls than using unaffected siblings or other relatives (Witte et al., 1999). However, with the growing interest in rare variants and the availability of NGS, there has been a resurgence of interest in using family-based designs (Zhu et al., 2010;Feng et al., 2011;Ionita-Laza and Ottman, 2011;Shi and Rao, 2011). Family-based designs may have other advantages that outweigh their loss of statistical efficiency. By exploiting information about co-segregation, they may be more efficient at prioritizing potentially causal variants from non-causal ones for subsequent testing for association with disease in larger samples. The ability to exploit Mendelian inheritance may also improve the imputation of rare variants in untested samples (Li et al., 2009;. Finally, family-based designs can exploit both between-and within-family comparisons in a two-step analysis for better power while being robust to bias from population stratification (Lange et al., 2003;Van Steen et al., 2005;Feng et al., 2007;Murphy et al., 2008). In this paper, we focus on the first of these advantages, using a design that sequences a subset of family members initially, ranks the discovered variants in terms of their likelihood of being associated with the trait using the phenotype information on the entire family, and then tests for association in an independent sample. In this sense, the design has elements of both two-phase and two-stage designs, in that the sequencing set is a proper subset of a larger family-based study and that an independent sample is used for replication or combined analysis.
One consequence of the new focus on rare variants is the need for novel analysis strategies, because testing associations individually with every variant would have very little power due to the large multiple comparisons penalty and their rarity. In a sample of, say, size 200, one might identify about 20 million variants. Most of these are likely to be unrelated to disease and genotyping all of them for a large case-control association study would be neither feasible nor statistically efficient, so some means of identifying those most likely to be causal is needed. Furthermore, under some models of disease causation, multiple variants in a causal gene (or pathway) could affect its function, so aggregating variants within genes may also improve power. To address this need, a host of "burden" tests have been developed based on counts of rare variants, weighted in various fashions (see Asimit and Zeggini, 2010;Cirulli and Goldstein, 2010;Basu and Pan, 2011;Bacanu et al., 2012;Thomas, 2012, for reviews). However, these are ill-suited to the situation where a region contains both deleterious and protective variants (Hoffmann et al., 2010). A random effects model that focuses instead on the variance of risk across variants rather than their mean might therefore be more powerful. The first of this type was the C α test (Neyman and Scott, 1966;Neale et al., 2011), which tests for overdispersion of case-control ratios, conditional on the total number of variants. The Sequence Kernel Association Test [SKAT (Wu et al., 2011;Lee et al., 2012)], based on a general linear mixed model, tests for association between similarity of phenotypes and similarity of multi-locus genotypes across all pairs of subjects. See Schaid (2010a,b) for a general review of the basic statistical foundations of such tests and various choices of kernel functions for genetic applications. Recently, this class of methods has been extended to family studies (Huang et al., 2010;Schifano et al., 2012;Chen et al., 2013;Ionita-Laza et al., 2013;Schaid et al., 2013b). Hierarchical modeling approaches offer another approach to the analysis of rare variants, allowing formal incorporation of external information for prioritization.
A variety of methods for incorporating genomic context, functional, or pathway annotation data have been discussed in GWAS contexts (reviewed by Cantor et al., 2010;Thompson et al., 2013). Examples of prior information might include loci previously reported, pathway or genomic annotation, expression QTL or other functional assays, etc. (Rebbeck et al., 2004;Bush et al., 2009;Karchin, 2009;Nicolae et al., 2010;Wang et al., 2010;Freedman et al., 2011;San Lucas et al., 2012;Minelli et al., 2013). Filtering on such variables has become a popular strategy, but risks eliminating many causal variants whose potential significance has not yet been recognized or loading up the list of prioritized variants with too many non-causal ones based on irrelevant information. The weighted False Discovery Rate (Roeder et al., 2006;Wakefield, 2007;Whittemore, 2007) and Gene Set Enrichment Analysis (Chasman, 2008;Holden et al., 2008) require specification of weights in advance and there is no obvious way to combine multiple filters. The hierarchical modeling approach described below is more flexible, allowing the weights given to various biofeatures to be determined empirically, based on their observed correlation with disease associations across the ensemble of all variants.
An example of a two-phase design is the Women's Environmental Cancer and Radiation Epidemiology (WECARE) Study of the risk of second breast cancers among survivors of a first breast cancer, focusing on radiation dose to the contralateral breast (Stovall et al., 2008;Langholz et al., 2009), various genes involved in DNA damage response pathways (Begg et al., 2008;Concannon et al., 2008;Borg et al., 2010;Malone et al., 2010;Quintana et al., 2011;Brooks et al., 2012;Quintana et al., 2012;Reiner et al., 2013) and their interactions (Bernstein et al., 2010; a GWAS is also currently in progress. The design is a nested case-control study, with two controls matched to each case on age and year of diagnosis of the first cancer and study center, and "counter-matched" on radiotherapy for treatment of the first cancer (Bernstein et al., 2004). As an illustration of the two-phase design, we are currently performing whole genome sequencing on a subsample of 201 subjects and whole exome sequencing on several hundred more, drawn from the 701 cases and 1399 controls, stratified jointly by case-control status and risk predictors-age at first cancer, family history (FH), radiation treatment, and time since exposure.
As an example of a family-based design, we are currently performing deep targeted resequencing of 11 replicated regions identified by previous GWASs as associated with colorectal cancer (CRC), using ∼4200 samples drawn from the Colon Cancer Family Registries (C-CFR). The C-CFR is an international collaboration of registries of families ascertained through CRC in various ways, some population-based, some from high-risk genetic clinics, some including population controls or control families (Newcomb et al., 2007). To date, 10,662 CRC families have been enrolled, totaling 62,353 individuals, with genetic samples available on 5113 cases and 9196 unaffected family members or population controls with epidemiologic risk factor information, and FH data on many more (http://epi.grants.cancer.gov/ (CFR/about_colon.html, accessed 3/8/13). For the purpose of comparing different designs, we have selected some samples from multiple-case families and some from unrelated cases or controls in various ways. Ultimately these data would be used to compare designs empirically in terms of the yield of significant findings by subsampling from these real data (e.g., to assess whether a lower depth of sequencing, narrower regions, fewer subjects or subjects targeted in different ways would have sufficed).
The aim of this paper is to review recent developments in methods for the design and analysis of NGS studies, with a particular focus on two-phase and family-based designs, and to illustrate the various issues with simulated data and applications to power calculations and preliminary data from these two studies.

TWO-PHASE SAMPLING FOR TARGETED RESEQUENCING
Suppose one has already completed a large case-control GWAS of unrelated individuals, in which one or more tag SNPs have been found to be strongly associated with a particular disease. It is unlikely that the associated SNPs would themselves be causalmore likely they are simply in LD with the truly causal variant(s). The aim of a targeted resequencing study is therefore to exhaustively re-sequence the region to identify all variants in the hopes of discovering these causal ones. Two key decisions are required: how to select the subsample to be sequenced; and how to prioritize the variants found in this subsample.

Approaches to prioritization of variants
One obvious method of prioritization would be on the basis of novelty, i.e., to focus attention on variants that have not been seen previously (or only rarely) among population controls. With the growing catalog of sequence variants in public databases like the 1000 Genomes Project, many causal variants are likely already to have been discovered and those that are novel are likely to be so rare that there would be very little power testing their association individually with disease. Furthermore, most novel variants are likely to be neutral. Nevertheless, the discovery of a novel association with disease is important, irrespective of whether or not the existence of the variant has been previously reported, but a discovery of a novel variant and its association with disease is particularly noteworthy. The same applies to filtering based on differences in allele frequencies between cases and controls within the sequencing subset (Yang and Thomas, 2011). Thus, some investigators have decided not to sequence controls, but this could be ill advised if cases are sequenced in populations not well represented in public databases or on platforms with different discovery characteristics (e.g., depth of coverage, quality control filtering).
Under the hypothesis that a gene may harbor multiple variants any of which could affect function or that a critical pathway could be affected by polymorphism in any of the genes in it, a strategy that aggregates across multiple related variants may be helpful. Methods section Simulation of Gene-and Pathway-level Prioritization in the WECARE STUDY describes a simulation of this strategy based on the WECARE study; results are discussed in the application below.
Hierarchical modeling (Greenland, 2000) entails adding a second-level model for the effect estimates of each variant, allowing the magnitude of the effects, their probability of being nonnull, or their covariances to depend upon external information ("prior covariates") (Conti and Witte, 2003;Hung et al., 2004;Chen and Witte, 2007;Hung et al., 2007;Lewinger et al., 2007;Conti et al., 2009;Hoffmann et al., 2010;Capanu and Begg, 2011;. Hierarchical models involve a first (subject)-level model for individual's phenotypes Y i as a function of a vector of genotypes G i =(G iv ) at loci v and corresponding regression coefficients β = (β v ), e.g., a general linear model of the form f [E(Y i )] = G i β, and a second (variant)-level model for the distribution of these regression coefficients as a function of prior covariates Z v , e.g., a linear regression model of the form E(β v ) = Z v π (or) perhaps a model for their variances or covariances (Thomas et al., 2009a,b). This approach also has the advantage of allowing for the uncertainty about which effects should be included in the model within a Bayes model averaging framework, e.g., by modeling the probability that a variant has no effect as logit[Pr(β v = 0)] = Z v α (Quintana et al., 2012;Quintana and Conti, 2013). It also allows the data to determine the optimal weights for the various prior covariates rather than having to specify them a priori; these papers show how the gain in power from including covariates with high predictive value for classifying causality of variants is offset by very little loss of power from including covariates with low predictive value (since they are usually assigned very little weight), in contrast to filtering approaches, which can lead to substantial loss of power if either sensitivity or specificity is low.

Joint analysis of SNP and sequence data
The subset of subjects from the substudy with the full sequence data would probably not provide adequate power for testing associations with disease directly, either variant-by-variant or by any of the aggregation methods described above. How then might one take advantage of the data from the much larger GWAS from which the substudy subjects were selected? Three basic strategies are possible: (i) by genotyping; (ii) by imputation; or (iii) by joint analysis.
The first of these is the simplest, but most expensive. One simply does custom genotyping in the main study of the prioritized variants. Under the hypothesis that any causal variants are likely to have been discovered by sequencing and that they survived prioritization, then the genotype data for the main study should be sufficient and associations can be tested directly, with appropriate correction for the effective number of independent tests performed (Conneely and Boehnke, 2007). A final analysis can then include a test of whether the novel variants account for the original SNP association (Yang and Thomas, 2011).
Imputation has become a standard approach for GWAS analysis, so that typically several million common variant associations are tested by combining the study data from the SNP panel with population distributions of all common and uncommon variants from such databases as the HapMap and 1000 Genomes projects (Asimit and Zeggini, 2012;Howie et al., 2012). For each variant not on the GWAS panel, one computes the expected allelic dosage and uses this as the covariate in a logistic regression model; this strategy is known to be superior to simply using the most likely genotype, in part because it correctly allows for the uncertainty in the imputation (Stram et al., 2003). Whether this strategy would be viable for rare variants is still unknown, but there are two reasons for concern. First, the strategy relies on linkage disequilibrium, and rare variants tend to have weaker LD than common ones (Duan et al., 2013). Second, it also relies on having sufficiently large reference panels, which would not include newly discovered variants.
Joint analysis of the full sequence data on the subsample and the SNP data on the main study is the most powerful approach and, like imputation does not involve any further genotyping costs. In their series of seminal papers on two-phase studies, Breslow et al. describe three basic analysis approaches: pseudolikelihood (PL), weighted likelihood (WL), and semi-parametric likelihood Breslow and Zhao, 1988;Cain and Breslow, 1988;Breslow and Holubkov, 1997a,b;Breslow and Chatterjee, 1999;Breslow et al., 2003Breslow et al., , 2009aBreslow and Wellner, 2007). The simplest of these is the WL approach, so for simplicity, we confine our discussion here to this one (Methods section Likelihoods for Joint Analysis of Two-phase Studies). The basic idea is based on Horvitz-Thomson estimating equations, which use the score function derived from the likelihood for a logistic regression of disease status in the substudy data alone, weighting each subject's contribution inversely by their and W i = N si /n si , s i being the sampling stratum to which subject i belongs and N s and n s the main study and substudy sample sizes respectively. While simple in concept, the disadvantage is that the only information used from the main study is the stratum sample sizes. A refinement of this approach is to replace the empiric weights based on the realized sample sizes by predicted weights based on a logistic regression of sampling probabilities on additional covariates available for all main study participants. Recent papers show how this basic approach can be stabilized by using "calibrated weights" without requiring assumptions about the validity of an imputation model using influence residuals (Breslow et al., 2009a,b). The utility of this approach for targeted follow-up of GWAS hits is discussed in Schaid et al. (2013a).

Optimization of sampling fractions
As with two-stage designs, it is theoretically possible to optimize the choice of sampling fractions, subject to a constraint on total cost, but in practice this requires knowledge of the true values of various model parameters (causal allele frequencies, RRs, LD with the GWAS SNPs, etc.). Fortunately, the design is often relatively insensitive to these parameters, so that a balanced design in which the various strata are represented by equal numbers n s in the subsample may be nearly optimal (Reilly and Pepe, 1995;Reilly, 1996). In their article on the application of this design to targeted follow-up of GWAS hits, for example, Schaid et al. (2013a) do not address optimization, but recommend the balanced design.
As with two-stage designs, the basic idea is either to maximize power subject to a constraint on total cost or to minimize the cost required to attain a target power. If the only cost is sequencing the subsample, then it is sufficient to optimize the proportional allocation of substudy subjects across strata. If instead one is designing both the main study and substudy de novo or if custom follow-up genotyping of the main study is planned, then the relative sample sizes of the two phases also need to be considered. In either case, there are likely to be multiple hypotheses being tested, so optimization of power for a specific type of variant may be less helpful than a global optimization. For this purpose, we have previously considered Asymptotic Relative Cost Efficiency, a quantity inversely proportional to the total cost times the variance of the parameter of interest, combining main and substudy data (Thomas, 2007), but more recently in the context of designs for sequencing using DNA pooling, we aimed to optimize power subject to a constraint on total cost (Liang et al., 2012). We adopt a similar approach to optimize designs for testing the 1 degree of freedom Madsen and Browning (2009) rare variant burden test (Methods section Optimization of Two-phase Studies), but this could be easily extended to maximize power for multi-dimensional hypotheses. Here, we summarize a small simulation study to illustrate the potential of two-phase designs (Methods section Simulation of Two-phase Designs).
Results using the simulated sequence data are shown in Table 1 for the full cohort (the "ideal" results if the entire sample could have been sequenced) compared with two-phase analyses using (1) imputation, (2) the Horvitz-Thompson WL approach with sample weights (Horvitz and Thompson, 1952); (3) the Breslow-Cain PL ; and (4) the Breslow-Holubkov semi-parametric estimator (Breslow and Holubkov, 1997a,b). The top half of the table provides results for the Madsen-Browning index including all variants present in the parent casecontrol study, while the bottom half is limited to those seen at least once in the subsample; for the latter, the risk index would include different variants for the different designs, so point estimates are not comparable. The last line gives the average estimate, empirical standard deviation of estimates across replicates, and the estimated non-centrality parameter (NCP) for the Wald test if no subsampling were done. Generally, the imputation approach was the least efficient for all designs, followed closely by WL estimator, while the semi-parametric one the most efficient. Except for the WL, the optimal sampling design was also the most efficient; the inefficiency of the WL in this case seems to be due to some small strata receiving very large weights (for example, 500/2 in the low-risk case stratum compared with 500/214 in the high-risk case stratum, see Footnote b of Table 1). In earlier simulations (not shown) we found relatively little inflation in the variance estimates or changes in the point estimates as the number of strata increases (although the number of replicates that failed to converge increased). Further research in the case of many sparse strata would be helpful, as well as on such issues as the size of the region to be sequenced and the depth of sequencing.

Application to the WECARE study of contralateral beast cancers
To illustrate the potential yield from a sequencing substudy, we consider whole genome sequencing of a subset of 200 genetically enriched subjects, with the intent of following up a subset of discovered variants by testing their associations in a larger study of 700 cases and 1400 controls. The sample sizes used for illustration derive from the WECARE study (Bernstein et al., 2004) described above. The 201 subjects in the top part of Table S1 were selected by prioritizing young age, positive FH, cases over controls, and among cases, those who received radiotherapy (and for these, longer latency). These samples are currently being wholegenome sequenced at an average depth of coverage of 30×. We used simulation to address the following questions: 1. What is the anticipated yield of variants discovered one or more times in this sample, as a function of population MAF and RR? 2. Of those discovered, what proportion would be novel (not in the 1000 Genomes Project), what proportion would be truly causal, and both novel and causal? 3. Among the discovered variants in each category (of MAF, RR, causality, and number of times seen in each series), what  Figure S1. In a subsample of this size, most variants with MAF >0.1% would be seen at least once, including the most causal variants ( Table 2). Restricting to those seen more than once considerably reduces the number of variants prioritized, as does eliminating those never or seldom reported in the 1000 Genomes Project sample, but also eliminates most of the truly causal variants. If, however, the goal is to identify at least some of the novel causal variants with adequate power to test them for association in the main study, then this design might still discover something in the range of 5-20 causal variants out of the total 1600 simulated, depending on the specific criteria used for prioritization, and of course depending upon the true simulation model parameters.
We also simulated gene-and pathway-level burden tests (Methods section Simulation of Gene-and Pathway-level Prioritization in the WECARE study; Table 3). These show a modest improvement in power at the higher levels of aggregation, but power is still low with these sample sizes. The simulated causal variants are predominantly rare, so only about 16% of causal ones are even discovered in this small sample, setting an upper bound for power for single variant tests. Of the 43 discovered causal variants (on average across 100 replicates), 19 are prioritized and 3 of these are found to be significantly associated in the full study sample, for an average power of 1.1%. Of course the corresponding proportions were much smaller for null variants, yielding only 3 false positives in total out of 31 million. (The elevated "false positive" rate for single variant tests compared with the target 0.05 is due to null variants in strong LD with other causal variants.) Similar comparisons yielded 2.9, 5.7, and 4.6% power for generegions, genes, and pathways respectively, with type I error rates at or below the target level. The improvement at the region and gene levels probably reflects the increasing benefit from pooling similar variants, while the failure of the pathway burden test to yield even better power may be due to an increasing proportion of truly null genes or variants diluting the effect of the positive ones. These results are based on prioritization at each level at α 1 = 0.01/p where p is the number of tests (variants, genes, etc. discovered in the subsample); while optimization of these values is possible, the results seem to be relatively insensitive across a broad range of choices. The specific results are somewhat more sensitive to the specific model parameters, only one being presented here, but the general patterns remained consistent across all values we considered. Sequencing is still underway but preliminary results from the first 93 samples from Table S1 suggest that the majority of subjects (mainly contralateral cases with early onset and/or family-history positive subjects) carry at least one functionally significant, clinically relevant or predicted diseasecausing mutation, based on external annotation criteria including Human Genome Mutation Database (Stenson et al., 2009), ClinVar [http://www.ncbi.nlm.nih.gov/clinvar,] MutationTaster (Schwarz et al., 2010) in both known and unknown breast cancer candidate genes and pathways, and >50% carry at least two and 10% carry three or more. The next step is to see whether these variants are differentially distributed between WECARE cases and (unilateral breast cancer) controls or population rates, whether they are associated with radiotherapy (suggesting an interaction effect), and to test these variants in the full WECARE study sample.

FAMILY-BASED DESIGNS FOR PRIORITIZATION
Several investigators have recently reported approaches to efficient selection of individuals for sequencing in family-based designs.  described an approach for targeted sequencing of regions exploiting already available linkage information to optimize imputation to other family members, but without using phenotype information, whereas Wang et al. (2013) described an approach for whole genome sequencing using phenotype and kinship information. We simulated various family-based designs for whole genome sequencing to address the following questions: 1. What criteria should be used to select families and members for sequencing substudies? 2. What criteria should be used to prioritize variants for subsequent association testing? 3. How do family-based designs for sequencing compare with those using unrelated individuals in terms of probability of discovering novel variants, classifying variants by their likelihood of being causal, and power for testing association with disease?
We consider a two-phase design that uses a subset of individuals from a family study chosen on the basis of their phenotypes and relationships to each other for discovery and screening, followed by association testing in the full pedigrees. If enough family-based samples are available, replication using additional family-based samples is preferable to using a different sampling scheme because one would like the spectrum of variants (e.g., MAFs and RRs) being tested in the second stage to be comparable to those discovered in the first. We compare the relative efficiency of this family-based design with a conventional two-stage case-control design with comparable costs. We evaluated these designs by simulating 4-generation pedigrees with 22 members in each. We sampled haplotypes from the same simulated population described earlier and randomly dropped genes through the pedigrees, generating phenotypes with randomly selected rare variants as causal with the same RR distribution and retaining those pedigrees with some minimum number of cases (Methods section Simulation of Family-Based Designs). To address the first question, families with various numbers of affected individuals were ascertained, and from each of these we selected individuals to sequence in various ways (e.g., two cases of at least second-degree relationship to each other and one unaffected individual). We then tabulated the following statistics for causal and non-causal variants by the relationship of the cases to each other.
• Rule-based criterion: the number of families for which all affected members carry the variant and no unaffected ones did among the subset sequenced; • Likelihood ratio (LR) criterion: the ratio of the retrospective likelihoods under the simulated penetrances and allele frequencies vs. the null penetrance (the average rate in the ascertained families); while similar to the lod score used in linkage analysis, here a single-locus likelihood is used to test association with a directly-observed variant, not markers in LD with an unobserved locus; • Bayes factor (BF) criterion: similar to the likelihood ratio, but based on the marginal probabilities under the simulated prior distributions of penetrances and allele frequencies; • Score test criterion: the score test derived from the retrospective likelihood, evaluated under the null hypothesis.
(Methods section Family-based Criteria for Prioritization of Variants). The score test was evaluated both at the single-variant and the regional level, the latter using the family-based SKAT tests (Schifano et al., 2012;Chen et al., 2013;Ionita-Laza et al., 2013;Schaid et al., 2013b). The other tests were used only for ranking variants individually. The score test essentially relates the phenotypes of the entire pedigree to the genotypes of those who have been sequenced using the inverse of the kinship matrix to weight them. If linkage information is available, then a direct test of association with imputed genotypes is possible , allowing for residual phenotypic correlations Figure 1 shows the mean score statistics per family for those with a total of 4 affected individuals in which either an affected sib pair with an affected first cousin or a discordant sib pair with an affected cousin have been sequenced. These were derived for an 11-member sub-pedigree for which exhaustive enumeration of all possible genotypes and phenotypes was feasible. As expected, variants with the largest scores for causal variants (top panel) and the highest probabilities of being causal (bottom panel) were those where both cases were carriers and (if sequenced) the control not. Having the control affected somewhat lowers the average score, but not as much as having an additional case being a carrier increases it, essentially because we are considering a relatively rare disease (population prevalence 5%).
This trade-off is explored further in Figure 2 for various types of relatives sequenced. Here, we fix the total number of individuals being sequenced at 100 across the designs being compared (e.g., 25 pedigrees with four members each sequenced, 33 with 3, 50 with 2 each, or 100 singletons). Although, as expected, having more families with fewer individuals sequenced increases the absolute discovery probabilities (not shown), the relative difference comparing causal and null variants goes the other direction, and the relative probability of prioritization also increases for more subjects per pedigree and fewer pedigrees sequenced, for a considerable increase in the overall relative probability of discovery and prioritization.
Comparing designs with two individuals sequenced per pedigree shows little difference in the probability of discovery across the relationships among the pairs (Figure 2), but shows that the www.frontiersin.org December 2013 | Volume 4 | Article 276 | 7

FIGURE 1 | Mean scores for causal variants (top panel) and ratio of frequencies of causal to non-causal variants (bottom panel) in simulated 11-member pedigrees with at least 4 affected members.
In each panel, results are shown for a design sequencing an affected sib pair and affected cousin by the number of carriers of the variant allele (left) or an affected first cousin pair and an unaffected sib by the number of carriers among cases and controls (right).
conditional probability of prioritization given discovery and the joint probability of discovery and prioritization is better for more distant relatives. Obviously, the more stringent the cutoff for any of these criteria, the fewer the variants that would be prioritized, but noncausal variants tend to be eliminated much faster than causal variants (Figure S2), so the challenge is to choose a threshold that minimizes the false positive proportion, subject to the total number of variants that can be tested in subsequent replication efforts. The relative performance of the various prioritization criteria is illustrated in Figure 3 as Receiver Operating Curves, varying these thresholds. Although the BF criterion is the best overall in terms of the area under the curve, it is the most computationally intensive and the score test is nearly as good and much faster to compute.

RELATIVE EFFICIENCY OF FAMILY-vs. POPULATION-BASED DESIGNS.
We compared the power of a two-stage family-based design with that of a conventional case-control design. The overall power for any design with independent tests in the two stages is simply the product of the powers for the two stages (Methods section Calculation of Power for Two-stage Designs). The probabilities of discovery, prioritization, and replication are illustrated in Table 4 for a range of design parameters-total sample sizes, proportions allocated to stage 1, numbers of copies required to be judged a discovery, and the minimum threshold required www.frontiersin.org December 2013 | Volume 4 | Article 276 | 9 FIGURE 3 | Receiver operating curves comparing different prioritization schemes.  Figure S3.) Obviously, as the number required for discovery is lowered or the threshold for prioritization is raised, fewer variants in total would be prioritized, leading to a less stringent multiple comparisons penalty, but at some point the overall power decreases because too many of the truly causal variants are either not discovered or not prioritized. Although the overall power (the proportion of all causal variants discovered, prioritized, and replicated) for any of these designs is only about 20% (for a sample size of 1000 families), the majority of those not found are either very rare or have very small effect sizes. Of particular interest are the numbers of novel variants (those not in the 1000 Genomes Project database) that are discovered, prioritized, and replicated. Since these too are predominately rare, power for them is even lower, but depending upon the total number that actually exist, they could still represent a substantial yield of true positive findings. These comparisons are provided for the "optimal" designs of each type and one alternative design that yields better power at modestly larger cost. Assuming costs of $1000 per family for enrollment and obtaining pedigree phenotypes, $100 per subject enrolled in a case-control design, $5000 per whole-genome sequence, and $0.05 per subject-genotype, the optimal familybased design turns out to require 540 pedigrees in stage I and

Frontiers in Genetics | Statistical Genetics and Methodology
December 2013 | Volume 4 | Article 276 | 10 1260 in stage II at a critical value of the score test for prioritizing variants of 3.0; this yields 167 true-positive replicated associations out of 1000 simulated at a cost of $12 M. The corresponding optimal case-control design would require 1400 case-control pairs in stage I, 5600 in stage II at a critical value of α 1 = 0.001 in stage I, for a yield of 159 true positive replicated associations at a cost of $18 M (about 2/3 the cost-efficiency of the familybased design). Of course, with different cost ratios, these optimal designs would change, as illustrated in Table 4 for the case where enrollment costs are 5 times larger and sequencing only $1000 per whole genome. In this instance, case-control designs turn out to be the more cost-efficient. For this simulated MAF distribution, only 175 of the 1000 causal variants would be novel, but in every situation considered, the power for discovering such rare variants is still more than half that for all causal variants. No striking differences were seen between the spectrum of RRs and MAFs discovered, prioritized, and replicated by the two types of designs with sample sizes chosen to yield similar overall power.

APPLICATION TO THE COLORECTAL CANCER FAMILY REGISTRIES DATA
Included in the Colorectal Cancer Family Registries are a few large families for whom all available family members have either been genotyped for previously replicated GWAS SNPs or whole exome sequence variants. We analyzed one of thesea large Australian pedigree comprising 145 individuals with a total of 7 colorectal, 1 Lynch syndrome, and 9 other cancer cases. Genotypes for 32 GWAS-associated SNPs were available for 49 of the members, including 5 of the CRC and Lynch cases and 4 of the other cancers. These data were used to illustrate the effect of subsampling. We selected individuals under various criteria, calculated LR, BF, and score statistics using only the SNP data for these selected individuals [but all the phenotype information (Visscher and Duffy, 2006)], and compared these results to those from the complete genotype data to see which criteria best distinguish variants that are "truly" associated (based on the complete data) from "false" positives. Figure 4 shows the correlation of each statistic computed using all available genotype data (as the "gold standard") with those using only the subsample of genotypes (averaging over 10 replicate subsamples). For the CRC and Lynch syndrome phenotype, we compared subsets of 1-4 cases and 0-3 controls out of the 5 available cases and 44 available controls. These showed little improvement in correlation for any of the test statistics from adding more than about 2 cases. Adding the genotypes for one or two unaffected members somewhat improved the correlation for LRs and BFs when there were 2 or more cases, but surprisingly worsened the correlation when only a single case was included; this may simply reflect instability due to the small number of cases in total. Results were somewhat more stable when all 9 cancer cases with available genotypes were considered, allowing comparisons of larger subsamples of cases. Adding more than about 3 cases did not materially improve the correlation and adding 1-3 controls improved the correlations only modestly, again reducing them when a single control was added to a single case. The bottom panel shows that the more distant relative pairs were more informative about distinguishing apparently associated from non-associated SNPs (based on the complete data).
Because only common variants were available in the Australian data, we also performed similar simulations on 15 large pedigrees that had previously been included in a linkage scan (Cicek et al., 2012) and had whole exome data available on 2-3 CRC cases from each. Not surprisingly, in this small dataset, no genomewide significant associations were found by the score test with any of the 359,744 single nucleotide variants (SNVs) called at least once (a third of these were called only 3 or fewer times) or with 100-SNV bins with the regional score SKAT test, nor were the regional tests particularly correlated with the maximum single-SNV tests or with prior annotation. Additional simulations (not shown) based on the real sequence data and simulated phenotypes (conditional on the total number of cases in each pedigree) confirmed that 15 families would be far too few to find any significant causal effects, even if IBD information were used. The design of a larger NGS study is described in the concluding section.

DISCUSSION
One of the advantages of family-based designs is that Mendelian inconsistencies can be used to check for genotyping errors (Pompanon et al., 2005). This has, of course, long been recommended as a routine quality control check in linkage and family-based GWAS studies. This advice becomes even more important when dealing with NGS data because of its inherently higher error rate as a function of depth of sequencing and quality control filters applied (Faye et al., 2013). Further research on approaches to using pedigree information to improve variant calling would be helpful.
In a similar vein, Mendelian inheritance could be exploited for improved imputation of variants in unsequenced family members. On obvious way to proceed would be to first use standard imputation procedures with external reference populations (Howie et al., 2012), treating each subject with GWAS SNP data as independent to obtain preliminary genotype probabilities at the sequenced variants. These could then be combined with the observed genotype calls for the sequenced family members using Mendelian transmission probabilities to obtain refined genotype probabilities (Burdick et al., 2006). While proceeding variantby-variant in this manner is relatively straightforward, it fails to take LD patterns among the variants into account, but a similar strategy could be applied to haplotypes . Such a two-step approach was used in simulations for the Genetic Analysis Workshop 18 (http://www.gaworkshop.org/ gaw18/index.html). Ideally, a unified approach that would integrate the two sources of information in a single step would be preferable. In addition to imputation, identity-by-descent information could be used to inform the selection of subjects for sequencing  and directly as a local genetic similarity kernel in family-based SKAT tests.
Homozygosity mapping in families has proven to be a valuable technique for mapping recessive alleles (Kruglyak et al., 1995;Chahrour et al., 2012). Design issues for sequencing studies for such traits are likely to be somewhat different from those considered here and would be useful avenue for further research. Another possibility is to use a two-step analysis of the same data, exploiting between-family comparisons to prioritize variants and within-family comparisons to test the most promising ones, in the spirit of Van Steen et al. (2005). Because these two tests are independent, one then need only correct the significance level for the number of variants passed to the second stage. For quantitative traits, regression of the offspring phenotypes on the mean of the parents' genotypes provides a simple first-step test. For disease traits, one would have to include control trios, nuclear families with varying proportions of affected offspring, or external control individuals to have the variability in phenotypes needed for the first-step test. Practically, however, this approach would require having access to the DNA for case-parent trios (which might not be available for late-onset diseases like cancer) and sequencing of the parents rather than the cases for the first stage; not only would this double the sequencing costs over a more conventional design that sequences only the cases, but it might seem counter-intuitive since the parents may not themselves be affected (even though at least one of each pair must carry any variant that case does). One could of course reverse the two steps, but this would require sequencing the entire trio in the first step and the final inference would not be robust to population stratification. In further simulation studies (not shown), we found that use of external controls tends to be more powerful than between-family comparisons for the first step, but is more susceptible to population stratification bias; this is not a threat to validity if used in the first step, but could reduce power if too many false-positives are passed to the second step, inflating the multiple testing penalty. The two-step analysis approach consistently yielded better power than a two-stage case-control design, however.
Two-stage and two-phase designs are also amenable to considerable cost-efficiency gains by using DNA pooling techniques (Sham et al., 2002) in the first stage, thereby allowing one to sequence many more subjects than would be feasible if one were to sequence individuals. Of course, only aggregate allele frequency information (Huang et al., 2010), not individual genotypes, are then available [unless one uses molecular bar-coding techniques (Craig et al., 2008)], but these can still be used for discovery of novel variants  or case-control comparisons of pool allele frequencies (Johnson, 2007;Macgregor et al., 2008;Zhao and Wang, 2009). Further cost-efficiencies are possible by constructing pools of pools, with bar-coding of the sub-pools (Smith et al., 2010). Optimization of designs using DNA pooling has been described by Liang et al. (2012), but extension to family-based studies remains a challenge (Lee, 2005).
We conclude by describing how these considerations influenced the design of a planned whole-exome sequencing study within the colorectal CFR. We are planning a three-stage familybased design, in which the first stage would use already available sequence information for prioritizing about 1000 genes. This would be followed by two stages of replication, each with probands from about 1000 multiple case families and 1000 controls. The first stage would exploit existing control data from the 1000 Genomes Project, while the second and third stages would use individually matched population controls. Because our hypothesis is that causal genes may harbor multiple rare variants-not necessarily the same across families-the two replication stages would perform full resequencing of the entire coding and flanking regions of the prioritized genes, 1000 genes in stage 2, 100 genes in stage 3. For the same reason, we have decided to use a family-based design for all three stages, since variants discovered in multiple-case families may not be well represented in unselected series of population-based cases. After analysis of the sequencing data from each stage, additional genotyping of the prioritized variants would be done on all other available family members for analyses using a conditional segregation analysis (Hopper et al., 1999). Three criteria would be used for prioritization at every stage: a family-based test of co-segregation with disease for each variant separately and for entire genes; a gene-based test of association comparing cases and controls; and filtering based on bioinformatics predictors. The first of these uniquely exploits the information available from a family-based design and can be used to rank genes on the probability they carry at least one causal variant, using an aggregate assessment of the impact of all rare variants in the gene. The three comparisons would be unified through hierarchical modeling, in which both the family-based and case-control comparisons would be incorporated in the likelihood for the first (individual)-level model, and the bioinformatics predictors would be incorporated in the second (variant)-level model. Similar issues are currently being discussed in the design of a large-scale sequencing study for the WECARE project. Since this is not a family-based study, the key decision there is how best to select the subset for sequencing in a two-phase design.

METHODS
All simulations were based on the same population of 10,000 haplotypes of length 250 Kb generated by the COSI program (Schaffner et al., 2005) with the population history parameters provided in their Table 1. This population contained 5125 unique variants, of which 4557 had minor allele frequencies (MAF) <0.05, 95% less than 0.01, 79% less than 0.001.

SIMULATION OF TWO-PHASE DESIGNS
We postulated a disease model involving multiple rare variants drawn from the simulated haplotype population with the probability of having any effect and the expected size of the effect depending inversely on the MAF (Figure S1). We then sampled pairs of haplotypes at random from this population, computed their risk under this model, and assigned case-control status, continuing in this manner until the target number of 1000 cases and 1000 controls were obtained for the parent GWAS, and tested these data for association with all common SNPs (MAF >5%). If a significant association is found with one or more SNPs, the replicate was retained for the sequencing substudy.
For the subsample, we first constructed a risk index based on a multiple logistic regression of disease state on all non-redundant GWAS SNPs and stratified the phase 1 subjects into three strata of high, medium, and low risk (with cutpoints at the 25th and 75th percentiles). We compared case-control, balanced, and optimal sampling of 600 subjects total out of the available 4000 (Methods Section Optimization of Two-phase Studies). For these subjects, we retained all variants, including the causal ones but also many more irrelevant ones.
Finally, we conducted a joint analysis of both phases. We tested association with the Madsen and Browning (2009) index-the number of rare variants weighted inversely by the square root of their allele frequencies-as the rare-variant covariate of interest, treated as continuous, using the WL, PL, and semi-parametric likelihood methods described in Methods Section Likelihoods for Joint Analysis of Two-phase Studies. We found that the risk index used for sampling was a confounder of the Madsen-Browning rare variant index, due to LD among the variants included in each, due to differences between the weights in the Madsen-Browning index and the simulated weights, and due to having used the disease status to construct the risk index, so all results were adjusted for the sampling risk index. This entire process was repeated 1000 times.

LIKELIHOODS FOR JOINT ANALYSIS OF TWO-PHASE STUDIES
Following the general notation used by Breslow and Holubkov (1997a,b), we let V represent a set of GWAS SNPs in a region found to be associated with disease Y, and X represent the causal variant(s) in LD with the GWAS SNPs, to be discovered by sequencing the region on a subsample of subjects. For this purpose, we defined the causal variable X to be the Madsen-Browning index. The imputation strategy entails simply fitting a regression model for X|V to the substudy data, and then usingX(V) as the covariate for Y|X in the full study. Proper inference would, however, require that the uncertainty in the imputation be taken into account in the analysis of the main study data. We now describe a formal likelihood approach to accomplish this.
If the first stage is a case-control sample, then the full likelihood would be the retrospective probability December 2013 | Volume 4 | Article 276 | 13 and the likelihood for the second stage sample S would be The full likelihood would then be L(θ) = L 1 (α, β, γ)L 2 (α, β) where θ = (α, β, γ). In practice, however, both V and X are highly multidimensional and we wish to avoid having to specify their joint LD distribution parametrically. When we do not assume functional forms of p α (X|V) and p γ (V) the likelihood above becomes suitable for semiparametric maximum likelihood (SPML) estimation. Both Breslow and Holubkov (1997a,b) and Scott et al. (2007) have developed profile likelihoods by maximizing out the high-dimensional parameters p α (X|V) and p γ (V), with a different parameterization. We followed a recent formulation of the problem from Scott et al., in which the estimating equations for β and the constraints for nuisance parameters π are described in a "log-likelihood" where π 1v = 1 − π 0v, N yv = N yv − n yv , and By iterating between maximizing a logistic likelihood with fixed offsets containing π and updating π using its constraint equations, we can obtain semiparametric (SPML) efficient estimates of β. The SPML approach has the advantage of being flexible about the distribution of covariates X and V while retaining good efficiency. However, the derivation of the semiparametric estimating equations was complex enough that it appeared only after two approximation methods-the WL and the PL-had been published. The WL approach weights individual score functions from model p β (Y|X) inversely proportional to the sampling probabilities. In our implementation, the original Horvitz-Thompson weights N yv /n yv were used, although an improvement might be to use predicted weights 1/Pr(S =1|Y,V,Z) that could incorporate auxiliary information Z from the full cohort (say, from a logistic model), or better yet, the calibrated weights described in Breslow et al. (2009b). The PL approach was first developed in Breslow and Cain (1988), representing an alternative that uses first phase information. Following their seminal paper, we work with a PL based on p β, δ (Y|X,V,S), which incorporates the parameter of interest β and the nuisance log-odds δ for Y = 1 in stratum V = v. They insert estimatesδ v = log(N 1v /N 0v ) into the PL and then solve for β. Schill et al. (1993) proposed to estimate δ and β simultaneously; this method, although not implemented here, has been reported to yield similar results as the Breslow and Cain (1988) version.

OPTIMIZATION OF TWO-PHASE STUDIES
Our general strategy for optimization of any of these two-phase designs aims to solve the following problem. Suppose we have collected phase I data on N subjects and seek to selectively assemble data on at most n subjects based on available information. The available information from phase I, mainly observations of Y and V, is summarized by the cell sizes N yv . What we wish to optimize are the cell-specific sampling fractions, denoted by s yv = n yv /N yv . A natural choice of objective function might be to aim for more precise parameter estimates per unit cost, for example using the Asymptotic Relative Cost Efficiency (Thomas, 2007). However, while this goal could be readily achieved for a scalar parameter as in Reilly (1996), it is less clear when more than one parameter is estimated. In this work, we chose our objective function to be the non-centrality parameter of the likelihood ratio test for H 0 :β= 0 vs. H 1 :β = 0, withβ being the subset of interest in β. This objective is equivalent to a linear combination of information matrix entries, and is thus a good summary of the standard error estimates. We denote the entire parameter vector including β and other nuisance parameters as θ. It has been shown (Self et al., 1992;Brown et al., 1999) that the non-centrality parameter can be computed as λ is the log-likelihood function, θ A is the true parameter vector under the alternative hypothesis, andθ 0 is the parameter vector that maximizes E θA [l(θ)] under the null hypothesis. A slightly different form, λ = λ − v with ν being the degrees of freedom of the test, is also reasonable. In this particular problem, we used l * (.) shown in the previous section in place of l(.). It has been shown (Scott and Wild, 1989;Scott et al., 2007) that the profile log likelihood l * (.) is amenable for standard likelihood ratio tests. This problem setting requires N yv , Pr(X|Y,V), β 0 (true value of β) to be either known or pre-specified. To obtain these quantities, we simulated 1000 data sets as described in section Simulation of Two-phase Designs, and consider these data sets as the underlying "super-population." Then we used the estimates of Pr(X|Y,V), β 0 and the average values of N yv from this super-population as the input to the optimization procedure. Hence we used a fixed optimal design for all simulation replicates, using the SPML. Compared to solving the optimization problem for each replicate, this strategy represents a solution to the "expected" problem.

CALCULATION OF THE EXPECTED YIELD OF SINGLE-VARIANT TESTS IN THE WECARE STUDY
The estimates in Table 2 were derived using only the MAF distribution from the simulated haplotype population. The expected carrier probabilities in each age, FH, and disease stratum were computed from the assumed distribution of RRs over a grid of MAF and RR values using standard Mendelian inheritance methods. We computed the probability of observing c copies of a variant in the subsample from the Poisson distribution for each MAF and RR bin, summing the expected counts over all sampling strata, and in a similar manner, we computed the probability of observing c copies among 1000 population controls. The total yield of discovered novel variants is then the sum over these bins of the number of variants in each bin times the probability of seeing >c and <c copies.
For association testing in the main study, we computed the NCP for the Mantel-Haenszel test (stratified by age and FH) for each bin and then computed power by reference to the cumulative normal distribution with Bonferroni correction for either the total number of discovered variants or only the number of novel variants. These are again summed over all RR and MAF bins to estimate the expected yield for various values of c and c' given in Table 2.

SIMULATION OF GENE-AND PATHWAY-LEVEL PRIORITIZATION IN THE WECARE STUDY
To compare the power of single-variant and burden tests, we selected variants from the haplotypes in a multi-level fashion as follows. We defined 100 pathways p, each comprising 1-20 genes g, further subdivided into three regions r (e.g., "exons," "introns and promoter regions," and "more distant enhancer regions"). Starting and ending locations of each gene and is sub-regions were selected at random and all variants v within these regions were included. Pathways, genes, and variants were selected as causal with probabilities π P , π G , and π v respectively, where π v for variant v depends upon the type of region and its MAF. Thus, a variant has a causal effect only if all three levels are designated as causal. Each causal variant was assigned a log RR β v as a sum of pathway, gene, and variant-level effects, each being the absolute value of a normal deviate with zero mean and variance σ 2 P , σ 2 G , and σ 2 v , respectively, σ 2 v also depending upon the type of region and MAF. We drew two haplotypes at random for each potential subject and computed the genetic log RR as the sum of the β v s for each variant they carried. Subjects were assigned at random to an age stratum and then to disease (unaffected, unilateral, bilateral) and FH strata with probabilities depending upon their age and genetic RR. This process was continued until the target number of subjects in each age, FH, and disease stratum was obtained and a random subset of these was designated as the sequencing sample. In the real WECARE study, we prioritized the youngest cases, those with a positive FH, and radiotherapy subjects with the longest latency for sequencing. In this way, we selected 201 subjects out of the total of 2199 available for sequencing; the distribution of the entire study sample and the sequencing subsample by age, FH, and laterality is provided in Table S1.
For the analysis, we scanned the subsample to identify all variants seen at least twice. All single variants that were seen more frequently than expected (by an amount depending on the number of comparisons) based on the general population MAFs and similarly all pathways, genes, or regions that were seen more frequently than expected were prioritized. These are tested for case-control association in the main study, using a Cochran-Mantel-Haenzsel (CMH) test, stratified by age and FH, with Bonferroni adjustment for the number of comparisons at each level.

SIMULATION OF FAMILY-BASED DESIGNS
Family-based simulations used a fixed pedigree structure of 4 generations with two offspring in each generation for a total of 22 members in 7 nuclear families. We sampled two haplotypes at random for each of the founders from the simulated haplotype population and dropped them at random without recombination through the non-founders. As before, we chose causal variants and their RRs depending upon MAF, computed the genetic log RR as the sum of the β v s for each causal variant a subject carries, and assigned disease status accordingly, adjusting the intercept to yield a population prevalence of 5%. Families with the required number of cases (set to 4 for most of the results reported here) were retained and the process continued until 1000 such families were ascertained. Various criteria were used to select a subset of family members whose genotypes were to be retained for analysis (the "sequencing subset," e.g., two affected individuals of at least second-degree relationship and one unaffected member), while retaining the phenotype information for the entire pedigree.
For each of the causal and a random sample of the non-causal variants, we computed the likelihood ratio, Bayes factor, and score statistics (described below) and tabulated these values for different configurations of genotypes among the sequenced members and their relationships to each other. For the rule-based prioritization, we also tabulated the number of variants found in at least f min families and the number of these that were prioritized by having the target genotype configuration (e.g., both cases being carriers and the control not). The distributions of causal and non-causal variants for each criterion are shown in Figure S2 as a function of the threshold for prioritization; plotting one curve against the other yields the ROCs displayed in Figure 3.

Rule-based criterion
Variants were classified on the basis of the number of families in which all sequenced cases carried the variant and any sequenced controls did not.

Likelihood ratio criterion
Following the principles described in Petersen et al. (1998), we estimated the probability that any particular variant is causal under a given genetic model by accumulating likelihood ratio contributions (comparing the likelihoods of the data under the alternative hypothesis that a particular variant is causal to that under the null hypothesis that it is not causal) across families. Letting Y denote the phenotypes of all family members (including those not sequenced), G bs the observed sequence data, β v the genetic RR and q v the minor allele frequency for variant v, the likelihood ratio is where Pr(G obs v , Y) = G unobs v Pr(Y|G v ) Pr(G v ). These calculations were done evaluating the likelihood under the simulated RR and minor allele frequency for each variant under the alternative hypothesis and under the induced marginal population risk for the null hypothesis.

Bayes factor criterion
The likelihood ratio criterion requires a maximization of the likelihood under the alternative hypothesis, which can be unstable for rare variants. To avoid this, Petersen et al. (1998)  Bayes factor by averaging over a prior distribution of MAF and RR. Bayes Factors are computed as the ratio of these marginal probabilities of the joint genotypes of the sampled individuals under the true model to that under the null. Of course, we do not know the true values of either β or q, or even their true probability distributions, so we used the simulated probability distributions for Pr(q) and Pr(β|q), averaging over a random samples of 100 parameter values drawn from their null and alternative distributions.

Score test criterion
We computed the score statistic T v for a single variant v derived from a multivariate logistic model for the phenotypes of the entire pedigree and the genotypes of the sequenced subset as where Y f is the vector of phenotypes for family f, p f is the familyspecific disease prevalence, K f is the kinship matrix, G fv the vector of genotypes for variant v, and q fv is the mean of G v among the sequenced members. The G fv − q fv deviations are set to zero for untyped individuals, but the inclusion of the kinship terms for typed-untyped pairs allows their phenotypes to contribute. This statistic has mean zero under the null hypothesis and asymptotic variance var(T v ) = f t fv 2 . For the purpose of prioritizing variants we used the score test T v 2 /var(T v ) for each variant and select the top-ranked ones at some cutoff. This provides a pure withinfamily comparison, but those families for which all sequenced individuals are no carriers or all are carriers become uninformative. A more powerful test that exploits both between-and within-family information replaces the p f and q fv by and, the corresponding means over all families. The regional (SKAT) test for all variants in region R is simply [ f ( v ∈ R t 2 fv )] 2 / f ( v∈R t fv 2 ) 2 .

CALCULATION OF POWER FOR TWO-STAGE DESIGNS
Using family-based simulation described above, we tabulated the average number of families in the simulated sample in which each variant was seen at least once and the NCP as the mean of the simulated score statistics by bins of MAF and RR. To compare designs with different stage 1 sample sizes and discovery thresholds, we rescaled the numbers of families carrying a given variant by the ratio of proposed and simulated sample sizes and recomputed the probability of discovery by reference to the Poisson distribution in each bin. In a similar manner, we computed the probability of prioritization at stage one at threshold λ min in each bin by rescaling the NCPs by the ratio of sample sizes and referred them to the non-central chi square distribution. To extend our simulation results to the whole genome, we multiplied the predicted number of simulated null variants meeting our discovery and prioritization criteria by 20,000,000/1000. The total number of variants carried forward to stage 2 is then simply the sum over all MAF and RR bins of the product of the number of variants in the population times the probabilities of discovery and prioritization. Power for stage two was computed in a similar manner by rescaling the NCPs by the stage 2 sample size and referring it to the non-central chi square distribution with Bonferroni correction for this number of tests carried forward. Thus, the yield of causal variants discovered, prioritized, and replicated is the sum of the number of variants in the population times the probability of discovery, prioritization, and replication over all MAF and RR bins. In each MAF bin, we also computed the Poisson probability that a variant would not have been seen at least twice in 1000 population controls and computed the power for novel variants in a similar manner.
Calculations for case-control designs were similar, except that no simulation was required. The probability of discovery could be computed directly from the Poisson distribution in the combined case and control sample and the NCP s for the chi square test for allelic association computed in the usual way for a 2×2 contingency table of allele counts by case-control status.

ACKNOWLEDGMENTS
Supported in part by NIH grants U01 HG005927, U19 CA148107, R01 ES019876, R01 CA129639, P30 CA014089, P30 ES 07048. The authors are grateful to Drs. Paul Marjoram and David Conti for many helpful discussions and to the investigators from the Colorectal Cancer Family Registries (Dr. Steve Thibodeau, P.I. of the proposed sequencing substudy) and WECARE Study (Dr. Jonine Bernstein, P.I.) for the studies used for illustration, particularly Dr. David Duggan for the summary of preliminary data from the WECARE Study.

SUPPLEMENTARY MATERIAL
The Supplementary Material for this article can be found online at: http://www.frontiersin.org/journal/10.3389/ fgene.2013.00276/abstract  Table S1 | Sample sizes used to illustrate power calculations for a two-phase design for the WECARE study (note that the actual sequencing sample is further stratified by radiotherapy and latency for the purpose of studying gene-radiation interactions, factors not considered in these simulations). *UBC, unilateral breast cancer (controls); CBC, contralateral (second asynchronous) breast cancer (cases)