Cryptic relatedness in epidemiologic collections accessed for genetic association studies: experiences from the Epidemiologic Architecture for Genes Linked to Environment (EAGLE) study and the National Health and Nutrition Examination Surveys (NHANES)

Epidemiologic collections have been a major resource for genotype–phenotype studies of complex disease given their large sample size, racial/ethnic diversity, and breadth and depth of phenotypes, traits, and exposures. A major disadvantage of these collections is they often survey households and communities without collecting extensive pedigree data. Failure to account for substantial relatedness can lead to inflated estimates and spurious associations. To examine the extent of cryptic relatedness in an epidemiologic collection, we as the Epidemiologic Architecture for Genes Linked to Environment (EAGLE) study accessed the National Health and Nutrition Examination Surveys (NHANES) linked to DNA samples (“Genetic NHANES”) from NHANES III and NHANES 1999–2002. NHANES are population-based cross-sectional surveys conducted by the National Center for Health Statistics at the Centers for Disease Control and Prevention. Genome-wide genetic data is not yet available in NHANES, and current data use agreements prohibit the generation of GWAS-level data in NHANES samples due issues in maintaining confidentiality among other ethical concerns. To date, only hundreds of single nucleotide polymorphisms (SNPs) genotyped in a variety of candidate genes are available for analysis in NHANES. We performed identity-by-descent (IBD) estimates in three self-identified subpopulations of Genetic NHANES (non-Hispanic white, non- Hispanic black, and Mexican American) using PLINK software to identify potential familial relationships from presumed unrelated subjects. We then compared the PLINKidentified relationships to those identified by an alternative method implemented in Kinship-based INference for Genome-wide association studies (KING). Overall, both methods identified familial relationships in NHANES III and NHANES 1999–2002 for all three subpopulations, but little concordance was observed between the two methods due in major part to the limited SNP data available in Genetic NHANES. Despite the lack of genome-wide data, our results suggest the presence of cryptic relatedness in this epidemiologic collection and highlight the limitations of restricted datasets such as NHANES in the context of modern day genetic epidemiology studies.

Epidemiologic collections have been a major resource for genotype-phenotype studies of complex disease given their large sample size, racial/ethnic diversity, and breadth and depth of phenotypes, traits, and exposures. A major disadvantage of these collections is they often survey households and communities without collecting extensive pedigree data. Failure to account for substantial relatedness can lead to inflated estimates and spurious associations. To examine the extent of cryptic relatedness in an epidemiologic collection, we as the Epidemiologic Architecture for Genes Linked to Environment (EAGLE) study accessed the National Health and Nutrition Examination Surveys (NHANES) linked to DNA samples ("Genetic NHANES") from NHANES III andNHANES 1999-2002. NHANES are population-based cross-sectional surveys conducted by the National Center for Health Statistics at the Centers for Disease Control and Prevention. Genome-wide genetic data is not yet available in NHANES, and current data use agreements prohibit the generation of GWAS-level data in NHANES samples due issues in maintaining confidentiality among other ethical concerns. To date, only hundreds of single nucleotide polymorphisms (SNPs) genotyped in a variety of candidate genes are available for analysis in NHANES. We performed identity-by-descent (IBD) estimates in three self-identified subpopulations of Genetic NHANES (non-Hispanic white, non-Hispanic black, and Mexican American) using PLINK software to identify potential familial relationships from presumed unrelated subjects. We then compared the PLINKidentified relationships to those identified by an alternative method implemented in Kinship-based INference for Genome-wide association studies (KING). Overall, both methods identified familial relationships in NHANES III andNHANES 1999-2002 for all three subpopulations, but little concordance was observed between the two methods

INTRODUCTION
Epidemiologic cohorts are a valuable resource for genotypephenotype studies given their large sample size, racial/ethnic diversity, and wealth of phenotypes, traits, and exposures. Though typically ascertained for specific phenotypes, the breadth of phenotypic and environmental variables makes these cohorts particularly well-suited for inclusion in secondary analyses. With the addition of genetic data such as genotypes, these epidemiologic cohorts are positioned for inclusion in genetic association or linkage studies. Some examples of well-characterized studies that have genetic data include the Framingham Heart Study (Splansky et al., 2007), Women's Health Initiative (Anderson et al., 2003), and the Jackson Heart Study (Taylor, 2005).
Some epidemiologic cohorts specifically seek related individuals during the ascertainment process in order to study the heritability of certain traits in a similar genetic background or minimize certain environmental differences between individuals; for example, the Framingham Heart Study recruited participants from a Massachusetts town and subsequently enrolled their offspring in later phases of the study to identify factors that contribute to cardiovascular disease (Splansky et al., 2007), whereas the Marshfield Clinic Personalized Medicine Research Project participants are relatively ethnically homogenous and come from the Marshfield, Wisconsin area (McCarty et al., 2005). Others, such as the National Health and Nutrition Examination Surveys (NHANES), use an ascertainment process where multiple participants from a single household may be included without documentation of the relationship between those participants (Ezzati et al., 1992).
Genetic association studies are susceptible to confounding, through population substructure and cryptic relatedness (Astle and Balding, 2009). Population substructure and cryptic relatedness both result from underlying relatedness between individuals in a population that is greater than the amount expected in a freely mating population. Distant relatedness, relatedness that occurs on a macro level such as racial/ethnic or genetic ancestry is a known confounder in association studies, and multiple methods exist for its identification and adjustment in analysis (Liu et al., 2013). Cryptic relatedness implies unknown (undocumented) more recent relatedness and includes family relationships such as grandparent-grandchild and full sibling pairs. While known family structure is essential for genetic linkage or family based association studies, population-based association studies assume independent (unrelated) individuals. Cryptic relatedness in studies that assume independent individuals results in inflated effect size estimates and possible false positive associations; thus, adjustment for familial relationships within population-based studies is considered necessary.
Several methods have been developed to infer familial relationships primarily for linkage studies, including Graphical Representation of Relationship errors (GRR; Abecasis et al., 2001) and multiple hidden Markov model (HMM) programs (Boehnke and Cox, 1997;McPeek and Sun, 2000). Contemporary linkage studies consist of pedigrees of varying size and thousands of diallelic markers, and methods such as GRR are routinely applied to these data to identify errors in pedigree assignment. While existing methods such as GRR could be applied to unrelated samples to identify cryptic relatedness, the performance of this simple clustering method is dependent both on the number of samples genotyped and the number and frequency of the markers genotyped. The need for consistent identification of cryptic relatedness coupled with the availability of dense marker panels led to the development of algorithms for genome-wide association studies which typically consist of thousands of individuals (either unrelated or as pedigrees) and hundreds of thousands to millions of diallelic markers.
One such GWAS-developed algorithm is implemented in PLINK, a widely used software program for genetic association studies that calculates identity-by-descent (IBD) using a combination of identity-by-state (IBS) between individuals and allele frequency at each single nucleotide polymorphism (SNP), assuming Hardy-Weinberg Equilibrium (Purcell et al., 2007). At a given locus, two individuals carrying the same allele are said to be IBD if the alleles arose from the same ancestral allele; if the same alleles are not the product of the same ancestral allele, they are said to be IBS. An alternate method, Kinshipbased INference for Genome-wide association studies (KING) was recently developed to infer relationships using kinship coefficients (Manichaikul et al., 2010). A kinship coefficient is the probability that at a specific locus, the allele picked at random in two individuals is IBD (Lange and Sinsheimer, 1992). Both of these methods assume large numbers of SNPs to calculate relatedness between pairs of individuals and may not accurately assign distant relationships when a small number of SNPs is used. In both simulated and actual GWAS-level data, KING is computationally faster than PLINK, and its framework is more flexible allowing for small sample sizes and population heterogeneity.
Most but not all epidemiologic cohorts or cross-sectional studies available for genetic association studies have GWASlevel data available on all or a fraction of the participants available. For these GWAS-less studies, basic quality control for genetic association studies such as the identification of cryptic relatedness or population substructure can be a challenge. One such study that routinely faces this challenge is the large, population-based NHANES. DNA samples were obtained from NHANES participants between 1991 and 1994 (NHANES III) and NHANES 1999-2002, and Genetic NHANES consists of several racial/ethnic subgroups: non-Hispanic whites (n = 6,634), non-Hispanic blacks (n = 3,458), and Mexican Americans (n = 3,950). Genetic NHANES has been genotyped for hundreds of SNPs (range n = 364-784) mostly selected to replicate and generalize genome-wide association study findings in diverse populations as part of the Epidemiologic Architecture for Genes Linked to Environment (EAGLE) study, a study site of the larger Population Architecture using Genomics and Epidemiology (PAGE) I study (Matise et al., 2011).
National Health and Nutrition Examination Survey participants are recruited by household; however, familial relationships are not consistently included in the data collection process. To understand the extent of gross cryptic relatedness, we inferred familial relationships using PLINK and KING, both developed for genetic association study settings involving large samples of presumably unrelated participants. Despite limited resolution, we identified cryptic relatedness in these large cross-sectional surveys using both of these methods and call attention to the potential for hidden familial relationships in epidemiologic cohorts accessed for genetic association studies.

Study Population
The National Health and Nutrition Examination Surveys are population-based surveys conducted across the U. S. by the National Center for Health Statistics (NCHS) at the Centers for Disease Control and Prevention (CDC). NHANES began in the 1960s; the Third NHANES (NHANES III) was performed from 1988 to 1994 and included 33,994 participants. From 1999, NHANES surveys have been performed continuously. NHANES 1999-2002 included 25,316 participants. NHANES capture selfdescribed race/ethnicity on all participants. Oversampling of Mexican Americans, non-Hispanic blacks, children, and the elderly were performed to create nationally representative samples. DNA is available for 7,159 participants over age 12 in NHANES III (non-Hispanic white, n = 2,631; non-Hispanic black, n = 2,108; Mexican American, n = 2,073; other, n = 348) and for 7,839 participants (non-Hispanic white, n = 4003; non-Hispanic black, n = 1350; Mexican American, n = 1877; other Hispanic, n = 418; other race including multi-racial, n = 191) in NHANES 1999-2002. Written informed consent was obtained from all participants. The CDC Ethics Review Board reviewed the present study, and the de-identified NHANES data were considered non-human subjects research by the Vanderbilt University Institutional Review Board.

Statistical Analysis
Participants were stratified by self-reported race/ethnicity in both NHANES datasets; IBD was estimated in non-Hispanic white, non-Hispanic black, and Mexican American subgroups. SNPs were excluded from the analysis if, within a given NHANES race/ethnicity, they were significantly out of Hardy-Weinberg Equilibrium (HWE; p < 0.001). To obtain a set of independent SNPs, we further excluded SNPs that were in high linkage disequilibrium (LD) with each other (r 2 > 0.60) using the -indep command in PLINK. Two different programs, PLINK (Purcell et al., 2007) and KING (Manichaikul et al., 2010), were used to calculate IBD within each NHANES subgroup.
The -genome command was used to calculate the IBD estimates using PLINK. We pre-specified the classification scheme for potential familial relationships based on ranges of z-scores and π-hat scores (proportion IBD) for PLINK ( Table 1). The -kinship command with additional parameters of -homo and -show-IBD were used to calculate IBD estimates using KING. Kinship coefficient ranges were used to identify potential familial relationships in KING (Table 1). Further classification of first degree relationships from KING were obtained by evaluating the IBD0 values; parent/child relationships will have IBD0 = 0 except for instances of genotyping error. We compared the ability of the software programs to identify cryptic relatedness within NHANES racial/ethnic groups using less than 1,000 SNPs. Due to the data use agreement with the CDC, any participant counts < 5 are suppressed for confidentiality concerns.

RESULTS
We calculated IBD estimates and potential familial relationships in NHANES III (n = 6,811) and NHANES 1999-2002 (n = 7,230) non-Hispanic whites, non-Hispanic blacks, and Mexican Americans using PLINK and KING software. In both NHANES III andNHANES 1999-2002, the majority of the participants   Table 2).
In NHANES III, PLINK calculated more potential familial relationships than KING in all subgroups. We identified few instances of potential duplicate samples/monozygotic (MZ) twins (Table 3) using either program; however, we observed no concordance between the potential duplicates/MZ twin pairs (Figure 1). A number of first degree relationships (parent/offspring and full sibling) were calculated with both methods. PLINK calculated the greatest number of potential first degree relationships in non-Hispanic whites (n = 1,864) and the majority of these relationships were parent/offspring (n = 1,843), while Mexican Americans full sibling pairs were the most common first degree relationship calculated with KING (n = 194; Table 3). Across all subgroups, most first degree relationships were categorized by PLINK as parent/offspring and by KING as full sibling ( Table 3). The greatest number of second degree and third degree relationships was observed in non-Hispanic blacks using PLINK (n = 237,154) and in non-Hispanic whites using KING (n = 162,609; Table 3). We found little concordance (<20% same relationship identified for a given participant pair) between the methods to identify the same participants' relationships, though 60% of non-Hispanic white parent/offspring pairs calculated with KING were classified as parent/offspring pairs with PLINK (Figure 1). We observed similar concordance/discordance among the results for non-Hispanic black parent/offspring pairs (50% concordant) and Mexican American parent/offspring pairs (33.33% concordant; Figure 1). Within the discordant results in each subgroup, a number of each potential familial relationship identified through one method was classified as a different familial relationship using the other method. Where the methods yielded a different potential familial relationship for a given participant pair, we found PLINK calculated the pair as more related (e.g., KING calculated third degree relationships classified as second degree or first degree in PLINK). This observation held across all subgroups and familial relationships, except second degree relationships in non-Hispanic blacks and Mexican Americans, where KINGcalculated second degrees were more frequently classified as PLINK third degrees than any other relationship. Broadly, PLINK and KING identified numerous potential first, second, and third degree relationships in NHANES III; however, there was little concordance between the results. In NHANES 1999-2002, we again observed PLINK calculated a greater number of potential familial relationships than KING for non-Hispanic whites and Mexican Americans; KING calculated a greater number of familial relationships in non-Hispanic blacks ( Table 4). There were few instances of MZ twins/duplicate samples among the three subgroups and the participant pairs were concordant between the two methods (Table 4; Figure 1). Fewer overall first degree relationships were calculated in NHANES 1999-2002 than in NHANES III (Tables 3 and 4). Similar to the trends observed in NHANES III, KING identified fewer instances of first degree relationships than PLINK (Table 4). PLINK calculated the greatest number of first degree relationships in non-Hispanic blacks (n = 498) and the majority of these were parent/offspring (n = 493), whereas Mexican American full sibling relationships were the most numerous relationship calculated by KING (n = 140; Table 4). The greatest number of second and third degree relationships were observed in non-Hispanic whites (n = 282,512) using PLINK, and in non-Hispanic blacks (n = 167,925) using KING ( Table 4). As with the NHANES III results, we found little concordance (<25%) between the participant pairs identified within each potential familial relationship with the exception of parent/offspring pairs across the three subgroups, in which >95% of parent/offspring pairs identified with KING were also identified with PLINK (Figure 1). In addition, we observed a similar trend within discordant pairs, where PLINK calculated the pair as more related than with KING, with the exception of non-Hispanic white second degree relationships which were more often classified as third degree relationships in PLINK. Overall, we identified potential cryptic relatedness in NHANES 1999-2002 using PLINK and KING, though concordance between the two methods was generally lacking.

DISCUSSION
Cryptic relatedness is a source of confounding in populationbased genetic association studies. Population stratification due to distant ancestry is typically accounted for in genetic association studies; however, more recent shared ancestry may not be considered. Though beneficial for epidemiologic, linkage, and family based association studies, close relatedness (first and second degree) may lead to spurious results and/or inflated effect estimates in population-based association studies where an underlying assumption is that the individuals are unrelated. Additionally, specifically testing the influence of multiple interactions on heritability estimates requires known relatedness between individuals (Patel et al., 2013). Epidemiologic cohorts and cross-sectional surveys that inconsistently collect pedigree data on their participants are particularly at risk for confounding due to cryptic relatedness. We estimated likely familial relationships in NHANES III and NHANES 1999-2002 using ∼730 and ∼370 SNPs, respectively, and two methods, PLINK and KING. Using PLINK, which estimates IBD by calculating IBS and allele frequencies at each SNP, we observed many potential first degree relationships in NHANES III across the three population subgroups. The majority of these first degree relationships in PLINK were likely parent/offspring pairs. In NHANES 1999-2002, we observed fewer likely first degree relationships than in NHANES III. Thousands of potential second and third degree relationships were calculated using PLINK in both NHANES III andNHANES 1999-2002. Using KING, which estimates IBD using kinship coefficients, we observed far fewer first degree relationships in NHANES III and NHANES 1999-2002 than with PLINK. The majority of the first degree relationships calculated by KING were full sibling pairs, in contrast to the parent/offspring majority identified with PLINK. In NHANES 1999-2002, the majority of first degree relationships were full sibling pairs. We identified numerous potential second and third degree relationships using KING. Using both PLINK and KING, which calculate familial relationships with different statistical methods, we observed relatedness in NHANES III and NHANES 1999-2002 across non-Hispanic white, non-Hispanic black, and Mexican American subgroups.
An earlier study evaluated familial relationships in NHANES III using short tandem repeats (STRs) at fifteen DNA loci used in forensic investigations (Identifiler R ) and found evidence of first and second degree relationships (Katki et al., 2010). Studies suggest that approximately 50 SNPs yield the same relationship discrimination as 13-15 STRs (Butler, 2007), and paternity results obtained by STR and SNP analysis are generally comparable (Dario et al., 2009). In acknowledgment of the lack of self-reported pedigree information for NHANES III participants, Katki et al. (2010) calculated the likely relationships for participants living in multi-person households with an exact method and an IBS method. Both methods were generally consistent in identifying first degree relationship pairs, but there was greater variation in the classification of likely second degree and first cousin relationships. It was further observed that the STRs in the Identifiler test were not informative enough to accurately discriminate between second degree, third degree, and unrelated participants (Katki et al., 2010). Using information not available to our study, Katki et al. (2010) also compared the likely familial relationships with the ages of the participants; they found only 5% of the parent-offspring pairs had age differences less than 16 years and 9% of sibling pairs with age differences greater than 25 years, suggesting few of their proposed first degree relationships were misclassified. However, their method may have underestimated the true level of relatedness in NHANES III by restricting potential familial relationships to participants from the same multi-person household; given the NHANES ascertainment process, it is likely additional family relationships exist, such as siblings or second and third degree relationships, within a given geographic region that may be present in the NHANES datasets.
In general, our results are consistent with those of Katki et al. (2010). We observed evidence of cryptic relatedness in NHANES III using both PLINK and KING methods. Similar to the counts observed in Katki et al. (2010), our PLINK results uncovered a high number of parent-offspring relationships. In contrast, our KING results calculated fewer first degree relationships and classified the majority of them as full sibling pairs. The similarities between our PLINKcalculated likely relationships and those obtained by Katki et al. (2010) may be due to PLINK's method of calculating IBD using IBS and allele frequencies, though notably, in Katki et al. (2010), the exact method classification and IBS method were nearly identical. Our study is the first, to our knowledge, to consider cryptic relatedness in NHANES 1999NHANES -2002therefore, no comparison to other studies could be drawn.
Despite the general agreement of our study with the previous Katki et al. (2010) work demonstrating cryptic relatedness in NHANES III and the results of the present study for NHANES 1999-2002, we found no evidence to suggest cryptic relatedness has yet led to inflated effect sizes or spurious positive associations in published studies accessing these NHANES datasets. Most of the published studies that have used the genetic NHANES datasets have been replication studies or generalization of prior findings to diverse populations (Crawford et al., unpublished). Formal meta-analysis of genetic NHANES association data with other epidemiologic cohorts generally does not reveal significant genetic heterogeneity or differences in genetic effect sizes (Dumitrescu et al., 2011b;Haiman et al., 2012;Carty et al., 2013;Fesinmeyer et al., 2013b;Jeff et al., 2014;Restrepo et al., 2014). Future studies that access these datasets to identify novel genetic variants should consider the potential for false positives and/or inflated effect sizes in their results and verify allele frequencies are comparable to published data.
A limitation to our study is the small number of SNPs used to identify likely familial relationships. Also, the SNPs in NHANES III do not necessarily overlap with SNPs in NHANES 1999NHANES -2002 in different SNP sets with different polymorphism information content. These differences may account in part for the observation that PLINK results were less concordant with KING results in NHANES III compared with NHANES 1999-2002 despite the greater number of SNPs available for analysis. In general, PLINK, KING, and other software programs that calculate cryptic relatedness, require large numbers of SNPs (typically GWAS-level data) to calculate IBD; the more markers are used in the calculation, the greater the stability and accuracy of the IBD estimates (Marchani et al., 2009), and the greater the confidence in the estimated familial relationships. Using fewer than the thousands of SNPs expected by these programs may have resulted in inflation of IBD estimates in our study and led to the discordant results between the two methods that we observed. A proportion of the discordant results may also be due to the fact that PLINK requires independent SNPs whereas KING does not. We used the same SNP set for both programs to make direct comparisons, but this restriction to independent SNPs coupled with the already limited number of SNPs available in Genetic NHANES may have compounded the discordance observed between the two methods.
Most epidemiologic studies with DNA have GWAS data, allowing investigators to more accurately assess the level of relatedness in the dataset, contrasting with the problem of assessing cryptic relatedness in genotype-limited datasets such as NHANES. Per CDC data use agreements, GWAS-level genotyping is not permitted at this time by investigators outside of CDC or without a CDC contract. While there are multiple programs for accurately estimating IBD given thousands of SNPs, there are relatively few options for epidemiologic collections with hundreds or fewer genotyped SNPs. Recently, a new R package, CrypticIBDCheck, was published demonstrating the ability to calculate IBD with as few as 60 candidate genes (∼300 SNPs) which have not been LD pruned for independence (Nembot-Simo et al., 2013). Future studies of Genetic NHANES should evaluate this and emerging tools to identify cryptic relatedness and properly adjust for its potential impact on downstream genetic association studies.
Another limitation of the current study is the lack of pedigree data available in NHANES, which unlike simulated data with known pedigree structures used to originally compare PLINK and KING (Manichaikul et al., 2010), prohibits the full evaluation of any method used to identify cryptic relatedness. Both PLINK and KING identified familial relationships in Genetic NHANES, but there was little concordance with respect to the number and type of familial relationships between the two methods. Under this scenario, alternative approaches to adjusting for cryptic relatedness in downstream analyses (such as generalized estimating equations to account for correlated data (Tregouet et al., 1997) may be more appropriate. Adjustment for correlated data has the additional advantage of preserving sample size and power compared with the usual practice of removing individuals from datasets after relationships have been identified. However, like PLINK and KING, methods such as mixed linear models (Yang et al., 2014) work best with genome-wide as opposed to sparse candidate gene data.
In summary, we have estimated familial relationships in Genetic NHANES, an epidemiologic cross-sectional study with sparse candidate gene SNPs available for genetic association studies. We implemented the commonly used PLINK and compared these familial relationships to those estimated by a second method implemented in KING. Familial relationships were identified in Genetic NHANES using both methods, but little concordance was observed between the methods. In absence of pedigree data, it is not possible to determine which of the two methods is more accurate in estimating familial relationships in this collection of cross-sectional surveys. In absence of GWASlevel data, basic quality control will continue to be a challenge for NHANES and similar epidemiologic collections. Further research is needed to identify alternate approaches to identify cryptic relatedness using sparse SNP data such as those available in Genetic NHANES.

AUTHOR CONTRIBUTIONS
The listed authors provided substantial contributions to the conception or design of the work (JM, DC), the acquisition (DC), analysis (RG, KB-G), or interpretation of the data (JM, RG, KB-G) for the work; drafted the work (JM) or revised it critically for important intellectual context (RG, KB-G, DC); gave final approval of the version to be published (JM, RG, KB-G, DC); and agreed to be accountable for all aspects of the work in ensuring that questions related to accuracy or integrity of any part of the work are appropriately investigated and resolved (JM, RG, KB-G, DC).

FUNDING
Genotyping in NHANES was supported in part by the Epidemiologic Architecture for Genes Linked to Environment (EAGLE) Study (U01HG004798 and its ARRA supplements) as part of the Population Architecture using Genomics and Epidemiology (PAGE) I study established by the National Human Genome Research Institute (NHGRI). Select NHANES III data presented here were genotyped under funding provided by the University of Washington's Center for Ecogenetics and Environmental Health (CEEH) supported by the National Institute of Environmental Sciences (NIEHS; 5 P30 ES007033-12) and the Vanderbilt Institute for Clinical and Translational Science Award (VICTR) as part of the Clinical and Translational Science Award (CTSA) grant (1UL1 RR024975-01) from the National Center for Research Resources at the National Institutes of Health (NIRR/NIH). Finally, genotyping services for select NHANES III SNPs presented here were provided by the Johns Hopkins University under federal contract number (N01-HV-48195) from NHLBI.