%A Panarella,Michela %A Burkett,Kelly M. %D 2019 %J Frontiers in Genetics %C %F %G English %K association study,Extreme phenotype sampling,population stratification,Principal Component Analysis,Type 1 error %Q %R 10.3389/fgene.2019.00398 %W %L %M %P %7 %8 2019-May-03 %9 Original Research %# %! Population stratification under EPS %* %< %T A Cautionary Note on the Effects of Population Stratification Under an Extreme Phenotype Sampling Design %U https://www.frontiersin.org/articles/10.3389/fgene.2019.00398 %V 10 %0 JOURNAL ARTICLE %@ 1664-8021 %X Extreme phenotype sampling (EPS) is a popular study design used to reduce genotyping or sequencing costs. Assuming continuous phenotype data are available on a large cohort, EPS involves genotyping or sequencing only those individuals with extreme phenotypic values. Although this design has been shown to have high power to detect genetic effects even at smaller sample sizes, little attention has been paid to the effects of confounding variables, and in particular population stratification. Using extensive simulations, we demonstrate that the false positive rate under the EPS design is greatly inflated relative to a random sample of equal size or a “case-control”-like design where the cases are from one phenotypic extreme and the controls randomly sampled. The inflated false positive rate is observed even with allele frequency and phenotype mean differences taken from European population data. We show that the effects of confounding are not reduced by increasing the sample size. We also show that including the top principal components in a logistic regression model is sufficient for controlling the type 1 error rate using data simulated with a population genetics model and using 1,000 Genomes genotype data. Our results suggest that when an EPS study is conducted, it is crucial to adjust for all confounding variables. For genetic association studies this requires genotyping a sufficient number of markers to allow for ancestry estimation. Unfortunately, this could increase the costs of a study if sequencing or genotyping was only planned for candidate genes or pathways; the available genetic data would not be suitable for ancestry correction as many of the variants could have a true association with the trait.