Association Analysis of Candidate Variants in Admixed Brazilian Patients With Genetic Generalized Epilepsies

Genetic generalized epilepsies (GGEs) include well-established epilepsy syndromes with generalized onset seizures: childhood absence epilepsy, juvenile myoclonic epilepsy (JME), juvenile absence epilepsy (JAE), myoclonic absence epilepsy, epilepsy with eyelid myoclonia (Jeavons syndrome), generalized tonic–clonic seizures, and generalized tonic–clonic seizures alone. Genome-wide association studies (GWASs) and exome sequencing have identified 48 single-nucleotide polymorphisms (SNPs) associated with GGE. However, these studies were mainly based on non-admixed, European, and Asian populations. Thus, it remains unclear whether these results apply to patients of other origins. This study aims to evaluate whether these previous results could be replicated in a cohort of admixed Brazilian patients with GGE. We obtained SNP-array data from 87 patients with GGE, compared with 340 controls from the BIPMed public dataset. We could directly access genotypes of 17 candidate SNPs, available in the SNP array, and the remaining 31 SNPs were imputed using the BEAGLE v5.1 software. We performed an association test by logistic regression analysis, including the first five principal components as covariates. Furthermore, to expand the analysis of the candidate regions, we also interrogated 14,047 SNPs that flank the candidate SNPs (1 Mb). The statistical power was evaluated in terms of odds ratio and minor allele frequency (MAF) by the genpwr package. Differences in SNP frequencies between Brazilian and Europeans, sub-Saharan African, and Native Americans were evaluated by a two-proportion Z-test. We identified nine flanking SNPs, located on eight candidate regions, which presented association signals that passed the Bonferroni correction (rs12726617; rs9428842; rs1915992; rs1464634; rs6459526; rs2510087; rs9551042; rs9888879; and rs8133217; p-values <3.55e–06). In addition, the two-proportion Z-test indicates that the lack of association of the remaining candidate SNPs could be due to different genomic backgrounds observed in admixed Brazilians. This is the first time that candidate SNPs for GGE are analyzed in an admixed Brazilian population, and we could successfully replicate the association signals in eight candidate regions. In addition, our results provide new insights on how we can account for population structure to improve risk stratification estimation in admixed individuals.

Genetic generalized epilepsies (GGEs) include well-established epilepsy syndromes with generalized onset seizures: childhood absence epilepsy, juvenile myoclonic epilepsy (JME), juvenile absence epilepsy (JAE), myoclonic absence epilepsy, epilepsy with eyelid myoclonia (Jeavons syndrome), generalized tonic-clonic seizures, and generalized tonic-clonic seizures alone. Genome-wide association studies (GWASs) and exome sequencing have identified 48 single-nucleotide polymorphisms (SNPs) associated with GGE. However, these studies were mainly based on non-admixed, European, and Asian populations. Thus, it remains unclear whether these results apply to patients of other origins. This study aims to evaluate whether these previous results could be replicated in a cohort of admixed Brazilian patients with GGE. We obtained SNP-array data from 87 patients with GGE, compared with 340 controls from the BIPMed public dataset. We could directly access genotypes of 17 candidate SNPs, available in the SNP array, and the remaining 31 SNPs were imputed using the BEAGLE v5.1 software. We performed an association test by logistic regression analysis, including the first five principal components as covariates. Furthermore, to expand the analysis of the candidate regions, we also interrogated 14,047 SNPs that flank the candidate SNPs (1 Mb). The statistical power was evaluated in terms of odds ratio and minor allele frequency (MAF) by the genpwr package. Differences in SNP frequencies between Brazilian and Europeans, sub-Saharan African, and Native Americans were evaluated by a two-proportion Z-test. We identified nine flanking SNPs, located on eight candidate regions, which presented association signals that passed the Bonferroni correction (rs12726617; rs9428842; rs1915992; rs1464634; rs6459526; rs2510087; rs9551042; rs9888879; and rs8133217; p-values <3.55e −06 ). In addition, the two-proportion Z-test indicates that the lack of association of the remaining candidate SNPs could be due to different genomic backgrounds observed in admixed Brazilians. This is the first time that candidate SNPs for GGE are analyzed in

INTRODUCTION
Genetic generalized epilepsies (GGEs) are a group of epilepsy syndromes in which the main feature is the recurrence of generalized onset seizures with no known or suspected etiology other than possible genetic predisposition (Berg et al., 2010;Scheffer et al., 2017). GGEs are among the most common types of epilepsy, with an estimated prevalence of 190 per 100,000 individuals (Aaberg et al., 2017). They include wellestablished syndromes: childhood absence epilepsy, juvenile myoclonic epilepsy (JME), juvenile absence epilepsy (JAE), myoclonic absence epilepsy (a rare form of GGE), epilepsy with eyelid myoclonia (Jeavons syndrome), generalized tonicclonic seizures, and generalized tonic-clonic seizures alone (Berg et al., 2010). These different GGE syndromes share most genetic susceptibility factors, suggesting an important correlation among the clinical subtypes (International League Against Epilepsy Consortium on Complex Epilepsies (ILAE Consortium on Complex Epilepsies)., 2018). The diagnosis of GGE relies mainly on clinical information and electroencephalographic examination (Scheffer et al., 2017).
Previous genome-wide association studies (GWASs) and exome sequencing analyses have identified 48 single-nucleotide polymorphisms (SNPs) putatively associated with susceptibility to the GGEs (EPICURE Consortium et al., 2012a,b Zhang et al., 2014;Wang et al., 2019). It is well known that admixed American populations are underrepresented in GWASs, decreasing the accuracy of replicating, predicting, and estimating polygenic risks for complex disorders in these populations (Martin et al., 2017(Martin et al., , 2019.
Therefore, this work aims to investigate if a genetic association exists between previously reported candidate SNPs and GGEs in a cohort of admixed Brazilians. To accomplish this goal, we first investigated the population structure of Brazilian patients with GGE. Subsequently, we performed an association study using the 48 previously reported candidate SNPs and their flanking regions.

Subjects
We evaluated a total of 87 patients with GGE who were followed up prospectively in the outpatient epilepsy clinic of the University of Campinas (UNICAMP) hospital. All patients had the diagnosis of GGE according to criteria established by the International League Against Epilepsy (ILAE) (Berg et al., 2010;Fisher et al., 2014). Patients were compared with a group of 340 individuals without any neurological disorder from the BIPMed database (Rocha et al., 2020). Both samples are predominantly from the Southeastern region in Brazil. Among the patients with GGE, we found 63 with JME, 10 with JAE, four generalized tonic-clonic seizures alone, two with Jeavons syndrome, one with myoclonic absence epilepsy, one with epilepsy with generalized tonic-clonic seizures, and six patients in whom a specific GGE syndrome could not be determined. All research participants signed an informed consent form previously approved by our Institutional Research Ethics Committee (IRB # 12112913.3.0000.5404).

Single-Nucleotide Polymorphism Quality Control and Population Structure Analysis
We extracted the genotypes for the 48 candidate SNPs (Table 1) from the SNP-array data generated by the Genome-Wide Human SNP Array 6.0 (Affymetrix Inc., Thermo Fisher Scientific, Waltham, MA, United States). These SNP-array data contain 905,171 available SNPs (GRCh37 build). To obtain an unbiased estimation of the population structure of our samples, we processed the SNP-array dataset of the 87 patients with GGE and the 340 BIPMed controls according to previous processing recommendations and pipelines (Anderson et al., 2010;Secolin et al., 2019). First, we removed ambiguous variants (with G/C or A/T alleles) from each dataset. Next, we merged the two datasets into one larger admixed Brazilian dataset (N = 427), maintaining only biallelic SNPs, autosomal SNPs, SNPs without Hardy-Weinberg disequilibrium (p-value <0.000001), and missing data <10%. Then, we estimated the heterozygosity rate for each sample and removed individuals with heterozygosity rates higher or lower than three standard deviations from the mean to avoid individuals with high inbreeding (low heterozygosity rates) or sample contamination (high heterozygosity rates). We also removed pairs of individuals who presented a proportion  The positions are based on GRCh37. SNPs with an asterisk (*) were obtained by BEAGLE imputation. BP, base pairs; PMID, PUBMED ID publications; MAF, minor allele frequency; HWE, Hardy-Weinberg equilibrium; OR, odds ratio; CI, confidence interval. of identical-by-state (IBS) alleles >0.85, which could indicate duplicated samples, and individuals with genomic relatedness matrix estimations higher than 0.125, which is the expected genomic relatedness for third-degree relatives (Anderson et al., 2010). The merging process, genotyping, and sample filtering were performed using PLINK 1.9 software (Purcell et al., 2007). Subsequently, we merged the filtered admixed Brazilian sample with the 1000 Genomes Project (1KGP) dataset (The 1000 Genomes Project Consortium et al., 2015), maintaining the SNPs present only in the admixed Brazilian sample. After merging, we removed SNPs with a minor allele frequency (MAF) < 0.01 and SNPs in linkage disequilibrium (LD), using the following parameters: window size = 50 SNPs, shift step = 5 SNPs, and r 2 = 0.5 (Anderson et al., 2010). We compared our dataset with the 1KGP data by principal component analysis (PCA) using PLINK v1.9 software (Purcell et al., 2007) to evaluate the presence of population-based outliers in the Brazilian samples.
To evaluate whether patients with GGE and BIPMed controls present population stratification, we performed the analysis of molecular variance (AMOVA) (Excoffier et al., 1992) using the poppr.amova R package and the RStudio interface, comparing the genetic distance among the two groups based on a set of 10,000 random SNPs across the genome. The AMOVA partitions the source of genetic variance (σ 2 ) into two components: withingroups and between-groups. The null hypothesis states that the samples were obtained from a global population, with variation due to random sampling in the construction of populations. Thus, we would expect a high heterogeneity within groups (σ 2 = 100%) and no heterogeneity between groups (σ 2 = 0%). On the other hand, under the alternative hypothesis, each group was obtained from different populations, and we would expect a low heterogeneity within groups (σ 2 < 100%) and high heterogeneity between groups (σ 2 > 0%) (Excoffier et al., 1992). Therefore, to evaluate the significance of σ 2 components, we generated a Monte Carlo null distribution of 10,000 variance components and tested against the observed variance components by the randtest function in the ade4 R package.

Single-Nucleotide Polymorphism Selection and Imputation
We observed that 31 SNPs were not found in the SNP-array dataset. Therefore, we performed an imputation of all 48 SNPs to obtain the missing SNPs and to evaluate the concordance between the imputed genotypes and the genotypes assessed by the SNP array. Since we analyzed a sample of admixed individuals, we elected to perform the imputation using two approaches. First, we phased and imputed the dataset using SHAPEIT2 v2.r387 (O'Connell et al., 2014) and BEAGLE v5.1 software (Browning et al., 2018) using the default software parameters for phasing and imputation. As a reference for the BEAGLE imputation, we used the 1KGP dataset (GRCh37/hg19 assembly) (The 1000 Genomes Project Consortium et al., 2015). To save on computation time, we imputed only the chromosomes in which the candidate SNPs are located (Table 1). We also evaluated whether the genotypes were successfully imputed by the correlation (in terms of r 2 ) of genotype dosage values between the imputed genotypes and true genotypes used as a reference from the 1KGP provided by the BEAGLE software. For the second imputation approach, we used the TOPMED Imputation Server (Das et al., 2016), with the TOPMed v.R2 on GRCh38 build (Kowalski et al., 2019). The TOPMED server imputation performed the liftover from GRCh37 to GRCh38 and the phasing using the EAGLE v.2.4 algorithm. Finally, imputation was performed by minimac4.

Candidate Single-Nucleotide Polymorphism Association Analysis
After genotype and individual filtering, 360 individuals remained (69 patients with GGE and 291 BIPMed controls), which were used in the association analysis. We estimated the statistical power of our sample by the genpwr package in R (Moore et al., 2020), which analyzes the statistical power under the evaluation between true and test genetic models (Dominant, Additive, Recessive, 2df/unspecified model). In this case, we evaluate the statistical power using a vector of MAFs (from 0.05 to 0.45, by 0.05) and an odds ratios (from 1.5 to 2.0, by 0.1) since not all candidate SNPs presented OR estimations from previous studies. We also set the following parameters for genpwr: model = logistic; N = 360; case/control ratio = 69/291 = 0.237; and alpha = 0.05. We evaluated candidate SNP association and OR estimation by logistic regression analysis using the PLINK v1.9 software (Purcell et al., 2007), including the first five PCs as covariates. We did not include age, age at seizure onset, and sex since these variables have not been correlated with the GGE phenotype (Berg et al., 2010;Scheffer et al., 2018).
It has been reported that SNPs found to be associated with the phenotype by GWAS in one population may be only nominally associated or non-associated in another population due to difference in LD across populations (Akiyama et al., 2019;Chen et al., 2020;Graff et al., 2021); however, it does not mean that an associated signal in the genomic region cannot be replicated. This is because the SNPs ascertained from GWAS are only tagging variants linked to causal ones. The lack of signals in the replication population could simply be caused by the broken linkage between tagging and causal variants. Therefore, to account for the difference in LD across populations and to investigate the transferability of previous GWAS signals, we used the SNP-array dataset, filtered for population structure and without LD pruning (652,883 SNPs), to interrogate the SNPs flanking the 1 Mb upstream and downstream the candidate SNPs by logistic regression. We assumed a p-value adjusted by the Bonferroni correction to avoid biased results due to the multiple comparisons. In this case, we used two thresholds: the first threshold took into account the 48 SNPs (p-value = 0.05/48 = 0.001), assuming one effective test per region, which is a reasonable assumption and may lead to more informative results. However, this threshold may not be stringent enough. Therefore, we also evaluate the results under a second threshold, considering all the 48 candidate SNPs and the additional flanking SNPs tested, and the results were plotted using the qqman package in R software (Turner, 2014).
Since previous studies of GGE were based on European populations and admixed Brazilians have a large proportion of European ancestry, we decided to evaluate whether the candidate SNP allele frequencies are similar between Brazilian and European populations. We extracted European allele frequencies from the gnomAD database (Karczewski et al., 2020) and performed a two-proportion Z-test using the prop.test function in R. Also, we included African populations from  gnomAD in the analysis due to the sub-Saharan African ancestry component present in Brazilian populations. However, since gnomAD does not separate Native American populations in the database, we include the Latin population in the analysis as a proxy.

Population Structure Analysis
The principal components in the PCA plot indicate that both cases and controls clustered together and were spread between Europeans, sub-Saharan Africans, and other admixed American populations (Figure 1). The AMOVA results showed that 99.61% of the genetic variation was observed within groups (patients or controls), and only 0.39% of the genetic variation was observed between groups ( Table 2). Because we have one hierarchical level of stratification (patients/controls), the poppr.amova package provided one total ϕ-statistics = 0.0031, with a p-value = 0.001 (Table 2), indicating evidence of population stratification between patients and controls and the necessity of population structure correction in further association tests.

Single-Nucleotide Polymorphism Selection and Imputation
According to the imputation results from the BEAGLE software (Browning et al., 2018), the correlation between the estimated allele dosage and the true allele dosage from the 1KGP is used as reference (in terms of r 2 ) and presented a minimum value of 95%. In addition, all 17 SNPs genotyped by the SNP array were correctly imputed by the BEAGLE software. However, we observed that the 17 SNPs genotyped by the SNP array presented only 45.7% of matching (on average) with genotypes imputed by the TOPMED server. Thus, we decided to perform further analysis using the imputed genotypes generated by the BEAGLE software.

Candidate Single-Nucleotide Polymorphism Association Analysis
As detailed in Table 1, one candidate SNP (rs1046276) was withdrawn from further association analysis due to the presence of the Hardy-Weinberg disequilibrium (p < 0.000001). According to the analysis performed using the genpwr package (Moore et al., 2020), we observed that the Additive model presented the highest power estimation. We did not observe 80% of statistical power for OR ≤ 1.6 (≥ 0.62 for protection effect) (Figures 2A,B). However, we calculated that our study had 80% power to detect an increased risk in terms of OR ≥ 1.7 (≤0.58 for protection effect) with MAF > 0.25 (Figure 2C), OR ≥ 1.8 (≤0.55 for protection effect) with MAF > 0.2 (Figure 2D), and OR ≥ 1.9 (≤0.52 for protection effect) with MAF > 0.15 (Figures 2E,F).
Since most Brazilian ancestry is derived from European populations (Kehdy et al., 2015;Moura et al., 2015;Secolin et al., 2019), we could hypothesize that effect sizes in terms of OR would present a higher correlation with European effect sizes comparing with Chinese or European/African American samples from previous studies (EPICURE Consortium et al., 2012a,b; International League Against Epilepsy Consortium on Complex Epilepsies (ILAE Consortium on Complex Epilepsies), 2014; International League Against Epilepsy Consortium on Complex Epilepsies (ILAE Consortium on Complex Epilepsies)., 2018; Zhang et al., 2014;Wang et al., 2019). Thus, we show a comparison of the OR estimations of 11 SNPs, which were available from the previous studies, and the OR estimations in our admixed Brazilian samples (Table 3). Remarkably, Chinese and European/African American samples also presented similar OR estimations compared with admixed Brazilians. Two SNPs had different OR estimations for admixed Brazilians compared with European and Chinese samples (rs10496964 and rs11890028).
Furthermore, the two-proportion Z-test results showed that 25 candidate SNPs have allele frequencies that were different when comparing admixed Brazilian and the ancestral populations. Among them, 16 SNPs presented differences in allele frequencies comparing admixed Brazilian and European populations. All 25 SNPs presented different allele frequencies when comparing admixed Brazilian and African samples. Remarkably, we also found 15 candidate SNPs with different allele frequencies when comparing admixed Brazilians and the Latin American samples in the gnomAD database (Table 4).

DISCUSSION
The Brazilian population was formed by an admixture of three main ancestry populations: Europeans, sub-Saharan Africans, and Native Americans (Kehdy et al., 2015;Moura et al., 2015;Secolin et al., 2019). In this scenario, it is important to explore whether candidate SNPs previously identified as associated with complex disorders in non-admixed populations also display association signals in the Brazilian admixed population.  Zhang et al., 2014;Wang et al., 2019). These studies were all performed in non-admixed populations, predominantly of European ancestry, raising the question of reproducibility of these results in other populations. Lack of transferability of GWAS results and polygenic risk scores obtained from Europeans and American admixed populations have previously been reported (Martin et al., 2017(Martin et al., , 2019, making it important to investigate whether these SNPs are associated with GGEs in our admixed Brazilian sample. An alternative explanation for the lack of reproducibility among populations relies on the observation that only tagging SNPs are ascertained in GWAS, and the lack of replication in different populations could be due to broken linkage between the tagging SNPs and the causal variants (Akiyama et al., 2019;Chen et al., 2020;Graff et al., 2021). Thus, we searched for SNPs flanking 1 Mb upstream and downstream of the candidate regions to investigate this issue. Indeed, we found 14,047 flanking SNPs, and nine of them presented statistically significant association signals after stringent corrections for multiple comparisons (p-value < 3.55e −06 ). These nine SNPs encompass eight candidate regions (  Wang et al., 2019). Therefore, we may suggest that polygenic risk scores calculated in European populations at these specific loci could indeed be transferable to admixed Brazilian individuals.
However, although all these 29 candidate regions passed the Bonferroni correction based on the 48 candidate SNPs (pvalue = 0.001), we understand that this p-value threshold is not stringent. Thus, the lack of association signal cannot be discarded for the 20 remaining candidate regions. Thus, one may still speculate that the lack of reproducibility could be due to the absence of statistical power, population stratification, or the differences in the genomic structure of the admixed sample compared with the previously studied populations.
Although we have identified flanking SNPs in the neighborhood of the candidate regions, which presented 80% of statistical power to detect increased risk or protection allele effect, we acknowledge the limited statistical power provided by the cohort analyzed, with 87 patients with GGE and 340 controls.
Despite the observed high heterogeneity within groups (σ 2 = 99.61%) and low heterogeneity between patients and controls (σ 2 = 0.39%), the statistics based on AMOVA results revealed evidence of population stratification between patients with GGE and the BIPMed controls. Thus, we corrected possible spurious association results by taking the first five principal components into account in the logistic regression model (Marchini et al., 2004;Price et al., 2010).
Indeed, the two-proportion Z-test showed that 16 SNPs presented different allele frequencies when comparing admixed Brazilian and European samples, further substantiating the hypothesis of lack of genetic association due to genetic differences when comparing the admixed Brazilians and Europeans.
It is important to note that 31 SNPs were not found in the SNP-array dataset, and we decided to impute them from all populations available in the 1KGP dataset (The 1000 Genomes Project Consortium et al., 2015). Previous studies have demonstrated that imputation accuracy for populations with a high proportion of European ancestry is higher than for populations with African or Native American ancestry (Martin et al., 2017). In addition, the EPIGEN-Brazil Initiative has also imputed admixed Brazilian samples from the 1KGP dataset with high confidence variants (Magalhães et al., 2018). However, the imputation by the TOPMED Consortium has demonstrated improved quality of variant imputation for admixed African and Hispanic/Latin populations compared with the 1KGP dataset (Kowalski et al., 2019). Thus, we also used this approach for comparison. We observed a perfect match between the SNPs genotyped in the SNP-array and their imputed correspondents for the BEAGLE imputation using the 1KGP as reference. By contrast, there was only 45.7% correspondence between the SNPs genotyped and the imputed SNPs using TOPMED. Thus, we can argue that Hispanic/Latin samples included in the TOPMED reference panel (Kowalski et al., 2019) may not represent the genomic structure of admixed Brazilians (Adhikari et al., 2016). This is an important finding and indicates that although allele frequencies of admixed Brazilian populations are different from other populations reported in public databases (Adhikari et al., 2016;Magalhães et al., 2018;Rocha et al., 2020), there is a remarkable accuracy in the SNP imputation for admixed Brazilian individuals based on populations from the 1KGP database, as demonstrated by our results and elsewhere (Magalhães et al., 2018).
In conclusion, we replicated association signals on eight candidate regions previously found in European populations, indicating the possibility of transferability of polygenic risk scores from European studies to admixed Brazilian populations in these specific candidate regions. In addition, we show evidence that differences in the genetic architecture of the population may hinder the replication of association results in admixed Brazilians for the remaining candidate regions, thus supporting the hypothesis of population differences influencing the association results in the present study. Also, we documented the effect of different methods/databases used for genotype imputation in admixed Brazilians. These results could be relevant to improving stratification risk estimation and future precision health applications in admixed Brazilian patients with GGEs and other complex disorders.

DATA AVAILABILITY STATEMENT
The datasets presented in this study can be found in online repositories. The names of the repository/repositories and accession number(s) can be found below: https://www.ebi.ac.uk/ eva/, PRJEB39251; https://www.ebi.ac.uk/ena, PRJEB45235.

ETHICS STATEMENT
The studies involving human participants were reviewed and approved by Comitê de Ética em Pesquisa da Universidade Estadual de Campinas. The patients/participants provided their written informed consent to participate in this study.

AUTHOR CONTRIBUTIONS
FK performed data processing, statistical analysis, and imputation. TA and PA performed data acquisition and SNP array genotyping. MA, CY, and FC performed the clinical analysis of GGE patients. FC and IL-C served as the principal investigators. RS conceptualized the work, created the study design, and served as a principal investigator. All authors reviewed and approved the final version of the manuscript.