The Extent and Impact of Variation in ADME Genes in Sub-Saharan African Populations

Introduction: Investigating variation in genes involved in the absorption, distribution, metabolism, and excretion (ADME) of drugs are key to characterizing pharmacogenomic (PGx) relationships. ADME gene variation is relatively well characterized in European and Asian populations, but data from African populations are under-studied—which has implications for drug safety and effective use in Africa. Results: We identified significant ADME gene variation in African populations using data from 458 high-coverage whole genome sequences, 412 of which are novel, and from previously available African sequences from the 1,000 Genomes Project. ADME variation was not uniform across African populations, particularly within high impact coding variation. Copy number variation was detected in 116 ADME genes, with equal ratios of duplications/deletions. We identified 930 potential high impact coding variants, of which most are discrete to a single African population cluster. Large frequency differences (i.e., >10%) were seen in common high impact variants between clusters. Several novel variants are predicted to have a significant impact on protein structure, but additional functional work is needed to confirm the outcome of these for PGx use. Most variants of known clinical outcome are rare in Africa compared to European populations, potentially reflecting a clinical PGx research bias to European populations. Discussion: The genetic diversity of ADME genes across sub-Saharan African populations is large. The Southern African population cluster is most distinct from that of far West Africa. PGx strategies based on European variants will be of limited use in African populations. Although established variants are important, PGx must take into account the full range of African variation. This work urges further characterization of variants in African populations including in vitro and in silico studies, and to consider the unique African ADME landscape when developing precision medicine guidelines and tools for African populations.


S2 Population Structure
A proper analysis of population structure is beyond the scope of this paper (See [1]). However it is useful to understand the extent of the diversity of the samples. Figure 2 in the main text showed a PCA of our samples (PC1 versus PC2). In Figure S1 we show PC2 versus PC3 of the same data, and we also show our data in the context of other populations. As explained in the methods section, the PC analysis included a number of reference populations including some 1000 Genomes European, Asian and African to ensure the analysis was unbiased. However, for clarity we only display some of the populations. Figure S2 shows the structure chart of the same data as that of Figure 2 in the main paper, for k = 3,. . . ,8. Admixture proportions were computed with ADMIXTURE [2] -30 independent estimates were run for each value of k and the final result computed using CLUMPP [3]. For clarity we have omitted some of the smaller groups. BWA=samples from Botswana, ZAF=samples from South Africa, NGA=Berom from Nigeria, BEN=Benin, CMR=Cameroon, BFA=Burkina Faso, GHA=Ghana, CEU=Utah residents (CEPH) with Northern and Western European ancestry (KGP), San=(Khoe and San from HAAD)   Table S5 lists SNPs after filtering for:

S3 Regulatory Variation
• in any non-coding region • MAF > 0.01 • CADD-PHRED score ≥ 10 [4] • the canonical transcript for the SNP is not in a coding region (a check on the initial selection) • binomial p-value compared with our entire 1000 Genomes data set < 0.05 The binomial p-value is calculated by taking the count of instances of the SNP as 2×homozygous count + heterozygous count; the total is this count for both alleles and the expected probability is the MAF for the entire 1000 Genomes data set. Give that our data is generally high coverage and some of the 1000 Genomes data is not, this is a first cut at finding significant regulatory SNPs. Some instances where the "ancestral" allele have much lower MAF than the "minor" allele pointing to a need to review calling the ancestral allele. We excluded examples with this issue (none in any case passed the p-value threshold).
This initial filtering resulted in 54 SNPs. We then compared differences between pairs of population versus the overall population by using PLINK to calculate F ST scores.

S5 Regulatory variation
Genetic variants from ADME core genes were filtered for those meeting all the following criteria: in any non-coding region (10,000 bp up and downstream from canonical transcript); MAF >0.01; CADD-PHRED score ≥ 10 [4]; and binomial p-value compared with the entire 1000 Genomes Project data set < 0.05. We compared these genetic variants (Table S5) for variability within pairs of populations as compared with the entire 1000 Genomes data set using F ST scores [5]. Since the number of genetic variants is small, we do not stratify it further into specific regulatory elements.
There were 54 genetic variants across our African data sets in non-coding regions that have significantly higher prevalence than in the KGP overall data set (Table S5). Figure S3 illustrates differences between population cluster pairs using F ST scores. In most cases, the variability is not greater between pairs of population clusters (F ST close to zero). We omit KS (Khoe and San) due to low sample size in this cluster.

S6 Runs of homozygosity
Runs of homozygosity (ROH) are areas in the genome where an individual has two identical copies of the genome due to shared ancestors on the maternal and paternal lines. The size of the ROH correlates with how recent the shared ancestor was. With high coverage data, we are able to detect ROHs of at least 300kb in size. High ROH is a measure of inbreeding decreased fitness and may be associated with ill health [6,7]. However, ROH are not randomly distributed across the genome and islands of homozygosity (ROHi) are known to exist: regions where the ROH of several individuals within a population overlap [8]. There is some evidence that these islands are found as a result of positive selection.
There are a total of 634 ROH in the sample. The key metrics we use are the size of ROHi (that is, how many individuals are in the ROHi) and size normalised by size of gene (ROHi/kb). The genes which have largest ROHi and ROHi/kb are CYP1A1, CYP1A2. The ABCB1 and DPYD genes are relatively large genes and have a large ROHi. Tables S6 and S7 show a summary of the ROH found in the core and extended genes in our data sets. The range of ROHi/kb varies significantly across all genes in the genome. Figure S5 shows a violin plot of the range of ROHi/kb in the core, extended, and all other genes in the genome. Statistical comparison is difficult because ranges are not normally distributed and a small number of extreme values skew the averages. Figure S5 shows the distribution of runs of homozygosity across the genes, showing the density of ROHi per gene, normalised by gene length Figure S5: Distribution of the ROHi/kb across the core, extended, and all genes. As there are extreme values, a y-cut-off of 3 was chosen to assist comparison. The median value and inter-quartile range is shown. Regions of homozygosity in core and extended gene sets were identified with using PLINK [9], using settings consistent for high-coverage data [6], viz. :-homozyg-snp 30:, :homozyg-kb 300:, :-homozyg-window-snp: :30:, :-homozyg group-verbose:. Table S6 shows the ROHi found in the core genes. For each group, the proportion of the individuals that are part of that ROHi for that gene is shown. In the two rightmost columns, the total number of individuals in the data set that are part of the ROHi is shown and then that number normalised by the length of the gene (i.e., #ROHi per thousand base pairs). In Table S7 a similar table is given for the extended data set. Table S6: Runs of homozygosity in the core genes split by group. For each gene the number of ROH found across all samples is shown by group as a fraction of the individuals in that group who share the ROHi, followed the total number and the total normalised by gene length.