# THE GENETIC AND ENVIRONMENTAL BASIS FOR DISEASES IN UNDERSTUDIED POPULATIONS

EDITED BY : Nicola Mulder, Zané Lombard, Mayowa Ojo Owolabi and Solomon Fiifi Ofori-Acquah PUBLISHED IN : Frontiers in Genetics

#### Frontiers eBook Copyright Statement

The copyright in the text of individual articles in this eBook is the property of their respective authors or their respective institutions or funders. The copyright in graphics and images within each article may be subject to copyright of other parties. In both cases this is subject to a license granted to Frontiers. The compilation of articles constituting this eBook is the property of Frontiers.

Each article within this eBook, and the eBook itself, are published under the most recent version of the Creative Commons CC-BY licence. The version current at the date of publication of this eBook is CC-BY 4.0. If the CC-BY licence is updated, the licence granted by Frontiers is automatically updated to the new version.

When exercising any right under the CC-BY licence, Frontiers must be attributed as the original publisher of the article or eBook, as applicable.

Authors have the responsibility of ensuring that any graphics or other materials which are the property of others may be included in the CC-BY licence, but this should be checked before relying on the CC-BY licence to reproduce those materials. Any copyright notices relating to those materials must be complied with.

Copyright and source acknowledgement notices may not be removed and must be displayed in any copy, derivative work or partial copy which includes the elements in question.

All copyright, and all rights therein, are protected by national and international copyright laws. The above represents a summary only. For further information please read Frontiers' Conditions for Website Use and Copyright Statement, and the applicable CC-BY licence.

ISSN 1664-8714 ISBN 978-2-88966-168-8 DOI 10.3389/978-2-88966-168-8

#### About Frontiers

Frontiers is more than just an open-access publisher of scholarly articles: it is a pioneering approach to the world of academia, radically improving the way scholarly research is managed. The grand vision of Frontiers is a world where all people have an equal opportunity to seek, share and generate knowledge. Frontiers provides immediate and permanent online open access to all its publications, but this alone is not enough to realize our grand goals.

#### Frontiers Journal Series

The Frontiers Journal Series is a multi-tier and interdisciplinary set of open-access, online journals, promising a paradigm shift from the current review, selection and dissemination processes in academic publishing. All Frontiers journals are driven by researchers for researchers; therefore, they constitute a service to the scholarly community. At the same time, the Frontiers Journal Series operates on a revolutionary invention, the tiered publishing system, initially addressing specific communities of scholars, and gradually climbing up to broader public understanding, thus serving the interests of the lay society, too.

#### Dedication to Quality

Each Frontiers article is a landmark of the highest quality, thanks to genuinely collaborative interactions between authors and review editors, who include some of the world's best academicians. Research must be certified by peers before entering a stream of knowledge that may eventually reach the public - and shape society; therefore, Frontiers only applies the most rigorous and unbiased reviews.

Frontiers revolutionizes research publishing by freely delivering the most outstanding research, evaluated with no bias from both the academic and social point of view. By applying the most advanced information technologies, Frontiers is catapulting scholarly publishing into a new generation.

#### What are Frontiers Research Topics?

Frontiers Research Topics are very popular trademarks of the Frontiers Journals Series: they are collections of at least ten articles, all centered on a particular subject. With their unique mix of varied contributions from Original Research to Review Articles, Frontiers Research Topics unify the most influential researchers, the latest key findings and historical advances in a hot research area! Find out more on how to host your own Frontiers Research Topic or contribute to one as an author by contacting the Frontiers Editorial Office: researchtopics@frontiersin.org

# THE GENETIC AND ENVIRONMENTAL BASIS FOR DISEASES IN UNDERSTUDIED POPULATIONS

Topic Editors:

Nicola Mulder, University of Cape Town, South Africa Zané Lombard, University of the Witwatersrand, South Africa Mayowa Ojo Owolabi, University of Ibadan, Nigeria Solomon Fiifi Ofori-Acquah, University of Ghana, Ghana

Citation: Mulder, N., Lombard, Z., Owolabi, M. O., Ofori-Acquah, S. F., eds. (2020). The Genetic and Environmental Basis for Diseases in Understudied Populations. Lausanne: Frontiers Media SA. doi: 10.3389/978-2-88966-168-8

# Table of Contents

*05 Editorial: The Genetic and Environmental Basis for Diseases in Understudied Populations*

Nicola Mulder, Zané Lombard, Mayowa Ojo Owolabi and Solomon Fiifi Ofori-Acquah

*07 A Sex-Stratified Genome-Wide Association Study of Tuberculosis Using a Multi-Ethnic Genotyping Array*

Haiko Schurz, Craig J. Kinnear, Chris Gignoux, Genevieve Wojcik, Paul D. van Helden, Gerard Tromp, Brenna Henn, Eileen G. Hoal and Marlo Möller

*20 Evaluating the Accuracy of Imputation Methods in a Five-Way Admixed Population*

Haiko Schurz, Stephanie J. Müller, Paul David van Helden, Gerard Tromp, Eileen G. Hoal, Craig J. Kinnear and Marlo Möller


Jeffrey G. Shaffer, Frances J. Mather, Mamadou Wele, Jian Li, Cheick Oumar Tangara, Yaya Kassogue, Sudesh K. Srivastav, Oumar Thiero, Mahamadou Diakite, Modibo Sangare, Djeneba Dabitao, Mahamoudou Toure, Abdoulaye A. Djimde, Sekou Traore, Brehima Diakite, Mamadou B. Coulibaly, Yaozhong Liu, Michelle Lacey, John J. Lefante, Ousmane Koita, John S. Schieffelin, Donald J. Krogstad and Seydou O. Doumbia


#### *115 Genetic Screening of the Usher Syndrome in Cuba*

Elayne E. Santana, Carla Fuster-García, Elena Aller, Teresa Jaijo, Belén García-Bohórquez, Gema García-García, José M. Millán and Araceli Lantigua

*124 Hydroxyurea-Induced miRNA Expression in Sickle Cell Disease Patients in Africa*

Khuthala Mnika, Gaston K. Mazandu, Mario Jonas, Gift D. Pule, Emile R. Chimusa, Neil A. Hanchard and Ambroise Wonkam

*130 The Potential Role of Regulatory Genes (*DNMT3A, HDAC5, *and* HDAC9*) in Antipsychotic Treatment Response in South African Schizophrenia Patients*

Kevin Sean O'Connell, Nathaniel Wade McGregor, Robin Emsley, Soraya Seedat and Louise Warnich


Allan Kalungi, Jacqueline S. Womersley, Eugene Kinyanda, Moses L. Joloba, Wilber Ssembajjwe, Rebecca N. Nsubuga, Jonathan Levin, Pontiano Kaleebu, Martin Kidd, Soraya Seedat and Sian M. J. Hemmings

*172 Frequencies of the* LILRA3 *6.7-kb Deletion are Highly Differentiated Among Han Chinese Subpopulations and Involved in Ankylosing Spondylitis Predisposition*

Han Wang, Yuxuan Wang, Yundi Tang, Hua Ye, Xuewu Zhang, Gengmin Zhou, Jiyang Lv, Yongjiang Cai, Zhanguo Li, Jianping Guo and Qingwen Wang

*182* GJB2 *and* GJB6 *Mutations in Non-Syndromic Childhood Hearing Impairment in Ghana*

Samuel M. Adadey, Noluthando Manyisa, Khuthala Mnika, Carmen de Kock, Victoria Nembaware, Osbourne Quaye, Geoffrey K. Amedofu, Gordon A. Awandare and Ambroise Wonkam


Bonnie R. Joubert, Stacey N. Mantooth and Kimberly A. McAllister

*221 Novel and Known Gene-Smoking Interactions With cIMT Identified as Potential Drivers for Atherosclerosis Risk in West-African Populations of the AWI-Gen Study*

Palwende Romuald Boua, Jean-Tristan Brandenburg, Ananyo Choudhury, Scott Hazelhurst, Dhriti Sengupta, Godfred Agongo, Engelbert A. Nonterah, Abraham R. Oduro, Halidou Tinto, Christopher G. Mathew, Hermann Sorgho and Michèle Ramsay

*237 Spinal Muscular Atrophy in the Black South African Population: A Matter of Rearrangement?*

Elana Vorster, Fahmida B. Essop, John L. Rodda and Amanda Krause

# Editorial: The Genetic and Environmental Basis for Diseases in Understudied Populations

#### Nicola Mulder <sup>1</sup> \*, Zané Lombard<sup>2</sup> , Mayowa Ojo Owolabi <sup>3</sup> and Solomon Fiifi Ofori-Acquah<sup>4</sup>

*<sup>1</sup> Computational Biology Division, Department Integrative Biomedical Sciences, IDM, University of Cape Town, Cape Town, South Africa, <sup>2</sup> Division of Human Genetics, National Health Laboratory Service & School of Pathology, Faculty of Health Sciences, University of the Witwatersrand, Johannesburg, South Africa, <sup>3</sup> Center for Genomic and Precision Medicine, University of Ibadan, Ibadan, Nigeria, <sup>4</sup> West African Genetic Medicine Center, College of Health Sciences, University of Ghana, Accra, Ghana*

Keywords: understudied populations, genetics, disease, GWAS, pharmacogenomics

#### **Editorial on the Research Topic**

#### **The Genetic and Environmental Basis for Diseases in Understudied Populations**

Large-scale genomics research is costly, requiring significant resources for community engagement, participant recruitment, experimentation, and genomics data generation. Though the costs of genotyping and sequencing are decreasing, the large sample sizes required for genome wide association studies restrict such studies to researchers with significant funding and adequate resources. Until recently, these studies have been performed predominantly in European and other first world country populations, creating a bias in representation of global populations in public databases. Through a change in funding priorities for some major biomedical research funders, and a recognition of the need for diversity in genetic data, the balance has begun to shift, and under-represented populations are increasingly being included in genomics studies. Data from these under-represented populations have the potential to significantly alter our understanding of the genetic basis for human diseases in all populations, as they enable us to complete a picture which previously had major gaps. For example, inclusion of African populations, our oldest and most diverse populations, is providing important insights into human evolution and the origin of disease-related mutations.

For this Research Topic, we sought high quality research papers describing novel insights into genetic and environmental factors that impact disease risk, expression, prognosis, and treatment in understudied populations in human genomics research. Topics could include Population genetics, genome wide association studies, epigenetics, pharmacogenomics, environmental risk factors for diseases or gene-environment interactions in diseases. The final topic issue has 19 published articles covering various diseases studied in African and other previously under-represented populations.

Though not reporting specific studies, Shaffer et al. describe capacity development efforts in Mali to increase the number of trained bioinformaticians and data scientists able to analyse and interpret large-scale genomics data on local populations. Some of the papers describe novel methods or evaluation of existing methods for working on complex populations. For example, Schurz, Müller et al. evaluated the accuracy of three different imputation methods for multi-way admixed populations, using the South African Colored population as an example. Their findings demonstrate the importance of using an appropriate imputation software and reference panel containing populations that accurately represent ancestral populations for admixed individuals.

Fatumo et al. report the first GWAS in a Ugandan population for multivariate blood cell count phenotypes. The authors used both univariate and multivariate approaches and demonstrated that the multivariate approach has larger power and identifies additional loci. They report that performing a joint analysis of correlated phenotype simultaneously can provide new insights

#### Edited and reviewed by:

*TingFung Chan, The Chinese University of Hong Kong, China*

\*Correspondence: *Nicola Mulder nicola.mulder@uct.ac.za*

#### Specialty section:

*This article was submitted to Evolutionary and Population Genetics, a section of the journal Frontiers in Genetics*

> Received: *07 May 2020* Accepted: *17 August 2020* Published: *23 September 2020*

#### Citation:

*Mulder N, Lombard Z, Owolabi MO and Ofori-Acquah SF (2020) Editorial: The Genetic and Environmental Basis for Diseases in Understudied Populations. Front. Genet. 11:559956. doi: 10.3389/fgene.2020.559956*

**5**

into complex traits that may not be identified in separate univariate analyses. However, there was an observation that highly correlated traits may also inflate p-values. New candidate loci for several blood cell count parameters were found illustrating the need for conducting GWAS in non-European populations to better understand the genetics of blood cell physiology.

There are a number of articles reporting on studies in African populations, most commonly on cohorts in South Africa. Vorster et al. describe a multiplex ligation-dependent probe amplification analysis to determine the cause of spinal muscular atrophy (SMA) in Black South African patients. Though no significant pathogenic CNVs were detected, they found discordant copy numbers of exons which suggest complex rearrangements that may affect the SMN1 gene. Their study reiterates the fact that the genetic determinants of SMA in some individuals from this population group differ from those identified previously in other populations. Infectious diseases such as HIV and TB are prevalent in African populations, and mental illness is on the increase. Kalungi et al. sought to identify associations between relative telomere length and internalizing mental disorders, such as depressive disorders, anxiety disorders, and post-traumatic stress disorders, among young HIV-infected Ugandan individuals. A longer relative telomere length was found in individuals with these orders than in age- and sex-matched controls, and they concluded that though the length was not the cause of the disorders, the disorders were causing accelerated cellular aging.

In another Africa GWAS study, motivated by the epidemiological evidence that males are more affected by tuberculosis (TB) than females, Schurz, Kinnear et al. reported the first TB host susceptibility genome-wide association study (GWAS) with a specific focus on sex-stratified autosomal analysis and the X chromosome. Although the results are only nominally indicative of association, it does highlight the significance of the X chromosome in TB susceptibility, and the importance of considering ascertainment bias in genotyping arrays when selecting appropriate genotyping tools for undertaking studies in understudied populations.

Looking at environmental impact, Joubert et al. systematically reviewed the important progress and promising opportunities in environmental health research in Africa. Literature describing harmful health effects of metals, pesticides, and dietary mold represented a context unique to Africa. However, cardiovascular and respiratory health endpoints impacted by air pollution were comparable to observations in other countries. Air pollution exposures unique to Africa were dust and specific occupational exposures. Investigations of environmental exposures with distinct routes of exposure, unique co-exposures and comorbidities, combined with the extensive genomic diversity in Africa in the context of gene-environment studies may lead to the identification of novel mechanisms underlying complex disease and promising potential for translation to global public health.

In line with this prospect, Boua et al. examined gene-smoking interactions with carotid intima-media thickness (cIMT) to identify potential drivers for atherosclerosis risk in West-African populations of the AWI-Gen Study. They identified new gene-smoking interaction variants for cIMT within the previously described RCBTB1 region and the novel regulatory region of TBC1D8. In silico functional analysis suggested the involvement of genes implicated in biological processes related to cell or biological adhesion and regulatory processes in genesmoking interactions with cIMT.

Precision medicine and pharmacogenomics was a strong theme in several of the featured publications. In a paper from outside the African continent, Nagar et al. surveyed pharmacogenomic variants in two populations in Colombia, Antioquia and Chocó, with differing ancestries. They found that some pharmacogenomic variants have unusually high minor allele frequencies and differentiation according to ancestral contributions. These included variants with toxicity and dosing implications. As a result, the authors developed a costeffective allele-specific PCR assay to test for relevant variants to inform healthcare decisions. In another paper, O'Connell et al. investigated the potential role of regulatory genes in antipsychotic treatment response in South African schizophrenia patients. Seven candidate genes showed significant expression level changes and four variants within these genes were significantly associated with treatment response. Compared to previously reported studies, two of these variants are identified as lying within eQTLs that impact brain gene expression, providing promising evidence that these may potentially serve as biomarkers of antipsychotic treatment response in the future.

### AUTHOR CONTRIBUTIONS

All authors listed have made a substantial, direct and intellectual contribution to the work, and approved it for publication.

### FUNDING

NM was supported by NIH grant U24HG006941; MO was supported by the NIH (SIREN U54HG007479; SIBS Genomics R01NS107900; ARISES R01NS115944-01; H3Africa CVD Supplement 3U24HG009780-03S5, and CaNVAS 1R01NS114045-01); SO-A was supported by NIH grants HL143886, HL1330600, HL141011, and GM113816; and ZL is supported by NIH grant U01MH115483.

## ACKNOWLEDGMENTS

The guest editors wish to thank all the authors and reviewers for their valuable contributions to this Research Topic and we hope that this collection of articles will be of interest to the medical and genetics community.

**Conflict of Interest:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2020 Mulder, Lombard, Owolabi and Ofori-Acquah. This is an openaccess article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# A Sex-Stratified Genome-Wide Association Study of Tuberculosis Using a Multi-Ethnic Genotyping Array

Haiko Schurz1,2 \*, Craig J. Kinnear<sup>1</sup> , Chris Gignoux<sup>3</sup> , Genevieve Wojcik<sup>4</sup> , Paul D. van Helden<sup>1</sup> , Gerard Tromp1,2,5, Brenna Henn<sup>6</sup> , Eileen G. Hoal<sup>1</sup> and Marlo Möller<sup>1</sup>

<sup>1</sup> DST-NRF Centre of Excellence for Biomedical Tuberculosis Research, South African Medical Research Council Centre for Tuberculosis Research, Division of Molecular Biology and Human Genetics, Faculty of Medicine and Health Sciences, Stellenbosch University, Cape Town, South Africa, <sup>2</sup> South African Tuberculosis Bioinformatics Initiative, Faculty of Medicine and Health Sciences, Stellenbosch University, Cape Town, South Africa, <sup>3</sup> Colorado Center for Personalized Medicine, Department of Biostatistics and Informatics, University of Colorado Anschutz Medical Campus, Aurora, CO, United States, <sup>4</sup> Department of Genetics, Stanford University, Stanford, CA, United States, <sup>5</sup> Centre for Bioinformatics and Computational Biology, Stellenbosch University, Cape Town, South Africa, <sup>6</sup> Department of Anthropology, UC Davis Genome Center, University of California, Davis, Davis, CA, United States

#### Edited by:

Zané Lombard, University of the Witwatersrand, South Africa

#### Reviewed by:

Shigeki Nakagome, Trinity College Dublin, Ireland Carina M. Schlebusch, Uppsala University, Sweden

> \*Correspondence: Haiko Schurz haiko@sun.ac.za

#### Specialty section:

This article was submitted to Evolutionary and Population Genetics, a section of the journal Frontiers in Genetics

> Received: 06 September 2018 Accepted: 06 December 2018 Published: 18 January 2019

#### Citation:

Schurz H, Kinnear CJ, Gignoux C, Wojcik G, van Helden PD, Tromp G, Henn B, Hoal EG and Möller M (2019) A Sex-Stratified Genome-Wide Association Study of Tuberculosis Using a Multi-Ethnic Genotyping Array. Front. Genet. 9:678. doi: 10.3389/fgene.2018.00678 Tuberculosis (TB), caused by Mycobacterium tuberculosis, is a complex disease with a known human genetic component. Males seem to be more affected than females and in most countries the TB notification rate is twice as high in males than in females. While socio-economic status, behavior and sex hormones influence the male bias they do not fully account for it. Males have only one copy of the X chromosome, while diploid females are subject to X chromosome inactivation. In addition, the X chromosome codes for many immune-related genes, supporting the hypothesis that X-linked genes could contribute to TB susceptibility in a sex-biased manner. We report the first TB susceptibility genome-wide association study (GWAS) with a specific focus on sex-stratified autosomal analysis and the X chromosome. A total of 810 individuals (410 cases and 405 controls) from an admixed South African population were genotyped using the Illumina Multi Ethnic Genotyping Array, specifically designed as a suitable platform for diverse and admixed populations. Association testing was done on the autosome (8,27,386 variants) and X chromosome (20,939 variants) in a sex stratified and combined manner. SNP association testing was not statistically significant using a stringent cut-off for significance but revealed likely candidate genes that warrant further investigation. A genome wide interaction analysis detected 16 significant interactions. Finally, the results highlight the importance of sex-stratified analysis as strong sex-specific effects were identified on both the autosome and X chromosome.

Keywords: tuberculosis, GWAS, sex-bias, host genetics, X chromosome, sex-stratified, susceptibility

## INTRODUCTION

fgene-09-00678 January 16, 2019 Time: 18:45 # 2

Tuberculosis (TB) caused by Mycobacterium tuberculosis (M. tuberculosis) is a global health epidemic and the leading cause of death due to a single infectious agent (World Health Organization [WHO], 2017). In 2016 1.3 million TB deaths were reported in HIV negative individuals and an additional 374000 deaths related to TB/HIV co-infection were recorded. The majority of these deaths occurred in southeast Asian and African countries (World Health Organization [WHO], 2017). TB is a complex disease, influenced by environmental and behavioral factors such as socio-economic status and smoking, as well as definite human genetic components. The contribution of the host genes to disease has been highlighted by numerous investigations, including animal (Pan et al., 2005), twin (Comstock, 1978; Sorensen et al., 1988; Flynn, 2006), linkage (Bellamy et al., 2000; Greenwood et al., 2000) and candidate gene association studies (Schurz et al., 2015). More recently genome-wide association studies (GWAS) in diverse populations have been done (Thye et al., 2010, 2012; Oki et al., 2011; Mahasirimongkol et al., 2012; Png et al., 2012; Chimusa et al., 2014; Curtis et al., 2015; Grant et al., 2016; Sobota et al., 2016; Qi et al., 2017).

Interestingly another influential factor in TB disease development is an individual's biological sex, which has been largely ignored in past TB studies and was usually only used as a covariate for adjusting association testing statistics. In 2016, males comprised 65% of the 10.4 million recorded TB cases, indicating that the TB notification rate is nearly twice as high in males as in females (World Health Organization [WHO], 2017). While socio-economic and behavioral factors do influence this ratio, it does not fully explain the observed sex-bias (Jaillon et al., 2017). Another factor that influences sex-bias is the effect that sex hormones (estrogen and testosterone) have on the immune system. Estrogen is an immune activator, upregulating pro-inflammatory cytokines (TNFα), while testosterone is an immune suppressor, upregulating anti-inflammatory cytokines (IL-10) (Cutolo et al., 2006). This could explain why men are more susceptible to infectious diseases compared to females (Jaillon et al., 2017). However, as sex-based differences in immune responses differ even between pre-pubertal boys and girls, as well as between post-menopausal women and elderly men, it shows that sex hormones do not fully explain the sex-bias (Klein et al., 2015). Thus, it has been proposed that the X chromosome and X-linked genes directly contribute to the observed sex-bias.

There are approximately 1,500 genes on the X chromosome, many of which are involved in the adaptive or innate immune system (Brooks, 2010). Since females have two X chromosomes, one requires silencing in order to equalize dosage of gene expression to that of men who only have one X chromosome. This silencing occurs randomly in each cell, making females functional mosaics for X linked genes and giving them a major immunological advantage over males (Jaillon et al., 2017). As males are haploid for X-linked genes any damaging polymorphisms or mutations on the X chromosome will have a more pronounced immunological effect in males than in mosaic females, thereby influencing the sex-bias (Abramowitz et al., 2014).

To date, eleven GWAS investigating susceptibility to clinical TB have been published (Thye et al., 2010, 2012; Oki et al., 2011; Mahasirimongkol et al., 2012; Png et al., 2012; Chimusa et al., 2014; Curtis et al., 2015; Grant et al., 2016; Sobota et al., 2016; Omae et al., 2017; Qi et al., 2017). There has not been significant overlap between the 11 published TB GWAS, but it seems that replication is more likely when populations with similar genetic backgrounds are compared: the WT1 locus was associated with disease in populations from West and South Africa (Thye et al., 2012; Chimusa et al., 2014). Critically, genotyping microarrays that did not fully accommodate African genetic diversity were used in these studies (Thye et al., 2010, 2012; Chimusa et al., 2014; Curtis et al., 2015; Grant et al., 2016). It is therefore possible that unique African-specific susceptibility variants were not tagged by these initial arrays, since LD blocks are shorter in African populations (Campbell and Tishkoff, 2008). Moreover, none of the GWAS included or examined the X chromosome or sex-stratified analysis of the autosomes as was done in an asthma cohort (Mersha et al., 2015). Genetic differences between asthmatic males and females were identified on the autosome, with certain alleles having opposite effects between the sexes. Candidate gene association studies provide independent confirmation of the involvement of the X chromosome in TB susceptibility, through the association of X-linked TLR8 susceptibility variants with active TB. Davila et al. (2008) investigated 4 TLR8 variants (rs3761624, rs3788935, rs3674879, and rs3764880) in an Indonesian cohort and showed that all variants conferred susceptibility to TB in males but not females. The results for males were validated in male Russian individuals (Davila et al., 2008). These results were validated for rs3764880 in Turkish children, but no significant association

**Abbreviations:** 95CI, 95% confidence interval; ARMCX1, Armadillo repeat containing X-linked 1; ARSF, arylsulfatase F; ASNS, asparagine synthetase; ATP2C1, ATPase secretory pathway Ca2<sup>+</sup> transporting 1; C5orf64, chromosome 5 open reading frame 64; CFAP54, cilia and flagella associated protein 54; CIITA, class II major histocompatibility complex transactivator; CXorf51B, chromosome X open reading frame 51B; DIAPH2, diaphanous related formin 2; DNA, deoxyribonucleic acid; DPF3, double PHD fingers 3; DROSHA, drosha ribonuclease III; FRMPD4, FERM and PDZ domain containing 4; GRAMD2B, GRAM domain containing 2B; GWAS, Genome-wide association study; HIV, human immunodeficiency virus; HWE, Hardy–Weinberg equilibrium; IL-10, interleukin 10; LD, linkage disequilibrium; LINC00400, long intergenic nonprotein coding RNA 400; LINC02153, long intergenic non-protein coding RNA 2153; LINC02246, long intergenic non-protein coding RNA 2246; MAF, minor allele frequency; MEGA, multi-ethnic genotyping array; MHC, Major histocompatibility complex; MIR514A1, MicroRNA 514a-1; miRNA, micro RNA; MTND6P12, MT-ND6 pseudogene 12; NCS1, neuronal calcium sensor 1; NF-kB, nuclear factor kappa-light-chain-enhancer of activated B cells; NTM, neurotrimin; OR, odds ratio; P\_comb, combined p-value using Stouffers method; P\_Diff, P-value for sex-differentiation test; PBMC, peripheral blood mononuclear cell; PCSK6, proprotein convertase subtilisin/kexin type 6; pTB, pulmonary Tuberculosis; RN7SKP120, RNA, 7SK small nuclear pseudogene 120; RNA, ribonucleic acid; RNF125, ring finger protein 125; RNF126, ring finger protein 126; RNU6-974P, RNA, U6 small nuclear 974, pseudogene; RTN4RL1, reticulon 4 receptor like 1; SAC, South African colored; SALL2, spalt like transcription factor 2; SNP, single nucleotide polymorphism; SRPX, sushi repeat containing protein X-linked; TB, tuberculosis; TBL1X, transducin beta like 1 X-linked; TENT4A, terminal nucleotidyltransferase 4A; TLR, toll-like receptor; TNFα, tumor necrosis factor alpha; TST, tuberculin skin test; URI1, URI1, prefoldin like chaperone; WT1, Wilms tumor 1; XWAS, X chromosome wide association study.

was found for rs3764879 (Dalgic et al., 2011). Hashemi-Shahri et al. (2014) found no significant TLR8 associations in an Iranian population, while rs3764880 was significantly associated with TB susceptibility in both males and females in a Pakistani cohort (Bukhari et al., 2015). In admixed South African individuals rs3764879 and rs3764880 were significantly associated in both males and females, while rs3761624 was only significantly associated in females (Salie et al., 2015). Interestingly, in this cohort opposite effects were consistently found between the sexes for the same allele in all investigated TLR8 variants (Salie et al., 2015), echoing the asthma findings of Mersha et al. (2015). Finally, in a Chinese cohort rs3764879 was significantly associated with TB disease in males but not females. While many of these variants did not reach genome wide significance they still provide evidence of the involvement of X-linked genes in TB susceptibility.

We report the first TB susceptibility GWAS with a specific focus on sex-stratified autosomal analysis and the X chromosome to elucidate the male sex-bias. Individuals from the unique fiveway admixed SAC population, with ancestral contributions from Bantu-speaking African, KhoeSan, European, South and East Asian groups were genotyped in this study (Chimusa et al., 2013; Daya et al., 2013). These genetic contributions are due to both the complex colonization history of South Africa and the country's importance as a refreshment station on major trade routes during the fifteenth to nineteenth century (de Wit et al., 2010; Uren et al., 2016). This is therefore the first GWAS in the SAC that uses an array (Illumina Multi Ethnic Genotyping Array, see section "Genotyping") specifically designed to detect variants in the 4 most commonly studied populations, making it the most suitable platform for diverse and admixed populations at the time of genotyping.

### MATERIALS AND METHODS

#### Study Population

Study participants were recruited from two suburbs in the Cape Town metropole of the Western Cape. These suburbs were chosen for its high TB incidence and low HIV prevalence (2%) at the time of sampling (1995–2005) (Kritzinger et al., 2009). Approximately 98% of the residents in these suburbs self-identify as SAC and have similar socio-economic status, which reduces confounding bias in the association testing (Chimusa et al., 2014). The cohort consists of 420 pulmonary TB (pTB) cases, bacteriologically confirmed to be culture and/or smear positive and 419 healthy controls from the same suburbs. Approximately 80% of individuals over the age of 15 years from these suburbs have a positive tuberculin skin test (TST), indicating exposure to M. tuberculosis (Gallant et al., 2010). All study participants were over 18 years of age and HIV negative.

Approval was obtained from the Health Research Ethics Committee of Stellenbosch University (project registration number S17/01/013 and 95/072) before participant recruitment. Written informed consent was obtained from all study participants prior to blood collection. DNA was extracted from the blood samples using the Nucleon BACC Genomic DNA extraction kit (Illumina, Buckinghamshire, United Kingdom). DNA concentration and purity was checked using the NanoDrop <sup>R</sup> ND-1000 Spectrophotometer and NanoDrop <sup>R</sup> v3.0.1 software (Inqaba Biotechnology, Pretoria, South Africa). The study adhered to the ethical guidelines as set out in the "Declaration of Helsinki, 2013 (World Medical Association [WMA], 2018).

### Genotyping

Genotyping was done using the Illumina MEGA (Illumina, Miami, United States) which contains 1.7 million markers from various ethnicities making it highly suitable for diverse and admixed populations. The array is based on novel variants identified by the Consortium on Asthma among African ancestry populations in the Americas (CAAPA), the Illumina human core content for European and Asian populations as well as multi-ethnic exome content from African, Asian and European populations. The array also contains ancestry informative markers specific to the SAC population. While the KhoeSan population is not highly represented on the array, which could lead to a certain level of ascertainment bias, at the time of genotyping it was the most suitable platform for this diverse and admixed populations. Genome studio v2.04 (Illumina, Miami, United States) was used for SNP calling to calculate intensity scores and to call common variants (MAF≥5%), followed by analysis with zCall to recall rare genotypes (MAF < 5%) (Goldstein et al., 2012).

### Genotyping Quality Control

Quality control (QC) of the genotyping data was done using the XWAS version 2.0 software and QC pipeline to filter out low quality samples and SNPs (Chang et al., 2014; Gao et al., 2015). Data were screened for sex concordance, relatedness (up to third degree of relatedness) and population stratification (as determined by principal component analysis). Genotypes for males and females were filtered separately in order to maintain inherent differences between the sexes. SNPs were removed from the analysis if missingness correlated with phenotype (threshold = 0.01) as well as individual and SNP missingness (greater than 10%), MAF (less than 1%) and Hardy–Weinberg equilibrium (HWE) in controls (threshold = 0.01). Filtering continued iteratively until no additional variants or individuals were removed. Overlapping markers between the sexes were merged into a single dataset. X chromosome genotypes were extracted and variants were removed if the MAF or missingness was significantly different between the sexes (threshold = 0.01). A flow diagram explaining quality control steps and association testing of the data is shown in **Supplementary Figure S1**.

#### Admixture

The SAC population is a 5-way admixed population with ancestral contributions from Bantu-speaking African populations, KhoeSan, Europeans, and South and East Asians (Chimusa et al., 2013; Salie et al., 2015). To avoid confounding during association testing the ancestral components are included as covariates (Daya et al., 2014a). Admixture was estimated for the autosome (chromosomes 1–22) and the X chromosome

separately using the software ADMIXTURE (v1.3) (Alexander et al., 2009) and reference genotyping data for 5 ancestral populations. The reference populations used to infer ancestry were European (CEU) and South Asian (Gujarati Indians in Houston, Texas and Pathan of Punjab) extracted from the 1000 Genomes Phase 3 data (Sudmant et al., 2015), East Asian (Han Chinese in Beijing, China), African (Luhya in Webuye, Kenya, Bantu-speaking African, Yoruba from Nigeria) and San (Nama/Khomani) (Uren et al., 2016; Martin et al., 2017). Due to the limited number of individuals available for each reference population the SAC data had to be divided into 21 groups to equal the number of individuals per reference population. The number of individuals per reference population and admixed population has to be kept consistent in order to maximize the accuracy of the admixture results by not over-representing one particular population in the analysis. Therefore admixture inference was done separately for each of the 21 SAC groups, referred to as running groups. Each running group was analyzed five times at different random seed values. The results for each individual were averaged across the five runs in order to obtain the most accurate ancestry estimations [Shringarpure et al., 2016). Four ancestral components (African, San, European, and South Asian (Salie et al., 2015)] were included as covariates in the logistic regression association testing with the smallest component (East Asian) excluded in order to avoid complete separation of the data.

#### Association Analysis

#### SNP Based Association Analysis

Autosomal TB association testing was done with sex-stratified and combined datasets using the additive model in PLINK (version 1.7<sup>1</sup> ) (Purcell et al., 2007) in order to detect sex-based differences. TB association testing for the X chromosome were done separately in males and females using XWAS (version 2) and the results were combined using Stouffers method in order to obtain a combined association statistic (Chang et al., 2014; Gao et al., 2015). A sex-differentiated test was conducted for the X chromosome using the XWAS software to test for significant differences in genetic effects between males and females. SNP based association testing (sex-stratified or not) compares the frequency of alleles between cases and controls to determine if a specific allele co-occurs with a phenotype (TB) more often than would be expected by chance. The sex-differentiation test on the other hand compares the effect size (OR) of a variant between the sexes to determine if a variant has a different effect on risk between the sexes. The sex-differentiation test is explained in more detail by Chang et al. (2014). X chromosome inactivation states were also included in the association testing as covariates using a method developed by Wang et al. (2014). To include inactivation states in the association analysis the most likely state was determined for each SNP. A variant can either be inactivated, or it can be skewed toward the deleterious or normal allele or the variant can escape inactivation. To determine which of the four states is most probable the likelihood ratio for each one was calculated and the inactivation state that maximized the likelihood ratio was applied to the SNP in question. This was done for each variant as inactivation states vary along the X chromosome [for a detailed description see Wang et al. (2014)]. Ancestry, sex and age were included in the analyses as covariates where applicable. Information on other risk factors known to influence TB susceptibility such as smoking and alcohol consumption was not available for this study cohort and could not be included as covariates. Multiple testing correction was done using the SimpleM method (Gao et al., 2010), which adjusts the significance threshold based on the number of SNPs that explains 95% of the variance in the study cohort. This method is less conservative than Bonferroni correction and is a close approximation of permutation results in a fraction of the time. For the autosome the genome-wide significance threshold was set to 5.0e−<sup>8</sup> (Panagiotou and Ioannidis, 2012).

#### Gene Based Association Analysis

Gene-based association testing groups SNPs together and thus decreases the multiple testing burden and increase power to detect an association. Gene-based association testing was done using the XWAS v2 scripts, which were implemented using the Python<sup>2</sup> (version 2.7.10) and R programming environment [version 3.2.4, (R Development Core Team, 2013)] and R packages corpcor and mvtnorm. Reference files for the known canonical genes on the X chromosome for human genome build 37 were included in the XWAS v2 software package and used to group variants and p-values by gene (Chang et al., 2014; Gao et al., 2015). Bonferroni correction was used to adjust for multiple testing instead of SimpleM, as all genes, unlike SNPs, are independent of each other in the context of association testing and as such the multiple test correction cannot be less than the number of genes tested.

#### Interaction Analysis

Genome-wide SNP interaction analysis was done using CASSI<sup>3</sup> (v2.51). A joint effects model was implemented for a rapid overview of interactions of all variants across the genome (autosome and X chromosome). Variants from significant interactions were reanalyzed using a logistic regression approach with covariate correction, which would not be feasible for a genome-wide interaction analysis as it would be too computationally intensive. As there is no general consensus on the significance threshold for genome wide interaction analysis Bonferroni correction was used in order to avoid potential inflation of false positive results.

#### RESULTS

#### Cohort Summary

In total 410 TB cases and 405 healthy controls passed the sexstratified QC procedure. General summary statistics for the cohort, including mean and standard deviation of age and global ancestry as well as the ratio of males to females in both cases and

<sup>1</sup>http://zzz.bwh.harvard.edu/plink/

<sup>2</sup>http://www.python.org

<sup>3</sup>https://www.staff.ncl.ac.uk/richard.howey/cassi/using.html

TABLE 1 | South African colored (SAC) sample characteristics showing case/control and sex distribution, mean and standard deviation of age and global ancestral components.


controls are shown in **Table 1**. Clear differences were observed between TB cases and controls for both age and ancestry, justifying the inclusion as covariates. Ancestral distributions were compared using the Wilcoxon signed-rank test and were shown to significantly differ (unpublished results) between the autosome and X chromosome (**Figure 1**). Y chromosome and mitochondrial haplogroup analysis also revealed strong sex biased admixture in the SAC population, with a strong female KhoeSan and male Bantu-speaking African and European bias (Quintana-Murci et al., 2010). As sex biased ancestry has been shown to reflect in strong differences between the autosomal and X chromosome ancestral components they were included as covariates in the respective analyses (Wang et al., 2008; Bryc et al., 2010a,b).

#### SNP Based

The top results for the autosomal association testing are shown in **Table 2** and **Supplementary Figure S2**, with the QQ-plot indicating no constraints on the analysis or inflation of the results (**Supplementary Figure S2**). Following multiple test correction,

no significant associations were identified for the combined or sex-stratified analysis, but it is important to note that the top associations differed between the sex-stratified and combined analyses as well as for males and females (**Table 2**). The most significant variant for the combined autosomal association test was rs17410035 (OR = 0.4, p-value = 1.5e−<sup>6</sup> , **Table 2**), located in the 3<sup>0</sup> -UTR of the DROSHA gene, which encodes a type 3 RNase. This RNase is involved in miRNA processing and miRNA biogenesis (Mullany et al., 2016). Although little evidence exists that rs17410035 has an impact on DROSHA gene expression or miRNA biogenesis (which could affect gene expression) it has been associated with increased colon cancer (OR = 1.22, p-value = 0.014) (Mullany et al., 2016) and cancer of the head and neck (OR = 2.28, p-value = 0.016) (Zhang et al., 2010). When the rs17410035 SNP interacts with other variants (rs3792830 and rs3732360) it can further increase the risk for cancer of the head and neck (Zhang et al., 2010), which illustrates the importance of doing interaction analysis. For the autosomal sex-stratified analysis the variant with the lowest p-value in males was rs11960504 (OR = 2.8, p-value = 7.21e−<sup>6</sup> , **Table 2**) located downstream of the GRAMD2B gene, a gene for which no information is available. The top hit in females was rs2894967 (OR = 2.17, p-value = 4.77e−<sup>6</sup> ) a SNP located upstream of the TENT4A gene, a gene coding for a DNA polymerase shown to be involved in DNA repair (Ogami et al., 2013). Closer inspection of the data revealed that the effects between the sexes were in the same direction for all top hits in the combined analysis, whereas all variants identified in the sex-stratified analysis had effects in opposite directions between the sexes, or one sex had no effect, indicating that even on the autosome strong sex specific effects are prominent.

For the X chromosome specific association testing a sexstratified test was conducted and the results were then combined using Stouffers method, which provided a good fit between expected and observed p-values (QQ-plot **Figure 2**) (Chang et al., 2014; Gao et al., 2015). The simpleM method indicated that of the 20,939 X-linked variants 17,600 explained 95% of the variance in the data resulting in a significance threshold of 2.8e−<sup>6</sup> (0.05/17,600). No statistically significant associations with TB susceptibility were identified in either sex-stratified or the combined analysis (**Table 3** and **Figure 2**). The top hit for the X-linked combined (p-value = 2.62e−<sup>5</sup> ) and females (OR = 1.83, p-value = 1.06e−<sup>4</sup> ) only analysis was the same variant, rs768568, located in the TBL1X gene. For the males the lowest p-value was rs12011358 (OR = 0.37, p-value = 1.25e−<sup>4</sup> ), a variant located in the MTND6P12 gene. Both of these genes have not been previously associated with TB susceptibility and MTND6P12 is a pseudogene with unknown expression patterns or function. Variants in TBL1X have been shown to influence prostate cancer (Park et al., 2016) and central hypothyroidism (Heinen et al., 2016) susceptibility. TBL1X is a regulator of nuclear factor kappalight-chain-enhancer of activated B cells (NF-kB) and is thus involved in the immune system which could impact TB susceptibility.

The method of modeling X chromosome inactivation states, developed by Wang et al. (2014), was also incorporated into the X-linked association testing, but no significant observations were observed. Although the p-values were generally lower than for the Stouffer method, the QQ-plot revealed that including estimations of X chromosome inactivation states inflated the p-values and increased the chance of type 1 errors and these results were therefore discounted (**Supplementary Table S1** and **Supplementary Figure S3**).

The sex differentiation test did not result in any significant associations (**Table 4**) and the variant with the lowest p-value was located in a pseudogene, RNU6-974P (p-value = 8.33e−<sup>5</sup> ). The second lowest p-value was for a variant upstream of the SRPX (p-value = 2.18e−<sup>4</sup> ) gene which has previously been shown to have a tumor suppressor function in prostate carcinomas (Kim et al., 2003). Whether these variants are associated with TB susceptibility or influence sex-bias is unclear, but the vastly opposite effects between the sexes are noteworthy. When comparing the OR for the sex differentiation test it is clear that variants can have major sex specific effects again highlighting the need for sex-stratified analysis (**Table 4**).

#### Gene Based

The X chromosome gene-based analysis, in which 1,105 X-linked genes were analyzed did not show any significant associations using a Bonferroni-adjusted significance threshold of 4.5e−<sup>5</sup> (**Table 5**). The association with the lowest p-value for the combined analysis was in the chromosome X open reading frame 51B (CXorf51B) (p-value = 1.28e−<sup>4</sup> ) coding for an uncharacterized protein (LOC100133053). The lowest p-value for males was in an RNA coding region that interacts with Piwi proteins (DQ590189.1, p-value = 1.7e−<sup>3</sup> ), a subfamily of Argonaute proteins. While Piwi proteins are involved in germline stem cell maintenance and meiosis the function of the Piwi interacting RNA molecules are unknown (Girard et al., 2006). For females the top hit was ARMCX1 (p-value = 6.07e−<sup>4</sup> ), a tumor suppressor gene involved in cell proliferation and apoptosis of breast cancer cells. While this gene has not been previously implicated in TB susceptibility, M. tuberculosis has been shown to affect apoptosis pathways in order to evade the host immune response, suggesting that ARMCX1 could affect TB susceptibility (Parandhaman and Narayanan, 2014). While not significant the analysis again reveals strong sex specific effects and the sexstratified and combined analysis gave three different results (**Table 5**).

#### Interaction Analysis

A genome-wide interaction analysis was performed using the software Cassie. In total 1893973105 interactions were analyzed and following a Bonferroni correction for the number of interactions performed the significance threshold was set to 2.6e−11. For the joint effects model, 18 interactions passed the significance threshold (**Supplementary Table S2**). The top interaction was between rs1823897, upstream of the ARSF gene and rs7064174 in the FRMPD4 gene (p-value = 7.23e−14), two genes for which not much information is available and it is unclear how they could be involved in TB susceptibility. The top 450 associations from the joint effects model were then

TABLE 2 | Top associations for the combined and sex-stratified autosomal association testing.


retested using logistic regression and the same covariates as the SNP based association testing. No significant interactions (threshold of 2.6e−11) were observed in the logistic regression model (**Table 6**), but as Bonferroni correction is very conservative the top interactions should still be considered as they reach the significance level for SNP based GWAS.

Among the top hits in the logistic regression analysis (**Table 6**) some could impact TB susceptibility as they are involved in



TABLE 4 | Sex-differentiation analysis.

fgene-09-00678 January 16, 2019 Time: 18:45 # 8


immune functions. The interaction with the lowest p-value was between rs2631914, located upstream of LINCO2153, which is upregulated in people with major depressive disorder (Cui et al., 2016), and rs8067702, located downstream of RTN4RL1), previously associated with congenital heart disease, microcephaly and mild intellectual disability (Tang et al., 2015). While this interaction is not very informative in the context of TB three other interactions were identified that could impact TB susceptibility (**Table 6**).

The first interaction of interest is between RNF125 gene (rs35996537) and URI1 (rs1118924), involved in downregulation of CD4+/CD38<sup>−</sup> T-cells and PBMCs in HIV-1 positive individuals and NF-kB/CSN2/Snail pathway, activated by TNFα, respectively (Shoji-Kawata et al., 2007; Zhou et al., 2017). Second the interaction between rs386560079 (ATP2C1), which is involved in regulation of intracellular Ca2+/Mn2<sup>+</sup> concentrations through the Golgi apparatus (Deng and Xiao, 2017) and rs6498130 (CIITA). Variants in the CIITA gene reduce the expression of MHC class II proteins and receptors resulting in an immune privilege phenotype (Mottok et al., 2015). The final interaction of interest is between rs12286374 (NTM), which is mainly expressed in the brain and promotes neurite outgrowth and adhesion (Maruani et al., 2015) and rs2040739 (RNF126) a ring type E3 ligase involved in the Protein B kinase pathway which has been previously implicated in glucose metabolism, apoptosis, cell proliferation and transcription (Song et al., 2005). While none of these genes have previously been implicated in TB susceptibility the fact that some of them are involved in immune functions suggests a role in TB susceptibility.

#### DISCUSSION

In this GWAS we investigated TB susceptibility in the admixed SAC population, with a specific focus on sex-bias and the X chromosome. A sex-stratified QC protocol was applied to the data in order to conserve inherent differences between the sexes and all statistical analysis were conducted in a sex-stratified and TABLE 5 | X chromosome gene-based association results.


combined dataset in order to fully assess the impact of sex on TB susceptibility and the male sex-bias it presents with. We found no significant associations on the autosome or X chromosome for both the sex-stratified and combined SNP and gene-based association testing. A few significant interactions were identified, but the impact of these on TB susceptibility is unclear and will require further investigation to validate and functionally verify.

For the combined autosomal SNP based association testing the only potential variant of interest is rs17410035 located in the DROSHA gene (**Table 2**) which is potentially involved in miRNA biogenesis and could impact TB susceptibility if immune related regulatory miRNA is affected. For the X-linked association testing the top association in males was in an uninformative pseudogene, while the female and combined analysis revealed the same variant, rs768568 located in the TBL1X gene (**Table 3**). The TBL1X protein has been shown to be a co-activator of NF-kB mediated transcription of cytokine coding genes, but the mechanism of activation is unclear (Park et al., 2016). NF-kB is a vital component of the proinflammatory signaling pathway and is involved in multiple immune pathways including TLRs (Lawrence, 2009), which have previously been shown to influence TB susceptibility (Schurz et al., 2015). Based on this one could extrapolate that variants in the TBL1X gene could affect activation and proinflammatory signaling of NF-kB, which could have a direct effect on the immune system and thus TB susceptibility. The direction of effect for this variant was the same in males and females (**Table 3**), but was less significant in males probably due to loss of power when analyzing haploid genotypes. For the variants identified in the sex differentiated analysis it is unclear how they could influence TB susceptibility



as the top hit is located in a pseudogene. However, the sex differentiated test did reveal just how big the difference in effects can be between the sexes for a specific variant (**Table 4**). If these variants with opposite effects are not analyzed in a sexstratified way then the effects would cancel each other out and any information on sex specific effects would be lost. The X-linked gene-based association test revealed no significant associations despite having more power than the SNP based association testing. A possible reason for this could be that Bonferroni correction was used and as this is very conservative possible associations could have been missed. When looking at the most significant associations (**Table 5**) however, it is unclear how the identified genes could be implicated in TB susceptibility.

The joint effects interaction analysis revealed several significant interactions, but as association results have been previously shown to be severely influenced by admixture (Daya et al., 2014b) only the results for the logistic regression analysis will be discussed here. A few variants were identified in the logistic interaction analysis that could impact TB susceptibility (**Table 6**). URI1 (rs1118924) is activated by TNFα and is involved in the NF-kB/CSN2/Snail pathway, CIITA (rs6498130) impacts expression of MHC class II proteins and receptors and rs35996537 (RNF125) and rs2040739 (RNF126) are both E3 ubiquitin ligase proteins which affect a multitude of cellular functions, such as apoptosis (Song et al., 2005) and protein degradation (Shin et al., 2015). NF-kB, TNFα, MHC class II, E3 ligases, apoptosis and T-cells have all been implicated in TB susceptibility and could collectively contribute by influencing the immune response (Hirsch et al., 1999, 2005; Torres et al., 2006; Fallahi-Sichani et al., 2012; Bai et al., 2013; Parandhaman and Narayanan, 2014; Shin et al., 2015; Franco et al., 2017). As TB is a complex disease all potential influential factors need to be considered and as such the interaction analysis cannot be ignored. Shortcomings of the interaction analysis are that they are very computationally intensive and suffer from a massive multiple test correction burden. Future research should thus focus on ways to prioritize variants for interaction analysis to decrease computation time as well as have sufficient sample size to minimize multiple test correction burden.

A previous GWAS in the SAC population found a significant association with TB susceptibility in the WT1 gene (rs2057178, OR = 0.62, p-value = 2.71e−<sup>6</sup> ) (Chimusa et al., 2014). This association did not reach genome-wide significance in our study (OR = 0.75, p-value = 0.049). At the time of the GWAS by Chimusa et al., 2014) there were few African and KhoeSan (only 6 KhoeSan) individuals in the reference data used for imputation and the accuracy of imputation in this population was not known. As the identified variant (rs2057178) was imputed into the data it should have been validated in the SAC population using an appropriate genotyping approach. Secondly although the variant reached a significance threshold for the number of variants tested it did not reach genome wide significance threshold of 5.0e−<sup>8</sup> (Panagiotou and Ioannidis, 2012). Finally, the GWAS performed by Chimusa et al. (2014) only contained 91 control individuals compared to 642 cases, which could affect the power of the study. Chimusa et al. (2014) were unable to replicate previous associations identified in the X-linked TLR8 gene (Davila et al., 2008). The two TLR8 variants in our data, rs3764880 (OR = 1.73, p-value = 3.1e−<sup>4</sup> ) and rs3761624 (OR = 1.70, p-value = 3.94e−<sup>4</sup> ) also did not show significant associations. While the haploid genotypes in males contributes to this, a second influential factor could be admixture. Chimusa et al. (2014) did not perform X chromosome specific admixture analysis, which could affect association testing of X-linked genes. Furthermore, only six KhoeSan reference individuals were available, which could affect the accuracy of admixture inference and severely

affect the results. For our study 307 KhoeSan individuals were available, improving the admixture inference and could explain why stronger effects (higher OR) were detected for the TLR8 variants when compared to Chimusa et al. (2014). It is also important to note that using global ancestry components as covariates does not correct for ancestry at any specific locus and as a result each locus in this population could have up to five different ancestries. This could greatly reduce power and contribute to the lack of replication between studies. In order to address this future studies could incorporate local ancestry inference into the analysis in order to determine the number of ancestries at a locus of interest. Other candidate genes identified in previous GWAS studies were also separately analyzed here, but associations did not replicate (Online **Supplementary Data Sheet S2**).

We did not find any significant associations with TB susceptibility, but highlight the need for sex-stratified analysis. Closer inspection of the data revealed that a large number of SNPs with opposite direction of effects for not only the X chromosome, but the autosome too. Sex specific effects has previously been reported for autosomal variants associated with pulmonary function in asthma (Berhane et al., 2000). In the SAC population these opposite effects have previously been observed for X-linked variants in the TLR8 gene (Daya et al., 2013) and the same is observed in this study. Sex-stratified analysis should therefore be included in association studies and incorporated in the study design. This can be done by keeping the male to female ratio balanced in the cases and controls. It would also be prudent to do the power calculation for the males and females separately. This will ensure sufficient power for sex-stratified analysis and could elucidate informative sex specific effects. This study was done in a 5-way admixed population. As was observed for the interaction analysis including admixture components significantly changes the association results. Furthermore it was observed (unpublished results) that the ancestral distribution between the X chromosome and autosome are different (**Figure 1**), which is an indication of sex-biased admixture (Goldberg and Rosenberg, 2015; Shringarpure et al., 2016) and highlights the importance of including X chromosome admixture components for X-linked and sex-bias analysis. It is important to note here that the ancestral components in the SAC present with a very wide range (**Figure 1**) and all this variability could affect the power of association studies. It is therefore desirable to increase the sample size when analyzing admixed individuals. Alternatively, a meta-analysis can be conducted, including data from all five ancestral populations, or local ancestry inference could be included in the analysis.

#### CONCLUSION

While no significant associations were identified this study shows the importance of conducting sex-stratified analysis. This analysis should be incorporated during the study design phase to ensure sufficient power and allow the inclusion of covariates with sex specific effects (in this case admixture components). The sex-stratified analysis revealed that the effect of certain variants can differ between males and females, not only for the X chromosome but also for the autosome. TB is a complex disease with most genetic associations that do not replicate across different populations, which complicates the elucidation of the genetic impact on disease susceptibility. By including sexstratified analysis and identifying sex specific effects and the cause for the male bias we can adjust treatment according to sex and potentially improve treatment outcome and survival.

### DATA AVAILABILITY

The summary statistics from the case-control cohort will be made available to researchers on request, while access to the raw data will only be available to researchers who meet the criteria for access to confidential data after application to the Health Research Ethics Committee of Stellenbosch University. Requests can be sent to: Dr. Marlo Möller, E-mail: marlom@sun.ac.za.

### AUTHOR CONTRIBUTIONS

HS, MM, CK, and GT conceived the idea for this study. CG, GW, and BH did the calling and QC of the raw genotyping data. HS did the analysis and wrote first draft. BH assisted with admixture analysis. All authors contributed to writing and proofreading for approval of the final manuscript.

## FUNDING

This research was partially funded by the South African government through the South African Medical Research Council. The content is solely the responsibility of the authors and does not necessarily represent the official views of the South African Medical Research Council. This work was also supported by the National Research Foundation of South Africa (Grant Number 93460) to EH. This work was also supported by a Strategic Health Innovation Partnership grant from the South African Medical Research Council and Department of Science and Technology/South African Tuberculosis Bioinformatics Initiative (SATBBI, GW) to GT.

### ACKNOWLEDGMENTS

We would like to acknowledge and thank the study participants for their contribution and participation. A preprint of this paper is available on the BioRxiv preprint repository (Schurz et al., 2018).

### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene.2018. 00678/full#supplementary-material

#### REFERENCES

fgene-09-00678 January 16, 2019 Time: 18:45 # 11


toll-like receptor 8 in pulmonary tuberculosis. PLoS Genet. 4:e1000218. doi: 10.1371/journal.pgen.1000218




of second primary tumor and/or recurrence in patients with early-stage head and neck cancer. Carcinogenesis 31, 2118–2123. doi: 10.1093/carcin/ bgq177

Zhou, W., Wang, Q., Xu, Y., Jiang, J., Guo, J., Yu, H., et al. (2017). RMP promotes epithelial-mesenchymal transition through NF-κB/CSN2/Snail pathway in hepatocellular carcinoma. Oncotarget 20, 40373–40388. doi: 10.18632/oncotarget.16177

**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Schurz, Kinnear, Gignoux, Wojcik, van Helden, Tromp, Henn, Hoal and Möller. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Evaluating the Accuracy of Imputation Methods in a Five-Way Admixed Population

Haiko Schurz1,2 \* † , Stephanie J. Müller1,2† , Paul David van Helden<sup>1</sup> , Gerard Tromp1,2 , Eileen G. Hoal<sup>1</sup> , Craig J. Kinnear<sup>1</sup>‡ and Marlo Möller<sup>1</sup>‡

<sup>1</sup> DST-NRF Centre of Excellence for Biomedical Tuberculosis Research, South African Medical Research Council Centre for Tuberculosis Research, Division of Molecular Biology and Human Genetics, Faculty of Medicine and Health Sciences, Stellenbosch University, Cape Town, South Africa, <sup>2</sup> South African Tuberculosis Bioinformatics Initiative (SATBBI), Faculty of Medicine and Health Sciences, Stellenbosch University, Cape Town, South Africa

Genotype imputation is a powerful tool for increasing statistical power in an association analysis. Meta-analysis of multiple study datasets also requires a substantial overlap of SNPs for a successful association analysis, which can be achieved by imputation. Quality of imputed datasets is largely dependent on the software used, as well as the reference populations chosen. The accuracy of imputation of available reference populations has not been tested for the five-way admixed South African Colored (SAC) population. In this study, imputation results obtained using three freely-accessible methods were evaluated for accuracy and quality. We show that the African Genome Resource is the best reference panel for imputation of missing genotypes in samples from the SAC population, implemented via the freely accessible Sanger Imputation Server.

#### Edited by:

Nicola Mulder, University of Cape Town, South Africa

#### Reviewed by:

Peristera Paschou, Purdue University, United States Pablo Orozco-terWengel, Cardiff University, United Kingdom

#### \*Correspondence:

Haiko Schurz haiko@sun.ac.za †Co-first authors ‡Co-senior authors

#### Specialty section:

This article was submitted to Evolutionary and Population Genetics, a section of the journal Frontiers in Genetics

> Received: 27 September 2018 Accepted: 17 January 2019 Published: 05 February 2019

#### Citation:

Schurz H, Müller SJ, van Helden PD, Tromp G, Hoal EG, Kinnear CJ and Möller M (2019) Evaluating the Accuracy of Imputation Methods in a Five-Way Admixed Population. Front. Genet. 10:34. doi: 10.3389/fgene.2019.00034 Keywords: imputation, accuracy, quality, admixture, 1000 Genomes, African, CAAPA, AGR

## INTRODUCTION

Over the past decade, genotyping technologies for genome-wide association studies (GWAS) have allowed for extensive and rapid genotyping of common variants (Ding and Jin, 2009; Ragoussis, 2009; Vergara et al., 2018). Commercial single nucleotide polymorphism (SNP) genotyping arrays contain between 300 000 and 2.5 million markers, but none have complete coverage of the human genome. Genotype imputation can be used to improve both coverage and power of a GWAS by inferring the alleles of un-genotyped SNPs based on the linkage disequilibrium (LD) patterns derived from directly genotyped markers and comparing them to a suitable reference population (Marchini and Howie, 2010; Pei et al., 2010; Malhotra et al., 2014). These imputed variants can then be used for association testing, to improve fine-mapping of a target region, or to conduct a meta-analysis.

Meta-analysis is a powerful and commonly used technique, but if the study data were generated using different platforms, there may be a reduction in statistical power due to minimal overlap between the genotyped markers. To overcome this reduction in power, imputation may be used

**20**

**Abbreviations:** 1000G, 1000 Genomes Phase 3 reference panel; AGR, African Genome Resource; AGVP, African Genome variation project; CAAPA, Consortium on Asthma among African ancestry populations in the Americas; HRC, Haplotype Reference Consortium; MEGA, Multi ethnic genotyping array; MIS, Michigan imputation server; PBWT, Positional Burrows-Wheeler Transformation; SAC, South African Colored; SIS, Sanger imputation server.

to increase the marker overlap between datasets, thereby improving the power of a meta-analysis (Anderson et al., 2008; Marchini and Howie, 2010; Hancock et al., 2012; McRae, 2017).

Imputation is dependent on the adequate matching of haplotypes based on LD and thus it is essential that the reference population is genetically similar to the population being imputed. Numerous reference datasets are freely available online and can be used for imputation via suitable imputation software. These include amongst others, the 1000 Genomes phase 3 data (1000G) (Sudmant et al., 2015), the Human Genome Diversity Project (Cavalli-Sforza, 2005), Haplotype Reference Consortium (HRC) (McCarthy et al., 2016) and the HapMap consortium (International HapMap 3 Consortium et al., 2010). Most of the above-mentioned reference panels focussed mainly on representing the European population and data for African populations and admixed populations containing African ancestry is limited.

African and admixed populations are more heterogeneous in their haplotype block structure and, as such, would benefit from a larger reference dataset incorporating more genetic diversity (Vergara et al., 2018). Reference datasets of this nature would increase the chances that an observed haplotype is present in the reference data, thereby greatly improving the imputation accuracy for African and admixed individuals with African ancestry. Fortunately, recent years have seen a substantial increase in the representation of African populations in the 1000G data (Sudmant et al., 2015) and additional databases focusing on representing African populations have been established. The Consortium on Asthma among African ancestry populations in the Americas [CAAPA, (Mathias et al., 2016)] reference panel is available for download from dbGap with Accession ID:phs001123.v1.p1 (access required) and the African Genome variation project (AGVP) (Gurdasani et al., 2015) as well as the African Genome Resource<sup>1</sup> (AGR, not publicly available) are three resources which have recently become a viable option for accurate imputation of African populations.

The AGR<sup>1</sup> contains the largest collection of haplotypes of African origin, with all the 1000G samples and an additional 2000 samples from Uganda, 100 samples from each of a set of five populations from Ethiopia, Egypt, Namibia (Nama/Khoesan), and South Africa (Zulu). The AGR contains 97 004 203 biallelelic SNPs spanning the autosomes and the X chromosome for 4 956 samples<sup>1</sup> . The 1000G reference panel contains 84 237 642 biallelic SNPs for 2 504 samples selected from 26 populations across Europe, Asia, the Americas, South-, and East-Asia (Sudmant et al., 2015). The CAAPA reference panel contains whole-genome sequences for 883 samples recruited into 19 case-control studies on asthma in the Americas. A total of 31 163 897 autosomal SNPs are included on the panel for imputation (Mathias et al., 2016).

Apart from choice of reference panel, the software used also affects the imputation accuracy (Hancock et al., 2012). Many imputation software packages are freely available and have been previously tested and validated for accuracy, including Impute2 (Howie et al., 2009), Beagle (Verma et al., 2014), MaCH, MaCH-Minimac and MaCH-Admix (Roshyara et al., 2016). These imputation software packages were evaluated in African and African-American populations using different reference panels and produced varying degrees of imputation quality and accuracy (Hancock et al., 2012; Roshyara et al., 2016).

Huang et al. (2009) tested imputation accuracy in 29 populations using the HapMap reference and showed that the highest imputation accuracy was achieved for European populations, followed by East-Asian, Central- and South-Asian, American, Oceanian, Middle-Eastern, and African populations. An additional finding from this study was that combining multiple reference populations resulted in improved imputation accuracy for any population analysed (Huang et al., 2009). While more appropriate reference panels are now available, which would increase the accuracy of imputation in African individuals, these results indicate that there are difficulties when imputing populations for which there is a limited number of reference individuals.

Imputation accuracy has previously been assessed for African populations (Huang et al., 2009; Hancock et al., 2012; Roshyara et al., 2016) and for populations with two- or three-way admixture, with results reaching over 75% accuracy (Nelson et al., 2016). In the present study, we assessed the accuracy of imputation in the five-way admixed South African Colored (SAC) population. The SAC population contains genetic contributions from Bantu-speaking Africans, KhoeSan, Europeans, and South- and East-Asians (de Wit et al., 2010; Daya et al., 2013). While, imputation in this population has been conducted previously and the resulting data used for association analyses (Chimusa et al., 2014), the accuracy of imputation in this highly admixed population is yet to be evaluated.

Here we assessed the quality and accuracy of results obtained from imputation in the SAC population and show that the AGR reference panel - accessed via the Sanger Imputation Serverproduced the highest quality and accuracy in imputed data. An in-house protocol using IMPUTE2 and 1000G reference panel imputed more variants than Sanger (AGR) but at a slightly reduced quality and accuracy.

#### METHODS

#### SAC Data

Two sources of data for the SAC cohort were available, namely genotypes obtained using the Affymetrix 500k array containing 500 000 SNP markers (Affymetrix, California, United States) and the Illumina (Illumina, California, United States) multiethnic genotyping array (MEGA) with 1.7 million markers. This study was carried out in accordance with the recommendations of the Health Research Ethics Committee of Stellenbosch University (project registration number S17/01/013, S17/02/037, and 95/072) before participant recruitment and written informed consent was obtained from all study participants prior to blood collection. All subjects gave informed consent in accordance with the Declaration of Helsinki. The protocol was approved by the Health Research Ethics Committee of Stellenbosch University.

Genotype data obtained using the Affymetrix and MEGA arrays were subjected to iterative quality control (QC) using

<sup>1</sup>https://imputation.sanger.ac.uk/

PLINK v1.9 (Purcell et al., 2007; Chang et al., 2015) as previously described (Schurz et al., 2018), with the exception of related individuals not being removed. Individuals missing more than 10% genotype information and SNPs with more than 2% missingness were removed, as well as any variants with a minor allele frequency (MAF) below 5% as well as loci with excessive heterozygosity (a detailed description of the filtering process can be found in **Supplementary Data S3**). All remaining missingness in the data is randomly distributed (data not shown) and the stringent SNP filter was used to ensure there are no incorrectly genotyped variants in the data that could influence the imputation accuracy (**Supplementary Data S4**).

These QC steps were iterated until no additional variants or individuals were removed, and concluded with a sexconcordance check to remove individuals with incorrect sex information. Genotype Harmoniser version 1.4.15 (Deelen et al., 2014) was used to strand align the two datasets to the 1000 Genomes Phase 3 reference panel [human genome build 37, (Sudmant et al., 2015)], update SNP IDs and remove any variants not in the reference panel. For the strand alignment a minimum LD value of 0.3 with at least three flanking variants was required for alignment. A secondary MAF alignment was also used at a threshold of 5%. Finally, the minimum posterior probability to call genotypes in the input data was left at the default value of 0.4.

#### Phasing and Imputation

Three different reference panels were used to conduct five protocols of phasing and imputation in order to assess which performed best for our admixed population (**Table 1**). The first protocol was an in-house method where the Affymetrix data (PLINK files) were phased using SHAPEIT v2 (Delaneau et al., 2012), using the default effective population size of 15 000. Imputation was then performed using IMPUTE2 v2.3.2 (Howie et al., 2009) and the 1000G Phase 3 reference panel (Sudmant et al., 2015), with default parameters except for the effective population size, which was set to 15 000 for consistency with the haplotype phasing process.

The second-, and third protocol made use of the Sanger Imputation server<sup>1</sup> (SIS). Genotypes from the Affymetrix 500k array in PLINK file format were converted to Variant Call Format (VCF) using PLINK v1.9 and then uploaded to the server where phasing was performed using SHAPEITv2.r790 (Delaneau et al., 2012) followed by imputation using the Positional Burrows-Wheeler Transformation (PBWT) algorithm (Durbin, 2014). Imputation was performed in two separate runs: the first run made use of the 1000G Phase 3 reference panel for imputation, and the second run made use of the African Genome Resource panel.

The fourth- and fifth protocol made use of the Michigan Imputation server [MIS, (Das et al., 2016)]. PLINK files were converted to VCF using PLINK v1.9 and uploaded to the server for two imputation runs, both of which were run on the QC and imputation mode. SHAPEITv2.r790 was used for haplotype phasing in both runs followed by imputation using the Minimac3 algorithm (Das et al., 2016). For the first run the mixed population option was used for the QC and haplotype phasing was performed followed by imputation with the 1000G Phase 3 reference panel. For the second imputation run, it was mandatory for the African-American population to be selected for QC when imputing with the CAAPA reference panel.

In summary, all of these methods implement a Hidden Markov Model (HMM) in different ways. Impute2 uses the Markov-chain to implement the HMM, while minimac3 uses a Monte-Carlo procedure to implement HMM (Li et al., 2010). PBWT also works on a Monte Carlo iteration but instead of HMM it infers haplotypes using a Positional Burrows Wheeler Transformation. All these imputation algorithms do a number of iterations of phasing (haplotype inference) and imputation and then the probabilities for each genotype are averaged for all iterations to give the posterior probability for each imputed genotype (**Supplementary Data S2**).

Although haplotype pre-phasing has been shown to decrease imputation accuracy slightly it was used in this study for consistency between the protocols (the Michigan server did not have an option to not phase data) and to increase the speed of imputation (Howie et al., 2009).

For all imputation runs, the reference panels included all available populations since using an all-inclusive reference panel is known to improve imputation accuracy (Huang et al., 2009). Of the five variations of imputation performed, only the MIS (CAAPA) run was incapable of performing imputation on the X chromosome. Results for the X chromosome have, however, been included for the other four imputation runs since the accuracy of X-linked imputation has not been previously evaluated.

#### QC of Imputed Data

Imputed data were returned from the imputation software in one of two formats: either in the form of a VCF file, or in Impute2 (gen/sample) format and based on the format, one of two QC procedures was employed to convert the imputed data from genotype probabilities to actual genotypes. Data output from the two procedures were compared and showed complete overlap and can thus be used interchangeably.

#### Procedure 1

For the in-house imputation performed using Impute2, a gen/sample output file was obtained and converted to a PLINK file using GTOOL<sup>2</sup> version 0.7.5. R version 3.2.4 was used to identify INDELS, which were removed using GTOOL (R Development Core Team, 2013). This was performed in order to more accurately assign SNP IDs and allele information when genotypes were called using GTOOL. The genotype calling threshold was set to 0.7, which was determined to have the best ratio of imputation accuracy and number of imputed variants (**Supplementary Figure S1**). Once genotypes were called, the resulting ped/map PLINK files were converted to bed/bim/fam PLINK files and all variants with no-call alleles were removed.

#### Procedure 2

For the imputation completed using the two online servers, VCF files were returned. The VCF files were converted to PLINK ped/map files using a genotype calling threshold of 0.7 (PLINK

<sup>2</sup>http://www.well.ox.ac.uk/\$\sim\$cfreeman/software/gwas/gtool.html


TABLE 1 | Haplotype phasing and genotype imputation methods used.

<sup>1</sup>AGR, African Genome Resource. <sup>2</sup>CAAPA, Consortium on Asthma among African-ancestry Populations in the Americas.

command: – vcf-min-gp command) and coding all no-call alleles as N (PLINK command: – output-missing-genotype N). INDELS and SNPs with no-call alleles were removed and the files were converted to PLINK bed format (bed/bim/fam).

#### Imputation Quality and Accuracy

To assess imputation quality we considered the internal quality metrics obtained from each imputation protocol: the INFO score (in the case of IMPUTE2) and the r-squared value (for PBWT and Minimac3). Although, the info score and r-squared quality metrics are not directly comparable, they have shown to be highly correlated in two notable studies: one by Marchini and Howie (Marchini and Howie, 2010), and another by Browning and Browning (Browning and Browning, 2016). Both papers reported that the quality scores returned by several commonly used imputation software, including those utilized in the protocols of this study, are highly correlated. These values range from 0 to 1, where a higher value indicates increased quality of an imputed SNP. These quality metrics were used to assess within data quality, not between data quality. Median quality scores were plotted against MAF in order to determine how quality was affected by MAF and to assess which imputation protocol had returned the best quality data at a given MAF.

Imputation accuracy was assessed by extracting the overlapping individuals from the MEGA and imputed Affymetrix data and using PLINK, any variants that overlapped between the two platforms prior to imputation were removed. Between the two arrays there were only 41 815 variants genotyped on both platforms and they were evenly distributed across the genome and should not affect the analysis if removed post-imputation. The analysis was performed per chromosome and for each SNP the alleles were compared between the imputed Affymetrix data and the MEGA data. If both alleles of a SNP matched it would be considered a complete match (or a flip match if alleles were correct but strand swopped). If only one allele matched it was considered a half match and if no alleles matched it was considered a no-match. For each chromosome the total number of imputed variants was recorded and their distribution by MAF was plotted to determine how the number of variants correlated with MAF between the different imputation protocols.

To determine the imputation accuracy, the SNP overlap between the MEGA and imputed Affymetrix data was assessed. Within this overlap the number of SNPs that were complete-, flip-, half- or non-matched were recorded along with their average INFO score or r-squared value. Since SNPs that are flipped can be flipped to align a reference, or a different dataset if a meta-analysis is planned, the flipped SNPs were considered matches for the purposes of calculating imputation accuracy. Accuracy was calculated by comparing the proportion of SNPs in the overlap that were complete (or flipped) matches to the number of overlapping SNPs. This provided an indication of accuracy and error rate within the overlapping region and should be a good indication of overall imputation accuracy. These calculations were performed for the autosomes and the X chromosome separately in order to determine how accurately and with what quality the X-linked variants were imputed compared to the autosomal variants.

### RESULTS

#### Genotyping Data

After QC and strand alignment, 919 individuals and 239 612 variants with a genotyping rate of 99.39% remained in the Affymetrix 500k dataset, and 771 Individuals with 1 491 347 variants remained in the MEGA dataset with a genotyping rate of 99.43%. A total of 325 individuals were genotyped on both the Affymetrix and MEGA array and 43 140 SNP markers overlapped between the two platforms. Following imputation the 325 individuals with genotype data from both MEGA and Affymetrix were extracted from both the MEGA data and imputed Affymetrix data so that their imputed genotypes (Affymetrix) could be directly compared to their actual genotypes (MEGA) in order to determine genotyping accuracy. The 43 140 SNPs that were genotyped on both platforms were removed from both datasets after imputation in order to not skew the accuracy analysis.

#### Imputation

For the SAC cohort, the best genotype imputation results obtained were from the in-house IMPUTE2 (with 1000G reference panel) and the Sanger imputation server (with the AGR reference panel) methods. The in-house method resulted in the most imputed variants across both the autosomes (60 438 387) and X chromosome (2 574 793), followed by SIS (AGR) (52 088 766 autosomal and 1 638 163 X-linked variants), while the SIS with 1000G reference panel had slightly fewer imputed variants than with the AGR panel (50 418 390 autosomal and 1 679 254 X-linked variants). The Michigan imputation server had only about half as many imputed variants as the other methods, for either reference panel (**Table 2**). The number of imputed variants that did not reach the genotype calling

threshold (0.7) was lowest in the in-house method followed by the Michigan server results, and SIS (1000G) and SIS (AGR) had the highest percentage of variants not reaching genotype calling threshold (**Table 2**). When imputed Affymetrix variants were compared to the MEGA genotypes, the SIS (AGR) data had the highest accuracy (within the overlapping region) on both the autosomes (89.27%) and X chromosome (90.21%). The imputation accuracy for the in-house and SIS (1000G) method was very similar, with the in-house method having a slightly lower genome wide error rate. The accuracy of the Michigan server was good on the autosomes (∼62-83%) but lacking for the X chromosome (∼65%) (**Table 3**). The SIS (AGR) imputed the least X-linked variants, but at the highest accuracy, whereas the in-house method had twice as many X-linked variants as Sanger with only a 1.28% drop in accuracy (**Tables 3**, **4**).

For the autosomes and X chromosome, the SIS (AGR) produced the best imputation quality across all MAF ranges, closely followed by the in-house method where quality was second to SIS (1000G) only for low MAF (0-1%) variants on the X chromosome (**Figure 1**). The Michigan server produced the lowest quality imputation according to internal quality metrics (**Figure 1** and **Table 4**). The median quality score was comparable across all autosomal chromosomes and thus only chromosome 1 is shown as a representation of the autosomes and for comparison to the X chromosome (**Figure 1**). **Figure 2** confirms that the SIS (AGR) method and the in-house method produced the best imputation quality since more SNPs were imputed at high quality for both Chromosome 1 and the X chromosome. Since the SIS (AGR) has the largest number of imputed genotypes not reaching the calling threshold, a trade-off between quality and number of variants exists between SIS (AGR) and the in-house method.

#### DISCUSSION

Imputation accuracy was previously evaluated in African and three-way admixed populations, but we have performed the first evaluation in a five-way admixed population. The imputation accuracy in African-American individuals (considered to be

TABLE 2 | Number of imputed variants and variants overlapping with MEGA as well as the percentage of calls that did not reach the genotype calling threshold (0.7). Imputed number of SNPs is given in millions and Overlapping number is given per ten thousand.


<sup>1</sup>Number of SNPs in millions. <sup>2</sup>Number of SNPs per ten thousand.

TABLE 3 | Genome wide error rate and accuracy of imputation on the autosomes and X chromosome.


TABLE 4 | Number of SNPs and accompanying median quality score for the three categories, within the MEGA overlapping region.


<sup>a</sup>Number of SNPs in thousands.

three-way admixed) ranges from 78% (Malhotra et al., 2014) to 89% (Howie et al., 2009). Bantu-speaking Southern African individuals have been imputed with an accuracy of about 95% and even African San individuals had an imputation accuracy of 89% (Huang et al., 2009). In the present study, the SIS (AGR) and the in-house imputation protocol had similar accuracies (89% and 88%, respectively, **Table 2**) compared to previous results from African and admixed populations. It should however, be noted, that the clear majority of non-matching variants were ambiguous (Imputed genotype A/T and MEGA genotype G/C, or vice versa) and the majority of half-matched variants were imputed as monomorphic (data not shown). These ambiguous variants were imputed at high quality (**Table 3**) and were not removed when filtering on quality score, but could be removed or aligned to a reference allele using appropriate software (such as Genotype Harmonizer). However, removal of these ambiguous variants is not mandatory. When analyzing a single dataset, the ambiguous variants of interest can be compared to a relevant reference genome and then flipped. This is especially useful when conducting a meta-analysis since these variants will then be comparable even though they originate from different datasets. If these ambiguous variants are considered to be correctly imputed, then the accuracy of imputation with the SIS (AGR) increases to 96% while the accuracy of the in-house imputation protocol increases to 94%. Accuracy and quality can be further improved by removing half-matching variants by applying a quality score and MAF filter.

Since four of the five protocols were capable of imputing X-linked variants, and since the quality and accuracy of X chromosome imputation has not been previously tested, we included it for this analysis. The X chromosome had only slightly lower or higher imputation quality for all imputation runs when compared to the autosomes, indicating that X chromosome imputation can be performed with confidence (**Tables 2**, **3**). Although not specifically analysed here, the quality of imputation at low MAF should also be noted: the imputation quality for rare variants was unexpected as large reference panels with the correct populations are required to accurately impute rare variants (Kim et al., 2015; Zheng et al., 2015; **Figure 1**).

The biggest limitation for imputation in the five-way admixed population is the lack of a suitable reference panel. Imputation in the San population has been shown to have the lowest imputation accuracy (89%) compared to other African populations (Huang et al., 2009), which could be due to a lack of applicable reference individuals. Since the main ancestral component in the SAC population is KhoeSan (Daya et al., 2013) this could affect the accuracy and quality of imputation in this population. However, this has improved due to the addition of KhoeSan individuals in the AGR and 1000G reference panels.

In conclusion, we have shown that imputation of the SAC population is feasible and produces quality data on both the autosomes and X chromosome. While the SIS (AGR) imputation had the best quality and accuracy, the in-house protocol using

Impute2 and 1000G Phase 3 also produced imputed data of a high standard and had the highest number of imputed variants. This protocol may prove especially useful in the case of a metaanalysis where one wishes to maximize SNP overlap between datasets. As the number of applicable reference populations and individuals grows, imputation accuracy will improve for African and admixed populations, but it remains the gold-standard to Sanger sequence a variant of interest to confirm that the imputed variant is present in the population prior to conducting further research.

### DATA AVAILABILITY

fgene-10-00034 February 2, 2019 Time: 18:16 # 8

Summary statistics for the quality and accuracy assessment of the SAC data will be made available to researchers who meet the criteria for access to confidential data after application to the Health Research Ethics Committee of Stellenbosch University. Requests can be sent to: MM, E-mail: marlom@sun.ac.za.

### AUTHOR CONTRIBUTIONS

HS, SM, GT, CK, and MM conceived the idea for this study. HS and SM performed the data QC. SM conducted phasing, imputation, and quality assessment. HS performed the accuracy assessment and wrote the first draft. All authors contributed to writing and proofreading for approval of the final manuscript.

#### FUNDING

This research was partially funded by the South African government through the South African Medical Research

#### REFERENCES


Council. The content is solely the responsibility of the authors and does not necessarily represent the official views of the South African Medical Research Council. This work was also supported by the National Research Foundation of South Africa (grant number 93460) to EH and by a Strategic Health Innovation Partnership grant from the South African Medical Research Council and Department of Science and Technology/South African Tuberculosis Bioinformatics Initiative (SATBBI, GW) to GT.

#### ACKNOWLEDGMENTS

We would like to acknowledge and thank the study participants for their contribution and participation.

#### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene. 2019.00034/full#supplementary-material



**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Schurz, Müller, van Helden, Tromp, Hoal, Kinnear and Möller. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Using Whole Genome Sequencing in an African Subphenotype of Myasthenia Gravis to Generate a Pathogenetic Hypothesis

#### Melissa Nel<sup>1</sup> , Nicola Mulder<sup>2</sup> , Tarin A. Europa<sup>1</sup> and Jeannine M. Heckmann<sup>1</sup> \*

<sup>1</sup> Neurology Research Group, Division of Neurology, Department of Medicine, University of Cape Town, Cape Town, South Africa, <sup>2</sup> Computational Biology Division, Institute of Infectious Disease and Molecular Medicine, University of Cape Town, Cape Town, South Africa

#### Edited by:

Fulvio Cruciani, Sapienza University of Rome, Italy

#### Reviewed by:

Henry Kaminski, George Washington University, United States Linda L. Kusner, George Washington University, United States

#### \*Correspondence:

Jeannine M. Heckmann jeanine.heckmann@uct.ac.za

#### Specialty section:

This article was submitted to Evolutionary and Population Genetics, a section of the journal Frontiers in Genetics

> Received: 12 November 2018 Accepted: 11 February 2019 Published: 01 March 2019

#### Citation:

Nel M, Mulder N, Europa TA and Heckmann JM (2019) Using Whole Genome Sequencing in an African Subphenotype of Myasthenia Gravis to Generate a Pathogenetic Hypothesis. Front. Genet. 10:136. doi: 10.3389/fgene.2019.00136 Myasthenia gravis (MG) is a rare, treatable antibody-mediated disease which is characterized by muscle weakness. The pathogenic antibodies are most frequently directed at the acetylcholine receptors (AChRs) at the skeletal muscle endplate. An ophthalmoplegic subphenotype of MG (OP-MG), which is characterized by treatment resistant weakness of the extraocular muscles (EOMs), occurs in a proportion of myasthenics with juvenile symptom onset and African genetic ancestry. Since the pathogenetic mechanism(s) underlying OP-MG is unknown, the aim of this study was to use a hypothesis-generating genome-wide analysis to identify candidate OP-MG susceptibility genes and pathways. Whole genome sequencing (WGS) was performed on 25 AChR-antibody positive myasthenic individuals of African genetic ancestry sampled from the phenotypic extremes: 15 with OP-MG and 10 individuals with control MG (EOM treatment-responsive). Variants were called according to the Genome Analysis Toolkit (GATK) best practice guidelines using the hg38 reference genome. In addition to single variant association analysis, variants were mapped to genes (±200 kb) using VEGAS2 to calculate gene-based test statistics and HLA allele group assignment was inferred through "best-match" alignment of reads against the IMGT/HLA database. While there were no single variant associations that reached genome-wide significance in this exploratory sample, several genes with significant gene-based test statistics and known to be expressed in skeletal muscle had biological functions which converge on muscle atrophy signaling and myosin II function. The closely linked HLA-DPA1 and HLA-DPB1 genes were associated with OP-MG subjects (gene-based p < 0.05) and the frequency of a functional A > G SNP (rs9277534) in the HLA-DPB1 3 <sup>0</sup>UTR, which increases HLA-DPB1 expression, differed between the two groups (G-allele 0.30 in OP-MG vs. 0.60 in control MG; p = 0.04). Furthermore, we show that rs9277534 is an HLA-DBP1 expression quantitative trait locus in patient-derived myocytes (p < 1 × 10−<sup>3</sup> ). The application of a SNP to gene to pathway approach to this exploratory WGS dataset of African myasthenic individuals, and comparing dichotomous subphenotypes, resulted in the identification of candidate genes and pathways that may contribute to OP-MG susceptibility. Overall, the hypotheses generated by this work remain to be verified by interrogating candidate gene and pathway expression in patient-derived extraocular muscle.

Keywords: myasthenia gravis, African, whole genome sequencing, extraocular muscle, ophthalmoplegia, HLA-DPB1, extreme phenotype, association

#### INTRODUCTION

fgene-10-00136 February 27, 2019 Time: 16:36 # 2

Myasthenia gravis (MG) is a rare, but treatable antibodymediated disease which results in fatigable weakness of skeletal muscles, including extraocular (or eye) muscles. In most individuals this is a result of pathogenic antibodies targeting the acetylcholine receptors (AChR) at the neuromuscular junction, which cause activation of complement at the muscle endplate and consequent muscle damage (Engel et al., 1977).

Though the incidence of AChR-antibody positive MG in sub-Saharan Africa is similar to global figures (Mombaur et al., 2015), and the response to MG therapies overall is similar among populations (Heckmann et al., 2007), we have recognized a subphenotype of treatment-resistant ophthalmoplegia, or OP-MG, among a subset of MG subjects of African genetic ancestry (Heckmann et al., 2007; Heckmann and Nel, 2017). This OP-MG subphenotype is characterized by severe, persistent extraocular muscle (EOM) weakness and commonly affects subjects with juvenile onset, but otherwise characteristic AChR-antibody positive MG (i.e., generalized muscle weakness which responds to treatment). The pathogenesis of the OP-MG subphenotype remains unknown though we hypothesize that individuals who develop this subphenotype may harbor African susceptibility variants which impact on the MG disease process in the particular context of the EOMs.

A previous extended whole exome sequencing (WES) study of OP-MG subjects, including untranslated region (UTR) coverage, identified a number of putative regulatory variants (Nel et al., 2017). However, this study suffered from several limitations including false positive variant calls which could not be validated by Sanger sequencing (likely PCR related) and limited coverage of the non-coding genome (which is expected to harbor a greater burden of variants contributing to complex disease risk). Here we identified a number of OP-MG associated variants in the HLA class II region though it was not possible to verify them with Sanger sequencing due to the complexity of this region. This was interesting because the genetic basis of MG has been investigated for more than three decades in individuals of European genetic ancestry and the consistent finding has been the association of the class I and II HLA region with individuals by age at MG onset (Nel and Heckmann, 2018).

The focus of the present study was to perform PCR-free whole genome sequencing (WGS) in a well characterized cohort of OP-MG and control MG individuals, all AChR antibodypositive and differing only by the responsiveness of their EOMs to standard therapy. Although the sample is small (n = 25), this discovery cohort represented highly selected individuals from the phenotypic extremes and matched for ancestry to maximize the power to detect association signals. Single nucleotide polymorphisms (SNPs) which were suggestive of association with OP-MG were validated in a larger cohort and a SNP to gene to pathway approach was used to prioritize genes based on skeletal muscle expression patterns.

#### MATERIALS AND METHODS

#### Patient Samples

Patients with generalized myasthenia gravis (MG) of earlyonset (<25 years) and African genetic ancestry (either black African or Cape mixed African ancestry) were recruited for WGS. This discovery sample represented the phenotypic extremes of treatment responsivity to myasthenic-associated EOM weakness. The case group (n = 15) included individuals with OP-MG as previously described (Heckmann and Nel, 2017), defined as treatment resistant weakness of EOMs. The control group (n = 10) included individuals with no persistent EOM weakness, i.e., EOM weakness may have been present at disease presentation but responded appropriately to treatment. DNA samples from 28 African ancestry MG patients (1 OP-MG and 27 control MG) with early onset disease (<38 years) served as a validation sample to genotype selected variants. This study was approved by the UCT Faculty of Health Sciences Human Research Ethics Committee (HREC 591/2014) and all subjects gave written informed consent in accordance with the Declaration of Helsinki. The study design is outlined in **Figure 1**.

### DNA Extraction and Whole Genome Sequencing

Genomic DNA was extracted from buffy coats of nucleated cells obtained from anticoagulated whole blood using the salting out method (Miller et al., 1988). Sequencing libraries (2 × 150 bp read length) were prepared from DNA samples using the TruSeq PCRfree library preparation kit (Illumina). Libraries were sequenced on Illumina HiSeq sequencing instruments (30× coverage) at the Kinghorn Centre for Clinical Genomics (Sydney, Australia) and the Centre for Genomic Regulation (Barcelona, Spain).

#### Read Alignment and Variant Calling

Paired end sequencing reads (FASTQ files) were aligned to the hg38 reference genome (including HLA contigs) using BWA MEM v0.7.15 (1000 Genomes Project Consortium et al., 2012) to generate BAM files. The Genome Analysis Toolkit best practice guidelines for germline SNPs and Indels were followed (GATK v3.7) (Van der Auwera et al., 2013) including

duplicate read removal and base quality score recalibration of BAM files followed by variant calling using Haplotypecaller (first on individual samples to generate GVCF files and then on the entire cohort to generate a final multisample VCF file) (Poplin et al., 2017). Variant quality score recalibration (VQSR) was performed separately for SNPs and Indels using a tranche sensitivity threshold of 99% to remove false positive calls. Variants were annotated using the Ensembl variant effect predicter (McLaren et al., 2016).

### HLA Allele Determination

Reads aligning to the HLA region on chromosome 6 (33,064,568– 33,080,777) and to the HLA contigs were extracted from the BAM files and realigned to reference sequences from the IMGT/HLA database (v3.29.0.1, 2017). HLA allele group assignment was inferred through "best-match" alignment of reads against the IMGT/HLA alleles using HLA Explore Software (Omixon).

### Case Control Association Analysis

Autosomal, bi-allelic variants were extracted from the VCF file and PLINK v1.9 (Chang et al., 2015) was used to perform various quality control procedures prior to association testing. Variant level filtering (excluding variants with MAF <5%, call rate <95% and Hardy–Weinberg (HW) equilibrium p-value < 1 × 10−<sup>6</sup> in controls) and sample level filtering (excluding individuals with outlying missing genotype or heterozygosity rates) was performed according to previously described guidelines (Anderson et al., 2010; Reed et al., 2015). To exclude any large-scale differences in ancestry between the OP-MG and control MG groups (which could confound the case-control association analysis), principal component analysis of 50 357 variants was performed after LD based SNP pruning using PLINK v1.9 (–indep-pairwise 1000 50 0.15). The allelic association of each marker with OP-MG was tested using Fisher's exact test (considering the unpruned dataset) and the genomic inflation estimate (lambda) was calculated for the unadjusted model based on median chisq.

#### Gene and Pathway Based Analyses

As a complimentary approach to single variant association analysis, VEGAS2 was used to calculate gene (Mishra and Macgregor, 2015) and pathway (Mishra and MacGregor, 2017) based p-values. This software tool maps SNPs to genes based on their genomic location. We performed two analyses in parallel: stringent mapping (including SNPs within a gene plus any SNPs outside of the gene with r <sup>2</sup> > 0.8 with SNPs within the gene) and less stringent mapping (including SNPs within a gene plus any SNPs 200 kb upstream and downstream of the 50UTR and 3 <sup>0</sup>UTR boundaries). For each mapping approach, the p-values from each mapping SNP are aggregated accounting for the linkage disequilibrium (LD) between SNPs and correcting for the gene size (number of SNPs). To compute pathway-based test statistics, the gene-based test statistics for gene lists in curated pathways (multiple sources including BIOCARTA, REACTOME and KEGG databases) and custom pathways (mined from various sources of EOM gene expression data, **Supplementary Table S2**) are aggregated and corrected for pathway size bias.

#### Sanger Sequence Verification of Variants

Two variants were verified by Sanger sequencing in a validation sample of myasthenics with African genetic ancestry consisting of 1 OP-MG subject and 27 control MG subjects using the following primers: CCAGGCTGAGAGACAAAGCAGACC forward and CGTACTTATGTGCCACACAAGAC reverse for rs16834631 in FAM92A1 and GATGGAGCTTCCGGAAGTCTTGG forward and CAAGGCAACTGCCTCTCTGCACC reverse for rs7816955 in PEF1.

#### Cell Cultures

Dermal fibroblasts from 10 OP-MG and 5 control MG individuals were obtained from skin punch biopsies using the explant method. These were transduced with an RGD fiber modified adenovirus containing a human MyoD transgene as previously described (Nel et al., 2019). Briefly, transduced fibroblasts were maintained in differentiation medium (DMEM + 5% horse serum + 1% P/S) for 48 h to induce myogenic transdifferentiation and generate myocytes. Myocytes stained positively for sarcomeric myosin and successful myogenic transdifferentiation was further confirmed by demonstrating muscle-specific gene expression in myocytes (CHRNA1, MYOD1, and MYOG). Importantly, based on muscle-specific gene expression levels, the degree of myogenic transdifferentiation was similar in both OP-MG and control MG myocytes. To mimic MG-induced gene expression changes in vitro, myocytes were stimulated with 5% homologous MG sera for 24 h before harvesting RNA. Sera samples were sourced from AChR antibody-positive, treatment-naive MG patients with generalized myasthenia and severe extraocular muscle involvement.

#### Quantitative Polymerase Chain Reaction (qPCR)

RNA was extracted from myocytes using the HighPure RNA extraction kit (Roche) according to the kit protocol. RNA concentration and purity was determined using the Nanodrop <sup>R</sup> ND1000 spectrophotometer [Thermo Scientific and all ratios were within the recommended ranges (A260/280 = 1.8–2.0; A260/230 > 1.7)]. 400 ng total RNA was reverse transcribed to cDNA using the RT<sup>2</sup> First Strand Kit (Qiagen) according to the manufacturer's specifications. Quantitative PCR was performed on the cDNA samples using proprietary Quantitect primer assays (Qiagen) (RPLP0, HLA-DPB1) and RT<sup>2</sup> SYBR Green Mastermix (Qiagen) on the 7900HT Fast Real-Time PCR System (Applied Biosystems). RPLP0 was selected from a panel of 10 reference genes which were screened for their expression stability in myocytes (Nel et al., 2019). Individual data points were calculated as 2−1Cq, where 1Cq = target gene Cq – reference gene Cq (Schmittgen and Livak, 2008).

### Data Visualization

Quantile–Quantile (Q–Q) and manhattan plots were created in R (version 3.5.1) using the qqman package (Turner, 2018). The heatmap of skeletal muscle tissue RNAseq expression data from the Genotype-Tissue Expression (GTEx) project was generated using the GTExPortal [1]. Graphs of qPCR expression data were created using Prism 7 (version 7.0c).

#### Computation

Computations were performed using facilities provided by the University of Cape Town's ICTS High Performance Computing team: hpc.uct.ac.za and the Bioinformatics Unit at the Centre for Genomic Regulation (CRG), Barcelona.

### RESULTS

### Clinical Characteristics of Study Participants

The clinical characteristics of the study participants are summarized in **Table 1**. For the WGS discovery sample, all subjects had early onset MG and there was no significant difference in the age of disease onset between OP-MG and control MG groups (14 years vs. 16 years, p = 0.450), or the sex ratios (p = 0.13). The WGS sample comprised 11 black African ancestry individuals (44%) and 14 Cape mixed African ancestry (M/A) individuals (56%) with similar ancestry proportions in OP-MG and control MG groups. While the sex and ancestry proportions were similar between the WGS and validation samples, the age at MG onset was significantly higher in the validation sample compared to the WGS sample (p = 4 × 10−<sup>8</sup> ). This was primarily because we tried to


IQR, interquartile range.

reduce confounders for the highly selected sample undergoing WGS by matching for age which was previously identified as a biological factor.

#### Description of Variants

The final VQSR filtered callset contained ∼18 million variants, including ∼2 million (11%) novel variants, with an overall Ti/Tv ratio of 2.12 and a heterozygous/homozygous ratio of 2.06 which is in line with previously published genomewide quality control metrics, particularly for African datasets (DePristo et al., 2011; Guo et al., 2014). A high proportion of the detected variants were singletons 29% (∼5 million). Overall there were ∼5 million variants per genome which is consistent with previously published data for African populations (Auton et al., 2015).

### Population Structure

An assessment of the population structure within the dataset was investigated using principal component analysis (PCA) after LD based SNP pruning. Combined, principal components 1 and 2 explain 26% of the total variance within the dataset; these are visualized on the PCA plot shown in **Figure 2**. The samples segregate into two clusters reflecting the black African and Cape M/A groups. Present day South Africans include a major ethnolinguistic group of black African South-Eastern Ngunilanguage (isXhosa and isiZulu) speakers. The Cape M/A ancestry population (predominantly Khoisan and Nguni-speaking African ancestry as well as smaller genetic contributions from Europeans and Southeast Asians) (De Wit et al., 2010; Quintana-Murci et al., 2010) comprise the most prevalent sub-population in the Western Cape region where this study was conducted. Despite their shared African ancestry with the black African ancestry individuals, the Cape M/A ancestry individuals form a dispersed but distinct cluster reflecting the admixed nature of this population which has considerable ancient African huntergatherer (Khoisan) and lesser non-African genetic contributions (Choudhury et al., 2017). Two out of the three outlier samples in the black African ancestry cluster represent individuals from other African countries (Zimbabwe and Burundi). Importantly, OP-MG and control MG individuals are equally represented in both ancestry groups which indicates that the case control association analysis will not be confounded by differences in population structure.

whole exome sequencing. MHC, major histocompatibility complex.

## Analysis of WGS Data to Identify Association Signals

Various approaches were used in parallel (outlined in **Figure 3**) to identify OP-MG associated variants and genes. The results of these analyses were interpreted in conjunction with our previous work, involving WES of OP-MG and control MG subjects (Nel et al., 2017), to collectively generate hypotheses regarding OP-MG susceptibility pathways.

#### Single Variant Association Analysis

Following variant and sample level filtering, the frequency of 8,752,596 variants were compared between case and control groups (i.e., OP-MG vs. control MG) using Fisher's exact test (**Figure 3A**). The black points in the quantile–quantile (Q– Q) plot in **Figure 4** show the observed p values (sorted from largest to smallest) plotted against the expected p-values from a theoretical χ 2 -distribution (Ehret, 2010). The gray straight line in the Q–Q plot indicates the distribution of SNPs under the null hypothesis. The black points form a straight line which is "deflated" relative to the gray line suggesting that the analyses were underpowered due to the small sample size in this study. Consequently, there were no variant associations which reached genome-wide significance (p < 5 × 10−<sup>8</sup> ).

The manhattan plot shows 7 variants with suggestive association with either the OP-MG or control MG phenotype (p < 1 × 10−<sup>5</sup> , **Figure 5A**) which are summarized in **Supplementary Table S1**. Five out of 7 variants had a lower frequency in OP-MG compared to control MG and all variants are common in African populations (1000 genomes data). None of these top associated variants had any predicted functional consequences. While this may be true of many top GWAS hits in studies using chip data, where the top SNPs may not themselves be pathogenic but may "tag" other functional variants in LD, this is an unlikely scenario in our study since we have genome-wide variant coverage.

Therefore, in order to further prioritize variants with sub-genome wide significance thresholds, we screened the VEP "impact" annotations of 1,751 variants with p < 0.001 (**Figure 5B**). Seven variants were classified as "low impact" (splice region, intron and synonymous variants) and two variants were classified as "moderate impact" (missense variants) but were predicted to be benign by various prediction tools. The remainder of the variants were classified as "modifiers" and included intergenic variants and variants in up- and downstream gene regions, some of which overlapped regulatory features.

Since non-coding genetic variation was hypothesized to contribute to OP-MG susceptibility, we applied a tissue-specific prioritization approach to identify which modifier variants overlapped a regulatory feature active in muscle [human skeletal muscle myoblast and myotube (HSMM and HSMMtube) samples from the ENCODE project and psoas muscle samples from the Roadmap Epigenomics Project]. Muscle samples were chosen since there is no publically available expression data for human

EOM. This analysis identified 13 variants which were more common in OP-MG compared to control MG. Two upstream gene variants, rs7816955 in FAM92A1 (p = 9.2 × 10−<sup>4</sup> ) and rs16834631 in PEF1 (p = 1.5 × 10−<sup>4</sup> ), overlapped Ensembl regulatory features classified as active promoters based on epigenome activity in relevant muscle cell lines. While the FAM92A1 variant did not overlap any Ensembl motif features, the PEF1 variant overlapped 24 putative transcription factor binding sites based on binding matrices, one of which was a high information position with predicted decreased binding of the RFX3::FIGLA transcription factor pair.

Both variants had a reported frequency ≤ 0.30 among African controls and ≤0.10 among European controls (1000 Genomes Project) and were validated by Sanger sequencing in the WGS sample. Their frequency was also determined in a validation sample (n = 28) which confirmed the association of these variants with OP-MG: PEF1 rs16834631 0.57 in OP-MG vs. 0.16 in control MG (p = 0.001) and FAM92A1 rs7816955 0.47 in OP-MG vs. 0.18 in control MG (p = 0.021; **Figure 3A**).

#### Gene-Based Association Analysis

A single variant association testing approach, while unbiased, is limited by stringent genome-wide significance thresholds which are difficult to reach after correcting for multiple testing (particularly relevant with our small sample size). Searching for association signals in single variants assumes that all affected individuals (i.e., OP-MG cases) have the same pathogenic variant(s) which does not fit with our current understanding of the genetic architecture of complex disease, which may be attributed to the joint effect of many causal loci with small effect sizes (Fu et al., 2013). To interrogate the collective biological meaning of the sub-threshold single variant associations, all variants (**Figure 5C**) were mapped to genes and their modest association signals were aggregated using VEGAS2 to derive gene based p-values (**Figure 3B**). A mapping threshold of 200 kb upstream and downstream of gene boundaries was chosen since this distance has been shown to increase the number of significant phenotype-pathway associations, particularly for autoimmune diseases (Brodie et al., 2016).

While no genes had significant p-values after correcting for multiple testing of 23,361 genes, 38 genes had a p-value ≤ 0.015. These were prioritized by determining their tissue expression using RNAseq expression data from the Genotype-Tissue Expression (GTEx) project (Aguet et al., 2017). Since there is no available expression data for the specific allotype of EOM, we prioritized genes based on their expression level in skeletal muscle tissues. Eleven genes had a medium expression level in skeletal muscle defined as a transcripts per million (TPM) value of 11–1,000 (shown in blue boxes in **Figure 6**). The functions of proteins encoded by genes with TPM > 20 are summarized in **Table 2**.

#### Pathway-Based Association Analysis

The sample size was not sufficient to produce meaningful pathway-based test statistics from the VEGAS2 pathway analysis which interrogated both curated and custom pathways.

#### HLA Region Associations

Expression (GTEx) project.

In our previous work we identified a unique "HLA signature" spanning the class II region of the MHC in OP-MG subjects (Nel et al., 2017) (**Figure 3C**) and the gene-based analysis in the present study also identified association signals in this region (HLA-DPA1 p = 0.015 and HLA-DPB1 p = 0.033). We therefore performed HLA typing (see section "HLA Allele Determination") to interrogate differences in HLA-DPA1 and HLA-DPB1 allele frequencies between OP-MG and control groups. In our sample, HLA-DPB1 allele diversity (12 alleles plus ambiguous alleles for 6 individuals) was higher than HLA-DPA1 allele diversity (5 alleles) which is similar to studies in European populations (Hollenbach et al., 2012). We found differences in the frequency of 3 HLA-DP alleles between OP-MG and control MG (**Table 3**). Interestingly, for the HLA-DPB1 locus, where alleles can be divided into two groups based on their associated HLA-DPB1 expression levels, we found that the proportion of "low expression" and "high

TABLE 2 | Top 7 genes (based on GTEx skeletal muscle expression data, TPM > 20) with the muscle-specific function of their encoded proteins.


Hu et al., 2012; Egerman and Glass, 2014; Jang et al., 2015; Kusner et al., 2010; Park et al., 2011; 6 Ito et al., 2004; Galloway et al., 2016.

expression" alleles differed between the OP-MG and control MG groups (p = 0.021). The HLA-DPA1<sup>∗</sup> 105:01 allele, the most common "low expression" allele in our sample and only observed in OP-MG individuals, appears to be common in African populations. The expression level of HLA-DPB1 alleles was shown to be correlated with the genotype at rs9277534, a functional A > G SNP located in the 30UTR of HLA-DPB1 (Thomas et al., 2012). The G-allele of this SNP increases HLA-DPB1 expression levels by altering the binding affinity of various microRNAs (Shieh et al., 2018). In keeping with the observed HLA-DPB1 frequency differences (Schöne et al., 2018), we found a higher frequency of the rs9277534 G-allele in the control MG group.

HLA-DPB1 rs9277534 genotype-expression correlations have been demonstrated in blood (Yamazaki et al., 2018) but there is no data on this expression quantitative trait locus (eQTL) in skeletal muscle tissue. We therefore analyzed HLA-DP expression grouped by rs9277534 genotype in myocytes derived from transdifferentiated dermal fibroblasts from OP-MG and control MG subjects (Nel et al., 2019) and found that the G-allele increased HLA-DPB1 expression levels (**Figure 7**, p < 1 × 10−<sup>3</sup> ).

#### DISCUSSION

In this study we have used various strategies to mine WGS data in an attempt to generate hypotheses regarding the pathogenetic basis of a subphenotype of a rare autoimmune disease, myasthenia gravis. The subphenotype is characterized by treatment resistance of the eye muscles, or EOMs, whereas the non-ocular muscles respond to standard MG therapies (Heckmann and Nel, 2017). EOM is a specific allotype of muscle tissue because it differs from limb muscles in many respects (Porter et al., 2001). Since only a proportion of MG subjects develop the OP-MG subphenotype, the pool of affected individuals available for genetic studies is small. Nonetheless, we employed a focused strategy using extreme subphenotype sampling of OP-MG cases vs. MG disease controls to perform a genome wide analysis. Putative OP-MG susceptibility variants, genes and pathways were identified following prioritization based on known tissue-specific expression patterns in skeletal muscle since gene expression data for EOM is not available.

We have identified three main candidate pathogenic themes which we postulate are involved in developing OP-MG, and preliminary functional studies show at least some support for these hypotheses. Briefly, we summarize evidence gleaned from other areas, using the principle of triangulation, to lend support to the generated hypotheses.

The first two themes relate to muscle atrophy and muscle recovery/remodeling. The EOMs may be more susceptible to complement-mediated muscle endplate injury during MG (in Soltys et al., 2008) due to their relatively lower expression levels of complement regulatory proteins, particularly decay accelerating factor (DAF) (Kaminski et al., 2004). We previously screened the DAF gene in OP-MG subjects and found a higher frequency of a functional DAF promoter polymorphism compared to controls which impaired transcriptional upregulation of DAF expression in patient-derived cell lines following a lipopolysaccharide immune stimulus (Heckmann et al., 2010). Also, clinically and at surgery, the EOMs in the most severe cases of OP-MG are thin/atrophic, not fibrotic and unable to generate muscle force (Heckmann and Nel, 2017). Although there is limited histological data on EOMs in MG, neurogenic atrophy is a common pathological observation in the muscle biopsies of MG cases, (Oosterhuis and Bethlem, 1973) and likely to be the result of "functional denervation," or the disconnection between the


<sup>1</sup>Goldfein, 2017; <sup>2</sup>Schöne et al., 2018; <sup>3</sup>Thomas et al., 2012. <sup>∗</sup>For HLA-DPB1, allele assignment was ambiguous for 6 OP-MG and 3 control MG individuals and these samples were excluded from the analysis.

nerve and muscle endplate secondary to MG-induced damage (Nakano and Engel, 1993). With that in mind, the gene-based analysis identified two genes (MKNK2, AKT1S1) involved in the IGF1/AKT/mTOR pathway, which is a key pathway in promoting muscle atrophy following denervation (Tang et al., 2014) (**Table 2**). In keeping with these unbiased findings, our previous gene expression profiling of OP-MG myocytes using a panel of genes relevant in several MG studies found expression of genes from this pathway (IGF1, AKT1, and AKT2) were strongly correlated in OP-MG myocytes but not in the myocytes from control MG cases (Nel et al., 2019). Interestingly, IGF1 is highly expressed in EOMs where it regulates both the muscle mass and force generation of these muscles, and its signaling is dysregulated in paralyzed EOMs (Altick et al., 2012).

Subsequent to MG damage we would expect the EOMs to undergo "regeneration" or remodeling due to their high numbers of resident satellite cells (McLoon and Wirtschafter, 2003), and this process requires the synthesis of new structural muscle proteins. We were therefore interested to observe that 3 of the 7 genes (MYL12B, PPP1R12C, and PPP1R2) identified by the gene-based analysis (**Table 2**) are involved in the stability and regulation of myosin II which is a prominent isoform in EOMs expressed by fast type IIA and IIB muscle fibers, respectively (Park et al., 2011).

While unbiased, genome-wide association studies (GWAS) typically employ very large samples, the application of this approach to the study of susceptibility to MG, has not been very informative in terms of identifying new disease loci. In two recent GWAS in MG, the strongest association signals identified were localized to the HLA region (Gregersen et al., 2012; Renton et al., 2015), which was already identified over 3 decades ago in a small case-control sample (Compston et al., 1980). The third theme we identified relates to the HLA region since we found an association signal with lower HLA-DPB1 expression in OP-MG which results from a functional polymorphism in the 30UTR of HLA-DPB1. Although the HLA-DP locus is not in LD with other HLA loci, the expression levels of HLA-DPB1 are increasingly recognized to have clinical relevance (Fleischhauer, 2015). While the main MG susceptibility locus lies in the class I or II region depending on the age at symptom onset (Nel and Heckmann, 2018), HLA-DPB1 alleles may influence the phenotypic manifestations of the MG disease process in different individuals.

We also identified association signals in PEF1 and FAM92A1 which were validated in an independent sample, although the functional relevance of these genes in EOM is unknown. This highlights the importance of validating the hypotheses generated by this work in patient-derived EOM tissue, preferably from OP-MG individuals. It is worth noting that candidate gene associations, such as those previously identified in the regulatory region of DAF, were not identified following the filtering criteria used in this study. This is likely due to the sample size constraints imposed by WGS which limits the ability to detect significant associations for low frequency variants such as DAF -198 C > G. This SNP had a frequency of 0.13 among the OP-MG subjects in this study (p = 0.119) which is comparable to the statistically significant association previously reported using a larger sample size (0.12, p = 0.001) (Heckmann et al., 2010), albeit with an overlap of two OP-MG samples between the two studies.

In conclusion, despite the limitations of using a small sample to mine whole genome data to generate pathogenic hypotheses in a structured yet unbiased approach, several lines of evidence suggest we have achieved our aims. The next step will be to analyze the functionality of these genes and pathways in patientderived extraocular muscle tissue.

#### DATA AVAILABILITY

The whole genome sequencing data on which the findings of this manuscript are based, have been deposited in the European Genome-Phenome Archive (EGA): https://www.ebi.ac.uk/ega/ home and can be found under the following accession ID: EGAS00001003462. Access to the dataset is governed by a data access committee.

#### AUTHOR CONTRIBUTIONS

MN designed the myocyte model, performed the functional studies in myocytes, performed the genomic data analysis, and wrote the manuscript. NM provided computing resources, bioinformatics support, and editorial input. TE performed the qPCR experiments on ocular fibroblasts. JH conceived and

designed the study, collected DNA samples and clinical data, and provided editorial input and funding support.

#### FUNDING

MN was funded by a Novartis-Africa Mobility award and UCT Faculty of Health Sciences postdoctoral research fellowship. JH received funding from the French Muscular Dystrophy Association (AFM-Téléthon) (20049), National Research Foundation (NRF) of South Africa (113416), and the Rare Disease Foundation microgrant program and the BC Children's Hospital Foundation (2016).

#### REFERENCES


#### ACKNOWLEDGMENTS

MN wishes to thank Dr. Julia Ponomarenko (head of the Bioinformatics Unit at the CRG) and her team for the provision of bioinformatics training.

### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene. 2019.00136/full#supplementary-material


african coloured population: a case of gender-biased admixture. Am. J. Hum. Genet. 86, 611–620. doi: 10.1016/j.ajhg.2010.02.014


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Nel, Mulder, Europa and Heckmann. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

fgene-10-00136 February 27, 2019 Time: 16:36 # 11

# Population Pharmacogenomics for Precision Public Health in Colombia

Shashwat Deepali Nagar1,2,3† , A. Melissa Moreno<sup>4</sup>† , Emily T. Norris1,2,3† , Lavanya Rishishwar2,3, Andrew B. Conley2,3, Kelly L. O'Neal<sup>1</sup> , Sara Vélez-Gómez<sup>4</sup> , Camila Montes-Rodríguez<sup>4</sup> , Wendy V. Jaraba-Álvarez<sup>4</sup> , Isaura Torres<sup>4</sup> , Miguel A. Medina-Rivas3,5, Augusto Valderrama-Aguirre1,3,6, I. King Jordan1,2,3 \* and Juan Esteban Gallo3,4 \*

<sup>1</sup> School of Biological Sciences, Georgia Institute of Technology, Atlanta, GA, United States, <sup>2</sup> IHRC-Georgia Tech Applied Bioinformatics Laboratory, Atlanta, GA, United States, <sup>3</sup> PanAmerican Bioinformatics Institute, Cali, Colombia, <sup>4</sup> GenomaCES, Universidad CES, Medellín, Colombia, <sup>5</sup> Centro de Investigación en Biodiversidad y Hábitat, Universidad Tecnológica del Chocó, Quibdó, Colombia, <sup>6</sup> Biomedical Research Institute, Cali, Colombia

#### Edited by:

Nicola Mulder, University of Cape Town, South Africa

#### Reviewed by:

Laura B. Scheinfeldt, University of Pennsylvania, United States Keyan Zhao, University of California, Los Angeles, United States

#### \*Correspondence:

I. King Jordan king.jordan@biology.gatech.edu Juan Esteban Gallo jegallo@ces.edu.co

†These authors have contributed equally to this work

#### Specialty section:

This article was submitted to Evolutionary and Population Genetics, a section of the journal Frontiers in Genetics

> Received: 17 November 2018 Accepted: 04 March 2019 Published: 22 March 2019

#### Citation:

Nagar SD, Moreno AM, Norris ET, Rishishwar L, Conley AB, O'Neal KL, Vélez-Gómez S, Montes-Rodríguez C, Jaraba-Álvarez WV, Torres I, Medina-Rivas MA, Valderrama-Aguirre A, Jordan IK and Gallo JE (2019) Population Pharmacogenomics for Precision Public Health in Colombia. Front. Genet. 10:241. doi: 10.3389/fgene.2019.00241 While genomic approaches to precision medicine hold great promise, they remain prohibitively expensive for developing countries. The precision public health paradigm, whereby healthcare decisions are made at the level of populations as opposed to individuals, provides one way for the genomics revolution to directly impact health outcomes in the developing world. Genomic approaches to precision public health require a deep understanding of local population genomics, which is still missing for many developing countries. We are investigating the population genomics of genetic variants that mediate drug response in an effort to inform healthcare decisions in Colombia. Our work focuses on two neighboring populations with distinct ancestry profiles: Antioquia and Chocó. Antioquia has primarily European genetic ancestry followed by Native American and African components, whereas Chocó shows mainly African ancestry with lower levels of Native American and European admixture. We performed a survey of the global distribution of pharmacogenomic variants followed by a more focused study of pharmacogenomic allele frequency differences between the two Colombian populations. Worldwide, we found pharmacogenomic variants to have both unusually high minor allele frequencies and high levels of population differentiation. A number of these pharmacogenomic variants also show anomalous effect allele frequencies within and between the two Colombian populations, and these differences were found to be associated with their distinct genetic ancestry profiles. For example, the C allele of the single nucleotide polymorphism (SNP) rs4149056 [Solute Carrier Organic Anion Transporter Family Member 1B1 (SLCO1B1)∗5], which is associated with an increased risk of toxicity to a commonly prescribed statin, is found at relatively high frequency in Antioquia and is associated with European ancestry. In addition to pharmacogenomic alleles related to increased toxicity risk, we also have evidence that alleles related to dosage and metabolism have large frequency differences between the two populations, which are associated with their specific ancestries. Using these findings, we have developed and validated an inexpensive allele-specific

**40**

PCR assay to test for the presence of such population-enriched pharmacogenomic SNPs in Colombia. These results serve as an example of how population-centered approaches to pharmacogenomics can help to realize the promise of precision medicine in resource-limited settings.

Keywords: pharmacogenomics, pharmacogenetics, precision medicine, genetic ancestry, admixture, Colombia, Antioquia, Chocó

#### INTRODUCTION

The precision medicine approach to healthcare entails a customized model whereby medical decisions and treatments are specifically tailored to individual patients (Collins and Varmus, 2015; Jameson and Longo, 2015). Currently, precision medicine is most commonly implemented via pharmacogenomic methods, which account for how individuals' genetic makeup affects their response to drugs (Weinshilboum and Wang, 2006; Ma and Lu, 2011). Pharmacogenomic knowledge of genetic variant-to-drug response interactions provides a means to optimize individual patients' treatment regimes, simultaneously maximizing drug efficacy while minimizing adverse reactions. Indeed, the essence of precision medicine has been described as "the right treatment, to the right patient, at the right time." While the precision medicine paradigm promises to revolutionize healthcare delivery, its prohibitive costs put it out of reach for the developing world. In particular, the need to characterize genomic information for each individual patient in a given population can place a tremendous burden on healthcare systems that may be struggling to provide basic services. For the moment, precision medicine as a standard of care is still very much limited to the Global North.

A recently articulated alternative to the precision medicine model is referred to as precision public health (Khoury et al., 2016, 2018; Weeramanthri et al., 2018). The focus of precision public health is populations, instead of individuals, and the idea is to leverage modern healthcare technologies for more precise population-level interventions. The mantra for precision public health is "the right intervention, to the right population, at the right time." This population-centered model of healthcare delivery provides one way for the technological innovations underlying precision medicine to realize their potential in developing countries. With respect to pharmacogenomics, knowledge regarding population genomic distributions of the genetic variants that mediate drug response can be used to focus resources and efforts where they will be most effective (Bachtiar and Lee, 2013). Under the precision public health model, population genomic profiles, as opposed to genomic information for each individual patient, can be employed to guide pharmacogenomic interventions; this is a far more cost-effective and realistic approach for the developing world (Nordling, 2017). For this study, we applied the precision public health paradigm using a survey of the distribution of pharmacogenomic variants in diverse Colombian populations. The major aim of this work was to tailor pharmacogenomic testing and interventions to the specific populations for which they will realize the greatest benefit.

Colombia is home to a highly diverse, multi-ethnic society. The modern population of Colombia is made up of individuals with genetic ancestry contributions from ancestral source populations in Africa, the Americas, and Europe (Wang et al., 2008; Bryc et al., 2010; Moreno-Estrada et al., 2013; Ruiz-Linares et al., 2014; Rishishwar et al., 2015). Colombia is also known to contain a number of unique regional identities. There are at least five distinct recognized regions in Colombia, each of which has its own defining demographic contours (Appelbaum, 2016; Wade, 2017). In fact, owing to historical barriers to migration, Colombian populations with very different genetic ancestry profiles can be found in close geographic proximity. This is very much the case for the two populations characterized for this study: Antioquia and Chocó (Medina-Rivas et al., 2016; Conley et al., 2017). Despite the fact that these neighboring administrative departments share a common border, their populations show clearly distinct genetic ancestries. Antioquia has primarily European ancestry, whereas Chocó is mainly African, and both populations also show varying levels of Native American admixture.

Previous studies have shown that the frequencies of pharmacogenomic variants can vary across populations with divergent genetic ancestries. This includes variation in pharmacogenomic variant allele frequencies among distantly related populations worldwide (Ramos et al., 2014; Lakiotaki et al., 2017) as well as marked frequency differences among populations sampled from within the same country (Bonifaz-Pena et al., 2014; Hariprakash et al., 2018). We hypothesized that pharmacogenomic allele frequencies should differ between the Colombian populations of Antioquia and Chocó, given their distinct ancestry profiles. If this was indeed the case, it would have direct implications for the development of pharmacogenomic approaches in the country. In this way, we hoped that a survey of the population pharmacogenomic patterns for Antioquia and Chocó could serve as an exemplar for the implementation of precision public health in the developing world.

Colombia's first clinical genomics laboratory – GenomaCES from Universidad CES in Antioquia<sup>1</sup> – is currently working to develop genomic diagnoses that are tailored to the local population, and members of the ChocoGen Research Project<sup>2</sup> are exploring the connections between genetic ancestry and health disparities in the understudied Colombian population of Chocó. Here, these two groups have joined forces in an effort to (i) discover pharmacogenomic variants with special relevance for these two Colombian populations and (ii) develop cost-effective

<sup>1</sup>https://www.genomaces.com/

<sup>2</sup>https://www.chocogen.com/

and rapid pharmacogenomic assays for those variants, which can be readily deployed in resource-limited settings.

#### MATERIALS AND METHODS

fgene-10-00241 March 21, 2019 Time: 16:27 # 3

### Pharmacogenomic SNPs (pharmaSNPs)

Pharmacogenomic single nucleotide polymorphisms (pharmaSNPs), i.e., human genetic variants associated with specific drug responses, were mined from the Pharmacogenomic Knowledgebase (PharmGKB<sup>3</sup> ) (Whirl-Carrillo et al., 2012). PharmGKB provides a manually curated set of clinical annotations with information about pharmaSNPs and their corresponding drug responses. The PharmGKB clinical annotations were downloaded and filtered to extract all individual pharmaSNP clinical annotations. Data on pharmaSNP clinical annotations were parsed and stored, including information about the direction and nature of the variant associated drug responses, the identity of each pharmaSNP effect and non-effect allele, the genes wherein pharmaSNPs are located, and the drug interaction evidence levels.

### PharmaSNP Genetic Variation

Data on human genome sequence variation were taken from the phase 3 data release of the 1000 Genomes Project (Genomes Project et al., 2015). For the 1000 Genomes Project, genome-wide SNPs were characterized via whole genome sequencing for 2504 individuals from 26 global populations, including the Colombian population of Antioquia [Colombian in Medellín (CLM), Colombia<sup>4</sup> ]. All of the pharmaSNPs from PharmGKB were found to be present in 1000 Genomes Project phase 3 variant calls. Genome sequence variation for the Colombian population of Chocó was characterized as part of the ChocoGen Research Project<sup>5</sup> as previously described (Medina-Rivas et al., 2016; Chande et al., 2017; Conley et al., 2017).

Genome sequence variation data were used to calculate the average minor allele frequency (MAF) and fixation index (FST) for a genome-wide set of n = 28,137,656 pruned SNPs and for the set of n = 1995 pharmaSNPs using the program PLINK (Purcell et al., 2007). Linkage disequilibrium pruning was performed to yield the genome-wide background SNP set with the PLINK indep command, using an r 2 threshold of 0.5 with a sliding window of 50 nt and a step size of 5 nt. MAF (p) values for each SNP were calculated across all populations as: p = number of variant sites total number of sites . FST values for each SNP were calculated among populations as: FST= σ 2 p ×(1− p) , where p is the average MAF across all 26 global populations and σ 2 is the observed MAF variation. Pairwise genomic distances were computed as 1-identity-by-state/Hamming distances between genomes using the PLINK distance command with the – distance-matrix option. The resulting high-dimensional pairwise genomic distance matrix was projected in two dimensions using multi-dimensional scaling (MDS) method implemented in the base package of the R statistical language (R Core Team, 2013). The program ADMIXTURE was used to characterized genetic ancestry components based on the genome-wide and pharmaSNP sets using K = 3 clusters (Alexander et al., 2009).

The differences in pharmaSNP effect allele frequencies (f) between Antioquia (ANT) and Chocó (CHO) were measured as (1) the log-transformed ratio of the population-specific allele frequencies log<sup>2</sup> (fANT/fCHO) and (2) as the population-specific allele frequency difference 4 = f ANT − fCHO. These two effect allele difference metrics were plotted orthogonally and the Euclidean distance from the origin was calculated for each pharmaSNP to yield a composite difference.

#### PharmaSNP Ancestry Associations

The influence of genetic ancestry on pharmaSNP genotype frequencies was measured via ancestry association analysis. To do this, individuals' genetic ancestry fractions – African, European, and Native American – inferred using ADMIXTURE with the genome-wide SNP set, were regressed against their individual pharmaSNP genotypes. The strength of the resulting ancestry × pharmaSNP associations were quantified using a linear regression model: y = βx + ε, where x ∈ {0, 1, 2}, corresponding to the number of pharmaSNP effect alleles, y is the ancestry fraction for a given ancestral group (African, European, or Native American), and β quantifies the strength of the association. The significance of the ancestry association is measured as the P-value obtained from a t-test, where t = β/SEβ.

#### Exome Sequence Analysis

Whole exome sequence (WES) analysis was conducted on a cohort of 132 de-identified patients characterized for the purposes of genetic testing by the GenomaCES laboratory (O'Donnell-Luria and Miller, 2016). The study was carried out in accordance with article 11 of resolution 8430 of 1993 of Colombian law, which states that for every investigation in which a human being is the study subject, respect for their dignity and the protection for their rights should always be present. The study protocol was reviewed and approved by the ethics committee and the research committee of Universidad CES, and all subjects gave written informed consent authorizing use of their biological samples and genetic information obtained through exome sequencing for research and academic training in accordance with the Declaration of Helsinki. Patient DNA was extracted from peripheral blood using the salting out method (Miller et al., 1988). Exon enrichment was performed using the Integrated DNA Technologies xGen capture kit, and exome sequencing was performed on the Illumina HiSeq 4000, generating 150 bp paired end reads at 100X coverage. Read quality was assessed using the FastQC program with a threshold of Q ≥ 30 (Andrews, 2010). Sequence reads were mapped to the hs37d5 (1000 Genomes Phase II) human genome reference sequence using SAMtools (Li et al., 2009), and variants were called using VarScan 2 (Koboldt et al., 2012). The resulting VCF files were surveyed for the presence of pharmaSNP alleles using the VCFtools package (Danecek et al., 2011). Manual inspection of the mapped sequence reads in support of pharmaSNP variant

<sup>3</sup>https://www.pharmgkb.org/ accessed April 2018

<sup>4</sup>https://www.coriell.org/0/Sections/Collections/NHGRI/1000Clm.aspx

<sup>5</sup>https://www.chocogen.com/

calls was performed using the Integrative Genomics Viewer (IGV) (Thorvaldsdottir et al., 2013).

#### Allele-Specific PCR Assay

fgene-10-00241 March 21, 2019 Time: 16:27 # 4

The identity of pharmaSNP allelic variants was assayed in the same 132 patients using custom-designed allele-specific PCR assays following the Web-based Allele-Specific PCR (WASP) primer design protocol (Wangkumhang et al., 2007). Both the WASP and Primer-BLAST (Ye et al., 2012) tools were used to design pairs of allele-specific forward primers that overlap with the pharmaSNPs of interest and their corresponding single reverse primers. PCR assays were performed using the Thermo ScientificTM Taq DNA Polymerase kit, with 25 µL final reagent volume, on the Bio-Rad thermocycler (C1000 TouchTM Thermal Cycler). PCR products were visualized and scored as homozygous non-effect allele, heterozygous, or homozygous effect allele using electrophoresis performed with 2.5% agarose gels stained with ethidium bromide (10 µL) with a running time of 60 min at 70 V in 1X TBE buffer. UV light was used to visualize the gel-separated PCR products.

#### RESULTS

#### Pharmacogenomic SNP Variation Worldwide

We operationally define pharmaSNPs as human nucleotide variants that are known to affect how individuals respond to medications. The Pharmacogenomics Knowledgebase (PharmGKB<sup>6</sup> ) provides a catalog of pharmaSNPs together with information regarding their known impacts on drug response. PharmGKB categorizes pharmaSNPs with respect to their specific effects on drug efficacy, dosage, or toxicity/adverse drug reactions as well as the level of evidence for their role in drug response: (1) high, (2) moderate, (3) low, or (4) preliminary. We mined the PharmGKB database for pharmaSNPs across all four evidence levels, yielding a total of 1995 SNPs genome-wide.

We evaluated the global patterns of pharmaSNP variation using whole genome sequence data for 26 populations from five continental (super) population groups characterized as part of the 1000 Genomes Project (Genomes Project et al., 2015). Levels and patterns of variation for pharmaSNPs were compared to a genome-wide background set of >28 million SNPs. Across all 26 global populations, pharmaSNPs show a very high average MAF (avg. MAF = 0.25) compared to genome-wide SNPs (avg. MAF = 0.02; **Figure 1A**). PharmaSNPs also show significantly higher levels of the fixation index (avg. FST = 0.07), a measure of between-population differentiation, for global populations compared to genome-wide SNPs (avg. FST = 0.01; **Figure 1B**). It should be noted that the higher avg. MAF observed for pharmaSNPs compared to genome-wide SNPs could reflect an ascertainment bias owing to a relative excess of rare variants in the 1000 Genomes Project sequence data. However, no such bias is expected for the FST values as calculated here, which are largely unaffected by the presence of rare variants in the 1000 Genomes Project data (Bhatia et al., 2013).

Given the high levels of variation and between-population discrimination shown by pharmaSNPs, we also evaluated the extent to which they carry information about genetic ancestry and admixture, particularly for the Colombian populations of Antioquia and Chocó. Pairwise genomic distances were computed for the Colombian populations together with a set of global reference populations from Africa, the Americas, and Europe, using both pharmaSNPs and the genome-wide SNP set. Pairwise genomic distances computed using both sets of SNPs were used to reconstruct the evolutionary relationships among human populations worldwide. The results for the genome-wide (**Figure 1C**) and pharmaSNP (**Figure 1D**) sets are highly similar. The genome-wide SNP set does provide higher resolution and tighter groupings than the pharmaSNPs, but the nature of the relationships among global populations does not change between the two SNP sets. The African, European, and Native American populations occupy the three poles of the MDS plot, with Antioquia falling along the axis between the European and Native American groups and Chocó grouping more closely with the African populations. Both Colombian populations show evidence of substantial admixture compared to the global reference populations.

We performed a similar comparison of the ability pharmaSNPs to quantify patterns of genetic ancestry compared to genome-wide SNPs using the program ADMIXTURE. Using K = 3 ancestry components, genome-wide SNPs clearly distinguish the reference African, European, and Native American populations, and characterize the Colombian populations of Antioquia and Chocó as distinct mixtures of all three ancestries (**Figure 1E**). Consistent with previous results (Conley et al., 2017), Antioquia shows an average of 61% European, 32% Native American, and 7% African ancestry, whereas Chocó shows primarily African ancestry (76%) followed by 13% Native American, and 11% European fractions. PharmaSNPs show qualitatively similar results albeit with lower resolution compared to the genome-wide SNP set (**Figure 1F**). Using pharmaSNPs, the global reference populations are not quite as distinct and the European component of ancestry appears to be overestimated in both the Native American reference populations as well as Antioquia and Chocó. Nevertheless, the clear distinction between the patterns of ancestry and admixture for the Colombian populations, whereby Antioquia is primarily European and Chocó is mostly African, is captured when only the pharmaSNPs are used.

#### Pharmacogenomic SNP Variation in Colombia: Antioquia Versus Chocó

Despite the fact that the Colombian administrative departments of Antioquia and Chocó are located in close proximity, their populations have distinct global origins (**Figure 2A**). As discussed in the previous section and elsewhere (Rishishwar et al., 2015; Medina-Rivas et al., 2016; Conley et al., 2017), the population of Antioquia shows mainly European genetic ancestry with substantial Native American admixture, whereas Chocó has

<sup>6</sup>https://www.pharmgkb.org/

Native American (red) populations.

Europe, and the Americas for (C) genome-wide SNPs and (D) pharmaSNPs. ADMIXTURE plots showing the genome-wide continental ancestry fractions using (E) all genome-wide SNPs and (F) only pharmaSNPs for admixed Colombian populations (Antioquia and Chocó) and reference African (blue), European (orange), and

primarily African ancestry with lower levels of Native American and European admixture. In light of the high levels of global variation seen for pharmaSNPs (**Figure 1**), we expected to see pronounced differences in the distributions of pharmaSNP alleles between Antioquia and Chocó. Such differences should have implications for public health strategies in the country, particularly with respect to the allocation of resources for pharmacogenomic testing.

We compared the frequencies of pharmaSNP effect alleles between Antioquia and Chocó to test this hypothesis. PharmaSNP effect alleles are operationally defined for this purpose as the allelic variants that increase the observed effect for a given drug-gene interaction, i.e., the alleles that increase the efficacy, dosage, or risk of toxicity/adverse drug responses for a drug. To ensure maximum relevance of our results for public health in Colombia, we focused on pharmaSNPs corresponding to the highest evidence levels in PharmGKB (levels 1 and 2; n = 155 pharmaSNPs). PharmaSNP effect allele frequency differences between Antioquia and Chocó were measured in two ways – (1) as the log transformed ratio of allele frequencies Antioquia/Chocó and (2) as the allele frequency differences between Antioquia and Chocó – in order to capture both high relative differences at low allele frequencies and high absolute differences at high allele frequencies (**Figure 2B**). When these two dimensions of pharmaSNP effect allele frequency differences are plotted orthogonally, the Euclidean distance from the origin captures the overall between-population difference seen for each SNP (**Figure 2C**).

As expected, numerous pharmaSNP effect alleles show large frequency differences between Antioquia and Chocó (**Figure 2**). We sought to quantify the role that the distinct genetic ancestry profiles of these two populations plays in these pharmaSNPs effect allele frequency differences. To do so, we developed and applied an ancestry association method whereby individuals' genetic ancestry fractions – African, European, and Native American – are regressed against their genotypes for any given pharmaSNP. This approach allows us to visualize and quantify the influence of genetic ancestry on pharmaSNPs genotype frequencies in these two diverse Colombian populations. **Figure 3** shows examples of ancestry associations for three pharmaSNPs with high levels of effect allele (and genotype) divergence between Antioquia and Chocó; ancestry associations for nine additional pharmaSNPs of interest to Colombia can be seen in **Supplementary Figure S1**. **Table 1** shows the results of ancestry

#### TABLE 1 | Colombian ancestry-associated pharmaSNPs of interest.


The pharmaSNPs marked with an asterisk (<sup>∗</sup> ) are shown in Figure 3.

association analyses for 13 pharmaSNPs of interest to Colombia, based on high levels of divergence between Antioquia and Chocó, and **Supplementary Table S1** contains the ancestry association results for all level 1 and 2 PharmGKB SNPs showing pharmaSNP effect allele Euclidean distances > 0.5 (as shown in **Figure 2C**).

#### Tacrolimus

The T allele of the pharmaSNP rs776746 (CYP3A5<sup>∗</sup> 3) is found at higher frequency in Chocó and is positively correlated with African ancestry and negatively correlated with both European and Native American ancestry (**Figure 3A**). This pharmaSNP is a splice site acceptor variant located within an intron of the CYP3A5 (Cytochrome P450 Family 3 Subfamily A Member 5) encoding gene. The T allele is associated with increased metabolism of Tacrolimus, an immunosuppressive drug often used to treat transplant patients, and thus individuals with T containing genotypes may require relatively higher dosages of this drug. Consistent with these observations, physicians in Cali, Colombia, have anecdotally reported that Afro-Colombian transplant patients do not respond well to standard doses of Tacrolimus.

#### Warfarin

The C allele of the pharmaSNP rs9923231 (VKORC1<sup>∗</sup> 2) shows a similar pattern with higher frequency in Chocó, a positive correlation with African ancestry, and negative correlations with both European and Native American ancestry (**Figure 3B**). This pharmaSNP is one of several variants of the VKORC1 (Vitamin K Epoxide Reductase Complex Subunit 1) encoding gene that have been associated with warfarin sensitivity. The SNP is located in the upstream, regulatory region of the gene, and individuals with the C allele may require an increased dosage of warfarin.

#### Simvastatin

The C allele of the pharmaSNP rs4149056 (SLCO1B1<sup>∗</sup> 5) is found in higher frequency in Antioquia, showing a negative correlation with African ancestry and a positive correlation with European ancestry (**Figure 3C**). The correlation with Native American ancestry for this SNP is not significant. This SNP is a missense variant in the SLCO1B1 encoding gene. The C allele is associated with simvastatin toxicity, and individuals with this allele may be at higher risk for simvastatin-related myopathy. These results agree very well with observations of physicians from the Universidad CES clinic in Antioquia, who have observed that ∼30% of patients treated with Simvastatin show evidence of adverse drug reactions.

#### Metformin

The C allele of the pharmaSNP rs11212617 is found at substantially higher frequency in Chocó compared to Antioquia, and it is positively correlated with African ancestry and negatively correlated with both European and Native American ancestry (**Figure 3D**). This pharmaSNP shows an interaction with the type 2 diabetes drug Metformin; the C effect allele was found to be associated with greater treatment success (GoDARTS and Ukpds Diabetes Pharmacogenetics Study Group et al., 2011). Interestingly, metformin was subsequently proven to have higher efficacy for the reduction of blood glucose levels reduction in African–Americans compared to European–Americans (Williams et al., 2014; Zhang and Zhang, 2015). Ergo, this ancestry-associated pharmaSNP shows a direct connection between genetic ancestry differences and differential drug response.

#### Cost-Effective pharmaSNP Genotyping in Colombia With Allele-Specific PCR

The results from the analysis of pharmaSNP variation in Colombia uncovered a number of SNPs with specific relevance to the country, in terms of anomalous effect allele frequencies within local populations, associations with different genetic ancestry groups, and broad relevance to public health. We reasoned

that such population genomic profiling can be used to focus efforts to develop precision medicine in the country and to maximize the return on investment for pharmacogenomic testing in resource-limited settings. To this end, GenomaCES developed and validated three custom allele-specific PCR assays to genotype pharmaSNPs of special relevance to these Colombian populations.

The criteria for the selection of pharmaSNPs that were interrogated with our custom allele-specific PCR assays included the PharmGKB evidence level along with a combination of population genomic and clinical information. Pharmacogenomic assays were only developed for pharmaSNPs from the PharmGKB evidence level 1A. This is the highest evidence level and corresponds to pharmaSNPs that are included in medical society-endorsed pharmacogenomics guidelines and/or implemented in major health systems. The additional criteria used to prioritize pharmaSNPs for the development of allele-specific PCR assays were: (i) observations of population-specific allele frequencies in Colombia along with related ancestry-associations, (ii) pharmacogenomic associations with drugs that are widely prescribed in Colombia and used to treat common conditions, and (iii) pharmacogenomic associations with drugs for which GenomaCES investigators have anecdotal information from collaborating physicians that pharmacogenomic tests would be of use to the local population, based on their observations of anomalous drug responses in their patients. It should be noted that the population and clinical criteria are not mutually exclusive; indeed, physicians' observations of anomalous drug responses in their patient populations are almost certainly related to the population-specific allele frequencies of the relevant pharmaSNPs.

An example of an allele-specific PCR assay developed for the simvastatin-associated pharmaSNP rs4149056 (SLCO1B1<sup>∗</sup> 5), located with an exon of the SLCO1B1 protein coding gene on the short arm of chromosome 12, is shown in **Figure 4A**. The pharmaSNP variant detection assay relies on the use of two forward primers – one to capture the non-effect allele T and one to capture the effect allele C – and a single reverse primer. Use of these two primer-pairs results in allele-specific amplicons, depending on the presence of each allele in an individual patient's genome. PCR results are shown for four patients: Patient-132 homozygous TT, Patient-44 heterozygous TC, and Patient-17 and Patient-26 homozygous CC (**Figure 4B**). We visualized the results of exome sequence analysis, with respect to the quality and coverage of mapped reads along with the counts of the different variant calls, to manually confirm the results of the allele-specific PCR assays (**Figure 4C**).

Having confirmed the accuracy of the rs4149056 (SLCO1B1<sup>∗</sup> 5) variant detection assay, we then ran it on a cohort of 132 de-identified patients from the GenomaCES laboratory, all of whom have exome sequences available for confirmatory analysis. The results of the allele-specific PCR and exome analyses are highly similar; taking the exome results as the ground truth against which to compare the PCR assay yields an overall accuracy of 97.7% for this test (**Figure 4D**). Two additional allele-specific PCR assays for SNPs associated with warfarin dosage – rs1799853 (CYP2C9<sup>∗</sup> 2) and rs1057910 (CYP2C9<sup>∗</sup> 3) – were tested on the same patient set and confirmed via exome sequence analysis. These two allele-specific PCR genotyping assays show even higher accuracies of 98.5 and 100%, respectively. We calculated a number of additional performance metrics for all three of these tests, breaking down each assay into its three constituent genotypes, the results of which are shown in **Supplementary Figure S2**.

### DISCUSSION

#### Caveats and Limitations

We would like to point out some of the caveats and limitations of the current study as they relate to the accuracy and utility of pharmacogenomic tests in understudied populations. The reach of our analysis is somewhat limited by the focus on pharmaSNPs, i.e., single nucleotide variants, as opposed to all possible genetic variants that may impact drug response. PharmGKB contains annotations of gene-to-drug response interactions that are mediated by a number of different kinds of variants, including larger scale structure variants such as insertion/deletion events and copy number variations (Roden et al., 2006; He et al., 2011). Furthermore, there are a number of pharmacogenomic tests that rely on the characterization of combinations of linked SNPs, i.e., haplotypes or star-alleles. For example, the most reliable warfarin sensitivity assays utilize multiple SNPs (haplotypes) across two genes in order to arrive at specific dosage recommendations (Johnson et al., 2011; Fung et al., 2012). Our survey of pharmaSNP variation will not capture these complex classes of pharmacogenomic variants and interactions.

Our focus on pharmaSNPs can be primarily attributed to the availability and the reliability of SNP data at our disposal, as opposed to other more complex genetic variants, particularly for the population of Chocó, which was characterized using a genome-wide SNP array (Medina-Rivas et al., 2016; Chande et al., 2017; Conley et al., 2017). Nevertheless, it is important to note that (i) there are numerous documented cases of individual SNPs that show demonstrable and reproducible effects on drug response (Lauschke et al., 2017) and (ii) there are many more pharmaSNPs available for analysis compared to the other variant classes (Whirl-Carrillo et al., 2012). For example, ∼93% of PharmGKB variant annotations correspond to individual pharmaSNPs (1995 out of 2144 total variants). Accordingly, we are confident that our study design captures the majority of the pharmacogenomically relevant human genetic variation based on current knowledge in the field.

Another limitation relates to the fact that we compared pharmaSNP allele frequencies among populations with distinct ancestries compared to the cohorts where they were originally characterized. As with other classes of clinical genetics studies (Petrovski and Goldstein, 2016; Popejoy and Fullerton, 2016), there remains a very strong bias whereby the majority of pharmacogenetic clinical trials have been conducted in developed countries on cohorts with European ancestry (Karlberg, 2008; Thiers et al., 2008). Thus, it is formally possible that the pharmaSNPs we analyzed may have different effects on drug

FIGURE 4 | Allele-specific PCR assay for pharmaSNPs. (A) Schema depicting the design of the allele-specific PCR assay for the pharmaSNP rs4149056 (SLCO1B1∗5) on chromosome 12. Two allele-specific forward primers are designed for the pharmaSNP of interest and paired with a single reverse primer, yielding allele-specific amplicons. (B) Allele-specific PCR results for four individuals are shown. PCR gel lanes are labeled with the allele used for the forward primer – T or C. (C) Results of exome sequence analysis used to confirm the results of the allele-specific PCR assays. Sequence reads (red – forward, blue – reverse) mapped to the genomic position for the SNP rs4149056, coverage levels (gray boxes above), and the identity of the called nucleotide variants at that same position are shown along with the reference nucleotide and amino acid sequences for the corresponding region of the SLCO1B1 gene (protein). Images were taken from the Integrative Genomics Viewer (IGV). Confusion matrices showing comparisons between the pharmaSNP variant calls made via exome sequence analysis and the allele-specific PCR assays are shown for (D) the simvastatin toxicity SNP rs4149056 (SLCO1B1∗5), and the warfarin dosage SNPs (E) rs1799853 (CYP2C9∗2) and (F) rs1057910 (CYP2C9∗3). Identical variant calls are shown along the diagonal, whereas off-diagonal calls show discrepancies between the exome and PCR variant calls; accuracy levels for each test are shown.

response in our populations of interest. Of course, the most rigorous way to assess the population-specific role of genetic variation in drug response would be to conduct clinical trials in all populations of interest. Currently, however, the high cost and complexity of performing clinical trials across multiple populations, particularly for variants with already well documented effects on drug response, renders this approach prohibitive. In addition, it is important to point out that the associations between pharmaSNPs and drug response that our study relies on are far more likely to be causal than associations uncovered by genome-wide association studies (GWAS), many of which do not replicate across populations with distinct ancestry profiles (Martin et al., 2017). This is because GWAS SNPs do not correspond to causal variants per se; rather, they are tag variants that mark haplotypes wherein the causal SNPs lie, and haplotype structure is known to vary widely across populations (Conrad et al., 2006). PharmaSNPs, on the other hand, correspond to the specific causal variants for which there is direct evidence of an impact on drug response. This is particularly the case for the narrower set of 155 pharmaSNPs deemed to be most confident by PharmGKB, which we used for our comparison of Antioquia and Chocó. The strong clinical and experimental evidence of these high confidence pharmaSNPs effects on drug response gives us confidence with respect to their potential relevance for our populations of interest.

### The Underlying Complexity of So-Called Hispanic/Latino Populations

As briefly mentioned in the previous section, a number of recent studies have underscored the major sampling bias that currently exists for human clinical genomic studies and emphasized the corollary importance of extending clinical trials to currently understudied populations. These studies rely on a variety of labels related to "Hispanic/Latino" to describe understudied populations from Latin America, or individuals and communities with origins in Latin America. For example, in a survey of the ancestry of study participants in GWAS cohorts, the authors used the label "Hispanic and Latin American ancestry," showing that members of this group made up a mere 0.06% of GWAS study participants in 2009 and 0.54% in 2016 (Popejoy and Fullerton, 2016). Another study, which demonstrated the importance of using matched ancestry samples for clinical variant interpretation, employed the category "Latino ethnicity" to classify exome variants into a single control group (Petrovski and Goldstein, 2016). The widely used Exome Aggregation Consortium (ExAC) database uses the term "Latino" as a population category for exome sequence variants (Lek et al., 2016), and the 1000 Genomes Project uses the super population code "Ad Mixed American (AMR)" to group genetically diverse populations from Colombia, Mexico, Peru, and Puerto Rico (Genomes Project et al., 2015).

It is interesting to note that the origins of the term Hispanic/Latino as a catch-all phrase to describe an extraordinarily diverse set of populations can be traced to decisions imposed by activists and bureaucrats of the US Census Bureau, motivated by the opportunity to create a politically influential interest group (Mora, 2014). The results of our study highlight the artificial nature, and the lack of practical utility, of the Hispanic/Latino label as it pertains to clinical genetic studies. Our two populations of interest – Antioquia and Chocó – would both be considered Hispanic/Latino, and in fact they are both from the same country within Latin America, but they have very distinct patterns of genetic ancestry and admixture. Furthermore, we show here that the differences in genetic ancestry have specific implications for the pharmacogenomic profiles of each population. The same thing will certainly hold true for many other sets of populations both within and between different Latin American countries. In light of this realization, we would like to emphasize that the stratification of so-called Hispanic/Latino populations for clinical genetic studies should be performed using their distinct genetic ancestry profiles as opposed to a politically imposed pan-ethnic label.

### Population-Guided Approaches to Pharmacogenomics in the Developing World

We hope that the population pharmacogenomic approach we applied to Colombian populations in this study can serve as model for their broader application in the developing world. Currently, genomic approaches to precision medicine are prohibitively expensive for many developing countries owing to their reliance on deep genetic characterization of individual patients. Precision public health, on the other hand, entails population-level interventions, and the focus on populations can provide a more cost-effective means for the implementation of novel genomic approaches to healthcare (Khoury et al., 2016; Khoury et al., 2018; Weeramanthri et al., 2018). Population-guided approaches to pharmacogenomics allow healthcare providers to allocate resources and efforts where they will be most effective by uncovering pharmacogenomic variants with special relevance to specific populations (Bachtiar and Lee, 2013; Nordling, 2017).

Here, we report a number of examples of pharmacogenomic variants with anomalously high effect allele frequencies in distinct Colombian populations. For example, the T allele of the pharmaSNP rs776746 is associated with African ancestry and found at a relatively high frequency in Chocó (**Figure 4** and **Table 1**). Since this variant is associated with the need for a higher dosage of the immunosuppressive drug Tacrolimus, Afro-Colombians may be particularly prone to organ rejection following allogeneic transplant. Accordingly, the local deployment of a pharmacogenomic test for this particular SNP in Chocó would simultaneously focus limited resources for genetic testing while also ensuring an outsized impact for Afro-Colombian patients. As another example, the population of Antioquia shows an elevated frequency of the C allele of the pharmaSNP rs4149056, which is associated with increased risk of simvastatin toxicity (**Figure 4** and **Table 1**). The development of a pharmacogenomic assay for this SNP, which is currently underway at GenomaCES in Antioquia, could help to mitigate the risk of adverse drug reactions to this commonly prescribed medication in the local population.

### Prospects for Pharmacogenomics in Colombia

This is an auspicious moment for the development of pharmacogenomic approaches to public health in Colombia. The Colombian biomedical community is simultaneously faced with a combination of great opportunities and profound challenges, both with respect to genomic medicine overall and for pharmacogenomics in particular (De Castro and Restrepo, 2015). In all of South America, Colombia is one of only two countries, together with Argentina, with nationalized healthcare systems that guarantee comprehensive coverage for all of its citizens. In 2015, the terms of this guarantee were updated, via the Ministry of Health and Social Protection resolution 5592, to cover broadly defined molecular genetic and genomic tests. This change resulted in a far more comprehensive coverage policy for these kinds of tests than currently exists in the United States, where many precision medicine treatments are still directly paid by patients (Szabo, 2018). This resolution reflects great foresight on the part of Colombian policy makers and represents a tremendous opportunity for local biomedical researchers, clinicians, and the patients that they serve. Furthermore, a very strong case has been made for how genome-enabled approaches to precision medicine should ultimately lead to substantial cost savings for the national healthcare system over the long term (Gallo, 2017; Gibson, 2018).

On the other hand, the costs of many of the tests covered by this policy are so expensive in Colombia that the sustainability of the policy has been called into serious question. For example, the molecular biology reagents needed for tests of this kind can often cost three times as much or more in Colombia, compared to the United States, owing to taxes and tariffs. We firmly believe that key solutions to this economic challenge will be to (i) build the local capacity needed to perform such tests and (ii) develop genomic assays that are specifically tailored to the needs of Colombian populations. To these ends, Universidad CES has invested substantially in the development of local capacity in genomic medicine via the establishment of GenomaCES, which is Colombia's first homegrown genomic medicine laboratory. As we have shown here, GenomaCES is working to develop inexpensive and rapid pharmacogenetic genotyping tests based on relatively simple allele-specific PCR assays. Developing local tests of this

#### REFERENCES


kind can help to ensure that variants of specific relevance to the country are prioritized for testing and to avoid the prohibitively high costs of commercially available tests and/or kits.

### DATA AVAILABILITY

All datasets generated for this study are included in the manuscript and/or the **Supplementary Files**.

### AUTHOR CONTRIBUTIONS

AV-A, IJ, and JG conceived of and designed the study. SN, AM, EN, LR, AC, KO'N, SV-G, and CM-R performed all data analysis. AM, SV-G, WJ-Á, and IT performed laboratory assays. MM-R, AV-A, IJ, and JG supervised and managed all aspects of the project in Colombia and the United States. MM-R and JG acquired study subject samples. SN, LR, and IJ prepared figures and wrote the manuscript.

### FUNDING

SN, EN, LR, AC, and IJ were supported by the IHRC-Georgia Tech Applied Bioinformatics Laboratory. AM, SV-G, CM-R, WJ-Á, IT, and JG were supported by GenomaCES and Universidad CES. AV-A was supported by Fulbright Colombia.

#### ACKNOWLEDGMENTS

We would like to acknowledge the study subjects from Antioquia and Chocó who chose to trust us with that most precious and personal biological asset – their DNA.

#### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene. 2019.00241/full#supplementary-material


genetic ancestry and admixture in the Colombian populations of Choco and Medellin. G3 7, 3435–3447. doi: 10.1534/g3.117.1118


medicine. Clin. Pharmacol. Ther. 92, 414–417. doi: 10.1038/clpt. 2012.96


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Nagar, Moreno, Norris, Rishishwar, Conley, O'Neal, Vélez-Gómez, Montes-Rodríguez, Jaraba-Álvarez, Torres, Medina-Rivas, Valderrama-Aguirre, Jordan and Gallo. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Expanding Research Capacity in Sub-Saharan Africa Through Informatics, Bioinformatics, and Data Science Training Programs in Mali

Jeffrey G. Shaffer<sup>1</sup> \*, Frances J. Mather<sup>1</sup> , Mamadou Wele<sup>2</sup> , Jian Li<sup>1</sup> , Cheick Oumar Tangara<sup>2</sup> , Yaya Kassogue<sup>2</sup> , Sudesh K. Srivastav<sup>1</sup> , Oumar Thiero<sup>2</sup> , Mahamadou Diakite<sup>2</sup> , Modibo Sangare<sup>2</sup> , Djeneba Dabitao<sup>2</sup> , Mahamoudou Toure<sup>2</sup> , Abdoulaye A. Djimde<sup>2</sup> , Sekou Traore<sup>2</sup> , Brehima Diakite<sup>2</sup> , Mamadou B. Coulibaly<sup>2</sup> , Yaozhong Liu<sup>1</sup> , Michelle Lacey<sup>3</sup> , John J. Lefante<sup>1</sup> , Ousmane Koita<sup>2</sup> , John S. Schieffelin<sup>4</sup> , Donald J. Krogstad<sup>1</sup> and Seydou O. Doumbia<sup>2</sup>

<sup>1</sup> Department of Global Biostatistics and Data Science, Tulane University School of Public Health and Tropical Medicine, New Orleans, LA, United States, <sup>2</sup> Faculty of Medicine and Odontostomatology, University of Sciences, Techniques and Technologies of Bamako, Bamako, Mali, <sup>3</sup> Department of Mathematics, Tulane University, New Orleans, LA, United States, <sup>4</sup> Sections of Pediatric & Adult Infectious Diseases, School of Medicine, Tulane University, New Orleans, LA, United States

#### Edited by:

Nicola Mulder, University of Cape Town, South Africa

#### Reviewed by:

Nicki Tiffin, University of Cape Town, South Africa Faisal Mohamed Fadlelmola, University of Khartoum, Sudan

> \*Correspondence: Jeffrey G. Shaffer

jshaffer@tulane.edu

#### Specialty section:

This article was submitted to Evolutionary and Population Genetics, a section of the journal Frontiers in Genetics

> Received: 16 November 2018 Accepted: 28 March 2019 Published: 12 April 2019

#### Citation:

Shaffer JG, Mather FJ, Wele M, Li J, Tangara CO, Kassogue Y, Srivastav SK, Thiero O, Diakite M, Sangare M, Dabitao D, Toure M, Djimde AA, Traore S, Diakite B, Coulibaly MB, Liu Y, Lacey M, Lefante JJ, Koita O, Schieffelin JS, Krogstad DJ and Doumbia SO (2019) Expanding Research Capacity in Sub-Saharan Africa Through Informatics, Bioinformatics, and Data Science Training Programs in Mali. Front. Genet. 10:331. doi: 10.3389/fgene.2019.00331 Bioinformatics and data science research have boundless potential across Africa due to its high levels of genetic diversity and disproportionate burden of infectious diseases, including malaria, tuberculosis, HIV and AIDS, Ebola virus disease, and Lassa fever. This work lays out an incremental approach for reaching underserved countries in bioinformatics and data science research through a progression of capacity building, training, and research efforts. Two global health informatics training programs sponsored by the Fogarty International Center (FIC) were carried out at the University of Sciences, Techniques and Technologies of Bamako, Mali (USTTB) between 1999 and 2011. Together with capacity building efforts through the West Africa International Centers of Excellence in Malaria Research (ICEMR), this progress laid the groundwork for a bioinformatics and data science training program launched at USTTB as part of the Human Heredity and Health in Africa (H3Africa) initiative. Prior to the global health informatics training, its trainees published first or second authorship and third or higher authorship manuscripts at rates of 0.40 and 0.10 per year, respectively. Following the training, these rates increased to 0.70 and 1.23 per year, respectively, which was a statistically significant increase (p < 0.001). The bioinformatics and data science training program at USTTB commenced in 2017 focusing on student, faculty, and curriculum tiers of enhancement. The program's sustainable measures included institutional support for core elements, university tuition and fees, resource sharing and coordination with local research projects and companion training programs, increased student and faculty publication rates, and increased research proposal submissions. Challenges reliance of high-speed bandwidth availability on short-term funding, lack of a discounted software portal for basic software applications, protracted application processes for United States visas, lack of industry job positions, and low publication rates in the areas of bioinformatics and data science. Long-term,

**54**

incremental processes are necessary for engaging historically underserved countries in bioinformatics and data science research. The multi-tiered enhancement approach laid out here provides a platform for generating bioinformatics and data science technicians, teachers, researchers, and program managers. Increased literature on bioinformatics and data science training approaches and progress is needed to provide a framework for establishing benchmarks on the topics.

Keywords: bioinformatics, data science, data capture and management systems, genetics, genomics, Human Heredity and Health in Africa (H3Africa), malaria, training

### INTRODUCTION

African countries have long been disproportionately burdened by the "big three" infectious diseases (HIV and AIDS, tuberculosis, and malaria) and neglected emerging infectious diseases such as EVD and Lassa fever. African populations maintain the world's highest levels of genetic diversity which decline proportionately with increasing distance from Africa (Tishkoff et al., 2009). Bioinformatics and data science [respectively, considered in this context as the methods and software tools for understanding biological data; and the unification of data design, collection and analysis (Hayashi, 1998; Wikipedia, 2019a)] research thrives on genetically diverse populations as population substructure variation contributes to the identification of true associations in complex disorders and drug response (Campbell and Tishkoff, 2008; Tishkoff et al., 2009; Quansah and McGregor, 2018). Research on these topics within Africa provide considerable opportunities for improving health outcomes through their application in infectious disease research, vaccine and drug development, and drug resistance patterns. The completion of the Human Genome Project and technological advances have led to significant cost reductions for genomic data acquisition and also provide immense opportunities for novel insights into etiology, diagnosis, and therapy (Tishkoff et al., 2009).

African researchers and participant populations have historically been underrepresented in GWAS. Through 2014, only 11 of the thousands of the GWAS have included African participants (Rotimi et al., 2014). While African countries such as South Africa have strong bioinformatics and data science capabilities, such capacity has been imbalanced across Africa, and many of its countries have yet to develop any of these capacities (Aron et al., 2017). Currently bioinformatics and data science degree programs are concentrated within several African institutions (Karikari et al., 2015b). Other factors negatively impacting bioinformatics and data science research in Africa include weak biomedical infrastructure; lack of governmental financial support; limited computational expertise; lack of participation in collaborative research beyond sample collection; and limited training opportunities, biorepositories, and databases (Tishkoff et al., 2009; Woolley et al., 2010; Rotimi et al., 2014; Karikari et al., 2015a; World Health Organization, 2015; Nielsen et al., 2017).

To fully benefit from advances in bioinformatics and data science research, it is imperative to train the next generation of African scientists on their use (Adoga et al., 2014; Human Heredity and Health in Africa, 2018). Tastan Bishop et al. (2015) note that the shortage of trained bioinformaticians is among the main obstacles in the development of bioinformatics in Africa (Tastan Bishop et al., 2015). Doumbo and Krogstad (1998) note that doctoral training on advanced topics are essential for African countries to define and implement their own health priorities (Doumbo and Krogstad, 1998). These demands call for building local university programs and infrastructure for establishing environments that are conducive for bioinformatics and data science training. Bioinformatics is known to require less infrastructural investments than other bench science initiatives, but essential resources are necessary such as powerful computer systems, reliable high-speed internet, access to databases and software programs, and reliable electricity (Karikari et al., 2015a). Karikari et al. (2015a) also note the importance of research infrastructure, research funding, training programs, scientific networking, and collaborations as key elements for developing bioinformatics expertise (Karikari et al., 2015a). Other factors affecting the implementation of training programs include teaching laboratories, server systems, airfare cost, timeliness of visas, suitable computational infrastructure, socio-political stability, and availability of open training spots (Aron et al., 2017; Shaffer et al., 2018). This capacity may be gained through research and training on overlapping computationally intensive topics such as data management and data capture (Shaffer et al., 2018). Attwood et al. (2017) describe the importance of data management, data storage, data integration, and data sharing, and data science in bioinformatics training (Attwood et al., 2017). The importance of DCMSs is regularly noted in the literature as a key tool for establishing sustainable and collaborative research efforts (Lansang and Dennis, 2004; World Health Organization, 2004; Abou Zahr and Boerma, 2005; Kirigia and Wambebe, 2006; Gezmu et al., 2011; Gutierrez et al., 2015; Mulder et al., 2017; Shaffer et al., 2018).

**Abbreviations:** ABioNET, African Bioinformatics Network; ACE, African Centers of Excellence in Bioinformatics program; AESA, Alliance for Accelerating Science in Africa; ASHG, African Society of Human Genetics; COI, conflict of interest; DCMS, data capture and management systems; EVD, Ebola virus disease; FNIH, Foundations for the National Institutes of Health; GDP, Gross domestic product; GIS, geographic information systems; GWAS, Genome-wide association studies; H3Africa, Human Heredity and Health in Africa; ICEMR, West Africa International Centers of Excellence for Malaria Research; IRB, Institutional review board; ITGH, Informatics Training in Global Health; ITMI, International Training in Medical Informatics; LMIC, Low and middle income countries; M.Sc., Master of Science; MRTC, Malaria Research and Training Center; MSPH, Master of Science in Public Health; NGS, next generation DNA sequencing; NIAID, National Institute of Allergy and Infectious Diseases; NIH, National Institutes of Health; NMCP, Mali National Malaria Control Program; USTTB, University of Sciences, Techniques and Technologies of Bamako, Mali; WHO, World Health Organization.

Multi-country organizations such as the H3Africa and H3Africa BioNet (H3ABioNet) consortiums have yielded extensive training and research opportunities within Africa (Human Heredity and Health in Africa, 2013; National Institutes of Health, 2018). The H3Africa initiative aims to study genomics and environmental diseases to improve the health of African populations, partnering between the AESA, the Wellcome Trust, the ASHG, and the NIH (Adoga et al., 2014; Human Heredity and Health in Africa, 2018). The H3Africa Consortium had the effect of diversifying the bioinformatics skills and training in Africa, providing genomics training for over 500 Africans approximately 5 years (Mulder et al., 2018a). H3ABioNet is a Pan-African bioinformatics network consisting of 32 bioinformatics research groups in 15 African countries and partner institutions in the United States providing bioinformatics training in both introductory bioinformatics topics and specialized topics such as next generation sequencing (NGS) and GWAS (National Institutes of Health, 2018). The H3ABioNet bioinformatics training platform includes distance-based online training courses using virtual classrooms across 20 African institutions (Gurwitz et al., 2017). The Eastern Africa Network of Bioinformatics Training (EANBitT) provides bioinformatics training in Kenya as part of a M.Sc. program in bioinformatics (International Centre of Insect Physiology and Ecology, 2018). Doctoral training in bioinformatics is also provided in Botswana and Uganda through the Collaborative African Genomics Network [CAfGEN; (Mlotshwa et al., 2017)]. Karikari (2015) discuss the current bioinformatics training programs in Ghana (Karikari, 2015). Tastan Bishop et al. (2015) lay out the development of bioinformatics as a discipline and list the current bioinformatics degree programs in Africa (Tastan Bishop et al., 2015). Mulder et al. (2018b) provide guidelines for competencies for bioinformatics training in Africa. The African Genomic Center maintains the first genome sequencing facility that was launched in Cape Town in 2018 and includes a strong bioinformatics training component (SAMRC, 2018). Other organizations promoting bioinformatics in Africa include The African Society for Bioinformatics and Computational Biology and formerly The ABioNET (SciDevNet, 2004; African Society for Bioinformatics and Computational Biology, 2019).

The focus of the current work are bioinformatics and data science training in sub-Saharan Mali. Research in Mali has emphasized malaria as it is the country's primary cause of morbidity and mortality, representing 42% of consultations in its health centers (Sissoko et al., 2017). Malaria control strategies in Mali have emphasized universal intervention coverage, epidemic and entomological surveillance, and targeted operational research (President's Malaria Initiative, 2018). Substantial progress in malaria reduction has occurred through scaling up malaria prevention and control interventions resulting in a nearly 50% reduction in malaria mortality rates in children under 5 years of age (President's Malaria Initiative, 2018). However, drug resistance to antimalarial drugs have complicated efforts to fully control malaria. The utilization of genomic and clinical data to understand parasite evolution, predict behaviors of resistance to new antimalarial medication, and inform strategies to prevent the spread of drug-resistant malaria is thus of great importance (Flegg et al., 2011; Fairhurst et al., 2012; Maiga et al., 2012; Takala-Harrison and Laufer, 2015; Oboh et al., 2018). Other infectious diseases with significant burden in Mali include leishmaniasis, filariasis, and tick- borne diseases. Neglected infectious diseases that have not been extensively studied (but not necessarily absent) in Mali include Lassa fever and EVD (Schieffelin et al., 2014; Shaffer et al., 2014; Traore et al., 2016).

As with many countries in sub-Saharan Africa, Mali has significant limitations in developing, implementing, sustaining, and expanding innovative mechanisms for research efforts and clinical trials that are central to its health improvement (Miiro et al., 2013; Mwangoka et al., 2013; Richie et al., 2015; Dicko et al., 2016; Niare et al., 2016). Recent studies on health information systems (HIS) in Mali reported limited expertise in data management, data analysis, and report generation (MEASURE Evaluation, 2014, 2016). Mali also shares the difficult task of collecting data through a weak HIS for monitoring the health of its population (Asangansi, 2012; Ndabarora et al., 2014). Despite these limitations, research investments in Mali have been substantial. Mali was established as an International Center of Excellence in Research (ICER) in 2002 and is currently ranked as the seventh highest investment country for malaria research (Head et al., 2017). The USTTB regularly serves as the lead institution research and training projects, including several recent awards as part of the H3Africa initiative (Landoure et al., 2016; Human Heredity and Health in Africa, 2019).

Here we describe an incremental approach for engaging the next generation of African scientists in research through a progressive sequence of informatics, bioinformatics, and data science training programs at the USTTB. We describe the approaches, developments, and challenges incurred culminating with the West African Center of Excellence for Global Health Bioinformatics Research Training program in an effort assist researchers for reaching underserved populations in similar environments.

#### MATERIALS AND METHODS

#### Study Site

Situated in urban Bamako, Mali, USTTB is comprised of schools of medicine, pharmacy, and basic sciences; an institute of applied science; and research laboratories focusing on malaria, tuberculosis, and retrovirology (Harvard T.H. Chan School of Public Health, 2018). The site maintains teaching computer laboratories; server systems; and a formal data center including computer workstations, printers and internet access in controlled-access spaces. USTTB is a member of the REDCap (Vanderbilt, TN) Consortium. The site is situated near the epicenter for a host of infectious diseases and is surrounded by numerous complementary research efforts and networks, including the West Africa International Centers of Excellence for Malaria Research [ICEMR (National Institute of Allergy and Infectious Diseases, 2018a)].

### An Incremental Approach for Engaging Underserved Populations in Bioinformatics and Data Science Research

Formal research and training infrastructure at USTTB dates back to 1989 with the launch the MRTC. The facility maintained hardwired internet access, laboratories, classrooms, conference rooms, and a library (Science Blog, 2000). The MRTC supported a host of internationally funded research projects (particularly the NIAID) and training programs and worked closely with Mali's National Malaria Control Program (NMCP; Science Blog, 2000). While the MRTC's mission was not initially focused on molecular research, it spawned growth in the area through its capacity building, particularly in the area of epidemiology. Bioinformatics was formally introduced to Mali in 2003 through the African Center for Training in Functional Genomics of Insect Vectors of Human Disease (AFRO VECTGEN), which was sponsored by the WHO as part of its Special Programme for Research and Training in Tropical Diseases (TDR) initiative. A timeline of incremental developments in bioinformatics and data science capacity building, research and training at USTTB are listed in **Table 1**.

The West African Center of Excellence for Global Health Bioinformatics Research Training program was launched in 2017 (Africa). The program leveraged infrastructure and personnel from: two earlier informatics training programs, a malaria research project, the USTTB bioinformatics M.Sc. program, and the African Center of Excellence in Bioinformatics (ACE) teaching computer laboratories (Doumbia et al., 2012; Koita et al., 2016, 2017; National Institute of Allergy and Infectious Diseases, 2018b). Descriptions of these efforts follow.

### International Training in Medical Informatics (ITMI)

From 1999 to 2003, the ITMI program provided short and long term training in informatics for Malian researchers at the MRTC and governmental health agencies across West Africa. The ITMI program was complemented with research on determinants of drug resistance, immune evasion and virulence in malaria, development of field research sites to study drug resistance, human response to malaria, pathogenesis of severe malaria, and malaria vaccine trials. The informatics focus for the ITMI occurred in the sense of research question formulation and data collection, capture, linkage processes, management, and analysis. The program included five trainees with the overall goal of completing master's degrees in public health and preparing manuscripts and submitting them for publication in peerreviewed journals.

### Informatics Training in Global Health (ITGH)

Building on the ITMI program, the ITGH program was a carried between 2004 and 2011 and provided training toward completion of M.Sc. degrees in public health. Several trainees also participated in an online master's diploma program known as epidemiology and public health (Epidemiologie et Sante Publique en ligne; ESPEL). With training delivered entirely in French, ESPEL was a consortium serving Francophone countries in the Mediterranean and North Africa through the University of Bordeaux with courses in statistics and epidemiology. Course instruction in the ESPEL program was provided by USTTB medical faculty and online tutoring tools.


FIC, Fogarty International Center; H3Africa, Human Heredity and Health in Africa; ICEMR, West Africa International Centers of Excellence for Malaria Research; NIAID, National Institute of Allergy and Infectious Diseases; NIH, National Institutes of Health; TDR, Special Program for Research and Training in Tropical Diseases; USTTB, University of Sciences, Techniques and Technologies of Bamako, Mali; WHO, World Health Organization.

### West Africa International Centers of Excellence for Malaria Research (ICEMR)

The International Centers of Excellence for Malaria Research is a network of research centers with a common mission to eradicate and control malaria in Asia, Africa, Latin America, and the Southwest Pacific (National Institute of Allergy and Infectious Diseases, 2018a). Between 2010 and 2017, the West African ICEMR network carried out longitudinal malaria studies at four sites in Senegal, The Gambia, and Mali (Doumbia et al., 2012). These countries provided four study sites with differential seasonal prevalence of Plasmodium falciparum (P. falciparum) infection and incidence in uncomplicated malaria (Shaffer et al., 2018). The primary goal of the study was to collect epidemiologic, clinical, and molecular data to better understand the transmission and human impact of malaria. Significant byproducts of this work were trained research personnel and established DCMS (Shaffer et al., 2018). These efforts continued in 2017 focusing on the study of malaria control interventions and antimalarial drug resistance (National Institute of Allergy and Infectious Diseases, 2018b).

### African Centers of Excellence in Bioinformatics Program (ACE)

The ACE program is a public-private partnership with the NIAID and the FNIH to strengthen bioinformatics research capacity in low and middle income (LMIC) African countries (Foundations for the National Institutes of Health, 2018). The ACE program was launched at USTTB in 2016 and included a teaching computer laboratory and e-classroom with Adobe Connect (San Jose, CA, United States) capacity for real time learning and instruction.

#### Africa Center for Training in Functional Genomics of Insect Vectors of Human Disease (AFRO VECTGEN)

The AFRO VECTGEN program was initiated in 2003 through a partnership with the WHO Special Programme for Research and Training in Tropical Diseases (TDR) and the MRTC Department of Medical Entomology and Vector Ecology (Doumbia et al., 2007). The program provided training for African scientists on genome sequence data management and analysis and functional genomics for research on vector-borne diseases.

### USTTB Master of Science (M.Sc.) Bioinformatics Program

USTTB maintains a M.Sc. in Bioinformatics program established in 2015 in collaboration with the H3ABionNet Consortium. The program is one of only 13 such programs in 7 African countries (Tastan Bishop et al., 2015; Mulder et al., 2016). The program includes 20 courses arranged over 4 semesters with cohorts of 15 students over quarterly semesters, including three semesters of coursework and short-term internships and a single semester of thesis research in bioinformatics. The 1st year of study includes two semesters of core coursework equivalent to 60 academic credits (European Credit system), and the 2nd year consists of 60 credits of coursework and a 4-month practicum and a master thesis research project. Training is provided in collaboration by USTTB faculty; the H3ABioNet Consortium (from instructors in Tunisia, South Africa, and Ghana); and collaborating institutions in France and the United States (through video conferencing and webinars). The program's curriculum is shown in **Table 2**.

A key component of the curriculum for engaging underserved populations in research included a formal course on English speaking and writing in scientific research (**Table 1**, course code BIN 103).

### West African Center of Excellence for Global Health Bioinformatics Research Training

Launched in October 2017, the West African Center of Excellence for Global Health Bioinformatics Research Training is a collaborative bioinformatics data science and training program between USTTB and Tulane University. The program provides bioinformatics and data science training to faculty and students at USTTB and is sponsored by the NIH Fogarty International Center as part of the H3Africa initiative. The program seeks to establish a sustainable bioinformatics and data science research training program at USTTB, focusing on advancing the USTTB bioinformatics

TABLE 2 | USTTB Master of Science (M.Sc.) in Bioinformatics program curriculum.


<sup>1</sup>Semesters based on quarterly time periods.

curriculum, increasing faculty and student authorship in bioinformatics and data science journals, grant proposal development, and improving success in gaining extramural research funding.

#### RESULTS

The primary outcomes of the ITMI and ITGH training programs included numbers of college degrees earned and publication frequencies and rates. These programs laid the foundation for subsequent research and training efforts. Among the trainees in the ITMI and ITGH programs were investigators the West Africa ICEMR and West African Center of Excellence for Global Health Bioinformatics Research Training program.

### International Training in Medical Informatics (ITMI)

This program provided long-term training to five trainees between 1999 and 2003. Each of these trainees successfully completed a M.Sc. in Public Health (MSPH) degree. The impact of the ITMI on publication productivity is shown in **Table 3**.

Publication rates per year in first authorship and third and higher authorship increased by 690% (0.10 per year to 0.79 per year) and 253% (0.40 per year to 1.41 per year), respectively, following the ITMI training program. Each of these increases was statistically significant (p < 0.001). Publications following the ITMI program were focused in the areas of malaria interventions, vaccine development, and epidemiology (Sagara et al., 2014; Portugal et al., 2017).

#### Informatics Training in Global Health (ITGH)

The ITGH program provided short and long term informatics training in Mali between 2004 and 2011. The program included 53 short-term trainees and 7 long-term trainees from the MRTC, local governmental agencies, field sites and neighboring Francophone West African countries. Short term workshop training was delivered in both French and English, and ten short-term trainees completed the ESPEL online diploma training in biostatistics and epidemiology. Three of the long-term trainees earned master of public health degrees in biostatistics programs, and four of the long-term trainees completed the online ESPEL training program.

### West African Center of Excellence for Global Health Bioinformatics Research Training Program (WABT)

The WABT was sponsored by the National Institutes of Health through its H3Africa initiative. The USTTB served as the WABT's lead institution, partnering with Tulane University (New Orleans, LA, United States) and the University of Strasbourg (Alsace, France). The WABT integrated three intertwined training components, namely faculty training and development, curriculum enhancement, and student training enhancement for students enrolled in the USTTB master's degree program in bioinformatics. The feedback loop illustrating the approach for launching new trainees into academic and research positions is shown in **Figure 1**.

The program provided a direct pipeline of trainees into the USTTB bioinformatics program and local research projects. An advisory board provided independent oversight and insight for the program. The three tiers of enhancement covered in the training program follow.

#### Faculty Enhancement

Faculty trainees were recruited from USTTB faculty with responsibilities in developing and overseeing the USTTB bioinformatics and data science curriculum. This component was carried out through participation in scientific workshops; delivery of oral and poster presentations at scientific conferences; mentorship on proposal development and manuscript preparation; and development and implementation of an annual bioinformatics symposium at the USTTB site. Research proposal topics focused on mobile health in malaria surveillance and efficacy evaluation for seasonal malaria chemotherapy in malaria prevention through the application of bioinformatics and data science approaches and technologies.

#### Curriculum Enhancement

Training activities included mentored program and curriculum development for the USTTB M.Sc. in bioinformatics program. The program's course competencies were compared and evaluated according to Mulder et al. (2018a,b). Curriculum modifications included an expansion of the program's component through providing options for completing thesis work at outside institutions. Additionally, a certificate training program in bioinformatics was developed based on current course offerings to expand participation and generate additional revenue for supporting the program. The ultimate goal of the

TABLE 3 | Publication productivity for n = 5 trainees enrolled in the International Training in Medical Informatics program (ITMI).


Results in second and third columns expressed as frequency of publications and average publications per trainee per year. <sup>1</sup>Comparison of person-time publication rates between pre- and post-training time periods.

curriculum enhancement activities was to lay the groundwork for a doctoral program in bioinformatics at USTTB.

#### Student Enhancement

The project provided partial scholarships for current and incoming students in the USTTB M.Sc. in bioinformatics program as well as partial scholarships for related doctoral programs. Trainees were recruited among students enrolled in the USTTB M.Sc. in bioinformatics program or doctoral students working in research programs with bioinformatics focuses. Training activities included "study abroad" training at outside institutions through the following mentored activities: formal coursework; literature review preparation; data capture, management, and analysis; and manuscript preparation. Manuscript data were provided through the West Africa ICEMR research projects. Trainees were responsible for attending and presenting research findings at professional research conferences, including the American Society of Tropical Medicine and Hygiene (ASTMH) and H3Africa consortium meetings. Funds were allocated for pilot research projects per year in the amount of \$10,000 USD, which were intended to foster mentorship, incorporation of research into the classroom, and research evaluation. An online portal was developed for proposal submissions, and proposals were reviewed and scored by the training program's key investigators and advisory board (**Figure 2**).

#### Workshop Training

Workshop training was provided annually at USTTB aimed toward two aspects: (1) enabling junior trainees to effectively manage and interpret genetic data, including major bioinformatics database sources and integration with biological data; and (2) performing computational tasks and carrying out analytical approaches to process, analyze, and interpret biological data. The program's workshop themes are listed in **Table 4**.

The official language in Mali is French, but Bombari is the most widely spoken (Wikipedia, 2019b). The workshop training was delivered in English, and periodic translation summaries in French were delivered by USTTB faculty. Each day of the workshops concluded with student oral summaries of concepts

and activities. Certificates of completion were awarded following successful completion of the workshops and were presented by the USTTB president and the project's principal investigators.

#### Additional Training Activities

Financial program management training was provided for USTTB financial administrators through in-person discussion sessions with trained sponsored projects personnel. Training was also provided on biographical sketch development, COI and disclosure, and research ethics.

#### Challenges

The challenges incurred during the bioinformatics and data science training program included language barriers, complexity in obtaining United States. Exchange Visitor visas (J−1), high-speed internet availability, and the lack of discounted software portals. While the training was primarily delivered in English, many of the trainees were not fully fluent in English. Also, this effort required the availability of high-speed bandwidth for utilizing software extensions and accessing biomedical databases. While high-speed bandwidth was available for training at USTTB, its funding was dependent on ongoing short-term funding. The lack of a discounted software portal for commercial software presented challenges for acquiring and upgrading several common software applications such as Microsoft Access (Redmond, WA, United States). This effort focused on freeware applications including R, REDCap, ArcGIS Online (Esri, Redlands, CA, United States), and QGIS (formerly Quantum GIS; Open Source Geospatial Foundation, Chicago, IL, United States). While the training program included cohorts of trainees with similar academic focuses, the participants with bioinformatics expertise ranged from beginning to advanced skill sets. The program strategy here incorporated basic biological concepts prior to covering more advanced topics in bioinformatics and data science. The lack of industry opportunities for trainees primarily limited post-training employment prospects to academia, research, and governmental health agencies.

Using a PubMed search with key words Mali bioinformatics yielded N = 63 publications between 2006 (the year of the first observed bioinformatics publication) and November 2018.



None of these hits focused exclusively on bioinformatics or data science training (**Figure 3**).

An analog search for the Netherlands {with a population of approximately 17,084,719 [2017 estimate], slightly smaller than Mali's total population of 17,885,245 [2017 estimate]; (United States Central Intelligence Agency, 2018)} yielded 8,027 hits. These results illustrate that publication in bioinformatics remains extremely weak in Mali relative to developed countries with similar populations.

#### Sustainability

Sustainability was a core part of the program's study design that was envisioned through its development in a university setting capable of: maintaining key resources during lapses in short-term funding, generating tuition and fees, developing workforces and human capital, and providing teaching computer and wet laboratory capacity. The program's sustainability approaches and measures are shown in **Table 5**.

TABLE 5 | Sustainable measures for the West African Center of Excellence for Global Health Bioinformatics Research Training program.


#### DISCUSSION

Bioinformatics and data science expertise has arguably the most potential and impact in underserved parts of Africa due its high levels of disease and genetic diversity. The computational capacity and dynamic nature of bioinformatics research and training necessitate incremental processes in capacity building and training on related data intensive topics such as data capture and management. This capacity may be used to establish research networks and improve site suitability for hosting additional research. The training programs here yielded a sustainable platform for launching trainees into academia and complimentary research projects. The training programs here benefited from a longstanding partnership between USTTB and Tulane University in both training and research capacities. This partnership fostered local participation in content and program development to target the specific needs and health outcomes found in Mali. The authors here acknowledge that the progress described throughout this work did not operate in a vacuum and directly benefitted from a host of efforts by other researchers over the past several decades.

Indeed the teaching laboratory facilities available through the African Center of Excellence in Bioinformatics and the bioinformatics curriculum development by Mulder et al. (2016) were vital to the launch of our bioinformatics and data science training program.

We found English training to be a key component for engaging underserved populations, and thus our earlier informatics training programs included formal English training. We believe that providing English training at outside institutions is becoming more difficult as United States visa programs such as the Visitor Exchange (J−1) visa program mandate English proficiency (U.S. Department of State, 2019). Inclusion of formal coursework in English in the USTTB bioinformatics curriculum also provided key advancements in this regard. It is of central importance for African universities with research missions to incorporate research-driven and overlapping computationally intensive research topics courses into their curriculum, including biostatistics, GISs, bioinformatics, and data science to foster a workforce capable of competing for large-scale research projects.

#### Trainee Outcomes

Trainee outcomes were considered primarily in terms of publications, grant proposal submissions and awards, research conference presentations, and employment outcomes. In the absence of strong industry or pharmaceutical presence, it is likely that trainees will ultimately gain employment in research, academic, or government settings. Emphasizing proposal preparation within the training programs therefore has great utility in this regard. The training efforts in this work benefitted from several complementary malaria research projects that provided training and data sources for manuscript development and publication. Our programs also benefitted from the biannual H3Africa research conferences, which provided an international venue for our trainees to present their work in oral or written discourses. Professional responsibilities in Africa are perhaps less dependent on scientific publishing for measuring scientific productivity than in other parts of the world, and thus additional incentives for publishing may be useful to conform to the extramural funding process where publication is highly prioritized. One such incentive occurs at the University of Cape Town where governmental supplements are provided for completed publications (Whitworth et al., 2010). While overall publication rates are improving across African institutions, they are not always available through search engines such as PubMed. Schoonbaert (2009) notes that a more complete resource for African literature is CABI's Global Health Database (Schoonbaert, 2009). Similar issues may arise for country-specific grants or foundation grants as they often go uncaptured in research repositories such as NIH RePORTER (U.S. Department of Health and Human Services, 2018). To this end, country research databases may be useful in improving research visibility.

#### Correlate Training

Ideally trainees should develop diverse research portfolios including topics focused on practical needs with utility in all facets of research, such as DCMS development and oversight. These practical skills provide opportunities over the entire course of research endeavors as opposed to sole skill sets in advanced analytical techniques that are suited for latter stages research. The challenges associated with bioinformatics and data science training parallel those for related data intensive research processes such as DCMS. Core bioinformatics and data science research infrastructure shares common elements with data intensive epidemiological and clinical research such as the setup of data systems, data management, and data warehousing. Because of the overlap in bioinformatics and DCMS responsibilities, we provided training on the development and use of REDCap databases and tablet-based data collection. Other practical training in our programs included biographic sketch and curriculum vitae and resume development and program management.

#### Regional Thinking and Sustainability of Bioinformatics and Data Science Training Programs

While the training programs covered in this work focused on Mali, we believe that they made a positive impact more broadly across the region of sub-Saharan Africa. These programs regularly hosted trainees from Mali's neighboring countries, including Nigeria and Ghana. The West Africa ICEMR also fostered collaborations among multiple sub-Saharan countries, including Mali, Senegal, and The Gambia.

Regional training approaches may increase research participation for countries lacking necessary capacity for hosting training efforts independently. The utility of regional-based approaches is recognized to yield sharing of study protocols and standardization of case definitions and reporting practices (Shaffer et al., 2018). Integrating data sources across study sites or countries provides opportunities for more advanced, multifactor approaches for evaluating treatments and vaccines. Such efforts are facilitated when host countries consider the health problems of neighboring countries as their own. Defining appropriate regional groupings may also consider the absence of disease as viable opportunities for control populations in research. Virtual regional infrastructures have also been shown to improve engagement with countries with sparse research resources (Jennings et al., 2004).

Long-term sustainability of training and capacity development in Africa will likely require additional support within the host countries in the area of research and development. Among 13 African countries (Cameroon, Gabon, Ghana, Kenya, Malawi, Mali, Mozambique, Nigeria, Sengal, South Africa, Tanzania, Uganda, and Zambia), only 3 countries (Uganda, Malawi, and South Africa) achieved a modest goal for spending at least 1% of GDP on research and development. GDP expenditures on research and development among the remainder of surveyed countries ranged between 0.20 and 0.48% (NEPAD Planning and Coordinating Agency, 2010). By contrast, these expenditures for the United States and Japan were 2.74 and 3.15, respectively (The World Bank, 2019).

### Clinical Trials Infrastructure and Drug Development

fgene-10-00331 April 11, 2019 Time: 18:17 # 10

Increased pharmaceutical involvement within Africa is greatly needed for developing bioinformatics and data science expertise and clinical research participation within the continent. Koita et al. (2016) note that key priorities in West Africa are the development of clinical research facilities and the training of host country investigators to ensure that the facilities and expertise necessary to evaluate candidate interventions are available in endemic regions when and where they are needed (Koita et al., 2016). The authors also note that many treatments deployed in Africa may have never included participants in their target countries. Bioinformatics and data science training programs provide an opportunity for showcasing workforce capacity to attract pharmaceutical and commercial investment. In turn, such investments will likely provide competitive advantages for short-term research. Partnerships with pharmaceutical companies may also serve as another means for sustaining core infrastructure during lapses in short-term funding.

### Importance of Literature on Bioinformatics and Data Science Training Efforts

To our knowledge, this is the first manuscript on bioinformatics and data science training in Mali. Additional literature on bioinformatics and data science training in Africa is needed for establishing training priorities, monitoring progress, and developing goal-based strategies for its improvement. Such literature also allows investigators developing new training programs to build on prior efforts and adapt training approaches. It is therefore essential for journal publishers to recognize the importance of publishing work on training programs as they often serve as the backbone for their associated research.

#### CONCLUSION

Bioinformatics and data science training programs in developing countries necessitate incremental and collaborative strategies for their feasible and sustainable development. The progress described here covered decades of collaborative efforts centered on training and research on computationally intensive topics. These efforts laid the groundwork and platforms conducive for hosting a bioinformatics and data science training program in Mali. Training programs are perhaps best facilitated through Africa's university systems as they are perhaps best positioned to maintain core resources during lapses in short-term funding. While bioinformatics and data science training programs are rapidly growing across Africa, much of the continent currently lacks substantial commercial investment and is reliant on short-term funding for training and research efforts. It is therefore critical to incentivize, commercial and governmental investment within African countries to complement short-term funding efforts. It is also of central importance to publish literature on scientific training programs to monitor and evaluate progress, develop standards, and share training approaches and experiences.

### AUTHOR CONTRIBUTIONS

JGS conceived and drafted the manuscript which was reviewed and approved by all authors. JGS, FM, MW, CT, JL, OT, SKS, DK, and SD participated in study design. JGS, FM, JL, SKS, OT, MW, DK, JJL, MD, MS, DD, MT, YK, AAD, ST, BD, MC, OK, YL, and SD assisted in carrying out the training program. JSS and ML assisted in manuscript editing and consultation.

### FUNDING

This study was supported by NIH Global Health Training awards D43TW01086 and D43TW007000 and NIH Cooperative Agreements U2R TW010673 for West African Center of Excellence for Global Health Bioinformatics Research Training and U19 AI 089696 and U19 AI 129387 for the West African Center of Excellence for Malaria Research. Data sets for trainee publication were financially supported through NIH/NIAID Cooperative Agreements for the West African International Centers for Excellence in Malaria awards U19 AI 089696 and U19 AI 129387. AAD is supported through the DELTAS Africa Initiative an independent funding scheme of the African Academy of Sciences (AAS)'s Alliance for Accelerating Excellence in Science in Africa (AESA) and supported by the New Partnership for Africa's Development Planning and Coordinating Agency (NEPAD Agency) with funding from the Wellcome Trust (DELGEME grant #107740/Z/15/Z) and the United Kingdom (UK) Government. The views expressed in this publication are those of the author(s) and not necessarily those of AAS, NEPAD Agency, Wellcome Trust, or the UK Government.

#### ACKNOWLEDGMENTS

We thank Professor Adama Keita for his support and accommodations in carrying out the West African Center of Excellence for Global Health Bioinformatics Research Training program. We thank Kathleen T. Branley, Tami G. Jenniskens, Denise M. Lovrovich, Kathleen M. Kozar, and Fatoumata Bamba for their administrative assistance. We also thank the members of the Advisory Group for their guidance: Dr. Robert L. Murphy, Dr. Nancy Kass, Dr. Mariam Sylla, and Dr. Stephen B. Kennedy. We dedicate this work to the victims of malaria and their families. Finally, we thank the members of the NIAID African Centers in Excellence in Bioinformatics and Data-Intensive Science (ACE) program: Darrell Hurt, Christopher Whalen, and Michael Tartakovsky. The teaching infrastructure established at USTTB through the ACE program played a key role in carrying out the aforementioned bioinformatics and data science training activities.

#### REFERENCES

fgene-10-00331 April 11, 2019 Time: 18:17 # 11


of African scientists. Genet. Med. 19, 826–833. doi: 10.1038/gim. 2016.177


artemisinin derivatives for uncomplicated malaria: a pooled analysis of clinical trials data from Mali. Malar. J. 13:358. doi: 10.1186/1475-2875- 13-358



**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Shaffer, Mather, Wele, Li, Tangara, Kassogue, Srivastav, Thiero, Diakite, Sangare, Dabitao, Toure, Djimde, Traore, Diakite, Coulibaly, Liu, Lacey, Lefante, Koita, Schieffelin, Krogstad and Doumbia. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Noonan Syndrome in South Africa: Clinical and Molecular Profiles

Cedrik Tekendo-Ngongang1†, Gloudi Agenbag<sup>1</sup> , Christian Domilongo Bope1,2 , Alina Izabela Esterhuizen1,3 and Ambroise Wonkam1,4 \*

<sup>1</sup> Division of Human Genetics, Departments of Medicine and Pathology, Faculty of Health Sciences, University of Cape Town, Cape Town, South Africa, <sup>2</sup> Departments of Mathematics and Computer Sciences, Faculty of Sciences, University of Kinshasa, Kinshasa, Democratic Republic of Congo, <sup>3</sup> National Health Laboratory Service, Groote Schuur Hospital, Cape Town, South Africa, <sup>4</sup> Faculty of Health Sciences, Institute of Infectious Diseases and Molecular Medicine, University of Cape Town, Cape Town, South Africa

Noonan Syndrome (NS) is a common autosomal dominant multisystem disorder, caused by mutations in more than 10 genes in the Ras/MAPK signaling pathway. Differential mutation frequencies are observed across populations. Clinical expressions of NS are highly variable and include short stature, distinctive craniofacial dysmorphism, cardiovascular abnormalities, and developmental delay. Little is known about phenotypic specificities and molecular characteristics of NS in Africa. The present study has investigated patients with NS in Cape Town (South Africa). Clinical features were carefully documented in a total of 26 patients. Targeted Next-Generation Sequencing (NGS) was performed on 16 unrelated probands, using a multigene panel comprising 14 genes: PTPN11, SOS1, RIT1, A2ML1, BRAF, CBL, HRAS, KRAS, MAP2K1, MAP2K2, NRAS, RAF1, SHOC2, and SPRED1. The median age at diagnosis was 4.5 years (range: 1 month−51 years). Individuals of mixed-race ancestry were most represented (53.8%), followed by black Africans (30.8%). Our cohort revealed a lower frequency of pulmonary valve stenosis (34.6%) and a less severe developmental milestones phenotype. Molecular analysis found variants predicted to be pathogenic in 5 / 16 cases (31.2%). Among these mutations, two were previously reported: MAP2K1-c.389A>G (p.Tyr130Cys) and PTPN11 - c.1510A>G (p.Met504Val); three are novel: CBL-c.2520T>G (p.Cys840Trp), PTPN11- c.1496C>T (p.Ser499Phe), and MAP2K1- c.200A>C (p.Asp67Ala). Molecular dynamic simulations indicated that novel variants identified impact the stability and flexibility of their corresponding proteins. Genotype-phenotype correlations showed that clinical features of NS were more typical in patients with variants in MAP2K1. This first application of targeted NGS for the molecular diagnosis of NS in South Africans suggests that, while there is no major phenotypic difference compared to other populations, the distribution of genetic variants in NS in South Africans may be different.

Keywords: Noonan syndrome, multigene panel testing, targeted next-generation sequencing, RASopathies, Ras/MAPK signaling pathway, South Africa

### INTRODUCTION

Noonan syndrome (NS; MIM 163950) is a common autosomal dominant condition, with an estimated global incidence of 1:1,000 to 1:2,500 live births (Mendez and Opitz, 1985). Affected individuals present with multisystem involvement, including short stature, distinctive craniofacial dysmorphism, congenital heart defects (CHD), skeletal abnormalities, developmental delay,

#### Edited by:

Zané Lombard, University of the Witwatersrand, South Africa

#### Reviewed by:

Koenraad Devriendt, KU Leuven, Belgium Maria Paola Lombardi, University of Amsterdam, Netherlands

#### \*Correspondence:

Ambroise Wonkam ambroise.wonkam@uct.ac.za

#### †Present Address:

Cedrik Tekendo-Ngongang, Medical Genetics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, United States

#### Specialty section:

This article was submitted to Genetic Disorders, a section of the journal Frontiers in Genetics

Received: 10 January 2019 Accepted: 28 March 2019 Published: 16 April 2019

#### Citation:

Tekendo-Ngongang C, Agenbag G, Bope CD, Esterhuizen AI and Wonkam A (2019) Noonan Syndrome in South Africa: Clinical and Molecular Profiles. Front. Genet. 10:333. doi: 10.3389/fgene.2019.00333

**67**

coagulation defects, and other abnormalities (Allanson and Roberts, 2016). Noonan syndrome is clinically heterogeneous with significant interfamilial and intrafamilial variable expression. Noonan syndrome condition is caused by heterozygous germline mutations in more than 10 genes encoding, either proteins of the Ras family of GTPases (KRAS, NRAS, RIT1, and RRAS), or modulators of Ras function (PTPN11, SOS1, SOS2, CBL, RASA2, and SHOC2) and downstream signal transducers (RAF1, BRAF, and MAP2K1) (Cordeddu et al., 2015; Aoki et al., 2016). To date, mutations in all identified genes for NS result in gain-of-function within the Ras/MAPK pathway, and account for up to 80% of NS cases (Aoki et al., 2016). In many clinical settings, missense variants in PTPN11 alone are found in about 50% of affected individuals (Tartaglia and Gelb, 2005), while SOS1 has been reported to be the second most mutated gene, accounting for 10–20% of PTPN11-negative patients (Roberts et al., 2007). Compared to PTPN11 and SOS1, the contribution of other known genes seems to be minimal, with variable mutation frequencies observed across populations (Allanson and Roberts, 2016). Recent studies have reported association between biallelic variants in LZTR1 and NS phenotype (Johnston et al., 2018; Nakaguma et al., 2019), supporting the existence of an autosomal recessive form of the condition, as suggested by some clinical studies (Abdel-Salam and Temtamy, 1969; Maximilian et al., 1992; Van Der Burgt and Brunner, 2000). However, heterozygous variants in LZTR1 have also been previously associated to the NS phenotype in at least seven families (Chen et al., 2014; Yamamoto et al., 2015), and the consequences of LZTR1 pathogenic variants on the Ras/MAPK signaling pathway remain to be clarified. The clinical diagnosis of NS is often straightforward in several cases, through recognition of key craniofacial and musculoskeletal features, in combination with CHD (Romano et al., 2010). Phenotypic overlap with other conditions sharing the same pathogenetic mechanisms so-called RASopathies, challenges the diagnosis of NS in some patients and families; this is particularly true for Cardiofaciocutaneous syndrome (CFC) and Costello syndrome. As evidenced by several genetic studies, phenotypic expressivity in many conditions may show variations across populations and ethnic groups, making it more challenging to apply well-established clinical diagnostic criteria universally (Tekendo-Ngongang et al., 2014; Kruszka et al., 2017). Affected individuals with NS from sub-Saharan Africa have seldom been reported in the literature (Kruszka et al., 2017). Furthermore, no study investigating the genetic etiology of NS in South Africans has been previously conducted, despite that numerous individuals and families affected with NS were identified in South Africa, largely due to unavailability of molecular diagnostic testing for RASopathies in the state public sector. The present study aimed at characterizing a cohort of South African patients with NS from clinical and molecular perspectives, using targeted NGS approach.

#### MATERIALS AND METHODS

#### Ethical Approval

The study was performed in accordance with the Declaration of Helsinki and with the approval of the Faculty of Health Sciences Human Research Ethics Committee, University of Cape Town (HREC: 449/2016). Written informed consent was obtained from the parents and/or the patient prior to their involvement into the study, including permission to publish photographs.

#### Patients and Phenotyping

This study was conducted in Cape Town, South Africa, and included 26 participants (20 children and six adults); among them, 20 were unrelated. Patients were recruited through the University of Cape Town (UCT) affiliated Hospitals, namely Red Cross War Memorial Children's Hospital (RCWMCH) and Groote Schuur Hospital (GSH). Patients were selected retrospectively and prospectively using the Van der Burgt scoring system for clinical diagnosis of NS (Van der Burgt, 2007). All pediatric and adult patients were assessed by trained clinical geneticists familiar with RASopathies. For each proband, family history suggestive of NS was systematically assessed by means of three or more generations pedigree. Phenotypic data recorded included all clinical characteristics, with emphasis on antenatal features, NS-specific dysmorphology assessment, investigation of bleeding diathesis and cardiovascular abnormalities on Electrocardiograms (ECG) and echocardiograms.

#### Molecular Methods

Genomic DNA of each selected patient was isolated from peripheral blood leukocytes at the National Health Laboratory Service (NHLS)-Molecular Genetics Laboratory, GSH, following the manufacturer's instructions, the standard Maxwell 16 protocol (Maxwell <sup>R</sup> 16 Blood DNA purification kit, Promega, Madison, WI 53711, USA).

#### Targeted Gene Panel Sequencing and Variants Analysis

Of the 26 DNA samples isolated, 16 DNA samples from unrelated patients were genotyped. Targeted Sequencing was performed with the Ion Torrent platform, at the sequencing laboratory of the Division of Human Genetics of UCT, using the Ion PGMTM system (Thermo Fisher Scientific, Waltham, Massachusetts, USA). Pre-designed primers for Ion AmpliSeq Noonan Research Panel were used (Life Technologies, Carlsbad, CA). The primers amplify exons and intron/exon boundaries of 14 genes known to be associated with NS and related conditions, including A2ML1, BRAF, CBL, HRAS, KRAS, MAP2K1, MAP2K2, NRAS, PTPN11, RAF1, RIT1, SHOC2, SOS1, and SPRED1. This multigene panel is predicted to cover 100% of the targeted regions, in 268 amplicons (Nelen et al., 2014). Sequencing data analysis, including quality assessment, read alignment, variants identification, variant annotation, visualization, and prioritization was primarily performed using the bioinformatics pipeline of the Ion Torrent Suite and the Ion Reporter cloud-based software (Thermo Fisher Scientific, Waltham, Massachusetts, USA). From the usable reads, 99% could be mapped to the human reference genome used (Homo sapiens, hg19, build 37.2). Further manual analysis was executed for variant prioritization and interpretation based on the variant call format (VCF) file generated by the Ion Reporter software. In this step, variants were prioritized using their minor allele frequency (MAF < 0.01) based on 1,000 genomes and 5,000 exomes projects, their zygosity, their function, their location within the gene, and their pathogenicity according to ClinVar. A parallel analysis of sequencing data was performed based on the binary alignment map (BAM) file generated by the Ion Reporter software. Picard package with option to SortSAM, MarkDuplicates and FixMateInformation on a per-sample basis were used to sort coordinate, mark polymerase chain reaction (PCR) duplicate reads and verify mate-pair information, respectively (Mckenna et al., 2010). The variant calling was done using Samtools, bcftools (Li et al., 2009) and Varscan 2 (Koboldt et al., 2012) with the reference Human genome (Hg19; build 37.2). The conservation and deleteriousness of the variant were investigated using ANNOVAR which interrogated the following tools: SIFT, PolyPhen 2 HVAR, Polyphen2 HDIV, MutationTaster, MutationAssessor, Likelihood ratio test (LRT), FATHMM, MetaSVM, MetaLR, GERP++, PhyloP, VEST3, DANN, CADD, PROVEAN, Fathmm-MKL, Integrated\_fitCons, SiPhy\_29way, PhastCons (Wang et al., 2010). The second level of variant filtration to avoid false positives or false negatives was conducted on annotated VCF files using an inhouse python script to select and retain only deleterious diseasecausing variants that have functional prediction using the 19 tools interrogated by ANNOVAR. The in-house python script uses two approaches to select deleterious variant (i) free hypothesis: cast of the vote of the annotated variant filter for "Deleterious or damaging disease-causing (D)" among annotation prediction tools based on a defined cut-off (∼50%); (ii) non free hypothesis: provide a list of known genes of the study with another level of prediction cut-off (∼25%) (Lebeko et al., 2017). The cut-off for both hypothesis is defined as follow; (i) free hypothesis: select only variant which 10 tools interrogated by ANNOVAR predicted presence of a "D"; (ii) no free hypothesis: select only the gene which 5 tools interrogated by ANNOVAR predicted presence of a "D" using the list of 14 NS genes analyzed. Existing online databases for previously reported NS variants and published literature on NS-associated variants were consulted for each candidate variant.

### Sanger Sequencing Validation and Family Segregation Studies

All variants identified and predicted to be pathogenic were subsequently confirmed with Sanger capillary direct cycle sequencing and capillary electrophoresis using standard protocols. Depending on availability of family members, segregation studies were performed: the proband's parents were screened to ascertain the origin of the variant. In addition, and where possible, other available family members clinically affected or not, were screened for the identified variant, using Sanger direct cycle sequencing.

### 3D Protein Structure Prediction for Functional Characterization of Novel Variants

Molecular dynamic (MD) simulations were conducted to assess the effect of novel variants on proteins function. The tertiary structure of the PTPN11 protein SHP-2 (PDB id: 2SHP) (Hof et al., 1998) and the MAP2K1 structure (UniProKD: A4QPA9) were generated using I-tasser homology webserver (Zhang, 2008). The CBL structure was generated by combining (PDB id: 1FBV) and homology model. Six independent MD simulations were conducted; (i) MAP2K1 mutant and wild-type (WT), SHP-2 mutant and WT, and CBL mutant and WT. All MD simulations were conducted with the GROMACS package, version 4.6.5 (Pronk et al., 2013) using OPLS force field (Jorgensen and Tiradorives, 1988). The systems were simulated in cubic box and solvated in water TIP3P (Neira et al., 1996). The temperature and pressure were maintained at 300 K using the Parrinello-Donadio-Bussi V-rescale thermostat (Bussi et al., 2007) and a pressure of 1 bar using the Berendsen barostat (Berendsen et al., 1984). The short-range non-bonded interactions were modeled using Lennard Jones potentials. The long-range electrostatic interactions were calculated using the particle mesh Ewald (PME) algorithm (Darden et al., 1993; Essmann et al., 1995). The LINCS algorithm was used to constrain all bond lengths (Hess et al., 1997) and the velocities were assigned according to the Maxwell-Boltzman distribution at 300 K.

### RESULTS

### Socio-Demographics

A total of 26 patients were included in this study, mostly unrelated (n = 20; 77%). The majority were children (n = 20; 77%) with a median age at clinical diagnosis of 4.5 years (range: 1 month−51 years). Our cohort had preponderance of individuals of mixed-ancestry background (53.8%), followed by black Africans (30.8%). There was a slight predominance of males (sex ratio: 1.3; 15 males: 11 females).

## Phenotypic Description

Gross motor milestones were on par for most patients, with ability to walk before the age of 18 months in 61.5% (n = 16/26) of cases. Six (6/26; 23.1%) patients were unable to walk after 24 months. Speech delay was reported in 50% (13/26) of cases. Craniofacial features were widely variable, but more characteristic in infants (2–12 months; 4/26), with widely spaced eyes, epicanthic folds and ptosis found in 75% of cases (**Table 1**). Comparison of six key physical characteristics (short stature, ptosis, widely spaced eyes, epicanthic folds, low-set ears, and webbed neck) between ethnic groups showed that, features were less frequent in Caucasians (2/26). Black Africans (8/26) presented with the most consistent dysmorphic features, with epicanthic folds (87.7%), ptosis (75%), and low-set ears (75%) found to be the most common features (**Table 2**). At least one cardiovascular abnormality was identified in 65.4% (17/26) of patients. The most common CHD was pulmonary valve stenosis (PS), found in 34.6% (9/26) of our cohort. Hypertrophic Cardiomyopathy (HCM) was identified in 19.2% (5/26) and left axis deviation on ECG was found in 23.1% (6/26) of cases. The complete list of clinical features identified in our cohort of 26 patients can be found in the **Table S1**.



TABLE 2 | Comparison of six key dysmorphic features between ethnic groups.


### Genetic Variants Profile

Following targeted NGS of 16 DNA samples from unrelated patients, in silico predictive algorithms supported the classification of seven heterozygous missense variants as pathogenic (7/16; **Table 3**; **Figure 1**). The novel variants were CBL c.2520T>G (p.Cys840Trp), PTPN11 c.1496C>T (p.Ser499Phe), and MAP2K1 c.200A>C (p.Asp67Ala). Analysis confirmed segregation of three novel variants with the disease in the respective families. Segregation analysis revealed a possible case of germline mosaicism in a family where the unaffected father does not carry the variant (MAP2K1; c.200A>C) found in his two affected children. These children were half-brothers, born from two separate mothers.

The two variants previously reported in ClinVar as pathogenic are MAP2K1 c.389A>G (p.Tyr130Cys) and PTPN11 c.1510A>G (p.Met504Val). Two identified CBL variants were predicted to be pathogenic by prediction tools, but were reported as benign: c.1858C>T (p. p.Leu620Phe) and as variant of uncertain significance: c.2345C>T (p.Pro782Leu) in ClinVar. Correlatively, CBL c.2345C>T failed to segregate with the disease in a large dominant family (**Figure S1**).

### Molecular Dynamic (MD) Simulations for Novel Variants

Molecular dynamic (MD) simulations (**Figure 2**) showed that: for CBL, substitution of the negatively charged and hydrophilic amino acid Cys840 with the non-polar and hydrophobic amino acid Trp840 could impact binding interactions, stability and the flexibility of the protein.

For MAP2K1, substitution of the negatively charged and hydrophilic amino acid Asp67 with the non-polar and hydrophobic amino acid Ala67 could impact binding interactions; for SHP-2, substitution of the polar and hydrophilic amino acid Ser499 with the non-polar and highly hydrophobic amino acid Phe499 in the active site of the protein could impact binding interactions and the stability of structure.

### Genotype-Phenotype Correlations

Correlations between the five variants that were predicted to be pathogenic, and NS related phenotypes are presented in **Table 4**. Comparisons showed that patients with variants in MAP2K1 were diagnosed earlier (mean age at diagnosis: 1 year) and presented with more typical clinical features of NS, followed by patients with variants in PTPN11 (mean age at diagnosis: 3.3 years; **Figure 3**). The patient with a variant in CBL was found to have more discreet clinical features (**Table 5**).

### DISCUSSION

This study provides a unique insight into the clinical and molecular profiles of South African individuals affected by NS, a rare attempt to comprehensively describe this condition in Africa. Patients were diagnosed relatively late, which could be explained by at least three factors: Firstly, the diagnosis of NS was hardly hypothesized and explored by clinicians in prenatal settings, partly due to the absence of prenatal molecular diagnostic testing for RASopathies in South Africa. A prenatal detection rate of 17.3% was reported elsewhere (Croonen et al., 2013). Secondly, in South Africa and Africa at large, the scarceness of trained medical geneticists often results in initial misdiagnosis. Finally, age-related variability in NS physical features makes its clinical diagnosis less easy for some professionals: moderate-to-severely affected individuals would be expected to be diagnosed by childhood, and mildly affected individuals in adulthood following either cardiac decompensation or cascade screening after the birth of a severely affected child (Van der Burgt, 2007; Roberts et al., 2013). Indeed, in the present study, all adult patients were diagnosed with NS during their assessment in the cardiology department. The developmental motor milestone in this cohort of South African patients was comparable to that reported in the literature, with 61.5% of patients being able to walk by the age of 18 months (Sharland et al., 1992). Half (50%) of our patients were able to speak simple two-word sentences before the age of 24 months, far below the average age (31 to 32 months) of simple twoword sentences in NS reported by other authors (Pierpont, 2016). Nevertheless, these differences may be considered with caution due to the small size of our cohort and the fact that only

TABLE 3 | Characteristics of pathogenic variants identified.


screening measures were used for the evaluation of milestones in the present study. The dysmorphology in NS also varies with ethnicity: a collaborative effort investigating NS-associated physical features in 125 individuals from diverse populations found that, the three most common physical features, present in >70% of individuals were: widely spaced eyes (≥80%), low-set ears (>80%) and short stature (>70%); ptosis was less common in black Africans (63%), and webbed neck less common in Asians (Kruszka et al., 2017). In the present study, Black Africans were found to have the most distinctive features, with epicanthic folds, previously reported to be very common in the general black South African population (Christianson et al., 1995), being the most common feature. However, caution should be observed when interpreting the high frequency of epicanthic folds in this study, as it may be more suggestive of a common variant in the general population than a distinctive feature of NS in South African patients. Congenital heart defects represent a major cause of morbidity and mortality in affected individuals with NS of all age groups (Prendiville et al., 2014). As equally found in this study, CHD are reported in 50–80% of individuals affected with NS, the most common CHD being PS ((Hickey et al., 2011); (Prendiville et al., 2014)).

The molecular detection rate in this study was relatively low (31.2%), compared to the expected >70% when using whole exome sequencing (WES) or a comprehensive multigene panel testing (Aoki et al., 2016). It is unlikely that our lower detection rate could be attributed to inappropriate phenotyping, in view of the stringent patient selection process applied. This suggests that genotyping these patients using other sequencing methods such as WES may allow identification of pathogenic variants in genes not investigated in this study, particularly in the large family (**Figure S1**) with strong clinical features of NS and no mutation identified. All the variants detected were missense, in accordance with available data on NS (Tartaglia et al., 2002; Nava et al., 2007; Martinelli et al., 2010). Interestingly, we identified a pathogenic variant (MAP2K1 c.389A>G; p.Tyr130Cys) known to be frequently associated with Cardiofaciocutaneous syndrome (Rodriguez-Viciana et al., 2006). This patient was initially labeled with the diagnosis of NS at the age of 12 months, but the progression of clinical features later favored revision of the diagnosis to Costello syndrome. This case illustrates the challenges in clinical diagnosis of some NS patients, due to overlapping features with other RASopathies. Variability in phenotypic expression, high genetic heterogeneity and low mutation frequency in several NS genes are among the difficulties in establishing consistent correlations between known causative genes or variants and specific phenotypes. Our findings are consistent with the literature, with a positive association between short stature, PS, coagulopathy, pectus deformities of the chest, and variants in PTPN11 (Tartaglia et al., 2002; Yoshida et al., 2004). Similar to previous reports, patients with variants in MAP2K1 in our study had typical craniofacial features and skin manifestations of NS (Nava et al., 2007; Nyström et al., 2008). To date, very little is known about genotypephenotype correlations in NS patients with variants in CBL. In addition to presenting with less typical craniofacial features, our patient with a CBL variant had a cardiovascular phenotype characterized by a combination of bicuspid aortic valve and coarctation of the aorta, which are infrequently associated with NS.

and hydrophobic may impact binding interaction, stability and the flexibility of the protein; (B) Zoom of the CBL mutation site comparing the configuration of the wild type and mutant protein and illustrating the flexibility of the structures. (C) MAP2K1 protein showing the mutated residue (Ala67) colored red. The substitution of Asp67 negatively charged and hydrophilic with Ala67 non-polar and hydrophobic potentially impact binding interactions; (D) Zoom of the MAP2K1 mutation site comparing the configuration of the wild type and mutant structures. (E) Crystal structure of SHP-2 including three domains of the protein: PTP (cyan), C-SH2 (yellow) and N-SH2 (pink). The mutated residue (Phe499) colored red is located in active site (Lee et al., 2005). The substitution of the polar and hydrophilic Ser499 with the non-polar and highly hydrophobic Phe499 in the active site of the protein potentially impact binding interactions and the stability of the new structure; (F) Zoom of the SHP-2 mutation site comparing the configuration of the wild type and mutant structures.

## CONCLUSION

This first application of targeted NGS for the molecular diagnosis of NS in South Africans suggests that clinical characteristics and genotype-phenotype correlations found in affected individuals are generally similar to those reported in other populations. Therefore, careful phenotyping based on existing diagnostic criteria can effectively enable the diagnosis of most NS-affected individuals in South Africa. The use of targeted NGS in the present study have allowed for detection of novel variants in TABLE 4 | Summary of clinical features in mutation-positive patients.


+: Feature present; –: Feature absent; N/A: not applicable.

FIGURE 3 | Craniofacial features of an 11-year-old boy with NS and PTPN11 c.1510A>G (p.Met504Val) variant. (A) Frontal views showing a triangular face with pointed chin; tall forehead; bilateral ptosis, predominantly on the right; sparse eyebrows; epicanthic folds and protruding ears. (B) Lateral view showing high anterior hairline with low-set and posteriorly rotated ears. (C) Electropherogram of the PTPN11 c.1510A>G (p.Met504Val) variant. (D) Posterior view with the arrow indicating webbing of the neck.

TABLE 5 | Comparisons of clinical features associated with the three genes identified.


genes infrequently associated with NS in other populations. Further studies of a larger African cohort with NS, ideally using WES, are needed.

### ETHICS STATEMENT

The study was performed in accordance with the Declaration of Helsinki and with the approval of the Faculty of Health Sciences Human Research Ethics Committee, University of Cape Town (HREC: 449/2016). Written informed consent was obtained from the parents and/or the patient prior to their involvement into the study, including permission to publish photographs.

## AUTHOR CONTRIBUTIONS

CT-N, AE, and AW contributed to conception and design of the study. CT-N and AW collected data; CT-N, GA, and CB performed molecular analysis and interpretation of data. CT-N wrote the first draft of the manuscript and CB wrote a section of the manuscript. All authors contributed to manuscript revision, read and approved the submitted version.

## FUNDING

This study was supported by the South African Medical Research Council (SAMRC's) Self-initiated Research (SIR) and the Wellcome Trust/AAS Ref: H3A/18/001, to AW; NIH, USA, grant number U01HG009716 to AW. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

### ACKNOWLEDGMENTS

The authors thank members of the clinical unit of the Division of Human Genetics, University of Cape Town for contributing to data collection. We also thank Drs. Bertram Henderson, Maureen Conradie and Sarah Kraus for their assistance in data acquisition. Finally, we are grateful to patients and families who participated in this study.

## SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene. 2019.00333/full#supplementary-material

Figure S1 | Pedigree of a family with a dominant inheritance pattern of NS, but no pathogenic variant found in the 14 genes investigated. Affected individuals (black) presented with typical craniofacial features, short stature, pectus deformities of the chest, webbed or short neck, congenital heart defect and café au lait spots. The variant CBL c.2345C>T (p.Pro782Leu) identified in the proband (IV-2), classified as uncertain significance in ClinVar, did not segregate with the disease in the family: five first or second-degree relatives were screened for the detected variant in addition to the index case, including three clinically affected and two non-affected individuals. The number beneath deceased family members indicates their age at death.

Table S1 | Summary of clinical features identified in the cohort of 26 patients.

### REFERENCES


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Tekendo-Ngongang, Agenbag, Bope, Esterhuizen and Wonkam. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Population Levels Assessment of the Distribution of Disease-Associated Variants With Emphasis on Armenians – A Machine Learning Approach

Maria Nikoghosyan1,2, Siras Hakobyan<sup>2</sup> , Anahit Hovhannisyan<sup>3</sup> , Henry Loeffler-Wirth<sup>4</sup> , Hans Binder<sup>4</sup>† and Arsen Arakelyan1,2 \* †

1 Institute of Biomedicine and Pharmacy, Russian-Armenian University, Yerevan, Armenia, <sup>2</sup> Research Group of Bioinformatics, Institute of Molecular Biology NAS RA, Yerevan, Armenia, <sup>3</sup> Laboratory of Ethnogenomics, Institute of Molecular Biology NAS RA, Yerevan, Armenia, <sup>4</sup> Interdisciplinary Centre for Bioinformatics, University of Leipzig, Leipzig, Germany

#### Edited by:

Zané Lombard, University of the Witwatersrand, South Africa

#### Reviewed by:

Ruzong Fan, Georgetown University Medical Center, United States Teri Manolio, National Human Genome Research Institute (NHGRI), United States

\*Correspondence:

Arsen Arakelyan arsen.arakelyan@rau.am; arakelyanaa@zoho.com †Share senior authorship

#### Specialty section:

This article was submitted to Genetic Disorders, a section of the journal Frontiers in Genetics

Received: 14 December 2018 Accepted: 11 April 2019 Published: 26 April 2019

#### Citation:

Nikoghosyan M, Hakobyan S, Hovhannisyan A, Loeffler-Wirth H, Binder H and Arakelyan A (2019) Population Levels Assessment of the Distribution of Disease-Associated Variants With Emphasis on Armenians – A Machine Learning Approach. Front. Genet. 10:394. doi: 10.3389/fgene.2019.00394 Background: During the last decades a number of genome-wide association studies (GWASs) has identified numerous single nucleotide polymorphisms (SNPs) associated with different complex diseases. However, associations reported in one population are often conflicting and did not replicate when studied in other populations. One of the reasons could be that most GWAS employ a case-control design in one or a limited number of populations, but little attention was paid to the global distribution of diseaseassociated alleles across different populations. Moreover, the majority of GWAS have been performed on selected European, African, and Chinese populations and the considerable number of populations remains understudied.

Aim: We have investigated the global distribution of so far discovered diseaseassociated SNPs across worldwide populations of different ancestry and geographical regions with a special focus on the understudied population of Armenians.

Data and Methods: We have used genotyping data from the Human Genome Diversity Project and of Armenian population and combined them with disease-associated SNP data taken from public repositories leading to a final dataset of 44,234 markers. Their frequency distribution across 1039 individuals from 53 populations was analyzed using self-organizing maps (SOM) machine learning. Our SOM portrayal approach reduces data dimensionality, clusters SNPs with similar frequency profiles and provides twodimensional data images which enable visual evaluation of disease-associated SNPs landscapes among human populations.

Results: We find that populations from Africa, Oceania, and America show specific patterns of minor allele frequencies of disease-associated SNPs, while populations from Europe, Middle East, Central South Asia, and Armenia mostly share similar patterns. Importantly, different sets of SNPs associated with common polygenic diseases, such as cancer, diabetes, neurodegeneration in populations from different geographic regions. Armenians are characterized by a set of SNPs that are distinct from other populations from the neighboring geographical regions.

**77**

Conclusion: Genetic associations of diseases considerably vary across populations which necessitates health-related genotyping efforts especially for so far understudied populations. SOM portrayal represents novel promising methods in population genetic research with special strength in visualization-based comparison of SNP data.

Keywords: complex diseases, genetic risk alleles, small populations, genome-wide association study, machine learning, self-organizing maps, population-level disease variant distribution, single nucleotide polymorphisms

#### INTRODUCTION

Non-communicable polygenic diseases such as cancers, neurodegeneration, cardiovascular, and metabolic disorders have become the most prevalent type worldwide and account for the majority of death events in developed and transition economy countries (Habib and Saha, 2010; Benziger et al., 2016). Initiation and development of complex diseases is governed by both, genetic and environmental factors (Ramos and Olden, 2008). Genetic predisposition to complex diseases is not a result of a single mutation, but they require synergic effect of variations in many genes. These variants can be more frequent and/or rare in a population giving rise to "common variant" and "rare variant" hypotheses (Pritchard, 2001; Reich and Lander, 2001). Currently, one of the primary tasks of genome medicine is to identify panels of complex disease-predisposing genetic markers for use in disease prognostics, diagnostics as well as drug development (Abraham and Inouye, 2015).

The most applied method for searching multiple genetic variants is a genome-wide association study (GWAS). During last decades, thousands GWAS have identified numerous single nucleotide polymorphisms (SNPs) associated with different complex diseases such as cancers, schizophrenia and diabetes, Alzheimer's and Parkinson's diseases (Giri et al., 2016; Foley et al., 2017; Sud et al., 2017; Billingsley et al., 2018). However, associations reported in one population often do not replicate when studied in another population and, moreover, sometimes they are being reported as neutral or even protective ones (Colhoun et al., 2003; Rice et al., 2007; Li and Meyre, 2013).

The explanation for this fact is that most GWAS employ a case-control design in selected populations, mainly of European, and in lesser extent from African and Chinese origin while other populations largely remain understudied. This issue has gained significant attention during recent years and number of papers has been published which evaluate how risk allele frequencies at known disease loci vary across populations and how this causes biases in population risk score estimation (Jankovic et al., 2010; Abraham and Inouye, 2015; Kim et al., 2018). Moreover, it has been lately shown that assessment of population-level distribution of disease risk alleles can contribute to public healthcare planning (Lau et al., 2018). However, most of this kind of studies either focus on limited population diversity or on a limited set of disease-SNP associations.

Moreover, the inclusion of genetically isolated populations will considerably enhance the understanding of complex traitassociated variants because of their reduced allele diversity (Kristiansson et al., 2008).

In order to describe the entire landscape of population-level variation of diseases-associated SNPs across multiple populations and geographic regions, we used a bioinformatics pipeline based on self-organizing maps (SOM) machine learning. This method has been previously applied to different high-dimensional omics data such as transcriptomic, epigenomic, and proteomic data (Binder et al., 2014; Hopp et al., 2015, 2018; Arakelyan et al., 2017). Its strong visualization capabilities and options for downstream bioinformatics analysis motivated us to apply SOM machine learning to genomic SNP data to study diseaseassociated risk profiles. We have investigated the distribution of about 44,000 disease-associated SNPs across 52 populations of different ancestry and geographical origin; among them the so-far understudied population of Armenians. Historically inhabiting the region of the South Caucasus, Armenian population was reproductively isolated since the Bronze Age (Haber et al., 2016), which makes them an interesting example for studying local specifics of the interaction of distribution of genetic risks for complex diseases with actual disease prevalence on the population level.

#### MATERIALS AND METHODS

#### Data and Pre-processing

In the first step of analysis population-related SNP data were merged with disease-associated SNPs and preprocessed (**Figure 1**). We considered the following data sets.

#### Population Data (HGDP and Armenians Data Set)

We used preprocessed genome-wide SNP data (Illumina 650Y arrays) taken from the Human Genome Diversity Project (HGDP<sup>1</sup> ) after removal of atypical and duplicated samples. The data collect genotypes (650,0000 SNPs) from 940 individuals from 51 populations in 8 geographical regions (Africa, Europe, Middle East, South and Central Asia, East Asia, Oceania, and America) (Rosenberg, 2006).

Single nucleotide polymorphisms data (Illumina Human Omniexpress microarray platform) of 99 Armenians (Eastern Armenian population) was taken from the recent publication by Haber et al. (2016).

#### Disease-Associated SNP Data

Lists of SNPs associated with diseases were collected from the following four databases: UniProt humsavar<sup>2</sup> , NCBI Clinvar<sup>3</sup> ,

<sup>1</sup>http://www.hagsc.org/hgdp/

<sup>2</sup>www.uniprot.org/docs/humsavar

<sup>3</sup>https://www.ncbi.nlm.nih.gov/clinvar/

delivers a SNP portrait of each individual. It represents a colored image showing clusters of SNPs with increased minor allele frequency (MAF) as red 'spot-like' areas. They were then used for extracting population-specific associations with disease risks and biological functions by applying enrichment techniques. The SOM mining step also makes use of overview maps summarizing all spots observed on population averaged mean portraits which characterize the SNP landscape of individuals of a certain geographic region and of SNP profiles showing the allele score across all individuals and populations studied. For example, the red minor-allelic spot in the right-upper corner of the map (see dashed circle) is specific for Africans because it is observed in their portraits but not in the portrait of Europeans. Its profile shows high and low values of the allele score for individuals from these regions. Each of the spots delivers a list of SNPs and associated genes, which, in turn, are used to extract disease risks for populations showing these spots.

GWASdb<sup>4</sup> , and DisGenNet<sup>5</sup> . The lists from all sources were then combined and the duplicated records were removed. The final list consisted of 321,955 disease-associated SNPs.

#### SNP-Filtered Population Data

Disease-associated SNPs with minor allele frequency (MAF) > 0.05 were selected from both data sets after

<sup>4</sup>http://jjwanglab.org/gwasdb

<sup>5</sup>http://www.disgenet.org/

removing missing genotypes using "vcftools." VCF genotype files were transformed into genotype matrix using "variant annotation" (Obenchain et al., 2014) and "snpstats" (Guino et al., 2006) R packages. Final dataset consisted of 44,234 disease-associated SNPs in 1039 samples that combined HGDP and Armenian populations.

#### Allele Coding

For further data processing, SNP-genotypes were coded by the following integers: 0 – homozygous major alleles genotype, 1 – heterozygous alleles genotype, and 2 – homozygous minor allele genotype. The full set of SNPs of each individual constitutes its SNP-portrait while the allele-coded values of each SNP over all individuals in the data set constitute its SNP-profile (**Figure 1**).

#### Disease Classification

fgene-10-00394 April 25, 2019 Time: 16:16 # 4

We used Disease Ontology (DO, release 2018-07-05) based classification of diseases. The DO is structured into types of disease on different levels using a tree-model (Schriml et al., 2012). For comparability of disease-SNP associations, we mapped DO-terms of level 4 and higher to level 3 of DO terms. For instance, diabetes mellitus (level 5) is assigned to carbohydrate metabolism disease (level 3) in further analysis (**Supplementary Figure S1**).

### Generating SNP-Portraits Using Self-Organizing Maps

In the next step preprocessed and filtered HDGP SNP datasets were feature centralized and then clustered using SOM machine learning (see Wirth et al., 2011 for a detailed description of the method, and **Figure 1** for a schematic representation). It translates the original data matrix consisting of the allele scores of N = 44,234 disease-associated SNPs collected from M = 1,039 individuals into a data matrix of reduced dimensionality of K = 3,025 so-called meta-SNP profiles. Hereby, the term 'profile' denotes the vector of allele score values across the individuals. The SOM training algorithm distributes the SNPs over the K micro-clusters of meta-SNPs by minimizing the Euclidian distance between the SNP-profiles as a similarity measure. This ensures that SNPs with similar profiles cluster together in the same or in closely located meta-SNPs. Each meta-SNP profile can be interpreted as the mean profile averaged over all SNP profiles referring to the respective meta-SNP cluster. The allele scores of the meta-SNPs of each individual are visualized by arranging them into a two-dimensional M = 55 × 55 grid and by using red to blue colors for a maximum to minimum allele score values in each of the grid images. These images 'portray' the genetic landscape of each individual studied. We used SOM implemented in "oposSOM" R package (Löffler-Wirth et al., 2015). All populations were labeled according to the geographical location while Armenians were considered as a separate group. Mean SNP-SOM portraits of populations from the same geographic regions were obtained by averaging the meta-SNP values of the respective individual SNP-portraits. A separate "zoom-in" SOM (Wirth et al., 2011) was trained by considering only populations of the HDGP data set from the Middle East and Europe together with Armenians to better resolve details of their disease-associated genomes. Full data analysis results are available from Zenodo Open data platform (Nikoghosyan et al., 2018).

### Spot Clustering, Disease, and GO-Term Enrichment

In the third step, we performed an analysis of the SOMclustered data to assess disease-associated genetic risks across the populations. Our SOM implementation used a ternary code for coloring each meta-SNP giving rise to spot-like red and blue colored regions in the SNP-portraits due to the self-organizing properties of the SOM algorithm. Red and blue spots refer to minor and major allelic regions while green areas mark heterozygous alleles. We then used segmentation algorithms developed previously (Wirth et al., 2011) to extract so-called spotclusters of (red) minor-allelic regions. Each of these spot-clusters consists of 100 to 1000s of SNP-profiles. Enrichment of disease DO terms in the spot clusters was then estimated by Fisher's exact test. For each spot, the test assesses whether the number of SNPs associated with a given disease is larger than expected under the assumption of random distribution of SNPs among the spots. Enrichment analysis was also performed for gene-ontology (GO) terms "biological process" and "cellular component" using over-representation analysis as implemented in WebGestalt webserver (Wang et al., 2017) to assess the functional context of the genes containing the SNPs in a given spot.

### RESULTS

### SOM-Portrayal of Geographical Diversity of Disease-Associated SNPs

Human disease-related genetic diversity is shaped by demographic, biological, and environmental factors. Here we applied a SOM approach to gain new insights about population-level distributions of disease-predisposing alleles across geographic regions using whole genome SNP-scans of 1039 individual selected from 52 ethnicities in seven geographic regions and of Armenians considered separately. SOM was trained using ca. 44,000 disease-associated SNPs. We obtained a gallery of "SNP portraits" visualizing the genotypes of diseaseassociated SNPs for each individual studied (**Supplementary Figure S2**). Inspection of the portraits reveals high diversity of textures reflecting the allelic landscapes in terms of areas enriched for major homozygous, heterozygous and minor homozygous genotypes color-coded in blue, green, and red, respectively. On the other hand, sample portraits were mostly very similar for individuals originating from the same geographic region while the portraits of individuals from different regions progressively diverge with increasing geographic distance in most cases. For example, individuals from sub-Saharan Africa typically show a red "spot" in the right upper corner of their SOM-portraits which shifts toward the right lower corner for individuals from Middle East and Europe including Armenians. This shift reflects the fact that the latter three populations show on average similar collections of minor homozygous disease-associated SNPs which however differ markedly from those of Sub-Saharan Africans. The red spots in the mean portrait of individuals from Central and South Asia partly overlap with those of Europeans but it shows also new, ubiquitous spots referring to disease-risk associated SNPs not observed in Europeans. Also the mean portraits of East Asian, Native American, and Oceania populations reflect a combination of common and ubiquitous spots reflecting footprints of their population history. To visualize the similarity relations between the individuals

the crossroad between African, European, and Asian branches of the MST. (B) A zoom-in SOM was calculated using data of selected populations from Middle East, Europe and Armenians for SOM-training to better resolve local similarity relations. The zoom-in MST reveals a relative compact clusters of Armenians bordered by populations from Middle East and Europe, respectively. (C) Difference portraits of Armenians with respect to other populations show an increase of non-African genetic contributions with respect to Middle Eastern populations and increased European contributions with respect to Central and South Asian populations.

from different geographic regions we generated a minimum spanning tree (MST) based on Pearson's correlation coefficients between their SOM-portraits (**Figure 2A**). For comparison we generated an independent component analysis (ICA) plot which is often applied as similarity presentation in population genetics (**Supplementary Figure S3**). The results reflect the variation of disease-associated alleles across the geographic regions. Interestingly the MST resembles also the distribution of the populations across the geographic regions ranging from Africa at one end to America and Oceania at the other one. A similar MST was recently reported by us using a selection of most variant SNPs instead for disease-associated ones (Binder and Wirth, 2014). The disease-associated SNP-genotypes selected here reflect similar genetic drift effects as the most-variant SNPs.

Interestingly, the Armenian individuals accumulate into a homogenous cloud at the crossroad between three branches collecting populations from (sub-Saharan) Africa and Middle-East, from Europe and from Asia, respectively. This localization of the Armenian cluster is in accordance with the previous genetic studies based on the genetic variation data on autosomal and uniparental loci (Hovhannisyan et al., 2014; Haber et al., 2016; Yepiskoposyan et al., 2016).

A more detailed view using a zoom-in SOM using only populations from the Middle East and Europe further emphasize the intermediate position of the Armenian population inbetween the Middle East and Europe (**Figure 2B**). Difference portraits show that disease-associated allele-landscapes of Armenians are characterized by non-African patterns compared with Middle East populations and by European patterns compared with Central and South Asian populations. The difference in comparison to other European populations is subtler showing also marked similarities in the allelic composition. In summary, SOM-portrayal of disease-associated SNPs reflects and characterizes the distribution of humans across geographical regions. Armenians occupy a central position of their disease-associated genome between Middle Eastern, European, and Central Asian populations in a region of an ancient crossroad of human migration.

#### Segmenting the SNP Landscape Into Minor-Allelic Spots

The majority (about 70%) of minor alleles in the HGDP dataset associated with the diseases studied, which is in accordance with previous observations (Lachance, 2010). We were interested to study clusters of co-localized minor-allelic SNPs that are evident as red, spot-like areas in the SNP-portraits. The spot summary map collects all relevant red spots (clusters of SNPs with minor allele high frequency) to provide an overview of the minor-allelic spot regions observed in the mean SOM portraits of the different geographic regions (**Figure 3A**). Overall we identified 13 minor-allelic spots labeled by capital letters A–M (**Figure 3A**). Mean profiles averaged over the allelic codes of all alleles collected in the respective spot reveal the geographic specificity of minor allele prevalence (**Figure 3C**). We identified seven spots which were unique for a given geographic region and another six (mixed) spots which shared between several regions (**Figure 3D**). For instance, portraits for Armenians, Europeans, Central South Asians and populations from the Middle East are characterized by red spots located in the right-lower corner, while the portraits from (sub-Saharan) Africa and from East Asia show different spots in the right and left upper corners of the map, respectively. On the other hand, SNP portraits from Oceania and, to a lesser degree, from America are characterized by two or more spot both unique and/or mixed distribution. For example, spot L reflects similar minor allelic SNP profiles of Oceanians, Native Americans and East Asians and partly also Africans while spot I reflects common genetic history of original populations in America and East Asia.

In order to demonstrate how SOM assigns single SNPs into clusters based on their allele frequency profiles, we mapped 40 SNPs from 17 genes with a high number of disease associations taken from Price et al. (2015) into the SNP landscape (**Figure 3B**). The most of the SNPs accumulate in the regions of spots D and E (19.5%) and of spots J and K (34.1%) corresponding to European and East Asian populations, respectively. About 38% of the genes were found in or near spots assigned to minor allele enrichment in other geographic regions such as Africa, Oceania, and America. This unbalanced distribution is presumably due to the population bias in the studied SNPs toward Europe and East Asia. It emphasized the necessity of extending genetic association studies to other populations.

We also evaluated the effect of linkage disequilibrium (LD) on distribution of SNPs in the SOM portraits, using SNPs located on chromosome 1 available in our dataset. SOM algorithm naturally tries to allocate SNPs with correlated profiles in close proximity (or in the same cluster) while SNPs with anti-correlated profiles are positioned in furthest regions of the SNP portrait. Thus SNPs that are in LD will be either located in one cluster (for positively associated alleles) or in two clusters located most distantly on the "SNP portrait" (for negatively associated alleles) (**Supplementary Figure S4**). Furthermore, since the disease-associated SNPs used in our study were already "pre-selected" based on GWAS or functional studies, and since the goal of our study was "portraying the population-level genetic risks" for known associations rather than identifying new ones, we can assume that LD's effect on our findings was minimal.

Thus, the SOM method aggregates disease-associated alleles into clusters associated with one or more regions this way reflecting geographical variability of disease susceptibility coded in MAF.

#### Associations Between Diseases and Spot-Modules of SNPs and Their Functional Context

Next, we evaluated disease enrichment in spots compared with their background distribution based on the clustered SNPs using Fisher's exact test. This "background" distribution shows that the largest number of SNPs is associated with complex diseases such as cardiovascular, nervous and respiratory system disorders and carbohydrate metabolism disease (**Supplementary Figures S5**, **S6**).

We detected 11 significant disease associations per spot on the average (**Supplementary Figures S7**–**S19**). The top diseases per spot are presented in **Figure 4A**. Hereby the same diseases such as carbohydrate metabolism disease (diabetes mellitus), mood disorders, immune system, and neuronal diseases enrich in different spots. These redundantly distributed diseases typically associated with SNPs in different genes as shown in the plots in **Figure 4C**. They revealed predominantly a one-to-one relation between the SNPs in spots and diseases (**Figure 4B**). The distribution of genes counts over the spots (**Figure 4C**) roughly follows an exponential decay law meaning that the number of genes associated with one spot dominates over the number of genes associated with multiple spots.

In order to further assess the difference on the functional context of the genes carrying the SNPs, we performed functional annotation of GO in each spot using over-representation analysis implemented in Webgestalt web-server (Wang et al., 2017). The results demonstrate that each spot is characterized by an almost unique set of enriched GO biological process terms (**Figure 5A**). Similar patterns are observed in the enrichment of GO terms related to molecular function (**Supplementary Figure S20**) and cellular localization (**Figure 5B**). On the other hand, one finds the same terms [e.g., related to adhesion, which plays an important role in maintaining the physiological state of various organs (Müller, 2006)] in different spots. Other GO-terms enriched in the spots included cell migration, cell and organ development and signaling which all appear to be deregulated in various diseases.

Our results underpin the complex character of diseases pathophysiology, which involves deregulations in multiple biological pathways and cellular networks (Zheng et al., 2018) often in a population-specific fashion (Ran et al., 2011). In summary, our results demonstrated considerable specificity of the distribution of genes and biological processes associated with the same diseases at the geographic levels.

### Genetic-Risk Profiling

For the detailed overview, we represented the disease-spot associations as a heatmap in **Figure 6A**. We compared them with the minor allele score profiles of the spots (**Figure 6B**) to combine the assignment of diseases with geographic regions.

The diseases accumulating in the lower part of the heatmap in **Figure 5A** are the most thoroughly studied ones showing highest overall enrichment in the background distribution (compare with **Supplementary Figure S5**) as well as in spots. According to the minor allele enrichment, these diseases can be considered as the most prevalent ones worldwide. Indeed, the global prevalence of diabetes (carbohydrate metabolism disease) is 8.5% (Kakkar, 2016), which makes it one of the most frequent diseases. Mood disorders (bipolar disorder, anxiety, and depression) are considered as the most frequent mental conditions (Steel et al., 2014), while immune system cancers (mostly malignant diseases of blood and lymphoid system) also have been reported to have high incidence rate worldwide (Foreman et al., 2018). Thoracic cancer (including lung cancer) associated SNPs were significantly enriched in three spots (A, E, K) covering all geographic regions.

The diseases in the upper part of the heatmap in **Figure 6A** are less enriched in the background distribution and thus they refer to moderately prevalent/studied ones. These diseases reveal region specificity of spot enrichment.

For example, vitamin metabolic disorders associated SNPs were enriched in spots A and K showing increased minor allele scores in Africa and East Asia, respectively. Vitamin deficiency in these regions was mostly attributed to economic and political reasons and also to local dietary practices (Bailey et al., 2015). Our results, however, showed that four SNPs (rs1348864, rs4778359, rs7781309, rs9937918) which associate with vitamin D metabolism (Bernatzky et al., 2009; Engelman et al., 2010) show high MAF in Africa and East Asia suggesting also increased genetic risk. Notably, for these regions, low levels of the vitamin D deficiency marker 25-hydroxyvitamin D in blood were reported (Prentice, 2008; Prentice et al., 2009).

Likewise, SNPs for bilirubin metabolic disorder accumulate in five spots (A, D, E, F, and K) linked to Africa, Europe, the Middle East, Asia, and Armenia. Interestingly, previous studies clearly implicated SNPs identified in the spots with the serum bilirubin levels in Europeans, Asians, and Africans (Kim et al., 2010; Chen et al., 2012; Cox et al., 2013). Moreover, populationdependent sets of mutations and polymorphisms were shown to be implicated in the development of inherited disorders of bilirubin clearance (Memon et al., 2016).

Finally, we found that some diseases are enriched in a single spot. For example, SNPs related to anxiety disorders were significantly enriched in Europe, the Middle East and Armenia (spot D). This result is in line with the large-scale meta-analysis performed by Baxter et al. (2013) indicating significantly reduced risk for anxiety disorders in non-western cultures compared with the western ones.

Overall, the results of population levels genetic risk profiling indicate a bias toward more prevalent diseases with global impact, such as cancers, immune system diseases, and diabetes. This, in turn, results in a larger number of associations, compared with less widespread diseases. We also find that the enrichment of diseases associated SNPs links to the disease prevalence in many cases.

#### Genetic Risks of Armenians

The analysis of the global SNP-landscape of worldwide populations provided an overview of the geographic distribution of disease-related genetic risk factors. However, it virtually does not resolve finer-granular population-level diseases-associations, especially, for relatively small populations such as Armenians. Our initial analyses reveal patterns of disease-associated alleles that they share with neighboring populations from the Middle East and Europe (e.g., Spots D and E in **Figure 3A**).

A detailed comparison of SNP portraits showed that Armenians are characterized by different spot patterns compared with that observed for populations from Europe, the Middle East and Central South Asia (**Figure 2C**).

In order to better resolve differences between these neighboring populations, we performed a so-called zoomin SOM-analysis (Wirth et al., 2011) that considered only populations from Europe (French, Sardinian, Russian,

Only a few genes were found in two or more spots according to different SNPs in the same gene (e.g., SH2B3). Examples were shown for three selected disease classes. (C) The number–number distributions over the spots follow an exponential decay meaning that the majority of genes associates with a single spot.

North Italians), Middle East (Bedouin, Druze, Palestinians) and Armenia.

It revealed a spot cluster of minor alleles of SNPs which specifically characterize Armenians (spot H in **Figure 7**). These SNPs associated with immune diseases, diabetes mellitus, skin diseases, and musculoskeletal diseases as the top-four ones. The top SNPs showing highest MAF, the affected genes, associated diseases and

their incidence in Armenia (Andreasyan et al., 2017) are listed in **Table 1**.

For example, the incidence of Behcet's disease was reported to be higher in Armenians and other South Caucasus populations compared with Russians (Lennikov et al., 2015). The highest prevalence of this disease has been reported among Turkish (450–500 per 100000), however, the prevalence of 90 per 100000 in ethnic Armenians (Oke and Khulief, 2016) is still considerably higher compared to Europeans (Leonardo and McNeil, 2015).

Toutette syndrome SNP is among the disease SNPs associated with Armenians. Systematic studies considering 44 populations have reported that Tourette syndrome is rare among Afro-Americans in the United States and sub-Saharan Africans. Till date, most of the Tourette syndrome cohorts have

the included SNPs for Armenians. Top diseases which associate with spot H are shown as barplot of enrichment p-values.

been described from Western sites and also from China, Japan, and the United Arab Emirates (Qi et al., 2017).

Currently no data is also available about the incidence and prevalence rates of Alzheimer's disease in Armenia, however, it is accepted that the actual rates are comparable if not higher


#### TABLE 1 | Top disease-risk associated SNPs in Armenians.

<sup>∗</sup>SNPs were located in spot H (Figure 6) and further sorted according maximum ratio MAF (Armenians)/MAF (others).

compared with worldwide rates (Saberi et al., 2012; Tataryan, 2012). Official statistics is also unavailable for obesity; however, the 2013 report by WHO (WHO, 2013) indicated that 55.5% of the adult population in Armenia were overweight and 24.0% were obese. Overall our analysis suggests links of population-level enrichment of diseases associated alleles and disease prevalence particularly in Armenians, which however presently lack reliable data about disease prevalence.

### DISCUSSION

In this study, we analyzed population and geographic regionwide distributions of disease-associated genetic risk factors using SOM machine learning. This approach generated region and population specific "SNP portraits" visualizing the distribution of disease-predisposing alleles and allowed for direct comparisons and assessment the variation of disease-associated alleles across the geographic regions.

Our results clearly indicate that there region/population-level specifics in the enrichment of disease-associated alleles, which could be linked to the disease prevalence. Moreover, we noticed a significant variation of disease predisposition background across the worldwide populations, in particular, for common diseases, such as diabetes, cancers, cardiovascular, and mental diseases. These observations confirm driving their multifactorial nature and involvement in multiple pathways their pathogenesis. It is worth to note that the low-frequency alleles associated with a disease in one population showed considerably high population levels frequency in another. We also observed a bias in present knowledge toward most prevalent diseases (like cancers, diabetes) as well as toward variants reported in Western World and few Asian and African populations. Future studies are required which focus more on so far understudied diseases and populations.

Further, our results raise the question of how the genetic risk in one population transfers into another one and it emphasizes the need for involving as much as possible populations into clinical genomics initiatives.

As an example of the understudied population, we focused on Armenians. Portraying of disease-related SNPs in Armenians demonstrated similarities with the Middle East, European and Central Asian populations. A more detailed analysis detected SNPs with specifically increased MAFs in Armenians compared with all other populations studied which indicates local disease prevalence in agreement with epidemiological data.

It is worth to notice a few limitations of our analysis. Population-level high MAFs does not necessarily indicate increased disease's susceptibility or prevalence in a particular region. Alternatively, it can be also the result of long periods of "non-exposure" to the disease in a certain population. Next, our study neglected a large number of variations of diseasecausing mutations, because they were not included neither in the array data used nor in available disease association catalogs. DNA sequencing will have more resolution in this respect, but there is presently no enough consistent data. Currently, there are several available datasets that contain exome or whole genome sequencing data from various population genetic and disease-specific studies, such as ExAC/gnomAD or 1000 Genomes which will enable studying population diversity based on larger number of samples. We have chosen HGDP for our methodical study because it provides a relative large population diversity which still exceeds that in the other datasets (51 in HGDP vs. 26 in 1000 Genomes, 10 in genomAD, and 17 in ExAC) and because of matched measuring platforms with SNP-data for Armenians. Future studies have to consider population diversity in Caucasus region and the surrounding areas and also ancient samples which become increasingly available for more detailed disease-risk profiling in space and time.

From a methodological point of view, our study demonstrates the power of machine learning and, particularly, of SOM portrayal or analyzing genomic data. This method possesses

strong visualization capabilities by providing maps of the SNP-landscapes of the populations under study. They project the highly-multidimensional SNP-patterns strictly into two dimensions in contrast to principal component plots which still are multidimensional. Moreover and most importantly, the method generates a "SNP portrait" for each individual this way enabling the personalized evaluation of its SNP-patterns. These individual portraits can be used to generate mean portraits averaged over selected groups of individuals, e.g., of populations from selected geographic regions which then can be compared to identify common and different SNP patterns. The strong clustering capabilities of SOM deliver groups of SNPs showing similar profiles and application of enrichment techniques provide their functional and disease context. The method needs further development for applications to genomic data, for example, to include other genetic defects and to integrate and to visualize additional phenotypic information.

Overall, our novel approach extends the toolset employed of association and population genetic studies. The strength of SOM portrayal used here can be seen in the possibility of disentangling entire genetic variation landscape into functional clusters, which subsequently can be assigned to various features of the groups studied. This includes stratification of populations and identification of diseases associated variants.

#### CONCLUSION

Our results clearly indicate that there is a great scope for further research in this area. There is a strong need to include

#### REFERENCES


non-Western populations in future studies that are clinically, geographically, and ethnically well-characterized.

#### AUTHOR CONTRIBUTIONS

AA and HB initiated the study. MN and AA performed calculations with contribution from SH, HL-W, and HB. MN, SH, AH, HL-W, HB, and AA contributed to results interpretation, manuscript writing, and approved the final manuscript.

#### FUNDING

The study was performed with the support of the Internal Grant of Russian-Armenian University within the framework of funding from the Ministry of Education and Science of the Russian Federation. This study was supported by the German Federal Ministry of Education and Science (BMBF) grants LHA (idSEM program: FKZ 031L0026 to HB and HL-W), PathwayMaps (WTZ ARM II-010 and 01ZX1304A to HB and AA), and oBIG (FFE-0034 to HL-W).

#### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene. 2019.00394/full#supplementary-material


groups, 1990 to 2016. JAMA Oncol. 4, 1553–1568. doi: 10.1001/jamaoncol.2018. 2706


WHO (2013). Nutrition, Physical Activity and Obesity.

fgene-10-00394 April 25, 2019 Time: 16:16 # 16


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Nikoghosyan, Hakobyan, Hovhannisyan, Loeffler-Wirth, Binder and Arakelyan. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Complimentary Methods for Multivariate Genome-Wide Association Study Identify New Susceptibility Genes for Blood Cell Traits

Segun Fatumo1,2,3 \*, Tommy Carstensen<sup>4</sup> , Oyekanmi Nashiru<sup>3</sup> , Deepti Gurdasani<sup>4</sup>† , Manjinder Sandhu4,5† and Pontiano Kaleebu1,2†

<sup>1</sup> Uganda Medical Informatics Centre, MRC/UVRI and LSHTM Uganda Research Unit, Entebbe, Uganda, <sup>2</sup> London School of Hygiene and Tropical Medicine, London, United Kingdom, <sup>3</sup> H3Africa Bioinformatics Network (H3ABioNet) Node, Centre for Genomics Research and Innovation, NABDA/FMST, Abuja, Nigeria, <sup>4</sup> Human Genetics, Wellcome Sanger Institute, Hinxton, Cambridge, United Kingdom, <sup>5</sup> Division of Computational Medicine, Department of Medicine, University of Cambridge, Cambridge, United Kingdom

#### Edited by:

Solomon Fiifi Ofori-Acquah, University of Ghana, Ghana

#### Reviewed by:

Lucas Lodewijk Janss, Aarhus University, Denmark Timothy Thornton, University of Washington, United States

#### \*Correspondence:

Segun Fatumo Segun.Fatumo@mrcuganda.org; segunfatumo@gmail.com †These authors have contributed

#### Specialty section:

equally to this work

This article was submitted to Evolutionary and Population Genetics, a section of the journal Frontiers in Genetics

> Received: 21 December 2018 Accepted: 28 March 2019 Published: 26 April 2019

#### Citation:

Fatumo S, Carstensen T, Nashiru O, Gurdasani D, Sandhu M and Kaleebu P (2019) Complimentary Methods for Multivariate Genome-Wide Association Study Identify New Susceptibility Genes for Blood Cell Traits. Front. Genet. 10:334. doi: 10.3389/fgene.2019.00334 Genome-wide association studies (GWAS) have found hundreds of novel loci associated with full blood count (FBC) phenotypes. However, most of these studies were performed in a single phenotype framework without putting into consideration the clinical relatedness among traits. In this work, in addition to the standard univariate GWAS, we also use two different multivariate methods to perform the first multiple traits GWAS of FBC traits in ∼7000 individuals from the Ugandan General Population Cohort (GPC). We started by performing the standard univariate GWAS approach. We then performed our first multivariate method, in this approach, we tested for marker associations with 15 FBC traits simultaneously in a multivariate mixed model implemented in GEMMA while accounting for the relatedness of individuals and pedigree structures, as well as population substructure. In this analysis, we provide a framework for the combination of multiple phenotypes in multivariate GWAS analysis and show evidence of multicollinearity whenever the correlation between traits exceeds the correlation coefficient threshold of r <sup>2</sup> >=0.75. This approach identifies two known and one novel loci. In the second multivariate method, we applied principal component analysis (PCA) to the same 15 correlated FBC traits. We then tested for marker associations with each PC in univariate linear mixed models implemented in GEMMA. We show that the FBC composite phenotype as assessed by each PC expresses information that is not completely encapsulated by the individual FBC traits, as this approach identifies three known and five novel loci that were not identified using both the standard univariate and multivariate GWAS methods. Across both multivariate methods, we identified six novel loci. As a proof of concept, both multivariate methods also identified known loci, HBB and ITFG3. The two multivariate methods show that multivariate genotype-phenotype methods increase power and identify novel genotype-phenotype associations not found with the standard univariate GWAS in the same dataset.

Keywords: multivariate GWAS, PCA, full blood counts, multiple phenotype, genome-wide association study

## INTRODUCTION

fgene-10-00334 April 24, 2019 Time: 17:30 # 2

Genome-wide association studies (GWAS) have discovered loci associated with a extensive range of human traits and diseases. Mostly, the standard univariate GWAS approach has been performed in a single trait framework without putting into consideration clinical relatedness and correlations among phenotypes. However, as many human traits are highly correlated, given the usual stringent statistical genome-wide significance threshold, such analyses may have a number of limitations including difficulties in identifying genetic risk factors implicating pleiotropic effects (Park et al., 2011). Current large-scale standard univariate and multivariate GWAS analyses have principally concentrated on the populations of European lineage (Need and Goldstein, 2009; Zhang et al., 2009; Galesloot et al., 2014; Porter and O'Reilly, 2017) with only a few small-scale GWAS in African populations across a narrow range of cardiometabolic diseases and traits (Gurdasani et al., 2015; Peprah et al., 2015). In order to generalize the discoveries from genetic studies of complex diseases and provide opportunities for new understandings into disease etiology and potential therapeutic strategies, it will be vital to investigate the genetic susceptibility in a global setting, including populations of African ancestry (McCarthy et al., 2008; Adoga et al., 2014).

Multivariate linear mixed models have been extensively used in a range of genetics studies (Yu et al., 2006; Kang et al., 2008, 2010; Zhang et al., 2010; Lippert et al., 2011; Loh et al., 2015; Hackinger and Zeggini, 2017). Recently this approach has attracted substantial topical interest in GWAS. Genome-wide Efficient Mixed Model Association (GEMMA) (Zhou and Stephens, 2014) models a multivariate linear mixed model to test SNPs associations with multiple traits simultaneously while adjusting for population stratification. In previous studies, multivariate analyses have mainly been performed on GWAS of lipids traits (Park et al., 2011) and anthropometry traits (Ried et al., 2016) mostly in the European and Asian populations. As cellular components of the full blood count (FBC) arise from a common pluripotent stem cell (Seet et al., 2017) and are highly correlated. Thus, FBC traits provide an opportunity to: (1) explore how multivariate GWAS performs in comparison with standard univariate analyses in a family-based dataset, (2) investigate the effect of highly correlated traits in multivariate analyses, (3) explore different multivariate approaches in GWAS, and (4) understand when a multivariate analysis would be most helpful in a GWA study. In the present study, we performed the first multivariate GWAS of FBC traits by analyzing quality controlled 2,230,258 autosomal SNPs in nearly 7000 individuals who are structured in clustered groups in rural Uganda, genotyped on the Illumina Human Omni 2.5 M octo array. We applied a two way complementary multivariate GWAS strategies in nearly 5000 genotyped samples and validation of the associated genetic variants in ∼2000 individuals with whole genome sequencing (WGS) sampled from Ugandan General Population Cohort (GPC).

## MATERIALS AND METHODS

#### Study Population

General Population Cohort is a population-based open cohort of roughly 22,000 inhabitants around 25 neighboring villages of Kyamulibwa, which is a subcounty of Kalungu district in countryside south-west of Uganda (**Figure 1**). The cohort study was founded in the late 80s by the Medical Research Council (MRC) United Kingdom in partnership with the Uganda Virus Research Institute (UVRI) to primarily investigate the trends in incidence and prevalence of HIV infection in Uganda. Samples were collected from research participants during a survey from the research study area. The study area is clustered into villages defined by governmental borders ranging in size from 300 to 1500 dwellers and includes numerous families resident within households (Asiki et al., 2013). The GPC Round 22 study took place in 2011 through collaboration between the University of Cambridge, Wellcome Sanger Institute (WSI), and MRC/UVRI. The study was contained within one annual survey round of the longitudinal cohort. The focus of the GPC Round 22 study was to investigate the genetics and epidemiology of communicable and non-communicable diseases to provide etiological insights into the genetic variation in cardiometabolic and infectious risk factors in children and adults using both population genetic and epidemiological approaches. The first set of samples tagged UGWAS was constituted of ∼5000 Uganda subjects genotyped on the HumanOmni2.5-8 Illumina genotyping chip array. Following a stringent quality control (see section "Quality Control"), 4778 individuals were carried further for analysis. The later set of samples tagged UG2G were ∼2000 individuals who underwent whole genome sequencing, of these 1,629 individuals passed quality checks and were non-overlapping with the genotype data. Both UGWAS and UG2G included several pedigrees, and individuals with cryptic relatedness, as well as individuals clustered by household and village. Due to extensive migration into and around the region, nine ethno linguistic groups in south-western Uganda were included in the sample.

#### Study Design

The data collection of GPC Round 22 study contained five main stages which took place in 2011 over the course of the year: mobilization (recruitment and consenting), mapping, census, survey, and feedback of results and clinical followup. The census consisted of a family questionnaire and questionnaire for the individual recruited from within the family. The family questionnaire was completed by the head of family or another responsible adult or emancipated minor member of the household. The household census questionnaire focused on sociodemographic information about the household, such as the quality of the house, property ownership, and employment of workers. The individual survey questionnaire captured information on members of a household including position within household, marital status, resident status, childbirth, and fertility, tribe, and religion. Information on lifestyle and health was obtained using a standard questionnaire.

This included biophysical measurements and blood samples (Asiki et al., 2013). To assess the spectrum of genetic variants associated with cardiometabolic traits in this population, we previously performed a standard univariate GWAS in a range of individual cardiometabolic traits. In the current study, we applied two different multivariate GWAS methods in analyzing multiple related FBC phenotypes simultaneously following a standard univariate GWAS analysis of the individual trait. We assessed the autosomal common SNPs in the imputed genotyped data (UGWAS) and sequenced UG2G in a pooled analyses comprising of 6407 all individuals, rather than a meta-analysis which would consider these as independent datasets and potentially result in inflation of type I error.

#### Quality Control

fgene-10-00334 April 24, 2019 Time: 17:30 # 4

Briefly, we applied stringent quality control filtering to carry out a succession of sequential quality control steps on ∼5000 Uganda samples genotyped on an Illumina array. Specifically, a total of 2,314,174 autosomal variants were genotyped on the illumine HumanOmni2.5-8 array. We excluded 39,368 autosomal variants who did not pass the stringent quality control cutoff for the variants (Heckerman et al., 2016). We also excluded a total of 91 individuals during sample QC as they fail to meet the quality control cutoffs for the samples call rate (>97%) or for the heterozygosity in the range of mean ± 3SD, or because they fail the gender check criteria using the X-chromosome as a match. Three samples were also excluded because of they are too related to one another using identical by descent (IBD >0.90) (Heckerman et al., 2016). Downstream analyses were performed on the remaining 2,230,258 autosomal markers and 4,778 samples which passed quality checks. The workflow for data processing of UG2G has been previously described in more detail.

#### Genotype Imputation

Imputation was carried out on pre-phased data with IMPUTE2 (Howie et al., 2009) using a merged reference panel of the whole genome sequence data from the African Genome Variation Project (Gurdasani et al., 2015), the UG2G described earlier and the 1000 Genomes phase 3 project (1000 Genomes Project Consortium, 2015) following standard recommendations. Imputation was carried out in chunks of two MB and then concatenated. In order to allow the most accurate different downstream analyses, imputed SNPs were further filtered at info statistics of 0.3 and a minor allele frequency (MAF) threshold of 0.5%. All duplicated sites and variants were also removed from the data. Analyses were carried out on the final set of 18,868,552 QC imputed data. This approach removed all monomorphic variants from the data which is based on Genome Research Consortium human build 37 also called the Human genome build 19.

#### Phenotype Definition and Transformation

Fifteen FBC traits were measured using the Beckman Coulter ACT5 Diff CP hematology analyzer (**Table 1**). We carried out the inverse normal transformation of each trait residual. First, we obtained residuals after the regression of each trait on age, age2, and sex. We then inverse normally transformed the residuals for GWAS analysis.

#### Evaluation for Systematic Difference Between Genotype and Sequence Data

Following merger of imputed genotype and sequence data, we first examined if systematic differences existed between TABLE 1 | A description of phenotypic traits analyzed in the total 6407 individuals in the pooled dataset.


imputed genotype data and sequence data (**Figure 2**). We carried out principal component analysis (PCA) on these data to examine whether there was separation by data mode (imputed genotype data and sequenced data). We noted clear separation of data points of genotype imputed and sequence data on PCA. In order to minimize systematic effects, we examined the 343 samples that had been genotyped and sequenced in duplicate. Using these samples, we evaluated different thresholds of concordance between sequence and imputed genotype data for identical samples, filtering out SNPs that showed a concordance <0.80 and <0.90, in the 343 samples. We found that a minimum concordance threshold of 0.90 was required to abolish systematic effects observed between genotype array and sequence data on PCA.

Following exclusion of 904,283 variants (2.3% of all variants) that showed <90% concordance in genotypes between the sequence and imputed genotype data (for 343 samples that had been genotyped and sequence), PCAs did not show any systematic differences between imputed genotype and sequence data. We inspected the first ten PCs to ensure that systematic differences did not represent an important axis of variation in the genetic data. Following filtering, a total of 39,312,112 autosomal markers in the joint set of 6,407 samples were taken forward for analyses. For GWAS association analyses, we only included a subset of variants (n = 20,594,556) that met an MAF threshold of at least 0.5%.

#### Statistical Methods for Association Analysis

We used the exact linear mixed model approach implemented in GEMMA version 24 for analysis of pooled data from 6,407 individuals in GPC. We evaluated different approaches for generation of the kinship matrix to control type I error in

analysis. It has been shown that inclusion of causal SNPs in the kinship matrix can lead to overly conservative results for these SNPs, and reduction in power for GWAS discovery. In order to maximize discovery, we used the leave one chromosome out (LOCO) approach for analysis (Listgarten et al., 2012; Yang et al., 2014). In this approach each chromosome is excluded from generation of the kinship matrix in turn, for association analysis for markers along that chromosome. This ensures that causal SNPs at a locus on a given chromosome are not used for generation of the kinship matrix used in analysis of that specific chromosome. Therefore, we generated 22 kinship matrices for analysis, each excluding the chromosome being analyzed using the given matrix.

For computational efficiency, and to avoid correlation effects due to LD, we LD pruned the data prior to calculation of the GRM matrix for each LOCO analysis. We carried out sensitivity analyses using different r2 thresholds for pruning, to examine whether type I error was appropriately controlled on examining genomic inflation factors from QQ plots. We finally used all markers with an MAF >1%, pruned to an r2 threshold of 0.5, using PLINK (Purcell et al., 2007) with the flags –maf 0.05 and – indep-pairwise 100 10 0.5, where 0.01 is the minimum MAF threshold of 1% and 0.5 is the r2 threshold within each 100 marker window sliding by a step size of 10 markers during each iteration. All genomic inflation factors for traits were noted to be below 1.05 using this approach.

We also included a covariate to indicate whether data originated from imputed genotyped individuals or sequenced individuals to allow for any systematic differences between data (although earlier PCA suggested no systematic effects in filtered data). A MAF threshold of 0.5% was applied in GEMMA analysis. The 20674434 variants that passed all quality control (QC) criteria were tested for associations using the standard univariate (UV-GWAS), multivariate approach (MV-GWAS) and principal component approach (PC-GWAS). These methods were described in the Sub-sections "Univariate GWAS Method (UV-GWAS), Multivariate GWAS Method (MV-GWAS), and Principal Component GWAS Method (PC-GWAS)." For each analysis, the P-values were calculated using the likelihood ratio test.

#### Univariate GWAS Method (UV-GWAS)

Here, we carried out a genome wide association study of 15 FBC traits (**Table 1**) using the standard univariate approach. We examined the association between a single trait at a time with SNPs taking into consideration issues with relatedness and population stratification. We show the distribution of association P-values for the 15 traits in QQ plots (**Supplementary Figure 1**). The genomic inflation factor for each analysis ranges from between 0.99 and 1.01 suggesting there is no genome-wide inflation due to population stratification. We show a summary of all genome-wide significant variants in **Table 2**.

#### Multivariate GWAS Method (MV-GWAS)

For the multivariate GWAS analysis, we started by testing for marker associations in a multivariate linear mixed model


TABLE 2 | Description of genome-wide significant loci using the standard univariate GWAS approach.

<sup>∗</sup>Known association; AF, allele frequency.

fgene-10-00334 April 24, 2019 Time: 17:30 # 6

in GEMMA with all the 15 FBC traits simultaneously while we controlled for population stratification without giving consideration to the level of correlations among these traits. We plotted the resulting P-values from this association analysis and showed the Manhattan and QQ plots in **Figures 3A,B**. We noted an unconventional Manhattan plot (**Figure 3A**) showed genomewide significant variants at almost every chromosome and the QQ plot showed a lift off from the null line as a visually inflated QQ plot (**Figure 3B**). Since this could be due to multicollinearity, we calculated the correlation coefficient between all FBC traits (see **Figure 4**) in order to identify highly collinear variables. Hemoglobin (Hgb) was found to be highly correlated with PCV (r = 0.94) and MCH highly correlated with MCV (r = 0.92) (full list in **Supplementary Tables 1a,b**). Repeating the analysis while excluding PCV, MCH, Hgb, and LYM showed an expected QQ plot (**Figure 4A**) and a conventional Manhattan plot with strong genetics signal at expected chromosomes 11 and 16 (**Figure 4B**).

In this analysis, we examined multiple correlated traits while taking into consideration issues with relatedness and population stratification. We noted that the issues with multicollinearity that manifest as inflated QQ plots, and unconventional Manhattan plots are particularly due to rare variation. The inflation is mostly for variants with <1% maf, but not all variants causing the inflation are in this category. It seems that rare variants are much more susceptible to unstable estimates with multi-collinearity. This analysis provides a framework for the combination of multiple phenotypes in multivariate GWAS analysis having shown evidence of multi-collinearity whenever the correlation between traits exceeds the correlation coefficient of r <sup>2</sup> >=0.75.

#### Principal Component GWAS Method (PC-GWAS)

Usually, the PCA is an analytic approach used in GWAS for examining population structure, especially within ethnolinguistic groups. Previous studies (Biffi et al., 2010) have used PCs as covariates in their analyses to correct for possible biases induced by sample collection or non-genetic geographical effects on phenotype. However, Ried et al. (2016) effectively applied PCA approach to four correlated anthropometric traits to encapsulate body shape and recommended the approach for other correlated traits such as FBC traits. We explored this approach to complement the standard multivariate GWAS we described in the Section "Multivariate GWAS Method (MV-GWAS)." We applied PCA to the same 15 correlated FBC traits in the same transformed phenotypic dataset to generate a

dimensional set of uncorrelated outcome PCs (**Supplementary Table 2**). We then tested for marker associations with each PC in the univariate linear mixed model in GEMMA. We show that the FBC composite phenotype as assessed by each PC articulates information that is not fully encapsulated by the individual FBC phenotype as this approach identifies genomewide significant variants that were not identified using both the standard univariate and multivariate GWAS.

#### Significance Thresholds for Multiple Testing

There are many methods such as Bonferroni or Sodak for multiple comparisons tests. These methods exploit the correlation structure between genetic variants to estimate the effective number of independent tests, and then use standard techniques for independent tests to calculate an appropriate significance threshold. In standard univariate GWAS (such as our UV-GWAS), the standard significance threshold of 5 × 10−<sup>8</sup> is mostly used. For our Mv-GWAS, GEMMA appropriately adjusted for testing multiple phenotypes, so there was no need for an additional correction, however, for PC-GWAS, the Bonferroni correction for testing 15 orthogonal phenotypes obtained from the principal components analysis of the 15 FBC phenotypes (PC-GWAS) would be 5 × 10−<sup>8</sup> /15 (3.33 × 10−<sup>9</sup> ). In order to address the potential introduction of type II errors via the application of this rigorous correction, we present all our results using the standard genome-wide significant threshold of P-value ≤5 × 10−<sup>8</sup> was met, but we highlight result with Bonferroni corrected significant threshold.

#### RESULTS

For each strategy (UV-GWAS, MV-WAS, PC-GWAS), we applied the typically significance threshold of p < 5.0E-08 to define association. We defined a locus to be novel if it had not been associated with any FBC trait in any previous GWAS and its P-value is less than or equal to 5 × 10−<sup>8</sup> . In order to define whether a locus was known or novel, we searched the NHGRI database for loci reaching statistical significance at a level 5e-08 associated with FBC trait. This was supplemented by a literature review.

#### Results for UV-GWAS

With UV-GWAS method, we analyzed each 15 trait individually and identified 4 novel association signals. This method also confirmed 4 known loci associated with blood traits (**Table 2**).

#### HBB

We identified important functional variants such as the sickle cell variant (rs334) in the HBB gene associated with RDW. The HBB locus is found to be associated with RBC distribution width in our main standard univariate analysis. As previously observed in regions affected with malaria, this variant has reached high frequencies as a result of balancing selection because it can provide resistance against the parasite, and reduces the severity of malaria among carriers. This signal was also identified by MV-GWAS and PC-GWAS.

#### ITFG3

We found 230 genome-wide statistical significant variants in the known locus ITFG3 associated with RBC, MCV, MCH, and MCHC. Though the function of ITFG3 is not known, it is known to be expressed in an erythroleukemia cell line, and other common SNPs of this gene have been implicated with red blood cell indices in European and Asian GWASs (Chen et al., 2013; Hodonsky et al., 2017). This signal was also identified by MV-GWAS and PC-GWAS.

#### R3HDM1

UV-GWAS identified 277 genome-wide statistical significant variants in association signal R3HDM1 gene on chromosome 2 with neutrophil count; this variant was common in African populations (MAF = 10%), and monomorphic in Europeans. This signal is reported in our study.

#### TPM4

UV-GWAS found 4 genome-wide statistical significant variants in the known locus TPM4. TPM4 plays a crucial role, in association with the troponin complex, in the calcium reliant on regulation of vertebrate striated muscle tightening (Crabos et al., 1991).

#### CTB-30L5.1 and AC008834.1

fgene-10-00334 April 24, 2019 Time: 17:30 # 8

Both CTB-30L5.1 and AC008834.1 are Uncharacterized, and do not code for protein. CTB-30L5.1 is an RNA Gene which is affiliated with the ncRNA class while AC008834.1 is a processed pseudogene.

### Results for MV-GWAS

Three association loci were identified using the MV-GWAS approach, of which two (HBB and ITFG3) had been previously reported to be associated with at least one of the FBC traits (**Table 3**). These known associations were also identified using the standard univariate and PC-GWAS approaches.

#### ATF3

We identified a common variant rs3123543 association with blood in ATF3 (**Figure 5**). ATF3 interacts with TP53, JunD proto-oncogene, JUN oncogene, CEBPB, and STAT1, among others. Notably, CEBPB is a vital transcriptional activator in the genes regulation engaged in hemopoiesis and immune and inflammatory responses (Janz et al., 2006).

### Results for PC-GWAS

Five novel association signals were identified using PC-GWAS method (**Table 4** and **Supplementary Figure 2**). It also found two


<sup>∗</sup>Known association; AF, allele frequency.

TABLE 4 | Genome-wide significant loci using PC-GWAS approach.


<sup>∗</sup>Known association; AF, allele frequency. <sup>+</sup>Locus remain significant after applying Bonferroni corrected threshold of 3.33 × 10−<sup>9</sup> .

known associations (HBB and ITFG3) that had been previously reported to be associated with at least one of the fifteen FBC traits. These known associations were also identified with UV-GWAS and MV-GWAS approaches and were described in the Sections "ITFG3 and R3HDM1."

#### PDZRN4

Two genome-wide statistical significant SNPs were identified in PDZRN4 (**Figure 6**). The locus enlarged epidermal growth factor receptor (EGFR) surface abundance and thus reduced homologous recombination repair frequency, the Negative genetic interaction between MUS81−/− and MUS81+/+, Decreased viability, Increased vaccinia virus (VACV) infection (Sivan et al., 2013) The gene is expressed in the lymph node, colon, bladder, whole blood, among other organs.

#### ANKRD26

We identified 21 genome-wide association variants at this locus (rs112505971, P-value 1.81e-08) (**Figure 7**). The variant (rs112505971) is common in Ugandan populations, with allele frequency of 0.948. It is noted that the variant is monomorphic in East and South Asian populations but very rare in Ad Mixed American and European populations with maf of 0% in 1000 genomes project. ANKRD26 (Ankyrin Repeat Domain 26) is a Protein-Coding gene. The peak variant is common in Uganda (5%) but absent in EUR and EAS populations. In Clinvar, ANKRD26 is known to be associated Thrombocytopenia 2. This is an autosomal dominant non-syndromic condition which is delineated by reduced numbers of standard platelets, resulting in a moderate bleeding inclination (Pippucci et al., 2011).

#### TTLL11

rs4837892 in TTLL11 (tubulin tyrosine ligase-like family) is associated with FBC (**Figure 8**). TTLL11 is expressed in 119 organs including whole blood, white blood cells, lymph node, and cervical spinal cord.

#### OTOR

We identified four novel genome-wide statistical variants at chromosome 20 in the gene OTOR. This gene is known to be associated with posttraumatic stress disorder in GWAS catalog (Xie et al., 2013).

#### COL1A1

One variant was identified in the gene COL1A1 to be associated with blood cell. This gene encodes the pro-alpha1 chains of type I collagen whose triple helix comprises two alpha1 chains and one alpha2 chain. The COL1A1 gene provides instructions for making part of a large molecule called type I collagen.

### Comparison of Genome-Wide Statistical Significant Association Loci Found by UV-GWAS, MV-GWAS, and PC-GWAS

Collectively, the three methods identified fifteen loci including ten novel loci associated with FBC traits. Two of the novel loci are intergenic variants and not shown in **Figure 9**.

#### DISCUSSION

To assess the spectrum of genetic variants associated with FBC traits in Uganda, we performed standard univariate and two multivariate GWAS methods to examine association of 15 FBC traits in 6407 individuals in a pooled data from UGWAS with UG2G sequence data. Across the three methods, we identified eight novel loci. They include ATF3 (rs3123543) using MV-GWAS strategy, PDZRN4 (rs7296503), ANKRD26 (rs112505971) and TTLL11 (rs4837892), OTOR (rs9917425), COL1A1 (rs3840870) using PC-GWAS strategy and AC008834.1 (rs7725036), CTB-30L5.1 (rs12534473), two intergenic variants (rs142586351, rs2769976) using UV-GWAS. As a proof of concept, both methods also identified known associated loci HBB and ITFG3. Additionally, UV-GWAS solely identified an additional variant at known loci TPM4 and R3HDM1 while PC-GWAS exclusively identified known association locus HLA-C. The MV-GWAS has been reported to be especially powerful when the genetic correlation between traits differs from the environmental. I think this effect is not present in the PC-GWAS, because it makes PCs based on phenotypic correlations. Therefore the two methods can be sensitive for different correlation patterns between the traits. The methods complement one another and show also show that multivariate genotype-phenotype methods increase power and identify novel genotype-phenotype associations not found with univariate GWAS in the same dataset.

One limitation of the MV-GWAS approach is its sensitivity to highly correlated traits. Sensitivity analyses showed that issues with multicollinearity may occur, that manifest as inflated QQ plots, and unconventional Manhattan plots particularly due to rare variation using the MV-GWAS strategy. The inflation is mostly for variants with <1% maf, but not all variants causing the inflation are in this category. It seems that rare variants are much more susceptible to unstable estimates with multicollinearity. Evidence of multi-collinearity was seen whenever the correlation between traits exceeded the ±0.75 threshold in MV-GWAS strategy. However, this approach exclusively identified a novel locus ATF3 with generally lower P-values compare to the standard univariate and PC-GWAS methods.

Though the PC-GWAS approach captured well the variation across FBC traits simultaneously in this study and identified more novel loci compare with the other two methods, it cannot be a replacement for both the standard univariate GWAS and MV-GWAS because there are still a number of known loci that were not identified by PC-GWAS in our study but were identified in the standard univariate GWAS (e.g., R3HDM1, TPM4).

To demonstrate the strength of these multivariate GWAS methods when used to complement each other, we collectively identified six novel loci (ANKRD26, PDZRN4, COL1A1, OTOR, TTLL11, ATF3) subject to replication and both methods also identified three known association loci (HBB, ITFG3, HLA-C). The multivariate methods evidence that multivariate genotype-phenotype method increase power and thus identify novel genotype-phenotype associations not found with univariate GWAS in the same dataset. Though the MV-GWAS improves P-value much better, the PC-GWAS strategy found more novel loci.

These multivariate methods could maximize novel loci discovery for other correlated phenotypes, such as lipid traits, liver function, cancers, anthropometry, immune disease, and others and might help to speed up drug discovery across a range of Cardiometabolic traits as previous studies have shown that FBC may serve as markers of proinflammatory state of metabolic syndrome and promoter of atherosclerotic risk (Jesri et al., 2005; Kotani et al., 2007; Kelishadi et al., 2010).

### ETHICS STATEMENT

This study was approved by the Science and Ethics Committee of the UVRI, the Ugandan National Council for Science and Technology, and the East of England-Cambridge South NHS Research Ethics Committee United Kingdom.

#### AUTHOR CONTRIBUTIONS

SF, DG, MS, and PK designed the study. SF performed the analyses. TC carried out the quality control and imputation.

DG, MS, and PK directed the project. SF and ON wrote the manuscript. All authors contributed to the interpretation of the results and writing the article.

#### FUNDING

This work was funded by the Medical Research Council/Uganda Virus Research Institute Uganda Research Unit on AIDS core funding and Wellcome Sanger Institute (Grant No. WT098051), the National Institute for Health Research Cambridge Biomedical Research Centre, and the UK Medical Research Council (Grant MR/K013491/1). SF was funded by National Institutes of Health (NIH) grant U01MH115485 and The Uganda Medical Informatics Centre (UMIC). DG was funded by MR/S003711/1. Computational support from UMIC was made possible through funding from the Medical Research Council (MC\_EX\_MR/L016273/1).

#### REFERENCES


### ACKNOWLEDGMENTS

SF wishes to acknowledge his exchange fellowship from the African Partnership for Chronic Disease Research (APCDR), MUII-plus bioinformatics grant; support of the H3Africa Bioinformatics Network (H3ABioNet) Abuja Node at the Center for Genomics Research and Innovation in Nigeria and useful personal communication with Dr. Monica Uddin, Dr. Adebowale Adeyemo, and Dr. Tinashe Chikowore. The authors wish to acknowledge the use of the UMIC computer cluster. The authors thank all the study research participants who contributed to this study.

### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene. 2019.00334/full#supplementary-material



posttraumatic stress disorder. Biol. Psychiatry 74, 656–663. doi: 10.1016/j. biopsych.2013.04.013


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Fatumo, Carstensen, Nashiru, Gurdasani, Sandhu and Kaleebu. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# The Relationship Between Environmental Exposure and Genetic Architecture of the 2q33 Locus With Esophageal Cancer in South Africa

Marco Matejcic<sup>1</sup>† , Christopher G. Mathew2,3 and M. Iqbal Parker<sup>1</sup> \*

<sup>1</sup> Division of Medical Biochemistry and Structural Biology, Institute for Infectious Disease and Molecular Medicine, University of Cape Town, Cape Town, South Africa, <sup>2</sup> Sydney Brenner Institute for Molecular Bioscience, University of the Witwatersrand, Johannesburg, South Africa, <sup>3</sup> Department of Medical and Molecular Genetics, Faculty of Life Sciences and Medicine, King's College London, London, United Kingdom

#### Edited by:

Zané Lombard, University of the Witwatersrand, South Africa

#### Reviewed by:

Shelley Macaulay, National Health Laboratory Service, South Africa Kathrine Elizabeth Scholtz, University of Limpopo, South Africa

#### \*Correspondence:

M. Iqbal Parker iqbal.parker@uct.ac.za

#### †Present address:

Marco Matejcic, Center for Genetic Epidemiology, Keck School of Medicine, University of Southern California, Los Angeles, CA, United States

#### Specialty section:

This article was submitted to Genetic Disorders, a section of the journal Frontiers in Genetics

Received: 08 January 2019 Accepted: 12 April 2019 Published: 01 May 2019

#### Citation:

Matejcic M, Mathew CG and Parker MI (2019) The Relationship Between Environmental Exposure and Genetic Architecture of the 2q33 Locus With Esophageal Cancer in South Africa. Front. Genet. 10:406. doi: 10.3389/fgene.2019.00406 Esophageal squamous cell carcinoma (ESCC) has a high prevalence in several countries in Africa and Asia. Previous genome-wide association studies (GWAS) in Chinese populations have identified several ESCC susceptibility loci, including variants on chromosome 2q33 and 6p21, but the contribution of these loci to risk in African populations is unknown. In this study we tested the association of 10 genetic variants at these two risk loci on susceptibility to ESCC in two South African ethnic groups. Variants at 2q33 (rs3769823, rs10931936, rs13016963, rs7578456, rs2244438) and 6p21 (rs911178, rs3763338, rs2844695, rs17533090, rs1536501) were genotyped in a set of Black Xhosa (463 cases and 480 controls) and Mixed Ancestry (269 cases and 288 controls) individuals. Genotyping was performed using TaqMan allelic discrimination assays. The Pearson's chi-squared test was used to compare the allele frequency between cases and controls. Gene-environment interactions with tobacco smoking and alcohol consumption were investigated in a case-control analysis. A logistic regression analysis was further performed to elucidate the independent effect of each association signal on the risk of ESCC. The 2q33 variants rs10931936, rs7578456, and rs2244438 were marginally associated with higher risk of ESCC in the Mixed Ancestry population (ORs = 1.39–1.58, p ≤ 0.035), of which rs7578456 and rs2244438 remained significant after multiple correction (p < 0.005). The associations with rs7578456 and rs2244438 were also observed across strata of tobacco smoking (ORs = 1.47–2.75, p ≤ 0.035) and alcohol consumption (ORs = 1.45–2.06, p ≤ 0.085) status. However, only the association with rs2244438, which lies within an exon of TRAK2, remained significant after adjustment for the other variants in the region. Interestingly, none of the variants tested were significantly associated with ESCC in the Black South African population. These finding implicate TRAK2 as a casual gene for ESCC risk in the Mixed Ancestry population of South Africa and confirm prior evidence of population-specific differences in the genetic contribution to ESCC, which may reflect differences in genetic architecture and environmental exposure across ethnic groups.

Keywords: genetic association, single nucleotide polymorphism, esophageal squamous cell carcinoma, South African populations, major histocompatibility complex, 2q33, TRAK2

## INTRODUCTION

fgene-10-00406 April 29, 2019 Time: 15:10 # 2

Esophageal squamous cell carcinoma (ESCC) is one of the most common malignancies in low- and middle-income countries and is a disease of major public health importance because of its poor prognosis and high mortality. The striking variation in the prevalence of ESCC between different ethnic groups is suggestive of contribution by population-specific environmental and dietary factors to susceptibility to the disease. However, although individuals within a specific geographical area may be exposed to the same environmental factors and share similar dietary habits, not all of them have the same risk of developing ESCC. It is clear that a combination of genetic susceptibility and environmental risk factors/diet are key components in the risk of development ESCC (Dandara et al., 2006; Vogelsang et al., 2012, 2014).

Esophageal squamous cell carcinoma (ESCC) accounts for about 90% of the 456,000 esophageal cancer cases reported each year (Abnet et al., 2018), and approximately 80% of the cases worldwide occur in low-to-middle income countries (LMIC) including South Africa (Ferlay et al., 2015; Wong et al., 2018). Tobacco smoking and alcohol consumption are the major environmental risk factors in South Africa (Dandara et al., 2015).

There is also strong evidence for the role of genetic factors in the etiology of ESCC (Sampson et al., 2015). Studies investigating the association between several single nucleotide polymorphisms in several drug metabolizing genes and the risk of developing ESCC have shown a clear population and ethnic variation in the risk profile. More than 200 xenobiotic-metabolizing enzymes are responsible for the metabolism and detoxification of dietary and environmental carcinogens, which if not removed, can bind to DNA and may lead to cancer causing mutations. Genes involved in the biosynthesis of these enzymes all comprise genetic polymorphic variants with altered gene expression or enzyme activity and may serve as molecular biomarkers that can provide important predictive information about carcinogenesis (reviewed in Matejcic and Iqbal Parker, 2015).

The development of genome-wide association studies (GWAS) has had a major impact in the discovery of multiple susceptibility loci for ESCC in Asian and Caucasian populations, including variants in PLCE1, C20orf54, PDE4D, RUNX1, and CASP8 (Abnet et al., 2010; Wang et al., 2010; Wu et al., 2011, 2012). However, the majority of these associations were not found in the Black and Mixed Ancestry populations of South Africa (Bye et al., 2011, 2012; Chen et al., 2019), suggesting the existence of genetic heterogeneity in the risk to ESCC.

A recent GWAS in a northern Chinese population identified common genetic variants on chromosome 2q33 that increased the risk for both ESCC and lung cancer (Zhao et al., 2017). These variants are therefore strong candidates for the study of pivotal biological mechanisms and pleiotropic effects associated with carcinogenesis. Previous GWAS and case-control studies have also identified the human major histocompatibility complex (MHC) region on chromosome 6p21 as a novel susceptibility locus for ESCC in high-risk populations from northern China (Wu et al., 2011; Shen et al., 2014; Zhang et al., 2017). Of these, the variant rs10484761 located upstream of the UNC5CL gene was found to be significantly associated with increased risk of ESCC in the Mixed Ancestry South African population (Bye et al., 2012). Nonetheless, the contribution of other risk variants in this region to susceptibility to ESCC in South African population is unknown.

In this study we investigated whether single nucleotide polymorphisms (SNPs) at 2q33 and 6p21 reported to be associated with ESCC in the northern Chinese populations also contributed to the increased risk of ESCC in the South African populations. We explored the interaction between these genetic factors and environmental exposure and used haplotype analyses to investigate the combined effect of these variants on the risk to ESCC. Finally, SNPs with suggestive evidence for association were investigated in a logistic regression model to assess the independent effect of each variant on the risk. TRAK2 was identified as a casual gene for ESCC risk in the Mixed Ancestry population of South Africa, but not in the Black population of Bantu decent, and confirms prior evidence of population-specific differences in the genetic contribution to ESCC.

## MATERIALS AND METHODS

#### Study Group

The study comprised 463 ESCC patients and 480 controls from the Black population and 269 ESCC patients and 288 controls from the Mixed Ancestry population who provided blood samples at recruitment. Black subjects were mainly Xhosa speakers for the last two generations (98.6%) from the Western Cape province of South Africa who migrated from the Eastern Cape over the past 1–2 generations where the majority of Xhosa speakers reside The Mixed Ancestry subjects were from the Western Cape. This is an admixed population that originated ∼300 years ago from the union of different ethnic groups with major ancestral components from the indigenous Khoisan, Bantu-speaking Africans, Europeans and Asians. Analysis of 75,000 autosomal SNPs in the Mixed Ancestry population of the Western Cape (formerly described as the Cape Colored population) compared with populations represented in the International HapMap Project and the Human Genome Diversity Project revealed that the major ancestral components of this population are predominantly Khoisan (32– 43%), Bantu-speaking Africans (20–36%), European (21–28%), and a smaller Asian contribution (9–11%). Asian component is mainly from Indonesia, Malaysia and the Indian subcontinent (de Wit et al., 2010). All patients were histologically diagnosed with primary invasive ESCC and were recruited between 2000 and 2012 at Groote Schuur and Tygerberg Hospitals in Cape Town. The control group included healthy volunteers with no history of cancer, and matched to cases for residential area, socioeconomic status, race, sex, and age.

All study participants completed a standardized questionnaire to collect demographic and lifestyle information. Data on alcohol consumption and tobacco smoking were available for both cases and controls. Alcohol drinkers and ever-smokers were defined as subjects who consumed any alcoholic beverage at least once every week, and those who had smoked at some point in their life;

otherwise, they were defined as non-drinkers and never-smokers. Demographic and exposure data are presented in **Table 1**.

Ethical approval for the study was obtained from the joint University of Cape Town/Groote Schuur Hospital Research Ethics Committee and the University of Stellenbosch/Tygerberg Hospital Ethics Committee. All subjects gave written informed consent in accordance with the Declaration of Helsinki.

#### Isolation and Purification of DNA

Peripheral blood samples were collected, with informed consent from all the participants and DNA was extracted at the University of Cape Town using a standard protocol (Gustafson et al., 1987). All DNA samples were diluted to a final concentration of 20 ng/µl in 96-well plates and stored at −20◦C until use.

#### SNP Selection and Genotyping

For the 2q33 locus, we selected five SNPs (rs3769823, rs10931936, rs13016963, rs7578456, and rs2244438) that were significantly associated (p < 0.05) with both esophageal and lung cancer risk in a northern Chinese population (Zhao et al., 2017). For the 6p21 locus, the selected SNPs were those with strongest evidence of association from published studies of ESCC in northern China (Shen et al., 2014; Zhang et al., 2017). These risk variants were investigated for their frequency in sample populations from the 1000 Genomes Project Phase 3<sup>1</sup> . We removed one SNP at 6p21 (rs6901869) that failed the frequency test (<5% frequency in African populations) as we would have limited power to detect any association of this SNP with ESCC. Since rs17533090 at 6p21 was highly correlated with rs35399661 in both Chinese and African populations (D' = 1/r <sup>2</sup> = 1 in 1000 Genome Project data), only rs17533090 was genotyped as a proxy for rs35399661. Finally, five SNPs from the 6p21 locus (rs17533090, rs911178, rs2844695, rs1536501, and rs3763338) were selected for genotyping.

Genotyping was performed using validated TaqMan allele discrimination assays (Applied Biosystems). Reactions were carried out in 2.5 µl volumes in 96-well plates. Each reaction contained 20 ng DNA diluted in distilled water (dH2O), 1X Universal PCR Master Mix, and 1X TaqMan SNP assay mix containing primers and TaqMan probes according to the manufacturer's protocol. The thermal cycling conditions consisted of an initial denaturation step at 95◦C for 10 min followed by 40 cycles of a two-step reaction: denaturation at 92◦C for 15 s and annealing/extension at 60◦C for 60 s. Amplification reactions and fluorescence measurements at the PCR end-point were performed in a Roche LightCycler 480 II instrument, and genotypes assigned using SP4 1.5.0 software (Roche). Genotype distributions in controls were tested for deviation from Hardy-Weinberg Equilibrium (HWE) using the Pearson's chi-squared test with a cut-off of p < 0.001. All genotype frequencies were in HWE in both ethnic groups. Call rates for all SNPs genotyped were > 95%.

#### Statistical Analysis

Allele frequencies in cases and controls by ancestry group were compared using the Pearson's chi-squared test for association with ESCC. Odds ratios (OR) and 95% confidence intervals (CI) were estimated using the common allele as the reference. A Bonferroni-corrected P-value of < 0.005 (0.05/10) was used to determine the significance threshold for all association tests based on the number of SNPs tested. For SNPs with marginal association (p < 0.05), gene-environment interactions were investigated in a case-control analysis stratified by tobacco smoking and alcohol consumption status. Haplotypes and correlation coefficients (D<sup>I</sup> and r 2 ) in controls were estimated using Haploview 4.2 (Barrett et al., 2005). The haplotype analysis was performed using UNPHASED (Dudbridge, 2008).

<sup>1</sup>www.1000genomes.org

TABLE 1 | Descriptive characteristics of Black and Mixed Ancestry ESCC cases and controls.


<sup>a</sup>Age of diagnosis for cases; age of recruitment for controls. <sup>b</sup>Ever-smokers include subjects who had smoked at some point in their lives. <sup>c</sup>Not available due to missing data. <sup>d</sup>Drinkers include subjects who consumed alcohol at least once a week.

Only haplotypes with an estimated frequency in controls ≥5% were tested. A logistic regression analysis was then performed to examine the independence of association evidence using the cancer status as a binary, dependent variable (affected or unaffected) and the SNP as independent variable adjusting for smoking status, alcohol consumption status and other SNPs in the locus. All reported p-values are two-sided and statistical analyses were performed using the R statistical computing platform (version 3.4.2).

#### RESULTS

### Study Sample Characteristics

Characteristics of cases and controls by ancestry group are shown in **Table 1**. The mean age of diagnosis was similar in the Black and Mixed Ancestry samples (59.6 and 60.7 years, respectively). The male to female ratio was 0.98 in Black cases and 1.92 in Mixed Ancestry cases. In addition, Mixed Ancestry cases had higher rates of ever-smokers (94%) and alcohol drinkers (81%) compared with Black cases (61 and 62%, respectively).

#### Single Variant Association Analysis

The results for the case-control analysis in the two South African populations are summarized in **Table 2**. Three of the five 2q33 variants tested were marginally associated with a higher risk of ESCC in the Mixed Ancestry population. These were CASP8 rs10931936 (OR = 1.39, 95% CI = 1.02–1.90, p = 3.52 × 10−<sup>2</sup> ), TRAK2 rs7578456 (OR = 1.58, 95% CI = 1.22–2.05, p = 2.59 × 10−<sup>4</sup> ), and TRAK2 rs2244438 (OR = 1.55, 95% CI = 1.16–2.07, p = 2.30 × 10−<sup>3</sup> ). Of these, rs7578456 and rs2244438 remained significant after Bonferroni multiple testing correction (P < 0.005). A suggestive association was also noted for CASP8 rs3769823 (OR = 1.26, 95% CI = 0.97–1.63, p = 0.076). Interestingly, none of these variants were associated with ESCC in the Black South African population. None of the SNPs on 6p21 tested positive for association with ESCC in either the Black or the Mixed Ancestry populations.

#### Gene-Environment Interaction Analysis

An analysis of alcohol consumption and cigarette smoking showed that there was no significant difference in association with rs10931936, rs7578456, and rs2244438 in the Mixed Ancestry sample across strata for tobacco smoking or alcohol consumption status, except for rs7578456 that was statistically significantly associated with risk in alcohol drinkers (OR = 1.61; 95% CI = 1.18–2.20, p = 0.003) but not in non-alcohol drinkers (OR = 1.55, 95% CI = 0.94–2.55, p = 0.085) (**Table 3**). As a further evidence for the lack of gene-environment interactions, the effect sizes in smokers and alcohol-drinkers were not substantially higher than those observed in all cases combined, while the number of individuals never smokers and non-drinker categories were too low to provide informative risk estimates (see **Table 1**).


TABLE 3 | Case-control association results by tobacco smoking and alcohol consumption status in Mixed Ancestry South Africans<sup>a</sup> .


<sup>a</sup>Only SNPs associated with ESCC risk in the overall case-control analysis are shown. <sup>b</sup>Subjects who had smoked at some point in their life. <sup>c</sup>Minor allele frequency in cases and controls. <sup>d</sup>p-values from Pearson's chi-squared test. <sup>e</sup>Subjects who consumed any alcoholic beverage at least once every week.

#### Linkage Disequilibrium and Haplotype Analysis

#### DISCUSSION

Correlation coefficients (D<sup>I</sup> and r 2 ) and linkage disequilibrium (LD) plots for SNPs at 2q33 were computed using the African and Asian population samples from the 1000 Genomes Project (see Materials and Methods) allowing a comparison of the LD structure between these ethnic groups. This shows that Africans have a lower level of LD (r 2 range: 0.15–0.59) across the 2q33 locus compared with Asians (r 2 range: 0.51–0.96) (**Figure 1**). Notably, there was a moderate correlation between rs3769823 and rs7578456 (r <sup>2</sup> = 0.59) and between rs10931936 and rs2244438 (r <sup>2</sup> = 0.53) in Africans, suggesting that the association of these variants with ESCC risk in the current study may be driven by one or two independent association signals. No considerable correlation was observed for SNPs at the 6p21 locus (r <sup>2</sup> ≤ 0.01; data not shown).

In the Mixed Ancestry sample, a marginal association was observed for the haplotype consisting of all minor alleles at 2q33 (ATAAA; OR = 1.54, 95% CI = 1.06–2.25, p = 0.024) compared with carriers of all common alleles (GCGGG; **Supplementary Table 1**). However, no increased risk was achieved by cooccurrence of the five minor alleles on the same haplotype compared with the risk predicted in the single variant test. Haplotype analysis of the five SNPs at 6p21 showed no evidence of association with ESCC in both the Black and Mixed Ancestry samples (p ≥ 0.09; **Supplementary Table 2**).

#### Logistic Regression Analysis

The multivariate logistic regression analysis revealed marginal associations with rs2244438 in the Mixed Ancestry sample after we controlled for the effect of the other SNPs in the region (rs3769823, rs10931936, rs13016963, rs7578456), although none of these associations were statistically significant after Bonferroni correction (highest OR = 1.19, smallest p = 0.008) (**Table 4**). Conversely, rs7578456 and rs10931936 were no longer significantly associated with risk after adjusting for rs2244438 (p = 0.399 and 0.354, respectively), indicating that rs2244438 was the variant driving the association at 2p33.

This study investigated the association between common genetic polymorphisms at 2q33 and 6p21 and the risk of ESCC in two South African populations. These variants were previously reported to be associated with higher risk for both ESCC and other cancer types in the Chinese population (Shen et al., 2014; Zhang et al., 2017; Zhao et al., 2017). Our analysis revealed three SNPs at 2q33 (rs10931936, rs7578456, rs2244438) that conferred an increased risk of ESCC in the Mixed Ancestry population, thus replicating the association signals identified in the Chinese study (Zhao et al., 2017). A similar pattern of association was also noted for rs3769823, but without statistical significance. Of these, rs2244438 that maps to a genomic region harboring TRAK2 was independently associated with ESCC risk after adjusting for the other SNPs at 2q33. These findings point to a single causal variant at 2q33 and suggest that multiple association signals at this locus may result from the high correlation with the causal variant. We should also note that rs2244438, unlike rs7578456 and rs10931936, lies within an exon of TRAK2 and therefore is more likely to have a functional effect on the encoded protein.

All associations observed in the Mixed Ancestry population were also detected by strata of smoking and alcohol exposure. In addition, these associations were not strengthened when the analysis was restricted to ever-smokers or alcohol drinkers relative to the full sample. Our findings are suggestive of no gene-environment interactions at the 2q33 locus, which is in line with the lack of interaction with smoking for Chinese lung cancer risk (Zhao et al., 2017). However, the proportion of never-smokers and non-drinkers in the Mixed Ancestry sample was very small, with low power to detect associations in these subgroups. Whether these two polymorphisms increase the risk of ESCC upon exposure to tobacco smoke or alcohol requires confirmation by analysis of a larger sample.

This is the first study to report that SNPs mapping to TRAK2 were significantly associated with ESCC risk in African populations and support previous findings of an association between rs2244438 and ESCC and lung cancer risk in the Chinese (Zhao et al., 2017). The trafficking kinesin-binding

protein 2 (TRAK2, also known as GRIF-1) is a member of a coiled-coil family of proteins with a role in regulating protein and organelle transport in cells (Brickley and Stephenson, 2011). Downregulation of this kinesin-associated protein may therefore cause dysfunctional cell signaling and potentially result in carcinogenesis. The variant rs2244438 is located within an exon with evolutionary constraints, and the G to A transition at this site results in the nonsynonymous change from threonine (Thr) to isoleucine (Ile) at the residue position of 528 on TRAK2. Functional effect prediction programs such as SIFT<sup>2</sup> and PolyPhen<sup>3</sup> suggested a damaging effect associated with this genetic polymorphism. Secondary structure prediction indicates that residue 528 is located at a disordered region of molecular surface, and that the Thr to Ile change probably disturbs the adjacent secondary structure. The TRAK2 gene at 2q33 is therefore a credible candidate for containing the causal variant driving the association at this locus, and the effect of rs2244438 on gene expression will require functional follow-up to determine its potential pathogenic effect on the protein.

We also performed the first analysis of the MHC region in relation to ESCC risk in the two South African populations described in this paper. The MHC encodes a set of cell surface glycoproteins known as human leukocyte antigens (HLA), that are critical for innate and adaptive immune response in humans (Horton et al., 2004). Loss of heterozygosity and DNA hypermethylation in the MHC region resulting in the downregulation of HLA class I and class II genes are common and well-recognized event in esophageal tumors (Nie et al., 2001; Yang et al., 2008; Zhao et al., 2011). Germline variants at this locus have been shown to confer higher risk of ESCC in Chinese populations (Shen et al., 2014; Zhang et al., 2017). However, we could not replicate such association with these variants in South African populations, although they were observed at a high enough frequency (≥10%) to detect suggestive associations with ESCC.

Our study suggests that genetic risk variants at the 2q33 locus are shared between the Chinese populations and the Mixed Ancestry population of South Africa. The Mixed Ancestry population from the Western Cape is an admixed population that originated from the union of different ethnic groups, receiving 9–11% of the ancestral contribution from Asians (de Wit et al., 2010). Thus, it is possible that genetic risk markers commonly found in the Chinese could have been inherited in the Mixed Ancestry population of South Africa. The relatively small and variable Asian genetic component across Mixed Ancestry individuals may also explain the weaker genetic associations with ESCC commonly observed in this ethnic group compared with the Chinese (Zhao et al., 2017). Population stratification could have contributed to the associations observed in this ethnic group, but this could only be resolved by high throughput genotyping of very large numbers of SNPs with

<sup>2</sup>http://sift.jcvi.org

<sup>3</sup>http://genetics.bwh.harvard.edu/pph


appropriate statistical correction for any differences observed (Teo et al., 2010). No evidence of association was observed in the Black population, which is consistent with our previous studies that failed to detect several associations reported in Chinese GWAS (Bye et al., 2011, 2012; Chen et al., 2019). The Black South African samples in this study are derived almost entirely from the Xhosa-speaking population, which is a genetically conserved population that received little or no ancestral contribution from other ethnic groups across generations. It is well established that African genomes have greater genetic diversity and lower LD map compared with Asian and European genomes (Campbell and Tishkoff, 2008). As an example, the five SNPs at 2q33 tested in this study were in lower LD in Africans (max r <sup>2</sup> = 0.59) compared with the Chinese (max r <sup>2</sup> = 0.96), and the number of private haplotypes was higher in Black individuals as compared with Mixed Ancestry individuals. The genetic risk variants reported in the Chinese population may therefore not capture the actual causal variants in the Black African population. There is also a possibility that the causal variants may have arisen after the migration of humans out of Africa and are therefore not present in Black African populations. Finally, differences in genetic associations between Asian and African populations may in part reflect variability in environmental exposures between ethnic groups, or technical issues such as small sample sizes which are not well powered to detect modest genetic effects. Fine-mapping of ESCC susceptibility loci should be carried out to provide additional biological insights and identify the causal variants driving the association in African populations.

#### CONCLUSION

In conclusion, our study reports a possible association between a gene involved in cellular trafficking and ESCC in the Mixed Ancestry population from the Western Cape province of South Africa, as previously described in a Chinese population. If validated in larger independent studies, these variants may aid in the identification of individuals at high risk of developing ESCC, who would benefit from early screening and prevention strategies. These variants may also represent novel targets for functional follow-up aimed at elucidating the underlying biological mechanisms of esophageal carcinogenesis and identifying targeted therapies tailored toward ESCC patients with specific genetic markers. We did not detect associations with genetic variants at 6p21 that were previously reported in Chinese populations, thus providing further support for the lack of replication of genetic findings across ethnic groups that may reflect differences in genetic architecture and environmental exposure. Therefore, GWAS and fine-mapping studies in African populations are required to increase our understanding of the genetic contribution to ESCC and to gain further insights into the genetic heterogeneity of the disease. In addition, the combined effects of genes and common environmental factors may play a role in interpreting association data across populations and should always be considered in the study of complex diseases such as esophageal cancer.

#### ETHICS STATEMENT

fgene-10-00406 April 29, 2019 Time: 15:10 # 8

Ethical approval for the study was obtained from the joint University of Cape Town/Groote Schurz Hospital Research Ethics Committee and the University of Stellenbosch/Tygerberg Hospital Ethics Committee. All subjects gave written informed consent in accordance with the Declaration of Helsinki.

### AUTHOR CONTRIBUTIONS

MM preformed the laboratory work, contributed to the statistical analysis, and wrote the first draft of the manuscript. CM revised the manuscript critically for important intellectual content. MP conceived and designed the study. All authors contributed to and read the final version of the manuscript revision.

### FUNDING

Research reported in this publication was jointly supported by the South African MRC with funds received from

#### REFERENCES


the National Department of Health and MRC (United Kingdom) with funds from the United Kingdom Government's Newton Fund and GSK. CM was supported by grants from the Cancer Association of South Africa (CANSA), the University of Witwatersrand Research Council and the South African National Research Foundation.

#### ACKNOWLEDGMENTS

We thank all the patients and their family members whose contributions made this work possible. We also wish to thank Daniela Mellis who assisted with the conditional logistic regression analysis.

#### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene. 2019.00406/full#supplementary-material


and tobacco smoking in oesophageal cancer risk. PLoS One 7:e36962. doi: 10.1371/journal.pone.0036962


in Chinese populations. PLoS One 12:e0177494. doi: 10.1371/journal.pone. 0177494


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

The handling Editor declared a shared affiliation, though no other collaboration, with one of the authors CM at the time of review.

Copyright © 2019 Matejcic, Mathew and Parker. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Genetic Screening of the Usher Syndrome in Cuba

Elayne E. Santana<sup>1</sup>† , Carla Fuster-García2,3† , Elena Aller2,3, Teresa Jaijo2,3 , Belén García-Bohórquez<sup>2</sup> , Gema García-García2,3, José M. Millán2,3 \* ‡ and Araceli Lantigua<sup>4</sup>‡

<sup>1</sup> Centro Provincial de Genética, Universidad de Ciencias Médicas de Holguín, Holguín, Cuba, <sup>2</sup> Health Research Institute La Fe, University Hospital La Fe, Valencia, Spain, <sup>3</sup> Centro de Investigación Biomédica en Red de Enfermedades Raras (CIBERER-ISCIII), Madrid, Spain, <sup>4</sup> Centro Nacional de Genética Médica, Havana, Cuba

Background: Usher syndrome (USH) is a recessive inherited disease characterized by sensorineural hearing loss, retinitis pigmentosa, and sometimes, vestibular dysfunction. Although the molecular epidemiology of Usher syndrome has been well studied in Europe and United States, there is a lack of studies in other regions like Africa or Central and South America.

#### Edited by:

Zané Lombard, University of the Witwatersrand, Johannesburg, South Africa

#### Reviewed by:

Miguel Carballo, Hospital Terrassa, Spain Claudio Graziano, Sant'Orsola-Malpighi Polyclinic, Italy

> \*Correspondence: José M. Millán millan\_jos@gva.es

†These authors have contributed equally to this work as first authors

‡These authors have contributed equally to this work as last authors

#### Specialty section:

This article was submitted to Genetic Disorders, a section of the journal Frontiers in Genetics

Received: 17 January 2019 Accepted: 07 May 2019 Published: 22 May 2019

#### Citation:

Santana EE, Fuster-García C, Aller E, Jaijo T, García-Bohórquez B, García-García G, Millán JM and Lantigua A (2019) Genetic Screening of the Usher Syndrome in Cuba. Front. Genet. 10:501. doi: 10.3389/fgene.2019.00501 Methods: We designed a NGS panel that included the 10 USH causative genes (MYO7A, USH1C, CDH23, PCDH15, USH1G, CIB2, USH2A, ADGRV1, WHRN, and CLRN1), four USH associated genes (HARS, PDZD7, CEP250, and C2orf71), and the region comprising the deep-intronic c.7595-2144A>G mutation in USH2A.

Results: NGS sequencing was performed in 11 USH patients from Cuba. All the cases were solved. We found the responsible mutations in the USH2A, ADGRV1, CDH23, PCDH15, and CLRN1 genes. Four mutations have not been previously reported. Two mutations are recurrent in this study: c.619C>T (p.Arg207<sup>∗</sup> ) in CLRN1, previously reported in two unrelated Spanish families of Basque origin, and c.4488G>C (p.Gln1496His) in CDH23, first described in a large Cuban family. Additionally, c.4488G>C has been reported two more times in the literature in two unrelated families of Spanish origin.

Conclusion: Although the sample size is very small, it is tempting to speculate that the gene frequencies in Cuba are distinct from other populations mainly due to an "island effect" and genetic drift. The two recurrent mutations appear to be of Spanish origin. Further studies with a larger cohort are needed to elucidate the real genetic landscape of Usher syndrome in the Cuban population.

Keywords: retinitis pigmentosa, sensorineural hearing loss, Usher syndrome, deaf-blindness, molecular genetics

## INTRODUCTION

Usher syndrome (USH, OMIM 276900, OMIM 276905, OMIM 605472, ORPHA: 886) is the most prevalent deaf-blindness of genetic origin. It is a recessive inherited disease characterized by sensorineural hearing loss (HL), visual loss due to retinitis pigmentosa (RP), and, in some cases, vestibular dysfunction. Prevalence estimates range from 3.2 to 6.2/100,000 (Espinós et al., 1998; Keats and Corey, 1999).

Patients with USH are classified into three clinical subtypes (USH1, USH2, or USH3), based on the severity and progression of hearing impairment and the presence or absence of vestibular

dysfunction. Usher syndrome type I (USH1) is the most severe type, characterized by severe to profound congenital sensorineural hearing loss, vestibular dysfunction, and prepubertal onset of RP eventually leading to legal blindness. USH2 is characterized by moderate to severe hearing impairment, normal vestibular function and later onset of retinal degeneration. USH3 displays progressive hearing loss, RP and variable vestibular phenotype (Saihan et al., 2009; Millán et al., 2010).

Currently, up to 13 genes have been associated with Usher syndrome: MYO7A, USH1C, CDH23, PCDH15, USH1G, and CIB2 are responsible for USH1, although the role of CIB2 in the Usher syndrome has recently been put on doubt (Booth et al., 2018). USH2A, ADGRV1, and WHRN are the three genes responsible for USH2, and the CLRN1 gene is the only one associated with USH3 cases to date. Besides, PDZD7 has been reported to behave as a modifier of the retinal phenotype in conjunction with USH2A, and a contributor to digenic inheritance with ADGRV1 (Ebermann et al., 2010). In addition, HARS was postulated as a novel causative gene of USH3, based on a mutation found in two patients (Puffenberger et al., 2012). Finally, mutations in CEP250 have been reported to cause cone-rod dystrophy, isolated RP and atypical forms of USH, characterized by early onset hearing loss and mild RP (Khateb et al., 2014; Fuster-García et al., 2018; Kubota et al., 2018).

In the last years, next generation sequencing (NGS) techniques have revolutionized the world of the molecular genetic diagnosis, allowing the whole genome, whole exome and targeted gene sequencing more feasible, and making easier, rapid and cost-effective the identification of disease genes and the underlying mutations. It has been especially useful in genetically heterogeneous diseases, such as hearing loss or retinal dystrophies (Choi et al., 2013; Fu et al., 2013; Mutai et al., 2013; Glöckle et al., 2014). We previously developed a targeted next generation sequencing method for Usher syndrome that proved to be highly efficient (Aparisi et al., 2014; Fuster-García et al., 2018).

Although the molecular epidemiology of the Usher syndrome and the distribution of mutations causing the disease among these genes has been well studied in Europe and United States, there is a lack of studies in other regions like Africa or Central and South America.

Here, we show for the first time a molecular landscape of the Usher syndrome in Cuba, and we provide as well a clinical description of all the cases.

#### MATERIALS AND METHODS

#### Patients

A descriptive cross-sectional study was carried out in a series of 11 families from Holguin (Cuba) with patients diagnosed clinically as Usher syndrome. All the 11 patients were Caucasian. The family trees of the families are shown in **Figure 1**.

The variables collected in this study were: age, sex, ethnicity, birthplace of the patients and their ancestors, consanguinity, age of onset HL and at diagnosis, HL degree, age of the first symptoms of RP and current clinical stage, and vestibular function. The institutional board of both the Ethics Committee of the University Hospital La Fe and the University of Holguín approved the study, according to the tenets of the Declaration of Helsinki and reviews. A survey assessed by the researchers was used in compliance after signing informed consent.

Ophthalmological examination included visual acuity, funduscopy, visual field test with Goldmann perimetry, and electroretinogram (ERG). The Audiological examination consisted of the vestibular function study through the caloric test and study of brainstem auditory evoked potentials (BAEP).

Hearing loss evaluation was carried out using a radio audiometer MA31 (Grosses Klinisches Audiometer, Germany) in the Hospital "Vladimir Ilich Lenin." The BAEPs were obtained in response to the monaural stimulation through TDH-39 hearing aids, with condensation clicks with a duration of 100 µsec and an intensity of 95 dB pSPL. The hearing loss of each affected individual was quantified by performing a complete tonal audiometry. Hearing loss was classified as: Mild (20– 40 dB), moderate (40–70 dB), severe (70–90 dB), or profound (more than 90 dB).

Peripheral blood was obtained and DNA was extracted in the National Center for Medical Genetics in Havana, and sent to the University Hospital La Fe in Valencia (Spain).

#### Targeted Exome Sequencing Design

We designed a customized AmpliSeq panel using Ion AmpliSeq Designer tool from Thermo Fisher Scientific<sup>1</sup> to generate the targeted library composed of all exons contemplated in all isoforms with 10 bp padding of the flanking intron regions, and the additional locus comprising the c.7595-2144A>G intronic mutation (Vaché et al., 2012). These target regions were covered by 810 amplicons of 125–175 bp length range, computing a total panel size of 147.95 kb. The designed panel (**Table 1**) included 14 genes, 10 USH causative genes (MYO7A, USH1C, CDH23, PCDH15, USH1G, CIB2, USH2A, ADGRV1, WHRN, and CLRN1) and four USH associated genes (HARS, PDZD7, CEP250, and C2orf71).

#### Sequence Enrichment and Next Generation Sequencing

The amplification of the targets was performed according to the Ion AmpliSeq Library Kit 2.0 protocol (Thermo Fisher Scientific, Inc.) for Ion Torrent sequencing. The sequencing was carried out with a theoretical minimum coverage of 500× either on the PGM or Proton system.

#### Variant Filtering and Analysis

The resulting sequencing data were analyzed with the Ion Reporter Software tool<sup>2</sup> in regards to the human assembly GRCh37 (also known as hg19). The annotated variants were filtered according to a Minor Allele Frequency (MAF) value

<sup>1</sup>www.ampliseq.com

<sup>2</sup>https://ionreporter.thermofisher.com/ir/

≤0.01, their annotation in the dbSNP<sup>3</sup> , their description in the Usher syndrome mutation database<sup>4</sup> and the mutation type.

<sup>3</sup>https://www.ncbi.nlm.nih.gov/SNP/

<sup>4</sup>https://grenada.lumc.nl/LOVD2/Usher\_montpellier/

Those disease-causing and suspected-to-be pathogenic variants were validated through conventional Sanger sequencing. For this, each DNA locus comprising a selected mutation was amplified by PCR with specific primers, and both forward and reverse strands were sequenced using the Big Dye 3.1 Terminator


Chr, chromosome number. <sup>∗</sup>Region of the USH2A PE (Pseudo-exon 40) where mutation c.7595-2144A>G is located. Targets are arranged according to the size of the covered region.

Sequencing Kit (Thermo Fisher Scientific, Inc.) after enzymatic PCR clean up with illustra ExoProStar 1-Step (GE Healthcare Life Sciences). The purified sequence products were analyzed on a 3500xL ABI instrument (Applied Biosystems by Thermo Fisher Scientific, Inc.).

The novel variants found in the cohort of probands were categorized based on the guidelines of the clinical and molecular genetics society<sup>5</sup> and the Unknown Variants classification system (see text footnote 4) as pathogenic, probably pathogenic (UV4), possibly pathogenic (UV3), possibly non-pathogenic (UV2), and neutral (UV1), according to the type of mutation, bioinformatic predictions and segregation analysis. The four novel mutations were frameshift or nonsense mutations. Hence, they were automatically stated as pathogenic variants.

The annotation of the variants was performed according to following isoform reference sequences for each gene: MYO7A (NM\_000260.3), USH1C (NM\_153676), CDH23 (NM\_022124.5), PCDH15 (NM\_033056.3), USH1G (NM\_173477), CIB2 (NM\_006383.2), USH2A (NM\_206933), ADGRV1 (NM\_032119.3), WHRN (NM\_015404), CLRN1 (NM\_174878), HARS (NM\_002109), PDZD7 (NM\_001195263.1), CEP250 (NM\_007186.4), and C2orf71 (NM\_001029883.2).

#### MLPA Complementary Analysis

In order to ascertain if homozygous mutations could truly be masked cases of a large deletion comprising a heterozygous variant, we performed pertinent multiplex Multiplex ligationdependent probe amplification (MLPA; MRC-Holland) analysis for the only USH genes available, USH2A and PCDH15.

#### RESULTS

Eleven index cases diagnosed of Usher syndrome from the province of Holguín, Cuba, were screened for mutations in the USH-associated genes of our home-designed panel.

Details of the genes, number of amplicons or coverage are described in **Table 1**.

Five cases were diagnosed of USH1, whereas four cases were USH2, and two cases were difficult to classify clinically. All the eleven cases were solved and the specific causative mutations can be found in **Table 2**.

Six families were consanguineous (54.5%) and another two were probably consanguineous (18.2%), since the parents come from the same small village. In total, the consanguinity or probable consanguinity in the cohort is over 70%.

Among the USH1 cohort, two pathogenic mutations were found in CDH23 (US-4, US-5, US-6, US-7, and US-11). In the USH2 cohort, two pathogenic mutations were found in ADGRV1 (US-2) and USH2A (US-16 and US-9), and PCDH15 (US-10). Regarding the unclassified cases, two mutations were found in CLRN1 (US-8 and US-12). Patient US-2, who carried the mutation c.15448\_15449delCT in homozygosis in the ADGRV1 gene, carried the additional c.3242G>A (p.Arg1081Gln) missense mutation in CDH23 in heterozygous state, which is predicted to probably damaging according to PolyPhen-2 and benign as SIFT and PROVEAN.

Four mutations are reported in this study for the first time, namely c.15448\_15449delCT (p.Leu5150Hisfs<sup>∗</sup> 6) in ADGRV1, c.7730\_7734delTCAGT (p.Phe2577Serfs<sup>∗</sup> 28) in CDH23, c.1624G>T (p.Glu542<sup>∗</sup> ) in CDH23, and c.3661C>T (p.Gln1221<sup>∗</sup> ) in PCDH15.

Two mutations have been found in several USH alleles in this study. The p.Arg207<sup>∗</sup> mutation in CLRN1 was found in homozygous state in two different families, both of them consanguineous. That means 18.2% of the total mutated alleles and 40% among the non-USH1 mutated alleles. Among the USH1 cases, p.Gln1496His accounted for 80% of the USH1 alleles (eight out of 10) and 36.4% of the total USH alleles. All the USH1 patients bear mutations in CDH23.

The sequences of each mutation are shown in **Figure 2**.

MLPA assays in the patients US-9, US-10, and US-16, with homozygous mutations in either USH2A or PCDH15, revealed no copy number variations.

#### Clinical Description

The clinical features of the 11 index patients are shown in **Table 3**.

#### Mutation: c.15448\_15449delCT (p.Leu5150Hisfs∗6) in ADGRV1

Proband of family US-2: The subject comes from a nonconsanguineous family (father from Mexico and mother from Cuba) and displays a typical USH2 phenotype. She presented with a postlingual moderate non-progressive HL, no vestibular dysfunction and postpubertal onset of RP. This patient carries the mutation p.Leu5150Hisfs<sup>∗</sup> 6 in ADGRV1 in homozygosis.

<sup>5</sup>https://www.emqn.org/emqn/Best+Practice

Frontiers in Genetics | www.frontiersin.org


TABLE 2 | Genetic findings of the patients screened in this study, mutations, their effect on the protein, genes mutated, and nature of the mutations.

#### Mutation: c.619C>T (p.Arg207<sup>∗</sup> ) in CLRN1

Proband of family US-8: Patient coming from a consanguineous family, harboring the mutation p.Arg207<sup>∗</sup> in CLRN1 in homozygous state. She has a postlingual moderate HL with a progression in the last 10 years. This subject is 81 years old and the progression of the HL may be due to age-related hearing impairment. She noticed nyctalopia at 8 years old and the visual field was much reduced by the age of diagnosis. She did not report any balance problems.

Proband of family US-12: Patient carries the p.Arg207<sup>∗</sup> mutation in CLRN1 in homozygous state. The family is also consanguineous, since the parents are second cousins. HL is postlingual and severe. RP signs were similar than those of US-8, yet with a later age of onset of symptoms and reduced visual field at age 24. ERG is abolished for this subject. In addition, the delayed walking onset and the reported difficulties in holding the head up as a baby suggest balance dysfunction.

#### Mutation: c.1841-2 A>G (p.Gly614Aspfs∗6) in USH2A

Proband of family US-16: The subject has a typical USH2A phenotype with a moderate postlingual non-progressive HL and typical RP of onset in the puberty.

#### Mutation: c.2299delG (p.Glu767Serfs∗21) in USH2A

Proband of family US-9: The patient harbors the most common mutation in USH2A patients of European origin, namely the c2299delG in USH2A. She displays a typical USH2 phenotype milder that US-16 with a mild postlingual HL and later onset of RP symptoms.

#### Mutation: c.3661C>T (p.Gln1221<sup>∗</sup> ) in PCDH15

Proband of family US-10: This patient carries the p.Gln1221<sup>∗</sup> mutation in PCDH15 in homozygous state. PCDH15 is associated to USH1 phenotype, however, this subject displayed postlingual moderate HL, normal vestibular function and relatively lateonset of RP.

#### Mutation: c.7730\_7734delTCAGT (p.Phe2577Serfs∗28) in CDH23

Proband of family US-4: The patient is a compound heterozygote for the CDH23 mutations p.Phe2577Serfs<sup>∗</sup> 28 and p.Gln1496His. He displays a typical USH1 phenotype with a prelingual, severe hearing loss RP onset al puberty and vestibular dysfunction.

#### Mutation: c.1624G>T (p.Glu542<sup>∗</sup> ) in CDH23

Proband of family US-7: Compound heterozygote for the CDH23 mutations p.Glu542<sup>∗</sup> and p.Gln1496His. Symptoms are distinctive of typical USH1 phenotype with a prelingual severe HL, early onset of RP and vestibular dysfunction.

#### Mutation: c.4488G>C (p.Gln1496His) in CDH23

Besides the compound heterozygotes US-4 and US-7, that carry p.Gln1496His together with other CDH23 mutations, three more patients carry the mutation in homozygous state, namely those from families US-5, US-6, and US-11. All of them displayed a typical USH1 phenotype.

## DISCUSSION

In this work, we report the first study in a cohort of Usher syndrome patients from Cuba. We found a total of eight mutations in 11 cases, four of which are novel (p.Leu5150Hisfs<sup>∗</sup> 6 in ADGRV1, p.Phe2577Serfs<sup>∗</sup> 28 and p.Glu542<sup>∗</sup> in CDH23, and p.Gln1221<sup>∗</sup> in PCDH15).

The presence in homozygosis of p.Gln1221<sup>∗</sup> in PCDH15 led to a typical USH2 phenotype with a severe HL of postlingual onset, no vestibular dysfunction and late onset RP, and despite being the causative mutation a nonsense variant. Although it is not common, mutations in genes that usually lead to USH1 and cause a USH2 phenotype, and vice-versa, have been reported (Bonnet et al., 2011; Aparisi et al., 2014; Fuster-García et al., 2018).

The mutations c.1841-2A>G (p.Gly614Asp<sup>∗</sup> fs6) and c.2299delG (p.Glu767Serfs<sup>∗</sup> 21) in USH2A have been reported many times in the literature as pathogenic in many populations.

Noteworthy, two mutations are recurrent in this study. The c.619C>T mutation (p.Arg207<sup>∗</sup> ) in CLRN1 was described by García-García et al. and Licastro et al. almost simultaneously in two a priori unrelated Spanish families of Basque origin and one family of Italian origin, respectively (García-García et al., 2012; Licastro et al., 2012). This mutation was found in homozygous state in two Cuban families. In the first family reported by García-García et al., the only affected member carried the p.Arg207<sup>∗</sup> mutation together with p.Tyr63<sup>∗</sup> . The patient displayed bilateral severe progressive sensorineural HL corrected with hearing aids and was a candidate for cochlear implantation. She showed a delay in gait development and a vestibular hyporeflexia and she displayed typical symptoms of RP since young. The onset of her RP was at 9 years old, including night blindness and peripheral visual loss. Fundus ophthalmoscopy showed pigmentary anomalies typical of RP with a visual acuity of 0.4 in both eyes and a rapid progression of the visual loss.

In the second family there were two affected sibs who were compound heterozygotes for p.Arg207<sup>∗</sup> and p.Ile168Asn. They displayed very discordant phenotypes. One brother had a typical RP and normal speech acquisition and motor milestones. At 13 years old he displayed a progressive bilateral HL that ranged 79–80 dB in the last clinical examination, and the vestibular function was normal. The other brother presented with a typical RP as well, but displayed a prelingual severe HL that required deaf school education.


**121**

These findings illustrate the impressive wide spectrum of sensorineural hearing impairment in type and degree, and the high degree of intersubject and intrafamiliar variability due to CLRN1 mutations, as previously reported (Pennings et al., 2003).

The other mutation, c.4488G>C (p.Gln1496His) in CDH23, was described by Bolz et al. (2001) in a large Cuban family. That study allowed the identification of the CDH23 gene as responsible of Usher syndrome type 1. Although c.4488G>C is a missense mutation (p.Gln1496His), the G>C change affects the last exon nucleotide and computational predictions and in vitro studies support the hypothesis of a splicing alteration leading to a truncated protein (Bolz et al., 2001).

Additionally, c.4488G>C has been reported two more times in the literature in two unrelated families of Spanish origin showing a typical USH1 phenotype (Astuto et al., 2002; Oshima et al., 2008).

It is noteworthy that the frequency of the mutated genes varies significantly when compared to other countries. In most populations MYO7A is the most prevalent gene among USH1 patients accounting for about 50% of the cases, except in some endogamic populations (Roux et al., 2011; Le Quesne Stabej et al., 2012; Glöckle et al., 2014; Yoshimura et al., 2014; Bonnet et al., 2016; Dad et al., 2016; Eandi et al., 2017; Sun et al., 2018). However, all the USH1 patients in this cohort carry mutations in CDH23. Furthermore, c.4488G>C accounts for 80% of USH1 alleles and no MYO7A mutations were detected in the cohort.

No conclusions can be obtained from the USH2 mutation distribution given the small size of the sample. Two out of the three clear USH2 patients are caused by mutations in USH2A, whereas the remaining is due to a mutation in ADGRV1. Both USH2A mutations have been reported many times in the literature, being c.2299delG the most frequent USH2 mutation in populations of European origin (Dreyer et al., 2000).

The frequency of Usher syndrome due to mutations in CLRN1 in our sample is 18% (two out of 11), considerably higher than the 5% or less in other populations. Usher syndrome resulting from mutations in CLRN1 is rare except in Finland and among the Ashkenazi jews, and its high frequency among USH3 patients in these populations is due to founder mutations (Joensuu et al., 2001; Ness et al., 2003). Here, the apparently high frequency of CLRN1 is attributable to the presence of another unique mutation that probably has a Spanish origin.

It must be remarked that most of the mutations found in this study are homozygous, yet it could be possible that these were in fact heterozygous variants in concurrence of a large

#### REFERENCES


deletion, even when consanguinity is at stake. MLPA could be performed for mutations in USH2A and PCDH15, but there is no kit available to analyze the other implicated genes ADGRV1, CLRN1, and CDH23.

Segregation analysis would also help to unveil this issue and also to confirm if the compound heterozygous mutations are indeed in trans and, thus, causative of the disease. However, the obtainment of DNA samples of the relatives was not available.

Although the sample size is very small, it is tempting to speculate that the gene frequencies in Cuba are distinct from other populations, mainly due to an "island effect" and genetic drift. Further studies with a larger sample comprising different geographical regions of Cuba are needed to elucidate the real genetic landscape of Usher syndrome in the Cuban population.

#### ETHICS STATEMENT

The institutional board of the Ethics Committee of the University Hospital La Fe and the University of Holguín, respectively, approved the study, according to the tenets of the Declaration of Helsinki and reviews.

#### AUTHOR CONTRIBUTIONS

JM and AL conceived, designed, and supervised the study. AL provided the samples. ES did the clinical data curation. CF-G, GG-G, and BG-B performed the molecular experiments and analyzed the sequencing data. EA and TJ did the results validations. JM and GG-G obtained the funding. ES and CF-G wrote the initial manuscript. JM, AL, and GG-G reviewed and edited the manuscript.

### FUNDING

This work was financially supported by a grant of the Institute of Health Carlos III (ISCIII; Ref.: PI16/00539). CF-G is a recipient of a fellowship from the ISCIII (Ref.: IFI14/00021).

#### ACKNOWLEDGMENTS

We sincerely acknowledge the patients for their voluntary participation.


syndrome identifies causal biallelic mutations in 93% of European patients. Eur. J. Hum. Genet. 24, 1730–1738. doi: 10.1038/ejhg.2016.99


syndrome genes in the UK National Collaborative Usher Study. J. Med. Genet. 49, 27–36. doi: 10.1136/jmedgenet-2011-100468


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

The reviewer CG declared a past co-authorship with several of the authors TJ and JM to the handling Editor.

Copyright © 2019 Santana, Fuster-García, Aller, Jaijo, García-Bohórquez, García-García, Millán and Lantigua. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Hydroxyurea-Induced miRNA Expression in Sickle Cell Disease Patients in Africa

Khuthala Mnika<sup>1</sup> , Gaston K. Mazandu1,2, Mario Jonas<sup>1</sup> , Gift D. Pule<sup>1</sup> , Emile R. Chimusa<sup>1</sup> , Neil A. Hanchard<sup>3</sup> and Ambroise Wonkam<sup>1</sup> \*

<sup>1</sup> Division of Human Genetics, Department of Pathology, Faculty of Health Sciences, University of Cape Town, Cape Town, South Africa, <sup>2</sup> African Institute for Mathematical Sciences, Cape Town, South Africa, <sup>3</sup> Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, United States

#### Edited by:

Zané Lombard, University of the Witwatersrand, Johannesburg, South Africa

#### Reviewed by:

Neneh Sallah, London School of Hygiene and Tropical Medicine (LSHTM), United Kingdom Fan Jin, Zhejiang University, China

#### \*Correspondence:

Ambroise Wonkam ambroise.wonkam@uct.ac.za

#### Specialty section:

This article was submitted to Genetic Disorders, a section of the journal Frontiers in Genetics

Received: 16 November 2018 Accepted: 10 May 2019 Published: 28 May 2019

#### Citation:

Mnika K, Mazandu GK, Jonas M, Pule GD, Chimusa ER, Hanchard NA and Wonkam A (2019) Hydroxyurea-Induced miRNA Expression in Sickle Cell Disease Patients in Africa. Front. Genet. 10:509. doi: 10.3389/fgene.2019.00509 Hydroxyurea (HU) is clinically beneficial in sickle cell disease (SCD) through fetal hemoglobin (HbF) induction; however, the mechanism of HU is not yet fully elucidated. Selected miRNAs have been associated with HU-induced HbF production. We have investigated differential HU-induced global miRNA expression in peripheral blood of adult SCD patients in patients from Congo, living in South Africa. We found 22 of 798 miRNAs evaluated that were differentially expressed under HU treatment, with the majority (13/22) being functionally associated with HbF-regulatory genes, including BCL11A (miR-148b-3p, miR-32-5p, miR-340-5p, and miR-29c-3p), MYB (miR-105-5p), and KLF-3 (miR-106b-5), and SP1 (miR-29b-3p, miR-625-5p, miR-324- 5p, miR-125a-5p, miR-99b-5p, miR-374b-5p, and miR-145-5p). The preliminary study provides potential additional miRNA candidates for therapeutic exploration.

Keywords: sickle cell disease, fetal hemoglobin, hydroxyurea, miRNA, Africa

## INTRODUCTION

Hydroxyurea (HU), the only food and drug administration (FDA) – approved treatment for sickle cell disease (SCD), is beneficial primarily through its ability to induce fetal hemoglobin (HbF) (Platt et al., 1984; Charache et al., 1992; Zimmerman et al., 2004). Clinical trials have shown hydroxyurea to be efficacious for increasing HbF in children, adolescents, and adults with SCA (Charache et al., 1992; Lee and Ambros, 2001; Thornburg et al., 2009). However, the precise mechanism by which HU can induces HbF in patients with SCA is not fully defined. Three main molecular pathways have been reported in HU-mediated response in increase HbF: (i) Epigenetic modifications, and transcriptional events, (ii) Signaling pathways, and (iii) Post-transcriptional pathways with regulation by Small non-coding RNA oligonucleotides (miRNA) (Pule et al., 2015).

miRNA have emerged as ubiquitous and potent molecular regulators that modulate the expression of many protein-coding genes by inhibiting mRNA translation (Lee and Ambros, 2001; Friedman et al., 2009). Multiple miRNAs have been implicated in the regulation of cell differentiation and maturation during hematopoiesis and erythropoiesis (Havelange and Garzon, 2010; Lawrie, 2010; Zhao et al., 2010). A few studies have demonstrated post-transcriptional regulation of HU-mediated γ-globin expression through miRNA in SCD patients; for example, miR-15a and miR-16-1 have been linked via the transcription factor MYB3 to elevated HbF

(Zhu et al., 2014; Pule et al., 2016), and expression of miR-26b and miR-151-3p have both been associated with HbF levels at the maximum tolerated dose (MTD) (Walker et al., 2011).

Studies have shown that miRNA expression of erythrocytes contributes to the majority of the miRNA expressions in whole blood (Juzenas et al., 2017). Because recent studies have identified miRNAs in mature erythrocytes that may reflect miRNA regulated processes during early erythropoiesis (Walker et al., 2011), we investigated differential HU-induced miRNA expressions using peripheral blood isolated from SCA patients before starting HU and after reaching the MTD. We identified novel miRNA expression changes after HU treatment, and their associated pathways, which mainly implicate HbF-regulatory genes. Our findings thus, provide novel insights into posttranscriptional mechanisms of actions of HU.

#### MATERIALS AND METHODS

#### Ethics Statement

The study was performed in accordance with the Declaration of Helsinki and with the approval of the Faculty of Health Sciences Human Research Ethics Committee, University of Cape Town (HREC Ref. No. 132/2010). Informed and written consent was obtained from all patients that were all adult participants (>18 years).

#### Patients and HU Exposure

Ten patients were enrolled in this study, all attending adult hematological clinic of Groote Schuur Hospital in Cape Town (South Africa), denoted as GS01 to GS10. All consenting patients were selected, socio-demographic and clinical data were collected by means of a structured questionnaire. Adult SCA patients were interviewed; patients' medical records were reviewed, to delineate their clinical features over the past 3 years. Anthropomorphic variables (body mass Index (BMI), and blood pressures (BP) were measured in the outpatient setting. No incentive was provided for participation in the study. Only patients who, who was at steady clinical state, without current acute such as vasoocclusive painful crisis and had not received a blood transfusion or hospitalization in the past 6 weeks where included. The hematological measures were those reported at the first visit to the hospital (**Supplementary Table S1**). Two patients GS01 and GS04, were investigated at two stage: before administration and after HU at MTD (indexed as H); Six patients were already on HU at MTD at the time of the study (GS02, GS03, GS07 GS08, and GS09, and GS10); and lastly, two patients (GS05 and GS06) had never been on HU.

#### Molecular Method

#### Genotyping: Sickle Cell Disease Mutation, β-Globin Gene Cluster Haplotypes, and 3.7 kb α-Globin Gene Deletion

DNA was extracted from peripheral blood, following instructions on the available commercial kit [QIAamp DNA Blood Maxi Kit. <sup>R</sup> (Qiagen, United States)]. Molecular analysis to determine the presence of the sickle mutation was carried out by polymerase chain reaction (PCR), followed by DdeI restriction analysis (Saiki et al., 1985). Using published primers and methods, five restriction fragment length polymorphism (RFLP) sites in the β-globin gene cluster were amplified to analyze the HBB haplotype background (Bitoungui et al., 2015). The 3.7 kb α-globin gene deletion was screened using expand-long template PCR, as previously reported (Rumaney et al., 2014).

#### RNA Extraction and miRNA Expression Profile

Total RNA was isolated using the miRNeasy kit according to protocol of the Manufacturer (QIAGEN, Hilden, Germany); and sequenced by the Genomic and RNA Profiling Core at Baylor College of Medicine, United States, using the NanoString Platform (NanoString Technologies, Inc., Seattle, WA, United States), according to manufacturer's instructions. miRNA expression profile analyses were performed using the significance analysis of microarrays (SAM) tool (Thusher et al., 2001). A cross-sectional analysis was performed for differential expression for all the patients without HU and those under HU at MTD (**Table 1** and **Figure 1A**). In addition, a pair-wise analysis was performed for patients GS01 and GS04, before and after treatment of HU at MTD for each patient alone (**Figures 1B1–B4**), and for both patients together (**Figure 1C**), looking mainly for miRNAs that were over or under-expressed, using the paired Wilcoxon rank test. Differences in expression counts of differentially expressed miRNAs were tested using one-factor analysis of variance (ANOVA), after normalizing different samples based on their Fisher-Pearson skewness coefficient scores (Doane and Seward, 2011), adjusted for multiple comparisons with the significance level set to 0.05. We refer the interested readers to the **Supplementary File (Section S2, sub-section 2)** for more information. Specially, for pair-wise analysis, we extracted sets of over- and under-expressed miRNAs in different sample pairs e.g., GS01-GS01\_H, GS01-GS04\_H, and GS04-GS01\_H using Pearson-Chi square scores and these sets were assessed using sample randomization to check whether the identified sets of over- and under-expressed miRNAs were more than expected by chance (Yocgo et al., 2017).

TABLE 1 | Differentially expressed microRNAs between SCD patients on HU and off HU in cross-sectional analysis.

microRNA-ID Fold-change q-value (%) p-values


In order to avoid possible residual effect of previous HU exposure Patient GS01 was remove from this analysis for non-compliance at initial administration of HU that he stopped for 3 months, before resuming treatment to achieve MTD.

FIGURE 1 | Profiles of Hydroxyurea-induced miRNA expression in selected SCD patients. Panel (A). Heatmap of the miRNA expressing profile for patients that are under HU and off HU in the cross-sectional analysis. Panels (B,C). miRNA expressing profile for patients that are under HU and out of HU, excluding the relapsed patient in pair-wire analysis of patients GS01 and GS04. Using the paired Wilcoxon rank test, we found a significant difference between expression profiles of the two states (p = 0.032e-2), with distinctive micro-RNAs differentially over or expressed in patients GS01 (Panel B1) and GS04 (Panel B3), as shown in the heat map (Panel B2) and the kernel density distributions (Panel B4) plots. Panel (C). miRNA expression levels of 29 miRNA that are over or under-expressed in GS04-HU vs. GS01. Panel (D). Networks linked to over and under expressed miRNA under HU treatment. In the figure 13 miRNAs target mainly 4 genes BCL11A, MYB, KLF3, and SP1, that belong to the same network and influence erythropoiesis and HbF expression.

### Bioinformatics Pathway Analysis: HU Effects and Identifying Potential Biological Targets

All the differentially expressed miRNAs in cross-sectional analysis (**Figure 1A**), and miRNAs over- or under-expressed miRNAs in pair-wise analysis (**Figures 1B1,B2,C**) were used to retrieve potential post-transcriptionally regulated gene targets, from the miRTarBase database (Chou et al., 2015), which stores experimentally validated miRNA-target interactions. For specific miRNAs that were over- and/or under-expressed in different sample pairs, we performed enrichment analyses, using Gene Ontology (GO) process, the protein GO Annotation (GOA) mapping and the Kyoto encyclopeadia of genes and genomes (KEGG) pathway datasets, in order to identify enriched biological processes and pathways in which gene targets are involved (Mazandu and Mulder, 2013).

## RESULTS

### Patients' Description

Ten patients were investigated, all migrant from Democratic Republic of Congo, with a median age of 25 (95% CI: 23–26). All patients are homozygous for the Sickle cell mutation (HbSS). Patients receiving HU had higher HbF levels than those without HU treatment (13 vs. 4.4%). Most patients had at least a Bantu haplotype, in the beta-globin genes' cluster; four patients were heterozygous for the 3.7 kb alpha-globin gene deletion; detailed clinical characteristics are shown in **Supplementary Table S1**.

## Cross-Sectional Analysis of the miRNA Expression Profiling

A total of 829 miRNAs were sequenced, and 798 that passed quality control were analyzed (**Supplementary File,Section S2**

for more details). The cross-sectional analysis identified 8 miRNAs differentially (over-) expressed with statistical characteristics and expression levels shown in **Table 1**, and the heat map in **Figure 1A**, respectively.

### Pair-Wise Analysis of Differential Over-Under-Expressed miRs in Two SCD Patients

With or without HU exposure, we found a significant difference between expression profiles of the two states (pvalue = 0.03266e-2), with 12 and 10 distinctive micro-RNAs differentially (over or under) in patients GS01 and GS04, respectively (**Figures 1B1–B4**). In order to elucidate miRNAs that influence the difference between the two patients' expression level profiles, we investigated miRNA expression levels which are over or under-expressed in GS04\_H vs. GS01. A total of 29 miRNAs met these criteria, most were under-expressed in GS04\_H (**Figure 1C**).

### Genes Targets and Biological Pathways of miRNAs That Are Differentially Expressed Under HU Treatment

Next, we used miRNAs that were differentially expressed in cross-sectional analysis of all patients, alongside over- and under-expressed miRNAs identified in pair-wise analysis of GS01 and GS04, to retrieve potential post-transcriptionally regulated genes using datasets extracted from the miRTarBase database (Chou et al., 2015). We found 13 miRNAs that mainly targeted mainly 4 genes BCL11A, MYB, KLF3, and SP1, belonging to the same network and predicted to influence erythropoiesis and HbF expression (**Figure 1D**); most of these miRNAs were under-expressed with the exposure to HU at MTD (**Supplementary Table S4**).

Additionally, we used genes targeted by miRNAs that were differentially expressed, to identify enriched biological processes and pathways in which targeted genes are involved. We mostly found association with cancer pathways. Other enriched biological pathways identified were pyrimidine metabolism (p-value = 0.00986), pathogenic Escherichia coli infection (p-value = 0.00072) and Oxidative phosphorylation (p-value = 0.00032). Enriched biological process identified with p-adjusted using Bonferroni multiple corrections was miRNA mediated inhibition of translation, the main posttranscriptional mode of action of miRNA (GO: 0035278 with p-value adjusted = 0.0274).

### DISCUSSION

The present study is the first to investigate in vivo miRNA expression in SCD patients in Africa, exposed to HU. MiRNA expression of erythrocytes is different from that of reticulocytes and leukocytes, but contribute to the majority of the microRNA expression in whole blood (Chen et al., 2008; Juzenas et al., 2017). This supports the most practical approach of using peripheral blood, in this study. Most of the miRNAs found to be differentially expressed under HU treatment in the current study, were also previously shown to be preferentially expressed in erythrocyte in SCD patients (Chen et al., 2008).

A major finding of the present study is the identification of specific and novel miRNA that are targeting HF- regulating genes (**Figure 1D**), i.e., miR-125b (SP1), mi199a, miR-7e, miR-106a, and miR-106b (KLF3), miR-140 miR-146; miR-188, miR-143, miR-125a, miR-19b, and miR-105 (MYB), miR-23b and miR-29a (BCL11A and SP1). We replicated previous findings that miR-148a, miR-29a, and mi151-3p, are differentially expressed in CD71+ erythroid cells, both before HU and after HU treatment at MTD in SCD-HbSS patients (Walker et al., 2011). Several other miRNAs are able to increase γ-globin gene expression, such as Lin28B, miR-486-3p, with let-seven family participating in the regulation of fetal to adult erythroid development process by increasing γ-globin gene expression through inhibitory effects on BCL11A (Lee et al., 2013; Ginder, 2015). miR-15a/16-1 restrain the MYB factor which then cause loss of the inhibitory effect on γ-gene and induce HbF in early erythroid progenitors (Sankaran et al., 2011).

Multiple miRNAs that target SP1 and KLF3 were differentially expressed under the HU treatment; several of these are novel (**Figure 1D**) and will require further functional investigation. KLF3 and SP1 are transcription factors, that belong to the family of β-like globin gene transcription regulation that act by binding to the LCR regions of the ε, γ, and β-globin promoters (Hu et al., 2007). SP1 has been shown to be the main target for miR-23a which increases γ and ε globin expression by SP1 inhibition and repression. KLF3 factor, a negative regulator of erythropoiesis process, is also specifically inhabited by miR-27a (Ma et al., 2013). Therefore, this translational study provides additional candidates miRNAs that may contribute to globin gene expression and subsequent HbF production, and thus stand as prospects for future post-transcriptional therapeutic approaches that could minimize the alterations of the whole cellular transcriptome and related HU sides effects.

Association of differentially expressed miRNA with cancer pathways might be because cancer pathways are over – represented in the supporting literature. Other enriched biological pathways included "biological process' associations with Pathogenic Escherichia coli infection, Pyrimidine metabolism and Oxidative phosphorylation. These pathways could be related to the known increased susceptibility to bacterial infection in patients with SCD, folate acid metabolism that is important erythropoiesis, or the oxidative stress associated with recurrent vaso-occlusive crisis (VOC), and deserve additional investigations in much larger samples.

There are a few limitations to the present study, the first of which is the modest sample size that might result false positive associations and over-claiming significance. With a larger sample size, it is also possible that additional microRNA and biological pathways would be identified. We have provided a simulation of the needed statistical power in future studies in the section S3 of the **Supplementary File** provided. Even though the current pilot study did not achieve the expected statistical power for its modest sample size, the results obtained are consistent with the

literature, biologically relevant, and provide strong hypothesis for future studies. The second possible limitation is that by analyzing miRNAs that are likely from late-stage erythroblasts instead of erythroid progenitors from the bone marrow, epigenetic or molecular changes resulting from hydroxyurea treatment may have been missed. Lastly, the observed associations with targeted HbF genes regulators in pathway analysis, do not provide direct evidence for miRNA expression with HbF production. Despite these limitations, the significant associations of a limited number of differentially expressed miRNAs that potentially target HbF gene regulators provide preliminary hypothesisgenerating results that can be used to design future functional experiments. These results also emphasize the need for future studies to investigate epigenetic processes in mechanisms of HbF expression and induction.

#### CONCLUSION

The study has shown that the global analysis of microRNA expression in peripheral blood of SCD patients, in the African context, can provide valuable insights into the mechanism of action of HU treatment. The study has identified novel HU-induced miRNA that specifically target HbF regulatory genes (BCL11A, MYB, KLF-3, and SP1), and are therefore strong candidates for post-transcriptional therapeutic exploration in SCD.

#### ETHICS STATEMENT

The study was performed in accordance with the Declaration of Helsinki and with the approval of the Faculty of Health Sciences Human Research Ethics Committee, University of

#### REFERENCES


Cape Town (HREC Ref. No. 132/2010). Informed and written consent was obtained from all patients that were all adult participants (>18 years).

### AUTHOR CONTRIBUTIONS

AW conceived and designed the experiments. AW, KM, GP, and NH performed the experiments. AW, GP, and KM patient recruitment, samples and clinical data collection and processing. MJ, GM, EC, KM, NH, and AW analyzed the data. AW, EC, MJ, and GM contributed reagents, materials, and analysis tools. KM, GP, GM, and AW wrote the manuscript. All authors revised and approved the manuscript.

#### FUNDING

The molecular experiments of the study were funded by the National Health Laboratory Services (NHLS), South Africa, and the NIH, United States, grant number 1U01HG007459-01, and NIH/NHLBI U24HL135600 to AW.

### ACKNOWLEDGMENTS

We acknowledge to all patients' participants.

#### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene. 2019.00509/full#supplementary-material


in sickle cell anemia. Eur. J. Clin. Invest. 74, 652–656. doi: 10.1172/JCI11 1464


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Mnika, Mazandu, Jonas, Pule, Chimusa, Hanchard and Wonkam. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# The Potential Role of Regulatory Genes (*DNMT3A, HDAC5,* and *HDAC9*) in Antipsychotic Treatment Response in South African Schizophrenia Patients

*Kevin Sean O'Connell1, Nathaniel Wade McGregor1\*, Robin Emsley2, Soraya Seedat2 and Louise Warnich1*

*1 Department of Genetics, Stellenbosch University, Stellenbosch, South Africa, 2 Department of Psychiatry, Stellenbosch University, Tygerberg, South Africa*

#### *Edited by:*

*Zané Lombard, University of the Witwatersrand, South Africa*

#### *Reviewed by:*

*Monica Uddin, University of South Florida, United States Boer Xie, St. Jude Children's Research Hospital, United States*

*\*Correspondence: Dr. Nathaniel Wade McGregor nwm@sun.ac.za*

#### *Specialty section:*

*This article was submitted to Genetic Disorders, a section of the journal Frontiers in Genetics*

*Received: 31 October 2018 Accepted: 18 June 2019 Published: 10 July 2019*

#### *Citation:*

*O'Connell KS, McGregor NW, Emsley R, Seedat S and Warnich L (2019) The Potential Role of Regulatory Genes (DNMT3A, HDAC5 and HDAC9) in Antipsychotic Treatment Response in South African Schizophrenia Patients. Front. Genet. 10:641. doi: 10.3389/fgene.2019.00641*

Despite advances in pharmacogenetics, the majority of heritability for treatment response cannot be explained by common variation, suggesting that factors such as epigenetics may play a key role. Regulatory genes, such as those involved in DNA methylation and transcriptional repression, are therefore excellent candidates for investigating antipsychotic treatment response. This study explored the differential expression of regulatory genes between patients with schizophrenia (chronic and antipsychotic-naïve first-episode patients) and healthy controls in order to identify candidate genes for association with antipsychotic treatment response. Seven candidate differentially expressed genes were identified, and four variants within these genes were found to be significantly associated with treatment response (*DNMT3A* rs2304429, *HDAC5* rs11079983, and *HDAC9* rs1178119 and rs11764843). Further analyses revealed that two of these variants (rs2304429 and rs11079983) are predicted to alter the expression of specific genes (*DNMT3A*, *ASB16*, and *ASB16-AS1*) in brain regions previously implicated in schizophrenia and treatment response. These results may aid in the development of biomarkers for antipsychotic treatment response, as well as novel drug targets.

Keywords: schizophrenia, epigenetics, neuropsychiatric genetics, gene expression, treatment response

## INTRODUCTION

The onset of schizophrenia is marked by a first psychotic episode, typically followed by subsequent relapse episodes, separated by intervals of remission (Lieberman et al., 2001). Diagnosis remains difficult due to a heterogeneity of symptoms as well as symptom overlap with other disorders (Tandon et al., 2009). Furthermore, treatment strategies are not optimal (Brandl et al., 2014), and it is estimated that approximately half of all patients with schizophrenia will not respond satisfactorily to antipsychotics (Lohoff and Ferraro, 2010), which are the mainstay of treatment and, as such, widely used. They are effective for positive symptoms (such as delusions and hallucinations); however, their efficacy for negative symptoms (such as apathy, anhedonia, and social withdrawal) is limited (Leucht and Davis, 2017). Moreover, antipsychotics are known to result in a number of adverse drug reactions (ADRs), including motor abnormalities (Chowdhury et al., 2011) and metabolic deficits (Tandon et al., 2010). These ADRs are often severe and long lasting, resulting in reduced compliance and diminished positive outcomes (Brandl et al., 2014). Considering the high rate of nonresponders to treatment and the potential severe side effects of treatment, there is a clear need to improve our understanding of antipsychotic treatment response.

Pharmacogenetics, the study of the effects of genetic variation on treatment outcomes, has been moderately successful in explaining variability in inter-individual treatment response. Variation within the dopaminergic pathway has been extensively investigated, and several variants within dopamine receptor genes are associated with treatment response (Lencz et al., 2010) and ADRs (Bakker et al., 2008). In addition, variation within genes encoding drug-metabolizing enzymes has yielded similar findings of association (Lohoff and Ferraro, 2010; Brandl et al., 2014). Larger hypothesis-free-driven genome-wide association studies (GWAS) have also associated common variation with antipsychotic treatment response (Liou et al., 2012; Zhang and Malhotra, 2013); however, there has been little validation or replication of these associations, and their biological relevance remains to be determined (Liou et al., 2012; Zhang and Malhotra, 2013). In addition to these challenges, the majority of treatment response heritability is not explained by common variation, suggesting that other factors must also play a role (Manolio et al., 2009). This underscores the complexity and multi-factorial nature of treatment response since common and rare genetic factors, environmental factors, and gene-environment interactions need to be considered (Manolio et al., 2009; Majchrzak-Celińska and Baer-Dubowska, 2017).

Epigenetics refers to molecular mechanisms that determine inherited cellular phenotypes without alteration of the genotype (Wu et al., 2012). These mechanisms include various molecular processes, such as histone modification, nucleosome remodeling, non-coding RNAs, and DNA methylation (Wu et al., 2012). Both unique and overlapping altered epigenetic modifications are associated with schizophrenia etiology and pathogenesis as well as antipsychotic treatment (Swathy and Banerjee, 2017). As such, an extremely complex, multi-directional relationship needs to be considered in studying the role of epigenetics in treatment response (pharmacoepigenetics), since genes that are implicated may be regulated by epigenetic modification independent of disease etiology, treatment outcomes, and/or ADRs (Kurita et al., 2012; Melas et al., 2012; Tang et al., 2014; Majchrzak-Celińska and Baer-Dubowska, 2017). For example, alterations of DNA methylation profiles (Melas et al., 2012; Tang et al., 2014) and altered chromatin structure (Kurita et al., 2012) may influence treatment response.

Although specific gene–environment interactions are required for epigenetic modifications to occur, it is important to note that specific genes directly or indirectly produce proteins and other products necessary for these modifications. For example, the *DNMT1* gene encodes for Dnmt1, which is responsible for the maintenance of existing methylation patterns during cell division (Bostick et al., 2007), while *DNMT3A* and *DNMT3B* encode for the enzymes responsible for establishing *de novo* methylation patterns (Okano et al., 1999). Regulatory genes, such as those involved in DNA methylation and transcriptional repression, are therefore excellent candidates for investigation with regards to antipsychotic treatment response.

The aims of this study were, therefore, to identify candidate regulatory genes that may be involved in antipsychotic treatment response, and then to determine if variation within these genes is associated with treatment outcome. In order to address these aims, we first sought to identify regulatory genes that were differentially expressed between patients with schizophrenia (including chronically medicated patients and drug-naïve first-episode patients) and healthy controls. By identifying genes differentially expressed between the chronically medicated patients and both the drug-naïve first-episode patients and healthy controls, we identified a subset of candidate genes that may be implicated in treatment response independent of schizophrenia pathogenesis. Variation within these candidate genes was then investigated in an independent cohort for association with treatment response over time.

### MATERIALS AND METHODS

An outline of the approach used to perform this study is provided in **Figure 1**.

### Participants for Gene Expression Analyses

The cohort consisted of 20 unrelated, age-matched, and male South African participants of South African colored (SAC) descent [10 patients with schizophrenia (SCZ), 26.4 ± 7.9 years old, and 10 healthy controls (CON), 26.5 ± 7.6 years old]. Patients with SCZ were further divided into two equal groups consisting of five first-episode patients (FES) (24.8 ± 10.9 years old) and five chronic patients (CHR) (28.0 ± 4.1 years old). The FES patients were recruited and sampled within 1 week of their first episode of psychosis and were subsequently administered flupenthixol decanoate (Fluanxol, Lundbeck, Copenhagen, Denmark). The CHR patients were recruited 6.2 ± 0.4 years after their first episode, and all were treated with flupenthixol decanoate (Fluanxol, Lundbeck, Copenhagen, Denmark) in addition to other psychotropic medications (**Supplementary Table 1**). All patients were diagnosed using the *Diagnostic and Statistical Manual of Mental Diseases*, Fourth Edition, Text Revision (DSM-IV TR) (American Psychiatric Association, 2000) diagnostic criteria for SCZ, schizophreniform disorder, or schizo-affective disorder. Written informed consent was obtained from all patients, or their caregivers, prior to the study, and ethical approval was granted by the Human Research and Ethics Committee (HREC), Faculty of Medicine and Health Sciences, Stellenbosch University (N13/08/115).

### Total RNA Isolation and cDNA Synthesis

Whole blood was collected from all participants by venipuncture of a forearm vein into PAXgene Blood RNA tubes (Qiagen, California, USA), which were stored at −20°C until processed.

Total RNA was extracted using the PAXgene Blood RNA Kit IVD according to the manufacturer's instructions (Qiagen, California, USA). All samples were eluted in 80 µl of elution buffer and stored at −80°C until further analysis. RNA yield and quality were assessed using an Agilent Model 2100 Bioanalyzer (Agilent Technologies, California, USA) and a DropSense 16 spectrophotometer (TRINEAN, Belgium). All samples had 260/280 > 2.0 and RIN > 7.0.

Reverse transcription was performed using the High-Capacity cDNA Reverse Transcription Kit with RNase Inhibitor (Applied Biosystems, California, USA), according to the manufacturer's specifications. Briefly, for each sample, 100 ng of RNA was added to 2 μl of random and oligo (dT) primers in a final reaction of 20 μl. These reaction tubes were then placed in the GeneAmp® PCR Systems 2700 (Applied Biosystems, California, USA) thermocycler at 25°C for 10 min, 37°C for 120 min, and 85°C for 5 min to inactivate the reverse transcriptase enzyme. The cDNA samples were then stored at −20°C until analyzed.

### Quantitative Real-Time PCR

Quantitative real-time PCR (qRT-PCR), to determine the relative mRNA abundance, was performed using the StepOnePlus Real-Time PCR System and SDS Software version 2.3 (Applied Biosystems, California, USA). The commercially available fluorescence-based TaqMan® Human DNA Methylation and Transcriptional Repression microarray plates (Applied Biosystems, California, USA) were used to assess the relative mRNA content of 27 regulatory genes (**Table 1**). Each sample was analyzed in two independent experiments to control for experimental bias using the following PCR conditions: a 10-min heat activation step (95°C) followed by 40 cycles of 15 s at 94°C and 1 min at 63°C. Fluorescence data, indicative of the amount of PCR product, was captured at each cycle. The relative mRNA concentrations were then calculated based on the cycle number that the threshold quantity of PCR product is obtained (Ct). The RefFinder online tool was used to assess the stability of potential housekeeping genes (https://www.heartcure.com. au/reffinder) (Xie et al., 2012), and *GAPDH* was identified as the most stable. Ct values were therefore normalized to values of *GAPDH* and expressed relative to this control (Livak and Schmittgen, 2001).

## Gene Expression Analyses

Differential gene expression between the CON and SCZ groups was determined using unpaired t-tests or Mann-Whitney U tests where appropriate. Additionally, one-way analysis of variance (ANOVA) and Tukey's *post hoc* tests were used to determine differences between the CON, FES, and CHR groups. The false discovery rate (FDR) correction (Benjamini and Hochberg, 1995) was used to correct for multiple testing in the CON *vs*. SCZ and CON *vs*. FES *vs*. CHR analyses (27 genes; FDR < 0.01). A significance threshold of p < 0.05 was used for the Tukey's *post hoc* tests since these were only performed in the case of a significant (FDR < 0.01) ANOVA result. Candidate genes for genetic association analyses with treatment response were selected if they met the following three criteria: i) significant


TABLE 1 | Relative expression levels for DNA methylation and transcriptional repression genes between control participants and schizophrenia patients.

*Bold typeset indicates significant differences [false discovery rate (FDR) < 0.01 or p < 0.05]. Data are presented as mean ± standard deviation.*

*CON, control participants; SCZ, schizophrenia patients; FES, first episode schizophrenia patients; CHR, chronic schizophrenia patients; n.d., not determined.*

*1CON vs. SCZ, 2CON vs. FES vs. CHR, 3CON vs. FES, 4CON vs. CHR, 5FES vs. CHR. #Genes included for association analyses with treatment response.*

differences in gene expression between the CHR and CON groups, ii) no significant differences in gene expression between the FES and CON groups, and iii) significant differences in gene expression between the CHR and FES groups.

#### Participants for Genetic Association Analyses

The patient cohort included 103 unrelated South African FES patients meeting DSM-IV TR (American Psychiatric Association, 2000) diagnostic criteria for SCZ, schizophreniform disorder, or schizo-affective disorder (80% SAC, 12% Xhosa, and 8% European descent) and have been described previously (Drogemöller et al., 2014; Chiliza et al., 2015; Ovenden et al., 2017; O'Connell et al., 2018). All patients received treatment with flupenthixol decanoate (Fluanxol, Lundbeck, Copenhagen, Denmark), a long-acting injectable antipsychotic, according to a fixed protocol. Treatment response was assessed using the Positive and Negative Syndrome Scale (PANSS) (Kay et al., 1987) over a period of 12 months, with measurements taken biweekly for the first 6 weeks, and every 3 months thereafter. Written informed consent was obtained from all patients, or their caregivers, prior to the study, and ethical approval was granted by HREC, Faculty of Medicine and Health Sciences, Stellenbosch University (N06/08/148).

#### Genetic Association Analyses

Variants within these candidate genes, obtained from each National Center for Biotechnology Information gene page (https://www.ncbi.nlm.nih.gov/gene/), were mined from available genome-wide genotype data. All 103 FES patients described above were previously genotyped with the Infinium OmniExpressExome-8 Kit (Illumina, California, USA) in accordance with the standard Illumina protocol. Variants were excluded from downstream analysis if they exhibited high rates of missing genotype data (> 5%), if their minor allele frequency (MAF) was < 1% or if they showed departure from the Hardy Weinberg equilibrium (p < 1×10-4). For this study, only variants with a minor allele frequency greater than 5% and a call rate greater than 98% were included for the association analyses. Furthermore, only one representative variant was included when two or more variants were shown to be in linkage disequilibrium (LD, r2 > 0.6) with one another. In total, 274 variants were included for analyses. All processing of genetic data was performed using Plink v1.9 (Purcell et al., 2007; Purcell and Chang).

Association analyses were conducted in R (R Core Team, 2017) using R packages lme4 (Bates et al., 2014) and lmerTest (Kuznetsova et al., 2016). Linear mixed-effects models were used to investigate the effect of genetic variants on change in PANSS scores for each subscale (positive, negative, and general) and total over the 12-month period, adjusting for age, gender, proportion ancestry, and baseline PANSS scores. Multiple modes of inheritance were investigated and the Bonferroni correction method was used to correct for multiple testing (274 variants, four modes of inheritance, four PANSS domains; threshold p = 1.14 × 10-5).

#### Bioinformatic Analyses

Significantly associated variants were identified for intronic variants only, which were assessed for potential functionality as brain-specific eQTLs by interrogating the BRAINEAC (www.braineac.org) (Ramasamy et al., 2014) and GTex (www. gtexportal.org) (GTEx Consortium, 2013, GTEx Consortium, 2015) online databases.

#### RESULTS

#### Relative Gene Expression Results

After the initial analysis, comparing gene expression between the CON and SCZ groups, 14 genes were shown to have significantly different expression between these groups. Specifically, the *CH4*, *DNMT1*, *DNMT3A*, *DNMT3B*, *HDAC3*, *HDAC5*, *HDAC6*, *HDAC7*, *HDAC9*, *HDAC11*, *MBD3*, *RBBP7*, *SAP30*, and *SIN3A*  genes showed increased expression in the SCZ group when compared to the CON group (**Table 1**). When comparing the CON group to the FES and CHR groups, 19 genes were shown to be differentially expressed (**Table 1**). *Post hoc* analyses identified seven genes that were significantly over-expressed in the CHR group when compared to both the CON and FES groups.

For follow-up investigation as candidates for association with antipsychotic treatment response, seven genes were selected. Specifically, the *B2M*, *DNMT3A*, *HDAC5*, *HDAC9*, *MBD2, MBD3*, and *RPLP0* genes were selected since the expression levels of these genes were significantly different between the CHR and CON groups and CHR and FES groups, respectively (**Table 1**, **Figure 2**). Furthermore, no significant differences in gene expression were identified for these genes when comparing the FES and CON groups (**Table 1**, **Figure 2**).

#### Genetic Association Results

Of the 274 variants investigated within the seven candidate genes, four were significantly associated with treatment response, as described by a change in PANSS scores over time, when considering correction for multiple testing (p < 1 × 10-5) (**Table 2**). All significant associations were identified with the PANSS-negative domain. Specifically, the *DNMT3A* rs2304429 *CC* genotype was significantly associated with an improved treatment response (greater reduction in PANSS-negative scores per month) when compared to the *TC* genotype. The rate of PANSS-N score reduction was significantly faster over time (an additional 1.92% per month) in patients with the *DNMT3A*  rs2304429 *CC* genotype when compared to patients with the *TC*  genotype. The *HDAC5* rs11079983 *TT* genotype was significantly associated with a poorer treatment response (less reduction in PANSS-negative scores per month) when compared to the *CC*  genotype. Patients with the *HDAC5* rs11079983 *TT* genotype had significantly slower rate of PANSS-N score reduction over time (less by 2.36% per month) when compared to patients with the *CC* genotype. Two variants within *HDAC9* (rs1178119 and rs11764843) were also significantly associated with poorer treatment response (less improved PANSS-negative treatment trajectory scores) as shown in **Table 2**. For *HDAC9* rs1178119, patients with the *GA* genotype had a significantly slower rate of reduction in PANSS-N scores over time (less by 2.04% per month) than patients with the *AA* genotype. Presence of the *HDAC9* rs1178119 *G* allele was also significantly associated with a slower rate of reduction in PANSS-N scores over time (less by 1.52% per month per *G* allele). Similarly, a significantly slower rate of reduction in PANSS-N scores over time (less by 1.92% per month) was also identified for patients with the *HDAC9* rs11764843 *CA* genotype when compared to those with the *AA*  genotype.

### Bioinformatics

The variants significantly associated with change in PANSS scores were assessed using the BRAINEAC (Ramasamy et al., 2014) and GTex (GTEx Consortium, 2013, GTEx Consortium, 2015) databases. Two variants (rs2304429 and rs11079983) were identified as potential brain-specific eQTLs. The rs2304429 variant was suggested by BRAINEAC to alter the expression of *DNMT3A* in particular brain regions—namely, the putamen (PUTM), the cerebellar cortex (CRBL), the temporal cortex (TCTX), and the medulla (MEDU). Specifically, the TT genotype was associated with reduced expression of *DNMT3A* in these brain regions. In addition, the HDAC5 rs11079983 variant was shown to alter the expression of *ASB16* and *ASB16-AS1* in the cerebellum when considering the GTex database. Specifically, the TT genotype is associated with reduced expression of *ASB16* and increased expression of *ASB16-AS1* when compared to the CC genotype, respectively. None of the other variants were identified as brain-specific eQTLs in the BRAINEAC or GTex databases.

#### DISCUSSION

We identified a number of DNA methylation and transcriptional repression genes to be significantly over-expressed in patients with SCZ, in first-episode and CHR, when compared to healthy controls (**Table 1**). Specifically, seven candidate regulatory genes were identified, including *B2M*, *DNMT3A*, *HDAC5*, *HDAC9*, *MBD2, MBD3*, and *RPLP0*, and variation within these genes was assessed for association with antipsychotic treatment response. Four variants were found to be significantly associated with poorer treatment trajectory in the PANSS-negative domain.

In this study, significant increases in the expression of *DNMT1* and *DNMT3A* were identified between all patients with SCZ and controls; however, further analyses revealed that this increase was only present when considering the CHR patients and not in the FES patients. Increased *DNMT1* and *DNMT3A*  gene expression in the GABAergic neurons of SCZ patients has been previously identified (Zhubi et al., 2009), while the increased expression of *DNMT1* was also established in the peripheral blood lymphocytes of patients with SCZ (Auta et al., 2013). When considering these previous studies, it is interesting

TABLE 2 | Variants significantly associated with treatment trajectory for the PANSS negative domain.


to note that the mean ages of their study cohorts were 57 ± 11 and 56 ± 18 (Zhubi et al., 2009) and 43.6 ± 10.3 (Auta et al., 2013), respectively. Given the young age of onset of SCZ, it is likely that the patients in these cohorts are more indicative of chronic SCZ and that the results of this study therefore replicate these previous findings.

The expression of a number of *HDAC* genes (1–4, 6, and 9) was previously investigated in the prefrontal cortex of patients with SCZ (Sharma et al., 2008). Only *HDAC1* was shown to have increased expression in patients with SCZ when compared to controls, while no significant differences in expression were identified for the other *HDAC* genes (Sharma et al., 2008). One possible explanation for the differences between these results and those of our study is that these patients were subject to a range of different medications, including typical and atypical antipsychotics, mood stabilizers (including valproic acid), antidepressants, stimulants, and sedatives which all may have an effect on the gene expression observed (Majchrzak-Celińska and Baer-Dubowska, 2017). Of the other differentially expressed genes in this study, *SAP30* expression was previously investigated and no significant changes were identified (Vawter et al., 2006). These results highlight the need for well-defined and deep phenotyping when investigating the molecular etiologies of neuropsychiatric disorders since their molecular architecture is malleable to disorder progression as well as treatment and other environmental factors (Gurwitz and Pirmohamed, 2010).

Due to the nature of the genes investigated in this study, changes in gene expression that were identified are indicative of altered regulatory mechanisms in chronically medicated patients with SCZ (**Table 1**). The alteration of these regulatory mechanisms is likely the result of a combination of disease progression, antipsychotic medication (Swathy and Banerjee, 2017), and the influence of other environmental factors through the process of epigenetics (Feil and Fraga, 2012). To further elucidate these complex interactions, gene expression differences between first-episode and CHR, as well as healthy controls, were assessed. Significantly increased expression of seven genes (*B2M*, *DNMT3A*, *HDAC5*, *HDAC9*, *MBD2, MBD3*, and *RPLP0*) was observed in patients with chronic SCZ, when compared to FES and healthy controls (**Figure 2**). These seven genes were selected as candidates for association with antipsychotic treatment response.

Novel associations were identified between variants within *DNMT3A* (rs2304429), *HDAC5* (rs11079983), and *HDAC9* (rs1178119, rs11764843) and antipsychotic treatment response as defined by a change in PANSS scores over time (**Table 2**). Specifically, all four of these variants were associated with a significantly worse treatment trajectory in the negative PANSS symptom domain. Bioinformatics analyses of these variants revealed that two variants are predicted to exert functional changes as eQTLs. The *DNMT3A* rs2304429 and *HDAC5*  rs11079983 variants were predicted to alter expression of specific genes in particular brain regions. Specifically, individuals with the rs2304429 *CC* genotype have increased *DNMT3A* gene expression in the PUTM, CRBL, TCTX, and MEDU when compared to individuals with the *TC* or *TT* genotypes. In addition, individuals with the *HDAC5* rs11079983 *TT* genotype have reduced expression of *ASB16* and increased expression of *ASB16-AS1* in the cerebellum when compared to the *CC*  genotype. In this study, the rs2304429 *CC* and rs11079983 *CC*  genotypes infer improved treatment response indicating that the brain region-specific gene expression changes associated with these variants may play a role. From these results, it may be hypothesized that increased expression of *DNMT3A*, as result of the rs2304429 *CC* genotype, in the abovementioned brain regions may result in *de novo* methylation patterns (Okano et al., 1999) that result in increased efficacy of antipsychotic medication. Similarly, reduced expression of *ASB16* and *ASB16-AS1* in the cerebellum, in the presence of the rs11079983 *CC* genotype, may result in altered ubiquitin-mediated pathways (Kohroki et al., 2005) and cytokine signaling (Babon et al., 2009) with beneficial effects when considering antipsychotic response. These brain regions have previously been implicated in SCZ (Hokama et al., 1995; Turetsky et al., 1995; Harrison, 2004; Yeganeh-Doost et al., 2011; Williams et al., 2014; Tohid et al., 2015) and suggested as targets for treatment (Buchsbaum et al., 2003; Hugdahl et al., 2009; Mitelman et al., 2009; Parker et al., 2014). These brain region-specific eQTLs should therefore be investigated as biomarkers and potential targets for antipsychotic treatment response.

The potential underlying mechanisms of action associated with the remaining two significant findings are not clear. Further studies investigating the exact role that these variants may have in antipsychotic treatment response is warranted. Furthermore, the novel associations identified in this study should be replicated and validated in additional independent cohorts. Moreover, the eQTL data presented were not generated from Sub-Saharan African-ancestry-related individuals and further studies are required to confirm whether the variants presented here have similar eQTL effects in populations of African-ancestry. Due to the small sample sizes used for the gene expression analyses in this study, these results should be interpreted with caution and require replication. Functional studies incorporating *in situ* and *in vivo* assays should also be considered for validation of these results. Moreover, the gene expression results presented in this study should be replicated in female patients with SCZ and controls.

In conclusion, this study identified significant differential expression of DNA methylation and transcriptional repression genes between FES with SCZ, chronically medicated patients with SCZ and healthy controls. Variants within specific differentially expressed genes were significantly associated with antipsychotic treatment response, and highlighted particular brain regions in which altered expression of specific genes may play a role in treatment outcome. These results may aid in the development of biomarkers for antipsychotic treatment response, as well as, novel drug targets, and treatment strategies.

#### ETHICS STATEMENT

Written and informed consent was obtained from all patients, or their caregivers, prior to the study and ethical approval was granted by the Human Research and Ethics Committee (HREC), Faculty of Medicine and Health Sciences, Stellenbosch University (N13/08/115 and N06/08/148).

#### AUTHOR CONTRIBUTIONS

KO'C NM, and LW conceived the study. KO'C performed the laboratory work and statistical analyses. RE and SS were involved in participant recruitment. KO'C wrote the initial draft for the manuscript. NM, LW, RE, and SS provided detailed critique of the manuscript. All authors approved the final manuscript for submission.

### FUNDING

This work was supported by the South African Medical Research Council "SHARED ROOTS" Flagship Project Grant no. MRC-RFA-IFSP-01-2013/SHARED ROOTS (Prof Soraya Seedat, Department of Psychiatry, Stellenbosch University) and the National Research Foundation (NRF): KO was funded by the

#### REFERENCES


Scarce Skills Post-Doctoral Fellowship (grant no. 96833). LW was funded by the Competitive Program for Rated Researchers (grant no. 93498) and the Bioinformatics and Functional Genomics Program (grant no. 93681). The opinions expressed and conclusions arrived at are those of the authors and are not necessarily attributed to these funding sources.

#### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene.2019.00641/ full#supplementary-material


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Copyright © 2019 O'Connell, McGregor, Emsley, Seedat and Warnich. This is an openaccess article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

# Systematic Review of Genetic Factors in the Etiology of Esophageal Squamous Cell Carcinoma in African Populations

#### *Hannah Simba1, Helena Kuivaniemi2, Vittoria Lutje3, Gerard Tromp2,4,5,6,7 and Vikash Sewram1\**

*1 African Cancer Institute, Division of Health Systems and Public Health, Department of Global Health, Faculty of Medicine and Health Sciences, Stellenbosch University, Cape Town, South Africa, 2 Division of Molecular Biology and Human Genetics, Department of Biomedical Sciences, Faculty of Medicine and Health Sciences, Stellenbosch University, Cape Town, South Africa, 3 Cochrane Infectious Diseases Group, Liverpool, United Kingdom, 4 Bioinformatics Unit, South African Tuberculosis Bioinformatics Initiative, Stellenbosch University, Cape Town, South Africa, 5 DST–NRF Centre of Excellence for Biomedical Tuberculosis Research, Stellenbosch University, Cape Town, South Africa, 6 South African Medical Research Council Centre for Tuberculosis Research, Stellenbosch University, Cape Town, South Africa, 7 Centre for Bioinformatics and Computational Biology, Stellenbosch University, Stellenbosch, South Africa*

*Edited by: Solomon Fiifi Ofori-Acquah, University of Ghana, Ghana*

#### *Reviewed by:*

*Clara S. Tang, The University of Hong Kong, Hong Kong Marco Matejcic, University of Southern California, United States*

> *\*Correspondence: Vikash Sewram vsewram@sun.ac.za*

#### *Specialty section:*

*This article was submitted to Evolutionary and Population Genetics, a section of the journal Frontiers in Genetics*

*Received: 16 November 2018 Accepted: 18 June 2019 Published: 02 August 2019*

#### *Citation:*

*Simba H, Kuivaniemi H, Lutje V, Tromp G and Sewram V (2019) Systematic Review of Genetic Factors in the Etiology of Esophageal Squamous Cell Carcinoma in African Populations. Front. Genet. 10:642. doi: 10.3389/fgene.2019.00642*

Background: Esophageal squamous cell carcinoma (ESCC), one of the most aggressive cancers, is endemic in Sub-Saharan Africa, constituting a major health burden. It has the most divergence in cancer incidence globally, with high prevalence reported in East Asia, Southern Europe, and in East and Southern Africa. Its etiology is multifactorial, with lifestyle, environmental, and genetic risk factors. Very little is known about the role of genetic factors in ESCC development and progression among African populations. The study aimed to systematically assess the evidence on genetic variants associated with ESCC in African populations.

Methods: We carried out a comprehensive search of all African published studies up to April 2019, using PubMed, Embase, Scopus, and African Index Medicus databases. Quality assessment and data extraction were carried out by two investigators. The strength of the associations was measured by odds ratios and 95% confidence intervals.

Results: Twenty-three genetic studies on ESCC in African populations were included in the systematic review. They were carried out on Black and admixed South African populations, as well as on Malawian, Sudanese, and Kenyan populations. Most studies were candidate gene studies and included DNA sequence variants in 58 different genes. Only one study carried out whole-exome sequencing of 59 ESCC patients. Sample sizes varied from 18 to 880 cases and 88 to 939 controls. Altogether, over 100 variants in 37 genes were part of 17 case-control genetic association studies to identify susceptibility loci for ESCC. In these studies, 25 variants in 20 genes were reported to have a statistically significant association. In addition, eight studies investigated changes in cancer tissues and identified somatic alterations in 17 genes and evidence of loss of heterozygosity, copy number variation, and microsatellite instability. Two genes were assessed for both genetic association and somatic mutation.

**139**

Conclusions: Comprehensive large-scale studies on the genetic basis of ESCC are still lacking in Africa. Sample sizes in existing studies are too small to draw definitive conclusions about ESCC etiology. Only a small number of African populations have been analyzed, and replication and validation studies are missing. The genetic etiology of ESCC in Africa is, therefore, still poorly defined.

Keywords: esophageal squamous cell carcinoma, genetic association, somatic variant, germline mutation, sequence variants, systematic review, African populations

#### INTRODUCTION

Esophageal cancer is an aggressive and fatal cancer of the 18digestive tract. It accounts for an estimated 455,800 new cases and 400,200 deaths per year globally, making it the eighth most common cancer in the world (Murphy et al., 2017). The malignant tumors are characterized by two major subtypes: esophageal squamous cell carcinoma (ESCC), which is the more common type and contributes 90%, and esophageal adenocarcinoma (EAC) (Kaz and Grady, 2014; Abnet et al., 2017). ESCC presents with poor prognosis and low survival rate (<5%) in low resource settings (Yazbeck et al., 2016; Murphy et al., 2017). The asymptomatic development of ESCC results in diagnosis at late stage for patients and is characterized by dysphagia. At this stage, treatment is limited to palliative care.

ESCC is endemic in specific geographic locations worldwide and has the most divergence in cancer incidence globally, with high prevalence reported in East Asia, Southern Europe, as well as in Eastern and Southern Africa (Abnet et al., 2017). This peculiar distribution draws questions on the specificity of certain risk factors to particular populations. The African ESCC corridor, which includes Ethiopia, Rwanda, Burundi, Malawi, Kenya, Uganda, Tanzania, and South Africa, is an ESCC hotspot region (Munishi et al., 2015; Schaafsma et al., 2015). It has also been reported that in Sub-Saharan Africa, ESCC develops in younger patients than in other regions (Kayamba et al., 2015).

The etiology of esophageal carcinoma is multifactorial. The risk factors reported worldwide comprise several lifestyle and environmental and genetic factors (Pink et al., 2011; Sewram et al., 2014; Chen et al., 2015; Sewram et al., 2016; Huang and Yu, 2018). Growing evidence supports the hypothesis that genomic alterations and epigenetic modifications contribute to tumor development (Baba et al., 2017). ESCC has both an inherited and cellular genetic basis (Abnet et al., 2017; Coleman et al., 2018). Familial syndromes associated with increased risk of malignancy include tylosis and Fanconi anemia (Abnet et al., 2017). The majority of genetic studies on ESCC have been case-control association studies analyzing single-nucleotide polymorphisms (SNPs) in various candidate genes. However, the reproducibility of these studies has been low. Some of the more common SNPs associated with ESCC have been identified in the aldehyde dehydrogenase 2 family gene (*ALDH2)* and an acetaldehyde dehydrogenase gene *(ADH1B)*  (Abnet et al., 2017). Variants in these genes have been shown to increase susceptibility to ESCC development, and they are

also associated with alcohol consumption (Abnet et al., 2017). Two meta-analyses published in 2018 reported associations between the genes *MTHFR* and *GSTT1* and esophageal cancer development (He et al., 2018; Kumar and Rai, 2018). However, the meta-analyses were done on predominantly Asian and Western populations. In recent years, the focus of ESCC research in the Western and Asian countries has shifted from candidate gene studies to genome-wide association studies (GWAS) and whole-exome sequencing (WES) to identify variants associated with ESCC. Combined analysis of different study designs has provided a better understanding of ESCC etiology in Asian populations (Abnet et al., 2017). Genes with variants implicated in the development of ESCC in these populations include phospholipase c epsilon 1 *(PLCE1)*, caspase 8 *(CAP8)*, tumor protein 53 *(TP53)*, and human leukocyte antigen *(HLA)* (Abnet et al., 2017).

The genetic etiology of ESCC in Africa is not well understood, since there have been very few studies on ESCC in African populations. This is in part due to the unavailability of adequate research infrastructure. A lack of comprehensive assessment and validation of existing evidence through systematic reviews has also contributed to this knowledge gap. A number of small studies on African populations have yielded varied associations between genetic variants and ESCC. There is, therefore, a need to systematically assess the current evidence in order to map out the contribution of genetic factors in the development of ESCC in African populations using critically appraised data.

The aim of the current systematic review was to assess all genetic (cross-sectional, case-control, and cohort) studies reporting on germline and somatic variants where risk factor estimates were calculated. This was achieved through the following: 1) critical appraisal of African literature on association of genetic factors to ESCC development; 2) comprehensive analysis of genetic (germline and somatic) variants in the reported studies; 3) data synthesis through pooled analysis, if feasible; and 4) comparison of genetic variants identified in African populations to those reported in other geographic regions.

#### MATERIALS AND METHODS

We followed the Preferred Reporting Items for Systematic Reviews and Meta-Analyses guidelines (PRISMA) (Little et al., 2009). However, because PRISMA is not a quality assessment tool, other instruments were used to assess quality control.

#### Data Sources and Search Strategy

We carried out a literature search on all published African ESCC studies up to April 2019. We developed a comprehensive set of search terms subjectively and iteratively. We searched the following electronic bibliographic databases without time or language limits: Medline (PubMed), Embase (OViD), Scopus, African Index Medicus, and Africa-wide information (EbsCOHost). We also checked the reference lists of potentially relevant articles for additional citations and used the "related citations" search key in PubMed to identify similar papers.

We checked Medline (PubMed) to identify controlled vocabulary (MeSH) terms related to esophageal cancer and also identified text keywords based on our knowledge of the field (**Table 1**). Medline search terms were modified for other electronic databases to conform to their search functions.

Screening for eligible studies was carried out by two authors (HS and HK). First, the two authors read the titles and abstracts independently and then met to finalize an initial list. Full articles of the studies selected based on the initial screening were read and assessed for inclusion to the systematic review. **Figure 1** shows the outline for selection of eligible studies.

#### Quality Control and Data Extraction

Quality of the methodology used in the published studies was assessed using a quality assessment tool adapted from the STrengthening the REporting of Genetic Association studies (STREGA) statement (Little et al., 2009). The quality assessment for genetic association studies to identify ESCC susceptibility loci included reporting on power calculations, detailed population

TABLE 1 | Medline (PubMed) search strategy to identify published African ESCC literature.



#13 Search (#9) AND #12

characteristics for cases, description of ESCC diagnosis, screening of cases and controls, reporting a measure of association using odds ratios, adjustment of population stratification, assessment of genotyping error, reporting the Hardy–Weinberg equilibrium, correction for multiple testing, and reporting of National Center for Biotechnology Information (NCBI) rs numbers for variants (**Table S1**).

For somatic mutation studies, quality assessment included the following: description of ESCC diagnosis, reporting of tissues used [cancerous (Ca) and normal neighboring tissue (NET)], detailed population characteristics, variant classification and type, confirmation of variants identified, reporting of amino acid change, and use of pathogenicity scoring (**Table S2**).

Data extraction was carried out by two authors (HS and HK) using data extraction forms. Two separate extraction forms were prepared for the germline (genetic susceptibility) and somatic mutation studies. The data extraction form for the genetic susceptibility studies included the following: description of the population (age, sex, sample size, smoking, and alcohol use for cases and controls separately), genotyping method, statistical analysis test, minor allele frequency (MAF), genotype frequency, haplotype frequency, and environmental association frequency. The somatic mutation study extraction form had the same variables excluding gene–environment interaction frequency and haplotype frequency.

The South African Admixed Population is reported as mixed ancestry in the tables according to how it was reported in the articles.

#### Data Analysis

A meta-analysis could not be performed as there were only two SNPs analyzed in more than one study and even those were analyzed in only two independent studies. For a meta-analysis to be carried out, SNPs have to be assessed in at least three separate case-control studies. *TP53* in the somatic variant studies was analyzed in four separate studies, but two of the studies had cases only with no controls, and the remaining two assessed different parts of the gene. The results of this systematic review will, therefore, be reported in a descriptive manner.

We were able to find rs numbers for most of the variants even if the authors of the original studies did not report them and have included them in the tables of this systematic review. We used the canonical SNP identifier (rs number) and dbSNP (version 152; April 2019) database at NCBI (https://www.ncbi.nlm.nih. gov/snp/) for this. We also determined the locus positions of the microsatellite markers reported in a study by Naidoo et al. (2005) using the primer-BLAST database at NCBI (https://www-ncbinlm-nih-gov.ez.sun.ac.za/tools/primer-blast).

To determine the linkage disequilibrium (LD) measures between the SNPs reported in the same genes, we obtained the imputed data set from the Thousand Genomes project (1000 Genomes Release Phase 3 2013-05-02) and used bcftools to extract all individuals from African populations, not including African Americans, and the 77 SNPs discussed here using all synonyms (alternative rs IDs) for SNPs (Auton et al., 2015). We obtained a dataset of 504 individuals and 67 SNPs. We computed all pair-wise r2 -values using PLINK (v1.09) (Danecek et al., 2011; Chang et al., 2015).

<sup>#12</sup> Search (#10) or #11

### RESULTS

#### Systematic Review Outline

The selection process for all the included studies is shown in **Figure 1**. The initial database search identified 2,235 articles. Titles and abstracts of these articles were reviewed, and 2,168 studies were removed for not being original genetic studies. The 67 articles that remained were selected for full-text eligibility assessment. This process resulted in the removal of 40 articles: 15 review articles, 18 chromosomal, gene or protein expression studies, 4 blood group studies, 1 duplicate, and 2 abstracts. A total of 27 full articles were then assessed for eligibility, and four articles were removed for not meeting the criteria, as follows: one study had no cancer patients/cases (Adams et al., 2003), one focused on the Chinese population (Li et al., 2016), while one focused on protein expression (Jaskiewicz and De Groot, 1994; Huang and Yu, 2018), and the other was a mathematical model study (Uys and Van Helden, 2003). In the end, 23 studies were included and analyzed in the systematic review.

#### Study Characteristics

The characteristics of all the genetic susceptibility and somatic variant studies included are shown in **Tables 2** and **3**, respectively. The 23 studies included in the study were published between 1990 and 2019. There were 17 genetic susceptibility and eight somatic variant studies. Two studies reported on both genetic susceptibility and somatic variants.

#### Genetic Susceptibility Studies

The 17 genetic susceptibility studies (**Table 2**) were all case-control studies (Dietzsch et al., 2003; Vos et al., 2003; Dandara et al., 2005; Li et al., 2005; Zaahl et al., 2005; Chelule et al., 2006; Dandara et al., 2006; Li et al., 2008; Li et al., 2010; Bye et al., 2011; Matejcic et al., 2011; Bye et al., 2012; Eltahir et al., 2012; Strickland et al., 2012; Vogelsang et al., 2012; Matejcic et al., 2015; Chen et al., 2019) published between 2003 and 2019. Sixteen articles reported on the South African population and one article on the Sudanese population. The majority (13/17; 76%) of the studies reported on the main subject characteristics (ethnicity, sex, age, and type of clinical assessment). Sample sizes for ESCC patients ranged from 18 to 880 with six of the studies having over 200 patient samples. Sample sizes for controls ranged from 88 to 939 with nine of the studies having over 200 control samples. It is difficult to estimate the total number of patients analyzed in these 17 studies, since it appears that the same authors used the same sample set for different SNPs in different publications. Our assessment showed that Bye et al. (2011) and Bye et al. (2012) used the same participants. In addition, studies by Li et al. (2005) and Li et al. (2008) used the same participants as Dandara et al. (2005). The remaining 12 studies do not seem to have any obvious sample overlap.

Altogether, 16 out of 17 studies clinically assessed for ESCC through histology. None of the studies clinically assessed controls for ESCC with the exception of one study (Strickland et al., 2012), which assessed controls using a brush biopsy. Nine studies reported on smoking and alcohol consumption status for all participants (Dandara et al., 2005; Li et al., 2005; Dandara et al., 2006; Li et al., 2008 Li et al., 2010; Bye et al., 2012; Vogelsang et al., 2012; Matejcic et al., 2015; Chen et al., 2019), while three (Bye et al., 2011; Matejcic et al., 2011; Strickland et al., 2012) reported those risk factors for only the ESCC patients.

The Hardy–Weinberg equilibrium deviation was assessed in 11 (65%) studies; however, only six (35%) of the studies reported power calculations, and three (18%) studies reported the evaluation of a genotyping error. Detailed characteristics of the study population were reported in 12 of the studies for cases and 10 for controls. Correction for multiple testing was reported in only seven (41%) studies. NCBI rs numbers were reported in eight (47%) studies. Our quality assessment scoring had 11 items (**Table S1**), and each item had a weight of 1 point; therefore, total maximum quality score was 11. Overall, only seven of the 17 (41%) studies scored half or above half (5.5). The highest score was 9 (Vogelsang et al., 2012; Chen et al., 2019), and the lowest score was 1 (Vos et al., 2003; Zaahl et al., 2005).

#### Somatic Variant Studies

Somatic variant studies (**Table 3**) constituted of eight studies published between 1990 and 2016 (Victor et al., 1990; Gamieldien et al., 1998; Dietzsch and Parker, 2002; Dietzsch et al., 2003; Vos et al., 2003; Naidoo et al., 2005; Patel et al., 2011; Liu et al., 2016). A total of 455 patients were assessed, with the control group comprising 200 NET and 146 blood samples. Of the 455 patient samples, one was reported to be an adenocarcinoma from one study; therefore, the exact ESCC patient population was 454. The study populations were from South Africa, Kenya, and Malawi.

Clinical diagnosis of ESCC was determined by histology in five (75%) studies, and the remaining three did not report on how clinical assessment was done. Four (50%) studies reported using both cancer tissue and NET for assessment. Three of these studies had an equal number of cancer tissue and NET samples. Two (25%) studies did not have any control samples, and the remaining two (25%) studies collected blood samples only as controls. Only two studies reported on smoking and alcohol consumption status. On patient characteristics, age and sex were reported in six (75%) of the studies. Variant classification and type were reported in all of the studies, but confirmation of results was reported in only two studies. No studies used pathogenicity scoring. Amino acid change was also reported in only two of the studies. Our quality assessment score had seven items (**Table S2**), and each item had a weight of 1 point; therefore, total maximum score for the quality assessment was 7. Overall, six of the eight (75%) studies scored half or above half (3.5). The highest score was 6 (Gamieldien et al., 1998), and the lowest score was 0 (Victor et al., 1990).

### Description of Genes Studied

A total of 58 genes were investigated in the 23 studies, which were selected for the systematic review, with 37 genes studied in the genetic susceptibility studies and 23 in the somatic variant studies. Two genes were investigated in both studies. In addition, the somatic studies investigated six genetic loci without specific gene names. A summary of SNPs analyzed in the genetic susceptibility studies is shown in **Table 4**. Over 100 SNPs were analyzed, and 25 SNPs were reported to be associated with ESCC (four SNPs using p values only, and 21 SNPs using p values and odds ratios). The 25 SNPs were in 20 genes: *ADH1B, ADH3, ALDH2, AR, CASP8, CHEK2, CP, CYP2E1, CYP3A5, GSTT2B, MGMT, MLH3, MSH3, NAT2, PTGS2 (also known as COX-2), PLCE1, PMS1, RUNX1, SLC11A1, and TP53*. The associations with all 25 SNPs were identified in South African populations, while none were found in the Sudanese population.

**Table 5** shows a summary of the pathways for the 20 genes. All the genes encode for proteins. Three of the genes, *ADH1B, ADH3,*  and *ALDH2*, are involved in alcohol metabolism (Li et al., 2008; Bye et al., 2011). Three mismatch repair genes, *MLH3, MSH3,* and *PMS1*, play a role in genomic integrity (Vogelsang et al., 2012). They are reported to also play a role in carcinogenesis. MGMT is involved in cell defense against mutagens, and mutations in the gene are reported to be associated with cancer formation (Bye et al., 2011). *NAT2* and *GSTT2B* play a role in the activation and deactivation of drugs and carcinogens, with reports of mutations




*(Continued)*

Genetics of Esophageal Carcinoma in Africa


*1Only range of age was reported for the combined group of cases and controls.*

*257 had ESCC.*

*3Same population as in Dandara et al. (2005) study.*

*459+/–13 for male (n = 48) and 66+/– (n = 48) for female patients.*

*5326 had ESCC.*

*6182 had ESCC.*

*7Western and Eastern Cape Province Black Population.*

*8Gauteng Province Black Population.*

*Ctrl, controls; ESCC, esophageal squamous cell carcinoma; HEX, heteroduplex; KASP, competitive allele specific PCR; PAGE, polyacrylamide gel electrophoresis; PCR, polymerase chain reaction; RFLP, restriction fragment length polymorphism; SD, Standard deviation; SSCP, single-strand conformation polymorphism.*

#### TABLE 3 | Characteristics of studies on somatic changes in ESCC in African populations.


*Ca, cancer tissue; HEX-SSCP, heteroduplex single-strand conformation polymorphism; NET, neighboring tissue; PAGE, polyacrylamide gel electrophoresis; PCR, polymerase chain reaction; WES, whole exome sequencing. 157 had ESCC and 1 had adenocarcinoma.*

#### TABLE 4 | Summary of studies investigating genetic susceptibility of ESCC in African populations.







*1Increased risk among smokers with SULT1A1\*2/\*2 genotype, but sample size was small.*

*2When OR > 1, effect allele = increased risk; when OR < 1, effect allele = protective effect.*

*3rs3765525 has been merged into rs959421.*

*4Western and Eastern Cape Province Black Population.*

*5Gauteng Province Black Population.*

being associated with carcinogenesis (Matejcic et al., 2015). Genes regulating cell apoptosis are *TP5, CHEK2*, *and CASP8* (Vos et al., 2003; Bye et al., 2011; Eltahir et al., 2012; Chen et al., 2019). *TP53* and *CHEK2* are also involved in gene expression and DNA repair. Regulation of gene expression is facilitated by *PLCE1* and *SLC11A1* (Zaahl et al., 2005; Bye et al., 2012). The *AR* gene regulates the sex hormones, androgens (Dietzsch et al., 2003), while *CYP2E1* and *CYP3A5* are involved in steroid, cholesterol, and lipid synthesis (Dandara et al., 2005; Li et al., 2005; Chelule et al., 2006). *CYP2E1* also metabolizes drugs and has been implicated in carcinogenesis. *CP* facilitates transportation of iron from organs into the blood cells*; RUNX1* plays a role in hematopoiesis and *PTGS2* in inflammation and mitogenesis (Bye et al., 2011; Bye et al., 2012; Strickland et al., 2012).

Nine of the 25 associated SNPs were from small studies with fewer than 150 cases and controls. These SNPs are in the following



six genes: *ADH3, AR, CP, CYP3A5, SLC11A1,* and *TP53*. Because of the small sample size, the reliability and replicability of these results are uncertain. Sixteen of the SNPs came from studies with at least 150 cases and controls, and one study with 142 cases. These sample sizes could potentially give reliable and replicable results. The 16 SNPs were from the following genes: *ADH1B, ALDH2, CASP8, CHEK2, CYP2E1, GSTT2B, MGMT, MLH3, MSH3, NAT2, PLCE1, PMS1, PTGS2, and RUNX1*.

Two of the 16 SNPs are in the *ALDH2* gene and were analyzed in two different studies. However, it is not clear whether these two SNPs are the same because, while one study reported the NCBI rs number (rs886205) (Bye et al., 2011), the other study did not (Li et al., 2008).The two SNPs reported very different MAF, and opposite odds ratios of 2.35 and 0.70 demonstrating increased risk and a protective effect, respectively.

Six of the 16 SNPs were reported to reduce the risk of ESCC, and they are the following: *ADH1B* (Arg48His; rs1229984), *ALDH2* (+82 A > G; rs886205), *GSTT2B* (deletion allele), *NAT2* (341T > C; rs1801280), *PTGS2* (-1195 A > G; rs689466), and *PLCE1* (Arg548Leu; rs17417407). The remaining 10 SNPs were reported to increase the risk of ESCC: *ALDH2* (ALDH2\*1/\*2), *CASP8* (Asp302His; rs1045485), *CHEK2* (rs4822983 C > T, and rs1033667, C > T), *CYP2E1* (7632T > A), *MGMT* (Leu84Phe; rs12917), *MLH3* (Arg797His; rs28756991), *MSH3* (Ala1045Thr; rs26279), *PMS1* (c.-21+639G > A; rs5742938), and *RUNX1* (rs2014300). Eleven of the 16 SNPs showed association in the South African Admixed population, while only four showed association in the Black South African population and one in a combined South African population. All the studies used PCR-based methods for genotyping. Using the 1000 Genomes Database, r2 analysis was carried out on SNPs reported in the same gene, to assess the LD between the SNPs. Thirteen pairs of SNPs in *MHS2, CP, MSH3, PLCE1,CHEK2,* and *NAT1* genes had r2 > 0.45, shown in **Figure 2** and **Table S3**.

Altogether 44 somatic changes were reported in the following 22 genes: *AR*, *CCND1*, *CDKN2A*, *COL1A2*, *EFGR*, *EP300*, *FAT1*, *FAT2*, *FAT3*, *FAT4*, *FBXW7*, *JAG1*, *KMT2C (MLL3)*, *KMT2D (MLL2)*, *MUC2*, *NFE2L2*, *NOTCH1*, *NOTCH3*, *PIK3CA*, *SERPINB4*, *TP53*, and *TP63*, and six genetic loci without specific gene names (**Table 6**). The specific locus positions with the corresponding microsatellite markers are as follows: 2p (D2S123), 3p13 (D3S659), 3p24.2-25 (D3S1255), 4q12 (Bat 25), 2p21-p16.3 (Bat 26), and 1p12-13.3 (Bat 40). These variants were reported in the South African (20 variants), Kenyan (three variants), and Malawian (21 variants) populations. While the majority of the studies used PCR-based methods, a more recent study used WES as the analysis method (Liu et al., 2016). A total of 18 of the 22 genes with somatic variants in cancer tissue were

77 SNPs discussed here using all synonyms (alternative rs IDs) for SNPs (Auton et al., 2015). We obtained a dataset of 504 individuals and 67 SNPs. We computed

discovered using WES. Statistical significance was not reported for any of the 44 variants. The most common type of somatic variants was missense mutations, reported in 14 of the 22 genes (64%) (Patel et al., 2011; Liu et al., 2016). Other somatic changes included copy number gains (14%), copy number losses (5%), deletions (14%), insertions (14%), and frameshift mutations (14%). In three studies (Dietzsch and Parker, 2002; Dietzsch et al., 2003; Naidoo et al., 2005), microsatellite instability and loss of heterozygosity (LOH) were reported (14%).

all pair-wise r2 using PLINK (v1.09) (Danecek et al., 2011; Chang et al., 2015).

**Table 7** shows a summary of the pathways in the 22 genes reporting somatic changes. Five genes, *AR, EP300, KMT2D, KMT2C,* and *TP53*, play a role in the regulation of transcription (Gamieldien et al., 1998; Dietzsch et al., 2003; Vos et al., 2003; Patel et al., 2011; Liu et al., 2016). The encoded protein for the *AR* gene functions as a steroid hormone activated transcription factor, while KMT2D has a role in methylation. Both *TP53* and *EP300* have been implicated in a number of cancers (Gamieldien et al., 1998; Vos et al., 2003; Patel et al., 2011; Liu et al., 2016). *TP53* additionally functions in DNA repair, gene expression, and apoptosis. The mismatch repair genes also facilitate DNA repair (Naidoo et al., 2005). *CCND1, CDKN2A, FAT1/2/3/4,* and *Ras* genes are all reported to be involved in cell cycle pathways including regulation of mitotic events, cell proliferation, and cell growth and death (Victor et al., 1990; Gamieldien et al., 1998; Liu et al., 2016). *NOTCH1* and *NOTCH3* both facilitate cell and tissue development (Liu et al., 2016). *JAG1* plays a role in hematopoiesis while *NFE2L2* is involved in response to inflammation including production of free radicals (Liu et al., 2016). *PIK3CA* is an oncogene implicated in tumor development while *SERPINB4* modulates response against tumor cells (Liu et al., 2016). *EGFR* and *COL1A2* genes encode for epidermal growth factor and type 1 collagen, respectively (Dietzsch and Parker, 2002; Liu et al., 2016). *FBXW7* is a tumor suppressor involved in ubiquitin degradation (Liu et al., 2016). *MUC2* facilitates the formation of a mucous barrier that protects the gut lumen (Liu et al., 2016). *TP63* gene is involved in tissue and organ development including skin and heart, and in adult stem cell regulation (Liu et al., 2016).



*LOH, loss of heterozygosity; MSI, microsatellite instability.*

#### Interaction Studies

Combinations of specific genotypes with environmental factors were also reported to be associated with ESCC in a number of studies (**Table 2**). The main two environmental factors studied were smoking and alcohol consumption. The interaction between smoking and alcohol status and specific genotypes was measured and reported as frequency (percentage) and assessed using p values and odds ratios in nine genetic susceptibility studies (Dandara et al., 2005; Li et al., 2005; Li et al., 2010; Dandara et al., 2006; Li et al., 2008; Li et al., 2010; Bye et al., 2011; Matejcic et al., 2011; Vogelsang et al., 2012; Matejcic et al., 2015). Four studies showed statistically significant associations between both alcohol and smoking status and variants in the *CYP3A5, CYP2E1, GST*, and *NAT2* genes (Dandara et al., 2005; Li et al., 2005; Matejcic et al., 2015). *SULT1A1* variants were associated with smoking status only (Dandara et al., 2006). Other interaction studies included wood/charcoal use and mutations in the *GST* genes (Li et al., 2010), as well as red and white meat intake and SNPs in *NAT1/2* genes (Matejcic et al., 2015).

#### DISCUSSION

#### General Systematic Review Findings

In this study, we systematically evaluated the genetic variants reported to be associated with ESCC in African populations providing the first systematic review on genetic factors of ESCC in this region. Of all studies that have been published on genetic association to ESCC in the African populations, only 23 fit our selection criteria. It was clear from the beginning that there is a dearth of information on this topic. Our analysis showed that 25 germline SNPs were reported to be associated with ESCC in the South African population. However, none of these SNPs were



repeated in three or more independent studies; hence, a metaanalysis was not possible. Additionally, only three (*ALDH2, PLCE*  and *CYP2E1*) of the 20 genes were analyzed in two independent studies, but testing for different SNPs. We determined that it was unlikely that the two *ALDH2* SNPs analyzed were the same SNPs. This is because the MAFs were significantly different and, while one SNP had a protective effect (reduced risk), the other increased risk. The lack of studies re-assessing the same genetic variants poses a major hurdle in validating existing evidence on the association between genetic variants and ESCC development. This makes resolving the genetic etiology of ESCC in African populations difficult.

#### Genetic Susceptibility to ESCC

Of the 25 SNPs from the genetic susceptibility studies that showed an association to ESCC, we concluded that results on 16 SNPs had the potential to be reliable and reproducible due to the larger sample sizes. Ten of the SNPs were reported to increase the risk of ESCC, while six were reported to reduce the risk. However, it was noted that the majority (11) of these SNPs showed association in the South African Admixed population and the studies did not report controlling for population stratification. This is a highly admixed population (Chimusa et al., 2013), in which the predominant ancestral lines are Khoesan (32–43%), Bantu-speaking Africans (20–36%), European (21–28%), and Asian (9–11%) (De Wit et al., 2010). This diverse population is a result of South Africa's colonial and trade history, and constitutes 9% of the total South African population (De Wit et al., 2010). Genetic variability can also be seen in the Black South African population (Chimusa et al., 2013). Without controlling for population stratification, the reproducibility of these results is questionable. It is, however, important to note that the majority of these studies were carried out several years ago, and information on population stratification and methods to detect it may not have been available as yet.

Re-examination of common SNPs from the Chinese population was done in three of the studies (Bye et al., 2011; Bye et al., 2012; Chen et al., 2019), but the findings were not conclusive. It is possible that there may be populationspecific differences influencing the genetic etiology of ESCC in the African populations. This may also point to the role of environmental factors contributing to the genetic susceptibility to ESCC through gene-environment interactions.

### Somatic Changes in ESCC

Forty-four somatic variants were reported, but only two were significantly associated with ESCC. The paucity of information was also evident in the somatic variant studies. There were significantly fewer studies (8) on somatic variants than on genetic susceptibility (17). The molecular profiling of tumors is of great importance as it is relevant in the development of targeted cellular therapeutics. One gene (*CDKN2A*) was analyzed in two studies, but these studies focused on a different variant. Another gene, TP53, was analyzed in four studies, but two studies analyzed different parts of the gene, and two had no control data. It was evident, however, that the WES study provided with a wider variety of genetic variants associated with ESCC (Liu et al., 2016). The WES study overall had the largest number of genetic variants of all the 23 studies and was able to identify variants in an unbiased manner.

#### Common Limitations Among the African Studies

There were no GWAS among the studies we analyzed, but reports from the Chinese and European studies demonstrated that GWAS are able to successfully identify common genetic variants associated with ESCC (Abnet et al., 2017). To date, GWAS has successfully identified more than 700 loci for cancer risk. However, these studies have been predominantly done in populations of European ancestry (80%), with African and Latin American populations contributing less than 1% (Van Loon et al., 2018). A shift to WES and GWAS on the African populations might, therefore, yield better results in identifying variants that play a role in ESCC development. The African Esophageal Cancer Consortium, which was initiated in 2016 by African investigators and International partners, released a call to action to, among other priority activities, increase molecular research on esophageal cancer in Africa, particularly GWAS and genomic profiling (Van Loon et al., 2018).

One of the main deficiencies in the studies was that the majority of the genetic susceptibility studies did not report a power calculation, or a genotyping error, and this may have resulted in studies being underpowered and with increased type II error. Few studies reported correction for multiple testing; however, many of the studies were not analyzing multiple variants at the same time. The lack of correction for multiple testing, therefore, is not a reflection on the methodological quality. Very few studies reported NCBI rs numbers. In most studies, the diagnosis of ESCC in patients was adequately defined with no ambiguity on the number of patients with ESCC. There were, however, three studies that combined samples from patients with squamous cell and adenocarcinoma into one case group, which could introduce bias (Dietzsch et al., 2003; Eltahir et al., 2012; Vogelsang et al., 2012).

It is important to note that rs numbers were poorly documented in the majority of the studies assessed in this systematic review. Additionally, in many of these studies, the positions of the SNPs using genome coordinates were not reported, hence making it difficult to locate the SNPs. In the absence of an rs number, we recommend that authors report the position using genome coordinates and the version of the genome used as a reference.

The somatic variant studies also had adequately defined ESCC diagnosis for the majority of the studies. While the variant classification and type were reported by most studies, there was no confirmation of the results (except for two studies). Overall, for both the germline and somatic variant studies, the quality of reporting for the majority of the studies was not adequate. Other important limitations and biases are the lack of controlling for population stratification and small sample sizes in the study populations, which may have led to unreliable results.

## Limitations of the Systematic Review

While we did a comprehensive search in four of the main literature databases, it is possible that we could have missed some non-English studies on African populations. Because of the lack of replication and validation studies, we could not carry out a meta-analysis in the current study. Furthermore, we did not re-analyze the data and relied on reported p values and odds ratios for descriptive analysis.

## CONCLUSIONS

While this review has highlighted a number of genes that may be potentially associated with ESCC in the African populations, limitations such as lack of reproducibility, quality of reporting, and quality of assessment remain a major concern. The implications of having these inconsistencies and lack of reproducibility are that the genetic etiology of ESCC in Africa will continue to be unclear. The region lags behind in contributing to genetic knowledge and literature on ESCC. Importantly, any preventative, diagnostic, or therapeutic interventions cannot be effectively identified or applied in these populations.

The identification of genetic markers of esophageal cancer susceptibility has clear translational benefits to African populations in understanding the underlying disease risk and heritability. Benefits include the utilization of genetic information to improve risk prediction, which can be translated into prevention and screening programs relevant and specific to the African population. These studies also play a role in identifying and quantifying the interactions of modifiable environmental risk factors, which interact with these genetic variants, and hence provide a platform for better targeted interventions. The ability to sufficiently translate genetic research on the African population is dependent on more genetic studies done on the population.

Our recommendations are that more and larger genetic studies be done on the African populations, particularly focusing on WES and GWAS approaches. This will require multinational collaborations between the African countries.

### ETHICS STATEMENT

The study was approved by the Stellenbosch University Health Research Ethics Committee as part of the Doctoral Studies of HS (HREC Reference #: S18/10/250).

### AUTHOR CONTRIBUTIONS

VL, VS, and HS carried out literature searches. HS, VS, and HK appraised the articles, summarized the results, prepared the tables and figures, and drafted the manuscript. VS and VL reviewed the articles and edited the manuscript. VS and HK conceptualized the idea for the research, obtained funding, supervised the project, and wrote sections of the manuscript. VL provided specialist expertise and knowledge, and critically reviewed the manuscript. GT carried out the r2 analyses, prepared the r2 figure and table, and critically reviewed and revised the manuscript. All authors approved the final version of the manuscript.

#### FUNDING

This work was supported by the African Cancer Institute, Faculty of Medicine and Health Sciences, Stellenbosch University. HS

### REFERENCES


acknowledges the Beit Trust Hardship Fund for providing a Doctoral Scholarship in part aid of tuition and registration fees and the Collaboration for Evidence-based Healthcare and Public Health in Africa (CEBHA+), as part of the Research Networks for Health Innovation in Sub-Saharan Africa Funding Initiative of the German Federal Ministry of Education and Research. GT was supported by the South African Tuberculosis Bioinformatics Initiative (SATBBI), a Strategic Health Innovation Partnership grant from the South African Medical Research Council and South African Department of Science and Technology.

#### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene.2019.00642/ full#supplementary-material


correlate with certain micronutrient deficiencies. *PloS One* 10, e0140107– e0140107. doi: 10.1371/journal.pone.0140107


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Copyright © 2019 Simba, Kuivaniemi, Lutje, Tromp and Sewram. This is an openaccess article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

# Internalizing Mental Disorders and Accelerated Cellular Aging Among Perinatally HIV-Infected Youth in Uganda

*Allan Kalungi1,2,3\*, Jacqueline S. Womersley1, Eugene Kinyanda2,4, Moses L. Joloba3,5, Wilber Ssembajjwe2,6, Rebecca N. Nsubuga6, Jonathan Levin7, Pontiano Kaleebu8, Martin Kidd9, Soraya Seedat1 and Sian M. J. Hemmings1*

*1 Department of Psychiatry, Stellenbosch University, Cape Town, South Africa, 2 Mental Health Project, MRC/UVRI and LSHTM Uganda Research Unit, Entebbe, Uganda, 3 Department of Medical Microbiology, Makerere University, Kampala, Uganda, 4 Department of Psychiatry, College of Health Sciences, Makerere University, Kampala, Uganda, 5 School of Biomedical Sciences, College of Health Sciences, Makerere University, Kampala, Uganda, 6 Statistics and Data Science Section, MRC/UVRI and LSHTM Uganda Research Unit, Entebbe, Uganda, 7 School of Public Health, University of Witwatersrand, Johannesburg, South Africa, 8 MRC/UVRI and LSHTM Uganda Research Unit, Entebbe, Uganda, 9 Centre for Statistical Consultation, Department of Statistics and Actuarial Sciences University of Stellenbosch, Cape Town, South Africa*

#### *Edited by:*

*Nicola Mulder, University of Cape Town, South Africa*

#### *Reviewed by:*

*Manasi Kumar, University of Nairobi, Kenya Celia Van Der Merwe, Broad Institute, United States*

*\*Correspondence: Allan Kalungi allankalungi1@gmail.com*

#### *Specialty section:*

*This article was submitted to Evolutionary and Population Genetics, a section of the journal Frontiers in Genetics*

*Received: 16 November 2018 Accepted: 03 July 2019 Published: 02 August 2019*

#### *Citation:*

*Kalungi A, Womersley JS, Kinyanda E, Joloba ML, Ssembajjwe W, Nsubuga RN, Levin J, Kaleebu P, Kidd M, Seedat S and Hemmings SMJ (2019) Internalizing Mental Disorders and Accelerated Cellular Aging Among Perinatally HIV-Infected Youth in Uganda. Front. Genet. 10:705. doi: 10.3389/fgene.2019.00705*

Introduction: Internalizing mental disorders (IMDs) in HIV+ children and adolescents are associated with impaired quality of life and non-adherence to anti-retroviral treatment. Telomere length is a biomarker of cellular aging, and shorter telomere length has been associated with IMDs. However, the nature of this association has yet to be elucidated.

Objective: We determined the longitudinal association between IMDs and relative telomere length (rTL) and the influence of chronic stress among Ugandan perinatally HIVinfected youth (PHIY).

Methods: IMDs (depressive disorders, anxiety disorders, and post-traumatic stress disorder) and IMDs were assessed using the locally adapted Child and Adolescent Symptom Inventory-5. In 368 PHIY with any IMD and 368 age- and sex-matched PHIY controls without any psychiatric disorder, rTL was assessed using quantitative polymerase chain reaction. Hierarchical cluster analysis was used to generate the three chronic stress classes (mild, moderate, and severe). *t*-tests were used to assess the difference between baseline and 12 month rTL and the mean difference in rTL between cases and controls both at baseline and at 12 months. Linear regression analysis was used to model the effects of chronic stress on the association between IMDs and rTL, controlling for age and sex.

Results: We observed longer rTL among cases of IMDs compared with controls (*p* < 0.001). We also observed a statistically significant reduction in rTL between baseline and 12 months in the combined sample of cases and controls (*p* < 0.001). The same statistical difference was observed when cases and controls were individually analyzed (*p* < 0.001). We found no significant difference in rTL between cases and controls at 12 months (*p* = 0.117). We found no significant influence

**159**

of chronic stress on the association between IMDs and rTL at both baseline and 12 months.

Conclusion: rTL is longer among cases of IMDs compared with age- and sex-matched controls. We observed a significant attrition in rTL over 12 months, which seems to be driven by the presence of any IMDs. There is a need for future longitudinal and experimental studies to understand the mechanisms driving our findings.

Keywords: internalizing mental disorders, relative telomere length, HIV+, perinatally HIV-infected youth, Uganda

### BACKGROUND

Human immunodeficiency virus/acquired immunodeficiency disease syndrome (HIV/AIDS) is a significant global health burden, with approximately 36.9 million people infected globally (UNAIDS, 2018). Both eastern and southern Africa remain the most affected regions, accounting for 45% of the world's HIV infections (UNAIDS, 2018). Of the over 2 million HIV-positive (HIV+) children globally, 90% reside in sub-Saharan Africa (UNAIDS, 2010). In Uganda, the country with the fifth-highest HIV prevalence in the region, an HIV prevalence of 0.5% has been reported among children aged 0–14, which corresponds to approximately 95,000 children living with HIV in the country (UPHIA, 2016–2017). The introduction of antiretroviral therapy (ART) has led to improved survival of HIV-infected youth (7–17 years); however, the mental health of these youth has received less attention (Mupambireyi et al., 2014). Perinatally HIV-infected youth (PHIY) are faced with a burden of psychiatric morbidity (Kamau et al., 2012), in addition to delayed motor and cognitive development (Le Doaré et al., 2012; Van Rie et al., 2007). Studies undertaken in both the developed (Europe and the United States) and developing world (sub-Saharan Africa) have documented depression rates of between 12.7% and 40% (Musisi and Kinyanda, 2009; Gadow et al., 2012; Kamau et al., 2012; Mellins et al., 2012; Nachman et al., 2012; Lwidiko et al., 2018; Kim et al., 2014) among PHIY. For anxiety disorders, rates of 9% to 32.2% have been reported among PHIY (Kamau et al., 2012; Mellins et al., 2012; Nachman et al., 2012; Kinyanda et al., 2019).

IMDs are associated with psychological distress (Musisi and Kinyanda, 2009), impaired quality of life, and non-adherence to ART (Walkup et al., 2009; Malee et al., 2011). In addition, patients with IMDs have higher mortality rates than have the general population (Cuijpers and Smit, 2002; Colton and Manderscheid, 2006; Ahmadi et al., 2011; Druss et al., 2011).

IMDs are characterized by quiet, internal distress (Tandon et al., 2011), in contrast to externalizing disorders, where overtly socially negative or disruptive behavior is displayed (Tandon et al., 2011). IMDs with high levels of negative affectivity include depressive disorders (e.g., dysthymic disorder), anxiety disorders (e.g., generalized anxiety disorder and social anxiety disorder), and obsessive-compulsive disorder (Regier et al., 2013; Turygin et al., 2013). Despite intensive research, the diagnosis of IMDs is still largely based on clinical symptoms, with an absence of biological markers to facilitate diagnosis. This is largely because the pathophysiological mechanisms underlying IMDs, such as depression and anxiety, are still largely unknown. Several studies have investigated the association between telomere length (TL) and IMDs, and shorter TL has been reported in adults with depression (Simon et al., 2006; Verhoeven et al., 2014; Cai et al., 2015) and anxiety disorders (Kananen et al., 2010; Verhoeven et al., 2015).

Telomeres are protein-bound deoxyribonucleic acid (DNA) repeat structures at the ends of chromosomes (Lindqvist et al., 2015), and are important in preventing chromosomes from fusing together during mitosis, thus preventing loss of genetic data (Allsopp et al., 1992; Blackburn et al., 2006). They also regulate cellular replicative capacity (Allsopp et al., 1992; Blackburn et al., 2006). During somatic cell replication, telomeres progressively shorten due to the inability of DNA polymerase enzyme to fully replicate the 3′ end of the DNA strand (Allsopp et al., 1992; Blackburn et al., 2006), a process termed as the "end replication problem" (Watson, 1972). This results in a gradual decline in telomere length (TL) over time. Once a critically short TL is reached, the cell is triggered to enter replicative senescence and subsequently cell death (Allsopp et al., 1992; Blackburn et al., 2006). TL provides a metric of cellular age and accounts for roughly 15% of the variance of age (Epel and Prather, 2018). TL has been reported to shorten in a predictable way with chronological age by roughly 20–40 base pairs per year (Cesare and Reddel, 2010). TL is partially genetically determined, with heritability estimates ranging from 36% to 84% (Aviv, 2012) and is highly variable between individuals (Vasa-Nicotera et al., 2005; Njajou et al., 2007). The current study assessed TL as relative TL (rTL), with rTL being proportional to an individual's TL (Cawthon, 2009).

**Abbreviations:** µL, microliter; ART, anti-retroviral therapy; CASI-5, Child and Adolescent Symptom Inventory—edition 5; CD4, cluster of differentiation 4; CIs, confidence intervals; Ct, threshold cycle; DNA, deoxyribonucleic acid; DSM-5, *Diagnostic Statistical Manual for Mental Disorders*—edition 5; HBG, human β-globin gene; HCA, hierarchical cluster analysis; HIV/AIDS, human immunodeficiency virus/acquired immunodeficiency disease syndrome; HIV+, HIV positive; IMD, internalizing mental disorder; JCRC, Joint Clinical Research Centre; LTL, leucocyte telomere length; MRC/DfID, Medical Research Council/Department for International Development; ng, nanogram; PTSD, post-traumatic stress disorder; qPCR, quantitative polymerase chain reaction; rTL, relative telomere length; s, second; scg, single copy gene; TASO, The AIDS Support Organization; TERC, telomerase RNA complex; TERT, telomerase reverse transcriptase; TL, telomere length; UNAIDS, The Joint United Nations Programme on HIV and AIDS; UVRI, Uganda Virus Research institute.

Several studies in youth have reported associations between adversity and telomere shortening (Shalev et al., 2013; Theall et al., 2013; Drury et al., 2014; Mitchell et al., 2014). Adversity experienced in youth ranges from exposure to traumatic stressors, such as sexual and physical abuse, to social adversities that relate to family structure, parental mental distress, and socio-economic status (SES). Causal associations between stressful life events and early adversities, such as childhood sexual abuse and major depression, are well documented (Kendler et al., 1999; Fergusson and Mullen, 1999; Kendler et al., 2000), with evidence suggesting molecular signatures of stress overlap with major depression (Cai et al., 2015) Biological processes, such as inflammation and oxidative stress, which have been observed in several psychiatric disorders are also associated with telomere shortening (Wolkowitz et al., 2011a; Wolkowitz et al., 2011b), suggesting that telomere shortening may be related to certain psychiatric endophenotypes.

Depression has been considered a syndrome of accelerated aging (Heuser, 2002). The first study to examine leucocyte TL (LTL) in a group of subjects with either major depression or bipolar disorder and aged-matched controls found shorter LTL among cases compared with healthy controls (Simon et al., 2006). A large longitudinal clinical cohort study found shorter LTL among groups who were currently depressed or had remitted depression compared with healthy controls (Verhoeven et al., 2014). However, there was no statistically significant difference in LTL between the currently depressed and remitted depression groups, suggesting that depression may leave an "indelible marker" on LTL. However, in the currently depressed group, a dose–response relationship was observed, with LTL inversely associated with both severity and duration of depression. This dose–response relationship was further supported by a longitudinal study by Shalev et al. (2014), where persistence of IMDs from 11 to 38 years predicted reduced LTL at 38 years of age in a dose-dependent manner among male participants. It is, however, not possible to rule out that LTL was already reduced at the first episode of depression, indicating that shorter LTL could be a risk factor for depression. Indeed, Gotlib et al. (2015) described shorter LTL as a risk marker for depression, where shorter LTL was observed among girls (aged 10–14 years) at increased risk for depression. High risk for depression was assessed as having a mother with a history of recurrent episodes of depression, while low risk was assessed as having a mother with no current or past Axis I disorder during a girl's lifetime. However, results across studies have been inconsistent. While several other studies have reported shorter LTL among currently depressed individuals compared with controls (Lung et al., 2007; Hoen et al., 2011; Wikgren et al., 2012; Garcia-Rizo et al., 2013), some studies have failed to find an association (Wolkowitz et al., 2011a; Teyssier et al., 2012; Needham et al., 2015; Schaakxs et al., 2015).

Accelerated aging has also been described in anxiety disorders. Using the same study population as described in Verhoeven et al. (2014), the authors reported shorter LTL among subjects with a diagnosis of current anxiety disorder than among controls (Verhoeven et al., 2015). There was, however, no statistically significant difference in LTL between the remitted anxiety disorder group and controls, suggesting that LTL shortening in anxiety disorders may be more reversible than that associated with depression. Needham et al. (2015) reported an association between shorter LTL and a diagnosis of generalized anxiety disorder and panic disorder among women. Kananen et al. (2010) reported shorter LTL among older anxiety disorder subjects (48– 87 years of age) compared with controls, and a study by Okereke et al. (2012) reported a dose–response relationship where severe phobia was associated with shorter LTL.

PTSD has also been considered in the context of accelerated aging (Moreno-Villanueva et al., 2013; Miller and Sadeh, 2014). Shorter LTL has been implicated in PTSD, though the effects were primarily explained by early life stress (O'Donovan et al., 2011). Shorter LTL was reported among combat-deployed soldiers with PTSD, compared with those without PTSD (Zhang et al., 2014). There is a need to understand whether telomere shortening is a direct effect of PTSD, whether the development of PTSD and shortening of telomeres are simultaneous effects of increased stress reactivity (Zhang et al., 2014), or whether telomere shortening is a risk factor for PTSD (Malan et al., 2011).

HIV infection has also been found to be associated with shortened telomeres (Oeseburg et al., 2010; Auld et al., 2016). HIV/AIDS may be viewed as a chronic psychological stressor due to the illness and stigma that are associated with the disease (Varni et al., 2012). Since TL has been found to be a marker for chronic stress (Needham et al., 2015), shorter telomeres are expected in HIV/AIDS subjects as compared with the diseasefree population.

We hypothesized that in PHIY in Uganda, attrition in rTL over a 12-month period would be greater in cases of IMDs compared with age- and sex-matched controls without any psychiatric disorder. We further hypothesized that cases would have shorter rTL than controls. We thus aimed to determine the longitudinal association between IMDs and rTL and the influence of chronic stress in this relationship.

### METHODS

#### Study Design

This case–control study was nested within a Medical Research Council/Department for International Development (MRC/ DfID)-funded project that investigated mental health among children and adolescents living with HIV/AIDS in Kampala and Masaka districts of Uganda (CHAKA study), which enrolled 1,339 Ugandan PHIY (7–17 years) of black African ancestry (Kinyanda et al., 2019). All participants with any of the IMDs (368 cases) and an equal number of age- and sex-matched controls were selected from CHAKA (*N* = 736) and included in the present study. Both the baseline and 12-month archived blood sample for each of the included participants was retrieved from which genomic DNA was extracted.

#### Study Population

Study participants were recruited from two HIV clinics in urban Kampala [Joint Clinical Research Centre (JCRC) and Nsambya Home Care] and three HIV clinics in rural Masaka [The AIDS Support Organization (TASO), Kitovu Mobile Clinic, and Uganda Cares]. All study participants were on ART.

### Procedures

Consenting PHIY, as well as their caregivers, were interviewed using a structured questionnaire. The questionnaire included, among others, socio-demographic characteristics (sex, study site, age, caregiver level of education, and SES), and modules on depression, post-traumatic stress disorder, and anxiety modules from the DSM-5 referenced Children and Adolescent Symptom Inventory-5 (CASI-5) (Gadow, 2013). The CASI-5 was locally adapted for use in Uganda (Mpango et al., 2017). Trained psychiatric nurses and psychiatric clinical officers administered the CASI-5 at two time points (baseline and 12 months). The CASI-5 lists the symptoms of a wide range of psychiatric disorders including major depressive disorder, generalized anxiety disorder, PTSD, and attention-deficit/hyperactivity disorder, among others. Individual CASI-5 items are rated on a 4-point frequency of occurrence scale ranging from never (0) to very often (3). There are several CASI-5 scoring algorithms; however, in the present study we used symptom count cutoff scores that reflect the prerequisite number of symptoms for a clinical diagnosis. At each study visit, 4 ml of blood was withdrawn from each study participant through venipuncture into an EDTA vacutainer and was stored at −80 °C pending DNA extraction.

### Inclusion and Exclusion Criteria

*Inclusion criteria*: i) HIV-infected outpatients, registered with any of the HIV clinics at any of the study sites; ii) aged between 7 and 17 years at the time of enrolment; iii) conversant in English or Luganda, the language into which the research assessment tools were translated; and iv) able to provide written informed consent(caregivers)/assent(adolescents). Cases were subjects who had any depressive disorder [depression or dysthymia (persistent depressive disorder)] or anxiety disorder. Controls were age-and sex-matched without any psychiatric disorder. Persistent IMDs were baseline cases that remained cases at 12 months, while remitted ones were baseline cases that lost disease status at 12 months

*Exclusion criteria*: i) Seriously ill including being unable to understand study procedures and ii) any other psychiatric disorder other than the ones listed above.

### Ethical Considerations

Both CHAKA and the present study were conducted in compliance with the Code of Ethics of the World Medical Association (Declaration of Helsinki). The CHAKA study obtained ethical and scientific clearance from the Uganda Virus Research Institute (UVRI) Science and Ethical Committee (#GC/127/15/06/459) and the Uganda National Council of Science and Technology (#HS 1601). The present study obtained approval from the Higher Degrees Research & Ethics Committee, School of Biomedical Sciences, College of Health Sciences, Makerere University (#SBS 421) and the Health Research Ethics Committee of Stellenbosch University (#S17/09/179). Caregiver provided informed consent for their children/adolescents to participate in the study and for a blood specimen to be drawn for genetics analyses. Adolescents provided further assent to participate in the study. Study participants who were diagnosed with significant psychiatric problems were referred to mental health units at Entebbe and Masaka government hospitals.

### Measure of Chronic Stress

Chronic stress was measured as social disadvantage and variables that were considered to confer social disadvantage were used to construct a composite index for chronic stress. A composite index of chronic stress was constructed from data collected on the following variables: orphanhood (double orphanhood carried a higher chronic stress score vs. single or not orphaned); food availability (not enough food carried a higher chronic stress score vs. enough food); study site (urban carried a higher chronic stress score than rural); and caregiver level of education (no formal education carried a higher chronic stress scores than primary and primary a higher stress score than secondary, etc.). Hierarchical cluster analysis (HCA) was used to generate the different cutoff points for each chronic stress class.

### Chronic Stress Classes

The chronic stress index ranged from 0 to 3.75, with a normal distribution. A total of three chronic stress classes were generated during HCA, i.e., mild, moderate, and severe. The mild class had a chronic stress score of 0 to 1.375, the moderate class had a score of greater than 1.375 to 2.375, and the severe class had a score of greater than 2.375.

### Analysis of Relative Telomere Length

DNA was extracted from blood collected from each participant, using the QiAmp Mini DNA Extraction Kit (Qiagen GmbH, Germany). Extracted DNA was quantified by 260/280 and 260/230 ultraviolet spectrophotometry on the NanoDrop 1000 spectrophotometer V3.7 (Thermo Fisher Scientific, Wilmington, MA). The DNA was subsequently diluted to 5 ng/µl and amplified using the KAPA SYBR FAST qPCR Master Mix (Merck, Darmstadt, Germany) per Cawthon et al. (2002), with slight modifications. Primers specific for telomeric repeats (T) (Cawthon, 2002) and a stably expressed single copy reference gene (S), the human β-globin (HBG1, 5′-GCTTCTGACACAACTGTGTTCACTAGC-3′; and HBG2, 5′-CACCAACTTCATCCACGTTCACC-3′), were used to amplify telomeric repeats and human β-globin, respectively. For the telomere assay, each reaction included 5 µl of KAPA SYBR FAST qPCR Master Mix (Merck, Darmstadt, Germany); 1.35 and 4.50 µM of forward and reverse primers, respectively; 5 ng of genomic DNA; and water in a 10-µl reaction volume. The human β-globin assay was identical to the telomere assay except that 2.0 µM of each of the forward and reverse primers were used. The reactions for the telomeric repeats and the human β-globin gene were amplified on the same 384-well plates. Each participant's DNA sample was amplified in triplicate. If the threshold cycle (Ct) values of the triplicates of particular samples differed by more than 0.5, those samples were excluded. From the triplicate Ct values, the means were calculated for each sample and used in subsequent calculations. Amplification was performed on the ABI 7900HT Fast Real-Time PCR system (Applied Biosystems, Foster City, CA) using the following thermal cycling profile: 95° C for 3 min, followed by 40 cycles of 95° C for 3 s and 60° C for 30 s, and a dissociation stage of 95° C for 15 s, 64° C for 15 s, and 95° C for 15 s. A calibrator sample was prepared by pooling equal amounts of DNA from each participant for the construction of a standard curve. The calibrator DNA sample was serially diluted 1.68-fold per dilution, to produce a nine-point standard curve, with DNA amounts ranging from 50 to 0.79 ng/µl. After amplification of the serial dilutions, a linear plot of the Ct versus the log value of the input amount of DNA (standard curve) was constructed using ABI's SDS v.2.3 software. The efficiency of a reaction was also determined from the standard curve of that reaction. Threshold and baseline values were used as determined by the SDS v.2.3 software. All Ct values were corrected for the PCR efficiency, and interplate calibrations were performed using GenEx software (http://www.gene-quantification.de/datan. html).

A validated qPCR method (Cawthon, 2002) was used to determine relative TLs (rTLs) in all samples. First, the mean telomere repeat copy number (tel, T) was normalized to a reference gene (single copy gene) (scg, S) copy number to control for differences in DNA quantity. The T/S ratio is proportional to the average TL. Thereafter, the factor by which the T/S ratio differs between the experimental sample and the calibrator sample is determined to provide an indication of relative average TL:

> T/S Relative average TL(tel) where C Ct Ct = = − − 2 2 ∆ ∆∆ ∆ t C= − t(tel) Ct(scg).

A T/S > 1 indicates that the average rTL in the sample is greater than that of the reference sample, and a T/S < 1 indicates that the average rTL in the experimental sample is less than that in the reference sample.

#### Power for the Study

We calculated the *post hoc* power for our study based on results from a study by Epel et al. (2004). We used the formula of sample size and power for difference in means in case–control studies. We worked on the assumption that cases (individuals with IMDs) would have higher levels of stress than controls (individuals without IMDs). Epel et al. (2004) found a 15% reduction in mean rTL among cases compared with controls. Given a 1:1 ratio of cases to controls and using a 5% level of significance, with 368 cases and controls, our study was well powered (power greater than 80%) to detect any reduction above 4.75% in mean rTL between cases and controls. For instance, a reduction of 5% in mean rTL between cases and controls provided a power of 83.8%.

#### Statistical Methods

Statistical analyses were conducted using Stata 15 (StataCorp, TX, USA).

Socio-demographic characteristics were described between cases and controls. Chi-square tests were used to assess the association between the socio-demographic characteristics and IMDs at baseline (cases vs. controls). SES was generated from a scale of nine household items (car, motorcycle, refrigerator, electricity, bicycle, radio, telephone, cupboard, and flask). Each item was weighted in the respective order, a car carrying a maximum weight of 9 and a flask a minimum weight of 1. A total score of items was generated, whose median cutoff of 13 was used to classify low and high SES. A score less than 13 was classified as low SES, while that greater than 13 was classified as high SES. Our study group (Kinyanda et al., 2011) has previously used household items as a measure of SES in rural settings of Uganda. A *t*-test was used to compare CD4 counts between cases and controls to account for any disparity in HIV disease progression.

Outliers were revealed by box and whisker plots and were all removed from the rTL data. The skewed rTL data became normally distributed after removal of outliers.

The distributions of rTL at baseline and at 12 months and the change in rTL were determined using a standardized normal probability plot (P-P plot) (See **Supplementary Materials**). The difference in rTL distribution at baseline and at 12 months was assessed using *t*-tests. The mean difference in rTL between cases and controls was also assessed using *t*-tests. One-way analysis of variance was used to assess whether there were any statistically significant differences between change in rTL and each of the variables of age, sex, study site, caregiver education level, and child education level.

Linear regression was used to i) assess the relationship between rTL and chronic stress, adjusting for sex and age; ii) model the effect of chronic stress on the association between IMDs and rTL, by comparing models without chronic stress to models with chronic stress; and iii) model the effect of age on the relationship between IMDs and rTL. There were missing data for rTL values at baseline or 12 months or both. For all analyses that needed computation of confidence intervals, we computed 95% confidence intervals; statistical significance was set at a *p*-value less or equal to 0.05, while a *p*-value greater than 0.05 but less than 0.07 was considered a trend towards marginal significance.

### RESULTS

Socio-demographic factors were evenly distributed between cases and controls as shown in **Table 1**.

Tests of association between different socio-demographic variables and rTL were run to determine potential confounders (**Table 2**). None of the socio-demographic variables were associated with rTL. Study site, age, and SES were significantly associated with chronic stress (*p* < 0.001, *p* = 0.040, and *p* = 0.015, respectively).

#### Difference in rTL Between Cases and Controls

rTL was normally distributed at both baseline and 12 months. Mean rTL (95%CI) of the combined sample of cases and controls was 1.148 (1.119–1.176) at baseline and 0.905 (0.879–0.931) at 12 months. For cases, mean rTL (95%CI) was 1.198 (1.157–1.239) at baseline and 0.925 (0.886–0.965) at 12 months; while for



*CD4, cluster of differentiation 4; primary, 0–7 years of formal education; secondary, 8–14 years of formal education; low SES, 0–13; high SES, > 13. All numbers that do not add up were due to missing data.*

TABLE 2 | *p*-values for tests of association between socio-demographic variables and rTL change and chronic stress.


*SES, socioeconomic status.*

controls, mean rTL (95%CI) was 1.097 (1.057–1.137) at baseline and 0.884 (0.851–0.917) at 12 months.

At baseline, we found a statistically significant difference in rTL between cases and controls (*p* < 0.001). However, contrary to what we expected, rTL was longer in cases compared with controls. There was, however, no statistical difference in rTL between cases and controls at 12 months (*p* = 0.117). In addition,

the change between baseline and 12-month rTL (rTL change) did not differ statistically between cases and controls (*p* = 0.608) (**Table 3**).

#### Differences Between Baseline and 12-month rTL

In the combined analysis of baseline cases and controls there was significant attrition in rTL between baseline and 12 months (*p* < 0.001). This attrition did not differ by internalizing mental disorder (IMD) status (*p* = 0.608). A further stratified analysis of cases only and controls only yielded similar *p*-values of <0.001 (**Table 4**).

#### Association Between Chronic Stress and rTL

We observed a trend towards statistical significance between chronic stress and baseline rTL (*p* = 0.067). Severe stress was significantly associated with longer rTL (*p* = 0.028) (**Table 5**). However, chronic stress was not significantly associated with either 12-month rTL or a change in rTL (*p* = 0.147 and *p* = 0.455, respectively) (**Table 5**).

#### Association Between Chronic Stress and IMDs

We found a trend toward statistical significance between chronic stress and IMDs (**Table 6**).

#### The rTL and IMDs After 12 Months

We found no significant difference in baseline rTL between cases of IMDs that persisted compared to those that remitted after 12 months (*p* = 0.235). We also found no statistically significant association between 12-month rTL and 12-month IMD status (*p* = 0.090), as well as no association between disease severity and rTL at baseline (*p* = 0.238) and 12 months (*p* = 0.264).

#### Effect of Chronic Stress on the Association Between IMDs and rTL

We found no significant influence of chronic stress on the association between IMDs and rTL both at baseline and at 12 months (**Table 7**).




TABLE 5 | Assessing association between chronic stress and rTL, adjusted for age and sex.


*Reference, reference chronic stress class during regression analysis.*

TABLE 6 | Association between chronic stress and internalizing mental disorders.


#### Effect of Age on the Relationship Between IMDs and rTL

On stratifying our analyses for age [children (7–11 years) and adolescents (12–17 years)], we observed no statistically significant differences by age group for IMDs and rTL compared with those that were observed with both age categories combined (**Table 8**).

#### DISCUSSION

In this study, we investigated the association between chronic stress and rTL among PHIY cases with IMDs and age- and sexmatched controls in Uganda. To our knowledge, this is the first


*CS, chronic stress; reference, reference disease status/chronic stress class during regression analysis.*



*Children, 7–11 years; adolescents, 12–17 years; reference, reference IMDs status during regression analysis.*

sub-Saharan African study to investigate the association between chronic stress with rTL and IMDs among PHIY.

Several studies have determined the association between TL and different internalizing psychopathologies. Shorter TL have been reported among cases of depression compared with controls (Garcia-Rizo et al., 2013; Shalev et al., 2014; Verhoeven et al., 2014), while others have failed to find significant associations (Wolkowitz et al., 2011a; Teyssier et al., 2012; Simon et al., 2015). Shorter TL has also been implicated in both anxiety disorders (Kananen et al., 2010; Verhoeven et al., 2015) and PTSD (O'Donovan et al., 2011; Zhang et al., 2014) and has been reported to confer risk for PTSD (Malan et al., 2011). Due to these reported associations of shorter TL in the different internalizing psychopathologies, we hypothesized that rTL would be shorter among cases of IMDs than controls in our study participants. Contrary to our hypothesis, we observed longer rTL among cases of IMDs compared with their controls (*p* < 0.001). Longer rTL among IMDs could be due to elevated telomerase levels. TL is maintained by a telomerase enzyme component known as telomerase RNA component (TERC) and a reverse transcriptase enzyme known as the telomerase reverse transcriptase (TERT) (Wang and Meier, 2004; Blackburn et al., 2006). Wolkowitz et al. (2012) indeed reported elevated telomerase levels among people with depression than among healthy matched controls at baseline. After 8 weeks of treatment with selective serotonin re-uptake inhibitors, they found that telomerase levels became even more elevated as depression

remitted. It has been speculated that elevated telomerase levels are a compensatory effort towards excessive loss of telomeres (Damjanovic et al., 2007; Lin et al., 2012).

We also observed a statistically significant reduction in rTL between baseline and 12 months in a combined sample of cases and controls (*p* < 0.001). A statistical difference was also observed when cases and controls were individually analyzed (*p* < 0.001). This difference was expected since TL generally decreases over the life span (Muezzinler et al., 2013). We found no significant difference in rTL between cases and controls at 12 months (*p* = 0.117). Since cases had significantly longer rTL than controls at baseline (*p* < 0.001), the lack of a significant difference at 12 months indicates greater rTL attrition among cases compared with controls. This is an interesting observation that points to the notion that IMDs are possibly driving accelerated cellular aging (rTL attrition). Indeed, telomere shortening has been reported to be strongly influenced by chronic stress exposure (Ridout et al., 2015), and suffering from a chronic disease, such as heart disease (Haycock et al., 2014) and diabetes (Zhao et al., 2013), has been conceptualized as a prolonged stress exposure that could explain their association with TL. IMDs have been reported as chronic stressors (McEwen, 2003) with chronic biological adaptations that result in long-term biological damage that could potentially explain rTL attrition due to IMDs. IMDs could also be leading to rTL attrition through inflammatory pathways. Depression has been reported to prime larger cytokine responses to stressors (Kiecolt-Glaser et al., 2015). Increased systemic inflammation has been associated with decreased TL among a prospective cohort of workers exposed to high level of fine particulate matter (Wong et al., 2014), while interventions that attenuate inflammatory processes in fear- and anxiety-based disorders have been thought to be effective in mitigating the symptoms of anxiety disorders (Michopoulos et al., 2017).

If IMDs were driving rTL attrition, we would expect significant reduction in rTL among cases with no corresponding significant reduction among controls. Intriguingly, we observed significant reduction in rTL in both groups (*p* < 0.001). This is possibly due to general longitudinal reduction in rTL. However, study subjects were only followed up for 12 months, and a longer follow-up period may be required to see a true difference in rTL attrition between cases and controls. It needs to be borne in mind that other factors may be responsible for either the overall greater reduction in rTL over 12 months or the greater rTL reduction among cases than controls. For example, participants were all on ART, with the type of ART regimen not accounted for in the analysis. Also, factors known to affect rTL, such as diet (Shiels et al., 2011) and frequency of physical exercise (Cherkas et al., 2008) were not accounted for. In addition, effects on rTL may have been determined even before birth from maternal stress, or through direct transmission of maternal rTL.

Although previous studies among children have found associations between TL and socio-demographic variables, such as caregiver level of education (Needham et al., 2012), parental SES (Needham et al., 2012; Mitchell et al., 2014), sex (Drury et al., 2014), and living environments (Theall et al., 2013), we found no association between any baseline sociodemographic variables and rTL change in the present study. This discrepancy could be due to cultural context, as previous studies were carried out in developed world settings that differ from the African low-income setting of this study. For example, stress due to orphanhood in the Ugandan context may be experienced differently, as there is a strong extended family system in Uganda where orphans tend to be taken care of by their uncles or aunts, unlike in the developed world where orphans are often institutionalized. The latter has been associated with shorter TL (Drury et al., 2012). More studies are needed to understand factors that affect TL in the sub-Saharan African context.

We found no association between rTL change and persistence or remission of IMDs. This further suggests that rTL does not drive IMDs, but rather IMDs may be driving accelerated cellular aging. Higher mortality rates have been reported among patients with IMDs compared with the general population, and the mortality is mainly due to the same age-related diseases as the general population, such as cancer, and heart, and cerebrovascular disease. For example, a study by Colton and Manderscheid (2006) reported that clients with a diagnosis of major mental illness died 1 to 10 years earlier than did clients with no major mental illness. Another study reported that persons with mental disorders died an average of 8.2 years younger than did the rest of the population and that presence of a mental illness was associated with a hazard ratio of 2 over a 17-year study period (Druss et al., 2011), supporting the mediating role of IMDs in accelerated cellular aging.

Psychological stress (both perceived stress and chronicity of stress) has been significantly associated with lower telomerase activity and shorter TL (Epel et al., 2004). We investigated the association between chronic stress and rTL in our sample. We observed a marginally significant association between chronic stress and rTL (*p* = 0.067). However, contrary to expectation, severe chronic stress was associated with longer rTL (*p* = 0.028) (**Table 5**). Longer rTL was also associated with IMD caseness. Thus, if increased stress (chronic) is an acquired vulnerability factor for IMDs, then it stands to reason that severe chronic stress would be associated with longer rTL, an association that we indeed observed. Further, since IMDs are associated with impaired quality of life and negative clinical and behavioral outcomes among PHIY and poor adherence to ART (Malee et al., 2011; Walkup et al., 2009), we expected significantly lower CD4 counts among cases than controls. However, we found no significant difference in mean CD4 count between cases and controls (*p* = 0.939). We did not investigate other virologic markers of HIV disease severity, such as viral load. However, all study participants were on ART, and thus no difference would be expected if adherence to treatment was similar between cases and controls.

We observed an association between chronic stress and study site and SES respectively. Living in urban areas and having a high SES were associated with more chronic stress than living in rural areas and having a low SES. The association of both urban location and high SES with increased chronic stress may be due to a correlation between the two variables, as participants in urban areas are often of higher SES as compared with their rural counterparts. The association of urban location with increased chronic stress could be due to ecological factors and pressures that are associated with urban life as compared with rural life. We also observed an association between age and chronic stress. Adolescents (12–17 years) were more stressed than children (7–11 years) and this could be due to the fact that adolescents were aware of their HIV status and the stress could be associated with the burden of being HIV+ and stigma among these study participants (Knizek et al., 2017).

Lastly, since severe chronic stress is associated with longer rTL, we expected severe chronic stress to lower the *p*-value of the regression for the association between IMDs abd rTL, an interaction we did not observe. We think that this could due to duration of chronic stress. Although the chronic stress variables used in the present study are known stressors in this population, the duration of the stressor was not assessed for.

#### LIMITATIONS AND RECOMMENDATIONS

We defined IMDs as having any depressive disorder or anxiety disorder or PTSD. The inclusion of PTSD is contentious as the disorder has recently been delineated from IMDs in the DSM-5 and may have skewed our findings. We recommend that future studies undertake a comparative analysis of the different disorders that make up the IMD spectrum to elucidate the independent contribution of each particular disorder.

We did not investigate factors that are known to affect rTL, such as frequency of physical activity, medication, diet, and presence of other comorbid diseases. Also, much as CD4 counts did not significantly differ between cases and controls, the ART regimen for each study participant was not accounted for in the analysis. Future studies should endeavor to consider these factors.

Both the duration and severity of IMDs have been shown to affect rTL. We did not assess the duration of IMDs. However, we think that this may not have greatly affected our findings because disease severity was not significantly associated with rTL in the present study. Future studies should, however, account for the duration of IMDs.

We suggest that the longer rTL observed among cases is due to elevated telomerase activity/levels. However, we did not investigate telomerase activity/levels between cases and controls. Future studies should investigate this possibility. Also, certain genes, such as the telomerase reverse transcriptase and telomerase RNA component, have been reported to influence TL biology. The role of polymorphisms in these genes influencing rTL needs to be investigated, and future studies should endeavor to address this.

Chronic stress was measured using a number of contextspecific indicators because there is no locally adapted tool for assessing chronic stress in this setting. While this may be a limitation and may limit generalizability to other settings, the variables used to generate the chronic stress index are known stressors in this population. Validation of the chronic stress index tool will be required in future studies in Uganda.

#### CONCLUSIONS

RTL was longer in cases with IMDs compared with age- and sexmatched controls.

We observed significant attrition in rTL over 12 months. This rTL attrition seems to be driven by the presence of any IMDs, indicating that IMDs could be driving accelerated rTL attrition. Mechanisms that either directly influence rTL or alleviate the effects of IMDs on rTL attrition could explain our study findings, and longitudinal and experimental studies are needed to fully elucidate underlying mechanisms.

### ETHICS APPROVAL AND CONSENT TO PARTICIPATE

The study obtained ethics approval from the Health Research Committee of Stellenbosch University (#S17/09/179) and the higher Degrees Research & Ethics Committee of the School of Biomedical Sciences, College of Health Sciences, Makerere University (#SBS 421). The parent study (CHAKA) obtained ethics approval from the Uganda Virus Research Institute's Science and Ethical Committee (#GC/127/15/06/459) and the Uganda National Council of Science and Technology (#HS 1601). All caregivers provided informed consent for their children/

adolescents to participate in the study and for a blood specimen to be withdrawn from them (child/adolescent) for rTL and other genetics analyses. Adolescents further provided informed assent to participate in the study.

### CONSENT FOR PUBLICATION

No details, images, or videos relating to any of the study participants are included in this manuscript.

### ETHICS STATEMENT

The study obtained ethics approval from the Health Research Committee of Stellenbosch University (# S17/09/179) and the Higher Degrees Research & Ethics Committee, School of Biomedical Sciences, College of Health Sciences, Makerere University (# SBS 421). The parent study (CHAKA) obtained ethics approval from the Uganda Virus Research Institute (UVRI) Science and Ethical Committee (# GC/127/15/06/459) and the Uganda National Council of Science and Technology (# HS 1601). All study participants provided written informed consent/assent to participate in the study and for a blood specimen to be withdrawn from them for the rTL and other genetics analyses in accordance with the Declaration of Helsinki.

### AUTHOR CONTRIBUTIONS

Concept was provided by AK, SMJH, EK, and SS. Data collection was done by AK, EK, SMJH, JSW, and SS. Data analysis was done by WS, AK, RNN, SMJH, JSW, SS, MK, and JL. First draft was done by AK, SMJH, JSW, WS, EK, SS, MLJ, RNN, PK, MK, and JL. Final revision was done by AK, SMJH, JSW, EK, SS, WS, MLJ, RNN, PK, MK, and JL. All authors read and approved the final manuscript.

## FUNDING

The study was funded by Medical Research Council/Department for International Development—African Leadership Award to Prof. Eugene Kinyanda (grant number MR/L004623/1), the Alliance for Global Health and Science of the Center for Emerging and Neglected Diseases (grant number 50288/N7145), the South African Research Chairs Initiative in Posttraumatic Stress Disorder, funded by the Department of Science and Technology, and the National Research Foundation of South Africa.

### ACKNOWLEDGMENTS

We thank the study participants and research assistants of the mental health section of MRC/UVRI & LSHTM Uganda Research Unit We thank the HIV clinics at the Joint Clinical Research Centre, Nsambya Home Care, TASO-Masaka, Kitovu Mobile Clinic and Uganda Cares-Masaka, for allowing us access to their patients. Members of the Neuropsychiatric Genetics Laboratory at Stellenbosch University, and most especially Dr. Craig Kinnear, Data and Statistics Section of the MRC/UVRI & LSHTM Uganda Research Unit, the National Research Foundation of South Africa.

#### REFERENCES


### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene.2019.00705/ full#supplementary-material


from the National Health and Nutrition Examination Survey. *Mol. Psychiatry* 20 (4), 520. doi: 10.1038/mp.2014.89


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Copyright © 2019 Kalungi, Womersley, Kinyanda, Joloba, Ssembajjwe, Nsubuga, Levin, Kaleebu, Kidd, Seedat and Hemmings. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

# Frequencies of the *LILRA3* 6.7-kb Deletion Are Highly Differentiated Among Han Chinese Subpopulations and Involved in Ankylosing Spondylitis Predisposition

*Han Wang1†, Yuxuan Wang2†, Yundi Tang2, Hua Ye2, Xuewu Zhang2, Gengmin Zhou1, Jiyang Lv1, Yongjiang Cai3 Zhanguo Li2\*, Jianping Guo2\* and Qingwen Wang1\**

#### *Edited by:*

*Zané Lombard, University of the Witwatersrand, South Africa*

#### *Reviewed by:*

*Guanglin He, Sichuan University, China Jacqueline Michelle Frost, Mount Sinai Medical Center, United States*

#### *\*Correspondence:*

*Zhanguo Li zgli99@aliyun.com Jianping Guo jianping.guo@bjmu.edu.cn Qingwen Wang wqw\_sw@163.com*

*†These authors have contributed equally to this work*

#### *Specialty section:*

*This article was submitted to Genetic Disorders, a section of the journal Frontiers in Genetics*

*Received: 15 February 2019 Accepted: 20 August 2019 Published: 18 September 2019*

#### *Citation:*

*Wang H, Wang Y, Tang Y, Ye H, Zhang X, Zhou G, Lv J, Cai Y, Li Z, Guo J and Wang Q (2019) Frequencies of the LILRA3 6.7-kb Deletion Are Highly Differentiated Among Han Chinese Subpopulations and Involved in Ankylosing Spondylitis Predisposition. Front. Genet. 10:869. doi: 10.3389/fgene.2019.00869*

*1 Department of Rheumatism and Immunology, Peking University Shenzhen Hospital, Shenzhen, China, 2 Department of Rheumatology and Immunology, Peking University People's Hospital, Beijing, China, 3 Health Management Center, Peking University Shenzhen Hospital, Shenzhen, China*

Introduction: Leukocyte immunoglobulin-like receptor A3 (*LILRA3*) belongs to the LILR family with unique feature of a 6.7-kb deletion variation among individuals. Frequencies of the 6.7-kb deletion vary widely across populations, but so far it has not been carefully investigated among Han Chinese subpopulations. Furthermore, we previously identified the non-deleted (functional) *LILRA3* as a novel genetic risk for multiple autoimmune diseases. The current study aimed to investigate (i) whether frequencies of the *LILRA3* 6.7-kb deletion differ within Han Chinese subpopulations and (ii) whether the functional *LILRA3* is a novel genetic risk for ankylosing spondylitis (AS).

Methods: The *LILRA3* 6.7-kb deletion was genotyped in two independent cohorts, including 1,567 subjects from Shenzhen Hospital and 2,507 subjects from People's Hospital of Peking University. Frequencies of the 6.7-kb deletion were first investigated in combined healthy cohort according to the Chinese administrative district divisions. Association analyses were performed on whole dataset and subsets according to the geographic regions. Impact of the functional *LILRA3* on AS disease activity was evaluated.

Results: Frequencies of *LILRA3* 6.7-kb deletion were highly differentiated within Han Chinese subpopulations, being gradually decreased from Northeast (80.6%) to South (47.4%). Functional *LILRA3* seemed to be a strong genetic risk in susceptibility to AS under almost all the alternative genetic models, if the study subjects were not geographically stratified. However, stratification analysis revealed that the functional *LILRA3* was consistently associated with AS susceptibility mainly in Northern Han subgroup under the alternative genetic models, but not in Central and Southern Hans. Functional *LILRA3* conferred an increased disease activity in AS patients (*P* < 0.0001 both for CRP and ESR, and *P* = 0.003 for BASDAI).

Conclusions: The present study is the first to report that the frequencies of *LILRA3*  6.7-kb deletion vary among Chinese Hans across geographic regions. The functional *LILRA3* is associated with AS susceptibility mainly in Northern Han, but not in Central and

**172**

Southern Han subgroups. Our finding provides new evidence that *LILRA3* is a common genetic risk for multiple autoimmune diseases and highlights the genetic differentiation among different ethnicities, even within the subpopulations of an ethnic group.

Keywords: *LILRA3*, genetic differentiation, genetic susceptibility, ankylosing spondylitis, Han subpopulations

#### INTRODUCTION

Ankylosing spondylitis (AS) is a chronic autoimmune disease characterized by new bone formation, progressively leading to ankylosis of the axial skeleton and functional disability. The disease predominantly affects young men. The etiology of AS is not completely understood, but it is believed that genetic factors play a major role in AS pathogenesis, particularly the MHC class I allele *HLA-B27*, which has been recognized as the best genetic marker for AS susceptibility (reviewed in (Brown et al., 2016)). However, despite over 80% of AS patients are *HLA-B27* carriers, only a small proportion of *HLA-B27* positive individuals ever develop AS (reviewed in (Reveille, 2012)). Furthermore, the genome-wide association studies (GWAS) have revealed that more than 60 additional genetic risk factors contributed to the disease, indicating a polygenic nature of AS. To date, only approximately 30% of AS heritability has been explained by the known genetic loci; many remain unidentified (reviewed in (Li and Brown, 2017; Ranganathan et al., 2017)).

The leukocyte immunoglobulin-like receptor genes (*LILRs*) is a highly homologous multigene family located on human chromosome 19q13.4 (Samaridis and Colonna, 1997; Kelley et al., 2005). One of characteristics of LILR family is their specific recognition of MHC class I molecules (Cosman et al., 1997). According to the signaling pathways through immunoreceptor tyrosine-based activating or inhibitory motifs, two subgroups of the LILRs have been defined: activating (LILRA1-6) and inhibitory (LILRB1-5) receptors (Samaridis and Colonna, 1997; Nakajima et al., 1999). Of which, LILRA3 (OMIM 604818) is unique, due to a premature stop codon in the extracellular stalk region, leading to a loss of transmembrane domain and therefore expressed only as a soluble receptor (Arm et al., 1997; Borges et al., 1997; Colonna et al., 1997). Furthermore, *LILRA3* exhibits a presence or absence of 6.7-kb variation among individuals. The 6.7-kb deletion comprises of the first six of total seven exons and removes all of four Ig-like domains, leading to a truncated protein (Torkar et al., 2000; Wilson et al., 2000; Norman et al., 2003). Interestingly, the frequencies of *LILRA3* 6.7-kb deletion vary widely among ethnic groups, being much higher in Northeast Asians such as Japanese (71%), Chinese Han (76%), Chinese Manchu (79%), and Koreans (84%), compared to Europeans (15–26%), South Asians (10%), or Africans (7%) (Hirayasu et al., 2006; Hirayasu et al., 2008; Du et al., 2014). However, so far, the frequencies of the *LILRA3* 6.7 kb deletion have not been carefully investigated among the Han Chinese subpopulations across the geographic regions.

To date the function of LILRA3 remains obscure, but LILRA3 could bind to HLA class I molecules HLA-G and HLA-C (Jones et al., 2011; Ryu et al., 2011) and may act as an antagonist on other LILRs or a soluble ligand to other receptors (Torkar et al., 2000; Burshtyn and Morcos, 2016). In Caucasian populations, the *LILRA3* 6.7-kb deletion has been reported as a genetic risk for primary Sjogren's syndrome (pSS) (Kabalak et al., 2009) and multiple sclerosis (MS) (Koch et al., 2005; Ordonez et al., 2009; Wisniewski et al., 2013; Ortiz et al., 2015; An et al., 2016). Nevertheless, our previous studies have demonstrated that, in Han Chinese population, the non-deleted (functional) *LILRA3* allele, rather than the 6.7-kb deleted *LILRA3*, was the genetic risk for pSS, systemic lupus erythematosus (SLE), and rheumatoid arthritis (RA) (Du et al., 2014; Du et al., 2015). A GWAS study has also reported the functional *LILRA3* was a risk factor for susceptibility to prostate cancer in Han population (Xu et al., 2012). These reports have provided strong evidence that the functional *LILRA3* is a genetic risk for multiple chronic diseases. However, whether the functional *LILRA3* is a novel susceptibility factor for AS has not been investigated. We undertook the present study (i) to investigate the frequencies of the *LILRA3*6.7-kb deletion among Han Chinese subpopulations across the geographic regions, (ii) to examine the possible genetic association between *LILRA3* and AS, and (iii) to examine whether *LILRA3* influences the disease activity in AS.

#### MATERIAL AND METHODS

#### Study Subjects

Two independent cohorts were enrolled, including 1,567 subjects (821 cases and 746 healthy controls) from Peking University Shenzhen Hospital (SZH) and 2,507 subjects (300 cases and 995 selected healthy subjects for case-control analysis by taking account of gender and age matching, and 2,207 healthy subjects for subpopulation stratification analysis, respectively) from Peking University People's Hospital (PH). All patients with AS fulfilled the 1984 Modified New York Criteria for the diagnosis of AS (van der Linden et al., 1984). All cases and healthy controls are Han Chinese.

In the SZH cohort, the patients were recruited from the Department of Rheumatology of Shenzhen Hospital and from both out-patient and in-patient departments between Jan 2012 and May 2019. The healthy controls were from the Health Care Center affiliated to Shenzhen Hospital.

In the PH cohort, the patients were recruited from the Department of Rheumatology and Immunology of People's Hospital and from in-patient department between Jan 2015 and May 2019. The healthy controls were recruited from the Health Care Center of PKU People's Hospital, the First Affiliated

**Abbreviations:** AS, ankylosing spondylitis; LILRA3, leukocyte immunoglobulinlike receptor A3; MHC, major histocompatibility complex; GWAS, genome-wide association study; CRP, C-reactive protein; ESR, erythrocyte sedimentation rate; MS, multiple sclerosis; BASDAI, the Bath Ankylosing Spondylitis Disease Activity Index.

Hospital of Anhui Medical University, and Gulou Hospital Affiliated to Medical College of Nanjing University, respectively (Du et al., 2014; Du et al., 2015). The baseline demographic characteristics of patients and controls are shown in **Table 1**. The geographic characteristics of the patients and controls from the two independent cohorts, according to the the Chinese administrative district divisions and the latitudes, are shown in the **Supplementary Table 4** and **5**, respectively.

The study was approved by the Medical Ethics Committee of Peking University Shenzhen Hospital. Written informed consent was obtained from all participants.

#### Subpopulation Stratifications

In SZH cohort, majority of the patients and healthy controls came from Guangdong, Hubei, Hunan, Jiangxi, Sichuan, and Fujian. In the PH cohort, majority of cases and healthy individuals were from Beijing, Tianjin, Hebei, Shandong, Liaoning, Inner Mongolia, Jilin, and Heilongjiang. In addition, a small proportion of healthy individuals in PH cohort were from Jiangsu and Anhui.

For the subpopulation stratification analysis, the two independent healthy cohorts were merged, and then, the total healthy subjects were classified into subgroups according to the Chinese administrative district divisions, i.e., Northeastern China, Northern China, Eastern China, Central China, Western China, and Southern China (He et al., 2017). For the case-control stratification analysis, the two independent case-control cohorts were respectively pooled, and the samples were then stratified into the three case-control sub-cohorts according to the latitudes, i.e., (i) latitude ≥ 35° (roughly corresponding to Northeastern and Northern regions), (ii) latitude 25–35° (roughly corresponding to Western, Central, and Eastern regions), and (iii) latitude ≤ 25° (roughly corresponding to Southern region). Of note, there were a number of cases and healthy controls from the metropolises of Shenzhen and Beijing, where the geographical location is no longer a good indicator of ancestral origin due to the impact of modern immigration (He et al., 2017). These individuals were self-identified their ancestral origins in the stratification analysis.

#### Determination of LILRA3-del by Sequence-Specific Primer-Polymerase Chain Reaction (PCR-SSP)

Genotypes of the *LILRA3* 6.7-kb deletion polymorphism were obtained by PCR amplification with modified sequence-specific


*AS, ankylosing spondylitis; SZH, Shenzhen Hospital; PH, People's Hospital. #the total healthy subjects from PH for subpopulation analysis. \*Mean ± SD years.*

primers (PCR-SSP) (Du et al., 2014; Du et al., 2015). The cases and controls from the SZH cohort were genotyped at Shenzhen Hospital, and the AS cases from the PH cohort were genotyped at People's Hospital. The genotyping success and confirmation rates were 99.1 and 100%, respectively. The genotyping dataset has been deposited in the figshare database (DOI: 10.6084/m9.figshare.9685619, https://figshare. com/s/0eab58f90f3b1b20a181).

In the PH cohort, the genotyping data for healthy controls were cited from our previous publications (Du et al., 2014; Du et al., 2015).

#### Serum Dkk-1 Measurements

A total of 384 AS patients were measured for serum levels of Dkk-1. Serum samples were collected and stored immediately at −80°C prior to be used. All cases were genotyped for the *LILRA3* 6.7-kb deletion polymorphism. Quantification of serum Dkk-1 concentration was performed by using commercially available ELISA kits, according to the manufacturer's instructions (R&D Systems, Minneapolis, MN). The detection range for Dkk-1 is from 31.3 to 2,000 pg/ml with an assay sensitivity around 15.6 pg/ml (catalogue number: DKK100).

#### Power Analysis

The power analyses were performed retrospectively for the available samples (cases and controls), using a fixed minor allele frequency of 31.0% (the MAF in healthy controls from the combined cohort), a type I error *P* of 0.05, and an OR of 1.4. The PS software (version 3.0.14) was used for power calculation (available at http://www.mc.vanderbilt.edu/ prevmed/ps).

#### Statistical Analyses

The Hardy–Weinberg equilibrium (HWE) test was performed for the polymorphism, using Pearson's goodness-of-fit chisquare test. The Pearson chi-square test was performed for the comparisons of allelic frequency differences between cases and controls. Odds ratios (OR) and 95% confidence intervals (CI) for genetic model analysis were calculated using logistic regression, adjusting for age and sex. The independent T-test was applied for analysis of serum levels of CRP, ESR, and Dkk-1, and the Bath Ankylosing Spondylitis Disease Activity Index (BASDAI) between two genotypic groups. All statistical analyses were conducted using program SPSS 16.0 (SPSS Inc., Chicago, IL, USA). The *P-*value < 0.05 was considered statistically significant.

#### RESULTS

The 6.7-kb deletion variant was in Hardy–Weinberg equilibrium (HWE) (*P* > 0.05) in healthy controls (data not shown). The study had a statistical power of 0.987 to detect the modest effect size of OR = 1.40, and a fixed minor allele frequency (MAF) of 31.0% (the MAF in healthy controls from the combined cohort) between LILRA3 and AS. However, the single-subpopulation study power was generally low (study power = 0.769 in Northern Han, 0.708 in Central Han, and 0.520 in Southern Han, respectively).

#### Frequencies of *LILRA3* 6.7-kb Deletion Are Highly Differentiated Within Han Chinese Subpopulations

As the frequencies of *LILRA3* 6.7-kb deletion were highly different in different populations worldwide and the study subjects were came from multiple geographical regions across China in present study, we hypothesize that the frequencies of *LILRA3* 6.7-kb deletion may also vary among Chinese Hans. To this end, we first pooled the two healthy cohorts and roughly classified the subjects into six subgroups according to the Chinese administrative district divisions (He et al., 2017) (**Figure 1A**). Interestingly, we found that allele frequencies of the 6.7-kb deletion varied remarkably according to the administrative district divisions. As shown in **Figure 1B** and **Table 2**, allele distribution of the 6.7-kb deletion were gradually decreased from Northeast (80.6%) to South (47.4%). Accordingly, frequencies of the functional *LILRA3* allele were gradually increased from Northeast (19.4%) to South (52.6%). Next, we investigated the geographical distribution of the *LILRA3* variants according to the latitude. As shown in **Figure** 

**1C**, the homozygous of the functional *LILRA3* was about 4.5% at latitude ≥ 35°(roughly corresponding to Northeastern and Northern regions), 10.4% at latitude 25–35° (roughly corresponding to Western, Central, and Eastern regions), and up to 30.6% at latitude ≤ 25°(corresponding to Southern region). Collectively, the frequencies of *LILRA3* 6.7-kb deletion was highly differentiated within Han Chinese subpopulations. Frequencies of the functional *LILRA3* were reversely correlated with the latitude in Han Chinese, with the highest frequency seen in Southern Hans (52.6%).

#### Functional *LILRA3* Seems to Be a Strong Genetic Risk for Development of AS, If the Study Subjects Were Not Geographically Stratified

To investigate the possible genetic association between the functional *LILRA3* and AS, we first assessed the impact of *LILRA3* on AS susceptibility in whole study subjects. Interestingly, frequencies of the functional *LILRA3* were remarkably increased in AS patients compared with healthy controls, either in allele model (40.2% *vs*. 31.0%, *P =*  1.28´10−12, OR = 1.49, **Table 3**) or in almost all the alternative genotypic models (e.g., recessive model [+/+ *versus* -/- and +/-]: 17.6% *vs*. 10.3%, *P =* 1.17´10−5, OR = 1.66; **Figure 2** and **Table 3**). It seems that the functional LILRA3 is a strong

TABLE 2 | Geographical distribution of the *LILRA3* variations in Han Chinese healthy individuals, according to the Chinese administrative district divisions (n = 3,343).


*N°, North latitude; (−), 6.7kb-deletion; (+), non-deletion.*

TABLE 3 | Association analysis of *LILRA3* with AS in combined cohort, adjusting for sex and age.


*AS, ankylosing spondylitis; HC, healthy controls; OR (95% CI), odds ratio (95% confidence interval); (−), 6.7kb-deletion; (+), non-deletion.*

genetic risk factor for AS susceptibility in Han Chinese, if the study subjects were not geographically stratified.

#### Stratification Analysis Reveals That Functional LILRA3 Is Associated With AS Susceptibility Only in North Han, but Not in Central and South Hans

Previous studies have shown that the population stratification is a potential issue for genetic association studies and may confound results and cause spurious associations (Chen et al., 2009; Xu et al., 2009). As allele frequencies of *LILRA3* were highly differentiated in our healthy cohort, we next stratified the cases and healthy individuals into subgroups corresponding to the three ranges of latitude. The three subgroups were then renamed as: (i) Northern Han (corresponding to latitude ≥35°), (ii) Central Han (corresponding to latitude 25–35°), and (iii) Southern Han (corresponding to latitude ≤ 25°). As shown in **Figure 3A**, frequencies of the functional *LILRA3*  (homozygous) were gradually increased in both healthy controls and AS patients from North to South. However, the associations between the functional *LILRA3* and AS susceptibility were different among the three subgroups. In the Northern Han subgroup, the functional *LILRA3* showed consistent association with AS susceptibility under almost all the alternative genetic models (allele model: *P =* 3.55 ´ 10−3, OR = 1.33; recessive model: *P* = 0.076, OR = 1.64; dominant model: *P =* 0.013, OR = 1.36; co-dominant model: 6.25 ´ 10−3, OR = 1.32; over-dominant model: 0.078, OR = 1.25), but not in Central and Southern Han subgroups (**Figure 3B**, **Tables 4** and **5** and **Supplementary Table 1**). Our results indicate that the functional *LILRA3* maybe a genetic risk for AS susceptibility in Northern Han subpopulation but not in Central and Southern Hans.

#### Functional LILRA3 Confers an Increased Disease Activity in AS Patients

We next examined whether the functional LILRA3 had an impact on disease activity in AS patients. As CRP and ESR are the two biomarkers most commonly utilized for evaluating AS disease activity, we evaluated the impact of *LILRA3* genotypes

on serum levels of CRP and ESR in AS patients. As shown in **Figure 4**, the patients homozygous for the functional *LILRA3*  had a significant higher levels of CRP and ESR than the nonfunctional *LILRA3* carriers (*P* < 0.0001 both for CRP and ESR, **Figures 4A**, **B**). Interestingly, we further observed that the BASDAI (a validated diagnostic test and gold standard for measuring disease activity in AS) was also significantly increased in AS patients homozygous for the functional LILRA3 (*P* = 0.003, **Figure 4C**). We also evaluated the impact of *LILRA3* genotypes on serum Dkk-1, a molecule related to AS disease activity. As shown in **Figure 4D**, no significant differences were observed for serum levels of Dkk-1 between different LILRA3 genotypes (*P* = 0.764).

#### DISCUSSION

Previous several studies have reported that frequencies of *LILRA3* 6.7-kb deletion vary widely across populations (Hirayasu et al., 2006; Hirayasu et al., 2008). Here, we demonstrate that allele frequencies of the 6.7-kb deletion also differ remarkably among the Han Chinese subpopulations across geographic regions, being the highest in Northeast China (80.6%) and the lowest in South China (47.4%), and positively correlated with the latitude. Conversely, frequencies of the functional *LILRA3* were reversely correlated with the latitude, being the highest in South (52.6%) and the lowest in Northeast (19.4%). If the study subjects were not carefully geographically stratified, the functional *LILRA3* seemed to be a strong susceptibility factor for AS. However, after stratifying the cases and healthy individuals according to the geographical regions, we find the functional *LILRA3* is mainly associated with AS susceptibility in North and Han, but not in Central and South Hans.

Genetic differentiations among Han Chinese subpopulations have been reported previously. Xu et al. (2009) reported that Chinese Han population is complicatedly substructured, with the main clusters corresponding roughly to Northern Han, Central Han, and Southern Han. By simulated case-control analysis, the study showed that the genetic differentiations among these clusters were sufficient to lead spurious associations in GWAS, if Han population

association analysis between functional *LILRA3* and AS susceptibility with or without geographic stratifications (recessive model).

#### TABLE 4 | Association analysis of *LILRA3* with AS, according to the latitudes and adjusting for sex and age (allele model).


*AS, ankylosing spondylitis; HC, healthy controls; OR (95% CI), odds ratio (95% confidence interval); (−), 6.7kb-deletion; (+), non-deletion; N°, North latitude.*

TABLE 5 | Association analysis of *LILRA3* with AS, according to the latitudes and adjusting for sex and age (recessive model).


*AS, ankylosing spondylitis; HC, healthy controls; OR (95% CI), odds ratio (95% confidence interval); (−), 6.7kb-deletion; (+), non-deletion; N°, North latitude.*

Wang et al. LILRA3 and AS in Han Subpopulations

was not properly stratified. Thus, any association studies should be carefully explained in Han Chinese population, especially when sample sources are diverse. Chen et al. (2009) reported that the structure of Han population is one-dimensional and clearly characterized by a continuous genetic gradient along the north-south geographical axis, rather than the east-west pattern. Interestingly, the study further showed that the Cantonese is the most differentiated subpopulation from the Northern Hans. Our data are consistent with these findings; that is, the allele frequencies of *LILRA3* are mainly differentiated alongside the north-south gradient, being a much higher frequency in Southern Han. The mechanism for this gene selection within Han Chinese population is unknown. We speculate that the gene flow, environmental factors such as exposure to ultraviolet light, diet, life style, and immune systems suffering from different pressures might account for the differentiation between Han Chinese subpopulations.

Despite the well-established genetic association between *LILRA3* and autoimmune diseases, the molecular function of LILRA3 remains undefined. However, *LILRA3* is highly homologous to *LILRB1* and *LILRB2* in the extracellular domains, suggesting it may act as a soluble antagonist to these inhibitory receptors *via* shared ligands (Torkar et al., 2000; Burshtyn and Morcos, 2016). Previously, we and others have reported that the expression of *LILRA3* was significantly increased in RA and SLE patients. *LILRA3* had an impact on disease activity in RA and SLE (An et al., 2010; Du et al., 2015). Functional *LILRA3* conferred a risk to disease severity in RA patients with early disease (Du et al., 2014). Serum LILRA3 was one of the strongest independent markers for disease severity in patients with MS (An et al., 2016). In present study, we also find the functional *LILRA3* has an impact on disease activity in AS patients.

Dkk-1 is a key inhibitory molecule in the Wnt pathway and is critically important in bone homeostasis. Therefore, Dkk-1 may play an important role in AS pathogenesis (Heiland et al., 2012). Increased Dkk-1 levels have been linked to bone resorption, whereas decreased levels are linked to new bone formation (Li et al., 2006; MacDonald et al., 2007). However, there are inconsistent findings regarding the relationship between serum Dkk-1 levels and the occurrence of AS. For instance, several studies have reported that serum Dkk-1 levels were significantly increased in patients with AS compared with normal subjects or bone-related disease controls (Daoussis et al., 2010; Zhang et al., 2016). The inhibitory effect of Dkk-1 in sera from AS patients on Wnt pathway activation was negligible and may be functionally impaired (Daoussis et al., 2010). In present study, we didn't find any differences between LILRA3 genotypes and Dkk-1 production.

One of limitations in current study is that the singlesubpopulation study power was generally low due to the geographical stratification and the rare allele frequency of the functional *LILRA3* in Northern Han subgroup. Thus, it may lead to an increased chance of type II errors, i.e., a false negative result. Additional studies with larger sample sizes are desired to confirm our findings.

In summary, the present study provides the first evidence that the frequencies of *LILRA3* 6.7-kb deletion vary widely among the Chinese Hans across geographic regions and positively correlated to the latitude. The functional *LILRA3* is associated with AS susceptibility in North Han, but not in Central and South Han subpopulations. *LILRA3* has an impact on disease activity in AS patients. These findings suggest *LILRA3* is a common genetic risk for multiple autoimmune diseases and provide clues for further functional studies. Our study further highlights the importance of genetic differentiations among ethnicities, even within the subpopulations of an ethnic group.

#### DATA AVAILABILITY

This manuscript contains previously unpublished data. The name of the repository and accession number are not available.

#### ETHICS STATEMENT

This study was performed in accordance with the Declaration of Helsinki and approved by the Medical Ethics Committee, Peking University Shenzhen Hospital. All patients provided informed consent to participate in the study in accordance with the Declaration of Helsinki. The protocol was approved by the Medical Ethics Committee, Peking University Shenzhen Hospital.

#### AUTHOR CONTRIBUTIONS

HW contributed to the collection of DNA samples and clinical data from SZH cohort, participated in genotyping, ELISA experiments, statistical analysis and manuscript drafting. YW contributed to the collection of DNA samples from PH cohort, participated in genotyping, data analysis and manuscript drafting. YT, HY and XZ participated in the collection and interpretation of clinical data from PH cohort. GZ, YC and JL participated in the collection and interpretation of clinical data from SZH cohort. ZL participated in the study design and revised the manuscript. JG contributed to the study design and data interpretation, supervision of the data analysis, manuscript drafting and revision. QW participated in the study design, data interpretation, and manuscript revision. All authors read and approved the final manuscript.

#### FUNDING

This work was supported in part by the National Natural Science Foundation of China (No. 31470875, No. 31670915, No. 31870913, No. 31711530023, No. 31530020, No.81871289, No. 81771678), Beijing Natural Science Foundation (No. 7162192), the Shenzhen Science and Technology Program for Basic Research (No. JCYJ20170307112009204), Traditional Chinese Medicine Bureau of Guangdong Province (No.20183011), and Sanming Project of Medicine in Shenzhen (No. SZSM201612009).

#### ACKNOWLEDGMENTS

We thank the staff from two Departments of Rheumatology and Immunology, Peking University Shenzhen Hospital and Peking University People's Hospital for recruiting healthy control

#### REFERENCES


and AS patient and managing the DNA samples. We are also grateful for the consent and cooperation to participate in this study from all the healthy people and AS patient volunteers.

#### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene.2019.00869/ full#supplementary-material

syndesmophyte formation in patients with ankylosing spondylitis. *Ann. Rheum. Dis.* 71, 572–574. doi: 10.1136/annrheumdis-2011-200216


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Copyright © 2019 Wang, Wang, Tang, Ye, Zhang, Zhou, Lv, Cai, Li, Guo and Wang. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

# GJB2 and GJB6 Mutations in Non-Syndromic Childhood Hearing Impairment in Ghana

*Samuel M. Adadey1, Noluthando Manyisa2, Khuthala Mnika2, Carmen de Kock2, Victoria Nembaware2, Osbourne Quaye1, Geoffrey K. Amedofu3, Gordon A. Awandare1 and Ambroise Wonkam2\**

1 West African Centre for Cell Biology of Infectious Pathogens (WACCBIP), University of Ghana, Accra, Ghana, 2 Division of Human Genetics, Faculty of Health Sciences—University of Cape Town, Cape Town, South Africa, 3 Department of Eye, Ear, Nose and Throat, School of Medical Sciences, Kwame Nkrumah University of Science and Technology, Kumasi, Ghana

#### Edited by:

Zané Lombard, University of the Witwatersrand, South Africa

#### Reviewed by:

Aime Lumaka, University of Liège, Belgium Colleen Aldous, University of KwaZulu-Natal, South Africa

\*Correspondence: Ambroise Wonkam ambroise.wonkam@uct.ac.za

#### Specialty section:

This article was submitted to Genetic Disorders, a section of the journal Frontiers in Genetics

Received: 28 February 2019 Accepted: 13 August 2019 Published: 18 September 2019

#### Citation:

Adadey SM, Manyisa N, Mnika K, de Kock C, Nembaware V, Quaye O, Amedofu GK, Awandare GA and Wonkam A (2019) GJB2 and GJB6 Mutations in Non-Syndromic Childhood Hearing Impairment in Ghana. Front. Genet. 10:841. doi: 10.3389/fgene.2019.00841

Our study aimed to investigate GJB2 (connexin 26) and GJB6 (connexin 30) mutations associated with non-syndromic childhood hearing impairment (HI) as well as the environmental causes of HI in Ghana. Medical reports of 1,104 students attending schools for the deaf were analyzed. Families segregating HI, as well as isolated cases of HI of putative genetic origin were recruited. DNA was extracted from peripheral blood followed by Sanger sequencing of the entire coding region of GJB2. Multiplex PCR and Sanger sequencing were used to analyze the prevalence of GJB6-D3S1830 deletion. Ninetyseven families segregating HI were identified, with 235 affected individuals; and a total of 166 isolated cases of putative genetic causes, were sampled from 11 schools for the deaf in Ghana. The environmental factors, particularly meningitis, remain a major cause of HI impairment in Ghana. The male/female ratio was 1.49. Only 59.6% of the patients had their first comprehensive HI test between 6 to 11 years of age. Nearly all the participants had sensorineural HI (99.5%; n = 639). The majority had pre-lingual HI (68.3%, n = 754), of which 92.8% were congenital. Pedigree analysis suggested autosomal recessive inheritance in 96.9% of the familial cases. GJB2-R143W mutation, previously reported as founder a mutation in Ghana accounted for 25.9% (21/81) in the homozygous state in familial cases, and in 7.9% (11/140) of non-familial non-syndromic congenital HI cases, of putative genetic origin. In a control population without HI, we found a prevalent of GJB2-R143W carriers of 1.4% (2/145), in the heterozygous state. No GJB6-D3S1830 deletion was identified in any of the HI patients. GJB2-R143W mutation accounted for over a quarter of familial non-syndromic HI in Ghana and should be investigated in clinical practice. The large connexin 30 gene deletion (GJB6-D3S1830 deletion) does not account for of congenital non-syndromic HI in Ghana. There is a need to employ next generation sequencing approaches and functional genomics studies to identify the other genes involved in most families and isolated cases of HI in Ghana.

Keywords: hearing impairment, genetics, GJB2 and GJB6, Ghana, Africa

## INTRODUCTION

Hearing impairment (HI) is a disabling congenital disease (Neumann et al., 2019), with the highest rate for agestandardized disability of life in the world (Murray et al., 2015; Vos et al., 2016). Globally, congenital HI has a prevalence of 1.3 per 1,000 population (James et al., 2018) and accounts for about 1 per 1,000 live births in developed countries, with a much higher up to 6 per 1,000 in sub-Saharan Africa (Olusanya et al., 2014). To improve the cognitive, social, speech, and language development of children living with HI, early diagnosis and intervention are recommended (Barnard et al., 2015). But in the absence of the widely used new-born screening, the age at diagnosis is usually late in Africa, e.g. 3.3 years in Cameroon (Wonkam et al., 2013). In many populations, nearly half of congenital HI cases have a genetic etiology, of which 70% are non-syndromic (Bademci et al., 2016; Sheffield and Smith, 2018). Among non-syndromic (NS) HI, nearly 80% of the cases are inherited in autosomal recessive (AR) mode (Wu et al., 2018; Zhou et al., 2019). To date, more than 98 genes have been identified, in ~170 NSHI loci mapped (Hereditary Hearing Loss Homepage; http:// hereditaryhearingloss.org/). Nevertheless, in many populations of European and Asian descent, pathogenic variants in *GJB2* (connexin 26 gene) and *GJB6* are major contributors to autosomal recessive NSHI (ARNSHI) (Chan and Chang, 2014), with the *GJB6*-D13S1830 deletion identified in up to 9.7%, as the second biggest genetic etiology of NS deafness in the European populations (del Castillo et al., 2002; del Castillo et al., 2003).

The prevalence of *GJB2-* or *GJB6-*related NSHI is very low in most sub-Saharan African populations (Gasmelseed et al., 2004; Kabahuma et al., 2011; Bosch et al., 2014; Javidnia et al., 2014; Lasisi et al., 2014). Of interest, previous studies have shown that a common founder mutation accounted for about 16.2% of congenital HI was p.R143W in a random sample of Ghanaians affected by hearing loss (Hamelmann et al., 2001). To our knowledge, the contribution of connexin 30 to HI, and the carrier frequency of the *GJB2* mutation in non-affected individuals has not been studied in Ghana (Adadey et al., 2017). In the present research, we aimed to investigate the putative environmental causes of childhood HI, and revisit the contribution of *GJB2*, and to investigate *GJB6* mutations in carefully selected samples of families segregating HI, and in isolated cases of putative genetic origin, as well control populations non-affected by HI, in Ghana.

### METHODS

### Patient Participants

Hearing impaired patients were recruited from 11 schools for the deaf following procedures reported previously in Cameroon (Wonkam et al., 2013). Briefly, individuals with severe HI diagnosed before 15 years of age were enrolled in this study. For all participants, detailed personal and family history were obtained, and the medical records reviewed by a medical geneticist and an ENT specialist, and relevant data extracted, including three-generation pedigree and perinatal history. If required, a general systemic and otological examination and audiological evaluation were performed, including pure tone audiometry or auditory brain stem response test. We followed the recommendation number 02/1 of the *Bureau International d'Audiophonologie (BIAP*), Belgium, to classify the hearing levels (Bureau\_International\_d'Audiophonologie, 1997; Wonkam et al., 2013). After consultation with the medical geneticist, individuals with syndromic deafness underwent additional assessment, when possible. As previously reported (Wonkam et al., 2013), HI was defined as: 1) acquired when associated with a putative environmental factor such a clinical evidence of meningitis; 2) genetic when at least two cases were reported in the same family without obvious environmental cause, in case of consanguinity, in case of presence of dysmorphism or developmental problems in addition to HI, or in case of a well-defined syndrome in clinically suspected; 3) of unknown etiology if either an environmental or a genetic origin were not clearly established.

### Control Participants

A total of 145 control participants without any personal or familial history of HI was randomly recruited in Ghana, from an apparently healthy individual, during a tuberculosis screening study.

### Molecular Methods

Peripheral blood was used for genomic DNA extraction, following the instructions on the manufacturer [QIAamp DNA Blood Maxi Kit. ® (Qiagen, USA)], in the Laboratory of the Department of Biochemistry, University of Ghana, Accra, Ghana.

Previously reported, primers for the *GJB2* genes were evaluated using BLAST® and and other Softwares as recommended (Bosch et al., 2014). The entire coding region of *GJB2* genes (exon2) was amplified, followed by sequencing using an ABI 3130XL Genetic Analyzer® (Applied Biosystems, Foster City, CA), in the Division of Human Genetics, University of Cape Town, South Africa.

Detection of del (*GJB6*-D13S1830) was performed using the method and primers described by del (del Castillo et al., 2002; del Castillo et al., 2003). The entire coding region of *GJB6* was amplified using the method described by (Chen et al., 2012). The PCR results were validated by Sanger sequencing of 10% of the sample.

### Data Analysis

Descriptive statistic and non-parametric test were used for comparisons. The level of significance was set at 5%.

## RESULTS

## Sex, Age of Onset of Hearing Impairment

A total of 1,104 participants was evaluated (**Figure 1**). The male/ female ratio was 1.49 (660/444). Most deaf participants (59.6%) had their first comprehensive HI medical test between the ages of 6 to 11 years (**Table 1** and **Figure S1A**). The median age of the students at the first medical diagnosis was 9.0 years, within a range of 2 to 22 years. The majority had pre-lingual HI (68.3%, *n* = 754; **Figure S1B**), of which 92.8% were congenital.

### Audiometric Characterization of HI

Analysis of the students' medical data indicated that 642 out of the 1,104 students had a comprehensive HI test (otoscopic ear examination, pure tone audiometry, and/or tympanometry), which characteristics are described in **Table S1**. Nearly all the participants had sensorineural HI (99.5%; *n =* 639). Only 1 and 2 students had conductive and mixed HI, respectively.

#### Major Etiologies of Childhood HI in the Study Population

The flowchart of the cohort is described in **Figure 1**, and the major cause of HI are displayed in **Table 2**. A lower frequency of infectious causes of HI was observed in our present study compared with other studies from sub-Saharan Africa (**Table 2**). Convulsion (with undetermined medical cause) was the most common cause of post-lingual HI followed by cerebrospinal meningitis (C.S.M.). Other diseases such as cerebral/complicated malaria, otitis media, and mumps, were also reported as causes of post-lingual HI (**Figure S2**). Over 60% of the students had congenital HI of unknown origin (**Figure S2**).

TABLE 1 | Age at diagnosis and onset of HI.


### Familial HI With Possible Patterns of HI Inheritance

We identified 97 families segregating hearing Impairment, in 21.4% of the students. In these families, 50.9% (235/461) of children were living with HI, with an average family size of 6.9. Most of these familial cases were non-syndromic (92/97). The pedigree analysis of the non-syndromic familial cases suggested autosomal recessive inheritance in 96.7% (89/92), with only 2 families exhibiting a pattern compatible with a non-syndromic autosomal dominant inheritance. One family exhibited a mitochondrial pattern of inheritance.

Waardenburg syndrome, an autosomal dominant condition, was the obvious syndromic and familial condition identified

FIGURE 1 | Flowchart of the recruitment and Molecular analysis of Hearing Impairment cases in Ghana. GJB2-R143W mutation, previously reported as founder a mutation in Ghana accounted for 27.2% (22/81) of familial, and in 7.9% (11/140) of non-familial non-syndromic congenital HI cases.

TABLE 2 | Comparison of our results to other studies in developing African countries.


in 5.1% (5/97) of familial cases, with variable expression of heterochromia in affected members (**Figure 2**).

#### Molecular Analysis Result of GJB2 and GJB6

A total of 81 families segregating non-syndromic hearing loss were molecularly investigated. Although samples were not collected from Adamarobe, the "Deaf village", 27 out of the 81 HI families screened for *GJB2* and *GJB6* were from the Eastern Region of Ghana (**Table S2**) where the "Deaf Village" is located (Kusters, 2012). One individual from each family was sequenced for *GJB2* mutation and we found a pathogenic mutation in 27.2% (22/81) with *GJB2*- R143W in the majority (21/22) in the homozygous state (**Table 3**); *GJB2* p.W44\* mutation in one case, in the homozygous state.

In non-familial non-syndromic cases, *GJB2*-R143W mutation was found in 7.9% (11/140) patients (**Figure 1**). The control population had 2 out of the 145 individuals with mutation *GJB2*- R143W in the heterozygous state.

No *GJB6*-D3S1830 deletion was identified in the samples screened.

### DISCUSSION

The present report in the most compressive study of the cause of childhood HI in Ghana. Moreover, we investigated for the first time, the prevalence of *GJB2* mutations in a non-affected group of individuals from Ghana.

In this study, we observed HI in more boys than girls, although gender has not been reported as an associated factor that predisposes children to the development of HI (Foerst et al., 2006; Le Roux et al., 2015). This may be due to the fact that more boys enroll in the schools for the deaf compared to girls, especially in resource-limited regions. In often cases, boys with disability have

with patients expressing the phenotype in only one eye.

represent patients expressing the typical bilateral striking blue eyes phenotype of Waardenburg syndrome, while (B) and (D) represent asymmetrical heterochromia,


TABLE 3 | GJB2 mutations among 365 previously studied and 97 Ghanaians families with profound sensorineural hearing impairment.

more priority to formal education compared to girls (Groce, 1997; Nagata, 2003; Rousso, 2015). Although "female protective model" is not common to HI studies, it has been proposed by some researchers to explain the higher prevalence of genetic disorders in males compared to females (Jacquemont et al., 2014; Werling and Geschwind, 2015). According to this model, females have a higher rate of possible gene disruption but are mostly not associated with genetic disorders compared to males (Jacquemont et al., 2014).

Hearing impairment screening aims at detecting permanent HI at early developmental ages for the appropriate intervention (Sarant et al., 2008; Ching et al., 2017; Ma et al., 2018). There is no universal newborn HI screening program in Ghana explaining the late diagnosis, as most of the study participants had their first comprehensive hearing test at the school age, thus 6–9 years of age. However, parents/guardians of these children gave the information on the onset of the condition. The late diagnostic of HI in Ghanaian children is partly tied to the limited number of hearing assessment facilities (Waller et al., 2017). In addition, the majority of the HI students were living in remote rural settlements often with unmemorable roads and hence the difficulty of having access to quality health care.

Post-lingual HI in Africa is often caused by environmental factors (Wonkam et al., 2013). Similar to other reports, complicated malaria, cerebrospinal meningitis, and convulsion (with undetermined cause) were identified from our study as major environmental factors that contribute to post-lingual HI in Ghana (**Table 2**). There was a high number of congenital cases reported in our study which may account for the reduced frequency of infectious causes of HI in our study compared to other studies from Africa. Nonetheless, the identified environmental factors can be prevented by good health care systems as well as preventive health care practices. It is therefore important that governmental policies should be implemented to minimize childhood morbidities which will eventually reduce the prevalence of post-lingual HI.

Pre-lingual hearing impairment was common in our study population which agrees with other findings (Chibisova et al., 2018). Majority of pre-lingual HI are congenital and are usually caused by genetic factors (Wonkam et al., 2013; Behlouli et al., 2016). Waardenburg syndrome was the most common syndromic HI identified among the congenital cases in line with other African data (Noubiap et al., 2014).

Mutations in *GJB2* were investigated in Ghana 18 years ago and identified a common founder mutation p.R143W (Hamelmann et al., 2001). The present study revisited the contribution of *GJB2*  mutations and confirm the particularly high proposition of the founder mutation in more than ¼ of families segregating HI. This is much higher than what was previously reported (18%) due to the stringent selection of familial cases in the present study. Majority of the families with HI and families positive for the founder mutation were from the Eastern Region of Ghana. It is from this Region that a high prevalence of congenital HI was reported and hence the name "Deaf Village" (David et al., 1971; Kusters, 2012). There was a relatively high proportion of *GJB2* mutations among the isolated case of putative genetics origin. This is an indication of the urgent need to implement this *GJB2-*p. R143W testing in patients with HI clinical practice in Ghana. The p.R143W mutation has also been reported in patients with HI in Japan (Zheng et al., 2015; Kasakura-Kimura et al., 2017), South Korea (Kim et al., 2016), and China (Luo et al., 2017). In addition, we report a variant previously described as Mayan: founder *GJB2* nonsense mutation (p.W44\*) in a Ghanaian family. *GJB2* p.W44\* mutation is the most common *GJB2* pathogenic variant in Guatemala deaf populations and was also reported in Mexico (Martínez-Saucedo et al., 2015). Ghana is an African exception, as most studies in Africa have not identify *GJB2* as a major cause of HI in sub-Saharan African populations (Lebeko et al., 2015; Wonkam, 2015).

This is the first study to investigate *GJB6*-D13S1830 mutation or coding region variations in Ghana, and we found no mutation, which is in line with previous African data (Bosch et al., 2014; Wonkam et al., 2015). Equally, *GJB6*-D13S1830 deletion was not found in populations from China (Jiang et al., 2014), India (Padma et al., 2009), Turkey (Tekin et al., 2003), and among African American and Caribbean Hispanics (Samanich et al., 2007). Therefore, the present data further support the hypothesis that the *GJB6*-D13S1830 deletion is a founder mutation (del Castillo et al., 2003).

The study also indicates more than 2/3 of families with HI are eligible for next-generation sequencing, due to the highly heterogeneous genetic nature of NSHI and the low proportion of families solved with single gene approach applied in this study. Nethetheless, the study did not exclude intronic variants in *GJB2,*  that is a possible limitation. Future research should either use high-throughput sequencing platforms to investigate known genes (Shearer et al., 2010; Lebeko et al., 2016), or whole exome sequencing that will allow identification of novel genes (Diaz-Horta et al., 2012). Indeed, based on the identification of specific inner ear transcripts, it is estimated that more than 1,000 NSHI genes are still to be identified (Hertzano and Elkon, 2012).

To contribute towards the reduction of HI incidence in Ghana, policy-makers must consider integrating newborn screening for HI into the health care system such that every child is screened for both genetic and acquired HI at birth. Early detection of the condition may lead to early intervention (Copley and Friderichs, 2010) which will eventually reduce the public health impact of this condition.

### CONCLUSION

The study showed that environmental factors remain a major cause of Hearing impairment in Ghana. The study confirms that Connexin 26 (*GJB2*) mutations are the most common cause of familial non-syndromic HI in Ghana, an exception in sub-Saharan Africa where mutations in *GJB2* in HI patients is generally close to zero. *GJB2* p.R143W founder mutation accounted more > 25% of familial cases and close to 8% of isolated cases of putative genetic origin and should be considered in for implementation in clinical practice, particularly after newborns screening for HI. The frequency of *GJB2* p.R143W founder mutation in the general population without personal and familial was relatively high: 1.4%. The study did not find any *GJB6* del(GJB6-D13S1830) deletion. Future studies should employ whole genome sequencing approaches and functional genomics studies to identify the other genes involved in most families, and in isolated cases of HI in Ghana.

### DATA AVAILABILITY

All datasets supporting the conclusions of this study are included in the manuscript and the **Supplementary Files**.

### ETHICS STATEMENT

The study was performed in accordance with the Declaration of Helsinki. Ethical approval was obtained from the Noguchi

### REFERENCES


Memorial Institute for Medical Research Institutional Review Board, the University of Ghana, Accra (NMIMR-IRB-CPN 006/16- 17 revd. 2018), and the University of Cape Town's Human Research Ethics Committee, reference 104/2018. Written informed consent was obtained from all participants if they were 18 years or older, or from the parents/guardians with verbal assent from children, including permission to publish photographs.

### AUTHOR CONTRIBUTIONS

Conceived and designed the experiments: GAA, GKA, AW. Performed the experiments: SA, OQ, NM, KM. Patients' recruitment, samples, and clinical data collection and processing: SA, GKA, Analyzed the data: SA, AW; Contributed reagents/ materials/analysis tools: GAA, VN, CK, AW. Wrote the paper: SA, GAA, VN, CK, AW. Revised and approved the manuscript: SA, OQ, GAA, GKA, KM, VN, CK, NM, AW.

## FUNDING

The study was funded by the Wellcome Trust, grant number 107755Z/15/Z to GAA and AW (co-applicant); NIH, USA, grant number U01-HG-009716 to AW, and the African Academy of Science/Wellcome Trust, grant, number H3A/18/001 to AW. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene.2019.00841/ full#supplementary-material

FIGURE S1 | Onset and time of HI test. (A) Age of deaf students at the first medical HI test. (B) Onset of HI. Paired T-test was used to compare the mean number of students with pre-lingual (n = 754) and post-lingual (n = 336) HI from 11 schools for the deaf. There was a significant difference between mean number of people with pre- and post-lingual HI with P value of 0.0001 (t = 7.68, df = 10).

FIGURE S2 | Major causes of childhood HI in Ghana. (A) Major causes of postlingual HI in Ghana. (B) Major causes of Pre-lingual HI in Ghana. Cerebrospinal meningitis was represented as C.S.M. The cause of HI labelled accident comprises of motor accidents and medical accidents such as wrong medication, child birth, and surgery. Diseases such as boil, anemia, Gilbertese, Jaundice, measles, mumps, Otitis media, and rubella were captured as other diseases while undefined sickness consist of individuals who developed the condition due to sickness, but the cause of the sickness was not determined.

achieve open-set speech recognition five years after cochlear implantation. *Otol. Neurotol. Off. Publ. Am. Otol. Soc. Am. Neurotol. Soc. Eur. Acad. Otol. Neurotol.* 36 (6), 985. doi: 10.1097/MAO.0000000000000723


GJB2, GJB6 and GJA1 in non-syndromic hearing loss in black Africans. *SAMJ S. Afr. Med. J.* 105 (1), 23–26.


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Copyright © 2019 Adadey, Manyisa, Mnika, de Kock, Nembaware, Quaye, Amedofu, Awandare and Wonkam. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

# Erratum: GJB2 and GJB6 Mutations in Non-Syndromic Childhood Hearing Impairment in Ghana

#### Frontiers Production Office

Frontiers Media SA, Lausanne, Switzerland

#### Keywords: hearing impairment, genetics, GJB2, GJB6, Ghana, Africa

#### **An Erratum On:**

**GJB2 and GJB6 Mutations in Non-Syndromic Childhood Hearing Impairment in Ghana.**  *by Adadey SM, Manyisa N, Mnika K, de Kock C, Nembaware V, Quaye O, Amedofu GK, Awandare GA and Wonkam A. Front. Genet.* (2019) 10:841. doi: 10.3389/fgene.2019.00841

Due to a production error, the phrase "GJB2 (connexin 30)" should be "GJB6 (connexin 30)." Furthermore, the phrase "GJB2-D3S1830" should be "GJB6-D3S1830."

A correction has been made to the **Abstract**:

"Our study aimed to investigate GJB2 (connexin 26) and GJB6 (connexin 30) mutations associated with non-syndromic childhood hearing impairment (HI) as well as the environmental causes of HI in Ghana. Medical reports of 1,104 students attending schools for the deaf were analyzed. Families segregating HI, as well as isolated cases of HI of putative genetic origin were recruited. DNA was extracted from peripheral blood followed by Sanger sequencing of the entire coding region of *GJB2*. Multiplex PCR and Sanger sequencing were used to analyze the prevalence of GJB6-D3S1830 deletion. Ninety-seven families segregating HI were identified, with 235 affected individuals; and a total of 166 isolated cases of putative genetic causes, were sampled from 11 schools for the deaf in Ghana. The environmental factors, particularly meningitis, remain a major cause of HI impairment in Ghana. The male/female ratio was 1.49. Only 59.6% of the patients had their first comprehensive HI test between 6 to 11 years of age. Nearly all the participants had sensorineural HI (99.5%; *n* = 639). The majority had pre-lingual HI (68.3%, *n* = 754), of which 92.8% were congenital. Pedigree analysis suggested autosomal recessive inheritance in 96.9% of the familial cases. *GJB2*-R143W mutation, previously reported as founder a mutation in Ghana accounted for 25.9% (21/81) in the homozygous state in familial cases, and in 7.9% (11/140) of non-familial non-syndromic congenital HI cases, of putative genetic origin. In a control population without HI, we found a prevalent of *GJB2*-R143W carriers of 1.4% (2/145), in the heterozygous state. No GJB6-D3S1830 deletion was identified in any of the HI patients. *GJB2*-R143W mutation accounted for over a quarter of familial non-syndromic HI in Ghana and should be investigated in clinical practice. The large connexin 30 gene deletion (GJB6-D3S1830 deletion) does not account for of congenital non-syndromic HI in Ghana. There is a need to employ next generation sequencing approaches and functional genomics studies to identify the other genes involved in most families and isolated cases of HI in Ghana."

Additionally, in the **Results** section, the word "GJB2" should be italicized.

#### Approved by:

Frontiers Editorial Office, Frontiers Media SA, Switzerland

\*Correspondence: Production Office, production.office@frontiersin.org

#### Specialty section:

This article was submitted to Genetic Disorders, a section of the journal Frontiers in Genetics

Received: 18 October 2019 Accepted: 22 October 2019 Published: 26 November 2019

#### Citation:

Frontiers Production Office (2019) Erratum: GJB2 and GJB6 Mutations in Non-Syndromic Childhood Hearing Impairment in Ghana. Front. Genet. 10:1151. doi: 10.3389/fgene.2019.01151

A correction has been made to the **Results**, subsection **Molecular Analysis Result of GJB2 and GJB6:**

"A total of 81 families segregating non-syndromic hearing loss were molecularly investigated. Although samples were not collected from Adamarobe, the 'Deaf village,' 27 out of the 81HI families screened for *GJB2* and *GJB6* were from the Eastern Region of Ghana (Table S2) where the 'Deaf Village' is located (Kusters, 2012).One individual from each family was sequenced for *GJB2* mutation and we found a pathogenic mutation in 27.2% (22/81) with *GJB2*- R143W in the majority (21/22) in the homozygous state (Table 3); *GJB2* p.W44\* mutation in one case, in the homozygous state." The original version of this article has been updated.

*Copyright © 2019 Frontiers Production Office. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

# Environmental Health Research in Africa: Important Progress and Promising Opportunities

#### *Bonnie R. Joubert1\*, Stacey N. Mantooth2 and Kimberly A. McAllister1*

1 National Institute of Environmental Health Sciences, National Institutes of Health, Durham, NC, United States, 2 VISTA Technology Services, Durham, NC, United States

The World Health Organization in 2016 estimated that over 20% of the global disease burden and deaths were attributed to modifiable environmental factors. However, data clearly characterizing the impact of environmental exposures and health endpoints in African populations is limited. To describe recent progress and identify important research gaps, we reviewed literature on environmental health research in African populations over the last decade, as well as research incorporating both genomic and environmental factors. We queried PubMed for peer-reviewed research articles, reviews, or books examining environmental exposures and health outcomes in human populations in Africa. Searches utilized medical subheading (MeSH) terms for environmental exposure categories listed in the March 2018 US National Report on Human Exposure to Environmental Chemicals, which includes chemicals with worldwide distributions. Our search strategy retrieved 540 relevant publications, with studies evaluating health impacts of ambient air pollution (n=105), indoor air pollution (n = 166), heavy metals (n = 130), pesticides (n = 95), dietary mold (n = 61), indoor mold (n = 9), per- and polyfluoroalkyl substances (PFASs, n = 0), electronic waste (n = 9), environmental phenols (n = 4), flame retardants (n = 8), and phthalates (n = 3), where publications could belong to more than one exposure category. Only 23 publications characterized both environmental and genomic risk factors. Cardiovascular and respiratory health endpoints impacted by air pollution were comparable to observations in other countries. Air pollution exposures unique to Africa and some other resource limited settings were dust and specific occupational exposures. Literature describing harmful health effects of metals, pesticides, and dietary mold represented a context unique to Africa. Studies of exposures to phthalates, PFASs, phenols, and flame retardants were very limited. These results underscore the need for further focus on current and emerging environmental and chemical health risks as well as better integration of genomic and environmental factors in African research studies. Environmental exposures with distinct routes of exposure, unique co-exposures and co-morbidities, combined with the extensive genomic diversity in Africa may lead to the identification of novel mechanisms underlying complex disease and promising potential for translation to global public health.

Keywords: G x E, Africa, environmental, pesticides, metals, mold, air pollution

#### Edited by:

Mayowa Ojo Owolabi, University of Ibadan, Nigeria

#### Reviewed by:

Robinson Odong, Makerere University, Uganda Orish Ebere Orisakwe, University of Port Harcourt, Nigeria

> \*Correspondence: Bonnie R. Joubert joubertbr@nih.gov

#### Specialty Section

This article was submitted to Evolutionary and Genetics, a section of the journal Frontiers in Genetics

Received: 17 November 2018 Accepted: 23 October 2019 Published: 16 January 2020

#### Citation:

Joubert BR, Mantooth SN and McAllister KA (2020) Environmental Health Research in Africa: Important Progress and Promising Opportunities. Front. Genet. 10:1166. doi: 10.3389/fgene.2019.01166

## INTRODUCTION

A global assessment by the World Health Organization (WHO) in 2016 estimated that 24% of the global disease burden and 23% of all deaths were attributed to modifiable environmental factors, including physical, chemical, and biological hazards to human health (Prüss-Ustün et al., 2016). The highest number of deaths per capita attributable to the environment was reported for sub-Saharan Africa, primarily reflecting infectious diseases, but also noncommunicable diseases and injuries. Disease burden was highest (36%) among children. In modern Africa, there has been rapid industrial development in the absence of health and environmental safety guidelines that parallel those in the United States, Canada, or Europe (Organization, 2017). Heavy metals, pesticides, air pollution, water contaminants, and waste represent hazardous exposures increasing in Africa (Nweke and Sanders, 2009), but with limited research attention on the implications for human health. Many chemicals that pose health risks to exposed populations in Africa and around the world are known to be endocrine disrupting chemicals (EDCs). A meeting of scientists around this issue took place in South Africa in 2015, leading to a "call to action" to utilize available scientific knowledge to address the impact of EDCs on human as well as wildlife health in Africa (Bornman et al., 2017). This meeting report also called for a shift from reaction to prevention, with utilization of existing datasets, increased biomonitoring, and surveillance of environmental chemicals, as well as further research including the support of longitudinal studies (Bornman et al., 2017).

Often in parallel to environmental health research, genomic research related to The Human Genome Project has advanced our understanding of disease susceptibility with enormous productivity and ongoing promise. Initial research in genomics had limited participation from African study populations, despite the important genomic diversity represented by African populations. However, huge efforts to address this limitation took place in the last decade resulting in an ongoing genomic research revolution in Africa (Consortium et al., 2014). Much of that effort was enabled by investments from the African Society of Human Genetics, National Institutes of Health (NIH), and the Wellcome Trust through the Human Heredity and Health in Africa (H3Africa) consortium (www.h3africa.org). The H3Africa consortium began in June 2010 to support genomic and epidemiological research led by African scientists (Consortium et al., 2014). Genomic research in Africa is not limited to the bounds of this consortium, but it represents a research infrastructure that enables innovative science. For example, studies covering common diseases such as cardiovascular (Owolabi et al., 2014), neurological (Akinyemi et al., 2016), respiratory (Zar et al., 2016a; Zar et al., 2016b), kidney (Osafo et al., 2015), and other non-communicable diseases are represented in this consortium. Developments in pharmacogenomics (Warnich et al., 2011) and the human microbiome (Adebamowo et al., 2017) are also underway, and many studies incorporate information about HIV, malaria, tuberculosis, and other common infections in Africa. The H3Africa consortium also promotes opportunities for training in bioinformatics (Adoga et al., 2014; Oluwagbemi et al., 2014; Mulder et al., 2016), supports three biorepositories on the African continent, and facilitates policy and ethical recommendations (Consortium et al., 2014; Barchi and Little, 2016; Munung et al., 2016; de Vries and Pool, 2017).

Not only does Africa offer the richest genomic diversity in the world, it also has an extensive diversity of under-researched environmental exposures, including some exposures unique to the continent, which present important public health issues. Integration of genomic variants with environmental risk factors is vital to properly characterize disease risk in Africa. However, the starting point for incorporating genomic (G) and environmental (E) factors can be daunting. Important questions include: What environmental exposures are relevant to what African populations? What are the priorities? What has been studied and what are the relevant health outcomes? How do the exposures and health outcomes differ compared to populations in other regions of the world? How can genomics and environmental exposures be integrated?

The purpose of this review is to summarize and provide examples of the latest environmental health research and the G x E interactions that have been characterized this decade in Africa. In this paper we use the "G x E" terminology to broadly represent the integration of genomic and environmental data in a research project or study population. It can represent various statistical or data science methods for evaluating both genomic and environmental factors and is not strictly referring to the biological or statistical sense of the term interaction. Our review expands previous reviews describing the distribution of environmental exposures in selected African populations by focusing on the evaluated health outcomes related to environmental exposures and including all of Africa.

#### Literature Search Strategy

We queried the PubMed database to identify peer-reviewed research or review articles or books (referred to generally as publications) examining environmental exposures and health outcomes in human populations residing on the African continent. We searched for publications evaluating the following environmental exposure categories: Ambient air pollution, indoor air pollution, electronic waste, environmental phenols, flame retardants, dietary mold, indoor mold, pesticides, perfluoroalkyl substances (PFASs), phthalates, and heavy metals. All search strategies, which included keywords as well as Medical Subject Headings (MeSH), are provided in the Supplementary Text, pulling preliminary results. All African countries were represented in the query and no exclusions were made based on the language of publication. The date range searched was from January 1, 2010 to March 20, 2018. Research articles were excluded if they did not include a measure/data for the queried exposure(s) and/or any health outcome(s). For example, research articles describing biomonitoring efforts or surveillance of human exposure to chemicals were not included if they did not also measure at least one health endpoint in a study population. We further refined our search to examine a subset of research or review articles that incorporated genomics, representing G x E research articles.

**Abbreviations:** BC, black carbon; CO, carbon monoxide; NO2, nitrogen dioxide; O3, ozone; PAHs, polycyclic aromatic hydrocarbons; PM, particulate matter; SO2, sulfur dioxide; VOCs, volatile organic compounds.

## RESULTS

Our literature search identified a total of 540 publications, representing 482 research articles, 57 reviews, and 1 book. A full list of the publications is provided in **Supplementary Table 1**. The results per exposure category are displayed in **Table 1** and **Figures 1** and **2**, where publications could belong to more than one category. The largest number of publications identified in our search represented exposures to indoor air pollution (n = 166), heavy metals (n = 130), ambient air pollution (n = 105), pesticides (n = 95), and dietary mold (n = 61). Notably fewer publications were retrieved for the exposure categories perfluoroalkyl substances (n = 16 initially, 0 after restricted to only those evaluating health outcomes), electronic waste (n = 9), indoor mold (n = 9), flame retardants (n = 8), environmental phenols (n = 4), and phthalates (n = 3). When we further subset the overall results to publications also evaluating genomic susceptibility or G x E interactions, we identified only 23 publications (21 research articles, 2 reviews, and no books). To summarize the publications across exposure categories, we highlight the important health endpoints, diseases, or outcomes evaluated, some specific exposures measured (and when possible, how measured), important at risk or vulnerable populations, and current research/data gaps.

### INDOOR AIR POLLUTION

We identified a total of 166 publications describing indoor air pollution and health endpoints across the African continent (**Table 1**). A 2016 *Lancet* review of 79 metabolic risk factors in a systematic analysis of the global burden of disease indicated that between 1990 and 2015, global exposure to household air pollution as well as unsafe sanitation, childhood underweight status, childhood stunting, and smoking, each decreased by more than 25% (GBD 2015 Risk Factors Collaborators, 2016). Household air pollution was listed as one of the top ten largest contributors

TABLE 1 | Summary of literature search results: Landscape of environmental health research in African populations. † Exposure category Example exposure sub categories ‡ Example sources of exposures # Environmental health publications Indoor air pollution Particulate matter (PM2.5, PM10), carbon monoxide (CO), volatile organic compounds (VOCs), aeroallergens, dust mites, sulfur dioxide (SO2), nitrogen dioxide (NO2), black carbon (BC), polycyclic aromatic hydrocarbons (PAHs) Cooking practices, cook stove type, environmental tobacco smoke, home heating practices, pests, domesticated and agricultural animals 166 Ambient air pollution PM2.5, PM10, CO, SO2, NO2, ozone (O3), BC, PAHs Vehicle emissions, wild fires, prescribed burning, wild fires, biomass burning, tobacco smoking, cooking, and factory emissions 105 Heavy Metals Antimony, Arsenic, Cadmium, Chromium, Cobalt, Copper, Lead, Manganese, Mercury, Nickel, Selenium, Tin, Tungsten, Uranium, Zinc Contaminated water, mining/occupational, diet, paint 130 Pesticides Pyrethroids, organophosphates, organochlorines Application of pesticides and exposure through agricultural occupations, indoor residual spraying, pest control 95 Dietary Mold A. flavus and A. parasiticus producing aflatoxin in; Mycotoxins; cassava Storage of staple foods such as groundnuts/ peanuts, corn, 61 Indoor Mold Airborne Aspergillus species (A. niger, and A. flavus, A. fumigatus) Moist home/work conditions, flour mill and bakeries with grinding of grains 9 Electronic waste Discarded electronic devices that can contain lead, cadmium, brominated flame retardants (BFRs), americium, mercury, hexavalent chromium, sulphur, perfluoroctanoic acid (PFOA), beryllium oxide Discarded computers and accessories, mobile phones, audiovisual materials, or appliances 9 Environmental phenols 2,5-Dichlorophenol, Benzophenone-3 (Oxybenzone), Bisphenol A, Bisphenol F, Bisphenol S, Triclosan, Ethyl paraben, Propyl paraben, Butyl paraben Plastics, food packaging, personal-care products 4 Flame retardants PBDEs, brominated flame retardants (BFRs), TBBPA, hexabromocyclododecanes (HBCDs), OPFRs Indoor furniture, recycled materials (e-waste related plastic casings) 8 Phthalates Mono-benzyl phthalate, Mono-n-butyl phthalate, Mono-isobutyl phthalate, Mono-ethyl phthalate, Mono- (2-ethylhexyl) phthalate, Mono-(2-ethyl-5-hydroxyhexyl) phthalate, Mono-(2-ethyl-5-oxohexyl) phthalate, Mono-(2-ethyl-5-carboxypentyl) phthalate, Mono-(3 carboxypropyl) phthalate Vinyl flooring, detergents, plastics, personal-care products, food packaging 3 Perfluoroalkyl substances (PFASs) Perfluorooctane sulfonate, Perfluorooctanoic acid Manufacturing, industry, exposure through fish consumption, 0

†Table sorted by number of publications and exposure category. Exposure category and example chemicals largely reflected the chemicals listed in the Fourth National Report on Human Exposure to Environmental Chemicals (https://www.cdc.gov/exposurereport/index.html). Exposures in the following categories were not included: Food safety, sanitation, waste management, personal or second-hand tobacco smoke, climate, or weather-related events.

‡Full details of subcategories can be found in Supplementary Table 1 which details the search strategy.

to global disability-adjusted life-years (DALYs), representing 85.6 million (66.7 million to 106.1 million) global DALYs (2016).

#### Health Outcomes

Across the indoor air pollution articles identified in our literature review, a critical health outcome noted was cardiovascular disease. Cardiovascular morbidities related to household air pollution have been identified in other countries, such as in China, Bangladesh, and Pakistan, raising ongoing concern for these risks in Africa (Noubiap et al., 2015). Studies specific to African populations identified in our review evaluated the impact of indoor air pollution on cardiovascular endpoints, such as cardiac chamber structure and function (Agarwal et al., 2018), blood pressure (Quinn et al., 2016; Alexander et al., 2017; Quinn et al., 2017; Arku et al., 2018; Swart et al., 2018), and inflammatory biomarkers (Olopade et al., 2017). Five of these articles focused on exposures to cooking or biomass fuel use in the home (Quinn et al., 2016; Alexander et al., 2017; Olopade et al., 2017; Quinn et al., 2017; Arku et al., 2018). Respiratory disease represented another major health outcome impacted by indoor air pollution; evaluated as the primary outcome of interest or a relevant co-morbidity in 77 of the identified indoor air pollution articles. This included articles describing general child respiratory health (Albers et al., 2015), acute lower respiratory tract infections in children (Buchner and Rehfuess, 2015), shortness of breath (Das et al., 2017), and asthma. Asthma and related morbidities were characterized in 37 articles and included outcomes such as asthma diagnosis and severity (Oluwole et al., 2017), asthma control (Kuti et al., 2017), allergen sensitization (Mbatchou Ngahane et al., 2016), and atopy (Morcos et al., 2011). Indoor air pollution-related impairments on innate immunity were also noted in some studies. For example, Rylance et al. (2015) observed an association between household air pollution and inflammatory responses assessed with IL6 and IL8 production and altered phagocytosis in macrophages exposed *in vitro* to respirable sized particulates.

### Exposures Measured

Most of the studies evaluating indoor air pollution focused on cooking practices including biomass fuel burning in indoor stoves. A total of 24 of the indoor air pollution research articles described exposure to dust. For example, dust was noted as a trigger for allergic rhinitis (Adegbiji et al., 2018) and house dust/dust mite exposure was associated with asthma (Bardei et al., 2016; Flatin et al., 2018). Particulate matter was evaluated in 24 of the indoor air pollution research studies, most focusing on PM10 (Abou-Khadra, 2013; Ibhafidon et al., 2014; Makamure et al., 2016; Jafta et al., 2017; Nkhama et al., 2017; Nkosi et al., 2017; Mentz et al., 2018) and PM2.5 exposures (Oluwole et al., 2013; Chafe et al., 2014; Ibhafidon et al., 2014; Dutta et al., 2017; Lacey et al., 2017; Lin et al., 2017a; Malley et al., 2017; Nkhama et al., 2017; Wylie et al., 2017a; Wylie et al., 2017b; Mentz et al., 2018). Some studies also measured NO, NO2, SO2, CO, and O3 (Jafta et al., 2017; Wylie et al., 2017a). DDT and DDE contamination from indoor residual spraying was found in household undisturbed dust and associated with DDT and DDE metabolites in serum of residents (Gaspar et al., 2015).

### At Risk Populations

Women conducting most of the household cooking and children helping or in proximity of cooking may be most impacted by indoor air pollution, depending on the family household practices.

### Research/Data Gaps

Although Rylance et al. (2015) described impairments to the immune system with exposure to indoor air, the interaction between this impairment and susceptibility to infections such as HIV or other infections warrants further research. A review by El-Gamal et al. (2017) describes literature on a wide range of aeroallergens across Africa but data on indoor aeroallergens are not included in all regions. The authors note the importance of characterizing genetic susceptibility in the context of immunodeficiencies in Africa, which has not received sufficient research attention.

## AMBIENT AIR POLLUTION

### Health Outcomes

We identified 105 articles describing health impacts of ambient air pollution in Africa (**Table 1**). Nine of these represented review papers, covering outcomes such as chronic lung diseases among HIV positive individuals (Attia et al., 2017), children's health such as pediatric asthma (Wolff et al., 2012; Jassal, 2015), biomarkers of genotoxicity (DeMarini, 2013), reproductive outcomes like preterm birth (Kumar et al., 2017; Malley et al., 2017), and severity of sickle cell disease (Tewari et al., 2015). Articles represented scientific depth and detail across the continent, covering key public health issues. Among all article types, notable endpoints evaluated were cardiovascular and cardiometabolic outcomes (Wichmann and Voyi, 2012; Benaissa et al., 2016), as well as broader burden of disease or life expectancy estimates (Berhane et al., 2016; Mokdad and GBD 2015 Eastern Mediterranean Region Lower Respiratory Infections Collaborators, 2018; Etchie et al., 2018). Some studies reported null findings. For example, an incremental life-time cancer risk was considered low in the context of exposure to PAHs from air pollution among city center residents of Kumasi, Ghana (Bortey-Sam et al., 2015). Additional outcomes evaluated included markers of oxidative stress, inflammatory cytokines, and chemokines (Cachon et al., 2014), chronic bronchitis from occupational exposures to dust (Hinson et al., 2016), elevated prostate specific antigen (PSA) among young men exposed occupationally to quarry pollutants (Ewenighi et al., 2017), chronic respiratory symptoms among limestone factory workers in Zambia (Bwalya et al., 2011), and allergic rhinitis in urban areas (Flatin et al., 2018). Exacerbation of silicosis due to higher doses of particulate matter exposure, impacts of exposure to prenatal air pollution on DNA methylation in the context of HIV status and antiretroviral treatment (Goodrich et al., 2016), asthma and asthma exacerbations, mortality, cerebrovascular outcomes, cardiovascular outcomes, and daily respiratory mortality were also evaluated.

## Exposures Measured

Several studies evaluated both indoor and ambient air pollution exposures and articles covered both the urban and rural settings (**Supplementary Table 1**). Ambient air pollution exposure in urban areas was noted in 36 publications including a study of air pollution and sleep disorders in children living in Egypt (Abou-Khadra, 2013). Occupational exposures were another important source of ambient air pollution exposure. Activities included limestone processing in Zambia (Bwalya et al., 2011), exposure to desert dust in West Africa [reviewed by de Longueville et al. (2013)], traffic exhaust (DeMarini, 2013), dust and fumes in artisanal mining (Ekosse, 2011), city transit-related air pollution (Elenge et al., 2011; Elenge and De Brouwer, 2011; Ekpenyong et al., 2012), stone quarrying industry exposures including deposition of inhaled aerosol particles at an industrial site in Egypt (Furi et al., 2017), sulfur dioxide (SO2) emissions from platinum group metal (PGM) smelting in Zimbabwe (Gwimbi, 2017), and charcoal processing activities in Namibia, including exposure to charcoal dust (Hamatui et al., 2016). DNA adducts to measure air pollution exposure among urban and suburban residents was also implemented in some studies (Ayi-Fanou et al., 2011).

### At Risk Populations

Across the articles evaluating air pollution exposure, occupationally exposed workers represented a critical population at risk. For example, exposure to pollutants through dust was mentioned in approximately one third of the ambient air pollution studies, half of which evaluated occupational exposures. Another study identified higher DNA adducts related to air pollution among taxi-motorbike drivers, roadside residents, street vendors, and gasoline sellers, compared to suburban and village inhabitants in Benin (Ayi-Fanou et al., 2011). Importantly, the impact of pollutant exposures correlating to occupation are not limited to impacts among workers. People living near work sites may also be affected. For example, Durban, South Africa represents one of Africa's busiest ports and the combination of industry, traffic, and biomass burning has led to substantial air pollution. A study of school children in Durban observed associations between air pollution exposures and respiratory symptoms, with notable burden on children with asthma (Mentz et al., 2018). These studies suggest that the impacts of occupational air pollution exposures are not limited to health endpoints in the workers alone. Immunocompromised individuals such as those living with HIV may also be more likely to experience chronic respiratory symptoms, abnormal spirometry, and chest radiographic abnormalities following air pollution exposures (Attia et al., 2017).

### Research/Data Gaps

Ambient air pollution exposure has been well characterized as an issue across Africa and around the world. Health impacts comparable to what has been identified in other populations were particularly clear for respiratory outcomes. Given the unique occupational settings in some regions of Africa, very high levels of exposure are of ongoing concern as is the peripheral impact on children and immunocompromised individuals.

### HEAVY METALS

### Health Outcomes

Reproductive outcomes have been associated with various high heavy metal exposures in Africa. For example, associations between impaired semen quality and possible infertility has been reported for higher levels of cadmium, lead, zinc, and selenium (Awadalla et al., 2011; Oluboyo et al., 2012; Abarikwu, 2013; Famurewa and Ugwuja, 2017). Elevated serum heavy metals (cadmium and lead) along with a reduction of essential micronutrients (zinc and copper) may also contribute to recurrent pregnancy loss (Ajayi et al., 2012). An association between lower maternal zinc, copper, and cadmium levels as well as cord copper levels with low birthweight newborns has also been observed (Abass et al., 2014; Rollin et al., 2015). Elevated lead and arsenic exposures may be associated with preterm birth and other birth outcomes in general (Kumar et al., 2017; Rollin et al., 2017) and cord blood mercury was significantly associated with birth weight, length, and head and chest circumference in a Nigerian study population (Obi et al., 2015). Several African countries have a high level of preeclampsia and significant associations between preeclampsia and serum levels of calcium and magnesium or excretion of high amounts of several toxic metals, especially lead, have been identified (Ikechukwu et al., 2012; Motawei et al., 2013; Elongi Moyene et al., 2016). Egypt has one of the highest incidences of intrauterine growth retardation, and this appears to be positively correlated with heavy metal toxicity (El-Baz et al., 2015).

Lead toxicity (sometimes in combination with high cadmium exposures) has been shown to be associated with renal function impairment (Alasia et al., 2010b). Occupationally lead-exposed subjects have been shown to have significantly higher blood lead levels, as well as serum urea, creatinine, and serum uric acid levels, and other renal biomarkers and markers of nephrotoxicity. Multiple studies suggest a higher risk for developing hyperuricemia and renal impairment with high lead exposure (Alasia et al., 2010a; Cabral et al., 2012; Cabral et al., 2015). Workers in a variety of occupations, including automobile technicians, e-waste workers, miners, and shootingrange workers are at risk for substantially high lead levels (Saliu et al., 2015; Obiri et al., 2016b; Mathee et al., 2017). Blood lead levels in school children have been associated with a variety of behavioral and cognitive outcomes, including: lower IQ, poorer school performance, anti-social or violent tendencies, hearing deficiencies, and delayed onset of puberty (Naicker et al., 2010; Tomoum et al., 2010; Abdel Rasoul et al., 2012; Naicker et al., 2012; Kashala-Abotnes et al., 2016; AbuShady et al., 2017; Nkomo et al., 2017).

A high prevalence of acute lead poisoning in children has been an ongoing issue in many African countries (Bouftini et al., 2015; Bose-O'Reilly et al., 2018), with the lead poisoning crisis in Zamfara State, Northern Nigeria noted as one of the worst such cases in modern history. More than 400 children have died in Zamfara as a result of ongoing lead intoxication since early in 2010, and this acute lead poisoning is believed to be related to artisanal gold mining (Moszynski, 2010; Dooyema et al., 2012; Bartrem et al., 2014). Younger children with high venous blood lead level thresholds during the first year of the Zamfara outbreak response displayed a variety of neurological outcomes and were at higher risk for encephalopathy (Greig et al., 2014). Another recent lead poisoning outbreak reportedly occurred from consumption of an ayurvedic medicine in South Africa (Mathee et al., 2015).

A variety of cancers have also been associated with heavy metal exposure (Fasinu and Orisakwe, 2013; Obiri et al., 2016a; Obiri et al., 2016b). Low levels of selenium was associated with the development of breast cancer (Alatise et al., 2013), as was higher levels of lead for infiltrating ductal breast carcinoma (Alatise and Schrauzer, 2010). Cadmium and arsenic were found to be synergistically associated with bladder cancer and both exposures are often also associated with smoking status (Feki-Tounsi et al., 2013a; Feki-Tounsi et al., 2013b; Feki-Tounsi et al., 2014). A higher serum selenium concentration and a deficiency of zinc and molybdenum was found to be associated with esophageal squamous dysplasia (Ray et al., 2012; Pritchett et al., 2017). A positive association between cadmium exposure and pediatric cancer may also be present (Sherief et al., 2015). High levels of some heavy metals (chromium, nickel, cadmium) were associated with head and neck cancer as well (Khlifi et al., 2013a;

Khlifi et al., 2013b). Many studies reported neurological outcomes associated with occupational exposure to mercury. Prominent symptoms among fluorescent lamp factory workers exposed to mercury included tremors, emotional lability, memory changes, neuromuscular changes, and performance deficits in tests of cognitive function (Al-Batanony et al., 2013). Neurological symptoms, memory disturbances, and anxiety and depression were found in dentists exposed to mercury. Bilateral and symmetric intentional tremor in both upper limbs were found in dentists exposed to particularly high levels of mercury (Chaari et al., 2015). Chronic mercury intoxication, with tremor, ataxia and other neurological symptoms, along with kidney dysfunction and immunotoxicity, have been identified in individuals with high body burdens of mercury living in or near artisanal small-scale mining communities. Exposed groups showed poorer results in different neuropsychological tests. Over half of amalgam burners (workers with highest mercury levels as a group) were found to have symptoms of mercury intoxication (Bose-O'Reilly et al., 2017), and a large proportion of small-scale gold miners have mercury exposures above occupational exposure limits (Tomicic et al., 2011; Gibb and O'Leary, 2014; Steckling et al., 2014; Mensah et al., 2016).

The early effects of methylmercury due to fish consumption and other possible sources of exposure have also been extensively studied. Some negative outcomes associated with growth and nervous system effects on fetuses and newborns, cognitive function, reproduction, and longer-lasting cardiovascular effects as adults have been observed (Karagas et al., 2012; Gonzalez-Estecha et al., 2014). However, other nutrients, particularly n-3 polyunsaturated fatty acids (PUFAs) in fish, may modify some of these health effects (Lynch et al., 2011; Gribble et al., 2015; Strain et al., 2015). For example, although an adverse association of educational measures with postnatal mercury exposure in males but not females was found in one study from the Seychelles Child Development Study (Davidson et al., 2010), a number of other studies from this cohort have found no significant associations between methyl mercury exposure (either through fish consumption or prenatal exposure to dental amalgams) and neurodevelopmental outcomes (Watson et al., 2011; Watson et al., 2012; Watson et al., 2013; van Wijngaarden et al., 2013; van Wijngaarden et al., 2017).

A limited number of other studies have assessed various heavy metals and trace elements in relation to health outcomes. Alterations of some essential trace metals may play a role in the development of diabetes mellitus and obesity in children and older adults (El Husseiny et al., 2011; Harani et al., 2012; Azab et al., 2014; Badran et al., 2016). Arsenic and lead appear to impact diabetes and cardiovascular outcomes but have been studied very little in the African context (Ettinger et al., 2014). Exposure to arsenic was significantly associated with increased odds of asthma and tachycardia in one report (Bortey-Sam et al., 2018). Neurocognitive and motor impairments observed in konzo, a motor neuron disease associated with cassava cyanogenic exposure in nutritionally challenged African children, may possibly be driven by the combined effects of cyanide toxicity and selenium deficiency (Bumoko et al., 2015). Selenium and a number of other trace elements may also influence goiter development and general thyroid metabolism (Kishosha et al., 2011; Maouche et al., 2015; El-Fadeli et al., 2016; Gashu et al., 2016). Liver function may be compromised in nickel-plating workers (El-Shafei, 2011). Chronic neuropathology appears to be associated with chronic manganese exposure in South African mine workers (Gonzalez-Cuyar et al., 2014). Some trace metals may also play a role in the development of anemia (Henriquez-Hernandez et al., 2017). Low serum zinc levels were associated with acute lower respiratory infections (Ibraheem et al., 2014). Elevated blood lead levels seem to be associated with increased asthma severity (Mohammed et al., 2015). Selenium deficiency may be a risk factor for peripartum cardiomyopathy as well as other vascular complications and the impact of this may vary based on race (Karaye et al., 2015; Swart et al., 2018). An association of some metals with the risk of nasosinusal polyposis disease were observed for some genetic variants involved in DNA repair pathways affecting susceptibility (Khlifi et al., 2015; Khlifi et al., 2017). High concentrations of some harmful elements in geophagic clays eaten in Africa may be associated with cardiovascular outcomes (Olatunji et al., 2014). Mineral imbalances and lead exposure may also be associated with elevated blood pressure (Rebacz-Maron et al., 2013; Were et al., 2014). Disturbances in copper have been implicated in one study of Parkinson's disease as well (Younes-Mhenni et al., 2013).

A connection between autism and various metals has also been studied. Altered urinary porphyrins, biomarkers of mercury toxicity, were observed in Egyptian children with autism spectrum disorder (Khaled et al., 2016). Levels of mercury, lead, and aluminum in hair of autistic patients was significantly higher than controls in one study (Mohamed Fel et al., 2015). High exposures of some heavy metals, particularly lead and mercury, have been treated with chelating agents, which appeared to improve autistic symptoms (Yassa, 2014).

#### Exposures Measured

Mercury was sometimes determined by using a direct mercury analyzer, while most heavy metals were measured by atomic absorption spectrophotometer in blood and serum (and sometimes hair, nails, and air/soil samples) (Ojo et al., 2014; Were et al., 2014; Sherief et al., 2015; Iwegbue et al., 2017). The quantification of metals in various tissues was also assessed by atomic absorption spectroscopy (Feki-Tounsi et al., 2014). A variety of biomarkers were incorporated into various studies, especially to monitor kidney injury or dysfunction (Samir and Aref, 2011; Cabral et al., 2012; Cabral et al., 2015). Some heavy metals' association with lipid peroxidation, DNA damage, oxidative stress, or apoptosis was assessed (El-Baz et al., 2015; Bortey-Sam et al., 2018) and the genotoxic impact of some occupational exposures was explored (El Shanawany et al., 2017).

### Vulnerable Populations

A variety of occupations clearly pose high risks for substantial exposure to heavy metals. Industrial metals are presently contaminating the environment and the water supplies, and the lack of education of workers and personal protective equipment was reported (Alatise and Schrauzer, 2010; Mensah et al., 2016). Individuals living near landfills and e-waste sites, particularly children, are at risk for a variety of exposures as e-waste components/constituents with heavy metal contamination can accumulate, in soil and surrounding vegetation, to toxic and genotoxic levels that could induce adverse health effects in exposed individuals (Alabi et al., 2012; Cabral et al., 2012). The outbreaks related to the fatal childhood lead poisoning illustrate the extreme vulnerability for young children (Dooyema et al., 2012; Bartrem et al., 2014). Other studies demonstrated the more subtle health outcomes related to lead exposures and suggest that even in the absence of overt clinical manifestations of lead toxicity, knowledge of lead exposure may influence the diagnosis in children presenting with anemia, intellectual impairment, poor academic performance, hearing impairments, and other outcomes (Abdel Rasoul et al., 2012).

### Research Gaps

There are numerous studies suggesting evidence for a variety of interactions among multiple heavy metals and trace elements, and the impact of these interactions on health outcomes. The interaction between lead and selenium is one of many interesting interactions associated with some cancers as lead may abolish the natural inhibitory effect on carcinogenesis observed for selenium (Alatise and Schrauzer, 2010). A synergistic interaction between cadmium and arsenic is also associated with bladder cancer (Alatise and Schrauzer, 2010; Feki-Tounsi et al., 2013a; Feki-Tounsi et al., 2014). There was evidence that obese children may be at a greater risk of developing an imbalance (mainly deficiency) of trace elements, which may be playing an important role in the pathogenesis of obesity and related metabolic risk factors (Azab et al., 2014). The mechanistic interactions of many heavy metals and trace elements, and the impact of these complex co-exposures for a variety of health outcomes is a substantial research gap in our current understanding.

The lead poisoning in Zamfara is an extreme example of both lead and multiple heavy metal mortality and morbidity, but highlights the importance of environmental remediation, chelation therapy, public health education, and control of mining activities to prevent future outbreaks (Dooyema et al., 2012; Bartrem et al., 2014). Furthermore, the primary source of lead pollution responsible for the lead poisoning of children in Nigeria appeared not to come from official mining activities but mainly from small scale operations conducted by local villagers, suggesting that some governmental regulation may be warranted (Moszynski, 2010). The oral chelating agent 2,3-dimercaptosuccinic acid (DMSA, succimer) appeared to be pharmacodynamically effective for the treatment of severe childhood lead poisoning in a resource-limited setting (Thurtle et al., 2014); in a number of situations, blood lead level monitoring has been used to show lower lead levels in children following implementation of such interventions (Brown et al., 2010; Bouftini et al., 2015).

The relationship between many metals and antioxidant enzymes and the role of the oxidative stress and inflammation pathways needs to be further explored (Maouche et al., 2015). Molecular mechanisms of how oxidative stress acts as a driver for organ dysfunction and the impact of antioxidants to mediate the potential toxic effect of various metal exposures will be important research areas to continue to explore (Samir and Aref, 2011). As one example, strategies to prevent konzo have successfully included dietary supplementation with trace elements, preferentially those with antioxidant and cyanidescavenging properties (Bumoko et al., 2015).

The relationship between heavy metals and many disease outcomes are in preliminary stages in African studies and elsewhere. Other associations between heavy metals and some diseases have been established in predominantly European populations but have not been extensively studied in the African context. The association of metals with autism, respiratory disease, and other health outcomes have been inconsistent and will require additional exploration. The impact of other nutrients in fish modifying methylmercury neurotoxicity is also an ongoing source of investigation (Lynch et al., 2011).

## PESTICIDES

### Health Outcomes

Pesticides, particularly the insecticide DDT and its breakdown product dichlorodiphenyl trichloroethylene (DDE) and other endocrine disrupting compounds, have been associated with numerous reproductive outcomes including male infertility, impaired semen quality, increased sperm defects, anogenital distance, mean penile length in baby boys, various urogenital malformations, and spontaneous miscarriages and infant deaths (Lubick, 2010; Naidoo et al., 2010; El-Helaly et al., 2011; English et al., 2012; Abarikwu, 2013; El Kholy et al., 2013; Bornman et al., 2017). One recent paper suggested decreased ovarian reserve associated with exposure to pyrethroid pesticides (Whitworth et al., 2015). Emerging evidence suggests that many endocrinedisrupting pesticides have effects on cardiometabolic outcomes (Azandjeme et al., 2013). For example, DDT concentration has been consistently and positively associated with body composition and body weight in young girls, and DDT and DDE were found to be associated with elevated risk of hypertensive disorders in pregnancy (Coker et al., 2018; Murray et al., 2018), while chronic exposure of non-diabetic farmers to organophosphorus malathion pesticides appears to induce insulin resistance (Raafat et al., 2012). One study examined a variety of biochemical effects of pesticides including hematological profile, lipid parameters, serum markers of nephrotoxicity and hepatotoxicity, as well as the activities of butyryl cholinesterase (BChE), acetylcholinesterase (AChE), and thiolactonase-paroxonase (PON). The study concluded that long-term exposure to pesticides may play an important role in the development of vascular diseases *via* metabolic disorders of lipoproteins, lipid peroxidation and oxidative stress, inhibition of BChE, and decrease in thiolactonase-PON levels (Wafa et al., 2013).

Neurological outcomes were the most commonly associated health outcomes reported for cumulative exposure to both organophosphorus and pyrethroid compounds. Pesticide applicators and farm workers (including adolescent and child workers) exposed to these compounds exhibit neurological/ neurobehavioral symptoms, deficits in neurobehavior performance tests, and neuromuscular disorders. These symptoms are often associated with greater inhibition of serum BChE and acetylcholinesterase activity, effect biomarkers often associated with neurotoxicity and cumulative TCPy, which is a biomarker of the organophosphorus pesticide chlorpyrifos (Sosan et al., 2010; Khan et al., 2014; Rohlman et al., 2014; Singleton et al., 2015; Manyilizu et al., 2016; Rohlman et al., 2016; Ismail et al., 2017b; Negatu et al., 2018). Some evidence for possible neurodevelopmental effects related to DDT in children has also been suggested (Osunkentan and Evans, 2015). Some associations were found between pesticide exposure and increased risks to various cancer outcomes, including bladder cancer, breast cancer, colorectal cancer, non-Hodgkin's lymphoma, and hepatocellular carcinoma (Lo et al., 2010; Awadelkarim et al., 2012; Amr et al., 2015; Arrebola et al., 2015; VoPham et al., 2017). Respiratory outcomes were also commonly associated with both cumulative and acute pesticide exposure, including associations with idiopathic pulmonary fibrosis, decreased lung function/increased wheeze, lower airway inflammation, chronic cough, and asthma (Awadalla et al., 2012; Callahan et al., 2014; Ndlovu et al., 2014; Okonya and Kroschel, 2015; Mamane et al., 2016; Quansah et al., 2016; Sankoh et al., 2016). Interestingly, a novel Hirmi Valley liver disease was identified in recent decades in Ethiopa, which may be partially caused by co-exposure of acetyllycopssamine and DDT (Robinson et al., 2014). Perhaps most striking is the substantial literature on acute pesticide poisoning, both accidental and intentional, with adolescents' intent on suicide (generally with the use of organophosphorus compounds and carbamates) contributing to an alarming increase in recent years (Balme et al., 2012; Azab et al., 2016; da Silva et al., 2016). In one study looking at acute pesticide poisoning in Kampala hospitals, 63% of cases of acute pesticide poisoning were intentional (Ssemugabo et al., 2017). The most common symptoms associated with accidental acute pesticide poisoning included skin and eye irritation, headaches, vomiting, nausea, chest pain respiratory disorders, and blurred vision (Karunamoorthi et al., 2012; Okonya and Kroschel, 2015; da Silva et al., 2016; Sankoh et al., 2016; Manyilizu et al., 2017; Ssemugabo et al., 2017).

## Exposures Measured

Many of the reviewed studies evaluated chronic pesticide exposure and alteration in serum enzymes associated with detoxification of pesticides, particularly inhibition of butyryl cholinesterase activity (Araoud et al., 2010; Araoud et al., 2011; Araoud et al., 2012). Biomarkers of exposures to the organophosphorus pesticides, chlorpyrifos (CPF) and Profenofos (PFF), were evaluated in some studies by measuring urinary levels of 3,5,6-trichloro-2-pyridinol (TCPy), a specific CPF metabolite and 4-bromo-2-chlorophenol (BCP), a specific PFF metabolite (Singleton et al., 2015). Inhibition of blood butyryl cholinesterase (BChE) and acetylcholinesterase (AChE) activities are effect biomarkers that were also evaluated in several of the reviewed studies (Ismail et al., 2010; Khan et al., 2014; Rohlman et al., 2014; Singleton et al., 2015; Rohlman et al., 2016; Ismail et al., 2017a; Ismail et al., 2017b). DDE/DDT was often assayed using ELISA (Bimenya et al., 2010).

## At Risk Populations

The *in utero* and early childhood effects of various pesticides and impact on long-term health highlights early life as a key susceptible time window for pesticide exposure. Adolescents working seasonally or during certain periods on farms may have a higher risk of neurotoxic effects of pesticide exposure because of their rapidly developing brains and bodies (Ismail et al., 2010; Ismail et al., 2017a; Ismail et al., 2017b). Because of the high morbidity and mortality associated with childhood and adolescent poisoning with pesticides (sometimes intentional), targeted prevention initiatives should be a high priority (Balme et al., 2010; Balme et al., 2012).

### Research/Data Gaps

The health effects of many pesticides have not been as extensively studied in African countries and may have different etiologies and patterns of exposure compared to other parts of the world. For example, the Sudan is experiencing a rapidly increasing cancer incidence, but little is known on tumor subtypes, epidemiology, or genetic or environmental cancer risk factors there or in other African countries (Awadelkarim et al., 2012).

Many of the reported agricultural pesticide studies in Africa were limited by exposure assessment methods (with many relying heavily on questionnaires alone to assess pesticide exposure and health risks). Future research could focus on improved pesticide exposure assessment methods, potentially incorporating multiple approaches and longitudinal studies to incorporate seasonal effects (VoPham et al., 2017). However, many opportunities exist now for comprehensive interventions to reduce both exposure and health risks associated with pesticide applications for both acute and cumulative exposures. 93% of farmers in rural Tanzania reported past lifetime pesticide poisoning (Lekei et al., 2014). Several reports have demonstrated acute pesticide poisoning to be associated with behaviors including lack of protective clothing, poor pesticide handling, not washing vegetables before eating, nozzle sucking, etc. (Magauzi et al., 2011; Oesterlund et al., 2014; Mekonen et al., 2015; da Silva et al., 2016; Sankoh et al., 2016; Manyilizu et al., 2017). One study from Sierra Leone reported most farmers having no knowledge about the safe handling of pesticides as 71% of them have never received any form of safety training (Sankoh et al., 2016). Comprehensive training and use of protective safety gear and clothing and safe handling practices may substantially reduce agricultural farmers' health risks. In addition, given that chronic exposure to pesticides appears to affect several biochemical parameters, biomonitoring of effects in agricultural workers might be a useful way to assess the individual risk of handling pesticides. For example, BChE activity appears to be a useful indicator to monitor workers chronically exposed to pesticides as it is indicative of adverse effects of pesticides in agricultural workers and might detect the effects of pesticides before adverse clinical health effects occur (Araoud et al., 2011).

Important data are still needed to help policy makers perform risk-benefit analyses of the use of DDT and other pesticides in areas of Africa most heavily impacted by malaria (Thompson et al., 2018). A variety of indoor residual spraying of insecticides is associated with substantial decreased risk of developing malaria (Kigozi et al., 2012; Loha et al., 2012), and a recent study in South Africa reported DDT most effective for malaria control while acknowledging the detrimental health effects. Alternative prevention methods for controlling malaria are needed as well as more studies illustrating the long-term impacts of DDT on health (Hlongwana et al., 2013).

### DIETARY MOLD

### Health Outcomes

Mycotoxins, particularly aflatoxin and fumonisins, are natural toxins that many people in Africa are exposed to because they contaminate the staple diet of groundnuts, maize, and other cereals (Darwish et al., 2014). Aflatoxin in particular (which is produced by the fungi *Aspergillus flavus* and *Aspergillus parasiticus*) (Afum et al., 2016) is established as a cause of cirrhosis and human liver cancer (hepatocellular carcinoma-HCC) and growth faltering (perhaps due to micronutrient deficiencies) in young children (Obuseh et al., 2011; Bosetti et al., 2014; Shirima et al., 2015; Smith et al., 2015; Wirth et al., 2017). Adverse birth outcomes and anemia in pregnant women and acute aflatoxin poisoning in Africa are also concerns (Kimanya et al., 2010; Shuaib et al., 2010a; Shuaib et al., 2010b; Wild and Gong, 2010; Khlangwiset et al., 2011; Hoffmann et al., 2015). Several reports have investigated possible impaired semen quality (infertility) in men associated with aflatoxin (Abarikwu, 2013; Eze and Okonofua, 2015). There is potential association of zearalenone (a non-steroidal estrogenic mycotoxin) with breast cancer risk (Belhassen et al., 2015). Ergotism has been associated with several species of *Claviceps* that are in rye and other cereal grains (Belser-Ehrlich et al., 2013). Fumonisin B (1) is a mycotoxin produced by *Fusarium* spp. molds and it has been linked with primary liver cancer and esophageal cancer (Domijan, 2012). Fumonisins have also been associated with neural tube defects (Wild and Gong, 2010). Aflatoxin and other mycotoxins have been linked to possible neurotoxicological outcomes as well as chronic hepatomegaly (Gong et al., 2012). Ochratoxin A, a mycotoxin produced by several *Aspergillus* and *Penicillium* species, is associated with chronic interstitial nephropathy (Hmaissia Khlifa et al., 2012; Gil-Serna et al., 2018). Contaminated peanuts have been associated recently with growth faltering (Mupunga et al., 2017). Wheat handlers exposed to *A. flavus* may have elevated risks of liver cancer as well (Saad-Hussein et al., 2014). HIV positive and HBV/HCV positive individuals exposed to aflatoxin are at substantially increased disease risks due to the established synergistic action of aflatoxin with HIV and HBV/HCB infection (Kew, 2013).

### Exposures Measured

Aflatoxin has been established as a potent liver carcinogen working through a genotoxic mechanism involving metabolic activation to an epoxide, formation of DNA adducts and, in humans, modification of the p53 gene. Extensive mechanistic research combined with molecular epidemiology has allowed quantitative risk assessment for aflatoxin to be measured. Molecular biomarkers to quantify aflatoxin exposure in individuals were essential to link aflatoxin exposure with liver cancer risk. Biomarkers were validated in populations with high HCC incidence including the Gambia, West Africa region (Wogan et al., 2012). Aflatoxin metabolite AFM(1) and other mycotoxin metabolites have been measured in breast milk (Adejumo et al., 2013) while aflatoxin-albumin (AF-alb) and AFB1-lysine have typically been measured in blood plasma or serum through a variety of methods (See **Table 2**) (Mitchell et al., 2017; McMillan et al., 2018). Correlations between urinary aflatoxin M1 (AFM1) and aflatoxin albumin adduct (Af-alb) have been established and suggest that urinary AFM1 is a good biomarker of aflatoxin. AFM1 appears to measure shorter-term exposure to aflatoxin whereas AF-alb measures longer term exposure (Chen et al., 2018). Serum levels of ochratoxin A might also serve as a useful biomarker of HCC risk (Matsuda et al., 2013).

## At Risk Populations

Growth faltering makes young children particularly vulnerable to mycotoxins as fetal and early postnatal growth and development appear to be affected and because aflatoxin is known to cross the placental barrier (Castelino et al., 2014; Castelino et al., 2015). Interventions should focus on reducing mold exposures during critical periods of fetal and infant development, particularly for nursing infants having possible contaminated milk (Adejumo et al., 2013). HIV positive and HBV/HCV positive individuals are also at risk populations for the health effects related to aflatoxin exposure (Kew, 2013). Agricultural workers and rural populations, particularly subsistence farming communities, are important at risk populations as well.

### Research/Data Gaps

Mycotoxin risk management has been successful in West Africa and other African countries, and this has substantially reduced disease attributable to aflatoxin (Liu et al., 2012; McGlynn et al., 2015). Many intervention/prevention efforts (including postharvest storage measures) are now underway to reduce exposure to highly toxic and carcinogenic contaminants in staple diets in Africa, especially aflatoxin and fumonisins, which people are exposed

#### TABLE 2 | Environmental exposures and health outcomes evaluated in African populations.



(Continued)

(PBT)

and 2,3,4,5,6-pentabromotoluene


† Included in the table as an example study measuring exposure in humans but not health outcomes. We did not retrieve any studies specific to PFAS in African populations measuring both exposures and health endpoints in humans.

to daily through grain and cereal staples in their diet. Aflatoxin biomarkers have also been used to show that primary prevention to reduce aflatoxin exposure can be achieved by low-technology approaches at the subsistence farm level in sub-Saharan Africa (Wogan et al., 2012). Daily urinary AFM1 levels have been shown to be useful as a biomarker of internal aflatoxin B1 exposure in short-term intervention trials to determine efficacy of interventions (Mitchell et al., 2013). Further application of knowledge to practice is currently underway with numerous intervention/prevention studies, clinical trials, and education (Wild and Gong, 2010; Hoffmann et al., 2015; Saleh et al., 2015). The comprehensive approach used to create many successful preventive interventions to reduce health risks associated with aflatoxin is a model for the development, validation, and application of biomarkers for other environmental exposures (Wogan et al., 2012).

There is evidence that maternal exposure to aflatoxin during the early stages of pregnancy is associated with differential DNA methylation patterns of infants, including in genes related to growth and immune function but how mycotoxin exposure in embryonic and fetal development may influence later disease risk needs to be explored (Hernandez-Vargas et al., 2015). The association between aflatoxin exposure and alteration in immune responses observed in humans suggest that aflatoxin could suppress the immune system and work synergistically with HIV to increase disease severity and progression to AIDS, but in general, the neurotoxicological and immunological/immunodepression aspects are not well understood (Jolly et al., 2015). While studies have shown synergism between aflatoxin and HBV in causing HCC, much less is known about whether aflatoxin and HCV synergize similarly (Palliyaguru and Wu, 2013). The relationship between HIV transmission frequency and fumonisin contamination also needs to be explored (Williams et al., 2010). Childhood immunizations for hepatitis B in many West African countries is still lagging behind many other countries, and this vaccination alone could substantially impact health risks (Ladep et al., 2014). Some findings of significant decrease in vitamin A associated with AF-ALB suggest that aflatoxin exposure compromises the micronutrient status of people who are immunocompromised, including people living with HIV (Obuseh et al., 2011). The interaction between aflatoxin and micronutrient deficiencies warrants more investigation (Watson et al., 2016; Watson et al., 2017).

#### INDOOR MOLD

### Health Outcomes

Indoor fungal-related outbreaks were measured and found to be associated with mucormycosis, endophthalmitis, aspergillosis, as well as asthma exacerbation and other infections in a variety of Sub-Saharan African samples (Gharamah et al., 2012; El-Mahallawy et al., 2016).

### Exposures Measured

Indoor mold was primarily measured as fungal spores present in airborne samples and measured in nasal swabs and sputum samples (Niare-Doumbo et al., 2014; Diongue et al., 2015) (**Table 2**).

## At Risk Populations

At risk-populations that were examined included pediatric wards with leukemia patients and other immunocompromised or allergic patients, oncology wards, and ophthalmology operating rooms (Gharamah et al., 2012; Niare-Doumbo et al., 2014; Gheith et al., 2015). Occupational exposure to aflatoxin was found in textile workers and was associated with liver tumor biomarkers (Saad-Hussein et al., 2013)**.** Airborne *Aspergillus* was associated with higher serum aflatoxin B1 and several liver enzymes among workers handling wheat flour (Saad-Hussein et al., 2016) as well, suggesting workers for several occupations may be at increased risk for indoor mold exposures.

### Research/Data Gaps

Different sensitization rates have been observed in different classes of patients. Highest indoor mold counts in many studies were often associated with the rainy season but more research exploring sensitization rates and seasonal variations is needed (Hasnain et al., 2012). Protective gear and safety measures to reduce exposure for some occupations are needed.

## PFAS

The literature describing PFAS-related health outcomes in Africans was extremely limited. Although our review did not find research articles evaluating PFAS and health outcomes in African populations, there has been increasing attention to PFAS exposure, including studies measuring PFAS in non-humans [e.g. crocodiles, fish (Ahrens et al., 2016)]. One study evaluated PFAS in maternal serum and cord blood in South Africa (Hanssen et al., 2010) but did not evaluate specific health endpoints in the study population where PFAS was measured.

### Health Outcomes

Ahrens et al. (2016) described a risk assessment strategy for evaluating potential human health outcomes related to the PFAS levels in different compartments of Ethiopia's largest lake, Lake Tana. Their findings do not indicate any elevated health risks, but the authors note the potential for harmful effects with increasing levels over time.

## Exposures Measured

Across the reviewed studies, perfluoroalkyl acids (PFAAs) were measured in water, sediment, and fish in Lake Tana, Ethiopia (Ahrens et al., 2016), in tilapia in South Africa (Bangma et al., 2017), and in wastewater and sludge from selected wastewater treatment plants in Kenya (Chirikona et al., 2015). Another study measured PFCs in maternal serum and cord blood of South African women-infant pairs. They did not report specific health outcomes but did note that the median maternal PFOS concentration was lower than has been reported in other studies, whereas the PFOA concentration was the same. The authors suggested that different exposure pathways (and sources) exist in this population compared to western-style study populations (Hanssen et al., 2010).

## At Risk Populations

Individuals with high fish consumption (e.g. living near the lake, depending on the lake for food or occupation, etc.) are at higher risk of these exposures. Although the results in our review did not evaluate specific health outcomes, PFOS levels were reportedly increasing between 1978 and 2001 in a study population in Southern Sweden that included women from countries of origin within and outside of Sweden, including Africa. This study observed higher levels in women with Sweden as the country of origin, compared to women from the Middle East, North Africa, and sub-Saharan Africa (Ode et al., 2013). Ode et al. report that PFOS levels increased over time, whereas PFOA and PFNA levels were unchanged between 1978 and 2001 in their study population.

### Research/Data Gaps

More research incorporating exposures and health endpoints measured in the same study population in Africa are needed. This gap may reflect potentially lower levels in African populations compared to U.S. and European populations where PFAS health studies have focused. However, as industrialization, urbanization, and globalization contribute to growing ubiquity of many environmental chemical exposures, we anticipate PFAS exposures may increase in African populations.

## ELECTRONIC WASTE

### Health Outcomes

A variety of crude recycling operations in developing nations, including Africa, have been reported to lead to multiple health risks. In many cases, e-waste workers are exposed to highly contaminated fumes due to burning practices (Akormedi et al., 2013). Self-reported hearing difficulties and stress associated with potential cardiovascular disease symptoms (including elevated blood pressure levels) have been reported in electronic waste recycling workers (Were et al., 2014; Burns et al., 2016). Workers burning e-waste products have been reported as having very high blood lead levels and noise exposures often exceed recommended occupational and community noise exposure limits (Burns et al., 2016). Workers have reported moderate to prominent levels of perceived stress as measured *via* Cohen's Perceived Stress Scale (Burns et al., 2016). Higher levels of a few chemicals related to e-waste recycling have also been associated with increased cancer risks (Obiri et al., 2016a).

### Exposures Measured

Across the e-waste studies reviewed, levels of polyaromatic hydrocarbons (PAHs), polychlorinated biphenyls (PCBs), and polybrominated diphenyl ethers (PBDEs) were typically analyzed using gas chromatography/spectrophotometry. Heavy metals were measured using atomic absorption spectrophotometry, and DNA damage was assayed in human peripheral blood lymphocytes using an alkaline comet assay in soil and plant samples (Alabi et al., 2012). Lead, cadmium, chromium, copper, arsenic, tin, zinc, and cobalt *via* oral and dermal contact in bottom ash and soil were measured using random sampling techniques and analyzed using standard methods for chemical analysis prescribed by the American Water Works Association (Obiri et al., 2016a) (**Table 2**).

## At Risk Populations

In general, e-waste workers in many African countries are a vulnerable at-risk population that may have a limited social safety net or legal protections. The chemical exposures reported in e-waste studies are relevant not just to e-waste workers but also to traders and residents, including children living in neighboring areas.

### Research/Data Gaps

The exposures related to e-waste recycling is an understudied area but limited reported studies suggest clear health risks associated with this activity. Cleaner technologies and protective gear for workers as well as education efforts are needed. Several reports recognized the complicated e-waste infrastructure system in some African countries and the need to understand all stakeholders involved (Amankwaa et al., 2017). One review suggested approaching the e-waste crisis in sub-Saharan Africa with an ongoing health impact assessment that would address the health, environmental, and social aspects of the issue and where all the steps of the assessment are performed with input from local communities (Tetteh and Lengel, 2017).

## FLAME RETARDANTS

### Health Outcomes

Several recent African studies have quantified concentrations of a variety of flame retardants and attempted to associate exposure levels with different health outcomes. Elevated levels of concentrations of polybrominated diphenyl ethers (PBDEs), polychlorinated biphenyls (PCBs) and some organochlorine pesticides (OCPs) were not found in colorectal cancer patients in Egypt, compared to controls (Abou-Elwafa Abdallah et al., 2017). Potential health concerns related to estimated lifetime cancer risk and other risks were suggested for levels of some organochlorine pesticides observed in soil samples (Sun et al., 2016), as well as DDT and PCBs from dietary fish exposure in one study (Ben Ameur et al., 2013). However, other studies did not show levels of flame retardants exceeding safety guidelines from dietary fish intake (Asante et al., 2013; El Megdiche et al., 2017). Concerns related to levels of PCBs, as well as brominated flame retardants such as polybrominated diphenyl ethers (PBDEs) and hexabromocyclododecanes (HBCDs), hexabromobenzene (HBB), 2,3-dibromopropyl-2,4,6-tribromophenyl ether (DPTE), pentabromoethylbenzene (PBEB) and 2,3,4,5,6-pentabromotoluene (PBT), were also measured in breast milk in several studies and found to be unexpectedly high (with estimated hazard quotient values exceeding the threshold of 1 or the US EPA reference doses exceeded) (Asante et al., 2011; Muller et al., 2016).

## Exposures Measured

The concentrations of polybrominated diphenyl ethers (PBDEs) were commonly measured in the reviewed studies by using gas chromatography electron impact ionization mass spectrometry (Akortia et al., 2017).

### Vulnerable Populations

Potential health risks for children, particularly nursing infants, for a variety of flame retardants were observed. PCBs in dirty oils and obsolete equipment as well as new sources of DDT for malaria control in some countries in Africa were noted as potential sources of exposure (Asante et al., 2011; Sun et al., 2016).

### PHENOLS

Only four studies met the inclusion criteria for this review of measuring phenols in relation to health outcomes in Africa (Motsoeneng and Dalvie, 2015; Muller et al., 2016; Abou-Elwafa Abdallah et al., 2017; Kumar et al., 2017), one of which covered the topic in a recent review of environmental factors and global estimates of preterm birth (Kumar et al., 2017). Abou-Elwafa et al. (2017) measured polychlorinated biphenyls (PCBS), some organochlorine pesticides (OCPs), as well as polybrominated diphenyl ethers (PBDEs, see flame retardants section) in serum of study participants in Egypt. Notably, concentrations of these chemicals were much lower in this Egyptian study population compared to other published concentrations in populations around the world.

### Health Outcomes

The health outcomes evaluated included colorectal cancer (Abou-Elwafa Abdallah et al., 2017), preterm birth (Kumar et al., 2017), birth weight and birth length (Muller et al., 2016), and neurological endpoints such as difficulty with buttoning, reading, or writing notes (Motsoeneng and Dalvie, 2015).

#### Exposures Measured

Across the studies, phenols were measured in serum, breast milk, and urine. Some of these studies also measured PCBs and OCPs and are discussed in greater detail in other sections.

## At Risk Populations

Similar to other chemical exposure categories, high risk populations include pregnant women, nursing infants (early life exposures in general), and young children.

### Research/Data Gaps

The limited publications describing phenols and health outcomes in Africa likely reflect the limited data of phenol use, distribution, and concentrations in human urine, serum, or blood. Despite the variability in the use of these compounds in some regions, the lipophilic and persistent characteristics of some chemicals enable bioaccumulation in the food chain. Most are listed as persistent organic pollutants under the United Nations Environment Programme (UNEP) Stockholm Convention (UNEP, 2009) (https://www.wipo.int/edocs/lexdocs/ treaties/en/unep-pop/trt\_unep\_pop\_2.pdf). There is very limited data for Africa evaluating health outcomes related to phenols. However, several studies document the existence of phenols in human samples such as methylated polybrominated diphenyl ethers in human milk from Bizerte, Tunisia (Ben Hassine et al., 2015), dust exposure in Egypt (Hassan and Shoeib, 2015), and urinary bisphenol A (not persistent) concentrations in girls in rural and urban Egypt (Nahar et al., 2012). The levels in Egypt were lower than NHANES age-matched American girls but the authors noted associations with food storage in plastic containers which may change over time in some Africa regions.

### PHTHALATES

### Health Outcomes

Data on the health effects of phthalates in Africa was also extremely limited—only three articles retrieved in our literature search evaluated the impact of exposure to phthalates and any health outcomes in an African study population (Colacino et al., 2011; Kumar et al., 2017; Van Zijl et al., 2017). Adverse health outcomes evaluated in these articles were preterm birth (Kumar et al., 2017) and estrogenic activity (Van Zijl et al., 2017). The third study focused on sources of exposure to phthalates among premenstrual girls in Egypt, reporting BMI, waist and hip circumference, and other anthropometric characteristics, comparing rural and urban study participants. The authors also compared the phthalate levels in this Egyptian population to the age-matched girls in U.S. NHANES data, identifying key sources of exposure (Colacino et al., 2011). Storage of food in plastic containers was a statistically significant predictor of monoisobutyl phthalate (MiBP) measured in urine of premenstrual girls, suggesting an important dietary route of exposure. The urinary measurements of phthalates were similar between the US and Egyptian age-matched girls (Colacino et al., 2011). Kumar et al. (2017) reviewed potential contributing factors to preterm birth and suggested phthalates should be evaluated more extensively in Africa.

### Exposures Measured

Phthalates were measured in urine using enzymatic deconjugation of the metabolites from their glucuronidated form, solid-phase extraction, separation with high performance liquid chromatography, and detection by isotope-dilution tandem mass spectrometry as described previously (Silva et al., 2007; Van Zijl et al., 2017). Estrogenic activity was identified in drinking water from Pretoria and Cape Town that also contained detectable levels of estrogens, bisphenol-A, and phthalates. No harmful effects from these were detected in their study population—the health risk assessment revealed acceptable health and carcinogenic risks associated with the consumption of distribution point water.

### At Risk Populations

Early life exposure is an important consideration in this group, impacting pregnant women and young children.

### Research/Data Gaps

Much more work is needed to evaluate the health implications from exposure to phthalates in the African setting, as exposures may increase over time.

### G X E AND RELATED INTEGRATION OF GENOMIC AND ENVIRONMENTAL EXPOSURES

Only 23 of the identified studies in our literature review considered both genomic and environmental factors related to health outcomes in Africa. All of these articles are listed in **Table 3**. Although effects of *PON1* genotype on organophosphorus pesticide chlorpyrifos (CPF) exposure effects for Egyptian agricultural workers were found to be minimal (Ellison et al., 2012), several other studies reported significant effects of genotype modification for various exposure risks. The *GSTP1* genotype appeared to modify the effects of ambient air pollutants PM10 and SO2 on lung function in South African children (Reddy et al., 2012). Genetic polymorphisms in *NAPH* and *SOD2* may modulate pesticide-associated risk for bladder cancer (Amr et al., 2015). The *TNF*-alpha 308 polymorphisms were associated with increased effects on lung function for several pollutants (SO2 and NO2) (Makamure et al., 2016). *PON1* 192RR and *CYP2D6* 1934A alleles were found to potentially alter susceptibility to organophosphate chronic toxicity in Egyptian agricultural workers as well (Tawfik Khattab et al., 2016). *ERCC3* and *ERCC2* polymorphisms impact the effect of cadmium exposure for nasal polyposis (Khlifi et al., 2017). Air pollution's effect on cardiovascular risk factors may be modulated by the *APOA5* 1131 polymorphism (Lin et al., 2017b). The *CD14* CT/TT genotype appears to be protective for increased exposure to some ambient air pollutants (Makamure et al., 2017). DNA variants in *NAT2*, *PON1*, and *GSTM1* may also modify organophosphate neurotoxicity (Glass et al., 2018).

A variety of other DNA and genomic biomarkers were also explored in relation to the effect of various exposure health risks. Aflatoxin adducts are known to be carcinogenic and mutagenic and have been associated with induction of the arginine to serine mutation in p53, and act synergistically with the hepatitis B virus to cause liver cancer (Kew, 2013). Repeated exposure to alpha-CYP pesticides appears to lead to p53 gene mutations (El Okda et al., 2017). A genotoxic impact for occupationally exposed antimony trioxide individuals was also reported with DNA damage detected in the form of increased apurinic/apyrimidic sites (El Shanawany et al., 2017). Interindividual variation in adduct levels associated with benzene and PAHs may reflect genetic susceptibility as well (Ayi-Fanou et al., 2011). One review summarized a variety of studies looking at various genotoxic biomarkers (including cytogenetic endpoints, chromosomal aberrations, etc.), DNA damage markers (including comet assay and urinary 8-hydroxydeoxyguanosine), and genomic biomarkers (including leukocyte telomere length, gene expression, etc.) (DeMarini, 2013). These markers were often able to distinguish traffic-exposed individuals from controls but only one of the 63 papers from this review was from an African-based study (DeMarini, 2013). Prenatal exposure to air pollution and HIV status of mothers appeared to lead to differential methylation in infants particularly in certain biological pathways related to metabolic processes and viral regulation (Goodrich et al., 2016). Only one study evaluated epigenome-wide DNA methylation and this study found differential methylation in genes related to growth and immune function for infants of aflatoxin-exposed mothers (Hernandez-Vargas et al., 2015). Only one study explored the possible effects on the microbiome for a particular exposure, and this report described changes in lung microbiome with high levels of black carbon particulates (Rylance et al., 2016). No genomewide association studies (GWAS) or whole genome sequencing or RNA sequencing studies were identified in this literature review.

### DISCUSSION

In this review, we summarize environmental health research in Africa covering the last decade, highlighting exposures unique to Africa with important health implications. Substantial progress has been made in identifying a wide range of health effects related to hazardous environmental exposures. In general, indoor and ambient air pollution studies across Africa were well characterized and health impacts are comparable to what has been described in other regions around the world. Increased industrialization, traffic, and biomass fuel burning in parts of Africa will continue to contribute to substantial air pollution. Many industrial metals contaminating the environment in parts of Africa and health effects comparable to those observed elsewhere, particularly cancer and neurological outcomes. Several reproductive outcome associations with heavy metals may be of particular interest in the African context. For example, the high levels of preeclampsia described in several African countries and the unusually high incidence of intrauterine growth retardation in Egypt may possibly be driven by toxic metal concentrations (El-Helaly et al., 2011; Ikechukwu et al., 2012; Motawei et al., 2013; El-Baz et al., 2015; Elongi Moyene et al., 2016). The acute lead poisoning for children is an urgent ongoing issue in many African countries and prevention of exposure among children is critical. A variety of pesticide studies reported reproductive, neurological, respiratory, and cancer outcomes, with one novel liver disease reporting an association with DDT (Robinson et al., 2014). The acute pesticide poisoning of adolescents (some intentional) is alarming and may reflect ease of access to these chemicals in the African continent (Balme et al., 2012; Azab et al., 2016; da Silva et al., 2016; Ssemugabo et al., 2017). Extensive mechanistic research combined with human studies over many years have allowed aflatoxin and other mycotoxins to be accurately measured and has facilitated prevention and intervention strategies. The literature on PFOS, flame retardants, phenols, e-waste, and phthalates remains extremely limited.

A variety of research gaps across multiple exposure categories were identified. The role of the immune system and inflammation and how it interacts with various exposures is an area that warrants more research. The role of endocrine disrupting chemicals in general are evolving and expanding with studies around the world and the metabolic impacts of this class of compounds, particularly for obesity, diabetes, and cardiovascular outcomes, will need to be further explored. As industrialization, urbanization, and globalization continue to impact the African continent, many emerging exposures, including PFOS, flame retardants, phenols, e-waste products, and phthalates may increase over time and will need further study in Africa.

Key susceptible/at risk populations were similar across multiple exposure categories and these include: pregnant women, children (particularly in *utero* and early childhood stages), and workers in specific occupational settings (agricultural, mining, street vendors, taxi-motorbike drivers, waste workers, etc.), and people living near urban areas who may be more highly exposed to particulate air pollutants such as benzene and PAHs. Immunocompromised individuals (people with HIV or other infections, cancer patients, etc.) may be particularly vulnerable to the effects of toxicants. The combined effects of environmental exposures and infections need to be further examined in African studies.

In general, the limited number of African studies exploring any integration or interaction of genomic and environmental factors suggests a substantial research gap. Extremely limited epigenomics and other omics applications were reported. The impact of possible transgenerational effects of some exposures by epigenomic processes has yet to be examined (Kabasenche and Skinner, 2014). The exploration of the interaction of genetic and environmental factors for disease susceptibility may enable future preventive measures. For example, the potential for agricultural workers exposed to high levels of pesticides to be screened based on genotype would be a way to help target protective measures for high risk groups and reduce disease burden (Tawfik Khattab et al., 2016). A better understanding of the regulation of biosynthetic genes related to some mycotoxins may also lead to new ways to monitor the food chain for mycotoxin contamination (Gil-Serna et al., 2018). The genetic diversity in Africa, combined with unique exposures and co-morbidities, can lead to novel G x E findings that cannot be discovered elsewhere.

### Future Directions

This review did not represent a systematic analysis of all findings reported in the literature. The purpose was to provide a broad scope of environmental health, including many complex exposure categories. Future systematic reviews could be implemented, focusing on one exposure category or a single or collection of chemicals. The greatest detail was provided for the G x E articles retrieved in this review, which, as has been noted, represented a critical research gap. Importantly, the WHO report in 2016 (Prüss-Ustün et al., 2016) stated that the current statistics related to many disease outcomes likely underestimate the true burden due to inadequate coverage in the literature, the challenge to capturing emerging risks, and the fact that many exposures take years to manifest into presentable symptoms or disease.

A number of exposures have received substantial research attention in Africa, which is encouraging, and some studies have

#### TABLE 3 | G x E † and health outcomes evaluated in African populations.



† G x E: genomic and environmental factors integrated in some way and evaluated as risk factors for health outcomes. G x E does not necessarily imply a statistical interaction; rather than genomic and environmental risk factors were both evaluated in the study participants, enabling the assessment of G x E interactions or related strategies.

provided unique insights that will allow further translational efforts to occur. Aflatoxin interventions and prevention efforts are a model for what could potentially be done with other exposures in a resource limited setting. Some reports were limited by exposure assessment methods (perhaps relying too heavily on questionnaires to assess exposure and health risks). Leveraging resources such as the Children's Health Exposure Analysis Resource (CHEAR) or Human Health Exposure Analysis Resource (HHEAR) (Balshaw et al., 2017) may enable critical gains in environmental exposure measurements in biospecimens collected in African studies. Increased environmental data in coordination with genomic infrastructure such as that in the H3Africa consortium offers a strong platform for building G x E research in Africa, although collaborations should not be limited to these resources alone.

Another underrepresented area of research was geospatial methods and spatiotemporal modeling to evaluate health outcomes in African populations. The utilization of satellite data in combination with ground monitoring is challenged by inadequate coverage of ground monitoring in Africa. Involvement of data scientists and related experts is needed to leverage existing data to advance environmental health research in Africa. The application of these methods is increasingly important with ongoing and foreseeable changes in weather patterns, agriculture, industrial development, resource mining, drought, natural vegetation, and wildlife across Africa, all of which impact the habitats of vectors transmitting infectious diseases. Variability in nutrition, poverty, and infectious diseases that all impact immunity further emphasizes the importance of bolstering environmental health research capacity across the continent.

In the coming decade, we anticipate ongoing advancements in environmental health and genomics, in coordination rather than in parallel. Leveraging the resource infrastructures within Africa and the growing global collaborations that consortia and bottom up approaches are capable of, the future for G x E research in Africa is promising.

### AUTHOR CONTRIBUTIONS

Conceived and designed the literature review: BJ, KM. Performed literature review and analysis of review results: SM. Wrote the paper: BJ, KM. Revised and approved the manuscript: BJ, KM, SM.

### ACKNOWLEDGMENTS

We thank Alyse Owoc and Bronwyn Cox from the NIEHS library who assisted with EndNote library organization tasks. We are grateful for the thoughtful manuscript review by our NIEHS colleagues Thad Schug and Harriett Kinyamu. We particularly thank the H3Africa consortium scientists, funders, staff, and study participants for stimulating the need for this review.

### REFERENCES


### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene.2019.01166/ full#supplementary-material

infant and child mortality in sub-Saharan Africa. *Trop. Med. Int. Health* 21 (12), 1572–1582. doi: 10.1111/tmi.12779


organochlorine pesticides and polychlorinated biphenyls: a case-control study in Tunisia. *Sci. Total Environ.* 520, 106–113. doi: 10.1016/j.scitotenv.2015.03.045


are related to neuromotor impairments in children with konzo. *J. Neurol. Sci.* 349 (1-2), 149–153. doi: 10.1016/j.jns.2015.01.007


iodine-deficient young children from the Amhara region of Ethiopia. *Eur. J. Clin. Nutr.* 70 (8), 929–934. doi: 10.1038/ejcn.2016.27


patient with type 2 diabetes: special role of manganese and chromium. *Ann. Biol. Clin. (Paris)* 70 (6), 669–677. doi: 10.1684/abc.2012.0763


inversely associated with linear growth of infants in Tanzania. *Mol. Nutr. Food Res.* 54 (11), 1659–1667. doi: 10.1002/mnfr.200900483


the blood of women exposed to vehicular pollution in Ile-Ife, Nigeria. *Environ. Sci. Pollut. Res. Int.* 21 (2), 1124–1132. doi: 10.1007/s11356-013-1951-z


child development study main cohort. *J. Am. Dent. Assoc.* 142 (11), 1283–1294. doi: 10.14219/jada.archive.2011.0114


A review. *Food Addit. Contam. Part A Chem. Anal. Control Expo Risk Assess.* 29 (2), 249–257. doi: 10.1080/19440049.2011.563370


**Conflict of Interest:** Author SM was employed by the company Vista Technology Services.

The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Copyright © 2020 Joubert, Mantooth and McAllister. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

# Novel and Known Gene-Smoking Interactions With cIMT Identified as Potential Drivers for Atherosclerosis Risk in West-African Populations of the AWI-Gen Study

Palwende Romuald Boua1,2,3\*, Jean-Tristan Brandenburg2 , Ananyo Choudhury <sup>2</sup> , Scott Hazelhurst 2,4, Dhriti Sengupta2 , Godfred Agongo2,3,5, Engelbert A. Nonterah5,6, Abraham R. Oduro<sup>5</sup> , Halidou Tinto<sup>1</sup> , Christopher G. Mathew2,7, Hermann Sorgho1 and Michèle Ramsay 2,3

#### Edited by:

Mayowa Ojo Owolabi, University of Ibadan, Nigeria

#### Reviewed by:

Liyong Wang, University of Miami, United States Linda Polfus, University of Southern California, United States

\*Correspondence:

Palwende Romuald Boua romyboua@gmail.com

#### Specialty section:

This article was submitted to Evolutionary and Population Genetics, a section of the journal Frontiers in Genetics

> Received: 27 February 2019 Accepted: 10 December 2019 Published: 07 February 2020

#### Citation:

Boua PR, Brandenburg J-T, Choudhury A, Hazelhurst S, Sengupta D, Agongo G, Nonterah EA, Oduro AR, Tinto H, Mathew CG, Sorgho H and Ramsay M (2020) Novel and Known Gene-Smoking Interactions With cIMT Identified as Potential Drivers for Atherosclerosis Risk in West-African Populations of the AWI-Gen Study. Front. Genet. 10:1354. doi: 10.3389/fgene.2019.01354 <sup>1</sup> Clinical Research Unit of Nanoro, Institut de Recherche en Sciences de la Santé, Nanoro, Burkina Faso, <sup>2</sup> Faculty of Health Sciences, Sydney Brenner Institute for Molecular Bioscience (SBIMB), University of the Witwatersrand, Johannesburg, South Africa, <sup>3</sup> Division of Human Genetics, National Health Laboratory Service and School of Pathology, Faculty of Health Sciences, University of the Witwatersrand, Johannesburg, South Africa, <sup>4</sup> School of Electrical and Information Engineering, University of the Witwatersrand, Johannesburg, South Africa, <sup>5</sup> Navrongo Health Research Centre, Ghana Health Service, Navrongo, Ghana, <sup>6</sup> Julius Center for Health Sciences and Primary Care, University Medical Center Utrecht, Utrecht University, Utrecht, Netherlands, <sup>7</sup> Department of Medical and Molecular Genetics, Faculty of Life Sciences and Medicine, King's College London, London, United Kingdom

Introduction: Atherosclerosis is a key contributor to the burden of cardiovascular diseases (CVDs) and many epidemiological studies have reported on the effect of smoking on carotid intima-media thickness (cIMT) and its subsequent effect on CVD risk. Gene-environment interaction studies have contributed towards understanding some of the missing heritability of genome-wide association studies. Gene-smoking interactions on cIMT have been studied in non-African populations (European, Latino-American, and African American) but no comparable African research has been reported. Our aim was to investigate smoking-SNP interactions on cIMT in two West African populations by genome-wide analysis.

Materials and methods: Only male participants from Burkina Faso (Nanoro = 993) and Ghana (Navrongo = 783) were included, as smoking was extremely rare among women. Phenotype and genotype data underwent stringent QC and genotype imputation was performed using the Sanger African Imputation Panel. Smoking prevalence among men was 13.3% in Nanoro and 42.5% in Navrongo. We analyzed gene-smoking interactions with PLINK after adjusting for covariates: age and 6 PCs (Model 1); age, BMI, blood pressure, fasting glucose, cholesterol levels, MVPA, and 6 PCs (Model 2). All analyses were performed at site level and for the combined data set.

Results: In Nanoro, we identified new gene-smoking interaction variants for cIMT within the previously described RCBTB1 region (rs112017404, rs144170770, and rs4941649) (Model 1: p = 1.35E-07; Model 2: p = 3.08E-08). In the combined sample, two novel intergenic interacting variants were identified, rs1192824 in the regulatory region of TBC1D8 (p = 5.90E-09) and rs77461169 (p = 4.48E-06) located in an upstream region of open chromatin. In silico functional analysis suggests the involvement of genes implicated in biological processes related to cell or biological adhesion and regulatory processes in gene-smoking interactions with cIMT (as evidenced by chromatin interactions and eQTLs).

Discussion: This is the first gene-smoking interaction study for cIMT, as a risk factor for atherosclerosis, in sub-Saharan African populations. In addition to replicating previously known signals for RCBTB1, we identified two novel genomic regions (TBC1D8, near BCHE) involved in this gene-environment interaction.

Keywords: GWIS, atherosclerosis, smoking, carotid intima-media thickness, gene-environment interactions

#### INTRODUCTION

During the last two decades, the burden of cardiovascular diseases (CVDs) has increased considerably, and low- and middle-income countries are now experiencing about 80% of the worldwide burden. Sub-Saharan Africa (SSA) is undergoing a health and demographic transition that has shifted the major causes of death from communicable and nutritional diseases to noncommunicable diseases (NCDs). The mean age of death attributable to CVDs in SSA in 2010 was 64.9 years (95% CI, 64.4–65.4) compared with 67.6–81.2 years for the rest of the world (Roth et al., 2015), making it one of the youngest affected populations globally. Populations of African descent have been under-represented in genomic studies, representing only about 3% of the participants worldwide used for genome-wide association studies up to 2016 (Popejoy and Fullerton, 2016). This is a gap that needs to be filled, considering that Africa is the continent with the highest genetic diversity owing to its deep evolutionary roots, and African genomes generally have lower linkage disequilibrium (LD). A previous study reported that African populations were more diverse and had significantly more genes and pathways involved in extreme allele frequency differences (EAFD) (Sulovari et al., 2017). The African genome is therefore highly relevant for the discovery of new genetic associations and a better understanding of human disease mechanisms (Tekola-Ayele and Rotimi, 2015; Choudhury et al., 2018; Martin et al., 2018).

Genetic understanding of complex traits has developed immensely over the past decade but remains hampered by the fact that genetic variants still explain only a fraction of the heritability of a trait, often referred to as the missing heritability. A contributor to this phenotypic variance is gene-environment interaction (GxE) (Duncan et al., 2014). GxE can be defined broadly as the interplay between the product of a genetic variant and an environmental factor as they affect a specific trait. GxE therefore refers to modification, by an environmental factor, of the effect of a genetic variant on a phenotypic trait. The phenotypic flexibility resulting from adjustments due to GxE could determine or modulate health or disease by modulating the adverse effects of a risk allele, or exacerbating the genotypephenotype relationship to increase risk. Environmental stimuli, acting over hundreds of generations, have promoted adaptation that is reflected in allele frequency shifts observed in current populations for traits and disease risk.

Identifying GxE represents a cornerstone in "Precision Public Health" that will allow individuals to adjust exposure to a particular environmental factor involved in GxE interactions for the benefit of reducing disease risk in accordance with specific genotypes. However, if GxE testing represents an opportunity, challenges remain in interaction studies. The ability to detect interactions can be dependent on scale, SNP-based analyses can lack power, exposure measurements can be inconsistent and imperfect, and optimal software for efficient GxE analysis is lacking. Fortunately, recent advances in methodology development have boosted GxE interaction analysis and more studies are being published.

Smoking is an important risk factor for coronary heart disease (CHD) and CVD (Schroeder, 2013). Despite improved understanding, the pathophysiological mechanisms underpinning the association between smoking and CVD have yet to be elucidated fully. Nonetheless smoking is known to have an effect on endothelial cells, inflammatory states, platelet activation, procoagulant factors, and antifibrinolytic factors (Barua and Ambrose, 2013). Several studies have reported the effect of smoking on subclinical atherosclerosis (Liang et al., 2009; Yang et al., 2015; Hansen et al., 2016; Kianoush et al., 2017). Atherosclerosis is a complex, progressive disorder affecting large and medium-sized arteries. The disease has a silent progression, often with no clinical evidence until the occurrence of a vascular event.

Gene-smoking interactions for atherosclerosis have been reported. In the Bogalusa Heart Study, a variant located in the region of EDNRA was found to be associated with the status of the left cIMT (Li et al., 2015), and in the Northern Manhattan Study (NOMAS), RCBTB1 was reported as a modifier of the smoking effect on cIMT, and MXD1-JPH1 for carotid plaque burden in the presence of smoking (Wang et al., 2014; Della-Morte et al., 2014). Genes involved in inflammatory pathways mediated by the NF-kB axis (TBC1D4 and ADAMTS9) have been identified as displaying gene-smoking interactions (Polfus et al., 2013). In 2017, a casecontrol study on CHD identified variants in ADAMTS7 associated with a loss of cardio-protective effects resulting from gene-smoking interactions (Saleheen et al., 2017).

The Africa Wits-INDEPTH partnership for Genomic Studies (AWI-Gen), a Collaborative Centre of the Human Heredity and Health in Africa (H3Africa) Consortium, was developed to investigate the genomic and environmental risk factors for cardio-metabolic diseases in Africans. In this paper, we report on a genome-wide analysis of gene-environment interactions to explore the role of smoking on cIMT (a measure of atherosclerosis) in male participants from Nanoro (Burkina Faso) and Navrongo (Ghana) in West-Africa, as part of the AWI-Gen study.

#### METHODS

#### Study Population

AWI-Gen is a cross-sectional study of adults (40 to 60 years of age) and in this study we used a subset of participants from the AWI-Gen study (Derra et al., 2012; Oduro et al., 2012; Ramsay et al., 2016; Ali et al., 2018), including male participants from the two study sites in West Africa. The participants for this study included 1,776 West African men from two rural settings, Nanoro (Burkina Faso) and Navrongo (Ghana). Only men were included in this study as smoking rates in women are very low in these communities. Participants completed a questionnaire with questions on demography, health history and behaviour. The reason for only performing the analysis on the two West African study sites, and not the other AWI-Gen study sites, is that they were comparable in terms of environmental exposures, genetic background, and prevalence of HIV infection (Derra et al., 2012; Oduro et al., 2012).

#### cIMT Measurement

cIMT was measured using Dual B-mode ultrasound images of the carotid tree showing a typical double line for the arterial wall. cIMT is best visible in the measurement segment of the distal common carotid artery with lowest measurement variability. The measurement is most reliable over a one centimeter segment and was performed by semi-automatic reading methods, which minimise reading errors. The far wall of both the left and right common carotid artery was imaged using a linear-array 12L-RS transducer with a GE Healthcare B-mode LOGIQe ultrasound machine (GE, Healthcare, CT, USA). The participant was in a supine position for the measurements, head turned towards the left at a 45-degree angle to measure the right carotid. Operators used anatomical landmarks to identify the common carotid artery (CCA) on a longitudinal plane and the image was frozen. The operator then identified a continuous onecentimeter segment (10 mm) of the CCA far wall and placed a cursor between two points (10 mm apart) on this identified segment with the proximal starting point 1 cm from the bulb of the CCA. The inbuilt software then automatically detected the intima-lumen and the media-adventitia interfaces and calculated the minimum, maximum and mean common cIMT in mm and to two decimal places. To measure the left carotid, the participant's head was turned to the opposite side, and the process was repeated. The cIMT values were QCed according to the Mannheim Consensus defining the use of cIMT in population-based studies. We generated the mean cIMT as the average of the mean common right and left cIMT and used this variable for all analyses.

#### Smoking Status and Other Variables

Smokers (current) and nonsmokers (never and former) were classified based on self-reporting. A dichotomous categorization of smoking status was chosen over a quantitative measure (e.g., pack-years) owing to the inherent high dimensionality of GWAS analysis (13.98 million SNPs with two-fold main effects and interaction variables per SNP). Additionally, previous studies reported the reversal of the effect of smoking after several years of cessation. Smoking intensity was assessed using pack/years calculated by multiplying the number of packs of cigarettes smoked per day by the number of years the person has smoked. The number of cigarettes or times a tobacco product were consumed was recorded at intervals of days (everyday, 5-6 days, 1-4 days, 1-3 days/month, less than once a month).

Other variables (height (m), weight (kg), BMI (kg/m<sup>2</sup> ), blood pressure, fasting glucose, total cholesterol and physical activity) were recorded as previously described (Ali et al., 2018).

#### Association and Follow-Up Analysis Genotype Data and Imputation

The H3Africa genotyping array<sup>1</sup> , designed as an Africancommon-variant-enriched GWAS array (Illumina) with ~2.3 million SNPs, was used to genotype genomic DNA using the Illumina FastTrack Sequencing Service<sup>2</sup> . The following preimputation QC steps were applied to the entire AWI-Gen genotype data set. Individuals with a missing SNP calling rate greater than 0.05 were removed. SNPs with a genotype missingness greater than 0.05, MAF less than 0.01, and Hardy-Weinberg equilibrium (HWE) P-value less than 0.0001 were removed. Nonautosomal and mitochondrial SNPs, and ambiguous SNPs that did not match the GRCh37 references alleles or strands were removed. Imputation was performed on the cleaned data set (with 1,729,661 SNPs and 10,903 individuals from the AWI-Gen study) using the Sanger Imputation Server and the African Genome Resources as reference panel. We selected EAGLE2 (Loh et al., 2016) for prephasing and the default PBWT algorithm was used for imputation. After imputation, poorly imputed SNPs with info scores less than 0.6, MAF less 0.01, and HWE P-value less than 0.00001 were excluded. The final QC-ed imputed data had 13.98 M SNPs, and data from male participants from the Nanoro and Navrongo study sites were extracted for analysis.

#### Data Analysis

Descriptive statistics were used to summarise the population characteristics. Continuous variables were reported in median and interquartile ranges and categorical variables were reported in percentages. All the data were analyzed per site before reporting for the combined group. We examined group differences using the Mann-Whitney test for continuous variables and Pearson Chi-square test for categorical variables.

<sup>1</sup> https://www.h3abionet.org/h3africa-chip

<sup>2</sup> https://www.illumina.com/services/sequencing-services.html

Analyses were performed using mean cIMT and SNPs with a MAF of 0.05 or above.

#### Gene-Environment Interaction (GxE) Analysis

Linear regression of mean cIMT was performed with covariates using R<sup>3</sup> . Residuals were extracted from the linear regression analyses and used for the GWAS analysis. Model 1 used the following covariates: age and six principal components (PCs) computed on genetics data (to account for genetic structure). To check the consistency of our association, Model 2 included further adjustment [Model 1 + BMI + systolic blood pressure + diastolic blood pressure + fasting glucose + total cholesterol + physical activity (moderate to vigorous physical activity in minutes per week (MVPA)] to analyze SNPs with pvalues < 1 E-06 in Model 1.

GxE testing was performed using the PLINK "-gxe" option (Purcell et al., 2007; Chang et al., 2015) on the "awigen" branch of the automated workflow4 (Baichoo et al., 2018). We screened the output for a genome-wide significance threshold (p-values < 5 E-08).

To assess genomic inflation, the observed distribution of −log10(P) values was compared to that expected in the absence of association (lambda) and illustrated in QQ plots. We used a cross-replication approach between the two sites, suggestive signals (p-values < 1E-04) in one site were checked in the other site and vice versa.

Power calculations were performed with Quanto<sup>5</sup> (Version 1.2.4). The study was powered at 80% to identify SNPs with MAF ≥0.05 and interaction effect size (OR) of >4, given our sample size and smoking prevalence in Nanoro. The power would be higher for Navrongo and the combined data set because of the increased number of smokers.

#### Functional Analysis of Associated Variants

The FUMA online platform<sup>6</sup> (Watanabe et al., 2017) was used to annotate, prioritize, visualize, and interpret GWAS results. From GWAS summary statistics as an input, it provided extensive functional annotation for all SNPs in genomic areas identified by lead SNPs. From the list of gene IDs (as identified by SNP2GENE option in FUMA) FUMA annotated genes in biological context (Watanabe et al., 2017). We selected all candidate SNPs in the associated genomic region having r<sup>2</sup> ≥ 0.6 with one of the independently significant SNPs, with a suggestive P-value (P < 1E-05) and MAF ≥ 0.05 for annotation. Predicted functional consequences for these SNPs were obtained by matching the SNP's chromosome base-pair position, and reference and alternate alleles, to databases containing known functional annotations, including ANNOVAR (Wang et al., 2010), combined annotation-dependent depletion (CADD) scores (Kircher et al., 2014), and RegulomeDB (RDB) (Boyle et al., 2012) scores.

#### Functional Annotation of Mapped Genes

Genes implicated by mapping of significant GWAS SNPs were further investigated using the GENE2FUNC procedure in FUMA (Watanabe et al., 2017 ), which provides hypergeometric tests of enrichment of the list of mapped genes in 53 GTEx tissue-specific gene expression sets (The GTExArd Consortium et al., 2015; GTex Consortium et al., 2017), 7,246 MSigDB gene sets7 , and chromatin states (Ernst and Kellis, 2013; Consortium Roadmap Epigenomics et al., 2015).

#### GWAS Catalog Lookup

The GWAS Catalog database was downloaded from the website<sup>8</sup> (Accessed on 12 Jul 2018) and a subset the data set generated using the following key words relevant to our study: genomewide interaction, gene-environment interactions, atherosclerosis, coronary artery diseases, carotid atherosclerosis, cIMT, coronary artery calcification, abdominal artery aneurism. Since our data set was in build 37, we performed a lift-over prior to comparison. In order to assess whether our study was replicating previous findings, we searched for the same marker or any markers within 100 kb of all suggestive index SNPs (p-value ≤ 1E-04) found in this study and further filtered the list using key words pertaining to coronary artery diseases and gene-environment interactions.

#### Ethics and Consent

This study received the approval of the Human Research Ethics Committee (Medical), University of the Witwatersrand, South Africa (M121029), the approval of the Centre Muraz Institutional Ethics Committee, Burkina Faso (015-2014/CE-CM) and the approval of the National Ethics Committee For Health Research, Burkina Faso (2014-08-096), the Ghana Health Service Ethics Review Committee (ID No: GHS-ERC:05/05/ 2015), and the Navrongo Institutional Review Board (ID No: NHRCIRB178). All the participants signed an Informed Consent Form before any study procedure was performed.

#### RESULTS

#### Characteristics of Participants

The participant characteristics are presented for Nanoro (n = 993), Navrongo (n = 783), and for the combined data (n = 1776) in Table 1. Only males were included. The prevalence of current smokers was 13.3%, 42.5% and 26.2%, respectively for Nanoro, Navrongo, and the combined data set. Smokers were younger than nonsmokers in Nanoro. Interestingly, although the prevalence of current smokers was lower, the smoking intensity was much higher (in packs/year) in Nanoro, 6.6 (3.7- 11.6) vs 1.7 (0-4.2) in Navrongo. Moreover, although there were no differences in mean cIMT values between smokers and nonsmokers, significant differences were observed for the following risk factors, BMI, systolic, and diastolic blood pressure, for both sites and overall. Total cholesterol and low density lipid cholesterol (LDL-C) were lower in smokers

<sup>3</sup> https://www.R-project.org/

<sup>4</sup> http://github.com/h3abionet/h3agwas/

<sup>5</sup> http://biostats.usc.edu/Quanto.html

<sup>6</sup> http://fuma.ctglab.nl/

<sup>7</sup> http://software.broadinstitute.org/gsea/msigdb

<sup>8</sup> https://www.ebi.ac.uk/gwas/


⌖Mann-Whitney rank-sum test, p-values at p < 0.05 are shown in bold.

ϒ Number of pack-years = (number of cigarettes smoked per day/20) × number of years smoked.

BMI, body mass index; BP, blood pressure; MVPA, moderate to vigorous physical activity.

compared to nonsmokers in Nanoro and the combined sample, whereas fasting glucose was lower for smokers in Navrongo and the combined sample. In spite of the geographic and genetic proximity of the two groups, the differences in the prevalence and smoking intensity, as well as that of other risk factors, suggest that the GxE interaction mechanism might not be the same in the two study centres.

#### Gene-Smoking Interactions

The association results of SNPs (MAF ≤ 0.05) are illustrated by Manhattan plots for each site and the combined data set in Figures 1A–C. Genomic inflation factors (GIFs) (lambda) were 0.994, 0.993, and 1.007, respectively for Nanoro (7,828,913 SNPs), Navrongo (7,842,446 SNPs), and the combined sample (7,839,440 SNPs) (Figures 1D–F).

The strongest signal in the Nanoro sample was found for the GxE where the C allele of rs7649061 (allele frequency (AF) = 0.25, p-value = 7.37E-08) was associated with a decrease of mean-cIMT in the presence of smoking (Table 2A, Supplementary Table 1). This variant is located in an intergenic region between BCHE (butyrylcholinesterase) and ZBBX (zinc finger B-box domain containing). Four other regions had signals with p < 5E-07 in the analysis for Nanoro: rs7095209 (p-value = 1.23E-07), an intergenic SNP close to the SORCS3 gene (sortilin related VPS10 domain containing receptor 3), the G allele of which (AF = 0.42) was associated with a decrease of mean-cIMT in the presence of smoking (Supplementary Figures 1 and 2); three chromosome 13 variants in high LD (rs112017404, rs144170770 and rs4941649), with the lowest p-value at 1.35E-07 (AF = 0.08), associated with an increase of mean-cIMT in smokers, and located between RCBTB1 (Regulator of chromosome condensation (RCC1) and BTB (POZ) domain containing protein 1) and ARL11 (ADP-ribosylation factor-like 11) (Figure 2A); rs13268575 (p = 2.77E-07) in the CNBD1 (cyclic nucleotide binding domain containing 1) region which was associated with an increase of mean-cIMT in smokers for the A allele (AF = 0.23); and a missense variant, rs17844302 (pvalue = 4.07E-07), in PCDHA6 (protocadherin alpha 6), found to be associated with an increase of mean-cIMT in smokers.

The associations for the analysis of the Navrongo data were less significant with none reaching p < 1E-07 (Table 2B, Supplementary Table 1). The strongest association signal was for rs4869800 (p-value = 1.92E-06, AF = 0.92), an intergenic variant between RGS17 (regulator of G-protein signaling 17) and OPRM1 (opioid receptor, mu 1). Other suggestive signals included genes from the olfactory receptor (OR) family. While there could indeed be true signals in the OR gene family, given the high false positive variant discovery rate in this gene family (Chen et al., 2017), it is difficult to assess the robustness of this association using imputed data sets, therefore we excluded these variants from the downstream analyses.

In the combined sample, one SNP (rs1192824) reached the genome-wide significance level (Table 2C, Supplementary Table 1). The C allele of rs1192824 (AF = 0.69), located in the intergenic region between TBC1D8 (TBC1 domain family, member 8) and CNOT11 (CCR4-NOT Transcription Complex Subunit 11), showed a SNP-smoking interaction associated with a lower cIMT in smokers compared to the T carriers (p = 5.90E-09) (Figure 2B). Another variant in the promoter flanking region of TBC1D8, rs77461169, 5648 bp away from rs1192824, showed a suggestive interaction (p = 4.48E-06), and was located in an open chromatin region. The two variants were not in LD (Figure 2C). The distribution of mean cIMT for the three rs1192824

base-pair positions along the chromosomes on the x-axis. The red line indicates Bonferroni-corrected genome-wide significance (p < 5E-08); the blue line indicates the threshold for suggestive association (p < 1E-04). Manhattan plot for Nanoro, 993 participants, 78289913 SNPs (A). Manhattan plot for Navrongo, 783 participants, 7842446 SNPs (B). Manhattan plot for combined set, 1776 participants, 7839440 SNPs (C). QQ plot for Nanoro, GIF = 0.9944 (D). QQ plot for Navrongo, GIF = 0.9934 (E). QQ plot for combined set, GIF = 1.0079 (F). GIF [genomic inflation factor (lambda)].

genotypes showed no difference in the nonsmokers, but there was a significant decrease of mean-cIMT for homozygote (C/C) and heterozygote (C/T) carriers among the smokers, when compared to the T/T genotype (Figure 3). This suggests a recessive mode of action for the risk allele (T). The second strongest signal was observed for rs12444312 (p-value = 1.27E-07) located in a noncoding RNA exon of LOC440390. This SNP is a regulatory region variant located in a CTCF binding site. A signal was found with rs11695675 (p-value = 3.48E-07) near FTCDNL1 (formiminotransferase cyclodeaminase N-terminal like). Near PCSK9, rs1158815 was found suggestive of the GxE interaction for cIMT in the combined sample at p-value = 6.22E-06 (Supplementary Figures 1 and 2).

When adjustment was applied for additional covariates, Model 1 + BMI + systolic blood pressure + diastolic blood pressure + fasting glucose + cholesterol + physical activity (MVPA), in Model 2, four variants in 2 loci reached genomewide significance (rs7649061, p = 2.20E-08; rs112017404 rs144170770-rs4941649, p = 3.08E-08) in Nanoro. In the combined sample, rs1192824 (p = 2.70E-08) remained the single locus below the genome-wide significance threshold (Table 3). The direction of the allelic effect was the same in all cases where significant associations were identified in Nanoro and the combined sample.

#### Comparison of Association Signals Between Sites

We found eight SNPs from Nanoro with some evidence for interaction (p-values < 1E-04) which showed nominal replication (p-value ≤ 0.05) in Navrongo, with the strongest being rs77655815 (Nanoro p-value = 4.01E-05), replicated at pvalue of 6.83E-05, and rs79419964 (Nanoro, p = 8.74E-06; Navrongo, p = 8.23E-04), and with p = 1.06E-06 (2.66E-06 for Model 2) in the combined analysis. From Navrongo, 19 SNPs (p < 1E-04) were nominally replicated in Nanoro, of which rs12444312 (Nanoro p = 1.15E-03), was found associated at a pvalue of 2.95E-07 in the combined sample (Supplementary Table 1). When significant, the direction of the allelic effect was the same in all cases.

#### GWAS Catalog Lookup for Gene-Environment Interaction

We replicated previously described gene-environment association loci for smoking interactions with body composition (BMI and waist circumference) (Justice et al., 2017), smoking or alcohol interaction with blood pressure (Taylor et al., 2016; Feitosa et al., 2018; Sung et al., 2018), coronary artery calcified plaque in type 2 diabetes (Divers et al., 2017) and peripheral arterial disease interaction with air pollution (Ward-Caviness et al., 2017) (Supplementary Table 5). No SNPs from these studies showed evidence of transference of the lead SNPs to African populations in our study, but none was specifically for cIMT as the main outcome. Interestingly, in Nanoro we identified 15 SNPs located within 100 kb of previously reported gene-environment interaction loci. The loci included a gene-alcohol interaction on blood pressure (Feitosa et al., 2018), gene-smoking interaction on waist circumference (Justice et al., 2017), gene-smoking interaction on lung cancer (Park et al., 2015) and gene-smoking interaction on blood pressure (Sung et al., 2018). In Navrongo, 13 SNPs replicated previous loci for gene-smoking interaction on BMI (Justice et al.,

#### TABLE 2A | Selected risk loci (p ≤ 1E-05) for SNP-smoking interactions on cIMT in Nanoro.


Gene-Smoking Interactions for Atherosclerosis in West Africa

\*Z\_GxE shows Beta interaction. For beta interaction, positive values indicate association of allele A1 and increased cIMT; negative values indicate association of allele A1 with decreased cIMT.

<sup>Y</sup>A1 is reflecting the alternative allele compared to the reference genome allele, all frequencies are reported for A1.

FIGURE 2 | Regional association plots for the RCBTB1 region in Nanoro (A). Regional association plots of TBC1D8 region in the combined data set (B). Distinct genomic risk loci were defined as linkage disequilibrium (LD)-independent regions (r2 ) separated by 100 kb and containing one or more SNPs with a suggestive association (p-values < 1E-05). For each locus, the plots show the –log10 transformed value of each SNP on the y-axis and base pair positions along the chromosomes on the x-axis. Genes overlapping the locus are displayed below the plot. SNPs are colored by their LD value with the lead SNP in the region, and those LD values were generated from the study populations. Haplotype blocks show that rs1192824 and rs77461169 are not in LD. Haplotype blocks were built using Haploview with LD values calculated from the two study populations together (C).

2017), lung cancer (McKay et al., 2017) and blood pressure (Taylor et al., 2016). Seventeen SNPs in the combined sample replicated previously reported interaction loci for gene-diabetes interaction for atherosclerotic plaque (Divers et al., 2017), gene-alcohol interaction for blood pressure (Feitosa et al., 2018), genesmoking interaction for BMI (Justice et al., 2017) and genesmoking interaction for blood pressure (Taylor et al., 2016; Sung et al., 2018).


#### Functional Analysis

Functional annotation of SNPs with suggestive associations showed that these were mostly intronic or intergenic (Supplementary Tables 2A–C). 30 SNPs displayed a CADD score above 12.37 (17 in Nanoro; 3 in Navrongo; 10 in Combined), suspected to be deleterious. In the Nanoro sample, two SNPs (rs6701037, rs6677097), in high LD with rs10573305 (KIAA0040 region), had a RDB score of 1f suggesting they were likely affecting binding sites and gene expression. Equally, for the combined sample two variants (rs10409209 and rs4807840), in LD with rs8111212, displayed a RDB score of 1f.

#### Functional Annotation of Mapped Genes

Genes implicated by mapping of significant SNPs were further investigated using the GENE2FUNC procedure in FUMA. Positional mapping, eQTL mapping (matched cis-eQTL SNPs) and chromatin interaction mapping (on the basis of 3D DNA–DNA interactions) are reported (Supplementary Tables 3A–C, Supplementary Figure 3). We found that rs1192824 and rs77461169 in TBC1D8 were implicated as eQTLs influencing the expression of TBC1D8, SNORD89, and RNF149 and also displaying chromatin interactions (Figure 4A). The RCBTB1 locus SNPs were eQTLs for RCBTB1, ARL11, CAB39L, and PSME2P2; their chromatin interactions included surrounding genes such as RCBTB1, KPNA3, CAB39L, SETDB2, MLNR, and CDADC1 (Figure 4B). A lookup into gene expression data sets, revealed expression of genes of interest in specific relevant tissues such as arteries (Supplementary Figure 4). We also analyzed the functional significance of the associated variants using FUN-LDA (Backenroth et al., 2018) and the results were largely similar to FUMA tissue-specific annotation.

#### Gene Set Analysis

Gene set analysis was only reported when at least five genes were implicated. In Nanoro, we found significant Gene Ontology Biological Processes for biological adhesion (AdjP = 2.69E-11), cell-cell adhesion (AdjP = 8.03E-11), homophilic cell adhesion via plasma membrane adhesion molecules (AdjP = 1.29E-09) and cell-cell adhesion via plasma membrane adhesion molecules (AdjP = 5.52E-09) (Figure 5; Supplementary Tables 4A–C).

#### DISCUSSION

Atherosclerosis is a low-grade chronic inflammatory condition characterized by aberrant lipid metabolism and a maladaptive inflammatory response. Biologically, the disease involves the formation of plaques in arterial walls and thickening that narrows the arterial passage, restricting blood flow and increasing the risk of occlusion resulting in a myocardial infarction and other events. Although environmental factors such as diet and/or smoking play an important role in the development of atherosclerosis, genetic factors represent

TABLE 2B |

Selected risk loci (p ≤ 1E-05) for

SNP-smoking

 interactions on cIMT in Navrongo.

 is reflecting the alternative allele compared to the reference genome allele, all frequencies are reported for A1.

#### TABLE 2C | Selected risk loci (p ≤ 1E-05) for SNP-smoking interactions in combined sample.


\*For beta interaction, positive values indicate association of allele A1 and increased cIMT; negative values indicate association of allele A1 with decreased cIMT.

<sup>Y</sup>A1 is reflecting the alternative allele compared to the reference genome allele, all frequencies are reported for A1. Gene-Smoking Interactions for Atherosclerosis in West Africa

important determinants of atherosclerotic CVD risk. Key geneenvironment interactions may increase the risk for adverse outcomes by contributing to an increase of cIMT, and this was the aim of this study.

performed using the Kruskal-Wallis test.

The gene-smoking interaction signals we identified for mean cIMT were with loci where the associated allele frequencies ranged from low to common, with most of them displaying effect sizes of over 4. This suggests that the effects were unlikely to be attributed to additive independent effects of genetic-association and smoking, but rather to the interaction. The high effects provided us with sufficient power to discover the gene-smoking interactions, given our sample size. Moreover, many previously reported loci for GxE with cardiovascular-related traits were replicated, and our study identified new GxE variants for cIMT. The loci identified are biologically relevant in terms of the pathophysiology of atherosclerosis and involve genes implicated in macrophage activation and recruitment in the endothelial layer, cholesterol metabolism at the cellular level, inflammation processes and signalling, and cell membrane activity.

Our study is the first to report an association of TBC1D8 in the GxE interaction with cIMT and consequent risk for atherosclerosis. TBC1D8 (also called Vascular Rab-GAP/TBCcontaining protein) is a gene involved in blood circulation, intracellular protein transport and positive regulation of cell proliferation. The gene is regulated by vascular genes like VEGF, ACKR1, VEGFA, SIRT1, and TNF. Previous GWASs found variants in TBC1D8 associated with bone mass (Kiel et al., 2007), cognitive decline (Li et al., 2015) and osteoporosis (Hsu et al., 2010) and found that TBC1D8 expression was subject to change under environmental stress. A study on the effect of smoking on gene expression found that TBC1D8 was differentially expressed in lymphocytes of smokers and nonsmokers (Charlesworth et al., 2010), as well as in macrophages from atherosclerotic plaques (Puig et al., 2011), depending on the inflammatory status of patients. Later Verdugo and colleagues (Verdugo et al., 2013) reported that TBC1D8 expression in monocytes was subject to a gene interaction with smoking among atherosclerotic patients, and that TBC1D8 was involved in one of the shortest gene paths between smoking and atherosclerotic plaques (smoking and plaques were separated by a relatively low number of genes). Their analysis of causality models provided evidence of gene expression partially mediating the relationship between smoking and atherosclerosis. Our study is therefore confirming the importance of TBC1D8 gene-environment interaction in atherosclerosis pathophysiology.

RCBTB1, previously identified as a modifier for smoking on cIMT in multi-ethnic northern American populations, has been replicated in our study. The signal identified in our study is independent from the one previously described, although located in the same gene, the signal reported in the NOMAS study was led by the Hispanic population and located about 42 kb from our signal. But, the allele frequencies where highest in non-Hispanic blacks compared to Hispanic (rs3751383, MAF: 0.44 vs 0.25). Our data showed no LD (r<sup>2</sup> < 0.02) between the lead SNPs from the two studies (rs3751383, rs112017404). The RCBTB1 gene TABLE 3 | Comparison of gene-smoking association results for Nanoro and the combined sample (All), based on Model 1 and Model 2 for selected SNPs.


\*For beta interaction, positive values indicate smoking interaction with allele A1 results in increased cIMT; negative values indicate smoking interaction with allele A1 results in decreased cIMT. In bold are the variants reaching genome-wide significance threshold (p < 5E-08) in Model 2.

encodes a protein with an N-terminal RCC1 domain and a Cterminal BTB (broad complex, tramtrack, and bric-a-brac) domain. In rat, overexpression of this gene in vascular smooth muscle cells induced cellular hypertrophy. These results suggest that gene-smoking interaction for atherosclerosis might be acting through an intensification of monocyte activation and recruitment under the endothelial layers before their differentiation into macrophages, a process known to trigger foam cell formation and subsequent plaques (Jia et al., 2017). The results from the functional analyses suggest that the genesmoking interaction for cIMT is likely acting through a regulatory process, explaining the involvement of multiple loci displaying chromatin interactions and acting as eQTLs. We were able, in our study, to reproduce a GxE interaction for markers in RCBTB1, albeit with an independent signal in the gene, demonstrating that the association is more generalizable. Our study is the first independent validation of the involvement of RCBTB1 in gene-smoking interaction for cIMT.

There were differences in the association results between the two study sites with low replication. These differences may be partly explained by differences in the prevalence of smoking, sample sizes and smoking intensity. Effectively, the median smoking intensity in Nanoro was three times higher than in Navrongo (6.6 pack/year vs 1.7 pack/year) (Table 1), whereas the number of smokers in Nanoro (n = 132) was less than half of those in Navrongo (n = 333). A previous study using a systems biology approach revealed that cigarette smoke induced a concentration-dependent (direct and indirect) biological mechanism that promotes monocyte–endothelial cell adhesion (Poussin et al., 2015). Hence, the influence of smoking intensity on the detection of gene-smoking interaction was previously reported in a study of gene-smoking interaction for blood pressure in the Framingham Heart Study (Basson et al., 2015). They found different associated loci in the light smoker and the heavy smoker groups (>10 cigarettes per day). In the study by Wang et al. on gene-smoking interaction, the strongest association was among heavy smokers (≥20 pack/year)

(Wang et al., 2014). This might explain why the RCBTB1 region was only replicated in Nanoro, where the smoking intensity was higher than in Navrongo.

We report a substantial number of suggestive GxE signals that may be African-specific as they have not yet been observed in non-African studies with larger cohorts. Since African genetic diversity is generally higher, it is possible that there are more novel gene variants that are related to pathways involved in complex diseases like atherosclerosis (Sulovari et al., 2017). Our study is restricted to men and is limited by the sample size and relatively low prevalence of smokers in Nanoro. There was, however, sufficient power to detect the effect sizes we observed, and our sample size exceeded several previously published studies. In the design of the study, an inclusion criterion was that participants should not be closely related (Ali et al., 2018); however the genetic data revealed several individuals with first and second degree relatedness (204 in nonsmokers, 62 in smokers; 15%). To mitigate the effect of relatedness, we ran the analysis using GEMMA, a program that adjusts using the kinship matrix (Zhou et al., 2012). The comparison of the results from GEMMA (gxe option) and from PLINK (gxe option) showed that the outputs were highly correlated, indicating that relatedness had little effect on the outcomes.

Probable contributors to the heterogeneity of signals between the two geographical groups include differences in the patterns of smoking exposure and the simplistic measure of smoking status that we used in this study (current smokers vs nonsmokers), over the use of a continuous measure (pack/year).

#### CONCLUSION

Our study provides the first report of gene-smoking interactions for cIMT in sub-Saharan African populations. We identified novel genome-wide significant variants in TBC1D8 for interactions with smoking for cIMT. The replication of eight previous signals identified in non-African populations, demonstrates that these signals are transferable to West Africa. The discovery of the novel signals, on the other hand, indicates the possibility of African-specific associations. The strategies of functional annotation and gene mapping using biological data resources provided useful information on the likely consequences of relevant genetic variants and identified plausible gene targets and biological mechanisms for functional follow-up. Gene set analyses contributed novel insight into underlying pathways, confirming the importance of geneenvironment interactions in atherosclerosis and pointing toward the involvement of specific cell types. Gene-environment GWASs will benefit from colocalization analyses for interpreting the biological and clinical relevance of the GWAS results. When variants associated with GxE are present at high frequency in target populations, this provides an opportunity for precision public health. Future studies based on populations from other African regions may provide validation of transferability to SSA more generally, identify further novel signals and to generate more insights into the relationship between these associations disease pathophysiology.

### DATA AVAILABILITY STATEMENT

The raw data supporting the conclusions of this manuscript will be made available by the authors, without undue reservation, to any qualified researcher.

### ETHICS STATEMENT

This study received the approval of the Human Research Ethics Committee (Medical), University of the Witwatersrand/South Africa (M121029), the approval of the Centre Muraz Institutional Ethics Committee/Burkina Faso (015-2014/CE-CM) and the approval of the National Ethics Committee For Health Research/Burkina Faso (2014-08-096), the Ghana Health Service Ethics Review Committee (ID No: GHS-ERC:05/05/ 2015) and the Navrongo Institutional Review Board (ID No: NHRCIRB178). All the participants signed an Informed Consent Form before any study procedure was performed.

### AUTHOR CONTRIBUTIONS

PB, HS, HT, AC, CM, and MR designed the study. PB and J-TB performed the analysis. DS performed the imputation. PB wrote the manuscript. PB, J-TB, AC, CM, HS, DS, SH, GA, EN, AO, HT and MR critically reviewed and approved the manuscript.

### FUNDING

This study was funded by the National Institutes of Health (NIH) through the H3Africa AWI-Gen project (NIH grant number U54HG006938) and the Wits Non-Communicable Disease Research Leadership Programme (NIH Fogarty International Centre grant number D43TW008330). AWI-Gen is supported by the National Human Genome Research Institute (NHGRI), Eunice Kennedy Shriver National Institute of Child Health & Human Development (NICHD), Office of the Director (OD) at the National Institutes of Health. PB is funded by the National Research Fondation/The World Academy of Sciences "African Renaissance Doctoral Fellowship" (Grant no. 100004).

### ACKNOWLEDGMENTS

This study would not have been possible without the generosity of the participants who spent many hours responding to questionnaires, being measured and having samples taken. We wish to acknowledge the sterling contributions of our field workers, phlebotomists, laboratory scientists, administrators, data personnel, and all other staff who contributed to the data and sample collections, processing, storage, and shipping. Investigators responsible for the conception and design of the AWI-Gen study include the following: MR (PI, Wits), Osman Sankoh (co-PI, INDEPTH), Stephen Tollman, and Kathleen Kahn (Agincourt PI), Marianne Alberts (Dikgale PI), Catherine Kyobutungi (Nairobi PI), HT (Nanoro PI), AO (Navrongo PI), Shane Norris (Soweto PI), and SH, Nigel Crowther, Himla Soodyall, and Zane Lombard (Wits).We would like to acknowledge each of the following investigators for their significant contributions to this research, mentioned according to affiliation: Wits AWI-Gen Collaborative Centre: Stuart Ali, AC, SH, Freedom Mukomana, Cassandra Soo; Soweto (DPHRU): Nomses Baloyi, Yusuf Guman.

#### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene.2019. 01354/full#supplementary-material

### REFERENCES


relative inflammation status. Circ. Cardiovasc. Genet. 4 (6), 595–604. doi: 10.1161/CIRCGENETICS.111.960773


multitissue gene regulation in humans. Science 348 (6235), 648–660. doi: 10.1126/science.1262110


Conflict of Interest: The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2020 Boua, Brandenburg, Choudhury, Hazelhurst, Sengupta, Agongo, Nonterah, Oduro, Tinto, Mathew, Sorgho and Ramsay. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Spinal Muscular Atrophy in the Black South African Population: A Matter of Rearrangement?

Elana Vorster 1\*, Fahmida B. Essop1 , John L. Rodda<sup>2</sup> and Amanda Krause<sup>1</sup>

<sup>1</sup> National Health Laboratory Service and School of Pathology, University of the Witwatersrand, Johannesburg, South Africa, <sup>2</sup> Department of Paediatrics, University of the Witwatersrand, Johannesburg, South Africa

Spinal muscular atrophy (SMA) is a neuromuscular disorder, characterized by muscle atrophy and impaired mobility. A homozygous deletion of survival motor neuron 1 (SMN1), exon 7 is the main cause of SMA in ~94% of patients worldwide, but only accounts for 51% of South African (SA) black patients. SMN1 and its highly homologous centromeric copy, survival motor neuron 2 (SMN2), are located in a complex duplicated region. Unusual copy number variations (CNVs) have been reported in black patients, suggesting the presence of complex pathogenic rearrangements. The aim of this study was to further investigate the genetic cause of SMA in the black SA population. Multiplex ligationdependent probe amplification (MLPA) testing was performed on 197 unrelated black patients referred for SMA testing (75 with a homozygous deletion of SMN1, exon 7; 50 with a homozygous deletion of SMN2, exon 7; and 72 clinically suggestive patients with no homozygous deletions). Furthermore, 122 black negative controls were tested. For comparison, 68 white individuals (30 with a homozygous deletion of SMN1, exon 7; 8 with a homozygous deletion of SMN2, exon 7 and 30 negative controls) were tested. Multiple copies (>2) of SMN1, exon 7 were observed in 50.8% (62/122) of black negative controls which could mask heterozygous SMN1 deletions and potential pathogenic CNVs. MLPA is not a reliable technique for detecting carriers in the black SA population. Large deletions extending into the rest of SMN1 and neighboring genes were more frequently observed in black patients with homozygous SMN1, exon 7 deletions when compared to white patients. Homozygous SMN2, exon 7 deletions were commonly observed in black individuals. No clear pathogenic CNVs were identified in black patients but discordant copy numbers of exons suggest complex rearrangements, which may potentially interrupt the SMN1 gene. Only 8.3% (6/72) of clinically suggestive patients had heterozygous deletions of SMN1, exon 7 (1:0) which is lower than previous SA reports of 69.5%. This study emphasizes the lack of understanding of the architecture of the SMN region as well as the cause of SMA in the black SA population. These factors need to be taken into account when counseling and performing diagnostic testing in black populations.

Keywords: spinal muscular atrophy, survival motor neuron 1, survival motor neuron 2, multiplex ligation-dependent probe amplification, copy number variations, rearrangement, South Africa

Edited by:

Nicola Mulder, University of Cape Town, South Africa

#### Reviewed by:

Brunhilde Wirth, University of Cologne, Germany Zeljka Pezer, Rudjer Boskovic Institute, Croatia

> \*Correspondence: Elana Vorster vorstere@ampath.co.za

#### Specialty section:

This article was submitted to Evolutionary and Population Genetics, a section of the journal Frontiers in Genetics

> Received: 16 November 2018 Accepted: 17 January 2020 Published: 13 February 2020

#### Citation:

Vorster E, Essop FB, Rodda JL and Krause A (2020) Spinal Muscular Atrophy in the Black South African Population: A Matter of Rearrangement? Front. Genet. 11:54. doi: 10.3389/fgene.2020.00054

### INTRODUCTION

Spinal muscular atrophy (SMA) is an autosomal recessive neurological disorder, characterized by the progressive degeneration of anterior horn cells (lower motor neurons) of the spinal cord, causing symmetrical muscle atrophy, weakness and paralysis. Historically, SMA was categorized into four clinical subtypes (SMA I–IV), ranging in severity, maximum muscle activity achieved, and age of onset, although it has been suggested that the SMA phenotype rather spans a continuum (Prior et al., 2004). SMA type I is the most severe form, with onset usually at birth or before six months with an average lifespan of two years. SMA type II is an intermediate form with an onset between 6 and 18 months (Fried and Emery, 1971); SMA type III is a mild form with onset after 18 months (Kugelberg and Welander, 1956) and SMA type IV is the mildest form with adult onset (Pearn et al., 1978).

A previous study suggested that the clinical presentation of SMA in black South African (SA) patients differs from worldwide reports with more frequent involvement of facial muscles in the severe infantile form of SMA leading to an expressionless facies (Moosa and Dawood, 1990). This is supported by clinical observation, but has not been scientifically documented.

SMA has been reported to be the second most common autosomal recessive disorder in Caucasian individuals after cystic fibrosis. The predicted birth incidence of SMA varies between 1 in 6,000 and 1 in 10,000 with a carrier frequency estimated at 1 in 40 to 1 in 60 worldwide (Hendrickson et al., 2009). The birth incidence of SMA in black SA patients has been estimated to be much higher at 1 in 3,574. This indicates that SMA may have a higher birth incidence than albinism (birth incidence: 1 in 3,900) in the black SA population. The carrier rate of SMA was previously estimated to be 1 in 23 in the white SA population and 1 in 50 in the black SA population (Labrum et al., 2007).

SMA is caused by mutations within the survival motor neuron 1 gene (SMN1; OMIM #6003541 ). A homozygous deletion of SMN1, exon 7 is reported to cause SMA in ~94% of patients with SMA worldwide (Hendrickson et al., 2009). In contrast, only 51% of SA SMA cases have been reported to be caused by a homozygous deletion of SMN1, exon 7 (Stevens et al., 1999; Labrum et al., 2007). An SMN1 deletion in conjunction with a second mutation, results in a compound heterozygote pattern and accounts for an additional 2–5% of patients with SMA worldwide (Wirth, 2000). The heterozygous deletion of SMN1, exon 7 rate in black SA patients with SMA who tested negative for the homozygous SMN1, exon 7 deletion, was previously reported to be as high at 69.5%, supporting the diagnosis of SMA in these patients and suggesting that SMA is probably due to additional unidentified mutations in this region (Labrum et al., 2007).

The SMN1 gene and its highly homologous copy, survival motor neuron 2 (SMN2; OMIM #601627<sup>1</sup> ) are located in the SMN region on chromosome 5q13. The SMN region consists of multiple copy genes, pseudogenes (Selig et al., 1995), repetitive sequences (Bürglen et al., 1996), and retrotransposon-like elements (Francis et al., 1995), resulting in a large 500 kb inverted duplication, containing both a telomeric copy (SMN1) and a centromeric copy (SMN2) of the region. The historic terms, "telomeric" and "centromeric" refer to the relative positions of the SMN1 and SMN2 genes, respectively, within the SMN critical region at chromosome 5q13. As a result of the complexity and hypervariability of the region, there is no current complete and accurate map of the SMN region.

Homozygous deletions of the centromeric SMN2, exon 7 are not thought to be pathogenic (Schwartz et al., 1997), but are commonly encountered in black SA patients referred for SMA testing (Stevens et al., 1999; Labrum et al., 2007). A number of studies have suggested that SMN2 acts as a disease modifying gene as SMA disease severity is inversely correlated with SMN2 copy number (McAndrew et al., 1997; Wirth et al., 1999; Feldkötter et al., 2002; Jedrzejowska et al., 2008). SMA type I patients tend to have two copies of SMN2, type II and type 3b (onset before three years)— three copies, SMA type IIIb (onset after three years)— four copies and SMA type IV— four to six copies (reviewed by Mercuri et al., 2018).

Recombination between SMN1 and SMN2 could potentially interrupt a critical region of SMN1, leading to the loss of fulllength functional SMN transcripts. SMA type II and III patients have been shown to have gene conversions from SMN1 to SMN2 rather than deletions, resulting in a higher copy number of SMN2, which has been associated with a milder phenotype (Campbell et al., 1997). A high frequency (31.5%) of black SA patients with SMA were shown to have smaller deletions including SMN1, exon 7, but with exon 8 present, possibly due to gene conversions (Stevens et al., 1999). Additional evidence for this hypothesis was reported by Labrum et al. (2007) who observed a lower frequency of large deletions spanning SMN1, exons 7, 8, and the NLR family, apoptosis inhibitory protein gene (NAIP) in black SA patients (9.8%) when compared to white SA patients (41.7%).

Other genes located in the duplicated SMN region at chromosome 5q13 include NAIP, GTF2H2, and SERF1A and their multiple pseudo copies. The lack of understanding of the physical structure and orientation of these genes in the SMN region, hampers the better understanding of the role of these genes in the SMA disease mechanism.

The SMN protein is present in both the cytoplasm and nucleus of all cells, but is particularly abundant in motor neurons. The SMN protein's main function involves the assembly of small nuclear ribonucleoprotein (snRNP) complexes important for pre-messenger RNA splicing (Markowitz et al., 2012).

SMN1 and SMN2 differ in only 5 nucleotides of sequence, with the critical difference being a silent C to T transition at cDNA position 840 (c.840C > T) in SMN2, resulting in the exclusion of exon 7 during splicing and causing the majority of SMN2 transcripts to be truncated and unstable (Lorson et al., 1999; Monani et al., 1999). Only 20% of the total full-length SMN (FL-SMN) transcript is produced from the SMN2 gene, which partly compensates for the lack of FL-SMN transcript produced from SMN1 in patients with SMA, but does not produce

<sup>1</sup> Online Mendelian Inheritance in Man (OMIM), OMIM accession numbers: 600354, 601627, http://www.omim.org.

sufficient SMN protein levels in motor neurons for their survival (Zheleznyakova et al., 2011). A milder phenotype (SMA types III and IV) have been associated with four or more copies of SMN2 (Wirth et al., 2006). Recently, Nusinersen, an antisense oligonucleotide drug that modifies splicing of SMN2, has been shown to lead to an increase in the total FL-SMN transcripts of SMN2, leading to improvements in motor function (Finkel et al., 2017).

Approximately 4% of American and Canadian individuals have been found to have heterozygous SMN1 deletions with two SMN1 gene copies on a single chromosome in addition to a chromosome with a deletion of the SMN1 gene (2:0 genotype) (Scheffer et al., 2001). These individuals are SMA carriers since they have the ability to pass on a deletion chromosome to subsequent generations. Carrier testing is compromised since quantitative techniques cannot distinguish between two copies in cis or trans of SMN1, one copy present on each chromosome or two copies of SMN1 present on a single chromosome in conjunction with 0 copies on the second chromosome (McAndrew et al., 1997). It is recommended that potential carriers with multiple copies of SMN1 need to be analyzed in a family context to try and clarify the phase of these copy number variations (CNVs) and to accurately assign carrier status.

Studies performed on various American population groups, showed an unusually high frequency of multiple copies of SMN1 in the African American population when compared to other populations (Hendrickson et al., 2009; Sugarman et al., 2012). A study performed on unaffected individuals from various sub-Saharan African populations (Kenyan, Malian and Nigerian) confirmed this observation and showed a higher frequency of multiple copies of SMN1 and deletions of SMN2 than European populations (Sangaré et al., 2014).

SMA was previously thought to be rare in African populations with limited studies performed in Northern Africa (Tunisia, Egypt, Nigeria, Algeria and Senegal), but this was likely due to an underestimation (Pelleboer et al., 1989; Tazir and Geronimi, 1990; Shawky et al., 2001; Ndiaye et al., 2002; Mrad et al., 2006).

It has been hypothesized that complex population-specific rearrangements of the SMN region could cause SMA in the black SA population (Labrum et al., 2007; Vorster et al., 2011). The main aim of this study was to investigate CNVs of the SMN region using the P021 multiplex ligation-dependent probe amplification (MLPA) probe mix (MRC Holland, Amsterdam, Netherlands), which has multiple probes spanning the SMN region, in an attempt to identify potential pathogenic CNVs which could contribute to the disease mechanism of SMA in the black SA population. A better understanding of potential pathogenic CNVs of the SMN region could improve diagnostic testing for the 49% of black SA patients affected with SMA who currently test negative for the common homozygous SMN1, exon 7 deletion.

#### SUBJECTS, MATERIALS, AND METHODS

### Subjects: U/U<sup>b</sup> Patients

U/U<sup>b</sup> patients (Unidentified mutation/Unidentified mutation genotype) represent black patients who presented with symptoms clinically suggestive of SMA and who previously tested negative for a homozygous deletion of SMN1, exon 7 in a diagnostic setting using an in-house PCR and restriction enzyme assay. U/U<sup>b</sup> patients were identified and selected in collaboration with the Clinical Section of the Division of Human Genetics, molecular diagnostic laboratory, National Health Laboratory Service Johannesburg (NHLS), and the University of the Witwatersrand (WITS), henceforth referred to as "the Division" and in collaboration with the Departments of Paediatrics of the Chris Hani Baragwanath and Charlotte Maxeke Academic Hospitals.

In total, 72 U/Ub patients were identified, nine of whom had muscle biopsies suggestive of SMA. MLPA analysis was performed on these patients to identify potential pathogenic CNV patterns. DNA samples of family members of U/U<sup>b</sup> patients were not available. These patients formed the main focus of this research study. DNA samples of all of these patients are stored in the Division.

#### Groups Used for Comparison N/N<sup>b</sup> Individuals

N/Nb individuals (Negative/Negative genotype) represent black controls negative for SMA. MLPA analysis was performed on family members of 61 N/Nb families (200 individuals in total). In order to be included, DNA had to be available from two unrelated parents and at least one child. The unaffected parents of these families were used as negative controls in this study and consisted of a total of 122 unrelated N/Nb individuals. Haplotypes were constructed from the MLPA data and family pedigrees were drawn to investigate potential novel CNV events in these families.

#### N/N<sup>w</sup> Individuals

N/N<sup>w</sup> individuals (Negative/Negative genotype) represent white controls negative for SMA. To compare the typical nonpathogenic CNV patterns of black and white individuals, 30 random unrelated N/N<sup>w</sup> individuals were tested on MLPA.

#### M1/M1 <sup>b</sup> Patients

M1/M1 <sup>b</sup> patients (Mutation 1: deletion of SMN1, exon 7/Mutation 1: deletion of SMN1, exon 7 genotype) represent black patients who were previously identified to have the common homozygous deletion of SMN1, exon 7 on a diagnostic PCR and restriction enzyme assay designed to detect and distinguish homozygous deletions of SMN1, exon 7 and SMN2, exon 7 (van der Steege et al., 1996). MLPA analysis was performed on 75 M1/M1 b patients to investigate the molecular structure of pathogenic CNV patterns, including the extent of homozygous deletions of SMN1, exon 7 and potential gene conversion events. Furthermore, 25 of these patients formed part of families (71 individuals in total). MLPA was performed on all family members and haplotypes were constructed from the MLPA data and family pedigrees to investigate the phase of potential common pathogenic CNV patterns.

#### M1/M1 <sup>w</sup> Patients

M1/M1 <sup>w</sup> patients (Mutation 1: deletion of SMN1, exon 7/ Mutation 1: deletion of SMN1, exon 7 genotype) represent white patients who were previously identified to have a homozygous deletion of SMN1, exon 7 on the diagnostic PCR, and restriction enzyme assay. For comparison, 30 random unrelated M1/M1 <sup>w</sup> patients were tested to compare the molecular structure of pathogenic CNV patterns between M1/ M1 <sup>b</sup> and M1/M1 <sup>w</sup> patients.

#### M2/M2 <sup>b</sup> Patients

M2/M2 <sup>b</sup> patients (Mutation 2: deletion of SMN2, exon 7/ Mutation 2: deletion of SMN2, exon 7 genotype) represent black patients who were previously identified to have a homozygous deletion of SMN2, exon 7 on the diagnostic assay. Fifty M2/M2 <sup>b</sup> patients were tested on MLPA to determine the underlying molecular structure of this common CNV and to understand the interaction between the SMN1 and SMN2 genes.

#### M2/M2 <sup>w</sup> Patients

M2/M2 <sup>w</sup> patients (Mutation 2: deletion of SMN2, exon 7/ Mutation 2: deletion of SMN2, exon 7 genotype) represent white patients who were previously identified to have a homozygous deletion of SMN2, exon 7 on the diagnostic assay. For comparison, eight random unrelated M2/M2 <sup>w</sup> patients were tested on MLPA to compare the molecular structure of this CNV between M2/M2 <sup>b</sup> and M2/M2 <sup>w</sup> patients. Only eight M2/M2 w patients were included in this group, since they were the only M2/M2 <sup>w</sup> patients available who have been identified in the white SA population.

#### Methods

#### Dna Extraction

Genomic DNA was extracted from whole blood using the salting out method (Miller et al., 1988), a commercial DNA extraction kit (High Pure PCR Template Preparation Kit, Roche Diagnostics), or in the case of chorionic villus sampling (CVS) and amniocyte material, the phenol-chloroform extraction method was used (Barker, 2004). All samples were processed and extracted in a diagnostic setting, with stringent quality control. The P021 probe mix was validated on blood, amniocyte material, and CVS samples. A quantity of 50–250 ng of DNA is recommended for MLPA<sup>2</sup> . All DNA samples were normalized in order to accurately compare probe copy numbers to each other.

#### MLPA Analysis

The MLPA P021 probe mix (MRC Holland, Amsterdam, Netherlands) is mainly designed to detect SMN1 and SMN2, exon 7 copy numbers. The P021 probe mix consists of a multiplex of 46 probes, consisting of seven DQ (dosage quality) control probes; two sex-chromosome specific probes (for gender determination and to detect sample mix-up); 22 internal reference probes, specific to various chromosomal regions and not associated with SMA; 15 probes specific to the SMN region (eight targeting the SMN1 and SMN2 genes and seven probes targeting neighboring genes). The DQ control probes amplify four Q fragments which determine whether sufficient DNA has been added to the reaction and whether ligation has been successful and two D fragments which determine whether successful denaturation of the DNA sample took place<sup>2</sup> .

Probes have been designed to target the critical one base pair difference between SMN1 and SMN2 in exon 7 and can therefore distinguish between exon 7 of SMN1 and SMN2. Similarly, probes have been designed to target a one base pair difference between SMN1 and SMN2, exon 8 and can therefore distinguish between exon 8 of SMN1 and SMN2. Probes specific to exons 1, 4, and 6 of the SMN1 and SMN2 genes as well as neighboring genes in the SMN region have been included to assist with determining the extent of deletions (He et al., 2013). Neighbouring genes include the RAD17 checkpoint clamp loader component gene (RAD17) and telomeric as well as centromeric copies of the NAIP genes (NAIP/NAIPY), the general transcription factor IIH subunit 2 genes (GTF2H2), and the small EDRK-rich factor 1A genes (SERF1A/1B) 3 .

MLPA was performed using the Applied Biosystems (ABI) 9700 thermal cycler and fragment separation was performed using the ABI Genetic Analyzer 3130xl (Applied Biosystems, Foster City, CA, USA). Dosage analysis was performed using the freely available Coffalyser software (MRC Holland, Amsterdam, Netherlands) to quantify CNVs. By comparing the copy number of PCR products observed in a patient sample with endogenous reference probes and several external control samples, relative quantitative changes in DNA fragments can be determined (Schouten et al., 2002). The copy number of probe regions was determined using the parameters as set out in Table 1. MLPA results were analyzed by statistical analysis using Statistica (Dell, version 12.7) and Real Statistics Using Excel software<sup>4</sup> to compare dosage trends between different patient groups and to determine significant differences among patient and control groups.

MLPA has been shown to be a reliable technique to detect multiple copy numbers in regions of segmental duplication, without the need for repeat testing or replicates (Cantsilieris et al., 2014). Furthermore, the analytical sensitivity and specificity of the P021 probe mix has been reported to be >99%<sup>2</sup> . The P021 probe mix was validated using samples of individuals whose SMN1 copy number was previously identified using the in-house PCR and restriction enzyme diagnostic assay or through ISO 17043 accredited external quality assessors, the European Molecular Genetics Quality Network (EMQN)<sup>3</sup> . Furthermore, negative, homozygous SMN1, exon 7 deletion and heterozygous SMN1, exon 7 deletion control samples were included in every experiment to ensure consistency among experiments. MLPA experiments were repeated and/or excluded when results did not adhere to quality requirements.

Haplotype analysis of pedigrees of N/N<sup>b</sup> and M1/M1 <sup>b</sup> families were performed to determine common chromosomal CNV patterns/haplotypes in the black SA population. Multiple copies of genes in the SMN region were assumed to be located

<sup>2</sup> MRC Holland MLPA general protocol and P021 product description, http:// www.mlpa.com/WebForms/.

<sup>3</sup> EMQN, https://www.emqn.org.

<sup>4</sup> Real Statistics Using Excel, http://www.real-statistics.com.


TABLE 1 | The relationship of the P021 probe mix dosage quotient with copy number.

Adapted from recommendations by MRC Holland (P021 SMA Product Description2).

on chromosome 5 and the gene order was based on current map data from the Ensembl genome browser (genome built: GRCh38)<sup>5</sup> . Family pedigrees were categorized as informative if a clear pattern of inheritance from parents to their children could be established for each of the MLPA probes. Family pedigrees were categorized as uninformative if the phase of the CNVs could not be correctly determined. Multiple copies of a specific probe region complicate the assignment of phase due to various combinations being possible within a family and apparent discrepant results could arise due to potential novel CNV events, family members who are not related as specified or technical MLPA faults. All discrepant results were repeated on MLPA to ensure the accuracy of these results.

An ethics application was approved unconditionally by the WITS Medical Human Research Ethics Committee (ethics clearance number: M130950).

#### RESULTS

#### MLPA Analysis

The results of MLPA analysis of SMN1 and SMN2 of the patient and control groups are summarized in Table 2. All MLPA data is available as Supplementary Material.

#### Comparative Analysis of N/N<sup>b</sup> and N/N<sup>w</sup> Individuals

The SMN1, exon 7 copy number was found to differ statistically between N/N<sup>b</sup> and N/N<sup>w</sup> individuals (Kruskal-Wallis test: H = 32.7; p < 0.0001). In this study, 50.8% (62/122) of N/N<sup>b</sup> individuals were found to have multiple copies (3–6) of SMN1, exon 7. These results stand in sharp contrast to trends observed in N/N<sup>w</sup> individuals, with only 3.3% (1/30) with multiple copies of SMN1, exon 7. Figure 1 compares the SMN1 copy numbers observed in SA populations with various international population groups.

Similarly, the telomeric SMN1, exon 8 and NAIP, exon 5 copy numbers differed significantly between N/N<sup>b</sup> and N/N<sup>w</sup> individuals (Kruskal-Wallis test: H = 14.3; p = 0.0002 and H = 7.2; p = 0.0071, respectively). As seen with the SMN1, exon 7 region; 54.9% (67/122) of N/N<sup>b</sup> individuals were found to have multiple copies (>2 copies) of SMN1, exon 8 and 37.7% (46/122) were found to have multiple copies of NAIP, exon 5, which are both assumed to co-locate with SMN1 in the telomeric SMN region. Once again, these results stand in contrast to trends observed in N/N<sup>w</sup> individuals, with only 6.7% (2/30) of individuals with multiple copies of SMN1, exon 8 and only 13.3% (4/30) with multiple copies of NAIP, exon 5. The SMN1, exon 7 and 8 copy numbers did not correlate fully, suggesting that some of these copies may not be contiguous.

The centromeric SMN2, exon 7 copy number was not found to differ significantly between N/N<sup>b</sup> and N/N<sup>w</sup> individuals (Kruskal-Wallis test: H = 0.09; p = 0.7652). The majority of N/ N<sup>b</sup> and N/N<sup>w</sup> individuals had two copies of SMN2, exon 7; 59.8% (73/122) and 60% (18/30), respectively. The SMN2, exon 8 copy number was found to differ significantly between N/N<sup>b</sup> and N/ N<sup>w</sup> individuals (Kruskal-Wallis test: H = 11.1; p = 0.0009). N/N<sup>b</sup> individuals had a higher rate of homozygous SMN2, exon 8 deletions (27% (33/122) than N/N<sup>w</sup> individuals (6.7% (2/30). This finding could be due to gene conversions from centromeric SMN2, exon 8 to telomeric SMN1, exon 8 resulting in hybrid genes consisting of SMN1, exon 7 and SMN2, exon 8. Deletions of these hybrid genes would result in a loss of SMN1, exon 7 in conjunction with SMN2, exon 8.

No N/N<sup>b</sup> (0/122) and N/N<sup>w</sup> (0/30) individuals appeared to have a detectable heterozygous deletion of SMN1, exon 7 (1:0 genotype), usually accepted to be carriers of SMA. Two N/N<sup>b</sup> families had de novo CNV events with a new mutation rate of 3.3% (2/61).

N/Nb individuals had a significantly higher variance in copy number than N/Nw individuals for the majority of MLPA probes. Whereas the copy number of the telomeric SMN1, exons 7, 8 and NAIP, exon 5 of N/Nw individuals seem to cluster at two copies, as expected; the copy number of SMN1, exons 7, 8 and NAIP, exon 5 of N/Nb individuals varies extensively from 1 to 6. Figure 2 shows the difference in variance in copy number of SMN1, exon 7 between N/Nb and N/Nw individuals. In N/Nw individuals the SMN1, exon 7 copy number clusters around integers (one or two copy numbers) whereas the SMN1, exon 7 copy number varies extensively in N/Nb individuals (one to six copy numbers). It is hypothesized that some of these multiple SMN1 gene copies may be partial discontinuous gene copies which may be non-functional due to interruptions in the coding region.

Further, The SERF1B and RAD17 copy numbers were not found to differ significantly between N/N<sup>b</sup> and N/N<sup>w</sup> individuals (Kruskal-Wallis test: H = 0.80; p = 0.367 and H = 0.00; p = 0.967, respectively). Multiple copies (more than 2) of RAD17 were observed in 14.8% (18/122) of N/N<sup>b</sup> individuals in contrast to no CNVs of RAD17 in N/N<sup>w</sup> individuals, suggesting that N/N<sup>b</sup> individuals have a higher variability of these regions than N/

N<sup>w</sup> <sup>5</sup> Ensembl Genome Browser, https://www.ensembl.org individuals.

TABLE 2 | Summary of telomeric SMN1, exons 7 and 8 and centromeric SMN2, exons 7 and 8 copy number across various patient and control groups.


#### Comparative Analysis of M1/M1 <sup>b</sup> and M1/M1 w Patients

As expected, MLPA analysis confirmed SMN1, exon 7 homozygous deletions in all 75 M1/M1 <sup>b</sup> patients and all 30 M1/ M1 <sup>w</sup> patients. The telomeric SMN1, exon 8 and NAIP, exon 5 copy numbers were found to differ statistically between M1/M1 b and M1/M1 <sup>w</sup> patients (Kruskal-Wallis test: H = 7.8, p = 0.0053; H = 5.6, p = 0.0181, respectively). This significant difference could be due to homozygous SMN1, exon 8 deletions being more common in M1/M1 <sup>w</sup> patients [80% (24/30)] than M1/M1 b patients [50.7% (38/75)] and heterozygous deletions of NAIP, exon 5 being more common in M1/M1 <sup>w</sup> patients [56.7% (17/30)] than M1/M1 <sup>b</sup> patients [21.3% (16/75)].

The SMN2, exon 7 copy number was found to differ significantly between M1/M1 <sup>b</sup> and M1/M1 <sup>w</sup> patients (Kruskal-Wallis test: H = 9, p = 0.0027), likely due to a higher frequency of multiple copies (> 2) of this region in M1/M1 <sup>w</sup> patients [36.7% (11/30)] when compared to M1/M1 <sup>b</sup> patients [10.7% (8/75)]. Multiple copies of SMN2, exon 7 in conjunction with homozygous deletions of SMN1, exon 7 suggest gene conversion from telomeric SMN1, exon 7 to centromeric SMN2, exon 7 being more common in M1/M1 <sup>w</sup> patients.

Due to identical sequences of exons 1, 4, 6, and another region of exon 8 of the SMN1 and SMN2 genes, the P021 probe mix cannot distinguish between SMN1 and SMN2 for these regions and will give a combined copy number result (representing both the SMN1 and SMN2 genes: SMN1/2). An absence of these probes could therefore represent a deletion in either or both copies of the SMN1 and SMN2, which complicates analysis.

A homozygous deletion of SMN1/2, exons 1, 4, and 6 were more frequently observed in M1/M1 <sup>b</sup> individuals [exon 1: 60% (45/75), exon 4: 61.3% (46/75) and exon 6: 62.7% (47/75) than M1/M1 <sup>w</sup> individuals (exon 1, 4 and 6: 31.3% (10/30)]. The SMN1/ 2, exons 4 and 6 copy numbers were found to differ statistically between M1/M1 <sup>b</sup> and M1/M1 <sup>w</sup> individuals (Kruskal-Wallis test: H = 4.4, p = 0.0355 and H = 4.2, p = 0.0399, respectively). The SMN1/2, exon 1 copy number was not found to differ significantly between M1/M1 <sup>b</sup> and M1/M1 <sup>w</sup> individuals (Kruskal-Wallis test: H = 1; p = 0.3188). The MLPA probes for exons 1, 4, and 6 cannot distinguish between SMN1 and SMN2 but since these patients have homozygous deletions of SMN1, exons 7 and 8, the exons 1, 4, and 6 deletions are most likely located in SMN1. The SMN1/2, exons 1, 4, and 6 copy numbers did not correlate fully with each other or with the SMN1, exon 7 and 8 copy numbers in M1/M1 <sup>b</sup> or M1/M1 <sup>w</sup> individuals suggesting that these copies may not be contiguous.

TheGTF2H2, exon 5 andNAIP/NAIPY, exon 13 copy numbers were found to differ statistically between M1/M1 <sup>b</sup> and M1/M1 w individuals (Kruskal-Wallis test: H = 10.8, p = 0.001 and H = 13.8, p = 0.0002, respectively). A deletion of GTF2H2, exon 5 and NAIP/

NAIPY, exon 13 were more frequently observed in M1/M1 b individuals [66.7% (50/75) and 61.3% (46/75) respectively] than M1/M1 <sup>w</sup> individuals [50% (15/30) and 23.3% (7/30) respectively].

SERF1B copy numbers were found to differ significantly between M1/M1 <sup>b</sup> and M1/M1 <sup>w</sup> individuals (Kruskal-Wallis test: H = 3.9, p = 0.0480), most likely due to a higher frequency of heterozygous deletions of SERF1B in 34.7% (26/75) of M1/M1 b individuals compared to 0% (0/30) in M1/M1 <sup>w</sup> individuals. There

was no difference in copy number of RAD17 between M1/M1 b and M1/M1 <sup>w</sup> individuals.

A higher frequency of deletions extending into the rest of SMN1/2 (exons 1, 4, and 6), NAIP/NAIPY, exon 13, GTF2H2, exon 5 and SERF1B in M1/M1 <sup>b</sup> patients when compared to M1/ M1 <sup>w</sup> patients suggest that large deletions are more common in M1/M1 <sup>b</sup> patients than M1/M1 <sup>w</sup> patients, discrepant from results obtained from SMN1, exon 8 and NAIP, exon 5 analysis.

#### Comparative Analysis of M2/M2 <sup>b</sup> and M2/M2 w Patients

From a previous retrospective audit of patients referred to the Division for SMA testing, performed from September 1991 to October 2015, it was shown that homozygous deletions of SMN2, exon 7 were identified in 12.4% (123/991) of black patients, 4.7% (9/192) of white patients, 4% (2/50) of Indian patients and 18.8% (3/16) of patients with mixed ancestry. There is a significantly higher percentage of SMN2, exon 7 deletions in black patients when compared to white patients (Chi-square test: c<sup>2</sup> = 11.64; p = 0.000645).

MLPA analysis confirmed homozygous SMN2, exon 7 deletions in 100% (50/50) of M2/M2 <sup>b</sup> patients and 100% (8/8) of M2/M2 <sup>w</sup> patients identified. Homozygous deletions of SMN2, exon 8 were detected in 98% (49/50) of M2/M2 <sup>b</sup> patients and 100% (8/8) of M2/M2 <sup>w</sup> patients. Only one M2/M2 <sup>b</sup> individual had a smaller deletion of SMN2, exon 7 which did not extend into exon 8. CNVs of SMN2, exons 7 and 8 were not found to differ statistically between M2/M2 <sup>b</sup> and M2/M2 <sup>w</sup> patients (Kruskal-Wallis test: H = 2.4; p = 0.1252 and H = 0.2; p = 0.6921, respectively).

CNVs of NAIP/NAIPY, exon 13 were found to differ statistically between M2/M2 <sup>b</sup> and M2/M2 <sup>w</sup> individuals (Kruskal-Wallis test: H = 5.6; p = 0.0177), most likely due to deletions being more frequently observed in M2/M2 <sup>b</sup> [44% (22/ 50) than M2/M2 <sup>w</sup> individuals (0% (0/8)].

CNVs of SMN1/2, exons 1 (Kruskal-Wallis test: H = 0.07; p = 0.7904), 4 (Kruskal-Wallis test: H = 2.5; p = 0.1107), and 6 (Kruskal-Wallis test: H = 0.8; p = 0.3579), GTF2H2, exon 5 (Kruskal-Wallis test: H = 1.98; p = 0.1596) and SERF1B (Kruskal-Wallis test: H = 0.02; p = 0.8855) were not found to differ statistically between M2/M2 <sup>b</sup> and M2/M2 <sup>w</sup> individuals. There was no difference in copy number of RAD17 between M2/M2 <sup>b</sup> and M2/M2 <sup>w</sup> individuals. The MLPA probes for exons 1, 4, and 6 cannot distinguish between SMN1 and SMN2, but since these patients have homozygous deletions of SMN2, exons 7 and 8, the exons 1, 4, and 6 deletions are most likely located in SMN2.

#### Comparative Analysis of U/U<sup>b</sup> Patients With Black Control Groups (N/N<sup>b</sup> , M1/M1 <sup>b</sup> and M2/M2 b )

Significant differences (p < 0.05) between U/Ub patients and M1M1 <sup>B</sup> and M2M2 <sup>B</sup> patients were observed, suggesting that hypothesized novel pathogenic CNVs are distinct from the common homozygous deletions of SMN1, exon 7 and SMN2, exon 7. U/Ub patients more closely resembled N/Nb individuals. Multiple copies (>2 copies) were observed for SMN1, exon 7 [38.9% (28/72) and SMN1, exon 8 [50% (36/72)] in U/Ub patients which is similar to that found in N/Nb individuals. A significant difference of the SMN1, exon 7 copy number between U/Ub patients and N/N<sup>b</sup> individuals was observed and could be attributed to the presence of heterozygous SMN1, exon 7 deletions in 8.3% (6/72) of U/Ub patients, not observed in any N/Nb individuals [0% (0/122)]. This result stands in contrast to a previous South-African study that reported a heterozygous SMN1, exon 7 deletion rate of 69.5% (16/23) (Labrum et al., 2007).

#### Haplotype and CNV Pattern Analysis N/N<sup>b</sup> Families

Only 31.7% (19/60) of N/N<sup>b</sup> families were completely informative where the phase of CNVs could be determined with certainty. For 60% (36/60) of N/N<sup>b</sup> families, multiple combinations were possible, due to the presence of multiple copies of one or more probe regions. The exact locations of these multiple copies are uncertain. Discrepant results were observed in 5% (3/60) of N/N<sup>b</sup> families possibly due to non-paternity or novel deletion/duplication events in the proband. Two N/N<sup>b</sup> families had clear novel results with a new mutation rate of 3.3% (2/60). In total, 19 N/N<sup>b</sup> families, consisting of 38 unrelated parents were found to be informative, from which 76 haplotypes were constructed. In total, 35 unique haplotypes were identified, emphasizing the high variability of this region.

#### M1/M1 <sup>b</sup> Families

Only 44% (11/25) of M1/M1 <sup>b</sup> families were completely informative where the phase of CNVs could be assigned with certainty. For 48% (12/25) of these M1/M1 <sup>b</sup> families, multiple combinations of CNVs were possible, due to the presence of multiple copies of one or more probes. Discrepant results were observed in 8% (2/25) of M1/M1 <sup>b</sup> families which could be due to a variety of causes such as non-paternity or novel deletion or duplication events in the proband. In total, 22 pathogenic haplotypes were constructed from probands from M1/M1 b families. Of these, 17 unique haplotypes were identified, once again emphasizing the high variability of this region.

#### DISCUSSION

The SMN1 gene is the key gene associated with SMA with the SMN2 gene thought to have a disease-modifying effect. Current drug therapies are aimed at increasing the FL-SMN transcripts produced from SMN2. Potential large complex rearrangements of the SMN region may play a role in the SMA disease mechanism in the black SA population and may influence diagnosis and potentially the effect of drug therapies. Therefore it is valuable to investigate the genetic CNV background of the black SA population.

A major limitation of previous quantitative studies of the SMN region performed in African-American (Hendrickson et al., 2009; Sugarman et al., 2012) and sub-Saharan African populations (Sangaré et al., 2014) was that CNV analysis was performed in unaffected individuals, with the exception of prenatal screening performed by Sugarman et al. This study focuses on comparing CNVs in black individuals who are negative for SMA (N/Nb ) to identify non-pathogenic CNVs as well as patients with known homozygous SMN1 and SMN2, exon 7 deletions (M1/M1 <sup>b</sup> and M2/M2 b , respectively) and patients who are clinically suggestive of SMA (U/U<sup>b</sup> ) to delineate potential pathogenic CNVs.

#### Multiple Copies of the Telomeric Region (SMN1, Exons 7 and 8 and NAIP, Exon 5) Were Observed in N/N<sup>b</sup> Individuals and Could Complicate Analysis

In this study, 50.8% of N/Nb individuals were found to have multiple (3–6) copies of SMN1, exon 7, which is similar to previous reports of 46.8% (Hendrickson et al., 2009) and 47.1% (Sugarman et al., 2012) in African-American individuals and a combined percentage of 48.6% in sub-Saharan African populations (Mali, Nigeria and Kenya) (adapted from Sangaré et al., 2014). In contrast, N/N<sup>w</sup> individuals have a much lower percentage of multiple SMN1, exon 7 copies of 3.3%, which is comparable to previous reports of 6.3% in white North-American populations (Hendrickson et al., 2009) and a combined percentage of 2.6% in European populations (Germany, France and Sweden) (summarized inFigure 1, Feldkötter et al., 2002;Corcia et al., 2012).

Similarly, multiple copies of SMN1, exon 8 and NAIP, exon 5 were more frequently observed in N/N<sup>b</sup> individuals (54.9% and 37.7%, respectively) when compared to N/N<sup>w</sup> individuals (6.7% and 13.3%, respectively).

No SMA carriers (individuals with a heterozygous deletions of SMN1, exon 7) were identified in either N/N<sup>b</sup> or N/N<sup>w</sup> individuals in this study in contrast to the previously predicted SA carrier rate of 1/50 in the black population and 1/23 in the white population (Labrum et al., 2007). Small sample sizes could have caused this discrepancy in both N/N<sup>b</sup> (n = 122) and N/N<sup>w</sup> (n = 30) individuals in this study.

Further, the discrepancy in N/N<sup>b</sup> individuals could also be due to two additional reasons. Firstly, MLPA analysis is a very robust technique, which has built-in statistical tests and extensive normalization to multiple exogenous regions, which are likely to yield more reliable and accurate results than the previously used in-house dosage system, which normalized results against a single exogenous region (Labrum et al., 2007). Secondly, it is likely that there is a higher frequency of heterozygous SMN1 deletion carriers (2:0, 3:0, 4:0, 5:0, 6:0, etc.) in the black SA population, not detectable by either of these two assays. These results are supported by a previous study performed by Sugarman et al. (2012) who reported heterozygous SMN1 deletions (2:0) to be more common in African-American individuals (27%, n = 4 883) when compared to white individuals (3.6%, n = 24 471). As a result, the carrier detection rate in African-American individuals was lower at 70% versus 91% in other population groups.

MLPA cannot provide information on the location or phase of multiple SMN1, exon 7 copies on an individual's chromosomes and therefore these multiple copies could be located on a single chromosome, resulting in a heterozygous SMN1 deletion carrier profile. MLPA is therefore not a reliable technique to detect SMA carriers in the black SA population.

#### Large Deletions Extending Into the Rest of the SMN Region Appear to Be More Common in M1/M1 <sup>b</sup> Patients

SMN1, exon 8 has been reported to be deleted together with SMN1, exon 7 in 93% of positive SMA cases (Lefebvre et al., 1995). Furthermore, NAIP deletions have been associated with SMN1 deletions in 67% of SMA type I patients (Roy et al., 1995). A previous SA study proposed that M1/M1 <sup>w</sup> patients had larger homozygous deletions of SMN1, exon 7 also encompassing the telomeric SMN1, exon 8 and NAIP more often than M1/M1 b patients (Labrum et al., 2007). Furthermore, M1/M1 <sup>b</sup> patients were previously reported to have a homozygous SMN1, exon 7 deletion in conjunction with a homozygous NAIP deletion (SMN1, exon 8 is present), suggestive of a gene conversion from SMN1, exon 7 to SMN2, exon 7, more often than M1/ M1 <sup>w</sup> patients (Stevens et al., 1999; Labrum et al., 2007). CNV results of SMN1, exon 7, 8 and NAIP from this study did not differ significantly from results from the previous study in either M1/M1 <sup>w</sup> (Chi-square test: c<sup>2</sup> = 0.9, p = 0.8) or M1/M1 <sup>b</sup> patients (Chi-square test: c<sup>2</sup> = 1.6, p = 0.7).

MLPA analysis using the P021 probe mix offers a more extensive look into the rest of the SMN region, suggesting a different hypothesis. NAIP and GTF2H2 deletions have been observed in patients with SMA (Roy et al., 1995; Carter et al., 1997)) and could therefore provide some information on the extent of SMN1, exon 7 deletions. A higher frequency of homozygous and heterozygous deletions extending into the rest of SMN1/2 (exons 1, 4, 6, and 8), NAIP/NAIPY, exon 13, GTF2H2, exon 5 and SERF1B were observed in M1/M1 <sup>b</sup> patients compared to M1/M1 <sup>w</sup> patients. These observations suggest that large deletions are more common in M1/M1 <sup>b</sup> patients than M1/ M1 <sup>w</sup> patients contrasting results from only analyzing SMN1, exon 8 and NAIP, exon 5.

Hybrid genes could mask larger deletions of the SMN1 gene and could have confounded previous SA reports (Stevens et al., 1999; Labrum et al., 2007), creating the impression of smaller deletions in M1/M1 <sup>b</sup> patients when compared to M1/M1 w patients. In support of this hypothesis, the CNVs of the centromeric SMN2, exons 7 and 8 do not correlate in N/N<sup>b</sup> , M1/M1 b , and N/N<sup>b</sup> individuals, suggesting that these two regions may not be contiguous potentially due to gene conversions or other rearrangements.

In contrast, M1/M1 <sup>w</sup> patients had a higher frequency of multiple copies of SMN2, exon 7 (3 copies: 33.3%, 4 copies: 3.3%) when compared to M1/M1 <sup>b</sup> patients (3 copies: 6.7%, 4 copies: 4%), suggesting that gene conversion from telomeric SMN1, exon 7 to centromeric SMN2, exon 7 might be more common in M1/M1 <sup>w</sup> patients.

#### Homozygous SMN2, Exons 7 and 8 Deletions Could Form Part of the Normal Variation in N/N<sup>b</sup> Individuals

The high frequency of homozygous SMN2, exon 7 deletions in N/N<sup>b</sup> individuals (27%) when compared to N/N<sup>w</sup> individuals (6.7%) suggests that these deletions form part of the general variation in the black SA population. These frequencies are similar to international reports of ~10% (Corcia et al., 2002; Gamez et al., 2002).

Primates only have one copy of the SMN1 gene. It has been hypothesized that the SMN region in early humans consisted of only the SMN1 gene. Due to the hypervariable nature of the SMN region, duplications of the SMN region resulted in multiple copies of the SMN1 gene, often observed in individuals of African descent (Dennis et al., 2017). This scenario is supported by the observation of multiple copies of SMN1 in conjunction with the lack of SMN2 in black SA individuals (N/ N<sup>b</sup> and U/Ub individuals). The duplicated SMN1 gene diverged into the SMN2 gene due to mutations (more specifically, the critical c.840C > T change in exon 7). A CNV containing both the SMN1 and SMN2 genes is more commonly observed in individuals of European descent (Kelter et al., 2000). A loss of SMN1 could take place as a result of a deletion of SMN1 or a gene conversion from SMN1 to SMN2, an arrangement observed more frequently in individuals of European descent (van der Steege et al., 1996). The higher rate of gene conversion in white SA individuals supports this hypothesis (M1/M1 w). Figure 3 summarizes the different CNVs of the SMN region and their evolution.

Human-specific segmental duplication of the SMN region resulting in the inverted centromeric SMN duplication (including SMN2) has been estimated to have taken place 0.3 mya. The exact method of further duplication of the SMN ancestral structure to the structure of the human reference today has been difficult to determine due to polymorphic, palindromic duplications of the region (Dennis et al., 2017).

#### No Novel Pathogenic CNVs Were Identified in U/U<sup>b</sup> Patients

In contrast to previous SA reports of heterozygous SMN1, exon 7 deletions being present in 69.5% of U/U<sup>b</sup> patients (Labrum et al., 2007), only 8.3% of U/U<sup>b</sup> patients were confirmed to have heterozygous deletions of SMN1, exon 7 in this study. This discrepancy could firstly be due to MLPA analysis being a very robust technique, which has built-in statistical tests and extensive normalization, which are likely to yield more reliable and accurate results than the previously used in-house dosage assay (Labrum et al., 2007). In support of this hypothesis, seven individuals previously reported to have heterozygous deletions of SMN1, exon 7 on the in-house dosage assay were retested on MLPA of which four individuals had discrepant results. Two of these individuals had two copies and two individuals had three copies of SMN1, exon 7, which were mistaken for one copy on the in-house dosage assay.

Secondly the presence of multiple copies of SMN1, exon 7 could mask the actual heterozygous SMN1, exon 7 deletion rate in U/Ub patients.

No novel pathogenic CNVs were identified in U/U<sup>b</sup> patients. The presence of potential large complex rearrangements in the black SA population not detectable by current standard diagnostic techniques is supported by the high variability and lack of correlation of copy number between different genes and exons seen in black SA individuals. SMN genes and exons may not have contiguous coding regions and the relationship between these complex rearrangements and the effect on SMN protein expression needs to be further investigated.

#### Haplotype and CNV Pattern Analysis

As part of a previous study performed in the Division, linkage analysis, using two intragenic and six extragenic microsatellite markers across the SMN1 gene, was performed to see if a common chromosomal background could be established in U/ Ub patients. No clear haplotype or common allele was identified and it was reported that it was particularly difficult to construct haplotypes (Labrum et al., 2007).

Similarly, in this study, multiple gene and exon copies in the black SA population complicated haplotype construction. Only 44% of M1/M1 <sup>b</sup> families and 31.7% of N/N<sup>b</sup> families were completely informative where the phase of the haplotype could be determined with certainty. The orientation of genes in the SMN region is not known and this study could not predict the arrangement of genes or exons even though the phase could be determined.

Potential novel events (sporadic deletions or duplications) were observed in 3.3% (2/61) of N/N<sup>b</sup> families. A new mutation rate of 3.3% is not unexpected as novel mutations have been reported at a rate of 2% due to the high instability of this region (Wirth et al., 1997). Moosa and Dawood reported a paucity of family history in black SA families affected by SMA potentially due to SMA being more sporadic in this population (Moosa and Dawood, 1990) although this may be due to poor ascertainment. Sporadic mutations could be explained as a consequence of novel gene conversion and rearrangement events.

Two variants, c.885+83T > G and c.885+667delAT in exon 8 of the SMN1 gene have been described to be associated with multiple SMN1 copies on a single chromosome in combination with a SMN1, exon 7 deletion on the other chromosome in African American and Ashkenazi Jewish population groups. It was suggested that these two variants are associated with heterogous SMN1 deletions (2:0) and could be useful in refining the carrier risk in individuals who have multiple copies of SMN1, exon 7 (Luo et al., 2014). The association of the c.885+83T > G and c.885+667delAT variants to heterozygous SMN1 deletion (2:0) haplotypes in the black SA population was investigated as part of a previous unpublished study in the Division. Both the c.885+83T > G and c.885+667delAT variants were observed in 60% (3/5) of individuals with known heterozygous SMN1 deletions (2:0) compared to 0% (0/7) of individuals known to have two copies of SMN1, exon 7 copy, one copy on each of their two chromosomes (1:1), suggesting that the two variants are associated with duplicated SMN1, exon 7 alleles. These two variants could be useful in refining the carrier risk of black SA individuals who have multiple copies of SMN1, exon 7.

### Implications for Diagnostic Testing

With the advent of new sequencing technologies, pan-ethnic population-based expanded carrier screening has been gaining momentum internationally. As an example, Israel has implemented a genetic screening program including carrier testing for SMA for couples of reproductive age (Zlotogora et al., 2016). The American College of Medical Genetics (ACMG) supports the inclusion of SMA into expanded carrier

FIGURE 3 | Hypothetical haplotypes representing the transformation of the SMN region from ancestral to modern populations. (A) represents the proposed order of genes in the SMN1 region. The SMN region in primates and early humans are thought to have consisted of only one copy of the SMN1 gene. (B) Duplications of the SMN1 region resulted in multiple copies of the SMN1 gene, frequently observed in individuals of African descent (Dennis et al., 2017). This is supported by the observation of multiple copies of SMN1 in conjunction with the lack of SMN2 in black SA individuals (N/N<sup>b</sup> and U/U<sup>b</sup> individuals). (C) Mutations in the duplicated SMN1 gene resulted in the SMN2 gene. A chromosome consisting of one SMN1 and one SMN2 gene.is thought to be the most common genotype seen in populations of European descent (N/N<sup>w</sup> individuals). (D, E) represent individuals with a deletion of SMN1. A homozygous deletion of SMN1, exon 7 causes SMA (M1/M1 <sup>w</sup> and M1/M1 b ). (E) Deletions of SMN1 in M1/M1 <sup>w</sup> individuals are commonly caused by gene conversions from SMN1 to SMN2, resulting in multiple copies of SMN2.

screening tests (Prior and Professional Practice and Guidelines Committee, 2008). Caution is advised against population screening in the black SA population, due to the presence of multiple copies of SMN1, exon 7 which could significantly impair accurate carrier detection and lead to false negative carrier results. MLPA may be useful in detecting the carrier risk in members of M1/M1 <sup>b</sup> families but it is highly recommended that MLPA results of samples referred for prenatal and carrier testing should be analyzed within a family context to identify the phase of multiple SMN1 gene copies in all SA populations.

#### Challenges and Limitations of the Study

The SMN region is extremely complex containing multiple pseudogenes (Selig et al., 1995) and repetitive sequences (Bürglen et al., 1996) within a large inverted segmental duplication. Due to this complexity, there is limited understanding of the exact order and location of genes in the SMN region. This is complicated even further since CNV trends observed in the various patient and control groups tested on MLPA, suggest that large rearrangements in the SMN region form part of the general variation within the black SA population.

It is well established that African populations have a higher level of genetic diversity than any other population (Tishkoff and Williams, 2002; Tishkoff and Kidd, 2004; Conrad et al., 2006; Pickrell et al., 2014). A local group of researchers who investigated CNVs in SA populations found that haplotype block lengths are significantly smaller in African populations when compared to non-African populations. These regions seem to coincide with recombination hotspots (Chimusa et al., 2015). Very few of these recombination hotspots seem to be shared between African and other populations (Choudhury et al., 2014; Chimusa et al., 2015). Perhaps the high variability of the SMN region in the black SA population could be due to frequent recombination events in the SMN region. This hypothesis is supported by a new mutation rate of 3.3% in this study which is comparable to the high new mutation rate seen in other populations (Wirth et al., 1997). Novel events may also influence the recurrence risk in black SA SMA families.

We need to identify and comprehend non-pathogenic CNVs in the general SA population to fully understand disease mechanisms overlaying these variations, specifically in the SMN region. The Southern African Human Genome Project (SAHGP) shows some promise in creating a better understanding of the baseline CNVs in the general black SA population (Pepper, 2011) although it is unlikely to provide detailed information on the architecture of the SMN region.

The high sequence homology of the SMN1 and SMN2 genes, with only a five nucleotide difference between the two genes and the highly variable CNVs of these genes make molecular diagnosis extremely challenging. This limitation complicates and restricts the design of primers and probes in this region and limits the choice of laboratory techniques that can be used to understand this region better. The P021-A2 MLPA kit was mainly designed to distinguish between exons 7 and 8 of the SMN1 and SMN2 gene, but cannot distinguish between exons 1, 4, and 6 of the SMN1 and SMN2 genes. This means that a combined result was observed for these probes. This makes it difficult to assign multiple copies of specific exons to either SMN1 or SMN2. Similarly, the NAIP/NAIPY, exon 13 probe was designed to detect the combined copy number of the NAIP gene and its centromeric copy, NAIPY and the GTF2H2, exon 5 was designed to detect the combined copy number of telomeric-GTF2H2 and centromeric-GTF2H2.

The copy numbers of SMN1/2 exons 1, 4, 6, and 8 do not correlate fully with each other or with the SMN1, exons 7 and 8 copy numbers in any of the groups suggesting that CNVs of the SMN1 and SMN2 genes do not consist of complete gene copies and that the exons may be non-contiguous. This is further complicated by gene conversion events between SMN1 and SMN2. Other factors which could potentially influence withinand between sample variance, is sample quality and experimental design. Result interpretation is therefore incredibly difficult and it is not possible to construct accurate CNV patterns using MLPA.

#### Future Studies

MLPA testing cannot give us information about the functionality of potential multiple, partial copies. RNA expression studies may be able to quantify the expression of FL-SMN transcripts which may be a more accurate indication of the amount of SMN protein produced in U/U<sup>b</sup> patients even in the presence of multiple SMN1, exon 7 copies. If multiple copies of the SMN1 gene are present on MLPA, but there is no corresponding FL-SMN transcript, it could be indicative of partial/interrupted nonfunctional SMN1 copies. The expression of SMN transcripts using real-time reverse transcription PCR (qRT-PCR) is being investigated as part of a current study in the Division.

Sixteen additional genes with overlapping phenotypes to SMA have been shown to be associated with non-5q forms of SMA (Peeters et al., 2014). Due to the lack of clinical information on patients referred for SMA testing to the Division it may be more practical to perform testing by an NGS neuromuscular panel first to exclude other related neuromuscular diseases and other causes of SMA before continuing SMA testing in individuals who test negative for the homozygous SMN1, exon 7 deletion.

A previous SA study sequenced the SMN1 gene in patients found to have heterozygous SMN1, exon 7 deletions on the previously used in-house dosage assay to look for additional mutations in SMN1 (Labrum et al., 2007). No pathogenic mutations were identified. Since the accuracy of the previously used in-house dosage assay has been questioned by this study, all U/U<sup>b</sup> and M2/M2<sup>b</sup> individuals who were found to have heterozygous deletions of SMN1, exon 7 should be sequenced to try and find a potential second pathogenic mutation. The high homology of the SMN1 and SMN2 genes complicates sequencing analysis however this challenge can be overcome with long range PCR targeting the SMN1 gene, followed by nested PCR and Sanger sequencing of exons 1–8 (Kubo et al., 2015).

PacBio single molecule, real-time sequencing (SMRT) technology<sup>6</sup> has shown some promise with resolving large CNVs. This technology is currently limited to 20 kb reads, which may still be too small to detect the full sequence of the SMN region, which is at least 500kb. The MinION (Oxford Nanoppore technologies) nanopore sequencer generate ultralong sequencing reads of up to 800kb and have been shown to improve the accuracy and to close gaps in the reference human genome (Jain et al., 2018). These long range sequencing technologies could be investigated to try and further define the structure of the SMN region in the black SA population.

### CONCLUSION

This is the first report summarizing CNV patterns of the SMN region in African patients with known homozygous SMN1 and SMN2, exon 7 deletions (M1/M1 b , M2/M2 b ) and patients who have features clinically suggestive of SMA (U/U<sup>b</sup> ). This is also the first report of CNVs patterns of the SMN region in the general black SA population.

Multiple copies of SMN1, exon 7 were observed as evidence of the marked hypervariability of the SMN region in the black SA population. These multiple copies potentially confound diagnostic and carrier testing and could potentially consist of partial, non-contiguous copies. Future studies investigating the expression of these multiple gene copies may provide information on their functional effect. No clear additional pathogenic CNV patterns were identified in U/U<sup>b</sup> patients. This study emphasizes the lack of understanding of the architecture of the SMN region and the composition of CNVs in the black SA population. These factors need to be taken into account when counselling and performing diagnostic, carrier and prenatal testing in the black SA population.

#### DATA AVAILABILITY STATEMENT

All datasets generated for this study are included in the article and the supplementary files.

#### ETHICS STATEMENT

This study was carried out in accordance with the recommendations of the 'WITS Medical Human Research Ethics Committee. Written informed consent of patients was not required as all DNA samples were referred and banked in a diagnostic setting and have been anonymised for the purpose of this study. The protocol was approved by the 'WITS Medical <sup>6</sup>

PacBio, https://www.pacb.com/smrt-science/smrt-sequencing/

Human Research Ethics Committee (ethics clearance number: M130950).

#### AUTHOR CONTRIBUTIONS

EV, FE, and AK contributed to the conception and design of the study. JR assisted in identifying appropriate patients for this study. EV performed all laboratory work, MLPA analysis, haplotype analysis, and statistical analysis as part of her MSc (Medicine) Human Genetics degree (obtained with distinction). EV wrote the first draft of the manuscript. All authors contributed to manuscript revision, read and approved the submitted version.

### FUNDING

National Health Laboratory Service (NHLS) Research Trust Grant (2014-2DEV41-EVO1). WITS Faculty Research Council Grant.

#### REFERENCES


#### ACKNOWLEDGMENTS

The National Health Laboratory Service (NHLS) for the use of their facilities for this project. MRC Holland who has generously sponsored 800 MLPA probe mix reactions for this project. Ms Haseena Khan who investigated the association of the c.885+83T > G and c.885+667delAT variants to heterozygous SMN1 deletion (2:0) carrier haplotypes in the black SA population as part of her BSc (Hons) project. This manuscript is based on findings reported in a Masters dissertation: Vorster (2017) Determining the molecular basis of spinal muscular atrophy in the black South African population (University of the Witwatersrand, Johannesburg, South Africa): http://wiredspace. wits.ac.za/handle/10539/25810.

#### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene.2020.00054/ full#supplementary-material


Hum. Mutat. 15, 228–237. doi: 10.1002/(SICI)1098-1004(200003)15:3<228:: AID-HUMU3>3.0.CO;2-9


Conflict of Interest: The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2020 Vorster, Essop, Rodda and Krause. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

digital media

of impactful research

article's readership