# THE IMPORTANCE OF DIVERSITY IN PRECISION MEDICINE RESEARCH

EDITED BY : Dana C. Crawford, Jessica Nicole Cooke Bailey and William Scott Bush PUBLISHED IN : Frontiers in Genetics

#### Frontiers eBook Copyright Statement

The copyright in the text of individual articles in this eBook is the property of their respective authors or their respective institutions or funders. The copyright in graphics and images within each article may be subject to copyright of other parties. In both cases this is subject to a license granted to Frontiers. The compilation of articles constituting this eBook is the property of Frontiers.

Each article within this eBook, and the eBook itself, are published under the most recent version of the Creative Commons CC-BY licence. The version current at the date of publication of this eBook is CC-BY 4.0. If the CC-BY licence is updated, the licence granted by Frontiers is automatically updated to the new version.

When exercising any right under the CC-BY licence, Frontiers must be attributed as the original publisher of the article or eBook, as applicable.

Authors have the responsibility of ensuring that any graphics or other materials which are the property of others may be included in the CC-BY licence, but this should be checked before relying on the CC-BY licence to reproduce those materials. Any copyright notices relating to those materials must be complied with.

Copyright and source acknowledgement notices may not be removed and must be displayed in any copy, derivative work or partial copy which includes the elements in question.

All copyright, and all rights therein, are protected by national and international copyright laws. The above represents a summary only. For further information please read Frontiers' Conditions for Website Use and Copyright Statement, and the applicable CC-BY licence.

ISSN 1664-8714 ISBN 978-2-88966-099-5 DOI 10.3389/978-2-88966-099-5

#### About Frontiers

Frontiers is more than just an open-access publisher of scholarly articles: it is a pioneering approach to the world of academia, radically improving the way scholarly research is managed. The grand vision of Frontiers is a world where all people have an equal opportunity to seek, share and generate knowledge. Frontiers provides immediate and permanent online open access to all its publications, but this alone is not enough to realize our grand goals.

#### Frontiers Journal Series

The Frontiers Journal Series is a multi-tier and interdisciplinary set of open-access, online journals, promising a paradigm shift from the current review, selection and dissemination processes in academic publishing. All Frontiers journals are driven by researchers for researchers; therefore, they constitute a service to the scholarly community. At the same time, the Frontiers Journal Series operates on a revolutionary invention, the tiered publishing system, initially addressing specific communities of scholars, and gradually climbing up to broader public understanding, thus serving the interests of the lay society, too.

#### Dedication to Quality

Each Frontiers article is a landmark of the highest quality, thanks to genuinely collaborative interactions between authors and review editors, who include some of the world's best academicians. Research must be certified by peers before entering a stream of knowledge that may eventually reach the public - and shape society; therefore, Frontiers only applies the most rigorous and unbiased reviews.

Frontiers revolutionizes research publishing by freely delivering the most outstanding research, evaluated with no bias from both the academic and social point of view. By applying the most advanced information technologies, Frontiers is catapulting scholarly publishing into a new generation.

#### What are Frontiers Research Topics?

Frontiers Research Topics are very popular trademarks of the Frontiers Journals Series: they are collections of at least ten articles, all centered on a particular subject. With their unique mix of varied contributions from Original Research to Review Articles, Frontiers Research Topics unify the most influential researchers, the latest key findings and historical advances in a hot research area! Find out more on how to host your own Frontiers Research Topic or contribute to one as an author by contacting the Frontiers Editorial Office: researchtopics@frontiersin.org

# THE IMPORTANCE OF DIVERSITY IN PRECISION MEDICINE RESEARCH

Topic Editors:

Dana C. Crawford, Case Western Reserve University, United States Jessica Nicole Cooke Bailey, Case Western Reserve University, United States William Scott Bush, Case Western Reserve University, United States

Citation: Crawford, D. C., Bailey, J. N. C., Bush, W. S., eds. (2020). The Importance of Diversity in Precision Medicine Research. Lausanne: Frontiers Media SA. doi: 10.3389/978-2-88966-099-5

# Table of Contents


Xueya Zhou, Ching-Lung Cheung, Tatsuki Karasugi, Jaro Karppinen, Dino Samartzis, Yi-Hsiang Hsu, Timothy Shin-Heng Mak, You-Qiang Song, Kazuhiro Chiba, Yoshiharu Kawaguchi, Yan Li, Danny Chan, Kenneth Man-Chee Cheung, Shiro Ikegawa, Kathryn Song-Eng Cheah and Pak Chung Sham


Brittany M. Hollister, Eric Farber-Eger, Melinda C. Aldrich and Dana C. Crawford

*51 Systematic Review and Meta-Analysis to Establish the Association of Common Genetic Variations in Vitamin D Binding Protein With Chronic Obstructive Pulmonary Disease*

Ritesh Khanna, Debparna Nandy and Sabyasachi Senapati


Briseida E. Feliciano-Astacio, Katrina Celis, Jairo Ramos, Farid Rajabli, Larry Deon Adams, Alejandra Rodriguez, Vanessa Rodriguez, Parker L. Bussies, Carolina Sierra, Patricia Manrique, Pedro R. Mena, Antonella Grana, Michael Prough, Kara L. Hamilton-Nelson, Nereida Feliciano, Angel Chinea, Heriberto Acosta, Jacob L. McCauley, Jeffery M. Vance, Gary W. Beecham, Margaret A. Pericak-Vance and Michael L. Cuccaro

*83 Motivations for Participation in Parkinson Disease Genetic Research Among Hispanics versus Non-Hispanics*

Karen Nuytemans, Clara P. Manrique, Aaron Uhlenberg, William K. Scott, Michael L. Cuccaro, Corneliu C. Luca, Carlos Singer and Jeffery M. Vance

*89 Understanding Participation in Genetic Research Among Patients With Multiple Sclerosis: The Influences of Ethnicity, Gender, Education, and Age*

Michael L. Cuccaro, Clara P. Manrique, Maria A. Quintero, Ricardo Martinez and Jacob L. McCauley

# Editorial: The Importance of Diversity in Precision Medicine Research

Jessica N. Cooke Bailey <sup>1</sup> \*, William S. Bush1,2 and Dana C. Crawford1,2 \*

*<sup>1</sup> Department of Population and Quantitative Health Sciences, Cleveland Institute for Computational Biology, Case Western Reserve University, Cleveland, OH, United States, <sup>2</sup> Department of Genetics and Genome Sciences, Case Western Reserve University, Cleveland, OH, United States*

Keywords: precision medicine, diversity, genomics, personalized medicine, research participant, genetic ancestry, polygenic risk scores, social determinants of health

**Editorial on the Research Topic**

#### **The Importance of Diversity in Precision Medicine Research**

Personalized or precision medicine is meant to distinguish tailored treatment from trial and error. The contemporary concept has evolved to specifically include the 'omic profile of a patient in the prevention, diagnosis, and treatment of disease. Rapid genomic discoveries made possible through genome-wide association studies (GWAS) coupled with decreasing costs of sequencing and genotyping have shifted precision medicine from an academic exercise to clinical reality for some conditions (e.g., Wigle et al., 2017; Claassens et al., 2019; Hamdan et al., 2019; Lim, 2019; Roden, 2019), while others are not far behind. The emergence of electronic health records (EHRs) now makes it possible to both perform population-scale research and effectively deliver personalized medicine to the individual patient through clinical decision support.

#### Edited by:

*Daniel Shriner, National Human Genome Research Institute (NHGRI), United States*

#### Reviewed by:

*Nora Franceschini, University of North Carolina at Chapel Hill, United States*

#### \*Correspondence:

*Jessica N. Cooke Bailey jnc43@case.edu Dana C. Crawford dana.crawford@case.edu*

#### Specialty section:

*This article was submitted to Applied Genetic Epidemiology, a section of the journal Frontiers in Genetics*

> Received: *22 June 2020* Accepted: *17 July 2020* Published: *26 August 2020*

#### Citation:

*Cooke Bailey JN, Bush WS and Crawford DC (2020) Editorial: The Importance of Diversity in Precision Medicine Research. Front. Genet. 11:875. doi: 10.3389/fgene.2020.00875*

While the promise of precision medicine is great, several identifiable gaps exist in current research that limit its reach to all potential patients. One key deficiency is the lack of diversity among biomedical research participants, which limits both the generalizability and availability of genomic-based treatments or prevention strategies. The vast under-representation of diverse populations in genetic/genomic studies (e.g., Sirugo et al., 2019) is highly problematic as genetic information gleaned from one population is not automatically transferrable across populations (Popejoy and Fullerton, 2016). Without sample diversity, signals revealing powerful insights into genetic association and/or drug response can go undetected due to differences in linkage disequilibrium, allele frequencies, and genetic architecture. New initiatives and studies are now in place to ensure the inclusion of traditionally underrepresented groups, defined by race/ethnicity, socioeconomic status and/or position, geography, and age, in genomic research (Bentley et al., 2020). We therefore anticipate a swell of new data and methodologies accelerating the already rapid pace of precision medicine research.

Our goal for this Research Topic was to present original research, commentaries, perspectives, and reviews on the impact and importance of diversity in precision medicine research. Below, we briefly overview the nine accepted manuscripts and the context in which they address this goal.

### IMPORTANCE OF RECRUITMENT AND RETENTION OF DIVERSE PARTICIPANTS

A 2009 analysis of GWAS participants revealed only 4% of DNA samples were from non-European participants (Need and Goldstein, 2009). By 2016, 20% of DNA samples were from non-European samples; however, this more than 2000%-fold increase was mainly due to expansion of studies in primarily East Asian ancestry populations (Popejoy and Fullerton, 2016). Taken together, less than 4% of samples analyzed were from individuals of African and Latin American ancestry, Hispanic people, and native or indigenous peoples, despite that these are the most vulnerable and traditionally underserved populations worldwide (Popejoy and Fullerton, 2016). Inclusion of diverse groups is key to diversifying the pool from which precision medicine can be developed. Discerning factors that influence participation and incorporating these findings into inclusive ascertainment strategies are crucial; efforts must be made to understand ways in which diverse groups can be accessed and invited to participate, as well as to identify motivators and/or barriers affecting willingness to participate and to remain in studies (Perreira et al., 2020). Four publications in this Research Topic addressed the topic "development of culturallyappropriate consent and recruitment strategies for precision medicine research" and "barriers to participation in research (such as access to technology, genomic literacy, concerns for digital data privacy, and factors that impact time or means to participate in research)."

To identify addressable issues and adjust enrollment protocols to improve participation among Hispanics, Nuytemans et al. sought to identify motivators of patients and caregivers affected by Parkinson's disease (PD) to participate in genomic research via surveys administered to patients in the University of Miami Health System's Movement Disorders Clinic, wherein approximately 35% of patients identify as Hispanic. Of the more than 150 self-identified white PD patients and caregivers, approximately 60% of whom were Hispanic, Hispanics and non-Hispanics were equally motivated to participate in genetic research for PD, but Hispanic patients were less likely to be influenced by the promise of scientific advancements. This lack of scientific interest was found to be likely confounded by lower levels of obtained education. The authors suggest that a potential reason for the underrepresentation of Hispanics in genetic research is due to reduced invitations to studies.

Also focused on motivators for research participation within a patient population impacted by a chronic disease, Cuccaro et al. surveyed individuals with multiple sclerosis (MS) participating in a genetic study of MS. The majority of approached study participants (95/101) were willing to participate in the survey; of these, over 80% were Hispanic and female. Survey respondents were asked to identify the primary reasons or motivations for participation. The most frequently cited reason was finding a cure, equally endorsed by Hispanic and non-Hispanic participants; having MS and helping future generations were also highly endorsed motivators, with Hispanics more frequently citing having MS and non-Hispanics more frequently citing finding new/better treatments. Overall, ethnicity was the only significant factor associated with willingness to participate.

The dearth of genetic data available for populations other than those of European descent extends to pharmacogenomics (PGx), the study of genomic information relevant to drug response to tailor dosing. Scherr et al. review challenges to recruiting African American participants in genomic studies and extrapolate these findings to PGx. Consistent with prior reports, their review highlighted African American distrust of the healthcare system, medical research, organization, and researchers as barriers to study participation. Authentic, intentional collaborations between researchers and communities are suggested as a means by which to begin overcoming distrust. Another overarching barrier was lack of knowledge or awareness regarding genomic studies. To reduce distrust and increase awareness, they suggest transparent and clearly described study protocols, educational messaging, and recruitment efforts that directly address existing attitudes and beliefs of distrust. Importantly, there was no evidence of lack of interest in research study participation; conversely, they found that African Americans are aware that participation in medical research is crucial to medical and scientific advancement. Thus, Scherr et al. suggest a focused approach to recruiting African American research study participants, including messaging that highlights altruism.

Alzheimer disease (AD) is another common, complex disease with later-in-life onset and for which most genetic and genomic studies to date have focused on individuals of European descent (e.g., Beecham et al., 2017). Feliciano-Astacio et al. describe the ascertainment approach applied in the Puerto Rico Alzheimer Disease Initiative (PRADI), a multisource recruitment effort to increase participation by Puerto Ricans in genomic research of AD, which currently has >670 participants. PRADI's successful recruitment was attained by establishing strong community engagement relationships and tailored recruitment of AD patients and families across multiple sites in Puerto Rico. Focused and deliberate recruitment efforts such as these will help ensure the inclusion of Hispanic and Latino populations in future precision medicine research efforts.

### POPULATION DIVERSITY AND GENE EXPRESSION

One publication in this Research Topic addressed "statistical methods for genomic data from multiple populations." A popular statistical method, known as PrediXcan (Gamazon et al., 2015), infers gene expression using genetic data. While now widely used on a variety of datasets derived from many different populations, most gene expression datasets are from majority European-descent populations, and thus construction of reference panels used by PrediXcan are based on Europeandescent data. Mikhaylova and Thornton evaluate the accuracy of PrediXcan in predicting or inferring gene expression in diverse populations. Using a combination of Genetic European Variation in Disease (Geuvadis) RNA sequencing data and 1000 Genomes Project whole genome sequencing data, Mikhaylova and Thornton demonstrate that the performance of PrediXcan varies by population, with lower performance for Africandescent populations compared with others available in the 1000 Genomes Project. The data suggest that prediction models developed using European reference panels are not necessarily transferrable to other populations due to differences in allele frequency, linkage disequilibrium, and genetic admixture.

### SOCIAL DETERMINANTS OF HEALTH AND GENETIC ASSOCIATION STUDIES

For complex diseases and traits, genetic variation alone does not sufficiently explain the totality of risk or variation. While this observation is widely accepted, few genetic association studies incorporate important measures of lifestyle, environmental exposures, or social determinants of health associated with disease risk and health disparities. Hollister et al. address this challenge by applying their recently validated algorithm that defines socioeconomic status using electronic health records (Hollister et al., 2017) to a large clinical population of African American patients. All patients were clinically screened for hypertension, a complex condition disproportionately prevalent in African Americans (Fryar et al., 2017) that is independently associated with many common genetic variants and environmental exposures such as diet and socioeconomic status (Aburto et al., 2013; Giri et al., 2019; de las Fuentes et al., 2020; Glover et al., 2020; Hollister et al.). In the work presented herein, Hollister et al. tested for and possibly identified a statistical interaction between education, a recognized social determinant of health, and genetic variants contributing to blood pressure, underscoring the need for additional study of the potentially modifying effects of non-genetic factors for diseases with noted population differences.

#### CANDIDATE GENE VARIATION AND CHRONIC OBSTRUCTIVE PULMONARY DISEASE

Two publications in this Research Topic addressed "Genomic discovery in non-European populations." Khanna et al. present a meta-analysis of 14 published studies investigating the association between variants rs4588 and rs7041 in the Vitamin-D binding (GC) protein locus and chronic obstructive pulmonary disease (COPD). Both GC rs4588 and rs7041 are robustly associated with vitamin D levels in GWAS of mostly Europeandescent populations (Manousaki et al., 2017; O'Brien et al., 2018). The meta-analysis presented by Khanna et al. include both European- and Asian-descent populations. Both single SNP tests of association for COPD and evaluations of linkage disequilibrium and haplotypes using publicly available genomic and in silico data are presented for multiple populations to more fully describe the genetic epidemiology of these loci.

Nandy et al. evaluated the association between serum surfactant protein D (SFTPD) concentration and SFTPD rs721917 and chronic obstructive pulmonary disease (COPD) and acute exacerbation COPD (AECOPD). Recent large GWAS of mostly European-descent populations have identified SFTPD rs721917 as significantly associated with COPD at genome-wide significance (Hobbs et al., 2017; Sakornsakolpat et al., 2019). Nandy and colleagues identified and meta-analyzed results from eight independent published reports, which included six with serum SFTPD concentrations and three with SFTPD rs721917

#### REFERENCES

Aburto, N. J., Ziolkovska, A., Hooper, L., Elliott, P., Cappuccio, F. P., and Meerpohl, J. J. (2013). Effect of lower sodium intake on health: systematic review and meta-analyses. BMJ 346:f1326. doi: 10.1136/bmj.f1326

genotype data for Asian populations from China, Lebanon, and Pakistan. As expected, both COPD and AECOPD were associated with serum SFTPD. However, while SFTPD rs721917 was significantly associated with both COPD and AECOPD in this meta-analysis, the direction of effect was opposite of that previously reported by recent GWAS of COPD (Hobbs et al., 2017; Sakornsakolpat et al., 2019). While limited in sample size, this small meta-analysis underscores the importance of generalizing GWAS findings in diverse populations.

### POLYGENIC RISK SCORES AND DIVERSE POPULATIONS

One publication in this Research Topic addressed "The use of genetic ancestry for genomic discovery (such as admixture mapping)." Genetic and polygenic risk score studies aggregate cumulative effects across genetic loci; effect sizes are typically estimated from GWAS that have traditionally been performed in samples of European descent. Unfortunately, polygenic risk scores do not always replicate in non-European ancestral groups [reviewed in (Sirugo et al., 2019)]. Focusing on Chinese and Japanese samples, Zhou et al. evaluated lumbar disc degeneration (LDD), another complex, age-related phenotype. The focus of this work was to investigate genetic overlap between LDD and four related risk factors. Strong association between a polygenic LDD score, constructed with weights from European-ancestry studies, and related risk factors was detected. However, phenotype variances explained were lower than in prior European studies, thus, reducing power to detect genetic overlaps. This study again emphasizes the importance of genetic studies inclusive of populations other than Europeans.

Taken together, this Research Topic is composed of nine publications that further emphasize the importance of diversity in precision medicine research and offer solutions to better ensure these translational research efforts are realized in the clinic for all to benefit.

### AUTHOR CONTRIBUTIONS

JC, WB, and DC conceived the idea for and wrote this editorial. All authors contributed to the article and approved the submitted version.

#### ACKNOWLEDGMENTS

The authors acknowledged the Cleveland Institute for Computational Biology and NIH grants R13HG009481 and R13HG010286 for supporting scholarly discussion and conferences associated with this research topic.

Beecham, G. W., Bis, J. C., Martin, E. R., Choi, S. H., DeStefano, A. L., Van Duijn, C. M., et al. (2017). Clinical/scientific notes: the Alzheimer's disease sequencing project: study design and sample selection. Neurol. Genet. 3:e194. doi: 10.1212/NXG.00000000000 00194


**Conflict of Interest:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2020 Cooke Bailey, Bush and Crawford. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Trans-Ethnic Polygenic Analysis Supports Genetic Overlaps of Lumbar Disc Degeneration With Height, Body Mass Index, and Bone Mineral Density

Xueya Zhou1,2, Ching-Lung Cheung3,4, Tatsuki Karasugi <sup>5</sup> , Jaro Karppinen<sup>6</sup> , Dino Samartzis <sup>7</sup> , Yi-Hsiang Hsu8,9,10, Timothy Shin-Heng Mak <sup>4</sup> , You-Qiang Song4,11 , Kazuhiro Chiba<sup>12</sup>, Yoshiharu Kawaguchi <sup>13</sup>, Yan Li <sup>1</sup> , Danny Chan<sup>11</sup> , Kenneth Man-Chee Cheung<sup>7</sup> , Shiro Ikegawa<sup>14</sup>, Kathryn Song-Eng Cheah<sup>11</sup> and Pak Chung Sham1,4 \*

#### Edited by:

*Dana C. Crawford, Case Western Reserve University, United States*

#### Reviewed by:

*Kenneth M. Weiss, Pennsylvania State University, United States Jing Hua Zhao, University of Cambridge, United Kingdom Anne E. Justice, Geisinger Health System, United States*

#### \*Correspondence:

*Pak Chung Sham pcsham@hku.hk*

#### Specialty section:

*This article was submitted to Applied Genetic Epidemiology, a section of the journal Frontiers in Genetics*

> Received: *13 April 2018* Accepted: *02 July 2018* Published: *03 August 2018*

#### Citation:

*Zhou X, Cheung C-L, Karasugi T, Karppinen J, Samartzis D, Hsu Y-H, Mak TS-H, Song Y-Q, Chiba K, Kawaguchi Y, Li Y, Chan D, Cheung KM-C, Ikegawa S, Cheah KS-E and Sham PC (2018) Trans-Ethnic Polygenic Analysis Supports Genetic Overlaps of Lumbar Disc Degeneration With Height, Body Mass Index, and Bone Mineral Density. Front. Genet. 9:267. doi: 10.3389/fgene.2018.00267* *<sup>1</sup> Department of Psychiatry, Li Ka Shing Faculty of Medicine, The University of Hong Kong, Hong Kong, Hong Kong, <sup>2</sup> Department of Systems Biology, Department of Pediatrics, Columbia University Medical Center, New York, NY, United States, <sup>3</sup> Department of Pharmacology and Pharmacy, Li Ka Shing Faculty of Medicine, The University of Hong Kong, Hong Kong, Hong Kong, <sup>4</sup> Li Ka Shing Faculty of Medicine, Center for Genomic Sciences, The University of Hong Kong, Hong Kong, Hong Kong, <sup>5</sup> Department of Orthopaedic Surgery, Faculty of Life Sciences, Kumamoto University, Kumamoto City, Japan, <sup>6</sup> Medical Research Center Oulu, University of Oulu and Oulu University Hospital, Oulu, Finland, <sup>7</sup> Department of Orthopaedics and Traumatology, Li Ka Shing Faculty of Medicine, The University of Hong Kong, Hong Kong, Hong Kong, <sup>8</sup> Hebrew SeniorLife, Institute for Aging Research, Roslindale, MA, United States, <sup>9</sup> Harvard Medical School, Boston, MA, United States, <sup>10</sup> Molecular and Integrative Physiological Sciences Program, Harvard School of Public Health, Boston, MA, United States, <sup>11</sup> Li Ka Shing Faculty of Medicine, School of Biomedical Science, The University of Hong Kong, Hong Kong, Hong Kong, <sup>12</sup> Department of Orthopedic Surgery, National Defense Medical College, Tokorozawa, Saitama, Japan, <sup>13</sup> Department of Orthopaedic Surgery, Toyama University, Toyama Prefecture, Japan, <sup>14</sup> Laboratory of Bone and Joint Diseases, Center for Integrative Medical Sciences, RIKEN, Tokyo, Japan*

Lumbar disc degeneration (LDD) is age-related break-down in the fibrocartilaginous joints between lumbar vertebrae. It is a major cause of low back pain and is conventionally assessed by magnetic resonance imaging (MRI). Like most other complex traits, LDD is likely polygenic and influenced by both genetic and environmental factors. However, genome-wide association studies (GWASs) of LDD have uncovered few susceptibility loci due to the limited sample size. Previous epidemiology studies of LDD also reported multiple heritable risk factors, including height, body mass index (BMI), bone mineral density (BMD), lipid levels, etc. Genetics can help elucidate causality between traits and suggest loci with pleiotropic effects. One such approach is polygenic score (PGS) which summarizes the effect of multiple variants by the summation of alleles weighted by estimated effects from GWAS. To investigate genetic overlaps of LDD and related heritable risk factors, we calculated the PGS of height, BMI, BMD and lipid levels in a Chinese population-based cohort with spine MRI examination and a Japanese case-control cohort of lumbar disc herniation (LDH) requiring surgery. Because most large-scale GWASs were done in European populations, PGS of corresponding traits were created using weights from European GWASs. We calibrated their prediction performance in independent Chinese samples, then tested associations with MRI-derived LDD scores and LDH affection status. The PGS of height, BMI, BMD and lipid levels were strongly associated with respective phenotypes in Chinese,

**9**

but phenotype variances explained were lower than in Europeans which would reduce the power to detect genetic overlaps. Despite of this, the PGS of BMI and lumbar spine BMD were significantly associated with LDD scores; and the PGS of height was associated with the increased the liability of LDH. Furthermore, linkage disequilibrium score regression suggested that, osteoarthritis, another degenerative disorder that shares common features with LDD, also showed genetic correlations with height, BMI and BMD. The findings suggest a common key contribution of biomechanical stress to the pathogenesis of LDD and will direct the future search for pleiotropic genes.

Keywords: polygenic score, genetic correlation, causality, pleiotropy, lumbar disc degeneration, osteoarthritis

### INTRODUCTION

Human intervertebral discs (IVDs) are fibrocartilaginous structures that lie between adjacent vertebrae. These IVDs hold the vertebrae together, facilitate some vertebral motion, and act as shock absorbers to accommodate biomechanical loads (Oxland, 2016). IVD is composed of a gel-like nucleus pulposus surrounded by an annulus fibrosis and separated from the vertebral body by a cartilaginous endplate (Humzah and Soames, 1988). During one's lifetime, due to excessive physical loading, occupational injuries, aging, genetics, and other factors, the IVDs may degenerate and display marked biochemical and morphological changes (Buckwalter, 1995; Urban and Roberts, 2003). Currently, magnetic resonance imaging (MRI) is the gold-standard for evaluating disc degeneration. Based on this imaging, numerous methods are available to grade and summarize different features indicative of degeneration, including signal intensity loss, bulging and herniation, as well as disc space narrowing (Battié et al., 2004; Cheung et al., 2009). Lumbar disc degeneration (LDD) is of clinical importance because it is believed to be a major cause of low back pain (Luoma et al., 2000; Livshits et al., 2011; Samartzis et al., 2011; Takatalo et al., 2011). Its severe form lumbar disc herniation (LDH), in which disc material herniates into the epidural space and compresses a lumbar nerve root, can cause neuropathic pains (sciatica) radiating to the lower extremity (Ropper and Zafonte, 2015).

Twin studies have demonstrated a strong genetic contribution to LDD (Sambrook et al., 1999; Battié et al., 2008). However, searching for genetic variants associated with LDD has been a challenge due to discrepancies and non-standardization of phenotype definitions, inconsistencies with imaging technology, and limited sample sizes in genome-wide association studies (GWASs) (Eskola et al., 2012, 2014; Williams et al., 2013). Similar to most other complex traits, LDD is likely to be polygenic with thousands of trait-associated variants each of which has tiny effect size.

In addition to age, sex, and environmental influences, LDD is also associated with several heritable risk factors including body mass index (BMI) (Liuke et al., 2005; Samartzis et al., 2011, 2012; Takatalo et al., 2013), bone mineral density (BMD) (Harada et al., 1998; Pye et al., 2006; Wang et al., 2011), and serum lipid levels (Leino-Arjas et al., 2008; Longo et al., 2011; Zhang et al., 2016). But it is not fully clear if there is a genetic basis for these phenotype associations. Identifying genetic overlaps between LDD and related traits will be useful for elucidating cause and effect because genetic markers are not subject to reverse causation or confounding and can be used as an instrument to infer causality using Mendelian randomization (Davey Smith and Hemani, 2014), and it can also suggest pleiotropic loci that reveal novel insights into biology (Solovieff et al., 2013).

Several methods have been developed to evaluate genetic overlap between traits by exploiting the polygenic architecture (Dudbridge, 2016). A polygenic score (PGS) of a trait is the summation of alleles across loci weighted by their effect sizes estimated from GWAS (Purcell et al., 2009). In its typical application, GWAS of a base phenotype is first done in a discovery sample. PGS can be calculated in an independent testing sample using single-nucleotide polymorphisms (SNPs) whose p-values are below some threshold in the GWAS of discovery sample. It can then be used as a predictor of target phenotypes in the testing sample using regression analysis. PGS has been widely used to predict disease risk (Chatterjee et al., 2016), evaluate genetic overlaps across traits (Krapohl et al., 2016), and infer genetic architectures (Stahl et al., 2012; Palla and Dudbridge, 2015). Because phenotyping of LDD by MRI is expensive and labor intensive, sample sizes are usually limited for well-phenotyped cohorts. PGS can leverage GWAS meta-analysis results from large consortia to maximize the power to detect genetic overlaps and is most suitable for the current study of LDD. Some other methods, such as bivariate linear mixed-effect model (Lee et al., 2012b; Vattikuti et al., 2012) would require genotypes of individuals of base and target phenotypes. Recently developed linkage disequilibrium score (LDSC) regression (Bulik-Sullivan et al., 2015) makes use of only summary-level association statistics and can account for sample overlaps between different studies, but it requires very large sample sizes that has not been available for LDD.

In this study, we applied PGS to investigate the genetic overlap of LDD with four related risk factors using the GWAS data of Hong Kong Disc Degeneration (HKDD) population-based cohort (Cheung et al., 2009; Samartzis et al., 2012; Li et al., 2016) and a Japanese case-control cohort of LDH that required surgery (Song et al., 2013). We selected BMI, BMD and serum lipids levels as base phenotypes, based on their previous reported associations with LDD (Pye et al., 2006; Longo et al., 2011; Samartzis et al., 2012). Height was also included because its association with chronic low back pain (Hershkovich et al., 2013; Heuch et al., 2015). Two semi-quantitative scores that summarize different aspects of LDD from lumbar spine MRI were used as target phenotypes in the HKDD cohort; LDH affection status was used as the third target phenotype in the Japanese case-control cohort. Because GWASs of base phenotypes were done in European populations whereas our testing samples were of East Asian ancestry, the performance of PGS in predicting base phenotypes was first evaluated in independent Chinese samples. Then we applied the best performing PGS of the base phenotype to test association with target phenotypes in testing samples (**Figure 1**). Results were then interpreted in light of previous epidemiological evidence and statistical power to detect association. To better understand the mechanism implied by the genetic overlaps and motivated by the suggestion that LDD and osteoarthritis (OA) may share common pathophysiological features (Loughlin, 2011; Ikegawa, 2013), we further tested if the base phenotypes that had genetic overlaps with LDD also showed genetic correlations with OA using the GWAS summary data of the arcOGEN study (Zeggini et al., 2012). Finally, we evaluated the predictive power of trans-ethnic PGS to aid the design of future studies.

### MATERIALS AND METHODS

# Study Samples

#### HKDD Cohort

The HKDD Study was a population-based cohort of approximately 3,500 Southern Chinese initiated to assess spinal

FIGURE 1 | The analysis framework. GWAS summary statistics of base phenotypes were obtained from published studies in European populations. The polygenic score (PGS) in a testing sample of East Asian population was calculated by weighted summation of alleles at approximately independent SNPs whose association *p*-values fall below some threshold in the discovery GWAS done in European populations. The performance of PGS to predict the base phenotype in East Asians was first evaluated in a validation sample. Then the best performing PGS of the base phenotype was used to test genetic overlap with lumbar disc degeneration (LDD) in the testing samples. In this study, we selected height, body mass index (BMI), bone mineral density (BMD) and lipid levels as base phenotypes. The prediction performance of PGS was evaluated in the HKDD cohort for height, BMI, and lipid levels, and in the HKOS cohort for BMD. Three LDD phenotypes were used as target phenotypes, including disc displacement and disc degeneration scores in the HKDD cohort and affection status of lumbar disc herniation (LDH) in the Japanese LDH case-control cohort.

phenotypes and their risk factors. All participants underwent T2-weighted MRI examination of the lumbar spine assessed by expert physicians (JK and KMC). Sample recruitment and MRI procedures have been described in detail previously (Cheung et al., 2009; Samartzis et al., 2012; Li et al., 2016). For the current study, we focused on two major aspects of LDD captured by different MRI features (**Figure 2a**). The first was signal intensity loss within nucleus pulposus, which may represents loss of water content of IVD. Its presence and severity at each lumbar disc was assessed by the Schneiderman's grades (Schneiderman et al., 1987). Based on this grading scheme each disc was given a score of 0–3, whereby 0 indicated normal and higher scores indicated increased severity. A disc degeneration score for each individual was calculated by the summation of Schneiderman's grades over all five lumbar discs. We also assessed disc displacement, represented as a bulging/protrusion or extrusion of disc material. An ordinal grade from 0 to 2 was assigned to each lumbar disc to indicate normal, bulge/protrusion or extrusion of disc material; for each individual, a disc displacement score was

calculated by the summation of the grades over all five lumbar discs (Cheung et al., 2009). Age, sex, physical workload based on occupation, history of smoking, and history of lumbar spine injury were obtained by a questionnaire for all participants. Body height and weight were measured at the time when each subject underwent MRI, and BMI was calculated by dividing weight by height squared (kg/m<sup>2</sup> ). A subset of the cohort (N = 815) also had their blood metabolite profiles measured by quantitative serum nuclear magnetic resonance (NMR) platform (Soininen et al., 2009, 2015). Low-density lipoprotein cholesterol (LDL-C), high-density lipoprotein cholesterol (HDL-C), triglycerides (TG) and total cholesterol (TC) were obtained as part of NMR metabolite measures. Association between LDD scores and other covariates were analyzed using multiple linear regression to account for correlations between predictor variables. The best fitting model was selected using Akaike information criterion.

A total of 2,373 individuals from the HKDD cohort were genotyped by Illumina HumanOmni-ZhongHua-8 Beadchip.

FIGURE 2 | Summary of phenotypes in the HKDD cohort. (a) Examples of magnetic resonance imaging show two major aspects of LDD. Disc displacement (left) is shown as bulging of disc material beyond confine of annulus fibrosus. The loss of proteoglycan and water content (right) within nucleus pulposus is reflected by the signal intensity loss. The lumbar spine has 5 intervertebral segments, termed L1 through L5. S1 stands for the first segment of sacral that is intermediately below the lumbar spine. (b) The prevalence of signal intensity loss and disc displacement at different levels of lumbar spine discs. Two ordinal grades (0–3 for signal intensity loss, 0–2 for disc displacement) were assigned to each lumbar disc to indicate the presence and severity of LDD, where 0 indicated normal and higher scores indicated increased severity. (c) The distribution of two LDD scores. The disc degeneration score and displacement score were defined by the summation of grades over all disc levels for signal intensity loss and disc displacement respectively. The two LDD scores are correlated in the population. The age threshold divides the HKDD cohort in two parts with roughly equal sample sizes. The older subjects tend to have higher disc degeneration scores and disc displacement scores. (d) Pairwise Pearson correlations between original phenotypes (upper triangle) and between residual phenotypes after adjusting for age and gender (lower triangle).

Basic genotyping and quality control (QC) procedures have been described in our previous study (Li et al., 2016). In this study, we used more stringent QC criteria that keep only individuals with a call rate >99% and common SNPs with minor allele frequency (MAF) >0.01. The genotypes were imputed to over 8 million common variants in Phase 3 of 1,000 Genomes reference panel using the Michigan Imputation Server (Das et al., 2016) and filtered to keep only common bi-allelic SNPs (MAF>0.01) with imputation quality metrics r <sup>2</sup> ≥ 0.3.

#### LDH Case-Control Cohort

The Japanese LDH case-control cohort was part of our previous genetic study of LDD (Song et al., 2013). Hospitalized patients LDH were ascertained on the basis of sciatica or severe low back pain requiring surgical treatment and confirmed by lumbar spine MRI. The controls were unrelated individuals from general Japanese population as part of Japan Biobank Project. All individuals were genotyped by Illumina HumanHap550v3 BeadChip. A total of 366 cases and 3,331 controls passed QC and were used in the association analysis. Genotypes were imputed to 2.5 million SNPs in Phase 2 HapMap Project using IMPUTE2 (Howie et al., 2009), and association analysis at each SNP was performed by logistic regression assuming an additive model using SNPTEST (Marchini et al., 2007).

#### HKOS GWAS

Hong Kong Osteoporosis Study (HKOS) was a prospective cohort study of over 9,000 Southern Chinese residents in Hong Kong (Cheung et al., 2017). BMD of the lumbar spine (LS-BMD) and femoral neck (FN-BMD) were measured by dual-energy Xray absorptiometry. The age-corrected and standardized BMD was generated for each gender. A total of 800 unrelated females with extreme BMD were selected in the previous GWAS (Kung et al., 2010). The low BMD subjects were those with BMD Zscore ≤−1.28 at either the LS or FN; the high BMD subjects were those with BMD Z-score ≥1.0 at either of the two skeletal sites. All individuals were genotyped by Illumina HumanHap610Quad Beadchip, whereby 780 individuals passed QC. Association analysis at each SNP was performed by linear regression using PLINK (Purcell et al., 2007). Detailed genotyping, QC, and imputation procedures have been described elsewhere (Kung et al., 2010; Xiao et al., 2012).

#### arcOGEN Study

The arcOGEN study (http://www.arcogen.org.uk/) was a collection of unrelated, UK-based individuals of European ancestry with knee and/or hip OA from the arcOGEN Consortium (Panoutsopoulou et al., 2011; Zeggini et al., 2012). Cases were ascertained based on clinical evidence with a need of joint replacement or radiographic evidence of disease (Kellgren-Lawrence grade ≥2), controls were from ancestrymatched (UK) population. A GWAS meta-analysis that included 7,410 cases and 11,009 controls as the discovery sample has been described previously (Zeggini et al., 2012). The summary statistics of the discovery GWAS were obtained by application to the consortium.

All studies were approved by local ethical committees. Written informed consent was obtained from all participants.

#### Statistical Analysis Polygenic Score Regression

We selected height, BMI, BMD, and serum lipid levels as base phenotypes, and obtained GWAS summary data (**Table 1**). PGS was created by the two strategies as described below and used to predict phenotypes through linear regression after accounting for other covariates. The non-genetic covariates were selected based on their association with each phenotype in the baseline multiple linear regression model (listed in footnotes of **Table 3** and **Table S7**). We validated the prediction performance of PGS on base phenotypes in the HKDD cohort (for height, BMI, and lipid levels) and HKOS cohort (for LS- and FN-BMD). Then, the PGS with best prediction performance for each base phenotype was used to test genetic overlap with LDD by predicting two LDD scores in the HKDD cohort and the LDH affection status in the Japanese case-control cohort. When individual-level genotype data in the testing sample were not available (for the HKOS GWAS and LDH case-control cohort), PGS regression was performed using summary statistics based algorithm implemented in gtx R package (Johnson, 2012), and SNP genotypes of HapMap 3 East Asian samples were used as the reference panel for SNP clumping.


*BMI, body mass index; BMD, bone mineral density; OA, osteoarthritis; LDH, lumbar disc herniation; GLGC, Global Lipids Genetics Consortium; HKOS, Hong Kong Osteoporosis Study.* §*The GIANT consortium's BMI GWAS included samples from multiple ethnicities; only the result of the European samples was used.*

#*BMI and lipids summary data were generated from GWAS*+*Metabochip joint analysis, so sample sizes can vary across different SNPs.*

¶*The HKOS GWAS was part of the GEFOS meta-analysis and was the only study of non-European population in that study.*

The first strategy for calculating PGS only used known trait-associated SNPs that reached genome-wide significance in previous studies (GWAS hits). Individual PGS profiles were calculated by summing up the dosage of trait-increasing alleles from imputed genotypes weighted by the reported effect sizes. This strategy has the advantage of including secondary signals within the same locus and increased accuracy of effect size estimates from a larger independent replication sample (for LSand FN-BMD).

As a second strategy, we performed genome-wide PGS analysis using PRsice (Euesden et al., 2015). Briefly, summary statistics of base GWAS was first aligned with genotyped SNPs of the testing sample. Then SNPs were pruned based on pvalue informed clumping algorithm [linkage disequilibrium (LD) r <sup>2</sup> < 0.1 across 500 kb] that selected SNPs most associated with the base phenotype in a locus to generate sets of independent SNPs. PGS was created using clumped SNPs whose p-value in the base GWAS were below pre-specified threshold. We varied p-value thresholds (1.0E-7, 1.0E-5, and from 1.0E-4 to 0.5 with a step of 0.0001) to select the one that maximized the variance explained (R 2 ) for the base phenotype in the validation sample.

#### Correcting Sample Overlap and Extreme Selection

The HKOS GWAS was part of BMD GWAS meta-analysis conducted by the GEFOS consortium. To make it an independent testing data, we inverted the fixed effect meta-analysis to subtract the contribution of HKOS from the GEFOS summary statistics (**Appendix 1** in the **Supplementary Material**). When calculating PGS using the BMD GWAS hits, we used effect estimates from the stage II replication sample to avoid the issue of overlapping sample.

The HKOS GWAS adopted an extreme phenotype design to increase association power, which also resulted in upward biased estimates of R <sup>2</sup> by PGS using linear regression. To get an estimate of R 2 in the unselected sample (Rˆ <sup>2</sup> ) for comparing with the previous report, we corrected the R 2 estimate in the selected sample (Rˆ <sup>2</sup> ′ ) by:

$$
\hat{R}^2 \approx \frac{\hat{R}^{2'}}{f - \left(f - 1\right)\hat{R}^{2'}}
$$

where f is the increased phenotype variance due to extreme phenotype selection (=2.739 in the HKOS GWAS sample). The derivation and validation of this approximation formula is given in **Appendix 3** (**Supplementary Material**).

#### R 2 for Case-Control Data on the Liability Scale

For the case-control data, it is meaningful to estimate the disease liability explained by PGS under the liability threshold model (Falconer and Mackay, 1996), so that the result can be compared to the heritability of LDH (Heikkila et al., 1989). We first converted summary statistics generated by the logistic regression to those of linear regression by first-order approximation. Then we used summary statistics based PGS regression to obtain an estimate of R <sup>2</sup> on the observed scale. Finally, the observed R 2 was converted to the liability scale using the transformation formula by Lee et al. (2012a) assuming the disease prevalence of 0.02 (Jordan et al., 2009). More details are given in **Appendix 4** (**Supplementary Material**).

#### Inference of Genetic Architecture and Projecting Prediction Performance

We applied AVENGEME (Palla and Dudbridge, 2015) to estimate parameters of genetic architecture of height and BMI from the PGS results. The procedure is described in **Appendix 2** (**Supplementary Material**). Briefly, for a presumed SNP heritability h 2 1 , the method estimated the fraction of markers that are null (πˆ0) and genetic correlation between the discovery and testing samples (σˆ12). If the genetic architectures of the discovery and testing sample are the same, then the genetic correlation between two samples can be estimated as ρˆ<sup>G</sup> = σˆ12/h 2 1 .

The same model was also applied to predict the expected R 2 for height and BMI in Chinese population under different study designs. To project R <sup>2</sup> using PGS created by weights from European GWAS, model parameters were set to the maximum likelihood estimates fitted to the observed PGS results. We also increased the discovery sample size by 500,000 to evaluate the increase of R 2 in the future. To predict R <sup>2</sup> using PGS created by weights from East Asian GWAS, we used the same set of model parameters, but set the discovery GWAS sample size to match the published study of East Asians (Wen et al., 2014; He et al., 2015) and assumed no heterogeneity of effect sizes between the discovery and testing samples (σ<sup>12</sup> = h 2 1 ). To incorporate between-sample heterogeneity within East Asians, we changed between population genetic correlation to 0.9 (so σ<sup>12</sup> = 0.9h 2 ), which is a lower bound for height and BMI in Europeans (de Vlaming et al., 2017) and in different Chinese GWAS samples (data not shown).

#### SNP Heritabilities

Phenotypes analyzed in the HKDD cohort were adjusted by covariates that are associated with the phenotype (listed in **Table 2**) by linear regression; and residues were inverse normal transformed when necessary. SNP heritabilities of the adjusted phenotypes were estimated using GCTA v1.25 (Yang et al., 2011a) after excluding individuals so that no pair of individuals had estimated coefficient of relatedness >0.05 as recommended by the GCTA developers (Yang et al., 2017).

#### Estimating Genetic Correlation Between Anthropometric Traits and OA

To test the genetic overlaps of OA with height, BMI and BMD, we applied LDSC regression (Bulik-Sullivan et al., 2015) to the GWAS summary statistics following the recommended procedure. BMD summary statistics were corrected to remove the contribution from the HKOS GWAS (the only non-European study) as in PGS regression.

Due to sample overlaps in different European GWASs, PGS could not have been applied to the genome-wide summary data. But for known BMD associated SNPs whose effect size estimates from an independent replication sample were available (Estrada et al., 2012), we also used PGS regression under summary statistic



mode to assess the genetic correlation between osteoarthritis and BMD.

#### Power Analysis

Power calculation was done assuming the test statistics follows non-central chi-squared distribution under the alternative hypothesis. The non-centrality parameter for quantitative trait is NR<sup>2</sup> 1−R<sup>2</sup> , where N is the sample size and R 2 is the phenotype variance explained by PGS. For binary trait, R 2 in the above formula is on the observed scale and can be converted from liability scale using Lee et al. (2012a)'s formula as described in **Appendix 4** (**Supplementary Material**).

#### RESULTS

#### Phenotype Summary of the HKDD Cohort

A total of 2,054 unrelated Chinese subjects in the HKDD cohort (60% were females) were included in polygenic analysis. The basic demographic and phenotype summary are shown in **Table S1**. Both signal intensity loss and disc displacement showed a higher prevalence and severity at lower lumbar levels (**Figure 2b**). The disc degeneration and disc displacement scores for each individual were calculated by the summation of grades over all levels. Consistent with a major effect of aging, older individuals tend to have higher disc degeneration and displacement scores (**Figure 2c**). The two LDD scores were correlated with each other (r = 0.57; **Figure 2c**). Both of them were also positively correlated with height, body weight, BMI, and lumbar spine injury (P < 0.001; **Figure 2d**, **Table S2A**); the correlation remained significant for all except injury after correcting for the effect of age and gender (**Figure 2d**, **Table S2B**). Multiple linear regression analysis showed that the best fitting models for both disc degeneration and displacement scores included age, sex, lumbar injury, height and BMI as covariates (**Table S3**), which together explained 21.5 and 9.6% phenotype variances respectively. The SNP heritability estimates for height and BMI in the HKDD cohort were 0.38 (±0.18) and 0.25 (±0.17), similar to the previous reports in Europeans (Yang et al., 2010, 2011b). For disc degeneration and displacement scores, after adjusting for known covariates, SNP heritability estimates were about 0.2∼0.3 (**Table 2**).

#### Evaluating the Prediction Performance of PGS of Anthropometric Traits

We first evaluated prediction performance of PGS in Chinese samples and compared them with Europeans. PGS profiles of height and BMI were created in the HKDD cohort using known trait-associated SNPs identified by the GIANT consortium GWAS meta-analyses. They explained 5.7 and 1.2% of height and BMI variances respectively (P < 1.0E-10 for height, P = 1.6E-07 for BMI), after adjusting for age, sex and principle components. It is 2∼3-fold lower than previous reports in independent European samples, which were 16% for height (Wood et al., 2014) and 2.7% for BMI (Locke et al., 2015). The prediction performance of BMD associated SNPs reported by GEFOS consortium were tested in the HKOS GWAS sample (Kung et al., 2010). After correcting for extreme phenotype selection (**Appendix 3** in the **Supplementary Material**), the known BMD-associated SNPs explained 3.4% and 3.0% variance of LS-BMD and FN-BMD in Chinese population (P < 1.0E-10), also lower than previous reported ∼5% in Europeans (Estrada et al., 2012).

Since GWAS hits may only explain a small proportion of phenotype variance, we extended PGS analysis to make use of whole-genome summary statistics (**Figure 3**). As the p-value threshold of the discovery GWAS increases, both true and false positive SNPs will be included in the PGS. The p-value threshold that optimized phenotype prediction depends on the discovery sample size and unknown genetic architecture (Chatterjee et al., 2013; Dudbridge, 2013), and should be determined empirically. At the optimal p-value threshold, we found the phenotype variance explained is similar to that using GWAS hits for FN-BMD, marginally improved for height, slightly worse for LS-BMD, and more than doubled for BMI (R <sup>2</sup> = 2.6%, P < 1.0E-10). Theoretical model fitting under a range of plausible parameters (**Table S4**) suggested that BMD had smaller fraction of trait associated SNPs (0.5∼0.6%) with larger effect sizes compared with BMI and height (estimated fraction of non-null markers: 14∼17%), which explained why sparse PGS models showed better prediction performance for BMD. The trait variances explained by PGS predicted by the models generally captures the trend of empirical observations at different p-value thresholds (**Figure 3**). The estimate of between-population genetic covariance for

each phenotype was consistently lower than the presumed heritability (**Table S4**), reflecting trans-ethnic heterogeneity in effect sizes.

### Testing Genetic Overlap Between Anthropometric Traits and LDD

We then applied PGS to test genetic overlaps between anthropometric traits and LDD (**Table 3**). In the HKDD cohort, the BMI PGS at its optimal threshold was positively associated with both disc displacement score (R <sup>2</sup> = 0.29%, P = 0.015) and disc degeneration score (R <sup>2</sup> = 0.31%, P = 0.011) after adjusting for sex, age and lumbar injury. The results are consistent with obesity as a major risk factor for LDD development and progression (Hassett et al., 2003; Hangai et al., 2008). The associations remained significant (P < 0.05) after further adjusting for height but disappeared after adjusting for BMI or body weight (**Table S5**). The PGS of LS-BMD were positively associated with disc displacement score (R <sup>2</sup> ≈ 0.2%; P<0.05) and remained significant (P < 0.05) after further adjusting for height, BMI or weight (**Table S6**). The same trend was also observed for FN-BMD but did not reach significance. The finding supports the previous reported genetic correlation between BMD and disc bulge in a twin study (Livshits et al., 2010). The lack of association with disc degeneration score is also consistent with the previous study that showed a smaller effect size between BMD and disc signal intensity on MRI (Livshits et al., 2010).

In addition to LDD scores in the general population, we also applied PGS to predict case-control status of symptomatic LDH (Song et al., 2013). The height PGS was positively associated with LDH (P < 0.01) and explained 0.35% of disease liability. The association cannot be explained by body weight, because the BMI PGS is better associated with weight but does not show association with LDH (P > 0.5). The result provides


 *dAssociation with disc displacement and degeneration scores were evaluated by inclusion of polygenic profile score as a covariate to the multiple linear regression model of target phenotype that adjusted for age, sex and lumbar spine injury. eFor LDH, R 2 is the variance of disease liability explained by the PGS (*Appendix 4 *in the* Supplementary Material*). The HKOS GWAS sample were selected from extreme ends of BMD distribution. After correcting for extreme-selection (*Appendix 3 *in the* Supplementary Material*), we estimate that R 2 by the PGS of GWAS hits is 3.52% for LS-BMDand2.99%forFN-BMD.*

 *For BMD, it was estimated in the HKOS GWAS sample (Kung et al., 2010; N* = *780 females). LS-BMD and FN-BMD were adjusted*

a genetic basis to the previous epidemiological observation that being tall is a risk factor for hospitalization due to LDH (Wahlstrom et al., 2012) and back surgery (Coeuret-Pellicer et al., 2010).

### Testing Genetic Overlap Between Lipid Levels and LDD

Previous studies also reported that increased level of LDL-C, TC, and TG were associated with increased risk of LDH (Leino-Arjas et al., 2008; Longo et al., 2011; Zhang et al., 2016). To test if serum lipid levels have genetic correlation with LDD, we did similar PGS analysis using known lipid associated SNPs and GWAS summary data from the Global Lipids Genetic Consortium (Willer et al., 2013). Prediction performance of PGS was first evaluated in a subset of the HKDD cohort (N = 620 with genotypes) whose lipid levels were measured by the high-throughput NMR approach. All PGS were significantly associated with the corresponding lipid levels. Except for LDL-C, the PGS of known lipid loci showed the best prediction performance (**Figure S1**). However, none of them was significantly associated with LDD scores with the expected direction in the HKDD cohort or LDH in the Japanese case-control cohort (**Table S7**). Directly testing the phenotype association in the HKDD cohort by multiple linear regression also showed no association (**Table S8**). Therefore, our data does not support the previously suggested role of atherosclerotic lipids in LDD.

## Power Consideration

Given sample sizes and study designs, the two testing samples used in this study show similar profiles of statistical power (**Figure 4**). We have >50% power (at significance level α = 0.05) to detect genetic correlation if PGS explains >0.2% variance of adjusted LDD scores (or LDH disease liability). To achieve the same power at α = 0.01, it would require PGS to explain >0.33% phenotype (liability) variance. But the current study does not have enough power to detect genetic overlap if the PGS explain less than 0.2% variance of LDD scores (or LDH liability). Therefore, we designed this study to only test phenotypes with previous epidemiological evidence for association with LDD. To further reduce the multiple testing burden, we had only used the PGS which was optimal in predicting the corresponding base phenotype to test the genetic overlap with LDD. The results that were nominally significant and consistent with the expected phenotype correlations can be interpreted as supportive evidence of genetic overlaps.

### The Association of a Height Associated SNP rs6651255 With LDD Scores

The current study has no power to search for individual SNPs showing pleiotropic associations with LDD and related traits. But we noted that a recent GWAS of LDH with lumbar spine surgery in Iceland population (Bjornsdottir et al., 2017) identified a genome-wide significant SNP rs6651255, which was also a known height associated SNP (Wood et al., 2014). The risk allele T showed an odds ratio of 1.23 and associated with increased height. The same study also reported that increase in the genetically determined height increased the risk of LDH with

*f* *was adjusted for sex, age, age*

*for*

*age*

*and*

*standardized*

*into*

*Z-scores.*

 *and the first PC. The residuals were then inverse normal transformed.*

surgery, but the effect of rs6651255 on LDH was not mediated by height. To replicate this finding in our cohorts, we found an LD proxy rs4733724 (LD r <sup>2</sup> = 1 with rs6651255 in 1000 Genomes CEU population) was directly genotyped in the HKDD cohort and reliably imputed in the Japanese case-control cohort. The allele A was coupled to the LDH risk allele and significantly increased both disc displacement and disc degeneration scores (P < 0.05; **Table 4**). The effects remained significant after further adjusting for height, BMI or body weight. The same allele was also weakly associated with increased height and increased risk of LDH requiring surgery (odds ratio = 1.11), but the results were not significant as the sample sizes limited the power to detect associations with small effect sizes.

## Genetic Correlation Between Anthropometric Traits and OA

Finally, LDD has been suggested to share common features with OA which is also known as degenerative joint disease (Loughlin, 2011; Ikegawa, 2013). To test if osteoarthritis also showed genetic overlaps with the same set of traits as LDD, we assessed the genetic correlations of BMI, BMD and height with osteoarthritis using LDSC regression (**Table 5**). Significant positive genetic correlation was found between BMI and osteoarthritis (rˆ<sup>G</sup> = 0.255, P = 4.0E-07), which is expected given the strong evidence for a causal role of BMI (Panoutsopoulou et al., 2014). Suggestive positive genetic correlations with osteoarthritis were also observed for height (P = 9.5E-03) and LS-BMD (P = 0.012) but not for FN-BMD (P > 0.1).

The genetic correlation between osteoarthritis and LS-BMD was less significant though its effect was stronger than between osteoarthritis and height, which was possibly due to smaller sample size of the BMD GWAS. Since the genetic architecture of BMD was dominated by fewer number of causal SNPs with larger effect sizes (**Table S4**), it is also possible that LDSC which assumed an infinitesimal model may be less optimal to detect genetic correlations for BMD and other traits. To support this, we calculated PGS of BMD GWAS hits using weights from the second stage replication sample of the GEFOS consortium (Estrada et al., 2012) to predict OA. The PGS of both LS-BMD and FN-BMD were strongly associated with OA case-control status in the acrOGEN sample (R <sup>2</sup> = 0.13%, P = 7.8E-07 for LS-BMD and R <sup>2</sup> = 0.12%, P = 2.5E-06 for FN-BMD). Taken together, the results suggest that like LDD, OA also shares genetic overlaps with height, BMI and BMD.

### DISCUSSION

### Between-Population Heterogeneity and Its Impact on Prediction Performance of PGS

In this study, we adopted a trans-ethnic PGS strategy to evaluate the genetic overlaps between different traits where GWAS of base phenotypes were done in Europeans and validation and testing samples were East Asians. Although most GWAS findings were generally replicated in populations different from the initial discovery, heterogeneity commonly existed in the estimated effect sizes (e.g., Carlson et al., 2013; Marigorta and Navarro, 2013), which would reduce the power of PGS to predict phenotypes in populations from a different ethnicity (e.g., Johnson et al., 2015). Consistent with this, we found in Chinese validation samples that variance of height, BMI and BMD explained by the PGS of corresponding GWAS hits were all lower than in Europeans. For height and BMI,


TABLE 4 | Association of rs4733724-A allele with lumbar disc degeneration and height in East Asian samples.

*The SNP rs4733724 was genotyped in the HKDD cohort and reliably imputed in the Japanese LDH case-control cohort. The A allele was previously reported to be associated with increased height in Europeans (Wood et al., 2014). The rs4733724-A allele is coupled to rs6651255-T, the latter of which was recently found to increase the risk (odds ratio* = *1.23) of LDH requiring surgery in Icelanders (Bjornsdottir et al., 2017). The frequency of rs4733724-A allele is 0.72 in East Asians and 0.23 in Europeans. <sup>a</sup>Odds ratio* = *1.11.*

TABLE 5 | Genetic correlations estimated by LD-score regression.


*BMI, body mass index; BMD, bone mineral density; LS, lumbar spine; FN, femoral neck; OA, osteoarthritis.*

assuming the genetic architecture is the same between European and Chinese, the observed PGS results suggest that betweenpopulation genetic correlations are about 0.4∼0.6 (**Table S4**, Materials and Methods). The rough estimations are within the range of previous estimate for type 2 diabetes and rheumatoid arthritis between European and East Asian using a different methodology (Brown et al., 2016).

The use of European GWAS in the current study is mainly due to large sample sizes and publicly available summary statistics. GWAS meta-analyses of height and BMI were also conducted in East Asians (Wen et al., 2014; He et al., 2015). Although sample sizes are much smaller (N≈36,000 for height, 87,000 for BMI), they are expected to have more similar effect sizes to the Chinese sample. To evaluate the tradeoff between sample size and effects heterogeneity, we projected expected prediction performance (R 2 ) of height and BMI using a theoretical model with parameters of genetic architecture compatible with the observed PGS results (Materials and Methods). Despite smaller sample sizes, using East Asian GWAS as the discovery sample is expected have comparable maximum R 2 as European GWAS to predict height in the Chinese population (**Figure 5**). For BMI, depending on the presumed SNP heritability, using East Asian GWAS shows comparable or better maximum R 2 (**Figure 5**, **Figure S2**). Further increase the European GWAS sample sizes by half a million, a scale similar to the on-going UK biobank study, the increase in R 2 for height is capped at 8% but roughly doubles for BMI. Notably, when using East Asian GWAS as the discovery sample, the best prediction performance can only be achieved at p-value thresholds >0.01. However, whole-genome summary statistics of East Asian GWASs were not publicly available for us before the start of this study. Also consistent with the theoretical predictions, incorporating East Asian GWAS top hits to the PGS of GWAS hits only marginally increased R 2 in predicting height and BMI, and their associations with the LDD scores remained insignificant (**Table S9**).

Given genetic architecture and sample sizes, the power of PGS in detecting genetic overlaps is mainly determined by the performance PGS in predicting the corresponding base phenotype. Therefore, the theoretical results suggest that the use of European GWAS as discovery sample in PGS analysis can still be a favorable approach in cross-trait analysis in the East Asian population. But we caution that the trans-ethnic PGS strategy may not be suitable for other populations like African. Nevertheless, whenever possible ancestry-matched GWAS of base phenotype with large sample sizes should be used to improve the power. Since summary data from large scale GWAS in non-European populations have started to become available recently (e.g., Akiyama et al., 2017), new method will be needed to integrate GWAS data from multiple ethnicities to further improve the PGS prediction performance.

#### The Influence of Phenotype Definition

In this study, we analyzed three LDD phenotypes, including two semi-quantitative scores derived from MRI assessment and one clinically defined symptom. The PGS of height, BMI and BMD were associated with at least one LDD phenotype. It highlights the complexity in operationally defining LDD, as the current diagnostic approach only captures certain aspects of the degenerative process. Therefore, comparison between different studies should clarify how phenotypes are defined. And it will be

(red line) was calculated using the parameters best fit to Figure 3. In comparison, the latest East Asian GWAS has sample size only 36K, expected *R* <sup>2</sup> was calculated using the same set of parameters except that we assumed no heterogeneity in effect sizes (i.e., genetic correlation = 1) between discovery and testing sample (blue line). To predict the gain in *R* <sup>2</sup> when using even larger European GWAS in the future, we further increased the discovery GWAS sample size by 500K (red dashed line). We also relaxed the assumption of no heterogeneity within East Asian and calculate expected *R* <sup>2</sup> assuming genetic correlation of 0.9 (blue dashed line). (B) For BMI, European GWAS has sample size 234K; East Asian GWAS has sample size 87K. Expected *R* <sup>2</sup> were calculated similarly as height, assuming SNP heritability of 0.22.

fruitful to jointly evaluate multiple MRI features in future genetic studies. However, although MRI is the current gold standard that gives best resolution in defining LDD, it is too expensive to be carried out in large samples.

An alternative strategy is to use the a "proxy phenotype" such as patient-based LDH in which large number of cases can be identified based on electronic medical records. Use of proxy phenotype has been demonstrated to improve the power in GWAS (e.g., Okbay et al., 2016). Increase in sample sizes can outweigh the dilution of genetic effects, but it may also capture certain aspects of the trait that is irrelevant to the phenotype of interest (e.g., Kong et al., 2017). In the current and our previous study (Song et al., 2013), LDH requiring surgery was presumed to represent an extreme end of disc displacement in the population. In this regard, it is surprising that the PGS of height strongly associated with LDH but not LDD scores, and PGS of BMI and BMD were associated with LDD scores but not LDH. Although the lack of expected associations can be false negatives due to insufficient power, we cannot rule out the possibility that ascertainment of LDH patients based on severe low back pain or sciatica may enrich polygenic factors other than LDD.

### Biological Interpretations

The observed genetic overlaps can be explained by either causality or genetic pleiotropy or both. Interestingly, BMI, BMD and height also showed suggestive evidence of positive genetic correlation with OA. It is possible that they can be explained by some common mechanisms. Although formal assessment of causality could utilize the Mendelian randomization paradigm in larger sample sizes, PGS can be used to nominate candidate phenotypes (Evans et al., 2013). Overweight or obesity has been established as one of the major risk factors for the development and progression of both LDD (Hassett et al., 2003; Hangai et al., 2008) and OA (Bierma-Zeinstra and Koes, 2007). It is commonly believed that increased body weight or BMI exerts more physical loading to the IVD and vertebral endplate (Videman et al., 2007) or joint cartilage (Guilak, 2011), and leads to increased wear and tear of the structures. For BMD, in addition to its correlation with LDD, previous studies also found the increase in BMD in OA patients and an inverse association between OA and osteoporosis (Hannan et al., 1993; Arden et al., 1996). It was postulated that increased BMD is associated with a loss of resilience of subchondral bone which may results in increased mechanical stress on joint cartilage (Foss and Byers, 1972; Radin and Rose, 1986) and similarly on IVD (Harada et al., 1998). The causal mechanism of tall stature on LDH that leads to hospitalization or surgery remains unclear. One possibility may be related to increased disc height, because a previous study using finite element modeling demonstrated that discs with taller height and smaller area were prone to larger motion, higher annular fiber stress and larger degree of disc displacement (Natarajan and Andersson, 1999). Another possibility may be altered spinal alignment in taller individuals that predispose them to lumbar spine injury. Notably, the postulated mechanisms all point to the pathophysiological role of biomechanical stress. Some other mechanisms have also been proposed (Katz et al., 2010; Samartzis et al., 2013). For example, obesity is also believed to lead to local inflammatory response of secondary mediators secreted by adipocytes known as adipokines. The causal role of adipokines and inflammatory markers can also be tested using their genetic predictors as instrumental variables in future studies.

Alternatively, the observed genetic correlations are also consistent with the genetic pleiotropy and shared pathways among skeletal phenotypes. In supporting this notion, several individual OA associated SNPs were associated with height or BMD (Reynard and Loughlin, 2013; Hackinger et al., 2017), and OA and LDD were found to share some common genetic risk factors (Song et al., 2008; Williams et al., 2011). At single SNP level, we also replicated the recent finding of Bjornsdottir et al. (2017) and showed that the height-increasing allele SNP rs6651255 was associated with the increase of two LDD scores in the HKDD cohort. The previous study did not find association of the same SNP with other related skeletal phenotypes like OA of the spine or osteoporotic vertebral fractures and suggested that the association was driven by the neuropathic pain rather than herniated lumbar discs. However, they did not examine the association of the SNP with radiologically defined LDD phenotypes. Our results in the large population-based cohort with MRI assessment suggest that the same SNP also influences the changes in composition and morphology of lumbar discs. Future genetic studies on LDD with larger sample sizes should search for additional pleiotropic SNPs to better understand bonecartilage relationships.

In summary, the current study is the first attempt to evaluate genetic overlap between LDD and related traits using GWAS data. Our trans-ethnic polygenic analysis supports the genetic correlations of height, BMI and BMD with LDD, and sheds new light on understanding the pathological mechanism of degenerative skeletal disorders.

### DATA AVAILABILITY

The genome-wide association summary statistics of the HKDD cohort is available at https://goo.gl/6gpt9g.

### REFERENCES


### ETHICS STATEMENT

This study was carried out in accordance with the recommendations of the ethical principles and guidelines for the protection of human participants of research, Human Research Ethics Committee, The University of Hong Kong. The protocol was approved by the Human Research Ethics Committee, The University of Hong Kong. All subjects gave written informed consent in accordance with the Declaration of Helsinki.

### AUTHOR CONTRIBUTIONS

XZ and PCS conceived the study and coordinated the research. KS-EC obtained the funding. XZ developed the methods and performed analysis. JK and KM-CC evaluated MRI images of the HKDD cohort. TK, Y-QS, KC, YK, and SI contributed the Japanese LDH GWAS summary data. C-LC contributed the HKOS GWAS summary data. Y-HH contributed the GEFOS consortium GWAS summary data. DS contributed the NMR data. YL and DC applied for the arcOGEN consortium GWAS summary data. XZ drafted the manuscript with inputs from PCS, KS-EC. DS, TS-HM, C-LC, and JK reviewed and revised manuscript. PCS and KS-EC participated discussion.

### ACKNOWLEDGMENTS

This work was supported by Research Grant Council of Hong Kong Theme-based Research Scheme Functional Analyses of How Genomic Variation Affect Personal Risk for Degenerative Skeletal Disorders (T12-708/12N), and General Research Fund 776513M, 17128515 and 17124027. GEFOS study was funded by the European Commission (HEALTH-F2-2008-201865-GEFOS). arcOGEN study was funded by a special purpose grant from Arthritis Research UK (grant 18030). We thank Ms. Pei Yu for curating the HKDD phenotype database, and Dr. Eleftheria Zeggini for providing the arcOGEN GWAS summary data.

### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene. 2018.00267/full#supplementary-material


genetic-correlation estimates from summary statistics. Am. J. Hum. Genet. 99, 76–88. doi: 10.1016/j.ajhg.2016.05.001


study using microCT and discography. J. Bone Miner Res. 26, 2785–2791. doi: 10.1002/jbmr.476


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 Zhou, Cheung, Karasugi, Karppinen, Samartzis, Hsu, Mak, Song, Chiba, Kawaguchi, Li, Chan, Cheung, Ikegawa, Cheah and Sham. This is an openaccess article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Accuracy of Gene Expression Prediction From Genotype Data With PrediXcan Varies Across and Within Continental Populations

Anna V. Mikhaylova\* and Timothy A. Thornton\*

*Department of Biostatistics, University of Washington, Seattle, WA, United States*

#### Edited by:

*Dana C. Crawford, Case Western Reserve University, United States*

#### Reviewed by:

*Georgios Athanasiadis, University of Copenhagen, Denmark Binglan Li, University of Pennsylvania, United States*

\*Correspondence:

*Anna V. Mikhaylova avmikh@uw.edu Timothy A. Thornton tathornt@uw.edu*

#### Specialty section:

*This article was submitted to Applied Genetic Epidemiology, a section of the journal Frontiers in Genetics*

Received: *30 November 2018* Accepted: *08 March 2019* Published: *03 April 2019*

#### Citation:

*Mikhaylova AV and Thornton TA (2019) Accuracy of Gene Expression Prediction From Genotype Data With PrediXcan Varies Across and Within Continental Populations. Front. Genet. 10:261. doi: 10.3389/fgene.2019.00261* Using genetic data to predict gene expression has garnered significant attention in recent years. PrediXcan has become one of the most widely used gene-based methods for testing associations between predicted gene expression values and a phenotype, which has facilitated novel insights into the relationship between complex traits and the component of gene expression that can be attributed to genetic variation. The gene expression prediction models for PrediXcan were developed using supervised machine learning methods and training data from the Depression Genes and Networks (DGN) study and the Genotype-Tissue Expression (GTEx) project, where the majority of subjects are of European descent. Many genetic studies, however, include samples from multi-ethnic populations, and in this paper we evaluate the accuracy of PrediXcan for predicting gene expression in diverse populations. Using transcriptomic data from the GEUVADIS (Genetic European Variation in Disease) RNA sequencing project and whole genome sequencing data from the 1000 Genomes project, we evaluate and compare the predictive performance of PrediXcan in an African population (Yoruban) and four European ancestry populations for thousands of genes. We evaluate a range of models from the PrediXcan weight databases and use Pearson's correlation coefficient to assess gene expression prediction accuracy with PrediXcan. From our evaluation, we find that the predictive performance of PrediXcan varies substantially among populations from different continents (*F*-test *p*-value < 2.2 × 10−16), where prediction accuracy is lower in the Yoruban population from West Africa compared to the European-ancestry populations. Moreover, not only do we find differences in predictive performance between populations from different continents, we also find highly significant differences in prediction accuracy among the four European ancestry populations considered (*F*-test *p*-value < 2.2 × 10−16). Finally, while there is variability in prediction accuracy across different PrediXcan weight databases, we also find consistency in the qualitative performance of PrediXcan for the five populations considered, with the African ancestry population having the lowest accuracy across databases.

Keywords: transcriptome, expression quantitative trait loci (eQTL), genetic diversity, genetic mapping, complex traits

## 1. INTRODUCTION

In the past decade, genome-wide association studies (GWAS) have identified thousands of genetic variants significantly associated with a wide range of human phenotypes (Sudlow et al., 2015; NHLBI, 2016; MacArthur et al., 2017; Visscher et al., 2017). The vast majority of these studies, however, were conducted in samples from European ancestry populations (Need and Goldstein, 2009; Bustamante et al., 2011; Petrovski and Goldstein, 2016; Popejoy and Fullerton, 2016; Bentley et al., 2017; Hindorff et al., 2018). Differences in allele frequencies, genetic architecture, and linkage disequilibrium (LD) patterns across ancestries suggest that GWAS discoveries can fail to generalize across populations, and recent publications have provided compelling evidence that GWAS findings often do not transfer from European populations to other ethnic groups (Adeyemo and Rotimi, 2009; Li and Keating, 2014). For example, Carlson et al. analyzed multi-ethnic data from the PAGE Consortium and concluded that some variants identified in GWAS in European ancestry populations had different magnitude and direction of allelic effects in non-European populations and the differential effects were more persistent in African Americans (Carlson et al., 2013). Moreover, genetic risk prediction models derived from European GWAS were found to be unreliable when applied to other ethnic groups (Carlson et al., 2013). Martin et al. examined the impact of population history on polygenic risk scores and demonstrated that they can be biased and confounded by population structure (Martin et al., 2017). Since genetic risk prediction accuracy depends on genetic similarity between the target and discovery cohorts, Martin et al. advised against interpreting the scores across populations and recommended computing them in genetically similar cohorts.

Associations between genetic variation and molecular traits, such as gene expression, have advanced our understanding of the mechanisms underlying trait-variant associations (Nica et al., 2010; Torres et al., 2014; Albert and Kruglyak, 2015). Prior studies have shown that a large proportion of GWAS variants identified for complex traits are expression quantitative trait loci (eQTLs), i.e., they play a role in regulating gene expression (Nicolae et al., 2010). Thus, eQTLs can aid in prioritizing likely causal variants among the ones identified by GWAS, especially if they are found in non-coding regions, and can help uncover the mechanisms by which genotypes influence phenotypes (Albert and Kruglyak, 2015). As a result, having three types of data genotype, phenotype and gene expression—on the same set of subjects can be advantageous for improved understanding of the relationships between complex traits, the genetic backgrounds of study subjects, and the underlying biological processes. However, collecting all of these different types of data on the same study subjects is often not feasible due to cost and tissue availability. Additionally, eQTL studies have the same pitfalls as GWASs—the majority of the detected eQTLs are not causal, but may be in LD with causal variants. Similar to variants identified through GWAS, eQTL findings might fail to replicate in diverse populations due to differential LD patterns across populations (Kelly et al., 2017).

Recently, there has been increased interest in integrating eQTL studies and GWASs for improved complex trait mapping. PrediXcan (Gamazon et al., 2015) is one of the most widely used integrative methods for testing associations between a phenotype and gene expression values predicted from SNP genotyping or sequencing data. PrediXcan can have increased power over traditional GWAS methods, particularly when differential changes in gene expression is an intermediary stage of the causal pathway from genetic variation to the outcome of interest. A useful feature of PrediXcan (and other similar methods) is the ability to obtain predicted gene expression values on study subjects when tissue types relevant to phenotypes are not available. We now give a very brief overview of the PrediXcan method. PrediXcan uses machine learning methods and large reference datasets consisting of both genotype and trascriptome data for supervised training to construct prediction models for expression of each gene. With PrediXcan, genetic training data is restricted to common cis-variants that are within 1 Mb upstream and downstream from the transcription region (Gamazon et al., 2015). Gene-specific derived SNP weights from the prediction models are then stored in databases, with separate sets of weights for different tissue types. Using these weights, PrediXcan allows for the prediction of gene expression values for study subjects with available genotype data, where predicted expression values are computed as a weighted linear combination of SNP dosages. Finally, the predicted expression values can then be used to test for associations with a phenotype of interest. By conducting tests on gene expression obtained from an aggregation of variants, PrediXcan dramatically reduces multiple testing burden as compared to single variant association testing.

Previous studies have reported differences in gene expression levels across diverse populations from the HapMap3 project, noting that 77% of eQTLs are population specific and only 23% are shared between two or more populations (The International HapMap 3 Consortium et al., 2010; Stranger et al., 2012). More distantly related populations have more differentially expressed genes than closely related populations, although this can often be explained by the expression of different gene transcripts across populations (Lappalainen et al., 2013). One potential limitation of PrediXcan, however, is that the method may not perform well in diverse populations, as the supervised learning for PrediXcan was conducted using data from the Depression Genes and Networks (DGN) and the Genotype-Tissue Expression (GTEx) Project—both of which consist primarily of Europeanancestry subjects (Lonsdale et al., 2013; Battle et al., 2014). Many genetic studies include samples from multi-ethnic populations, and understanding the accuracy of gene expression prediction with PrediXcan across populations is of interest to many genetic researchers.

Recent works have evaluated the performance of PrediXcan in diverse populations (Gottlieb et al., 2017; Li et al., 2018). Li et al. evaluated PrediXcan whole-blood prediction models and investigated the factors that influence prediction accuracy using the Yoruban (YRI) and European (CEU) samples from the Genetic European Variation in Health and Disease (GEUVADIS) (Lappalainen et al., 2013) cohort. In this paper, the PrediXcan performance was reported to be unsatisfactory for most genes due to predicted gene expression values not correlating well with the observed values (Li et al., 2018). Differences in prediction accuracy with PrediXcan between the YRI and CEU, however, were not directly compared. Gottlieb et al. investigated the performance of PrediXcan for a small subset of 116 genes that are in the warfarin-response pathway in European and African American samples where they concluded that PrediXcan performed poorly in African Americans (Gottlieb et al., 2017).

Here, we evaluate the predictive performance of PrediXcan both across and within continental populations using thousands of genes across the genome. Using the GEUVADIS transcriptome data and whole genome sequencing data from the 1000 Genomes Project (Lappalainen et al., 2013; Auton et al., 2015), we consider four closely related European ancestry populations and one African population. In our analysis, we test the null hypotheses of (1) no difference in prediction accuracy with PrediXcan across European and African continental populations; and (2) no difference in predictive performance among the four European derived populations. We obtain predicted gene expression levels using seven PrediXcan weight databases derived from whole blood and lymphoblastoid cell lines (LCL) transcriptome data for each individual. To evaluate differences in prediction accuracy among the populations, we use a linear mixed effects model framework where Pearson's correlation coefficients for observed and predicted gene expression levels are included as the outcome and the populations are included as categorical predictors. In addition, we evaluate the utility of whole-bloodbased models when making predictions for LCL expression data. We find from our analyses that accuracy of PrediXcan for gene expression prediction not only differs between European and African continental populations, but also among closely related populations of European ancestry. Furthermore, prediction accuracy with PrediXcan is the lowest in Africans across all seven weight databases considered, which further illustrates the need to develop new predictive models using training data composed of individuals who have similar ancestry to the target sample for which gene expression is to be predicted (Mogil et al., 2018).

#### 2. MATERIALS AND METHODS

#### 2.1. Datasets

We obtained gene expression data from the GEUVADIS Consortium and whole genome sequencing data from the 1000 Genomes Project. The gene expression data consisted of RNA sequencing on lymphoblastoid cell line (LCL) samples for 464 individuals from five populations. Of these, 445 subjects were in the 1000 Genomes Phase 3 dataset, including 358 subjects of European descent, and 87 subjects of African descent. European samples included: Utah residents with Northern and Western European ancestry (CEU, n = 89), British individuals in England and Scotland (GBR, n = 86), Finnish in Finland (FIN, n = 92), and Toscani in Italy (TSI, n = 91). African samples included individuals of African descent from Yoruba in Ibadan, Nigeria (YRI, n = 87). Gene expression measurements were available for 23,722 genes.

We used seven PrediXcan weight databases: DGN wholeblood (further referred to as DGN), GTEx v6 1KG whole blood, GTEx v6 1KG LCL, GTEx v6 HapMap whole blood, GTEx v6 HapMap LCL, GTEx v7 HapMap whole blood (GTEx WB), and GTEx v7 HapMap LCL (GTEx LCL). The databases were downloaded from http://predictdb.org/.

### 2.2. Filtering Procedure for Poorly Predicted Genes

Linear regression models were used to identify genes whose predicted values were not associated with the observed values at significance level of 0.05 in order to filter out genes that have poor prediction accuracy across all subjects. For each gene, we fit a linear regression model with observed gene expression as the outcome and predicted gene expression as the predictor of interest. A Wald test was used to assess significance of the coefficient for each gene in the linear model. Genes with corresponding p-values that were higher than a nominal significance level of 0.05 were identified and labeled as "poorly predicted."

We then calculated Pearson's correlation coefficient, r, between observed and predicted expression values for every gene, in each population separately. A few genes had the same predicted gene expression levels across all subjects. Since we could not calculate the correlation if one of the variables was constant, we excluded those genes. Thus, for every gene considered there were five Pearson's correlation coefficients, one for each population. Note that we used r instead of the square of Pearson correlation, r 2 , in order to take directionality of correlation into account when assessing predictive performance. We found that using r 2 as a measure of predictive accuracy can be misleading as there were genes for which predicted and observed expression values had a significant negative correlation.

It should be noted that we also performed an evaluation of the performance of PrediXcan without doing any filtering of genes in order to assess the impact on the analysis when poorly predicted genes are excluded, as discussed below.

### 2.3. Assessing Prediction Accuracy Differences Across Populations and Across Tissues

In the analyses described below to assess differences in prediction accuracy with PrediXcan across populations, two sets of genes were considered—all genes without any filtering and a subset of genes using the filtering process previously described.

We first compared prediction performance between the two continental groups—European and African. For each gene, we calculated two Pearson's correlation coefficients between observed and predicted gene expression levels—one based on all European samples and the other one based on the African samples. We then used a paired t-test to assess differences in mean prediction accuracy between the correlation coefficients for European samples vs correlation coefficients for African samples.

To assess differences in prediction accuracy across the five populations, we used a linear mixed effects model approach where we fit the following model:

$$r\_{ij} = \beta\_0 + \gamma\_i + \beta\_1 \mathbb{I}\_{FIN,i} + \beta\_2 \mathbb{I}\_{GBR,i} + \beta\_3 \mathbb{I}\_{TSI,i} + \beta\_4 \mathbb{I}\_{YR,i} + \epsilon\_{ij}, \tag{1}$$

where rij is the correlation coefficient for gene i in population j; and IFIN,<sup>i</sup> , IGBR,<sup>i</sup> , ITSI,<sup>i</sup> , and IYRI,<sup>i</sup> are indicator variables that are equal to 1 if the gene correlation was calculated on the population indicated in the subscript, and otherwise are equal to 0. Thus, we modeled population as a categorical predictor, with the CEU population as a reference. To account for variation between genes, we included a random intercept γ<sup>i</sup> for each gene and we assumed that γ<sup>i</sup> ∼ N (0, σ 2 γ ). We also included an error term ǫij, such that ǫij ∼ N (0, σ 2 ). We used repeated measures ANOVA to test the null hypothesis of β<sup>1</sup> = β<sup>2</sup> = β<sup>3</sup> = β<sup>4</sup> = 0 for no difference in mean Pearson's correlation coefficients among the populations. A Wald test was used to assess significance of differences in mean Pearson's correlation coefficients between CEU, the reference population, and each of the other four populations.

We also ran a similar analysis where we excluded the CEU population due to potentially lower quality of the CEU cell lines, as reported in the literature (Çaliskan et al., 2014; Yuan et al., 2015). We fit a model identical to (1), excluding the CEU and using the FIN population as a reference:

$$r\_{i\overline{i}} = \beta\_0 + \gamma\_i + \beta\_1 \mathbb{I}\_{\text{GBR},i} + \beta\_2 \mathbb{I}\_{\text{TSI},i} + \beta\_3 \mathbb{I}\_{\text{YRI},i} + \epsilon\_{i\overline{j}}, \tag{2}$$

where the notation is the same as above.

Additionally, we tested for differences in prediction accuracy across the four European populations. For this analysis, we included only individuals of European ancestry and fit the following linear mixed effects model:

$$r\_{i\bar{j}} = \beta\_0 + \gamma\_i + \beta\_1 \mathbb{I}\_{\text{FIN},i} + \beta\_2 \mathbb{I}\_{\text{GBR},i} + \beta\_3 \mathbb{I}\_{\text{TSI},i} + \epsilon\_{i\bar{j}}, \tag{3}$$

where CEU is included as the reference population in the model. As in the previously described analyses, a repeated measures ANOVA was used to test for differences in prediction accuracy across the four European populations.

To evaluate how the PrediXcan performance with wholeblood (WB) databases differed from LCL databases, we restricted the set of genes to only those that were present in both the WB and LCL databases. First, we presented scatter plots of correlation coefficients comparing WB and LCL databases in the five populations separately. Then we recalculated Pearson's correlation coefficients between observed and predicted expression values with all the five populations combined but separately for every database, i.e., as a result, we had two correlation coefficients per gene, one that corresponded to a GTEx WB database and one to a GTEx LCL database. We compared each pair of GTEx WB and GTEx LCL databases using a paired t-test between LCL-based correlation coefficients and WB-based correlation coefficients. All the statistical analyses described above were performed in R version 3.3.3 (R Core Team, 2014). All plots were generated with ggplot2 (Wickham, 2016).

#### 3. RESULTS

#### 3.1. Overview of PrediXcan Weight Databases

In **Table 1**, we summarize the main features of the PrediXcan weight databases that we used in the analyses. Compared to DGN TABLE 1 | Summary of PrediXcan databases used in analyses.


TABLE 2 | Number of genes for which Pearson correlation coefficients are available by population and by PrediXcan weight database.


database, GTEx databases have fewer gene models and smaller training sample sizes. HapMap and 1KG-based models differ in the number of variants used for training: GTEx Hapmap models were trained on the HapMap genotyping data while GTEx 1KG were trained on the 1000 Genomes sequencing data, so the latter utilize more variants when predicting expression. While GTEx LCL databases are based on relatively small training sets, they are derived from the same tissue as the GEUVADIS RNA-seq data we analyzed. Lastly, DGN and GTEx v7 sets of weights were trained only on Europeans samples, while GTEx v6 databases had a small fraction of non-Europeans.

To avoid repetition, results using the DGN, GTEx v7 WB, and GTEx v7 LCL databases are included in the main text, while the results for the other four databases are provided in the **Supplementary Material**.

#### 3.2. PrediXcan Prediction Accuracy Differs Across Diverse Populations

Using DGN, GTEx WB, and GTEx LCL models and sequence data, gene expression was predicted for 10,387, 5,432, and 2,777 genes, respectively (see **Table 2**). The number of genes with available predictions varied by population, where the four European populations had a similar number of gene predictions while the counts for YRI were slightly lower. We excluded 33 genes, 13 genes, and 10 genes from DGN, GTEx WB, and GTEx LCL, respectively, due to there being no variation in predicted gene expression values for at least one of the populations. For the remaining genes, we identified those that had poor prediction accuracy based on associations between observed and predicted values, as described in section Materials and Methods on filtering poorly predicted genes. From the genes predicted with the DGN database, two-thirds were labeled by this criterion as "poorly predicted," while slightly less than a half were labeled as such from gene sets predicted using the GTEx databases. As previously mentioned, we also considered the performance of PrediXcan without doing any filtering of the genes. For every weight database, we had two sets of genes—before and after filtering where the latter set is a much smaller subset of the former. Both versions were used and evaluated in downstream analyses.

We first evaluated performance of PrediXcan for the two continental populations, European and African. We compared Pearson's correlation of predicted and observed gene expression values for the combined sample consisting of all individuals from the four European-ancestry populations to Pearson's correlation calculated for the YRI African population sample. As only two groups were being compared in this analysis, a paired t-test was used to assess differences in prediction accuracy, where the pairing was based on the gene. With or without the filtering of genes, we find the mean difference in gene correlation coefficients between the European and African samples to be highly significantly different from zero, regardless of the weight database used (all p-values < 2.2 × 10−16), with the African population having lower prediction accuracy than the European samples.

Next, we computed gene correlation coefficients, separately in each of the five populations. Violin plots display the correlation coefficients by population across genes before and after filtering (see **Figures 1A,B**, respectively). **Figure 1A** shows correlation coefficients for the genes before any filtering was done and we observe that LCL-derived models perform better than WBderived: i.e., DGN and GTEx v7 WB correlation distributions are centered at values close to 0, whereas GTEx LCL correlation distributions are centered at higher values, especially for the four European populations. We also note that prediction accuracy is slightly lower for the African populations than for any of the European populations across the three weight databases. This trend is even more obvious after the filtering process. As we can see in **Figure 1B**, the overall performance accuracy improved after filtering in all the populations, as expected. However, the difference in prediction performance in Europeans vs. Africans is even more apparent. The four European populations have similar prediction accuracy, whereas it is lower for the African population. Similarly to panel A, LCL-derived prediction models perform better than WB-derived in filtered genes in **Figure 1B**.

Afterwards, we binned the genes into six categories based on the gene correlation coefficients (see **Table 3**). The majority of genes have very poor prediction accuracy—of the genes predicted with whole-blood databases, a third have negative correlations and a half have correlations between 0 and 0.2. Of the genes predicted with LCL, a fifth have negative correlations and over a third have correlations between 0 and 0.2. The distribution of gene correlation coefficients is fairly similar across the four European populations, although predictive accuracy seems worse in CEU compared to FIN, GBR, and TSI. The predictive accuracy is the lowest in the African sample. Across all populations, only a small number of genes were predicted with high accuracy (with r > 0.6). Furthermore, all European populations have a greater number of well-predicted genes than the African population, regardless of the weight database used.

Next, we assessed the association between the prediction accuracy (as gene correlation coefficients) and population category via repeated measures ANOVA and linear mixed models using both sets of genes, all and filtered. The results for unfiltered and filtered genes were comparable and led to equivalent conclusions. Based on the repeated measures ANOVA, we find that prediction accuracy differs across populations for filtered and unfiltered sets of genes, regardless of the weight database used (p-values for all databases were < 2.2 × 10−16). Below, we focus our attention on filtered genes and present the parameter estimates and their 95% confidence intervals calculated using model-based standard errors for the model 1 in **Table 4**. From the linear mixed model 1, we find that the prediction accuracy is significantly higher in FIN, GBR, and TSI and significantly lower in YRI, compared to CEU. This suggests that predictive performance varies not only among distant populations, but also among closely related populations. When we performed the analysis on a full set of genes, without any filtering, regression coefficients were slightly attenuated toward zero; however, the conclusions from hypothesis testing remained the same.

We repeated the analysis described above, this time excluding the CEU population. We present the parameter estimates and the corresponding 95% confidence intervals in **Table 5**. From the repeated measures ANOVA, we find that prediction accuracy differs across the four populations (p-values for all databases were < 2.2 × 10−16). Moreover, based on the coefficients and the corresponding p-values from the linear mixed model 2, we estimate the prediction accuracy to be significantly higher in GBR and significantly lower in TSI and YRI, compared to the FIN population (see corresponding p-values in **Table 5**). This difference in prediction accuracy is the greatest between YRI and FIN when GTEx v7 LCL weight database was used. Like in the analysis above, we notice that predictive performance differs across populations, including European populations.

Finally, we evaluated PrediXcan prediction accuracy on a subset of subjects with European ancestry. Based on the repeated measures ANOVA test, prediction performance differs across the four European populations in genes before and after filtering, regardless of the weight database used (p-values for all databases were < 2.2 × 10−16). Because of potentially biased expression patterns of the CEU due to the previously mentioned age of these cell lines, we conducted an analysis where we omitted the CEU population and compared prediction accuracy among the other three European populations. The results were comparable to the analysis of the European populations that included CEU. With a repeated measures ANOVA, we find highly significant differences in prediction accuracy among the FIN, GBR, and TSI populations, with p-values less than 10−<sup>6</sup> across all weight databases with or without filtering of poorly predicted genes.

TABLE 3 | Gene counts per population, per database, per correlation category for the five populations using DGN, GTEx WB, and GTEx LCL weight databases.


TABLE 4 | Results from linear mixed models for population category (with CEU as a reference) and change in gene correlation coefficient among filtered genes.


TABLE 5 | Results from linear mixed models for population category (excluding CEU, with FIN as a reference) and change in gene correlation coefficient among filtered genes.


### 3.3. PrediXcan Prediction Accuracy Differs Between Tissues

As can be seen in the violin plots in **Figure 1**, both databases based on whole blood perform similarly, and LCL-based database displays improved prediction accuracy. In order to compare pairwise gene correlations, we restricted our analyses to the 1,595 genes common for both GTEx v7 WB and GTEx v7 LCL.

Scatter plots presented in **Figure 2** suggest that the majority of genes have very similar correlation coefficients when using WB and LCL databases across all populations. However, we see more genes in the upper left corner, above the dotted line, indicating that using the LCL database results in more genes with better prediction accuracy. This result is not surprising since the expression data we used were derived from LCL. The results of the paired t-test are consistent with the visual examination of the data: the mean difference between gene correlations based on the GTEx v7 LCL models and based on the GTEx v7 WB models is 0.03 (p-value < 2.2 × 10−16), with predictions based on the LCL model having higher performance.

### 4. DISCUSSION

In this work, we evaluated the performance of PrediXcan and compared the prediction accuracy of the method across five geographically diverse populations from two continents for seven weight databases. Models from all weight databases considered were trained on subjects primarily of European ancestry; three of the databases were derived from LCL and the remaining four from whole blood. As a measure of prediction accuracy, we computed correlation coefficients for each gene in all populations and used both paired t-tests and linear mixed effects models to assess evidence of significant differences in prediction performance across populations. We also investigated whether whole blood models are appropriate for predicting gene expression levels in LCL.

We find highly significant differences in prediction accuracy with PrediXcan in the European ancestry populations as compared to the YRI African population, with the prediction accuracy being lower in YRI. The lower accuracy with PrediXcan in the African population is expected since the PrediXcan models were largely trained using European ancestry samples, and this result is consistent with recent works showing that prediction accuracy is expected to be higher when the training and testing cohorts are of similar ancestry (Gottlieb et al., 2017; Li et al., 2018; Mogil et al., 2018). Surprisingly, we also find highly significant differences in prediction accuracy with PrediXcan among the closely related European ancestry populations, with the Finnish, British, and Italian populations having significantly higher prediction accuracy than the CEU. These results are consistent across all seven PrediXcan weight databases we considered. Lastly, we also find that LCL-trained models outperformed whole-blood-trained models across populations, although the prediction accuracy was similar for many of the genes.

Among the European populations, we find that prediction accuracy for the CEU population was the lowest. LCLs are derived from B cells found in whole blood, and they provide a continuous supply of genetic material for GWAS and gene expression studies. However, they do undergo a transformation to become immortal that can change their biology and they do not have the same properties as native tissue (Kelly et al., 2017). Storage conditions, freeze-thaw cycles, and maturity of cell lines can also affect gene expression patterns (Çaliskan et al., 2014; Yuan et al., 2015). The CEU cell lines were collected much earlier than the other cell lines and LCL age can have a confounding effect and bias downstream analyses (Yuan et al., 2015). This factor could have contributed to the differences in prediction accuracy among European populations. We did, however, perform a sensitivity analysis that excluded the CEU population, and there were highly significant differences in prediction accuracy

with PrediXcan among the FIN, GBR, and TSI populations, as well as between these three combined European populations and the YRI African population, with the YRI having the lowest accuracy.

Overall, PrediXcan accurately predicted gene expression for some genes; however, the majority of genes had very poor correlation between measured and predicted expression levels. For almost half the genes, for example, the correlation was negative. There are some important caveats and limitations to point out with the PrediXcan method. First, the prediction models of PrediXcan are based on common cis-variants and they do not take rare cis- and trans-regulatory elements into account. Common cis-eQTLs only account for 9–12% of genetic variance in gene expression, according to a large twin study (Grundberg et al., 2012). Another recent study demonstrates that trans-acting variants largely contribute to gene expression variation, with estimates of genetic variance in expression due to trans-acting variation ranging from 60 to 90% (Liu et al., 2018). However, individual effects of each trans-variant are very weak and difficult to map because they require well-powered studies.

We conclude this paper by highlighting that the lack of genomic data from diverse populations limits the ability to effectively interpret and translate genomic results into clinical applications for individuals from diverse populations, and particularly non-European ancestry populations. The results presented in this paper illustrate that gene expression prediction models are, in general, not transferable across diverse populations from different continents, and further corroborate the importance of including more ancestrally diverse individuals in medical genomics to ensure that everyone gets the benefits of precision medicine and to avoid further exacerbating healthcare inequality (Oh et al., 2015, 2016; Manrai et al., 2016). We also demonstrate that there can be differences in prediction accuracy among closely related European populations, suggesting that prediction models that take into account fine-scale ancestry differences among individuals may be important for improved prediction of gene expression from genetic data. Lastly, our study had only modest sample sizes and evaluated gene expression prediction accuracy with PrediXcan in European and African populations. Future transcriptomic studies with much larger samples sizes are needed for the development of improved gene expression prediction models for multi-ethnic populations, including admixed populations such as African Americans and Hispanic/Latino populations, who have recent ancestry derived from multiple continents.

### DATA AVAILABILITY

GEUVADIS expression data is available at Array Express (E-MTAB-264 and E-GEUV-1) at https://www.ebi.ac.uk/ arrayexpress/experiments/ and 1000 Genomes project genotype data is available at http://www.internationalgenome.org/.

### AUTHOR CONTRIBUTIONS

AM and TT conceived the idea, designed the analysis, interpreted the results, and wrote the paper. AM ran the analysis.

#### FUNDING

This work was supported by National Institute of Health grant AG054074. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

### REFERENCES


### ACKNOWLEDGMENTS

We thank two reviewers for helpful comments and suggestions that improved the manuscript.

### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene. 2019.00261/full#supplementary-material


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Mikhaylova and Thornton. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Systematic Review and Meta-Analysis Confirms Significant Contribution of Surfactant Protein D in Chronic Obstructive Pulmonary Disease

#### Debparna Nandy, Nidhi Sharma and Sabyasachi Senapati\*

*Department of Human Genetics and Molecular Medicine, Central University of Punjab, Bathinda, India*

#### Edited by:

*William Scott Bush, Case Western Reserve University, United States*

#### Reviewed by:

*Lifeng Tian, University of Pennsylvania, United States Lili Ding, Cincinnati Children's Hospital Medical Center, United States*

> \*Correspondence: *Sabyasachi Senapati sabyasachi1012@gmail.com; s.senapati@cup.edu.in*

#### Specialty section:

*This article was submitted to Applied Genetic Epidemiology, a section of the journal Frontiers in Genetics*

> Received: *13 July 2018* Accepted: *29 March 2019* Published: *17 April 2019*

#### Citation:

*Nandy D, Sharma N and Senapati S (2019) Systematic Review and Meta-Analysis Confirms Significant Contribution of Surfactant Protein D in Chronic Obstructive Pulmonary Disease. Front. Genet. 10:339. doi: 10.3389/fgene.2019.00339* Background: Surfactant protein D (SFTPD) is a lung specific protein which performs several key regulatory processes to maintain overall lung function. Several infectious and immune mediated diseases have been shown to be associated with SFTPD. Recent findings have suggested the serum concentration of SFTPD can be used as a diagnostic or prognostic marker for chronic obstructive pulmonary disease (COPD) and acute exacerbation COPD (AECOPD). But these findings lack replication studies from different ethnic populations and meta-analysis, to establish SFTPD as reliable diagnostic or prognostic biomarker for COPD and associated conditions.

Methods: We performed systematic literature search based on stringent inclusion and exclusion criteria to identify eligible studies to perform a meta-analysis. Our objective was to assess the predictability of serum SFTPD concentration and SFTPD allelic conformation at rs721917 (C > T) with COPD and AECOPD outcome. These variables were compared between COPD and healthy controls, where mean difference (MD), and odds ratio (OR) were calculated to predict the overall effect size. Review manager (RevMan-v5.3) software was used to analyse the data.

Results: A total of eight published reports were included in this study. Comparative serum SFTPD concentration data were extracted from six studies and three studies were evaluated for assessment of genetic marker from SFTPD. Our study identified strong association of elevated serum SFTPD with COPD and AECOPD. Significant association of risk was also observed for "T" allele or "TT" genotype of rs721917 from SFTPD with COPD and AECOPD.

Conclusion: Serum concentration and alleleic conformation of SFTPD has a significantly high predictive value for COPD and AECOPD. Thus, these can be tested further and could be applied as a predictive or prognostic marker.

Keywords: SFTPD, COPD, AECOPD, rs721917, meta-analysis

## INTRODUCTION

Chronic Obstructive Pulmonary Disease (COPD) affects lungs and exhibits irreversible airflow conditions that leads to improper respiratory function (Carolan et al., 2014). COPD is a global disease burden which accounts for ∼3 million deaths annually (Zemans et al., 2017) and is responsible for the increase in worldwide mortality and morbidity (Dickens et al., 2011). Chronic Obstructive Pulmonary Disease is projected to be the third leading cause of death by 2020 (Dickens et al., 2011). Chronic Obstructive Pulmonary Disease has multiple sub-phenotypic conditions like emphysema, lean body mass, mucus hypersecretion, and acute exacerbation (Dickens et al., 2011; Shakoori et al., 2012; Carolan et al., 2014). Each sub-phenotype is considered to be the outcome of different immune related pathways which are involved in COPD pathogenesis (Ishii et al., 2012).

Surfactant protein D (SF-D or SFTPD) is a highly lung specific glycoprotein secreted by type II alveolar cells and nonciliated clara cells and functionally involved in maintaining the lung functions (Shakoori et al., 2012; Akiki et al., 2016). This multimeric glycoprotein belongs to lectin super family (Moreno et al., 2014) and takes part in immune regulation and maintenance of lung function (Ju et al., 2012). SFTPD is found to have three domains, namely: the collagen like domain, the neck domain, and the carbohydrate domain (Moreno et al., 2014). The carbohydrate binding domain is responsible for the maintenance of innate immune function in the lungs. Upon calcium binding, this calcium dependent protein cross-talks with defensin and other immunoregulatory molecules (Crouch and Wright, 2001; Jakel et al., 2013; Moreno et al., 2014). Due to considerably high molecular stability i.e., over 6 months in circulation, SFTPD has been investigated to establish it as a biomarker for pulmonary function (Holmskov et al., 2003; Hoegh et al., 2010).

Most COPD patients belong to the stable COPD category (SCOPD) followed by acute exacerbation COPD (AECOPD). AECOPD is characterized by sudden worsening of respiratory conditions including secretion of greenish phlegm (Shakoori et al., 2009). Trends of elevated serum SFTPD concentration among AECOPD patients compared to SCOPD or healthy control group have been reported. Elevated serum SFTPD among AECOPD patient group (n = 13; 227 ± 120 ng/ml), compared to SCOPD (n = 14; 151 ± 83 ng/ml), and control group (n = 54; 127 ± 65 ng/ml) was reported among Pakistanis (Shakoori et al., 2009). In another case-control study on a Chinese population, similar trend was observed, where serum SFTPD level was found to be significantly (p < 0.001) higher among AECOPD (n = 40; 235.22 ± 48.27 ng/ml) than SCOPD (n = 71; 153.54 ± 45.21 ng/ml) and control subjects (n = 60; 103.05 ± 24.97 ng/ml) (Ju et al., 2012). Serum SFTPD is often found to show association with different lung function parameters (Liu et al., 2014). Besides COPD, SFTPD is found to be associated with multiple pulmonary and other multifactorial diseases including lung cancer, interstitial pneumonia, asthma, viral infection, and other acute respiratory syndromes (Ishii et al., 2012; Carolan et al., 2014; Zemans et al., 2017). Serum SFTPD level can be pivotal in the diagnosis and monitoring of prognosis of various pulmonary conditions. Among COPD patients, serum concentration of SFTPD was found associated with BODE (body mass index, airflow obstruction, dyspnea, exercise capacity) index of severity (Ju et al., 2012) and mortality (Celli et al., 2012). However, its association with COPD severity was not observed in several other studies (Lomas et al., 2009; Liu et al., 2014; Akiki et al., 2016).

Genetic variations in SFTPD have also been established as informative genetic markers for COPD in different populations (Shakoori et al., 2012; Fakih et al., 2018). A non-synonymous variation rs721917:c.92T>C (p.Met31Thr) is associated with altered serum concentration of SFTPD and its multimerization (Sorensen et al., 2009). Presence of "T" or "C" alleles of rs721917 codes for methionine or threonine amino acids, respectively, at 31st position of SFTPD protein. Degradation of SFTPD from its multimerized (high molecular weight) to non-multimerized (low molecular weight) form is associated with respiratory diseases, including COPD (Fakih et al., 2018). This variation was also found associated with COPD among Mexicans and Europeans and with emphysema among Japanese populations (Guo et al., 2001; Foreman et al., 2011; Ishii et al., 2012; Horimasu et al., 2014). Other intronic or synonymous variations (rs2245121, rs911887, rs6413520, and rs7078012) were also identified as associated with altered serum concentration of SFTPD among Europeans (NETT-NAS and ECLIPSE cohorts) (Foreman et al., 2011).

In this study, we performed a systematic review and metaanalysis, with the objective to establish the potential of a single biomarker, SFTPD in identifying, and stratifying the different COPD sub-phenotype(s). We attempt to establish an association between variation in serum SFTPD concentration and the SFTPD genetic variation rs721917 with COPD and its sub- phenotype(s). The rationale of this study is to increase the power by considering multiple studies in a meta-analysis with similar environmental conditions, thereby evaluating the significance of SFTPD as an important diagnostic biomarker.

### METHODOLOGY

### Identification and Eligibility of Relevant Studies

A search for eligible literature was done till May 12th, 2018. Databases used for the retrieval of eligible articles were PubMed (along with MESH database) and Google Scholar. The following keywords were used to retrieve all the publications: "COPD and SFTPD"; "Chronic obstructive pulmonary disease and surfactant protein D"; "COPD and serum SFTPD"; "COPD and SFTPD genotypes." Publications with the desirable keywords were selected. Further publications were added from the cross-references of the retrieved articles. Details are given in the **Figure 1**.

### Study Inclusion/Exclusion Criteria

Scanning of publications with relevant titles and abstracts were done only for case-control studies encompassing COPD, and

TABLE 1 | Summarized results for association of serum SFTPD concentration with COPD and AECOPD.


AECOPD as one of the major sub-phenotypes. No other subphenotypes of COPD were included in the study owing to maintain the focus area of the present meta-analysis. Only those studies were included where study participants were aged more than 35 years and all were smokers. We did not keep sex as a selection criteria. While for genetic polymorphisms, studies having information about rs721917 were only selected.

#### Data Extraction

Data extraction from the eligible publications was done by two investigators independently and conflicts were resolved through group discussions. Following data was extracted from the finally selected publications:

a. Protein biomarker: author names, number of participants, SFTPD serum/plasma level mean value (cases and controls), diagnostic criteria.

b. Genetic biomarker (rs721917): author names, number of participants, allele distribution among cases and controls, population, and diagnostic criteria.

### Statistical Analysis

For statistical analysis, Review Manager (RevMan-v5.3) Copenhagen: The Nordic Cochrane Center, The Cochrane Collaboration, 2014, software was used. As serum biomarker level is a continuous variable mean difference (MD) was calculated. For genetic marker odds ratio (OR) for pooled data was calculated. Different genetic models such as, allelic model, dominant model, recessive model, and additive model were used to analyze the association. Heterogeneity among studies was calculated using I 2 and chi<sup>2</sup> tests, where I <sup>2</sup> more than 50% and chi<sup>2</sup> p-value <0.05 was considered significant heterogeneity. Both the analyses were done using fixed effect model. Meta-OR or Meta-MD were calculated using Z-test with 5% level of

significance and 95% confidence interval. Possible publication bias was evaluated through visual inspection of funnel plots generated using the same software.

### RESULTS

### Characteristics of Eligible Studies

Following online literature search, a total of 97 publications were obtained. Additionally seven publications were found through cross-references. However, based on our study inclusionexclusion criteria, a total of 96 publications were excluded. Therefore, only eight publications were found eligible and taken forward for the meta-analysis (**Figure 1**). Eligible studies were reported between 2009 and 2017 (**Supplementary Tables 1**, **2**). Out of these 96 publications, six were assessed to evaluate the risk of SFTPD serum concentration and three were assessed to evaluated for risk of genetic variation in SFTPD (rs721917) with overall COPD and acute exacerbation with COPD (AECOPD). Detail characteristics of these studies are presented in the **Supplementary Tables 1**, **2**. No significant publication bias was observed among the studies included in this meta-analysis (**Supplementary Figure 1**).

### Association of Serum SFTPD With COPD

Serum concentration of SFTPD (mean and SD) was available for a total of 2,109 cases and 464 healthy controls reported in eligible studies. Elevated serum SFTPD values were found to be significantly associated [M.D = 39.26 (36.97, 41.54; p\_Z < 0.00001] with overall COPD (**Table 1** and **Figure 2A**). Two of these studies were further assessed for evaluating the contribution of serum SFTPD level with AECOPD. Metaanalysis was performed on 53 cases and 114 controls. Elevated level of serum SFTPD was found associated with AECOPD [MD = 130.41(114.62, 46.20); p\_Z < 0.00001] (**Table 1** and **Figure 2B**).

## Association of SFTPD Genotype With COPD

All of these three reports on the association of SFTPD genetic variations with overall COPD were carried out on Asian populations. Cases in Chinese and Lebanese populations were diagnosed according to both American Thoracic Society (ATS) and GOLD Criteria, while the Pakistani population was diagnosed solely on the basis of GOLD criteria (**Supplementary Table 2**). For rs721917; allelic and genotypic (dominant and recessive) associations are summarized in **Table 2**. Under the allelic model of association, "T" allele was identified to confer risk for both COPD [OR = 1.34 (1.07–1.67); p\_Z = 0.01] and AECOPD [OR = 1.41 (1.09–1.83); p\_Z = 0.009] (**Figure 3**). Similarly, dominant model identified association of "TT" genotype with both COPD [OR = 1.41 (1.00–1.99); p\_Z = 0.05] and AECOPD [OR = 1.50 (1.01–2.23); p\_Z = 0.04] with marginal significance. Recessive model confirmed the protective role of "CC" genotype for both COPD [OR = 0.60 (0.39–0.94); p\_Z = 0.02] and AECOPD [OR = 0.55 (0.33–0.92); p\_Z = 0.02]. Dominant role and risk confers by the "T" allele was further confirmed by additive models of association (**Table 2**).

### DISCUSSION

Surfactant protein-D is a key innate immunity molecule with significant role in host defense. It has been reported as associated with several health conditions including COPD as well as various associated manifestations, such as AECOPD TABLE 2 | Results of meta-analysis for alleles and genotypes of rs721917 under different genetic models.


*All analyses were done using Fixed effect model.*

(Hartl and Griese, 2006). Recent genome-wide association studies had identified SFTPD as one of the most significant and well-replicated gene associated with COPD and it's allied complications (Kim et al., 2012; Hobbs et al., 2017). Significant difference in SFTPD serum concentration among the COPD and healthy controls can be used as criteria for disease diagnosis and/or prognosis. So far, except for emphysema (alpha-1 antitrypsin) no other biomarker is available for the diagnosis or evaluation the COPD prognosis and associated lung function. COPD and AECOPD are reported to be associated with elevated serum concentration of SFTPD (Lomas et al., 2009; Shakoori et al., 2009, 2012; Ju et al., 2012; El-Deek et al., 2013; Ozyurek et al., 2013). In contrary emphysema patients are found to have lower level of serum SFTPD than healthy people (Ishii et al., 2012).

This study is the first attempt to review and meta-analyze the existing published literature to assess the predictive value of SFTPD serum concentration or genotypes for COPD and AECOPD. Due to stringent study inclusion and exclusion criteria limited articles were found eligible for this study. Present study identified strong associations of elevated serum SFTPD level with both COPD and AECOPD. Extracted data from all the eligible studies were homogenous and no study selection bias was observed. As expected, serum SFTPD was observed

observed.

more significantly associated with AECOPD (p < 0.00001; MD = 130.41) compared to COPD (p < 0.0001; MD = 39.26) when compared to healthy controls. However, as this study was performed on reported case-control based cross-sectional studies, causal effect relationship between the serum SFTPD level and COPD could not be affirmed. Recent evidences confirmed that elevated serum SFTPD can be used as a prognostic marker for COPD, as its serum concentration has been found significantly elevated among AECOPD compared to COPD (Shakoori et al., 2012; Ou et al., 2015).

Present study suggests the use of SFTPD as a biomarker to evaluate COPD. Since different range of SFTPD concentrations are found for different COPD and AECOPD conditions, single biomarker can be used for the diagnosis of COPD, and it's prognosis. Range of scale (SFTPD serum concentration) can be made to access the diagnosis and prognosis.

Limitations of this study include, less population numbers due to stringent inclusion and exclusion criteria. Population as a whole has been considered and not further stratified on their ethnicities. Studies with larger cohorts need to be conducted to confirm the association of serum SFTPD and its allelic conformation with COPD and AECOPD. Furthermore, to generalize these findings large-scale population based replication studies are warranted.

#### REFERENCES


#### AUTHOR CONTRIBUTIONS

SS conceptualized the study. DN and NS performed the systematic review and meta-analysis. SS, DN, and NS wrote the manuscript. All authors reviewed and finalized the manuscript for submission.

### FUNDING

Department of Science and Technology—Science and Engineering Research Board (ECR/2016/001660), New Delhi, India and University Grants Commission (F.30-4/2014(BSR), New Delhi, India.

#### ACKNOWLEDGMENTS

Dr. Kavita Singh, Public Health Foundation of India, Gurugram, India, for helping in data analysis.

#### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene. 2019.00339/full#supplementary-material


sensitive and associated with exacerbations of COPD. Eur. Respir. J. 34, 95–102. doi: 10.1183/09031936.00156508


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Nandy, Sharma and Senapati. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# A Social Determinant of Health May Modify Genetic Associations for Blood Pressure: Evidence From a SNP by Education Interaction in an African American Population

#### Edited by:

C. Charles Gu, Washington University in St. Louis, United States

#### Reviewed by:

Tesfaye B. Mersha, Cincinnati Children's Hospital Medical Center, United States Kenneth M. Weiss, The Pennsylvania State University, United States

#### \*Correspondence:

Melinda C. Aldrich melinda.aldrich@vumc.org Dana C. Crawford dcc64@case.edu; dana.crawford@case.edu

#### Specialty section:

This article was submitted to Applied Genetic Epidemiology, a section of the journal Frontiers in Genetics

Received: 30 November 2018 Accepted: 18 April 2019 Published: 10 May 2019

#### Citation:

Hollister BM, Farber-Eger E, Aldrich MC and Crawford DC (2019) A Social Determinant of Health May Modify Genetic Associations for Blood Pressure: Evidence From a SNP by Education Interaction in an African American Population. Front. Genet. 10:428. doi: 10.3389/fgene.2019.00428

#### Brittany M. Hollister<sup>1</sup> , Eric Farber-Eger<sup>2</sup> , Melinda C. Aldrich<sup>3</sup> \* and Dana C. Crawford<sup>4</sup> \*

<sup>1</sup> Social and Behavioral Research Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, United States, <sup>2</sup> Vanderbilt Institute for Clinical and Translational Research, Vanderbilt University Medical Center, Nashville, TN, United States, <sup>3</sup> Department of Thoracic Surgery, Vanderbilt Genetics Institute, Vanderbilt University Medical Center, Nashville, TN, United States, <sup>4</sup> Department of Population and Quantitative Health Sciences, Cleveland Institute for Computational Biology, Case Western Reserve University, Cleveland, OH, United States

African Americans experience the highest burden of hypertension in the United States compared with other groups. Genetic contributions to this complex condition are now emerging in this as well as other populations through large-scale genomewide association studies (GWAS) and meta-analyses. Despite these recent discovery efforts, relatively few large-scale studies of blood pressure have considered the joint influence of genetics and social determinants of health despite extensive evidence supporting their impact on hypertension. To identify these expected interactions, we accessed a subset of the Vanderbilt University Medical Center (VUMC) biorepository linked to de-identified electronic health records (EHRs) of adult African Americans genotyped using the Illumina Metabochip (n = 2,577). To examine potential interactions between education, a recognized social determinant of health, and genetic variants contributing to blood pressure, we used linear regression models to investigate twoway interactions for systolic and diastolic blood pressure (DBP). We identified a two-way interaction between rs6687976 and education affecting DBP (p = 0.052). Individuals homozygous for the minor allele and having less than a high school education had higher DBP compared with (1) individuals homozygous for the minor allele and high school education or greater and (2) individuals not homozygous for the minor allele and less than a high school education. To our knowledge, this is the first EHR -based study to suggest a gene-environment interaction for blood pressure in African Americans, supporting the hypothesis that genetic contributions to hypertension may be modulated by social factors.

Keywords: electronic health records, social determinants of health, African Americans, blood pressure, geneenvironment, education

## INTRODUCTION

fgene-10-00428 May 9, 2019 Time: 14:43 # 2

African Americans have a higher prevalence of hypertension, or chronically high blood pressure, compared with other racial/ethnic groups (Yoon et al., 2015; Writing Group Members et al., 2016). Despite this higher burden of disease in African Americans, early genome-wide association studies (GWAS) for hypertension and systolic blood pressure (SBP) and diastolic blood pressure (DBP) were limited to populations of Europeandescent (Levy et al., 2009; Newton-Cheh et al., 2009; Wang et al., 2009; International Consortium for Blood Pressure Genome-Wide Association Studies et al., 2011) or east Asian-descent (Kato et al., 2011). More recent GWAS have been performed in ancestrally diverse populations, including African Americans or African-descent populations (Adeyemo et al., 2009; Zhu et al., 2011, 2015; Kidambi et al., 2012; Franceschini et al., 2013; Hoffmann et al., 2017; Liang et al., 2017). Collectively, these associated common variants explain 3–6% of the variance for SBP and DBP, and in the largest European-descent study to date account for up to 27% of the estimated single nucleotide polymorphism (SNP)-wide heritability for these traits (Evangelou et al., 2018).

Current GWAS findings explain only a proportion of the expected contribution from additive genetic effects. Previous twin and family studies estimate these traits have moderate to high heritability (30–70%) (Fagard et al., 1995; Rotimi et al., 1999; Levy et al., 2000; Hottenga et al., 2005; Kupper et al., 2005), suggesting that additional genetic associations have yet to be discovered. Given that GWAS identify common single nucleotide variants (SNVs) for association, additional genetic associations may be found among rare SNVs (Doris, 2011; Russo et al., 2018). Importantly, most GWAS consider only main effects and do not consider interactions with relevant environmental exposures. Two recent and large GWAS of blood pressure have considered alcohol consumption (Feitosa et al., 2018) and smoking (Sung et al., 2018), both of which identified novel putative associations for these traits.

Here, we examine the modifying effects of education, a measure of socioeconomic status (SES) and recognized social determinant of health, on SBP and DBP traits among African Americans drawn from a clinical setting. Previous epidemiologic studies suggest that in addition to alcohol consumption and smoking, social environment and specifically SES has a strong influence on blood pressure and hypertension (Seeman et al., 2008; Cha et al., 2012; Non et al., 2012). Further, a GWAS in the Framingham Heart Study accounting for educational attainment identified novel associations for blood pressure traits among European Americans (Basson et al., 2014). Based on these prior findings, we hypothesized that educational attainment modifies associations between genetic variants and blood pressure among African Americans. To test this hypothesis, we accessed a large biobank linked to electronic health records (EHRs) in a racially diverse clinical population. We identified two associated SNPs, ARHGAP22 rs4593967 (SBP) and IQCK rs950928 (DBP), neither of which has been previously associated with blood pressure. We also identified a novel SNP-education interaction affecting DBP, suggesting social determinants of health may modify genetic effects contributing to complex human traits.

### MATERIALS AND METHODS

### Study Population and Data Collection

The study population is derived from BioVU, a DNA biobank of the Vanderbilt University Medical Center (VUMC) linked to de-identified EHRs. DNA samples are extracted from discarded blood samples drawn for routine clinical care (Roden et al., 2008). These samples are linked to the Synthetic Derivative (SD), the de-identified version of the VUMC EHR. Medical records within the SD are scrubbed of all Health Insurance Portability and Accountability Act (HIPAA) identifiers. This study was approved by the Vanderbilt University Institutional Review Board.

The study population consists of African American adults >18 years old drawn from a larger study of minority patients with DNA samples in BioVU (n = 15,863) (Crawford et al., 2015). We extracted relevant demographic variables, including race/ethnicity, sex, and age at data extraction available in the SD. Smoking status was extracted using International Classification of Diseases, Ninth Revision, Clinical Modification (ICD-9-CM) tobacco use codes as previously described (Wiley et al., 2013). Education was extracted from the free text of EHRs using a recently validated text-mining algorithm (Hollister et al., 2016). Education was modeled as a categorical variable: less than high school, high school, and some college or above. All weight and height measures were extracted from the EHR, and after extensive quality control, as described in Goodloe et al. (2017), median values were used to represent individual-level body mass index (BMI).

The median value of all blood pressure measurements within an individual's EHR prior to a recording of blood pressure-altering medications in the patient's medication list were used in analyses. Medications included in the keyword list of anti-hypertensives were angiotensin converting enzyme inhibitors, angiotensin II receptor blocker, beta blockers, non-dihydropyridine calcium channel blockers, hydralazine, Minoxidil, central alpha antagonists, direct renin antagonists, aldosterone antagonists, alpha antagonists, and diuretics including thiazides, K-sparing, and loop diuretics. Any blood pressure measurement found after any mention of these types of medications were excluded from analyses.

### Genotyping and Quality Control

Genotyping of 15,863 DNA samples from non-European descent individuals was performed using the Metabochip, a custom Illumina genotyping array designed to target SNPs and surrounding genomic regions associated with metabolic traits and cardiovascular disease (Buyske et al., 2012; Voight et al., 2012). We restricted the following quality control and statistical analyses to DNA samples from African Americans in BioVU (n = 11,301). All genotyping quality control was performed using PLINK 1.9 (Chang et al., 2015). After the removal of SNPs with a minor allele frequency of less than 5%, SNPs with a Hardy-Weinberg Equilibrium exact test p-value of less than

1 × 10−<sup>7</sup> , and SNPs with a genotyping call rate of less than 95%, a total of 115,834 variants remained (**Supplementary Figure S1**). We further removed 967 samples for either ambiguous sex, missing genotypes (>5%), or relatedness (twins, full siblings, parent/offspring) (**Supplementary Figure S1**). A total of 10,334 DNA samples passed genotyping quality control. After quality control, global ancestry was estimated using unsupervised ADMIXTURE analysis, assuming K = 2 (Alexander et al., 2009). Linkage disequilibrium (r 2 ) was calculated using 1000 Genomes Phase 3 data and an expectation-maximization algorithm adapted from Haploview (Barrett et al., 2005) available through rAggr (Edlund et al., 2017).

Local ancestry for rs6687976 was assigned as previously described (Fish et al., 2018). Briefly, SHAPEITv2 (Delaneau et al., 2013) and the 1000 Genomes Phase 3 reference panel<sup>1</sup> were used to phase the genotype data. RFMix (Maples et al., 2013) was used to assign local ancestry. Phased chromosomal haplotypes were matched to Yoruba and CEPH/European ancestral population panels from 1000 Genomes.

#### Statistical Analysis

Inclusion criteria included African American adults with available Metabochip genotyping data and complete information on age, sex, BMI, premedication SBP, premedication DBP, smoking status, and education level. A total of 2,577 African Americans met genotyping quality control and had relevant covariates (**Supplementary Figure S2**). All statistical analyses were performed using PLINK 1.9 (Chang et al., 2015) or R (R Core Team, 2008). Linear regression models were used to identify genetic variants associated with either premedication SBP or premedication DBP. A Bonferroni adjusted p-value of 4.32 × 10−<sup>7</sup> was used to determine significance. A main effects model included covariates for age, age squared, sex, BMI, smoking status, and percent global African ancestry:

$$\begin{aligned} \text{Premediation SBP or DBP} &= \ \beta\_0 + \beta\_{\text{cov}} \, ^\ast X\_{\text{cov}} + \\ \beta\_1 \, ^\ast \text{SNP} &+ \ e \end{aligned} $$

A second main effects model included the same covariates, but also included education. To examine the interaction between genetic variants and education and how it may affect blood pressure, we modeled two-way interactions using a linear regression model and the same covariates as in our main effects model:

$$\text{Premediation SBP or DBP} = \beta\_0 + \beta\_{\text{cov}} ^\ast \text{X}\_{\text{cov}} + \beta\_1 ^\ast \text{SNP} + \varepsilon$$

#### β2 <sup>∗</sup>Education + β<sup>3</sup> ∗ SNP∗Education + e

The decision was made to focus on a set of SNPs which had a p-value of less than 1.4 × 10−<sup>5</sup> from the main effects model to reduce issues with multiple testing. This significance threshold was chosen based on a Bonferroni correction for the number of SNPs that would remain if SNPs with an r 2 -value of greater than 0.1 were removed from our dataset. For this set of significant SNPs, we used a model which included the main effects of education and the SNP, as well as the interaction term between education and the genetic variants. The significance threshold for the interaction models was based on the number of SNPs tested for association with premedication SBP and DBP (p < 0.01 and p < 0.003, respectively).

### RESULTS

#### Population Characteristics

The final study population for analysis included 2,577 African American adults with Metabochip genotyping data and complete phenotype data (**Supplementary Figures S1, S2**). Among this study population, the majority were female (71%) with a median age of 38 years and median BMI of 26.8 kg/m<sup>2</sup> (**Table 1**). Compared with a larger African American BioVU population genotyped on the Metabochip (Crawford et al., 2015), this subset had proportionally more females, was younger, and had a lower median BMI. The median premedication SBP and DBP were within the normal clinical range (122 and 74 mmHg, respectively) and most of the population was never smokers (87%; **Table 1**). The median percent global African ancestry was 81.7%. The majority of participants had at least a high school degree (**Table 1**). The median premedication SBP and DBP in this final study sample were statistically different (p < 0.05) from the larger study sample missing education data in the EHR but varied by only 3 mmHg (**Supplementary Table S1**).

TABLE 1 | Study population characteristics representing African American adults from a biobank with electronic health record (EHR)-extracted blood pressure.


The population in this study was a subset of African Americans from the Vanderbilt University Medical Center (VUMC) biobank, BioVU. Samples were drawn from BioVU in 2011. All individuals had Metabochip genotype data which passed quality control measures. Individuals also had complete phenotype data which included age, sex, education level, smoking status, median body mass index (BMI), median premedication systolic blood pressure (SBP), and median diastolic blood pressure (DBP). These phenotypes were derived from the electronic health record. African ancestry was determined using ADMIXTURE. SD, standard deviation.

<sup>1</sup>https://mathgen.stats.ox.ac.uk/impute/impute\_v2.html#reference

### Predictors of Systolic and Diastolic Blood Pressure

In univariate analyses (**Table 2**), both premedication SBP and DBP were significantly associated with increasing age, male sex, and increasing BMI. SBP increased with age and DBP increased with age until around the age of 60, then began decreasing (**Supplementary Figures S3, S4**).

Neither premedication SBP nor premedication DBP was associated with smoking status or global African ancestry. Also, education was not significantly associated with either premedication SBP or premedication DBP (**Table 2** and **Supplementary Figures S5, S6**). Of all the variables tested, age and premedication DBP significantly co-varied with education (**Supplementary Table S2**).

### Education as a Modifier of Genetic Associations With Systolic and Diastolic Blood Pressure

To test for possible interactive effects between education and genetic variants associated with SBP and DBP, we examined three models: (1) initial single SNP tests of association without education as a covariate, (2) single SNP tests of association with education as a covariate, and (3) single SNP tests of association with SNP × education interaction terms. In the first model, single SNP tests of association were performed for SBP and DBP using linear regression adjusting for age, age squared, sex, BMI, smoking status, and percent global African ancestry. For both SBP (**Supplementary Figure S7**) and DBP (**Supplementary Figure S8**), only a single SNP was statistically significant using a Bonferroni correction (p < 4.32 × 10−<sup>7</sup> ): ARHGAP22 rs4593967 and IQCK rs950928, respectively.

The second set of models included education in addition to other relevant covariates (**Supplementary Figures S9, S10**). The addition of education to the model did not change the most significantly associated SNPs for either SBP or DBP (**Table 3**). In the regression model for SBP that included education, rs4593967 again passed Bonferroni correction (p < 4.32 × 10−<sup>7</sup> ), and two other SNPs (rs10921895 and rs3804485) were associated at a suggestive significance threshold (p < 7.24 × 10−<sup>6</sup> ). For DBP, rs950928 and rs8056711 passed Bonferroni correction. However, these SNPs have the same effect size and are in perfect linkage disequilibrium (r <sup>2</sup> = 1.0), so they likely represent the same association.

In the final set of models, education × SNP interaction terms were examined using SNPs associated with SBP or DBP at p < 1.4 × 10−<sup>5</sup> , as described above. No interaction terms met a strict Bonferroni correction (**Supplementary Figures S11, S12**). However, we identified a potential SNP-education interaction affecting DBP, rs6687976 (p = 0.052; **Table 4**). This potential interaction remained with the addition of local ancestry to the model. Individuals homozygous for the minor allele and having less than a high school education had higher DBP compared with (1) individuals homozygous for the minor allele and high school education or greater and (2) individuals not homozygous for the minor allele and less than a high school education (**Supplementary Figure S13**). No statistically significant interactions were identified for SBP (**Table 4**).

## DISCUSSION

We sought to determine if education, a measure of SES and a recognized social determinant of health, modified genetic

TABLE 2 | Univariate analyses between relevant covariates and median premedication blood pressure values among African American adults.


Prior to genetic analyses, covariates were examined to determine their association with the outcomes, premedication systolic (SBP) and diastolic blood pressure (DBP) in the study population, a subset of African Americans drawn from the Vanderbilt University Medical Center biobank BioVU (n = 2,577). Each linear regression model had either median premedication SBP or median premedication DBP as the outcome. The covariates included in each model were education level, median age, sex, body mass index (BMI), smoking status, and global African ancestry. Both premedication SBP and DBP were significantly associated with age, sex, and BMI. Premedication DBP is also significantly associated with education level. The symbol "<sup>∗</sup> " indicates statistical significance.

TABLE 3 | Characteristics of single nucleotide polymorphisms (SNPs) associated with premedication systolic and diastolic blood pressure with and without education in the model.


In the both sets of linear regression models, median premedication systolic blood pressure (SBP) and median premedication diastolic blood pressure (DBP) were the outcomes. Additionally, both sets of linear regression models included age, age squared, sex, median body mass index (BMI), smoking status, and median percent global African ancestry as covariates. The first set of models did not include education level. The second set of models included education. The addition of education to the model did not change which SNPs were most associated with SBP or DBP. Bolded p-values are considered statistically significant after Bonferroni correction.

TABLE 4 | Single nucleotide polymorphisms (SNPs) examined for interactions with education level impacting median premedication systolic and diastolic blood pressure.


Median premedication systolic blood pressure (SBP) and diastolic blood pressure (DBP) were outcomes in the linear regression models. Covariates included in the models were age, age squared, sex, body mass index, smoking status, and African ancestry. The main effect of education and the SNP, as well as the SNP × education interaction term were also included in the model. Less than high school was the reference group within the regression models. The p-value for the potential SNPeducation interaction is bolded.

associations with SBP and DBP in African Americans. A previous study suggested gene × education interactions occur with blood pressure, but this study was conducted in a Europeandescent population (Basson et al., 2014). Associations between premedication SBP or premedication DBP and genetic variants from the Metabochip were examined, while including known predictors of blood pressure (age, BMI, sex, percent African ancestry, and smoking status) in the model. Results were compared with models which included a main effect for education, and a main effect for education plus a SNP-education interaction term. We observed a suggestive SNP by education interaction affecting DBP, a result not explained by local genetic ancestry. This potential interaction requires statistical replication and further investigation.

#### Models Without Interaction

In univariate analyses the associations between premedication SBP and DBP and increasing age, male sex as well as increasing BMI were consistent with previous reports (August, 1999; Wright et al., 2011; Dua et al., 2014). The patterns of associations between SBP and DBP and age across the age continuum are also consistent with previous reports (Liang et al., 2017; Evangelou et al., 2018).

Intronic ARHGAP22 rs4593967 was significantly associated with SBP and has not been previously reported as associated with blood pressure or hypertension. The minor allele frequency for ARHGAP22 rs4593967 in this African American sample was 0.14, consistent with frequencies reported for Africandescent populations included in The Genome Aggregation Database (0.148; Lek et al., 2016) and the 1000 Genomes Project (0.176; 1000 Genomes Project Consortium et al., 2015). Conversely, the minor allele is less frequently observed among populations of European (∼0.08) or East Asian-descent (<0.01). No other common (MAF > 1%) variants within 500 kb are in strong linkage disequilibrium (r <sup>2</sup> ≥ 0.80) with rs4593967 in African-descent populations from the 1000 Genomes Project. ARHGAP22 encodes the rho GTPase activating protein 22 and is widely expressed with highest expression levels in the brain. Variants within ARHGAP22 have been associated with diabetic retinopathy, conduct disorder, daytime sleep, and selfemployment (Dick et al., 2011; Huang et al., 2011; Van Der Loos et al., 2013; Spada et al., 2016), but these associations have not been replicated.

Intronic IQCK rs950928 was significantly associated with DBP after adjusting for multiple testing. Like ARHGAP22 rs4593967, the minor allele frequency for IQCK rs950928 is

higher among populations of African-descent (∼0.40) compared with European-descent populations (∼0.15). IQCK rs950928 is in perfect or strong linkage disequilibrium with rs8056711 and rs59009734 in African-descent populations, neither of which has been previously associated with human disease or traits. IQCK, which overlaps with several genes including KNOP1, encodes for IQ motif containing K and serves as an EF hand protein binding site. Like ARHGAP22, IQCK is highly expressed in the brain. A search within the Genotype-Tissue Expression (GTEx Consortium, 2013) database suggests that both rs8056711 and rs59009734 may be expression quantitative loci (eQTL), where each addition of the minor allele is associated with higher gene expression for several tissues including the right atrium auricular region of the heart and the aorta. While IQCK rs950928 and its associated SNPs rs8056711 and rs59009734 have not been previously associated with any phenotypes, common variants within IQCK have previously been associated with blood pressure, BMI, bone density, heart rate, chronic obstructive pulmonary disease, bipolar disorder, and a BMIeducation interaction (Cho et al., 2009; Liu et al., 2010; Wan et al., 2011; Boardman et al., 2014; Winham et al., 2014).

Despite the present study's small sample size (n = 2,577), there was sufficient power (80%) to detect significant associations with moderate effect size of 1.0 and a minor allele frequency of 0.20. For less common variants (MAF = 0.10), the study was powered to detect alleles with an effect size of 1.5 or greater. For variants with a MAF of 0.05, an effect size of 2.0 was needed in order to detect the variant's effect. This study was not powered to detect any of the variants reported in the recent one million-person GWAS of blood pressure, as the variant with the largest effect size in that study was less than 1.0, with a median effect size of 0.219 mmHg (Evangelou et al., 2018). The limited power due to small sample size and limited directly genotyped variants likely contributed to the lack of replication of SNPs known to be associated with blood pressure in African Americans from previous GWAS.

### SNP × Education Interactions

We identified a possible SNP-education interaction affecting DBP for rs6687976 (p = 0.052). As the addition of local ancestry to the model did not alter the association, we expect that this observation is a result of true modifying effects of SES rather than ancestry. Individuals with two minor alleles and less than a high school education had higher blood pressure compared to those with two minor alleles and a high school education or those with less than a high school education and fewer minor alleles (**Supplementary Figure S13**). SNP rs6687976 is located within an intergenic region of chromosome 1 (Chr1:105674536 in GRCh37.p13) and has not been previously associated with any human traits within the literature. It is also not identified as an eQTL in GTEx (GTEx Consortium, 2013). Despite the limited information known about rs6687976, this result suggests that interactions between markers of social determinants of health and genetic variants affecting blood pressure likely exist, consistent with the findings of other studies that have observed interactions between genetic variants and social factors such as depression (Smith et al., 2017), perceived discrimination (Taylor et al., 2017), and cigarette smoking (Taylor et al., 2016).

### Limitations

The present study has several limitations. Primarily, the sample size is limited driven by the inclusion criteria of complete phenotype data for a specific racial/ethnic group within the larger clinical dataset. Therefore, we are unable to detect any variants of smaller effect sizes. The requirement for complete data may have also introduced biases that limit the interpretation and generalizability of these data.

In addition to the limited sample size, the study population was also different compared with previously published studies of blood pressure in African American populations. While the proportion of females, median BMI, percent African ancestry, median SBP, and median DBP were comparable with previous studies (Parra et al., 1998; Dumitrescu et al., 2015; Baharian et al., 2016; Franceschini et al., 2016; Jones et al., 2018; Restrepo et al., 2018), the population in this study did have a much lower median age, over 15 years younger. Given that blood pressure increases with age, this younger study population may have reduced variability in blood pressure measurements compared with the older published study populations with right-skewed distributions (Wright et al., 2011).

Another limitation was the lack of a replication dataset; therefore, all associations reported here are putative pending statistical replication or corroborative functional data. To date, other studies comparable or larger in sample size have not yet reported associations between these SNPs and blood pressure (Hoffmann et al., 2017). Furthermore, the genotyping array used here was also designed to include rare variation collected from the African ancestry samples as part of the 1000 Genomes Project. Therefore, many of the variants on the Metabochip were rare in African ancestry populations (Buyske et al., 2012) and filtered out during the quality control process as the present study was not powered to detect associations for rare variation.

There were also limitations regarding the phenotype data. All the variables were extracted from EHRs. While these records have extensive amounts of data, the data recorded by healthcare providers are not always accurate and the ability to extract the data can be limited. Furthermore although the positive predictive value of our algorithm was 80% (Hollister et al., 2016), there may have been inaccurate education information for the individuals within the dataset.

Determining which blood pressure measurements to use in the study is also a challenge, as measurements can vary widely across the EHR. The median blood pressure measurements were chosen for our study to reduce the influence of this variation. Beyond the inaccuracies and decisions to be made regarding the information within the EHR, blood pressure is difficult to measure within the clinic. Measurements of blood pressure can vary due to the calibration of instruments, the time of day it is measured, and due to illness (Jones et al., 2003). Patients also tend to have higher blood pressure within a clinical setting due to stress (Jones et al., 2003). To avoid these potential biases as much as possible, we chose median premedication blood pressure

values for analysis, thereby avoiding outlier measurements and the changes introduced by blood pressure medications.

Finally, while education is a recognized social determinant of health, it is not a perfect proxy for social experiences. Still, evidence suggests that educational attainment can be a reflection of earning potential and social status (Shavers, 2007; Tamborini et al., 2015). Education has been shown to be associated with life expectancy, numerous biomarkers, and other health outcomes such as obesity and smoking (Seeman et al., 2008; National Center for Health Statistics, 2012). Low educational attainment itself is not the cause of poor health outcomes, but rather a variable often associated with individual-level behavioral determinants (e.g., smoking) or community-level determinants (e.g., racial segregation) that may influence blood pressure. Neither of these determinants is routinely recorded with the EHR; in contrast, educational attainment is often mentioned in the EHR. The availability of these data coupled with the observation that individual educational attainment is often stable over time make this variable a robust albeit imperfect proxy for social experiences.

### Strengths

Despite the limitations within the study, there were also several strengths. Primarily, this is the first study to incorporate EHR-derived education information into a large-scale genetic investigation. This study is a proof of principle that EHR-derived social determinant information can be investigated in a GWAS setting, thus breaking new ground to incorporate social factors in genetic studies among biobank populations. This is also the first analysis to observe an interaction between education and a common genetic variant with blood pressure in an African American population.

Despite the consistent association between social environment and health, social determinants of health are typically not included in genetic studies of health outcomes. For studies that access biobanks, the lack of social determinant data is likely related to the difficulty in accessing these data within the EHR, where they are not usually recorded in structured fields. The algorithms used in our study are the first to extract these important data from EHRs for research purposes (Hollister et al., 2016).

This study paves the road for the incorporation of education, as well as other social determinants of health, into genetic studies using biobank populations. The SNP-by-education interaction we observed affecting DBP (rs6687976) could suggest an example of a possible biological impact of the adversity experienced due to lower educational achievement. Only individuals homozygous for the minor allele who had less than a high school education experienced an increase in DBP. This association needs to

### REFERENCES


be replicated; however, it suggests a potential pathway for the biological imbedding of stress experiences (represented by lower educational attainment) affecting blood pressure and risk for hypertension. Further studies are needed to support this hypothesis. We anticipate that this research will encourage other investigators to continue to study the genetics of health outcomes associated with racial health disparities and to incorporate social determinants of health within these studies.

### AUTHOR CONTRIBUTIONS

BH conducted the analyses and wrote manuscript. EF-E helped to extract the phenotype data from the electronic health record. MA and DC contributed to guidance on project, and manuscript writing and editing.

### FUNDING

This work was supported by the National Institutes of Health (NIH) U01 HG004798 and its ARRA supplements (DC), as well as National Cancer Institute 1K07CA172294 (MA). This publication was also made possible by the Clinical and Translational Science Collaborative of Cleveland, 4UL1TR0002548, from the National Center for Advancing Translational Sciences (NCATS) component of the National Institutes of Health and NIH Roadmap for Medical Research. Its contents are solely the responsibility of the authors and do not necessarily represent the official views of the NIH. The dataset (s) used for the analyses described were obtained from Vanderbilt University Medical Center's BioVU, which is supported by institutional funding and the National Center for Research Resources, grant UL1 RR024975–01 (now at NCATS, grant 2UL1 TR000445–06).

### ACKNOWLEDGMENTS

We would like to thank Drs. Alex Fish and William Bush for access to the local genetic ancestry data, and we further thank Dr. Bush for helpful comments during the revision process.

### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene. 2019.00428/full#supplementary-material

in African Americans. PLoS Genet. 5:e1000564. doi: 10.1371/journal.pgen. 1000564

Alexander, D. H., Novembre, J., and Lange, K. (2009). Fast model-based estimation of ancestry in unrelated individuals. Genome Res. 19, 1655–1664. doi: 10.1101/ gr.094052.109

August, P. (1999). Hypertension in men. J. Clin. Endocrinol. Metab. 84, 3451–3454.



perceived discrimination on blood pressure among African Americans in the Jackson Heart Study. Medicine 96:e8369. doi: 10.1097/MD.00000000000 08369


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

The handling Editor declared a past co-authorship with one of the authors DC.

Copyright © 2019 Hollister, Farber-Eger, Aldrich and Crawford. This is an openaccess article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Systematic Review and Meta-Analysis to Establish the Association of Common Genetic Variations in Vitamin D Binding Protein With Chronic Obstructive Pulmonary Disease

#### Ritesh Khanna, Debparna Nandy and Sabyasachi Senapati\*

*Department of Human Genetics and Molecular Medicine, Central University of Punjab, Bathinda, India*

#### Edited by:

*William Scott Bush, Case Western Reserve University, United States*

#### Reviewed by:

*Lijun Ma, Wake Forest University, United States Renata Ferrari, Universidade Estadual Paulista (UNESP), Brazil Suzana Erico Tanni, Universidade Estadual Paulista (UNESP), Brazil*

> \*Correspondence: *Sabyasachi Senapati sabyasachi1012@gmail.com*

#### Specialty section:

*This article was submitted to Applied Genetic Epidemiology, a section of the journal Frontiers in Genetics*

> Received: *04 June 2018* Accepted: *16 April 2019* Published: *16 May 2019*

#### Citation:

*Khanna R, Nandy D and Senapati S (2019) Systematic Review and Meta-Analysis to Establish the Association of Common Genetic Variations in Vitamin D Binding Protein With Chronic Obstructive Pulmonary Disease. Front. Genet. 10:413. doi: 10.3389/fgene.2019.00413* Background: Vitamin-D binding protein (DBP) also known as GC protein, is a major determinant for vitamin- D metabolism and transport. GC1F, GC1S, and GC2 are the three allelic variants (denoted as rs4588 and rs7041) of GC, and known to be associated with chronic obstructive pulmonary disease (COPD). However, contradictory reports and population specific risk attributed by these alleles warranted detailed genetic epidemiology study to establish the association between GC variants and COPD. In this study we performed a meta-analysis and investigated the genetic architecture of GC locus to establish the association and uncover the plausible reason for allelic heterogeneity.

Methods: Published cross-sectional case control studies were screened and meta-analysis was performed between GC variants and COPD outcome. RevMan-v5.3 software was used to perform random and/or fixed models to calculate pooled odds ratio (Meta-OR). Linkage disequilibrium (LD) and haplotypes at GC locus were evaluated using 1000 Genomes genotype data. *In silico* functional implications of rs4588 and rs7041 was tested using publicly available tools.

Results: GC1F allele and GC1F/1F genotype were found to confer COPD risk in overall meta-analysis. GC1S/1S was found to confer risk only among Europeans. *In silico* investigation of rs4588 and rs7041 identified strong eQTL effects and potential role in regulation of GC expression. Large differences in allele frequencies, linkage disequilibrium (LD) and haplotypes were identified at GC locus across different populations (Japanese, African, Europeans, and Indians), which may explain the variable association of different GC alleles in different populations.

Conclusion: GC1F and GC1F/1F impose significant genetic risk for COPD, among Asians. Considerable differences in allele frequencies and LD structure in GC locus may impose population specific risk.

Keywords: vitamin D-binding protein, COPD, meta-analysis, linkage disequilibrium, genetic polymorphisms, allelic heterogeneity

### INTRODUCTION

Chronic obstructive pulmonary disease (COPD) is a complex disease affecting the lung function. Genetically susceptible individuals develop the COPD while they get exposed to environmental triggers, such as noxious gases or suspended particles. Decreased level of vitamin-D in serum is associated with COPD among individuals with a history of smoking (Janssens et al., 2010). Besides environmental and genetic factors, metabolic factors are also critical and do cross talk with each other for the pathogenesis of COPD (Rabe et al., 2007).

Vitamin-D binding protein (DBP), also known as groupspecific component (GC), belongs to a gene cluster family which is expressed in liver and other tissues (Chishimba et al., 2010). As the name suggests, it is known for its binding to circulating vitamin-D<sup>3</sup> and its transportation from liver to other tissues during its metabolism (Daiger et al., 1975). GC is a highly polymorphic gene and three of its allelic variants, namely GC1F, GC1S, and GC2, have been studied extensively for their association with vitamin-D deficiency (VDD) and other diseases including COPD (Chishimba et al., 2010; Wood et al., 2011). These variants correspond to different allelic arrangements of rs7041 and rs4588 (**Table 1**). These GC protein variants are reported to have a different affinity to bind to vitamin-D<sup>3</sup> i.e., 25(OH)D3, and thus affect its serum concentration (Arnaud and Constans, 1993; Janssens et al., 2010). Circulating level of vitamin-D<sup>3</sup> is regulated by the synthesis and enzymatic degradation of 25(OH)D<sup>3</sup> by catabolizing enzymes. More than 90% of the circulating 25(OH)D<sup>3</sup> present in tightly bound (K<sup>d</sup> ∼ 10−<sup>9</sup> M) form with GC proteins (Arnaud and Constans, 1993). Therefore, different GC isoforms influence the serum concentration/bioavailability of 25(OH)D3.

A study performed on north Indian cohort has shown homozygous GC1F variant to confer risk, and the disease severity is observed in a variant specific dose dependent manner, where the geometric mean of serum 25(OH)D<sup>3</sup> was observed in the ascending order of GC genotypes 1F/1F<1S/1F<1S/1S<1S/2<2/2 among COPD patients (Maheswari et al., 2014). Recent reports also indicate the protective role of GC2 variant among healthy individuals. Similar reports were published by the studies done among Caucasian and north Indian cohorts (Schellenberg et al., 1998; Berg et al., 2013; Maheswari et al., 2014; Chen et al., 2015). Azzawi et al. confirmed similar study outcomes in

TABLE 1 | Allelic arrangements correspond to three different GC variants implicated in COPD.


an Egyptian cohort where GC1F and GC1F/1S variants were found to be associated with low serum vitamin-D<sup>3</sup> concentration (Al-Azzawi et al., 2017). While studies done among Korean population have shown an association of GC2 variants with COPD progression, where GC2 and GC1F/1S variants were shown to be associated with higher emphysema index, irrespective of VDD. These studies also identified an association of GC2 and GC1F/1S variants with lower and higher serum concentration of vitamin-D3, respectively. GC2 showed significant association with VDD (Jung et al., 2014; Park et al., 2016).

GC protein (or DBP) is also involved in the inflammation by getting converted into MAF (Macrophage Activating Factor) in the presence of enzymes secreted by leucocytes. It has been found that the conversion of GC into GC-MAF is a deglycosylation process. Absence of glycosylated Lys residue at 420 in GC2 variants makes it an inappropriate reactant for the deglycosylation process, which makes them protective for COPD (Maheswari et al., 2014). Vitamin-D<sup>3</sup> is known to inhibit the expression of MMPs (Matrix Metalloproteinases), which are responsible for the emphysema degradation of lung alveoli. Thus, optimal serum concentration of Vitamin-D<sup>3</sup> is very critical among emphysema patients and a trial for such serum Vitamin D<sup>3</sup> intervention among a large participant group can further elucidate its role in COPD progression (Berg et al., 2013). Serum Vitamin-D<sup>3</sup> is also found to have seasonal and geographical variations, which depend on the amount of sunlight reaching the skin (Jung et al., 2014; Al-Azzawi et al., 2017). ECLIPSE Cohort study did not find an association between serum DBP and emphysema or lung function, although a negative correlation was found among DBP and serum 25(OH)D<sup>3</sup> level (Berg et al., 2013). While another study in an alpha1-antitrypsin deficient Caucasian population showed the association of serum DBP with COPD conditions (Wood et al., 2011). A recent report indicated a strong relationship between serum 25(OH)D<sup>3</sup> and pulmonary function (FEV1 and FVC) in a well-defined COPD cohort (Janssens et al., 2010).

It is evident that GC is a major determinant for several health parameters including those associated with COPD. However, contradictory findings of association of different alleles with COPD and non-replication across different populations warranted further meta-analysis and detailed population genetics studies. In the present study, we anticipated to explain the association of known GC alleles with COPD and investigate the genetic and functional aspects of GC alleles. Locus architecture of different populations was also investigated to explain the non-replication/differential replication of GC alleles in different populations. We hypothesized that genetics architecture at GC locus leads to population specific allelic variation in GC and its association with COPD. The study was performed with the following specific objectives: (i) perform meta-analysis to establish association of commonly studied GC alleles with COPD, and (ii) investigate the genetic heterogeneity at a functionally relevant GC locus, that explain variability in GC protein and COPD.

## MATERIALS AND METHODS

### Literature Retrieval

Our objective was to identify research articles where genetic association of GC has been tested with COPD. We restricted our study to three major genetic polymorphisms of GC, namely GC1F, GC1S, and GC2 alleles. Literature was searched online in the National Center for Biotechnology Information (NCBI-PubMed), Google Scholar and Medline. The major search language for the literature was English, papers in other languages were translated for further review. To obtain the best quality outcome, we include only peer reviewed scientific literature. Literature were searched until May 2018. The keywords used for the search for literature were as follows: Vitamin D binding protein and chronic obstructive pulmonary disease, DBP and COPD, GC alleles and COPD, COPD association GC. Cross references were also reviewed and references from the retrieved articles were also checked manually so as to find any relevant articles.

### Inclusion and Exclusion Criteria

Only case-control studies were included for this meta-analysis. Only those studies were included where different alleles (1F, 1S, and 2) and genotypes (1F/1F, 1S/1S, 2/2, 1F/1S, 1F/2, 1S/2) of GC were studied for their association with COPD. Included studies clearly mentioned either the actual numbers, or the percentage of cases and controls with different genotypes and alleles of GC. Included studies have both smokers and non-smokers among both cases and controls.

## Data Extraction

Data was extracted from eligible articles by two investigators independently and differences and controversies were resolved by group discussions. We first validated the study types and then extracted author names, year of publication, details of genotypes/alleles and their frequencies in COPD patients and controls.

### Statistical Analysis

Results of association of three distinct alleles have been included in this study. These alleles were GC1F, GC1S and GC2, represented in NCBI dbSNP as rs4588 and rs7041, respectively (**Table 1**). Therefore, a total of six different genotypic combinations were studied, such as, GC1F/1F, GC1F/1S, GC1S/1S, GC1F/2, GC1S/2, and GC2/2. Independently these genotypes and three allelic associations were evaluated by metaanalysis. In each analysis, the experimental allele or genotypes were tested against the total allele or genotype counts. Metaanalysis was performed using Review Manager (RevMan-v5.3) Copenhagen: The Nordic Cochrane Center, The Cochrane Collaboration, 2014. Additive genetic model with 95% confidence interval (CI) was used in each of these independent analyses. Heterogeneity between studies was calculated by the I 2 and chi<sup>2</sup> test, where I <sup>2</sup> > 50% and chi<sup>2</sup> p <0.05 was considered as significant heterogeneity. Meta-analysis of odds ratios were performed using a random effect model where significant heterogeneity was observed, otherwise a fixed effect model was used. Overall effect size (Meta-OR) was calculated by Z-test with 5% alpha level. A sensitivity analysis was performed to access whether meta-analysis results were substantially influenced by the presence of any study. This was done by systematically excluding one study at a time and recalculating the significance (p-value of the χ 2 and Z-test) of the results. The funnel plot was used to analyze the publication bias. Subgroup analysis between Asian and Caucasian studies was also performed to identify any significant differences due to individual group stratification.

### Linkage Disequilibrium, Haplotypes, and Comparative Allele Frequency

Genetic architecture of GC locus was evaluated to explain population specific effects (if any) of GC alleles on its association with different human traits/diseases. To analyze the linkage disequilibrium, LD plots and haplotypes were reconstructed using Haploview (Barrett et al., 2004). LD calculations and manipulation of genotype files were done using Plink 1.07 (Purcell et al., 2007). 1000 Genomes genotype information for four major populations, such as CEU (Utah residents with northern and western European ancestry), GIH (Gujarati Indians in Houston, USA), YRI (Yoruba in Ibadan, Nigeria), and JPT (Japanese in Tokyo), were evaluated for LD analysis. Raw genotype data for these populations were obtained from 1,000 Genomes ftp through Ensembl. Genotype data were obtained for a 50 kb window on both the sides around rs7041 i.e., chr4:71702617-71802617 (GRCh38.p12). Comparative allele frequencies for GC1F, GC1S, and GC2 corresponding to rs4588 and rs7041 were evaluated from Ensembl (https://asia.ensembl. org/index.html), HaploReg (http://archive.broadinstitute.org/ mammals/haploreg/haploreg.php).

### In silico Functional Implication Assessment

Functional implications of rs4588 and rs7041 were analyzed using open source browsers. RegulomeDB (http://www. regulomedb.org/index) was used to analyze the regulatory function and GTEx portal (https://gtexportal.org/home/) was used to analyze single tissue or gene eQTL.

## RESULTS

## Characteristics of Eligible Studies

A total of 71 studies were identified initially after online literature search. After screening and proper reviewing for the eligible papers 48 papers were excluded. There were two duplicate studies, six studies were for asthma, and 11 were for diseases other than asthma and COPD, such as osteomalacia, type II diabetes, adenocarcinoma, pulmonary tuberculosis and other non-relevant diseases. There was a non-human study done on mice, which was also excluded from the meta-analysis. Metaanalysis (n = 5), which was done previously on COPD and GC, was also excluded but was used to identify cross references. Fourteen studies were excluded because they were either cohort studies or random clinical trials done on supplementation of vitamin-D3. Eleven studies were found irrelevant, either due to less information for cases or control subjects, and one study was in other language and was excluded from the meta-analysis. A further seven studies were excluded as adequate/complete genotype and study participant information were not given. After this screening based on our inclusion/exclusion criteria, a total of 14 studies were found eligible for meta-analysis (Kueppers et al., 1977; Home et al., 1990; Ishii et al., 2001; Ito et al., 2004; Laufs et al., 2004; Lu et al., 2004; Korytina et al., 2006; Huang et al., 2007; Janssens et al., 2010; Shen et al., 2010; Jung et al., 2014; Li et al., 2014; Maheswari et al., 2014; Al-Azzawi et al., 2017) (**Figure 1**).

#### Genotypic and Allelic Association

A total of 14 studies were included in this meta-analysis where genotypes for different above-mentioned GC alleles, in both COPD patients and healthy controls, were reported. Out of these 14 studies, nine studies were performed on different Asian populations and five were on European populations. Details of the study participants and haplotypes or allele frequencies are given in the **Supplementary Table 1**. Random effect model was performed to find out the pooled effect size for GC1F/1F, GC1F/S, GC1F/2, and GC2/2 genotypes, and GC1F, GC1S and GC2 alleles in COPD. For remaining analyses, fixed effect model was used due to insignificant study heterogeneity (chi<sup>2</sup> p > 0.05 and I <sup>2</sup> <50%) (**Figure 2** and **Supplementary Figure 1**). Metaanalysis was performed separately for reports on Asians and Europeans to identify significant differences in effect size, if any.

#### Allelic Association

GC1F allele has been found significantly predisposing for COPD outcome in combined analysis (Meta-OR = 1.29; 95% CI = 1.09–1.55; Z p-val = 0.004). Independently, GC1F allele has been found strongly associated among Asians (ORAsia = 1.45; 95% CI = 1.24–1.68; Z p-val< 0.00001), but not among Europeans (OREurope = 1.02; 95% CI = 0.73–1.42; Z pval = 0.92) (**Figures 2A,B**). Both GC1S and GC2 alleles were not found significant in conferring risk or protection with COPD outcome (**Supplementary Figure 1**). However, considering the trend of association, both these alleles were found protective in combined analyses.

#### Genotypic Association

Homozygous GC1F/1F was found significantly predisposing genotype with COPD outcome (Meta-OR = 1.61; 95% CI = 1.18– 2.20; Z p-val = 0.002). Independent analysis found significant association of this genotype among Asians (ORAsia = 1.93; 95% CI = 1.38–2.70; Z p-val = 0.0001), but it remains insignificant among Europeans (OREurope = 1.11 with 95% CI = 0.64–1.95; Z p-val = 0.71). Significant predisposition was observed for GC1S/1S genotype among Europeans (OREurope = 1.29; 95% CI = 1.00–1.68; Z p-val = 0.05), however it remains insignificant among Asians (**Figure 2**). Further, no significant associations were observed for any of the alleles or genotypes, either in combined or independent analyses in Asians and Europeans (**Supplementary Figure 1**).

#### Sensitivity Analysis and Publication Bias

Sensitivity analysis was performed for each study. No significant deviation in heterogeneity and study significance (p-value of the χ 2 and Z-test) was observed. Subgroup analyses did not identify any significant (p<0.05) subgroup stratification (**Figure 2** and **Supplementary Figures 1A–G**). Further, manual investigation of funnel plots did not identify any publication bias, where shapes of the funnel plots were symmetrical (**Supplementary Figure 2**).

#### Linkage Disequilibrium

Comparative LD analysis of GC locus showed substantial differences in the background LD structure between four reference populations. Comparatively similar LD structure was observed in CEU and GIH, however structure is further broken in JPT and YRI. Both the variations, rs4588 and rs7041, do not constitute any likely haplo-blocks in JPT and YRI (**Supplementary Figure 3**). Haplotypes for GC1F, GC1S, and GC2 were found to be present with relatively equal frequency among CEU (0.19, 0.57, and 0.24) and GIH (0.21, 0.46, and 0.32), however, these haplotypes were not found in JPT and YRI. Furthermore, moderate yet similar LD was observed between these two markers in CEU (r <sup>2</sup> = 0.42; D' = 1) and GIH (r <sup>2</sup> = 0.41; D' = 1), however, LD is completely broken in JPT (r <sup>2</sup> = 0.10; D' = 1) and YRI (r <sup>2</sup> = 0.00; D' = 0.53). Notable haplotypic variations were observed across the genomic region, whereas in JPT and YRI, these two variations are not in tight linkage with neighboring markers (**Supplementary Figure 3**). Allele frequencies of rs4588 and rs7041 and LD between them were seen to be very heterogeneous across 26 different populations, as documented in 1000 Genomes Project. Absolutely no LD (r <sup>2</sup> =0) was observed among different African populations, whereas the highest degree of LD was observed among Europeans and South Asian populations followed by Americans (**Supplementary Table 2**).

### In silico Functional Implications

GC (ENSG00000145321) expresses in the liver despite very negligible expression in the pancreas and stomach. For two missense SNPs, rs4588, and rs7041, no evidence was observed for significant eQTL on GC in liver tissue, however, significant eQTL was observed in subcutaneous adipose (p = 6.55E-6), sun exposed skin (p = 1.67E-6), and stomach (p = 5.46E-9) tissues. SNP rs4588 was identified to alter motif-binding sites of transcription factors SP1 and SP3; and transcription factor binding element (KLF16). rs4588 and rs7041 were both identified: (a) to localize in DNase hypersensitivity regions in a common set of cell types and tissues, and (b) potentially alter histone modification in liver (strongly) and skin (quiescent/low) tissue.

### DISCUSSION

In this systematic review, we performed meta-analysis and evaluated linkage disequilibrium at GC locus, in order to investigate the association of common GC polymorphisms with COPD. This meta-analysis established that GC1F allele and GC1F/1F genotype confers risk of COPD. However, association is majorly restricted to Asians and not in Europeans. On the contrary, GC1S/1S genotype was observed to confer risk to Europeans only (with borderline significance). At least one copy of GC2 has been found to confer protection from COPD among both Asians and Europeans. Previous meta-analysis studies

and independent reports have shown different results from Europeans and Asians, which could be due to differences in allelic segregation and haplotypic heterogeneity at a population level (Chen et al., 2015; Horita et al., 2015; Wang et al., 2015; Xiao et al., 2015; Xie et al., 2015). Large differences in LD structure and haplotypes were observed in different ethnic populations, such as CEU GIH, JTP, and YRI. Although no reports are available (on association of GC variants and COPD) from African countries, we have included their representative genotypes for comparative genetic studies (**Supplementary Figure 3**). Notable differences in allele frequencies and LD between rs4588 and rs7041, among different populations, suggest significant population specific genetic contribution in GC variants (**Supplementary Table 2**). These major differences in LD between these two critical variants resulted into different haplotype frequencies and an absence of any quantifiable haplotypes in JPT and YRI. This indicates that perhaps different haplotypes are associated with different ethnic populations, which requires further large-scale genetic studies to uncover the novel alleles or haplotypes, if any. The overall trend shows relative similarity between CEU and GIH and distinct differences were observed in YRI and JPT. Different allelic arrangements of GC result into different GC variants, which vary in their isoelectric points and binding efficiency to vitamin D<sup>3</sup> (Braun et al., 1992; Arnaud and Constans, 1993; Speeckaert et al., 2006). Furthermore, in different populations, rs4588 and rs7041 may tag different sets of regulatory and structural SNPs (in haplotypes) across GC, and thus could play critical role in regulating expression and function of the GC protein.

Although VDD is found to be associated with COPD (Jolliffe et al., 2018), the underlying causes for such mechanisms remain unanswered. Recent GWAS studies on COPD were unable to identify GC or vitamin D receptor (VDR) as a significantly associated gene (Wain et al., 2017). However, genetic polymorphisms from these genes are found to be associated with VDD (Yousefzadeh et al., 2014; Zaki et al., 2017). In most of the studies, low level of serum Vitamin-D<sup>3</sup> is reported to be associated with the severity of COPD condition. Particularly, rs4588 has been shown to influence GC binding to Vitamin-D<sup>3</sup> (Nimitphong et al., 2013). It can be argued that, along with sufficient vitamin-D<sup>3</sup> intake/supplementation, a functionally more potent form of GC is necessary to maintain optimal serum bioavailability of vitamin-D3. Therefore, inter individual differences in the GC protein may act as a predisposing factor for COPD. Further genetic epidemiological studies are warranted to identify novel risk alleles from GC that are associated with GC



FIGURE 2 | Assessment of risk for meta-analysis of (A) *GC1F* allele, and (B) GC1F/1F genotype with COPD.

function, and thus implication in COPD. However, the presence of differential LD structure of GC locus needs to be considered as a major confounding factor.

#### AUTHOR CONTRIBUTIONS

SS conceptualized and designed the study. RK and DN performed literature screening and meta-analysis. SS performed in silico genetic study. RK, DN, and SS contributed in writing the manuscript and interpreted the results. All the authors reviewed the manuscript and finalized for submission.

#### REFERENCES


#### FUNDING

We acknowledge financial supports from DST-SERB (#ECR/2016/001660), UGC-BSR grant (30-4/2014-BSR), and research grant from Central University of Punjab (GP.25).

#### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene. 2019.00413/full#supplementary-material


pulmonary disease susceptibility: a meta-analysis. Biomed. Rep. 3, 183–188. doi: 10.3892/br.2014.392


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Khanna, Nandy and Senapati. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# A Review of African Americans' Beliefs and Attitudes About Genomic Studies: Opportunities for Message Design

Courtney L. Scherr<sup>1</sup> \*, Sanjana Ramesh<sup>1</sup> , Charlotte Marshall-Fricker<sup>1</sup> and Minoli A. Perera<sup>2</sup>

<sup>1</sup> Department of Communication Studies, Center for Communication and Health, Northwestern University, Chicago, IL, United States, <sup>2</sup> Department of Pharmacology, Center for Pharmacogenomics, Feinberg School of Medicine, Chicago, IL, United States

Precision Medicine, the practice of targeting prevention and therapies according to an individual's lifestyle, environment or genetics, holds promise to improve population health outcomes. Within precision medicine, pharmacogenomics (PGX) uses an individual's genome to determine drug response and dosing to tailor therapy. Most PGX studies have been conducted in European populations, but African Americans have greater genetic variation when compared with most populations. Failure to include African Americans in PGX studies may lead to increased health disparities. PGX studies focused on patients of African American descent are needed to identify relevant population specific genetic predictors of drug responses. Recruitment is one barrier to African American participation in PGX. Addressing recruitment challenges is a significant, yet potentially low-cost solution to improve patient accrual and retention. Limited literature exists about African American participation in PGX research, but studies have explored barriers and facilitators among African American participation in genomic studies more broadly. This paper synthesizes the existing literature and extrapolates these findings to PGX studies, with a particular focus on opportunities for message design. Findings from this review can provide guidance for future PGX study recruitment.

Keywords: African American, genomics, health communication, pharmacogenomics, precision medicine, recruitment

### INTRODUCTION

Precision Medicine (PM) refers to the targeting of therapies according to an individual's, genetics, lifestyle or environment and holds immense promise to improve population health outcomes (Khoury et al., 2016). A branch of precision medicine, pharmacogenomics (PGX) is the study of genetic information to determine individual response (e.g., efficacy/toxicity) to pharmaceutical agents with the goal of developing safe and effective medications and dosage that can be tailored based on an individual's genetics (Lee, 2003; Empey, 2016). In order to draw conclusions about gene interactions and genetic variation within and across ancestries, substantial and diverse patient data are needed (Jaffe, 2015; Khoury et al., 2016). To date, most PGX participants are of European ancestry (Perera et al., 2014). However, African Americans have greater genetic variation than European populations, therefore, results from existing PGX studies may not be as predictive in

#### Edited by:

Jessica Nicole Cooke Bailey, Case Western Reserve University, United States

#### Reviewed by:

Satyanarayana M. R. Rao, Jawaharlal Nehru Centre for Advanced Scientific Research, India Suzette J. Bielinski, Mayo Clinic, United States

\*Correspondence:

Courtney L. Scherr courtney.scherr@northwestern.edu

#### Specialty section:

This article was submitted to Applied Genetic Epidemiology, a section of the journal Frontiers in Genetics

Received: 02 October 2018 Accepted: 24 May 2019 Published: 14 June 2019

#### Citation:

Scherr CL, Ramesh S, Marshall-Fricker C and Perera MA (2019) A Review of African Americans' Beliefs and Attitudes About Genomic Studies: Opportunities for Message Design. Front. Genet. 10:548. doi: 10.3389/fgene.2019.00548

**59**

African American populations (Johnson et al., 2011; Perera et al., 2014). Under-representation of African American populations impairs the ability to translate PGX findings into clinical care, and will ultimately result in increased health disparities (Perera et al., 2014).

The challenge of recruiting minority populations likely stems from historic and contemporary mistreatment. For example, the Tuskegee Syphilis study has had a lingering effect on African Americans trust of medical institutions and research (Gamble, 1997). In addition to historic mistrust related to clinical research more broadly, genomic studies are further problematized due to concerns about personal identification, disenfranchisement stemming from genomic-based policies, and the potential threat of eugenics (Jackson, 1999). Furthermore, concerns about the inability for genomic research to address issues of social justice, and potentially exacerbate issues of health disparities remain (Jackson, 1999). Although few studies have examined the recruitment of African Americans to PGX studies, several have reported African American recruitment for genetic studies or biobanks (which we hereinafter refer to as genomic studies for simplicity).

Prior studies have reported demographic differences, for example, that African Americans are less likely to participate in research that includes a DNA sample or a biopsy compared with whites (Dye et al., 2016; Moledina et al., 2018). However, other studies have reported conflicting findings related to demographic factors influencing participation. One study related to prostate cancer genomics compared African American participants with white participants and found African American participants were younger, less educated, lower income, and less likely to be married compared with white participants (Patel et al., 2012). However, a different study found that African American women who provided a saliva sample for genomic research were older, regularly took a multivitamin, had a physician visit in the previous year, and reported a history of breast colorectal, or cervical screening compared with African American women who did not provide a saliva sample (Adams-Campbell et al., 2016). While demographic differences are useful in the categorization of participants, they do not provide useful insight for recruitment efforts.

Literature on recruitment efforts often describe communitybased approaches (CBA) to engage participants in genomic studies by emphasizing intentional and meaningful community member engagement throughout the research process (Israel et al., 1998; Vadaparampil and Pal, 2010; Kiviniemi et al., 2013; Ochs-Balcom et al., 2015; McNeill et al., 2018). However, CBA focus on broad methods for recruitment and less on message content. Existing studies also have reported on the use of educational materials and seminars to improve African American recruitment (Skinner et al., 2008; Halverson and Ross, 2012; Rodriguez et al., 2016; Radecki Breitkopf et al., 2018). Studies found pre-post increases in knowledge about genomic studies, more favorable attitudes (Patel et al., 2018) and less negative affect (Kiviniemi et al., 2013) after receiving an educational intervention. However, random control trials and other studies employing pre-post assessment found no changes in attitudes about genomic research because of educational interventions (Skinner et al., 2008; Halverson and Ross, 2012). Such findings are not surprising because attitudes do not correlate with knowledge, but are shaped by values and beliefs (Grimshaw et al., 2002; Marteau et al., 2002; Fishbein and Yzer, 2003). Therefore, recruitment messages which address beliefs and attitudes related to participation in PGX studies, in addition to providing education, may speak more directly to African Americans' concerns, and may more consistently improve recruitment efforts (Scherr et al., 2017).

Existing literature regarding African Americans' beliefs and attitudes about genomic studies is disparate, and sometimes conflicting. Aggregating existing information provides an opportunity to reflect on current findings and potentially guide recruitment message strategies. Therefore, the objective of this paper is to systematically review qualitative and quantitative literature on African Americans' beliefs and attitudes about genomic studies that may influence their decision to participate. We synthesized results from this review to highlight opportunities for the design of genomic study recruitment messages.

### MATERIALS AND METHODS

### Study Design

Studies that provided insight regarding African Americans' beliefs and attitudes toward participation in biobanks or genomic studies (inclusive of genetic or PGX) were included in this review. We focused on biobank and genomic studies because, to the best of our knowledge, no studies have exclusively explored African Americans' beliefs and attitudes about PGX. Qualitative and quantitative studies with original empirical data were included, but conference abstracts, reviews, commentaries, editorials, legal opinions, letters to the editors, case studies, dissertations, and thesis studies were excluded. Given the potential influence of historical context, we excluded studies conducted outside the United States. We were interested in genetic studies that may be able to provide information on the treatment of chronic adult onset conditions; therefore, we excluded studies related to behavioral, developmental, or mental health genomics because we believed contextual factors (e.g., stigma, environment) could impact the results of such studies. We also excluded studies that explored medical professionals' attitudes or beliefs about genomic studies because, while valuable, their attitudes and beliefs may be influenced by their additional education and training. We excluded studies that included less than 13% African Americans as a proportion of the total sample, which is consistent with the proportion of African Americans in the United States population. Finally, we excluded studies in which we could not distinguish African Americans' responses from the responses of other study participants. Genomic studies have been conducted over a relatively limited period; therefore, we included all studies accepted for publication up to July 25, 2018 in this review.

### Information Sources and Search

A study team member worked with a University librarian and searched PubMed, Scopus, Web of Science, Embase, and

Google Scholar for relevant citations. The search string was as follows: "African American" OR Black AND "genetic research" OR "pharmacogenomics research" OR "genomic research" OR "personalized medicine" OR "precision medicine" AND "study recruitment" OR "research participation." The initial search returned 1,179 total citations: 15 from PubMed, 14 from Scopus, 133 from Web of Science, 26 from Embase, and 990 from Google Scholar. After consolidating the lists, we removed 109 duplicate citations, for a final sample of 1,070 citations.

#### Study Selection

We screened studies for eligibility by conducting a review of the study titles, followed by an abstract review, and finally a full text review. Reviewers were instructed to be conservative in their exclusion; when uncertain, the study was retained. One study team member conducted the review of titles and excluded those that did not meet eligibility criteria. A second team member reviewed 20% of the titles to confirm exclusion criteria reliability. Kripendorf's α = 0.73 was achieved, an acceptable level of reliability (Krippendorff, 2004). Next, two study team members split the remaining abstracts evenly for review, and excluded those which did not meet eligibility criteria. Twenty percent of the abstracts overlapped for reliability calculation, and α = 0.86 was achieved. Finally, one study team member reviewed 92% and another study team member reviewed 28% of full text and excluded those that did not meet eligibility criteria. Twenty percent of the full text overlapped to calculate reliability, and α = 0.85 was achieved.

#### Data Analysis

One study team member reviewed the final studies included in the analysis to extrapolate information including the study design, the population setting, the total sample size, the sample race, and age. Two study team members conducted thematic analysis of the articles using MAXQDA to manage the data (VERBI Software, 2018).

### RESULTS

Of the 1,070 total titles screened, we removed 292 based on the title review, 558 based on the abstract review, and 197 based on the full text review, for a final sample of 24 articles (see **Figure 1**).

#### Review of Studies

Our review of the literature (**Table 1**), identified tensions in African Americans' beliefs and attitudes about genomic research. The overarching theme of trust (or lack thereof) was present across studies, and influenced subsequent attitudes about genomic research and participation. However, even with concerns about trust, African Americans believed their participation in genomic studies was critical. These negative and positive beliefs informed their attitudes about participation in genomic studies. What follows is a summary of the literature highlighting tensions between distrust and the value of their participation.

#### Distrust

We found a shadow of historic and continued injustice cast across studies. Distrust was ubiquitous in all facets of the research enterprise and extended from members of the research and medical communities (Skinner et al., 2015; Drake et al., 2017; Kraft et al., 2018), to medical or research institutions (Drake et al., 2017; Kraft et al., 2018), and the conduct of research and science in general (Skinner et al., 2015). The Tuskegee Study of Untreated Syphilis frequently functioned as a historical referent for the distrust of biomedical research, particularly among African Americans (Hoyo et al., 2003; Bates and Harris, 2004; Cohn et al., 2015; Kraft et al., 2018). One study found African Americans were significantly more concerned that something like Tuskegee could happen again than white participants (Hagiwara et al., 2014). More specific to genetics, revelations about Henrietta Lacks, and more recent and local race-related abuses by researchers, raised concerns about trust, privacy and the benefits of genomic studies (Buseh et al., 2013; Drake et al., 2017; Kraft et al., 2018; Lee et al., 2019). The impact of race-related injustice was apparent in two multi-race studies that found distrust was more salient among African American participants compared with their white counterparts (Bussey-Jones et al., 2010; Hagiwara et al., 2014). The salience of race in historic injustices in the United States raised suspicions about researchers' intentions, and the potential for race-based research to be used for maleficence ranging from racial discrimination to eugenics, or even genocide (Buseh et al., 2013; Isler et al., 2013; Kraft et al., 2018).

Distrust often was tied to fears about study processes and outcomes. Most frequently mentioned were fears of being experimented on or treated as a "guinea pig" or "lab rat" (Hoyo et al., 2003; Ochs-Balcom et al., 2011; Luque et al., 2012; Buseh et al., 2013; Erwin et al., 2013; Hagiwara et al., 2014; Walker et al., 2014), as was fear of exploitation (McDonald et al., 2012; Buseh et al., 2013). Several studies revealed beliefs that research is conducted at the expense of African Americans for the financial profit of those in power (Kraft et al., 2018; Lee et al., 2019), or to provide more effective treatments to white or privileged individuals (Luque et al., 2012; Halbert et al., 2016). Both African American and white participants in one study raised concerns about the possibility that genetic research could be used to discriminate against certain groups of people, with significantly more African Americans reporting that their concern about potential discrimination would influence their willingness to provide a blood sample for research (Goldenberg et al., 2011). Personal experiences with racial discrimination, and witnessing expanding health disparities in spite of medical advancements, added to beliefs that the medical and research communities were not trustworthy (Buseh et al., 2013). Among African Americans, increased distrust was significantly associated with reduced likelihood of biobank participation (McDonald et al., 2014; Halbert et al., 2016).

Despite concerns about trust and associated fears about participation, participants' relationship with medical research was complicated (McDonald et al., 2012). Tensions existed

#### TABLE 1 | Studies included in Review.

fgene-10-00548 June 13, 2019 Time: 17:38 # 4


<sup>∗</sup>AA, African American.

between distrust of medical research and beliefs that African American participation in research is imperative (Bates and Harris, 2004; Ochs-Balcom et al., 2011; McDonald et al., 2012; Erwin et al., 2013; Hagiwara et al., 2014). In particular, participants described the necessity of African American participation in order to determine the efficacy and optimal dosing (i.e., PGX) and find more effective ways to treat and prevent diseases which frequently impact their race (Bates and Harris, 2004; Buseh et al., 2013; Erwin et al., 2013). In one study, neither concerns about exploitation nor distrust of medical research were associated with willingness to donate biological specimens for research (Hagiwara et al., 2014).

In contrast, some studies found African American participants trusted medical research and biobanks, and were favorable toward medical research (Hagiwara et al., 2014; Walker et al., 2014; Cain et al., 2016). More recent studies assessing African American community members' knowledge, beliefs, and attitudes about medical and genomic research found study participants did not believe they would be taken advantage of or harmed by research focused on minorities (Cain et al., 2016; Jones et al., 2017). Female members of The Links Incorporated (a notfor-profit African American service organization) who believed research conducted in the United States was ethical were more willing to participate in genomic studies (Brewer et al., 2014).

The overarching theme of distrust was present in most, but not all studies. Even among those with high levels of distrust, the importance of African Americans' participation in medical and genomic research was recognized. This dichotomy may explain why some studies found high levels of distrust and others did not. Participants' divergent views may underlie an attempt to reconcile beliefs about distrust of medical research with the importance of their participation in medical research to avoid cognitive dissonance.

#### Community Engagement

Participants described community engagement as one strategy to overcome distrust. Community members and leaders described how researchers often entered their community to obtain something from them, and then simply left (Buseh et al., 2013). Such interactions left the community feeling used, disrespected and engendered continued distrust (Buseh et al., 2013). Failing to engage community members prior to conducting studies was viewed as a barrier (Hoyo et al., 2003), whereas genuine engagement, care and communication were viewed as facilitators that created trust (Walker et al., 2014). "Authentic collaboration" is desired which means that researchers: (1) engage with community leaders and the community at the start of the project before major decisions are made, (2) ensure proper resources are available, (3) give credit to the communities, (4) maintain community engagement beyond the study, and (5) share study outcomes (Buseh et al., 2013; Cohn et al., 2015). Participants did not desire frequent contact, but they wanted to know how their participation contributed to the advancement of science (Cohn et al., 2015). Similarly, participants in a focus group study recommended working early on in the research process to improve relationships between institutions and community members citing existing strong relationships with local community hospitals as an example (Kraft et al., 2018).

#### Awareness and Knowledge

Awareness and knowledge of genomics, or a desire to learn more were associated with favorable attitudes toward genomic studies and/or intentions to participate (Hoyo et al., 2003; Ochs-Balcom et al., 2011; Cohn et al., 2015; Jones et al., 2017). Conversely, lack of education, understanding, awareness or knowledge were associated with less favorable attitudes and lower intentions to participate (Hoyo et al., 2003; Bates and Harris, 2004; Ochs-Balcom et al., 2011; Skinner et al., 2015; Drake et al., 2017). Participants noted that information about research studies was not readily available in their communities, or that African Americans are often not approached or asked to participate (Drake et al., 2017).

Participants described opportunities to overcome low levels of awareness, such as providing educational sessions to ensure informed participation of African Americans (Buseh et al., 2013). Participants in another study suggested that researchers could learn as much from the community as the community could learn from researchers, and advocated for bidirectional educational efforts be bidirectional (Buseh et al., 2013). Similarly, research targeting the African American community was viewed as an opportunity for collaboration between researchers and community members (Cohn et al., 2015). Tying together trust and education, participants suggested that one way to prevent mistreatment of African Americans was for them to request additional information about research studies during recruitment (Bates and Harris, 2004; McDonald et al., 2012). Given this finding, researchers should anticipate that African Americans will have a greater need for information about study procedures than white participants do.

### Process of Study Conduct

Across studies, African Americans described their attitudes and beliefs about particular aspects of the research process including research team members and/or the associated institution, study procedures and safeguards, participation risk and compensation. We describe each category next.

### Face of the Study

African Americans reported in two studies that they were more likely to participate in research conducted by Historically Black Colleges (HBC) (Hoyo et al., 2003; Diaz et al., 2008). HBCs were viewed as more trustworthy, and participants believed the involvement of HBCs would ensure results and benefits from their participation would be returned to the African American community (Hoyo et al., 2003). Additionally, African Americans want to see African American physicians and/or researchers in leadership roles on the research team (Hoyo et al., 2003; Bates and Harris, 2004; Buseh et al., 2013; McDonald et al., 2014; Cain et al., 2016). It was believed researchers from shared racial backgrounds would be more likely to understand relevant cultural beliefs and experiences, and were viewed as more trustworthy (Hoyo et al., 2003; Bates and Harris, 2004; Buseh et al., 2013; McDonald et al., 2014; Cain et al., 2016; Kraft et al., 2018). In two studies African Americans reported that they were more likely to participate if the investigator was African American (Diaz et al., 2008; McDonald et al., 2014), and one study found a decreased likelihood of participation if the study was conducted by a predominately white college or a white investigator (Diaz et al., 2008).

Similarly, participants across several studies preferred information about genomic research or specific studies be delivered by African Americans (Diaz et al., 2008; Dash et al., 2014), particularly if the study was race specific (McDonald et al., 2014). Participants reported more favorable attitudes toward research, and an increased likelihood of enrollment when the study was introduced by a trusted other such as their physician, friends, family members, and/or community leaders (Hoyo et al., 2003; Diaz et al., 2008; Drake et al., 2017). Participants suggested that hearing about the research study within their community, and knowing others in their community who were involved in the study, would increase their likelihood of participation (Drake et al., 2017).

### Study Procedures and Safeguards

Given past injustices, African Americans held significant concerns about the use and accessibility of their data by other individuals or institutions. Due to racism and possible malevolent intent, across studies African Americans wanted to know specifically how their biological material might be used (Buseh et al., 2013; Hagiwara et al., 2014). Not knowing

specifically how the specimen would be used was a barrier to participation (Dash et al., 2014). There were concerns about surreptitious use of genetic material for surveillance, to deny rights and privileges, in criminal investigations, and for other uses beyond the purpose of their original consent (Hoyo et al., 2003; Buseh et al., 2013; Cohn et al., 2015; Kraft et al., 2018). In addition to the aforementioned concerns, participants in one focus group study specifically mentioned concerns related to identity, cloning, and the use of their sample after death (Lee et al., 2019). In addition, not knowing who would have access to their personal information, and who might obtain access to their personal information (e.g., other medical entities like insurance companies) raised concerns, and in some cases, significantly decreased likelihood of participation (Ochs-Balcom et al., 2011; McDonald et al., 2014; Walker et al., 2014; Halbert et al., 2016). Across studies, transparency of study procedures and clear descriptions about safeguards to protect participant privacy were determined essential for participation. Specifically, African Americans want transparency and to know as much as possible about the purpose and rationale for the study, how their specimen would be used and by whom, and the safeguards in place to protect their privacy (Dash et al., 2014; Hagiwara et al., 2014; Skinner et al., 2015; Kraft et al., 2018). Furthermore, continued and ongoing communication about changes to study protocols, or changes to sample access, and the specific studies for which their sample would be used was important, as was maintaining the option to opt in or out of particular studies (Kraft et al., 2018; Lee et al., 2019).

### Participation Risk

One study identified that beliefs about the risk of participation were negatively associated with willingness to participate (Brewer et al., 2014), but another study found concerns about the risk of participation was only a consideration when making participation decisions (McDonald et al., 2012). Another study found African Americans were specifically worried about the possible contamination of equipment used for biospecimen collection (Hagiwara et al., 2014). Aside from risk, concerns about procedures primarily focused on invasiveness. Studies found participants least preferred studies where methods were viewed as invasive (Cain et al., 2016), and were more favorable toward participating in studies they believed were less invasive in terms of procedure, privacy, and resources (Hoyo et al., 2003; Diaz et al., 2008; Cain et al., 2016). Although one study found blood donation for participation in a genomic study to be minimally invasive (McDonald et al., 2012), other studies identified fear of needles or the donation of blood as a barrier to study participation (Ochs-Balcom et al., 2011; Dash et al., 2014; Drake et al., 2017).

Concerns about invasiveness included the expenditure of resources, specifically, cost and time. Participants in one study raised concerns about the potential personal costs of participating including costs associated with blood draws and genetic analysis (Skinner et al., 2015). Possible sustained participation in a longitudinal study evoked questions about the number of tasks and time required of participants (Hoyo et al., 2003; McDonald et al., 2012); participants were more favorable about participating in studies which only lasted a short period of time (McDonald et al., 2014). Participants viewed the distance they had to travel for study participation as a barrier to participation (McDonald et al., 2012; Cain et al., 2016). Any perceived expense to the participant such as cost or time for participation, including time that would be taken from work (Walker et al., 2014; Skinner et al., 2015) and transportation issues (McDonald et al., 2014; Halbert et al., 2016) were barriers to participation, unless compensation could be provided (Cain et al., 2016).

## Compensation

African Americans expected compensation for participants' time for any study that required any type of time commitment, including travel. Compensation for such expenses were believed to increase participation (Erwin et al., 2013; Skinner et al., 2015; Cain et al., 2016; Drake et al., 2017; Jones et al., 2017), and in some cases, African Americans suggested profit sharing as a means for compensation (Buseh et al., 2013; Jones et al., 2017). However, across studies it was noted that the form of compensation did not always need to be direct participant payment. African Americans suggested that food, gas cards, healthcare and/or medication (Hoyo et al., 2003; Hagiwara et al., 2014; Drake et al., 2017), and even individual research results could be provided as a form of compensation (Skinner et al., 2015; Jones et al., 2017). Indeed, some studies found failure to provide research results to participants would prevent African Americans from participating (McDonald et al., 2014; Halbert et al., 2016).

## Individual Level Benefits and Drawbacks of Study Participation

African Americans' interest in participating in genomic studies often was driven by beliefs about benefits for themselves, family members, or future generations. In some cases, individual benefit was broadly or unclearly defined (McDonald et al., 2012, 2014; Skinner et al., 2015; Jones et al., 2017). In other studies, individual benefit included the belief that participation in research meant they would receive better health care (Brewer et al., 2014). Participants across several studies believed they would derive individual benefit by learning more about their genetic risk, which, depending on the results, could act as a motivator for making positive lifestyle changes (Buseh et al., 2013; Skinner et al., 2015). Studies conducted with affected participants, or those already at risk for a specific disease, found increased interest in participation when the study could provide knowledge about the particular condition, for example, cancer (Ochs-Balcom et al., 2011; McDonald et al., 2014; Halbert et al., 2016), asthma (Jones et al., 2017), cardiovascular disease, or type 2 diabetes (Skinner et al., 2015).

Aside from personal benefit, African Americans across studies believed participation in genomic or biobank studies could provide insight into disease that would ultimately benefit their family members or future generations (Ochs-Balcom et al., 2011; McDonald et al., 2012; Dash et al., 2014; Walker et al., 2014; Skinner et al., 2015; Drake et al., 2017; Kraft et al., 2018). They also suggested benefits to family members or future generations could be indirect or much further into the future, such as helping researchers develop medicine that may be used by future generations (Dash et al., 2014).

Notably, two studies found participants did not believe there would be a personal benefit from participating in a research study, and did not believe they would be a benefactor of research outcomes (Halbert et al., 2016; Drake et al., 2017). African Americans believed they were unlikely to benefit personally from medical advancements due to insurance discrimination and the out of pocket costs associated with new pharmaceutical treatments (Halbert et al., 2016; Lee et al., 2019). In some cases, African Americans believe harm could come from finding out about a medical condition that they did not want to know about. As a result, in some studies, learning about personal genetic information was identified as a barrier to participation (Ochs-Balcom et al., 2011; Walker et al., 2014; Skinner et al., 2015; Drake et al., 2017; Jones et al., 2017).

### At the Community Level

The potential for genomic or biobank studies to improve health outcomes for their community was embraced by participants (Goldenberg et al., 2011; McDonald et al., 2012; Buseh et al., 2013; Walker et al., 2014; Cohn et al., 2015). Several studies highlighted participants' beliefs that African American participation in medical research, and genomic research in particular, is essential as a means to address health issue of traditionally underserved populations as a means to reduce health disparities (Ochs-Balcom et al., 2011; McDonald et al., 2012; Isler et al., 2013; Skinner et al., 2015). African Americans in one study held the belief that their participation in today's research would facilitate personalized medicine and more targeted prevention and treatment options for disease, for future generations of African Americans (Buseh et al., 2013). While African Americans were favorable toward race specific studies designed to improve health outcomes for their own race (Goldenberg et al., 2011; Ochs-Balcom et al., 2011; McDonald et al., 2014; Walker et al., 2014), results from one study found participants felt such studies were more likely to take advantage of or hurt minorities (Jones et al., 2017). Further, African Americans suggested that despite their participation and advances in medicine, they believed study results were unlikely to reach their community as a result of historic barriers to medical care (Luque et al., 2012). As a solution, African Americans suggested that any prevention or treatment innovations resulting from African American participation must be accessible and affordable for those community members (Buseh et al., 2013; Halbert et al., 2016). Yet, concerns were raised about whether genomic studies could address social determinants of health that are typically responsible for poor health outcomes, and are often ignored (Buseh et al., 2013).

Related to the belief that their participation could benefit their community, favorable views about participation in genomic studies or biobanks most frequently stemmed from altruistic beliefs. Participants believed participation in genomic studies would help future patients or people in general (McDonald et al., 2012; Skinner et al., 2015; Kraft et al., 2018). Caring for others and the benefit of participation to society were central to motivating

TABLE 2 | Summary of Beliefs and Attitudes and Message Design Opportunities.


participation (Brewer et al., 2014; Jones et al., 2017), despite concerns about trust (Bates and Harris, 2004).

### DISCUSSION AND CONCLUSION

Given favorable attitudes, but low participation rates, culturally appropriate and ethical messages about PGX studies that facilitate recruitment of African Americans are needed (Halbert et al., 2016). Trust has often been cited as the leading barrier to African American participation in health-related research (George et al., 2014; Luebbert and Perez, 2016; Hughes et al., 2017; Jones et al., 2017). Consistently, our review found that distrust in the healthcare system, medical research, organization, and researchers is a commonly held belief by many African Americans (Bates and Harris, 2004; Bussey-Jones et al., 2010; Hagiwara et al., 2014; McDonald et al., 2014; Cohn et al., 2015; Skinner et al., 2015; Halbert et al., 2016; Drake et al., 2017). We forward several suggestions to overcome distrust (see **Table 2**). First, meaningful and intentional community collaboration can demonstrate value and meaning for African American participants (Walker et al., 2014). Indeed, a systematic review conducted by Johnson et al. (2011) identified communitybased strategies, such as engaging community leadership, as one method for improving recruitment of African Americans into genomic research. However, results from our review suggest researchers must move beyond simply contacting community leaders at the time of the study. Instead, researchers should engage in what participants called "authentic collaboration" from before the start of the research study and extending after the study as a means to foster trust, demonstrate respect and honor the value of community contributions (Buseh et al., 2013; Cohn et al., 2015). These findings are consistent with the success of other studies, which have used CBA as a method to improve recruitment of African Americans (Israel et al., 1998; Vadaparampil and Pal, 2010; Kiviniemi et al., 2013; Ochs-Balcom et al., 2015; McNeill et al., 2018).

Our review also identified lack of knowledge or awareness about genomic studies as an overarching barrier (Hoyo et al., 2003; James et al., 2008; Skinner et al., 2008; Ochs-Balcom et al., 2011; Drake et al., 2017). However, educational interventions have demonstrated little impact on attitudes or beliefs, thus suggesting messages that address existing attitudes and beliefs in addition to providing education may be more effective at addressing African Americans' concerns about participation in genomic studies (Skinner et al., 2008; Halverson and Ross, 2012). Furthermore, it could be argued that beliefs about the trustworthiness of research scientists or institutions (Luque et al., 2012; Erwin et al., 2013; Hagiwara et al., 2014; Walker et al., 2014) impact African Americans' expectations for research participation. For example, African Americans concerns about being experimented on or exploited explain why they want complete transparency about study protocols and data sharing practices (Dash et al., 2014; Hagiwara et al., 2014; Skinner et al., 2015). As such, messages that are transparent and clearly describe the study protocol may reduce mistrust as a barrier. Based on our review, messages for African Americans about genomic studies should provide substantial information about the study purpose and procedure and describe processes and measures in place to safeguard their privacy. Previous research found that messages which intentionally highlight procedures and security are more likely to overcome concerns related to privacy and outcomes (McQuillan et al., 2006; George et al., 2014; Luebbert and Perez, 2016; Hughes et al., 2017; Jones et al., 2017).

Contrary to the belief that minority populations are not interested in participating in research studies, our review found African Americans were highly interested in participating (Wendler et al., 2005; Horowitz et al., 2017; Jones et al., 2017). Studies in our review indicated African Americans believed their participation in medical research was crucial for the advancement of science (Bates and Harris, 2004; McDonald et al., 2012; Erwin et al., 2013; Hagiwara et al., 2014). Thus, researchers should devote more attention to facilitators of African American participation in medical research. Specifically, as identified in our review, messages that highlight altruism or benefit for one's community and recognize the importance of including minority populations may promote participation in clinical studies of African Americans (George et al., 2014; Hughes et al., 2017; Jones et al., 2017).

Ultimately, one goal of PM research is to reduce health disparities (Collins and Varmus, 2015; Khoury et al., 2016). In particular, PGX uses personal genomic data to inform optimal tailoring of pharmaceutical agents to prevent adverse drug interactions (Perera et al., 2014). Despite the individualized focus of PGX, efforts require a population-based approach to better understand inter-population and intrapopulation diversity (Bonham et al., 2016; Khoury et al., 2016). This review drew upon existing literature to provide a consolidated overview of African American's beliefs and attitudes toward genomic research. This information can inform recruitment strategies and messages that may increase African American participation in genomic studies, and PGX studies in particular. Future research testing the message strategies identified in this review are needed to continue to understand best practices for communicating genomic research with the African American population. Additionally, future studies should explore African Americans' beliefs and attitudes regarding PGX studies. Such knowledge may contribute to the advancement of PM among minority populations.

### AUTHOR CONTRIBUTIONS

CS, SR, and MP conceptualized the study. CS, SR, and CM-F devised the methods, conducted the literature search, reviewed the literature, and conducted the analysis. CS and SR drafted the manuscript. MP reviewed and edited the manuscript.

## FUNDING

This study was supported by U54 MD010723 African American Cardiovascular pharmacogenetic CONsorTium (ACCOUNT): discovery and translation.

### REFERENCES

fgene-10-00548 June 13, 2019 Time: 17:38 # 10


professionals," in Getting Research Findings into Practice, ed. A. Haines (London: John Wiley & Sons), 29–65.



VERBI Software (2018). MAXQDA Analytics Pro. Berlin: VERBI.


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Scherr, Ramesh, Marshall-Fricker and Perera. This is an openaccess article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# The Puerto Rico Alzheimer Disease Initiative (PRADI): A Multisource Ascertainment Approach

Briseida E. Feliciano-Astacio<sup>1</sup> , Katrina Celis<sup>2</sup> , Jairo Ramos<sup>2</sup> , Farid Rajabli<sup>2</sup> , Larry Deon Adams<sup>2</sup> , Alejandra Rodriguez<sup>1</sup> , Vanessa Rodriguez<sup>2</sup> , Parker L. Bussies<sup>2</sup> , Carolina Sierra<sup>1</sup> , Patricia Manrique<sup>2</sup> , Pedro R. Mena<sup>2</sup> , Antonella Grana<sup>2</sup> , Michael Prough<sup>2</sup> , Kara L. Hamilton-Nelson<sup>2</sup> , Nereida Feliciano<sup>3</sup> , Angel Chinea<sup>1</sup> , Heriberto Acosta<sup>4</sup> , Jacob L. McCauley<sup>2</sup> , Jeffery M. Vance<sup>2</sup> , Gary W. Beecham<sup>2</sup> , Margaret A. Pericak-Vance<sup>2</sup> \* and Michael L. Cuccaro<sup>2</sup>

#### Edited by:

Kelli K. Ryckman, The University of Iowa, United States

#### Reviewed by:

Phillip E. Melton, Curtin University, Australia Bethany Wolf, Medical University of South Carolina, United States

#### \*Correspondence:

Margaret A. Pericak-Vance MPericak@med.miami.edu

#### Specialty section:

This article was submitted to Applied Genetic Epidemiology, a section of the journal Frontiers in Genetics

Received: 18 September 2018 Accepted: 17 May 2019 Published: 19 June 2019

#### Citation:

Feliciano-Astacio BE, Celis K, Ramos J, Rajabli F, Adams LD, Rodriguez A, Rodriguez V, Bussies PL, Sierra C, Manrique P, Mena PR, Grana A, Prough M, Hamilton-Nelson KL, Feliciano N, Chinea A, Acosta H, McCauley JL, Vance JM, Beecham GW, Pericak-Vance MA and Cuccaro ML (2019) The Puerto Rico Alzheimer Disease Initiative (PRADI): A Multisource Ascertainment Approach. Front. Genet. 10:538. doi: 10.3389/fgene.2019.00538 <sup>1</sup> Department of Internal Medicine, Universidad Central Del Caribe, Bayamón, PR, United States, <sup>2</sup> John P. Hussman Institute for Human Genomics, University of Miami Miller School of Medicine, Miami, FL, United States, <sup>3</sup> VA Caribbean Healthcare System, San Juan, PR, United States, <sup>4</sup> Clínica de la Memoria, San Juan, PR, United States

Introduction: Puerto Ricans, the second largest Latino group in the continental US, are underrepresented in genomic studies of Alzheimer disease (AD). To increase representation of this group in genomic studies of AD, we developed a multisource ascertainment approach to enroll AD patients, and their family members living in Puerto Rico (PR) as part of the Alzheimer's Disease Sequencing Project (ADSP), an international effort to advance broader personalized/precision medicine initiatives for AD across all populations.

Methods: The Puerto Rico Alzheimer Disease Initiative (PRADI) multisource ascertainment approach was developed to recruit and enroll Puerto Rican adults aged 50 years and older for a genetic research study of AD, including individuals with cognitive decline (AD, mild cognitive impairment), their similarly, aged family members, and cognitively healthy unrelated individuals age 50 and up. Emphasizing identification and relationship building with key stakeholders, we conducted ascertainment across the island. In addition to reporting on PRADI ascertainment, we detail admixture analysis for our cohort by region, group differences in age of onset, cognitive level by region, and ascertainment source.

Results: We report on 674 individuals who met standard eligibility criteria [282 ADaffected participants (42% of the sample), 115 individuals with mild cognitive impairment (MCI) (17% of the sample), and 277 cognitively healthy individuals (41% of the sample)]. There are 43 possible multiplex families (10 families with 4 or more AD-affected members and 3 families with 3 AD-affected members). Most individuals in our cohort were ascertained from the Metro, Bayamón, and Caguas health regions. Across health regions, we found differences in ancestral backgrounds, and select clinical traits.

**70**

Discussion: The multisource ascertainment approach used in the PRADI study highlights the importance of enlisting a broad range of community resources and providers. Preliminary results provide important information about our cohort that will be useful as we move forward with ascertainment. We expect that results from the PRADI study will lead to a better understanding of genetic risk for AD among this population.

Keywords: Alzheimer disease, ascertainment, PRADI, genetics, community resources, ADSP, diversity, health disparities

#### INTRODUCTION

Alzheimer disease (AD) is a progressive neurodegenerative disorder that affects 1 in 9 Americans over the age of 65. This disease has a significant impact on individuals with AD and their families and poses huge financial and social burden on society. To date, over 20 loci have been identified as risk factors for AD in non-Hispanic White (NHW), genome wide association studies (GWAS) with limited GWAS in other populations (Lambert et al., 2013). In addition, the only large AD sequencing effort to date, the Alzheimer's Disease sequencing project (ADSP) (Beecham et al., 2017), has focused its efforts on individuals of NHW descent, including a limited number of Hispanic (HI), and African American individuals. The importance of examining AD in other populations (Ramirez et al., 2008) is highlighted by findings that show Caribbean Hispanics from the Dominican Republic are twice as likely as NHW to have lateonset Alzheimer's Disease (LOAD) (Tang et al., 1998, 2001). Furthermore, the incidence of new LOAD cases in families from the Dominican Republic is three times larger than the incidence found in NHW families (Vardarajan et al., 2014) even though the genetic risk of LOAD is similar. Despite clear evidence that points to the importance of investigating AD in underserved populations, this work has lagged.

Although comparisons of risk among different ethnic groups are complicated by differences in the assessment of cognitive decline across studies and population differences in willingness to participate in medical research, there are several possible explanations for increased incidence in these specific ethnic groups (e.g., lower educational attainment, higher rates of cardioand cerebrovascular disease, and metabolic syndrome). While the importance of diversity and inclusion in genomic research has been emphasized for more than two decades (NIH Revitalization Act of 1993, Public law 103–143) many groups, including Hispanics, are underrepresented in biomedical research (Shavers et al., 2002; Sheppard et al., 2005; Calderon et al., 2006), including genomic, and translational studies (Armstrong et al., 2005; Ricker et al., 2006; Armstrong et al., 2012). Further, this lack of participation has the potential to delay the application of novel treatments that may be relevant to these populations, exacerbating existing health disparities in a variety of diseases, including AD. Specifically, given the importance of genomic research in the development and implementation of precision medicine initiatives (Hampel et al., 2017), there is an urgency to engage with and include underserved and underrepresented groups in such research to enable access to these advanced treatments (Wilkins, 2018).

Alzheimer disease is the most common form of dementia and the fourth leading cause of death in Puerto Rico (PR) (Friedman et al., 2016). The population of PR was estimated at 3,474,182 individuals in 2015, with 617,007 over the age of 65, and AD prevalence of 12.5% (Puerto Rico Department of Health, 2015). Further, according to Perreira et al. (2017) the population of PR is aging and struggles with high rates of comorbid conditions (e.g., hypertension and diabetes) that contribute to dementia. These numbers underscore the need to investigate early risk factors and develop the necessary research to study the neurobiology of cognitive decline in Puerto Ricans and more broadly Hispanics. Furthermore, enriching AD genomic studies with Hispanic populations is fundamental for reducing health disparities, delivering precision medicine, and ultimately improving health outcomes for this community.

To address the range of disparities experienced by Hispanics due to under-representation in genomic studies of AD, we developed the Puerto Rico Alzheimer Disease Initiative (PRADI). The goal of this National Institute of Aging funded project is twofold. First, the PRADI study examines genomic risk for AD in Puerto Ricans and adds to the growing body of knowledge regarding Hispanic risk for AD. Second, the PRADI study makes comparisons using two types of controls: family-based (related controls) and case-control (unrelated controls), paralleling, and building on the ongoing work of the ADSP (Beecham et al., 2017). Furthermore, Puerto Ricans are an admixed population, enriched for at least three ancestries (European Caucasian, Western African, and Amerindian/Taino), resulting in complex population substructure (Claudio-Campos et al., 2015; Rajabli et al., 2018). The use of population substructure (i.e., global and local ancestry) can allow for adjustment of models to improve genetic analyses. The importance of examining ancestral contributions in Hispanics can be seen in studies of complex diseases, including asthma (Gignoux et al., 2019), multiple sclerosis (Amezcua et al., 2018), and cancer (Salgado-Montilla et al., 2017; Diaz-Zabala et al., 2018). The usefulness of understanding and incorporating genotypic and admixture information into the conceptualization and management of disease among Puerto Ricans is becoming increasingly apparent (Morales-Borges, 2017; Diaz-Zabala et al., 2018).

In contrast to other studies of Puerto Ricans (Tucker et al., 2010), the current study focuses exclusively on participants from the island of PR. We describe the design and implementation of our multisource method for recruiting individuals for the genetic study of AD and our corresponding work in the community to increase study participation among eligible Puerto Ricans. Equally important, we describe our cohort with respect to clinical

features and ancestral proportions by region. These results provide a preliminary picture of our PRADI cohort.

### MATERIALS AND METHODS

A multisource ascertainment approach was implemented to recruit and enroll participants into the PRADI study. As described below, the approach consisted of different phases that revolved around community engagement and included: (a) identification and relationship building with key stakeholders from several organizations; (b) collaborative agreement on ascertainment methods and formalization using memorandums of understanding; (c) targeted actions and recruitment events; and (d) education and dissemination of information about AD to health professionals and the general public. This approach was designed to establish and strengthen collaborative relationships with key stakeholders to facilitate ascertainment for this study and future studies.

Ascertainment efforts were carried out in PR and encompassed all seven health regions (Arecibo, Bayamón, Caguas, Fajardo, Mayagüez, Metro, and Ponce) as defined by the Puerto Rico Department of Health. Only bilingual personnel were sent to the sites and plain Spanish was used for all verbal and written study-related communication (materials for public dissemination were developed for a third-grade reading level). Standard screening and evaluation activities were performed, which included collection of clinical, family, and medical history and neurocognitive testing. Individuals were determined to be cases or controls with further specification depending on whether they were family history positive or negative for AD.

Finally, to investigate potential differences among our participants from different parts of the island, we tested for differences in age of onset and 3MS scores by health region and ascertainment source (i.e., AD specialist, adult care center, or community event/activity). We also conducted admixture analysis to examine the population substructure of our Puerto Rican cohort by region to evaluate differences in ancestry proportions among the health regions.

#### Ascertainment Procedures

#### Ascertainment Phase One: Getting to Know the Field Stakeholders From Multiple Sectors

In the initial phase of our multisource ascertainment approach, the local team identified potential sources of participants within PR communities by interacting with groups and providers that serve the AD population. There are multiple groups and ongoing community initiatives working to increase AD awareness in PR. Our goal was to establish collaborative relationships with stakeholders from different sectors (**Figure 1**). These interactions served as a starting point to disseminate information about the study, to identify sources for cases and controls, to build networks with potential collaborators, and to create opportunities for direct ascertainment. In addition, these initial meetings served as a venue for discussing the importance of inclusive recruitment in genetics research, especially how a lack of diversity can delay specific populations' access to personalized/precision medicine. The primary groups we approached included:

#### Governmental Stakeholders

We contacted central and local government representatives, including the PR office of the Ombudsman for the Elderly (OPPEA, for its Spanish acronym), a legal affairs office for older adults, and the AD Registry of the Health Department of PR. As an initial step, local team members joined the Health Department Alzheimer's Advisory Board. This process allowed us to meet with key stakeholders to discuss the PRADI study. Through these initial contacts, OPPEA provided us with additional contacts at the provider level to include various programs and adult care centers for older adults and those with AD and other cognitive problems. Through these contacts, we established ties with additional local government representatives of the

municipalities, including Cidra, Fajardo, Carolina, Aguadilla, Arecibo, among others.

#### Community Non-profit Organizations (NPO)

To establish community based collaborations in the non-profit sector, we contacted multiple groups that serve older adults in PR, including the Puerto Rican Chapter of the AARP; Mente Activa (Active Mind), which is a non-profit organization that promotes physical and mental activity for older adults and those with dementia; and Organización Pro Ayuda a Personas con Alzheimer (OPAPA), another non-profit organization that provides education and support to people with AD and their families. Our team met with leadership in these organizations to provide information about the PRADI study.

#### Religious Groups

Our primary religious contact was the Lutherans Social Service of PR, a non-profit faith-based organization involved in providing services to older adults. It is funded to provide programs to train dementia capable personnel and service providers as well as programs to identify older adults with early signs of AD. In addition, we contacted the Catholic Church, especially the Seminary of PR and the Caguas Cathedral. Both groups agreed to assist with the study by providing access to participants and disseminating information about our study during religious services and through print media.

#### Ascertainment Phase Two: Defining and Formalizing Collaborations With Stakeholders

The next phase in our multisource ascertainment approach was seeking and using input from the stakeholders and organizations about best practices for ascertainment. This process typically involved in person discussions between the local team (headed by Dr. Feliciano, a neurologist who specializes in the care of older individuals) and the organizations. This allowed us to define our ascertainment practices in alignment with accepted practices for the respective organizations, groups, etc. In addition, it allowed us to address any concerns at the outset. Based on these discussions, we constructed memorandums of understanding (MOUs) to specify the nature of the relationship and outline collaborative activities with the stakeholders from different sectors. MOUs were signed with OPPEA, the Puerto Rican Chapter of the AARP and Lutheran Social Services of PR. In addition, we established MOUs with Mayors and their staff from several municipalities, including Cidra, Fajardo, Carolina, Aguadilla, and Arecibo. As part of the MOU, the Universidad Central del Caribe provided insurance endorsements for the use of their venues during recruitment events.

#### Ascertainment Phase Three: Targeted Actions and Direct Recruitment

Working with the various groups with whom we had MOUs, we set up multiple recruitment events. Depending on the site, pre-recruitment conferences were scheduled to educate center personnel (e.g., primary doctors, nurses, social workers, psychologist, and others dementia specialists) or the public. These pre-recruitment meetings were used to provide general information about AD and to clarify aspects of the study in person to healthcare providers as well as potential participants and their families. At meetings involving the public, potential participants, or family members we gathered contact information for further follow up, leading to recruitment of interested individuals. This also allowed us to estimate the number of participants and to plan our ascertainment resources accordingly.

#### Ascertainment Phase Four: Giving Back: Dissemination and Education

We conducted a number of follow-up events to provide information for caregivers and center personnel at the various recruitment sites. For physicians, we were able to provide continuing medical education through the Puerto Rican College of Physicians and Surgeons; for health professional staff, we provided participation certificates for early detection of AD and culturally relevant adaptation of the comprehensive and evidence-based community support strategies.

This follow-up allowed us to disseminate information about AD to the community. The provision of information about AD to non-AD healthcare workers and general communities will help us build local resource networks and empower them with knowledge about dementia capabilities to improve the quality of life of the participants and their caregivers. In addition, at select venues we have also organized educational outreach activities where we served as expert speakers, providing information about dementia research and care. Typical audiences included healthcare providers (e.g., nurses, social workers, case managers, and primary care physicians) and the public. We have also engaged in dementia-related initiatives via social media, like "Un café por el Alzheimer" (A cup of coffee for Alzheimer) (Friedman et al., 2016), which shares our study information on their social media platforms.

#### Study Population

A convenience sampling method with a geographic distribution throughout the island was used. PRADI participants were self-reported Puerto Rican adults, aged 50 years, and older with no restrictions on gender or socioeconomic status. While the majority of participants were residents of PR, a small fraction of relatives of the Puerto Rican families living in the continental United States (Florida, New York, Connecticut, and Massachusetts) were enrolled. In addition, some individuals less than 50 years of age were enrolled. When conducting our analyses, we included only residents of PR who were 50 years of age or greater.

Our cohort is further specified based on seven health regions as defined by the PR Department of Health<sup>1</sup> . These seven regions contain multiple municipalities and place this cohort in the context of the previously established health related structure. Each of the health regions is labeled by the major municipality within each region (with the exception of the Metro region). As seen in **Figure 1**, the most heavily populated areas per the 2010 census are the Metro, Caguas, and Bayamón regions, containing 22, 16, and 16% of the total population, respectively.

<sup>1</sup>http://www.salud.gov.pr/Pages/Regiones-de-Salud-y-Servicios-Directos.aspx

Per the same census period, ∼15% of individuals in PR were over 65 years of age.

#### Ascertainment Sources

fgene-10-00538 June 17, 2019 Time: 17:30 # 5

All participants were ascertained via three main sources: AD specialists, adult care centers, and community events. This approach allowed us to capture a wide range of AD cases from varied socioeconomic backgrounds and education levels. All individuals were recruited using site-specific IRB approved protocols.

#### AD Specialists

Several AD specialists (neurologists, psychiatrists, and geriatricians) served as collaborators and referred patients who met inclusion criteria and were interested in participating in the PRADI study. These included patients with AD, mild cognitive impairment (MCI), and dementia. As described below in the screening and evaluation section, we obtained clinical and medical records for patients who were recruited via AD specialists.

#### AD Centers and Adult Care Centers

To date, we have recruited participants from seven AD dedicated centers and advanced age nursing homes across the island, identified through the OPPEA directory of services website. The AD centers and nursing homes serve between 20 and 40 individuals who are typically older than 60 years of age (with or without the diagnosis of AD) on a daily basis. These centers focus on providing therapeutic, social, and recreational activities to improve quality of life, as well as educating, and supporting caregivers or family members.

#### Community Groups

We conducted recruitment events in various municipalities. Typically, these recruitment events were preceded by a prerecruitment event. The actual recruitment visits were then conducted at various centers or in private spaces. During these events, our multi-disciplinary teams consented participants (or their proxies), conducted cognitive screenings, and drew blood samples. These events ranged in size from small venues that attracted 20 or so individuals to much larger events that drew 60 or more individuals. We were able to enroll cases and controls during these events.

### Inclusion/Exclusion Criteria

Participants were enrolled in the following categories: cases (AD and MCI), unaffected family members of cases, or unrelated individuals with no cognitive problems. To be enrolled, participants had to meet basic inclusion criteria. All individuals had to: (a) be of Puerto Rican ancestry (with at least one grantparent born on the island); (b) be ≥50 years of age; and (c) be willing to participate (or, in cases of serious cognitive impairment, have family members who consent on their behalf) and provide informed consent or have a proxy for consent.

To be included as a case, we required that individuals have a previous clinical diagnosis of AD, MCI, dementia, or show evidence of a memory disorder, and meet standard criteria for AD or MCI (McKhann et al., 1984; Albert et al., 2011; McKhann et al., 2011). We included cases from families (family history positive) as well as sporadic or isolated cases (family history negative). We excluded individuals whose memory and cognitive problems are secondary to other causes (e.g., stroke, psychoses, etc.) and those with a known mutation (e.g., PS1, PS2, or APP).

To be included as a control, individuals had to meet basic inclusion criteria, have no prior clinical diagnoses of a memory disorder or subjective memory complaints, demonstrate no cognitive problems on neurocognitive screening and assessment, and be unrelated to our cases. Unaffected family members had to meet the same inclusion criteria as the controls in addition to being a first- or second-degree relative of a case. For unaffected family members, we typically included the oldest available individual.

#### Screening and Evaluation

For participants enrolled as cases (i.e., with suspected memory problems or known dementias), we conducted a detailed chart review during which we corroborated clinical diagnoses and extracted current and past medical histories, current and past medications, family histories (pedigrees), and sociodemographic information. In addition, we collected clinical neurologic and neuropsychological test data, neuroimaging results, and pertinent lab values (e.g., hematology, thyroid function, lipid profile, vitamin D and B12 levels, and liver function tests, hypothyroidism, and vitamin deficiency).

For presumptive cases, we conducted an initial screening with the Modified Mini-Mental State Examination (3MS) (Folstein et al., 1975; Teng and Chui, 1987) followed by a cognitive evaluation that included the NIA-LOAD cognitive battery (Morris et al., 2006; Weintraub et al., 2009). In addition, we administered the Clinical Dementia Rating Scale (CDR) (Yesavage, 1988). Individuals who were deemed cognitively normal were screened with the 3MS (Folstein et al., 1975; Teng and Chui, 1987) and the CDR. For most cognitively normal individuals, we administered the NIA-LOAD battery.

### Adjudication

All clinical, historical and screening/evaluation test data (e.g., laboratory tests, neurologic examination, neuroimaging, and neuropsychological screen and testing) from individuals with a known or suspected dementia were reviewed by a clinical adjudication panel consisting of a neurologist, neuropsychologist, and clinical staff. The panel reviewed all data and assigned best-estimate diagnoses. To be classified as AD individuals had to meet the current NIA-AA criteria (McKhann et al., 2011). They were further classified as definite (neuropathologic confirmation), probable, or possible AD. Diagnoses of MCI were assigned using the NIA-AA criteria (Albert et al., 2011). Cognitively normal individuals with no history of memory problems and MMSE or 3MS scores that fall above clinical cutoffs were designated as unrelated controls for the study. Family-based controls were evaluated similarly for inclusion in family-based analyses (Beecham et al., 2017). In the course of adjudication meetings, team members discussed cases until a diagnostic classification was determined. For those cases in

which the team was unable to arrive at a final decision, the team stipulated the reason and corrective actions were taken (e.g., obtaining a more detailed history, retesting, etc.) In the event of a disagreement, the team consulted with an independent dementia specialist.

### Analysis

To test for possible differences in our cohort related to where participants live and how they were ascertained, we compared mean 3MS scores and mean age of onset (AAO) for cases by region and recruiting source. Cases consisted of both AD and MCI phenotypes. In addition, for our controls we were able to compare mean 3MS scores by region. All analyses were performed using one-way ANOVA in SAS and SPSS (SAS Institute Inc., 2011; SPSS, 2013). P values lower than 0.05 were considered statistically significant.

In addition, we conducted an admixture analysis to estimate the proportions of admixture (European, African, and Native American) in our cohort. Genotyping and quality control methods are described elsewhere (Alexander et al., 2009; Rajabli et al., 2018). Briefly, genotyping was performed on the Expanded Multi-Ethnic Genotyping Array and Global Screening Array (Illumina, San Diego, CA, United States) and quality was assessed using PLINK software, v.2. Using the reference panels (African, European, and Native American populations) from the Human Genome Diversity Project3, we conducted admixture analysis, using ADMIXTURE software (Alexander et al., 2009; Rajabli et al., 2018), to generate average ancestry proportions across PR's seven health regions.

## RESULTS

We have enrolled 770 individuals over a 30-month period, 710 of which were from PR. After removing individuals <50 years of age (35 unaffected, 1 MCI), our current dataset consisted of 674 individuals. The distribution of enrollment across the seven health regions of PR, as seen in **Figure 1**, shows the heaviest ascertainment in the Metro (44%; N = 295), Caguas (20%; N = 134), and Bayamón (16%; N = 106) regions, which reflects the greater population densities of these regions and cities. Enrollment numbers for the seven health regions are presented in **Table 1**, which also provides the numbers for the respective municipalities within those health regions.

Among these 674 individuals, 282 (42%) were ascertained as AD, 115 (17%) were ascertained as MCI, and the remaining 277 (41%) were ascertained as unaffected. The majority of our cases (83%) had an age of onset ≥65 years of age. The greatest numbers of AD (N = 111; 39%) and MCI (N = 61; 53%) were ascertained in the Metro region. Equally high ascertainment numbers were also observed in Bayamón and Caguas (AD N = 54, 19%; MCI N = 17, 15%).

Participants were recruited from three sources: AD specialists (N = 261, 39%), adult care centers (N = 201, 30%), and community events (N = 202, 30%). Not surprisingly, as seen in **Table 2**, most of the AD cases were recruited via the AD specialist, while the largest number of MCI cases were ascertained through community events. **Figure 2** provides additional information regarding enrollment sources per the respective health regions.


TABLE 2 | Ascertainment by source (N = 664)<sup>∗</sup> .

fgene-10-00538 June 17, 2019 Time: 17:30 # 7


<sup>∗</sup>Our total ascertainment=674; 10 individuals were missing source data, A = Alzheimer disease; MCI = Mild cognitive impairment; UNAFF = Unaffected.

Finally, our cohort can be further delineated by whether individuals were part of a family or ascertained as an isolated/sporadic case. Of the 43 multiplex families that have been completed to date, 10 families contain four or more living individuals with AD, 3 families contain 3 living individuals with AD, and 31 families contain 2 living individuals with AD. The mean number of LOAD cases per multiplex family is 3.9. Among the 198 individuals from those multiplex families 73 (37%) meet the criteria for LOAD, 19 (9%) meet the criteria for EOAD, 31 (16%) meet the criteria for MCI, and 75 (38%) meet the criteria for no cognitive problems.

#### Admixture Results

We examined the population structure of Puerto Ricans using the supervised ADMIXTURE analysis at K = 3. **Figure 3A** illustrates the results from the ADMIXTURE analysis in a bar-plot figure. Each vertical bar represents an individual and corresponding estimates of the fraction of continental ancestries (African, European, and Native American). On average, Puerto Ricans have mostly European ancestry with a mean value of 69.3% (SD = 12.2). Mean values for African and Native American ancestry are 17.3% (SD = 12.2) and 13.4% (SD = 4.2), respectively as seen in the box plots (**Figure 3B**).

**Figure 4A** illustrates the bar-plots of admixed individuals across the Puerto Rican health zones and shows heterogeneous admixture patterns. Results of the admixture analysis are in general agreement with recent genetic studies showing a three-way admixture (European, African, and Native American) structure in Puerto Ricans (Via et al., 2011).

We observed a non-uniform distribution of European and African ancestral backgrounds across the health regions with relatively high European and low African ancestral proportions in Mayagüez, Ponce, and Bayamón (**Figure 4B**). The average European and African ancestry fractions in these zones are 74.4% (s = 5.8), 74% (SD = 8.1), 73.3% (SD = 8.7) and 11.9% (SD = 6.6), 11.6% (SD = 5.9), 11.0% (SD = 4.9), respectively. In contrast, the Native American ancestral background shows nearly uniform distribution across the geographical zones (**Figure 4B**).

#### Clinical Comparisons

Separate one-way ANOVAs were conducted to test if mean values for AAO and the 3MS differed by (a) ascertainment region (i.e., the seven health regions of PR) and (b) ascertainment source (AD specialist, adult care center, and community).

#### Age at Onset (AAO)

The mean AAO values for our AD and MCI case were 74.1 (SD = 9.4) and 71.2 (SD = 8.5), respectively. As noted above, for the purposes of analysis we combined these into one group (cases) which had a mean AAO value of 73.2 (SD = 9.2). The mean values for AAO for the seven ascertainment regions and sources are shown in **Table 3**.

Across the different regions, mean AAO values ranged from 70.3 (SD = 7.4) in Mayagüez to 75.9 (SD = 9.6) in Fajardo. Results of one-way ANOVA found no statistically significant differences in AAO across the different health regions F(6,385) = 0.92, p = 0.48. The mean AAO values for the three ascertainment sources ranged from 70.6 (SD = 4.6) for AD specialists to 76.4 (SD = 9.0) for cases ascertained through adult care centers. The results of the one-way ANOVA found significant group differences among the three ascertainment sources F(2,382) = 16.29 p < 0.001. Post hoc tests showed mean AAO was higher in patients recruited from the community sites (+4.1 years) and adult care centers (+6.0 years) than it was for patients ascertained from AD specialists.

#### Modified Mini Mental State Examination (3MS)

The mean 3MS scores for our AD and MCI cases were 52.6 (SD = 23.5) and 80.1 (SD = 12.2), respectively; the overall mean 3MS score for all cases was 63.5 (SD = 24). The mean 3MS scores for the seven ascertainment regions and sources are seen in **Table 3**.

Among the health regions, mean 3MS scores ranged from 46.3 (SD = 28) in Mayagüez to 69.5 in the Metro region (SD = 19.9). Note that we dropped the Fajardo region, as there were only three 3MS scores. For these comparisons, the

homogeneity of variances assumption was violated, as assessed by Levene's Test of Homogeneity of Variance (p = 0.008). The one-way Welch ANOVA results show statistically significantly differences in mean 3MS scores between the health regions Welch's F(5,52.96) = 3.81, p = 0.005. Games-Howell post hoc analysis revealed only one statistically significant comparison (p < 0.01) between the Metro and Mayagüez regions (23.3+5.8) [mean ± standard error]. For source, the mean values ranged from 63.1 (SD = 24) for cases ascertained via the community to 64.2 (SD = 27.6) for cases ascertained through AD specialists. Again, Levene's Test of Homogeneity of Variance was significant (p = 0.03) indicating that the homogeneity of variances assumption was violated, prompting use of Welch's ANOVA. Results of one way ANOVA found no statistically significant differences in 3MS means Welch's F(2,130.5) = 0.04, p = 0.96.

### DISCUSSION

Using a multisource approach that emphasized community engagement and was tailored to the Puerto Rican population, we were able to enroll eligible participants and their family members across PR. A major feature of our community engagement efforts was the development of partnerships with leaders of health initiatives in municipalities and resources within those municipalities. These included the health department,



<sup>∗</sup>Removed from analysis of 3MS score by region.

governmental organizations, community-based organizations, religious groups, and various healthcare providers. Establishing strong community partnerships allowed us to develop strategies with input from different parts of the community to achieve an ascertainment approach that was sensitive to the local culture.

Our multisource approach emphasizes community engagement beginning with the identification of and establishment of relationships with key stakeholder groups and organizations. This allowed us to develop mutually agreed upon ways to implement research activities and create memorandums of understanding to formalize implementation. Working with these stakeholders and organizations enabled us to conduct outreach and ascertainment activities in the respective municipalities. Concurrent with the outreach activities and recruiting events (and as a way of giving back to the communities), we provided information and educational opportunities to healthcare providers and the public. This community engagement approach, developed for PRADI by AD clinicians and researchers in Puerto Rico and Miami, is a platform for our ongoing ascertainment efforts.

Using this approach, we have enrolled 674 individuals from PR over the age of 50 for our PRADI study. These individuals were recruited fairly evenly from the three ascertainment sources are and concentrated in the three health regions with the largest numbers of individuals – Metro, Bayamón, and Caguas. We also observed that the main ascertainment sources varied by the health regions, reflecting different resources in the respective regions. Further, while the percentage of individuals ascertained in select regions paralleled the percentage of the total population for the region, the Metro and Ponce regions were disparate as 44% of our participants were ascertained in the Metro region which constitutes 22% of the population vs. 3% of our participants were ascertained in the Ponce region which constitutes 14% of the population. These ascertainment figures have already begun to inform our subsequent recruitment efforts, as we emphasized the need to engage other sectors of PR (e.g., Ponce).

The importance of recruiting in regions such as Ponce and Mayagüez is also reflected in the results of our admixture analysis showing differences in the proportion of European and African ancestry among individuals from these regions. The failure to ascertain participants from regions with different ancestral backgrounds could potentially limit the applicability of important findings to these groups. The significance of this for the PRADI study is reinforced by work showing that different ancestral backgrounds may play a significant role in modifying the effect of APOE on risk for AD (Rajabli et al., 2018). These results are preliminary and will need further investigation, in particular to specify area of origin for participants vs. current area.

In addition to potential ancestral differences across the different regions, we observed clinical differences in our cohort in relation to ascertainment region and sources. For instance, participants' mean 3MS scores varied by ascertainment region although the only significant difference was between the Mayaguez and Metro regions. This may reflect differences in the sources of these participants as most of the individuals from Mayaguez were ascertained in the community. While there were no significant differences in AAO among participants from these different regions, we observed that AAO varied according to ascertainment source. Specifically, individuals who had been seen by AD specialists were more likely to have been identified as having cognitive/memory problems at younger ages. Aside from differences in sample size, the observed differences in AAO and 3MS values by ascertainment region and source most likely reflect the complex interplay of multiple influences, including access to AD specialists, availability of dementia related resources, and general knowledge and acceptance of AD.

The influence of knowledge and acceptance of AD is an important issue that is intertwined with efforts to recruit and enroll participants for genetic studies of AD in PR. While genetic studies of AD in PR have been undertaken by several groups as part of a larger emphasis on understanding AD in Caribbean Hispanics (Lee et al., 2006; Barral et al., 2015), the ascertainment approach developed for PRADI focuses solely on the island and intends to create a program that enhances knowledge of AD in PR.

Efforts to increase knowledge of AD in PR have grown recent years and the multisource approach to recruitment and enrollment is aligned with programs such as the Un Café por el Alzheimer program in PR, which provides education about AD at coffee shops and through social media (Friedman et al., 2016). The educational component that we include as part of our larger ascertainment approach is crucial for providing information about AD to healthcare providers and the public across the various communities and will potentially impact participation in biomedical research, including genetic studies (Karlawish et al., 2011).

The goal of the PRADI study is to investigate the genetics of AD in Puerto Ricans. AD is a complex disease with substantial burden on the population – particularly in PR where there is a large aging population suffering from chronic diseases that may exacerbate existing risk (Perreira et al., 2017). To date, there

have been a scarcity of genetic studies of complex traits (e.g., AD) in Puerto Ricans which could exacerbate existing health disparities. Exceptions to this are the Boston Puerto Rican Health Study (BPRHS), a longitudinal cohort study which examines non-genetic, and genetic influences on multiple health outcomes among mainland Puerto Ricans (Tucker et al., 2010) and the Hispanic Community Health Study (HCHS), a large longitudinal multi-cohort project which studies a variety of health outcomes among different Hispanic-Latino groups in the US, including Puerto Ricans (Lavange et al., 2010) – both of which have extensive phenotypic and genotypic data. Using data from these cohorts, investigators have found links between select genes, obesity and asthma (Guo et al., 2018), lipid profiles (Graff et al., 2017), and blood pressure traits (Sofer et al., 2017). A large amount of research has genetic factors contributing to asthma and other pulmonary traits which are a major health problem in Puerto Ricans. The involvement of Puerto Ricans in this work can lead to greater understanding of genetic contributions to disease in this population and intervention opportunities. Central to the success of this research is ensuring participation in this research (Karlawish et al., 2011).

Our results suggest the importance of engaging multiple stakeholders and communities across municipalities. Including stakeholders in the development of outreach and recruitment was an important part of the PRADI ascertainment approach. Another important aspect of our ascertainment approach was the provision of AD and dementia information to providers, care centers, and the public. While our ascertainment results cannot be directly attributed to our multisource approach we have preliminary data that can guide more systematic evaluation of what works best as the PRADI study moves forward. Ultimately, this study and others like it are intended to inform and improve health outcomes and reduce health disparities for Puerto Ricans and other Hispanic Latino populations who have been consistently underserved.

### ETHICS STATEMENT

This study was carried out in accordance to the recommendations of the National Institute of Health Guiding Principles for Ethical Research Pursuing Potential Research Participants Protection and the 2016 National Institute of Health Single Review Board (sIRB) Policy. This study received ethical approval from University of Miami Institutional Review Board (approved protocol #20070307) and Universidad Central del Caribe Institutional Review Board (approved protocol # 2016-26). The Universidad Central del Caribe is relying on the designated

### REFERENCES

Albert, M. S., DeKosky, S. T., Dickson, D., Dubois, B., Feldman, H. H., Fox, N. C., et al. (2011). The diagnosis of mild cognitive impairment due to alzheimer's disease: recommendations from the national institute on agingalzheimer's association workgroups on diagnostic guidelines for alzheimer's disease. Alzheimers Dement 3, 270–279. doi: 10.1016/j.jalz.2011.03.008

UM-IRB by an Institutional Review Board Authorization Agreement (Protocol: Genetic Studies in Dementia). All subjects (participant or proxy) gave written informed consent. This study was carried out in accordance with the Declaration of Helsinki and amendments.

### AUTHOR CONTRIBUTIONS

MC helped with study design, assisted with clinical adjudication of patient and control data, and wrote and proofread the manuscript. BF-A and KC assisted with study design, ascertainment, and clinical adjudication of patient and control data, and wrote and proofread the manuscript. JR and FR performed statistical analyses and helped to writing the manuscript. LA and JV helped with study design, ascertainment, and clinical adjudication of patient and control data. PB, PRM, AR, and VR helped with ascertainment and clinical adjudication of patient and control data. CS, PM, AG, MP, and JM helped with ascertainment of patient and control data. KH-N compiled data for the publication and ran clinical queries. NF helped with ascertainment of patient and control data, and proofread the manuscript. AC and HA helped with ascertainment of patient and control data, diagnosis, and adjudication. GB conceived of and implemented the study design. MP-V conceived of and implemented the study design, assisted with ascertainment and clinical adjudication of patient and control data, and helped to writing the manuscript.

### FUNDING

Financial support for the research, authorship, and publication of this article was provided by the grant "Genomic Characterization of Alzheimer's Disease Risk in the Puerto Rican Population" (1RF1AG054074-01) from the National Institute of Health (NIH) and National Institute on Aging (NIA).

### ACKNOWLEDGMENTS

We wish to acknowledge the community, faith, and government organizations, as well as the healthcare professionals, and individuals who participated and collaborated in this research project. We are grateful to the families and staff who participated in this study. We also gratefully acknowledge the resources provided by the John P. Hussman Institute for Human Genomics and the Universidad Central del Caribe.

Alexander, D. H., Novembre, J., and Lange, K. (2009). Fast model-based estimation of ancestry in unrelated individuals. Genome Res. 9, 1655–1664. doi: 10.1101/gr. 094052.109

Amezcua, L., Beecham, A. H., Delgado, S. R., Chinea, A., Burnett, M., Manrique, C. P., et al. (2018). Native ancestry is associated with optic neuritis and age of onset in hispanics with multiple sclerosis. Ann. Clin. Transl. Neurol. 11, 1362–1371. doi: 10.1002/acn3.646



**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Feliciano-Astacio, Celis, Ramos, Rajabli, Adams, Rodriguez, Rodriguez, Bussies, Sierra, Manrique, Mena, Grana, Prough, Hamilton-Nelson, Feliciano, Chinea, Acosta, McCauley, Vance, Beecham, Pericak-Vance and Cuccaro. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Motivations for Participation in Parkinson Disease Genetic Research Among Hispanics versus Non-Hispanics

*Karen Nuytemans1,2\*, Clara P. Manrique1, Aaron Uhlenberg1, William K. Scott1,2, Michael L. Cuccaro1,2, Corneliu C. Luca3, Carlos Singer3 and Jeffery M. Vance1,2*

*1 John P. Hussman Institute for Human Genomics, University of Miami Miller School of Medicine, Miami, FL, United States, 2 Dr. John T. Macdonald Foundation Department of Human Genetics, University of Miami Miller School of Medicine, Miami, FL, United States, 3 Department of Neurology, University of Miami Miller School of Medicine, Miami, FL, United States*

#### *Edited by:*

*Jessica Nicole Cooke Bailey, Case Western Reserve University, United States*

#### *Reviewed by:*

*Kenneth M. Weiss, Pennsylvania State University, United States Mark Z. Kos, University of Texas Rio Grande Valley Edinburg, United States*

*\*Correspondence:*

*Karen Nuytemans knuytemans@med.miami.edu*

#### *Specialty section:*

*This article was submitted to Applied Genetic Epidemiology, a section of the journal Frontiers in Genetics*

*Received: 29 November 2018 Accepted: 21 June 2019 Published: 16 July 2019*

#### *Citation:*

*Nuytemans K, Manrique CP, Uhlenberg A, Scott WK, Cuccaro ML, Luca CC, Singer C and Vance JM (2019) Motivations for Participation in Parkinson Disease Genetic Research Among Hispanics versus Non-Hispanics. Front. Genet. 10:658. doi: 10.3389/fgene.2019.00658*

Involvement of participants from different racial and ethnic groups in genomic research is vital to reducing health disparities in the precision medicine era. Racial and ethnically diverse populations are underrepresented in current genomic research, creating bias in result interpretation. Limited information is available to support motivations or barriers of these groups to participate in genomic research for late-onset, neurodegenerative disorders. To evaluate willingness for research participation, we compared motivations for participation in genetic studies among 113 Parkinson disease (PD) patients and 49 caregivers visiting the Movement Disorders clinic at the University of Miami. Hispanics and non-Hispanics were equally motivated to participate in genetic research for PD. However, Hispanic patients were less likely to be influenced by the promise of scientific advancements (*N* = 0.01). This lack of scientific interest, but not other motivations, was found to be likely confounded by lower levels of obtained education (*N* = 0.001). Overall, these results suggest that underrepresentation of Hispanics in genetic research may be partly due to reduced invitations to these studies.

Keywords: participation, genetics, research, diversity, Parkinson disease

### INTRODUCTION

Disproportionate participation in genomic studies across different racial and ethnic groups has significant long-term implications for translational benefits associated with this research. Research that focuses on a limited pool of racial and ethnic diverse populations versus a broader array of populations may lead to potential biases in genomic research findings and restrict benefit to the limited population group (Bustamante et al., 2011). Often the same groups that are underserved in health care are underrepresented in genomic research, thus increasing the potential for future overall health disparities. With the development of a precision care model for health services, more genomic information will be integrated into health services (Biesecker and Green, 2014). This development emphasizes the importance of racial or ethnic specific genetic information with respect to disease risk. Our understanding of population-specific genomic information relies on the inclusion of all racial and ethnic groups in genomic research. The absence of such information will render the implementation of precision medicine in the understudied groups less effective at best. Recent summary studies on genomic reports confirm that approximately 80% of individuals included in genome-wide association studies are European or of European descent, with only 1% Hispanic representation (Bustamante et al., 2011; Popejoy and Fullerton, 2016; Sirugo et al., 2019). Even applications of the more recent next-generation sequencing technology still include over 60% European (descent) individuals (Bustamante et al., 2011; Popejoy and Fullerton, 2016; Sirugo et al., 2019). A common belief is that underrepresentation of racial and ethnic groups in biomedical research is the result of reduced willingness to participate because of mistrust and stigma (Shavers et al., 2002; George et al., 2014; Erves et al., 2017). An alternative position is that underrepresentation can also be ascribed to limited access to research opportunities and reduced invitations to participate (Wendler et al., 2006; Ceballos et al., 2014). Previous studies focusing on participation in research in general have found that among non-white populations, willingness to participate is closely linked to and motivated by concern for personal or overall family or community health (Sanderson et al., 2013; Ulrich et al., 2013; Ceballos et al., 2014; George et al., 2014). Unwillingness, in contrast, is often driven by negative perception of research, lack of personal benefit, and/or fear of results (Sanderson et al., 2013; George et al., 2014; Erves et al., 2017).

Little information is known on the perceived barriers and motivations of patients with late-onset neurodegenerative diseases to participate in research. Despite the higher prevalence of Parkinson disease (PD) and Alzheimer disease (AD) in Hispanics versus white non-Hispanics, reports discussing race or ethnicity in AD or PD health care provide evidence of disparities in prescribing medications (Hemming et al., 2011; Thorpe et al., 2016), referrals to clinical trials (Schneider et al., 2009), availability of resources (Graham-Phillips et al., 2016), and cost (Gilligan et al., 2013) (benefitting non-Hispanic whites more than any other group). Specifically for PD, disparities relating to referrals to deep brain stimulation surgery have also already observed (Chan et al., 2014). This disparity for these disorders extends to genomic research as there is little data on genetic variation (variants, frequency and/or effect size) contributing to PD (or AD) in non-whites. Interestingly, variants unique to a specific racial or ethnic background are reported for PD (e.g., *PINK1* in Asians; Nuytemans et al., 2010) as well as AD (e.g., *ABCA7* frameshift deletion in African Americans; Farrer et al., 1997; Collins, 1999; Calderon et al., 2006; Reitz et al., 2013; Cukier et al., 2016; Feliciano et al., 2016), indicating a clear need to increase research in non-white populations.

Hispanics can harbor variable levels of admixture of European, African, and Native American ancestry in their genetic background (Mao et al., 2007; Price et al., 2007; Bryc et al., 2010). Detailed analyses in the Hispanic population and other admixed populations can thus inform on genetic contributions of disease in the others. Therefore, these groups can be highly instructive in our understanding of genetic disease across race and ethnicity. For example, through analyses of local ancestry, Dr. Rajabli et al. found different risk effects associated with APOEε4 in Hispanic AD depending on the ancestral origin of the region the ε4 was located on (European OR = 10 versus African OR = 3; Rajabli et al., 2018), consistent with previously observed lower APOEε4 risk for AD in an admixed population of African Americans (Farrer et al., 1997). Additionally, after identifying a strong risk effect in African Americans for *ABCA7* (Reitz et al., 2013) (similar to *APOE* in WNH), we recently identified a pathogenic 44-bp deletion in *ABCA7* specific to the African American population and Caribbean Hispanics with an African ancestral background in the *ABCA7* region (Cukier et al., 2016).

To date, despite theirs being the largest minority group in the US (U.S. Census Bureau American Community Survey, 2017), only a handful of genomic studies studying the major PD genes (*LRRK2*, *PARK2*, *PARK7*, *PINK1*, and *SNCA*) have focused on PD patients of Hispanic ancestry (Deng et al., 2006; Alcalay et al., 2010; Marder et al., 2010; Saunders-Pullman et al., 2011; Gatto et al., 2013; Duque et al., 2015; Cornejo-Olivas et al., 2017). These studies often present data in a small sample size of Hispanic patients and summarize across all Hispanic PD patients, regardless of ancestry. Given the high variability of admixture in these populations, caution is warranted for the interpretation and extrapolation of these results. The only study of a large cohort of Hispanic patients (*N* = 1,150) originating from southern South America reports highly variable contribution of LRRK2 p.G2019S (originally observed in European patients) to PD in different Latin American countries (Mata et al., 2011). Additionally, Mata et al. observed an enrichment of an LRRK2 variant p.Q1111H in Peruvian and Chilean, but not Uruguayan or Argentinian PD patients (Mata et al., 2011), suggesting that this variant originated from the Native American genetic background in these patients. Follow-up analyses showed that this variant is common on Native American background and not contributing to disease (Cornejo-Olivas et al., 2017). Alternatively, when screening GBA, a population-specific variant (p.K198E) contributing to disease was only found in the Colombian population (Velez-Pardo et al., 2019). Taken together, the data presented above underscore the need to include admixed and non-European populations in biomedical research of PD and other neurodegenerative disorders to further our understanding of genetic contribution to PD in these populations with complex genetic architectures as well as across all populations (i.e., transethnic).

Here, we wished to evaluate the willingness of patients affected by a late-onset, complex disease (PD) and their caregivers to participate in genomic research, and the main drivers of this willingness across race and ethnicity to potentially identify issues to address and adjust current enrollment protocols to improve participation across all populations.

#### MATERIAL AND METHODS

#### Human Subject Research Compliance

The presented study was approved by the Institutional Review Board at the University of Miami, and informed consent for the survey was obtained from all participants.

#### Participants and Enrollment

All patients were seen by physicians specializing in movement disorders (CS, CCL) at the University of Miami (UM) Health System's Division of Parkinson's Disease and Movement Disorders clinic. This division serves as the premier referral center for movement disorders patients from abroad with a particular connection to Latin America and the Caribbean. Both Dr. Singer and Dr. Luca speak Spanish and can address the patient in their preferred language. Summary data from the UM Health System suggest that approximately 35% of PD patients identify as Hispanic. Individuals were eligible for this study if they a) had a clinical diagnosis of PD or b) were caregivers of a person with PD and c) were 18 years of age or older. All eligible individuals were referred to the study by their physicians. Patients who agreed to contribute to this survey were approached about a proposed, hypothetical PD genetic research study closely resembling the one ongoing at the John P. Hussman Institute for Human Genomics at UM. All interviewees received the same information including description of the study purpose and requirements (e.g., a single blood draw, collection of personal and family medical history, and no return of personal results). They were then asked whether they were willing to participate in such a study and to complete a brief survey to indicate the reasons for their decision (i.e., participate vs. not participate). All interactions with participants were conducted in the preferred language of the participant.

#### Survey Items

Adapting from a prior in-house study (Cuccaro et al., 2014), we constructed a multi-item survey to assess influences on willingness to participate in genomic/genetic research. This survey asked participants to select reasons that influenced their decision to (refuse to) participate in the proposed genetic study of PD. Individuals who agreed to participate were asked to select from six predefined reasons that could have influenced their decision (e.g., "I want to help find a cure for PD" or "I want to help improve science and knowledge on PD") (**Table 1**, **Supplementary Table**). Individuals who declined participation were asked to select from 10 reasons for this decision (e.g., "I don't like having my blood drawn," "I am concerned my insurance company will find out my results," or "I don't trust what will happen with my sample") (**Supplementary Table**). In addition, we collected socio-demographic information including age, sex, race/ethnicity, and education level. To assess race/ethnicity, we asked participants to indicate what race they identify with, as well as to describe themselves as Latino (indicating geographical ancestral origin in Latin America), Hispanic (referring to Spanish-speaking populations in Latin America with ancestral origin in the Iberian Peninsula), neither, or unknown, and indicate country/region of ancestral origin if known (questions available in **Supplementary Data**).

#### Data Analysis

As described below, we restricted our analyses to individuals who agreed to participate in the proposed, hypothetical genetic study to PD. We tested whether the frequency of endorsement for each of the six reasons for participation differed based on ethnicity using Fisher's exact tests. Given that only 12 individuals indicated that they would not participate in the proposed genetic study, we did not include these data in the statistical analyses.

### RESULTS

#### Participant Description

Over the course of 27 clinic days (1 day a week from November to July), we interviewed 162 individuals, of which 113 were PD patients. The remaining 49 individuals presented themselves as caregivers for the patient (35 spouses/partners, 11 children, 3 other). The majority of patients and caregivers identified as white, Hispanic (WH; ~63% and ~59%, respectively) or white, non-Hispanic (WNH; ~30% and ~37%). Most of the Hispanic participants were of Cuban ancestry (~60%), followed by Colombian (~10%) and Puerto Rican (~9%) ancestry. These figures correspond to the demographic figures for the larger Miami area. Among remaining participants, 3% of individuals

TABLE 1 | Comparison of endorsement rates per reason in Hispanics versus non-Hispanics in patient and caregiver groups.


*\*Fisher's exact test; \*\*Higher education defined as education received after high school; \*\*\*Other (encompassing any other reasons participants indicated as motivation, not in the predetermined list) was not analyzed due to low number of endorsements.*

identified as Black/African American, 1.2% identified as Arab, and less than 1% identified as Asian. Given these small numbers, we restricted our analyses to WH (*N* = 101) and WNH (*N* = 52) participants.

Mean age at interview as a function of ethnicity did not differ within the patient (WH 66.7y/WNH 67.1y) or caregiver (WH 59.3y/WNH 59.9y) groups. However, a significant difference in reported education level was observed, with >88% of WNH holding a degree of higher education than high school versus ~60% in the WH participant group (*p*~0.0001 across combined patient/caregiver group, **Table 1**).

#### Survey Results

Overall, ~91% of WH and WNH participants reported they would participate in the proposed PD genomic study. Specifically, 97% of the 106 patients would agree to participate, with no observed difference between the two ethnic groups (*p* = 1.00). Among patients, "*To help find a cure for PD*" and "*I suffer from PD*" were endorsed at similarly high frequencies between ethnic groups (~91%, *p* = 0.68 and ~83%, *p* = 0.78, respectively), making them the most frequently endorsed reasons for participating in the proposed study (**Table 1**). In contrast, nominally significant higher frequencies of WNH versus WH patients endorsed reasons driven by scientific discovery; "*To find new/better treatments for PD*" (87.5% vs 61.4%, respectively; *p* = 0.02), and "*To improve science and knowledge about PD*" (75% vs 46%, respectively; *p* = 0.01). Additionally, we observed a nominally significant difference for "*I'm encouraged by family/friend to participate*" (WNH 30% vs WH 10%; *p* = 0.04).

In the caregivers group, no difference in willingness to participate in genetic studies was observed, though overall percentage was lower than in patients (86% in WH versus 83% in WNH). Interestingly, while a similar pattern of results was observed among WNH and WH caregivers for the same statements as for the patients, one exception was noted as 86.7% of WNH caregivers endorsed "*To help future generations with PD*" as a reason for participating in the proposed study versus only 44% of WH caregivers (*p* = 0.01).

Given the observed differences in level of education between the two groups, we also analyzed the data based on education status regardless of ethnicity to assess confounding effects (**Table 2**). We observed a nominally significant difference for motivation by "*To improve science and knowledge about PD*" for higher educated versus non-higher educated participants in the patient group (*p* = 0.001; 62.5% vs 26.7%) as well as overall (*p* = 0.002; 64.6% vs 34.9%). No difference in motivation was observed for the caregiver group based on education level.

#### DISCUSSION

Given the growing impact of genomic information on clinical care for increasing numbers of conditions, it is of utmost importance to recognize the genetic differences among racial and ethnic groups. Available research findings for PD or other neurodegenerative disorders on mostly WNH or Asian population groups are not necessarily generalizable to all individuals (Bustamante et al., 2011). Very recently, genetic research for the more common neurodegenerative disease AD in diverse populations of African Americans and Hispanics has shown the power of these analyses across race and ethnicity to identify variants contributing to disease and improve the field's understanding of disease mechanisms [e.g., GBA (p.K198E) in Hispanics (Velez-Pardo et al., 2019), ABCA7 in Africans and African Americans (Reitz et al., 2013; Cukier et al., 2016), differential risk of APOEe4 on different background (Rajabli et al., 2018)]. These data support the importance to extend genomic research to diverse populations for neurodegenerative disease to fully understand genetic risk factors contributing to disease.

More recently, the number of studies evaluating recruitment issues and methods in different racial and ethnic groups versus the traditional European research population has grown with the rise of precision medicine initiatives, though there are very few for complex, late-onset diseases (Zhou et al., 2016; Hughes et al., 2017). Our results indicate that WH individuals affected by or caring for someone with PD seen at the UM Movement Disorders clinic would


*\*Fisher's exact test; \*\*Higher education defined as education received after high school; \*\*\*Other (encompassing any other reasons participants indicated as motivation, not in the predetermined list) was not analyzed due to low number of endorsements.*

be equally willing to participate in genomic research for late-onset disease PD as WNH individuals, given current enrollment protocols. One could argue that the high rate of willingness reflects an increase in interest in research in those individuals seeking treatment at an academic medical center. However, we have observed a significant difference between WNH and WH participants driven by research progress to participate, indicating interest in research has at the very least less priority than other, more personal reasons for WH participants. Additional analyses showed that the lack of motivation of scientific improvement is likely correlated with lower educational level. This divergence could potentially be explained by an underlying lower level of knowledge of or familiarity with basic science and medical research in the WH participant group. Interestingly, the few participants who provided an open answer as reason to participate ("*other*") indicated they are more willing to participate to help their doctor with whom they have a good relationship. Though these were limited numbers, these data might suggest a higher level of trust between physicians/researchers and participants through a more personal relationship and being helped in the participant's language of choice.

Taking together the high willingness seen here but current underrepresentation in medical research of WH participants, we offer that the underrepresentation of WH individuals in PD research is in part due to a reduced invitation to participate. It is therefore important for the medical and scientific fields to make a concerted effort to reach out to the different communities and truly establish a relationship as well as inform on and extend participation in (PD) studies to all races and ethnicities. This investment in community outreach will lead to a more equal

REFERENCES


representation in research and ultimately to a reduction in health disparities.

#### ETHICS STATEMENT

The presented study was approved by the Institutional Review Board at the University of Miami and informed consent for the survey was obtained from all participants.

### AUTHOR CONTRIBUTIONS

KN, MC, WS, CS, and CL contributed conception and design of the study; KN, CM, and AU managed and organized the project; KN performed the statistical analysis; KN and MC wrote the first draft of the manuscript; WS, JV, CS, and CL critically reviewed the manuscript. All authors contributed to manuscript revision and read and approved the submitted version.

### FUNDING

This research was funded by a National Parkinson Foundation Moving Day® grant (PI Nuytemans).

#### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene.2019.00658/ full#supplementary-material


disease risk in Puerto Rican and African American populations. *PLoS Genet.* 14, e1007791. doi: 10.1371/journal.pgen.1007791


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

The handling editor declared a past co-authorship with one of the authors WS.

*Copyright © 2019 Nuytemans, Manrique, Uhlenberg, Scott, Cuccaro, Luca, Singer and Vance. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

# Understanding Participation in Genetic Research Among Patients With Multiple Sclerosis: The Influences of Ethnicity, Gender, Education, and Age

#### Edited by:

Jessica Nicole Cooke Bailey, Case Western Reserve University, United States

#### Reviewed by:

Satyanarayana M. R. Rao, Jawaharlal Nehru Centre for Advanced Scientific Research, India Marsha Michie, Case Western Reserve University, United States

> \*Correspondence: Jacob L. McCauley jmccauley@med.miami.edu

#### Specialty section:

This article was submitted to Applied Genetic Epidemiology, a section of the journal Frontiers in Genetics

Received: 12 October 2018 Accepted: 31 January 2020 Published: 13 March 2020

#### Citation:

Cuccaro ML, Manrique CP, Quintero MA, Martinez R and McCauley JL (2020) Understanding Participation in Genetic Research Among Patients With Multiple Sclerosis: The Influences of Ethnicity, Gender, Education, and Age. Front. Genet. 11:120. doi: 10.3389/fgene.2020.00120 Michael L. Cuccaro1,2, Clara P. Manrique1 , Maria A. Quintero<sup>1</sup> , Ricardo Martinez <sup>1</sup> and Jacob L. McCauley 1,2\*

<sup>1</sup> John P. Hussman Institute for Human Genomics, University of Miami Miller School of Medicine, Miami, FL, United States, <sup>2</sup> Dr. John T. Macdonald Foundation, Department of Human Genetics and Genomics, University of Miami Miller School of Medicine, Miami, FL, United States

This study examined reasons for participation in a genetic study of risk for multiple sclerosis (MS). Our sample consisted of 101 patients diagnosed with MS who were approached about enrolling in the Multiple Sclerosis Genetic Susceptibility Study. Participants were predominantly Hispanic (80%), female (80%), and well educated (71%), having at least some level of college education. Of these 101 individuals who were approached, 95 agreed to participate and are the focus of this report. Among enrollees, the most frequently cited reasons for participation were to find a cure for MS (56%), having MS (46%), and helping future generations (37%). Regression models comparing ethnic groups, Hispanics endorsed having MS as a reason to participate significantly more frequently than non-Hispanics (HI 52%, non-HI 19%, p = 0.015) while non-Hispanics endorsed finding new and better treatments significantly more frequently than Hispanics (Hispanic 17%, non-Hispanic 50%, p = 0.003). Among our three age groups, younger individuals endorsed finding a cure for MS significantly more frequently (74% of 18–35-year olds vs. 56% of 36–55 year olds vs. 39% of >55 year olds). Our results suggest that motivations for participation in genetic research vary by ethnicity, and that these influences need to be considered in developing more inclusive programs of disease-related genetic research. Future efforts should focus on development of standard methods for understanding participation in genetic and genomic research, especially among underrepresented groups as a catalyst for engaging all populations.

Keywords: participation, genetics, research, minorities, motivation, multiple sclerosis

## INTRODUCTION

It is widely believed that underrepresented groups are less willing to participate in biomedical research due to barriers such as mistrust, stigma, and competing demands, leading to underrepresentation (Shavers et al., 2002; George et al., 2014). However, under-representation in biomedical research is also a by-product of limited access to research opportunities and reduced invitations to participate (Wendler et al., 2006; Katz et al., 2007), which persists to this day (Jones et al., 2017). Thus, even in situations where willingness to participate in biomedical research among underrepresented populations is indistinguishable from other groups, levels of participation may differ for other reasons (Katz et al., 2009; Fisher and Kalbaugh, 2011). Importantly, it is not clear that underrepresented groups' attitudes about participation in biomedical research extend to participation in genetic research. Reduced willingness to participate in genetic research has generally been attributed to unfavorable attitudes about this type of research (Matsui et al., 2005). Clearly, there is much to be learned about why individuals from underrepresented populations participate in genetic research.

Among underrepresented populations, consistent themes for participation include altruism, benefit to family members, selfbenefit, and personal curiosity (Sanderson et al., 2013; Walker et al., 2014). Similarly, concerns about individual and family health as well as helping the common good were primary motivations for participation in genetic research among African Americans enrolled in the Jackson Heart study (Walker et al., 2014). Respondents in this study also reported being motivated by the opportunity to get involved in something that would help African Americans across the country; most expressed a high confidence and trust in the study leaders and staff. Sanderson and colleagues conducted structured interviews to assess willingness to participate in genomics research on complex diseases among a diverse group of participants from an inner-city hospital, which included black, Hispanic, and non-Hispanic white individuals (Sanderson et al., 2013). Results showed that willingness to participate was motivated by altruism, benefit to family members, personal health benefit, personal curiosity and improving understanding. In contrast, unwillingness to participate was motivated by negative perceptions of research, lack of perceived personal relevance, negative feelings about procedures (e.g., blood draws), practical barriers, and fear of results (Sanderson et al., 2013).

The importance of participation in genetic research has implications for translational benefits associated with such research. For various groups that may already be under-served, an under-representation in genetic research can amplify future health disparities. For instance, Bustamante and colleagues report that failure to investigate a "broader ensemble of populations" will bias findings from genomic research and benefit only the privileged segment of the population who participate (Bustamante et al., 2011). While this situation has improved somewhat, there is still an underrepresentation of non-European populations in genetic research, which is crucial to ensuring that the benefits of research are available for all (Popejoy and Fullerton, 2016). The importance of genetics for health services has been anticipated for some time (Sterling et al., 2006). More than 10 years after Sterling and colleagues described the importance of genetics for health services (Sterling et al., 2006), the integration of genetics in health services has arrived as whole exome and whole genome sequencing technologies are increasingly present in clinical settings (Biesecker and Green, 2014; Krier et al., 2016). However, as noted by Landry and colleagues, a lack of equitable representation in this new era of precision medicine research will inhibit translational benefits for groups not represented (Landry et al., 2018).

Efforts to include underrepresented groups in genetic and genomic research have increased, albeit slowly. One line of study has examined influences on willingness to participate, including motivations. To date, findings from studies of motivation to participate in genomic research among underrepresented populations have been mixed, and some of the observed differences in outcomes may be attributable to study design. For example, some studies assess motivations to participate among individuals who enroll or decline participation in a genetic risk study (i.e., actual participation) (Parikh et al., 2017) while others survey intentions to participate (Halbert et al., 2016; Cooke Bailey et al., 2018). Similarly, some studies enroll patients who are from the general population of patients in both hospital and nonhospital setting (Sanderson et al., 2013; Walker et al., 2014; Jones et al., 2017), while others assess factors associated with participation among patients with specific diseases (Parikh et al., 2017). This is an important distinction as motivational factors vary considerably depending on the type of study and population (e.g., clinical trial vs. observational study, disease group vs. healthy population) (Goodman et al., 2018; Goodman et al., 2019). Further, the set of reasons that motivate healthy individuals to participate is likely very different from reasons that motivate individuals with specific diseases. To date, there have been limited studies using methods which directly ask individuals with specific diseases about reasons for participating in genetic research for those diseases. Acknowledging the concerns raised by Goodman and colleagues around conflating disease and healthy population studies and methods, we believe that asking patients who enroll in genetic studies about their reasons for enrollment is the most informative approach. This belief is supported by the work of the Clinical Sequencing Exploratory Research (CSER) consortium, which has investigated multiple facets of participation in genomic research, including why patients decline to participate (Amendola et al., 2018).

For this study, we asked patients with multiple sclerosis (MS) who were participating in a genetic risk study for MS to identify the primary reasons or motivations for participation using questions based on information from prior qualitative studies. We examined the frequencies of responses in relation to ethnicity, age, and gender. To date, incorporating genetics into precision medicine for MS is a work in progress (Giovannoni, 2017; Hansen and Okuda, 2018), but there has been considerable progress over the past several years (Matthews, 2015). As these genetic discoveries slowly accrue and become clinically useful, it is equally important that they are applicable across populations (Hindorff et al., 2018; Bonham et al., 2018). However, as noted above, the utility of genomic information in clinical settings rests on a foundation of established findings from prior studies and the absence of such information affects interpretation of clinical findings. Thus, a lack of diversity in research has the potential to exacerbate existing inequalities in health care (Popejoy and Fullerton, 2016). Given the under inclusion of non-European ancestry groups in genetic and genomic research, a necessary first step is to understand the factors that influence participation and then use this information to create more inclusive ascertainment.

#### METHODS

#### Human Subjects Research Compliance

All procedures followed were in accordance with the ethical standards of the Institutional Review Board at the University of Miami Miller School of Medicine, and with the Helsinki Declaration of 1975, as revised in 1999 (Human, 1999). Informed consent was obtained from all participants included in the study.

#### Participants and Enrollment

Participants for this study consisted of 101 patients with a diagnosis MS who were ascertained through the University of Miami Health System's MS Center of Excellence, as well as the local community. Patients were eligible for this study if they had a clinical diagnosis of MS and were 18 years of age or older.

Potential enrollees in the genetic risk for MS study were recruited in the clinic setting or at a community outreach events, at which time they were invited to participate. Most of our participants were enrolled in the clinic setting, indicative of the volume of patients available at that site. Once they indicated their decision, the clinical coordinator would ask individuals to select a reason(s) for their decision (i.e., to participate in the genetic research study or not) from a list of possible reasons (which were presented to the participant) and record their answers. Participants also provided socio-demographic information at that time. All materials were presented in the preferred language of the participant.

#### Measures

#### Sociodemographic information

Participants were asked their gender, race-ethnicity, and religious affiliation. In addition, they were asked to indicate their age group and education level.

#### Reasons for participation

We identified 11 possible reasons for participation (two of which were "other" and "not sure") in a genetic research study (see list of reasons in Supplementary Information). The reasons were derived from multiple studies of reasons for participating in biomedical research (e.g., clinical trials and observational studies) as well as biobank and genetic studies (Streicher et al., 2011; Lang et al., 2013; Sanderson et al., 2013; Walker et al., 2014) that were primarily conducted among convenience samples of individuals with no known disease or illness. Given the paucity of published methods for evaluating willingness to participate in clinical populations we created questions that reflected the primary themes from other types of qualitative research (e.g., structured interviews and focus groups) that assessed willingness to participate in genetic research for reasons such as altruism (e.g.,To help future generations), personal benefit (e.g., I suffer from MS), and advancing research (e.g., To help improve science and knowledge about MS). The questions were drafted by one of the investigators (clinical psychologist) and subsequently reviewed by other team members including the director of patient and family ascertainment and senior clinical coordinators, both who have extensive experience in participant recruitment. Following revisions, the survey was administered to various staff to evaluate wording, item order, and item complexity.

#### Data Analysis

Our primary questions of interest involved whether endorsement of reasons for participating in the genetic risk for MS study differed by ethnicity, gender, education, and age. To answer these questions, we conducted separate logistic regression analyses using ethnicity, gender, and education as binary outcomes (i.e., Hispanic vs. non-Hispanic, male vs. female, any college vs. no college), and our survey items as predictor variables. For age, we conducted multinomial logistic regression with three levels of our outcome variable (young = 18-35 years, middle = 36-55 years, and older = > 55 years). We tested each of the models for significance and report on those items which are significant contributors to the respective models (i.e., which items predict the outcomes of interests (e.g., Hispanics vs non-Hispanics), thereby reducing the number of significance tests to those associated with the four overall tests (corrected significance level p = 0.0125). Odds ratios and confidence intervals are available for each model. All statistical analyses were performed using SPSS version 24 software (SPSS, 2013) and were restricted to individuals who agreed to participate (n = 95).

#### RESULTS

Among the 101 individuals approached about participating in the genetic risk for MS study, 95 (94%) agreed to participate. All results are based on this group of 95 individuals. As seen in Table 1, most of our participants were Hispanic (N = 79; 83%) and female (N = 78; 82%). We tested whether our Hispanic and non-Hispanic participants differed with respect to gender and found no differences in the proportions of males and females by ethnicity (Fisher's Exact Test, p = 0.15). Similarly, while a large percentage of the sample was college educated (71%), we found that our Hispanic and non-Hispanic participants did not differ in education (p = 0.58). Finally, there were no differences in age by ethnic group (p = 0.47)

Examination of overall endorsement patterns (Figure 1) showed that finding a cure, endorsed by 56% of participants,

#### TABLE 1 | Cohort description (N=95).


Cuccaro et al. Participation in Genetic Research

was the most frequently cited reason for participating in the study. In addition, having MS and helping future generations, were endorsed by a majority of participants as reasons to enroll in the MS study.

Table 2 summarizes the endorsement patterns for the respective items by ethnicity, gender, education, and age. At the descriptive level, inspection of the frequencies of endorsements shows that both Hispanic and non-Hispanic participants cited finding a cure equally (56% per group). This was the most common reason for the respective groups. However, compared to non-Hispanics, Hispanic participants endorsed having a disease as a reason to participate in the genetic risk for MS study more frequently than non-Hispanics (HI 52%, NH 19%). Conversely, non-Hispanic participants cited finding new/better treatments more frequently than Hispanics (NHI 50%, HI 17%).

TABLE 2 | Percentage of endorsements for reasons to participate by ethnicity, sex, education, and age (N and % values).


Young=18–35 years.

Middle=36–55 years.

Old= > 55 years.

Endorsement patterns by sex, age, and education were similar to those identified in our ethnic groups as finding a cure and having multiple sclerosis were endorsed consistently as reasons for participating in the MS study.

To test for differences in reasons for participating in genetic research we conducted separate logistic regressions to ascertain the effects of the respective survey items (i.e., reasons for participating) on different binary (ethnicity, sex, and education groups) and multinomial (age groups) outcomes. For each of the respective analyses, we restricted our predictors to the following survey items: I want to help find a cure for MS; To help improve science and knowledge about MS; To find new/better treatments for MS; I suffer from MS; To help future generations; The doctor asked/recommended that I participate; and, Encouragement from a family member or friend. The remaining items were not cited as reasons for participating by more than one individual.

#### Ethnic Group

Our logistic regression model evaluating the ability of survey items to predict ethnic group (Hispanic vs. non-Hispanic) was statistically significant, c<sup>2</sup> (6) = 20.61, p = 0.002. Of the six predictor variables (i.e., survey items that were reasons for participating in the study), three contributed significantly to the model: I suffer from MS, To find new/better treatments for MS, and Encouragement from a family member or friend. These items differed between our Hispanic and non-Hispanic participants. Among the three items, the largest OR (7.34; CI 1.52, 35.68) was found for the item, I suffer from MS, indicating that endorsing this item as a reason was more likely among Hispanics vs. non-Hispanics. Conversely, To find new/better treatments for MS (OR = 0.15), and Encouragement from a family member or friend (OR = 0.13), were associated with a reduced likelihood of endorsement by Hispanics vs. non-Hispanics. Table 3 has the odds ratios and confidence intervals for these results.

#### Sex

The logistic regression model evaluating the ability of survey items to predict sex was not significant, c<sup>2</sup> (7) = 6.54, p = 0.478, as none of the items differed between males and females. The odds ratios and confidence intervals for the respective items are available in Supplementary material (Supplementary Table 1).

#### Education

Similar to the logistic regression model for sex, the model which evaluated the ability of survey items to predict educational group (college vs. no college) was not significant, c<sup>2</sup> (7) = 7.33, p = 0.396). The odds ratios and confidence intervals for the respective items are also available in Supplementary material (Supplementary Table 2).

#### Age

As seen in Table 2, we collapsed the various age groups into three categories (18–35 years of age, 36–55 years of age, and >55 years of age). Assessment of how well the model fits using likelihood ratio tests was not significant c<sup>2</sup> (14) = 13.23, p = 0.508. For one of the predictors, we observed a trend in comparison of the older and younger groups (p = 0.021) although given that the omnibus test was not significant, this finding did not survive correction for multiple tests. However, the odds for selecting this as a reason to participate among younger vs. older participants was 4.896, 95% CI 1.28, 18.79) suggesting that this item is more likely among younger vs. older participants. These results along with the additional parameter estimates are available in supplementary material (Supplementary Table 3).

### DISCUSSION

Overall, our logistic regression analyses yielded only one significant model which showed that there were different reasons for participating in genetic research between Hispanics and non-Hispanics. Among the reasons for participating, personal experience with MS (i.e., I suffer from MS), was strongly associated with Hispanics vs. non-Hispanics with an odds ratio of 7.36. In contrast, non-Hispanics were significantly more likely to endorse helping to discover new treatments (OR = 0.15) as a reason to participate. While personal experience with MS and discovery of new treatments are generally aligned with a theme of deriving personal benefit, the differences may hint at subtle distinctions between Hispanics and non-Hispanics or how the items were interpreted. Certainly, our findings regarding Hispanics being motivated by having a disease (i.e., MS) are in line with prior research showing that Hispanics are more likely to participate in biomedical research if it is relevant to them



OR, odds ratio \*significant coefficients.

MS, multiple sclerosis.

(Ulrich et al., 2013). Note that one additional item, encouragement from others (OR = 0.13), was less likely to be endorsed by Hispanics as a reason to participate in genetic research—again possibly reflecting personal motivation. The second item, finding new and better treatments, was endorsed by 50% of non-Hispanics vs. only 17% of Hispanics, and has elements of personal benefit as well as altruism. Further, while not significant, 50% of non-Hispanics endorsed helping future generations as a reason for motivation compared to 34% of Hispanics. Even though this difference was not significant, when coupled with the results regarding the item finding new and better treatments, there is a suggestion that Hispanics and non-Hispanics with MS may have different perspectives on what they see as priorities for participation.

Importantly, while interpretations of the above response patterns are reasonable and fit with previously published findings regarding personal meaningfulness and benefit to society (Goodman et al., 2019), we would encourage caution in interpretation of the results. In particular, given that we only asked participants to indicate if a particular reason motivated them to participate, endorsements could be interpreted in multiple ways. For instance, endorsement of I suffer from MS as a reason to participate could simply be acknowledging that their participation is important for research vs. a desire to derive personal benefit. Ultimately, in the absence of open-ended responses that could explain participant reasoning, multiple inferences about the meaningfulness of the data are possible.

Interestingly, while not significant, 50% of non-Hispanics endorsed helping future generations as a reason for motivation compared to 34% of Hispanics. Even though this difference was not significant, when coupled with the results regarding the item finding new and better treatments, there is a suggestion that Hispanics and non-Hispanics with MS differ in altruism. One additional item, encouragement from others (OR = 0.13), was less likely to be endorsed by Hispanics as a reason to participate in genetic research—again reflecting personal motivation.

At a descriptive level, our results show that among enrollees in an MS genetic risk study, the most frequently cited reason for participating was finding a cure for MS. While this reason for participation did not differ by ethnicity, sex, or education there was a trend among participants in different age groups. Specifically, for the item, I want to help find a cure for MS, a positive response was more likely among younger (i.e., 18-35 year olds) vs older (> 55 years) participants; our middle age group (36-55 years) did not differ from younger or older participants for this item. While it is not surprising that endorsement of finding a cure is high among respondents as a whole, especially given that seeking personal benefit is a powerful motivator for participation in biomedical and genetic research, an age-related effect has not been previously reported. Thus, while many studies adjust for age in their analyses to control for its influence on outcomes, this variable may be of value in terms of understanding the likelihood of participation. For instance, participants in the younger age groups may be more enthusiastic about finding a cure as they are still early in the disease process. At a minimum, investigators seeking to enroll participants for genetic studies should be aware of how age may affect motivations to participate in research when developing recruitment strategies.

The current study offers new information about motivations for participation in MS genetic research as a function of ethnicity and age. While the strengths of the study are its focus on individuals who have a disorder (MS) vs a hypothetical scenario, and the inclusion of Hispanics, the results should be interpreted with caution in light of several factors including small sample size, higher education levels, and a high rate of willingness to participate, raising the possibility of bias related to their being approached during a clinical encounter (i.e., at a neurology appointment). Consequently, our results may not be generalizable to individuals with MS who are receiving services outside of academic medical centers or those who are not receiving care. Moving forward, collecting more information such as duration and severity of illness, acculturation, and trust in the health care system, could reveal subtle influences on reasons for participation in genetic research. Finally, as noted in the Methods section, we developed the items (i.e., reasons for participation) based on themes from qualitative research conducted with mainly non-disease populations. Given the preliminary nature of our study, the questions have limited formal validation data. However, given the interesting results, we are expanding our efforts to learn more about participant motivations by providing participants an opportunity to explain their choices and recruiting both healthy individuals and those with diseases to compare response patterns. We believe these efforts will increase our ability to understand the nuances of why individuals participate in genetic studies and if those reasons vary by race and ethnicity.

In summary, this study adds to our understanding of influences on actual participation in research studies about genetic risk. Based on our study, it appears that ethnicity was the only significant factor associated with willingness to participate. Studies like this and others provide valuable information about why individuals ultimately participate in genetic research and can inform the development of recruitment strategies. Inclusive enrollment is critical to translational efforts that can play a major role in improving the health and wellbeing of all individuals.

## DATA AVAILABILITY STATEMENT

The datasets generated for this study are available on request to the corresponding author.

### ETHICS STATEMENT

This research was approved by the Institutional Review Board, University of Miami Miller School of Medicine. MC, CM, MQ, RM, and JM declare that they have no conflict of interest. All procedures followed were in accordance with the ethical standards of the responsible committee on human experimentation (institutional and national) and with the Helsinki Declaration of 1975, as revised in 2000 (5). Informed consent was obtained from all patients included in the study.

### AUTHOR CONTRIBUTIONS

MC, CM, MQ, RM, and JM contributed to the design and implementation of the research, to the analysis of the results, and to the writing of the manuscript.

#### FUNDING

The research reported in this publication was supported by the National Institutes of Health (NIH) through the National Institute of Neurological Disorders and Stroke (NINDS) under award number 1R01NS096212, the National Institute on Minority Health and Health Disparities (NIMHD) and the National Human Genome Research Institute (NHGRI) under

### REFERENCES


award number U54MD010722, and the National Multiple Sclerosis Society (NMSS) under award number RG4680A1. All content is solely the responsibility of the authors and does not necessarily represent the official views of the NIH or the NMSS.

#### ACKNOWLEDGMENTS

We gratefully acknowledge the resources provided by the John P. Hussman Institute for Human Genomics and the strong support of the South Florida chapter of the NMSS. We also thank the multiple sclerosis genetic study participants and their families for their willingness to participate in our research studies.

#### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene.2020. 00120/full#supplementary-material


a population based cohort study involving genetic analysis. J. Med. Ethics 7, 385–392. doi: 10.1136/jme.2004.009530


genetics of glaucoma. Ethn. Health 24 (6), 1–11. doi: 10.1080/ 13557858.2017.1346189


Conflict of Interest: The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

The reviewer MM and handling Editor declared their shared affiliation.

Copyright © 2020 Cuccaro, Manrique, Quintero, Martinez and McCauley. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

digital media

of impactful research

article's readership