# NEXT-GENERATION SEQUENCING IN PHARMACOGENETICS/GENOMICS

EDITED BY : Ulrich M. Zanger, José A. G. Agúndez and Amit V. Pandey PUBLISHED IN : Frontiers in Pharmacology and Frontiers in Genetics

#### Frontiers eBook Copyright Statement

The copyright in the text of individual articles in this eBook is the property of their respective authors or their respective institutions or funders. The copyright in graphics and images within each article may be subject to copyright of other parties. In both cases this is subject to a license granted to Frontiers. The compilation of articles constituting this eBook is the property of Frontiers.

Each article within this eBook, and the eBook itself, are published under the most recent version of the Creative Commons CC-BY licence. The version current at the date of publication of this eBook is CC-BY 4.0. If the CC-BY licence is updated, the licence granted by Frontiers is automatically updated to the new version.

When exercising any right under the CC-BY licence, Frontiers must be attributed as the original publisher of the article or eBook, as applicable.

Authors have the responsibility of ensuring that any graphics or other materials which are the property of others may be included in the CC-BY licence, but this should be checked before relying on the CC-BY licence to reproduce those materials. Any copyright notices relating to those materials must be complied with.

Copyright and source acknowledgement notices may not be removed and must be displayed in any copy, derivative work or partial copy which includes the elements in question.

All copyright, and all rights therein, are protected by national and international copyright laws. The above represents a summary only. For further information please read Frontiers' Conditions for Website Use and Copyright Statement, and the applicable CC-BY licence.

ISSN 1664-8714 ISBN 978-2-88963-891-8 DOI 10.3389/978-2-88963-891-8

#### About Frontiers

Frontiers is more than just an open-access publisher of scholarly articles: it is a pioneering approach to the world of academia, radically improving the way scholarly research is managed. The grand vision of Frontiers is a world where all people have an equal opportunity to seek, share and generate knowledge. Frontiers provides immediate and permanent online open access to all its publications, but this alone is not enough to realize our grand goals.

#### Frontiers Journal Series

The Frontiers Journal Series is a multi-tier and interdisciplinary set of open-access, online journals, promising a paradigm shift from the current review, selection and dissemination processes in academic publishing. All Frontiers journals are driven by researchers for researchers; therefore, they constitute a service to the scholarly community. At the same time, the Frontiers Journal Series operates on a revolutionary invention, the tiered publishing system, initially addressing specific communities of scholars, and gradually climbing up to broader public understanding, thus serving the interests of the lay society, too.

#### Dedication to Quality

Each Frontiers article is a landmark of the highest quality, thanks to genuinely collaborative interactions between authors and review editors, who include some of the world's best academicians. Research must be certified by peers before entering a stream of knowledge that may eventually reach the public - and shape society; therefore, Frontiers only applies the most rigorous and unbiased reviews.

Frontiers revolutionizes research publishing by freely delivering the most outstanding research, evaluated with no bias from both the academic and social point of view. By applying the most advanced information technologies, Frontiers is catapulting scholarly publishing into a new generation.

#### What are Frontiers Research Topics?

Frontiers Research Topics are very popular trademarks of the Frontiers Journals Series: they are collections of at least ten articles, all centered on a particular subject. With their unique mix of varied contributions from Original Research to Review Articles, Frontiers Research Topics unify the most influential researchers, the latest key findings and historical advances in a hot research area! Find out more on how to host your own Frontiers Research Topic or contribute to one as an author by contacting the Frontiers Editorial Office: researchtopics@frontiersin.org

# NEXT-GENERATION SEQUENCING IN PHARMACOGENETICS/GENOMICS

Topic Editors: Ulrich M. Zanger, Dr. Margarete Fischer-Bosch Institut für Klinische Pharmakologie (IKP), Germany José A. G. Agúndez, University of Extremadura, Spain Amit V. Pandey, University of Bern, Switzerland

Citation: Zanger, U. M., Agúndez, J. A. G., Pandey, A. V., eds. (2020). Next-Generation Sequencing in Pharmacogenetics/genomics. Lausanne: Frontiers Media SA. doi: 10.3389/978-2-88963-891-8

# Table of Contents

#### *05 Pharmacogenetic Variation in Over 100 Genes in Patients Receiving Acenocumarol*

Vanessa Gonzalez-Covarrubias, Javier Urena-Carrion, Beatriz Villegas-Torres, J. Eduardo Cossío-Aranda, Sergio Trevethan-Cravioto, Raul Izaguirre-Avila, O. Javier Fiscal-López and Xavier Soberon

*14 Multiplexed Nanopore Sequencing of HLA-B Locus in Māori and Pacific Island Samples*

Kim N. T. Ton, Simone L. Cree, Sabine J. Gronert-Sum, Tony R. Merriman, Lisa K. Stamp and Martin A. Kennedy


Yitian Zhou, Kohei Fujikura, Souren Mkrtchian and Volker M. Lauschke

*75 A New Panel-Based Next-Generation Sequencing Method for ADME Genes Reveals Novel Associations of Common and Rare Variants With Expression in a Human Liver Cohort*

Kathrin Klein, Roman Tremmel, Stefan Winter, Sarah Fehr, Florian Battke, Tim Scheurenbrand, Elke Schaeffeler, Saskia Biskup, Matthias Schwab and Ulrich M. Zanger

*90 Next-Generation Sequencing of* PTGS *Genes Reveals an Increased Frequency of Non-synonymous Variants Among Patients With NSAID-Induced Liver Injury*

María Isabel Lucena, Elena García-Martín, Ann K. Daly, Miguel Blanca, Raúl J. Andrade and José A. G. Agúndez

*98 Genetic Association of Olanzapine Treatment Response in Han Chinese Schizophrenia Patients*

Wei Zhou, Yong Xu, Qinyu Lv, Yong-hui Sheng, Luan Chen, Mo Li, Lu Shen, Cong Huai, Zhenghui Yi, Donghong Cui and Shengying Qin

*105 Actionable Pharmacogenetic Variation in the Slovenian Genomic Database*

Keli Hočevar, Aleš Maver and Borut Peterlin

#### *116 Integrating Next-Generation Sequencing in the Clinical Pharmacogenomics Workflow*

Efstathia Giannopoulou, Theodora Katsila, Christina Mitropoulou, Evangelia-Eirini Tsermpini and George P. Patrinos

*122 Star Allele-Based Haplotyping versus Gene-Wise Variant Burden Scoring for Predicting 6-Mercaptopurine Intolerance in Pediatric Acute Lymphoblastic Leukemia Patients*

Yoomi Park, Hyery Kim, Jung Yoon Choi, Sunmin Yun, Byung-Joo Min, Myung-Eui Seo, Ho Joon Im, Hyoung Jin Kang and Ju Han Kim

*132 Identification of Novel Biomarkers for Drug Hypersensitivity After Sequencing of the Promoter Area in 16 Genes of the Vitamin D Pathway and the High-Affinity IgE Receptor*

Gemma Amo, Manuel Martí, Jesús M. García-Menaya, Concepción Cordobés, José A. Cornejo-García, Natalia Blanca-López, Gabriela Canto, Inmaculada Doña, Miguel Blanca, María José Torres, José A. G. Agúndez and Elena García-Martín

# Pharmacogenetic Variation in Over 100 Genes in Patients Receiving Acenocumarol

Vanessa Gonzalez-Covarrubias <sup>1</sup> \*, Javier Urena-Carrion<sup>1</sup> , Beatriz Villegas-Torres <sup>1</sup> , J. Eduardo Cossío-Aranda<sup>2</sup> , Sergio Trevethan-Cravioto<sup>2</sup> , Raul Izaguirre-Avila<sup>2</sup> , O. Javier Fiscal-López <sup>2</sup> and Xavier Soberon<sup>1</sup>

1 Instituto Nacional de Medicina Genomica, Mexico City, Mexico, <sup>2</sup> Instituto Nacional de Cardiologia, Mexico City, Mexico

#### Edited by:

Ulrich M. Zanger, Dr. Margarete Fischer-Bosch Institut für Klinische Pharmakologie (IKP), Germany

#### Reviewed by:

Marcelo Rizzatti Luizon, Universidade Federal de Minas Gerais, Brazil Wenndy Hernandez, University of Chicago, United States

#### \*Correspondence:

Vanessa Gonzalez-Covarrubias vgonzalez@inmegen.gob.mx

#### Specialty section:

This article was submitted to Pharmacogenetics and Pharmacogenomics, a section of the journal Frontiers in Pharmacology

Received: 14 September 2017 Accepted: 09 November 2017 Published: 23 November 2017

#### Citation:

Gonzalez-Covarrubias V, Urena-Carrion J, Villegas-Torres B, Cossío-Aranda JE, Trevethan-Cravioto S, Izaguirre-Avila R, Fiscal-López OJ and Soberon X (2017) Pharmacogenetic Variation in Over 100 Genes in Patients Receiving Acenocumarol. Front. Pharmacol. 8:863. doi: 10.3389/fphar.2017.00863 Coumarins are widely prescribed worldwide, and in Mexico acenocumarol is the preferred form. It is well known that despite its efficacy, coumarins show a high variability for dose requirements. We investigated the pharmacogenetic variation of 110 genes in patients receiving acenocumarol using a targeted NGS approach. We report relevant population differentiation for variants on CYP2C8, CYP2C19, CYP4F11, CYP4F2, PROS, and GGCX, VKORC1, CYP2C18, NQO1. A higher proportion of novel-to-known variants for 10 genes was identified on 41 core pharmacogenomics genes related to the PK (29), PD (3), of coumarins, and coagulation proteins (9) including, CYP1A1, CYP3A4, CYP3A5, and F8, and a low proportion of novel-to-known variants on CYP2E1, VKORC1, and SULT1A1/2. Using a Bayesian approach, we identified variants influencing acenocumarol dosing on, VKORC1 (2), SULT1A1 (1), and CYP2D8P (1) explaining 40–55% of dose variability. A collection of pharmacogenetic variation on 110 genes related to the PK/PD of coumarins is also presented. Our results offer an initial insight into the use of a targeted NGS approach in the pharmacogenomics of coumarins in Mexican Mestizos.

Keywords: coumarins, acenocumarol, pharmacogenomics, Mexican Mestizo, population differences, targeted sequencing

# INTRODUCTION

Anticoagulants such as warfarin, acenocumarol, and phenprocoumon act as inhibitors of vitamin K reducing enzymes, which regenerate vitamin K, a cofactor for several clotting proteins. Acenocumarol is the coumarin of choice in Europe and Latin America (Ufer, 2005). The heterogeneity in coumarins efficacy, safety, and dosing has been partly explained by clinical, demographic, and genetic parameters. Several polymorphisms on CYP2C9, VKORC1, and CYP4F2 can account for about 40–50% on coumarin dose differences (Scott et al., 2014). However, most studies have been performed for warfarin and in populations other than Mexican. Acenocumarol (4-nitrowarfarin) is the most commonly prescribed oral coumarin in the public health care system in Mexico. In contrast to warfarin, the more potent isomer, S-acenocumarol, is rapidly eliminated and the drug's therapeutic effect is most likely due to R-acenocumarol. The R isomer is metabolized by several members of the cytochrome P-450 family including, CYP2C9, CYP2C19, CYP2C8, CYP2C18, CYP3A4, CYP1A1, and CYP1A2 (Tassies et al., 2002; Ufer, 2005). Hence, genetic variation on genes coding for these proteins should putatively influence acenocumarol dosing.

The collection of pharmacogenetic variation in Mexican populations is still scarce (Fricke-Galindo et al., 2016). Reports indicate that some of the actionable markers on VKORC1 and UGT1A1 present significant population differences (Bonifaz-Peña et al., 2014), suggesting the existence of variants with distinctive allele frequency in these populations potentially

influencing drug response. Endeavors are currently ongoing to amass a more comprehensive picture of the pharmacogenetic variation in Mexican Mestizos. Here, we investigated genetic variation in over 100 genes by targeted NGS in patients receiving acenocumarol, including genes involved in general pharmacokinetics and pharmacodynamics, vitamin K recycling, and coagulation proteins, these latter also potentially affecting acenocumarol response (Allan et al., 2005; Harrington et al., 2008; Carcao et al., 2015; Tong et al., 2016). Clinical and genetic data were used to develop an algorithm to explain dose variability in this group of patients.

Genomic data analyses provided with a collection of pharmacogenetic variation for this population. This approach hints to toward the consideration of multiple variants to assess acenocumarol dosing for an individualized dose assessment.

#### MATERIALS AND METHODS

Participants and DNA extraction. The National Institute of Cardiology in Mexico City prescribes acenocumarol on a regular basis mostly after stroke, stent implants, or for thrombosis. A hundred and fifty patients treated with acenocumarol between 2006 and 2010 were surveyed and monitored for acenocumarol efficacy through at least three consecutive INR measurements. Of these, 103 blood samples were available for DNA extraction using the DNeasy Blood & Tissue kit (Qiagen, Valencia CA, USA) from a routine blood sample in EDTA-Vacutainer collection tubes, sample characteristics are depicted in **Table 1**. All participants gave written informed consent according with the Declaration of Helsinki. The project was reviewed and approved by The Research and Ethics Committees at The National Institute of Cardiology and The National Institute of Genomic Medicine (INMEGEN) Mexico City, project approval 25/2016/I.

#### Next Generation Sequencing

We investigated coding, 25 bp of adjacent introns, and 5′ and 3′ UTR regions of 110 genes related to general pharmacogenomics including core pharmacokinetics and pharmacodynamics targets in 100 DNA samples. We selected these genes according to the general PGx and the PK/PD of acenocumarol by searching the available literature using the keywords, pharmacogenetics, pharmacogenomics, acenocumarol, coumarin pharmacokinetics,



and pharmacodynamics (van Leeuwen et al., 2008; Soria et al., 2009; Whirl-Carrillo et al., 2012; Tong et al., 2016). Regions of interest were captured using a Haloplex custom Target Enrichment System (Agilent Technologies, Santa Clara, CA, USA) defined for 150 × 2 paired-end reads, in a panel size of 1.1Mbp. In addition, we included a set of 360 ancestry informative markers (AIMS) to assess genetic admixture using SNPs from the HapMap database for CEU and YRI populations, and Natives from Mexico (Galanter et al., 2012). Sequencing libraries were generated according to the manufacturer's protocol (version D.5, May 2013). Briefly, all 100 DNA samples (225 ng) were digested with 8-paired restriction enzymes, fragmentation pattern was analyzed in a 2100 Biolanalyzer (Agilent Technologies). DNA fragments were hybridized with the Haloplex synthetic probes, adapters were ligated followed by PCR amplification for library enrichment. Library quality for fragment size and molarity was also performed using 2100 Biolanalyzer information. Samples were pooled and sequenced in a Genome Analyzer II (Illumina, San Diego, CA, USA) according to the manufacturer's instructions. Targeted genes are listed in Supplemental Table ST1.

#### Bioinformatic and Statistical Analyses

Sequence reads were processed according to the Broad Institute recommended best practices workflow and the Genome Analysis ToolKit (GATK) (Acland et al., 2013; Van der Auwera et al., 2013). Briefly, paired-end reads were trimmed to remove adapters and low quality regions using Trimmomatic (Bolger et al., 2014), reads with an average Q ≤ 30 were discarded, followed by elimination of reads shorter than 36 nucleotides. Mapping and alignment of sequencing reads were performed with BWA, Samtools, and Picard using the hg19 human genome reference (dbSNP build 137) (McKenna et al., 2010). Base quality score calibration and single nucleotide variant (SNV) calling were assessed using GATK v3.3. Variants were confirmed visually in the integrative genomic viewer, IGV (Robinson et al., 2011), and their functional impact was annotated using SnpEff, and ranked as low, moderate, modifier, or high (Sherry et al., 2001; Cingolani et al., 2012; Exome Variant Server<sup>1</sup> ).

The data analysis toolset, PLINK was used to determine descriptive statistics, allele frequencies, Hardy-Weinberg, and

**Abbreviations:** AIC, Akaike information criteria; BIC, Bayesian information criterion; DIC, deviance information criterion; FDR, false discovery rate; Fst, fixation index as a measure of genetic variance; GLM, generalized linear model; BMI, body mass index; INR, international normalized ratio based on prothrombin time; LD, linkage disequilibrium; MAF, minor allele frequency; NGS, next generation sequencing; PD, pharmacodynamics; PharmGKB, pharmacogenomics knowledge base; PK, pharmacokinetics; SNV, single nucleotide variant; SNP, single nucleotide polymorphism.

<sup>1</sup>Exome Variant Server. [updated 2014/01/02/10:04:37]. Available online at: http:// evs.gs.washington.edu/EVS

population differentiation, this latter was assessed by determining the measure of genetic variance in this subpopulation relative to population variance in other continental groups using the FST statistic. A threshold value of P < 0.05 after FDR was considered as statistically significant (Purcell et al., 2007).

#### Pharmacogenetic Model

We utilized a Bayesian statistical approach to incorporate genetic variants to an algorithm for acenocumarol dose estimation (Sebastiani et al., 2009; Chen et al., 2012). The rejection of the null hypothesis (lack of association between acenocumarol dose and genetic variants) was based on probabilities of stochastic computations of Markov Chain Monte Carlo methods (MCMC). Also, we tested the association between dose and all SNV alone or in combination including those previously identified via single-SNP analysis (P > 0.05, FDR). We considered 4614 SNPs and 815 variant interactions for model development. This strategy allows for the identification of independent genetic variants or those that depend on each to influence dose variation. Variants were considered with a MAF <0.95 and >0.05, and interactions between variants within the same gene with frequencies <0.9 and >0.10, i.e., that two or more variants in a gene may have adding or balancing effects on the dose. First, we used a Bayesian Generalized Linear model for variable selection to obtain the posterior probability of a gene variant affecting acenocumarol dose, then we used Bayes Factors, a form of Bayesian hypothesis tests, to prioritize a set of models, and then we evaluated the selected models through Deviance Information Criterion (DIC), a measure of model selection related to AIC and BIC criteria, commonly used in Bayesian hierarchical models. Briefly, for the former, we used a gamma likelihood function with logarithmic link function, variance τ and mean conforming to Equation (1), where v1,<sup>j</sup> and v2,<sup>j</sup> represent binary variables for each genotype of a SNPj, refers to interactions between variants, G is a set of genes g, and ng the number of SNPs in g; vr,j<sup>g</sup> represents genotype r of SNP j in gene g, and β r1,r2 jg ,kg represents the effect size of genotypes vr1,j<sup>g</sup> and vr2,k<sup>g</sup> ; Ij and Ij,k are binary variables for the inclusion or exclusion of SNPs and SNP interactions, and x<sup>i</sup> and α<sup>i</sup> represent m non-genetic covariates including age, sex, BMI, and height.

$$\begin{split} \log \left( \mathbb{E} \left[ \mathbf{y} | \boldsymbol{\theta} \right] \right) &= \quad c + \sum\_{i}^{m} \alpha\_{i} \mathbf{x}\_{i} + \sum\_{j}^{n} I\_{j} (\boldsymbol{\rho}\_{1j} \boldsymbol{\nu}\_{1j} + \boldsymbol{\rho}\_{2j} \boldsymbol{\nu}\_{2j}) \\ &+ \sum\_{\mathcal{S} \in G} \sum\_{j\_{\mathcal{S}} < k\_{\mathcal{S}}}^{n\_{\mathcal{S}}} I\_{j\_{\mathcal{S}},k\_{\mathcal{S}}} \{ \boldsymbol{\beta}\_{j\_{\mathcal{S}},k\_{\mathcal{S}}}^{1,1} \boldsymbol{\nu}\_{1,j\_{\mathcal{S}}} \boldsymbol{\nu}\_{1,k\_{\mathcal{S}}} + \boldsymbol{\beta}\_{j\_{\mathcal{S}},k\_{\mathcal{S}}}^{1,2} \boldsymbol{\nu}\_{1,j\_{\mathcal{S}}} \boldsymbol{\nu}\_{2,k\_{\mathcal{S}}} \\ &+ \quad \boldsymbol{\beta}\_{j\_{\mathcal{S}},k\_{\mathcal{S}}}^{2,1} \boldsymbol{\nu}\_{2,j\_{\mathcal{S}}} \boldsymbol{\nu}\_{1,k\_{\mathcal{S}}} + \boldsymbol{\beta}\_{j\_{\mathcal{S}},k\_{\mathcal{S}}}^{2,2} \boldsymbol{\nu}\_{2,j\_{\mathcal{S}}} \boldsymbol{\nu}\_{2,k\_{\mathcal{S}}} \end{split} \tag{1}$$

Where, τ ∼ Gamma (λ, κ), c ∼ Normal (0, τ 2 µ ), c ∼ Normal (0, τ 2 µ ), βj.<sup>k</sup> ∼ Laplace(0, τ 2 β ), β r1, r2 jg ,kg ∼ Laplace (0, τ 2 β ), I<sup>j</sup> ∼ Bernoulli(π), Ij<sup>g</sup> ,k<sup>g</sup> ∼ Bernoulli(π), π ∼ U(a, b), and α<sup>i</sup> ∼ Normal(0, τ 2 α )

Next, we standardized clinical variables for mean zero and unitary variance, and using JAGS 4.1.0 and R 3.2,0 we obtained MCMC from the posterior distribution. We ran five chains of 110,000 iterations each, including a burn-in period of 10,000 iterations and random initial values, convergence was verified via the Gelman-Rubin statistic Rˆ < 1.2, followed by a series of Bayes Factors to condition on the presence or absence of variants, branching them into a decision tree, as part of the pharmacogenetic model development. Further details on the model development were included in Supplemental Table ST2.

# RESULTS

Demographic characteristics stratified by sex are presented in **Table 1**. Bioinformatic analyses revealed 5108 variants in 110 genes in 100 DNA samples with an average depth of 250x and >98% coverage, but a wide range was registered depending on the gene (30x−600x, 80–100%). These 110 genes represent less than 1% of the coding genome, approximately 25% of a pharmacogenome, more than half of the Coriell reference list for pharmacogenomics, and include >20% of actionable pharmacogenetic markers listed by CPIC (Pratt et al., 2010; Relling and Klein, 2011). After quality control, variant calling, and annotation, 4290 SNVs were utilized for statistical analyses (ST3). There was a complete agreement between genotypes of variants assessed by NGS and allele discrimination performed for CYP2C9<sup>∗</sup> 2,∗ 3, and <sup>∗</sup> 5, CYP4F2 rs2108622, and VKORC1 rs9934438. Admixture analysis with 314 ancestry informative markers showed an average population structure of 50–92% Mexican Native and 6–54% Caucasian (CEU), all individuals showed less than 5% of Sub-Saharan African (YRI) admixture.

Of these 4290 SNVs, 28% have not been reported before (1237 without an rs identifier) and 274 of these novel variants had a minor allele frequency (MAF>1%). On average, each individual showed 908 SNVs, 534 heterozygous and 374 homozygous, of which 258 were present per individual (**Table 2**). Four-hundred and seven variants in 65 genes did not suffice the equilibrium of Hardy-Weinberg (9.8%, Supplemental Table ST4).

Variants were classified by SNPEff according to their to their in-silico functional impact as high, moderate, modifier, or low (Cingolani et al., 2012). We listed a total of 36 known SNVs (27 heterozygous and 9 homozygous) with a high functional impact (**Table 2** and ST3). These resequencing descriptive statistics seem to compare to other reports (Waldron, 2016).

#### Pharmacogenetic Variation

The FST statistic was assessed to evaluate genetic differentiation between Mexican Mestizos and three major continental populations, Chinese Han from Beijing (CHB), Yoruba from Ibadan, Nigeria (YRI), and Europeans from Utah, USA (CEU) utilizing the 1000 genomes database. YRI showed the largest differentiation with 377 variants with a FST value above 0.25, followed by CEU (51 variants with FST > 0.25), and CHB (32 variants with FST > 0.25, ST5). FST >0.25 values were identified for variants on several genes related to the coagulation cascade or coumarin metabolism. For example, when comparing to Caucasians we found high population differentiation for

#### TABLE 2 | Summary of NGS genetic variation.


HGVS, Human Genome variation society nomenclature; C, coding region; P, protein. The 1000 Genomes Project Consortium et al. (2015). POS indicates the position on the chromosome for novel variants.

variants on CYP2C8, CYP2C19, CYP4F11, CYP4F2, PROS, and GGCX. Comparing to CHB, differences arose on CYP2C8, VKORC1, 2C18, NQO1, and for YRI differences were observed on major CYPs, FMOs, F13B, F8, PROS, and SERPINA10 among others (ST5). Allele frequency comparisons between Mexicans from the 1000 genome project (MXL) and those in this study showed similar FST values for most variants, except for CYP2C18 rs2281889, CYP2C8 rs1891071, CYP4F2 rs309319, and CYP4F11 rs11086013, for which we observed FST values between 0.15 and 0.33.

We analyzed allele frequency variation for 30 major pharmacogenes and 10 genes related to the coagulation cascade. The largest number of variants per gene was observed on SULT1A1, CYP2E1, CYP1B1, CYP3A4, CYP3A5, F5, and F11. Interestingly, CYP1A2, CYP3A4, CYP3A5, CYP1A6, F11, F13B, and F8 showed a large proportion of novel variants compared to known variants. Genes with significantly fewer variants were, CYP1A3, CYP1A5, CYP1A9, and F9, the three former did not show any novel variants (ST3). Next, we assessed the presence of known and novel variant considering a MAF >5% and a predicted functional impact as high or moderate in these genes. For known variants, we list 7 with a high predicted functional impact, 2 on UGTs and one on SULT1A1, CYP2C9, CYP2C19, and CYP2C8 (**Table 3**).

Novel variants were observed on 37 of these 40 genes in counts from 1 on UGT1A1, to 23 on CYP3A5, 25 in F5, and 26 in F8. Of these, 2 showed a high functional impact predicting a stop codon on CYP3A4 and CYP2B6, a moderate impact was reported for 16 variants on 14 genes (**Table 3**). The proportion of novel-to-known variants and its functional impact for these pharmacogenes is represented in **Figure 1**.

#### Acenocumarol Pharmacogenetic Model

We developed a pharmacogenomic model to predict acenocumarol dose, using a Bayesian approach that included all SNP variants and the interaction among those on the same gene. We fitted the Bayesian GLM through five MCMC chains where genetic variants were prioritized by their posterior inclusion probability. The higher the posterior probability of a variant, the larger its influence on dose. **Figure 2** shows a hierarchical tree indicating an ordered relevance of variants from VKORC1, CYP2D8P, and SULT1A1, followed by those on CYP4F12, F13B, and F8, Values of posterior probability for all variants are listed in ST7 and ST8.

The dosing algorithm accounted for age, sex, weight, and height. The interaction between VKORC1 variants rs8050894 and rs9934438 which are in LD (R = 0.492), showed the highest posterior inclusion probability (mean, 0.96), followed by a novel variant, on CYP2D8P (POS42547668), and variants on SULT1A1 rs11648192, and CYP2C8 rs1058932, and rs2275620. Unfortunately, pharmacogenetic variant VKORC1 rs9923231 did not pass NGS quality controls thus, it was not modeled, but it is in complete LD with rs9934438, which was included in these analyses (Rieder et al., 2005).

Final model evaluation, we used R to implement the series of Bayes Factors as described in Supplemental Table ST2 with a cut-off value, c<sup>j</sup> = 3 + m<sup>j</sup> , where m<sup>j</sup> is the number of variants conditioned to be absent from the model, and 3 as a minimum cut-off value based on Harold Jeffrey scale of interpretation for Bayes Factors (Baldi and Long, 2001). We selected two models according to the lowest DIC values, the first one included SNVs interactions, VKORC1 rs8050894 and rs9934438 and variants on SULT1A1 and CYP2D8P (ST7). The second model excluded variant interactions. Modeled variants and clinical parameters (age, sex, weight, and height) explain up-to 55.9% of dose

TABLE 3 | Novel and known variants on relevant pharmacogenes at MAF > 5%.


POS, position on the chromosome for novel variants. p.Met1?, change in the translation of the initiation codon with unknown effect. \*Insertion of a termination codon.

variation for this study group. Values of high density intervals (HDI) are presented considering for 95% posterior probability (ST7 and ST8). The addition of additional variants to the model increased DIC values significantly which translates into a decreasing impact on acenocumarol dose (**Table 4**).

### Pharmacogenetic Model Considering Variant Interactions

Ln Dose = −0.6935–0.0071<sup>∗</sup> age (y) + 0.0035∗weight (kg) – 0.1136 (if male) + 1.0709∗height (m) – 0.213 (if VKORC1 rs8050894 is C/G and rs9934438 is G/A) – 0.719 (if VKORC1 rs8050894 is G/G and rs9934438 is A/A) + 0.899 (if CYP2D8P POS.42547668 is T/C) + 0.203 (if SULT1A1 rs11648192 is C/T). Variance explained 55.9%.

### Pharmacogenetic Model without Variant Interactions

Ln Dose= −0.5846 – 0.0069<sup>∗</sup> age (y) + 0.0045∗weight (kg) – 0.0945 (if male) + 0.9795∗height (m) – 0.239 (if VKORC1 rs9934438 is G/A) – 0.529 (if VKORC1 rs9934438 is A/A) + 1.092 (if CYP2D8P POS. 42547668 is T/C) + 0.188 (if SULT1A1 rs11648192 is C/T). Variance explained 40.0%.

Finally, we used these models to recalculate acenocumarol dose in patients with an INR 2–3 receiving a stable dose. Pharmacogenetic dose calculations approached given acenocumarol doses for all INR-sable patients (P > 0.05) except for one patient, R4 who needed 5.4 mg/day, and models estimated 2.47 mg/day. All individual dose estimations were listed in ST9.

#### DISCUSSION

Here, we investigated pharmacogenetic variation in 110 genes by targeted NGS in patients treated with acenocumarol.

#### Novel Variants

Genetic variation analyses showed that the presence of novel variants varied widely among genes. For example, the largest number of novel variants (≥20) was observed on CYP3A4, CYP3A5, SULT1A1, GGCX, F11, F5, F8, and F9. High functional impact variants were present on CYP2B6 (MAF, 50%), CYP2C8 (MAF 1%), CYP3A4 (MAF 8%), F5 (MAF 2%), and VKORC1 (MAF 1%). These are relevant for its allele frequency, the dozens of drugs they metabolize, and because their impact predicts a stop codon. CYP2B6 and CYP3A4 are among the most polymorphic genes thus, it is not surprising the presence of relevant novel variants. Similarly, for VKORC1 population differentiation has been previously reported and the presence of a novel variant

with a high functional impact may be in part, a consequence of this stratification (Bonifaz-Peña et al., 2014). Novel variants on major metabolizing genes, CBR1, CYP1A2, CYP3A4, CYP3A5, P2RY2, and UGT1A6 represented 40–50% of novel and known variants, suggesting that the collection of variation on these genes is probably not yet complete in Mestizos. Other metabolizing genes, CYP1B1, CYP2C18, CYP2C19, UGT1A members, CYP2E1, SULT2B1, and SULT1A1 showed a low proportion of novelto-know variants (**Figure 1**). This may not necessarily mean that genetic variation is complete for these genes. For example, SULT1A1 presented 63 known and 19 novel variants ranking this gene as second with the largest number of variants.

Also, we confirmed population differences previously reported with an Fst > 0.19, on VKORC1 rs9934438 and four variants on UGT2B15 when comparing to YRI and CHB (Bonifaz-Peña et al., 2014). Differences between Mestizos and CEU were observed for UGT2B15, CYP2E1, CYP1A2, CYP4F2, UGT2B7, F12, and F12 (ST5). Interestingly, a few variants showed an Fst > 0.20 between Mestizos from this study and Mexicans from Los Angeles (MXL) from the 1000 genome project, on CYP2C8, CYP2C18, and CYP4F2 relevant for the pharmacokinetics of coumarins, phenytoin, vitamin K, and lipids. Allele frequencies of all variants are listed on ST6. Observations on these relevant pharmacogenes highlight the need to for a cautious implementation of pharmacogenomics in Mexican Mestizos.

We developed a pharmacogenetic model to estimate acenocumarol dose testing over four thousand variants. The model considered relevant variants on, SULT1A1, CYP2D8P, and VKORC1. For the latter gene, variants, rs8050894 and rs9934438, are well-known pharmacogenetic markers of coumarin dosing with the highest PharmGKB level of evidence. The interaction of these SNPs has already been reported as part of a haplotype (CG vs. TA) that aids to classify patients into high and low dose requirements (Rieder et al., 2005).

Interestingly, the model did not associate variants on CYP4F2 or CYP2C9 with acenocumarol dose (**Table 4**). Maybe because CYP4F2 (rs2108622) has a lower impact (Danese et al., 2012), and R-acenocumarol is metabolized by several CYPs other than CYP2C9. Moreover, reported variants that impair CYP2C9 activity are present in low frequency in Mexican Mestizos (Villegas-Torres et al., 2015). Instead, we observed dose association with variants on SULT1A1 and CYP2DP8. CYP2D8P is a pseudogene in the CYP2D cluster comprising CYP2D6,

moderate impact, and black for high impact.

FIGURE 2 | Hierarchical tree of variants influencing acenocoumarol dose. The parent node (solid lines) represent gene variants from which other variants depend for acenocouumaorl dose estimation, i.e., all probability statement in that branch are conditioned to the probabilistic event of the parent node (gene) Iparent = 1, dashed lines indicate that the parent node was conditioned to zero, Iparent = 0. The number on each line represents the Bayes Factor for the following branched off node; given the cut-off values. For visualization purposes, we did not condition on CYP2D8P, which was strongly associated with high-dose values (≥40 mg/wk) in 3 of the 6 patients receiving high doses. Nevertheless, both frequentist and Bayesian hypothesis testing suggested a strong association to coumarin dose for this variant (P-value < 0.05 and

around 25% of all prescribed drugs. This cluster seems to have rapidly evolved due to environmental adversity with ethnic differences (Heim and Meyer, 1992). Wang et al identified a CYP2D6 transcription enhancer in the CYP2D cluster supporting the consideration of variants outside the CYP2D6 loci for functional genotyping (Wang et al., 2015). We identified a new variant on CYP2D8P POS.42547668, 26 Kbp upstream CYP2D6, and although there is no xenobiotic metabolism reported for this pseudogene, we can speculate that this variant is in LD with another one affecting gene expression or drug metabolism. The inclusion of many variants to dissect a pharmacogenetic phenotype is becoming more common as it increases our knowledge in paths and network interactions not previously considered (Cruz-Correa et al., 2017; Oliveira-Paula et al., 2017).

Our model is similar to others in that it includes typical clinical variants (age, sex, weight, and height), a dose prediction around 50% confidence, and the inclusion of VKORC1 as the primary determinant of acenocumarol dose. And even though we report VKORC1 rs9934438 vs. VKORC1 9923231 this latter, most commonly studied (Zhang et al., 2015; Tong et al., 2016) these are in complete LD. Finally, dose assessment using this model closely approached the dose received to achieve an INR 2–3, except for patient R4. Therefore, we delved into the genetic variability of this sample observing 20 heterozygous and 9 homozygous variants, these latter on SULT1A1, FGB, CDH12, KCNJ6, CBR3, and CYP2E1. However, this variation does not necessarily explain a lack of dose prediction. We can speculate that it is the presence

Gonzalez-Covarrubias et al. Pharmacogenetic Variation in 100+ Genes

TABLE 4 | Pharmacogenomic model parameters.


<sup>a</sup>The Bayes Factor corresponds to a hypothesis test with H<sup>0</sup> : β = 0 and H<sup>1</sup> : β = µ, where β is the coefficient and µ is the posterior mean. HDI is the high-density interval for µ.

of multiple variants on certain genes that affects several steps of the pharmacodynamics or pharmacokinetics and thus, drug efficacy.

We acknowledge the size and closed patient group studied retrospectively in individuals that were already assigned a dose by trial and error, not allowing for a prospective use of the genetic information. These observations will require confirmation and replication. We provide a list of 20 variants in 18 genes ordered by its impact on acenocumarol dose around the PK/PD of coumarins and the biochemistry of the coagulation cascade.

Our results offer an initial insight to the use of a genomic approach in pharmacogenetics showing that the advent of next generation sequencing may offer an alternative to identify and utilize individual variation to potentially explain a pharmacological relevant phenotype (Cheng et al., 2015). Future endeavors should focus on confirming these observations in a larger population.

As of June 2017, there are under a dozen reported studies of NGS in Mexican populations (NBCI), here, we present a list of variants in 110 pharmacogenes in Mexican Mestizos providing population information for allele frequency, differentiation from other continental groups and phenotype associations, which may complement the current catalog of pharmacogenomic variation in different populations.

# AUTHOR CONTRIBUTIONS

VG-C: Performed genomic experiments, interpreted results, and wrote the manuscript, JU-C: Analyzed the data and developed the Bayesian model, BV-T: Coordinated demographics and clinica data, performed genotyping experiments. JC-A: Conceived, planned, and performed the clinical study, ST-C: Conceived, planned, and performed the clinical study, RI-A: Conceived, planned, and performed the clinical study, OF-L: Conceived, planned, and performed the clinical study, XS: Conceived and coordinated genomic analyses.

#### ACKNOWLEDGMENTS

This work was supported by Conacyt CB-2015-01 No.252952 to the project "Diversidad Farmacogenetica en Mexicanos, coleccion e interpretacion," and by INMEGEN to the Pharmacogenomics Laboratory, project No.22/2014/I. The technical expertise of Roberto Galindo from Winter

#### REFERENCES


Genomics during bioinformatic analyses is gratefully acknowledged. M.Sc. Alfredo Mendoza Vargas from the NGS unit, kindly provided support and assistance during DNA sequencing. We also thank the staff at the National Institute of Cardiology, Marisol Serna Galarza, Sandra J. Rodriguez Duarte, MD. Jose Luis Lopez, and at INMEGEN Itsel A. Alva-Velazquez, and Anayelli Munoz-Rivas for their clinical and technical assistance during patient treatment, and sample management.

#### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fphar. 2017.00863/full#supplementary-material

genome-wide ancestry informative markers to study admixture throughout the Americas. PLoS Genet. 8:e1002554. doi: 10.1371/journal.pgen.1002554


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2017 Gonzalez-Covarrubias, Urena-Carrion, Villegas-Torres, Cossío-Aranda, Trevethan-Cravioto, Izaguirre-Avila, Fiscal-López and Soberon. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Multiplexed Nanopore Sequencing of HLA-B Locus in Maori and Pacific ¯ Island Samples

Kim N. T. Ton<sup>1</sup> , Simone L. Cree<sup>1</sup> , Sabine J. Gronert-Sum<sup>2</sup> , Tony R. Merriman<sup>3</sup> , Lisa K. Stamp<sup>4</sup> and Martin A. Kennedy <sup>1</sup> \*

<sup>1</sup> Department of Pathology and Biomedical Science, University of Otago, Christchurch, New Zealand, <sup>2</sup> JSI Medical Systems GmbH, Ettenheim, Germany, <sup>3</sup> Biochemistry Department, University of Otago, Dunedin, New Zealand, <sup>4</sup> Department of Medicine, University of Otago, Christchurch, New Zealand

The human leukocyte antigen (HLA) system encodes the human major histocompatibility complex (MHC). HLA-B is the most polymorphic gene in the MHC class I region and many HLA-B alleles have been associated with adverse drug reactions (ADRs) and disease susceptibility. The frequency of such HLA-B alleles varies by ethnicity, and therefore it is important to understand the prevalence of such alleles in different population groups. Research into HLA involvement in ADRs would be facilitated by improved methods for genotyping key HLA-B alleles. Here, we describe an approach to HLA-B typing using next generation sequencing (NGS) on the MinIONTM nanopore sequencer, combined with data analysis with the SeqNext-HLA software package. The nanopore sequencer offers the advantages of long-read capability and single molecule reads, which can facilitate effective haplotyping. We developed this method using reference samples as well as individuals of New Zealand Maori or Pacific Island descent, because HLA-B ¯ diversity in these populations is not well understood. We demonstrate here that nanopore sequencing of barcoded, pooled, 943 bp polymerase chain reaction (PCR) amplicons of 49 DNA samples generated ample read depth for all samples. HLA-B alleles were assigned to all samples at high-resolution with very little ambiguity. Our method is a scaleable and efficient approach for genotyping HLA-B and potentially any other HLA locus. Finally, we report our findings on HLA-B genotypes of this cohort, which adds to our understanding of HLA-B allele frequencies among Maori and Pacific Island people. ¯

#### Keywords: HLA-B, nanopore sequencing, Maori, Pacific Island, pharmacogenetics, Polynesian ¯

# INTRODUCTION

The human leukocyte antigen (HLA) locus contains a large family of genes encoding the human major histocompatibility complex (MHC) proteins. It is located on chromosome 6p21 and divided into three classes: class I, class II, and class III. HLA molecules are extremely variable due to their peptide-binding function and are associated with autoimmune diseases and adverse drug reactions (ADRs) (Tiwari and Terasaki, 1985; Bharadwaj et al., 2010). HLA-B is the most polymorphic gene, with over 4,600 known alleles encoding 3,408 unique proteins (IMGT/HLA Database release 3.27 in January 2017) (Robinson et al., 2014). Previous studies have identified particular HLA-B alleles as risk factors for drug-induced hypersensitivity reactions (Alfirevic and Pirmohamed, 2010; Sukasem et al., 2014). For example, screening for the HLA-B<sup>∗</sup> 57:01

#### Edited by:

Ulrich M. Zanger, Dr. Margarete Fischer-Bosch Institut für Klinische Pharmakologie (IKP), Germany

#### Reviewed by:

Federica Sangiuolo, Università degli Studi di Roma Tor Vergata, Italy Louise Warnich, Stellenbosch University, South Africa

> \*Correspondence: Martin A. Kennedy martin.kennedy@otago.ac.nz

#### Specialty section:

This article was submitted to Pharmacogenetics and Pharmacogenomics, a section of the journal Frontiers in Genetics

Received: 21 December 2017 Accepted: 12 April 2018 Published: 30 April 2018

#### Citation:

Ton KNT, Cree SL, Gronert-Sum SJ, Merriman TR, Stamp LK and Kennedy MA (2018) Multiplexed Nanopore Sequencing of HLA-B Locus in Maori and Pacific Island ¯ Samples. Front. Genet. 9:152. doi: 10.3389/fgene.2018.00152

**14**

A preprint of this manuscript is in bioRxiv with doi: https://doi.org/10.1101/169078

allele is recommended prior to abacavir treatment to decrease risk of a hypersensitivity reaction (Martin et al., 2012). Strong association between HLA-B<sup>∗</sup> 58:01 and allopurinol-induced severe cutaneous adverse reactions such as Stevens–Johnson syndrome (SJS) or toxic epidermal necrolysis (TEN) have been reported across different populations (Somkrua et al., 2011). However, some HLA alleles associated with drug-induced hypersensitivity can be ethnic group-specific. For example, HLA-B ∗ 15:02 allele is a risk factor for carbamazepine-induced SJS/TEN found in several Asian populations but not in Caucasian and Japanese populations (Tassaneeyakul et al., 2010; Phillips et al., 2018). Given the correlation of ethnic-specific risk alleles with ADRs, a good knowledge of HLA allele frequencies, and the prevalence of susceptibility alleles in particular, is important for the study of pharmacogenetics.

The high level of polymorphism in the MHC family means HLA genotyping is complex. HLA alleles are mostly determined by the sequences of exons 2 and 3 in HLA class I genes and exon 2 in HLA class II genes (Shiina et al., 2012). Present DNA-based methods for HLA typing are polymerase chain reaction (PCR) -sequence-specific priming (PCR-SSP), PCR-sequence-specific oligo hybridization (PCR-SSO), PCRrestriction fragment length polymorphism (PCR-RFLP), and sequence-based typing (SBT) (Tait et al., 2009; Bontadini, 2012; Erlich, 2012). SBT is currently considered the gold standard method applied in high-resolution HLA typing (Erlich, 2012). However, this approach may generate ambiguous HLA typing due to haplotype phase issues and incomplete sequencing. Other approaches have various limitations in resolution, workflow complexity, probe design and testing requirements, as new HLA alleles are submitted to the IMGT/HLA sequence database (Erlich, 2012; Shiina et al., 2012).

Recently, next-generation sequencing (NGS) methods have become widely established for HLA typing (Abbott et al., 2006; Bentley et al., 2009; Erlich et al., 2011; Erlich, 2012; Shiina et al., 2012; Hosomichi et al., 2013, 2015; Schöfl et al., 2017). Such approaches reduce the risk of phase ambiguity and allow highthroughput, high-resolution HLA typing. Various approaches to HLA typing using NGS have been developed, including PCRbased HLA sequencing (Erlich et al., 2011; Boegel et al., 2012; Liu et al., 2012; Shiina et al., 2012; Hosomichi et al., 2013; Schöfl et al., 2017), whole exome sequencing (WES) or whole genome sequencing (WGS) data-derived typing (Liu et al., 2012; Major et al., 2013). However, these methods are of limited value for research studies of ADRs, where a more targeted screening for specific HLA alleles may be all that is required.

Here we describe the development of high-throughput HLA typing from next-generation DNA sequencing data, focusing on the HLA-B locus, and its application to identifying HLA-B alleles within the Maori and Pacific Island population ¯ of New Zealand. Our strategy took advantage of a recent iteration of the novel NGS platform, the MinIONTM nanopore sequencer (Oxford Nanopore Technologies), and barcode sequences for labeling and simultaneously analyzing HLA-B amplicons from multiple samples. The MinION is a tiny, portable nanopore sequencer powered by a USB 3.0 port (Quick et al., 2014). It allows analysis of sequencing data in real time and generation of very long reads (Urban et al., 2015). A small pilot study used the device to examine HLA-A and HLA-B alleles from a single sample, using the earlier, quite error-prone R7.0 flow-cell chemistry (Ammar et al., 2015). Oxford Technologies released a new chemistry (R9.4) in May 2016, which has proven to be more accurate and with higher throughput. This major update motivated us to examine the performance of this pocket-sized device on one of the most polymorphic genes in the human body, HLA-B.

New Zealand is a multi-ethnic country with people from many different nations. The Maori are the indigenous Polynesian ¯ people, who first settled in New Zealand. New Zealand is also home to many people from the Pacific Islands, with its main city of Auckland referred to as the "Polynesian capital of the world" (Anae, 2005). Polynesian people in New Zealand include people from Samoa, Cook Islands, Tongan, Fiji, Niue and Tokelau, which together account for 7.6% of New Zealand population (Geck, 2017). To date, there is a paucity of studies providing prevalence data of HLA-B alleles in Maori and Pacific Island ¯ population in New Zealand (Abbott et al., 2006; Edinur et al., 2013). Given that HLA-B alleles are so relevant to disease predisposition and ADRs, it is important to establish a prevalence dataset for HLA-B for these population groups. Therefore, we sought to develop an assay for HLA-B screening in this population, using the MinION.

# MATERIALS AND METHODS

# Participants

Forty unrelated Maori and Pacific Island individuals with ¯ no history of inflammatory disorders were recruited from the Otago and Auckland regions of New Zealand. The proportion of ancestry for each participant was estimated by recording the ethnicity of all four grandparents. This study was carried out in accordance with the recommendations of the Standard Operating Procedures for Health and Disability Ethics Committees (New Zealand), as reviewed by the Lower South Ethics Committee (New Zealand), with written informed consent from all subjects. Additionally, five individuals were included from a local study on ADRs called Understanding ADRs or responses Using Genome Sequencing (UDRUGS) (Maggo et al., 2017), which was approved by the Southern Health and Disability Ethics Committee (New Zealand), with written informed consent from all subjects. A further four samples of known HLA-B genotype were obtained from the Coriell Institute for Medical Research (Camden, NJ, USA). These nine individuals were used as a reference set for the MinION analysis, after confirmation by either Sanger sequence based typing (SBT) or data retrieved from the 1000 Genomes project, or both.

# HLA-B Genotyping by Sanger Sequencing

We selected a subset of our participants (four Polynesian, five UDRUGS and two Coriell samples) to analyze by Sanger sequencing, as additional references. Nested PCR was used to amplify a 1,710 bp region spanning exon 2 and exon 3 of HLA-B. PCR products were diluted and used as templates in second round PCR to amplify a 943 bp amplicon. These amplicons were then directly sequenced in both forward and reverse directions using a set of six sequencing primers (**Table 1**). The primers used for amplification included some nucleotide redundancies at sites of HLA-B variation, to prevent allelic drop-out during the PCR step. All PCR primers and sequencing primers were derived from published work (Abbott et al., 2006; Cotton et al., 2012). The HLA-B genotypes of these 11 samples were generated from the Sanger sequence data using SBTengine v.3.10.0.2610 (GenDX, Utrecht, Holland).

#### Minion Library Construction

The primer used to amplify a fragment of 943 bp HLA-B exon 2 and 3 included a specific sequence (**Table 1**) at the 5′ end, which is compatible with barcode sequences (Oxford Nanopore Technologies). A standard protocol of the Kapa LongRange Hotstart DNA PCR (Kapa Biosystems) was applied, consisting of 1X Kapa LongRange Buffer (without Mg2+), 2.0 mM MgCl2, 0.3 mM dNTPs (2.5 mM each dNTP), 0.5µM of each primer, 1.25 U/50 µl Kapa LongRange HotStart DNA Polymerase, 50 ng genomic DNA, and water up to 50 µl. Thermal cycling conditions were 94◦C for 3 min, 25 cycles at 94◦C for 15 s, 68◦C for 15 s, and 72◦C for 1 min, with a final extension at 72◦C for 1 min. The PCR products were visualized by electrophoresis on 2% agarose gel stained with SYBRTM Safe DNA Gel Stain (Invitrogen), and then purified using 1x Agencourt AMPure XP beads (Beckman Coulter).

PCR products were quantified by Qubit <sup>R</sup> 2.0 Fluorometer (ThermoFisher Scientific) and were diluted to 2 nM in water. A second PCR was performed to incorporate barcode sequences using Oxford Nanopore PCR Barcoding kit (EXP-PBC096). Each 100 µl reaction contained 1X Kapa LongRange Buffer (without Mg2+), 2.0 mM MgCl2, 0.3 mM dNTPs (10 mM each dNTP), 0.2µM PCR Barcode primers (from BC01 to BC49), 2.0 U Kapa LongRange HotStart DNA Polymerase, and 0.5 nM of firstround PCR product. The cycling parameters were an initial denaturation 95◦C for 3 min, followed by15 cycles at 95◦C for 15 s, 62◦C for 15 s, and 65◦C for 1 min, with a final extension at 65◦C for 1 min. All 49 barcoded products were cleaned up with 1x Agencourt AMPure XP beads, then quantified. Purified PCR products were normalized by concentration before being pooled for library preparation.

The pooled library was prepared using the Oxford Nanopore Sequencing protocol (SQK-NSK007). We used 5 µg of library as an input, instead of the recommended 1 µg, to improve yield for downstream steps. Briefly, 5 µg of purified amplicon library was prepared with the NEBNext end repair module (New England Biolabs), then dA-tailed using the NEBNext dA-tailing module (New England Biolabs). The end-prepared, dA-tailed library was subsequently ligated with leader and hairpin adapters, followed by purification using Dynabeads <sup>R</sup> MyOneTM Streptavidin C1 beads (Invitrogen).

The final prepared library from 49 participants was loaded into the MinION R9.4 flowcell (Oxford Nanopore Technologies). The flowcell was run for 48 h using the MinKNOW software (0.51.1.39).

# Data Analysis

Raw sequence data were uploaded for base-calling using Metrichor software (2D Basecalling for SQKMAP007 - v1.107). Sequences in FASTA format were extracted from the raw FAST5 files using poretools v.0.6.0 (Loman and Quinlan, 2014). Statistical analysis of the MinION sequencing data were generated and visualized by De Coster et al. (2017). In order to determine error rates, base-calls in FASTQ format were extracted using poretools v.0.6.0 (Loman and Quinlan, 2014) and then aligned against Sanger sequenced reference using BWA-MEM (version 0.7.12-41044), parameter "-x ont2d". Additional statistical analyses were performed with Python and R scripts available at https://github.com/camilla-ip/marcp2. We only used two-dimensional (2D) reads, which are consensus calls of the combined template and complement strands, to perform HLA-B locus high-resolution typing with SeqPilot


Underlined letters are IUPAC codes indicating base redundancies at positions corresponding to known HLA-B variants.

v.4.3.1 using default settings (JSI medical systems). HLA-B ambiguities were designated as G group nomenclature (http:// hla.alleles.org/alleles/g\_groups.html). Samples that could not be assigned genotypes due to mismatches were re-analyzed for error correction. Nanopolish pipeline was applied on input reads of these samples to check improvement on the baselevel accuracy (Loman et al., 2015). After polishing, consensus sequences were re-processed to assign HLA-B genotypes of these samples.

### RESULTS

A total of 40 Maori and Pacific Island, four Coriell and five UDRUGS individuals were selected for library construction. We successfully amplified a region of 943 bp spanning exon 2 and 3 of HLA-B in a single PCR. After purifying, all 49 PCR products were diluted to reach the desired concentration (2.0 nM). Three gave insufficient yield, ranging from 0.19 to 1.47 nM. However, these three were still included to test whether they could be effectively amplified in the barcoding step.

Two primers used in the first PCR were tailed with the adapter sequences, which were compatible with Oxford Nanopore barcodes. Each PCR product was then amplified with barcode primers, at which point all 49 PCR products were tagged with barcodes, increasing their length to 1,063 bp (**Figure 1**).

PCR products were subjected to normalization prior to pooling and sequencing on the MinION (∼368 ng/each). Five samples had significantly lower concentrations than other samples (range: 20.4–66.8 ng), but these were included in the pool (**Table 2**). The total DNA quantity in the pool was 7.5 µg and 5 µg was used for downstream steps. After end-repair, adapter ligation and purification steps, 585 ng of prepared library remained and was loaded into the MinION flow cell.

Given that for this version of the MinION chemistry, 2D reads were more accurate and had greater length than 1D reads, we extracted only the 2D reads for downstream analysis. After conversion, all 49 FASTQ files were imported into SeqPilot

software for HLA-B allele assignment. The mean read depth was 5,807x and the mean read length was 1,029 bp, close to the expected size of all amplicons (**Figure 2**). An average of 5,854 sequence reads per barcoded sample was obtained from a total of 289,095 2D pass reads. The proportion of reads with a Q-value threshold of 15 was 83.3% (**Figure 3**). There were 286,852 reads with uniquely identified barcodes, of which 199,297 reads passed the quality filters and were aligned to the assigned allele sequences. The distribution histogram of both assigned and aligned reads for each barcoded sample is shown in **Figure 4**.

The five samples that amplified poorly still produced ample reads for SeqPilot to generate HLA-B typing calls (**Table 2**). Notably, individual PI\_C3, which had the lowest number of mapped reads (80) that aligned to the reference sequence, was

TABLE 2 | DNA quantity and number of mapped reads of poorly amplified samples.


assigned the alleles HLA-B<sup>∗</sup> 44:04, 56:02:01. We are confident these alleles are correct as no mismatches occurred at key polymorphic sites in the reads.

Using FASTQ files as inputs, SeqPilot effectively assigned HLA-B alleles of 49 individuals. There were 4 homozygotes and 45 heterozygotes, resulting in 38 alleles called at the third field (formerly 6-digit) resolution and five alleles at second field (4-digit) resolution (**Table 3**). There were six that could not be automatically called by the software due to mismatches with reference sequences. Notably, the phasing bias mostly happened at nucleotide position 130–136 of exon 2. We realized that there was A/G heterozygous at nucleotide 133 and the sequence around this position was a repeat of G and A (GAGAGRGGAG). We also observed that if nucleotide 133 was G, there was usually a deletion of two to four nucleotides (**Figure 5**). **Figure 6** illustrates the ambiguous results of sample PI\_A2 with several possible alleles and their corresponding mismatched locations. HLA-B<sup>∗</sup> 27:05:02G was not able to be confidently called due to two mismatches at the area mentioned above (**Figure 6**). Sanger sequencing these six samples confirmed that these mismatches were not PCR artifacts, suggesting these errors might happen during the nanopore sequencing step. To explore whether such errors could be corrected, we ran the Nanopolish pipeline to compute consensus sequences with improved base quality. After comparing these polished sequences with sequences obtained from Sanger sequencing, we found that Nanopolish did not improve the accuracy for these error regions further (**Figure 7**). In this case, we manually assigned allele pairs from the suggested list, choosing those with the least mismatch. Although other NGS approaches have been applied to HLA analysis, Sanger sequencing is regarded as the gold standard for HLA-typing. Consequently, we selected 11 individuals for HLA-B genotyping using the Sanger SBT method for validation. These were four Polynesian, five UDRUGS and two Coriell samples. Six amplicons covering exon 2 and 3 of each individual were directly sequenced. HLA-B alleles were defined based on these polymorphic sequences. Using SBTengine (GenDX, Utrecht, Holland) for analysis, we found that there was a high consistency of variant calls with calls derived from the MinION data. Our genotyping data for all four Coriell individuals were concordant with the data from these samples generated by the 1000 Genomes project. These

results indicate that this MinION sequencing method is able to generate consensus sequences for high-resolution HLA-B typing with considerable accuracy.

There were 38 HLA-B alleles identified in the 40 Maori ¯ and Pacific Island individuals examined. Among these alleles, HLA-B<sup>∗</sup> 40:01:01 had the highest frequency (28.95%), followed by HLA-B<sup>∗</sup> 44:02:01 (21.05%) and HLA-B<sup>∗</sup> 07:02:01 (18.42%). According to the HLA allele frequency database (http://www. allelefrequencies.net), HLA-B<sup>∗</sup> 40:01:01 is the most prevalent allele in the Han Chinese population (allele frequency of 0.155). It has been reported that the Polynesian people are ancestrally related to Micronesia, Taiwanese Aborigines and East Asia (Kayser et al., 2008; Edinur et al., 2013). This is consistent with our observation on the frequency of HLA-B<sup>∗</sup> 40:01:01 in this population. On the other hand, the HLA-B<sup>∗</sup> 44:02:01 and HLA-B<sup>∗</sup> 07:02:01 alleles are present in multiple populations. When comparing with other studies on Polynesians, we found there were four previously observed alleles not represented in our data (Edinur et al., 2013). Presumably this is because our sample size is insufficient to reflect the full range of HLA-B alleles in Pacific Island or Maori populations. Nevertheless, ¯ the most common allele (HLA-B<sup>∗</sup> 40:01:01) in our study is similar to the Edinur findings, suggesting that this is a common HLA-B allele in this population. Interestingly, there were no HLA-B<sup>∗</sup> 15:02:01 nor B<sup>∗</sup> 58:01 observed in the Polynesian individuals. The lack of HLA-B<sup>∗</sup> 58:01 in Polynesians has previously been reported in other studies (Abbott et al., 2006; Roberts et al., 2015). These alleles have been implicated in carbamazepine and allopurinol-induced severe cutaneous adverse reactions, respectively. However, HLA-B<sup>∗</sup> 57:01:01, important for abacavir-associated hypersensitivity reactions, was apparent in two samples.

# DISCUSSION

The primary goal of our study was to develop methods for HLA-B class 1 typing on the MinION nanopore sequencer. This study applied PCR across HLA-B exon 2 and exon 3, followed by barcoding and nanopore sequencing of 49 samples simultaneously. The high quality and good depth of coverage of our sequencing data for all samples, including several that were present at low concentration, enabled accurate assignment of HLA-B alleles. With ongoing improvements to the speed, throughput and workflow of MinION flowcells and the associated chemistry, it is likely that multiplexing could be extended to much greater numbers without compromising the ability for accurate typing.

The workflow we employed was straightforward and solely PCR-based. Though the ONT protocol requires 2.0 nM of input amplicon prior to the barcoding step, sample PI\_G2 was successfully amplified and produced sequencing data at 0.19 nM, less than 10-fold the recommended amount. That said, even these poorly represented samples generated sufficient data for HLA-B typing. Of all the reads generated from the sequencing device, 199,297 reads had good quality and passed the filters for alignment. Reads were ignored if they did not map to the TABLE 3 | Assignment result of HLA-B obtained from MinION sequencing and SBT.


# A, African; Ch, Chinese; C, Caucasian; NM, NZ Maori; CM, Cook Island Maori; S, Samoan; N, Niuean; P, others; U, unknown. <sup>U</sup> Samples selected for validation. § Samples required manual allele assignment.

region of interest. Among those usable reads, varied numbers of reads between samples were observed, ranging from 80 to 17,329 (**Figure 4**). Regardless of this read-depth variability between samples, adequate coverage was achieved for all samples and allowed for confident HLA-B phasing. It is also worth noting that individual PI\_C3 only had 80 reads that were aligned with reference sequences, but alleles could be assigned confidently. Our protocol takes 3 days to complete, comprising 1–1.5 days for library construction, 1 day for sequencing and base calling, and a half-day for data analysis. The sequencing step takes up to 2 days if aiming for more reads; however, it is possible to analyze data prior to the end of the run if desired. It can be

argued that the turnaround time of our method is still longer than that of the gold standard SBT method. Other NGS workflows for MinION library construction may take even longer (3–4 days) with the employment of biotinylated probes (Karamitros and Magiorkinis, 2015). Moreover, given the fact that the capacity of the MinION can be enlarged to analyze other HLA loci at high-resolution and more sample input (up to 96 samples at present), MinION-based HLA typing can overcome this limitation. The ability to confidently call the genotype with such a small number of reads suggests that it would be reasonable to increase throughput by sequencing additional HLA loci, or indexing up to 96 individuals per sequencing run.

Our average read length was 1,028 bp indicating that our reads were long enough to cover exon 2 and exon 3 of HLA-B. We are aware that ambiguous typing of HLA-B might occur due to variants outside the region analyzed. This resulted in the assignment of G Codes in several samples (**Table 3**). In HLA typing, the letter "G" is used to report ambiguous alleles which have identical exon sequences encoding the peptide binding domains (exon 2 and exon 3 for HLA class I). A whole gene sequencing is required to resolve this ambiguity and obtain a full resolution of HLA allele. However, as not all alleles have been completely sequenced, only exon sequences can be mapped in some cases. For example, there are 4,356 HLA-B alleles but only 384 alleles have complete sequence information (Robinson et al., 2014). By this, we mean it is currently more practical to focus on exons solely than to sequence the entire gene for HLA-B typing with minimal ambiguity. Obviously, we are able to increase the read length for complete sequencing of the HLA-B if necessary, as the MinION is capable of very long sequence reads (Carter and Hussain, 2017). Therefore, once all HLA-B alleles in the IMGT/HLA database have full information, our method can be adapted to sequence a full-length HLA-B with greater specificity and sensitivity.

Our results showed that SeqPilot was able to identify HLA-B alleles accurately using the MinION reads. Although the software could not automatically assign HLA-B alleles of all participants (6 cases), it listed the most likely genotype combinations in rank of number of mismatch sites. This enabled us to manually assign HLA-B alleles based on this order. We found that these errors only occurred at nucleotide position 130–136 of exon 2 and on samples which had a (GA)<sup>3</sup> repeat on one of the allele sequence. Deletion rates of the MinION using R7.3 and R9.0 chemistry are 4.1 and 3.5%, respectively (Jain et al., 2017). Therefore, it may be that deletion errors produced by the MinION device combined with the complex nature of the HLA-B, make accurate interpretation of genotype particularly difficult around this region. Of the six individuals that were manually phased, alleles from three had been identified either by the SBT method or the 1000 Genomes project or both. These alleles were all consistent with our manual assignments, suggesting this approach can be applicable in such circumstances.

The second aim of our study was to examine HLA-B alleles in individuals of Maori and Pacific Island descent living in New ¯ Zealand. Previous studies used allele-specific primer PCR for HLA-B genotyping, which provides typing at first field resolution (Edinur et al., 2013; Roberts et al., 2013). Here, we report a feasible method for high-throughput and high-resolution HLA-B typing using NGS. Though ours is a relatively small sample, this initial finding can be used as a reference for future studies on the prevalence of HLA-B in these ethnic groups.

In recent years, various high-throughput HLA typing studies have been conducted using different NGS technologies (Carapito et al., 2016). Though NGS-based HLA typing can be timeconsuming (Chua and Ng, 2016), it offers high-resolution, unambiguous, phase-defined HLA alleles, overcoming some of the limitations of traditional approaches. For instance, the current gold standard method (SBT) may generate ambiguous typing due to genotype phase issues and incomplete sequencing. Modern NGS-based HLA typing methods were mostly developed on Illumina MiSeq/HiSeq or PGM Ion Torrent platforms, which employed short-to-medium sequencing read data as an input for HLA allele assignment (Hosomichi et al., 2015). At present, the Illumina platform has been widely adopted due to its high accuracy and high precision for HLA typing. However, the advent of long read single-molecule sequencing holds the promise of achieving full-length phase-defined HLA genes as well as detecting novel and rare variants. An early, very preliminary study suggested the MinION nanopore sequencer has potential for HLA analysis (Ammar et al., 2015). The device and associated chemistry has been continuously improving (Jain et al., 2017). For example, the total error of 2D reads reduced from 9.1% in R7.3 chemistry to 7.3% in R9.0 chemistry (now R9.4). Another advantage of the MinION device is its portability, which raises the possibility of using the device for HLA analysis in field situations or point-of-care settings.

Before this assay could be applied clinically, at least two things would need to occur. First, the nanopore technology is still in a state of relatively rapid development, and the MinION platform would need to stabilize before clinical implementation would be possible. For example, since the work described in this report was completed there have been various further iterations of chemistry and library preparation procedures for the MinION. In addition, newer versions of the nanopore sequencing equipment with higher throughput (such as the GridION and PromethION) have been recently released to the market. Second, a much large study would be required to assess the sensitivity and specificity of the nanopore sequencing and allele calling procedure described in this paper. This would need to be carried out in a cohort which had undergone HLA typing using the current gold standard HLA typing approach of SBT (Erlich, 2012).

# CONCLUSION

We have described here the development and evaluation of a PCR-based HLA-B sequencing method using MinION Nanopore Technology on R9.4 flow cell. We demonstrated that our method is relatively straightforward and can generate accurate sequencing data from many barcoded samples in a single run. We also reported that precise HLA-B alleles could be obtained from the MinION reads with minimal phase ambiguity. Our protocol can be easily adapted for other HLA loci, or for full gene sequencing, or to employ greater levels of multiplexing. The method could be particularly valuable for research studies examining the role of HLA alleles in ADRs.

# DATA AVAILABILITY

The complete sequencing data can be accessed at the NCBI Sequence Read Archive database with the accession number SRP138979.

# AUTHOR CONTRIBUTIONS

KT carried out the laboratory work, data analysis and drafted the manuscript. SC advised on nanopore sequencing procedures and bioinformatic analyses. SG-S contributed to the analysis and assignment of HLA alleles. TM and LS recruited subjects and provided DNA for this analysis. MK supervised the work and contributed to preparation of the manuscript.

#### REFERENCES


# ACKNOWLEDGMENTS

KT was supported by a University of Otago PhD scholarship. This work was also supported by funding from the Jim and Mary Carney Charitable Trust.


Tiwari, J. L., and Terasaki, P. I. (1985). HLA and Disease Associations. New York, NY: Springer Science & Business Media. doi: 10.1007/978-1-4613-8545-5

Urban, J. M., Bliss, J., Lawrence, C. E., and Gerbi, S. A. (2015). Sequencing Ultra-Long DNA Molecules with the Oxford Nanopore MinION. bioRxiv:019281.

**Conflict of Interest Statement:** Author SG-S was employed by JSI medical systems GmbH. The other authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 Ton, Cree, Gronert-Sum, Merriman, Stamp and Kennedy. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# PASSPORT-seq: A Novel High-Throughput Bioassay to Functionally Test Polymorphisms in Micro-RNA Target Sites

Joseph Ipe<sup>1</sup> , Kimberly S. Collins1,2, Yangyang Hao3,4, Hongyu Gao3,4, Puja Bhatia<sup>1</sup> , Andrea Gaedigk<sup>5</sup> , Yunlong Liu3,4 and Todd C. Skaar<sup>1</sup> \*

<sup>1</sup> Division of Clinical Pharmacology, Department of Medicine, Indiana University School of Medicine, Indianapolis, IN, United States, <sup>2</sup> Department of Pharmacology and Toxicology, Indiana University School of Medicine, Indianapolis, IN, United States, <sup>3</sup> Department of Medical and Molecular Genetics, Indiana University School of Medicine, Indianapolis, IN, United States, <sup>4</sup> Center for Computational Biology and Bioinformatics, Indiana University School of Medicine, Indianapolis, IN, United States, <sup>5</sup> Division of Clinical Pharmacology, Toxicology and Therapeutic Innovation, Children's Mercy Kansas City, Kansas City, MO, United States

#### Edited by:

Ulrich M. Zanger, Dr. Margarete Fischer-Bosch-Institut für Klinische Pharmakologie (IKP), Germany

#### Reviewed by:

Dylan Glubb, QIMR Berghofer Medical Research Institute, Australia Eric R. Gamazon, The University of Chicago, United States

> \*Correspondence: Todd C. Skaar tskaar@iu.edu

#### Specialty section:

This article was submitted to Pharmacogenetics and Pharmacogenomics, a section of the journal Frontiers in Genetics

Received: 30 November 2017 Accepted: 29 May 2018 Published: 15 June 2018

#### Citation:

Ipe J, Collins KS, Hao Y, Gao H, Bhatia P, Gaedigk A, Liu Y and Skaar TC (2018) PASSPORT-seq: A Novel High-Throughput Bioassay to Functionally Test Polymorphisms in Micro-RNA Target Sites. Front. Genet. 9:219. doi: 10.3389/fgene.2018.00219 Next-generation sequencing (NGS) studies have identified large numbers of genetic variants that are predicted to alter miRNA–mRNA interactions. We developed a novel high-throughput bioassay, PASSPORT-seq, that can functionally test in parallel 100s of these variants in miRNA binding sites (mirSNPs). The results are highly reproducible across both technical and biological replicates. The utility of the bioassay was demonstrated by testing 100 mirSNPs in HEK293, HepG2, and HeLa cells. The results of several of the variants were validated in all three cell lines using traditional individual luciferase assays. Fifty-five mirSNPs were functional in at least one of three cell lines (FDR ≤ 0.05); 11, 36, and 27 of them were functional in HEK293, HepG2, and HeLa cells, respectively. Only four of the variants were functional in all three cell lines, which demonstrates the cell-type specific effects of mirSNPs and the importance of testing the mirSNPs in multiple cell lines. Using PASSPORT-seq, we functionally tested 111 variants in the 3<sup>0</sup> UTR of 17 pharmacogenes that are predicted to alter miRNA regulation. Thirty-three of the variants tested were functional in at least one cell line.

Keywords: SNP, functional testing, genetic variants, miRNA, high-throughput screening assays, 3<sup>0</sup> UTR

# INTRODUCTION

Large scale sequencing studies and genome-wide association studies (GWASs) have identified 1000s of genotype–phenotype associations (Welter et al., 2014). Some of the phenotype-associated variants alter gene function and many of them are in linkage disequilibrium with the functional variants. The functional impacts of variants can be predicted using bioinformatic algorithms, but the in silico predictions are often incorrect and need experimental validation. While there are several experimental methods to functionally test variants, most do not have the capacity to simultaneously test the large number of variants.

Nearly 90% of genetic variants associated with phenotypes have been described to be located in non-coding regions such as the untranslated regions (UTRs) (Hindorff et al., 2009). Variants, including single nucleotide polymorphisms (SNPs) within non-coding regions, can impact gene

expression in several ways; one example is by altering the interaction between mRNAs and micro-RNAs (miRNAs). Polymorphisms within miRNA-binding sites have been implicated in diseases such as cancer (Pelletier and Weidhaas, 2010; Iuliano et al., 2013), Alzheimer's disease (Liu et al., 2017), and diabetes (Elek et al., 2015).

miRNAs are 21–23 nucleotide long RNAs that posttranscriptionally silence genes or reduce their expression levels by complementarily binding to target sites within mRNAs. More than 29,000 human mRNAs are collectively targeted by are over 2500 miRNAs (Kozomara and Griffiths-Jones, 2014). Several different miRNA binding sites may be present on one mRNA and many contain genetic variations. To date, over 400,000 SNPs have been identified in miRNA binding sites (Liu et al., 2012). Interestingly, only about 32,000 have a minor allele frequency greater than 1% classifying most of them as rare variants (Liu et al., 2012). Thus, tests to identify functional SNPs affecting miRNA binding, here referred to as mirSNPs, will involve screening a large number of variants.

Testing these large number of mirSNPs using GWAS requires statistical correction for multiple testing, such as the Bonferroni correction. The low minor allele frequency of many causal variants, and routine multiple comparisons corrections make it very difficult, or impossible, to statistically identify functionally relevant variants in genome-wide studies. Consequently, in GWAS, impractically large numbers of subjects from diverse populations would be required to identify rare functional variants that are statistically significant. Lowering the statistical threshold or not correcting for multiple comparisons increases the sensitivity to detect rare variant associations, but results in the detection of many false positives signals. Despite some technical challenges, high-throughput in vitro approaches have been implemented that are specific to variants in certain noncoding regions, such as splice-junctions (Soemedi et al., 2014) and promoters (Kwasnieski et al., 2012; Melnikov et al., 2012). However, we are not aware of any high-throughput assays available to functionally test variants in miRNA binding sites (Ipe et al., 2017).

We developed PASSPORT-seq (parallel assessment of polymorphisms in miRNA target-sites by sequencing), a highthroughput bioassay that involves pooled synthesis, parallel cloning and single-well transfection followed by next-generation sequencing (NGS) to functionally test 100s of mirSNPs at once. This assay produced results that are reproducible and consistent with luciferase reporter assays, a gold-standard platform widely used to assess gene expression in vitro. We also demonstrate the application of this assay to test 111 genetic variants that are predicted to alter miRNA regulation of 17 pharmacogenes.

#### MATERIALS AND METHODS

#### Selection of mirSNPs

RNA samples from thirty human livers were sequenced using SOLiD <sup>R</sup> technology (Thermo Fisher Scientific, Waltham, MA, United States). SNPs in the 3<sup>0</sup> UTRs were identified (O'Leary et al., 2016) and an 8-base pair region on either side of the reference and variant alleles was analyzed using TargetScan (Lewis et al., 2005) to identify SNPs that were in miRNA seed binding regions. SNPs that altered the predicted miRNA seed binding sites were considered for further analysis. For assay development, 84 SNPs that were associated with allele-specific expression in the sequencing dataset were selected. A flowchart representing the selection process of the 84 test mirSNPs is shown in Supplementary Figure 1. In addition, we selected 16 mirSNPs from the SomamiR database (Bhattacharya et al., 2013) that have been linked with cancer. The list of 100 SNPs used for assay development are listed in Supplementary Table 1. Similarly, 111 mirSNPs located in the 3<sup>0</sup> UTR regions of 17 pharmacogenesthe core absorption, distribution, metabolism, and excretion (ADME) genes<sup>1</sup> , PXR, CAR, and HNF4α which showed allele specific expression in the sequencing dataset were selected to demonstrate the application of the assay. The list of these 111 SNPs are listed in Supplementary Table 4. The RNA analysis and genotyping was approved by the Indiana University Institutional Review Board.

#### Test Sequence Design

The 5<sup>0</sup> and 3<sup>0</sup> flanking regions for each SNP were obtained from dbSNP. A 32-nucleotide region which contained either the variant or reference nucleotide flanked by nine nucleotides on the 3<sup>0</sup> end and 22 nucleotides on the 5<sup>0</sup> end was selected as the test sequence. Two-hundred such regions (100 reference and 100 variant) were selected to test 100 SNPs. Universal primer binding regions were added on the 5<sup>0</sup> (GTAATTCTAGGAGCTC) and 3 0 (CGTTCTAGAGTCGGG) end of each test region. The final test fragment was 63 nucleotides in length (see Supplementary Figure 2). The 200 test fragments were commercially synthesized as pooled single-stranded DNA oligonucleotides (Oligomix <sup>R</sup> , LC Sciences, Houston, TX, United States). The pool contained 10–50 attomoles of each sequence. The oligonucleotides were synthesized as single-stranded DNA and was diluted 1:5. One µL of the diluted Oligomix <sup>R</sup> was amplified in a 50 µL PCR reaction using 0.3 µM universal primers and 25 µL 2X CloneAmpTM HiFi PCR premix (Takara, Mountain View, CA, United States). PCR conditions used were: 98◦C (10 s), 53◦C (5 s), and 72◦C (5 s) for 35 cycles.

Seven SNPs were also tested in individual luciferase assays using 63-nucleotide long single stranded oligonucleotides that were individually synthesized (reference and variant); (Integrated DNA Technologies, Coralville, IA, United States), made double stranded as described for pooled oligonucleotides, and cloned into the pIS-0 plasmid.

#### Plasmid Library Preparation

The pIS-0 vector (plasmid 12178; Addgene, Cambridge, MA, United States) (Yekta et al., 2004) (see Supplementary Figure 3) was linearized with SacI-HF <sup>R</sup> and BmtI-HF <sup>R</sup> restriction endonucleases (New England Biolabs, Ipswich, MA, United States) and purified using QIAquick <sup>R</sup> PCR spin columns (Qiagen, Germantown, MD, United States). Plasmid assembly was performed using 40 ng of linearized plasmid and

<sup>1</sup>http://www.pharmaadme.org/joomla/

2 µL of unpurified PCR product containing double-stranded test oligonucleotides using the NEBuilder <sup>R</sup> HiFi DNA assembly kit (NEB, Ipswich, MA, United States) per manufacturer's instructions. The universal primers used to amplify the testoligonucleotide pool also served as the flanking homology regions for the NEBuilder <sup>R</sup> assembly. Two µL of the NEBuilder <sup>R</sup> assembly product were transformed into 60 µL chemically competent E. coli (transformation efficiency > 5 × 10<sup>8</sup> cfu/µg) (Takara, Mountain View, CA, United States) and plated on six standard 100 mm LB-agar plates containing 100 µg/ml ampicillin. After overnight incubation, all colonies were dislodged from the plates by adding 2 ml LB-broth containing 100 µg/ml ampicillin and agitation using 10–20 ColiRollersTM glass beads (EMD Millipore, Billerica, MA, United States). The colonies harvested from the six plates in LB-broth were pooled together. The liquid culture was incubated at 37◦C for 5 h after which plasmids were isolated using 10 QIAprep <sup>R</sup> Spin miniprep columns (Qiagen, Germantown, MD, United States) as per manufacturer's instructions. Column elutions were combined to create the plasmid library that was used for downstream experiments. The plasmid DNA concentration was determined using a Quant-iTTM DNA Broad Range kit (Thermo Fisher Scientific, Waltham, MA, United States).

# Sanger Sequencing

To determine the representation of the test constructs in the plasmid library, 28 individual colonies were grown in LB-broth containing 100 µg/ml ampicillin. Plasmids were isolated using QIAprep <sup>R</sup> Spin miniprep columns (Qiagen, Germantown, MD, United States) as per manufacturer's instructions and Sanger sequenced using a primer (GTGGTTTGTCCAAACTCATC) near the test insert (ACGT, Inc., Wheeling, IL, United States).

#### Transfection of Cells in Culture

The plasmid library was used to transfect three different human cell lines: HEK293 (embryonic kidney), HepG2 (liver carcinoma), and HeLa (cervical cancer). Cells were seeded at a density of 0.9 × 10<sup>5</sup> cells per well into 24- well plates. The cells were transfected 24 h after plating with 500 ng/well of the pIS-0 plasmids. Ten ng of pGL4.74, a Renilla luciferase reporter plasmid, was added to each well as a transfection control. Transfection was performed using 50 µL transfection-mix in Opti-MEM <sup>R</sup> (Life Technologies, Carlsbad, CA, United States) containing 1.5 µL Lipofectamine <sup>R</sup> 3000 (Life Technologies, Carlsbad, CA, United States) per the manufacturer's instructions. Opti-MEM <sup>R</sup> and culture media were used with no antibiotics.

# RNA Isolation and cDNA Synthesis

Transfected cells were incubated for 48 h, lysed in situ and total RNA isolated using a RNeasy <sup>R</sup> purification kit with the optional DNase treatment (Qiagen, Germantown, MD, United States). RNA was quantified using the Quant-iTTM RNA Broad Range kit (Thermo Fisher Scientific, Waltham, MA, United States) and cDNA synthesized from 800 ng of total RNA using the QuantiTect <sup>R</sup> Reverse Transcription kit (Qiagen, Germantown, MD, United States).

## Molecular Barcoding

Using the cDNAs from the transfected cells, the miRNA binding sites within the 3<sup>0</sup> UTR of the luciferase genes were amplified in 50 µL PCR reactions using 0.3 µM flanking universal primer sand 25 µL 2X CloneAmpTM HiFi PCR premix (Takara, Mountain View, CA, United States). In a separate reaction for each sample, 2 µL of cDNA and 1 pg of the input plasmid pool was used as PCR template. PCR conditions used were: 98◦C (10 s), 54◦C (5 s), and 72◦C (5 s) for 25 cycles. A 6-nucleotide unique molecular barcode was added to the 5<sup>0</sup> -end of both the forward and reverse primer (see Supplementary Table 2). The input pools (n = 4 replicates) and the five biological replicates in the three different cell lines were each 'barcoded' by a unique pair of sequences. The barcoded PCR products were purified using a MinElute <sup>R</sup> PCR Purification kit (Qiagen, Germantown, MD, United States). The barcoded libraries were combined in equimolar concentrations to create a sequencing pool with 19 different molecular barcodes. A schematic representation of the steps involved in creating this sequencing pool is shown in **Figure 1**.

# Next-Generation Sequencing

The pooled PCR products were sequenced using a modified protocol for the Ion ProtonTM system (Thermo Fisher Scientific, Waltham, MA, United States). Briefly, the sequencing library was created by end-polishing the barcoded PCR products, followed by adapter ligation and amplification. The resulting library was quantified and its quality accessed with the Agilent Bioanalyzer (Agilent Technologies, Santa Clara, CA, United States). Eight microliters of the 100 pM library were then applied to Ion Sphere Particles to prepare the sequencing template. The template was amplified using Ion OneTouch 2. The Ion Sphere Particles with the amplified template were loaded onto an IonPI <sup>R</sup> chip and sequenced on the Ion Proton system per manufacturer's instructions. Approximately, 41 million reads were generated from each sequencing run. Raw reads were generated as fastq files for bioinformatic analysis. Sequencing data has been made publicly available through GEO (Accession No. GSE111845).

# Bioinformatic Analysis

The raw reads were aligned to the reference library containing the 200 test sequences (TMAP- Ion Torrent Suite <sup>R</sup> , Thermo Fisher Scientific, Waltham, MA, United States). The reads that aligned to the reference library were filtered to retain reads with a mapping quality greater than 20 and further filtered to include only those sequences with perfect barcodes at both ends.

Differential expression analysis compared the expression of each variant to its respective reference allele for all SNPs. To account for differences in the concentrations of the variant and reference plasmids that were used for the transfections, a plasmid input correction factor for each target site was calculated as the average of the number of reads from the variant plasmid divided by the number of reads from the reference plasmid across four

replicates of the plasmids. The reads from the variant alleles for all biological replicates were divided by the input correction factor. The corrected read counts were fit into a generalized linear model using EdgeR (Robinson et al., 2010) assuming a negative binomial distribution. Biological replicates and the genotype were used as covariates. p-Values and log<sup>2</sup> fold-change of the variant alleles compared to the respective reference alleles were derived using a likelihood ratio test on the genotype variable in the generalized linear model. The p-values were corrected for a false discovery rate (FDR) using the Benjamini and Hochberg algorithm (Benjamini and Hochberg, 1995). The EdgeR script can be found in Supplementary File 1.

The two sequencing runs, each with five biological replicates, were analyzed together by fitting the number of reads for 10 pairs of variant and reference alleles (five from each experiment). The different sequencing runs were included as an additional covariate. The statistical analysis was performed as described above.

#### Luciferase Reporter Assay

Genetic variants, including mirSNPs have been functionally tested using a reporter plasmid such as the pIS-0 vector (Yekta et al., 2004; Adams et al., 2007; Ramamoorthy et al., 2012). This plasmid contains the firefly luciferase gene whose expression can be quantified either by qPCR or by the luciferase reporter assay. The reference or variant allele version of the predicted miRNA binding sites were cloned into the 3<sup>0</sup> UTR of the luciferase gene within the plasmid. The plasmids were then transfected into cells as described above. Forty-eight hours after transfection, cells were lysed in situ and Dual-Luciferase <sup>R</sup> assays were performed per manufacturer's instructions (Promega, Madison, WI, United States). The luciferase reporter activity was measured using a 96-well plate-reader (BioTek, Winooski, VT, United States). The firefly luciferase activity was normalized to that of Renilla luciferase in each well. The ratio of the normalized luciferase activity from the variant and reference plasmid provides a relative measure of SNP-mediated differential mRNA expression.

# RESULTS

The traditional luciferase reporter assay is useful in lowthroughput experiments, but is not a practical and cost-effective method to test the 1000s of mirSNPs identified at a genomewide scale. As a novel approach, we modified the luciferase reporter assay to develop PASSPORT-seq that can functionally test 100s of mirSNPs in parallel. Since one of the mechanisms of miRNA regulation is by degrading mRNA, this assay was specifically designed to evaluate the impact of genetic variation in miRNA binding sites on mRNA expression. We recognize that miRNAs also alter mRNA translation, however measuring protein levels does not distinguish between the impact on mRNA vs. translation and thus would not provide the same mechanistic insights. We identified 100 variants in predicted seed sequences of miRNA binding sites, and cloned the binding sites into the pIS-0 luciferase plasmid; each contained either the reference or variant nucleotide of 100 selected mirSNPs. The pool of the resulting 200 plasmids was then transfected into three cell lines and the luciferase gene expression measured by NGS. A difference in mRNA expression between the reference and the variant plasmids indicated a functional mirSNP.

# Cloning Efficiency and Plasmid Representation

To test the efficiency of the parallel cloning, plasmids from 28 individual colonies were isolated and the sequence of the inserts were determined by Sanger sequencing. Out of the 28

colonies, 25 contained inserts without errors; of those, 24 were unique sequences suggesting that the cloning efficiency was high and there was negligible sequence-bias in the plasmid pool. Furthermore, as described below, all 200 sequences were observed in the NGS of the entire pool.

### Reproducibility Across Biological Replicates

To test the reproducibility of the PASSPORT-seq bioassay, we first performed the assay with five biological replicates in each of the three cell lines and compared the number of reads from two of the five biological replicates. The input plasmid libraries had representation of all 100 allelic pairs. The R<sup>2</sup> value for the comparison of sets of two of the five input normalized-biological replicates within the same sequencing run was between 0.68 and 0.98 (p < 0.05; Supplementary Figure 4) demonstrating highly reproducible results within a run.

#### Reproducibility Across Runs

Next, we repeated the PASSPORT-seq assay again with another five biological replicates in each of the three cell lines to validate the observed results. A separate cDNA and sequencing library was created for the experiments. A strong correlation (R <sup>2</sup> = 0.98; p < 0.05) was observed in the results from the plasmid pool from the first sequencing run and those from the second sequencing run (see Supplementary Figure 5). There was a high correlation (R <sup>2</sup> = 0.74; p < 0.05) in the results between the first and second set of biological replicates (**Figure 2** and Supplementary Figure 6). This strong correlation between results of the first sequencing run with those of the second sequencing run across the three cell lines demonstrates the high reproducibility of the observed results.

#### Identification of Functional mirSNPs

Of the 100 mirSNPs tested, 69 showed statistically significant (p < 0.05) differences in expression between the variant and its respective reference allele in at least one cell line (**Figure 2B**).

In HEK293, HepG2, and HeLa a significant effect was seen in 27, 44, and 39 mirSNPs, respectively (see Supplementary Figure 7). Due to the large number mirSNPs tested, the results were corrected using the Benjamini and Hochberg procedure across cell lines. This conservative threshold (FDR ≤ 0.05) was met by 55 mirSNPs in at least one cell line with 11, 36, and 27 in HEK293, HepG2, and HeLa cells, respectively. Because these variants were informatically predicted to be functional, this may be an overly conservative statistical correction. The effect of the mirSNPs was cell line- specific; four SNPs were functional across all three cell lines, while others were functional in either two cell lines or unique to one cell line (**Figure 3**).

# Validation With Traditional Luciferase Assays

Twenty-one of the results were validated using traditional individual luciferase transfection experiments. The variant and

reference allele binding sites of the selected mirSNPs were individually cloned into the 3<sup>0</sup> UTR of the luciferase gene within the pIS-0 plasmid and transfected into HEK293, HepG2, and HeLa cells. This included seven of the miRNA target sequences, each with reference and variant sequences, tested in three cell lines for a total of 21 validations. Within each cell line, the luciferase activity in the cells transfected with the reference plasmid was compared to the activity in the cells transfected with the variant plasmids. The effect of the variants in these individual luciferase assays were compared with the results from the PASSPORT-seq assay (**Figure 4**). In 17 of the 21 comparisons, the statistical significance of the results and the direction of the effect of the variant matched the PASSPORTseq results (Supplementary Figure 8 and Supplementary Table 3). In an additional two comparisons (rs3134615 in HeLa and HEK293 cells), the results were statistically significant in one assay, but not the other, but the direction and magnitude of effect of the variant in the PASSPORT-seq was similar to the luciferase assay in both cell lines (Supplementary Figure 8). Thus, the results were very similar in 19 of the 21 comparisons (>90%).

# Application of PASSPORT-seq

To demonstrate the utility of this assay, mirSNPs predicted to alter miRNA regulation of 17 pharmacogenes were selected for functional testing from the RNA-seq dataset described earlier. These variants were functionally tested using the PASSPORTseq assay in HeLa, HepG2, HEK293, and HepaRG (hepatic cells that retain characteristics of primary human hepatocytes) cells. Out of the 111 genetic variants tested, the effect of 33 variants were statistically significant in at least one cell line, including 6, 13, 12, and 27 in HeLa, HepG2, HEK293, and HepaRG cells, respectively (**Figure 5** and Supplementary Table 4). The effects of several mirSNPs were shown to be cell line-specific. Only four mirSNPs had significant effects in all the four cell lines (**Figure 6**). The effect of a genetic variant (rs12979270), located in the 3<sup>0</sup> UTR of the pharmacogene- CYP2B6, was shown to be statistically significant in HepaRG cells. A recent study shows that this variant, could explain part of the interindividual variability seen in the activity of this critical pharmacogene (Burgess et al., 2017). These results demonstrate the potential of this assay to identify clinically relevant functional genetic variants.

# DISCUSSION

We developed PASSPORT-seq to screen 100s of SNPs that are predicted to alter miRNA–mRNA interactions. The availability of pooled oligonucleotide synthesis and the NEBuilder <sup>R</sup> gene assembly system have made this assay possible. Our assay builds and substantially improves the technologies that have had only limited success (Reid et al., 2009). For example, in previous high-throughput splicing assays, short inserts were underrepresented during the library construction using traditional cloning methods (Chen and Chasin, 1994; Ke et al., 2011; Soemedi et al., 2014). In contrast, our library had representation of all allele pairs. This may be explained by the cloning method we utilized, i.e., the NEBuilder <sup>R</sup> gene assembly system that produces covalently closed circular plasmids as opposed to traditional cloning methods, which yield a nicked-relaxed plasmid topology. This change in topology may result in a more efficient plasmid uptake by chemically competent bacteria (Hanahan, 1983; Xie et al., 1992; Kobori and Nojima, 1993). Better transformation efficiency increases the probability of both the variant and reference plasmids being represented in the resulting pool, which is a critical prerequisite for studying the allele-specific activity of miRNAs.

The activity of miRNAs on target sequences have been studied using reporter assays where the target sites of interest are cloned into the 3<sup>0</sup> UTR of a reporter gene followed by quantification of the reporter activity. Typical reporter assays also overexpress the miRNAs that are predicted to regulate the target site (Cloonan et al., 2008; Loya et al., 2009; Baccarini and Brown, 2010). Such overexpression, however, may not provide a physiological context to the miRNA–mRNA interaction. For example, high miRNA expression levels may force interactions with mRNAs that do not normally occur. They can also compete with the miRNA processing machinery or binding sites and alter normal miRNA function. In contrast, our assay was performed in the endogenous miRNA expression background, which provides a more physiologically relevant context of the results. In addition, we used multiple different cell lines to allow parallel testing to identify cell line-specific effects of the mirSNPs.

miRNAs regulate gene expression by either degrading the target mRNAs or by binding to mRNAs and blocking translation. Traditional luciferase assays test the effect of miRNA by measuring differences in its target protein activity. However, the differences in protein activity due to mRNA degradation and those from translational blockage will be indistinguishable using

luciferase activity assays. The PASSPORT-seq assay provides additional evidence of the mechanism underlying the effect of the variant by specifically detecting only the changes in mRNA transcript levels. We demonstrated that the results obtained with our PASSPORT-seq assay did reflect those obtained with the traditional luciferase reporter assay set-up. As described above, the PASSPORT-seq assay quantifies the relative expression of luciferase mRNAs, whereas the luciferase assay measures the luciferase enzyme activity. Thus, it is not surprising that there may be differences in the magnitude of effects between the different assays. For example, the effect on the luciferase activity could be larger due to both the degradation of the mRNA and the blockade of translation. In contrast, the effect on the protein levels and activity could be smaller due to delays from the time of changes in mRNA levels until the changes in protein levels and activities are observed. Typically, changes in mRNA expression due to endogenous miRNA-mediated regulation are subtle (<30%) (Baek et al., 2008; Bartel, 2009; Denzler et al., 2016). Consequently one would expect relatively small effect sizes of the variants, which was what was seen in many of these variants. However, there are many examples demonstrating the clinical impact of these types of variants (Bhattacharya et al., 2014).

One of the key findings of these studies is the cell linespecific function of mirSNPs. This is likely in part due to the cell type specific variation in miRNA expression profiles resulting in the effect of a mirSNP being evident in one cell line and not in another (Landgraf et al., 2007; Ludwig et al., 2016). We observed that the identity and number of functional mirSNPs reproducibly varied across the different cell types. This demonstrates one of the strengths of PASSPORT-seq in that it can identify cell line-specific effects of mirSNPs. These differences were validated using a second PASSPORT-seq run that reproduced the cell line effect. Additionally, the cell linespecific effect was also observed in the application of the assay to test 111 mirSNPs in pharmacogenes. The tissue/cell specificity of mirSNP function could also explain why the effects of mirSNPs are not always consistent across studies. This further complicates the bioinformatics predictions of the functional impact of the mirSNPs. Thus, when using this assay, the cell line must be carefully chosen to reflect the cell type of interest regarding the central biological hypothesis of the study. Since studies have shown that mirSNPs affect a wide variety of biological processes such as cancer, neurodegenerative disorders, infectious diseases, cardiovascular disease, and metabolic disorders (Bhattacharya et al., 2014), the in vitro model for testing the mirSNPs is an important consideration.

As with any in vitro assay, PASSPORT-seq has some limitations. First, it detects only changes in mRNA, rather than protein levels. miRNA- mRNA interactions are known to cause mRNA destabilization, but can also lead to translational repression (Lim et al., 2005; Bartel, 2009). Since this assay does not detect changes in translation, it may underestimate the functional impact of some of the mirSNPs. Second, the variants could be affecting mRNA stability by mechanisms other than by altering miRNA targeting. For example, it could be altering RNA binding protein function that could alter the mRNA stability. Although this would need additional validation experiments to determine the mechanism of action, it would still be of biological value. Last, like other studies using the pISO plasmid, the miRNA binding site is tested in the context of the luciferase mRNA, rather than the endogenous mRNA.

In summary, the PASSPORT-seq assay is a powerful tool that bridges bioinformatic predictions and high-throughput mechanistic investigation of functional genetic variants that affect miRNA–mRNA interactions. Future efforts will be aimed toward further increasing the capacity of the assay and identifying translational effects. This assay also has the potential to be modified to be applicable to genetic variants in other functional genomic regions such as promoters and splice junctions. Collectively, these assays will be key to elucidating the mechanisms underlying the genetic contribution to the inter-individual phenotypic variability.

# AUTHOR CONTRIBUTIONS

JI, PB, AG, and TS designed the assay. JI, KC, and PB performed the in vitro experiments. YH, HG, and YL performed the bioinformatic analysis. JI, KC, and TS performed the data analysis. JI, KC, YH, YL, AG, and TS wrote and/or edited the manuscript.

# FUNDING

This work was supported by National Institutes of Health [NIH/NIGMS R01-GM088076 (TS), NIH/NIGMS F31- GM119401 (Burgess), NIH/NCRR RR025761]; and Vera Bradley Foundation for Breast Cancer (JI). Funding for open access charge: National Institutes of Health.

# ACKNOWLEDGMENTS

We thank the Center for Medical Genomics at Indiana University School of Medicine for their support on DNA sequencing. Biospecimens were stored in the CTSI Specimen Storage Facility which is supported, in part, by grant NIH/NCRR RR025761. We also thank, Drs. Michael Eadon, Eric Benson, Thomas De Luca, and Marelize Swart for their assistance via intellectual discussions.

# SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene. 2018.00219/full#supplementary-material

# REFERENCES

fgene-09-00219 June 13, 2018 Time: 16:14 # 9


frequency. Nucleic Acids Res. 21:2782. doi: 10.1093/nar/21.11. 2782



**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 Ipe, Collins, Hao, Gao, Bhatia, Gaedigk, Liu and Skaar. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Development of an AmpliSeqTM Panel for Next-Generation Sequencing of a Set of Genetic Predictors of Persisting Pain

#### Dario Kringel<sup>1</sup> , Mari A. Kaunisto<sup>2</sup> , Catharina Lippmann<sup>3</sup> , Eija Kalso<sup>4</sup> and Jörn Lötsch1,3 \*

1 Institute of Clinical Pharmacology, Goethe-University, Frankfurt, Germany, <sup>2</sup> Institute for Molecular Medicine Finland, HiLIFE, University of Helsinki, Helsinki, Finland, <sup>3</sup> Fraunhofer Institute for Molecular Biology and Applied Ecology – Project Group Translational Medicine and Pharmacology, Frankfurt, Germany, <sup>4</sup> Division of Pain Medicine, Department of Anesthesiology, Intensive Care and Pain Medicine, University of Helsinki and Helsinki University Hospital, Helsinki, Finland

#### Edited by:

Ulrich M. Zanger, Dr. Margarete Fischer-Bosch-Institut für Klinische Pharmakologie (IKP), Germany

#### Reviewed by:

Theodora Katsila, University of Patras, Greece Cheryl D. Cropp, Samford University, United States

> \*Correspondence: Jörn Lötsch j.loetsch@em.uni-frankfurt.de

#### Specialty section:

This article was submitted to Pharmacogenetics and Pharmacogenomics, a section of the journal Frontiers in Pharmacology

Received: 24 May 2018 Accepted: 17 August 2018 Published: 19 September 2018

#### Citation:

Kringel D, Kaunisto MA, Lippmann C, Kalso E and Lötsch J (2018) Development of an AmpliSeqTM Panel for Next-Generation Sequencing of a Set of Genetic Predictors of Persisting Pain. Front. Pharmacol. 9:1008. doi: 10.3389/fphar.2018.01008 Background: Many gene variants modulate the individual perception of pain and possibly also its persistence. The limited selection of single functional variants is increasingly being replaced by analyses of the full coding and regulatory sequences of pain-relevant genes accessible by means of next generation sequencing (NGS).

Methods: An NGS panel was created for a set of 77 human genes selected following different lines of evidence supporting their role in persisting pain. To address the role of these candidate genes, we established a sequencing assay based on a custom AmpliSeqTM panel to assess the exomic sequences in 72 subjects of Caucasian ethnicity. To identify the systems biology of the genes, the biological functions associated with these genes were assessed by means of a computational over-representation analysis.

Results: Sequencing generated a median of 2.85 · 10<sup>6</sup> reads per run with a mean depth close to 200 reads, mean read length of 205 called bases and an average chip loading of 71%. A total of 3,185 genetic variants were called. A computational functional genomics analysis indicated that the proposed NGS gene panel covers biological processes identified previously as characterizing the functional genomics of persisting pain.

Conclusion: Results of the NGS assay suggested that the produced nucleotide sequences are comparable to those earned with the classical Sanger sequencing technique. The assay is applicable for small to large-scale experimental setups to target the accessing of information about any nucleotide within the addressed genes in a study cohort.

Keywords: pain, data science, knowledge discovery, functional genomics, next generation sequencing (NGS)

# INTRODUCTION

fphar-09-01008 September 17, 2018 Time: 10:19 # 2

Persisting pain has been proposed to result from a gene environment interaction where nerve injuries or inflammatory processes act as triggers while the clinical symptoms develop only in a minority of subjects (Lee and Tracey, 2013). A role of the genetic background in pain is supported by evidence of many variants modulating the individual perception of pain and the development of its persistence (Diatchenko et al., 2005; Lötsch et al., 2009b; Mogil, 2012). Genetic variants have been reported to confer protection against pain such as the rs1799971 variant in the µ-opioid receptor gene (OPRM1) (Lötsch et al., 2006), or to increase the risk for persisting pain such as the rs12584920 variant of the 5-hydroxytryptamine receptor 2A gene (HTR2A) (Nicholl et al., 2011) or the rs734784 polymorphism in the voltagegated potassium ion channel modifier, subfamily S member 1, gene (KCNS1) (Costigan et al., 2010). Nevertheless, the genetic background of persisting pain is still incompletely understood (Mogil, 2009; Lötsch and Geisslinger, 2010) and under intense discussion.

Until recently, research focused on the role of selected functional genetic variants as protective or risk factors of persisting pain. This has changed with the broader availability of next generation sequencing (NGS) (Metzker, 2010). To make use of these technical advancements, we developed a custom AmpliSeqTM library and sequencing assay for efficient detection of genetic variants possibly associated with persisting pain. We propose an assay of a set of 77 genes supported by evidence of an involvement in pain and its development toward persistence. The set size fully uses the technical specifications of the AmpliSeqTM gene sequencing library technique.

#### MATERIALS AND METHODS

### Selection of Genes Relevant for Persisting Pain

A set of candidate genes with shown or biologically plausible relevance to persisting pain was created by applying a combination of criteria, which provided three different genetic subsets. **Subset 1** was chosen exclusively on the basis of computational functional genomics based on a recently published analysis of persisting pain regarded as displaying systemic features of learning and neuronal plasticity (Mansour et al., 2014). As discussed previously (Ultsch et al., 2016), the view of chronic pain as a dysregulation in biological processes of learning and neuronal plasticity (Alvarado et al., 2013) seems to be captured by the controlled vocabulary (Camon et al., 2004) of the Gene Ontology (GO) knowledge base by the GO terms "learning or memory" (GO:0007611)<sup>1</sup> and "nervous system development" (GO:0007399)<sup>2</sup> . An intersection of the genes annotated to these GO terms with a set of 539 "pain genes" identified empirically as relevant to pain provided the first subset of 34 genes described in detail previously (Ultsch et al., 2016). Briefly, the intersecting set of so-called "pain genes" consists of a combination of (i) genes listed in the PainGenes database (Lacroix-Fralish et al., 2007) 3 , (ii) genes causally involved in human hereditary diseases associated with extreme pain phenotypes, (iii) genes found to be associated with chronic pain in at least three human studies, and (iv) genes coding for targets of novel analgesics under clinical development (Lötsch et al., 2013).

**Subset 2** consisted of genes that were reported to carry variants modulating the risk or the phenotypic symptoms in at least two different clinical settings of persisting pain. They were obtained using (i) a PubMed database search for the string "(chronic OR persisting OR neuropathic OR back OR inflammatory OR musculoskeletal OR visceral OR widespread OR idiopathic OR fibromyalgia) AND pain AND (polymorphism OR variant) NOT review," to which genes highlighted in overviews on pain genetics (e.g., Edwards, 2006) were added. The intersection of the queried genes with the set of 539 "pain genes" (see above) provided a subset of 13 genes (**Table 1**).

Finally, **subset 3** comprised genes that have consistently been included in human pain research projects over the last several years. One of them is the OPRM1 gene that codes for the human µ-opioid receptor and which has been shown to modulate the time course of persisting cancer pain by delaying the necessity of opioid treatment (Lötsch et al., 2010). However, further genes were added such as the GDNF gene coding for the glial cell derived neurotrophic factor, which has been shown to be involved in a glia-dependent mechanism of neuropathic pain (Wang et al., 2014) although no modulating human genetic variants have been reported so far. Following expert counseling within the EUfunded "glial-opioid interface in chronic pain, GLORIA" research consortium (Kringel and Lötsch, 2015) 4 , a subset of 30 genes (**Table 1**) was identified. Thus, the complete set as the union of the three subsets comprised 43 + 13 + 30 = 77 genes that are proposed to be included in an NGS panel of human persisting pain.

#### DNA Sample Origin

Due to the costs of assay development (for details, see second paragraph of the Discussion), the AmpliseqTM panel was established in a limited number of n = 72 DNA samples. This corresponds to the number of samples used in comparable recent studies for NGS assay establishment and validation (Bruera et al., 2018; De Luca et al., 2018; Mustafa et al., 2018; Shah et al., 2018). To further limit the project costs, the AmpliseqTM panel was established in a subset of samples originating from a clinical cohort of 1,000 women who had undergone breast cancer surgery (Kaunisto et al., 2013; Lötsch et al., 2018). The study followed the Declaration of Helsinki and was approved by the Coordinating Ethics Committee of the Helsinki University Hospital. Each participating subject had provided a written informed consent including genetic studies.

Specifically, for the presently reported method establishment, a subsample of 72 women (age 58.4 ± 8 years, mean ± standard deviation, weight 69.3 ± 11 kg), was drawn from the clinical

<sup>1</sup>http://amigo.geneontology.org/amigo/term/GO:0007611

<sup>2</sup>http://amigo.geneontology.org/amigo/term/GO:0007399

<sup>3</sup>http://www.jbldesign.com/jmogil/enter.html

<sup>4</sup>http://gloria.helsinki.fi

TABLE 1 | Genes included in the proposed NGS panel of persisting pain, combined from three subsets included on different bases.


(Continued)

#### TABLE 1 | Continued

fphar-09-01008 September 17, 2018 Time: 10:19 # 4


Subset #1 comprises d = 34 genes that had resulted from a computational functional genomics analysis (Ultsch et al., 2016) pursuing the hypothesis that persisting pain displays systemic features of learning and neuronal plasticity (Mansour et al., 2014). Hence, from a set of genes identified empirically as relevant to pain and listed in the PainGenes database (http://www.jbldesign.com/jmogil/enter.html, Lacroix-Fralish et al., 2007), those were selected that are annotated to the Gene Ontology (Ashburner et al., 2000) terms "learning or memory" and "nervous system development." The references are those found to provide evidence for an association with pain, except for PTPRZ1 that was a novel finding in (Ultsch et al., 2016). Subset #2 comprises d = 13 genes identified empirically as relevant to pain and listed in the PainGenes database (http://www.jbldesign.com/jmogil/enter.html, Lacroix-Fralish et al., 2007) and reported to carry variants that modulated the risk or the symptomatology in at least two different clinical settings of persisting paint. Subset #3 comprises d = 30 genes repeatedly shown during the last several years to play a role in the human genetics of persisting pain or recently reported as novel players.

subgroup not having developed persisting pain during the observation period. This was believed to come closer to a random sample than a mixture of patients with persisting and without persisting pain. This limitation of the sample selection has probably affected which and how many variants were identified. However, it is unlikely to have jeopardized the general applicability of the gene selection heuristics, assay establishment and validation, and of the functional analysis of the selected subset of genes.

#### DNA Template Preparation and Amplification

A multiplex PCR amplification strategy for the coding gene sequences was accomplished online (Ion AmpliseqTM Designer)<sup>5</sup> to amplify the target region specified above (for primer sequences, see **Supplementary Table 1**) with 25 base pair exon padding. After a comparison of several primer design options,

<sup>5</sup>http://www.ampliseq.com

the design providing the maximum target sequence coverage was chosen. The ordered 1,953 amplicons covered approximately 97.5% of the target sequence (**Supplementary Table 2**). A total of 10 ng DNA per sample was used for the target enrichment by a multiplex PCR and each DNA pool was amplified with the Ion AmpliseqTM Library Kit in conjunction with the Ion AmpliseqTM "custom Primer Pool"-protocols according to the manufacturer's procedures (Life Technologies, Darmstadt, Germany).

After each pool had undergone 18 PCR cycles, the PCR primers were removed with FuPa Reagent and the amplicons were ligated to the sequencing adaptors with short stretches of index sequences (barcodes) that enabled sample multiplexing for subsequent steps (Ion XpressTM Barcode Adapters Kit; Life Technologies). After purification with AMPure XP beads (Beckman Coulter, Krefeld, Germany), the barcoded libraries were quantified with a Qubit <sup>R</sup> 2.0 Fluorimeter (Life Technologies, Darmstadt, Germany) and normalized for DNA concentration to a final concentration of 20 pmol/l using the Ion Library EqualizerTM Kit (Life Technologies, Darmstadt, Germany).

Equalized barcoded libraries from seven to eight samples at a time were pooled. To clonally amplify the library DNA onto the Ion Sphere Particles (ISPs; Life Technologies, Darmstadt, Germany), the library pool was subjected to emulsion PCR by using an Ion PGM HI-Q View Template Kit on an PGM OneTouch system (Life Technologies, Darmstadt, Germany) following the manufacturer's protocol.

#### Sequencing

Enriched ISPs which carried many copies of the same DNA fragment were subjected to sequencing on an Ion 318 Chip to sequence pooled libraries with seven to eight samples. During this process, bases are inferred from light intensity signals, a process commonly referred to as base-calling (Ledergerber and Dessimoz, 2011). The number of combined libraries that can be accommodated in a single sequencing run depends on the size of the chip, the balance of barcoded library concentration, and the coverage required. The high-capacity 318 chip was chosen (instead of the low-capacity 314 or the medium-capacity 316 chip) to obtain a high sequencing depth of coverage for a genomic DNA library with >95% of bases at 30x. Sequencing was performed using the sequencing kit (Ion PGM Hi-Q Sequencing Kit; Life Technologies, Darmstadt, Germany) as per the manufacturer's instructions with the 200 bp singleend run configuration. This kit contained the most advanced sequencing chemistry available to users of the Ion PGM System (Life Technologies, Darmstadt, Germany).

# Data Analysis

#### Bioinformatics Generation of Sequence Information

The raw data (unmapped BAM-files) from the sequencing runs were processed using Torrent Suite Software (Version 5.2.2, Life Technologies, Darmstadt, Germany) to generate read alignments which were filtered by the software into mapped BAM-files using the reference genomic sequence (hg19) of target genes. Variant calling was performed with the Torrent Variant Caller Plugin using as key parameters: minimum allele frequency = 0.15, minimum quality = 10, minimum coverage = 20 and minimum coverage on either strand = 3.

The annotation of called variants was done using the Ion Reporter Software (Version 4.4; Life Technologies, Darmstadt, Germany) for the VCF files that contained the nucleotide reads and the GenomeBrowse <sup>R</sup> software (Version 2.0.4, Golden Helix, Bozeman, MT, United States) to map the sequences to the reference sequences GRCh37 hg19 (dated February 2009). The SNP and Variation Suite software (Version 8.4.4; Golden Helix, Bozeman, MT, United States) was used for the analysis of sequence quality, coverage and for variant identification.

Based on the observed allelic frequency, the expected number of homozygous and heterozygous carriers of the respective SNP (single nucleotide polymorphism) was calculated using the Hardy-Weinberg equation. Only variants within the Hardy-Weinberg equilibrium as assessed using Fisher's exact test (Emigh, 1980) were retained. The SNP and Variation Suite software (Version 8.4.4; Golden Helix, Bozeman, MT, United States) was used for the analysis of sequence quality, coverage and for variant identification.

# Assay Validation

Method validation was accomplished by means of Sanger sequencing (Sanger and Coulson, 1975; Sanger et al., 1977) in an independent external laboratory (Eurofins Genomics, Ebersberg, Germany). As performed previously with different AmpliSeqTM panels (Kringel et al., 2017) and other genotyping assays (Skarke et al., 2004, 2005), four DNA samples have been chosen randomly from an independent cohort of healthy subjects and sequenced with the current NGS panel. For the detected variant type, single nucleotide polymorphisms from five different genomic regions for which clinical associations have been reported (**Table 2**), i.e., rs324420 (FAAH), rs333970 (CSF1), rs4986790 (TLR4), rs4633 (COMT), and rs17151558 (RELN) were chosen for external sequencing. Amplification of the respective DNA segments was done using PCR primer pairs (forward, reverse) of (i) 5<sup>0</sup> -TTTCTTAAAAAGGCCAGCCTCCT-3<sup>0</sup> and 5<sup>0</sup> -AATGACCCAAGATGCAGAGCA-3<sup>0</sup> (ii) 5<sup>0</sup> -GCCTT CAACCCCGGGATGG-3<sup>0</sup> and 5<sup>0</sup> -CTCCGATCCCTGGTGC TCCTC-3<sup>0</sup> (iii) 5<sup>0</sup> -TTTATTGCACAGACTTGCGGGTTC-3<sup>0</sup> and 5<sup>0</sup> -AGCCTTTTGAGAGATTTGAGTTTCA-3<sup>0</sup> (iv) 5<sup>0</sup> -CC TTATCGGCTGGAACGAGTT-3<sup>0</sup> and 5<sup>0</sup> -GTAAGGGCTTT GATGCCTGGT-3<sup>0</sup> (v) 5<sup>0</sup> -GTTATTCCTCTGTAAGCAGCTGCC T-3<sup>0</sup> and 5<sup>0</sup> -TGTTTGTTTTAGATTGTGGTGGGTT-3<sup>0</sup> . Results of Sanger sequencing were aligned with the genomic sequence and analyzed using Chromas Lite <sup>R</sup> (Version 2.1.1, Technelysium Pty Ltd, South Brisbane, QLD, Australia) and the GenomeBrowse <sup>R</sup> (Version 2.0.4, Golden Helix, Bozeman, MT, United States) was used to compare the sequences obtained with NGS or Sanger techniques.

# RESULTS

The NGS assay of the proposed set of 77 human genes relevant to persisting pain was established in 72 genomic DNA samples. As applied previously (Kringel et al., 2017), only exons including 25 bases of padding around all targeted coding regions for which the realized read-depths for each nucleotide was higher than 20 were contemplated as successfully analyzed. With this acceptance criterion the whole or almost whole coverage of the relevant sequences was obtained (**Table 1**; for details on missing variants, see **Supplementary Table 3**). The NGS sequencing process of the whole patient cohort required ten separate runs, each with samples of n = 7 or n = 8 patients. Coverage statistics were analogous between all runs and matched the scope of accepted quality levels [20–22]. A median of 2.85 · 10<sup>6</sup> reads per run was produced. The mean depth was close to 200 reads, the mean read length of called bases resulted in 205 bases and average chip loading was 71% (**Figure 1A**). To establish a sequencing output with a high density of ISPs on a sequencing chip, the chip loading value should exceed 60% (Life Technologies, Carlsbad, United States). The generated results of all NGS runs matched with the results obtained with Sanger sequencing of random samples (**Figure 1B**), meaning the accordance of nucleotide sequences between NGS and Sanger sequencing was 100% in all validated samples.

TABLE 2 | A list of coding human variants in the 77 putative chronic pain genes, found in the present random sample of 72 subjects of Caucasian ethnicity, for which clinical associations have been reported.


(Continued)

#### TABLE 2 | Continued

fphar-09-01008 September 17, 2018 Time: 10:19 # 7


The selection is restricted to one or two publications per variant, and it focuses on a pain context corresponding to the main aim of the present NGS gene panel; however, functional variants highlighted in another clinical context are additionally provided in the lower part of the table. #Database of Single Nucleotide Polymorphisms (dbSNP). Bethesda (MD, United States): National Center for Biotechnology Information, National Library of Medicine. Available from: http://www.ncbi.nlm.nih.gov/SNP/ (Sherry et al., 2001).

Following elimination of nucleotides agreeing with the standard human genome sequence GRCh37 g1k (dated February 2009), the result of the NGS consisted of a vector of nucleotide information about the d = 77 genes for each individual DNA sample (**Figure 2**). This vector had a length equaling the set union of the number of chromosomal positions in which a nonreference nucleotide had been found in any probe of the actual cohort. Specifically, a total of 3,185 genetic variants was found, of

FIGURE 1 | Assay establishment and validation. (A) Pseudo-color image of the Ion 318TM v2 Chip plate showing percent loading across the physical surface. This sequencing run had a 76% loading, which ensures a high Ion Sphere Particles (ISP) density. Every 318 chip contains 11 million wells and the color scale on the right side conduces as a loading indicator. Deep red coloration stays for a 100% loading, which means that every well in this area contains an ISP (templated and non-templated) whereas deep blue coloration implies that the wells in this area are empty. (B) Alignment of a segment of the ion torrent sequence of the COMT gene as a Golden Helix Genome Browse <sup>R</sup> readout versus the same sequence according to an externally predicted Sanger electropherogram. Highlighted is the COMT variant rs4633 (COMT c.186C>T → p.His62 = ) as a heterozygous mutation and a non-mutated wild type. The SNP is part of the functional COMT haplotype comprising rs4633, rs4818 and rs4680, which showed >11-fold difference in expressed enzyme activity and was reported to be associated with different phenotypes of pain sensitivity (Diatchenko et al., 2005).

which 659 were located in coding parts of the genes, 1,241 were located in introns and 1,285 in the 3<sup>0</sup> -UTR, 5<sup>0</sup> -UTR, upstream or downstream regions. The coding variants for which a clinical or phenotypic association have been reported are listed in **Table 2** together with an example of each variant. Most of the observed variants were single nucleotide polymorphisms (d = 571) whereas mixed polymorphisms (d = 26), nucleotide insertions (d = 18) or nucleotide deletions (d = 44) were more rarely found.

#### DISCUSSION

In this report, development and validation of a novel AmpliseqTM NGS assay for the coding regions and boundary parts of d = 77 genes qualifying as candidate modulators of persisting pain is described. The NGS assay produced nucleotide sequences that corresponded, with respect to the selected validation probes, to the results of classical Sanger sequencing. However, the NGS assay substantially reduced the laboratory effort to obtain the genetic information and provides the perquisites to be used in high throughput environments. In particular, the presented NGS assay is convenient for small up to large-scale setups. As mentioned in the methods section, a limitation of the present results applies to the identified genetic variants as only samples from Caucasian women were included. By contrast, the validity of gene selection and assay establishment is unlikely to be reduced by this selection chosen to remain within the financial limits of the present project.

Specifically, as observed previously (Kringel et al., 2017), the comprehensive genetic information and the high throughput are reflected in the assay costs. Specifically, sequencing of the 77 genes in 72 DNA samples required approximately € 18,000 for the AmpliSeqTM custom panel, € 5,500 for library preparation, € 700 for template preparation and € 700 for sequencing. Ten 318 sequencing chips cost around € 7,000 and in addition and basic consumables and laboratory supplies issued approximately € 800. With 7–8 barcoded samples loaded on ten chips, the expense to analyses the gene sequence for a single patient were around € 325. While NGS costs are likely to decrease in the near future (Lohmann and Klein, 2014), present assay establishment was therefore applied in DNA samples planned for future genotype versus phenotype association analysis, which required using DNA from patients of a pain-relevant cohort instead from a true random sample of healthy subjects.

As a result of the present assay development, a set of d = 77 genes was chosen as potentially relevant to persisting pain. The chosen set of genes differs from alternative proposals aiming at similar phenotypes (Mogil, 2012; Zorina-Lichtenwalter et al., 2016). However, when analyzing these alternatives for mutual agreement, only limited overlap could be observed (**Figure 3**). This emphasizes that the genetic architecture of persisting pain is incompletely understood, and several independent lines of research can be pursued. Of note, the present set showed the largest agreement with a set of d = 539 genes identified empirically as relevant to pain and listed in the PainGenes database (Lacroix-Fralish et al., 2007) <sup>6</sup> or recognized as causing human hereditary diseases associated with extreme pain phenotypes (Lötsch et al., 2013; Ultsch et al., 2016). Combining all proposals into a large panel was not an option due to the technical limitations of the IonTorrent restricting the panel size to 500 kb (pipeline version 5.6.2); therefore, further genes would need to be addressed in separate panels.

In the present study sample, selected with a certain bias by using, as explained above for cost saving, clinical samples from only women and only Caucasians, a total of 659 genetic coding variants were found. Regardless of the sample preselection, 105 clinical associations (**Table 2**) could be queried for the observed variants from openly obtainable data sources comprising (i) the

<sup>6</sup>http://www.jbldesign.com/jmogil/enter.html

included in the assay. The vertical size of the cells is proportional to the number of variants of a particular type; the horizontal size of the cells is proportional to the number of variants found in the respective gene. The location of the variants is indicated at the left of the mosaic plot in letters colored similarly to the respective bars in the mosaic plot. Variants were not found at all possible locations of each gene, which causes the reduction of several bars to dashed lines drawn as placeholders and indicating that at the particular location no variant has been found in the respective gene. The figure has been created using the R software package (version 3.4.2 for Linux; http://CRAN.R-project.org/, R Development Core Team, 2008). UTR: untranslated region. NCExonic: Non-coding exonic.

FIGURE 3 | Venn diagram (Venn, 1880) visualizing the intersections between the presently proposed set of human genes involved in modulating the risk or the clinical course of persisting pain ("Current set," green frame), and two alternative proposals ["Mogil" (Mogil, 2012), blue frame and "Zorina-Lichtenwalter" (Zorina-Lichtenwalter et al., 2016), violet frame]. In addition, a set of d = 539 genes identified empirically as relevant to pain and either listed in the PainGenes database (http://www.jbldesign.com/ jmogil/enter.html, Lacroix-Fralish et al., 2007) or added because recognized as causing human hereditary diseases associated with extreme pain phenotypes, found to be regulated in chronic pain in at least three studies including human association studies, or being targets of novel analgesics. The number of shared genes between data sets is numerically shown in the respective intersections of the Venn diagram. The figure has been created using the R software package (version 3.4.2 for Linux; http://CRAN.R-project.org/, R Development Core Team, 2008) with the particular package "Vennerable" (Swinton J., https://r-forge.r-project.org/R/?group\_id=474).

Online Mendelian Inheritance in Man (OMIM <sup>R</sup> ) database<sup>7</sup> , (ii) the NCBI gene index database<sup>8</sup> , the GeneCards database<sup>9</sup> [27] and the "1000 Genomes Browser"<sup>10</sup> (all accessed in December 2017). The observation of functional variants in the present cohort preselected for the absence of pain persistence is plausible as (i) variants can exert protective effects against chronic pain and (ii) most genetic variants identified so far exert only small effects on pain and the individual result of their functional modulations depends on their combined effects or from the sum of positive and negative effects on pain perception (Lötsch et al., 2009a).

The selection of genes (**Table 1**) relied on empirical evidence of their involvement in pain. For subset #1 (d = 34), this had been shown for 33 genes in the original paper (Ultsch et al., 2016). As the hypothesis that persisting pain displays systemic features of

<sup>9</sup>http://www.genecards.org

learning and of neuronal plasticity (Mansour et al., 2014) could be substantiated at a computational functional genomics level, the further gene (PTPRZ1, protein tyrosine phosphatase Z 1) can also be regarded as supported by prior knowledge to be included in the present set. The subset comprised, for example, genes associated with the mesolimbic dopaminergic system, i.e., DRD1, DRD2, DRD3, which code for dopamine receptors, and TH, which is the coding gene for the tyrosine hydroxylase, a metabolic restricting enzyme in dopaminergic pathways, which have been implicated in promoting chronic back pain (Hagelberg et al., 2003, 2004; Jaaskelainen et al., 2014; Martikainen et al., 2015). Further 14 genes were involved in the circadian rhythm recognized as a modulatory factor in various pain conditions such as arthritis (Haus et al., 2012; Gibbs and Ray, 2013) and neuropathic pain (Gilron and Ghasemlou, 2014). The subset further included three NMDA receptor genes (GRIN1, GRIN2A, and GRIN2B) known to be major players in a number of essential physiological functions including neuroplasticity (Coyle and Tsai, 2004). In addition, metabotropic glutamate receptors (mGluR) have been implemented in several chronic pain conditions. One subtype, mGluR5, coded by GRM5, is of particular interest in the context of pain conditions as recent studies showed a pro-nociceptive role of mGluR5 in models of chronic pain (Walker et al., 2001; Crock et al., 2012). Furthermore, genes associated with histaminergic signaling such as HRH3 have been implicated in pain transmission (Hough and Rice, 2011) and analgesia (Huang et al., 2007).

The second subset of genes relied on a new PubMed search rather than on a previously published and hypothesisbased selection of candidate genes. A computational functional genomics analysis of this subset (details not shown) suggested its involvement in (i) immune processes and (ii) nitric oxide signaling. The genes annotated to the GO term "immune system process" included interleukin (IL1B, IL4, IL6, IL10) (Dinarello, 1994; Choi and Reiser, 1998; Mocellin et al., 2004; Nemeth et al., 2004) and histocompatibility complex related (HLA-B) genes (Dupont and Ceppellini, 1989), which have been shown to be involved in immunological mechanisms of pain (Sato et al., 2002; de Rooij et al., 2009). This is also supported by published evidence for the further genes in this list, such as, TNF (Vassalli, 1992; Franchimont et al., 1999), GCH1 (Schott et al., 1993) and P2RX7 (Chen and Brosnan, 2006). The second major process group emerging from the functional genomics analysis of the key evidence for genetic modulation of clinical chronic pain was nitric oxide signaling, in particular metabolic processes, summarized in this context under the GO term "reactive oxygen species metabolic process" which includes the genes IL6 (Deakin et al., 1995), TNF (Deakin et al., 1995; Katusic et al., 1998), ESR1 (Clapauch et al., 2014), IL10 (Cattaruzza et al., 2003), GCH1 (Katusic et al., 1998; Zhang et al., 2007), IL1B (Katusic et al., 1998), IL4 (Coccia et al., 2000), P2RX7 (Gendron et al., 2003), SOD2 (Fridovich, 1978). Furthermore, catecholamines including noradrenaline, adrenaline and dopamine have multiple functions in the brain and spinal cord including pain perception and processing (D'Mello and Dickenson, 2008). Catechol-Omethyltransferase, encoded by the COMT gene, is one of several enzymes that degrade dopamine, noradrenaline and adrenaline

<sup>7</sup>http://www.ncbi.nlm.nih.gov/omim

<sup>8</sup>http://www.ncbi.nlm.nih.gov/gene

<sup>10</sup>https://www.ncbi.nlm.nih.gov/variation/tools/1000genomes

and has become one of the most frequently addressed genes in pain research (Nackley et al., 2006).

Finally, subset #3 (d = 30) consists of genes repeatedly shown to play a role in the genetic modulation of persisting pain in humans or, by contrast, included a few novel items only recently published in the context of pain. This included members of the transient receptor potential (TRP) family (TRPA1, TRPM8, TRPV4) that are expressed at nociceptors and which are well established players in the perception of pain via their excitation by chemical, thermal or mechanical stimuli (Clapham, 2003). This similarly applies to the opioidergic system represented by the inclusion of the genes coding for the major opioid receptors (OPRM1, OPRK1 OPRD1), which have been associated with variations in pain or opioid response in various settings (Lötsch and Geisslinger, 2005). The most important of this group, the µ-opioid receptor encoded by the OPRM1 gene, carriers several variants of which the 118 A>G (rs1799971) has been studied most extensively since the early description of its association with a functional phenotype in humans (Lötsch et al., 2002).

Almost half of the present sets of genes were chosen based on a computational functional genomics analysis that attributed persisting pain to GO processes of "learning or memory" and "nervous system development" (Ultsch et al., 2016) as likely to reflect systemic features of persisting pain. This implied a functional bias and therefore, the present set of d = 77 genes (**Figure 4**) was analyzed whether this bias prevailed when comparing it with the alternative sets of human genes proposed to modulate persisting pain (Mogil, 2012; Zorina-Lichtenwalter et al., 2016). As applied previously (Lippmann et al., 2018), the biological roles of the set of d = 77 genes were queried from the Gene Ontology knowledgebase (GO)<sup>11</sup> (Ashburner et al., 2000) where the knowledge about the biological processes, the molecular functions and the cellular components of genes is formulated using a controlled and clearly defined vocabulary of GO terms. Particular biological roles of the set of d = 77 genes, among all human genes, were analyzed by

<sup>11</sup>http://www.geneontology.org/

TABLE 3 | Current targeting of the genes included in the proposed NGS panel of persisting pain by novel drugs that are currently under active clinical development and include analgesia as the main clinical target or at least as one of the intended clinical indication.


(Continued)

#### TABLE 3 | Continued

fphar-09-01008 September 17, 2018 Time: 10:19 # 13


The information was queried from the Thomson Reuters Integrity database at https://integrity.thomson-pharma.com on July 11, 2018.

means of over-representation analysis (ORA). This compared the occurrence of the particular GO terms associated with the present set of genes with their expected occurrence by chance (Backes et al., 2007). In contrast to enrichment analysis, any quantitative criteria such as gene expression values are disregarded (Backes et al., 2007). The analyses were performed using our R library "dbtORA" (Lippmann et al., 2018) <sup>12</sup> on the R software environment (version 3.4.2 for Linux; R Development Core Team, 2008) 13 .

Surprisingly, the results of this analysis indicated that the functional bias of the present gene set toward "learning or memory" (GO:0007611) and "nervous system development" (GO:0007399) was not maintained against the alternative gene sets. Instead, a few more general GO terms such as "behavior" ("single organism behavior," GO:0044708), or "response to organic cyclic compound" (GO:0014070) and response to alkaloid (GO:0043279), which could be identified as morphine and cocaine when repeating the analysis with a less conservative α-correction (further details not shown), were overrepresented, as well as the pain specific term "sensory perception of pain" (GO:0019233). A possible explanation that the selection bias of the present gene set was not maintained when comparing it with alternative proposals is that the two biological processes, "learning or memory" and "nervous system development," reflect indeed an important biological function of persisting pain and even when choosing candidate genes without having these processes in mind as for the alternative gene sets, they are nevertheless included. This may be regarded as support for the present gene set as suitable candidates for future association studies with persisting pain phenotypes.

Although the present gene set has been assembled with a focus of a relevance to pain, many of its members have pharmacological implications. Specifically, 58 of the 77 genes (75%) have been chosen as targets of analgesics, approved or under current clinical development (**Table 3**). Moreover, several of the genes in the present NGS panel have been implicated in pharmacogenetic modulations of drug effects (**Table 4**). Possibly the most widely studied gene in analgesic research is OPRM1 because coding for the primary target of opioids (Peiro et al., 2016). Several polymorphisms have been described in OPRM1, among which the best characterized may be rs1799971 (OPRM1 118A>G) that leads to an asparagine to aspartate substitution at the extracellular terminal of the receptor protein (Bond et al., 1998). May studies have addressed this variant (for reviews, see Walter et al., 2013; Somogyi et al., 2015).

<sup>12</sup>https://github.com/IME-TMP-FFM/dbtORA

<sup>13</sup>http://CRAN.R-project.org/


TABLE 4 | Summary of variants in genes included in the proposed NGS panel of persisting pain, that have been implicated in a pharmacogenetic context to modulate the effects of drugs administered for the treatment of pain or as disease modifying therapeutics in painful disease.

The information was derived by literature search and by querying the Pharmacogenetics Research Network/Knowledge base at http://www.pharmgkb.org (accessed in July 2018). Only key or example references are given.

Summarizing its effects, the variant is associated with decreased receptor expression and signaling efficiency (Oertel et al., 2012) which leads to reproducibly reduced pharmacodynamic effects in human experimental settings while the effect size seems insufficient to be a major factor of opioid response in clinical settings, despite several reports of modulations of opioid demands or side effects. For example, subjects carrying the 118A>G variant were found to have a reduced response to morphine treatment (Hwang et al., 2014), reduced analgesic response to alfentanil (Oertel et al., 2006) and demanded higher doses of morphine for pain relief (Klepstad et al., 2004; Hwang et al., 2014). However, the importance of this variant seems to be comparatively high in patients with an Asian ethnic background, which might be related to the higher allelic frequency as compared to other ethnicities. COMT is a key modulator of dopaminergic neurotransmission and in the signaling response to opioids The Val158Met polymorphism (rs4680) causes an amino acid substitution in the enzyme, which reduced the enzyme active to a forth (Peiro et al., 2016). Carriers of the homozygous Met/Met variant had lower morphine requirements than those with a the wild type COMT (Rakvag et al., 2005). Furthermore, a modulation of the effects of TRPV1 targeting analgesics is supported by observations that intronic TRPV1 variants were associated with insensitivity to capsaicin (Park et al., 2007) while the coding TRPV1 variant rs8065080 was associated with altered responses to experimentally induced pain (Kim et al., 2004). Moreover, gain-of-function mutations in TRPV1 have been associated with increased pain sensitivity (Boukalova et al., 2014), for which TRPV1 antagonists would enable a specific pharmacogenetics-based personalized cure.

### CONCLUSION

fphar-09-01008 September 17, 2018 Time: 10:19 # 15

The breakthrough in mapping the whole human genome (Lander et al., 2001; Venter et al., 2001) along with genome wide association studies (GWAS) has led to rapid advances in the knowledge of the genetic bases of human diseases (Wellcome Trust Case Control and Consortium, 2007). Genetic research in pain medicine has directed to the recognition of genes in which variants influence pain behavior, post-operative drug requirements, and the temporal developments of pain toward persistence (James, 2013). While many candidate gene association studies have identified multiple genes relevant for pain phenotypes (Fillingim et al., 2008), pain related genetic studies have so far been owned by investigations of a limited number of genes. Roughly ten genes or gene complexes account for over half of the extant findings and several of these candidate gene associations have held up in replication (Mogil, 2012). The selection of variants has been limited and they have been addressed in most studies repeatedly, leading to the perception that genetic research in pain produces often unsatisfactory results (Mogil, 2009). However, this may soon change with the arise of new technologies. In this manuscript, we present a validated NGS assay for a set of 77 genes supported by empirical evidence and computational functional genomics analyses as relevant

#### REFERENCES


factors modulating the risk for persisting pain or its clinical picture.

#### AUTHOR CONTRIBUTIONS

JL, DK, and EK conceived and designed the experiments. DK performed the experiments. JL and DK analyzed the data and wrote the paper. CL provided methodological expertise and bioinformatical tools. DK and JL interpreted the results. EK and MK provided DNA samples.

#### FUNDING

This work has been funded by the European Union Seventh Framework Programme (FP7/2007 – 2013) under grant agreement no. 602919 ("GLORIA", EK and JL) and the LandesOffensive zur Entwicklung Wissenschaftlichökonomischer Exzellenz (LOEWE), LOEWE-Zentrum für Translationale Medizin und Pharmakologie (JL). These public funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

#### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fphar. 2018.01008/full#supplementary-material

following pediatric tonsillectomy. Pharmacogenomics 15, 1749–1762. doi: 10. 2217/pgs.14.137




during adolescence: the ROOTS project. Int. J. Epidemiol. 39, 361–369. doi: 10.1093/ije/dyp173



response in attention-deficit/hyperactivity disorder. J. Psychopharmacol. 31, 1070–1077. doi: 10.1177/0269881116667707




cancer pain: OPRM1 and COMT gene. Pain 130, 25–30. doi: 10.1016/j.pain. 2006.10.023




**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 Kringel, Kaunisto, Lippmann, Kalso and Lötsch. This is an openaccess article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Computational Methods for the Pharmacogenetic Interpretation of Next Generation Sequencing Data

Yitian Zhou<sup>1</sup> , Kohei Fujikura<sup>2</sup> , Souren Mkrtchian<sup>1</sup> and Volker M. Lauschke<sup>1</sup> \*

<sup>1</sup> Section of Pharmacogenetics, Department of Physiology and Pharmacology, Karolinska Institutet, Stockholm, Sweden, <sup>2</sup> Department of Diagnostic Pathology, Kobe University Graduate School of Medicine, Kobe, Japan

Up to half of all patients do not respond to pharmacological treatment as intended. A substantial fraction of these inter-individual differences is due to heritable factors and a growing number of associations between genetic variations and drug response phenotypes have been identified. Importantly, the rapid progress in Next Generation Sequencing technologies in recent years unveiled the true complexity of the genetic landscape in pharmacogenes with tens of thousands of rare genetic variants. As each individual was found to harbor numerous such rare variants they are anticipated to be important contributors to the genetically encoded inter-individual variability in drug effects. The fundamental challenge however is their functional interpretation due to the sheer scale of the problem that renders systematic experimental characterization of these variants currently unfeasible. Here, we review concepts and important progress in the development of computational prediction methods that allow to evaluate the effect of amino acid sequence alterations in drug metabolizing enzymes and transporters. In addition, we discuss recent advances in the interpretation of functional effects of non-coding variants, such as variations in splice sites, regulatory regions and miRNA binding sites. We anticipate that these methodologies will provide a useful toolkit to facilitate the integration of the vast extent of rare genetic variability into drug response predictions in a precision medicine framework.

Keywords: precision medicine, personalized medicine, variant effect prediction, ADME, NGS, rare variant analysis, noncoding variation, pharmacogenomics

#### INTRODUCTION

Inter-individual differences in drug response are clinically important phenomena that result in reduced efficacy or adverse reactions in 25–50% of all patients and genetic factors have been estimated to account for around 20–30% of these (Spear et al., 2001; Sim et al., 2013). Fueled by technological advances in Next-Generation Sequencing (NGS) technologies, the application of comprehensive sequencing approaches is on the rise for various applications, including studies of biodiversity, population genetics and biomedical research (Levy and Myers, 2016). Furthermore, plummeting costs to <1,000 USD per human genome and increasing worldwide sequencing capacities that we estimate to exceed 100 petabases per year (10<sup>15</sup> bases corresponding to the size of around 100,000 human genomes) open tremendous possibilities for NGS to revolutionize precision medicine.

#### Edited by:

Ulrich M. Zanger, Dr. Margarete Fischer-Bosch Institut für Klinische Pharmakologie (IKP), Germany

#### Reviewed by:

Theodora Katsila, University of Patras, Greece Greg Slodkowicz, MRC Laboratory of Molecular Biology (MRC), United Kingdom

\*Correspondence:

Volker M. Lauschke volker.lauschke@ki.se

#### Specialty section:

This article was submitted to Pharmacogenetics and Pharmacogenomics, a section of the journal Frontiers in Pharmacology

Received: 03 August 2018 Accepted: 20 November 2018 Published: 04 December 2018

#### Citation:

Zhou Y, Fujikura K, Mkrtchian S and Lauschke VM (2018) Computational Methods for the Pharmacogenetic Interpretation of Next Generation Sequencing Data. Front. Pharmacol. 9:1437. doi: 10.3389/fphar.2018.01437

**58**

Strikingly, these massive NGS data sets revealed that individuals harbored on average more than 3.7 million single nucleotide variants (SNVs) and more than 350,000 insertions and deletions across different populations, emphasizing the substantial variability of the human genome (The 1000 Genomes Project Consortium, 2012). Particularly genes involved in drug absorption, distribution, metabolism and excretion (ADME) proved to be highly diverse and genetically complex (Fujikura et al., 2015; Bush et al., 2016; Kozyra et al., 2017). Across 208 ADME genes more than 69,000 SNVs have been described, 98.5% of these being rare with minor allele frequencies (MAF) <1% (Ingelman-Sundberg et al., 2018). The overall pharmacogenetic variability was highly population specific, particularly for isolated populations, such as Ashkenazi Jews (Ahn and Park, 2017; Kozyra et al., 2017; Zhou and Lauschke, 2018). Given this enormous pharmacogenetic variability, one of the key frontiers of contemporary pharmacogenomics is the translation of these comprehensive genomic data into clinically actionable treatment recommendations (Lauschke and Ingelman-Sundberg, 2016a, 2018).

Heterologous expression in cell lines followed by quantitative determination of gene product functionality using appropriate end points is considered as the gold standard strategy to characterize the functional impact of pharmacogenetic variants. Furthermore, epidemiological association studies can provide additional indications about the consequences of genetic variants on drug metabolism related phenotypes in vivo. However, for the functional interpretation of rare variants these approaches suffer from multiple shortcomings:


Thus, in the absence of viable experimental strategies, computational prediction methodologies are routinely used to predict the functional impact of genetic variants. Most of these algorithms focus on predicting the functional consequences of variants that result in amino acid substitutions. However, recently much progress has also been made regarding the interpretation of non-coding variants that affect splice sites, promoters, enhancers or miRNA binding sites (**Figure 1**).

Prediction algorithms are generally trained on pathogenic variant sets and most tools base their conclusions, at least in part, on the evolutionary conservation of the respective sequence. Importantly however, pharmacogenes are hallmarked by low evolutionary conservation and are generally not associated with human disease. These peculiarities result is specific problems for the interpretation of pharmacogenetic variants. Here, we provide an updated overview of computational approaches for the functional interpretation of genetic variants, specifically focusing on their suitability for pharmacogenetic predictions. We describe the underlying statistical frameworks and discuss their different bases for decision-making. Furthermore, we highlight important progress particularly in the interpretation of noncoding genetic variability. We conclude that computational tools are essential for the functional interpretation of an individual's pharmacogenotype and that their further improvement constitutes one of the most important frontiers for the clinical implementation of NGS-based genotyping.

# INTERPRETATION OF VARIANTS RESULTING IN AMINO ACID EXCHANGES

Genetic variants that result in amino acid substitution, henceforth termed missense variants, can impact the functionality of the respective protein by various mechanisms, including alterations in active sites, structural destabilization due to protein misfolding, perturbations in solvent accessibility or modification of post-translational processing. Each individual harbors 10,000–12,000 missense variants, many of which are rare (The 1000 Genomes Project Consortium, 2015). These rare variants have been suggested as important modulators of complex disease risk (Kryukov et al., 2007) and inter-individual differences in drug response (Kozyra et al., 2017). Among all variant classes, missense variants are the most extensively studied and a plethora of computational methods is available for their functional interpretation. Conceptually, these algorithms predict the functional impact of missense variants based on sequence information, primarily evolutionary conservation of the respective residues, and/or structural information of the corresponding gene product. In the following, we highlight recent progress, provide an overview of available tools and discuss their utility for pharmacogenetic predictions. For methodological details we refer the interested reader to excellent recent reviews (Ng and Henikoff, 2006; Peterson et al., 2013; Tang and Thomas, 2016).

# Predictions Based on Sequence Information

Evolutionary conservation scores are calculated by analyzing the evolutionary variation dynamics of DNA or amino acid sequences among homologs with the hypothesis that the extent of conservation is a strong predictor of the importance of the respective sequence for structure and function of the corresponding gene product. Thus, positions with a high evolutionary rate are thought to be dispensable, whereas slowly evolving, i.e., conserved sequences indicate a selective pressure against variation in these regions and thus deleterious effects if mutated.

Evolutionary conservation as a metric to distinguish deleterious from neutral variants is considered by most computational prediction algorithms. The majority of approaches that focus on the functional interpretation of missense variants utilize amino acid sequence alignment, whereas others utilize nucleotide sequence alignments or a combination

of both methods (**Table 1**). While alignment of amino acid sequence proved to be effective for the analysis of missense variants, genomic sequence alignments provide additional versatility and allow to extend functional interpretations to variant classes that do not alter the amino acid sequence, such as synonymous and regulatory variants. Notably, commonly used conservation-based functionality predictors do not consider sequence interdependencies. Explicit integration of residue dependency information obtained from multiple sequence alignments was however recently shown to improve predictive performance (Hopf et al., 2017), emphasizing the added value of complementing conservation based functionality predictions with variant interaction data.

On the basis of multiple sequence alignments, algorithms derive their functionality predictions either based on direct theoretical models, or by various machine-learning approaches. The former methods predict the functional impact of variants based on phenomenological scores derived from theoretical models that are known a priori. In contrast, machine learning methods search for patterns in multi-dimensional training data sets consisting of labeled deleterious and benign variations, which will then be used as the basis to generate predictions on new unlabeled data. Machine learning approaches include support vector machines, random forests, artificial neural networks, naive Bayes approaches, gradient tree boosting and regression models. With increasing wealth of large-scale data sets to learn from, machine learning methods become increasingly popular as versatile tools to generate predictive models in many areas of biomedicine (Camacho et al., 2018).

Commonly used algorithms are generally designed to flag deleterious variants, which are mostly assumed to result in a reduced gene product function, and their performance of gainof-function variants is substantially worse (Flanagan et al., 2010). Notably, the algorithm B-SIFT, a modified version of the widely used SIFT tool (Ng and Henikoff, 2001), was developed to overcome this limitation (Lee et al., 2009). Conceptually, B-SIFT identifies increased functionality variants based on protein sequence alignments by scoring whether a given mutation results in a change commonly present in protein homologs and the tool successfully identified experimentally validated gain-of-function variants in cancer.

While computational missense variant predictors are generally reported to achieve high predictive accuracies with areas under the receiver operating characteristic curve (AUCROC) that often pivot around 0.9, drastic drops in performance to AUCROC of 0.5–0.75 have been reported on independent, functionally determined human variant datasets (Mahmood et al., 2017). These findings were corroborated by a recent cross-comparison of 23 methods based on three independent pathogenicity datasets in which the authors found that REVEL and VEST3 performed overall best, whereas the most commonly used methods SIFT and PolyPhen-2 performed only medially (Li et al., 2018). Furthermore, no functional consequences could be detected using various in vitro or in vivo tools for 40% of variants predicted to be deleterious by common functionality prediction tools (Miosge et al., 2015). Thus, while current tools have proven powerful in clinical diagnostics to prioritize potentially causative mutations in genetic diseases for further analyses (Boycott et al., 2013), their predictive power is not yet sufficient to predict functional variant effects without substantial subsequent validations.

Importantly, the quality of prediction models critically relies on accurate training data sets. For instance, models are commonly generated using training sets of pathogenic variants as positive controls and polymorphisms identified to be common in large-scale sequencing projects as negative, i.e., functionally TABLE 1 | Methods to predict the functional effect of missense variants based on sequence information.


(Continued)

#### TABLE 1 | Continued


HMM, hidden Markov model; SVM, support vector machine; NB, naïve Bayes classifier; EL, ensemble learning; RF, random forest; RM, regression model; NN, neural networks; GTB, gradient tree boosting; HGMD, Human Gene Mutation Database; OMIM, Online Mendelian Inheritance in Man; ESP, Exome Sequencing Project; PMD, Protein Mutant Database.

neutral variants. For pharmacogenetic predictions such a strategy is associated with multiple problems: Firstly, training on diseaseassociated data sets will, in the best case, result in prediction models that accurately predict the pathogenicity of variants. However, only very few ADME genes are directly associated with disease, suggesting that pathogenicity is not the right endpoint to inform about variant effects in the pharmacogenetic arena. Secondly, while evolutionary conservation constitutes a useful metric to predict functional consequences in genes under purifying selection, evolutionary conservation in pharmacogenes is generally much lower (Fujikura, 2016), indicating that conservation cannot reliably inform about functional impacts of variations in pharmacogenes. Finally, the choice of common polymorphisms as neutral training sets is problematic. Genetic variants that occur with high frequencies are not necessarily functionally neutral, particularly in pharmacogenetic loci, as evidenced by a multitude of high-frequency loss of function variants in CYP genes, such as CYP3A5<sup>∗</sup> 3 (MAF = 95% in Europeans), CYP2C19<sup>∗</sup> 2 (MAF = 34% in South Asians) and CYP2D6<sup>∗</sup> 4 (MAF = 16% in Latinos) (Zhou et al., 2017).

The indicated problems incentivized us to develop a prediction framework tailored specifically toward pharmacogenetic functionality assessment (Zhou et al., 2018). Specifically, the model was devised using a two-step procedure: Firstly, functionality classification threshold of 18 commonly used functional prediction algorithms were optimized by leveraging a dataset of 337 experimentally characterized pharmacogenetic variants using 5-fold cross validations. In a second step, we integrated the best performing orthogonal algorithms following a strategy that had been shown to further improve predictive accuracy (Martelotto et al., 2014). The resulting method achieved 93% for both sensitivity and specificity for both loss-of-function and functionally neutral variants. Moreover, the returned score can provide quantitative estimates of the effect of the variant in question on gene function, thus facilitating the functional and personalized interpretation of an individual's NGS-based pharmacogenome.

Recent progress in large-scale experimental mutagenesis screens provides a promising approach to further expand the development of powerful training resources for missense variant effect predictors. While such a strategy has already been used to develop a prediction method based on 10 proteins from different species with disparate structures (Gray et al., 2018), we propose that deep mutational scanning data from ADME proteins is likely to substantially refine the resulting model for pharmacogenetic predictions. For such an endeavor, we recommend to use multiple substrates for each protein, as correlations between prediction and experiments improved with more comprehensive interrogation of protein function (Gallion et al., 2017). Combined with ADME-optimized prediction models, we envision that such an approach can further enhance the predictive accuracy of in silico methods and yield sufficiently accurate tools to allow for the clinical implementation of computational pharmacogenetic predictions.

# Utilization of Structural Data

While evolutionary conservation scores can provide useful metrics to assess the pathogenicity of missense variants, they have limitations when applied to the less conserved genes, such as most ADME genes, which prompted the search for additional orthogonal in silico methods. To this end, the analysis of predicted or experimental structural data provides an appealing concept, as the correct folding of polypeptide chains into threedimensional tertiary structures is of paramount importance for their biological functions. Structure-based approaches either directly use known crystal or NMR structures, preferably at high resolution <2–3 Å (Wlodawer et al., 2008) or, should such data not be available, leverage knowledge of the experimental 3D structures of homologous sequences (**Table 2**).

The effect of variants is predicted by how the folding free energy difference between the unfolded and folded states (1G ◦ ) is modified upon point mutations (11G ◦ ) with negative and positive values of 11G ◦ indicating destabilizing and stabilizing mutations, respectively. In recent years a large number of mechanistically diverse approaches have been presented, with machine learning-based strategies being most prevalent. SDM constitutes a statistical potential energy function that can estimate variant effects on protein stability (Topham et al., 1997). This approach pioneered the knowledge-based prediction of mutation effects on protein stability and has also been successfully used in combination with machine learning techniques (Pires et al., 2014a). An updated version of the tool, SDM2 (Pandurangan et al., 2017), with a 5-fold increase in underlying structural information as well as extensions for interaction modeling can be accessed through a free, publically available web server interface. Similarly, the algorithm HOPE (Venselaar et al., 2010) can calculate structural and functional effects of amino acid exchanges based on homology modeling. It should be however noted that most of the current tools are strongly biased toward the detection of destabilizing effects (Pucci et al., 2018).

Approximately 70% of the human proteome can be structurally modeled by homology (Somody et al., 2017). Yet, the number of resolved 3D structures for genes involved in drug ADME remains relatively low, at least in part due to the membrane bound nature of many of these proteins. Furthermore, as many metabolic enzymes, such as cytochrome p450s (CYPs) exhibit marked active-site flexibility, which often results in ligand-induced conformational changes, prediction of variant effects based on direct structural data is difficult for these proteins and substrate-specific effects have to be considered. Thus, while the prediction of amino acid exchanges on substrate metabolism remain difficult, folding stability of variant proteins of interest can be estimated using existing computational tools based on sequence homology modeling (Kulshreshtha et al., 2016).

# EVALUATION OF TRUNCATION VARIANTS

Drug metabolizing enzymes and transporters have been found to harbor a multitude of truncation variants, such as microinsertions and micro-deletions (indels) causing frameshifts, stopgain and start-lost variants. Some of these variants are clinically relevant and occur with high frequencies in specific populations, including the stop-gain variant CYP2C19<sup>∗</sup> 3 in East Asians and the frameshift variants CYP2D6<sup>∗</sup> 3 and CYP2D6<sup>∗</sup> 6 in Europeans (Zhou et al., 2017). As most pharmacogenes have only minor endogenous functions, they are under low evolutionary pressure and, consequently, such loss-of-function variants are often not selected against (Lauschke et al., 2017). Moreover, it has been speculated that pharmacogenetic loss-of-function alleles can even be selected for in modern humans, possibly due to reduced bioactivation of dietary toxicants (Fujikura, 2016). Truncation variants are commonly assumed to have deleterious effects and only few studies have been presented that provide approaches to quantitatively assess the functional consequences of such mutations (Cline and Karchin, 2011).

Early bioinformatic tools, such as LOFTEE, prioritize truncation variants based on a set of empirical rules, including whether the variant of interest occurs in the last 5% of transcript or whether the truncating allele is the ancestral TABLE 2 | Methods to predict the functional effect of missense variants based primarily on structural features.


SVM, support vector machine; RM, regression model; GTB, gradient tree boosting.

state (MacArthur et al., 2012). Other approaches, such as Likelihood-ratio scoring (Zia and Moses, 2011), SIFT Indel (Hu and Ng, 2012) and NutVar (Rausell et al., 2014), primarily utilize the evolutionary conservation of amino acid residues. However, predictive performance of these tools for loss-offunction mutations is limited when trained on only missense mutations. Moreover, these methods are trained on genes that have high-quality annotations, which poses problems for the functional interpretation of truncation variants in genes for which such annotations are not readily available.

To overcome these shortcomings, CADD was developed by integrating many diverse functional genomics annotations into a single score for each variant, which allows to estimate the impact of all classes of genetic variation, including truncating variants (Kircher et al., 2014). Newer approaches, such as DDIGin (Folkman et al., 2015) and VEST-Indel (Douville et al., 2016) supplement conservation-based features with information about sequence and structural properties at nucleotide and protein levels as well as intrinsic disorder predictions from the region affected by stop gain and frameshift variants. Notably, the recently developed tool ALoFT (Annotation of Loss-of-Function Transcripts) can categorize the pathogenic importance of putative loss-of-function mutations by integrating variant information with redundancy and haplosufficiency data of the corresponding gene (Balasubramanian et al., 2017). However, aforementioned methods are primarily focused on distinguishing benign and disease-causing mutations. Thus, future studies are needed to evaluate whether this emphasis on the pathogenicity of variants might affect the performance of these methods regarding the functionality prediction of truncating variants in genes not associated with disease, such as many ADME genes.

In addition to impacts on functional and structural properties of proteins, truncating variants can affect nonsense-mediated mRNA decay (NMD). NMD is a conserved translationdependent mechanism that is responsible for recognizing and eliminating aberrant mRNA transcripts to prevent the production of truncated peptides, thereby playing a critical role in preventing the accumulation of misfolded protein and subsequent initiation of the unfolded protein response (UPR) (Kervestin and Jacobson, 2012; Schoenberg and Maquat, 2012). Recently, Hsu et al. presented NMD Classifier, a tool for the systematic classification of NMD events, which was reported to correctly identify 99.3% of the NMD-causing transcript structural changes (Hsu et al., 2017). The incorporation of this information alongside functional estimates is expected to not only increase discriminative power but also to suggest the nature of the functional impact of a given variant. Interestingly, there is evidence that NMD efficiency varies between individuals and that these differences correlate with response to NMD inhibitors in cystic fibrosis patients (Linde et al., 2007; Kerem et al., 2008). While this phenomenon has to the best of our knowledge not been explicitly tested in the context of pharmacogenomics, interindividual differences in NMD magnitude could, at least in part, explain the large differences in drug response between patients with loss-of-function genotypes (Jukic et al., 2018 ´ ) and thus have important implications for therapy.

In summary, much progress has been made regarding the functional interpretation of variants causing truncations of the corresponding gene product and current computational tools are able to incorporate a variety of features into their predictions, including evolutionary conservation, sequence and structural information as well as putative effects on NMD. However, it remains to be demonstrated whether these available tools will also be suitable for the prediction of effects of truncation variants in poorly conserved pharmacogenetic loci.

## PREDICTION OF ABERRANT SPLICING EVENTS

Splicing of pre-mRNA is a critical step during mRNA maturation in which introns are excised and exons are ligated. This process necessitates the presence of 5′ and 3′ splicing signals and branch point sequence and is further regulated by exonic and intronic splicing enhancer/silencer (ESE/ESS and ISE/ISS, respectively) (Lee and Rio, 2015; Shi, 2017). Mutations in these regions can disrupt the splicing process and result in aberrantly processed transcripts, which can trigger NMD or result in the production of dysfunctional proteins. The functional importance of genetic variants in splice sites is emphasized by estimates that around 15% of human pathogenic mutations cause dysregulation of splicing (Baralle et al., 2009).

Variants located in canonical splice sites are considered having the largest effect on splicing events. Therefore, a multitude of computational algorithms were developed to handle the prediction of 5′ and 3′ splice site, such as NNSplice (Reese et al., 1997), MaxEntScan (Yeo and Burge, 2004), GeneSplicer (Pertea et al., 2001), and SplicePort (Dogan et al., 2007; **Table 3**). Moreover, variants outside splice sites can have substantial effects on splicing (Soukarieh et al., 2016) and a variety of computational methods have been developed to predict the effect of such regulatory sequences. Examples are sequence the conservation-based algorithm Skippy (Woolfe et al., 2010) and the machine learning tools MutPred Splice (Mort et al., 2014), scSNVEL (Jian et al., 2014b), SPANR (Xiong et al., 2015), and CryptSplice (Lee et al., 2017). Further tools are available for the identification of branch point sequences (Corvelo et al., 2010; Zhang et al., 2017). Lastly, the secondary structure of pre-mRNAs can interfere with splice-site recognition, modulate spliceosome binding or can facilitate splicing efficiency by bringing splice donors and acceptors into close proximity (Warf and Berglund, 2010). Consequently, genetic variants that alter pre-mRNA structure were found to promote alternative splicing (Wan et al., 2014), incentivizing the incorporation of structural information provided by tools, such as TurboFold (Harmanci et al., 2011) or CentroidFold (Sato et al., 2009), into variant effect predictions. For a more detailed description of structural RNA analyses we refer the interested reader to excellent recent reviews (Jian et al., 2014a; Lorenz et al., 2016; Ohno et al., 2018).

In ADME genes, dysregulation of splicing has long been recognized as a cause for inter-individual variability drug metabolism (Hanioka et al., 1990) and toxicity (Raida et al., 2001) and the liver was found to be is among the tissues with highest levels of alternative splicing activity (Yeo et al., 2004). As splicing is highly tissue specific, these data indicate that algorithms for the prediction of variant splice effects in pharmacogenetics should ideally be trained on positive control sets for which aberrant splicing is confirmed in the tissue of interest, i.e., primarily liver. To this end, the GTEx project (GTEx Consortium, 2017) provides a rich resource that has already been successfully utilized for the identification of tissue-specific splice events in pharmacogenes (Chhibber et al., 2017).

In summary, the toolkit of available computational algorithms for the prediction of variant effects on splicing has rapidly grown and by now allows not only to evaluate direct impact on splice sites, but also to assess mutations in regulatory splice enhancers and silencers, as well as branch points. For the application of these methods for pharmacogenomics there is a need to benchmark available tools on splice variants in ADME genes. Moreover, we anticipate that the utilization of tissue-specific expression data will further refine splice site predictions.

### FUNCTIONAL IMPACT OF VARIANTS IN UNTRANSLATED REGIONS

miRNAs play important roles in the regulation of mRNA stability and translation. miRNA-mRNA interaction occurs through conserved miRNA binding sites in the 3′ -UTRs and at least 10% of all SNPs are located in 3′ -UTRs and might affect complementary miRNA-mRNA pairing (Xiao et al., 2009). Furthermore, miRNAs have been shown to be important modulators of ADME gene expression profiles (Rieger et al., 2013). Therefore, functional interpretation of genetic variations within miRNA target sites constitutes an important factor for the prediction of the fate of corresponding transcript. Thus, to evaluate the potential relevance of genetic polymorphisms in UTRs various databases, such as the polymiRTS Database 3.0 (Bhattacharya et al., 2014) or MirSNP (Liu et al., 2012), provide useful resources that contains a collection of experimentally confirmed SNPs and indels not only in miRNA target sites but also in miRNA seed regions responsible for mRNA binding. Furthermore, a variety of other SNP effect prediction servers are publically available (Fehlmann et al., 2017).

In case no experimental data is available, various computational tools can be used to predict possible disruption of the miRNA-mRNA pairing for a given variant (**Table 3**). MicroSNiPer (Barenboim et al., 2010) and ImiRP (Ryan et al., 2016) identify and predict such disruptions by comparing the mutant 3 ′ -UTR sequences with major variant databases.

#### TABLE 3 | Tools for the prediction of variant effects on splicing, transcript levels or translation.


HGMD, Human Gene Mutation Database; 1000G=1000 Genomes Project; DBASS, Database for Aberrant Splice Sites; NMD, nonsense-mediated decay; HMM, hidden Markov model; RBP, RNA binding protein.

Similarly, mrSNP can predict the effect of any variant identified in NGS-based projects on miRNA-target transcript interaction (Deveci et al., 2014). However, it is important to note that miRNA target predictions seem to have a high false-positive rate (Pinzón et al., 2017), suggesting that these problems might be lingering for studies utilizing miRNA-target databases without stringent experimental validations. Besides predicting the effect of genetic variants in putative miRNA target sites, multiple online tools are available for inverse approaches, analyzing variants in miRNAs or pre-miRNAs for possible deleterious effects. For more comprehensive collection of miRNA related variant interpretation tools the reader is referred to the recent reviews and online resources (Akhtar et al., 2016; Moszynska et al., 2017).

In addition, recent approaches expanded the methodological portfolio beyond miRNA binding site prediction to include effects of UTR variants on binding of RNA-binding proteins (RBPs), translational efficacy and ribosomal loading. Effects of indels on RBP binding can be evaluated using PinPor, which has been demonstrated to have some success in distinguishing disease-causing and neutral indels (Zhang et al., 2014). Furthermore, Sample et al. presented the preprint of a deep learning approach based on experimental polysome profiling to predict the impact of UTR sequence on translation (Sample et al., 2018). These developments nicely indicate the diversification of parameters that can incorporated into variant effect predictions, thus further refining biological interpretation of NGS data sets.

# ANALYSIS OF REGULATORY VARIANTS

Non-coding regions account for more than 99% of the human genome and, consequently, their consideration substantially expands the analysis space of computational predictions. Variants in non-coding regions can affect regulatory elements, such as promoters, enhancers, silencers, and insulators, which, in turn, may alter their affinity to transcription factor or remodel the local chromatin structure (Zhang and Lupski, 2015; Deplancke et al., 2016). Accurate prediction of the functional consequences of such variants constitutes one of the major challenges in human genetics.

To interpret noncoding variants, a variety of different strategies have been presented. The first approaches, such as SiPhy (Garber et al., 2009), PhyloP (Pollard et al., 2010), PhastCons (Siepel et al., 2005), GERP++ (Davydov et al., 2010), or SCONE (Asthana et al., 2007), were based on evolutionary constraint using sequence alignments. However, the observation that no enhanced constraints were identified in regulatory elements at the level of DNA sequence despite conserved transcription factor binding led to the realization that conservation of regulatory regions can only be a weak indicator of the functional effects of SNVs in regulatory regions (Schmidt et al., 2010; Arbiza et al., 2013). Consequently, conservation metrics were complemented with additional functional genomics features, such as the sequence and genic context, transcription factor binding profiles (Johnson et al., 2007), histone modification data (Zhang et al., 2010) and DNase I hypersensitive sites (Boyle et al., 2008) in an attempt to improve prediction quality. Based on these rich data sets, a variety of ensemble classifiers were developed using various machine learning approaches that aim to distinguish neutral from pathogenic variants, including GWAVA (Ritchie et al., 2014), CADD (Kircher et al., 2014), FATHMM (Shihab et al., 2013, 2015; Rogers et al., 2018), DANN (Quang et al., 2015), DIVAN (Chen et al., 2016), and Genomiser (Smedley et al., 2016) (**Table 4**).

In contrast, other methods, such as gkm-SVM (Lee et al., 2015) and DeepSEA (Zhou and Troyanskaya, 2015) have been developed to predict regulatory elements based on primary sequence alone. Trained on publically available cell type-specific chromatin data provided by ENCODE (The ENCODE Project Consortium, 2012) and the Roadmap Epigenomics Project (Roadmap Epigenomics Consortium et al., 2015) as well as transcription factor binding patterns accessible via JASPAR (Khan et al., 2018), these algorithms predict to what extent a genetic variant will cause changes to the local chromatin profiles and how these effects translate into functional consequences. The resulting data demonstrate that inferring consequences from functional genomics data is highly cell type and context specific and relies on biologically appropriate training sets. These convincing findings incentivize the generation of functional genomics data from carefully phenotyped human tissues involved in drug ADME to derive tissue-specific regulatory lexica and we envision that training machine learning approaches on these data sets will substantially increase the power of regulatory pharmacogenetic prediction classifiers.

As with coding variants, the use of potentially biased training sets and multi-dimensional circularity between training and test data constitutes an inherent problem for current variant prediction tools (Grimm et al., 2015). For instance, a variety of algorithms consider common variants from the 1000 Genomes project as functionally neutral control sets for model training. However, while these variants are likely to be depleted of pathogenic variants in haploinsufficient genes, many common variants entail functional consequences in their respective gene product, particularly if the gene is rapidly evolving, such as many CYP genes. Similar problems arise when the model is trained using phenotype associated GWAS polymorphisms as functional variant sets, as only 5.5% of GWAS index SNPs are estimated to be causal whereas the remainder is only in linkage disequilibrium with the true functional variant in the locus (Farh et al., 2015).

To overcome these problems, unsupervised approaches have been developed that do not rely on the labeling of training data, thereby reducing the dependence on preexisting variant classifications and existing models of mutation. These unsupervised models, such as GenoCanyon (Lu et al., 2015) and Eigen (Ionita-Laza et al., 2016), represent powerful tools for the genome-wide interpretation of variants. However, as they are calibrated on genome-wide data, it remains to be determined whether gene class-specific peculiarities, such as low evolutionary conservation in ADME genes, might affect the predictive accuracy of these approaches for pharmacogenetic applications.

#### TABLE 4 | Algorithms for the functional interpretation of regulatory variants.


RF, random forest; SVM, support vector machine; HMM, hidden Markov model; EL, ensemble learning; NN, neural networks; INSIGHT, Inference of Natural Selection from Interspersed Genomically Coherent Elements Gronau et al., 2011; US, unsupervised; HGMD, Human Gene Mutation Database; 1000G, 1000 Genomes Project; ESP, Exome Sequencing Project; TF, transcription factor; HSS, hypersensitive site; FAIRE, Formaldehyde-Assisted Isolation of Regulatory Elements Giresi et al., 2007; NHGRI, National Human Genome Research Institute.

# CONCLUSIONS

Technical progress in NGS technology has resulted in its routine application in medical genetics and clinical diagnostics. In contrast, clinical implementation of NGS-based pharmacogenomics is largely lagging behind (Lauschke and Ingelman-Sundberg, 2016b; Ji et al., 2018). Most importantly, in order to utilize the major advantage of NGS-based genotyping, which is the discovery of the entire panorama of the individual's genetic portfolio, tools have to be in place, which allow to translate these variability data into functional consequences and clinical recommendations. Whereas, the identification of rare putatively deleterious mutations in congenital diseases is aided by clear phenotypic alterations of the affected patient and the possibility to perform comparative genomic analyses of unaffected family members, pharmacogenomic phenotypes are generally more difficult to detect as they only present in a given context, such as exposure to specific medications. In the absence of drug response associations or experimental characterizations that support the functional interpretation of rare variants, there is thus an urgent need for reliable computational prediction tools to fill this space.

Importantly, recent developments in computational variant effect prediction methods promise to narrow the gap to meet the exacting demands on genomics applications in the clinics. Machine learning constitutes an important tool kit to fully harness the power of large data sets provided by NGS. However, these approaches rely on accurate labeling of input variants, i.e., training data need to be correctly classified into deleterious and functionally neutral variants. Thus, we advocate for approaches that leverage smaller data sets of variants for which comprehensive experimental or functional genomic data is available instead of training algorithms on large but functionally poorly annotated data, such as treating all common polymorphisms identified in the 1000 Genomes Project as functionally neutral. In addition, we endorse previous appeals for the sharing of codes and data sets, which will enable comparative benchmarking of newly developed tools and algorithms and will accelerate research progress within the area of computational pharmacogenomics and beyond (Kalinin et al., 2018).

The functional consequences of missense variants have been most extensively studied. Respective methods base their predictions on evolutionary conservation and structural information of the polypeptide encoded by the respective gene. Importantly, while evolutionary conservation is a suitable measure to inform about the deleteriousness of a variant, i.e., its effect on organismal fitness, it is not suitable for the prediction of variant effects in genes under low selective pressure, such as most pharmacogenes. Recognition of these conceptual problems resulted in the development of computational predictors trained

FIGURE 2 | The past, present and future of pharmacogenetic phenotype predictions. (A) Conventionally, pharmacogenetic predictions were based on the interrogation of few common candidate SNPs, whose functional effects were predicted based on extensive literature evidence, resulting in high predictive accuracy but only few considered variations. (B) With increasing prevalence of whole exome sequencing (WES), a multitude of pharmacogenetic variants with unknown functional relevance are identified. These variants can be interpreted using computational methods. However, current algorithms are generally trained to detect the pathogenicity rather than the functionality of queried variants, resulting in overall relatively low predictive accuracy. Furthermore, only effects of missense and nonsense variants are evaluated. (C) In the near future, whole genome sequencing (WGS) will become the predominant genotyping methodology, revealing not only coding variants but also variants in regulatory regions and introns. To facilitate interpretation of this data, we envision that pharmacogenetic predictors will be directly trained on functionally annotated ADME data sets. Emerging technologies, such as deep mutational scanning for the systematic interrogation of missense variants or mutagenesis screens in microphysiological systems (MPS) for the characterization of variants in regulatory regions, provide powerful tools to generate these data, boosting the predictive performance of data hungry machine learning tools. These advances allow to go beyond the interpretation of missense and nonsense variants and to include also non-coding and regulatory variations into pharmacogenetic assessments.

specifically on ADME missense variants (Zhou et al., 2018). We envision that these approaches will become more powerful with increasing functionally annotated pharmacogenetic variant data.

Furthermore, multiple strategies have been developed to analyze the functional impact of variants in non-coding regions of the genome, which are increasingly recognized as a substantial contributor to inter-individual variability. An increasing number of algorithms is by now available that base their predictions on a multitude of different parameters, including effects on miRNA binding or translational efficiency, modulation of splicing and impacts on transcriptional events by disruption of transcription factor binding sites or polymerase loading (**Figure 1**). While these developments provide a methodological arsenal to comprehensively characterize all different classes of genetic variants, these methods are generally trained on pathogenic variant sets and have not been benchmarked on independent data sets. Thus, their predictive power for pharmacogenetic assessments remains to be evaluated.

The prediction of drug metabolism phenotypes based on the genotype of the individual has made tremendous progress over the last decades (**Figure 2**). Conventional approaches use data from few candidate variants for which substantial in vitro or in vivo characterization data was available to predict drug response. While this strategy has been successful in incorporating common pharmacogenetic variability into clinical decision-making, they fail to address functional effects of the vast extent of rare genetic variants. To also include rare variants, pilot programs were initiated in which WES was used to comprehensively interrogate the genetic landscape of pharmacogenomic loci (Bielinski et al., 2014). However, analyses were restricted to pharmacogenetic missense variants and the effects of SNVs with unknown functional relevance were interpreted using computational models trained on pathogenic data sets with negative impacts on the accuracy of phenotype predictions, as discussed above. Thus, while these strategies constitute an important step toward the further personalization of genotypeguided treatment decisions their predictive accuracy is rather low.

We expect that technological, methodological and analytical progress will contribute to a further refinement of NGSguided drug treatment in the near future. Firstly, technological advances will result in an increasing dissemination of WGS, which facilitates the incorporation of the entire profile of an individual's genetic variability, including regulatory variants, into pharmacogenetic predictions. Secondly, we envision that novel high-throughput methodologies for functional characterizations, such as deep mutational scanning, will provide powerful approaches to generate large functionally annotated pharmacogenetic variant data sets. In addition, recent advances in the development of microphysiological systems (MPS) that

#### REFERENCES

allow to model key target tissues associated with drug metabolism or safety provide (Ewart et al., 2018) provide promising tools to generate tissue-specific and human-relevant data sets for studies of gene-drug interactions (Ingelman-Sundberg and Lauschke, 2018). Using this integrated wealth of functional pharmacogenetic data to train machine learning models aspires to provide high-accuracy predictions based on the entire genetic variability landscape of the respective patient.

Importantly, leveraging this information as guidance for clinical decision-making promises to increase treatment efficacy and reduce the risks of adverse events in carriers of pharmacogenetic variants whose effects have not been experimentally evaluated. Current market analysis estimates suggests that implementation of artificial intelligence into the clinical decision support toolbox might increase average life expectancy in the Western World by 0.2–1.3 years and reduce total health care expenditures by 5–9%, corresponding to 2 trillion to 10 trillion USD globally per year (Bughin et al., 2017). However, in order to realize these exciting prospects, there is a need for prospective, randomized controlled trials that evaluate patient outcomes and cost-effectiveness of such preemptive advice across genes, drugs and health care systems.

In summary, computational prediction methods are essential for the implementation of NGS into clinical decision-making. While much progress has been made and a plethora of conceptually diverse tools is already available, there is a need to develop specialized methods that are optimized for the prediction of variant functionality rather than pathogenicity and are calibrated specifically on pharmacogenetic data. We envision that technological, methodological and analytical advances will soon allow to comprehensively predict variant effects with sufficient accuracy to justify the design of trials in which the clinical value of NGS-guided treatment decisions can be tested in a prospective setting.

# AUTHOR CONTRIBUTIONS

All authors listed have made a substantial, direct and intellectual contribution to the work, and approved it for publication.

# FUNDING

The work in the authors' laboratory is supported by the Swedish Research Council [grant agreement numbers: 2016-01153 and 2016-01154], by the Strategic Research Programme in Diabetes at Karolinska Institutet, by the European Union's Horizon 2020 research and innovation program U-PGx [grant agreement No. 668353], and by the Lennart Philipson and Harald och Greta Jeansson Foundations.

Ahn, E., and Park, T. (2017). Analysis of population-specific pharmacogenomic variants using next-generation sequencing data. Sci. Rep. 7: 8416. doi: 10.1038/s41598-017-08468-y

Akhtar, M. M., Micolucci, L., Islam, M. S., Olivieri, F., and Procopio, A. D. (2016). Bioinformatic tools for microRNA dissection. Nucleic Acids Res. 44, 24–44. doi: 10.1093/nar/gkv1221

Adzhubei, I. A., Schmidt, S., Peshkin, L., Ramensky, V. E., Gerasimova, A., Bork, P., et al. (2010). A method and server for predicting damaging missense mutations. Nat Methods 7, 248–249. doi: 10.1038/nmeth041 0-248


exploiting biased substitution patterns. Bioinformatics 25, i54–i62. doi: 10.1093/bioinformatics/btp190


not more) from published macromolecular structures. FEBS J. 275, 1–21. doi: 10.1111/j.1742-4658.2007.06178.x


**Conflict of Interest Statement:** VL is co-founder and owner of HepaPredict AB.

The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 Zhou, Fujikura, Mkrtchian and Lauschke. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# A New Panel-Based Next-Generation Sequencing Method for ADME Genes Reveals Novel Associations of Common and Rare Variants With Expression in a Human Liver Cohort

#### Edited by:

Rick Kittles, Irell & Manella Graduate School of Biological Sciences, United States

#### Reviewed by:

Jeannine S. McCune, Beckman Research Institute, United States Wenndy Hernandez, The University of Chicago, United States Jason Hansen Karnes, The University of Arizona, United States

\*Correspondence: Ulrich M. Zanger uli.zanger@ikp-stuttgart.de

#### Specialty section:

This article was submitted to Pharmacogenetics and Pharmacogenomics, a section of the journal Frontiers in Genetics

Received: 26 October 2018 Accepted: 09 January 2019 Published: 31 January 2019

#### Citation:

Klein K, Tremmel R, Winter S, Fehr S, Battke F, Scheurenbrand T, Schaeffeler E, Biskup S, Schwab M and Zanger UM (2019) A New Panel-Based Next-Generation Sequencing Method for ADME Genes Reveals Novel Associations of Common and Rare Variants With Expression in a Human Liver Cohort. Front. Genet. 10:7. doi: 10.3389/fgene.2019.00007 Kathrin Klein1,2, Roman Tremmel1,2, Stefan Winter1,2, Sarah Fehr3,4, Florian Battke3,4 , Tim Scheurenbrand3,4, Elke Schaeffeler1,2, Saskia Biskup3,4, Matthias Schwab1,2,5,6 and Ulrich M. Zanger1,2 \*

<sup>1</sup> Dr. Margarete Fischer-Bosch-Institute of Clinical Pharmacology, Stuttgart, Germany, <sup>2</sup> Medical School, University of Tübingen, Tübingen, Germany, <sup>3</sup> CeGaT GmbH, Tübingen, Germany, <sup>4</sup> Praxis für Humangenetik Tübingen, Tübingen, Germany, <sup>5</sup> Department of Clinical Pharmacology, University Hospital Tübingen, Tübingen, Germany, <sup>6</sup> Department of Pharmacy and Biochemistry, University of Tübingen, Tübingen, Germany

We developed a panel-based NGS pipeline for comprehensive analysis of 340 genes involved in absorption, distribution, metabolism and excretion (ADME) of drugs, other xenobiotics, and endogenous substances. The 340 genes comprised phase I and II enzymes, drug transporters and regulator/modifier genes within their entire coding regions, adjacent intron regions and 5<sup>0</sup> and 30UTR regions, resulting in a total panel size of 1,382 kbp. We applied the ADME NGS panel to sequence genomic DNA from 150 Caucasian liver donors with available comprehensive gene expression data. This revealed an average read-depth of 343 (range 27–811), while 99% of the 340 genes were covered on average at least 100-fold. Direct comparison of variant annotation with 363 available genotypes determined independently by other methods revealed an overall accuracy of >99%. Of 15,727 SNV and small INDEL variants, 12,022 had a minor allele frequency (MAF) below 2%, including 8,937 singletons. In total we found 7,273 novel variants. Functional predictions were computed for coding variants (n = 4,017) by three algorithms (Polyphen 2, Provean, and SIFT), resulting in 1,466 variants (36.5%) concordantly predicted to be damaging, while 1,019 variants (25.4%) were predicted to be tolerable. In agreement with other studies we found that less common variants were enriched for deleterious variants. Cis-eQTL analysis of variants with (MAF ≥ 2%) revealed significant associations for 90 variants in 31 genes after Bonferroni correction, most of which were located in non-coding regions. For less common variants (MAF < 2%), we applied the SKAT-O test and identified significant associations to gene expression for ADH1C and GSTO1. Moreover, our data allow comparison of functional predictions with additional phenotypic data to prioritize variants for further analysis.

Keywords: ADME, next generation sequencing, pharmacogenomics, eQTL analysis, rare variants

# INTRODUCTION

fgene-10-00007 January 29, 2019 Time: 16:59 # 2

Genetic variation in genes that function in the absorption, distribution, metabolism, and elimination (ADME) of drugs contributes significantly to the interindividual variability in efficacy and toxicity of numerous drugs from practically all therapeutic categories. In the past half century, pharmacogenetic research has unraveled many clinically meaningful associations between germline genetic variants and pharmacokinetic or drug response phenotypes (Meyer, 2004; Zanger and Schwab, 2013; Alfirevic and Pirmohamed, 2017). Clinical implementation of this knowledge is currently being pursued worldwide by several consortia (Caudle et al., 2013; Dunnenberger et al., 2015; Relling and Evans, 2015; Cecchin et al., 2017; Swen et al., 2018). For example, the Clinical Pharmacogenetics Implementation Consortium (CPIC) has so far issued 65 dosing guidelines for 38 drugs and 15 relevant genes (October 2018<sup>1</sup> ). Until recently, pharmacogenetics has mainly focused on common genetic variants, which can be relatively easily assessed for association with pharmacokinetic or drug response phenotypes. However, a considerable proportion of genetic variability remains unexplained even for well-studied genes like CYP2D6, as recently shown by twin studies (Matthaei et al., 2015). Currently, it is widely assumed that rare deleterious variants fill this gap and contribute significantly to functional variability, which is further supported by the fact that rare variants are enriched for deleterious alleles due to purifying selection (1000 Genomes Project Consortium et al., 2012; Lek et al., 2016; Ingelman-Sundberg et al., 2018). Indeed, with the increasing availability of next-generation-sequencing (NGS) technology, several studies explored genetic variability of pharmacologically relevant "pharmacogenes" and revealed large numbers of rare variants, most of which were previously unknown (Tennessen et al., 2012; Fujikura et al., 2015; Han et al., 2016; Kozyra et al., 2016; Hovelson et al., 2017; Schärfe et al., 2017). For statistical reasons it is intrinsically more difficult to investigate the functional significance of rare variants as compared to common variants, especially regarding pharmacogenetic phenotypes, for which studies including relevant phenotypic data are essentially lacking. On the other hand, in vitro testing of thousands of variants is currently prohibitive for time and financial reasons. Current hopes to integrate rare variants into clinical pharmacogenomics therefore rely mainly on computational prediction tools, many of which are publically available (Ingelman-Sundberg et al., 2018; Zhou et al., 2018a). Computational predictions of "damaging" or "loss-of-function" (LOF) versus "tolerable" (TOL) functionality performed on ADME rare variants detected in genetic screens indicated that up to 30% of drug response variability could be due to rare variants and that likely every patient carries at least one "actionable" pharmacogenetic variant (Crosslin et al., 2015; Ji et al., 2016). However, data on the validity of functional prediction are scarce and their performance as well as the true contribution of rare variants to pharmacogenetics variability remains unclear, especially since current predictive algorithms rely largely on principles of evolutionary conservation, which may be more appropriate in the context of disease than for drug metabolism and response.

In this study we have developed a panel-based NGS pipeline for comprehensive sequence analysis of 340 ADME genes comprising all major genes known to be involved in phase 1 and phase 2 drug metabolism, drug transport and its regulation, as well as numerous additional genes of potential interest in this context. We applied our ADME NGS panel on genomic DNA from 150 human liver samples that we have previously genotyped by other methods and for which comprehensive mRNA expression data and some additional ADME phenotypes are available. This allowed us to directly compare genotype with expression for common and rare variants, unraveling numerous novel associations and potential candidates. In addition, we performed functional prediction for subsets of variants and exemplarily compared these with hepatic phenotype. This type of analysis, which has rarely been done, should be helpful to improve functional prediction and allow to prioritization of interesting rare variants for further analysis.

#### MATERIALS AND METHODS

#### Patient DNA and Liver Samples

Liver tissues and corresponding blood samples were previously collected from patients of White European descent undergoing liver surgery at the Department of General, Visceral, and Transplantation Surgery (A. K. Nuessler, P. Neuhaus, Campus Virchow, University Medical Center Charité, Humboldt University Berlin, Germany) (Klein et al., 2012). The study protocol was approved by the ethics committees of the medical faculties of the Charité, Humboldt University, and the University of Tübingen. The study was conducted in accordance with the Declaration of Helsinki, and written informed consent was obtained from each patient. Only non-tumorous tissue was collected, as confirmed by histological examination, and stored at −80◦C. Available patient documentation includes sex, age, smoking habits, alcohol consumption, presurgery medication, diagnosis leading to liver resection, and serological liver function parameters. Samples from patients with hepatitis, cirrhosis, or chronic alcohol abuse were excluded. A summary of the data is presented in **Supplementary Table S1**.

Phenotypic data were available from previous studies. Genome-wide mRNA expression profiling was previously performed using Illumina Human-WG6v2 Expression BeadChip (see below). For selected genes quantitative mRNA levels were determined by real-time PCR, protein levels by Western blot, and enzyme activity levels by mass spectrometry (**Supplementary Table S2**).

**Abbreviations:** ADME, Absorption Distribution Metabolism Excretion; bp, basepair; CNV, copy number variant; eQTL, expression quantitative trait loci; HWE, Hardy–Weinberg equilibrium; INDEL, insertion/deletion; Kbp, kilo basepair; LOF, loss of function; MAF, minor allele frequency; NGS, next generation sequencing; RFLP, restriction fragment length polymorphism; SNP, single nucleotide polymorphism; SNV, single nucleotide variant; TOL, tolerated; UTR, untranslated region.

<sup>1</sup>www.pharmgkb.org/guidelines

Genomic DNA was isolated from corresponding blood samples as described previously (Gomes et al., 2009). Quality and concentration of gDNA were determined using both, the Qubit Fluorometric Quantitation (Thermo Fisher Scientific, Dreieich, Germany) and Nanodrop ND-8000 (Thermo Fisher Scientific, Dreieich, Germany). Gene expression and genotyping data assessed by Human-WG6v2 Expression BeadChip and HumanHap300 Genotyping BeadChip (Illumina, Eindhoven, Netherlands) were preprocessed as previously described (Schröder et al., 2013) and the data are accessible through GEO Series accession numbers GSE32504 and GSE39036, respectively.

#### Targeted ADME NGS Panel Sequencing

Genomic DNA was enriched using a custom design Agilent SureSelect XT in-solution kit (Agilent Technologies, Santa Clara, CA, United States). The design of the PGX panel for all relevant ADME classified and ADME related genes (340 genes in total) included publically available gene lists of PharmaADME.org<sup>2</sup> (CORE/EXTEND, n = 236), pharmGKB<sup>3</sup> (Whirl-Carrillo et al., 2012); [very important pharmacogenes (VIP), n = 36], as well as additional genes with confirmed or putative ADME-related function according to literature search (n = 104; **Supplementary Table S2**). For analysis, the genes were assorted into functional groups as follows: ATP-binding cassette transporters (ABC; n = 45), solute carrier transporters, solute carrier organic anion transporters, and ion channels (SLC/SLCO; n = 64), members of phase I metabolism excluding cytochrome P450 and other modifying enzymes (Phase1: n = 36), members of phase II metabolism (Phase 2; n = 53), cytochrome P450s/modifying enzymes (CYP/modifiers; n = 53), nuclear receptors/transcription regulators (NR/TR; n = 46), and genes of other background and potentially related to ADME (others; n = 43) (**Figure 1B** and **Supplementary Table S2**). Positions of exon regions, 3<sup>0</sup> and 5<sup>0</sup> UTR (untranslated regions) were based on RefSeq major transcripts sequences (GRCh37; hg19; UCSC genome browser). Exon sizes were extended by 20 nucleotides on each side. Sequence of very short exons was symmetrically increased to at least 160 nucleotides. For selected genes 5<sup>0</sup> regions were extended to cover 2 kbp (n = 29). The total number of exons was 4,210 and total target size reached 1,382 kbp (**Supplementary Table S2**). Panel details are available on demand.

Target capturing was specifically designed for NGS of selected regions and DNA libraries were generated using Agilent insolution target capture technology from up to 1 µg high quality genomic DNA for each sample. NGS was carried out on the Illumina HiSeq2500 system (Illumina Inc., San Diego, CA, United States) at high depth with 2 × 100 bps paired-end reads. Raw sequencing reads generated by the Illumina platform were demultiplexed using Illumina bcl2fastq (1.8.2) (Illumina, San Diego, CA, United States). Adapter sequences were removed with cutadapt and the trimmed reads mapped to the human reference genome (GRCh37 hg19) using the Burrows Wheeler Aligner (BWA-mem 0.7.2; Li and Durbin, 2010). Reads mapping to more than one location with identical mapping scores were discarded (in house software). Read duplicates likely resulting from PCR amplification were removed (samtools 0.1.18). Variants were called using samtools and varscan (2.3.5)<sup>4</sup> . Technical artifacts were removed (in-house software) and the remaining variants were annotated based on several internal and external databases. We created a read count matrix for sequenced targets and 150 samples using the R package cn.mops.1.12.0 and the BAM files to assess the quality of coverage per gene and per target region. Approximately 5.9 million on target reads were generated per sample with a mean mapping quality of 58.2 and a mean coverage of 343 per target site. A Frequentist or a Bayesian algorithm was applied to call SNVs and small insertions/deletions (INDELs). Detection of insertions is limited by read length and no insertions above 50 bp were observed. Variant annotations were retrieved from UCSC genome data browser<sup>5</sup> , dbSNP build151 (March 22, 2018), and Sequence Ontology (SO) terms to describe the effect of each variant on genes in terms of transcript structure. Enrichment and sequencing procedure were established, validated, and provided by CeGaT GmbH, Tübingen, Germany. CeGaT is accredited by DAkkS according to DIN EN ISO 15189:2014, by the College of American Pathologists (CAP) and CLIA-certified (Dohrn et al., 2017). Sequence variant data has been deposited at the European Genome-phenome Archive (EGA), which is hosted by the EBI and the CRG, under accession number EGAS00001003426. Further information about EGA can be found on https://ega-archive.org (Lappalainen et al., 2015).

#### High Quality Variants

Only variants within the predefined target regions were selected and further analyzed (n = 16,928). Variant calls with sequencing coverage below 20× were regarded as invalid. Moreover, heterozygous calls were regarded as invalid when variant allele ratios were <5%. Invariant positions and variants with less than 70% valid values in all samples were excluded. Furthermore, 696 variants with HWE p-values < 10−<sup>5</sup> were considered suspicious and consequently excluded from all subsequent analyses. Finally, 13,838 SNVs and 1,889 INDELs were further investigated in this work. Genedata Profiler Analyst Module (V12.0.2.; Genedata AG, Basel, Switzerland) and GraphPad Prism (V5.04; GraphPad Software Inc., La Jolla, CA, United States) were used for data filtering, visualization, and basic statistical calculations.

#### Global Validation

Evaluation of ADME panel sequencing data was performed by direct comparison of sample genotypes to available genome wide SNP data (Illumina HumanHAP300 SNP; GEO Series accession number GSE39036; Schröder et al., 2013) as well as genotype data of 87 individual SNVs determinations obtained with several other genotyping methods in former studies (RFLP, Sanger sequencing, TaqMan allelic discrimination, MALDI-TOF, and other arrays) from the same sample set. Array variant data

<sup>2</sup>http://pharmaadme.org

<sup>3</sup>https://www.pharmgkb.org/

<sup>4</sup>http://dkoboldt.github.io/varscan

<sup>5</sup>https://genome.ucsc.edu/cgi-bin/hgVai

were "lifted" to GRCh37 (hg19), and only SNVs within the target regions defined above and with HWE p-value > 10−<sup>5</sup> were extracted (n = 276). Finally, genotype data for 363 variants were available for validation. Concordance of genotype data from ADME NGS and results from orthogonal methods was evaluated by computing percentage of identical genotype calls over all variants and samples. Variant positions within the above defined target boundaries were extracted from publically available databases from the Exome Aggregation Consortium ExAC<sup>6</sup> (Lek et al., 2016) and 1000 Genomes project<sup>7</sup> (1000 Genomes Project Consortium et al., 2015). In total, 11,558 and 68,918 variants were retrieved in the demanded genomic regions from 1000G and ExAC, respectively. Chromosomal

<sup>6</sup>http://exac.broadinstitute.org/

<sup>7</sup>http://www.internationalgenome.org/

position and nucleotide change (reference/alternative) were used to identify corresponding variants in the ADME NGS panel data. After adjusting frequency data to MAF numbers ranging between 0 and 50%, MAF from European (EUR, 1000G) or non-Finnish European (NFE, ExAC) were compared to observed MAF from our cohort. In addition, several well known variants in CYP2D6, CYP2C9, CYP2C19 and CYP2B6, NAT2 and DPYD were confirmed by Sanger sequencing. A concordance of 100% was observed covering 57 SNVs in 19 samples.

#### In silico Prediction

fgene-10-00007 January 29, 2019 Time: 16:59 # 5

The impact of coding variants on protein function was predicted using Polyphen 2 (PP2<sup>8</sup> Adzhubei et al., 2013) as well as the Provean Human Genome Variants tool [Protein Variation Effect Analyzer (PROV)<sup>9</sup> ; Choi et al., 2012], providing Provean and in addition SIFT (Sorting Intolerant from Tolerant; Sim et al., 2012) scores. All algorithms are based, among other features, on sequence conservation and were used with default settings. For a total of 4,017 coding variants including missense (n = 3,893), frameshift (n = 37), initiator codon (n = 7), stop codon (n = 46) and other coding variants (**Table 1**), prediction was performed using chromosomal genomic positions, reference and variant nucleotide. Functional predictions of the type LOF versus tolerated (TOL) was retrieved from Provean (cutoff 2.5; deleterious/neutral), SIFT (cutoff 0.05; damaging/tolerated) and Polyphen2 (probably and possibly damaging/benign). It must be pointed out that frameshift variants (n = 37) as well as mutations of stop codons (gain/loss; n = 46) are not predictable by these tools.

#### Cis-eQTL Analyses

Cis-eQTL analysis between the 15,727 variants (13,838 SNVs and 1,889 INDELs) and their corresponding gene were performed

<sup>8</sup>http://genetics.bwh.harvard.edu/pph2/

<sup>9</sup>http://provean.jcvi.org/index.php

with statistical software R-3.5.0 (R Core Team, 2018) and additional packages SNPassoc (v1.9-2; González et al., 2014), SKAT (v1.3.2.1; Lee, 2017), and illuminaHumanv2.db (v1.26.0; Dunning et al., 2015).

mRNA expression levels were assessed by Human-WG6v2 Expression BeadChip (Illumina, Eindhoven, Netherlands) and preprocessed as described (Schröder et al., 2013). Probe sets were re-annotated using the R package illuminaHumanv2.db (Dunning et al., 2015). Only probe sets with "good" or "perfect" probe quality as defined by illuminaHumanv2fullReannotation were considered for the eQTL analyses. Of the 340 ADME and ADME related genes described above, 303 genes (89%) were represented on the Human-WG6v2 Expression BeadChip with at least one "good" or "perfect" probe set. If several "good" or "perfect" probe sets were annotated to a gene, data of these entire probe sets (i.e., log2 normalized expression signals) were averaged, finally resulting in an expression matrix of size 303 genes × 150 samples for the eQTL analyses. Of the 15,727 variants, 14,294 (90.9%) were annotated to one of the 303 genes.

For individual eQTL analyses, only variants with MAF ≥ 2% and annotated to one of the 303 genes (n = 3,241) were considered, in order to avoid testing variants with very few minor allele carriers (a MAF ≥ 2% in 150 patients corresponds to at least 3 minor allele carriers; in our dataset, all variants with MAF ≥ 2% actually comprised at least 4 minor allele carriers). For 8 of the 303 genes, only variants with MAF < 2% were annotated in the ADME NGS panel (ABCB9, ALDH2, CYP11A1, GSTK1, GSTM1, GSTT1, PRMT1, and SULT1A4), leaving 295 genes and 3,241 variants for individual cis-eQTL analyses. These analyses were performed using the generalized linear model framework of R-package SNPassoc (González et al., 2014), considering four different genetic models: codominant, dominant, recessive, and additive. Only the minimal p-value of the four genetic models for each SNP was reported. Besides univariate analyses, cis-effects of variants on mRNA expression were analyzed controlling for 10 covariates [sex, age, smoking, alcohol consumption, diagnosis, C-reactive protein (CRP) level, cholestatic liver disease, presurgical medication (no drugs, P450


<sup>a</sup>Classification nomenclature according to ENSEMBLE variation sequence ontology terms. <sup>b</sup>Known/novel: with/without dbSNP database identifier. <sup>c</sup> Including: codingexon-variant, stop-retained.

TABLE 1 | Structural classification of ADME panel variants (n = 15,727).

FIGURE 2 | Variability of gene families. (A) Distribution of known and novel variants in ADME gene families. The numbers of observed known and novel variants (including SNVs and INDELs) per gene are shown for the seven major functional classes of ADME genes defined in Figure 1. Open boxes, known variants; filled boxes, novel variants; boxes show median with 75th and 25th percentiles and whiskers represent 10th and 90th percentiles. Lower part: statistical significance calculated by Kruskal–Wallis with Dunn's multiple comparison test of total number of variants per genes between family groups: <sup>∗</sup>P ≤ 0.05, ∗∗∗P ≤ 0.001. (B) Functional categorization of variants. Total number and proportion of variants observed in each functional class is shown separately for known and novel variants. Functional classes are defined as follows: 50UTR, upstream and 5<sup>0</sup> untranslated region; MIS, initiator codon, missense and stop codon variants; SPLICE, variants in consensus splice site acceptor and donor regions; 30UTR, downstream and 3<sup>0</sup> untranslated region; OTHER, other functional classes (intronic, frameshift, synonymous, other coding and non-coding variants). (C) Comparison of minor allele frequencies (Continued)

#### FIGURE 2 | Continued

(MAF) between novel and known observations. Total number of known observations with dbSNP identifier (open white bars; n = 8,454), novel observations (filled purple bars; n = 7,273); dotted line marks MAF = 2 and 5%.

inducer and other drugs), serum total bilirubin (TBILI) level, and serum gamma glutamyl transferase (GGT) level; see further details in **Supplementary Table S1**]. We used the Bonferroni method for multiple testing correction and set the significance level at 0.05/3,241 = 1.54E-05.

Moreover, we performed combined cis-eQTL analyses of the rare variants (MAF < 2%; n = 11,053) using the optimal unified association test framework for sets of variants (SKAT-O; Lee et al., 2012) implemented in R-package SKAT. To be more precise, for each of the 303 genes, the association of the set of all rare variants annotated to this gene and the corresponding mRNA expression data was investigated applying the SKAT-O test with standard weights. The same 10 covariates as in the eQTL analysis of common variants were used for an analogous multivariate SKAT-O analysis. For combined cis-eQTL analysis of rare variants, the Bonferroni-corrected significance level was set to 0.05/303 = 1.65E-04.

### RESULTS

#### Development and Performance of the Targeted ADME NGS Panel

**Figure 1A** gives an overview of the project workflow. The selection of genes was based on the PharmaADME.org gene lists "core" and "extend" and the PharmGKB VIP genes and was complemented with numerous additional genes of potential relation to drug metabolism (**Figure 1B**). All 340 genes finally included were targeted for all exons, exon/intron boundaries, as well as 5<sup>0</sup> and 30UTRs. An extended 5<sup>0</sup> region of 2 kb was included for a group of 29 selected genes. The total panel size comprised 1,382 kbp distributed over all chromosomes except the Y chromosome (**Figure 1C** and **Supplementary Table S2**). In our cohort of 150 liver samples, the gene target regions were covered to a mean read-depth of 343× (25th percentile = 265; 75th percentile = 398; **Supplementary Figure S1A**). More than 98% of the target regions were covered at more than 30×. The highest coverage was obtained for UGT2B11 (average 811), while GSTT2B showed the lowest average coverage of 27. These discrepancies did not hinder our analysis and can be resolved in a further iteration of design. Overall, 99% of the genes were covered on average at least 100-fold. Direct comparison of variant annotation with 363 available genotypes determined independently by other methods revealed an overall concordance of >99% (**Supplementary Figure S2**). The accuracy obtained with data derived from the Illumina HumanHap300 genotyping platform (99.3%) was slightly lower compared to data from other genotyping methods (99.6%), which may be due to inaccurate genotype

TABLE 2 | eQTL analysis: Significant associations from multivariate regression models after Bonferroni correction (only minimal p-values of four genetic models used are listed).


(Continued)

#### TABLE 2 | Continued

fgene-10-00007 January 29, 2019 Time: 16:59 # 8


<sup>a</sup>Variant identifier "chromosome \_ position \_ reference nucleotide \_ variant nucleotide". <sup>b</sup>Genetic model with minimal p-value: A, additive; R, recessive; D, dominant; C, codominant.

calling by the array method. Further details on performance and validation of the ADME NGS panel are presented in the Sections "Materials and Methods" and **Supplementary Material**.

#### Analysis of DNA Variants

A total of 16,928 genetic variants were detected within the defined target regions. Of these, 1,201 were excluded from further analysis because of low genotype quality (n = 505) or due to HWE p-values below 10−<sup>5</sup> (n = 696). The remaining 15,727 variants comprised 13,838 SNV and 1,889 variants classified as small insertions or deletions (INDELs). The length changes of these ranged from deletion of 33 nucleotides up to insertion of 20 nucleotides, with 1 bp deletions or insertions being the most frequent. Larger structural variants including copy number variations (CNVs) are currently under investigation using other methods.

As expected, most SNVs were biallelic, only 62 were triallelic and no tetraallelic variants were found. Among triallelic variants, transversions were more common (n = 80) than transitions, and G to T and G to A were the most common observations (n = 26 and n = 25, respectively).

None of the sequenced regions was invariant. On average, we observed 10.5 variants/kbp, corresponding to a mean distance of variants of 95 bp. Based on SNV density, the least variable genes were UGT1A9 and UGT1A10 with <2 SNVs/kbp and the genes with highest observed variant densities were CYP4F11

#### (42 SNVs/kbp) and CYP2D6 (31 SNVs/kbp) (**Supplementary Figure S1C**).

Variant annotation revealed that 7,273 (46.2%) of the variants were not yet annotated in the NCBI dbSNP database (dbSNP build 151, March 2018) and thus considered as novel observations. **Figure 2A** displays the number of variants per gene for known and unknown variants in the different ADME gene groups while **Figure 2B** depicts the fraction of variants according to functional annotation. The number of variants per gene was highest in the ABC and SLC/SLCO transporters and lowest in phase II genes. As reported in several recent studies the number of novel observations was substantial in all gene and functional groups (Fujikura et al., 2015; Gordon et al., 2016; Han et al., 2016). Of 15,727 SNV and small INDEL variants, 12,022 had a MAF below 2%, including 8,937 singletons. Of the 7,273 novel variants, 7,139 (>98%) had MAFs below 2% (**Figure 2C**), while 80 (1.1%) had MAFs ≥ 5%. Most of these were located in non-coding regions.

Functional classification based on major transcripts for each gene according to UCSC database revealed 6,058 variants in coding regions (including 3,893 missense and 46 stop gain variants; **Table 1** and **Figure 2B**) and 9,669 variants in various non-coding regions (e.g., 1,000 in 50UTR and 4,138 in 30UTR; **Table 1** and **Figure 2B**). We also analyzed 36 VIP genes, derived from PharmaGKB and PharmaADME websites separately for novel SNVs. In total we observed 502 unannotated variants in these genes (dbSNP151), 120 of them representing missense variants (**Supplementary Table S3**).

For comparison with publically available population data, we extracted small variants from the 1000 Genomes (EUR population) and ExAC (NFE, non-Finnish European) databases for the ADME NGS panel target regions, resulting in 11,558 and 68,918 variants, respectively (**Supplementary Figure S3A**). The MAFs of the matching variants in our sample set (ExAC/NFE: n = 2,993; 1000G/EUR: n = 4,913) were in good correlation with published population frequency data (Pearson r = 0.96 and r = 0.98 for both EUR and NFE populations, respectively). The median MAF of these SNVs was 1.16% for NFE and 2.98% for EUR. We did not detect another 6,645 (EUR) and 65,925 (NFE) known variants with median MAFs of 0.1% (EUR) and 0.002% (NFE) (**Supplementary Figures S3A,B**). Together these data indicate that mainly very rare variants with allele frequencies below 0.1% were missed in our cohort.

#### Association With Expression Levels

To directly evaluate the functional impact of variants, we assessed liver mRNA expression in an existing dataset (Schröder et al., 2013). To ensure high data quality only mRNA expression data of genes with "perfect" or "good" probes (see section "Materials and Methods") were considered (available for n = 303 genes). Due to sample size and statistical power considerations, we performed separate analyses for less common (MAF < 2%) and more common (MAF ≥ 2%) variants.

To evaluate the impact of more common variants (n = 3,241) on expression of the corresponding genes we performed ciseQTL analysis using univariate regression models. This analysis revealed significant associations for 94 variants after Bonferroni correction. In multivariate analysis with correction for 10 covariates (see section "Materials and Methods") 90 variants in 31 genes remained significant after Bonferroni correction (minimal p-value of the four genetic models < 1.54E-05; **Figure 3** and **Table 2**). Interestingly, 62 (70%) of these were located in noncoding regions, and most of these (n = 40) in 30UTR regions. Of note, three eQTLs represented PharmGKP VIP genes (CYP2D6: rs1080985; CYP3A5: rs15524; VCORC1: rs7294).

Association analysis of rare variants is challenging. To overcome the problem of limited sample size/statistical power, various methods have been developed to test sets of rare variants. Here we used the SKAT-O approach (Lee et al., 2012) for groupwise association of all rare variants in a gene with mRNA expression data. These variants are incorporated into a gene-wise test statistic via a weighted sum. Thus, p-values relate to genes,

not to variants. SKAT-O combines the strengths of burden tests thereby being powerful in different scenarios, i.e., when many variants of a gene are associated with expression levels and have the same effect direction, or when there are only few associated variants or variants that differ in effect direction. **Figure 4A** summarizes the results for univariate and multivariate SKAT-O analyses. After correction for multiple testing, two associations, for ADH1C and GSTO1, remained statistically significant. Further details showing expression levels of individual carriers are presented in **Figure 4B**. For example, five samples with a rather low expression were heterozygous carriers of the SNP chr10\_106027186 A > T (30UTR; rs17885600), including the two individuals with the lowest GSTO1 levels (**Figure 4B**). Hence, SKAT-O analysis resulted in identification of at least two genes with plausible genotype–phenotype correlations for variants with MAF < 2%.

#### Prediction of Functional Effects

We concentrated on coding variants resulting in amino acid change (missense), frameshift, or affecting initiator and stop codons, together accounting for 66% of coding variants and one fourth of all variants (**Figure 2B**). We used the common tools Polyphen 2 (PP2), Provean, and SIFT, that make dichotomous functional predictions of the type "loss of function" (LOF) versus

"tolerated" (TOL) (Zhou et al., 2018a). Of the analyzed subset of 4,017 coding variants, more than 95% were predictable by these algorithms (PP2, n = 3,818; PROV, n = 3,874; SIFT, n = 3,881). LOF prediction was retrieved concordantly by all three algorithms for 1,466 variants (36.5%) and TOL was concordantly calculated for 1,019 variants (25.4%; **Figure 5A**). In agreement with other studies (Bush et al., 2016; Han et al., 2016; Hovelson et al., 2017) we found that the proportion of LOF- versus TOL-predicted variants was significantly higher among the less common (MAF < 2%) compared to more common variants (Chi-square test, p < 0.0001). With one exception (SLC28A1 G254V, MAF = 2.3%) all novel LOF-predicted variants were less common with MAF < 2% (**Figure 5B**).

Interestingly, transporters and nuclear receptors/transcriptional regulators had large proportions of predicted LOF variants that had not yet been listed in the

dbSNP database. The highest number of predicted LOF variants in one gene was observed in NCOR2 (n = 47), and nine ABC transporters (A7, A2, A4, C1, C10, C8, A3, and C11) are found among the genes with the highest LOF-predicted variants (**Figure 5C**).

#### Integrating Prediction and Association

While the SKAT-O test identified only two significant associations, functional prediction indicated a much larger number of predicted LOF variants, as also reported by others (Han et al., 2016; Hovelson et al., 2017). In contrast to former studies, our data allow inspection of genotype-phenotype correlations individually for each variant and for several available phenotypes. While these excessive data are currently being analyzed, we illustrate here a typical example. Of particular interest are protein levels, as functionally damaging ADME gene variants are frequently associated with lower protein levels. **Figure 6** shows exemplarily the correlation of all detected amino acid variants of ABCC11, encoding the drug transporter MRP8, with MRP8 protein levels obtained for the same liver cohort in a previous study (Magdy et al., 2013). Interestingly, carriers of concordantly LOF-predicted variants (n = 73) showed highly variable protein levels (23-fold; coefficient of variation 81%), essentially covering the entire range of MRP8 variability, while carriers of only TOL-predicted variants (n = 30) were spread across a smaller protein range (ninefold; coefficient of variation 53%). Of note, the median protein levels of carriers of LOF-predicted and TOL-only-predicted variants were similar (P = 0.73; **Figure 6**). Thus, our phenotypic data allow identification of several MRP8 low and high expressors in relation to genotype. While there does not seem to be a simple relation between functional prediction and phenotypic expression, our data should be helpful to prioritize variants for further investigation and to improve prediction tools.

#### DISCUSSION

In this study we designed a new panel to target 340 ADME genes for NGS. We tested and validated our ADME NGS panel on a cohort of 150 human liver specimens with comprehensive genetic, functional, and medical characterization. This allowed us not only to perform extensive genotype-phenotype correlations to identify novel relationships for common and rare variants but also to compare computational predictions of functional effects with real phenotypes, which should be useful to further develop and optimize prediction algorithms for variant effects.

We designed our ADME NGS panel to comprise 340 genes including most phase I and phase II enzymes, drug transporters and numerous transcriptional regulators and other modifiers of xenobiotics and endogenous substances. We used Agilent insolution target capture technology to allow informed selection of relevant regions and optimization of coverage on targets. Only four genes, SULT1A3, SULT1A4, MIF, and CYP26C1, were covered below 100-fold. Low coverage of some genes was also observed by others who speculated that common null functional alleles, high sequence homology as well as

pseudogenes may disturb capture of such regions (Han et al., 2016). Direct comparison of 363 genotype data available from previous pharmacogenetic studies in the liver cohort revealed an overall accuracy of the ADME NGS panel of >99%. The overall performance of our ADME NGS panel was comparable to other targeted capture sequencing panels (Bush et al., 2016; Gordon et al., 2016; Han et al., 2016; Hovelson et al., 2017). Compared to these other platforms we included a greater number of genes with the intention to investigate not only established ADME genes but also less well known ADME candidate genes.

While several NGS studies of different types recently explored genetic variation in ADME genes (Fujikura et al., 2015; Bush et al., 2016; Han et al., 2016; Kozyra et al., 2016; Hovelson et al., 2017; Schärfe et al., 2017), our study is, to our knowledge, the only one that provides phenotypic measurements in human samples. In this study we analyzed only SNVs and small INDELs, while larger structural variations will be analyzed separately (Tremmel et al., in preparation). For the more common variants (MAF ≥ 2%) multivariate eQTL analysis revealed 90 significantly associated variants, most of them located in noncoding regions. Six of these loci had already been described in our previous genome wide association study, e.g., rs7294 in VKORC1 3 <sup>0</sup>UTR, or rs1201559 (P516L) in SLC22A10 (Schröder et al., 2013). Interestingly, several of the SNVs located in 30UTRs (ARNT rs11552229, CYP3A5<sup>∗</sup> 10 rs15524, EPHX2 rs1042032 and rs1042064, UGT2A1 rs4148312 and VKORC1 rs7294) are discussed as potential micro-RNA binding sites, partially proven by tissue eQTL (Wei et al., 2012). Furthermore, our data confirm

predicted eQTL effects on expression in liver tissue in the Genotype-Tissue Expression portal (GTex<sup>10</sup>; Lonsdale et al., 2013) e.g., for the EPHX2 variant rs1042032 and VKORC1 rs7294. Some other eQTLs we found had also been reported previously in the context of phenotype/genotype correlations. For example, rs1080985 in CYP2D6 corresponds to the −1584C > G variant that is linked to the low-expression CYP2D6<sup>∗</sup> 41 allele (Raimundo et al., 2000; Raimundo et al., 2004); the PON1 rs854552 variant had been found in a nutrigenetic approach on markers of cardiovascular disease (Rizzi et al., 2016); and the AOC1 (diamine oxidase) variant rs10156191 was associated with hypersensitivity response to non-steroidal anti-inflammatory drugs (Agúndez et al., 2012).

In contrast to common variants, association of individual rare variants is greatly limited by sample size and thus presents a special challenge. The problem is aggravated by the fact that by far most rare variants occur in heterozygous condition, where any effect could be masked by the variability of the "normal" allele. Furthermore, rare variants can be damaging in many ways, affecting expression, protein abundance, or catalytic function. A single phenotype such as expression may thus not reveal the deleterious nature of a particular variant. Nevertheless we assume that analysis of gene or protein expression should be most promising, because damaging variants often affect expression negatively. This is the case, for example, for most low-activity CYP variants (e.g., CYPs 2B6, 2C19, 2D6, 3A4, 3A5 mostly due to aberrant splicing; Zanger and Schwab, 2013), and many established variants of clinical relevance like UGT1A1<sup>∗</sup> 28 and Gilberts syndrome (Ehmer et al., 2012) and VKORC1 variants in warfarin metabolism (Li et al., 2009). Our statistical approach to relate rare variants to gene expression data by SKAT-O test revealed two significant associations for rare variants of ADH1C and GSTO1, both of which appear highly plausible and would not have been detected by the cis-eQTL analysis. The variant rs283413 in ADH1C, a stop gain mutation at protein position G78, is discussed as risk factor for Parkinson's disease (Buervenich et al., 2005) and alcohol biodisposition (Martínez et al., 2010; Way et al., 2015). The GSTO1 rare variants have so far not been reported to be associated with expression to our knowledge, but a significant genotype influence of the 30UTR SNP rs17885600 on expression of the adjacent GSTO2 in liver tissue supports a potential eQTL effect of this variant (Lonsdale et al., 2013).

As a further approach to identify deleterious ADME rare variants, we used computational prediction, which has recently been used in several studies (Bush et al., 2016; Han et al., 2016; Hovelson et al., 2017). However, in none of these studies, phenotypic information was provided to compare prediction with a phenotypic parameter. Similar to other studies we found a considerable fraction of all variants (36.5%) to be predicted as damaging by all three prediction tools used. Somewhat unexpectedly, preliminary analyses did not reveal statistically significant associations between LOF-predicted variants and lower expression. As exemplarily illustrated for ABCC11 and MRP8 protein abundance, LOF predicted variants were not more frequently associated with lower protein levels as compared to TOL predicted variants. Thorough analyses of these data are currently in progress. A recent advanced approach integrated prediction and functional activity data available from diverse sources to develop an improved prediction framework adopted to pharmacogenetic assessments (Zhou et al., 2018b). Our data should be highly valuable to test and further improve such approaches.

### CONCLUSION

We designed a new targeted NGS pipeline to determine SNVs and small INDELs for 340 ADME genes and used it to analyze 150 well characterized human liver samples. In addition to common known variants we confirmed the existence of large numbers of rare and previously unknown germline variants. Available phenotypic information on the samples allowed us to elucidate numerous novel eQTLs for common variants and to identify novel relationships between rare variants and expression. Furthermore our data allow direct comparison of computationally predicted functional effects for coding variants with actual phenotypes. Using data for the transporter ABCC11/MRP8, we showed that variants predicted as deleterious are present in both high and low expressors of MRP8. While this emphasizes challenges and current limitations of computational prediction approaches to integrate rare variants into pharmacogenomics, such data are important to assess and improve the current strategies.

#### AUTHOR CONTRIBUTIONS

KK, ES, MS, UZ, SF, and SB designed the study. KK, UZ, and MS provided DNA samples. SF, FB, TS, and SB designed the panel and generated sequencing data. KK, RT, SW, and SF analyzed the data. KK, RT, and UZ wrote the manuscript. All authors contributed to editing and final proofreading the manuscript.

# FUNDING

This study was supported by the Robert Bosch Foundation, Stuttgart, Germany and the European Commission Horizon 2020-PHC-2015 grant U-PGx 668353.

# ACKNOWLEDGMENTS

We thank Dr. Florian Büttner for help with large datasets (1000G, ExAC). The excellent technical assistance of Igor Liebermann is gratefully acknowledged.

#### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene.2019. 00007/full#supplementary-material

<sup>10</sup>https://commonfund.nih.gov/gtex


impaired function of cytochrome P450 2D6 in white subjects. Clin. Pharmacol. Ther. 76, 128–138. doi: 10.1016/j.clpt.2004.04.009


the expression of human xenobiotic metabolism enzyme and transporter genes. Front. Genet. 3:248. doi: 10.3389/fgene.2012.00248


**Conflict of Interest Statement:** SF, FB, TS, and SB were employed by CeGaT GmbH, Tübingen.

The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Klein, Tremmel, Winter, Fehr, Battke, Scheurenbrand, Schaeffeler, Biskup, Schwab and Zanger. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Next-Generation Sequencing of PTGS Genes Reveals an Increased Frequency of Non-synonymous Variants Among Patients With NSAID-Induced Liver Injury

María Isabel Lucena1†, Elena García-Martín2†, Ann K. Daly <sup>3</sup> , Miguel Blanca<sup>4</sup> , Raúl J. Andrade<sup>1</sup> and José A. G. Agúndez <sup>2</sup> \*

<sup>1</sup> Unidad de Gestión Clínica de Aparato Digestivo, Servicio de Farmacología Clínica, Centro de Investigación Biomédica en Red de Enfermedades Hepáticas y Digestivas Málaga, Instituto de Investigación Biomédica de Málaga, Hospital Universitario Virgen de la Victoria, Universidad de Málaga, Málaga, Spain, <sup>2</sup> Instituto de Salud Carlos III, University Institute of Molecular Pathology Biomarkers, UNEx, ARADyAL, Cáceres, Spain, <sup>3</sup> Liver Research Group, Institute of Cellular Medicine, The Medical School, Newcastle University, Newcastle upon Tyne, United Kingdom, <sup>4</sup> Servicio de Alergología, Hospital Infanta Leonor, ARADyAL, Madrid, Spain

#### Edited by:

George P. Patrinos, University of Patras, Greece

#### Reviewed by:

Volker Martin Lauschke, Karolinska Institute (KI), Sweden Su-Jun Lee, Inje University, South Korea

> \*Correspondence: José A. G. Agúndez jagundez@unex.es

†These authors have contributed equally to this work

#### Specialty section:

This article was submitted to Pharmacogenetics and Pharmacogenomics, a section of the journal Frontiers in Genetics

Received: 19 November 2018 Accepted: 08 February 2019 Published: 28 February 2019

#### Citation:

Lucena MI, García-Martín E, Daly AK, Blanca M, Andrade RJ and Agúndez JAG (2019) Next-Generation Sequencing of PTGS Genes Reveals an Increased Frequency of Non-synonymous Variants Among Patients With NSAID-Induced Liver Injury. Front. Genet. 10:134. doi: 10.3389/fgene.2019.00134 Purpose: The etiopathogenesis of drug-induced liver injury (DILI) is still far from being elucidated. This study aims to the study of genetic variations in DILI, related to the drug target, and specifically in the genes coding for the cyclooxygenase enzymes.

Methods: By using Next-generation Sequencing we analyzed the genes coding for COX enzymes (PTGS1 and PTGS2) in 113 individuals, 13 of which were patients with DILI caused by COX-inhibitors.

Results: The key findings of the study are the increased frequency, among DILI patients, of SNPs causing alterations in transcription factor binding sites and non-synonymous PTGS gene variants, as compared to control subjects. Moreover, the association with non-synonymous SNPs was exclusive of DILI patients with late-onset (50 days or more) Pc < 0.001 as compared to DILI patients with early onset, or with control subjects.

Conclusions: Our findings suggest an interaction of long-term exposure to COX inhibitors combined with functional variants of the COX enzymes in the risk of developing DILI. This is a novel observation that might have been overlooked by previous genetic studies on DILI because of the limited coverage of PTGS genes in exome chips.

Keywords: PTGS1, PTGS2, next generation sequencing, drug-induced liver injury, COX1, COX2

# BACKGROUND

Although drug-induced liver injury (DILI) is a rare adverse drug event, it is often life-threatening because of the risk of developing acute liver failure. The mechanisms underlying DILI risk are not well understood and hence, the search for biomarkers of DILI risk is a major research field that aims to identify markers that could be used as both proof of the mechanisms involved and of the risk factors that can be used for DILI prediction, as has already been done with many pharmacogenomics biomarkers (Lucena et al., 2008, 2010; Agúndez, 2009; Agundez et al., 2009, 2011; Andrade et al., 2009; Robles-Diaz et al., 2016; Nicoletti et al., 2017). There are presently several independent hypotheses to explain idiosyncratic DILI, but none of these is able to explain all the circumstances in which DILI occurs.

Some genetic biomarkers for DILI either mechanisticallybased using a case-control strategy or with a GWAs/exome sequencing approaches have been identified [for a review, see (Robles-Diaz et al., 2016)]. However, the involvement of genetic changes in DILI risk (for instance HLA risk alleles) has been documented for only a few drugs (Kaliyaperumal et al., 2018). On the other hand, case-control genotyping studies, GWAS and exome sequencing have important limitations because only some SNPs are tested, and most of the target sequence is not checked. To overcome this problem, deep sequencing comprising whole genes is necessary.

In this study, we analyzed the potential effect of mutations in cyclooxygenase genes (PTGS1 and PTGS2) on DILI risk related to NSAIDs. From a mechanistic point of view, such a risk could be related to genetic alteration in the arachidonic acid pathways, which are closely related to inflammation. On the other hand, adverse drug events for drugs acting on the COX enzymes (that is, COX inhibitors) may be more likely if COX activity is altered because of genetic variations. For this reason, we analyzed patients who developed DILI after the administration of COX-inhibitors and healthy individuals who tolerated COX-inhibitors.

# CASE PRESENTATION

Thirteen patients (8 women and 5 men) who experienced DILI caused by COX inhibitors and 100 individuals who tolerated COX-inhibitors at standard doses were included in this study. The culprit drug for DILI and clinical details of patients are shown in **Table 1.** Gender-matched control individuals who tolerated COX-inhibitors (62 women and 38 men) individuals were recruited among staff and medical students of the Hospitals and the Universities participating in this study. Individuals which were considered as healthy after medical examination, to exclude pre-existing disorders and history of adverse events after the use of COX-inhibitors, were asked to participate and over 95% of these agreed to do so. We selected consecutive control subjects matched with patients for drug exposure: Fifty control subjects who have received ibuprofen within the previous month to sample collection, 20 who received diclofenac, 10 indomethacin, 10 naproxen, and 10 rofecoxib. These frequencies match with the frequencies for the DILI patients, except that no control subject received nimesulide since this drug was discontinued from the Spanish market due to liver safety. Both patients and controls were Caucasian Spanish individuals. Written informed consent for participation in this case report was obtained from all participants. The protocol for this study was in accordance with the Declaration of Helsinki and its subsequent revisions and was approved by the respective Ethics Committees of the participating Hospitals.

# DESCRIPTION OF LABORATORY INVESTIGATIONS AND DIAGNOSTIC TESTS

To achieve complete gene capture, we sequenced all exons, intron-exon boundaries as well as the 5′ and 3′ flanking regions for both genes. Referred to the GRCh37 assembly of the human genome, the sequences studied were the following: PTGS1: Chromosome 9:125.131.159 to 125.158.017; PTGS2: Chromosome 1:186.640.825 to 186.651.605. Partially overlapping amplicons with a size lower than 400 bp were designed. A total of 62 CS1/CS2 tagged primer pairs were synthesized and used to amplify 113 DNA samples using the Access Array platform (Fluidigm). During amplification, samples were labeled with standard MID barcodes designed for the FLX454 sequencing system. After amplification and MID-labeling, individual amplicon libraries were analyzed using a Bioanalyzer 2100 (Agilent) and bioanalyzer traces were used to estimate the amplicon concentration for each sample. Samples were then pooled, and libraries were purified by SPRI using Ampure beads to remove all possible traces of small molecules, primers, primer-dimers, or any other contaminants. The pooled library was again quantified and titrated so that a final amount of 1.95E+10 molecules with an enrichment percentage of 7% was loaded on a Pico Titer Plate (Roche) for a 200-cycle titanium-based sequencing run, made on FLX-454 equipment. Reads were processed using an amplicon processing pipeline and sff files were used for further analyses. Coverage averaged around 50x for the whole project. Coverage for the SNPs identified (shown in **Supplemental Table 1**) was always over 50x. Sequencing reads were de-multiplexed and aligned using the Amplicon Variant Analyzer software v2.8 (Roche) so that reads for each particular sample- target region combination were analyzed in search of variants. Details of the amplification and sequencing primers are available in **Supplemental Table 1**.

The putative effect on the non-synonymous variants identified in silico was assessed by using the Sorting Tolerant form Intolerant (SIFT) and Polymorphism Phenotyping (PolyPhen) scores as shown in the 1,000 genomes website for every SNP, as well as the online application MutationAssessor (http:// mutationassessor.org/r3/).

### RESULTS

The sequencing results (summarized in **Table 2**) reveal that PTGS genes are well conserved. Although dozens of PTGS1 and PTGS2 single nucleotide polymorphisms (SNPs) have been described to occur in Caucasian populations (see Agúndez et al., 2015), our findings show that most of these SNPs were not identified, or were extremely rare, in this cohort.

**Abbreviations:** COX, Cyclooxygenase, prostaglandin-endoperoxide synthase; NSAID, Non-steroidal anti-inflammatory drug; DILI, Drug-induced liver injury; GWAS, Genome-wide association study; SNP, Single nucleotide polymorphism; HLA, Human leukocyte antigen; PTGS1, Prostaglandin-Endoperoxide Synthase 1; PTGS2, Prostaglandin-Endoperoxide Synthase 2.


TABLE 1 |

Demographic

 and clinical

characteristics

 of 13 patients with

NSAIDs-induced

 idiosyncratic

 liver injury.

 organ

 or


(Continued)

TABLE 2 | PTGS1

and

PTGS2

variant sequences

 identified in the study group.


TABLE

2


Continued

**94**

Predicted consequences

aNonsynonymous

bNonsynonymous

cNonsynonymous

dNonsynonymous

eNonsynonymous

 (W8R), SIFT score = 0.85 (tolerated, low confidence),

 (P17L), SIFT score = 1.00 (tolerated, low confidence),

 (R108Q), SIFT score = 0.12 (tolerated), PolyPhen score = 0.21 (benign), Mutation Assessor = medium impact.

 (K185T), SIFT score = 0.36 (tolerated), PolyPhen score = 0.007 (benign), Mutation Assessor = neutral.

 (R228H), SIFT score = 1.00 (tolerated), PolyPhen score = 0.002 (benign), Mutation Assessor = neutral.

 PolyPhen score = 0 (unknown), Mutation Assessor = neutral.

 PolyPhen score = 0 (unknown), Mutation Assessor = low impact.

 for missense variants:

TABLE 3 | Detailed genotype distribution for relevant SNPs.


MAF, Minor allele frequency; UGV, Upstream gene variant; MSV, Missense variant.

TABLE 4 | Haplotype analysis.


Global haplotype association p < 0.0001.

NA, not applicable; \*any nucleotide.

Interestingly, most of the PTGS1 and PTGS2 SNPs included in the Illumina human exome chip or human core exome chip (Urban et al., 2012) are also absent in this study group. This raises doubts about the coverage of exome chips to identify genetic associations related to PTGS1 and PTGS2 genes.

In the whole population study, we identified 31 single nucleotide polymorphisms (SNPs) for PTGS1, including four non-synonymous SNPs. For PTGS2 we identified 31 SNPs including one non-synonymous. We observed an increased frequency of PTGS1 and PTGS2 mutations among DILI patients, as compared to that observed in control individuals. Most of the SNPs identified in patients were rare among control individuals and were rare also according to the 1,000 genomes database (as shown in **Table 2**). All patients but one (case 1 in **Table 2**) had mutations at the PTGS1 gene and all patients but one (case 5 in **Table 2**) had mutations at the PTGS2 gene. **Table 3** summarizes the comparison of relevant SNPs across patients with late-onset DILI, the rest of DILI patients and control individuals.

# DISCUSSION OF THE UNDERLYING PATHOPHYSIOLOGY AND THE NOVELTY OR SIGNIFICANCE OF THE CASE

The most remarkable findings in this study are the presence among DILI patients of SNPs causing alterations in transcription factor binding sites such as the PTGS1 SNP rs10306225 (Agundez et al., 2014), and the PTGS2 SNPs rs4648253, rs689466, and rs20417, as well as non-synonymous SNPs such as PTGS1 rs1236913 (W 8 R), rs3842787 (P 17 L), rs5787 (R 108 Q), rs3842792 (K 185 T), and PTGS2 rs3218622 (R 228 H). These missense variants are extremely rare among European individuals (Agúndez et al., 2015). The putative effects of the most relevant SNPs shown in **Table 3** have been revised elsewhere (Agúndez et al., 2015). In brief, besides the rs10306225 SNP, which is a promoter variant that causes a modification in a CDX1 binding site (Agundez et al., 2014), the rest of SNPs are non-synonymous. According to functional predictions and functional analyses (reviewed in Agúndez et al., 2015) the SNPs rs1236913, rs3842787 have a little functional effect, although clinical associations for these SNPs with urticaria induced by NSAIDs (Cornejo-Garcia et al., 2012) and myocardial infarction/stroke (Lee et al., 2008; Lemaitre et al., 2009; Gao et al., 2014), respectively, have been proposed. The functional effect of the rs5787 SNP is unknown, although functional prediction suggests a mild functional impact (see **Table 2**), rs3842792 SNP is predicted as functional (**Table 2**), but in vitro findings suggest reduced functionality (Lee et al., 2007), and no functional impact for the PTGS2 SNP rs3218622 has been described.

No particular association of missense SNPs with culprit drug, age, gender, clinical presentation, type of liver injury, and severity of the disease was identified. However, as shown in **Table 1**, there is heterogeneity in the duration of treatment before DILI onset. This heterogeneity, rather than being a weakness, is a strong point in this study because it allowed discriminating the frequencies of PTGS gene variations in DILI patients with late and short-term onset. All the five DILI patients with the longest times to DILI onset (50 or more days; patients n◦ 3, 8, 9, 10, 12 in **Table 1**) had missense variants, and no patient with shorter time to DILI onset had such missense variants. The intergroup comparison values for carriers of any nonsynonymous PTGS variants were as follows: Patients with late DILI onset (50 or more days) vs. the rest of DILI patients (P < 0.001). Patients with late DILI onset vs. control individuals (P < 0.001). By turn, no significant differences for carriers of non-synonymous PTGS variants were observed among patients with DILI onset shorter than 50 days and control subjects (P = 0.325). Haplotype analyses (**Table 4**), and linkage disequilibrium (LD) analyses (**Supplemental Table 2**), show that the risk is due to the presence of rare haplotypes (containing missense variants) in the group of patients with late-onset DILI, but it is not due to LD variations for these variants. The strong association observed in this report, although it is based in five cases only, suggests a relationship of non-synonymous PTGS gene variations with DILI onset after long-term NSAID therapy. This is a novel observation that has not been raised by previous studies. Although the putative role of PTGS gene variations has been explored using the Illumina human exome chip or human core exome chip, it is of note that chip coverage was very limited for PTGS genes (Urban et al., 2012). By turn, this study has complete

#### REFERENCES


coverage thus allowing the identification of, as yet, disregarded SNPs. Another relevant difference with most DILI genetic studies is that in this report we stratified patients according to the time to onset. It cannot be ruled out heterogeneity in the etiopathogenesis of DILI, and it is conceivable that the mechanisms involved in DILI with a late onset might be different from those involved in immediate or short-latency reactions. This study, albeit with the inherent limitations of statistical power that case reports have, reinforces the view that a complete gene coverage and a detailed phenotype stratification of DILI patients could be essential to gain strength in further genetic association studies.

### AUTHOR CONTRIBUTIONS

ML and EG-M participated in the design of the study, in data acquisition, and in critical revision for important intellectual content. AD, MB, and RA participated in the analysis and interpretation of the data and critical revision for important intellectual content. JA participated in the conception, design, data analysis and interpretation, the drafting of the manuscript and critical revision for important intellectual content. All authors approved the final version of the manuscript and all agree to be accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the article are appropriately investigated and resolved.

# FUNDING

Financed in part by grants PI12/00241, PI12/00378, PI12/00324, PI15/00303, and RETICS RD16/0006/0004 from Fondo de Investigación Sanitaria, Instituto de Salud Carlos III, Spain, and IB16170, GR18145 from Junta de Extremadura, Spain. Financed in part with FEDER funds from the European Union.

#### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene. 2019.00134/full#supplementary-material


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Lucena, García-Martín, Daly, Blanca, Andrade and Agúndez. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Genetic Association of Olanzapine Treatment Response in Han Chinese Schizophrenia Patients

<sup>1</sup> Key Laboratory for the Genetics of Developmental and Neuropsychiatric Disorders, Bio-X Institutes, Ministry of Education, Shanghai Jiao Tong University, Shanghai, China, <sup>2</sup> Department of Psychiatry, First Hospital of Shanxi Medical University, Taiyuan, China, <sup>3</sup> Shanghai Mental Health Center, Shanghai Jiao Tong University School of Medicine, Shanghai, China, <sup>4</sup> Shandong Mental Health Center, Jinan, China, <sup>5</sup> Shanghai Key Laboratory of Psychotic Disorders, Shanghai, China,

Wei Zhou<sup>1</sup>† , Yong Xu<sup>2</sup>† , Qinyu Lv<sup>3</sup> , Yong-hui Sheng<sup>4</sup> , Luan Chen<sup>1</sup> , Mo Li<sup>1</sup> , Lu Shen<sup>1</sup> , Cong Huai<sup>1</sup> , Zhenghui Yi<sup>3</sup> \*, Donghong Cui3,5 \* and Shengying Qin1,6 \*

<sup>6</sup> The Third Affiliated Hospital, Guangzhou Medical University, Guangzhou, China

#### Edited by:

Amit V. Pandey, University of Bern, Switzerland

#### Reviewed by:

Ming Ta Michael Lee, Geisinger Health System, United States Vindhya Udhane, Medical College of Wisconsin, United States Julio Benitez, Universidad de Extremadura, Spain

#### \*Correspondence:

Zhenghui Yi yizhenghui1971@163.com Donghong Cui manyucc@126.com Shengying Qin chinsir@sjtu.edu.cn

†These authors have contributed equally to this work

#### Specialty section:

This article was submitted to Pharmacogenetics and Pharmacogenomics, a section of the journal Frontiers in Pharmacology

Received: 03 September 2018 Accepted: 11 February 2019 Published: 04 March 2019

#### Citation:

Zhou W, Xu Y, Lv Q, Sheng Y-h, Chen L, Li M, Shen L, Huai C, Yi Z, Cui D and Qin S (2019) Genetic Association of Olanzapine Treatment Response in Han Chinese Schizophrenia Patients. Front. Pharmacol. 10:177. doi: 10.3389/fphar.2019.00177 Olanzapine, a second-generation antipsychotic medication, plays a critical role in current treatment of schizophrenia (SCZ). It has been observed that the olanzapine responses in schizophrenia treatment are different across individuals. However, prediction of this individual-specific olanzapine response requires in-depth knowledge of biomarkers of drug response. Here, we performed an integrative investigation on 238 Han Chinese SCZ patients to identify predictive biomarkers that were associated with the efficacy of olanzapine treatment. This study applied HaloPlex technology to sequence 143 genes from 79 Han Chinese SCZ patients. Our result suggested that there were 12 single nucleotide polymorphisms (SNPs) had significant association with olanzapine response in Han Chinese SCZ patients. Using MassARRAY platform, we tested that if these 12 SNPs were also statistically significant in 159 other SCZ patients (independent cohort) and the combined 238 SCZ patients (composed of two tested cohorts). The result of this analysis showed that 2 SNPs were significantly associated with the olanzapine response in both independent cohorts (rs324026, P = 0.023; rs12610827, P = 0.043) and combined SCZ patient population (rs324026, adjust P = 0.014; rs12610827, adjust P = 0.012). Our study provides systematic analyses of genetic variants associated with olanzapine responses of Han Chinese SCZ patients. The discovery of these novel biomarkers of olanzapine-response will facilitate to advance future olanzapine treatment specific for Han Chinese SCZ patients.

Keywords: olanzapine, polymorphism, schizophrenia, pharmacogenetics, biomarker, association study

# INTRODUCTION

Schizophrenia (SCZ) is a severe chronic neuropsychiatric illness. According to a survey carried out in 33 countries, about 15 out of 100,000 individuals were suffering from SCZ globally (McGrath et al., 2004). These SCZ patients were estimated to have higher risk of death (about 2.5 times more) compared to healthy individuals (McGrath et al., 2004; Saha et al., 2007). Patients with SCZ disorder require long-term treatments to prevent themselves from illness progression or symptom relapse

**98**

(Howes et al., 2015; Chong et al., 2016). Although the development of SCZ has long been regarded as caused by a combination of genetic and environmental factors, detailed pathophysiological mechanism of SCZ still remains unclear.

Second-generation antipsychotic (SGA) medications are widely considered as the most advanced and effective treatment for SCZ patients nowadays. These SGA includes olanzapine (OLA), risperidone and quetiapine (Owen et al., 2016). However, SCZ patients who received SGA treatments often experienced severe adverse drug reactions (ADRs) (Zhang and Malhotra, 2011). In fact, current SGA therapies could be seen as a subjective "trial and error" process. For some of the SCZ patients, these SGA treatments did not exert any therapeutic effects on their symptoms. This individual-specific response to SGA may be caused by the association between individualspecific genetic variation and the efficacy of SGAs (Zhang and Malhotra, 2011). In order to address such specificity of SGA response across different patients, scientists in pharmacogenomics field is now exploring the possibility of predicting drug response using individual-specific genetic signatures (Arranz and de Leon, 2007).

Olanzapine is one of the most commonly used SGAs. It has been shown to have relatively superior efficacy in various clinical trials compared to other SGAs (Lieberman et al., 2005; Leucht et al., 2009). It has been reported that OLA treatment had relatively low extrapyramidal side effects and better efficacy to minimize negative symptoms of SCZ patients when it is used in clinically effective doses (Meltzer, 1999). OLA binds to serotonin type 2 (5-HT2) and dopamine (D2) receptors with high affinity in patients' body. Diphosphate glucuronosyltransferases (UGT), a member of cytochrome P450 family and flavincontaining monooxygenase 1 (FMO1), catalyze the oxidative hepatic metabolism process of OLA (Ring et al., 1996; Kassahun et al., 1997; Linnet, 2002). On the other hand, due to the heterogeneity in different SCZ patients, not all patients respond to OLA treatment adequately well as we expected. Some patients who received OLA therapy even experienced severe adverse side effects that resulted in non-compliance with drug treatment (Zhang and Malhotra, 2011; Musil et al., 2015). If these SCZ patients with no other effective therapies specific for them they would have to face the coming disease progression, relapses and potential long-term hospitalizations (Robinson et al., 1999; King et al., 2014).

Although numerous studies have been performed on the factors that influence the therapeutic efficacy of OLA, there were very few of them focused on the individual-specific genetic biomarkers of OLA response (Söderberg and Dahl, 2013). In comparison, earlier attempts to search for biomarkers of OLA response focused mainly on the relationship between OLA response and its metabolic pathways, including glucuronidation, hydroxylation, N-demethylation and N-oxidation pathways (Laika et al., 2010; Haslemo et al., 2012; Mao et al., 2012; Söderberg et al., 2013; Brandl et al., 2015). A number of genetic variants, including UGT2B10 rs61750900 (UGT2B10<sup>∗</sup> 2) (Erickson-Ridout et al., 2011), CYP1A2 rs762551 (CYP1A2<sup>∗</sup> 1F) (Laika et al., 2010), DRD3 rs6280 (Adams et al., 2008), AHR rs4410790 (Söderberg et al., 2013), FMO3 K158–G308, FMO1 rs12720462 (FMO1<sup>∗</sup> 6) and FMO1 rs7877 have been reported to play important roles to influence OLA metabolism (Soderberg et al., 2013). In addition, the drug response and the pharmacokinetics of OLA have also been found to associate with genetic elements that are not directly involved in the metabolic pathway of OLA (Lin et al., 2006; Meary et al., 2008; Cabaleiro et al., 2013; Yu et al., 2018). For example, P-glycoprotein, a membrane protein that pumps foreign substances out of cells and is regarded as element that is not directly related to OLA metabolic pathway, affects the penetration of OLA into the central nervous system (Lin et al., 2006). These discoveries suggested that a comprehensive study of OLA response requires clear understanding of the complicated biological network that is composed of enzymes involved in drug metabolism, drug transportation and drugs targeted receptors.

In this study, we investigated the associations between SNPs in 143 genes and the OLA response of 79 Han Chinese SCZ patients using target-sequencing technology. The newly found biomarkers was considered as genetic signature of drug responses to 8-week treatment with OLA and were validated in the other independent Han Chinese SCZ patient cohort.

# MATERIALS AND METHODS

#### Subjects

In this study, we collected 2 independent sets of OLA response data from Han Chinese SCZ patients in order to validate our discoveries. We named the first set as 'discovery cohort' and other one as 'independent cohort.' The demographics and clinical details of the both sets of patients are demonstrated in **Table 1**.

The discovery cohort was composed of 79 recruited Han Chinese SCZ patients who had been treated with OLA from the Shanghai Mental Health Center of China. It comprised 37 males and 42 females. The mean age of them was 43.1 ± 18.3 years old (**Table 1**). SCZ of the patients was diagnosed according to the criteria of the Diagnostic and Statistical Manual of Mental Disorders, Fourth Edition (DSM-IV) and confirmed by at least two experienced psychiatrists. Patients who had physical complications or other substance abuse were excluded from this investigation. Our analysis only considered patients who had not been previously treated with atypical antipsychotics and had not received any medication for more than 4 weeks before their enrollment in this study.

The independent cohort contained 159 recruited Han Chinese SCZ patients who were undergoing OLA monotherapy from the Shanghai Mental Health Center of China and the First Hospital of Shanxi Medical University. They were composed of 75 males and 84 females. The mean age of independent cohort was 38.5 ± 16.4 years (**Table 1**). This cohort of patients was used to validate the novel biomarkers found by the analysis of discovery cohort. The selection criteria were similar to the discovery cohort. Finally, patient data from discovery cohort and independent cohort were combined in order to gain a greater power in statistical analysis. Therefore, a total of 238 patients were used to perform another validation of the newly found biomarkers.

TABLE 1 | Demographic and clinical details of patients suffering from schizophrenia in three cohorts.


BMI, body mass index; T0 – baseline measurement, T1 – follow-up measurement (8 weeks). Bold font indicates statistically significant values (P < 0.05). Values are shown as means ± standard deviation.

<sup>a</sup>Gender was analyzed by χ 2 test and other characteristics were analyzed by Student's t-test.

# Clinical Assessment

Clinical effects of patients were evaluated using the Positive and Negative Syndrome Scale (PANSS) by two fully qualified psychiatrists during the 8 weeks of OLA treatment. The interrater reliability between the two raters was found to be high since the intraclass correlation coefficients (ICCs) was larger than 0.8. Based to Obermeier's method, patients were classified as good responders (reduction of PNASS score ≥ 50%) and poor responders (reduction of PNASS score < 50%) for analysis (Obermeier et al., 2010). The initial daily dose of OLA was 10 mg per day and then it gradually increased to 15 mg per day within the 1st week. After that, the dosage was adjusted based on individual tolerance to the treatment. During the medication period, nursing staff closely monitored any medication compliance occurred in patients. No other drugs were administered when OLA monotherapy was performed, except for sennoside for constipation, flunitrazepam or lorazepam for acute insomnia and biperiden for any extrapyramidal side effects.

#### Ethics Statement

The study was approved by the Ethical Committee of Human Genetic Resources in Shanghai, China. All subjects or their legal guardians understood the procedure and had given written informed consent to their participation in this study according to the Declaration of Helsinki (Human, 1999).

#### Targeted Genes and Capture Design

One hundred and forty-three genes were selected for targeted sequencing were based on their involvement in drug metabolism (including genes encode for drug-metabolizing enzymes, drug-transporting enzymes and the receptors mediating drug responses) from PharmGKB database<sup>1</sup> and the relevant literatures (Arranz et al., 2011; Li and Bluth, 2011; Arranz et al., 2016) that reported potential genes related to SGAs efficacy. We aimed to study these genes in order to investigate novel biomarkers of OLA response. The details of these 143 genes are listed in **Supplementary Table S1**.

Sequencing probes for the 143 targeted genes were designed using Agilent's SureDesign tool<sup>2</sup> . Targeted regions of these genes of interest included their coding regions ± 10 bp and untranslated regions (UTR) according to information from RefSeq, Ensembl, CCDS, and GENCODE databases (Harrow et al., 2012; Pruitt et al., 2014; Cunningham et al., 2015).

# Library Preparation and Next Generation Sequencing

Genomic DNA was extracted from whole blood using a QIAamp DNA Blood Mini Kit (Qiagen GmbH, Hilden, Germany). The quantity and quality of the genomic DNA were measured by Nanodrop 2000 (Thermo Scientific, United States). Then we adjusted the genomic DNA to a final concentration of 100 ng/µl with high-purity water and stored at −20◦C. Libraries were prepared with a HaloPlex Target Enrichment System Kit (Agilent Technologies, Santa Clara, CA, United States) following the manufacturer's instructions. Libraries were then quantified using the Agilent 2100 Bioanalyzer (Agilent Technologies). Sequencing was performed with the HiSeq 2500 platform (Illumina, San Diego, CA, United States) using paired-end libraries (2 × 101-bp).

Raw data were processed following standard protocols used in earlier reports (Gaynor et al., 2016). In short, raw image files were first converted to the FASTQ format and the reads were aligned to the human reference genome (hg19, GRCh37). SNPs were identified according to GATK standard hard filtering parameters (DePristo et al., 2011). On average, 99% of reads covered >80× and 81% >200×, which suggested that the coverage was sufficiently high to detect variants with appropriate sensitivity. The program ANNOVAR was used to annotate SNVs that covered >20× according to the information from Ensembl Variation, dbSNP, and 1000genome database (Sherry et al., 2001; Abecasis et al., 2010; Wang et al., 2010; Flicek et al., 2012). Subsequently, individual and SNP-level quality controls were performed using PLINK (v1.07) software (Purcell et al., 2007). Data cleaning was performed according to the following criteria: genotypic call rate < 95%, Hardy–Weinberg equilibrium (HWE) < 0.001, and minor allele frequency (MAF) < 0.01.

<sup>1</sup>https://www.pharmgkb.org/

<sup>2</sup>www.agilent.com/genomics/suredesign

After we conducted quality controls of the sequences, there were 77 individuals (38 good responders and 39 poor responders) and 807 SNPs remained in our data for later analysis.

#### Validation Trial

fphar-10-00177 February 28, 2019 Time: 19:51 # 4

We identified 12 SNPs that were significantly associated with OLA response using discovery cohort. Similar analysis was performed using both independent cohort and patient samples with two cohort combined in order to validate these newly found SNPs. These 12 SNPs were genotyped using the Sequenom MassARRAY platform (Agena Bioscience, San Diego, CA, United States) following manufacturer's instructions. MassARRAY primers were designed using a semi-automated software Assay Design Suite v2.0<sup>3</sup> . The primer sequences are listed in **Supplementary Table S2**. Data cleaning was performed according to the following criteria: genotypic call rate < 95%, Hardy–Weinberg equilibrium (HWE) < 0.001, and minor allele frequency (MAF) < 0.01. Similarly, to the earlier processing of discovery cohort, quality control was carried on and only the filtered data was used for our analysis.

## Statistical Analyses

The demographic characteristics of the both 'good responder' and 'poor responder' groups were examined to confirm the homogeneity of the data used in our analysis. The data was found to have normal distribution, allowing student's t-tests to be performed on the obtained data (age, PANSS score, etc.). Gender differences were analyzed using the Chi-square test. SPSS software (version 11.0, Chicago, IL, United States) was used for all the statistical analyses in this study. The association between genotype and OLA response was assessed using logistic regression model by PLINK vl.07 software (Purcell et al., 2007). P-values were corrected using Bonferroni method for multiple testing adjustments. Two-tailed P-values of 0.05 were considered to be statistically significant. Power analysis was performed by the software GPower 3.1.

# RESULTS

#### Patients Characteristics and Sequencing Profile

The demographic and clinical details of patient subjects included in this study are shown in **Table 1**. Among the 79 patients in the discovery cohort, 38 patients were defined to be good responders to OLA while 41 of them were poor responders. On the other hand, among 159 patients in independent cohort, 119 patients were good responders and 40 as poor responders to OLA. In the total cohort that comprised both sets, there were 157 good responders and 81 poor responders out of a total of 238 patient subjects. There was no statistically significant difference in the baseline characteristics between good responders and poor responders, except in the case of the PANSS total scores at the 8-week endpoint, meaning that the population was homogeneous (**Table 1**).

<sup>3</sup>https://agenacx.com/

# Effects of Individual Polymorphisms on the OLA Response in the Discovery Cohort

Twelve out of 807 tested SNPs were found to be significantly associated with OLA response of 79 Han Chinese SCZ patients in discovery cohort. **Table 2** lists the results of the SNP association analysis of pharmacogenetic impact on OLA treatment response (P < 0.05). Two newly found variants were located on the exon's region (rs6280, P = 0.026, OR = 3.0, 95% CI = 1.14–7.87; rs2011404, P = 0.04, OR = 5.4, 95% CI = 1.08–26.93). The other 10 variants, which were not located in exons regions, were also found to be significantly associated with OLA response. However, there were no variants remained statistically significant after multiple-testing corrections (data not shown).

# Verification of the Genetic Variants Associated With the Response to OLA in the Independent and Total Cohort

We used independent cohort and total sample population composed of both discovery cohort and independent cohort to validate the 12 SNP signatures found from the analysis of discovery cohort. The relevant clinical information of the data is shown in **Table 1**. A total of 12 SNPs was genotyped from patients in these two sets. In particular, SNP rs324026 and rs12610827 were found to be significantly associated with OLA treatment response in the independent cohort (P = 0.023 and P = 0.023). In the combined cohort, 4 SNPs displayed significant difference in OLA response between good responders and poor responders. We obtained strong evidence to conclude that these 2 variants (rs324026 and rs6280) in the dopamine receptor D3 (DRD3) gene were significantly associated with OLA response in Han Chinese SCZ patients (P = 0.001 and 0.0047). In addition, SNP rs12610827 (near to PLK5) and rs1543494 (located in SUPT16H) were also shown to be significantly associated with the OLA treatment response (P = 0.001 and 0.038). Detailed information of these significant SNPs is shown in **Table 3**.

#### Power Analysis

Post hoc power analysis revealed that the statistical power of the discovery cohort size (n = 79) in detecting a significant association (P < 0.05) was 0.76 with a medium effect size (Odds ratio = 2.0). The power of independent cohort size (n = 159) was 0.96 with the same effect size. These results indicated that the sample size in our study was sufficient to achieve a considerably low risk of a type II error.

# DISCUSSION

To date, most pharmacogenomic studies on the OLA response focused on a few genes that are known to be relevant to OLA metabolism. Our study represents a more systematic survey of genetic biomarkers, including drug metabolic enzyme genes, receptor genes and other related genes. 143 genes of interest were sequenced using Next-generation sequencing technology for our association analysis. This study is one of the most comprehensive


TABLE 2 | 12 SNPs that were found to be significantly associated with responses to olanzapine treatment, by targeted sequencing.

Chr, chromosome; SNP, single-nucleotide polymorphism; Chr. Pos, chromosome position; MA, minor allele; R\_freq, responder frequency; NR\_freq, non-responder frequency; OR, odds ratio; CI, confidence interval.

TABLE 3 | Validation of SNPs associated with the olanzapine response.


Chr, chromosome; SNP, single-nucleotide polymorphism; Chr. Pos, chromosome position; MA, minor allele; OR, odds ratio; CI, confidence interval; <sup>a</sup>P were adjusted by Bonferroni method. Bold font indicates statistically significant values.

pharmacogenetic analyses of association between SNP variants and OLA response.

Our result suggested that SNP rs324026 in DRD3 gene had significant association with OLA response using independent cohort. This difference still remained significant in the total cohort comprised 2 cohorts even after Bonferroni correction. However, the other variant rs6280 was only found to be significantly associated with an 8-week treatment of OLA response in the combined cohort population and did not have evidence to have significant associations with OLA response in independent cohort. This inconsistent result may be caused by the small sample set we tested. Notably, these 2 SNPs both exhibited strong linkage disequilibrium (r <sup>2</sup> > 0.9) in the HaploReg database (Ward and Kellis, 2012). Therefore, both rs6280 and rs324026 may serve as biomarkers of OLA treatment response.

It is known that rs6280 mutation leads to a glycine for serine substitution and is associated with altered dopamine binding affinity. This glycine variant had been suggested to be able to increase the densities of the dopamine receptor D3 (DRD-3) in some areas in human brain (Jeanneteau et al., 2006). Adams et al. (2008) reported that DRD-3 gly/gly genotype and other polymorphisms in linkage disequilibrium with ser-9-gly variant were significantly associated with an increase in PANSS total score. Therefore, we concluded that our result was consistent with the earlier discoveries of the association between ser-9-gly variant and clozapine, which is the most similar receptor binding profile to OLA. Cerrato et al. (2017) surveyed 65 papers and found that rs6280 was successfully replicated as prognostic biomarkers of clozapine efficacy. In contrast, rs324026 variant had never been reported to affect the therapeutic efficacy of OLA. Rs324026 is located next to exon 5. Our analysis results suggested that we could only find evidence of significant association between SNP rs324026 and OLA efficacy after Bonferroni correction in the combined sample with both discovery and independent cohorts. Additionally, individuals with C alleles of rs324026 generally experience significantly better efficacy of OLA treatment.

In this study, SNP rs12610827 was validated in patients from independent cohort and its association with OLA response remained significant for multiple testing in the total cohort after Bonferroni correction. Rs12610827 variant is located near the PLK5 gene. Polo-like kinases (Plks) family, consisted of 5 members (Plk1-Plk5), is traditionally regarded to play an important part only in cell cycle progression. However, mounting evidence showed that Plk2 and Plk5 are also closely involved in neuron biology (de Carcer et al., 2011a). It had been suggested that Plk2 modulates neurite formation in response to activities of brain-derived growth factor (BDGF) (Inglis et al., 2009). Additionally, Plk5 was highly expressed in the central nervous system and it serve as a Plk2-like role in the cerebellum according earlier report (de Carcer et al., 2011b). It had been suggested that Plk5 was regulated by CpG methylation of the promoter region on the transcriptional level (de Carcer et al., 2011a). The level of PLK5 gene expression may be influenced by the methylation status of this variant. In this study, Han Chinese SCZ patients who

carried allele T in SNP rs12610827 showed more good response in OLA treatment. However, no previous reports have found such association between PLK5 variants and drug response. Therefore, we believe rs12610827 variant is worthy of further investigations in order to verify its influence on OLA response in the future.

This study has several limitations. First, our analysis did not consider some other genes that may have significant association with antipsychotic. Therefore, our results may have neglected some important biomarkers of OLA response due to this incomplete gene collection. Secondly, a number of identified SNP associations failed to stay statistically significant after Bonferroni corrections. This may be caused by over-correction because sample size was relatively small. Therefore, employing strict multiple corrections such as Bonferroni to the data may be too harsh for this specific study.

# CONCLUSION

In sum, we performed a comprehensive study on 238 Han Chinese SCZ patients in order to identify potential biomarkers of Han Chinese-specific OLA responses. The result showed that 143 genes were significantly associated with OLA. In addition, 2 variants (rs324026 and rs12610827) were found to have significant association with the OLA response. Future investigations with larger sample sizes and high-throughput methods such as high-density SNP arrays and whole exome sequencing are warranted to find more biomarkers to predict the efficacy of OLA in the Han Chinese population.

# DATA AVAILABILITY

The datasets used and/or analyzed during the current study are available from the corresponding author on reasonable request.

#### REFERENCES


# AUTHOR CONTRIBUTIONS

WZ and YX performed the experiments and wrote the manuscript. CH and ML aided in processing the data. QL, Y-hS, and YX aided in the collection of the materials. LC and LS helped in revising the manuscript. ZY, DC, and SQ designed and revised the manuscript.

# FUNDING

This work was supported by grants from the 863 Program (Grant Numbers 2012AA02A515 and 2012AA021802), the National Natural Science Foundation of China (Grant Numbers 81421061, 81273596, J1210047, 30900799, 81361120389, 30972823, 81671326, and 81671336), National Key Research and Development Program (Grant Numbers 2016YFC0905000, 2016YFC0905002, 2016YFC1200200, 2016YFC0906400, and 2017YFC0909200), The 4th Three-Year Action Plan for Public Health of Shanghai (the Project No. 15GWZK0101), Shanghai Key Laboratory of Psychotic Disorders (Grant Number 13dz2260500), Public Science and Technology Research Funds (Grant Number 201210056), The Fourth Round of Shanghai Three-Year Action Plan on Public Health Discipline and Talent Program: Women and Children's Health (Grant Number 15GWZK0401), the Shanghai Jiao Tong University Interdisciplinary Research fund, and the Shanghai Leading Academic Discipline Project (Grant Number B205).

#### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fphar. 2019.00177/full#supplementary-material



of the norepinephrine transporter gene. Am. J. Med. Genet. B Neuropsychiatr. Genet. 4, 491–494. doi: 10.1002/ajmg.b.30635


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Zhou, Xu, Lv, Sheng, Chen, Li, Shen, Huai, Yi, Cui and Qin. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Actionable Pharmacogenetic Variation in the Slovenian Genomic Database

Keli Hocevar ˇ \*, Aleš Maver and Borut Peterlin\*

Clinical Institute of Medical Genetics, University Medical Centre Ljubljana, Ljubljana, Slovenia

Background: Genetic variability in some of the genes that affect absorption, distribution, metabolism, and elimination ("pharmacogenes") can significantly influence an individual's response to the drug and consequently the effectiveness of treatment and possible adverse drug events. The rapid development of sequencing methods in recent years and consequently the increased integration of next-generation sequencing technologies into the clinical settings has enabled extensive genotyping of pharmacogenes for personalized treatment. The aim of the present study was to investigate the frequency and variety of potentially actionable pharmacogenetic findings in the Slovenian population.

#### Edited by:

Ulrich M. Zanger, Dr. Margarete Fischer-Bosch Institut für Klinische Pharmakologie (IKP), Germany

#### Reviewed by:

Vanessa Gonzalez-Covarrubias, Instituto Nacional de Medicina Genómica (INMEGEN), Mexico Collet Dandara, University of Cape Town, South Africa

\*Correspondence:

Keli Hocevar ˇ kelihocevar@gmail.com Borut Peterlin borut.peterlin@guest.arnes.si

#### Specialty section:

This article was submitted to Pharmacogenetics and Pharmacogenomics, a section of the journal Frontiers in Pharmacology

Received: 19 September 2018 Accepted: 26 February 2019 Published: 14 March 2019

#### Citation:

Hocevar K, Maver A and ˇ Peterlin B (2019) Actionable Pharmacogenetic Variation in the Slovenian Genomic Database. Front. Pharmacol. 10:240. doi: 10.3389/fphar.2019.00240 Methods: De-identified data from diagnostic exome sequencing in 1904 cases submitted to our institution were analyzed for variants within 293 genes associated with drug response. Filtered variants were classified according to population frequency, variant type, the functional impact of the variant, pathogenicity predictions and characterization in the Pharmacogenomics Knowledgebase (PharmGKB) and ClinVar.

Results: We observed a total of 24 known actionable pharmacogenetic variants (PharmGKB 1A or 1B level of evidence), comprising approximately 26 drugs, of which, 12 were rare, with the population frequency below 1%. Furthermore, we identified an additional 61 variants with PharmGKB 2A or 2B clinical annotations. We detected 308 novel/rare potentially actionable variants: 177 protein-truncating variants and 131 missense variants predicted to be pathogenic based on several pathogenicity predictions.

Conclusion: In the present study, we estimated the burden of pharmacogenetic variants in nationally based exome sequencing data and investigated the potential clinical usefulness of detected findings for personalized treatment. We provide the first comprehensive overview of known pharmacogenetic variants in the Slovenian population, as well as reveal a great proportion of novel/rare variants with a potential to influence drug response.

Keywords: next-generation sequencing, pharmacogenomics, personalized medicine, Slovenian population, PharmGKB

# INTRODUCTION

fphar-10-00240 March 12, 2019 Time: 19:11 # 2

Genetic variation in genes associated with drug pharmacokinetics (i.e., absorption, distribution, metabolism, elimination) or pharmacodynamics (e.g., alerting a drug's target or perturbing biological pathways) can significantly contribute to individual responsiveness to drugs, and thus on the therapeutic efficacy and toxicity (Dunnenberger et al., 2015; Relling and Evans, 2015). As an integral part of personalized medicine, pharmacogenomics has great potential to enhance clinical benefit, decrease adverse drug reactions and cost of treatment by optimizing drug selection and dosing for an individual. The rapidly dropping cost of next-generation sequencing within recent years, and consequently, the increased integration of these technologies into the clinical settings offered an unprecedented opportunity for extensive genotyping of pharmacogenes. Exome sequencing is a powerful tool for gaining insight into both common and rare coding variation. However, presently, there are no comprehensive studies regarding the application of exome sequencing data for reporting of pharmacogenetic variants. Therefore, there are two challenges that arise while analyzing exome-sequencing data. First, how and which variants with established pharmacogenomic effects should be reported, and second, how to evaluate novel putatively functional variants or variants in genes with less established pharmacogenetic function.

The lack of a comprehensive overview of the distribution of rare variants in pharmacogenes among different ethnic groups and the lack of knowledge about their functional consequences and clinical actionability additionally limit the fully integrated use of pharmacogenomics in a routine clinical practice. Therefore, pharmacogenomics usually remains focused on a limited number of common variants in a small number of genes, the detection of which is primarily based on targeted gene panels or genotyping arrays, such as AmpliChip CYP450 test (Roche) or Affymetrix DMET Plus Assay (Potamias et al., 2014). However, these approaches do not consider the complete heterogeneity of the variation within pharmacogenes and do not address the issue of rare and private variants with potentially large effects. Moreover, previous studies have shown that the vast majority of proteincoding variation is rare, previously unknown, populationspecific and enriched for deleterious alleles (Nelson et al., 2012; Tennessen et al., 2012; Gordon et al., 2014; Fujikura et al., 2015). Thus, it is likely that rare variation importantly contributes to some currently unexplained differences in pharmacological responsiveness and metabolism. Consistent with this notion, recent research highlighted that rare variants account for 30– 40% of the functional variability in the pharmacogenes (Kozyra et al., 2016). In the study by Ramsey et al. (2012), authors showed that rare variants account for 17.8% of the variability attributed to SLCO1B1, a gene associated with methotrexate clearance and disposition of many other medications including statins and irinotecan.

With an objective to enable the clinical use of pharmacogenomics, projects like eMERGE are systematically documenting and evaluating both common and rare variants in pharmacogenes, thus creating clinically useful electronic networks of pharmacogenetic variation (Rasmussen-Torvik et al., 2014). Additionally, the Clinical Pharmacogenetics Implementation Consortium (CPIC<sup>1</sup> ) (Caudle et al., 2014), a shared project between the Pharmacogenomics Knowledgebase (PharmGKB<sup>2</sup> ) (Whirl-Carrillo et al., 2012) and the Pharmacogenomics Research Network (PGRN) (Shuldiner et al., 2013), started to develop peer-reviewed, evidence-based guidelines for specific gene/drug combinations. By September 2018 CPIC published 65 dosing guidelines covering 15 genes and 38 drugs<sup>3</sup> . The efforts to facilitate implementation have also been undertaken by other nationwide networks such as the Royal Dutch Association for the Advancement of Pharmacy and Canadian Pharmacogenomics Network for Drug Safety (Ross et al., 2010). Currently, 23 different genes have described actionable variants (corresponding to PharmGKB level 1A or 1B of evidence) for germline pharmacogenomics (last accessed on September 9th, 2018).

However, there are presently no consensus recommendations on which pharmacogenetic findings should be actively sought and reported back to patients when analyzing exome or genome sequencing data. Nevertheless, the potential usefulness of pharmacogenomic findings in the exome sequencing data has recently been implicated. In the study of Lee et al. (2016), 21 potentially useful PharmGKB actionable variants (1A and 1B) were identified in 645 individuals who have undergone clinical exome sequencing. In a related study by Cousin et al. (2017), secondary pharmacogenetic findings from clinical whole exome sequencing (WES) testing were reported in a cohort of 94 primarily pediatric patients referred for a suspected genetic disorder. The study results showed that 91% of patients had at least one pharmacogenetic variant allele in CYP2C19, CYP2C9, and VKORC1 genes and that 20% of them had potential immediate implications on current medication use. A study on 60,706 human exomes from ExAC population dataset further estimated the prevalence of common as well as rare functional variants in 806 drug-related genes and its implications for 1236 FDA approved drugs. The extended exome data analysis revealed that four in five patients are likely to carry a variant with possibly functional effects (Schärfe et al., 2017).

As population data are specific and cannot be generalized, even within closely related European populations (Mizzi et al., 2017), we used a genomic database of 1904 Slovenian individuals to comprehensively assess the population burden of pharmacogenetic variants. We conducted a nationally based survey of genetic variation within 293 genes, known to influence

**Abbreviations:** ABC, ATP-binding cassette; ACMG, American College of Medical Genetics and Genomics; BWA, Burrows-Wheeler algorithm; CADD, Combined Annotation–Dependent Depletion Score; CNV, copy number variation; CPIC, Clinical Pharmacogenetics Implementation Consortium; CYP, cytochrome P450 superfamily; FDA, Food and Drug Administration; MAF, minor allele frequency; MAFSlo, minor allele frequencies for the Slovenian population; gnomAD, the Genome Aggregation Database; PGRN, Pharmacogenomics Research Network; PharmGKB, Pharmacogenomics Knowledgebase; SLC, solute carrier; TCAs, tricyclic antidepressants; UGTs, UDP-glucuronosyltransferases; UCSC, University of California Santa Cruz.

<sup>1</sup>www.eu-pic.net

<sup>2</sup>www.pharmgkb.org

<sup>3</sup>www.pharmgkb.org/guidelines

drug response. Our additional motivation was to assess the applicability of the pharmacogenetic reporting as a part of the routine analysis of exomes and to gain insight into the opportunities and challenges that arise. Accordingly, we analyzed (1) which pharmacogenomic variants could be covered with exome sequencing data, (2) the frequency of known actionable variants in the Slovenian population, (3) and the frequency of rare variants with putative functional impacts and the possibilities for their interpretation.

# MATERIALS AND METHODS

#### Participants

Exome datasets from 1904 patients who were referred to the Clinical Institute of Medical Genetics, University Medical Centre, Ljubljana, Slovenia from July 2014 to October 2017, and have undergone clinical (Illumina TruSight One panel, targeting 4813 genes associated with Mendelian disorders) or whole exome sequencing (Agilent SureSelect All Exon V5 or Illumina Nextera coding exome capture), were recruited for this analysis. All patients gave informed consent for participation in accordance with the Declaration of Helsinki. The study was approved by the Slovenian National Medical Ethics Committee (0120-561/2016). Data were de-identified and phenotype data of the patients were not available.

# Panel Design

The panel of genes analyzed in the present study was selected based to include genes that were captured with all of the used protocols (Illumina TruSight One protocol, Nextera Coding Exome, and the Agilent SureSelect All Exon v5 protocol). We have established the pharmacogenetic list of 293 genes associated with pharmacological impacts, which is based on 33 genes from VeraCode <sup>R</sup> ADME Core Panel Assay (Illumina) and supplemented with 260 additional genes from PharmaADME (198 genes), PharmGKB (37 genes), and eMERGE-PGx (Sphinx) (25 genes) websites (**Supplementary Table S1**). Combining clinically relevant genes from these sources ensures that our gene set covers the majority of the key genes currently reviewed in pharmacogenomics research that are also captured with both – the clinical and whole exome sequencing.

#### Exome Sequencing

Of the 1904 samples sequenced, the majority (1,582 samples, 83.1%) of the samples were enriched using the Illumina TruSight One protocol, followed by Nextera Coding Exome (188 samples, 9.9%) and the remaining were analyzed using the Agilent SureSelect All Exon v5 (134 samples, 7.0%) protocol. Raw sequence files were processed using a custom exome analysis pipeline, based upon GATK best practices backbone. Reads were aligned to UCSC hg19 human reference genome assembly using Burrows-Wheeler (BWA) algorithm and duplicate sequences were removed using Picard MarkDuplicates, followed by base quality score recalibration, variant calling, variant quality score recalibration, and variant filtering using elements of the GATK toolset (Depristo et al., 2011). In all cases, we attained a minimum median exome coverage of 60x, with over 95% of targets covered with at least 10× sequencing depth. Although the cytochrome genes are characterized by a high degree of homology, we were able to uniquely map over 90% of the reads in these regions, while the non-uniquely mapped reads were attributed mapping quality of 0 by the BWA. GATK variant caller did not emit sequence variants in these regions, thereby reducing the rate of low-quality variants in regions of high homology.

# Variant Analysis

Variants were stored and annotated in our in-house variant collection and annotation system, which is based on vTools software. Variant effect predictions were made using snpEff (Cingolani et al., 2012) and ANNOVAR tools (Wang et al., 2010) and were based on RefSeq gene models (O'Leary et al., 2016), whereas annotations from dbSNP v141 were used for single nucleotide polymorphism (SNP) annotation. Genome Aggregation Database (gnomAD) (Lek et al., 2016) was employed as a source of variant frequencies in worldwide populations. The consensus calls of dbNSFP v2 (Liu et al., 2013) precomputed pathogenicity predictions were used to predict functional effect for missense variants, including SIFT (Sim et al., 2012), Polyphen-2 (Adzhubei et al., 2010), MutationTaster (Schwarz et al., 2014), CADD (Combined Annotation–Dependent Depletion Score) (Kircher et al., 2014), and MetaSVM (Dong et al., 2015). GERP++ rejected substation (RS) scores were used as the source of information for evolutionary sequence conservation applicable to all types of variants (Davydov et al., 2010). Our pipeline included ClinVar as a source of known disease or drug response association of identified variants. Variants that reached coverage less than 20 and quality less than 300 were excluded from the subsequent analysis.

#### Variant Filtration and Characterization

Firstly, we applied a 293-gene panel for the filtration of exome data. Next, variants were characterized according to PharmGKB levels of evidence for variant-drug associations<sup>4</sup> (Whirl-Carrillo et al., 2012) (accessed on 9th September 2018). Level 1A category includes variant-drug pairs with a CPIC pharmacogenetic guideline or variants implemented at a PGRN site or another major health system. Level 1B annotations comprise variant-drug combinations in which the preponderance of evidence shows an association that has been replicated in more than one cohort, with significant p-values and preferably with a strong effect size. Clinical annotation of Level 2A refers to variants within known pharmacogenes that are more likely to have a functional significance. Level 2B annotation refers to variant-drug pairs with moderate evidence of an association that has been replicated, but the results might not be statistically significant or the effect size may be small. Initially, we searched for actionable variants, defined as variants with PharmGKB 1A and 1B levels of evidence. The search was based on dbSNP accession numbers. Star allele assignments (ec. CYP2C9<sup>∗</sup> 3) were searched for corresponding rs numbers where possible, using

<sup>4</sup>www.pharmgkb.org/downloads

star-allele nomenclature from PharmVar Database<sup>5</sup> and TPMT Nomenclature Committee websites<sup>6</sup> . Additionally, we extracted variant-drug pairs with PharmGKB annotations of 2A or 2B. We also inspected how many variants with ClinVar accession "drug response" were detected in our dataset. The dosing algorithms were obtained from CPIC guidelines and variantdrug-phenotype associations from PharmGKB website<sup>7</sup> .

Next, we filtered variants on the basis of their minor allele frequencies (MAFs), with an exclusion of variants with MAF > 0.01 in the gnomAD database. We also excluded variants that were detected in more than 19 (1%) individuals as heterozygous and variants detected in more than 15 individuals as homozygous in the Slovenian genomic database. We rated the variants according to variant functional impact, variant type, and theoretical pathogenicity predictions (PolyPhen-2, SIFT, Mutation Tester, MetaSVM, CADD). Additionally, median Phred normalized CADD annotation scores for each gene were calculated (Kircher et al., 2014). Subsequently, we examined the distribution of rare exonic and splicing

<sup>5</sup>www.pharmvar.org/genes

<sup>7</sup>www.pharmgkb.org

variation across major pharmacogenetic gene groups, including cytochrome P450 (CYP) superfamily, ATP-binding cassette (ABC) superfamily, solute carrier (SLC) superfamily, and UDPglucuronosyltransferases (UGTs).

## RESULTS

#### Overview

Using exome-sequencing data from 1904 individuals we detected a total of 72,293 high-quality variants in 293 pharmacogenes. Our data revealed that most of the variants in pharmacogenes were rare (n = 65,059, MAFgnomAD < 0.01), comprising about 90.0% of all variants. Of these rare variants 4360 were annotated as missense, 2239 as synonymous, 174 as frameshifts, and 127 as stop gained. Among the rare non-coding variants, 48,822 were classified as intronic, 1142 as upstream variants, 700 as downstream variants, 1914 as 30UTR variants, and 735 as 50UTR variants. Of the rare variants, 9229 (14.2%) were previously reported in the gnomAD database. The number of variants by type is summarized in **Table 1**. The counts by variant annotation impact and variant annotation type for rare exonic and splicing variation are presented in **Figure 1**. The distribution

#### TABLE 1 | SNPEff Variant types.


gnomAD, Genome Aggregation Database; MAF, minor allele frequency; PharmGKB, Pharmacogenomics Knowledgebase.

<sup>6</sup>www.imh.liu.se/tpmtalleles

of rare variation across major pharmacogenetic gene groups is presented in **Figure 2**.

### Known Actionable Pharmacogenetic Variants

Firstly, we focused on variants featured in the PharmGKB or/and in ClinVar database. We looked for PharmGKB annotated variant-drug combinations with four highest levels of evidence 1A, 1B, 2A, and 2B. Within the exome sequencing data we identified 24 PharmGKB unique variants with the highest levels of evidence, 1A or 1B, associated with response to about 26 drugs. Twelve of them were rare, with the MAF not exceeding 1% in gnomAD and the Slovenian genomic database. Rare actionable variants (PharmGKB 1A or 1B) located in the exonic or splicing regions were observed in the following genes: CYP2D6 (frameshift variant), CYP2C19 (one start lost and another missense variant), CFTR (disruptive inframe deletion and five missense variants), DYPD (one missense variant and another splice donor variant), and TPMT (one missense variant).

We estimated minor allele frequencies for the Slovenian population (MAFSlo) for each potentially actionable finding and the results are presented in **Table 2** and **Supplementary Tables S2**–**S4**. The most prevalent actionable variant (PharmGKB 1A or 1B level) in our database was a missense variant in CYP4F2 gene (Val433Met, rs2108622) with MAFSlo of 27.4%, associated with warfarin dosage (PharmGKB level 1A of evidence), followed by missense variants in CYP2B6 (Gln172His, rs3745274, MAFSlo = 22.4%, PharmGKB 1B), CYP2D6 (Pro34Ser, rs1065852, MAFSlo = 19.5%, PharmGKB 1A), and SLCO1B1 (Val174Ala, rs4149056, MAFSlo = 19.2%, PharmGKB 1A) genes. Further most prevalent variants in PharmGKB 1A or 1B category were splice acceptor variant (c.506-1G > A, rs3892097) in CYP2D6 gene (MAFSlo = 16.7%) and synonymous variant (Pro227Pro, rs4244285) in CYP2D6 gene (MAFSlo = 12.6%).

Additionally, we identified 68 variants with PharmGKB 2A or 2B levels of evidence, with seven of them also in the 1st categories (PharmGKB 1A and 1B), but presented a different type of an association or different drug-variant pair and were for that reason listed twice. Altogether, the most common pharmacogenetic variant detected in the Slovenian genomic database was a missense variant in the F5 gene (Gln534Arg, rs6025, PharmGKB 2A) with MAFSlo of 88.4% (MAFgnomAD = 98.0%). This was followed by a synonymous variant in ABCC4 gene (Lys1116Lys, rs1751034, PharmGKB 2B), associated with a response to tenofovir and MAFSlo of 77.0% (MAFgnomAD = 81.0%). A missense variant in the TP53 gene (Pro72Arg, rs1042522, PharmGKB 2B) with MAF in the Slovenian population of 71.8% (MAFgnomAD = 66.9%) was the third most frequently detected pharmacogenetic variant in Slovenian individuals. The variant is associated with the efficacy and toxicity/ADR of antineoplastic agents, such as cisplatin, cyclophosphamide, fluorouracil, and paclitaxel. A start lost variant in VDR gene (Met1? rs2228570) was the second most prevalent variant in the PharmGKB 2A category, with MAFSlo of 52.8% (MAFgnomAD = 62.9%). The variant is associated with the efficacy in response to peginterferon alfa-2b in patients suffering from chronic hepatitis C. A missense variant in COMT gene (Val158Met, rs4680, PharmGKB 2A) reached the MAFSlo of 48.6%, which is in line with gnomAD MAF frequency of 46.3%. Variant-drug pairs along with corresponding frequencies are summarized in **Table 2** and **Supplementary Tables S2**–**S4**.

We detected 89 ClinVar variants with at least one accession number 6 ("drug response"), 16 of them with MAF of less than 1% (ClinVar version 02.10.2017).

When we compared MAFs for each risk variant in the Slovenian genomic database to MAFs in the gnomAD database, we generally got consistent results for detected exonic and splicing variation. However, some inconsistencies in the MAFs among databases were apparent. For example, the variant in ANKK1 gene (Glu713Lys, rs1800497) associated with the toxicity and ADR of antipsychotics, had a MAF of 26.4% in the gnomAD database (MAFEuropean(Non−Finnish) = 19.2%), but had a MAF of only 17.3% in the Slovenian genomic database. A synonymous variant in the CYP2C19 gene (Pro227Pro, rs4244285) influencing the efficacy of several drugs including amitriptyline, clopidogrel, citalopram, and clomipramine, had a MAF of in 17.6% in the gnomAD database (MAFEuropean(Non−Finnish) = 14.7%) and 12.6% in the Slovenian database. The variant in F5 gene (Gln534Arg, rs6025), associated with the adverse event of thrombosis in systemic hormonal contraceptives use, had a MAF of 98.0% in gnomAD (MAFEuropean(Non−Finnish) = 97.0%) and only of 88.4% in Slovenian population. In contrast, missense variant in the SLCO1B1 gene (Val174Ala, rs4149056) associated with the adverse drug reaction and toxicity of simvastatin, has MAF of 13.3% in gnomAD (MAFEuropean(Non−Finnish) = 15.6%) and 19.2% in the Slovenian database. It is important to note that some of the detected variants were intronic, therefore, their MAFSlo may be unreliable due to the lack of sequence coverage for these regions in exome sequencing data.

#### Functional Impacts of Variants

Next, we characterized the rare variants on the basis of predicted functional impacts. We detected 2101 variants that reached CADD score above the cut-off value of 20 and thus ranked into the 1st percentile of the most deleterious variants (Kircher et al., 2014). Several in silico prediction algorithms


TABLE 2 | Variant-drug pairs with 1A or 1B clinical annotation according to Pharmacogenomics Knowledgebase (PharmGKB).

#### TABLE 2 | Continued

fphar-10-00240 March 12, 2019 Time: 19:11 # 7


HetSlo, number of heterozygotes in the Slovenian genomic database; HomSlo, number of homozygotes in Slovenian genomic database; MAF, minor allele frequency; 1A, variant-drug pairs with a CPIC pharmacogenetic guideline or variants implemented at a PGRN site or another major health system; 1B, variant-drug pairs in which the preponderance of evidence shows an association that has been replicated in more than one cohort, with significant p-values and preferably with a strong effect size.

(Mutation Tester, Polyphen-2, SIFT, MetaSVM) predicted in consensus as pathogenic 565 missense variants that also reached CADD score above 20. These included 131 novel variants- not previously reported in gnomAD, dbSNP or ClinVar database (**Supplementary Table S5**). We further analyzed rare variants with protein-truncating effects, including frameshift variants, stop-gain variants, and variants affecting splicing, which resulted in additional 429 variants predicted to be highly pathogenic based on their functional impact, of which 177 were novel (**Supplementary Table S6**). Most of the rare putatively functional variants were thus missense, followed by frameshift (n = 174) and stop-gain variants (n = 127).

Additionally, we examined CADD annotation scores separately for each gene and further compared median CADD scores of rare variants with common variants. We observed the highest median CADD scores for rare variants in the following genes: CDA, SLC10A2, TPMT, and SULF1. Among CYP genes, the highest median score was detected in the CYP1A1 gene, followed by CYP24A1, and CYP2R1. In the SLC group, SLC10A2, SLC22A2, and SLC22A6 genes ranked the highest. Altogether, the highest CADD score of 54 was detected for the known pathogenic variant in ABCA4 gene (c.6445C > T, Arg2149<sup>∗</sup> , rs61750654), followed by the stop-gain variant in EPHX2 gene (Arg467<sup>∗</sup> , CADD = 51). Expectedly, we identified significantly more damaging variants among the rare variants than among common variants with the population frequency exceeding 1%.

#### DISCUSSION

The implementation of exome sequencing technologies into daily clinical practice makes the prospect of a personalized treatment increasingly available. Here, we showed that by using clinical and whole exome sequencing technologies it is possible to identify not only variants that are causative for patients' clinical presentation but also a considerable proportion of pharmacogenetics findings with established evidence and potential clinical utility.

Within the study population, we identified a high frequency of well-established examples of common genetic polymorphisms, as well as known rare actionable variants. We detected 24 variants with compelling evidence of pharmacogenetic significance (PharmGKB level 1A or 1B variants) associated with about 26 drugs, where 12 of them were rare, and 61 additional variants with a level 2A or 2B PharmGKB evidence. Our results are consistent with those of the previously published study by Lee et al. (2016), in which authors used combined SNP chip and exome sequence data of 1101 individuals. In their study, 29 variants were detected that ranked in the PharmGKB 1A and 1B categories; 21 of them were detected by exome sequencing technology. Similarly, 22 actionable clinical variants (PharmGKB 1A/1B) were found in 120 pharmacogenes when analyzing 1000 Genomes Phase 3 data of 2540 individuals (Wright et al., 2018).

Furthermore, our results are correlated with the already known fact that rare variants are enriched for deleterious variation. We identified 308 novel variants of potential functional significance, including 131 missense variants (predicted in consensus as pathogenic by functional prediction algorithms: Mutation Tester, Polyphen-2, SIFT, MetaSVM, CADD) and 177 protein-truncating variants. We observed that especially when testing an expanded set of genes, novel putatively functional variants and variants in genes with less established effects represent a considerable challenge in result interpretation and reporting. To date, very few studies have conducted a systematic overview of the distribution and frequency of genetic variation with potentially high impact over a large set of pharmacogenes (Kozyra et al., 2016; Schärfe et al., 2017; Wright et al., 2018). So far, studies examining rare genetic variation have been limited to small sets of genes or on gene groups, such as largely studied cytochrome P450 (CYP) gene family (Gordon et al., 2014; Fujikura et al., 2015). Further evaluation of functional consequences and clinical effects is required to extend our understanding of rare variants. Therefore, a considerable part of the variation in response to treatment still remains unclear and has not yet been integrated into routine clinical practice.

Moreover, we have observed that MAFs of some known variants differ significantly in the Slovenian dataset when compared to gnomAD MAFs. This raises the importance of establishing population specific databases of pharmacogenomics variation. With growing pharmacogenetic databases and increased integration of sequencing technologies into clinical practice, we will also gain additional insight into the rare pharmacogenetic variation. This will make publicly accessible and easily updatable data repositories such as CPIC, PharmGKB, ClinVar, Pharmacogene Variation (PharmVar) Consortium, as well as population-specific databases, essential for the accurate interpretation of pharmacogenomics results along with the subsequent integration of dosing recommendations and guidelines into electronic healthcare record systems.

Also, the identification and reporting of pharmacogenetic findings are in many aspects distinct from reporting of the disease causative variants. The proposed American College of Medical Genetics and Genomics (ACMG) criteria for interpretation of sequence variants are not intended for pharmacogenomic findings (Richards et al., 2015). While the comprehensive phenotyping data could be of particular value when interpreting the putative disease-causative variants, the genotype-phenotype correlation for pharmacogenomic findings is apparent only when the patient is exposed to a specific drug. Furthermore, the results may not be useful at the time of reporting, but only when the particular drug is prescribed to the patient. However, by potential reporting or storing of actionable variants from sequencing data preemptively, they may be available prior the prescription and for a wide range of medications, subsequently influencing decisions about treatment, which could significantly medically benefit the patients (Dunnenberger et al., 2015; Ji et al., 2016).

Compared to approaches targeting only known pharmacogenetics variants, sequencing technologies are beneficial for a number of additional aspects. SNP genotyping assays may be unable to detect low-frequency variants with potential deleterious functional effects. Besides, the response to a majority of the drugs is influenced by several genes, including genes encoding drug metabolizing enzymes, transporters, drug targets, and disease-modifying genes, or by various variants within the same gene, which may not be detected using approaches targeting known pharmacogenetics variants (Relling and Evans, 2015). We have demonstrated that exome sequencing is an effective method for the detection of both rare and common pharmacogenetic variants in a large set of genes under one investigation.

The present study identifies a high number of clinically relevant highly actionable variant-drug associations, with already established dosing guidelines and recommendations applicable for the use in personalized treatment. Here we highlight the potential clinical utility for a selection of variants detected in the Slovenian database.

A decreased function missense variant in the SLCO1B1 gene (Val174Ala, rs4149056, allele <sup>∗</sup> 5) has a MAF of 19.2%

in the Slovenian population. It was identified as heterozygous in 585/1904 (31%) individuals and as homozygous in 73/1904 (4%) individuals, who, therefore, have an intermediate and high myopathy risk, respectively, when receiving simvastatin treatment. Consequently, a lower dose or alternative statin (e.g., pravastatin or rosuvastatin) and routine creatine kinase (CK) surveillance are recommended for these individuals (Wilke et al., 2012; Ramsey et al., 2014).

Variants of CYP2D6 and CYP2C19 genes affect the exposure, efficacy, and safety of tricyclic antidepressants (TCAs) (Hicks et al., 2013). A synonymous variant (Pro227Pro, rs4244285) in the CYP2C19 gene represents no function allele (<sup>∗</sup> 2) and thus greatly decreases the conversion of tertiary amines to secondary amines, which may cause a sub-optimal response. The MAF of the variant in the Slovenian population was estimated at 12.6%; 35/1904 (1.8%) individuals carried the variant in the homozygous state, who should avoid the use of tertiary amine and alternative drugs that are not metabolized by CYP2C19 are recommended. Moreover, we detected c.506-1G > A variant (allele <sup>∗</sup> 4, rs3892097), with anticipated effect on splicing in CYP2D6 gene, resulting in a greatly reduced metabolism of TCAs to less active compounds. The variant was found in 487/1904 (25.6%) of Slovenian individuals as heterozygous, and in 74/1904 (3.9%) as homozygous. Additionally, the variant also has a major role in the activation of prodrugs such as codeine and tramadol.

Cytochrome P450 CYP2C19 also catalyzes the bioactivation of the antiplatelet prodrug clopidogrel that inhibits the ADPdependent P2Y<sup>12</sup> receptor. CYP2C19 ( ∗ 2) loss-of-function allele impairs formation of active metabolites (Scott et al., 2013). Both heterozygous 411/1904 (21.6%) and homozygous 35/1904 (1.8%) clopidogrel-treated patients with acute coronary syndromes have significantly reduced platelet inhibition and thus an increased risk for serious adverse cardiovascular events. Alternative antiplatelet medication, such as prasugrel or ticagrelor is strongly recommended in individuals with this variant.

Furthermore, we detected two rare variants with PharmGKB level 1A of evidence, one variant with the effect on splicing (c.1905+1G > A, allele <sup>∗</sup> 2A, rs3918290, MAFSlo = 0.263%) and another missense variant (c.2846A > T, Asp949Val, rs67376798, MAFSlo = 0.236%) in the DPYD gene. Heterozygotes for one of the detected variants in the DPYD gene have reduced leukocyte dihydropyrimidine dehydrogenase (DPD) activity (at 30–70% that of the normal population) and an increased risk of severe or even lethal drug toxicity when treated with fluoropyrimidine drugs. At least 50% reduction in starting dose is recommended, followed by titration of dose based on toxicity or pharmacokinetic test (Caudle et al., 2013).

Our study also has some limitations. Firstly, the application of exome sequencing for the detection of pharmacogenomics variants is limited (Londin et al., 2014). Because of the lack of coverage of the exome test, we were not able to accurately detect the majority of the intronic variation (e.g., the rs9923231 variant of the VKORC1 gene). Due to technical limitations, structural variants and repetitive regions were not sufficiently assessed. We also recognize the limitation of the sensitivity of exome sequencing in highly homologous regions of the human genome, including the cytochrome genes (e.g., part of the known actionable variability of the CYP2D6 gene). We recognize the possibility that we failed to detect a minor part of pharmacogenomic variation due to the limited detection of variants in these regions. Furthermore, in the present study, we did not extend the exome analysis on copy number variation (CNVs). Also, with de-identified data, we could not identify the compound heterozygous states or assess the polygenic effects of variants. Nevertheless, when analyzing each patient's data separately, it will be possible to include multigenic effects, compound heterozygous states and some of the risk haplotypes with established pharmacogenetic effects in future patient's records, which will add the considerable value to the exome sequencing results. With such valuable data, we could significantly benefit future patients by increasing the efficacy and decreasing adverse drug responses of pharmacologic treatment.

# CONCLUSION

In conclusion, our results demonstrate that nationally based exome sequencing data represents a valuable source for identification of pharmacogenetic variants. The direct inclusion of actionable pharmacogenetics findings in patient's records could significantly improve the outcome in patients who underwent diagnostic exome and genome sequencing. Furthermore, our data provide the first comprehensive overview of the distribution of both rare and common variants within several pharmacogenes and provides first estimates on their prevalence for the Slovenian population. We have shown that testing beyond known polymorphisms is warranted to gain further insight into rare variation and to facilitate more reliable future interpretation and reporting of pharmacogenetic findings. We anticipate that the present dataset will be of great importance for future research and validation of pharmacogenetics variation in the Slovenian population. Based on our results we propose that known pharmacogenetic variants with well-established effects should be a part of every genetic report.

# AUTHOR CONTRIBUTIONS

BP, KH, and AM contributed to conception and design of the study. KH and AM performed the statistical analysis. KH wrote the first draft of the manuscript. All authors contributed to manuscript revision, read and approved the submitted version.

# FUNDING

This study was funded by the Slovenian Research Agency (ARRS), grant no. P3-0326.

#### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fphar. 2019.00240/full#supplementary-material

# REFERENCES

fphar-10-00240 March 12, 2019 Time: 19:11 # 10



**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Hoˇcevar, Maver and Peterlin. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

**115**

# Integrating Next-Generation Sequencing in the Clinical Pharmacogenomics Workflow

Efstathia Giannopoulou<sup>1</sup> , Theodora Katsila<sup>1</sup> , Christina Mitropoulou<sup>2</sup> , Evangelia-Eirini Tsermpini<sup>1</sup> and George P. Patrinos1,3,4 \*

<sup>1</sup> Department of Pharmacy, School of Health Sciences, University of Patras, Patras, Greece, <sup>2</sup> The Golden Helix Foundation, London, United Kingdom, <sup>3</sup> Department of Pathology, College of Medicine and Health Sciences, United Arab Emirates University, Al-Ain, United Arab Emirates, <sup>4</sup> Zayed Center of Health Sciences, United Arab Emirates University, Al-Ain, United Arab Emirates

#### Edited by:

Ulrich M. Zanger, Dr. Margarete Fischer-Bosch-Institut für Klinische Pharmakologie (IKP), Germany

#### Reviewed by:

Joseph Borg, University of Malta, Malta Volker Martin Lauschke, Karolinska Institute (KI), Sweden

> \*Correspondence: George P. Patrinos gpatrinos@upatras.gr

#### Specialty section:

This article was submitted to Pharmacogenetics and Pharmacogenomics, a section of the journal Frontiers in Pharmacology

Received: 09 January 2019 Accepted: 27 March 2019 Published: 05 April 2019

#### Citation:

Giannopoulou E, Katsila T, Mitropoulou C, Tsermpini E-E and Patrinos GP (2019) Integrating Next-Generation Sequencing in the Clinical Pharmacogenomics Workflow. Front. Pharmacol. 10:384. doi: 10.3389/fphar.2019.00384 Pharmacogenomics has been recognized as a fundamental tool in the era of personalized medicine with up to 266 drug labels, approved by major regulatory bodies, currently containing pharmacogenomics information. Next-generation sequencing analysis assumes a critical role in personalized medicine, providing a comprehensive profile of an individual's variome, particularly that of clinical relevance, comprising of pathogenic variants and pharmacogenomic biomarkers. Here, we propose a strategy to integrate next-generation sequencing into the current clinical pharmacogenomics workflow from deep resequencing to pharmacogenomics consultation, according to the existing guidelines and recommendations.

Keywords: clinical pharmacogenomics, workflow, implementation, next-generation sequencing, clinical decision support tools

### INTRODUCTION

Since the 1950s, many pioneers in the biomedicine field have reported individual variability in disease management and envisioned personalized medicine in health care (Evans and Relling, 1999). Notwithstanding, the realistic application of genomic findings and technologies in the clinic goes beyond the discovery of gene variants and their validation in clinical trials. Lam (2013) has suggested a series of stages regarding the development and implementation pathways for pharmacogenomic tests, namely: (i) discovery of pharmacogenomic biomarkers and validation in well-controlled studies with independent populations; (ii) replication of druggene(s) association and demonstration of utility in at-risk patients; (iii) development and regulatory approval of companion diagnostic test; (iv), assessing the clinical impact and costeffectiveness of the pharmacogenomic biomarkers; (v), involvement of all stakeholders in clinical implementation (Lam, 2013).

Noteworthy, the scientific challenges and implementation barriers existing within the abovementioned stages are still rather unmet. Pharmacogenomic testing occurs by genotyping or sequencing and is mostly outsourced from hospitals to private companies, being a timeconsuming and costly process (Harper and Topol, 2012). Unfortunately, there is still a profound lack of understanding within the medical community regarding genomics and the impact of genomic variants in rationalizing drug prescription (Stanek et al., 2012; Mitropoulou et al., 2014).

On the other hand, the pharmacy benefit managers (involved with authorizing of fulfilling most prescriptions in the United States) have been particularly interested in the use of pharmacogenomic testing to save employers (their customers) the cost of a drug through genotyping, making the pharmacy benefit managers in question more competitive (Topol, 2010).

In 2013, the United States Food and Drug Administration (FDA) announced a guidance for industry entitled "Clinical Pharmacogenomics: Premarket Evaluation in Early-Phase Clinical Studies and Recommendations for Labeling<sup>1</sup> " in an effort to address the challenges that need to be met. The FDA has also established the Genomics and Targeted Therapy Group<sup>2</sup> toward the advancement of the application of genomic technologies in the discovery, development, regulation, and use of medications. In the same context, the United States National Cancer Institute has announced a rather similar research and development workflow toward treatment strategies in cancer, including: (i) the support of the routine collection of germline and tumor biospecimens from clinical trials or population-based studies, (ii) the support in efficacy/toxicity biomarker development, (iii) the incorporation of pharmacogenomic markers into clinical trials, and (iv) the consideration of ethical, legal, social, biospecimen, and data-sharing implications of pharmacogenomics research (Freedman et al., 2010). Today, FDA has approved 266 drugs that include genetic information in their labels (Drozda et al., 2018) and the same is true for the European Medicines Agency (Ehmann et al., 2015). The distribution of these drugs between various target diseases indicates that oncology, cardiology, psychiatry, and neurology are among the most common ones in which pharmacogenomics are readily applicable for routine clinical care (Potamias et al., 2014).

#### NEXT-GENERATION SEQUENCING GENOTYPING IN PHARMACOGENOMICS

Considering the plummeting cost of genotyping, particularly in a high-throughput format, such as panel-based genotyping and/or next-generation sequencing as well as data accuracy improvements, one would envisage that comprehensive pharmacogenomic testing using these approaches could be readily applicable in a clinical setting (Kitzmiller et al., 2011). Indeed, major academic institutions, government-sponsored as well as private organizations and research consortia are engaged into collaborative programs that focus on next generation sequencing of the cancer genome, aiming to describe the architecture of cancer-specific somatic alterations and as such, aid clinicians toward disease management (Simon and Roychowdhury, 2013), while others, such as the SEAPharm Consortium<sup>3</sup> are currently exploring the use of targeted pharmacogene resequencing in 100 pharmacogenes to explore the pharmacogenomic variants allelic architecture and the most prevalent pharmacogenomic biomarkers in Southeast Asian populations.

Recently, by investigating the exome sequences or over 60000 individuals, Ingelman-Sundberg et al. (2018) demonstrated that each individual harbors, on average, approximately 41 putatively functional pharmacogenomic variants from which 10.8% are rare and found to be highly gene- and drug-specific, accounting for a substantial part of the unexplained inter-individual differences in drug metabolism phenotypes.

Still, and contrary to identifying the genetic basis of disorders characterized by a high degree of phenotypic and clinical variability and/or genetic heterogeneity (Ku et al., 2016), in case of pharmacogenomic testing, where the role of several pharmacogenes is well established, targeted gene resequencing seems to be perhaps more relevant compared to whole exome sequencing, as it also captures rare pharmacovariants that are present in other genomic positions than the gene exons, such as promoters, intronic and untranslated sequences, which have been shown to lead to drastic reduction of drug metabolizing enzyme activity. This is further highlighted in a recent study comparing the results obtained by whole genome sequencing, whole exome sequencing, and microarray-based genotyping, indicating that the performance of genotyping arrays is similar to that of whole genome sequencing, whereas whole exome sequencing is not suitable for pharmacogenomics predictions (Reisberg et al., 2019). In any case, novel and rare pharmacovariants that can only be identified by next-generation sequencing approaches are of utmost importance in personalized drug therapy to provide information of use to avoid adverse drug reactions and lack of response (Lauschke and Ingelman-Sundberg, 2018).

Tumor samples are known to contain both acquired and inherited alterations, along with somatic DNA. Thus, cancer sequencing efforts also capture germline information. This germline information plays a crucial role in optimizing the dose and selection of therapy. A unique benefit to next generation sequencing is the ability to discover rare variants (in cancer patients, germline DNA is also analyzed as a means to identify variants in the tumor) in the genome and then, delineate their impact on drug response (Gillis et al., 2014). This has been previously demonstrated by Mizzi et al. (2014), indicating that novel and rare variants can exert a deleterious effect in drug metabolizing enzymes, such as CYP2D6, TPMT, CYP2C19, involved in anti-cancer, psychiatric and cardiology drug treatment, among others, by introducing premature stop-codons or out-of-frame frameshifts very close to the N-terminus of the enzyme. These authors also demonstrated that whole genome sequencing could identify novel CYP2C9 variants relevant to anticoagulation treatment, which could not have been identified using microarray-based genotyping approaches, which could potentially guide toward alternative anticoagulation treatment modalities in two patients suffering from atrial fibrillation (Mizzi et al., 2014). Furthermore, rather than Sanger sequencing, next generation sequencing technology yields more accurate quantitative results, when somatic variation is considered and can be achieved at a higher throughput scale (Simon and Roychowdhury, 2013). Indeed, findings in genes involved in the metabolism of anti-cancer drugs

<sup>1</sup>https://federalregister.gov/a/2013-01638

<sup>2</sup>https://www.fda.gov/drugs/scienceresearch/ucm572617.htm

<sup>3</sup>http://www.pharmagtc.org/seapharm

further demonstrate the potential applicability of whole genome sequencing for pharmacogenomic testing in a clinical setting in the not too distant future (McCarty et al., 2011; Mizzi et al., 2014; Karageorgos et al., 2015).

## INFORMATION TECHNOLOGIES AND DATA INTERPRETATION

Currently, difficulties in pharmacogenomics data interpretation are claimed responsible for the slow clinical uptake of pharmacogenomics. Two main aspects of data interpretation have been identified to affect pharmacogenomics translation into clinical practice: (i) the interpretation of reported genetic results by clinicians and (ii) the interpretation of published research results. It has become evident that the vast majority of health professionals even though acknowledges that genetic variations may influence drug response, only a limited number of those feel adequately informed about pharmacogenomic testing and data interpretation (Stanek et al., 2012; Mitropoulou et al., 2014). So far, standardization in conducting pharmacogenomics studies is lacking, mainly due to inconsistencies in results reporting (O'Donnell and Ratain, 2012). These inconsistencies make data interpretation challenging or even chaotic to researchers, professional organizations, consortia and clinicians alike and international efforts are currently ongoing to standardize pharmacogenomics testing reporting (Kalman et al., 2016).

With the advent of next generation sequencing, collaboration toward data accumulation would help maximize its clinical benefit, as large sample sizes would provide the means to retrospectively analyze large patient cohorts for (i) discovery of common and rare variants, (ii) validation, and (iii) pharmacogenomics outcomes toward decision-making. Today, the Electronic Medical Records and Genomics (eMERGE) Network, attempts to maximize the benefit from next generation sequencing analyses, focusing on the combination of DNA biorepositories with electronic medical records to facilitate largescale, high-throughput genetic research and return genetic testing results to patients in a clinical setting (McCarty et al., 2011). Such efforts would be beneficial to be exploited, including somatic and germline variation discovery and implementation as well as clinical and uptake outcomes.

To this end, in the big data era, biomedicine scientists need to critically appraise data, collaborate in an efficient and effective way and make decisions. For this, large-scale volumes of complex multi-faceted data need to be meaningfully assembled, mined, analyzed and provided in a user-friendly manner. An innovative web-based collaboration support platform that adopts a hybrid approach on the basis of the synergy between machine and human intelligence was previously reported, aiming to facilitate the underlying sense-making and decision making processes (Tsiliki et al., 2014). Clinical decision support (CDS) tools have been also proved valuable in the context of clinical pharmacogenomics, as they provide guidance on clinical decisions, through electronic medical records (Bell et al., 2014). Again, these tools demand clear and precise algorithms based on scientifically robust findings, ideally synergizing among different variant prediction tools to take novel and rare pharmacogenomic variants into consideration to determine their pathogenicity.

# VALIDATION AND ACCREDITATION OF SERVICES

The application of pharmacogenomics in personalized medicine is very challenging and influence medicine and biomedical research in many areas, namely clinical medicine, drug development, drug regulation, pharmacology, and toxicology (Tremblay and Hamet, 2013; Drozda et al., 2018). However, many issues have to be addressed including genomic data quality and assays' accreditation.

According to the European Medicines Agency (EMA) guidelines, there is a regulatory framework defined by Good Clinical Practice (DCP) compliance (European Medicines Agency, 2001/2005), Good Laboratory Practice (GLP) compliance (European Medicines Agency, 2015), Good Manufacturing Practice (GMP), and Good Distribution Practice (GDP) (European Medicines Agency, 2001), while recently a guideline for Good Pharmacogenomics Practice has been produced (European Medicines Agency, 2018). In particular, this guideline stresses the importance of all steps included in any next-generation sequencing protocol from DNA extraction, DNA processing, preparation of libraries, generation of sequence reads and base calling, sequence mapping, variant annotation and filtering, variant classification, and interpretation. According to this guideline, a crucial parameter for next-generation sequencing analysis is the minimum sequencing coverage, which in case of germline pharmacovariants should be at least 30×, while in case of rare variants, a higher coverage is needed in order to ensure that also the rarer variants are detected by the sequencing. Also, in case or highly homologs genes and pseudogenes, that can contribute to miscalled variants due to sequencing artifacts, it is recommended to include methods that use substantially longer read lengths, i.e., fragments longer that 1000 base pairs.

This guideline portfolio has been developed to ensure the quality of medical products and services. The transfer of this policy to pharmacogenomics assays is critical, since numerous studies have pointed sources of inter- and intra-laboratory error and variability in experimental results (Ji and Davis, 2006). The quality issues of pharmacogenomics rely on the genomic complexity of the region of interest that can impact accuracy and precision of an assay. Consequently, it is important to understand and give due consideration to assay design (Pant et al., 2014), especially when it comes to next-generation sequencing.

Additionally, the validation of the discovery findings coming from pharmacogenomics studies in large randomized clinical trials is often difficult, due to high costs and ethical considerations (Wheeler et al., 2013). In the case of prospective clinical trials, specific drug-dosing schedules are used, providing consistent and well-maintained drug data for pharmacogenomics studies. To increase the sample size for a particular phenotype, it may be useful to combine data from the treatment arms of a clinical trial and then, control for potential confounding,

owing to treatment differences during the statistical analyses. In this context, cancer pharmacogenomics studies have shown promising results, although replication may still be an issue (Hyman et al., 2015). Currently, there are not enough well phenotypic patient data sets for most cancer drugs under investigation to make replication studies feasible, especially when effect sizes are small (Spencer et al., 2009; Daly, 2010). Despite the limitations and difficulties with samples' size, cancer pharmacogenomics studies have demonstrated the potential to make therapy safer and more effective for patients (Spencer et al., 2009; Daly, 2010; Wheeler et al., 2013).

#### CONSULTATION

There is uncertainty about the ways that the results of pharmacogenomics can be translated into clinical care decisions by the government agencies. This reflects the complex genetic interactions, the paucity of evidence (in some cases) as well as the legal constraints by the regulatory bodies. As a consequence, health professionals are in a vulnerable position (Maliepaard et al., 2013; Trent et al., 2013). This status is imprinted by the United States FDA policy that orders every pharmacogenomics product to provide any relative information available, but without any use recommendation (Maliepaard et al., 2013; Trent et al., 2013). Uncertainty and lack of information ask for additional pressure on professional societies to develop the appropriate clinical practice guidelines to ensure that patient care is not compromised or unnecessary genetic testing is avoided (Maliepaard et al., 2013; Trent et al., 2013). No doubt, multiple sources of information on pharmacogenomics tests can create confusion in clinical decision-making. To overcome this, PharmGKB<sup>4</sup> was established to consolidate datasets into one curated database, where users can query for drug, gene, disease or metabolic pathway to obtain information such as drug properties, pathway diagrams as well as related publications in a centralized manner. Also, the Clinical Pharmacogenetics Implementation Consortium (CPIC<sup>5</sup> ) and the Dutch Pharmacogenetics Working Group (Dutch Pharmacogenetics Working Group, 2005) have issued guidelines per gene-drug combination assisting healthcare professional to interpret pharmacogenomic testing results and reciprocally adjust the dose or select an alternative drug.

#### CONCLUDING REMARKS

In the era of big data and -omics technologies, the translation of pharmacogenomics in the clinic has yet to be met. This does not only refer to next-generation sequencing-based genotyping but also the more easily applicable low-to-medium throughput (single variant to panel-based) genotyping. Nevertheless, nextgeneration sequencing will soon be part of the clinical reality and as such, one of the first areas that will be readily applicable is the rationalization of drug use.

Depending on the available resources and infrastructure, application of next-generation sequencing in pharmacogenomics will vary from targeted pharmacogene resequencing in low resource settings, be it either in a panel-based format per drug categories (e.g., cardiovascular diseases, oncology, psychiatric diseases, etc.) or in a more comprehensive preemptive pharmacogenomics format including as many pharmacogenes as possible. In those settings, where whole exome, or – ideally – whole genome, sequencing is available, then pharmacogenomic variant identification will be performed simultaneously with the disease genetic diagnosis, focusing only on those variants in the pharmacogenes. As such, the following workflow is recommended for clinical pharmacogenomics (outlined in **Figure 1**):

(1) Next generation sequencing (targeted pharmacogene resequencing, whole exome and/or whole genome sequencing) will be performed in duly accredited laboratories, following the established guidelines for good pharmacogenomics and other practices,


#### REFERENCES


Such a pharmacogenomics scoring system is currently being developed (Patrinos GP, unpublished) to facilitate integration of next-generation sequencing for pharmacogenomics into the routine clinical care. In addition, there are further opportunities for omics-related disciplines, beyond genomics, to be employed for personalized drug response predictions, namely pharmacoepigenomics (Lauschke et al., 2018), pharmacometagenomics (Balasopoulou et al., 2016) and/or pharmacometabolomics (Balasopoulou et al., 2016; Balashova et al., 2018).

We feel that proper implementation of the proposed workflow for next-generation sequencing-based pharmacogenomic testing can occur only via the synergy of all stakeholders and their will to implement the current technological advances, in this case, next generation sequencing and information technologies. In cancer, particularly, such a synergy would be greatly beneficial toward the enigmatic complexity of the disease and great individual variability.

#### AUTHOR CONTRIBUTIONS

All authors have compiled and approved the manuscript.

#### FUNDING

This work has been partly funded by a European Commission grant (Ubiquitous Pharmacogenomics (U-PGx); H2020- 668353) to GP and encouraged by the Genomic Medicine Alliance Pharmacogenomics Working Group. GP is a Full Member of the European Medicines Agency, CHMP-Pharmacogenomics Working Party.



**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Giannopoulou, Katsila, Mitropoulou, Tsermpini and Patrinos. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Star Allele-Based Haplotyping versus Gene-Wise Variant Burden Scoring for Predicting 6-Mercaptopurine Intolerance in Pediatric Acute Lymphoblastic Leukemia Patients

*Yoomi Park1, Hyery Kim2, Jung Yoon Choi3,4, Sunmin Yun1, Byung-Joo Min1, Myung-Eui Seo1, Ho Joon Im2, Hyoung Jin Kang3,4 and Ju Han Kim1,5\**

#### *Edited by:*

*Martin A. Kennedy, University of Otago, New Zealand*

#### *Reviewed by:*

*Chakradhara Rao Satyanarayana Uppugunduri, Université de Genève, Switzerland Maria J. Prata, University of Porto, Portugal William Newman, University of Manchester, United Kingdom*

> *\*Correspondence: Ju Han Kim juhan@snu.ac.kr*

#### *Specialty section:*

*This article was submitted to Pharmacogenetics and Pharmacogenomics, a section of the journal Frontiers in Pharmacology*

*Received: 14 January 2019 Accepted: 20 May 2019 Published: 11 June 2019*

#### *Citation:*

*Park Y, Kim H, Choi JY, Yun S, Min B-J, Seo M-E, Im HJ, Kang HJ and Kim JH (2019) Star Allele-Based Haplotyping versus Gene-Wise Variant Burden Scoring for Predicting 6-Mercaptopurine Intolerance in Pediatric Acute Lymphoblastic Leukemia Patients. Front. Pharmacol. 10:654. doi: 10.3389/fphar.2019.00654*

*1 Seoul National University Biomedical Informatics (SNUBI), Division of Biomedical Informatics, Seoul National University College of Medicine, Seoul, South Korea, 2 Department of Pediatrics, Asan Medical Center, University of Ulsan College of Medicine, Seoul, South Korea, 3 Department of Pediatrics, Seoul National University College of Medicine, Seoul, South Korea, 4 Seoul National University Cancer Research Institute, Seoul, South Korea, 5 Center for Precision Medicine, Seoul National University Hospital, Seoul, South Korea* 

*Nudix Hydrolase 15 (NUDT15)* and *Thiopurine S-Methyltransferase (TPMT)* are strong genetic determinants of thiopurine toxicity in pediatric acute lymphoblastic leukemia (ALL) patients. Since patients with *NUDT15* or *TPMT* deficiency suffer severe adverse drug reactions, star (\*) allele-based haplotypes have been used to predict an optimal 6-mercaptopurine (6-MP) dosing. However, star allele haplotyping suffers from insufficient, inconsistent, and even conflicting designations with uncertain and/or unknown functional alleles. Gene-wise variant burden (GVB) scoring enables us to utilize next-generation sequencing (NGS) data to predict 6-MP intolerance in children with ALL. Whole exome sequencing was performed for 244 pediatric ALL patients under 6-MP treatments. We assigned star alleles with PharmGKB haplotype set translational table. GVB for *NUDT15* and *TPMT* was computed by aggregating *in silico* deleteriousness scores of multiple coding variants for each gene. Poor last-cycle dose intensity percent (DIP < 25%) was considered as 6-MP intolerance, resulting therapeutic failure of ALL. DIPs showed significant differences (*p* < 0.05) among *NUDT15* poor (PM, *n* = 1), intermediate (IM, *n* = 48), and normal (NM, *n* = 195) metabolizers. *TPMT* exhibited no PM and only seven IMs. GVB showed significant differences among the different haplotype groups of both *NUDT15* and *TPMT* (*p* < 0.05). Kruskal–Wallis test for DIP values showed statistical significances for the seven different GVB score bins of *NUDT15*. GVB*NUDT15* outperformed the star allele-based haplotypes in predicting patients with reduced last-cycle DIPs at all DIP threshold levels (i.e., 5%, 10%, 15%, and 25%). In *NUDT15*-and-*TPMT* combined interaction analyses, GVB*NUDT15*,*TPMT* outperformed star alleles [area under the receiver operating curve (AUROC) = 0.677 vs. 0.645] in specificity (0.813 vs. 0.796), sensitivity (0.526 vs. 0.474), and positive (0.192 vs. 0.164) and negative (0.953 vs. 0.947) predictive values. Overall, GVB correctly classified five more patients (i.e., one into *below* and four into *above 25% DIP* groups) than did star allele haplotypes. GVB analysis demonstrated that 6-MP intolerance in pediatric ALL can be reliably predicted by aggregating NGS-based common, rare, and novel variants together without hampering the predictive power of the conventional haplotype analysis.

Keywords: 6-mercaptopurine, drug toxicity, variant burden, pharmacogenetics, pharmacogenomics, nextgeneration sequencing, Nudix Hydrolase 15 (NUDT15), Thiopurine S-Methyltransferase (TPMT)

# INTRODUCTION

6-Mercaptopurine (6-MP) is a commonly used drug in the maintenance therapy of pediatric acute lymphoblastic leukemia (ALL). Since patients have a potential to experience medicationinduced life-threatening side effects including bone marrow suppression and hepatotoxicity, providing a tailored drug dosing regimen is essential in clinical practice (Vogenberg et al., 2010).

One of the strongest ways to determine initial 6-MP dose is an experimental assessment of potential for drug adverse reactions, such as severe neutropenia by monitoring 6-MP metabolite concentration or using *in vitro* activity profiles (Dubinsky et al., 2000; Ansari et al., 2002; Cuffari, 2005; Bradford, 2011; Supandi et al., 2018). However, applying such methods into routine clinical practice for predicting drug-induced toxicity is still challenging because it is extremely time-consuming, expensive, and inefficient (González-Lama and Gisbert, 2016).

As recent studies have demonstrated the strong association between genetic polymorphisms and inter-individual variability in 6-MP dose intensity, approaches to predict drug tolerance on the basis of individual genomic profiles have arisen. The primary genetic determinant of thiopurine toxicity is *TPMT*, which plays a crucial role in identifying patients in need of treatment modification with reduced enzyme activity (Lennard, 2014). However, this has not been applicable to East Asian populations since the frequency of *TPMT* polymorphisms varies by ethnicity (Relling et al., 2013). Recently, a novel pharmacogenetic marker, *NUDT15*, has clarified its role in predicting thiopurine toxicity in Asian populations (Yang et al., 2014; Yang et al., 2015; Zgheib et al., 2016; Kakuta et al., 2017). Clinical Pharmacogenetics Implementation Consortium (CPIC) published an updated guideline for thiopurine dosing based on both *TPMT* and *NUDT15* genotypes using the star allele-based dose prediction method (Relling and Klein, 2009; Relling et al., 2018). This prevailing method provides therapeutic recommendations for dosing based on star allele genotypes. However, the utilization of star alleles in clinical practice has many obstacles that occur mainly due to 1) the extremely complex nomenclature system, 2) the limited resolution of phenotype prediction due to many unknown and uncertain function alleles, 3) ignorance of functional impacts of rare and/or novel variants, and 4) limited use in previously studied populations only (Robarge et al., 2007). Next**-**generation sequencing (NGS) challenges the conventional star alleles on the basis of genotyping technologies and clinical studies in case–control settings.

In the era of NGS, the comprehensive genotyping capabilities of NGS platform have enabled us to capture the true diversity of gene variation, and researchers propose alternative ways to predict individual intolerance towards a drug. One promising method is a gene-wise variant burden (GVB) scoring approach that can calculate gene-wise cumulative variant deleteriousness scores including common, rare, and even novel genetic variants for each gene (Lee et al., 2016). Here, we assessed the utility of GVB scoring method in quantifying the potential contributing effect of variants on enzymatic activity. By combining the clinically proven and well-established associations between the two genes, i.e., *NUDT15* and *TPMT*, and 6-MP dose intensity percent (DIP, actual/planned dose) as a clinical endpoint, we performed a comparison study of the conventional star allele-based haplotyping and GVB scoring methods for predicting the last-cycle 6-MP DIP as an indicator for 6-MP intolerance of ALL patients with *NUDT15* and/or *TPMT*  deficiency. Overall, both star alleles and GVB showed significant correlations with 6-MP DIP values. Star allele-based haplotype groups showed significant correlation with GVB score groups. For predicting reduced last-cycle DIP values, GVB analysis outperformed the conventional star allele method for *NUDT15* and showed comparable result for *TPMT*. In *NUDT15*-and-*TPMT*  combined interaction analyses, GVB*NUDT15*,*TPMT* outperformed star allele-based predictions [area under the receiver operating curve (AUROC) = 0.677 vs. 0.645] in specificity (0.813 vs. 0.796), sensitivity (0.526 vs. 0.474), and positive (PPV; 0.192 vs. 0.164) and negative (NPV; 0.953 vs. 0.947) predictive values. It is demonstrated that gene-wise evaluation of *in silico* deleterious variant score burden can be a useful method for predicting 6-MP intolerance in pediatric ALL patients, considering NGS-based common, rare, and novel variants concurrently while not hampering the predictive power of the conventional haplotype analysis.

# MATERIALS AND METHODS

#### Patients and Clinical Data Collection

A total of 298 Korean pediatric ALL patients with 6-MP treatment during maintenance therapy were recruited in the present study from two major teaching hospitals, i.e., Asan Medical Center (AMC) and Seoul National University Hospital (SNUH). Of the 298 subjects, 244 individuals who did not meet the exclusion criteria (i.e., relapse of the disease, stem cell transplantation, Burkitt's lymphoma, mixed phenotype acute leukemia, infant ALL, or very high risk) were selected. All participants provided written informed consent. The study was approved by the AMC Review Boards and the SNUH Review Boards. The 6-MP dose per meter body surface area over a 12-week cycle was recorded. The maximum tolerated dose of 6-MP was defined as the dose at the last maintenance cycle for each patient. Patients from two hospitals had received treatment under the same treatment protocol and dose adjustment guidelines to maintain the ANC levels within target levels (500–1,500/µL). Genotype-guided dose modification was not conducted. Additional demographic data are shown in **Table 1**.

#### TABLE 1 | Clinical characteristics of study subjects.


*†Data for age at diagnosis were not available for one subject. 6-MP, 6-mercaptopurine; AMC, Asan Medical Center; SNUH, Seoul National University Hospital.*

#### Data Generation and Sequencing

Exome sequencing was performed using Ion AmpliSeq™ Exome panel to screen coding sequence region of entire genome. This panel included the exome of 19,072 genes and the size of the total targeted region was 57.7 Mb. The panel contained 293,903 primer pairs that were multiplexed into 12 pools to avoid primer-dimer formation and interference during PCR. The range of amplicons amplified by these oligo primer pairs ranged from 125 to 275 bp, and the rate of "on target" coverage for this panel was 95.69%. PCR assays were performed directly to amplify 100 ng of genomic DNA samples extracted from normal blood cells in bone marrow aspirates or peripheral blood so as to collect the target regions using the oligo primer pairs of the panel. Reaction parameters were as follows: 99°C for 2 min, followed by 10 cycles of 99°C for 15 s, 60°C for 16 min, and 10°C for 1 min. After amplification, library construction was performed by using the Ion AmpliSeq library kit plus as described in the manufacturer's instructions (Thermo Scientific, Waltham, MA). Libraries were quantified using an Agilent 2100 Bioanalyzer (Agilent, Santa Clara, CA) and then diluted to ~10 pM. Subsequently, 33.3 μL of the barcoded libraries was combined in sets of three barcodes. The combined libraries were sequenced using the Ion Proton platform with PI chip V3, following the manufacturer's instructions (Thermo Scientific, Waltham, MA). Reads were mapped to the human reference genome build (hg19) with a mapping alignment program from Thermo Fisher (version 4.4, Torrent Suite Software) on germ-line and low stringency settings (minimum observed allele frequency required for a non-reference variant call is 0.18 for single-nucleotide variant (SNV) and 0.23 for InDel, minimum phred scales call quality is 14 for SNV and 19 for InDel, minimum coverage for called variants is 35 for SNV and 40 for InDel, and maximum strand bias is 0.95 for SNV and 0.75 for InDel). Single-nucleotide variants (SNVs) and short insertions/deletions (InDels) were identified *via* Genome Analysis Toolkit (GATK) 2.8-1 Unified Genotyper (DePristo et al., 2011). To estimate the pathogenicity of variants, two *in silico* variant deleteriousness prediction scores were annotated: sorting intolerant from tolerant (SIFT) (Ng, 2003) and combined annotation dependent depletion (CADD) (Kircher et al., 2014). The protein-coding gene region was defined using ANNOVAR (http://annovar.openbioinformatics.org/) (Wang et al., 2010). All the variants identified in 244 ALL samples are described in **Supplementary Table S1** and **S2**.

#### Calculation of Gene-Wise Variant Burden Score

Gene-wise deleterious variant burden was computed for *NUDT15* and *TPMT* as described by Lee et al. (2016) and Seo et al. (2018). Under the hypothesis that variants that have potential effects to change protein function not necessarily guarantee but have power to cause harmful phenotypes, only variants with SIFT scores less than 0.7 were further considered.

#### *G v <sup>i</sup>* = { | *v* **with a SIFT score less than** . **0 7**}

As SIFT does not provide functional scores for InDels, adj*v* for all InDel variants were assigned as 1e-8 under the hypothesis that InDels are more deleterious than single-nucleotide substitutions. Considering the dosage effects, adjusted SIFT score adj*v* was calculated for each variant according to their genotype.

$$\nu\_{\text{adj}}\nu\_{j} = \begin{cases} \left(\text{SIFT score}\right)^{0.5}, \text{ if } \nu\_{j} \in \mathcal{G}\_{l} \text{ and heterogeneous} \\ \quad \text{SIFT score} \quad , \text{ if } \nu\_{j} \in \mathcal{G}\_{l} \text{ and homogeneous} \end{cases}$$

For each gene *Gi* with *n* deleterious variants, we calculated GVB(*Gi* ), the cumulative genic effect for all coding variants of the gene, by calculating the geometric mean of adj*v* (Equation). GVBg is considered as 1 if the count *n* of variant *j* with scores less than 0.7 in a gene is 0, indicating that the gene is not displaying any deleterious variant.

$$\mathbf{GVB}(\mathbf{G}\_i) = \begin{cases} \begin{array}{c} 1 \\ \end{array} , \text{if } \ n(\mathbf{G}\_i) = \mathbf{0} \\\\ \begin{pmatrix} \mathbf{''} \\ \end{pmatrix}\_{\text{adj } \mathbf{V}\_j} \mathbf{v}\_j \end{cases} , \text{ if } \ n(\mathbf{G}\_i) > \mathbf{0} $$

We obtained GVBg values for each individual ranging from 0 to 1. To predict 6-MP sensitivity, GVB*NUDT15*,*TPMT* was generated by calculating the geometric mean of GVB*NUDT15* and GVB*TPMT*.

$$\mathbf{GVB}^{\text{NUDT15, TPMT}} = \left(\mathbf{GVB}^{\text{NUDT15}} \times \mathbf{GVB}^{\text{TDMT}}\right)^{\frac{1}{2}}$$

#### Prediction of Star Allele Diplotypes for 244 Acute Lymphoblastic Leukemia Samples

To classify 244 ALL samples into three metabolizer groups, we inferred haplotypes using the PHASE 2.1.1 software (Stephens et al., 2001; Stephens and Scheet, 2005) (**Supplementary Figure S3**). On the basis of the inferred haplotype information, we extracted star alleles that matched the haplotype set translational table from PharmGKB (https://www.pharmgkb.org/) (Whirl-Carrillo et al., 2012). Predicted genotypes were translated into molecular phenotypes on the basis of the coded genotype– phenotype translation tables from Moriyama et al. (2016) for *NUDT15* and from PharmGKB tables for *TPMT*.

#### Estimation of Diagnostic Accuracies by Receiver Operating Curve Analyses

To assess prediction accuracies, we calculated DIP, the percentage of the actual administered dose to the planned dose, as an index for 6-MP drug toxicity. Dose in the last maintenance cycle was used, since the doses of 6-MP in the final maintenance cycle were supposed to be the maximum tolerated doses for patients (Kim et al., 2012). DIP prediction accuracies of GVB (GVB*NUDT15*, GVB*TPMT*, and GVB*NUDT15*,*TPMT*) and star allele-based predictions were compared using AUROC analysis with the R language pROC package (Robin et al., 2011). We computed specificity, sensitivity, PPV, and NPV under the binary classification model with nine different cutoff levels (i.e., 5%, 10%, 15%, 25%, 35%, 45%, 60%, 80%, and 100%) for defining high-risk DIP groups. All statistical analyses were performed using R version 3.5.1.

#### RESULTS

#### Relation of Gene-Wise Variant Burden and Star Allele-Based Molecular Phenotypes

*NUDT15* and *TPMT* haplotypes of each subject were first inferred from whole exome sequencing (WXS) data by using the PHASE tool, and matched star allele genotypes were assigned for each subject. The star allele genotypes were then translated into three molecular phenotype groups according to their allele combinations; poor (PM, No function|No function), intermediate (IM, Normal|No function or Normal|Decreased), and normal (NM, Normal|Normal) metabolizers. Six and four star alleles were identified for *NUDT15* and *TPMT* genes, respectively, from the 244 ALL patients with their frequencies (**Table 2**). **Table 3** shows the distribution of subsequently predicted enzymatic metabolizer phenotypes for *NUDT15* and *TPMT* among the 244 ALL patients.

While 49 (20.1%) of 244 ALL patients were classified into non-NM (one PM and 48 IMs) phenotype for *NUDT15*, only TABLE 2 | Alleles identified in 244 ALL samples with known allele functions.


*Haplotypes were inferred via PHASE 2. Star alleles were assigned by the PharmGKB haplotype set translational table. ALL, acute lymphoblastic leukemia.*



*Molecular phenotypes were assigned using the PharmGKB haplotype set translational table. Star (\*) allele genotype-to-phenotype correlation was adapted from information available at the Moriyama et al. (NUDT15) and the Clinical Pharmacogenetics Implementation Consortium (CPIC) guideline (TPMT); NA, not available.*

seven (2.9%) IMs were identified for *TPMT*, reflecting ethnic variation of *NUDT15* and *TPMT* variants, in a consistent manner (**Table 3**). Since individuals with *TPMT* homozygous mutant alleles are rarely observed in East Asian population, none of the patients were classified into the poor metabolizer group. IMs were stratified into two groups: 1) individuals carrying one copy of a normal function allele and one copy of a *decreased function* allele and 2) individuals carrying one copy of a normal function allele and one copy of *no function* allele. Carriers of non-functional allele, compared with carriers of decreased function allele, are considered to be at an increased risk for functional decline.

Patients with *NUDT15* normal metabolizing alleles (DIP = 67.608 ± 28.2, *n* = 195) tolerated significantly higher DIPs of 6-MP than did slow metabolizers [5.712 (PM, *n* = 1), 56.452 ± 28.2 (IM, *n* = 48)] (**Figure 1A**). Clinical usefulness of the conventional star allele-based classification was successfully demonstrated for *NUDT15* variants in the present study. Due to the small number of non-NM subjects for *TPMT* in Korean ALL patients, the difference of DIPs between NM (65.702 ± 28.4, *n* = 237) and IM (46.805 ± 35.7, *n* = 7) did not reach statistical significance (*p* = 0.10, **Figure 1B**).

GVB scores among different molecular phenotype groups for *NUDT15* (PM = 0.09, IM = 0.248 ± 0.1, and NM = 0.995 ± 0.1, **Figure 2A**) and for *TPMT* (IM = 0.229 ± 0.3, NM = 1 ± 0.0, **Figure 2B**) showed statistically significant differences. The observed positive correlation between our GVB score and the conventional enzymatic metabolizer phenotypes for both

FIGURE 1 | Distribution of last-cycle dose intensity percent of 6-mercaptopurine according to star allele-based molecular phenotype groups in ALL. Dose intensity percent distribution across (A) *Nudix Hydrolase 15 (NUDT15)* and (B) *Thiopurine S-Methyltransferase (TPMT)* molecular phenotype groups. Normal metabolizers of *NUDT15* showed significantly higher dose intensity percent than did intermediate (*p* = 0.006) and poor (*p* = 0.090) metabolizers. \**p* < 0.1, \*\**p* < 0.05, and \*\*\**p* < 0.01 by Mann–Whitney *U* test.

(*NUDT15*, *p* = 4.17E−52; *TPMT p* = 5.84E−47) and poor (*NUDT15*, *p* = 1.9E−22) metabolizers. \**p* < 0.1, \*\**p* < 0.05, and \*\*\**p* < 0.01 by Mann–Whitney *U* test.

*NUDT15* and *TPMT* variants strongly supported our further analysis. Note that those pharmacogenetic star alleles have long been empirically developed by clinical case–control studies and/ or animal and molecular studies. In contrast, the GVB analysis is based on purely theoretical *ab initio* and *in silico* methods without requiring empirical studies that are prohibitively costly considering the numerous drugs and genetic variants discovered by NGS technologies and the interactions. In the following sections, we explore the potential of the GVB scoring method for predicting DIPs as an indicator of 6-MP intolerance in pediatric ALL patients.

#### Gene-Wise Variant Burden Scores for Predicting Last-Cycle 6-Mercaptopurine Dose Intensity Percent

Since both *NUDT15* and *TPMT* genes are not highly variable, only seven and two GVB value groups for *NUDT15* and *TPMT*, respectively, were identified in the 244 ALL patients. GVB*NUDT15* demonstrated statistically significant positive correlation with DIP (*p* = 0.016 by Kruskal–Wallis test, *p* = 0.001 (*p* = 0.21) by Spearman's rank correlation, *p* = 0.001 ( = 0.17) by Kendall's rank correlation) (**Figure 3A**). Exclusion of the two patients having both *NUDT15* and *TPMT* variants slightly improved statistical significance (**Supplementary Figure S4**). Due to the low frequency of *TPMT* alleles in East Asian population, 97.5% (*n* = 238) of all ALL patients were classified into wild type (GVB*TPMT* = 1.00 ± 0.00) and only six (2.50%) were classified into variant type (GVB*TPMT* = 0.10 ± 0.00) groups, resulting in poor statistical significance (*p* = 0.408 by *T*-test, *p* = 0.272 (*ρ* = 0.07) by Spearman's rank correlation, *p* = 0.271 ( = 0.06) by Kendall's rank correlation) (**Figure 3B**).

#### Performance Comparisons Between Gene-Wise Variant Burden and Star Allele-Based Molecular Phenotypes Across Different Risk Group Decision Thresholds

Using ROC analysis, we evaluated the performances of GVB at nine cutoff levels (i.e., DIP < 5%, 10%, 15%, 25%, 35%, 45%, 60%, 80%, and 100%) for defining the 6-MP high-risk groups. Star allele-based classification was also applied for systematic comparison across different DIP threshold levels. DIP below 25% of planned dose of 6-MP is a generally accepted threshold for predicting 6-MP intolerance. **Figure 4A** demonstrates that GVB*NUDT15* showed better AUCs at all threshold DIP levels below 25% (0.998 (DIP < 5%), 0.676 (DIP < 10%), 0.669 (DIP < 15%), and 0.653 (DIP < 25%)) than did the conventional star allelebased molecular phenotypes (AUC = 0.618). Moreover, exclusion of the two confounding patients with both *NUDT15* and *TPMT* variant alleles slightly improved performances than did both before-exclusion GVB*NUDT15* at all threshold DIP levels below 25% [AUC = 0.998 (DIP < 5%), 0.676 (DIP < 10%), 0.639 (DIP < 15%), and 0.627 (DIP < 25%)] and the star allele-based (AUC = 0.596) analyses (**Figure 4B**). Mainly due to the low frequency of *TPMT* variant alleles in East Asian population, both GVB*TPMT* and star allele-based predictions using *TPMT* seem to show poor AUCs for predicting DIP at all threshold levels (**Figure 4C** and **D**).

More importantly, we performed ROC analysis by aggregating the genetic effects of these two genes, *NUDT15* for East Asian and *TPMT* for European heritages. We computed and evaluated GVB*NUDT15*,*TPMT*, which outperformed GVB*NUDT15* or GVB*TPMT*  alone as well as the combined molecular phenotypes of both *NUDT15* and *TPMT* at all DIP threshold levels (**Figure 5**). In summary, at the clinically important DIP level of below or above 25%, the best AUC values for GVB*NUDT15*,*TPMT*, GVB*NUDT15*, GVB*TPMT*, and combined star alleles were 0.677, 0.653, 0.574, and 0.645, respectively. GVB*NUDT15*,*TPMT* not only showed the best performance but also successfully included the two confounding patients with both *NUDT15* and *TPMT* variant alleles. While combining GVB scores of multiple genes is simple and straightforward, it is not the case for star alleles, which do not provide a uniform way of combining method for multiple genes.

#### Comparison of Prediction Accuracies Between Gene-Wise Variant Burden and Star Allele-Based Methods

To test the clinical utility of GVB method for guiding 6-MP dosing and/or for providing systematic framework for clinical studies of

Wallis *p*-value = 0.016, Spearman's rank correlation *p*-value = 0.001 (*ρ* = 0.21), and Kendall's rank correlation *p*-value = 0.001 ( = 0.17)]. (B) GVB*TPMT* [Kruskal– Wallis *p*-value = 0.271, Spearman's rank correlation *p*-value = 0.272 (*ρ* = 0.07), and Kendall's rank correlation *p*-value = 0.271 ( = 0.06)].

FIGURE 4 | Comparison of diagnostic accuracies between star allele-based molecular phenotyping and GVB scoring for 6-mercaptopurine intolerance in ALL. Diagnostic accuracies are measured by using AUROC analysis for (A) GVB*NUDT15* excluding two subjects with *TPMT* variants (DeLong's *p*-value = 0.163), (B) GVB*NUDT15* (DeLong's *p*-value = 0.163), (C) GVB*TPMT* excluding seven subjects with *NUDT15* variants (DeLong's *p*-value = 0.5), and (D) GVB*TPMT* (DeLong's *p*-value = 0.841). Numbers in the last parentheses indicate area under the curve (AUC) with 95% confidence intervals. DIP, dose intensity percent; AUC, area under the curve.

6-MP intolerance and its genetic determinants of *NUDT15* and *TPMT*  for predicting DIP groups, we evaluated the diagnostic characteristics of the conventional star allele-based and GVB scoring methods in a simulated clinical setting. **Table 4A** and **4B** exhibits diagnostic accuracies for star allele-based molecular phenotype groups and genewise variant burden score groups, respectively, for 6-MP intolerance among 244 pediatric ALL patients by the last-cycle DIP of 6-MP. Of the 244 ALL patients, 189 (84.4%) exhibited no *NUDT15* or *TPMT*  variant and hence was classified into NMs for both genes (**Table 4A**). Of the rest 55 non-NM patients, nine (16.4%) showed DIP below 25%, while 10 of 189 (5.3%) NM patients showed low DIP values.

Although one can choose many threshold levels of GVB, because star alleles can just provide a small number of categories, we chose the most reliable binning threshold of GVB*NUDT15*,*TPMT* ≤ 0.3, the cut-point that maximizes the Youden index (**Supplementary Figure S5**), for classifying the patients into the below and above 25% DIP

in the last parentheses indicate AUC with 95% confidence intervals. DIP,

dose intensity percent; AUC, area under the curve.

groups as shown in **Table 4B**. It is a coincidence that Lee et al. (2016) also suggested GVB*Pharmacogenes* ≤ 0.3 as the threshold for predicting pharmaceutical market withdrawals in general. GVB*NUDT15*,*TPMT* correctly classified one more high-risk (DIP ≤ 25%) and four more low-risk (DIP > 25%) patients into the correct-risk groups (**Table 4B**) than did the traditional haplotype-based method (**Table 4A**), with an improved sensitivity from 47.36% to 52.63% and an improved specificity from 79.56% to 81.33%, though the difference did not reach statistical significance (*p*-value for sensitivity = 1 and *p*-value for specificity = 0.134, as determined using a McNemar test). Both PPV and NPV increase from 16.36% to 19.23% and from 94.70% to 95.31%, respectively. Overall, it is suggested that the "computational" GVB*NUDT15*,*TPMT* is an improved or at least comparable predictor than the "empirical" star allele-based haplotypes for determining subjects with increased risk of 6-MP intolerance in pediatric ALL patients measured by the last-cycle 6-MP DIP.

#### DISCUSSION

An enduring challenge in precision medicine is to predict adequate drug responses for individual patients (Shah and Shah, 2012). Recent discoveries have revealed a few highly functional and clinically relevant novel variants associated with 6-MP intolerance. However, since implicating drug toxicity based on a single variant is notoriously unreliable as shown in **Supplementary Figure S1** for SIFT and **Supplementary Figure S2** for CADD, developing strategies to aggregate the key effects over a range of genomic

TABLE 4 | Comparison of star allele-based haplotyping versus gene-wise variant burden (GVB*NUDT15*,*TPMT*) analyses for 6-mercaptopurine intolerance measured by lastcycle dose intensity percent in ALL. Diagnostic accuracy table of (A) star allele-based haplotypes and dose intensity percent groups and (B) gene-wise variant burden score and dose intensity percent groups.


(B)


*PM, poor metabolizer; IM, intermediate metabolizer; NM, normal metabolizer; PPV, positive predictive value; NPV, negative predictive value.*

region is highly required. In the present study, we evaluated the utility of gene-wise deleterious variant burden scoring method, as a sequencing-based, simple, reliable, quantitative, and easy-tocompare score for predicting 6-MP intolerance of 244 pediatric ALL patients. In addition to DIP, GVB showed a statistically significant negative correlation with the incidence of grade 4 neutropenia (*p* = 1.43E−04 by Kruskal–Wallis test, *p* = 3.89E−07 (*ρ* = −0.32) by Spearman's rank correlation, and *p* = 8.06E−07 ( = −0.27) by Kendall's rank correlation (**Supplementary Figure S6**). This implies that GVB is a reliable score that can predict hematological toxicity in pediatric ALL patients. When beginning treatment, NGS-based drug intolerance prediction is useful because it is practical to detect patients at high risk of toxicity. For example, patients with low GVB have a high probability of 6-MP toxicity at the initial recommended dose range; thus, clinicians may attempt to reduce the initial target dose of 6-MP. After an initial target dose is determined, a close therapeutic drug monitoring could help to avoid potential causes for toxicity, such as clinically relevant drug–drug interactions, reduced drug clearance due to liver and/or renal impairment, and altered drug utilization due to physiological conditions, as a complementary type of practice during the treatment (Ju-Seop Kang, 2009).

GVB analysis has several benefits over conventional star allelebased approaches. GVB 1) quantitates gene-wise variant burden with a single score; 2) provides a measure of inter-individual genetic variability for each gene; 3) considers common, rare, and novel genetic variants together; 4) provides an ethnic variability-neutral method for studying pharmacogenomics; 5) provides a systematic and reliable framework for designing further pharmacogenomics studies considering many gene interactions for clinical endpoints; and 6) adopts the contributing effect of novel low-frequency variants with potentially reduced function in predicting individual drug toxicity.

Based upon the very recent CPIC updates on *NUDT15*, three newly enrolled alleles were characterized (Moriyama et al., 2017). Since new haplotype designation is highly dependent on the characteristics of the study population, there will be restrictions in incorporating new or even as-yet-unidentified evidences in predicting future drug intensity. GVB can be used to develop a model to determine optimal doses without requiring a multi-ethnic population study, especially for under-studied subpopulations.

The following limitations are inherent in the present study. To evaluate the validity of GVB, independent replication studies for an expanded gene–drug set with sufficient sample sizes in diverse ethnic groups are required as no novel variant was identified in the current study. A conventional single variant-based association test of rare variants requires infeasible magnitude of sample sizes (Bansal et al., 2010), but approaches that aggregate common, rare, and novel variants jointly will substantially reduce a required effective sample sizes (Witte, 2012). The robustness of the analysis framework shall further be improved as novel prognostic markers on 6-MP DIP are acquired. The limitations in interpreting the score includes that all InDels are treated as highly damaging as SIFT provides scores for only single-nucleotide variants. As there are many *in silico* variant deleteriousness scoring method based on different principles, comprehensive evaluation of different method is required (**Supplementary Figure S7**). We also performed CADD-based computation of GVB values, resulting in similar results (**Supplementary Figures S8 and S9**). It has been reported that CADD tends to evaluate in-frame InDels as relatively benign (Kircher et al., 2014). However, recent *in vitro* activity assay of *NUDT15* (Moriyama et al., 2017) proved that in-frame InDel carriers are more likely to be in states with severely diminished response to 6-MP. It is strongly recommended that for clinical applications, potential clinical impacts of genetic variants on drug sensitivity should be further examined to improve estimation accuracy, as *in silico* prediction scores can exhibit incorrect predictions. Producing a custom capture panel for clinically actionable genes could be more cost-effective than an exome-based approach.

One subject who was correctly classified by GVB carried a lowfrequency novel deletion and predicted to belong to the high-risk group by GVB, whereas star allele-based prediction classified this patient into the NM group for both *NUDT15* and *TPMT*. The patient required reduced dose than recommended (DIP = 23.7%), supporting that GVB analysis resulted in 6-MP dose-related adverse drug reactions. The patient's variant was heterozygous p.Gly17\_ Val18del, which was very recently assigned as NUDT15\*9 with uncertain functionality. The other four who were correctly classified by GVB had p.Arg139His on one allele, which has assigned them to the IM (NUDT15 \*1/\*4) group. GVB classified them as relatively safe for drug toxicity, and none of them required a 25% reduction from the starting dose. Additionally, one patient who was classified as high risk by GVB was assigned to IM for both *NUDT15* and *TPMT* and required a severely reduced dose (14%), suggesting that GVB*NUDT15*,*TPMT* exhibits benefits in aggregating effects of many moderate genetic determinants into a single quantitative value.

#### ETHICS STATEMENT

The study was approved by the Asan Medical Center (AMC) Review Boards and the Institutional Review Board of Seoul National University Hospital (SNUH). Informed written consents for blood sampling and analyses were obtained from all participants.

# AUTHOR CONTRIBUTIONS

YP and JK designed the model and the framework. HK, JC, HI, and HK collected samples and clinical data. BM and MS carried out the experiment. YP and SY analyzed the data and carried out the implementation. YP performed the calculations. YP and JK wrote the manuscript. JK conceived the study and was in charge of overall direction and planning. All authors read and approved the final manuscript.

#### FUNDING

This research was supported by a grant (16183MFDS541) from the Ministry of Food and Drug Safety in 2019.

#### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fphar.2019.00654/ full#supplementary-material

# REFERENCES


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Copyright © 2019 Park, Kim, Choi, Yun, Min, Seo, Im, Kang and Kim. This is an openaccess article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

# Identification of Novel Biomarkers for Drug Hypersensitivity After Sequencing of the Promoter Area in 16 Genes of the Vitamin D Pathway and the High-Affinity IgE Receptor

Gemma Amo1,2, Manuel Martí 1,2, Jesús M. García-Menaya3,4, Concepción Cordobés 5,6 , José A. Cornejo-García7,8, Natalia Blanca-López 9,10, Gabriela Canto9,10 , Inmaculada Doña11,12, Miguel Blanca9,10, María José Torres 11,12, José A. G. Agúndez 1,2 and Elena García-Martín1,2 \*

#### Edited by:

Marcelo Rizzatti Luizon, Federal University of Minas Gerais, Brazil

#### Reviewed by:

Ken Batai, University of Arizona, United States Meenal Gupta, The University of Utah, United States

#### \*Correspondence:

Elena García-Martín elenag@unex.es

#### Specialty section:

This article was submitted to Pharmacogenetics and Pharmacogenomics, a section of the journal Frontiers in Genetics

Received: 12 November 2018 Accepted: 04 June 2019 Published: 25 June 2019

#### Citation:

Amo G, Martí M, García-Menaya JM, Cordobés C, Cornejo-García JA, Blanca-López N, Canto G, Doña I, Blanca M, Torres MJ, Agúndez JAG and García-Martín E (2019) Identification of Novel Biomarkers for Drug Hypersensitivity After Sequencing of the Promoter Area in 16 Genes of the Vitamin D Pathway and the High-Affinity IgE Receptor. Front. Genet. 10:582. doi: 10.3389/fgene.2019.00582 <sup>1</sup> University Institute of Molecular Pathology Biomarkers, UEx, Cáceres, Spain, <sup>2</sup> ARADyAL Instituto de Salud Carlos III, Cáceres, Spain, <sup>3</sup> Allergy Service, Badajoz University Hospital, Badajoz, Spain, <sup>4</sup> ARADyAL Instituto de Salud Carlos III, Badajoz, Spain, <sup>5</sup> Allergy Service, Mérida Hospital, Badajoz, Spain, <sup>6</sup> ARADyAL Instituto de Salud Carlos III, Cáceres, Spain, <sup>7</sup> Research Laboratory, IBIMA, Regional University Hospital of Málaga, UMA, Málaga, Spain, <sup>8</sup> ARADyAL Instituto de Salud Carlos III, Cáceres, Spain, <sup>9</sup> Allergy Service, Infanta Leonor University Hospital, Madrid, Spain, <sup>10</sup> ARADyAL Instituto de Salud Carlos III, Madrid, Spain, <sup>11</sup> Allergy Unit, IBIMA, Regional University Hospital of Málaga, UMA, Málaga, Spain, <sup>12</sup> ARADyAL Instituto de Salud Carlos III, Málaga, Spain

The prevalence of allergic diseases and drug hypersensitivity reactions (DHRs) during recent years is increasing. Both, allergic diseases and DHRs seem to be related to an interplay between environmental factors and genetic susceptibility. In recent years, a large effort in the elucidation of the genetic mechanisms involved in these disorders has been made, mostly based on case-control studies, and typically focusing on isolated SNPs. These studies provide a limited amount of information, which now can be greatly expanded by the complete coverage that Next Generation Sequencing techniques offer. In this study, we analyzed the promoters of sixteen genes related to the Vitamin D pathway and the high-affinity IgE receptor, including FCER1A, MS4A2, FCER1G, VDR, GC, CYP2R1, CYP27A1, CYP27B1, CYP24A1, RXRA, RXRB, RXRG, IL4, IL4R, IL13, and IL13RA1. The study group was composed of patients with allergic rhinitis plus asthma (AR+A), patients with hypersensitivity to beta-lactams (BLs), to NSAIDs including selective hypersensitivity (SH) and crossreactivity (CR), and healthy controls without antecedents of atopy or adverse drug reactions. We identified 148 gene variations, 43 of which were novel. Multinomial analyses revealed that three SNPs corresponding to the genes FCER1G (rs36233990 and rs2070901), and GC (rs3733359), displayed significant associations and, therefore, were selected for a combined dataset study in a cohort of 2,476 individuals. The strongest association was found with the promoter FCER1G rs36233990 SNP that alters a transcription factor binding site. This SNP was over-represented among AR+A patients and among patients with IgE-mediated diseases, as compared with control individuals or with the rest of patients in this study. Classification

**132**

models based on the above-mentioned SNPs were able to predict correct clinical group allocations in patients with DHRs, and patients with IgE-mediated DHRs. Our findings reveal gene promoter SNPs that are significant predictors of drug hypersensitivity, thus reinforcing the hypothesis of a genetic predisposition for these diseases.

Keywords: Next-Generation Sequencing (NGS), vitamin D, high-affinity IgE receptor (FCεRI), NSAIDs (non-steroidal anti-inflammatory drugs), beta-lactam antibiotic, drugs hypersensitivity reactions, allergic rhinitis, asthma

# INTRODUCTION

The prevalence of atopy, allergic diseases, and drug hypersensitivity reactions (DHRs) is increasing worldwide. In Europe, studies have estimated a prevalence of 20– 25% allergic diseases in adults, with many young people being unaware of their disease (Linneberg, 2011; Kruse and Vanijcharoenkarn, 2018), which means an important economic impact for healthcare (European Commission, 2008; Bouvy et al., 2015) reaching an amount from e55 to e151 billion per year in European Union, including indirect costs related to the absence or reduced productivity at work (Zuberbier et al., 2014; Kruse and Vanijcharoenkarn, 2018). Due to their complexity, it is difficult to understand the specific mechanisms and molecules involved in the development of these diseases or to establish a way to prevent or reduce them. Allergic rhinitis (AR) reduces the quality of life by affecting sleep, school, work productivity, and social life. AR is an immunoglobulin E (IgE) mediated inflammatory disease, which is associated with other inflammatory diseases such as asthma. It has been estimated that around 20% of people in the USA and Europe suffer from allergic rhinitis (Durham et al., 2012; Ozdoganoglu and Songu, 2012; Rondon et al., 2017). Taking this into consideration, AR has been classified as a major chronic respiratory disease (Brozek et al., 2017). Drug hypersensitivity reactions (DHR) account for, ∼3 to 6% of all hospital admissions. These reactions occur in 10 to 15% of hospitalized patients (Gomes and Demoly, 2005; Szczeklik and Nizankowska-Mogilnicka, 2009; Doña et al., 2014). Beta-lactam antibiotics (BLs) are the most common cause of DHRs mediated by specific immunological mechanisms (Antúnez et al., 2006; Doña et al., 2012, 2014) and, although the mechanisms of how the immune system recognizes these drugs are not fully determined, BLs are considered the classical model of this type of reactions (Blanca et al., 2009). Together with BLs, non-steroidal anti-inflammatory drugs (NSAIDs) are account for the vast majority of DHRs (Gomes et al., 2004; Messaad et al., 2004; Chen et al., 2012; Doña et al., 2012), but, in this case, these DHRs are not exclusively mediated by specific immunological mechanisms (selective hypersensitivity), involving a response to a single drug and good tolerance to other chemically unrelated NSAIDs (Canto et al., 2009; Cornejo-Garcia et al., 2009); but also by nonspecific immunological mechanisms (cross-reactions), which can be caused by more than one chemically unrelated NSAIDs (Agúndez et al., 2012; Kowalski et al., 2013).

Recent investigation proposes the vitamin D pathway among putative factors linked to allergic diseases, because of its important role in immune system (Veldman et al., 2000; Cantorna et al., 2015) and its direct relation with allergic diseases (Black and Scragg, 2005; Camargo et al., 2007; Benson et al., 2012; Suaini et al., 2015). There are many molecules involved in the vitamin D pathway: hydroxylases from CYP450 family, such as CYP27A1, CYP27B1, CYP2R1, and CYP24A1; the vitamin D binding protein (GC) that acts like a transporter, the vitamin D receptor (VDR), the retinoid receptor X (RXR) and interleukins which participate in downstream pathway (IL4 and IL13). In addition, there are other target molecules and signaling pathways, which could be involved in allergic mechanisms, such as the high-affinity IgE Receptor (FCεRI), which plays a key role in allergic reactions. This receptor is stimulated by IgE, triggering mast cells and basophils activation, and the consequent release of inflammatory mediators. In human mast cells and basophils, FCεRI consist of a heterotetramer composed by three subunits: FCεRIα, the ligand-binding subunit which is encoded by FCER1A gene; FCεRIβ, a signal-augmenting subunit encoded by MS4A2; and FCεRIγ, a signal-transducing subunit that is presented like a dimer and it is encoded by FCER1G (Kinet, 1999; Potaczek and Kabesch, 2012). Elevated levels of IgE have been detected in atopic conditions like allergic rhinitis, asthma, atopic dermatitis, anaphylaxia (Platts-Mills, 2001; Wallace et al., 2008) thus making FCεRI a plausible target molecule in the study of the mechanisms involved in the development and in the clinical presentation of allergy.

It could be hypothesized that variations related to expression and/or function in genes of the vitamin D signaling pathways or FCεRI might modify the risk of developing rhinitis or DHRs, and/or the presentation of clinical manifestations of these reactions. As a matter of fact, several studies demonstrated an association between different allergic diseases, including DHRs, and polymorphisms in these genes (Poon et al., 2004; Raby et al., 2004; Donfack et al., 2005; Bossé et al., 2009; Saadi et al., 2009; Pillai et al., 2011; Micheal et al., 2013; Berenguer et al., 2014; Amo et al., 2016a; Narozna et al., 2016). Several studies addressed the putative impact of exonic and intronic SNPs within the abovementioned genes and the risk of allergic diseases and/or DHR (Wjst, 2005; Wjst et al., 2006; Battle et al., 2007; Arshad et al., 2008; Sadeghnejad et al., 2008; Weidinger et al., 2008; Black et al., 2009; Bossé et al., 2009; Ferreira et al., 2009; Knutsen et al., 2010; Li et al., 2010, 2012, 2014, 2016; Michel et al., 2010; Moffatt et al., 2010; Cooper et al., 2011; Joubert et al., 2011; Liu et al., 2011; Lu et al., 2011; Park et al., 2011; Paternoster et al., 2011; Pillai et al., 2011; Burkhardt et al., 2012; Choi et al., 2012; Granada et al., 2012; Lasky-Su et al., 2012; Ramasamy et al., 2012; Robinson et al., 2012; Zhou et al., 2012; Anderson et al., 2013; Hur et al., 2013; Ismail et al., 2013; Movahedi et al., 2013; Potaczek et al., 2013; Sharma et al., 2014; Yang et al., 2014; Kumar et al., 2015; Papadopoulou et al., 2015; Pino-Yanes et al., 2015; Tian et al., 2015; Amo et al., 2016a,b; Han et al., 2016; Karaca et al., 2016; Narozna et al., 2016; Overton et al., 2016; Ådjers et al., 2017; Ashley et al., 2017; Park and Tantisira, 2017; Sun et al., 2017; Xu et al., 2017; Zhang et al., 2017; Zhao et al., 2017). However, there is little information about SNPs located in the promoters of these genes, which might have functional consequences.

In an attempt to identifying genetic susceptibility factors associated with allergy and/or DHRs, that may provide novel information to gain a better understanding of these pathologies, we carried out an exhaustive analysis of genetic variations situated in the promoter region of the mentioned genes by using Next Generation Sequencing (NGS) in patients with allergic rhinitis plus asthma (AR+A), BLs hypersensitivity, selective NSAIDs hypersensitivity (SH) and cross-reactions to NSAIDs (CR), as well as in healthy control individuals. The genes included in the study were FCER1A, MS4A2, FCER1G, VDR, GC, CYP2R1, CYP27A1, CYP27B1, CYP24A1, RXRA, RXRB, RXRG, IL4, IL4R, IL13, and IL13RA1. In addition, we also analyzed the interaction of genetic and non-genetic factors, such as age, gender, and antecedents of atopy, in the risk of developing these diseases.

# PATIENTS AND METHODS

#### Study Population

A total cohort of 2,476 individuals participated in this study. All were Caucasian Spanish individuals. These included 406 healthy controls without antecedents of atopy or adverse drug reactions, 528 patients with AR+A, 561 individuals with BLs hypersensitivity, 668 patients with NSAIDs cross-reactivity (CR), and 313 selective hypersensitivity patients (SH) which were single-NSAIDs responders. Written consent for participation was obtained for all participants. Patients were recruited at Hospitals participating in the study. All the patients who were invited to participate in the study agreed to do so. Control individuals were selected among students and staff in the University and Hospitals participating in the study. Characteristics of the study groups are summarized in **Table 1**. The diagnosis was carried out as described elsewhere (García-Martín et al., 2007; Doña et al., 2011; Amo et al., 2016a; Lacombe-Barrios, 2018). The protocol for this study was in accordance with the Declaration of Helsinki and its subsequent revisions and was approved by the respective Ethics Committees of the participating Hospitals.

To get a further analysis of the sample, we put together some of the groups of patients which share a specific characteristic. Thus, we defined three new groups of study: "DHR group," were we included all the patients with DHR: namely, patients with hypersensitivity to BLs and NSAIDs (both, CR and SH); "DHR-IgE Group," which comprises selective hypersensitivity to BLs and SH; and "IgE Mediated Group," where we included all the IgE-mediated reactions (AR+A, BLs and SH).

#### Identification of Novel Variants Using NGS

A subset of participants were selected for this phase. A total cohort of 175 individuals participated in this NGS analysis. These included 22 healthy controls without antecedents of atopy or adverse drug reactions, 22 patients with AR+A, 43 individuals with BLs hypersensitivity, 41 patients with NSAIDs cross-reactivity (CR), and 46 selective hypersensitivity patients (SH) which were single-NSAIDs responders. Characteristics of the participants are summarized in **Table S1**. Genomic DNA was obtained from leukocytes and purified according to standard procedures. DNA samples were analyzed by NGS after specific enrichment based on the Haloplex design. Details of the areas sequenced are shown in **Table S2**. DNA was digested with restriction enzymes specific for this design (Haloplex, Agilent, Santa Clara, CA, USA), followed by hybridization with specific probes, DNA circularization and selection of the target areas, according to the protocol supplied. Sequencing was carried out in a MiSeq sequencer (Illumina, San Diego, CA, USA) using the pair end format. The coverage was always higher than that recommended by the manufacturer (23.7 Mb per sample). All variants identified had at least a 50X coverage and more than 95% of these had more than 100x coverage. The sequencing results were analyzed by using the application SureCall 4.0 (Agilent, Santa Clara, CA, USA), adapted to the analysis of enriched Haloplex sequences, and MiseqReporter V04 (Illumina, San Diego, CA). Sequence revision against human genome was carried out by using the Integrative Genomes Viewer (Broad Institute, Cambridge, MA, USA).

#### Combined Dataset Analyses

All patients and controls participated in this phase. Analyses were carried out by using TaqMan genotyping focused on the SNVs raised after multiple comparison analyses of the NGS phase (see the results section for further details). The SNPs were analyzed in triplicate, by using SNP TaqMan assays (Life Technologies S.A., Alcobendas, Madrid, Spain), and following the conditions specified by the manufacturer. Assay details are as follows: FCER1G-rs36233990, Custom TaqMan <sup>R</sup> Assay; FCER1G-rs2070901, (C\_\_15867981\_20); and GC-rs3733359, (C\_\_25652813\_40).

#### Statistical Analysis

The R package SNPasoc (Gonzalez et al., 2014) was used to calculate allele and genotypic frequencies, to determine the Hardy-Weinberg equilibrium using exact test (Wigginton, 2005) and to analyse differences between groups (González et al., 2007). The comparison between groups was performed with the Fisher's Exact Test (FET) and Likelihood Ratio Test (LRT) with an initial crude analysis followed by an adjusted analysis including gender as the categorical covariate when it was possible. False Discovery Rate (FDR) correction was used for the multiple comparison adjustments (Benjamini et al., 2001). The results were considered statistically significant when P-values were under 0.05. The association between SNPs and traits was estimated by odds ratio (OR) with a 95% confidence interval (CI) or by Relative Risk (RR) when the variation was not found in the control group. The Relative Risk was calculated by using EpiBasic, a tool for statistical analysis of tabular information, performing a stratified analysis, using the inverse variance (1/SE<sup>2</sup> ) as weigh. This tool was developed as a companion to a Danish textbook on epidemiology (Juul, 2012), and the


TABLE 1 | Characteristics of the participants.

spreedsheat could be download from the following link (Juul and Frydenberg, 2016): http://ph.medarbejdere.au.dk/undervisningog-uddannelse/software/.

Association between each SNP and each clinical phenotype was assessed by using binary logistic regression. Then, predictive Models based on Multinomial Logistic Regression (MLR) (Agresti, 2003) were performed for SNPs showing association in the binary regression analyses by using SPSS (IBM SPSS Statistics for Windows, Version 22.0). The p-values associated with every SNP were calculated using the Chi-Square test. Each model has associated several pseudo-R<sup>2</sup> coefficients as indicators of the strength of the association between the response and the predictor variables. Cox and Snell is based on the loglikelihood for the model compared to the log likelihood for a baseline model and it has a theoretical maximum value of <1, even for a "perfect" model (Cox and Snell, 1989) and Nagelkerke is an adjusted version of the Cox & Snell R-square that adjusts the scale of the statistic to cover the full range from 0 to 1 (Nagelkerke, 1991). McFadden is another version, based on the log-likelihood kernels for the intercept-only model and the full estimated model. This is the pseudo-R<sup>2</sup> coefficient most frequently used and the correlation between variables is good when the values are comprised between 0.2 and 0.4, and better up to 0.4 (McFadden, 1974, 1977). The first model includes all the groups separately, that is, AR+A, BLs, CR and SH. Model 2 considered two groups of patients: AR+A and a group combining all DRH patients. Model 3 considered three groups of patients: AR+A, patients with Ige-Mediated DHR, and patients with DHR not related to IgE (that is, CR patients). For all models the control group was always the reference group. Coefficients were calculated by dropping samples with missing data in explanatory variables, which have been selected using stepwise regression method. The statistical power was calculated from variant allele frequencies with a genetic model analysing the frequency for carriers of the disease gene with a RR value = 2 (p = 0.05) for the genetic associations identified in the combined dataset model as described elsewhere (Pértegas Díaz and Pita Fernández, 2003). These values are shown in **Table S3**. The functional impact of the gene variants was analyzed by using TRANSFAC (Matys et al., 2003, 2006).

#### RESULTS

#### Identification of Novel Variants Using NGS

In this phase we identified 148 variations situated in the promoter region of genes related with vitamin D and FCεRI genes. The information about the variations, their frequency in the whole sample and the Hardy-Weinberg equilibrium values is summarized in **Table 2**. It is to be noted that 84 out of the 148 (56.7%) of the SNPs identified in this study were found in cases only and not in control individuals.

Among the 148 gene variations identified, 43 were novel. Within the 105 already described SNPs, 25 have not been described or studied earlier in European individuals, although they show marginal MAF in our study (only three SNPs show MAF above 0.010). Regarding known SNPs, the frequencies are concordant with the results previously described in the 1,000 Genomes public database (http://grch37.ensembl.org/ index.html) for individuals with European descent for all the variations except for the rs4020369 SNP in the GC gene, where the described frequency for Europeans is equal to 0, but in our population it shows a MAF close to 0.040, that is in agreement with the global frequency described in 1,000 Genomes for overall individuals.

One hundred and three out of the 148 variants identified were at Hardy-Weinberg equilibrium (HWE) in the overall study population (see **Table S4**), which is to be expected given the high number of SNPs analyzed and the limited sample size in the NGS analyses. Within the 45 variants that were not in HWE, only 7 showed a MAF>0.050 in agreement with frequencies described in literature. We carried out binary logistic regression analyses excluding those variants with MAF<0.02 (see **Table 2**). Results of the regression analyses are shown in **Table S5**. Among the 44 variants we selected those with adjusted P ≤ 0.10 for multinomial analyses. Therefore, 25 variants were included in the multinomial analysis as well as gender, antecedents of atopy and clinical group (Allergic rhinitis + Asthma; BLs, CR and SH). It is to be noted that some SNPs with a high significance after logistic binary regression analyses (See **Table S5**), such as rs1467664 (RXRG), rs3733359 (GC), rs2070874 (IL4), rs4303288, and rs4307775 (VDR) and rs2259735 (CYP24A1), were not significant after multinomial analysis. The statistically significant variables raised after this analysis were three SNPs (FCER1G rs36233990, FCER1G rs2070901, and GC rs3733359), as well as antecedents of atopy.

#### Combined Dataset Analyses

The three SNPs mentioned above were analyzed in the whole study group. The FCER1G rs36233990 SNP was monomorphic in the control group, whereas heterozygous subjects were identified in all subgroups of patients and homozygous individuals were identified in the AR+A and CR groups. Statistically significant

#### TABLE 2 | SNPs with MAF ≥ 0.02 observed in the NGS study.



differences were identified for this SNP in all subgroups of patients (**Table 3**), although after FDR correction differences remained significant for the AR+A and BLs groups. The FCER1G rs2070901 SNP had a marginal trend toward higher frequency of the variant allele among CR patients, which was not statistically significant after FDR analysis. These two FCER1G SNPs are not


Comparison between groups of patients and controls. (1) Relative Risk was calculated when the reference group (control) shows only non-mutated frequency. (2) OR and P-value were adjusted for gender. (\*) OR adjusted by gender cannot be calculated due to lack of cases.

at linkage disequilibrium, being the D' value = 0.3415 and the r value = 0.0504. The minor allele frequency for the GC rs3733359 SNP was lower in the BLs group as compared to that of healthy individuals. This comparison remained statistically significant after FDR analysis.

The patients were grouped according to the underlying mechanism of the reaction. The first group was composed of all patients with drug hypersensitivity (DHR group), that is, (BLs + CR + SH). The second group of patients was composed of all patients with IgE-mediated drug hypersensitivity (D-IgE) (BLs + SH). The third group of patients was composed of all patients with IgE-mediated reactions (BLs + SH + AR+A), which were compared vs. healthy controls (**Table 4**). The two FCER1G SNPs displayed statistically significant differences in DHR and IgEmediated reactions as compared to control individuals, although the only difference that remained significant after FDR correction was that of the SNP FCER1G rs2070901 in patients with IgEmediated diseases (**Table 4**).

#### Classification Models

We built models including the three SNPs with significant associations in the combined dataset analyses phase, as well as the antecedents of atopy and gender, and the pseudo R-square values for each model (those which provide the best classification of each patient in its correct group) are shown in **Table 5**. All the models selected the FCER1G rs36233990 as a good variable. The variable "Antecedents of atopy" was also selected, although this was expected because none of the control individuals had antecedents of atopy. Model 1 was made by including the three SNPs with a significant P-value and all the clinical groups separately, as compared to control subjects. Model 2 included all patients with DHR compared to control individuals, and model 3 included all patients with IgE-mediated diseases vs. control individuals. The classifications per group are shown in **Table 6**. It is to be noted the high percentage of correct allocations using the three SNPs only (that is, without considering antecedents of atopy and gender). For comparison, we show in **Table 6** the results of the same models including covariables such as antecedents of atopy, age and gender. The fact that antecedents of atopy predicted 100% of AR+A patients has little value because control individuals had no antecedents. Age and gender, however, are not good predictors either (**Table 6**). Therefore, these covariables did not improve the predictive capacity of the models based on the SNPs.

#### DISCUSSION

Genetic variation is a major cause of interindividual differences in the susceptibility to a number of disorders. In this regard, a huge number of genetic association studies related to allergic disorders and drug hypersensitivity events have been carried out. Most of these studies have a case-control design interrogating only a few polymorphisms, typically, a few SNPs located within the coding region. The use of NGS techniques allows for a complete coverage of large areas thus revealing novel SNPs or analysing SNPs that are not included in most studies. In a previous NGS study in the promoter area of the genes encoding the COX-1 and COX-2 enzymes, we identified several novel SNPs. More




\*The overall effectiveness of the model was assessed by using the Chi-square statistic.

TABLE 6 | Prediction models.


than 70 SNPs modified transcription factor binding sites, either by disrupting existing sequences or by creating new binding sites (Agundez et al., 2014).

The present study is aimed to analyse the promoter areas of 16 genes related to allergic diseases and drug hypersensitivity reactions (DHRs). We have focused on the promoter gene region due to its crucial role in transcriptional activity and expression of the gene, as it has been observed for the SNPs located in the promoter of FCER1A (Potaczek et al., 2009), or IL13 (Cameron et al., 2006; Kiesler et al., 2009; Li et al., 2014). The rationale for the selection of the 16 genes included in this study is based on putative mechanisms involved this type of reactions. FCERI plays an essential role in IgE-mediated mechanisms and variants in FCERI genes have been previously described as genetic factors related to asthma (Cui et al., 2003; Kim et al., 2006; Palikhe et al., 2008; Joubert et al., 2011; Ramphul et al., 2014; Yang et al., 2014), allergy (Hasegawa et al., 2003; de Guia et al., 2015; Liao et al., 2015; Amo et al., 2016a,b), and food sensitization (Liu et al., 2011; Hong and Wang, 2012). It has been also described that some variants in genes involved in the vitamin D pathway are related with asthma (Poon et al., 2004; Raby et al., 2004; Wjst, 2005; Wjst et al., 2006; Bossé et al., 2009; Saadi et al., 2009; Li et al., 2011; Pillai et al., 2011; Maalmi et al., 2013; Leung et al., 2015; Hutchinson et al., 2017), especially variations in genes regulated by vitamin D, such as IL4 and its receptor (Burchard et al., 1999; Donfack et al., 2005; Ober and Hoffjan, 2006; Battle et al., 2007; Michel et al., 2010; Baye et al., 2011; Hesselmar et al., 2012; Liu et al., 2012; Micheal et al., 2013; Nie et al., 2013; Zhu et al., 2013; Al-Muhsen et al., 2014; Berenguer et al., 2014; Klaassen et al., 2015; Zhang et al., 2015; Narozna et al., 2016) and IL13 (Black et al., 2009; Bottema et al., 2010; Palikhe et al., 2010; Cui et al., 2012; Accordini et al., 2016; Xu et al., 2017), which are also related to IgE (Marsh et al., 1994; Kabesch et al., 2006). According to previous research, the mechanisms involved in cross-reactions and selective ones are different, (Doña et al., 2012; Ayuso et al., 2013; Torres et al., 2014; Nissen et al., 2015; Amo et al., 2016a) and previously published results show that some variations, either related to FCERI or to vitamin D, are strongly associated with IgE-mediated pathologies, like rs12135235 in FCER1A, rs144205117 in CYP2R1, rs1467664 in RXRG or rs4303288 in VDR. Association between the rs2070874 in IL4 and atopy and hypersensitivity, has been described in previously published works (Burchard et al., 1999; Donfack et al., 2005; Kabesch et al., 2006; Ober and Hoffjan, 2006; Kim et al., 2010; Madore and Laprise, 2010; Baye et al., 2011; Lu et al., 2011; Liu et al., 2012; Andiappan et al., 2013; Hsu et al., 2013; Micheal et al., 2013; Movahedi et al., 2013; Zhu et al., 2013; Berenguer et al., 2014; Caniatti et al., 2014; Li et al., 2014; de Guia et al., 2015; Klaassen et al., 2015; Zhang et al., 2015; Hua et al., 2016; Narozna et al., 2016).

Although binary logistic regression analyses pointed to six SNPs corresponding to RXRG, GC, IL4, VDR, and CYP24A1 (see the Results section), statistical significance for these SNPs was not supported after multinomial analysis, except for the GC SNP. By turn, two additional FCER1G SNPs, as well as the GC SNP, were statistically significant after multinomial analyses. It is important to note that the major findings obtained in the present study are novel, since only one of the three SNPs that remained after the multinomial analysis have been previously related with atopy or drug hypersensitivity. Among these, one of the FCER1G SNPs is novel and hence have not been studied before, and the other one has been related with food sensitization (Liu et al., 2011) and has been previously studied in patients with selective hypersensitivity to NSAIDs and allergic rhinitis without significant association (Amo et al., 2016a,b). After the NGS and combined dataset analyses phases, prediction models revealed that one of these SNPs, designated as FCER1G rs36233990 was correct in all models and it allowed an excellent prediction for patients with DHR, IgE-mediated DHR and all IgE-mediated diseases analyzed. It should be taken into consideration that the significant p-values observed for rs36233990 in case-control association analyses might be inflated because this SNP was not observed in controls. However, this is a commonly observed SNP in European populations, which underscores the need for large control sets. This is a limitation in this study. The rs36233990 variation is located in a regulatory region where multiple transcription factor binding sites exist. The variant allele T triggers the appearance of E2F-3:Prrxl1 complex and GKLF (KLF4). On the other hand, the variant allele T leads the disappearance of a binding site for the transcription factor p300. Our own previous findings supported a role of FCER gene variations in patients with AR+A, but not in patients with SH (Park et al., 2011; Amo et al., 2016a) which are consistent with those raised in this study. The minor allele of rs2070901 in FCERIG triggers the disappearance of a transcription factor binding site for ELK-1: OC-2. The GC variation designated as rs3733359 is located in a splice region for transcripts 2 and 3 of GC, and in the 5' untranslated region for transcripts 1 and X1. This variant has been previously related to immune and other disorders (Jung et al., 2011; Wang et al., 2015; Xie et al., 2018). Our findings regarding the GC polymorphism support the hypothesis of a relevant role of vitamin D in allergy (Hall and Agrawal, 2017; Tian and Cheng, 2017.

In summary, our findings show that the analysis of the gene promoters is useful for the identification of genetic biomarkers of risk for DHRs and AR+A. Models using these gene variations allow a high degree of prediction, that is, correct group allocations (**Table 6**) based on these SNPs only. It should be kept in mind that the variant allele frequencies for these SNPs are relatively low, that is, the frequency of carriers of the risk variants is relatively low, specially for the most significant SNP FCER1G rs36233990 (<2% of patients). The frequencies of carriers for other two SNPs are 47 and 10.5% for FCERIG rs2070901 and GC rs3733359, respectively. Therefore, the presence of these gene variations cannot explain, by itself, the development of most cases of DHR. However, the SNPs raised in this study, point to mechanisms involved in

#### REFERENCES


DHR and add novel information that can be used as a proof of mechanism.

#### AUTHOR CONTRIBUTIONS

EG-M and JA contributed conception and design of the study. JG-M, CC, JC-G, NB-L, GC, ID, MB, and MT recruited and characterized patients. MM performed the statistical analysis. GA wrote the first draft of the manuscript. MM, EG-M, and JA wrote sections of the manuscript. All authors contributed to manuscript critical revision with important intellectual contribution, read, and approved the submitted version.

#### FUNDING

This work was supported in part by Grants PI15/00303, PI15/00726, PI17/01593, PI18/00540 and ARADyAL RD16/0006/0001, RD16/0006/0004 and RD16/0006/0024 from Fondo de Investigación Sanitaria, Instituto de Salud Carlos III, Spain, IB16170 and GR18145 from Junta de Extremadura, Spain. Financed in part with FEDER funds from the European Union.

#### ACKNOWLEDGMENTS

JC-G is a researcher from the Miguel Servet Program (Ref CP14/00034), and ID from the Juan Rodés Program (Ref JR15/0036), both from the Carlos III National Health Institute, Spanish Ministry of Economy and Competitiveness).

#### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene. 2019.00582/full#supplementary-material

polymorphisms rs1805010 and rs1801275 are associated with increased risk of asthma in a Saudi Arabian population. Ann. Thorac. Med. 9, 81–86. doi: 10.4103/1817-1737.128849


binding protein 3 genes and the development of eczema during childhood. Br. J. Dermatol. 158, 1315–1322. doi: 10.1111/j.1365-2133.2008.08565.x


clinical manifestations of asthma and allergic rhinitis. Clin. Exp. Allergy 37, 1175–1182. doi: 10.1111/j.1365-2222.2007.02769.x


Juul, S. (2012). Epidemiologi Og Evidens. 2nd Edn. København: Munksgaard.


and asthma risk: an update meta-analysis. PLoS ONE 8:e69120. doi: 10.1371/journal.pone.0069120


childhood and adult asthma. Am. J. Respir. Crit. Care Med. 170, 1057–1065. doi: 10.1164/rccm.200404-447OC


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Amo, Martí, García-Menaya, Cordobés, Cornejo-García, Blanca-López, Canto, Doña, Blanca, Torres, Agúndez and García-Martín. This is an openaccess article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.