Genome-Wide Association Study of Smoking Behavior Traits in a Chinese Han Population

Tobacco use is one of the leading causes of preventable disease worldwide. Genetic studies have elucidated numerous smoking-associated risk loci in American and European populations. However, genetic determinants for cigarette smoking in Chinese populations are under investigated. In this study, a whole-genome sequencing (WGS)-based genome-wide association study (GWAS) was performed in a Chinese Han population comprising 620 smokers and 564 nonsmokers. Thirteen single-nucleotide polymorphisms (SNPs) of the raftlin lipid linker 1 (RFTN1) gene achieved genome-wide significance levels (P < 5 x 10−8) for smoking initiation. The rs139753473 from RFTN1 and six other suggestively significant loci from CUB and sushi multiple domains 1 (CSMD1) gene were also associated with cigarettes per day (CPD) in an independent Chinese sample consisting of 1,329 subjects (805 smokers and 524 nonsmokers). When treating males separately, associations between smoking initiation and PCAT5/ANKRD30A, two genes involved in cancer development, were identified and replicated. Within RFTN1, two haplotypes (i.e., C-A-C-G and A-G-T-C) formed by rs796812630-rs796584733-rs796349027-rs879511366 and three haplotypes (i.e., T-T-C-C-C, T-T-A-T-T, and C-A-A-T-T) formed by rs879401109-rs879453873-rs75180423-rs541378415-rs796757175 were strongly associated with smoking initiation. In addition, we also revealed two haplotypes (i.e., C-A-G-G and T-C-T-T derived from rs4875371-rs4875372-rs17070935-rs11991366) in the CSMD1 gene showing a significant association with smoking initiation. Further bioinformatics functional assessment suggested that RFTN1 may participate in smoking behavior through modulating immune responses or interactions with the glucocorticoid receptor alpha and the androgen receptor. Together, our results may help understand the mechanisms underlying smoking behavior in the Chinese Han population.


INTRODUCTION
Although many programs and regulatory policies of tobacco control have been introduced, reducing the smoking prevalence to a satisfactory level remains an unsolved issue in many countries, especially low-and middle-income countries (1). It has been reported that over 1.3 billion people were tobacco users in 2018 (2). Cigarette smoking is believed to have a wide range of deleterious health effects, such as cardiovascular and pulmonary diseases, and cancers (3)(4)(5)(6). Tobacco smoking and second-hand smoke exposure contribute to more than 6 million deaths worldwide annually, posing a serious threat to public health (7).
The establishment of daily smoking usually consists of three main stages: smoking initiation, transition from experimentation to regular smoking, and development of nicotine dependence (ND) (8,9). Both genetic and environmental factors have been shown to influence all smoking-related stages (9). Smoking initiation, smoking quantity, smoking cessation, and nicotine dependence are commonly studied phenotypes in researches of smoking-related genetic predispositions. Of these, the most wellknown is the association between ND and genetics. As primary targets in the brain for nicotine to exert its biological effects, genes encoding nicotinic acetylcholine receptors (nAChRs) represent one of the most investigated ND susceptibility gene families. It is known that nAChRs can trigger the release of dopamine and glutamate, and furthermore reinforce nicotine reward and addiction (10). Meta-analyses and genome-wide association studies (GWASs) have identified a variety of polymorphisms within nAChRs, e.g., rs3743075 in CHRNA3 (11,12) and rs2273500 in CHRNA4 (13) associated with ND. In addition to nAChRs, nicotine metabolizing enzymes (e.g., CYP2A6), dopamine receptors and transporters (e.g., DRD2, DRD4, and SLC6A3), and neuregulin signaling pathway proteins (e.g., NRG3) are also considered to have high impacts on nicotine addiction (14)(15)(16)(17).
China is the world's largest producer of tobacco products, and smoking prevalence in Chinese males is among the highest in the world (18,19). Nevertheless, GWA studies of smoking behaviors in the Han Chinese are much less reported compared with those conducted in the populations of European descent (20) and of European American or of African American populations (21,22). In the present study, 1,184 Chinese Han adults (including 620 smokers and 564 nonsmokers) were recruited. Whole-genome sequencing (WGS) was performed to identify genome-wide variants. Genetic variants associated with smoking initiation and cigarettes per day (CPD) were determined by association tests. To verify our GWAS findings, a replication analysis was conducted in 1,329 subjects including 805 smokers and 524 nonsmokers. The possible mechanisms of how the observed variants involved in smoking behavior were also briefly discussed.

Subjects and Phenotypes
The discovery sample included a total of 1,184 unrelated Han Chinese adults from a Yunnan cigarette factory. All participants consented to participate in this project and provided a selfadministered survey questionnaire including smoking status, smoking quantity, disease history, height, weight, and age (Supplementary Table 1). Among them, 620 were current smokers, 63 were former smokers, and 501 were never smokers. All smokers had smoked at least 100 cigarettes in their lifetimes. The most reported disease among the 1,184 participants was hypertension (32 cases, Supplementary Table 1). The study was approved by the institutional review board on human studies.
Quality control of variants was applied by the standard recommended GATK filters, including variant quality score recalibration (VQSR), largest contiguous homopolymer run of the variant allele (HomopolymerRun), binomial test (GetHetCoverage), root mean square of mapping quality (RMSMappingQuality), and strand bias (FisherStrand). To further reduce bias, the following exclusion criteria were adopted: 1) minor allele average depth <4X; 2) average depth in case or control <8; 3) eightfold rate for case or control <0.9; 4) and P-value of Hardy-Weinberg equilibrium test <10 −4 . In addition, variants without dbSNP IDs [also with minor allele frequency (MAF) <0.005] were excluded.

Replication Analysis
To replicate the GWAS associations, 1,329 participants including 805 smokers and 524 nonsmokers were recruited from local hospitals in Jincheng and Taiyuan of Shanxi Province in China during 2012-2013 (12). All 1,329 participants were males, aged 19-62 years (Supplementary Table 1). Participants with psychiatric diseases such as schizophrenia, Alzheimer's disease, and major depression diagnosed by the Diagnostic and Statistical Manual of Mental Disorders (DSM)-IV criteria were excluded from enrollment. The project was approved by the Ethics Committee of First Affiliated Hospital of Zhejiang University School of Medicine. A set of answers to questions including age, education, income, medical history, environment, and smokingrelated behaviors were collected by trained researchers. The sequencing was performed using Illumina HiSeq X10 and analyzed as reported previously (12).

Sequencing and Variant Discovery
WGS was carried out in the 1,184 subjects, yielding approximately 1.3 trillion clean reads with an average read length of 100 bp. In the quality control steps for the study participants, as displayed in Supplementary Table 2, 26 subjects whose mean sequencing depth was <8X coverage and 37 subjects whose 10X coverage was <80% were removed. The estimate of the inbreeding coefficient and principal component analysis (PCA) filtered out 21 and 3 outliers, respectively.
To better demonstrate the population structure of our subjects, we performed PCA on individuals from our study along with samples from European (CEU), African (YRU), American (AMR), South Asian (SAS), and East Asian (EAS) obtained from the phase 3 release of 1000 Genomes Project (29). The distribution of the subjects in this study was concordant with the East Asian cluster ( Figure 1A). During the test of relatedness, 49 samples were excluded from the analyses because duplicates and cryptic relatedness were detected. A total of 126 samples were removed (data for four subjects were marked as low quality by more than one rule) (Supplementary Table 2). In the remaining 1,058 subjects, there were 573 current smokers, 44 ex-smokers, and 441 never smokers ( Table 1). Smoking prevalence was much higher among men (ever smokers, 50.8%) compared with women (ever smokers, 7.6%). The average sequencing depth was 34.78-fold coverage ( Figure 1B).
Despite this, 45 variants reached a P value of <10 −5 , with the rs78955061 from the intergenic region of ACKR3 (atypical chemokine receptor 3) and LOC93463 had the smallest P value of 7.92 x 10 −7 . With a cutoff of p <10 −3 , 27 variants from 19 known smoking-associated genes were determined. When the study sample was restricted to males, 22 SNPs yielded a P value of <10 −5 , of which the rs143124048 from PALLD (a gene encoding palladin protein) had the smallest P value of 1.70 x 10 −6 (Supplementary Data 1), and 28 SNPs from 15 known smoking-associated genes were identified at p <10 −3 . However, none of these signals reached a genome-wide significant level, which may be due to a lack of statistical power (Supplementary Figure 1C).

Replication Study
Due to the fact that very few females smoke in China, only males were included in the replication sample. Variants which had a Pvalue lower than 10 −5 in the primary analysis were selected (for RFTN1 and CSMD1, all variants with P-value <10 −3 were tested). As a result, none of the smoking initiation or CPD associated loci from the total discovery sample (with men and women combined) were significant. However, 18 male-specific smoking initiation associated loci (i.e., rs10128115, rs10128145, rs72795203, rs12241402, rs10128497, rs16936694, rs7072685, rs12261634, rs11010478, rs11010482, rs112089093, rs12248963, rs1480525, rs10128398, rs12256178, rs10128169, rs1122458, and rs7071386) yielded significant results (P < 0.05, Supplementary Table 4). All of these SNPs were located in the intergenic region between PCAT5 and ANKRD30A. At a P-value threshold of 0.10, another 15 SNPs also showed evidence for replication, 14 of which were in the intergenic region between PCAT5 and ANKRD30A. The other one, rs4590382 (P = 0.06), was an intergenic SNP between LOC101928283 and GRM8. For malespecific CPD-associated loci in the primary analysis, no evidence of replication was observed. Intriguingly, in testing for an association with CPD, the rs139753473 within RFTN1 and six SNPs within CSMD1 showed a P-value of <0.05 ( Table 3). The six SNPs from CSMD1 included rs76965088, rs117740219, rs78094590, rs138695620, rs76195425, and rs148939406. Furthermore, one SNP from RFTN1 (rs796139390), and four SNPs from CSMD1 (i.e., rs114254701, rs10503200, rs56391646,  and rs149909271) achieved a nominally marginal significance level (P < 0.10) for an association with CPD.
Given that in the individual SNP-based association analysis of smoking initiation, the CSMD1 had the largest number of SNPs with P <10 −3 , of the reported smoking-associated genes, haplotype-based association analysis was also performed on the 29 SNPs in CSMD1. One LD block exhibited a D' larger than 0.97 (Supplementary Figure 2). As shown in Table 4, two haplotypes, C-A-G-G and T-C-T-T, derived from rs4875371-rs4875372-rs17070935-rs11991366, were strongly associated with smoking initiation (Hap-Score = −3.35 and 3.77, P = 3.36 x 10 -4 and 1.61 x 10 -4 , respectively).

Bioinformatics Functional Assessment of RFTN1
In silico functional analyses based on the RegulomeDB (https:// regulomedb.org/) and HaploReg (https://pubs.broadinstitute. org/mammals/haploreg/haploreg.php) databases were performed for SNPs with P <10 −8 , i.e., rs139753473, rs200713609, rs116358832, rs796950514, rs796881087, rs796687837, rs796068970, and rs796689769. All of these SNPs were intron variants within RFTN1. The male-specific smoking associated rs11010435 was omitted because the GWAS did not have adequate power and this locus was not replicated. Although evidence of regulatory potential was weak for rs139753473 (RegulomeDB score = 0.008), this SNP alters E2F, Egr-1, MOVO, Nrf1, UF1H3BETA, YY1, and SP1 transcription factor binding motifs according to HaploReg (Supplementary Table 5). A further investigation using the PROMO prediction tool (30) suggested that the locus of rs139753473 interacted with two transcription factors, including the glucocorticoid receptor alpha and the androgen receptor. Interestingly, the glucocorticoid receptor has been reported to be associated with the probability of smoking severity and cessation in a sample of obstructive airway disease patients (31). It has been suggested that cigarette smoking could increase androgen receptor activity (32). Additionally, rs200713609 had a RegulomeDB score of 0.61 and could alter the PEBP transcription factor binding motif. The RegulomeDB score for rs116358832 was 0.13. Motifs altered by rs116358832 included CEBPB and GATA. For rs796950514, rs796881087, rs796687837, rs796068970, and rs796689769, the RegulomeDB score ranked from 0.13 to 0.61, and no altered motif was found  in either RegulomeDB or HaploReg. In the examination of the correlation between these variants and the expression of RFTN1, no available expression quantitative trait loci (eQTL) data could be found. Furthermore, according to the expression pattern retrieved from GTEX PORTAL (https://gtexportal.org/home/), RFTN1 had the highest expression in lymphocytes and was also expressed in various brain tissues, such as the cortex, the frontal cortex (BA9), and cerebellum (Supplementary Figure 3).

DISCUSSION
Associations between genetic variants and cigarette smoking have been largely deciphered for European-and American-ancestry populations (21,33). For the Chinese Han population, studies on genetic factors conferring smoking susceptibility are still limited in the literature. Here we performed deep WGS of 1,184 Chinese samples and discovered 35 million variants. Of them, 1,882,293 (5%) and 26,578,915 (76%) were found to be low-frequency and rare variants, respectively. Follow up replication analyses revealed risk alleles in RFTN1, CSMD1, and PCAT5/ANKRD30A genes likely contributing to smoking behavior. In the discover stage, 13 SNPs from RFTN1 were significantly associated with smoking initiation, i.e., rs139753473, rs200713609, rs116358832, rs796950514, rs796881087, rs796687837, rs796068970, rs796689769, rs796931177, rs796525300, rs75180423, rs796757175, and rs541378415. The rs11010435 from the intergenic region of PCAT5 and ANKRD30A was also significantly associated with smoking initiation in male smokers. For CPD, we found no genome-wide significant signals, but there were 45 and 22 variants in the total sample and the male subgroup, respectively, at the threshold of P <10 −5 . To validate the preliminary findings, we performed a replication study for the variants with a Pvalue less than 10 −5 (for RFTN1 and CSMD1, variants with P <10 −3 were included), using another Chinese Han sample containing 1,329 male subjects. Although variants associated with smoking initiation and CPD in the total discovery sample were not replicated, we replicated 18 loci for their association with smoking initiation in men, which including rs10128115, rs10128145, rs72795203, rs12241402, rs10128497, rs16936694, rs7072685, rs12261634, rs11010478, rs11010482, rs112089093, rs12248963, rs1480525, rs10128398, rs12256178, rs10128169, rs1122458, and rs7071386 from the intergenic region between PCAT5 and ANKRD30A (P < 0.05). For male-specific associations with CPD, no evidence of replication was found. Furthermore, although RFTN1 and CSMD1 were originally identified in the test of smoking initiation, in the replication test of CPD, the RFTN1 gene's rs139753473 and the CSMD1 gene's rs76965088, rs117740219, rs78094590, rs138695620, rs76195425, and rs148939406 reached a P-value of <0.05. Given that smoking initiation and CPD are moderately correlated smoking behavior traits (correlation coefficient r = 0.425, P = 2.6 x 10 −15 ) (34), these association results for RFTN1 and CSMD1 are of great interest and warranted to elucidate their biological roles in smoking.
Within the 24 SNPs observed in the RFTN1, two LD blocks, rs796812630-rs796584733-rs796349027-rs879511366 and rs879401109-rs879453873-rs75180423-rs541378415-rs796757175, were uncovered. Two haplotypes (i.e., C-A-C-G and A-G-T-C) from the former and three (i.e., T-T-C-C-C, T-T-A-T-T, and C-A-A-T-T) from the latter were significantly associated with smoking initiation. Additionally, haplotypebased association analysis also showed that two CSMD1derived haplotypes, C-A-G-G and T-C-T-T formed by rs4875371-rs4875372-rs17070935-rs11991366 were strongly correlated with smoking initiation.
The lead SNP (rs139753473) associated with smoking initiation is located within the intron region of the RFTN1 gene. It may bind to two transcription factors, i.e., the glucocorticoid receptor alpha and the androgen receptor, which have been proposed to play a role in smoking. Nonetheless, the functional studies were carried out in silico and need experimental validation. Although RFTN1 had the highest expression in lymphocytes, its expression can also be found in brain tissues, e.g., the cortex, frontal cortex (BA9), and cerebellum. These regions are believed to be involved in the brain's reward and inhibitory control processes (35,36). In addition, RFTN1 contributes to multiple immune-related biological pathways, including B and T cell receptor (BCR and TCR, respectively) signaling, toll-like receptor (TLR) 3 signaling, and interleukin-17 (IL17) production (37)(38)(39)(40). It has been increasingly recognized that activation of central immune signaling by substances (e.g., opioids) can enhance drug reward (41,42). In mice, TLR3 modulates cocaine reward through proinflammatory immune signaling (43). In particular, BCR, TCR, TLR, and IL17 induce activation of nuclear factor kappaB (NF-kB) (44)(45)(46)(47), and NF-kB mediates the reward effects of drugs (e.g., cocaine) (48). Moreover, NF-kB is not only a transcription factor involved in inflammation and the immune response (49), but is also a regulator of synaptic plasticity and memory (50).It is plausible that RFTN1 could play a role in smoking initiation by regulating immune responsiveness.  CSMD1 is a complement-regulatory protein that is highly expressed in the central nervous system, contributing to addiction vulnerability (51). In an analysis of 4,122 psoriasis cases and 3,101 healthy controls, CSMD1 showed evidence of association with cigarette smoking (52). Regarding PCAT5 and ANKRD30A, it is reported that both of them are related to cancer progression. For instance, PCAT5 is a long noncoding RNA regulated by the ERG, an active transcription factor common in human prostate cancer (53). Similarly, the ANKRD30A encodes a DNA-binding transcription factor implicated in breast cancer (54,55). Our findings indicated that these two genes could be potential targets to investigate the connection between smoking and cancers.
One major limitation of this study is the sample size. Due to the relatively high sequencing cost per sample and difficulty in recruiting participants, our sample size was not large enough to provide various phenotypes and adequate statistical power. Although the application of WGS in genome-wide association analysis allowed for successful detection of a few novel or lowfrequency variants associated with cigarette smoking, these SNPs are infrequently reported in previous smoking or other psychiatric-related traits, preventing further functional assessments. This also raised the question of whether the observed loci, especially those from RFTN1 and PCAT5/ ANKRD30A, were specifically associated with smoking in the Chinese population. Further studies with particular attention to these genes are therefore required to address this issue. Another concern is that smoking prevalence is usually different by sex, while in this study, association analyses were not performed in a female-specific manner, because the discovery cohorts only included 74 current female smokers and there were no females in the replication sample. In addition, the Fagerström Test for Nicotine Dependence (FTND) is a widely used instrument to estimate nicotine dependence. It is possible that GWAS of this phenotype may produce more robust results. Additionally, as a major metabolite of nicotine, cotinine represents a direct biomarker of smoking quantity. Measuring the cotinine level in serum, urine, or saliva could benefit further validation studies, by providing more objective information on smoking quantity.
In summary, to the best of our knowledge, we have reported the first WGS-based GWAS of smoking phenotypes in a Chinese Han cohort. We provided exploratory evidence that RFTN1 and CSMD1 are involved in smoking. Associations between smoking initiation and PCAT5/ANKRD30A were also detected and replicated in a male-specific manner. RFTN1 might function in smoking initiation through interactions with the immune system, the glucocorticoid receptor alpha and androgen receptor signaling. These findings provide extensive insight into the biological mechanisms of smoking behavior in the Chinese Han population.

DATA AVAILABILITY STATEMENT
The datasets presented in this article are not readily available because due to the restriction of "Regulation of the People's Republic of China on the Administration of Human Genetic Resources", sequencing data of this study can not be shared publicly. Requests to access the datasets should be directed to zhuzhouhai@gmail.com.

ETHICS STATEMENT
The studies involving human participants were reviewed and approved by Biomedical Ethics Committee of Joint Institute of Tobacco and Health. The patients/participants provided their written informed consent to participate in this study.

AUTHOR CONTRIBUTIONS
ML, JY, and ZZ conceived the study. YC, ML, and QL performed the data analysis. ML wrote the manuscript. All authors contributed to the article and approved the submitted version.