regSNPs-ASB: A Computational Framework for Identifying Allele-Specific Transcription Factor Binding From ATAC-seq Data

Xu, Siwen; Feng, Weixing; Lu, Zixiao; Yu, Christina Y.; Shao, Wei; Nakshatri, Harikrishna; Reiter, Jill L.; Gao, Hongyu; Chu, Xiaona; Wang, Yue; Liu, Yunlong

doi:10.3389/fbioe.2020.00886

ORIGINAL RESEARCH article

Front. Bioeng. Biotechnol., 29 July 2020

Sec. Computational Genomics

Volume 8 - 2020 | https://doi.org/10.3389/fbioe.2020.00886

regSNPs-ASB: A Computational Framework for Identifying Allele-Specific Transcription Factor Binding From ATAC-seq Data

SX
Siwen Xu ^1,2
WF
Weixing Feng ¹
ZL
Zixiao Lu ³
CY
Christina Y. Yu ^4,5
WS
Wei Shao ³
HN
Harikrishna Nakshatri ²
JL
Jill L. Reiter ⁶
HG
Hongyu Gao ⁶
XC
Xiaona Chu ⁶
YW
Yue Wang ⁶
YL
Yunlong Liu ^2,6^*

1. Institute of Intelligent System and Bioinformatics, College of Automation, Harbin Engineering University, Harbin, China
2. Center for Computational Biology and Bioinformatics, Indiana University School of Medicine, Indianapolis, IN, United States
3. Regenstrief Institute, Indiana University School of Medicine, Indianapolis, IN, United States
4. Department of Medicine, Indiana University School of Medicine, Indianapolis, IN, United States
5. Department of Biomedical Informatics, The Ohio State University, Columbus, OH, United States
6. Department of Medical and Molecular Genetics, Indiana University School of Medicine, Indianapolis, IN, United States

Article metrics

View details

Citations

5,9k

Views

2,2k

Downloads

Abstract

Expression quantitative trait loci (eQTL) analysis is useful for identifying genetic variants correlated with gene expression, however, it cannot distinguish between causal and nearby non-functional variants. Because the majority of disease-associated SNPs are located in regulatory regions, they can impact allele-specific binding (ASB) of transcription factors and result in differential expression of the target gene alleles. In this study, our aim was to identify functional single-nucleotide polymorphisms (SNPs) that alter transcriptional regulation and thus, potentially impact cellular function. Here, we present regSNPs-ASB, a generalized linear model-based approach to identify regulatory SNPs that are located in transcription factor binding sites. The input for this model includes ATAC-seq (assay for transposase-accessible chromatin with high-throughput sequencing) raw read counts from heterozygous loci, where differential transposase-cleavage patterns between two alleles indicate preferential transcription factor binding to one of the alleles. Using regSNPs-ASB, we identified 53 regulatory SNPs in human MCF-7 breast cancer cells and 125 regulatory SNPs in human mesenchymal stem cells (MSC). By integrating the regSNPs-ASB output with RNA-seq experimental data and publicly available chromatin interaction data from MCF-7 cells, we found that these 53 regulatory SNPs were associated with 74 potential target genes and that 32 (43%) of these genes showed significant allele-specific expression. By comparing all of the MCF-7 and MSC regulatory SNPs to the eQTLs in the Genome-Tissue Expression (GTEx) Project database, we found that 30% (16/53) of the regulatory SNPs in MCF-7 and 43% (52/122) of the regulatory SNPs in MSC were also in eQTL regions. The enrichment of regulatory SNPs in eQTLs indicated that many of them are likely responsible for allelic differences in gene expression (chi-square test, p-value < 0.01). In summary, we conclude that regSNPs-ASB is a useful tool for identifying causal variants from ATAC-seq data. This new computational tool will enable efficient prioritization of genetic variants identified as eQTL for further studies to validate their causal regulatory function. Ultimately, identifying causal genetic variants will further our understanding of the underlying molecular mechanisms of disease and the eventual development of potential therapeutic targets.

Introduction

Expression quantitative trait loci (eQTL) analysis has developed over the years into a powerful tool to investigate the effects of genetic variants on gene regulatory networks, identify quantitative traits for complex diseases, and derive causal inference frameworks for genomic markers and gene expression (Fagny et al., 2017; van der Wijst et al., 2020). Specifically, eQTL analysis is designed to investigate how single-nucleotide polymorphisms (SNPs) in regulatory elements directly modify the abundance of a gene transcript. The major goal of an eQTL study is to reduce the large number of variants identified from a genome-wide association study (GWAS) to a list of potential causal SNPs for further investigation into how the locus contributes to disease. The validity of eQTL analysis has been demonstrated in multiple tissue types, in which high heritability has been observed in a large variety of gene transcripts (Nica and Dermitzakis, 2013).

Despite recent technological and methodological advances, eQTL analysis cannot distinguish between causal and non-functional variants that are in strong linkage disequilibrium (LD). In this study, we aim to identify functional SNPs in key regulatory regions that alter transcriptional regulation and thus, potentially impact cellular function. Such variants would be important for investigating the etiology of the associated disease and for identifying potential therapeutic targets.

Surveys of GWAS indicate that about 93% of disease- and trait-associated variants lie within non-coding sequences, especially in intergenic and intronic areas (Maurano et al., 2012). Notably, 76.5% of all non-coding GWAS SNPs are either within or in perfect LD with DNase I hypersensitive sites, which correspond to open chromatin regions that contain transcription factor (TF) binding motifs (Maurano et al., 2012). Such variants are likely to disturb gene expression by modulating transcriptional regulatory elements, including promoters, silencers, and enhancers (Cookson et al., 2009; Musunuru et al., 2010; Degner et al., 2012). Intronic variants can affect gene splicing by altering canonical splice sites, activating non-canonical splice sites, or changing splicing regulatory elements. In addition, intronic variants can also affect transcription regulatory motifs resulting in altered gene expression (Vaz-Drago et al., 2017). We previously developed a computational framework called regSNPs-intron that showed high accuracy in predicting disease-causing intronic SNPs (Lin et al., 2019). Similar methods are needed to prioritize intragenic variants that alter binding sites of key DNA binding proteins, such as TFs, as an efficient way of identifying candidate disease-causing SNPs.

Assay for transposase-accessible chromatin with high-throughput sequencing (ATAC-seq) is a high-throughput technology that employs an engineered Tn5 transposase to map genome-wide chromatin accessibility and nucleosome positioning (Buenrostro et al., 2013). In open chromatin regions, DNA-binding proteins protect DNA from DNase I enzyme digestion in DNase-seq or Tn5 transposase insertion (Thurman et al., 2012; Sung et al., 2016). Such protection often results in an altered pattern of enzymatic cleavage at the binding site compared to the flanking genomic region, which is referred to as a footprint. To date, many studies have inferred TF binding sites from ATAC-seq data by analyzing the digital genomic footprints left by DNA-binding proteins (Cavalli et al., 2016; D’Antonio et al., 2017; Wei et al., 2018). For example, the algorithms HINT-ATAC, DeFCoM, and Mocap each focus on identifying TF binding sites from sequencing-based footprint data (Chen et al., 2017; Quach and Furey, 2017; Li et al., 2019). While these existing methods predict TF binding events at the single base-pair resolution, none of these tools were designed to evaluate the impact of genetic variants on TF footprints. Furthermore, like DNase I, Tn5 transposase is reported to have specific sequence bias (Adey et al., 2010; Lu et al., 2017), although the impact of this bias on ATAC-seq footprinting profiles has not been systematically investigated.

Recently, the Sasquatch algorithm was developed to predict the effect of non-coding variants on TF binding by analyzing differences in DNase footprints between samples with different genotypes (Schwessinger et al., 2017). This method improved the statistical power by effectively eliminating biases resulting from variations in the footprint patterns of different transcription factors. Here, we propose to further analyze the differences in the ATAC-seq read distribution between two heterozygous alleles from the same individual. This strategy further reduces the biases of the sequencing variation on different experiments and sample types.

In our analysis, we first systematically screened for functional variants by applying a generalized linear model (GLM) based on the ATAC-seq data from two different cell lines. We further evaluated our findings using RNA-seq data from the same cell line, as well as publicly available chromatin interaction data (Teng et al., 2015). The overall strategy presented in this study provides evidence for functional SNP activity, which can serve as the basis for generating testable hypotheses for experimental validation. Ultimately, findings generated from regSNPs-ASB are expected to aid in understanding the molecular mechanisms of complex diseases.

Results

Sequence variants within a TF-binding site may alter TF-binding affinity at that locus (Johnston et al., 2019). When functional SNPs are heterozygous and TFs preferentially bind to one allele, differential ATAC-seq cleavage patterns between the two alleles are expected to result (Figure 1). Based on this premise, we developed a computational model called regSNPs-ASB for detecting allele-specific differences in TF occupancy on a genome-wide scale.

FIGURE 1

Overall Strategy and Data Preprocessing for regSNPs-ASB

The overall strategy of regSNPs-ASB is shown in Figure 2, which consists of six steps. First, open chromatin regions and SNPs were extracted from ATAC-seq data. Second, open chromatin regions and SNPs were filtered by the quality control and the total number of cutting sites in each open chromatin region, respectively. Third, potential allele-specific TF binding sites were extracted by merging the loci of heterozygous variants and TF binding sites. Fourth, significant allele-specific TF binding events were detected using GLM. Fifth, empirical filtering was performed to avoid false positive events. The final output is a list of ASB events with associated TF and regulatory SNP information.

FIGURE 2

To identify potential allele-specific transcription factor binding sites, we first conducted ATAC-seq experiments on patient-derived mesenchymal stem cells (MSC) and a breast cancer cell line (MCF7). Using the MACS2 algorithm (Feng et al., 2011), we identified 33,919 and 8,746 open chromatin regions in MSC and MCF7 cells, respectively. Within these regions, heterozygous SNPs were derived from the ATAC-seq BAM files using samtools (Li et al., 2009). After quality control filtering, 1,119,374 and 815,852 heterozygous variants were identified in the open chromatin regions of MSC and MCF7 cells, respectively (Table 1). In each open chromatin region, we used FIMO (Grant et al., 2011) to scan both reference and alternative sequences for putative transcription factor binding sites. In total, 3,874,515 and 1,725,376 candidate binding sites were identified from MSC and MCF7 cell lines, respectively. Among these, 40,475 binding sites contained heterozygous variants in MSC and 12,402 in MCF7.

TABLE 1

Dataset	Chromatin accessibility regions	Heterozygous SNPs	Potential TF binding sites	Candidate allele-specific TF binding sites	Significant allele-specific TF binding sites	Filtered allele-specific TF binding sites	Regulatory SNPs
MSC	33,919	1,119,374	3,874,515	40,475	2417	406	125
MCF-7	8,746	815,852	1,725,376	12,402	922	122	53

regSNP-ASB filtering summary.

Detection of Allele-Specific TF Binding Events in MSC and MCF7 Cells

For each potential TF binding site containing heterozygous variants, we asked whether there were allele-specific differences in the open chromatin regions. To address this question, a generalized linear model (see section “Materials and Methods”) was implemented to examine the differences in the ratio of the cut site frequency in the flanking regions relative to the TF footprint (i.e., the flanks-to-footprint ratio) between the reference and alternative alleles (Figure 1B).

A negative binomial distribution was used in our model to account for the over-dispersion of the read counts. The coefficient of the interaction between the allele type (reference or alternative) and the region (flanking or footprint), β_int, was used to evaluate whether the footprints on the putative TF-binding sites were significantly different between the two alleles. This analysis resulted in 2,417 candidate allele-specific TF binding sites in MSC and 922 sites in MCF-7 cells (FDR < 0.05). Using the position weight matrices (PWM) of the candidate TFs retrieved from the JASPAR database (Khan et al., 2018), we further calculated the position-specific scoring matrices (PSSM) scores (see section “Materials and Methods”) of the candidate allele-specific TF-binding sites for both reference and alternative alleles. Overall, we detected 406 and 122 candidate TF-binding sites in MSC and MCF7 cell lines with a significant β_int value and a sizable PSSM score difference (| deltaPSSM| > 5). Since each variant can potentially disrupt the binding sites of multiple TFs, 125 and 53 heterozygous SNPs in MSC and MCF7 were identified with the potential to disrupt TF binding. The number of TFs whose binding sites were disrupted by each heterozygous variant are shown in Supplementary Figure S1. A complete list of these variants can be found in Supplementary Tables S1, S2. In addition, the β_int values, the ATAC-seq read distribution on and flanking the putative sites, and the deltaPSSM, can be found in an RShiny website¹.

Examples of the ATAC-seq signal distribution for two TF binding sites on the reference and alternative alleles identified by regSNPs-ASB are presented in Figure 3A. For SNP rs7164266, regSNPs-ASB detected a decrease in the footprint on the alternative allele. The matching score differences (deltaPSSM = 7.24) further supported the regSNPs-ASB prediction. Since rs7164266 occurs in a CTCF-binding motif, this finding suggests that the variant allele could potentially repress CTCF binding to this site. On the contrary, for rs6752740, a SNP located in a putative KLF14 binding site, a clear footprint was observed on the alternative allele, but not on the reference allele. This finding indicates that the rs6752740 SNP could enhance KLF14 binding, which was supported by the matching score differences (deltaPSSM = −5.48). The ratio of the number of ATAC-seq cutting sites between the two alleles on putative TF binding sites (X) versus the flanking regions (Y) was plotted for all of the identified events (Figure 3B). Taken together, these results indicate that the regSNPs-ASB algorithm can identify genome-wide allele-specific TF binding events from ATAC-seq data.

FIGURE 3

Correlation of Allele-Specific TF Binding and Allele-Specific Gene Expression

We next asked whether the allele-specific TF binding predicted by regSNPs-ASB correlated with allele-specific gene expression. To address this question, we first identified likely target genes for the heterozygous variants that exhibited allele-specific TF binding. For putative promoter variants, the closest gene to each variant was considered as the target gene. However, because enhancer variants can be far away from their target genes, Chromosome Conformation Capture (3C) and its derivative techniques have become the major biochemical approaches to study such distal transcriptional regulation and chromatin interactions. Therefore, potential enhancer regulatory SNPs were combined with 3C experimental interaction data obtained from the 4Dgenome database. For this analysis, we focused on MCF-7 cells since 3C data was not available for MSCs. In MCF7 cells, 8 of the 53 heterozygous variants that exhibited allele-specific TF binding were located in the promoter or 5′UTR regions, while the other 45 SNPs were located in intronic or distal intergenic regions and were considered potential enhancer regulatory SNPs. For these 45 putative enhancer SNPs, 66 potential target genes were identified in the 4Dgenome database.

Allele-specific expression analysis for the target genes was conducted using heterozygous variants in the open reading frame of the putative target genes. We found that 75% (6/8) of the target genes associated with promoter variants and 39% (26/66) of the target genes associated with putative enhancer variants showed significant allele-specific expression (p-value < 0.01, Supplementary Table S7).

One example of the chromatin interaction analysis is shown in Figure 4, where allele-specific binding on rs151202 was observed in the ATAC-seq data (Figure 4B). Motif analysis indicated that this SNP weakened the binding of transcription factor AP-2 gamma (TFAP2C). A previous 3C study reported that this SNP is located in a regulatory region that interacts with two target genes, DHX29 and SKIV2L2 (Teng et al., 2015). Interestingly, we observed strong allele-specific expression differences of both of these genes at SNP loci in the coding-regions, rs3761764 (p-value = 0.031) and rs2061242 (p-value = 1.35E-07), respectively (Figure 4B).

FIGURE 4

These results suggest that the rs151202 variant at chr5:54529604 likely disturbed TFAP2C binding, resulting in differential expression of DHX29 and SKIV2L2 genes. Interestingly, TFAP2C is a key regulator of hormone responsiveness in breast carcinoma cells through the control of multiple estrogen signaling pathways (Gee et al., 2009). It is also noteworthy that the overexpression of DHX29 can promote cancer cell growth in culture and in xenografts (Parsyan et al., 2009). Taken together, these results demonstrate that regSNPs-ASB has the potential to identify causal regulatory SNPs that affect TF binding and further impact the expression levels of target genes.

Correlation of Allele-Specific TF Binding and Allele-Specific Gene Expression

We next asked whether the regulatory SNPs identified by regSNPs-ASB could be found in a list of eQTL signals. To address this question, we compared all of the regulatory SNPs identified in MCF-7 and MSC to the eQTLs in the GTEx database. We found that 30.2% (16/53) of the regulatory SNPs in MCF-7 and 42.6% (52/122) of the regulatory SNPs in MSC were also eQTLs (Supplementary Tables S3, S4). This enrichment was significant when compared to non-regulatory SNPs (chi-square test, p-value = 1.52E-08). Thus, SNPs identified by regSNPs-ASB that are also in eQTLs could be prioritized for experimental validation of their functional role as the causal variants involved in the target gene regulation.

Based on the finding that regulatory SNPs are significantly enriched in eQTLs, we further tested whether our method could be used to identify plausible causal SNPs that modulate TF binding from a list of GWAS SNPs associated with specific cell systems. Notably, rs7943121 was identified as a SNP that weakens the binding of the transcriptional repressor CCCTC-binding factor (CTCF) (FDR = 0.033, Figure 5A). This SNP was also identified as an eQTL in breast mammary tissue that could modulate SPT2 chromatin protein domain containing 1 (SPTY2D1) expression (Figure 5B). Furthermore, we also observed that rs7943121 was in strong LD (r² > 0.8) with rs10832963, which is one of the most significant GWAS-SNPs associated with breast cancer (Figure 5C; Michailidou et al., 2017). Overall, these results provide compelling evidence that regulatory SNPs detected by regSNPs-ASB can potentially be the drivers of previous-reported eQTL and GWAS signals.

FIGURE 5

Discussion

eQTLs and GWAS are conventional methods for mapping association variants that have identified tens of thousands of loci that are highly correlated with gene expression and common diseases, respectively. However, screening for the true causal variants remains a challenging problem. In this study, we introduced a computational method, regSNPs-ASB, for detecting regulatory SNPs that affect transcription factor binding using ATAC-seq data. The major conclusion of this study is that by integrating ATAC-seq data with RNA-seq expression data, chromatin conformation data and LD information from SNP datasets, regSNPs-ASB is able to efficiently evaluate the regulatory functions of SNPs in disease contexts. This conclusion is based on the following evidence: First, we systematically detected potential regulatory SNPs with allele-specific differences in TF occupancy using a statistical test. Second, we provided evidence that regulatory SNPs can affect the expression of their target genes. Finally, we found that regulatory SNPs are significantly enriched within expression-correlated variants compared with non-regulatory SNPs in the GTEx database. Collectively, these findings demonstrate that the regSNPs-ASB algorithm can be used to identify causal variants from ATAC-seq data, and thereby, further our understanding of the molecular mechanisms of complex diseases.

Our method is developed based on the established strategy for conducting allele-specific analysis on many types of sequencing data, which specifically focus on the sequencing reads on heterozygous loci. This includes allele-specific gene expression (Zhang et al., 2014; Castel et al., 2015), allele-specific alternative splicing analysis (Nembaware et al., 2008; Li et al., 2012), allele-specific binding of ChIP-seq (de Santiago et al., 2017) and CLIP-seq data analysis (Yang et al., 2019), allele-specific chromatin interaction (Cavalli et al., 2019), and allele-specific chromatin accessibility (Harvey et al., 2014; Zhang et al., 2019). To our knowledge, our method is the first to analyze allele-specific footprint.

When designing regSNPs-ASB, we considered several potential issues that could arise from technical artifacts. On the one hand, the intrinsic sequence bias of DNase I and Tn5 enzymes has a striking effect on the average cut profile over a specific TF-binding motif, which is a major limitation of current genomic footprinting methods (Yardımcı et al., 2014; Xu et al., 2017). regSNPs-ASB effectively eliminates the impact of sequence preference bias, since this algorithm focuses on identifying the differences in the cut-site distribution between two alleles of a single locus, where the technical variability should be the same. For the same reason, our model is insensitive to the differences in the intrinsic footprint patterns from different TF domains. On the other hand, cancer cell lines often contain copy-number variants, which complicate the detection of allele-specific TF binding. Traditional ChIP-based methods for identifying allele-specific TF binding (Wei et al., 2012; Cavalli et al., 2016), cannot effectively distinguish gene expression changes that are caused by allele-specific copy number amplification from the effects caused by allele-specific TF binding. regSNPs-ASB ignores this feature of cancer genomes because it detects differences in footprint shape that result from TF-binding, rather than the imbalance of read counts from different alleles. By avoiding these technical problems, regSNPs-ASB effectively screens for allele-specific TF binding in an unbiased manner.

We also note that there are some limitations of regSNPs-ASB. First, although this model can effectively evaluate TFs with shallow footprints that are caused by transient interactions with DNA, regSNPs-ASB, like all other footprinting-based computational algorithms, cannot detect TFs that do not leave footprints because of their short occupancy time (on the order of seconds). In that case, the differential TF binding events cannot be detected by ATAC-seq and technological improvements will be needed to capture those events. Second, the proposed method can only be applied on the heterozygous loci in the genome, and cannot be used to identify the functional variants on the homozygous variants. In addition, regSNPs-ASB infers TF occupancy from open chromatin regions that contain a probabilistic match to TF consensus recognition motifs, which limits the ability to identify TFs whose motif sequences are not known. However, this limitation will diminish as new TF motifs are discovered and included in the JASPAR database.

It should be noted that the proposed methods are designed for the cell systems with diploid genomes, which may not be appropriate to study the cell systems with complicated chromosome rearrangement, including tumors and cancer cells that have aberrant CNV (copy number variation) and LOH (lost of heterogeneity). In theory, our method can be applied on the genomic loci with somatic mutations. At such loci, however, since only a proportion of the cells will carry the same mutation, the power of our method for detecting such variants may be compromised.

In summary, we show that the regSNPs-ASB algorithm is effective in identifying candidate causal SNPs from ATAC-seq data. This new computational tool will enable efficient prioritization of genetic variants identified by association studies for further studies to validate their causal regulatory function. Ultimately, identifying causal genetic variants will further our understanding of the underlying molecular mechanisms of disease.

Materials and Methods

Cell Lines

Human MCF-7 breast cancer cells were purchased from ATCC (Manassas, VA, United States) and were authenticated using cell line authentication services of Genetica (Burlington, NC, United States). Cells were maintained in minimal essential media (MEM) plus 10% fetal bovine serum (FBS) with penicillin and streptomycin. Media was changed to phenol red-free MEM with 5% charcoal-dextran treated FBS for at least 3 days prior to experiments. We generated 3 ATAC-seq libraries and 2 RNA-seq libraries from MCF-7 cells. Each individual library was derived from three technical replicates. Human mesenchymal stem cells (MSCs) were purchased from Lonza (Walkersville, MD, United States) and were tested for purity by flow cytometry. MSCs were thawed and the culture process was initiated by plating in tissue culture flasks (Corning, Corning, NY, United States) containing MSC growth medium (Lonza) at 37°C in 5% CO₂-90% humidity according to the manufacturer’s instructions. We generated 3 ATAC-seq libraries from MSC and each individual library was derived from two technical replicates.

ATAC-seq Experimental Procedure

Assay for transposase-accessible chromatin with high-throughput sequencing was performed according to the published protocol (Buenrostro et al., 2013). Briefly, cells were collected in cold PBS and cell membranes were disrupted in cold lysis buffer (10 mM Tris–HCl, pH 7.4, 10 mM NaCl, 3 mM MgCl₂ and 0.1% IGEPAL CA-630). The nuclei were pelleted and resuspended in the transposase reaction mix containing 25 μL 2 × TD buffer, 2.5 μL transposase (Illumina) and 22.5 μL nuclease-free water. Directly following transposition, the sample was purified using a Qiagen MinElute kit. Following purification, libraries were amplified using 1 × NEBnext PCR master mix and 1.25 μM custom Nextera PCR primers. AMPure XP beads (Beckman Coulter) were used to purify the transposed DNA and the amplified PCR products. All libraries were sequenced on a 100-cycle paired-end run on an Illumina NovaSeq 6000 instrument.

RNA-seq Experimental Procedure

Total RNA was prepared using a RNeasy kit (Qiagen). The concentration and quality of total RNA samples was first assessed using an Agilent 2100 Bioanalyzer. A RIN (RNA integrity number) of five or higher was required to pass the quality control. A TruSeq Stranded mRNA Library Prep Kit (Illumina) was used to prepared single-indexed strand-specific cDNA libraries from 500 nanograms of RNA per sample. The resulting libraries were quantified using a Qubit and the size distribution was assessed using an Agilent 2100 Bioanalyzer. Pooled libraries (1.5 picomoles) were sequenced with 2 × 75 bp paired-end configuration on a HiSeq 4000 instrument (Illumina). A Phred quality score (Q score) was used to measure the quality of sequencing. More than 90% of the sequencing reads reached Q30 (99.9% base call accuracy).

Preprocessing for Identifying Allele-Specific Transcription Factor Binding

For ATAC-seq footprinting analysis, all of the read start sites were adjusted to represent the center of the transposon binding event; reads aligning to the forward strand were offset by + 4 bp and reads aligning to the reverse strand were offset by -5 bp (Buenrostro et al., 2013). MACS2 (Feng et al., 2011) was used with default parameters to identify all ATAC-seq peaks. Peaks with <200 cutting sites were removed from downstream analysis to minimize the effect of nucleosome-bound regions. Heterozygous SNPs were identified using samtools-1.6 and bcftools-1.6 with the parameters mpileup -uf and view -Nvcg, respectively (Li et al., 2009). SNPs were further filtered by VcfFilter with the parameter DP > 10 and MQ > 20 (Erik, 2012). In each open chromatin region, we used FIMO (Grant et al., 2011) to recognize potential TF binding sites from both reference and alternative alleles. Lastly, we used BEDTools IntersectBed (Quinlan and Hall, 2010) to merge the loci of heterozygous variants and potential TF binding sites. The set of TF binding sites intersecting SNPs were regarded as potential allele-specific TF binding sites.

Transcription Factor Motif Disruption Analysis

The position-specific scoring matrices (PSSM) of the candidate TFs were retrieved from the JASPAR database (Khan et al., 2018) and used to annotate the potential regulatory effects of the tested SNPs on TF motifs. The magnitude of the change in binding affinity was calculated as the absolute difference (delta) of PSSM scores, that is delta (PSSM) = PSSM(ref) – PSSM(alt).

regSNPs-ASB Computational Model Description

We explicitly modeled the counts of DNA fragments for each potential binding site using a generalized linear model. The total number of sequencing reads within a given region of the genome approximately follow a negative binomial distribution.

For each potential allele-specific TF binding site, we built the following generalized linear regression model to fit the DNA fragment counts:

where y is the number of ATAC-seq reads in which the Tn5 cleavage position mapped to a specific region and allele; x_r and x_a are binary predictor variables that indicate the region where a cleavage event happened (0 = binding site and 1 = flanking region) and the allele that the fragment mapped to (0 = reference allele and 1 = alternative allele), respectively (Figure 1B); β_r and β_a are the regression coefficients used to estimate the relationship between the scalar response and x_r and x_a. In addition, β₀ indicates the average sequencing depth around a heterozygous site and ε is the random error. Logarithm is the canonical link function when the response variable follows a negative binomial distribution. Our null hypothesis (H₀) is: β_int = 0. Rejecting the null hypothesis indicates that the allelic imbalance differs between reference and alternative alleles. A positive or negative β_int value indicates a gain or loss of binding ability to the variant, respectively. False discovery rate was calculated using the Benjamini-Hochberg procedure (Benjamini and Hochberg, 1995).

Regulatory SNPs Target Gene Identification

For each regulatory SNP, we first used SnpSift (Cingolani et al., 2012) to retrieve the corresponding reference SNP ID number (rsID) before querying the SNPs to the Short Genetic Variations database (dbSNP). ANNOVAR (Hakonarson et al., 2010) was used to annotate the location of each SNP. Variants located within a 1 kb region upstream or downstream of a transcription start site were considered as potential promoter regions. The remaining regulatory SNPs were considered as enhancers using chromatin interaction data from the 4DGenome database². Each record in this database includes three parts: a chromatin interaction between two genomic regions, genes located in the interacting regions, and the experimentally-derived or computationally-predicted data used to detect the genomic interaction. BEDTools IntersectBed (Quinlan and Hall, 2010) was used to examine overlaps between regulatory SNPs and interacting chromatin regions. For any genomic region overlapping with regulatory SNPs, genes located in the paired interacting region were recognized as potential target genes for the corresponding regulatory SNPs.

Allele-Specific Expression Analysis

Typical allele-specific expression analysis seeks to capture allelic imbalance of reference and alternative alleles in RNA-seq read counts covering heterozygous sites. Under the null hypothesis of balanced expression, the fraction of allelic read counts is expected to fit a binomial distribution (N, 0.5). For each regulatory SNP target gene, we used a binomial test to detect whether allele-specific expression occurred at each coding SNP based on the corresponding RNA-seq data. All target genes that reject the null hypothesis (P value < 0.05) were considered genes with allele-specific expression.

SNP Linkage Disequilibrium Analysis

We used the 1000 Genome Project Phase 3 variants³ (Auton et al., 2015) and plink2 (Chang et al., 2015) to conduct the linkage disequilibrium analysis. First, vcf file formats were converted to the corresponding pgen, psam, and pvar file formats using plink –vcf. Then we calculated r² and D’ between each SNP and disease-correlated GWAS-SNP using plink –out. The linkage disequilibrium results were plotted by the genetic variant-centered annotation browser, SNiPA (Arnold et al., 2015).

eQTL Analysis

The eQTL analysis was performed using GTEx v7 data⁴. We classified a regulatory SNP as a genetic variant exerting regulatory effects on the expression of gene if it was completely overlapping with any eQTL in the corresponding tissue.

Statements

Data availability statement

The datasets presented in this study can be found in online repositories. The names of the repository/repositories and accession number(s) can be found in the article/Supplementary Material.

Author contributions

YL and SX conceptualized the project and drafted the manuscript. SX performed the bioinformatic analyses. SX, JR, ZL, CY, and YL wrote and edited the manuscript. HN, HG, YW, and YL contributed to the design of the experiments and data interpretation. WS participated in the development of methodology. YL and WF gave the research direction. All authors contributed to the article and approved the submitted version.

Funding

The funding for this study was partially supported by Indiana University Precision Health Initiative.

Acknowledgments

We thank Dr. Andy B. Chen for their help on R Shiny Development.

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Supplementary material

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fbioe.2020.00886/full#supplementary-material

FIGURE S1

(A) The histogram of the number of heterogeneous variants on each TF in MSC. (B) The histogram of the number of TF binding sites on each variant in MSC. (C) The histogram of the number of heterogeneous variants on each TF in MCF7. (D) The histogram of the number of TF binding sites on each variant in MCF7.

TABLE S1

GLM results of MCF-7.

TABLE S2

GLM results of MSC.

TABLE S3

eQTL analysis and regulatory SNP annotation of MCF-7.

TABLE S4

eQTL analysis and regulatory SNP annotation of MSC.

TABLE S5

Raw RNA-seq data of MCF-7.

TABLE S6

Chromatin interaction experimental data of MCF-7.

TABLE S7

ASE analysis for the target genes of regulatory SNPs in MCF-7.

Footnotes

1.^https://yunlongliulab.shinyapps.io/regsnps-asb/

2.^https://4dgenome.research.chop.edu/

3.^hgdownload.cse.ucsc.edu/gbdb/hg19/1000Genomes/phase3/

4.^https://gtexportal.org/home/datasets

References

1
AdeyA.MorrisonH. G.Asan, XunX.KitzmanJ. O.TurnerE. H.et al (2010). Rapid, low-input, low-bias construction of shotgun fragment libraries by high-density in vitro transposition.Genome Biol.11:R119. 10.1186/gb-2010-11-12-r119
2
ArnoldM.RafflerJ.PfeuferA.SuhreK.KastenmüllerG. (2015). SNiPA: an interactive, genetic variant-centered annotation browser.Bioinformatics311334–1336. 10.1093/bioinformatics/btu779
3
AutonA.AbecasisG. R.AltshulerD. M.DurbinR. M.AbecasisG. R.BentleyD. R.et al (2015). A global reference for human genetic variation.Nature52668–74. 10.1038/nature15393
4
BenjaminiY.HochbergY. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing.J. R. Stat. Soc. B (Methodological)57289–300. 10.1111/j.2517-6161.1995.tb02031.x
- CrossRef
- Google Scholar
5
BuenrostroJ. D.GiresiP. G.ZabaL. C.ChangH. Y.GreenleafW. J. (2013). Transposition of native chromatin for fast and sensitive epigenomic profiling of open chromatin. DNA-binding proteins and nucleosome position.Nat. Methods101213–1218. 10.1038/nmeth.2688
6
CastelS. E.Levy-MoonshineA.MohammadiP.BanksE.LappalainenT. (2015). Tools and best practices for data processing in allelic expression analysis.Genome Biol.16:195. 10.1186/s13059-015-0762-766
- CrossRef
- Google Scholar
7
CavalliM.BaltzerN.UmerH. M.GrauJ.LemnianI.PanG.et al (2019). Allele specific chromatin signals, 3D interactions, and motif predictions for immune and B cell related diseases.Sci. Rep.9:2695. 10.1038/s41598-019-39633-39630
- CrossRef
- Google Scholar
8
CavalliM.PanG.NordH.Wallén ArztE.WallermanO.WadeliusC. (2016). Allele-specific transcription factor binding in liver and cervix cells unveils many likely drivers of GWAS signals.Genomics107248–254. 10.1016/j.ygeno.2016.04.006
9
ChangC. C.ChowC. C.TellierL. C. A. M.VattikutiS.PurcellS. M.LeeJ. J. (2015). Second-generation PLINK: rising to the challenge of larger and richer datasets.GigaScience4:7. 10.1186/s13742-015-0047-48
- CrossRef
- Google Scholar
10
ChenX.YuB.CarrieroN.SilvaC.BonneauR. (2017). Mocap: large-scale inference of transcription factor binding sites from chromatin accessibility.Nucleic Acids Res.454315–4329. 10.1093/nar/gkx174
11
CingolaniP.PlattsA.WangL. L.CoonM.NguyenT.WangL.et al (2012). A program for annotating and predicting the effects of single nucleotide polymorphisms. SnpEff: SNPs in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3.Fly680–92. 10.4161/fly.19695
12
CooksonW.LiangL.AbecasisG.MoffattM.LathropM. (2009). Mapping complex disease traits with global gene expression.Nat. Rev. Genet.10184–194. 10.1038/nrg2537
13
de SantiagoI.LiuW.YuanK.O’ReillyM.ChilamakuriC. S. R.PonderB. A. J.et al (2017). BaalChIP: bayesian analysis of allele-specific transcription factor binding in cancer genomes.Genome Biol.1839–39. 10.1186/s13059-017-1165-1167
- CrossRef
- Google Scholar
14
DegnerJ. F.PaiA. A.Pique-RegiR.VeyrierasJ.-B.GaffneyD. J.PickrellJ. K.et al (2012). DNase I sensitivity QTLs are a major determinant of human expression variation.Nature482390–394. 10.1038/nature10808
15
D’AntonioM.WeghornD.D’Antonio-ChronowskaA.CouletF.OlsonK. M.DeBoeverC.et al (2017). Identifying DNase I hypersensitive sites as driver distal regulatory elements in breast cancer.Nat. Commun.8:436. 10.1038/s41467-017-00100-x
16
ErikG. (2012). Vcflib: A C++ Library for Parsing and Manipulating VCF Files. GitHub. Available online at: https://github.com/ekg/vcflib
- Google Scholar
17
FagnyM.PaulsonJ. N.KuijjerM. L.SonawaneA. R.ChenC.-Y.Lopes-RamosC. M.et al (2017). Exploring regulation in tissues with eQTL networks.Proc. Natl. Acad. Sci.114E7841–E7850. 10.1073/pnas.1707375114
18
FengJ.LiuT.ZhangY. (2011). Using MACS to identify peaks from ChIP-Seq data.Curr. Protoc. Bioinformatics.Chap. 2: Unit2.14–2.14. 10.1002/0471250953.bi0214s34
19
GeeJ. M. W.ElorantaJ. J.IbbittJ. C.RobertsonJ. F. R.EllisI. O.WilliamsT.et al (2009). Overexpression of TFAP2C in invasive breast cancer correlates with a poorer response to anti-hormone therapy and reduced patient survival.J. Pathol.21732–41. 10.1002/path.2430
20
GrantC. E.BaileyT. L.NobleW. S. (2011). FIMO: scanning for occurrences of a given motif.Bioinformatics271017–1018. 10.1093/bioinformatics/btr064
21
HakonarsonH.LiM.WangK. (2010). ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data.Nucleic Acids Res.38:e164. 10.1093/nar/gkq603
22
HarveyC. T.MoyerbraileanG. A.DavisG. O.WenX.LucaF.Pique-RegiR. (2014). QuASAR: quantitative allele-specific analysis of reads.Bioinformatics311235–1242. 10.1093/bioinformatics/btu802
23
JohnstonA. D.Simões-PiresC. A.ThompsonT. V.SuzukiM.GreallyJ. M. (2019). Functional genetic variants can mediate their regulatory effects through alteration of transcription factor binding.Nat. Commun.10:3472. 10.1038/s41467-019-11412-11415
- CrossRef
- Google Scholar
24
KhanA.FornesO.StiglianiA.GheorgheM.Castro-MondragonJ. A.van der LeeR.et al (2018). JASPAR 2018: update of the open-access database of transcription factor binding profiles and its web framework.Nucleic Acids Res.46D260–D266. 10.1093/nar/gkx1126
25
LiG.BahnJ.LeeJ.-H.PengG.ChenZ.NelsonS.et al (2012). Identification of allele-specific alternative mRNA processing via transcriptome sequencing.Nucleic Acids Res.40:e104. 10.1093/nar/gks280
26
LiH.HandsakerB.WysokerA.FennellT.RuanJ.HomerN.et al (2009). The Sequence Alignment/Map format and SAMtools.Bioinform. (Oxford, England)252078–2079. 10.1093/bioinformatics/btp352
27
LiZ.SchulzM. H.LookT.BegemannM.ZenkeM.CostaI. G. (2019). Identification of transcription factor binding sites using ATAC-seq.Genome Biol.20:45. 10.1186/s13059-019-1642-1642
- CrossRef
- Google Scholar
28
LinH.HargreavesK. A.LiR.ReiterJ. L.WangY.MortM.et al (2019). RegSNPs-intron: a computational framework for predicting pathogenic impact of intronic single nucleotide variants.Genome Biol.20:254. 10.1186/s13059-019-1847-1844
- CrossRef
- Google Scholar
29
LuZ.HofmeisterB. T.VollmersC.DuBoisR. M.SchmitzR. J. (2017). Combining ATAC-seq with nuclei sorting for discovery of cis-regulatory regions in plant genomes.Nucleic Acids Res.45:e41. 10.1093/nar/gkw1179
30
MauranoM. T.HumbertR.RynesE.ThurmanR. E.HaugenE.WangH.et al (2012). Systematic localization of common disease-associated variation in regulatory DNA.Science3371190–1195. 10.1126/science.1222794
31
MichailidouK.LindströmS.DennisJ.BeesleyJ.HuiS.KarS.et al (2017). Association analysis identifies 65 new breast cancer risk loci.Nature55192–94. 10.1038/nature24284
32
MusunuruK.StrongA.Frank-KamenetskyM.LeeN. E.AhfeldtT.SachsK. V.et al (2010). From noncoding variant to phenotype via SORT1 at the 1p13 cholesterol locus.Nature466714–719. 10.1038/nature09266
33
NembawareV.LupindoB.SchouestK.SpillaneC.SchefflerK.SeoigheC. (2008). Genome-wide survey of allele-specific splicing in humans.BMC Genomics9:265. 10.1186/1471-2164-9-265
34
NicaA. C.DermitzakisE. T. (2013). Expression quantitative trait loci: present and future.Philos. Trans. R. Soc. Lond. B Biol. Sci.368:20120362. 10.1098/rstb.2012.0362
35
ParsyanA.ShahbazianD.MartineauY.PetroulakisE.AlainT.LarssonO.et al (2009). The helicase protein DHX29 promotes translation initiation, cell proliferation, and tumorigenesis.Proc. Natl. Acad. Sci.10622217–22222. 10.1073/pnas.0909773106
36
QuachB.FureyT. S. (2017). DeFCoM: analysis and modeling of transcription factor binding sites using a motif-centric genomic footprinter.Bioinformatics33956–963. 10.1093/bioinformatics/btw740
37
QuinlanA. R.HallI. M. (2010). BEDTools: a flexible suite of utilities for comparing genomic features.Bioinformatics26841–842. 10.1093/bioinformatics/btq033
38
SchwessingerR.SuciuM. C.McGowanS. J.TeleniusJ.TaylorS.HiggsD. R.et al (2017). Sasquatch: predicting the impact of regulatory SNPs on transcription factor binding from cell- and tissue-specific DNase footprints.Genome Res.271730–1742. 10.1101/gr.220202.117
39
SungM.-H.BaekS.HagerG. L. (2016). Genome-wide footprinting: ready for prime time?Nat. Methods13222–228. 10.1038/nmeth.3766
40
TengL.HeB.WangJ.TanK. (2015). 4DGenome: a comprehensive database of chromatin interactions.Bioinformatics312560–2564. 10.1093/bioinformatics/btv158
41
ThurmanR. E.RynesE.HumbertR.VierstraJ.MauranoM. T.HaugenE.et al (2012). The accessible chromatin landscape of the human genome.Nature48975–82. 10.1038/nature11232
42
van der WijstM.de VriesD. H.GrootH. E.TrynkaG.HonC. C.BonderM. J.et al (2020). The single-cell eQTLGen consortium.eLife9:e52155. 10.7554/eLife.52155
43
Vaz-DragoR.CustódioN.Carmo-FonsecaM. (2017). Deep intronic mutations and human disease.Hum. Genet.1361093–1111. 10.1007/s00439-017-1809-1804
- CrossRef
- Google Scholar
44
WeiY.LiX.WangQ.-F.JiH. (2012). iASeq: integrative analysis of allele-specificity of protein-DNA interactions in multiple ChIP-seq datasets.BMC Genom.13:681. 10.1186/1471-2164-13-681
45
WeiZ.ZhangW.FangH.LiY.WangX. (2018). esATAC: an easy-to-use systematic pipeline for ATAC-seq data analysis.Bioinformatics342664–2665. 10.1093/bioinformatics/bty141
46
XuS.WangY.LiuH.ChenD.BiH.FengW. (2017). “A new method for alleviating sequence-specific biases in DNase-seq,” in proceeedings of th 2017 First International Conference on Electronics Instrumentation & Information Systems (EIIS), (Harbin: IEEE), 1–5.
- Google Scholar
47
YangE.-W.BahnJ. H.HsiaoE. Y.-H.TanB. X.SunY.FuT.et al (2019). Allele-specific binding of RNA-binding proteins reveals functional genetic variants in the RNA.Nat. Commun.10:1338. 10.1038/s41467-019-09292-w
48
YardımcıG. G.FrankC. L.CrawfordG. E.OhlerU. (2014). Explicit DNase sequence bias modeling enables high-resolution transcription factor footprint detection.Nucleic Acids Res.4211865–11878. 10.1093/nar/gku810
49
ZhangS.MoyW.ZhangH.McGowanH.ShiJ.LeitesC.et al (2019). Landscape of allele-specific open chromatin in human ipsc-differentiated neurons and it implication for mental disorders.Eur. Neuropsychopharmacol.29S799–S800. 10.1016/j.euroneuro.2017.08.034
- CrossRef
- Google Scholar
50
ZhangS.WangF.WangH.ZhangF.XuB.LiX.et al (2014). Genome-wide identification of allele-specific effects on gene expression for single and multiple individuals.Gene533366–373. 10.1016/j.gene.2013.09.029

Summary

Keywords

expression quantitative trait loci, allele-specific binding, transcription factor, ATAC-seq, functional single-nucleotide polymorphisms, computational biology, bioinformatics, transcriptional regulation

Citation

Xu S, Feng W, Lu Z, Yu CY, Shao W, Nakshatri H, Reiter JL, Gao H, Chu X, Wang Y and Liu Y (2020) regSNPs-ASB: A Computational Framework for Identifying Allele-Specific Transcription Factor Binding From ATAC-seq Data. Front. Bioeng. Biotechnol. 8:886. doi: 10.3389/fbioe.2020.00886

Received

06 April 2020

Accepted

09 July 2020

Published

29 July 2020

Volume

8 - 2020

Edited by

Mulin Jun Li, Tianjin Medical University, China

Reviewed by

Xinghu Shi, Temple University, United States; Lei Wei, University at Buffalo, United States

Updates

This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Yunlong Liu, yunliu@iu.edu

This article was submitted to Computational Genomics, a section of the journal Frontiers in Bioengineering and Biotechnology

Disclaimer

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.

Computational Genomics

ORIGINAL RESEARCH article

regSNPs-ASB: A Computational Framework for Identifying Allele-Specific Transcription Factor Binding From ATAC-seq Data

Abstract

Introduction

Results

Overall Strategy and Data Preprocessing for regSNPs-ASB

Detection of Allele-Specific TF Binding Events in MSC and MCF7 Cells

Correlation of Allele-Specific TF Binding and Allele-Specific Gene Expression

Correlation of Allele-Specific TF Binding and Allele-Specific Gene Expression

Discussion

Materials and Methods

Cell Lines

ATAC-seq Experimental Procedure

RNA-seq Experimental Procedure

Preprocessing for Identifying Allele-Specific Transcription Factor Binding

Transcription Factor Motif Disruption Analysis

regSNPs-ASB Computational Model Description

Regulatory SNPs Target Gene Identification

Allele-Specific Expression Analysis

SNP Linkage Disequilibrium Analysis

eQTL Analysis

Statements

Data availability statement

Author contributions

Funding

Acknowledgments

Conflict of interest

Supplementary material

Footnotes

References

Summary

Outline

Figures

Cite article

Article metrics

ORIGINAL RESEARCH article

regSNPs-ASB: A Computational Framework for Identifying Allele-Specific Transcription Factor Binding From ATAC-seq Data

Abstract

Introduction

Results

Overall Strategy and Data Preprocessing for regSNPs-ASB

Detection of Allele-Specific TF Binding Events in MSC and MCF7 Cells

Correlation of Allele-Specific TF Binding and Allele-Specific Gene Expression

Correlation of Allele-Specific TF Binding and Allele-Specific Gene Expression

Discussion

Materials and Methods

Cell Lines

ATAC-seq Experimental Procedure

RNA-seq Experimental Procedure

Preprocessing for Identifying Allele-Specific Transcription Factor Binding

Transcription Factor Motif Disruption Analysis

regSNPs-ASB Computational Model Description

Regulatory SNPs Target Gene Identification

Allele-Specific Expression Analysis

SNP Linkage Disequilibrium Analysis

eQTL Analysis

Statements

Data availability statement

Author contributions

Funding

Acknowledgments

Conflict of interest

Supplementary material

Footnotes

References

Summary

Outline

Figures

Cite article

Share article

Article metrics