The genetic determinants of the relative expression of alternative 3’UTR isoforms in human

In the last decades, genome wide association studies (GWAS) have uncovered tens of thousands of associations between common genetic variants and complex diseases. However, these statistical associations can rarely be interpreted functionally and mechanistically. As the majority of the disease-associated variants are located far from coding sequences, even the relevant gene is often unclear. A way to gain insight into the relevant mechanisms is to study the genetic determinants of intermediate molecular phenotypes, such as gene expression and transcript structure. We propose a computational strategy to discover genetic variants affecting the relative expression of alternative 3’ untranslated region (UTR) isoforms, generated through alternative polyadenylation, a widespread post-transcriptional regulatory mechanism known to have relevant functional consequences. When applied to a large dataset in which whole genome and RNA sequencing data are available for 373 European individuals, 2,530 genes with alternative polyadenylation quantitative trait loci (apaQTL) were identified. We analyze and discuss possible mechanisms of action of these variants, and we show that they are significantly enriched in GWAS hits, in particular those concerning immune-related and neurological disorders. Our results point to an important role for genetically determined alternative polyadenylation in affecting predisposition to complex diseases, and suggest new ways to extract functional information from GWAS data.

1 polyadenylation specificity factor (CPSF) that, together with other protein complexes, induces the cleavage of the transcript in correspondence of the downstream poly(A) site. The large majority of human genes have multiple poly(A) sites, so that alternative polyadenylation (APA) is a widespread phenomenon contributing to the diversification of the human transcriptome through the generation of alternative mature transcripts with different 3' ends. Such transcripts are translated into identical proteins, but protein level, localization, and even interactions can depend on the 3' end of the transcript [16].
APA events have been grouped into classes based on the location of the alternative poly(A) site and the type of change determined by their differential usage [12]. In this work we have taken into consideration only the simplest and most frequent mode (tandem 3'UTR APA), in which two poly(A) sites located within the same terminal exon, one in a proximal and one in a distal position, produce transcripts that differ only in the length of the 3'UTR. Such variation in 3'UTR length can have an important functional impact, for example by affecting the binding of microRNAs and RNA binding proteins and thus transcript abundance, translation and localization. Moreover, APA regulation is strongly tissue-and cell type-dependent [17,18,19,20,21] and several examples are known of altered APA regulation associated to human diseases [7,22].
How genetic variants influence APA has not been comprehensively investigated in a large human population yet. A recent analysis of whole genome sequencing (WGS) data from [8] found hundreds of common single nucleotide polymorphisms (SNPs) causing the alteration or degradation of motifs that are similar to the canonical PAS [23], but did not extend the analysis to other possible mechanisms. Other studies found strong associations between genetic variants and APA regulation [24,25,26,8,9,27], but a systematic investigation based on a large number of samples and variants, specifically targeted to APA rather than generically to transcript structure, and unbiased in the choice of variants to examine, is not yet available.
Here we propose a new computational strategy for the genome-wide investigation of the influence of genetic variants on the expression of alternative 3'UTR isoforms in a large population. In particular, we analyzed WGS data paired with standard RNA-Seq data obtained in 373 European (EUR) individuals [8]. Statistically, our approach is analogous to methods commonly implemented in eQTL mapping analysis and it aims to overcome the limitations illustrated above for the specific purpose of correlating variants to 3' UTR isoforms.
A central task, preliminary to the analysis of genetic variants, is thus the quantification of the alternative 3'UTR isoforms. Various strategies have been implemented to this end, from custom analysis pipelines for microarray data [28], to the development of next-generation sequencing technologies specifically targeted to the 3' end of transcripts, such as the serial analysis of gene expression (SAGE) [20] and sequencing of APA sites (SAPAs) [21], allowing also the identification of previously unannotated APA sites.
More recently, tools able to capture APA events from standard RNA-Seq data have been developed. In general, these approaches can be divided into two categories: those that exploit previous annotation of poly(A) sites [29,30], such the ones provided by PolyA_DB2 [31] and APASdb [32], and those that instead try to infer their location from the data [17]. Although the latter approach potentially allows analyzing also previously unannotated sites, the former leads to higher sensitivity [29,30], and was thus preferred in this study. Undoubtedly, approaches based on standard RNA-Seq are not as powerful and accurate as technologies that specifically sequence the 3' ends. However, they allow studying this phenomenon in an uncomparably larger number of samples and conditions, including the recently generated large-scale transcriptomic datasets of normal individuals that we use in this work.

Genetic variants affect the relative expression of alternative 3'UTR isoforms of thousands of genes
In order to investigate the effect of human genetic variants on the expression of alternative 3'UTR isoforms, we developed a computational approach similar to the one commonly used for eQTL analysis (Fig. 1). It was applied to a large dataset in which WGS data paired with RNA-Seq data are available for 373 European (EUR) individuals (GEUVADIS dataset [8]). A collection of known alternative poly(A) sites [31] was used, together with a compendium of human transcripts, to obtain an annotation of alternative 3'UTR isoforms that was then combined with RNA-Seq data in order to compute, for each gene, the expression ratio between short and long isoform (m/M value) in each individual.
Linear regression was then used to identify associations between the m/M values of each gene and the genetic variants within a cis-window including the gene itself and all sequence located within 1Mbp from the transcription start site (TSS) or the transcription end site (TES). This led to the fitting of ∼30 million linear models, involving ∼6,300 genes and ∼5.3 million variants. About 190,000 models, involving 2,530 genes and ∼160,000 variants, revealed a significant association Our set of significant genes shows only moderate overlap with genes for which eQTLs or transcript ratio QTLs (trQTLs) were reported in Ref. [8] from the same data (Fig. 3). Alternative polyadenylation can result in changes in gene expression levels as a consequence of the isoformdependent availability of regulatory elements affecting the stability of transcripts, such as mi-croRNA binding sites [13]. In this case, apaQTLs should also be eQTLs. However, APA may also have effects that do not imply changes in expression levels, including the modulation of Similarly, a complete overlap with trQTLs is not expected, because they were identified by taking into account all the annotated alternative transcripts of a gene including alternative splicing and transcription initiation: The identification of apaQTLs for several genes for which trQTLs were not identified suggests that focusing on a specific class of transcript structure allows higher sensitivity.
These results show that a large number of genetic determinants of alternative polyadenylation can be inferred from the analysis of standard RNA-Seq data paired with the genotypic characterization on the same individuals.

apaQTLs are preferentially located within active genomic regions
Just like eQTLs, we expect apaQTLs be located within genomic regions that are active in the relevant cell type (lymphoblastoid cells for our data). In order to verify this hypothesis, we superimposed the apaQTLs to the ChromHMM annotation of the human genome for the GM12878 cell line [37], and used logistic regression, as detailed in the Methods, to determine the enrichment or depletion of apaQTLs for each chromatin state, expressed as an odds ratio (OR). As expected, significant ORs > 1 were obtained for active genomic regions, such as transcribed regions, promoters and enhancers, suggesting that genetic variants have a higher probability of being apaQTLs when they are located in active regions. Conversely, apaQTLs were depleted in repressed and inactive chromatin states. Similar results were obtained using broad chromatin states (Fig. 4) Fig. 4 and Supplementary Fig. 5). Taken together, these results show that genetic variants affecting alternative polyadenylation tend to be located in cell-type specific active chromatin regions.
In the following, we will divide apaQTLs in two classes: Intragenic apaQTLs are those located inside one of the genes whose isoform ratio we are able to analyze, while all other apaQTLs will be referred to as extragenic (note that these might be located inside a gene for which we are unable to perform the analysis, for one of the reasons explained in the Methods).

Intragenic apaQTLs are enriched in coding exons and 3'UTRs
Having established that genetic variants have a widespread influence of the expression of alternative 3'UTR isoforms, we turned to their putative mechanisms of action. First of all, we considered the distribution of intragenic apaQTLs among regions contributing to the mRNA vs.
introns. As shown in Fig within both intronic sequences and internal exons, such sites were not taken into account in our analysis. Finally, the depletion of 5' UTRs might be due to the distance of these elements from the polyadenylation loci, and to the fact that these regions are mostly involved in other regulatory mechanisms, such as translational regulation [39]. In the following, we examine in more detail three possible mechanisms by which intragenic apaQTLs could exert their action.

Creation and destruction of PAS motifs
The first possibility is direct interference with the APA regulation, favoring the production of one of the two isoforms in individuals with a particular genotype. A comprehensive atlas of high-confidence PAS has been recently reported [40]. In addition to the canonical PAS motifs (AAUAAA and AUUAAA) it contains 10 previously known signals and 6 new motifs. Exploiting this resource, we were able to identify SNPs that cause the creation or the destruction of putative functional PAS motifs and, as expected, we found that they were enriched among apaQTLs (OR = 1.72, 95% confidence interval (CI) = 1.08 -2.75, P-value = 0.0215). In total, 42 PAS-altering variants were found to be apaQTLs of the gene in which they reside. While expected, this result can be considered to validate our strategy.
A few examples are worth discussing in detail. SNP rs10954213 was shown by several studies [41, 24,42] to determine the preferential production of the short isoform of the IRF5 transcription factor through the conversion of an alternative PAS motif (AAUGAA) into the canonical one (AAUAAA) in a proximal position within the 3'UTR. Consistently, we found that this variant is associated with higher prevalence of the short isoform (Fig. 6A,B). Moreover, the same variant was associated to higher risk of systemic lupus erythematosus (SLE), and higher IRF5 expression, that could be due to the loss of AU-rich elements (ARE) in the short transcript isoform [24].
Globally, these findings are in agreement with the known involvement of IRF5 in several pathways that are critical for the onset of SLE (Type I IFN production, M1 macrophage polarization, autoantibody production, and induction of apoptosis [43]).
A similar trend was detected in the case of the rs9332 variant, located within the 3'UTR of the MTRR gene, encoding an enzyme essential for methionine synthesis ( Supplementary Fig. 6A).
This variant was reported to be associated with a higher risk of spina bifida, along with other variants within the same gene [44]. We found that the variant is associated with the increased relative expression of the short isoform of the MTRR transcript, as a consequence of the creation of a proximal canonical PAS. We can thus speculate that, similarly to what was shown for IRF5, this post-transcriptional event could lead to a variation in the activity of the enzyme activity and ultimately to increased disease susceptibility.
The same mechanism might provide putative mechanistic explanations for associations found by GWAS studies. For example we found the variant rs5855 to be an apaQTL for the PAM gene ( Supplementary Fig. 6B

Alteration of microRNA binding
In an alternative scenario, genetic variants can influence the relative expression of alternative 3'UTR isoforms by acting on the stability of transcripts, for example through the creation or destruction of microRNA binding sites. For each gene with alternative 3'UTR isoforms, we divided the 3' UTR into two segments: the "PRE" segment, common to both isoforms, and the "POST" segment expressed only by the longer isoform. Variants altering microRNA binding sites located in the POST segment can result in the variation of the relative isoform expression since they affect only the expression of the long isoform.
For example, we found that the rs8984 variant is associated with an increased prevalence of the long transcript isoform of the CHURC1 gene, an effect that could be due to the destruction of a binding site recognized by microRNAs of the miR-582-5p family within the POST segment of the gene ( Supplementary Fig. 7). More generally, we found that apaQTLs are enriched, albeit slightly, among the genetic variants that create or break putative functional microRNA binding sites (OR = 1.15, 95% CI = 1.02 -1.30, P-value = 0.022). However, we could not find significant agreement between the predicted and actual direction of the change in isoform ratios for these cases. Together with the marginal significance of the enrichment, this result suggests that the alteration of microRNA binding sites is not among the most relevant mechanisms in the genetic determination of 3' UTR isoform ratios.

Extragenic apaQTLs act in-cis through the perturbation of regulatory elements
Understanding the function of extragenic apaQTLs is less straightforward because, although there are few examples of DNA regulatory elements contributing to APA regulation [14], it is commonly believed that APA is mainly controlled by cis-elements located within transcripts, both upstream and downstream of the poly(A) sites [13].
To further explore this aspect we took advantage of a different annotation of active genome regions, which includes the association between regulatory regions and target genes, namely the cis-regulatory domains (CRDs) identified in lymphoblastoid cell lines in Ref. [61]. Extragenic apaQTLs were indeed found to be enriched in CRDs (OR = 1.73, 95% CI = 1.69 -1.78, P-value < 10 −16 ). The 3D structure of the genome is a key aspect of gene regulation [62], as it determines physical contacts between distal regulatory regions and proximal promoters. In particular, CRDs have been described as active sub-domains within topologically associated domains (TADs), containing several non-coding regulatory elements, both proximal and distal. The perturbation of those regulatory elements by genetic variants can lead to the alteration of gene expression and perhaps interfere with other processes such as alternative polyadenylation, as suggested by our results. Importantly, CRDs have been assigned to the nearby genes they regulate. We could thus observe that extragenic apaQTLs tend to fall within CRDs that have been associated with their target genes much more frequently than expected by chance. Indeed, this correspondence was verified for 27,527 extragenic apaQTLs, while the same degree of concordance was never obtained in 100 permutations in which each extragenic apaQTL was randomly associated to a gene in its cis-regulatory window (median number of correspondences 12,571). These results suggest an important role of genetic variants located in active, non-transcribed cis-regulatory regions in regulating alternative polyadenylation of the target genes.

A role for apaQTLs in complex diseases
Since common genetic variation is involved in complex diseases, often by affecting gene regulation, a natural question is whether apaQTLs can be used to provide a mechanistic explanation for  Table 2). However, a strong enrichment was also detected for almost all the other We observed that apaQTLs that are also GWAS hits often map to genes in the human leukocyte antigen (HLA) locus, suggesting that in at least some cases the enrichment could be mostly driven by this genomic region. Somewhat unexpectedly, this was particularly evident for neurological disorders. In order to clarify this point, we evaluated all enrichments after excluding the variants in the HLA locus. Although in some cases the OR decreased after removing HLA variants, for most GWAS categories the enrichment was still significant (Supplementary Fig. 8 and

2.6
The effect of genetic variants on APA can be confirmed in patients As briefly discussed above, the rs10954213 variant is associated with a higher risk of SLE. Evidence about the related molecular mechanism arose from the analysis of cell lines derived from healthy individuals [41,42], and the effect of the variant on IRF5 expression in blood cells was confirmed in SLE patients [64,65]. However, direct evidence on the effect of this variant on APA regulation in SLE patients is still missing.
In order to assess whether rs10954213 affects IRF5 APA regulation in SLE patients, we analyzed RNA-Seq data derived from whole blood cells in 99 patients [66]. We detected a strong difference in IRF5 m/M values among the three rs10954213 genotypes, with the alternative allele associated with higher m/M values, i.e. shorter 3' UTR (Kruskal-Wallis test P-value = 9.21×10 −10 ; Fig. 8).
Therefore the variant has, at least qualitatively, the same effect in the whole blood of SLE patients as in lymphoblastoid cell lines of normal individuals.

Discussion
We used a new efficient strategy to study how human genetic variants influence the expression of alternative 3'UTR isoforms. This issue has been previously investigated with different approaches [26,25,27,8,9]. The method we propose combines wide applicability, being based on standard RNA-Seq data, with the high sensitivity allowed by limiting the analysis to a single type of transcript structure variant, namely 3' UTR length. Such higher sensitivity led us to discover thousands of variants associated with 3' UTR length that were not identified in a general analysis of trascripit structure from the same data in [8]. Moreover, the significant overlap between our apaQTLs and the eQTLs identified in [8] confirms the known relevant role of 3'UTRs in gene expression regulation. However, the regulation of 3' UTR length is known to affect regulatory processes that do not directly alter mRNA abundance, such as regulation of translation efficiency, mRNA localization and membrane protein localization [12,36]. Indeed most of the apaQTLs we found were not identified as eQTLs in [8].
The various mechanisms underlying the association between genetic variants and the relative abundance of 3'UTR isoforms can be classified in two main classes based on whether they affect the production or degradation rates of the isoforms. The production related-mechanisms include the alteration of APA sites, of cis-regulatory elements located in promoters and enhancers, and of binding sites of RBPs involved in nuclear RNA processing; the degradation-related mechanisms include the alteration of the binding sites of microRNAs and cytoplasmatic RBPs affecting mRNA stability. Taken together, our results suggest that the genetic effects on 3'UTR isoforms act prevalently at the level of production, as shown by the strong enrichment of apaQTLs in non-transcribed regulatory regions and among the variants creating or disrupting APA sites, and by the relatively weak enrichment of variants creating or disrupting microRNA binding sites.
Also the results on altered RBP binding sites confirm this picture, since most motifs altered by apaQTLs are associated to nuclear RBPs involved in nuclear RNA processing.
In particular, we identified several apaQTLs creating or destroying putative functional PAS motifs. However, it should be noted that our ability to detect these events is intrinsically limited by the motif repertoire that we used [40], which might miss some of the rarest alternative PAS motifs.
For example, we found that the rs6151429 variant is associated with the increased expression of the long isoform of the transcript codified by the Arylsulfatase A (ARSA) gene ( Supplementary   Fig. 6D), in agreement with previous evidence [67]. However, we did not include this variant among those disrupting a PAS motif since the disrupted motif (AAUAAC) is not included in the catalog that we used. In addition, we considered only PAS-altering single nucleotide substitutions, while also other types of genetic variants can modify the PAS landscape of a gene. For example, a small deletion (rs374039502) causes the appearance of a new PAS motif within the TNFSF13B gene, and has been associated with an higher risk of both multiple sclerosis and SLE in the Sardinian population [68].
We observed a strong enrichment of apaQTLs in regulatory regions such as promoters and enhancers, as previously found for variants generically affecting transcript structure in [8]. These results point to an important role of DNA-binding cis-acting factors in the regulation of 3' UTR length, and to the existence of a widespread coupling between transcription and polyadenylation [12,69]. Moreover, it has been shown that RBPs involved in APA regulation can interact with promoters [14].
Regarding the effect of genetic variants on mRNA stability, we focused on the perturbation of microRNA binding, taking into account both the creation and the destruction of microRNA binding sites within transcripts. The relevance of mRNA stability seemed to be confirmed by a modest enrichment of microRNA-altering SNPs among intragenic apaQTL, however the direction of their effect on microRNA binding is not statistically consistent with the expected direction of the change in 3' UTR isoform ratio. The same type of ambiguity has been previously reported with regard to the relationship between the effect of SNPs on microRNA binding and gene expression levels [70] and makes us doubt whether these microRNA-altering apaQTLs are truly causal for the associated gene. These results suggest that the alteration of microRNA binding may not be a predominant mechanism explaining the variation of the expression of alternative 3'UTR isoforms across individuals. Limitations in the accuracy of predicted micorRNA binding sites might also contribute to this result.
Another possible mechanism of action of intragenic apaQTLs is the perturbation of the regulatory action of RBPs, as indicated by the modest but highly significant enrichment of SNPs altering RNA-binding motifs. However, the lack of strong enrichments when considering each motif individually suggests that specific RBP motifs may have a small regulatory impact on APA that may also depend on the context, as recently suggested [30]. As in the case of microRNAs, also our limited knowledge of the binding preferences of RBPs might limit our power to detect their effects: More sophisticated models should take into account the highly modular structure of RBPs that often include multiple RNA binding domains (RBDs), the emerging importance of both the binding context and the RNA structure and even more sophisticated modes of RNA binding [71,72].
Furthermore, it is reasonable to assume that also non-canonical modes of APA regulation can be affected by genetic variants and therefore drive the detection of variable isoform expression ratios.
For example, it has been recently suggested that an epitranscriptomic event, the m 6 A mRNA methylation, can be associated with alternative polyadenylation [15]. In addition, recently published results suggest that genetic variants could affect APA regulation also in an indirect way, without affecting the regulatory machinery. Past studies have reported that a narrow range of 10-30nt between the PAS and the poly(A) site is required for efficient processing, however [73] suggested that also greater distances can sometimes be used thanks to RNA folding events that bring the PAS and the poly(A) site closer to each other. Therefore, we can speculate that if a genetic variant affects RNA folding in such a way as to modify the distance between the PAS and the poly(A) site, it could also influence APA regulation.
While the mechanisms discussed above act at the level of the primary or mature transcript, our results revealed a perhaps unexpectedly large number of extragenic apaQTLs, mostly located in regulatory regions. These apaQTLs point to an important role of DNA-binding elements such as transcription factors in regulating alternative polyadenylation through long-distance interactions with cleavage and polyadenylation factors. The investigation of these mechanisms is thus a promising avenue of future research.
Alternative polyadenylation can affect several biological processes, influencing mRNA stability, translation efficiency and mRNA localization [13]. Therefore, it is not surprising that its perturbation has been associated with multiple pathological conditions [7,22]. In the present study, we detected a strong enrichment of GWAS hits among apaQTLs, supporting the idea that 3' UTR length is a useful addition to the list of intermediate molecular phenotypes that can be used for a mechanistic interpretation of GWAS hits. In particular, we identified genetic variants previously associated to neurological disorders, such as autism, schizophrenia and multiple sclerosis, which may act by affecting the regulation of polyadenylation. The importance of post-transcriptional events in the onset of neurological diseases has been recently confirmed by two studies demonstrating that genetic variants affecting alternative splicing (sQTL) give a substantial contribution to the pathogenesis of schizophrenia [74] and Alzheimer's disease [75]. We also observed that the relevant apaQTLs often map to HLA genes, but that the enrichment is not explained by the HLA locus alone. On the other hand, examples of APA events involving HLA genes have been reported [76,77] and genes encoding antigen-presenting molecules account for the highest fraction of genetic risk for many neurological diseases [78].
A gene-based alternative approach to the interpretation of GWAS has been recently proposed.
In the original implementation of Transcriptome Wide Association Studies (TWAS) [79], eQTL data obtained in a reference dataset are used to predict the genetic component of gene expression in GWAS cases and controls, which is then correlated with the trait of interest, thus allowing the identification of susceptibility genes. More recently, the authors of [80] proposed a summary-based TWAS strategy in which the association between the genetic component of gene expression and a trait is indirectly estimated through the integration of SNP-expression, SNP-trait, and SNP-SNP correlation data. Furthermore, this kind of analysis has also been performed exploiting a collection of sQTLs, leading to the identification of new susceptibility genes for schizophrenia [81] and Alzheimer's disease [75]. In a similar way, apaQTLs could be used to discover cases in which the association between genes and diseases is driven by the alteration of the expression of alternative 3'UTR isoforms.
We are aware of some limitations of this study. First, the simple model that we used for the definition of alternative 3'UTRs isoforms limits the type of events that can be detected, because we can see only events involving poly(A) sites located within the transcript segments taken into account for the computation of the m/M values (the PRE and the POST segments). Nonetheless, the adoption of this simple model significantly reduces the computational burden and might be sufficient to indicate general trends that can be subsequently further investigated with more sophisticated models. Indeed, it has been previously shown, in a slightly different context (i.e. the comparison of APA events detected in different cellular conditions or tissues), that the results obtained with our model are comparable with those obtained exploiting a more complex model that takes into account all the possible APA isoforms of a gene, especially because also genes with multiple poly(A) sites mainly use only two or a few of them [29]. Second, our strategy depends on a pre-existing annotation of poly(A) sites: Methods that infer the location of poly(A) sites from RNA-Seq data are available, but they can have lower sensitivity in the detection of APA events [29,30]. In addition, we examined only a single cell type (lymphoblastoid cells) to demonstrate the feasibility of apaQTL mapping analysis. A broader investigation, exploiting data such as those provided by Genotype-Tissue Expression (GTEx) consortium [82], would be particularly valuable. Indeed, APA regulation seems to be significantly tissue-specific and global trends of poly(A) sites selection in specific human tissues have been described: For example transcripts in the nervous system and brain are characterized by preferential usage of distal PAS, whereas in the placenta, ovaries and blood the usage of proximal PAS is preferred [12].
In conclusion, we have identified thousands of common genetic variants associated with alter-

Annotation of alternative 3'UTR isoforms
We considered the human transcripts included in RefSeq and associated them with the corre- The coordinates of the human poly(A) sites were converted from hg17 to hg19 using liftOver [90] and then combined with the gene structures defined above to define the alternative 3'UTR isoforms. For the definition of alternative 3'UTR isoforms we adopted a simple model taking into account only two alternative poly(A) sites for each gene, because previous evidence suggests that also genes with multiple poly(A) sites mainly use only two of them [29].
where l P REa and l P OSTa are respectively the length of the PRE and POST segment of the gene a, #r P RE a,i and #r P OST a,i are respectively the number of reads mapped on the PRE and the POST segment of the gene a in the i th individual.
The m/M values were computed for 14,542 genes for which we were able to define alternative 3'UTR isoforms. Infinite and negative values of m/M (that happen when the POST region does not produce any reads, and when the POST region produces more reads than the PRE region after length normalization, respectively) were considered as missing values. Then only those on autosomal chromosomes (chr1-22) and with less than 100 missing m/M values were selected for the following investigation, leaving us with 6,256 genes.

Genotypic data pre-processing
Starting from the downloaded VCF files, we extracted genotypic data for 373 EUR individuals for whom also RNA-Seq data are available using VCFtools [91]. In addition, only common genetic variants with Minor Allele Frequency (MAF) higher than 5% were considered in all the following analyses; the MAF value of each genetic variant was reported in the VCF files, but we recomputed them taking into account that the reference allele in the VCF file may not always be the most frequent one in the EUR population considered by itself, and conservatively attributing the most frequent homozygous genotype to individuals for which the genotype was missing.

Principal Component Analysis of genotypic data
It is known that special patterns of linkage disequilibrium (LD) can cause artifacts when a Principal Component Analysis (PCA) is used to investigate population structure [92]. We filtered out all the genetic variants falling within 24 long-range LD (LRLD) regions whose coordinates were derived from [92]. In addition, following [93], we performed an LD-pruning of the genetic variants using the --indep-pairwise function from PLINK v1.9 [94] to recursively exclude genetic variants with pairwise genotypic R 2 >80% within sliding windows of 50 SNPs (with a 5-SNPs increment between windows). Also in this case VCFtools [91] was used to apply all these filters to the VCF files and finally EIGENSTRAT v6.1.4 [95] was used to run the PCA on the remaining genotypic data at the genome-wide level.

apaQTL mapping
From a statistical point of view, we adopted the same strategy used in standard eQTL mapping analyses [8] to identify genetic variants that influence the expression level of the alternative 3'UTR isoforms of a gene. For each of the 6,256 examined genes, we defined a cis-window as the region spanning the gene body and 1 Mbp from both its TSS and its TSE. Then, for each gene a linear model was fitted, independently for each genetic variant within its cis-window, using the genotype for the genetic variant as the independent variable and the log2-transformed m/M value of the gene as the dependent variable: where log 2 (m/M a,i ) is the log2-transformed m/M value computed for the a gene in the i th individual, g v,i is the genotype of the i th individual for the j th genetic variant, I i is the imputation status (0|1) of the i th individual, gP C n,i is the value of the n th Principal Component (PC) obtained from genotypic data for the i th individual, β 0 is the intercept, β 1 , β 2 and α n are the fitted regression coefficients and a is the error term for the gene a.
The fitting of the linear models was done using the CRAN R package MatrixEQTL [96]. Genotypes were represented using the standard 0/1/2 codification, referring to the number of alternative alleles present in each individual, and matrices with genotypic information were obtained from VCF files exploiting the Perl API (Vcf.pm) included in the VCFtools suite [91]. Following [8], in all our models we included both the imputation status of the individuals and the first three PCs obtained from genotypic data as covariates, in order to correct for possible biases due to population stratification ( Supplementary Fig. 1) or genotype imputation.
The observed distribution of nominal P-values was compared with the expected one in Quantile-Quantile plots (Q-Q plots), revealing the expected inflation due to the LD issue ( Supplementary   Fig. 2). A permutation-based procedure was implemented [97]: all the models were fitted again after the random shuffling of the m/M values of each gene across samples; then for each genevariant pair we counted how many times we obtained a random P-value less than its nominal P-value and divided this value by the total number of random tests done. Finally, to control for multiple testing, the empirical P-values were corrected with the Benjamini-Hochberg procedure [98] and models with a corrected empirical P-value less than 0.05 were considered statistically significant. Manhattan plots were drawn using the CRAN R package qqman [106].

Comparison with other molecular QTLs
In order to compare the genes for which we detected one or more apaQTLs with those for which eQTL/trQTL were reported [8], we translated the Ensembl Gene IDs (ENSG) to NCBI Entrez Gene IDs using Ensembl v67 [99] retrieved using the Bioconductor R package biomaRt v2.30 [100,101]. 229 ENSGs could not be translated with this procedure and were therefore excluded from this analysis.

Enrichment analyses
In order to functionally characterize the apaQTLs, we analyzed the enrichment of several features among such variants, including their genomic location, their ability to alter known regulatory motifs, and their association with complex diseases. All enrichments were evaluated through multivariate logistic regression to allow correcting for covariates. In this section we provide an overview of the method, but refer to the following subsections for details about each analysis.
For each feature we first established which genetic variants were potentially associated with the feature (for example only variants in the 3' UTR can alter microRNA binding sites). Therefore, each enrichment analysis started with the selection of the "candidate variants" that were subsequently subjected to an LD-based pruning, in order to obtain a subset of independent candidate variants (the same strategy was implemented for example in [102] to evaluate the enrichment of GWAS hits among eQTLs). LD-based pruning was always performed using PLINK with the same parameters used in the case of the PCA of genotypic data (see above), but applied in each case to the candidate variants only. To each candidate variant surviving pruning we attributed a binary variable indicating whether it has the feature under investigation. Finally, these variants are classified as apaQTLs (i.e. corrected empirical P-value < 0.05 for at least one gene) and null variants (i.e. nominal P-value > 0.1 in all the fitted models). We excluded the "grey area" variants with nominal P-value < 0.1 but empirical corrected P-value > 0.05 as they are likely to contain many false negatives. Finally we fitted a multivariate logistic model in which the dependent variable is the apaQTL/null status of the variant, and the independent variables are the feature of interest and covariates. The latter always include the MAF of the variant, since variants with higher MAF are more likely to be found as significant apaQTLs, and possibly other covariates depending on the feature under examination (see below).
The logistic model can thus be written as: where F eature j is a binary variable indicating whether the genetic variant j has the feature of interest, β 0 is the intercept, β 1 is the regression coefficient for the feature, j is the error term and P r(apaQT L) j is the fitted probability that the genetic variant j is an apaQTL. As expected, in our models the regression coefficient of the MAF was always positive. The regression coefficient of the F eature term and its associated P-value were used to establish if having the feature under investigation influences the probability of being an apaQTL, and to compute the corresponding odds ratio (OR).

Chromatin states
This analysis was performed independently for two cell types (the GM12878 and NHEK cell lines). In both cases, the candidate variants were virtually all the genetic variants for which the apaQTL models were fitted, but we excluded those not associated with any chromatin state and all the structural variants, because their length can prevent them from being univocally associated with a chromatin state.

Gene regions
The candidate variants were all the intragenic variants for which the apaQTL models were fitted.
We defined as intragenic all variants falling between the start and the end of the gene, plus 1,000 bps after the end (to take into account possible misannotations of the 3' UTR).
Independent enrichment analyses were performed for the following sequence classes: coding exons, introns, 5'UTR and 3'UTR. For each class the binary feature used as a regressor was assigned the value 1 for variants falling within the class and 0 otherwise. Only the MAF was included in the covariates.

Cis Regulatory Domains
The candidate variants were all the extragenic variants (i.e. all variants that are not intragenic according to the definition given above) for which an apaQTL model was fitted. The binary feature was given value 1 for variants falling within a CRD and 0 otherwise. Besides the MAF, the distance from the nearest gene was included as a covariate, since variants closer to a gene are more likely to be apaQTLs.
To verify that the apaQTLs tend to be included in the CRDs specifically associated to the gene on which they act, we translated the CRD-gene associations provided in [61] into Entrez Gene IDs, and we counted how many genetic variants fall within a CRD associated to at least one gene for which the variant is an apaQTL. This number was then compared with that obtained in the same way after randomly assigning a target gene to each extragenic variant within the cis-window used for apaQTL analysis (100 independent randomizations were used). PAS motifs. The PAS motif is always located upstream of its target poly(A) site. It has been suggested that a narrow range of 10-30 nt is required for efficient processing, but recent work suggests that also larger distances can be functional thanks to RNA folding processes bringing the poly(A) site closer to the PAS [73]. Assuming that a PAS-altering SNP would affect the usage of its nearest poly(A) site, we associated to each intragenic SNP the nearest downstream poly(A) site, selected those for which such poly(A) site was located within the PRE/POST segments, and retained as candidate variants only those whose distance from the corresponding poly(A) site was between 10 and 100 nt. PAS-altering variants were defined as those for which

GWAS hits
We considered only the GWAS catalog records referring to a single genetic variant on autosomal chromosomes for which all the fields CHR_ID, CHR_POS, SNPS, MERGED, SNP_ID_CURRENT, and MAPPED_TRAIT_URI were available, as well as the RSID. The coordinates of the selected genetic variants in hg19 were derived from dbSNP Build 151. We thus obtained 56,672 genetic variants associated with at least one complex trait. Furthermore, starting from the EFO URI(s) reported for each association, we obtained the corresponding EFO Parent URI(s) from the EFO annotation file.
All variants examined as potential apaQTLs were considered as our candidate variants. A binary feature value of 1 was attributed to each candidate variant surviving LD pruning and associated to a trait, or with a tagging variant associated to a trait, as in the case of motif-altering variants.
Enrichment was evaluated for all trait-associated variants together, for each single trait, and for trait categories defined based on the EFO ontology. Only traits and trait categories associated with at least 100 GWAS hits were analyzed. The same analysis was also performed after excluding all variants within the HLA locus, as defined by The Genome Reference Consortium (https: //www.ncbi.nlm.nih.gov/grc/human/regions/MHC?asm=GRCh370).

The rs10954213 variant in SLE patients
Since we were interested in the IRF5 gene only, RNA-Seq reads were aligned to a reduced genome comprising the gene sequence and an additional 50bp at its 3' end using Bowtie v2.2.3 [104] and TopHat v2.0.12 [105]. As genotypic data were not available for these individuals, we inferred the rs10954213 variant status from the relative proportion of A and G in the RNA-Seq reads.
Individuals were considered homozygous for the reference (G) or for the alternative (A) allele when the same nucleotide was present in all the reads, and a single read with a different nucleotide was considered sufficient to call an heterozygous individual ( Supplementary Fig. 9). In this way we obtained 10 homozygotes for the reference allele, 73 heterozygotes and 16 homozygotes for the alternative allele. The MAF thus obtained is consistent with that reported in the NCBI dpSNP database [89]. A Kruskal-Wallis test was then used to evaluate the differences in m/M values between genotypes. A different criterion for the assignment of genotypes (at least 20% of the reads carrying the less frequent allele required to call a heterozygous individual) gave comparable results ( Supplementary Fig. 10).
[30] Kevin C. H. Ha, Benjamin J. Blencowe, and Quaid Morris. "QAPA: a new method for the systematic analysis of alternative polyadenylation from RNA-seq data". In: Genome            Supplementary figure 9: Genotypic information was not available for the SLE patients, therefore their genotype in correspondence of the rs10954213 genetic variant was inferred from RNA-Seq data. In the main analysis, having at least one aligned read with a different nucleotide was considered sufficient to call a heterozygous individual. The figure shows the alignment of RNA-Seq reads in a region around the variant with respect to a reduced genome including only the IRF5 gene and was generated using the Integrative Genomics Viewer (IGV) software [108]. For example, the SRR2443147 sample (top panel) was considered heterozygous, while the SRR2443196 sample (bottom panel) was considered homozygous for the alternative allele. in SLE patients as a function of the genotype of the individuals when an alternative criterium was adopted to define the genotypic classes.