ORIGINAL RESEARCH article
To ERV Is Human: A Phenotype-Wide Scan Linking Polymorphic Human Endogenous Retrovirus-K Insertions to Complex Phenotypes
- 1Division of Epidemiology, School of Public Health, University of California, Berkeley, Berkeley, CA, United States
- 2Division of Epidemiology, School of Community Health Sciences, University of Nevada, Reno, NV, United States
- 3Department of Epidemiology and Biostatistics, Helen Diller Comprehensive Cancer Center, University of California, San Francisco, San Francisco, CA, United States
- 4Department of Neurosurgery, Duke University, Durham, NC, United States
- 5Department of Neurosurgery, Helen Diller Comprehensive Cancer Center, University of California, San Francisco, San Francisco, CA, United States
Approximately 8% of the human genome is comprised of endogenous retroviral insertions (ERVs) originating from historic retroviral integration into germ cells. The function of ERVs as regulators of gene expression is well established. Less well studied are insertional polymorphisms of ERVs and their contribution to the heritability of complex phenotypes. The most recent integration of ERV, HERV-K, is expressed in a range of complex human conditions from cancer to neurologic diseases. Using an in-house computational pipeline and whole-genome sequencing data from the diverse 1,000 Genomes Phase 3 population (n = 2,504), we identified 46 polymorphic HERV-K insertions that are tagged by adjacent single nucleotide polymorphisms (SNPs). To test the potential role of polymorphic HERV-K in the heritability of complex diseases, existing databases were queried for enrichment of established relationships between the HERV-K insertion-associated SNPs (hiSNPs), and tissue specific gene expression and disease phenotypes. Overall, hiSNPs for the 46 polymorphic HERV-K sites were statistically enriched (p < 1.0E−16) for eQTLs across 44 human tissues. Fifteen of the 46 HERV-K insertions had hiSNPs annotated in the EMBL-EBI GWAS Catalog and cumulatively associated with >100 phenotypes. Experimental factor ontology enrichment analysis suggests that polymorphic HERV-K specifically contribute to neurologic and immunologic disease phenotypes, including traits related to intra cranial volume (FDR 2.00E-09), Parkinson's disease (FDR 1.80E-09), and autoimmune diseases (FDR 1.80E-09). These results provide strong candidates for context-specific study of polymorphic HERV-K insertions in disease-related traits, serving as a roadmap for future studies of the heritability of complex disease.
Retroviruses are a class of RNA virus that undergoes reverse transcription to DNA during the infectious cycle inside a host cell. At the proviral stage, retroviral DNA integrates into the host DNA to produce viral proteins. Integration into germ cells can result in endogenization, wherein the virus can be vertically transmitted via standard Mendelian inheritance mechanisms. Endogenous retroviruses (ERV) are ancient examples of proviruses that integrated and endogenized into the human genome >40mya (Bannert and Kurth, 2006). In modern humans, ERVs account for approximately 8% of the genome (Bannert and Kurth, 2006). Their relative stability, as well as the conservation of orthologs in other primate genomes suggests that they induce genome plasticity and can enhance evolutionary fitness (Cordaux and Batzer, 2009; Feschotte and Gilbert, 2012; Grow et al., 2015). Retroviruses are reliant on the fitness of their host for survival and the long-standing evolutionary cooperation between ERVs and humans may represent a symbiotic relationship (Jern and Coffin, 2008). The positive selection of persistent ERVs in the genome may have resulted from increasing the probability of survival to reproductive age [via adaptive effects on placentation (Simpson et al., 1996); and immune (Hurst and Magiorkinis, 2015) and brain development (Mortelmans et al., 2016)]. The phenotypic effects of ERVs on the post-reproductive adult, however, remain unclear and are of growing interest (Li et al., 2015; Bowen et al., 2016; Sekar et al., 2016).
Previous studies have described the potential mechanisms by which ERVs influence human phenotypes. ERV insertions introduce viral genes and, due to their inter-individual homology can generate copy-number variants via non-allelic homologous recombination (Campbell et al., 2014). They modify transcription by adding enhancers (Chuong et al., 2013) and promoters (Fuchs et al., 2013), disrupting intron structure, causing RNA interference (Ling et al., 2003) adding poly-A tails (Kim, 2012), and altering DNA methylation (Kreimer et al., 2013). ERV expression, typically restricted in healthy tissues except during placental development, is detected in diseases including cancers and autoimmune disorders [reviewed in (Cegolon et al., 2013; Nexø et al., 2016)].
While the functional effects of ERVs are well established, the vast majority of ERV insertions are fixed across individuals, and so their potential contribution to phenotypic variation has been largely overlooked. The HERV-K subfamily, however, contains human-specific, unfixed insertions, ranging from fully intact provirus to solo long terminal repeat (LTR) sequences (Wildschutte et al., 2016). HERV-K represents one of the most recent ERV integrations into the human genome and numerous insertions have neither been eliminated from the genome (via negative selection or drift), nor fixed (via positive selection or drift). The polymorphic nature of these insertions suggests a potential contribution to causal variation in the heritability of complex phenotypes. Targeted studies have identified specific HERV-K integrations that affect disease risk, for example a polymorphic HERV-KC4 inserted within the complement component 4 (C4A/B) gene appears to be involved in the genetic risk of schizophrenia (Sekar et al., 2016).
Technical limitations have proved a major obstacle in the untargeted identification of polymorphic HERV-K insertions for application to clinical and epidemiologic studies. With the emergence of next-generation sequencing technologies, methods are being developed for the untargeted identification of ERVs among other mobile genetic elements in human genomes (Witherspoon et al., 2010; Ray and Batzer, 2011; Wildschutte et al., 2016). Here, we examine phenotypic effects of all polymorphic HERV-K insertions identifiable from a large, publically available whole genome sequencing (WGS) dataset. With our computational pipeline, we identified HERV-K insertion locations using data from the diverse 1000 Genomes Phase 3 population (n = 2,504). By identifying a subset of polymorphic HERV-K insertions with strong associations to adjacent “tagging” single nucleotide polymorphisms (SNPs), we have leveraged several comprehensive SNP annotation databases to test for enrichment of established relationships between HERV-K insertion-associated SNPs (hiSNPs), tissue-specific gene expression, and diverse disease phenotypes.
HERVnGoSeq Computational Pipeline
To elucidate the broad phenotypic effects of polymorphic HERV-K insertions, we developed a computational pipeline, HERVnGoSeq, to identify the presence/absence of known and novel HERV-K insertions in individual WGS data (Figure S1). Quality-filtered raw WGS were aligned to HERV-K113, one of the youngest HERV-K elements in the human genome with a conserved intact LTR sequence that is also capable of producing viral particles in vitro (Boller et al., 2008). Reads that partially aligned to HERV-K113 - chimeric reads - were trimmed and the non-HERV portions of the reads were extracted. The trimmed chimeric reads were then aligned to the human genome (GRCh37/Hg19). The base-pair position of the trimmed end of the read where HERV-K sequence was removed was called as the insertion point. Insertion points were collected for both the forward and reverse complement alignments separately. Insertion points within 1,000 bp of each other were grouped to represent a single insertion point. The presence of putative HERV-K insertions were assigned to each individual if they had at least one chimeric read that aligned to that insertion point. Absence of an insertion was inferred for individuals when they lacked any chimeric reads representing the specific insertion. The complete pipeline and description is available at https://github.com/unreno/chimera.
Putative polymorphic HERV-K elements were nominated via HERVnGoSeq (Figure S1). Sequence similarity between the reference index, HERV-K113 LTR, and the HERV-K10 LTR, which was ancestrally co-opted to form another mobile element class, Sine-VNTR-Alu composite elements (SVA, Hancks and Kazazian, 2010), resulted in nomination of insertion sites of SVA-A, B, and C in addition to HERV-K when the LTR portion of the SVA element was sufficiently conserved. Thus, true HERV-K insertion sites nominated by HERVnGoSeq were identified as follows: HERV-K present in reference—Using the dataset of mobile genetic elements present in the GRCh37 derived from RepBase (Bao et al., 2015) (UCSC RepeatMasker track), HERVnGoSeq nominated sites were confirmed as HERV-K if mapped within a known HERV-K ± 100 bp. HERV-K absent from reference—The remaining polymorphic insertion sites were determined to be HERV-K only if the insertion ± 100 bp was previously reported and confirmed with sequencing by a previous study (Dangel et al., 1994; Barbulescu et al., 1999; Mayer et al., 1999; Turner et al., 2001; Bennett et al., 2004; Hughes and Coffin, 2004; Macfarlane and Simmonds, 2004; Mamedov et al., 2004; Belshaw et al., 2005; Moyes et al., 2005; Kidd et al., 2008; Lee et al., 2012; Marchi et al., 2014; Sudmant et al., 2015; Wildschutte et al., 2016). Otherwise, they could not be distinguished from SVAs in silico. Additional HERV-K insertions missed by HERVnGoSeq but identified by previous studies and genotyped in the 1KG population were also included (n = 5) (Sudmant et al., 2015; Wildschutte et al., 2016). Prevalence of each insertion site was estimated based on either the presence/absence calls from HERVnGoSeq or from genotypes of HERV-K insertions from previous studies (n = 5). To ensure that polymorphic HERV-K insertions could not be explained by larger deletions, each insertion site was compared to start and end locations of deletions called in 1KG structural variant VCF. Any HERV-K insertion with a flanking deletion larger than 1,000 bp could not be reliably called polymorphic and was excluded from downstream analyses.
Identification of SNPs Associated With Polymorphic HERV-K Insertions
Most studies of common genetic disease-risk variants published to-date rely on SNPs, which are easy and cheap to measure compared to other structural genetic variants. These SNP studies rely on linkage disequilibrium (LD) wherein the disease-associated SNP is not necessarily the causal variant but instead tags the causal variant (outlined in Figure S2). To test our underlying hypothesis that disease-associated SNPs are, in some cases, tagging polymorphic HERV-K insertions, which are the true causal variants, we next identified SNPs associated with each HERV-K insertion and queried existing SNP:disease databases for phenotypic associations.
All HERV-K insertion sites were tested for SNP associations. For each of the 2,504 individuals in 1000 Genomes Phase 3, a binary indicator of presence/absence of the HERV-K insertion was generated via HERVnGoSeq or by recoding genotypes generated from previous publications (Sudmant et al., 2015; Wildschutte et al., 2016). Variant files for 1000 Genomes Phase 3 were downloaded from the FTP site (NCBI FTP site: ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp). After stratifying by continental population and removing related individuals (Gazal et al., 2015), all biallelic SNPs present in the 1,000 Genomes Phase 3 variant files were tested for association with the presence of each of the polymorphic HERV-K insertions using logistic regression adjusted for population stratification by including the first 6 multidimensional scaling (MDS) vectors in all models. MDS components were generated from all 1,000 Genomes variants following pruning for common SNPs (minor allele frequency MAF > 0.05) and for independence followed by random thinning to 10% of variants. All logistic regression modeling and MDS estimation were conducted in Plink 1.9 (Chang et al., 2015).
Manhattan plots were generated to visualize associations between genome-wide SNPs and polymorphic HERV-K insertions. For each of the “taggable” HERV-K insertion sites (i.e., those that showed a single, strong association peak in the Manhattan plot), hiSNPs were defined as all SNPs within a 1 Mb window of the insertion with p-value for association less than or equal to the Bonferroni adjusted p-value threshold for significance (0.05/Total SNPs in 1 Mb window).
Sensitivity of hiSNP Set Generation
To confirm that the binary HERV-K insertion presence/absence calls made by HERVnGoSeq generated hiSNP sets similar to hiSNPs for HERV-K insertions called using a different pipeline, we identified 12 HERV-K insertions detected by HERVnGoSeq and the 1000 Genomes by Sudmant et al. (2015). For the 12 HERV-K insertion sites detected by HERVnGoSeq and genotyped by Sudmant et al., logistic regression-based SNP associations were estimated from the binary HERVnGoSeq calls and the dichotomized 1,000 Genomes genotype calls within Europeans. hiSNPs generated by HERVnGoSeq and 1,000 Genomes calls using the method described above were compared. To ensure that hiSNPs associated by logistic regression were representative of SNPs that are in LD with polymorphic HERV-K insertions, complete genotype data for the 12 overlapping HERVnGoSeq/1,000 Genomes sites were used to identify tagging SNPs via the r2 measure of LD, defined as having an r2 > 0.2.
Expression Quantitative Trait Loci (eQTL)
Polymorphic HERV-K insertion hiSNPs across all HERV-K insertion sites were pooled and tested for eQTL (SNP-gene expression association with p < 0.05 adjusted for multiple tests) enrichment against all common SNPs included in the tissue-specific Genotype-Tissue Expression (GTEx) Project Version 6 (n = 11,555,102) (Carithers and Moore, 2015) using a Fisher's exact test. The null hypothesis for this test is that the odds ratio ([hiSNP&eQTL]x[Not hiSNP&Not eQTL])/([Not hiSNP&eQTL]x[Not eQTL&hiSNP]) = 1. Enrichment of hiSNPs annotated as GTEx eQTLs were also calculated separately by HERV-K insertion site and tissue type using Fisher's exact tests.
Genome-Wide Association and Experimental Factor Ontology Enrichment
To investigate whether hiSNPs for the polymorphic HERV-K insertion sites have established phenotypic associations, the EMBL-EBI GWAS Catalog (MacArthur et al., 2017) was queried for the presence of hiSNPs. To test for broader phenotypic enrichment across the HERV-K insertion sites, experimental factor ontology enrichment analyses were conducted for all pooled hiSNPs using the XGR online tool (http://galahad.well.ox.ac.uk:3020) with significant enrichments having a false discovery rate <0.05.
To determine whether the absence of hiSNPs for some HERV-K insertions sites was due to the absence of any proximal SNPs, SNP density was calculated for all insertion sites. All SNPs in the 1,000 Genomes Phase 3 dataset were counted within a 1 MB, 500 Kb, and 100 Kb window centered on each polymorphic HERV-K insertion site. The mean SNP densities for HERV-K insertions with hiSNPs was compared to HERV-K insertions with prevalence estimates from 0.2-0.8 and no hiSNPs using a Student's t-Test. Two HERV-K sites located in unlocalized contigs (chr1_gl00192_random) were excluded.
To examine whether some HERV-K insertions lacked hiSNPs due to their proximity to recombination hotspots, we selected two recombination hotspot maps and calculated the distance between the HERV-K insertion sites and recombination hotspots. We identified two datasets mapping the genomic locations of recombination hotspots genome-wide—one using a population-average LD-based mapping method (4,697 hotspots, Genomes Project et al., 2010) and the other using ChIP-seq to identify PRDM9 binding sites among five individuals (62,110 hotspots, Pratto et al., 2014). Repeated random sampling with replacement was used to estimate the distribution of mean distances between randomly selected genomic locations and the nearest recombination hotspot. In order to compare this distribution to the distribution of distances of polymorphic HERV-Ks, a pool of random genomic locations was created wherein locations were matched to the HERV-K sites by chromosome and GC content of flanking 2 Kb region. HERV-K insertion sites on chromosome Y and unmapped contigs were excluded (n = 11). First, the percentage GC content of the 2 kb flanking each HERV-K was calculated using data from the UCSC gc5Base Track (Karolchik et al., 2004) in Hg19. Next, all chromosomes were divided into 2 Kb segments and percentage GC content was calculated for each. GC content for all segments was calculated to the tenth of a percent. For each HERV-K insertion, 2 Kb genomic segments were added to the sampling pool if they had identical GC content and were on the same chromosome. This resulted in an average of 444 random locations from which to sample for each HERV-K site (~187,000 total). From this pool, 172 random genomic locations (defined as the center of the 2 Kb fragment), matched 1:1 to respective HERV-K insertions, were sampled and the mean distance to nearest recombination hotspots were calculated. This sampling procedure was repeated 1,000 times. The entire process was carried out independently with each of the recombination hotspot maps.
Polymorphic HERV-K Identification
Our computational pipeline, HERVnGoSeq, nominated 1,381 putative HERV-K insertion sites among 2,504 human whole genomes from the 1000 Genomes Phase 3 population where sequencing reads partially aligned to the HERV-K113 LTR. HERV-K113 represents the one of the most recent HERV-K integration and thus is most likely to be polymorphic and to have preserved function (Turner et al., 2001). Of the 1,381 sites, 403 HERV-K insertions mapped to reference HERV-K breakpoint sequences in GRCh37/Hg19. A total of 783 putative sites mapped to reference SINE-alu-VNTR (SVA) insertions and were discarded. Of the remaining 195 non-reference putative insertions, 28 had been previously annotated and confirmed as HERV-K via sequencing in recent studies (Sudmant et al., 2015; Wildschutte et al., 2016) and the remainder, many of which are likely SVA, will require future targeted sequencing to confirm. With the rapidly increasing rate of discovery of HERV-K insertions in human genomes, we were able to collate an additional 5 HERV-K insertion sites discovered in a parallel study (Wildschutte et al., 2016), which were also genotyped in the 1000 Genomes population. In total, the 431 (403 reference and 28 non-reference) HERV-K insertions were tested for SNP associations (Table S1). The dichotomized presence/absence, rather than genotypes, of the nominated 431 HERV-K insertion sites were tested for SNP associations to reduce potential misclassification induced by poor sensitivity of calls due to low sequencing depth (~4x). An additional 16 HERV-K insertion sites identified in independent populations of diseased individuals [The Cancer Genome Atlas (Marchi et al., 2014); dbRIP (Wang et al., 2006)] were not detected by HERVnGoSeq, nor any parallel study utilizing the 1000 Genomes Project, suggesting that the diversity of HERV-K insertion sites expands beyond what is represented in the 1000 Genomes population.
All HERV-K insertions detected in more than one individual (n = 431) were tested for SNP associations using logistic regression. HERVnGoSeq did not identify any HERV-K insertions occurring in 100% of 2,504 individuals; however, low coverage sequencing data likely resulted in an underestimation of the prevalence of some insertion sites. Thus it is likely that some sites with high prevalence called by HERVnGoSeq are actually fixed across populations. Genome-wide logistic regression stratified by continental population and adjusted for genetic ancestry revealed 46 polymorphic HERV-K insertion sites with significant SNP associations (hiSNPs) in at least one continental population after correction for multiple testing (Table 1, Table S2, Figures S3–S50). Figure 1 shows a Manhattan plot of the genome-wide SNP association results for a polymorphic HERV-K insertion located on chromosome 1.
Figure 1. Manhattan plot. Results from a genome-wide association study for a polymorphic HERV-K insertion discovered with HERVnGoSeq at chr1:111802591. The gray horizontal line represents the threshold for genome-wide significance of 5 × 10−8. Logistic regression was performed separately within each of the five super populations in the 1,000 Genomes. AFR, African; AMR, Ad Mixed American; EAS, East Asian; EUR, European; SAS, South Asian.
Of these, 13 were not previously known to be polymorphic. The majority of the remaining 385 HERV-K insertion sites with no identifiable hiSNPs were rare (prevalence < 0.2, n = 48), singletons (n = 14), or common and potentially fixed (prevalence > 0.8, n = 190), which may explain the lack of association with neighboring SNPs. However, 129 HERV-K sites appear to be common and unfixed (prevalence restricted to 0.2–0.8) yet have no hiSNPs and thus could not be evaluated for phenotype enrichment in this study (Figure 2).
Figure 2. (A) Ideogram. Relative genomic locations of HERV-K insertion locations with (green) and without (blue) hiSNPs identified via HERVnGoSeq. (B) Histogram. Frequency distribution of identified HERV-K insertion prevalences for insertions with (green) and without (blue) hiSNPs.
Among the 46 HERV-K insertion sites with strong SNP associations, hiSNP sets were selected within a 1-megabase window to ensure that no strongly associated SNPs were excluded arbitrarily due to genomic distance. The median number of hiSNPs associated with each of these 46 HERV-K insertions was 279 (Table 1). A complete summary including odds ratio and p-values for all hiSNP-HERV-K associations can be found in Table S3.
To ensure that the selected hiSNPs represented SNPs in true linkage disequilibrium with the HERV-K insertions, a subset of 12 sites that were identified by HERVnGoSeq and also genotyped in an independent study of structural variation in the 1,000 Genomes Project (Sudmant et al., 2015) were selected for sensitivity analyses. hiSNP sets selected by HERVnGoSeq were compared to hiSNP sets derived from logistic regression with outcome being dichotomized genotypes of insertions called by the 1,000 Genomes, and LD via r2 values based on maximum likelihood phasing. HERVnGoSeq logistic regression-based hiSNP sets consistently nominated the greatest number of hiSNPs across the 12 sites and, for 10 sites, >75% of HERVnGoSeq hiSNPs were also hiSNPs derived from the genotyped insertions via logistic regression or r2 (Table S2, Figures S51–S63).
Tissue-Specific Differential Gene Expression and Disease Enrichment
The Genotype-Tissue expression (GTEx) project provides expression quantitative trait loci (eQTL) analysis results from genotype and gene expression data derived from 449 individuals across 44 human tissues (Carithers and Moore, 2015). We tested whether the HERVnGoSeq derived hiSNPs for each of the HERV-K insertion sites were enriched for eQTLs based on these data. Because the GTEx individuals are ~85% white, we restricted these analyses to the hiSNP sets identified among the 30 polymorphic HERV-K insertion sites in the European continental population (Table S1) to reduce confounding by population stratification. We observed enrichment (p < 0.05) of hiSNPs for eQTLs in at least one tissue type for 22 of the 30 sites by Fisher's exact test (Figure 3). HERV-K insertion sites contributed the most eQTL associations to subcutaneous adipose tissue and thyroid tissue with 15 HERV-K sites each. Fifteen of the sites with hiSNP sets enriched for eQTLs also include SNPs associated with disease by GWAS (Figure 3). The number of genes for which individual HERV-K insertion hiSNPs served as eQTLs ranged from 1 to 75 (Table S4). Often in the instances where a large number of genes were affected, the HERV-K insertion and genes occurred on blocks of extended LD (i.e., the major histocompatibility complex).
Figure 3. Heat map of log-transformed p-values for enrichment of hiSNPs that are eQTLs. Includes the 30 polymorphic HERV-K insertion sites with strong SNP associations in the European continental population. Enrichment of SNPs that are both hiSNPs and eQTLs was determined using Fisher's exact test. Results are stratified across 44 human tissue types.
We next examined the 30 hiSNP sets identified in the European continental population for annotation in the NHGRI-EBI GWAS Catalog. Half of the HERV-K insertion sites had at least one hiSNP with a genome-wide significant association with a disease phenotype (Figure 4). In total, European polymorphic HERV-K insertions are associated with 80 human phenotypes (Table 2).
Figure 4. Manhattan plots. Plots of hiSNP sets for 15 HERV-K insertion sites among the 30 polymorphic HERV-K insertion sites with strong SNP associations in the European continental population. Vertical black lines denote HERV-K insertion locations. hiSNPs that are annotated in the NHGRI-EBI GWAS Catalog are represented by red points.
Experimental factor ontology enrichment analysis suggests that polymorphic HERV-K insertions broadly associate with neurologic and immunologic disease phenotypes, including traits related to intracranial volume (FDR 4.40E-08), Parkinson's disease (FDR 1.80E-09), and autoimmune diseases (FDR 1.80E-09) (Table S5, Figure S64).
Analyses of “Untaggable” HERV-K Insertion Sites
The majority of polymorphic HERV-K insertions identified via HERVnGoSeq were not associated with any nearby SNPs (n = 129) and could not be evaluated for existing phenotypic associations in this study. Fifty-three of these HERV-K insertions (estimated mean prevalence: 45.6%, range: <1–79.7%), occur within genes (Table S6).
The distribution of polymorphic HERV-K in specific chromosomal regions (ex: telomeres, centromeres) did not explain the lack of strong hiSNP associations in the 129 identified polymorphic sites (Figure 2A). However, we suspected that they might differ from sites with strong SNP associations in two respects—proximal SNP density and distance to nearest recombination hotspots, which both effect neighboring patterns of LD (Ardlie et al., 2002; Ke et al., 2004). We found significantly lower SNP densities in the areas around HERV-K insertion sites without hiSNPs (mean 2813.2) than the areas around HERV-K insertions with hiSNPs (mean 3392.7, difference in means: 579.5 SNPs, p = 0.0007). However, the presence of SNPs flanking the 129 HERV-K insertions without hiSNPs suggests that SNP density is not a sufficient determining factor.
ERVs can be involved in homologous and non-homologous recombination events (Campbell et al., 2014). Some are enriched for PRDM9 binding motifs (Campbell et al., 2014) and elimination through recombination is a major mechanism by which ERV sequences are removed from the human genome (Katzourakis et al., 2007). We investigated whether recombination at HERV-K insertion sites explained the lack of hiSNP associations with these 129 HERV-K insertions by measuring their proximity to known recombination hotspots.
HERV-K insertion sites without hiSNP associations were farther on average from mapped recombination hotspots than their hiSNP-associated counterparts (222.2 kb difference in distance to LD hotspots, p = 0.04, and 40.1 kb difference in distance to ChIP-seq hotspots, p = 0.008). To determine whether the distance of polymorphic HERV-K insertions from recombination hotspots was greater than expected by chance, we compared the mean distance of these 172 polymorphic HERV-K insertions (46 with hiSNPs and 129 without) to the distances from hotspots of repeated random samples of 172 genomic locations. To mitigate potential confounding, random genomic locations were matched to HERV-K insertions sites on chromosome and flanking 2 kb GC content. Polymorphic HERV-K without hiSNPs were farther from recombination hotspots than randomly selected genomic locations (p < 0.005), whereas insertions with hiSNPs were not farther (or closer) to recombination hotspots than expected by chance (Figure 5).
Figure 5. Mean distance to nearest recombination hotspot. Distances indicated for polymorphic HERV-K insertions with and without hiSNPs (dashed lines) and the distribution of mean distances of random genomic locations matched to HERV-K insertions on proximal GC content and chromosome. Distributions were derived from 1000 repeated random samples with replacement. (A) Distances from nearest ChIP-seq-based recombination hotspot, (B) Distances from nearest LD-based recombination hotspot.
This study shows that polymorphic HERV-K insertions occur in regions of the genome enriched for phenotypic function and, furthermore, that these insertion variants co-occur with established disease-risk variants, providing previously-untested candidates for the functional elements underlying the heritability of numerous complex diseases. Using our computational pipeline, HERVnGoSeq, and the diverse 1,000 Genomes population, we confirmed the presence of 33 known polymorphic HERV-K insertions and identified an additional 13 confirmed sites via strong SNP associations not previously recognized as polymorphic. Of the total 46 HERV-K insertions under investigation, 22 have hiSNP sets enriched for eQTLs and 15 contained disease-associated SNPs identified in prior GWAS.
The collective evidence put forth by annotated hiSNPs supports a role for HERV-K insertions in inducing phenotypic effects. There is previous evidence that polymorphic HERV-K insertions may affect brain function. Previous studies have established links between HERV-K and amyotrophic lateral sclerosis (Alfahad and Nath, 2013), HIV-associated dementia (Garrison et al., 2007), and Schizophrenia (Sekar et al., 2016). Our results further support these links and provide specific candidate polymorphisms that may explain these observations. We found two polymorphic HERV-K insertions (chr5:64388440 and chr6:32648036) whose hiSNP sets include a GWAS hit for schizophrenia. The association of the hiSNP at the insertion chr6:32648036 has already been attributed to the presence of a polymorphic HERV-K at the complement component 4 (C4) locus, resulting in altered expression at C4A and C4B, which we also observed in our eQTL enrichment results (Table S4) (Sekar et al., 2016). The hiSNP set at the second site at chr5:64388440 contains a SNP that is associated with schizophrenia symptoms relating to hallucination, delusion, and paranoia (Fanous et al., 2012) and both the SNP and the HERV-K are located directly upstream of and serve as eQTLs for ADAMTS6, a gene among a family that experimentally induces neurite growth in cultured neurons (Hamel et al., 2008). One of the most strongly associated hiSNPs for another HERV-K at chr4:120263688 was previously identified in a GWAS examining the genetics of cognitive performance using proxy-phenotypes (Rietveld et al., 2014). Experimental factor ontology enrichment analysis of hiSNPs also suggested a largely neurological phenotypic effect of polymorphic HERV-K wherein Parkinson's disease, intracranial volume, and temporal arteritis were the seconds, third, and fourth most significantly enriched terms, respectively. Enrichment analysis also suggests that HERV-K insertion sites may have a functional role in autoimmune diseases. The role of polymorphic HERV-K in immunity, particularly insertions within the HLA, is difficult to delineate. While no specific associations have been established between HERV-K and autoimmunity, strong evidence suggests a link between multiple sclerosis and expression of HERV-W (Schmitt et al., 2013); and the role of ERVs in autoimmunity has been long suspected, but its study has been hindered by technological limitations. With the increasing availability of next-generation sequencing data and computational methods like HERVnGoSeq, the time is ripe for a thorough investigation of polymorphic HERVs in autoimmune disease.
HERV-K expression has frequently been noted in human cancers and has also been of interest as an etiologic factor. We found hiSNP associations with Hepatocellular carcinoma (HCC) tagging two polymorphic HERV-K insertions. Recent studies identified an increase in HERV-K expression in HCC vs. normal tissue (Ma et al., 2016) and also discovered that HCC tumor mutations are frequently caused by APOBEC enzymes, a component of the human innate immune system primarily active against ERVs (Chiu and Greene, 2008). As such, the role of polymorphic HERV-K in interaction with hepatitis viruses and HCC appears warranted.
The HERV-K LTR is known to contain enhancer elements and thus the degree to which hiSNPs were enriched for eQTLs was not unexpected. An advantage of using the GTEx database is the ability to determine tissue-specific eQTL activity, which can help discern the phenotypic effects of polymorphic HERV-K. For example, we observe that the hiSNPs for a polymorphic HERV-K at chr19:22414379 are enriched for eQTLs in adipose tissue, suggesting that the insertion could affect fat storage. Indeed, the hiSNP set for this insertion also contains a GWAS SNP associated with changes in body mass index over time (McQueen et al., 2015).
It is possible that the HERV-K with hiSNP eQTL enrichment is not itself the variant altering cis-gene expression, particularly in cases where the HERV is inserted on a large LD block, for example within the HLA region (chr6:32505702, chr6:32648036, chr6:32746812). Often these regions are not fully explored in functional follow-up studies due to their complexity. Since HERV-K LTRs contain functional elements, they serve as strong candidates for eQTLs regardless of regional complexity, warranting additional studies. It is also possible that the observed relationships between eQTLs and HERV-K insertions differ across populations. However, the GTEx consortium primarily consists of samples derived from european individuals and thus the relationship between polymorphic HERV-K and gene expression can currently only be inferred in this population.
We could not leverage SNP annotations to illuminate the function of the majority of polymorphic HERV-K insertions because they did not associate with any neighboring SNPs. In addition to the 46 HERV-K insertions with hiSNPs, we also nominated 129 reference HERV-K insertions that are likely polymorphic and that may have yet-undetected phenotypic associations. It is possible that some of the lower prevalence sites without hiSNPs are fixed and were called polymorphic only because of poor detection sensitivity of HERVnGoSeq. However, this seems unlikely, as our prevalence estimates for previously recognized non-reference polymorphic HERV-K were usually within ~10% of previous studies' estimates (Figure S65). Pairwise correlations of SNPs directly adjacent to these HERV-K insertions suggest that there is LD in these regions (data not shown), but that the HERV-K insertions are “dark variants” that are not correlated with proximal SNPs. One potential explanation why the majority of polymorphic HERV-K insertions fail to have strong SNP associations is the greater than expected distance to the nearest recombination hotspot. Patterns of LD are known to strengthen the closer variants are to a hotspot, with complete loss of LD within the hotspot itself. The observed decay of HERV:SNP LD farther from hotspots requires further study. It is also possible that there is hotspot activity near or within these HERVs that were not identified and included in the two hot spot maps used for this study. Breakdown of LD patterns surrounding these HERV-K insertions may also be explained by other mechanisms that could not be investigated in the present study, including frequent sporadic non-allelic homologous recombination events, evolutionarily recent integration, off-target mutagenic activity of HERV-K repressors such as APOBEC enzymes, or hypermethylation resulting in sporadic deamination of methylated cytosines.
In our survey of phenotypic associations with polymorphic HERV-K insertions, the greatest limitation was the poor sensitivity of detection of HERV-K due to the low coverage of the sequencing data available for the 1,000 Genomes population. Consequently, we were not able to call genotypes or estimate prevalence with high precision. Similar pipelines that have used these data to genotype mobile genetic elements often include an imputation step. Our observation, that a significant number of HERV-K insertions lack SNP associations, likely impedes the ability and reliability of imputation-based methods for genotyping these polymorphisms. This may also explain why so few members of the HERV-K family have been recognized as polymorphic. We anticipate that the accuracy of genotyping HERV-K insertions will greatly increase with higher coverage sequencing data.
Polymorphic HERV-K elements are associated with the germ-line risk of myriad phenotypes. While this study provides a starting point for further investigation, disease-specific epidemiologic and functional studies are needed to elucidate the role of specific polymorphic HERV-K insertions in complex diseases.
This study was carried out in accordance with the recommendations of our institutional review board at both the University of California, San Francisco and University of Nevada, Reno. The protocol was approved by the UCSF and UNR IRB (IRB# 1178149-1, 992935-1). Only previously collected data was used in this analysis where original investigators required all subjects gave written informed consent in accordance with the Declaration of Helsinki.
SF and AW conceived of the study, conducted and assisted in analysis, and wrote the paper. GW developed bioinformatic pipelines, managed data, and assisted in manuscript preparation. LB, AdS, KW, CM, JC, JW assisted in analyses, provided technical expertise and aided in manuscript preparation and writing.
We thank our funders (NIH/NCI 5T32CA151022-05 PI:Costello, NIH/NIGMS 1R15GM126562-01 PI:Francis), and the countless other studies generating publicly available data to further our understanding of disease.
Conflict of Interest Statement
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene.2018.00298/full#supplementary-material
ERV, endogenous retrovirus; HERV-K, human endogenous retrovirus-K; SNP, single nucleotide polymorphism; hiSNP, HERV-K insertion associated single nucleotide polymorphism; SVA, sine-VNTR-alu element; eQTL, expression quantitative trait loci; GWAS, genome-wide association study; WGS, whole genome sequencing; MDS, multi-dimensional scaling; LD, linkage disequilibrium.
Barbulescu, M., Turner, G., Seaman, M. I., Deinard, A. S., Kidd, K. K., and Lenz, J. (1999). Many human endogenous retrovirus K (HERV-K) proviruses are unique to humans. Curr. Biol. 9, 861–868. doi: 10.1016/S0960-9822(99)80390-X
Belshaw, R., Dawson, A. L., Woolven-Allen, J., Redding, J., Burt, A., and Tristem, M. (2005). Genomewide screening reveals high levels of insertional polymorphism in the human endogenous retrovirus family HERV-K(HML2): implications for present-day activity. J. Virol. 79, 12507–12514. doi: 10.1128/JVI.79.19.12507-12514.2005
Bennett, E. A., Coleman, L. E., Tsui, C., Pittard, W. S., and Devine, S. E. (2004). Natural genetic variation caused by transposable elements in humans. Genetics 168, 933–951. doi: 10.1534/genetics.104.031757
Boller, K., Schonfeld, K., Lischer, S., Fischer, N., Hoffmann, A., Kurth, R., et al. (2008). Human endogenous retrovirus HERV-K113 is capable of producing intact viral particles. J. Gen. Virol. 89, 567–572. doi: 10.1099/vir.0.83534-0
Bowen, L. N., Tyagi, R., Li, W., Alfahad, T., Smith, B., Wright, M., et al. (2016). HIV-associated motor neuron disease: HERV-K activation and response to antiretroviral therapy. Neurology 87, 1756–1762. doi: 10.1212/WNL.0000000000003258
Campbell, I. M., Gambin, T., Dittwald, P., Beck, C. R., Shuvarikov, A., Hixson, P., et al. (2014). Human endogenous retroviral elements promote genome instability via non-allelic homologous recombination. BMC Biol. 12:74. doi: 10.1186/s12915-014-0074-4
Cegolon, L., Salata, C., Weiderpass, E., Vineis, P., Palù, G., and Mastrangelo, G. (2013). Human endogenous retroviruses and cancer prevention: evidence and prospects. BMC Cancer 13:4. doi: 10.1186/1471-2407-13-4
Chang, C. C., Chow, C. C., Tellier, L. C., Vattikuti, S., Purcell, S. M., and Lee, J. J. (2015). SecondS-generation PLINK: rising to the challenge of larger and richer datasets. Gigascience 4:7. doi: 10.1186/s13742-015-0047-8
Chiu, Y. L., and Greene, W. C. (2008). The APOBEC3 cytidine deaminases: an innate defensive network opposing exogenous retroviruses and endogenous retroelements. Annu. Rev. Immunol. 26, 317–353. doi: 10.1146/annurev.immunol.26.021607.090350
Chuong, E. B., Rumi, M. A., Soares, M. J., and Baker, J. C. (2013). Endogenous retroviruses function as species-specific enhancer elements in the placenta. Nat. Genet. 45, 325–329. doi: 10.1038/ng.2553
Dangel, A. W., Mendoza, A. R., Baker, B. J., Daniel, C. M., Carroll, M. C., Wu, L. C., et al. (1994). The dichotomous size variation of human-complement C4 genes is mediated by a novel family of endogenous retroviruses, which also establishes species-specific genomic patterns among old-world primates. Immunogenetics 40, 425–436. doi: 10.1007/BF00177825
Fanous, A. H., Zhou, B., Aggen, S. H., Bergen, S. E., Amdur, R. L., Duan, J., et al. (2012). Genome-wide association study of clinical dimensions of schizophrenia: polygenic effect on disorganized symptoms. Am. J. Psychiatry 169, 1309–1317. doi: 10.1176/appi.ajp.2012.12020218
Fuchs, N. V., Loewer, S., Daley, G. Q., Izsvák, Z., Löwer, J., and Löwer, R. (2013). Human endogenous retrovirus K (HML-2) RNA and protein expression is a marker for human embryonic and induced pluripotent stem cells. Retrovirology 10:115. doi: 10.1186/1742-4690-10-115
Garrison, K. E., Jones, R. B., Meiklejohn, D. A., Anwar, N., Ndhlovu, L. C., Chapman, J. M., et al. (2007). T cell responses to human endogenous retroviruses in HIV-1 infection. PLoS Pathog. 3:e165. doi: 10.1371/journal.ppat.0030165
Genomes Project, C., Abecasis, G. R., Altshuler, D., Auton, A., Brooks, L. D., Durbin, R. M., et al. (2010). A map of human genome variation from population-scale sequencing. Nature 467, 1061–1073. doi: 10.1038/nature09534
Grow, E. J., Flynn, R. A., Chavez, S. L., Bayless, N. L, Wossidlo, M, Wesche, D. J., et al. (2015). Intrinsic retroviral reactivation in human preimplantation embryos and pluripotent cells. Nature 522, 221–225. doi: 10.1038/nature14308
Hamel, M. G., Ajmo, J. M., Leonardo, C. C., Zuo, F., Sandy, J. D., and Gottschall, P. E. (2008). Multimodal signaling by the ADAMTSs (a disintegrin and metalloproteinase with thrombospondin motifs) promotes neurite extension. Exp. Neurol. 210, 428–440. doi: 10.1016/j.expneurol.2007.11.014
Hughes, J. F., and Coffin, J. M. (2004). Human endogenous retrovirus K solo-LTR formation and insertional polymorphisms: implications for human and viral evolution. Proc. Natl. Acad. Sci. U.S.A. 101, 1668–1672. doi: 10.1073/pnas.0307885100
Karolchik, D., Hinrichs, A. S., Furey, T. S., Roskin, K. M., Sugnet, C. W., Haussler, D., et al. (2004). The UCSC table browser data retrieval tool. Nucleic Acids Res. 32, D493–D96. doi: 10.1093/nar/gkh103
Ke, X., Hunt, S., Tapper, W., Lawrence, R., Stavrides, G., Ghori, J., et al. (2004). The impact of SNP density on fine-scale patterns of linkage disequilibrium. Hum. Mol. Genet. 13, 577–588. doi: 10.1093/hmg/ddh060
Kidd, J. M., Cooper, G. M., Donahue, W. F., Hayden, H. S., Sampas, N., Graves, T., et al. (2008). Mapping and sequencing of structural variation from eight human genomes. Nature 453, 56–64. doi: 10.1038/nature06862
Kreimer, U., Schulz, W. A., Koch, A., Niegisch, G., and Goering, W. (2013). HERV-K and LINE-1 DNA methylation and reexpression in urothelial carcinoma. Front. Oncol. 3:255. doi: 10.3389/fonc.2013.00255
Lee, E., Iskow, R., Yang, L. X., Gokcumen, O., Haseley, P., Luquette, L. J. III., et al. (2012). Landscape of somatic retrotransposition in human cancers. Science 337, 967–971. doi: 10.1126/science.1222077
Ling, J., Pi, W., Yu, X., Bengra, C, Long, Q, Jin, H, et al. (2003). The ERV-9 LTR enhancer is not blocked by the HS5 insulator and synthesizes through the HS5 site non-coding, long RNAs that regulate LTR enhancer function. Nucleic Acids Res. 31, 4582–4596. doi: 10.1093/nar/gkg646
Li, W., Lee, M. H., Henderson, L., Tyagi, R., Bachani, M., Steiner, J., et al. (2015). Human endogenous retrovirus-K contributes to motor neuron disease. Sci. Transl. Med. 7:307ra153. doi: 10.1126/scitranslmed.aac8201
MacArthur, J., Bowler, E., Cerezo, M., Gil, L., Hall, P., Hastings, E., et al. (2017). The new NHGRI-EBI Catalog of published genome-wide association studies (GWAS Catalog). Nucleic Acids Res. 45, D896–D901. doi: 10.1093/nar/gkw1133
Mamedov, I., Lebedev, Y., Hunsmann, G., Khusnutdinova, E., and Sverdlov, E. (2004). A rare event of insertion polymorphism of a HERV-K LTR in the human genome. Genomics 84, 596–599. doi: 10.1016/j.ygeno.2004.04.010
Ma, W. J., Hong, Z. F., Liu, H. L., Chen, X., Ding, L., Liu, Z., et al. (2016). Human endogenous retroviruses-K (HML-2) Expression is correlated with prognosis and progress of hepatocellular carcinoma. Biomed. Res. Int. 2016:8201642. doi: 10.1155/2016/8201642
Mayer, J., Sauter, M., Rácz, A., Scherer, D., Mueller-Lantzsch, N., and Meese, E. (1999). An almost-intact human endogenous retrovirus K on human chromosome 7. Nat. Genet. 21, 257–258. doi: 10.1038/6766
McQueen, M. B., Boardman, J. D., Domingue, B. W., Smolen, A., Tabor, J., Killeya-Jones, L., et al. (2015). The national longitudinal study of adolescent to adult health (add health) sibling pairs genome-wide data. Behav. Genet. 45, 12–23. doi: 10.1007/s10519-014-9692-4
Moyes, D. L., Martin, A., Sawcer, S., Temperton, N., Worthington, J., Griffiths, D. J., et al. (2005). The distribution of the endogenous retroviruses HERV-K113 and HERV-K115 in health and disease. Genomics 86, 337–341. doi: 10.1016/j.ygeno.2005.06.004
Nexø, B. A., Villesen, P., Nissen, K. K., Lindegaard, H. M., Rossing, P., Petersen, T., et al. (2016). Are human endogenous retroviruses triggers of autoimmune diseases? unveiling associations of three diseases and viral loci. Immunol. Res. 64, 55–63. doi: 10.1007/s12026-015-8671-z
Pratto, F., Brick, K., Khil, P., Smagulova, F., Petukhova, V. G., Camerini-Otero, D. R., et al. (2014). Recombination initiation maps of individual human genomes. Science 346:826. doi: 10.1126/science.1256442
Rietveld, C. A., Esko, T., Davies, G., Pers, T. H., Turley, P., Benyamin, B., et al. (2014). Common genetic variants associated with cognitive performance identified using the proxy-phenotype method. Proc. Natl. Acad. Sci. U.S.A. 111, 13790–13794. doi: 10.1073/pnas.1404623111
Schmitt, K., Richter, C., Backes, C., Meese, E., Ruprecht, K., and Mayer, J. (2013). Comprehensive analysis of human endogenous retrovirus group HERV-W locus transcription in multiple sclerosis brain lesions by high-throughput amplicon sequencing. J. Virol. 87, 13837–13852. doi: 10.1128/JVI.02388-13
Sekar, A., Bialas, A. R., de Rivera, H., Avery Davis, A., Hammond, R. T., Kamitaki, N., et al. (2016). Schizophrenia risk from complex variation of complement component 4. Nature 530, 177–183. doi: 10.1038/nature16549
Simpson, G. R., Patience, C., Löwer, R., Tönjes, R. R., Moore, H. D., Weiss, R. A., et al. (1996). Endogenous D-type (HERV-K) related sequences are packaged into retroviral particles in the placenta and possess open reading frames for reverse transcriptase. Virology 222, 451–456. doi: 10.1006/viro.1996.0443
Sudmant, P. H., Rausch, T., Gardner, E. J., Handsaker, R. E., Abyzov, A, Huddleston, J., et al. (2015). An integrated map of structural variation in 2,504 human genomes. Nature 526, 75–81. doi: 10.1038/nature15394
Turner, G., Barbulescu, M., Su, M., Jensen-Seaman, M. I., Kidd, K. K., and Lenz, J. (2001). Insertional polymorphisms of full-length endogenous retroviruses in humans. Curr. Biol. 11, 1531–1535. doi: 10.1016/S0960-9822(01)00455-9
Wang, J., Song, L., Grover, D., Azrak, S., Batzer, M. A., and Liang, P. (2006). dbRIP: a highly integrated database of retrotransposon insertion polymorphisms in humans. Hum. Mutat. 27, 323–329. doi: 10.1002/humu.20307
Wildschutte, J. H., Williams, Z. H., Montesion, M., Subramanian, R. P., Kidd, J. M., and Coffin, J. M. (2016). Discovery of unfixed endogenous retrovirus insertions in diverse human populations. Proc. Natl. Acad. Sci. U.S.A. 113, E2326–E2334. doi: 10.1073/pnas.1602336113
Keywords: HERV-K, GWAS, polymorphism, eQTL, recombination
Citation: Wallace AD, Wendt GA, Barcellos LF, de Smith AJ, Walsh KM, Metayer C, Costello JF, Wiemels JL and Francis SS (2018) To ERV Is Human: A Phenotype-Wide Scan Linking Polymorphic Human Endogenous Retrovirus-K Insertions to Complex Phenotypes. Front. Genet. 9:298. doi: 10.3389/fgene.2018.00298
Received: 27 April 2018; Accepted: 16 July 2018;
Published: 14 August 2018.
Edited by:Robert Klein, Icahn School of Medicine at Mount Sinai, United States
Reviewed by:Sarven Sabunciyan, Johns Hopkins University, United States
Preetida J. Bhetariya, University of Utah, United States
Kazuaki Monde, Kumamoto University, Japan
Copyright © 2018 Wallace, Wendt, Barcellos, de Smith, Walsh, Metayer, Costello, Wiemels and Francis. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.