To ERV Is Human: A Phenotype-Wide Scan Linking Polymorphic Human Endogenous Retrovirus-K Insertions to Complex Phenotypes

Approximately 8% of the human genome is comprised of endogenous retroviral insertions (ERVs) originating from historic retroviral integration into germ cells. The function of ERVs as regulators of gene expression is well established. Less well studied are insertional polymorphisms of ERVs and their contribution to the heritability of complex phenotypes. The most recent integration of ERV, HERV-K, is expressed in a range of complex human conditions from cancer to neurologic diseases. Using an in-house computational pipeline and whole-genome sequencing data from the diverse 1,000 Genomes Phase 3 population (n = 2,504), we identified 46 polymorphic HERV-K insertions that are tagged by adjacent single nucleotide polymorphisms (SNPs). To test the potential role of polymorphic HERV-K in the heritability of complex diseases, existing databases were queried for enrichment of established relationships between the HERV-K insertion-associated SNPs (hiSNPs), and tissue specific gene expression and disease phenotypes. Overall, hiSNPs for the 46 polymorphic HERV-K sites were statistically enriched (p < 1.0E−16) for eQTLs across 44 human tissues. Fifteen of the 46 HERV-K insertions had hiSNPs annotated in the EMBL-EBI GWAS Catalog and cumulatively associated with >100 phenotypes. Experimental factor ontology enrichment analysis suggests that polymorphic HERV-K specifically contribute to neurologic and immunologic disease phenotypes, including traits related to intra cranial volume (FDR 2.00E-09), Parkinson's disease (FDR 1.80E-09), and autoimmune diseases (FDR 1.80E-09). These results provide strong candidates for context-specific study of polymorphic HERV-K insertions in disease-related traits, serving as a roadmap for future studies of the heritability of complex disease.


INTRODUCTION
Retroviruses are a class of RNA virus that undergoes reverse transcription to DNA during the infectious cycle inside a host cell. At the proviral stage, retroviral DNA integrates into the host DNA to produce viral proteins. Integration into germ cells can result in endogenization, wherein the virus can be vertically transmitted via standard Mendelian inheritance mechanisms. Endogenous retroviruses (ERV) are ancient examples of proviruses that integrated and endogenized into the human genome >40mya (Bannert and Kurth, 2006). In modern humans, ERVs account for approximately 8% of the genome (Bannert and Kurth, 2006). Their relative stability, as well as the conservation of orthologs in other primate genomes suggests that they induce genome plasticity and can enhance evolutionary fitness (Cordaux and Batzer, 2009;Feschotte and Gilbert, 2012;Grow et al., 2015). Retroviruses are reliant on the fitness of their host for survival and the long-standing evolutionary cooperation between ERVs and humans may represent a symbiotic relationship (Jern and Coffin, 2008). The positive selection of persistent ERVs in the genome may have resulted from increasing the probability of survival to reproductive age [via adaptive effects on placentation (Simpson et al., 1996); and immune (Hurst and Magiorkinis, 2015) and brain development (Mortelmans et al., 2016)]. The phenotypic effects of ERVs on the post-reproductive adult, however, remain unclear and are of growing interest (Li et al., 2015;Bowen et al., 2016;Sekar et al., 2016).
Previous studies have described the potential mechanisms by which ERVs influence human phenotypes. ERV insertions introduce viral genes and, due to their inter-individual homology can generate copy-number variants via non-allelic homologous recombination (Campbell et al., 2014). They modify transcription by adding enhancers (Chuong et al., 2013) and promoters (Fuchs et al., 2013), disrupting intron structure, causing RNA interference (Ling et al., 2003) adding poly-A tails (Kim, 2012), and altering DNA methylation (Kreimer et al., 2013). ERV expression, typically restricted in healthy tissues except during placental development, is detected in diseases including cancers and autoimmune disorders [reviewed in (Cegolon et al., 2013;Nexø et al., 2016)].
While the functional effects of ERVs are well established, the vast majority of ERV insertions are fixed across individuals, and so their potential contribution to phenotypic variation has been largely overlooked. The HERV-K subfamily, however, contains human-specific, unfixed insertions, ranging from fully intact provirus to solo long terminal repeat (LTR) sequences (Wildschutte et al., 2016). HERV-K represents one of the most recent ERV integrations into the human genome and numerous insertions have neither been eliminated from the genome (via negative selection or drift), nor fixed (via positive selection or Abbreviations: ERV, endogenous retrovirus; HERV-K, human endogenous retrovirus-K; SNP single nucleotide polymorphism; hiSNP, HERV-K insertion associated single nucleotide polymorphism; SVA, sine-VNTR-alu element; eQTL, expression quantitative trait loci; GWAS, genome-wide association study; WGS, whole genome sequencing; MDS, multi-dimensional scaling; LD, linkage disequilibrium. drift). The polymorphic nature of these insertions suggests a potential contribution to causal variation in the heritability of complex phenotypes. Targeted studies have identified specific HERV-K integrations that affect disease risk, for example a polymorphic HERV-KC4 inserted within the complement component 4 (C4A/B) gene appears to be involved in the genetic risk of schizophrenia (Sekar et al., 2016).
Technical limitations have proved a major obstacle in the untargeted identification of polymorphic HERV-K insertions for application to clinical and epidemiologic studies. With the emergence of next-generation sequencing technologies, methods are being developed for the untargeted identification of ERVs among other mobile genetic elements in human genomes (Witherspoon et al., 2010;Ray and Batzer, 2011;Wildschutte et al., 2016). Here, we examine phenotypic effects of all polymorphic HERV-K insertions identifiable from a large, publically available whole genome sequencing (WGS) dataset. With our computational pipeline, we identified HERV-K insertion locations using data from the diverse 1000 Genomes Phase 3 population (n = 2,504). By identifying a subset of polymorphic HERV-K insertions with strong associations to adjacent "tagging" single nucleotide polymorphisms (SNPs), we have leveraged several comprehensive SNP annotation databases to test for enrichment of established relationships between HERV-K insertion-associated SNPs (hiSNPs), tissue-specific gene expression, and diverse disease phenotypes.

HERVnGoSeq Computational Pipeline
To elucidate the broad phenotypic effects of polymorphic HERV-K insertions, we developed a computational pipeline, HERVnGoSeq, to identify the presence/absence of known and novel HERV-K insertions in individual WGS data ( Figure S1). Quality-filtered raw WGS were aligned to HERV-K113, one of the youngest HERV-K elements in the human genome with a conserved intact LTR sequence that is also capable of producing viral particles in vitro (Boller et al., 2008). Reads that partially aligned to HERV-K113 -chimeric reads -were trimmed and the non-HERV portions of the reads were extracted. The trimmed chimeric reads were then aligned to the human genome (GRCh37/Hg19). The base-pair position of the trimmed end of the read where HERV-K sequence was removed was called as the insertion point. Insertion points were collected for both the forward and reverse complement alignments separately. Insertion points within 1,000 bp of each other were grouped to represent a single insertion point. The presence of putative HERV-K insertions were assigned to each individual if they had at least one chimeric read that aligned to that insertion point. Absence of an insertion was inferred for individuals when they lacked any chimeric reads representing the specific insertion. The complete pipeline and description is available at https://github. com/unreno/chimera.

HERV-K Identification/Validation
Putative polymorphic HERV-K elements were nominated via HERVnGoSeq ( Figure S1). Sequence similarity between the reference index, HERV-K113 LTR, and the HERV-K10 LTR, which was ancestrally co-opted to form another mobile element class, Sine-VNTR-Alu composite elements (SVA, Hancks and Kazazian, 2010), resulted in nomination of insertion sites of SVA-A, B, and C in addition to HERV-K when the LTR portion of the SVA element was sufficiently conserved. Thus, true HERV-K insertion sites nominated by HERVnGoSeq were identified as follows: HERV-K present in reference-Using the dataset of mobile genetic elements present in the GRCh37 derived from RepBase (Bao et al., 2015) (UCSC RepeatMasker track), HERVnGoSeq nominated sites were confirmed as HERV-K if mapped within a known HERV-K ± 100 bp. HERV-K absent from reference-The remaining polymorphic insertion sites were determined to be HERV-K only if the insertion ± 100 bp was previously reported and confirmed with sequencing by a previous study (Dangel et al., 1994;Barbulescu et al., 1999;Mayer et al., 1999;Turner et al., 2001;Bennett et al., 2004;Hughes and Coffin, 2004;Macfarlane and Simmonds, 2004;Mamedov et al., 2004;Belshaw et al., 2005;Moyes et al., 2005;Kidd et al., 2008;Lee et al., 2012;Marchi et al., 2014;Sudmant et al., 2015;Wildschutte et al., 2016). Otherwise, they could not be distinguished from SVAs in silico. Additional HERV-K insertions missed by HERVnGoSeq but identified by previous studies and genotyped in the 1KG population were also included (n = 5) (Sudmant et al., 2015;Wildschutte et al., 2016). Prevalence of each insertion site was estimated based on either the presence/absence calls from HERVnGoSeq or from genotypes of HERV-K insertions from previous studies (n = 5). To ensure that polymorphic HERV-K insertions could not be explained by larger deletions, each insertion site was compared to start and end locations of deletions called in 1KG structural variant VCF. Any HERV-K insertion with a flanking deletion larger than 1,000 bp could not be reliably called polymorphic and was excluded from downstream analyses.

Identification of SNPs Associated With Polymorphic HERV-K Insertions
Most studies of common genetic disease-risk variants published to-date rely on SNPs, which are easy and cheap to measure compared to other structural genetic variants. These SNP studies rely on linkage disequilibrium (LD) wherein the diseaseassociated SNP is not necessarily the causal variant but instead tags the causal variant (outlined in Figure S2). To test our underlying hypothesis that disease-associated SNPs are, in some cases, tagging polymorphic HERV-K insertions, which are the true causal variants, we next identified SNPs associated with each HERV-K insertion and queried existing SNP:disease databases for phenotypic associations. All HERV-K insertion sites were tested for SNP associations. For each of the 2,504 individuals in 1000 Genomes Phase 3, a binary indicator of presence/absence of the HERV-K insertion FIGURE 1 | Manhattan plot. Results from a genome-wide association study for a polymorphic HERV-K insertion discovered with HERVnGoSeq at chr1:111802591. The gray horizontal line represents the threshold for genome-wide significance of 5 ×10 −8 . Logistic regression was performed separately within each of the five super populations in the 1,000 Genomes. AFR, African; AMR, Ad Mixed American; EAS, East Asian; EUR, European; SAS, South Asian. was generated via HERVnGoSeq or by recoding genotypes generated from previous publications (Sudmant et al., 2015;Wildschutte et al., 2016). Variant files for 1000 Genomes Phase 3 were downloaded from the FTP site (NCBI FTP site: ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp). After stratifying by continental population and removing related individuals (Gazal et al., 2015), all biallelic SNPs present in the 1,000 Genomes Phase 3 variant files were tested for association with the presence of each of the polymorphic HERV-K insertions using logistic regression adjusted for population stratification by including the first 6 multidimensional scaling (MDS) vectors in all models. MDS components were generated from all 1,000 Genomes variants following pruning for common SNPs (minor allele frequency MAF > 0.05) and for independence followed by random thinning to 10% of variants. All logistic regression modeling and MDS estimation were conducted in Plink 1.9 (Chang et al., 2015).
Manhattan plots were generated to visualize associations between genome-wide SNPs and polymorphic HERV-K insertions. For each of the "taggable" HERV-K insertion sites (i.e., those that showed a single, strong association peak in the Manhattan plot), hiSNPs were defined as all SNPs within a 1 Mb window of the insertion with p-value for association less than or equal to the Bonferroni adjusted p-value threshold for significance (0.05/Total SNPs in 1 Mb window).

Sensitivity of hiSNP Set Generation
To confirm that the binary HERV-K insertion presence/absence calls made by HERVnGoSeq generated hiSNP sets similar to hiSNPs for HERV-K insertions called using a different pipeline, we identified 12 HERV-K insertions detected by HERVnGoSeq and the 1000 Genomes by Sudmant et al. (2015). For the 12 HERV-K insertion sites detected by HERVnGoSeq and genotyped by Sudmant et al., logistic regression-based SNP associations were estimated from the binary HERVnGoSeq calls and the dichotomized 1,000 Genomes genotype calls within Europeans. hiSNPs generated by HERVnGoSeq and 1,000 Genomes calls using the method described above were compared. To ensure that hiSNPs associated by logistic regression were representative of SNPs that are in LD with polymorphic HERV-K insertions, complete genotype data for the 12 overlapping HERVnGoSeq/1,000 Genomes sites were used to identify tagging SNPs via the r 2 measure of LD, defined as having an r 2 > 0.2.

Expression Quantitative Trait Loci (eQTL)
Polymorphic HERV-K insertion hiSNPs across all HERV-K insertion sites were pooled and tested for eQTL (SNP-gene expression association with p < 0.05 adjusted for multiple tests) enrichment against all common SNPs included in the tissuespecific Genotype-Tissue Expression (GTEx) Project Version 6 (n = 11,555,102) (Carithers and Moore, 2015) using a Fisher's exact test. The null hypothesis for this test is that the odds ratio ([hiSNP&eQTL]x[Not hiSNP&Not eQTL])/([Not hiSNP&eQTL]x[Not eQTL&hiSNP]) = 1. Enrichment of hiSNPs annotated as GTEx eQTLs were also calculated separately by HERV-K insertion site and tissue type using Fisher's exact tests.

Genome-Wide Association and Experimental Factor Ontology Enrichment
To investigate whether hiSNPs for the polymorphic HERV-K insertion sites have established phenotypic associations, the EMBL-EBI GWAS Catalog (MacArthur et al., 2017) was queried for the presence of hiSNPs. To test for broader phenotypic enrichment across the HERV-K insertion sites, experimental factor ontology enrichment analyses were conducted for all pooled hiSNPs using the XGR online tool (http://galahad.well.ox.ac.uk:3020) with significant enrichments having a false discovery rate < 0.05.

SNP Density
To determine whether the absence of hiSNPs for some HERV-K insertions sites was due to the absence of any proximal SNPs, SNP density was calculated for all insertion sites. All SNPs in the 1,000 Genomes Phase 3 dataset were counted within a 1 MB, 500 Kb, and 100 Kb window centered on each polymorphic HERV-K insertion site. The mean SNP densities for HERV-K insertions with hiSNPs was compared to HERV-K insertions with prevalence estimates from 0.2-0.8 and no hiSNPs using a Student's t-Test. Two HERV-K sites located in unlocalized contigs (chr1_gl00192_random) were excluded.

Hotspot Distance
To examine whether some HERV-K insertions lacked hiSNPs due to their proximity to recombination hotspots, we selected two recombination hotspot maps and calculated the distance between the HERV-K insertion sites and recombination hotspots. We identified two datasets mapping the genomic locations of recombination hotspots genome-wide-one using a populationaverage LD-based mapping method (4,697 hotspots, Genomes Project et al., 2010) and the other using ChIP-seq to identify PRDM9 binding sites among five individuals (62,110 hotspots, Pratto et al., 2014). Repeated random sampling with replacement was used to estimate the distribution of mean distances between randomly selected genomic locations and the nearest recombination hotspot. In order to compare this distribution to the distribution of distances of polymorphic HERV-Ks, a pool of random genomic locations was created wherein locations were matched to the HERV-K sites by chromosome and GC content of flanking 2 Kb region. HERV-K insertion sites on chromosome Y and unmapped contigs were excluded (n = 11). First, the percentage GC content of the 2 kb flanking each HERV-K was calculated using data from the UCSC gc5Base Track (Karolchik et al., 2004) in Hg19. Next, all chromosomes were divided into 2 Kb segments and percentage GC content was calculated for each. GC content for all segments was calculated to the tenth of a percent. For each HERV-K insertion, 2 Kb genomic segments were added to the sampling pool if they had identical GC content and were on the same chromosome. This resulted in an average of 444 random locations from which to sample for each HERV-K site (∼187,000 total). From this pool, 172 random genomic locations (defined as the center of the 2 Kb fragment), matched 1:1 to respective HERV-K insertions, were sampled and the mean distance to nearest recombination hotspots were calculated. This sampling procedure was repeated 1,000 times. The entire process was carried out independently with each of the recombination hotspot maps.

Polymorphic HERV-K Identification
Our computational pipeline, HERVnGoSeq, nominated 1,381 putative HERV-K insertion sites among 2,504 human whole genomes from the 1000 Genomes Phase 3 population where sequencing reads partially aligned to the HERV-K113 LTR. HERV-K113 represents the one of the most recent HERV-K integration and thus is most likely to be polymorphic and to have preserved function (Turner et al., 2001). Of the 1,381 sites, 403 HERV-K insertions mapped to reference HERV-K breakpoint sequences in GRCh37/Hg19. A total of 783 putative sites mapped to reference SINE-alu-VNTR (SVA) insertions and were discarded. Of the remaining 195 non-reference putative insertions, 28 had been previously annotated and confirmed as HERV-K via sequencing in recent studies (Sudmant et al., 2015;Wildschutte et al., 2016) and the remainder, many of which are likely SVA, will require future targeted sequencing to confirm. With the rapidly increasing rate of discovery of HERV-K insertions in human genomes, we were able to collate an additional 5 HERV-K insertion sites discovered in a parallel study (Wildschutte et al., 2016), which were also genotyped in the 1000 Genomes population. In total, the 431 (403 reference and 28 nonreference) HERV-K insertions were tested for SNP associations ( Table S1). The dichotomized presence/absence, rather than genotypes, of the nominated 431 HERV-K insertion sites were tested for SNP associations to reduce potential misclassification induced by poor sensitivity of calls due to low sequencing depth (∼4x). An additional 16 HERV-K insertion sites identified in independent populations of diseased individuals [The Cancer Genome Atlas (Marchi et al., 2014); dbRIP (Wang et al., 2006)] were not detected by HERVnGoSeq, nor any parallel study utilizing the 1000 Genomes Project, suggesting that the diversity of HERV-K insertion sites expands beyond what is represented in the 1000 Genomes population.

hiSNP Identification
All HERV-K insertions detected in more than one individual (n = 431) were tested for SNP associations using logistic regression.
HERVnGoSeq did not identify any HERV-K insertions occurring in 100% of 2,504 individuals; however, low coverage sequencing data likely resulted in an underestimation of the prevalence of some insertion sites. Thus it is likely that some sites with high prevalence called by HERVnGoSeq are actually fixed across populations. Genome-wide logistic regression stratified by continental population and adjusted for genetic ancestry revealed 46 polymorphic HERV-K insertion sites with significant SNP associations (hiSNPs) in at least one continental population after correction for multiple testing (Table 1,  Table S2, Figures S3-S50). Figure 1 shows a Manhattan plot of the genome-wide SNP association results for a polymorphic HERV-K insertion located on chromosome 1.
Of these, 13 were not previously known to be polymorphic. The majority of the remaining 385 HERV-K insertion sites with no identifiable hiSNPs were rare (prevalence < 0.2, n = 48), singletons (n = 14), or common and potentially fixed (prevalence > 0.8, n = 190), which may explain the lack of association with neighboring SNPs. However, 129 HERV-K sites appear to be common and unfixed (prevalence restricted to 0.2-0.8) yet have no hiSNPs and thus could not be evaluated for phenotype enrichment in this study (Figure 2).
Among the 46 HERV-K insertion sites with strong SNP associations, hiSNP sets were selected within a 1-megabase window to ensure that no strongly associated SNPs were excluded arbitrarily due to genomic distance. The median number of hiSNPs associated with each of these 46 HERV-K insertions was FIGURE 4 | Manhattan plots. Plots of hiSNP sets for 15 HERV-K insertion sites among the 30 polymorphic HERV-K insertion sites with strong SNP associations in the European continental population. Vertical black lines denote HERV-K insertion locations. hiSNPs that are annotated in the NHGRI-EBI GWAS Catalog are represented by red points. Table 1). A complete summary including odds ratio and p-values for all hiSNP-HERV-K associations can be found in Table S3.

(
To ensure that the selected hiSNPs represented SNPs in true linkage disequilibrium with the HERV-K insertions, a subset of 12 sites that were identified by HERVnGoSeq and also genotyped in an independent study of structural variation in the 1,000 Genomes Project (Sudmant et al., 2015) were selected for sensitivity analyses. hiSNP sets selected by HERVnGoSeq were compared to hiSNP sets derived from logistic regression with outcome being dichotomized genotypes of insertions called by the 1,000 Genomes, and LD via r 2 values based on maximum likelihood phasing. HERVnGoSeq logistic regressionbased hiSNP sets consistently nominated the greatest number of hiSNPs across the 12 sites and, for 10 sites, >75% of HERVnGoSeq hiSNPs were also hiSNPs derived from the genotyped insertions via logistic regression or r 2 (Table S2, Figures S51-S63).

Tissue-Specific Differential Gene Expression and Disease Enrichment
The Genotype-Tissue expression (GTEx) project provides expression quantitative trait loci (eQTL) analysis results from genotype and gene expression data derived from 449 individuals across 44 human tissues (Carithers and Moore, 2015). We tested whether the HERVnGoSeq derived hiSNPs for each of the HERV-K insertion sites were enriched for eQTLs based on these data. Because the GTEx individuals are ∼85% white, we restricted these analyses to the hiSNP sets identified among the 30 polymorphic HERV-K insertion sites in the European continental population (Table S1) to reduce confounding by population stratification. We observed enrichment (p < 0.05) of hiSNPs for eQTLs in at least one tissue type for 22 of the 30 sites by Fisher's exact test (Figure 3). HERV-K insertion sites contributed the most eQTL associations to subcutaneous adipose tissue and thyroid tissue with 15 HERV-K sites each. Fifteen of the sites with hiSNP sets enriched for eQTLs also include SNPs associated with disease by GWAS (Figure 3). The number of genes for which individual HERV-K insertion hiSNPs served as eQTLs ranged from 1 to 75 (Table S4). Often in the instances where a large number of genes were affected, the HERV-K insertion and genes occurred on blocks of extended LD (i.e., the major histocompatibility complex).
We next examined the 30 hiSNP sets identified in the European continental population for annotation in the NHGRI-EBI GWAS Catalog. Half of the HERV-K insertion sites had at least one hiSNP with a genome-wide significant association with a disease phenotype (Figure 4). In total, European polymorphic HERV-K insertions are associated with 80 human phenotypes ( Table 2).

Analyses of "Untaggable" HERV-K Insertion Sites
The majority of polymorphic HERV-K insertions identified via HERVnGoSeq were not associated with any nearby SNPs (n = 129) and could not be evaluated for existing phenotypic associations in this study. Fifty-three of these HERV-K insertions (estimated mean prevalence: 45.6%, range: <1-79.7%), occur within genes ( Table S6).
The distribution of polymorphic HERV-K in specific chromosomal regions (ex: telomeres, centromeres) did not explain the lack of strong hiSNP associations in the 129 identified polymorphic sites (Figure 2A). However, we suspected that they might differ from sites with strong SNP associations in two respects-proximal SNP density and distance to nearest recombination hotspots, which both effect neighboring patterns of LD (Ardlie et al., 2002;Ke et al., 2004). We found significantly lower SNP densities in the areas around HERV-K insertion sites without hiSNPs (mean 2813.2) than the areas around HERV-K insertions with hiSNPs (mean 3392.7, difference in means: 579.5 SNPs, p = 0.0007). However, the presence of SNPs flanking the 129 HERV-K insertions without hiSNPs suggests that SNP density is not a sufficient determining factor.
ERVs can be involved in homologous and non-homologous recombination events (Campbell et al., 2014). Some are enriched for PRDM9 binding motifs (Campbell et al., 2014) and elimination through recombination is a major mechanism by which ERV sequences are removed from the human genome (Katzourakis et al., 2007). We investigated whether recombination at HERV-K insertion sites explained the lack of hiSNP associations with these 129 HERV-K insertions by measuring their proximity to known recombination hotspots.
HERV-K insertion sites without hiSNP associations were farther on average from mapped recombination hotspots than their hiSNP-associated counterparts (222.2 kb difference in distance to LD hotspots, p = 0.04, and 40.1 kb difference in distance to ChIP-seq hotspots, p = 0.008). To determine whether the distance of polymorphic HERV-K insertions from recombination hotspots was greater than expected by chance, we compared the mean distance of these 172 polymorphic HERV-K insertions (46 with hiSNPs and 129 without) to the distances from hotspots of repeated random samples of 172 genomic locations. To mitigate potential confounding, random genomic locations were matched to HERV-K insertions sites on chromosome and flanking 2 kb GC content. Polymorphic HERV-K without hiSNPs were farther from recombination hotspots than randomly selected genomic locations (p < 0.005), whereas insertions with hiSNPs were not farther (or closer) to recombination hotspots than expected by chance (Figure 5).

DISCUSSION
This study shows that polymorphic HERV-K insertions occur in regions of the genome enriched for phenotypic function and, furthermore, that these insertion variants co-occur with established disease-risk variants, providing previouslyuntested candidates for the functional elements underlying the heritability of numerous complex diseases. Using our computational pipeline, HERVnGoSeq, and the diverse 1,000 Genomes population, we confirmed the presence of 33 known polymorphic HERV-K insertions and identified an additional 13 confirmed sites via strong SNP associations not previously recognized as polymorphic. Of the total 46 HERV-K insertions under investigation, 22 have hiSNP sets enriched for eQTLs and 15 contained disease-associated SNPs identified in prior GWAS.
The collective evidence put forth by annotated hiSNPs supports a role for HERV-K insertions in inducing phenotypic effects. There is previous evidence that polymorphic HERV-K insertions may affect brain function. Previous studies have established links between HERV-K and amyotrophic lateral sclerosis (Alfahad and Nath, 2013), HIV-associated dementia (Garrison et al., 2007), and Schizophrenia (Sekar et al., 2016). Our results further support these links and provide specific candidate polymorphisms that may explain these observations. We found two polymorphic HERV-K insertions (chr5:64388440 and chr6:32648036) whose hiSNP sets include a GWAS hit for schizophrenia. The association of the hiSNP at the insertion chr6:32648036 has already been attributed to the presence of a polymorphic HERV-K at the complement component 4 (C4) locus, resulting in altered expression at C4A and C4B, which we also observed in our eQTL enrichment results (Table S4) (Sekar et al., 2016). The hiSNP set at the second site at chr5:64388440 contains a SNP that is associated with schizophrenia symptoms relating to hallucination, delusion, and paranoia (Fanous et al., 2012) and both the SNP and the HERV-K are located directly upstream of and serve as eQTLs for ADAMTS6, a gene among a family that experimentally induces neurite growth in cultured neurons (Hamel et al., 2008). One of the most strongly associated hiSNPs for another HERV-K at chr4:120263688 was previously identified in a GWAS examining the genetics of cognitive performance using   20228799, 23128233, 19915573, 23511034, 18836448, 20228798 (Continued) Frontiers in Genetics | www.frontiersin.org proxy-phenotypes (Rietveld et al., 2014). Experimental factor ontology enrichment analysis of hiSNPs also suggested a largely neurological phenotypic effect of polymorphic HERV-K wherein Parkinson's disease, intracranial volume, and temporal arteritis were the seconds, third, and fourth most significantly enriched terms, respectively. Enrichment analysis also suggests that HERV-K insertion sites may have a functional role in autoimmune diseases. The role of polymorphic HERV-K in immunity, particularly insertions within the HLA, is difficult to delineate. While no specific associations have been established between HERV-K and autoimmunity, strong evidence suggests a link between multiple sclerosis and expression of HERV-W (Schmitt et al., 2013); and the role of ERVs in autoimmunity has been long suspected, but its study has been hindered by technological limitations. With the increasing availability of nextgeneration sequencing data and computational methods like HERVnGoSeq, the time is ripe for a thorough investigation of polymorphic HERVs in autoimmune disease. HERV-K expression has frequently been noted in human cancers and has also been of interest as an etiologic factor. We found hiSNP associations with Hepatocellular carcinoma (HCC) tagging two polymorphic HERV-K insertions. Recent studies identified an increase in HERV-K expression in HCC vs. normal tissue (Ma et al., 2016) and also discovered that HCC tumor mutations are frequently caused by APOBEC enzymes, a component of the human innate immune system primarily active against ERVs (Chiu and Greene, 2008). As such, the role of polymorphic HERV-K in interaction with hepatitis viruses and HCC appears warranted.
The HERV-K LTR is known to contain enhancer elements and thus the degree to which hiSNPs were enriched for eQTLs was not unexpected. An advantage of using the GTEx database is the ability to determine tissue-specific eQTL activity, which can help discern the phenotypic effects of polymorphic HERV-K. For example, we observe that the hiSNPs for a polymorphic HERV-K at chr19:22414379 are enriched for eQTLs in adipose tissue, suggesting that the insertion could affect fat storage. Indeed, the hiSNP set for this insertion also contains a GWAS SNP associated with changes in body mass index over time (McQueen et al., 2015).
It is possible that the HERV-K with hiSNP eQTL enrichment is not itself the variant altering cis-gene expression, particularly in cases where the HERV is inserted on a large LD block, for example within the HLA region (chr6:32505702, chr6:32648036, chr6:32746812). Often these regions are not fully explored in functional follow-up studies due to their complexity. Since HERV-K LTRs contain functional elements, they serve as strong candidates for eQTLs regardless of regional complexity, warranting additional studies. It is also possible that the observed relationships between eQTLs and HERV-K insertions differ across populations. However, the GTEx consortium primarily consists of samples derived from european individuals and thus the relationship between polymorphic HERV-K and gene expression can currently only be inferred in this population.
We could not leverage SNP annotations to illuminate the function of the majority of polymorphic HERV-K insertions because they did not associate with any neighboring SNPs. In addition to the 46 HERV-K insertions with hiSNPs, we also nominated 129 reference HERV-K insertions that are likely polymorphic and that may have yet-undetected phenotypic associations. It is possible that some of the lower prevalence sites without hiSNPs are fixed and were called polymorphic only because of poor detection sensitivity of HERVnGoSeq. However, this seems unlikely, as our prevalence estimates for previously recognized non-reference polymorphic HERV-K were usually within ∼10% of previous studies' estimates ( Figure S65). Pairwise correlations of SNPs directly adjacent to these HERV-K insertions suggest that there is LD in these regions (data not shown), but that the HERV-K insertions are "dark variants" that are not correlated with proximal SNPs. One potential explanation why the majority of polymorphic HERV-K insertions fail to have strong SNP associations is the greater than expected distance to the nearest recombination hotspot. Patterns of LD are known to strengthen the closer variants are to a hotspot, with complete loss of LD within the hotspot itself. The observed decay of HERV:SNP LD farther from hotspots requires further study. It is also possible that there is hotspot activity near or within these HERVs that were not identified and included in the two hot spot maps used for this study. Breakdown of LD patterns surrounding these HERV-K insertions may also be explained by other mechanisms that could not be investigated in the present study, including frequent sporadic non-allelic homologous recombination events, evolutionarily recent integration, offtarget mutagenic activity of HERV-K repressors such as APOBEC enzymes, or hypermethylation resulting in sporadic deamination of methylated cytosines.
In our survey of phenotypic associations with polymorphic HERV-K insertions, the greatest limitation was the poor sensitivity of detection of HERV-K due to the low coverage of FIGURE 5 | Mean distance to nearest recombination hotspot. Distances indicated for polymorphic HERV-K insertions with and without hiSNPs (dashed lines) and the distribution of mean distances of random genomic locations matched to HERV-K insertions on proximal GC content and chromosome. Distributions were derived from 1000 repeated random samples with replacement. (A) Distances from nearest ChIP-seq-based recombination hotspot, (B) Distances from nearest LD-based recombination hotspot.
the sequencing data available for the 1,000 Genomes population. Consequently, we were not able to call genotypes or estimate prevalence with high precision. Similar pipelines that have used these data to genotype mobile genetic elements often include an imputation step. Our observation, that a significant number of HERV-K insertions lack SNP associations, likely impedes the ability and reliability of imputation-based methods for genotyping these polymorphisms. This may also explain why so few members of the HERV-K family have been recognized as polymorphic. We anticipate that the accuracy of genotyping HERV-K insertions will greatly increase with higher coverage sequencing data.
Polymorphic HERV-K elements are associated with the germline risk of myriad phenotypes. While this study provides a starting point for further investigation, disease-specific epidemiologic and functional studies are needed to elucidate the role of specific polymorphic HERV-K insertions in complex diseases.

ETHICS STATEMENT
This study was carried out in accordance with the recommendations of our institutional review board at both the University of California, San Francisco and University of Nevada, Reno. The protocol was approved by the UCSF and UNR IRB (IRB# 1178149-1, 992935-1). Only previously collected data was used in this analysis where original investigators required all subjects gave written informed consent in accordance with the Declaration of Helsinki.

AUTHOR CONTRIBUTIONS
SF and AW conceived of the study, conducted and assisted in analysis, and wrote the paper. GW developed bioinformatic pipelines, managed data, and assisted in manuscript preparation. LB, AdS, KW, CM, JC, JW assisted in analyses, provided technical expertise and aided in manuscript preparation and writing.

FUNDING
We thank our funders (NIH/NCI 5T32CA151022-05 PI:Costello, NIH/NIGMS 1R15GM126562-01 PI:Francis), and the countless other studies generating publicly available data to further our understanding of disease.