Related Endogenous Retrovirus-K Elements Harbor Distinct Protease Active Site Motifs

Background: Endogenous retrovirus-K is a group of related genomic elements descending from retroviral infections in human ancestors. HML2 is the clade of these viruses which contains the most intact provirus copies. These elements can be transcribed and translated in healthy and diseased tissues, and some of them produce active retroviral enzymes, such as protease. Retroviral gene products, including protease, contribute to illness in exogenous retroviral infections. There are ongoing efforts to test anti-retroviral regimens against endogenous retroviruses. Herein, we examine the potential activity and diversity of human endogenous retrovirus-K proteases, and their potential for impact on immunity and human disease. Results: Sequences similar to the endogenous retrovirus-K HML2 protease and reverse transcriptase were identified in the human genome, classified by phylogenetic inference and compared to Repbase reference sequences. The topologies of trees inferred from protease and reverse transcriptase sequences were similar and agreed with the classification using reference sequences. Surprisingly, only 62/480 protease sequences identified by BLAST were classified as HML2; the remainder were classified as other HML groups, with the majority (216) classified as HML3. Variation in functionally significant protease motifs was explored, and two major active site variants were identified – the DTGAD variant is common in all groups, but the DTGVD motif appears limited to HML3, HML5, and HML6. Furthermore, distinct RNA expression patterns of protease variants are seen in disease states, such as amyotrophic lateral sclerosis, breast cancer, and prostate cancer. Conclusion: Transcribed ERVK proteases exhibit a diversity which could impact immunity and inhibitor-based treatments, and these facets should be considered when designing therapeutic regimens.


INTRODUCTION
Retroviridae is a diverse family composed of both exogenous infectious viral species whose life cycle includes stages with a ssRNA genome inside virions which is converted into a dsDNA provirus, and endogenous proviral species transmitted in a Mendelian fashion. The evolutionary pressures on ERVs and exogenous retroviruses (XRVs) are very different; ERVs experience much stronger negative selection against pathogenicity since their survival depends directly on reproduction of their host (Holmes, 2011;Stewart et al., 2011;Barbeau and Mesnard, 2015). If fact, some ERVs are directly involved in placentation (Mi et al., 2000), and some host genes descend from retroviral ancestors (Kaneko-Ishino and Ishino, 2012). Indeed, the role of ERVs in the evolution and function of their host genome is becoming more clear (Bannert and Kurth, 2004;Cowley and Oakey, 2013). The apparent inactivity of many ERVs is probably a direct consequence of this pressure, as many ERV integrations contain inactivating mutations that disrupt the function or expression of viral genes. The most severe inactivating mutation is the case of solo LTRs, where the entire coding region of the ERV is deleted. Despite this, a surprising number of ERVs appear to lack obvious inactivating mutations, and some are even infectious, blurring the line between ERV and XRV (Boller et al., 2008;Jha et al., 2011;Denner and Young, 2013;Kozak, 2014). Although no infectious ERV has been proven to exist in the human genome, some loci are polymorphic and population genetics analysis and functional studies have not ruled out low-level ongoing infectious replication of some human ERVs (Turner et al., 2001;Jha et al., 2011;Naveira et al., 2014;Wildschutte et al., 2014Wildschutte et al., , 2016. Here we focus on the Betaretrovirus-like human ERVK, and particularly the HML2 clade (HK2), which includes loci encoding functional enzymes, and whose youngest members may be less than 200,000 years old (Franklin et al., 1988;Boller et al., 2008). Actively transcribed ERVs, such as ERVK-10, are assigned names by the Human Gene Nomenclature Committee as recommended by Mayer et al. (2011). The biological role of transcribed ERVK elements is becoming more clear in recent years (Ruda et al., 2004;Macfarlan et al., 2012;Manghera et al., 2014Manghera et al., , 2015Michaud et al., 2014;Schlesinger et al., 2014;Grow et al., 2015;Li et al., 2015;Bray et al., 2016;Christensen, 2016;Becker et al., 2017;Prudencio et al., 2017), but the effects of individual viral proteins is understudied as compared to their XRV counterparts. Protease-mediated maturation of retroviral structural proteins and enzymes is critical to the XRV life-cycle; viruses lacking PR activity produce virions with an immature morphology and are not infectious (Kohl et al., 1988). Even though ERVs are not known to replicate by infection of new cells, they can increase in copy number by reinfection of their host cell; this process called reintegration requires the activity of all the viral enzymes (Dewannieux et al., 2006). Because PR function is conserved across Retroviridae, a virus with an inactive protease may be complemented by another retroviral protease. The factors which permit or exclude this complementation must be context dependent; for example, the HK2 protease encoded by ERVK-10 cleaves HIV-1 Gag in vitro, although not in virio (Towler et al., 1998). The contextual nature of complementation is not perfectly understood, so care must be taken in applying findings across model systems, and particularly in the context of human disease (Towler et al., 1998;Contreras-Galindo et al., 2012;Morandi et al., 2015).
Structural analysis proceeds most easily using crystallographic structural determination, which is not yet available for the HK2 PR; however, biochemical and genetic analyses reveal a typical A2 aspartic protease, with each 106-residue monomer of the mature dimer processed autocatalytically from a larger precursor (Mueller-Lantzsch et al., 1993;Towler et al., 1998;Kuhelj et al., 2001;Dunn et al., 2002). The active HK2 PR is moderately sensitive to pepstatin, with a pH optimum around 4.5 (Towler et al., 1998;Kuhelj et al., 2001). Some HK2 elements (such as ERVK-10) encode a functional protease, and HK2 virions with condensed cores have been observed budding from human cells (Mueller-Lantzsch et al., 1993;Sauter et al., 1995;Towler et al., 1998;Boller et al., 2008). Furthermore, the two reconstructed HK2 viruses Phoenix and ERVK CON have a PRdependent infectivity (Dewannieux et al., 2006;Lee and Bieniasz, 2007). Although the substrate affinity of the HK2 PR has not been extensively studied, the PR cleavage sites in the HK2 Gag polyprotein are known (Kraus et al., 2011), as is its susceptibility to select PIs (Towler et al., 1998;Kuhelj et al., 2001).
In addition to catalyzing maturation of viral proteins, PR may play a role in immunity and pathology of human diseases. Protease pathogenicity has been best studied in HIV-1, whose PR is known to cleave a variety of host proteins (Shoeman et al., 1991;Impens et al., 2012), and to trigger apoptosis under specific conditions (Shoeman et al., 1991;Rumlova et al., 2014). HIV-1 PR also directly combats innate immunity by trafficking the dsRNA sensor RIG-I to the lysosome (Solis et al., 2011). Another example of HIV PR inhibition of innate immunity is its ability to cleave RIPK1 and RIPK2 proteins, thus successfully abrogating NF-κB signaling (Wagner et al., 2015). The effect of proteases is not limited to innate immune signaling proteins, but also effector proteins such as intrinsic restriction factors. Both FIV and MLV cleave APOBEC3 in their respective hosts as a means to evade this antiviral defense (Abudu et al., 2006;Yoshikawa et al., 2017). Given the known pathogenic potential of retroviral proteases, it seems prudent to explore the potential effects of PRs encoded by ERVK which, in contrast to HIV-1, are present in all human beings -and which have been associated with specific human diseases, such as ALS, schizophrenia, rheumatic disease and cancer Seifarth et al., 2005;Reynier et al., 2009;Douville et al., 2011;Li et al., 2015;Christensen, 2016).
Here, we undertake genomic, sequence-function, and transcriptomic analysis of the diverse ERV PRs in the human genome. We draw on the existing ERV and XRV literature to evaluate the phylogenetic distribution of sequence motifs with potential functional relevance, focusing on HK2 PRs. We accomplish this by identifying human ERV protease sequences, inferring their phylogenetic affiliation to each other, and comparing them to representative sequences. The classification process was carried out in parallel using RT as a standard comparator. Here we show that distinct ERVK PR variants with predicted differences in their functional activity are differentially transcribed in disease-relevant human tissues.

Representative Retroviral Motifs
With the goal of using the most recent human genome build GRCh38 to identify loci encoding retroviral PR and RT enzymes, HMMs were used as a search tool. Pfam-A was searched for all relevant protein families related to retroviruses. Table 1 describes the output from this search, revealing 92 HMMs in total, with 40 directly related to retroviral core and accessory proteins. Most important amongst these are the retroviral protease domain RVP, and the RT core domain RVT_1. Retrovirus-associated HMMs (Table 1), the PR sequence of ERVK-10 (Towler et al., 1998), and the RVT_1 domain of Phoenix (Dewannieux et al., 2006), were subsequently used to identify pro and pol sequences in representative retroviral sequences and in the human genome.

Representative Endogenous Retroviral Sequences
As expected, BLAST identified more, but less diverse sequences, than HMMER in both the human genome and Repbase consensa. tBLASTn for HK2 PR matched every ERVK clade except HK10, but no other ERV. HMMER identified an RVP domain in every ERVK clade, and most other ERVs. The regions identified by each method overlap. tBLASTn for the Phoenix RVT_1 domain matched most ERVs. HMMER also identified an RVT_1 domain in most ERVs. The results of these methods mostly overlapped, with some notable differences. BLAST identified only part of the RVT_1 domain on ERVE, and the RVT_1 domain on ERV9 and ERVFc covers only part of the BLAST result. Each of the BLAST and HMMER results from ERVFb only partially overlapped. HMMER identified an RVT_1 domain on ERVT, but BLAST found no match. Both methods disproportionately detected HML2 loci; BLAST did so because the query sequences were HML2-derived, whereas HMMER did so because HML2 loci are more intact and therefore more easily discovered by LTRharvest.
The nucleotide sequences from each search were aligned using their translation and their phylogeny was inferred using RAxML from both nucleotide and amino acid (AA) alignments (see Figure 1 for PR tBLASTn results). The resulting alignments and trees were used to classify the genomic sequences identified by the corresponding search. These reference trees have very low bootstrap support values; however, pol and pro trees have similar topology and the classifications based on evolutionary placement agree with the tree topology inferred directly from genomic pro sequences.

Protein
Pfam HMMs HMMs in the "other" category correspond to common functional motifs, nonretroviral proteins, or domains with unknown function.

Classification of ERV Sequences in the Human Genome
Having established a system to identify PR and RT elements, ERV nucleotide sequences were identified in GRCh38 using BLAST and HMMER as directed by GenomeTools. These were then curated to eliminate highly divergent sequences with rare insertions (less frequent than 1/20 sequences) and to minimize the size of their multiple alignment before phylogenetic inference. LTRdigest hits that did not co-occur with another retroviral domain were eliminated. The genomic distribution of the resulting ERV annotations is presented in Table 2, along with curation data. When examining PR hits, tBLASTn for ERVK-10 PR identified 480 sequences in GRCh38 (Additional Files 1, 2). Forty-two are identical to another PR sequence, of which 15 are from an alternative assembly. LTRdigest identified 150 RVP domains in GRCh38 (Additional Files 3, 4). Only 8 of these were shorter than half of the expected length of an RVP domain. Occasionally, FIGURE 1 | Sequences identified by tBLASTn which are similar to the HK2 protease. Representative ERV sequences from Repbase were searched using tBLASTn for the HK2 PR. These were aligned by MACSE and their phylogeny was inferred using RAxML under the JTT + G model of evolution selected using ProtTest 3.
Bootstrap support values are represented as branch labels. This figure was produced using Geneious, FigTree, and GIMP. "Raw" refers to all sequences identified, and "Curated" refers only to sequences that were aligned for further analysis. Credible RT sequences are those which were not eliminated for causing gappy sites. Curated RVT_1 sequences are those which co-occurred with another retroviral HMM match. LTRharvest and LTRdigest were not used to search alternative assemblies, so there are no results to report. Since this curation workflow did not exclude any protease search results, the number is shown only once.
sequences from different loci were indistinguishable; twelve were identical to another RVP locus. RVP positions 38-42 are absent in 83 PRs; 48 of these were confidently classified into one of ERVW, ERV9, ERVE, ERVT, or ERV3. The remaining 35 PR sequences were not confidently classified. In contrast, the PR sequences identified by LTRdigest are less numerous and more diverse than those identified by BLAST. There is a substantial overlap in the sequences recognized by each method, but LTRdigest identified HK2 PR with a disproportionately higher frequency than other ERVs, given their relative occurrence in the human genome (Gifford and Tristem, 2003;Bannert and Kurth, 2006). When examining RT hits, tBLASTn for the RVT_1 domain of Phoenix identified 1412 sequences in GRCh38, of which 840 were removed because they induced gaps (Additional Files 5, 6). LTRdigest identified 1766 RVT_1 domains in GRCh38, of which 1617 were removed either because they did not co-occur with another core retroviral domain or they contained uncommon insertions which induced gaps in the alignment (Additional Files 7, 8). Standalone RVT_1 domains were eliminated. This is because non-ERV retroelements, such as LINEs (Boissinot et al., 2000), also encode a RT with an RVT_1 domain, but lack other retroviral proteins. These non-ERV retroelements would skew the analysis, since they are much more numerous; retroelements as a whole make up 42.2% of the human genome, but ERVs make up less than 8.3% [47]. Although LINEs are not flanked by LTRs, they are nonetheless annotated when LTRharvest mistakes other appropriately spaced repetitive sequences for LTRs.
In order to assign appropriate nomenclature to the identified sequences, we classified 489 loci (one or more PR or RT encoding regions separated by no more than 10,000 bp) ( Table 3). We observe 53 loci contain only one BLAST or LTRdigest result, 324 loci contain 2, 57 loci contain 3, 53 loci contain 4, and one each contain 5 and 6 results. Four loci had all results placed on internal nodes by RAxML and were not classified at all. Eighteen loci contained sequences whose classification varied; in most cases these sequences were truncated or insertion of one retroelement inside another resulted in spurious association of sequences belonging to separate elements.
Totals are shown for each ERV supergroup. The total number of unplaced sequences is also shown along with the number of sequences placed on internal branches, which in both cases were between the H/F and W/9 clades. Sequences were considered unplaced if the probability of their placement into the tree was less than 90%. One sequence was placed on an internal node outside these clades. AA alignments are translated from nucleotide (NT) alignments.
Phylogenetic trees constructed from genomic sequences (Figure 2) shared similar trends as those observed in the reference trees (Figure 1 and Additional Files 9-16). The classifications assigned by placement of reference sequences into the phylogeny (Additional Files 17-32) agrees with the branching of the phylogeny inferred from genomic sequences directly. This can be clearly seen in Figure 2, in which the leaves of these phylogenies are colored according to their classification by evolutionary placement. Each phylogenetic tree (Additional Files 33-40) constructed from aligned nucleic acid sequences of either PR or RT from reference ERV sequences contained a bipartition segregating ERVK clades, although it should be noted that BLAST PR results include exclusively ERVK sequences. HK1, 2, 4, 9, and 10 commonly formed a clade within ERVK. In all four trees in Figure 2, ERVK leaves form a distinct clade, and likewise ERVW and ERV9 form a clade which is associated with ERVF. ERVS forms a clade with ERVL. ERVI forms a clade with ERVADP. ERVE forms a clade with ERV3. Figure 2 is intended to allow the reader to quickly see the broad trends; if the reader is interested in the placement of specific elements, the phylogenetic trees output by RAxML are included in Additional Files 33-40.

Alternate Genome Assemblies Reveal Additional PR Sequences
Although it fails to capture the real diversity of humanity, the reference human genome contains variable regions with multiple valid assemblies. New loci can be distinguished from loci which were duplicated by inclusion in both assemblies by examining the flanking genomic sequences, which should also be identical for technical duplicates. The 15 kilobases flanking the start site of each PR identified in GRCh38 by BLAST and GenomeTools was extracted and compared to alternative assemblies using blastn. Eight regions had a perfect match, of which five were originally identified by BLAST. Many BLAST hits surrounding identical PR sequences had partial matches indicative of closely similar insertions in divergent genomic loci. Moreover, 31 of the 52 FIGURE 2 | Amino acid sequence alignment derived phylogenies of human genomic endogenous retroviral sequences. A tree is shown for each search method (PR, protease BLAST; RT, reverse transcriptase BLAST; RVP, HMMER for RVP; RVT_1, HMMER for RVT_1). Leaves are colored by their classification through the evolutionary placement algorithm implemented by RAxML (see color legend at bottom of figure). Branch coloring represents bootstrap support, where red indicates poor support (closer to zero) and green indicates good support (closer to 100).
regions identified on alternative assemblies had no high scoring matches. The fact that many sequences could not be paired up is consistent with previous observation of unique insertions in a recent analysis of the ERV content within alternative assemblies (Wildschutte et al., 2014).

Diversity of Protease Sequences
The consensus sequences for each ERVK clade from Repbase contain many conserved sites (Figure 1). It is clear that PR functional motifs are less diverse than the surrounding sequences (Figure 3), and that variants occur with different frequencies ( Table 4). The α-helix (C2) has only one common variant (GRDLL). The active site loop (B1) has two; the DTGAD motif is most common and occurs in all clades, whereas the second most frequent (DTGVD) occurs only in the HK3, HK5, and HK6 clades. The tree in Figure 4 is displayed to clearly show the abundance of HK3 sequences (top of figure) versus all other clades. Additionally, Figure 4 highlights the distribution of common active site variants (DTGAD in blue, DTGVD in orange). Active site motifs were extracted from MACSE aligned protease sequences identified by BLAST in the human genome (GRCh38). Sequences containing stop codons and/or frameshift mutations were removed and their frequency was determined. This was done using standard UNIX utilities. Sequences were considered inactive if they contained mutations not observed in active proteases of other XRVs; otherwise, they were considered potentially active.

ERVK HML2 Proteases Exhibit Great Diversity
Sequences classified as HK2 by RAxML were examined in more detail to determine if their variability could have functional consequences if the protease is translated, especially if it could lead to differential drug susceptibility because the variable residue intrudes into the substrate binding site. Several columns of the alignment in Figure 5 represent variable sites which clearly subdivide the clade in two.
The most notable variation within HK2 sequences (Figure 5) is the variable AA (L, V or I) at position 89 which is predicted to help form the S3 and S1 binding subsites by homology (Menendez-Arias et al., 1994). Columns 55, 65, and 66 are also notable. The AA (V or I) at position 55 follows the flap interface (columns 51-54) predicted by homology (Kuhelj et al., 2001). Residues 65 and 66 are near the surface of PR in the N-terminal strand of A2; from this position, their variability could impact protein-protein interactions within the viral polyproteins or with host partners, and could potentially influence flap mobility during substrate binding via interactions with α-helix C1. Thus, the cellular complement of HK2 PR variants represent a continuum of enzymatic affinities and protein-protein interactions.

Variant Proteases Are Expressed in Human Health and Disease
We looked in publicly available RNA-Seq datasets ( Table 5) from diseases with some known association with ERV expression to establish how genomic diversity is transcribed in healthy and  Douville et al., 2011;Ren et al., 2012;Agoni et al., 2013;Wildschutte et al., 2014;Abba et al., 2015;Li et al., 2015;Brohawn et al., 2016). RNA-Seq results were narrowed to examine the expression of the two most highly expressed and well-studied groups HML2 and HML3. Further, ERVK loci (detectable by both BLAST and LTRharvest as previously described) were limited to proteases without gross inactivating mutations. Reanalysis of transcriptomics data from three independent studies focused on ALS, breast cancer and prostate cancer are shown in Figure 6; expression of other HML groups are in Additional File 42. Overall, the expression of loci encoding DTGAD and DTGVD motifs is significantly different, due to low expression of DTGVD loci, and each of these is divergent from the pooled expression of other less common motifs. Expression of protease encoding transcripts alone does not prove that these will be properly translated or proteolytically processed (Bauerova-Zabranska et al., 2005;Tien et al., 2018). Additional supporting evidence comes from examination of the upstream gag gene by searching reads aligning to the same locus using tBLASTn for the Gag sequence of ERV-K113. Several, but not all, loci expressed gag as well as pro. Individual experiments are required to confirm Gag-pro-pol polyprotein processing for each transcribed locus.
There is a trend toward increased ERVK transcription in cervical spinal tissues of ALS patients versus neuro-normal controls ( Figure 6); indeed, this enhancement is evident when patients are stratified by sex, with female cases having an increased burden of ERVK (Additional File 43, p < 0.01).
When tissues from cancer patients were examined, breast cancer biopsies revealed increased PR transcripts with DTGAD (p < 0.05) and alternate PR active site motifs (p < 0.05), as compared to cosmetic breast resection controls. Overall, HML-2 expression was enhanced in breast cancer (p < 0.01). This ERVK load difference in breast cancer appears to be driven by the increased expression of three loci in 8p23.1 flanked by LTR5A, and encoding a DTGSD active site motif. Interestingly, an HIV-1 PR variant which encodes a DTGSD motif is known to be active, although less so than the wild-type enzyme (Hong et al., 1998). Similarly, HML-2 transcripts are increased in prostate cancer specimens as compared to autologous non-cancerous adjacent tissue (p <0.01). These findings point toward a mixed profile of PR variants in disease states with elevated ERVK expression.

DISCUSSION
The human genome is home to a diversity of ERVs; this necessitates much effort in discriminating those of biological and/or clinical importance. Our identification of transcribed ERVK loci with a potential to encode functional PR enzymes will inform future studies examining their biological impact and drug susceptibility.
There remain technical challenges in assessing ERVs, as the results from LTRharvest depict that this paradigm is not ideal for analysis of more complicated loci containing recombination, serial insertions, or substantial deletions. Theoretically, each ERV should encode exactly 1 pro and 1 pol at the moment of insertion, FIGURE 4 | Phylogeny of human genomic nucleotide sequences derived using BLAST and the ERVK HK2 PR. Potentially active sequences with a DTGAD motif in their active site are colored blue, and those with DTGVD are colored orange. Sequences containing inactivating mutations are colored gray. The classification of each area of the tree is indicated by the name associated with the arc around the perimeter. Branch coloring represents bootstrap support, where red indicates poor support (closer to zero) and green indicates good support (closer to 100). Placement of the root at the branch connecting HK3 is only a matter of visual convenience -the real root branch is more likely that leading to HK8. This alignment is available in Additional File 41. but over time deletion and insertion events can remove or obscure these signatures. Despite this issue, the use of HMMs to identify ERVs within the human genome was effective. Indeed, RVP and RVT_1 annotations are more commonly identified within elements annotated by LTRharvest, since BLAST did not detect loci distantly related to the query HK2 sequences. The search results were improved by the merging of results from the same paralog and alignment-based extension. These sequences were grouped both by direct phylogenetic inference and by comparison to a reference tree. The phylogenetic patterns apparent in Figures 1, 2 broadly agree with previously recognized relationships between ERVs (Vargiu et al., 2016). Although trees from pro and pol differ in some respects, they generally agree at deeper nodes which are important for classification, despite the lower phylogenetic signal of protease due to its comparatively shorter length.
There is value in examining both reference genomes and alternate assemblies. The reference genome was constructed from few people (Osoegawa et al., 1998;Lander et al., 2001), suggesting that the ERV content of individual human genomes could vary considerably and deserves closer scrutiny; especially since unfixed loci are likely to be younger (Belshaw et al., 2005), and could encode functional enzymes. Future use of data from platforms such as the 1000 Genomes Project (2018) will provide an improved appreciation of the diversity of ERV content in human DNA.

ERVK Protease Sequence Diversity
The diversity of human genomic retroviral proteases reflects their evolutionary history and conserved structure-function relationships. Residues flanking the active site and those maintaining the dimer interface are less variable, a pattern clearly  discernible in the Pfam family RVP (Finn et al., 2014) and evident within Figure 1. The β-sheet at the "bottom" of the enzyme is an exception to this rule, as its residues are not as conserved, perhaps because this region of tertiary structure is driven by backbone interactions (Louis et al., 2007). Overall, the most notable variation in genomic protease sequences is the absence of RVP positions 38-42 in 83 PRs, none of which were classified as ERVK.
The sequence of the retroviral protease active site (B1) and the active-site associate helix (C2) are conserved, as seen in Figures 1, 3, 5 for ERVK. The most common active site motif of ERVK, DTGAD, is identical to that predominantly observed in exogenous retroviruses. The second most common motif, DTGVD (A29V), is observed in functional retroviral proteases; in fact, there are many examples of HIV proteases bearing this variation (e.g., Uniprot: P15833). Given the wide phylogenetic distribution of A29V and the similar chemical properties of alanine and valine residues, this is probably a functional variant of the ERVK protease. This motif could therefore represent pre-integration variation of the ancestral XRV which gave rise to ERVK, which is further supported by the concentration of DTGVD motif containing sequences in HK3, seen in Figure 4. Alternatively, the motif could represent a post-integration mutation predating the amplification of HK3 (Mayer and Meese, 2002). We also predict several α-helix (C2) variants could be functional in ERVK. In the HIV-1 PR (e.g., PDB: 1HVC), the C2 arginine residue (R94) fills the space between B1 and the loop connecting A1 to D2 while forming hydrogen bonds to residues in both. Another large polar residue might fill this role; indeed, lysine is known to do so in HIV-1 protease (Louis et al., 2007). Several genomic motifs fit these requirements, such as the HK2 R94Q and R94K variants, and the HK8 D95E variant. Systematic experimental verification of the activity of individual PR variants is thus warranted.
Nonetheless, we predict that most of the less common active site loop variants are inactive (Figure 4). This prediction is pulled from our understanding of the composition of this motif. Many variations are of a residue which catalyzes the hydrolytic reaction, or which is predicted to be essential to the dimer interface. The catalytic residues must be aspartic acid. The threonine residue which helps form the dimer interface is uncommonly replaced by serine (e.g., PDB: 2JYS, 2RSP) (Jaskolski et al., 1990;Hartl et al., 2008). Others could disrupt the active site by substituting a side-chain of very different size. The glycine residue must be small, as distortions to the structure of this region could disrupt the catalytic structure. This is also true for the alanine residue, although it is known to be replaced by serine in HIV-1 without a total loss of activity (Ido et al., 1991). The final aspartic acid residue participates in hydrogen bonds with the nearby arginine residues (Louis et al., 2007), and might be replaced by another FIGURE 6 | Expression of HML2 and HML3 by RNA-Seq. The expression of ERV loci in GRCh38 containing an uninterrupted protease was measured with bowtie2 using RNA-Seq data from different conditions (SRA accession numbers for each condition are: ALS -SRP064478, Breast Cancer -SRP058722, Prostate Cancer -ERP000550). Normalization to fragments per kilobase of exon per million mapped reads (FPKM) is relative to the entire locus. This graphic was made using R with the tidyverse library, and GIMP. The N's of each column are, from left to right, HML2 and then HML3: 104,8,0,7,29,7,118,9,0,10,35,8;369,27,0,17,108,25,146,11,0,4,40,5;195,14,0,4,64,13,177,14,0,2,52,13. Counts are also provided in CSV and Excel format in Additional Files 44, 45. small charged or polar residue; however, the character of this residue may not be critical, since it can vary with glycine in HIV-1 protease (Louis et al., 2007).
The young, polymorphic HK2 element ERVK113 is well studied and was reported to produce immature virions, perhaps due to a non-functional protease (Boller et al., 2008). This is in contrast to the known active HK2 protease of ERVK-10 (Ono, 1986;Towler et al., 1998). This is further supported by reported differences in the sequences of these two mature proteases; the G56S (reported as G234S) substitution in ERVK113 may disrupt the flap interface, compromising PR activity and conferring an immature particle morphology (Boller et al., 2008).

Implications of ERVK PR Diversity on Immunity
Retroviral proteases have numerous and varied protein targets which are difficult to predict, with correspondingly broad cellular effects stemming from their expression. Many proteases include targets with immune functions, but these vary based on hostvirus pairings. It remains to be seen how ERVK PRs may modulate innate immunity signaling in humans, as seen with XRVs (Abudu et al., 2006;Solis et al., 2011;Wagner et al., 2015;Yoshikawa et al., 2017). Based on our knowledge of HIV (Konvalinka et al., 1995;Snasel et al., 2000;Lin et al., 2003), it is even conceivable that each ERVK PR variant may target distinct panels of cellular proteins, due to their different substrate specificities.
The potential impact of ERVK PR not limited to the innate immune system. Observations from the recovery of immune function in patients on HAART containing or not containing PIs suggests that HIV PR can directly impact the functioning of adaptive immune cells independently of other viral proteins (Ananworanich et al., 2003;Chiodi, 2006). Furthermore, differential ERVK PR substrate specificities could lead to changes in the peptide fragment population from which MHC class I epitopes are drawn, in addition to the impact that PI use alters this system (Kourjian et al., 2014;Kourjian et al., 2016). Our observation of distinct ERVK PR variants expressed in disease highlights the need for future studies to consider their overlapping and distinct impacts on cell signaling and proteome profiles related to immunity.

Implications for Drug-Based Treatment of ERVK-Associated Disease
Many protease coding sequences could be translated into an active enzyme, and their sequence variability may impact their activity, and likely their susceptibility to antiretroviral drugs. Variation between DTGAD and DTGVD in the active site of sequences classified as HK3 is notable. This variation could have biological relevance when both HK2 and HK3 elements are expressed, such as in ALS  and schizophrenia . The HK2 PR is susceptible to some clinically relevant HIV-1 PIs, although not to the same degree as the HIV-1 PR (Towler et al., 1998;Kuhelj et al., 2001;Tyagi et al., 2017). Since ERVK expression is associated with many disease states, and the activity of retroviral enzymes can be pathogenic, it is possible that PIs may be useful in the future treatment of ERVK-associated conditions. In fact, at least one ongoing clinical trial (NCT02437110) plans to administer the PI Darunavir, along with other drugs, to patients with ALS. Another clinical trial (NCT01528865) targeting ERVK in lymphoma was withdrawn -but the intent to explore the clinical utility of ERVK PIs is becoming clear.
The rapid development of drug resistance is a major hurdle in the treatment of HIV-1 infections, but since ERVK is fixed in the host genome, it cannot rapidly evolve such drug resistance. However, ERVK elements are diverse and likely respond differently to PIs. This diversity is not represented in the ERVK literature, which is focused on the consensus HK2 PR (Towler et al., 1998;Kuhelj et al., 2001;Dewannieux et al., 2006;Lee and Bieniasz, 2007;Tyagi et al., 2017). One recent paper states that the active site loop of the HIV-1 and ERVK PR are identical (Tyagi et al., 2017), a finding which this report directly contradicts. We predict this diversity may necessitate the simultaneous use of multiple PIs in the treatment of ERVKassociated diseases. Consequently, if the drug regimen in the above-mentioned clinical trial is not effective in treating ALS, this cannot be taken as evidence that such therapy would not be effective using multiple PIs in a combination therapy.
We have established that a multitude of ERVK protease variants exist in the human genome, and that some of these variants can reasonably be predicted to have differing drug binding profiles. However, many ERVs are transcriptionally repressed (Bogerd et al., 2006;Schlesinger and Goff, 2013;Schlesinger et al., 2014); our analysis of publicly available RNA-Seq data shows not only that ERVs are expressed, but that the expression of loci with differing biochemical and evolutionary characteristics varies between and within different disease conditions, as well as in healthy controls. Due to the relatively recent insertion of ERVK sequences within the human genome, and the pathogenicity of XRV PRs, it is warranted that future studies consider their role in modulating immunity. Since distinct biochemical and ancestral features may impact the susceptibility of these enzymes to PIs, we propose that patientspecific drug regimens may be required in treatment of ERVK associated disease. Moreover, given the potential importance of protease sequence variability, the sequences of other ERVK proteins (particularly the drug targets reverse transcriptase and integrase) should also be explored.

CONCLUSION
Within the human genome, ERVK proteases exhibit a high degree of diversity. Specifically, two predominant PR active sites emerge, the DTGAD and DTGVD variants, which are differentially expressed in disease states. This study will also be an asset for inferring how ERVK PRs impact the human proteome, specifically as it pertains to immune function. It is possible to interpret this paper as casting a negative light on the prospect of inhibitor-based treatment of ERVK-associated disease, but this is not our intention. We do, however, caution against treatments and techniques that treat the numerous and diverse elements called ERVK as though they were a single invariable element. This variability is not endless; biomedical and clinical techniques to account for this enzymatic variation exist. We recommend that the results of inhibition assays against the specific ERVK proteases expressed in each disease state should be considered if anti-PR drug regimens are to be implemented for inflammatory disease.

MATERIALS AND METHODS
Sources for databases and software employed in the course of this study are listed in Table 6.
Sequences homologous to retroviral protease (pro) and reverse transcriptase (pol) were automatically extracted from the last major build of the reference human genome and classified with published sequences from Repbase. Structure-function analysis was applied to ERV PR sequences using the limited published data on them, and by transferring knowledge of better studied proteases.

HMMER and GenomeTools
HMMs for which to search were identified in Pfam-A using the search term "retrovirus" (Finn et al., 2014). ERV reference sequences were searched by HMMER with an e-value of 1.0. Putative LTR retroelements were identified in GRCh38 by LTRharvest, and these were subsequently analyzed by LTRdigest, which ran HMMER. LTRharvest and LTRdigest are part of Genometools v1.5.1 (Gremme et al., 2013). Hits from the same locus in different reading frames were merged and aligned with MACSE, then the regions which would make each sequence flush to the alignment were retrieved.

Phylogenetic Inference and Classification
Results from each of the four searches were curated by eliminating HMM matches not associated with another core retroviral domain and by removing BLAST results which induced gaps in ≥95% of alignment rows with MACSE. Sequences were then re-aligned with MACSE and models of evolution were selected using jModelTest (Darriba et al., 2012) and ProtTest (Guindon et al., 2010;Darriba et al., 2011). Maximum likelihood phylogenies were inferred using RAxML v7.2.8 (Stamatakis, 2014) for nucleotide and protein alignments of each search (8 trees in total). GRCh38 results were classified using RAxML's evolutionary placement algorithm with reference to the inferred phylogeny of representative ERV sequences. Misclassified or recombinant elements were sought by comparing the assignment of search results separated by less than 10 kbp. This locus-based definition is computationally simple, but potentially error prone; a more accurate element-based analysis would require algorithmic reconstruction of insertion architecture, as was undertaken in the construction of Dfam (Wheeler et al., 2013).

Measuring Expression in Publicly Available RNA-Seq Data
Publicly available RNA-Seq data from tissues associated with ERVK expression in human disease were obtained from the SRA using sratools. The resulting FASTQ files were aligned to the human genome (GRCh38) using bowtie2 following examination with FASTQC. The resulting SAM file was then indexed using samtools and finally the resulting indexed BAM file was queried for expression of ERV loci identified using the methods above. The raw expression values thus obtained were normalized by library size and locus length to produce FPKM values.
The resulting FPKM values were examined on density and QQ plots to confirm that they did not conform to a normal distribution. Non-parametric tests of statistical significance were used in this paper. For unpaired expression data (breast cancer, ALS) the Mann-Whitney U test was used, and for paired data (prostate cancer) the Wilcoxon Signed-Rank Test was used. In all cases tests were two-sided and p-values were corrected using the Bonferroni procedure, considering each set of contrasts involving the same variables to be a family.

AVAILABILITY OF DATA AND MATERIAL
The datasets used to generate the findings of this paper are available from NCBI by their accession numbers. SRA accession numbers for each RNA-Seq data are: ALS -SRP064478, Breast Cancer -SRP058722, Prostate Cancer -ERP000550. The human genome version 38 was obtained from the NCBI FTP site [79]. Datasets generated during the current study which are not included as additional files are available from the corresponding author upon request. Important scripts used in the course of this study are included in Additional File 48.

AUTHOR CONTRIBUTIONS
MT and RD conceived the experiments. MT performed the bioinformatics analysis. MT and RD wrote the manuscript.

ACKNOWLEDGMENTS
We would like to thank the Douville lab team for their input on this project. This work was supported by WestGrid (www.westgrid.ca) and Compute Canada (www.computecanada.ca) by use of the computer cluster Orcinus. We acknowledge that the University of Winnipeg is in Treaty 1 territory and in the heart of the Métis nation. Orcinus is located on the traditional, ancestral, and unceeded territory of the Musqueam people. MT currently resides in the traditional territory of the Three Fire Peoples: the Ojibwa, Odawa, and Potawatomi.

SUPPLEMENTARY MATERIAL
The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fmicb.2018. 01577/full#supplementary-material       FILE 8 in FASTA (fa) format | Aligned genomic RVT_1 HMMER results. The sequences found by HMMER directed by LTRdigest for the protein family HMM RVT_1 in the human genome aligned by MACSE.
FILE 9 in Newick (nwk) format | Phylogeny inferred from genomic BLAST results for the HK2 protease amino acid sequences. The phylogeny of amino acid sequences identified in the human genome GRCh38 by BLAST using the mature HK2 protease as a query as inferred by RAxML under the JTT + G model of evolution and aligned by MACSE.
FILE 10 in Newick (nwk) format | Phylogeny inferred from genomic BLAST results for the HK2 protease nucleic acid sequences. The phylogeny of nucleic acid sequences identified in the human genome GRCh38 by BLAST using the mature HK2 protease as a query as inferred by RAxML under the GTR model of evolution and aligned by MACSE.
FILE 11 in Newick (nwk) format | Phylogeny inferred from genomic BLAST results for the HK2 reverse transcriptase amino acid sequences. The phylogeny of amino acid sequences identified in the human genome GRCh38 by BLAST using the HK2 reverse transcriptase as a query as inferred by RAxML under the JTT + G model of evolution and aligned by MACSE.
FILE 12 in Newick (nwk) format | Phylogeny inferred from genomic BLAST results for the HK2 reverse transcriptase nucleic acid sequences. The phylogeny of nucleic acid sequences identified in the human genome GRCh38 by BLAST using the HK2 reverse transcriptase as a query as inferred by RAxML under the GTR model of evolution and aligned by MACSE.
FILE 13 in Newick (nwk) format | Phylogeny inferred from genomic HMMER results for RVP amino acid sequences. The phylogeny of amino acid sequences identified in the human genome GRCh38 by HMMER directed by LTRdigest using the protein family HMM RVP as a query as inferred by RAxML under the JTT + G model of evolution and aligned by MACSE.
FILE 14 in Newick (nwk) format | Phylogeny inferred from genomic HMMER results for RVP nucleic acid sequences. The phylogeny of nucleic acid sequences identified in the human genome GRCh38 by HMMER directed by LTRdigest using the protein family HMM RVP as a query as inferred by RAxML under the GTR + G model of evolution and aligned by MACSE.
FILE 15 in Newick (nwk) format | Phylogeny inferred from genomic HMMER results for RVT_1 amino acid sequences. The phylogeny of amino acid sequences identified in the human genome GRCh38 by HMMER directed by LTRdigest using the protein family HMM RVT_1 as a query as inferred by RAxML under the JTT + G model of evolution and aligned by MACSE.
FILE 16 in Newick (nwk) format | Phylogeny inferred from genomic HMMER results for RVT_1 nucleic acid sequences. The phylogeny of nucleic acid sequences identified in the human genome GRCh38 by HMMER directed by LTRdigest using the protein family HMM RVT_1 as a query as inferred by RAxML under the GTR + G model of evolution and aligned by MACSE. FILE 17 formatted as described in the RAxML user manual (txt) | Classification likelihood weights for evolutionary placement of genomic protease BLAST amino acid sequences onto the corresponding reference tree. The placement of genomic sequences onto the tree of Repbase consensa was done using RAxML's evolutionary placement algorithm under the JTT + G model of evolution, and this file contains the likelihood of each potential placement for each sequence.
FILE 18 formatted as described in the RAxML user manual (txt) | Classification likelihood weights for evolutionary placement of genomic protease BLAST nucleic acid sequences onto the corresponding reference tree. The placement of genomic sequences onto the tree of Repbase consensa was done using RAxML's evolutionary placement algorithm under the GTR + G model of evolution, and this file contains the likelihood of each potential placement for each sequence. FILE 19 formatted as described in the RAxML user manual (txt) | Classification likelihood weights for evolutionary placement of genomic reverse transcriptase BLAST amino acid sequences onto the corresponding reference tree. The placement of genomic sequences onto the tree of Repbase consensa was done using RAxML's evolutionary placement algorithm under the RtREV + G model of evolution, and this file contains the likelihood of each potential placement for each sequence. FILE 20 formatted as described in the RAxML user manual (txt) | Classification likelihood weights for evolutionary placement of genomic reverse transcriptase BLAST nucleic acid sequences onto the corresponding reference tree. The placement of genomic sequences onto the tree of Repbase consensa was done using RAxML's evolutionary placement algorithm under the GTR + G model. FILE 21 formatted as described in the RAxML user manual (txt) | Classification likelihood weights for evolutionary placement of genomic RVP HMMER amino acid sequences onto the corresponding reference tree. The placement of genomic sequences onto the tree of Repbase consensa was done using RAxML's evolutionary placement algorithm under the WAG + G + F model of evolution, and this file contains the likelihood of each potential placement for each sequence. FILE 22 formatted as described in the RAxML user manual (txt) | Classification likelihood weights for evolutionary placement of genomic RVP HMMER nucleic acid sequences onto the corresponding reference tree. The placement of genomic sequences onto the tree of Repbase consensa was done using RAxML's evolutionary placement algorithm under the GTR + G model of evolution, and this file contains the likelihood of each potential placement for each sequence.
FILE 23 formatted as described in the RAxML user manual (txt) | Classification likelihood weights for evolutionary placement of genomic RVT_1 HMMER amino acid sequences onto the corresponding reference tree. The placement of genomic sequences onto the tree of Repbase consensa was done using RAxML's evolutionary placement algorithm under the RtREV + G model of evolution, and this file contains the likelihood of each potential placement for each sequence.
FILE 24 formatted as described in the RAxML user manual (txt) | Classification likelihood weights for evolutionary placement of genomic RVT_1 HMMER nucleic acid sequences onto the corresponding reference tree. The placement of genomic sequences onto the tree of Repbase consensa was done using RAxML's evolutionary placement algorithm under the GTR + G model of evolution, and this file contains the likelihood of each potential placement for each sequence.
FILE 25 in Newick (nwk) format | Original labeled reference tree for evolutionary placement of genomic protease BLAST amino acid sequences. The reference tree for placement of genomic protease BLAST results labeled for use with the corresponding classification likelihood file.  FILE 32 in Newick (nwk) format | Original labeled reference tree for evolutionary placement of genomic RVT_1 HMMER nucleic acid sequences. The reference tree for placement of genomic RVT_1 HMMER directed by LTRdigest results labeled for use with the corresponding classification likelihood file.
FILE 33 in Newick (nwk) format | Phylogeny inferred from amino acid sequences from representative Repbase consensa searched using BLAST with the HK2 protease as a query. The phylogeny of amino acid sequences identified in representative Repbase consensa by BLAST using the mature HK2 protease as a query as inferred by RAxML under the JTT + G model of evolution and aligned by MACSE.
FILE 34 in Newick (nwk) format | Phylogeny inferred from nucleic acid sequences from representative Repbase consensa searched using BLAST with the HK2 protease as a query. The phylogeny of nucleic acid sequences identified in representative Repbase consensa by BLAST using the mature HK2 protease as a query as inferred by RAxML under the GTR + G model of evolution and aligned by MACSE.
FILE 35 in Newick (nwk) format | Phylogeny inferred from amino acid sequences from representative Repbase consensa searched using BLAST with the HK2 reverse transcriptase as a query. The phylogeny of amino acid sequences identified in representative Repbase consensa by BLAST using the HK2 reverse transcriptase as a query as inferred by RAxML under the RtREV + G model of evolution and aligned by MACSE.
FILE 36 in Newick (nwk) format | Phylogeny inferred from nucleic acid sequences from representative Repbase consensa searched using BLAST with the HK2 reverse transcriptase as a query. The phylogeny of nucleic acid sequences identified in representative Repbase consensa by BLAST using the HK2 reverse transcriptase as a query as inferred by RAxML under the GTR + G model of evolution and aligned by MACSE.
FILE 37 in Newick (nwk) format | Phylogeny inferred from amino acid sequences from representative Repbase consensa searched using HMMER for RVP. The phylogeny of amino acid sequences identified in representative Repbase consensa by HMMER using the protein family HMM RVP as a query as inferred by RAxML under the WAG + G + F model of evolution and aligned by MACSE.
FILE 38 in Newick (nwk) format | Phylogeny inferred from nucleic acid sequences from representative Repbase consensa searched using HMMER for RVP. The phylogeny of nucleic acid sequences identified in representative Repbase consensa by HMMER using the protein family HMM RVP as a query as inferred by RAxML under the GTR + G model of evolution and aligned by MACSE.
FILE 39 in Newick (nwk) format | Phylogeny inferred from amino acid sequences from representative Repbase consensa searched using HMMER for RVT_1. The phylogeny of amino acid sequences identified in representative Repbase consensa by HMMER using the protein family HMM RVT_1 as a query as inferred by RAxML under the RtREV + G + I model of evolution and aligned by MACSE.
FILE 40 in Newick (nwk) format | Phylogeny inferred from nucleic acid sequences from representative Repbase consensa searched using HMMER for RVT_1. The phylogeny, inferred by RAxML under the GTR + G model of evolution, of nucleic acid sequences identified in representative Repbase consensa by HMMER using the protein family HMM RVT_1 as a query aligned by MACSE. FILE 42 in PNG (png) format | Expression of non-HML2/HML3 ERVs in human tissues. The expression of ERVK loci in GRCh38 containing an uninterrupted protease was measured with bowtie2 using RNA-Seq data from different conditions (SRA accession numbers for each condition are ALS -SRP064478, Breast Cancer -SRP058722, Prostate Cancer -ERP000550). Normalization to fragments per kilobase of exon per million mapped reads (FPKM) is relative to the entire locus. This graphic was made using R with the tidyverse library, and GIMP.
FILE 43 in PNG (png) format | Expression of ERVK in cervical spinal cord RNA-Seq of male and female donors with ALS. The expression of ERVK loci in GRCh38 containing an uninterrupted protease was measured with bowtie2 using RNA-Seq data from different conditions (SRA accession numbers for each condition are ALS -SRP064478, Breast Cancer -SRP058722, Prostate Cancer -ERP000550). Normalization to fragments per kilobase of exon per million mapped reads (FPKM) is relative to the entire locus. Samples were sexed using expression of XIST. This graphic was made using R with the tidyverse library, and GIMP.    FILE 48 in ZIP (zip) format | Important scripts used for this study. This ZIP archive contains the perl and bash scripts used to locate and align protease and reverse transcriptase using BLAST in the human genome and the scripts used to measure expression in RNA-Seq data.