Insights into HLA-G Genetics Provided by Worldwide Haplotype Diversity

Human leukocyte antigen G (HLA-G) belongs to the family of non-classical HLA class I genes, located within the major histocompatibility complex (MHC). HLA-G has been the target of most recent research regarding the function of class I non-classical genes. The main features that distinguish HLA-G from classical class I genes are (a) limited protein variability, (b) alternative splicing generating several membrane bound and soluble isoforms, (c) short cytoplasmic tail, (d) modulation of immune response (immune tolerance), and (e) restricted expression to certain tissues. In the present work, we describe the HLA-G gene structure and address the HLA-G variability and haplotype diversity among several populations around the world, considering each of its major segments [promoter, coding, and 3′ untranslated region (UTR)]. For this purpose, we developed a pipeline to reevaluate the 1000Genomes data and recover miscalled or missing genotypes and haplotypes. It became clear that the overall structure of the HLA-G molecule has been maintained during the evolutionary process and that most of the variation sites found in the HLA-G coding region are either coding synonymous or intronic mutations. In addition, only a few frequent and divergent extended haplotypes are found when the promoter, coding, and 3′UTRs are evaluated together. The divergence is particularly evident for the regulatory regions. The population comparisons confirmed that most of the HLA-G variability has originated before human dispersion from Africa and that the allele and haplotype frequencies have probably been shaped by strong selective pressures.


INTRODUCTION
Human leukocyte antigen G (HLA-G) belongs to the family of non-classical HLA class I genes, located within the major histocompatibility complex (MHC) at chromosomal region 6p21.3. The MHC segment is considered to be the most polymorphic region in vertebrate genome (1). Although the HLA-G product presents the same class I classical molecule structure, its main function is not antigen presentation. HLA-G function in the immune response regulation has been extensively studied since its discovery by Geraghty and colleagues in 1987 (2).
The HLA-G gene has been the target of most recent research regarding the function of class I non-classical genes. The main features that distinguish HLA-G from classical class I genes are (a) limited protein variability, (b) alternative splicing generating several membrane bound and soluble isoforms, (c) short cytoplasmic tail, (d) modulation of immune response (immune tolerance), and (e) restricted expression to certain tissues (3).
HLA-G role in immune tolerance was first studied in trophoblast cells at the maternal-fetal interface (9). Several studies reported an aberrant or reduced HLA-G expression in both mRNA and protein levels. This phenomenon was observed in pathological conditions such as preeclampsia (10) and recurrent spontaneous abortion (11) in comparison with normal placentas.

HLA-G GENETIC STRUCTURE
The HLA-G gene presents a structure that resembles other classical class I genes such as HLA-A, HLA-B, and HLA-C. HLA-G encodes for a membrane-bound molecule with the same extracellular domains presented by other class I molecules, including the association with the β2-microglobulin. However, its main function is not antigen presentation.
The HLA-G gene exon/intron structure and splicing patterns are well defined, but there are inconsistencies between the National Center for Biotechnology Information (NCBI) 1 , the International Immunogenetics Database (IMGT/HLA 2 ), and the Ensembl database 3 annotations regarding its structure, mainly because the IMGT/HLA database only presents sequences within 300 bases upstream the coding sequence (CDS) and the database does not consider most of the 3 untranslated region (UTR) segment. Therefore, in the present work, the structure defined by NCBI/Ensembl will be used throughout the text.
According to the NCBI reference sequence NC_000006.12 (GRCh38 or hg19) and transcripts such as NM_002127.5 (NCBI), ENST00000428701, and ENST00000376828 (Ensembl), the HLA-G gene (NCBI Gene ID: 3135) presents eight exons and seven introns, consistent with a classical class I gene structure, and encompasses a region of 4144 nucleotides between positions 29826979 and 29831122 at 6p21. 3 (GRCh38). This gene is surrounded by some of the most polymorphic genes in the human genome (Figure 1), such as HLA-A (115 Kb downstream), HLA-B (1526 Kb downstream), and HLA-C (1441 Kb downstream), and other non-classical HLA loci such as HLA-E (662 Kb downstream) and HLA-F (103 Kb upstream). According to the NCBI annotation and hg19, the HLA-G DNA segment encodes a full-length mRNA of 1578 nucleotides and alternative smaller ones, as discussed later. Considering the full-length mRNA, 1017 nucleotides represent the CDS encoding for a full-length protein of 338 amino acids, 178 nucleotides represent the 5 UTR segment, and 383 nucleotides represent the 3 UTR segment.
There is no consensus regarding the exact location where the HLA-G transcription may start. Considering the NCBI and Ensembl annotations, and the transcripts NM_002127.5 from NCBI and ENST00000428701 from Ensembl, the HLA-G transcription starts 866 nucleotides upstream the initial translated ATG (third * at Figure 1). However, other transcripts tell us a different story: ENST00000376828 indicates that the HLA-G transcription might start even earlier, while ENST00000360323 indicates that the transcription starts 24 nucleotides upstream the initial translated ATG. Given these contradictory information, it is possible that the HLA-G gene presents multiple transcription start points depending on the presence of specific transcription factors or other expression inducing mechanisms, but it probably presents only one translation start point as described further. Since there is no consensus, in the present work, we opt to use the annotation presented by both NCBI and Ensembl, considering NM_002127.5 and ENST00000428701 as references. Considering the transcription start site indicated by NM_002127.5/ENST00000428701 or ENST00000360323, HLA-G presents a large 5 UTR segment. Within this segment, there is an intron (intron 1) of about 688 nucleotides that is spliced out, giving rise to 5 UTR of about 178 nucleotides composed of DNA segments of two adjacent exons. Considering this transcription start point, the HLA-G 5 Frontiers in Immunology | Immunological Tolerance sequence presents at least three potential translation start points, i.e., two in the 5 UTR and the third one defining the beginning of the CDS. In the present work, we will consider the Adenine of this third ATG, i.e., the first base of the CDS, as nucleotide +1. Although conventional nomenclature would suggest the first transcribed base as nucleotide +1, our decision will avoid unnecessary confusion regarding the position of various well-established HLA-G variation sites. All nucleotides before the CDS will be noted as negative numbers and nucleotides in the CDS segment will be noted as positive numbers, using as a reference sequence the one available at the official human genome hg19 or NC_000006.12.
The first ATG is found between nucleotides −154 and −152 (mRNA) or nucleotides −842 and −840 (DNA). The second one is found between nucleotides −118 and −116 (mRNA) or nucleotides −806 and −804 (DNA). Both of these translation start points are in the same frame and are included in a sequence that does not resemble the preferred translation initiation sequence (Kozak consensus sequence) and might not initiate translation (62). Even if the first ATG is used, it would produce a peptide of only eight residues due to a stop codon found downstream in the reading frame. Alternatively, if the second ATG is used, a protein of about 136 amino acid residues would be produced. Although in a different frame from the main translation start point (the third one), this 136 amino acid molecule is quite similar to other human and primate class I molecule alpha-1 domains. The third and main ATG is compatible with the preferred Kozac sequence (62) and it initiates the translation of the full-length 338 amino acid residues protein and defines the beginning of the CDS segment.
The HLA-G CDS is composed of joining segments of six exons, in which the first contains the translation start point and the last one contains the stop codon (Table 1, Figure 1). It should be noted that there is no consensus regarding exon and intron nomenclature between NCBI/Ensembl and the IMGT/HLA databases. IMGT/HLA considers as exon 1 the first mRNA segment that is translated, i.e., exon 2 for NCBI/Ensembl (Figure 1). The actual exon 2, which encodes the final portion of the 5 UTR, contains the main translation start point and in fact encodes the HLA-G leader peptide (Figure 1). In addition, exons 3, 4, and 5 encode the alpha-1, alpha-2, and alpha-3 domains, respectively, exon 6 encodes the transmembrane domain, and exon 7 the cytoplasmic tail. A premature stop codon at exon 7 leads to a shorter cytoplasmic tail when compared to other class I molecules (Figure 1, Table 1). The segment downstream the stop codon at exon 7 extending to exon 8 composes the HLA-G 3 UTR. The HLA-G mRNA 3 UTR is short when compared to other class I genes. This gene structure description highlights one of the widely spread misconceptions regarding HLA-G gene structure: in 1987, Geraghty and colleagues proposed the existence of an exon 7 based on homology with classical class I genes (2). This "exon 7" was in fact part of the intron 7 (NCBI) and it is usually absent in most of the HLA-G transcripts. Although this "exon 7" segment has been found in alternative transcripts (e.g., ENST00000478519), other intron segments are also sometimes kept in rare alternative transcripts (e.g., ENST00000478355), since alternative splicing is an important characteristic of the HLA-G gene as described further. The HLA-G gene may produce at least seven protein isoforms generated by alternative splicing of the primary transcript (Figure 1). Four isoforms are membrane bound presenting the transmembrane domain and the short cytoplasmic tail. HLA-G1 is the full-length membrane-bound isoform with a structure that resembles classical class I molecules. HLA-G2 lacks alpha-2 domain, HLA-G3 lacks alpha-2 and alpha-3 domains, and HLA-G4 lacks alpha-3 domain. Three isoforms are soluble due to the lack of the transmembrane domain. The soluble HLA-G5 and HLA-G6 isoforms present the same extracellular domains of HLA-G1 and HLA-G2, respectively; however, both transcript variants retain intron 5 leading to a stop codon before the translation of the transmembrane domain, and a tail of 21 amino acids implicated in their solubility. HLA-G7 transcript variant retains intron 3 leading to a premature stop codon. Therefore, HLA-G7 isoform presents only the alpha-1 domain linked to two amino acids encoded by intron 2 (Figure 1) (63)(64)(65).
In the next sections, we will address the HLA-G variability and haplotype diversity among several populations around the world.

HLA-G VARIABILITY AS DESCRIBED IN THE 1000GENOMES PROJECT
The 1000Genomes Project is a large survey aiming to sequence the entire genome of thousands of individuals in several populations around the world (66). In the initial released data, the phased www.frontiersin.org genotypes of 1092 individuals from 14 populations were available. These data have driven several studies regarding HLA-G variability and evolutionary aspects (67)(68)(69).
The initial genotype published by the 1000Genomes Project was based on exome sequencing or whole genome low coverage sequencing and lacks several known HLA-G polymorphisms due to limitations in the genotype detection procedures at that moment. Among the missing polymorphic sites, we may highlight some known indels, such as the traditionally studied 14-bp presence or absence (insertion/deletion) in the HLA-G 3 UTR. In addition, the method used to infer genotypes and haplotypes failed to clearly distinguish triallelic SNPs, reporting them as biallelic ones (e.g., the HLA-G promoter SNP at position −725C/T/G, rs1233334).
Considering these technical limitations and considering the fact that most of the bioinformatics tools used in the initial survey are now more advanced and developed, we have reevaluated the 1000Genomes raw sequencing data regarding the HLA-G gene using a locally developed pipeline to get genotypes and haplotypes, to better understand the HLA-G variability around the world and to retrieve data regarding some HLA-G missed polymorphic sites.
First, by using Samtools (70) subroutine view, we downloaded the BAM files (binary alignment map) containing the 1000Genomes official alignment data for the HLA-G gene region (between positions 29793317 and 29799834 at chromosome 6) directly from the 1000Genomes server (ftp://ftptrace.ncbi.nih.gov/1000genomes/ftp/). The reads downloaded were already trimmed on both ends for primer sequences. The download was performed for each of the initial 1092 samples and included data from both low coverage whole genome and exome when available. It should be mentioned that we got the sequences (reads) from BAM files representing the HLA-G region, thus, the next step of our pipeline used only the reads that were previously mapped to the HLA-G region by the 1000Genomes Consortium. Each BAM file was converted into a Fastq format file retrieving all reads that were previously mapped to the HLA-G region. The BAM to Fastq conversion was made using Bamtools (https://github.com/pezmaster31/bamtools/) and Perl scripts (locally developed) to filter out duplicated reads and to classify the reads as paired or unpaired.
Both paired and unpaired Fastq files were mapped to a masked chromosome 6 (hg19), in which only the HLA-G region was available and the rest of the chromosome was masked with "N" to preserve nucleotide positions regarding hg19. To date, hg19 presents a HLA-G coding region sequence compatible with the widely spread HLA-G allele known as G*01:01:01:05. Mapping was performed using the application BWA, subroutine ALN (71), configured to allow the extension of a deletion up to 20 nucleotides, in order to evaluate the 14-bp polymorphism. The resulting BAM files from the newly mapped reads, from both paired-end and unpaired sequences, were joined using Picard-tools (http: //picard.sourceforge.net/index.shtml). Regions containing indels were locally realigned by using the application GATK (72), routines RealignerTargetCreator and IndelRealigner. This local realignment used as reference a file containing known HLA-G indels. The Bamtools software was also used to remove reads mapped with low mapping quality (MQ) scores (MQ < 40). After the procedure described above, 16 samples were discarded because all mapped reads (or most of them) were withdrawn due to poor MQ scores. The GATK routine UnifiedGenotyper was used to infer genotypes and a VCF file (variant call format) was generated.
Given the low coverage nature of the 1000Genomes data, some genotypes called by GATK are far uncertain, mainly in situations in which a homozygous genotype is inferred when that position presents low depth coverage. In addition, given the polymorphic nature and the high level of sequence similarity of HLA genes, some level of miss-mapped reads is expected and might bias genotype inference. To circumvent this issue, the VCF file generated by GATK was treated with a locally developed Perl script that applied the rules described below. This script uses the number of different reads detected for each allele at a given position (provided by GATK when the VCF file was generated).
-Homozygosity was only inferred when a minimal coverage of seven reads was achieved; otherwise, a missing allele was introduced in this genotype. This procedure assures (p > 0.99) that a homozygous genotype is called because of lack of variance at that position and not because the second allele was not sampled. -Genotypes, in which one allele was extremely underrepresented (proportion of reads under 5%), were considered as homozygous for the most represented allele. This procedure minimizes the influence of miss-mapped reads to the HLA-G region and the high level of sequencing errors that characterizes nextgeneration sequencing data, and such correction was applied only in situations characterized by high depth of coverage (20 or more reads available for the evaluated position). -For genotypes in which one allele was mildly underrepresented (with a proportion of reads between 5 and 20%), a missing allele was introduced representing this underrepresented allele. This procedure is particularly helpful in situations characterized by low depth of coverage (less than 20 reads available for the evaluated position), in which a single read may indicate the existence of an alternative allele, such read may be a miss-mapped read (false positive variant) or may represent a true unbalanced heterozygous genotype (true positive variant). Therefore, the definitive status of this kind of genotype (homozygous or heterozygous) was inferred during a final imputation step. -Genotypes in which the proportion of reads for the less represented allele was higher than 20% were considered to be heterozygous. This procedure assures that only high-quality heterozygous genotypes are passed forward to the imputation procedure.
After applying the rules described above, the HLA-G database presented 8.42% of missing alleles, i.e., alleles that were considered uncertain because of low coverage or bad proportions. Some single nucleotide variations (SNVs) previously detected (with low quality) were converted into monomorphic as the alternative allele was removed or coded as missing, thus, they were not considered for further analyses. By using the VCFtools package (73), we removed SNVs that were no longer variable or that were represented just once in the dataset (i.e., singletons). In addition, we predicted the functional effect of each SNV, i.e., they were classified as coding synonymous mutations, coding non-synonymous Frontiers in Immunology | Immunological Tolerance mutations, splice site acceptors, stop-codon generation, and others, by using Snpeff (74). The missing alleles were imputed as well as HLA-G haplotypes were inferred by using the PHASE algorithm (75) as previously described (76,77). For this purpose, a database containing high-quality genotype information for 133 SNVs for each of the 1076 remaining samples was used. The haplotyping procedure generated 200 haplotypes, with a mean haplotype pair probability of 0.7965 and with 524 samples (48.70%) presenting a haplotype pair with a probability higher than 0.9. The results of the procedure described above were presented separately for each HLA-G region (coding, 3 UTR and promoter) and, finally, as fully characterized extended haplotypes.
To characterize and explore global patterns of HLA-G diversity, a population genetics approach was performed using the ARLEQUIN 3.5.1.3 software (78,79). The frequencies of each HLA-G haplotype were computed by the direct counting method and adherences of diplotype proportions to expectations under Hardy-Weinberg equilibrium were tested by the exact test of Guo and Thompson (80). Intrapopulational genetic diversity parameters were assessed in each population by computation of gene diversity (average expected heterozygosity across variation sites), haplotype diversity, nucleotide diversity, and the number of private haplotypes. Interpopulation genetic diversity was explored by means of pair-wise F ST estimates (81), by the exact test of population differentiation (82), and by the analysis of molecular variance (AMOVA) (83), all based on haplotype frequencies.
Since the pair-wise F ST and the exact test of population differentiation between pairs of populations represent 91 statistical comparisons, the Bonferroni correction was used to adjust the significance level for multiple testing, resulting in a α = 0.0005 (i.e., 0.05/91). Reynolds' genetics distance was also estimated for each pair of population samples by the ARLEQUIN 3.5.1.3 software (78,79,84). The resulting matrix was used to generate a multidimensional scaling (MDS) using the PASW Statistics (17.0.2) software (SPSS Inc.).

HLA-G CODING REGION VARIABILITY AND HAPLOTYPES
In contrast to classical HLA class I genes, HLA-G presents low variability in its coding region. To date, only 50 coding alleles or haplotypes are officially recognized by the IMGT/HLA database 2 (version 3.17.0.1). Most of the SNVs in the HLA-G coding region are either coding synonymous mutations or intronic variants. Therefore, these 50 officially recognized HLA-G alleles encode only 16 different full-length proteins and two truncated molecules (null alleles). This is a distinctive feature of the HLA-G gene and also of other non-classical class I genes: only 36% of the known HLA-G alleles are associated with different HLA-G molecules when compared to classical class I genes, in which 75.4% for HLA-A, 77.8% for HLA-B, and 73.5% for HLA-C alleles are associated with different molecules (IMGT/HLA). The limited HLA-G coding region polymorphism is distributed among the alpha-1, alpha-2, and alpha-3 domains, while for classical class I genes, polymorphisms are found mainly around the region encoding the peptide binding groove, i.e., alpha-1 and alpha-2 domains (1). This is particularly evident for HLA-B, in which there is at least one recognized allele carrying a mutation for each nucleotide of exons 2 or 3, with few exceptions.
Among the high-frequency HLA-G coding alleles, we may find the G*01:01:01:01, G*01:01:01:04, G*01:01:01:05 (present at hg19), G*01:01:02:01, G*01:01:03:01, G*01:01:05, and G*01:01:07 alleles; all carrying intronic or synonymous mutations and encoding for the same full-length HLA-G molecule known as G*01:01. HLA-G*01:01:01:01 is the reference allele used by IMGT/HLA, it was the first one described (2) and usually the most common allele in all populations studied so far. Among the frequent ones, we also find the G*01:03:01:01 allele that is characterized by a non-synonymous mutation at position 292, codon 31, exchanging a Threonine by a Serine, encoding the full-length molecule known as G*01:03. Another group of alleles are represented by G*01:04:01, G*01:04:03, and G*01:04:04, all of them encoding the same molecule known as G*01:04. They are characterized by a non-synonymous mutation at position 755, codon 110, exchanging a Leucine by an Isoleucine, and by other synonymous mutations. The null allele, G*01:05N, which is associated with a truncated HLA-G molecule due to a deletion of a cytosine around codon 130 that changes the reading frame, is also very frequent in some African, Asian, and admixed populations. Finally, the last frequent allele is G*01:06, which is characterized by a nonsynonymous mutation at position 1799, codon 258, exchanging a Threonine by a Methionine, encoding a molecule known as G*01:06. Other HLA-G alleles are sporadically found around the world, but only the ones presented above have been described at polymorphic frequencies.
However, the variability in the HLA-G coding region may be higher than the one presented by IMGT/HLA, because IMGT/HLA only presents alleles that were cloned, sequenced, and properly characterized by the researchers. In addition, most of the known alleles are not fully characterized, presenting only some exons sequenced. Therefore, the variability at the HLA-G coding region may be greater than the one reported so far.
The reevaluation of the HLA-G sequencing data from the 1000Genomes Project indicated that the HLA-G coding region is indeed much conserved and just a few new coding alleles are frequently found worldwide. The approach described earlier evidenced the presence of 81 SNVs in the HLA-G coding region, as described in Table 2. Some of these variation sites are truly polymorphic, while some might be considered as mutations. In addition, some of these new sites are not represented in the IMGT/HLA database and might represent new HLA-G alleles.
As observed in Table 2, most of the 81 variation sites occur in introns (54 sites) or in exons as synonymous changes (16 sites). Thus, 86.4% of all variants are associated with the same HLA-G full-length molecule, unless they somehow influence HLA-G splicing pattern. Among the ones that might be related to different www.frontiersin.org   Frontiers in Immunology | Immunological Tolerance HLA-G full-length proteins, we may find two frameshift mutations: the first associated with the G*01:05N null allele and the second representing a low-frequency variation site not recognized by IMGT/HLA (genomic position 29797195); one variation site associated with a splicing acceptor site (genomic position 29795822, HLA-G position + 201) and eight non-synonymous modifications, most of them recognized by IMGT/HLA. Interestingly, one synonymous modification was found presenting a high frequency (2.93%) and is not associated with any known HLA-G allele described so far (HLA-G position + 2412, rs17179080, Table 2). Although a triallelic SNV is described at exon 2 (HLA-G position + 372), associated with the G*01:04:02 allele, we did not find the third allele in the present data. As described earlier, haplotypes were inferred considering all variation sites found in the HLA-G region. When the coding region is isolated from these haplotypes, we found 93 different HLA-G coding haplotypes, a number far higher than the number of HLA-G alleles officially recognized. The complete table of haplotypes is available upon request. Table 3 describes all coding haplotypes presenting a minimum global frequency of 1% and the closest known HLA-G allele in terms of sequence similarity. It should be mentioned that non-variable positions for the haplotypes presented in Table 3 were removed. Although 93 different haplotypes were inferred, only 11 present a frequency higher than 1%. Of those, 10 were compatible with a specific allele described at the IMGT/HLA database and mentioned earlier as high-frequency alleles that usually occur in any population, and 1 is a new allele that is close to G*01:01:01:01 but presents the frequent nucleotide change at position + 2412, not recognized by IMGT/HLA. As previously observed in other studies, the most frequent HLA-G allele is G*01:01:01:01, followed by G*01:01:02:01 and G*01:04:01. These 11 haplotypes or coding www.frontiersin.org Frontiers in Immunology | Immunological Tolerance   Table 3 do present heterogeneous frequencies among the 1000Genomes populations ( Table 4). The G*01:01:01:01 allele, for example, is very frequent among Europeans and Asians, presents intermediate frequencies among admixed populations and lower frequencies in African populations, while an opposite pattern is observed for the G*01:05N null allele. In addition, allele G*01:01:03:03 is absent or very rare in African populations, and the G*01:04:04, G*01:01:01:04, and G*01:01:01:01new alleles are absent in Asians.

HLA-G 3 UNTRANSLATED REGION VARIABILITY AND HAPLOTYPES
The reevaluation of the HLA-G sequencing data indicated that its 3 UTR presents several high-frequency variation sites in a short segment. The approach described earlier evidenced as much as 17 variation sites in this short region, as described in Table 5. Some of these variation sites are polymorphic and have been previously described in several studies that evaluated the HLA-G 3 UTR (38,69,76,88,(105)(106)(107)(108)(109)(110)(111)(112)(113)(114)(115)(116)(117), while some might be considered as mutations. In general, nine variation sites can be considered as true polymorphisms. It should be noted that the nomenclature used to designate HLA-G 3 UTR variation sites is based on our previous reports, being designated as UTR-1, UTR-2, and so forth (88). In this matter, the 14-bp insertion (rs371194629), although less frequent and not represented in the hg19 human genome, is considered to be the ancestral allele and should be counted for designate HLA-G 3 UTR positions.
When the 3 UTR segment is isolated from the 200 extended haplotypes found, we observe 41 different haplotypes for this region. Table 6 presents all haplotypes that reached a global frequency higher than 1% and the complete table of haplotypes is available upon request. Monomorphic positions considering these high-frequency haplotypes are removed from Table 6. Considering the global frequency of each haplotype, it is noteworthy that only nine haplotypes account for more than 95% of all haplotypes found. These haplotypes were named according to the previous studies addressing the HLA-G 3 UTR variability (38,69,76,88,(105)(106)(107)(108)(109)(110)(111)(112)(113)(114)(115)(116)(117).
The haplotypes found considering the reevaluation of the 1000Genomes data are consistent with the ones found in several other populations, and some haplotypes that were previously considered as rare ones (such as UTR-10 and UTR-18) are actually more frequent than previously thought considering all populations pooled together (global frequency). Some rare SNVs that were previously described using Sanger sequencing, such as the one at position +3001 (69,110,111), and others that were described in studies evaluating the 1000Genomes data, such as +3032, +3052, +3092, +3121, and +3227, were also detected in this reevaluation ( Table 5). In addition, it should be pointed out that the 14-bp polymorphism, which is absent at the 1000Genomes initial released VCF files, was retrieved from the raw sequence data and its genotypes were inferred for most of the samples.
Similar to the HLA-G coding region, a heterogeneous distribution of these nine 3 UTR haplotypes is observed among the 1000Genomes populations ( Table 7). The UTR-1 haplotype, for example, is very common in European populations, but presents lower frequencies in populations from Africa. The UTR-7 haplotype is absent or rare in populations of African ancestry, and haplotypes UTR-6 and UTR-18 are absent or rare in Asia. The 3 UTR haplotype frequencies in admixed populations are close to the ones reported for other admixed populations such as Brazilians (76,88,110,111). In addition, the frequencies observed for the 1000Genomes African populations are close to the ones reported for other African populations described in isolated reports (108,116,117). Moreover, the frequencies reported here are close to the ones presented for the same data in another manuscript (69), with some minor differences since this latter manuscript only imputed the 14-bp polymorphism and used the original 1000Genomes VCF data.
Haplotypes are ordered according to their global frequency.

HLA-G 5 PROMOTER REGION VARIABILITY AND HAPLOTYPES
As previously discussed, there is no consensus regarding where the HLA-G transcription starts. Considering NCBI and NM_002127.5, the HLA-G transcription starts 866 nucleotides upstream the initiation codon ATG. However, most of the studies performed so far regarding the HLA-G promoter structure did consider 1500 nucleotides upstream the main initiation codon ATG as the HLA-G promoter region. In this scenario, only SNVs above −866 should be considered as promoter SNVs (or SNVs from the upstream regulatory region) and the ones between −866 and −1 should be considered as 5 UTR SNVs. Nevertheless, despite of this inconsistency and considering the fact that there is no consensus yet regarding the HLA-G initial transcription starting point, in the present work we considered all SNVs upstream the main translation start point as promoter (5 upstream regulatory region) SNVs. The approach described earlier evidenced the presence of 35 SNVs in the HLA-G promoter region, as described in Table 8  as true polymorphisms (minor allele frequency above 1%), and at least 11 present frequencies around 50%. In addition, the trialleic SNP at position −725, as well as other known indels at the promoter region, was properly recovered. When the promoter region is isolated from the 200 extended haplotypes found, we observe 64 haplotypes for this region. Table 9 presents all haplotypes that reached a frequency higher than 1% and the complete table of haplotypes is available upon request. Monomorphic positions considering these frequent haplotypes were removed from Table 9. Considering the global frequency of each haplotype, it is worth mentioning that only nine haplotypes account for more than 95% of all haplotypes found. These haplotypes were named according to previously published works addressing the HLA-G promoter region variability (76,(118)(119)(120).

Frontiers in Immunology | Immunological Tolerance
As previously observed for both the coding and 3 UTR regions, promoter haplotype frequencies greatly vary among populations ( Table 10).

HLA-G EXTENDED HAPLOTYPES
As described earlier, 200 extended haplotypes were inferred considering the whole HLA-G sequence encompassing the promoter, coding, and 3 UTR segments. Since there is no official nomenclature for the entire MHC genes, the HLA-G extended haplotypes were named according to the nomenclature adopted for each HLA-G segment. As already observed for some populations (76,88,(118)(119)(120), the promoter haplotypes are usually associated with the same coding and 3 UTR haplotypes (Table 11). For example, promoter haplotype 010101a is usually associated with the coding www.frontiersin.org Table 9 | The most frequent HLA-G 5 promoter region haplotypes presenting frequencies higher than 1% considering all populations of the 1000Genomes Project (Phase 1).
Haplotypes are ordered according to their global frequency.
allele G*01:01:01:01 and the 3 UTR haplotype named UTR-1. The same phenomenon is observed for each of the main HLA-G promoter, coding, or 3 UTR haplotypes. In this matter, only 24 extended HLA-G haplotypes were found presenting a minimum frequency of 0.5% and representing more than 85% of all haplotypes, and only 15 present frequencies higher than 1%. The extended haplotypes shown in Table 11 were classified according to previously defined HLA-G lineages (76,118). It becomes clear that most of the extended haplotypes are associated with the same encoded full-length molecule and functional polymorphisms are mainly present at the regulatory regions. In fact, many polymorphisms in the regulatory regions do present high frequencies (around 50%), what is compatible with the evidence of balancing selection acting on the HLA-G regulatory regions (3,69,76,88,115,118,121). For example, lineages HG010101 (a, b or c) and HG010102 are associated with HLA-G coding alleles that usually encode the same HLA-G molecules (exception made to the G*01:06 and G*01:05N alleles), but the promoter and 3 UTR haplotypes are the most divergent ones compared to each other.
Recently, the Neanderthal genome sequence corresponding to a sample dating 40,000 years was published (122). The same pipeline described above was applied to this Neanderthal genome and we found that this unique sample does present a HLA-G haplotype found among modern humans with a frequency of 0.00604 (G010101f/G*01:01:01:04/UTR-6) and another haplotype that was not found in the present series and is composed of a recombined promoter, an unknown HLA-G coding allele close to G*01:01:02:01 and UTR-2.

HLA-G WORLDWIDE DIVERSITY
Human leukocyte antigen G worldwide intrapopulational genetic diversity was evaluated by means of different population genetics parameters ( Table 12). Except for the number of private Frontiers in Immunology | Immunological Tolerance www.frontiersin.org alleles, which is greatly influenced by sample sizes and the number of different samples from a same geographic area (group), African populations exhibited higher levels of genetic diversity in comparison with Europeans and Asians. Admixed populations sampled in America also revealed high levels of diversity. These findings are consistent with the current knowledge that older and admixed populations are prone to exhibit larger diversity than younger and non-admixed populations. Similar observations are made when the promoter ( Table 13) and coding ( Table 14) regions are considered separately. Since these differences between Africans and non-Africans are not as substantial as those observed for neutral markers (123), such similar levels of diversity may be reflecting both demographic events and the action of balancing selection. However, when the 3 UTR is considered (  (76) and also for the populations of the 1000Genomes Project (69), both the promoter and 3 UTR diversity have been shaped by a strong balancing pressure. The comparison of the three different HLA-G regions (Tables 13-15) also reveals interesting aspects. The average expected heterozygosity (gene diversity) for variation sites at the 3 UTR is~20% higher (0.2730) than the estimated ones for the promoter (0.2323) and coding (0.2244) regions. As a consequence, nucleotide diversity is 4.5 times higher for the 3 UTR (2.8640%) than for the promoter (0.6331%) and coding (0.6432%) regions. Nucleotide diversity at HLA-G 3 UTR is almost 40 times higher than the human genome average (0.075%) (118,124), resulting in an astonishing average of 8.19 differences when two randomly chosen 3 UTR (286-bp long) haplotypes are compared. Balancing selection favors the maintenance of different alleles in Frontiers in Immunology | Immunological Tolerance  www.frontiersin.org  Frontiers in Immunology | Immunological Tolerance a population, resulting in a proportionally higher average pairwise difference as compared with the measure of diversity based on the number of polymorphic sites. The worldwide nucleotide diversity at the whole HLA-G locus (0.7548%) is as expected slightly higher than that observed for the Brazilian population sample (0.00643%) (76). The direct comparison of haplotype diversity between the three regions could not be performed, since the very different lengths and number of variation sites of the three regions (Tables 2, 5, and 8) may bias any retrieved conclusions. Two independent approaches were used to evaluate the extent of differentiation between pairs of populations (interpopulation diversity): F ST and the exact test of population differentiation based on haplotype frequencies. Although these analyses have the same purpose and may provide similar results, both were performed to provide more reliable and robust conclusions. The analysis of the pair-wise F ST matrix revealed a large range of variation of F ST values: from −0.0150, between British from England and Scotland (GBR) and Iberian populations from Spain (IBS), to 0.2037, between Finnish (FIN) and Japanese (JPT) ( Table 16). While only 1 out of 6 (16.7%) pairs of admixed populations and 4 out of 10 (40%) European populations differed significantly at the 5% unadjusted significance level; it is noteworthy that the two African populations, as well as the three Asian populations, differed. IBS presented the lowest number of significant comparisons (2 out of 13), a fact that is clearly related to the lack of statistical power due to the small sample size. On the other hand, JPT (all comparisons), CHB (12 out of 13), CHS (12 out of 13), FIN (12 out of 13), and YRI (11 out of 13) presented the largest number of significant comparisons. An overall stronger differentiation was observed by the matrix composed of non-differentiation probability values obtained through the exact test of population differentiation ( Table 17). While only 3 out of 10 (30%) European populations differed significantly at the 5% significance level, it is noteworthy that the two African populations, as well as the three Asian populations and four admixed populations, differed. IBS presented the lowest number of significant comparisons (4 out of 13), while JPT, CHB, CHS and YRI differed in all pairwise comparisons including them. To sum up, both the exact test of population differentiation based on haplotype frequencies and the F ST estimate revealed the existence of highly significant difference between the 14 populations. Since the more frequent HLA-G haplotypes are shared between most of the populations, these pairwise population differences may be due to the existence of many low-frequency haplotypes that are restricted to two or three populations (22.5% of the 200 identified haplotypes) or are private to a single population (63% of the 200 haplotypes).
To further explore the genetic relationships between populations, an AMOVA was performed assuming a hierarchical structure in which the 14 populations were divided into four groups: African, Asian, European, and admixed populations ( Table 18). Considering the whole HLA-G gene, differences between the four groups account for only 2.45% of the variance, whereas 1.64% of the variance occurs as a consequence of differences between populations that belong to a same group. Almost all the variance (95.91%) is observed within populations. This same pattern is observed when each HLA-G region, i.e., promoter, coding, and 3 UTR, is considered separately, with the exception of the 3 UTR where the variance among groups (0.65%) gets even lower than the variance among populations that belong to a same group (1.32%), and is statistically non-significant.
Since the group composed of admixed populations represent an assembly of populations whose individuals present varying levels www.frontiersin.org  of ancestry that can be assigned to Africans, Amerindians/Asians, and Europeans, this group was removed from a second round of analysis ( Table 18). As a result, levels of variance between groups increased, although still lower than the expected ones for neutrally evolving sequences (123). Therefore, one may conclude that this analysis reflects the fact that most of the HLA-G diversity, particularly that from the 3 UTR, (a) originated from Africa before Homo sapiens dispersion to other continents and (b) has been maintained in worldwide populations by non-neutral evolutionary forces, particularly balancing selection. These conclusions are corroborated by previous data on HLA-G (68,69,76,89,121). Moreover, many different low-frequency haplotypes are being generated within populations by mutation and recombination. Frontiers in Immunology | Immunological Tolerance

HLA-G EVOLUTION ASPECTS
The MHC class I molecules evolved by a series of events that include chromosomal duplication, gene recombination, and selection probably driven by pathogens (125)(126)(127). Apparently, MHC-G, the HLA-G homologous sequence in non-human primates, is the oldest class I gene and it would be responsible for the origin of the whole class I loci (127). In fact, MHC class I genes from the New World primates, such as the cotton-top tamarin (Saguinus oedipus), are much closer to the human HLA-G than other human classical class I genes (127). This primate lineage separated from the one that gave rise to the Old World monkeys (or anthropoids) about 38 million years ago. It is noteworthy that the HLA-G and MHC-G molecules are functionally different despite the high identity among exonic sequences (128). New World primates' MHC-G plays a role in antigen presentation that is uncommon for human HLA-G, and this fact suggests that they are not orthologous as theorized in the past (129,130). In contrast, the cotton-top tamarin presents two MHC-C molecules with inhibitory properties that interact with KIR receptors (131). The regulation of MHC levels (in this case, MHC-C) in these non-human primates seems to be one of the responsible mechanisms for fetal acceptance as well as for the shorter pregnancy period (132). Old World primates have a peculiar MHC-G molecule. It presents just the α1 domain due to a stop codon at codon 164 (133), which may not hinder fetal protection against maternal NK cells, unless there is a mechanism in which the stop codon is ignored, allowing translation to continue (which is not discarded). In addition, gorillas and chimpanzees present a conserved MHC-G coding segment with few variations (3,128,129). Even the pregnancy period being shorter than in human beings, these species are polygamous, which would expose the female to different allogeneic fetuses during the fertile age. Orangutans on the other hand have long-lasting relationships and five MHC-G variants have been found so far -the polymorphism levels are low but more similar to human beings (3). Orangutans and humans are separated by about 15 million years of evolution. Possibly, the differences between maternal-fetal relationships among different species are responsible for each MHC-G peculiarities and for its function and variation levels.
In addition to alignments between human and other primates coding MHC-G sequences, analyses of HLA-G non-coding regions have proved to be highly informative about the evolutionary history of this gene. For example, the polymorphism of 14-pb located on HLA-G exon 8 (3 UTR) is exclusively found in the human lineage, suggesting that UTR haplotypes bearing the deletion such as UTR-1 are more recent than the ones that present the 14-bp fragment (134).
An interesting finding confirmed recently is that one of the most frequent HLA-G coding allele (global frequency of 0.24257), G*01:01:01:01, which is usually associated with UTR-1 and the promoter haplotype G010101a [described in Ref. (76) and Table 11], is probably the most recent haplotype. These data were established by the association between G*01:01:01:01/UTR-1 with an Alu insertion (AluyHG) that occurred before human dispersion from Africa, in a location 20 Kb downstream HLA-G 3 UTR. The frequency of this Alu element increases with distance from Africa (68).
Given the HLA-G immunomodulatory properties and the unique tissue expression patterns, HLA-G expression levels must be maintained under a fine regulatory control. In addition, the lack of variability found in its coding region and limited number of proteins coded by this gene lead us to believe that this region www.frontiersin.org is under tight evolutionary forces that limit variation. The differences on mammalian pregnancy and species-specific pathogens must be considered when studying the evolution of the immune system molecules.

HLA-G TRANSCRIPTION REGULATION
Most of the studies already performed to understand HLA-G regulation considered as the HLA-G promoter 200 nucleotides upstream the first translated ATG and within 1.5 Kb upstream the CDS. The HLA-G regulation is unique among all class I genes [reviewed at Ref. (67)]. Generally, HLA class I genes present two main regulatory modules in the proximal promoter region (within 200 bases upstream the CDS) that includes [reviewed at Ref. (67)] (a) the Enhancer-A (EnhA) that interacts with NF-κB family of transcription factors, which are important elements to induce HLA class I genes expression (135); (b) the interferon-stimulated response element (ISRE) that consists of a target site for interferon regulatory factors (IRF), which might act as class I activators (IRF-1) or inhibitors (IRF-2 and IRF-8) (135). The ISRE module is located adjacent to the EnhA element, and both work cooperatively controlling HLA class I genes expression; (c) the SXY module in which the transcription apparatus is mounted.
However, the HLA-G gene presents regulation peculiarities that differ from other class I genes [reviewed at Ref. (67)]. First, the HLA-G EnhA is the most divergent one among the class I genes and is unresponsive to NF-κB (136) and might only interact with p50 homodimers, which are not potent HLA class I gene transactivators (137). In addition, the HLA-G ISRE is also unresponsive to IFN-γ (138) due to modified ISRE. In fact, the HLA-G locus presents the most divergent ISRE sequence among the class I genes (135,136), what could explain the absence of IFN-γ induced transactivation. The ISRE is also a target for other protein complexes that may mediate HLA class I transactivation. However, both HLA-G EnhA and ISRE seem to bind only the expressed factor Sp1, which apparently does not modulate the constitutive or IFN-induced transactivation of HLA-G (136). Some polymorphisms in promoter region, such as −725 C > G/T, are close to known regulatory elements. In this matter, the −725 G allele was related with higher HLA-G expression levels (120).
The SXY module comprises the S, X1, X2, and Y boxes and is an important target for regulatory binding elements and HLA class I genes transactivation. Box X1 is a target for the multiprotein complex regulatory factor X (RFX), including RFX5, RFX-associated protein, and RFXANK (137,(139)(140)(141). The RFX members use to interact with an important element for HLA class II transactivation (CIITA), also important to HLA class I gene transactivation (139). The X2 box is a binding target for activating transcription factor/cAMP response element binding protein (ATF/CREB) transcription factor family (142) and Y box is a binding target for nuclear factor Y (NFY), which includes subunits alpha, beta, and gamma (NFYA, BFYB, and NFYC) (67,139). For HLA-G, the SXY module presents sequences compatible only with S and X1 elements, but divergent from X2 and Y. Because CIITA is dependent of a functional SXY module, which includes X2 and Y elements, the SXY module does not transactivate HLA-G gene (139,(143)(144)(145)(146).
Other regulatory elements within the HLA-G promoter have been described, such as heat shock element, located at −469/−454 position, that bind with heat shock factor-1 (HSF-1), important elements involved in immune responses modulation (147), and progesterone, which is a steroid hormone secreted from corpus luteum and placenta, involved with endometrium maintenance and embryo implantation [reviewed at Ref. (67)]. The mechanism involved in HLA-G expression induced by progesterone is primarily mediated by the activation of progesterone receptor and a subsequent binding to a progesterone response element, found in the promoter region (148). The transactivation of HLA-G transcription has also been demonstrated by leukemia inhibitory factor (LIF) (149) and methotrexate cell exposure (150). In addition, it was demonstrated an increased HLA-G transcription level in choriocarcinoma cell JEG3 line after the treatment with LIF. Furthermore, LIF induces HLA-G expression in the presence of endoplasmic reticulum aminopeptidase-1 (ERAP1), expressed in the endoplasmic reticulum, and repression of ERAP1 culminates in HLA-G downregulation, indicating that ERAP1 has an important role in HLA-G regulation (151). Finally, it is necessary to highlight the importance of methylation status of the HLA-G promoter, since it appears to be very important for HLA-G transcription (152,153).
Although some HLA-G regulatory elements are known, it is not clear why balancing selection is maintaining divergent lineages since most of the polymorphisms would not theoretically influence HLA-G transcription by the known mechanisms, mainly because they do not coincide with known regulatory elements [reviewed at Ref. (67)]. It should be noted that the same SNVs described for the HLA-G promoter in other manuscripts are also found in the present analysis.

HLA-G POST-TRANSCRIPTIONAL REGULATION
HLA-G might also be regulated by post-transcriptional mechanisms such as alternative splicing and microRNAs. Several studies have reported polymorphisms influencing splicing, mRNA stability, and also the ability of some microRNAs to bind to the HLA-G mRNA. The HLA-G 3 UTR segment is a key feature for its regulation mainly by the binding of microRNAs and influencing mRNA stability. HLA-G 3 UTR presents several polymorphic sites that influence gene expression [reviewed at Ref. (67)]. The 14-bp presence or absence (insertion or deletion) polymorphism was implicated in the HLA-G transcriptional levels and mRNA stability. The presence of the 14 bases segment in trophoblast samples has been associated with lower mRNA production for most membrane-bound and soluble isoforms (98,154), and the absence of this segment seems to stabilize mRNA with a consequent higher HLA-G expression (98,155,156). In addition, HLA-G transcripts presenting the 14 bases segment can be further processed with the removal of 92 bases from the complete mRNA (98), giving rise to a shorter HLA-G transcript reported to be more stable than the complete isoform (157). The alternative splicing associated with the presence of the 14 bases segment is probably driven by other polymorphic sites in Linkage Disequilibrium with this polymorphic site (3).
The SNP located at position +3142 has been associated with differential HLA-G expression, because it might influence microRNA binding (158). The presence of a Guanine at the + 3142 is associated with a stronger binding of specific microRNAs, Frontiers in Immunology | Immunological Tolerance such as miR-148a, miR-148b, and miR-152, decreasing HLA-G expression by mRNA degradation and translation suppression (3,158,159). In addition, the 14-bp region might also be a target for specific microRNAs and other 3 UTR polymorphisms might also influence microRNA binding (159). Another polymorphic site that would influence HLA-G expression is located at +3187. The allele +3187A is associated with decreased HLA-G expression because it extends an AU-rich motif that mediates mRNA degradation (106).
UTR-1 ( Table 6) is the only frequent 3 UTR haplotype that do not carry the 14-bp sequence, and both the high expression alleles +3142G and +3187A. Therefore, it was postulated that this haplotype would be associated with high HLA-G expression; this was confirmed by another study evaluating soluble HLA-G levels and 3 UTR haplotypes (109). In addition, as already introduced, this haplotype (together with the coding allele G*01:01:01:01) is probably the most recent one (109) and its frequency might be increased worldwide due to its high-expressing feature.

CONCLUDING REMARKS
Due to the key features of HLA-G on the regulation of immune response and immune modulation, particularly during pregnancy, the overall structure of the HLA-G molecule has been maintained during the evolution process. This is evident when the variability of more than a thousand individuals is taking into account, and only few encoded different molecules are frequently found. Most of the variation sites found in the HLA-G coding region are either synonymous or intronic mutations. The HLA-G promoter region presents numerous polymorphic sites, with several examples of variation sites in which both alleles are equally represented. Although the mechanisms underlying why some divergent promoter haplotypes are preferentially selected are still unclear, just a few divergent and frequent promoter haplotypes are found worldwide. The HLA-G 3 UTR variability is quite expressive considering the fact that most of the SNVs are true polymorphisms, they are equally represented, and this segment is of short size. These observations, for both promoter and 3 UTR, are compatible with the evidences of balancing selection acting on these regions. Finally, the population comparisons confirmed that most of the HLA-G variability has arisen before human dispersion from Africa and that the allele and haplotype frequencies might have been shaped by strong selective pressures.