HLA-A, -B, -C, -DRB1, -DQB1, and -DPB1 Allele and Haplotype Frequencies of 28,927 Saudi Stem Cell Donors Typed by Next-Generation Sequencing

Human leukocyte antigen (HLA) allele and haplotype frequency distribution varies widely between different ethnicities and geographical areas. Matching for HLA alleles is essential for successful related and unrelated stem cell transplantation. Among the Saudi population, data on HLA alleles and haplotypes are limited. A cross-sectional study was performed on 28,927 bone marrow donors. The most frequent HLA alleles were HLA-A*02:01:01G (20.2%), A*24:02:01G (7.5%); B*51:01:01G (19.0%), B*50:01:01G (12.3%); C*06:02:01G (16.7%), C*07:02:01G (12.2%); DRB1*07:01:01 (15.7%), DRB1*03:01:01G (13.3%); DQB1*02:01:01G (29.9%), DQB1*03:02:01G (13.2%); and DPB1*04:01:01G (35.2%), DPB1*02:01:02G (21.8%). The most frequent HLA-A~C~B~DRB1~DQB1 haplotypes were A*02:01:01G~C*06:02:01G~B*50:01:01G~DRB1*07:01:01G~DQB1*02:01:01G (1.9%) and A*02:05:01G~C*06:02:01G~B*50:01:01G~DRB1*07:01:01G~DQB1*02:01:01G (1.6%). The most frequent HLA-A~C~B~DRB1~DQB1~DPB1 haplotypes were A*02:01:01G~C*15:02:01G~B*51:01:01G~DRB1*04:02~DQB1*03:02:01G~DPB1*04:01:0G (1%) and A*02:01:01G~C*07:02:01G~B*07:02:01G~DRB1*15:01:01G~DQB1*06:02:01G~ DPB1*04:01:01G (0.9%). Based on these haplotype frequencies, we provide forecasts for the fraction of patients with full matching and single mismatched donors for 3 to 6 loci depending on the registry size. With one million donors, about 50% of the patients would find an 8/8 match and 90% a 7/8 match. These data are essential for registry planning, finding unrelated stem cell donors, population genetic studies, and HLA disease associations.


INTRODUCTION
The human leukocyte antigen (HLA) system is encoded by the most polymorphic genes known, located on the short arm of chromosome 6 (1). The HLA system is divided into class I (HLA-A, -B, and -C) and class II (HLA-DR, -DQ, and -DP) loci. Recent advances in sequencing technologies helped in unraveling a vast number of new HLA alleles. So far, 27,589 different HLA allele sequences have been reported based on the latest IPD-IMGT/HLA Database 3.41.0 (2). The HLA system plays an essential role in activating the immune system through the presentation of processed antigens to CD4+ and CD8+ T cells. On the other hand, HLA is a key player in the success of organ and stem cell transplantation. Mismatch in both HLA class I and II alleles between donor and recipient is a major cause of organ rejection, graft failure, and graft-vs. host disease following hematopoietic stem transplantation (3). In addition, certain HLA alleles were found to be associated with autoimmune diseases, such as ankylosing spondylitis, Type I diabetes, Behçet's disease, celiac disease, and rheumatoid arthritis (4).
Saudi society has a very high rate (more than 70%) of consanguineous marriages (5). Together with having large families, the chances of finding an HLA-matched relative are very high. However, a proportion (30-40%) of patients cannot find a matched donor (6). Therefore, the Saudi Stem Cell donor registry (SSCDR) was launched in 2011 in Riyadh, Saudi Arabia. SSCDR is a national organization that manages unrelated stem cell donors and cord blood units as a source for allogeneic stem cell transplantation. The aim of SSCDR establishment was to provide stem cells for those patients without a matching family donor. Today, SSCDR has more than 75,000 registered donors who are willing to donate their stem cells for any patient in need worldwide. Currently, SSCDR works closely with 14 donor recruitment centers in Saudi Arabia and has facilitated more than 67 successful stem cell transplants, 40 for national patients and 27 for international patients, including from Germany, the United Kingdom, the United States, Spain, Italy, Turkey, Sweden, Norway, India, and Australia. The goal of any stem cell registry is to reflect the HLA allele polymorphism of the population it serves. This can be challenging in countries such as Saudi Arabia with high HLA diversity due to a large number of ethnic groups from different countries that have settled there over the years in the holy cities of Mecca and Madinah.
The aim of this study was to evaluate the HLA class I and class II alleles and haplotypes from the existing database of registered Saudi stem cell donors. This study is essential for stem cell transplantation programs as the level of matching between donor and recipient is very important for the outcome of hematopoietic stem cell transplantation.

MATERIALS AND METHODS Population
A cross-sectional study was performed on 28,927 Saudi volunteers for bone marrow donors registered in the database of the SSCDR and typed for HLA-A, -B, -C, -DRB, -DQB1, and -DPB1 at a high-resolution level from 2013 to 2017. All donors included in the analysis fulfilled the eligibility criteria and were unrelated Saudi Arabs, 49% males and 51% females, 18-60 years old with an average age of 31 from different regions within the country, including Riyadh, Qasim, Dammam, Assire, Albaha, Jizan, Jeddah, and Madinah. These regions were not linked directly to each subject at that time; therefore, this cohort was not classified or grouped geographically.

HLA Class I and HLA Class II Typing
This project was approved by the local IRB at King Abdulaziz Medical City, Riyadh, National Guard Health Affair, Saudi Arabia. All donors were asked to sign a formal written consent. Blood samples, six drops on filter paper were collected. Total genomic DNA was extracted using QIAamp96 DNA Blood Mini Kit according to the manufacturer's instructions (Qiagen, Hilden, Germany). HLA-A, -B, -C, -DRB1, -DQB1, and -DPB1 genotyping was carried out by next-generation sequencing (NGS) at the laboratories of Histogenetics (Ossining, NY). Briefly, HLA alleles were typed using high-resolution (Class I: Exon 2 and 3, Class II: Exon 2) (7) using the Illumina platform (San Diego, CA). The sequencing templates (DNA libraries) for Illumina were prepared according to the manufacturer's protocols, involving two rounds of amplification. For sequence data analysis, MiSeq generates FastQ files that are used by the in-house NGSAutoTyper software to perform the final typing. HistoS software is used for further review and analysis, and the data is then transferred for final reporting to HistoTyper software. For certain typing results, we gave alternative combinations with G codes. Then, we selected the highest probable combination based on the frequency.
Homozygous typing results were controlled by the following method. For class I, exon 2 and 3 are repeated with alternative primers complementary to sequences in the middle of exon 2 and exon 3 to generate amplicons spanning mid exon 2intron 2-exon 3 (bridge amplification). In addition known HLA-B-C linkages are checked. If the bridge amplicon or B-C association confirms the homozygosity, then results are reported. If the results indicate heterozygous typing, a 1-KB amplicon is repeated spanning the whole exon 2-intron 2-Exon 3 with PacBio sequencing technology. For class II, DRB1 is analyzed with one generic and two group-specific amplifications. DQB1 and DPB1 are analyzed with one generic and one group-specific amplification. In addition, known DRB1-DRB3/4/5 and DRB1-DQB1 linkages are checked. In the presence of base occurrence more than 300 and a known linkage, homozygous results are accepted. In the absence of the latter, sequencing is repeated by a long-range amplicon spanning exon 2-intron 2-exon 3 using PacBio technology.

Criteria for Common and Well-Documented (CWD) Alleles
In this study, the CWD alleles were defined according to CIWD 3.0.0 inclusion criteria (8). Common alleles are those with ≥1 in 10,000, and well-documented alleles are those with ≥5 occurrences. Intermediate alleles (>1 in 100,000) were not calculated due to low sample size.

Statistical Analysis
The population analyses and genetic diversity measures were calculated using Arlequin 3.5 software (9). In particular, allele frequencies and three, five, and six locus haplotype frequencies were determined by the expectation-maximization (EM) algorithm (10).
A pairwise linkage disequilibrium (LD) test was performed with a Markov chain algorithm using estimated haplotypes for each individual obtained with the Excoffier-Laval-Balding (ELB) algorithm. Frequencies of common haplotypes were similar with both methods, and those with significant pairwise LD are shown in the tables. For testing the deviation from Hardy Weinberg equilibrium, a Markov chain algorithm was performed. A selective neutrality test could not be performed due to software allele and haplotype number limitations. Computed two loci (HLA-B-C and HLA-DRB1-DQB1) associations were listed as common (with observed frequency of more than 100 times) or rare (with observed frequency of <100 times).
The probability of finding a completely matching and single mismatched donor for 3-6 loci, depending on the registry size, was calculated based on the formula first shown in 1989 (11) and refined in 2014 (12). As is a common practice today (13), matching was based on the antigen recognition domain (exons 2 and 3 for class I and exon 2 only for class II).
An overview over the frequency distribution of all six loci is shown in Table 3 and Figures 1A,B. Apparently, HLA-DQB1 and HLA-DPB1 are much less polymorphic than the other loci, in particular due to a number of frequent alleles. HLA-A and HLA-DRB1 carry a very similar amount of information that is higher than in HLA-C but lower than in HLA-B, which is by far the most polymorphic of the six loci. As is well-known, a rare allele has to be seen at least three times in order to be more frequent than an unobserved one significantly with p < 0.05 according to the Poisson distribution. As a consequence, we have marked this frequency in Figure 1A and the corresponding cumulative frequency in Figure 1B. Figures 2A,B present two analogous plots for the haplotypes of 2-6 locus combinations frequently considered in donor-patient matching, and other combinations are shown in Supplementary Tables 1-3 and Supplementary Figures 1A,B, 2A,B. These figures demonstrate that our study covers more than 95% of all 2-locus haplotypes of the Saudi population; however, this fraction diminishes to about 80% for 4 and 5 loci and 70% for 6 loci.
The pairwise linkage disequilibrium (LD) parameters, frequency, observed, expected, D, D' , and p values for each possible pair of two HLA alleles are estimated in Supplementary Table 4. Most alleles at HLA-A and HLA-B are in strong LD with HLA-C. The relative strength of LD between two HLA loci was calculated based on the pairwise LD parameters for all the allelic pairs. Common and rare associations between HLA-DRB1-DQB1 and HLA-B-C in the Saudi population are presented in Tables 4, 5, respectively. CWD alleles are presented in Table 6. CWD alleles represent 60% of the allelic distribution in the Saudi population. The data used in this analysis are available in the Allele Frequencies Net Database (http://allelefrequencies.net/hla6006a.asp?hla_population=3685) and on a public website (https://www.ihiw18.org/).   Table 7 with a frequency >0.5%. Only one HLA-A∼C∼B∼ DRB1∼DQB1∼DPB1 haplotype was of >1% frequency in our population: HLA-A * 02:01:01G∼C * 15:02:01G∼B * 51:01:01G∼ DRB1 * 04:02∼DQB1 * 03:02:01G∼DPB1 * 04:01:0G. A full list of all HLA-A∼C∼B∼DRB1∼DQB1 and HLA-A∼C∼B∼DRB1∼ DQB1∼DPB1 haplotypes (observed ≥3 times) is shown in Supplementary Table 9.   For each of the six loci, the table shows Shannon's entropy of the frequency distribution as well as the number of the most frequent alleles required to cover 50, 75, 90, and 95% of the gene pool as well as the total number of alleles observed in the sample.

Haplotype Frequencies
For haplotypic diversity, the mean expected heterozygosity was 1.382 (±0.186). In haplotype-level computations, gene diversity and average gene diversity over loci were found to be 0.999 and 0.892 (±0.476), respectively.

Applications for Matching
The 6-locus haplotype frequencies were mapped to P-codes (implicitly ignoring the rare intron-based null alleles), and then, protein-level phenotype frequencies were derived for the purpose of the matching extrapolations. Figure 3 shows that, for example, with one million donors, only about 50% of the patients would find an 8/8 match, but already 90% would get a 7/8 match. Overall, registry sizes required for identical rates of full matches for 3-6 loci are roughly in proportions of 1:2:2:10, and with one mismatch accepted, this ratio is 1:4:10:40.
For HLA-B, only four alleles show a frequency >5%. These alleles are B * 51:01:01G, B * 50:01:01G, B * 08:01:01G, and B * 07:02:01G. These four alleles account for 43.4% of HLA-B diversity in this cohort. HLA-B * 50:01 is frequently seen in Arab populations. In Saudi Arabia, B * 50:01:01 accounts for almost all B * 50 alleles (19). In this cohort, HLA-B * 50:01 is the most common B * 50 allele (seen in 7170 individuals), and only 17 individuals carry the B * 50:02 allele. Moreover, B * 50:01:01G is also frequently seen in Caucasians, North Africans, and West-South Asians (20). HLA-B * 15 exhibits the highest number of alleles as 33 different alleles are detected in this cohort; however, B * 15 is not a common allele in Saudis. B * 51 alleles are the second highest polymorphic with 12 different alleles-only B * 51:01 is the most common. This has implications on finding a matched donor for those individuals carrying the rare B * 51 or B * 15 alleles.
For HLA-C, only five alleles show a frequency of >5%. The most frequent HLA-C alleles are C * 06:02:01G, followed by C * 07:02:01G, C * 04:01:01, C * 152:0, and C * 07:01:01G. These five alleles account for 61.4% of HLA-C diversity in this cohort. HLA-C * 07 exhibits the highest number of alleles; 13 different alleles are detected in in this cohort with C * 07:02 and C * 07:01 as the most common alleles. In a study comparing serology with molecular typing, we find the same HLA-C common alleles, including HLA-C * 15, which was not detected by serology (21).
For HLA-DPB1, only five alleles show a frequency of >5%. The most frequent HLA-DPB1 alleles are DPB1 * 04:01:01G, DPB1 * 02:01:02G, DPB1 * 03:01:01, DPB1 * 04:02:01, and DPB1 * 17:01:01. These five alleles account for 80.4% of HLA-DPB1 diversity in this cohort. All top 3 DPB1 alleles are frequent in Asians, especially Chinese, and in Caucasians (20). No single DPB1 type shows diversity; this might be a reflection of the DPB1 nomenclature, which unlike the other HLA genes, does not depend on exon 2. Typing for polymorphisms within DPB1 may not be resolved at the allele level using the current NGS solutions, in which an average fragment size of 200 bp may not resolve cis/trans polymorphisms. This issue, however, can be sorted by third-generation long-read technology (22). Hardy Weinberg equilibrium analysis shows an excess of homozygotes in this large cohort. This phenomenon was also observed previously by our group (17,18) and others (23) and might be explained by the high consanguinity marriages in the Saudi population (5). We observed different allele and haplotype frequencies between the central and Eastern provinces of Saudi Arabia (18,24). Thus, it is important to study different regions of Saudi Arabia independently. However, as with other populations, Saudis are not restricted to their area of origin and many move to Central, Eastern, and Western provinces seeking jobs. The most common HLA-A∼C∼B∼DRB1∼DQB1 haplotype seen in this study is A * 02:01:01G∼C * 06:02:01G∼B * 50:01: 01G∼DRB1 * 07:01:01G∼DQB1 * 02:01:01G. This haplotype is not common in other populations; however, it is seen at low frequency in American Hispanics, Indian Tamil Nadu, and Columbia cord blood (20). The second most common haplotype is A * 02:05:01G∼C * 06:02:01G∼B * 50:01:01G∼DRB1 * 07:01:01G∼DQB1 * 02:01:01G. This haplotype differs from the previous haplotype by the HLA-A * 02:05 allele and seems to have similar distribution (20). Interesting to note are the common HLA A∼C∼B∼DRB1∼DQB1∼DPB1 haplotypes; the top three haplotypes account for 3%, and all of them are A * 02:01:01G-based haplotypes, thus reflecting the high frequency of A * 02:01:01G, which accounts for 1/5 of the population.    Here, we show the number of CWD alleles. We applied CIWD catalogs 3.0 (8). CWD alleles can be calculated but not intermediate as intermediate alleles require a much larger sample size to be calculated (>1 in 100,000 sample size). CWD alleles for all loci, in our Saudi cohort are in the range of 60%. HLA-B has the highest number of common and highest number of welldocumented alleles, and HLA-DQB1 has the lowest numbers for both categories. Our raw data is available through version 3.0 of the CWD catalogs (8).
In any statistical analysis, detail and precision are competing properties, which is reflected in our study by the changing coverage of the gene pool with the significantly positive alleles and haplotypes. Although we are getting more than 99.5% for all individual loci and 97-99% for all two-locus combinations, this is gradually decreasing to 72% for 6-locus haplotypes. We compare that to the German population (12) in which a sample of about 30,000 individuals would cover a very similar part of the polymorphism now well-described based on a sample of several millions.
The different spread between the full match and the onemismatch curves in Figure 3 is primarily due to the strong linkage disequilibrium between HLA-B and -C as well as between HLA-DRB1 and -DQB1. As a consequence, single mismatches are much less likely than double mismatches. Basically, with a registry size of one million, 10/10 matches might be found for about half of the Arab patients while 60% of patients would find a 9/10 donor among only 100,000 donors. The main caveat in those theoretical calculations is that algorithms and formulae deriving haplotype frequencies from phenotypes and then applying those frequencies back to diploid individuals are all based on the concept of a Hardy Weinberg equilibrium, which our population does not fulfill due to regional subpopulations and non-random mating. On the other hand, it is probably the best extrapolation that can be made on the basis of today's data, and one of the major developments of our century in most regions of the globe will be the abrasion of deviations from HWE in all regions and at all scales.
In conclusion, the results of this study present information that can be used as a tool to identify a hematopoietic stem cell unrelated donor recruitment and selection strategy as well as a helpful tool for population genetic studies and HLA disease associations. Furthermore, knowledge of populationspecific allele and haplotype frequency provides hypothetical estimation of the chances of finding matched donors in the registry. There are limitations to our study as we could not stratify our subjects geographically as we have people moving routinely between regions. However, this will be looked at carefully in the future at the time of new donor registration. This may be achieved by asking the donors about the place of birth of both parents and grandparents.

DATA AVAILABILITY STATEMENT
The original contributions presented in the study are included in the article/Supplementary Material, further inquiries can be directed to the corresponding author/s.