A Detailed View of KIR Haplotype Structures and Gene Families as Provided by a New Motif-Based Multiple Sequence Alignment

Human chromosome 19q13.4 contains genes encoding killer-cell immunoglobulin-like receptors (KIR). Reported haplotype lengths range from 67 to 269 kb and contain 4 to 18 genes. The region has certain properties such as single nucleotide variation, structural variation, homology, and repetitive elements that make it hard to align accurately beyond single gene alleles. To the best of our knowledge, a multiple sequence alignment of KIR haplotypes has never been published or presented. Such an alignment would be useful to precisely define KIR haplotypes and loci, provide context for assigning alleles (especially fusion alleles) to genes, infer evolutionary history, impute alleles, interpret and predict co-expression, and generate markers. In order to extend the framework of KIR haplotype sequences in the human genome reference, 27 new sequences were generated including 24 haplotypes from 12 individuals of African American ancestry that were selected for genotypic diversity and novelty to the reference, to bring the total to 68 full length genomic KIR haplotype sequences. We leveraged these data and tools from our long-read KIR haplotype assembly algorithm to define and align KIR haplotypes at <5 kb resolution on average. We then used a standard alignment algorithm to refine that alignment down to single base resolution. This processing demonstrated that the high-level alignment recapitulates human-curated annotation of the human haplotypes as well as a chimpanzee haplotype. Further, assignments and alignments of gene alleles were consistent with their human curation in haplotype and allele databases. These results define KIR haplotypes as 14 loci containing 9 genes. The multiple sequence alignments have been applied in two software packages as probes to capture and annotate KIR haplotypes and as markers to genotype KIR from WGS.


INTRODUCTION
The proteins of the killer-cell immunoglobulin-like receptor (KIR) family are important to human health. Whether or not these transmembrane receptors on natural killer (NK) cells bind with peptide-bound human leukocyte antigen (HLA) class I molecules determines how they help educate, activate, and inhibit NK cell functionality, including cytotoxicity and cytokine release. The accumulated evidence seems to indicate that KIR evolved with HLA to balance its pathogen defense effects with its effects on reproduction via the embryo-uterus interface (1)(2)(3). KIR effects are generally tissue-specific and tissue-variable. Within a given NK cell, KIR expression is stochastic, depending on the epigenetic profiles, inter-and intra-genic content of the haplotypes, binding alleles or lack thereof, methylation status, alternative splicing, and randomness (4)(5)(6). Besides investigating KIR in the context of viruses and pregnancy, medicine is also studying its effects in cancer, hematopoietic stem cells transplants, various autoimmune diseases, and immunotherapy (7)(8)(9).
KIR gene names reflect their protein structures (10). First the prefix "KIR," followed by the number of extracellular domains ("2D" or "3D"), followed by a short or long intracellular domain ("S," L"), followed by an index. KIR2DS1, for example, has two extracellular domains ("2D") and a short intracellular domain ("S"); it is the first gene named with that structure ("1"), and KIR2DS2 is the second. The Immuno Polymorphism Database for KIR (IPD-KIR) names KIR gene alleles, records their DNA-RNA-protein relationships, and annotates each gene's alleles in a multiple sequence alignment (11). It contains over 300 full-length DNA and almost 1,000 protein reference alleles in the latest release, 2.9.0.
There is no equivalent of IPD's allele annotation for haplotypes. KIR haplotypes have no official nomenclature, although the majority of contributions to the human genome reference use a convention set by Pyo et al. in 2010 (12) and 2013 (13). The haplotype names reflect the two-part structure of the region: a proximal centromeric ("c") region is paired with ("~") a distal telomeric ("t") region. Each region is a variant of a "A" haplotype or "B" haplotype family, followed by a two-digit index. The haplotype named "cB02~tA01," for example, is comprised of the second centromeric B region ("cB02") in cis with ("~") the first telomeric A region ("tA01").
Although there are many publications that analyze gene or intergene allele alignments, none report full haplotype multiple sequence alignments (MSAs), and therefore the haplotype nomenclature has not been formalized. Allele gene assignments are evaluated largely by amplification primers or sequence similarly to other alleles, as opposed to location in its haplotype. To help solve these issues, we present a simple bioinformatics approach to represent KIR haplotypes as a string of alignmentbased motifs. We applied it by creating a full-haplotype DNA MSA for all 68 human haplotype sequences in the human genome project and a chimp haplotype for an outlier.
In our recent preprint describing a KIR long-read assembler (14), we show that 18 120 bp sequences can be used to capture full haplotype KIR DNA from PacBio circular consensus sequencing (CCS) reads. In this manuscript, we show when those probe sequences are aligned to assembled haplotypes, the pattern of the probes ("motifs") provides a structural annotation. These alignment locations of the motifs allow the haplotype sequences to be represented accurately and efficiently at the structural level. We leveraged that annotation to align the 68 published human KIR haplotypes and one chimpanzee that haplotype to within an average of 2,398 bases, and then we micro aligned each locus for precise full haplotype multiple sequence alignment. The results are concordant with the annotation in the human genome reference and reveal 13 structural haplotypes for the 68 human haplotype sequences. The MSA also shows the haplotypes have 14 loci containing 9 genes. KIR2DS4 shares a locus with KIR2DS3 and KIR2DS5, as opposed to KIR2DS1 or by itself. Sharing a gene motif with locus KIR2DS3/ KIR2DS4/KIR2DS5 is the shared locus KIR2DL2/KIR2DL3. KIR2DL1, KIR2DS1, and KIR2DS2 share a gene motif at different loci, as do KIR2DL5A and KIR2DL5B.
The MSA includes this study's contribution of 27 new haplotypes to the human genome project, including 24 haplotypes from 12 individuals of African American ancestry.

Source Sequences and Annotation
The source sequences consisted of two newly assembled haplotypes from an Ashkenazim individual (from the assembler preprint), along with all 66 full-sequence alternative haplotypes in the human genome reference (12,13,15,16) and a chimpanzee haplotype (GenBank accession AC155174.2) (17). The two Ashkenazim haplotypes were assembled with PacBio reads obtained from the Genome In a Bottle (GIAB) consortium (18) and have not yet been submitted to genetic databases. All other haplotypes were generated by physically separating chromosomes via fosmid cloning and then sequencing and assembling fragments into full haplotype sequences. The 68 include 27 new sequences first described in this manuscript, including 24 haplotypes from 12 individuals of African American ancestry that were selected by genotypic diversity and/or lack of representation in the human genome reference. An additional three haplotypes (one Asian, two European) of opportunity from two individuals were contributed by Scisco Genetics, who sequenced all non-GIAB haplotypes following previously reported protocols (12,13). The GenBank entries also provided structural and allelic annotation of all the human haplotypes, except for the two Ashkenazim haplotypes. Broken down by population, the haplotype counts include 31 African or African American, 3 Asian, 2 Ashkenazim, 24 European, 2 Guarani South Amerindian, 2 Romani in Spain, and 4 unknowns. The structures are depicted schematically in Figure 1, excluding cA01~tB04, which contains a large insertion and was excluded for compact visualization.

Workflow Overview
The result of the workflow was a multiple sequence alignment of 68 human and 1 chimp haplotypes. As shown in Figure 2, the first step was to align the 18 capture probes to the haplotype sequences ( Figure 2A). Next, the sequence between each pair of probes (e.g., probe 4 to probe 3) was assigned a letter (e.g., C) ( Figure 2B). In this way, each haplotype can be represented by a string ("motifs," e.g., "…CIJKL…") whose length is 105 characters or less. These haplotype motifs were aligned to produce a high-level MSA ( Figure 2C). The motifs allowed loci to be located in the haplotype sequences. The sequences for all alleles of each locus was aligned separately, and then the MSAs were concatenated to create the full haplotype DNA MSA ( Figure 2D). In this two-alignment approach, the motif alignment provides structural annotation and the locus-specific DNA alignment provides base-level accuracy. The following paragraphs detail each step.

Probe Alignment and Inter-Probe Naming
The assembler preprint describes a method that uses 18 120 bp probes to capture KIR PacBio long-read sequences and assemble them into haplotypes with an average of 97% coverage and 99.7% concordance compared with reference sequences. When the probes are aligned to the haplotypes (Figure 2A), they align every 2,398 bases on average. Probe locations are discovered by alignment with bowtie2 with options "-a -end-to-end -rdg 3,3rfg 3,3." The alignment order of the 18 probes across a haplotype sequence allows that haplotype structure to be succinctly annotated as sequences of probe pairs ( Figure 2B). For example, assume the alignment of probe 4 followed by probe 3 is called "C," probe 3 followed by probe 12 is called "I," and probe 12 followed by probe 10 is called "J." Then the region "probe 4 to probe 3 to probe 12 to probe 10" can be called "CIJ." Probe 1 repeating once in the alignment is "Z" and twice is "ZZ," etc.
There are 42 such probe pairs in this collection of 69 haplotypes.
In this way, haplotypes can be briefly annotated as strings of a 42-character alphabet. See Figure 3 for an example of KP420440, whose full motif is MHCIJKLFGHCIJKLAIRLFZGHCIJ KLMHCIRLMHCIJKLSCTUVWXYKLSCNOJKLAIRLFZGHC IJKLMHCIJKLFPCNOJQ. The motif pairs are defined in Supplementary Table 1; the probe sequences are defined in the assembly manuscript and are also available via GitHub at https://github.com/droeatumn/kpi/tree/master/input.

Inter-Probe Name Alignment
A full-haplotype multiple sequence alignment of the probe motifs was generated for the 68 human haplotypes plus the chimpanzee as an outgroup ( Figure 2C). Alignment of the motifs was created with MAFFT (19), but mostly aligned manually with Aliview 1.2.6; manual alignment was required to resolve ambiguities, as the motif alphabet is not directly supported by DNA or protein aligners. Ambiguous alignments were resolved by following the human curation of the reference haplotypes.

DNA Alignment
MAFFT was used to merge or add these haplotypes in the order cB02~tA01, cA01~tA01, cA02~tA01, cB02~tB01, cA0X~tB0X, cB0X~tA01, and cB0X~tB0X, where "X" a general number not already used. Then, the full-length DNA sequences were separated into sets as defined by their motif structures ( Figure  2D). Each set was aligned separately with MAFFT. The alignment was then edited manually by adding or deleting gaps to conform to locations in the probe motif MSA. Using the loci defined in the motifs, and the motif locations defined in the DNA, the alignment was refined for all alleles in each locus with MUSCLE in Aliview. A high-level depiction of the alignment was created with Jalview 2.11.0's "Overview Window" functionality and NCBI's Multiple Sequence Alignment Viewer 1.13.1. The MSA was validated at the structural level by showing that it recapitulates the human curation of the reference haplotypes in GenBank and the human genome reference. The alignment was validated at the allele level by showing each allele is assigned and annotated as expected with respect to IPD-KIR classifications.
The ability of existing software to align the 69 haplotypes was also evaluated with MAFFT stand-alone (-thread 19 -threadtb

Haplotype References
The African American haplotype sequences MN167504-MN167530 were deposited in GenBank; Supplementary Table 2 contains details, including allele calls as IPD-KIR names for all 68 haplotypes in the human genome reference. Structural annotation for the new sequences was confirmed by the workflow by Pyo et al. Step (A) depicts the alignment of 18 120 base probes to each haplotype sequence.
Step (B) shows how each ordered pair of probes in the alignment is assigned a letter. For example, the letter "I" represents the sequence between the alignment of probe 3 and probe 12.
Step (C) shows how each haplotype is depicted by a string of letters, how a haplotype motif MSA is generated from them, and how loci are defined in the MSA.
Haplotype sizes by number of loci (not including the KIR3DP1-KIR2DL4 intergenic region) range from four loci (cA04~tA03) to 18 (cA01~tB04). Figure 6 displays a phylogenetic tree from the motifs of the 69 haplotypes, including cA01~tB04 and the chimp haplotype; the clusters recapitulate the their human-annotated names shown on the right part of the figure.
The DNA alignment of the 67 human haplotypes (minus cA01~tB04) has 263,556 positions; it is included in Supplementary Data Sheet 3. Figure 7B shows an overview of the DNA haplotype alignment as generated by Jalview's "Overview Window" function; the names of the haplotypes were added after exporting the overview image from Jalview. The pattern of white (gaps) and gray (aligned sequences) mirror the patterns in the schematic depictions of the human curated  haplotypes in Figure 1 (recapitulated in Figure 7A). This evidence shows the DNA alignment matches the human curation at a structural level. Figure 8 shows a view of the alignment of the 67 haplotypes using NCBI's MSA Viewer. Positions that are in an agreement with consensus are colored in gray, positions that are not in an agreement with consensus are colored in red. In this tool, consensus includes gaps (lack of an allele at that position). For KIR, this causes the regions specific to B haplotypes to show as pure red, and the other regions to show as mixture of red and gray.

Validation of the Multiple Sequence Alignment-Defined Allele Assignments by Immuno Polymorphism Database for KIR Annotation
The haplotypes are annotated by the names and orders of their alleles in Supplementary Table 2. Of the 647 alleles in the 68 haplotypes, 556 (86%) can be assigned names via IPD-KIR 2.9.0, at least at protein resolution. Unnamed alleles occur either because they have not yet been included in the IPD-KIR database, or they are partial alleles in the case of KIR3DL3 and KIR3DL2. 97% of alleles can be named excluding KIR3DL3, KIR2DP1, and KIR3DL2. 39 KIR2DP1 alleles are unnamed. To check the gene assignment for the 91 alleles that could not be named, the sequences were aligned to a set of 17 full gene alleles, each the first named allele for its assigned gene in IPD-KIR. Each allele being evaluated was considered to be correctly annotated if the allele sequence aligned closet to the IPD-KIR reference to which it was assigned by the motifs. Those results show that every unnamed allele aligned to the reference allele predicted by its motif assignment. The only exception was the GU182360 KIR3DL2, which aligns closest to KIR3DP1; however, this haplotype sequence is incomplete on the 3' end, and the KIR3DL2 allele only contains the 2,221 5'-end sequences. Alleles that are a fusion of KIR3DL1 and KIR3DL2 (e.g., KIR3DL1 alleles 059-061) are classified as KIR3DL2 alleles using motifs but are classified as KIR3DL1 alleles in IPD-KIR.

Comparison with Existing Multiple Sequence Alignment Software
Of the existing alignment software that was evaluated on the 67 human haplotypes, only stand-alone MAFFT was able to generate an alignment, using the FFT-NS-2 strategy. The input size was too large for Clustal Omega local server, M-Coffee web server, MUSCLE local server, and WebPrank web server, and the other algorithms ran out of memory. The MAFFT alignment, output, and overview are included in Supplementary Data Sheet 4 and Supplementary Image 2.
Except for large portions of KIR3DL3, distal KIR2DL3/KIR2DS4/ 2DS3S5, and KIR3DL2, the MAFFT alignment does not recapitulate human-curated KIR haplotypes or gene alleles. It is 75% larger than our motif-guided alignment (960,221 vs. 263,556 positions), and its overview shows the alignment columns are mostly gaps compared with the more expected block shaped column in Figure 7.

DISCUSSION
Although our evaluation of existing MSA software methods was not exhaustive, we believe it is unlikely that any current general-purpose alignment software can align all human KIR haplotype sequences consistent with the human curation. Alignment of the 68 sequences, each 67-269 kb, each homologous with itself and every other haplotype over lengths of 15 kb is very challenging without prior knowledge. Scoring matrices and algorithms like dynamic programming do not generally support sequences of this size under conditions where relatively young gene regions have duplicated both within and between haplotypes. To the best of our knowledge, a MSA of KIR haplotypes has never been published, despite the fact that dozens of haplotype sequences have been public for years. Figures 5 and 6 demonstrate that the capture probe alignment motifs alone can annotate KIR haplotype sequences in a way that recreates the human curated annotation in GenBank. Figures 7 and 8 similarly show the accuracy of the DNA alignment; they demonstrate that the DNA annotated by the motifs recreates the assignment of the alleles in IPD-KIR. Also, for two practical examples, the capture probes have demonstrated the ability to assemble haplotypes with an average of 97% coverage (preprint), and the DNA MSA was used to discover gene markers that can genotype whole genome  sequences (WGS) with almost perfect accuracy (11) (preprint). The combination of multi-resolution validation and software applications demonstrates that both the motif and DNA alignments are likely to be accurate. Figure 5 (alignment in Supplementary Data Sheet 1 suggest 14 human KIR loci and 1 intergenic locus. The genic loci are abbreviated and labeled as 3DL3, 2DS2, 2DL2L3, 2DL5B, 2DS3S5, 2DP1, 2DL1, 3DP1, 2DL4, 3DL1S1, 2DS3S4S5, 2DS1, 3DL2, and the intergenic locus is between 3DP1 and 2DL4 (3DP1-2DL4). There are nine motif-defined genes in the 14 genic loci ( Figure 5). Loci abbreviated 2DL1, 2DS1, and 2DS2 share motif MHCIJKL. Loci 2DL2L3, 2DS3S5, and 2DS3S4S5 share variations of the FZ*GHCIJKL motif. Loci 2DL5A and 2DL5B share variations of FIRL. The motif MSA suggests KIR2DS4 on the A haplotype shares a locus with KIR2DS3 and KIR2DS5 (not KIR2DS1) on the B haplotypes. The MSA similarly suggests KIR2DL2 and KIR2DL3 share a locus as do KIR3DL1 and KIR3DS1. The Z character is unique to centromeric 2DS3S5 and telomeric 2DS3S4S5 loci and marks the alleles to a certain extent. In both, the KIR2DS3 alleles have 1 Z in their motifs, KIR2DS5 has 2 or 3, and KIR2DS4 alleles are more variable.
Although the motifs correctly annotate the haplotype structures, they do not always do so unambiguously, nor do they always agree with the existing nomenclature as utilized in IPD-KIR. In the current gene nomenclature, there are 16 KIR genes. Some researchers have previously considered KIR2DL2 and KIR2DL3 to occupy the same locus as well as KIR2DS3 and KIR2DS5 (20,21). Conversely, KIR2DL5 was once considered one gene but was split into KIR2DL5A for alleles in the centromeric locus and KIR2DL5B for alleles in the telomeric locus, creating new "A" and "B" designations that are independent from the "A" and "B" haplotype classifications (22). The motif MSA suggests the KIR region has 14 loci in 9 gene motifs. Under the existing gene nomenclature, KIR3DL1 and KIR3DL2 differ only by an index, since they are both three domain ("3D") long tail ("L") genes; since the extracellular domains of KIR3DL1/KIR3DL2 fusion alleles are from KIR3DL1, those fusion alleles are labelled in IPD-KIR as KIR3DL1. They are considered KIR3DL2 alleles under the motif  A total of 27 new haplotypes were added to the human genome reference as part of this study (accessions MN167504-30), 24 of which are from African Americans. The haplotypes cluster by structure, not population. Haplotypes from Africans or African Americans now constitute 47% of the KIR alternate references in the human genome project, and the human genome project contains more than three times as many alternative references for KIR than any full chromosome. These new haplotypes contain the first deposits of unusual linkages such as KIR2DL5B*00804 in the KIR2DL5A locus (MN167506), KIR3DL1 or KIR3DL2 (fusion) without KIR2DS4 (tA02), KIR2DL2 without KIR2DS2 (cA03), and a KIR2DS2 (fusion) without KIR2DL2 (cB05). MN167526's KIR2DS2 allele has a one base deletion after 285th base in the CDS sequence, which leads to a premature stop code at position 435 in the CDS.
Scisco Genetics' contribution to the human genome project of the KIR haplotypes has-like the human genome project itselfbeen an important advancement and will support the downstream discovery of the human immune system. Building on that work, the motifs and alignments presented here provide a means to help unify interpretation of the entire KIR region. They can be used to precisely define KIR haplotypes and loci, provide context for assigning alleles (especially fusion alleles) to genes, improve evolutionary inferences, improve imputation, interpret co-expression, and generate markers. The motif probes have been applied to a workflow to capture, assemble, and annotate KIR haplotypes at https://github.com/ droeatumn/kass; it includes the ability to annotate KIR contigs/haplotypes as a separate workflow. The DNA alignments have been applied to discover markers used in a workflow to genotype KIR presence/absence from WGS at https://github.com/ droeatumn/kpi.

DATA AVAILABILITY STATEMENT
The datasets presented in this study can be found in online repositories. The names of the repository/repositories and accession number(s) can be found in the article/Supplementary Material.

ETHICS STATEMENT
The studies involving human participants were reviewed and approved by National Marrow Donor Program. The patients/ participants provided their written informed consent to