Race/ethnicity-stratified fine-mapping of the MHC locus reveals genetic variants associated with late-onset asthma

Introduction: Asthma is a chronic disease of the airways that impairs normal breathing. The etiology of asthma is complex and involves multiple factors, including the environment and genetics, especially the distinct genetic architecture associated with ancestry. Compared to early-onset asthma, little is known about genetic predisposition to late-onset asthma. We investigated the race/ethnicity-specific relationship among genetic variants within the major histocompatibility complex (MHC) region and late-onset asthma in a North Carolina-based multiracial cohort of adults. Methods: We stratified all analyses by self-reported race (i.e., White and Black) and adjusted all regression models for age, sex, and ancestry. We conducted association tests within the MHC region and performed fine-mapping analyses conditioned on the race/ethnicity-specific lead variant using whole-genome sequencing (WGS) data. We applied computational methods to infer human leukocyte antigen (HLA) alleles and residues at amino acid positions. We replicated findings in the UK Biobank. Results: The lead signals, rs9265901 on the 5’ end of HLA-B, rs55888430 on HLA-DOB, and rs117953947 on HCG17, were significantly associated with late-onset asthma in all, White, and Black participants, respectively (OR = 1.73, 95%CI: 1.31 to 2.14, p = 3.62 × 10−5; OR = 3.05, 95%CI: 1.86 to 4.98, p = 8.85 × 10−6; OR = 19.5, 95%CI: 4.37 to 87.2, p = 9.97 × 10−5, respectively). For the HLA analysis, HLA-B*40:02 and HLA-DRB1*04:05, HLA-B*40:02, HLA-C*04:01, and HLA-DRB1*04:05, and HLA-DRB1*03:01 and HLA-DQB1 were significantly associated with late-onset asthma in all, White, and Black participants. Conclusion: Multiple genetic variants within the MHC region were significantly associated with late-onset asthma, and the associations were significantly different by race/ethnicity group.

We used SHAPEIT 4.2.1 to perform haplotype phasing across 29,062,806 variants for the subgroup of PEGS participants with whole-genome sequencing data (n = 4,737), considering only variants that were genotyped in both the reference panel and PEGS. We estimated genome-wide local ancestry on the autosomes and X chromosome using RFMix, a discriminative modeling approach (Maples et al., 2013). The reference population was 2,504 unrelated individuals from the 1000 Genomes Project, Phase 3, with sequencing to a targeted depth of 30X (Byrska-Bishop et al., 2022). Based on the 1000 Genomes data, we classified genomic positions by five continental super populations: AFR (African), AMR (Admixed American), EAS (East Asian), EUR (European), and SAS (South Asian). Granular population groupings can be found in the International Genome Sample Resource (IGSR) (Fairley et al., 2020). We classified individuals as one of the five continental ancestral groups if any genomewide local ancestral proportion was greater than 0.50.

MHC region-wide allelic association tests
After quality control of sequencing data for the major histocompatibility complex (MHC) region, 41,012, 39,867, and 47,433 SNPs remained for pooled, EUR, and AFR ancestry, respectively. To correct for multiple hypothesis testing, we adopted a similar approach as previous studies (Lee et al., 2020). We used the coda package in R to determine the effective number of tests by fitting an autoregression model to the summary statistics. The effective number of tests is often smaller than the empirical number of tests since it accounts for the local genetic correlation between SNPs (linkage disequilibHCG17 rium) used in association tests.

Fine-mapping: conditional analysis
Using WGS data for PEGS participants, we conducted fine-mapping in the MHC region at 6p22. 1-21.3 (chr6:28,510,120-33,480,577, assembly of GRCh38.p14) to identify candidate causal variants for late-onset asthma. From the results of the allelic association tests, we rank-ordered SNPs by their strength of association (p-values). We performed stepwise regression by adding one SNP at a time until the effect of the lead signal (i.e., the SNP with the smallest p-value from the allelic association tests) reached p >0.05. During regression modeling, we considered the local linkage disequilibrium structure by filtering out SNPs based on their pairwise correlation (r 2 >0.8). The final model included independent SNPs associated with late-onset asthma adjusted for other SNPs and covariates.

Functional annotation of genetic variants
For annotation, we used public data from ENCODE (ENCODE Project Consortium, 2012;Luo et al., 2020), HUGin (Martin et al., 2017), GTEx (Genotype-Tissue Expression (GTEx) Project, 2022), RegulomeDB (Boyle et al., 2012), Open Target Genetics (Ghoussaini et al., 2021;Mountjoy et al., 2021) to delineate the potential biological links between genetic variants and late-onset asthma in PEGS participants. We used quantitative measures of functional importance (CADD score and RegDB rank and score), DNase I hypersensitive site sequencing, histone modification ChIP-seq (H3K4me3, H3K27ac), transcription factor binding ChIP-seq (CTCF), long-range chromatin interactions (Hi-C), and expression quantitative trait loci (eQTL) data to investigate each SNP's potential genomic regulatory/functional role. Based on functional annotation, we prioritized SNPs with higher scores and supportive evidence as candidate causal variants contributing to late-onset asthma association signals.
Kourami is a recently developed enrichment-free computational method that uses WGS data to directly assemble full sequences of the peptide-binding domain (exons 2 and 3 for class I human leukocyte antigen (HLA) genes and exon 2 for class II HLA genes) (Lee and Kingsford, 2018). We first built a reference panel using multiple sequence alignments of known HLA alleles from the Immuno Polymorphism Database-International ImMunoGeneTics (IPD-IMGT)/HLA project database (Robinson et al., 2020). Based on the reference panel, we constructed an initial directed acyclic HLA graph, which we further modified by alignment projection (deletion or mismatch, insertion into a gap column, or insertion into a new column). Using this modified HLA graph, we identified the best paths with the maximum weights (number of reads) and most supportive phasing information to assemble pairs of candidate HLA alleles. We filtered the assembled HLA alleles to exclude ambiguous calls where two or more haplotypes were considered equally likely as well as calls supported by a "MaxFlow" parameter less than 10 and those with less than 95% identity with the called haplotype sequence.

Amino acid position inference: CookHLA and HATK
The CookHLA and HLA analysis toolkit (HATK) enrichment-free computational HLA imputation/inference methods are data-based matching techniques that align WGS reads to known alleles. We used 1000 Genomes Project reference panels (SNPs in the MHC region and HLA types). For pooled ancestry (i.e., all study participants) (n=2,504), we used the ALL reference panel. For European and African ancestries, we used the EUR and AFR reference panels, respectively. We first inferred four-digit classical HLA alleles (HLA-A, -B, -C, -DQB, and -DRB1) with CookHLA. CookHLA uses the latest hidden Markov model, which incorporates genetic distance as an input. CookHLA also captures local genetic information through repeated imputation for each exon in HLA genes by embedding a marker set in the middle position of an exon and adaptively learning the genetic map of the MHC region. In this way, CookHLA accounts for variability in highly polymorphic exons and data-or population-specific linkage disequilibrium (LD) structure in the MHC region. The bestmatched HLA alleles are generated as binary markers (presence/absence) based on consensus posterior probabilities from the repeated imputation.
We then applied HATK using the HLA allele information generated from CookHLA (the HLA PED equivalent to a plink PED file) to infer residues at each amino acid position. We built a dictionary for amino acid and DNA sequences for HLA genes based on the IMGT/HLA database and cleaned the HLA allele names according to the nomenclature defined by the World Health Organization Nomenclature Committee for Factors of the HLA System (http://hla.alleles.org/nomenclature/committee.html). We generated the best-matched residues at amino acid positions as binary markers (presence/absence) and then conducted logistic regression using this dosage data, adjusting for the same covariates as for pooled, EUR, and AFR ancestries.

Supplementary Figures
Supplementary Figure S1. Overview of the study analyses/methodologies. Hispanic, and (f) participants self-identifying as other races/ethnicities. Figure S3. LocusFocus plots of colocalization analysis results. The lead SNP for all participants (rs9265901) was not associated with the expression of HLA-B in lung tissue or whole blood. However, an SNP that was highly correlated with the lead SNP (rs9265892) was associated with the expression of HLA-B in whole blood. The lead SNP (rs55888430) for White participants was not associated with the expression of HLA-DOB. An LD SNP for the lead SNP (rs73410789) was associated with HLA-DOB expression in whole blood. The lead SNP for Black participants, rs117953947, was not associated with the expression of HCG17 in lung tissue. Figure S4. Effects of individual residues at amino acid positions in HLA genes for pooled, European, and African ancestries. The color scheme represents the levels of effect and significance of the association between each amino acid residue at a given position for an HLA allele and late-onset asthma, adjusted for covariates.
Supplementary Table S2. Results of fine-mapping of the MHC region harboring race/ethnicityspecific signals using sequenced data. We evaluated the joint effect of multiple variants on late-onset asthma using stepwise regression models. We conducted conditional analyses on race/ethnicityspecific lead SNPs to identify potential causal SNPs in the MHC region for all, White, and Black participants.  rs9265901, rs55888430, rs117953947) and late-onset asthma, adjusted for covariates and additional SNPs, in forward stepwise regression analyses. We adjusted all regression models for age, sex, and four ancestries. Table S3. Functional annotation of variants identified from fine-mapping analyses in the MHC region (chr6: 28,510,480,577). We gathered information on each variant's potential genomic regulatory role from publicly available data, including quantitative measures of functional importance (CADD score and RegDB rank and score), indication of functional genomic elements (ENCODE), and long-range chromatic interactions (Hi-C data from HUGIn) for all, White/Non-Hispanic, and Black/Non-Hispanic participants.