METHODS article

Front. Genet., 16 January 2024

Sec. Statistical Genetics and Methodology

Volume 14 - 2023 | https://doi.org/10.3389/fgene.2023.1295327

Sister haplotypes and recombination disequilibrium: a new approach to identify associations of haplotypes with complex diseases

  • 1. Institute of Gerontology, Center for Genetics, Sichuan Academy & Sichuan Provincial People Hospital, University of Electronic Science and Technology of China, Chendu, Sichuan, China

  • 2. Inflammatory Bowel and Immunobiology Research Institute, Cedars-Sinai Medical Center, Los Angeles, CA, United States

Article metrics

View details

1,9k

Views

802

Downloads

Abstract

Haplotype-based association analysis has several advantages over single-SNP association analysis. However, to date all haplotype-disease associations have not excluded recombination interference among multiple loci and hence some results might be confounded by recombination interference. Association of sister haplotypes with a complex disease, based on recombination disequilibrium (RD) was presented. Sister haplotypes can be determined by translating notation of DNA base haplotypes to notation of genetic genotypes. Sister haplotypes provide haplotype pairs available for haplotype-disease association analysis. After performing RD tests in control and case cohorts, a two-by-two contingency table can be constructed using sister haplotype pair and case-control pair. With this standard two-by-two table, one can perform classical Chi-square test to find statistical haplotype-disease association. Applying this method to a haplotype dataset of Alzheimer disease (AD), association of sister haplotypes containing ApoE3/4 with risk for AD was identified under no RD. Haplotypes within gene IL-13 were not associated with risk for breast cancer in the case of no RD and no association of haplotypes in gene IL-17A with risk for coronary artery disease were detected without RD. The previously reported associations of haplotypes within these genes with risk for these diseases might be due to strong RD and/or inappropriate haplotype pairs.

Introduction

High-throughput sequence technologies enable us to easily genotype dozens of single nucleotide polymorphisms (SNPs) within any interesting gene. Such genome-wide SNP data are rapidly growing in disease association studies (Neale and Sham, 2004; Cheng et al., 2005). The association analysis includes single-SNP(Cordell and Clayton, 2002) and haplotype-based disease associations (Zhao et al., 2003a; Zhao et al., 2003b; Clark, 2004; Niu, 2004). Haplotype-based association analysis has several advantages over single-SNP association analysis (Clark, 2004; Yang et al., 2008). The theoretical evidence is that haplotype-based tests would be more powerful because single-marker linkage-disequilibrium (LD)-based methods may not capture all of the available LD information, which is contained in multi-locus haplotypes (Akey et al., 2001; Schaid, 2006; Wen and Tsai, 2014). Therefore, there have been a lot of reporters of haplotype-disease association studies in recent years. However, to date all associations between haplotypes and complex diseases have not excluded recombination interference among multiple loci within haplotypes and hence some results might be confounded by recombination interference. In addition, although many methods (Akey et al., 2001; Sham et al., 2004; Allen and Satten, 2005, 2007, 2009; Fardo et al., 2011; Wen and Tsai, 2014) can be used to test for haplotype-based association, inappropriate haplotype pairs have broadly been used and might lead to finding spurious haplotype-disease associations. To exclude confounding of recombination interference in haplotype-disease association studies, we here introduce recombination disequilibrium (RD) (Tan, 2020). By following definitions of Hardy-Weinberg disequilibrium (HWD) at one locus and linkage disequilibrium (LD) between two loci (Robbins, 1918; Geiringer, 1944; Lewontin and Kojiana, 1960; Lewontin, 1964; Hill and Robertson, 1968), recombination disequilibrium (RD) is defined among three or more loci (Tan, 2020). Although LD has been widely used in haplotype-disease association studies, LD among multiple loci becomes very complicated and poorly understood due to recombination interference. Hastings (1984) indicated that commonly used measures of linkage disequilibrium are not appropriate for a multilocus system. Thomson and Baur (1984) also showed by an example that combinations of allele frequencies and pairwise linkage disequilibrium terms, which are permissible at two-locus level, may not be permissible at three-locus level. LD between two loci is not important for haplotype association, while recombination interference is a key factor in haplotype analysis because it determines frequencies of haplotypes (gametes) in populations. For example, double crossover types in positive interference status are less than those in independent status. The interference intensity is dependent of distance between two adjacent intervals. In classical genetics, coefficient of coincidence is used to measure crossover interference because of the fact that only positive interference has been discovered. With a great advance of technologies in molecular genetics, in particular, with a broad application of genotyping at molecular markers such as SNPs, negative interference has been observed in all species. Likewise, negative interference intensity becomes stronger as distance between adjacent intervals becomes shorter. Coefficient of coincidence is not available to describe negative interference because it is significantly asymmetric in positive and negative directions. This asymmetry leads to difficulty in testing for positive or negative interference in statistics. However, RD can easily measure positive and negative interferences and can easily be tested by Chi-square test (Tan, 2020). In single locus-disease association, Hardy-Weinberg equilibrium (HWE) test is required because frequencies of gene and genotypes follow HWE, then locus-disease associations found are true. In genome-wide study (GWS), linkage disequilibrium (LD) would result in false locus-disease associations due to the fact that linking of non-risk loci to disease gene alters genotype frequencies. Frequencies of haplotypes in recombination disequilibrium status contain linkage or recombination interference effect and hence would generate false haplotype-disease associations. Therefore, RD test is required in haplotype-based association of diseases. In addition, haplotype pairs are also a very important factor impacting association of haplotypes with diseases because correct factor pair is a necessary condition testing for association between two factors. In this paper, we offered a new approach to study haplotype-disease association. The new approach is based on RD and sister haplotypes. We used four public haplotype-based control-case data to show power and robustness of this method.

Materials and methods

Data collection

In our current study, we recruirated four public haplotype datasets: 1) SNP haplotype dataset of Alzheimer disease (AD) consists of 210 cases and 159 non-demented elderly controls downloaded from (Fallin et al., 2001). This haplotype data have 8 SNPs (C19M1∼C19M8) in a 205kbp region that contains ApoE gene on chromosome 19 and constructed two configures: M1M3M4*M6 constructs configure1 and M1M2M5M62 constructs configure 2 where M4* is C19M4 that is part of ApoE-ε4 that is a risk gene increasing risk for AD. 2) Breast cancer haplotype data derived from 560 cases and 354 controls (Faghih et al., 2009) are composed of 8 haplotypes containing three variants (−1512 A/C, −1055 C/T and 2044 G/A) in gene IL-13. 3) haplotypes in interleukin-17A gene with risk for premature coronary artery disease (CAD) composed of four SNPs (rs8193036, rs3819024, rs2275913 and rs8193037) were genotyped in 900 premature CAD patients and 935 health persons (Vargas-Alarcon et al., 2015). 4) COMT haplotype dataset published by Peterson et al. (2010). This dataset has 15 haplotypes consisting of 6 SNPs SNP1(rs1544325), SNP2(rs174674), SNP3(rs7290221) SNP4 (rs2239393), SNP5 (rs4680) in exon4 and SNP6 (rs46462316) in Catechol-O-methyl transferase (COMT) genes.

Haplotype data quality

Recently many large-scale GWAS analyses have been carried out in samples of several thousands of patients and normal individuals. Large SNP data make it possible to conduct large-scale haplotype association analysis of diseases. One can use the above haplotype estimation methods and software packages to create haplotype data from the SNP data. But before performance of our method for haplotype-disease association analysis, haplotype data are necessarily checked in following aspects: 1) since our method is based on biallelic haplotypes, SNPs with multiple alleles must be removed from haplotypes; 2) data with less than 7 types of haplotypes are not available for RD test; 3) haplotypes consisting of more than 3 SNPs should be dissected into three-SNP haplotypes.

Construction of sister haplotypes

An important step for finding association of haplotypes with a complex disease of study is to construct sister haplotypes. Since haplotypes consist of four base types in DNA sequence, unlike gametes in classical genetics, it is difficult to determine which haplotypes are paired to be sister haplotypes. To construct sister haplotypes, one is first required to translate notation of DNA base haplotypes into notation of classical genetic genotypes. For doing so, we set three pairs of capital and lower letters, for example, Aa at site1, Bb at site 2 and Cc at site 3. A capital letter is assigned to an allele at one site and a lower letter to another allele. For example, in Table 1, M1M4*M6 has sites1 and 2 with alleles C and T, and site 3 with alleles A and G. But for the convenience of understanding, the best assignment way is that the capital letters are assigned to alleles of parental haplotypes and lower letter is assigned to mutation alleles. The parental type has the largest frequencies. In our current example, the parental haplotype is TTA, so we set T = T and t = C at site 1, B = T and b = C at site 2 and A = A and a = G at site 3. Thus, we can translate 8 DNA haplotypes to 8 genotypes of dominant gametes and determine sister gametes.

TABLE 1

HapGameteFreqOverallCaseControlHapGameteFreqOverallCaseControl
M1M3M4*M1M4*M6
TCCaBcp4′0.0090.0130CCAabCp2′0.0440.0630
CCCABcP20.0150.0230CCGabcp1′0.0810.1010.042
TTCabcp1′0.0970.110.063CTAaBCp30.2020.1730.221
CTCAbcp30.110.1410.042CTGaBcp4′0.190.1290.24
TCTaBCp3′0.3620.2850.417TCAAbCp40.0160.0190.007
CCTABCp10.3780.2880.447TCGAbcp3′0.090.1050.056
TTTabCp2′0.0170.1240.017TTAABCp10.2240.1750.258
CTTAbCp40.0140.0140.014TTGABcp20.1550.2350.176
M3M4*M6M1M3M6
CCAAbCp40.020.030CCAaBCp30.2110.2110.219
CCGAbcp3′0.0040.0060CCGaBcp4′0.1820.1390.228
CTAABCP10.4220.3430.477CTAabCp2′0.0350.0550.002
CTGABcp20.3180.230.387CTGabcp1′0.0890.120.054
TCAabCp2′0.040.0520.007TCAABCp10.2310.2090.258
TCGabcP1′0.1670.20.098TCGABcp20.140.1270.159
TTAaBCp30.0040.0050.002TTAAbCp40.0090.0090.007
TTGaBcp4′0.0270.1330.029TTGAbcp3′0.1050.2550.073

Data of haplotypes consisting of three SNPs derived from four-SNP haplotypes in configure 1 (M1M3M4*M6) where M4* is C19M4 that is part of ApoE-ε4.

Construction of two-by-two contingency tables

After sister haplotypes are constructed by using the above method, two-by-two tables are required to be constructed. As an example, two-by-two contingency tables (Table 2) with sister-haplotypes in rows and case-control of AD in columns were made by using data in Table 1.

TABLE 2

ControlCaseControlCaseControlCaseControlCase
TBA364tbA250tBA323TbA347
tba1351TBa747Tab1451tBa307

Four two-by-two tables made by using sister haplotypes and case-control.

a Haplotype data from Fallin et al., (2009).

Chi-square test for association between haplotypes and diseases

A pair of sister haplotypes is similar to a pair of alleles at a locus, therefore, a two-by-two contingency table constructed with sister-haplotypes and case-control of a disease satisfies Chi-square test for independence between two variables. Using contingency tables, a null hypothesis that a pair of sister haplotypes is not associated with a disease of study can be tested by using Chi-square with degree of freedom = 1. For haplotypes constructed with three SNPs, we have four pairs of sister haplotypes and hence four null hypotheses that are tested by using Chi-squares. To exclude false associations due to recombination interference, testing for RD in haplotypes in control and case cohorts (Tan, 2020) are required. The method for testing for RD can be found in (Tan, 2020). RD is recombination disequilibrium among multiple loci. Similarly to linkage disequilibrium (LD), strong RD also results in spurious findings in haplotype-disease associations because strong RD would significantly change frequencies of haplotypes: where , , , and . The is frequency of parental types. The is frequency of double crossover, and the and are frequencies of two single-crossovers. reflects difference between frequencies of double-crossover and single-crossovers. The frequency of double-crossover measures linkage intensity of three loci on a chromosome. Strong positive or negative interference would significantly change frequency of double crossover. From , we can infer if these loci in haplotypes are strongly linked. Therefore, test for can exclude spurious association between haplotypes and disease due to linkage. A diagram for construction of sister haplotype pairs, converting haplotypes to genotype of three loci, RD test, and Chi-square test for association between sister haplotype pairs and a disease of study including a practical example is given in Supplementary Material.

R package SHAD

R package SHAD (sister haplotype-based association of disease) was designed to implement RD tests and association analysis of haplotype with disease in case and control populations. SHAD package works in R environment and has two functions for haplotype association analysis: One is applied to three-SNP haplotypes and another is applied to m-SNP haplotypes where m>3. Function hapAnalysis is used to analyze three-haplotype association with disease. Three-SNP haplotypes have four pairs of sister haplotypes. It outputs RD, Chi-square results and p-value for RD and OR, Chi-square test, and p-values for OR in case-control. Function hapADA is used to dissect m-SNP haplotypes into n combinations of three-SNP haplotypes and perform association analysis of sister haplotype pairs with disease in all combinations. SHAD package is available for request.

Results

In nature populations, sister-gametes may have different frequencies due to mutation, deletion, gene conversion and selection. But the disequilibrium between sister-gametes interestingly allows us to develop a statistical approach to test for association of sister-gametes with a complex disease of study. Under the null RD, if difference between sister-gametes in a patient (case) population is significantly different from the health (control) population, then the sister-gamete disequilibrium would be associated with the disease. Current SNP data provide us with a broad way to study haplotype-disease association. Fallin et al. (2001) reported a SNP haplotype dataset of 210 Alzheimer disease (AD) cases and 159 non-demented elderly controls. They used an EM algorithm to estimate frequencies of haplotype consisting of 8 SNPs (C19M1∼C19M8) in a 205kbp region that contains ApoE gene in chromosome 19. Since they just reported haplotype data of configures 1 and 2 (configure1: M1M3M4*M6 and configure 2: M1M2M5M6) where M4* is C19M4 that is part of ApoE-ε4 that has been found to be a risk gene increasing risk for AD (Corder et al., 1993; Saunders et al., 1993; Strittmatter et al., 1993; Farrer et al., 1997), we here did not consider the other configures. We used the haplotype data of these two configures to test for RD among SNPs and associations between haplotypes and risk for AD. We constructed four combinations of three-locus haplotypes from configure 1 by collapsing the same haplotypes and generated three-locus haplotype data (Table 1). According to Fallin et al. (2001), SNPs C19M1,C19M2, C19M5, and C19M6 followed HWE. No LD occurred between C19M1 and C19M4, between C19M1 and C19M5, between C19M1 and C19M6, between C19M2 and C19M3, between C19M2 and C19M5, and between C19M2 and C19M6, but LD existed between C19M4 and C19M6, between C19M3 and C19M4, between C19M3 and C19M5 and between C19M3 and C19M6. The loci C19M1 and C19M8 flank physical interval of 205 kbp on chromosome 19. Our RD analysis shows that there is no RD among loci C19M1, C19M4, and C19M6, among loci C19M1, C19M3, and C19M6 in the case, control, and overall populations, while loci C19M3, C19M4, and C19M6 had very significant RD in all these three populations (p = 0.0014 in overall, p = 8.8E-06 in the case population and p = 0.044 in the control population, Table 3), which is very consistent with significant LDs between them given by Fallin et al. (2001). In haplotype M1M3M4* combination, we detected RD only in the case population (p = 0.0076, Table 3). This may be attributed to strong linkage between C19M3 and C19M4. From two-by-two data, we calculated odds ratios and their Chi-square statistics (Table 4). In haplotype combination of three-SNP M1M3M4*, sister haplotypes CCT and TTC (ABC and abc) and sisterhaplotypes CTC and TCT (Abc and aBC) were associated with risk for AD (p <0.05). In haplotype combination of three-SNP M1M4*M6, sister-haplotypes TTA and CCG (ABC and abc) and sister haplotypes TTG and CCA (ABc and abC) were detected to be associated with risk for AD (p <0.05). These two three-SNP combinations all contain AD risk factor ApoE-ε4 and had no recombination interference among the three loci. But three-SNP M1M3M6 haplotype combination does not contain AD risk factor ApoE-ε4 (M4), its sister haplotypes TCA and CTG (ABC and abc) were also associated with risk for AD (p < 0.05) without RD confounding. Sister haplotypes TCG and CTA (ABc and abC) and sister haplotypes CCA and TTG (aBC and Abc) were very significantly associated with risk for AD (p <0.01). This result demonstrates that M3 is also a risk factor of AD (called ApoE-ε3) because in configure 2 (M1M2M5M6) without M3 and M4, none of sister haplotype pairs was found to be significantly associated with risk for AD and no RD among triplet SNPs in all four haplotype combinations (Supplementary Tables S1–S3). As M3, M4 and M6 are tightly linked, associations of the sister haplotypes CTA and TCG (ABC and abc) and sister haplotypes CTG and TCA (ABc and abC) with risk for AD in three-SNP M3M4*M6 haplotype combination (p < 0.01) were confounded by RD.

TABLE 3

OverallCaseControlOverallCaseControl
M1M3M4*M1M4*M6
P10.4750.3980.510.3050.2760.3
P20.0320.1480.0170.1990.2970.176
P30.4720.4270.4590.2920.2780.277
P40.0230.0280.0140.2060.1460.247
RD−0.0042−0.052−0.00070.0047−0.0420.0253
X20.2377.1350.00480.0411.9590.461
p-value0.6260.00760.9440.8390.1620.497
M3M4*M6M1M3M6
P10.5890.54310.5750.320.3290.312
P20.3580.28180.3940.1750.1820.161
P30.0080.01160.0020.3160.4660.292
P40.0470.16360.0290.1910.1480.235
RD0.02480.0860.01590.0058-0.0360.0263
X210.24719.724.0420.0671.4880.527
p-value0.00148.8E-060.0440.7950.2220.467

RD and chi-square testing RD among three SNPs in four haplotypes (M1M3M4*M6) where M4* is C19M4 that is part of ApoE-ε4.

TABLE 4

Sister gametesORZ-valuep-valueX2p-valueORZ-valuep-valueX2p-value
M1M3M4*M1M4*M6
ABC/abc0.36742.3990.01655.10340.02390.30082.4420.01465.25450.0218
ABc/abCNANANA010NANA5.27040.0216
Abc/aBC4.71433.40.000711.6330.00060.42081.8760.06072.83330.0923
AbC/aBcNANANA0.17780.67335.6291.5080.13161.4430.2297
M3M4*M6M1M3M6
ABC/abc0.36093.0270.00258.58510.00340.38632.1360.03273.87090.0491
ABc/abC0.07042.4990.01258.1640.00430NANA7.55550.0059
Abc/aBCNANANANANA0.27943.2590.001110.040.0015
AbC/aBcNANANA0.12770.72082.48280.7280.46690.02460.8753

Chi-square test of associations between sister haplotypes and Alzheimer disease.

M4* is C19M4 that is part of ApoE-ε4.

Another haplotype data published by Faghih et al. (2009) provide an opposite example. By using differential analysis method (Faghih et al., 2009), found that two haplotypes (ACA and CCA) of three variants in gene IL-13 were significantly associated with risk for breast cancer. By using our method, we got four pairs of sister haplotypes and their frequencies in the case and control populations (Table 5). But as we predicted, our RD analysis showed that RD>0.02 was extremely significant (p = 2.81e-06, 5.12e-05, and 1.53e-07 in overall, control, and case populations, respectively, Table 6). Obviously these three variants are in a very short interval of 3.5kbp (457bp + 3099bp) such that extremely strong negative recombination interference occurred. But interestingly none of sister-haplotype pairs was found to be associated with risk for breast cancer (Table 7). The significant differences in frequencies of haplotypes ACA and CCA between the case and control groups in Faghih et al. (2009) just were due to RD and/or inappropriate haplotype pairs used. We did not find any other reports that variants in gene IL-13 are associated with risk for breast cancer. Another similar example can be found in Vargas-Alarcon et al.’s report of association of haplotypes in interleukin-17A gene with risk for premature coronary artery disease (CAD). Four SNPs (rs8193036, rs3819024, rs2275913 and rs8193037) in gene IL-17A were genotyped in 900 premature CAD patients and 935 health persons (Vargas-Alarcon et al., 2015) performed haplotype-based association analysis of premature CAD using individual and common haplotype pairs (called individual-common haplotype pairs). The common haplotype is TAGG. They found that TAGA was associated with risk for CAD at significance level of p <0.05. But TAGA has different alleles at only one locus from the common haplotype TAGG. This association, which is equivalent to SNP-disease association, conflicts with the fact that none of SNPs within gene IL-17A was associated with CAD. Our haplotype analysis indicates that these four SNPs should construct 16 haplotypes, of which only 10 haplotypes were observed with hapview, hence only rs8193036, rs3819024, and rs2275913 are valid to construct 8 haplotypes (see Supplementary Material). The RD test shows that in the premature CAD and control populations a very strong negative recombination interference occurred among these three SNP loci within gene IL17A (Supplementary Material). The RD results ( 0.0199 in the case population with p = 7.90E-06 and 0.0274 in the control population with p = 1.59E-09) are very agreeable with the fact that these SNPs are in a very short region within gene IL-17A indicated by high LD value (r2 >0.9 and D’>0.8). As seen in gene IL-13, none of sister haplotype pairs was found to be associated with risk for CAD (Supplementary Material). This result is well consistent with the result that none of SNPs was found to be associated with risk for CAD (Vargas-Alarcon et al., 2015).

TABLE 5

HaplotypesPatient n = 560Normal n = 354Sister gametesbFrequency
ACA78 (14%)69 (20%)ABgp2
ATA15 (3%)4 (1%)Abgp3'
ACG302 (54%)182 (50%)ABGp1
ATG29 (5%)15 (4%)AbGp4
CCA7 (1%)0 (0%)aBgp4'
CTA72 (13%)50 (15%)abgp1'
CCG15 (3%)7 (2%)aBGp3
CTG42 (7%)27 (8%)abGp2'

Eight kinds of haplotypes consisting of 3SNP in IL-13 and their distribution in patient and normal populationsa.

a

Haplotype data from Faghih et al., 2019.

b

site1(locus: -1512 A/C): A=A, C = a; site2(locus: -1055 C/T): C = B, T = b; site 3 (locus: -2044 G/A): A = g and G = G.

TABLE 6

OverallControlCase
P1=p1+p1′0.670160.6724640.667857
p2=p2+p2′0.2462730.2782610.214286
P3=p3+p3′0.0427280.0318840.053571
P4=p4+p4′0.0538820.0434780.064286
RD0.025910.0203650.031454
x221.9354616.4014727.54278
P-value2.81e-065.12e-051.53e-07

RD test for recombination interference among the three loci in gene IL-13.

TABLE 7

Sister gametesOdds ratiox2p-value
ABG/ abg1.152322.08960.1483
ABg/ abG0.7267080.86490.3524
aBG/ Abg0.5714280.54140.4618
AbG/ aBg-1.93800.1639

Results for association between sister gametes and risk for breast cancer.

To furthermore demonstrate that our method is broadly useful, we constructed an R package SHAD (Supplementary Package and Material) and applied it to a COMT haplotype dataset published by Peterson et al. (2010). This dataset has 15 haplotypes consisting of 6 SNPs in Catechol-O-methyl transferase (COMT) genes. Gene COMT has 6 exons and 5 introns (McGregor, 2014). SNP1(rs1544325), SNP2(rs174674) and SNP3(rs7290221) are located in intron 1 and the intervals between SNPs 1 and 2 and between SNPs 2 and 3 are 2357 bp and 12447bp, respectively. SNP4 (rs2239393) is located in intron 3, SNP5 (rs4680) in exon4 and SNP6 (rs46462316) in intron5. Intervals between SNP2 and SNP4, between SNP4 and SNP5, and between SNP5 and SNP6 are separately 16414bp, 833bp, and 861bp. Since Peterson et al. (2010) did not recognize how to construct sister haplotypes, they used individual-common haplotype pairs in the case and control groups to calculate OR and found that haplotypes GAGAGC and AGCGAC were significantly associated with risk for breast cancer. Our sister haplotype analysis was still based on three-SNP system. Haplotypes consisting of 6 SNPs should have 20 three-SNP haplotype combinations, which are more than 15 haplotypes observed, so many haplotypes were missed. In theory, each three-SNP combination should have 8 haplotypes. In haplotype combination list (Supplementary Table S4), 11 combinations had 6 haplotypes and 8 combinations had 7 haplotypes and only one had 8 haplotypes. Since 6 haplotypes cannot construct valid sister gamete pairs, we removed them from our analysis. For combinations with 7 haplotypes, we assigned frequencies of rare haplotypes in the case and control groups to the missing haplotype in each combination. Thus these 8 combinations each had 8 haplotypes. Using our R package SHAD (Sister-haplotype Association of Disease), we obtained the results of RD and disease association tests. The results summarized in Supplementary Table S5 show that except that combination 19 had no significant RD, the other 7 combinations had very significant RD. Combination 6 (SNP1, SNP3 and SNP5), combination13 (SNP2, SNP3 and SNP6), and combination16 (SNP2, SNP5 and SNP6) had very strong negative recombination interference but in combination 9 (SNP1, SNP4 and SNP6), combination 10 (SNP1, SNP5 and SNP6), combination11(SNP2, SNP 3 and SNP4), and combination12 (SNP2, SNP3 and SNP5) there was very strong positive recombination interference among three SNPs. Unsurprisingly, in all combinations none of sister-haplotype pairs was found to be associated with risk for breast cancer (Supplementary Table S5). These results are completely predicted by recombination interference occurring in so short intervals within the gene and within introns. To our knowledge, COMT is chiefly produced by nerve cells in the brain and its variants were found to be associated with risk for mental illness and schizophrenia, other disorders that affect thought (cognition), emotion, bipolar disorder, panic disorder, anxiety, obsessive-compulsive disorder (OCD), eating disorders, and attention deficit hyperactivity disorder (ADHD) (disease http://ghr.nlm.nih.gov/gene/COMT). So far we have not yet found any other evidence for that variants of COMT are associated with risk for breast cancer.

Discussion

Theoretically, RD reveals recombination interference among multiple loci in an ideal population because in such a population RD is completely derived from recombination interference. In a natural population, however, in addition to recombination interference, RD may also be derived from selection, mutation, gene conversion, migration and/or genetic drift in a small population because these factors can also alter frequencies of gametes or haplotypes (Tan, 2020). In human local populations, these factors may also result in haplotype-based association of complex diseases. Therefore, RD test is required in haplotype-based association of disease.

Frequencies of haplotypes in natural or human populations can be estimated by using the existing methods such as PHASE (Stephens et al., 2001), fastPHASE (Scheet and Stephens, 2006), BEAGLE (Browning and Browning, 2007), IMPUTE2 (Howie et al., 2009), RCEH (Gao et al., 2009) and MaCH (Li et al., 2010). However, current statistical methods for haplotype-disease association analysis, as seen in the above examples, do not consider recombination interference though LD has been excluded in haplotype-based association analysis of diseases. LD can easily be tested between two loci (Robbins, 1918; Geiringer, 1944; Lewontin and Kojiana, 1960; Lewontin, 1964; Hill and Robertson, 1968) but get very complicated among multiple loci because LD cannot measure recombination interference. Recombination interference becomes strong in a short interval. Recombination interference results in change of frequencies of haplotypes which would lead to spurious association between haplotypes and a complex disease. An example is that association of haplotype in gene IL-17A with CAD reported by Vargas-Alarcon et al. (2015) was due to recombination interference within gene IL-17A. In addition, small populations also result in change of haplotype frequencies because of genetic drift, which leads to false association of haplotypes with the disease. Therefore, in a small population, testing for RD in haplotypes can exclude false hapoltype-disease associations. If no RD in haplotypes is found in control and case populations, identified association of sister haplotypes with a disease of study is acceptable in statistics. For example, M1M3M4* haplotype containing risk factor apoE-ε4 and M1M3M6 haplotype containing risk factor apoE-ε3 were found to be associated with risk for AD in small human population (210 AD cases and 159 non-demented elderly controls) using our sister haplotypes and RD test. ApoE-ε3 (Huang et al., 1995; DeMattos et al., 2001; Hopkins et al., 2002; Sen et al., 2012; Pedachenko et al., 2015; Mahan et al., 2022; Sepulveda-Falla et al., 2022; Mulgrave et al., 2023) and apoE-ε4 (Ayyubova, 2023; Chen et al., 2023; Hamza et al., 2023; Koutsodendris et al., 2023; Pires and Rego, 2023; Sun and Xie, 2023; Zhou et al., 2023) have been verified to be risk factors for AD. Fallin et al. (2001) however found that 3 haplotypes in configure 2 flanking M3 and M4 were significantly associated with risk for AD by using individual-others pairs. However, haplotypes in configure 2 (M1M2M5M6) should not be associated with risk for AD because haplotypes in configure 2 do not contain M3 and M4. For example, three SNPs can construct 8 genotypes ABC, abc, ABc, abC, aBC, Abc, AbC and aBc, if we just consider SNP1 and SNP3 and ignore SNP2, we then have four two-SNP genotypes: AC (ABC and AbC), ac (aBc and abc), Ac (ABc and Abc), aC (aBC and abC) each containing B and b alleles at SNP2 locus. If SNP2 is assumed to be a risk factor, then there should not be associations between SNP1-SNP3 haplotypes and risk for the disease. So (Fallin et al., 2001) findings of haplotypes associated with AD in configure 2 are incorrect.

A null hypothesis for haplotype-disease association is that under recombination equilibrium, if disequilibrium between two sister haplotypes does not result in disease, then difference in frequency between sister haplotypes in the case population should be independent of that in the control population. Since two sister haplotypes, like a pair of alleles at a locus, are respectively derived from father and mother and hence are genetically a pair of sister gametes. It is reasonable to construct two-by-two contingency tables with sister haplotypes and case-control for association test. Therefore, inappropriate haplotype pairs would result in false findings of haplotype-disease associations. For example, in individual-common haplotype pairs (Gaudet et al., 2006; Peterson et al., 2010), only one haplotype (e.g., CTA in Table 5) has different alleles at all three loci from the common haplotype (e.g., ACG in Table 5), while the others have the same alleles at two or one locus with the common haplotype. This means that only one haplotype can be paired with the common haplotype in biology. Individual-others pairs (Fallin et al., 2001), as seen in configure 2, would create an incorrect association between haplotypes and risk for the disease because most of the other haplotypes are irrelevant to this haplotype and cannot be paired with it in biology. In order to validate this conclusion, we applied individual-common haplotype pair and individual-others pair methods to the haplotype data (Table 5) of Faghih et al. (2009) and to a new haplotype dataset (Supplementary Table S6) created by assigning 500 patients to the 8 three-SNP haplotypes using their frequencies in the case population and 400 health individuals to the same 8 haplotypes using their frequencies in the normal population. In the original haplotype data (Table5 or Supplementary Table S6), the individual-common pair and sister haplotype pair methods did not find any association between haplotypes and risk for breast cancer but the individual-other pair method identified that ACA was associated with risk for breast cancer (p = 0.03254) (Supplementary Table S7). In the new haplotype data (Supplementary Table S6), both individual-common pair and individual-other pair methods found that haplotypes ACA and ATA were very significantly associated with risk for breast cancer (p ≤ 0.005191). The inconsistent results between two datasets with the same haplotype frequencies in the case and control populations indicate that both individual-common pairs and individual-other pairs are incorrect haplotype pairs in association analysis. However, we did not find that four pairs of sister haplotypes were associated with risk for breast cancer (Supplementary Table S7) in the original and new haplotype data, suggesting that sister haplotype pairs are correct pairs for testing for association between haplotypes and risk for disease. These four examples above show that our sister haplotype method based on RD has high-sensitivity and lower specificity. Theoretical analysis show that our method satisfies conditions of independence of two random variables, that is, two sister haplotypes are paired and case and control of disease are also paired. We will use simulation data to show that our method would have higher power, higher ROC courve, and lower FDR in multiple haplotype-disease tests than the other haplortype-based methods in future study.

Statements

Data availability statement

The original contributions presented in the study are included in the article/Supplementary Material, further inquiries can be directed to the corresponding author.

Author contributions

S-YL: Conceptualization, Data curation, Formal Analysis, Funding acquisition, Investigation, Resources, Writing–original draft, Validation. Y-DT: Conceptualization, Data curation, Formal Analysis, Funding acquisition, Investigation, Resources, Writing–original draft, Methodology, Software, Supervision, Writing–review and editing.

Funding

The author(s) declare financial support was received for the research, authorship, and/or publication of this article. This study was supported by the Sichuan Science and Technology Program (2022NSFSC0679).

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

Supplementary material

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene.2023.1295327/full#supplementary-material

References

  • 1

    AkeyJ.JinL.XiongM. (2001). Haplotypes vs single marker linkage disequilibrium tests: what do we gain?Eur. J. Hum. Genet.9 (4), 291–300. 10.1038/sj.ejhg.5200619

  • 2

    AllenA. S.SattenG. A. (2005). Robust testing of haplotype/disease association. BMC Genet.6 (Suppl. 1), S69. 10.1186/1471-2156-6-S1-S69

  • 3

    AllenA. S.SattenG. A. (2007). Inference on haplotype/disease association using parent-affected-child data: the projection conditional on parental haplotypes method. Genet. Epidemiol.31 (3), 211–223. 10.1002/gepi.20203

  • 4

    AllenA. S.SattenG. A. (2009). A novel haplotype-sharing approach for genome-wide case-control association studies implicates the calpastatin gene in Parkinson's disease. Genet. Epidemiol.33 (8), 657–667. 10.1002/gepi.20417

  • 5

    AyyubovaG. (2023). Apoe4 is A risk factor and potential therapeutic target for alzheimer's disease. CNS Neurol. Disord. Drug Targets23, 342–352. 10.2174/1871527322666230303114425

  • 6

    BrowningS. R.BrowningB. L. (2007). Rapid and accurate haplotype phasing and missing-data inference for whole-genome association studies by use of localized haplotype clustering. Am. J. Hum. Genet.81 (5), 1084–1097. 10.1086/521987

  • 7

    ChenF.ChenY.KeQ.WangY.GongZ.ChenX.et al (2023). ApoE4 associated with severe COVID-19 outcomes via downregulation of ACE2 and imbalanced RAS pathway. J. Transl. Med.21 (1), 103. 10.1186/s12967-023-03945-7

  • 8

    ChengR.MaJ. Z.ElstonR. C.LiM. D. (2005). Fine mapping functional sites or regions from case-control data using haplotypes of multiple linked SNPs. Ann. Hum. Genet.69 (Pt 1), 102–112. 10.1046/j.1529-8817.2004.00140.x

  • 9

    ClarkA. G. (2004). The role of haplotypes in candidate gene studies. Genet. Epidemiol.27 (4), 321–333. 10.1002/gepi.20025

  • 10

    CordellH. J.ClaytonD. G. (2002). A unified stepwise regression procedure for evaluating the relative effects of polymorphisms within a gene using case/control or family data: application to HLA in type 1 diabetes. Am. J. Hum. Genet.70 (1), 124–141. 10.1086/338007

  • 11

    CorderE. H.SaundersA. M.StrittmatterW. J.SchmechelD. E.GaskellP. C.SmallG. W.et al (1993). Gene dose of apolipoprotein E type 4 allele and the risk of Alzheimer's disease in late onset families. Science261 (5123), 921–923. 10.1126/science.8346443

  • 12

    DeMattosR. B.RudelL. L.WilliamsD. L. (2001). Biochemical analysis of cell-derived apoE3 particles active in stimulating neurite outgrowth. J. Lipid Res.42 (6), 976–987. 10.1016/s0022-2275(20)31622-9

  • 13

    FaghihZ.ErfaniN.RazmkhahM.SameniS.TaleiA.GhaderiA. (2009). Interleukin13 haplotypes and susceptibility of Iranian women to breast cancer. Mol. Biol. Rep.36 (7), 1923–1928. 10.1007/s11033-008-9400-7

  • 14

    FallinD.CohenA.EssiouxL.ChumakovI.BlumenfeldM.CohenD.et al (2001). Genetic analysis of case/control data using estimated haplotype frequencies: application to APOE locus variation and Alzheimer's disease. Genome Res.11 (1), 143–151. 10.1101/gr.148401

  • 15

    FardoD. W.DruenA. R.LiuJ.MireaL.Infante-RivardC.BrehenyP. (2011). Exploration and comparison of methods for combining population- and family-based genetic association using the Genetic Analysis Workshop 17 mini-exome. BMC Proc.5 (Suppl. 9), S28. 10.1186/1753-6561-5-S9-S28

  • 16

    FarrerL. A.CupplesL. A.HainesJ. L.HymanB.KukullW. A.MayeuxR.et al (1997). Effects of age, sex, and ethnicity on the association between apolipoprotein E genotype and Alzheimer disease. A meta-analysis. APOE and Alzheimer Disease Meta Analysis Consortium. JAMA278 (16), 1349–1356. 10.1001/jama.278.16.1349

  • 17

    GaoG.AllisonD. B.HoescheleI. (2009). Haplotyping methods for pedigrees. Hum. Hered.67 (4), 248–266. 10.1159/000194978

  • 18

    GaudetM. M.ChanockS.LissowskaJ.BerndtS. I.PeplonskaB.BrintonL. A.et al (2006). Comprehensive assessment of genetic variation of catechol-O-methyltransferase and breast cancer risk. Cancer Res.66 (19), 9781–9785. 10.1158/0008-5472.CAN-06-1294

  • 19

    GeiringerH. (1944). On the probability theory of linkage in Mendelian heredity. Ann. Math. Stat15, 25–57. 10.1214/aoms/1177731313

  • 20

    HamzaE. A.MoustafaA. A.TindleR.KarkiR.NallaS.HamidM. S.et al (2023). Effect of APOE4 allele and gender on the rate of atrophy in the Hippocampus, entorhinal cortex, and fusiform gyrus in alzheimer's disease. Curr. Alzheimer Res.19, 943–953. 10.2174/1567205020666230309113749

  • 21

    HastingsA. (1984). Linkage disequilibrium, selection and recombination at three Loci. Genetics106 (1), 153–164. 10.1093/genetics/106.1.153

  • 22

    HillW. G.RobertsonA. (1968). The effects of inbreeding at loci with heterozygote advantage. Genetics60 (3), 615–628. 10.1093/genetics/60.3.615

  • 23

    HopkinsP. C.HuangY.McGuireJ. G.PitasR. E. (2002). Evidence for differential effects of apoE3 and apoE4 on HDL metabolism. J. Lipid Res.43 (11), 1881–1889. 10.1194/jlr.m200172-jlr200

  • 24

    HowieB. N.DonnellyP.MarchiniJ. (2009). A flexible and accurate genotype imputation method for the next generation of genome-wide association studies. PLoS Genet.5 (6), e1000529. 10.1371/journal.pgen.1000529

  • 25

    HuangD. Y.WeisgraberK. H.GoedertM.SaundersA. M.RosesA. D.StrittmatterW. J. (1995). ApoE3 binding to tau tandem repeat I is abolished by tau serine262 phosphorylation. Neurosci. Lett.192 (3), 209–212. 10.1016/0304-3940(95)11649-h

  • 26

    KoutsodendrisN.BlumenfeldJ.AgrawalA.TragliaM.GroneB.ZilberterM.et al (2023). Neuronal APOE4 removal protects against tau-mediated gliosis, neurodegeneration and myelin deficits. Nat. Aging3 (3), 275–296. 10.1038/s43587-023-00368-3

  • 27

    LewontinR.KojianaK. (1960). The evolutionary dynamics of complex polymorphisms. Evolution14, 458–472. 10.2307/2405995

  • 28

    LewontinR. C. (1964). The interaction of selection and linkage. I. General considerations; heterotic models. Genetics49 (1), 49–67. 10.1093/genetics/49.1.49

  • 29

    LiY.WillerC. J.DingJ.ScheetP.AbecasisG. R. (2010). MaCH: using sequence and genotype data to estimate haplotypes and unobserved genotypes. Genet. Epidemiol.34 (8), 816–834. 10.1002/gepi.20533

  • 30

    MahanT. E.WangC.BaoX.ChoudhuryA.UlrichJ. D.HoltzmanD. M. (2022). Selective reduction of astrocyte apoE3 and apoE4 strongly reduces Aβ accumulation and plaque-related pathology in a mouse model of amyloidosis. Mol. Neurodegener.17 (1), 13. 10.1186/s13024-022-00516-0

  • 31

    McGregorN. R. (2014). Catechol O-methyltransferase: a review of the gene and enzyme. J. J. Dent. Res.1 (1), 1–18.

  • 32

    MulgraveV. E.AlsayeghA. A.JaldiA.Omire-MayorD. T.JamesN.NtekimO.et al (2023). Exercise modulates APOE expression in brain cortex of female APOE3 and APOE4 targeted replacement mice. Neuropeptides97, 102307. 10.1016/j.npep.2022.102307

  • 33

    NealeB. M.ShamP. C. (2004). The future of association studies: gene-based analysis and replication. Am. J. Hum. Genet.75 (3), 353–362. 10.1086/423901

  • 34

    NiuT. (2004). Algorithms for inferring haplotypes. Genet. Epidemiol.27 (4), 334–347. 10.1002/gepi.20024

  • 35

    PedachenkoE. G.BiloshytskyV. V.Mikhal'skyS. A.GridinaN. Y.Kvitnitskaya-RyzhovaT. Y. (2015). The effect of gene therapy with the APOE3 Gene on structural and functional manifestations of secondary hippocampal damages in experimental traumatic brain injury. Zh Vopr. Neirokhir Im. N. N. Burdenko79 (2), 21–32. 10.17116/neiro201579221-32

  • 36

    PetersonN. B.Trentham-DietzA.Garcia-ClosasM.NewcombP. A.Titus-ErnstoffL.HuangY.et al (2010). Association of COMT haplotypes and breast cancer risk in caucasian women. Anticancer Res.30 (1), 217–220.

  • 37

    PiresM.RegoA. C. (2023). Apoe4 and alzheimer's disease pathogenesis-mitochondrial deregulation and targeted therapeutic strategies. Int. J. Mol. Sci.24 (1), 778. 10.3390/ijms24010778

  • 38

    RobbinsR. B. (1918). Applications of mathematics to breeding problems II. Genetics3 (1), 73–92. 10.1093/genetics/3.1.73

  • 39

    SaundersA. M.StrittmatterW. J.SchmechelD.George-HyslopP. H.Pericak-VanceM. A.JooS. H.et al (1993). Association of apolipoprotein E allele epsilon 4 with late-onset familial and sporadic Alzheimer's disease. Neurology43 (8), 1467–1472. 10.1212/wnl.43.8.1467

  • 40

    SchaidD. J. (2006). Power and sample size for testing associations of haplotypes with complex traits. Ann. Hum. Genet.70 (Pt 1), 116–130. 10.1111/j.1529-8817.2005.00215.x

  • 41

    ScheetP.StephensM. (2006). A fast and flexible statistical model for large-scale population genotype data: applications to inferring missing genotypes and haplotypic phase. Am. J. Hum. Genet.78 (4), 629–644. 10.1086/502802

  • 42

    SenA.AlkonD. L.NelsonT. J. (2012). Apolipoprotein E3 (ApoE3) but not ApoE4 protects against synaptic loss through increased expression of protein kinase C epsilon. J. Biol. Chem.287 (19), 15947–15958. 10.1074/jbc.M111.312710

  • 43

    Sepulveda-FallaD.SanchezJ. S.AlmeidaM. C.BoassaD.Acosta-UribeJ.Vila-CastelarC.et al (2022). Distinct tau neuropathology and cellular profiles of an APOE3 Christchurch homozygote protected against autosomal dominant Alzheimer's dementia. Acta Neuropathol.144 (3), 589–601. 10.1007/s00401-022-02467-8

  • 44

    ShamP. C.RijsdijkF. V.KnightJ.MakoffA.NorthB.CurtisD. (2004). Haplotype association analysis of discrete and continuous traits using mixture of regression models. Behav. Genet.34 (2), 207–214. 10.1023/B:BEGE.0000013734.39266.a3

  • 45

    StephensM.SmithN. J.DonnellyP. (2001). A new statistical method for haplotype reconstruction from population data. Am. J. Hum. Genet.68 (4), 978–989. 10.1086/319501

  • 46

    StrittmatterW. J.SaundersA. M.SchmechelD.Pericak-VanceM.EnghildJ.SalvesenG. S.et al (1993). Apolipoprotein E: high-avidity binding to beta-amyloid and increased frequency of type 4 allele in late-onset familial Alzheimer disease. Proc. Natl. Acad. Sci. U. S. A.90 (5), 1977–1981. 10.1073/pnas.90.5.1977

  • 47

    SunR.XieC. (2023). Peripheral ApoE4 leads to cerebrovascular dysfunction and aβ deposition in alzheimer's disease. Neurosci. Bull.39 (8), 1330–1332. 10.1007/s12264-023-01058-1

  • 48

    TanY. D. (2020). Recombination disequilibrium in ideal and natural populations. Genomics112, 3943–3950. 10.1016/j.ygeno.2020.06.034

  • 49

    ThomsonG.BaurM. P. (1984). Third order linkage disequilibrium. Tissue Antigens24 (4), 250–255. 10.1111/j.1399-0039.1984.tb02134.x

  • 50

    Vargas-AlarconG.Angeles-MartinezJ.Villarreal-MolinaT.Alvarez-LeonE.Posadas-SanchezR.Cardoso-SaldanaG.et al (2015). Interleukin-17A gene haplotypes are associated with risk of premature coronary artery disease in Mexican patients from the Genetics of Atherosclerotic Disease (GEA) study. PLoS One10 (1), e0114943. 10.1371/journal.pone.0114943

  • 51

    WenS. H.TsaiM. Y. (2014). Haplotype association analysis of combining unrelated case-control and triads with consideration of population stratification. Front. Genet.5, 103. 10.3389/fgene.2014.00103

  • 52

    YangY.LiS. S.ChienJ. W.AndriesenJ.ZhaoL. P. (2008). A systematic search for SNPs/haplotypes associated with disease phenotypes using a haplotype-based stepwise procedure. BMC Genet.9, 90. 10.1186/1471-2156-9-90

  • 53

    ZhaoH.PfeifferR.GailM. H. (2003a). Haplotype analysis in population genetics and association studies. Pharmacogenomics4 (2), 171–178. 10.1517/phgs.4.2.171.22636

  • 54

    ZhaoL. P.LiS. S.KhalidN. (2003b). A method for the assessment of disease associations with single-nucleotide polymorphism haplotypes and environmental variables in case-control studies. Am. J. Hum. Genet.72 (5), 1231–1250. 10.1086/375140

  • 55

    ZhouX.ShiQ.ZhangX.GuL.LiJ.QuanS.et al (2023). ApoE4-mediated blood-brain barrier damage in Alzheimer's disease: progress and prospects. Brain Res. Bull.199, 110670. 10.1016/j.brainresbull.2023.110670

Summary

Keywords

sister haplotypes, complex disease, association, recombination interference, Alzheimer disease, linkage disequilibrium, SNPs, coronary artery disease

Citation

Liao S-Y and Tan Y-D (2024) Sister haplotypes and recombination disequilibrium: a new approach to identify associations of haplotypes with complex diseases. Front. Genet. 14:1295327. doi: 10.3389/fgene.2023.1295327

Received

16 September 2023

Accepted

13 December 2023

Published

16 January 2024

Volume

14 - 2023

Edited by

Peng Wang, Harbin Medical University, China

Reviewed by

Cecilia Contreras-Cubas, National Institute of Genomic Medicine (INMEGEN), Mexico

Sergio Flores, Autonomous University of Chile, Chile

Updates

Copyright

*Correspondence: Yuan-De Tan,

Disclaimer

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.

Outline

Cite article

Copy to clipboard


Export citation file


Share article

Article metrics