Assortative mating on ancestry-variant traits in admixed Latin American populations 1

Background Assortative mating is a universal feature of human societies, and individuals from ethnically diverse populations are known to mate assortatively based on similarities in genetic ancestry. However, little is currently known regarding the exact phenotypic cues, or their underlying genetic architecture, which inform ancestry-based assortative mating. Results We developed a novel approach, using genome-wide analysis of ancestry-specific haplotypes, to evaluate ancestry-based assortative mating on traits whose expression varies among the three continental population groups – African, European, and Native American – that admixed to form modern Latin American populations. Application of this method to genome sequences sampled from Colombia, Mexico, Peru, and Puerto Rico revealed widespread ancestry-based assortative mating. We discovered a number of anthropometric traits (body mass, height, facial development and waist-hip ratio) and neurological attributes (educational attainment and schizophrenia) that serve as phenotypic cues for ancestry-based assortative mating. Major histocompatibility complex (MHC) loci show population-specific patterns of both assortative and disassortative mating in Latin America. Ancestry-based assortative mating in the populations analyzed here appears to be driven primarily by African ancestry. Conclusions This study serves as an example of how population genomic analyses can yield novel insights into human behavior.


Background 48
Mate choice is a fundamental dimension of human behavior with important implications for population 49 genetic structure and evolution [1][2][3]. It is widely known that humans choose to mate assortatively rather 50 than randomly. That is to say that humans, for the most part, tend to choose mates that are more similar 51 to themselves than can be expected by chance. Historically, assortative mating was based largely on 52 geography, whereby partners were chosen from a limited set of physically proximal individuals [4]. Over 53 millennia, assortative mating within groups of geographically confined individuals contributed to genetic 54 divergence between groups, and the establishment of distinct human populations, such as the major 55 continental population groups recognized today [5][6][7]. 56 However, the process of geographic isolation followed by population divergence that characterized human 57 evolution has not been strictly linear. Ongoing human migrations have continuously brought previously 58 isolated populations into contact; when this occurs, the potential exists for once isolated populations to 59 admix, thereby forming novel population groups [8]. Perhaps the most precipitous example of this process 60 occurred in the Americas, starting just over 500 years ago with the arrival of Columbus in the New World 61 numerous studies have demonstrated an influence of similarities in height and body mass on mate choice 71 [2,[16][17][18]. In addition, assortative mating has been observed for diverse neurological traits, such as 72 educational attainment, introversion/extroversion and even neurotic tendencies [19][20][21][22][23][24]. Harder to classify 73 traits related to personal achievement (income and occupational status) and culture (values and political 74 leanings) also impact patterns of assortative mating [19,25,26]. Odor is one of the more interesting traits 75 implicated in mate choice, and it has been linked to so-called disassortative (or negative assortative) 76 mating, whereby less similar mates are preferred. Odor-based disassortative mating has been attributed 77 to differences in genes of the major histocompatibility (MHC) locus, which functions in the immune system, 78 based on the idea that combinations of divergent human leukocyte antigen (HLA) alleles provide a selective 79 advantage via elevated host resistance to pathogens [27,28]. 80 Ancestry is a particularly important determinant of assortative mating in modern admixed populations [29, 81 30]. Studies have shown that individuals in admixed Latin American populations tend to mate with partners 82 that have similar ancestry profiles. For example, partners from both Mexican and Puerto Rican populations 83 have significantly higher ancestry similarities than expected by chance [24,31]. In addition, a number of 84 traits that have been independently linked to assortative mating show ancestry-specific differences in their 85 expression [32]. Accordingly, ancestry-based mate choice has recently been related to a limited number 86 of physical (facial development) and immune-related (MHC loci) traits [24]. 87 The studies that have uncovered the role of genetic ancestry in assortative mating among Latinos have 88 relied on estimates of global ancestry fractions between mate pairs [24,31]. Given the recent accumulation 89 of numerous whole genome sequences from admixed Latin American populations -along with genome 90 sequences from global reference populations [7] -it is now possible to characterize local genetic ancestry 91 8 1: Figure S1). The program ADMIXTURE [39] was used to infer the continental genetic ancestry fractions -117 African, European and Native American -for individuals from the four Latin American populations 118 (Additional file 1: Figure S2). Distributions of individuals' continental ancestry fractions illustrate the 119 distinct ancestry profiles of the four populations (Fig. 1). Puerto Rico and Colombia and show the highest 120 European ancestry fractions along with the highest levels of three-way admixture. These two populations 121 also have the highest African ancestry fractions; although, all four populations have relatively small 122 fractions of African ancestry. Peru and Mexico show more exclusively Native American and European 123 admixture, with Peru having by far the largest Native American ancestry fraction. 124 The program RFMix [35] was used to infer local African, European and Native American genetic ancestry 125 for individuals from the four admixed Latin American populations analyzed here. RFMix uses global 126 reference populations to perform chromosome painting, whereby the ancestral origins of specific 127 haplotypes are characterized across the entire genome for admixed individuals. Only haplotypes with high 128 confidence ancestry assignments (≥99%) were taken for subsequent analysis. Examples of local ancestry 129 assignment chromosome paintings for representative admixed individuals from each population are shown 130 in Additional file 1: Figure S3. The overall continental ancestry fractions for admixed genomes calculated 131 by global and local ancestry analysis are highly correlated, and in fact virtually identical, across all individuals 132 analyzed here, in support of the reliability of these approaches to ancestry assignment (Additional file 1: 133 Figure S4). 134 135 Assortative mating and local ancestry in Latin America 136 We analyzed genome-wide patterns of local ancestry assignment in order to assess the evidence for 137 assortative mating based on local ancestry in Latin America (Fig. 2a). For each individual, the ancestry 138 assignments for pairs of haplotypes at any given gene were evaluated for homozygosity (i.e., the same 9 ancestry on both haplotypes) or heterozygosity (i.e., different ancestry on both haplotypes) (Fig 2b). For 140 each gene, across all four populations, the observed values of ancestry homozygosity and heterozygosity 141 were compared to the expected values in order to compute gene-and population-specific assortative 142 mating index (AMI) values. AMI is computed as a log odds ratio as described in the Methods. The expected 143 values of local ancestry homozygosity and heterozygosity used for the AMI calculations are based on a 144 Hardy-Weinberg triallelic model with the three allele frequencies computed as the locus-specific ancestry 145 fractions. High positive AMI values result from an excess of observed local ancestry homozygosity and are 146 thereby taken to indicate assortative mating based on shared local genetic ancestry. Conversely, low 147 negative AMI values indicate excess local ancestry heterozygosity and disassortative mating. 148 While we were interested in exploring the relationship between local genetic ancestry and assortative 149 mating, we recognized that mate choice is based on phenotypes rather than genotypes per se. Since 150 phenotypes are typically encoded by multiple genes, expressed in the context of their environment, we 151 used data from genome-wide association studies (GWAS) to identify sets of genes that function together 152 to encode polygenic phenotypes (Fig. 2c). We combined data from several GWAS database sources in order 153 to curate a collection of 106 gene sets that have been linked to the polygenic genetic architecture of a 154 variety of human traits. These gene sets range in size from 2 to 212 genes and include a total of 986 unique 155 genes (Additional file 1: Figure S5). We focused on phenotypes that are known or expected to influence 156 mate choice and thereby impact assortative mating patterns. These phenotypes fall into three broad 157 categories: anthropometric traits (e.g., body shape, stature and pigmentation), neurological traits (e.g., 158 cognition, personality and addiction) and immune response (HLA genes). Finally, we used a meta-analysis 159 of the AMI values for the sets of genes that underlie each polygenic phenotype in order to evaluate the 160 impact of local ancestry on assortative mating (Fig. 2d). 161 We compared the distributions of observed versus expected AMI values to assess the overall evidence for 162 local ancestry-based assortative mating in Latin America. Expected AMI values were computed via 163 permutation analysis by randomly combining pairs of haplotypes into diploid individuals in order to 164 approximate random mating. The distribution of the expected AMI values is narrow and centered around 165 0, whereas the observed AMI values have a far broader distribution and tend to be positive (expected AMI 166 =-0.01, σ=0.03, observed AMI =0.11, σ=0.14; Fig. 3a). When all four admixed Latin American populations 167 are considered together, the mean observed AMI value is significantly greater than the expected mean AMI 168 (t=18.14, P=8.12e-56). The same trend can be seen when all four populations are considered separately 169 (Additional file 1: Figure S6). Mean observed AMI values vary substantially across populations, with Mexico 170 showing the highest levels of local ancestry-based assortative mating and Puerto Rico showing the lowest 171 ( Fig. 3b). There is also substantial variation seen for the extent of assortative mating among the three 172 broad functional categories of phenotypes (Fig. 3c). Local ancestry-based assortative mating is particularly 173 variable for HLA genes, with high levels of assortative mating seen for Mexico and evidence for 174 disassortative mating seen for Colombia and Puerto Rico. Anthropometric traits tend to show higher levels 175 of local ancestry-based assortative mating across all four populations compared to neurological traits. 176 In addition to the permutation test that we used to compute expected AMI values based on randomly 177 paired haplotypes, we also performed a simulation analysis using a population genetic model of assortative 178 mating in order to validate the performance of the AMI test statistic (Additional file 1: Figure S7). We were 179 particularly interested in exploring the potential effects of different ancestry proportions among the 180 populations analyzed here, and different gene set sizes, on computed AMI values. The population genetic 181 model that we used to simulate assortative mating combines Hardy-Weinberg genotype expectations with 182 a single parameter that represents the fraction of the population that mates assortatively. Details of how 183 this model was implemented to simulate AMI values for the four populations can be found in the Methods 184 section. The population genetic simulation shows that our AMI test statistic is fairly sensitive to low values 185 of the assortative mating parameter . We also show that AMI values are not biased in any particular 186 direction based on the overall ancestry fractions observed for each population. For example, according to the simulation, Colombia should have the highest overall AMI values, followed by Puerto Rico, Mexico and 188 Peru. This order is completely different from what is seen for the observed AMI values, where Mexico 189 shows the highest mean value, followed by Peru, Colombia and Puerto Rico (Fig. 3b) There are 15 polygenic phenotypes that have statistically significant AMI values, after correction for 206 multiple tests, indicative of local ancestry-based assortative mating (q<0.05; Fig. 4a). The majority of the 207 statistically significant cases of assortative mating are seen in the Mexican population (8 out of 15), and the 208 anthropometric functional category is most commonly seen among the significant phenotypes (12 out of 209 three out of the four populations analyzed here (Colombia, Mexico and Peru). Body mass index is the next 211 most common phenotype, with four significant cases in two populations (Mexico and Peru). The only 212 neurological traits that show significant evidence of assortative mating are schizophrenia (Mexico and Peru) 213 and educational attainment (Mexico). Puerto Rico was the only population that did not show any individual 214 phenotypes with significant evidence of assortative mating, consistent with its low overall AMI values ( Fig.  215 3b and Additional file 1: Figure S6). A list of these significant traits, including references to the literature 216 where the trait single nucleotide polymorphism (SNP)-associations were originally reported, is provided in 217 Additional file 1: Table S1. 218 In addition to evaluating individual phenotypes for statistically significant AMI values, we also looked for 219 polygenic phenotypes that showed the most similar or dissimilar patterns of assortative mating across the represented among the highly population variant phenotypes, and the highly variant phenotypes consist 225 of both assortative and disassortative mating cases (specifically the HLA genes that are described in more 226 detail below). Neurological phenotypes are particularly enriched among the variant cases, including 227 temperament and several addiction-related phenotypes: opioid sensitivity, alcohol dependence and 228 general addiction. Interestingly, all of the least variant phenotypes -height, waist-hip ratio and 229 schizophrenia -are also found among the most significant cases of assortative mating, attesting to a 230 pervasive role in ancestry-based assortative mating for these traits. A list of the population (in)variant 231 traits, including references to the literature where the trait SNP-associations were originally reported, is 232 provided in Additional file 1: Table S1.
Given the evidence of significant local ancestry-based assortative mating that we observed for a number of 234 traits, we evaluated whether there were particular ancestry components that were most relevant to mate 235 choice. In other words, we asked whether the excess counts of observed ancestry homozygosity or 236 heterozygosity are linked to specific local ancestry assignments: African, European and/or Native American. 237 For significant polygenic phenotype gene sets of interest, we computed the observed versus expected 238 ancestry homozygosity for each ancestry separately across all genes in the set ( More recent studies of assortative mating, powered by advances in human genomics, have begun to 279 explore the genetic architecture underlying the human traits that form the basis of mate choice [2,21]. In 280 addition, recent genomic analyses have underscored the extent to which human genetic ancestry 281 influences assortative mating [24,30,31]. However, until this time, these two strands of inquiry have not 282 been brought together. The approach that we developed for this study allowed us to directly assess the 283 connection between local genetic ancestry -i.e., ancestry assignments for specific genome regions or 284 haplotypes -and the human traits that serve as cues for assortative mating. 285 Our approach relies on the well-established principle that assortative mating results in an excess of genetic 286 homozygosity [29]. However, we do not analyze homozygosity of specific genetic variants per se, as is 287 normally done, rather we evaluate excess homozygosity, or the lack thereof, for ancestry-specific 288 haplotypes (Fig. 2b). By merging this approach with data on the genetic architecture of polygenic human 289 phenotypes, we were able to uncover specific traits that inform ancestry-based assortative mating. This is 290 because, when individuals exercise mate choice decisions based on ancestry, they must do so using 291 phenotypic cues that are ancestry-associated. In other words, ancestry-based assortative mating is, by 292 definition, predicated upon traits that vary in expression among human population groups. An obvious 293 example of this is skin color [32], and studies have indeed shown skin color to be an important feature of 294 assortative mating [42][43][44][45]. It follows that the assortative mating traits that our study uncovered in admixed 295 Latin American populations must be both genetically heritable and variable among African, European and 296 Native American population groups. 297 The anthropometric traits found in our study -body mass, height, waist-hip ratio, and facial development 298 -are both heritable and known to vary among the continental population groups that admixed to form 299 modern Latin American populations. This implies that the genetic variants that influence these traits should 300 also vary among these populations. Accordingly, it is readily apparent that mate choice decisions based on 301 these physical features could track local genetic ancestry. Interpretation of the neurological traits that 302 show evidence of local ancestry-based assortative mating -schizophrenia and educational attainment -is 303 not quite as straightforward. For schizophrenia, it is far more likely that we are analyzing genetic loci 304 associated with a spectrum of personality traits that influence assortative mating, as opposed to mate 305 choice based on full-blown schizophrenia, and indeed personality traits are widely known to impact mate 306 choice decisions [19,22,25]. In addition, since schizophrenia prevalence does not vary greatly world-wide 307 [46], it is more likely that ancestry-based assortative mating for this trait is tracking an underlying 308 endophenotype rather than the disease itself. While educational attainment outcomes are largely show evidence for disassortative mating (Fig. 5c). Interestingly, disassortative mating for HLA loci in higher levels of African ancestry compared to Mexico and Peru. The population-and ancestry-specific 326 dynamics of MHC-dependent mate choice revealed here underscore the complexity of this issue. 327 Assortative mating alone is not expected to change the frequencies of alleles, or ancestry fractions in the 328 case of our study, within a population. Assortative mating does, however, change genotype frequencies, 329 resulting in an excess of homozygous genotypes. Accordingly, ancestry-based assortative mating is 330 expected to yield an excess of homozygosity for local ancestry assignments (i.e., ancestry-specific 331 haplotypes) (Fig. 2b). By increasing homozygosity in this way, assortative mating also increases the 332 population genetic variance for the traits that influence mate choice. In other words, assortative mating 333 will lead to more extreme, and less intermediate, phenotypes than expected by chance. This population 334 genetic consequence of assortative mating allowed us to evaluate the extent to which specific continental 335 ancestries drive mate choice decisions in admixed populations, since specific ancestry drivers of assortative 336 mating are expected to have increased variance. We found that the fractions of African ancestry have the 337 highest variance among individuals for all four populations, consistent with the idea that traits that are 338 associated with African ancestry drive most of the local ancestry-based assortative mating seen in this study 339 (Fig. 6). Indeed, our results underscore the prevalence of ancestry-based assortative mating in modern Latin by allowing us to hone in on the phenotypic cues that underlie ancestry-based assortative mating. Our 348 method also illuminates the specific ancestry components that drive assortative mating for different traits 349 and makes predictions regarding traits that should vary among continental population groups. Whole genome sequence data and genotypes were merged, sites common to all datasets were kept, and 359 single nucleotide polymorphism (SNP) strand orientation was corrected as needed, using PLINK version 1.9 360 [37]. The resulting dataset consisted of 1,645 individuals from 38 populations with variants characterized 361 for 239,989 SNPs. The set of merged SNP genotypes was phased, using the program SHAPEIT version 2.r837 362 [38], with the 1KGP haplotype reference panel. This phased set of SNP genotypes was used for local 363 ancestry analysis. PLINK was used to further prune the phased SNPs for linkage, yielding a pruned dataset 364 containing 58,898 linkage-independent SNPs. This pruned set of SNP genotypes was used for global 365 ancestry analysis. 366

To infer continental (global) ancestry of the four admixed Latin American populations, ADMIXTURE [39] 370
was run on the pruned SNP genotype dataset (n=58,898). ADMIXTURE was run using a K=4, yielding African, Continental African, European, and Native American populations were used as reference populations, and 377 contiguous regions with the same ancestry assignment, i.e., ancestry-specific haplotypes, were delineated 378 where the RFMix ancestry assignment certainty was at least 99%. 379 Autosomal NCBI RefSeq coding genes were accessed from the UCSC Genome Browser and mapped to the 380 ancestry-specific haplotypes characterized for each admixed Latin American individual. For each diploid 381 genome analyzed here, individual genes can have 0, 1 or 2 ancestry assignments depending on the number 382 of high confidence ancestry-specific haplotypes at that locus. Our assortative mating index (AMI, see 383 below) can only be computed for genes that have 2 ancestry assignments in any given individual, i.e., cases 384 where the ancestry is assigned for both copies of the gene. Thus, for each Latin American population , 385 the mean ( ) and standard deviation ( ) of the number of genes with 2 ancestry assignments were 386 calculated and used to compute an ancestry genotype threshold for the inclusion of genes in subsequent 387 analyses. Genes were used in subsequent assortative mating analyses only if they were present above the 388 ancestry genotype threshold of − . 389

Gene sets for polygenic phenotypes 392
The polygenic genetic architectures of phenotypes that could be effected by assortative mating were 393 characterized using a variety of studies taken from the NHGRI-EBI GWAS Catalog [40], the Genetic 394 For each polygenic phenotype, all SNPs previously implicated at genome-wide significance levels of P≤10 -8 398 were collected as the phenotype SNP set. The gene sets for the polygenic phenotypes were collected by 399 directly mapping trait-associated SNPs to genes. SNPs were used to create a gene set only if the SNP fell 400 directly within a gene and thus no intergenic SNPs were used in creating gene sets. Gene sets from the 401 GWAS Catalog were mapped from SNPs using EBI's in-house pipeline. Sets from GIANT were mapped 402 according to specifications of each individual paper. Gene sets from literature searching were mapped 403 using NCBI's dbSNP. For each Latin American population, phenotype gene sets were filtered to only include 404 genes that passed the ancestry genotype threshold, as described previously. Finally, the polygenic 405 phenotype gene sets were filtered based on size, so that all polygenic phenotypes included two or more 406 genes. The final data set contains gene sets for 106 polygenic phenotypes, hierarchically organized into 407 three functional categories, including 986 unique genes (Additional file 1: Figure S5). 408 409

Assortative mating index (AMI) 410
To assess local ancestry-based assortative mating, we developed the assortative mating index (AMI), a log 411 odds ratio test statistic that computes the relative local ancestry homozygosity compared to heterozygosity 412 for any given gene. Ancestry homozygosity occurs when both genes in a genome have the same local 413 ancestry, whereas ancestry heterozygosity refers to a pair of genes in a genome with different local 414 ancestry assignments. The assortative mating index ( ) is calculated as: 415 is the ratio of the observed and expected local ancestry homozygous gene 417 is the ratio of the observed and expected local ancestry heterozygous gene 418

pairs. 419
The observed values of local ancestry homozygous and heterozygous gene pairs are taken from the gene- or 2 + 2 + 2 + 2 + 2 + 2 . Accordingly, the expected frequency of homozygous pairs is 2 + 426 2 + 2 and the expected frequency of heterozygous pairs is 2 + 2 + 2 . For each gene, in each 427 population, the expected homozygous and heterozygous frequencies are multiplied by the number of 428 individuals with two ancestry assignments for that gene to yield the expected counts of gene pairs in each 429

class. 430
For each polygenic phenotype, a meta-analysis of gene-specific AMI values was conducted to evaluate the 431 effect of all of the genes involved in the phenotype on assortative mating, using the metafor [41] package 432 in R. 95% confidence intervals for each gene, meta-gene AMI values, significance P-values, and false 433 discovery rate q-values, were computed using the Mantel-Haenszel method under a fixed-effects model.

Permutation of random mating 436
A standard permutation testing framework was adopted for the approximation of random mating in each 437 of the four Latin American populations. Random mating was approximated by randomly combining pairs 438 of individual phased haplotypes from a population to yield permuted diploid genotypes. Haploid 439 chromosomes were permuted randomly within each population using the Fisher-Yates shuffle. After Significance testing for the difference between the observed and expected AMI distributions was 492 completed using the t-test package in R. The metafor package, used for calculating the meta-analysis AMI 493 values, also calculates a P-value and a false discovery rate q-value to correct for multiple statistical tests, 494 which were used for identifying polygenic phenotypes that are significantly influenced by local ancestry-495 based assortative mating in each Latin American population. The variance of AMI values across the four 496 populations for each phenotype was calculated as it is implemented in R and used for identifying based assortative mating patterns. The coefficient of variation was used to measure the inter-individual 499 variance for each of the three continental ancestry components within the four admixed Latin American 500 populations analyzed here. 501 502 Declarations 503 Ethics approval and consent to participate 504 The de-identified human genome sequence data analyzed here are made publicly available as part of the 505 1000 Genomes Project and the Human Genome Diversity Project. 506

Consent for publication 507
Not applicable 508 Availability of data and materials 509 1000 Genomes Project data are available from http://www.internationalgenome.org/data/ 510 Human Genome Diversity Project data are available from http://www.hagsc.org/hgdp/ 511 Previously published Native American genotype data can be accessed from a data use agreement 512 governed by the University of Antioquia as previously described [36]. 513

Competing interests 514
The authors declare that they have no competing interests. 515    Assortative mating meta-analysis The top 20 phenotypes with the highest or lowest, and most statistically significant, AMI variance levels across populations. Across population variance levels are normalized using the average AMI population variance level for all phenotypes. All AMI variance levels shown are significant at q<0.05. The highest variance (most dissimilar patterns) of the AMI are at the top, while the lowest variance (most similar patterns) of AMI are at the bottom.