# POLYPLOID POPULATION GENETICS AND EVOLUTION - FROM THEORY TO PRACTICE

EDITED BY : Hans D. Daetwyler and Richard John Abbott PUBLISHED IN : Frontiers in Ecology and Evolution, Frontiers in Genetics and Frontiers in Plant Science

#### Frontiers eBook Copyright Statement

The copyright in the text of individual articles in this eBook is the property of their respective authors or their respective institutions or funders. The copyright in graphics and images within each article may be subject to copyright of other parties. In both cases this is subject to a license granted to Frontiers. The compilation of articles constituting this eBook is the property of Frontiers.

Each article within this eBook, and the eBook itself, are published under the most recent version of the Creative Commons CC-BY licence. The version current at the date of publication of this eBook is CC-BY 4.0. If the CC-BY licence is updated, the licence granted by Frontiers is automatically updated to the new version.

When exercising any right under the CC-BY licence, Frontiers must be attributed as the original publisher of the article or eBook, as applicable.

Authors have the responsibility of ensuring that any graphics or other materials which are the property of others may be included in the CC-BY licence, but this should be checked before relying on the CC-BY licence to reproduce those materials. Any copyright notices relating to those materials must be complied with.

Copyright and source acknowledgement notices may not be removed and must be displayed in any copy, derivative work or partial copy which includes the elements in question.

All copyright, and all rights therein, are protected by national and international copyright laws. The above represents a summary only. For further information please read Frontiers' Conditions for Website Use and Copyright Statement, and the applicable CC-BY licence.

ISSN 1664-8714 ISBN 978-2-88963-390-6 DOI 10.3389/978-2-88963-390-6

#### About Frontiers

Frontiers is more than just an open-access publisher of scholarly articles: it is a pioneering approach to the world of academia, radically improving the way scholarly research is managed. The grand vision of Frontiers is a world where all people have an equal opportunity to seek, share and generate knowledge. Frontiers provides immediate and permanent online open access to all its publications, but this alone is not enough to realize our grand goals.

#### Frontiers Journal Series

The Frontiers Journal Series is a multi-tier and interdisciplinary set of open-access, online journals, promising a paradigm shift from the current review, selection and dissemination processes in academic publishing. All Frontiers journals are driven by researchers for researchers; therefore, they constitute a service to the scholarly community. At the same time, the Frontiers Journal Series operates on a revolutionary invention, the tiered publishing system, initially addressing specific communities of scholars, and gradually climbing up to broader public understanding, thus serving the interests of the lay society, too.

#### Dedication to Quality

Each Frontiers article is a landmark of the highest quality, thanks to genuinely collaborative interactions between authors and review editors, who include some of the world's best academicians. Research must be certified by peers before entering a stream of knowledge that may eventually reach the public - and shape society; therefore, Frontiers only applies the most rigorous and unbiased reviews.

Frontiers revolutionizes research publishing by freely delivering the most outstanding research, evaluated with no bias from both the academic and social point of view. By applying the most advanced information technologies, Frontiers is catapulting scholarly publishing into a new generation.

#### What are Frontiers Research Topics?

Frontiers Research Topics are very popular trademarks of the Frontiers Journals Series: they are collections of at least ten articles, all centered on a particular subject. With their unique mix of varied contributions from Original Research to Review Articles, Frontiers Research Topics unify the most influential researchers, the latest key findings and historical advances in a hot research area! Find out more on how to host your own Frontiers Research Topic or contribute to one as an author by contacting the Frontiers Editorial Office: researchtopics@frontiersin.org

# POLYPLOID POPULATION GENETICS AND EVOLUTION - FROM THEORY TO PRACTICE

Topic Editors: Hans D. Daetwyler, La Trobe University, Australia Richard John Abbott, University of St Andrews, United Kingdom

Citation: Daetwyler, H. D., Abbott, R. J., eds. (2020). Polyploid Population Geneics and Evolution - From Theory to Practice. Lausanne: Frontiers Media SA. doi: 10.3389/978-2-88963-390-6

# Table of Contents

*04 Editorial: Polyploid Population Genetics and Evolution—From Theory to Practice*

Abdulqader Jighly, Richard J. Abbott and Hans D. Daetwyler


Ivan Jelenić, Anna Selmecki, Liedewij Laan and Nenad Pavin


Brian J. Knaus and Niklaus J. Grünwald


Abdulqader Jighly, Reem Joukhadar, Sukhwinder Singh and Francis C. Ogbonnaya

# Editorial: Polyploid Population Genetics and Evolution—From Theory to Practice

#### Abdulqader Jighly <sup>1</sup> \*, Richard J. Abbott <sup>2</sup> and Hans D. Daetwyler 1,3

*<sup>1</sup> Agriculture Victoria, AgriBio, Centre for AgriBiosciences, Bundoora, VIC, Australia, <sup>2</sup> School of Biology, University of St Andrews, St Andrews, United Kingdom, <sup>3</sup> School of Applied Systems Biology, La Trobe University, Melbourne, VIC, Australia*

Keywords: allopolyploids, autopolyploids, polyploidization, mathematical biology, simulation

#### **Editorial on the Research Topic**

#### **Polyploid Population Genetics and Evolution—From Theory to Practice**

Despite polyploids being widespread and of great importance in eukaryotic diversification, our understanding of the dynamics of the evolution and inheritance of polyploids is less advanced than for diploids. The challenges in studying the population genetics and evolution of polyploids reside in the presence of more than two homoeologous "diverged but related" chromosome copies in allopolyploids or homologous "identical" chromosome copies in autopolyploids. Moreover, diploidization processes following polyploidy trigger other challenges in inferring paleo-polyploidization or ancient polyploidization events, which complicate the study of diverged homo(eo)logous genes and modeling of ecological factors affecting polyploids and their interactions with diploid ancestors. Statistical methods originally developed for the diploid mode of inheritance are generally biased when analyzing polyploids creating an urgent need to develop new methods for studying the evolutionary dynamics and modes of inheritance of polyploids (Dufresne et al., 2014; Meirmans et al., 2018). The aim of this Research Topic is to enhance our current understanding of the population genetics and evolution of polyploids and to highlight the practical applications that flow from such understanding. The collection of 12 papers covers four main areas of investigation: (1) the establishment of polyploids and long-term evolutionary consequences of polyploidy; (2) the evolution of gene expression, gene families, and chromosomes in polyploids; (3) the development of novel statistical polyploid-friendly population genetics models; and (4) the practical applications of different statistical models in polyploid trait evolution, quantitative genetics, and plant breeding.

With regard to the first of these topics, Baduel et al. provide a comprehensive review of factors affecting the successful establishment of newly formed polyploids in the wild and the shortand long-term costs and benefits that emanate from polyploidy. In this context, they discuss recent relevant ecological, physiological, cytological and genomic research, and underline the "wondrous cycles" of polyploidy (Wendel, 2015) in which polyploidization repeatedly happens after diploidization events over long evolutionary timescales. The advantages and disadvantages of polyploidy are further considered by Gaynor et al., but from the standpoint of a macro-scale study of the effects of polyploidization on the geographical community structure of two widely distributed flowering plant families, the Brassicaceae and Rosaceae, both of which have experienced multiple rounds of polyploidization events in the past. By combining cytogeographical information with phylogenetic analyses of plant communities in these two families across the USA, they show that communities may be shaped in diverse ways by polyploidy, but that impacts of genome duplication are not clear cut and are lineage specific. They highlight the need for much greater information on

#### Approved by:

*Genlou Sun, Saint Mary's University, Canada*

> \*Correspondence: *Abdulqader Jighly abdulqader.jighly@ agriculture.vic.gov.au*

#### Specialty section:

*This article was submitted to Evolutionary and Population Genetics, a section of the journal Frontiers in Ecology and Evolution*

> Received: *06 September 2019* Accepted: *15 November 2019* Published: *28 November 2019*

#### Citation:

*Jighly A, Abbott RJ and Daetwyler HD (2019) Editorial: Polyploid Population Genetics and Evolution—From Theory to Practice. Front. Ecol. Evol. 7:460. doi: 10.3389/fevo.2019.00460* ploidal variation across species' ranges to provide a deeper understanding of the effects of genome duplication on plant community structure.

Following polyploidization, alterations to chromosome number and structure as well as gene function, expression, and copy number may occur and feature prominently in diploidization (Ohno, 1970; Tate et al., 2009; Conant et al., 2014; Jighly et al., 2019). With regard to changes in chromosome number, Jelenic´ et al. develop a mitotic mathematical model to predict the chromosome loss rate in polyploids before testing it in polyploid cells of the yeast, Saccharomyces cerevisiae. The model depends on spindle dynamics and the maximum duration of mitotic arrest. They show that a small change in spindle assembly time can cause a massive increase in the rate of chromosome loss in tetraploid cells. Focusing on gene expression and function, Takahagi et al. analyze 727 previously published RNA sequence datasets of hexaploid wheat collected from different developmental stages, tissues, and environmental conditions to examine differences in expression profiles. They observe genes that are present and expressed in triplets, doublets, or specifically in one subgenome, contributing to broad biological functions and annotations. With regard to gene family changes, Mable et al. report an analysis of European diploid and tetraploid Arabidopsis lyrata and Arabidopsis arenosa populations to infer the complex evolution of the "S-receptor kinases" (SRK) gene family. This gene family is involved in the female component of genetically controlled self-incompatibility and is subject to strong balancing selection (Castric and Vekemans, 2007). In turn, they examine how the diversity of SRK alleles in tetraploids compares with that in diploid relatives, whether there is increased trans-specific polymorphism in tetraploids for these genes, if introgression occurs among species and ploidy levels, and whether copy number variation exists among paralogs.

Developing and extending widely used diploid theories and statistical models to fit polyploids is an important aim of the Research Topic. Meirmans and Liu extend the widely used analysis of molecular variance (AMOVA) to autopolyploids. This can be regarded as a significant step forward, given that since AMOVA was first developed by Excoffier et al. (1992), it has been widely employed in analyzing the population genetics of diploids. Similarly, site frequency spectrum (SFS) based methods such as the neutrality (Tajima, 1989) and (Fay and Wu, 2000) as well as heterozygosity of allelic variant tests such as Tajima's estimator of nucleotide diversity (Tajima, 1983) are widely used in diploid population genetics. Together with other SFS methods applied to high-throughput sequencing data, Ferretti et al. extend their application to autopolyploid populations and discuss their bias when applied to small populations. Detecting gene copy number variation is one of the most challenging tasks in the population genetic analysis of autopolyploids, leading (Knaus and Grünwald) to develop an R package "VCFR" to infer copy number variation in polyploids. The novelty of their method is that it does not require including the copy number of genomic regions (or alleles) a priori, but instead, VCFR infers them depending on the frequency of the most abundant alleles. Bourke et al. review the existing methods applied to experimental autopolyploid populations, such as breeding populations. They focus on methods of genotyping of polyploids, physical and genetic mapping procedures, simulating polyploid breeding populations, and quantitative genetic analyses including quantitative trait loci (QTL) mapping, genome wide association studies (GWAS) and genomic prediction in polyploids.

The final papers that comprise this Research Topic focus on the applications of polyploid population genetics in plant breeding. Ferrão et al. use a large breeding population of 1,575 autotetraploid blueberry individuals to dissect the genetic basis of eight fruit related traits and detect QTL associated with genotyping-by-sequencing based single nucleotide polymorphism (SNP) markers. They call their SNPs twice, with diploid and tetraploid genotype coding to compare the effect of diploid-like calling on GWAS results. Diploid coding resulted in shorter linkage disequilibrium blocks and a much smaller number of significantly associated QTL indicating the importance of using a tetraploid model. As an alternative to using tetraploid SNP coding, Manrique-Carpintero et al. developed a dihaploid potato population and conducted QTL mapping for vigor, height, and different tuber traits. Finally, in an examination of a population of synthetic allohexaploid wheat (Triticum turgidum – AABB × Aegilops tauschii – DD), Jighly et al. divided the additive variance for 12 biotic and abiotic stresses among the 21 chromosomes representing the A, B, and D subgenomes. They found that the wild D subgenome had the highest contribution to the additive variance in most traits, while the A subgenome had the lowest. They also reported a weak but significant positive correlation between the cumulative size of each of three homoeologous chromosomes and their cumulative additive variance.

The articles published on this Research Topic provide a body of knowledge in the field of polyploid population genetics and evolution. Though much progress has been made in this area, many challenges remain. Of particular importance will be the further development of robust statistical models for polyploids and the effective and efficient simulation of their population genetic and genomic complexities (Dufresne et al., 2014; Jighly et al., 2018). This will allow the testing of models and assumptions under complex evolutionary and demographic scenarios with validation using empirical data. Given the economic, ecological, and evolutionary importance of polyploidy, further concentrated research efforts are required to advance population genetic theories and applications that relate directly to polyploid species.

# AUTHOR CONTRIBUTIONS

AJ provided the first draft. RA and HD edited the manuscript and provided additional text. All authors revised and approved the manuscript for submission.

# REFERENCES


and their ancestors. Mol. Ecol. Res. 18, 1157–1172. doi: 10.1111/1755-0998. 12896


**Conflict of Interest:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Jighly, Abbott and Daetwyler. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# The "Polyploid Hop": Shifting Challenges and Opportunities Over the Evolutionary Lifespan of Genome Duplications

#### Pierre Baduel <sup>1</sup> , Sian Bray <sup>2</sup> , Mario Vallejo-Marin<sup>3</sup> , Filip Kolárˇ 4,5 and Levi Yant 2,6 \*

1 Institut de Biologie de l'École Normale Supérieure, Paris, France, <sup>2</sup> Department of Cell and Developmental Biology, John Innes Centre, Norwich, United Kingdom, <sup>3</sup> Biological and Environmental Sciences, University of Stirling, Stirling, United Kingdom, <sup>4</sup> Department of Botany, University of Innsbruck, Innsbruck, Austria, <sup>5</sup> Department of Botany, Faculty of Science, Charles University in Prague, Prague, Czechia, <sup>6</sup> School of Life Sciences and Future Food Beacon, University of Nottingham, Nottingham, United Kingdom

#### Edited by:

Richard John Abbott, University of St Andrews, United Kingdom

#### Reviewed by:

Barbara K. Mable, University of Glasgow, United Kingdom Christian Parisod, University of Neuchâtel, Switzerland Andreas Madlung, University of Puget Sound, United States

> \*Correspondence: Levi Yant leviyant@gmail.com

#### Specialty section:

This article was submitted to Evolutionary and Population Genetics, a section of the journal Frontiers in Ecology and Evolution

> Received: 25 April 2018 Accepted: 23 July 2018 Published: 20 August 2018

#### Citation:

Baduel P, Bray S, Vallejo-Marin M, Kolár F and Yant L (2018) The ˇ "Polyploid Hop": Shifting Challenges and Opportunities Over the Evolutionary Lifespan of Genome Duplications. Front. Ecol. Evol. 6:117. doi: 10.3389/fevo.2018.00117 The duplication of an entire genome is no small affair. Whole genome duplication (WGD) is a dramatic mutation with long-lasting effects, yet it occurs repeatedly in all eukaryotic kingdoms. Plants are particularly rich in documented WGDs, with recent and ancient polyploidization events in all major extant lineages. However, challenges immediately following WGD, such as the maintenance of stable chromosome segregation or detrimental ecological interactions with diploid progenitors, commonly do not permit establishment of nascent polyploids. Despite these immediate issues some lineages nevertheless persist and thrive. In fact, ecological modeling commonly supports patterns of adaptive niche differentiation in polyploids, with young polyploids often invading new niches and leaving their diploid progenitors behind. In line with these observations of polyploid evolutionary success, recent work documents instant physiological consequences of WGD associated with increased dehydration stress tolerance in first-generation autotetraploids. Furthermore, population genetic theory predicts both short- and long-term benefits of polyploidy and new empirical data suggests that established polyploids may act as "sponges" accumulating adaptive allelic diversity. In addition to their increased genetic variability, introgression with other tetraploid lineages, diploid progenitors, or even other species, further increases the available pool of genetic variants to polyploids. Despite this, the evolutionary advantages of polyploidy are still questioned, and the debate over the idea of polyploidy as an evolutionary dead-end carries on. Here we broadly synthesize the newest empirical data moving this debate forward. Altogether, evidence suggests that if early barriers are overcome, WGD can offer instantaneous fitness advantages opening the way to a transformed fitness landscape by sampling a higher diversity of alleles, including some already preadapted to their local environment. This occurs in the context of intragenomic, population genomic, and physiological modifications that can, on occasion, offer an evolutionary edge. Yet in the long run, early advantages can turn into long-term hindrances, and without ecological drivers such as novel ecological niche availability or agricultural propagation, a restabilization of the genome via diploidization will begin the cycle anew.

Keywords: polyploidy, selection, population genetics, evolution, autopolyploidy, genome duplication

# INTRODUCTION

Whole genome duplication (WGD) is a pervasive event in the evolution of eukaryotes, with an especially strong representation throughout the plant kingdom. Despite this prevalence, however, WGD is no easy victory: the presence of an extra set of chromosomes creates numerous biological challenges ranging from chromosome mis-segregation to altered gene expression, changes in cell size or in intracellular physiology (Ramsey and Schemske, 2002; Osborn et al., 2003; Adams and Wendel, 2005; Comai, 2005; Chen and Ni, 2006; Chen, 2007; Otto, 2007; Parisod et al., 2010; Brownfield and Köhler, 2011; Hollister, 2015). Numerous studies have shown evidence of dysfunction in newly formed polyploids across kingdoms: in plants, fungi and animals, including notably cancer cells (Ramsey and Schemske, 2002; Storchova and Pellman, 2004; Comai, 2005; Yant and Bomblies, 2017). Even more, polyploid establishment is substantially constrained by overall low chances of polyploid mutants persisting among their diploid progenitors (Levin, 2002; Ramsey and Schemske, 2002). This stems from both direct competition between the two cytotypes (Yamauchi et al., 2004) and frequency dependent selection (Levin, 1975), which suggests most autopolyploids are likely to go extinct before establishment (Levin, 1975; Husband, 2000).

Despite these challenges, WGD events have occurred repeatedly throughout the evolution of eukaryotes (Gregory and Mable, 2005; Wood et al., 2009; Wendel, 2015), leading to an abundance of established polyploid species in the wild. There is clear evidence of WGD in the ancestry of most plant lineages (Vision et al., 2000; Bowers et al., 2003; Paterson et al., 2004; Schlueter et al., 2004; Pfeil et al., 2005; Barker et al., 2008; Burleigh et al., 2008; Shi et al., 2010). Among angiosperms, it is estimated that 30–70% have undergone additional WGD events (Stebbins, 1938; Grant, 1963; Goldblatt, 1980; Masterson, 1994; Wood et al., 2009; Mayrose et al., 2011; Ruprecht et al., 2017). Polyploidy is also common among important crops (e.g., wheat, maize, sugar cane, coffee, cotton, potato, and tobacco), suggesting that WGD is often a key factor in successful crop domestication (Salman-Minkov et al., 2016). Obtaining precise estimates of the extent of polyploidy can be complicated, in part due to difficulties obtaining direct empirical evidence. For example, autopolyploids, resulting from within-species genome duplication, are often not considered a separate species from their diploid progenitors. As a result, their overall abundance compared to allopolyploids (where two distinct genomes are combined) have historically been underestimated (Soltis et al., 2007; Barker et al., 2016).

#### Polyploidy, a High-Risk, High-Gain Path

Frequency estimates of WGD increase in habitats affected by environmental disturbances (Favarger, 1984; Brochmann et al., 2004; Parisod et al., 2010). Concordant with this observation, in diploid-polyploid systems overlapping former glaciation limits, polyploids are found more frequently in the previously glaciated areas while their diploid progenitors commonly remain or retreat within former refugia (Ehrendorfer, 1980; Kadereit, 2015). For example, in Arabidopsis, both auto and allopolyploidization events were estimated to coincide with glacial maxima (Novikova et al., 2018). Beyond the fact that environmental stochasticity both increases the rate of WGD and provides new space for colonization (Baack, 2005; Fawcett et al., 2009; Oswald and Nuismer, 2011), such observations implicate WGD in speciation and adaptive radiation (Wood et al., 2009) and support the long-standing hypothesis that WGD per se can potentiate evolutionary adaptation, although evidence for this is somewhat mixed. Clear empirical evidence from in vitro evolution experiments in yeast demonstrated that tetraploids adapted faster than lower ploidies (Selmecki et al., 2015) and has bolstered this hypothesis. However, complementary approaches such as ecological niche modeling do not always support niche innovation in polyploids (Glennon et al., 2014). For example in primroses, the niches occupied by the three polyploid species (tetraploid, hexaploidy and octoploid) were distinct relative to the diploid progenitor but they were also narrower (Theodoridis et al., 2013).

Here we synthesize recent advances in polyploid research from new genomic, ecological, and cytological analyses with older observations and theoretical arguments into two primary dimensions (**Figure 1**): consequences (challenges vs. gains) of WGD and their time-span (short-term vs. long-term). To address specifically the effects of WGD, we focus on autopolyploids, which arise from within-species WGD events and thus carry four or more homologous copies of each chromosome (for a clear depiction, see Bomblies and Madlung, 2014). We thus largely set aside allopolyploids (polyploid hybrids), which strongly confound the effects of WGD with hybridization. On many fronts, recent results from autopolyploid systems have confirmed earlier theoretical predictions, but some have unveiled surprising new results in the context of a wide range of biological processes. Most strikingly, the population genetic consequences of WGD have been the subject of ample theoretical arguments despite thin experimental support to date, but this is changing. Our synthesis paints an overall picture of autopolyploidy as a high-risk high-gain path, where long-term complications often outweigh initial benefits while paving the way for re-diploidization. This depiction strengthens the idea of polyploidy as a transitionary state, a "hop," which has seen growing support from the polyploid community (Escudero et al., 2014; Wendel, 2015).

# THE SHORT-TERM CHALLENGES

## Meiosis

Perhaps the most stringent challenge faced by a nascent autopolyploid is directly tied to the very process of reproduction. A sudden doubling of homologous chromosome number disrupts regular meiotic pairing and segregation: instead of each chromosome having only one homolog with which to pair, in autotetraploids there are suddenly three.

The situation in most diploids is relatively straightforward, with proper chromosome pairing during synapsis typically relying on programed double-strand DNA breaks and a sequence-based homology search for the homolog using these broken fragments (Grelon et al., 2001; Page and Hawley, 2003; Stacey et al., 2006; Hartung et al., 2007). Once homologous

chromosomes have aligned, a small fraction of these breaks mature into crossovers (COs) between homologes, thus creating bivalent chromosome pairs physically linked by the CO. These bivalents then align parallel to the poles at which point the COs are essential for the creation of mechanical tension between the assembling spindles via connections to each centromere. This tension transmitted through the obligate COs ensures the correct orderly segregation of chromosomes and further acts as an essential cell cycle checkpoint allowing progression to anaphase (Lampson and Cheeseman, 2011; Campbell and Desai, 2013).

In newly formed polyploids however, the presence of multiple equivalent partners for each chromosome leads to more complex arrays of CO formations. If there is more than one crossover per bivalent, a single homolog can have two separate partners, resulting in a multivalent. Most multivalent configurations are not conducive to the formation of regular tension (Bomblies et al., 2016) and some entirely fail to involve one homolog, leading to mis-segregation and aneuploidy. Compounding this, in nascent autopolyploids entanglements and interlocks can occur between non-homologs. If left unresolved these can result in catastrophic chromosome damage, losses and rearrangements. Such entanglements and interlocks do occur in diploids but are much more commonly resolved (Storlazzi et al., 2010). Thus, meiotic stability is a key hurdle that must be overcome following WGD and is one of the hallmarks of an adapted polyploid. Indeed, loci that encode genes controlling meiotic recombination and crossovers are strongly implicated in adaptation to WGD (Hollister et al., 2012; Yant et al., 2013).

Of course, one way to bypass unstable meiosis is to simply not use it. It has long been recognized that asexual reproduction (vegetative propagation and agamospermy) and WGD are correlated, with polyploids displaying elevated rates of asexual propagation compared to diploid relatives (Manning and Dickson, 1986; Schinkel et al., 2016; Herben et al., 2017). Such a reproductive strategy may even confer short-term benefits. Asexual reproduction is considered advantageous during range expansion (Baker, 1967), and this may go some way in explaining the invasive nature of many polyploids. Likewise, vegetative propagation could be a short-term fix; buying time for stable meiosis to evolve (Vallejo-Marín and Hiscock, 2016) and reducing the frequency of mating with the diploid progenitors (see section Cytotype and Competitive Exclusion below). Despite potential short-term gains, however, asexuality may not be a viable long-term strategy, and many polyploids are sexual.

In established sexual autopolyploids meiotic instability is often resolved (Yant et al., 2013; Bomblies et al., 2015), with chromosomes forming bivalents or multivalents that segregate regularly. Indeed, there are multiple conceivable ways to modify meiosis and escape genomic instability, but there is considerable empirical work yet to do to learn how many of these exist in nature, much less the mechanistic basis of different solutions. An elegant theory suggests that simply increasing the degree of CO interference could solve this problem (Bomblies et al., 2016). Under this theory if the range of CO interference is greater than the distance to the end of the chromosome, the number of COs would be reduced to one, and if the range of CO interference is only slightly smaller than the chromosome length the COs will be terminalized. This would favor conformations that produce appropriate tension leading to orderly anaphase (Bomblies et al., 2016). Whatever the mechanism, stabilizing meiosis would seem the best solution given the advantages of sex in the long run.

## Gene Dosage

It was historically expected that an increase in gene number would result in a uniform increase in gene expression (Comai, 2005). This would correspond to a 1:1 dosage effect where 1x diploid expression per genome results in double the total gene expression per cell in an autotetraploid. However, other dosage responses are also possible (Coate and Doyle, 2010): for example, dosage compensation could reduce the per genome expression by half to match overall diploid expression levels per cell (0.5:1 in autotetraploids). This compensation could be partial (ratio between 0.5:1 and 1:1) or even negative (ratios below 0.5:1). In the other direction, dosage effects could also amplify the expression level increase resulting from polyploidy (ratios above 1:1). The impact of these effects could vary across the transcriptome with some gene categories more likely to follow a 1:1 response while others respond differently. The evidence for an overall 1:1 dosage response to WGD in autopolyploids stems primarily from one study of a synthetic polyploid series in maize (Guo et al., 1996). Among the 18 genes followed, most exhibited a 1:1 dosage response, but there were several exceptions and the dosage compensation response of some genes varied over different ploidy levels. Compared to the extensive literature on gene expression changes in allopolyploids (e.g., expression level dominance, genome-wide transcriptomic rewiring, biased fractionation) this represents a major gap in our understanding in autopolyploids. This gap should be closed, as recent empirical evidence points clearly to selective sweeps in transcriptionrelated loci. This suggests that adaptation of the transcriptional machinery to cope with gene dosage effects may be important in neo-autopolyploids. Indeed, one of the most dramatic genomewide selective sweeps in response to adaptation to WGD in A. arenosa is in the locus encoding the Transcription initiation factor IIF (TFIIF) beta subunit, a key member of the complex that drives RNA synthesis during the transcription (Hollister et al., 2012; Yant et al., 2013).

This gap in our understanding of the mechanistic basis behind dosage compensation is partly the result of technical difficulties. Methods commonly used to evaluate genome-wide expression patterns (microarrays and RNAseq) rely on extensive normalization of the RNA input and therefore are perfectly appropriate to detect relative changes in gene expression but not at all to measure variations in absolute transcriptome size (Lovén et al., 2012; Coate and Doyle, 2015). Only recently has a directed investigation of expression differences between diploids and autotetraploids been reported, using three normalization procedures to take into account transcriptome size, biomass, and cell density (Visger et al., 2017). This allows a clearer discrimination of expression differences than was previously possible, as concentration-based normalizations can mask up to 50% of expression differences. Indeed, in previous relative transcriptome comparisons (Stupar et al., 2007; del Pozo and Ramirez-Parra, 2014; Zhang et al., 2014) <10% of the transcriptome undergoes expression changes in response to WGD. However, when absolute differences were measured by Visger et al. approximately 1.5 times more genes showed expression differences; 80% of which had a dosage sensitive response with a ratio >0.5:1 (overexpressed in autotetraploid cells compared to diploids) thus compensating for the lower cell density of tetraploids (similar expression per biomass). As the first global analysis of gene dosage response to WGD properly taking into account potential changes in overall transcriptome sizes, this study by Visger et al. effectively demonstrates that 1) most genes are compensating for gene dosage (83% with no differences between diploid and tetraploid cells) and 2) that the genes which do not (17%) mostly present increased expression per cell (ratio >0.5:1) somehow compensating for gene dosage per biomass. This data thus does not support a general trend for 1:1 dosage effects, and also shows other responses are possible (underexpression per cell in tetraploids). More empirical work in this area is required (**Figure 2**) as antagonistic dosage effects in particular may open possibilities for WGD-induced transcriptomic innovations. In particular, some functional categories were enriched for dosage-sensitivity, mostly in relation to photosynthesis or the chloroplast.

Indeed, parallel evidence from the study of patterns of gene retention following allo-polyploidization (Maere et al., 2005; Thomas et al., 2006; Coate et al., 2011) or from more manageable experimental systems like the X chromosome (Birchler, 2014) support the idea that dosage responses are selectively constrained by genetic pathways. This idea was formalized as the dosage balance hypothesis by Papp et al. (2003), who argued that greater fitness loss would result from perturbing the relative abundance of components of a signaling cascade or of a multi-subunit protein complex than from absolute but concerted concentration changes that would maintain overall stoichiometry. For genes under this selection for gene dosage balance, WGD itself may not greatly disrupt relative abundances, as all genes would see their dosage increased more or less proportionally to one another. If that is the case, however, gene dosage compensation mechanisms (from reduction of gene expression to loss of gene duplicates) would be strongly constrained following WGD, as these would need to be concerted across all interacting subunits to maintain stoichiometry. Hence, the gene duplicates of interacting proteins have been well-preserved even over long time-spans following polyploidization (Papp et al., 2003; Thomas et al., 2006; Birchler and Veitia, 2010). This could be a major hindrance for polyploids as this selection would limit an ability to rectify deleterious gene dosage effects. One way to circumvent this selection would be a uniform reduction of gene expressions across the genome, which, Freeling et al. (2015) argue, is precisely one of the effects predicted for transposition bursts (see section Transposition Burst and the Generation of a High-Effect Mutation Pool below). Supporting this potential importance of TEs in dosage response, dosagedependent genes in A. thaliana × A. arenosa polyploid series containing from 4 to 0 copies of the A. thaliana genome were depleted of TEs while dosage-independent genes were enriched in TEs (Shi et al., 2015). Under the assumption that TEs are equally likely to insert near both categories of genes, this data suggests that TE insertions near dosage-independent genes are selected against. This is consistent with the gene dosage balance hypothesis, as genes within a network would be unlikely affected synchronously, except in the case of a transposition burst where the effect of TEs would be evenly scattered across the genome. This hypothesis remains to be explored further (**Figure 2**), as

there could also be reverse effects, where presence of TEs in the vicinity of a gene might itself influences the gene-dosage response of that gene. Such effects are yet to be tested but it could help explain the diversity of transcriptomic responses to WGD observed even between accessions of the same species. In A. thaliana, Yu et al. (2010) detected ∼500 genes differentially expressed in the Col-0 genotype after WGD but only nine in Ler. Notably, the three genes Yu et al. identify as highly but differentially ploidy-responsive across Col-0, Ler, and a panel of seven other accessions (AT1G53480, AT4G32280, AT5G18030) are all located within 1 kb of a TE insertion (anno-j.org).

### Cytotype and Competitive Exclusion

Newly formed polyploids are expected to suffer a mating disadvantage when they are relatively rare in the population (Husband, 2000). This type of frequency-dependent disadvantage is known as minority cytotype exclusion (Levin, 1975). The minority cytotype principle is based on the idea that, under random mating, rare cytotypes are expected to be involved in interploidy matings more often than common cytotypes. Assuming that interploidy matings are more likely to produce inviable or sterile offspring, rare cytotypes should have reduced relative fitness. Such a frequency-dependent mating disadvantage was described from experimental and natural mixed-ploidy populations (Hagberg and Elleström, 1959; Maceira et al., 1993; Husband, 2000; Baack and Stanton, 2005; Mráz et al., 2012), but only a few studies further evaluated its significance for polyploid establishment. Interestingly, studies of mixed-ploidy populations of Chamerion angustifolium indicate a surprising asymmetry in this relationship between ploidies. In experimental arrays in field conditions, diploid fitness was frequency-dependent, while fitness in tetraploids was unaffected by their relative frequency. This was likely a result of pollinators preferentially visiting flowers of tetraploid individuals, particularly when rare, and also due to skewed pollen competition favoring tetraploids (Husband, 2000; Husband et al., 2002). Interestingly the effect was mirrored in natural mixed-ploidy populations, where tetraploid mothers produced fewer triploid hybrids than diploid mothers (Husband and Sabara, 2003; Sabara et al., 2013). These studies thus provide a demonstration of minority cytotype exclusion in action and a novel mechanism by which polyploids may avoid its consequences through assortative mating. Indeed, given that WGD can yield larger flowers through the gigas effect (Ramsey and Schemske, 2002; Simon-Porcar et al., 2017), and pollinators often show a preference for visiting larger flowers, non-random mating in mixed-ploidy populations may be important for alleviating the costs of rarity. Additional mechanisms to reduce the costs of inter-cytotype mating are to shift toward selfpollination (Barringer, 2007), or bypass sex altogether either through agamospermy (Thompson and Whitton, 2006; Kao, 2007), or increased vegetative reproduction, as shown both by association studies, (Herben et al., 2017), and in synthetic polyploids (Drunen and Husband, 2018).

In addition, direct competition between the parental diploid and its derivative autopolyploid can hinder the establishment of a nascent polyploid as predicted by theory (Rodríguez, 1996; Yamauchi et al., 2004). This of course depends on niche overlap between the diploid and polyploid. Unlike allopolyploids, where hybridization is expected to create novel genetic combinations unique to the hybrids, autopolyploids may not immediately possess such dramatic genetic differentiation from their progenitors. On the other hand, ploidy-altered traits may translate to better polyploid performance in competition either with its diploid progenitor or with other species. Studies experimentally addressing competition between diploids and their naturally-occurring, recently arisen autopolyploid derivatives are, however, very rare and either support this view (Maceira et al., 1993) or show no difference (Thompson et al., 2015). Alternatively, ploidy-altered traits may also help to cope with competition with other species and may broaden niches, opening the possibility to escape from minority cytotype exclusion. This notion is supported by theoretical models (Rodríguez, 1996) and the observation that polyploids are more frequent in competitive, demanding, and humandisrupted habitats than their diploid relatives (Ehrendorfer, 1980). However, despite the frequent invocation of superior competitive ability to explain polyploid success, this has only rarely been addressed experimentally and available results speak against this trend in autopolyploids (Münzbergová, 2007; Fialová and Duchoslav, 2014), in contrast to allopolyploids (Rey et al., 2017).

### THE SHORT-TERM GAINS

#### Masking of Deleterious Mutations

Haldane pointed out in 1933 that in the short-term polyploidy should greatly reduce the effect of genetic load by masking recessive or partially-recessive deleterious mutations behind an increased allelic multiplicity (Haldane, 1933). Indeed, at a given allele frequency q, the proportion of homozygotes in a diploid population will be q<sup>2</sup> but this drops exponentially to q<sup>4</sup> in an autotetraploid population. Thus, for deleterious recessive alleles, the frequency of autotetraploids expressing the associated phenotype will be an order of magnitude lower (or two if q is already small). This means that deleterious recessive alleles can reach much higher frequencies in autotetraploid populations before being exposed to strong selection and equilibrium frequencies are higher in autotetraploids vs. diploids. As newly formed polyploids initially inherit genetic load from a diploid genomic background where the equilibrium frequency is much lower, genetic load will be relieved in young polyploids, providing an early benefit (**Figure 1**). As a result, as long as most deleterious alleles are at least partially recessive (which is the case in both Mimulus and yeast; Willis, 1999; Agrawal and Whitlock, 2011), WGD is predicted to lead to temporary fitness increases (Korona, 1999; Otto and Whitton, 2000; Mable and Otto, 2001), although empirical evidence for this is still missing (**Figure 2**).

#### Instantly Altered Physiological Properties

Both population genetic theory and emerging empirical evidence suggest that a broad set of factors interact to alter the genomic landscape of autopolyploids. However, understanding the effect of WGD in isolation from separate yet correlated effects has only recently made major progress. While it was suggested 35 years ago that biochemical and physiological changes resulting from WGD might underlie polyploid adaptability (Levin, 1983), the best evidence of a direct link took three decades to emerge, when Chao et al., 2013) elegantly demonstrated that A. thaliana first generation autotetraploids have instantaneously enhanced salt tolerance compared to isogenic diploids. Neo-autotetraploid A. thaliana lines were shown to experience a tradeoff, being less fit compared to diploid progenitors under non-saline conditions, but more fit in response to saline challenge (Chao et al., 2013). The authors proposed that in conditions of salinity stress the autopolyploid lineages would benefit from a fitness advantage that could contribute to their establishment and persistence, thanks to an improved ability to accumulate potassium and exclude sodium. Indeed, the following year it was shown that autotetraploid A. thaliana are additionally more drought tolerant (del Pozo and Ramirez-Parra, 2014). A major challenge now is to determine the molecular events that bind WGD to this enhanced stress tolerance. It appears that the key tissue to investigate is the root, where salinity and drought tolerance meet potassium homeostasis and ABA signaling (Saleh et al., 2008; Meng et al., 2011; Allario et al., 2013; Chao et al., 2013; Wang et al., 2013; del Pozo and Ramirez-Parra, 2014). Work there promises to reveal mechanistically how WGD has an immediate effect on cellular physiology that is independent of increased genetic diversity.

There is also good evidence that somatic WGD may enhance stress resilience. For example, in Medicago and sorghum root endopolyploidy correlates with salt tolerance (Ceccarelli et al., 2006; Elmaghrabi et al., 2013) and can be induced by salt in tolerant, but not sensitive, strains (Ceccarelli et al., 2006). Thus, the ability to induce endopolyploidy may be responsible for salinity tolerance, perhaps due to cell size changes in the roots that alter ion uptake. Higher proportions of endopolyploid cells also correlate with greater drought tolerance (Cookson et al., 2006; Saleh et al., 2008; Meng et al., 2011; Chao et al., 2013).

Equally importantly, some effects have been disentangled from polyploidy and shown to be unrelated. A recent study by Solhaug et al. (2016) demonstrated that the allopolyploid Arabidopsis suecica had enhanced carbon assimilation via photosynthesis and elevated respiration rates relative to its progenitors A. arenosa and A. thaliana. This enhanced photosynthetic capacity was environment specific (dependent on high light levels) suggesting a potential mechanism for range expansion by the allopolyploid into novel niches. This advantage was not the direct result of polyploidization, as shown by comparing 12 accessions of isogenic diploid A. thaliana to colchicine-generated neo-polyploids. These autopolyploids showed no difference in carbon assimilation by photosynthesis compared to their diploid progenitors, suggesting that the photosynthetic vigor of A. suecica is a result of hybridization and not WGD.

Despite this progress, the majority of functional studies do not capture the final link: proof that the observed WGD-associated change is adaptive in the natural environment (**Figure 2**). An exception was provided by Ramsey, who used reciprocal transplants involving tetraploids, hexaploids, and neo-hexaploids (produced from the tetraploids) of Achillea borealis to show a link between WGD and increased fitness in the native environment (Ramsey, 2011). There, WGD itself accounted for 70% of the fitness difference, while the remaining variation (i.e., difference between neo-hexaploids and native hexaploids) could be ascribed to subsequent evolution of the native polyploid. However, the physiological mechanism and its genetic basis in this case remains unknown, which highlights the difficulty of comprehensive inter-disciplinary studies combining genetics, physiology, and ecology.

## Transposition Burst and the Generation of a High-Effect Mutation Pool

The hypothesis that WGD presents a genomic shock that activates transposable elements (TEs) across the genome was first proposed by Barbara McClintock (1984) to explain the association between polyploidy and increased TE content. Resident in virtually all genomes, TEs are highly mobile, making them powerful endogenous mutagens. To repress their activity, organisms target TEs with epigenetic silencing mechanisms such as DNA methylation (Bennetzen and Wang, 2014; Ito and Kakutani, 2014). However, the efficiency of TE silencing can be influenced by a number of factors, including environmental or cellular stressors. In some cases, the reactivation of TEs can be explained by the presence of stress-associated transcription factor binding sites in TE promoters (reviewed by Horváth et al., 2017). However, a more global impact of stress on the efficiency of TE-silencing mechanisms has also been suggested (Tittel-Elmer et al., 2010). In particular, genomic stress brought on by hybridization or polyploidization has global effects on epigenetic regulation and may thereby lead to TE reactivation (Kashkush et al., 2003; Madlung et al., 2005; Lopes et al., 2013; Springer et al., 2016; Edger et al., 2017). However, genome shock in polyploids has been studied primarily in allopolyploid contexts, where hybridization is the major contributor, as observed in Senecio cambrensis (Hegarty et al., 2006) and Spartina anglica (Parisod et al., 2009). To date, very few studies addressed the effect of WGD per se on TE transpositions apart from Bardil et al. (2015) who demonstrated an activation of LTR-retroelements following WGD, along with a contribution of gene-flow at the origin of polyploids.

Although most of the mutations TEs generate are deleterious, there is some evidence that TE insertions can be beneficial. The best example of this adaptive potential can be found in the classic case of industrial melanism in the peppered moth, where a young TE insertion that appeared and rapidly rose to fixation during the industrial revolution (∼200 years ago) has been proposed as responsible for the dark morph providing camouflage from predators (Van't Hof et al., 2016). The variation produced by TE activity can thus become a fruitful target of natural selection, providing adaptive solutions to the very stresses that initiated their reactivation (Ito et al., 2016). Thus, a global transposition burst triggered by genomic shock could immediately provide nascent polyploids with a pool of high-effect mutations to test against new challenges. In addition, the reactivation of TEs in young polyploids may also contribute to the stabilization of the neo-polyploid genome. First, TE insertions close to genes are known to have an impact, mostly negative (Hollister and Gaut, 2009) but sometimes positive (Quadrana et al., 2016), on the expression of nearby genes. Therefore, the global array of new insertions resulting from a transposition burst might result in broad re-wiring of gene expression and thereby contribute to the rebalancing of gene-dosages (Kashkush et al., 2003; Freeling et al., 2015), as was suggested by recent observations in rice neopolyploids (Zhang et al., 2015; see section Gene Dosage above). Second, TE content in centromeric regions contributes to the bulk of centric heterochromatin that is essential for the separation of sister chromatids during meiosis. Heterochromatin resists the pull exerted by microtubules and the resultant tension silences the spindle checkpoint, allowing meiosis to proceed (Stephens et al., 2013). Increased TE content generated from a transposition burst in neo-polyploids and distributed across chromosomes may thus lead to an overall strengthening of the meiotic spindles and contribute to stabilizing chaotic meiosis following WGD (see Meiosis section above).

# THE LONG-TERM GAINS

## Enhanced Invasiveness and Colonization Potential

Polyploids are over-represented among invasive plants. While in many cases diploids and tetraploids co-exist in the native range, the tetraploids are more often found alone in the invaded range than the contrary (e.g., Hollingsworth and Bailey, 2000). Consistent with this, polyploidy is associated with a potential for habitat colonization and transitions to weediness (Brown and Marshall, 1981; Soltis and Soltis, 2000; Pandit et al., 2006, 2011; Prentis et al., 2008). Common physiological factors contributing to invasiveness are associated with the necessary tolerance to environmental variation including stress resilience, phenotypic plasticity, or rapid cycling (early and prolific flowering aids in coping with or escaping from unpredictable environmental conditions; Baker, 1965; Grotkopp et al., 2002; Blair and Wolfe, 2004; Burns, 2004; Hall and Willis, 2006; Sherrard and Maherali, 2006; Franks et al., 2007). Such life history adaptations can help mediate trade-offs between resource accumulation and stressavoidance and are important for wild species as well as for crops (Jung and Müller, 2009).

Are polyploids pre-adapted or innately more capable of acquiring such traits? This is an open question, but the many cases where both cytotypes occur in their native range but only polyploids do in the invasive ranges (Lafuma et al., 2003; Mandák et al., 2005; Kubátová et al., 2007; Schlaepfer et al., 2008; Treier et al., 2009) suggest a potential pre-adaptation of polyploids for invasiveness (te Beest et al., 2012). However, environmental stresses also increase the rate of unreduced gamete formation and thus of polyploidization events (Bretagnolle and Thompson, 1995; Ramsey and Schemske, 1998). Therefore, polyploidization has also been viewed as a post-colonization process (Mandák et al., 2003) even if through hybridization (e.g., Hahn et al., 2012). Here we focus on the genetic factors implicated in invasiveness that are likely impacted by WGD (**Figure 1**). In particular, because the invasion of novel habitats typically proceeds from a small number of founders, some genetic properties of autopolyploids can enhance their chances of successful colonization. These include larger effective population sizes, a greater tolerance for selfing (and inbreeding depression), the ability to recover from the genetic bottlenecks, potentially enhanced sampling from existing standing variation, as well as expected lower levels of linked selection (below and te Beest et al., 2012).

# Increased Diversity and Tolerance for Selfing

In allopolyploids, where two distinct genomes are united, fitness advantages have often been attributed to interspecific hybridization rather than WGD (Barker et al., 2016). A conservative back-of-the-envelope calculation by Barker et al. (2016) estimated that the rate of production of autopolyploid cytotypes could be 40–80 times greater than that of allopolyploids. Given the approximate parity of allo- and auto-polyploids in nature, this suggests a large advantage to hybridization over the benefits directly attributable to WGD.

Because the two sub-genomes typically do not recombine, allopolyploids can continue to enjoy the advantage of heterosis and a stable multi-allelic state over many generations. Autopolyploids on the contrary, do not benefit from fixed heterozygosity. Nevertheless, it has been proposed that polysomic inheritance alone leads to higher genetic diversity (Haldane, 1932), and experimental comparisons between autotetraploids with tetrasomic inheritance and their diploid parents validate this theoretical expectation (Soltis and Soltis, 2000). This increased diversity has been linked to both an immediate increase in the number of mutational targets (doubled number of chromosomes in the case of autotetraploids), that in the long-run provide increased effective population sizes, and an expected reduced efficiency of purifying selection (Ronfort, 1999). This rise in genetic diversity in autopolyploids is proposed as a cause of the observed successes of tetraploids compared to their diploid sister lineages (Roose and Gottlieb, 1976; Soltis and Soltis, 1989, 2000; Soltis et al., 1993; Brochmann et al., 2004). The positive relationship between phenotypic plasticity and invasiveness introduced first by Baker (1965) is now well-documented through numerous physiological and morphological comparisons of invasive and native species (reviewed in Richards et al., 2006). Increased diversity in polyploids is often invoked to explain an increased plasticity and ability of polyploids to sustain range expansions into disturbed habitats. This is a tempting speculation, but a causal link demonstrating that the increased diversity in tetraploids confers an adaptive advantage is lacking (**Figure 2**).

In addition, reduced homozygosity in autopolyploids is expected to reduce the potential inbreeding depression associated with genetic load (Charlesworth and Charlesworth, 1987). This is because at any locus the increase in copy number in autopolyploids increases the probability of heterozygous offspring, even during selfing (Moody et al., 1993). As a result, the fitness cost associated with selfing (inbreeding depression) may be ameliorated (Lande and Schemske, 1985; Schemske and Lande, 1985) or at worst unchanged (Ronfort, 1999) depending on the range of dominance effects impacting fitness. This prediction has been confirmed in ferns, where the selffertilization of the gametophyte makes it possible to directly measure the impact of selfing on survival rates in the resulting sporophytes. In two different diploid-tetraploid fern pairs, selfing survival rates were nearly 100% in the tetraploid races, while it ranged from 5 to 60% in the diploids (Masuyama and Watano, 1990). Similarly, a reduction of inbreeding depression is observed in other polyploid-diploid comparisons (Husband and Schemske, 1997; Galloway et al., 2003; Husband et al., 2008), even though there are cases where the opposite is observed (Johnston and Schoen, 1996).

Tolerance to selfing is of major importance in the ability of a population to colonize new habitats, a consideration known as "Baker's rule" (Baker, 1967). Indeed, during the colonization process, early invaders are likely to be isolated with little opportunity for outcrossing. Selfing therefore provides reproductive assurance for the dispersed invaders (Barrett et al., 2008), and this translates to a high rate of co-occurrence between selfing or asexual propagation and low-density conditions or frequent colonization bouts (Baker, 1967; Price and Jain, 1981; Pannell and Barrett, 1998). For example, Daehler found that low inbreeding depression in hexaploid smooth cordgrass populations invading the San Francisco Bay area in California was associated with higher self-fertility and a higher fitness advantage for founding populations in the field (Daehler, 1998; Renny-Byfield et al., 2010).

# Reductions in Linked Selection: An Advantage in Changing Environmental Conditions?

As a mirror image to the reduced efficiency of selection against deleterious mutations, the increase in frequencies of beneficial alleles, even when dominant, will be slower in polyploids under tetrasomic inheritance than in diploids (Hill, 1971). Therefore, the time to fixation for an allele during a selective sweep can be greatly increased in autopolyploids. This prolonged rise in allele frequency might lead to more opportunities for mutation and recombination with other haplotypes, which are even further enhanced by the increased mutation and recombination rates resulting from greater ploidy levels. Weaker linkage thus may promote adaptation through reduced interference between alleles, allowing greater opportunity for a beneficial allele to recombine onto haplotypes with fewer deleterious mutations (**Figure 1**). However, increased recombination can lead to lower fitness in constant environments by breaking down beneficial associations (Lewontin, 1971; Feldman et al., 1980). Therefore, increased recombination may only be selected for in environments with fluctuating conditions (Charlesworth, 1976; Otto and Michalakis, 1998; Lenormand and Otto, 2000), which also happen to be environments with higher incidences of polyploids (Favarger, 1984; Brochmann et al., 2004; Parisod et al., 2010). This association between increased recombination and adaptation to environmental variation would strongly favor long-term evolution of autopolyploids, but remains to be experimentally tested (**Figure 2**).

# Sampling of Standing Variation From Local Introgression

As an autopolyploid lineage expands its range, it may encounter populations of its diploid progenitor or other species with which hybridization is possible. Provided that such populations are locally adapted, introgression may then supply genetic variants that facilitate persistence. Although polyploidization is traditionally viewed as a means of instant speciation (Coyne and Orr, 2004), the ploidy barrier is permeable (reviewed in Ramsey and Schemske, 1998; Kolár et al., 2017). While adaptive introgression is increasingly recognized as an important force in the evolution of haploid and diploid organisms by genomic studies (reviewed by Arnold and Kunte, 2017; Schmickl et al., 2017), empirical genomic evidence for gene flow among a diploid and its autopolyploid derivative is lacking. The ability to accept genetic variation from alternative cytotype might be beneficial, as it could provide preadapted local alleles upon which selection may act and/or may alleviate inbreeding associated with founding events during range expansions (Parisod et al., 2010). We however lack well-documented examples of traits and underlying loci that may explain evolutionary significance of gene flow for establishment and further spread of a polyploid. The only case to our knowledge, although confined to allopolyploids, is acrossploidy transfer of potentially adaptive floral genes, RAY1 and RAY2, from diploid Senecio squalidus into the allotetraploid Senecio vulgaris that has given rise to a novel variety of S. vulgaris with ray florets (Chapman and Abbott, 2010). An additional hint, coming from an autopolyploid system, is the likely uptake of a diploid-like CONSTANS allele during the colonization of railways by a distinct lineage of autotetraploid A. arenosa (Baduel et al., 2018). This allele may allow the railway ecotype to escape the repression exacted on flowering by FLOWERING LOCUS C and underlie the observed rapid and repeated flowering. These two examples indicate that this may be a fruitful area for future research. An alternative benefit of interploidy hybrids for polyploid establishment may result from their contribution to recurrent formation of polyploids. Triploid hybrids, if fertile, often produce unreduced (2n = 2x) gametes (Ramsey and Schemske, 1998; Chrtek et al., 2017) that can merge with reduced (n = 2x) gametes of a tetraploid leading to formation of novel tetraploids (Husband, 2004). We note, however, that much gene flow may be neutral or even maladaptive. For example, if a tetraploid has adapted to problems associated with meiotic segregation during its establishment (Yant et al., 2013), later diluting of such co-adapted gene networks by introgression of diploid-like alleles would lead to reductions in fitness.

Even when assuming beneficial consequences, interploidy gene flow would provide relative advantages to the polyploid only in cases when (potentially adaptive) alleles flow more often into the polyploid than into progenitor diploids (**Figure 1**). Indeed, this seems to be the case and it was recognized as early as 1971 by Stebbins that gene flow among cytotypes is asymmetrical (Stebbins, 1971). A mechanistic explanation for this asymmetry is that while there are direct pathways for gene flow in the 2x –> 4x direction, the reverse is more convoluted. The unreduced 2n = 2x gamete of a diploid and the reduced n = 2x gamete of a tetraploid can combine leading to onestep formation of a tetraploid interploidy hybrid (2x + 2x = 4x; Koutecký et al., 2011; Chrtek et al., 2017; Sutherland and Galloway, 2017). However, a triploid hybrid, capable of forming reduced n = x gametes, is an essential stepping-stone for the creation of a diploid hybrid (Kolár et al., 2017). Thus, gene flow in the 4x –> 2x direction is less likely as it involves two separate crossing steps (4x –> 3x and 3x –> 2x). Moreover, the triploid hybrid is often either non-viable (triploid block) or unfit (Ramsey and Schemske, 1998; Köhler et al., 2010). Indeed, the few available empirical genetic studies document either much stronger (in autopolyploid systems: Ståhlberg, 2009; Jørgensen et al., 2011; Arnold et al., 2015) or exclusively unidirectional gene flow from the diploid into the polyploid (in allopolyploid systems: Slotte et al., 2008; Chapman and Abbott, 2010; Zohren et al., 2016). In a longer evolutionary timespan recurrent origins of autopolyploid lineages from different diploid sources followed by hybridization among these polyploids (Soltis and Soltis, 2009) would also lead to enrichment of the tetraploid gene pool by alleles from distinct diploid lineages, similar to direct unidirectional gene flow from diploid to polyploid.

If higher polyploids are formed (i.e., hexaploids, octoploid, etc.) they may hybridize with the tetraploid or among one another and further enhance variation of the polyploid lineages. The few empirical studies available show that the postzygotic barrier, both in terms of rate of hybrid formation and its fitness, is lower among the various polyploid cytotypes than it is between diploids and their polyploid derivatives (Greiner and Oberprieler, 2012; Sonnleitner et al., 2013; Hülber et al., 2015; Sutherland and Galloway, 2017). This corresponds well with the explanation of maternal: paternal genome imbalance in the endosperm as a primary cause of the postzygotic barrier (Köhler et al., 2010; Greiner and Oberprieler, 2012). Because the magnitude of endosperm imbalance in tetraploid–hexaploid hybrids is approximately one third lower than in diploid– tetraploid hybrids (Sonnleitner et al., 2013) these higher-ploidy hybrids may be more fit than diploid–tetraploid hybrids.

Aside from intraspecific gene flow, polyploidy may also break down systems of reproductive isolation present in diploid progenitors and thus increase interspecific gene flow. For example, although the reproductive isolation in diploid lineages of Arabidopsis arenosa and Arabidopsis lyrata is near complete, tetraploid A. lyrata can form viable hybrids with both diploid and tetraploid A. arenosa, likely due to the disruption of an endosperm-based barrier (Lafon-Placette et al., 2017). Interestingly, hybridization between those species appears to have donated beneficial alleles contributing to local adaptation to harsh serpentine soils in the tetraploid A. arenosa (Arnold et al., 2016). In this study Arnold et al. (2016) found that several genes exhibiting signatures of selection for adaptation to serpentine soils also appeared to have been introgressed from A. lyrata. Finally, the tendency of polyploids to expand into novel niches may further increase chances of encountering foreign lineages with which hybridization may occur. Although the cause of this is unclear, the heightened adaptability of many polyploids fueled by introgression may provide positive feedback, allowing further spread and hybridization. Altogether, these examples illustrate a tendency for polyploids to act as evolutionary "sponges," accumulating variation through introgression across both ploidy and species barriers. This supports the view of polyploids as diverse evolutionary amalgamates from multiple distinct ancestral lineages—a property advantageous for further expansions.

### THE LONG-TERM CHALLENGES

This begs the question: if WGD events are common, and polyploids display advantageous traits, why are established autopolyploids relatively uncommon and paleo-polyploids so frequent? Transitions to polyploidy tend to be observed at the tips of phylogenies (Escudero et al., 2014), suggesting that polyploid lineages typically do not survive as such over longer evolutionary timescales. Consequently, the growing consensus is that polyploidy is an ephemeral but repeatedly appearing state (Wendel, 2015). This could be the result of both pervasive polyploid extinction, as there is some suggestion that recently arisen polyploids experience lower diversification rates and higher extinction rates relative to congeneric diploids (Mayrose et al., 2011, 2015; Arrigo and Barker, 2012), as well as repeated returns to diploidy and disomic inheritance after transitionary polyploid phases (e.g., Haufler, 1987; Wendel, 2015; Soltis et al., 2016). To date, such a transition has only been mathematically modeled in autopolyploids (Le Comber et al., 2010); empirical evidence is lacking. Thus, in addition to the short term biological challenges faced by newly-arisen polyploids, longer-term challenges may help explain the transience of the polyploid state, even reviving the idea of polyploidy as an evolutionary "dead-end" (Wagner, 1970; Stebbins, 1971; Mayrose et al., 2011, 2015). Ironically, many of these postulated longer-term negative effects result from the continuation of earlier beneficial population genetic mechanisms.

#### Increased Genetic Load

If in the short-term polysomic masking results in a fitness increase, a reduced strength of purifying selection (Ronfort, 1999) would eventually lead to the slow rise of recessive deleterious mutations until mutational load reaches a new, higher equilibrium (Otto and Whitton, 2000). This new polyploid equilibrium may take hundreds of thousands of generations to establish (Otto and Whitton, 2000), but would ultimately produce a genetic load proportional to ploidy level and the mutation rate µ per haploid genome (Haldane, 1937). A particularly strong effect of this would be on TEs, as their distribution of fitness effects is much more heavily skewed toward highly-deleterious mutations compared to single-nucleotide polymorphisms and thus are strongly affected by purifying selection. This has been demonstrated recently in A. thaliana (Quadrana et al., 2016), where it was shown that TEs insert throughout the genome, but are rapidly purged from genic rich regions and chromosome arms, most likely due to the deleterious consequences of insertions near or within genes (Quadrana et al., 2016). Even if evidence of this mostly comes from allopolyploid systems (wheat, Brassica, etc.), these long-term effects are likely to be similar in autopolyploids and in the long run we can expect the initial differences in transposition burst triggered by the two modes of polyploidization to be less important compared to the relaxation of purifying selection shared by both systems. Indeed, in the allotetraploid Capsella bursa-pastoris, an increase in TE content was observed around genes compared to its two parental diploid species, C. grandiflora and C. orientalis, which was attributed primarily to a relaxation of purifying selection and not to any change in TE activity (Ågren et al., 2016), and there is accumulating evidence of TE proliferation over long timespans following polyploidization (Sarilar et al., 2011; Yaakov and Kashkush, 2012; Piednoël et al., 2015). However, this doesn't seem to apply to all TE families equally. For example some gypsy-like retro-elements proliferated in Aegilops tetraploids while others remained quiescent (Senerchia et al., 2014). This could be due to differences in insertion preferences between TE families or more simply to the fact that many TEs are actually defective. Indeed, most of the TEs carried by a genome have lost their transpositional capacities and are fossilized: in the human genome <0.05% of TEs remain active (Mills et al., 2007). Between two active families, differences in their regulation, copy number, chromosome localization, etc. may also explain different responses to relaxed purifying selection. For example, a family inserting more commonly into genes will be more strongly purified and therefore more strongly affected by a relaxation of purifying selection than in TE families that inherently avoid inserting into gene-coding loci. Such differences in insertion preferences have been observed in one LTR retrotransposon family between A. thaliana where genic insertions are strongly selected against, and A. lyrata, where gene-poor centromeric regions are preferentially targeted, reaching much higher copy numbers (Tsukahara et al., 2012).

Eventually, it is thought that the reduced strength of purifying selection from polysomic masking may overshadow the early advantages of low mutational load, which begins at the lower diploid equilibrium levels immediately following WGD (Otto, 2007; Gerstein and Otto, 2009). At equilibrium (**Figure 1**), polyploids are predicted to suffer from the increased frequency of deleterious mutations, which are introduced in higher numbers (doubled in the case of autotetraploids). However, given the difficulties of finding an ancient enough system where autopolyploids have reached such an equilibrium but are still ecologically comparable to their diploid progenitors, empirical support for this remains sparse (**Figure 2**).

# Slower Selection on Recessive Beneficial Mutations

In addition to hampering selection against deleterious mutations, polysomic masking can also prevent recessive beneficial mutations from reaching fixation. This may even effectively counter the increased input of beneficial mutations arising from the increased number of haploid genomes (Haldane, 1932; Gerstein and Otto, 2009). Whereas in haploids the rate of fitness increase only depends on the rate of appearance of beneficial mutations and their fitness effect (Haldane et al., 1927), in diploids it also depends on the dominance level of mutations. This is further intensified in polyploids (Gerstein and Otto, 2009). For example, in autotetraploids with tetrasomic inheritance, the rate of fitness increase (w) can be written as a function of the rate of appearance, the fixation probability, and the fitness effect (s) as follows:

$$
\Delta \mathbf{w}\_{4\mathbf{x}} = 4N \mathbf{v}. 2h\_1 \mathbf{s}. \mathbf{s}
$$

Where N is the population size, ν is the beneficial mutation rate, and h<sup>1</sup> is the dominance of the new allele in simplex (for example Aaaa for tetraploids). Therefore, polyploids would adapt faster only when mutations are at least partially dominant (h<sup>1</sup> > 0.5) and thus not hindered by polysomic masking. In an attempt to test this prediction experimentally, Schoustra et al. (2007) observed the fastest rates of loss of a costly resistance allele in diploid strains of the fungus Aspergillus nidulans that periodically reverted to haploidy. These strains accumulated multiple recessive beneficial mutations in the diploid state that were exposed to positive selection in the haploid state. This pattern is reminiscent of the transitionary polyploid phases postulated to have occurred throughout the evolution of plants (Haufler, 1987; Wendel, 2015; Soltis et al., 2016).

Further amplifying this effect of reduced positive selection, lower linkage in autopolyploids (see Reductions in Linked Selection: An Advantage in Changing Environmental Conditions?) increases the likelihood of recombination breaking down favorable haplotypes as they slowly rise in frequency. As a consequence, beneficial mutations in close vicinity and positively selected in autotetraploids are unlikely to remain linked to each other for long. This may be beneficial in the early stages of invasion (directional selection) or under fluctuating environments, but in the long run it is predicted to be unfavorable. Indeed, once established in their new range and closer to a new fitness optimum, selection is predicted to favor increased linkage and reduced recombination (Feldman et al., 1980, 1996).

#### Bigger Genomes, Slower Growth Rates

With their doubled genomes, autopolyploids are likely to face the general rule in animals and plants dictating that increased genomic content results in decreased growth and division rates (Gregory and Mable, 2005; Otto, 2007). If the impact of genome size is clear at the cellular level it is less evident at the organism level (Knight and Beaulieu, 2008) and exceptions to this rule can easily be found: first in the growth form (e.g., trees have small genomes, Beaulieu et al., 2008) and in the environment (Zörgö et al., 2013). This led some to suggest the overall negative relationship between genome size and metabolic rate across gymnosperms and angiosperms may be the result of a rather indirect effect through other traits such as growth form (Beaulieu et al., 2007a). It should be noted, however, that these rules are based on the observation of established polyploids and therefore the impact of genome size itself remains to be directly assessed independently of the potentially confounding effects of subsequent evolution. On the cellular level, it seems clear that increased nuclear content leads to increased cell volume (Beaulieu et al., 2008; Knight and Beaulieu, 2008) and slower growth rates (Cavalier-Smith, 1978; Gregory, 2001), which have long been observed in polyploids as well (Müntzing, 1936; Stebbins, 1971; Garbutt and Bazzaz, 1983). At the organismal level, older observations have illustrated that polyploids often flower later (Smith, 1946) and are more frequently perennial (Hagerup, 1932; Müntzing, 1936; Sano et al., 1980), but the role of WGD itself has been rarely experimentally evaluated since then. For example, in synthetic A. thaliana tetraploids, there was no consistent trend in flowering time over 12 ecotypes investigated in two common gardens (Solhaug et al., 2016) and similarly no differences in this trait were found between diploid and synthetic polyploids of Chamerion angustifolium (Husband et al., 2016). On the other hand, a recent study leveraged parallel altitudinal clines and intraspecific genome size variation in maize landraces to show repeated reductions in genome size in high-altitude populations most likely via selection on flowering time (Bilinski et al., 2018). Furthermore, in growth chamber experiments Bilinski et al. were able to confirm an association, even if modest, between genome size, cell production, and cell sizes. Therefore, such constraints may turn out to be particularly costly for polyploids that successfully switched to invasiveness thanks to early advantages (see section Enhanced Invasiveness and Colonization Potential above). Indeed, invasive species commonly exhibit early flowering (Pyšek et al., 2009), lower seed sizes with higher dispersal abilities and annual life cycles which are also the prerogative of small-genome species (Grotkopp et al., 2002; Knight et al., 2005; Beaulieu et al., 2007b). Even if more research is needed to clarify the direct impact of polyploidy, evidence so far suggests that the potential slowing of growth rates may impact negatively longterm fitness. Thus, selection would likely push for a reduction of genome size, especially in transitions to invasiveness. This process may be very long and stochastic, however, as evidence in the Nicotiana genus shows genome downsizing is minimal in young polyploids (∼200,000 years old) only appearing in polyploids approximately 4.5 million years old, at which point genome size increases are also observed (Leitch et al., 2008).

# Post-polyploidy Diploidization, a Cradle for Diversification

It is now clear that nearly all plant lineages are paleo-polyploids, with their evolutionary histories including at least one round of WGD (Wendel, 2015). However, numbers of past WGD events do not correlate with chromosome numbers nor genome sizes. For example, given the three rounds of genomic multiplications that have occurred in Brassica genomes (α, β, and γ, Franzke et al., 2011; Jiao et al., 2011), and assuming ancestral angiosperms had between 5 and 7 chromosomes (Stebbins, 1971; Raven, 1975), we would expect, without reductions in chromosome numbers along the way, resultant species to carry between 40 and 56 chromosomes, when some carry as few as six (Anderson and Warwick, 1999). A similar reasoning holds with genome sizes (Wendel, 2015): thus it is apparent that past polyploidization events were followed by massive genome downsizing, both in chromosome numbers and in absolute size (Leitch and Bennett, 2004; Leitch and Leitch, 2008). This genome downsizing ultimately leads to the diploidization of descendants (Soltis et al., 2015). These paleo-polyploids then commonly undergo further rounds of polyploidization, generating a cyclical process described as the "wondrous cycles of polyploidy" and occurring repeatedly over long evolutionary timescales (Wendel, 2015).

Several mechanisms have been proposed to underlie the diploidization process, all of them relying on non-homologous translocations (Mandáková and Lysak, 2018). One contributor to these illegitimate recombination events are TEs, since homology between TE copies can lead to spurious recombination events between non-homologous chromosomes (Vicient and Casacuberta, 2017). Some of the rearrangements resulting from these non-homologous recombinations (inversions, reciprocal translocations, deletions, and duplications) do not affect chromosome numbers, but others (end-to-end translocations, EETs, nested chromosome insertions, NCIs, and Robertsonian translocations) are seen as the mechanistic basis of "polyploid drop" (Mandáková and Lysak, 2018). Indeed, all three processes result in the merger of two chromosomes into one via non-homologous recombinations between two distal regions (EETs), two distal regions with a pericentromeric region (NCIs), or between a distal and a pericentromeric region (Robertsonian translocations). Distal and pericentromeric regions are particularly prone to ectopic homologies due to their enrichment in repetitive elements, in particular TEs (Quadrana et al., 2016; Vicient and Casacuberta, 2017). Therefore, even though most recombination events between TEs will lead to small indels, the possibility of large-scale chromosomal rearrangements may represent a major driver of genome restructuring during diploidization (Vicient et al., 1999). In fact, evidence supporting a role for TEs during diploidization has been observed in Nicotiana (Lim et al., 2007) and maize (Bruggmann et al., 2006). However, these dysploidy events have an immediate fitness cost, as the merging of two chromosomes leads to obvious chromosome segregation issues. In outcrossers in particular, the probability of forming non-aneuploid offspring is very low, and newly formed dysploids are likely to suffer from woes similar to newly-formed autopolyploids (Mandáková and Lysak, 2018). This is why it was theorized that the establishment of dysploids would be relatively favored in selfers (Charlesworth, 1992). By increasing homozygosity of the offspring, selfing indeed reduces the fitness cost of dysploidy by increasing the probability of producing offspring homozygous for the merged chromosome. Extending this reasoning, we can expect higher rates of dysploidy among weedy invasives, due to both their propensity for selfing and their often faster cycling (e.g., Grant, 1981), which increases the probability of spurious recombinations (Mandáková and Lysak, 2018). This relationship between life-history and dysploidy rate has been confirmed (e.g., Luo et al., 2015) even though some examples show this is not always straightforward (slow polyploid drop rate in rice despite being annual, Murat et al., 2010). Furthermore, the advantage of a reduced number of chromosomes may be particularly valuable for colonizers (see section Bigger Genomes, Slower Growth Rates above).

These considerations become particularly relevant for aging polyploids, which both carry an increased TE content (see sections Transposition Burst and the Generation of a High-Effect Mutation Pool and Increased Genetic Load above) and are more likely to tolerate selfing (section Increased Diversity and Tolerance for Selfing above). Thus, factors that initially represented an advantage for the establishment of recent autopolyploids may transform into the very drivers of polyploid drop and return to diploidy (**Figure 1**).

Compared to WGD, which leads to an exact doubling of chromosome numbers, polyploid drop is more erratic and can produce broad variation in chromosome number. In Brassica for example, the variation in base chromosome numbers is the result of multiple and independent diploidizations from the mesohexaploid ancestor (Lysak et al., 2007; Mandáková et al., 2017). Indeed, the stochasticity of polyploid drop, not WGD, is thought to be a major contributor to speciations and radiations (**Figure 1**). However, polyploidy drop is of course not possible without an earlier WGD. Accordingly, recently revised phylogenetic evidence convincingly supports the occurrence of WGDs significantly before large angiosperm radiations, sometimes by millions of years (Tank et al., 2015; Clark and Donoghue, 2017). These reports strengthen the WGD Radiation Lag-Time Model formalized by Schranz et al. (2012), who found that in six examples of angiosperm radiations a species-poor sister-group shared a WGD event with the species-rich crown group, directly contradicting the notion that WGD was the sole immediate cause of these radiations. The lag between WGD and subsequent radiations thus has been proposed as evidence that the long and stochastic process of polyploid drop is the proximal engine of speciation and cladogenesis (Dodsworth et al., 2016; Clark and Donoghue, 2017; Mandáková and Lysak, 2018).

# CONCLUSION

As best expressed by Johnathan Wendel, the "wondrous cycles" of polyploidy have gained increasing attention and support, both theoretical and empirical, over the earlier ideas that polyploids were evolutionary dead-ends. Excellent recent reviews have discussed the complex mixture of advantages and disadvantages of polyploidy (see especially Spoelhof et al., 2017), and here we aimed to extend this with the most recent evidence considered explicitly in the scope of the dynamic temporal nature of shifting costs and benefits. In doing so, we hope to bring to light the importance of the timescales at which evolutionary dynamics are at play over the lifespan, from dawn till dusk, of any given genome duplication, thus creating the conditions for these wondrous cycles to emerge. We see a picture of each cycle of WGD-diploidization as a temporary but powerful engine of evolutionary diversification. Eventually, without specific selective pressures maintaining a strong advantage for polyploids, each hop to polyploidy is restabilized in a drop to diploid form, but there are plenty of evolutionary opportunities for speciation and radiation along the way.

#### AUTHOR CONTRIBUTIONS

PB drafted the manuscript and realized the figures with help from all other authors. SB drafted the manuscript with help from all

#### REFERENCES


other authors. MV-M drafted the manuscript with help from all other authors. FK drafted the manuscript with help from all other authors. LY drafted the manuscript with help from all other authors.

#### FUNDING

LY acknowledges funding from the European Research Council (ERC) under the European Union's Horizon 2020 research and innovation programme (grant agreement 679056) and the UK Biological and Biotechnology Research Council (BBSRC) via grant BB/P013511/1 to the John Innes Centre. Additional support was provided by Czech Science Foundation (project 17-20357Y to FK).


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 Baduel, Bray, Vallejo-Marin, Koláˇr and Yant. This is an openaccess article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Phylogenetic Structure of Plant Communities: Are Polyploids Distantly Related to Co-occurring Diploids?

#### Michelle L. Gaynor <sup>1</sup> , Julienne Ng<sup>2</sup> and Robert G. Laport <sup>2</sup> \*

<sup>1</sup> Department of Biology, University of Central Florida, Orlando, FL, United States, <sup>2</sup> Department of Ecology and Evolutionary Biology, University of Colorado Boulder, Boulder, CO, United States

Polyploidy is widely acknowledged to have played an important role in the evolution and diversification of vascular plants. However, the influence of genome duplication on population-level dynamics and its cascading effects at the community level remain unclear. In part, this is due to persistent uncertainties over the extent of polyploid phenotypic variation, and the interactions between polyploids and co-occurring species, and highlights the need to integrate polyploid research at the population and community level. Here, we investigate how community-level patterns of phylogenetic relatedness might influence escape from minority cytotype exclusion, a classic population genetics hypothesis about polyploid establishment, and population-level species interactions. Focusing on two plant families in which polyploidy has evolved multiple times, Brassicaceae and Rosaceae, we build upon the hypothesis that the greater allelic and phenotypic diversity of polyploids allow them to successfully inhabit a different geographic range compared to their diploid progenitor and close relatives. Using a phylogenetic framework, we specifically test (1) whether polyploid species are more distantly related to diploids within the same community than co-occurring diploids are to one another, and (2) if polyploid species tend to exhibit greater ecological success than diploids, using species abundance in communities as an indicator of successful establishment. Overall, our results suggest that the effects of genome duplication on community structure are not clear-cut. We find that polyploid species tend to be more distantly related to co-occurring diploids than diploids are to each other. However, we do not find a consistent pattern of polyploid species being more abundant than diploid species, suggesting polyploids are not uniformly more ecologically successful than diploids. While polyploidy appears to have some important influences on species co-occurrence in Brassicaceae and Rosaceae communities, our study highlights the paucity of available geographically explicit data on intraspecific ploidal variation. The increased use of high-throughput methods to identify ploidal variation, such as flow cytometry and whole genome sequencing, will greatly aid our understanding of how such a widespread, radical genomic mutation influences the evolution of species and those around them.

Keywords: Brassicaceae, genome duplication, non-native species, phylogenetic community ecology, polyploidy, Rosaceae

#### Edited by:

Hans D. Daetwyler, La Trobe University, Australia

#### Reviewed by:

Jacob A. Tennessen, Oregon State University, United States Rubén Torices, Université de Lausanne, Switzerland

> \*Correspondence: Robert G. Laport rob.laport@gmail.com

#### Specialty section:

This article was submitted to Evolutionary and Population Genetics, a section of the journal Frontiers in Ecology and Evolution

> Received: 22 January 2018 Accepted: 11 April 2018 Published: 30 April 2018

#### Citation:

Gaynor ML, Ng J and Laport RG (2018) Phylogenetic Structure of Plant Communities: Are Polyploids Distantly Related to Co-occurring Diploids? Front. Ecol. Evol. 6:52. doi: 10.3389/fevo.2018.00052

# INTRODUCTION

Polyploidy, or whole genome duplication, has been an important force shaping the evolutionary history of vascular plants (Adams and Wendel, 2005; Rieseberg and Willis, 2007; Soltis et al., 2009; Ramsey and Ramsey, 2014). Not only is polyploidy considered an important mechanism of speciation (Coyne and Orr, 2004; Soltis et al., 2014; Zhan et al., 2016), it is also often associated with major phenotypic shifts such as in size, flower color, water use, reproductive system, pollinator specialization, herbivore resistance, and phenology (Levin, 1983; Masterson, 1994; Segraves and Thompson, 1999; Husband et al., 2007; Maherali et al., 2009; Balao et al., 2011; Ramsey and Ramsey, 2014). Genome duplication has also been associated with novel alterations to genomic architecture and regulation that may affect adaptation (Comai, 2005; Madlung, 2013). However, despite the prevalence of polyploid events, the biodiversity implications of genome duplication, and the phenotypic differences often observed between diploids and polyploids, much remains unknown about how far reaching the impact of whole genome duplication is on interactions with other species and at the community level (Laport and Ng, 2017; Segraves, 2017).

Renewed interest in studying polyploidy over the last several decades has bent recent opinion toward acknowledging the significance of genome duplication on patterns of biodiversity (Coyne and Orr, 2004; Soltis et al., 2007; Ramsey and Ramsey, 2014; Laport and Ng, 2017; Segraves, 2017). Yet, the influence of genome duplication on population- and community-level dynamics remains unclear, in part because the evolutionary origin of polyploids may strongly influence the extent of polyploid phenotypic variation. Polyploids formed via the hybridization of two closely related species with partially diverged genomes (allopolyploidy) often exhibit phenotypes that are intermediate to, or outside the range of (i.e., transgressive), the parental species. In contrast, polyploids formed via the union of unreduced gametes within a population (autopolyploidy) often exhibit more subtle phenotypic differences when compared to their diploid progenitors. Historically, the more pronounced phenotypic variation among allopolyploids was considered as being important for interspecific interactions and patterns of biodiversity (Soltis et al., 2007; Ramsey and Ramsey, 2014). Research over the last few decades has shown, however, that the phenotypic, and underlying genetic, variation associated with both allo- and autopolyploids has the potential to influence ecological affinities, and play an important role in facilitating the establishment of new cytotypes, their expansion into a broader range of environmental conditions, and consequently their interactions with other species.

From the extensive body of empirical and theoretical work on the ecology and evolution of polyploids, whole genome duplication can be expected to have cascading effects on interspecific interactions and community-level dynamics (Ramsey and Ramsey, 2014; Certner et ˇ al., 2017; Laport and Ng, 2017; Segraves, 2017). Although the direction and strength of the effect remains unclear, a number of predictions can be made about how species interactions and co-occurrence may be shaped by whole genome duplication based on previous species- and population-level work. For example, when considering first generation polyploids (i.e., tetraploids), it is thought that these neopolyploids must immediately compete with their co-occurring diploid progenitor upon formation while suffering a distinct frequency-dependent reproductive disadvantage. Because the relatively rare tetraploids are most likely to mate with more abundant diploids, this disadvantage, known as minority cytotype exclusion, arises from the lower fitness realized through the production of inviable or infertile triploid hybrid offspring (Hagberg and Ellerström, 1959; Levin, 1975). With few or no potential mates with which to reproduce, the neopolyploid is effectively "bred to death." However, even slight differences between polyploids and diploids, such as phenological shifts in the timing of reproduction, reproductive strategy (e.g., sexual vs. asexual), and ecological differences, may satisfy theoretical requirements for successful escape from minority cytotype exclusion (Husband, 2000). By easing direct ecological competition and promoting assortative mating, phenotypic differences may allow neopolyploids to persist within the range of their diploid progenitors. Present day communities would therefore reflect signatures of these historic events, whereby polyploids will often co-occur with their close diploid relatives.

Alternatively, polyploids may overcome minority cytotype exclusion by dispersing to new, unexploited habitats and maintaining the exclusion of their progenitors. Theoretical work predicts that the likelihood of neopolyploids becoming established at their site of origin is very low (Fowler and Levin, 2016), while dispersal and exploitation of novel habitat due to phenotypic differences either accompanying or arising rapidly after genome duplication greatly increases the probability of persistence (Lewis, 1962; Kay, 1969; Leitch and Leitch, 2008; Levin and Soltis, 2017). Indeed, the phenotypic differences associated with polyploidy may be great enough to facilitate the establishment of new cytotypes and their expansion into a broader range of ecological and environmental conditions. For example, polyploids have been documented to differ in ecological niche affinities and adaptive traits (Ramsey, 2011; McIntyre, 2012; Laport et al., 2013; Glennon et al., 2014; Marchant et al., 2016), experience morphological and physiological differences affecting phenological and physiological rates (Beaulieu et al., 2008; Manzaneda et al., 2012; Laport et al., 2016; Rey et al., 2017), have unique interactions with herbivores and pollinators (Thompson et al., 1997; Kennedy et al., 2006; Arvanitis et al., 2007; Halverson et al., 2008; Thompson and Merg, 2008; Roccaforte et al., 2015), and exhibit unique water relations (Maherali et al., 2009) and mycorrhizal associations (Tešitelová ˇ et al., 2013). Shifts from sexual to asexual reproduction, or a breakdown of self-incompatibility systems (Comai, 2005; Otto, 2007), could further promote the establishment of polyploids in geographic areas isolated from their diploid progenitors by providing a means of reproduction and population increase. If strong ecological differentiation between cytotypes and establishment in geographically isolated areas is the predominant mode of neopolyploid success and persistence, polyploids should more often occur in different communities than their close diploid relatives.

In addition to phenotypic differences between polyploids and diploids, variation at the molecular level also likely bears strongly on community assembly. In particular, genetic changes associated with whole genome duplication could increase the ecological success of polyploids in novel communities. Doubled nuclear DNA content on its own can have cellular phenotypic consequences that alter intracellular stoichiometric relationships and physiological rates, causing shifts to growth rate, gas exchange, and flowering time (Comai, 2005; Beaulieu et al., 2008; Madlung, 2013; Bilinski et al., 2018), which may allow polyploids to outcompete co-occurring diploids. The novel genetic architecture and regulatory environment of duplicated genomes may also lead to greater adaptability, and the larger genome size may be a larger target for functional mutations that could influence adaptation (Comai, 2005; Madlung, 2013; Soltis et al., 2015; Song and Chen, 2015; Mei et al., 2018). For example, the increased genomic content of polyploids presents potential opportunities for rapid paralog subfunctionalization or neofunctionalization that could lead to greater competitive ability or ecological success, and even invasiveness, relative to diploid progenitors (Thompson and Lumaret, 1992; Schlaepfer et al., 2010; te Beest et al., 2011; Green et al., 2013; Pyšek et al., 2013; Nagy et al., 2017). Indeed, the increased genomic/allelic diversity of larger genomes, decreased inbreeding depression, multisomic inheritance, intergenomic recombination, and accelerated epigenetic processes of polyploids have been identified as major factors that may predispose polyploid populations to rapidly exploit novel ecological niches (Comai, 2005; Soltis et al., 2009; Parisod et al., 2010; Green et al., 2013; Madlung, 2013). Thus, while the genetic changes associated with whole genome duplication and their influence over ecologically relevant phenotypic shifts may be used as a basis to make predictions about the ecological success of polyploids within communities, it remains relatively unexplored whether polyploids are indeed better competitors in a community context.

One way to investigate the influence of genome duplication on community structure is by analyzing diploid and polyploid co-occurrence within multiple communities using a comparative phylogenetic framework. Although there has been an increase in studies integrating phylogenetic data with questions about community ecology over the last decade (Webb et al., 2002; Cavender-Bares et al., 2006; Emerson and Gillespie, 2008; Vamosi et al., 2009), no studies have explicitly included ploidal information to assess the influence of genome duplication (and associated phenotypes) on community structure. Here, we use a novel approach to examine how polyploids influence phylogenetic community structure by combining ploidal information with phylogenetic analyses of plant communities across the United States. Specifically, we focus on two large plant families that are well represented across North American biomes and in which polyploidy has evolved multiple times, Brassicaceae and Rosaceae, to test (1) whether polyploid species are more distantly related to diploids within the same community than co-occurring diploids are to one another. We expect this phylogenetic pattern if polyploids escaped minority cytotype exclusion by inhabiting a different geographic range compared to their diploid progenitor and close relatives. We also test (2) whether polyploid species tend to exhibit greater ecological success than diploid species, using the relative abundance of polyploids vs. diploids as an indicator of successful establishment within communities. We further compare the abundance of native and non-native species to examine whether species experiencing recent ecological range expansions (i.e., non-native species) also tend to be polyploid.

# MATERIALS AND METHODS

## Community Data Collection

We obtained species composition and abundance data for Brassicaceae and Rosaceae communities across the United States from the National Ecological Observatory Network (NEON; https://www.neonscience.org; Keller et al., 2008). NEON has established sites across the United States and conducted plant surveys of replicated 400 m<sup>2</sup> plots across each site. We specifically focused on Brassicaceae and Rosaceae communities because they are polyploid-rich, broadly represent contrasting life histories, and were present in a large number of NEON communities. We focused on 16 communities (**Figure 1**), each of which had three or more representatives from the respective family for which we could obtain ploidal data (6 Brassicaceae communities, 11 Rosaceae communities; **Figure 2**). For each species, we determined its ploidal level based on scientific literature and online databases (Kew C-value database, http://data.kew.org/ cvalues/; Chromosome Count Database, https://ccdb.tau.ac.il; Table S1), as well as its native status following the designation assigned in the United States Department of Agriculture (USDA) PLANTS database (https://plants.usda.gov) (**Figure 2**). While mode of polyploid origin is likely important for interspecific interactions and ecological success, we were unable to consider differences in origin for this study as we could not consistently determine whether a species was an allo- or autopolyploid. As geographic variation in ploidy can be common (Baack, 2005; Kolár et al., 2009; Ståhlberg, 2009; Trávnícek et al., 2011; Castro ˇ et al., 2012; Laport et al., 2012; Ramsey and Ramsey, 2014; Zozomová-Lihová et al., 2015; Wefferling et al., 2017), we aimed to determine the community-specific ploidal level of each species. When species were reported to comprise multiple ploidal levels for the region around a NEON site (**Figure 2**), we repeated analyses with each ploidy. When assigning native status to each species, we considered species to be either native or non-native to the lower 48 states.

### Phylogenetic Reconstruction

As published phylogenies of Brassicaceae and Rosaceae did not include all members of our study communities (Huang et al., 2016; Zhang et al., 2017), we reconstructed phylogenies for each family using sequence data from GenBank and newly generated sequence data for species that did not have publicly available sequence data for our target genetic loci. We focused on one nuclear locus, ITS (internal transcribed spacer), and two chloroplast loci, rbcL (ribulose bisphosphate carboxylase large chain) and matK (maturase K). To generate our own sequences, leaf tissue was obtained from the Rocky Mountain Herbarium (RM) and the Missouri Botanical Gardens Herbarium (MO).

Forest (BART), Central Plains Experimental Range (CPER), Disney Wilderness Preserve (DSNY), Harvard Forest (HARV), Jones Ecological Research Center (JERC), Moab (MOAB), Klemme Range Research Station (OAES), Onaqui-Ault (ONAQ), Oak Ridge National Laboratory (ORNL), Ordway-Swisher Biological Station (OSBS), Smithsonian Environmental Research Center (SERC), Smithsonian Conservation Biology Institute (SCBI), North Sterling (STER), Talladega National Forest (TALL), Woodworth (WOOD), and University of Notre Dame Environmental Research Center (UNDE).

DNA was extracted using the Qiagen Plant Mini Kit or CTAB DNA extraction method (Doyle and Doyle, 1987). We amplified the gene regions using previously published primers (Table S2) and following PCR protocols available in the Supplementary Materials (Supplementary Material 1). As we had difficulty amplifying ITS for Rosaceae due to polymorphisms in binding sites, we designed a new primer using Primer 3 (Koressaar and Remm, 2007; Untergasser et al., 2012) based on previously sequenced Rosaceae species: ITS\_SGR (5′ -AGG TTT GAC AAC CAC CGA TT-3′ ). We sent PCR products to Genewiz (Cambridge, Massachusetts) for purification and sequencing, and checked sequence quality in Geneious v6.0.5 (Biomatters Ltd., Auckland, NZ).

To ensure that the evolutionary relationships among members of the community were consistent with known relationships, we reconstructed phylogenies that included all species in this study, as well as any other available sequences from GenBank for each family. The inclusion of additional species not occurring within the communities of focus in phylogenetic reconstruction has been shown to reduce error in node age estimates, and consequently in calculations of community phylogenetic diversity metrics (Park et al., 2018). High quality sequence data for the targeted genetic loci were downloaded from GenBank using the PHLAWD pipeline (Smith et al., 2009). We combined GenBank sequences with newly generated sequences, aligned them in Mafft v7 (Katoh et al., 2002) and concatenated the gene regions in Mesquite v3.10 (Maddison and Maddison, 2017). The final data set included 1,912 species for Brassicaceae including five outgroup members (Cleome lutea Hook., Cleome viscosa L., Cleome rutidosperma DC., Moringa oleifera Lam., Polanisia dodecandra (L.) DC.). For Rosaceae, the final data set included 1,450 species including four outgroup members (Rhamnus cathartica L., Ceanothus verrucosus Nutt., Pisum sativum L., Astragalus membranaceus (Fisch.) Bunge).

We used Bayesian inference to reconstruct a time-calibrated phylogeny for each family using BEAST2 v2.4.5 (Bouckaert et al., 2014) on the CIPRES Science Gateway (www.phylo.org). For the phylogenetic reconstruction of Brassicaceae, the stem and crown nodes were constrained with a lognormal offset of 59.5 and 42.0 million years ago (Ma) (mean 0.01, standard deviation 1.0), respectively, following Huang et al. (2016). For the phylogenetic reconstruction of Rosaceae, the stem and crown nodes were constrained with a lognormal offset of 106.50 and 95.09 Ma (mean 0.01, standard deviation 1.0), respectively, following Zhang et al. (2017). We conducted two runs of 120 million generations and sampled trees every 12,000 generations. We used Tracer v1.6 (Rambaut et al., 2014) to verify that both runs reached stationarity and converged on the posterior distribution of trees. As identified in Tracer, we discarded 10% of the trees from each run as burn-in, then combined and summarized trees as a maximum clade credibility (MCC) tree using LogCombiner and TreeAnnotator (included as part of the BEAST2 package). We pruned all species that were not included in each of our study communities from the trees prior to site-specific analyses.

# Diploid and Polyploid Phylogenetic Relationships

We used two approaches to determine whether polyploid species are more distantly related to diploids within the same community than co-occurring diploids are to one another. First, we used a broad-scale approach to investigate patterns of phylogenetic relatedness across all sites by calculating the phylogenetic distance between diploids and their closest diploid relative within the same community (nearest taxon distance; NTD2x−2x), and comparing these distances to the phylogenetic distance between

polyploids and their most closely related, co-occurring diploid species (NTDpolyploid−2<sup>x</sup> ). We pooled these values across sites and compared NTD2x−2<sup>x</sup> to NTDpolyploid−2<sup>x</sup> by conducting a Mann-Whitney U test using the wilcox.test function in R. We also evaluated our hypothesis of closer relationships between co-occurring diploids than among co-occurring diploids and polyploids by comparing the proportion of NTD2x−2<sup>x</sup> and NTDpolyploid−2<sup>x</sup> comparisons that fell below a threshold of the mean nearest taxon distance (MNTD) of the family-level phylogeny. MNTD was calculated using the cophenetic.phylo function in the ape R package (Paradis et al., 2004).

Second, we examined patterns of phylogenetic relatedness within each site by testing whether the MNTD between polyploids and diploids (MNTDpolyploid−2<sup>x</sup> ) was significantly greater than MNTD2x−2<sup>x</sup> than expected by chance. We employed a simulation approach by comparing the observed metric MNTDpolyploid−2<sup>x</sup> / MNTD2x−2<sup>x</sup> within each community to a null distribution generated by replacing polyploid community members with randomly drawn species from a pool of polyploids from all sites, and recalculating the MNTDpolyploid−2<sup>x</sup> / MNTD2x−2<sup>x</sup> metric for the new community. Our null distribution comprised 1,000 random communities per site. We considered polyploids to be more distantly related to diploids than expected by chance if the MNTDpolyploid−2<sup>x</sup> / MNTD2x−2<sup>x</sup> metric was greater than 1 and was greater than 95% of the null distribution (P < 0.05). Any communities that did not have both diploid and polyploid species (Rosaceae: DSNY; Brassicaceae: STER, OAES) or only had one diploid or polyploid representative (Rosaceae: WOOD) were excluded from these analyses. All MNTD calculations were performed using the ses.mntd function in the picante R package (Kembel et al., 2010).

#### Tests of Polyploid Ecological Success

We identified whether polyploid species showed patterns consistent with having greater ecological success than diploids by using species abundance as an indicator of successful establishment within a community (Levin, 1975; Callaway and Aschehoug, 2000; Cleland et al., 2004). Specifically, we tested whether polyploids occurred at greater total relative abundance than diploids within each community by conducting a Mann-Whitney U-test with the wilcox.test function in R. We further assessed whether differences in abundance could be attributed to non-native species, reflecting ecological success of recent range expansions, by testing whether the total relative abundance of diploid and polyploid species significantly differed between natives and non-natives. We tested significance using a Kruskal-Wallis test and when appropriate, followed the analysis with Dunn's post-hoc test. This was performed using the kruskal.test and the dunnTest (FSA package) functions, respectively, in R.

#### RESULTS

#### Community Data Collection

Our six Brassicaceae communities comprised 3–8 Brassicaceae species while our eleven Rosaceae communities comprised 3–24 Rosaceae species (**Figure 2**). Most Brassicaceae in our communities were 2x, 4x, or 6x, with the exception of one species where the ploidy ranged from 20x to 30x [Cardamine concatenata (Michx.) O. Schwarz.; Kreiner et al., 2017; Table S1]. In Brassicaceae communities, 33–100% of the species were polyploid, and 22–75% of the species were non-native (**Figure 2**). In Rosaceae communities, species ranged in ploidal level from 2x to 12x, with 33–86% of the species being polyploid. These communities also ranged from not having any non-native species to 44% of the species being non-native.

#### Phylogenetic Reconstruction

We generated 63 new sequences for species missing sequence data for our target loci (GenBank accessions KY427264- KY427326; Table S3). The final Brassicaceae alignment comprised 1,912 species and was 8,242 basepairs (bp) in length, while the final Rosaceae alignment comprised 1,450 species and was 12,007 bp in length (TreeBASE accession: S22405). All study species within the communities were represented in our time-calibrated phylogenetic trees, and both phylogenies for Brassicaceae and Rosaceae community members were congruent in topology to previously published phylogenies (Huang et al., 2016; Zhang et al., 2017; **Figure 2**).

# Diploid and Polyploid Phylogenetic Relationships

Our broad-scale analysis examining phylogenetic patterns of relatedness between co-occurring polyploids and diploids vs. cooccurring diploids found that across all sites, NTDpolyploid−2<sup>x</sup> was significantly greater than NTD2x−2<sup>x</sup> for both Brassicaceae (P < 0.05) and Rosaceae (P << 0.01; **Figure 3**). Further supporting this result for both families was that a larger proportion of NTD2x−2x comparisons fell below the MNTD threshold compared to NTDpolyploid−2x. For Brassicaceae communities, 76.9% of diploid-diploid comparisons and 34.1% of polyploiddiploid comparisons fell below the Brassicaceae MNTD, while for Rosaceae, 84.1% of diploid-diploid comparisons and 53.8% of polyploid-diploid comparisons fell below the Rosaceae MNTD (**Figure 3**). This pattern suggests that fewer polyploids co-occur with a close diploid relative compared to diploids.

When examining each site, MNTDpolyploid−2<sup>x</sup> was greater than MNTD2x−2<sup>x</sup> for three of the four Brassicaceae communities (MNTDpolyploid−2<sup>x</sup> / MNTD2x−2<sup>x</sup> > 1; **Figure 4**). However, MNTDpolyploid−2<sup>x</sup> / MNTD2x−2<sup>x</sup> was only significantly greater than expected by chance at one site (ONAQ; P < 0.05) in our simulation analyses. At MOAB, although MNTDpolyploid−2<sup>x</sup> was greater than MNTD2x−2x, the phylogenetic distance was smaller than expected by chance (lower 2.5% of the null distribution).

Within Rosaceae communities, MNTDpolyploid−2<sup>x</sup> was greater than MNTD2x−2<sup>x</sup> for seven of the nine communities, but none of these differences were significantly different from the random expectation in our simulation analyses (**Figure 4**). For one community (JERC), we considered Crataegus spathulata Michx. to either be a diploid or a triploid. When analyzed as a diploid, we found that MNTDpolyploid−2<sup>x</sup> was smaller than expected by chance, although overall, MNTDpolyploid−2<sup>x</sup> was still greater than MNTD2x−2<sup>x</sup> (MNTDpolyploid−2<sup>x</sup> / MNTD2x−2<sup>x</sup> > 1). However, we did not find any significant patterns when C. spathulata was treated as a triploid in the analyses. At another community (HARV), we performed analyses with Rubus setosus Bigelow as either diploid or triploid, however there was no effect on the overall results.

#### Tests of Polyploid Ecological Success

In Brassicaceae communities, polyploids tended to be more abundant than diploids (P > 0.05; **Figure 5A**). Though not a significant pattern, in all communities that included both diploid and polyploid species, ≥70% of the individuals were polyploid. The greater abundance of polyploids appears to be driven by non-native polyploids, which tended to be greater in number than native polyploids (**Figure 5C**). However, we did not find a significant difference in the abundance of nonnative and native diploid or polyploid individuals within any of the communities (P > 0.05). This may be due to the small number of Brassicaceae communities included in the analysis, or could suggest that the ecological success of polyploid species is not the result of non-native species experiencing recent range expansions.

In Rosaceae communities, we found no significant difference between diploid and polyploid abundance (P > 0.05; **Figure 5B**). When polyploids and diploids were categorized as native or non-native, however, we found that native species were significantly more abundant than co-occurring non-native species for both diploids and polyploids (P < 0.05; **Figure 5D**) suggesting that ecological success is not necessarily associated with genome duplication.

FIGURE 5 | Relative abundance of diploids and polyploids in (A,C) Brassicaceae and (B,D) Rosaceae communities. (A) In Brassicaceae communities, polyploids tend to be more abundant than diploids, though the difference was not significant (P > 0.05). (B) In Rosaceae communities, diploids and polyploids do not significantly differ in abundance (P > 0.05). (C) In Brassicaceae communities, non-native (NN) polyploids tend to occur at a greater abundance than the other groups, but the difference is not significant (P > 0.05). (D) In Rosaceae communities, native species (N) are significantly more abundant than non-native species for both diploid and polyploid species (P < 0.05). Letters above the distributions in (D) indicate significantly different groups. The diamond and error bars indicate the mean of the distribution ± 1 standard error.

# DISCUSSION

Polyploidy is now widely accepted as a mechanism of reproductive isolation and plant speciation, but much remains to be clarified about the influence of genome duplication on population- and community-level dynamics. In this study, we draw upon the extensive body of work conducted on the ecology and evolution of polyploids to predict and test how genome duplication may affect phylogenetic community structure. By examining two large flowering plant families with high incidences of polyploidy using phylogenetic data and cytogeographic information from a diversity of sources, we found that communities may be shaped in diverse ways by genome duplication and that the impacts of polyploidy are far from clearcut. Polyploidy appears to influence patterns of phylogenetic relationships and species co-occurrence in Brassicaceae and Rosaceae communities, but these patterns appear to be lineagespecific rather than due to properties intrinsic to all genome duplication events. These results reflect the complexities and multifaceted consequences of polyploidy (Soltis et al., 2016), but our study also highlights the current paucity of information on ploidal variation at fine spatial scales (especially at cytotype contact zones), which may have contributed, in part, to some inconsistencies in our results.

# Patterns of Polyploid Community Structure Are Lineage-Specific

For both Brassicaceae and Rosaceae, we found that ploidal variation is a common feature of communities across the United States. We especially observed a higher diversity of ploidal levels, and higher overall ploidies, among the Rosaceae. Of the 11 Rosaceae communities, all but one comprised both diploid and polyploid species, while two of the six Brassicacae communities were either composed of only polyploid species or of only diploid species. It is not immediately clear why Rosaceae species would exhibit a greater diversity of ploidies and higher ploidal complements, or why Rosaceae communities almost always included polyploids. This pattern may simply be due to the greater number of Rosaceae species present in the included communities, or that Rosaceae is an older family than Brassicaceae (∼95 vs. ∼42 million years old, respectively; Huang et al., 2016; Zhang et al., 2017) allowing more time for the evolution of greater ploidal diversity. However, it is notable that, compared to Brassicaceae, Rosaceae species tend to have perennial life histories. The longer-lived life histories of perennial species may satisfy conditions that promote unreduced gamete and polyploid formation, or polyploid phenotypes may best experience higher fitnesses when they have longer-lived life histories. Previous studies suggest that polyploid populations may arise more regularly in herbaceous species, but not necessarily in short-lived or annual species (Stebbins, 1938; Grant, 1981; Ramsey and Schemske, 2002; Zenil-Ferguson et al., 2017). It is possible that, on average, the Rosaceae species included in our analyses fall into a "sweet spot" of non-woody perennial life-history traits favoring genome duplication.

Our phylogenetic analyses of Brassicaceae and Rosaceae community structure indicate that in both families, polyploid species tend to be more distantly related to co-occurring diploids than diploids are to each other. Indeed, the proportion of diploiddiploid relationships falling below the MNTD of the familylevel phylogeny was greater than that for the proportion of polyploid-diploid relationships (**Figure 3**). This suggests that the polyploid members of these communities may not have arisen in situ, but rather these polyploids are likely to have arisen in disjunct communities, or from interspecific hybridizations (i.e., allopolyploidy; Symonds et al., 2010), before dispersing to the surveyed communities. This is consistent with polyploids escaping minority cytotype exclusion by inhabiting differing geographic or ecological areas compared to their close relatives (Levin, 1975; Husband, 2000; Certner ˇ et al., 2017). Alternatively, this phylogenetic pattern could have arisen if polyploids did establish within the same community as their diploid ancestors, but interploidal competition resulted in the local extinction of the diploid. Further studies incorporating a temporal aspect to community structure to capture interspecific interactions through time would allow us to distinguish between these two alternatives.

When considering each site separately, we did not find a consistent pattern in our simulation analyses. Although one Brassicaceae community showed polyploids to be more distantly related to diploids than expected by chance, at all other sites, we did not find a significant pattern, or found that polyploids were more closely related to diploids than expected, despite phylogenetic distances between polyploids and diploids being larger than the distances between diploids to one another. Together, the results from our broad-scale analyses and site-specific simulation analyses suggest that polyploidy can play an important role in shaping community structure but that the effect is species-specific. For example, the extent to which polyploids differ in phenotype and genetic composition could influence interactions with co-occurring species and the mode of escape from minority cytotype exclusion. Polyploids can exhibit wider variation in phenotypes compared to diploids, ranging from striking to subtle, which may depend in part upon the mode of polyploid formation. While allopolyploids often exhibit phenotypes that are intermediate to the parental species, the combination of two evolutionarily differentiated genomes, and their attendant regulatory elements, can sometimes produce transgressive phenotypes outside the range of variation harbored by either parental species (McCarthy et al., 2015, 2017). In contrast, autopolyploids often exhibit more subtle phenotypic differences when compared to their diploid progenitors (Maherali et al., 2009; Thompson et al., 2015). Therefore, our lack of a consistent result could be due, at least in part, to the inherent genetic differences between autopolyploids and allopolyploids, and further investigations examining how these two modes of polyploid formation may differ in their influence on community structure would go far toward illuminating interspecific interactions involving polyploids.

The apparent lineage-specific effect of polyploidy on phylogenetic community structure may also be due to varying ecological niche affinities and/or differences in life history. The hypothesized association between greater ploidal diversity and perennial life history (Müntzing, 1936; Stebbins, 1938; Grant, 1981) may mean that genome duplication shapes communities dominated by perennial species more strongly than communities comprising mostly annual species (Stebbins, 1938; Leitch and Leitch, 2012; Zenil-Ferguson et al., 2017). Though the incidence of polyploidy among woody species (which also tend to be perennial) is lower than among herbaceous species, this may consequently mean that the ecoregions or habitats dominated by perennial species (e.g., forests, woodlands, shrublands) are influenced more strongly by genome duplication than habitats where annual and herbaceous species dominate (e.g., grasslands, meadows). Our findings clearly provide motivation for broader investigations of the differences in impact on community structure between polyploid plant species with differing life histories.

# Polyploids Not Consistently More Ecologically Successful Than Diploids

Polyploidy has classically been argued to be an important enabler of plant invasions and the exploitation of novel ecological niches (Pandit et al., 2014; Ramsey and Ramsey, 2014). Indeed, chromosome number has been identified as a correlate of invasiveness (Pyšek et al., 2013), and non-native polyploids in some flora are more likely to successfully become naturalized than diploid species (Nagy et al., 2017). As a measure of ecological success, a greater relative abundance of non-native species within a community should reflect their ability to successfully occupy and exploit novel habitat or outcompete and displace resident species (Levin, 1975; Callaway and Aschehoug, 2000; Cleland et al., 2004; te Beest et al., 2011). In our study, we found opposing patterns within Brassicaceae and Rosaceae communities. Brassicaceae polyploids showed patterns of abundance consistent with being more ecologically successful than diploids, which may have been driven by non-native species. Although there was not a significant difference between diploid and polyploid abundance, perhaps due to the relatively small sample of Brassicaceae sites analyzed, it is striking that in all communities that had both diploid and polyploid species, polyploids made up over 70% of the total relative abundance. On the other hand, Rosaceae diploid species were just as abundant as polyploids, and native species appeared to be more ecologically successful, with higher abundances, than the non-native species regardless of ploidy, suggesting that ecological success is not always a correlate of non-native and/or polyploid species.

The lack of a clear pattern for greater non-native polyploid abundance relative to diploids in Brassicaceae and Rosaceae communities is consistent with the varying findings of prior studies on invasive polyploids. For example, although many polyploids are invasive (Thompson, 1991; Pandit et al., 2006), species with smaller genome sizes have also been found to occur at higher species abundance, especially among annual species (Herben et al., 2012), and are more likely to be invasive (Grotkopp et al., 2002; Pandit et al., 2006, 2014; Kubešová et al., 2010; Lavergne et al., 2010; Herben and Goldberg, 2014; Schmidt et al., 2017). These counterintuitive findings may also reflect species-specific effects where a polyploid's potential for successful establishment and population expansion within a community may be highly dependent upon species-specific attributes, life histories, source locations, or the local environment of the community. For example, in anthropogenically disturbed habitats, non-native or invasive species are often polyploid (Lumaret and Borrill, 1988; Ramsey and Schemske, 1998). The importance of source locations and the ecology of the nonnative range can also be seen in English Ivy (Hedera spp.), where the observation that diploids are invasive on the east coast of North America and tetraploids are invasive on the west coast of North America is thought to be due to adaptation that has occurred within the native European range, followed by subsequent exploitation of similar habitat within the invasive range (Green et al., 2013). Moreover, different cytotypes can also vary in ecological attributes and fitness across their range (McIntyre and Strauss, 2017), further nuancing the probability of establishment success within a community.

Conflicting observations of polyploid ecological success relative to diploids may also be due to the eco-evolutionary dynamics that occur over ecological timescales that affect interspecific competition and adaptation (Yoshida et al., 2003; Hairston et al., 2005; Reznick, 2013; DeLong et al., 2016). It is possible that when considered over time, the polyploid species observed in Brassicaceae and Roseaceae communities may be superior competitors that are in the process of displacing resident diploid species (or other ploidies). Alternatively, the polyploid species may be transient or ephemeral community members, documented at the present moment in time, and will eventually be displaced by the resident diploid species (Certner et ˇ al., 2017). Additional studies incorporating phenotypic traits, and temporal data on species occurrence and abundance are needed to parse these alternatives and identify the underlying drivers of community structure. NEON's mission to repeatedly survey these sites over the next 30 years may provide an avenue to examine how community structure changes temporally, and offer insight into how polyploids and diploids interact within communities.

Observations that polyploids are not always ecologically superior suggest that polyploidy per se may have limited influence on the successful establishment of a population, or that the effects of genome duplication may not be uniformly predictable after polyploidy "primes the pump." This can be seen in studies explicitly examining ecological differences between diploids and polyploids that show variable patterns of ecological niche divergence for both auto- and allopolyploids (Glennon et al., 2014; Marchant et al., 2016). Studies involving synthetically generated polyploids have further demonstrated that interploidal trait differences only partially arise as a direct consequence of polyploidy, and similar studies in established polyploids are consistent with genome duplication either representing or generating intra-population variation that can be elaborated upon by natural selection (e.g., Husband and Schemske, 2000; Raabová et al., 2008; Ramsey, 2011; Laport et al., 2016). Additional studies incorporating ecological data (i.e., climate, soil, water availability, pollinators, etc.) would likely provide greater detail about diploid and polyploid differences at the community level in both native and non-native systems, and should be undertaken for a broader range of species (Kolár et al., 2017). Yet, additional comparative studies examining multiple diploid-polyploid pairs would go far in disentangling the influence of lineage- or cytotype-specific life history attributes, functional traits, and genomic contributions on the adaptive potential of genome duplication for range expansion and the establishment of non-native species within communities.

# The Need for Greater Documentation of Geographic Ploidal Variation

Our study highlights the need for better documentation of intraspecific ploidal variation in a geographical context to better understand the role of genome duplication on plant community structure. Our characterization of members within a community was reliant upon local-scale documentation of ploidal variation, but we often found a paucity of available geographically explicit intraspecific ploidy data. Despite the known prevalence of geographic variation in ploidy within species (e.g., Baack, 2005; Kolár et al., 2009; Ståhlberg, 2009; Trávnícek et al., 2011; Castro ˇ et al., 2012; Laport et al., 2012; Zozomová-Lihová et al., 2015; Wefferling et al., 2017; reviewed in Ramsey and Ramsey, 2014), species harboring populations differing in ploidy have historically been geographically under-sampled. Modern technologies, such as high throughput flow cytometry screening for DNA content (Kron et al., 2007), have improved our ability to identify intraspecific ploidal variation, representing potential cryptic biodiversity, and can facilitate tying phenotypic variation to different ploidies within polyploid complexes. Furthermore, new genomic tools and the ever-increasing trove of genomic data for non-model organisms could be used in post-hoc analyses to further reveal novel cytotype variation (e.g., modifications to genotype-by-sequencing approaches; Gompert and Mock, 2017). The implementation of these approaches, paired with broader usage of electronic databases (e.g., Kew C-value database, Chromosome Count Database) and inclusion of ploidy or genome size information on herbarium specimens will facilitate the documentation of polyploid complexes and further aid explorations of polyploid biodiversity and its influence on community structure.

#### CONCLUSION AND PERSPECTIVE

This is an exciting time to study the ecological and evolutionary implications of polyploidy at the population and community level. The growing body of work on polyploid evolution and population-level dynamics suggests that polyploidy may potentially have cascading effects on communities, yet few studies have explicitly tested the effect genome duplication has on community structure. Our novel study on Brassicaceae and Rosaceae communities suggests that the effects of genome duplication on community structure may often be lineagespecific, but polyploidy should still be considered as a potentially important driver of biodiversity patterns given the pervasiveness of genome duplication among vascular plants. Our findings contribute to the increasing number of studies highlighting the complexity and multifaceted consequences of whole genome duplication (reviewed in Ramsey and Ramsey, 2014; Soltis et al., 2016). Although explicitly population-level studies may reveal the processes underlying the pattern (e.g., inter-trophic-level interactions such as with herbivores, pollinators, mycorrhiza, and other microbial symbionts; reviewed in Segraves, 2017),

#### REFERENCES


macro-scale studies such as ours complement the many population-level studies of polyploids by providing a "zoomed out" perspective on general patterns, a comparative evaluation of a greater diversity of plant species and life histories, and offer nuance into how different evolutionary lineages may interact within communities comprising multiple ploidies.

At the same time, our understanding of the effect of polyploidy on community structure may have been hindered by the paucity of available geographically meaningful data on intraspecific ploidal variation, and the difficulty in compiling existing data from scattered literature reports. Alongside the recognized need to characterize intraspecific genetic and trait variation to understand their subsequent effects on community structure (Hughes et al., 2008; Bolnick et al., 2011), we urge continued emphasis on the characterization and documentation of ploidal variation across species' ranges. Such information will greatly aid comparative studies at the population and community level, and help shed light on how such a common, but profound, mutation influences the evolution of species and those around them.

### AUTHOR CONTRIBUTIONS

All authors contributed to the design of the research and writing the paper. MG collected the data, and MG and JN performed statistical analyses.

# FUNDING

This research was supported by a National Science Foundation grant to JN and RL (NSF-EF 1550813), an REU supplement to that grant, and a National Science Foundation grant to RL (NSF-DEB-1556371).

## ACKNOWLEDGMENTS

We thank Vivianna Sanchez and William Weaver for their assistance in data collection and interpretation, and Chelsea Pretz for fruitful discussions in planning this study. We also thank Stacey D. Smith for generously sharing laboratory space to conduct this research.

### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fevo. 2018.00052/full#supplementary-material


with two geo-cytotypes of Solidago gigantea Aiton (Asteraceae). J. Plant Ecol. 11, 317–327. doi: 10.1093/jpe/rtx005


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 Gaynor, Ng and Laport. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Spindle Dynamics Model Explains Chromosome Loss Rates in Yeast Polyploid Cells

Ivan Jelenic´ 1 , Anna Selmecki <sup>2</sup> , Liedewij Laan<sup>3</sup> \* and Nenad Pavin<sup>1</sup> \*

*<sup>1</sup> Department of Physics, Faculty of Science, University of Zagreb, Zagreb, Croatia, <sup>2</sup> Department of Medical Microbiology and Immunology, Creighton University Medical School, Omaha, NE, United States, <sup>3</sup> Department of Bionanoscience, Faculty of Applied Sciences, Kavli Institute of NanoScience, Delft University of Technology, Delft, Netherlands*

Faithful chromosome segregation, driven by the mitotic spindle, is essential for organismal survival. Neopolyploid cells from diverse species exhibit a significant increase in mitotic errors relative to their diploid progenitors, resulting in chromosome nondisjunction. In the model system *Saccharomyces cerevisiae,* the rate of chromosome loss in haploid and diploid cells is measured to be one thousand times lower than the rate of loss in isogenic tetraploid cells. Currently it is unknown what constrains the number of chromosomes that can be segregated with high fidelity in an organism. Here we developed a simple mathematical model to study how different rates of chromosome loss in cells with different ploidy can arise from changes in (1) spindle dynamics and (2) a maximum duration of mitotic arrest, after which cells enter anaphase. We apply this model to *S. cerevisiae* to show that this model can explain the observed rates of chromosome loss in *S. cerevisiae* cells of different ploidy. Our model describes how small increases in spindle assembly time can result in dramatic differences in the rate of chromosomes loss between cells of increasing ploidy and predicts the maximum duration of mitotic arrest.

Keywords: polyploidy, spindle assembly, chromosome loss, chromosome segregation, cell cycle regulation, theoretical modeling, genome instability

### INTRODUCTION

Chromosome segregation is an important, highly conserved cellular function. A complex network of interacting components segregates chromosomes with high precision. However, rare errors in chromosome segregation are observed, and the error rate generally increases when the number of sets of chromosomes (ploidy, n) increases within the cell (Comai, 2005). Increased rates of chromosome loss are observed in autopolyploid cells, within yeasts, plants, and human cells (Mayer and Aguilera, 1990; Song et al., 1995; Ganem et al., 2009). For example, autopolyploidization of Phlox drummondii results in an immediate loss of approximately 17% of genomic DNA in the first generation and up to 25% after three generations (Raina et al., 1994). Autopolyploidization can also cause tumorigenesis, and these tumors are marked by significant chromosome gain/loss events (Fujiwara et al., 2005; Zack et al., 2013). Therefore, the general observation is that many newly formed polyploid cells have increased chromosome segregation errors relative to isogenic diploid cells, and the cause of these errors is not known.

#### Edited by:

*Richard John Abbott, University of St Andrews, United Kingdom*

#### Reviewed by:

*Pirita Paajanen, John Innes Centre (JIC), United Kingdom Yongshuai Sun, Xishuangbanna Tropical Botanical Garden (CAS), China*

#### \*Correspondence:

*Liedewij Laan l.laan@tudelft.nl Nenad Pavin npavin@phy.hr*

#### Specialty section:

*This article was submitted to Evolutionary and Population Genetics, a section of the journal Frontiers in Genetics*

> Received: *02 March 2018* Accepted: *13 July 2018* Published: *06 August 2018*

#### Citation:

*Jelenic I, Selmecki A, Laan L and ´ Pavin N (2018) Spindle Dynamics Model Explains Chromosome Loss Rates in Yeast Polyploid Cells. Front. Genet. 9:296. doi: 10.3389/fgene.2018.00296*

**40**

The normal sexual life cycle of the budding yeast Saccharomyces cerevisiae includes haploid (n = 1, 16 chromosomes) and diploid cells (n = 2, 32 chromosomes). In addition, tetraploid cells (n = 4, 64 chromosomes) are rarely found in nature, but can be generated in the lab by mating two diploid cells. In this organism, the effect of ploidy on the rate of chromosome loss is very pronounced: haploid and diploid cells have rates of chromosome loss around 10−<sup>6</sup> chromosomes per cell per cell division, whereas tetraploid cells have a rate around 10−<sup>3</sup> (Mayer and Aguilera, 1990; Storchová et al., 2006). The rate of chromosome loss was measured with isogenic haploid, diploid, and tetraploid strains that each contained a single genetically marked chromosome. In these assays the cells that have lost the chromosome markers are quantified, and the rate of loss is determined by fluctuation analysis (Lea and Coulson, 1949). Moreover, polyploid laboratory yeast strains tend to lose chromosomes and reduce to a diploid level in experimental evolution studies (Gerstein et al., 2006; Selmecki et al., 2015). Thus, the genomic stability of a cell line is to a large extent related to cellular ploidy, but how ploidy alters chromosome segregation is not known (Otto and Whitton, 2000).

Chromosome segregation is driven by the mitotic spindle, a self-organized micro-machine composed of microtubules and associated proteins (Pavin and Tolic, 2016; ´ Prosser and Pelletier, 2017). In budding yeast, during spindle assembly, spindle poles nucleate microtubules, which grow in a direction parallel with the central spindle or in arbitrary directions within the nucleus (Winey et al., 1995; O'Toole et al., 1997). A microtubule that comes into the proximity of a kinetochore (KC), a protein complex at the sister chromatids, can attach to the KC and thus establish a link between chromatids and spindle poles, as shown in vitro (Mitchison and Kirschner, 1985; Akiyoshi et al., 2010; Gonen et al., 2012; Volkov et al., 2013), in vivo (Tanaka et al., 2005), and theoretically (Hill, 1985). Theoretical models have quantitatively shown that this process can contribute to spindle assembly in yeasts and in mammalian cells (Wollman et al., 2005; Paul et al., 2009; Kalinina et al., 2013; Vasileva et al., 2017). Prior to chromosome separation, all connections between chromatids and the spindle pole must be established, and erroneous KCmicrotubule attachments must be corrected, for which several theoretical models have been proposed (Zaytsev and Grishchuk, 2015; Tubman et al., 2017). These connections are monitored by the spindle assembly checkpoint (Li and Murray, 1991). Once KCs are properly attached and chromosomes congress to the metaphase plate (Gardner et al., 2008), the spindle assembly checkpoint is silenced and microtubules separate the sister chromatids (Musacchio and Salmon, 2007).

Cells that cannot satisfy the spindle assembly checkpoint are arrested in mitosis. However, cells can break out of the arrest after several hours, an event that is often referred to as "mitotic slippage" (Minshull et al., 1996; Rudner and Murray, 1996; Rieder and Maiato, 2004), and this mitotic exit is molecularly regulated (Novák et al., 1999; Rudner et al., 2000). Even though the molecular mechanisms that regulate cell cycle and spindle assembly are emerging, it is an open question as to how changes in ploidy can have such a dramatic effect on the rates of chromosome loss.

In this paper, we introduce a theoretical model for chromosome loss in cells with different ploidy. We test the hypothesis that polyploidy limits faithful chromosome segregation by the combination of dynamics of spindle assembly and a maximum time of mitotic arrest. Our model predicts that for increasing ploidy, spindle assembly time scales linearly with the number of chromosomes, which results in exponential changes in the rate of chromosome loss. Our model quantitatively reproduces the increase in chromosome loss observed in tetraploid S. cerevisiae cells relative to haploid and diploid cells.

#### MATERIALS AND METHODS

#### Model for Chromosome Loss

In our model we describe the dynamics of spindle assembly including KC attachment and detachment (**Figure 1A**), silencing of the spindle assembly checkpoint and the maximum duration of mitotic arrest after which cells enter anaphase regardless of whether all KCs are attached, allowing for chromosome loss in our model. To make a prediction for chromosome loss, we describe populations of cells in prometaphase, metaphase, and anaphase with either all KCs attached to the spindle, or with at least one unattached KC, and we calculate the fraction of cells in each population (**Figure 1B**). Transitions between these populations arise from spindle dynamics (**Figure 1A**).

#### Dynamics of Spindle Assembly

To describe dynamics of spindle assembly, we calculate the rate of KC capture, k + i , by taking into account known microtubule dynamics and geometry of yeast spindles (**Figure 1A**). Here, index i denotes the number of left sister KCs attached to the spindle; analogous calculations are applied to right sister KCs. Microtubules nucleate from the spindle pole body at rate ν<sup>i</sup> and extend toward the spindle equator. They can attach to an unattached KC with probability p. The rate of KC attachment is the probability of attachment of one of the unattached KCs multiplied with the microtubule nucleation rate, which for C chromosomes and C − i unattached KCs reads

$$k\_i^+ = \left[1 - \left(1 - p\right)^{C-i}\right] \nu\_i, \quad i = 0, \ldots, C - 1. \tag{1}$$

For other values of the index i the rate of KC attachment is zero to exclude unrealistic cases, with a negative number of chromosomes or with more than C chromosomes. In the case of euploid cells, the number of chromosomes is related to the ploidy as C = 16 · n. We calculate the nucleation rate at the spindle pole body as v<sup>i</sup> = v · (M−i), where we assume that a spindle pole body has a constant number of M nucleation sites with M − i unoccupied nucleation sites. To determine M for different numbers of chromosomes, we introduce a linear relationship between the number of chromosomes and nucleation sites, M = α · C + 4, which is based on experimental findings (Storchová et al., 2006; Nannas et al., 2014). The parameter α is typically around 1. We also assume the nucleation rate for one nucleation site, ν , to be constant as in previous studies (Kitamura et al., 2010; Vasileva et al., 2017). In our model, attachment occurs when a microtubule contacts the KC (Tanaka et al., 2005). The

probability of attachment is calculated based on spindle geometry as the ratio of the cross-section areas of the KC, SKC, and the total area of the spindle, p = SKC/(S · M + SKC). Here S denotes the cross-section area occupied by one microtubule. Values for these parameters are estimated from electron microscopy studies (O'Toole et al., 1997; Storchová et al., 2006; Gonen et al., 2012). We assume that microtubules detach from one KC at constant detachment rate, k <sup>−</sup>, because our model does not include forces at the KC (Akiyoshi et al., 2010).

#### Silencing the Spindle Assembly Checkpoint and Chromosome Loss

Cells proceed from metaphase to anaphase by silencing the spindle assembly checkpoint at a constant rate, k0. They can also proceed from prometaphase to anaphase when they spend a prolonged time in mitotic arrest (Minshull et al., 1996; Rudner and Murray, 1996; Rieder and Maiato, 2004), which in our model results in chromosome loss. We distinguish these two cases by introducing a rate of anaphase entry given by

$$\begin{Bmatrix} k\_L \\ k\_A \end{Bmatrix} = k\_0 \begin{Bmatrix} f'(t) \\ 1 + f'(t) \end{Bmatrix},\tag{2}$$

where in the top and bottom row we calculate rates at which cells leave prometaphase and metaphase, respectively. We describe bypassing the checkpoint in mitotic arrest with a function of time f (t), irrespective whether cells are in prometaphase or metaphase. Because this function is not known, we choose a simple mathematical form f (t) = exp [(t − t0) /tc], which accounts for the rate of anaphase entry increase in time. Here, parameters t<sup>0</sup> and t<sup>c</sup> denote the duration of mitotic arrest and the characteristic timescale, respectively.

#### Fraction of Cells in Prometaphase, Metaphase, and Anaphase With and Without Lost Chromosomes

In our model, we denote the fractions of cells in prometaphase and metaphase by ρi,<sup>j</sup> . The fraction of cells in anaphase with at least one KC unattached to the spindle, ρL, represents the fraction with lost chromosomes. The fraction of cells in anaphase with all KCs attached is denoted ρA. The indices i and j denote the number of left and right sister KCs attached to the spindle, respectively, in cells with C chromosomes (i = 0, . . . , C and j = 0, . . . , C). The combination of indices i = j = C describes cells with all KCs attached, which corresponds to metaphase cells. All the other combinations of indices describe cells with at least one unattached KC, which correspond to prometaphase cells. As time, t, progresses (i) KCs attach to or detach from the spindle, or (ii) cells enter anaphase changing the factions of cells in the populations (**Figure 1B**). In our model, attachments of different KCs as well as their detachments are independent. We describe these processes by a system of rate equations:

$$\begin{split} \frac{d\rho\_{i,j}}{dt} &= k\_{i-1}^{+}\rho\_{i-1,j} + k\_{j-1}^{+}\rho\_{i,j-1} + (i+1)k^{-}\rho\_{i+1,j} \\ &+ (j+1)k^{-}\rho\_{i,j+1} - (k\_i^{+} + ik^{-} + k\_j^{+} + jk^{-}) \\ &+ k\_{\mathcal{L},\mathcal{A}})\rho\_{i,j}, \ i,j = 0, \ldots, \mathcal{C} \\ k\_{\mathcal{L},\mathcal{A}} &= \begin{cases} k\_{\mathcal{A}}, \text{if } i=j=\mathcal{C} \\ k\_{\mathcal{L}} \text{ otherwise} \end{cases}, \end{split} \tag{3}$$

$$\frac{d\rho\_{\rm L}}{dt} = k\_{\rm L} \sum\_{i,j=0}^{C} \rho\_{i\dot{j}} (1 - \delta\_{i,C} \delta\_{j,C}),\tag{4}$$

$$\frac{d\rho\_{\rm A}}{dt} = k\_{\rm A}\rho\_{\rm C,C} \,. \tag{5}$$

Here δ denotes the Kronecker delta function, which has value 1 when two indices have the same value and 0 otherwise. Note that equation (3) describes a situation where only one KC can attach to or detach from the spindle at a time, which can be used if KCs attach and detach independently of each other. We also introduce the average time of both prometaphase and metaphase, which we term the time of spindle assembly, hti = ∞R 0 t dρ<sup>A</sup> dt dt/ ∞R 0 dρ<sup>A</sup> dt dt. Please note that the model does not take cell division into account and therefore the total number of cells is conserved.

#### RESULTS

## Chromosome Loss in Cells With One Chromosome

To illustrate how chromosome loss occurs during the transition from prometaphase to anaphase, we numerically solve our model first for cells with only one chromosome, C = 1, for parameters given in **Figure 1C**. We discuss the time course for different populations of cells. Initially, cells have no chromosome attached to the spindle. In prometaphase, when spindle assembly starts and KCs attach to the spindle, the fraction of cells in this population decreases, while the fraction of cells in the other populations increases (compare the light and dark purple lines in **Figure 1D**). After an initial increase, the fraction of cells in prometaphase starts decreasing as more KCs attach, and cells switch to metaphase (compare purple and black lines in **Figure 1D**). Finally, cells switch to anaphase. The fractions of cells in anaphase increase and asymptotically approach a limit value because the model does not describe cells leaving anaphase (orange and blue lines in **Figure 1D**). In this case with only one chromosome, the fraction of cells with a lost chromosome is very low.

### Dramatic Increase in the Rate of Chromosome Loss With an Increase in Ploidy

To explore the relevance of our model for haploid, diploid, and tetraploid yeast cells, we further solve our model for the respective number of chromosomes in each ploidy type, C = 16, 32, and 64 (**Figure 2A**). We find that cells with an increasing number of chromosomes spend a longer time in prometaphase and metaphase, though the general trend is similar to the case with C = 1 (**Figure 1D**). Additionally, there is a rapid decrease in the fraction of cells in prometaphase and metaphase, which occurs around the maximum time of mitotic arrest, t = t0, which is visible for cells with 64 chromosomes. After cells pass the maximum time of mitotic arrest, they predominantly enter anaphase regardless whether all KCs are attached. Thus, the more cells are still in prometaphase, the more cells will enter anaphase with unattached KCs. Because populations of cells with more chromosomes spend more time in prometaphase, they also enter anaphase later (**Figures 2A,B**). This time delay results in an increasing fraction of cells in anaphase with at least one lost KC because these cells have a greater chance to proceed to anaphase without a completely formed spindle (**Figure 2B**).

To explore which processes included in our model are responsible for significant chromosome loss, we determine the relevance of our model parameters. As our model describes both KC capture and transition to anaphase, we separately analyse the contribution of each process. We introduce the average time of both prometaphase and metaphase, which we refer to as the time of spindle assembly (Methods). We find that the time of spindle assembly increases with the number of chromosomes. Changing the chromosome number from 16 to 32 increases the time of spindle assembly approximately 2 fold, whereas, for a change from 32 to 64, it increases 5-fold (**Figure 2C**). Next, we explored how ploidy variations affect chromosome loss. We find that haploid (C = 16) and diploid (C = 32) cells have the same order of magnitude for the fraction of the population with at least one lost chromosome (**Figure 2D**). Interestingly, the fraction of cells with at least one lost chromosome increases dramatically for cells with higher ploidy, such as tetraploid cells (C = 64). When we plot the fraction of cells with lost kinetochores against spindle assembly time, we find that linear-scale changes in spindle assembly time result in exponential-scale changes in the rate of chromosome loss (**Figure 2E**). To summarize, our combined results show that small changes in spindle assembly time result in dramatic differences in the rate of chromosome loss as soon as prometaphase time approaches the maximum time of mitotic arrest.

### Relevance of Parameters on the Time of Spindle Assembly and the Chromosome Loss Rate

As our model describes spindle formation, we explore the relevance of parameters on the time of spindle assembly. We varied the parameter that links the number of chromosomes

inset legends. (C) Time of spindle assembly as a function of the number of chromosomes. (D) Rate of chromosome loss for cells as a function of the number of chromosomes. Arrowheads denote haploid, diploid and tetraploid number of chromosomes. (E) Rate of chromosome loss for cells as a function of the time of spindle assembly. Data points are obtained from (C,D), and correspond to *C* = 4, . . . , 64. Cases with *C* = 16, 32, 64 are shown in blue. At *t* = 0, ρ0,0 = 1 and all other populations are 0. The other parameters are given in Figure 1C.

and microtubule nucleation sites, α , for different number of chromosomes. For parameter values α = 1.0 the time of spindle assembly increases with the number of chromosomes (**Figures 2C**, **3A**). By increasing α to values >1 the assembly speeds up, but the influence is noticeable for a larger number of chromosomes (**Figure 3A**). By decreasing the parameter to the value α = 0.9 the assembly time dramatically increases with number of chromosomes and goes to infinity when there are more than 40 chromosomes. The infinite time of spindle assembly occurs for cells in which the number of microtubule

arrest, *tc*. Three different shades correspond to different values of the parameter *t<sup>c</sup>* = 8 min, 10 min, 12 min. For color-codes see inset legend. The other parameters are given in Figure 1C.

**45**

nucleation sites at one pole is smaller than number of chromosomes. Interestingly, in yeast the value of the parameter α in cells is close to 1 (**Figure 1C**).

We next explore the relevance of geometry by varying the cross-section area of the KC, SKC. We find that geometry has a small contribution for a small number of chromosomes, but for larger number of chromosomes, the time of spindle assembly decreases with the increase of the cross-section area (**Figure 3B**). The role of the cross-section area occupied by one microtubule, S, can be inferred from these data because both parameters, the cross-section area occupied by one microtubule and the crosssection area of the KC, contribute to attachment probability p.

Further, we explore how the choice of the function that describes bypassing the checkpoint in mitotic arrest f (t) affects the chromosome loss rate. We find that for a linear function the chromosome loss rate increases as the number of chromosome increases (**Figure 3C**). However, in this case the model cannot explain experimental results quantitatively. For example, when number of chromosomes changes from 32 to 64 the chromosome loss rate increases approximately 20 times with the linear function, whereas when ploidy in experiments changes from diploid to tetraploid the loss rate increases thousand times. A chromosome loss rate in the model is more similar to the experimental results for nonlinear functional forms, such as quadratic and cubic functions (**Figure 3C**). Because from this analysis we cannot predict a functional form for the function f (t), we choose an exponential function as a simple function that provides agreement with experiments.

Finally, we explore how the parameters that describe bypassing the checkpoint in mitotic arrest, t<sup>0</sup> and t<sup>c</sup> , affect the chromosome loss rate. We find that cells with shorter duration of mitotic arrest have an increased chromosome loss rate, irrespective of ploidy (**Figure 3D**). We also find that cells with a smaller characteristic timescale of mitotic arrest have a smaller rate of chromosome loss (**Figure 3E**).

#### DISCUSSION

Here we introduced a model in which we explored chromosome loss dynamics by accounting for key aspects of spindle assembly, including microtubule nucleation and KC attachment/detachment, together with a maximum time of mitotic arrest. Our theory provides a plausible explanation for experiments in yeast tetraploid cells, where there is a 1,000-fold increase in the rate of chromosome loss relative to haploid and diploid cells (Mayer and Aguilera, 1990; Storchová et al., 2006). Our model not only quantitatively predicts an increase in chromosome loss in cells with an increasing chromosome number, but also a longer duration of spindle assembly time. Indeed, the doubling time of yeast increases with ploidy in S. cerevisiae. For example, doubling times of haploid, diploid and tetraploid yeast cells in YPD is approximately 130, 146, and 171 min, respectively (Mable, 2001). This suggests that cells with increasing ploidy have an increased spindle assembly time, with differences in the same order of magnitude as in our model. However, this prediction needs to be further verified by direct measurements of average spindle assembly time in haploid, diploid, and tetraploid yeast cells. Key parameters of cytoplasmic microtubule dynamics were measured previously for diploid and tetraploid S. cerevisiae cells, including the rates of microtubule growth, shrinkage, catastrophe and rescue during G1 and mitosis (Storchová et al., 2006). We hypothesize that changes in these parameters may cause a change in the average spindle assembly time in a population of cells, but experimental validation in yeast is also needed.

In yeast cells of different ploidy, chromosome loss can occur for many reasons. Configurations with syntelic attachments can also appear and lead to chromosome loss. Storchova et al. detected an increased frequency of erroneous KC attachments in polyploid cells and suggest an important role for syntelic attachments based on increased activity of Ipl1, the yeast homolog of Aurora B (Storchová et al., 2006). Additionally, microtubules can detach from KCs during anaphase, which can further increase chromosome loss events. Thus, identifying experimentally which of these configurations are predominant in cells with lost chromosomes is crucial for establishing a complete picture of chromosome loss.

Laboratory tetraploid yeast cells have an increased rate of chromosome loss. However, a recent experimental evolution study with laboratory yeast cells found that some tetraploid cell lines could maintain their full chromosome complement (C = 64) for >1,000 generations (Lu et al., 2016). The evolved, stable tetraploid cells had elevated levels of the Sch9 protein, one of the major regulators downstream of TORC1, which is a central regulator of cell growth. Interestingly, the evolved stable tetraploid cells also had increased resistance to the microtubule depolymerizing drug benomyl relative to the ancestor tetraploid cells, indicating that increased Sch9 activity may, at least in part, rescue spindle formation defects observed in the ancestral tetraploid cells (Storchová et al., 2006; Lu et al., 2016). This is consistent with our model, where chromosome stability in tetraploid cells can be obtained by increasing the rate of spindle assembly.

This is the first theoretical study of the mechanism driving high rates of chromosome loss in polyploid yeast cells. Our approach for within-species ploidy variation can be applied to other species, including plants (Hufton and Panopoulou, 2009), where rates of chromosome loss are also higher in polyploid cells than in diploid cells, if the details of spindle self-organization are adjusted for the specific organism and cell-type. For example, for cells with more than one microtubule per KC, merotelic attachments need to be taken into account as well (Gregan et al., 2011). Future models will show the extent to which spindle assembly time influences the rate of chromosome loss for a variety of systems.

### AUTHOR CONTRIBUTIONS

NP, LL, and AS conceived the project. NP and LL developed the model, IJ solved the model. All authors wrote the paper.

#### FUNDING

This research was supported by the QuantiXLie Centre of Excellence, a project co-financed by the Croatian Government and European Union through the European Regional Development Fund – the Competitiveness and Cohesion Operational Programme (Grant KK.01.1.1.01.0004, NP), the Netherlands Organization for Scientific Research (NWO/OCW)

#### REFERENCES


as part of the Frontiers of Nanoscience program (LL), a Nebraska EPSCoR First Award (AS) and an LB692-Nebraska Tobacco Settlement Biomedical Research Development Fund (AS).

#### ACKNOWLEDGMENTS

We thank Timon Idema, Judy Berman and Iva Tolic for critical ´ reading of the manuscript, and Ivana Šaric for the drawings. ´


'search-and-capture' process during mitotic-spindle assembly. Curr. Biol. 15, 828–832. doi: 10.1016/j.cub.2005.03.019


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 Jeleni´c, Selmecki, Laan and Pavin. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Gene Co-expression Network Analysis Suggests the Existence of Transcriptional Modules Containing a High Proportion of Transcriptionally Differentiated Homoeologs in Hexaploid Wheat

#### Kotaro Takahagi1,2,3, Komaki Inoue<sup>1</sup> and Keiichi Mochida1,2,3,4 \*

Edited by: Hans D. Daetwyler, La Trobe University, Australia

#### Reviewed by:

Divya Mehta, Queensland University of Technology, Australia Wilco Ligterink, Wageningen University & Research, Netherlands

> \*Correspondence: Keiichi Mochida keiichi.mochida@riken.jp

#### Specialty section:

This article was submitted to Evolutionary and Population Genetics, a section of the journal Frontiers in Plant Science

> Received: 31 January 2018 Accepted: 23 July 2018 Published: 08 August 2018

#### Citation:

Takahagi K, Inoue K and Mochida K (2018) Gene Co-expression Network Analysis Suggests the Existence of Transcriptional Modules Containing a High Proportion of Transcriptionally Differentiated Homoeologs in Hexaploid Wheat. Front. Plant Sci. 9:1163. doi: 10.3389/fpls.2018.01163 <sup>1</sup> Bioproductivity Informatics Research Team, RIKEN Center for Sustainable Resource Science, Yokohama, Japan, <sup>2</sup> Graduate School of Nanobioscience, Yokohama City University, Yokohama, Japan, <sup>3</sup> Kihara Institute for Biological Research, Yokohama City University, Yokohama, Japan, <sup>4</sup> Institute of Plant Science and Resources, Okayama University, Kurashiki, Japan

Genome duplications aid in the formation of novel molecular networks through regulatory differentiation of the duplicated genes and facilitate adaptation to environmental change. Hexaploid wheat, Triticum aestivum, contains three homoeologous chromosome sets, the A-, B-, and D-subgenomes, which evolved through interspecific hybridization and subsequent whole-genome duplication. The divergent expression patterns of the homoeologs in hexaploid wheat suggest that they have undergone transcriptional and/or functional differentiation during wheat evolution. However, the distribution of transcriptionally differentiated homoeologs in gene regulatory networks and their related biological functions in hexaploid wheat are still largely unexplored. Therefore, we retrieved 727 publicly available wheat RNA-sequencing (RNA-seq) datasets from various tissues, developmental stages, and conditions, and identified 10,415 expressed homoeologous triplets. Examining the co-expression modules in the wheat transcriptome, we found that 66% of the expressed homoeologous triplets possess all three homoeologs grouped in the same co-expression modules. Among these, 15 triplets contain co-expressed homoeologs with differential expression levels between homoeoalleles across ≥ 95% of the 727 RNA-seq datasets, suggesting a consistent trend of homoeolog expression bias. In addition, we identified 2,831 differentiated homoeologs that showed gene expression patterns that deviated from those of the other two homoeologs. We found that seven co-expression modules contained a high proportion of such differentiated homoeologs, which accounted for ≥ 20% of the genes in each module. We also found that five of the co-expression modules are abundantly composed of genes involved

in biological processes such as chloroplast biogenesis, RNA metabolism, putative defense response, putative posttranscriptional modification, and lipid metabolism, thereby suggesting that, the differentiated homoeologs might highly contribute to these biological functions in the gene network of hexaploid wheat.

Keywords: allopolyploidization, co-expression gene network, hexaploid wheat, homoeolog, transcriptional module

#### INTRODUCTION

Interspecific hybridization and polyploidization have played important roles in the evolution and diversification of plants (Soltis and Soltis, 2009; Van de Peer et al., 2009). Allopolyploids are originated from hybridization between different species followed by whole-genome duplication (Ramsey and Schemske, 1998; Comai, 2005). Despite the multiple conditions that need to be met for allopolyploidization to occur, including existing populations of parental lines in the same area, overcoming hybrid incompatibility, gametic non-reduction, and chromosome doubling (Osabe et al., 2012), the occurrence of allopolyploids is widespread in various taxonomic groups in plants (Leitch and Leitch, 2008; Barker et al., 2016). Therefore, it has been hypothesized that allopolyploid species have evolutionary advantages compared to their diploid ancestral species (Wendel, 2000; Doyle et al., 2008).

Improved traits that evolved in allopolyploid plants enhanced their productivity and have contributed to the domestication of many crops (Chen, 2010; Renny-Byfield and Wendel, 2014). For example, the allotetraploid Arabidopsis suecica has more vigorous growth and produces more seeds than its ancestral species (Solhaug et al., 2016), whereas the allotetraploid Coffea arabica can better adapt to changes in temperature than its diploid ancestors (Combes et al., 2013). In allohexaploid wheat (Triticum aestivum), both natural and synthetic plants have higher tolerance to salt stress than their diploid and tetraploid ancestors (Dubcovsky and Dvorak, 2007; Yang et al., 2014). These examples suggest that allopolyploidization often leads to increased productivity through fixation of genomic heterozygosity, which improves environmental fitness and contributes to the habitat expansion of a species.

Allopolyploidization can give rise to transcriptional and/or functional changes in homoeologs (genes that are duplicated due to allopolyploidization) (Mochida et al., 2003; Adams and Wendel, 2005; Moore and Purugganan, 2005). Homoeologs can undergo accelerated evolution due to redundant genetic codes that can evolve new functions without constraints (Kaessmann, 2010; Naseeb et al., 2017). A number of studies have revealed their fates as non-functionalized (loss of function of one of the duplicated genes), subfunctionalized (partitioning of function between duplicated genes), and/or neo-functionalized (diversification of function between the duplicated genes) (Lynch and Conery, 2000; Blanc and Wolfe, 2004; Cusack and Wolfe, 2007). Homoeologs in plants often show different expression patterns across tissues, developmental stages, and conditions, suggesting that they have undergone suband/or neofunctionalization (Madlung, 2013). The differential employment of homoeologs through dynamic transcriptional regulation may contribute to the enhanced evolutionarily adaptability of allopolyploid species.

A number of studies based on homoeolog-specific gene expression analysis have reported the evolutionary fates of homoeologs in various allopolyploid plants (Adams, 2007; Hughes et al., 2014; Takahagi et al., 2018). Transcriptome analysis has revealed that the expression of multiple ribosomal protein-coding homoeologs in Brassica napus is tissue-dependent (Whittle and Krochko, 2009). An investigation of the relative levels of allelic and homoeologous gene expression in cotton revealed that subfunctionalized genes are mainly expressed in reproductive tissues, and non-functionalized alleles are typically derived from the A-genome, indicating potential genomeof-origin bias for neofunctionalization (Chaudhary et al., 2009). Differentiation of expression patterns of homoeologs in allopolyploid species might effect changes in their gene regulatory networks owing to transcriptional and/or functional divergence. The evolutionary changes in gene regulatory networks are thought to facilitate responses to developmental programs and environmental cues in allopolyploids (Chen and Ni, 2006).

Hexaploid wheat, Triticum aestivum, is a widely cultivated allohexaploid crop (2n = 6x = 42, AABBDD) that originated from hybridization between the domesticated allotetraploid Triticum turgidum (2n = 4x = 28, AABB) and the diploid goat grass Aegilops tauschii (2n = 2x = 14, DD) approximately 10,000 years ago, followed by genome duplication (Matsuoka, 2011; Feldman and Levy, 2012). Pfeifer et al. (2014) generated a co-expression gene network of hexaploid wheat and examined the contribution of expression of each homoeolog. They found that several network modules exhibit unbalanced homoeolog expression, which might be associated with biological functions and tissue types (Pfeifer et al., 2014). Recently, Tanaka et al. (2016) reported homoeolog-specific regulation of the floral MADS-box genes in wheat, and differential expression patterns of homoeologs were consistently observed in both natural and synthetic allohexaploid wheat varieties (Tanaka et al., 2016). Moreover, Powell et al. (2017) demonstrated that the wheat transcriptome has homoeolog expression bias toward the B- and D-subgenomes in response to pathogen infection (Powell et al., 2017). The divergent expression patterns between homoeologs suggest that they have undergone transcriptional and/or functional differentiation. However, the distribution of transcriptionally differentiated homoeologs in gene regulatory networks and their related biological functions in hexaploid wheat are still largely unexplored.

In this study, to elucidate homoeologous networks in hexaploid wheat and to explore their differentiation, we retrieved

publicly available RNA-sequencing (RNA-seq) datasets from various tissues, developmental stages, and conditions. We categorized hexaploid wheat genes to construct homoeologous groups and identified expressed homoeologous triplets. We also identified differentiated homoeologs that show gene expression patterns that deviate from those of the other two homoeologs. In addition, we explored gene network modules containing a high proportion of differentiated homoeologs in the transcriptome of hexaploid wheat. We assessed enriched functions in the network modules and discussed the evolution of such network modules resulting from transcriptional differentiation of homoeologs in hexaploid wheat.

### MATERIALS AND METHODS

#### Data and Data Processing

All publicly available wheat transcriptome sequence datasets were retrieved from the NCBI Sequence Read Archive (April 26, 2017)<sup>1</sup> . To adjust the data format, the datasets were screened according to the following criteria: (1) RNA-seq data strictly (i.e., no EST, FL-cDNA, etc.) from Triticum aestivum samples, (2) total number of sequence reads ≥ 10,000,000, and (3) an average sequence read length is 70–1000 bases. The RNAseq datasets presenting the following characteristics were also removed from analyses, as they were considered inappropriate for gene expression profiling: (1) datasets resulting from pooled samples, taken at different time points, (2) datasets obtained from chromosome deletion and chromosome addition lines, and (3) datasets obtained for poorly described methodologies. RNA-seq reads of the screened datasets were trimmed using Trimmomatic (v.0.32) (Bolger et al., 2014) with the following settings: thread 1 LEADING: 20 TRAILING: 20 SLIDINGWINDOW:4:15 MINLEN: 50. To obtain high-quality sequence datasets, the trimmed datasets were further screened according to the following criteria: (1) ≥ 70% of raw reads are maintained after the trimming step and (2) an average sequence read length is 70–1000 bases after trimming. The trimmed reads obtained after the second screening were mapped to the representative cDNA sequences annotated in the genome assembly of Chinese Spring wheat (International Wheat Genome Sequencing Consortium, 2014) downloaded from the Ensembl (v.35)<sup>2</sup> using the BWA program (v.0.7.8) (Li and Durbin, 2009) with its mem command. To use datasets with high-quality alignments of the reads, those that were not uniquely mapped and/or not paired mapped were removed from the read alignment datasets using custom Perl scripts. In total, 727 read alignment datasets (**Supplementary Table S1**), for which ≥ 50% of raw reads remained after the read removal step, were subjected to further analysis. The reads per million mapped reads (RPM) values were calculated for all genes in the 727 read alignment datasets. Genes with an RPM ≥ 3 in at least eight datasets (≥ 1% of the 727 RNA-seq datasets) were identified as significantly expressed genes.

### Identification of Homoeologous Groups

To identify homoeologous groups, representative protein sequences of the A-, B-, and D-subgenomes annotated in the genome assembly of Chinese Spring wheat (International Wheat Genome Sequencing Consortium, 2014) downloaded from Ensembl (v.35)<sup>2</sup> were compared against each other using BLASTP (v.2.6) (McGinnis and Madden, 2004), applying an e-value cut-off of 1e-5 and a sequence identity cut-off of 90%. Sets of three homoeologs that were reciprocal best hits in all pairwise comparisons were identified as homoeologous triplets (ABD type in **Figure 1B**). Sets of two homoeologs with reciprocal best hits for two subgenomes and without hits for the other subgenome were identified as homoeologous doublets (AB, AD, and BD types in **Figure 1B**). Genes without hits in any of the other two subgenomes were identified as subgenome-unique genes (A, B, and D types in **Figure 1B**).

### t-distributed Stochastic Neighbor Embedding (t-SNE) Analysis

To summarize expression patterns of the genes with an RPM ≥ 3 in a range of 1–7 datasets (spatiotemporally expressed genes), t-SNE analysis was performed using the Rtsne package (v.0.13)<sup>3</sup> in R (v.3.4.3). The number of iterations was set at 10,000, and parameter theta was set at 0.0.

#### Co-expression Network Analysis

To compute co-expression modules of homoeologs, WGCNA analysis (Langfelder and Horvath, 2008) was performed based on the normalized RPM using the one-step automatic network construction method with the following parameters: power = 9, networkType = "signed", TOMType = "unsigned", minModule-Size = 30, reassignThreshold = 0, mergeCutHeight = 0.25, numericLabels = TRUE, pamRespectsDendro = FALSE. A softthresholding power was selected by evaluating the scale-free topology model fit.

#### Identification of Differentially Expressed Genes

For identification of the homoeologous triplets containing coexpressed homoeologs with differential expression levels between homoeoalleles, the gene expression fold changes between homoeologs across the 727 RNA-seq datasets were calculated based on RPM. Pairs of homoeologs with a fold change ≥ 3 and RPM ≥ 3 for at least one of the homoeologs were identified as differentially expressed homoeologs. For the examination of expression bias between homoeologs in the homoeologous triplets, reads used for RPM calculation in a series of RNAseq datasets (SRR1542404-SRR1542417) (Liu et al., 2015) were subjected to differential gene expression analysis performed by using the edgeR package (v.3.20.9) (Robinson et al., 2010) in R (v.3.4.3). Pairs of homoeologs with a false discovery rate (FDR) ≤ 0.001 and RPM (average of 2 biological replicates in the

<sup>1</sup>https://www.ncbi.nlm.nih.gov/sra

<sup>2</sup>http://plants.ensembl.org/Triticum\_aestivum/Info/Index

<sup>3</sup>https://github.com/jkrijthe/Rtsne

RNA-seq datasets) ≥ 3 for at least one of the homoeologs were identified as significantly differentially expressed homoeologs.

# Gene Ontology (GO) Enrichment Analysis

The closest homologs of wheat genes in Arabidopsis and rice were identified by BLASTP (v.2.6) (McGinnis and Madden, 2004) searches, applying an e-value threshold of ≤ 1e-5. GO terms of the best-hit genes in Arabidopsis and rice were used as the customized annotations for wheat genes. To reduce bias, GO terms that were assigned to more than 5,000 wheat genes were excluded. Enriched GO terms were identified for selected genes using BLAST2GO (v.4.1.9) (Conesa et al., 2005) with the customized annotations of wheat genes. For the estimation of the enriched GO terms of genes that are spatiotemporally expressed (representing genes with an RPM ≥ 3 in less than 1% (eight datasets) of the 727 RNA-seq datasets) or non-significantly expressed (representing genes with an RPM < 3 in all of the 727 RNA-seq datasets), all of the annotated genes in the Chinese Spring wheat chromosomes were used as a reference set. For estimation of the enriched GO terms of the other sets of genes, those in the expressed homoeologous triplets were used as a reference set. The significance threshold was set at FDR ≤ 0.001. The enriched GO terms were summarized based on their semantic similarities using the web-based tool REVIGO<sup>4</sup> (Supek et al., 2011).

### RESULTS

## Homoeologous Triplets in Hexaploid Wheat

To explore the distribution of transcriptionally differentiated homoeologs in gene regulatory networks and their related biological functions in hexaploid wheat, we identified expressed

<sup>4</sup>http://revigo.irb.hr/

homoeologous triplets using publicly available RNA-seq datasets. We gathered 727 RNA-seq datasets from hexaploid wheat composed of as many as 517 biosamples relating to various tissues, developmental stages, and conditions, which enabled us to comprehensively explore functional differentiation of transcription regulatory networks in hexaploid wheat (**Supplementary Table S1**). We mapped the quality-checked reads of the RNA-seq datasets to the set of representative cDNA sequences annotated in the genome assembly of Chinese Spring wheat. Using a threshold of RPM ≥ 3 in at least eight datasets (≥1% of the 727 RNA-seq datasets), we found that 73,329 genes (74% of the 99,308 genes corresponding to the representative cDNA sequences assigned to each chromosome) are significantly expressed in hexaploid wheat. To construct putative homoeologous groups, and estimate the number of expressed homoeologs from each homoeoloci, we clustered all the 99,308 genes into 49,710 gene groups based on sequence similarity, using a reciprocal BLAST homology search (**Figure 1A**). Approximately 38% of the genes were classified into gene groups composed of three homoeologs, one from each subgenome (homoeologous triplets, ABD type in **Figure 1B**), in which 84% of the triplets (10,415 triplets) contained three homoeologs significantly expressed in the RNA-seq datasets (expressed homoeologous triplets; **Figure 1B**). We also observed that 31,738 genes (39% of 82,012 genes assigned into each of the homoeologous groups) are expressed from one or two homoeologous loci on the subgenomes, which suggests that approximately 40% of the homoeologous groups contain homoeologs rarely expressed or silenced in the wheat transcriptome (**Figure 1B**).

# Spatiotemporally Expressed Genes in Wheat

To characterize the genes found in the wheat transcriptome that are rarely expressed or silenced, we investigated the

chromosomal distribution and function of these genes. Using the threshold to identify significantly expressed genes, we classified 25,979 genes as rarely expressed or silenced, which suggested a transcriptional sign of non-functionalization or acceleration of spatiotemporal transcriptional regulation. To further investigate the functional properties of such genes, we assessed their chromosomal distribution; however, no biased distribution of these genes was found across the 21 wheat chromosomes (**Figure 2A**). We found that 44% of the 25,979 genes were expressed in at least one RNA-seq dataset with an RPM ≥ 3, whereas the remaining 56% genes showed an RPM < 3 in all of the RNA-seq datasets, suggesting spatiotemporal expression and insignificant expression, respectively (**Figure 2B**). To summarize the expression patterns of the spatiotemporally expressed genes across the 727 RNA-seq datasets, we clustered and visualized the expression profiles of these genes using the t-SNE algorithm, and detected several clusters corresponding to the RNA-seq datasets from particular tissues, such as roots, stamens, and anthers (**Figure 2C**), suggesting their tissue-specific expressions. To assess gene functions over-represented in the spatiotemporally or non-significantly expressed genes, we performed GO enrichment analysis, and found some enriched GO terms related to the response to abiotic stresses, metabolism, and organ development (**Figure 2D**).

# Expression Bias Between Homoeologs in Hexaploid Wheat

To examine expression bias between homoeologs in the expressed homoeologous triplets, we computed co-expressed homoeologs and differentially expressed homoeologs based on the 727 RNAseq datasets. For identification of the co-expressed homoeologs, we applied the WGCNA algorithm, and identified 22 coexpression modules. The results of WGCNA analysis indicated that 66% of the expressed homoeologous triplets possess all three homoeologs grouped in the same co-expression modules (co-expressed triplets, ABD type in **Figure 3A**). For 27% of the triplets, two out of three homoeologs were grouped in the same co-expression modules (AB-D, AD-B, and BD-A types in **Figure 3A**), whereas for the remaining 5% of the triplets, all three homoeologs were assigned to different modules (A-B-D type in **Figure 3A**). To further identify homoeologs that are co-expressed while differentially expressed (representing similar expression patterns across the 727 RNA-seq datasets and differential expression levels between homoeoalleles), we identified differentially expressed homoeologs (fold-change ≥ 3) in the co-expressed triplets, and found that at least 258 triplets contained co-expressed homoeologs with differential expression levels between homoeoalleles across ≥ 50% of the 727 RNA-seq datasets (**Figure 3B**). We also found that 15 co-expressed triplets contained such homoeologs observed in ≥ 95% of the datasets, suggesting a consistent trend of homoeolog expression bias (**Figures 3B,C**). On the basis of our GO enrichment analysis of these genes, we observed several over-represented functions, such as biotin metabolism, protein modifications, and response to gibberellin stimulus (**Figure 3D**). Moreover, to illuminate homoeologspecific expression patterns relative to particular tissue type that are supported statistically, we examined the expression bias between homoeologs in the homoeologous triplets in a series of RNA-seq datasets related to multiple abiotic stress conditions such as drought, heat, and combined heat and drought (SRR1542404-SRR1542417) (Liu et al., 2015), and found that an increased number of homoeologous triplets contained differentially expressed homoeologs (FDR ≤ 0.001) in response to the drought and heat stress conditions, thereby suggesting the differentiation of transcriptional responsiveness between homoeologs to environmental stresses (**Supplementary Table S2**).

# Transcriptional Modules Containing a Number of Differentiated Homoeologs

We constructed co-expression gene networks based on the 727 RNA-seq datasets, and thus found that differentiated homoeologs were unevenly distributed in each of the co-expression modules and that several modules contained high proportions of differentiated homoeologs. On the basis of co-expression modules established from our WGCNA analysis, we identified 2,831 homoeologous triplets containing one homoeolog for which the expression pattern deviated from those of the other two homoeologs, which consisted of 9, 10, and 8% of differentiated homoeologs located in A-, B-, and D-subgenomes, respectively (BD-A, AD-B, and AB-D types, respectively, in **Figure 3A**). We also found that such differentiated homoeologs accounted for approximately 9% of all genes used for the WGCNA analysis (10,415 homoeologous triplets; 31,245 genes), whereas seven coexpression modules contained a high proportion of differentiated homoeologs, accounting for ≥ 20% of the genes in each module (**Figure 4A**). To estimate enriched biological functions for the genes within the co-expression modules containing a number of differentiated homoeologs, we performed GO enrichment analysis, and found that five of the co-expression modules are abundant in genes involved in biological processes such as chloroplast biogenesis (module 7; **Figure 4B**), RNA metabolism (module 8; **Figure 4C**), putative defense response (module 10; **Figure 4D**), putative posttranscriptional modification (module 15; **Figure 4E**), and lipid metabolism (module 18; **Figure 4F**). These findings suggest that differentiated homoeologs might highly contribute to these biological functions in the gene network of hexaploid wheat.

# DISCUSSION

Through our homoeologous gene expression analysis of hexaploid wheat based on a number of RNA-seq datasets, we demonstrated a landscape of transcriptional differentiation among homoeologs. Our comprehensive list of genes that were significantly expressed from one or two homoeologous loci enabled us to identify those genes that may have undergone transcriptional suppression or be directed to spatiotemporal expression. Leach et al. (2014) reported that 55% of genes in hexaploid wheat are expressed from one or two homoeologous loci on the subgenomes in root and shoot tissues (Leach et al., 2014). Using the RNA-seq datasets of 90 wheat lines,

Wang et al. (2017) found that approximately 60% of wheat genes are expressed from one or two homoeologous loci in reproductive tissues (Wang et al., 2017). Our findings based on more comprehensive transcriptome datasets showed that, compared with previous observations, a smaller number of genes (∼40% of genes assigned into each of the homoeologous groups) are expressed from one or two homoeologous loci (**Figure 1**). These observations suggest that approximately 15–20% of wheat genes, including the silenced loci considered in previous studies, may contain homoeologs that can be expressed in specific tissues, at different developmental stages, or under different conditions. Our list of the spatiotemporally expressed and non-significantly expressed genes represent as many as 44% of those genes expressed (RPM ≥ 3) in 1–7 datasets out of the 727 RNA-seq datasets, and suggested that some of these are particularly expressed in specific tissues such as roots, stamens, and anthers (**Figures 2B,C**). Although we used a threshold of RPM ≥ 3 in less than

homoeologous triplets. ABD, homoeologous triplets in which all three homoeologs are grouped in the same co-expression module; AB-D, homoeologous triplets in which A- and B-homoeologs are grouped in the same co-expression module while D-homoeolog is in another co-expression module; AD-B, homoeologous triplets in which A- and D-homoeologs are grouped in the same co-expression module while B-homoeolog is in another co-expression module; BD-A, homoeologous triplets in which B- and D-homoeologs are grouped in the same co-expression module while A-homoeolog is in another co-expression module; A-B-D, homoeologous triplets in which all three homoeologs are assigned to different modules; Not clustered, homoeologous triplets in which two or all three homoeologs are not assigned to a co-expression module. (B) Number of co-expressed triplets containing differentially expressed homoeologs across ≥ 50% of the 727 RNA-seq datasets. (C) Box plot of the expression levels of the homoeologs in 15 homoeologous triplets showing a consistent trend of homoeolog expression bias ≥ 95% across the 727 RNA-seq datasets. (D) Enriched GO terms in the biological processes of genes in the 258 co-expressed triplets containing differentially expressed homoeologs across ≥ 50% of the 727 RNA-seq datasets.

1% (eight datasets) of the 727 RNA-seq datasets to identify spatiotemporally or non-significantly expressed genes, this threshold depends on the proportion of samples from similar tissues in the dataset, which might present genes specifically expressed in unusually sequenced samples. To further explore spatiotemporally expressed genes, transcriptome datasets obtained from anatomically- or seasonally-distinct samples should be analyzed using emerged technologies such as lasercapture microdissection RNA-seq (LCM RNA-seq) (Zhan et al., 2015) and field transcriptome sequencing (Plessis et al., 2015). These findings may suggest that such genes expressed only from one or two homoeoalleles undergo transcriptional silencing, probably through differentiation of expression patterns and specialization of spatial expression. Consequently, such duplicated genes might be non-functionalized through promoter malfunctions or repression of other transcriptional machineries as a process of functional diploidization (Levy and Feldman, 2002; Rajkov et al., 2014).

Our gene co-expression network analysis enabled us to identify homoeologous triplets containing homoeologs that are co-expressed while differentially expressed (2.5% of the 10,415 expressed homoeologous triplets), as well as differentiated homoeologs that are classified into co-expression modules that differ from the other two homoeologs (27% of the 10,415 expressed homoeologous triplets) (**Figures 3A,B**). The results of our comprehensive analysis provide evidence that may suggest that most of the differential expression observed between homoeologs represents an alteration of expression patterns in hexaploid wheat. The results of our co-expressed gene network analysis enable us to identify transcriptional modules that contain abundant differentiated homoeologs involved in several particular biological processes, which might have evolved such biological functions in hexaploid wheat through its allopolyploidization (Chen et al., 2007; Feldman and Levy, 2012). Multiple studies have provided evidence to suggest that homoeolog subfunctionalization may be related to enhanced

adaptability to adverse environmental conditions in various allopolyploid species, such as tetraploid cotton, tetraploid coffee, and hexaploid wheat (Liu and Adams, 2007; Hu et al., 2011; de Carvalho et al., 2014; Liu et al., 2015). Consequently, our results suggest that along with other genes, such differentiated homoeologs may have innovated transcriptional networks, which may have contributed to adaptation to environmental change as well as to enhanced productivity during the evolution of hexaploid wheat.

The large number of RNA-seq datasets analyzed in the current study allowed integrating the transcriptional properties of each homoeologous triplet into a dataset (**Supplementary Table S3**), thereby providing a useful information resource for understanding the evolution and function of duplicated genes in hexaploid wheat. Moreover, our analyses using the datasets enabled us to demonstrate the presence of co-expression modules containing a high proportion of differentiated homoeologs in hexaploid wheat, which in turn allowed us to dissect its complex transcriptome derived from duplicated genomes. The considerable recent advances in whole-genome assembly in Triticeae species, including hexaploid wheat and its ancestors (Ling et al., 2013; Mochida and Shinozaki, 2013; International Wheat Genome Sequencing Consortium, 2014; Luo et al., 2017), provide us with an opportunity to further explore sub- /neofunctionalized homoeologs and elucidate the diploidization process that occurred during the evolution of hexaploid wheat after allopolyploidization. Such analysis will enable us to identify genes and transcriptional modules that may be associated with adaptive traits in hexaploid wheat. Such genes and transcriptional modules might also prove useful in enhancing the adaptation of staple crops to counter the potentially

#### REFERENCES


adverse impacts of global climate changes and improve their productivity.

#### AUTHOR CONTRIBUTIONS

KT and KM designed the work. KT and KI performed the bioinformatics analysis. KT and KM wrote the manuscript.

# FUNDING

This work was partially supported by grants-in-aid for Young Scientists (A) (Grant No. 26712003 to KM) from the Japan Society for the Promotion of Science (JSPS) and by funds to KM from the CREST (JPMJCR16O4) of the Japan Science and Technology Agency (JST).

### ACKNOWLEDGMENTS

The authors thank Shun Takaya for his help in data analysis. This work was supported by RIKEN Junior Research Associate Program.

#### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fpls.2018.01163/ full#supplementary-material


domain proteins in wheat (Triticum aestivum L.). Gene 471, 13–18. doi: 10. 1016/j.gene.2010.10.001


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 Takahagi, Inoue and Mochida. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Adding Complexity to Complexity: Gene Family Evolution in Polyploids

Barbara K. Mable<sup>1</sup> \*, Anne K. Brysting<sup>2</sup> , Marte H. Jørgensen<sup>2</sup> , Anna K. Z. Carbonell 1,3 , Christiane Kiefer <sup>4</sup> , Paola Ruiz-Duarte<sup>4</sup> , Karin Lagesen2,5 and Marcus A. Koch<sup>4</sup>

1 Institute of Biodiversity, Animal Health & Comparative Medicine, University of Glasgow, Glasgow, United Kingdom, <sup>2</sup> Centre for Ecological and Evolutionary Synthesis, Department of Biosciences, University of Oslo, Oslo, Norway, <sup>3</sup> Department of Biological and Environmental Sciences, University of Stirling, Stirling, United Kingdom, <sup>4</sup> Centre for Organismal Studies Heidelberg, Department of Biodiversity and Plant Systematics, Botanic Garden and Herbarium Heidelberg, University of Heidelberg, Heidelberg, Germany, <sup>5</sup> Department of Bioinformatics, University of Oslo, Oslo, Norway

Comparative genomics of non-model organisms has resurrected whole genome duplication (WGD) from being viewed as a somewhat obscure process that happens in plants to a primary driver of eukaryotic diversification. The shadow of past ploidy increases has left a strong signature of duplicated genes organized into gene families, even in small genomes that have undergone effectively complete rediploidization. Nevertheless, despite continually advancing technologies and bioinformatics pipelines, resolving the fate of duplicate genes remains a substantial challenge. For example, many important recognition processes are driven not only by allelic expansion through retention of duplicates but also by diversification and copy number variation. This creates technical difficulties with assembly to reference genomes and accurate interpretation of homology. Thus, relatively little is known about the impacts of recent polyploidization and hybridization on the evolution of gene families under selective forces that maintain diversity, such as balancing selection. Here we use a complex of species and ploidy levels in the genus Arabidopsis (A. lyrata and A. arenosa) as a model to investigate the evolutionary dynamics of a large and complicated gene family known to be under strong balancing selection: the receptor-like kinases, which include the female component of genetically controlled self-incompatibility. Specifically, we question: (1) How does diversity of S-receptor kinase (SRK) alleles in tetraploids compare to that in their close diploid relatives? (2) Is there increased trans-specific polymorphism (i.e., sharing of alleles that transcend speciation, characteristic of balancing selection) in tetraploids compared to diploids due to the higher number of copies they carry? (3) Do these highly variable loci show evidence of introgression among extant species/ploidy levels within or outside known zones of hybridization? (4) Is there evidence for copy number variation among paralogs? We use this example to highlight specific issues to consider when interpreting gene family evolution, particularly in relation to polyploids but also more generally in diploids. We conclude with recommendations for strategies to address the challenges of resolving such complex loci in the future, using advances in deep sequencing approaches.

Keywords: polyploidy, gene family evolution, self-incompatibility, copy number variation, trans-specific polymorphism, balancing selection, introgression

#### Edited by:

Richard John Abbott, University of St Andrews, United Kingdom

#### Reviewed by:

Céline Poux, Université de Lille, France Baocheng Guo, Institute of Zoology (CAS), China

\*Correspondence: Barbara K. Mable barbara.mable@glasgow.ac.uk

#### Specialty section:

This article was submitted to Evolutionary and Population Genetics, a section of the journal Frontiers in Ecology and Evolution

> Received: 29 March 2018 Accepted: 17 July 2018 Published: 07 August 2018

#### Citation:

Mable BK, Brysting AK, Jørgensen MH, Carbonell AKZ, Kiefer C, Ruiz-Duarte P, Lagesen K and Koch MA (2018) Adding Complexity to Complexity: Gene Family Evolution in Polyploids. Front. Ecol. Evol. 6:114. doi: 10.3389/fevo.2018.00114

# INTRODUCTION

#### Background and Aims

The sequencing of the human genome in 2001 (Lander et al., 2001) promised to revolutionize modern medicine and lead to a new era in understanding the complexity of genetic control of complex phenotypes. While this has certainly been true, it is really the comparative genomics of non-model organisms that has led to a complete revolution in understanding (e.g., Seeb et al., 2011; da Fonseca et al., 2016). One unexpected finding was that whole genome duplication (WGD) has been an important process contributing to the genomic history of all eukaryotes, including those with relatively small genomes, such as the yeast Saccharomyces cerevisiae (Wolfe and Shields, 1997) and the model plant Arabidopsis thaliana (Blanc and Wolfe, 2004). Although Susumu Ohno in the late 1960s had emphasized the central role of gene duplication in the evolutionary history of vertebrates (Ohno, 1970), it wasn't until after his death in 2000 that comparative genomic studies confirmed that fish had undergone multiple rounds of WGD (e.g., Meyer and Van de Peer, 2005), as he had predicted. He also had predicted that effective rediploidization following duplication was inevitable but that some duplicates would be retained to perform new or specialized functions, leaving a footprint of past duplications and organization of genes into gene families. His ideas about the fates of duplicate genes to include specialization of function (now known as "subfunctionalization"; Force et al., 1999) also have been resurrected and form the basis for understanding the history of complex genomes such as salmonids, which underwent an independent WGD after the last teleost specific duplication (Hermansen et al., 2016; Lien et al., 2016). Comparative studies of vertebrates have thus been critical for establishing polyploidization as a creative evolutionary force shaping the genomes of all eukaryotes (Van de Peer et al., 2017), as had long been recognized for plants (e.g., Soltis et al., 1992; Adams, 2007).

Nevertheless, despite recognition that duplicated genes are critical for understanding genome structure and function (Van de Peer et al., 2017), the practicalities of assembling duplicates in genomic resequencing studies, resolving orthology, and interpreting their potentially redundant effects on phenotypes remains a substantial challenge (da Fonseca et al., 2016). Retention of duplicate genes following genomic or tandem duplication is non-random (Adams, 2007) and is both constrained and promoted by achieving appropriate levels of expression (e.g., Gout and Lynch, 2015; Mattenberger et al., 2017; Rodrigo and Fares, 2018). The "gene balance" hypothesis, for example, predicts that loci involved in regulating levels of expression of integrated genetic pathways (such as transcription factors or members of signal transduction pathways) should show increased retention of duplicates to maintain coordinated function (Birchler and Veitia, 2010). Genes for which high expression is advantageous might be expected to retain expression in duplicated copies whereas divergence in patterns of expression could be advantageous for others. Genes that are retained in duplicate through one round of WGD also have been found to be preserved through later rounds (Seoighe and Gehring, 2004). Thus, not considering the role of gene copies retained in duplicate could alter interpretation of regulatory processes associated with adaptation.

One type of adaptive process often associated with large and complex gene families is recognition of self vs. non-self, where high polymorphism is favored by continually changing selection pressures, and retention of duplicate copies could be beneficial for increasing allelic repertoire. For example, the "big bang" theory of the emergence of the adaptive immune systems in vertebrates invokes multiple rounds of WGD as the major source of this potential (Flajnik and Kasahara, 2010). Similarly, investigation of the genomic repertoire of pathogen-associated genes (R genes) in several crop plants through targeted sequence capture (Jupe et al., 2012, 2013; Giolai et al., 2016; Van Weymers et al., 2016) has revealed much more extensive gene families than was previously predicted based on whole genome resequencing studies. R genes have also been demonstrated to show signatures of adaptive introgression between closely related species of Arabidopsis, with extensive trans-specific sharing of alleles across species (Bechsgaard et al., 2017). An added complication for these types of gene families is that copy number can be variable even among individuals within a species (e.g., Mable et al., 2015), meaning that genome references will not always include the full complement of copies. Copy number variation has been linked to disease severity in humans (Beckmann et al., 2007; Wheeler et al., 2008) and adaptive processes in other organisms (Saintenac et al., 2011; Zmienko et al., 2014; Duvaux et al., 2015; Hull et al., 2017) but methods that can reliably distinguish between lack of coverage and variation in presence of a particular gene copy are required to fully evaluate the evolutionary significance of presence/absence polymorphisms following gene duplication.

The high polymorphism expected for recognition genes means that they are prime candidates to be "lost" in genomic resequencing studies, even in diploids. For example, genes controlling sporophytically controlled self-incompatibility (SI) in plants have been found to be missing from resequencing assemblies because they are too divergent from the reference genome and so trawling in the unassembled reads is necessary to characterize these highly polymorphic genes (Mable et al., 2017). Both male and female components are members of large gene families that show extensive trans-specific polymorphism, with highly similar alleles shared across species and even genera but high divergence between functional specificities (Schierup et al., 1998; Paetsch et al., 2006; Castric and Vekemans, 2007; Busch et al., 2008; Guo et al., 2011; Tedder et al., 2011; Leducq et al., 2014). The gene controlling female specificity (S-receptor kinase, SRK) is part of a large family of receptor kinases, which evolved through a complex history of gene duplication and loss, followed by gene fission and fusion (Xing et al., 2013). Gene conversion between SRK and other members of the gene family is also thought to have contributed to expansion of functional allelic diversity (Prigoda et al., 2005; Guo et al., 2011). This creates additional challenges with interpreting which variants are parts of the functional locus regulating the SI response and which are functionally unlinked but show high sequence similarity. For sporophytic SI, the phenotype of the pollen is determined by the genotype of the diploid (or tetraploid) parent, so there can be dominance in both pollen and stigma. Dominance is known to be complex, with non-linear interactions that can differ between pollen and stigma (Lewis, 1947; Stevens and Kay, 1989; Hatakeyama et al., 1998; Shiba et al., 2002; Mable et al., 2003; Llaurens et al., 2009; Schoen and Busch, 2009). Trans-specific polymorphism (i.e., sharing of alleles that transcends speciation) of SRK alleles has been well established for diploids (Charlesworth et al., 2006; Boggs et al., 2009; Castric et al., 2010), and is thought to be a key indicator of the action of balancing selection (Takahata, 1990). However, the strength of balancing selection on tetraploids has not been assessed specifically. Since tetraploid individuals can carry up to four different SRK alleles, there is potential for increased sharing across species, at least of recessive alleles. They can also carry multiple copies of recessive alleles (Mable et al., 2004), which could result in the maintenance of more variants within specificities than for diploids. While previous work has demonstrated that linkage and dominance works similarly in tetraploid compared to diploid Arabidopsis lyrata (Mable et al., 2004), the evolutionary dynamics of S-alleles in tetraploids has not been studied.

In addition, interpreting the fate of duplicate genes in polyploids is complicated by the fact that hybridization is often associated with WGD and so it can be difficult to disentangle the effects of combining and duplicating genomes on patterns of duplicate gene expression or dynamics of gene families (e.g., Evans, 2007; Guggisberg et al., 2009; Mable, 2013). Fortunately, rapid advances in sequencing technology and bioinformatic processing mean that the toolbox available to resolve such challenges continues to improve. Targeted sequence capture, for example, has been used effectively to investigate genomic changes in polyploids (Salmon et al., 2012; Gardiner et al., 2016; Krasileva et al., 2017). However, even with these advances in technology there are important issues to consider when resolving and interpreting evolutionary dynamics of gene families, particularly for systems in which recent polyploidization and hybridization could complicate accurate assembly into orthologs and subsequent genotyping within and between copies.

The purpose of this paper is to discuss these issues in the context of understanding the evolutionary dynamics of the SRK gene family in a species complex (A. lyrata and A. arenosa) that includes both diploids and tetraploids, with tetraploids showing extensive introgression in a hybrid zone in central Europe (Schmickl et al., 2010; Jørgensen et al., 2011; Schmickl and Koch, 2011; Hohmann et al., 2014; Muir et al., 2015; Novikova et al., 2016; Hohmann and Koch, 2017). In A. arenosa, tetraploids have been predicted to have arisen through autopolyploidisation (Arnold et al., 2015); secondary contact with A. lyrata during interglacial and postglacial range contractions and expansions has subsequently led to introgression between tetraploids in the two species. Our intent was to use investigation of S-receptor kinase evolution in this species complex as a model for understanding how balancing selection operates in polyploid genomes and to determine whether these highly polymorphic gene families could be useful indicators of hybridization and introgression. Specifically, our objectives were to question: (1) How does diversity of SRKrelated alleles in tetraploids compare to that in their close diploid relatives? (2) Is there increased trans-specific polymorphism of SRK alleles in tetraploids compared to diploids because of the increased number of copies they can carry? (3) Do these highly variable loci show evidence of introgression among extant species/ploidy levels within or outside known zones of hybridization? (4) Is there evidence for copy number variation among paralogs?

We use these questions to highlight the challenges for interpreting gene family evolution, particularly in polyploids, but also relevant to diploids. We conclude with recommendations for how some of these challenges might be overcome using deep sequencing approaches. We reiterate the recommendation from others (Salmon et al., 2012; Jupe et al., 2013; Gardiner et al., 2016; Van Weymers et al., 2016; Krasileva et al., 2017) that non-amplicon based targeted sequence capture (e.g., whole genome exon capture or targeting of particular gene families) is the most promising method for tackling the full complexity of gene family evolution in complex genomes but suggest cautionary strategies that should be considered when interpreting evolutionary patterns.

## Notes on Terminology and Known Challenges Associated With the SRK Gene Family

A complication with understanding the evolution of complex gene families is distinguishing what is meant by an "allele." For SRK, there can be sequence variation within "specificities," which are SRK types that confer a specific SI phenotype (i.e., a protein expressed on the surface of the stigma that is recognized as self by the comparable protein expressed on the surface of the pollen grain). These specificities (which we will refer to as "alleles") can be as divergent from one another as they are from other genes (which we will refer to as "loci") in the same gene family. Moreover, phylogenetic clustering alone is not sufficient to predict which sequence variants represent SRK alleles because gene conversion with unlinked loci has resulted in higher similarity between paralogs than among SRK alleles (Prigoda et al., 2005). Diploid individuals should contain only two functional SRK alleles but could contain varying numbers of loci in the gene family that are not linked to the SI phenotype; since tetraploids can contain multiple copies of the same allele without altering the specificity or dominance (Mable et al., 2004), the number of SRK alleles expected in a polyploid cannot be predicted. Thus, assigning "sequence variants" to gene family loci or SRK alleles is even more complicated in polyploids than for diploids. SRK alleles have been grouped into four different dominance classes (A1, A2, A3, B; Prigoda et al., 2005). Polymorphisms within specificities/alleles (which we will refer to as "haplotypes") are more apparent for recessive than dominant alleles because the former are expected to occur at higher frequency and show more sharing between populations (Bechsgaard et al., 2006; Castric and Vekemans, 2007; Castric et al., 2008, 2010; Stoeckel et al., 2008; Llaurens et al., 2009; Goubet et al., 2012). There is a single most recessive allele (S1, Class A1; Prigoda et al., 2005) that is found globally and in multiple species in the genus Arabidopsis (Mable et al., 2003; Dart et al., 2004; Prigoda et al., 2005; Mable and Adam, 2007; Castric et al., 2010; Foxe et al., 2010). Alleles in Class B are recessive to all other classes except S1 but are more similar to unlinked loci (Aly13-2 and Aly13-7) than to the other classes (Prigoda et al., 2005) and show more intraallele polymorphisms than dominant alleles (Classes A2 and A3; Prigoda et al., 2005; Castric et al., 2010). The high transspecific polymorphism also means that naming of alleles can be confusing because a variant found in certain species is often provided a specific number before discovering that it potentially represents the same specificity as an already named allele in another species (Castric et al., 2010). Thus, alleles are named with the species in which they were originally described as a prefix (e.g., Aly refers to A. lyrata, Aha refers to A. halleri, Ath refers to A. thaliana, Aar refers to A. arenosa). Finally, since the SI phenotype is determined by a combination of variants at the female SRK and male SCR genes, phenotypic specificities are labeled only "S#" (e.g., S1) for segregation analyses.

From our previous studies on the evolutionary dynamics of SRK alleles in diploids, we have already described challenges in generating robust data for interpreting these complex gene families in diploids, relevant for the sequencing strategies we apply here: (1) Primers designed to be general enough to recognize all SRK alleles also amplify the rest of the gene family, so a major challenge is assigning sequence variants to loci (Schierup et al., 2001; Charlesworth et al., 2003b; Mable et al., 2003, 2017; Mable and Adam, 2007). (2) This is complicated by the fact that, due to the extensive polymorphism in SRK and evidence that gene conversion has contributed to allelic repertoire, paralogs that are not linked to the SI phenotype can be more similar to "real" alleles than "real" alleles are to one another, so similarity can't always be used to assign functionality (Schierup et al., 2001; Mable et al., 2003; Prigoda et al., 2005). (3) Amplicon-based approaches are inherently at risk of generating PCR recombinants between copies, making it difficult to distinguish errors from actual recombination, introgression in hybrids, or gene conversion between sequences. (4) It is also difficult to distinguish presence/absence of paralogs from amplification biases during PCR (Mable et al., 2017). (5) There is extensive length heterogeneity within and between members of the gene family, so it can be difficult to establish the positional homology necessary to interpret patterns of selection (Charlesworth et al., 2003a). (6) The highly polymorphic nature of SRK alleles means that they are sometimes too divergent from the reference genome to be assembled using standard filtering strategies; this means that these types of alleles might frequently be found in the unassembled reads for resequencing projects (Mable et al., 2017).

# MATERIALS AND METHODS

# Sampling and Overview of Methods

Samples were obtained from both diploid and tetraploid populations of A. lyrata and A. arenosa sampled from Central Europe (**Table 1**). Although current systematics suggests separation of diploid and tetraploid A. arenosa into distinct species taxonomically (Koch et al., 2008), for simplicity, we will refer to both as A. arenosa here. We sampled individuals from 3-5 populations of each "type": A2x refers to diploid A. arenosa, A4x to tetraploid A. arenosa, L2x to diploid A. lyrata and L4x to tetraploid A. lyrata. Tetraploid populations occurring in a hybrid zone between the two species (Schmickl, 2009; Schmickl and Koch, 2011; Hohmann et al., 2014; Muir et al., 2015; Novikova et al., 2016) were included to test for patterns of introgression. Diploids have not been found to hybridize (Jørgensen et al., 2011) and so were considered "pure" populations. To test patterns of linkage of sequence variants with the SI phenotype, we also included 104 individuals from crosses between A. lyrata tetraploid parents whose genotypes had been partially resolved by cloning and Sanger sequencing; we performed di-allele crosses within these families to establish SI phenotypes that could be compared to the 454 genotypes.

We used a combination of approaches to address the main research questions: (1) 454 pyrosequencing using degenerate primers (**Supplementary Table 1**) targeting the SRK gene family (Jørgensen et al., 2012) to characterize diversity and patterns of allele sharing in diploids and polyploids; (2) direct Sanger sequencing to investigate signatures of introgression in shared haplotypes and for segregation analyses to test linkage to the SI phenotype; (3) cloning and Sanger sequencing using degenerate primers (**Supplementary Table 1**) to obtain longer products than possible with 454 pyrosequencing to further characterize potentially new alleles; and (4) using data from a recent genomic resequencing study (Novikova et al., 2016) to search for the SRK gene family using novel assembly approaches, to test whether copy number variation and patterns of introgression can be mined using existing genomic data. We focused on variation in exon 1 (the S-domain) because it contains the sites used for recognition of self vs. non-self (Schierup et al., 2001; Charlesworth et al., 2003a). However, we also used the genome mining approach to determine whether we could pull out fulllength sequences that include the functional kinase domain (exons 3-7).

While 454 pyrosequencing has largely been replaced by methods demonstrated to show higher accuracy such as Illumina (Schirmer et al., 2015, 2016; D'Amore et al., 2016), we use results from this study as a platform to highlight considerations for working with gene families that should apply across methods. We thus haven't focused on attempting to resolve 454 specific problems but instead on general issues with clustering and assigning sequence variants to loci and designating allelic specificities for interpretation of gene family evolution. We include these as "challenges" in relation to the methods used to address each objective.

## Detailed Methodology

#### Clustering and SRK Genotyping Strategies

To increase the probability of amplifying all variants of SRK present in the populations sampled, we used 454 pyrosequencing of pooled amplicons from four sets of degenerate primers but sharing a common reverse sequence, SLGR (**Supplementary Table 1**; Schierup et al., 2001). Detailed



Crosses performed between individuals sampled from Mödling and Rauheneck Ruin near Baden were used to test segregation of genotypes resolved using 454 and SI phenotypes. <sup>a</sup>Not included in Schmickl (2009) but collected from Rauheneck Ruin, near Baden.

b Insufficient DNA remained after the 454 sequencing to screen for AlySRK01.

methods for the 454 analyses are described in Jørgensen et al. (2012), including estimation of error rates and the use of segregation within known families to test the reliability of genotyping. The initial paper described the strategies used for clustering reads into contigs and filtering to reduce errors. We recommended that optimal clustering was obtained with a 90% sequence similarity criterion and excluding sequences present at a frequency of <7% of the total reads for an individual; these conclusions were based on a subset of the original data that included repeated runs involving the same individuals. We also recommended that clustering should be conducted after reads were trimmed to 200 bp from the "common primer" end (SLGR in this case).

Although the crosses between tetraploid A. lyrata individuals confirmed presence of the expected SRK alleles known to be present in the parents, they also indicated some inaccuracy in allele calls in relation to barcodes; a number of alleles that were not in the parents were assigned to individuals from the crosses, sometimes at high read numbers (see Jørgensen et al., 2012). We concluded that this was due to tag switching between barcodes, as had been suggested from other studies (van Orsouw et al., 2007; Carlsen et al., 2012). Blank lanes (negative controls) also sometimes contained sequences matching known SRK alleles, again often at high read numbers. We thus modified our filtering and clustering strategies in the analysis of the full dataset.

Reads were initially assembled into contigs based on clustering to sequences from a reference database of known SRK alleles and known members of the gene family that have been characterized in other studies and from our unpublished data from Sanger sequencing. A second iteration then used newly sequenced reads as seeds for clustering, in order to identify putatively new alleles (generating "read-only" contigs). BLAST analyses of "read only" contigs indicated that some known alleles (both SRK and paralogs) had been fragmented into multiple contigs. In such cases, contigs for a particular allele were combined, sequences sorted by barcode, and read numbers counted for each individual that contained a particular sequence type. Remaining "read only" contigs that did not show at least 80% similarity to S-related kinases from Genbank were not considered further. Final contigs were then sorted into putative "types": known SRK alleles, putatively new SRK-like variants, or known paralogs. Contigs assigned to SRK alleles whose dominance had been established previously (Prigoda et al., 2005; Goubet et al., 2012) were further sorted into the following classes: (1) A1, consisting of a single most recessive allelic specificity that has been found globally in Arabidopsis species (SRK01); (2) A2, dominant to all other classes; (3) A3, recessive only to class A2; and (4) B, recessive to all except A1 and showing high similarity to unlinked loci (Aly13-2 and Aly13-7). Contigs were also inspected for clustering of more than one named SRK allele from the database.

The next step was to subdivide variants within contigs into individual haplotypes, in order to test patterns of transspecific polymorphism and to assess evidence for introgression between species. In our pilot study (Jørgensen et al., 2012) we recommended that only sequence variants present in at least 7% of the reads for an individual should be "counted" as true variants. However, in the full analysis, inspection of the contigs associated with particular alleles revealed very uneven read numbers both between individuals (ranging from a minimum of a single read to a maximum of 1,126 reads in the 465 individuals screened; average 344 ± 156) and across loci (i.e., SRK alleles and paralogs) within individuals. Low read numbers of particular alleles were also not directly proportional to the overall read numbers in the individual. The strict 7% threshold would have excluded some alleles that amplified in multiple individuals but were only present at low read numbers within individuals. A striking example was SRK01: it was fragmented across multiple contigs but when reassembled, it tended to be found at very low read numbers within individuals but was found across a wide range of individuals and showed population- and species-specific variants, as expected for a recessive allele (Billiard et al., 2007; Goubet et al., 2012). Many individuals showed <20 reads but the individuals that showed high read numbers (>100) tended not to show amplification of any other alleles, suggesting competition in the PCR when other alleles were present.

For haplotype calling, we thus also considered genotype calls at thresholds of at least 4% of reads and between 0 and 4% of reads. A problem with assessing such optimization strategies when including tetraploids is that there is not a robust basis for excluding individuals based on numbers of expected haplotypes. Although we could use diploids to determine thresholds of read numbers that minimized calling of more than two SRK alleles per individual and predicting homozygosity only for recessive alleles, this was confounded by the difficulties of predicting linkage of newly identified alleles (Charlesworth et al., 2003b; Prigoda et al., 2005). Tetraploids are expected to have up to four copies of SRK per individual but they can also contain multiple copies of recessive alleles (Mable et al., 2004), precluding extrapolating "confidence thresholds" based on diploids. We thus decided on a conservative threshold of at least 20 reads for a given haplotype to make relative comparisons among populations and species in the frequency of presence of particular variants. For reconstruction of evolutionary relationships among alleles, haplotypes present in <20 reads in a single individual and individuals with <200 total reads were excluded.

#### Statistical Analyses

To investigate whether there were differences in sequencing quality, detection biases, or real differences in frequency of sequence variants found we used generalized linear models to test whether the variation was significantly explained by ploidy, species or their interaction. Since multiple 454 runs were used for genotyping, we included barcoding tag number and lane as random effects, to account for any variation they explained. Analyses were conducted using JMP version 10.0 (SAS Institute, Incorporated).

#### Reconstructing Evolutionary Relationships Among Alleles

To establish phylogenetic relationships of newly identified alleles and to predict their dominance, we aligned the 454 sequences to the reference set (**Supplementary Data Sheet 1**) and reconstructed phylogenetic trees, using MEGA 7.0 (Kumar et al., 2016). We extracted consensus sequences for each haplotype of the SRK-like alleles identified and initially performed multiple alignments using the online version of Clustal Omega (Sievers et al., 2011) and then optimized by eye to establish positional homology and to set the correct reading frame to minimize stop codons, using Se-al version 2.0 (Rambaut, 1996) and McClade version 4.0 (Maddison and Maddison, 2000). To assess patterns of trans-specific polymorphism, if there was an exact match of a sequence to the reference database used for clustering, we named the haplotype "REF\_HAP1" but if there was no exact match we retained the database allele (just named "REF"). We also added homologs from A. lyrata, A. arenosa, A. halleri and A. thaliana from Genbank for each specificity identified among the 454 samples (e.g., AHASRK04 and ATH-haplogroup A have been identified as homologs of AlySRK37; Bechsgaard et al., 2006). As implemented in MEGA, the best fitting substitution model was identified using ModelTest and then Maximum Likelihood was used to cluster sequences, using 1,000 bootstrap replicates. Due to the reticulate nature of evolution in this gene family, a strictly bifurcating evolutionary history is not expected but a tree-like representation is useful for identifying clusters of similar sequences. In previous studies, we have found that phylogenetic clustering is informative about dominance for Class A3 and B alleles but that Class A2 are paraphyletic based on alignments of approximately 900 bp of sequences in exon 1 of SRK (Prigoda et al., 2005). We thus used phylogenetic clustering to predict dominance of new specificities identified or known specificities for which dominance had not been established. We calculated genetic distances within and between dominance classes using both the best fitting substitution model and raw % similarity, using MEGA. We then mapped relative frequency of each haplotype in the four types of populations onto the tree, using Evolview in the Evolgenius package (He et al., 2016).

#### Testing the Accuracy of 454 Genotyping Using Segregation Analyses

We used the 454 pyrosequencing to genotype SRK from 11 families raised from crosses between tetraploid A. lyrata individuals whose grandparents had at least partially resolved SRK genotypes, in order to test segregation of alleles and as an additional test of reliability of the clustering thresholds set. Given the low read numbers found for SRK01, we established genotypes by a combination of allele-specific Sanger sequencing for this allele with the 454 sequencing for other alleles to compare segregation of alleles within families and to aid in excluding spurious allele calls. For a subset of these crosses, we performed controlled pollinations among all pairwise combinations of individuals, in order to test linkage of the variants identified to the SI phenotype and to predict dominance relationships (as in Mable et al., 2004).

#### Direct Sanger Sequencing of SRK01

To complement the 454 sequencing, we used targeted direct Sanger sequencing to resolve SRK01 genotypes to be able to investigate signatures of introgression of this recessive allele. We screened all individuals raised from the crosses between tetraploid A. lyrata individuals to aid in segregation analyses and a subset of individuals from the population survey to confirm haplotype calls and obtain more accurate frequencies of variants within and between individuals (**Table 1**).

We amplified products using an allele-specific primer (qtAlSRK01F: TCCTACATCATCGCAG) with the general reverse primer (SLGR: ATCTGACATAAAGATCTTGACC) that had been used for 454 sequencing. The 20 µL PCR reactions (using reagents from Invitrogen, Inc., Paisley, UK) consisted of 1 µL template, 2 µL 10x PCR buffer (Invitrogen Incorporated, Paisley, UK), 2 µL 10 mM dNTPs, 1 µL 50 mM MgCL2x, 0.2 µL 10µM of each primer, and 0.2 µL Taq polymerase. The PCRs were run in MJ research thermocyclers using the following program: initial denaturing phase of 3 min at 94◦C, 1 min annealing at 54◦C, 2 min extension at 72◦C; followed by 34 cycles of 30 s at 94◦C, 30 s at 54◦C, 2 min at 72◦C; and a final extension step of 6 min at 72◦C.

Individuals that showed amplification of products of the expected size (∼500 bp) were sent for sequencing to The GenePool in Edinburgh, using the reverse primer SLGR. Chromatograms were checked for base-calling errors using Sequencher 4.7 (Gene Codes Corporation, Ann Arbor, MI) and BLAST was used to confirm sequence identity.

Sequences were aligned using Sequencher, version 4.7 and heterozygous positions were recorded using IUPAC (International Union of Pure and Applied Chemistry) ambiguity codes. The phase of heterozygous positions was resolved by matching to variants found in the 454 sequencing and to homozygous sequences found in the Sanger sequencing. Genotypes predicted based on this process were then aligned to the specific 454 sequences for each individual. Speciesspecific variants were identified in diploids based on private haplotypes for the two species. We used the datamonkey server (www.datamonkey.org; Delport et al., 2010), which implements statistical tests associated with the programme HyPhy (Pond et al., 2005), to test for evidence of recombination using GARD (Genetic Algorithm for Recombination Detection; Pond et al., 2006). In addition, we manually inspected alignments for evidence of potential breakpoints and in such cases, aligned each "section" independently to the other haplotypes identified for a particular specificity. Where a putatively recombinant type showed similarity to two or more species-specific haplotypes in different regions of the sequence, they were classified as potentially introgressed. A minimum spanning network (Bandelt et al., 1999) was drawn using PopArt (Leigh and Bryant, 2015) to resolve the relationships among the SRK01 haplotypes.

#### Cloning and Sanger Sequencing of Longer SRK Alleles

As the 454 sequences were too short to be informative for future population genetics analyses and tests for selection, we used degenerate primers (**Supplementary Table 1**) to amplify longer products from tetraploid A. lyrata and A. arenosa sampled from the hybrid zone in the Wachau region of Austria (∼600 bp, also described in Ruiz-Duarte, 2012). We then used these products as seeds for the genome mining (see section Mining SRK Alleles From Genome Resequencing Data) to determine whether we could determine the genomic location of the "new" alleles found, as an indication of linkage to the S-locus.

Genomic DNA was extracted from three to four leaves from plants of tetraploid A. lyrata and A. arenosa individuals using a modified CTAB protocol (Doyle and Doyle, 1987). Degenerate primers known to amplify a number of different gene family copies and SRK alleles (Schierup et al., 2001) in A. lyrata and A. halleri (Forward: 13SeqF1, 5′ -ccgacggtaaccttgtcatcctc-3 ′ and Reverse: SLGR, 5′ -atctgacataaagatcttgacc-3′ ) were used (Charlesworth et al., 2000). Genomic DNA was mixed with a pair of primers, 10 µmol each, 4 µl of 5x buffer (ready-made), 50 mM MgCl2, 0.4 µl of 10 mM dNTP mixtures, 0.1 µl Taq DNA Polymerase (Mango Taq, Bioline). PCR amplification conditions were as follows: denaturation at 94◦C for 2 min followed by 34 cycles of 94◦C for 30 s, 50◦C for 30 s, and 72◦C for 30 s; a final extension at 72◦C for 5 min.

PCR products were cloned into pGEM <sup>R</sup> -T Vector Systems (Promega Inc.). Colony PCR (20–30 colonies per individual) was conducted to test for inserts using SP6 and T7 primers, followed by AluI digestion to identify clones carrying different putative SRK alleles. To avoid errors that might occur during PCR amplification and sequencing, a minimum of three independent clones with the same restriction profile were sequenced at the GATC BIOTECH facility. SeqMan software (DNASTAR, Inc) was used to clean and create consensus sequences.

We created separate alignments for each allele that was found both in the 454 and the Sanger sequencing by aligning the new sequences to references from Genbank and to the 454 sequences, in order to confirm shared specificity (**Supplementary Data Sheet 2**).

#### Mining SRK Alleles From Genome Resequencing Data

The 454 pyrosequencing data was not appropriate for determining presence and absence of paralogs because of: (1) the difficulty of distinguishing gene copies from new alleles at the SRK locus; and (2) amplification biases that made it difficult to set thresholds for reliability. Several known paralogs (Aly8, Aly9, Aly13-2/13-7) were expected to amplify with the primer set used. Polymorphic regions like the S-locus are known to be difficult to assemble in genome resequencing studies due to divergence from the reference genome (Mable et al., 2017) but we tested whether de novo assemblies from a genome resequencing study (Novikova et al., 2016) could be used to assess copy number of the SRK–related kinase gene family. We also attempted to pull out full-length sequences that spanned the S-domain (exon 1), transmembrane (exon 2) and kinase domains (exons 3-7) (Charlesworth et al., 2003a).

There are currently 28 fully resequenced genomes available from diploid and tetraploid A. lyrata and A. arenosa, from which we selected three or four individuals from each species and ploidy level to test whether we could obtain useful information on copy number and complete gene sequences. We used our paired end read data (Genbank SRR2040821, SRR2040822, SRR2040825, SRS945917, SRS1256176, SRS1256175, SRR2020827, SRR2040828, SRR2040829, SRR2040830, SRR2040791, SRR3111440, SRR3111441) and trimmed the reads for adapter contamination using cutadapt (Martin, 2011) and the respective adapter sequences. To obtain SRK alleles from these data we attempted two different approaches: mapping based and de novo assembly, on average we used ∼110 million paired end reads for the tetraploid accessions and ∼60 million reads for the diploid accessions corresponding to an average coverage of 20x.

In the initial mapping strategy we used as reference the S-locus region of the SRK locus on scaffold 7 of the MN74 reference genome (which was originally sampled from a North American outcrossing populations and has the S13 allele of the genes AL7G32720 = SCR, AL7G32730 = SRK, AL7G32710 = ARK3; Mable et al., 2017). Upon mapping we intended to extract reads that mapped to SRK and adjacent sequences in pairs and to perform a de novo assembly of these sequences only. In a first attempt we mapped reads using bwa (Li and Durbin, 2009). However, this approach did not yield any or an extremely low number of reads mapping to SRK, while adjacent regions were covered by the expected number of sequencing reads. Since bwa expects reads to have an identity of 90% or more to the reference and SRK alleles show much lower similarity (as little as 70% identity), we were not successful in mapping SRK reads to the reference. In a second attempt we used Next Gen Mapper (Sedlazeck et al., 2013), which only requires 65% of identity between read and reference. By this approach we were able to map reads to the Slocus including SRK but nevertheless a de novo assembly of these reads into complete or partial copies of the SRK locus failed.

We used CLC genomics workbench (https://www. qiagenbioinformatics.com/) to perform de novo assemblies using standard settings (automatic word and bubble size, minimum contig length 500 bp, reads were mapped back to contigs setting mismatch costs, insertion costs and deletion costs to 3 and length fraction as well as similarity fraction were set to 0.9) and the scaffolding option. Resulting scaffolds/contigs were indexed as BLAST libraries. We initially used FJ867321 (the S-domain from AlySRK30) to BLAST against these libraries to pull out sequences predicted to be SRK based on more than 50% coverage of the query sequence (filtered for low complexity, expect set to 10, word size to 11, match to 2, mismatch to −3, gap existence to 5, gap extension to 2). These hits were aligned to the first exon of AL7G32730 (AlySRK13 from the MN47 reference genome) to identify intron/exon boundaries and then trimmed if necessary. This approach yielded in total 66 sequences in the 13 accessions analyzed (**Supplementary Table 10**). Therefore, our BLAST search also must have identified other S-domain encoding genes besides SRK.

In order to obtain an overview on the presence of S-domain encoding genes we performed another BLAST search using the first exon of the MN47 SRK against the MN47 reference genome. This search revealed five genes encoding proteins that have an S-domain (AL7G32730 = SRK, AL7G32710 = Aly8, AL6G48380 = Aly3, AL3G23610 = Aly9, AL2G23090 = Aly10.2). From this result we expected that our contigs identified in the 13 resequenced accessions should have their best BLAST hit with one of these five loci. So, we aligned the 66 contig sequences to the first exon of the MN47 SRK and trimmed them in length to the first exon. Then we performed a blast search of the 66 trimmed sequences against the MN47 reference genome. All of the 66 sequences had their best blast hit with one of the five loci we had identified beforehand. Typically hits for AL7G32710, AL6G48380, AL3G23610, AL2G23090 showed a very small Evalue and a high score while AL7G32730 hits were characterized by a lower score and E-value due to the lower conservation for alleles of this locus.

We initially used the BLAST results to predict similarity to known SRK alleles and related receptor kinase gene family members available in Genbank for each of the contigs. However, since we had identified potentially new variants in this study, we also aligned sequences pulled out from the resequenced genomes to our reference database and to the sequences found using 454 and the longer Sanger sequences to confirm sequence identity (**Supplementary Data Sheet 2**; **Supplementary Table 12**). We used clustering in phylogenetic trees (reconstructed using Maximum Likelihood in MEGA 7.0) to predict SRK specificity and to determine presence/absence of other members of the gene family.

One of the paralogs (Aly9) is known to amplify in all A. lyrata individuals that have been tested using PCR-based screening (Mable, personal observation). We thus used identification of this locus as a control for whether it was likely that the genome-mining approach could be reliably used to detect copy number variation in highly polymorphic gene families. The approach described initially only identified this locus in three of 12 genomes so we trialed another approach, using the sequences in **Supplementary Data Sheet 2**, along with the 66 contigs originally identified to BLAST the de novo assemblies for each genome. This resulted in an additional 102 contigs, which then were aligned back to the reference database and identities confirmed using cluster analysis. In this analysis Aly9 was resolved for all individuals and more complete genotypes were obtained for SRK and the other paralogs screened, so only the results from this final analysis are presented.

### RESULTS AND DISCUSSION

# Objectives 1 and 2: Diversity and Allele Sharing of SRK in Diploids and Tetraploids

After filtering and assigning variants to alleles based on sequence similarity and predicting dominance classes and linkage to the SI phenotype based on phylogenetic clustering, we identified 107

evolution with rate heterogeneity modeled under a gamma distribution and with proportion of invariant sites estimated. Bootstrap proportions above 70% are indicated as filled circles on nodes. The tree was rooted with the unlinked paralogs Aly8 (Ark3 in A. thaliana) and Aly10.1 (Ark1 in A. thaliana). Alleles for each SRK specificity are assigned to a dominance class based on previous studies of A. lyrata (Prigoda et al., 2005) and A. halleri (A1 = yellow; A2 = red; A3 = green; B = blue; unlinked = gray); new alleles or previously identified alleles where dominance has not been confirmed are colored according to the class predicted by their position in the tree. Tip labels are colored according to the species in which they were found in the 454 sequences (lyrata = red; lyrata+arenosa = purple; arenosa = blue) or the origin of the reference allele in cases where there was no exact match (halleri = green; thaliana = black). Also shown is the frequency of a particular haplotype in each of the four groups compared (diploid arenosa, A2x = dark blue; tetraploid arenosa, A4x = light blue; diploid lyrata, L2x = dark red; tetraploid lyrata, L4x = light red). Due to the high number of haplotypes but low read numbers for AlySRK01 and the unlinked loci Aly13-2 and Aly13-7, only a subset of haplotypes are included and frequencies are not indicated.

haplotypes (unique sequence variants) that could be grouped into 63 potential alleles (specificities) that were at least 80% similar to SRK (**Figure 1**; **Supplementary Table 2**). Seventeen were potentially new specificities that were <90% similar to the A. lyrata, A. halleri or A. arenosa reference sequences included (**Supplementary Table 3**). However, seven of these new variants were predicted not to be linked to SI based on phylogenetic clustering and so could represent other members of the gene family. All of the new potentially unlinked alleles were found in diploid and/or tetraploid A. arenosa, with two of them also occurring at high frequency in L4x populations but only a single L2x individual sharing one of the new unlinked alleles with A4x individuals. The new alleles predicted to be linked to SI were distributed more evenly among the two species.

When accounting for variation due to lane and tag as random effects using generalized linear models, we found no evidence for significant differences between species or ploidy levels or their interactions in terms of number of reads, total number of contigs resolved (indicative of the wider gene family), the number of SRK-like alleles (i.e., variants showing at least 80% similarity to known SRK sequences, so including unlinked alleles), or the number of alleles or haplotypes per individual predicted to be linked to SRK (**Supplementary Table 4**). There was a significant interaction between ploidy and species in the proportion of contigs resolved that were at least 80% similar to SRK (i.e., more reads were SRK-like than similar to other members of the gene family), with a significantly higher proportion in tetraploids compared to A. arenosa diploids but no significant difference compared to A. lyrata diploids. Since the primers used were developed based on variation within A. lyrata (Schierup et al., 2001; Charlesworth et al., 2003a, 2006), this could be an indication that not all SRK-like alleles were amplified for A. arenosa due to variation in the primer regions, resulting in resolution of more spurious contigs due to non-specific amplification. However, overall, there was very little evidence that tetraploids were fundamentally different to diploids in terms of sequence quality or the ability to resolve variants.

The 200 bp sequences produced similar resolution in phylogenetic clustering as previous studies using 600 bp (Tedder et al., 2011) and resulted in consistent patterns of polymorphism expected for dominant and recessive alleles at SRK. Examination of relative frequency distributions also generally met theoretical expectations but indicated no obvious differences in diversity between ploidy levels. There was extensive variability in relative frequencies of each haplotype, with some being restricted to certain species or populations and some being found across both species and ploidy levels (**Figure 1**; **Supplementary Table 2**). We predicted that there should be highest interspecific sharing of individual haplotypes among tetraploids due to their known introgression (Schmickl, 2009; Schmickl et al., 2010; Jørgensen et al., 2011; Schmickl and Koch, 2011) but also because they can maintain more allelic copies within individuals. We found that 23 haplotypes were shared between A4x and L4x compared to 12 between A2x and L2x, including seven that were shared among all four population types (**Supplementary Table 2**). Sharing between the two types of tetraploids was similar to that among ploidy levels within species (24 among A. lyrata and 22 among A. arenosa). The highest number of private haplotypes was also found for diploids: 19 for A2x and 15 for L2x, compared to 12 for A4x and 8 for L4x. These results are consistent with predicted patterns of introgression among the tetraploids in northeastern Austria (Wachau region and Forealps; Schmickl et al., 2010; Jørgensen et al., 2011; Schmickl and Koch, 2011).

Although it is difficult to separate increased transpecific polymorphism from this introgression, we found some evidence that there might be more differences in selection pressure or demographic history between species than between ploidy levels. Plotting allele frequency distributions for each ploidy and species combination demonstrated an excess of intermediate frequency alleles in both diploid and tetraploid A. arenosa (**Figure 2**), as expected for a locus under balancing selection (Mable and Adam, 2007). However, the pattern was more skewed toward low frequency alleles in A. lyrata, particularly in tetraploids. In North American populations of A. lyrata, a difference in allele frequency spectrum for SRK was found between inbreeding and outcrossing populations (Mable and Adam, 2007) but the latter showed more similar patterns as those observed for A. arenosa in this study. Since shifts toward intermediate frequencies are also expected for population bottlenecks (Luikart et al., 1998), it is possible that in particular diploid A. arenosa experienced a larger decline in population numbers since the past glaciation. What was striking in the current study was that tetraploids did not have a dramatically higher number of alleles or haplotypes within populations or alleles or haplotypes per individual than diploids, regardless of dominance class (**Table 2**). Furthermore, for neutral genes, there is a steep gradient of increasing genomic contribution of A4x found within introgressed A. lyrata along a transect in the hybrid zone (Schmickl, 2009; Schmickl et al., 2010; Jørgensen et al., 2011; Schmickl and Koch, 2011; Hohmann et al., 2014; Muir et al., 2015) but this is not reflected in the SRK distribution; i.e., SRK are more mixed than would be predicted based on neutral patterns, as might be expected under balancing selection. This suggests that tetraploids are not fundamentally different from diploids in their capacity for maintaining diversity of SRK, as suggested previously from segregation analyses within tetraploid families based on crosses involving one of the same tetraploid populations studied here (L4\_AUT2) and a tetraploid population from Aggsbach, Austria (Mable et al., 2004).

Consistent with theory (Billiard et al., 2007), recessive alleles in diploids have been demonstrated to occur at higher frequency, to show shallower branch lengths in phylogenetic analyses, and more extensive polymorphism within specificities than dominant alleles (Llaurens et al., 2008, 2009; Castric et al., 2010; Vekemans et al., 2011; Goubet et al., 2012). In our study, Class B alleles (recessive to A2 and A3 classes) showed lower intraclass polymorphism (13% average pairwise sequence divergence, compared to 25% for Class A2 and 15% for Class A3) but more haplotypes per allele than the two dominant classes (2.56 ± 1.33 compared to 1.59 ± 0.75 in Class A2 and 1.56 ± 0.89 in Class A3, **Table 2**) and there was high divergence between classes (26–29%; **Table 3**). The paralogous locus identified in previous studies that is similar to class B alleles (Aly13-2) showed similar within locus variation (13%) as for class B alleles and lower divergence from class B than the other dominance classes (16% compared to at least 27% to the others). There was a higher proportion of alleles restricted to only one of the species among the dominant (29% for Class A2 and 50% for Class A3) than recessive (20% for Class B) alleles but a majority of the unlinked alleles (67%) were only found in A. arenosa (**Supplementary Table 2**). Thirteen alleles were found only in

tetraploids, but none were Class B and only four (three Class A2 and one Class A3) were shared between the two species. Thus, results were consistent with the increased trans-specific polymorphism expected for recessive alleles at a locus under balancing selection (Billiard et al., 2007; Llaurens et al., 2008; Castric et al., 2010; Goubet et al., 2012).

Overall, these results suggest that tetraploids do not show increased mate availability due to an increase in S-locus repertoire but instead might be constrained by the potential mate limitation caused by having "too many" S-alleles. This is similar in theory to expectations for immune genes in animals, where an optimal number of alleles has been suggested as conferring higher fitness than maximizing allelic diversity (Reusch et al., 2001; Aeschlimann et al., 2003; Wegner et al., 2003; Kalbe et al., 2009). The high allele sharing among ploidy levels precluded testing of whether there is relaxed balancing selection acting in tetraploids but this was not suggested by the site frequency distributions, which suggested a stronger species than ploidy effect. Nevertheless, there are some important caveats to consider in the interpretation of these results, due to particular challenges when working with this type of gene families (see Challenges below).

In the crosses between tetraploid A. lyrata individuals, we found the same three SRK01 haplotypes using both 454 and targeted Sanger sequencing (haplotypes 1, 2, and 3). This allowed us to test the accuracy of the 454 genotyping despite the low read numbers for SRK01 and provide more complete data for segregation analyses. For 50% of the individuals identical genotypes were predicted using the two approaches, with 14% testing negative for the allele-specific PCR but positive using 454, compared to 10% showing the opposite pattern (**Table 4**). Different haplotypes were predicted by the two methods only for a single individual. However, the direct sequencing was more sensitive, resolving heterozygotes in 24% of the individuals that were predicted to be homozygous based on the 454 sequencing (compared to only 2% showing the opposite pattern). Segregation of SRK01 genotypes in the crosses confirmed previous predictions (Mable et al., 2004) that tetraploids could harbor multiple copies of haplotypes for this recessive specificity (**Table 5**). These data were then combined with segregation of the haplotypes resolved using 454 pyrosequencing (**Supplementary Table 5**). After excluding 454 alleles not present in the parents, the majority of individuals showed four or fewer expected haplotypes. Comparison of segregation of predicted genotypes with self-incompatibility phenotypes (**Figure 3**; **Supplementary Tables 6**, **7**), confirmed linkage of two alleles previously tested in other crosses (SRK16 and SRK29) and one that had been identified in the grandparents but had not been deposited to Genbank (SRK48). However, the segregation analyses suggested that not all alleles were detected by 454 and suggested that the stringent filtering in some cases omitted alleles that must have been present based on the incompatibility phenotypes.

#### Challenge: Filtering Decisions for Clustering

Despite recommendations from our pilot study that a threshold of 90% similarity would be appropriate for clustering (Jørgensen et al., 2012), our analyses of the full dataset suggested that (B)

Mable et al. Gene Family Evolution in Polyploids

TABLE 2 | Distribution of alleles (A) and haplotypes (B) across diploid (A2x, L2x) and tetraploid (A4x, L4x) populations (POP) for different predicted dominance classes (A2 and A3 are dominant to B), excluding Class A1, which is represented only by SRK01; read numbers were too low to be certain about presence or absence for that allele.



Alleles that did not appear to fall under any of the known dominance classes are not included, as they were predicted to be unlinked to the SI phenotype. Also shown is the number of alleles or haplotypes per individual.

TABLE 3 | Percent sequence divergence within and between dominance classes, for alleles identified using 454 sequencing.


Divergence within classes is shown on the diagonal. An unlinked locus that shows polymorphism among haplotypes (Aly13-2) is included for comparison.

a single threshold may not be appropriate for gene families that include different levels of divergence among classes or copies; for example, in relation to dominance (Prigoda et al., 2005). In our study, BLAST analysis of "read only" contigs demonstrated that some known alleles were fragmented across multiple contigs. For recessive alleles (Class B, SRK01) and unlinked loci (Aly9, 13-2 and 13-7), combining contigs resulted in mixtures of haplotypes from different alleles (specificities), making it challenging to assign sequence variants to alleles. While several dominant alleles (Aly16, Aly30, and Aly42) also showed fragmentation, there was no ambiguity in assigning sequence variants to alleles. Resolving recessive alleles into unique contigs thus required more manual manipulation and sorting of variants into haplotypes. Since recessive alleles also had on average more haplotypes per allele (2.44 ± 1.42) than dominant alleles (1.57 TABLE 4 | Proportion of individuals that tested positive for SRK01 specificity using direct Sanger and 454 sequencing, indicating the population (A2x = diploid A. arenosa; A4x = tetraploid A. arenosa; L2x = diploid A. lyrata; L4x = tetraploid A. lyrata), sample sizes (N-direct, N-454) and % of individuals that tested positive for SRK01 in each.


± 0.74 for A2; 1.56 ± 0.89 for A3) (**Supplementary Table 2**), read numbers per haplotype were often lower, which made setting a single threshold for reducing spurious genotyping difficult.

#### Challenge: Amplicon Based Errors and Biases

From previous studies we anticipated that the single most recessive allele, SRK01, would be present at high frequency and would show a higher number of haplotypes than other specificities (Billiard et al., 2007; Castric and Vekemans, 2007; Llaurens et al., 2008; Castric et al., 2010; Goubet et al., 2012; Vekemans et al., 2014). In our 454 data, SRK01 was present in all populations surveyed and we identified 15 unique variants that were present in more than one individual; however, read numbers tended to be very low (often with <10 reads per individual) and fell well below the thresholds set for considering "real" presence of a given haplotype used for other loci for most individuals. Although multiple haplotypes differing by a single or few bp are expected for recessive alleles (Castric and Vekemans, 2007), the low read numbers made it difficult to distinguish PCR errors from actual polymorphism. High read numbers were found for some individuals, but they tended to show the presence of few other sequence variants. In addition, several known paralogs that should be present in all individuals (Aly8, Aly9; Charlesworth et al., 2003b) were expected to amplify with the primer set used but this was very inconsistent. Aly9 was present in the majority of individuals but read numbers varied dramatically from 0.5 to 92% of the total reads in an individual. There was a significant difference in the proportion of reads that were Aly9, with A. arenosa tetraploids showing a higher proportion than both diploids, which showed a significantly higher proportion than A. lyrata tetraploids (**Supplementary Table 4**). Whether this is due to an amplification bias or expansion of the gene family is difficult to distinguish. For Aly8, only 41/460 sequenced individuals showed any amplification and most were present at only low read numbers (maximum 15%). We thus could not assess presence or absence of other members of the gene family based on the 454 sequencing or use the paralogs to make inferences about introgression in the tetraploids to avoid the confounding effects of balancing selection. Even after correcting for chimeras, there was some evidence for recombination in some of the specificities showing polymorphism among populations (e.g., SRK01, some of the class B alleles) but this was difficult to


TABLE 5 | Segregation of SRK01 genotypes within families raised from crosses between tetraploid A. lyrata individuals, as determined by direct Sanger sequencing; the number of individuals where a particular genotype was found is indicated in parentheses.

Complete segregation of haplotypes found using 454 sequencing, combined with this genotyping is detailed in Supplementary Table 5. \*homozygous for SRK01 in 454 sequencing.

distinguish from PCR recombinants, particularly with only 200 bp of sequence.

#### Challenge: Assessing the Accuracy of Genotyping

Although arguably more problematic for 454 pyrosequencing than for more recently developed approaches due to tag switching of barcodes, which we previously found could occur for up to 7% of samples (Jørgensen et al., 2012) and has been reported in other studies (Carlsen et al., 2012), the biggest challenge was deciding on thresholds and criteria for assessing accuracy of genotyping and efficiency of filtering strategies. The 200 bp sequences resolved were useful for assessing haplotype diversity within alleles, identifying putatively new alleles, predicting dominance based on phylogenetic clustering, and the distribution of allele and haplotype frequencies among populations. The results also generally fit with theoretical predictions. However, there was less certainty for determining individual genotypes; the crosses, for example, included more alleles than should have been present in some individuals, including alleles that were not identified in the parents (**Supplementary Table 5**). The haplotype frequencies indicated in **Figure 1** and **Supplementary Table 2** are thus based on a conservative threshold of at least 20 reads per individual but this likely underestimates patterns of haplotype sharing across populations and species. Nevertheless, an advantage of studying gene family evolution in SI genes over comparable systems like the MHC in vertebrates is that linkage of each new variant could be tested by segregation analyses to a known phenotype (Schierup et al., 2001; Mable et al., 2003, 2004; Prigoda et al., 2005). In our study, the low amplification of SRK01, which we otherwise knew from Sanger sequencing based genotyping of the parents should have multiple variants within families, precluded confidence in segregation analyses based only on the 454 data. However, targeted Sanger sequencing for this allele aided in interpretation of the segregation analyses. Unfortunately, as we performed crosses before the 454 sequencing, we could not test linkage of all new variants found to the SI phenotype. It was also not feasible to determine when unlinked alleles were amplified based on the presence of "too many" haplotypes.

#### Objective 3: Introgression of SRK Alleles

For the population survey, the 454 genotyping identified 22 SRK01 variants. Using targeted amplifications and Sanger sequencing we identified 24 haplotypes. All of these but seven had been found using the 454 pyrosequencing, but including five that matched the 454 sequences but had additional polymorphisms outside of the shared sequence region (indicated by distinct letters after the haplotype name; **Supplementary Tables 8**, **9**). However, only 11 of the 22 variants found by 454 sequencing were confirmed by direct sequencing and there was a higher proportion of PCR positive results among the 454 than the Sanger sequences (**Table 4**).

Using the diploids as a guide, we identified "arenosa" and "lyrata" specific haplotypes, as well as three that appeared to be recombinants between species-specific variants (haps 7, 8, and 10; **Supplementary Data Sheet 3**), two of which were identified from a single A4x population that was predicted to be introgressed (A4X\_AUT1, from Kerhnoff; Schmickl, 2009). Although analyses using GARD in the HYPHY package did not find statistical evidence for recombination breakpoints, this might have been because of the short tracts of introgression. The minimum spanning network indicated that haps 7 and 8 did in fact fall between species-specific clusters whereas hap10 was on a tip in the A. arenosa part of the network (**Figure 4**). What is striking is that reticulation in the network involved primarily A. arenosa tetraploids and that diploids had a lower diversity of SRK01 haplotypes compared to tetraploids. There was also some haplotype sharing among tetraploids but not between the diploids. Since the crosses established that individual tetraploids could harbor up to three different SRK01 haplotypes and many were heterozygous for two, this higher diversity among tetraploids could be because SRK01 is effectively neutral and so could accumulate more mutations in tetraploids because of the higher copy number maintained (Mable et al., 2004). Crossing data suggest that SRK is functional in individuals sampled from the hybrid zone (Ruiz-Duarte, 2012), but it is also possible that selection pressure to maintain restricted recombination in the S-locus region (Charlesworth et al., 2006) would be relaxed with the increased copy number in tetraploids. Moreover, introgression of recessive alleles between A. lyrata and A. halleri has been found in diploids (Castric et al., 2010), suggesting that hybridization might disrupt linkage. Although the crosses we performed only included tetraploid A. lyrata from outside of the known hybrid zone, two individuals in one family were self-compatible (**Supplementary Table 6**). It is thus also possible that increased recombination at the S-locus occurs with spontaneous loss of SI in some individuals.

The presence of A. arenosa like haplotypes in two of the A. lyrata tetraploid populations and the most frequent A. lyrata haplotype (hap1) in most A. arenosa populations from the hybrid zone (**Table 4**, **Supplementary Table 9**) could suggest more recent and secondary hybridization while the introgressed haplotypes (i.e., those that appeared to be recombinants between the species-specific variants) could reflect older events. One A. arenosa-like haplotype (hap2) was found in an A. lyrata tetraploid population in the Northeastern Austrian Forealps (L4\_AUT4 from Lilienfeld), and in the crosses, which involved individuals from two peripheral A. lyrata populations (L4\_AUT2 from Mödling and L4\_AUT5 from Rauheneck Ruin, near Baden). This could suggest undetected hybridization within these "pure" populations, as also suggested by whole-genome data (Hohmann and Koch, 2017). While these results fit with expectations based on predicted patterns of hybridization in tetraploid populations from Austria (Schmickl, 2009; Schmickl et al., 2010; Schmickl and Koch, 2011; Muir et al., 2015), there are similar caveats about the use of PCR-based genotyping as raised for the 454 sequences, as described below.

#### Challenge: PCR Based Approaches to Genotyping

Overall, there was not much consensus between the SRK01 genotypes resolved using 454 and direct sequencing. While the crosses demonstrated that the latter was more sensitive to detect heterozygotes when products were amplified, the population survey revealed a potential bias against amplifying variants found in A2x populations. A much lower proportion of individuals from these populations tested positive than from other populations, and many of the haplotypes found using 454, but not direct sequencing, were from A2x populations. This potential bias reduced the sample sizes that could be used to classify haplotypes showing species-specific presence. In the segregation analyses (**Table 5**), two individuals had all three SRK01 haplotypes segregating in the parents: one individual didn't show presence of other alleles expected in the parents based on the 454 sequencing but showed some unexpected alleles; the other individual showed more than four expected haplotypes (**Supplementary Table 5**). Thus, we cannot rule out contamination. Moreover, interpretation of introgressed haplotypes could have been confounded by PCRbased recombination but they were found only in a stabilized hybrid population. Moreover, some haplotypes were only resolved from direct sequences of heterozygotes; in those cases cloning would be required to absolutely confirm the full range of haplotypes present. We had originally intended to also test the utility of other polymorphic members of the gene family (e.g., Aly9); however, since there were even more haplotypes predicted by the 454 sequencing but separated by fewer variants (data not shown), there would have been too much reliance on accurately identifying singletons.

# Objective 4: Copy Number Variation in the SRK-Related Gene Family

Clustering of contigs resolved from the de novo assembly approach to genome mining of our database of SRK and its paralogs (i.e., all unique variants found using the 454 pyrosequencing, targeted sequencing of SRK01, cloning of longer

appears intermediate and so could be another introgressed haplotype but it was only found in a single individual. Note that extensive reticulation was found predominantly among sequences found in A4x populations and that there is less variation among haplotypes restricted to diploids than those found in tetraploids.

products using degenerate primers, and additional sequences available in Genbank) was used to uncover receptor-like kinases from published diploid and tetraploid genome sequences (**Supplementary Table 10**). This resulted in identification of 1- 2 predicted SRK alleles in the diploid and 1-4 in the tetraploid accessions for both species among the 13 short read sets screened (**Table 6**; **Supplementary Table 11**). In total 29/177 contigs were assigned as SRK, but 12 of these would have been mis-assigned based only on BLAST (**Supplementary Table 10**). Aly13-2-like sequences were pulled out in seven accessions, but would have been classified as SRK based only on BLAST (**Table 6**; **Supplementary Table 10**). This locus is not present in all individuals, so copy number variation is expected (Mable et al., 2017). Other alleles predicted to be unlinked to the SI phenotype were also resolved by the clustering analysis but none of these would have been assigned as SRK-like based on BLAST (**Table 6**; **Supplementary Table 10**). One published allele (AlySRK32) whose phylogenetic position and dominance have not been resolved in previous studies was pulled out from six accessions; based on the length of its branch to other SRK sequences, it has been predicted to be unlinked to the SI phenotype (Tedder et al., 2011). AlySRK47 (found in four of the accessions) is also predicted to be unlinked, based on its phylogenetic position relative to linked sequences. The other four paralogs tested were present in all accessions, except for

one diploid A. lyrata that lacked ARK3 (Aly8). Since this latter locus is tightly linked to SRK in some specificities and shows high polymorphism (Kusaba et al., 2001; Charlesworth et al., 2003b; Guo et al., 2011; Vekemans et al., 2014), this could be due to divergence from the reference sequence. AL2G2623090 included sequences similar to both Aly10.1 (ARK1 in A. thaliana) and Aly10.2 (ARK2 in A. thaliana), which were detected in all individuals. Aly10.2 is a suspected pseudogene in A. lyrata due to a large deletion and does not amplify in all individuals (Charlesworth et al., 2003b), whereas Aly10.1 is predicted to be functional and amplifies in more individuals. Clustering suggested that only four individuals had both genes but not all contigs could be resolved due to missing parts of the sequence. AL6G484380 (Aly3) was found in all individuals. Fourteen of the contigs clustered into two distinct clades that did not show similarity to any known paralogs (contig-only clusters; **Supplementary Table 10**). One was found in 10/13 accessions while the other was only found in four; these could represent previously uncharacterized members of the S-receptor kinase gene family.

For the SRK sequences, five of the putatively new alleles found by 454 pyrosequencing were pulled out, all of which also were detected by cloning and sequencing using degenerate primers; multi-exon sequences were mined from the genomes for two of them (NEW2, NEW16; **Table 6**). Multi-exon sequences were also


Contigs in red would have been mis-assigned based only on BLAST; those in blue were not resolved by BLAST. Cloned sequences were also obtained from contigs indicated in bold; asterisks indicate sequences where multi-exon sequences were pulled out using the genome mining. Alleles showing high similarity to SRK but not predicted to be linked to the SI phenotype are also indicted (Unlinked). The total number of linked and unlinked alleles resolved per accession is also indicated. <sup>a</sup>Also has AlySRK10 and AlySRK28.

pulled out for AlySRK01, AlySRK15, Aly13-2, and Aly13-7. While the genome mining approach seems promising, the presence of homozygotes for SRK for three individuals suggests that not all SRK alleles were identified within individual genomes: one L2x and one A2x individual had a single dominant allele each (AlySRK15, in dominance Class A2 and AlySRK13 in dominance Class A3, respectively). One L4x individual was homozygous for AlySRK01, which is plausible, as homozygotes for this recessive allele have been found in previous segregation-based analyses of tetraploid A. lyrata from Austria (Mable et al., 2004).

#### Challenge: Extracting Full-Length Sequences of Polymorphic Genes From Short Read Data

While the genome mining holds promise for investigating copy number variation and obtaining full-length sequences from new alleles, the approach that worked best required a detailed reference database of alleles in order to accurately assign sequences to loci. BLAST analyses alone resulted in mis-assignment of SRK alleles to other paralogs and other paralogs were sometimes assigned as SRK alleles. While part of this was because not all sequences were available in Genbank for BLAST analysis, the gene conversion with unlinked loci that makes similarity alone unreliable (Prigoda et al., 2005) remained problematic in these analyses. For example, Aly13- 2/13-7 sequences (which are not linked but are highly similar to Class B SRK alleles) were assigned as SRK in the initial analyses using only the five genes extracted from the MN47 genome. Manual alignments and phylogenetic clustering were required to determine allelic identities and to assign sequences to paralogous loci. However, there were clues in the BLAST analyses that suggested mis-assignment of SRK-like alleles; a signature of high E-value and low score in all cases predicted clustering to SRK-like sequences (although including unlinked loci such as Aly13-2). Nevertheless, the presence of only a single dominant allele in some accessions suggested that the genome mining did not pull out all SRK sequences that should have been present (since homozygotes should only be possible for recessive alleles).

While we had hoped also to be able to use this approach to map the potentially new alleles found using 454 sequencing to genomic regions to predict linkage to the S-locus, the failure of the mapping approach meant that this was not possible. However, when amplifying longer sequences using degenerate primers, we were able to obtain full-length sequences for some of the potentially new specificities predicted from the 454 analyses that we could use to BLAST the de novo assemblies (**Table 6**).

#### CONCLUSIONS AND RECOMMENDATIONS

The results presented here suggest that the highly polymorphic SRK alleles could be useful for interpreting evolutionary patterns of gene flow among populations, species and ploidy levels. We have demonstrated that tetraploids show no apparent advantage in terms of allelic or haplotypic repertoire due to more relaxed selection than diploids but that there is increased evidence for introgression (at least based on the most recessive SRK allele) among tetraploids from suspected hybrid populations. We also demonstrated that following up high throughput genotyping with targeted PCR can help to increase accuracy and completeness. We also identified new alleles not previously characterized and predicted dominance based on phylogenetic clustering.

Nevertheless, there are some important caveats from the analyses, which highlight considerations for future studies based on more robust approaches to high throughput genotyping. We make the following recommendations for future investigations of gene family evolution, in diploids as well as polyploids: (1) applying a hierarchical strategy to filtering decisions for cluster analyses could improve assignment of sequence variants to allelic variants, similar to suggestions for hierarchical AMOVA or STRUCTURE analyses (Holsinger and Mason-Gamer, 1996; Herdegen et al., 2014); (2) amplicon-based approaches for genotyping using deep sequencing should be avoided if there are other options available, as differential amplification and the difficulty of distinguishing PCR errors from real biological processes are difficult to overcome by any current sequencing technology; (3) due to the difficulty of assigning variants to gene copies, interpretation of gene family evolution should always be accompanied by co-segregation of sequence variants with the phenotype, whenever possible; (4) genome mining of resequenced genomes has the potential to investigate copy number variation and obtain full-length sequences that would be useful for population genetics analyses and tests for selection but lack of assembly of highly polymorphic genes to references means that this might only be practical for genes where there is already extensive knowledge about the components of the gene family.

While our results have demonstrated some useful insights into the dynamics of a complex gene family in polyploids and hybrids, we recommend that non-PCR-based sequence capture approaches hold the most promise for assessing patterns of selection on genes under balancing selection, where trans-specific polymorphism, reduced differentiation among alleles, and intermediate frequency alleles are predicted. Such approaches, for example, have been successfully applied to investigating R-gene variation in crop plants (Jupe et al., 2012, 2013; Andolfo et al., 2014; Giolai et al., 2016; Russell et al., 2016; Van Weymers et al., 2016). Whole genome resequencing approaches could be useful for setting the genomic context and fate of duplications, but there are still substantial challenges to resolve in distinguishing loss of copies from lack of coverage or lack of assembly to the reference due to high sequence divergence. A hierarchical approach to filtering or assembly to multiple references (e.g., multiple individuals or multiple alleles or gene family members) could help to overcome such difficulties but resolving fine-scale variation among variants from errors (e.g., haplotypes within specificities) and resolving complete heterozygous genotypes (particularly in polyploids) will require some creative bioinformatic solutions.

### DATA AVAILABILITY STATEMENT

The 200 bp fragments generated by 454 sequencing are too short for submission to Genbank but a full alignment of the sequences

#### REFERENCES


identified has been provided as **Supplementary Data Sheet 1** (including only the 454 sequences and references), 2 (including all unique alleles found across analyses), and 3 (SRK01 sequences). All new Sanger sequences have been deposited to Genbank (Accession numbers: MH507371- MH507400). Accession numbers and details for all unique sequences identified, along with those for reference sequences already available in Genbank are provided in **Supplementary Table 12**.

#### AUTHOR CONTRIBUTIONS

BM wrote the paper and performed the bulk of the sequence analyses described. MJ, AC, and PR-D generated sequence data for objectives 1, 2, and 3, respectively. MJ also performed the crosses for segregation analyses. KL and CK developed bespoke bioinformatics pipelines for objectives 1 and 4, respectively. AB contributed conceptually and financially to objectives 1 and 2. MK contributed conceptually to all aspects of the project and provided samples and advice on sampling locations and interpretation of the hybrid zones. AB, MK, and CK contributed to writing the manuscript.

#### FUNDING

This project was funded by a NERC Advanced Research Fellowship (NE/B50094X/1) to BM, and support from the Centre for Ecology and Evolutionary Synthesis (RCN/179569) to AB. Further support for genome resequencing was granted through a DFG grant (DFG SPP 1529; KO2302-14) to MK.

#### ACKNOWLEDGMENTS

We thank Aileen Adam and Elizabeth Kilbride for technical assistance.

#### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fevo. 2018.00114/full#supplementary-material


of Arabidopsis lyrata (Brassicaceae) with sporophytic control of selfincompatibility. Heredity 90, 422–431. doi: 10.1038/sj.hdy.6800261


the Illumina MiSeq platform. Nucleic Acids Res. 43:1341. doi: 10.1093/nar/ gku1341


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 Mable, Brysting, Jørgensen, Carbonell, Kiefer, Ruiz-Duarte, Lagesen and Koch. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Analysis of Molecular Variance (AMOVA) for Autopolyploids

Patrick G. Meirmans <sup>1</sup> \* and Shenglin Liu<sup>2</sup>

*1 Institute for Biodiversity and Ecosystem Dynamics, University of Amsterdam, Amsterdam, Netherlands, <sup>2</sup> Department of Bioscience, Aarhus University, Aarhus, Denmark*

Autopolyploids present several challenges to researchers studying population genetics, since almost all population genetics theory, and the expectations derived from this theory, has been developed for haploids and diploids. Also many statistical tools for the analysis of genetic data, such as AMOVA and genome scans, are available only for haploids and diploids. In this paper, we show how the Analysis of Molecular Variance (AMOVA) framework can be extended to include autopolyploid data, which will allow calculating several genetic summary statistics for estimating the strength of genetic differentiation among autopolyploid populations (*F*ST, ϕST, or *R*ST). We show how this can be done by adjusting the equations for calculating the Sums of Squares, degrees of freedom and covariance components. The method can be applied to a dataset containing a single ploidy level, but also to datasets with a mixture of ploidy levels. In addition, we show how AMOVA can be used to estimate the summary statistic ρ, which was developed especially for polyploid data, but unfortunately has seen very little use. The ρ-statistic can be calculated in an AMOVA by first calculating a matrix of squared Euclidean distances for all pairs of individuals, based on the within-individual allele frequencies. The ρ-statistic is well suited for polyploid data since its expected value is independent of the ploidy level, the rate of double reduction, the frequency of polysomic inheritance, and the mating system. We tested the method using data simulated under a hierarchical island model: the results of the analyses of the simulated data closely matched the values derived from theoretical expectations. The problem of missing dosage information cannot be taken into account directly into the analysis, but can be remedied effectively by imputation of the allele frequencies. We hope that the development of AMOVA for autopolyploids will help to narrow the gap in availability of statistical tools for diploids and polyploids. We also hope that this research will increase the adoption of the ploidy-independent ρ-statistic, which has many qualities that makes it better suited for comparisons among species than the standard *F*ST, both for diploids and for polyploids.

Keywords: genetic differentiation, population structure, FST, double reduction, polysomic inheritance, polyploidy, AMOVA

# INTRODUCTION

Autopolyploidy is an important, but often overlooked, aspect of the evolution of all major groups of Eukaryotes-plants, animals, and fungi- and may constitute an underappreciated source of biodiversity (Hardy, 2015). There are many species in which multiple ploidy levels (cytotypes) exist and often each cytotype itself conforms to the requirements of several widely used species

#### Edited by:

*Richard John Abbott, University of St. Andrews, United Kingdom*

#### Reviewed by:

*Lindsay V. Clark, University of Illinois at Urbana-Champaign, United States Peter E. Smouse, Rutgers University, The State University of New Jersey, United States*

> \*Correspondence: *Patrick G. Meirmans p.g.meirmans@uva.nl*

#### Specialty section:

*This article was submitted to Evolutionary and Population Genetics, a section of the journal Frontiers in Ecology and Evolution*

> Received: *01 February 2018* Accepted: *30 April 2018* Published: *23 May 2018*

#### Citation:

*Meirmans PG and Liu S (2018) Analysis of Molecular Variance (AMOVA) for Autopolyploids. Front. Ecol. Evol. 6:66. doi: 10.3389/fevo.2018.00066* Meirmans and Liu AMOVA for Autopolyploids

concepts (Soltis et al., 2007). Autopolyploidy has many effects on the mechanisms of evolution, not only because of the increase in genomic content and the flexibility for developing new traits (Larkin et al., 2016), but also because, compared to diploidy, it generates different dynamics of allele frequencies that interact with various demographic processes, influencing adaptation and speciation (Parisod et al., 2010). In a species with different ploidy levels, the different cytotypes often show intricate geographical patterns in their distribution, which may be the result of historical, demographic, ecological, or genetic processes (Glennon et al., 2014; Kolár et al., 2017). The analysis of population genetic structure of autopolyploids may therefore reveal a lot about these processes. However, polyploids also present several challenges to the researchers studying their population genetics (Dufresne et al., 2014). This is because population genetic theory, the expectations derived from this theory, and the statistical tools for data analysis were developed mostly for haploids and diploids and require translation for polyploids (Meirmans et al., 2018).

Several of the basic genetic processes work differently in autopolyploids than in diploids (Meirmans et al., 2018). The higher number of chromosomes means that for each gene a higher total number of copies is present in a population. This increases the number of mutation events per population, and also increases the impact of migration as each migrant individual carries more chromosome copies to its new population. Conversely, the higher total number of chromosome copies is akin to a higher effective population size and therefore reduces the force of genetic drift, compared to a diploid population with the same number of individuals. Mendelian segregation also works differently in autopolyploids, since it is not necessarily completely random, as is almost always the case in diploids. Instead, there may be disomic inheritance, polysomic inheritance, or a combination of the two, where the rate of polysomy varies across the genome (Stift et al., 2008; Meirmans and van Tienderen, 2013). In addition, autopolyploids may show double reduction, a process where two copies of the same chromatid segment end up in the same gamete (Bever and Felber, 1992; Hardy, 2015). For example in a autotetraploid with genotype ABCD this may lead to the production of homozygous (AA, BB, CC, and DD) gametes, in addition to the expected heterozygous gametes (e.g., AB, AD). A more practical problem in the genetic analysis of polyploids is that it is often difficult to estimate the dosage of the different alleles in a genotype (Dufresne et al., 2014). For example, it may be impossible to distinguish between the triploid genotypes AAB and ABB since they both share the marker phenotype AB. Missing dosage information may introduce a bias in the subsequent analysis; though depending on the type of analysis this bias may be corrected for quite effectively when random mating in populations can be assumed (De Silva et al., 2005; Meirmans et al., 2018). However, when the assumption of Hardy Weinberg equilibrium cannot be made for a species, accounting for the missing dosage information becomes more problematic, though in some cases it is possible to adjust the calculations specifically to take the missing dosage into account (Hardy, 2015; Field et al., 2017).

Estimating the strength of the genetic population structure is usually done using F-statistics that decompose the genetic variance into within-individual, within-population and amongpopulation components (Wright, 1969). Autopolyploidy affects the way these statistics should be estimated (Meirmans et al., 2018), but also their expected values under a given model of population structure, when compared to the same model for diploids (Ronfort et al., 1998). For example, the expected value of FST—quantifying the degree of population differentiation depends on the balance among migration, mutation, and drift. In autopolyploids, the increased effects of mutation and migration, in combination with the reduced force of drift, cause the expected value of FST to be lower than the corresponding value for diploids (Meirmans et al., 2018). This difference in expectation complicates comparisons of the strength of population structure among species or sets of populations with different ploidy levels.

To enable a better estimation of the degree of population differentiation across ploidy levels, Ronfort et al. (1998) developed an alternative summary statistic, which they called ρ, for which the expected value is independent of the ploidy level. The ρ-statistic is comparable to FST in that it estimates the degree of population differentiation and—barring estimation error—ranges between 0 and 1. For haploid data, the value of ρ is exactly the same as the value of FST; for higher ploidy levels, the value of ρ is generally slightly higher than that of FST. The ploidy independence of ρ is achieved by disregarding the withinindividual variation (illustrated by Equation 14 below). Another perk of the ρ-statistic that makes it suitable for the analysis of polyploid data is that its value is both independent of the rate of double reduction (Ronfort et al., 1998) and of the frequency of polysomic inheritance (Meirmans and van Tienderen, 2013). The ρ-statistic also has a major advantage that is applicable to diploid as well as polyploid data: its value is independent of the rate of self-fertilization or other forms of inbreeding. This means that under a given model of population structure, ρ will have the same value for a strict inbreeder as for an obligate outcrosser, whereas FST gives higher values for inbreeders than for outcrossers. This is especially useful in comparative studies, where a comparison of FST and ρ can be used to see whether differences in population structure are due to differences in mating system or due to differences in population connectivity. Unfortunately, ρ is not very widely used, possibly because there are only few computer programs that allow estimation of ρ from genetic marker data. The only two such programs that we are aware of are SPAGEDI (Hardy and Vekemans, 2002), and GENODIVE (Meirmans and van Tienderen, 2004).

One of the most popular methods for estimating F-statistics is via Analysis of Molecular Variance (AMOVA) (Excoffier et al., 1992; Peakall et al., 1995; Michalakis and Excoffier, 1996). This popularity is probably due to the remarkable flexibility of the AMOVA framework: it can be used for the estimation of different types of F-statistics (FST, ϕST, RST) and can easily incorporate additional hierarchical levels of population structure (e.g., testing for differentiation among groups of populations). In addition, AMOVA can be used to detect population clustering in a genetic dataset (Dupanloup et al., 2002; Meirmans, 2012). However, AMOVA has been described only for haploid and diploid data and the link between the AMOVA framework and the ρ-statistic has not been explored theoretically.

In this paper, we outline how the AMOVA framework can be extended to include autopolyploid data. We start by discussing how the standard AMOVA, for calculating FST, ϕST, or RST, can be easily adapted for use with autopolyploids. We then show how the ploidy-independent ρ-statistic can be calculated in AMOVA by using a matrix of squared Euclidean distances between individuals, calculated from the within-individual allele frequencies. Finally, we show the application of the method by calculating both FST and ρ for simulated datasets and discuss how to deal with the polyploidy-specific complication of missing dosage information.

#### THE AMOVA FRAMEWORK

#### General Approach

In AMOVA (Excoffier et al., 1992; Michalakis and Excoffier, 1996), F-statistics are calculated from a set of covariance components, corresponding to the different hierarchical levels assumed to be present in the population structure (following Cockerham, 1973; Weir and Cockerham, 1984). So under a simple model of population structure where individuals are distributed over a number of populations, we can decompose the total genetic variance (σ 2 T ) into among-populations (σ 2 a ), amongindividuals within populations (σ 2 b ), and within-individuals (σ 2 c ) covariance components, such that σ 2 <sup>T</sup> <sup>=</sup> <sup>σ</sup> 2 <sup>a</sup> <sup>+</sup><sup>σ</sup> 2 <sup>b</sup> <sup>+</sup><sup>σ</sup> 2 c . The Fstatistics can then be calculated as simple ratios of those covariance components:

$$F\_{ST} = \frac{\sigma\_a^2}{\sigma\_T^2} \tag{1a}$$

$$F\_{\rm IS} = \frac{\sigma\_b^2}{\sigma\_b^2 + \sigma\_c^2} \tag{1b}$$

$$F\_{IT} = \frac{\sigma\_a^2 + \sigma\_b^2}{\sigma\_T^2} \tag{1c}$$

When the populations can be clustered into multiple groups, an extra hierarchical level is added and the total genetic variance is decomposed into among-groups (σ 2 a ), among-populations within-groups (σ 2 b ), among-individuals within-populations (σ 2 c ), and within individuals (σ 2 d ) covariance components, such that σ 2 <sup>T</sup> <sup>=</sup> <sup>σ</sup> 2 <sup>a</sup> <sup>+</sup><sup>σ</sup> 2 <sup>b</sup> <sup>+</sup><sup>σ</sup> 2 <sup>c</sup> <sup>+</sup><sup>σ</sup> 2 d . The corresponding F-statistics are then:

$$F\_{CT} = \frac{\sigma\_a^2}{\sigma\_T^2} \tag{2a}$$

$$F\_{\rm SC} = \frac{\sigma\_b^2}{\sigma\_b^2 + \sigma\_c^2 + \sigma\_d^2} \tag{2b}$$

$$F\_{\rm IS} = \frac{\sigma\_\epsilon^2}{\sigma\_\epsilon^2 + \sigma\_d^2} \tag{2c}$$

$$F\_{IT} = \frac{\sigma\_a^2 + \sigma\_b^2 + \sigma\_c^2}{\sigma\_T^2} \tag{2d}$$

This follows the Analysis of Variance framework that was developed earlier by Cockerham (1973) and Weir and Cockerham (1984). However, whereas Weir and Cockerham calculated these covariance components from a linear vector of allele frequencies, AMOVA calculates them using a matrix **D** of pairwise squared Euclidean distances. This is based on previous work by Li (1976) showing that conventional Sums of Squares can be calculated from a matrix of pairwise squared Euclidean distances. These Sums of Squares can then be used to calculate the Expected Mean Squares, which in turn can be used to calculate the covariance components (Weir and Cockerham, 1984).

The use of a distance metric is actually what gives AMOVA its remarkable flexibility, as the distance metric can be changed, depending on the type of data under analysis. A simple matching distance can be used for a single locus with allelic data for example for SNPs (Peakall et al., 1995; Michalakis and Excoffier, 1996). Multilocus values of the F-statistics can then be obtained by summing the covariance components over loci. A distance metric for haplotypic data was described in the original paper by Excoffier et al. (1992), based on the phenetic distance between the pair of haplotypes. This is also the most frequently used method for sequence data, though more complex distance metrics can be used as well—e.g., by incorporating a specific mutational model or by tracing distances along a connecting network or tree (Excoffier and Smouse, 1994). A distance metric for microsatellites loci can be calculated by taking the squared difference in repeat number between alleles (Michalakis and Excoffier, 1996).

The interpretation of the F-statistics returned by AMOVA depends strongly on the choice of distance metric used. This means that from the wide array of available estimators for FST, different estimators are obtained by different distance metrics. For allelic data, where the simple matching distance is used, the resulting F-statistics are mathematically equivalent to the estimators of Weir and Cockerham (1984). In contrast, for haplotypic/sequence data, the distances are indicative of the evolutionary relationships between haplotypes/sequences (Whitlock, 2011); to reflect this, the F-statistics are generally referred to with the Greek letter ϕ. Finally, when for microsatellites the difference in repeat number is used, the estimator corresponds to the RST-statistic (Slatkin, 1995).

#### Adaptation to Autopolyploids

For autopolyploids, AMOVA can be performed using the same methods as above for calculating the pairwise distances among alleles, yielding estimates of FST, ϕST, or RST. However, the higher ploidy means that the overall size of the complete distance matrix increases. So what is needed to adapt a standard diploid AMOVA to autopolyploid data is to account for this larger overall sample size in all the calculations, which is very straightforward when the data contain only a single ploidy level. For a total sample size of N diploids, the distance matrix is of size 2N ∗ 2N, whereas for autopolyploids with ploidy level x, the matrix is of size xN<sup>∗</sup> xN (for computational efficiency it is also possible to only use the lower or upper half of the matrix). The Sums of Squares are therefore calculated by summing over a larger number of pairwise distances, though this follows the same approach as outlined by Excoffier et al. (1992; Their equations 8a−8c) by summing over groups, populations, and individuals as necessary.

The higher ploidy level results in a larger number of allele copies within individuals, populations, and groups. This larger number of allele copies needs to be reflected in the degrees of freedom used for calculating the Expected Mean Squares; however, this is only the case for the within-individual and total degrees of freedom, as the others are only determined by the higher-level sample sizes. In **Table 1** we give generic formulas for the degrees of freedom for any ploidy x > 1 for a model with a single group of populations, and compare this to the diploid case described in the original papers (Excoffier et al., 1992; Peakall et al., 1995; Michalakis and Excoffier, 1996); the notation follows the notation used in the documentation of the software Arlequin (Excoffier and Lischer, 2010, 2015). **Table 2** shows the same, but then for multiple groups of populations. The Expected Mean Squares can then be obtained by dividing the corresponding Sum of Squares by the degrees of freedom.

For calculating the covariance components from the expected mean squares it is necessary to incorporate the sample sizes at the different hierarchical levels included in the analysis. The simplest case is a single group of populations all of the same ploidy level x and all with the same sample size Np. In this case (**Table 1**), the multiplication factor n is defined as xNp, the number of allele copies sampled per population. However, when there is unbalanced sampling (**Table 1**), this multiplication factor has to take the sample sizes for all populations separately into account:

$$n = \frac{\varkappa N - \sum\_{p \in P} \frac{\varkappa N\_p^2}{N}}{P - 1} \tag{3}$$

Here, N<sup>p</sup> is the number of individuals sampled in population p.

When there are multiple groups of populations (**Table 2**) there are three coefficients: n, n ′ , and n ′′. When sampling is balanced, so with the same number of individuals sampled for every population and the same number of populations sampled in each of the G groups, n and n ′ are defined as xNp. As above, this is simply the number of allele copies sampled per population. The value of n ′′ is then defined as xN<sup>g</sup> , the number of allele copies sampled per group of populations. However when sample sizes within populations and/or groups are unbalanced (**Table 2**), the sample sizes have to be taken into account for the calculation, and the three coefficients are defined as:

$$n = \frac{\varkappa N - \sum\_{\mathcal{G} \in G} \sum\_{p \in \mathcal{g}} \frac{\varkappa N\_p^2}{N\_{\mathcal{g}}}}{P - G} \tag{4a}$$

$$n' = \frac{\sum\_{\mathfrak{F} \in G} \frac{(N - N\_{\mathfrak{F}})}{N\_{\mathfrak{F}}} \sum\_{p \in \mathfrak{g}} \varkappa N\_{p}^{2}}{N(G - 1)} \tag{4b}$$

$$n^{\prime\prime} = \frac{\varkappa N - \frac{\sum\_{\xi \in G} \varkappa N\_{\xi}^2}{N}}{G - 1} \tag{4c}$$

TABLE 1 | Outline of the AMOVA framework for a single group of populations with the degrees of freedom (d.f.) both given for diploids and generalized for any ploidy level *x* (except haploid).


*P is the number of populations and N the number of individuals; the value of the multiplication coefficient n is calculated using Equation (3). This method can be used to obtain estimates of FST ,* ϕ*ST , or RST .*

TABLE 2 | Outline of the AMOVA framework for multiple group of populations with the degrees of freedom (d.f.) both given for diploids and generalized for any ploidy level *x* (except haploid).


*G is the number of groups, P the number of populations, and N the number of individuals; the value of the multiplication coefficients n, n*′ *and n*′′ *are calculated using Equations (4a–c). This method can be used to obtain estimates of FST ,* ϕ*ST , or RST .*

where N<sup>g</sup> is the number of individuals sampled in group g. For haploid and diploid data (x = 1 and x = 2), these equations are the same as for the standard AMOVA (Michalakis and Excoffier, 1996; Excoffier and Lischer, 2015).

#### Mixed Ploidy Datasets

Slightly more complicated evolutionary scenarios involve multiple ploidy levels, either occurring in separate populations, or co-occurring in populations. In such a case, there is no single ploidy level x that can be used to calculate the degrees of freedom and the multiplication coefficients. However, when the ploidy level of every genotyped individual is known (e.g., through flow cytometry), this problem can be solved by using the number of allele copies sampled per population (C), rather than the number of individuals (N). **Table 3** shows the formulas for the degrees of freedom for any mixture of ploidy levels (though all should be at least diploids) for a model with a single group of populations. The corresponding coefficients n, n ′ , and n ′′ are defined as (again TABLE 3 | Outline of the AMOVA framework for a single group of populations with the degrees of freedom (d.f.) for a mixture of individuals with different ploidy levels, based on the total number of allele copies sampled (*C*).


*P is the number of populations and N the number of individuals; the value of the multiplication coefficients n, n*′ *, and n*′′ *are calculated using Equations (5a–d). This method can be used to obtain estimates of FST ,* ϕ*ST , or RST .*

following the notation from Excoffier and Lischer, 2015):

$$S\_P = \sum\_{p \in P} \sum\_{i \in p} \frac{C\_i^2}{C\_p} \tag{5a}$$

$$n = \frac{C - S\_P}{N - P} \tag{5b}$$

$$n' = \frac{S\_P - \sum\_{i \in N} \frac{C\_i^2}{C}}{P - 1} \tag{5c}$$

$$n^{\prime\prime} = \frac{C - \sum\_{p \in P} \frac{C\_p^2}{C}}{P - 1} \tag{5d}$$

The F-statistics can then be calculated in the normal way, using Equations (1a–c). Note that when the significance of the population differentiation is tested by permuting individuals over populations, the number of allele copies in the permuted populations may differ from the original values. Therefore, the coefficients n, n ′ , and n ′′ will have to be recalculated for every permutation.

#### Ploidy-Independent ρ-Statistic

In addition to the above-developed method that yields estimates of FST, ϕST, or RST, AMOVA can also be used to obtain estimates of the ploidy-independent ρ-statistic (Ronfort et al., 1998). Here we show that this can be done by performing AMOVA on a matrix of squared Euclidean distances calculated from the within-individual allele frequencies. Other than the above methods of calculating distances—where each distance is calculated between a pair of alleles or haplotypes—here each squared Euclidean distance (denoted as d 2 ij) is calculated between a pair of individual genotypes at a locus. The metric is calculated as

$$d\_{ij}^2 = \sum\_{a=1}^A \left(p\_{ia} - p\_{ja}\right)^2 \tag{6}$$

where pia is the frequency of the ath allele (a ∈ {1, 2, . . . , A}) within individual i. In diploids, these frequencies can take the values 0, 0.5, and 1; in triploids the values 0, 0.33, 0.67, and 1; in tetraploids the values 0, 0.25, 0.5, 0.75, and 1; etc. For haploids, the only two possible values are 0 and 1 and therefore for haploids this metric is the same as the simple-matching distance; by extension this means that for haploid data the value of ρ equals that of FST.

This distance metric yields, for any ploidy level, only a single distance value per pair of individuals. As a result, the distance matrix is only of size N <sup>∗</sup>N, whereas the approach above resulted in a matrix of xN<sup>∗</sup> xN, for data of ploidy level x. The N ∗N matrix can then be used to perform AMOVA using the equations (not shown here) originally developed for haploid data in the paper by Excoffier et al. (1992). This approach also allows ρ to be calculated at different hierarchical levels, e.g., to compare differentiation among clusters of populations. For such use, we will adopt the convention of adding subscripts to indicate which levels are compared, though Ronfort et al. (1998) did not use any such subscripts in their original description of ρ. Note that since the within-individual component is disregarded, there are no ρ equivalents of FIS and FIT in such a hierarchical analysis.

When the two individuals have the same ploidy level, the squared Euclidean distance metric proposed here is a simple linear transformation of the squared Euclidean distance metric of Smouse and Peakall (1999). Since a linear transformation of the distance matrix does not affect the relative sizes of the variance components, this means that the Smouse and Peakall distance can also be used for AMOVA. However, the metric from Smouse and Peakall has only been defined for cases where the two individuals have the same ploidy level, whereas the metric proposed above is also suited to mixtures of different ploidy levels.

The mathematical relationship between the squared Euclidean distance metric and ρ can be deduced as follows. Again, pia refers to the frequency of the ath allele (a ∈ {1, 2, . . . , A}) in the ith individual (i ∈ {1, 2, . . . , N}). The sum of the **D** matrix can then be transformed as:

$$\begin{split} \sum\_{i=1}^{N} \sum\_{j=1}^{N} d\_{ij}^{2} &= \sum\_{i=1}^{N} \sum\_{j=1}^{N} \sum\_{a=1}^{N} \left( p\_{ia} - p\_{ja} \right)^{2} \\ &= 2N^{2} \sum\_{a=1}^{A} \left( \frac{1}{N} \sum\_{i=1}^{N} p\_{ia}^{2} \right. \\ &\quad - \left( \frac{1}{N} \sum\_{i=1}^{N} p\_{ia} \right)^{2} \end{split} \tag{7}$$

If we define

$$\check{H}\_O = \frac{1}{N} \sum\_{i=1}^{N} \left( 1 - \sum\_{a=1}^{A} p\_{ia}^2 \right) \tag{8}$$

and

$$\check{H}\_E \equiv 1 - \sum\_{a=1}^{A} \left( \frac{1}{N} \sum\_{i=1}^{N} p\_{ia} \right)^2 \tag{9}$$

then the sum of squared distances in Equation (7) can be simplified to:

$$\sum\_{i=1}^{N} \sum\_{i=1}^{N} d\_{ij}^{2} = 2N^{2} \left( \check{H}\_{E} - \check{H}\_{O} \right) \tag{10}$$

Hˇ <sup>E</sup> and <sup>H</sup><sup>ˇ</sup> <sup>O</sup> as defined here are analogous—but not equivalent—to the standard H<sup>E</sup> and H<sup>O</sup> as defined by Nei (1987) for diploids and by Moody et al. (1993) for polyploids (see also Meirmans et al., 2018):

$$H\_O = \frac{1}{N} \sum\_{i=1}^{N} \left( \left( 1 - \sum\_{a=1}^{A} p\_{ia}^2 \right) \cdot \frac{\boldsymbol{\chi}\_i}{\boldsymbol{\chi}\_i - 1} \right) \tag{11}$$

$$H\_E = 1 - \sum\_{a=1}^{A} \left( \frac{\sum\_{i=1}^{N} \left( \mathbf{x}\_i \cdot \boldsymbol{p}\_{ia} \right)}{\sum\_{i}^{N} \boldsymbol{x}\_i} \right)^2 \tag{12}$$

While H<sup>E</sup> and H<sup>O</sup> attempt to correct the calculation of allele frequency or heterozygosity using individual ploidy information, Hˇ <sup>E</sup> and <sup>H</sup><sup>ˇ</sup> <sup>O</sup> ignore such information, hence endowing <sup>ρ</sup> a ploidyindependent nature.

In the Island model where the number of populations is r (each population has a size of N), ρST can be calculated as

$$\rho\_{ST} = 1 - \frac{(r \cdot N)^2 \cdot \pi^2 \sum\_{k=1}^r \sum\_{i=1}^N \sum\_{j=1}^N d\_{kij}^2}{r \cdot N^2 \cdot \pi^2 \sum\_{i=1}^{r \cdot N} \sum\_{j=1}^{r \cdot N} d\_{ij}^2} \tag{13}$$

Using the link between the sum of squared distances and the <sup>H</sup><sup>ˇ</sup> -statistics that was established in Equation (10), Equation (13) can be transformed into:

$$
\rho\_{\rm ST} = \frac{\check{H}\_T - \check{H}\_S}{\check{H}\_T - \check{H}\_O} \tag{14}
$$

The statistic <sup>H</sup><sup>ˇ</sup> <sup>T</sup> is defined in the same vein as <sup>H</sup><sup>ˇ</sup> <sup>E</sup> but then for all populations together; <sup>H</sup><sup>ˇ</sup> <sup>S</sup> is the average of <sup>H</sup><sup>ˇ</sup> <sup>E</sup> calculated over populations. If the populations contain only a single ploidy level, Equation (14) can be transformed into

$$\rho\_{ST} = \frac{H\_T - H\_\text{S}}{H\_T - H\_\text{O} \cdot \frac{\chi - 1}{\chi}} \tag{15}$$

which is the same as Equation (6) in Meirmans et al. (2018).

### APPLICATION TO DATA

#### Simulations Under a Hierarchical Island Model

To test how well the above-developed AMOVA framework performs for data with different ploidy levels, we simulated data under a standard hierarchical island model (Slatkin and Voelm, 1991; Vigouroux and Couvet, 2000). A set of 20 populations was simulated, divided into two archipelagoes, both having 10 populations. All populations had the same size of N = 100; mating within populations was completely random, including a probability of self-fertilization of 1/N. Genetic markers were simulated at 1,000 independently segregating loci; mutation followed a K-alleles model with 100 possible allelic states and a mutation rate of µ = 0.0001. Migration took place at different rates among populations from the same archipelago (m1) and among populations from different archipelagoes (m2).

The model was population-based, so individuals were not explicitly modeled but instead the populations were represented by a set of vectors containing the allele frequencies of all possible allelic states at all loci. Under the assumption of random mating, one generation of genetic drift can then easily be simulated by drawing random numbers from a multinomial distribution. For the expected values in the multinomial, we used the current population allele frequencies—after incorporating the expected effects of migration and mutation. For the number of draws in the multinomial, we used the number of chromosome copies in the population, so the population size multiplied by the ploidy level. The model was written in R, using the rmultinom() function for drawing random numbers; the used R-script is available in online Supplement 1 (Data Sheet 1).

The model was run for diploids, tetraploids, and hexaploids, for values of m2 of 0.001, 0.0001, and 0.00001; per value of m2 a range of values of m1 was used with a maximum of 0.1 and a minimum equal to the value of m2 (so m1 ≥ m2). Per scenario, the model was run once for 20,000 generations; replication was provided by the use of the 1,000 independent loci. After the last generation, genotypes were constructed by randomly distributing the alleles over individuals and written to a file. The software GENODIVE v. 2b27 (Meirmans and van Tienderen, 2004) was used to perform a hierarchical AMOVA on the resulting genotypes. The results were compared to the theoretical expectations for FSC and FCT derived by Vigouroux and Couvet (2000). Though these expectations were only derived for diploids, general results for any ploidy level x can be obtained by substituting all occurrences of the term "4N" in the equations by the term "2xN" (see Meirmans et al., 2018). The expectations for ρ for any ploidy level are equivalent to the expectation for FST under haploidy (Ronfort et al., 1998; Meirmans et al., 2018), so can also be derived from the equations of Vigouroux and Couvet (2000) by substituting every "4N" by "2N."

#### Simulation Results

When applying the AMOVA framework to the simulated data for several ploidy levels, the results closely matched the theoretical expectations (**Figure 1**), indicating that AMOVA correctly estimates the variance components and the F-statistics. For all three values of m2, FSC showed a monotonic decrease with increasing values of m1 (**Figure 1**, top row), whereas FCT showed a monotonic increase (**Figure 1**, bottom row). As random mating within populations was assumed, the values of FIS were close to zero for all simulated scenarios (not shown). The only slight deviation between the results of the simulation and the theoretical expectations was observed for the FCT-statistic when the migration rate within archipelagoes (m1) was close to or equal to the migration rate between archipelagoes (m2). This deviation can easily be explained since the theoretical derivations of Vigouroux and Couvet (2000) assume that m1 > m2. For the cases where m1 = m2, the simulations consistently show a FCT value that is close to zero, whereas the expected values are slightly higher.

As expected, there is a strong difference between the Fstatistics (**Figure 1**) and the ρ-statistics (**Figure 2**) in how they behave under different ploidy levels. For the F-statistics, at a given set of migration rates, the values decrease with increasing ploidy level. This is due to the increased impact of migration at higher ploidy levels combined with a decrease in the force of genetic drift (Meirmans et al., 2018). On the other hand, the ρstatistics generally have similar values for all ploidy levels when

calculating the differentiation among subpopulations within clusters (ρSC) or the differentiation among clusters (ρCT). This ploidy-independence of the ρ-statistic is immediately evident from the almost completely overlapping lines in **Figure 2**. As we saw above for FCT, the estimates of ρCT from the simulated data show a slight deviation from the expected values when the assumption of m1 > m2 is violated.

# DISCUSSION

#### Expanding the AMOVA Framework

In this paper, we showed how the AMOVA framework (Excoffier et al., 1992; Peakall et al., 1995; Michalakis and Excoffier, 1996) can be used for autopolyploids of any ploidy level by adapting the way the Sums of Squares and resulting variance components are calculated. This method can be used with any distance metric that is normally used with haploid or diploid data, which means that the method can be used to obtain estimates of FST, ϕST, or RST. In addition, we showed that the use of a simple squared Euclidean distance metric defined here will yield an estimate of the ploidy-independent ρ-statistic. For both approaches (FST and ρ), AMOVA can be used for datasets from a single cytotype or a mixture of cytotypes. Since the covariance components are calculated separately for each locus, the method can even be used with species where there is ploidy variation within the genome, such as the salmonid fishes (Allendorf et al., 2015).

We tested the developed method with datasets simulated under a hierarchical island model of migration, for multiple ploidy levels. The results of the simulations closely matched those from the theoretical derivation of Vigouroux and Couvet (2000; see also Slatkin and Voelm, 1991), showing that the method correctly estimates the variance components. A slight deviation was only observed when the assumption of m1 > m2 that was made by Vigouroux and Couvet was violated. The violation of this assumption was done on purpose as the simulations where m1 = m2 allowed us to test the AMOVA in scenarios without any hierarchical population structure. In these cases, the AMOVA correctly showed the absence of any differentiation between clusters (FCT = 0); the theoretical expectation in these cases was slightly higher. Interestingly, this is the first study—as far as we are aware—that has compared the theoretical expectations for the hierarchical island model with simulated data; even though hierarchical F-statistics are widely used in analyses of genetic marker data, the theoretical derivations have received very little attention, for autopolyploids as well as for diploids.

# The Ploidy-Independent ρ-Statistic

Though the ρ-statistic that was developed by Ronfort et al. (1998) is ideally suited to analyze autopolyploid data, it has seen relatively little use for this purpose. We hope that the possibility of calculating ρ using AMOVA will help to make it more widely adapted. For calculating ρ we described a simple squared Euclidean distance metric based on within-individual allele frequencies. This is closely related to the metric of Smouse and Peakall (1999), which uses allele counts rather than frequencies. As we describe above, for any single ploidy level our metric is a simple linear transformation of the metric of Smouse and Peakall, and so for a single-ploidy dataset the two metrics give identical results in AMOVA. However, one problem with the Smouse and Peakall metric—and AMOVA based on it—is that it

cannot be used for analyses with mixed ploidy levels, as that will lead to a bias. The metric from Smouse and Peakall (1999) has received some criticism because it is founded on geometric rather than biological principles (Kosman and Leonard, 2005; Dufresne et al., 2014). However, these criticisms are unjustified since our deductions above (Equations 7–15) recovered the biological meaning of this method by linking the metric with the calculation of ρ.

In our derivations above we have only focused on autopolyploids. However, in many polyploid species it is not known whether it is an allopolyploid or an autopolyploid. In addition, species may show inheritance patterns that are intermediate between these two extremes, with partly disomic and partly polysomic inheritance (segmental allopolyploids; Stebbins, 1947). Furthermore, the frequency of polysomic inheritance may even vary among loci within a genome (Stift et al., 2008). Meirmans and van Tienderen (2013) used simulations of tetraploids where the rate of tetrasomy varied between full disomic and full tetrasomic inheritance to test the presence of bias in several genetic summary statistics. They found that an assumption of autopolyploidy for a species that is in fact an allopolyploid can give a strong downward bias in the value of FST. On the other hand, the ρ-statistic was almost completely free of such a bias and is therefore the statistic of choice when the exact mode of segregation of a polyploid is unknown. Of course, this does not mean that the mode of segregation becomes irrelevant for the analysis of polyploid data; for a true understanding of the genetic processes within a polyploid species, studying the segregation mode is indispensible.

The greatest strength of ρ lies in comparisons across species or sets of populations with different ploidy levels. For a given migration rate, population size, and mutation rate, the value of ρ will be the same in diploids as in polyploids. Comparisons of ρ across species with different ploidy levels therefore permits assessing whether the impact of these processes are different in the different species. The ρ-statistic can also be easily calculated for mixed-ploidy data. However, in such cases there is an important caveat. Whereas the same set of allele frequencies always yields the same value of FST, regardless of the ploidy level, this is not the case for ρ. So in a case where there are multiple ploidy levels, calculating ρ separately for each ploidy level will give different values, even when within populations there is complete admixture among the cytotypes (see Meirmans et al., 2018). Another limitation of ρ is that is currently only defined under the Infinite Allele and K-allele models of mutation. This means that it is not applicable to markers that follow a Stepwise Mutation Model (as is the case for RST) or for sequence data (as is the case for ϕST). This is because there are no Euclidean distances among individual genotypes that can take these mutational processes into account.

### Polyploid AMOVA in Practice: Software

AMOVA for autopolyploids has been implemented in the software GENODIVE (Meirmans and van Tienderen, 2004), which is freely available for Mac computers from http:// www.patrickmeirmans.com/software. In addition to FST and ρ, GENODIVE can also use AMOVA to calculate the F ′ ST statistic for autopolyploids, which is FST standardized relative to the level of within-population variation (Meirmans, 2006; Meirmans and Hedrick, 2011). Besides the standard AMOVA, where the degree of differentiation is calculated based on an a priori defined hierarchical population structure, GENODIVE also offers AMOVA-based K-means clustering (Meirmans, 2012) for autopolyploid data based on the ρ-statistic. This analysis allows clustering of individuals or populations into k groups, where the algorithm finds the clustering with the highest value of the ρ-statistic. The autopolyploid AMOVA has also been implemented in the R-package POPPR V. 2.7.0 (Kamvar et al., 2014).

The ρ-statistic is also applicable to haploid and diploid data, and for such datasets it can be estimated using AMOVA with the software GENALEX (Peakall and Smouse, 2012). For haploid data, the ρ-statistic is simply equal to FST obtained from running an AMOVA. For diploid data, the option to calculate genetic distances among individuals should first be run, which calculates the metric of Smouse and Peakall (1999). When an AMOVA is subsequently performed using this distance matrix, the resulting differentiation statistics—labeled ϕ in the output—are equivalent to ρ.

#### Dealing With Missing Dosage Information

One of the major practical challenges of working with autopolyploids is the problem of missing dosage information for alleles (Dufresne et al., 2014). Depending on the type of marker and the sequencing depth for genotyping-by-sequencing data often only marker phenotypes are available and not the complete genotypes. This missing dosage information may cause a bias in the estimation of allele frequencies in samples from autopolyploid populations; in AMOVA, this will cause a bias in the estimation of the covariance components. This is because individuals with different genotypes can have the same phenotype: AAAB, AABB, and ABBB all have phenotype AB. This will lead to an underestimation of the distance between individuals and the corresponding Sums of Squares, and hence to an underestimation of FST and ρ. It is, as yet, not possible to correct for this bias directly in the calculation of AMOVA.

It is possible to correct for this bias in an indirect way by completing the genotypes via random imputation of the missing alleles, when Hardy-Weinberg equilibrium can be assumed within populations. For this, bias-corrected allele frequencies should first be estimated based on the set of phenotypes, e.g., using the maximum likelihood method of De Silva et al. (2005). Then for every individual, the phenotype should be filled in by randomly drawing alleles based on the expected frequency (under HWE) of the different genotypes that can be constructed from this phenotype, given the estimated frequencies of the alleles present in the phenotype. So for example when a tetraploid has phenotype AB and allele A is very common in the population and allele B is very rare, it is much more likely that the genotype will be randomly filled to AAAB than to AABB or ABBB. If this imputation is done for all individuals in the dataset and the sample sizes per population are sufficient, the allele frequencies in the imputed dataset will closely match the estimated allele frequencies and the imputed dataset can be used to perform AMOVA. Simulations have shown that this type of imputation can successfully remove bias caused by missing dosage for both FST and ρ (Meirmans et al., 2018). The procedure has been implemented in the AMOVA and AMOVA-based K-means clustering functions of the software GENODIVE (Meirmans and van Tienderen, 2004). Since it involves randomly drawing alleles, it may be prudent to repeat the procedure a number of times and calculate the average values of the F-statistics across replicates. Nevertheless, it's important to realize that the assumption of random mating, necessary for such imputation, is likely to be violated for many polyploids. Therefore, a next major step in the field would be the development of a method that can take the missing dosage into account directly without an assumption of HWE.

# CONCLUSIONS

The statistical tools available for polyploids still lag behind those available for diploids (Dufresne et al., 2014; Meirmans et al., 2018). Hopefully, the Analysis of Molecular Variance for autopolyploids that we described here will help to narrow this gap when developers of statistical software that allows polyploid data (e.g., Jombart, 2008; Clark and Jasieniuk, 2011; Kamvar et al., 2014) will implement this method more widely. We also hope that our description of the link between the squared Euclidean distances, calculated from the within-individual allelefrequencies, and the ρ-statistic will help advocate the use of this statistic. Its independence of the ploidy level, the rate of double reduction, the frequency of polysomic inheritance, and the mating system makes ρ better suited for comparisons among species than the standard FST, both for diploids and for polyploids.

# AUTHOR CONTRIBUTIONS

PM: developed the general method for calculating the Sums of Squares for polyploids; SL: derived the proof linking the ρstatistic to the Euclidean distances; PM: wrote the manuscript with input from SL.

# ACKNOWLEDGMENTS

We would like to thank Marc Stift, Filip Kolár, and the ˇ participants of the 2017 polyploidy workshop in Prague for stimulating discussions about all aspects of polyploidy research. We also would like to thank Zhian Kamvar for implementing the polyploid AMOVA in the POPPR package.

# SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fevo. 2018.00066/full#supplementary-material

# REFERENCES


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 Meirmans and Liu. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# The Site Frequency/Dosage Spectrum of Autopolyploid Populations

#### Luca Ferretti <sup>1</sup> \*, Paolo Ribeca<sup>1</sup> and Sebastian E. Ramos-Onsins <sup>2</sup>

*<sup>1</sup> The Pirbright Institute, Woking, United Kingdom, <sup>2</sup> Centre for Research in Agricultural Genomics, Barcelona, Spain*

The Site Frequency Spectrum (SFS) and the heterozygosity of allelic variants are among the most important summary statistics for population genetic analysis of diploid organisms. We discuss the generalization of these statistics to populations of autopolyploid organisms in terms of the joint Site Frequency/Dosage Spectrum and its expected value for autopolyploid populations that follow the standard neutral model. Based on these results, we present estimators of nucleotide variability from High-Throughput Sequencing (HTS) data of autopolyploids and discuss potential issues related to sequencing errors and variant calling. We use these estimators to generalize Tajima's *D* and other SFS-based neutrality tests to HTS data from autopolyploid organisms. Finally, we discuss how these approaches fail when the number of individuals is small. In fact, in autopolyploids there are many possible deviations from the Hardy–Weinberg equilibrium, each reflected in a different shape of the individual dosage distribution. The SFS from small samples is often dominated by the shape of these deviations of the dosage distribution from its Hardy–Weinberg expectations.

Keywords: autopolyploidy, dosage distribution, Hardy-Weinberg equilibrium, high-throughput sequencing, site frequency spectrum, heterozygosity, neutrality tests, allelic dosage

# 1. INTRODUCTION

The study of nucleotide variability in polyploid species is a convoluted task that requires solving a number of methodological and analytical difficulties related to the specific nature of the species (detailed in the reviews of Dufresne et al., 2014; Meirmans et al., 2018). The impact of diploidy on the evolutionary dynamics is well-known, but the complexity of the impact of higher ploidy on the genetic variability of polyploid organisms is even higher. An example is provided by autopolyploid species: as they contain copies originating from genome duplication of the same species, the inheritance is expected to be polysomic (all the variants of the same chromosome can pair in the meiosis process) but it is not rare to find preferential pairs (Stift et al., 2008; Chester et al., 2012), resulting in partial polysomic or even disomic inheritance. The different inheritance types, which may simultaneously occur in the same species, could generate differences in the effective population size at different loci and consequently different patterns of genetic variability. Another distinctive aspect of polyploid species that impacts their genetic variability patterns is the process of double reduction, where the two copies of the same chromatid migrate to the same gamete (Haldane, 1930). As a consequence, this process will increase drastically the homozygosity of the gametes for the involved segment.

#### Edited by:

*Hans D. Daetwyler, La Trobe University, Australia*

# Reviewed by:

*Barbara K. Mable, University of Glasgow, United Kingdom Paul David Blischak, The Ohio State University, United States Polina Yu. Novikova, VIB-UGent Center for Plant Systems Biology, Belgium*

> \*Correspondence: *Luca Ferretti luca.ferretti@gmail.com*

#### Specialty section:

*This article was submitted to Evolutionary and Population Genetics, a section of the journal Frontiers in Genetics*

> Received: *20 April 2018* Accepted: *28 September 2018* Published: *23 October 2018*

#### Citation:

*Ferretti L, Ribeca P and Ramos-Onsins SE (2018) The Site Frequency/Dosage Spectrum of Autopolyploid Populations. Front. Genet. 9:480. doi: 10.3389/fgene.2018.00480*

**89**

High-Throughput Sequencing (HTS) has facilitated the study of genome data in general and that of polyploid species as well. Still there are difficulties, mainly assigning the sequence reads to homologous (rather than homeologous) loci and/or dealing with relatively high rates of sequencing error (You et al., 2018). The amount of software available in order to correctly assembly and detect variants (e.g., GATK from Broad Institute) is increasing, although the task remains challenging (Mielczarek and Szyda, 2016; You et al., 2018). These methodological problems are expected to be (at least partially) solved in the next years with the technological progress of the sector, including long reads and linked reads to improve phasing and increased throughput of sequencing runs (Dufresne et al., 2014; Shendure et al., 2017).

The study of polyploid variability from HTS data and the development of statistical methods based on these sequencing methodologies are driving current genetic studies of polyploids (Dufresne et al., 2014; Hardy, 2016) and will continue to have a fundamental impact on the field. Nevertheless, still much work is needed, especially on the topic of allelic dosage, that is, the number of copies of each allele in a heterozygous individual (Blischak et al., 2016). Since the development of HTS, a number of studies developing computational and statistical methods that account for polyploidy have been published. Example are statistics to estimate the levels of variability (Ferretti and Ramos-Onsins, 2015) and heterozygosity (e.g., De Silva et al., 2005; Hardy, 2016) with different approaches to take into account the allelic dosage, or the detection of population structure (e.g., Falush et al., 2003; Gao et al., 2007) and comparative measures of these differences between populations/species/individuals (e.g., Jost, 2008; Meirmans and Hedrick, 2011). Arnold et al. (2012) showed that autotetrasomic inheritance can be modeled using a Kingman's standard coalescent (Kingman, 1982). Their results can be generalized to autopolyploid species of different ploidy and are especially useful as a null model to predict the neutral patterns of genetic diversity in polyploid species. Also additional phenomena specific to polyploids, such as double reduction, can be modeled in a way resembling partial self-fertilization (Arnold et al., 2012).

Nevertheless, a major gap in the population genetic analysis of polyploid organisms is the application of methods based on the Site Frequency Spectrum (SFS). Of special interest is the generalization to polyploid organisms of Tajima's D (Tajima, 1989), Fay and Wu's H (Fay and Wu, 2000) and other neutrality tests based on the SFS (Achaz, 2009; Ferretti et al., 2010, 2012). The SFS and the heterozygosity of allelic variants are among the most important statistics for population genetic analysis of diploid organisms and have been commonly used for describing the genetic variability of genomic data and for inferring the parameters of evolutionary models (e.g., Nielsen, 2000). Indeed, the combination of these two statistics (frequency and heterozygosity) describes completely the genotype of a diploid population for a given genomic position.

In this paper we consider a single population of autopolyploid organisms. Compared to the diploid case, the genotypes of variants in polyploid organisms present a more complex structure resulting from a combination of internal spectra for each individual. We discuss this genotype structure and its decomposition into different statistics, including the SFS and a generalization of the distribution of heterozygosity that we call the Site Dosage Spectrum (SDS).

For samples of large size, we argue that the details of deviations from Hardy–Weinberg equilibrium have a relatively small impact on the SFS. The expected value of the SFS of autopolyploid individuals is derived for a panmictic, neutral population of constant size. We also derive the expected value the most general spectrum for autopolyploids, i.e., the joint Site Frequency-Dosage Spectrum (SFDS), which represents a combination of the SFS and the SDS. We use these results as a null model to build estimators of nucleotide diversity and neutrality tests for HTS data and we discuss the robustness of estimators of genetic variability.

For small samples, violations of Hardy–Weinberg in the dosage distribution have a strong impact on the SFS. We show how autopolyploid populations have the potential to harbor a wide range of deviations from Hardy–Weinberg equilibrium due e.g., to inbreeding, population structure, selection, dominance, modes of inheritance, or combinations of these causes. We discuss the impact of some of these violations on dosage and on SFS-based neutrality tests.

A synopsis of symbols and abbreviations used in both text and formulas can be found in **Table 1**. It should be noted that to the best of our knowledge most of the equations that follow (all but 2, 3, 7, 11, and 13) are original work presented in this paper for the first time. More details about their derivations can be found in the **Appendix**.

# 2. SFDS STRUCTURE IN AUTOPOLYPLOIDS

### 2.1. SFS and Heterozygosity in Diploids

Individuals are often sampled from a wild population without prior studies of the subpopulation structure or phenotypic differences. In this case, it is usually assumed for population genetic analysis that all individuals are equivalent and that any summary statistic should treat all sequences equally. To

TABLE 1 | List of the main symbols and abbreviations used throughout the text.


our knowledge, all existing statistics for sequences sampled from a single populations at the time of this writing—such as estimators of variability, neutrality tests, estimators based on linkage disequilibrium and haplotype-based statistics—rely implicitly on this assumption.

These statistics can also be classified in terms of the number of sites involved in each individual computation. The frequency of a SNP requires information only on the alleles at a single genomic site, while linkage disequilibrium requires a comparison of alleles at two sites. On the other extreme, haplotype statistics require information on all sites in the sequence.

In this manuscript we will focus on the simplest statistics, i.e., those which can be computed independently for each site (and eventually averaged over all sites in the sequence to obtain summary statistics). We will also consider only biallelic variants (one ancestral and one derived/mutated allele present at each site) in our analysis. Biallelic SNPs represent by far the most common type of variant in eukaryotic genomes, hence this assumption is not particularly restrictive. This is true also for autopolyploid organisms, since it relies on the low mutation rates per base and the corresponding low variability at the population level.

A simple explaination for the prevalence of biallelic variants is the following. Under the usual assumptions for the Kingman coalescent, which describes autopolyploid populations as well (Arnold et al., 2012), SNPs are generated by at least a mutation in a given site along the tree. The tree length in coalescent units is a number of order O(1), while the effective mutation rate in coalescent units is represented by the parameter of genetic variability θ = 2pNeµ where N<sup>e</sup> is the effective population size, p is the ploidy and µ is the mutation rate per base. For most eukaryotic organisms, θ is around 10−<sup>3</sup> (Lynch, 2005). This estimate is based on diploids, but the order of magnitude would be the same for most autopolyploids. The fraction of sites containing a SNP in a finite sample is the product of θ and tree length, and therefore proportional to θ. However, for a triallelic SNP to occur, two mutations should appear on the tree, hence only a fraction O(θ 2 ) of sites contains a SNP with three or more alleles, i.e., only a fraction O(θ) of the SNPs is triallelic. This argument is valid for autopolyploids, but not for allopolyploids, since it does not take into account the divergence between homeologous chromosomes.

In haploid populations, the only statistic based on information at a single position of nucleotide sequences is the frequency of the mutated/derived allele f(x) at a given site x. In fact, once the frequency in the sample is known, the genotypes of all individuals are known up to permutations of the individual. The summary statistic is the so-called SFS, which is the number of sites with a mutation of (derived) frequency j/n in a sample of n individuals, denoted by ξ<sup>j</sup> . For the whole population, the equivalent spectrum is the density of sites in the sequence with a mutation of (derived) frequency between f and f + df , denoted by ξ (f).

In diploid populations, however, the frequency of a mutation at a given site x is not sufficient to fully determine the genotypes of the n individuals in the sample. The reason is that each individual can be homozygous for either the ancestral or the mutated allele or it can be heterozygous, i.e., it is characterized by an internal count of the mutated allele at that site (which can be 0, 1, or 2) and a corresponding internal frequency (0, 1/2, or 1). Taken together, all individuals in the sample carry an "internal spectrum" distributed as <sup>I</sup>d(x) with d = 0, 1, 2, defined as the count of individuals with internal count d for the mutation at position x, which is of course normalized as P<sup>2</sup> d=0 <sup>I</sup>d(x) = n. This individual spectrum is related to the global frequency of the mutation through its mean count P<sup>2</sup> d=0 dId(x) = 2nf(x).

The diploid genotype at position x is fully determined by Id(x) up to permutations of the individuals. Given that Id(x) has three components (number of ancestral homozygotes I0, of heterozygotes I<sup>1</sup> and of derived homozygotes I2) but one is constrained by the number of individuals and another combination corresponds to the frequency, there is only one independent component left, for instance the number of heterozygotes I1(x). The information contained in this spectrum is therefore equivalent to the two statistics f(x) and h(x), where h(x) is the heterozygosity (the fraction of heterozygous individuals in the sample) defined as h(x) = I1(x)/n.

Heterozygosity is another very well-known statistic in the population genetics of diploid organisms. If the alleles at site x are in Hardy–Weinberg equilibrium (i.e., under random mating and without selection), the expected fraction of heterozygotes is given by the standard formula E[h(x)] = 2f(x)(1 − f(x)), i.e., it corresponds to the pairwise nucleotide diversity in the population at that site. Its distribution for a discrete sample is a binomial with the same mean 2f(1 − f) in terms of the population frequency.

Deviations from the expectation h ≈ 2f(1 − f) are signatures of violations of some of the assumptions of the Hardy–Weinberg equilibrium. For example, a deficit of heterozygotes h < 2f(1−f) is expected if there is sub-population structure in the sample, violating the "random mating" assumption.

Note that the most general summary single-site statistic for diploids is neither the SFS nor the heterozygosity, but rather the joint site frequency-heterozygosity spectrum ψ(f , h) or its corresponding version ψj,I<sup>1</sup> for a finite sample. This joint spectrum is defined as the number of sites with a derived variant at frequency f = j/2n and where a fraction h = I1/n of the individuals are heterozygous.

The neutral expectation for this frequency-heterozygosity spectrum in finite samples can be found from the known theory from the frequency spectrum in haploids (Fu, 1995; Ewens, 2004) combined with simple combinatorial arguments applied to the Hardy–Weinberg equilibrium (Weir, 1996). This combination gives

$$\operatorname{E}[\psi\_{j,\mathcal{X}\_1}] = \frac{\theta \, 2^{\mathcal{X}\_1} \, \frac{n!}{\mathcal{X}\_1! \frac{j-\mathcal{X}\_1}{2}! \left(n-\frac{j+\mathcal{X}\_1}{2}\right)!}}{j \binom{2^n}{j}} \tag{1}$$

Note the constraint that j − I<sup>1</sup> should be a multiple of 2.

In **Figure 1**, we illustrate how this spectrum appears under neutrality for a single population of constant size, both in the standard model and under two demographic models: recent admixture and population structure. The latter shows a clear violation of Hardy–Weinberg equilibrium due to a lack of

under two demographic models: recent admixture (B) and population structure (C). In both cases, we assume two well-separated populations with divergence equal to θ, the effective population size of the first population being twice the size of the other. In the former case, we assume instantaneous admixture of the two populations and random mating thereafter. In the latter case, the consequence of the absence of mating between different populations is a reduction of heterozygotes in the pooled population, known as the Wahlund effect.

heterozygotes—the so-called Wahlund effect (Rosenberg and Calabrese, 2004).

In diploids, not much attention has been devoted to this joint spectrum, and the two quantities f and h are usually studied separately. One of the possible reasons is that the Hardy–Weinberg equilibrium is reached in a single generation for diploids, hence heterozygosity and deviations from Hardy– Weinberg equilibrium are affected by phenomena acting on short time scales, while the SFS contains information on evolution at larger scales. However, the difference between these quantities becomes more blurred in autopolyploids, as we will discuss in the rest of this paper.

#### 2.2. SFDS in Autopolyploids

In autopolyploids, the framework for single-site statistics is reminiscent of the diploid case. The main difference is that at each position of each individual genome the mutated allele can be present in a number of copies from 0 to the ploidy p. In polyploids, the frequency of an allele within an individual is often called its allelic dosage.

The internal spectrum Id(x), defined as the count of individuals with allelic dosage d for the mutation at position x, now covers a broader range of dosages d = 0, 1, 2 ... p. For this reason, we will call it the Dosage Distribution (DD). As before, this spectrum is normalized as P<sup>p</sup> d=0 <sup>I</sup>d(x) = n and it is related to the global frequency of the mutation by P<sup>p</sup> d=0 dId(x) = pnf(x).

Specification of these two conditions can be avoided if we discard the homozygote counts from the DD, since such counts are completely determined by sample size and frequency together with the rest of the DD. The heterozygous part of the SDS plays the same role as heterozygosity in diploids; however, it has the form of a frequency spectrum, hence an additional complexity with respect to the one-dimensional heterozygosity statistic.

An illustration of the DD and its complexity can be found in **Figure 2**. In this hypothetical example, we consider a panmictic population with mixed mating (partly selfing, partly outcrossing) and distributed according to a spatial density gradient away from a central region. If the selfing rate depends on the density, being low in dense regions and high in sparse ones, then individuals in dense regions will show a pattern consistent with Hardy– Weinberg equilibrium in the DD, while those in sparse regions will show an excess of homozygotes due to selfing.

For large populations, we can define a normalized DD as i<sup>d</sup> = <sup>I</sup>d/n. The most general single-site statistic for autopolyploids is therefore the joint Site Frequency-Dosage Spectrum (SFDS) ψ(f ,{id}d=1...p−<sup>1</sup> ) or its discrete version <sup>ψ</sup>j,{I<sup>d</sup> }d=1...p−<sup>1</sup> for a finite sample. Similar to the diploid case, this joint SFDS is defined as the number of sites with a derived variant at frequency f = j pn where the dosage distribution across individuals is i<sup>d</sup> = <sup>I</sup>d/n. If we condition on a given frequency, we obtain the Site Dosage Spectrum (SDS) p({id}d=1...p−<sup>1</sup> |f).

An important and subtle point that should be clear from **Figure 3** is that the SDS is the distribution of the DD, and hence it cannot be reliably summarized as a single average DD. Reducing the SFDS for a given frequency to the average DD over all variants of that frequency is the equivalent of summarizing the distribution of heterozygosity in diploids by providing the average heterozygosity only. In fact the SFDS is a full p-dimensional spectrum whose components are the frequency (one component) and the heterozygous part of the DD (p − 1 components), the latter representing the SDS.

FIGURE 2 | Illustration of the Dosage Distribution (including homozygotes) in a panmictic autotetraploid population with density-dependent selfing rates. In this example, we assume for simplicity that segregating alleles are at intermediate frequency in the population; their dosage in each individual is represented by the color lightness. Since the average frequency is the same everywhere, the average dosage also is. However, by contrast, the DD depends strongly on the sampling location because of variations in the local spatial density. Sampling individuals at random across different locations would result in an average DD like the one in the top-right inset. On the other hand, sampling around a given location would result in different DDs, as illustrated. Locations in the central region tend to have DDs similar to the Hardy–Weinberg ones, while peripheral locations show a large excess of homozygotes because of sampling.

FIGURE 3 | Illustration of the relation between the Dosage Distribution and the Site Dosage Spectrum. On the left, homologous sequences from 4 tetraploid individuals are shown (*n* = 4, *p* = 4), containing 3 SNPs of frequency 50%. On the right, the three DDs (one for each SNP) are shown at the top. The SDS at the bottom is the distribution of these DDs (which in this example is given by the three DDs with probability 1/3 each). Note that the SDS bears no relation with the average DD, which is shown in the middle. In this example, the Site Frequency/Dosage Spectrum would be ψ8,{1/4,0,1/4} = 1/3, ψ8,{1/4,1/2,1/4} = 1/3, ψ8,{0,1,0} = 1/3 and ψ8,{I} = 0 for other choices of I.

# 2.3. The SFDS of the Standard Neutral Model

The expected value of the SFDS under the standard neutral model is a simple generalization of the diploid frequency-heterozygosity spectrum presented before. In an infinite population and in the absence of double reduction, the Dosage Distribution for a mutation of frequency f under Hardy–Weinberg equilibrium is well-known (Haldane, 1930):

$$i\_d = \binom{p}{d} f^d (1-f)^{p-d} \quad \text{for } d = 0 \dots p \tag{2}$$

and the expected value of the neutral SFS has the standard shape

$$\mathrm{E}[\xi(f)] = \frac{\theta}{f};\tag{3}$$

hence the expected population SFDS is simply

$$\mathbb{E}[\psi(f,\{i\_d\})] = \frac{\theta}{f} \prod\_{d=1}^{p-1} \delta\left(i\_d - \binom{p}{d} f^d (1-f)^{p-d}\right) \tag{4}$$

where δ(z) is the Dirac delta function, which represents a distribution concentrated at z = 0.

For finite samples the expected values are slightly more complex. A combinatorial argument similar to the diploid case — based on the ways to assign the j mutated alleles across the pn homologous chromosomes—provides the following formula for the SDS, i.e., the distribution of the Dosage Distribution {Id}d=1...p−<sup>1</sup> in finite samples of size n: different weights on singletons, such as Fu and Li's F and D tests for background selection (Fu and Li, 1993), and the expansion test R<sup>2</sup> (Ramos-Onsins and Rozas, 2002). The shape of Hardy– Weinberg violations affects the SFS on a scale 1f . p pn <sup>=</sup> <sup>1</sup>/n. Since most tests weight frequencies in a smooth way over scales of 1f ∼ 1/n for n large enough, the DD can usually be ignored in large samples.

However, unbiased sequence data from a large number of individuals is typically obtained by High-Throughtput Sequencing (HTS) at low to moderate coverage. HTS data at low coverage is usually unbalanced and more prone to be significantly impacted by sequencing errors, thus requiring tailored approaches. Hence in this section we focus on SFS-based estimators of genetic variability and neutrality tests adapted to HTS data.

SNP calling is usually required prior to population genetic analysis. It is even more relevant for HTS data, due to the typical amount of sequencing errors for these technologies. It is key that only methods developed specifically for polyploids (e.g., GATK from Broad Institute) or for pooled data (e.g., Raineri et al., 2012) are used, since the accuracy of SNP calling algorithms depends on the ploidy. Algorithms for diploids are usually unsuitable to analyse data from organisms with higher ploidy.

Allelic dosage estimation could also be performed (e.g., Blischak et al., 2016), but it is unreliable at low coverage and can be challenging even at high coverage. In fact, dosage uncertainties represent one of the biggest hurdles when dealing with polyploid population genetics (Blischak et al., 2016). However, an accurate estimate of allelic dosage for each individual is not needed to estimate genetic diversity at population level. In fact, none

$$\mathbb{E}[p(\mathcal{Z}\_d|\boldsymbol{j})] = \frac{n!}{\mathbb{E}[p(\mathcal{Z}\_d|\boldsymbol{j})] = \frac{\mathbb{E}\left[\frac{j-\sum\_{d=1}^{p-1}d\mathcal{Z}\_d}{\rho}\right]! \left(n - \frac{j}{\rho} - \left(1 - \frac{1}{\rho}\right)\left(\sum\_{d=1}^{p-1}d\mathcal{Z}\_d\right)\right)!} \prod\_{d=1}^{p-1} \binom{p}{d}^d}{\binom{pn}{\cdot}} \tag{5}$$

where the above expression should be interpreted as 0 if it contains factorials of non-integer numbers. More details can be found in the **Appendix**.

The SFDS in finite samples can be found combining (5) with the known neutral expected SFS θ/j:

$$\mathbb{E}[\psi\_{j,\{\mathcal{Z}\_d\}}] = \frac{\theta}{j} \mathbb{E}[\mathcal{p}(\{\mathcal{Z}\_d\}|j)] \tag{6}$$

Note that in finite samples frequency and DD are under the constraint that j − <sup>P</sup>p−<sup>1</sup> d=1 dI<sup>d</sup> should be a multiple of p.

### 3. SFS ESTIMATORS AND NEUTRALITY TESTS FOR LARGE SAMPLES

For large samples n ≫ 1, the exact shape of the DD and the SDS do often have a negligible impact on tests based on the shape of the SFS and their normalization. In fact, most of these tests place weights on ξ (f) that change gradually with the frequency. There are a few exceptions—for instance tests that assign very

j of the methods we discuss in this section requires an explicit estimation of dosage. All these methods work directly on shortread data after SNP calling and filtering of unreliable lowfrequency variants.

The estimators of variability proposed in this section take read depth explicitly into account and are unbiased at low coverage as well. Hence there is no need to filter regions of low coverage, although excluding regions with read depth lower than the ploidy could increase the accuracy of the results. However, since our estimators do not take sequencing errors into account, we strongly suggest to perform SNP calling prior to analysing variability with them. For such analyses SNPs can be filtered with moderately conservative parameters, e.g., excluding only SNPs with posterior probability >0.95 or equivalently p-value >0.05 or PHRED quality score <15.

In this section we consider an experimental setup where every polyploid individual of ploidy p in a sample of n individuals is sequenced separately with a read depth of ri(x) at position x, where i = 1 ... n. The count of the alternative (derived) alleles within reads from the ith individual at position x is ci(x). If the position x has been filtered out during SNP calling, we discard the SNP and consider ci(x) = 0 for all individuals.

#### 3.1. Estimators of Variability

#### 3.1.1. Watterson's Estimator

The classical estimator of variability based on the SFS is the Watterson estimator (Watterson, 1975), which is based on the number of segregating sites S in a sample of size n. Under an infinite sites model and a panmictic stationary and neutral scenario with population size N, where mutations are randomly and independently occurring given a mutation rate µ per nonoverlapped generation (i.e., a Wright-Fisher model), the expected variability level θ = 2pNeµ can be estimated by:

$$
\theta\_W = \frac{\mathcal{S}}{a\_n},
\tag{7}
$$

where a<sup>n</sup> = <sup>P</sup>n−<sup>1</sup> j=1 1 j . This estimator is based on the expected neutral spectrum of mutations and is sensitive to the presence of an excessive number of singletons (which can be observed, for example, under demographic expansion scenarios (Ramos-Onsins and Rozas, 2002) or in the presence of high rates of artifactual sequencing errors (Achaz, 2008).

A generalization of the Watterson estimator for autopolyploids, in the form of a Maximum Composite Likelihood estimator, has been derived in Equation (34) of Ferretti and Ramos-Onsins (2015). However, this estimator suffers from a strong bias due to sequencing errors. In fact, sequencing errors appear as low frequency variants which increase the estimate of S. Two strategies could be applied to reduce this dependence: either S should be estimated using only filtered SNPs obtained from SNP calling algorithms, or low frequency variants should be removed with an approach similar to that used in Achaz (2008).

#### 3.1.2. Tajima's Estimator of Nucleotide Diversity

Tajima's estimator (Tajima, 1983) or the pairwise nucleotide difference statistic (5) is also a relevant estimator of nucleotide diversity and is defined as the average number of differences between sequences. In fact, for each position i it estimates the level of heterozygosity in the population [2fi(1 − fi), where f<sup>i</sup> is the absolute frequency of a given variant allele at position i]. In the infinite-site and stationary neutral model, the expected value of Tajima's estimator (θ5) is equal to that of Watterson's estimator (that is, under the ideal Wright-Fisher scenario E[θ5] = E[θW] = θ). Tajima's estimator for a region of size L is given by:

$$\theta\_{\Pi} = \frac{n}{(n-1)} \sum\_{i=1}^{L} 2f\_i (1 - f\_i). \tag{8}$$

Results from Ferretti et al. (2013) can be combined to build an unbiased estimator of pairwise nucleotide diversity for multiple polyploid individuals:

$$\hat{\theta}\_{\Pi} = \frac{2}{n(n-1)} \left[ \frac{p}{p-1} \sum\_{j=1}^{n} \pi\_j + 2 \sum\_{j=1}^{n-1} \sum\_{k=j+1}^{n} \pi\_{j,k} \right] \tag{9}$$

where π<sup>j</sup> is the average pairwise difference between reads from the jth individual, and πj,<sup>k</sup> is the average pairwise difference between pairs of reads from the jth and kth individual (Ferretti et al., 2013). Both these quantities account naturally for dosage. The factor p/(p − 1) is the same factor that appears between the estimates of sample and population heterozygosity in the above formula (8) (Nei and Roychoudhury, 1973).

The above estimator weights the information from all individuals equally, irrespectively of their coverage and dosage. It is possible to build less noisy unbiased estimators by considering further assumptions on the variance of the pairwise differences. Given the average coverage per base r¯<sup>j</sup> of the jth individual, the variances can be often approximated by inverse powers of this coverage Var(πj) ∝ 4/r¯<sup>j</sup> + 4/p, Var(πj,<sup>k</sup> ) ∝ 1/r¯<sup>j</sup> + 1/r¯<sup>k</sup> + 2/p (see **Appendix**). Hence, an approximate Minimum Variance Unbiased Estimator for the pairwise diversity can be obtained by weighting the terms in the above estimator by their variance:

$$\hat{\theta}\_{\Pi} = \frac{\sum\_{j=1}^{n} \pi\_{j} \frac{\bar{r}\_{j}(p-1)}{2(\bar{r}\_{j} + p)} + 2 \sum\_{j=1}^{n-1} \sum\_{k=j+1}^{n} \pi\_{j,k} \left(\frac{1}{\bar{r}\_{j}} + \frac{1}{\bar{r}\_{k}} + \frac{2}{p}\right)^{-1}}{\sum\_{j=1}^{n} \frac{\bar{r}\_{j}(p-1)^{2}}{2p(\bar{r}\_{j} + p)} + 2 \sum\_{j=1}^{n-1} \sum\_{k=j+1}^{n} \left(\frac{1}{\bar{r}\_{j}} + \frac{1}{\bar{r}\_{k}} + \frac{2}{p}\right)^{-1}} \tag{10}$$

As both versions of this estimator assign a negligible weight to low frequency alleles, they are much more robust with respect to sequencing errors and uncertainties in SNP calling. Hence in the presence of significant rates of sequencing errors, or other related causes of incorrect base calling, any of these estimators should be preferred to the Watterson estimator discussed above.

### 3.2. Neutrality Tests

#### 3.2.1. Tajima's D

Tajima's D test (Tajima, 1989) was the first neutrality test based on the frequency spectrum and it is still the most popular one. It is based on the difference between the Tajima's estimator θ<sup>5</sup> and the Watterson estimator θW. As explained above, under the stationary neutral model it is expected that this difference would be zero. However, empirical data violating the theoretical assumptions can result in significant differences. This test can discriminate among some selective and/or demographic processes. The Tajima's D statistic is given by:

$$D = \frac{\hat{\theta}\_{\Pi} - \hat{\theta}\_{W}}{\sqrt{\text{Var}(\hat{\theta}\_{\Pi} - \hat{\theta}\_{W})}} \tag{11}$$

where the denominator is computed under the standard neutral model and is a function of θ and np.

For HTS data, the numerator of the test can be simply obtained from the difference of the Tajima's and Watterson's estimators presented above.

Obtaining the exact denominator is computationally tricky. A practical approximation is to use the standard denominator for the test, but replacing the "haploid" sample size np by an effective sample size neff defined as the average number of homologous chromosomes that have been actually sequenced at every position, i.e.,

$$m\_{\rm eff} = \frac{1}{L} \sum\_{\mathbf{x}=1}^{L} \sum\_{j=1}^{n} p \left[ 1 - \left( 1 - \frac{1}{p} \right)^{r\_l(\mathbf{x})} \right] \tag{12}$$

#### 3.2.2. Fay and Wu's H

Fay and Wu's H test (Fay and Wu, 2000) was designed to detect derived allele frequencies much higher than expected under a neutral scenario. A large number of variants at high frequencies can be a consequence of positive selection, although it could also occur in the presence of signals of population structure (e.g., introgression). The test compares the levels of variability of Tajima's estimator (θ5) vs. another variability estimator—here named θH—that weights the number of segregating sites quadratically with the frequency of derived alleles. The normalized version of this test (Zeng et al., 2006) is:

$$H = \frac{\hat{\theta}\_{\Pi} - \hat{\theta}\_{H}}{\sqrt{\text{Var}(\hat{\theta}\_{\Pi} - \hat{\theta}\_{H})}}\tag{13}$$

For HTS data, we apply the same approach as for Tajima's D. The only difference is that we use the alternative definition of the numerator 2(θ<sup>5</sup> − θL) where θ<sup>L</sup> is the Zeng's estimator, which is linear in the derived frequency (Zeng et al., 2006). An unbiased version of θ<sup>L</sup> for HTS data is

$$\hat{\theta}\_L = \sum\_{\mathbf{x}=1}^L \frac{\sum\_{j=1}^n c\_j(\mathbf{x})}{\mathcal{N}\_L(\mathbf{x}) \sum\_{j=1}^n r\_j(\mathbf{x})} \tag{14}$$

where the normalization factor

$$\mathcal{N}\_{L} = \sum\_{k=1}^{pn-1} \frac{1}{k} \sum\_{k\_1=0}^{p} \dots \sum\_{k\_n=0}^{p} \delta\_{k, k\_1 + \dots + k\_n} \frac{\prod\_{i=1}^{n} \binom{p}{k\_i}}{\binom{pn}{k}} \left[ 1 - \prod\_{i=1}^{n} \left( \frac{k\_i}{p} \right)^{r\_i(\mathbf{x})} \right] \tag{15}$$

is the probability that a segregating site is not interpreted as a fixed derived variant based on the reads. Note that δi,<sup>j</sup> is the Kronecker delta which is 1 if i = j and 0 otherwise.

An approximate version of the denominator of the test can be derived inserting neff in the standard denominator, as described above for Tajima's D.

#### 4. SMALL SAMPLES AND HARDY–WEINBERG VIOLATIONS IN THE SDS

For small autopolyploid samples, deviations from the neutral SFS cannot be clearly discriminated from violations of Hardy– Weinberg. In fact, in the smallest possible sample of a single individual, the Dosage Distribution coincides with the SFS! More precisely, the SFS for a single individual corresponds to the heterozygous components of the Dosage Distribution averaged across sites. Hence, the features of the DD have a huge impact on the SFS.

This impact is two-fold. On a practical side, if it is not possible to estimate allelic dosage with sufficient accuracy, then uncertainties in individual dosage result in large uncertainties in the determination of allele frequencies, and therefore of the SFS. However in principle, even if dosage could be accurately inferred, the shape of the SFS for a few individuals would still be largely determined by the effect on the DD of the deviations from Hardy–Weinberg equilibrium. We will discuss such deviations in this section.

For diploid organisms there is only one possible direction for Hardy–Weinberg violation, i.e., excess or deficit of heterozygotes. However, in autopolyploids, many different deviations from Hardy–Weinberg equilibria are possible, resulting in different deviations from the neutral SFS. In fact, in this section we present four examples of possible mechanisms of violation of Hardy–Weinberg equilibrium which correspond to four different directions in the space of expected DDs. These examples are (i) inbreeding; (ii) inbreeding with mixed disomic/polysomic inheritance; (iii) heterozygote advantage; (iv) selection against recessive mutations. In tetraploids, combinations of these mechanisms span the whole space of all possible deviations from Hardy–Weinberg.

The shapes of the deviations of the expected DD from a Hardy–Weinberg equilibrium are shown for these mechanisms in **Figure 4**, both in tetraploids and hexaploids. The corresponding directions of the deviations of SFS-based tests from their null values are shown in the same figure for Tajima's D and Fay and Wu's H for a range of ploidy from 4 (tetraploids) to 10 (decaploids).

#### 4.1. Inbreeding

Inbreeding is a well-known cause of violation of Hardy– Weinberg. Both in diploids and in polyploids, selfing and other mechanisms such as subpopulation structure cause a lack of heterozygotes, as discussed in relation to the Wahlund effect (Rosenberg and Calabrese, 2004).

As an example of its consequences on the DD, we can model a small rate of selfing in a population with polysomic inheritance by assuming an equilibrium in the DD given the frequency of the variant, with an approach similar to the one used in De Silva et al. (2005):

$$\mathcal{Z}\_k^{\text{eq}} = \sum\_{k'=0}^{p} \sum\_{k'=0}^{p} \mathcal{Z}\_{k'}^{\text{eq}} \mathcal{Z}\_{k'}^{\text{eq}} \sum\_{a=0}^{p} \text{Hyp}(a|k', p/2, p) \text{Hyp}(k - a|k', p/2, p) \tag{16}$$

where Hyp(·) is the hypergeometric distribution that corresponds to the sampling of chromosomes in gametes. Note that all the Hardy–Weinberg equilibrium distributions I eq <sup>k</sup> <sup>=</sup> p k f k (1−f) p−k discussed before are solutions of the equation above (Here and in the rest of this section, we ignore the possibility of double reduction, since it requires a separate modeling of its impact on allele frequencies; Butruille and Boiteux, 2000).

Then we can perturb the equilibrium by occasional selfing events with a small probability p<sup>s</sup> , obtaining:

$$\Delta \mathcal{Z}\_k = -p\_s \mathcal{Z}\_k^{\rm eq} + p\_s \sum\_{k'=0}^p \mathcal{Z}\_{k'}^{\rm eq} \sum\_{a=0}^p \text{Hyp}(a|k', p/2, p) \text{Hyp}(k - a|k', p/2, p) \tag{17}$$

The shape of this violation of Hardy–Weinberg is shown in **Figure 4**. As expected, it results in an excess of homozygotes in the population. For a single individual, it has a positive impact on both Fay and Wu's H and Tajima's D. For tetraploids, the deviations from the null value are more apparent in H, while in organisms with ploidy higher than 6, violations tend to be larger in D.

# 4.2. Intermediate Disomic/Polysomic Inheritance

Not only the rates of selfing/outcrossing, but also the mode of inheritance could impact on the violation of Hardy–Weinberg. Mixed disomic/polysomic inheritance is an example of an alternative inheritance mode that appears to be less rare than expected (Meirmans and Van Tienderen, 2013).

Without inbreeding, partial disomic inheritance alone does not lead to violations of the Hardy–Weinberg equilibrium. Hence to study deviations from Hardy–Weinberg we model mixed disomic/polysomic inheritance but with a small selfing rate p<sup>s</sup> , similar to the case above. We denote the probability of disomic and polysomic inheritance by p<sup>2</sup> and 1 − p<sup>2</sup> respectively. For small selfing rate, it is easy to argue that the violations would be a combination of purely disomic and purely polysomic violations with weights p<sup>2</sup> and 1 − p<sup>2</sup> respectively, i.e.,

$$
\Delta \mathcal{L}\_k = (1 - p\_2) \Delta \mathcal{L}\_k^{polysonic} + p\_2 \Delta \mathcal{L}\_k^{disonic} \tag{18}
$$

assuming that p<sup>s</sup> ≪ 1.

Purely disomic violations would satisfy similar equations as the purely polysomic ones in the previous section, although with slightly different inheritance terms. Similar to what happens in diploid organisms, sampling of the new generation occurs separately for each heterozygous pair of disomically homologous chromosomes:

$$\Delta \mathcal{Z}\_k = -p\_s \mathcal{Z}\_k^{\text{eq}} + p\_s \sum\_{k'=0}^p \mathcal{Z}\_{k'}^{\text{eq}} \sum\_{h=0}^{p/2} \frac{2^h \binom{\frac{p/2}{h; \frac{k'-h}{2}; \frac{\rho - k'-h}{2}}}{\binom{\rho}{k'}} \binom{h}{\frac{k-k'+h}{2}} 2^{-h} \tag{19}$$

The corresponding shape of Hardy–Weinberg violations shown in **Figure 4** is similar to the one of selfing in polysomic organisms, but with an excess of homozygous pairs of disomically homologous chromosomes that translates into an excess in the components of even dosage in the spectrum. The impact on Fay and Wu's H and Tajima's D is similar to that of purely polysomic inheritance.

### 4.3. Heterozygote Advantage

Heterozygote advantage, or overdominance, is a form of "hybrid vigor" where individuals heterozygous for the locus considered acquire a higher fitness than those provided by the two homozygous genotypes. For simplicity, we can assume the two differences in fitness to be the same. Unsurprisingly, this effect tends to increase the amount of intermediate-frequency alleles and heterozygotes (Kaplan et al., 1988).

Modeling selection dependent on the allelic dosage can be done via an approach similar to the one employed above, but is trickier. Selection is not a one-off or rare event but perturbs permanently the equilibrium I eq k , hence a self-consistent version of the perturbative equations should be employed. Assigning a fitness φ<sup>k</sup> = 1 + s<sup>k</sup> to each allelic dosage, we obtain the equilibrium condition

$$\mathcal{Z}\_k^{\text{eq}} = \sum\_{k'=0}^p \sum\_{k''=0}^p \frac{\mathcal{Z}\_k^{\text{eq}} \phi\_{k'} \mathcal{Z}\_{k''}^{\text{eq}} \phi\_{k''}}{\left(\sum\_{l=0}^p \mathcal{Z}\_l^{\text{eq}} \phi\_l\right)^2} \sum\_{a=0}^p \text{Hyp}(a|k', p/2, p) \text{Hyp}(k-a|k'', p/2, p) \tag{20}$$

We can then perturb at linear order in s<sup>k</sup> and compute 1I<sup>k</sup> = I eq <sup>k</sup> <sup>−</sup> <sup>I</sup> 0 k , with I 0 k being a solution of Equation (16). After using the fact that P<sup>p</sup> k=0 I 0 <sup>k</sup> <sup>=</sup> 1, we obtain the linear system

$$\Delta \mathcal{T}\_k = 2 \sum\_{k'=0}^p \sum\_{k''=0}^p \mathcal{Z}\_{k'}^0 \left( \mathcal{Z}\_{k''}^0 s\_{k''} + \Delta \mathcal{Z}\_{k''} \right) \times$$

$$\sum\_{a=0}^p \text{Hyp}(a|k', p/2, p) \text{Hyp}(k - a|k', p/2, p)$$

$$-2 \mathcal{Z}\_k^0 \sum\_{l=0}^p \left( \mathcal{Z}\_l^0 s\_l + \Delta \mathcal{Z}\_l \right) \tag{21}$$

This equation describes how perturbations to the neutral equilibrium driven by weak selection increase, which is a good proxy for the shape of Hardy–Weinberg violations in the DD.

An example of a fitness assignment that leads to heterozygote advantage is s<sup>k</sup> = s for k = 1 ... p − 1 but s<sup>0</sup> = 0, s<sup>p</sup> = 0. This gives a constant fitness advantage to all heterozygotes, independently on their dosage.

We report the Hardy–Weinberg violations for this example in **Figure 4**. As expected, heterozygote advantage increases the number of alleles at all frequencies while reducing homozygotes. Surprisingly enough, despite the intuition that the effect would be to increase Tajima's D due to the excess of intermediate-frequency variants, the final spectrum impacts negatively on Fay and Wu's H and only weakly on Tajima's D, as shown in **Figure 4**.

#### 4.4. Recessive Deleterious Mutations

It is possible to use the same approach as in the previous subsection to deal with selection against derived homozygotes. If the mutation is deleterious but recessive, there will be a fitness gap between the homozygotes for the derived allele, which would show the phenotypic effects of the mutation, and all other genotypes, that would not. This is another classical cause of violation of Hardy–Weinberg equilibrium, although in practice it is difficult to detect since the mutations involved tend to be at low frequency and therefore the lack of derived homozygotes could be attributed to the Hardy–Weinberg equilibrium itself.

The fitness assignment for a recessive deleterious allele is s<sup>p</sup> = −s but s<sup>k</sup> = 0 for k = 0 ... p − 1. This describes a selection pressure against derived homozygotes only.

The shape of the Hardy–Weinberg violations in this case shows the expected reduction in derived homozygotes and an excess in intermediate-dosage heterozygotes. This causes a reduction in Fay and Wu's H, as shown in **Figure 4**. Ironically, negative values of Fay and Wu's H are also one of the typical signatures of selection and genetic hitchhiking.

#### 5. DISCUSSION

In order to advance our understanding of the evolutionary processes affecting the genome of polyploid species, an important step is to gain a deeper knowledge of the way these processes modulate the fate of genetic variants, and consequently the levels and patterns of genetic variability. Two of the main descriptive statistics used in population genetics to summarize genetic variability are the SFS and the heterozygosity (h), which contain information on the global and internal allelic spectra, respectively. The expected patterns of these statistics have not been studied in detail for polyploids; that is especially true for many conditions commonly found in empirical studies of autopolyploid species, for instance small sample sizes and violations of the Hardy–Weinberg equilibrium such as inbreeding. In addition, understanding the expected patterns in commonly used statistics such as Tajima's D or Fay and Wu's H tests is of great relevance for the correct interpretation of the evolutionary processes occurring in autopolyploid populations. Typical patterns there could well be different from the expected patterns in diploid populations, simply because genetic and evolutionary processes have different peculiarities in the two cases.

Studies focused on the analysis of nucleotide variability in polyploid species present special difficulties in comparison to diploid species, as is extensively reviewed in Dufresne et al. (2014). These difficulties have been partially the reason for a relatively scarce number of publications on HTS analysis of genomic variability among wild autopolyploid populations. Nevertheless polyploid plant species in particular are of great interest, given their high economic and strategic impact. In the last years there has been a proliferation of studies on related model species such as Arabidopsis (e.g., Hollister et al., 2012; Arnold et al., 2015), other relatively simple species (e.g., Cornille et al., 2016; Kasianov et al., 2017), but also economically important species with more complex genetics (e.g., Raman et al., 2014; Rocher et al., 2015; Kamneva et al., 2017; Krasileva et al., 2017). Although the number of relevant datasets deposited in sequence databases is constantly growing, their adequate analysis will require the further development of specific statistical tools, especially to infer sequence variability and population genomics.

In this manuscript we outlined the rich structure of frequency spectra in autopolyploids. The combination of global and internal spectra—i.e., mutation frequency in the population for the SFS, and allelic dosage in individuals for the SDS—contributes to the complexity of the polyploid SFDS.

The intricacy of the SFS structure and the challenges posed by its correct inference are possibly the reasons why this summary statistic has been given scant attention in polyploids so far (Dufresne et al., 2014; Meirmans et al., 2018), despite the fact that it represents one of the classical statistics in population genetics (Nielsen, 2005; Casillas and Barbadilla, 2017).

In this paper we also discussed some of the challenges related to the analysis of autopolyploid data generated by HTS technologies. However, our discussion is restricted to the simplified case of Hardy–Weinberg equilibrium, which is likely to be violated in many real populations of autopolyploid plants e.g., because of selfing. Even for purely outcrossing autopolyploid organisms, violations of Hardy–Weinberg could be caused by widespread mechanisms such as a large number of recessive deleterious alleles. Similarly, the interplay between the SFS and the Dosage Distribution has been discussed here only in the simplified case of small perturbations of Hardy–Weinberg equilibrium in a single individual. These assumptions allow us to present for the first time a systematic picture of the issues; on the other hand, more work is required to build a theoretical understanding of the SFDS and of SFS-based inference in polyploids, especially for small samples.

One of the most important consequences of the present work is the different interpretation of the neutrality test under deviations from a neutral panmictic model in Hardy–Weinberg equilibrium (**Figure 4**). For a low number of samples, the SFS tends to be dominated by the SDS. Deviations from Hardy– Weinberg equilibrium within each individual distort the full SFS and result in values of neutrality tests that are different from those expected in diploid populations undergoing the same processes. For instance, heterozygote advantage in a small sample of diploid individuals is expected to result in an increase of heterozygotes and therefore a deviation of the Tajima's D test toward positive values. On the other hand, in a single autopolyploid individual with the same number of homologous chromosomes, this effect would be close to zero or negative. The reason is two-fold: homozygote alleles would not be classified as polymorphisms and therefore would not be included in the spectrum, while the impact of heterozygote advantage on dosage itself is complex. Generally speaking, the impact of Hardy– Weinberg violations on allelic dosage tends to affect deeply the SFS of the global sample when the sample size is small, complicating the interpretation of the results of neutrality tests. Note that the Hardy–Weinberg equilibrium is not reached in a single generation for autopolyploid species, leaving a longer signal in the genome patterns in relation to diploid species.

The role of allelic dosage uncertainties should be emphasized once more. Despite being challenging, the inference of individual genotypes (i.e., allelic dosage) by likelihood estimation can be obtained from HTS datasets using several algorithms. Recently, Maruki and Lynch (2017) developed a genotype calling algorithm that has proven useful for population genetic analysis. Nevertheless, accurate inference can only be obtained with high read depths and high cost, which usually implies the analysis of just a few individuals. Even in such a case, as shown in this paper, the inference of genotype likelihoods could be hindered by conservative assumptions on the Hardy– Weinberg patterns of the DD, which can generate systematic biases especially in relation to low frequency variants. Focusing on the analysis of variability, the real genotype of each individual is not as important as the pattern of the whole SFS, considering the uncertainties produced by deviations from Hardy–Weinberg equilibrium and other random processes. That is the reason why the equations presented here make performing genotype inference for each autopolyploid individual unnecessary.

Another reason why allelic dosage uncertainty is not a limitation for SFS inference can be illustrated by the following general argument. By definition, the frequency of an allele is the sum of its allelic dosages across individuals divided by the total number of homologous chromosomes in the sample, i.e., np. This implies a relation between frequencies and their uncertainties: more precisely, by classical probability arguments, the standard deviation of the frequency is the quadratic mean of the standard deviation of the allelic dosage divided by p √ n. Hence, no matter how large is the allelic dosage uncertainty for each individual, the accuracy in the reconstruction of the frequency is always good for samples of large enough size. In fact, the maximum standard deviation of allelic dosage is p/2, i.e., the uncertainty in frequency is at most <sup>1</sup> 2 √ n . This means that 25 individuals are sufficient to estimate allele frequencies with an uncertainty of about 0.1, even in the worst-case estimate of allelic dosage uncertainties.

How large the actual sample should be depends on the actual uncertainties in dosage and the evolutionary dynamics of the population. The typical uncertainties in dosage inference from HTS are expected to be around p/ √ r¯ where r¯ is the average read depth per individual, hence they decrease with the sequencing depth of the experiment. However, if the dynamics is driven by rare variants, a larger number of individuals is needed to obtain an accurate estimate of their frequency, since the unavoidable variance in frequency due to the sampling process of individuals from the whole population is between <sup>f</sup>(1−f) pn (under Hardy–Weinberg equilibrium) and <sup>f</sup>(1−f) n (if the Hardy–Weinberg conditions are strongly violated).

At present, the complexity of most analyses implies that good-quality population genetic data of samples of multiple autopolyploid organisms from the same natural population are hard to obtain. Most of the efforts so far were focused on the relation between different populations (Meirmans and Hedrick, 2011) and the comparison between different levels of ploidy, which require the sequencing of single samples from multiple populations. On a broader evolutionary scale, polyploidization during speciation and its evolutionary consequences were also studied in several biological systems (Parisod et al., 2010; Barker et al., 2016). However, there is a general lack of good datasets, and theoretical approaches to understand the microevolutionary picture are lagging behind (Dufresne et al., 2014; Meirmans et al., 2018), with the possible exception of linkage and QTL mapping. We hope that this paper will raise some awareness of the issues involved and clarify the relation between important quantities such as the frequency spectrum, the heterozygosity and the distribution of allelic dosage.

In conclusion, considering spectra of allelic dosage such as the SDS is of fundamental importance for the study of the evolutionary processes in autopolyploids. These internal spectra have a large impact on the global SFS for small sample sizes (for large sample size, the SFS can be reliably inferred and should not be strongly affected by Hardy–Weinberg violations). In this framework, we have proposed a set of estimators of variability and neutrality tests for autopolyploid HTS samples, based on well-known tests such as Tajima's D and Fay and Wu's H. Additionally, we have shown how different deviations from Hardy–Weinberg equilibrium and other uncertainties are reflected in the dosage distribution at the level of single individuals. In general, we bring attention to the importance of the study of the joint SFDS in polyploid species in order to correctly interpret the patterns of population variability.

## AUTHOR CONTRIBUTIONS

LF and SR-O conceived the paper. LF and PR developed the theory. LF implemented it. LF, PR, and SR-O wrote the paper.

#### FUNDING

This work was supported by grant AGL2016-78709-R (MEC, Spain) to SR-O. We also acknowledge the financial support of the Spanish Ministry of Economy and Competitivity for the Center of Excellence Severo Ochoa 2016-2019 (SEV-2015-0533) grant awarded to the Center for Research in Agricultural Genomics and by the CERCA Programme/Generalitat de Catalunya.

#### REFERENCES


The Pirbright Institute receives grant-aided support from the Biotechnology and Biological Sciences Research Council of the United Kingdom (projects BB/E/I/00007035, BB/E/I/00007036 and BBS/E/I/00007039).

#### ACKNOWLEDGMENTS

We thank Emanuele Raineri and Miguel Pérez-Enciso for past discussions on the development of individual HTS estimators. We also thank the editors of this Special Issue on Polyploid Population Genetics and Evolution for their suggestion to submit this paper. Further details on the mathematical derivations and the R code used to generate the figures can be found in **Supplementary Material**.

#### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene. 2018.00480/full#supplementary-material


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 Ferretti, Ribeca and Ramos-Onsins. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Inferring Variation in Copy Number Using High Throughput Sequencing Data in R

#### Brian J. Knaus and Niklaus J. Grünwald\*

Horticultural Crops Research Unit, United States Department of Agriculture-Agricultural Research Service, Corvallis, OR, United States

Inference of copy number variation presents a technical challenge because variant callers typically require the copy number of a genome or genomic region to be known a priori. Here we present a method to infer copy number that uses variant call format (VCF) data as input and is implemented in the R package vcfR. This method is based on the relative frequency of each allele (in both genic and non-genic regions) sequenced at heterozygous positions throughout a genome. These heterozygous positions are summarized by using arbitrarily sized windows of heterozygous positions, binning the allele frequencies, and selecting the bin with the greatest abundance of positions. This provides a non-parametric summary of the frequency that alleles were sequenced at. The method is applicable to organisms that have reference genomes that consist of full chromosomes or sub-chromosomal contigs. In contrast to other software designed to detect copy number variation, our method does not rely on an assumption of base ploidy, but instead infers it. We validated these approaches with the model system of Saccharomyces cerevisiae and applied it to the oomycete Phytophthora infestans, both known to vary in copy number. This functionality has been incorporated into the current release of the R package vcfR to provide modular and flexible methods to investigate copy number variation in genomic projects.

Keywords: bioinformatics, computational biology, copy number variation (CNV), high throughput sequencing (HTS), Phytophthora, ploidy, R package

# INTRODUCTION

Investigations into the variation in the number of copies of genes, chromosomes, or genomes are well-established research topics, yet they continue to present technical challenges to molecular genetic analysis. Many examples provide evidence of how copy number affects the phenotype. For example, schizophrenia in humans is thought to be caused by variation in copy number of certain genes (Sekar et al., 2016). Presence of an additional chromosome (aneuploidy) results in Down syndrome in humans (Hassold and Hunt, 2001). Existence of an extra copy of all chromosomes (triploidy) is used in agriculture to produce sterile organisms such as seedless watermelons (Varoquaux et al., 2000) or sterile salmon (Johnstone, 1992; Cotter et al., 2000). Whole genome duplication (polyploidy) results in every chromosome being duplicated, a phenomenon observed throughout plants, animals, and fungi (Todd et al., 2017; Van de Peer et al., 2017). Although this phenomenon is well established, it presents a challenge to high throughput sequencing projects in that most popular genomic variant callers, such as the GATK's (DePristo et al., 2011) or FreeBayes (Garrison and Marth, 2012), require the a priori specification of how many alleles to call. While the inference of copy number may be an important precursor to point mutation discovery,

#### Edited by:

Hans D. Daetwyler, La Trobe University, Australia

#### Reviewed by:

Beniamino Trombetta, Sapienza Università di Roma, Italy Garrett McKinney, University of Washington, United States

#### \*Correspondence:

Niklaus J. Grünwald nik.grunwald@ars.usda.gov

#### Specialty section:

This article was submitted to Evolutionary and Population Genetics, a section of the journal Frontiers in Genetics

> Received: 31 January 2018 Accepted: 26 March 2018 Published: 13 April 2018

#### Citation:

Knaus BJ and Grünwald NJ (2018) Inferring Variation in Copy Number Using High Throughput Sequencing Data in R. Front. Genet. 9:123. doi: 10.3389/fgene.2018.00123

**102**

many authors argue that copy number variation may be more abundant throughout a genome than point mutations (Katju and Bergthorsson, 2013) making it an important facet in the investigation of genomic architectures.

Existing software for determining the number of copies at a locus from high throughput sequencing data can be broadly classified into two categories: copy number variation detection and whole genome ploidy inference. The important difference among these categories is the form of data they use. Copy number variation detection software uses per position sequence depth (Yoon et al., 2009; Abyzov et al., 2011; Klambauer et al., 2012; Li et al., 2012) while whole genome ploidy inference software uses the relative frequency of the two most abundant alleles sequenced at a locus (Zohren et al., 2016; Gompert and Mock, 2017; Weiß et al., 2018). Copy number variation detection methods group the per position sequence depth into windows and attempt to sort these into base-ploid (typical depth) windows or windows that deviate from base-ploid. They generally require the investigator to specify a priori what copy level the baseploid state is. If the research question is to determine how many copies occur at the base-ploid state, these methods will not be appropriate. Whole genome ploidy inference methods use the frequency that the two most abundant alleles were sequenced at for heterozygous positions, or allele balance, and summarize this information throughout the genome. (Here we use the term 'allele balance' where other authors have used 'allele frequency' to distinguish the measure from the use of 'allele frequency' in population genetics.) For example, for heterozygous alleles we would expect to observe an approximate frequency of one half for diploids, ratios of thirds for triploids, and ratios of quarters for tetraploids (**Figure 1**). Whole genome ploidy inference uses all of the genomic information to infer a single copy number for the entire genome. A third hybrid method uses allele balance (referred to as allelic ratio) and heterozygosity to assign copy number to populations of data (McKinney et al., 2017). However, if the research question is to explore copy number variation within a population this method will not be relevant. Therefore, there are at least two distinct approaches to determine the number of copies present in genomes, and more currently being proposed, each with different strengths and limitations.

Our research presented us with the need to determine if copy number varied throughout genomes, where we did not have prior knowledge of what the actual base-ploidy might be. We therefore combined the windowing functionality from copy number variation detection methods with the allele balance concept from whole genome ploidy inference methods. We use a non-parametric approach to infer copy number given that empirical explorations of available data indicated that common distributions, particularly at low sequence depth, do not fit well. Our method is implemented in a new update to the package vcfR in the R software environment (R Core Team, 2018). R is an established and growing language facilitating the analysis of population genetic and genomic data (Paradis et al., 2017a,b). We demonstrate the utility of this method using genomes from the model fungus Saccharomyces cerevisiae and our ongoing work with the oomycete plant pathogen Phytophthora infestans. Both of these organisms show variation in ploidy across individuals as well as within regions within a genome.

# MATERIALS AND METHODS

#### Methodology

We developed new functionality added to the current release of the vcfR package that can be used to infer copy number or ploidy in R. We initially developed vcfR for VCF data import/export,

FIGURE 1 | Allele balance (e.g., the distribution of the frequency at which the most abundant allele and the second most abundant allele were sequenced) at heterozygous positions in three Saccharomyces cerevisiae genomes. For each heterozygous genotype the frequency at which the most abundant allele was sequenced at (light blue) and the frequency at which the second most abundant allele was sequenced at (dark blue) were recorded. This information was then summarized with a histogram. Expectations for allele balance are 1/2 for diploids, 1/3 and 2/3 for triploids, and 1/4, 1/2, and 3/4 for tetraploids. This approach provides a dominant copy number for each genome but no information about variation within each genome. Expectations and critical values for binning allele balance information are presented below the histograms.

quality control, visualization and general manipulation (Knaus and Grünwald, 2017). vcfR now includes a range of new functions useful for binning variants into windows, summarizing the frequency that alleles were sequenced at, and assigning a closest expected copy number value to these windows (**Table 1**).

Data from high throughput sequencing (HTS) projects on populations typically results in calling variants that might include single nucleotide polymorphisms (SNPs), indels, and inversions. Output from popular variant callers is presented in files that adhere to the variant call format (VCF) specification (Danecek et al., 2011). This specification provides the option to include counts for how many times each allele was sequenced for each genotype. For example, in the GATK's HaplotypeCaller (McKenna et al., 2010) output includes allele depth (AD) as a comma delimited string of counts. This VCF data can be imported into R using our function read.vcfR(). Once any desired quality control steps have been performed on the data (Knaus and Grünwald, 2017), such as omitting variants of unusual sequence coverage, this allele depth data can be extracted using the vcfR function extract.gt(). We then use the function is\_het() to set homozygous positions in the allele depth matrices as missing data (NA) so we can focus our analysis on the heterozygous positions. The allele depth is reported as a comma delimited string, the individual elements of which can be isolated with the function masplit(). Dividing the count for each allele by the sum of the counts for the two most abundant alleles, results in the frequency at which each allele was sequenced, or allele balance. This data can now be plotted as histograms for visualization.

Determining copy number for sub-genomic regions requires the genome to be divided into sub-genomic windows and, because this typically results in many windows per sample, it requires a numeric method of summarizing this data. This goal is accomplished with the function freq\_peak(). This function takes as input a matrix of allele balance data, as described above, a vector of chromosomal positions for each variant, a window size, and a bin width for summarizing the allele balance values. The vector of chromosomal positions is used to assign variants to windows. The window size specifies how large the genomic windows should be. This will in part be based on the frequency of heterozygous positions observed in the target sample as well as a balance between the conflicting desires for small windows that provide fine scale resolution and

TABLE 1 | Functions available to analyze copy number variation and mixed copy number data in the current release of vcfR.


large windows that provide a large number of variants (i.e., support) for a determination. Within each window the allele balance values are summarized by bins from 0 to 1 and of the width specified by the bin width parameter. The bin with the greatest number of variants is selected as the peak location. Here, again, a balance must be found between resolution (small bins) and support (large bins). Default values are provided based on what we have determined to work in our study systems, but we highly encourage adjusting the parameters based on the specifics of each project. These parameters are expected to be context specific to each study system. This function returns three matrices, one containing the window coordinates, one containing the peak locations and one containing the count of variants that resulted for each window. The matrix of variant counts per window can be used to help determine optimal window size and to censor windows that resulted in a low number of variants. The peaks can then be assigned to their nearest expected value (1/5, 1/4, 1/3, 1/2, 2/3, 3/4, 4/5) using the function peak\_to\_ploid(). This is accomplished by using critical values that are half way between each expected value (**Figure 1**). Once a copy number has been assigned its confidence is measured by creating a distance from expectation. The distance from expectation is the observed value subtracted by the expectation it was assigned to which is then divided by the critical value on the side of the expectation where the observed value was (**Figure 1**). Dividing the critical value scales the difference from expectation from zero (exactly at our expectation) to one (half way between expectations). This can also be used to remove border cases where observed value is intermediate to the expected values and we therefore have low confidence in the determination. The results from the function freq\_peak() can be visualized using freq\_peak\_plot(). This last function was inspired in part by BAF plots (Laurie et al., 2010).

Theoretical population genetics is based largely on haploid and diploid organisms. Investigations into populations that consist of higher ploidy individuals, or populations with a mixture of copy numbers, present a methodological challenge in that few applications are available to analyze them. We have extended Nei's GST (Nei, 1973, 1987) and Hedrick's G'ST (Hedrick, 2005) to address this challenge. These measures of population subdivision are based on ratios of heterozygosity. Because heterozygosity is based on the number and type of alleles found in a population it provides a convenient way to analyze populations of mixed copy number. Our implementation is inspired by the implementation in adegenet (Jombart, 2008) which weights the heterozygosities by their sample size. This is an attempt to correct for unbalanced sample sizes, situations where a different number of individuals were sampled from different populations. We instead weight the heterozygosities by the observed number of alleles in each population to correct for both unbalanced samples as well as instances where individuals may vary in copy number as well. An unbalanced design occurs when different amounts of data are collected for different populations. For example, one sample may have consisted of 20 individuals while another may have only consisted of 10. This imbalance may

have occurred due to logistical reasons or technical issues in sample preparation. When copy number is unknown, the investigator may sample the same number of individuals in the populations, but if one population turns out to have four copies where the other has only two, the population with four copies will have twice as much information as the other. Weighting each population by the number of alleles observed is an attempt to mitigate these issues. The function genetic\_diff() uses a vcfR object and a factor that indicates population membership (VCF data typically does not include population information) and returns a table including heterozygosities, Nei's GST, and Hedrick's G'ST.

#### Example Data

To demonstrate our method, we tested it on three data sets. The first data set consisted of three samples of Saccharomyces cerevisiae (CBS7837, CBS2919, and CBS9564) from Zhu et al. (2016) that were reported as diploid, triploid and tetraploid by Weiß et al. (2018). We also included an additional sample (YJM1098) that was reported by Zhu et al. (2016) as being predominantly diploid but demonstrating aneuploidy for chromosome XII. These samples represent an organismal system where the genome is of relatively small size (12 Mbp), high quality (in its 64th revision; Engel et al., 2014) and where the samples were sequenced with a goal of attaining 80X sequence depth with Illumina GAII reads.

A second data set consisted of two samples of the plant pathogen Phytophthora infestans (99189 and 88069) that were reported by Weiß et al. (2018) as being diploid and triploid. The P. infestans system represents a more modestly sized genome (240 Mbp) that remains in its first draft (Haas et al., 2009), but where the samples were sequenced with the intent of attaining 100X sequence depth for each haplotype using Illumina HiSeq 3000 sequencing (Weiß et al., 2018).

The third dataset included 17 samples of P. infestans and one sample of P. mirabilis collected from the literature, subset to Supercontig\_1.50, and made available as an R package (Knaus and Grünwald, 2017). This represents a set of samples that were of more typical sequence depth for genomics projects than we might expect from investigations that were specifically interested in copy number.

For the first two datasets, the data were downloaded from the NCBI sequence read archive and FASTQ data were extracted using the sratoolkit. These reads were mapped to the yeast genome (S288C) or the P. infestans genome (T30-4) using bwa 0.7.10-r789 mem (Li, 2013). The resulting SAM file had mate pair information updated, was sorted and converted to BAM format using samtools 1.3.1 (Li et al., 2009). Duplicates were marked using picard-tools-2.5.0 and the files were indexed using samtools. For each sample, a g.VCF file was created from its BAM file using the GATK's (3.5-0-g36282e4) HaplotypeCaller (McKenna et al., 2010). Read processing for the pinfsc50 was described previously (Knaus and Grünwald, 2017). Briefly, the reads were mapped using bwa mem and variants were called using the GATK's HaplotypeCaller resulting in VCF data. The g.VCF and VCF data were processed in vcfR (Knaus and Grünwald, 2017) using the methods described above using the functions freq\_peak(), peak\_to\_ploid(), and freq\_peak\_plot(). For the S. cerevisiae samples, a window size 40 kbp was used while a window size of 200 kbp was used for the P. infestans samples.

# Performance

We assessed performance of our method over a range of genome sizes. Data used for the benchmarking were subset from the 99189 P. infestans sample including the entire data set (240 Mbp genome) and subsets of this dataset to represent genomes of 100, 10, and 1 Mbp. Each data set was processed 20 times and this processing was implemented using an R markdown script. The use of R markdown, as opposed to a pure R script, likely incurred a performance cost as our timing included the compilation of the R markdown to a web page. We advocate that using tools like R markdown should be considered a best practice and hope that this will characterize typical use. Benchmarking was performed on an Intel© CoreTM i7-4790 CPU at 3.60 GHz with 32 GB of RAM running Ubuntu 16.04 LTS. Results were visualized in R and a linear regression was performed using the R function stats::lm().

# RESULTS

### Implementation

A new update for the R package vcfR was recently released including several new functions (**Table 1**). The function freq\_peak() returns the peaks called for each window as well as diagnostic information. The data in VCF files only includes information for the variable positions. This means that all positions in a window will not be present in VCF data. A lookup table is created and returned that includes the genomic coordinates for each window, the row number of the first and last rows of VCF data that were analyzed, and the genomic position of the first and last variant in each window. This information is intended to coordinate comparisons among data extracted from VCF files and genomic windows. A matrix of variant counts per sample and window is also provided. Because heterozygosity may not be known and some windows may have mapping issues (e.g., high variant counts) or regions of loss of heterozygosity or a high number of missing or ambiguous nucleotides in the reference (low variant counts), this information can be used to help determine optimal window size for a particular organism. Furthermore, this approach can help identify anomalous regions in the genome that may require further scrutiny. Lastly, a matrix of frequencies of allele balance is generated.

Results of the above process can be visualized and postprocessed to obtain copy number calls and quality assessment. The function freq\_peak\_plot() can be used to visualize the combined VCF derived data and the results of the windowing and peak calling operations. Because the result is a simple data structure (a list of matrices) the universe of R packages that can be used with matrix data are also available to explore the data. The data can also be post processed with the function peak\_to\_ploid()

that converts the allele balance frequency data to an integer copy number as well as distances from expectation:

## Distance from expectation = observed allele balance − expected value critical value

The distance from expectation is the observed allele balance frequency subtracted by the frequency expected based on the final determination. This value is then divided by its bin width (**Figure 1**) in order to scale it from zero to one where zero represents an allele balance that is exactly on our expectation (e.g., 1/4, 1/3, 1/2, etc.) and one is half way between two expectations. This value can then be used as a measure of confidence in our copy number determination and to omit border cases (instances where the observed allele balance is close to one).

#### Saccharomyces cerevisiae Dataset

Analysis of the Saccharomyces cerevisiae dataset validated previous reports and revealed new features. The S. cerevisiae samples were sequenced at about 100X at variable positions (**Figure 2**) making it a high coverage dataset. The samples were determined to consist of individuals that were predominantly diploid (CBS7837), triploid (CBS2919), and tetraploid (CBS9564), confirming previous reports (**Figure 1**; Weiß et al., 2018). The samples had a heterozygosity of around 0.003–0.008 heterozygous positions per site (**Figure 3**). Because the variant caller (the GATK's HaplotypeCaller) tends to aggressively call variants, this estimate may include false positives and therefore may be an overestimate of the true biological value. We have previously discussed strategies we feel may improve the quality of called variants to attain a production data set (Knaus and Grünwald, 2017). Current functionality in vcfR allowed for convenient reproduction of

figures previously reported (**Figure 4**; Zhu et al., 2016) that indicated intragenomic variation in copy number. This copy number variation was demonstrated to be minor relative to the entire genome (**Figure 5**), indicating that while sample YJM1098 may be predominantly diploid, it still contains variation that would not be apparent from whole genome summaries. The use of the vcfR functions freq\_peak() and peak\_to\_ploid() provided a sliding window analysis that revealed intragenomic variation in copy number. **Figure 6** demonstrated the results of the function freq\_peak\_plot() that revealed a sample that appeared diploid, but contains regions of low heterozygosity such that inferences cannot be made (CBS7837 chromosome XI at around 200 kbp and around 350 kbp). The sample CBS2919 appeared predominantly triploid, consistent with previous findings (Weiß et al., 2018), but also included a region on chromosome VII from its origin to around 400 kbp that appeared to have four copies. The sample CBS9564 was reported by Weiß et al. (2018) to be tetraploid, which is in agreement with our results, but also appeared to have regions on chromosome IX that had three or five copies. These findings confirm previous reports and also reveal that new information can be found by investigating specific regions within each genome.

#### Phytophthora infestans Dataset

The two P. infestans samples were sequenced at almost 200X (99189) and 300X (88069) or approximately 100X per expected chromosome (**Figure 7**; Weiß et al., 2018). The genomes had heterozygosities of around 0.003–0.006 heterozygous positions per site (**Figure 8**). Because the variant caller tends to aggressively call variants, this estimate may include false positives and therefore may be an overestimate of the true biological value.

Examination of the genomic distribution of allele balance values confirmed the report of Weiß et al. (2018) that isolate 99189 was predominantly diploid while 88069 was predominantly triploid (**Figure 9**). However, through windowing across the supercontig, we were able to observe that while isolate 99189 does appear to be predominantly diploid, a large portion of its supercontig\_1.29 appears to have three copies (**Figure 10**) demonstrating previously uncharacterized intragenomic variation in copy number.

### Pinfsc50 Dataset

The pinfsc50 dataset provides an opportunity to evaluate data with more moderate and more typical lower read depths. This data represents samples for a population of P. infestans at supercontig 50 that were sequenced between ca. 10X to 70X

coverage (**Figure 11**). The distribution of allele balance values for these samples (**Figure 12**) demonstrated a range of copy numbers from diploid (e.g., strain P17777us22) to triploid (strain P13626). However, several samples (e.g., strains P1362 or t30- 4) appeared to be ambiguous as to their copy number. This demonstrates that not all samples that have been sequenced from typical sequencing projects may be of suitable quality for copy number determination.

### Population Differentiation

The function genetic\_diff() calculates genetic differentiation for mixed copy number populations (**Table 2**). It retains the chromosome and position information from the VCF data to maintain the coordinate system. Heterozygosities as well as the number of alleles observed in each population are returned. If the number of alleles in data are unknown, this latter information may be used to summarize this information. For larger data sets, quantiles can be calculated to identify loci of unusual allele counts. The function reports GST, maximum heterozygosity, maximum GST and uses these to calculate G 0 ST. The returned data structure is a simple data.frame which should easily facilitate further analysis and presentation of this information with the universe of R functionality.

#### Performance

Regression analysis revealed that execution time scaled linearly with genome size (**Figure 13**). There was a highly significant

bottom panel is chromosome IX from sample CBS9564. This chromosome appears to consist of regions that have three copies as well as regions with five copies.

FIGURE 7 | The distribution of sequence depths at variable positions for P. infestans samples produced by Weiß et al. (2018). These plots are similar to the S. cerevisiae plots in that most of the genome appears to have been sequenced at a base ploidy level, but long tails indicate that regions above and below this level exist.

relationship between execution time and genome size (**Table 3**) indicating that our benchmarking may be a good predictor of how the method will perform with other genomes.

FIGURE 10 | Supercontig\_1.29 of P. infestans isolate 99189 appears predominantly triploid in contrast to the rest of its genome that appeared to be diploid (compare with Figure 9). Values of 0 (no read support for the allele) and 1 (all reads support one allele) are expected to be homozygous calls. Because this is an analysis of heterozygous positions these have been omitted from this plot.

# AVAILABILITY

Version 1.7.0 of the package vcfR had been released at the time of submission of this manuscript and contains all of the novel features described here. This version is available on CRAN (https://CRAN.R-project.org/package=vcfR) and at the Grünwald lab's GitHub site (https://github.com/grunwaldlab/ vcfR). More information and example code can be found at:

https://knausb.github.io/vcfR\_documentation/. Data and scripts used to produce figures in this manuscript are available at the project's Open Science Framework site (Knaus and Grünwald, 2018).

# REQUIREMENTS

thresholds).

• R version 3.0.1 or greater and vcfR 1.7.0.

# INSTALLATION

At the R console, vcfR can be installed from CRAN as follows: install.packages('vcfR') library('vcfR')

# DISCUSSION

Numerous studies have used high throughput sequencing to study genetic diversity in populations based on genotypes, or single nucleotide polymorphisms, inferred by variant callers. To our knowledge there is currently no variant caller that can infer the number of alleles to call. Instead, the investigator must specify the number of alleles to call a priori. Here we present novel methodologies to infer genomic and subgenomic copy number using HTS data as well as to visualize these data in the R environment.

Our method builds on existing methods by using a sliding window approach to infer copy number based on the frequency that the most abundant and second most abundant alleles were sequenced at. While we designed this method to work with VCF

TABLE 2 | Genetic differentiation as reported by the function genetic\_diff().


The chromosome (CHROM) and position (POS) are retained from the VCF data. Heterozygosities for each population (a and b) and total heterozygosity are reported. The number of alleles (n\_a, n\_b)\_observed in each population are reported. Lastly, GST, maximum heterozygosity (Htmax), maximum GST (Gstmax) and G<sup>0</sup> ST (Gprimest) are calculated.

data (Danecek et al., 2011) using the R package vcfR (Knaus and Grünwald, 2017), we feel an important role of our method is to help make this data available to the existing universe of R packages. VCF data only includes information on variable positions within the genome. We therefore produce a lookup table to identify which genomic windows variants belong to. Other functions convert the VCF data into numeric matrices. In theory, this information could be used to implement other functionality, such as applying mixture models (Leisch and Gruen, 2012; Fraley et al., 2012) to the data. It also means that other visualization tools available to the R environment can be used beyond those provided here. Because characterization of copy number may be challenging in certain regions of the genome, e.g., regions rich in transposable elements or problematic assemblies, we provided the count of heterozygous positions for each window as well as the distance from expectation. These metrics provide tools to help judge whether certain regions may have well predicted copy numbers or which regions may require further investigation.



The intercept was not significantly different from zero while the slope was highly significantly different from zero.

The existing methods most similar to ours include those of Zohren et al. (2016), Gompert and Mock (2017), and Weiß et al. (2018) because they are all based on the frequency that alleles were sequenced at. Zohren and colleagues used allele balance (which they referred to as allelic ratio) and fit betabinomial distributions to model diploid individuals and betabinomial mixture models (the fitting of multiple distributions to a population of data) to model triploid and tetraploid individuals. Likelihoods for each ploidy model were compared using AIC (Akaike, 1974), resulting in a single ploidy call for each sample. R code to implement their method is available at Dryad. Gompert and Mock model the ratio of the abundance of the nonreference allele (from biallelic SNPs) to the total number of reads sequenced at each variant using binomial distributions in a Bayesian framework resulting in a single ploidy call for each sample. Their method is implemented in R using rjags (Plummer, 2016) and is available on CRAN as the package gbs2ploidy. The method of Weiß and colleagues is similar to that of Zohren and colleagues in that it employs mixture models; however, it differs in that it uses Gaussian components. It also differs in that it is written in C and designed to work on the BAM files as opposed to heterozygous positions determined by a variant caller. Because it is implemented in a compiled language it is very fast relative to the R implementations. It is also unique in that it employs a uniform noise component. The sample CBS7837 in **Figure 1** has a well-defined peak, yet the base of the peak varies almost from zero to one indicating a substantial amount of data that deviates from any of our expectations. Similarly, the sample CBS2919 in **Figure 1** has two well defined peaks but the data does not go to zero between these peaks. This phenomenon can be seen in Zohren and colleagues' **Figure 2** and Yoshida et al. (2013) **Figure 8** and is part of our justification for the use of a non-parametric method. Weiß and colleagues fit this uniform component in an attempt to capture the noise in the data leaving the putatively cleaner data for their Gaussian mixture model. Their software is available on GitHub in the repository named nQuire.

The method presented has been designed to work with VCF data (Danecek et al., 2011) that contains the number of times each allele was sequenced for each variant. In theory, any method that produces a valid VCF file, or the counts of times the most abundant and second most abundant allele were sequenced in a format that can be read into R, can be analyzed. While the examples presented here are based on whole genome sequencing our method should be applicable to data generated with reduced representation libraries. For example, we've also used the method with genotyping-by-sequencing data (Elshire et al., 2011) processed with TASSEL (Bradbury et al., 2007). However, there are some practical matters to consider. This is an analysis of heterozygous positions. Homozygous positions will appear similar regardless of copy number and are uninformative. Organisms that are inbred or have a mode of reproduction that includes selfing may have a low density of heterozygous positions making inferences using our method challenging. The use of reduced representation libraries may also contribute to a lower number of observed heterozygous positions requiring use of larger windows ultimately resulting in a lower resolution to the inference of copy number variation.

There is currently a diversity of methods available for the analysis of high-throughput sequencing that demonstrates a diversity of performance. This diversity in performance exists in de novo assembly software (Earl et al., 2011; Bradnam et al., 2013), variant callers (Pabinger et al., 2014), copy number variation callers (Duan et al., 2013; Pabinger et al., 2014), and metagenomic pipelines (Edgar, 2017). This diversity is likely due to the nascent nature of the data and methods used to analyze it. We hope

#### REFERENCES


our method will contribute to the analysis of CNV, but also hope it will stimulate the development of new tools or the integration of these existing methods into new tools to explore copy number variation. Perhaps future improvements can be found by integrating sequence coverage and allele balance data as some authors have already done graphically (Zhu et al., 2016).

#### AUTHOR CONTRIBUTIONS

BK conceived the project, wrote code, wrote the documentation, and wrote the manuscript. NG conceived the project, coordinated the collaborative effort, discussed interpretation, wrote the manuscript, and obtained funding.

#### FUNDING

This research is supported in part by U.S. Department of Agriculture (USDA) Agricultural Research Service Grant 5358- 22000-039-00D and USDA National Institute of Food and Agriculture Grant 2011-68004-30154.

#### ACKNOWLEDGMENTS

Mention of trade names or commercial products in this manuscript are solely for the purpose of providing specific information and do not imply recommendation or endorsement.



**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 Knaus and Grünwald. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Tools for Genetic Studies in Experimental Populations of Polyploids

#### Peter M. Bourke, Roeland E. Voorrips, Richard G. F. Visser and Chris Maliepaard\*

Plant Breeding, Wageningen University & Research, Wageningen, Netherlands

Polyploid organisms carry more than two copies of each chromosome, a condition rarely tolerated in animals but which occurs relatively frequently in the plant kingdom. One of the principal challenges faced by polyploid organisms is to evolve stable meiotic mechanisms to faithfully transmit genetic information to the next generation upon which the study of inheritance is based. In this review we look at the tools available to the research community to better understand polyploid inheritance, many of which have only recently been developed. Most of these tools are intended for experimental populations (rather than natural populations), facilitating genomics-assisted crop improvement and plant breeding. This is hardly surprising given that a large proportion of domesticated plant species are polyploid. We focus on three main areas: (1) polyploid genotyping; (2) genetic and physical mapping; and (3) quantitative trait analysis and genomic selection. We also briefly review some miscellaneous topics such as the mode of inheritance and the availability of polyploid simulation software. The current polyploid analytic toolbox includes software for assigning marker genotypes (and in particular, estimating the dosage of marker alleles in the heterozygous condition), establishing chromosome-scale linkage phase among marker alleles, constructing (short-range) haplotypes, generating linkage maps, performing genome-wide association studies (GWAS) and quantitative trait locus (QTL) analyses, and simulating polyploid populations. These tools can also help elucidate the mode of inheritance (disomic, polysomic or a mixture of both as in segmental allopolyploids) or reveal whether double reduction and multivalent chromosomal pairing occur. An increasing number of polyploids (or associated diploids) are being sequenced, leading to publicly available reference genome assemblies. Much work remains in order to keep pace with developments in genomic technologies. However, such technologies also offer the promise of understanding polyploid genomes at a level which hitherto has remained elusive.

Keywords: polyploid genetics, polyploid software tools, autopolyploid, allopolyploid, segmental allopolyploid

# INTRODUCTION

One of the most fundamental descriptions of any organism is its ploidy level and chromosome number, generally written in the form 2n = 2x = 10 (here, for the ubiquitous model plant species Arabidopsis thaliana L.). Plant scientists in particular will be familiar with this representation of the chromosomal constitution of the sporophyte generation (i.e., the adult plant). The second

#### Edited by:

Hans D. Daetwyler, La Trobe University, Australia

#### Reviewed by:

Keiichi Mochida, RIKEN Center for Sustainable Resource Science (CSRS), Japan Patricio Ricardo Munoz, University of Florida, United States

> \*Correspondence: Chris Maliepaard chris.maliepaard@wur.nl

#### Specialty section:

This article was submitted to Evolutionary and Population Genetics, a section of the journal Frontiers in Plant Science

> Received: 25 January 2018 Accepted: 04 April 2018 Published: 18 April 2018

#### Citation:

Bourke PM, Voorrips RE, Visser RGF and Maliepaard C (2018) Tools for Genetic Studies in Experimental Populations of Polyploids. Front. Plant Sci. 9:513. doi: 10.3389/fpls.2018.00513

term in this seemingly simple equation describes the normal complement of chromosomal copies possessed by a member of that species, which is generally 2× ("two times") for diploids. Species where this number exceeds two are collectively referred to as polyploids. Not unexpectedly, each polyploid individual is the product of the fusion of gametes from two parents, just like their diploid counterparts. In other words, polyploids can also be defined as individuals derived from non-haploid gametes (in the case of triploids derived from diploid × tetraploid crosses, only one gamete satisfies this condition). The transmission of non-haploid gametes is one of the main "complexifying" features of polyploidy, leading to a whole range of implications for the genetic analysis of these "hopeful monsters" (Goldschmidt, 1933).

The ongoing genomics revolution can be seen as a rising tide which has also lifted the polyploid genetics boat, although not quite to the same level as for diploids. Most genetic advances are made in model organisms, among which self-fertilizing diploid species predominate. It is therefore not surprising that most tools and techniques for molecular-genetic studies are specific to diploids. However, polyploid species are particularly important to mankind in the provision of food, fuel, feed, and fiber (not to mention "flowers," if ornamental plant species are also included), making the genetic analysis of polyploid species an important avenue of research for crop improvement.

Although a collective term such as "polyploidy" has its uses, it tends to obscure some fundamental differences between its members. For example, polyploids are generally subdivided into autopolyploids and allopolyploids (Kihara and Ono, 1926). Autopolyploids arise through genomic duplication within a single species, generally through the production of unreduced gametes (Harlan and De Wet, 1975) and exhibit polysomic inheritance, meaning pairing and recombination can occur between all homologous copies of each chromosome during meiosis. One of the most well-studied examples is autotetraploid potato (Solanum tuberosum L.). Allopolyploids, on the other hand, are the product of genomic duplication between species [usually through hybridisation involving unreduced gametes (Harlan and De Wet, 1975)] and display disomic inheritance, where more-related chromosome copies ("homologs") may pair and recombine during meiosis, whilst less-related chromosome copies ["homoeologs," also spelled "homeologs" (Glover et al., 2016)] do not. Among allopolyploids, allohexaploid wheat (Triticum aestivum L.) is probably the most well-studied. If pairing and recombination between homoeologs occurs to a limited extent, the species may be referred to as "segmental allopolyploid" (Stebbins, 1947), traditionally deemed to have arisen from hybridisation between very closely related species (Stebbins, 1947; Chester et al., 2012) but which may also be the result of partially diploidised autopolyploidy (Soltis et al., 2016). In many cases, a species cannot be clearly designated as one type or another, leading to uncertainty or debate on the subject (Barker et al., 2016; Doyle and Sherman-Broyles, 2016). From the perspective of genetics and inheritance, allopolyploids behave much like diploid species and therefore many of the tools developed for diploids can be directly applied. The main challenge that faces allopolyploid geneticists is in distinguishing between homoeologous gene copies carried by sub-genomes within an individual (Kaur et al., 2012; van Dijk et al., 2012; Rothfels et al., 2017). Autopolyploids (and segmental allopolyploids) do not behave like diploids, and are therefore in most need of specialized methods and tools for subsequent genetic studies. In this review we focus primarily on the availability of tools and resources amenable to polysomic [and "mixosomic" (Soltis et al., 2016)] species, with less emphasis on allopolyploid-specific solutions. Although the development of novel methodologies for the genetic analysis of polyploids are interesting, without translation into a software tool for use by the research community they remain purely conceptual and with limited impact. We therefore try to limit our attention to the tools currently available rather than cataloging descriptions of unimplemented methods.

Experimental populations, in use since Mendel's groundbreaking work (Mendel, 1866), are traditionally derived from a controlled cross between two parental lines of interest (either directly studying the F<sup>1</sup> or some later generation). We use the term here to distinguish our subject matter from "wild" or "natural" populations, which would necessitate sampling individuals from an extant population in the wild. Quantitative genetics, particularly the genetics of human pathology, has greatly benefitted from the use of large panels of individuals to perform so-called "genome-wide association studies" (GWAS). The use of such panels offers to complement the experimental toolbox of polyploid geneticists as well, and although perhaps not strictly speaking an "experimental" population, we consider them relevant to the current discussion.

Here, we review three main areas: (1) polyploid genotyping, including the scoring of marker dosage (allele counts) and generation of haplotypes; (2) genetic and physical mapping, where we look at the possibilities for linkage mapping as well as the availability of reference sequences; and (3) quantitative trait analysis and genomic selection, including tools that perform quantitative trait locus (QTL) analysis in bi-parental populations, genome-wide association analysis (GWAS) and genomic selection and prediction. We also consider the current tools to simulate polyploid organisms for in silico studies, as well as those that can help determine the mode of inheritance of the species being studied. We reflect on current and future developments, and the tools that will be needed to keep pace with the innovations we are witnessing in genomic technologies.

#### POLYPLOID GENOTYPING

One of the most crucial aspects in the study of polyploid genetics is the generation of accurate genotypic data. However, it is also fraught with difficulties, not least the detection of multiple loci when only a single locus is targeted (Mason, 2015; Limborg et al., 2016). Various technologies exist, with almost all current applications aimed at identifying single nucleotide polymorphisms (SNPs). Although many genomic "service-providers" (e.g., companies or institutes that offer DNA sequencing) have their own tools to analyze and interpret raw data, these tools are not always suitable for use with polyploid datasets. Gel-based marker technologies continue to

be used and retain certain advantages (e.g., low costs associated with small marker numbers, requiring only basic laboratory facilities, multi-allelism etc.). However, most studies now rely on SNP markers for genotyping due to their great abundance over the genome, their high-throughput capacity and their low cost per data point. Targeted genotyping such as SNP arrays (a.k.a. "SNP chips") rely on previously identified and selected polymorphisms, usually identified from a panel of individuals chosen to represent the gene pool under investigation. In contrast, untargeted genotyping generally uses direct sequencing of individuals, albeit after some procedure to reduce the amount of DNA to be sequenced [e.g., by exome sequencing (Ng et al., 2009) or target enrichment (Mamanova et al., 2010)]. The disadvantages of targeted approaches have been well explored (particularly regarding ascertainment bias, where the set of targeted SNPs on an array poorly represents the diversity in the samples under investigation due to biased methods of SNP discovery) (Albrechtsen et al., 2010; Moragues et al., 2010; Didion et al., 2012; Lachance and Tishkoff, 2013), although there are advantages and disadvantages to both methods (Mason et al., 2017). Apart from costs, differences exist in the ease of data analysis following genotyping, with sequencing data requiring greater curation and bioinformatics skills (Spindel et al., 2013; Bajgain et al., 2016) as well as potentially containing more erroneous and missing data (Spindel et al., 2013; Jones et al., 2017).

In polyploids, SNP arrays have been developed in numerous species [recently reviewed by (You et al., 2018)], which include both autopolyploid (or predominantly polysomic polyploids) and allopolyploid species. Examples of the former include alfalfa (Li et al., 2014), chrysanthemum (van Geest et al., 2017b), potato (Hamilton et al., 2011; Felcher et al., 2012; Vos et al., 2015), rose (Koning-Boucoiran et al., 2015) and sour cherry (Peace et al., 2012). Examples of allopolyploid SNP arrays include cotton (Hulse-Kemp et al., 2015), oat (Tinker et al., 2014), oilseed rape (Dalton-Morgan et al., 2014; Clarke et al., 2016), peanut (Pandey et al., 2017), strawberry (Bassil et al., 2015) and wheat (Akhunov et al., 2009; Cavanagh et al., 2013; Wang et al., 2014; Winfield et al., 2016). Untargeted approaches such as genotyping using next-generation sequencing have also been applied, for example in autopolyploids such as alfalfa (Zhang et al., 2015; Yu et al., 2017), blueberry (McCallum et al., 2016), bluestem prairie grass (Andropogon gerardii) (McAllister and Miller, 2016), cocksfoot (Dactylis glomerata) (Bushman et al., 2016), potato (Uitdewilligen et al., 2013; Sverrisdóttir et al., 2017), sugarcane (Balsalobre et al., 2017; Yang et al., 2017b) and sweet potato (Shirasawa et al., 2017), and in allopolyploids such as coffee (Moncada et al., 2016), cotton (Islam et al., 2015; Reddy et al., 2017), intermediate wheatgrass (Thinopyrum intermedium) (Kantarski et al., 2017), oat (Chaffin et al., 2016), prairie cordgrass (Spartina pectinata) (Crawford et al., 2016), shepherd's purse (Capsella bursa-pastoris) (Cornille et al., 2016), wheat (Poland et al., 2012; Edae et al., 2015), and zoysiagrass (Zoysia japonica) (McCamy et al., 2018) (noting that the precise classification of some of these species as auto- or allopolyploids has yet to be conclusively determined). Whatever the technology used, it is clear that we are currently witnessing an explosion of interest in polyploid genomics. However, the critical issue of how to make sense of this data remains, starting with the assignment of marker dosage, a.k.a. "genotype calling."

#### Assignment of Dosage

One of the key distinguishing features of polysomic polyploidy is the fact that there are multiple heterozygous conditions possible in genotyping data. We use the term marker "dosage" to denote the minor allele count of a marker; a species of ploidy q possesses q + 1 distinct dosage classes in the range 0 to q (**Figure 1**). Of course the concept of marker dosage could also be used in diploid species, but coding systems such as the lm × ll / nn × np / hk × hk system (Van Ooijen, 2006) predominate. Marker dosage is generally understood to apply to bi-allelic markers (such as single SNPs), although it is conceivable to score marker dosage at multi-allelic loci. If marker dosage cannot be accurately assessed, genotypes would likely have to be dominantly scored (i.e., all heterozygous classes would be grouped with one of the homozygous classes), resulting in a loss of information (Piepho and Koch, 2000).

All available dosage-calling tools rely on a population in order to determine marker dosage. In other words, calibration between the various dosage classes is performed across the population (for which we are not implying any degree of relatedness in the population other than coming from the same species). All current tools are designed to process genotyping data from SNP arrays, using the relative strength of two allele-specific (fluorescent) signals to assign a discrete dosage value. With increasing interest in genotyping using next generation sequencing (GNGS), we anticipate that tools which use read-counts of potentially multiple SNPs (or multi-SNP haplotypes) will soon be developed, although these have yet to appear. One of the current challenges under investigation regarding GNGS-based genotype calling is the accurate determination of dosage (Kim et al., 2016), which may require relatively deep sequencing [e.g., 60–80 × coverage estimated in autotetraploid potato (Uitdewilligen et al., 2013)].

Returning to the SNP array-based tools, the two main service providers for high-density SNP arrays, Illumina and Affymetrix, both offer proprietary software solutions for analyzing polyploid datasets. Affymetrix's Power Tools and Illumina's GenomeStudio (with its Polyploid Genotyping Module) have both been developed with both diploid and polyploid datasets in mind. However, there have also been a number of genotyping tools

Bourke et al. Polyploid Genetic Tools

that have been put into the public domain. One of the first of these to be released was fitTetra (Voorrips et al., 2011), a freely available R package (R Core Team, 2016) designed to assign genotypes to autotetraploids that were genotyped on either Illumina's Infinium or Affymetrix's Axiom arrays. fitTetra fits mixture models to bi-allelic SNP intensity ratios either under the constraint of Hardy-Weinberg equilibrium within the population, or as an unconstrained fit, using an expectationmaximization (EM) algorithm in fitting. This can have the drawback of requiring significant computational resources for high-density marker datasets, although it is automated and can therefore process large datasets in a single run. The original release was specific to tetraploid data only. However, an updated version (fitPoly) can process genotyping data of all ploidy levels and has recently made available as a separate R package on CRAN<sup>1</sup> . The SuperMASSA application (Serang et al., 2012) can also process data from all ploidy levels (as it was initially developed to dosage-score sugarcane data, notorious for its cytogenetic complexity) and is currently hosted online by the Statistical Genetics Laboratory in the University of São Paulo, Brazil. One of the interesting features of SuperMASSA is that prior knowledge of the exact ploidy level is not needed (useful for a crop like sugarcane). Instead, the genotype configuration which maximizes the posterior probability across all specified ploidy levels is chosen. In practice, most researchers will already know the ploidy of their samples (although aneuploid progeny in some species may occur) and can constrain the model search. A drawback of the online implementation is that markers are analyzed one-by-one, and results need to be copied from the webpage each time. However, a command-line version of SuperMASSA is currently under development.

The R package polysegRatioMM (Baker et al., 2010) generates marker dosages for dominantly scored markers using the JAGS software (Plummer, 2003) for Markov Chain Monte Carlo (MCMC) generation. Fully polysomic behavior is assumed, and segregation ratios of marker data are used to derive the most likely parental scores. Although able to process data from all even ploidy levels, the software only considers a subset of marker types (marker that are nulliplex in one parent or simplex in both parents). Nowadays, there is a move away from dominantly scored markers to co-dominant marker technologies like SNPs, and parental samples are usually included in multiple replicates (and so can be genotyped directly with offspring, rather than imputed from the offspring). The package is therefore of questionable use for modern genotyping datasets. An unrelated R package, beadarrayMSV (Gidskehaug et al., 2010), was developed to handle Illumina Infinium SNP array data from "diploidising" tetraploid species such as the Atlantic salmon. The software was designed to score markers which target multiple loci (so-called multi-site variants, or MSVs), as well as single-locus markers displaying disomic inheritance. In a comparison with fitTetra, beadarrayMSV was unable to accurately genotype autotetraploid data from potato, although conversely fitTetra performed poorly on salmon data (Voorrips et al., 2011). This demonstrates that appropriate software is needed for specific situations (indeed, in many cases specific scenarios have motivated the development of specialized software).

Having prior knowledge about the expected meiotic behavior of the species is always advantageous when it comes to analyzing any polyploid data. This is especially true for the latest dosagecalling software to be released, the ClusterCall package for R (Schmitz Carley et al., 2017). Here, prior knowledge of the meiotic behavior of the species is required, since the expected segregation ratios of an F<sup>1</sup> autotetraploid population are used to assign dosage scores to the clusters identified through hierarchical clustering. In well-behaved autotetraploids such as potato (Swaminathan and Howard, 1953; Bourke et al., 2015) this is arguably not a problem (as long as skewed segregation does not occur), and indeed can lead to increased accuracy in genotype calling (Schmitz Carley et al., 2017). However, in less wellcharacterized species such as leek, alfalfa, or many ornamental species, the precise meiotic behavior may not always follow the expected tetrasomic model, causing potential problems with fitting. The authors are aware of this and suggest that alternatives like fitTetra or SuperMASSA be used in circumstances where a tetrasomic model no longer holds. Unfortunately, such prior knowledge is not always available before genotyping takes place – meiotic behavior can even differ between individuals of a species that was thought to display meiotic homogeneity (e.g., complete tetrasomy) (Bourke et al., 2017).

#### Haplotype Assembly

Although bi-allelic SNP markers have many practical advantages, they carry less inheritance information than multi-allelic markers. Crop researchers and breeders often wish to develop a simple diagnostic marker test for a trait of interest. Unfortunately, the chances of having a single SNP in complete linkage disequilibrium with a favorable or causative allele of a gene of interest is very small. Markers which have been found to uniquely "tag" a favorable allele in one population may not do so in another. For more than a decade, the increased power of haplotype-based associations have been known and reported in human genetic studies (Zhang et al., 2002; de Bakker et al., 2005), with the term "haplotype" denoting a unique stretch of sequence. Translating haplotyping approaches from diploid to polyploid species has been a non-trivial exercise, requiring novel algorithms to handle the overwhelming range of possibilities that can arise [especially when allowing for sequencing errors and (possible) recombinations]. Multi-SNP haplotypes can be assembled from single dosage-scored SNPs (originating from SNP array data), although haplotypes are more commonly generated using overlapping sequence reads (**Figure 2**).

A number of different polyploid haplotyping tools (for sequence reads) have been developed in recent years, including polyHap (Su et al., 2008), SATlotyper (Neigenfind et al., 2008), HapCompass (Aguiar and Istrail, 2013), HapTree (Berger et al., 2014), SDhaP (Das and Vikalo, 2015), SHEsisplus (Shen et al., 2016), and TriPoly (Motazedi et al., unpublished). Three of these tools (HapCompass, HapTree, and SDhaP) were recently compared and evaluated over a range of different simulated read depths, ploidy levels and insert sizes for paired-end reads (Motazedi et al., 2017). The authors found that each of

<sup>1</sup>https://cran.r-project.org/package=fitPoly

these software programs had particular advantages, for example HapTree was found to produce more accurate haplotypes for triploid and tetraploid data, whilst HapCompass performed best at higher ploidies (6× and higher) (Motazedi et al., 2017). Both SHEsisplus and TriPoly have yet to be independently tested. For allopolyploid species, the user-friendly Haplotag software has been designed to identify both single SNPs and multi-SNP haplotypes from genotypes developed using next generation sequencing data (Tinker et al., 2016). An interesting feature is the use of a simple "heterozygosity filter" that excludes haplotypes with higher than expected heterozygosity across a population (suggesting paralogous loci). Currently, however, data from outcrossing or autopolyploid species is not suitable for this software.

The input data of haplotyping software can be grouped into two types. Individual SNP genotyping data (with a known marker order) was used by the first wave of polyploid haplotyping implementations such as polyHap and SATlotyper. More recently, haplotyping tools use sequence reads as their input, although some pre-processing is required: reads must first be aligned followed by extraction of their SNPs (i.e., masking of non-polymorphic sites) to generate a SNP-fragment matrix with individual reads as rows and SNP positions as columns [as described for HapCompass (Aguiar and Istrail, 2013)]. In other words, all haplotyping tools [apart perhaps from Haplotag (Tinker et al., 2016)] require that users possess a certain level of bioinformatics skills. Although we expect polyploid haplotypes to become increasingly used in the future, the development of userfriendly and computationally efficient tools is first needed before haplotype-based genotypes become truly mainstream.

One interesting development is the application of haplotyping to whole genome assemblies (as opposed to genotyping a population). This has recently been attempted in the tuberous hexaploid crop sweet potato (Ipomoea batatas) (Yang et al., 2017a). The authors first produced a consensus assembly to which reads were re-mapped for variant calling, followed by a phasing algorithm which resolved the six haplotypes of the sequenced cultivar for about 30% of the assembly (Yang et al., 2017a). Ultimately, about half of the assembled genome could be haplotype-resolved. Future sequencing (or re-sequencing) efforts in polyploid species should produce more phased genomes, which will no doubt be useful for haplotyping applications (for example in validating predicted haplotypes).

# GENETIC AND PHYSICAL MAPPING OF POLYPLOID GENOMES

One of the first steps in understanding the genetic composition of any species is the development of a map, be it a genetic

map based on information about linkage and co-inheritance of specific DNA locations, or a physical map giving a reference DNA sequence for the species. In polyploid species, numerous technical and methodological complications arise that make the mapping of polyploids a much more complex endeavor than diploid mapping. However, there is currently an upsurge in interest in polyploid mapping, which has led to much progress in recent years.

#### Linkage Maps

Although the first genetic linkage map was developed more than 100 years ago (Sturtevant, 1913), their use in genetic and genomic studies has persisted into the "next-generation" era. This can be attributed to a number of factors. A linkage map is a description of the recombination landscape within a species, usually from a single experimental cross of interest. For breeders, knowledge of genetic distance is arguably more important than physical distance, as it reflects the recombination frequencies in inheritance studies as well as describing the extent of linkage drag around loci of interest. Many software for performing QTL analysis require linkage maps of the markers, not physical maps. This is because co-inheritance of markers and phenotypes within a population are assumed to be coupled – a physical map gives less precise information about the co-inheritance of markers than a linkage map does since physical distances do not directly translate to recombination frequencies (particularly in the pericentromeric regions). Another reason why linkage maps continue to be developed is that they are often the first genomic representation of a species, upon which more advanced representations can be built. They provide useful longrange linkage information over the whole chromosome which is often missing from assemblies of short sequence reads. This fact has been repeatedly exploited in efforts at connecting and correctly orientating scaffolds during genome assembly projects (Bartholomé et al., 2015; Fierst, 2015).

As mentioned in the Introduction, polyploids can be divided into disomic or polysomic species, with the additional possibility of a mixture of both inheritance types in the case of segmental allopolyploids. Many linkage maps in polyploids have been based exclusively on 1:1 segregating markers, also known as simplex markers [because the segregating allele is in simplex condition (one copy) in one of the parents only]. These markers possess a number of advantages over other marker segregation types, but also some distinct disadvantages. In their favor, couplingphase simplex markers in polyploid species behave just like they would in diploid species, regardless of the mode of inheritance involved (repulsion-phase recombination frequency estimates are not invariant across ploidy levels or modes of inheritance, but exert less influence on map construction due to lower LOD scores). The advantage of this is clear: in unexplored polyploid species for which the mode of inheritance is uncertain, simplex markers allow an "assumption-free" linkage map to be created, following which the mode of inheritance can be further explored. The only exception to this is if double reduction occurs, i.e., when a segment of a single chromosome gets transmitted with its sister chromatid copy to an offspring, a consequence of multivalent pairing and a particular sequence of segregation and division during meiosis (Haldane, 1930; Mather, 1935). Double reduction occurs randomly in polysomic species and only introduces a small bias into recombination frequency estimates (Bourke et al., 2015). This means that, ignoring the possible influence of double reduction, diploid mapping software can generally be used for simplex marker sets at any ploidy level and for any type of meiotic pairing behavior (**Figure 3**), opening up a very wide range of diploid-specific software options (Cheema and Dicks, 2009).

However, simplex marker sets have some limitations. Firstly, in selecting only simplex markers, a large proportion of markers with different segregation patterns are not used. This usually reduces the map coverage (while increasing the per-marker costs of the final set of mapped markers). More importantly, simplex markers give limited information about linkage in repulsion phase, particularly at higher ploidy levels (van Geest et al., 2017a). This means that homolog-specific maps can be produced, but they are unlikely to be well-integrated between homologs in a single parent, and impossible to integrate across parents. In other words, the chromosomal numbering will most likely be inconsistent between parental maps if only simplex markers are used. Producing a consensus or fully integrated map is desirable for many reasons, including being able to detect and model more complex QTL configurations than just simplex QTL. Therefore, a truly polyploid linkage mapping tool should be able to include all marker segregation types, not just 1:1 segregating markers.

#### Polyploid Linkage Mapping Software

Linkage mapping can be broken into three steps – linkage analysis, marker clustering and marker ordering. There are still relatively few software tools that can perform all three of these steps for polysomic species. Perhaps the most wellknown and widely used software tool is TetraploidMap for Windows (Hackett and Luo, 2003; Hackett et al., 2007). As well as producing linkage maps for autotetraploid species, this software also performs QTL interval mapping (returned to later). Recently, TetraploidMap was updated to enable the use of dosage-scored SNP data (Hackett et al., 2013). The updated version, TetraploidSNPMap (Hackett et al., 2017), is freely available to download from the Scottish BioSS website<sup>2</sup> , and possesses a sophisticated graphical user interface (GUI) which will be extremely welcome for users in both the research and breeding community. Apart from its dependency on the Windows platform, the main drawback of TetraploidSNPMap (TSNPM) is that it is programmed to analyze autotetraploid data only, and there is no indication when or if it will be expanded to other ploidy levels or modes of inheritance. However, tetraploidy is the most common polyploid condition (Comai, 2005) and therefore this software is still relevant for a broad range of species.

Recently, an alternative linkage mapping package called polymapR was released, which is described in a pre-print manuscript (Bourke et al., unpublished). Like TSNPM, polymapR used dosage-scored marker information from F<sup>1</sup> populations to estimate recombination frequencies by maximum likelihood in a two-point linkage analysis. It can perform linkage analysis

<sup>2</sup>https://bioss.ac.uk/knowledge/tetramap.html

for polysomic triploids, tetraploids and hexaploids as well as segmental allotetraploid populations. As an R-based package it requires some level of user familiarity with R, but comes with a descriptive vignette which should make it accessible even to novice R users. It uses the same high-speed map ordering algorithm as TSNPM, namely MDSMap (Preedy and Hackett, 2016), and produces both integrated and phased linkage maps (i.e., separate maps for each parental homolog that are also integrated into a single consensus map). So far, developmental versions of this software have been used to generate high-density linkage maps in tetraploid potato (Bourke et al., 2016), tetraploid rose (Bourke et al., 2017), and hexaploid chrysanthemum (van Geest et al., 2017a).

Another recently released R package that can perform linkage map construction is the netgwas package, also described in a pre-print manuscript (Behrouzi and Wit, 2017a). netgwas claims to be able to construct maps at any ploidy level in both inbred and outbred bi-parental populations, and rather than computing recombination frequencies and LOD scores, it uses conditional dependence relationships between markers based on discrete graphical models. The algorithm automatically detects linkage groups (which are traditionally identified by a userspecified LOD threshold) and does not rely on knowledge of parental dosage scores (which should offer robustness against parental genotyping errors). The output of netgwas is clustered and ordered marker names, but without assigning genetic positions (centiMorgans) or marker phasing, which are part of the TSNPM and polymapR output. The lack of marker phasing in particular is a major drawback, as phase considerations are crucial in polyploid genetic analyses. However, given its novel and computationally efficient approach to map construction, it appears to be a very interesting addition to the current range of polyploid mapping tools.

Another software program that is able to perform all three major steps in polyploid linkage mapping is the PERGOLA package in R (Grandke et al., 2017). This software can analyze marker data from all ploidy levels and modes of inheritance, but is limited to populations derived from completely inbred (homozygous) founder parents, such as F<sup>2</sup> or BC<sup>1</sup> populations. While these sorts of experimental population are common in diploid plant species, they are much less common in polyploids due to the difficulty in reaching homozygosity through selfing (Haldane, 1930). Generally speaking, polyploids are more heterozygous than diploids (Soltis and Soltis, 2000) although there is no general consensus regarding their tolerance of inbreeding (Krebs and Hancock, 1990; Soltis and Soltis, 2000; Galloway et al., 2003; Galloway and Etterson, 2007). There are indications that polyploid plant species self-fertilize more often than their diploid relatives (Barringer, 2007). However, regardless of whether polyploids tolerate some levels of inbreeding or not, heterozygosity is maintained for many more generations in repeatedly selfed polyploids than in selfed diploids (**Figure 4**). It therefore appears likely that PERGOLA was developed for newly formed polyploids derived from inbred diploid lines. The complexities facing extant (or heterozygous) polyploid species such as unknown marker phasing, or variable marker information contents are ignored by PERGOLA, making it doubtful that this tool will have a wide impact on linkage mapping in existing polyploid populations.

One final software that should be mentioned is PolyGembler, recently described in a pre-print manuscript (Zhou et al., unpublished). It proposes a novel approach to the creation of linkage maps in outcrossing polyploids, and is also suitable for diploid mapping. Interestingly, it combines a haplotyping algorithm [derived from the polyHap algorithm (Su et al., 2008)] to first generate phased multi-marker scaffolds or haplotypes.

from repeated rounds of inbreeding/selfing, using expressions derived by Haldane (1930). For autotetraploids (red line), 95% homozygosity (horizontal dotted line) is achieved after on average 19 generations of selfing, while for a hexaploid (blue line) 95% homozygosity is reached after approximately 32 generations. By contrast, a diploid reaches 95% homozygosity after approximately 5 generations of selfing (black dashed line).

These are then used to calculate recombination frequencies by counting recombination events both within and between these scaffolds, leading to an extremely simple estimate of r which has no corresponding LOD score. Scaffolds are clustered using a graph partitioning algorithm, and thereafter, the computationally efficient CONCORDE traveling-salesman solver is employed to order markers [as is done for example in TSPmap (Monroe et al., 2017)]. This assumes that the variance of all r estimates is equal and that weights are not required – which may well be the case if the haplotype scaffolds are correctly constructed. PolyGembler claims to be able to handle the high levels of missing data and genotyping errors associated with next-generation sequencing data. Although it is applicable to multiple ploidy levels, the authors point out that mapping at the hexaploid level becomes computationally difficult due to the huge number of possible combinations in the formation of haplotypes. However, it appears to be a very promising tool which combines both genetic and bioinformatic approaches in a single pipeline.

Apart from those tools which constitute a complete linkage mapping pipeline, there have been some specific tools recently developed which we predict will have an important impact on future polyploid mapping applications. One of the most significant of these is the MDSMap package in R (Preedy and Hackett, 2016), a novel approach for determining a map order using multi-dimensional scaling. Marker data in polyploid species possesses variable information content, a fact that can be appreciated by considering the haplotype origin of markers of dosage 1 from a duplex marker in a tetraploid species. Certain combinations of markers provide very unambiguous information about co-inheritance, whereas others do not. Therefore, weights are required to prevent imprecise combinations from exerting a large influence on the map order. Before MDSMap was developed, the only reliable algorithm for ordering weighted recombination frequencies was the weighted regression algorithm from the original JoinMap implementation (Stam, 1993; Van Ooijen, 2006). However, this has the disadvantage of being very slow for higher numbers of marker and is therefore of limited use with current highdensity marker datasets. The MDSMap approach can achieve similar results in a fraction of the time, and takes as its input the same information as JoinMap does, the pairwise recombination frequency estimates and logarithm of odds (LOD) scores, making this tool suitable for linkage map construction at any ploidy level, provided pairwise linkage analysis can be performed.

One final tool that has also proven useful for polyploid linkage map construction is the LPmerge package in R (Endelman and Plomion, 2014). LPmerge uses linear programming to remove the minimum number of constraints in marker order in order to create a conflict-free consensus map. It was originally developed to create integrated genetic maps from multiple (diploid) populations. That said, polyploids contain multiple copies of each chromosome and therefore also present a similar challenge if we consider each homolog map as originating from a different population, with non-simplex markers as bridging markers (mapped in more than one population). Homolog-specific maps are still regularly generated in polyploid mapping studies [e.g., in potato (Bourke et al., 2015, 2016), rose (Vukosavljev et al., 2016) or sweet potato (Shirasawa et al., 2017)], for which LPmerge (or a similarly efficient integration algorithm) could then be used to generate chromosomally integrated maps.

#### Physical Maps

Arguably, one of the most important "tools" in current genomics studies is access to a high-quality reference genome assembly. Species for which a reference genome assembly exists have even been classified as "model organisms" (Seeb et al., 2011), such is the importance and impact a genome can bring to research on that species. Without a reference sequence available, the scope of genomic research remains limited. For example, GWAS rely on knowledge of the relative position of SNP markers (usually on a physical map), and many sequencing applications rely on a reference assembly on which to map reads. A reference genome also facilitates the development of molecular markers (e.g., primer development), the comparison of results between different genetic studies (by providing a single reference map), as well as allowing comparisons of specific sequences such as genes, enabling prediction of gene function across related species.

Polyploid genomes are by definition more complex than diploid genomes, having multiple copies of each homologous chromosome. Many polyploid species are also outbreeding, leading to increased heterozygosity which is problematic in de novo assemblies and necessitates specialized approaches (Kajitani et al., 2014). The most common solution until now has been to sequence a representative diploid species. For example in highly heterozygous autotetraploid potato, a completely homozygous doubled monoploid (S. tuberosum group Phureja DM1-3) was sequenced (Potato Genome Sequencing Consortium, 2011) which still represents the primary reference sequence today<sup>3</sup> .

<sup>3</sup>http://solanaceae.plantbiology.msu.edu/

In the case of allopolyploids, multiple diploid progenitor species are often sequenced instead [e.g., peanut (Bertioli et al., 2016)]. The emergence of the pan-genome concept, originally proposed for microbial species (Tettelin et al., 2005), has interesting implications for how highly heterozygous polyploid genomes will be presented in future. We have already mentioned the arrival of phased genomics with the sweet potato genome, which aimed to generate six chromosomelength phased assemblies for each of its 15 chromosomes (Yang et al., 2017a). In future, both pan-genomes and phased genomes are likely to play a bigger role in polyploid reference genomics. Examples of polyploid species that have so far been "sequenced" are listed in **Table 1**. This is by no means an exhaustive list, nor does it describe all developments for the listed species. For example, the sequence of allotetraploid Coffea arabica (which accounts for roughly 70% of all coffee production) has recently been assembled, with a draft assembly (C. arabica UCDv0.5) available on the Phytozome database<sup>4</sup> . What **Table 1** highlights is that at the time of writing, there were already a wide range of polyploid crop species that have well-developed genomic resources, despite the fact that in many cases these are from closely related or progenitor diploid species. In time, just like for coffee, we predict that direct sequencing of polyploid species themselves will gradually replace the haploidised reference sequences in importance and application, leading to more insights of direct relevance to polyploids.

#### QUANTITATIVE TRAIT ANALYSIS AND GENOMIC SELECTION

One of the main goals of genetic studies is to find causative associations between DNA polymorphisms and phenotypic traits. In domesticated species in particular, these studies are often performed with a practical aim: to develop marker-based methods of selecting superior lines in a breeding program. Traditional approaches such as bi-parental QTL mapping have been complemented in recent years by new methodologies such as GWAS and genomic selection. However, all these approaches require polyploid-specific solutions which can capture the increased complexity of polysomic inheritance. We look at the three most commonly used approaches for identifying quantitative trait variation and how specific software tools are helping to revolutionize polyploid plant breeding programs.

### QTL Analysis

The term "QTL analysis" usually refers to studies that aim to detect regions of the genome [so-called quantitative trait loci (Geldermann, 1975)] that have a significant statistical association with a trait in specifically constructed experimental populations. These populations are most often created by crossing two contrasting parental lines ("bi-parental" populations), although there is increasing interest in using more complex population designs in order to increase the range of alleles and genetic

<sup>4</sup>www.phytozome.net

backgrounds being studied [e.g., "MAGIC" populations (Huang et al., 2015)]. As already discussed, there is great difficulty in developing inbred lines by repeatedly selfing polyploids due to the sampling of alleles during polyploid gamete formation [in a diploid this sampling generates 2 1 =2 combinations; for a tetraploid this rises to 4 2 =6 and in 

a hexaploid 6 3 =20 combinations, resulting in protracted heterozygosity (**Figure 4**)], not to mention the problem of inbreeding depression associated with many outcrossing

polyploid species. Therefore, most QTL analyses in polyploid species have been performed using the directly segregating F<sup>1</sup> progeny of a cross between heterozygous parents (a "full sib" population). This leads to poor resolution of QTL positions when compared to the more popular diploid inbred populations like RILs etc., as well as the fact that populations must be vegetatively propagated if replication over years or different growing environments is desired. For many polyploid species, vegetative propagation is indeed possible (Herben et al., 2017) and F<sup>1</sup> populations have the added advantage of being relatively quick and simple to develop, while, because of a generally high level of heterozygosity, many loci will be segregating in the F1. Therefore despite their drawbacks, F<sup>1</sup> populations remain the bi-parental population of choice for mapping studies.

The methods for QTL analysis in diploid species have become increasingly convoluted (van Eeuwijk et al., 2010); in polyploid species such theoretical complexities have yet to be attempted, given the more immediate difficulties in accurately genotyping as well as modeling polyploid inheritance. Just like for linkage mapping and GWAS, the range of software tools available for QTL analysis in polyploids remains rather limited, although there are a number of recent developments that are helping transform the field.

One of the only dedicated software for tetraploid QTL analysis is the already-mentioned TetraploidMap software (Hackett et al., 2007). This software enables interval mapping to be performed in autotetraploid F<sup>1</sup> populations (as well as a simple single-marker ANOVA test), using a restricted range of markers (1 × 0, 2 × 0, and 1 × 1 markers only, where 1 × 0 denotes a marker dosage of 1 in one parent and 0 in the other, etc.). Although still available, it has been superseded by the TetraploidSNPMap software (Hackett et al., 2017). TetraploidSNPMap (TSNPM) uses SNP dosage data to either construct a linkage map (as already described) or perform QTL interval mapping. In contrast to its predecessor, TSNPM can analyze all marker segregation types, and allows the user to explore different QTL models at detected peaks. At its core is an algorithm to determine identity-by-descent (IBD) probabilities for the offspring of the population, which are then used in a weighted regression performed across the genome.

An independent software tool that has been developed to determine IBD probabilities in tetraploids is TetraOrigin (Zheng et al., 2016), implemented in the Mathematica programming language. TetraOrigin relaxes the assumption of random bivalent pairing during meiosis (which TSNPM employs) to allow for

TABLE 1 | Some examples of publicly available reference sequences for polyploid species.


both preferential chromosomal pairing as well as multivalent formation and the possibility of double reduction. Although not programmed in a user-friendly format like TSNPM, it is relatively straightforward to use, taking an integrated linkage map and marker dosage matrix as input. It does not perform QTL analysis directly, but the resulting IBD probabilities can then be used to model genotype effects in a QTL scan either using a weighted regression approach like TSNPM, or in a linear mixed model setting. IBD probabilities allow interval mapping since they can be interpolated at any desired intervals on the linkage map.

For ploidy levels other than tetraploid, there are currently no dedicated software tools available for QTL analysis or IBD probability estimation. Single-marker approaches such as ANOVA on the marker dosages [assuming additivity – various dominant models could also be explored; see, e.g., (Rosyara et al., 2016)] are of course possible and require access to basic statistical software packages such as R (or even Excel). However, such approaches are not ideal – they are only effective if marker alleles are closely linked in coupling with QTL alleles, and offer no ability to predict the QTL segregation type or mode of gene action as is done for example in TSNPM (Hackett et al., 2017). As interest increases in the genetic dissection of important traits in polyploid species, we anticipate that it is only a matter of time before more flexible crossploidy solutions are developed. Methodologies developed for tetraploid species often claim that "extension to higher ploidy levels is straightforward." These sorts of disingenuous claims attempt to mark new research territory as already solved. If extensions to higher ploidy levels were indeed straightforward we would already be reporting on a wider range of tools available for them – as far as we can tell, so far there are none.

Returning to the topic of population types, we also anticipate that more powerful QTL analyses can be performed by combining information over multiple populations. Approaches such as pedigree-informed analyses, implemented for diploids in the FlexQTL software (Bink et al., 2008), could overcome some of the limitations imposed by the restrictions on population types in software for polyploids. However, it may take some time before such tools become translated to the polyploid level.

#### Genome-Wide Association Studies

Genome-wide association studies have emerged as a powerful tool for detecting causative loci underlying phenotypic traits. They have been particularly popular in species where the generation of experimental populations is problematic (such as humans). GWAS has been readily adopted across a broad spectrum of species since then, due to the promise of increased mapping resolution, a more diverse sampling of alleles and a simplicity in population creation (no crossing required) (Bernardo, 2016). There are certain disadvantages though, particularly in how rare (and potentially important) variants can be missed (Ott et al., 2015) and the confounding effect of population structure on results (Korte and Farlow, 2013). Nevertheless, GWAS continues to be an important analytical option to help shed greater light on genotype – phenotype associations. The application of GWAS in polyploid species is relatively new, although there have already been

a number of studies published in various crop species, for example in potato, oilseed rape, wheat, and oats (Uitdewilligen et al., 2013; Gajardo et al., 2015; Sukumaran et al., 2015; Tumino et al., 2016, 2017). GWAS studies usually need to account for population structure and relatedness to prevent spurious associations, often in the context of linear mixed models (Yu et al., 2006; Bradbury et al., 2007; Zhang et al., 2010).

One challenge in applying GWAS to polyploid species is how to define a relatedness metric between polyploid individuals (i.e., how to generate the kinship matrix, K). So far, there have been two software tools released for polyploid GWAS, namely the R package GWASpoly (Rosyara et al., 2016) and the previously mentioned SHEsisPlus (Shen et al., 2016). Of these, only GWASpoly looks critically at the form of the kinship matrix K. Three different forms of K were tested in the development of the package, with the canonical relationship matrix (VanRaden, 2008) [termed the realized relationship matrix by the authors (Rosyara et al., 2016)] found to best control against inflation of significance values. This is also the default K provided in the GWASpoly package. An alternative approach to GWAS mapping for polyploids is provided by the netgwas package (Behrouzi and Wit, 2017b), previously mentioned for its linkage mapping capacity. Again, graphical models form the basis of the approach, which goes beyond single-marker association mapping to investigate genotypephenotype interactions using all markers simultaneously in a graph structure. There is almost no discussion on how confoundedness between population structure and phenotypes are handled, but the authors claim the detection of false positive associations is not problematic.

One final aspect worth considering is the issue of deploying an adequate number of markers in a polyploid GWAS, which potentially represents a much larger genomic space. In A. thaliana, it was estimated that between 140K and 250K SNPs would be needed to fully cover the genome based on a study of linkage disequilibrium in that species (Kim et al., 2007). Modeling the decay of linkage disequilibrium in polyploid species is a more complex exercise. It was previously suggested that estimates of linkage disequilibrium may be inflated in polyploid species (Jannoo et al., 1999; Flint-Garcia et al., 2003). A more recent survey of linkage disequilibrium in autotetraploid potato using SNP dosages estimated that at most 40K SNPs would be needed for QTL discovery in potato (Vos et al., 2017), a much lower estimate than for Arabidopsis (Kim et al., 2007). The discrepancy comes in part from the differences in how these figures were estimated, using a 'hide-the-SNP' simulation for Arabidopsis versus a 'rule of thumb' calculation for potato, but mainly from the difference in the extent of LD between the two species [estimated at ∼10 Kb in A. thaliana versus ∼2 Mb in S. tuberosum (Kim et al., 2007; Vos et al., 2017)]. Detecting or even defining linkage disequilibrium between markers linked in repulsion phase is non-trivial in autopolyploids (Vos et al., 2017), which is analogous to the problem of detecting and estimating recombination frequency between such markers in a linkage mapping study. So far, we are not aware of any software tool that has been developed to estimate the extent of linkage disequilibrium in polyploids, which would complement the design of future GWAS studies in polyploid species.

# Genomic Prediction and Genomic Selection

There has been much attention given to the advantages of using all marker data to help predict phenotypic performance, rather than focussing on single markers (or haplotypes) that are linked to QTL as was previously advocated. The motivation behind this is clear – many of the most important traits in domesticated animal and plant species are highly quantitative, with far too many small-effect loci present to be able to tag them all with single markers (Bernardo, 2008). One of the most important traits in any breeding program is also a famously quantitative trait: yield. It has been suggested that despite many years of phenotypic selection, crop yield in tetraploid potato has essentially remained unchanged (Jansky, 2009; Slater et al., 2016). This is a remarkable indictment of traditional selection methods, yet offers muchneeded impetus for the development and deployment of new paradigms in breeding for quantitative traits.

Genomic prediction first arose in animal breeding circles (Meuwissen et al., 2001), where the concept of estimating breeding values from known pedigrees was already wellestablished. However, the estimation of breeding values in polyploid species requires special consideration due to the complexity of polysomic inheritance and the possibility of double reduction. In practice, breeding values are usually estimated using restricted maximum likelihood (REML) to solve mixed model equations, requiring the generation of an inverse additive relationship matrix A −1 , also called the numerator relationship matrix. The form of A <sup>−</sup><sup>1</sup> depends on, among other things, whether the inheritance is polysomic or disomic, and whether double reduction occurs (Kerr et al., 2012; Amadeu et al., 2016; Hamilton and Kerr, 2017). The R package AGHmatrix was developed in order to compute the appropriate A matrix for autotetraploids with a known pedigree (Amadeu et al., 2016), using theory developed in (Kerr et al., 2012). In applying their approach to an autotetraploid blueberry (Vaccinium corymbosum L.) population, the authors determined the A matrix under various levels of double reduction, afterwards selecting the model which maximized the likelihood of the data (Amadeu et al., 2016). More recently, an alternative R package polyAinv was released which computes A −1 as well as the kinship matrix K and the inbreeding coefficients F (Hamilton and Kerr, 2017). polyAinv claims to be applicable to any ploidy level (rather than just autotetraploids) and can accommodate sex-based differences in IBD probabilities (Hamilton and Kerr, 2017). Like AGHmatrix, it also incorporates double reduction in its calculations. However, in one study of nine common traits in autotetraploid potato, the inclusion of double reduction, or even the adoption of an autotetraploidappropriate relationship matrix was found to have a minimal impact on the results (Slater et al., 2014). Studies which ignore the specific complexities of autopolyploids may still benefit from genomic prediction and selection, as for example was demonstrated in tetraploid potato (Sverrisdóttir et al., 2017).

Commonly used software tools for estimating breeding values at the diploid level include ProGeno (Maenhout, 2018) and ASreml (VSN International, 2018) which could be suitable for polyploid breeding programs, although this has yet to be conclusively demonstrated.

## POLYPLOID INHERITANCE AND SIMULATION

As a final section we look at two topics which are important to the development of polyploid genetic resources – the mode of inheritance and the availability of simulation software for polyploid species. Although these topics do not necessarily go together, they represent very important considerations in themselves. The mode of inheritance is a polyploid-specific topic, with no equivalent issue arising in diploid genetic studies. Simulation studies, on the other hand, have been used repeatedly at the diploid level to test new methodologies, determine empirical thresholds, evaluate competing methods etc. The availability of a range of software options to simulate polyploid genetic behavior is crucial if polyploid genetics is to flourish.

#### Mode of Inheritance

The term "mode of inheritance" refers to the randomness of meiotic pairing processes that give rise to gametes, and is often used to distinguish between disomic (diploid-like) inheritance, and polysomic (all allele combinations equally possible) inheritance. As alluded to already, intermediate modes of inheritance are theoretically possible if partially preferential pairing occurs between homologs, resulting in on average more recombinations between certain homologs, and less between others (putative homoeologs). This intermediate inheritance pattern, originally termed segmental allopolyploidy (Stebbins, 1947) and more recently termed mixosomy (Soltis et al., 2016), poses additional challenges over those of purely polysomic or disomic behavior. One of the main complications is the lack of fixed segregation ratios to test markers against (Allendorf and Danzmann, 1997), which is often used as a measure of marker quality (Stringham and Boehnke, 1996; Pompanon et al., 2005). Currently there are no dedicated tools available to ascertain the most likely mode of inheritance in polyploids. Some "traditional" approaches to predict the mode of inheritance are summarized in (Bourke et al., 2017), many of which are relatively straightforward to implement using a statistical programming environment like R (R Core Team, 2016). In that study, TetraOrigin (Zheng et al., 2016) was used to estimate the most likely pairing configuration that gave rise to each offspring in an F<sup>1</sup> tetraploid population. This enabled the authors to test whether there were deviations from the expected patterns of homolog pairing under a tetrasomic model (Bourke et al., 2017). A simple alternative using closely linked repulsion-phase simplex marker pairs was also proposed and has been implemented in the polymapR package (Bourke et al., unpublished). Apart from preferential pairing, TetraOrigin can also predict whether marker data arose from bivalent or multivalent pairing during meiosis, facilitating an analysis of the distribution of double reduction products. However, apart from its restriction to tetraploid data, an integrated linkage map is required before TetraOrigin can be employed. In severe cases of mixosomy, it is not obvious how a reliable linkage map should be generated. Corrections for mixosomy in a tetraploid linkage analysis are possible in polymapR, but in extreme cases marker clustering will also be affected, making map construction quite challenging. A confounding complication is the possibility of variable chromosome counts (aneuploidy), as for example encountered in sugarcane (Grivet et al., 1996; Grivet and Arruda, 2002) or in ornamentals such as Alstroemeria (Buitendijk et al., 1997), which makes the diagnosis of the mode of inheritance even more difficult. As more polyploid species begin to be genotyped, the issue of unknown mode of inheritance will likely exert more influence, further necessitating the development of software tools that can provide an accurate assessment of the inheritance mode using marker data, and that can accommodate the full spectrum of polyploid meiotic behaviors.

### Simulation Software

As with any software tool, developing standards and scenarios upon which the performance of the tool can be judged is vital to ensure reliable results. In this final section we consider the range of simulation tools currently available for polyploids. Probably the most widely used polyploid simulation software currently available is PedigreeSim (Voorrips and Maliepaard, 2012). Originally developed to generate diploid and tetraploid populations, the current release (PedigreeSim V2.0) can simulate populations of any even ploidy level (2, 4, 6, . . .). What makes PedigreeSim particularly attractive is its ability to simulate a diversity of meiotic pairing conditions, including quadrivalents (which can result in double reduction) or preferential chromosome pairing. It takes four input files (which are relatively simple to generate) that provide a description of the desired simulation parameters and the input marker data. The software then creates (dosage-scored) genotype data for any pedigreed population, e.g., an F<sup>1</sup> population of specified size (Voorrips and Maliepaard, 2012). Some authors have used PedigreeSim to simulate multiple generations of random mating, allowing an investigation of population structure and linkage disequilibrium in polyploid species (e.g., Rosyara et al., 2016; Vos et al., 2017), which can be implemented quite easily with some basic programming knowledge. PedigreeSim is written in Java and can run on all major operating systems.

A Windows-based software Polylink, which originally performed two-point linkage analysis and simulation of tetraploid populations (He et al., 2001), is no longer available. The R package polySegratio (Baker, 2014) simulates dominantly scored marker data in autopolyploids of any even ploidy level. Generating the dosage data is straightforward: only the expected proportion of marker types (simplex, duplex, triplex, . . .) as well as the ploidy is required. However, the markers are essentially completely random, with no connection to any linkage map, which is arguably of limited use for any application that requires some degree of linkage between markers. The simulation

capacities of polysegRatio therefore appear to be most useful for testing functions within the package itself, namely those designed to impute parental dosages given the observed segregation ratios in offspring scores.

A final polyploid simulation tool that has recently been developed is the HaploSim pipeline which includes the HaploGenerator function (Motazedi et al., 2017). HaploGenerator is designed to generate sequence-based haplotypes in a polyploid of any even ploidy, taking the fasta file it is provided with as a reference from which haplotypes are built. The software generates random SNP mutations at a specified distribution before simulating next-generation sequencing (NGS) reads in formats corresponding to a number of current sequencing technologies such as Illumina or Pacific Biosystems (PacBio). The pipeline was originally developed to compare the performance of a number of haplotype assembly algorithms (Motazedi et al., 2017), but could also be useful for testing the performance of any other tool which uses NGS reads as genotypes.

#### FUTURE PERSPECTIVES

In this review we have attempted to describe the most important software tools that are currently available to the polyploid genetics community. There are likely to be tools that were missed and tools that have subsequently been released – this is the danger of such a review. However, we have tried where possible to also discuss the gaps that are apparent in the current set of available tools which will hopefully help guide their development in future. Polyploid genotyping arguably remains the most critical step, as without accurate genotype data there is little point in building models for polyploid inheritance. However, we are now witnessing the slow emergence of tools that take polyploid genotypes and use them to make inferences on the transmission of alleles and the effects of such alleles in polyploid populations. As genotyping technologies continue to evolve, so too should the suite of tools developed to analyze those genotypes. Tools for analyzing SNP dosage data from SNP arrays are well-established.

# REFERENCES


The coming decade will likely see a move away from SNP arraybased genotyping to the use of sequence-read based genotypes, although this will require that all tools heretofore developed be updated to accommodate the new type of data. Information on the mode of inheritance from marker data is also needed for each population studied, which deserves more attention than it currently receives. A move from diploid-based reference genomes to fully polyploid (and haplotype-resolved) reference genomes would also help broaden the boundaries of polyploid genetics away from the diplo-centric view of genomics which currently dominates. Although there have been many exciting discoveries and developments in polyploid genetics in the past decade or more, we feel its golden age has yet to arrive, an age which will be heralded all the sooner by the provision of robust and userfriendly tools for the genetic dissection of this fascinating group of organisms.

### AUTHOR CONTRIBUTIONS

PB wrote this review, with input from REV, RGFV, and CM. All authors read and approved the final manuscript.

#### FUNDING

This research was provided through the TKI polyploids project "Novel Genetic and Genomic Tools for Polyploid Crops" (project numbers BO-26.03-009-004 and BO-50-002-022). The support of the companies participating in the polyploids projects is gratefully acknowledged.

#### ACKNOWLEDGMENTS

The authors wish to thank Dr. Jeffrey Endelman (University of Wisconsin–Madison) for the helpful remarks regarding tetraploid kinship matrices, and Dr. Heleen Bastiaanssen (Anthura B.V.) for the helpful feedback on the manuscript.



ancestral diploid species based on optimised selection of single-locus markers in the allotetraploid genome. Theor. Appl. Genet. 129, 1887–1899. doi: 10.1007/ s00122-016-2746-7



Limborg, M. T., Seeb, L. W., and Seeb, J. E. (2016). Sorting duplicated loci disentangles complexities of polyploid genomes masked by genotyping by sequencing. Mol. Ecol. 25, 2117–2129. doi: 10.1111/mec.13601


and recurrent polyploidization in Andropogon gerardii. Am. J. Bot. 103, 1314–1325. doi: 10.3732/ajb.1600146



**Conflict of Interest Statement:** The authors of this review have been involved in the development of a number of the software tools mentioned, namely fitTetra, fitPoly, PedigreeSim, TetraOrigin, polymapR, HaploSim, and TriPoly. We have tried to give an impartial perspective where these and alternative tools are concerned. Apart from this, the authors do not have any further conflicts of interest to declare.

Copyright © 2018 Bourke, Voorrips, Visser and Maliepaard. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Insights Into the Genetic Basis of Blueberry Fruit-Related Traits Using Diploid and Polyploid Models in a GWAS Context

Luís Felipe V. Ferrão1†, Juliana Benevenuto1†, Ivone de Bem Oliveira1,2, Catherine Cellon<sup>3</sup> , James Olmstead<sup>4</sup> , Matias Kirst 5,6, Marcio F. R. Resende Jr. <sup>7</sup> and Patricio Munoz <sup>1</sup> \*

<sup>1</sup> Blueberry Breeding and Genomics Laboratory, Horticultural Sciences Department, University of Florida, Gainesville, FL, United States, <sup>2</sup> Plant Genetics and Genomics Laboratory, Agronomy College, Federal University of Goias, Goiania, Brazil, <sup>3</sup> Duda Farm Fresh Foods, Oviedo, FL, United States, <sup>4</sup> Driscoll's Inc., Watsonville, CA, United States, <sup>5</sup> Forest Genomics Laboratory, School of Forest Resources and Conservation, University of Florida, Gainesville, FL, United States, <sup>6</sup> Genetics Institute, University of Florida, Gainesville, FL, United States, <sup>7</sup> Sweet Corn Genomics and Breeding, Horticultural Sciences Department, University of Florida, Gainesville, FL, United States

#### Edited by:

Hans D. Daetwyler, La Trobe University, Australia

#### Reviewed by:

Peter Bradbury, United States Department of Agriculture, United States Ross Houston, University of Edinburgh, United Kingdom

> \*Correspondence: Patricio Munoz p.munoz@ufl.edu

†These authors have contributed equally to this work.

#### Specialty section:

This article was submitted to Evolutionary and Population Genetics, a section of the journal Frontiers in Ecology and Evolution

> Received: 31 January 2018 Accepted: 02 July 2018 Published: 24 July 2018

#### Citation:

Ferrão LFV, Benevenuto J, de Bem Oliveira I, Cellon C, Olmstead J, Kirst M, Resende MFR Jr and Munoz P (2018) Insights Into the Genetic Basis of Blueberry Fruit-Related Traits Using Diploid and Polyploid Models in a GWAS Context. Front. Ecol. Evol. 6:107. doi: 10.3389/fevo.2018.00107 Polyploidization is an ancient and recurrent process in plant evolution, impacting the diversification of natural populations and plant breeding strategies. Polyploidization occurs in many important crops; however, its effects on inheritance of many agronomic traits are still poorly understood compared with diploid species. Higher levels of allelic dosage or more complex interactions between alleles could affect the phenotype expression. Hence, the present study aimed to dissect the genetic basis of fruit-related traits in autotetraploid blueberries and identify candidate genes affecting phenotypic variation. We performed a genome-wide association study (GWAS) assuming diploid and tetraploid inheritance, encompassing distinct models of gene action (additive, general, different orders of allelic interaction, and the corresponding diploidized models). A total of 1,575 southern highbush blueberry individuals from a breeding population of 117 full-sib families were genotyped using sequence capture and next-generation sequencing, and evaluated for eight fruit-related traits. For the diploid allele calling, 77,496 SNPs were detected; while 80,591 SNPs were obtained in tetraploid, with a high degree of overlap (95%) between them. A linear mixed model that accounted for population and family structure was used for the GWAS analyses. By modeling tetraploid genotypes, we detected 15 SNPs significantly associated with five fruit-related traits. Alternatively, seven significant SNPs were detected for only two traits using diploid genotypes, with two SNPs overlapping with the tetraploid scenario. Our results showed that the importance of tetraploid models varied by trait and that the use of diploid models has hindered the detection of SNP-trait associations and, consequently, the genetic architecture of some commercially important traits in autotetraploid species. Furthermore, 14 SNPs co-localized with candidate genes, five of which lead to non-synonymous amino acid changes. The potential functional significance of these SNPs is discussed.

Keywords: autopolyploid, allelic dosage, SNP calling, genetic association, gene action, breeding, Vaccinium

# INTRODUCTION

Polyploidy is a widespread phenomenon among the flowering plants. Rounds of ancient and recent polyploidization events have been shaping the genomes and the evolutionary trajectories of plant lineages, driving phenotypic diversification (Adams and Wendel, 2005; Paterson, 2005; Jiao et al., 2011; Blischak et al., 2016). Expansion of phenotypic range and novel phenotypes often arise with polyploidization (Spoelhof et al., 2017). The genomic redundancy created by polyploidy allows relaxed selective constraints and functional divergence of gene copies, which can generate new phenotypes in the long-term evolutionary process (Adams and Wendel, 2005; Comai, 2005). Immediate phenotypic effects of polyploidy are also observed compared to their diploid progenitors, such as increased cell and organ size, changes in flowering time, and greater vigor and biomass (Osborn et al., 2003; Tamayo-Ordóñez et al., 2016). The molecular mechanisms contributing to phenotypic variation shortly after polyploidization are not well-understood, but probably involve more complex genetic and epigenetic effects of higher allelic dosage and heterosis (Osborn et al., 2003; Jackson and Chen, 2010; Renny-Byfield and Wendel, 2014; Fort et al., 2016). For example, genome-wide gene expression studies in resynthesized polyploid plants and yeasts have shown ploidydependent gene expression alterations, which likely affect the phenotype (Guo et al., 1996; Galitski et al., 1999; Osborn et al., 2003; Pumphrey et al., 2009; Jackson and Chen, 2010).

Polyploids exhibiting new phenotypic traits can outperform their diploid counterparts, occupy new niches, and become ecologically and agriculturally important (Tamayo-Ordóñez et al., 2016; Spoelhof et al., 2017). Many important crops are polyploids with varied ploidy levels and mode of origin (i.e., autoor allopolyploids). However, despite the economic importance of polyploids and the impact that ploidy can have in the phenotypic expression, the effects of allelic dosage on quantitative traits remain largely unexplored. Most genetic studies in polyploids have so far relied on diploid models to simplify the polyploid data. The complex nature of polyploid genetic data (e.g., multiple alleles and mixed inheritance patterns) has hindered the understanding of genetic architecture of important traits (Dufresne et al., 2014). Moreover, molecular techniques and statistical methodologies were also constraints for polyploids, such as the challenge to define the allelic dosage (Garcia et al., 2013; Lu et al., 2013; Dufresne et al., 2014; Li et al., 2014a; Annicchiarico et al., 2015; Uitdewilligen et al., 2015; Schulz et al., 2016).

Due to the advances in new genotyping technologies, it is now possible to generate high-density single nucleotide polymorphism (SNP) data and evaluate the relative abundance of each allele based on read sequencing depth to infer the allelic dosage. Genome-wide association studies (GWAS) that consider allelic dosage can help uncover the genetic basis of complex traits by considering more realistic genetic models, and hence reducing the signal-to-noise ratio (Garcia et al., 2013; Grandke et al., 2016). Moreover, the effect of the genotype classes on the phenotypic variation can be tested under different gene action models to gain additional insights into additive and non-additive effects (Rosyara et al., 2016). The present study aimed to understand how modeling the allelic dosage influences the identification of SNPs significantly associated with blueberry fruit-related traits through GWAS analyses.

Blueberry has been recognized worldwide for its health benefits, becoming one of the crops with the highest consumer demand and productive trends (USDA, 2016). During blueberry improvement in the United States, interspecific hybridizations have been used for the development of "southern" highbush cultivars adapted to warmer climates. Crosses primarily involved the autotetraploid "northern" highbush blueberry (Vaccinium corymbosum L.) and the diploid evergreen blueberry (V. darrowii Camp) (Sharpe and Darrow, 1959). Tetraploid hybrids were achieved by the occurrence of unreduced gametes during pollen formation in the diploid species (Ortiz et al., 1992). Despite interspecific hybridizations, blueberry cultivars are considered autotetraploids with non-preferential bivalent chromosome pairing during meiosis and the absence of chromosome structural differentiation (Qu and Hancock, 1995; Qu et al., 1998; Lyrene et al., 2003). The conventional breeding program employs phenotypic recurrent selection, and the release of a new cultivar can take up to 15 years (Hancock et al., 2008). In a perennial polyploid species, such as blueberry, marker-assisted selection has the potential for accelerate the cultivar development process. In this sense, the GWAS analyses can also assist in the identification of causal polymorphisms or molecular markers associated with fruit-related traits relevant for blueberry breeding. The objective of this study is two-fold: (i) to compare the effects of diploid and tetraploid marker calling in population genetics and GWAS analysis; and (ii) to perform the first GWAS analysis for fruit-related traits in southern highbush blueberry.

# MATERIALS AND METHODS

# Plant Material and Trait Phenotyping

The southern highbush blueberry population used in this study was generated as part of the breeding program at the University of Florida. For this study, 124 controlled crosses were made among 148 selected parents in February 2011. Seeds from each cross were cold-stratified for 5 months and planted in a greenhouse as a family in 2 L pots in November 2011. One hundred seedlings from each family were later transplanted to a high-density nursery (∼20,000 plants per 0.2 ha) in a row-column design at the University of Florida Plant Science Research and Education Unit in Citra, Florida. In May 2013, a first round of selection was performed. Unselected plants were removed from the field and the remaining individuals constituted the 1,575 plants from 117 crosses used in this study.

The phenotypic evaluations were conducted during fruit ripening (6 weeks from the beginning of April to mid May 2014) and flowering (January 2015) periods when the plants were in their third growing season. Eight fruit-related traits were measured: weight, size, firmness, stem scar diameter, pH, soluble solids content, flower bud density, and yield. Yield was evaluated using a 1-to-5 rating scale, where 1 indicates none or very few berries on the plant and 5 is a yield comparable to standard commercial cultivars. The flower bud density refers to the number of flower buds on the top 20 cm of one representative upright shoot from the main stem, and was reported as number of buds per 20 cm of shoot. For the fruit traits, the average of five berries randomly selected from each genotype was calculated. Weight (g) was measured using an analytical scale (CP2202S, Sartorius Corp., Bohemia, NY). The same five berries were equatorially oriented to measure fruit size diameter (mm) and firmness (g∗mm−<sup>1</sup> compression force), with a minimum and maximum force threshold of 50 and 350 g, respectively, using the Firm-Tech II (BioWorks Inc., Wamego, KS). The picking stem scar was positioned upward on a tray in a light box with a digital SLR camera (Pentax K-x, Ricoh Imaging, Denver, CO) placed 50 cm above the berry. A ruler was also placed in each image as a size reference. The images were uploaded into FIJI (Schindelin et al., 2012), the scale was set using the ruler, and the scar diameter (mm) was measured for each berry. The blueberry juice was used to measure traits related to sensory quality. The soluble solids content (◦Brix), an approximate surrogate measure of sugar content, was assessed using a digital pocket refractometer (Atago U.S.A, Inc., Bellevue, WA). The juice pH was measured using a glass pH electrode (Mettler-Toldeo, Inc., Schwerzenbach, Switzerland).

#### Capture-Seq Genotyping and SNP Calling

Total genomic DNA was extracted from leaf tissue of each plant using the E-Z 96 PlantDNAKit (Omega Bio-Tek, Norcross, GA). Genotyping was performed by RAPiD Genomics (Gainesville, FL, USA) using sequence capture. Briefly, 31,063 customdesigned biotinylated probes of 120-mer were developed based on the scaffolds of the blueberry draft genome sequence (2013 version) (Bian et al., 2014; Gupta et al., 2015). Sequencing was carried out in the Illumina HiSeq2000 platform using 100 cycle paired-end runs. Raw reads were first trimmed for minimum base quality of 20, demultiplexed, and barcodes were removed. Subsequently, reads were aligned to the blueberry genome (2013 version) using BWA v.0.7.12 (Li and Durbin, 2009).

Polymorphisms and genotypes were called using FreeBayes v.1.0.1, selecting the diploid (-p 2) and the tetraploid (-p 4) options (Garrison and Marth, 2012). Genotypes were represented by the count of alternative alleles. Therefore, for the diploid calling, genotypes were coded as 0 (AA), 1 (AB), or 2 (BB), where "A" and "B" refers to the reference and alternative alleles, respectively. The genotypes for the tetraploid calling were coded as 0 for nulliplex (AAAA), 1 for simplex (AAAB), 2 for duplex (AABB), 3 for triplex (ABBB), and 4 for quadruplex (BBBB). We performed a sample filtering by excluding individuals with more than 90% of missing data across SNPs (sample call rate = 0.9). SNPs were further filtered by: (i) minimum depth of coverage of 40; (ii) minimum genotype quality score of 10; (iii) only biallelic locus; (iv) maximum missing data of 0.7; (v) minor allele frequency of 0.05. The remaining missing genotypes were imputed with the mode of each locus as suggested by Rosyara et al. (2016).

#### Population Genetics Analyses

Population genetics parameters were computed considering the polyploid and diploid scenarios. We estimated: (i) allele frequency; (ii) heterozygosity; (iii) linkage disequilibrium (LD) decay; and (iv) population structure. The allele frequency for each locus was obtained by counting the number of alternative alleles, divided by sample size, and ploidy level. The observed heterozygosity was calculated as a fraction of the number of heterozygote classes by the total number of loci. Pearson correlation tests (r 2 ) were performed for pairwise LD estimation within scaffolds. All scaffolds were pooled to plot a genomewide LD decay and boxplots of r 2 values for categories of marker distances. The decay of LD over genetic distance was determined as the mean distance associated with an empirical LD threshold of r <sup>2</sup> = 0.2. To assess the genetic structure of blueberry population, the Principal Components Analysis (PCA) was performed using the marker-based relationship matrix as input. Diploid and tetraploid genomic relationship matrices were computed with the AGHmatrix R-package (Amadeu et al., 2016). The Discriminant Analysis of Principal Components (DAPC) was conducted to cluster genetically similar individuals using the Bayesian Information Criterion (BIC) to select the best supported model, as implemented in the R package adegenet v. 1.3-1 (Jombart and Ahmed, 2011).

#### GWAS Analyses

The SNP-trait association analyses were based on a linear mixed model, accounting for population structure (**Q**) and relative kinship (**K**) matrices as implemented in the GWASpoly Rpackage (Rosyara et al., 2016). The **Q**+**K** linear mixed model was:

$$\mathbf{y} = \mathbf{Z}\mathbf{S}\mathbf{r} + \mathbf{Z}\mathbf{Q}\mathbf{v} + \mathbf{Z}\boldsymbol{\mu} + \boldsymbol{\varepsilon}$$

where **y** is a vector of observed phenotypes; ε is a vector of random residual effects, with a multivariate normal distribution with a zero mean vector and an identity variance-covariance (VCOV) matrix; **v** is a vector of sub-populations effects, with incidence matrix **Q**; and **u** is a random polygenic effect, with a multivariate normal distribution with a zero mean vector and VCOV matrix proportional to a kinship matrix (**K**-matrix). The **Z** incidence matrix maps genotypes to observations, and the SNP effects are represented by the τ fixed vector. As pointed out by Rosyara et al. (2016), the matrix **S** depends on the genetic model assumed. In order to compare diploid and tetraploid pipelines, the **Q**+**K** model was implemented in both scenarios. For tetraploid, the **K-**matrix was constructed assuming tetrasomic inheritance (Slater et al., 2013), while for the diploid model it was built considering the algorithm proposed by VanRaden (2008). Both matrices were computed using the AGHmatrix R-package (Amadeu et al., 2016). To correct for population structure, PCA analysis was computed internally using the GWASpoly package and the four principal components were further used in GWAS analyses.

Eight gene action models were tested for the tetraploid genotype calling: general, additive, simplex dominant alternative (simplex-dom-alt), simplex dominant reference (simplex-domref), duplex dominant alternative (duplex-dom-alt), duplex dominant reference (duplex-dom-ref), diplo-additive, and diplogeneral. According to Rosyara et al. (2016), the general type of genetic model allows the SNP effect for each genotypic class to be arbitrary and statistically equivalent. In the additive model the SNP effect is proportional to the dosage of the minor allele. In the simplex dominant models, all the heterozygotes (AAAB, AABB, ABBB) are equivalent to one of the homozygotes (AAAA or BBBB). In the duplex dominant models, the duplex state (AABB) has the same effect as either the simplex (AAAB) and nulliplex (AAAA) or the triplex (ABBB) and quadriplex (BBBB) states. In the diploidized models (diplo), all heterozygous classes have the same effect, resembling a traditional diploid dosage model (AA, AB, BB), and have gene action models encompassing the general and additive effects. The diploid genotype calling was also used for GWAS analyses, using the following gene actions: diplogeneral, diplo-additive, simplex-dom-alt, and simplex-dom-ref.

Correction for multiple testing using a q-value threshold of 0.05 was applied to determine significant associations using the q-value R-package (Storey and Tibshirani, 2003). We also explored more and less conservative thresholds for declaring significance by using Bonferroni correction of 0.05 and qvalue of 0.1, respectively. QQ-plots were used to evaluate the presence of confounding factors leading to an excess of significant associations.

The proportion of phenotypic variation explained by significant SNPs was approximated by the coefficient of determination (R 2 ). The R <sup>2</sup> was estimated considering a linear regression model that included the first four principal components from PCA analyses, the SNP marker parameterized in accordance with the gene action and a vector of random residual effects.

### Candidate Gene Mining

SNPs were characterized in silico for their genomic position and functional effect. SNPs were annotated using snpEff v.4.3 (Cingolani et al., 2012), using the blueberry draft genome (2013 version) and gene predictions. Predicted gene models were retrieved from the bitbucket repository https://bitbucket. org/lorainelab/blueberrygenome (Gupta et al., 2015). Candidate genes surrounding significantly associated SNPs were annotated using the Blast2GO tool with BLASTp search against the nonredundant protein database (Götz et al., 2008). We also searched for Arabidopsis thaliana v. TAIR10 orthologs using Phytozome v.12.1 BLASTp tool (https://phytozome.jgi.doe.gov).

## RESULTS

#### Phenotypic Variation

A total of 1,575 blueberry plants from 117 crosses were phenotyped for eight fruit-related traits (yield, flower bud density, fruit weight, firmness, size, soluble solids content, pH, and scar diameter). Most traits followed a normal distribution, except yield which was evaluated on a 1-to-5 rating scale, and flower bud density which followed a Poisson distribution (**Figure 1**). High phenotypic correlation was only found between berry size and weight (r = 0.94) (**Figure 1**).

#### Genotypic Data

After filtering the genotypic data, a total of 1,557 individuals and 77,496 SNPs were maintained for the diploid analyses; while

1,559 individuals and 80,591 SNPs were considered for tetraploid analyses. SNPs were sampled throughout the genome, although not evenly distributed, which was expected due to the target design strategy used in this study (**Figure 2**).

Tetraploid and diploid pipelines identified 74,941 common SNPs (around 95% of overlap). We assumed that the differences between pipelines were due to the algorithms implemented in the Freebayes software, which considers different criteria to define a SNP in each parameterization. As a consequence of the high degree of overlap, few differences were observed regarding the position and functional characterization of the SNPs in the blueberry genome (**Figures 3A,B**). Most SNPs were detected in non-coding regions; around 7% targeted exonic regions, mostly causing missense mutations. The distribution of the alternative allele frequency across loci was also similar for both approaches (**Figure 3C**), with the tetraploid model showing the mean allele frequency slightly lower (0.25 vs. 0.27). The main difference between the tetraploid and diploid scenarios was on the genotype calling (**Figure 3D**). For biallelic SNPs in autotetraploids, there are five possible genotypes, with three possible heterozygous states. For diploids, there are only three possible genotypes, with one heterozygous class. The probabilistic assignment of genotypes based on sequence read depth led to a higher heterozygosity for tetraploid compared to diploid genotype calling (0.42 vs. 0.34).

# Linkage Disequilibrium and Population Structure

The LD and population structure were consistent between the ploidy models. The trend of LD decay in the blueberry

FIGURE 2 | Distribution of filtered SNPs from the tetraploid pipeline in 100 kb windows across the 20 largest blueberry scaffolds (gray). The x-axis represents the distance in base pairs.

downstream regions refer to distances less than 5 kb from surrounding genes. (B) Functional effects of SNPs located in exonic regions. (C) Distribution of alternative ("B") allele frequency. (D) Distribution of genotypic frequencies across loci.

breeding population can be observed by the r <sup>2</sup> distribution across categories of base pair distances between SNPs in **Figure 4**. At the significance threshold (r <sup>2</sup> = 0.2), the LD decay presented significant correlation between markers 73 Kb apart for the diploid model and 80 Kb apart for the tetraploid model (Supplementary Figure 1). In order to verify the possible influence of the population structure in the GWAS analysis, we performed PCA and DAPC cluster analyses. The results for both

the first and second principal components. (B) Bayesian Information Criterion (BIC) for number of clusters ranging from 0 to 156.

standardizations were very similar, with the tetraploid matrix explaining slightly more of the population genetic variation (28.19 vs. 24.78%) (**Figure 5A**). The comparison of the BIC values for the DAPC analysis suggested the presence of 50 groups in the population (**Figure 5B**), which showed similarities with the pedigree recorded in the population. Hence, in the GWAS analyses, we used the PCA scores to control for population stratification and the genomic relationship matrix to control for cryptic relatedness.

# Associations Detected by Polyploid and Diploid Gene Action Models

We performed GWAS analyses for eight fruit-related traits using the **Q**+**K** linear mixed model. A total of 77,496 and 80,591 SNPs were regressed individually in the diploid and tetraploid GWAS models, respectively. Manhattan plots displaying the significance threshold for each locus in their genomic location are shown in Supplementary Figures 2, 3. The inspection of QQ-plots did not show evidences of systematic bias in any trait or model evaluated (Supplementary Figures 4, 5).

Association analyses using the tetraploid genotypes and a q-value threshold of 0.05 allowed the identification of 23 significant SNPs associated with five traits and 11 were also significant after Bonferroni correction (**Table 1**). Six SNPs were identified by more than one gene action model. A total of 15 distinct SNPs were identified: seven for fruit size, two for scar diameter, three for soluble solids content, one for pH, and two for flower bud density (**Figure 6A**, **Table 1**). For fruit size, soluble solids content, and pH traits, dominance models were effective for detecting at least one association. However, the general model was the most effective at detecting associations. This class of model assumes that each genotype has its own effect and hence encompasses different gene actions. The inspection of the phenotypic variation across genotypes for significant SNPs identified by the general model suggested degrees of overdominance for some traits (e.g., see SNPs scaffold13749-868 and scaffold00818- 130228 for flower bud density trait) (Supplementary Figure 6). Under a less conservative threshold, the number of distinct associations increased from 15 (q-value<0.05) to 37 (qvalue<0.1) and new associations were detected for fruit weight and firmness traits (**Figure 6A**, Supplementary Table 1). It is

pipelines, respectively. SNPs were concatenated by their position in the genomic scaffolds and are displayed along the circular Manhattan plots according to their adjusted p-value. The significance threshold (q-value = 0.05) is represented by the gray circle in each layer. Vertical dashed gray lines highlight the significant SNPs. The names of significant SNPs are listed outside of the plot. SNPs identified for diploid and tetraploid pipelines are in orange and blue, respectively; while the common SNP identified in both pipelines is in black.

also noteworthy that the same SNP located at scaffold00697, position 151000, was detected as significantly associated with fruit size and fruit weight, the two highly correlated traits (**Figure 1**).

Considering the diploid genotype calling and a q-value threshold of 0.05, we detected seven significant SNPs associated with two fruit-related traits (**Table 1**). Out of these, one association was significant after Bonferroni correction. We found three distinct SNPs associated with scar diameter and four with flower bud density (**Figure 6A**, **Table 1**). The general model was the most effective for all traits. Under a less conservative threshold, the number of distinct associations increased from 7 (q-value < 0.05) to 14 (q-value < 0.1) and new associations were detected for berry size, firmness, pH, soluble solids content traits (**Figure 6A**, Supplementary Table 1).

Overall, more SNP-trait associations were identified by modeling the genotypes as tetraploid than as diploid (**Figures 6A,B**). Associations for fruit size, soluble solids content, and pH were only detected using tetraploid models, considering a q-value threshold of 0.05. However, there were four SNPs for flower bud density and one for scar diameter that were only detected by modeling diploid genotypes. Moreover, both models were able to detect the same two SNPs for scar diameter (**Figure 6C**, **Table 1**). No significant association was found for firmness, fruit weight, and yield traits with any ploidy and model tested under this moderate threshold.

## Candidate Genes Underlying Fruit-Traits Variation

We identified candidate genes flanking SNPs significantly associated with traits based on the annotation of the blueberry genome (see **Table 1** for q-value < 0.05 and Supplementary Table 1 for q-value < 0.1). Among the protein-coding genes surrounding the seven distinct SNPs associated with fruit size trait, we found a putative lipase (CUFF.5533.1), a RING-type E3 ubiquitin ligase (CUFF.6059.2), a xyloglucan endotransglucosylase (CUFF.38641.1), a hypersensitive-induced response protein 1 (gene.g14573.t1), and a chloroplast rhomboidlike protease (CUFF.39364.1). Two SNPs in high LD and few base pairs apart were located at the gene encoding the chloroplast RHOMBOID-like protease, one of them leading to a missense mutation (**Figure 7**).

For scar diameter, three distinct SNP-trait associations were detected. Annotation was found only for one of the surrounding genes, which encoded a pentatricopeptide repeat-containing protein (CUFF.20851.1).

Three significant SNPs were found for solid soluble content. Two SNPs occurred at genes potentially encoding proteins with a role in the ubiquitin-mediated protein degradation pathway: a ubiquitin-activating enzyme E1 (CUFF.53548.1) and an E3 ubiquitin ligase (CUFF.16799.1).

For the flower bud density trait, six significant SNPs were found, with four potentially causing missense mutations. Out of those, two SNPs in high LD were located at a gene encoding a zf-RVT domain-containing protein (CUFF.60704.1), one at a gene encoding heat shock protein hsp83-90 (CUFF.13871.1), and another at a gene encoding a kinase U-box domain-containing protein (CUFF.57663.1).

For pH trait, no functional annotation was found for the flanking gene.

# DISCUSSION

GWAS analyses in autopolyploids impose additional steps not required in diploids, including the estimation of allele copy number and usage of genetic models that account for dosage effects (Garcia et al., 2013; Dufresne et al., 2014; Rosyara et al., 2016). To circumvent this problem, an alternative has been to use knowledge and methods applied to diploid species in polyploid analyses (Mollinari and Serang, 2015). In this work, we have demonstrated that assuming a diploid parameterization onto a tetraploid species affects the results of a GWAS study. Furthermore, this study is the first to utilize association genetics to understand the genetic architecture and molecular basis of fruit-related traits in blueberry.

# How Does Ploidy Affect Population Parameter Estimation?

Prior to performing a SNP-trait association analysis, a detailed understanding of population structure and linkage disequilibrium is essential (Flint-Garcia et al., 2003). Therefore, we compared diploid and polyploid pipelines in terms of marker characterization and estimation of population genetic parameters.

The high degree of overlap between SNP loci identified by both pipelines suggested that the SNP calling step is not drastically affected by ploidy level. However, differences were observed in the genotype calling step, which affected the magnitude of the population parameters. The lower heterozygosity estimated by using diploid (0.34) rather than tetraploid (0.42) genotypes indicates that diploid standardization may cause an underestimation of the heterozygosity rates. Although heterozygosity is a populational parameter and therefore depends on the genetic background under analysis, the heterozygosity estimated in the tetraploid standardization is more in accordance with previous results reported for blueberry (Debnath, 2014; Tailor et al., 2017). Tetraploid highbush blueberry is primarily an outcrossing species with early-acting inbreeding depression (Krebs and Hancock, 1990). Therefore, higher levels of heterozygosity are indeed expected. Moreover, it is reasonable to assume a greater degree of heterozygosity in autopolyploid species in general, since more alleles at one locus are expected when compared to diploids (Gallais, 2003). High levels of heterozygosity have been reported in polyploid species due to its associated benefits, including buffering of deleterious mutations and heterosis (Comai, 2005).

In terms of population-based genomic association studies, it is well-known that population structure is one factor that can result in spurious associations, i.e., associations between a phenotype and markers that are not linked to any causative



loci (Pritchard et al., 2000; Sillanpää, 2011). For diploid and tetraploid pipelines, the most likely number of groups in DAPC analyses were in accordance with the pedigree recorded in the population. Based on the QQ-plot results, we inferred that the first four principal components and the genomic relationship matrices in each parameterization were sufficient to account for sample structure confounders. However, it is noteworthy that this conclusion is limited to our breeding population. In more complex pedigrees, for example, the usage of relationship matrix for autotetraploids might impact the final results (Kerr et al., 2012; Amadeu et al., 2016).

LD is another population parameter that significantly affects GWAS results. Assuming that association analyses rely on non-random association between SNPs and causative genes, determining the extent of LD is important to define strategies in GWAS analyses. For both pipelines, we observed a rapid

explained by the marker; NA, no annotation;

\*Significant after Bonferroni correction (p < 0.05).

 are

 to

 genome

 2013.

LD decay across the blueberry scaffolds. Accordingly, low LD is reported in other outcrossing species (Gupta et al., 2005). For practical purposes, short LD blocks require a higher number of individuals with records and higher marker density in order to identify causal variants (Goddard et al., 2016). Hence, the usage of a high number of individuals and a high throughput genotyping method was consistent with our research scenario. The LD pattern can also provide information about the genetic diversity in our breeding population. Assuming that the expectation of r² can be expressed as a function of the effective population size (Ne), faster LD decay is expected as long as Ne increases (Flint-Garcia et al., 2003). Empirically, a short-range LD observed in our population suggests a large Ne value. This is in accordance with the breeding strategy at the University of Florida as parental selection has been performed in order to decrease the inbreeding depression, therefore maintaining genetic diversity (Cellon et al., 2018).

# SNP-Trait Associations in Autotetraploid Blueberries

Polyploid studies considering the relative abundance of each allele at a particular locus in the genome allow the testing of more realistic genetic models. For example, the usage of allele dosage has impacted the construction of genetic linkage maps (Mollinari and Serang, 2015), the computation of observed and expected allele frequencies (Dufresne et al., 2014), and the inference of population structure and patterns of historical demography (Blischak et al., 2016). On the bases of genome-wide association studies, our results supported the importance of including allelic dosage to identify significant SNP-trait associations. By modeling tetraploid genotypes under a q-value threshold of 0.05, at least one SNP-trait association was detected for five traits in a blueberry breeding population, and no associations were detected for fruit size, pH, and soluble solids traits when the dosage effect was omitted.

In addition to the allelic dosage, we also tested different gene action models. It is noteworthy that the genotypic value of an individual is estimated differently in polyploid and diploid species. In autotetraploids, the higher number of alleles per locus reflects on different coefficients of dominance, increasing the range of genetic models to describe one-locus genotypic value (Gallais, 2003). In this study, dominance gene actions were addressed on the simplex and duplex dominance models. Simplex dominance represents the first order interaction among alleles and may be modeled regardless of the ploidy. Nevertheless, duplex dominance arises when heterozygotes are affected only if they have two unfavorable alleles; therefore, it is a model that can only be tested in polyploid systems. Duplex dominance interaction models were detected for associations under q-value threshold of 0.1 for flower bud density and firmness traits. Hence, our results reinforce the importance of considering an autotetraploid parameterization in blueberry.

We also tested "diploidized models" or "pseudodiploid models" using the tetraploid genotype calling, as they are widely-used in polyploid analyses due to straightforward implementation in diploid software (Li et al., 2014b; Biazzi et al., 2017). This parameterization disregards the allele dosage and all heterozygotes are grouped into the same genotypic class, which is at the midpoint between the two homozygotes (Rosyara et al., 2016; Slater et al., 2016). In diploid species, this is equivalent to the additive model (parameterized as {0,1,2} and assuming that the SNP effect is proportional to the dosage of the minor allele). In autotetraploids, this parameterization might be interpreted as a partial dominance model suggesting that any order of interaction between alleles reduces the genotypic value (Gallais, 2003; Slater et al., 2016). Our results showed that "diploidized models" were valid for scar diameter, fruit size, and soluble solids traits under a q-value threshold of 0.05. Interestingly, the standard assumption of additivity was not the most appropriate to describe the phenotypic variation observed in blueberry. Divergent results were described in autopolyploid potatoes, for which most of the QTLs were identified considering additive models (Rosyara et al., 2016). Based on our results, we might infer that non-additive effects have a key role in understanding the genetic architecture of blueberry fruit traits.

Although we did not have explicitly approached models addressing partial interactions among alleles, they are potential models to be further implemented in GWAS analyses. Overdominance is particularly more complex, since it can be explored by restricting interactions among alleles to different orders (Gallais, 2003). In this study, these genetic assumptions were implicitly considered in the general model. General model is a generic class that also encompasses other models with no genetic assumptions (Rosyara et al., 2016). Not surprisingly, this model was able to identify the highest number of significant trait-associations, with some overlap with the competing models.

However, considering a q-value threshold of 0.1, significant associations were identified by simplex and duplex models for soluble solids, flower bud density, and pH, but not by the general model. According to Rosyara et al. (2016), there is a trade-off between flexibility and power, because the general model requires a higher number of degrees of freedom, resulting in a lower statistical power.

The heritability estimate provided some insights into the results. Heritability is a population parameter that measures the degree of variation in a phenotypic trait that is due to genetic variation (Falconer and Mackay, 1996). Therefore, it is reasonable to expect a positive relation between heritability and ability to detect associations. In the current population, low to mid narrow-sense heritability was found for the traits, varying from 0.16 for flower bud density to 0.57 for scar diameter (for details, see Cellon et al., 2018). In line with this, individual markers explained a small portion of the phenotypic variation (less than 5%). These results suggest that all fruit-related traits analyzed herein are quantitative, which means that phenotypic variation depends on the cumulative actions of many genes with small effects and their interaction with environment.

# Biological Insights Into the Genetic Basis of Fruit-Related Traits in Blueberry

Among the significant SNPs associated with blueberry fruitrelated traits, some did not lie in protein-coding regions and others caused synonymous changes. In the majority of the GWAS studies in plants, significant associations were also detected for variants in introns, untranslated, or intergenic regions (Ingvarsson and Street, 2011). Many of these variants can be in LD with an untyped causal non-synonymous mutation or might cause changes in gene expression (Gilad et al., 2008). In the case of blueberry, the absence of a high-quality reference genome is an additional challenge for GWAS analysis and biological interpretation. The current available genome is very fragmented and many predicted genes are incomplete (Gupta et al., 2015). Hence, the biological significance of the associations found herein is still limited and speculative, but we point out some insights into the potential molecular mechanisms underlying the variation of each trait.

Larger fruits are a consumer-desired trait in the fresh blueberry market. Among the significant SNPs associated with fruit size, one caused a non-synonymous mutation in the putative gene encoding a chloroplast-located rhomboid-like protease. In A. thaliana, the lack of a rhomboid protease was associated with reduced fertility and aberrations in flower morphology (Knopf et al., 2012; Thompson et al., 2012). Changes in floral morphology and development can affect the fruit size and shape, as reported in tomato (Tanksley, 2004). However, to our knowledge, no study has reported the role of a rhomboidlike protease in fruit size variation. Another SNP associated with berry size occurred at a gene encoding a RING-type E3 ubiquitin ligase. Interestingly, a QTL for rice grain width and weight was also mapped in RING-type protein with E3 ubiquitin ligase activity (Song et al., 2007). Song et al. (2007) suggested that this protein negatively regulates cell division by targeting its substrate(s) to proteasome degradation, since its loss of function resulted in increased cell number and larger (wider) rice spikelet hull. Another interesting SNP was the one located in a gene encoding a xyloglucan endotransglucosylase. This enzyme catalyzes the molecular grafting between xyloglucan molecules in the plant cell-wall matrix, allowing expansive cell growth by restructuring the cell wall (Miedes et al., 2011; Ohba et al., 2011). In transgenic tomatoes with modified expression of a xyloglucan endotransglucosylase gene, fruit size was positively correlated with the expression level of this enzyme (Ohba et al., 2011).

The picking scar size also affects blueberry commercialization, as bigger scars increase perishability and pathogen penetration (Parra et al., 2007). Among the associations detected for scar diameter, the most interesting was the SNP detected under a qvalue of 0.1, upstream of an auxin transporter 3, which controls cellular auxin influx. The major form of auxin IAA (Indole-3 acetic acid) is known to delay fruit abscission from the receptacle by reducing the sensitivity of cells in the abscission zone to ethylene (Blanusa et al., 2005; Kühn et al., 2016). The inhibition of polar auxin transport in grapevine fruitlets resulted in fruit drop (Kühn et al., 2016).

Soluble solid content and pH are important sensory quality factors affecting blueberry fruit flavor. Sweetness perception of fruits depends on the balance between sugars and acids (Cirilli et al., 2016; Farneti et al., 2017). For the sugar content, measured as the soluble solids content, two significant SNPs occurred at genes encoding proteins with a role in the ubiquitin-mediated protein degradation pathway. The attachment of ubiquitin molecules to selected proteins can have diverse regulatory functions, influencing the protein activity, abundance, trafficking, or localization (Stone, 2014). The ubiquitin-proteasomal degradation machinery is also involved in the regulation of sugar signaling pathways, which primarily targets the source-to-sink carbon partitioning (Rolland et al., 2006). The role of proteolysis in controlling sugar accumulation was also reported in tomato fruits (Ariizumi et al., 2011). For pH variation, no annotation was found for the predicted gene harboring the significant SNP, hindering biological insights at this point.

Flower bud density can be useful to estimate potential yield in the next harvest (Salvo et al., 2012). Among the significant associations with this trait, we found SNPs leading to missense mutations. One missense mutation occurred at the gene encoding for a heat shock protein (hsp83-90). In Ipomoea nil (formely Pharbitis nil, the Japanese morning glory), hsp83 was upregulated upon exposure to a photoperiod that induces flowering (Felsheim and Das, 1992). The heat shock protein Hsp90 was also reported to act as an environmental signal sensor regulating flowering time (Sangster et al., 2007) and flower development (Margaritopoulou et al., 2016). Another missense variant was found at a gene encoding for a protein kinase U-box domain-containing. The U-box domain has a ubiquitin ligase activity and the kinase motif suggests that this protein participates in signal transduction cascades via phosphorylation. The potential ortholog in Arabidopsis thaliana (At1g16760) is expressed during the pollen stage (Wang et al., 2008).

Fruit firmness is a trait of commercial importance as it directly affects fruit quality, shelf life, and transportability (MacLean and NeSmith, 2011); therefore, it is a key target for blueberry breeding. In this work, we identified associations only when we used a less stringent q-value threshold of 0.1; two missense variants were detected. One of the SNPs causing missense mutations was located at a putative ubiquitin-likespecific cysteine proteinase. Recent studies have shown the role of proteolysis in the regulation of fruit ripening in tomato (Wang et al., 2014, 2017). Particularly, a vacuolar cysteine proteinase (SlVPE3) was shown to affect the accumulation of numerous ripening-related proteins, acting as a posttranscriptional regulator (Wang et al., 2017). Moreover, Salentijn et al. (2003) found cysteine proteinases differentially expressed between firm and soft strawberry cultivars. The other missense variant associated with firmness was located in a SAM-MTase. SAM-MTases are ubiquitous enzymes that catalyze the transfer of methyl groups from S-adenosyl methionine (SAM) to a myriad of compounds (e.g., DNA, RNA, proteins, sterols, pectin, lignin, flavonoids, phenylpropanoids, and alkaloids) and also act in the biosynthesis pathway of ethylene and polyamines. Many of those compounds have an important role in fruit ripening (Moffatt and Weretilnyk, 2001; Roje, 2006; Teyssier et al., 2008; Singh et al., 2010; MacLean and NeSmith, 2011;

Paul et al., 2012; Van de Poel et al., 2013; Zhang et al., 2015).

#### Current Challenges and Perspectives of GWAS in the Blueberry Breeding Program

Two of the major challenges faced in this study were the absence of a high-quality genome assembly for blueberry and the allelic dosage calling. We expect that the improvement of genome contiguity might impact the reads alignment quality, providing a more accurate SNP calling and a more precise location of the markers associated with traits. Dosage calling has also been recognized to be a major challenge in genomic studies of polyploid species (Bourke et al., 2018), and it is an area that when fully developed could contribute significantly to association studies in autopolyploids. Population structure is another issue that could be affecting the current results. Controlling for population structure is a standard procedure in GWAS analyses, as we did by using the Q+K model; however, it reduces the statistical power to detect associations when phenotypes strongly correlate with relatedness (Reif et al., 2010; Brachi et al., 2011; Würschum et al., 2012; Ogut et al., 2015; Han et al., 2016; Klasen et al., 2016).

Our results suggested that blueberry fruit quality traits have a complex genetic basis. Therefore, the traditional implementation of marker-assisted selection using our GWAS results seems limited at this point. However, we emphasize that new associations with higher effects could be detected in future GWAS analyses using a complete genome assembly, higher marker density, and more accurate dosage calling method. Alternatively, genomic selection is a promising approach for prediction of complex traits and it is an opportunity for future studies.

#### CONCLUSION

Altogether, in this study we demonstrated that simplifying tetraploid data as a diploid can have significant consequences in some population genetic parameters and in the ability to detect marker-trait associations. The absence of associations detected by the conventional additive gene action model suggests that nonadditive effects might play a key role in understanding the genetic

#### REFERENCES


architecture of blueberry fruit traits. Some of the significant SNPs were detected within and around biologically plausible candidate genes. The encoded proteins may act on pathways that affect the traits as suggested by studies in other plant species. However, better gene prediction and functional validation of these genes will further improve our understanding of the variation of fruitrelated traits in blueberry.

## DATA AVAILABILITY

Phenotypic and genotypic datasets used for diploid and tetraploid analyses are available from the Dyrad Digital Repository (accession number doi: 10.5061/dryad.kd4jq6h).

### AUTHOR CONTRIBUTIONS

PM and JO designed the study. CC and JO conducted the field experiment and collected the phenotypic data. CC performed the DNA extraction. MR performed the SNP calling and filtering. LF, JB, and IdB performed the data analyses and interpretation. JB, LF, IdB, and PM wrote the paper. MR and MK provided analytical expertise and edited the manuscript. PM supervised the whole study. All authors read and approved the final version of the manuscript for publication.

#### FUNDING

This work was funded by the UF royalty fund generated by the licensing of blueberry cultivars. CC was partially supported, while at the University of Florida, by the National Institute of Food and Agriculture, United States Department of Agriculture, under award number 2014-67013-22418 to PM and JO. IdB was supported by CAPES (Coordenação de Aperfeiçoamento de Pessoal de Nível Superior) [PSDE scholarship: 88881.131685/2016-01].

### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fevo. 2018.00107/full#supplementary-material

their roles in controlling sugar accumulation in tomato fruits. J. Exp. Bot. 62, 2773–2786. doi: 10.1093/jxb/erq451


in higher polyploids: a comparative study in hexaploid chrysanthemum. BMC Genomics 17:672. doi: 10.1186/s12864-016-2926-5


Zhang, Z., Jiang, S., Wang, N., Li, M., Ji, X., Sun, S., et al. (2015). Identification of differentially expressed genes associated with apple fruit ripening and softening by suppression subtractive hybridization. PLoS ONE 10:e0146061. doi: 10.1371/journal.pone.0146061

**Conflict of Interest Statement:** JO and CC were affiliated to the University of Florida when the study started and by the time the manuscript was submitted they were employed by Driscoll's Inc., and Duda Farm Fresh Foods, respectively.

The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 Ferrão, Benevenuto, de Bem Oliveira, Cellon, Olmstead, Kirst, Resende, and Munoz. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

**146**

# Genome Reduction in Tetraploid Potato Reveals Genetic Load, Haplotype Variation, and Loci Associated With Agronomic Traits

Norma C. Manrique-Carpintero<sup>1</sup> , Joseph J. Coombs <sup>1</sup> , Gina M. Pham<sup>2</sup> , F. Parker E. Laimbeer <sup>3</sup> , Guilherme T. Braz <sup>2</sup> , Jiming Jiang2,4, Richard E. Veilleux <sup>3</sup> , C. Robin Buell 2,5 and David S. Douches <sup>1</sup> \*

*<sup>1</sup> Potato Breeding and Genetics Program, Department of Plant, Soil and Microbial Sciences, Michigan State University, East Lansing, MI, United States, <sup>2</sup> Department of Plant Biology, Michigan State University, East Lansing, MI, United States, <sup>3</sup> Department of Horticulture, Virginia Tech, Blacksburg, VA, United States, <sup>4</sup> Department of Horticulture, Michigan State University, East Lansing, MI, United States, <sup>5</sup> Plant Resilience Institute, Michigan State University, East Lansing, MI, United States*

#### Edited by:

*Richard John Abbott, University of St Andrews, United Kingdom*

#### Reviewed by:

*Laura M. Shannon, University of Minnesota Twin Cities, United States Ek Han Tan, University of Maine, United States*

> \*Correspondence: *David S. Douches douchesd@msu.edu*

#### Specialty section:

*This article was submitted to Evolutionary and Population Genetics, a section of the journal Frontiers in Plant Science*

> Received: *15 March 2018* Accepted: *12 June 2018* Published: *03 July 2018*

#### Citation:

*Manrique-Carpintero NC, Coombs JJ, Pham GM, Laimbeer FPE, Braz GT, Jiang J, Veilleux RE, Buell CR and Douches DS (2018) Genome Reduction in Tetraploid Potato Reveals Genetic Load, Haplotype Variation, and Loci Associated With Agronomic Traits. Front. Plant Sci. 9:944. doi: 10.3389/fpls.2018.00944* The cultivated potato (*Solanum tuberosum*) has a complex genetic structure due to its autotetraploidy and vegetative propagation which leads to accumulation of mutations and a highly heterozygous genome. A high degree of heterozygosity has been considered to be the main driver of fitness and agronomic trait performance in potato improvement efforts, which is negatively impacted by genetic load. To understand the genetic landscape of cultivated potato, we constructed a gynogenic dihaploid (2*n* = 2*x* = 24) population from cv. Superior, prior to development of a high-density genetic map containing 12,753 single nucleotide polymorphisms (SNPs). Common quantitative trait loci (QTL) were identified for tuber traits, vigor and height on chromosomes 2, 4, 7, and 10, while specific QTL for number of inflorescences per plant, and tuber shape were present on chromosomes 4, 6, 10, and 11. Simplex rather than duplex loci were mainly associated with traits. In general, the Q allele (main effect) detected in one or two homologous chromosomes was associated with lower mean trait values suggesting the importance of dosage allelic effects, and the presence of up to two undesired alleles in the QTL region. Loss of heterozygosity has been associated with a lower rate of fitness, yet no correlation between the percent heterozygosity and increased fitness or agronomic performance was observed. Based upon linkage phase, we reconstructed the four homologous chromosome haplotypes of cv. Superior. revealing heterogeneity throughout the genome yet nearly duplicate haplotypes occurring among the homologs of particular chromosomes. These results suggest that the potentially deleterious mutations associated with genetic load in tetraploid potato could be mitigated by multiple loci which is consistent with the theory that epistasis complicates the identification of associations between markers and phenotypic performance.

Keywords: haplotype, dihaploid, genetic load, complex traits, linkage map

# INTRODUCTION

Cultivated potato is an autotetraploid, highly heterozygous, and vegetatively propagated species. Tetrasomic inheritance comprises multiple genotypic configurations with up to four alleles and various combinations of alleles and dosage per locus. The more diverse alleles are at a locus, the greater the heterozygosity and number of allelic and epistatic interactions (Carputo and Frusciante, 2011). At any given locus of a tetraploid clone, there are up to three types of intra-locus interactions that could result in non-additive effects: first order (between two alleles), second order (among three alleles), and third order (among four alleles) while allelic dosage could mediate additive effects of intra-locus interactions. There may be more complexity for the optimal allelic combinations, locus interactions, and genetic effects when modeling quantitative traits. Elevated heterozygosity and genetic load have also been considered the main drivers of high and low vigor, respectively, associated with agronomic trait performance of cultivated potato. Inbreeding depression after self-pollination, and the superiority of tetraploid potato due to heterozygosity and polyploidy established a breeding bias toward increased heterotic diversity (De Jong and Rowe, 1971; Mendoza and Haynes, 1974). Besides the effects of epistasis, gain or loss of allelic diversity could be responsible for heterosis and inbreeding depression, respectively; recessive undesired alleles in the homozygous state would be expected to decrease fitness whereas allelic diversity at heterozygous loci facilitate both dominant and overdominant effects (Miranda Filho, 1999; Ceballos et al., 2015). Consistent with this hypothesis, a large-scale genome resequencing survey of genetic load in asexually propagated cassava revealed that the amount of deleterious mutations is greater in cultivated cassava compared to wild progenitors, as has been found in maize, sunflower and rice, and that cultivated cassava has a markedly greater number of mutations in the heterozygous rather than the homozygous state which could mask the lethal effects of recessive deleterious mutations (Ramu et al., 2017; Wang et al., 2017). The asexual propagation and polyploidy of cultivated potato give the potential of retaining greater mutational load, and also the generation of genome plasticity that enhances adaption to environmental changes. Copy number variation (CNV) in cultivated tetraploid potato is widely distributed throughout the genome and has been associated with lowly expressed genes and genes that respond to biotic and abiotic stress (Pham et al., 2017).

Dihaploids (2n = 2x = 24) from the cultivated tetraploid potato Solanum tuberosum (2n = 4x = 48) have been a valuable tool for genetic and cytogenetic studies as well as for breeding. Peloquin et al. (1991) reviewed the use of dihaploids to support evidence of tetrasomic inheritance, determine the basic chromosome number within the Solanum genus, discover meiotic mutations, understand ploidy and evolution, and assess sexual compatibility and hybridization barriers in potato. Dihaploid progeny of potato can be produced by anther culture or by chromosome elimination, sometimes referred to as "prickle pollination." Specific haploid-inducer lines induce chromosome elimination; following fertilization from a cross of a tetraploid maternal clone with a haploid inducer, the paternal chromosomes are selectively eliminated from the developing hybrid embryo. By introduction of a homozygous, dominant embryo spot marker into haploid inducers, gynogenic dihaploid seed can be selected by the absence of the purple embryo spot visible on the hypocotyl of embryos or seedlings (Hermsen and Verdenius, 1973). As gametic genotypic representations of autotetraploid potato, dihaploid populations can facilitate determination of the complex genetic structure of cultivated potato. The reduced genome complexity of dihaploids enables simpler segregation ratios than tetraploids and a better understanding of the genetic factors controlling traits of interest. Several dihaploid populations have been used to decipher monogenic or polygenic inheritance and gene action effects associated with morphologic, agronomic and disease resistance traits (Cipar and Lawrence, 1972; Matsubayashi, 1979; De Maine, 1984; Pineda et al., 1993; Song et al., 2005; Velasquez et al., 2007). Unilateral (4x × 2x) and bilateral (2x × 2x) crosses using dihaploids have also served as a bridge to generate simpler and more efficient breeding schemes, overcome hybridization barriers, and achieve introgression of adaptive traits in the cultivated potato (Chase, 1963; Peloquin et al., 1991; Rokka, 2009).

The potato cultivar "Superior" was released in 1962 by the University of Wisconsin as a round white variety with scab resistance, and medium maturity (Rieman, 1962). Currently, it is grown in the USA and Canada as a fresh market variety. Dihaploid populations of a potato variety exhibit uniparental segregation following genome reduction. Using cv. Superior. as a model in our study, we generated a dihaploid population extracted from cv. Superior. to observe the effects of unmasking genetic load on different agronomic traits and elucidate main genomic regions associated with trait performance to understand the genetic complexity of tetraploid potato.

## MATERIALS AND METHODS

#### Plant Material

A gynogenic dihaploid (2n = 2x = 24) population of 95 individuals was created from S. tuberosum Group Tuberosum tetraploid cv. Superior. The S. tuberosum Group Phureja haploid inducer IVP101, homozygous dominant for an embryo seed spot marker, was used as the pollinator. Seeds lacking a purple spot were grown and leaf tissue from in vitro plantlets subjected to flow cytometry to identify dihaploids (Owen et al., 1988). Peaks were compared to known monoploid and diploid controls.

#### Genotyping

DNA was isolated from leaf tissue of the "Superior" parent and 95 dihaploid gynogenic progeny and Illumina compatible paired end libraries were constructed as described previously (Hardigan et al., 2016). Libraries were skim-sequenced on the Illumina HiSeq 2000 platform at low coverage, with a theoretical approximation of 8x coverage of the genome, to identify single nucleotide polymorphic (SNP) segregating markers. Adapters and low quality bases were removed from the raw reads using Cutadapt v. 1.8.1 (Martin, 2011) and cleaned reads were aligned using BWA-MEM v. 0.7.11r1034 (Li, 2013) to the S. tuberosum Group Phureja DM 1-3 516 R44 reference genome v4.04 (Hardigan et al., 2016). Genotypes were called using the GATK Unified Genotyper (McKenna et al., 2010). Markers with unexpected segregation, distorted segregation (Chi-square threshold P-value <0.01), and singleton markers (SNPs without any duplicates) or markers with just one duplicate were removed. The remaining high quality markers were used for map construction after excluding duplicate co-segregating markers. Raw sequences are available in the National Center for Biotechnology Information Sequence Read Archive under BioProject ID PRJNA335821.

#### Linkage Map and Quantitative Trait Locus Analysis

The TetraploidSNPMap software for biallelic SNP markers (Hackett et al., 2017), informative for allele dosage in an autotetraploid species, was used to generate the genetic map and quantitative trait locus (QTL) analysis. As a unique parent population, only simplex (AAAB, ABBB) and duplex (AABB) marker configurations in "Superior" were segregating in the dihaploid population. The expected segregation for simplex markers in the diploid progeny corresponded to a 1:1 homozygous: heterozygous genotypic ratio (AA:AB, BB:AB), and for duplex markers to a 1:4:1 genotypic ratio (AA:AB:BB). However, four homologs per parental chromosome are segregating in this population. Thus, the segregation obtained in the dihaploid progeny fits the autotetraploid segregation for a cross with a null male parent for simplex (AAAB × AAAA, ABBB × BBBB) and duplex (AABB × AAAA, AABB × BBBB) markers. The marker configurations of the different genotypes were recoded according to TetraploidSNPMap code (AAAA = 0, AAAB = 1, AABB = 2, ABBB = 3, and BBBB = 4). For simplex segregation (AA:AB = AAAA:AAAB, BB:AB = BBBB:ABBB), genotypes were recoded as 0 and 1; while for duplex segregation (AA:AB:BB = AAAA:AABB:BBBB), genotypes were recoded as 0, 1, and 2.

The linkage map was constructed according to Hackett et al. (2013). The different mapping steps were implemented in TetraploidSNPMap: analysis of single marker segregation; cluster into linkage groups; estimation of recombination frequency and logarithm<sup>10</sup> of the odds ratio for linkage (LOD score); and ordering and inference of SNP linkage phase (Hackett et al., 2017). A preliminary test of cluster of simplex SNPs was done using JoinMap 4.1 (Van Ooijen, 2006). Markers were coded for cross-pollinated population type (<lmxll>). This step allowed identification and exclusion of problematic markers that did not cluster as part of linkage groups. In the mapping process in TetraploidSNPMap, problematic and near duplicate markers were also detected and excluded; these mainly corresponded to outliers in the clustering and metric multidimensional scaling (MDS) ordering steps. A high concordance between genetic and physical maps has been reported for potato mapping populations (Felcher et al., 2012; Sharma et al., 2013). As a final quality control of the generated linkage maps, marker genetic positions (cM) were plotted against their physical positions (Mb) on each chromosome to generate MaryMaps (Chakravarti, 1991).

Square root transformation of phenotypic data was performed to improve the QTL detection. A QTL interval mapping analysis with a step size of 1 cM was done to identify QTL. A logarithm of the odds (LOD) threshold calculated based on a test of 500 permutations was used to detect significant marker associations. Next, the trait was modeled as an additive function of the QTL allele effect on each of eight homologous chromosomes (four for each parent). In addition to a full additive model, any of four different QTL simple models could be fit in this population with a single parent segregation. Simplex QTL model (Qqqq × qqqq), where the Q allele drives the main effect, duplex QTL (QQqq × qqqq) with additive effects of Q allele (the QTL genotypes qqqq, Qqqq, QQqq have means of m, m+Q, m+2Q), duplex QTL with non-additive effects of Q allele (the QTL genotypes qqqq, Qqqq, QQqq have different means m1, m2, m3), and duplex QTL with dominant effects of Q allele (two QTL genotypes qqqq, Q\_qq mean categories). A QTL fit a simple model when the value of Schwarz Information Criterion (SIC) (Schwarz, 1978) was smaller than or close to the value of the full model, at least with a difference of 2 units from other simple models.

#### Field Evaluations

The dihaploid population was grown from greenhouse- or fieldproduced tubers at the Montcalm Research Center, Lakeview, MI (MRC) and the Botany and Plant Pathology Farm, East Lansing, MI (BPP) of Michigan State University over 2 years. In 2014, greenhouse-grown tubers were harvested in March and planted at MRC. In 2015, tubers produced in the field in 2014 in addition to greenhouse-grown tubers, harvested between February and March, were planted at MRC and BPP, respectively. In 2015, greenhouse tubers were subjected to a Rindite treatment to break dormancy prior to planting (Varga and Ferenczy, 1956). Thus, a total of three location/year datasets is reported in this study. All trials had a randomized complete block design using plots of eight plants as experimental unit and three replications per clone. Parents and progeny were evaluated for eight traits: total tuber yield (TTY) measured as g/plant, average tuber weight (ATW) in g, tuber set (TS) as number of tubers per plant, plant vigor (Vigor) scored as overall plant canopy development ∼3 months after planting using a 1–5 scale (1: low vigor, 5: high vigor), plant height (Height) in cm assessed when plants started flowering, number of inflorescences per plant counted after a line initiated flowering in the plot (Infl/plant), specific gravity (SPGR) calculated using the formula [air weight/(air weight– water weight)] for a minimal sample size of 1 kg/plot, and tuber shape (Shape) scored using a 1–5 scale (1 = compressed, 2 = round, 3 = oval, 4 = oblong and 5 = long).

#### Heritability and Correlation Analysis

The restricted maximum likelihood method (REML) was used to calculate broad-sense heritability (H<sup>2</sup> ) with clones as random effects and site-year environments as random fixed effects. The heritability was estimated on a genotype mean basis as the ratio of:

$$H^2 = \frac{\sigma\_{\mathcal{g}}^2}{\left(\sigma\_{\mathcal{g}}^2 + \frac{\sigma\_{\mathcal{g}}^2 \ast \mathbf{s} - \mathbf{y}}{m} + \frac{\sigma\_{\mathcal{e}}^2}{m}\right)}$$

where (σ 2 g ), ( <sup>σ</sup> 2 g ∗ s−y m ), and ( (σ 2 e ) rm ) are the genetic, genotype × siteyear environment interaction and residual variance components, m is the number of site-year environments and r is the number of replications.

Pearson correlation was used to estimate correlations between traits among site-year environments using the REML method when samples were missing. Means, variances, correlation and distribution analyses were calculated using JMP <sup>R</sup> 10 SAS Institute Inc., Cary, NC, USA.

#### Fluorescence in Situ Hybridization

Chromosome preparation and fluorescence in situ hybridization (FISH) were performed using published protocols (Braz et al., 2018). Individual potato chromosomes of "Superior" were identified using two "barcode probes," which contain 27,306 and 27,366 oligonucleotides (45 nt), respectively, derived from 26 different regions on the 12 potato chromosomes. These two probes produce 26 distinct FISH signals. Each of the 12 potato chromosomes is labeled with distinct signal pattern (Braz et al., 2018). FISH images were captured using a QImaging Retiga EXi Fast 1394 CCD camera and were processed with Meta Imaging Series 7.5 software. The final contrast of the images was processed using Adobe Photoshop CS3 software.

#### Rescue of Dwarf Mutant With Gibberellic Acid Treatment

Plantlets of the dihaploid VT\_SUP\_46 from the "Superior" dihaploid population maintained in vitro, were obtained after subculture on regular MS medium (Murashige and Skoog, 1962) (MS basal medium with vitamins + 3% sucrose + 0.6% plant agar micropropagation grade, pH 5.8, reagents from Phytotechnology Laboratories, Shawnee Mission, KS, USA). Assay tubes with plantlets were placed in a growth room at 22◦C and 16-h photoperiod. Plantlets were grown in regular MS medium and medium supplemented with 0.3 mg/l of zeatin riboside (ZR, trans isomer, Sigma, St Louis MO USA) and two concentrations of gibberellic acid (GA3, Research Products International, Mt Prospect, IL, USA), 0.02 or 0.2 mg/l to rescue from the unique dwarf phenotype observed in this clone.

#### RESULTS

#### Phenotypic Performance

A total of 95 dihaploid clones was generated from crosses of cv. Superior. with IVP101; however, due to the low vigor of many of the dihaploids there was limited production of planting material in the greenhouse, and/or delayed emergence in the field such that between 50 and 75 individual clones were evaluated for the various traits under field conditions. For SPGR, 39 clones could be assessed in the MRC-2014 trial (**Table 1**). A wide range of variation of traits was observed in the population (**Table 1**). The quantile diagnostic plots showed a trend toward normal distribution for the population means of TS, Height, SPGR, and Shape, while bimodal normal distributions were observed for the means of TTY, ATW, Vigor and Infl/plant (Supplementary Figure 1). Transgressive segregation for TS, Vigor, Height, Infl/plant, and SPGR was detected in the progeny. For TTY and ATW, a few progeny performed similar to the parental line skewing the distribution toward greater values. Similarly, a few individuals with many Infl/plant skewed the distribution of this trait, especially in MRC-2014 and BPP-2015 site-year environments. Even though the normal distribution for SPGR was not affected by the skewness, there was tendency toward low values as shown by the negative skewness.

Overall, the number of days after planting required for 75% of plants per clone in a plot to emerge varied from 24 to 100 days. MRC-2014 had a significantly greater average number of days (63) to emerge compared with 52 and 42 days for BPP-2015 and MRC-2015, respectively (P-value <0.0001). The high correlation between MRC-2014 and BPP-2015 emergence data (0.53, P-value <0.0001), and low correlation of these locations with MRC-2015 (0.36 and 0.30, respectively; P-value < 0.01 and 0.02, respectively) showed that the greenhouse-produced tubers used as planting material at both locations had similar longer emergence period compared to the field-grown tubers from the previous season that was used as seed for MRC-2015; this seed source appeared to be the main driver of more efficient emergence for most of the dihaploid clones (35.9 days for 75% of the population). The planting material did not have a critical effect on the reproducibility of data as the broad sense heritability was greater than 0.7 for all traits (**Table 2**). Comparison of correlations among data from different locations for the same trait, revealed that nearly all correlations were greater than 0.5 with P-values <0.0001. A low correlation was observed only for SPGR in BPP-2015 with MRC-2014 and MRC-2015 (0.24 and 0.4; P-values < 0.01 and 0.004, respectively), whereas the correlation between MRC-2014 and MRC-2015 for SPGR was 0.85 (P-value <0.0001). The low tuber yield for many individuals limited the total number of progeny evaluated for SPGR, therefore this trait was excluded from the QTL analysis.

High positive correlations among TTY, TS, ATW, Height, and Vigor were observed for all 3 site-year environments (**Tables 3**–**5**). Infl/plant showed high and moderate positive correlations for all 3 site-year locations. Tuber shape had low to no correlation with the other traits. The longer emergence period was highly correlated with low Height and Vigor for all three environments, while moderate to low negative correlations were observed between emergence and the three tuber traits, TTY, ATW, and TS.

A site-year environmental effect was detected for all traits except ATW. MRC-2015 reported significantly greater mean values and MRC-2014 the lowest for TS, Height, and Vigor, while BPP-2015 had the greatest values for Infl/plant and Shape, and BPP-2015 and MRC-2015 for TTY [P-values <0.001 for all except ATW (0.072) and Shape (0.008)]. For 53 dihaploid clones with full data, we detected significant (P-value <0.0001) genotypeenvironment interactions in all locations for TTY, TS, ATW, Vigor, Height, and Shape.

#### Linkage Map

A high-density genetic map was built for the 95-progeny of "Superior" dihaploid population. After filtering to identify high-quality segregating markers (Supplementary Table 1), we identified 12,753 polymorphic SNPs that were successfully TABLE 1 | Frequency distribution statistics of cv. Superior. and its dihaploid population (Pop) for 3 site-year environments [Montcalm Research Center (MRC) in 2014 and 2015, and Botany and Plant Pathology Farm (BPP) in 2015].


*Total tuber yield (TTY) in g/plant, average tuber weight (ATW) in g, tuber set (TS) as number of tubers per plant, plant height (Height) in cm, plant vigor (Vigor) 1: low vigor, 5: high vigor, number of inflorescences per plant (Infl/plant), specific gravity (SPGR), and tuber shape (Shape) 1* = *compressed, 2* = *round, 3* = *oval, 4* = *oblong and 5* = *long.*

TABLE 2 | Heritability for eight agronomic traits evaluated in the "Superior" dihaploid population.


*Total tuber yield (TTY) in g/plant, average tuber weight (ATW) in g, tuber set (TS) as number of tubers per plant, plant height (Height) in cm, plant vigor (Vigor) 1: low vigor, 5: high vigor, number of inflorescences per plant (Infl/plant), specific gravity (SPGR), and tuber shape (Shape) 1* = *compressed, 2* = *round, 3* = *oval, 4* = *oblong and 5* = *long.*

mapped (**Table 6** and Supplementary Table 2). The SNPs were located mainly at intergenic regions (10,159, 79.7%), compared to genic regions (2594, 20.3%), with 746 in exons and 1,970 in introns, and 120 overlapping both positions due to alternative splicing. The genetic map has a length of 1299.1 cM with 819 to 1374 SNPs per chromosome. The average inter-locus distance was 0.7 cM with a genome coverage of 99.3% relative to the 12 chromosomes in the current potato genome assembly.

# QTL Identified

In general, common QTL were identified on chromosomes 2, 4, 7, and 10 for TTY, TS, ATW, Height, and Vigor, while specific QTL were identified for Infl/plant and Shape on chromosomes 4, 6, 10, and 11 (**Figure 1**). In some cases, the QTL were not identified in all site-year environments for each trait, as the peak was not always significant or not detected. **Table 7** summarizes the QTL chromosome locations, phenotypic variation, QTL genetic model and homologous chromosomes associated with the Q allele effect. For most of the QTL, the closest SNPs to the QTL peak, cosegregating in phase with the Q alleles, were also reported.

#### Fluorescence in Situ Hybridization

Even though a strict quality filtering process was used in the selection of markers for linkage mapping, the construction of genetic maps for chromosomes 4 and 11 was especially problematic. We conducted oligo-based fluorescence in situ hybridization (Oligo-FISH) (Braz et al., 2018) to examine if the four copies of chromosomes 4 and 11 in "Superior" TABLE 3 | Correlation analysis for field season at Montcalm Research Center in 2014 for nine traits.

*Total tuber yield (TTY) in g/plant, average tuber weight (ATW) in g, tuber set (TS) as number of tubers per plant, plant height (Height) in cm, plant vigor (Vigor) 1: low vigor, 5: high vigor, number of inflorescences per plant (Infl/plant), specific gravity (SPGR), and tuber shape (Shape) 1* = *compressed, 2* = *round, 3* = *oval, 4* = *oblong and 5* = *long, 75% of emergence number of days after planting (75% Emerg). Significant positive correlation dark green P-value* < *0.0001, light green P-value* < *0.05, significant negative correlation dark red P-value* < *0.0001, intermediate red P-value* < *0.001, and light red P-value* < *0.001.*

*Total tuber yield (TTY) in g/plant, average tuber weight (ATW) in g, tuber set (TS) as number of tubers per plant, plant height (Height) in cm, plant vigor (Vigor) 1: low vigor, 5: high vigor, number of inflorescences per plant (Infl/plant), specific gravity (SPGR), and tuber shape (Shape) 1* = *compressed, 2* = *round, 3* = *oval, 4* =*oblong and 5* = *long, 75% of emergence number of days after planting (75% Emerg). Significant positive correlation dark green P-value* < *0.0001, light green P-value* < *0.05, significant negative correlation dark red P-value* < *0.0001, intermediate red P-value* < *0.001, and light red P-value* < *0.001.*

show visible structural variation. All 48 chromosomes could be individually identified based on the Oligo-FISH signal patterns (**Figure 2**). We did not observe any unambiguous chromosome structural changes associated with chromosomes 4 and 11. However, three copies of chromosome 4 contain a visible heterochromatic knob in the short arm, whereas the remaining copy of chromosome 4 does not contain the knob (**Figure 2**).

#### Double Reduction Leads to a Dwarf Mutant

A dark green and rosette dwarf phenotype that can be rescued by GA<sup>3</sup> application has been reported in hybrid progeny of cv. Superior. as well as some other potato diploid and tetraploid clones (Bamberg and Hanneman, 1991; Valkonen et al., 1999). A single dihaploid, VT\_SUP\_46, within our "Superior" dihaploid population has a strong dwarf phenotype (**Figure 3**). Treatment of in vitro plantlets of VT\_SUP\_46 on propagation medium supplemented with GA<sup>3</sup> (0.02 and 0.2 mg/l) resulted in rescue from the dwarf phenotype (**Figure 3**).

### DISCUSSION

### Genetic Load Unmasked in cv. Superior. Dihaploid Population

As reported previously (Peloquin and Hougas, 1960; De Maine, 1984; Kotch et al., 1992; Hutten et al., 1995), segregation of a tetraploid parent configuration in a gametic dihaploid population leads to breakdown of allelic combinations and interactions, and to unmasking of the genetic load due to homozygosity of recessive alleles and/or the effects of dysfunctional alleles. A dihaploid population has an expected reduction of heterozygosity equivalent to three generations of self-pollination of an autotetraploid, which increases the probability of a homozygous state of recessive and deleterious alleles (Peloquin and Hougas, 1960). The effect of homozygous recessive and sub-lethal alleles in a duplex configuration in a locus in the parental line will lead to 17% weakness or loss in the progeny, to 50% when in a triple dose in a triplex parent genotype, and would not be detected in a simplex configuration (Hutten et al., 1995). In complex traits, several genes and their contribution to the genetic structure of the

TABLE 5 | Correlation analysis for field season at Montcalm Research Center in 2015 for nine traits.


*Total tuber yield (TTY) in g/plant, average tuber weight (ATW) in g, tuber set (TS) as number of tubers per plant, plant height (Height) in cm, plant vigor (Vigor) 1: low vigor, 5: high vigor, number of inflorescences per plant (Infl/plant), specific gravity (SPGR), and tuber shape (Shape) 1* = *compressed, 2* = *round, 3* = *oval, 4* = *oblong and 5* = *long, 75% of emergence number of days after planting (75% Emerg). Significant positive correlation dark green P-value* < *0.0001, light green P-value* < *0.05, significant negative correlation dark red P-value* < *0.0001, intermediate red P-value* < *0.001, and light red P-value* < *0.001.*

TABLE 6 | "Superior" linkage map length in centimorgans (cM), physical length in megabase pairs (Mb), and features of mapped single nucleotide polymorphisms (SNPs).


*Chromosome (Chr), potato genome sequence consortium assembly version 4.03 (PGSC v 4.03), unique segregating markers (Seg).*

trait will influence the magnitude of the effect of recessive or sub-lethal alleles, producing a wide range of variation in phenotype. Hutten et al. (1995) evaluated 31 different dihaploid populations reporting some levels of dwarfism, wide variation between populations in the rate of tuberization ability, and low frequencies of flowering and pollen stainability. In fact, low fitness phenotypes prevented 35.8% (MRC-2014), 22.1% (BPP-2015), and 40% (MRC-2015) of the "Superior" dihaploid population to be evaluated under different site-years. Almost 20% of "Superior" dihaploids were never evaluated in the field due to extremely low vigor (**Figure 4**).

Using the theory proposed by Fasoulas (1988) where greater genetic load affecting a trait would increase the coefficient of variation (%CV), causing negative kurtosis and positively skewing the trait frequency distribution of a dihaploid population, Kotch et al. (1992) studied the frequency distribution statistics of several dihaploid potato populations. Skewness, kurtosis, and the inbreeding depression coefficient (relative percentage of dihaploid population mean compared to tetraploid parent mean) were used to indicate the type of gene action affecting different traits. In general, a negligible or low inbreeding depression coefficient, close to zero or minor skewness, negative or zero kurtosis, and low %CV indicated that the 4x parent had genes with primarily additive effects and low genetic load associated with the trait. In contrast, significant positive skewness, positive kurtosis, high %CV, and a high inbreeding depression coefficient are suggestive that the 4x parent primarily has genes with non-additive effects associated with the trait. In this "Superior" dihaploid population (**Table 1**), SPGR is potentially a trait mainly governed by genetic factors with additive effects and low genetic load, while for TTY and ATW genes that have non-additive effects and greater genetic load are suggested. For traits where the distribution statistics fit in the middle of these parameters, both additive and non-additive

genetic effects could be mediating the phenotype, this is the case for TS, Height, Vigor, and Infl/plant.

By comparing the performance of different populations, Kotch et al. (1992) highlighted that a trait with similar %CV could have different inbreeding depression coefficients, which implies the importance of non-additive gene control rather than genetic load (fixation of deleterious genes). Evaluations of trait performance in inbred generations and diallelic crosses of outcrossing species (e.g., maize, cassava) suggest that the relevance of non-additive effects increases with the genetic complexity of a trait, and that a strong inbreeding depression effect will also be associated (Ceballos et al., 2015). Nonadditive effects driving heterosis (dominance, overdominance, and epistasis) are particularly important for grain yield and fresh root/tuber yield. A similar complexity is suggested in potato yield traits, such as tuber yield with strong inbreeding depression as described in this analysis and reported in populations of self-pollinated tetraploids (Golmirzaie et al., 1998).

#### Genome Heterogeneity of cv. Superior

A high-density genetic linkage map was built for the 95-progeny of "Superior" dihaploid population. For several chromosomes, it was difficult to order and estimate the linkage phase of the markers, particularly for chromosomes 4 and 11. With the MDS ordering approach (Preedy and Hackett, 2016), a continuous curve plot is expected because of the low linkage between markers located at opposite chromosome ends. The MDS graph showed sub-clusters of markers producing extra sub-curves within the general curve or outlier points in some instances. Excluding problematic markers solved this problem. Primarily genotype errors or distorted segregation could affect the marker quality and mapping process. This is not the case in our population since besides the threshold (P-value < 0.01) used to eliminate markers with distorted segregation, we did not detect any pattern with meaningful distorted segregation that could limit transmission of specific genomic regions. However, inversions in some homologs or structural variation between homologous chromosomes could also be associated with problems during linkage mapping. In fact, an inversion resulted in a large gap on chromosome 11, while on chromosome 4 we observed a tendency of independent clustering and mapping of the homologous chromosomes. Both chromosomes 4 and 11 showed a greater length than normally reported in previous diploid and tetraploid linkage maps (Hackett et al., 2013; Sharma et al., 2013; Manrique-Carpintero et al., 2015; Massa et al., 2015; Da Silva et al., 2017). For chromosome 4, this may be due to the large heterochromatic knob on three of the four homologous chromosomes (**Figure 2**).

Based on the linkage phase generated in the mapping process, we reconstructed the four haplotypes for each of the 12 homologous chromosomes of the "Superior" parent for the mapped loci (**Figure 5**). Then genetic distances were calculated between different pairs of homologs per chromosome using GGT 2.0 software (Van Berloo, 2008). Different patterns of differentiation among homologs per chromosome were observed based on the simple matching coefficient (the number of shared alleles as proportion of all alleles) distance measurement (**Table 8**). For instance, for chromosomes 10 and 12, only one homolog was markedly different from the other three, while for chromosomes 1, 2, and 5, a pair of homologous chromosomes were highly similar and the other types of homologs were distant. This analysis revealed a novel observation of a high level of heterogeneity among homologous chromosomes in a tetraploid potato cultivar.

#### QTL Analysis of Agronomic Traits

Highly correlated traits shared QTL with similar positions and effects. For most of the QTL (chromosome 2, 4, and 7) for TTY, ATW, TS, Height, and Vigor, the Q allele was in simplex configuration and associated with lower trait mean values in heterozygous genotypes. When the Q allele was detected on two homologous chromosomes, the presence of any or both Q alleles was associated with lower mean values. This resulted in having a marker segregation in which 50 or 16.7% of the evaluated population showed lower fitness phenotype with the Q allele associated. This could be explained by the importance of dosage allelic effect in the genotype configuration of the tetraploid parent or that the tetraploid parent has mainly one and up to two weak or dysfunctional alleles in the QTL regions. Nevertheless, if the recessive detrimental alleles are in simplex configuration we do not expect homozygous allelic states unless TABLE 7 | QTL identified in the "Superior" dihaploid mapping population, chromosome (Chr), and genetic centimorgan position (cM), logarithm of the odds (LOD) significance, variance explained (*R* 2 ).


*(Continued)*

#### TABLE 7 | Continued


*Total tuber yield (TTY) in g/plant, average tuber weight (ATW) in g, tuber set (TS) as number of tubers per plant, plant height (Height) in cm, plant vigor (Vigor) 1: low vigor, 5: high vigor, number of inflorescences per plant (Infl/plant), specific gravity (SPGR), and tuber shape (Shape) 1* =*compressed, 2* =*round, 3* = *oval, 4* =*oblong and 5* = *long, main Q allele effect associated with lower (*↓*) or greater (*↑*) mean trait values.* \**Significant marker even though a specific model was not detected.*

double reduction has occurred. In contrast, for the QTL on chromosome 10, the Q alleles in heterozygous genotypes were associated with greater mean values of these traits. Based on the analysis of the statistics of distribution of phenotypic data, dominance, intra-locus interactions, and epistatic interaction effects were considered as the main types of gene action associated with TTY and ATW, while a combination of additive, dominance, intra-locus interactions, and epistatic interaction effects was evident for TS, Height and Vigor. Either additive or dominant effects could explain the QTL with simplex allelic effects detected for most of the traits, while the duplex QTL effects were explained by dominant, additive and interaction effects.

We did not find any specific QTL for TTY and ATW, the traits with the greatest inbreeding depression. We hypothesize that probably multiple loci with a low percentage of explained variance as well as their epistatic interactions could be the reason underlying a lack of power to detect these QTL. Similarly, major QTL may not be segregating in this specific population. A clear example is the maturity locus on chromosome 5 associated with Dof Zinc Finger Protein-StCDF gene (Kloosterman et al., 2013). We did not identify a QTL in that region even though three alleles for cv. Superior were reported by Hardigan et al. (2017). The "Superior" alleles have polymorphisms (non-synonymous SNPs and truncations) compared to the allele associated with short day tuberization photoperiod control CDF in Solanum tuberosum Group Andigena. Therefore, all of these alleles should have similar additive effects in which any combination of those alleles in the diploid progeny is not associated with a segregating phenotype. Infl/plant corresponded to a trait for which we observed no inbreeding depression. The statistics of distribution analysis suggested that this trait should have gene actions associated with additive effects. Simplex and duplex with no additive allelic effects were the main type of gene action identified in the QTL analysis. Considering that several loci contribute to the genetic structure of a quantitative trait, we expect that epistatic interactions may play a major role in the genetic structure of the evaluated

FIGURE 2 | Fluorescent *in situ* hybridization visualization of cv. Superior. chromosomes. The knobs on three copies of chromosome 4 are indicated by an arrow, the knob-less copy of chromosome 4 is indicated by a large arrowhead. Four copies of chromosome 11 are also indicated.

traits (best allelic combinations at different loci). In fact, only a few individuals in the progeny reached a genetic structure that generated a phenotype similar to the tetraploid parent.

The common QTLs for TTY, ATW, TS, Vigor and Height on chromosomes 2, 4, 7, and 10 co-localized with previous QTL reported for one or a few of the evaluated agronomic traits. Interestingly, the single parent tetraploid segregation revealed that the QTL were collectively associated with all of these traits. A QTL on chromosome 2 was reported for TTY and tuberization (Van den Berg et al., 1996; McCord et al., 2011; Manrique-Carpintero et al., 2015), on chromosome 4 for ATW, tuber size and tuberization (Van den Berg et al., 1996; D'hoop et al., 2014; Manrique-Carpintero et al., 2015), on chromosome 7 for tuber yield (Schäfer-Pregl et al., 1998), and on 10 for

tuber yield, tuber set, and Vigor (Schäfer-Pregl et al., 1998; Manrique-Carpintero et al., 2015; Rak et al., 2017). Bonierbale et al. (1993) reported QTL on chromosomes 2, 4, and 7 for TTY, TS, and ATW, although the QTL on chromosome 2 does not match the chromosome arm location of our QTL. Similarly, several authors have reported a QTL on chromosome 10 for tuber shape (Van Eck et al., 1994; Prashar et al., 2014; Lindqvist-Kreuze et al., 2015), separating mainly compressed, round, and oval from the more elongated shape types oblong and long.

#### Importance of Inbreeding

Loss of heterozygosity has been associated with lower fitness. Considering that the homozygous alleles in the "Superior" parent were also homozygous in the progeny, we tested if the amount of segregating heterozygosity inherited from the tetraploid parent was associated with any trait. There was no correlation for most of the traits and poor correlation between the percentage of inherited heterozygosity and increasing TTY and TS trait values for all 3 site-years (R <sup>2</sup> = 0.07–0.09 and P-value < 0.03). As reported by Bonierbale et al. (1993), the additivity of a certain number of heterozygous loci rather than total heterozygosity makes a greater contribution to overall trait performance, along with the dominant alleles and epistatic effects. For instance, the weakest dihaploid clone (VT\_SUP\_46) had greater inherited heterozygosity than a high vigor and high-yielding dihaploid (VT\_SUP\_19), 60 and 55%, respectively. Haplotype analysis of "Superior" chromosomes showed a high level of heterogeneity in the parental genome. Cross-pollinated mating type, vegetative propagation, and polyploidy of cultivated potato contribute to retention of greater mutational load that is further complicated by rampant

structural variation throughout the genome (Pham et al., 2017). Genetic load due to deleterious allelic mutations in the simplex configuration could be compensated by the alternative allele, but also by multiple loci with similar function(s) in the polyploid genome. At a given locus, it is possible to have: (i) duplicate alleles or alleles with synonymous nucleotide polymorphisms that will not affect the functionality at the protein level, (ii) alleles with polymorphisms that alter functionality at the protein level, and /or (iii) alleles with no functionality (i.e., a null allele). In principle, any alternative functional allele would compensate for dysfunctionality in a dihaploid or tetraploid individual when present with the lethal allele, at the same or different locus. Therefore, the combination of alleles at multiple loci determines the trait phenotype. However, epistasis complicates the identification of associations between markers and phenotypic performance (Ceballos et al., 2015). Inbreeding can be the most efficient method to organize the genome to combine favorable alleles interacting in a stable epistatic system, therefore high fitness progeny would have the best genetic structure (Jansky et al., 2016). By design, we examined only biallelic SNPs, thereby disregarding the contributions of multiallelic loci on yield attributing traits. For triallelic loci, 1/6th of the dihaploid progeny would be expected to be homozygous at any given SNP site whereas for tetraallelic loci, all dihaploid progeny would remain heterozygous, albeit with different combinations of alleles.

#### Candidate Genes

In the common QTL regions identified in this study (**Figure 1**) for TTY, ATW, TS, Height and Vigor, we hypothesized that candidate genes associated with overall plant growth and


TABLE 8 | Genetic distance between homologs (H) of each chromosome (Chr) of cv. Superior.

*Distance color code: shorter genetic distance toward red while greater genetic distance greener.*

development, as well as tuberization (Supplementary Table 3) would be present. Hormonal regulation, sucrose metabolism, photoperiod, circadian clock, and age-dependent signaling pathways are involved in tuber initiation and growth (Navarro et al., 2015) for which some genes have been identified. In the QTL region on chromosomes 2 and 7, candidate genes in the photoperiod regulatory pathway associated with length of plant cycle and tuberization were identified (Dof Zinc Finger Protein-StCDF3, CONSTANTS-CO, and miRNA156) around 46 and 2 Mb, respectively. High accumulation of sucrose and starch in terminal sink organs is enhanced by efflux from the leaves promoting tuberization, down-regulation of the phloem Sucrose transporter 4 (SUT4) gene is critical to the switch from apoplastic to symplastic phloem uploading (Chincinska et al., 2013). SUT4 follows a circadian expression pattern, has reciprocal regulation with gibberellic acid (GA), and affects the expression of circadian-regulated genes, flowering, tuberization and shade avoidance. SUT4 is located at 65.8 Mb on chromosome 4, in the region where a QTL was detected. The breakdown of active GA is required for tuberization and gibberellin 2-oxidase genes are part of the mechanism that controls endogenous levels of GA (Kloosterman et al., 2007); we identified a Gibberellin 2-oxidase 2 (GA2ox2) candidate gene at 51.9 Mb in a QTL regions on chromosome 7. Interestingly, in the other QTL region of chromosome 7 at 1.9 Mb is a Trehalose-phosphate synthase 1 (TPS1) gene with a potential role in the T6P regulatory pathway that was recently associated with flowering and tuberization in potato (Seibert et al., 2017). Ectopic expression of Lonely Guy 1 (LOG1), a cytokinin-activating enzyme, drove the formation of aerial minitubers in tomato (Eviatar-Ribak et al., 2013). The plants displayed a unique transcriptome signaling network probably associated with the appropriated local hormonal balance for tuber formation. Differential expression and pleiotropic effects of LOG genes showed their major role in cytokinin metabolism to modulate plant growth and development in Arabidopsis thaliana (Kuroha et al., 2009). A cytokinin riboside 5′ -monophosphate phosphoribohydrolase LOG3 gene is located in the QTL region at 56 Mb on chromosome 10. For tuber shape, several candidate genes associated with cell structure and function, and pectin metabolism have been reported in the major QTL located around 48 Mb on chromosome 10 (Lindqvist-Kreuze et al., 2015). Similarly, in the QTL region discovered in our analysis on chromosome 6, a Pectinesterase gene is located at 58 Mb.

#### Dwarf Phenotype

There is strong evidence that a dwarf phenotype observed in our "Superior" dihaploid population is the result of GA<sup>3</sup> deficiency. The dark green and rosette dwarf phenotype has been reported in potato in hybrid progeny of cv. Superior. as well as some other potato diploid and tetraploid clones (Bamberg and Hanneman, 1991; Valkonen et al., 1999). In all cases, reversion of the dwarf phenotype occurred following GA<sup>3</sup> application. A single recessive locus encoding ga1 was proposed to cause the dwarf phenotype, which was confirmed by evaluation of test segregation in several crosses (Bamberg and Miller, 2012). The study also revealed that a gibberellin deficiency allele was in simplex configuration (GGGg) in "Superior." The homozygous state gg of the recessive allele of a simplex locus in a dihaploid population is expected only due to double reduction, therefore a small proportion of dwarf phenotype would be observed in the dihaploid progeny. In fact, VT\_SUP\_46 is a unique clone in our "Superior" dihaploid population with a strong dwarf phenotype. Examination of the regions with potential double reduction in VT\_SUP\_46, revealed the end of chromosomes 6 as the candidate region. However, a few other clones also showed double reduction but did not have dwarf phenotype, suggesting that other loci could compensate the GA<sup>3</sup> supply in those dihaploid clones. When in vitro plantlets of VT\_SUP\_46 were grown on propagation medium supplemented with GA (0.02 and 0.2 mg/l) the plants elongated to a normal phenotype (**Figure 3**).

# CONCLUSION

Genetic load in the "Superior" cultivar was unmasked through the generation of a dihaploid population. The segregation of the parental tetraploid configuration identified major QTL regions associated with most of the evaluated agronomic traits. Interestingly, four chromosomes were identified with common QTL that could elucidate interconnected metabolism. Candidate genes regulating plant development and tuberization were identified in the QTL regions. Complementation of gene function due to homozygous deleterious alleles could play a major role in trait performance in polyploid potato.

# AUTHOR CONTRIBUTIONS

RV, CRB, and DD planned and designed the project. NM-C drafted the manuscript. JC and NM-C conducted phenotypic data analysis, linkage mapping, and QTL analysis. NM-C tissue culture experiment. GB and JJ made cytogenetic analysis. DD and JC were involved field experiments. GP generate genotypic data. RV and FL generated the Superior dihaploid population. All authors contributed to the editing of the manuscript and approved the final draft.

#### REFERENCES


#### FUNDING

This research was supported by the National Science Foundation under Grant No. IOS-1237969 to CRB, DD, Yuehua Cui, and RV.

#### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fpls.2018. 00944/full#supplementary-material

potato genome sequence. PLoS ONE 7:e36347. doi: 10.1371/journal.pone.00 36347


locus analysis of agronomic traits in a diploid potato population using single nucleotide polymorphism markers. Crop Sci. 55, 2566–2579. doi: 10.2135/cropsci2014.10.0745


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 Manrique-Carpintero, Coombs, Pham, Laimbeer, Braz, Jiang, Veilleux, Buell and Douches. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Decomposing Additive Genetic Variance Revealed Novel Insights into Trait Evolution in Synthetic Hexaploid Wheat

#### Abdulqader Jighly 1,2 \*, Reem Joukhadar 1,3, Sukhwinder Singh<sup>4</sup> and Francis C. Ogbonnaya<sup>5</sup>

*<sup>1</sup> Agriculture Victoria, Agriculture Research Division, AgriBio, Centre for AgriBiosciences, Bundoora, VIC, Australia, <sup>2</sup> School of Applied Systems Biology, La Trobe University, Bundoora, VIC, Australia, <sup>3</sup> Department of Animal, Plant and Soil Sciences, La Trobe University, Bundoora, VIC, Australia, <sup>4</sup> International Maize and Wheat Improvement Center (CIMMYT), Texcoco, Mexico, <sup>5</sup> Grains Research and Development Corporation, Kingston, ACT, Australia*

#### Edited by:

*Richard John Abbott, University of St Andrews, United Kingdom*

#### Reviewed by:

*Margarida Matos, Universidade de Lisboa, Portugal Zhihong Zhu, The University of Queensland, Australia*

\*Correspondence:

*Abdulqader Jighly abdulqader.jighly@ecodev.vic.gov.au*

#### Specialty section:

*This article was submitted to Evolutionary and Population Genetics, a section of the journal Frontiers in Genetics*

> Received: *11 September 2017* Accepted: *22 January 2018* Published: *06 February 2018*

#### Citation:

*Jighly A, Joukhadar R, Singh S and Ogbonnaya FC (2018) Decomposing Additive Genetic Variance Revealed Novel Insights into Trait Evolution in Synthetic Hexaploid Wheat. Front. Genet. 9:27. doi: 10.3389/fgene.2018.00027* Whole genome duplication (WGD) is an evolutionary phenomenon, which causes significant changes to genomic structure and trait architecture. In recent years, a number of studies decomposed the additive genetic variance explained by different sets of variants. However, they investigated diploid populations only and none of the studies examined any polyploid organism. In this research, we extended the application of this approach to polyploids, to differentiate the additive variance explained by the three subgenomes and seven sets of homoeologous chromosomes in synthetic allohexaploid wheat (SHW) to gain a better understanding of trait evolution after WGD. Our SHW population was generated by crossing improved durum parents (*Triticum turgidum;* 2n = 4x = 28, AABB subgenomes) with the progenitor species *Aegilops tauschii* (syn *Ae. squarrosa*, *T. tauschii*; 2n = 2x = 14, DD subgenome). The population was phenotyped for 10 fungal/nematode resistance traits as well as two abiotic stresses. We showed that the wild D subgenome dominated the additive effect and this dominance affected the A more than the B subgenome. We provide evidence that this dominance was not inflated by population structure, relatedness among individuals or by longer linkage disequilibrium blocks observed in the D subgenome within the population used for this study. The cumulative size of the three homoeologs of the seven chromosomal groups showed a weak but significant positive correlation with their cumulative explained additive variance. Furthermore, an average of 69% for each chromosomal group's cumulative additive variance came from one homoeolog that had the highest explained variance within the group across all 12 traits. We hypothesize that structural and functional changes during diploidization may explain chromosomal group relations as allopolyploids keep balanced dosage for many genes. Our results contribute to a better understanding of trait evolution mechanisms in polyploidy, which will facilitate the effective utilization of wheat wild relatives in breeding.

Keywords: polyploidy, synthetic hexaploid wheat, diploidization, additive variance, heritability

# INTRODUCTION

Polyploidization, whole genome duplication (WGD), is a natural process in which a single genome can be duplicated to form autopolyploids with more than two homologs for each chromosome, or multiple genomes are duplicated following hybridization between two or more species to form allopolyploids with multiple pairs of homologs derived from different ancestral genomes, termed homoeologs. Following WGD, multiple copies of duplicated genes may be lost, diverge in function, or silenced through a phenomenon called "diploidization" in which balanced dosages for many genes can be retrieved (Ohno, 1970; Lynch and Conery, 2000; Tate et al., 2009; Conant et al., 2014). Rapid genomic rearrangements and epigenetic changes have been observed directly after WGD (Ozkan et al., 2001; Shaked et al., 2001; Kashkush et al., 2002; Hegarty et al., 2008) which can cause changes in the architecture of different traits (Weiss-Schneeweiss et al., 2013).

WGD can be induced in laboratories to generate new taxa such as triticale (Stace, 1987), or to introduce new variation into known taxa such as bread wheat (Triticum aestivum, 2n = 6x = 42, AABBDD) which suffered a severe genetic bottleneck during its origin (Yang et al., 2009). Synthetic hexaploid wheat (SHW) can be generated by crossing Triticum turgidum (2n = 4x = 28, AABB) with Aegilops tauschii (2n = 2x = 14, DD), mimicking the natural evolutionary origin of bread wheat. SHW germplasm is a proven source of genetic diversity to improve yield (Gororo et al., 2002; Dreccer et al., 2007; Ogbonnaya et al., 2007, 2013), soil-borne pathogen (Mulki et al., 2013), insect (El-Bouhssini et al., 2013; Joukhadar et al., 2013), and fungal disease resistance (Zegeye et al., 2014; Jighly et al., 2016), as well as boron (Emebiri and Ogbonnaya, 2015) and salinity tolerance (Dreccer et al., 2004; Ogbonnaya et al., 2008a). However, it remains uncertain how the three subgenomes (A, B, and D) of bread wheat contribute to observed phenotypes or whether the wild Aegilops parent makes a considerable contribution to the additive genetic variance for different traits especially when crossed with an improved or elite durum wheat parent. This can be investigated by partitioning the total additive trait variance into different chromosomes in a SHW population.

Recently, a number of studies partitioned the additive variance of different traits captured by multiple sets of markers in both human and animal quantitative genetics studies. Applications varied from differentiating the variance captured by different chromosomes (Robinson et al., 2013), genotyped, and imputed variants (Lee et al., 2012), genic, and intergenic variants (Yang et al., 2011b), different SNP chips (Chen et al., 2014), to differentiating the variance of common and rare variants (Lee et al., 2013; Yang et al., 2015). In general, almost all studies reported a medium to high correlation between chromosome size and its explained additive variance for the studied traits. Yet, this approach has not been applied to any plant population, particularly among polyploid species such as wheat, where considerable efforts have gone into exploiting valuable sources of new genes from its progenitor species for cultivated wheat improvement (Ogbonnaya et al., 2013). Applying this approach to allopolyploids can provide a better understanding and a new way for differentiating the additive effects captured by different subgenomes.

In this research, we used a SHW population to investigate the contribution of each subgenome to trait variation. The SHW population was derived from crosses between wild Ae. tauschii parents and improved durum cultivars and was phenotyped for resistance to 10 different diseases and tolerance to two abiotic stresses. The same dataset was previously characterized in multiple genome-wide association studies (GWAS) for major genes associated with these different stresses (Mulki et al., 2013; Emebiri and Ogbonnaya, 2015; Jighly et al., 2016). However, the GWAS approach does not adequately provide the precise contribution of each chromosome/subgenome to the total heritability as genes identified through GWAS represent only a small proportion of the total heritability (Goldstein, 2009; Yang et al., 2017). Such information is critical to understanding trait evolution in newly synthesized allopolyploids and to efficiently utilize wild relatives in wheat breeding. In the present paper, we investigated this by partitioning the additive variance into each of the 21 SHW chromosomes. The relation between partitioned additive variance and chromosome, subgenome and chromosomal group size was also investigated. To the best of our knowledge, this is the first study to use this approach in polyploid or plant populations.

# MATERIALS AND METHODS

#### SHW Phenotyping and Genotyping

The SHW population consists of 173 crosses between different A. tauschii accessions and elite durum cultivars (**Table S1**). The population was genotyped with DArTSeq—a genotyping by sequencing, (GBS) approach, developed by Diversity Array Technology, DArT, http://www.diversityarrays.com/. The full method is described in Sehgal et al. (2015). In brief, restriction enzymes were used first to reduce the complexity of the wheat genome and the Pst1-RE adapters were tagged with 96 barcodes. This strategy allows for multiplexing 96 samples in a single Illumina HiSeq2500 lane to generate around 0.5 million of 77 bp reads per sample. The generated FASTQ files were trimmed at Phred score 30 and further filtering steps and SNP calling were conducted using designed scripts developed by DArT P/L. Only SNPs with <20% missing data and >5% minor allele frequency were used in subsequent analyses. The SNP dataset used for the current study was previously published as a supplement in Jighly et al. (2016).

The SHW population was phenotyped for aluminum (Al) and boron (Br) tolerance, stem (Sr), yellow (Yr) and leaf (Lr) rusts, crown rot (Cr), yellow leaf spot (YLS), septoria nodorum leaf blotch (SNL) and septoria nodorum glume blotch (SNG), root lesion nematodes [Pratylenchus neglectus (Pn) and Pratylenchus thornei (Pt)] and cereal cyst nematode (CCN) resistance. Experimental details were previously described in (Ogbonnaya et al., 2008b; Emebiri and Ogbonnaya, 2015; Jighly et al., 2016). Briefly, the germplasm was screened in three replicates for the three rust diseases under field conditions. The most commercially important fungal pathotypes used for infection were 104–1,2,3,(6), (7), 11, 13 (accession number 200347) for Lr; 98–1,2,3,5,6 (accession number 781219) for Sr; and 134 E16A (021510) for Yr. Four different isolates (WAC 4302, WAC 4305, WAC 4306, and WAC 4309) were used in four replicates under greenhouse conditions for SNG and SNL. YLS was also screened in a controlled environment against isolates 03–0148, 03–0152, and 03–0053. For CCN, plants were considered resistant if they had less than five cysts per plant root while plants were considered susceptible if they had more than 30 cysts. Plants with 5–30 cysts were considered moderately resistant to moderately susceptible. The severity of Pn and the number of Pt nematodes per plant were used to infer the score of resistance by comparing the plant response to resistant and susceptible checks. Br tolerance was phenotyped by measuring root growth at the seedling stage on a filter paper soaked with boron while Al tolerance was measured using the hematoxylin staining of root apices method (Raman et al., 2010).

#### Statistical Analysis

We estimated 21 genetic relatedness matrices (GRMs) from SNPs located on each one of the SHW chromosomes following the method described in (Yang et al., 2010, 2011a). The variance explained by each chromosome was estimated using the genomic-relatedness-based restricted maximum likelihood (GREML) analysis by fitting all 21 GRMs simultaneously in the mixed linear model (Lee et al., 2012; Lee and van der Werf, 2016):

$$\mathcal{y} = X\beta + \sum\_{i=1}^{n} g\_i + \varepsilon$$

Where y is a vector of phenotypes, n is the number of chromosomes (21 in our case), β is a vector of fixed effects, X is an incidence matrix that relates individuals to fixed effects and ε is a vector of random errors. g<sup>i</sup> is a vector of random additive genetic effect attribute to chromosome i. The variance structure of phenotype is equal to:

$$V = \sum\_{i=1}^{n} A\_i \sigma\_{\mathfrak{g}\_i}^2 + I \sigma\_{\mathfrak{e}\_i}^2$$

Where A<sup>i</sup> is the GRM for chromosome i, σ 2 gi is the additive genetic variance captured by SNPs on chromosome i, I is an identity matrix and σ 2 e is the error variance.

We ran the analysis twice, with and without including the first 10 principal components (PCs) as fixed effects. Including a number of PCs in the model can control for population structure in the germplasm; thus, the effect of population structure will be minimal if the model that fits PCs revealed similar results to the model that does not include PCs (Lee et al., 2012). The first 10 PCs were calculated using PLINK 1.9 (http://www. cog-genomics.org/plink/1.9/). To further investigate the effect of the correlation between different chromosomes due to shared structure among chromosomes (Lee et al., 2012; Yang et al., 2017), we calculated the conditional effect for each one based on the other 20 chromosomes. This was done by fitting 21 different models that each excluded one different GRM from the joint analysis. If the SNPs located on the excluded chromosome were correlated with SNPs on the other 20 chromosomes, the conditional effect analysis will overestimate the additive variance for the 20 chromosomes. Subtracting the conditional additive variance from the overall additive variance inferred from the full model is equal to the proportion of additive variance of the excluded chromosome that is not correlated with other chromosomes. This value can be used to investigate dependency among chromosomes and to confirm differences among subgenomes.

The D subgenome in our germplasm had very large LD blocks compared to the A and B subgenomes (Jighly et al., 2016) which may overestimate the heritability for the D subgenome (Speed et al., 2012). Thus, we repeated the analysis after randomly omitting 20% of the whole SNP dataset, omitting 20% of SNPs located on A and B subgenomes only, or omitting 50% of SNPs located on D subgenome. The three analyses showed similar results thus only results of the first analysis is presented in the present paper. The idea is that if we do not have enough SNP density to cover all LD blocks in both A and B subgenomes, omitting a considerable proportion of the SNPs will mask the variance captured by the deleted SNPs while keeping the D subgenome unaffected. Obtaining the same results from the original and the masked analyses suggests that each LD block is covered with adequate number of SNPs and as such, the majority of its variance can be captured with the available SNPs.

Analysis of covariance (ANCOVA) was used to determine significant differences among the three subgenomes considering (1) the subgenome size as a covariate or (2) the chromosome size as a covariate. The fitted model for the first ANCOVA analysis was: Additive Effect ∼ subgenome + subgenome size. For the second analysis, we fitted the model twice, with and without including the interaction between chromosome size and subgenome. Thus, the models were: Additive Effect ∼ subgenome + chromosome size; and Additive Effect ∼ subgenome <sup>∗</sup> chromosome size.

For each trait, a Chi-square test was performed to test whether the actual additive variance explained by the three subgenomes lies within the expected range for their values. The genome size for A, B and D subgenomes is 5727, 6274, and 4945 Mb, respectively. Thus, the expected contribution for each subgenome to the additive variance was calculated as the proportion of the subgenome size to the whole genome size, which was 33.8, 37, and 29.2% for A, B, and D subgenomes, respectively.

To further confirm that the differences among subgenomes are true and have not been inflated because of relatedness among individuals, we ran 100 replicates of the GREML analysis using randomly sampled phenotypes from the normal distribution N (0, 1). This analysis allows us to compare our findings to the null hypothesis given our data. True differences among subgenomes/chromosomal groups should be detected when using our empirical phenotypes and not simulated ones.

Finally, the reliability of the GREML analysis was estimated by running a 100 replicates of the analysis in which we omitted one random individual for each replicate (reduced model). Pearson correlation coefficients between additive variances of both models (full and reduced) for all chromosomes across all traits were computed. The reliability was estimated as the square of the average Pearson correlation coefficient over the 100 replicates. The reliability was used to calculate the "attenuated correlation" for all our correlation analyses following Charles (2005) implemented in Fisher (2014). Calculating the attenuated correlation avoids overestimating the significance of the correlation analysis by adjusting its value according to the standard deviation of our additive variance estimation.

# RESULTS

The SHW dataset included 6,176 GBS based SNPs with missing data <20% and minor allele frequency >5%. The total heritability values ranged from 44.8 to 60.5% for resistance to Sr and SNG, respectively, (**Table 1**) with an average value of 50.4%. All estimated heritabilities were significantly higher than the heritability obtained under the null model with simulated phenotypes, which had an average of 22 and 95% confidence interval between 16.3 and 27.7%. However, it is worth noting that these values should be less than the actual heritabilities as they depend on the genotyped SNPs only (Manolio et al., 2009). The numbers presented in **Table 1** represent the proportion of the total additive variance explained by each chromosome, which sum to 100 for each trait, in which negative values were recorded as zeroes (Plotted in **Figure 1**). The original estimations and their standard deviations can be found in **Table S2**. The average standard deviation across chromosomes and traits was equal

TABLE 1 | The additive variance for different traits and its partitioning (as percentage of the total heritability) into different chromosomes, chromosomal groups, and genomes.


*Negative estimations were set to 0 in this table but detailed information can be found in* Table S2*. The last row represents Chi square p-value which compares the actual fractional contribution of A, B, and D subgenomes to the additive variance with the expected one which assumes the percentage of the subgenome size, 33.8, 37, and 29.2% for A, B, and D subgenomes, respectively. NS: not significant at 0.05.*

to 0.077 while the reliability of the GREML analysis given the standard deviation was equal to 0.45 (0.67<sup>2</sup> ). The considerably low reliability is a result of small population size and relatedness among individuals.

For the 21 chromosomes across all traits, we found no correlation between chromosome sizes and their explained additive variance (**Figure 1A**; **Table 2**). However, for individual traits, only Sr resistance showed a significant correlation between all 21 chromosomes and their fractional contribution to the additive variance with p-value = 0.04 and r = 0.45 (**Table 2**; **Figure S1**). The median r value between chromosome size and fractional additive variance for all traits was equal to 0.005. When chromosomes within each subgenome were considered, only the additive variance explained by the B subgenome chromosomes showed a significant but weak correlation with chromosome size (p-value = 0.02 and r = 0.25; **Figure 1A**; **Table 2**). Neither the Sr correlation nor the B subgenome correlation were significant after adjusting them for attenuation following Charles (2005).

A significant correlation was evident between the cumulative size for each chromosomal group and the fractional additive variance explained by the group with p-value = 0.01 and r = 0.27 (**Figure 1B**, **Table 2**). Removing two outliers (the contribution of group 4 for Cr and Pt resistance which are highlighted in yellow, **Figure 1B**) strengthened this correlation with p-value = 0.001 and r = 0.34. However, when correcting the correlation for attenuation, it was significant only after removing the two outliers with p-value = 0.037 and r = 0.23. A single chromosome with the highest contribution within each group can explain about 69% of the total group additive variance on average across all traits. The relationship between fractional additive variance and the chromosomal group cumulative size for individual traits had a median value of 0.43 (**Table 2**) and is plotted in **Figure S2**.

The cumulative fractional additive variance significantly varied between the three subgenomes. The median values for the percentage of additive variance contributed by A, B, and D subgenomes were 23.7, 33, and 38.7%; respectively (**Figure 1C**). These values changed to 23.8, 31.8, and 41.3%, respectively, after omitting stem rust resistance, an outlier compared to other traits. ANCOVA analysis that considered the genome size as a covariate confirmed the significant differences among the three

TABLE 2 | Pearson correlation coefficient (*r* values) between the additive variance explained by all 21 chromosome sizes (column All), chromosomes within each subgenome (A, B, and D) and chromosomal group size (Groups).


*The final row represents the r values considering all traits together (visualized in* Figures 1A,B*). † Represents the correlation coefficient after removing the two outliers in* Figure 1B*; this was significant at p-value* < *0.05 after correcting for attenuation.* \**Significant at p-value* < *0.05;* \*\**Significant at p-value* < *0.01. ††Not significant after correcting for attenuation.*

subgenomes across all 12 traits with p-values = 0.01. This was the only significant component in the model. The ANCOVA analysis that considered the size of chromosomes as a covariate had a p-value of 0.006 (same value with and without including the interaction between genome and chromosome size in the model) which was the only significant component in both models.

For individual traits, Chi-square tests showed significant differences between the actual and the expected subgenome contribution to all traits except for Br, Lr, and SNG. For Al, CCN, Cr, Pt, and Yr, only the contribution of the D subgenome was higher than expected, while the contributions of the B and D subgenomes were higher than expected for Pn, SNL, and YLS (**Table 1**). Br, Lr, and SNG resistances were not significantly different from the expected contribution, but the actual contribution of the D subgenome for all of them was slightly higher than expected (**Table 1**).

Population structure, linkage disequilibrium, and relatedness among individuals did not have an effect on our results. The inclusion of the first 10 principal components as covariates in the model did not have a large effect on heritability estimates (data not shown) which means that population structure has minimal effect on the heritability estimations. Similarly, further analysis with a randomly chosen subset of SNPs did not affect the results either (**Table S3**), indicating that the extended linkage disequilibrium observed in the D subgenome in this population did not overestimate the contribution of the D subgenome. Furthermore, under the null hypothesis using simulated phenotypes, the cumulative additive variance was 0.0698 (±0.026), 0.0735 (±0.027) and 0.0766 (±0.029) for the A, B, and D subgenomes, respectively, indicating true differences among subgenomes observed with empirical phenotypes that are not affected by relatedness among individuals.

Estimating the conditional effect for each chromosome based on the other 20 chromosomes showed considerable correlation among chromosomes (**Table 3**; **Table S2**). On average for all chromosomes across all traits, 46% of chromosome additive variance can be explained by other chromosomes. This value ranged from 20.6% for Yr resistance to 57.3% for Br tolerance (Inferred from **Table 3**). Interestingly, even for the conditional analysis after excluding correlated additive variances, our conclusion that the D genome had the highest contribution to the total heritability did not change with 22.3, 31.9, and 44.8% of the total additive variance attributed to the A, B, and D subgenomes, respectively. Removing Sr increased the D subgenome contribution to 45.7% and reduced the B subgenome contribution to 30.1%. The correlation among all 21 GRMs also support these results (**Figure 2**). All GRMs for the A and B subgenome chromosomes clustered together while GRMs for D subgenome chromosomes formed another cluster. Thus, the correlated additive variance can be explained by the same ancestor supporting the superiority of the D subgenome regardless of the low reliability of the GREML analysis.

## DISCUSSION

Decomposing additive genetic variance based on different set of SNPs has become a commonly used method in quantitative genetics in recent years (Yang et al., 2010, 2011a,b, 2015; Lee et al., 2012). Researchers usually remove related individuals to ensure that they are capturing SNP-based heritability only (Yang et al., 2017). Although this is possible in human genetics and some animal populations that have large effective population size, it is impossible to have such optimal populations containing distinctly related individuals in species such as bread wheat with extremely small effective population sizes (Joukhadar et al., 2017). For this reason, the heritability estimated with this method in populations of species such as bread wheat will be a mixture of SNP-based heritability from phenotypic correlation due to unrelated individuals and pedigree-based heritability from phenotypic correlation due to relatedness (Yang et al., 2017). One advantage of using related individuals is that the analysis requires smaller populations to obtain an acceptable standard error (SE), because SE is negatively correlated with the average relatedness among individuals. Yang et al. (2017) pointed out that the SE can be further decreased if rare SNPs are excluded from the analysis.

Linkage disequilibrium (LD) can cause a huge bias for decomposing additive variance analysis as the variance estimation depends on the LD between the causal variant and the closest genotyped SNPs (Speed et al., 2012). The D subgenome in our population showed large LD blocks (Jighly et al., 2016) but this did not result in over estimating its contribution because there were sufficient SNPs to capture most additive variance in the A and B subgenomes (**Table S3**). This is not unexpected for populations with small effective population size like SHW. For example, randomly selecting 10K out of 354K SNPs reduced the captured additive variance by only 1% for different traits in chickens (Abdollahi-Arpanahi et al., 2014). Population structure also did not affect the estimation as the TABLE 3 | The heritability estimation using the conditional effect model (excluding the GRM of one chromosome).


*The values between brackets describes the additive variance inferred from the full model (the first row in the table) minus the conditional total additive variance. The last three rows represent the contribution of each subgenome to total independent additive variance (values between brackets). The last row represents Chi square p-value which compares the conditional contribution of A, B and D subgenomes to the additive variance with the expected one which assumes the percentage of the subgenome size, 33.8, 37, and 29.2% for A, B, and D subgenomes, respectively. NS, not significant at 0.05.*

estimations were very similar to the model that involved the first 10 PCs as covariates (Lee et al., 2012), although considerable correlation between different chromosomes was observed in this germplasm (**Table 3**; **Table S2**). On the other hand, this correlation did not affect our conclusion that the D subgenome had a higher contribution to the total additive variance relative

to the A and B subgenomes (**Table 3**; **Table S2**), and especially that GRMs of the D subgenome chromosomes were clustered together and were not correlated with any of the 14 GRMs of the A and B subgenome chromosomes (**Figure 2**).

Almost all studies that have partitioned additive variance have shown a significant correlation exists between chromosome size and variance (e.g., Yang et al., 2011b; Lee et al., 2012; Robinson et al., 2013). In the present study using SHW, however, chromosome size was not correlated with explained additive variance for any trait, although a weak correlation was observed for chromosomes within the B subgenome. The significant correlation for Sr (**Table 2**) cannot be attributed to chromosome size directly, but rather to differences in size between D and B subgenomes, which explained 13.8 and 66.8% of the additive variance, respectively (**Figure S1**; **Table 1**). The previous two correlations became non-significant after correcting for attenuation.

In contrast to what we found for all individual chromosomes, a significant but weak correlation was found between the cumulative sizes and cumulative additive variances for each chromosomal group (**Figure 1B**). In polyploids, the balanced dosage hypothesis, which involves gene loss, functional divergence and epigenetic changes in newly synthesized polyploids, has been widely discussed and has been proven for many gene families (Ohno, 1970; Lynch and Conery, 2000; Tate et al., 2009; Buggs et al., 2010, 2012; Xiong et al., 2011; Feldman and Levy, 2012; Conant et al., 2014; Dodsworth et al., 2016). We hypothesize that these structural and functional changes during diploidization keep a single functional copy for each gene in one homoeolog and thus, larger chromosomes may not necessarily have higher contribution to the additive variance if functional copies are not distributed equally in the three homoeologs. Instead, when considering the three homoeologs together, all genes will have functional copies. Thus, larger chromosomal groups may have higher contribution to the additive variance. This may explain the correlation between group size and effect. Another important finding is that one homoeolog can dominate the group additive effect within each chromosomal group with an average of 69% of the total group additive variance (Inferred from **Table 1**). Future research using larger populations should consider the relation between variance and chromosome size in both SHWs and their progenitors to further confirm this finding and to better understand underlying mechanisms that allow one homoeolog to dominate the group additive effect.

Pont et al. (2013) showed that the D subgenome generally dominated the tetraploid A and B subgenomes in hexaploid wheat by analyzing synteny and conserved orthologous gene data. Our results also showed this for stress resistance traits and that the dominance effect of the D subgenome was greater with regard to the A than the B subgenome with the median percentage of additive variance across all traits for A subgenome being 23.7% (**Figure 1C**). However, this cannot be generalized for all traits. For instance, the A subgenome contributed 9.6% more than the D subgenome to Lr resistance, whereas the B subgenome dominated the A and D subgenomes for Sr resistance (**Table 1**). Lagudah et al. (1993) showed that transferring Sr and Lr resistance form Ae. tauschii to hexaploid wheat is partially or fully suppressed by unknown mechanisms while Kerber and Green (1980) reported a suppressor for A and B subgenome Sr resistance in chromosome 7D. Later studies have indicated that suppression of the resistance of one subgenome of bread wheat by the other subgenomes is affected by SHW parents and pathogen isolates (Kema et al., 1995; Badebo et al., 1997; Ogbonnaya et al., 2013). Thus, efficient implementation of SHW in breeding programs should combine superior chromosomes within each chromosomal group for each trait independently, although the general trend showed that the D subgenome had a higher contribution to the additive variance. Future research should investigate suppression mechanisms and whether the general D subgenome superior additive contribution is a result of suppressing A and B subgenomes resistance to different biotic and abiotic stresses.

### AUTHOR CONTRIBUTIONS

AJ: suggested and planned the study, analyzed the data and drafted the manuscript; RJ: assisted with R scripting and drafted the manuscript; SS: provided the GBS data; FO: planned the study, provided the phenotypic data, drafted the manuscript and gave the final acceptance for the manuscript to be submitted; All authors read and approved the final copy of the manuscript.

#### ACKNOWLEDGMENTS

We thank Seeds of Discovery—Sustainable Modernization of Traditional Agriculture project (MasAgro), Mexico for supporting the genotyping work. The Grains Research and Development Corporation funded Synthetic Evaluation Project in Australia.

#### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene. 2018.00027/full#supplementary-material

Figure S1 | Percentage of individual chromosome contribution to the additive variance for each trait as function to chromosome size. Colors represents different subgenomes; red: "A" subgenome chromosomes; Green: "B" subgenome chromosomes; and Purple: "D" subgenome chromosomes. The gray line represents the correlation for all 21 chromosomes.

# REFERENCES


Figure S2 | Percentage of chromosomal group contribution to the additive variance for each trait as function to chromosome size.

Table S1 | Pedigree and passport information for the SHW population.

Table S2 | The first line for each chromosome contains information about the estimated additive variance for different traits and their standard deviations, between brackets, using the full model (the model that fits 21 GRMs). The second line for each chromosome is the heritability estimation using the conditional effect model (excluding the GRM of one chromosome). Values between brackets describes the additive variance inferred from the full model (the first row in the table) minus the conditional total additive variance. The second line is exactly the same as Table 3 in the paper but was repeated here for easier comparisons between the full and the conditional models.

Table S3 | The additive variance for different traits and its partitioning (as percentage of the total heritability) into different chromosomes, chromosomal groups and subgenomes for subset of the whole data set that includes 80% of our SNPs.


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 Jighly, Joukhadar, Singh and Ogbonnaya. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.