<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v2.3 20070202//EN" "journalpublishing.dtd">
<article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" article-type="research-article">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">Front. Genet.</journal-id>
<journal-title>Frontiers in Genetics</journal-title>
<abbrev-journal-title abbrev-type="pubmed">Front. Genet.</abbrev-journal-title>
<issn pub-type="epub">1664-8021</issn>
<publisher>
<publisher-name>Frontiers Media S.A.</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="doi">10.3389/fgene.2020.603056</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Genetics</subject>
<subj-group>
<subject>Original Research</subject>
</subj-group>
</subj-group>
</article-categories>
<title-group>
<article-title>Excision Dominates Pseudogenization During Fractionation After Whole Genome Duplication and in Gene Loss After Speciation in Plants</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name><surname>Yu</surname> <given-names>Zhe</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/1107237/overview"/>
</contrib>
<contrib contrib-type="author">
<name><surname>Zheng</surname> <given-names>Chunfang</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/1165787/overview"/>
</contrib>
<contrib contrib-type="author">
<name><surname>Albert</surname> <given-names>Victor A.</given-names></name>
<xref ref-type="aff" rid="aff2"><sup>2</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/121109/overview"/>
</contrib>
<contrib contrib-type="author" corresp="yes">
<name><surname>Sankoff</surname> <given-names>David</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
<xref ref-type="corresp" rid="c001"><sup>&#x0002A;</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/116599/overview"/>
</contrib>
</contrib-group>
<aff id="aff1"><sup>1</sup><institution>Department of Mathematics and Statistics, University of Ottawa</institution>, <addr-line>Ottawa, ON</addr-line>, <country>Canada</country></aff>
<aff id="aff2"><sup>2</sup><institution>Department of Biological Sciences, University at Buffalo</institution>, <addr-line>Buffalo, NY</addr-line>, <country>United States</country></aff>
<author-notes>
<fn fn-type="edited-by"><p>Edited by: Chuang Ma, Northwest A and F University, China</p></fn>
<fn fn-type="edited-by"><p>Reviewed by: Vladimir A. Trifonov, Institute of Molecular and Cellular Biology (RAS), Russia; Weilong Hao, Wayne State University, United States</p></fn>
<corresp id="c001">&#x0002A;Correspondence: David Sankoff <email>sankoff&#x00040;uottawa.ca</email></corresp>
<fn fn-type="other" id="fn001"><p>This article was submitted to Evolutionary and Population Genetics, a section of the journal Frontiers in Genetics</p></fn></author-notes>
<pub-date pub-type="epub">
<day>18</day>
<month>12</month>
<year>2020</year>
</pub-date>
<pub-date pub-type="collection">
<year>2020</year>
</pub-date>
<volume>11</volume>
<elocation-id>603056</elocation-id>
<history>
<date date-type="received">
<day>04</day>
<month>09</month>
<year>2020</year>
</date>
<date date-type="accepted">
<day>27</day>
<month>11</month>
<year>2020</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright &#x000A9; 2020 Yu, Zheng, Albert and Sankoff.</copyright-statement>
<copyright-year>2020</copyright-year>
<copyright-holder>Yu, Zheng, Albert and Sankoff</copyright-holder>
<license xlink:href="http://creativecommons.org/licenses/by/4.0/"><p>This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.</p></license> </permissions>
<abstract><p>We take advantage of synteny blocks, the analytical construct enabled at the evolutionary moment of speciation or polyploidization, to follow the independent loss of duplicate genes in two sister species or the loss through fractionation of syntenic paralogs in a doubled genome. By examining how much sequence remains after a contiguous series of genes is deleted, we find that this residue remains at a constant low level independent of how many genes are lost&#x02014;there are few if any relics of the missing sequence. Pseudogenes are rare or extremely transient in this context. The potential exceptions lie exclusively with a few examples of speciation, where the synteny blocks in some larger genomes tolerate degenerate sequence during genomic divergence of two species, but not after whole genome doubling in the same species where fractionation pressure eliminates virtually all non-coding sequence.</p></abstract>
<kwd-group>
<kwd>gene loss</kwd>
<kwd>fractionation</kwd>
<kwd>polyploidization</kwd>
<kwd>whole genome duplication</kwd>
<kwd>plant evolution</kwd>
<kwd>synteny</kwd>
<kwd>pseudogene</kwd>
<kwd>genomics</kwd>
</kwd-group>
<contract-sponsor id="cn001">Natural Sciences and Engineering Research Council of Canada<named-content content-type="fundref-id">10.13039/501100000038</named-content></contract-sponsor>
<counts>
<fig-count count="9"/>
<table-count count="0"/>
<equation-count count="0"/>
<ref-count count="20"/>
<page-count count="9"/>
<word-count count="4555"/>
</counts>
</article-meta>
</front>
<body>
<sec sec-type="intro" id="s1">
<title>1. Introduction</title>
<p>The evolutionary process of gene loss, through DNA excision&#x02014;or sequence elimination (Eckardt, <xref ref-type="bibr" rid="B2">2001</xref>), pseudogenization (Jacq et al., <xref ref-type="bibr" rid="B4">1977</xref>), or other mechanism, is the obverse of gene acquisition by a genome through processes such as tandem or remote duplication of individual genes, whole genome doubling (WGD), neo- and sub-functionalization and horizontal transfer. Loss serves a number of functional and structural roles, such as in the reconfiguring of regulatory or metabolic networks or in compensating for the energetic, material, and structural costs of gene complement expansion.</p>
<p>An longstanding biological controversy in evolutionary genomics (Byrnes et al., <xref ref-type="bibr" rid="B1">2006</xref>; van Hoek and Hogeweg, <xref ref-type="bibr" rid="B11">2007</xref>) involves the question of whether duplicated genes are deleted through random excision &#x0201C;elimination of excess DNA&#x0201D; namely the deletion of chromosomal segments containing one or more genes, which we have termed the &#x0201C;structural&#x0201D; mechanism, or through targeted (possibly) gene-by gene events such as regulatory epigenetic silencing and pseudogenization, which we call &#x0201C;functional&#x0201D; mechanisms. Because it is often difficult to ascertain whether a single-copy gene is the result of the deletion of a duplicate copy, and because the outcomes of the two kinds of process may appear similar, it is often difficult to discern which one is operating.</p>
<p>The alignment of the gene orders of homologous genes in two related genomes, or subgenomes of an (ancient) polyploid, such as that provided by the S<sc>yn</sc>M<sc>ap</sc> program on the C<sc>o</sc>G<sc>e</sc> platform (Lyons and Freeling, <xref ref-type="bibr" rid="B5">2008</xref>; Lyons et al., <xref ref-type="bibr" rid="B6">2008</xref>), is a uniquely reliable first step in the assessment of gene conservation or loss after speciation or polyploidization. The homology of pairs of genes in the chromosomal fragments &#x0201C;synteny blocks&#x0201D; making up such an alignment, is doubly confirmed, first by the common level of sequence similarity of all the gene pairs in the block, and second by the common chromosomal context, namely the common order of the homologous genes in the two fragments, represented as follows:</p>
<p><inline-graphic xlink:href="fgene-11-603056-i0001.tif"/></p>
<disp-quote>
<p>Synteny block on homeologous regions of two chromosomes.</p>
<p>Dark circles indicate retained genes, white circles deleted genes.</p>
<p>There are five retained duplicate gene pairs, four singletons on the</p>
<p>lower chromosome and one singleton on the upper chromosome.</p>
</disp-quote>
<p>In synteny blocks, it is relatively easy to see where duplicate genes have been deleted, and how many genes in a row have been lost. In this paper, we use this property of synteny blocks in devising a simple method to distinguish clearly between genomes where excision is the main mechanism for gene loss, and those where pseudogenization may also play a role.</p>
<p>Although the basics of polyploidy in plants have been understood for over a century (Winge, <xref ref-type="bibr" rid="B13">1917</xref>), and though this process is well-attested across the entire evolutionary spectrum, from bacteria (Hansen, <xref ref-type="bibr" rid="B3">1978</xref>; Tobiason and Seifert, <xref ref-type="bibr" rid="B10">2006</xref>) to pre-mammalian vertebrates (Ohno, <xref ref-type="bibr" rid="B7">1970</xref>), the statistical study of conservation and reduction at the genome level originates with the discovery and analysis by Wolfe and Shields of an ancient WGD in the <italic>Saccharomyces cerevisiae</italic> genome sequence (Wolfe and Shields, <xref ref-type="bibr" rid="B14">1977</xref>). But starting with the first few plant genomes to be sequenced&#x02014;<italic>Arabidopsis, Oryza, Populus</italic>&#x02014;the realization has grown that all flowering plants species are &#x0201C;paleopolyploids,&#x0201D; re-diploidized descendants of one or more ancient polyploidization events. It is in the context of the Angiosperm/Magnoliophyte phylum or division that we have attempted to resolve the structure-function controversy (Byrnes et al., <xref ref-type="bibr" rid="B1">2006</xref>; van Hoek and Hogeweg, <xref ref-type="bibr" rid="B11">2007</xref>) using several modeling and statistical approaches (Zheng et al., <xref ref-type="bibr" rid="B20">2009</xref>; Sankoff et al., <xref ref-type="bibr" rid="B9">2010</xref>, <xref ref-type="bibr" rid="B8">2015</xref>; Yu and Sankoff, <xref ref-type="bibr" rid="B18">2016</xref>; Yu et al., <xref ref-type="bibr" rid="B17">2020</xref>). In the present paper, however, our focus is less on how fractionated gene pairs are organized within synteny blocks, than on what happens to these genes&#x02014;do they degenerate in place, or are they simply removed from the DNA sequence of the genome?</p>
<p>Our claim is that the overwhelming loss process is the latter: the complete excision of the gene from the genome, the elimination of the sequence of the entire gene. As such, we do not adopt any restrictive definition of a pseudogene or quantification of the various types of pseudogenes in plants, which was done in the recent definitive study of Xie et al. (<xref ref-type="bibr" rid="B16">2019</xref>); here we simply examine whether any DNA, and how much, remains, when a one member of a pair of homeologous genes, as identified by S<sc>yn</sc>M<sc>ap</sc>, is absent from a syntenic block. We will show that in the large majority of cases, there is a drastic loss of DNA, leaving only a small stretch of intergenic sequence, so that no kind of pseudogene, whatever its definition, except for very small fragments of cDNA, can be present. In other words, fractionation, and most gene loss in ancient genomes, does not tend to result in long-lasting full length or part length degenerate genes, but a relatively complete loss of the DNA. This does not mean that pseudogenes are absent or even rare in these and other genomes. Many of these may persist over many millions of years. Nevertheless, Xie et al. (<xref ref-type="bibr" rid="B16">2019</xref>) found that poplar has almost 25,000 pseudogenes, but &#x0003C;1,500 of these stem from the Salix whole genome doubling, and most of these are presumably small fragments of coding sequence.</p>
</sec>
<sec sec-type="methods" id="s2">
<title>2. Methods</title>
<sec>
<title>2.1. Sampling of Plant Species</title>
<p>In each of four core eudicot plant families (or orders), we selected a pair of genomes for which annotated genome sequences are available:</p>
<list list-type="order">
<list-item><p><italic>Populus trichocarpa</italic> (poplar) CoGe ID 25127, and <italic>Salix purpurea</italic> (willow) CoGe ID 52439 in the rosid family Salicaceae,</p></list-item>
<list-item><p><italic>Salvia splendens</italic> (scarlet sage) CoGe ID 55705, and <italic>Tectona grandis</italic> (teak) CoGe ID 55706 in the asterid family Lamiaceae,</p></list-item>
<list-item><p><italic>Linum usitatissimum</italic> (flax) CoGe ID 16772 and <italic>Hevea brasiliensis</italic> (rubber tree) CoGe ID 16772 in the order Malpighiales, also rosids, and</p></list-item>
<list-item><p><italic>Malus domestica</italic> (apple) CoGe ID 54783 and <italic>Pyrus</italic> &#x000D7; <italic>bretschneideri</italic> (pear) CoGe ID 37224 belonging to the same subtribe Malinae of another rosid family Rosaceae.</p></list-item>
</list>
<p>All these genomes have undergone at least one whole genome duplication since the ancient whole genome triplication &#x0201C;gamma&#x0201D; at the origin of the core eudicots.</p>
</sec>
<sec>
<title>2.2. Construction of Synteny Blocks</title>
<p>For each of the eight genomes individually we first carried out a self-comparison of the unmasked sequences using the S<sc>yn</sc>M<sc>ap</sc> program on the C<sc>o</sc>G<sc>e</sc> platform (Lyons and Freeling, <xref ref-type="bibr" rid="B5">2008</xref>; Lyons et al., <xref ref-type="bibr" rid="B6">2008</xref>) to construct paralogous syntenic blocks. Based on the distribution of gene pair similarities, also output by S<sc>yn</sc>M<sc>ap</sc>, we retained only those blocks for which the average similarity confirmed that the duplication occurred at the time of the most recent polyploidization event experienced by the genome.</p>
<p>For each of the four pairs of genomes, we then used S<sc>yn</sc>M<sc>ap</sc> to compare the two and construct orthologous synteny blocks. We again referred to the distribution of gene pair similarities in selecting only those blocks likely to have been created at the time of the speciation event at the origin of the diverging lineages leading to the two species being studied. We thus aimed to exclude synteny blocks created by polyploidization in the common ancestor of the two, including the gamma triplication, as well as blocks created in either of the two genomes by post-speciation polyploid events.</p>
<p>The stringent criteria, such as a minimum number of contiguous pairs (default = 5), incorporated in S<sc>yn</sc>M<sc>ap</sc> tends to excludes some of the homologous gene pairs created by these genomic events (represented in <xref ref-type="fig" rid="F1">Figure 1A</xref>), especially after some time has elapsed. Inversions, translocations and other chromosomal rearrangement events in a genome or in either of two related genomes, break synteny blocks into smaller pieces that may not satisfy the criteria, as illustrated in <xref ref-type="fig" rid="F1">Figures 1B,E</xref>.</p>
<fig id="F1" position="float">
<label>Figure 1</label>
<caption><p>Mechanisms of loss of gene pair homology after speciation or polyploidization. <bold>(A)</bold> Synteny block made up of co-linear homologous pairs. <bold>(B)</bold> erosion of synteny block by translocation to a remote chromosomal location of a portion of sub-threshold length. <bold>(C)</bold> Pseudogenization. Genes rendered inoperable represented by gray dots. <bold>(D)</bold> Excision of DNA fragment including one or more genes. Arrows represent a new adjacency after the loss of the excised genes. <bold>(E)</bold> Jump of one member of pair to a different genomic location, or loss of only one of two or more homologs of the same gene.</p></caption>
<graphic xlink:href="fgene-11-603056-g0001.tif"/>
</fig>
<p>We have assessed the effect of the default S<sc>yn</sc>M<sc>ap</sc> requirement&#x02014;at least five closely spaced gene pairs for a synteny block to be identified&#x02014;by increasing and decreasing this threshold (see <xref ref-type="fig" rid="F2">Figure 2</xref>). A slight decrease in the number of genes in blocks when the threshold is increased to 6 is simply due to the elimination of a few blocks of length 5. But as we decrease the threshold to 3, the algorithm starts to capture blocks made up of independently created but coincidentally neighboring pairs, as well as pairs where one member is already in a larger block, since a gene can be in more than one block. It becomes increasingly difficult to disentangle the behavior of duplicate gene pairs created by polyploidization from other processes of duplication and loss. Thus, we retained the default value, 5.</p>
<fig id="F2" position="float">
<label>Figure 2</label>
<caption><p>Effect of minimum block size (number of genes) on the number of genes incorporated into synteny blocks.</p></caption>
<graphic xlink:href="fgene-11-603056-g0002.tif"/>
</fig>
<p>Since we will be focusing on pseudogenization and excision in our analysis, <xref ref-type="fig" rid="F1">Figures 1C,D</xref>, we developed a method that does not favor the identification of one in favor of the other.</p>
</sec>
<sec>
<title>2.3. Identification of Deletion Intervals and Their Lengths</title>
<p>We scanned the output of the retained synteny blocks for homeologous segments on two chromosomes (or two disjoint regions of one chromosome) bounded by one or (usually) more duplicate gene pairs at both ends, where all the genes in one segment&#x02014;the fractionated side&#x02014;are absent, i.e., not detected by S<sc>yn</sc>M<sc>ap</sc> (No gene can be absent from the other segment&#x02014;otherwise the ancient gene pair, if it ever existed, would not be visible.) We call the number of contiguous single-copy genes in the unfractionated side of the segment the <italic>length</italic> of the interval. This is the same as the number of genes that are missing from the fractionated side.</p>
<p>For both sides of the segment, we also determine the amount of DNA between the pairs that bound the segment. For the unfractionated side, with all the single-copy genes, this is just the size (in base pairs) of the genes plus the intergenic regions, including the initial region, after one bounding pair, and the final region, before the other bounding pair, in the segment. In the fractionated side, this includes whatever DNA remains between the two bounding pairs, which does not include any genes, according to S<sc>yn</sc>M<sc>ap</sc>.</p>
<p>Two possibilities are represented by <xref ref-type="fig" rid="F1">Figures 1C,D</xref>. In the former case, pseudogenization, a gene is rendered inoperable, such as by a point mutation that creates a stop codon inside an erstwhile coding region. In the latter, a chromosomal fragment containing one or more genes is simply physically excised. To assess which of these two processes accounts for the data, we note that pseudogenization through acquiring a gene-internal stop codon, or a frameshift, leaving the gene intact, at least initially, does not shorten the length of the chromosomal region it is in. The average length of a pseudogene is roughly half of that of a functional gene (Xie et al., <xref ref-type="bibr" rid="B16">2019</xref>), but this average includes the very numerous short fragments. In contrast, excision of genes, including some or all of the flanking intergenic DNA, will definitely shorten the region, leaving at most a short stretch of non-coding sequence.</p>
</sec>
<sec>
<title>2.4. The Visualization of Gene Density and Pseudogene Density</title>
<p>By plotting the average number of base-pairs in the unfractionated, or totally conserved, intervals of a given length against the length of the interval, we estimate the average size of a gene (plus the following intergenic region). In most cases we expect this plot to be approximately linear, with slope giving the average base-pairs per gene. This is just the inverse of the gene density for that interval. For the fractionated, or totally reduced, side, the number of base pairs per missing gene provides an upper limit (via its inverse) on the number of full-length pseudogenes that may be in the interval. Although most pseudogene tools were developed in the context of human or vertebrate genomes, and have limited applicability for plant genomes (Xiao et al., <xref ref-type="bibr" rid="B15">2016</xref>), Xie et al. have succeeded in implementing P<sc>seudo</sc>P<sc>ipe</sc> (Zhang et al., <xref ref-type="bibr" rid="B19">2006</xref>) for surveying pseudogenes in a range of plant species, and their results will be seen to be consistent with ours in the analyses below.</p>
</sec>
</sec>
<sec sec-type="results" id="s3">
<title>3. Results</title>
<sec>
<title>3.1. Willow and Poplar</title>
<p><xref ref-type="fig" rid="F3">Figure 3</xref> contains the results of our analysis of the <italic>Salix</italic> and <italic>Populus</italic> genomes. The two panels on the left show the expected approximate linear growth in the number of base pairs in the unfractionated side of the interval. The great variability of the individual regions simply reflects the inhomogeneity of gene density along the length of the chromosome. In contrast, the regions in both <italic>Salix</italic> and <italic>Populus</italic> that have lost annotated genes show zero growth, with relatively little variability, as a function of the number of missing genes; they have lost almost all their DNA sequence. There cannot be significant numbers of pseudogenes, full or reduced, or other relics of the missing genes. This is striking evidence in favor of the predominance of excision.</p>
<fig id="F3" position="float">
<label>Figure 3</label>
<caption><p>Comparison of DNA content in unfractionated and fractionated intervals in the <italic>Salix</italic> and <italic>Populus</italic> genomes. Linear regression fits are indicated. Self-comparisons on the left do not distinguish between subgenomes since these are hard to identify across chromosomes and are generally mingled due to interchromosomal rearrangements, such as reciprocal translocation and chromosome fission and fusion. The two comparisons between genomes on the right hand side analyze gene loss from each genome separately. We use the terms &#x0201C;fractionated&#x0201D; and &#x0201C;unfractionated&#x0201D; in these two panels to mean &#x0201C;reduced&#x0201D; and &#x0201C;conserved,&#x0201D; even though the polyploidization-induced fractionation does not play a role here.</p></caption>
<graphic xlink:href="fgene-11-603056-g0003.tif"/>
</fig>
</sec>
<sec>
<title>3.2. <italic>Salvia</italic> and Teak</title>
<p><xref ref-type="fig" rid="F4">Figure 4</xref> contains the results of the corresponding analysis of the <italic>Salvia</italic> and <italic>Tectona</italic> genomes. The figures are very similar to those from the Salicacea. Some of the curves show great fluctuation of the values for the longer intervals, but this is likely due to smaller sample size. Of interest is that the DNA content of the fractionated (read: &#x0201C;reduced&#x0201D;) intervals formed after speciation show a small but steady increase, but still orders of magnitude less than the sizes of the unfractionated (&#x0201C;conserved&#x0201D;) intervals.</p>
<fig id="F4" position="float">
<label>Figure 4</label>
<caption><p>Comparison of DNA content in unfractionated and fractionated intervals in the <italic>Salvia</italic> and <italic>Tectona</italic> genomes.</p></caption>
<graphic xlink:href="fgene-11-603056-g0004.tif"/>
</fig>
</sec>
<sec>
<title>3.3. Flax and Rubber</title>
<p><xref ref-type="fig" rid="F5">Figure 5</xref> repeats the same analysis, this time applied to the <italic>Linum</italic> and <italic>Hevea</italic> genomes. The results parallel those of the two other pairs of genomes, except for the apparently anomalous behavior of the <italic>Hevea</italic> intervals, where the number of base pairs attains the same level as the conserved genes in <italic>Linum</italic>. This, however, may be seen as an artifact of the disproportionately large genome of <italic>Hevea</italic> with respect to that of <italic>Linum</italic>. The intergenic space in <italic>Hevea</italic> is four or five times as great as that of <italic>Linum</italic>, and there is much scope for retention or acquisition of repetitive elements and other sequence over the long period since the speciation event, which occurred much earlier than the other events we study.</p>
<fig id="F5" position="float">
<label>Figure 5</label>
<caption><p>Comparison of DNA content in unfractionated and fractionated (conserved and reduced) intervals in the <italic>Linum</italic> and <italic>Hevea</italic> genomes.</p></caption>
<graphic xlink:href="fgene-11-603056-g0005.tif"/>
</fig>
<p>To put this disproportions in perspective, we can normalize the <italic>Hevea</italic> results by a factor which measures the difference in sizes of the two genomes. This produces the comparisons in <xref ref-type="fig" rid="F6">Figure 6</xref>, which better resembles those of the Salicaceae and Lamiaceae.</p>
<fig id="F6" position="float">
<label>Figure 6</label>
<caption><p>Comparison of DNA content in unfractionated and fractionated (conserved and reduced) intervals in the <italic>Linum</italic> and normalized <italic>Hevea</italic> genomes.</p></caption>
<graphic xlink:href="fgene-11-603056-g0006.tif"/>
</fig>
</sec>
<sec>
<title>3.4. Pear and Apple</title>
<p><xref ref-type="fig" rid="F7">Figure 7</xref> shows the analysis of the <italic>Pyrus</italic> and <italic>Malus</italic> genomes. Here again, we have an anomalous large amount of DNA in the <italic>Malus</italic> reduced gene intervals after speciation. It is true that the <italic>Malus</italic> genome is larger than <italic>Pyrus</italic>, but explaining this through normalization (<xref ref-type="fig" rid="F8">Figure 8</xref>) is not completely satisfactory. This is the only trend out of the thirty-two we have presented that departs from our main narrative.</p>
<fig id="F7" position="float">
<label>Figure 7</label>
<caption><p>Comparison of DNA content in unfractionated and fractionated (conserved and reduced) intervals in the <italic>Pyrus</italic> and <italic>Malus</italic> genomes.</p></caption>
<graphic xlink:href="fgene-11-603056-g0007.tif"/>
</fig>
<fig id="F8" position="float">
<label>Figure 8</label>
<caption><p>Comparison of DNA content in conserved and reduced intervals in the <italic>Pyrus</italic> and normalized <italic>Malus</italic> genomes.</p></caption>
<graphic xlink:href="fgene-11-603056-g0008.tif"/>
</fig>
</sec>
<sec>
<title>3.5. Comparisons Across Genome Pairs</title>
<p>To compare the results from the four pairs of genomes, we must take into account the diverse genome sizes, number of genes in a genome, and the resulting gene densities. <xref ref-type="fig" rid="F9">Figure 9</xref> shows that gene density (or rather its inverse: base pairs per length of conserved fragment) in unfractionated and conserved intervals closely tracks the average gene density (or its inverse) for the entire genome. At the same time, the residual sequence length in intervals where fractionation or gene loss has taken place is not sensitive to gene density, it remains very close to zero, as expected from an excision explanation.</p>
<fig id="F9" position="float">
<label>Figure 9</label>
<caption><p>Comparison of gene density in unfractionated regions and the whole genome. Diagonal represents equality between the two densities.</p></caption>
<graphic xlink:href="fgene-11-603056-g0009.tif"/>
</fig>
<p>We can also report, although it seems superfluous after examining <xref ref-type="fig" rid="F3">Figures 3</xref>&#x02013;<xref ref-type="fig" rid="F7">7</xref>, <xref ref-type="fig" rid="F9">9</xref>, that a <italic>t</italic>-test confirms at a very high level of significance that the slopes of the two regressions in each panel are different.</p>
</sec>
<sec>
<title>3.6. Occurrences of Gene Translocation</title>
<p>To exclude other explanations of our syntenic block data, such as that in <xref ref-type="fig" rid="F1">Figure 1E</xref>, we looked further into the fate of the fractionated genes in the <italic>Populus-Salix</italic> comparison. By setting the minimum block size to 1 in the S<sc>yn</sc>M<sc>ap</sc> self-comparison, we could detect all pairs of gene duplicates, not only those in synteny blocks. We then searched for pairs to the singletons identified in the original (default 5) construction of synteny blocks that we analyzed in section 3.1 above. Of the 429 out of 8,307 <italic>Salix</italic> singletons, we found only 429, or 5%, that were paired else where in the genome at approximately the expected similarity level. Of the 10,737 <italic>Populus</italic> singletons, only 742, or 7%, were paired elsewhere. Moreover, some of the pairs that were identified could have been distinct paralogs that were part of a pre-existing triplet before fractionation&#x02014;such triplets or higher sets of paralogs are not uncommon. We can conclude that translocation as an alternative explanation to excision can account for only a very small fraction of the gaps in synteny blocks.</p>
<p>There remains the possibility that if the missing genes did not translocate out of the synteny block, the singletons may have migrated in, after the polyploidization or speciation event (Vicient and Casacuberta, <xref ref-type="bibr" rid="B12">2017</xref>). The main mechanism for this would be retrotransposition. However, retroposons are generally not annotated as genes in the C<sc>o</sc>G<sc>e</sc> database, even in the unmasked genome sequences we studied, and thus would not show up as singletons. Neither are many of the singletons likely to be translocated genes: a large proportion of genes in these genomes are paired, and an equal proportion of the putatively translocated singletons would show up as pairs elsewhere in the genome in the minimum block size 1 analysis. We have already seen that this is not the case.</p>
</sec>
</sec>
<sec sec-type="conclusions" id="s4">
<title>4. Conclusions</title>
<p>The statistical evaluation of the massive duplicate gene cohorts created by speciation or polyploidization shows that pseudogenization is either a very rare process or does not result in much stable structure. By the present time, the clear impression is that fractionation simply excises the DNA of a gene or several contiguous genes. Ongoing work to be reported elsewhere suggests that this elimination of sequence does occur piecemeal over 30 million years or even 1 million years. It is of course still possible that once a pseudogene is created, or a gene otherwise silenced, its DNA is immediately vulnerable to repeated small deletions, so that the pseudogene itself would be transient. The distinction between this and some single-event excision becomes a matter of semantics.</p>
<p>More surprising perhaps is that gene loss after speciation, occurring independently in two sister genomes, seems to follow the same trajectory. There is of course no genomic interaction between species pairs like <italic>Salvia</italic> and <italic>Tectona</italic>, but their common origin allows us to use one to track the gene loss pattern in the other. There remain questions of how universal excision is; in the <italic>Salvia</italic>-<italic>Tectona</italic> and <italic>Poplar</italic>-<italic>Salix</italic> comparisons it is very clear. Because of the genome size differential, it is harder to determine in <italic>Linum</italic>-<italic>Hevea</italic>, while in the case of <italic>Malus</italic>, though fractionation proceeds by excision, further gene loss may involve other mechanisms as well. We note that the role of differential amounts of repetitive sequence and active retroposon activity can impact this type of comparison between species, less so within one species.</p>
<p>Although it is difficult to say if it has any impact on our analysis, we note that speciation of apple and pear came later than their common whole genome duplication. It is the same for poplar and willow. The teak whole genome duplication occurred before speciation, but the <italic>salvia</italic> came after. That means that we analyzed more recent <italic>salvia</italic> fractionation than an earlier one that it shares with teak. The rubber-flax speciation is much more ancient than their individual whole genome duplications.</p>
</sec>
<sec sec-type="data-availability-statement" id="s5">
<title>Data Availability Statement</title>
<p>Publicly available datasets were analyzed in this study. This data can be found at: <ext-link ext-link-type="uri" xlink:href="https://genomevolution.org/coge/">https://genomevolution.org/coge/</ext-link>.</p>
</sec>
<sec id="s6">
<title>Author Contributions</title>
<p>ZY and DS planned the research, carried it out, and wrote this article. CZ developed and organized the data and participated in the planning and devising the analyses. VA contributed to the interpretation of the analyses and to understanding the pertinence of the results. All authors contributed to the article and approved the submitted version.</p>
</sec>
<sec sec-type="COI-statement" id="conf1">
<title>Conflict of Interest</title>
<p>The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.</p>
</sec>
</body>
<back>
<ref-list>
<title>References</title>
<ref id="B1">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Byrnes</surname> <given-names>J. K.</given-names></name> <name><surname>Morris</surname> <given-names>G. P.</given-names></name> <name><surname>Li</surname> <given-names>W. H.</given-names></name></person-group> (<year>2006</year>). <article-title>Reorganization of adjacent gene relationships in yeast genomes by whole-genome duplication and gene deletion</article-title>. <source>Mol. Biol. Evol.</source> <volume>23</volume>, <fpage>1136</fpage>&#x02013;<lpage>1143</lpage>. <pub-id pub-id-type="doi">10.1093/molbev/msj121</pub-id></citation>
</ref>
<ref id="B2">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Eckardt</surname> <given-names>N. A.</given-names></name></person-group> (<year>2001</year>). <article-title>A sense of self: the role of DNA sequence elimination in allopolyploidization</article-title>. <source>Plant Cell</source> <volume>13</volume>, <fpage>1699</fpage>&#x02013;<lpage>1704</lpage>. <pub-id pub-id-type="doi">10.1105/tpc.13.8.1699</pub-id></citation>
</ref>
<ref id="B3">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Hansen</surname> <given-names>M. T.</given-names></name></person-group> (<year>1978</year>). <article-title>Multiplicity of genome equivalents in the radiation-resistant bacterium <italic>Micrococcus radiodurans</italic></article-title>. <source>J. Bacteriol.</source> <volume>134</volume>, <fpage>71</fpage>&#x02013;<lpage>75</lpage>. <pub-id pub-id-type="doi">10.1128/JB.134.1.71-75.1978</pub-id></citation>
</ref>
<ref id="B4">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Jacq</surname> <given-names>C.</given-names></name> <name><surname>Miller</surname> <given-names>J. R.</given-names></name> <name><surname>Brownlee</surname> <given-names>G. G.</given-names></name></person-group> (<year>1977</year>). <article-title>A pseudogene structure in 5S DNA of <italic>Xenopus laevis</italic></article-title>. <source>Cell</source> <volume>12</volume>, <fpage>109</fpage>&#x02013;<lpage>120</lpage>. <pub-id pub-id-type="doi">10.1016/0092-8674(77)90189-1</pub-id></citation>
</ref>
<ref id="B5">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Lyons</surname> <given-names>E.</given-names></name> <name><surname>Freeling</surname> <given-names>M.</given-names></name></person-group> (<year>2008</year>). <article-title>How to usefully compare homologous plant genes and chromosomes as DNA sequences</article-title>. <source>Plant J.</source> <volume>53</volume>, <fpage>661</fpage>&#x02013;<lpage>673</lpage>. <pub-id pub-id-type="doi">10.1111/j.1365-313X.2007.03326.x</pub-id><pub-id pub-id-type="pmid">18269575</pub-id></citation></ref>
<ref id="B6">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Lyons</surname> <given-names>E.</given-names></name> <name><surname>Pedersen</surname> <given-names>B.</given-names></name> <name><surname>Kane</surname> <given-names>J.</given-names></name> <name><surname>Alam</surname> <given-names>M.</given-names></name> <name><surname>Ming</surname> <given-names>R.</given-names></name> <name><surname>Tang</surname> <given-names>H.</given-names></name> <etal/></person-group>. (<year>2008</year>). <article-title>Finding and comparing syntenic regions among <italic>Arabidopsis</italic> and the outgroups papaya, poplar and grape: CoGe with rosids</article-title>. <source>Plant Physiol.</source> <volume>148</volume>, <fpage>1772</fpage>&#x02013;<lpage>1781</lpage>. <pub-id pub-id-type="doi">10.1104/pp.108.124867</pub-id><pub-id pub-id-type="pmid">18952863</pub-id></citation></ref>
<ref id="B7">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Ohno</surname> <given-names>S.</given-names></name></person-group> (<year>1970</year>). <source>Evolution by Gene Duplication</source>. <publisher-loc>Berlin</publisher-loc>: <publisher-name>Springer</publisher-name>.</citation></ref>
<ref id="B8">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Sankoff</surname> <given-names>D.</given-names></name> <name><surname>Zheng</surname> <given-names>C.</given-names></name> <name><surname>Wang</surname> <given-names>B.</given-names></name> <name><surname>Fernando Buen Abad Najar</surname> <given-names>C.</given-names></name></person-group> (<year>2015</year>). <article-title>Structural <italic>vs</italic>. functional mechanisms of duplicate gene loss following whole genome doubling</article-title>. <source>BMC Genomics</source> <volume>15</volume>:<fpage>915</fpage>. <pub-id pub-id-type="doi">10.1109/ICCABS.2014.6863915</pub-id><pub-id pub-id-type="pmid">26680009</pub-id></citation></ref>
<ref id="B9">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Sankoff</surname> <given-names>D.</given-names></name> <name><surname>Zheng</surname> <given-names>C.</given-names></name> <name><surname>Zhu</surname> <given-names>Q.</given-names></name></person-group> (<year>2010</year>). <article-title>The collapse of gene complement following whole genome duplication</article-title>. <source>BMC Genomics</source> <volume>11</volume>:<fpage>313</fpage>. <pub-id pub-id-type="doi">10.1186/1471-2164-11-313</pub-id><pub-id pub-id-type="pmid">20482863</pub-id></citation></ref>
<ref id="B10">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Tobiason</surname> <given-names>D. M.</given-names></name> <name><surname>Seifert</surname> <given-names>H. S.</given-names></name></person-group> (<year>2006</year>). <article-title>The obligate human pathogen, Neisseria gonorrhoeae, is polyploid</article-title>. <source>PLoS Biol.</source> <volume>4</volume>:<fpage>e185</fpage>. <pub-id pub-id-type="doi">10.1371/journal.pbio.0040185</pub-id><pub-id pub-id-type="pmid">16719561</pub-id></citation></ref>
<ref id="B11">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>van Hoek</surname> <given-names>M. J.</given-names></name> <name><surname>Hogeweg</surname> <given-names>P.</given-names></name></person-group> (<year>2007</year>). <article-title>The role of mutational dynamics in genome shrinkage</article-title>. <source>Mol. Biol. Evol.</source> <volume>24</volume>, <fpage>2485</fpage>&#x02013;<lpage>2494</lpage>. <pub-id pub-id-type="doi">10.1093/molbev/msm183</pub-id><pub-id pub-id-type="pmid">17768305</pub-id></citation></ref>
<ref id="B12">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Vicient</surname> <given-names>C. M.</given-names></name> <name><surname>Casacuberta</surname> <given-names>J. M.</given-names></name></person-group> (<year>2017</year>). <article-title>Impact of transposable elements on polyploid plant genomes</article-title>. <source>Ann Bot.</source> <volume>120</volume>, <fpage>195</fpage>&#x02013;<lpage>207</lpage>. <pub-id pub-id-type="doi">10.1093/aob/mcx078</pub-id><pub-id pub-id-type="pmid">28854566</pub-id></citation></ref>
<ref id="B13">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Winge</surname> <given-names>&#x000D6;.</given-names></name></person-group> (<year>1917</year>). <article-title>The chromosomes: their number and general importance</article-title>. <source>Comptes Rendus des Travaux Lab. Carlsberg</source> <volume>13</volume>, <fpage>131</fpage>&#x02013;<lpage>275</lpage>.</citation></ref>
<ref id="B14">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Wolfe</surname> <given-names>K.</given-names></name> <name><surname>Shields</surname> <given-names>D.</given-names></name></person-group> (<year>1977</year>). <article-title>Molecular evidence for an ancient duplication of the entire yeast genome</article-title>. <source>Nature</source> <volume>387</volume>, <fpage>708</fpage>&#x02013;<lpage>713</lpage>.<pub-id pub-id-type="pmid">9192896</pub-id></citation></ref>
<ref id="B15">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Xiao</surname> <given-names>J.</given-names></name> <name><surname>Sekhwal</surname> <given-names>M. K.</given-names></name> <name><surname>Li</surname> <given-names>P.</given-names></name> <name><surname>Ragupathy</surname> <given-names>R.</given-names></name> <name><surname>Cloutier</surname> <given-names>S.</given-names></name> <name><surname>Wang</surname> <given-names>X.</given-names></name> <etal/></person-group>. (<year>2016</year>). <article-title>Pseudogenes and their genome-wide prediction in plants</article-title>. <source>Int. J. Mol. Sci.</source> <volume>17</volume>, <fpage>1991</fpage>&#x02013;<lpage>2006</lpage>. <pub-id pub-id-type="doi">10.3390/ijms17121991</pub-id><pub-id pub-id-type="pmid">27916797</pub-id></citation></ref>
<ref id="B16">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Xie</surname> <given-names>J.</given-names></name> <name><surname>Li</surname> <given-names>Y.</given-names></name> <name><surname>Liu</surname> <given-names>X.</given-names></name> <name><surname>Zhao</surname> <given-names>Y.</given-names></name> <name><surname>Li</surname> <given-names>B.</given-names></name> <name><surname>Ingvarsson</surname> <given-names>P. K.</given-names></name></person-group> (<year>2019</year>) <article-title>Evolutionary origins of pseudogenes their association with regulatory sequences in plants</article-title>. <source>Plant Cell</source> <volume>31</volume>, <fpage>563</fpage>&#x02013;<lpage>578</lpage>. <pub-id pub-id-type="doi">10.1105/tpc.18.00601</pub-id>.<pub-id pub-id-type="pmid">30760562</pub-id></citation></ref>
<ref id="B17">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Yu</surname> <given-names>Z.</given-names></name> <name><surname>Zheng</surname> <given-names>C.</given-names></name> <name><surname>Sankoff</surname> <given-names>D.</given-names></name></person-group> (<year>2020</year>). <article-title>Gaps and runs in syntenic alignments</article-title>, in <source>International Conference on Algorithms for Computational Biology</source>. Lecture Notes in Computer Science <volume>12099</volume>, <fpage>49</fpage>&#x02013;<lpage>60</lpage>.</citation></ref>
<ref id="B18">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Yu</surname> <given-names>Z. N.</given-names></name> <name><surname>Sankoff</surname> <given-names>D.</given-names></name></person-group> (<year>2016</year>). <article-title>A continuous analog of run length distributions reflecting accumulated fractionation events</article-title>. <source>BMC Bioinformatics</source> <volume>17</volume>(<supplement>Suppl. 14</supplement>):<fpage>412</fpage>. <pub-id pub-id-type="doi">10.1186/s12859-016-1265-5</pub-id><pub-id pub-id-type="pmid">28185566</pub-id></citation></ref>
<ref id="B19">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Zhang</surname> <given-names>Z.</given-names></name> <name><surname>Carriero</surname> <given-names>N.</given-names></name> <name><surname>Zheng</surname> <given-names>D.</given-names></name> <name><surname>Karro</surname> <given-names>J.</given-names></name> <name><surname>Harrison</surname> <given-names>P. M.</given-names></name> <name><surname>Gerstein</surname> <given-names>M.</given-names></name></person-group> (<year>2006</year>). <article-title>PseudoPipe: an automated pseudogene identification pipeline</article-title>. <source>Bioinformatics</source> <fpage>1437</fpage>&#x02013;<lpage>1439</lpage>. <pub-id pub-id-type="doi">10.1093/bioinformatics/btl116</pub-id><pub-id pub-id-type="pmid">16574694</pub-id></citation></ref>
<ref id="B20">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Zheng</surname> <given-names>C.</given-names></name> <name><surname>Wall</surname> <given-names>P. K.</given-names></name> <name><surname>Leebens-Mack</surname> <given-names>J.</given-names></name> <name><surname>dePamphilis</surname> <given-names>C.</given-names></name> <name><surname>Albert</surname> <given-names>V. A.</given-names></name> <name><surname>Sankoff</surname> <given-names>D.</given-names></name></person-group> (<year>2009</year>). <article-title>Gene loss under neighbourhood selection following whole genome duplication and the reconstruction of the ancestral <italic>Populus</italic> diploid</article-title>. <source>J. Bioinform. Comput. Biol.</source> <volume>7</volume>, <fpage>499</fpage>&#x02013;<lpage>520</lpage>. <pub-id pub-id-type="doi">10.1142/s0219720009004199</pub-id><pub-id pub-id-type="pmid">19507287</pub-id></citation></ref>
</ref-list>
<fn-group>
<fn fn-type="financial-disclosure"><p><bold>Funding.</bold> Research and publication costs were supported in part by Discovery Grant number 8867-200 from the Natural Sciences and Engineering Research Council of Canada. DS holds the Canada Research Chair in Mathematical Genomics.</p>
</fn>
</fn-group>
</back>
</article>