Potential of Herbariomics for Studying Repetitive DNA in Angiosperms

Repetitive DNA has an important role in angiosperm genomes and is relevant to our understanding of genome size variation, polyploidisation and genome dynamics more broadly. Much recent work has harnessed the power of high-throughput sequencing (HTS) technologies to advance the study of repetitive DNA in flowering plants. Herbarium collections provide a useful historical perspective on genome diversity through time, but their value for the study of repetitive DNA has not yet been explored. We propose that herbarium DNA may prove as useful for studies of repetitive DNA content as it has for reconstructed organellar genomes and low-copy nuclear sequence data. Here we present a case study in the tobacco genus (Nicotiana; Solanaceae), showing that herbarium specimens can provide accurate estimates of the repetitive content of angiosperm genomes by direct comparison with recently-collected material. We show a strong correlation between the abundance of repeat clusters, e.g. different types of transposable elements and satellite DNA, in herbarium collections versus recent material for four sets of Nicotiana taxa. These results suggest that herbarium specimen genome sequencing (herbariomics) holds promise for both repeat discovery and analyses that aim to investigate the role of repetitive DNAs in genomic evolution, particularly genome size evolution and/or contributions of repeats to the regulation of gene space.

require prior amplification. These can then be universally amplified in order to get an appropriate concentration to sequence, and provided this is not overdone, the distribution of fragments across the genome should represent those that were present in the original sample. Provided that patterns of degradation are mostly stochastic (Staats et al., 2011), then the resulting sequence data should be suitable for many studies of genome evolution (including nuclear and organellar). Most studies that have thus far utilized HTS for herbarium specimen sequencing (herbariomics) have focussed on genome skimming (Dodsworth, 2015a) for organellar genome reconstruction, in particular for assembling the plastid genome (Bakker et al., 2016;Bakker, 2017). This is due to several factors, including ease of assembly and the predominance of plastid regions in angiosperm phylogenetics. But there has also been interest in examining high-copy DNA sequences in this material, which can be examined in cost-effective yet low-coverage genome "skims" (Straub et al., 2012;Dodsworth, 2015a). High-copy DNA is also present in such samples, from both the mitochondrial genome and nuclear genomes, of which the chief component is repetitive elements.
The predominant sequences in angiosperm genomes generally are repetitive elements, which contribute a majority of nuclear DNA (Pellicer et al., 2018). These repeats include both class I (retrotransposons) and class II elements (DNA transposons), as well as satellite and other tandemly repeated sequences. The majority of DNA in most angiosperms studied to date is from two superfamilies of retrotransposons, the Ty1/Copia and Ty3/Gypsy families (Renny-Byfield et al., 2011;Novák et al., 2014;Macas et al., 2015;Mccann et al., 2018). Consequently, characterizing and understanding the repeat landscape in angiosperm genomes is essential for understanding the bulk of DNA in angiosperm genomes. Genome size varies over 2,400-fold in angiosperms (Kelly and Leitch, 2011;Dodsworth et al., 2015b;Pellicer et al., 2018). Thus changes in repetitive DNA content and dynamics are important for understanding aspects of genome size variation (Kelly et al., 2015;Macas et al., 2015;Dodsworth et al., 2016;Pellicer et al., 2018), genome dynamics and genomic processes post-polyploidization (Renny-Byfield et al., 2013;Dodsworth et al., 2017), as well as the impact of repeats on gene space evolution (Lisch, 2013;Dodsworth et al., 2015b).
Despite the importance of repetitive DNA for genome and organismal evolution in plants, few studies have looked at repeat sequences in detail in DNA samples obtained from herbarium specimens. Potentially, patterns of degradation could be uneven and biased in a non-stochastic manner that would hinder, for examples, the accurate estimation of the abundance of different repeat types, but this is currently not well characterized.
Nevertheless, there appears to be no obvious bias in degradation between the three genomic compartments in plants (i.e., plastome, mitogenome, and nuclear genome) from comparative studies of herbarium material through time, although this may be subject to particular treatment contexts (Staats et al., 2011). Most studies that do concentrate on the nuclear genome are focussed mostly on reconstruction of whole-genome sequence data (Staats et al., 2013), but dips in coverage across lowcopy genic regions are certainly more pronounced compared to fresh material or genome references. If there are no significant sequence-dependent biases in degradation patterns across the nuclear genome then the majority of abundant repeats should be present in similar abundance as in freshly collected samples. Here we aim to test this idea by using a case study of four species from the tobacco genus (Nicotiana section Suaveolentes; (Solanaceae), where we directly compare herbarium collections with recently collected material of the same taxa.

MATERIALS AND METHODS
Four species of Nicotiana section Suaveolentes (small-medium sized herbs- Figure 1A) were sequenced, for which there were both recently collected silica-dried leaf material (hereafter "fresh") and also herbarium collections of approximately 10 years of age (hereafter "herbarium"). The current consensus is that most damage is done during the initial specimen collection/preservation stage (in particular the heat treatment used in the drying process) (Staats et al., 2011(Staats et al., , 2013Bakker, 2017), hence the young age of specimens chosen here should not confound some conclusions regarding herbarium DNA more broadly. Large variation amplitude of environmental conditions during long-term storage would also contribute to further DNA degradation, but the storage conditions are generally relatively constant in most herbaria. Herbarium specimens were prepared in 2005 (housed at MELU) and sampled for DNA in 2015 from the following duplicate collections housed at BM: N. heterantha voucher specimens for all samples are held at K and AD. A sample from freshly collected N. africana (USDA line TW6) was used as an outgroup taxon (voucher at QMUL). Genomic DNA was extracted using a modified CTAB protocol (Wang et al., 2013) and Illumina TruSeq PCR-free kits were used to prepare all libraries with 350-550 bp average insert sizes after Covaris sonication. Libraries were single or dual-indexed, multiplexed and sequenced either on a NextSeq (V2-300 cycles 2 × 150 bp PE; N. africana and N. velutina [both] samples) or MiSeq (V2-300 cycles at 2 × 150 bp PE; all other samples) at Queen Mary University of London Genome Center.
Phylogenetic relationships were reconstructed using the large subunit of ribosomal DNA (rDNA) assembled for each sample. First, a hybrid full-length unit was reconstructed using Nicotiana sylvestris and Nicotiana tabacum clones, following (Lunerová et al., 2017), and reads were mapped to this consensus for each sample in Geneious v. 9.1.4 (Kearse et al., 2012) for each sample, with the map-to-reference tool, 5 iterations and using the default settings. Resulting consensus sequences were aligned using MAFFT (Katoh et al., 2017) and ambiguous portions of the alignment were removed using Gblocks (Castresana, 2000;Talavera and Castresana, 2007) in SeaView v. 4.5.4 (Gouy et al., Frontiers in Ecology and Evolution | www.frontiersin.org 2010). The final alignment consisted of 8,458 bp. Phylogenetic inference was performed using RAxML v. 8.2.10 (Stamatakis, 2014) with the GTR+G substitution model and 100 bootstrap replicates, as implemented on the CIPRES web server (Miller et al., 2010).
Clustering analyses were performed using the RepeatExplorer2 (RE2) pipeline on the Galaxy server (www. repeatexplorer.org) (Novák et al., 2010(Novák et al., , 2013. Reads were pre-processed to remove those with minimum quality less than 10 over 95% of the read length, with maximum 5 Ns permitted in any read. Reads for each taxon sample were prefixed with unique 6-letter codes, to enable comparative analysis, and 125,000 reads were randomly sub-sampled for each taxon sample. These sub-samples were repeated in triplicate for a total dataset size of 3,375,000 reads, and mean number of reads and coefficients of variation were calculated per sample and repeat type. Ideally reads should be taken in proportion to genome size (i.e., genome proportion, see Dodsworth et al., 2015a). For this dataset we chose equal-read number sampling, however, due to a lack of genome size data for the taxa used in this study, and no attempt at phylogenetic reconstruction from the repeat abundances (sensu Dodsworth et al., 2015a). Default clustering parameters were used as per . Automatic annotations were scrutinized with respect to protein domain hits and pairedend read information in order to provide annotations for the top 15 most-abundant clusters ( Figure 1C). Annotation in RE2 is a development of existing classifications (Jurka et al., 2005;Wicker et al., 2007;Llorens et al., 2011), based on a custom protein domain database in RE2 and phylogenetic lineages-particularly for retroelement (Ty1/Copia and Ty3/Gypsy) classification. Contaminating clusters (e.g., Illumina PhiX control) and organellar clusters (plastid, mitochondrial) were removed prior to further analyses. Graphical plots and statistical analyses were performed in R version 3.3.0 (R Development Core Team, 2016). Mean number of reads between fresh and herbaria collected samples were fitted as bivariate regressions. CL1 (satellite) was removed as an outlier. To compare variation between species, replicates, and sample type, number of reads were fitted in a negative binomial regression to accommodate over dispersion in a Poisson model, with the MASS package (Venables and Ripley, 2002). A linear regression was fitted to test effects of species, sample type, and cluster size on the coefficient of variation (CoV) between replicates. CoV was log-transformed and fitted with a third degree polynomial. In all analyses, residuals were viewed in diagnostic plots to ascertain that model assumptions were met.

RESULTS AND DISCUSSION
Phylogenetic analyses of rDNA confirmed the sister relationship between samples of the same species (Figure 1B) as expected. Plotting the comparative sizes of the 15 most-abundant clusters ( Figure 1C) showed a strong similarity in cluster size between herbarium and fresh samples of each species, but also the similar genomic composition across all four Australian section Suaveolentes taxa (N. heterantha, N. cavicola, N. rosulata subsp. ingulba, and N. velutina). Section Suaveolentes is a monophyletic group of allotetraploid origin, approximately 6-7 million years' old (Clarkson et al., 2017). Most of the Australian species have probably originated in the last 2-3 million years, and although there is considerable chromosome number differentiation (from n = 24 to 15) in these species, their general genomic composition appears strikingly similar (Dodsworth, 2015b). Notably this does exclude N. africana, the outgroup, which has a highly divergent genomic composition. It is the sole member of Nicotiana in Africa, found in Namibia and sister to the rest of section Suaveolentes in all phylogenetic analyses to date (Clarkson et al., 2010(Clarkson et al., , 2017Kelly et al., 2013). The most abundant mobile elements in all genomes were LTR retrotransposons: Ty1/Copia retroelements of the Maximus clade (Figure 1C), followed by Ty3/Gypsy chromoviruses, and other LTR retroelements (Ty1/Copia-Bianca). A large satellite was present in all Australian taxa (Figure 1C-CL1, green rectangle) and in much lower abundance in N. africana. One non-LTR LINE element was also recovered as abundant in all taxa including N. africana ( Figure 1C-CL7, red rectangle), and one DNA transposon (CL15).
To investigate further the relationship between herbarium and fresh samples for each species, read counts (cluster abundance) were plotted for each species (Figure 2). All clusters representing at least 0.01% of the genome were plotted (232 clusters). Lines of best-fit from linear regression models were plotted (solid line) vs. the 1:1 line (dashed line) for comparison between correlation and the expected 1:1 ratio. A strong correlation was found between cluster abundance in the fresh samples and the herbarium samples for each set of taxa (Figure 2), with Pearson's coefficients of 0.97 or greater (Figure 2; Table S1). However, in most cases (with perhaps the exception of N. rosulata subsp. ingulba), the line obviously deviated from the 1:1 line. In all cases the regression line was above the 1:1 line, indicating slightly higher abundance of repeats estimated for each herbarium sample vs. the fresh samples ( Figure 2E). This could be due to high-copy regions of the DNA being more frequent following DNA degradation than low-copy regions, thereby marginally biasing the DNA samples in favor of highcopy repetitive sequences. Potentially genic DNA has a more open chromatin confirmation, is less protected by histones, and more vulnerable to DNA fractionation with aging. However, DNA copy number fluctuations are not seriously influencing the overall abundance of repeats, neither for particular types of element (Figure 1C), nor particular cluster sizes (Figure 2)for elements that constitute at least 0.01% of the genome. Furthermore, the differences we do observe in repeat copy numbers (Figures 1C, 2) could reflect slight differences in genome size between samples given that they were not from the same individual (nor even the same population). This is most pronounced for CL1, a satellite repeat, for which expansion and contraction is more common due to unequal recombination or sister chromatid exchange. Thus, one might expect an even tighter correlation if the exact same individuals were used for the comparison. No herbarium DNA-specific clusters were found amongst the clusters analyzed representing at least 0.01% of the genome. There was no obvious pattern  (Table S4).
regarding repeat type (Figure 2) and deviation across all four comparisons.
Variation between technical replicates was generally low ( Figure 1C;Table S2), although the coefficient of variation was significantly affected by cluster size (mean number of reads), increasing as read number decreases ( Figure 2F; Table S3). Herbarium samples had significantly less variation (c. 8.4%) across replicates, compared to fresh samples ( Table S3). The abundance of elements was consistently higher in herbarium samples compared to fresh samples by c. 17.7% (Table S4) as indicated by regression lines always above the 1:1 line (Figure 2). This was variable though, with the most variation between herbarium and fresh samples found for N. rosulata subsp. ingulba (reduced values for Pearson's correlation coefficient of 0.97 and R 2 of 0.95; Table S1). The difference between herbarium and fresh samples was generally slightly smaller than the difference between species (c. 18-19%), apart from N. heterantha, which was not significantly different to N. cavicola (Table S4). Nicotiana section Suaveolentes taxa were previously found to have an usually static genomic repeat composition between species of the core Australian clade (Dodsworth, 2015b), and therefore it is not surprising that the differences between species are similar to differences within species. Whether or not these differences would have any significance to, e.g., evolutionary studies, would entirely depend on the comparisons being made and whether studies incorporate many comparisons between both herbarium and non-herbarium DNA samples.
The library preparation process is also known to influence the sequencable fragments retrieved, and this may be in a nonstochastic manner, and dependent on repeat type. For instance, Macas et al. (2015) found up to four-fold variation (1.7-2.0 on average) for satellite DNA clusters between different libraries prepared from the same DNA sample, for two species of Vicia beans. In contrast, they found only up to two-fold variation between replicates for mobile elements , and if summing read counts for whole groups of repeats then the variation was mostly eliminated. This variation is likely due to PCR steps that can introduce biases particular in relation to GC content. This was not directly addressed with library replicates here, although the same PCR-free kits were used for library preparation of all samples, which should alleviate PCR bias.
Whilst there are only small differences between repeat copy numbers in herbarium and fresh material assessed using genome skimming, the copy number variation will have an impact on attempts to assemble whole genomes or plastomes (Bakker, 2017), where amount of plastid reads and fractionation may also be affected by growing conditions at the time of specimen fixation. Studies looking at patterns of degradation in herbarium collections over time, including of the same individual plant (e.g., Staats et al., 2011), concluded that most double-strand breaks occur directly after the specimens are fixed (i.e., the heat or other treatment), and that the age of specimens is of less consequence. Thus, the genome skimming results we show here will likely hold for herbarium specimens that are much older than used in this study. They may even hold for cases where there is a correlation between age and fragmentation over time, as other studies have found in herbarium material ranging over 300 years in age (Weiß et al., 2016). Other considerations regarding singlestrand damage and base changes, such as C → T transitions as a result of cytosine deamination toward the end of reads, should be considered. However, Staats et al. (2011) found that C → T and G → A transitions were only increased in plastid-derived herbarium reads and not for nuclear or mitochondrial ones, and estimated this transition rate to be very low, at approximately 1.53 x 10 −6 nucleotide −1 year −1 . This was not directly addressed in this study, due to the clustering method used for repetitive element analysis, whereby some single base substitutions are unlikely to have a material impact on the data. This is because repeat types and abundances are a result of clusters that include reads with a hit for >90% similarity and >55% of the read length (Novák et al., 2010(Novák et al., , 2013.

CONCLUSIONS
We found a clear correlation between repeat cluster size in herbarium material and fresh material, although this deviated slightly from a 1:1 relationship in the four cases analyzed. Clusters included different types of class I and class II repetitive elements, including in the most abundant clusters, several types of LTR and non-LTR retrotransposons and a large satellite DNA. No obvious bias was found for neither particular types of repetitive element, nor particular sizes of clusters (representative of element abundance), in most cases. Small differences in element abundance were found, with herbarium specimens generally having higher abundances. Further investigations across different angiosperm taxa, with a variety of secondary chemistry, as well as explicitly testing different drying methods, would be valuable to test how generalisable these results are. Overall, we believe that herbarium specimens show promise not only for characterizing the types of repeats present in angiosperm genomes, but also for comparative studies investigating genome evolution.

DATA AVAILABILITY
Datasets are in a publicly accessible repository: The datasets for this study can be found in the NCBI Sequence Read Archive (SRA) with the following SRA accession: SRP157901 AUTHOR CONTRIBUTIONS SD, AL, FF, and MC conceived the study, with input from SK, OM, MG, and RC. Access to herbarium material at BM was provided by SK. MWC and MC provided access to fresh material. MS, RC, and SD conducted lab work. SD and MG performed analyses. All authors contributed to writing and editing the final manuscript.

FUNDING
This work was funded by a NERC studentship to SD.