Characterization of Hemerocallis citrina Transcriptome and Development of EST-SSR Markers for Evaluation of Genetic Diversity and Population Structure of Hemerocallis Collection

Hemerocallis spp. commonly known as daylilies and night lilies, are among the most popular ornamental crops worldwide. In Eastern Asia, H. citrina is also widely cultivated as both a vegetable crop and for medicinal use. However, limited genetic and genomic resources are available in Hemerocallis. Knowledge on the genetic diversity and population structure of this species-rich genus is very limited. In this study, we reported transcriptome sequencing of H. citrina cv. ‘Datonghuanghua’ which is a popular, high-yielding variety in China. We mined the transcriptome data, identified and characterized the microsatellite or simple sequence repeat (SSR) sequences in the expressed genome. From ∼14.15 Gbp clean reads, we assembled 92,107 unigenes, of which 41,796 were annotated for possible functions. From 41,796 unigenes, we identified and characterized 3,430 SSRs with varying motifs. Forty-three SSRs were used to fingerprint 155 Hemerocallis accessions. Clustering and population structure analyses with the genotypic data among the 155 accessions reveal broader genetic variation of daylilies than the night lily accessions which form a subgroup in the phylogenetic tree. The night lily group included accessions from H. citrina, H. lilioasphodelus, and H. minor, the majority of which bloom in the evening/night, whereas the ∼100 daylily accessions bloomed in the early morning suggesting flowering time may be a major force in the selection of night lily. The utility of these SSRs was further exemplified in association analysis of blooming time among these accessions. Twelve SSRs were found to have significant associations with this horticulturally important trait.


INTRODUCTION
Hemerocallis species are among the most popular ornamental crops worldwide because of the large, conspicuous flowers and their adaptation to a wide range of soils and climates. In Eastern Asian countries, some species, especially H. citrina (night lily, or long yellow daylily) are important vegetables or medicinal plants with a long history of cultivation. For example, the first reference of H. fulva (daylily) was found in a writing in the Chou Dynasty of   (Hu, 1968;Kitchingnan, 1985). The unopened flower buds (fresh or dried) of the night lily are consumed as a special vegetable (also known as "golden needle vegetable"). The medicinal value of Hemerocallis species has received more attention in recent years. The flower buds seem to be enriched with antioxidants, such as stelladerol and caffeoylquinic acid derivatives Hsu et al., 2011;Lin et al., 2011;Jiao et al., 2016). These secondary metabolites are used to treat anxiety and swelling (Uezu, 1998;Gu et al., 2012;Yi et al., 2012;Du et al., 2014;Liu et al., 2014), and for other applications in modern medicine and biology (Venugopalan and Srivastava, 2015). Thus, Hemerocallis flowers have considerable potentials as "nutraceutical" or "functional" foods (Rop et al., 2011).
In addition to their economic and medical values, some features of Hemerocallis species are of particular biological significance. For example, the day and night lilies provide an excellent model for understanding the genetic and molecular mechanisms of flower opening and flower senescing. All known species in this genus show strict circadian rhythm of flowering with the rapid opening and withering that lasts only a few hours indicting precise regulation of floral death probably by a programmed cell death system (Hasegawa et al., 2006;Nitta et al., 2010;Hirota et al., 2012Hirota et al., , 2013). All Hemerocallis species exhibit self-incompatibility. As such, Hemerocallis was proposed as a future model system to study these biologically interesting phenomena (Rodriguez-Enriquez and Grant-Downton, 2012).
As an ornamental plant, extensive breeding efforts during the last century have resulted in varieties with different flower size, color, shape, scent and blooming time. In the US, over 83,000 daylily varieties have been registered in the American Hemerocallis Society online database 1 . Some morphological characteristics of daylily and night lily plants and flowers are exemplified in Figure 1. Despite the economic and biological importance as well as extensive breeding efforts on Hemerocallis species, not much has been accomplished on genetic studies in this genus. As demonstrated in numerous other horticulture crops, molecular markers are important tools for genetic research and breeding in Hemerocallis with many potential applications such as evaluation of genetic diversity and population structure of germplasm collection, development of linkage maps, gene and QTL (quantitative trait loci) mapping or cloning, as well as marker-assisted selection of horticulturally important traits in breeding. Nevertheless, limited work has been done in the development of molecular markers for Hemerocallis species. The natural range of Hemerocallis encompasses temperate and subtropical Asia, with the main center of diversity of the genus in China, Korea and Japan (Flora of China, 2000). However, reports on the genetic diversity and population genetic analysis of Hemerocallis especially for the collections in China are sporadic (e.g., Tomkins et al., 2001;Podwyszynska et al., 2009;Vendrame et al., 2009;Miyake and Yahara, 2010). In these early reports, molecular markers employed included random amplified polymorphic DNA (RAPD) and amplified fragment 1 http://www.daylilies.org/ length polymorphism (AFLP), each of which has limitations in practical uses (Varshney et al., 2007). Despite the increased use of single nucleotide polymorphism (SNPs), SSR markers remain popular in plant genetic and breeding studies because of their many desirable attributes, including hyper variability, a multi allelic nature, co-dominant inheritance, reproducibility, relative abundance, extensive genome coverage (including organellar genomes), relatively low cost, and amenability to high throughput genotyping (Parida et al., 2009). However, very few SSRs have been reported for Hemerocallis (e.g., Zhu et al., 2009). While there are several methods to develop SSR markers, transcriptome sequencing (RNA-Seq) has been an efficient and affordable way for large-scale, and genome-wide SSR discovery and marker development for various marker-based studies in non-model plants (e.g., Hudson, 2010;Strickler et al., 2012;Yuan et al., 2013;Zhang et al., 2013;Guo et al., 2016;Smithunna et al., 2016). Thus, the objective of the present study is to develop EST (expressed sequence tag)-SSR markers through transcriptome sequencing and examine their utility in evaluation of genetic diversity and population structure of our Hemerocallis collection. We also conducted association analysis to identify potential association of molecular markers with horticulturally important traits.

Plant Materials
One hundred and fifty-five Hemerocallis accessions were employed for genetic diversity analysis with 43 EST-SSRs. These accessions belong to at least 13 species from different geographic regions around the world including commercial varieties, landraces, and collections from the wild. The details of these 155 accessions are presented in Supplementary Table  S1, and their geographic distributions are illustrated in Figure 2. The taxonomic status of more than half of the 155 accessions was labeled uncertain during collection (Hemerocallis spp. in Supplementary Table S1). All accessions were grown in the Hemerocallis Germplasm Resource Nursery located on the campus of Shanxi Agricultural University (Taigu, Shanxi Province, China). All plants used for sample collection have been grown for 3-5 years in the nursery. At full flowering stage, the fresh roots, flower buds, and leaves of the H. citrina cv 'Datonghuanghua' were collected, flash frozen in liquid nitrogen, and used for total RNA extraction and RNA-Seq.

RNA-Seq, Transcriptome Assembly, and Annotation
For RNA-Seq, total RNA was isolated from each sample with three plants pooled with a modified CTAB protocol (Tel-Zur et al., 1999). RNA quantity and quality were evaluated using the Agilent Bioanalyzer (Agilent Technologies, CA, United States). The integrity of the total RNA was also assessed through agarose gel electrophoresis. The cDNA library for RNA-Seq was prepared using Illumina Truseq TM RNA sample prep kit (Illumina, CA, United States) following manufacturer's protocols. The concentration and insert size of the library were assessed using Qubit2.0 and Agilent 2100. Pair-end (125 PE) Illumina high-throughput sequencing was performed with a Hi-Seq 2500 sequencing machine. All samples were sequenced in the same instrument (HWI-7001455), same run (316), same follow cell (HH7VHADXX), and same lane (2).
After resequencing, the adaptors and low quality reads were filtered from the raw reads. The clean reads were de novo assembled into contigs with an optimized k-mer length = 25 and group pairs distance = 300 (Grabherr et al., 2011) using the Trinity program 2 . Unigene sequences were aligned against databases of the Clusters of Orthologous Groups (COG) (Tatusov et al., 2000), Gene Ontology (GO) (Michael et al., 2000), Kyoto Encyclopedia of Genes and Genomes (KEGG) (Kanehisa et al., 2004), euKaryotic Orthologous Groups (KOG) (Eugene et al., 2004). The functions of unigenes were predicted by BLAST against the NCBI non-redundant protein (Nr), and Swiss-Prot databases at an E-value of 10 −5 . The resulting datasets were further aligned to the Protein family (PFAM) database with HMMER (E-value 10 −10 ). For further quantitative assessment of the completeness of the assembly and annotation, the H. citrina cv 'Datonghuanghua' transcriptome was subjected to BUSCO (Benchmarking Universal Single-Copy Orthologs) analysis (Simão et al., 2015) against the Viridiplantae_odb10, embryophyta_odb10, and liliopsida_odb10 databases.

Microsatellite Sequence Identification and EST-SSR Genotyping
SSR identification among the unigene sequences (>1 kb) was performed with the MIcroSAtellite (MISA) program 3 .
All microsatellites containing di-, tri-, tetra-, penta-, hexanucleotide, and compound motifs with more than five repeats were included. The Primer3 program 4 was used to design primers for the identified SSRs. Each designed primer pair was first evaluated with in silico PCR procedure 5 using unigene sequences as the template. Primers with multiple amplicons were filtered out.
Forty-three EST-SSR markers were selected for genotyping 155 Hemerocallis accessions. Genomic DNA was extracted from fresh leaves of each accession (pooled from five plants) using the CTAB method. A touch-down PCR procedure (Azhaguvel et al., 2009) was employed and the amplicons were separated using 9% non-denaturing polyacrylamide gel electrophoresis (PAGE) with 0.5 × TBE buffer for 1.5 h in 250V, and then visualized with silver staining.

Cluster and Population Structure Analyses
The SSR genotypic data were organized in a matrix in which 0 and 1 representing absent and present alleles, respectively (9 = missing data). The genetic distances were calculated through the SM coefficient by the Sim Qual procedure in NTSYSpc 2.10 ( Rohlf and Jensen, 1989). The dendrogram was constructed using the unweighted pair-group method with arithmetic mean (UPGMA) clustering and drawn by NTSYSpc 2.10. We also applied Neighbor Joining (NJ) clustering on the dataset, which was implemented in the MEGA5.05 software package. A model-based Bayesian clustering was applied to infer genetic structure and define the number of clusters (gene pools) in the dataset using STUCTURE v.2.3.4 (Pritchard et al., 2000). No prior information was used to define the clusters. Independent runs were done by setting the number of clusters (K) from 1 to 15. Each run was comprised of a burn-in length of 10,000 followed by 100,000 MCMC (Monte Carlo Markov Chain); each replicate at a particular K-value was repeated 20 times. An ad hoc statistic K based on the rate of changes in the log probability of data among successive K-values (Evanno et al., 2005) was calculated through Structure Harvester v.0.9.93 and used to estimate the most likely number of clusters (K). K was calculated using K = m[| L(K+1)-2L(K)+L(K-1)|]/s| L(K)|, where (L(K) is the logarithm of K, s is the standard deviation, and m is the mean.

Association Analysis for Blooming Time in 155 Hemerocallis Accessions
We performed association analysis of blooming time with SSR markers among the 155 accessions of the Hemerocallis collection.
During flowering season, the time of flower opening was continuously monitored real-time using a digital video camera (360 Smart Camera, D606, 360 China). The flowering time of each plant from each accession was recorded. The blooming time of at least five flowers per accession was collected. The blooming time of each accession is provided in Supplementary Table S1. In association analysis, the blooming time was treated a qualitative trait. Accessions with blooming time from 4:00 to 10:00 and 16:00 to 24:00 were assigned "0" and "1, " respectively.
Pairwise linkage disequilibrium (LD) and association analyses were performed using TASSEL4.3.6 (Bradbury et al., 2007;Dent et al., 2012), and calculation of LD and P-values. Association analysis was performed using both the general linear model (GLM, Q model) and the mixed linear model (MLM, Q+K model) in TASSEL. The comparison wise significance was computed using 1,000 permutations as implemented in GLM. The kinship matrix was generated by NTSYSpc 2.10 with option P3D for variance component estimation in MLM (Kang et al., 2008). Significance of marker-trait associations was determined at P ≤ 0.05 (Jaiswal et al., 2012). False discovery rate (FDR) was also used to detect true associations.

Gene Expression in Different Tissues of H. citrina cv. 'Datonghuanghua'
The expression level of blooming time-related gene c33464.graph_c0 in different tissues (fresh roots, flower buds and leaves) of H. citrina cv. 'Datonghuanghua' (H0006) was carried out with quantitative real-time PCR (RT-qPCR) on an ABI prism 7500 Fast Real-time PCR system 147 machine (Thermo Fisher Scientific Inc., Waltham, MA, United States). Sequences of the primers for c33464.graph_c0 were GGCGAATTAGTCTGGAAAGAACTAGG (forward primer) and TGTTATGTTCCTCGTCCGTCCAC (reverse primer). The H. citrina actin gene, HcACT, was used as the reference. The primers sequences for this gene were GAGCAAGGAAATCACGGCACT (forward primer) and GGAACCTCCAATCCAAACACTGTAC (reverse primer). RT-PCR procedure followed Hou et al. (2017). Each sample had three biological and three technical replications.

Transcriptome Sequencing, de novo Assembly, and Unigenes Annotation
After filtering of the raw sequencing data, nearly 56 million clean reads (14.15 Gb) were obtained from root, bud and leaf transcriptomes. Main RNA-Seq statistics for the three tissues are presented in Supplementary Table S2. In all three transcriptomes, >90% reads had Q30 or higher quality scores. Using the Trinity de novo assembly program, the ∼14.15 Gb high-quality reads were assembled into contigs, transcripts and unigenes. Main statistics of the assemblies are shown in Table 1. From 6,390,477 contigs (>25 bp in length), 164,723 transcripts were assembled (mean length 838 bp) representing 92,107 unigenes. The length distribution of these unigenes is illustrated in Figure 3. The transcripts with >500 bp in length accounted for 47.06% of all transcripts. There were 13,724,749 (76.9%), 15,731,521 (79.2%), and 14,841,142 (80.4%) reads from the roots, buds and leaves mapped to the assembly transcripts and unigenes, respectively.
In addition, 23,716 unigenes matched with the GO database were categorized into 51 functional sub-groups of the three main GO groups: cellular component, molecular function, and biological process ( Figure 5). The majority of the unigenes was assigned to biological processes (63,089, 43.7%), followed by cellular components (35,677, 37.2%), and molecular functions (27,719, 19.2%). Under the category of biological processes, metabolic process (15,976, 25.3%) and cellular process (13,749, 21.8%) were predominant. In the cellular component category, cell part (12,837, 23.9%) and cell (12,746, 23.8%) were the most abundant classes. As for the molecular function, catalytic activities (12,157, 43.9%) and binding (11,677, 42.1%) were the top two categories in numbers.
In KEGG pathway analysis of the 41,796 unigenes, 9,375 (22.4%) could be assigned to 117 pathways that belong to five main categories including metabolism (6,783), genetic information processing (3,013), cellular processes (495), environmental information processing (321), and organismal systems (305) (see details in Figure 6). The colchicine content is a very important trait for commercial production of night lily. Among annotated unigenes, 20 were involved in the isoquinoline alkaloids metabolic pathway that is related with colchicine biosynthesis. The category with the largest number of unigenes was metabolism, in which the most represented five pathways were oxidative phosphorylation (ko00190) (336, 4.9%), glycolysis/Gluconeogenesis (ko00010) (325, 4.8%), carbon fixation in photosynthetic organisms (ko00710) (306, 4.5%), purine metabolism (ko00230) (242, 3.6%), and starch and sucrose metabolism (ko00500) (228, 3.4%).  We conducted BUSCO analysis to evaluate completeness of the transcriptome using the near-universal orthologous gene groups. The details are presented in Supplementary Table  S3 and illustrated in Supplementary Figure S1. For example, among the 430 BUSCOs in the viridiplantae_odb10 database, H. citrina cv. 'Datonghuanghua' transcriptome had 388 (90.2%) complete single or duplicated copies, 27 were fragmented, and 15 were missing, which indicated that this transcriptome assembly had adequate coverage and quality to represent the H. citrina transcriptome for various downstream analyses.

Identification and Characterization of EST-SSRs in Datonghuahua Transcriptomes
Using MISA, we mined microsatellite-containing sequences from 13,019 unigenes with >1,000 bp in length. As a result, 3,430 potential SSRs were identified from 2,977 unigenes, 453 of which contained more than one SSR (detailed in Supplementary Table S4). Details of the 2,977 unigenes, associated SSRs, and motif statistics are provided in Supplementary Tables S5, S6, respectively. The sequences of these unigenes in fasta format are provided in Supplementary Table S7. Among those SSRs, those with the tri-nucleotide repeat motifs were the most abundant (54.29%, 1,862), followed by SSRs with dinucleotide (31.75%, 1,089), and tetra-nucleotide (3.24%, 111) motifs. Other types were rare. Among the di-and tri-nucleotide repeats, (GAA/TTC) n , and GA/TC) n or (AG/CT) n were the most common repeat motifs, respectively (Supplementary Table S6).
From the 2,977 SSR-containing unigene sequences, primers were designed for 3,577 SSRs. The primer sequences for each SSR and associated information are provided in Supplementary  Table S8. From the list, we tested 192 experimentally in six accessions. Based on the preliminary data, we finally selected 43 SSRs that had clear, and unique amplicons with expected size. Information for the 43 EST-SSRs is provided in Table 3.

Genetic Diversity of Hemeracallis Germplasm Collection
The forty-three EST-SSRs were used to fingerprint 155 Hemeracallis accessions. A total of 396 alleles were detected  Frontiers in Plant Science | www.frontiersin.org FIGURE 6 | Functional classification of the assembled unigenes based on Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway analysis. 9,375 unigenes are annotated to five categories: metabolism, genetic information processing, cellular processes, environmental information processing and organic systems. There are a large number of pathways of metabolism and genetic information processing (in total 106 pathways) but the picture is limited, so the first five pathways with the largest number of each category are selected for plotting. The value to the right of each bar indicates the number of unigenes annotated in the pathway.
at the 43 marker loci. The number of alleles detected at each locus varied from 2 to 25 with an average of 9.21 per locus. The polymorphism information content (PIC) ranged from 0.907 to 0.259 with an average of 0.606 (Table 4). Mean Shannon's information index (I) was 1.294 (0.477-2.653). Mean observed heterozygosity (Ho) and mean expected heterozygosity (He) was 0.1757 (0.0000-0.5175) and 0.5935 (0.2104-0.9109). These indicated that the 43 SSRs provided rich genetic information (Table 4).
Pairwise genetic similarity coefficients were calculated among all samples which are provided in Supplementary Table  S9. The mean genetic similarity coefficient among these accessions was 0.8642 (range: 0.7828-1.000). The least genetic similarity coefficient was 0.7928 between "Suqian 1-C" and US4 suggesting they are genetically the most distant among these accessions. Meanwhile, the genetic similarity coefficient between 'Datonghuanghua' (H0006) and 'Qiaotouhuanghua' (H0010) was 1.000 indicating they may actually belong to the same accession.
A UPGMA dendrogram of the 155 accessions based on 43 EST-SSR markers was constructed (Supplementary Figure S2).  Table S1).
We also constructed a Neighbor-Joining (NJ) tree for the 155 accessions, which is presented in Figure 7. The groupings were largely the same as the UPGMA dendrogram (Supplementary Figure S2) with all night lily accession in one clade. These data suggested that night lily has narrower genetic base than daylily, and the night lily is likely the result of human selection from daylilies for human consumption during crop evolution.

Population Structure of 155 Hemeracallis Accessions
A model-based Bayesian clustering approach was used to analyze 155 accessions by STRUCTURE. The logarithm of the likelihood [Ln P(D)] on average continued to increase with increasing numbers of assumed subpopulations (K) from 2 to 20 ( Figure 8A). Difference between Ln P(D) values at two successive K-values became non-significant after K = 5. As the ad hoc statistic K preferentially detects the uppermost level of structure and gave the highest value at K = 2 ( Figure 8B; Lia et al., 2009;Emanuelli et al., 2013). Thus, K = 2 was considered as the most probable prediction for the number of subpopulations. Indeed, in the first round of structure analysis, the 155 accessions were split into two major clusters. Group I contained 53 accessions with an average Q value of 0.9867 (86.79% of them with Q > 0.98), and Group II contained 102 accessions with an average Q value of 0.9213 (78.43% of them with Q > 0.90) (Figure 8C and Supplementary Table S1). Group I consisted of 53 accessions, all of which were blooming at night (night lily group). Groups II consisted of 102 accessions, 97 accessions of them were blooming in the morning All number are the number of alleles detected by each marker among 155 accessions. PIC = polymorphism information content for each marker. I = Shannon's Information index (Lewontin 1972). Ho is observed heterozygosity, and He is expected heterozygosity that was computed using Levene 1949. Table S1). Some accessions were admixtures of the two groups.

Association Analysis of Blooming Time Among 155 Accessions
Blooming time is a very important horticultural trait for both daylily and night lily varieties. We recorded blooming time of all 155 accessions, and the data were presented in Supplementary   Table S1. Among them, 96 bloomed (almost all daylily) in the morning (7-10 a.m.), and the rest (almost night lily) in the later afternoon or evening (4-11:30 p.m.). To confirm the utility of the EST-SSR markers developed in the present study, we conducted association analysis of blooming time using these markers. We first examined linkage disequilibrium (LD) in the Hemerocallis population. Of the 903 pairwise combinations generated for 43 the EST-SSR loci, 786 (87.0%) showed LD (Supplementary Table S10 and Supplementary Figure S3). At P ≤ 0.001, 74 pairs (8.2%) had mean r 2 of 0.4266 indicating strong LD. At P ≤ 0.001, the standardized disequilibrium coefficients (D') was 0.8435 which was close to 1.0 suggesting low chance of recombination between pairs of loci, which may be caused by the long-term asexual reproduction among the Hemerocallis collection. Association analysis was performed using both the general linear model (GLM, Q model) and the mixed linear model (MLM, Q+K model) in TASSEL. The results are presented in Table 5, and illustrated in Figure 9. In the GLM model, 12 markers showed significant association with blooming time, 8 of which had extremely high significance (P ≤ 0.01; mean r 2 = 0.0459) (Points above the blue line in Figure 9c). Similarly, under the MLM model, 10 markers were significantly associated with blooming time and 5 had extremely significant association (P ≤ 0.01; mean r 2 = 0.0560) (Points above the blue line in Figure 9d). The 10 markers with significant associations with the blooming time in both GLM and MLM were SAU00008, SAU00045, SAU00047, SAU00045, SAU00063, SAU00064, SAU00109, SAU00121, SAU00143, and SAU00150. Among them, SAU00109 had the strongest association (P << 0.001 in MLM model) that could explain 8.8% phenotypic variance (8.0% in GLM) ( Table 5). The predicted functions of the 12 unigenes associated with these SSRs are provided in Table 5. Both SAU00064 and SAU00109 were associated with the same unigenes c22731.graph_c0, which seems to encode a eukaryotic translation initiation factor 2D. Another unigene, c33464.graph_c0, associated with marker SAU00063 was annotated to encode a LHY-like protein, which is known to be regulated by circadian clock. The expression of c33464.graph_c0 in flower buds was significantly higher than that of roots and leaves in both RNA-Seq and RT-qPCR experiments (Figure 10). The unigene c47752.graph_c0 associated with SAU00008 was predicted to encode phosphatase 2C 35/65, and its expression level in roots was significantly different from that in flower buds and leaves.

DISCUSSION
In this study, 92,107 Hemerocallis transcriptomic unigenes were assembled using the Illumina Hi-Seq 2500 platform ( Table 1). The N50 length of these unigenes was 908 bp, and the average length was 575 bp. These results were comparable to what obtained in other recent studies in species of the Liliaceae family, such as Lilium regale (N50 = 920 bp, average length = 682 bp) (Cui et al., 2018). The number of unigenes and contained SSRs obtained from the present study (92,107 unigenes with   'Datonghuanghua'. The results show the mean ± SD (error bars) and were generated from three biological replicates. ** mean significantly difference, P ≤ 0.01. HcACT was used as an internal control for RT-qPCR.
3,430 SSRs) was also close the number identified in other three Liliaceae species (L. formolong, L. longiflorum, and L. longiflorum) (average unigenes = 72,256) (Biswas et al., 2018). These results suggested that closely related species share similar gene contents and distribution of SSR sequences in their genomes.
Fingerprinting is among the most popular uses of molecular markers. In Hemerocallis, hundreds of varieties have been released. Many varieties have the same names but may be morphologically different, or have the same appearances but with different names. Marker-based fingerprinting can help resolve these issues. In the study, the two landraces 'Malinhuanghua (H0029)' and 'Malinhuanghua2 (H0125)' had the same name (in Chinese), but have different geographic origins. Clustering analysis indicated that they were located in different clades (Figure 7 and Supplementary Figure S2). Therefore, these two accessions do not seem to share any common parent, but may have the same name by chance. The three accessions, 'Taiguxuancao (H0038), ' 'Stella de Oro 2 (H0077), ' and 'Stella de Oro 3 (H0139)' may have a similar situation. In addition, three accessions, H. minor Mill. (H0007) and H. minor Mill.2 (H2802) were all introduced from Qingyang, Gansu Province, China, but clustering analysis did not support their close relatedness. These apparent mis-identifications were probably because the original names of these foreign varieties were changed right after their introduction. This is also likely true for many accessions that were introduced from one place to another place inside China. Work in the present study presents a good example to show the power of molecular markers in variety identification and protection.
Both clustering analysis and STRUCTURE analysis place ∼50 accessions into one group (Figure 7 and Supplementary Figure S2) that bloom in the evening, while the rest of ∼100 accession all bloom in the morning (Supplementary Table S1).
Marker-based phylogenetic analysis revealed that the night lily accessions were more genetically related, which seems to be consistent with their geographical distribution. Accessions in the daylily group were collected from more geographically diverse locations (Supplementary Table S1). Based on our multiple observations (Ji et al., 2018), accessions in the daylily group are also morphologically more diverse than those in the night lily group. For example, the colors of petals and sepals of daylilies may vary from yellow, yellow-orange, orange, orangered, red, purple-red to purple. For flower size, the width variation coefficient was high in petals and sepals. Among them, the orange petal width variation coefficient was the highest of 42.15% and the average width was 35.90 mm (13.58-75.56 mm). Similarly, the variation coefficient of the orange sepal width was the highest at 38.74%, and the average width was 23.40 mm (13.20∼48.34 mm). The difference between stalk height and leaf width were highly significant. The inflorescences were rich in morphological variation ranging from mini inflorescence, extremely short branches, to large branched inflorescence, with varying amounts of blossoms. On the other hand, the 55 accessions in the night lily group (branches in red in Figure 7 and Supplementary Figure S2 were collected from nine provinces of China, all of which were diploids and are edible. For these accessions, both petals and sepals were yellow while the width of petals and sepals were very small, with the mean values 17.46 and 13.77 mm, respectively. As compared with accessions in the daylily group, the stalk height (main stem) of night lily group was in general higher (129.08 cm), the leaf was narrower (3.16 cm), and the average number of blossoms was 22.10. Fifty-three accessions were blooming at night, except for 'Panlonghua' (H0004) and 'Panlonghua 2' (H0111) that bloomed in the morning. The results from both marker analysis (Figure 7 and Supplementary Figure S2) and morphological observations indicate that the night lily group may have narrower genetic base than the daylily group accessions in this collection. This seems reasonable because the daylily group accessions had a more diverse geographic origin (Supplementary Table S1). Also, many of the night lily accessions may share common parents during their breeding and selection. From the crop evolution perspective, night lily was probably selected from the genetically more diverse daylily gene pool for vegetable use. During this process, blooming at night may be a main target of selection which is critical for vegetable use to collect the tender unopened flower bud during day time.
A few accessions were placed in different clades when different programs were used in clustering analysis, for example, 'Panlonghua' and 'Panlonghua 2' were clustered into night lily group by NTSYS and MEGA, which were in daylily group by STRUCTURE (Supplementary Table S1, Figures 7, 8, and Supplementary Figure S2). 'Panlonghua' and 'Panlonghua 2' were edible landraces, which were probably the hybrids between the night lily and the daylily. They displayed characteristics of both parents including morning blooming and the narrow perianth traits. 'Stella de Oro 2' (H0077), 'Xue Qiu Hong' (H0092), 'Beijing 6' (H0101), and 'Little Bumble Bee' (H0141) were daylily varieties that blossom at night, but they were clustered together with other daylily accessions that bloomed during the day. These observations suggest gene flow between the two cultivated groups during the long-term breeding and selection. Hermerocallis varieties are often recognized as either daylily or night lily based on their blooming time. Probably only a few genes with strong phenotypic effects are underlying the blooming time (Masahiro et al., 2006). In this study, we identified 12 SSRs with strong association with blooming time ( Table 5). The SSR marker SAU00063 with the strongest association is in the unigene (c33464.graph_c0) that was annotated to encode a LHYlike protein ( Table 5). The LHY (Late Elongated Hypocotyl) gene has been shown to regulate flower timing in a number of plants (e.g., Ramos-Sanchez et al., 2019;Zhang et al., 2019). Our work in the present study provides a foundation for gene discovery of flower timing in Hermerocallis. However, additional work is needed to validate this result.

CONCLUSION
This is the first report on the Hemerocallis transcriptome, and a large number of EST-SSR markers have been developed from this transcriptome, which will provide an excellent resource for researchers and breeders focusing on Hemerocallis. More importantly, our results showed that this strategy using highthroughput sequencing technology is feasible and extremely convenient and efficient in the genus Hemerocallis. The genes associated with circadian rhythm provide germplasm resources and genes basis for further study on flowering rhythm of this genus.

DATA AVAILABILITY STATEMENT
The datasets presented in this study can be found in NCBI under accession numbers SRR11610941, SRR11610942, and SRR11610943 (Submission ID: SUB7327030; BioProject ID: PRJNA628147).

AUTHOR CONTRIBUTIONS
SL designed and conducted the whole research. FJ collected the phenotype data and conducted the GWAS analysis. FH collected some accessions and developed the EST-SSR. HC analyzed the transcriptome data. QS ran the PAGE gel. GX and XK kept the Hemerocallis germplasm. YW advised the project and revised the manuscript.