Constructing a Core Collection of the Medicinal Plant Angelica biserrata Using Genetic and Metabolic Data

Angelica biserrata is an important medicinal plant in Chinese traditional medicine. Its roots, which are known as Duhuo in Chinese, are broadly applied to treat inflammation, arthritis, and headache. With increasing market demand, the wild resources of A. biserrata have been overexploited, and conservation, assessment of genetic resources and breeding for this species is needed. Here, we sequenced the transcriptome of A. biserrata and developed simple sequence repeat (SSR) markers from it to construct a core collection based on 208 samples collected from Changyang-related regions. A total of 132 alleles were obtained for 17 SSR loci used with the polymorphic information content (PIC) ranging from 0.44 to 0.83. Abundant genetic diversity was inferred by Shannon’s information index (1.51), observed (0.57) and expected heterozygosity (0.72). The clustering analysis resulted into two sample groups and analysis of molecular variance (AMOVA) showed only 6% genetic variation existed among populations. A further metabolic analysis of these samples revealed the main coumarin contents, such as osthole and columbianadin. According to the genetic and metabolic data, we adopted the least distance stepwise sampling strategy to construct seven preliminary core collections, of which the 20CC collection, which possessed 42 A. biserrata individuals accounting for 90.20% of the genetic diversity of the original germplasm, represented the best core collection. This study will contribute to the conservation and management of A. biserrata wild germplasm resources and provide a material basis for future selection and breeding of this medicinal plant.


INTRODUCTION
Angelica biserrata (R. H. Shan & C. Q. Yuan) C. Q. Yuan & R. H. Shan (Umbelliferae) is a medicinal herb widely used in traditional Chinese medicine (TCM). The medicinal properties of A. biserrata are mainly derived from its roots, which are known as Duhuo in Chinese and have special aromatic smell when processed. Duhuo is broadly applied in TCM to treat inflammation, arthritis, and headache due to its diverse functions of winddispelling, dampness-eliminating, cold-dispersing, and painrelief (Chen et al., 1995;Li et al., 2017;Ma J. H. et al., 2019). Moreover, the aromatic contents of A. biserrata are also useful for both medicinal and industrial applications (Zia-Ul-Haq et al., 2014;Senkal et al., 2019). Over 500 compounds have been identified in A. biserrata, of which the coumarins osthole and columbianadin, and volatile oils are the main active ingredients (Liu et al., 1995;Yang et al., 2015;Markus et al., 2017;Chen et al., 2018). Osthole and columbianadin are also the index compounds in the 2020 Chinese Pharmacopoeia for quality evaluation of Duhuo. These compounds reveal diverse pharmacological effects, including anti-tumor, antiinflammatory, antioxidant, anti-bacterial, immunomodulatory, sedative, and analgesic effects (Chen et al., 1995;Ma J. H. et al., 2019). Besides its medicinal use, Duhuo has been applied in plant protection and cosmetics. Some traditional Chinese wines and teas are made using Duhuo as the main raw material (Feng et al., 2018). Alternatively, Duhuo is used to produce plant-based pesticides because its ethanol extract has significant inhibitory effects on some bacteria (Feng et al., 2018).
Angelica biserrata naturally occurs in south-central and southeast China, with a central distribution area within and around the Wuling Mountains, including the joining region of Hubei, Chongqing, Hunan, and Guizhou Provinces (Zaugg et al., 2011;Guo et al., 2018). Changyang, a county in Hubei Province, is a popular planting area for A. biserrata. The Ziqiu town of Changyang is historically known for its role as a local market and circulation terminal for Duhuo, which is collected and transported from the surrounding wild and planted areas. Given the good quality and yield of Duhuo within and around the Changyang region, "ZiqiuDuhuo" has been developed as a Geographical Indication Product for protection. The wild resource of A. biserrata within the Wuling Mountains is rich but has been greatly reduced recently due to severe excavation by local people to meet the increasing market demand. Artificial cultivation of A. biserrata has been rapidly developed but the medicinal quality of cultivated Duhuo is much poorer than that of wild plants. Moreover, as a member of the Umbelliferae plants, the mating system of A. biserrata are thought to be outcrossing by wind or insect pollinations (Zych et al., 2019), while self-pollination in field are less successfully (data not shown). Therefore, a comprehensive genetic assessment of the wild and cultivated germplasms is needed for conservation and breeding of this important medicinal herb.
Diverse germplasm resources are critical for crop breeding and improvements (Kumar et al., 2016). However, the conservation and management of genetic resources for most crops are tedious and costly. Establishing core collections provides a convenient solution to conserve and effectively utilize genetic resources (Yuan et al., 2010). The concept of "core collection" was first proposed by Frankel and Brown (1984), and further supplemented and broadened by Brown (1989). A core collection is a subset of germplasm resources that fully represents the genetic diversity of the entire population with the fewest samples (Guzmán et al., 2017;Liu et al., 2019). The core collection should have a sufficiently small sample size, but preserve overall genetic variation. Genetic diversity is the basis of biological diversity which determines the ability of a species to adapt to the external environment and evolve (Ndjiondjop et al., 2017). A lower genetic diversity of a species means less genetic variation and adaptability, which threatens its long-term survival (Linde and Selmes, 2012;Rodríguez-Nevado et al., 2018). Genetic diversity is also an important factor affecting synthesis of secondary metabolites in medicinal plants (Yuan et al., 2010). Therefore, analyzing the genetic diversity of plant populations is a critical step to construct a core collection.
Crops, such as soybean, rice, wheat, peanut, and cucumber Zhang et al., 2011;Kaga et al., 2012;Kobayashi et al., 2016) have been successfully constructed core collections. In contrast, core collections of medicinal plants are still limited although a few have been reported, such as those of Scutellaria baicalensis (Bai et al., 2010), Glycyrrhiza species (Liu et al., 2020), and Dalbergia odorifera (Liu et al., 2019). The results of these studies have consistently revealed the efficiency of core collections for conservation and management of germplasm resources, and to provide material basis for breeding (Campoy et al., 2016;Jeong et al., 2019;Miyatake et al., 2019). Generally, constructing a core collection use both phenotypic trait and molecular marker data. Phenotypic traits are simple and intuitive but unstable and vulnerable to environmental influences, while molecular markers are stable and reliable and more truly reflect genetic differences between germplasms (Xu et al., 2016;Chen et al., 2017;Guo et al., 2017). The metabolic content of medicinal plants is the most important phenotypic trait when constructing a core collection, and therefore is slightly different for that of other traditional crops.
Here, we combined genetic and metabolic analyses of A. biserrata samples collected from areas within and around the Changyang region to construct core collection. Due to the limited genomic resources of this medicinal plant, we first conducted RNA-sequencing of a representative sample of A. biserrata to develop simple sequence repeat (SSR) markers based on the assembled transcripts to investigate the overall genetic variation of the samples. Moreover, six main kinds of coumarins previously reported in Duhuo (Pan et al., 1987;Liu et al., 1995;Yang et al., 2008) were characterized in all corresponding samples used in the genetic analysis. A core collection was finally prepared using data from a least distance stepwise sampling strategy (LDSS) . Our study will promote the protection and management of the extant A. biserrata germplasm resource, and provide core materials for future selection and breeding of this plant. Our method of constructing a core collection will also serve as a reference for other medicinal plants in terms of germplasm conservation and utilization.

Plant Materials and RNA Extraction
A total of 208 wild A. biserrata samples were collected from three different locations of Puling, Gaofeng, and Langping (Supplementary Table 1) in the Changyang region and its adjoining Wufeng country, Hubei Province. These counties have the most concentrated distribution of wild A. biserrata in China. Four tissues, including the roots, stems, flowers, and leaves were selected from a representative A. biserrata individual, quickly frozen in liquid nitrogen, and stored at −80 • C until the RNA-sequencing analysis. Moreover, at least 10 young leaves collected from each individual were dried in silica gel and ground into powder in liquid nitrogen to extract genomic DNA as a template for polymerase chain reaction (PCR) amplification. The roots collected from each plant were sampled and washed with ultrapure water and dried to the no-moisture state in an oven at 50 • C. The dried root samples were pulverized into powder with an automatic sample rapid grinding instrument and passed through a 65-mesh sieve to determine the metabolic components.
Total RNA was extracted separately from roots, stems, flowers, and leaves using TRIzol reagent (Thermo Fisher Scientific, Waltham, MA, United States) according to the manufacturer's instructions and mixed equally to build up the RNA libraries accordingly. The purity, content, and integrity of total RNA were checked with the NanoPhotometer R spectrophotometer (Implen, Westlake Village, CA, United States), a Qubit R Flurometer (Life Technologies, Carlsbad, CA, United States), and the Agilent Bioanalyzer 2100 system (Agilent Technologies, Palo Alto, CA, United States), respectively.

Transcriptome Sequencing, Assembly and Annotation
For RNA-sequencing, oligo (dT) magnetic beads were used to enrich the eukaryotic mRNA. The first-strand of cDNA was synthesized using random hexamers and dNTPs, DNA polymerase I, and RNase H were then added to synthesize the second-strand, which was further purified using AMPure XP beads. cDNA libraries were prepared according to the manufacturer's protocol, and sequencing was performed using the Illumina Hiseq 4000 sequencing platform (Illumina, Inc., San Diego, CA, United States). To obtain high-quality RNA-Seq data, the raw sequencing reads were filtered by deleting the invalid reads with adaptor contaminations, reads containing polyA, reads with "N" bases ("N" means that the base information cannot be determined) at a ratio > 5%, and reads with a mass ratio < 5 and a base ratio > 50%.
Trinity software 1 with default parameters was used to de novo assemble the clean data from four plant tissues to generate the contigs and singletons. The functional annotations of the assembled unigenes were performed through BLASTX analyses on the corresponding NCBI non-redundant (Nr) protein 2 and SWISS-PROT protein 3 databases with the best similar hit of an E-value of 1e − 05. The classification of protein function was searched in the Gene Ontology (GO) database 4 and the Clusters of Orthologous Groups (COG) database 5 . The Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway database 6 was used to determine the biological pathways of unigenes.

DNA Extraction and SSR Mining
Genomic DNA was extracted by the traditional CTAB method with minor modifications. The quality and purity of the isolated DNA samples were examined by 1% agarose gel electrophoresis and the concentration of each sample was determined using a NanoDrop 2000 spectrophotometer (Ouyang et al., 2018). The final concentration of each DNA sample was diluted to approximately 50 ng/µl for subsequent PCR amplification experiments.
The forward and reverse primers were synthesized commercially, in which the forward primer was added with an M13 tail sequence (GTAAAACGACGGCCAGT) labeling with FAM (blue), HEX (green), and ROX (red). The genomic DNA was PCR amplified in a 10 µl solution whose reaction system contained: 2 µl genomic DNA, 5 µl 2 × Taq PCR MasterMix, 0.04 µl forward primer, 0.25 µl reverse primer, 0.15 µl M13-FAM/M13-HEX/M13-ROX, and 2.6 µl ultrapure water. The mixed PCR amplifications were performed in a BiometraTone 96G (Analytik Jena AG, Jena, Germany) with a PCR amplification cycle of 94 • C pre-denaturation for 5 min, followed by 35 cycles of 94 • C for 30 s, 54 • C annealing for 30 s, 72 • C extension for 30 s, and a 72 • C extension for 10 min. The PCR products were detected by automatic fluorescence using an ABI 3730XL Sequence Analyzer. GeneMapper3.0 software was used to analyze the amplified fragment size of the different samples at each SSR locus.

High Performance Liquid Chromatography (HPLC) Analysis of the Metabolic Contents
A 0.5 g portion of plant tissue (root) powder was accurately weighed and ultrasound extracted with 20 ml of 50% methanoldichloromethane solution (v/v) for 30 min at room temperature. Extraction solvent was added, and the extract was shaken and filtered through a 0.45 µm PTFE filter. Standards of umbelliferone (L/N: A04A6L1), 8-methoxypsoralen (L/N: P30J7M16998), bergapten (L/N: W30M9Z57633), columbianetin acetate (L/N: P08F9F54518), osthole (L/N: T08M8B30733), and columbianadin (L/N: P13J9F65515) were purchased from Shanghai Yuanye Bio-Technology Co., Ltd. (Shanghai, China) and were dissolved in 50% methanol-dichloromethane to make stock solutions, which were mixed and diluted with methanol to prepare final standard solutions at a concentration of 100 µg.mL −1 for the HPLC analysis.
The chromatographic analysis was performed using an Ultimate XB-C 18 column (250 mm × 4.6 mm, 5 µm; Shanghai Welch Technology Co., Ltd., Shanghai, China) on a Shimadzu LC-20AD HPLC system equipped with SPD-20A detector and CTO-20A thermostatic column compartment (Shimadzu, Kyoto, Japan). The binary gradient elution system consisted of methanol (solvent A) and water (solvent B), and the linear elution procedures were: 0-4 min, 40% A; 4-8 min, 40-50% A; 8-12 min, 50% A; 12-24 min, 50-55% A; 24-28 min, 55% A; 28-32 min, 55-70% A; 32-52 min, 70% A. Column temperature was set to 25 • C at a flow rate of 1.0 mL/min, and injection volume was 20 µL. The UV detection wavelength was set to 330 nm. A linear regression analysis was performed with the concentrations of the six coumarin standard solutions as the abscissa (x) and the peak area as the ordinate (y). The limits of detection (LODs) and limits of quantification (LOQs) were determined at signal-to-noise ratios of 3 and 10, respectively.

Statistical Analysis
POPGENE version 1.3.2 (Yeh et al., 1997) was used to calculate the genetic diversity parameters, including allele frequency, number of alleles (Na), effective number of alleles (Ne), observed heterozygosity (Ho), expected heterozygosity (He), Shannon's information index (I), and Nei' s gene diversity index (H). Polymorphism information content (PIC) was estimated using PIC Cale software (Nagy et al., 2012). NTSYS-pc software version 2.1 (Rohlf, 2000) was used to perform the cluster analysis using the unweighted pair group method with the arithmetic method (UPGMA) based on similarity matrices calculated according to the simple matching coefficient. GenAlEx version 6.5 (Peakall and Smouse, 2012)was used to perform the principal coordinates analysis (PCoA) based on the Nei's genetic distance. The population structure was determined using STUCTURE version 2.0 (Pritchard et al., 2000) and further hierarchical analyses of molecular variance (AMOVA) was carried out using GenAlEx.

Construction of Core Collections
Previous studies have shown that the size of a core collection generally accounts for 5-30% of the original population samples (Brown, 1989;Ortiz et al., 1998;Hu et al., 2000). According to the sample size and the level of genetic diversity of A. biserrata in our study, we selected 10-40% of the original population samples as the core sample set. The LDSS method  was used to identify a group with the smallest genetic distance, and delete the sample with the lowest coumarin content. The seven preliminary core collections with different sampling ratios accounted for 40%, 35%, 30%, 25%, 20%, 15%, and 10% of the original germplasm and were retained by repeated clustering. The retention degree of genetic diversity of these preliminary core collections was manually checked by determining their corresponding genetic diversity indices (Na, Ne, Ho, He, H, I, and PIC). The average content of each coumarin and the total coumarins in the seven preliminary core collections were also calculated using GraphPad Prism version 6.0 software (Lazareno, 1994). The core collection was screened for the retention degree of genetic diversity and the coumarin compound contents.
To investigate any significant differences between the core collections and the original samples, SPSS version 24.0 (IBM Crop, 2012) was used to perform a t-test on three main genetic parameters (Ne, H, and I) of them. Further comparative analysis of genetic variation between the original germplasm and the core collection or the removed samples (The remainder of the original germplasm after removing the core collection) was also conducted. A PCoA analysis for the original and core germplasms to verify the sample distributions. The difference in coumarin contents between the core collection and original germplasm was assessed by comparing single coumarin and total coumarin contents using GraphPad Prism.

Transcriptome Assembly and Annotation of A. biserrata
After filtering, a total of 5.9 Gb clean reads were retained and used for the de novo assembly, leading to 39,748 unigenes were finally obtained with a N50 sequence length of 1,636 bp, and the percentage of GC was 40.36%. The benchmarking universal single-copy orthologs (BUSCO) analysis based on plant gene models showed that 73.71% of the BUSCO sequences were completely present in the A. biserrata transcripts (Supplementary Table 2). About 70. 42% (27,989) of the unigenes were annotated in five public databases, in which 27,385 unigenes (68.90%) matched in the Nr database and 25,780 (64.86%) can be annotated in the Swissprot database ( Figure 1A). Furthermore, a total of 19,193 unigenes (48.29%) could be classified into 57 functional groups categorized in three major GO categories based on sequence homology ( Figure 1B). Amongst, biosynthetic process and cellular nitrogen compound metabolic process, nucleus, ion binding are the most abundant GO terms ( Figure 1B). We also found that a total of 8,371 unigenes were clustered into 24 function categories in the COG database, and signal transduction mechanisms (473) was the major category, followed by translation, ribosomal structure and biogenesis (354), transcription (311), posttranslational modification, protein turnover, chaperones (279) and Carbohydrate transport and metabolism (211) (Figure 1C). The annotated unigenes of A. biserrata were also mapped into 28 KEGG pathways (Figure 1D), of which metabolism was the most represented category. Twelve metabolic pathways were included in the category and the carbohydrate metabolism was the most dominant pathway (Figure 1D).

Characterization of the Transcriptome Derived SSRs
For the assembled A. biserrata transcripts, 6,651 sequences contained SSRs and 1,343 sequences contained more than one SSR loci. A total of 8,371 SSR loci were detected with a frequency of 21.06% and an average of one locus per 4.9 kb. Among all SSRs, 7,566 (90.38%) were simple repeat motifs and 805 (9.62%) presented in compound formation (Supplementary Table 3). The repeat types of these SSRs were diverse, of which mononucleotide (3,045, 36.38%) and dinucleotide (3,713, 44.35%) repeats were the major types, followed by trinucleotides (1,472, 17.58%). The total proportion of tetranucleotides (1.03%), pentanucleotides (0.3%), and hexanucleotides (0.36%) was <2% (Figure 2A and Supplementary Table 4). The number of motif repetitions for different SSR types varied broadly, as a whole with a maximum frequency repeat for six repetitions (19.08%), followed by ten (17.86%), seven (10.80%), five (10.74%), and eleven (9.35%) repetitions. Among the three SSR types with the highest frequency, the primary numbers of repetitions of mononucleotides, dinucleotides, and trinucleotides were ten, six and five, respectively ( Figure 2B and Supplementary Table 5). Moreover, the A/T repeat type was predominant (96.95%) among mononucleotides, and AG/CT (61.24%) was the primary repeat type among dinucleotides, while AAG/CTT accounted for 23.23% of trinucleotides (Supplementary Table 6).

SSR Markers Validated for the Genetic Analysis
We randomly selected 146 pairs of SSR primers for primary screening by 1% agarose gel electrophoresis, among which 42 pairs of primers failed to amplify. The remaining primers were further screened using the fluorescent capillary electrophoresis detection method, and 17 pairs of SSR primers with high amplification efficiency, good reproducibility, and high polymorphism were retained for the overall genetic analysis (Supplementary Table 7).
A total of 132 alleles were obtained for all 17 SSR loci among all population samples, and the number of alleles (Na) varied from 4 (Abssr107, Abssr128) to 15 (Abssr39) with an average value of 7.76 alleles per locus (Supplementary Table 8). The Abssr39 locus revealed the most abundant genetic variation among the 17 loci investigated, followed by the Abssr33 locus. Both loci had high identification efficiency and therefore are valued for applications to identify A. biserrata samples/taxon  and closely related species. We further compared the genetic diversity of samples from the three different sampling sites.
Although 113 alleles were observed in the Puling population, the overall genetic diversity of the Langping population was the highest among the three sampling sites, suggesting richer genetic variation presented in the A. biserrata samples from Langping (Table 1).

Population Structure of the A. biserrata Germplasm
The genetic correlation coefficients between the 208 A. biserrata samples varied from 0.72 to 0.94, suggesting a similarly genetic background of them. The UPGMA cluster analysis revealed two groups of samples, of which one group included the materials collected from the Puling and Gaofeng locations, while the other group included samples mainly from the Langping (Figure 3A). The PCoA plot showed a similar result ( Figure 3B). The first principal coordinate of the PCoA analysis explained 27.44% of the variation, while the second principal coordinate accounted for 17.27% of the variation. The vast majority of the Puling and Gaofeng germplasms were concentrated together in the PCoA plat, while those from Langping were concentrated in another cluster ( Figure 3B). In the STRUCTURE analysis, the most appropriate population number for the data set was K = 2 (Supplementary Figure 1), suggesting two main genetic groups present for the A. biserrata samples investigated. Similarly, the germplasms from the Puling and Gaofeng populations grouped into one genetic group, while those mainly from the Langping population group in the other ( Figure 3C).
The AMOVA analysis showed that the vast majority of genetic variation was presented within populations, while only 6% of the genetic variation occurred among populations (Supplementary Table 9). The highest level of pairwise F st was 0.083, which appeared between the Gaofeng and Langping populations (Supplementary Table 10).

HPLC Analysis of the Metabolic Contents in Duhuo
The results of the regression equations and the LODs and LOQs for the six coumarin standards are shown in Supplementary Table 11. Precision was evaluated by the relative standard deviation (RSD) values of the peak area of the standard solutions, which were 1.10-1.70%. The RSD repeatability values were 1.39-2.44%, which were indirectly assessed by detecting six samples on the same day. RSD values of stability evaluated in the same sample solution at different time points (0, 2, 4, 6, 8 and 12 h) were <1.89%. A recovery experiment was conducted in which the standards were added to sample powder with known content to evaluate accuracy of the method, and percent recovery was 96.50-102.41%. The peak times of the target peaks on the chromatogram were appropriate with good resolution and there was no negative interference [50% methanol-dichloromethane solution (v/v) as a negative control] (Supplementary Figure 2). Among the six coumarins investigated in the samples, columbianadin was the most abundant compound (up to 8.27 mg/g in samples), followed by osthole (average 1.17 mg/g), columbianetin acetate (average 0.56 mg/g), 8-methoxypsoralen (average 0.55 mg/g), and bergapten (average 0.20 mg/g). In addition, osthole and columbianadin contents revealed clear differences among Duhuo samples (Supplementary Table 12). Among the 208 Duhuo samples collected from three sampling locations, most samples from the Puling and Gaofeng population had higher columbianadin content, in particular the Puling049 sample revealed the highest columbianadin content (18.58 mg/g). In contrast, Duhuo samples from the Langping location had higher osthole content but relatively lower columbianadin content (Supplementary Table 12).

Construction of the Core Collection
An LDSS sampling strategy was adopted to obtain seven preliminary core collections named 40CCC, 35CC, 30CC, 25CC, 20CC, 15CC, and 10CC. Each collection included 83,73,63,52,42,31, and 21 accessions, accounting for 40%, 35%, 30%, 25%, 20%, 15%, and 10% of the original germplasm, respectively ( Table 2). As the sampling ratio decreased, the number of alleles (Na) and the retention rates of the alleles (Ra) decreased gradually. Na ranged from 7.41 (40CC) to 6.65 (10CC) in different collections. Neither 15CC nor 10CC reached 90% of the original germplasm for Ra, and others varied from 90.20% (20CC) to 95.50% (40CC) of the original germplasm. The genetic parameters of the seven preliminary core collections were all higher than those estimated for the original germplasm. In particular, the genetic diversity estimates, including all genetic parameters investigated except Ne, were higher in 20CC than the others, suggesting better representativeness by the 20CC set than the others ( Table 2).
As the sampling ratio decreased, the average contents of the six coumarins increased gradually (1.81-2.46 mg/g), which may be the reason why individuals with lower total coumarin content were removed during the clustering process. However, when looking at the content of a single coumarin component, umbelliferone content was essentially unchanged, which may have been caused by the low content itself (Figure 4 and Supplementary Table 13). According to the genetic data obtained here, 20CC had the richest genetic diversity among all of the preliminary core collections and retained the maximum genetic diversity of the original germplasm. The average total coumarin contents of 10CC reached the maximum value, followed by 15CC and 20CC. However, the retention rates (Ra) of 10CC and 15CC alleles did not reach 90%, which means that the alleles of the original germplasm were not effectively retained, and genetic diversity was not as rich as that of 20CC. In this study, we simultaneously focused on genetic diversity and the metabolic ingredient contents to select the best core collection. Not only the 20CC collection preserved the overall genetic diversity of the original germplasm to the greatest extent, but also the average content of total coumarins increased from 1.81 to 2.20 mg/g compared to the original germplasm. As a consequence, 20CC composed of 42 accessions was the first choice for the best core collection, in which approximately 50% of the germplasm came from the Langping location ( Figure 5).

Quality Evaluation of the Constructed Core Collection
The main genetic parameters, including Ne, H, and I were not significantly different between the core collection and the original germplasm (Supplementary Table 14), indicating no significant genetic difference between them. The retention ratio of the core collection to the original germplasm was greater than 100% for each genetic parameter evaluated, except Na (Supplementary Table 15). The seven genetic diversity parameters of the removal samples were all lower than the original germplasm, which means that these removal samples can lead to an increase in the genetic duplication rate of the population (Supplementary Table 15). Moreover, it could be seen from the PCoA diagram that the materials in the core collection were not only similar to the distribution of the original germplasm, but also had a relatively comprehensive and uniform coverage, which represented the geometric distribution of the original germplasm, indicating that the genetic structure of the original germplasm was well maintained by the core collection and further confirming the representativeness of the selected A. biserrata core collection ( Figure 3B).

SSR Development of the A. biserrata Transcriptome
Microsatellite tandem repeats are widely distributed throughout plant genome and SSR markers derived from these repeats are the most frequently used molecular tool for evaluating plant genetic diversity and population structure (Filippi et al., 2015), for constructing genetic maps (Wu et al., 2014), and for marker-assisted breeding (Kumar et al., 2015). SSRs generally consist of one to six nucleotide basic units repeated multiple times (Feng et al., 2016;Vieira et al., 2016), and are codominance with good repeatability and high polymorphism Ouyang et al., 2018). However, developing SSR markers is relatively difficult in non-model plants without genomic resources (Wang et al., 2019). Most medicinal plants are non-model organisms and their genetic information is scarce, causing a lag in molecular biology and genetic breeding researches (Xin et al., 2017). Given the cost of RNA-seq have fallen dramatically in recent years, transcriptome sequencing therefore provided a convenient and economical way for quickly obtaining SSR markers in non-model plants. Moreover, most SSR markers derived from transcriptome are directly or indirectly related to functional loci. Compared to those SSRs derived from the neutral parts of a plant genome, these transcriptome derived SSRs are potentially more useful, in particular for constructing FIGURE 4 | The average content of individual coumarin, average content of total coumarins of the original germplasm and seven preliminary core collections. 100% represents the original germplasm, 40%∼10% represent seven preliminary core collections with different sampling ratios. X-axis represents different species of coumarins, Y -axis represents content (mg/g).
crop core collections in which selection in relation to focused phenotypic traits is the primary purpose. Currently, transcriptome derived SSR markers have been widely applied in population genetic analyses of medicinal plants, such as Picria felterrae (Yan et al., 2019), Gynostemma pentaphyllum , and Perilla frutescens (Sa et al., 2018). Researches on constructing core collections of medicinal plants using transcriptome derived SSRs are also quickly increased in recent years. An example came from the successful development of a core collection of Dalbergia odorifera, in which 19 microsatellite markers developed from its transcriptome were used (Liu et al., 2019). For A. biserrata, no SSR markers have been previously reported and the genomic resources of this species is also limited. To our knowledge, this study represented the first report for the genome information (transcriptome) of A. biserrata. The transcriptome information and the developed SSR markers presented here therefore provide important molecular resources for future genetic analysis of A. biserrata germplasm.

Population Genetic Variation of A. biserrata
Previous studies of A. biserrata mainly focused on its chemical and pharmacological aspects, while there are few reports in relation to molecular and genetic diversity of this medicinal plant. Population genetic diversity plays an irreplaceable role in the survival and evolution of organisms and it is extremely important for genetic innovation and breeding of plants. Here, we found a relatively high level of genetic diversity occurred in the wild A. biserrata germplasm (average PIC value: 0.67, average Shannon's information index I: 1.52; Table 1), which was comparable to those of other plants within the same Angelica genus. For example, using a similar method of ten transcriptome derived SSRs, a genetic analysis of 56 A. dahurica ecotype samples revealed an average PIC value was 0.41 and the Shannon's information index I was 0.85 (Chen et al., 2019). Given the different sample numbers investigated within both analyses, the levels of genetic diversity of these two Angelica plants are therefore consistently high.
Our analysis further revealed a relatively higher ratio of genetic variation presented within populations than those between populations (Supplementary Table 10). This result was similar to those reported from Angelica sinensis (Zhu et al., 2018) and Daucus carota (Chaitra et al., 2020), of which most of genetic variation was found to be present within wild populations rather than between populations. Similar to many other Umbelliferae plants, A. biserrata is a cross-pollinated plant, and both the pollen and seeds spread through wind or insects, leading to long-distance dispersal or exchanges of genetic materials between populations is possible. A high level of gene flow therefore possibly occurred across the populations of A. biserrata within and around the Changyang region we investigated. Continuously high level of gene flow can strongly weaken population genetic differences and keep a balance of genetic variation between populations in geographically adjacent localities. Further analysis is needed to investigate the population gene flow pattern of this species in future.

Core Collection Construction of A. biserrata
Core collections are the most representative subsamples from the original germplasm, which must be as comprehensive as possible to effectively represent the characteristics of the original germplasm and to ensure that not too much genetic diversity is lost (Frankel and Brown, 1984). For medicinal plants, the intrinsic quality (in terms of the contents of active ingredients) and appearance characteristics are both important phenotypic traits (Huang et al., 2005). Therefore, for constructing a core collection of a medicinal herb, it is necessary to consider both genetic and metabolic diversity simultaneously. Many previous studies of medicinal plants for constructing core collections are only based on genetic diversity, such as those reported in Scutellaria baicalensis (Bai et al., 2010) and Dalbergia odorifera (Liu et al., 2019). In the present study, we simultaneously determined genetic variation and metabolic contents of the A. biserrata germplasm, which may be more comprehensive and indicative for selection of core samples. Moreover, a series of results (Table 2 and Figure 4) further demonstrated that our trade-off method which was mainly based on genetic diversity and supplemented by metabolic components was feasible and reliable.
To construct a core collection of plants, the sampling strategy, including determining a reasonable sampling ratio is also important for the selection of core materials. Different sampling methods can directly affect the extraction of core materials. Due to the good representativeness and stability during the clustering procedure, the LDSS method is currently frequently used as a clustering procedure for constructing core collections of crops. It makes heterogeneous between core materials through deleting maximum genetic duplication of germplasms. Moreover, similar to the maximization (M) strategy which maintains diverse alleles through retaining maximum number of alleles at each locus (Duan et al., 2017), the LDSS method also preserve a sufficient number of alleles on the basis of avoiding genetic duplication to the greatest extent. For medicinal plants such as A. biserrata, this method can further combine both genetic and metabolic data in the core collection construction. Moreover, the sampling ratio is often adjusted according to the scale of the original population and richness of genetic diversity during construction of core collections (Hu et al., 2000). Collectively, our strategy of constructing a core collection of A. biserrata is valued for selecting materials for conservation and breeding of this plant, and also provides reference for other medicinal plants in terms of core collection construction.

DATA AVAILABILITY STATEMENT
The datasets presented in this study can be found in online repositories. The names of the repository/repositories and accession number(s) can be found below: https://www.ncbi.nlm. nih.gov/genbank/, PRJNA655389.

AUTHOR CONTRIBUTIONS
YL designed the research. XP provided sampling support. XH, XW, and YL contributed to the sampling in the wild. ML and XH performed the experiments. ML analyzed the data and wrote the article. YL, ZH, and JZ revised the manuscript. All authors read and approved the submitted version.

ACKNOWLEDGMENTS
We sincerely appreciated staffs from Hubei Kangnong Seed Co., Ltd., provided assistance for sampling plant materials of wild A. biserrata.

SUPPLEMENTARY MATERIAL
The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fpls.2020. 600249/full#supplementary-material Supplementary Figure 1 | STRUCTURE analysis for estimate of the best K value. K values were set from 1 to 10 and the best one was predicated as K = 2 according to the K value.