Development and Application of EST-SSR Markers in Cephalotaxus oliveri From Transcriptome Sequences

Cephalotaxus oliveri is an endemic conifer of China, which has medicinal and ornamental value. However, the limited molecular markers and genetic information are insufficient for further genetic studies of this species. In this study, we characterized and developed the EST-SSRs from transcriptome sequences for the first time. The results showed that a total of 5089 SSRs were identified from 36446 unigenes with a density of one SSR per 11.1 kb. The most common type was trinucleotide repeats, excluding mononucleotide repeats, followed by dinucleotide repeats. AAG/CTT and AT/AT exhibited the highest frequency in the trinucleotide and dinucleotide repeats, respectively. Of the identified SSRs, 671, 1125, and 1958 SSRs were located in CDS, 3′UTR, and 5′UTR, respectively. Functional annotation showed that the SSR-containing unigenes were involved in growth and development with various biological functions. Among successfully designed primer pairs, 238 primer pairs were randomly selected for amplification and validation of EST-SSR markers and 47 primer pairs were identified as polymorphic. Finally, 28 high-polymorphic primers were used for genetic analysis and revealed a moderate level of genetic diversity. Seven natural C. oliveri sampling sites were divided into two genetic groups. Furthermore, the 28 EST-SSRs had 96.43, 71.43, and 78.57% of transferability rate in Cephalotaxus fortune, Ametotaxus argotaenia, and Pseudotaxus chienii, respectively. These markers developed in this study lay the foundation for further genetic and adaptive evolution studies in C. oliveri and related species.


INTRODUCTION
Usually, successful conservation strategies require obtaining the genetic information, of which genetic diversity and population structure are essential parts. Molecular markers become useful tools to study genetic diversity and population structure of natural germplasm resources in non-model plants with no reference genomes (Parida et al., 2009). Compared to the other makers, SSRs possess the advantages of relative abundance, high polymorphism, codominant inheritance, and reproducibility, which have gained considerable importance in plant genetics and breeding (Taheri et al., 2018). In addition, SSRs have also been applied for discovering quantitative trait loci (QTL), linkage map construction between gene and marker, marker assisted selection for desired traits (MAS), and so forth (Kalia et al., 2011).
According to the source, SSRs are classed into genomic SSRs (gSSRs) and expressed sequence tag SSRs (EST-SSRs), the latter of which are located in the genic transcribed regions and are identified by NGS technology (Varshney et al., 2005;Wei et al., 2011). In general, EST-SSRs have been found to be more development-inexpensive, more evolutionarily conserved, and have higher transferability to related species than traditional anonymous gSSRs (Scott et al., 2000;Gadissa et al., 2018). Moreover, they are expected to be less polymorphic than gSSRs because of greater DNA sequence conservation in transcribed regions, but also less prone to null alleles (Postolache et al., 2014). The location of EST-SSRs determines their functional roles. EST-SSRs in CDSs affect the inactivated or activated genes or truncate proteins, in 3′UTR are involved in gene silencing or transcription slippage, and in 5′UTR impact gene transcription and/or translation (Lawson and Zhang, 2006;Gao et al., 2013).
Cephalotaxus oliveri (Cephalotaxaceae) is an endemic conifer species of China. It is a perennial shrub or small tree which is 4 m tall with white stomatal bands between the midvein and marginal bands of abaxial leaves. This plant can be cultivated as an ornamental in gardens, and its wood can be used for the manufacture of farm tools and furniture (Fu et al., 1999). Moreover, a variety of plant alkaloids (such as anticancer alkaloid harringtonine) can be extracted from leaves, branches, seeds, and roots, which have certain curative effects on leukemia and lymphoid sarcoma (Ni et al., 2016;Xiao et al., 2019). The species is distributed in broad-leaved or coniferous forests to the south of Qinling mountains-Huaihe River line and west of Wuyi Mountains at an altitude of 300-1800 m, and the distribution locations mainly include southeastern and northeastern Yunnan, Guizhou, southern and western Sichuan, northwestern Hubei, Hunan, eastern Jiangxi, and northern Guangdong (Fu et al., 1999). However, its natural population sizes decreased significantly in the recent years due to overexploitation, deforestation, and climate change. This species has been listed as vulnerable in the IUCN Red List. Thus, effective strategies are important to ensure the conservation and scientific utilization of C. oliveri. EST-SSRs have been widely developed and applied to numerous coniferous species. Zhang et al. (2015) developed nine EST-SSRs of Larix gmelinii and used them to evaluate the genetic parameters and transferability to three related species. Zeng et al. (2018) developed 11 EST-SSRs of T. grandis, which revealed a moderate level of genetic diversity and two different genetic groups within this species. Li et al. (2021) also investigated the genetic variation and population structure of 20 EST-SSRs. All these studies proved that EST-SSRs are effective molecular markers for studying the genetic diversity of conifers with high transferability. Although, hitherto, few SSRs (Pan et al., 2011;Miao et al., 2012) and ISSRs (Wang et al., 2016) have been developed and applied, there have been no reports on EST-SSR markers for C. oliveri, even for Cephalotaxaceae, using transcriptome data, greatly limiting research on genetic diversity, germplasm preservation, and molecular breeding of this species.
In this study, we used the leaf transcriptome data by Illumina sequencing from C. oliveri, and the objectives were to 1) characterize frequency, distribution, and function of the SSR motifs from transcriptome unigenes; 2) develop and characterize novel EST-SSRs and examine the level of polymorphism; 3) analyze the cross-species transferability of the polymorphic EST-SSRs; and 4) explore the genetic diversity and structure of C. oliveri by polymorphic EST-SSRs.

Plant Materials and DNA Extraction
In 2019, 134 C. oliveri individuals from seven natural sampling sites in China were collected for development and characterization of EST-SSRs; detail sample information is shown in Supplementary Figure S1 and Supplementary  Table S1. Young leaves were sampled and placed into sealed bags containing dry silica gel and stored at 20°C for later use. The distance between individuals was more than 10 m. Total genomic DNA was extracted using the modified cetyltrimethylammonium bromide (CTAB) method (Su et al., 2005). DNA quality and concentration were determined by 1% agarose gel electrophoresis and Nanodrop 2000c (Thermo scientific, MA, United States), respectively. Then, all samples of DNA were diluted to a desired working concentration (50 ng/μl) and maintained at −20°C for PCR amplification.

EST-SSR Detection and Primer Design
The unigenes of leaf transcriptome data used for SSR development in this study came from the study conducted in our laboratory by He et al. (He et al., 2021) (Accession Number: SRR12058210). Potential SSRs were detected from 36446 unigenges using the the microsatellite tool (MISA, http://pgrc. ipk-gatersleben.de/misa/misa.html). The identification criteria for SSRs were set at a minimum number of 10, 6, 5, 5, 5, and 5 repeat units for mono, di, tri, tetra, penta, and hexanucleotide motifs, respectively. Primers were designed using Primer 3 software with major parameters as follows: primer length of 18-27 bp, PCR product size ranging from 100 to 280 bp, GC content of 40-60%, and annealing temperature ranging from 57 to 62°C.

Functional Annotations and Classification of EST-SSRs
All unigenes containing SSRs were compared against six public databases by BLAST, including NCBI nonredundant protein sequences (NR), NCBI nonredundant nucleotide sequences (NR), Kyoto Encyclopedia of Genes and Genomes (KEGG), SWISS-PROT, Protein family (Pfam), and Clusters of eukaryotic Orthologous Groups (KOG). Blast2GO was used to perform Gene Ontology (GO) terms based on the Nr annotation (Conesa et al., 2005).

EST-SSR Amplification and Validation
Totally, 238 pairs of primers were randomly selected (SSRs containing mononucleotides were not considered because of the high error rate of PCR product) and synthesized by TSINGKE Biological Technology Co., Ltd. (Beijing, China) for polymorphic EST-SSR development. PCR amplification was performed in a total reaction volume of 25 μl that included 1 μl of template DNA (50 ng/μl), 12.5 μl of 2×Taq Master Mix (Vazyme Biotech; Nanjing, China), 0.5 μl of each primer (10 μM), and 10.5 μl of ddH 2 O. Amplifications were carried out applying the following procedures: 5 min at 95°C for initial denaturation, followed by 30 cycles of 30 s at 95°C, 30 s at the annealing temperature, 30 s at 72°C, and 10 min at 72°C for final extension. The successfully amplified products were processed in 6% denaturing polyacrylamide gel electrophoresis for polymorphism screening by 14 individuals from seven sampling sites of C. oliveri. Polymorphic primers were further used for genotyping, and the PCR products with fluorescence labeling were separated on an ABI 3730xl DNA Analyzer (Applied Biosystems, CA, United States), using GeneScan LIZ500 (Applied Biosystems) as the internal lane size standard. The genotyping data were obtained through GeneMapper v5 (Applied Biosystems).

Analysis of Data
We used the 28 EST-SSR markers to analyze genetic diversity and structure among 134 individuals from seven sampling sites. POPGENE v1.32 (Yeh et al., 1999) was used to calculate population genetic parameters, including the number of alleles (A), number of effective alleles (Ae), observed heterozygosity (Ho), expected heterozygosity (He), and Shannon's information index (I). The polymorphism information contents (PICs) of each marker were calculated using PIC_CALC v0.6 (Botstein et al., 1980). MICRO-CHECKER 2.2.3 (Van Oosterhout et al., 2004) was used to estimate the null allele frequency for each marker. Linkage disequilibrium (LD) for locus pair and the departure from Hardy-Weinberg equilibrium (HWE) were detected by GENEPOP v4.7 ( (Raymond and Rousset, 1995), Bonferroni corrections were performed to determine significance levels for all tests at p-value < 0.05 (Rice, 1989). Analysis of molecular variance (AMOVA) and principal coordinate analysis (PCoA) were performed in GenAlEx v6.5 (Peakall and Smouse, 2012).
Based on Bayesian clustering analysis, STRUCTURE v2.3.4 (Pritchard et al., 2000) with the default setting of the admixture model was used to analyze the genetic structure of natural populations within the species. Ten independent runs were performed with K from 1 to 10. Each run was estimated with Markov chain Monte Carlo steps of 100,000 iterations and burnin period of 100,000 iterations. The optimal K value was evaluated in STRUCTURE HARVESTER (http://taylor0.biology.ucla.edu/ structureHarvester/). CLUMPP (Jakobsson and Rosenberg, 2007) and Distruct (Rosenberg, 2004) were used to estimate the averaged admixture coefficients for each K value and visualize the clustering results, respectively.

Transferability in Cross-Species
Three different species, namely, C. fortunei, Ametotaxus argotaenia, and P. chienii were used to analyze the transferability of the 28 EST-SSR markers. Young leaves of 17, 12, and 12 individuals were collected. Genomic DNA extraction, PCR amplification, and separation and size reading of target products for all these samples were performed as described above.
In addition, the physical positions of these 5089 SSRs in the unigenes were also identified, and 671, 1125, and 1958 SSRs were located in CDS, 3′UTR, and 5′UTR, respectively, while the remaining 1335 SSRs had no sufficient information to determine their position. In CDS, trinucleotide repeats were the dominant type. Most of mononucleotide and dinucleotide repeats were located in 3′UTR and 5′UTR, whereas trinucleotide repeats were also abundant in 5′UTR (Figure 2).

SSR Annotation and Classification
To explore the potential function of SSR-containing unigenes, all these unigenes were annotated by comparison against the seven functional databases: Nr, Nt, KEGG, Swiss-Prot, Pfam, KOG, and GO. As indicated in Table 2 Classification of all SSR-containing unigenes was performed using their annotations with GO, KEGG, and KOG databases. These SSR-containing unigenes were classified to three major GO functional categories: biological process (7287, 44.14%), cellular component (5350, 32.40%), and molecular function (3874, 23.46%), which were further classified into 25, 16, and 11  different sub-categories, respectively ( Figure 3). Of these, unigenes related to cellular process, metabolic process, and single-organism process accounted for the largest proportion in biological processes. In the cellular component, the most enriched sub-category was the cell part, followed by the cell and membrane. The molecular function category mainly represented the genes involved in binding, catalytic activity, and transporter activity. The unigenes were annotated in 115 KEGG metabolism pathways that were classified into five categories, including cellular processes (55, 4.47%), environmental information processing (71, 5.77%), genetic information processing (379, 30.79%), metabolism (678, 55.08%), and organismal systems (48, 3.90%) (Figure 4). In the second level of the pathway, carbohydrate metabolism represented the largest pathway, followed by translation and transcription. Among these 115 pathways, the top five were spliceosome, plant hormone signal transduction, starch and sucrose metabolism, mRNA surveillance pathway, and plant-pathogen interaction.

Development and Validation of EST-SSR Markers
Among the 5089 SSRs, 3900 pairs of primers were successfully designed. 238 pairs of primers were randomly selected for amplification and validation, and the results showed that 127 (53.36%) primers produced expected size bands. Of these 127 pairs of primers, 80 were monomorphic and 47 were identified as polymorphic. Finally, 28 high-polymorphic primers were selected and used for genetic analysis (Table 3).
In total, 118 alleles were detected in all samples by the 28 EST-SSR markers. A per locus ranged from 2 to 7 with an average of 4.214. Ae ranged from 1.087 to 4.595, with an average value of 1.889. I varied from 0.215 to 1.676, with an average value of 0.724. Ho and He were calculated as 0.279 and 0.382, ranging from 0.052 to 0.836 and 0.080 to 0.785, respectively. The PIC ranged from 0.079 to 0.752, with an average of 0.345. In addition, four loci (Co229, Co274, Co82, and Co244) were found to have null alleles. Significant departures from HWE were detected for 18 of 28 EST-SSR loci ( Table 3). Among 378 pairs of loci, 44 pairs showed LD and only two (Co14&Co257 and Co229&Co222) remained significant after Bonferroni correction (p-value < 0.05), indicating that the EST-SSR loci used in this study were independent of each other.

Genetic Diversity and Structure
For each sampling site, the mean A, Ae, I, Ho, and He ranged from 2.071 to 2.679, 1.451 to 1.720, 0.396 to 0.587, 0.208 to 0.311, and 0.243 to 0.352, respectively. Overall, SX had the greatest genetic diversity, whereas the lowest genetic diversity was found in PB (Supplementary Table S3). AMOVA showed that 71% of the variation was found within populations and 29% among populations (Supplementary Table S4).
The population structure of C. oliveri was analyzed in STRUCTURE, and the optimal K value was observed at K 2 with maximum Delta K value ( Figure 6A). The seven sampling sites were divided into two genetic groups ( Figure 6B and Supplementary Figure S1). Group Ⅰ included the sites, namely, SX, HNG, WYH, LP, EMS, and WGS, while Group Ⅱ only included site PB. Consistent with the STRUCTURE analysis,    Frontiers in Genetics | www.frontiersin.org November 2021 | Volume 12 | Article 759557 6 the result of PCoA also revealed two groups based on genetic distance (Figure 7). The first and second axes explained 29.55 and 12.75% of the total variation, respectively. HWE was further tested for these two genetic groups, and the result showed that 13 loci deviated from HWE in Group Ⅰ, while only one locus was deviated from HWE in Group Ⅱ (Supplementary Table S5).

Cross-Species Transferability
The 28 pairs of primers were also evaluated for transferability in the three species (C. fortune, A. argotaenia and P. chienii). The results showed that 15 primer pairs were successfully amplified in all species (all the templates). Twenty-seven pairs were applicable for (had transferability in) C. fortunei, with the highest rate of 96.43%. 20 (71.43%) and 22 (78.57%) pairs had transferability in A. argotaenia and P. chienii, respectively (Supplementary Table S6).
A microsatellite locus typically varies in length between 5 and 40 repeats, but longer strings of repeats are possible (Selkoe and Toonen, 2006). In this study, the repeat units of 10, 5, 6, 11, and 12 accounted for 77.42% of the total SSR loci, with the size mainly ranging from 10 to 18 bp. The predominant type was trinucleotide repeats (26.19%), followed by dinucleotide repeats (15.35%) and hexanucleotide repeats (4.30%). This result was similar to that of pervious reports that trinucleotide repeats were the abundant type for other conifers, including C. hainanensis (Qiao et al., 2014), A. argotaenia (Ruan et al., 2019), P. koraiensis , and P. chienii (Xu et al., 2020). Among the trinucleotide repeats, the most abundant motif was AAG/CTT, which was identical to previous findings in Cryptomeria japonica (Ueno et al., 2012), P. halepensis (Pinosio et al., 2014), L. gmelinii , T. grandis (Zeng et al., 2018), A. argotaenia (Ruan et al., 2019), and P. chienii (Xu et al., 2020). In addition, this motif was the second abundant motif in C. hainanensis (Qiao et al., 2014) and P. dabeshanensis (Xiang et al., 2015). These results showed that AAG/CTT was conserved in conifers. Of the dinucleotide repeats, AT/AT and AG/CT exhibited high frequency. However, like most conifers, including T. contorta (Majeed et al., 2019) and P. koraiensis , the CG/CG motif was found to be rare in this study. It might be due to methylation of cytosine in CpG sequences, which might potentially inhibit transcription (Gonzalez-Ibeas et al., 2007;Xing et al., 2017).
Some studies have shown that SSRs are much more abundant in the UTRs than in the CDSs of many plants (Li et al., 2004). In the present study, 17.85% of SSRs were found to be located in CDSs in contrast to 82.13% in UTRs. Moreover, trinucleotide repeats were mostly accumulated in the CDSs and other types had  a small proportion. This result was consistent with those of the studies in T. contorta (Majeed et al., 2019), Elaeagnus mollis , and P. chienii (Xu et al., 2020). This might explain that non-trinucleotides negatively selected frameshift mutations in coding regions, contributing to changes in SSR length, and therefore to expression. In contrast, trinucleotides did not generate frameshifts and did not affect gene expression through single-motif length mutations (Metzgar et al., 2000;Taheri et al., 2019). All SSR-containing unigenes in leaves of C. oliveri were annotated through seven functional databases. 86.72% of these unigenes were annotated in at least one database, which was similar to the annotation result of all unigenes in leaf (He et al., 2021). In the GO classification, most of SSR-containing unigenes were categorized in the cellular process, metabolic process, singleorganism process, cell part, cell, binding, and catalytic activity, which are involved in the basic metabolism in the plant cells. Furthermore, KEGG and KOG classification suggested that SSRcontaining unigenes were involved in growth and development of C. oliveri with various biological functions.
The transferability rate of the markers corresponds to genomic similarity and can reflect the genetic relationships and extent of sequence conservation between species (Zhang et al., 2014). EST-SSR markers are expected to have a high transferability rate due to conservation of transcribed regions among related species (Kalia et al.,   . In this study, the transferability of 28 EST-SSRs from C. oliveri to C. fortune was 96.43%, confirming that C. oliveri had a closer relationship with C. fortune than the other two species. This transferability rate was consistent with that of Taxodium 'zhongshansa' (100%) (Cheng et al., 2015), while higher than that of Abies alba (75-81%) (Postolache et al., 2014) and P. chienii (70%) (Xu et al., 2020). Significantly, the transferability rate was more than 70% both for A. argotaenia and P. chienii belonging to Taxaceae, indicating that there may be a high genomic similarity between C. oliveri and Taxaceae species. These results indicated that the markers developed in this study would provide a powerful molecular tool for evolutionary adaptation and genetic relationship analysis in C. fortune, A. argotaenia, and P. chienii. Genetic diversity is an important indicator in the conservation and management of plant genetic resources .In this study, genetic parameters were detected in 134 individuals by the 28 EST-SSRs. The mean PIC value was 0.345, indicating a mean level of genetic information, and similar results were observed in C. japonica (0.325) (Ueno et al., 2012), A. argotaenia (0.455) (Ruan et al., 2019), and P. koraiensis (0.404) (Li et al., 2021). Mean A (4.214) was higher than that reported for T. grandis (2.636) (Zeng et al., 2018) and L. principis-rupprechtii (3.850) (Dong et al., 2018), but lower than that for P. koraiensis (6.45)  and P. chienii (6.4) (Xu et al., 2020). The average of Ho and He were 0.279 and 0.382, respectively, which was lower than that for P. dabeshanensis (0.445 and 0.486) (Xiang et al., 2015) and L. principis-rupprechtii (0.487 and 0.490) (Dong et al., 2018). These results revealed that the 134 individuals had a moderate level of genetic diversity, which might be caused by small population size and fragmentation; meanwhile, habitat heterogeneity might restrict gene flow between populations and had an impact on genetic diversity . However, this study on genetic diversity of C. oliveri showed a lower level than that in the previous study employing gSSRs (Ho 0.570, He 0.568) (Miao et al., 2012). These differences might be due to the different types and numbers of genetic markers used in the studies or to the different populations and sample sizes (Tong et al., 2020). Furthermore, the genetic diversity of the population level was lower than that at the species level, which might be due to limited numbers of SSR loci and small population size.
The distribution geographic range, habitat types, and species characteristics may have a significant influence on the population structure, as well as genetic diversity (Jia et al., 2016;Li et al., 2019). A previous report found that 22 populations of C. oliveri were clustered into two groups by constructing a dendrogram, and the YNdws population (Yunnan Province) was clustered into the separated group (Wang et al., 2016). In this study, STRUCTURE and PCoA analyses also revealed a distinct genetic population (site PB) among the seven sampling sites of C. oliveri. The main reason was that site PB was in the southern edge of the distribution area, as the most geographically distant population, which limited the gene flow between site PB and other sites. However, detailed reasons for the population structure need to be further investigated with an increased number of population and materials.
According to the results of genetic diversity and population structure of C. oliveri, we gained the following management implications and recommendations: for the sites (SX, LP) with higher genetic diversity, in situ conservation strategies should be designed to protect the habitats from human disturbance and prevent the loss of genetic diversity. Ex situ conservation programs should be carried out for the sites (EMS, PB) with lower genetic diversity in order to increase the species numbers of these sampling sites and improve their adaptability to the environment. Furthermore, special attention should be given to site PB because of its isolation from the other sampling sites; it is necessary to save the seeds of this site for breeding.

CONCLUSION
In this study, 5089 EST-SSRs were identified from transcriptome data of C. oliveri, and the distribution, frequency, location, and function of motifs were characterized and evaluated., Twentyeight EST-SSR markers were developed for C. oliveri with abundant polymorphisms and showed a moderate level of genetic diversity. Genetic structure and PCoA analysis revealed two different genetic groups of natural C. oliveri. In addition, more than 70% of these EST-SSRs could be transferred to other species. These EST-SSRs would enable further genetic investigation in C. oliveri and related species and could be used to ensure effective conservation and breeding applications of C. oliveri in the future.

DATA AVAILABILITY STATEMENT
Publicly available datasets were analyzed in this study. These data can be found here: Twenty-eight EST-SSR sequences generated for this study were deposited at GenBank with Accession numbers MZ773613-MZ773640.

AUTHOR CONTRIBUTIONS
HL and YZ performed the experiments and wrote the manuscript, and the two authors contributed equally to this study. ZW analyzed the data. YS and TW conceived and designed the experiments and revised the manuscript.