Development of simple sequence repeat markers for sugarcane from data mining of expressed sequence tags

Sugarcane (Saccharum spp. hybrids) is a worldwide acclaimed important agricultural crop used primarily for sugar production and biofuel. Sugarcane’s genetic complexity, aneuploidy, and extreme heterozygosity make it a challenging crop in developing improved varieties. The molecular breeding programs promise to develop nutritionally improved varieties for both direct consumption and commercial application. Therefore, to address these challenges, the development of simple sequence repeats (SSRs) has been proven to be a powerful molecular tool in sugarcane. This study involved the collection of 285216 expressed sequence tags (ESTs) from sugarcane, resulting in 23666 unigenes, including 4547 contigs. Our analysis identified 4120 unigenes containing a total of 4960 SSRs, with the most abundant repeat types being monomeric (44.33%), dimeric (13.10%), and trimeric (39.68%). We further chose 173 primers to analyze the banding pattern in 10 sugarcane accessions by PAGE analysis. Additionally, functional annotation analysis showed that 71.07%, 53.6%, and 10.3% unigenes were annotated by Uniport, GO, and KEGG, respectively. GO annotations and KEGG pathways were distributed across three functional categories: molecular (46.46%), cellular (33.94%), and biological pathways (19.6%). The cluster analysis indicated the formation of four distinct clusters among selected sugarcane accessions, with maximum genetic distance observed among the varieties. We believe that these EST-SSR markers will serve as valuable references for future genetic characterization, species identification, and breeding efforts in sugarcane.


Introduction
Sugarcane (Saccharum spp.hybrids) is a global economic and energy crop, with China ranking third in sugar production.This perennial herb is known for its photosynthetic efficiency, higher biomass accumulation, aneuploid polyploidy (≥8), and genetic heterogeneity (Cordeiro et al., 2003).Conventional sugarcane breeding, primarily by stem cutting is laborious and timeconsuming, often taking decades to produce new varieties.The complex genetic background of sugarcane cultivars was derived from interspecific hybridization of S. spontaneum L. and S. officinarum L. (Garsmeur et al., 2018).The commercial sugarcane cultivars inherit ~70-80% of their chromosomes from S. officinarum, ~10-15% from S. spontaneum, and the remaining ~5-10% from interspecific recombination (D'hont et al., 1996).The limited introgression in sugarcane breeding has led to a narrow genetic basis in commercial cultivars (Singh et al., 2013).
Genome sequencing revolutionized the discovery and application of SSRs in various plant species, including sugarcane.The release of the S. spontaneum genome in 2018 (Zhang et al., 2018) has provided a valuable resource for sugarcane cultivar breeders.Previous efforts have yielded a relatively small number of SSR markers in sugarcane, for instance, 351 EST-SSRs were identified from 4085 EST sequences (Singh et al., 2013), 406 EST-SSR markers with 63 were verified as polymorphic (Ul Haq et al., 2016), and 2005 markers were identified from EST sequences with 65.5% showed polymorphism (Oliveira et al., 2009).Therefore, the development of markers to assess the genetic relationships with a comprehensive set of EST information has become an imperative task.In this study, we attempt to screen EST-SSR based on sugarcane unigenes, particularly those associated with functional genes, and assess the genetic diversity among other sugarcane accessions (10 in total) that have been previously overlooked.Additionally, we also investigate the evolutionary relationship between the sugarcane genome to those of sorghum and maize.We believe that these newly developed EST-SSR markers will provide a valuable reference for sugarcane breeding programs and facilitate species screening and identification.

Plant materials
A panel of diverse sugarcane accessions including wild type and eight cultivated were sources from Guangxi, Yunnan, Taiwan, and Fujian.These accessions were maintained at Guangxi University, Nanning, China (Table 1).

Identification of SSR motifs and primer pair design
The assembled EST sequences were subjected to a search for SSR motif using the Microsatellite program (MISA; http://pgrc.ipkgatersleben.de/misa/)with default parameters as follows: 10 for monomeric repeats, 6 for dimeric repeats, and 5 for trimeric, tetrameric, pentanucleotide, and hexameric repeats each.Subsequently, the primer pair was designed in the program Primer 3.0 with the standard criteria as a primer size of 18 to 27 bp and approximately 20 bp, PCR product size of 100 to 300 bp, GC content from 40 -60%, and melting temperature (Tm) variation from 57 -63°C.
For each SSR locus, we selected three primer pairs, and the pair yielding the highest-scoring DNA was selected for subsequent SSR marker studies.In-silico PCR analysis of the SSR primer pair was performed using MFEprimer3.2.6 (https://mfeprimer3.ige netech.com/)with default parameter setting, except the Tm was set to 50°C (Yu and Zhang, 2011).The primers were synthesized from Sangon Biotech (Shenzhen, China).

Genomics DNA extraction and SSR genotyping
The genomic DNA (gDNA) was extracted from young sugarcane leaves using the cetyltrimethylammonium bromide (CTAB) method.A Nanodrop spectrophotometer (thermos Scientific, USA) was used for gDNA quantification followed by 1% agarose gel electrophoresis for the quality of gDNA.Finally, the DNA was normalized to 10 ng mL −1 for PCR amplification.The PCR reaction was performed in a total reaction volume of 10 uL containing 30−50 ng of gDNA, 2.0 mL of 10×Taq buffer (Mg2+), 0.2 mM each of dNTPs, 0.5 mM each forward and reverse primer, and 0.5 U of Taq DNA polymerase (Clontech, Takara, Shanghai).The resulting PCR products, along with a 2000 bp DNA marker, were separated on an 8% polyacrylamide gel through electrophoresis and visualized using silver staining.
SSR genotyping data were recorded as one (band present) and zero (band absent).The Polymorphism Information Content (PIC) values were computed using the following formula: where Pg represents the frequency of a unique genotype if each SSR marker represents a single locus with n SSR genotypes.
The presence and absence of SSR genotyping data were used to construct the phylogenetic tree of 10 sugarcane accessions using the Neighbor-joining (NJ) method based on Nei's genetic distance with the MEGAX program.

Unigenes annotation in sugarcane and comparison with sorghum and maize
We annotated all the unigenes containing SSRs against Gene Ontology (GO, http://www.geneontology.org) and Kyoto Encyclopedia of Genes and Genomes (KEGG, http:// www.genome.jp/kegg/)databases.To assess the conservation of sugarcane unigenes, we conducted BLASTN searches against the sorghum (Z3116) and maize (B73) genomes, using an e-value threshold of -15 for sorghum and -10 for maize.Our selection criteria included a sequence identity of more than 80% and a sequence length exceeding 100bp.

Distribution of SSR markers
For SSR analysis, a dataset of 285,216 EST sequences retrieved from the NCBI was subjected to quality and redundancy by the CAP3 program.A total of 23666 unigenes sequences including 4547 contigs were generated (Supplementary Table S1).The unigenes' length ranged from a minimum of 101 bp to a maximum length of 4040 bp, with approximately 17467 unigenes' length varying between 600 to 1200 bases, and 826 unigenes measuring 1800-2400 nucleotides in length (Figure 1A).A summary of the sequencing results is presented in Table 2. Using the MISA identification tool, we predicted 4120 unigenes containing 4960 SSRs, with a frequency of one SSR/4.43kb of the available ESTs.Among these sequences, 685 ESTs contained more than one SSR, with 415 being compound SSRs featuring multiple types of repeat motifs.

Validation and polymorphisms of SSR primers
The results from in-silico PCR analysis showed that 235 of 240 SSR primer pairs had potential amplicons in at least one of the three    S4A).Subsequent analysis of the predicted SSR motifs within the potential amplicons generated by SSR primer pairs showed that 219 of 235 SSR primer pairs had the predicted SSR motif in potential amplicons.Of these, 16 SSR primer pairs only existed in both S. spontaneum and maize, while 19 SSR primer pairs were found in the sorghum genome.Similarly, 34 SSR primer pairs had SSR motifs present in both maize and sorghum genomes, and 17 SSR primer pairs shared SSR motifs in both S. spontaneum and sorghum genomes.Additionally, 21 SSR primer pairs showed SSR motifs in both S. spontaneum and maize genomes.In contrast, 106 SSR primer pairs presented SSR motifs in all three genomes (Supplementary Table S4A).

Parameters Numbers
Among 235 primer pairs with potential amplicons, 40 SSR primer pairs showed at least one base of the primer sequence that did not match with the amplicon.Further analysis of the binding sites of SSR primer pairs with the potential amplicons showed that 53, 10, and 17 SSR primer pairs fully match with at least one of the potential amplicons in the S. spontaneum, maize, and sorghum genome, respectively.Nine SSR primer pairs were found to fully match in both maize and sorghum genomes.Thirty-two SSR primer pairs showed full matches in both S. spontaneum and sorghum genomes, and 20 SSR primer pairs fully matched in both S. spontaneum and maize genomes.Intriguingly, 54 SSR primer pairs were found to fully match in all three genomes (Supplementary Table S4A).
For the applicability of the deduced SSR markers, we selected 173 primer pairs for the analysis in 10 sugarcane accessions including maize and sorghum using PAGE analysis (Figure 4).After optimization, we retained 163 of 173 primers due to clear banding patterns and ease of identification.Among these, 4 were monomeric, 16 were dimeric, 125 were trimeric, 4 were tetrameric, single pentameric, and 3 were hexameric with length ranges spanning from 21 to 109 bp.These 163 SSR loci were capable of amplifying 3-21 alleles within selected accessions, with an average of 9.46 alleles per locus.These SSR markers can be used effectively in genetic diversity analysis, population genetics, and germplasm identification.The polymorphism information content (PIC) values for these SSR loci range from 0.292 to 0.972, with an average PIC value of 0.808, indicating a high level of genetic diversity (Supplementary Table S4B).
Additionally, we gained more insights by integrating in-silico PCR analysis and amplification of three primer pairs for each locus.We detected expected PCR products containing SSR loci in both in- silico PCR analysis and PCR amplification for all three primer pairs in sugarcane.Notably, unexpected PCR bands were amplified for three primer However, potential amplicons with long fragments, especially more than 1000 bp, were not amplified in maize and sorghum (Table 4 and Figure 4).Additionally, the 265/ 266 bp bands amplified with primer PW2-23 fully matched in sugarcane and maize were amplified successfully, while the partially matched potential amplicon of 265 bp in sorghum was also amplified.However, for primer pairs PW2-28 and PW2-29, no potential amplicon of the expected PCR products was found in maize and sorghum (Table 4).Nonetheless, an almost identical PCR pattern to sugarcane was observed in maize and sorghum (Figure 4).

Functional annotation of sugarcane unigenes harboring the SSRs
To explore the potential functions of SSR-containing unigenes, all of these unigenes were annotated against the publicly available functional databases.This analysis indicated that 38.75% of unigenes were associated with GO, while 43.96% were linked to the KEGG.These SSR-containing unigenes were further classified into three major GO functional categories including, biological process, cellular component, and molecular function (Figure 5A and Supplementary Table S5A).Within biological processes, unigenes related to post-embryonic development, photosynthesis, fruit ripening, DNA metabolic process, flower development, and regulation of molecular function accounted for the largest proportion.The cellular component category primarily represented unigenes involved in peroxisome, cytoskeleton, and mitochondrion.In the molecular function category, the most enriched unigenes were involved in signaling receptor activity, protein binding, structural molecule activity, and transporter activity binding.
Furthermore, these unigenes annotated 195 KEGG metabolism pathways, which were classified into six categories including cellular processes, environmental information processing, genetic information processing, metabolism, organismal systems, and brite hierarchies (Figure 5B).In the second level of the pathway classification, prominent categories included carbohydrate metabolism, translation, signal transduction, transport and catabolism, environmental adaptation, protein families: genetic information processing, and protein families associated with signaling and cellular processes.Additional details of each category are provided in Figure 5B and Supplementary Table S5B.

Genetic diversity and relationships among genotypes
To explore the genetic similarity of sugarcane accession, we conducted a cluster analysis based on a matrix for the presence and absence of deduced alleles.Figure 5 represents the clustering results in the form of phylogenetic trees.The phylogenetic clustering unveiled four distinct accession clusters: "S.robustum",  "Yunrui05-767", "ROC10", "ROC22", "Guatang28", and "Guatang32" form a major cluster; Cluster-I, "Funong40", and "Funong39" are present in Cluster-II."Yunrui05-782" and "S.spontaneum" formed separate cluster each (Cluster-III and IV) at the bottom of the phylogenetic tree (Figure 5).The accession in Cluster-I shares a genetic distance value of 7.4 in relation to other accessions in Cluster-II.Notably, the Taiwan accessions, "ROC22" and "ROC10", as well as the Fujian varieties, "Funong40" and "Funong39" showed a genetic distance of 2.5 between them, indicating a higher degree of similarity as determined by the studied SSR markers.The largest genetic distances were recorded between Yunnan varieties clustered in different clades.
EST-SSRs, 163 primer pairs proved effective for identifying 10 sugarcane accessions, demonstrating the suitability of transcriptome sequences as valuable resources for SSR markers' development.
The cluster results aligned well with origin and pedigrees of 10 sugarcane accessions, providing insights into their relationships (Figure 6 and Table 1).For instance, sugarcane accessions from Fujian, Guangxi, and Taiwan clustered according to their breeding regions, while those with common parents clustered together.Additionally, our analysis revealed that S. officinarum shared a closer relationship with cultivated sugarcane compared to S. spontaneum.Interestingly, two cultivated sugarcane lines (Yunrui05-782 and Yunrui05-767) derived from hybrid wild species were distinct from other cultivated sugarcane varieties, highlighting the potential of wild species in expanding the genetic basis of cultivated sugarcane through sexual hybridization.We also explored the distribution of SSRs within the genomes of 10 sugarcane cultivars, observing a relatively high frequency of SSRs, approximately 1/4.43 kb.This frequency is comparable to certain other plant species such as P. violascens (1/4.45 kb), Chinese cabbage (1/4.67 kb), and Wheat (1/5.46 kb) but significantly higher than in Arabidopsis (1/13.83kb) (Cardle et al., 2000;Peng and Lapitan, 2005;You et al., 2015;Cai et al., 2019).The types of repeat motifs in this study were not uniformly distributed in the sugarcane genome.In general, unlike former research studies on sugarcane (Table 4) by Singh et al. (2013); Xiao et al. (2020); Ukoskit et al. (2012), andUl Haq et al. (2016), we found that the monomeric repeats accounted for the largest proportion, at 44.33% followed by tetrameric and dimeric repeats which were 39.68% and 13.10%, respectively (Table 3).These results are different from Xiao et al. (2020) in which trimeric repeats were most abundant.Dimeric and trimeric repeats were predominant when excluding monomeric repeats.Additionally, we found that the proportion of tetrameric, pentameric, and hexameric repeats was significantly lower than those reported by Xiao et al. [16] and other species (Table 5).Overall, our findings contribute to a deeper understanding of the SSR landscape in sugarcane and its implications for genetic studies and breeding programs.As shown in Figure 2, the A/T motif was the predominant monomeric repeat (88%).In contrast, the GC/CT repeats accounted for 56%, which was higher than what was reported by Xiao et al. (2020) sugarcane.Additionally, the abundance exceeded in other species such as taro (52.86%) (You et al., 2015), pigeon pea (16.7%) (Dutta et al., 2011), and wheat (8.7%) (Peng and Lapitan, 2005).Of trimeric repeats, CCG/CGG was the most predominant (48%), higher than the previous findings in taro (You et al., 2015),  Cluster analysis of Sugarcane accession by SSR markers.The scale at the bottom represents the genetic distance between all the accessions.sugarcane (4.84%) (Xiao et al., 2020), and rice and maize (Cardle et al., 2000).The CGC/GCG trimeric repeat at 17% was the second most abundant, which was lower than in P. violascens (3.45%) (Cai et al., 2019) and sugarcane (4.74%) (Xiao et al., 2020).The prevalence of trimeric repeat, CCG/CGG, a characteristic repeat in monocots was verified by our results but was rare in dicotyledonous plants (You et al., 2015;Cai et al., 2019;Xiao et al., 2020).The PIC is a critical metric in assessing the level of polymorphism of SSR markers, with a PIC value greater than 0.5 indicating a high level of polymorphism (Botstein et al., 1980).In our study, based on 163 EST-SSR markers, PIC values ranged from 0.292 to 0.972 with an average PIC value of 0.809 (Table S4).In general, EST-SSR primer pairs and corresponding SSR loci were designed and aligned in S. spontaneum, sorghum, and maize in this study, which provided a possible way to develop EST-SSRs for sugarcane breeders.First, we developed EST-SSRs using sugarcane ES sequences or functional genes in the sugarcane genome.Some of the EST-SSR primer pairs were synthesized and successfully amplified by PCR in 10 sugarcane cultivars with sorghum and maize.Interestingly, our analysis revealed that a subset of SSR primer pairs (9 in S. spontaneum, 13 in maize, and 7 in sorghum) produced potential amplicons exclusively in one of these genomes.This observation suggests that while these species share some genetic similarities, they have also undergone unique evolutionary processes that have led to the development of distinct SSR loci.Such species-specific SSR markers can serve as important indicators of genetic divergence and could shed light on the evolutionary history of these species.
In sunflowers, most SSR-containing genes are involved in various biological processes such as cellular and metabolic processes (Lulin et al., 2012).Parmar et al., (Parmar et al., 2022) reported that most of the SRR-containing genes are involved in biological regulation and metabolic processes, which is consistent with the present study.The most important molecular functions of the GO-enriched genes in the present study are transport activity, binding, signaling receptor activity, protein activity, and catalytic activity.Additionally, the key biological processes associated with GO enrichment genes include fruit ripening, post-embryonic development, photosynthesis, and regulation of molecular functions.KEGG analysis of SSR-containing genes showed an important metabolic pathway such as carbohydrate metabolism and amino acid metabolism.The genetic information processing category was the second largest group.

Conclusion
In the present study, we achieved several significant outcomes.We successfully aligned sugarcane unigenes with sorghum and maize, leading to the identification and development of a valuable set of EST-SSR markers in sugarcane.A total of 4960 potential SSR markers were identified and of 240 randomly selected primer pairs, 173 were assessed for polymorphism.Among these, 163 primer pairs exhibited polymorphism when applied to 10 sugarcane accessions.Furthermore, we annotated 4203 SSR-containing unigenes into GO and KEGG databases, shedding light on their potential functions and pathways.Notably, we found that 56.43% of sugarcane unigenes mapped in maize genome to a single locus, 29.11% at two loci, 5.6% at three loci, and 8.58% with other loci.This suggests a distinct evolutionary relationship between sugarcane and sorghum with more duplication events occurring in maize chromosome segments.We believe these results have broad implications, contributing an important resource for future genomic and genetic studies in sugarcane but also serving as a powerful tool for studying evolutionary adaptation and genetic relationships in other related species.
FIGURE 1 Characteristics of unigenes and SSR from the Saccharum spontaneum ESTs (A) The number of unigenes based on the number of nucleotides in each, (B) the number of SSR motifs in monomeric, dimeric, trimeric, tetrameric, pentameric, and hexameric, and (C) the number of different SSR motifs in the unigenes of Saccharum spontaneum.

FIGURE 4
FIGURE 4 FIGURE 5 Summary of functional annotation of SSR-containing unigenes.(A) GO and (B) KEGG represent different classes based on the predicted function of the top 50 SSR-containing unigenes.The y-axis indicates the number of genes in each specific category.

TABLE 1
Information of sugarcane accessions.

TABLE 2
Details of ESTs and SSRs identified in sugarcane.

TABLE 3
Summary of frequencies of different SSR repeat motif types.

TABLE 5
Comparison of frequency of microsatellites of different species.