Comprehensive Characterization of Simple Sequence Repeats in Eggplant (Solanum melongena L.) Genome and Construction of a Web Resource

We have characterized the simple sequence repeat (SSR) markers of the eggplant (Solanum melongena) using a recent high quality sequence of its whole genome. We found nearly 133,000 perfect SSRs, a density of 125.5 SSRs/Mbp, and also about 178,400 imperfect SSRs. Of the perfect SSRs, 15.6% were complex, with two stretches of repeats separated by an intervening block of <100 nt. Di- and trinucleotide SSRs accounted, respectively, for 43 and 37% of the total. The SSRs were classified according to their number of repeats and overall length, and were assigned to their linkage group. We found 2,449 of the perfect SSRs in 2,086 genes, with an overall density of 18.5 SSRs/Mbp across the gene space; 3,524 imperfect SSRs were present in 2,924 genes at a density of 26.7 SSRs/Mbp. Putative functions were assigned via ontology to genes containing at least one SSR. Using this data we developed an “Eggplant Microsatellite DataBase” (EgMiDB) which permits identification of SSR markers in terms of their location on the genome, type of repeat (perfect vs. imperfect), motif type, sequence, repeat number and genomic/gene context. It also suggests forward and reverse primers. We employed an in silico PCR analysis to validate these SSR markers, using as templates two CDS sets and three assembled transcriptomes obtained from diverse eggplant accessions.


INTRODUCTION
The eggplant (Solanum melongena L., 2n = 2x = 24), also referred to as aubergine or brinjal, belongs to the Solanaceae family. After potato and tomato, eggplant represents the third most important solanaceous crop species and its bulk production is concentrated in China, India, Iran, Egypt and Turkey, with Italy and Spain representing the most important European Union producers 1 . Unlike tomato (S. lycopersicum) and potato (S. tuberosum), which belong to the Potato clade and are native to South America, eggplant belongs to the spiny Solanums (subgenus Leptostemonum) and it is native to the Old World. A minimum of two domestication events from the wild species S. insanum has been supported: one in India and one in southern China/SE Asia, with a possible additional and independent center of domestication in the Philippines (Meyer et al., 2012). Two other eggplant species are commonly grown in sub-Saharan Africa: the scarlet eggplant (S. aethiopicum L.) and the gboma eggplant (S. macrocarpon L). Both of them can be inter-crossed with S. melongena producing hybrids with intermediate fertility (Acquadro et al., 2017).
Advances in next-generation sequencing (NGS) platforms, through multiplexed sequencing of barcoded samples in a single run, have driven the costs down and considerably reduced the time needed for the whole genome sequencing of plant species. Genomes provide opportunities to develop a huge amount of novel molecular markers, which can be used for assessing the genetic variations within cultivars and germplasms, for the development of molecular genetic and physical maps, for identifying genes and quantitative trait loci controlling economically important traits and to assist breeding for crop improvement. Among them, microsatellites or simple sequence repeat (SSR) markers represent one of the most informative, versatile, and practical DNA-based markers used in plant breeding programs, since they are easy to score and have wide genomic distribution, codominant inheritance and a multi-allelic nature (Xu et al., 2013). The first sets of eggplant microsatellites was developed from the screening of small insert genomic libraries with di-and trinucleotide probes (Nunome et al., 2003a,b). Afterwards, Stàgel et al. (2008) developed a small set of SSR markers from genic DNA sequence lodged in public access databases, while Nunome et al. (2009) reported the identification of over 1,000 SSR markers from a screen of enriched gDNA and cDNA libraries. An extensive set of nearly 2,000 putative eggplant SSRs was later on isolated by Barchi et al. (2011) from RAD (restriction-site associated DNA) tags, of which a sub-set showed to be polymorphic among the parents of mapping populations. Additional informative SSR markers were also developed from S. melongena genomic libraries enriched for AG/CT motif  and, more recently, from transcriptome analysis of S. incanum and S. aethiopicum, two close relatives of the common eggplant (Gramazio et al., 2016). In eggplant SSR markers have been applied in diversity, phylogenetic as well as linkage mapping studies (Barchi et al., 2010(Barchi et al., , 2012Munoz-Falcon et al., 2011;Fukuoka et al., 2012;Hurtado et al., 2012;Prohens et al., 2012;Cericola et al., 2013;Gramazio et al., 2017). Recently a high quality genome draft of eggplant, covering 1.06 Gb of gap-free sequence with an N50 of 2.9 Mb, was produced by a combination of Illumina sequencing and optical mapping and was anchored to 12 pseudomolecules 2 (The Eggplant Genome Consortium, 2017).
Here we report on the identification of single locus SSR markers in an eggplant genome-wide survey as well as on the development of a public dynamic microsatellite database 3 , which represents a one-stop resource for the global community of scientists and breeders. The web resource provides user needbased primer designing facilities with mobile-friendly features, to 2 www.eggplantgenome.org 3 www.eggplantmicrosatellite.org/ facilitate rapid selection of suitable custom markers for a wide range of genetic analyses.

The SSR Content of the Eggplant Genome
The high quality genome of the eggplant inbred line "67/3, " recently sequenced (The Eggplant Genome Consortium, 2017) and available in the public domain at www.eggplantgenome.org, was downloaded in FASTA format. Its 12 pseudomolecules (representing partial sequences of each of the species' chromosomes), along with all unmapped scaffolds, were chopped into manageable pieces using SciRoKo tool 4 . Perfect, compound, and imperfect SSRs were identified in silico using the SciRoKo SSR-search module (see footnote 4). A minimum of four repetitions together with a minimum length of 15 nt was requested. Any sequence was considered as a perfect SSR where a motif was repeated at least fifteen times (1 nt motif), eight times (2 nt), five times (3 nt) or four times (4-6 nt), allowing for only one mismatch. For compound repeats, the maximum default interruption (spacer) length was set at 100 bp. The coordinates (start/end position) of each SSR were matched with those of the gene space using Bedtools intersect 5 , using the default parameters with -loj (left outer join) option: where the overlap comprised at least 1 nt, the repeat was designated as a genic SSR. The putative function of genes carrying at least one SSR was defined by the GO (gene ontology) approach. Enrichment for GO terms was recognized by examining the set of genic SSRs upon the whole genome GO annotation dataset using the AgriGO suite 6 , collecting enriched GO terms with P-values < e−5 and a false discovery rate <0.01, and visualized through ReviGO 7 .

Collection of Genomic Sequences From Different Solanaceae Species
For comparison purposes, the draft genome sequence of the Asian eggplant cultivars 'Nakate-Shinkuro' (Hirakawa et al., 2014), together with the full genome sequences of 12 other plant species from the Solanaceae family (Solanum lycopersicum, S. pimpinellifolium, S. pennellii, S. tuberosum, Capsicum annuum, C. chinense, C. baccatum, Nicotiana tabacum, N. attenuata, N. benthamiana, Petunia axillaris, P. inflata) and their closely related species Coffea canephora, were collected from the related public database and scanned for the presence of perfect SSRs, using the same procedures described above. The source of each of the above full genome sequences is given in Supplementary  Table S1.

EgMiDB, an SSR Database for Eggplant
The Eggplant Microsatellite DataBase (EgMiDB) was developed to provide browsable access to the SSR data. This web application, based on a LAMP stack, comprises a client tier (client browser), a middle tier (Apache web server with PHP interpreter) and a database tier (MySQL DBMS). A user-friendly interface was developed using PHP, which is an open-source server-side scripting language. The set of in silico detected SSRs were stored in the MySQL database, using PHP scripts to parse the text file from SciRoKo. User need-based customized queries can be generated from the web interface and allow users to search the microsatellite marker information in MySQL database. A standalone version of Primer3 8 is also provided to design primer pairs for any given SSR: its output lists alternative sets of primer pairs, and the characteristics of the expected amplicon.

Marker Validation
An in silico validation was adopted to validate designed SSR primers. About 1000 SSR primer pairs, amplifying in genic regions, were randomly selected and tested with the primer search tool (version EMBOSS:6.6.0.0) in three transcriptomes (Yang et al., 2014;Ramesh et al., 2016;PRJNA247728, unpublished) and two CDS set from eggplant genomic projects published by Hirakawa et al. (2014) and The Eggplant Genome Consortium (2017). Custom scripts were used to count positive PCR results and ambiguous amplifications.

The SSR Content of the Eggplant Genome and Cross-Species Comparison
In the ∼1.06 Gb of the gap-free eggplant genomic sequence, we identified 132,831 perfect SSR motifs (125.5 SSR/Mb), which included 20,670 compound SSRs ( Table 1). The number of imperfect SSR motifs was 178,407 ( Table 2). The content and distribution of SSRs in the genome sequence of the eggplant 8 https://sourceforge.net/projects/primer3 breeding line "67/3" were compared with those of 14 other plant genomes, related to different degrees (25.3 Gb of sequence in total, about 2.2 million SSRs) and retrieved from databases (Supplementary Table S1). The number of perfect SSRs found in the "67/3" genome was more than one-third higher than the number detectable in another eggplant cultivar, "Nakate-Shinkuro" ( Table 1). This indicates that the high quality of the recently released eggplant genome sequence markedly influenced the number of detectable microsatellites.
The "67/3" eggplant genome was also found to include almost twice the number of perfect microsatellites compared to the wild species S. pimpinellifolium (54,829) and N. attenuata (67,198), but only half of those in N. tabacum (266,145) and C. baccatum (277,513). The cumulative length of the full collection of eggplant SSRs was 3.4 Mbp, which is 0.32% of the assembled genome, a percentage analogous to those of tomato (0.31%) and potato (0.28%), but considerably higher than found in C. chinense and C. annuum (0.15 and 0.12%, respectively). Compound SSRs represented 15.6% of the eggplant perfect SSRs, a proportion exceeded only in S. lycopersicum (27.7%) and S. pennellii (37.6%) ( Table 1).
Considering all 14 species, genome size was found to be positively associated with the number of identified SSR motifs (R 2 = 0.948, P < 0.01). However, as a general trend, species possessing larger genomes show lower SSR density (SSRs/Mb) (Morgante et al., 2002). This is the case for the three Capsicum species and for N. tabacum and N. benthamiana, but S. pimpinellifolium and N. attenuata are exceptions; their genome sizes are, respectively, 688.9 and 830.4 Mbp, but their microsatellite densities are comparable to those found in plant species with genomes three to five times larger.
However, although differences in genome size may contribute to the level of repetition of microsatellites, density of SSRs has been found not to be related to genome size (Zhao et al., 2012;Behura and Severson, 2014;Portis et al., 2016). The density of perfect microsatellites in the genome of the eggplant breeding  line "67/3" is the highest observed within the Solanaceae family, although similar to those detected in S. pennellii and P. axillaris, and, in the context of this study, inferior only to that found in Coffea canephora (139.1 SSRs/Mb) ( Table 1).

Characterization of SSR Motifs by Different Length and Repeats
Eggplant SSR motifs which predominated were the di-and trinucleotides (respectively 43 and 37% of all SSRs, with densities of 53.7 and 46.4 SSRs/Mbp), with lesser proportions of mono-(8%) and tetranucleotides (7%); the penta-and hexanucleotide repeats contributed <5% ( Table 2). Dinucleotide sequences were in the majority, amounting to 1.7 Mbp (51.1% of the cumulative length of all SSR motifs). Dinucleotides were also the most common type in tomato and in Nicotiana species. Among the imperfect SSR motifs the occurrence of mono-to trinucleotide motifs was low compared to that seen among the group of perfect SSRs. There were relatively larger amounts of longer motifs: together the penta-and hexanucleotide SSRs represented 21% of the total, and 17% of the cumulative length of all imperfect SSR motifs ( Table 2). We found that the sum of di-and trinucleotides formed the majority of perfect SSRs in all the genomes of Solanum, Capsicum and Nicotiana species searched, ranging from 67% in potato to 81% in N. benthamiana. However, in Petunia species and Coffea canephora, mononucleotides were the most frequent type (Figure 1). The variation of perfect microsatellites in the eggplant genome with regard to the number of repeat units is shown in Figure 2 and Supplementary Table S2. In all SSR classes, it was observed that longer repeats are less abundant; a trend of decrease in SSR frequency with the increase of their repeat number has been observed in other species (Shi et al., 2013;Cheng et al., 2016;Portis et al., 2016). For example, the number of SSRs with ≤8 repeats accounted for about half the total, while those with >20 repeats accounted for less than 12%. The decrease was less marked for mononucleotides and dinucleotide motifs than for longer repeat types, with tetra-to hexanucleotide motifs showing the sharpest frequency reduction with increase of repeat (Figure 2A). As a consequence, the mean number of dinucleotide repeat units (15.3) was more than twice the number of trinucleotide repeat units (7.4), and about three times higher than for tetra-, penta-and hexanucleotides (5.0-5.5) ( Table 2). Based on the length of the perfect repeat motifs, 19.7% of SSRs were considered to belong to hypervariable class I (≥30 nt). Another 22.7% were assigned as potentially variable class II (20-30 nt) types, while the remaining 57.7% represented variable class III (<20 nt) types ( Figure 2B). Most of the mononucleotides (83.6%) belonged to class III, while for hexanucleotides classes I and II preponderated (19.1 and 80.9%, respectively) ( Figure 2B). Class I consisted mainly of di-(77.2%) and trinucleotides (17.2%), with all the other motifs accounting for <2% (Figure 2C).

Characterization of SSRs by Classified Type
Grouping of repeats into classes was carried out according to the method of Jurka and Pethiyagoda (1995). Thus the trinucleotide repeat class AAT includes (TTA)n, (TAT)n, (ATT)n, (ATA)n, and (TAA)n, which are equivalent either in different reading frames or via complementarity. A total of 336 kinds of SSR motif were found, with all the possible base combinations of mono-(only 2 types), di-(4) tri-(10) and tetranucleotide (33), plus 80 and 207 variations of penta-and hexanucleotide repeats ( Table 2).
We also evaluated in detail the individual repeat motifs for each type of SSR in the eggplant genome (Figure 3 and Supplementary Table S3). The base composition of eggplant SSR motifs is strongly biased toward A and T; the most frequent mono-to hexa-nucleotides motifs were A/T (85.1%), AT/TA (85.9%), AAC/GTT (50.4%), AAAT/ATTT (47.1%), AAAAT/ATTTT (19.9%) and AACAAT/ATTGTT (7.1%). In terms of the distribution of different motifs, AT repeats were not only the predominant dinucleotides, but they were also the most frequent motif in the entire genome, accounting for 36.8% of the total SSRs. On the other hand, CG repeats were exceedingly rare (0.04%). Within the trinucleotide motifs, AAC, AAT, and AAG repeats were the most abundant (together representing 89%), whereas GC-rich repeats such as AGC, CCG, and ACG were uncommon.
In the same manner AT-rich tetranucleotide motifs such as AAAT, ACAT, AAAG, and AATT predominated in the eggplant genome (79% of total) while the motifs AAAAT, ATATC, AAACC, and AATAT represented nearly half (49.0%) of the total pentanucleotide repeats. Only three hexanucleotide motifs AACAAT, AAAAAT, and AAGAGG were present with a relative frequency higher than 5% (Figure 3). Analogous motif type distributions have been found for all the other related species analyzed here (Supplementary Table S4). It has been reported that AT-rich repeats prevail in dicot but not in monocot species, and that this difference may be explained partially by the nucleotide composition of their genomes: 34.6% GC content in dicots vs. 43.7% in monocots (Cavagnaro et al., 2010). When comparing 15 highly diverse plant species, we recently reported that among the mononucleotide motifs, A/T was predominant in all the species with the exception of rice (61.8% of C/G), while the motif AT/TA was the most frequent dimer across the genomes examined except in date palm (51.6% AG/CT). A content of CG/GC > 1% was observed only in rice, while this motif was almost absent in the other species .

The Distribution of SSRs in the Chromosomes
The eggplant SSR loci identified were additionally classified according to their motif and distribution over the pseudomolecules (Table 3 and Supplementary Table S5). About 21% of the identified SSRs were present in unanchored scaffolds (chromosome E00), while the average number detected on pseudomolecules E01-E12 was 8,746 and 11,673 for perfect and imperfect SSRs, respectively. We estimated the relationship between chromosome length and number of SSRs present on each of them and found a high correlation: R 2 = 0.96, P < 0.01 ( Figure 4A). The longest linkage group E01 contained the greatest number of SSRs (13,244 perfect, 17,432 imperfect, 103.9 Mbp), while the shortest (E09 and E05; 28.3 and 33.9 Mbp, respectively) hosted the fewest. However, substantial differences among chromosomes in the densities of SSRs were observed, ranging from 116.9 (E12) to 166.6 (E09) perfect SSR/Mbp (Table 3). On the whole a general trend of higher SSR density for short LGs (and vice versa) was observed, with the exceptions of E06 and E10 ( Table 3). The distribution of motif types within individual chromosomes was very similar to the pattern found over the whole genome ( Figure 4B), with di-and trinucleotide repeats the most frequent and penta-/hexanucleotides the scarcest. Mono-to trinucleotide SSRs exhibited maximum variation among LGs, with E03 and E09 showing, respectively, the lowest percentages for di-(38%) and tri-(32%) and the highest for mononucleotides (ca 12%). When considered the distribution of different motifs on a chromosome basis, the percentages of the most frequent mono-di-and tri-nucleotides were similar to those detected in the whole genome, while the relative contributions of the main tetra-, penta-and hexanucleotides varied between LGs (Supplementary  Table S3).

Gene Context of SSRs
Using data from assembled chromosomes of the eggplant breeding line "67/3" (see footnote 2), the genomic distribution of SSRs was compared to their association with individual genes. In all, 2,449 perfect SSRs (1.84% of the total) and 3,524 imperfect SSRs (1.98%) were associated with, respectively, 2,086 and 2,924 genes (  Table S6 reports the distribution over each pseudomolecule of the identified gene SSRs classified on the basis of their repeat motif. SSRs were widely scattered on the chromosomes, except for the mono-and part of the dinucleotide motifs, which were concentrated toward the chromosome ends (Figure 5), similar to the gene space distribution detected in other species (The Potato Genome Sequencing Consortium, 2011; The Tomato Genome Consortium, 2012; Kim et al., 2014). Some microsatellite hotspots were observed, mostly due to very long stretches of compound microsatellites (e.g.: motifs longer than 100 repetitions) and, since they were often found between neighbor genes (e.g.: SMEL_002g164820.1 and SMEL_002g164830.1), they could be involved in gene regulation (Gao et al., 2013;Sawaya et al., 2013). Moreover, they could be exploited as putative highly polymorphic markers, since most of the highly mutable loci are compound microsatellites, comprising two or more repeated motifs (Eckert and Hile, 2009). A comparison between SSR motifs detected in the global set of genomic and genic SSRs is reported in Figure 6. The total populations of SSRs across the genome and gene space were classified into triplet repeats (tri-and hexanucleotides) and non-triplet repeats (mono-, di-, tetra-and penta-nucleotides); a higher frequency in gene sequences of triplet repeats was detected for both perfect (73.1%) and imperfect (68.6%) motifs ( Figure 6A). Trinucleotides were the most common, representing 65.4% (12.2 SSR/Mbp) and 49.8% (13.3 SSR/Mbp) of perfect and imperfect genic SSRs, respectively; the next most frequent group within perfect motifs was dinucleotides (11.1%, 2.1 SSR/Mbp), while for imperfect SSRs hexanucleotides were the second most frequent group (18.8%, 5.0 SSR/Mbp) ( Table 4 and Figure 6B).
In analogy with other species (Toth et al., 2000;Morgante et al., 2002;Mun et al., 2006;Cavagnaro et al., 2010), even though dinucleotides are the most common repeats in the eggplant genome, tri-and hexanucleotides prevail in the gene space. This has been attributed to negative selection against frameshift mutations in coding regions; trinucleotides have increased their frequency in the coding portion as a result of mutation pressure and, possibly, positive selection for specific single amino-acid stretches (Morgante et al., 2002;Subramanian et al., 2003). The most frequent genic SSR motif types were the trinucleotides AAG/CTT (22.2%), ATC/ATG (9.5%), AGG/CCT, and ACC/GGT. AG/CT (5.1% of the total genic SSRs) were the most common dinucleotides (Figure 6C). Similar patterns of motif distribution have been observed in other species. In the those compared by Morgante et al. (2002), Cavagnaro et al. (2010), and Portis et al. (2016), AT/TA repeats seem to be typical of non-transcribed regions; AG/CT prevail in gene (B) REVIGO summary of "biological process" and "molecular function" enriched terms. The bubble size is proportional to the log 10 (p-value) of enrichment analyses and its color is also a function of log 10 (p-value) of enrichment analyses (blue: low, -red: high p-value). The x and y-axes reflect semantic similarity according to the REVIGO algorithm (similar GO terms appear close together).
sequences, while AC/GT and CG/GC repeats are the least frequent dinucleotides in both genomic and gene sequences. In the same way a strong bias in the distribution toward GC-rich motifs, most evident for di-tri-and hexanucleotides, was found in eggplant gene sequences. For example, trinucleotide SSRs in genes had a GC content of 43%, whereas a GC content of only 27% was found in the whole genomic trimeric SSRs.
Compared to non-coding microsatellites, genic SSR markers exhibit a higher portability among related species, boosting their utilization as anchor markers suitable for comparative genetics purposes (Varshney et al., 2005). As opposite, coding SSRs are under higher selection pressure (Toth et al., 2000), so they may not provide an adequate level of polymorphism to distinguish between closely related ecotypes/varieties. Indeed, eggplant genic SSRs provide a reduced set of potentially variable SSRs, as their repeat numbers are lower than in the whole genome, with 84.2% of SSRs containing ten or fewer repeats and only 3.3% having 20 or more repeats. However, genic SSRs represent a relevant class of 'functional markers' as their presence in transcripts has been shown to play a critical role in gene expression/function in both humans (i.e.: microsatellite instability, MSI; Brouwer et al., 2009;Nelson et al., 2013) and plants, where MSI is known to increase with plant development in Arabidopsis (Golubov et al., 2010). With this aim, we compared the set of SSR-containing genes in the reference eggplant gene space, assessing the specific gene regulation functions which are frequently present.
The genes containing one or more SSRs were found within 38 sub-GO categories of three main GO categories ["biological processes" (BP), "molecular functions" (MF) and "cellular components" (CC)]. Over-representation was found for a number of gene families (Figure 7 and Supplementary Table S7); for BP in the sub-categories "regulation of gene expression" (GO:0010468), "regulation of transcription" (GO:0045449), and "transcription" (GO:0006350); for MF, "nucleic acid binding" (GO:0003676) and "transcription regulator activity" (GO:0030528). No enrichment was observed for CC. The occurrence of SSRs within specific gene functions has previously been observed (Yu et al., 2010;Kujur et al., 2013;Liu et al., 2015;Portis et al., 2016;Scaglione et al., 2016) and transcription factors form a significant class of genes containing SSRs. Moreover, the critical role of transcription factors carrying microsatellites has been noted (Li et al., 2004), and this requires clarification in relation to species diversification within Solanaceae.

EgMiDB System Architecture, Features and Utility
The public domain EgMiDB available at www. eggplantmicrosatellite.org provides a searchable interface to the microsatellite data reported in this study. It offers similar features to the CyMSatDB database 9 . In brief, it can be used to retrieve SSRs based on a wide range of simple and complex searches, with single or chained queries. More than one chromosome may be selected. Researches can limit the search via location and number of markers in the required range. The output lists a wide range of all necessary information, including an optional download of the flanking sequences. A 'Design Primers' button provides detailed data including sequences, melting temperatures, etc., with the facility to download in Excel format (Figure 8).
EgMiDB is highly user-friendly, optimized to be accessed on all mobile devices. It implements input checking with autocorrection of errors. It: (i) displays default values or ranges; (ii) auto-corrects data immediately on entry rather than when a query is run; (iii) provides comprehensible error messages; and (iv) indicates when a field is valid. It also provides a simple interface for use by administrators when importing data into the system, and facilitates conversion of SciRoKo output into data formatted for EgMiDB.
With the marker information present in EgMiDB, the standard linkage map's marker density can be increased, and this will contribute to QTL and gene mapping studies. In order to facilitate the use of these markers, we have implemented several plugins 9 www.artichokegenome.unito.it/cymsatdb/ to generate primers at user-defined chromosomal locations, which can be directly exported for use in genotyping assays. Additionally, the developed SSR markers can also be used in DUS testing for variety identification in multiplex mode. From our large marker dataset, the identification of markers with thermodynamic compatibility for multiplex designing can be also accomplished.

Marker Validation
In addition to understanding the characteristics of SSR distribution, the development of molecular markers for eggplant based on these loci was one of the most important objectives in this study. An in silico validation exercise was carried out, taking advantage of available transcriptomic/genomic resources. Normal methods for validating SSR markers involve employing them to prime PCRs containing appropriate template DNA. Here, in silico validation was possible because of the sequence data available from five eggplant accessions. The CDS set derived from the inbred "67/3" line (The Eggplant Genome Consortium, 2017) provided validation of over one-third (385/999) of the markers, while the CDS set from cv. "Nakate-Shinkuro" (Hirakawa et al., 2014) validated a slightly lower proportion (322/999). The other three assembled transcriptomes gave from 233/464 to 283/283 non-ambiguous in silico PCR signals (Figure 9). By taking into account ambiguous amplification, validation percentages were substantially increased in fragmented sets (i.e., transcriptomes).
By using the genes predicted from the two sequenced eggplant varieties, almost 32-39% of primer pairs were predicted to be functional. When data from the three different assembled transcriptomes was included, the validation level was reduced to 23-28%, which can be considered satisfactory, since in previous similar studies validation success rates ranging from 50% FIGURE 9 | Simple sequence repeat in silico validation. Relative number of microsatellites showing positive signal following in silico validation of 1000 randomly selected loci in three transcriptomes (Yang et al., 2014;Ramesh et al., 2016;PRJNA247728, unpublished) and two CDS sets (from eggplant genomic projects; Hirakawa et al., 2014;The Eggplant Genome Consortium, 2017).  to 11% (Iquebal et al., 2013) have been reported. The presence of non-validated markers may be attributed to the presence of a large intron(s) between the pair of primer sites or the presence of primers in intron regions, because the sequences used for primer design are genomic-based. All the 999 validated SSRs were loci on genes, but only 345 (35%) of these genes were mono-exonic, a percentage close to the success rate found in the validation tests. Differences in validation between cultivar-derived datasets will also result from the genetic structure of the varieties used to assemble/predict genes. Even taking this into account, our first EgMiDB provides key information for genetics and breeding studies as well as the integration of un-mapped scaffolds for improvement and optimization of the eggplant genome sequence.

CONCLUSION
The main goal of the present study was to identify a large set of valid SSR markers in eggplant, by adopting key standards including: (i) an SSR length of not less than 15 bp and (ii) a minimum of four repeat units per SSR. These parameters were chosen since microsatellite polymorphisms and mutation rates correlate positively with increase in the number of repeat units (Weber, 1990) as this promotes higher rates of strandslippage during replication (Whittaker et al., 2003). A total of 133k/178k eggplant-specific perfect/imperfect microsatellite loci are reported here, which were mined using a whole genome bioinformatics survey. A large set of long/hypervariable and potentially variable SSRs has been identified. Because microsatellites are, as expected, ubiquitously distributed, we detected higher SSR repeat content for longer chromosomes (Zietkiewicz et al., 1994) as well as homogeneous distribution of SSRs. These inherent attributes of microsatellites make them desirable markers.
The database EgMiDB contains all the available information related to genomic and genic perfect/imperfect microsatellite loci, with unrestricted public access. Through an easy-to-use web interface and a highly customized primer-designing tool, it is optimized to cater for the needs of plant biologists and breeders. It is very flexible and allows access with user-defined options. The database will help in selection of microsatellites of a particular type and also in identification of all putative microsatellite sites within specific genomic regions, such as functional microsatellites present in gene sequences. Furthermore, it makes it possible the identification of markers with thermodynamic compatibility for multiplex designing by allowing a large number of samples to be screened in a given experiment. EgMiDB will thus contribute to uniform bioinformatics workflow strategy to achieve SSR genotyping tailored to eggplant, and represents a key tool for making progress in both basic genomics research and molecular breeding of the species. The database will be constantly updated as additional sequencing data for eggplant genotypes become available, allowing rapid and simple in silico identification of SSRs including those polymorphic.

AUTHOR CONTRIBUTIONS
EP, SL, FP, GLR, and AA conceived and designed the research. EP, LB, LT, and AA analyzed the data. FP and LV designed and constructed the web database. EP and AA drafted the manuscript. SL, LT, and GLR reviewed and edited the manuscript. All authors read and approved the manuscript.