Development of Genome-Wide Functional Markers Using Draft Genome Assembly of Guava (Psidium guajava L.) cv. Allahabad Safeda to Expedite Molecular Breeding

Guava (Psidium guajava L.), a rich source of nutrients, is an important tropical and subtropical fruit of the Myrtaceae family and exhibits magnificent diversity. Genetic diversity analysis is the first step toward the identification of parents for hybridization, genetic mapping, and molecular breeding in any crop species. A diversity analysis based on whole-genome functional markers increases the chances of identifying genetic associations with agronomically important traits. Therefore, here, we sequenced the genome of guava cv. Allahabad Safeda on an Illumina platform and generated a draft assembly of ~304 MB. The assembly of the Allahabad Safeda genome constituted >37.95% repeat sequences, gene prediction with RNA-seq data as evidence identified 14,115 genes, and BLAST n/r, Interproscan, PfamScan, BLAST2GO, and KEGG annotated 13,957 genes. A comparative protein transcript analysis of tree species revealed the close relatedness of guava with Eucalyptus. Comparative transcriptomics-based SSR/InDel/SNP-PCR ready genome-wide markers in greenish-yellow skinned and white fleshed-Allahabad Safeda to four contrasting cultivars viz apple-color-skinned and white-fleshed-Lalima, greenish-yellow-skinned and pink-fleshed-Punjab Pink, purple-black-skinned and purple-fleshed-Purple Local and widely used rootstock-Lucknow-49 were developed. The molecular markers developed here revealed a high level of individual heterozygosity within genotypes in 22 phenotypically diverse guava cultivars. Principal coordinate, STRUCTURE clustering, and neighbor-joining-based genetic diversity analysis identified distinct clusters associated with fruit skin and flesh color. The genome sequencing of guava, functional annotation, comparative transcriptomics-based genome-wide markers, and genetic diversity analysis will expand the knowledge of genomes of climacteric fruits, facilitating trait-based molecular breeding and diversifying the nutritional basket.


INTRODUCTION
Guava (Psidium guajava L.), a member of the Myrtaceae family, is a cross-pollinated perennial fruit tree and has the chromosome number 2n = 22 (Kumar and Ranade, 1952). Except for the triploid seedless type (Raman et al., 1971), almost all of its commercial varieties are diploid. A genome size of ∼495-538 MB has been reported for Brazilian white and red-fleshed diploid cultivars; however, tetraploid species such as Psidium cattleianum and Psidium acutangulum, have a larger genome size of ∼1,030-1,144 MB (Da Costa et al., 2008). In India, guava is the fourth most significant fruit crop after mango, banana, and citrus (Ray, 2002). Known as the "apple of tropics" (Nakasone and Paull, 1998), guava is grown in tropical and sub-tropical regions of the world, and is a rich source of nutrients, minerals, and vitamins (Thaipong et al., 2006;Mittal et al., 2020). Despite high economic importance, genomic information regarding guava remains scarce, reducing the slope of genetic gain.
Guava fruit consumer prefers less and soft seeded, prolonged shelf life, colored flesh, apple color fruits and grower demands wilt tolerant, fruit fly resistant, high yielding and climate resilient stress tolerant cultivars. Guava exhibits variability in the color of peel and flesh, flesh thickness, seed number, distribution of seeds, seed strength, and flavor that ranges from sweet to sour in different cultivars (Mehmood et al., 2014). The guava crop is frequently threatened by guava wilt, a complex problem caused by Fusarium sp., nematodes, and water logging conditions, demanding the development of resistant genetic rootstocks. Fruit flies, major pests of fruit and vegetable crops, decimate the rainy season guava crop every year in the subtropics. Furthermore, the lower shelf life of guava threatens its export potential, rendering it an underutilized fruit. Natural genetic variability and selection of favorable agronomic traits via traditional breeding programs is a long process that relies on the arbitrary rearrangement of alleles between two closely related parent plants (Nimisha et al., 2013). Also, traditional breeding is expensive and timeconsuming, particularly for fruit crops where the ontogeny cycle requires years due to the long juvenile phase (Longhi et al., 2013). These constraints can be overcome by resourceful strategies such as genomics-assisted breeding (GAB) (Varshney et al., 2005;Kole et al., 2015) or marker-assisted breeding (MAB) (Baumgartner et al., 2016), where the selection of fruit quality-related traits and resistance to biotic and abiotic stresses can be made at the seedling stage.
A tremendous surge in the availability of genomic resources in the last decade has been translated in accessing the variability in the germplasms (Jain et al., 2019). A shift from isozyme and random amplified polymorphic DNA (RAPD)based molecular markers to gene-based simple sequence repeat Abbreviations: AS, Allahabad Safeda; NGS, next generation sequencing; SNPs, single nucleotide polymorphisms; InDel, insertion/deletion; SSR, simple sequence repeat; MAB, marker-assisted breeding; GAB, genomics-assisted breeding; BLAST, Basic Local Alignment Search Tool; KASP, Kompetitive Allele-Specific PCR; PAGE, polyacrylamid gel electrophoresis; KEGG, Kyoto Encyclopedia of Genes and Genomes; BUSCO, Benchmarking Universal Single-Copy Orthologs; PCoA, principal coordinate analysis; QTLs, quantitative trait loci; RAPD, random amplified polymorphic DNA.
The possibility of genome and transcriptome assemblies due to decrease in the cost of next-generation sequencing (NGS) has significantly advanced genomic studies progress in the last few years. NGS has led to the decoding of many genomes and enhanced the knowledge of genome architecture and the development of many molecular markers (Goodwin et al., 2016). In the last decade, many fruit crops such as grapevine (Jaillon et al., 2007), papaya (Ming et al., 2008), apple (Velasco et al., 2010), strawberry (Shulaev et al., 2011;Hirakawa et al., 2014), Japanese apricot (Zhang et al., 2012), peach (Verde et al., 2013), pear Chagné et al., 2014), and mango (Wang et al., 2020) had been sequenced. In Myrtaceae species, Eucalyptus grandis (Myburg et al., 2014), Leptospermum scoparium (Thrimawithana et al., 2019), and "New Age" Chinese guava (Feng et al., 2020) have been sequenced. Here, we have sequenced the commercially important Indian guava cultivar Allahabad Safeda (AS).
Allahabad Safeda is the gold standard cultivar that has been commercially grown for ∼4 decades in India owing to attractive fruit size, high total soluble sugars, vitamin C content, and better organoleptic traits. It is involved in several breeding programs at the national level. To understand the genetics of guava and boost GAB/MAB programs, we generated a draft genome assembly of AS using Illumina Paired-end reads. A comparative transcriptome analysis of AS, Purple Local (PL), CISH-G5 (Lalima), Punjab Pink (PP), and Lucknow-49 (L-49) mapped to an AS draft genome assembly developed SSR-, InDel-, and SNP-based functional markers. An evaluation of a set of 233 markers was performed to study genetic diversity in a diversity panel of 22 P. guajava and 2 P. cattleianum genotypes to validate the markers. The markers developed in the study would be helpful in future breeding programs for guava.

Allahabad Safeda Genome Sequencing and Assembly
Allahabad Safeda leaf samples were collected from a 10-yearold tree maintained in the orchards of Punjab Agricultural University, Ludhiana, in Punjab (India). The leaf samples were flash-frozen in liquid nitrogen and stored at −80 • C before genomic DNA extraction. Following the instructions of the manufacturer, total DNA was extracted with DNeasy R Plant Mini-Kit (QIAGEN, Hilden, Germany). Two paired-end (PE) libraries with an insert size of 300 and 500 bp, respectively, were prepared using TruSeq R stranded DNA Library Prep (Illumina, San Diego, CA, United States) and sequenced using a HiSeq 2500 Illumina sequencer (Illumina, San Diego, CA, United States). All the raw files have been submitted to NCBI under Bioproject PRJNA557348 with Biosample SAMN12395251 under SRA SRR9865865 and SRR9865866.

Allahabad Safeda Gene Prediction and Genome Annotation
The de novo repeat identification was performed for the AS genome assembly by RepeatModeler v1.0.11 with RepeatMasker's repeat library downloaded from RepBase (https://www.girinst. org). RepeatMasker-version 2.1 (http://www.repeatmasker.org) was then used for complete identification of repetitive sequences, and the masked genome was further used for annotation.
Transfer RNA (tRNA) genes were predicted with tRNAscan-SE version 2.0.5 (Chan et al., 2021) using the prediction method infernal. The MAKER v 2.31.10 genome annotation pipeline (Cantarel et al., 2008) was used for gene prediction in the masked genome assembly. A Trinity-based AS transcriptome assembly (Mittal et al., 2020) was used as EST evidence. To generate ab initio gene predictions with the repeat masked assembly, SNAP (Korf, 2004) and AUGUSTUS v 3.2.2 (Stanke et al., 2006) were used. AUGUSTUS was trained using tomato training parameters. The MAKER pipeline was re-run with repeat, EST, SNAP hidden Markov models (hmm), and guava training parameters developed over tomato using AUGUSTUS as evidence.
All the predicted genes were annotated by comparing the protein sequences (output of MAKER-third round) against the Pfam hmm library of PfamScan (http://www.ebi.ac.uk/Tools/ pfa/pfamscan). Protein domain identification was performed by comparing the protein sequences to the InterPro database using InterProScan version 5.39-77.0 (http://www.ebi.ac.uk/interpro/ download/), with additional parameters of generating GO ids and detecting pathways. The transcript sequences of predicted genes were subjected to DIAMOND (Buchfink et al., 2014) BLAST against the n/r database of NCBI. The XML results from the DIAMOND BLAST were mapped and annotated using Blast2GO ver-5.2.5 (Conesa et al., 2005) to perform Gene Ontology functional classification of the genes, and Go ids were subsequently plotted with Web Gene Ontology Annotation Plot (WEGO2.0) (http://wego.genomics.org.cn/cgibin/wego/index.pl) to visualize the distribution of gene functions. Kyoto Encyclopedia of Genes and Genomes (KEGG) Automatic Annotation Server (KAAS) (Moriya et al., 2007) was used to perform the functional classification by assigning the guava protein transcript sequences to a pathway based on the KEGG database (www.genome.jp/kegg).  (Letunic and Bork, 2021).

Allahabad Safeda Phylogenetic Analysis and Generation of Pseudochromosomes
Draft pseudochromosomes of AS were generated using Chromosomer (version-0.1.3) (Tamazian et al., 2016). The genome of E. grandis (Myburg et al., 2014) was downloaded from Phytozome and used as a reference to generate the draft chromosomes of the AS genome.

Development of EST-SSRs/EST-InDel Markers
Ribonucleic acid sequencing for AS, PP, and Apple Color (AC) described previously (Mittal et al., 2020), PL (genotype that accumulates purple pigmentation in the foliage, flower buds, petals, and fruits at immature, mature, ripe, and overripe stages; Mittal et al., unpublished), and L-49 used as root stock (Mittal et al. unpublished) was performed. Trinitybased transcriptome assemblies for PP, AC, PL, and L-49 were generated as that for AS [described in (Mittal et al., 2020)]. PL, L-49, AC, and PP RNA-seq assemblies subjected to BLASTn against AS transcriptome identified polymorphic InDels. Primer3 version 0.40 (http://primer3.sourceforge.net/) was used to design the primers from positions flanking Indels/SSRs with gaps ≥ 8 bp and with a product size range of 80-120 bp, primer length of 20-27 bp with Tm ranging from 50 to 60 • C, and GC content of 40-60%. The InDel/SSR sites were mapped to the AS genome assembly to identify the coordinates and distribution over 11 pseudochromosomes.
The designed 93 InDel and 15 SSR markers were tested for polymorphism on 24 guava genotypes. PCR products were amplified in a thermocycler with denaturation at 95 • C for 3 min, 95 • C for 30 s, 55 • C as annealing for 30 s, and 72 • C as amplification step for 1 min with a repeat of 35 cycles and a final elongation step of 5 min at 72 • C in a 10-µl reaction. The reaction was performed with 7 ng DNA as a template and .5µM forward and reverse primers in GoTaq R Green Master Mix (Promega Biotech India Pvt. Ltd., New Delhi, India) in the presence of 10 mg/ml BSA (0.5 µl) and 10 mg/ml PVP (0.5µl) as adjuvants. The amplified products were resolved on 6% polyacrylamide gel containing ethidium bromide in a vertical gelelectrophoresis system (C.B.S Scientific Co., San Diego, California, USA) and visualized with an Alphaimager HP gel documentation system (ProteinSimple, San Jose, CA, United States) with the AlphaView software.

Development of EST-SNP Markers and KASP Assay
Bowtie2 (Langmead and Salzberg, 2012) was used for indexing the genome and mapping the RNA-seq reads of five genotypes on the AS genome. To obtain a variant call format (VCF) file that includes SNP information, the sequence alignment/map format (SAM) files were converted to binary sequence alignment/map format (BAM) files with SAMtools (Li et al., 2009) and subjected to SNP calling with Freebayes (Garrison and Marth, 2012). The high-confidence SNPs were selected with VCFtools (Danecek et al., 2011) with criteria such as Indels and SNPs above quality score > 60 and read depth >20. Polymarker (Ramirez-Gonzalez et al., 2015) was used to design the Kompetitive Allele Specific PCR (KASP) assay with the diploid genome parameter.
To confirm that the SNPs obtained were not because of heterozygosity within the genome, the RNA-seq reads of AS, PP, AC, PL, and L-49 were further screened with visualization tool Integrative Genomics Viewer (IGV) (Robinson et al., 2011) with the AS genome as a reference. SNPs that were not called because of heterozygosity within the genome and were flanked by a region of conserved 50 bp with SNP depth >25 were selected for the KASP assay. Near equidistant markers spanning the 10 pseudochromosomes were synthesized (Integrated DNA Technologies, IDT R , Coralville, IA, United States). The KASP assay was subjected to 24 genotypes in 384 well format ABIthermocycler with PCR profile 95 • C for 15 min for enzyme activation and DNA denaturation, 95 • C for 20 s for subsequent DNA denaturation and 64 • C for 1 min for coupled annealing and amplification as a touchdown for 10 cycles, and 57 • C for 1 min for the next 30 cycles. The 4-µl reaction was set up as a 10-ng genomic DNA template, 2-µl LGC master mix (LGC Biosearch Technologies, Hoddesdon, United Kingdom), .054 µl of the primer mix [forward1 (12 µM)/forward2 (12 µM)/reverse common primer (30 µM)]. After PCR, the plate was read in an Infinite F200 Pro (Tecan, Mannesdorf, Switzerland) fluorescent reader and then assessed for polymorphism on KlusterCaller version 3.4.1.36 by the LGC Genomics software.

Annotation of Synthesized SSR/InDel/SNP Markers
All the synthesized primers were subjected to blastn with default parameters against the AS transcriptome. The components with full query coverage were searched for Gene Ontology (Mittal et al., 2020). Subsequently, GO ids were visualized with WEGO for binning into cellular components, biological functions, and molecular processes. All the genes harboring molecular markers mapping to the genome were analyzed for transcript expression in colored vs. non-colored tissues with edgeR (Robinson et al., 2009).

Genetic Diversity Analysis in Guava
Allelic information on the 22 guava genotypes (variable for flesh and peel color; Table 1) generated by SSR/InDel/SNP markers was computed with the STRUCTURE 2.3.4 software (Pritchard et al., 2000) with Burnin 250,000, Monte Carlo Markov Chain (MCMC) 750,000, and K 1-15 with 10 iterations each. The most probable K value was inferred using the DELTAk "Evanno method" (Evanno et al., 2005) using the STRUCTURE HARVESTER web (Earl and vonHoldt, 2012). Principal coordinate analysis was run for all the genotypes with GenAlEx6.5.1b2 (Peakall and Smouse, 2012). The neighborjoining tree was constructed using MEGAX (Kumar et al., 2018). Calculations of Shannon's Information Index (I), effective alleles (Ne), different alleles (Na), number of loci with private alleles (PA), unbiased expected heterozygosity (uHe), the expected heterozygosity (He), and polymorphic information content (PIC) were also performed. Na was calculated by the direct count of alleles across subpopulations per loci and averaged by the arithmetic mean across loci per sub-population. Ne was calculated using He by the formula: Here, p is the frequency of the allele. I was calculated (per locus and averaged across the number of loci) using the formula: where ln is the natural logarithm of p i , i.e., is frequency of ith allele, and private alleles are the alleles unique to the subpopulation. The expected heterozygosity (He) provides the probability that the two individuals would be different and was calculated using the formula: Unbiased expected heterozygosity (uHe) was calculated using the allele frequency and sample size (n) with the formula: An analysis of molecular variance for white flesh, apple color, and pink flesh subgroups for within population, among population, and within an individual was performed, and the number of migrants (Nm) was calculated. The PIC value for molecular markers was calculated with PowerMarkerV3.25 (Liu and Muse, 2005).

RESULTS
The AS Table 2). A BUSCO analysis (using 2,121 genes of eudicotyledons) revealed that 89.9% of the genes in the guava genome were conserved with 87.3% complete and single-copy (S), 2.6% complete and duplicated (D), 5.8% fragmented (F), and only 4.3% missing (M) gene models (Figure 1, Supplementary Figure 1). Taken together, it suggests a good-quality draft genome assembly of AS for genome annotation and molecular marker development.   tRNAs were identified as coding for 20 amino acids; however, 53 were pseudogenes. Only 29 tRNA genes exhibited the presence of introns. Gene prediction in the masked draft genome assembly in the first round of MAKER identified 16,392 ab initio gene models with EST evidence from the previous RNA-seq of AS (Mittal et al., 2020). Using gene models computed with SNAP (hmm file), retraining parameters of S. lycopersicon as a training model with AUGUSTUS on the repeat masked AS genome, EST-GFF, and RM GFF, the second round of MAKER developed 9,509 trained gene models. Re-running SNAP and AUGUSTUS on secondround results of MAKER developed new hmm and guava-based retraining parameters. The re-run of MAKER (third round) with new hmm guava retraining parameters, new EST-GFF, and new RM-GFF files predicted 14,115 genes.
A functional annotation for the predicted genes discovered 11,686, 10,981 (8,534 domains, 1,746 repeats, 7,421 families, and 126 motifs), 13,854, and 13,840 genes displaying significant similarities to known proteins in the InterPro, Pfam, NCBI n/r, and GO databases, respectively (Figure 2A,  Supplementary Data Files 3-6), and 13,957 (98.8%) genes were annotated in at least one of these four databases. A Gene Ontology functional classification of the predicted genes with WEGO assigned 7,766, 7,322, and 8,440 genes to the GO classes of biological process, cellular component, and molecular function, respectively ( Figure 2B). A KEGG classification with KAAS assigned annotated genes to 389 pathways (Supplementary Data File 7).

Guava Exhibits a Close Evolutionary Relationship With Eucalyptus
To investigate the relationship of the guava genome with other species, we performed gene family clustering of P.  Table 2). The rooted species tree obtained with the STAG inference of orthofinder was generated with iTOL, which placed guava adjacent to E. grandis (Figure 3).

Fifty Percent Draft Guava Genome Maps Over 11 Chromosomes of Eucalyptus
Myrtaceae is a large family of dicotyledonous woody plants containing 130-150 genera (Grattapaglia et al., 2012), with Eucalyptus and Psidium being economically important genera. Also, owing to the close evolutionary relationship depicted by Orthofinder, we developed pseudochromosomes in guava using Eucalyptus as reference. Chromosomer (Tamazian et al., 2016) builds pseudochromosomes from genome contigs or scaffolds using alignments to the chromosomes of reference provided by a closely related species. With an average nucleotide identity of 81.411% between AS and E. grandis, the ∼152 Mb guava genome was mapped on E. grandis, resulting in 11 pseudochromosomes.

Development of Genomic-SSR and Functional SSR/InDel/SNP Markers in Guava Deploying Comparative Transcriptomics
A total of 188,183 SSR loci were identified in the AS genome with the MISA script (Beier et al., 2017) and included 21,295   Data File 8). Of the repeat motifs observed, mono-nucleotide repeats were the most abundant, constituting 70.3% of all the SSRs, followed by di-21.1%, tri-6.7%, tetra-1.1%, penta-0.3%, and hexa-0.02% nucleotide motifs (Figure 4A, Supplementary Data File 8).
Out of the 108 primers tested, 90 (84.11%) were able to detect polymorphism in 24 genotypes (Table 1), with alleles ranging from 1 to 4 and average PIC value of 0.267. A subset of 8 such markers exhibiting biallelic polymorphism is shown in Figure 4B.
A total 386,952 sequence variants were obtained with Freebayes using variant calling format files generated on PL, PP, L-49, and AC compared with AS. The variants identified included 368,556 SNPs and 18,396 InDels. After filtering with VCF tools, all the InDels were removed and based on depth (≥20 reads) and quality score (≥60), 231,632 SNPs were obtained. Among the SNPs, transitions accounted for 58% of the SNPs, where A↔G transitions were more as compared with C↔T. Transversions accounted for 41.35% of the SNPs, where T↔A transversions were the highest followed by G↔T, A↔C, and G↔C. The high-quality SNPs were used as an input with a 50-bp flanking sequence region in the Polymarker software resulting into 62,722 primer sets distributed over the guava genome (Supplementary Data File 10). The primer sets over near-equal distances spanning the 11 pseudochromosomes of guava were further screened with Integrated Genome Viewer (Supplementary Figure 2) for high-quality SNPs flanked by ≥50-bp homozygous regions. A total of 130 such high-quality near equidistant markers (Supplementary Table 4) were amplified in 24 genotypes, and 106 were found polymorphic distinguishing the genotypes, genetically. More than 61% of the markers exhibited a PIC value > 0.3. Figure 5 shows the scatter plots generated with Klustercaller by running the KASP assay with primers on all the 11 pseudochromosomes demonstrating the utility of such biallelic markers for guava molecular breeding.
The genome wide genomic SSR and EST-InDel/SNP markers spanned over the 11 guava chromosomes are shown in Figure 6. Also, markers exhibiting polymorphism in the 24 genotypes are shown in the inner rings of the circos plot. The functional analysis of 233 tested markers with WEGO shows that 130 of them are involved in important cellular components, biological processes, or molecular functions ( Figure 7A). We compared the transcript expression of a mature fruit of PL to AS, PP to AS, and red peel of Lalima to green peel. The heat map ( Figure 7B) shows that the differential expression of genes in colored vs. non-colored tissues is associated with many markers.

Newly Developed Markers Clustered Diverse Guava Genotypes/Cultivars in Accordance to Flesh/Peel Color
The principal coordinate analysis (Peakall and Smouse, 2012) was performed using the 186 validated biallelic SSR/InDel/SNP markers to investigate population clusters across 22 genotypes ( Figure 8A, Table 1, Supplementary Data File 11). Accordingly, the PCoA plot indicated that the 5 fruit-flesh/skin-based genotypes generally clustered separately. The apple color peel genotypes clustered together, compared with the pink-fleshed being relatively scattered. Pink-fleshed genotypes were more closely related to the white-fleshed than the apple color, yellow, and purple-fleshed. Purple/yellow-fleshed and apple color genotypes were the most diverse and did not show overlap ( Figure 8B).
The population structure of the 22 genotypes was inferred using the clustering program STRUCTURE, testing for 1 to 15 clusters (K) (Figure 8C). Evanno's correction revealed the peak of delta K at K = 3, suggesting the presence of three main clusters (Supplementary Figure 3), and an additional peak at K = 6 suggests the presence of a substructure. Based on the optimal K = 3 by STRUCTURE, cluster 1 consisted of four pinkfleshed (PP, Hisar Surkha, Punjab Kiran and 17-16), a yellowfleshed (Portugal), and six white fleshed genotypes (Hisar Safeda, Arka Amulya, Punjab Safeda, L-49, Seedless, and AS). Cluster 2 consisted of PL and VNR Bihi, and cluster 3 consisted of all apple color genotypes, Shweta, and the remaining pink-fleshed genotypes (Supplementary Figure 4).
At K = 5, cluster 1 divided into three separate clusters where one cluster had three pink-(PP, Punjab Kiran, and 17-16), two white-(Punjab Safeda, L-49), and a yellow fleshed genotype (which remained the same even at K = 10). Another cluster had AS and Arka Amulya clustered together, and the last cluster had Hisar Surkha, Hisar Safeda, and Seedless (Supplementary Figure 4). At K = 10, all the apple-colored genotypes formed one cluster; two white-fleshed (Shweta, Seedless), two pink-(Lalit and Arka Kiran), and two white-fleshed (AS and Arka Amulya) formed three clusters separately, while Seedless displayed maximum admixture (Supplementary Figure 4). At K = 15 the trend of clusters remained the same. However, Portugal, Arka Amulya, and Shweta also displayed a high level of admixture. Interestingly, the apple-colored genotypes were never assigned to a new independent cluster as K was increased. The results of STRUCTURE were in coherence with the results of the PCoA analyses. The neighbor-joining tree constructed with MEGAX (Kumar et al., 2018) was also consistent with the results of the PCoA and STRUCTURE analyses ( Figure 8D). In conclusion, the analysis with three independent methods supports the division of these diverse genotypes correlated with fruit color; however, pink-and white fleshed ones show an overlap.
Furthermore, we examined the allelic patterns of the subgroups and found them highly diverse. The mean Shannon's information index (I), the expected heterozygosity (He), and the unbiased expected heterozygosity (uHe) were high (0.289, 0.198, and 0.255, respectively), suggesting a high level of diversity among and within the genotypes ( Table 4). The highest diversity was shown by the subgroup consisting of white flesh color with I, He, and uHe of 0.506, 0.342, and 0.365, respectively. Interestingly, despite the average Na, Ne, and PA of 1.499, 1.345, and 0.12 in the 22 P. guajava cultivars, the highest Na, Ne, and PA (1.909, 1.594, and 0.011) was observed for the subgroup with white flesh color. With 11 unique loci contributing to private alleles, eight genotypes across four out of five subgroups carried private alleles (Table 5). With six being the highest number of private alleles for PL, these might be linked with the unique characteristics of the genotype (Mittal et al. Unpublished). The analysis of molecular variance (AMOVA) showed that out of the total variation in the subgroups white flesh, apple color, and pink flesh, only 12% variation was among the subgroups and 19% variation was among the individuals, while most of the variation was within each individual (69%), depicting high-level heterozygosity of the examined loci ( Table 6). The Nm was low at 1.905, depicting a low number of gene migrations among the subgroups (Table 6). Also, the PCoA graph exhibited very less overlap among the subgroups (Figure 8B). A low number of gene 4 | Allelic pattern diversity indexes for each subgroup based on the number of different alleles (Na), number of effective alleles (Ne), Shannon's information index (I), private alleles frequency (PA), expected heterozygosity (He), and unbiased expected heterozygosity (uHe).

Allahabad Safeda Evolutionary Relationship
A draft genome of heterozygous Psidium guajava cultivar AS has been successfully assembled by Illumina-based NGS at ∼50x coverage. Owing to the scarce genomic resources in tropical and subtropical fruit species as compared with temperate families such as Vitaceae, Rosaceae, and Rutaceae, the genome assembly of guava provides a foundation for evolutionary studies, comparative genomic investigation of the unique biological characteristic, intraspecific genome diversity, and molecular breeding in this nutraceutical rich crop. Repeat content in guava (37.95%) is at par with the repeat content of smaller genomes of similar sizes such as 35% in rice (Matsumoto et al., 2005), 40.5% in mango (Wang et al., 2020), 41.4% in grapevine (Jaillon et al., 2007), 44.5% in Eucalyptus (Myburg et al., 2014), and 45% in Citrus clementina as compared with the larger genomes that consist almost entirely of repetitive sequences such as wheat and maize with a repeat content of ∼85% (Schnable et al., 2009;Appels et al., 2018). Therefore, the guava genome provides additional evidence that there is a linear increase in the repeat-sequence content of the genome with genome size (Novák et al., 2020). Comparing predicted proteins in AS with 18 other plant species demonstrates close evolutionary relatedness between E. grandis and P. guajava. All the monocots M. acuminata, A. comosus, Z. mays and O. sativa grouped together, while the Rosaceae species P. persica, M. domestica, and F. vesca clustered separately. The precision of comparative genome analysis depends on deciphering proteomes as to how their genomes were assembled and annotated (Chagné et al., 2014). In AS, we identified 764 species-specific genes that did not have orthologs detected in the other species (Supplementary Table 2). Further analysis of these proteins using a more comprehensive array of species for comparison would be required to determine whether these proteins encode for traits specific to guava.

Guava Expressed Genome-Based Markers-Leveraging Targeted Molecular Breeding
Owing to the long juvenility period and difficulty in developing standard mapping populations like that in cereals, pulses, and other annuals, we need a large number of molecular markers to make MAS possible in fruit tree species. In guava RAPD, ISSR, AFLP, COS (Prakash et al., 2002;Risterucci et al., 2005;Rai et al., 2012), and, very recently, genomic SSR markers (Kumar et al., 2020) have been developed and used for germplasm characterization and genetic diversity analysis. Under the aegis of the European Union Project "GUAVAMAP, " SSR markers for P. guajava were developed with a traditional approach involving the construction of an SSRenriched library followed by cloning and Sanger sequencing. The first molecular linkage map was developed with AFLP and COS library based on MADS-, HOMEO-box, and RGL sequence-derived markers in a bi-parental F 1 population (MP1) of a cross "Enana Cubana roja × N6" (Valdés-Infante et al., 2003). Next-generation sequencing tools have equipped the researchers with precise technologies for marker development, thus evading laborious methods that involve the construction of genomic and cDNA libraries followed by cloning and sequencing. The annotation of coding sequences and NGS-based EST comparison among colored vs. white flesh, apple color vs. green skin color, soft seed vs. hard seed, and short vs. long shelf life should lead to the generation of EST-SSR and EST-SNP makers. Such markers may directly be associated with the traits of interest (Varshney et al., 2005). However, in the absence of genome assembly, the generation of markers for targeted breeding is difficult. Therefore, the ultimate goal of the assembled genome here is to serve as a guideline in developing tools for MAB in guava. We identified an abundant number of SSRs and SNPs, and structural variations such as InDels spread across the whole genome, which will be highly useful in developing functional markers for guava breeding. MYB-based (Takos et al., 2006;Ban et al., 2007;Chagné et al., 2016) and red TE-based specific markers (Zhang et al., 2019) developed to distinguish the red skin from non-red in apple; SNP-based markers for Alternaria brown spot in citrus (Cuenca et al., 2016), markers for citrus Tristeza virus resistance, and for demarcating polyembryonyin citrus (Gentile et al., 2020) are several utilized examples emphasizing the necessity of marker development for the pre-selection of hybrid seedlings for commercially important traits. Here, we developed high throughput genomic SSR markers from genome assembly (Supplementary Data File 9). The Validation of 15 SSRs on the 24 genotypes promises their utilization for genetic mapping, as already shown independently (Kumar et al., 2020). The comparative transcriptomic-based InDel and KASP assay markers on a genome-wide basis make this study the first report on the development of function-based markers in guava. Out of the 233 markers tested, 195 were able to detect polymorphism in the 24 genotypes studied. The biallelic markers exhibit a PIC value of 0-0.5 (Botstein et al., 1980). In the case of the authors, they identified a high average PIC value of 0.279, emphasizing the high potential utility of the newly developed markers for genetic diversity analysis. Moreover, 130 of these polymorphic markers were involved in cellular, biological, and molecular functions. This novel set of SSR/InDel/SNP-based markers ensures expedited molecular breeding programs in guava. Furthermore, the identification of >60,000 KASP assay-ready SNPs in phenotypically diverse genotypes (Supplementary Data File 10) provides a gold mine for developing a plethora of genome-wide biallelic markers, the heart of targeted association/bi-parental mapping.

Guava Population Structure and Diversity of Indian Cultivars
Population structure analysis is often the primary step for understanding genetic diversity, performing genome-wide association mapping to identify true marker-trait associations, and identifying genes associated with traits (Luo et al., 2019). The genetic diversity and population structure of guava genotypes in India are underexplored, so germplasm characterization should be instrumental in selecting diverse parental genotypes for varietal improvement (Kumar et al., 2020). SSR markers developed (Risterucci et al., 2005) under the aegis of GUAVAMAP have been utilized for characterization and genetic diversity analysis of guava germplasm worldwide (Viji et al., 2010;Priya et al., 2011;Coser et al., 2012;Sitther et al., 2014;Kherwar et al., 2018;Ma et al., 2020). Recently developed 26 g-SSRs (Kumar et al., 2020) were also used to estimate genetic diversity. However, 186 novel EST-based InDels and SNP markers for population structure analysis in guava is an important step forward for functional marker utilization in this fruit crop. These EST-markers clustered distinctly the apple-colored, purple-fleshed, yellow-fleshed, white-fleshed, and pink-fleshed genotypes (Figure 8). The NJ tree placed all the apple-colored genotypes next to each other, and white fleshed AS and Arka Amulya together with L-49, explaining the fact that L-49 is a chance seedling from AS. The genotype 17-16 is the female parent of Punjab Kiran and HB88, and the clustering showed all three in a single group. Hisar Safeda and Hisar Surkha grouped together explaining the fact that other than the difference in flesh color, the tree habit, foliage, fruit shape is so similar that it is hard to differentiate between two cultivars without cutting open the fruit. A similar genetic inference has also been derived independently (Kumar et al., 2020). Most of the pink-fleshed and white-fleshed genotypes showed dispersed clusters rather than clustering independently, raising the demand for a larger number of markers on a genome-wide scale. However, it might even be more complicated like Hisar Safeda and Hisar Surkha that pink-fleshed and white-fleshed genotypes are genetically similar (Kherwar et al., 2018;Kumar et al., 2020).
Private alleles provide important information on identifying distinctive genetic variability at loci and diversified genotypes, which could be employed in crop breeding to enhance allele affluence in a population (Borba et al., 2009;Salem and Sallam, 2016). The calculation of private alleles reveals the allelic information of a certain predefined subpopulation. The six loci bearing the private alleles in PL signify that these loci are critical for explaining the phenotypic diversity of this unique cultivar. Thus, studying the genes in the genomic regions bearing these loci is important to establish the role of these genes in providing unique phenotypic qualities to PL.
Overall, the draft genome, gene information linked to biological and other resources such as high-throughput and ESTbased InDel/SNP markers, developed in the this study provides crucial information on the genus Psidium as well as the Myrtaceae family and lays the ground for improvement of quality and agronomic traits by gene mapping in biparental populations for fruit quality, biotic and abiotic stresses, genome-wide association, and comparative genomics. The draft genome provided here can be used as a reference to re-sequence the Psidium germplasm for mining candidate gene-based high-throughput markers and developing SNP arrays for guava breeders. The availability of such information would make genomic selection possible for guava breeding programs.

DATA AVAILABILITY STATEMENT
Data has been submitted to NCBI under Bio project PRJNA557348, Biosample SAMN12395251 under SRA SRR9865865, SRR9865866, and genome assembly PAU_PgAS_1 with NCBI accession VSKU00000000. All data and materials generated or analyzed during this study are included in this article or are available from the corresponding author on reasonable request. The Guava Genome Annotation & Markers are available at https://doi.org/10.6084/m9.figshare.14573262.v1.

AUTHOR CONTRIBUTIONS
AM and IY conceived the idea. AM, MJ, ST, NA, and RB collected the plant material. IY, ST, PS, and AM performed the bioinformatics analysis. ST, MJ, and AM performed the wet lab experiments. NA, RB, and MG maintained the orchards, provided the plant material, and coordinated the phenotypic analysis. PC, GD, and ST conducted KASP marker development and genetic analysis. ST and AM wrote the paper. AM, IY, PC, and MG coordinated the overall experimentation. All authors read the manuscript and approved it.

FUNDING
The experiments were supported by the self-financing scheme of Punjab Agricultural University, Ludhiana and Department of Biotechnology Grant No: BT/PR24373/AGIII/103/1012/2018 to AM.