A chromosome-level genome assembly of Chinese quince (Pseudocydonia sinensis)

Introduction Pseudocydonia sinensis, also known as Chinese quince, is a perennial shrub or small tree highly valued for its edibility and medicinal properties. Method This study presents the first chromosome-level genome assembly of P. sinensis, achieved using HiFi sequencing and Hi-C scaffolding technology. Results The assembly resulted in a high-quality genome of 576.39 Mb in size. The genome was anchored to 17 pseudo-chromosomes, with a contig N50 of 27.6 Mb and a scaffold N50 of 33.8 Mb. Comprehensive assessment using BUSCO, CEGMA and BWA tools indicates the high completeness and accuracy of the genome assembly. Our analysis identified 116 species-specific genes, 1196 expanded genes and 1109 contracted genes. Additionally, the distribution of 4DTv values suggests that the most recent duplication event occurred before the divergence of P. sinensis from both Chaenomeles pinnatifida and Pyrus pyrifolia. Discussion The assembly of this high-quality genome provides a valuable platform for the genetic breeding and cultivation of P. sinensis, as well as for the comparison of the genetic complexity of P. sinensis with other important crops in the Rosaceae family.


Introduction
Pseudocydonia sinensis (Thouin) C. K. Schneid., also known as Chinese quince, is a shrub or small tree belonging to the genus Pseudocydonia in the subfamily Maloideae of Rosaceae (Figure 1A).It is the only species in the genus Pseudocydonia.This species is native to the central and eastern regions of China, and introduced to numerous countries around the world (Lu et al., 2003).P. sinensis blooms from March to April and fruits from June to July.The plant produces pale red flowers, and its fruits are yellow-green, oblong in shape, emit distinct fragrance and have a noticeable tart taste.P. sinensis is valued for its ornamental beauty, edibility, and medicinal value.In various regions of China, the fruit is consumed by locals through boiling in water or preserving in sugar.Extracts from the fruit are also widely used in the production of candies, beverages and desserts.Medically, the dried P. sinensis fruit is known for its hangover relief and expectorant properties (Grygorieva et al., 2020).
Karyotype analysis of P. sinensis revealed that the species has a small genome size, with a diploid number of 34 chromosomes, ranging from 1.0 to 1.8 mm in length (Iwatsubo et al., 2022).Despite being widely cultivated as an ornamental and dual-purpose plant, no complete genome of P. sinensis has been sequenced to date, posing a significant limitation for the breeding and evolutionary studies of this species.Here, we present the first chromosome-level genome assembly of P. sinensis, relying on Hifi sequencing and Hi-C scaffolding technology.Utilizing this high-quality genome assembly, we conducted a comparative genomic analysis between P. sinensis and 12 other Rosaceae species, and investigated the genomic structural differences between Pseudocydonia sinensis, Crataegus pinnatifida and Pyrus pyrifolia.

Samples collection and DNA extraction
Samples were collected from an agricultural plantation in Dali, Yunnan province (E 100.191766, N 25.690538), comprising leaves, stems, and fruits of a single P. sinensis tree.high-quality genomic DNA was extracted from the sampled leaves using the CTAB method (Porebski et al., 1997).Subsequently, the purity and concentration of DNA were assessed with Nanodrop (Technologies, Wilmington, DE), Qubit 3.0 fluorometer, and electrophoresis on a 1% agarose gel.

Genome survey of Pseudocydonia sinensis
The genome size of P. sinensis was estimated using k-mer method (Marcais and Kingsford, 2011) based on Illumina genomic DNA sequencing data.A high-quality Illumina DNA library was constructed and sequenced using Illumina NovaSeq platform with the PE150 layout.This process yielded a total of 70.3 Gb of raw sequencing data.To ensure data reliability, stringent quality filtering was applied to the raw data.Ultimately, 70.2 Gb of high-quality clean reads were obtained for use in genomic exploration and refinement.Quality-filtered reads were subjected to k-mer analysis using Jellyfish 2.0 program (http:// www.genome.umd.edu/jellyfish.html).

DNA and RNA extraction and sequencing
For PacBio HiFi sequencing, high quality genomic DNA was extracted and purified from P. sinensis leaves.DNA samples that passed quality checks (main band >30kb) were selected to be randomly fragmented into pieces (15-18kb).The DNA fragments were enriched and purified followed by end repaired.Adapters were ligated to both ends of the nucleic acid fragments, and a library was constructed by removing unsuccessfully ligated fragments with exonuclease.The constructed library was then sequenced on the PacBio Sequel II platform, and the raw data was processed using the CCS program (https://github.com/PacificBiosciences/ccs) to generate HiFi reads.
The Hi-C library was prepared according to Belton et al. (2012) and Shi et al. (2019) with a modification.The extracted genomic DNA was randomly fragmented into pieces and labeled with biotin-14-dCTP.A library was constructed followed by blunt-end repaired, A tailing, adapter ligation, purification and PCR amplification.The Hi-C libraries were quantified and sequenced on the Illumina NovaSeq platform using the PE 150 layout, yielding 65.5 Gb of data.Quality control of Hi-C raw data was performed using the software HiC-Pro v2.8.0 (Belton et al., 2012).
Total RNA Extraction Kit (RNAprep Pure DP441) was used to isolate total RNA from three different tissues (leaf/stem/fruit) of a single P. sinensis tree.Eukaryotic mRNA was enriched from the total RNA.Single-stranded cDNA was synthesized using random hexamers as primers with the mRNA as a template, and thus to synthesize double-stranded cDNA, resulting in the final sequencing library.The qualified libraries were pooled and sequenced on the Illumina platform at Novogene Bioinformatics Technology Co., Ltd.(Beijing, China).
Hi-C technology was used to provided long-range interaction information to achieve global phasing of the genome (Burton et al., 2013).The sequenced Hi-C data were assembled at the chromosomal level using Allhic software (https://github.com/tangerzhang/ALLHiC) (Dudchenko et al., 2017).The Juicebox software (Dudchenko et al., 2018) was used to manually correct the chromosome interaction intensity, and finally obtain the genome at the chromosome level.Chromosomal interaction heatmap was used to visualize the interaction matrices of each chromosome (Wolff et al., 2020).
The time calibration was conducted based on available TimeTree (http://timetree.org/)fossil records.The fossil time calibration was conducted based on the root nodes of A.thaliana and Rosaceae species, as well as the root nodes of C. pinnatifida and P. pyrifolia, resulting in a rooted tree with fossil time calibrations.The divergence times between species were calculated using the MCMCTree in the PAML software (Yang, 2007).FigTree (http:// tree.bio.ed.ac.uk/) was used to visualize the phylogenetic tree with divergence times.The calculation of contraction and expansion gene families for each lineage was conducted using CAFE5 software (https://github1s.com/hahnlab/CAFE)(De Bie et al., 2006).Contraction and expansion genes families of P. sinensis were further analyzed for the GO enrichment analysis using ClusterProfile software (Yu et al., 2012).
The genomes of P. sinensis, C. pinnatifida and P. pyrifolia were selected for the analysis of whole-genome duplication (WGD) events and selective pressure analysis.The WGD event was determined by the fourfold synonymous third-codon transversion (4DTv) values (Yang and Nielsen, 2000).JCVI (https://github.com/tanghaibao/jcvi/wiki/MCscan-(Python-version)) was used to identify the homologs among three species.ParaAT (https:// ngdc.cncb.ac.cn/tools/paraat) was used to perform protein-coding DNA alignments for these homologs.The non-synonymous/ synonymous substitution (Ka/Ks) values were calculated to assess the selective pressure using the YN00 program of the PAML software with default parameters (Yang, 2007).The Ks values and 4DTv were visualized using the ggplot2 package in R. MCScanX (Wang et al., 2012) was used to identify syntenic regions and generate a synteny plot between the three species.

Genome sequencing and assembly
Illumina sequencing yielded 70.2 Gb of clean reads with a coverage depth of 119.25X.Genome survey analysis using kmer (k=17) indicated a primary peak around depth=36, and the estimated genome size calculated by the formula Kmer-number/ depth is approximately 679.04 Mb, with a corrected genome size of 664.59 Mb.The genome heterozygosity rate is 0.62%, and the proportion of repetitive sequences is 55.05% (Supplementary Table 2, Supplementary Figure 1).
A high-quality DNA sample was used to construct a SMRTbell library, which was sequenced on the PacBio Sequel II platform with a coverage depth of 35.78X, yielding a total of 21 Gb of HiFi reads after a series of processing steps.The HiFi reads comprised 2,456,334 reads, with an average read length of 8,660 bp, a N50 read length of 12,347 bp, and the longest read length being 49,180 bp (Supplementary Table 3, Supplementary Figure 2).Hifiasm resulting in a primary assembly comprising 301 contigs, with a total contig length of 576.39 Mbp and a contig N50 length of 27.6 Mbp (Supplementary Table 4).
The Hi-C library sequencing yielded a total of 67.07 Gb of raw sequencing data, 66.66 Gb of clean Hi-C data were obtained after filtering.The P. sinensis genome was assembled to chromosomelevel resolution with the aid of Hi-C data.The resulting chromosomal genome assembly yielded a total contig length of 576,387,120 bp and a contig N50 size of 27,604,817 bp; the total scaffold length amounted to 576,390,020 bp, with a scaffold N50 size of 33,874,332bp.The genome anchoring rate was 97.62% (Supplementary Tables 5, 6).The chromosomal interaction heatmap displayed 17 clear chromosomal clusters, with interactions within individual chromosomes being markedly higher than that between chromosomes (Figure 1B).
The BUSCO assessment yielded a completeness score of 99.00%.The CEGMA analysis utilized a core gene set consisting of 248 conserved genes obtained from six eukaryotic model organisms.In P. sinensis, 241 out of 248 Core Eukaryotic Genes (CEGs) genes were assembled, achieving a final ratio of 97.18%.BWA software aligned short-read library data with the assembled genomic sequence of P. sinensis, achieving an approximate read alignment rate of 99.35% and a genome coverage of about 99.97% (Supplementary Table 7, Supplementary Figure 3).The QV value derived from the mergury-mash module of Merqury software was 48.8606.By calculating the GC content and average depth of the assembled genome sequence, it is proved that there is no GC bias and possible contamination in the analyzed sequencing data (Supplementary Figure 4).Overall, the chromosomal assembly and genome anchoring rates were favorable, indicating that the genome assembly was highly satisfactory.The overall statistics for the P. sinensis genome are listed below (Table 1).

Genome annotation
313,631,790 bp tandem repeat sequences were obtained, makes up 54.41% of the P. sinensis genome (Supplementary Table 8).For a total of 280,849,280 bp long terminal repeats (LTRs) were get, which is the most abundant class for the repeat sequences, comprising 48.73% of the genome (Supplementary Table 9).A total of 37,779 protein-coding genes were predicted, of which 22,301 could be structurally predicted by all three methods (de novo prediction, homology prediction, and transcriptome-assisted prediction).Within the coding genes, the average lengths of transcripts and CDS are 3,116.70bp and 1,189.34bp, respectively.The average lengths of exons and introns are 230.81bp and 464.10 bp, respectively, with an average of 5.15 exons per gene (Supplementary Table 10).37,398 out of the 37,779 protein-coding genes could be annotated, while 381 remained unannotated.The probability of predicting gene function was 98.99% (Supplementary Table 11).As for the non-coding genes, 637 miRNAs, 1,408 tRNAs, 599 snRNA and 5,572 rRNA were identified from the P. sinensis genome (Supplementary Table 12).An overview of the genome assembly and annotation is shown in Figure 2.

Gene family and evolution analysis
Orthologous clustering results identified a total of 36,911 orthologous gene families across these 14 species.9,192 gene families were identified to be shared by all species (see Figure 3A).116 species specific gene families of P. sinensis were identified for the 13 species closely related to P. sinensis (Supplementary Table 13).GO enrichment suggests that these species-specific genes are primarily involved in the regulation of redox reactions, cleavage reactions, ion channel regulation, signal transduction, and methylation modification (Figure 3B).
A total of 165 orthologous single-copy genes were identified in the 14 species, (Figure 4B; Supplementary Table 14).The phylogenetic results indicate that P. sinensis share a common ancestor with P. pyrifolia, M. sieversii and M. domestica.P. sinensis and P. pyrifolia diverged from their common ancestor around 11.6 Mya.Meanwhile, P. sinensis and C. pinnatifida diverged from a shared ancestor around 30 Mya (Figure 4A; Supplementary Figure 5).
In the analysis of expansion and contraction gene families, 1196 significantly expanded and 1732 significantly contracted gene families were detected for P. sinensis (Figure 4A).According to the GO enrichment results, the expanded gene families are primarily associated with protein synthesis, regulation, and degradation.In contrast, the contracted gene families are related to nucleotide metabolism and cellular energy metabolism (Figure 5).
The Ks and 4DTv values exhibited same trend (Figure 6A).The 4DTv values for P. sinensis, C. pinnatifida, and P. pyrifolia indicated that C. pinnatifida, P. pyrifolia and P. sinensis showed a single peak at 0.07, suggesting that P. sinensis has undergone only one recent whole genome duplication (WGD) event in its evolutionary history.The most recent duplication event occurred before the divergence of P. sinensis from both C. pinnatifida and P. pyrifolia.Both P. sinensis-C.pinnatifida and P. sinensis-P.pyrifolia comparisons revealed a similar peak, with their 4DTv value distributions being roughly equivalent, indicating that these species experienced similar genome duplication events at close evolutionary points (see Figure 6B).Based on Ka/Ks analysis, the selective pressures experienced by P. sinensis, C. pinnatifida, and P. pyrifolia were estimated.We found that most Ka/Ks values of P. sinensis, C. pinnatifida, and P. pyrifolia are less than 1.0.27 genes of P. sinensi were identified positively selected (Ka/Ks>1).When the Ka/Ks values are between 0.2 and 1.1, P. sinensis has fewer genes than P. pyrifolia and more genes than C. pinnatifida (Figure 7).

Genomic synteny analysis
P. sinensis, C. pinnatifida and P. pyrifolia share 16,978 similar genes in total (Figure 8A).Homologous genes between C. pinnatifida and P. sinensis, as well as P. sinensis and P. pyrifolia, were identified based on genomic collinearity analysis.A total of 56,547 collinear genes between C. pinnatifida and P. sinensis were identified, accounting for 72.17% of the total gene count (78,350), and 58,461 collinear genes between P. sinensis and P. pyrifolia, comprising 70.73% of the total gene count (82,655).Additionally, there was a significant amount of chromosomal rearrangement events among them, such as translocations (Figure 8).

Discussion
P. sinensis serves not only as an ornamental crop, but also valued for its medicinal and edible properties, making it of substantial economic value.However, genetic breeding process in this plant has been relatively low, primarily due to our limited understanding of its genomic background.The sequencing and assembly of present genome provide valuable insights to improve the genetic breeding and cultivation of P. sinensis for optimal utilization.
The estimated genome sizes of P. sinensis determined by k-mer analysis was 664.59 Mb.The chromosome-level genome of P. sinensis assembled with its 576.39 Mb sequence, with a contig N50 size of 27.6 Mb.Hi-C library sequencing generated a total of

A B
GO enrichment shows the functions of the expansion and contraction genes of P. sinensis.(A) the functions of contraction genes, (B) the functions of expansion genes.
67.07 Gb of raw sequencing data, from which 66.66 Gb of clean Hi-C data were obtained.The assembly resulted in a total contig length of 576,387,120 bp and a contig N50 size of 27,604,817 bp; the total scaffold length amounted to 576,390,020bp, with a scaffold N50 size of 33,874,332bp.The genome anchoring rate was 97.62%.The results of chromosome heat map showed 17 distinct chromosome sets, with significantly stronger interaction intensity within chromosomes compared to between chromosomes.The BUSCO assessment indicated a completeness assembly score of 99.00%.the BWA software aligned short-read library data with the assembled genomic sequence, achieving an approximate read alignment rate of 99.35% and a genome coverage of about 99.97%.The comprehensive assessment results indicate that the genome assembly is highly complete and accurate.
The k-mer analysis showed that the heterozygosity rate of the P. sinensis genome was 0.62%, which is higher than that in the genomes of peach (0.31%) (Lian et al., 2022), loquat (0.31%) (Wang, 2021), and lower than that in the genomes of pear (0.89%) (Gao et al., 2021), P. mume (0.75%) (Zheng et al., 2022), and Chaenomeles speciosa (2.1%) (He et al., 2023).The genome size of P. sinensis was smaller comparable to that of Chaenomeles speciosa (632.3Mb) (He et al., 2023), apple (652-668 Mb) (Zhang et al., 2019;Sun et al., 2020b), hawthorn (779.24Mb) (Zhang et al., 2022), loquat (733.32Mb) (Wang, 2021), but larger than that of pear (496.9-541.34Mb) (Dong et al., 2020;Gao et al., 2021).We postulated that the smaller genome size might be due to internal standard differences, experimental errors, or other variations among studies.116 species-specific genes were identified within the P. sinensis through orthologous clustering analysis.GO enrichment suggests that these species-unique genes are primarily involved in the regulation of redox reactions, cleavage reactions, ion channel regulation, signal transduction, and methylation modification.1196 significantly expanded and 1732 significantly contracted gene families were detected in the P. sinensis genome.The expanded gene families are primarily associated with protein synthesis, regulation, and degradation.In contrast, the contracted gene families are associated with nucleotide metabolism and cellular energy metabolism.These expansion and contraction play crucial roles in the functional diversification of genes in Rosaceae plants.Previous studies have reported the involvement of expanded gene families in the biosynthetic pathways of plant natural products in the hawthorn genome (Zhang et al., 2022).In the loquat genome, expanded gene families were discovered to be associated with monoterpenoid biosynthesis as well as starch and sucrose metabolism (Wang, 2021).
According to the phylogenetic analysis in this study, P. sinensis is found to be closely related to pear and hawthorn (branch support value=100).He et al. (2023) suggest that Chaenomeles speciose is more closely related to apple.Due to the unavailability of the whole genome information of Chaenomeles speciose at present, it is challenging to conduct a comprehensive comparison of these two species.Nevertheless, considering the available information, it can be inferred that there may not be a strong relationship between the genus Pseudocydonia and genus Chaenomeles.
WGD events occurred in the genome of P. sinensis.According to the 4DTv values, the most recent duplication event occurred before the divergence of P. sinensis from both C. pinnatifida and P. pyrifolia.Consistent with expectations, the distribution of Ks values showed similar trends to that of the 4DTv results.The distribution of Ka/Ks ratios indicates strong negative purifying selection for most genes of P. sinensis genome (Ka/Ks<1), while 28 genes were identified as positively.When comparing the syntenic patterns of P. sinensis with C. pinnatifida and P. pyrifolia (Figure 8C).Collinearity analysis revealed a large number of homologous gene blocks between each pair of species.We found numerous chromosomal rearrangement and translocation among them, with more chromosomal rearrangement events found between hawthorn and P. sinensis than between P. sinensis and pear.In the Rosaceae family, conserved syntenic relationships were frequently found between species in the same genus (Rubus rugosa and Rubus chinensis) (Chen et al., 2021)).However, when comparing with species of other genus in Rosaceae, chromosomal rearrangement and translocation are commonly found, such as between rosa, peach and strawberry (Chen et al., 2021), as well as between Chaenomeles speciose, pear and apple (He et al., 2023).These results suggest that the Rosaceae family exhibits conserved synteny within the genus but shows significant genetic variation among genera.

TABLE 1
Summary statistics for the P. sinensis genome.