Genomic analyses of agronomic traits in tea plants and related Camellia species

The genus Camellia contains three types of domesticates that meet various needs of ancient humans: the ornamental C. japonica, the edible oil-producing C. oleifera, and the beverage-purposed tea plant C. sinensis. The genomic drivers of the functional diversification of Camellia domesticates remain unknown. Here, we present the genomic variations of 625 Camellia accessions based on a new genome assembly of C. sinensis var. assamica (‘YK10’), which consists of 15 pseudo-chromosomes with a total length of 3.35 Gb and a contig N50 of 816,948 bp. These accessions were mainly distributed in East Asia, South Asia, Southeast Asia, and Africa. We profiled the population and subpopulation structure in tea tree Camellia to find new evidence for the parallel domestication of C. sinensis var. assamica (CSA) and C. sinensis var. sinensis (CSS). We also identified candidate genes associated with traits differentiating CSA, CSS, oilseed Camellia, and ornamental Camellia cultivars. Our results provide a unique global view of the genetic diversification of Camellia domesticates and provide valuable resources for ongoing functional and molecular breeding research.


Introduction
Human ancestors working in agriculture transformed wild plant species into new forms and established all major domesticated crops with desirable traits to meet human needs worldwide (Doebley et al., 2006).Prominent examples include rice, wheat, and maize as staple foods (Doebley et al., 2006), peanut and sesame for oil production (Wei et al., 2015), and cotton for natural fibers (Wang et al., 2019).Through artificial selection, these crops obtain better grain quality (changes in starch and other compounds), higher yield, and altered flowering time or plant height (Gross and Olsen, 2010) due to underlying genetic changes.Crop improvement and diversification would produce different varieties within a single species, but the genetic drivers of trait diversification remain unknown in many less-studied horticultural plants.
The Camellia plants in the Theaceae family are native to East Asia, with more than 300 species worldwide (Zeng and Endo, 2019).Three types of plants, including ornamental Camellia, oilseed Camellia, and tea plants, played indispensable roles in meeting the varied needs of human life (Teixeira and Sousa, 2021).The ornamental Camellia, prized for its exquisite, long-lasting flowers and deep green foliage, boasts over 30,000 cultivars in various flower forms and colors.The oilseed Camellia (i.e., C. oleifera) has been continuously selected for larger seeds with high content and quality of unsaturated fatty acid over the centuries (Lin et al., 2022).The tea plant (C.sinensis), including two varieties, C. sinensis var.assamica (CSA) and C. sinensis var.sinensis (CSS), has been commercially grown in more than 60 countries as a popular beverage (Drew, 2019).Despite the trait diversification of Camellia domesticates, the underlying genomic basis for the evolutionary course remains poorly understood.
Researchers reported many Camellia genomes recently from tea plants (Xia et al., 2017;Wei et al., 2018;Wang et al., 2020;Xia et al., 2020;Zhang et al., 2020aZhang et al., , 2020Zhang et al., , 2021) ) and oilseed Camellia (Lin et al., 2022).These studies identified genes involved in the biosynthesis of terpenoids and fatty acids, but it is unknown if they bear signatures of human selection.Among tea plants, 'Yunkang10' (CSA, 'YK10') is a widely cultivated cultivar in Southwestern China.It was recognized as a national fine variety and bred from Fengqing large leaf variety of Yunnan Academy of Agricultural Sciences Tea Research Institute (Zhang et al., 2020c;Lei et al., 2021).It bears the advantages of wide-range adaptability, low-cost cultivation, and quick growth rate over other cultivars (Zhang et al., 2020c).The reported genome of 'YK10' (Xia et al., 2017) had poor continuity (contigs N50 = 19,958 bp), and the lack of high-quality CSA genomes warrants obtaining a high-quality genome assembly of the 'YK10'.
In addition, analyses of population structure and genetic diversity in tea plants yielded two competing theories for its evolution history (Wang et al., 2020;Xia et al., 2020;Zhang et al., 2020bZhang et al., , 2021)).The researchers proposed that either two tea varieties (CSS and CSA) underwent parallel domestication (Zhang et al., 2021) or they resulted from a single origin in Southwestern China (Xia et al., 2020).In comparison, oilseed Camellia and ornamental Camellia populations showed a high genetic diversity (Lin et al., 2022;Yang et al., 2022).Previous reports also showed that simple sequence repeat (SSR) markers could differentiate ornamental Camellia populations in northern and southern China according to their geographical origin and genetic background (Yang et al., 2022).In contrast, the C. oleifera cultivars were mainly distinguished by their morphological characteristics rather than geographic origin, based on 25,581 SNPs from transcriptome sequencing data of 221 oilseed Camellia cultivars (Lin et al., 2022).Despite these advances, additional analyses of a comprehensive collection of cultivars may add to our understanding of the complex evolutionary history of Camellia.
Here, we report the genome of 'YK10', the resequencing of 625 Camellia accessions, and the pan-genome of 206 core cultivars.We evaluated the phylogeny, population structure, and trait diversification of Camellia species, which provided insights into the evolution, domestication, and dispersal of this economically significant perennial horticultural plant.

DNA isolation, sequencing, and assembly
The leaves of 'YK10' were sampled from Lincang, Yunnan province of China.Genomic DNA for sequencing was extracted from leaves using the cetyltrimethylammonium bromide (CTAB) method.50 mg DNA was used to construct the PacBio Sequel sequencing libraries, producing raw PacBio reads after sequencing.Paired-end libraries with insert sizes of 292,319,340,394,448,514,578,812, and 904 bp were constructed using NEBNext Ultra II DNA Library Prep Kit for Illumina (NEB, USA).All the constructed libraries were sequenced on an Illumina HiSeq2500 platform (Illumina, USA) with a PE-100 module.The raw data were filtered by removing reads with more than 5% N or more than 40 bp low-quality bases (Q20).Moreover, only one copy of any duplicated paired-end reads was retained.The clean data was employed to correct the CANU-assembled genome.
The contigs were scaffolded with optical mapping using the SAPHYR optical mapping technology (BioNano Genomics).High molecular weight DNA with a fragment distribution greater than 150 kb was isolated using BioNano sample preparation kits (BioNano Genomics).Sequence-specific labeling of genomic DNA (gDNA) for BioNano mapping was prepared by Nicking, Labeling, Repairing, and Staining (NLRS).600 ng of the isolated DNA was used for subsequent DNA nicking using Nt.BssSI (NEB) incubating for 2 hours at 37°C.The nicks were labeled according to the instructions provided with the BioNano Prep NLRS Labeling Kit (BioNano Genomics).The labeled DNA sample was loaded onto the Saphyr Chip nanochannel array (BioNano Genomics) and imaged using the Saphyr system and associated software (BioNano Genomics) according to the Saphyr System User Guide.The Sovle (Version: 3.1_08232017; BioNano Genomics) software package (parameters: -B 1 -N 2) was used to assemble the maps and construct scaffolds.
The software Redundans.pywith parameters "-identity 0.55overlap 0.80 -noscaffolding -nogapclosing" was performed to resolve the redundancy in the genome first, and then the scaffolds were aligned all-to-all using BWA-MEM (Version: 0.7.12) (Li and Durbin, 2009).The coverage ratio was calculated as the length of synteny region/total length for an aligned length greater than 1 kb and identity above 96%.The sequences with more than 99% coverage ratio were filtered out.

RNA-seq
'YK10' leaves were collected at the shoot, mature, and late developmental stages.3mg of total RNA per sample was used as input material for the RNA sample preparation.Beads with oligo (dT) were used to isolate poly(A) mRNA from total RNA.RNA sequencing libraries were constructed from these mRNA using the TruSeq RNA Sample Preparation Kit (Illumina, San Diego, USA).The resulting libraries were sequenced using the PE150 module of the Illumina platform.In addition, the Iso-Seq library for leaf tissue was prepared according to the Isoform Sequencing protocol (Iso-Seq) using the Clontech SMARTer PCR cDNA Synthesis Kit and the BluePippin Size Selection System protocol as described by Pacific Biosciences.Then, the library was sequenced on a PacBio Sequel System.
Finally, these TE sequences were classified according to their characteristics using TEsorter (https://github.com/zhangrengang/TEsorter) based on the REXdb database (viridiplantae_v3.0)(Neumann et al., 2019).For non-autonomous TEs that lack protein domains, such as SINEs, were classified according to their best-hit sequences using the method above.LTR-retrotransposon sequences encoding reverse transcriptase (RT) protein domains were extracted and used to construct a phylogenetic tree using FastTree (Price et al., 2010).The tree was created for each lineage of TEs separately.For rooting, we included ten sequences from other lineages.Ty3/gypsy Retand sequences were included as an outgroup for Ty1/Copia Angela and Ty1/Copia Ivana trees.Ty3/Gypsy Athila trees were rooted using Ty1/copia Angela.

Gene prediction, functional annotation, and evolution
We employed EVidence Modeler (EVM) (Haas et al., 2008) to consolidate RNA-seq-based, de novo, and homolog-based predictions into a final gene set.For RNA-seq-based gene prediction, the reads of Illumina transcriptome were cleaned with Trimmomatic (Bolger et al., 2014) and aligned to the genome with HISAT2 (Version: 2.2.0) (Kim et al., 2019).Alignments were assembled independently with StringTie (Version: 2.1.4)(Kovaka et al., 2019).For de novo gene prediction, the clean reads of Illumina transcriptome were assembled using Trinity (Version: v2.14.0) (Grabherr et al., 2011).Full-length transcriptomes were managed with the SMRT Analysis software suite (PacBio, Version: release_6.0.0.47841) and resulted in a set of full-length transcripts.RNA-seq assemblies and full-length transcripts were combined and further refined using PASA (Haas et al., 2008), and a high-quality training set was generated using the PASA assemblies.Then Augustus (Stanke et al., 2008) was performed with this training set.Three additional ab initio gene predictions, GlimmerHMM (Majoros et al., 2004), GENSCAN (Burge and Karlin, 1997), and SNAP (Korf, 2004) were also used.Protein sequences of C. sinensis var.assamica (CSA) (Xia et al., 2017) and C. sinensis var.sinensis (CSS) (Wei et al., 2018) were used for homolog-based gene annotation.After mapping the protein sequences to the 'YK10' genome using tblastn, protein-coding regions were obtained by extending 2,000 bp upstream and downstream of the matching DNA sequences.GeneWise (Birney and Durbin, 2000) was then used to predict gene structure within each protein-coding region.All lines of evidence were then fed to EVM using intuitive weight (RNAseq > cDNA/protein > de novo gene predictions).Finally, EVM models were updated with PASA.The genes were filtered if the proportion of repeats was more than 50% or the CDS was less than 300 nt.
The gene families cluster was performed with protein-coding genes of 'YK10', TGY, DASZ, CON, C. japonica, and Actinidia chinensis cultivar 'Hongyang' (Kiwifruit) (Yue et al., 2022) using OrthoFinder (Version: 2.5.4) (Emms and Kelly, 2019).The alignment of multiple single-copy orthologous genes and corresponding rooted species tree generating with OrthoFinder were fed to MCMCTREE in the PAML package (Version: 4.9j) (Yang, 2007) to infer the species divergence times.To estimate each species' divergence time, the known divergence times between these species were collected from http://timetree.org.Based on the calculated phylogeny and divergence times, CAFE5 (Mendes et al., 2020) was applied to identify whether gene families had undergone expansion or contraction.The Gamma model with two gamma rate categories and the root equilibrium frequency with Poisson distribution were set.
All-to-all BLASTP analysis of protein sequences was performed within YK10, TGY, DASZ, and CON, respectively, using an e-value cutoff of 1e-10.Syntenic regions within each species were then identified using MCscanX (Wang et al., 2012) based on the all-to-all BLASTP results.Protein sequences of homologous gene pairs in the identified syntenic regions were first aligned using MUSCLE (Edgar, 2004), and the protein alignments were then converted to the CDS alignments.Finally, synonymous substitution rates (Ks) were calculated on these CDS alignments using KaKs_Calculator (Version: 1.2) (Zhang et al., 2006) with the NG method.

DNA sample preparation and sequencing for resequencing
486 Camellia accessions were obtained from the sampling place, containing 15 Other Camellia accessions (other species in Camellia), 21 ornamental Camellia accessions, 23 oilseed Camellia accessions, 54 Wild Tea accessions, 20 Hybrid accessions (C.sinensis var.assamica × C. sinensis var.sinensis), 246 CSA accessions (C.sinensis var.assamica) and 107 CSS accessions (C.sinensis var.sinensis).In addition, 139 accessions (1 oilseed Camellia accession, 13 Wild Tea accessions, 5 Hybrid accessions, 37 CSA accessions, and 83 CSS accessions) in a previous paper (Wang et al., 2020) were also used for resequencing analysis.Young leaves were collected from the plants and snap-frozen in liquid nitrogen.Total DNA was extracted with the DNAsecure plant kit (Tiangen, Beijing).2 µg genomic DNA from each accession was used to construct a sequencing library following the manufacturer's instructions using NEBNext Ultra DNA Library Prep Kit (NEB, USA).Paired-end sequencing libraries with an insert size of approximately 400 bp were sequenced on an Illumina NovaSeq 6000 sequencer at Novogene (Beijing, China).Paired-end resequencing reads were filtered using NGSQCToolkit_v2.3.3 (Patel and Jain, 2012), removing reads containing adapter or poly-N and low-quality reads (reads with >40% bases having Phred quality ≤ 20) from the raw data.Reads shorter than 70bp were discarded, and the yielded clean data were used for downstream analyses.

Population genetics analysis
Whole-genome SNPs were used to construct the ML (Maximum likelihood method) phylogenetic tree with 100 bootstraps using SNPhylo (Version: 20140701) (Lee et al., 2014).C. cuspidata (KM6) was used to provide outgroup information.The iTOL tool (http://itol.embl.de)was used to color the phylogenetic tree.
SNPs in linkage disequilibrium (LD) were filtered using PLINK (Version v1.90b3.38)(Purcell et al., 2007) with a window size of 50 SNPs (advancing 5 SNPs at a time) and an r 2 threshold of 0.5.Population structure was analyzed using ADMIXTURE (Version: 1.3) (Alexander et al., 2009) with a block-relaxation algorithm.To explore the convergence of individuals, we predefined the number of genetic clusters K from 2-8 and ran the cross-validation error (CV) procedure with default methods and settings.

Clustering and discretization
Genetic distances among and within each accession were calculated based on LD pruned SNPs using PLINK with the formulation 1-IBS, where IBS is identity by state.The CMDSCALE function in R was used to calculate eigenvectors (Mardia, 1978) based on the distance matrix.Formal clustering with the PAM method and filtering based on each cluster's silhouette scores computed by the DISCRETIZE algorithm were performed with the IBS distance matrix using an available R script (https://github.com/grafau/discretize),which was previously described (Gutaker et al., 2020).Clustering and discretization were carried out independently for clusters k, ranging from 2 to 9. The remaining 443 accessions in the corresponding group with k=5 were selected and filtered to remove outliers based on the phylogenetic tree.The filtered accessions were used for subsequent analysis.

Relationship inference
Relationships between each accession were verified using KING (Version: 2.2.5) (Manichaikul et al., 2010) based on the basic set of SNPs.Close relatives were inferred based on the estimated kinship coefficients as shown in the following algorithm: an estimated kinship coefficient range >0.354, [0.177, 0.354], [0.0884, 0.177] and [0.0442, 0.0884] correspond to duplicate, 1st-degree, 2nddegree, and 3rd-degree relationships, respectively.The duplicate accessions were removed from subsequent group analysis.

Genome scanning for selective sweep signals
RAiSD (Raised Accuracy in Sweep Detection, Version 2.9) (Alachiotis and Pavlidis, 2018) was used to detect signatures of selective sweeps based on the m statistics in ornamental Camellia, oilseed Camellia, Wild Tea, CSA 1, CSA 2, and CSS group, respectively.The significant threshold for m statistic score was set as top 0.5%.

Linkage disequilibrium
LD was calculated using PopLDdecay (Version: v3.41) (Zhang et al., 2019).The pairwise r 2 values within and between different chromosomes were calculated.The LD for each group was calculated using SNP pairs only from the corresponding group.

Demographic history reconstruction using the PSMC approach
The Pairwise Sequentially Markovian Coalescent (PSMC) model (Sun and An, 2005) was used to reconstruct the demographic history of each tea accession.Parameters were set as follows: -N25, -t15, -r5 and -p '4 + 25*2 + 4+6'.The estimated time to the most recent common ancestor (TMRCA) was given in units of 2N 0 time, and the relative population size (N e ) at state t was scaled to N 0 (the present effective population size).Then, the neutral mutation rate (6.1 × 10 -9 mutations per site per year) was used to scale the TMRCA and N e values into chronological time.For plotting, accessions were grouped by their populations.

Demographic history inference using SMC++
SMC++ (version: v1.15.4.dev16+g72ea2e2) (Terhorst et al., 2017) was employed to infer population size histories and split times between the CSA 2 and CSS groups.The analysis was performed by testing the treatment of runs of homozygosity longer than 50, 100, and 200 kb as missing or not (no mask).A mutation rate of 6.1 × 10 -9 mutations per site per year was used to convert the scaled times and population sizes into real times and sizes.

Treemix
We estimated admixture graphs in SOM-defined tea clusters using TreeMix (Version: 1.12) (Pickrell and Pritchard, 2012), employing a Maximum Likelihood (ML) method based on a Gaussian model of allele frequency change.Migration events (m) were set from 1 to 5. The bootstrap values were calculated with 1,000 replicates, using the Wild Tea group as the root.Each migration event was run ten times with a random seed.The optimal number of migration edges was estimated using an R package "OptM" (Version: v0.1.6)(Fitak, 2021).Node robustness was estimated with 100 bootstrap replicates and plotted using the treemix.bootstrapfunction in BITE (Version: v1.2.0008) (Milanesi et al., 2017).

Pan-genome construction
The accessions used for pan-genome analysis were selected according to manual screening and SVCollector (Ranallo-Benavidez et al., 2021).The manual screening followed these principles: (1) remove accessions with sequencing depth lower than 7 ×; (2) remove hybrid accessions; (3) remove the potentially wrong accessions whose position in the SNPs' phylogenetic tree was incongruent with the accession information; (4) just keep one accession if there were multiple cultivars; (5) retain one accession per species in a clade of the SNPs' phylogenetic tree for ornamental Camellia, oilseed Camellia, and Wild Tea groups.After combining accessions selected by the two methods, accessions were further removed if they were not in the accession pool, resulting in clustering, discretization, and removing duplicate accessions.After the above filtering step, sequences of 206 tea accessions were used for the pan-genome construction.
Raw reads of selected accessions were processed to remove duplicated reads using Nubeam-dedup (Dai and Guan, 2020).Then, the adapters and low-quality sequences were trimmed using Trimmomatic (Version: 0.39) (Bolger et al., 2014) with parameters 'SLIDINGWINDOW:4:20 MINLEN:50'.The final cleaned reads of each accession were de novo assembled using Megahit (Version: v1.2.9) (Li et al., 2016a) with default parameters.The non-reference sequences were identified according to the previous description (Gao et al., 2019).In brief, the assembled contigs with lengths longer than 500 bp were selected and aligned to 'YK10' genomes using Mummer (Version: 4.0.0beta2)(Delcher et al., 2003).If the continuous alignment was longer than 300 bp with sequence identity higher than 86%, and the continuous unaligned regions were longer than 500 bp, then the unaligned regions were extracted as unaligned sequences.These unaligned sequences and other unaligned contigs were then searched against the GenBank nucleotide database using blastn (Version: 2.9.0+) (Camacho et al., 2009).Sequences with the best hits outside the green plants or covered by known plant mitochondrial or chloroplast genomes were removed.The cleaned non-reference sequences from all accessions were combined and then processed to remove redundant sequences using CD-HIT (Version: 4.8.1)(Li and Godzik, 2006) with an identity threshold of 90%.The resulting non-redundant sequences and the reference 'YK10' genome were merged as the pan-genome.

Annotation of the pan-genome
A custom repeat library was constructed by screening the pangenome using EDTA (Version: v1.9.6) (Ou et al., 2019) and used to screen the non-reference genome to identify repeat sequences with RepeatMasker (Version: 4.1.2-p1)(Tarailo-Graovac and Chen, 2009).Protein-coding genes were predicted from the repeat-masked nonreference genome using BRAKER (Version: 2.1.6)(Brůna et al., 2021) with two lines.One of the lines was running BRAKER with RNA-Seq data.RNA-seq data included Illumina transcriptome for leaves sequencing in our paper and reads of 40 tea samples downloaded from NCBI (Supplementary Table S1).The reads were cleaned using Trimmomatic (Version: 0.39) (Bolger et al., 2014) with parameters 'SLIDINGWINDOW:4:15 MINLEN:70' and aligned to the pangenome with HISAT2 (Version: 2.2.0) (Kim et al., 2019).Then, the alignments were converted to a hint file for AUGUSTUS in gff format.All gff files were merged and fed to BRAKER.Another line was running BRAKER with OrthoDB Viridiplantae protein database.The results were then combined using TSEBRA (https://github.com/Gaius-Augustus/TSEBRA).The genes were filtered if the proportion of repeat sequences in gene sequences was greater than 50% or the CDS length was less than 300 nt.
Gene functions were assigned according to the best match using blastp against KEGG databases.InterPro functional analysis and Gene Ontology IDs were obtained using InterProScan (Zdobnov and Apweiler, 2001).The pathway to which the gene might belong was derived from the matching genes in KEGG.

PAV analysis
Sequencing reads from each accession were aligned to the pangenome using Bowtie2 (Version: 2.4.4) (Langmead and Salzberg, 2012) with "-end-to-end -sensitive" parameters.The gene presence/ absence variation was characterized with the SGSGeneLoss package (Version: v0.1) (Golicz et al., 2015).For a specific gene within a given accession, if less than 20% of its exon regions were covered by at least two reads (minCov = 2, lostCutoff = 0.2), this gene was treated as absent in that accession; otherwise, it was determined as present.A maximum-likelihood phylogenetic tree was constructed based on the PAVs with 1,000 bootstraps using IQ-TREE (Version: 1.6.12)(Nguyen et al., 2015) with C. costei (KM7) as outgroup.Population structure was determined using ADMIXTURE (Version: 1.3) (Alexander et al., 2009).Principal component analysis was performed with TASSEL5 (Bradbury et al., 2007).
The presence frequencies within each pair of groups (ornamental Camellia versus oilseed Camellia and CSA versus CSS) were derived to identify genes under selection.The significance of the difference in the presence frequencies for each gene between the two compared groups was determined using Fisher's exact test.The raw P values of all genes in each comparison were corrected using a false discovery rate (FDR).Genes under selection were identified with significantly different frequencies (FDR < 0.001 and fold change >2).KEGG enrichment analysis was performed using TBtools (Chen et al., 2020).

Pan-genome modeling
Pan-genome and core-genome size curves were fitted using the nlsLM function in the R package minpack.lmaccording to the steps (Hurgobin et al., 2018).The combinations of genomes were obtained according to the formula: 206!/(n!(206-n))!, n = [1,206].The pan-genome size was modeled using the power law regression: y = Ax B + C, and the core genome size was modeled using the exponential regression y = Ae Bx + C (Tettelin et al., 2005;Zhao et al., 2014).
We predicted 42,536 protein-coding genes (54,598 transcripts) following homology and de novo methods (Supplementary Tables S9 and S10; 79.5% of core BUSCO genes).In total, 35,328 (83.05%) predicted genes had functional annotation in public databases (Supplementary Table S11).The species phylogeny using singlecopy orthologous genes of reported Camellia genomes (Figure 1B, Supplementary Table S12) revealed an estimated divergence time of ~7.15 million years ago (MYA) between 'YK10' and 'Tieguanyin'.The divergence time between C. japonica and C. oleifera var.'Nanyongensis' was estimated at ~10.47 MYA.Tea trees diverged from other Camellia species at ~13.88 MYA.Whole genome duplication (WGD) analysis indicated that YK10 (and other Camellia plants) had experienced the same genome duplication event, in which the Ks peak was 0.386-0.414(Supplementary Figure S2).3,919 and 2,158 gene families of 'YK10' underwent expansion and contraction, respectively (Supplementary Figure S3), with 757 exhibiting significant expansion and 178 showing significant contraction.The significantly expanded genes were related to phenylpropanoid biosynthesis, glycan and glycosaminoglycan metabolism, ubiquinone and other terpenoid-quinone biosynthesis, and so on (Supplementary Table S13).Five gene families, including 26 NUCLEOTIDE-BINDING SITE AND LEUCINE-RICH REPEAT (NBS-LRR) genes, also underwent significant expansion.In comparison, the significantly contracted genes were related to monoterpenoid biosynthesis, metabolism of terpenoids and polyketides and others (Supplementary Table S13).
We predicted ~2.83Gb repetitive elements in the 'YK10' genome (Table 1, Supplementary Table S14), among which 82.64% repeats were transposable elements (TEs; Supplementary Table S15).Long-terminal repeat (LTR) retrotransposons represented 58.48% of the genome, with Ty3/gypsy Ogre elements being the major lineage (10.08%,Supplementary Table S15).We showed a universal distribution of major TE families across the genome (Supplementary Figure S4) and analyzed the divergence of the reverse transcriptase sequences of different TE lineages (Supplementary Figure S5).Notably, Athila elements were evolutionarily young, but Angela and Ivana were comparably more ancient (Supplementary Figure S5).The LTR retrotransposon burst event in 'YK10' was consistent with other reported genomes (Figure 1C), around 0.331 MYA.
We collected 486 Camellia accessions worldwide and generated whole-genome sequencing data at an average depth of 18.28×, and we also included 139 accessions from a previous study (Wang et al., 2020) to maximize the genetic diversity for population genomic analyses (Figure 2A, Supplementary Table S16).The mean mapping rate of these reads to the 'YK10' reference genome was ~98.63% (Supplementary Table S17), and the properly paired mapping rate was 49.99-93.89%and showed significant difference among groups (Supplementary Figure S6, Supplementary Table S17).After applying filtering criteria to the called variants, we identified 651,670,193 single-nucleotide polymorphisms (SNPs), among which 80,356,229 SNPs had minor allele frequencies (MAF) more than 0.05 and <50% missing calls (Supplementary Table S18).About 91.87% of SNPs were intergenic.The nonsynonymous-tosynonymous substitution ratio for the SNPs in the coding regions was 1.419.We also found 3,727,828 indels (< 40 bp), 84.30% of which are in the intergenic regions and 0.83% in the coding regions.An estimated 72.82% of indels in the coding regions could cause frameshift mutations (Supplementary Table S18).

Camellia population structure
We constructed a maximum likelihood (ML) phylogenetic tree using C. cuspidata as an outgroup to investigate the phylogenetic relationships among Camellia accessions.We found that many Camellia species in the section Theopsis of subg.Thea (hereafter termed 'Other') were closer to the phylogenetic tree root (Supplementary Figure S7).In addition, the ornamental Camellia species form two polyphyletic clusters.The oilseed Camellia species also did not cluster together.This result suggested rampant genetic introgression with congeners or potential misclassification of accessions based on morphology during sample collection.In the section Thea of subgenus Thea, wild tea tree species, CSA, and CSS accessions formed three independent clades, with CSA and CSS being sister clades.
To identify discrete subpopulations (k d ) of Camellia accessions, we clustered all accessions based on genomic distances by partitioning around medoids (PAM) and a silhouette-based procedure (Gutaker et al., 2020).This discretization procedure removed genetic gradients between subpopulations (Figures 2B,  C), corresponding to the removal of accessions with mixed genetic ancestry assessed by ADMIXTRUE (Figures 2D-F, Supplementary Figure S8).In the end, we obtained five discrete subpopulations: The base group containing Other/ornamental/oilseed Camellia species, a second group containing wild tea trees (hereafter Wild Tea), the third and fourth groups containing CSA accessions (hereafter CSA 1 and 2), and the last group containing CSS accessions (hereafter CSS).
Characterization of the linkage disequilibrium (LD, expressed as r 2 ) pattern showed that the LD decay for all tea tree subpopulations was very rapid.The LD reached half the maximum average r 2 at a distance of 400 bp for the CSA 1 group, 100 bp for the CSA 2 group, and 100 bp for the CSS group.In comparison, the LD reached half of the maximum r 2 at a distance of 11.7 kb for ornamental Camellia, 3.4 kb for oilseed Camellia, and 400 bp for the Wild Tea group (Supplementary Figure S10).

Demographic history of CSA and CSS
We applied the pairwise sequentially Markovian coalescent model (Li and Durbin, 2011) to analyze the Camellia groups (Figure 2G).The results were scaled to real time, assuming a neutral mutation rate of 6.1 × 10 -9 nucleotides per year.Almost all groups experienced a steady decline in effective population size (N e ) from the highest point at 300-500 thousand years ago (Kya) to the nadir at 10-30 Kya.The Wild Tea group had a smaller N e at the highest point.We also explored the population split between CSA and CSS populations with unphased SNP data using SMC++ (Terhorst et al., 2017).This approach revealed that these two groups diverged at about 11 Kya (Supplementary Figure S11).3F), which may be related to dwarf phenotype (Guo et al., 2020).

Subpopulation structure in tea tree Camellia
Ecological niche modeling revealed a significant suitable habitat (suitability score >0.75) for tea trees in the Southern regions of East Asia and Southeast Asia from the Last Glacial Maximum (LGM) to the Holocene period, consistent with the collected records from GBIF (Figure 4A, Supplementary Figure S12, Supplementary Table S20).It is also apparent that the suitable habitat gradually separated into two smaller regions since the Late Holocene.The west covers the Eastern Himalayan region and the Ganges Basin, whereas the east mainly covers South China.This result corresponds to the sister relationship between CSA and CSS (Figure 2D), hinting at the parallel domestication of tea trees.
We applied a Self-Organizing Map (SOM) approach (Wirth et al., 2011) to investigate the genetic diversity and the subpopulation-level distributions of characteristic alleles in tea cultivars.A gallery of SNP portraits helped visualize the genotypes (Supplementary Figures S13-S15) using color-coded allelic landscapes to depict major homozygous (blue), heterozygous (green), and minor homozygous (red), respectively.The accession portraits were very similar among accessions from the same geographic region but progressively different among  accessions from different regions with increasing geographic distance.This result revealed three, two, and six smaller clusters in the CSA 1, CSA 2, and CSS groups, respectively.The geographic distribution of the CSA 1 group was mainly in Yunnan and Southeast Asia (Supplementary Figure S16A), whereas the CSA 2 group was in Africa, Bangladesh, Nepal, and Assam.By contrast, the CSS clusters were scattered across South China (Supplementary Figure S16B).
The population differentiation (F ST ) level was more prominent between the CSA 1 and the CSA 2 clusters (Supplementary Figure S17A), indicating the genetic separation between Yunnan and South Asia tea tree plants.The F ST among CSS clusters was relatively lower (F ST < 0.15), showing a low level of genetic differentiation (Supplementary Figure S17B).Most CSS clusters showed higher nucleotide diversity (Supplementary Figure S18A), probably reflecting historical introgression and a more extensive habitat throughout evolution history in South China.The differences were significant between/among most CSA and CSS clusters (Supplementary Figure S18B).
To illustrate the high level of gene flow between the CSA and CSS groups, we estimated admixture graphs of SOM-defined clusters using TreeMix.The optimal number of migration events was four (Figure 4B, Supplementary Figure S19), including events from CSA 2 into CSA 1 (clusters 2 and 3; migration weight 0.393) and from CSS to CSA 1 (cluster 3; migration weight 0.340).This result suggested that CSA 1 cluster 3 might result from CSS and CSA 2 hybridization.We also observed an admixture event from CSS cluster 1 to CSA 2 cluster 1 (migration weight 0.154) and a gene flow from CSA 2 to CSS cluster 4 (migration weight 0.132).

Camellia pan-genome construction and analysis
We selected 206 Camellia accessions (Supplementary Table S16) for the pan-genome construction, including eight accessions in the Other Camellia, 21 accessions in the ornamental Camellia, 12 accessions in the oilseed Camellia, 26 accessions in the Wild Tea group, 68 CSA accessions, and 71 CSS accessions.The average genome size of accessions was ~2.30Gb for the de novo assemblies, and the average N50 size was 1.46 kb (Supplementary Figure S20).We mapped assembled contigs to the 'YK10' reference genome and identified ~43.33 Gb non-reference sequence sharing < 86% identity with the reference (Supplementary Figure S21).After removing redundancies, we obtained 9,275,358 sequences (8.58 Gb) representing the non-reference genome and annotated 216,302 protein-coding genes.The 'YK10' reference and the final nonreference genome formed the Camellia pan-genome and were used for the following analyses.We categorized pan-gnome genes according to their presence frequencies, including 26,330 core genes shared by all 206 accessions (100% accessions), 4,497 softcore genes (99-100% accessions), 206,044 shell genes (1-99%), and 21,916 cloud genes (<1%), respectively (Supplementary Figure S22).Iteratively random sampling of accessions showed the plateau of both pan and core genes, suggesting a finite pan-genome size (Figure 5A).
The tea tree Camellia accessions have significantly fewer genes than the Other Camellia species (Figure 5B, Supplementary Figure S23), suggesting a general trend of gene loss during tea tree evolution.Phylogenetic, principal component analyses, and ADMIXTRUE clustering (Figures 5C, D, Supplementary Figure S24) using the presence/absence variations (PAVs) showed clear separation of the Wild Tea, CSA, and CSS accessions except for re255, which may be a misidentified accession.By contrast, most Other Camellia accessions were clustered together, corresponding to the results from similar analyses using SNPs (Supplementary Figure S7).
We identified PAVs under selection by screening genes with significantly different frequencies between CSA and CSS (Supplementary Figure S25).We identified 11,060 CSA-favorable genes (higher frequencies in CSA) and 10,972 CSS-favorable genes (higher frequencies in CSS).The CSA-favorable genes were mainly involved in carotenoid biosynthesis, metabolism of terpenoids and polyketides, carbohydrate metabolism, and benzoxazinoid biosynthesis (Supplementary Table S21).The CSS-favorable genes were mainly involved in carotenoid biosynthesis, steroid biosynthesis, metabolism of terpenoids and polyketides, cytochrome P450, nitrogen metabolism, and monoterpenoid biosynthesis (Supplementary Table S21).Similarly, we identified 210 ornamental Camellia-favorable genes and 2,945 oilseed Camellia-favorable genes (Supplementary Figure S26).The pathways of fatty acid degradation and carbohydrate metabolism were enriched in oilseed Camellia, but no significant pathways related to secondary metabolism were enriched in ornamental Camellia.

Discussion
Tea is a popular non-alcoholic beverage worldwide.Because of the health benefit, economic value, and progress of sequencing technique, lots of Camellia genomes, including 'YK10', CSA, CSS, and oilseed Camellia, have been decoded (Xia et al., 2017;Wei et al., 2018;Wang et al., 2020;Xia et al., 2020;Zhang et al., 2020aZhang et al., , 2020Zhang et al., , 2021;;Lin et al., 2022;Chen et al., 2023).However, most CSA assemblies have lower contig N50 lengths (~881 kb) (Chen et al., 2023).Our new 'YK10' assembly has improved forty-fold in terms of contig N50 length (817 kb) compared to a previous report (Xia et al., 2017), and its quality was comparable with other reported CSA (Chen et al., 2023) and CSS genomes (Wei et al., 2018;Wang et al., 2020;Xia et al., 2020;Zhang et al., 2020a).TE burst events were estimated at around 0.331 MYA using our genome, which was younger than that obtained from a previous 'YK10' genome (Xia et al., 2017).Furthermore, the Ks peak of WGD analysis in this study (0.386-0.414) was larger than that in the previous studies (0.31-0.36) (Wang et al., 2020;Xia et al., 2020), suggesting an older genome duplication event.Observably, our results showed that some NBS-LRR and phenylpropanoid biosynthesis-related gene families were expanded in the 'YK10' genome, which may be beneficial for disease resistance and tea flavor, respectively, given the fact that NBS-LRR genes are the most important disease resistance genes in plants (Wu et al., 2017).The cardiovascular health benefits of drinking tea are mainly due to flavonoids (Hodgson and Croft, 2010), which are synthesized through a specific branch of the phenylpropanoid pathway (Li et al., 2019).
Trait diversification of Camellia domesticates was visualized for three main types according to human needs, but the potential genomic drivers remained covered.Ornamental Camellia cultivars exhibit diverse flowering times, flower sizes, petal colors, and petal forms.These traits are putative targets of human selection throughout history.We identified abundant selective sweep genes related to floral development in this group, among which the ARF protein plays a role in the auxin-mediated pathway in promoting floral fate (Chung et al., 2019) in addition to its role in regulating gynoecium morphogenesis, self-incompatibility, de novo organ regeneration, and organ polarity (Zhang et al., 2018).MYB proteins regulate anthocyanin biosynthesis in fruits and flowers, and GST proteins transport anthocyanin from the endoplasmic reticulum to facilitate the coloration of flowers, leaves and stems (Lu et al., 2021).Other selected genes in the group encoding beta-ring hydroxylase protein and ethylene-responsive transcription factor protein also affected flowering in Camellia (Wang et al., 2016;Huang et al., 2022), thus possibly influencing the ornamental value of flowers in ornamental Camellia.
The oilseed Camellia produces high-temperature cooking oil with a mild tea aroma.It is a crucial source of dietary fats for the local Southeast Asian people.Not surprisingly, some selective sweep genes in this group were found to be related to FA biosynthesis.In Camellia, phospholipase D (PLD) became important in regulating triacylglycerol (TAG) production.Transgenic Camelina plants expressing Arabidopsis PLD produced 2-3% higher levels of TAG (Deepika and Singh, 2022).In addition, acyl-CoA synthetase mediates de novo lipogenesis and glycerol lipid assembly in developing seeds (Ichihara et al., 2003).Furthermore, the genes encoding fatty acid omega-hydroxylase proteins and fatty acyl-ACP thioesterase B protein are also related to FA metabolism (Salas and Ohlrogge, 2002;Osborne et al., 2022).
The details of tea tree domestication remain inconclusive, as the true progenitor of the cultivated tea tree is unknown.New evidence emerged to propose the parallel cultivation of CSA and CSS in the past and hypothesize that selected genes possibly affected flavor, leaf size, and tree height (Zhang et al., 2021).Our results revealed that the function enrichment of selective sweep genes showed differences in these two groups.We also found that CSA and CSS were sister clades in the phylogenetic tree, and the divergence time between CSA and CSS predated the earliest documentation of tea cultivation.Ecological niche models showed that the suitable habitat of tea plants gradually separated into two smaller regions since the Late Holocene.All of these supported the parallel domestication processes for CSA and CSS.However, the location and time of domestication still require verified progenitor populations for elucidation.Especially, a previous study found that both cell length and the cell number of internodes were reduced in a dwarf mini plant 1 (mnp1) mutant of Medicago truncatula, which involved a putative CPS gene in the first step of gibberellin biosynthesis (Guo et al., 2020).A CPS gene was found in the selective sweep region in the CSS group in the present study, providing indirect evidence for the relatively shorter plant height in CSS.
The kinship of important cultivars contributed to understanding the history of breeding and screening germplasm resources, e.g., identifying the offspring of the cultivar 'Yabukita'.The pattern of geographic distribution among different CSA and CSS subpopulations showed significant differences, suggesting that established cultivars traveled to far-away tea-growing regions, possibly with early farmers.In addition, we investigated the population differentiation, nucleotide diversity, and gene flow among these subpopulations.These results reflected a complex hybridization history during tea tree domestication and breeding history.
Furthermore, we provided a landscape of Camellia pangenome.Though a graphical pangenome based on the de novo assembly of 22 elite tea cultivars has been released (Chen et al., 2023), our Camellia pan-genome will expand the Camellia gene pool, thus helping subsequent functional research.Genetic structure analysis based on PAVs of Camellia pan-genome corresponded to the results based on SNPs, proving the capacity and value of this genetic variation.The group-favorable genes also provided valuable genetic resources for explaining the trait diversification of Camellia domesticates.

Conclusion
In conclusion, we presented a high-quality genome sequence for the CSA tea tree.According to resequencing and pan-genome analyses for Camellia species, the genetic diversity and population structure of a worldwide panel were revealed, and genetic drivers of selection during domestication were identified.Our results offered a valuable genomic resource for ongoing functional research of Camellia species in agriculture.Many aspects of the tea origin remain unknown, but our results agreed with the parallel domestication of CSA and CSS.Because the progenitor of tea trees is unidentified, two-population analyses of selective sweeps and the dispersal history of tea trees remained to be examined.

FIGURE 5
FIGURE 5Feature of tea pan-genome.(A) Simulations of the increase of the pan-genome size and the decrease of core-genome size.Accessions were sampled as 10,000 random combinations of each given number of accessions.The upper and lower edges of the purple and green areas correspond to the maximum and minimum numbers of genes, respectively.Solid black lines indicate the pan-and core-genome curves fitted using points from all random combinations.(B) Violin plots showing the number of detected genes in each group.Three lines (from the bottom to the top) in each violin plot show the location of the lower quartile, the median, and the upper quartile, respectively.(C) Principal component analysis based on PAVs.The core and all absent genes were excluded.(D) Maximum-likelihood tree and model-based clustering of the 206 accessions with different numbers of ancestral kinships (K = 2, 3, 4, and 5) using the identified PAVs.The core and all absent genes were excluded.

TABLE 1
Summary of genome assembly and annotation.
copalyl-diphosphate synthase (CPS) protein in the selective sweep region of the CSS group (Figure