Revealing Distinctions in Genetic Diversity and Adaptive Evolution Between Two Varieties of Camellia sinensis by Whole-Genome Resequencing

Camellia sinensis var. sinensis (CSS) and C. sinensis var. assamica (CSA) are the two most economically important tea varieties. They have different characteristics and geographical distribution. Their genetic diversity and differentiation are unclear. Here, we identified 18,903,625 single nucleotide polymorphisms (SNPs) and 7,314,133 insertion–deletion mutations (indels) by whole-genome resequencing of 30 cultivated and three wild related species. Population structure and phylogenetic tree analyses divided the cultivated accessions into CSS and CSA containing 6,440,419 and 6,176,510 unique variations, respectively. The CSS subgroup possessed higher genetic diversity and was enriched for rare alleles. The CSA subgroup had more non-synonymous mutations and might have experienced a greater degree of balancing selection. The evolution rate (dN/dS) and KEGG enrichment indicated that genes involved in the synthesis and metabolism of flavor substances were positively selected in both CSS and CSA subpopulations. However, there are extensive genome differentiation regions (2959 bins and approximately 148 M in size) between the two subgroups. Compared with CSA (141 selected regions containing 124 genes), the CSS subgroup (830 selected regions containing 687 genes) displayed more selection regions potentially related to environmental adaptability. Fifty-three pairs of polymorphic indel markers were developed. Some markers were located in hormone-related genes with distinct alleles in the two cultivated subgroups. These identified variations and selected regions provide clues for the differentiation and adaptive evolution of tea varieties. The newly developed indel markers will be valuable in further genetic research on tea plants.


INTRODUCTION
The tea plant (Camellia sinensis (L.) O. Kuntze) is a perennial evergreen woody plant (2n = 2x = 30) that originated in southwest China. The plant was first cultivated over 2,000 years ago (Xia et al., 2017;Wei et al., 2018). Tea is one of the most popular non-alcoholic drinks worldwide, reflecting its economic and cultural importance, as well as health benefits. C. sinensis var. sinensis (CSS) and C. sinensis var. assamica (CSA) are two main varieties of tea trees and are the most widely cultivated varieties globally. CSA are mainly cultivated in tropical regions, while CSS are introduced into most tropical and subtropical regions due to its stronger cold resistance. The distinct morphological and biochemical characteristics between CSS and CSA are mainly attributed to their different geographical distributions and growth environments (Hao et al., 2018;Li et al., 2019). For example, CSA is generally rapidly growing with a larger leaf area compared with CSS. Significant differences in the contents of flavonoids, caffeine, and theanine have been demonstrated between the two varieties (Wei et al., 2018;Zhu et al., 2019). Tea plants are self-incompatible, which promotes relatively highly heterozygous progenies and may produce rich genetic variations. Therefore, revealing the interspecific differences and genetic diversity at the whole-genome level is important for the understanding of the genetic mechanism of tea plant growth and development.
With the loss and acquisition of genes, almost all cultivated crops have undergone different degrees of domestication and improvement to adapt to the environment and meet human needs. Different domestication and improvement purposes will result in different genome characteristics. With the rapid development of sequencing technology, whole-genome sequencing and large-scale resequencing can be applied to various plant species. Sequence alignment can detect many sites of variation in genetic and intergenic regions. This includes some non-synonymous mutations located in the coding or regulatory regions, which have become important resources for studying gene functional and genetic differentiation. Wholegenome resequencing has been widely applied in various plant species, such as the detection of variation sites in cotton (Shen et al., 2017;Wang et al., 2017;Du et al., 2018), genome association analysis in rice (Huang et al., 2013;Wang et al., 2016), origin and evolutionary analysis in pears , and quantitative trait locus (QTL) mapping of important agronomic traits in melo (Hu et al., 2017). A study reported that CSA and CSS may have diverged 22,000 years ago and were subsequently domesticated independently (Meegahakumbura et al., 2016). This may have resulted in different tree height, leaf area, resistance, and contents of quality component between the two subgroups. However, due to insufficient reliable evidence, the regions of divergence between CSS and CSA in the genome have not been elucidated.
Due to the high heterozygosity (2.7%) and large genome size (approximately 2.94 Gb), it is very difficult to explore the genetic differentiation of tea plants on the whole-genome scale (Xia et al., 2020b). With the determinations of the sequences of the "Yunkang 10" and "Shuchazao" genomes (Xia et al., 2017;Wei et al., 2018), it has become possible to study the domestication and improvement of tea plants using a whole-genome resequencing strategy. Previously, the domestication origin and breeding history of tea plants were mainly studied with a small number of simple sequence repeat (SSR) or single nucleotide polymorphism (SNP) markers. For instance, the comparison of the genetic diversity of Chinese and Indian tea plants with 23 pairs of SSR primers indicated the possibility of three independent domestication events for Chinese tea, Chinese CSA, and Indian CSA (Meegahakumbura et al., 2016(Meegahakumbura et al., , 2017. A total of 15,444 SNPs from 18 cultivated and wild tea accessions were identified based on RAD sequencing. Thirteen genes containing missense mutations had strong artificial selection signals (Yang et al., 2016). Genetic diversity, linkage disequilibrium, and population structure analyses were performed on 415 accessions from Guizhou Province, China, using 79,016 high-quality SNPs by the genotyping by sequencing method (Niu et al., 2019). However, due to the insufficient sample size or uneven distribution of markers, many important sites of variation or genetic regions between CSS and CSA have not been detected. To further clarify genetic differentiation and adaptive evolution of CSS and CSA, whole-genome resequencing of different tea tree cultivars is needed.
Recently, Pacbio sequencing and high-throughput/resolution chromatographic capture (Hi-C) technology has more accurately assembled the tea tree genome into 15 pseudo chromosomes with a genome size of approximately 2.94 G (Xia et al., 2020a). In the present study, the SNPs and insertion-deletion mutations (indels) of variant sites were obtained by aligning the sequenced fragments of 33 samples to the latest reference genome. These variant sites, especially those located in genetic regions, will provide important genetic resources for the functional research and breeding projects of tea trees. Novel SNP and indel patterns of tea cultivated species were revealed by comprehensive resequencing. The genetic structure of CSS and CSA was analyzed using a high confidence SNP variation site set. A large number of mutation sites unique to CSS or CSA were identified. In addition, through the evaluation of F-Statistics (Fst), some divergence genetic regions were identified from the genome. Genes located in these regions may produce functional changes in CSS and CSA, including the acquisition, silencing, enhancement, and inhibition of gene functions. Others, in this project, a large number of stable and reliable indel markers located in these genes have been developed, which will provide important support for exploring the domestication and improvement, mining functional genes, and QTL mapping of tea trees.

Sample Collection and Genome Resequencing
from ancient tea area in Menghai County, Yunnan Province. The above 33 germplasms were submitted for re-sequencing, and their detailed information was shown in Supplementary  Table 4 and Supplementary Figure 1. The plucked leaf samples were stored at −80 • C until their genomic DNA was extracted by cetyl trimethylammonium bromide and a sequencing library was constructed. Qualified genomic DNA was cut into 350 bp fragments by ultrasound to build the DNA libraries. Paired-end sequencing libraries with an insert size of 150 bp were constructed according to the manufacturer's instructions for sequencing on the Hiseq 2,500 platform (Illumina, San Diego, CA, United States). After removing the reads with an adapter, the reads with N content >10% and low-quality reads (sequences whose bases with a quality value of ≤10 account for more than 50% of the total sequence length), clean reads were obtained.

Sequence Alignment, Variant Calling From Resequencing Data, and Filtering
The clean reads were mapped to a chromosome level tea tree reference genome of 'Shuchazao' using BWA-MEM (version 0.7.10) with parameters "−T 20 −k 30". Picard-1.124 was used to filter redundant reads (MarkDuplicates) to ensure the accuracy of the variant calling. The SNP and small indel variant locus were called by the GATK-4.1.8.1-HaplotypeCaller method to generate a gvcf format file for each sample. They were merged by GATK-CombineGVCFs to obtain the population variations vcf file. The mutation sites were filtered using the following criteria: (1) SNPs within 5 bp near indel and adjacent indels within 10 bp were filtered out based on a perl script included in bcftools; (2) the number of variations in the 5 bp window should not exceed 2; (3) QUAL < 30| | QD < 2.0| | MQ < 40| | FS > 60.0 was used in variant filtration in GATK-4.1.8.1, the Ts/Tv (transition/transversion), and the heterozygous rate of each sample was recorded; (4) the plink-1.9 was used to filter the population variant locus with MAF < 0.05 and missing data rate >10%.

Population Stratification, Phylogenetic Analysis, and PCA
Considering the reliability of the analysis results, we first removed the linkage disequilibrium SNP markers using a sliding window method with plink-1.9. The window length, step size, and r 2 threshold were set to 50, 10, and 0.2 bp, respectively. The newly obtained vcf files were converted into phylip, map, ped, and bed files using tassels-5.2.50 and plink-1.9. Finally, variant sites of SNPs were used for population stratification, phylogeny analysis, and PCA. The optimal subgroup k value tested from 2 to 8 and population stratification analysis were performed by admixture-1.3.0 using default parameters. The result was drawn using the pophelper-2.3.0 package. MEGAX software was used to build a phylogenetic tree of 33 tea germplasms with 1000 bootstrap replicates by the neighbor-joining method. The phylogenetic tree was adjusted using the FigTree-1.4.3 software. PCA was performed by GCTA-1.9.

Genomic Diversity, Unique Mutation Sites, and Genetic Differentiation Region Whole-Genome Screening
Based on the VCFtools and SnpEff packages, the nucleotide diversity (π), Tajima's D, and pairwise F-Statistics (Fst) values were calculated with a 500 kb window size and 50 kb step. The genomic regions with a Fst value ≥0.368 (top 5% of the Fst value) were used as differentiation bins. Unique SNPs and indels between CSA and CSS were screened and annotated separately. According to the annotated vcf file, genes with dn/ds (non-synonymous/synonymous) >1 and located in differentiation regions were picked from the genome for KEGG annotation and enrichment analysis using a web program 1 . Selective scan analysis was performed as previously described (Xia et al., 2020a).

Primer Design and Validation of the Indel Markers
To verify the accuracy of genotyping and develop indel markers for genetic research, indels with a length ≥5 bp were used as candidate loci. A total of 120 unique primer pairs were designed based on the flanking sequences of indel loci through the Primer 3.0 program. The length of the primers was set as 20 to 22 with an optimum length of 21 bp. The primer pairs were preliminary screened on eight tea cultivars, including four CSS ("Anhui 3, " "Longjing 43, " "Tieguanyin, " and "Shancha 1") and four CSA ("Hekai 33, " "Nanuoshan 14, " "Yunkang 10, " and "Zijuan"). The primer pairs with polymorphism and unambiguous amplification bands were screened by polymerase chain reaction (PCR) and the Fragment Analyzer 96 (Advanced Analytical Technologies, Inc., Ames, IA, United States) for further identification of 30 tea samples. Detailed information of the 30 tea samples is provided in Supplementary Table 5. Subsequently, according to the PCR amplification results, He and Ho were calculated using Popgene32 software. The number of alleles (Na), MAF, and PIC were calculated using PowerMarker3 software. A more detailed approach was described in previous studies (Liu et al., , 2019. Transcriptome data used in this study were downloaded from NCBI (PRJNA52233).

Detection of Genome-Wide Variant Sites
Genomes of 33 tea accessions, including 15 CSS and 15 CSA, from 14 provinces in China were sequenced using the Hiseq2500 platform. Three wild taliensis tea (CTS1, CTS2, and CTS3) were used as the outgroup (Table 1). A total of 1570 Gb high-quality clean reads were obtained with an average Q30 value of 91.3% and average GC content of 39.4%. Approximately 48 Gb of clean data was obtained for each sample, with an average sequencing depth of 16 and 1× coverage of more than 88.7% (Supplementary Table 1). These highquality sequences were aligned to the chromosome level highprecision genome (Xia et al., 2020a). The average mapping rate produced 96.53% alignment and proper mapping reached 84.68% (Supplementary Table 1). Among the 33 tea accessions, 18,903,625 SNPs were identified with an average density of 8.5/kb. Except for the 1,116,245 multiple allele SNPs (5.9%), the remaining 17,787,380 were classified into transitions (Ts: G/A and C/T) and transversions (Tv: A/C, A/T, C/G, and G/T). Transitions and transversions accounted for 67.6% (12,785,494) and 26.5% (5,001,886) of the total number of SNPs, respectively, with a transition/transversion (Ts/Tv) ratio of 2.56. A total of 7,314,133 indels were identified, including 2,095,200 insertion variants, 3,784,855 deletion variants, and 1,434,078 multiple allele indel loci, with an average indel density of 4.9/kb. The number of mutation sites contained in different samples varied from the lowest of 4,926,922 to the highest of 67,405,39 (Table 1). Calculations based on SNP mutations showed that, except for observed heterozygosity (Ho) with large variations with values of 0.08 to 0.21, the expected heterozygosity (He), Ts/Tv, and non-synonymous/synonymous (Non-syn/Syn) ratios were almost equal for all cultivated varieties ( Table 1).
According to the genetic background, in addition to the three CTS as the outgroup, all other samples were divided into two subgroups of CSS and CSA, with 15 samples in each subgroup. The number of SNPs in the intergenic region and single nucleotide indels were predominantly large in each subgroup. Although the average number of SNPs (3,867,825) and indels (1,708,296) in the CSS subgroup for each individual was smaller than the average (4,308,533 SNPs and 1,726,145 indels) of each individual in the CSA subgroup, the total SNPs (13,034,839) and indels (4,326,311) were more abundant than the CSA subgroup (12,507,724 SNPs and 4,189,516 indels). These findings suggested that the CSS subgroup had a larger number of rare alleles. However, for the CSA subgroup, there were more SNPs located in the genic region (963,993) and non-synonymous mutations (132,905 non-synonymous mutations with a Non-syn/Syn ratio of 1.408), while the Non-syn/Syn ratio (1.424) of the CSS subgroup was only slightly and insignificantly higher (Figure 1). A large number of unique variations were found between CSS (5,099,732 unique SNPs and 1,340,687 unique indels) and CSA (4,572,617 unique SNPs and 1,603,893 unique indels). These loci showed similar patterns in the two subgroups (Figure 1).
The phylogenetic analysis showed that the clustering patterns obtained by the three analyses were highly consistent. CSS species grown in tropical and subtropical regions were classified into the first category, and CSA species grown in tropical regions were classified into the second category. These classifications were highly consistent with their geographical distributions (Figure 2). The phylogenetic analysis revealed a complex interspecific relationship of CSS or CSA. "Xicha 5" from Jiangsu Province and "Longjing 43" from Zhejiang Province were clustered into one group with a bootstrap support of 65, "Anhui 3" from Anhui Province and Henan Province "Xinyang 10" was clustered into one group with a bootstrap support of 66, "Zhennong 117" with the blood of "Fudingdabai" from Zhejiang Province, "Baihaozao" from Hunan Province, and "Fudingdabai" from Fujian Province and "Chuanmu 28" from Sichuan Province were closely clustered into one group with a bootstrap support of 100. These findings implied that tea varieties from different regions may share the same parent. Interestingly, two tea varieties ("Guihong 3" and "Yaoshanxiulv") from Guangxi Province closely clustered into one group. One of two tea varieties from Guangdong Province formed a single clade ("Xiuhong") and the other variety ("Yinghong 9") as well tea varieties from Yunnan Province were tightly clustered with a bootstrap support of 94 (Figure 2A). These findings suggested a different genetic origin, even for varieties from the same cultivated area.

Genomic Genetic Differentiation and Evolutionary Patterns Between CSS and CSA
To further evaluate the genetic diversity and population differentiation of CSS and CSA, we calculated the nucleotide diversity (π) and Tajima's D values (Figure 3). The CSS group showed slightly higher nucleotide diversity (π mean = 0.744 × 10 3 , median = 0.776 × 10 3 , and std = 0.329 × 10 3 ) than the CSA group (π mean = 0.683 × 10 3 , median = 0.683 × 10 3 , and std = 0.316 × 10 3 ). This result was consistent with the number of mutation sites they contained. Interestingly, the values of π and Tajima's D showed opposite trends, compared with CSS (Tajima's D mean = 0.208, median = 0.143, and std = 0.838), and higher Tajima's D in CSA (mean = 0.436, median = 0.421, and At the same time, the ratio of non-synonymous and synonymous mutations (dN/dS = ω) was calculated to explore the evolutionary direction of genes (Xanthopoulou et al., 2019;Ogutu et al., 2020). When ω > 1, 14,594 and 14,865 genes were selected from CSS and CSA subgroups, respectively. These genes were considered to have undergone positive selection. When ω < 1, 7,202, and 7,414 genes from CSS and CSA, respectively, that may have undergone purification selection were found. Kyoto Encyclopedia of Genes and Genomes (KEGG) of the positively selected genes revealed that genes related to tea quality, such as terpenes, phenylalanine, and other secondary metabolites biosynthesis, were significantly enriched in CSS or CSA subpopulations (Figure 4).
In addition, to estimate population differentiation and to screen differentiated genetic regions between the two groups, pairwise F-Statistics (Fst) values (Fst mean = 0.2336, median = 0.2285, and std = 0.0825) were calculated based on the whole genomic SNP data. The degree of differentiation of the two subpopulations differed in 15 putative chromosomes.
It should be noted that Fst may be <0 when the degree of variation in the subpopulation is greater than that of the whole population ( Figure 3C). When the threshold was set to 0.368 (top 5% of the Fst value), 5121 genes (10.3% of the total genes in tea plant) contained in 2959 bins were identified as candidate genes. Subsequently, KEGG enrichment analysis of these genes revealed that the top 20 KEGG enrichment pathways involved plant hormone signal transduction, pyruvate metabolism, cutin, suberin and wax biosynthesis, anthocyanin biosynthesis, photosynthesis, and others (Supplementary Figure 2).

Selection Sweep Between CSS and CSA
Adaptation to nature and artificial selection will leave traces on the gene pool of the population. Xia et al. identified 510 windows and 1569 genes related to domestication and improvement (Xia et al., 2020a). The selection region between CSS and CSA subsets has not yet been identified. Based on Fst and θπ for selective sweeping analysis, we identified 830 and 141 selection regions in CSS and CSA populations, which contained 687 and 124 genes, respectively (Figure 5). Only 85 genes were described in previous studies. Further functional annotation of these genes revealed their involvement in environmental adaptation, metabolism, and genetic information processing. Some of these genes that are related to growth and development, such as carotenoid synthesis, cell division, photorespiration, and stress resistance, may lead to different environmental adaptability and morphological characteristics of the two subgroups. Detailed selection regions and gene information are presented in Supplementary Table 2.

Development of Indel Markers Related to Genetic Differentiation
We randomly selected 120 indel loci located in the differentiation region to develop indel markers. These primers were preliminarily screened with eight tea varieties. After removing 34 non-polymorphic and 33 ineffective amplification, 53 primer pairs with clear polymorphic amplification bands were successfully selected and further applied to 30 tea varieties to detect their cross-varieties/species transfer capability (Figure 6  and Supplementary Figure 3).
Among them, 24, 20, and 9 indel markers showed high polymorphism, moderate polymorphism, and low polymorphism in 30 tea varieties, respectively . The polymorphism information content (PIC) values ranged from the lowest 0.0624 (ChrInDel16) to the highest 0.7883 (ChrInDeL11), with an average of 0.4494. In addition, the major allele frequency (MAF), number of alleles (Na), observed heterozygosity (Ho), and expected heterozygosity (He) were also used to evaluate newly developed indel markers ( Table 2). The MAF values ranged from 0.30 (ChrInDeL11, ChrInDel35) to 0.9667 (ChrInDeL16) with an average value of 0.6050. The Na values ranged from 2 (ChrIndeL15, ChrIndeL16, ChrIndeL21, ChrIndeL24, ChrIndeL30, ChrIndeL32, ChrIndeL33, and ChrInDeL53) to 8 (ChrInDeL12 and ChrIndeL13) with an average of 4. Ho ranged from 0 (ChrIndeL16 and ChrIndeL35) to 1 (ChrInDeL42 and ChrInDel51) with an average value of 0.449. He ranged from 0.0655 (ChrInDeL16) to 0.8266 (ChrInDeL11) with an average value of 0.5137. It should be pointed out that the PIC and He values of these markers were not only similar but also have the same change trend, while the change trend of MAF was exactly opposite to this trend. In addition, some markers had higher application value due to different allele patterns in the CSS and CSA subgroups. For example, ChrInDeL01 correctly divided the 30 samples used for marker verification into the CSS and CSA subgroups. ChrInDeL02 and ChrInDeL29 had opposite allelic distribution characteristics and the expression levels of the genes in which they are located differed significantly between the two CSS and CSA cultivars (Figure 6). More detailed genetic information and amplification of these newly developed markers are shown in Table 2 and Supplementary Figure 3. The sequences of 53 primer pairs and their genomic positions are listed in Supplementary Table 3. Using these newly developed indel markers, we performed phylogenetic construction on 30 tea plants. These 30 tea plant varieties were clearly divided into the CSS and CSA subgroups in a manner that was strongly consistent   with their breeding history, geographic distribution, and leaf shape characteristics (Supplementary Figure 4). These results suggested that the markers have high polymorphism and good mobility, and can be valuable for molecular-assisted breeding, fingerprinting, and functional research.

DISCUSSION
A long breeding history usually reduces the genetic diversity of crops. Therefore, it is important to study the origin and domestication of species for crop improvement and domestication (Zsogon et al., 2018). Since no credible wild ancestor of CSS has been found, it is generally believed that CSS originated from southwest China and underwent independent domestication to adapt to various living environments (Meegahakumbura et al., 2016). However, it is puzzling that cultivated tea plants showed higher genetic diversity than wild tea plants in previous studies (Yang et al., 2016;Niu et al., 2019). More critically, important differentiation regions and genes of CSS and CSA variants in their genomes have not been identified.
In this study, 15 CSS, 15 CSA, and 3 CTS tea varieties were subjected to high-depth whole-genome resequencing to explore the evolutionary patterns of the different subpopulations. After filtering out low-quality reads, a total of 1570 Gb of high-quality resequencing clean data was obtained, with an average depth of 16× for each sample. By aligning clean reads to the reference genome, and then detecting and strictly filtering mutation sites with various software, we finally identified 18,903,625 SNPs and 7,314,133 indels among 33 tea varieties, with average SNP and indel densities of 8.5 and 4.9/kb, respectively. In the process of evolution, structurally similar bases are easily replaced, and short indel mutations are less harmful to species (Yang and Yoder, 1999;Saxena et al., 2014). Therefore, the most abundant types of SNP and indel mutations are transition (Ts: G/A and C/T) and single nucleotide indels, respectively. Similar research results have been reported in rice , cotton (Shen et al., 2017), tomato (Causse et al., 2013), quinoa , mume and other crops . It should be noted that for each sequenced sample, He, Ts/Tv, and the ratio of nonsynonymous substitutions to synonymous substitutions were almost the same, except that Ho was different (from 0.08 to 0.21). This may be caused by long-term interspecific hybridization and suggests that different tea tree varieties may have common ancestry and origin. Compared with other plants, such as pear  and soybean (Lam et al., 2010), we observed that the ratio of non-synonymous substitutions to synonymous substitutions of tea trees was higher. This ratio is comparable to millet (Bai et al., 2013) and, but lower than that of lotus .
Mutation sites in genic regions may have important effects on plant growth and development (Jin et al., 2016(Jin et al., , 2018Xanthopoulou et al., 2019;Wang et al., 2020). For example, a single nucleotide mutation in gibberellic acid receptor GID1 could cause a gibberellic acid-insensitive dwarf phenotype in peaches (Cheng et al., 2019). Shang et al. (2014) reported that SNP-1601, located upstream of the start codon of the gene Csa5G157230 can regulate the bitter taste of cucumber through extensive resequencing and genome-wide association study. Presently, up to 91.5% of SNPs were determined to be located in intergenic regions, with only 8.5% of SNPs located in genes, and 329,698 SNPs located in CDS regions were annotated as non-synonymous mutations. Thus, tea plants may have diversified phenotypes and adaptability. Up to 6,440,419 and 5,776,510 unique mutations were identified in CSS and CSA subpopulations, respectively, indicating that the two subpopulations have undergone extensive genetic differentiation at the genome-wide level. Interestingly, compared with the CSA subgroup, the total number of mutation loci in the CSS subgroup was larger, while each sample contained fewer mutation sites on average. One possible explanation is that the samples in the CSS subgroup have a wider geographical origin, and the complex natural environment drives the formation of more adaptive mutations. The data will be important in studies to genetically improve tea trees.
Cultivars of sinensis tea and CSA are the two most important tea tree varieties. They vary widely in geographical distribution, morphological characteristics, and flavor substance content. Generally, CSS, which features small leaf morphology and higher cold resistance, is widely distributed in tropical and subtropical regions. CSA, with larger leaves and stronger growth but lower cold resistance, is distributed in tropical regions. Long-term outcrossing has formed a complex blood relationship between CSS and CSA (Zhao et al., 2014a;Xia et al., 2017;Wei et al., 2018;Liu et al., 2019). In this study, we used the high confidence of 552,750 non-linkage disequilibrium SNPs obtained from sequencing to perform population structure analysis, PCA, and phylogenetic tree construction of 33 samples to evaluate their genetic diversity. Despite different evaluation strategies being used, the 33 samples were clearly divided into three subgroups of CSS, CSA, and CTS, which proved the accuracy of our genotyping. In the CSS and CSA subgroups, 4 and 9 varieties were mixed in the pedigree analysis (Figure 2). This indicated frequent gene communication between the CSS and CSA subgroups. Their genetic background was obviously different from that of the CTS subgroup.
The domestication history of tea plants has not been clarified. It is generally believed that to improve the flavor of tea, people have only made changes to very limited number regions and these regions occupy a very low ratio in the genome. The significant difference between CSS and CSA may be mainly affected by natural selection. By calculating the π, Tajima's D, and Fst values of CSS and CSA subgroups independently, we found that the nucleic acid diversity of the CSA (π mean = 0.683 × 10 3 ) subgroup was slightly lower than that of CSS (π mean = 0.744 × 10 3 ), but was significantly higher than that of pear and cotton (Du et al., 2018;Wu et al., 2018). Interestingly, the CSA subgroup displayed a higher Tajima's D value (mean = 0.436) and the Tajima's D-value of the CSS subgroup (mean = 0.208) was approximately half that of CSA, indicating that the CSA subgroup lacks rare alleles due to stronger natural selection. This is probably because the CSA tea varieties in China are mainly cultivated in fewer provinces that feature similar climatic conditions. The 2959 differentiation and 971 selection regions identified from the genome will inform further studies to understand the genetic differentiation and environmental adaptability between the CSS and CSA subgroups. Enrichment analysis revealed that auxin-associated genes or pathways accounted for the highest abundance, suggesting that there may be different growth and development mechanisms between CSS and CSA species because they play important roles in plant root growth, leaf development, and biological and abiotic resistance.
Hormones play an indispensable role in the processes of plant growth and resistance to stress (Brian, 1978;Dreher and Callis, 2007;Blázquez et al., 2020). Tea trees with small leaves cultivated in subtropical regions stop growing due to the cold and reduced light time, while tea trees cultivated in tropical regions maintain a strong growth potential and sprout young shoots. The plant hormone signal transduction pathway contains some of the genes involved in root (GO:0048767 and GO:0010086) and leaf (GO:0048366, GO:0010150, and GO:0048831) development, ultraviolet (UV) and far-red light response (GO:0010224 and GO:0010218), cold (GO:0009409 and GO:0070417), and drought resistance (Supplementary Figure 2). We speculate that these genes may be related to the developmental regulation and ecological adaptability of the two subpopulations. For instance, UV-B light and the abscisic acid receptor PYL8 can regulate cotyledon root development and stress adaptation by MYB13 interaction (Zhao et al., 2014b;Qian et al., 2020). The abscisic acid receptor PYL9 improves drought resistance by regulating leaf senescence . Transcriptome analysis of tea trees has revealed that the plant hormone signal transduction pathway contains many key genes in response to cold treatment (Zheng et al., 2016;Ban et al., 2017;Hao et al., 2018). Interestingly, we found that PYL8 and PYL9 showed significant expression differences in some CSS and CSA varieties, and the distinct genotype characteristics of the two indel markers (ChrInDeL02 and ChrInDeL29) may cause the two genes to show opposite expression trends, thus regulating the growth potential and environmental adaptability of CSS and CSA (Figure 6). In addition, a unique indel mutation was identified in the exon of the AUX/IAA18 gene (relating to plant root development) between the CSS and CSA varieties (Figure 6 and Supplementary  Figure 3), implying that IAA18, PYL8, and PYL9 may coregulate the growth and development of the root system of tea plants (Wang et al., 2005;Ploense et al., 2009;Notaguchi et al., 2012). Although many mutation sites have not been verified, the results provide important data and clues to understand the genetic differentiation and application of marker-assisted selection of tea plants.
Molecular markers are widely used in genetic map construction, map-based cloning, and population diversity analysis. As a third-generation molecular marker, indels have the characteristics of wide genome distribution, high polymorphism, and good reproducibility (Branham et al., 2018;Deokar et al., 2019). Through preliminary screening, 53 indel markers were selected. They were able to amplify clear polymorphic bands in 30 tea varieties (Supplementary Figure 3). The average PIC content of these indel markers was 0.451, slightly lower than previously developed SSR markers . Nonetheless, they are more valuable considering that many markers are located in introns or annotated as non-synonymous mutations . These markers were used for population structure analysis and phylogenetic tree construction. CSS and CSA varieties were correctly divided into two categories (Supplementary Figure 4), which proves that these newly developed markers have sufficient identification for different tea varieties and can be applied to more extensive genetic research.

DATA AVAILABILITY STATEMENT
The datasets presented in this study can be found in online repositories. The names of the repository/repositories and accession number(s) can be found below: https://www.ncbi.nlm. nih.gov/, PRJNA597714.

AUTHOR CONTRIBUTIONS
YA, SL, and CW conceived and designed the research project. YA and XM performed data analysis and manuscript drafting. YA conducted DNA extraction, primer design, PCR amplification, SNP, and indel marker validation. XM, SZ, XX, and RG were involved in sample collection and data analysis. All authors read and approved the manuscript.