A phylogenomic study of Iridaceae Juss. based on complete plastid genome sequences

The plastid genome has proven to be an effective tool for examining deep correlations in plant phylogenetics, owing to its highly conserved structure, uniparental inheritance, and limited variation in evolutionary rates. Iridaceae, comprising more than 2,000 species, includes numerous economically significant taxa that are frequently utilized in food industries and medicines and for ornamental and horticulture purposes. Molecular studies on chloroplast DNA have confirmed the position of this family in the order Asparagales with non-asparagoids. The current subfamilial classification of Iridaceae recognizes seven subfamilies—Isophysioideae, Nivenioideae, Iridoideae, Crocoideae, Geosiridaceae, Aristeoideae, and Patersonioideae—which are supported by limited plastid DNA regions. To date, no comparative phylogenomic studies have been conducted on the family Iridaceae. We assembled and annotated (de novo) the plastid genomes of 24 taxa together with seven published species representing all the seven subfamilies of Iridaceae and performed comparative genomics using the Illumina MiSeq platform. The plastomes of the autotrophic Iridaceae represent 79 protein-coding, 30 tRNA, and four rRNA genes, with lengths ranging from 150,062 to 164,622 bp. The phylogenetic analysis of the plastome sequences based on maximum parsimony, maximum likelihood, and Bayesian inference analyses suggested that Watsonia and Gladiolus were closely related, supported by strong support values, which differed considerably from recent phylogenetic studies. In addition, we identified genomic events, such as sequence inversions, deletions, mutations, and pseudogenization, in some species. Furthermore, the largest nucleotide variability was found in the seven plastome regions, which can be used in future phylogenetic studies. Notably, three subfamilies—Crocoideae, Nivenioideae, and Aristeoideae—shared a common ycf2 gene locus deletion. Our study is a preliminary report of a comparative study of the complete plastid genomes of 7/7 subfamilies and 9/10 tribes, elucidating the structural characteristics and shedding light on plastome evolution and phylogenetic relationships within Iridaceae. Additionally, further research is required to update the relative position of Watsonia within the tribal classification of the subfamily Crocoideae.

Prior to the molecular approach, the association of Iridaceae within the order Asparagales was not widely accepted solely based on morphology (Dahlgren and Bremer, 1985). Recent molecular analyses of monocotyledons have consistently defined Iridaceae in the order Asparagales, among the non-core asparagoid grade, associated with families such as Doryanthaceae, Ixilirionaceae, and Tecophiliaceae (Kim et al., 2010). Goldblatt significantly discussed the morphological characteristics of Iridaceae and assumed that the putative inherited characteristics support a place in the Asparagales rather than Liliales (Goldblatt et al., 1998;Tillich, 2003). In addition, several molecular studies have undergone revised subfamilial circumscriptions and have been classified by various authors (Dahlgren and Bremer, 1985;Goldblatt, 1990;Rudall, 1994;Rudall, 1995;Souza-Chies et al., 1997;Reeves et al., 2001). The most updated phylogenetic reconstruction of Iridaceae is that of Goldblatt, using molecular data derived from five coding regions (rbcL, rps4, trnL-F, matK, and rps16), validating the majority of classification aspects and recognizing seven subfamilies within Iridaceae (Isophysidoideae consisting of the monospecific Isophysis tasmanica, solely with superior ovary, Nivenioideae, Iridoideae, Crocoideae, Geosiridoideae, Aristeoideae, and Patersonioideae) (Goldblatt et al., 2008). Although these studies provide insights into the molecular systematics of Iridaceae and employ limited DNA regions, comprehensive phylogenetic studies with plastid genome sequences have not yet been conducted.
Chloroplasts are a significant semi-autonomous plant cell organelle with unique genetic information. Plastid (cp) genomes are maternally, paternally, or biparentally inherited within the angiosperm lineage (Birky Jr, 1995). In recent years, the plastid genome database has provided new opportunities for determining the evolutionary and phylogenetic relationships among plants because of their unique characteristics, such as small size, highly conserved gene content and order, and low recombination rates. Plasmid genomes have been used to investigate evolution and phylogeny. Typically, angiosperms have highly conserved cp or plastid genomes that are circularly organized and contain a large single-copy region (LSC) and a small single-copy region (SSC) separated by paired inverted repeats (IRs). In general, the cp genomes of terrestrial plant species include 120-130 distinct genes, the majority of which encode proteins associated with photosynthesis (approximately 79), with the remaining genes encoding transfer RNAs (tRNAs) (approximately 30) and ribosomal RNAs (rRNA) (4) (Dyer, 1984;Sugiura, 1992). The rapidly increasing accessibility of characterized plastomes determined by cost-effective next-generation sequencing (NGS) can help elucidate our understanding and facilitate insights into chloroplast gene function and evolution.
Currently, the genome sequences of some genera of the subfamily Iridoideae (Iris spp.), Geosiridoideae (Geosiris australiensis and Geosiris aphylla), and Crocoideae (Crocus spp.) are openly accessible in the National Center for Biotechnology Information (NCBI) (Joyce et al., 2018;Nemati et al., 2019;Kang et al., 2020). Furthermore, independent mycoheterotrophic and nonphotosynthetic Geosiris species have had significant gene losses (Joyce et al., 2018). Therefore, more information should be acquired regarding the plastid genome resources of Iridaceae species. A comparative examination of the plastome sequences of various species can help investigate sequence variability and molecular evolutionary processes related to gene duplication, transfer, and rearrangement measures (Walker et al., 2014;Yan et al., 2018). We annotated and compared 24 newly sequenced representative taxa to understand the plastome evolution of Iridaceae in terms of their structure, gene contents and rearrangements, and IR expansion and contraction. Based on new genomic data and already available genomes in NCBI, we performed the first comprehensive phylogenomic analysis representing all seven subfamilies and tribes, except Tritoniopsis, to identify molecular signatures comprising insertions and deletions (INDELS), gene losses, and pseudogenization in DNA sequences that are uniquely shared by different species within Iridaceae. In this study, the cp genome sequence was obtained, and its gene content and organization were assessed. The second step involved a comparative analysis of several Iridaceae species. Finally, high nucleotide diversity allowed for the identification of genes that were extremely variable and informative. The plastid genome sequences can help reveal the evolutionary and phylogenetic relationships among the seven subfamilies and tribes of Iridaceae.

Taxon sampling and DNA extraction
Fresh leaves of the representative taxa were collected and dried in silica gel until DNA extraction (Table 1). For the ingroup dataset, a total of 31 species, including 24 novel plastid genomes representing all seven Iridaceae subfamilies, were selected for analysis. Eleven outgroup species were used for comparison. The modified cetyl trimethyl ammonium bromide method was used to isolate total DNA (Doyle and Doyle, 1987) from the silica-dried samples. We also extracted the total cellular DNA from the three closely related Asparagales families (listed in Table 1) to choose the ideal set of outgroup species to construct phylogenomics. The extracted DNA concentration was measured using a Biospec-nano (Shimadzu) Spectrophotometer and tested using 1% agarose gel electrophoresis.

Genome sequencing, assembly, and annotation
Libraries were sequenced using the Illumina MiSeq sequencing platform (Illumina, Seoul, Korea). Each sample produced at least 1 GB of raw data. The average length of the paired-end reads (raw reads) was 301 bp. Before assembly, the raw reads were trimmed to sequences with more than 3% probability of error per base using Geneious prime 2020.1.2 (Kearse et al., 2012) to exclude poor-quality reads. The cp genome of Iris sanguinea (GenBank accession KT626943) was used as a reference for assembled reads by performing a "map to reference". The isolated reads were then extracted and reassembled using the option "De novo Assembly" in Geneious prime 2020.1.2 (Kearse et al., 2012), followed by the completion of this step until quadripartite structures were obtained. We checked and analyzed the gene boundaries for each protein-coding gene by using the Iris sanguinea plastome as a reference with default settings using 80% similarity to annotate genes. The assembled plastid genomes were annotated using GeSeq (Tillich et al., 2017), and all tRNAs were set using tRNAScan-SE (Chan and Lowe, 2019) by default search. Full circular maps were generated using OGDRAW software (Greiner et al., 2019).

Comparative genomic analysis
The 24 complete plastid genome sequences ranging from 150,062 to 164,622 bp were compared and aligned in MAFFT (Kearse et al.,  2012) and visualised in an mVISTA plot using Shuffle-LGAN mode (Frazer et al., 2004). Iris loczyi was used as the reference genome. The nucleotide diversity (Pi values) of protein-coding genes and RNA genes (tRNA and rRNA) among Iridaceae species was calculated using the DnaSP 6.0 program (Rozas et al., 2017). Any major structural changes, including junctions, boundaries (LSC/IRs and SSC/IRs), and IR expansions and contractions, were investigated using IRscope (Amiryousefi et al., 2018).

Relative synonymous codon usage analysis and selective pressure estimation
The relative synonymous codon usage (RSCU) was examined from the 79 protein-coding genes using DAMBE program to determine the comparative inclination of a specific base encoding the necessary amino acid codon (Xia and Xie, 2001). The values of RSCU greater than one were regarded as having a higher codon frequency.
The synonymous (Ka) and non-synonymous (Ks) substitution ratios of ycf2 gene were calculated using the KaKs calculator (http:// services.cbu.uib.no/tools/kaks). The average K a /K s of ycf2, however, was >1, indicating positive selection. Thereafter, selective pressure estimation was carried out to investigate positive selection on the ycf2 gene of 12 species covering three subfamilies. The ycf2 sequences were individually extracted and aligned using MEGA6 (Tamura et al., 2013). Positive selection was identified using the branch-site model in the codeml program, which was implemented in the PAML v4.9 package (Yang, 2007). The codon frequencies were predicted using the F3 × 4 model. Selective pressure is calculated by the ratio (w) of the nonsynonymous substitution rate (dN) to the synonymous substitutions rate (dS) which is used in most adaptive evolutionary studies. In this study, there were three sets of comparison models: M0 (single ratio) and M3 (discrete), M7 (beta) and M8 (beta and w), and M1 (near neutral) and M2 (selection), which were used to find the sites under positive selection (w > 1) in the ycf2 gene. An amino acid site with posterior probabilities >0.95 was considered as positively selected.

Phylogenomic analysis
A phylogenetic analysis of 42 plastid genome sequences was performed. In this study, 31 ingroup taxa from Iridaceae and three outgroup taxa (Doryanthaceae, Tecophiliaceae, and Xeronemataceae) were newly sequenced, and the data for the remaining species were downloaded from NCBI GenBank. To strengthen the representation of the closely related Asparagales family, 11 outgroup taxa were included based on previous phylogenetic studies (Chen et al., 2013). We extracted 79 protein-coding genes and used MAFFT version 7.309 to align each gene individually. Then, we analyzed the alignment and adjusted it manually using Geneious Prime (Kearse et al., 2012). The combined alignment matrix comprised 42 taxa and 71,829 characters.
Maximum parsimony (MP), maximum likelihood (ML), and Bayesian inference (BI) analyses were performed for phylogenetic analysis. MP analysis was performed using PAUP* version 4.0 (Wilgenbusch and Swofford, 2003) using comparable character allowance, and omitted data were treated as gaps. Tree bisection and reconnection branch swapping were used for searches of 1,000 randomly chosen additions, and MulTree allowed 10 trees to be held at each stage. Bootstrap analyses (1,000 pseudoreplicates) were performed to determine the relative bootstrap support with similar parameters. We utilized jModel test version 2.1.7 for ML and BI (Darriba et al., 2012) analyses to obtain suitable models based on the Akaike information criterion and selected GTR+I+G as the best model. ML analysis was conducted using IQ-TREE (Trifinopoulos et al., 2016), and support values were calculated with 1,000 bootstrap replicates. BI analysis was performed using MrBayes (Ronquist et al., 2012). Trees were sampled every 1,000 generations during two independent runs of random trees with a minimum of 1 × 10 6 generations. The first 25% of the trees were removed as burn-in, and the remaining trees were used to build a consensus tree with 50% majority-rule, with the proportion of bifurcations in the consensus tree provided as a posterior probability (PP) to evaluate the robustness of half of the BI tree. The model parameters were then verified using the effective sample size values. Finally, phylogenetic trees were modified using Figtree program (Rambaut, 2020).

Plastid genome organization and gene features
The plastid genomes of 24 species of Iridaceae, including representative taxa from the subfamilies Isophysioideae, Patersonioideae, Aristeoideae, Nivenioideae, Crocoideae, and Iridoideae, were constructed. The genomes of additional species, including Crocus spp., Iris spp., and subfamily Geosirioideae (Geosiris australiensis and G. aphylla), were obtained from NCBI GenBank. All the newly assembled contigs formed a quadripartite structure. The resultant plastomes were composed of aLSC, SSC, and a pair of inverted repeats (IRa and IRb) ( Figure 1). The size of the CP genomes varied from 119,004 to 164,622 bp (Table 1). Among the seven Iridaceae subfamilies, Geosirioideae had the smallest cpDNA (Geosiris australiensis NC_042825; 119,004 bp) and the largest cpDNA (Tigridia pavonia; 164,622 bp). In addition, the GC content of the cpDNA sequence in Iridaceae was approximately 38% for all the examined taxa. The plastid genome of the autotrophic Iridaceae comprised 113 genes, including 79 protein-coding genes, 30 tRNA genes, and four rRNA genes distributed in the IR region (Table 2). Among them, 18 were intron-containing genes, including six tRNA and 12 protein-coding genes. Three of the latter genes-clpP, rps12, and ycf3contained two introns. The 5′ exon of the trans-spliced rps12 gene is located in the LSC area, whereas the 3′ exon and intron are located in the IR regions. In addition, significant inversions and variable INDEL patterns of resident pseudogenes were found in some Iridaceae species.

Pseudogenization and inversions
Although the overall pattern of genes was similar between species, we found that several protein-coding genes with frameshifts, substitutions and point mutations, internal stop codons, and unclear start codons were classified as pseudogenes. Several genes from the Representative chloroplast genome of Iridaceae. The colored bars indicate protein-coding genes, tRNA genes, and rRNA genes. Genes drawn outside of the map circle are transcribed clockwise, while that drawn inside are transcribed counterclockwise. The darker gray in the inner circle corresponds to the GC content, while the lighter gray corresponds to the AT content.  examined species had pseudogenization annotated as ndhG (Diplarrena moraea and Nivenia stokoei), ndhD (Hesperantha coccinea), rpoC2 (Gladiolus communis), rpoB (Chasmanthe floribunda), ndhK (Witsenia maura), and ycf1 (Trimezia steyermarkii) ( Table 2). Based on the mVISTA data, a detailed comparison study was performed, as demonstrated in Figure 2, where there were similarities in coding genes; however, some differences among Iridaceae species were identified. We observed several structural variations, such as inversions in genome structure among species. Unlike other Iridaceae species, Moraea polystachya and M. spathulata possessed intergenic spacer inversions from ycf2 to ndhB (approximately 1.7 kb). In addition, Watsonia longifolia and Gladiolus communis exhibited significant inversions from petN to trnE-UUC (approximately 3 kb) (Figure 2). In many studies, ycf2 gene alone provides well-supported phylogeny because of its large size and extensive sequence variation (containing many INDELS) uniquely shared by some species and is frequently employed in phylogenetic research on angiosperms (Huang et al., 2010;Zhong et al., 2019;Tang et al., 2022). Our findings revealed significant deletions and inconsistent variations in the ycf2 gene region. The ycf2 gene of Iridaceae species was compared, and it was observed that the subfamilies Crocoideae, Nivenioideae, and Aristeoideae had 237 base sequence deletions in the same locus ( Figure 2; Supplementary Figure S1). Similar deletion events may provide a plausible explanation for the sister relationships among these three subfamilies. To examine the existence of ycf2 gene based on selective pressure (positive or purifying selection), we generated the K a /K s (synonymous and nonsynonymous) ratio values of the representative taxon of Iridaceae (Supplementary Table S4-4.1). Thereafter, the average K a /K s of ycf2, however, was >1, indicating positive selection.
The PAML v4.9 package (Yang, 2007) was used to measure the logarithmic likelihood value (InL) and estimation of positive selection for the complete sequence data of the ycf2 protein-coding region of 12 species in the Iridaceae family. There were several positive selection loci found in models M2 (positive selection) and M8 (beta and w>1) with the posterior probability superior to 95% and 99% as calculated by Bayes empirical and naïve empirical estimation (Supplementary Table S4-4.2).

Nucleotide diversity and IR boundaries
We analyzed the nucleotide diversity of protein-coding genes, tRNA, and rRNA of the 31 ingroup taxa comprising seven subfamilies of Iridaceae to identify phylogenetically informative genes (Figure 3; Supplementary Table S1). In total, 113 genes from different regions, that is, IR, LSC, and SSC, were aligned to calculate pi values. The nucleotide diversity of the IR region in protein-coding genes was relatively low compared with that of the LSC and SSC regions. In the CDS, the highest pi values were recorded for matK (0.07907), followed by NADH dehydrogenase genes (ndhK, ndhF, ndhD, and ndhG), ycf1, and rps15. Among RNAs, the pi value ranged from 0.04019 (trnQ-UUG) to 0.00267 (rrn16rRNA and rrn5rRNA), whereas two genes, trnA-UGC and trnN-GUU, exhibited no nucleotide diversity (Pi value: 0) (Figure 3). The detailed values are shown in Supplementary Table S1.

FIGURE 2
Sequence identity plot comparing 31 CP genomes with I. loczyi as a reference using mVISTA. CDS, coding sequences; CNS, non-coding sequences.
The primary cause of size differences in cp genomes is the contraction and extension of IR boundaries, which are frequent evolutionary events (Wang et al., 2008;Zong et al., 2019). The boundaries (LSC/IRs and SSC/IRs) were similar among Iridaceae species (Figure 4). Despite the boundary of plastome sequences of the Iridaceae species being largely constant, we found expanded IR regions, whereas the rpl22 gene in the LSC region was lengthened by 104 bp only in Libertia pulchella. A partially duplicated copy of the ycf1 gene was observed at the IRb-SSC junctions of all plastid genomes. Furthermore, rps19 gene was duplicated in the IR region. Of these, Tigridia pavonia presented a large genome, owing to the enlarged IR length. Notably, the Australian species Patersonia fragilis, Isophysis tasmanica, and Diplarrena moraea had LSC with more base pairs than the other species (Figure 4) due to INDEL mutations of nucleotides in the non-coding region.

Relative synonymous codon usage assembly
A total of 79 protein-coding genes of 31 representative taxa of Iridaceae were used to compare the values of RSCU (Supplementary Figure S4). The stop codons UAA, UAG, and UGA were excluded.
The total observed frequency ranged from 22,805 (Isophysis tasmanica) to 6,194 (Geosiris australiensis) ( Figure 5). Leucine (L) was the most prevalent amino acid in the autotrophic Iridaceae (10.25 to 11.08%), whereas isoleucine (I) and leucine (L) were the most extensive amino acid in the mycoheterotrophic species (Geosiris). Notably, cysteine (C) was the least abundant amino acid in both photosynthetic and non-photosynthetic groups (1.09 to 1.20%) (Supplementary Figure S4, Supplementary Table S5). Both methionine (M) and tryptophan (W) show an RSCU value (equal to 1) that exhibited no codon preferences in all the examined species.

Phylogenetic analysis
The phylogenetic reconstructions based on 79 protein-coding genes for 31 ingroup (Iridaceae) and 11 outgroup species contained 71,829 characters, resulting in phylogenies with identical topologies using MP, ML, and BI with bootstrap values of 100% and PP of 1.0 ( Figure 6). Iridaceae was recovered as monophyletic with strong support (100%, 1) as in previous studies performed by Goldblatt on the basis of the selected genes.
This study, with relatively more coding gene sequences than previous molecular studies, represents the extent of relationships Graphs showing the variation in nucleotide diversity (Pi) in the protein-coding region, transfer RNA, and ribosomal RNA in 24 representative taxa of Iridaceae. Kamra et al. 10.3389/fpls.2023.1066708 Frontiers in Plant Science frontiersin.org within Iridaceae. The evolutionary relationships found in this study at the subfamily level were notably consistent with prior extensive investigations (Isophysioideae, Isophysis tasmanica, sister to other Iridaceae species, 100%, 1). The three subfamilies Patersonioideae, Geosirioideae, and Aristeoideae comprised a well-supported clade (100%, 1), together with the two subfamilies Crocoideae and Nivenioideae (100%, 1). Geosiris australiensis and G. aphylla exhibited substantial plastome rearrangements and gene losses relative to other Iridaceae species, but they were nonetheless highly supported as a clade (100%, 1). Diplarrena moraea was sister to other genera in the Iridoideae clade. Diplarreneae was further subdivided into two distinct clades: one Irideae and three important tribal clusters, including Sisyrinchieae, sister to Trimezieae, and the remaining Tigrideae with high support values in MP, ML, and BI analyses ( Figure 6).

Plastome structure and comparisons
Our first comprehensive representative plastome sampling from the basal clade Isophysioideae to the higher Crocoideae clade offers an excellent prospect for investigating plastome evolution. Overall, the plastome lengths in the photosynthetic group ranged between 150,820 and 164,622 bp ( Table 1). The LSC and SSC regions are comparatively longer with a higher AT content than the IR region, with the notable exception of the plastomes in the non-photosynthetic group, which contain significantly reduced SSC and extended IR regions (e.g., Geosirioideae) wherein photosynthetic genes have been extensively dropped (Joyce et al., 2018). In the evolution of land plants for both autotrophic and heterotrophic taxa, substantial extension of the IR region has occurred several times through SSC gene transporters (Naumann et al., 2016;Zhu et al., 2016;Sinn et al., 2018). Notably, the published chloroplast genome of Sisyrinchium angustifolium (Kang et al., 2020) contained gaps or missing nucleotide bases (129) between the sequences; however, our newly sequenced genome of this species substantially filled the existing gaps using NGS data.

Pseudogenization and sequence inversion
Larger genes encoding protein products (up to 1,000 amino acids) have been demonstrated to be more likely to undergo pseudogenization (Khachane and Harrison, 2009). Events such as INDEL and pseudogenization are also thought to be critical for understanding the evolutionary history (van der Burgt et al., 2014). Several protein-coding genes from the examined Iridaceae species exhibited pseudogenization (Table 3)-for instance, in Trimezia steyermarkii, the ycf1 gene contained several internal stop codons due to INDELS. In Gladiolus communis, rpoC2 only had one mutation [GAA (glutamate) ! TAA (stop codon)], which was recognized as a pseudogene. We likewise identified ndhK as a pseudogene in Witsenia maura due to a substitution mutation. We also found that Hesperantha coccinea had a dissimilar frontal region of ndhD gene (up to 66 bp). The ndhG gene was functional in Iridaceae species; however, it underwent pseudogenization by base insertion, changing the reading frame in Diplarrena moraea. Within Iridaceae, Geosiris species also showed pseudogenization of NADH dehydrogenase and other photosynthetic genes (Joyce et al., 2018). In addition, heterotrophic plants frequently experience ndh gene Comparison of the borders of large single copy, small single copy, and inverted repeat regions of chloroplast genomes of Iridaceae.
Studies have shown that plastomes in many angiosperm lineages have genomic structural differences, called inversions, ranging from small to large inversions, which provide important phylogenetic information (Figure 2). These include Onagraceae (Hupfer et al., 2000), Asteraceae (Walker et al., 2014), Fabaceae (Wang et al., 2018;Charboneau et al., 2021), and Commelineaceae (Jung et al., 2021). Previously, Jung et al. (2021) reported that AT-rich sequences caused genomic rearrangements, such as inversions. Our results also identified two inversion events, ycf2-ndhB intergenic region inversion, in two Moraea species, M. spathulata and M. polystachya, that showed high AT-rich content due to weak hydrogen bonding (Supplementary Figures S2, S3). Unlike other Iridaceae species, Watsonia and Gladiolus have one large inversion (petN-trnE-UUC), and both species have reversed orientation in the same loci, demonstrating high AT content and forming a clade sister in Crocoideae. Therefore, genome sequence comparisons are appropriate for analyzing structural variations.

Gene locus deletion in ycf2 gene
The ycf2 gene is the largest protein-coding gene reported in flowering plants (Drescher et al., 2000). The function of ycf2 gene is unknown, but it does not seem to be specifically linked to photosynthesis (Drescher et al., 2000). In our study, ycf2 sequences contained many INDELS that were distinctively shared by some examined species. Members of the Crocoideae, Nivenioideae, and Aristeoideae subfamilies shared a 237-bp deletion in the ycf2 gene (Supplementary Figure S1) as we moved from the basal (Isophysioideae) to the higher clade (Crocoideae, Nivenioideae, and Aristeoideae) ( Figure 6). Zhong et al. (2019) performed selective locus detection of the ycf2 gene in eight species of the composite family and reported that the ycf2 gene has undergone adaptive evolution. Hu et al. (2015); Fan et al. (2018); Zhong et al. (2019), and Wu et al. (2020) investigated the positive selection of the protein-coding ycf2 gene and reported its role in adaptation to diverse ecosystems. In our study, the gene ycf2 was subjected to a selective pressure test using a sequence alignment of ycf2 gene that codes for protein to estimate the K a /K s ratio, and the results showed K a /K s >1, indicating positive selection (Supplementary Table S4-4.1). Furthermore, ycf2 was used for selective pressure estimation using branch-site model, and it was found that the ycf2 gene was a positively selected gene with several positive sites (Supplementary  Table S4 -4.2). Notably, ycf2 showed the greatest degree of differentiation, positive selection, and high conservation, which may promote adaptive evolution. Further extensive investigation and detailed discussion on the adaptive evolutionary analysis of the ycf2 gene should be conducted in future studies on Iridaceae.

Nucleotide diversity and relative synonymous codon usage analysis
The nucleotide diversity of CDS, tRNA, and rRNA among the Iridaceae plastomes was evaluated to determine genetic divergence. The IR regions had lower nucleotide diversity than the LSC and SSC regions because of their highly conserved composition. We found a phylogenetically informative dataset composed of matK and ndh genes (ndhK, ndhE, ndhD, and ndhG), ycf1, and rps15, which had high values (Pi > 0.07). The genes listed above were further used to analyze phylogenetic relationships (Figure 3).
These events in protein-coding sequences have helped us to better understand the evolutionary relationships among Iridaceae species and expand our understanding of plastid genome evolution in angiosperms. Our findings offer new evidence for the evolutionary development and history of the core Iridaceae based on specific inversion events and pseudogenization of some genes, which are confined to some species, in contrast to the rest of the studied species.
Codon usage in plastid protein-coding genes plays an important role to infer the evolutionary forces such as mutation-selection and genetic drift within species (Komar, 2016). Most of the amino acids were encoded by two to six synonymous codons, with the exception of methionine (M) and tryptophan (W). For the codons, the RSCU values >1 indicate that the codon are frequently used, and the values <1 imply that the codon is used less frequently than expected (Sharp and Li, 1986). Our results showed that there were 31 codons in autotrophic species and 29 codons in mycoheterotrophic Geosiris that have a high value (>1) and ended with A/U. However, most of the codons with low value (<1) ended with G/C (Supplementary Figure S4). The most commonly used codon in Iridaceae was Leucine (Leu), except in non-photosynthetic Geosiris aphylla (isoleucine, Ile) due to the substantial plastome rearrangements and gene losses relative to other Iridaceae species.

Phylogenetic relationships
In this study, we used plastid protein-coding sequences to analyze representative sampling datasets of Iridaceae. The comparative trial of phylogenomics using plastid genome sequences significantly supported the monophyly of Iridaceae and was in accordance with recently published results ( Figure 6). The highly supported phylogenetic Estimated phylogenetic tree of Iridaceae based on maximum parsimony (MP), maximum likelihood (ML), and Bayesian inference (BI) analyses using 79 proteincoding genes. The asterisks (*) around nodes indicate a bootstrap value of 100% for MP and ML analyses and a posterior probability of 1.0 for BI analysis.
relations reconstructed by MP, ML, and BI analyses are consistent with previous studies estimated by Goldblatt et al. (2008) and Joyce et al. (2018). The location of Watsonia Mill. is the only point of incongruence; Goldblatt et al. (2008) discovered Watsonia with other related species in the tribe Watsonieae with weak support (72%). Notably, Joyce et al. (2018) found Watsonia and Schizorhiza in one clade with strong support (99%, 1), whereas we recovered a clade comprising Watsonia and Gladiolus with high support values (100%, 1). Watsonia pillansii or Beatrice Watsonia is a species of cormous perennial geophyte native to southern Africa (South Africa, Lesotho, and Swaziland) in the Iridaceae family, which is closely related to Gladiolus (Chen, 2021). Our analysis described Gladiolus and Watsonia in one clade and presented a similar inversion in comparison with other Crocoideae members ( Figure 6). Further studies are needed to revise the position of Watsonia.

Conclusion
This study is the first report of a comparative genomic analysis of the second largest family in Asparagales, revealing significant genomic structural characteristics, nucleotide diversity, pseudogenization-inversion events, and variable INDEL patterns using complete plastid genomes of representative taxa of Iridaceae. In addition, relationships among subfamilies and tribes [except Tritoniopsis (Tritoniopsideae)] were determined with high support in the current study using 79 proteincoding genes. Plastome data generated new and robust insights into the phylogeny of Gladiolus-Watsonia clade (Iridaceae, Crocoideae). The three subfamilies share common ycf2 gene locus deletion. The complete plastome investigated in this study will facilitate further research on Iridaceae and enhance our understanding of plastome evolution.

Data availability statement
The names of the repository/repositories and accession number(s) can be found at: National Center for Biotechnology Information (NCBI) Bioproject database under accession number OP710243-OP710248; OP729159-OP729169; OP745518-OP745527 (Supplementary Table S3).

Groups of genes
Names of genes No.

Author contributions
KK: experimentation, formal analysis, data curation and interpretation, preparation of figures and tables, and writing of the original draft. JJ: authored or reviewed the drafts of the paper and approved the final draft. J-HK: project administration, conceptualization, supervision, writing-review, and draft approval. All authors contributed to the article and approved the submitted version.