The Repeat Sequences and Elevated Substitution Rates of the Chloroplast accD Gene in Cupressophytes

The plastid accD gene encodes a subunit of the acetyl-CoA carboxylase (ACCase) enzyme. The length of accD gene has been supposed to expand in Cryptomeria japonica, Taiwania cryptomerioides, Cephalotaxus, Taxus chinensis, and Podocarpus lambertii, and the main reason for this phenomenon was the existence of tandemly repeated sequences. However, it is still unknown whether the accD gene length in other cupressophytes has expanded. Here, in order to investigate how widespread this phenomenon was, 18 accD sequences and its surrounding regions of cupressophyte were sequenced and analyzed. Together with 39 GenBank sequence data, our taxon sampling covered all the extant gymnosperm orders. The repetitive elements and substitution rates of accD among 57 gymnosperm species were analyzed, the results show: (1) Reading frame length of accD gene in 18 cupressophytes species has also expanded. (2) Many repetitive elements were identified in accD gene of cupressophyte lineages. (3) The synonymous and non-synonymous substitution rates of accD were accelerated in cupressophytes. (4) accD was located in rearrangement endpoints. These results suggested that repetitive elements may mediate the chloroplast genome rearrangement and accelerated the substitution rates.


INTRODUCTION
Cupressophytes, also called non-Pinaceae conifers, comprise about 380 species in 58 genera of five families: Araucariaceae, Podocarpaceae, Sciadopityaceae, Taxaceae (including Cephalotaxaceae), and Cupressaceae (including Taxodiaceae) (Christenhusz et al., 2011). Most species of Araucariaceae and Podocarpaceae are usually distributed in Southern Hemisphere, while other three families are located in the Northern Hemisphere. Some of the cupressophytes species are of economic and ecological value to humans. For instance, most species of Cupressaceae are valued for the production of timbers or ornamentals. The secondary metabolite paclitaxel (taxol) extracted from the bark of Taxus is a chemotherapy drug to treat ovarian and breast cancer.
Acetyl-CoA carboxylase (ACCase) facilitate the acetyl-CoA to form malonyl-CoA and is supposed to regulate de novo fatty acid biosynthesis (Konishi and Sasaki, 1994;Sasaki and Nagano, 2004). Most higher plants, except for Gramineae, have two forms of ACCase: a prokaryotic type made up of several subunits in the stroma of plastids and a eukaryotic form composed of an only multifunctional polypeptide located in the cytosol (Konishi et al., 1996). The prokaryotic ACCase form is organized by the α-carboxyl transferase, the biotin carboxyl carrier, the biotin carboxylase, and the β-carboxyl transferase subunit (Gornicki et al., 1997). Except for β-carboxyl transferase was encoded by the plastid accD gene, other three subunit are all nucleus encoded. The plastid-localized accD gene is essential for leaf growth and to maintain plastid compartment in tobacco (Kode et al., 2005). Elevation of accD expression successfully raised the entire ACCase amount in plastids, and significantly raised the fatty acid content in tobacco leaves (Madoka et al., 2002). Furthermore, expression of accD was considered to be essential at the stage of embryo development in Arabidopsis (Bryant et al., 2011).
AccD is widely distributed in plants, including the reduced chloroplast genome of parasitic and non-photosynthetic plants (Wolfe et al., 1992;de Koning and Keeling, 2006). However, accD has been lost several times from the chloroplast genomes of some angiosperm lineage: Acoraceae (Goremykin et al., 2005), Poaceae (Konishi and Sasaki, 1994;Harris et al., 2012), Campanulaceae (Haberle et al., 2008), Geraniaceae , and Fabaceae (Magee et al., 2010). In Poaceae, the plastid-located prokaryotic form ACCase is functionally replaced by the nuclear-encoded eukaryotic type (Konishi et al., 1996;Gornicki et al., 1997). The loss of accD gene from the chloroplast genomes of Campanulaceae and Fabaceae was also consistent with an additional ACCase counterpart in the nucleus (Magee et al., 2010;Rousseau et al., 2013). In Trifolium repens of Fabaceae, through scanning high-throughput EST sequence data, accD was found to fuse with a nuclear gene for plastid lipoamide dehydrogenase (LPD2) (Magee et al., 2010); in Trachelium caeruleum of Campanulaceae, a transit peptide is combined with an abridged accD gene, which includes only 331 amino acids (Rousseau et al., 2013). In contrast to the loss of this gene among the above species, the length of accD gene in cupressophyte species including Cryptomeria japonica, Taiwania cryptomerioides, Cephalotaxus wilsoniana, C. oliveri, T. chinensis, and P. lambertii have diversified in an increasing direction (Hirao et al., 2008;Wu et al., 2011;Yi et al., 2013;Vieira et al., 2014;Zhang et al., 2014). The extension of the accD gene length is mainly caused by the insertion of large number of tandem repeated sequences in this area. But the repetitive elements of the gene are different among Cephalotaxus, T. cryptomerioides, T. chinensis, and P. lambertii (Yi et al., 2013;Vieira et al., 2014;Zhang et al., 2014). Therefore, evolutionary mechanisms underlying the occurrence of repetitive elements in cupressophyte of accD gene remain poorly studied. Sequence data from a wider phylogenetic breadth of cupressophytes are needed to clarify the evolutionary history of accD gene.
In the study of four mammalian and a bird genome, it is suggested that regions surrounding tandem repeats evolve faster than other non-repeat-containing regions (Simon and Hancock, 2009). One explanation is that regions nearby repeat sequences have evolved under weaker negative selection than the remaining region they embedded in (Djian et al., 1996;Faux et al., 2007). Another explanation is that the repeat sequences give rise to more substitutions near the flanking sequences (Huntley and Clark, 2007). Recent evidence also suggests that the insertion of repeat sequence elevated substitution rate of the entire sequence (Huntley and Clark, 2007). It is also assured that repeat sequence themselves evolves faster than their flanking sequence (Huntley and Golding, 2000). With many repeat elements in accD, whether the substitution rates for the repeat sequences or their flanking sequences have accelerated is unknown. To elucidate the overall evolutionary history or patterns of the repeat sequences in chloroplast genome, substitution rate pattern of accD gene were identified in this study.
In order to have a better insight into the evolutionary trace of accD in cupressophytes, in this study, we have sequenced accD genes from 18 cupressophytes species. The aim of this study focuses on: (1) investigating whether accD gene length in cupressophytes tends to increase; (2) exploring if accD gene in other cupressophytes species have specific repetitive elements like Cephalotaxus, T. cryptomerioides, T. chinensis, and P. lambertii; (3) determining the substitution rates pattern of accD in cupressophytes; (4) identifying gene order states around accD gene and verifying the association of repetitive elements, substitution rates and genome rearrangement.

Plant Sampling
Fresh leaves of 18 conifer species were sampled from Wuhan Botanical Garden, Chinese Academy of Sciences (CAS), Institute of Botany, CAS, and Sun Yat-sen University, respectively ( Table 1). The materials used for DNA extraction were saved in silica gel.

DNA Extraction and Sequencing
Total genomic DNA was isolated from the leaves of samples using the CTAB method (Gawel and Jarret, 1991). The quality of the genomic DNA was determined by 1% agarose gel electrophoresis. The accD gene investigated in this study was acquired using polymerase chain reaction (PCR). PCR primers (Supplementary Table 1) were designed from conserved region sequences in four gymnosperms (C. japonica, NC_010548; T. cryptomerioides, NC_016065; C. wilsoniana, NC_016063; C. oliveri, KC136217). The PCR system was as described in former study (Li et al., 2016). Then the PCR products were cloned into PCR 2.1 plasmid vector (Invitrogen, Carlsbad, CA, United States), and transformed to E. coli DH5α. At least three random positive clones were sequenced using ABI 3730xl DNA Analyzer (Applied Biosystems, Foster City, CA, United States).

Sequence Assembly and Annotation
The sequences generated from different primers were assembled as a single sequence by BioEdit (Hall, 1999) with an overlapping of 150-300 bp. Contigs were initially annotated by DOGMA (Dual Organellar GenoMe Annotator). Genes that not be confirmed by DOGMA were recognized using Blastx 1 and ORF Finder 2 . The tRNA genes were annotated by tRNAscan-SE v1.21 (Lowe and Eddy, 1997).

Repeat Sequence Analyses
The sequences were initially scanned by REPuter at a repeat length ≥20 bp with a similarity of above 90% (Kurtz et al., 2001). Sequences were further processed by the Tandem Repeats Finder software (Benson, 1999).

The Estimation of Substitution Rate
For the analysis in Figure 2, we first constructed a maximum likelihood (ML) tree using rbcL sequences. The analysis was performed in RaxML v8.1.x software with the GTR+I+G model. In addition, according to the strongly supported relationship published elsewhere (Lu et al., 2014), the Podocarpaceae and Araucariaceae were adjusted as a sister group for the rbcL ML tree. At last, this tree was used for the following substitution rate calculation. In order to compare the substitution 1 http://blast.ncbi.nlm.nih.gov/Blast.cgi 2 https://www.ncbi.nlm.nih.gov/orffinder/ rate of accD gene to two other widely used chloroplast gene marker rbcL and matK, we also downloaded these two gene sequences from GenBank. The branch lengths of non-synonymous (d N ) and synonymous (d S ) nucleotide substitutions for accD, matK, and rbcL trees were calculated using the free-ratio model implemented in PAML Codeml program.

The General Features of accD Gene in Cupressophytes
The sequences acquired in this study were deposited in the GenBank with the accession number of KT30780-KT30797.A comparison of 57 gymnosperm accD sequences showed that the approximate 200 amino-acids at the end of this gene were highly conserved ( Supplementary Figures 1-9, 1-10, the position of 1200-1400 in the alignment). This C-terminal region is functional importance for ACCD protein (Zhang et al., 2003). However, we found that the residues at the N-terminal and the middle region showed low similarities ( Supplementary Figures 1-1 to 1-8). The major difference between 57 gymnosperm accD sequences is apparent as a large insertion sequences in the N-terminal and the middle region of cupressophyte accD sequence ( Supplementary  Figures 1-1 to 1-8). Furthermore, the open reading frame has not been destroyed by these insertion sequences.
The accD gene length in cupressophyte experienced an extraordinary expansion. The accD gene in Podocarpaceae lineage expands above 600 codons ( Table 1). The Cephalotaxus hainanensis analyzed in this study shows the largest accD gene size, reaching 1070 codons (Table 1), which is approximately three times of the other Pinaceae species. The accD gene length also varies significantly within family. The accD gene in Taxus has experienced dramatic expansion, reaching as long as 735, 736, 753, 759, and 767 codons in T. mairei, T. wallichiiana, T. media, T. chinensis, and T. yunnanensis, respectively ( Table 1) Table 1). In general, our results support Hirao et al's hypothesis that the accD gene length in cupressophytes has been expanded (Hirao et al., 2008).

Repetitive Amino Acid Elements in accD
To initiate our investigation into the mechanisms underlying accD gene length-associated mutation, REPuter and Tandem Repeat Finder were used to search repetitive sequences. As expected, accD gene length variation is explained by the insertions consisting of tandem repeated sequences. The repetitive sequences in accD gene are represented by a total of 31 categories present in 2-13 nearly identical copies, all of which are in the same (i.e., direct) orientation relative to each other (Supplementary Table 2 and Figure 1). Cycadaceae, Ginkgoaceae and the Pinaceae species with a relatively small gene size ( Table 1) do not have repetitive elements. In comparison, the accD in cupressophytes investigated in this study possess a great many repetitive sequences. Ten repetitive elements were identified in the accD gene from the Cephalotaxaceae (Supplementary Table 2 and Figure 1A). Some repetitive elements, represented by R5, R9, and R8, were exclusively found either in C. wilsoniana or C. hainanensis, whereas the other repetitive elements such as R1, R2, R3, R4, R10 were found in all Cephalotaxus species. R1, R2, and R10 repetitive elements were all duplicated two times in the five Cephalotaxus species. The copy number of R3, R4, R6, and R7 varies in different species. For instance, the C. hainanensis has 13 repetitive elements of R3; while C. sinensis, C. Wilsoniana, and C. fortune have 12 copies of this repetitive element and C. oliveri has only six copies of R3 repetitive elements.
Three repetitive elements of R11 and four of R12 were found in Cunninghamia lanceolata and Calocedrus formosana, respectively (Supplementary Table 2 and Figure 1B). Juniperus virginiana has two copies of R14 while the other Juniperus species have only one copy of this repetitive element (Supplementary Table 2 and Figure 1B). The main difference in repetitive elements between two Taiwania species was the copy number variations of R16 and R18. T. cryptomerioides has six copies of R16 and three copies of R18, while T. flousiana has seven copies of R16 and two copies of R18 (Supplementary Table 2 and Figure 1B). R19 repetitive element commonly exists in Cupressaceae species except for C. lanceolata and Taiwania (Supplementary Table 2 and Figure 1B).
In Taxus and Pseudotaxus, the accD gene contained four kinds of repetitive elements: R20, R21, R22 and R23 (Supplementary Table 2 and Figure 1C). Two copies of R21 were found in T. wallichiana while other Taxus species have only one copy of R21. T. chinensis and T. cuspidate both have four copies of R20 and T. yunnanensis has only two, while the remaining four Taxus species each have three copies of this repetitive elements. P. chienii also has R20 element but only one copy. The copy number of R22 in Taxus is also different, ranging from two in T. wallichiana to four in T. yunnanensis.
The largest tandem repetitive elements, spanning 59 amino acids, named as R25, exist in Torreya and Amentotaxus (Supplementary Table 2 and Figure 1D). Two copies of R25 were identified in Torreya and Amentotaxus. Amentotaxus has two genus-specific repetitive elements, R24 and R26, whose copy number are also different between Amentotaxus argotaenia and Amentotaxus formosana (Supplementary Table 2 and Figure 1D). Podocarpaceae has only a few repetitive elements (Supplementary Table 2 and Figure 1E). P. macrophyllus and P. neriifolius each contain three copies of R27, while P. lambertii contains two. Other Podocarpaceae species contain two copies of R28. A. cunninghamii and A. dammara each have eight and seven copies of R31, which is also lineage specific (Supplementary  Table 2 and Figure 1E). The consensus sequences of R31 were also found in Podocarpaceae but all existing as single copy (not repeated), suggesting that R31 repetitive element was only duplicated in Araucariaceae. Furthermore, no pairs of direct repetitive sequences were identified in two sides of the inserted repetitive elements of cupressophytes.

Rapid Evolution of accD in Cupressophytes
The value of d N and d S for accD, rbcL, and matK gene were represented as branch lengths in Figure 2. In the d N tree, rbcL and matK gene has a relatively low substitute rate through the entire tree. The branch leading to the ancestry clade of Cupressaceae, Taxaceae, and Cephalotaxaceae in the accD d N tree is longer than other branches, suggesting that accD evolves faster in this clade. In addition, the branch leading to Podocarpaceae and Araucariaceae in the accD d N tree is longer than other gymnosperms (Figure 2). Interestingly, the accD gene length also starts to expand at the lineage of Podocarpaceae and Araucariaceae. For the matK and rbcL d S tree, most gymnosperm species evolve slowly and consistently except for the branch leading to Podocarpaceae. However, the d S value of accD gene for cupressophyte evolves much faster than many of the Pinaceae species. Compared with rbcL and matK, accD gene shows a high level of divergence among cupressophyte species. In general, accD has experienced substitution rates acceleration and this acceleration is locus and lineage specific.

Gene Order Around accD in Gymnosperms
The gene order around accD could be classified into six types (Figure 3). At high taxonomic levels, the gene order tends to be conserved across Cycadaceae, Ginkgoaceae and Pinaceae with a type of: rbcL-trnR-accD-psaI. Gene order in Araucariaceae and Podocarpaceae excluding Podocarpus totara is nearly identical to that of Cycadaceae, Ginkgoaceae and Pinaceae except that an extra trnD gene was found between rbcL and trnR. In P. totara, the gene order is: psbM-trnD-accD-psaI, which is different from that of the other three Podocarpus species, despite being members of the same genus. In Taxaceae, C. japonica, Taiwania, M. glyptostroboides and C. lanceolata, rbcL and clpP is near accD. The gene order of Cephalotaxus differs from that of Taxaceae by the inversion of clpP and translocation of rps16. Comparing with Taxaceae, the rpl23 takes the place of clpP making the gene order to be: rbcL-accD-rpl23 in Juniperus and C. formosana. It is amazing that gymnosperm chloroplast genomes have so much difference in gene organization surrounding accD, so we speculate that the accD gene must be involved in some rearrangement events of gymnosperm chloroplast genome. The arrows indicate the repetitive elements which has only one copy. The spacer between two fragments was divided by three dots. FIGURE 2 | d N and d S trees for accD, matK, and rbcL gene. Branch lengths are in terms of d N and d S as estimated by PAML under a constrained topology. The topology of the accD d N tree, accD d S tree, rbcL d N tree, rbcL d S tree were identical to each other. The matK d N tree and d S tree were similar with rbcL and accD trees after removing C. hainanensis. The gray boxes denote the branch whose d N or d S has been accelerated.
Frontiers in Plant Science | www.frontiersin.org FIGURE 3 | Gene organization around the accD locus in gymnosperms. Green star indicates the lineage where accD has expanded. Genes shown above line are transcribed from left to right, while those located below line are transcribed opposite direction. The half-height region in rps16 represents an intron. The topology (not drawn to scale) in the left side was the same as accD d N tree in Figure 2. The roman numbers I-VI denotes six types of gene organization around the accD.

The accD Gene Length and Repetitive Elements
In gymnosperms, the reading frame lengths of accD vary considerably. At present, six complete chloroplast (cp) genomes of Gnetales have been published. However, accD could not be found in these cp genomes, suggesting that accD was lost from the cp genomes of Gnetales (Wu et al., 2009). The accD gene length of Cycas (359 codons) and Ginkgo (323 codons) is relatively short. In Pinaceae, the accD gene length range from 319 (Picea abies) to 326 (Picea morrisonicola) codons. However, we identified that the accD gene length in cupressophyte experienced an extraordinary expansion. From the alignment of 57 gymnosperm accD gene sequences, we can speculate that the enlarged accD gene size in cupressophytes is mainly caused by numerous amounts of insertion repetitive sequences in the middle region. Meanwhile, many different repetitive elements were identified in the inserted sequence. The repetitive elements have a relatively low similarity among different genus (Supplementary Table 2 and Figure 1), suggesting these repetitive elements likely do not have a common origin, and have formed independently.

The Function of Repetitive Elements in accD
In addition to cupressophyte, the repeat sequences in accD were also reported in two legume species (P. sativum and L. sativus) (Magee et al., 2010), pepper (C. annuum) (Jo et al., 2011) and M. truncatula (Gurdon and Maliga, 2014). This verifies the idea that some proteins are more easily generating repeats during evolution (Mularoni et al., 2010). The accD gene in both P. sativum and L. sativus contains many repeat sequences in their middle region, but the repetitive elements from these two species show low similarity. The repetitive elements in P. sativum and L. sativus were also different with those in cupressophyte, suggesting that repetitive elements were speciesspecific. In pepper, seven repeats of an 18 bp-long element sequences were observed. And interestingly, one pair of short direct repeat sequences was located nearby the inserted repeat sequences. But no such sequences were found near the inserted repeat sequences of accD gene in cupressophytes and legume, suggesting that these direct repeat sequences were not necessary for the formation of repeat sequences. The transcription of accD gene in pepper was confirmed by reverse transcriptase PCR, so the expanded accD gene in pepper is supposed to be functional. Furthermore, a large number of complex repeats were found in the different ecotype of M. truncatula. It is suggested that the function of these inserted repeat sequences is not very important for ACCase (Gurdon and Maliga, 2014). However, on the other hand, the reading frame in this gene was not destroyed; so we speculate that the repetitive elements in these species may play a role of regulation to protein function. All of these results suggest that accD is a specific gene that tends to be easily form independently repeat sequence. And these repeat sequences are species-specific, which were only detected in some species.
The accD gene encodes the carboxyltransferase β subunit of ACCase. It is essential for leaf development in tobacco, as knocking out accD gene may be lethal (Kode et al., 2005). Three points strongly indicate that the function of this gene has not been destroyed in cupressophytes. Firstly, despite containing repeats, the original reading frame of accD gene is maintained, revealing that the genes in cupressophytes work well. Secondly, three sites were considered to be important for accD gene in potato: an acetyl-CoA bonding site, a CoA-carboxylation catalytic site and a carboxy-biotin binding site. All these three sites were located at the C-terminal region of the protein in all gymnosperm species. Thirdly, only the lineages of cupressophytes contain a large number of complex repeats. The Cycas (359 amino acids), Ginkgo (323 amino acids) ( Table 1) and most angiosperm accD genes have not been expanded and did not contain repeat sequences. Yi et al. (2013) have confirmed that accD in C. oliveri still have function after expansion. From the alignment of accD gene in 57 gymnosperms, we could see that the accD gene is considerably conserved in the 3 end. In comparison, the nucleus copies of accD in Trachelium and Trifolium each encodes only the 3 end region of the chloroplastic gene (Rousseau et al., 2013). The 3 end region of the accD gene encode the carboxyl transferase, which is the most important functional region discovered in this gene (Zhang et al., 2003). So it is reasonable to see a higher conservable 3 end of the accD relative to the much variable 5 end as their function restriction.

The Acceleration of Substitution Rate
Gene-specific rate acceleration is considered to be a common character of chloroplast genome evolution Park et al., 2017). Among three genes (rbcL, matK, accD) analyzed in this study, only accD gene shows an obviously acceleration in both d N and d S in cupressophyte (Figure 2), which was also the lineage full of abundant repeat sequences (Supplementary Table 2 and Figure 1). Meanwhile, the repetitive elements were also identified for rbcL and matK using Tandem Repeat Finder, but no repetitive elements were found. It seems that there is a positive correlation between repetitive elements and substitution rate acceleration. Many studies support that species-specific rate acceleration have relevant with genomic rearrangement Guisinger et al., 2008Guisinger et al., , 2010. However, the relationship between repeat sequence and rate acceleration has been little documented. Park et al. (2017) attribute the acceleration substitutate rate of accD gene in Geraniaceae to the insertion sequence. However, we found that repetitive elements also exist in these insertion sequences. Maybe the insertion of repetitive elements promoted the sequence to be more variable, thus leading to the acceleration of substitution rate.

Genome Rearrangement Happened Near accD
Dispersed repeat elements were supposed to locate in rearrangement endpoints. In this study, we also sequenced the gene nearby accD, and found that there are six kinds of gene order type near accD (Figure 3). This indicates that accD gene is located in inversion endpoints. In addition to the relocation or complete loss of trnD or trnR, rbcL is generally located on one side of accD. For Cephalotaxus, a large inversion has happened relative to other gymnosperms which relocated accD gene near rps16 rather than rbcL (Wu et al., 2011). We speculated that repetitive elements may induce the rearrangement near accD. An explanation for this correlation is that recombination between repeat sequence can lead to rearrangements of genome (Rogalski et al., 2006;Gray et al., 2009). In addition to accD, other rearrangement endpoints causing by inversions also exist. For instance, two inversions were identified between Agathis dammara and Nageia nagi chloroplast genome (Wu and Chaw, 2014), making the intergenic region of ycf1 and clpP, rpl23, rpl20, and petG as rearrangement endpoints. However, repeat elements were only detected in the intergenic region of ycf1 and clpP. Rpl23, rpl20, and petG did not have repeat sequence. These suggest that not all the rearrangement endpoints have repeat elements.

CONCLUSION
The accD gene in cupressophyte has undergone an extraordinary length expansion, which was mainly caused by abundant independent repetitive elements. Accompanied with repetitive elements, the d N and d S of accD are also accelerated. In addition, accD has been involved in many rearrangement events. All these results suggest that the repetitive elements may promote the acceleration of substitution rate and mediate the genome rearrangement. Our study provides a typical case for the research of relationship between repetitive elements, genome rearrangement and substitution rate.