Large Differences in the Haptophyte Phaeocystis globosa Mitochondrial Genomes Driven by Repeat Amplifications

The haptophyte Phaeocystis globosa is a well-known species for its pivotal role in global carbon and sulfur cycles and for its capability of forming harmful algal blooms (HABs) with serious ecological consequences. Its mitochondrial genome (mtDNA) sequence has been reported in 2014 but it remains incomplete due to its long repeat sequences. In this study, we constructed the first full-length mtDNA of P. globosa, which was a circular genome with a size of 43,585 bp by applying the PacBio single molecular sequencing method. The mtDNA of this P. globosa strain (CNS00066), which was isolated from the Beibu Gulf, China, encoded 19 protein-coding genes (PCGs), 25 tRNA genes, and two rRNA genes. It contained two large repeat regions of 6.7 kb and ∼14.0 kb in length, respectively. The combined length of these two repeat regions, which were missing from the previous mtDNA assembly, accounted for almost half of the entire mtDNA and represented the longest repeat region among all sequenced haptophyte mtDNAs. In this study, we tested the hypothesis that repeat unit amplification is a driving force for different mtDNA sizes. Comparative analysis of mtDNAs of five additional P. globosa strains (four strains obtained in this study, and one strain previously published) revealed that all six mtDNAs shared identical numbers of genes but with dramatically different repeat regions. A homologous repeat unit was identified but with hugely different numbers of copies in all P. globosa strains. Thus, repeat amplification may represent an important driving force of mtDNA evolution in P. globosa.


INTRODUCTION
The haptophyte Phaeocystis globosa is a cosmopolitan phytoplankton that plays a pivotal role in global carbon and sulfur cycles by releasing substantial quantities of dimethylsulfide propionate (DMSP) (Schoemann et al., 2005). It is also an important HAB species that causes blooms in various ocean regions including the Arabian sea, the Southern North Sea, and the coastal waters of East Asia and Southeast Asia (Madhupratap et al., 2000;Chen et al., 2002;Schoemann et al., 2005;Hai et al., 2010;Rousseau et al., 2013;Shen and Qi, 2021). P. globosa blooms outbreak frequently, bringing serious ecological hazards, and causing great economic losses of aquaculture industry and nuclear power facility threats (Shen et al., 2000;Qi et al., 2004;Hai et al., 2010;Cao et al., 2017).
Accumulating evidence suggested the P. globosa has high level of genetic diversity. Various physiological and ecological characteristics were observed among different P. globosa strains. For example, P. globosa strains collected in different ocean regions (e.g., Hong Kong and Shantou, China) have different acute toxic effects to Artemia sinica (Wei and Jiang, 2005). Sizes of P. globosa colonies found in the Chinese coastal waters can reach 3 cm in diameters, which are substantially larger than those found in European waters (Qi et al., 2004;Rousseau et al., 2007). The temperatures at which P. globosa blooms occurred were different for P. globosa blooms recorded in different geographical regions. For example, most P. globosa blooms occurred in Chinese coastal waters when temperatures were 16-22 • C (Shen et al., 2000;Qi et al., 2004), which were higher than that in European waters (Seuront et al., 2006;Blauw et al., 2010). Cell surface structures of P. globosa strains isolated in Europe and Beibu Gulf, China were different. For example, small circular bulges were observed in small flagellate cells of China strains, while such characteristics were not observed in European strains (Hu et al., 2019).
Much effort has been made to ascertain P. globosa genetic diversity as a way to understand phenotypical differences. Common molecular markers (18S rDNA, 28S rDNA, ITS, and rbcL) have been applied to distinguish P. globosa strains with limited success (Hu et al., 2019;Song et al., 2020). The ITS region has been demonstrated to be a poor molecular marker because it has been found to have high intra-genome variation, while 18S rDNA, 28S rDNA and rbcL have been found to be over conservative that they cannot be used to effectively distinguish different strains (Hu et al., 2019;Song et al., 2020). Thus, these common molecular markers could not be adequately used to probe P. globosa genetic or phenotypic diversity. Molecular markers with high resolution are urgently needed to resolve P. globosa genetic diversity in environment.
Comparative analysis of the organelle genomes has been fruitful for identifying molecular markers with improved resolution. The complete chloroplast genome (cpDNA) of the first P. globosa strain was published in 2014, which was 107,461 bp in length (Smith et al., 2014). One molecular marker rbcS-rpl27 based chloroplast sequence intergenic spacer was found to be able to separate phylogeographic populations of P. globosa (Zhang et al., 2021). Comparative analysis of the cpDNAs of P. globosa strains facilitated the development of a new molecular marker pgcp1 with even higher resolution for tracking P. globosa genetic diversity .
Relative mutation rate of mtDNAs has been found to be higher than that of cpDNAs in P. globosa (Smith et al., 2014). The mtDNAs of P. globosa and P. antarctica have been published but they are still incomplete, due to the complex repeat regions (Smith et al., 2014). Repeat regions are ubiquitous in algal mtDNAs, some of which can be very large in length. For example, mtDNA of the cryptophyte Rhodomonas salina contains a 4.7 kb long repeat region, 11 repeat motifs ∼40-700 bp long occurring up to 31 times, forming a complex repeat structure. Tandem repeats are the major arrangement but the region also includes a large, 3 kb, inverted repeat and several potentially stable 40-80 bp long hairpin structures (Hauth et al., 2005). The cryptophyte Hemiselmis andersonii has a 20 kb repeat region, comprised of numerous tandem and dispersed repeat units of between 22 and 336 bp, and occurring up to 100 times. Adjacent to these repeats are 27 copies of palindromic sequences predicted to form stable DNA stem-loop structures (Kim et al., 2008). The mtDNA of Phaeodactylum tricornutum has an even larger repeat region of 35 kb, consisting almost entirely of various combinations of several kinds of tandem repeat, including five elements 33-404 bp (Oudot-Le Secq and Green, 2011). The chlorophyte Pedinomonas mtDNA contains a 9 kb-repeated sequence, 13 repeat elements range in size from 6 to 389 bp (ele-01 to ele-13) were characterized. Although most members within an element family are identical, some are not (Turmel et al., 1999). We hypothesize that change in repeat regions may be, at least partially, due to differential repeat unit amplification, which is an important driving force for P. globosa mtDNA differences in evolution.
In this study, we constructed the first complete mtDNA of P. globosa, which contained two large repeat regions of 6.7 kb and ∼14.0 kb in length, respectively. We identified several repeat units with various numbers of copies. Comparative analysis of six P. globosa mtDNAs revealed that the copies numbers varied substantially. Our results revealed that change in repeat units is a driver of P. globosa mtDNA diversity.

Phaeocystis globosa Strain Isolation, Culture and Characterization
The five P. globosa strains (CNS00066, CNS00069, CNS00087, CNS00262, and CNS00266) described herein were isolated from samples collected in 2019 from P. globosa bloom waters of the Beibu Gulf, China and Vietnam sea area (Table 1). Single-cell isolation was achieved by selecting individual cells using a micropipette, followed by multiple washes before transferring each single cell to 24-well culture dishes for growth and characterization. They grew in L1 seawater culture medium (without Na 2 SiO 3 ) (Guillard and Hargreaves, 1993). Cultures were maintained at 20 • C under a 12:12 h light: dark cycle with approximately 30 µM s −1 m −2 of cool white fluorescent illumination.
Identification of the cultured algae was done based on microscopic morphological characters and phylogenetic analysis based on 18S rDNA. ZEISS Axio Imager Z2 light microscope (Zeiss, Sliedrecht, Netherlands) were used for morphological observation. 18S rDNA was assembled basing Illumina reads with GetOrganelle (Jin et al., 2020), with publicly available 18S rDNA sequences of P. globosa as reference sequences. The assembled sequences were further validated by aligning sequencing reads against the assembled result using BWA (Li and Durbin, 2009) and SAMtools (1.9)  and visualized using IGV (Thorvaldsdottir et al., 2013). For phylogenetic analysis, 40 18S rDNA sequences were used. The sequences were aligned using MAFFT (Katoh and Standley, 2013). The ambiguously aligned regions were further manually edited and adjusted using MEGA7.0 (Kumar et al., 2016). There were a total of 1562 positions in the final dataset. The maximum likelihood (ML) tree and aBayes (BI) tree based on 18S rDNA were inferred using W-IQ-TREE, with best substitution model TNe + I + G4 was selected using default parameters of W-IQ-TREE (Trifinopoulos et al., 2016). The support for nodes was assessed by performing 1000 bootstrap replicates. For analysis of sequence variability between Phaeocystis strains, 1633 positions in the final 18S rDNA sequence matrix, 420 positions in the 18S rDNA V4 sequence matrix were obtained, 171 positions in the 18S rDNA V9 sequence matrix were obtained. Numbers of pair-wise variations sites among P. globosa strains were calculated using BioEdit (Hall, 1999).

DNA Preparation and Genome Shotgun Sequencing
Cultures at the exponential growth phase were harvested and concentrated by centrifugation. Two sequencing methods were used for the genomes in this study. All five strains obtained in this study were sequenced using Illumina sequencing, while the strain CNS00066 was also sequenced using PacBio Sequel II sequencing technology. For Illumina sequencing, total nucleic acids were extracted using the OMEGA HP Plant DNA Mini Kit (Omega Biotek, Inc., United States) and quantified using a NanoDrop One spectrophotometer (Labtech International Ltd., Uckfield, United Kingdom). DNA samples of five P. globosa strains were prepared for whole genome sequencing. DNA degradation and contamination were monitored on 1% agarose gels. DNA concentration was measured using Qubit R DNA Assay Kit in Qubit R 2.0 Flurometer (Life Technologies, CA, United States). Sequencing libraries were generated using NEBNext R DNA Library Prep Kit following manufacturer's recommendations and indices were added to each sample. The genomic DNA was randomly fragmented to a size of 350 bp by shearing, then DNA fragments were end polished, A-tailed, and ligated with the NEBNext adapter for Illumina sequencing, and further PCR enriched by generic adapter P5 and P7 oligos. The PCR products were purified (AMPure XP system) and resulted in libraries were analyzed for size distribution by Agilent 2100 Bioanalyzer system and quantified using real-time PCR. Qualified libraries were sequenced on an Illumina platform according to effective concentration and data volume at Novogene (Beijing, China). Sequencing was done using NovaSeq PE150 (Illumina, San Diego, CA, United States).
For strain CNS00066, the long-read DNA sequencing technology PacBio Sequel II was used for constructed the complete mitochondrion genome. For PacBio Sequel II sequencing, high quality genomic DNA was extracted using a modified CTAB method. RNase A was used to remove RNA contaminants. The quality of the DNA was checked by agarose gel electrophoresis. One long insert library (15 kb) was prepared, which was sequenced using the PacBio Sequel II platform (Frasergen, Wuhan, China). We generated 6.44 Gb HiFi reads.

Quality Control and Assembly
For Illumina sequencing data, methods of quality control and assembly were similar with Song et al. (2020). Raw reads in FASTQ format were first processed through a series of quality control (QC) procedures with in-house C scripts by (1) removing reads with ≥ 10% unidentified nucleotides (N); (2) removing reads with > 50% bases having Phred quality < 5; (3) removing reads with > 10 bp aligned to the adapters, allowing ≤ 10% mismatches; (4) removing putative PCR duplicates generated by PCR amplification in the library construction process (read 1 and read 2 of two paired-end reads that were completely identical). The filtered reads were assembled with SPAdes (Bankevich et al., 2012) and GetOrganelle (Jin et al., 2020).
The resulting assembly scaffolds from the mtDNA were determined using BLAST (Johnson et al., 2008) of the complete mtDNA of the strain CNS00066 and publicly available partial mtDNA of P. globosa strain Pg-G(A) (NC_021637) (Smith et al., 2014). One mtDNA scaffold which contained one N region was obtained for each strain, respectively. Finally, clean reads were mapped to the assembled mtDNA to validate and to correct the wrong bases, using BWA (Li and Durbin, 2009) and SAMtools (1.9)  and visualized using IGV (Thorvaldsdottir et al., 2013). We successfully completed the repeat region 1 in the strains CNS00069, CNS00087, CNS00262, and CNS00266, all of which were validated using Sanger DNA sequencing. The sequences were further validated by aligning sequencing reads against the assembled sequences using BWA (Li and Durbin, 2009) and SAMtools (1.9) ) and visualized using IGV (Thorvaldsdottir et al., 2013).
For PacBio Sequel II sequencing data, data were mapped to partial P. globosa mtDNA (KC967226) with minimap2 (version 2.5) (Li et al., 2018) using the default parameters, and a total 7.03Mb of potential mitochondrial HiFi reads were obtained. Reads depth of coverage was > 160×. The mtDNA HiFi reads were locally assembled with MECAT2 software (default parameter) (Xiao et al., 2017) to obtain circular mitochondrial contig 43,585 bp. The mtDNA was validated through an additional round of alignment using BWA (Li and Durbin, 2009) and SAMtools (1.9) ) and visualized using IGV (Thorvaldsdottir et al., 2013).
In order to facilitate comparative analysis, the starting sites of mtDNAs were all adjusted to cox1.

Synteny Analysis
The synteny comparison of seven mtDNAs of Phaeocystis was visualized using Mauve ver. 2.3.1 under the progressive mode (Darling et al., 2010). The gene content and synteny of strain CNS00069, strain Pg-G(A) and P. antarctica strain CCMP1374 compared to that of strain CNS00066, respectively, was performed using circos-0.69 (Krzywinski et al., 2009).

Repeat Analysis
A dot-plot similarity matrix of P. globosa strain CNS00066 mtDNA against itself was done with Dotter ( Barson and Griffiths, 2016), in order to show repeat regions. The repeat regions were analyzed using Tandem Repeats Finder (Benson, 1999) and local BLAST using this mtDNA as the database. For stemloop structure prediction and calculation of minimum Gibbs free energies in DNA, we utilized RNAstructure 3 used default parameters (Bellaousov et al., 2013). Here, we considered only stem-loop structures which were max expect results predicted, although numerous smaller and less stable structures. The repeat units sequence matrix was aligned by MAFFT (Katoh and Standley, 2013), and the ambiguously aligned regions were further manually edited and adjusted by eye. The sequence similarity was analyzed using BioEdit (Hall, 1999).

Structural Features of the mtDNAs and Base Composition
Five P. globosa strains (CNS00066, CNS00069, CNS00087, CNS00262, and CNS00266) obtained in this study were identified based on comparative analysis of 18S rDNA sequences. The five P. globosa strains were nested within P. globosa monophyletic clade with robust support (Figure 1). The numbers of nucleotide differences between P. globosa strains were 0-16 nucleotides, which were much smaller than that of other congeneric species (19-74 nucleotides). V4 region of 18S rDNA was also analyzed, and the numbers of nucleotide differences between P. globosa strains were 0-6 nucleotides, which were much smaller than that of other congeneric species (11-20 nucleotides). 18 of 25 Phaeocystis 18S rDNA sequences have complete V9 region, and we analyzed the V9 region of these 18 strains. The number of V9 region nucleotide differences between P. globosa strains were 0-2 nucleotides, which was indistinguishable from other congeneric species (0-20 nucleotides). This indicated that V9 region does not provide enough resolution for species identification in Phaeocystis.
Of these five P. globosa strains, CNS00066 was a solitarycelled (Figure 1), which never formed colonies under laboratory culture condition, although it was originally isolated from a P. globosa colony. CNS00066 cells were 3-4.5 µm in diameter with two flagella and one haptonema emerging from a depression in the cell body. Strains CNS00069, CNS00087, CNS00262, and CNS00266 had both solitary cells and colonies, with cells being 4-8 µm in diameter, and colonies being up to 0.3 cm in diameter. Some cells in colonies had flagella and haptonema, while other did not.
The mtDNA of P. globosa strain CNS00066 was a circular molecule of 43,585 bp (Figure 2A). It had two large blocks of repeat regions. The repeat regions segregated the coding region into two parts and the genes were encoded on two strands, respectively, suggesting that the genome may be transcribed in two units. Aside from the repeat region, the mtDNA was relatively compact with 92.8% of sequence specifying genes and structural RNAs, with only 7.2% non-coding regions. The overall AT content of the mtDNA of CNS00066 was 70.6%, which was similar to mtDNAs of other species in the FIGURE 1 | The phylogenetic analysis of the haptophyte species based on 18S rDNA sequences. The evolutionary history was inferred by using the Maximum Likelihood method and aBayes (BI) method based on TNe + I + G4 substitution model. The analysis involved 40 18S rDNA sequences. There were a total of 1562 positions in the final dataset. Evolutionary analyses were conducted using W-IQ-TREE. Numbers at the branches represent aBayes support/ultrafast bootstrap support (%), respectively. Only bootstrap values above 0.50 aBayes support and 50% ML bootstrap support are shown. The strains obtained in this study were bold. Morphology of strain CNS00066 was shown.
Frontiers in Microbiology | www.frontiersin.org A B FIGURE 2 | Circular map of the P. globosa mtDNA and phylogenetic analysis of haptophyte species. (A) Circular map of the mtDNA of P. globosa strain CNS00066. Genes were colored according to the functional categories listed in the index below the map. The GC content is indicated on the inner circle. The repeat regions (repeat 1 and repeat 2) is in gray. Genes showed inside and outside the circle are transcribed clockwise and counter-clockwise, respectively. (B) Phylogenetic tree of haptophyte species based on 13 mitochondrial proteins (ATP6, ATP9, COX1, COX2, COX3, COB, NAD1, NAD2, NAD3, NAD4L, NAD4, NAD5, and NAD6). Phylogenetic tree was reconstructed on the IQ-TREE web server (Trifinopoulos et al., 2016). Numbers at the branches represent aBayes support/ultrafast bootstrap support (%), respectively. Our strains were shown in bold.

Phylogenetic Analysis of Haptophyte Species
To explore the evolutionary relationship between P. globosa and other haptophyte members, we constructed a phylogenetic tree using protein sequences encoded by 13 PGCs in the mtDNAs that were shared by all haptophyte members ( Figure 2B). The mtDNAs of five strains obtained in this study clustered with other P. globosa strains in a monophyletic clade with robust support. The mtDNAs of strain CNS00066 (MW435860) clustered with P. globosa (KC967226) with moderate support. The mtDNAs of four strains CNS00069 (MW435863), CNS00087 (MW435865), CNS00262 (MW435861), and CNS00266 (MW435862), which were isolated from the Beibu Gulf in China and Vietnam sea area, clustered together in a different clade. The species in Coccolithophyceae, Pavlovophyceae, and Rappephyceae all formed well supported branches, and Rappephyceae was sister to the Pavlovophyceae. The phylogenetic relationship was generally agreed with the 18S rDNA-based phylogenetic relationship in this study and previous results basing mitochondrial and plastid dataset (Kawachi et al., 2021) with minor differences, which Rappephyceae was sister to the Coccolithophyceae.

Gene Content and Synteny Analysis of Phaeocystis Species
The six mtDNAs of P. globosa (five obtained in this study and KC967226) shared an identical number (19) of PCGs (Figures 3A-C). Four of these genes (atp4, atp6, atp8 and atp9), which encode subunits of the ATP synthase complex, were identified. Seven genes encoding NADH dehydrogenase subunits, nad1-6 and nad4L, were detected. Three genes encoding complex IV (cytochrome c oxidase subunits), cox1, cox2 and cox3 were identified. Four genes encoding ribosomal proteins, rpl16, rps12, rps14 and rps3 were detected. One gene encoding complex III (cytochrome c reductase) cob was identified. A total of 25 tRNA genes and two rRNA genes (rrnL and rrnS) are present in the P. globosa mtDNA. 5S rRNA gene was not detected. None of the genes contained intron except rrnL (23S rRNA) gene in mtDNA of P. globosa strain Pg-G(A), which contained a single intron.
We performed a comparative analysis of seven mtDNAs of the genus Phaeocystis. Comparison of genome organization and gene order were shown in Figure 3A and Supplementary Figure 1. Five conserved gene blocks (blocks a-e) were identified among mtDNAs of Phaeocystis. Within P. globosa strains, two gene arrangement events were discovered. The first was mtDNAs of strain CNS00066 and strain Pg-G(A), which showed identical gene orders (blocks a-b-c-d-e), and the second were mtDNAs of strains CNS00069, CNS00087, CNS00262, and CNS00266, which showed another identical gene order, and blocks order was a-bd-e-c. Compared with the first genome organization, the block b was inverted and reversed, and the block c and the blocks d-e were reciprocally translocated in the second genome organization. A high degree of syntenic conservation was identified between P. globosa strains CNS00066, Pg-G(A) and P. antarctica, except that block e (nad5) was inverted and reversed in P. antarctica. Compared to P. globosa, another additional tRNA gene trnE-UUC was found in P. antarctica.

Variable Repeat Lengths and Locations Among mtDNAs of Phaeocystis globosa Strains
To test our hypothesis that plasticity of repeat regions may be an important driver of P. globosa mtDNA diversity in evolution, we constructed mtDNAs of four additional P. globosa strains (CNS00069, CNS00087, CNS00262, and CNS00266). The mtDNAs of these strains were still incomplete, suggesting the presence of complex repeat sequences that remain to be resolved.
Nevertheless, all genes were successfully identified in these sequences, suggesting that the missing sequences were repeat regions most likely without containing any genes. Aside from the repeat regions, mtDNAs were also relatively compact which was similar to that of strain CNS00066. AT contents of these strains were 66.7-67.2%, which were a little lower than the complete mtDNA of CNS00066.
The mtDNAs of all P. globosa strains contained two noticeable repeat regions with varying lengths in different strains: repeat 1 and repeat 2 ( Figure 6A). The length of complete repeat regions repeat 1 (larger than 250 bp were analyzed here) was rather variable among different P. globosa strains, which varied substantially between 334 bp-6.7 kb, with the longest repeat found in the strain CNS00066, and the shortest repeat found in strain CNS00266, suggesting that the repeat regions might be plastic. The gray rectangles in Figure 6A showed the end of the scaffold (represented using a segment of 1000N), and these regions were repeat 2 which are not yet constructed. The still unresolved repeat 2 within all P. globosa strains was located at the beginning of two transcription units, while repeat 1 was located at the ending of two transcription units. Within the mtDNA of the P. antarctica strain CCMP1374, three repeat regions were found, among which only one complete repeat region was obtained (763 bp in length), and two unfinished repeat regions were located at the ends of two transcription units. The current known repeats length of Phaeocystis strains was substantially shorter than the corresponding repeat in strain CNS00066.
The locations of the repeat regions were also variable among different strains. Within the strains CNS00066 and Pg-G(A), two repeat regions were located between block c and block d, block e and block a, respectively. Within strains (CNS00069, CNS00087, CNS00262, and CNS00266), two repeats were between block a and block b, block e and block c, respectively. Gene structure rearrangement was close to repeat regions, such as the block b in strains (CNS00069, CNS00087, CNS00262, and CNS00266) was inverted compared with that of CNS00066. This suggested that rearrangement in mtDNA of Phaeocystis may have occurred through repeat-mediated recombination.

Characterization of Repeat Units in the P. globosa Strain CNS00066
Two complex repeat regions were observed in mtDNA of strain CNS00066 (Figure 6B). Six repeat units (A-F) of different length (13 bp-203 bp) were identified in repeat regions in the strain CNS00066 ( Figure 6C and Table 3). The number of copies of each repeat unit varied dramatically and the copies of each repeat unit were not identical. For example, repeat unit A (∼163 bp) had 21 copies with sequence identity ranging from 94.4 to 100%; Repeat unit B (∼51 bp) had 252 copies with sequence identity ranging from 63 to 100%; Repeat unit C (∼63 bp) had 48 copies with sequence identity ranging from 92 to 100%; Repeat unit D (34 bp) had 22 copies with sequence identity ranging from 91.1 to 100%; Repeat unit E (13 bp) had 18 copies with sequence identity ranging from 92.3 to 100%; Repeat unit F (∼203 bp) had only two copies with sequence identity of 99.0%. Interestingly, the 5 26 bp of the repeat unit C showed high similarity to repeat unit B, suggesting possible evolutionary relationship between these two repeat units.

FIGURE 6 | Continued
Homologous Repeat Sequence Found in P. globosa mtDNAs Because repeat regions were identified in all sequenced mtDNAs of Phaeocystis species, we hypothesized that the repeat units identified in different mtDNAs share homology. To test this hypothesis, we analyzed the repeats of mtDNAs of five P. globosa strains obtained in this study, in addition to the P. globosa strain Pg-G(A) and the P. antarctica strain. Results showed that homologous repeat unit A could be identified in all P. globosa mtDNAs (Figure 6E), but not in the P. antarctica mtDNA.
The repeat unit A showed variable copy numbers, different length and sequence similarities among different strains ( Figure 6E). The lengths varied between 55 bp-161 bp. The repeat unit A appeared three times in strain CNS00069, 12 times in strain CNS00087, six times in strain CNS00262, three times in strain CNS00266, and five times in strain Pg-G(A). No homologous sequences were found within P. antarctica. The results indicated the repeat unit A may share a common ancestor within P. globosa.

DISCUSSION
The mtDNAs are generally compact with short intergenic regions (Gray et al., 1998;Boore, 1999). However, the sizes of mtDNAs   can vary considerably among related species and even between strains of same species, which have been found to be driven by differences in repeat content or the inclusion of introns (Liu et al., 2017). The mtDNAs sequences of two Ulva pertusa strains Up1 and Up2 were different by 4.7 kb different in length, which was due to two additional group II introns in two genes (cox1 and cox2) and tandem duplication mutations in non-coding regions in the larger mtDNA (Liu et al., 2017). A pair of 3.5 kb-long, cox1-harboring repeats were found only in two Nannochloropsis strains CCMP527 and CCMP537, while intergenic regions of the six mtDNAs were generally conserved (Wei et al., 2013). Metschnikowia yeasts mtDNAs vary extensively in size and mainly contributed by different numbers and sizes of introns (Lee et al., 2020). This study reported the first full-length mtDNA of P. globosa, which was 43,585 in size. This large mtDNA size was driven by two large complex repeat regions, whose combined length was almost half of the entire mtDNA size. Indeed, the length of repeat sequences in P. globosa was the longest repeat region among all revolved haptophyte mtDNAs. The repeat regions were results of amplification of a few repeat units. The sequences among different repeat unit copies within strain were not identical, and base substitutions and indels existed. Repeat unit could be ubiquitous mutational hotspots. We also identified sequence homology among different repeat regions in the same strains, such as repeat unit B both exist in repeat 1 and repeat 2 in strain CNS00066. This indicated that repeat unit may have been amplified among different repeat regions within mtDNA. The structural features of the repeat regions of P. globosa have some similarities compared to other algae such as cryptophyte Rhodomonas salina (Hauth et al., 2005) and Hemiselmis andersenii (Kim et al., 2008), diatom Phaeodactylum tricornutum (Oudot-Le Secq and Green, 2011), and Chlorophyte Volvox carteri (Smith and Lee, 2010) which all were composed of several different repeat units, and usually include tandem repeat, reverse repeat and palindromic sequences.
Among P. globosa different strains, the length and location of repeat regions, number of repeat units were highly variable, but there is some regularity. For example, repeat unit A was identified in all P. globosa strains. It was not found in P. antarctica, suggesting that it may be P. globosa-specific. The repeat unit A retained obvious sequence similarity among different strains, and this indicated repeat unit A may have a shared origin, existing in the mtDNA of the most recent common ancestor of strains. The sequence similarity of repeat unit A among different strains was often lower than that among different copies within strains, suggesting that the repeat regions acquired mutations in different strains. The fast mutation rates of repeat regions have also been reported in species Ulva lactuca  and Ulva pertusa (Liu et al., 2017). Sequence similarity of repeat regions have been reported between related chlorophyte species Haematococcus lacustris and Stephanosphaera pluvialis, suggesting recent common ancestor of these two species (Zhang et al., 2019;Smith, 2020).
All P. globosa strains shared same numbers genes, and these gene sequences were mostly conserved among all P. globosa strains, consistent with previous observations that mtDNAs are generally compact (Gray et al., 1998). Genes are encoded on both DNA strands and their orientation suggests the presence of two transcription units within P. globosa mtDNAs, e.g., starting from the repeat region 2 and ending in repeat region 1 in both transcription directions in strain CNS00066.
In addition to differences in repeat content among mtDNAs of different strains, genome rearrangement events were also identified. Repeat regions, especially palindromic sequences were adjacent to rearrangement of the mtDNA, suggesting that they may play an important role in genomic rearrangements, transcription and replication, as proposed for other algae including red alga Chondrus crispus (Leblanc et al., 1995) and cryptophyte algae Rhodomonas salina (Hauth et al., 2005). Microhomologies have also been associated with tandem duplications and structural variation in plant mtDNAs (Xia et al., 2020). Interesting, the palindromic sequence p2 are coincident with a change in the direction of "cumulative GC skew." Additionally, the palindromic sequence p2 had minimum CG content (Supplementary Figure 3), suggesting that the palindromic sequence p2 may correspond to the origin of replication.
Among the known 18 mtDNAs of haptophyte species, nine were reported as complete sequences, and other nine were reported as partial sequences. Nine full-length mtDNAs all contain repeat regions of various lengths ranging from 1.9 to 20.7 kb (Puerta et al., 2004;Smith and Keeling, 2012;Hovde et al., 2014Hovde et al., , 2019Nishimura et al., 2014;Smith et al., 2014;Hulatt et al., 2020;Wideman et al., 2020). The mtDNA of the P. globosa strain CNS00066 had the longest repeat region among all constructed haptophyte mtDNAs. The repeat regions were usually composed of tandem repeat formed by multiple repeat units, such as Phaeocystis strains and Chrysochromulina tobinii CCMP291, Pavlova strains, and Diacronema viridis strains. Chrysochromulina tobinii CCMP291 has the second longest repeat regions in Haptophyta, which was ∼9.3 kb and consisted of four repeat units ranging from 85 bp to ∼1.5 kb. Chrysochromulina sp. NIES-1333 showed a very different mitochondrial structure from other haptophyte strains described above. The Chrysochromulina sp. NIES-1333 genome (34,291 bp) has two small intergenic repeats of 1624 bp and 1630 bp, and missing a large, complex repeated sequence array.
Seven of nine partial mtDNAs were obviously truncated in the repeat regions, including those of Pavlova lutheri CCMP 1325 and all Phaeocystis strains. No obvious repeat regions were found in Chrysochromulina parva UW 1161, Haptophyta sp. isolate H2 and P. ranunculiformis NIES-3900, and their mtDNAs are partial sequences and have not been successfully assembled. Homologous sequences of the repeat region of P. globosa have not been discovered in other haptophyte species. Length of repeat region of different P. globosa strains was variable. The repeat length polymorphism maybe used as molecular marker for specific strains.
Phylogenetic analysis using mtDNAs revealed that Rappephyceae was sister to the Pavlovophyceae, while Prymnesiophyceae split from their common ancestor earlier. However, the support was only moderate (Figure 2B), suggesting that the evolutionary relationships among these three classes (Rappephyceae, Pavlovophyceae, and Prymnesiophyceae) might be more complicated. Indeed, our 18S rDNA-based phylogenetic analysis and a recent study suggested that Rappephyceae was sister to the Prymnesiophyceae, while Pavlovophyceae split from their common ancestor earlier (Kawachi et al., 2021). Their exact evolutionary relationship will be resolved with more genomes are sequenced and assembled.

CONCLUSION
In this study, we successfully constructed the full-length mtDNA of P. globosa for the first time. It was a circular genome with a size of 43,585 bp, encoding 19 PCGs, 25 tRNA genes and two rRNA genes. Surprisingly it was found to contain two large-sized repeat regions of 6.7 kb and ∼14.0 kb in length, respectively. The combined length of these two repeat regions accounted for almost half of the entire mtDNA. The length and the repetitive nature of these two repeat regions made it challenging to construct the complete length of full-length mtDNA of P. globosa. Further analysis revealed that the repeat regions consisted of combinations of several of repeat units. To explore the nature of repeat sequences in other P. globosa strains, we have also constructed mtDNAs of four additional P. globosa strains (one from Beibu Gulf, China and three from Vietnam). Comparative analysis revealed that corresponding repeat regions were readily identified in all strains but with different lengths. A homologous repeat unit A was found to be prevalent among all P. globosa strains, but with different numbers of copies and length. In addition to differences in repeats, these mtDNAs also differed in gene structure rearrangement. Another surprising result identified in this project was the plasticity of the repeat regions of the mtDNAs of different P. globosa strains. Thus, repeats may be an important driving force of mtDNA evolution. In addition, the mtDNAs of different P. globosa strains showed variations, suggesting that the mtDNAs can be used as a super molecular marker for distinguishing different strains.

AUTHOR CONTRIBUTIONS
NC conceived of the project. HS and YC organized the strain selection, cultivation, DNA preparation, and genome sequencing. HS organized the assembly, annotation, quality control, and analyzed the data with suggestions from others. HS and NC wrote the manuscript with inputs from others. FL revised the manuscript. All authors read and approved the final version of the manuscript.

FUNDING
Supplementary Figure 2 | The partial palindromic (contains the repeat units near loop) sequences of p1 and p2. They exhibit a strong potential to form hairpin secondary structures.
Supplementary Figure 3 | GC content and GC skew of P. globosa strain CNS00066 mtDNA. GC content (black) and GC skew (green and purple). For GC skew, the center line indicates the average GC skew value for the genome. Green shading above the line denotes GC skew values greater than the genome average, whereas purple shading below the line denotes GC skew values less than the genome average. Major and minor tick marks on the outermost and innermost circles show nucleotide position on the genome in 1 kb increments. The arrow points to loop of the palindromic sequence p2 in repeat region 2.