DNA Barcoding Diatoms From China With Multiple Genes

Diatoms play a key role in water quality assessments and algae bloom. However, taxonomic confusion often exists for diatoms, and morphological characters are extremely diverse for species identification. DNA barcoding with multiple genetic markers can contribute much to diatom diversity investigation. In this study, we employed sequences of four genetic markers (COI, rbcL, SSU, and LSU) to discriminate diatom strains from both marine and freshwater environments of China, by tree, distance, and character-based barcoding methods. The available published diatom sequences were also incorporated into our new sequences. A total of 93 rbcL, 81 COI, 83 SSU, and 75 LSU sequences of diatom samples were obtained in this study. The multiple genetic markers discriminated most species clearly. The identification of species by micrographic observation was generally consistent with the DNA barcoding analysis except that some potential cryptic species were revealed by DNA barcoding. The COI, rbcL, and LSU sequences all showed high taxonomic resolution at the species level by phylogenetic and character-based analysis. Some potential identification errors in public diatom sequences were also found. The phylogenetic and character-based analysis revealed consistent species identification and showed clearer species discrimination than the distance-based method. In conclusion, our study evaluated the efficiency of four genetic markers in barcoding 11 genera within Bacillariophyta isolated from China and complemented many diatom reference sequences to public databases.


INTRODUCTION
Diatoms are photosynthetic secondary endosymbionts found throughout marine and freshwater environments and are believed to be responsible for around one-fifth of primary productivity on earth and the occurrence of blooms (Bowler et al., 2008;Casteleyn et al., 2010). Diatoms are also frequently used for water quality assessments for marine as well as freshwater environments (Kawecka and Olech, 1993;Spaulding and McKnight, 1999). While some diatom species have broad ecological plasticity, others, including closely related species, are adapted to specific environmental conditions (Vanelslander et al., 2009). There are, estimated 200,000 diatom species, living in terrestrial, freshwater, and marine systems as benthos or phytoplankton (Dam et al., 1994;Potapova and Charles, 2007;Zalack et al., 2010;Hamsher et al., 2011). Diatom-based indices require unambiguous identification at the species level. However, the species identification of diatoms is time-consuming and needs in-depth knowledge of organisms under investigation, such as bacteria (Zhang et al., 2018). Thus, taxonomic confusion often exists for diatoms, while a large number of morphological characters are extremely diverse (Evans et al., 2007).
The identification of diatoms has been somewhat improved by molecular tools, e.g., the discovery of cryptic diversity (Medlin et al., 1991;Behnke et al., 2004;Beszteri et al., 2005;Sarno et al., 2005;Amato et al., 2007;Evans et al., 2007;Poulíčková et al., 2010). For many years DNA barcoding has been proved as a promising approach for species identification and detection of cryptic species, particularly for microbial communities (Hebert et al., 2003a,b;Zou et al., 2016aZou et al., ,b, 2018. Our previous studies have shown that it is important to combine different analytical tools for the DNA barcoding of microalgae (Zou et al., 2016a(Zou et al., ,b, 2018. While the tree-based approach uses neighbor-joining (NJ), Bayesian, or maximum-likelihood trees for species identification, the distance-based approach calculates a genetic distance between species and assigns a cutoff value (the "barcode gap") to discriminate species. The character-based approach discriminates species by the fundamental concept that members of a given taxonomic group share diagnostic characters (more than three bases) that are absent from comparable groups (Rach et al., 2008;Sarkar et al., 2008). A program based on the Characteristic Attributes Organization System (CAOS) algorithm (Sarkar et al., 2002a,b) was developed to implement a character-based approach for DNA barcoding . CAOS is an automated systematic method for discovering conserved character states from cladograms (i.e., trees) or groups of categorical information, and defines attribute tests at each node in a phylogenetic tree, similar to decision tree algorithms. Character states, called "attribute tests" in decision trees, are termed "Characteristic Attributes" (CAs) in CAOS . Although it remains argued which analytical method of DNA barcoding is more precise, it is unquestionable that comparison of multiple analytical methods would be important for taxonomic assignments.
While there is no single conserved gene that could be used for barcoding all phytoplankton taxa, multiple genetic markers (like rbcL and SSU) have been proposed as potential markers for barcoding diatoms (Mónica and Kaczmarska, 2009;Hamsher et al., 2011;Tamura et al., 2011;Guo et al., 2015;Li et al., 2015). Within Bacillariophyta, it was indicated that ITS was a potential marker for the DNA barcoding of Thalassiosirales and that COI could just barcode some genera (Guo et al., 2015). Trobajoa et al. (2011) showed that although COI was more variable than LSU and rbcL for barcoding Nitzschiapalea, it was difficult to recover cox1 sequences. Hamsher et al. (2011) suggested that rbcL-3P should be used as the primary marker for barcoding Sellaphora. Within Chlorophyta, recommended that tufA be adopted as the standard marker for the routine barcoding of green marine macroalgae (excluding the Cladophoraceae). Thus, genetic markers that have universal primers for PCR easy amplification and are variable enough for species discrimination should be further selected. Another issue is that the current reference database is incomplete so some molecular sequences cannot be matched to species level or even higher level. In this case, new DNA marker sequences of various taxa need to be added to the public reference library. In recent years, metabarcoding has developed as a new identification tool for environmental samples (Zimmermann et al., 2015;David and Jed, 2016). For example, Liu et al. (2020a) employed metabarcoding to identify forensic discrimination of drowning incidents. However, one substantial limitation of metabarcoding is exactly the limited reference sequences in public libraries that are used for read assignments (Liu et al., 2020b).
China has large sea areas and many freshwater lakes. Algae bloom in China is becoming a serious environmental problem (Qin et al., 2011;Duan et al., 2015). The cyanobacteria, Chlorophyta and Bacillariophyta, are the main microalgae for bloom. While most researchers focused on the cyanobacteria diversity study in China, the taxonomy of Chlorophyta and Bacillariophyta is lagged by molecular tools. Our previous studies have just identified some genera of Chlorophyta by DNA barcoding (Zou et al., 2016a). The identification of comprehensive species of diatoms from China is important for aquatic ecology.
In this study, we employed sequences of four genetic markers (COI, rbcL, SSU, and LSU) to barcode diatoms from a wide distribution of marine and freshwater environments from China by tree-, distance-, and character-based analytical methods. The available published diatom sequences were also incorporated into our new sequences for better analysis. We aim to (1) evaluate the efficiency of the four genetic markers in barcoding some genera within Bacillariophyta collected by us in this study; (2) contribute new reference sequences of multiple genetic markers of various diatoms species to the public database.

Sample Collection and Culture
We collected diatoms from both marine and lake environments in Qingdao, Nantong, Wuhan, and Zhoushan, China, where the locations in Qingdao, Nantong, Zhoushan, Lianyungang, and Ningbo were marine regions and the location in Wuhan was a lake region (Supplementary Table 1; Supplementary Figure 1). Following Andersen (2005), the diatom strains collected were isolated first. After isolation, the strains were cultured in a 250-ml flask containing a medium. Then, the cultured strains were identified using an electron microscope (40 × zoom), where we assigned the strains to species first by their general shape characteristics and then compared the micrographic observations with the barcoding identification. The detailed sampling information, including GenBank numbers, is shown in Supplementary Table 1

PCR Amplification, Sequencing, and Sequence Alignment
After DNA extraction with the Qiagen DNEasy Plant Extraction kit (Qiagen Inc., Valencia, CA, United States), each marker of COI, rbcL, SSU, and LSU was amplified with multiple primers ( Table 1). PCR reactions and conditions also followed Zou et al. (2016a,b), with different annealing temperatures ( Table 1). A 1.5% agarose gel was used to confirm PCR products producing a single band, and the products were sent to the Beijing Genomics Institute (BGI) for bidirectional sequencing. A set of publicly available sequences of diatom for each gene marker downloaded from GenBank was added to the new sequences produced in this study to be analyzed together. MAFFT (Katoh et al., 2009) was employed for alignment and trimming. The sequences of the four genetic markers were also joined together as an integrated target (COI + rbcL + SSU + LSU) for barcoding analysis.

Barcoding Assignments
The phylogenetic-distance-and character-based barcoding analyses were conducted for each of the four genetic markers and the combined fragment (COI + rbcL + SSU + LSU). Neighbor-joining (NJ), Bayesian, and maximum-likelihood (ML) were employed for phylogenetic barcoding analysis, where NJ trees were constructed based on the Kimura two-parameter (K2P) distance model (Hebert et al., 2003a) with MEGA (Tamura et al., 2011); Bayesian analyses were performed with MrBayes 3.1.2 (Ronquist and Huelsenbeck, 2003); and ML searches were performed with PhyML 3.0 (Guindon et al., 2010). jModeltest v.0.1.1 (Posada, 2008) was used to estimate the most appropriate models for both Bayesian and ML tree construction. The most appropriate models for rbcL, COI, LSU, and SSU were GTR + G, TVMef + I + G, GTR + G, and GTR + G, respectively. The distance-based barcoding analysis was performed for each of the four markers in MEGA (Tamura et al., 2011), where the intraspecific and interspecific distances were analyzed. The character-based analysis was performed for each of the four genetic markers and the combined target in Characteristic Attribute Organization System (CAOS) and CAOS-Analyzer . The datasets in NEXUS files and their DNA data matrices were produced in MacClade v4.0659 (Mindell, 1994), which were carried out in the CAOS system to get the characteristic attributes at the nucleotide positions (Bergmann et al., 2009).

RESULTS
A total of 93 rbcL, 81 COI, 83 SSU, and 75 LSU sequences of diatom samples were obtained in this study (Supplementary Table 1 Table 1).
Generally, the identification of species for each strain by micrographic observations was consistent with the identification by DNA barcoding of all the four gene loci, except that some potential cryptic species were found within some species. We also found some misidentifications of diatom sequences from public databases. The detailed barcoding results for each gene locus are shown below individually, where the names of species for our newly-obtained sequences in the phylogenetic trees were based on micrographics observations. Some potential cryptic species revealed in the phylogenetic trees are indicated as species names (I, II, III. . . ). For the sequences downloaded from NCBI, their GenBank numbers are shown beside the name of a strain. The species for character-based analysis were from the phylogenetic trees for each gene locus, where the cryptic species were included.
A total of 11 species, 10 genera, 10 families, seven orders, and three classes were recovered from all the samples collected from each location (Supplementary Table 2). It was indicated that the diversity of species was high in Lianyungang, Jiangsu (Supplementary Figure 2).

rbcL Barcoding Assignments
The phylogenetic analysis of rbcL recovered a generally clear assignment resolution within Bacillariophyta (Figure 1). At the species level, most species analyzed were distinguished as separate clades. The rbcL sequences of the 11 species were newly obtained in this study, including Asteroplanus karianus, Cerataulina pelagica, Chaetoceros muellerii, Cyclotella sp., Entomoneis sp., Licmophora paradoxa, Melosira varians, Navicula bottnica, Phaeodactylum tricornutum, Skeletonema costatum, and Thalassiosira gravida. The strains within C. muellerii from different sea areas clustered together as one clade (Figure 1). For species whose data were downloaded from GenBank, most of them could be assigned as monophyletic clades, but some of them clustered together as one group (e.g., A. karianus, Asterionellopsis glacialis, and Asterionellopsis socialis). Sequences of L. paradoxa from this study clustered together with that from published papers. Sequences of Navicula ramosissima from this study were also separated clearly from a published sequence. Additionally, T. rotula, T. gravida, and Thalassiosira delicata, including samples from this study and GenBank, were FIGURE 1 | Bayesian phylogenetic trees for the rbcL and COI genes. The NJ, Bayesian, and Maximum Likelihood bootstraps are indicated for species clades recovered, where the order is NJ/Bayesian/ML. Some potential cryptic species revealed are indicated by species name (I, II, III…). For sequences downloaded from NCBI, their GenBank numbers are shown besides the name of strain.
Frontiers in Marine Science | www.frontiersin.org closely related in the phylogenetic trees. At the genus level, all the genera that were analyzed clustered as monophyletic clades, except for Licmophora, Thalassiosira, and Asteroplanus, which gathered as paraphyletic clades, (Figure 1).
The intraspecific and interspecific distances were calculated separately (Figure 2). Most interspecific distances were higher than 0.02. However, no apparent barcoding gap existed between the intraspecific and interspecific distances, and several species within certain genera were separated by interspecific distances lower than 0.02, such as Skeletonema and Thalassiosira. On the other hand, most of the species had intraspecific distances lower than 0.02, as expected.
The character analysis showed general consistent taxonomic assignments with the phylogenetic-based identification ( Table 2). Species that were clearly assigned as monophyletic clades in the phylogenetic trees were also separated with more than three characters attributes (CAs), such as M. varians, Melosira nummuloides, Licmophora normanina, and Entomonesis ornata. L. paradoxa and Navicula ramosissma, which were divided into two clades in the phylogenetic trees, were also separated as two clades, which showed more than three CAs. For species that could not be distinguished by tree-based barcoding, the character analysis also shows the same CAs for them, e.g., A. karianus, A. glacialis, and A. socialis.

COI Barcoding Assignments
Most species included in COI were separated clearly in the NJ, ML, and Bayesian trees, and all formed monophyletic clades with high support, including species assigned from this study, such as C. muellerii, Cyclotella meneghiniana, and S. costatum (Figure 1). These species were also discriminated by more than three CAs from positions 9 to 426 of the COI fragments (Table 2). However, several species were divided into separate clades that could be cryptic species, e.g., M. varians and P. tricornutum. These potential cryptic species were also shown as separate clades in the character analysis where they were distinguished by more than three characters ( Table 2). For example, we identified all strains of P. tricornutum I, II, III in Figure 1 as P. tricornutum by micrographic observation. However, all their sequences were assigned to separate clades, which did not cluster with any other species. Thus, we consider the separated clades as cryptic P. tricornutum species that need to be noticed and confirmed in future studies related to species identification. It was also shown that the separated clades of P. tricornutum were from  Figure 3 (LSU and SSU) by Characteristic Attributes Organization System (CAOS) analysis.

Chaetoceros gracilis T T A G T A
Frontiers in Marine Science | www.frontiersin.org  A  T  T  T  C A  T  A  T  A  T  A  C/T  T  T  C  A  A  T  T A T

Nucleotide numbers cover 15 selected positions from 3 to 636 on the rbcL sequences. Nucleotide numbers cover 22 selected positions from 9 to 426 on the COI sequences.
different sea areas, e.g., the strains of P. tricornutum II were from Zhoushan, Zhejiang, and the strains of P. tricornutum III were from Lianyungang, Jiangsu. At the genus level, for all the genera analyzed, Chaetoceros, Cyclotella, and Skeletonema were assigned as paraphyletic clades (Figure 1). Compared with rbcL, the COI marker also produced higher distances for both intraspecific and interspecific comparisons, and no gap appeared between the intraspecific and interspecific distances (Figure 2). Almost all the interspecific distances were higher than the threshold of 0.02, except for T. rotula and A. karianus, which had an interspecific distance of 0.0189. For the intraspecific distance, three species (M. varians, L. paradoxa, and Entomoneis sp) had values higher than 0.02, and all the rest had values lower than 0.02.

LSU Barcoding Assignments
At the species level, while most species clustered as monophyletic clades, several species were divided into separate groups (e.g., Thalassiosia rotula) and available GenBank sequences clustered as one group (e.g., T. gravida, T. delicate) (Figure 3). These phylogenetic assignments were consistent with the character analysis, where C. pelagica and Entomoneis. sp. and T. rotula were also separated as different clades with more than three CAs, and T. gravida, T. delicate, and T. punctigera showed the same CAs from positions 35 to 532 of the fragment (Table 3). At the genus level, almost all the genera clustered as monophyletic clades except for the cryptic species in Thalassiosira (Figure 3).
For LSU, most species (96%) had interspecific distances above 0.02 (Figure 2). Of the 15 species, 6 had intraspecific distances higher than 0.02, and 9 had intraspecific distances lower than 0.02. Thus, there was an overlap between the intraspecific and interspecific distances.

SSU Barcoding Assignments
In comparison with rbcL, COI, and LSU, SSU produced less resolved tree topologies (Figure 3), where some species could not be separated clearly as phylogenetic clades (e.g., Chaetoceros gracilis and C. muellerii). However, C. gracilis and C. muellerii, and C. cryptica and C. cryptica were clearly discriminated by more than three CAs (Table 3). Some species that were divided into several separate clades in the phylogenetic trees also differed from each other by more than three CAs, such as Melosira vaians, Cyclotella gamma, and L. paradoxa (Table 3). At the genus level, many of the genera analyzed were assigned as paraphyletic clades.
A portion of (97%) the species had interspecific distances above 0.02 (Figure 2). However, some species that could not be separated by the phylogenetic trees also had interspecific distances lower than 0.02. Thus, it is clear that there is much overlap between the intraspecific and interspecific distances.

Combined Barcoding Assignments
The phylogenetic and distance-based barcoding of the combination of rbcL, COI, LSU, and SSU was also conducted (Figure 4) for further verification. The samples that had all sequences from the four genes were collected for the combined analysis. The distance-based method was not used, because the number of samples analyzed is limited. It was indicated that the phylogenetic tree of the combined sequences showed a clear topological structure. The species analyzed were separated as monophyletic clades with higher support.

DISCUSSION
Although diatom species are distributed globally and play an important role in aquatic ecology (Zalack et al., 2010), many remain undiscovered or unassigned yet (Smetacek,   Melosira varians-I Frontiers in Marine Science | www.frontiersin.org

Species Positions
Entomoneis ornata 1999). The diatom diversity needs to be investigated globally, especially for courtiers that have large areas of water. DNA barcoding has provided a convenient tool for species identification (Hebert et al., 2003a,b;Zou et al., 2016a,b). Here, we employed four genetic markers for assigning diatoms from China with phylogenetic, distance, and character-based methods. The identification of species for each strain by micrographic observations was generally consistent with the identification through phylogenetic-based trees by DNA barcoding. For phylogenetic-based barcoding, rbcL, COI, and LSU were able to discriminate most of the species clearly within Bacillariophyta. At the species level, both rbcL and COI phylogenetic barcoding analyses showed better resolution in discriminating all the species. Nevertheless, some available sequences from NCBI could not be separated in the rbcL and COI phylogenetic trees, which suggests that some of the sequences submitted to NCBI are possibly misidentified. Additionally, all the four genetic markers assigned some species as cryptic, which were divided into several monophyletic clades in the phylogenetic trees. The character barcoding analysis and phylogenetic barcoding analysis obtained consistent species identification accordingly. All the species identified as clearly monophyletic clades in the phylogenetic trees were also assigned as separate clades by character analysis with more than three CAs. The potential cryptic species revealed by the phylogenetic analysis were also divided into separate clades in the character analysis with more than three CAs. All the cryptic species need to be noted in future studies. While barcoding analytical methods are argued, our study suggests that the combination of phylogenetic and character analyses gives more accurate species identification results.
All the results provide us with the understanding that different barcoding genetic markers give different identification resolutions for diatoms at both high and low taxonomic levels. By comparison, rbcL, LSU, and COI proved more effective in barcoding diatoms, which is partly consistent with the previous results that rbcL should be used as the primary marker for diatom barcoding (Hamsher et al., 2011;MacGillivary and  Kaczmarska, 2011). For example, MacGillivary and Kaczmarska (2011) suggested that a small rbcL fragment could be used for a dual-locus barcode with the more variable 5.8S + ITS-2 to discriminate diatom species, and Guo et al. (2015) showed that rbcL performed well in clustering some lower taxa. In Guo et al. (2015), it was also demonstrated that genetic loci had different assignment efficiency for different genera. For example, the COI region could just discriminate some genera within Bacillariophyceae, and ITS was a potential marker for barcoding some genera of Thalassiosirales (Cyclotella, Skeletonema, and Stephanodiscus). In our study, it was also indicated that different genetic loci had different identification efficiency at the genus level. Generally, LSU performed well in barcoding most of the genera within Bacillariophyta, but rbcL, COI, and SSU could not assign some of the genera as monophyletic clades, e.g., the Licmophora in rbcL phylogenetic analysis and Cyclotella in COI phylogenetic analysis. On the other hand, rbcL, COI, and SSU performed well in barcoding diatoms at the species level. Thus, we suggest the combination of rbcL, COI, LSU, and SSU for DNA barcoding the 11 genera of diatoms, since they are easily amplified by PCR and have enough variation for identifying different genera. The efficiency of barcoding entire Bacillariophyta should be tested by employing more species belonging to more different genera. We also merged the four genetic markers to conduct the phylogenetic and character analysis to verify the identification of species. The NJ, ML, and Bayesian trees of the merged sequence assigned all the species as clear monophyletic clades. The clear topology from the combined data was possible because the samples analyzed were limited. But the analysis from the combined data was generally consistent with that from the single gen. For the distance-based approach, the genetic distance of 0.02 between interspecific and intraspecific comparisons is proposed as a criterion for barcoding (Hebert et al., 2003a,b), which means that the intraspecific distance should be lower than 0.02 and the interspecific distance should be higher than 0.02. However, for all the genetic markers, some interspecific distances were lower than 0.02 and some intraspecific distances were higher than 0.02, without an obvious distance gap between the interspecific and intraspecific distances. This suggests that the distance criterion of 0.02 cannot always discriminate the species of diatoms. Thus, our study provides information that the phylogenetic and character-based methods are more effective for barcoding diatoms. In future studies, we can try to use other distance-based tools for barcoding diatoms, such as ABGD or Spider (Boyer et al., 2012;Puillander et al., 2012). However, in our previous studies, it was also indicated that the phylogenetic and character-based barcoding methods showed more advantages than the ABGD method for barcoding Chlorophyta (Zou et al., 2016a). Thus, in our opinion, we recommend the phylogenetic and character-based barcoding approaches for barcoding microalgae.
Here, we perform a comprehensive diversity investigation of diatoms from China, which will greatly contribute to the classification of diatoms. Most of the samples were collected from sea areas of the Yellow Sea and East China sea where algae bloom often occurs. The rest of the samples were collected from typical freshwater lakes in China. Therefore, the samples studied could represent the diverse diatom in China. Compared with previous studies that just used limit genetic markers or analytical methods Kaczmarska, 2009, 2010;Hamsher et al., 2011), we discriminated most diatom species clearly and revealed some cryptic species. For some strains from different habits (e.g., different marine sea areas and lakes) within one species, there was not much difference in their identification, such as C. muellerii and T. rotula, but for P. tricornutum, the strains from different sea areas were revealed as cryptic species. These suggest that the external habits possibly also contribute to the species diversity of diatoms. However, our study focused on accurate species identification and the complementary of diatom sequences to reference databases. The amount of the samples studied was not substantial to conduct a comprehensive diatom diversity investigation in China. In future studies, we will employ metabarcoding to monitor diatom diversity by Next Generation Sequence with a large amount of sequences. The available diatom sequences in public databases were also incorporated into our newly obtained sequences, the comprehensive analysis of which showed some possible identification errors of public diatom sequences. In conclusion, our study reports the accurate identification of diatoms from China comprehensively by DNA barcoding, which is important for well-understanding algae blooms and aquatic ecology.
Finally, with the development of Next Generation Sequencing, metabarcoding is becoming more efficient for species assignment with markers such as 16S, COI, and 18S, etc (Gogarten et al., 2020). However, metabarcoding often has a bias in accurate species identification for a large amount of reads because of incomprehensible reference sequences (Rachel et al., 2019;Gogarten et al., 2020). It is important to complement the reference sequences in public databases with more gene sequences of more species. In our study, the new sequences of multiple markers from a large number of samples provide much assistance for metabarcoding diatoms.

AUTHOR CONTRIBUTIONS
SZ designed the experiment, analyzed the data, and wrote the manuscript. YB and XW conducted the experiment. CW helped to revised the manuscript. All authors contributed to the article and approved the submitted version.

FUNDING
The financial support from the China Postdoctoral Science Foundation (2014M561661 and 2015T80558) and the Fundamental Research Funds for the Central Universities (KJQN201742 and Y0201600141) was gratefully acknowledged.