First Draft Genome Assembly of the Seaweed Sargassum fusiforme

Wang, Shengqin; Lin, Lidong; Shi, Yijian; Qian, Weiguo; Li, Nan; Yan, Xiufeng; Zou, Huixi; Wu, Mingjiang

doi:10.3389/fgene.2020.590065

DATA REPORT article

Front. Genet., 23 October 2020

Sec. Genomics of Plants and Plant-Associated Organisms

Volume 11 - 2020 | https://doi.org/10.3389/fgene.2020.590065

First Draft Genome Assembly of the Seaweed Sargassum fusiforme

SW
Shengqin Wang ^1,2^†
LL
Lidong Lin ^2,3^†
YS
Yijian Shi ⁴
WQ
Weiguo Qian ²
NL
Nan Li ^1,2
XY
Xiufeng Yan ^1,2
HZ
Huixi Zou ^1,2^*
MW
Mingjiang Wu ^1,2^*

1. National and Local Joint Engineering Research Center of Ecological Treatment Technology for Urban Water Pollution, Wenzhou University, Wenzhou, China
2. College of Life and Environmental Science, Wenzhou University, Wenzhou, China
3. Dongtou Fisheries Science and Technology Research Institute, Wenzhou, China
4. College of Electrical and Electronic Engineering, Wenzhou University, Wenzhou, China

Article metrics

View details

Citations

7,2k

Views

1,9k

Downloads

Introduction

Sargassum fusiforme is a major edible brown macroalga grown in lower intertidal zones along coastlines. The macroalga is cultivated in Japan, Korea, and China (Ma et al., 2017). S. fusiforme is a source of bioactive compounds, such as polysaccharides, which have various functions, including antioxidants, anti-aging, memory improvement and immune regulation (Chen et al., 2016; Bogie et al., 2019; Zhang et al., 2020). It is also an essential herb used to disperse phlegm in traditional Chinese medicine (Li and Wei, 2002). However, genomic and genetic resources for this economically important seaweed are limited. This limitation could be attributed to, in part, the high levels of polysaccharides in the brown algae, which complicate DNA extraction. Recent research has shown that S. fusiforme contains potentially toxic quantities of inorganic arsenic, although the levels differ significantly between different growing areas (Yokoi and Konomi, 2012). Notably, a study on the molecular genetics and genomics of S. fusiforme could give a clue on strategies for reducing the levels of inorganic arsenic.

The genus Sargassum (Fucales, Phaeophyceae) consists of over 300 species of multicellular organisms, which are widely distributed in marine habitats. Many of these species are essential in the ecosystem and are economically important marine crops. For example, Sargassum horneri is the dominant golden-tide seaweed in the ocean (Liu et al., 2018). Despite its wide distribution, there are no available complete or draft genomes for any of the Sargassum genus microalgae to date. Besides, the entire Phaeophyceae class only has six decoded complete or draft genome sequences of brown algae species. These species include Ectocarpus siliculosus (Cock et al., 2010), Saccharina japonica (Ye et al., 2015), Macrocystis pyrifera (https://www.ncbi.nlm.nih.gov/genome/?term=Macrocystis+pyrifera), Nemacystus decipiens (Nishitsuji et al., 2019), Cladosiphon okamuranus (Nishitsuji et al., 2016), and Undaria pinnatifida (Shan et al., 2020). The availability of genomic resources for these essential algae will assist future selective breeding programs and fundamental genomic and evolutionary studies, among other roles.

The complete chloroplast and mitochondrial genome sequences of S. fusiforme have been decoded and phylogenetically analyzed (Liu et al., 2016; Yonghua et al., 2019). Recently, genome assembly using a sequencing technology that combines PacBio and Ilumina reads has shown high performance. In this study, de novo whole-genome sequencing in a strain of S. fusiforme was performed using PacBio and Illumina reads. The PacBio long reads were assembled using wtdbg2, and the Illumina short reads were used to correct the primary assembly by Pillon (Walker et al., 2014; Ruan and Li, 2020). To our knowledge, this is the first reported genome assembly in a member of the Fucales order.

Value of the Data

The genomic sequence data generated in the present study provides essential information for further S. fusiforme studies, such as functional genomic analysis and genome-assisted breeding. Since these results offer the first reference genome of the order Fucales, they can be used in evolutionary studies in the Chromista kingdom. Further, the draft genome can be used as a genomic reference for the discovery of functional products and other genome mining applications.

Materials and Methods

Materials and DNA Extraction

The S. fusiforme strain used for genome sequencing in this study was collected from the Wenzhou Dongtou District, Zhejiang Province of China. Dozens of picked vesicles were cleaned in ddH₂O, then frozen and stored in liquid nitrogen for genomic DNA extraction. The genomic DNA was extracted by the cesium chloride density gradient centrifugation method (Jagielski et al., 2017). DNA quantitation and quality checks were performed using a NanoDrop 2000 microspectrophotometer, a Qubit fluorometer, and 1% agarose gel electrophoresis (Novogene Corporation, Beijing, China).

Illumina Sequencing and Genome Size Estimation

The isolated genomic DNA was subjected to short-read library preparation with stranded Illumina paired-end protocols (Illumina Inc., San Diego, CA, USA). Subsequently, libraries with insert lengths of 500 bp were constructed and sequenced by the Illumina NovaSeq platform. The raw reads were processed with the fastp program with the following parameters: -g -q 5 -u 50 -n 15 -l 150 (Chen et al., 2018), to obtain more than 197.5 million paired-end clean reads. The kmer frequencies (k = 21) were calculated using Jellyfish v2.2.10 (Marçais and Kingsford, 2011). The kmer-based statistical method from Genomescope (Vurture et al., 2017), predicted the S. fusiforme haploid genome length to be ~ 450 MB, with heterozygosity of ~1.0%. Notably, this estimated size is larger than the genome size of the Sargassum genus (196–319 Mb) previously estimated in three other species by static microspectrophotometry (Phillips et al., 2011).

PacBio Sequencing and Genome Assembly

For long-read sequencing, the isolated genomic DNA was subjected to a large insert library preparation using the PacBio Sequel sequencing platform (PacBio, Menlo Park, CA, USA). A total of ~16 GB of polymerase reads were obtained following the manufacturer's instructions. Long reads were assembled by wtdbg2 (Ruan and Li, 2020). Due to wtdbg2 constructed the consensus using a fuzzy Bruijn graph and assembled raw reads without error correction, the primary assembly was realigned against the long reads using Minimap2, then polished with Racon, which does three replications(Vaser et al., 2017; Li, 2018). Finally, the Illumina paired-end clean reads were mapped onto the assembled contigs using bowtie2 and further corrected by Pilon (Langmead and Salzberg, 2012; Walker et al., 2014). To remove potential contaminants, contigs with a biased GC content (>0.6 or <0.4) were aligned to the NCBI non-redundant nucleic acid database with an E-value 1e-5. Then, a total of 112 contigs with the top 10 no matches to eukaryotes were regarded as contaminant contigs and removed from the final assembly. At last, 6,750 contigs were retained as the final assembly, with a total length of ~394.4 MB, which has an N50 value of ~142.1 KB (Table 1). Therefore, the estimated coverage for PacBio data and Illumina data were ~42 x and 180 x, respectively.

Table 1

Species	*E. siliculosus*	*S. japonica*	*N. decipiens*	*C. okamuranus*	*U. pinnatifida*	*M. pyrifera*	*S. fusiforme*
Total genome size (Mbp)	197	537	154	130	502.8	409.1	394.4
Scaffold N50 length (Kbp)	6,528	252	1,863	418	16,510	–	–
Contig N50 length (Bp)	–	58,867	6,265	21,705	1,707,374	2,573	142,085
Number of scaffolds	30	13,327	685	541	114	–	–
Number of contigs	–	–	411,597	31,858	515	246,830	6,750
Length of longest contig (Kbp)	324	466	–	–	7,337	110	931
Length of shortest contig (Bp)	–	–	–	–	–	280	319
Length of longest scaffolds (Kbp)	18,946	1,469	–	–	32,303	–	–
Length of shortest scaffolds (Bp)	869,870	500	–	–	1,000	–	–
GC contents (%)	56	49.6	54	54	50.14	50	48.4
The identified complete BUSCOs (%)	–	–	–	–	82.9	–	90.4

Comparison of genome assemblies of seven brown algae.

Repeat Region Prediction

A de novo repeat library for S. fusiforme was conducted using RepeatModeler (Flynn et al., 2020), which employs three repeat-finding methods; RECON (Bao and Eddy, 2002), RepeatScout (Price et al., 2005), and TRF (Benson, 1999). The repeat library was then subjected to RepeatMasker to find and mask homologous repeats in the assembled genome using rmblast as the default search engine. Finally, the total length of repetitive sequences was ~239.4 Mb, accounting for ~60.7% of the draft genome size.

Gene Prediction and Annotation

The BRAKER1 pipeline was used to perform gene prediction by integrating ab initio gene prediction and RNA-seq based prediction, which combined the advantages of both GeneMark-ET and AUGUSTUS (Hoff et al., 2016). Raw RNA-seq data from three independent transcriptomic projects were used for gene prediction: ERR2041176, SRR12206544, and SRR5357673 (One Thousand Plant Transcriptomes Initiative, 2019). Transcript alignment was performed on the repeat masked assembly based on Tophat2, with default parameters (Kim et al., 2013). The aligned results, combined with transcript data, were used to generate initial gene structures using the GeneMark-ET tool (Lomsadze et al., 2014). The initial gene structures were used for training by AUGUSTUS to produce the final gene predictions (Stanke and Waack, 2003). Finally, 20,222 putative genes were detected. Of these putative genes, 84.45% genes had Nr homologs, 69.93% had Swiss-Prot homologs, 56.57% had Pfam homologs, 67.65% had gene ontology (GO) homologs, 51.11% had KOG homologs, and 42.60% had COG homologs (Figure 1A). In Nr homolog analysis (with an E-value of 0.0001), the best hit species with 15,284 genes belong to the Ectocarpus genus (Figure 1B). Here, the GO term distribution of the putative genes into the molecular function, cellular component, and biological process categories (Figure 1C) was determined by the Blast2GO suite with an E-value of 0.0001 and visualized with the WEGO 2.0 web tool at the macro-level (Conesa and Götz, 2008). COG and KOG analyses were performed with an E-value of 0.001 by the WebMGA server (Wu et al., 2011). KAAS web services were used to map the putative S. fusiforme genes onto the KEGG metabolic pathways along with genes in other plant species, including Amborella trichopoda, Chlamydomonas reinhardtii, Ostreococcus lucimarinus, Ostreococcus tauri, Micromonas commoda, Cyanidioschyzon merolae, Galdieria sulphuraria, and Arabidopsis thaliana. A total of 3,796 genes had homologs in the KEGG database, of which 1,140 were mapped onto 119 enzymes listed in the pathway categorized “Metabolism.”

Figure 1

Completeness and Accuracy of the Assembly

The predicted genes from the BRAKER1 pipeline were subject to BUSCO version 3.0.2 to evaluate the completeness of the assembled genome, based on the eukaryota_odb9 database. More than 90% complete BUSCOs were detected at the protein level, with the single-copy, duplicated, fragmented, and missing accounting for 82.8, 7.6, 6.3, and 3.3%, respectively. The accuracy of the genome assembly was evaluated by mapping the clean reads against the genome with bowtie2. Finally, more than 90.7% of reads were remapped.

Comparative Genomics

Ortholog analysis was conducted using OrthoMCL v2.0.9 based on protein datasets from the BRAKER1 pipeline and six other brown algae species: E. siliculosus, S. japonica, M. pyrifera, N. decipiens, C. okamuranus, and U. pinnatifida. Here, the protein sequences for S. japonica were determined by combining transcriptome data from the 1 KP (One Thousand Plant Transcriptomes Initiative, 2019), and the protein sequences for M. pyrifera were called using FRAGGENESCAN from pyrosequencing technology data (Konotchick et al., 2013). CD-HIT was used to remove redundant sequences (90% identity or more) in each organism. Then, non-redundant protein sequences were subjected to all-again-all Blastp with an E-value of 1e-5. Finally, a total of 1,4819 groups were identified by OrthoMCL and ortholog intersections across the seven brown algae, as shown in Figure 1D. Subsequently, 2,287 groups were identified as one to one orthologs from these species. These sequences were concatenated into a supergene and used for multiple sequence alignment by MAFFT. Then, a phylogenetic tree was built by PhyML (Figure 1E) using the maximum likelihood method.

Statements

Data availability statement

The data used to support the findings of this study are available from Short Read Archive (SRA) database (http://www.ncbi.nlm.nih.gov/sra/) under the accession number: PRJNA597239. The raw SMRT sequencing data is available in NCBI SRA database with accession number: SRR12124361. Genome assembly and annotation data has been deposited at Figshare repository (https://figshare.com/s/517f180deb56a5b31b14).

Author contributions

LL, XY, HZ, and MW conceived the project and designed the objectives. LL, WQ, and NL collected the specimens. WQ and NL perform DNA/RNA extraction and sequencing. SW and YS implemented genome assembly and data analysis. SW drafted the manuscript. All authors edited and contributed to the article.

Funding

This work was supported by the Special Project on Blue Granary Science and Technology Innovation under the National Key R&D Program (2018YFD0901501), the Special Science and Technology Innovation Project for Seeds and Seedlings of Wenzhou City (N20160016), the National Natural Science Foundation of China (41876197 and 31670402), and the Key Fishing and Agricultural Science and Technology Project in Dongtou District, Wenzhou (N2018Y03A).

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

1
BaoZ.EddyS. R. (2002). Automated de novo identification of repeat sequence families in sequenced genomes. Genome Res.12, 1269–1276. 10.1101/gr.88502
2
BensonG. (1999). Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Res.27, 573–580. 10.1093/nar/27.2.573
3
BogieJ.HoeksC.SchepersM.TianeA.CuypersA.LeijtenF.et al. (2019). Dietary Sargassum fusiforme improves memory and reduces amyloid plaque load in an Alzheimer's disease mouse model. Sci. Rep.9:4908. 10.1038/s41598-019-41399-4
4
ChenP.HeD.ZhangY.YangS.ChenL.WangS.et al. (2016). Sargassum fusiforme polysaccharides activate antioxidant defense by promoting Nrf2-dependent cytoprotection and ameliorate stress insult during aging. Food Funct.7, 4576–4588. 10.1039/c6fo00628k
5
ChenS.ZhouY.ChenY.GuJ. (2018). fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics34, i884–i890. 10.1093/bioinformatics/bty560
6
CockJ. M.SterckL.RouzéP.ScornetD.AllenA. E.AmoutziasG.et al. (2010). The Ectocarpus genome and the independent evolution of multicellularity in brown algae. Nature465, 617–621. 10.1038/nature09016
7
ConesaA.GötzS. (2008). Blast2GO: A comprehensive suite for functional analysis in plant genomics. Int. J. Plant Genomics2008:619832. 10.1155/2008/619832
8
FlynnJ. M.HubleyR.GoubertC.RosenJ.ClarkA. G.FeschotteC.et al. (2020). RepeatModeler2 for automated genomic discovery of transposable element families. Proc. Natl. Acad. Sci. U.S.A.117, 9451–9457. 10.1073/pnas.1921046117
9
HoffK. J.LangeS.LomsadzeA.BorodovskyM.StankeM. (2016). BRAKER1: unsupervised RNA-Seq-based genome annotation with GeneMark-ET and AUGUSTUS. Bioinformatics32, 767–769. 10.1093/bioinformatics/btv661
10
JagielskiT.GaworJ.BakułaZ.ZuchniewiczK.ZakI.GromadkaR. (2017). An optimized method for high quality DNA extraction from microalga prototheca wickerhamii for genome sequencing. Plant Methods13, 77–78. 10.1186/s13007-017-0228-9
11
KimD.PerteaG.TrapnellC.PimentelH.KelleyR.SalzbergS. L. (2013). TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome Biol.14:R36. 10.1186/gb-2013-14-4-r36
12
KonotchickT.DupontC. L.ValasR. E.BadgerJ. H.AllenA. E. (2013). Transcriptomic analysis of metabolic function in the giant kelp, Macrocystis pyrifera, across depth and season. New Phytol.198, 398–407. 10.1111/nph.12160
13
LangmeadB.SalzbergS. L. (2012). Fast gapped-read alignment with Bowtie 2. Nat. Meth.9, 357–359. 10.1038/nmeth.1923
14
LiH. (2018). Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics34, 3094–3100. 10.1093/bioinformatics/bty191
15
LiX.WeiW. (2002). Chinese Materia Medica: Combinations & Applications. Donica Publishing Ltd.; Elsevier Health Sciences.
- Google Scholar
16
LiuF.LiuX.WangY.JinZ.MoejesF. W.SunS. (2018). Insights on the Sargassum horneri golden tides in the Yellow Sea inferred from morphological and molecular data. Limnol. Oceanograp.63, 1762–1773. 10.1002/lno.10806
- CrossRef
- Google Scholar
17
LiuF.PangS.LuoM. (2016). Complete mitochondrial genome of the brown alga Sargassum fusiforme (Sargassaceae, Phaeophyceae): genome architecture and taxonomic consideration. Mitochondrial DNA A DNA Mapp. Seq. Anal.27, 1158–1160. 10.3109/19401736.2014.936417
18
LomsadzeA.BurnsP. D.BorodovskyM. (2014). Integration of mapped RNA-Seq reads into automatic training of eukaryotic gene finding algorithm. Nucleic Acids Res.42, e119–e119. 10.1093/nar/gku557
19
MaZ.WuM.LinL.ThringR. W.YuH.ZhangX.et al. (2017). Allelopathic interactions between the macroalga Hizikia fusiformis (Harvey) and the harmful blooms-forming dinoflagellate Karenia mikimotoi. Harmful Algae65, 19–26. 10.1016/j.hal.2017.04.003
20
MarçaisG.KingsfordC. (2011). A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics27, 764–770. 10.1093/bioinformatics/btr011
21
NishitsujiK.ArimotoA.HigaY.MekaruM.KawamitsuM.SatohN.et al. (2019). Draft genome of the brown alga, Nemacystus decipiens, Onna-1 strain: fusion of genes involved in the sulfated fucan biosynthesis pathway. Sci. Rep.9, 4607–4611. 10.1038/s41598-019-40955-2
22
NishitsujiK.ArimotoA.IwaiK.SudoY.HisataK.FujieM.et al. (2016). A draft genome of the brown alga, Cladosiphon okamuranus, S-strain: a platform for future studies of “mozuku” biology. DNA Res.23, 561–570. 10.1093/dnares/dsw039
23
One Thousand Plant Transcriptomes Initiative (2019). One thousand plant transcriptomes and the phylogenomics of green plants. Nature574, 679–685. 10.1038/s41586-019-1693-2
24
PhillipsN.KapraunD. F.Gómez GarretaA.Ribera SiguanM. A.RullJ. L.Salvador SolerN.et al. (2011). Estimates of nuclear DNA content in 98 species of brown algae (Phaeophyta). AoB Plants 2011:plr001. 10.1093/aobpla/plr001
25
PriceA. L.JonesN. C.PevznerP. A. (2005). De novo identification of repeat families in large genomes. Bioinformatics21(Suppl. 1), i351–i358. 10.1093/bioinformatics/bti1018
26
RuanJ.LiH. (2020). Fast and accurate long-read assembly with wtdbg2. Nat. Meth.17, 155–158. 10.1038/s41592-019-0669-3
27
ShanT.YuanJ.SuL.LiJ.LengX.ZhangY.et al. (2020). First genome of the brown alga undaria pinnatifida: chromosome-level assembly using PacBio and Hi-C technologies. Front. Genet.11:140. 10.3389/fgene.2020.00140
28
StankeM.WaackS. (2003). Gene prediction with a hidden Markov model and a new intron submodel. Bioinformatics19(Suppl. 2), ii215–ii225. 10.1093/bioinformatics/btg1080
29
VaserR.SovićI.NagarajanN.ŠikićM. (2017). Fast and accurate de novo genome assembly from long uncorrected reads. Genome Res.27, 737–746. 10.1101/gr.214270.116
30
VurtureG. W.SedlazeckF. J.NattestadM.UnderwoodC. J.FangH.GurtowskiJ.et al. (2017). GenomeScope: fast reference-free genome profiling from short reads. Bioinformatics33, 2202–2204. 10.1093/bioinformatics/btx153
31
WalkerB. J.AbeelT.SheaT.PriestM.AbouellielA.SakthikumarS.et al. (2014). Pilon: an integrated tool for comprehensive microbial variant detection and genome assembly improvement. PLoS ONE9:e112963. 10.1371/journal.pone.0112963
32
WuS.ZhuZ.FuL.NiuB.LiW. (2011). WebMGA: a customizable web server for fast metagenomic sequence analysis. BMC Genomics12, 444–449. 10.1186/1471-2164-12-444
33
YeN.ZhangX.MiaoM.FanX.ZhengY.XuD.et al. (2015). Saccharina genomes provide novel insight into kelp biology. Nat. Commun.6:6986. 10.1038/ncomms7986
34
YokoiK.KonomiA. (2012). Toxicity of so-called edible hijiki seaweed (Sargassum fusiforme) containing inorganic arsenic. Regul. Toxicol. Pharmacol.63, 291–297. 10.1016/j.yrtph.2012.04.006
35
YonghuaZ.ShengqinW.WeiguoQ.YijianS.NanL.XiufengY.et al. (2019). Characterization of the complete chloroplast genome of Sargassum fusiforme and its phylogenomic position within phaeophyceae. Mitochondrial DNA Part B4, 3258–3259. 10.1080/23802359.2019.1671246
- CrossRef
- Google Scholar
36
ZhangR.ZhangX.TangY.MaoJ. (2020). Composition, isolation, purification and biological activities of Sargassum fusiforme polysaccharides: A review. Carbohyd. Poly.228:115381. 10.1016/j.carbpol.2019.115381

Summary

Keywords

sargassum fusiforme, genetic breeding, brown algae, phaeophyceae, sargassaceae

Citation

Wang S, Lin L, Shi Y, Qian W, Li N, Yan X, Zou H and Wu M (2020) First Draft Genome Assembly of the Seaweed Sargassum fusiforme. Front. Genet. 11:590065. doi: 10.3389/fgene.2020.590065

Received

31 July 2020

Accepted

17 September 2020

Published

23 October 2020

Volume

11 - 2020

Edited by

Hongjian Wan, ZheJiang Academy of Agricultural Sciences, China

Reviewed by

Peng Xu, Guangzhou University, China; Lijing Bu, University of New Mexico, United States

Updates

This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Huixi Zou zjuzhx@wzu.edu.cnMingjiang Wu wmj@wzu.edu.cn

This article was submitted to Plant Genomics, a section of the journal Frontiers in Genetics

†These authors have contributed equally to this work

Disclaimer

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.

Genomics of Plants and Plant-Associated Organisms

DATA REPORT article

First Draft Genome Assembly of the Seaweed Sargassum fusiforme