First Genome of the Brown Alga Undaria pinnatifida: Chromosome-Level Assembly Using PacBio and Hi-C Technologies

The brown alga Undaria pinnatifida (Harvey) Suringar is an economically important kelp species native to the Northwest Pacific and has been extensively farmed as human food in East Asia for more than half a century (Yamanaka and Akiyama, 1993). It is also an important resource for extracting biologically active compounds such as fucoidans which have diverse applications in pharmaceutical and cosmetic industries (Zhao et al., 2018; Yoo et al., 2019). Its annual yield worldwide has been more than two million tons since 2012 (http://www.fao.org/fishery/species/ 2777/en). Nowadays U. pinnatifida has become a cosmopolitan species due to its worldwide spread in recent decades, attracting increasing public attention (South et al., 2017). It has been listed as one of the world's 100 worst invasive species (Lowe et al., 2000), and in Europe has been regarded as one of the top 10 worst invasive species (Gallardo, 2014). As a member of Laminariales, U. pinnatifida has a life history involving the alternation between two heteromorphic stages, namely the macroscopic sporophyte and the microscopic gametophyte. The haploid gametophyte was preliminarily determined to possess 30 chromosomes (Yabu et al., 1988). Sexual reproduction occurs in the gametophytic phase, in which the eggs discharged by female gametophytes are fertilized by sperms released by male gametophytes. In addition to this major reproductive pattern, parthenogenesis and apogamy have long since been revealed to be important components of the life history (Fang and Dai, 1959; Fang et al., 1979; Nakahara, 1984; Shan et al., 2013). Recently, an unusual monoecious phenomenon has been observed in zoosporederived gametophytes, which are able to form oogonia and antheridia simultaneously and give rise to sporophytes by selfing (Li et al., 2014; Li et al., 2017). The sporophytes can become mature and release zoospores. All these spores grow into male gametophytes at first, and monoecious phenomena will be observed in some of them under developmental conditions. This finding makes the life cycle of U. pinntifida more complicated than we have traditionally thought. On one hand novel breeding methods have been developed based on these findings (Shan et al., 2013; Li et al., 2017), and on the other hand the versatile reproductive ways are suggested to be beneficial for its worldwide spread. However, the molecular mechanisms underlying the various reproduction


INTRODUCTION
The brown alga Undaria pinnatifida (Harvey) Suringar is an economically important kelp species native to the Northwest Pacific and has been extensively farmed as human food in East Asia for more than half a century (Yamanaka and Akiyama, 1993). It is also an important resource for extracting biologically active compounds such as fucoidans which have diverse applications in pharmaceutical and cosmetic industries (Zhao et al., 2018;Yoo et al., 2019). Its annual yield worldwide has been more than two million tons since 2012 (http://www.fao.org/fishery/species/ 2777/en). Nowadays U. pinnatifida has become a cosmopolitan species due to its worldwide spread in recent decades, attracting increasing public attention (South et al., 2017). It has been listed as one of the world's 100 worst invasive species (Lowe et al., 2000), and in Europe has been regarded as one of the top 10 worst invasive species (Gallardo, 2014).
As a member of Laminariales, U. pinnatifida has a life history involving the alternation between two heteromorphic stages, namely the macroscopic sporophyte and the microscopic gametophyte. The haploid gametophyte was preliminarily determined to possess 30 chromosomes (Yabu et al., 1988). Sexual reproduction occurs in the gametophytic phase, in which the eggs discharged by female gametophytes are fertilized by sperms released by male gametophytes. In addition to this major reproductive pattern, parthenogenesis and apogamy have long since been revealed to be important components of the life history (Fang and Dai, 1959;Fang et al., 1979;Nakahara, 1984;Shan et al., 2013). Recently, an unusual monoecious phenomenon has been observed in zoosporederived gametophytes, which are able to form oogonia and antheridia simultaneously and give rise to sporophytes by selfing (Li et al., 2014;Li et al., 2017). The sporophytes can become mature and release zoospores. All these spores grow into male gametophytes at first, and monoecious phenomena will be observed in some of them under developmental conditions. This finding makes the life cycle of U. pinntifida more complicated than we have traditionally thought. On one hand novel breeding methods have been developed based on these findings (Shan et al., 2013;Li et al., 2017), and on the other hand the versatile reproductive ways are suggested to be beneficial for its worldwide spread. However, the molecular mechanisms underlying the various reproduction means remains unknown. Lack of genomic sequence information hinders such fundamental study in U. pinnatifida. Herein, we report for the first time the complete genome of a male gametophyte of U. pinnatifida at the chromosomal level.

Value of the Data
The genomic sequence data can be used for genetic breeding applications, and elucidation of sex-determination and invasion mechanisms in U. pinnatifida. It has been the first reference genome of the family Alariaceae, which can be used in comparative genomics and evolutionary studies of Laminariales (kelp) species.

Sample Collection and DNA Extraction
One male gametophyte clone (designated as M23) of U. pinnatifida, which was established from one zoospore originating from a cultivated mature sporophyte (Pang et al., 2008) in Dalian, China, was used for genome sequencing ( Figure  1A). Genomic DNA was extracted using the cetyl trimethyl ammonium bromide (CTAB) method according to Shan and Pang (2009). The DNA quantity and quality was assessed with Qubit 3.0 (Thermo Fisher Scientific Inc., Carlsbad, CA, USA) and agarose gel electrophoresis, respectively.

Genome Survey
A survey of the genome was conducted using the Illumina sequencing. A short-insert DNA library (~280 bp) was established and sequenced by Illumina Novaseq6000 platform (Illumina Inc., San Diego, CA, USA). After discarding lowquality and redundant reads, we obtained 24.3 Gb high-quality paired-end (150 bp) clean reads ( Table 1). These reads were employed in the distribution analysis of k-mer (k=19) frequency (Marçais and Kingsford, 2011). The peak of 19-mer was at a depth of 34X, and the genome size was predicted to be~539 Mb. The heterozygosity and the percentage of repeated sequences were estimated to be 0.48% and 34.2%, respectively. Pilot assembly of the clean reads resulted in a genome of 551 Mb, similar to that estimated by k-mer method.

Long-Read Genome Sequencing With PacBio Technology
Genomic DNA was mechanically fragmented into sizes of~20 kb using a Covaris g-tube (Covaris Inc., Woburn, MA, USA). The Pacific Biosciences single-molecule real-time (SMRT) Bell™ sequencing library was constructed using a Template Prep Kit (PacBio, Menlo Park, CA, USA). After DNA damage and end repair, SMRT adaptors were ligated to generate SMRT Bell™ templates. The Blue Pippin (Sage Science Inc., Beverly, MA, USA) was used to select sizes of the fragments (> 15 Kb). After a second round of DNA end repair, the SMRT Bell™ templates were purified for final sequencing with the PacBio Sequel system (PacBio, Menlo Park, CA, USA). Ten SMRT cells were used to obtain a total of 62.6 Gb (~120X) of raw polymerase reads.

De Novo Genome Assembly
After removal of short and low-quality reads and the adaptor sequences, the raw polymerase reads were converted to 62.3 Gb subreads data, with an N50 length of 11,463 bp. Preliminary assembly was conducted using Falcon v1.2.4 (Chin et al., 2016). All the clean sequencing data was aligned to the assembled contigs with BLASR (Chaisson and Tesler, 2012), and errors in the contigs were corrected using Arrow (SMRT link v6). The Illumina sequencing data was aligned to the contigs using BWA v0.7.15 (Li and Durbin, 2009) for further correction by using Pilon v1.22 (Walker et al., 2014). A draft genome of 616.6 Mb which consisted of 807 contigs was obtained, with an N50 length of 1.8 Mb. The gametophytes of kelp species are known to contain symbiotic bacteria and thus a filtration procedure was conducted to remove potential bacterial contamination (Ye et al., 2015). The contigs were searched against the non-redundant nucleotide (NT) database of the National Center for Biotechnology Information (NCBI; https://www.ncbi.nlm.nih. gov/) with BLASTN and those with the best-hit matches to bacteria were discarded from the genome. The final draft genome was 511.0 Mb and consisted of 515 contigs with an N50 length of 1.71 Mb ( Table 1).

High-Throughput Chromosome Conformation Capture (Hi-C) Library Construction and Chromosome-Level Assembly
Gametophytic cells were fixed with formaldehyde (1.44%) and lysed with tissue lysis (40 mM CaCl 2 , 1 mg mL −1 collagenase). The cross-linked DNA was digested with the restriction enzyme DpnII. Biotinylated residues were added during repair of the sticky ends and the resulting blunt-end fragments were ligated under dilute conditions (Lieberman-Aiden et al., 2009). The DNA was extracted and randomly sheared to fragments of 250-500 bp. The biotin labeled fragments were isolated with magnetic beads, and end repair, dA tailing, adaptor ligation, PCR amplification, and purification were conducted for final construction of Hi-C library. The DNA quantity was preliminarily estimated by Qubit 3.0 and the insert size was tested by Agilent 2100 (Agilent Technologies, Santa Clara, CA, USA). The library concentration was accurately quantified by quantitative PCR. The qualified library was sequenced to produce 150 bp paired-end reads using Illumina Novaseq6000 platform. A total of 57.1 Gb clean dataset was obtained, containing 190,379,612 reads ( Table 1).
The Hi-C sequence data was aligned against the draft genome using JUICER v1.6.2 (Durand et al., 2016). Totally 161,141,067 (84.6%) reads were mapped to the genome and 110,009,243 (57.8%) of them were uniquely mapped. The uniquely mapped sequences were analyzed with 3D-DNA software to assist genomic assembly (Dudchenko et al., 2017). The algorithms "misjoin" and "scaffolding" were used to remove the misjoins and obtain scaffolds at the chromosomal level. The algorithm "seal" was employed to find the scaffolds that had been incorrectly removed by the "misjoin". The heatmap of chromosome interactions was constructed to visualize the contact intensity among chromosomes using JUICER v1.6.2 ( Figure 1B). As a result 114 scaffolds were assembled with an N50 length of 16.5 Mb. Finally a total of 502.8 Mb genomic sequences were located on 30 chromosomes, accounting for 98.4% of the whole assembled length ( Table 1).
Functional annotation of the predicted protein-coding genes were conducted through aligning them against the nonredundant protein (NR), SwissProt, evolutionary genealogy of genes: Non-supervised Orthologous Groups (eggNOG) (Huerta-Cepas et al., 2015) and the Kyoto Encyclopedia of Genes and Genomes (KEGG) databases using the BLASTX with an E value cutoff of 10 −5 . Annotation by the Gene Ontology (GO) database was performed using Blast2GO software. Totally 12,402 (87.4%) genes were annotated in at least one database ( Figure 1D).

Completeness and Accuracy of the Assembly
The previously obtained full-length transcripts was aligned against the assembled genome with Gmap and it was found 87.7% of the transcripts could be mapped.
The paired-end reads obtained in genome survey were also aligned against the assembled genome with BWA and 92.2% of them were successfully mapped. The analysis of Benchmarking Universal Single-Copy Orthologs (BUSCO) v3.0.2 (Simão et al., 2015), in combination with TBLASTN, Augustus, and HMMER v3.1b2 (Finn et al., 2011) software, was used to evaluate the completeness of the assembl ed genome base d on eukaryota_odb9 database. The percentage of the identified complete BUSCOs was 82.9% at the protein level, with the fragmented and missing BUSCOs accounting for 8.9% and 8.2%, respectively.

DATA AVAILABILITY STATEMENT
The raw genome sequencing data have been deposited in the NCBI SRA database under the BioProject accession number PRJNA575605. Genome assembly and annotation data has been deposited at Figshare (https://figshare.com/s/94aebbd77f374b9c6faf). The raw SMRT sequencing data for full-length transcriptome analysis is available in NCBI SRA database with accession numbers SRR8083207, SRR8083208, and SRR8083209.

AUTHOR CONTRIBUTIONS
SP and TS conceived the study. TS cultured and maintained the gametophyte samples. XL, YZ, and HG cultured the sporophytes. TS, JY, LS, and JL extracted the DNA and performed genome assembly and data analysis. TS, JY, and SP wrote the manuscript.