Genome Sequencing of Fiber Flax Cultivar Atlant Using Oxford Nanopore and Illumina Platforms

Citation: Dmitriev AA, Pushkova EN, Novakovskiy RO, Beniaminov AD, Rozhmina TA, Zhuchenko AA, Bolsheva NL, Muravenko OV, Povkhova LV, Dvorianinova EM, Kezimana P, Snezhkina AV, Kudryavtseva AV, Krasnov GS and Melnikova NV (2021) Genome Sequencing of Fiber Flax Cultivar Atlant Using Oxford Nanopore and Illumina Platforms. Front. Genet. 11:590282. doi: 10.3389/fgene.2020.590282 Genome Sequencing of Fiber Flax Cultivar Atlant Using Oxford Nanopore and Illumina Platforms


INTRODUCTION
Flax (Linum usitatissimum L.) has been grown for seeds and fiber since ancient times (Vaisey-Genser and Morris, 2003). Fiber flax is taller than linseed and has branches only in the upper part of the stem. Linseed branches begin from the middle part of the stem, and these plants produce many large seeds (Diederichsen and Richards, 2003). Flax seeds are rich in omega−3 fatty acids and lignans, the health benefits of which have been proven in numerous studies (Caligiuri et al., 2014;Goyal et al., 2014;Kezimana et al., 2018;Parikh et al., 2019). Therefore, linseed is used in the food and pharmaceutical industries, animal feeds, and the production of eco-friendly paints and composites (Singh et al., 2011;Corino et al., 2014;Goyal et al., 2014;Campos et al., 2019;Fombuena et al., 2019). Flax fibers are hollow tubes that mainly consist of cellulose; they have high strength and durability, which allows one to use them in the production of high-quality textiles (Vaisey-Genser and Morris, 2003). Flax fiber has a high absorbent capacity owing to the wicking and movement of moisture along the surface, enabling its use in cloth for hot climates, sails, tents, and rugs (Atton, 1989). However, it is possible to obtain a long fiber only from a part of the flax stem with no branches; therefore, despite high quality, linen fibers have to a large extent been displaced by synthetic fibers (Muir and Westcott, 2003). Nevertheless, awareness of ecological problems has attracted attention to the use of materials that are more sustainable for our planet, and interest in flax fibers is reviving. Additionally, in the last few years, flax fiber has been actively used as a component of composite materials with good potential for automotive, aerospace, and packaging applications in which high fiber length is not very important (Zhu et al., 2013;Mokhothu and John, 2015;Wu et al., 2016;Dhakal and Sain, 2019;Fombuena et al., 2019;Goudenhooft et al., 2019;Zhang et al., 2020a).
The genome of linseed cultivar CDC Bethune was sequenced on an Illumina platform in 2012, using paired-end and mate-pair libraries. This resulted in an assembly of 302 Mb with scaffold N50 of about 700 kb, contig N50 of ∼20 kb, and 81% coverage of the flax genome estimated at 370 Mb (Wang et al., 2012). Chromosome-level assembly for 15 chromosome pairs of CDC Bethune was obtained in 2018, using BioNano genome optical, BAC-based physical, and genetic mapping (You et al., 2018). Scaffold-level genome assemblies of linseed cultivar Longya-10, fiber cultivar Heiya-14, and pale flax were generated in 2020, based on Illumina sequencing, Hi-C technology, and genetic mapping (Zhang et al., 2020b). These results are extremely important for further progress in molecular studies of flax, the development of genome editing, and marker-assisted and genomic selection (Saha et al., 2019;Morello et al., 2020;You and Cloutier, 2020). A high-quality genome can be used as a reference for genome and transcriptome assemblies of different flax cultivars/lines, and the identification of polymorphisms and differences in gene expression within L. usitatissimum genotypes (Dmitriev et al., , 2020bGuo et al., 2019;Wu et al., 2019). Genome sequences of flax are necessary for the identification of particular gene families or repeat classes in species of the genus Linum and cultivars/lines of L. usitatissimum (Bolsheva et al., 2019;Novakovskiy et al., 2019;Ali et al., 2020;Dmitriev et al., 2020a).
Recent studies have shown that different genotypes of the same crop can diverge greatly at the genome level, not only in terms of SNPs and small indels but also long insertions and deletions, which can be identified by comparing high-quality genome assemblies (Zhang et al., 2019;Song et al., 2020). Nextgeneration sequencing platforms, such as Illumina, SOLiD, 454, Ion Torrent, and BGISEQ, have enabled the determination of genomic sequences for thousands of plant genotypes using short reads, whereas the development of third-generation sequencing platforms, such as Pacific Biosciences (PacBio) and Oxford Nanopore Technologies (ONT), which produce long reads of up to hundreds of thousands of bases, has facilitated accurate genome assembly (Goodwin et al., 2016;Li et al., 2017;Belser et al., 2018;Li and Harkess, 2018). Despite the wide use of thirdgeneration sequencing approaches in studies of plant genomes, we did not find such sequencing data for flax. To fill this gap, we sequenced the genome of fiber flax cultivar Atlant using ONT and Illumina platforms to obtain a combination of long reads with insufficient accuracy and short high-precision reads, which is extremely important for high-quality genome assembly.

Plant Material
Fiber flax cultivar Atlant (alias-l. 23-4 Saldo × Mogilevskij) is characterized by high values of parameters that determine the quality of fiber, including flexibility, metric number, linear density, and calculated relative breaking load. Additionally, this cultivar has low variability of morphological and anatomical characteristics under stress conditions, especially unfavorable soil pH, compared to optimal ones (Ryzhov et al., 2012;Rozhmina et al., 2020). These characteristics of cultivar Atlant are important for the guaranteed production of high-quality fibers that meet the requirements of the textile industry.
Atlant seeds were obtained from the Institute for Flax (Torzhok, Russia), which is the originator of this cultivar. Seeds were sterilized in 1% sodium hypochlorite for 2 min and planted in 20 cm pots with sterile soil. Plants were grown in a climate chamber (Daihan LabTech, South Korea) for 2 weeks, and then leaves were collected from individual plants, frozen in liquid nitrogen, and stored at −80 • C until DNA extraction.

DNA Extraction
The DNA extraction method included the homogenization of 0.1 g of leaves from a single plant in liquid nitrogen followed by DNA isolation using a DNA-EXTRAN-3 kit (Synthol, Russia), DNA precipitation with CTAB-containing buffer (1% CTAB, 50 mM Tris-HCl pH 8.0, and 10 mM EDTA), and purification in ion-exchange columns from the Blood and Cell Culture DNA Mini Kit (Qiagen, USA). The DNA concentration and quality were evaluated using a Qubit 2.0 fluorometer (Life Technologies, USA) and NanoDrop 2000C spectrophotometer (Thermo Fisher Scientific, USA). The DNA length and control of RNA absence were assessed via electrophoresis using a 0.8% agarose gel.

Genome Sequencing on ONT Platform
Library preparation was performed using an SQK-LSK109 Ligation Sequencing Kit (ONT, UK) for 1D genomic DNA sequencing. Minor modifications were introduced to the basic protocol for library preparation. The incubation time was increased to 20 min at the DNA recovery step and 60 min at the adaptor ligation step. A MinION (ONT) instrument with an R9.4.1 flow-cell (ONT) was used for sequencing.

Genome Sequencing on Illumina Platform
DNA was fragmented on an S220 ultrasonic homogenizer (Covaris, USA), and 1 µg of fragmented DNA was used for library preparation using a NEBNext Ultra II DNA Library Prep Kit for Illumina (New England Biolabs, UK) with a size selection of adaptor-ligated DNA of ∼600-800 bp. The DNA library concentration and quality were evaluated on a Qubit 2.0 fluorometer (Life Technologies) and 2100 Bioanalyzer (Agilent Technologies, USA), respectively. Sequencing was performed on a HiSeq 2500 instrument (Illumina, USA) with a read length of 250 + 250 bp.

Preliminary Data Analysis
For successful Nanopore sequencing, DNA quality is crucial. We developed a protocol for the isolation of long high-purity DNA from a single flax plant and obtained DNA of ∼50 kb with A260/A280 of 1.9 and A260/A230 of 2.0. The DNA concentrations measured with a NanoDrop spectrophotometer (Thermo Fisher Scientific) and Qubit fluorometer (Life Technologies) had similar values, which is an important criterion of DNA purity. The sequencing of the obtained DNA on the ONT platform produced 8.4 Gb with N50 of 12 kb, corresponding to ∼23× flax genome coverage. On the Illumina platform, 30× genome coverage was obtained with 22.6 million 250 + 250 paired-end reads. The raw data were deposited in the NCBI Sequence Read Archive (SRA) under the BioProject accession number PRJNA648016.
First, the MinION fast5 files were processed using Guppy 3.6.1 (https://community.nanoporetech.com/protocols/Guppyprotocol/v/gpb_2003_v1_revt_14dec2018) with the highaccuracy flip-flop algorithm (dna_r9.4.1_450bps_hac.cfg configuration file). Then, adapter sequences were removed using Porechop (https://github.com/rrwick/Porechop), and low-quality reads (average Q < 6) were filtered out using Trimmomatic 0.32 (Bolger et al., 2014). Illumina reads were also filtered (minimum read length-50) and trimmed (trailing, minimum Q-28) using the Trimmomatic tool. Genome assemblies based on the Nanopore reads were performed using four assemblers: Canu 2.0 , Flye 2.7 (Kolmogorov et al., 2019), Shasta 0.5.0 (Shafin et al., 2020), and wtdbg2 2.5 (Ruan and Li, 2020). The default parameters were used, except for the minimum read length for Shasta (was set to 3,000 bp) and expected genome size for Flye and wtdbg2 (was set to 400 Mb). The statistics for the genome assemblies were calculated using QUAST 5.0.2 (Gurevich et al., 2013), and are presented in Table 1. Canu produced the longest assembly (361.7 Mb for contigs and 393.9 Mb for unitigs, which means high-confidence contigs) and largest contig of 5 Mb, and was one of the best in most parameters of Nx and Lx statistics. The highest N50 value of 365 kb was obtained using wtdbg2; however, the total assembly length was only 212.2 Mb, almost less than twice the real size of the flax genome (Wang et al., 2012). Canu was the second in N50 value, resulting in 350 kb for contigs and 225 kb for unitigs.
The misassembly rates between our assemblies and the NCBI representative genome for L. usitatissimum (cultivar CDC Bethune, GenBank: GCA_000224295.2) were evaluated using QUAST (Supplementary Data 1). Canu resulted in the best coverage of the reference genome (∼95% for both contigs and unitigs) and the largest alignment (662 kb for both contigs and unitigs) and lost only to Shasta in one of the key parameters, NA50, which is an analog of N50 for fragments successfully mapped to the reference. Considering the rate of misassemblies larger than 1 kb and duplication ratio, Canu was only third after Shasta and wtdbg2; however, the latter demonstrated very low coverage of the reference genome (only 63.25%). It should be taken into account that we compared Atlant assemblies with the genome of another cultivar; therefore, it is naturally that some under-and misassemblies are present. The aforementioned statistics allowed, firstly, a comparison of the current Atlant assemblies performed with different tools.
Thereafter, the assemblies were polished, using Nanopore reads, with Racon 1.4.3 (Vaser et al., 2017) and/or Medaka 1.0.3 (https://github.com/nanoporetech/medaka), and, using Illumina reads, with the POLCA tool from the MaSuRCA 3.3.9 assembler (Zimin et al., 2017) to improve the contig accuracy. The assembly completeness was evaluated as the content of universal single-copy genes inherent to land plants using BUSCO v4, embryophyta_odb10 dataset (Seppey et al., 2019). The results are presented in Figure 1. For assemblies before polishing, the best results were obtained for Canu unitigs (93.74%), Canu contigs (93.62%), and Flye contigs (93.56%), whereas the worst result was shown for contigs assembled by wtdbg2 (59.73%). The highest efficiency of polishing was peculiar to the combination of Racon + Medaka + POLCA, which improved the completeness of the assembly from 93.62 to 97.40% (Canu contigs). This result was the best among those of all variants of assembler-polisher combinations. The totality of the parameters, including Nx and BUSCO statistics, as well as the misassemblies, suggested that the Canu genome assembly of flax cultivar Atlant polished according to Racon + Medaka + POLCA scheme was best, and it was used for further genome annotation.
The large percentage of duplicated BUSCOs (68% for the polished Canu assembly) is noteworthy. This is in good agreement with the statement that L. usitatissimum originated as the result of the hybridization of two diploid Linum species, from each of which it received a whole set of chromosomes (Bolsheva et al., 2017).
In the NCBI genome database, assemblies of only three L. usitatissimum genomes are presented: linseed cultivar CDC Bethune (representative genome, chromosome level, GenBank: GCA_000224295.2), linseed cultivar Longya-10 (scaffold level, GenBank: GCA_010665275.1), and fiber flax cultivar Heiya-14 (scaffold level, GenBank: GCA_010665265.1). For all three genomes, annotations have not been submitted that complicates the use of these data in studies of flax. In the present study, we annotated the assembled genome of fiber flax cultivar Atlant using the funannotate 1.8.0 pipeline (https://funannotate. readthedocs.io/en/latest/). Immediately before the annotation, repeat masking was performed with TANTAN (http://cbrc3. cbrc.jp/~martin/tantan/). Approximately 7.6% of the genomic sequence was masked. For the annotation, we used our previously obtained transcriptome sequencing data for five different tissues of cultivar Atlant (NCBI SRA: SRX8380594-shoots of seedlings, SRX8380593-roots of seedlings, SRX8380592-flowers of adult plants, SRX8380591-stems of adult plants, and SRX8380590leaves of adult plants). To make genome-guided transcriptome assembly, we mapped the RNA-Seq reads to the assembled Frontiers in Genetics | www.frontiersin.org genome via HISAT2 2.2.0 (Kim et al., 2019). About 96% of reads (54.0M of 56.2M) were successfully mapped. 82,290 transcripts corresponding to 69,143 genomic loci were assembled using Trinity 2.8.5 in genome-guided mode. Based on the transcript data and mapped RNA-Seq reads, a total of 77,522 gene models were predicted using PASA 2.4.1, Augustus 3.3.3, GlimmerHMM 3.0.4, SNAP v. 2006-07-28, GeneMark 4.61, and CodingQuarry 2.0 (the results were combined and analyzed using EvidenceModeller 1.1.1). Among them, 1,182 were referred to as tRNA. In total, 18,946 gene models were successfully annotated using the Pfam database (up-to-date on June 2020), 19,741 using eggNOG (up-to-date on June 2020), 953 using BUSCO embryophyta_odb10 dataset, and 3,725 using UniProt (up-to-date on June 2020). The summary statistics of the functional annotation of predicted genes are presented in Supplementary Data 2. The assembled genome was deposited in the NCBI database under the BioProject accession number PRJNA648016.

CONCLUSIONS
In this study, the genome of fiber flax cultivar Atlant was sequenced for the first time, using both Oxford Nanopore and Illumina platforms. For successful Nanopore sequencing, a protocol for extraction of pure high-molecular-weight DNA from the leaves of a single flax plant was developed. Sequencing of this DNA on the ONT platform resulted in 23× flax genome coverage (8.4 Gb, N50 = 12 kb). On the Illumina platform, 30× genome coverage was obtained (22.6 million of 250 + 250 paired-end reads). Genome assemblies were performed using Canu, Flye, Shasta, and wtdbg2. Subsequent polishing by Racon, Medaka, and POLCA was used to improve the contig accuracy. The most complete and accurate assembly was achieved by Canu with the polishing scheme Racon + Medaka + POLCA: total length = 361.7 Mb, N50 = 350 kb, and 97.40% completeness according to BUSCO. The genome was annotated using the funannotate pipeline and our transcriptome sequencing data for 5 different tissues of cultivar Atlant. The obtained results are useful for the evaluation of L. usitatissimum polymorphism at the genome level, the identification of sequences specific to fiber flax, as a reference in studies of fiber flax cultivars, and the development of flax genomic selection and genome editing. These findings can also be used for the analysis of flax DNA methylation at the wholegenome level, as information on this DNA modification can be derived from Nanopore reads.

DATA AVAILABILITY STATEMENT
The raw sequencing data and the assembled genome are deposited in the NCBI database under the BioProject accession number PRJNA648016.

AUTHOR CONTRIBUTIONS
AD, TR, and NM conceived and designed the work. EP, RN, AB, TR, NB, LP, ED, PK, AS, and NM performed the experiments. AD, EP, TR, AZ, OM, AK, GK, and NM analyzed the data. AD, EP, TR, GK, and NM wrote the manuscript. All authors read and approved the final manuscript.

FUNDING
This work (genome sequencing, assembly, and annotation) was financially supported by the Russian Science Foundation, grant 16-16-00114. Flax cultivar Atlant was bred under the financial support of the Ministry of Science and Higher Education of the Russian Federation, state assignment number 075-00853-19-00.

ACKNOWLEDGMENTS
We thank the Center for Precision Genome Editing and Genetic Technologies for Biomedicine, EIMB RAS for providing genome sequencing techniques and computing power. This work was performed using the equipment of EIMB RAS Genome center (http://www.eimb.ru/ru1/ckp/ccu_genome_ce.php).