Long-Read Sequencing Reveals the Repertoire of Long-Chain Polyunsaturated Fatty Acid Biosynthetic Genes in the Purple Land Crab, Gecarcoidea lalandii (H. Milne Edwards, 1837)

Citation: Ting SY, Lau N-S, Sam K-K, Quah ESH, Ahmad AB, Mat-Isa M-N and Shu-Chien AC (2021) Long-Read Sequencing Reveals the Repertoire of Long-Chain Polyunsaturated Fatty Acid Biosynthetic Genes in the Purple Land Crab, Gecarcoidea lalandii (H. Milne Edwards, 1837). Front. Mar. Sci. 8:713928. doi: 10.3389/fmars.2021.713928 Long-Read Sequencing Reveals the Repertoire of Long-Chain Polyunsaturated Fatty Acid Biosynthetic Genes in the Purple Land Crab, Gecarcoidea lalandii (H. Milne Edwards, 1837)


INTRODUCTION
Phylogenetically, crabs (Brachyura) are among the most diverse crustaceans, with representatives in marine, freshwater, and terrestrial niches, encompassing over 7,000 species in 98 families (Ng et al., 2008;De Grave et al., 2009). The Gecarcinidae crab family, consist of six genera encompassing 20 species, showing distinct adaptations for a terrestrial lifestyle, with larval development taking place in the oceans (Hartnoll, 1988;Greenaway, 1999;Ng and Shih, 2015). Interestingly, while the conquest of terrestrial habitats is reported to encompass long-term adaptations, a much more rapid process has been reported to occur in terrestrial crabs (Schubart et al., 1998). During this sea to land transition, animals have developed a number of physiological and anatomical adaptations associated with gas exchange, salt and water balance, nitrogenous excretion, thermoregulation, reproduction, feeding, and diet (Bliss and Mantel, 1968;Powers and Bliss, 1983;Greenaway, 1999;Richardson and Araujo, 2015). The Gecarcoidea genus consists of Gecarcoidea natalis, which is endemic to Christmas Island and the Cocos (Keeling) Islands while Gecarcoidea lalandii is widely distributed in the Indo-West pacific region (Lai et al., 2017). Both these species release larvae from above the water, a behavior speculated to confer protection to the ovigerous females (Liu and Jeng, 2007). Land crabs are considered opportunistic omnivores, feeding on carrion, insects, mammalian feces, and plant material (Wolcott and Wolcott, 1984;Ortega-Rubio et al., 1997). Nevertheless, the nature of their habitat has driven some of the species to become more herbivorous, foraging mostly on leaf litter, vascular plants foliage, seeds, and fruits (Linton and Greenaway, 2007;Wolcott and O'connor, 2015). This shift was shown to be crucial in promoting the diversification of crustaceans (Poore et al., 2017).
A dichotomy between terrestrial and marine-based diets is the concentration of longchain polyunsaturated fatty acids (LC-PUFA) such as eicosapentaenoic acid (EPA), docosahexaenoic acid (DHA) and arachidonic acid (ARA) in the food chain (Hixson et al., 2015). The importance of this difference is magnified by the reliance of terrestrial organisms on the aquatic environment for supply of LC-PUFA (Gladyshev et al., 2009).
Aquatic primary producers and invertebrates, especially marine species, possess all the necessary enzymes required to biosynthesize the LC-PUFA from shorter chain fatty acid substrates. Two classes of enzymes, the fatty acyl desaturase (Fads) and elongases of very long-chain fatty acid (Elovl), work together sequentially, inserting a double bond at defined locations of the fatty acyl backbone and elongating the fatty acyl chain through addition of two carbon units. Work on various vertebrate species, mainly on teleost, have elucidated the complete gamut of Fads and Elovls responsible for the biosynthesis of EPA/DHA and ARA from linolenic acid (LNA) and linoleic acid (LA), respectively. Similar corresponding works on various aquatic invertebrates such as molluscs and echinoderms have also revealed the presence of Fads and Elovl orthologs and LC-PUFA biosynthesis activities (Monroig and Kabeya, 2018). The presence of functional Elovl in crustaceans were only recently shown in two marine brachyuran crabs (Mah et al., 2019;Sun et al., 2020;Ting et al., 2020). Given the limited amount of LC-PUFA in the terrestrial food chain, very little is known about how terrestrial crustaceans adapt in terms of LC-PUFA biosynthesis capacity.
Transcriptome reconstruction is a rapid and economical approach useful for systematic gene models characterization in species without any genome reference. RNA sequencing via next-generation sequencing technology is widely used for this purpose (Grabherr et al., 2011;Mutz et al., 2013). However, short reads require large computational assembly and are often insufficient to capture entire end-to-end transcripts, limiting the accuracy of gene model prediction (Steijger et al., 2013). The long-read isoform (Iso-Seq) sequencing method (Pacific Biosciences, PacBio) analyse full-length transcripts as a single sequence read without the need for further assembly, making it an ideal for identifying and characterizing novel transcripts and transcript isoforms (Eid et al., 2009;Roberts et al., 2013;Rhoads and Au, 2015). This technique was used for transcriptome analysis in crustacean species such as Litopenaeus vannamei (Zeng et al., 2018;Wan et al., 2019), Penaeus monodon (Pootakham et al., 2020), and Scylla paramamosain (Wan et al., 2019). In consideration of the lack of any terrestrial crab reference genome, the aim of this study was to apply the PacBio Iso-Seq technique to profile the transcriptome of G. lalandii (Decapoda, Brachyura, Gecarcinidae) with the objective of identifying relevant transcripts for LC-PUFA biosynthesis Fads and Elovl enzymes. Our study provides the first gene catalogs for a terrestrial crab species, adding to currently available Brachyura transcriptome datasets.

Sample Collection and Total RNA Extraction
G. lalandii sample was collected at Rawa Island, Johor, Malaysia (2.5204 • N, 103.9760 • E) in June 2018. Three adult male crabs weighing 120-150 g were sampled. We selected hepatopancreas tissue to obtain transcript library as hepatopancreas is responsible for the storage of organic matter and metabolism of nutrients (Vogt, 1994;Wen et al., 2001;Abol-Munafi et al., 2016). The tissue was sampled immediately after euthanization, snap-frozen in liquid nitrogen, and stored at −80 • C prior to RNA extraction. Total RNA was isolated from the tissues of three crabs using the Qiagen RNeasy mini kit (Qiagen, Germany) following manufacturer's recommendations and pooled together for library preparation. Potential DNA contamination was eliminated by applying on-column DNase digestion using RNase-Free DNase Set (Qiagen, Germany). RNA concentration was determined using a Qubit R 2.0 Fluorometer (Life Technologies, USA), and RNA quality was assessed using an Agilent 2,100 Bioanalyzer (Agilent Technologies, USA). RNA with RNA integrity number (RIN) value above 9 was used for library construction.

PacBio cDNA Library Preparation and Sequencing
Sequencing library was prepared according to the PacBio Iso-Seq protocol. The Clontech SMARTer PCR cDNA Synthesis Kit with Oligo(dT) primers was used to generate the first and second cDNA strand from polyA mRNA. Size fractionation and selection on ≤ 4 kb and ≥ 4 kb bin were performed using the BluePippin TM Size selection system (Saga Science, USA). SMRT bell library was constructed with the PacBio DNA Template Prep Kit 1.0, and sequencing run was performed on a PacBio Sequel platform.

Iso-Seq Data Analyses
The sequence data was processed through the RS_IsoSeq (version 2) protocol. Reads of insert ROIs (previously known as circular consensus sequence) were generated from raw subreads using the SMRT Link (version 5.1) software (Gordon et al., 2015). A total of 11.43 Gb raw data was generated by 5,821,087 of subreads, which were classified into 446,716 of ROI reads ( Table 1). The ROIs were classified based on the presence of 5 ′ and 3 ′ adapters as well as the poly(A) tail into full-length and non-full length reads. Sequences containing both the 5 ′ and 3 ′ primers and having a poly(A) tail signal preceding the 3 ′ primer were considered to be full-length (FL) ROIs. ROI reads comprised of 295,220 full-length non-chimeric transcripts with an average read length of 2,448 bp. FL ROIs sequences were then passed through the isoform-level clustering (ICE algorithm). ROI sequences were used to correct errors in the isoform sequences using the Quiver algorithm. The Quiver polishing process produced high-quality and low-quality isoform sequences corresponding to a predicted accuracy of ≥ 99% or below, respectively. The isoform-level clustering and final polishing steps yielded 18,552 and 2,76,668 of high-quality and low-quality consensus isoforms, respectively. The completeness of consensus transcripts was assessed by benchmarking universal single-copy orthologs (BUSCO) (version 3.1.0) (Simão et al., 2015). Our G. lalandii hepatopancrease transcriptome only captured 67% of the conserved arthropod genes, likely due to our sampling from a single tissue (Supplementary Figure 1). The isoform sequences were mapped to the Portunus trituberculus reference genome (Tang et al., 2020) (a chromosome-level genome assembly of Brachyura) using GMAP (version 2015-09-29) (Wu and Watanabe, 2005). More than 92% of consensus transcripts were mapped to the P. trituberculus genome

Functional Annotation of Isoforms
The obtained high-quality isoforms were annotated by conducting a local BLASTx (version 2.7.1) search against the protein databases, namely the GenBank NCBI non-redundant protein sequences (NR), SwissProt, TrEMBL, euKaryotic Ortholog Groups (KOG), and Kyoto Encyclopedia of Genes and Genomes (KEGG) databases. The BLAST hits to NR were processed by functional annotation with BLAST2GO (version 5.1) (Conesa et al., 2005). The coding region prediction of high-quality isoforms was performed using ANGLE (version 2.2) (Shimizu et al., 2006). Hmmscan (version 3.1b) was used to search sequence protein domains against the Pfam database.
In general, most of the transcripts (89.91%) exhibited homology with at least one protein database. Matches to the specific databases were as follows: 89.41% to NR,88.78% to TrEMBL,86.92% to KEGG,84.98% to Pfam,83.75% to SwissProt, and 83.57% transcripts aligned to the KOG databases. Further, 86.59 % of transcripts were assigned to multiple Gene Ontology (GO) classification terms. Among them, "metabolic process, " "cellular process, " and "response to stimulus" were the most represented terms in the biological process (Supplementary Figure 2). In the cellular component, the majority of the transcripts were represented by "cellular anatomical entit, " "intracellular, " and "protein-containing complex." Within the molecular function category, "catalytic activity, " "binding, " and "transporter activity" had the highest number of transcripts. The top hit species distribution of matches with known sequences indicates that the G. lalandii transcripts had the highest number of hits to the Eriocheir sinensis, Penaeus vannamei and Metacarcinus magister (Supplementary Figure 3).

Analyses of Fads and Elovl Orthologs in G. lalandii
KEGG annotation revealed the presence of several genes implicated in the polyunsaturated fatty acid biosynthesis pathway of G. lalandii (Supplementary Figure 4). The transcriptomic databases for selected crustacean species used for comparative analyses were downloaded from the NCBI and JGI databases. The downloaded sequences were analyzed using BLAST for Fads and Elovl sequences. Fads and Elovl identified were aligned using MUSCLE in MEGAX (Kumar et al., 2018) ( Supplementary Figures 5-8). Two Elovl elongases were found in the G. lalandii transcripts with highest BLAST hits matching Elovl 6 and a putative novel Elovl, respectively. The best fit model was predicted using ModelFinder (Kalyaanamoorthy et al., 2017), and phylogenetic tree was built using IQ-TREE (version 1.6.12) based on the predicted model (Nguyen et al., 2015). Maximum likelihood phylogenetic analysis of Elovl from various crustacean transcriptomes also placed the two Elovl sequences in their respective ortholog clusters ( Figure 1A). Elovl6, Elovl7, ElovlA, Elovl B and the putative new Elovl all shared the common ancestors while the Elovl4 formed a separate clade. To date, the Elovl4 and Elovl6 have been shown to have in vitro PUFA elongation capacities in marine brachyuran species (Sun et al., 2020;Ting et al., 2020). Interestingly, while G. lalandii appear to have a Elovl6 ortholog, there was no Elovl4.
Sequence alignment analysis was also performed with Clustal Omega (https://www.ebi.ac.uk/Tools/msa/clustalo/), and the conserved motif was highlighted. Motif analysis was done using WebLogo (https://weblogo.berkeley.edu/examples.html), where the amino acids of histidine boxes were represented based on their frequency of occurrence. Alignment of the conserved histidine box in each respective Elovl revealed the conservation of the LHxxHH motif in all the elongases (Supplementary Figure 9). Functional characterization of these elongases for comparison with the marine crab species will yield insights on the LC-PUFA biosynthesis capacities of brachyuran living in habitats with different availability of LC-PUFA in the food chain.
Frontiers in Marine Science | www.frontiersin.org which has high affinity for saturated fatty acids (Monroig et al., 2017). Collectively these transcripts will facilitate the deciphering of LC-PUFA biosynthesis in regard to adaptation to terrestrial living.

RE-USE POTENTIAL
Here, we present a long-read RNA sequencing dataset of G. lalandii hepatopancreas that was generated by using the PacBio platform. Specifically, our analyses exemplify applicability of the dataset for mining two orthologs of two classes of enzymes known to work in concert in LC-PUFA biosynthesis, the elongase and fatty acyl desaturase. This transcript assembly serves as a resource for gene mining, which will aid in the functional characterization of various enzymes relevant to the biosynthetic pathway.

DATA AVAILABILITY STATEMENT
The datasets presented in this study can be found in online repositories. The names of the repository/repositories and accession number(s) can be found below: https://www. ncbi.nlm.nih.gov/, SRR10903072; https://www.ncbi.nlm.nih. gov/, GIKV00000000.

AUTHOR CONTRIBUTIONS
EQ and AA collected and identified the samples. ST and N-SL designed the experiments and analyzed the data. K-KS and M-NM-I participated in the data analysis and checking. AS-C conceived and supervised the study. ST, N-SL, and AS-C wrote the manuscript. Revision of data and manuscript after first round of peer-review were by all authors.

FUNDING
We thank the Malaysian Ministry of Higher Education for funding the project under the FRGS Grants Scheme (203/PCCB/6711650).