Genome Size Estimation and Full-Length Transcriptome of Sphingonotus tsinlingensis: Genetic Background of a Drought-Adapted Grasshopper

Sphingonotus Fieber, 1852 (Orthoptera: Acrididae), is a grasshopper genus comprising approximately 170 species, all of which prefer dry environments such as deserts, steppes, and stony benchlands. In this study, we aimed to examine the adaptation of grasshopper species to arid environments. The genome size of Sphingonotus tsinlingensis was estimated using flow cytometry, and the first high-quality full-length transcriptome of this species was produced. The genome size of S. tsinlingensis is approximately 12.8 Gb. Based on 146.98 Gb of PacBio sequencing data, 221.47 Mb full-length transcripts were assembled. Among these, 88,693 non-redundant isoforms were identified with an N50 value of 2,726 bp, which was markedly longer than previous grasshopper transcriptome assemblies. In total, 48,502 protein-coding sequences were identified, and 37,569 were annotated using public gene function databases. Moreover, 36,488 simple tandem repeats, 12,765 long non-coding RNAs, and 414 transcription factors were identified. According to gene functions, 61 cytochrome P450 (CYP450) and 66 heat shock protein (HSP) genes, which may be associated with drought adaptation of S. tsinlingensis, were identified. We compared the transcriptomes of S. tsinlingensis and two other grasshopper species which were less tolerant to drought, namely Mongolotettix japonicus and Gomphocerus licenti. We observed the expression of CYP450 and HSP genes in S. tsinlingensis were higher. We produced the first full-length transcriptome of a Sphingonotus species that has an ultra-large genome. The assembly characteristics were better than those of all known grasshopper transcriptomes. This full-length transcriptome may thus be used to understand the genetic background and evolution of grasshoppers.


INTRODUCTION
Sphingonotus Fieber, 1852 (Orthoptera: Acrididae), is a species-rich genus of grasshoppers that comprises 192 valid species and subspecies (Cigliano et al., 2017). This genus shows extensive radiation in the Palearctic, and many species are endemic to islands (Husemann et al., 2015). These grasshoppers are distributed in arid zones of the Northern Hemisphere, with diverse hotspots in the Mediterranean and Central and Eastern Asia . Single species have been recorded in the Galapagos Islands, Mexico, the Caribbean, Brazil, and northwestern Australia (Rentz, 1996;Cigliano et al., 2017). Their wide distribution, regional high incidence of endemism, and polymorphic stridulation organs make Sphingonotus an attractive study system for biogeographic and evolutionary questions (Johnsen, 1985;Benediktov, 2009;Husemann et al., 2011). The most intriguing characteristic of these species is that they prefer dry environments such as deserts, steppes, and dry stony benchlands (Husemann et al., 2014). Recently, several studies examined the phylogeography and evolution of this genus. However, it remains unclear how its species have adapted to drought environments and radiated to such richness. Sphingonotus tsinlingensis (Orthoptera: Acridoidea) is a grasshopper endemic to China, where it occurs on sandy and stony benchlands along the northern Qinling Mountains (Zheng et al., 1963). This species belongs to the oriental lineage of Sphingonotus, represents the adaptive status of Sphingonotus spp. in the East, and may serve as an ideal model for studying the evolution and ecology of grasshoppers in arid environments (Husemann et al., 2014;Moussi et al., 2018). Nonetheless, only few studies and sequences of respective functional genes of this species are available (Cui and Huang, 2012;Shah et al., 2019). Thus, further reference sequences and molecular genetic studies are required to elucidate its evolutionary and ecological characteristics.
Full-length transcriptome analysis can help identify coding and non-coding RNA and quantify differential gene expression. Moreover, it plays an important role in deciphering genomic functions with respect to physiological mechanisms and responses to environmental challenges (Jiang et al., 2015). Full-length transcriptomes of the two grasshoppers Gomphocerus licenti and Mongolotettix japonicus have been published previously, which yielded 590,112 and 566,165 circular consensus sequences (CCSs) as well as 458,131 and 428,979 full-length non-chimeric (FLNC) reads, respectively. In total, 17,970 and 16,766 unigenes were identified, with 17,495 and 16,373 coding sequences (CDSs), 1,082 and 813 transcription factors (TFs), 11,840 and 10,814 simple sequence repeats (SSRs), and 905 and 706 long non-coding RNAs (lncRNAs) by analyzing the transcriptomes of G. licenti and M. japonicus, respectively; 15,803 and 14,846 respective unigenes were annotated in public databases (Yuan et al., 2020).
In this study, the genome size (GS) and transcriptome of S. tsinlingensis were produced and examined for further functional and ecological studies using the deep-coverage PacBio isoform sequencing technique (Camacho et al., 2015;Yuan et al., 2019). In addition to protein-coding genes, different types of genetic elements such as TFs, SSRs, and lncRNAs were identified and classified. The produced full-length transcriptome of S. tsinlingensis was compared to published transcriptomes of other grasshoppers for quality evaluation, and heat shock protein (HSP) and cytochrome P450 (CYP450) genes were analyzed to investigate drought adaptation in this species.

Sample Collection
Six adult S. tsinlingensis males were collected from a natural population at a pebble beach in Xi'an (34 • 02 05.2 N, 108 • 33 04.3 E) on September 14, 2020. Some grasshoppers were kept alive in insect mesh cages until flow cytometry (FCM) was performed in the laboratory, while others were dissected, immediately immersed in liquid nitrogen, and stored at −80 • C.

GS Estimation Using FCM
Flow cytometry was used to investigate the GS of S. tsinlingensis (Dolezel and Bartos, 2005). DNA content is directly proportional to FCM fluorescence intensity. Thus, the GS of an organism can be calculated by comparison of a sample's fluorescence intensity with that of an internal standard of known GS. L. migratoria (1C = 6.5 G) (Wang et al., 2014) was chosen as an internal standard. FCM was performed as previously described (Gregory and Johnston, 2008;Hare and Johnston, 2011). To prepare single cell suspensions, head tissues of L. migratoria and S. tsinlingensis were removed, placed in a tissue grinder with 1 mL Galbraith buffer (Galbraith et al., 1983), and ground 20 times. Then, the solution was filtered through a 38-µm nylon mesh to remove cellular debris and stained using 50 µg/mL propidium iodide. The above steps were performed on ice. The solution was then stored in the dark at 4 • C for 30 min. The GS was assessed using a flow cytometer (Cytoflex S; Beckman Coulter, Krefeld, Germany) with three technical replicates, which were activated with a 488-nm laser and low flow rates.

RNA Extraction
Total RNA was isolated from muscle tissues of the intact body of all individuals after removing the guts. Extractions were conducted using TRIzol reagent (Invitrogen, Carlsbad, CA, United States), following the manufacturer's instructions. RNA degradation and contamination were screened using 1% agarose gel electrophoresis, and RNA integrity and purity were determined using an Agilent 2100 Bioanalyzer (Agilent Technologies, CA, United States) and a NanoDrop 2000 device (Thermo Scientific, Wilmington, DE, United States), respectively. Only total RNA samples with an RNA integrity index of ≥8 were used for producing high-throughput sequencing libraries.

Library Preparation and Sequencing
The protocol of SMARTer PCR cDNA Synthesis Kit (TaKaRa, Dalian, China) was used to synthesize full-length cDNA and cDNA fractions. The BluePippin Size Selection system (Sage Science, Massachusetts, United States) was used to select the PCR products. SMRTbell (Pacific Biosciences, Menlo Park, CA, United States) template libraries were produced using the SMRTbell Template Prep Kit (Pacific Biosciences) according to the manufacturer's instructions. An Agilent 2100 Bioanalyzer and a Qubit 2.0 device (Life Technologies, Carlsbad, CA, United States) were used to assess library quality and concentration, respectively. SMRT sequencing was performed using a PacBio Sequel platform (PacBio) at Novogene Technology Co. (Novogene, Beijing, China). An Illumina (San Diego, CA, United States) sequencing library was constructed using a Gene Expression Sample Prep Kit (Illumina), according to the manufacturer's instructions. The qualified library was paired-end sequenced (2 × 150 bp) on an Illumina HiSeq X Ten (Illumina) platform by a commercial provider (Novogene).

Data Analyses
PacBio data were processed using the SMRT Link 5.1 software pipeline (Pacific Biosciences). First, subreads were identified, and CCSs were produced using corrections between subreads. The CCSs were divided into FLNC and NFL sequences according to whether they contained 5'-primers, 3'-primers, and poly-A tails. FLNC reads were clustered using the ICE algorithm to obtain consensus sequences; Arrow software 1 was used to refine consensus isoforms using the NFL to produce refined consensus sequences. Illumina RNA-seq short reads were filtered to remove adaptor sequences, ambiguous reads with "N" bases, and low-quality reads. Filtered Illumina data were then used to refine consensus sequences using Proovread (Hackl et al., 2014). Redundant isoforms (identity < 0.9; coverage age < 0.85) were eliminated using the CD-HIT program without considering the 5'-difference. To process raw Illumina sequencing data, Trimmomatic and FastQC (Poluri et al., 2019) software programs were used. HISAT (Pertea et al., 2016) was then used to map the filtered reads to the genome. Fragments per kilobase of exon per million fragments mapped (FPKM) values were calculated using StringTie (Pertea et al., 2015), and 25,761 genes had FPKM values.

Gene Functional Annotation
Gene functions were annotated using the following databases: non-redundant protein sequences (NR) (Li et al., 2002), nonredundant nucleotide sequences (NT), Protein Family (Pfam), clusters of orthologous groups of proteins (KOG) (Tatusov et al., 2003), Swiss-Prot (Bairoch and Apweiler, 2000), Kyoto Encyclopedia of Genes and Genomes (KEGG) (Kanehisa et al., 2004), and Gene Ontology (GO) (Ashburner et al., 2000). The Basic Local Alignment Search Tool (BLAST) was used with an e-value of 1 −10 in the NT database analysis, and the Diamond BLASTX software was used with an e-value 1 −10 in the NR, KOG,

Protein-CDS Prediction
ANGEL software (Shimizu et al., 2006) was used to determine CDSs among cDNA sequences. ANGEL calculates the coding potential of all codons using information from a short region around the target codon. Short regions are generated using a sliding window. All codons were labeled as CDS or ELSE, according to their coding potential. The most probable path was then traced using a Markov chain model with dynamic programming. Positions where a frame was changed were detected as the rough positions of frameshift errors in the path. Finally, each rough position was modified by selecting the most probable position from the candidate positions located near the rough positions.

TF and lncRNA Analysis
Transcription factors were predicted using the AnimalTFDB 2.0 database 2 . Hmmscan software was used for comparison with the AnimalTFDB database to screen third-generation sequences (e-value < 0.0001) for TF identification and assign transcripts to different families. The Coding-Non-Coding Index (Sun et al., 2013), coding potential calculator (Kong et al., 2007), Pfam-scan (Finn et al., 2016), and predictors of long non-coding RNAs and messenger RNAs based on an improved k-mer scheme (Li et al., 2014) tools were used to predict the coding potential of transcripts. Transcripts with predicted coding potential according to all four tools were filtered out, and those without coding potential constituted the candidate set of lncRNAs.

SSR Analysis
Simple sequence repeats of the transcriptome were identified using MIcroSAtellite 3 , which allows the identification and localization of perfect microsatellites as well as compound microsatellites that are interrupted by a certain number of bases.

GS Estimation of S. tsinlingensis
The GS of S. tsinlingensis was estimated using FCM. Two peaks were identified, with P1 and P2 representing the counted cell intensities of L. migratoria and S. tsinlingensis, respectively. The mean fluorescent intensities of peaks P1 and P2 were 8.11 × 10 5 and 15.98 × 10 5 (low coefficient of variation, less than 5%), respectively. The ratio of GS of S. tsinlingensis to L. migratoria was equal to that of P2 to P1. The GS of S. tsinlingensis was calculated to be approximately 12.81 Gb (Supplementary Figure 1). This GS was considerably larger than that of L. migratoria and was one of the largest among Acrididae.

Full-Length Transcriptome Assembly of S. tsinlingensis
In total, 146.98 Gb raw polymerase reads were obtained using the PacBio isoform sequencing platform. After filtering and self-correcting the raw data, 28.6 million subreads were processed into 901,383 CCS reads. These CCSs were then clustered and polished into 88,693 non-redundant full-length, non-chimeric isoforms, which was the final molecular sequence pool for screening gene components (Supplementary Table 1;  Supplementary Figure 2). The full-length transcriptome of S. tsinlingensis showed better quality characteristics than other reported grasshopper genomes and transcriptomes. The transcript number was 88,693, and the percentage of long nonredundant isoforms (over 1,000 bp) was 99.28%. The most outstanding parameter was an average length of 2,497 bp and N50 value of 2,726 bp, which was longer than other transcriptomes ( Table 1).

Prediction of CDSs
Among full-length, non-chimeric isoforms of S. tsinlingensis, 48,502 transcripts with CDSs were identified, accounting for approximately 54.68% of the total isoforms. The N50 length of CDSs was 1,230 bp, and each CDS encoded 229.3 amino acids on average (Supplementary Figure 3). In total, 37,569 CDSs were retrieved after integrated annotations using seven gene databases, accounting for 78.18% of the total CDSs ( Figure 1A; Supplementary Table 2). A Venn diagram was generated to visualize the respective contribution of different databases to the annotations, and 3,575 genes were annotated by the five most commonly employed databases ( Figure 1B).
The main annotations were from the NR database (34,731 genes). Among these, S. tsinlingensis produced the most hits with Zootermopsis nevadensis (12.86%, 4,459 genes; Supplementary Figure 4A). The pattern of hits suggested that these species shared close phylogenetic positions, and the percentages corresponded to the number of their indexed sequences in the database, which also revealed that the proportion of Acrididae was not as high as that of other species. GO and KEGG annotations were used to describe the genome composition of S. tsinlingensis. Cells and cell parts (2,865 genes) were the most represented terms in the cell component categories, and binding (7,457 genes) was the most represented molecular function (Supplementary Figure 4B). In the KEGG annotations, 30,637 annotations were assigned to 355 signaling pathways. The most enriched KEGG pathways were associated with basic metabolism processes, including signal transduction (1,008 genes), transport and catabolism (608 genes), amino acid metabolism (606 genes), cancer overview (578 genes), and endocrine system (571 genes; Supplementary Figure 4C). Several highly conserved signaling pathways that play a critical role in insect growth and body development were markedly enriched, including the Wnt, Notch, transforming growth factor-β, Janus kinase/signal transducer and activator of transcription, mitogen-activated protein kinase, and Hedgehog pathways.

CYP450 and HSP Genes
Drought adaption-related genes and their annotations were identified from the isoform sequences of S. tsinlingensis, including 61 xenobiotic CYP450 and 66 HSP genes. The lengths of CYP450 and HSP sequences varied from 1,019 to 5,346 bp and from 1,438 to 4,428 bp, respectively. Most BLAST identities of these target-focused sequences were below 95%, which suggested that they were novel to the current understanding of genetic mechanisms in S. tsinlingensis (Supplementary Figure 5).
To confirm that these sequences were associated with adaptation to drought in S. tsinlingensis, we compared the expression of CYP450 and HSP genes among the transcriptomes of S. tsinlingensis and two other grasshopper species that were less tolerant to dry areas (M. japonicus and G. licenti). The box plot illustrates that the distribution of expressions had a wider range with respect to CYP450 and HSP genes in S. tsinlingensis than in M. japonicus and G. licenti. In particular, the distribution ranges Q1 to Q3 of the FPKM values of CYP450 and HSP genes were 6.38 to 48.10 and 34 to 65.36, respectively, in S. tsinlingensis. However, these values were only 2.12 to 11.74 (CYP450) and 11.89 to 64.53 (HSP) in M. japonicus and 2.32 to 8.30 (CYP450) and 1.67 to 8.05 (HSP) in G. licenti, respectively (Figure 2). These results suggested that more members of CYP450 and HSP genes  in S. tsinlingensis participated and contributed to its adaptation to drought. In addition, the overall expression levels of CYP450 and HSP genes were higher than those in other grasshoppers such as M. japonicus and G. licenti.

LncRNA Identification
LncRNAs were identified using a combination of the Coding Potential Calculator, Coding-Non-Coding Index, Coding Potential Assessment Tool, and Pfam methods. In total, 12,765 (22.94%) valid lncRNAs (>200 bp long and with more than two exons) were identified (Figure 3). A length distribution analysis of lncRNAs revealed that their lengths ranged from 212 to 8,365 bp, with a mean length of 2,090 bp. The N50 length of these identified lncRNAs was 2,185 bp. The number of lncRNAs was significantly lower than that of mRNAs. To verify the two possible conditions, i.e., each  type of lncRNA regulated multiple mRNAs or the valid lncRNA was not fully determined, the linkages between the 12,765 lncRNAs and 48,502 CDSs were checked. The lncRNAs were only related to 9,347 CDSs, suggesting that lncRNAs were not fully sequenced. Four HSP and no CYP450 gene-associated lncRNAs were identified, suggesting that the regulatory mechanisms of these HSP and CYP450 genes were not well characterized.

TF Identification
Transcription factors participate in gene expression regulation by linking lncRNAs and mRNAs. In this study, 414 putative TFs belonging to 38 TF gene families were predicted (Supplementary Table 3). zf-C2H2 (20.29%, 84/414) was the most abundant TF family, followed by ZBTB (16.18%, 67/414) and THAP (15.22%, 63/414; Supplementary Figure 6). The relationships among lncRNAs, TFs, and mRNAs are not discussed here because of the limited number of TFs. Instead, a GO enrichment analysis was conducted using genes in which TFs have been determined. Although lower in numbers, these genes did not present functional bias and covered almost all basic metabolic functions. Survival-dependent mechanisms such as multicellular organismal processes, developmental processes, and immune system processes were the most enriched mechanisms (Figure 4).

DISCUSSION
To the best of our knowledge, this is the first genomic study on Sphingonotus spp. Although S. tsinlingensis is an oriental species endemic to China, it represents the most popular morphological type of the grasshopper genus Sphingonotus and the subgenus Spingonotus. These taxa show hind wings with a curved black bind and blue color at the base (Zheng et al., 1963). A comprehensive transcriptome comprises all expressed gene sequences of an organism. These sequences can act as gene pools to provide and complement speciesspecific sequences. In addition, full-length transcriptome has better connectivity and integrity compared to Illumina short reads, and play an important role in gene annotation. Therefore, the transcriptome of S. tsinlingensis may provide a reference for the study of Sphingonotus species. Data, including lncRNAs, TFs, and SSRs, may be used as a genetic background for gene identification, homologous gene screening, phylogenetics, and adaptive evolution analysis. Moreover, through wellconnected and accurate sequences and short-read quantitative analyses, these elements were found to be involved in the unique physiological processes of this species. These results suggest that our data are reliable and may be of use for future studies.
Based on the ultra-large GS, we suggest using full-length transcriptomes instead of whole genomes when conducting genomic studies on grasshoppers of this genus. In general, PacBio sequences are 85% accurate at 30-fold coverage of the transcriptome (Midha et al., 2019). In this study, the full-length transcriptome was successfully produced at 60-fold coverage. The success of this strategy will help optimize the experimental design for similar future studies. Because grasshoppers have very large genomes and genome assemblies are difficult to construct owing to the highly repetitive regions (Schatz et al., 2010;Shah et al., 2019), grasshopper genome assembly requires a considerable amount of sequencing and computation and is thus time-consuming and expensive. Even with respect to mRNA, sequencing frequently fails to enrich effective sequences owing to the large GS, and rare genes are missed because of sequencing bias (Gao et al., 2016). To address this problem, sequencing depth FIGURE 4 | Gene ontology (GO) enrichment reveals the functions of genes that have been determined to be transcription factors (TFs). The rich factor refers to the ratio of the number of differentially expressed transcripts to the total number of annotated transcripts located in the GO term, and the Q-value is the P-value corrected by multiple hypothesis tests, with a value range of 0 to 1. The closer the Q-value is to zero, the more significant is the enrichment.
was increased in the current study to produce a sufficient amount of effective sequences. The number of protein-expressing genes in the functional gene dataset observed in the current study was the same as that found in previous studies (Yuan et al., 2020). The proportion of annotated functional genes was increased, and the average length of genes was markedly improved. From the composition of gene functions, gene detection was complete, and many basic metabolism-related genes were detected. Among grasshopper species for which both transcriptome and genome was reported, the GS of S. tsinlingensis was the largest. Its GS of approximately 12.81 Gb was almost twice that of L. migratoria (Wang et al., 2014). A comprehensive exploration of the effective sequences in the genome was conducted when the ratio of transcript size to GS was less than 2%.
These results suggest that the full-length transcriptome is an effective method for studying the genomics of grasshopper species with ultra-large and complex genomes. However, there are some limitations to this study. For example, lncRNAs, miRNAs, TFs, and mRNAs obtained from the full-length transcriptome typically play important roles in regulating gene expression at epigenetic, transcriptional, and post-transcriptional levels and constructing miRNA-lncRNAs-mRNAs-TF networks (Ye et al., 2018). Nonetheless, no association between these four regulatory elements was identified in the present study. The regulatory elements detected using functional enrichment reflected only the basic metabolic processes. To obtain better data in future studies and better enrich the results of the full-length transcriptome, a combination of multi-omics methods should be used, such as competing endogenous RNA for identification (Jiang et al., 2020) supplemented with mRNA next-generation sequencing for expression profile analysis. This is the first GS and full-length transcriptome study on a species of the genus Sphingonotus. S. tsinlingensis has an ultralarge genome (12.81 Gb). Deep PacBio sequencing is currently the best method for retrieving nucleic acid sequences from species with ultra-large and complex genomes. The full-length transcriptome produced in the present study provides a reference resource for future studies on gene identification and comparison and will help improve our understanding of the mechanisms by which grasshoppers adapt to arid environments.

DATA AVAILABILITY STATEMENT
The datasets presented in this study can be found in online repositories. The names of the repository/repositories and accession number(s) can be found below: https://www.ncbi.nlm. nih.gov/, PRJNA707366 and https://zenodo.org/record/4588269# .YEdESKSwM3s, 4588269.

AUTHOR CONTRIBUTIONS
LZ, D-LG, and S-QX conceived the study and designed the experiments. D-LG analyzed the data. LZ wrote the manuscript. LZ, HW, PL, and KS performed the sequencing experiments. D-LG and S-QX revised the manuscript. All authors read and approved the final manuscript.

FUNDING
This work was supported by the Excellent Doctor Innovation Project of Shaanxi Normal University (S2015YB03) and Fundamental Research Funds for the Central Universities (GK201903063 and GK202105003). This work was also partly supported by the National Natural Science Foundation of China (No. 31872273).