De novo Transcriptome Assembly of Myllocerinus aurolineatus Voss in Tea Plants

Myllocerinus aurolineatus Voss is a species of the insecta class in the arthropod. In this study, we first observed and identified M. aurolineatus Voss in tea plants in Guizhou, China, where it caused severe quantity and quality losses in tea plants. Knowledge on M. aurolineatus Voss genome is inadequate, especially for biological or functional research. We performed the first transcriptome sequencing by using the Illumina Hiseq™ technique on M. aurolineatus Voss. Over 55.9 million high-quality paired-end reads were generated and assembled into 69,439 unigenes using the Trinity short read software, resulting in a cluster of 1,207 bp of the N50 length. A total of 69,439 genes were predicted by BLAST to known proteins in the NCBI database and were distributed into Gene Ontology (20,190), eukaryotic complete genomes (12,488), and the Kyoto Encyclopedia of Genes and Genomes (3,170). We also identified 96,790 single-nucleotide polymorphisms and 13,121 simple sequence repeats in these unigenes. Our transcriptome data provide a useful resource for future functional studies of M. aurolineatus Voss for dispersal control in tea plants.


INTRODUCTION
Tea plant Camellia sinensis is a major economic crop in China and accounts for half of the total plantation area worldwide . Many nutritional ingredients are produced in tea shoots, which are used as raw materials for commercial tea processing (Comblain et al., 2016). However, various tea plant pests jeopardize the quantity and quality of tea, such as the tea leaf beetle, one of the most serious pests affecting tea plants in China, Vietnam, Indonesia, Japan, and other teaproducing countries in Asia (Sun et al., 2010;Yang et al., 2013;Roy and Muraleedharan, 2014). Here, we reported a leaf beetle, Myllocerinus aurolineatus Voss in tea plant in Guizhou province, China. M. aurolineatus was first reported in China in 1991, where the species emerged in tea plants (Zhang, 1991).
M. aurolineatus is an exceptionally destructive and productive pest that shows strong tolerance and adaptability against environment challenges, including those with pesticide treatment, extreme temperatures, and volatile compound attractions (Sun et al., 2010(Sun et al., , 2017. In Meitan district, Guizhou, M. aurolineatus pupates from March to April, then larvae become adult and begin to mate from June to July. Finally, the outbreak of this pest will be from July to August. Pesticide resistance, the diapause process, and other biological flexibilities are moderated by series molecular mechanisms (Casida, 2017;Jugulam and Gill, 2018;Zhang et al., 2018). However, lack of bona fide reference regarding gene annotation and genomic information restricts our research on the molecular mechanism of these physiological processes for exploring new management approaches for this species. To date, only 157 nucleotide sequences including 51 expressed sequences (EST) have been deposited in GenBank for this beetle. In addition, only few genetic markers have been reported for M. aurolineatus (Ma et al., 2014;Mukhopadhyay et al., 2016).
De novo transcriptome assemblies constitute a valid technology to study reference-free genomes in non-model organisms and provided abundant sequence and gene expression information for further functional research (Gayral et al., 2013;Smith-Unna et al., 2016). With the development of sequencing technologies, such as Illumina and 454 Life Sciences, genomic information in model and non-model organisms have been broadly uncovered along with numerous genes involved in biotic and abiotic stresses, pesticide resistance, developmental pathways, diapause transitions, and hormone regulations by homology blast with related organisms (Ekblom and Galindo, 2011;Tarrant et al., 2016). In this study, Illumina Hiseq TM technique was used for de novo transcriptome assembly and analysis of M. aurolineatus, a new reported tea plant pest. Sequences were performed with a BLAST search to the known proteins of the NCBI database and the CDS fragments were predicted. Gene Ontology (GO), euKaryotic Ortholog Groups (KOG), and the Kyoto Encyclopedia of Genes and Genomes (KEGG) were annotated. Furthermore, we used these data to generate simple sequence repeat (SSR), and single-nucleotide polymorphism (SNP) markers, which will be potential resources for trait mapping. We believe that the transcriptome assembly of M. aurolineatus obtained in our study represents crucial resources for research into molecular pathway, gene function, metabolic regulation associated with pesticide resistance, and biocontrol areas.

M. aurolineatus Collection
In August, the mature M. aurolineatus were captured from the top leaves of the 20-year tea plant in Meitan District, Guizhou, China. Three beetles were kept in liquid nitrogen as a group and subsequently used for transcriptome sequencing.

Nucleic Acid Isolation
Total RNA was extracted using the Trizol kit (Sangon, China) according to the manufacturer's protocol, and treated with RNase-free DNase I to remove genomic DNA contamination. RNA integrity was evaluated with a 1.0% agarose gel. Thereafter, RNA quality and quantity were assessed using a Nanodrop (Thermo, USA) and an Agilent 2100 Bioanalyzer (Agilent Technologies, USA). The high-quality RNA samples were subsequently submitted for library preparation and sequencing.

Library Preparation and Sequencing
A total of 2 µg of RNA sample was used as input material for the RNA sample preparations. Sequencing libraries were generated using VAHTSTM mRNA-seq V2 Library Prep Kit for Illumina R following the manufacturer's protocols. Firststrand cDNA was synthesized using random hexamer primer and M-MuLV Reverse Transcriptase. Second-strand cDNA synthesis was subsequently performed using DNA polymerase I and RNase H. Remaining overhangs were converted into blunt ends via exonuclease/polymerase activities. To select cDNA fragments of preferentially 150-200 bp in length, the library fragments were purified with AMPure XP system (Beckman Coulter, Beverly, USA). Then, 3 µl of USER Enzyme (NEB, USA) was utilized with size-selected, adaptor-ligated cDNA at 37 • C for 15 min followed by 5 min at 95 • C in PCR cycler. Finally, PCR products were purified (AMPure XP system) and the library quality was assessed on the Agilent Bioanalyzer 2100 system. The libraries were then quantified and pooled. Paired-end sequencing of the library was performed on the HiSeq XTen sequencers (Illumina, USA).

Data Assessment and Quality Control
FastQC (version 0.11.2) was used for evaluating the quality of sequenced data. Raw reads were filtered by Trimmomatic (version 0.36) according to five steps: (1) removing adaptor sequence; (2) removing low quality bases from reads 3 ′ to 5 ′ (Q < 20); (3) removing low-quality bases from reads 5 ′ to 3 ′ (Q < 20); (4) using a sliding window method to remove the base value <20 of reads tail (window size is 5 bp); (5) removing reads with reads length <35 nt and its pairing reads. The remaining clean data were used for further analysis.

Transcriptome Assembly and Gene Annotation
The remaining clean reads were de novo assembled into transcripts using Trinity short read software (version 2.0.6) (parameter: min_kmer_cov 2). Transcripts with a minimum length of 200 bp were clustered to minimize redundancy. For each cluster (representing the transcriptional complexity for the same gene), the longest sequence was preserved and designated as a unigene. Unigenes were blasted against the NCBI nonredundant protein database, SwissProt, TrEMBL, Conserved Domain Database (CDD), Protein family (Pfam), and KOG databases (E-value < 1e −5 ). According to the priority order of the best aligned results, the Unigene ORF (open reading frame) was determined using the ORF-predictor server. At the same time, TransDecoder (version 3.0.1) was used to predict the CDS sequences of the un-aligned Unigenes. GO functional annotation information was obtained according to transcript annotation results of SwissProt and TrEMBL. KEGG Automatic Annotation Server (version 2.1) was used for the KEGG annotation.

M. aurolineatus Identification
M. aurolineatus were captured in the tea plant in Meitan District, Guizhou, China, and the morphological character was described as follows: body, gray black; head with two longitudinal lines yellow pale, with metallic shimmer in dorsal view; antennae geniculate, with scapus straight and thin, expanded in apical three flagella; pronotum with four longitudinal lines yellow pale; elytra with several yellow pale long patches, with black traverse patches in the center in dorsal view (Figure 1).

Illumina Sequencing and De novo Assembly
We    were annotated in at least one database ( Table 3). These results were similar to other Arthropoda species, such as 10,218 unigenes were identified in Dendroctonus ponderosae (Regniere and Bentz, 2007), 4,251 in Tribolium castaneum (Li et al., 2008), 1,207 in Papilio xuthus , and 353 in Acyrthosiphon pisum (Bishnoi and Singla, 2017). A total of 43.56% annotated unigenes were matched from D. ponderosae according to the proportion on the distribution of best-matched species, followed by 18.12, 5.15, 1.51, and 1.27% sequences that matched sequences from T. castaneum, P. xuthus, A. pisum, and Oryctes borbonicus, respectively. The remaining 30.39% of sequences were matched from other species (Figure 3).

GO, KOG, and KEGG Classifications of Unigenes
Cluster groups of the unigenes were calculated against GO, KOG, and the KEGG databases using BLASTx search with a cutoff E value of 10 −5 . A total of 29.08% unigenes (20,190) were annotated in these three databases ( Table 3).
As an international standardized gene functional classification system, GO determines the functional description of genes and their products (Tarrant et al., 2016). Three independent classifications, namely, "biological process, " "cellular component, " and "molecular function, " are included in this process. As shown in Figure 4 and Supplementary Table 1, 20,190 (29.08%) unigenes were assigned to the main GO classifications, including 71 sub-classifications. Within the FIGURE 3 | Species distribution of top BLASTx results. The pie chart shows the species distribution of unigenes top BLASTx results against the NR protein database with a cutoff E value <1e −3 . molecular function group, binding (GO:0005488) and catalytic activity (GO:0003824) processes were the main proportions with 12,042 and 9,471 unigenes, respectively. In the biological process, cellular (GO:0009987) and metabolic processes (GO:0008152) were the largest two groups including 13,156 and 10,939 unigenes, respectively. Cell (GO:0005623, 13,874 unigenes) and cell part (GO:0044464, 13,837 unigenes) represented the majority of processes in the cellular component category.
To further identify the predicted proteins' biological pathways, the KEGG pathway database was employed to annotate the unigenes by analyzing the gene functions systematically and integrating the chemical and genomic information (Du et al., 2016;Kanehisa et al., 2016). In our results, 3,170 unigenes were assigned and annotated to 279 different biological pathways (Supplementary Table 3). "Signal transduction" (667 unigenes) was the most enriched pathway, followed by "Translation" (415 unigenes), "Transport and catabolism" (364 unigenes), and "Folding sorting and degradation" (329 unigenes) (Figure 6).

CDS Prediction
BLAST alignment was used to search the coding sequences (CDS) in NR, SwissProt, and TrEMBL databases, and the remaining undefined sequences were assigned by using TransDecoder software. A total of 39,876 CDS sequences were predicted in our data (Supplementary Tables 4, 5). After transcript combination, the gene coverage was searched by BEDTools software. From our data, 63,340 genes can be detected in the 90-100% gene distribution rate, followed by 3,314 genes in the 80-90% gene distribution rate, 1,538 genes in the 70-80% gene distribution rate, 867 genes in the 60-70% gene distribution rate, and only few genes in the gene distribution rate under 50% (Supplementary Figure 1).

SSR Discovery
SSRs or microsatellite sequences, represented by 2-6 base pairs in length of core sequences, are polymorphic loci present in genomic DNA (Powell et al., 1996). To date, SSRs have been used to genotype numerous species for functional analysis (Kantety et al., 2002). Our results obtained 13,848 SSRs, including 9.40% dinucleotide repeats, 10.32% trinucleotide repeats, and 1.05% tetra/penta/hexanucleotide repeats. A total of 1,355 SSRs (9.78%) with more than 15 base pairs in length and 1,525 SSRs (11.01%) with <15 base pairs in length were found (Figure 7). Among the dinucleotide repeat motifs, (AT/AT)n (855 SSRs) and (AG/CT)n (271 SSRs) were the most common types and significantly outnumbered the other types of dinucleotide repeat motifs. Among the trinucleotide repeat motifs, (AAT/ATT)n (547 SSRs) and (ACC/GGT)n (197 SSRs) were the most common types (Figure 7). A total of 6,512 unique sequences with microsatellites flanking sequences on both sides of the microsatellites were obtained after removing the microsatellites that lacked sufficient flanking sequences for primer design, which allows for the design of primers for genotyping.

SNP Identification
The type of variation in the genome was commonly analyzed by using SNPs (Williams et al., 2010;Gimode et al., 2016). SNPs were generated by aligning sequences used for contig assembly. We obtained 96,790 SNPs after removing those that had a base mutation frequency lower than 1% (Supplementary Table 6). The proportions of transition substitutions of A->G, T->C, G->A, and C->T were 16.95% (16,408), 16.50% (15,970), 15.94% (15,430), and 15.82% (15,313), respectively (Figure 8). A total transition:transversion ratio was 1.87:1. A small proportion of transversion-type SNPs and a large proportion of transition-type SNPs exist because of the differences in the numbers of hydrogen bonds and base structures between different bases.

CONCLUSION
In the present study, we identified a new pest in tea plants in Guizhou, China and first reported the transcriptome of M. aurolineatus Voss by using de novo assembly technology with next-generation sequencing. We identified 69,439 unigenes, which constitute an important database for future studies on M. aurolineatus gene function analysis. Furthermore, we annotated these unigenes in the GO, KOG, and KEGG databases. In addition, to explore the molecular biology research of M. aurolineatus, we predicted many SSRs and SNPs that will provide a basal foundation for genetic analysis.

DATA AVAILABILITY STATEMENT
The datasets presented in this study can be found in online repositories. The names of the repository/repositories and accession number(s) can be found below: NCBI BioProject ID: PRJNA694293.

AUTHOR CONTRIBUTIONS
XL: conceptualization. XX: methodology. MC: software. XX: validation. JJ and MH: formal analysis. JJ: investigation. LJ: resources. XX and JJ: writing-original draft preparation. XL: writing-review and editing and project administration. XL and XX: funding acquisition. All authors contributed to the article and approved the submitted version.