Draft Genome of a Blister Beetle Mylabris aulica

Mylabris aulica is a widely distributed blister beetle of the Meloidae family. It has the ability to synthesize a potent defensive secretion that includes cantharidin, a toxic compound used to treat many major illnesses. However, owing to the lack of genetic studies on cantharidin biosynthesis in M. aulica, the commercial use of this species is less extensive than that of other blister beetle species in China. This study reports a draft assembly and possible genes and pathways related to cantharidin biosynthesis for the M. aulica blister beetle using nanopore sequencing data. The draft genome assembly size was 288.5 Mb with a 467.8 Kb N50, and a repeat content of 50.62%. An integrated gene finding pipeline performed for assembly obtained 16,500 protein coding genes. Benchmarking universal single-copy orthologs assessment showed that this gene set included 94.4% complete Insecta universal single-copy orthologs. Over 99% of these genes were assigned functional annotations in the gene ontology, Kyoto Encyclopedia of Genes and Genomes, or Genbank non-redundant databases. Comparative genomic analysis showed that the completeness and continuity of our assembly was better than those of Hycleus cichorii and Hycleus phaleratus blister beetle genomes. The analysis of homologous orthologous genes and inference from evolutionary history imply that the Mylabris and Hycleus genera are genetically close, have a similar genetic background, and have differentiated within one million years. This M. aulica genome assembly provides a valuable resource for future blister beetle studies and will contribute to cantharidin biosynthesis.

INTRODUCTION Cantharidin (C 10 H 12 O 4 , CTD) is a monoterpene chemical compound produced by some species of the Meloidae family from Coleoptera, commonly known as blister beetles or Spanish flies (Crowson, 1970;Carrel et al., 1993;Karras et al., 1996;Tagwireyi et al., 2000;Jiang et al., 2019). CTD is antiinflammatory, antiviral, and increases immune-regulating activities (Moed et al., 2001); therefore, it has been widely used to treat a variety of diseases including skin-related diseases (furuncles and piles) (Su et al., 2016), tuberculous scrofuloderma (Silverberg et al., 2000;Moed et al., 2001;Richard et al., 2014) and erectile dysfunction (Wang, 1989;Liu and Chen, 2009). New research has also reported that CTD and its derivatives inhibit the proliferation of many types of cancers including pancreatic , liver (Zhang et al., 2014), lung (Luan et al., 2010), and breast cancer (Dorn et al., 2009;Li et al., 2017). CTD has grown in popularity as an alternative to anticancer drugs, and increasing attention is being paid to methods for acquiring this chemical owing to its promising prospects as an anti-tumor agent (Kadioglu et al., 2014).
As a natural source for obtaining CTD, the CTD biosynthesis mechanism in certain blister beetles has been the focus of multiple studies (Agranoff et al., 1960;McCormick et al., 1986;Lu et al., 2016). The partial genomes of the most commercially important species in Chinese traditional medicine, Hycleus cichorii and Hycleus phaleratus, have been determined to provide the genetic background of blister beetles . Moreover, according to the transcriptome analysis of Mylabris cichorii, another blister beetle also widely used in clinical treatment, the mevalonate (MVA) pathway and juvenile hormone biosynthesis may contribute to CTD biosynthesis (Huang et al., 2016). Although these findings provide a references for the experimental study of biosynthetic pathway mechanisms in different meloid beetles (Yin and Jin, 2010;Lu et al., 2016;Huang et al., 2016), the whole genome data required to study the genetic basis of the entire process remains unavailable (Wu et al., 2018). In addition, although many genes involved in CTD biosynthesis have been identified, new genes may still be identified using a newly obtained genome (Yin and Jin, 2010;Huang et al., 2016;. Therefore, utilizing multiple genomes from blister beetle species able to produce CTD will enable comparative genomic research to provide better understanding of the entire production mechanism. A whole gene set is helpful to accelerate research and provide useful background information on the systematic and evolutionary processes of CTD producing species. Additional genomes from the Meloidae family will enable researchers to trace the development of CTD biosynthesis. Generation of a genome from the Mylabris genera will provide a reference for other related species and value for further physiological and evolutionary research experiments. Mylabris aulica is a commonly distributed blister beetle with a large population in inner-Mongolia, China, and the same ability to produce CTD as other blister beetles; therefore, this species has great research potential as a prospect to obtain natural CTD. This species has been utilized for medical purposes, but not for commercial trade . Here, we report the first draft genome assembly of M. aulica (NCBI: txid1914941) using long reads. This study obtained and assembled the M. aulica genome sequence, then annotated the identified genes to further explore the characteristics of the M. aulica genome while paying particular attention to CTD biosynthesis. By combining time-tree construction and orthologous analyses with related species, we tentatively explored the time frame when CTD biosynthesis likely appeared in these species. To learn more about the genes related to CTD biosynthesis, we analyzed the similarity and differences of genes in M. aulica and related species. Knowledge of the genetic background of this and similar species will not only contribute to the study and usage of blister beetles, but also aid in supplying the increasing demand for naturally derived CTD, thereby lowering prices for this medicinal material.

Sample Collection and Sequencing
Twenty-one adult M. aulica (NCBI txid1914941) beetles were collected from Hailaer, Inner-Mongolia Province, China in August 2018. Genomic DNA was extracted from each individual male beetle using DNAeasy Tissue Kits (Qiagen, Halden, Germany). With the retrieved genomic DNA, two DNA libraries of different insert sizes were constructed and the Illumina X-ten (Illumina HiSeq X-Ten, San Diego, California) and Nanopore promethION (Oxford Nanopore Technologies) (Wouter et al., 2019) platforms were used for DNA sequencing. The short-read Illumina sequencing library was obtained by performing g-TUBE fragmentation, repair, adaptor connection, digestion with exonuclease, and recycling approximately 350 bp sequences using approximately 1.5 mg DNA according to standard sequencing kit protocol (NEBNext Ultra DNA Library Prep Kit for Illumina). The long-read Nanopore sequencing library was constructed using 5 mg DNA and the SQK-LSK109 sequencing preparation kit (Ligation Sequencing Kit). The retrieved library had a mean DNA fragment length of approximately 20 kb.

Genome Assembly, Polishing, and Completeness Assessment
After sequencing, a strict quality control on the raw Illumina and Nanopore sequencing data was performed using Trimmomatic v0.39 (Bolger et al., 2014) and Nanofilt v2.3.0 (De Coster et al., 2018), respectively. Reads with low quality (Q30 < 90%) or those that contained more than 5% unknown bases were removed. Environmental microbe contamination was removed by deleting sequences that provided hits in the GenBank env_nt database (ftp://ftp.ncbi.nlm.nih.gov/blast/db/). Before assembly, a k-mer based analysis was performed to estimate genome size using GCE (genome characteristics estimation) Manekar and Sathe, 2018) using all the short-read DNA sequences. The estimated size of the genome guided further assembly by helping with software parameter adjustments. A pipeline integrating CANU (Koren et al., 2017) and MECAT (Xiao et al., 2017) was then used to conduct the assembly using genome sequencing data with default parameters. CANU was used to generate more accurate self-corrected reads with a corrected error rate equal to 0.050. MECAT was used to generate contigs. To improve assembly accuracy, the generated Nanopore sequenced data assembly was polished using Pilon (Walker et al., 2014) with next generation data. The Nanopore sequenced data was mapped back to the assembly with Minimap2 (Li, 2018) to check the correctness. Whole genome completeness was assessed using BUSCO (benchmarking universal single-copy orthologs) v3 (Simao et al., 2015).

Repetitive Elements Determination
Combined specific repeated sequence database and repetitive element identification was performed using the Repeatmasker (Chen, 2004) and Repeatmodeler (Castoe et al., 2011) pipelines. First, a novel library for repetitive elements was constructed using Repeatmodeler. Then, this novel library was merged into a database with all the known repetitive elements from the Insecta phylum. Then, using this database, a thorough search for repetitive elements was conducted in Repeatmasker. As a supplemental search, other types of repetitive sequences were also identified using LTR FINDER (Xu and Wang, 2007), MISA (http://pgrc.ipk-gatersleben.de/ misa/) (Beier et al., 2017), and RepeatScout (Price et al., 2005) Finally, the results generated from the previous search were merged with these supplemental repetitive sequences and masked from the genome. The masked repeat sequences were further classified according to their different types ( Table S1).

Gene Finding and Annotation
A high-quality gene set was obtained using the combined methods of ab initio and homology-based predictions. The RNA-seq (Huang et al., 2016) data of M. cichorii was used to search for the best gene model in PASA (Haas et al., 2003). Then, this model was used in Augustus (Burge and Karlin, 1997) and SNAP (http://korflab.ucdavis.edu/) (Goodswen et al., 2012) for ab initio prediction. For the homology-based method, we downloaded the gene sets of H. cichorii (Wu et al., 2018), H. phaleratus, Aethina tumida (Evans et al., 2018), Anoplophora glabripennis (Mckenna et al., 2016), Tribolium castaneum (Richards et al., 2008), Dendroctonus ponderosae (Keeling et al., 2013), Leptinotarsa decemlineata (Schoville et al., 2018), Diabrotica virgifera virgifera (Coates et al., 2012), and Bombyx mori  from the GenBank database. These homologous protein sequences were concatenated and imported into GeneWise (Trapnell et al., 2012) to search for genes. The pseudogenes were filtered by determining whether they could be correctly translated and had mature stop codons. In addition, a concatenated ab initio and homology-based gene prediction pipeline identified genes using MAKER (Campbell et al., 2014) (http://www.yandell-lab.org/software/maker.html) software. Finally, all gene prediction results were merged to yield a non-redundant reference gene set using Evidence Modeler (Haas et al., 2008) (Table S2). The protein sequences were extracted and subsequently put into functional annotations (Data Sheet 1).

Functional Annotation of Protein-Coding Genes
Functional annotation was performed by NCBI BLAST (ftp://ftp. ncbi.nlm.nih.gov/blast/db/) (Lobo, 2012) to query the predicted gene sequences to functional databases such as NR (Li et al., 2013), KEGG (Kyoto Encyclopedia of Genes and Genomes (Keller et al., 2011), andSwiss-prot (UniProt Consortium, 2010). The Pfam database was used to annotate proteins using HMMER (Birney et al., 2004). Gene ontology (Walker et al., 2014) terms were obtained using Blast2GO (Pertea et al., 2015). Metabolic pathways of these genes were assigned using the KAAS (Moriya et al., 2007) tool provided by the KEGG database. All of these annotations can be found in Table S3.

Orthologous Analysis and Time-Tree Reconstruction
Gene families were identified using OrthoFinder software (Rice et al., 2000). The protein sequences of H. cichorii, H. phaleratus, A. tumida, A. glabripennis, T. castaneum, D. ponderosae, L. decemlineata, D. virgifera virgifera, and B. mori were used for all-2-all comparisons to search for orthologous gene families ( Table S4). The expansion and constriction of the orthologous gene families were determined and counted (Table S5). Then, the coding sequences of the single-copy genes in the 10 species were identified, extracted, and aligned using the MAFFT program (Katoh, 2002) to construct one super-matrix. The BEAST program (Yang and Rannala, 2006)

Genome Assembly and Completeness
The M. aulica genome assembly was 288.5 Mb in size with a scaffold N50 length of 467.8 kb and an L50 value of 44. The repeat sequence in the assembly was 153.1 Mb and occupied 50.62% of the total length. All of these parameters were greatly improved compared to those reported for the H. cichorii and H. phaleratus genomes (N50 values of 79.3 kb and 56.1 kb and identified repetitive content of 22.73% and 13.47%, respectively) ( Table 1). It is obvious that the Nanopore long-read sequencing technology used in this study has exponentially improved assembly connectivity, proving that it is a better strategy for obtaining highly qualified genomes.
Evaluation of assembly completeness and correctness showed that the M. aulica assembly spanned 91% of the estimated genome size (318.4 Mb), which proves that it is highly complete with few missing sequences. Assembly error rate was estimated by mapping the genome sequencing data back to the assembly. The results showed a re-mapping coverage of 101.4× (30.41 Gb of data), and an extremely low error rate (0.09%) that  (Figure 1) suggests that not only the connectivity, but the completion for the obtained M. aulica genome assembly is superior. The assembly yielded by Nanopore sequencing shows that this approach is advantageous in all respects.

Genetic Background of M. aulica
Using the obtained high-quality assembly, we explored the basic genetic background of M. aulica. First, we identified the gene set embedded in this assembly and retrieved 16,500 genes. The average length of these genes was 5,940.82 bp and each of them contained about 6.62 exons. The average length of coding domains and exons were 1,958.63 bp and 295.72 bp, respectively. All these parameters were comparable to those of the H. cichorii and H. phaleratus assemblies. More genes were identified in the M. aulica genome assembly, which confirms our previous finding that our assembly has better continuity and completion than other available assemblies and could serve as the most complete genetic research material for blister beetles. Gene function retrieval from multiple databases showed that almost all genes (16,444 of 16,500, 99.7%) could be annotated. Among them, 15,579 genes could be assigned to at least one GO term, and 6269 could be assigned to different KEGG pathways. Additionally,15,153,13,445,and 16,131 of the genes could be annotated using Pfam (El-Gebali et al., 2018), Swiss-prot (Boutet et al., 2016), and the Genbank NR databases, respectively. The functional traits of all genes were further explored according to their annotations. The summarized GO terms prevalent in most insects and responsible most basic life activities were present and essential in M. aulica (Figure 2A). There were no specialized functional basis characters in M. aulica. Further, as we classified the source species of the NR annotations, we found that 79.85% of the genes resembled sequences in T. castaneum and A. glabripennis genomes, which are not blister beetles ( Figure  2B). This result indicates that the functional basis for M. aulica is not largely deviated from that of common beetles, and that the genes contributing to specialized CTD synthesis only comprise a small portion of the whole genome.

Comparative Genomic Analysis Based on Ortholog Gene Families
Identification and analysis of ortholog gene families is the foundation of comparative genomics that enables a better understanding of the evolution of traits among different species. Overall, we identified 12,726 ortholog gene families in M. aulica and nine related species. Among them, 6,204 members including 48.75% of total gene families were commonly shared in M. aulica, H. cichorii, H. phaleratus, T. castaneum, and A. glabripennis (Figure 3). The results showed that these species have broad genetic connections and are suitable for phylogenetic analysis and divergence time estimation.
Thus, using the 435 identified single copy ortholog genes, a creditable phylogenetic tree was created (Figure 4). Based on this tree, we found that M. aulica, H. cichorii, and H. phaleratus formed a single clade with extremely close phylogenetic positions and a short divergence time. A common ancestor for these three blister beetles was suggested. Moreover, the speciation for these species were estimated to have occurred within one million years, which suggested their hypothetical common ancestor was not much older than the establishment of their species. The metabolic and biological characters of M. aulica should be similar to those of H. cichorii and H. phaleratus, which were inherited from their common ancestor, and the CTD synthesis mechanism could be a part of that inheritance.
To determine whether this CTD synthesis mechanism was solely the result of inheritance and not affected by adaption or the unique physiology formation of M. aulica, we identified the significantly expanded ortholog gene families within this species. The ultra-metric time tree showed that 377 of the 1,783 expanded gene families were significant. The GO and KEGG enrichment analyses of these gene families showed that none of the known genes related to CTD biosynthesis were included, and the function of the most significantly expanded genes were related to metabolic functions necessary for survival ( Figures  5A, B). The CTD synthesis mechanism should have been a mature system before the species differentiation of blister beetles, which further leads to the reasonable speculation that this mechanism might be conserved in all blister beetles, except for those that have lost the ability to synthesize CTD.

Genes Related to CTD Biosynthesis in M. aulica
Identifying genes related to CTD biosynthesis in M. aulica is the major aim of our study and the primary value of our retrieved FIGURE 1 | Summarized benchmarks in the benchmarking universal singlecopy orthologs (BUSCO) assessment for M. aulica, H. cichorii, and H. phaleratus genomes. These estimations used 42 species from Insecta as the database and 1,658 BUSCOs were searched. The orange, blue, and green sections of each bar represent complete percentage, fragmented percentage, and missing percentage, respectively. genomic data. The results provide direct evidence for the implication that the M. aulica has the potential to replace or fill the current medical market for blister beetles. To enable this potential, identification of the CTD biosynthesis genes in the M. aulica genome is required. Based on previous research, we tried to locate genes belonging to the "terpenoid backbone biosynthesis" pathway, as it is considered responsible for CTD synthesis in blister beetles (Huang et al., 2016) (Figure 6). In total, 30 genes in M. aulica matching this criterion were screened out. These genes are involved in both the MVA and MEP/DOXP pathways. These two paths synergistically participate with each other and form a complete metabolic chain in accordance with the current understanding of CTD biosynthesis in blister beetles. We then determined whether any of these 30 screened genes are unidentified genes. The results showed that all genes resembled reported sequences in the NR database, which suggested they have already been revealed by previous molecular analysis. However, two sequences (BMGene00496 and BMGene01890, Figure 7) were tagged as uncharacterized proteins, indicating that their biological functions remain unknown. Therefore, we performed a function search and provide their protein information for the first time. The results showed that they are    enzymes that contribute to the synthesis of acetoacetyl-CoA and farnesal, respectively. The Pfam domains embedded in these two sequences show that and they might work as thiolase and short chain dehydrogenase, respectively.

CONCLUSION
The ability to synthesize multiple bioactive substances, especially CTD, is an interesting and attractive biological mechanism in blister beetles; genomics research is highly significant in the study of CTD biosynthesis. However, there are few genomic studies on CTD, and complete genomic blister beetle research is even scarcer. This study provides the first M. aulica genome with high continuity and integrity and performs preliminary comparative genomic analyses based on this genome. Compared to previous genomic data (Huang et al., 2016), this study generated a highlyqualified genome and identified novel genes that will provide both a theoretical basis and technical support for further analyses on CTD biosynthesis or natural CTD resource mining. The continuity and completion for the M. aulica assembly is superior to other blister beetle genome assembly results. The findings provided by our study also provide a new understanding of the genetic background and evolution of blister beetles. Our findings imply that all blister beetles share a common ancestor, and the biological synthesis mechanism and evolution of CTD in blister beetles may just be the result of inheritance with minimal adaptive functional modification afterward. M. aulica has the same metabolism for CTD biosynthesis as the currently utilized species. This proves that M. aulica has the potential to replace the  Frontiers in Genetics | www.frontiersin.org January 2020 | Volume 10 | Article 1281 current medical use of H. cichorii and H. phaleratus. Moreover, we found 30 gene families in the "terpenoid backbone biosynthesis" pathway that contribute to CTD biosynthesis and provided functional traits for two previously uncharacterized genes that facilitate existing CTD synthesis research. These findings will pave the way for the comprehensive future study of CTD and associated biosynthesis pathways in blister beetles.

DATA AVAILABILITY STATEMENT
The datasets generated for this study can be found in the NCBI BioProject accession number PRJNA558450.

AUTHOR CONTRIBUTIONS
D-LG, and X-QH conceived the study and designed the experiments, they should be considered as co-first author. D-LG, JP, and DM performed the experiments. X-QH, D-LG and YL analyzed the data. X-QH and HH wrote the manuscript. S-QX and J-YX revised the manuscript. All authors read and approved the final manuscript.

FUNDING
This work was financially supported by the Excellent Doctor Innovation Project of Shaanxi Normal University (S2015YB03), the Fundamental Research Funds for the Central Universities (2018CSLZ004), and the Fundamental Research Funds for the Central Universities (GK201604008; GK201702010; GK201701006). This work was supported in part by the National Natural Science Foundation of China (No. 31372250).

ACKNOWLEDGMENTS
We thank NextOmics Biosciences (Wuhan, China) for providing technical support.