Discovery of 194 Unreported Conopeptides and Identification of a New Protein Disulfide Isomerase in Conus caracteristicus Using Integrated Transcriptomic and Proteomic Analysis

Current ConoServer database accumulates 8,134 conopeptides from 122 species of cone snail, which are pharmaceutically attractive marine resource. However, many more conopeptides remain to be discovered, and the enzymes involved in their synthesis and processing are unclear. In this report, firstly we screened and analyzed the differentially expressed genes (DEGs) between venom duct (VD) and venom bulb (VB) of C. caracteristicus, and obtained 3,289 transcripts using a comprehensive assembly strategy. Then using de novo deep transcriptome sequencing and analysis under a strict merit, we discovered 194 previously unreported conopeptide precursors in Conus caracteristicus. Meanwhile, 2 predicted conopeptides from Consort were verified using liquid chromatography-mass spectrometry/mass spectrometry (LC-MS/MS). Furthermore, we demonstrated that both VD and VB of C. caracteristicus secreted hundreds of different conotoxins, which showed a high diversity among individuals of the species. Finally, we identified a protein disulfide isomerase (PDI) gene, which, functioning for intramolecular disulfide-bond folding, was shared among C. caracteristicus, C. textile, and C. bartschi and was the first PDI identified with five thioredoxin domains. Our results provide novel insights and fuel further studies of the molecular evolution and function of the novel conotoxins.

A variety of interpretations on the genetic diversity of conotoxins were proposed, such as gene duplication, gene loss, code shift mutations, early termination, and recombination (Espiritu et al., 2001;Stanley, 2008;Puillandre et al., 2010Puillandre et al., , 2014Chang and Duda, 2012). Conotoxin folding occurs mainly in the endoplasmic reticulum, where hundreds of different cysteinerich polypeptides are folded effectively (Tayo et al., 2010;Lu et al., 2014;Zhang et al., 2019). How these domains are correctly oxidized and folded in the poison gland tissues remains unclear, but it is known that molecular chaperones in the endoplasmic reticulum are involved in toxin folding (Gao et al., 2017). In vitro folding of these toxins usually results in low folding yield, misfolding, or accumulation of polymer products . Cone snails can secrete a variety of different toxins in a short period of time (Duda and Palumbi, 2004;Puillandre et al., 2014). To date, more than 15 different posttranslational-modification enzymes have been implicated in the post-translational modification of conotoxins (Buczek et al., 2005). Several previous studies have examined the molecular relationship between PDIs (protein disulfide isomerase) and conotoxin diversity (Hatahet and Ruddock, 2009;Safavi-Hemami et al., 2010;Wang et al., 2017).
To the best of our knowledge, an in-depth comparison of conopeptides among individual Conus caracteristicus, a vermivorous species, has yet to be performed. Furthermore, it is interesting that previous studies have identified hundreds of unique toxins or conopeptides from a single Conus specimen using genomic or proteomic methods (Biass et al., 2009;Davis et al., 2009;Kaas et al., 2010;Lavergne et al., 2015;Peng et al., 2021). Besides the diversity on gene or protein sequence level, the high toxin diversity is attributed to the oxidative folding of proteases in the venom (Buczek et al., 2005;Bulaj and Olivera, 2008). It is assumed that many novel natural conopeptides and post-translational modification enzymes involved in oxidative conotoxin folding remain to be discovered (Peng et al., 2016).
Here, we report a set of novel conopeptides isolated from the venom duct (VD) and venom bulb (VB) of vermivorous C. caracteristicus. We also present conopeptide diversity among individuals and verify these superfamilies using liquid chromatography-mass spectrometry/mass spectrometry (LC-MS/MS). During this process, we have accidentally identified a new protein disulfide isomerase (PDI) gene, which is highly homologous to sequences in C. textile and C. bartschi. Our results increase the available data concerning marine conopeptides and may provide insight into their potential applications.
Using a comprehensive assembly strategy, we analyzed the differentially expressed genes between the VB group and the VD group; triplicates (three individuals) were set in each group. The correlation coefficient of the transcriptomes within each group (VB or VD) was greater than 0.9 ( Figure 1A). We identified 3,289 transcripts that were at least 2-fold differentially expressed between VD and VB (P < 0.01); moreover, 1,207 transcripts were upregulated in the VD group, and 2,082 transcripts were upregulated in the VB group ( Figure 1B). Venn diagrams showed that 10,058 genes from the assembled transcripts were commonly annotated in Nt (non-redundant nucleotide sequences), Swiss-Prot (blastx), Swiss-Prot (blastp), and Nr (non-redundant proteins) (Supplementary Figure 1B).
The Gene Ontology (GO) Consortium terms most enriched in the VB group were cytoskeletal protein binding, contractile fiber, and supramolecular polymer, all of which are closely related to the cytoskeleton (Supplementary Figure 2). In contrast, signaling receptor ligand precursor processing, calcium ion binding, and serine hydrolase activity were the main terms enriched in the VD group (Supplementary Figure 3). These GO terms are closely related to the cleavage of a peptide bond in a precursor.
In order to compare 3,289 differentially expressed transcripts with the genome annotation data of C. betulinus, we downloaded the EST data (ID: PRJNA290540) associated with the article (Peng et al., 2021) and compared it through BLASTN. The threshold value is set to 1e-5. We found that 163 reported EST sequences were detected in our VD group, and 409 reported EST sequences were detected in our VP group. These corresponding sequences deserve more followup attention. To further categorize possible functions, we searched the 3,289 differentially expressed transcripts against the EuKaryotic Orthologous Groups (KOG) database. We successfully annotated 328 gene sequences from the VD and 85 gene sequences from the VB with KOG functions in 25 functional categories. The most significantly enriched category in the VB was cytoskeleton (93 sequences), followed by signal transduction mechanisms (67 sequences) and general function prediction only (48 sequences). In contrast, the most significantly enriched category in the VD was post-translational modification, followed by protein turnover, chaperones (16 sequences), general function prediction only (12 sequences), intracellular trafficking, secretion, and vesicular transport (11 sequences) ( Figure 1C).
Compared with the VB cells, there were complex regulatory networks related to peptide synthesis and processing in the VD  cells. Thus, although both organs produce venom, they have subtly different functions in defense.
to illustrate the relationships among the annotated conopeptide precursor sequences (Figure 2A). Across the three individuals, 39 peptides were shared in the VD and 51 were shared in the VB ( Figure 2B). This clearly indicates that conopeptides are not only diverse within a single tissue of one individual but are also heterogeneous among individuals.
The ConoServer database (See Text Footnote 1) indicates that  Figure 2C).
Most of the conopeptides obtained herein were reported for the first time from C. caracteristicus (Supplementary Table 2). In total, 2,194 candidate toxin genes were identified in Frontiers in Marine Science | www.frontiersin.org C. caracteristicus, more than 15 times the number of toxin genes identified to date. In particular, Lt9a (Pi et al., 2006) in the P gene superfamily was identified for the first time in C. caracteristicus. After removing duplicate sequences and reported sequences, we collectively identified 1,330 candidate conopeptide precursors. With the 1,330 sequences, the software ConoPrec in Conosever (See Text Footnote 1) was used to retain the sequences with signal peptide, pre-peptide and mature peptide. As a result, we finally obtained 249 relatively reliable gene sequences of natural conotoxin, and the remaining 1,081 sequences were considered as of candidate conotoxin. Of the 249 sequences, 55 sequences were recorded as conopeptide precursors and 194 were non-recorded conopeptide precursors (Supplementary Table 3).
The total ion current traces from Ca-2-VD and Ca-3-VB are shown in Figures 3A,B. No target conopeptide was detected in other mass spectra, and no conopeptide fragment was identified in the mass spectrometry data of Ca-1-VD, Ca-1-VB, Ca-2-VB, and Ca-3-VD. The identification of the peptide sequences using MS confirmed the expression of the associated conopeptide genes (Figures 3C,D). According to the conservation of signal peptides, the Ca2-VD-c49774_g1_5_3 were not a member of any known gene superfamilies by evolutionary analysis and may thus belong to as-yet undescribed gene superfamilies (Figure 4). The Ca3-VB-c93596_g1_5_1 (Ca6.14) was classified into the O1 gene superfamily. In addition, we also found another 14 O1 superfamily member by sequence alignment with 249 sequences obtained above.
Because there was no enrichment of target conopeptides when the samples were processed, eventually only two conopeptides were identified using LC-MS/MS. The abundances of the observed conopeptides were relatively high, suggesting that these were the primary conotoxins used by C. caracteristicus. Due to the low ratio of conopeptides to total proteins, much fewer protein sequences were identified by MS than gene sequences by RNAseq.

Identification of csPDIA5, a New Protein Disulfide-Isomerase
Protein disulfide isomerase is one of the most critical posttranslational modification enzymes affecting conopeptide secondary structure, maturation, and functional activity (Buczek et al., 2005;Bulaj and Olivera, 2008). PDIs are highly diverse both within and among Conus species (Figueroa-Montiel et al., 2016;Safavi-Hemami et al., 2016). We designed PCR primers based on PDI-A5-like, a PDI coding transcript sequence from C. caracteristicus, to isolate cDNA from various Conus species. Putative specific PDI products (length > 2,000 bp) were successfully obtained from C. caracteristicus and C. bartschi (Supplementary Figure 4). Based on homologous sequence alignment, we also extracted the highly homologous sequence of PDI from assembled transcripts of C. textile.
The full-length sequences, obtained using Sanger sequencing (Supplementary Document 1), differed slightly. The distance matrix for the multiple sequence alignments showed that sequence homology among different Conus species was greater than 95% but was less than 80% among Conus species, P. canaliculata, and A. californica (Supplementary Document  2). Based on evolutionary phylogenetic analysis of the three sequences, as well as the characteristics of their thioredoxinlike domains, we identified the novel conotoxin-specific PDI (csPDIA5) (Figures 5, 6). Importantly, these csPDIA5s shared five thioredoxin-like domains: 'CGYC, ' 'CGHC, ' 'CGHC, ' 'CGHC, ' and 'CGHC'. For example, csPDIA5 from C. textile had a total length of 2,301 bp and encoded 767 amino acids. In addition, the active site 'CGH(G)C' included the conserved sequence 'GY(F)PTL(F)K(Y)YF, ' and the 3-terminal was rich in proline (Supplementary Figure 5).

Analysis of the Secondary and Tertiary Structure of the csPDIA5 Protein
The secondary protein domains were predicted using SMART software, and the domains of csPDI (GHGH), PDI (GHGH), and csPDIA5 were compared ( Figure 7A). We found that csPDIA5 included five thioredoxin domains. However, previous studies have shown that PDI generally contains only three thioredoxin domains (Figueroa-Montiel et al., 2016;Safavi-Hemami et al., 2016). For example, the PDI sequence of another cone snail was highly similar to the human P4HB gene, with only two "CGXC" active sites (Wang et al., 2017).
Protein structure determines protein properties and functions. To further explore csPDIA5 domains, we compared the tertiary structures of csPDI to that of the csPDIA5 protein from C. textile (Figures 7B,C). We found that csPDIA5 incorporated six functional structural regions, including five 'CGH(G)C' sites, while csPDI included four domains with two "CGHC" sites.

DISCUSSION
Transcriptomics and proteomics are different approaches to gene identification, each with special features. The combination of these two approaches has contributed greatly to the discovery of novel conotoxins (Violette et al., 2012;Jin et al., 2015). Transcriptome sequencing is useful for the identification of rare transcripts, while MS protein sequencing reveals the final secreted peptides. The number of conotoxin genes in VD and VB varies from tens to thousands in previous studies, indicating the potentially high diversity of conotoxin genes and products with each individuals (Lavergne et al., 2015;Peng et al., 2016;Yao et al., 2019;Zhang et al., 2019Zhang et al., , 2021. In this study we initially found 1,330 candidate conopeptide precursors on mRNA levels. It seems that the shared number (51) in the VB samples is too big, when compared with previously reported data in other Conus species (such as only few in Peng et al., 2016). It is speculated that there are two reasons. The first is the difference in the methods for data analysis. Peng et al. used BLASTX search and HMMER analysis to predict the assumed conopeptide sequence, and then manually checked it with Conoprec, while we used ConoSort and Conoprec. The second reason is that there are conotoxin differences between different sizes of cono snails and between different species. In addition, not only de novo RNAseq assembly without additional verification may lead to some errors (Xie et al., 2014), but also machine learning method is regarded as imperfect source with over-classification, thus often gives misleading results (Safavi-Hemami et al., 2015). Using a more strict and conservative restriction for new gene calling, we identified 249 relatively reliable natural conotoxin genes and the remaining 1,081 sequences were considered as candidate genes. Among them, 194 were non-recorded conopeptide precursors.
Sequence alignment and evolutionary tree analysis showed that we identified one putative novel superfamilies. The P gene superfamily, which has been identified in other cone snails (Pi et al., 2006), was herein reported for the first time from C. caracteristicus.
One concern is that we only identified two conopeptides using LC-MS/MS. This may due to the fact that when preparing the total protein, we did not collect the crude venom by centrifugation, but directly processed the protein. Some major unrelated proteins may mask the signal of conotoxin.
Our initial design idea was to study both conotoxins and related post-translational-modification enzymes. In this report, the analysis of two complete organs (VD and VB) from three C. caracteristicus individuals revealed a high level of intraspecies variation among conopeptide precursors in this vermivorous species. Importantly, the analysis of differentially expressed genes showed that the most significantly enriched signaling pathways in the VD were peptide synthesis and processing, which reflected the function of this organ in the synthesis and processing of many conotoxins rich in disulfide bonds.
Post-translational-modification enzymes play a critical role in the maturation and oxidative folding of conopeptides, which are cysteine-rich . PDIs are the primary regulators of this type of post-translational modification (Buczek et al., 2005;Bulaj and Olivera, 2008). Transcriptome sequencing and bioinformatics analysis showed that genes in the disulfideisomerase family were upregulated compared to any other molecular chaperone genes (Zhang et al., 2019). In human PDIs, the most common active domain is ERDJ5, which includes four catalytic active sites in the thioredoxin domain (Cunnea et al., 2003;Benham, 2012). Herein, we identified the csPDIA5 family, which includes a thioredoxin domain with five catalytic active sites, in a cone snail for the first time. Cone snails are neogastropods in the genus Conus. The mesogastropod P. canaliculate lacks a highly homologous csPDIA5 gene. In addition, we did not identify any highly homologous csPDIA5 genes in any other species in the UniProt protein database or the NCBI nucleic acid database. Finally, it is possible that the csPDIA5 gene family may be unique to the neogastropod Conus genus.
The PDIs and csPDIs are ubiquitously expressed in the VD and play a major role in catalyzing the oxidation of cysteines into their native disulfides (Wang et al., 2007;O'Brien et al., 2018). csPDIs are preferentially expressed in the VD with very low expression levels in other tissues (Safavi-Hemami et al., 2016). In addition, the combinatorial effect of cone snail endoplasmic reticulum oxidoreductin-1 (Conus Ero1) and csPDI provided higher folding yields than Ero1 and PDI in vitro (O'Brien et al., 2018). The active domains of both PDI and csPDI is thioredoxin domains, and usually there are only two domains in each PDI/csPDI identified by now. Notably, we for the first time found csPDIA5 with five thioredoxin domains. We speculate that some cone snails may have evolved the csPDIA5 protein to meet the extreme needs of the oxidative folding process during conopeptide synthesis. In addition, it is worth to further investigate in the future the effects of PDI, csPDI and csPDIA5 on folding rates of conotoxins in vitro, the synergistic effect of other post-translational-modification enzymes in the oxidative folding of conotoxins.

CONCLUSION
In conclusion, we found 194 previously unreported conopeptide precursors in Conus caracteristicus, identified two conotoxins at the protein level. In addition, we demonstrated that conotoxin diversity among both VD and VB of C. caracteristicus. Finally, we identified the first protein disulfide isomerase (PDI) with five thioredoxin domains. We expect that additional conopeptides will be identified. Nevertheless, this work provides transcriptomic and proteomic data to support future investigations of Conus gene evolution and novel toxin function. Further explorations of the effects of csPDIA5 on the oxidative folding and functional activities of conotoxins may update the current in vitro approaches for the production of conotoxins in pharmaceutics and pharmacologic field.

RNA Extraction
Three adult specimens of C. caracteristicus and one specimen of C. textile were collected in the South China Sea. Four VDs and three VBs were extracted and identified as Ct, Ca-1-VD, Ca-2-VD, Ca-3-VD, Ca-1-VB, Ca-2-VB, and Ca-3-VB. VDs and VBs were separated immediately after dissection. Each sample was divided into two parts, and both were stored at −80 • C until use. One part was used for RNA extraction, while the other part was used for LC-MS/MS. Total RNA was extracted using TRIzol (Invitrogen), following the manufacturer's instructions. Agilent 2100 Bioanalyzer and Agilent RNA 6000 nano kit (Agilent Technologies, Santa Clara, CA, United States) were used for the concentration and integrity of the extracted RNA.

Transcriptome Sequencing and Bioinformatics Analysis
The transcriptomes of the C. caracteristicus individuals were paired-end sequenced on an Illumina HiSeq platform. The size of the target fragment was 125 bp. The NGS QC Toolkit (Patel and Jain, 2012) and Trimmomatic-0.60 (Bolger et al., 2014) were used for quality control. The reads were processed to ensure data quality for transcriptome assembly. The filtering criteria were as follows: (1) Remove the adaptor-polluted reads (Reads containing more than 5 adapter-polluted bases were regarded as adaptor-polluted reads and would be filtered out); (2) Remove leading low quality or N bases (below quality 3) (LEADING:3); (3) Remove trailing low quality or N bases (below quality 3) (TRAILING:3); (4) Scan the read with a 4-base wide sliding window, cutting when the average quality per base drops below 15 (SLIDINGWINDOW:4:15); (5) Drop reads below the 100 bases long (MINLEN:100). Both reads of paired-end sequencing data were fifiltered if either read of the paired-end reads was adaptor polluted. The clean reads were checked using FastQC 2 .
Trinity version 2.3 3 was used for de novo assembly. For alignment: clean data mapped to assembled transcripts using Bowtie2 version 2.2.3 4 . The alignment rate of more than 75% are considered to meet the downstream analysis criteria. The longest sequences are defined as potential transcripts (unigenes). The open reading frames (ORFs) were predicted by TransDecoder-4.1.0 5 . Then all ORFs and unigenes were used as queries to align against sequences in the databases of Swiss-Prot 6 , EuKaryotic Orthologous Groups (KOG) 7 using the BLAST algorithm with a cut-off e-value of < 10 −5 .
R packages edgeR was used for differential gene analysis between three VDs and three VBs. The log2fold-change was set to 2, and the P value was set to 0.05. The six unigenes (Ca-1-VD, Ca-2-VD, Ca-3-VD, Ca-1-VB, Ca-2-VB, and Ca-3-VB) were used downstream to conotoxin genes.

Screening New Conotoxins Using ConoSorter
ConoSorter is a machine learning program for the largescale identification of conopeptide precursors based on their signal, pro-and mature region sequence composition (Lavergne et al., 2013). First, ConoSorter directly analyzes unique transcripts. We chose sequences 40-147 amino acids long. Second, predicted conopeptide precursors with hydrophobicity ≥65% were selected. The predicted genes obtained here are considered to be candidate conotoxin gene databases. Finally, based on the predictions of ConoPrec (Kaas et al., 2012), the sequences with signal peptides, pre-peptides and mature peptides were retained.

Protein Fractionation and Preparation for LC-MS/MS Analysis
The VD samples were stored at −80 • C. Sample preparation and analysis were performed following the requirements of the Thermo Scientific EASY-nLC 1,000 system and the Thermo LTQ Orbitrap Elite mass spectrometer, as well as the methods summarized in our previous studies (Zhang et al., 2019). The samples were alkylated at 25 • C for 45 min in the presence of 25 mM iodoacetamide in the dark and then were solubilized in lysis buffer [8 M urea, pH 8.00] containing 5 mM DTT at 60 • C for 45 min. At last, the obtained protein solutions were reconstituted in 50 mM ammonium bicarbonate with 0.5 M urea, pH 7.8, and then digested (trypsin: protein = 1:100) for 10 h at 37 • C. The pH of the samples was adjusted to 7.8. The running time for each sample was 50 min.

Proteomic Data Analysis
The total ion current traces of the VD and VB were visualized using Thermo Xcalibur 2.2 sp1.48. pFind Studio 3.1.5 8 was used for peaklist generation and sequence identification against candidate conopeptide databases (Supplementary Table 2; Chi et al., 2018). For the alkylated samples, the fixed modification was set to carbamidomethyl (+ 57.021 Da). The MS instrument was CID-ITMS (Collision-induced dissociation-Ion Trap Mass Spectrometer). Precursor tolerance was set to 20 ppm and fragment tolerance was set to 20 ppm when searching the databases. The dynamic modification was set to oxidation (+ 15.995 Da,methionine). The maximum modifications per peptide was 4. Precursor tolerance was set to 20 ppm, and fragment tolerance was set to 20 ppm. Up to three missed cleavages were allowed. The enzyme was trypsin, and the FDR (False Discovery Rate) was set to 1%. The peptide mass was 600-10,000 da, and the peptide length was 6-100 amino acids.

Polymerase Chain Reaction and Sanger Sequencing of the New Protein Disulfide Isomerase
From the blast annotation results from Ca-1-VD unigene, we extracted a PDI transcript of more than 2,000 bp. We designed upstream and downstream primers to amplify the full-length sequence from the cDNA template of C. caracteristicus and C.bartschi. The upper-primer was ATGGCGTTGCCCTGGAAAG, and the lower-primer was TCTGATAAAAAGACGAATTGTGA. The 50 µL PCR (Polymerase Chain Reaction) amplification system: 10 × LA Buffer 5 µl, LA Taq DNA Polymerase 0.5 µl, Upper-Primer (10 µM) 1 µl, Lower-Primer (10 µM) 1µl, dNTP 1 µl, cDNA 1∼10 ng (1 µl), Sterile Water 40.5 µl. PCR reaction procedure: 94 • C 2 min; 94 • C 30 s, 56 • C 30 s, 72 • C 2 min 20 s, 35 cycle; 72 • C 10 min. The effective length for Sanger sequencing was only about 800 bp. We also designed two intermediate sequencing primers (TGATGGGTCTGAGAACCCTGTACA and TGTTTGGGGCTGTGGACTGCAC) based on the full-length gene sequence. An Applied Biosystems 3730 Series Genetic Analyzer was used for Sanger sequencing.

Bioinformatics Analysis of the New Protein Disulfide Isomerase
The open reading frames (ORFs) in the DNA sequences were identified using ORF Finder 9 . SMART 10 was used to predict the secondary domain of the target protein. SWISS-MODEL in SMART is a fully automated protein structure homologymodeling server, accessible via the Expasy web server 11 . Typically four steps are followed to build a homology model 12 : (i) identification of structural template(s), (ii) alignment of target sequence and template structure(s), (iii) model-building, and (iv) evaluation of model quality. The template 3boa.1.A was choose. The 3boa.1.A was crystal structure of yeast protein disulfide isomerase. The sequences of the new PDIs obtained using PCR and Sanger sequencing were compared and analyzed with the CLUSTALW 2.1 (Sievers et al., 2011). All distances were between 0.0 and 1.0.

Phylogenetic Tree
We analyzed signal sequences from the three novel precursor conopeptides identified herein, signal sequences from the gene superfamilies in ConoServer, and 35 previously identified superfamilies (Bernáldez et al., 2013;Lavergne et al., 2015;Prashanth et al., 2016). The nucleic acid sequences of PDI and csPDI were obtained from a previous study (Safavi-Hemami et al., 2016). The remaining sequences (from mouse, Danio rerio, rat, P. canaliculata, A. californica, and human) were obtained from the NCBI and UniProt databases. The rat disulfide-isomerase gene TMX1 was used as the outgroup for the neighbor-joining phylogenetic tree. Amino acid sequences were aligned using CLUSTALW 2.1 (Sievers et al., 2011). All distances were between 0.0 and 1.0.

DATA AVAILABILITY STATEMENT
The datasets presented in this study can be found in online repositories. The names of the repository/repositories and accession number(s) can be found in the article/Supplementary Material.