In silico analysis of crustacean hyperglycemic hormone family G protein-coupled receptor candidates

Ecdysteroid molting hormone synthesis is directed by a pair of molting glands or Y-organs (YOs), and this synthesis is inhibited by molt-inhibiting hormone (MIH). MIH is a member of the crustacean hyperglycemic hormone (CHH) neuropeptide superfamily, which includes CHH and insect ion transport peptide (ITP). It is hypothesized that the MIH receptor is a Class A (Rhodopsin-like) G protein-coupled receptor (GPCR). The YO of the blackback land crab, Gecarcinus lateralis, expresses 49 Class A GPCRs, three of which (Gl-CHHR-A9, -A10, and -A12) were provisionally assigned as CHH-like receptors. CrusTome, a transcriptome database assembled from 189 crustaceans and 12 ecdysozoan outgroups, was used to deorphanize candidate MIH/CHH GPCRs, relying on sequence homology to three functionally characterized ITP receptors (BNGR-A2, BNGR-A24, and BNGR-A34) in the silk moth, Bombyx mori. Phylogenetic analysis and multiple sequence alignments across major taxonomic groups revealed extensive expansion and diversification of crustacean A2, A24, and A34 receptors, designated CHH Family Receptor Candidates (CFRCs). The A2 clade was divided into three subclades; A24 clade was divided into five subclades; and A34 was divided into six subclades. The subclades were distinguished by conserved motifs in extracellular loop (ECL) 2 and ECL3 in the ligand-binding region. Eleven of the 14 subclades occurred in decapod crustaceans. In G. lateralis, seven CFRC sequences, designated Gl-CFRC-A2α1, -A24α, -A24β1, -A24β2, -A34α2, -A34β1, and -A34β2, were identified; the three A34 sequences corresponded to Gl-GPCR-A12, -A9, and A10, respectively. ECL2 in all the CFRC sequences had a two-stranded β-sheet structure similar to human Class A GPCRs, whereas the ECL2 of decapod CFRC-A34β1/β2 had an additional two-stranded β-sheet. We hypothesize that this second β-sheet on ECL2 plays a role in MIH/CHH binding and activation, which will be investigated further with functional assays.


Introduction
Molting processes in decapod crustaceans are controlled by ecdysteroids synthesized by a pair of molting glands, or Y-organs (YOs) (1)(2)(3).Molt-inhibiting hormone (MIH), released from the Xorgan/sinus gland in the eyestalks, inhibits YO ecdysteroidogenesis through a cyclic nucleotide-dependent signaling pathway (4).In a proposed model, MIH binding to a high-affinity G protein-coupled receptor (GPCR) induces a cAMP/Ca 2+ -dependent triggering phase that leads to a prolonged NO/cGMP-dependent summation phase, which maintains the YO in the basal state between MIH pulses (2,3,5).It is hypothesized that activation of cGMP-dependent protein kinase leads to inhibition of mechanistic Target of Rapamycin Complex 1 (mTORC1)-dependent ecdysteroidogenesis (6).When conditions are suitable for molting, reduced MIH release activates the YO; rising hemolymph ecdysteroid titer drives the transition from the intermolt stage to the premolt stage (2,3,7).
MIH is a member of the crustacean hyperglycemic hormone (CHH) superfamily of neuropeptides.CHHs are characterized by six conserved cysteines that form three intramolecular disulfide bridges in the mature peptide (8,9).They are classified into two types based on the transcript processing, precursor protein sequence, and post-translational modifications (8).Type I peptides (CHH and insect ion transport peptide or ITP) have an N-terminal signal peptide sequence, precursor-related peptide sequence, a KR cleavage site, and mature peptide (10,11).Isoforms are generated by alternative mRNA splicing and chemical modifications of the N-and C-termini are common (5,10,11).Type II peptides (MIH, gonad-inhibiting hormone or GIH, and mandibular organ-inhibiting hormone or MOIH) lack the precursor-related peptide and KR cleavage site, having only the signal peptide and mature peptide sequences (3,9).In addition, the N-terminal sequences of the Type II mature peptides have a conserved glycine (Gly12) inserted at the fifth position after the first cysteine (3).No isoforms are generated by alternative splicing in type II peptides and post-translational modifications are uncommon (3,5).CHH superfamily mature peptides have a compact native conformation stabilized by the three disulfide bridges and nine conserved hydrophobic residues (3,9,12,13).
Type I peptides have four a-helices and type II peptides have the four a-helices and an additional short a1/3 10 -helix located around the conserved Gly12 (5,9).Functional studies of expressed mutant MIH recombinant constructs show that both the N-and C-terminal regions, which come in close apposition in the native structures, contribute to MIH activity (9,13,14).Interestingly, the two residues at positions #13 and #14 in the a1 helix, but not the Gly12 itself, are critical for full MIH activity (9,14).
Any effort to identify the MIH receptor must start with a comprehensive search for ITP receptor homologs in crustacean transcriptomes, particularly those detected in the YO.Previous efforts using sequence homology, though partially successful, were hampered by fragmented and siloed databases, representing a relatively small number of species and taxonomic groups (17,19,(21)(22)(23)(24)(25).CHH/MIH/GIH/MOIH peptides probably arose after the Hexapoda-Malacostraca split approximately 515 million years ago (39,40).Therefore, it is likely that receptors to the CHH superfamily are ancient and that their lineage can be traced back to ecdysozoan ancestors in the Cambrian Period.Here we report the use of CrusTome, a multi-species, multi-tissue, transcriptome database of 201 assembled mRNA transcriptomes from 189 crustaceans and 12 ecdysozoan outgroups (41), to in silico deorphanize candidate CHH family Class A GPCRs, relying on sequence homology to the three B. mori ITP receptors.Putative homologs of BNGR-A2, BNGR-A24, and BNGR-A34 were identified in transcriptomes across Crustacea and annotated as CHH family receptor candidates (CFRCs).Among decapod crustaceans that have historically served as model organisms for molt regulation, seven CFRCs in G. lateralis, eight CFRCs in P. clarkii, and eight in C. maenas were identified.Multiple sequence alignments, phylogenetics, and molecular modeling of predicted receptor proteins identified structural features and conserved motifs in ECL2 and ECL3, which form the ligand-binding region.These features and motifs can be used to distinguish members of CFRC clades and subclades, and suggest mechanisms for ligand binding specificity.In silico modeling of Gl-MIH, Gl-CHH, and Gl-CFRC-A24 and -A34 protein structures was conducted, as G. lateralis is an established model for the study of molting physiology and endocrinology (2,7,42).Based on phylogeny, sequence analysis, and molecular modeling, we hypothesize that CFRC-A34b1 and CFRC-A34b2 are the MIH receptors in decapod crustaceans and should be prioritized for functional assays.

Data sourcing
Transcriptomic datasets from the model crab species G. lateralis arising from previous work were obtained from public repositories and incorporated into the analyses.These datasets included transcriptomes for G. lateralis eyestalk ganglia (Supplementary Data 1) and YO under different experimental conditions (43)(44)(45).
Subject hits with e-value < e -10 were selected for further screening.The screening process consisted of the following: 1) Interproscan (version 5.64) analysis to determine if the BLAST hits were all class A GPCRs and contained the seven transmembrane domain region (IPR000276/PF00001: 7tm_1); 2) Any sequences that had less than six transmembrane regions as analyzed by TmHMM v2 were discarded; and 3) Redundant sequences from the same species were manually removed following evaluation of percent sequence homology following multiple sequence alignments and construction of maximum likelihood phylogenetic tree construction (see below).Exceptions were made in steps #1 and #2 if fragmented sequences were from brachyuran crabs.Hits were refined to retain only those that were complete or nearly complete sequences based on their length and domain regions with the aim to maximize the phylogenetic diversity and signal of the dataset, while preserving representation of focal clades, such as order Decapoda and the infraorder Brachyura (true crabs).

Multiple sequence analyses and phylogenetics
The resulting putative CFRC sequences were aligned using the multiple sequence aligner, Multiple Alignment using Fast Fourier Transform (MAFFT; v.7.490 (47).Other subclasses of class A GPCRs from G. lateralis, which were previously annotated, served as outgroups (Proctolin, FMRF, Allatostatin, and HPR1 receptors) (17).The parameters for MAFFT alignment were chosen to prioritize accuracy over speed and to allow for large unaligned regions if encountered ("-dash -ep 0 -genafpair -maxiterate 10000" (47), thus fine-tuning the process for proteins that are typically challenging to align due to their particular structural characteristics (e.g., GPCRs (48).The -dash parameter equipped MAFFT with the capability to query a 'Database of Aligned Structural Homologs,' thereby integrating structural data to guide and optimize the alignment process (49).Subsequently, the generated alignment was trimmed using ClipKit in smart-gap mode (50), an alignment trimming tool proficient at discerning and preserving phylogenetically informative sites and facilitating more accurate phylogenetic inference.Maximum-likelihood phylogenies were reconstructed using IQ-TREE2 (51), applying a Jones-Taylor-Thornton amino acid replacement matrix under a FreeRate model with 9 rate categories [JTT+R9 (52)(53)(54)(55)], as suggested by ModelFinder for the trimmed alignment (56).The phylogenetic tree derived from this initial reconstruction was subjected to TreeShrink for automated detection and removal of outliers/paralogs, setting the a value at 0.05 (57).The pruned alignment was then subjected to a second, final round of phylogenetic reconstruction using IQ-TREE2 (51) to enable confident characterization and annotation of the target proteins in a phylogenetic context.A subsequent IQ-TREE2 phylogenetic reconstruction was performed using identical model parameters as previously utilized [JTT+R9 (52)(53)(54)(55)].The branch support for this final phylogeny was assessed in a bipartite manner, using the Ultra-Fast Bootstrap approximation (UFBoot; 10,000 iterations) and an approximate Bayes test (58)(59)(60).The process was repeated independently for those sequences falling within the A24 and A34 clades to obtain better resolution of terminal branches in cladespecific phylogenies.Reference sequences and their corresponding Accession Identifiers are cataloged in Supplementary Data S2.
Multiple sequence alignments were produced for selected species across the phylogeny to compare the diversity of the ECL2 regions among the A2, A24, and A34 clades.These alignments were generated with the previously mentioned MAFFT strategy and subsequently visualized with a custom script to assess sequence content and conservation across clades and species (code available at: https://github.com/invertome/scripts/tree/main/plots).In addition, the script generates sequence logo plots depicting the proportion of each residue found per site in the alignment.Amino acid residue colors that are proximal in color space, in both the alignments and logo plots, denote similarities in physicochemical characteristics of the corresponding residues (61).Additionally, a deep-learning algorithm was employed to detect, predict, and annotate the topology of the candidate GPCRs (62) to delineate intracellular, extracellular, and transmembrane regions.Similarly, a subset of decapod species was selected to generate and plot multiple sequence alignments of the CHH and MIH peptides, as well as the A34 clade to further compare the sequences in ECL2, TMM5, ICL3, TMM6, and ECL3 regions.

Protein structural modeling
Neural network-based methods AlphaFold and RoseTTAFold have outperformed homology modeling programs like Modeller for GPCR modeling in the absence of good templates (26, [63][64][65].Consequently, we used AlphaFold2 for the structural modeling of G. lateralis CFRC, Gl-MIH, and Gl-CHH sequences.The RoseTTAFold web service was used with default settings to predict the structures of each protein sequence, and the structural features were compared with the AlphaFold2 models. UCSF ChimeraX version 1.5, a free multi-platform molecular modeling program developed by the Resource for Biocomputing, Visualization, and Informatics at the University of California, San Francisco, was used to model and visualize three-dimensional structures (66,67).Full-length predicted protein of Gl-CFRC-A24a, -A24b1, -A34a2, -A34b1, and -A34b2 were selected for structural modeling; partial sequences of Gl-CFRC-A2a1 and -A24b2 were excluded.The sequences were truncated at the Nand C-termini due to intrinsic disorder of the N-terminal and Cterminal domains.The truncated sequences (Supplementary Data 3) were subjected to AlphaFold2 using the ChimeraX interface to submit three-dimensional structure prediction to run at Google Colab (68).The predicted structures were energy minimized and the best model out of five was selected for further analysis.AlphaFold2 has been used to predict either active or inactive states of GPCRs (69).However, the use of these sophisticated techniques was not feasible for this study, due to the lack of three-dimensional structures for crustacean GPCRs with their ligands.The RoseTTAFold web service was used with default settings to predict five structures of each protein sequence, and the structural features were compared with the AlphaFold2 models.
Published CFRC sequences, as well as those of additional CFRC sequences, in seven decapod species are summarized in Table 1.The sequences are organized according to the proposed classification nomenclature.Most of the published sequences were in the A34 clade; the lone exception was Pc-GCRC-A9, which was in the A24 clade (Table 1).New contigs encoding A2 and A24 sequences were identified in five species from transcriptomes in the CrusTome database.No new CRFRCs were identified in P. trituberculatus, as the RNAseq data for this species was not included in CrusTome (41).CrusTome also did not include transcriptomic data for S. verreauxi; the new sequence, designated Sv-CHHR3, was provided by Dr. Tomer Ventura for the phylogenetic analysis (Figure 3).Unpublished CFRC sequences in the A2 clade were identified in P. clarkii (A2a1 and A2b), C. maenas (A2a1 and A2b), and G. lateralis (A2a1).New A24 CFRC sequences were identified in P. clarkii (A24b1 and A24b2), S. paramamosain (A24a), E. sinensis (A24a and A24b1), S. verreauxi (A24a), C. maenas (A24a, A24b1, and A24b2), and G. lateralis (A24a, A24b1, and A24b2).One new A34 CFRC sequence was identified in C. maenas (A34b3).
Contigs encoding CFRC sequences were identified in the transcriptomes from 37 decapod species representing six infraorders.The decapod sequences were assigned to 11 of the 14 subclades (Table 4).The three subclades lacking decapod CFRC sequences were A2a2 (Figure 2), A24b (Figure 3), and A34a (Figure 4).None of the 37 decapod species expressed all 11 CFRCs.The number of CFRC sequences for a species ranged from one in Neocaridina denticulata, Munida micropththalma, and Acanthephyra stylorostratis to nine in P. clarkii.CFRC-A24a and CFRC-A34b1 were the most common, with CFRC-A24a identified in 22 species and CFRC-A34b1 identified in 27 species from all six infraorders (Table 4).Members of the A2 subclades (A2a1 and A2b) were identified in seven species each, but not always in the same species; only in C. maenas and P. clarkii were both A2a1 and A2b expressed.A24b3 was the least common sequence in the A24 clade; it was identified in only three species (one astacidean and two carideans).CFRC-A34a1 was identified in nine species from four infraorders, whereas CFRC-A34a2 was identified in 12 species from five infraorders (Table 4).CFRC-A34b2 was identified only in the Brachyura, with six of the 10 species expressing both A34b1 and A34b2.CFRC-A34b3 was the least common of the sequences in the A34 subclades; it was identified in only seven of the 37 species (Table 4).

Sequence analysis of the ECL2, ICL3, and ECL3 regions of decapod CFRCs
Multiple sequence alignment of the ECL2 region compared the sequence content, conservation, and the annotation of putative novel structures in decapod CFRCs with the B. mori BNGR-A2, -A24, and -34 sequences.A common feature shared by all the CFRCs, including Bombyx, was a conserved cysteine (C) in ECL2 (Figure 5, reference alignment position #565).The A2, A24, and A34 CFRC clades displayed unique ECL2 amino acid compositions that were consistent across taxa and within each subclade as depicted by the alignments and logo plots (Figure 5, reference alignment positions #544 to #591).The A2a1 and A2b sequences had an insertion of three or four amino acids unique to the A2 clade (Figure 5, reference alignment positions #553 to #556).All the A24 sequences, including BNGR-A24, had a pair of threonine residues (T) at reference positions #546 and #547, and a nine amino acid sequence (WPDGxxxxS), starting four residues C-terminal to the conserved cysteine (Figure 5, reference alignment positions #569 to #577).The A24a, A24b1, and A24b2 subclades had a conserved tyrosine (Y) that distinguished them from the A24b3 sequences (Figure 5, reference alignment position #552).All the A34 sequences, including BNGR-34, had a conserved tryptophan (W) eight residues N-terminal to the conserved cysteine (Figure 5, reference alignment position #552).Within the A34 clade, the A34a subclade had the shortest ECL2 sequence, while the A34b1/ b2 subclades had the longest ECL2 sequence (Figure 5).The length of the A34b3 ECL2 sequence was intermediate between A34a and A34b1/b2.Moreover, the ECL2 sequences of the A34b subclades had a second conserved cysteine (C) absent in the A34a ECL2 sequences (Figure 5; cysteine located at reference alignment Brachyuran A34 sequences were selected for multiple sequence alignment to compare the ECL2, ECL3, ICL3, TMM5, and TMM6 regions in greater detail.Crayfish (P.clarkii) sequences (Pc-CFRC-A34a1, -A34a2, -A34b1, and -A34b3) were included for reference.All the A34 ECL2 sequences had a conserved arginine (R), tryptophan (W), and cysteine (C) located at reference alignment positions #471, #478, and #486, respectively, as shown on the CFRC-A34 clade-specific alignment (Figure 6).All the CFRCs had a conserved CWxP motif in TMM6 (Figure 6; reference alignment positions #588 to #591).The five A34 subclades (A34a1, A34a2, A34b1, A34b2, and A34b3) were differentiated by amino acid sequence and length of the ICL3 region (Figure 6).The A34b3 subclade was distinguished from the other A34 subclades by a seven amino acid sequence between the tryptophan (W) and the cysteine (C) in ECL2 (WxDLVEESC; Figure 6, reference alignment positions #478 to #486) and by a four or six amino acid insertion in ICL3 (Figure 6; reference alignment positions #555 to #560).The A34b1/b2 subclades had a 14-to 16-amino acid insertion containing a second conserved cysteine (C) (Figure 6, reference alignment positions #489 to #506).This insertion forms a second two-stranded b-sheet (see "Structural modeling of Gl-CFRC-A24 and -A34 proteins" section below).Additionally, the ECL2 region of the brachyuran A34b1/b2 subclades had a conserved 14-amino acid sequence that included the first cysteine (WxNFTxWxCxExFP; Figure 6, reference alignment positions #478 to #491; Supplementary Data 4).The A34a1/a2 sequences had a six-amino acid insertion in ECL3 that was not present in the A34b subclades (Figure 6, reference alignment positions #606 to #611).
Table 5 summarizes the conserved motifs identified via multiple sequence alignments in ECL2 and ECL3 that distinguish the 11 decapod CFRC subclades.The analysis included the CFRC sequences from 37 decapod species, with the number of sequences analyzed ranging from three for A24b3 to 27 for A34b1 (Table 4).All the ECL2 motifs contained the conserved cysteine (Table 5).CFRC-A2a1 and -A2b had four conserved amino acid residues (PxGxDxP; Table 5), which distinguished them from the members of the A24 and A34 clades.The four A24 subclades (CFRC-A24a, -A24b1, -A24b2, and -A24b3) had "WPDG" in the ECL2 motif that was absent in the A2 and A34 subclade sequences (Table 5).These ECL2 and ECL3 motifs distinguished the CFRC-A24a subclade from the CFRC-A24b subclades.The CFRC-A24b1 and -A24b2 sequences showed similarities, with "STTVS" in ECL2 and "HNS" and "IQH" in ECL3 (Table 5).CFRC-A24b3 ECL3 motif was similar to CFRC-A24b1 and -A24b2 but differed from CFRC-A24b1/b2 in the ECL2 motif in both length and amino acid composition.The ECL2 and ECL3 motifs among the three CFRC-A34 subclades varied in length and sequence.CFRC-A34b1 and -A34b2 had a longer ECL2 with a second conserved cysteine (Table 5).The "CVVTxDAK" sequence of this region in CFRC-A34b2 only occurred in brachyurans (Table 5).CFRC-A34b3 showed an ECL2 motif length closer to that of CFRC-A34a than those of CFRC-A34b1/b1.Similar to CFRC-A34b1/b2, the CFRC-A34b3 ECL2 motif sequence had a second conserved cysteine (Table 5).6) (8,9).MIH had a conserved Gly12 that was absent in CHH.Nine hydrophobic residues, which stabilize peptide structure (12), are indicated in Table 6.MIH and CHH differed in the number and lengths of the a-helical regions.CHH had four a-helices, while MIH had five a-helices, with the additional short a1/3 10 -helical turn before the Gly12 (Table 6).

Multiple sequence alignments and structural modeling of Gl-MIH and Gl-CHH proteins
Structural models of Gl-MIH and the eyestalk Gl-CHH isoform mature peptides are shown in Figure 7. Conserved surface-exposed residues from Table 6 were included in the Gl-MIH and Gl-CHH structural models, as these residues have roles in their structure, function, and protein-receptor interactions (9, 72).The N-terminal sequences of Gl-MIH and Gl-CHH were conserved across other brachyuran species and were usually found on the external surface of the protein (Table 6; Figure 7).The C-terminal region of Gl-CHH, which included the a4and a5-helices, was less conserved in comparison to its N-terminal region and the C-terminal region of Gl-MIH (Figure 7).Gl-MIH also had a longer and slightly

Structural modeling of Gl-CFRC-A24 and -A34 proteins
The structures of the G. lateralis A24a/b1, A34a2, and A34b1/ b2 CFRCs were modeled using AlphaFold2.Gl-CFRC-A2a1 and Gl-CFRC-A24b2 were not included in the modeling, as they were partial sequences with incomplete open reading frames (Table 2; Supplementary Data 2).Each full-length CFRC sequence consisted of a single polypeptide with seven transmembrane domains and a topology with the N-terminus oriented on the extracellular surface and the C-terminus on the intracellular surface (Figures 8, 9; depicted as ribbon diagrams on the left for each receptor).The AlphaFold2 models of these GPCRs were evaluated using a perresidue confidence score (pLDDT) between 0 and 100 and the results shown in the structures on the right for each receptor (Figures 8, 9).Regions corresponding to a-helical transmembrane domains showed very high confidence (pLDDT > 90), representing over one-third of the three-dimensional structure.The ECL2 bsheets had very high (pLDDT > 90) to confident (90 > pLDDT > 70) scores.The other parts of the models, such as ECL3 and intracellular loops, were mostly represented as unresolved loops  Brachyura: with low (70 > pLDDT > 50) scores.It should be noted that AlphaFold2 introduces bias in modeling TMM6 and ICL3 regions.This is attributed to the fact that most currently available high-resolution structures were obtained with engineered GPCRs that lacked major portions of the ICL3 and the C-terminal domains (37, 73).
The colors in Table 4 match the colors in the trees for the A2, A24, and A34 clades (Figures 1-4).
Multiple sequence alignments of the Extracellular Loop 2 (ECL2) sequences in clades A2, A24, and A34 of Bombyx mori ITP GPCR homologs, which depicts the diversity of ECL2 types found in decapod crustaceans.ECL2 residues are highlighted in bold.MSA color scheme corresponds to similarities in physicochemical properties of amino acid residues (see Materials and Methods).All the CFRC sequences have a highly conserved cysteine at reference position #565.
insertions in the two sequences (Figures 9B, C).This feature distinguished CFRC-A34b1 and -A34b2 all the other crustacean GPCRs.Consistent with AlphaFold2 models, RosettaFold also predicted two two-stranded b-sheets in the ECL2 of Gl-A34b1/b2 (data not shown).
Multiple sequence alignment of ECL2 sequences of decapod CFRC-A34b1/b2 identified a conserved motif of 41-42 amino acid residues in A34b1 and 40-41 amino acid residues in A34b2 (Table 5; Figure 6).Restricting the alignment to the A34b1 and A34b2 sequences from brachyuran species (Table 4), two consensus Multiple sequence alignments of the ECL2, TMM5, ICL3, TMM6, and ECL3 regions across subclades A34b1, A34b2, and A34b3 in representative decapod species (Procambarus clarkii, Scylla paramamosain, Eriocheir sinensis, Gecarcinus lateralis, and Carcinus maenas).The alignment illustrates the composition and length of the ECL regions that reflect putative differences in ligands and/or binding affinities.Note that the sequence reference numbers differ from those in Figure 5 due to the sequences selected for the alignments.ECLs are highlighted in bold.MSA color scheme corresponds to similarities in physicochemical properties of amino acid residues (see Materials and Methods).

CFRC Clade ECL2 Motif Sequences ECL3 Motif Sequences
The ECL2 motifs are centered around a conserved Cys (C).A second cysteine in CFRC-A34b1, -A34b2, and -A34b3 is indicated with double underline.Sequences shared between two or more of the A2 or A24 CFRCs are indicated with double underline.Brackets indicate sites with possible indels or residues seen in equal proportions.
sequences were identified within the motif.The conserved residues were included in structural models of the Gl-A34b1/b2 ligandbinding domain (Figure 10).The YRxxYxxxWxNFTxWxCxExFP brachyuran consensus sequence included the cysteine (C) in b-sheet #1, while the CxVxxDAK sequence included the cysteine (C) in bsheet #2 5).A notable feature of both receptors was that ECL2 had conserved hydrophobic residues projecting from both bsheets (Figure 10).The sequences in the ECL2 motif were highly Gl-MIH and Gl-CHH sequences (70, 71) and locations of a-helical regions (a1, a2, a3, a4, and a5) are indicated by lines above the sequences.The six conserved cysteines are underlined.Surfaceexposed conserved amino acids in the Gl-MIH and Gl-CHH sequences are indicated by red bold font.Nine conserved hydrophobic residues that stabilize peptide conformation are indicated by blue bold font.The four amino acids (RKKK) at the C-terminus of Gl-CHH were not included in the modeling (Figure 7).6).Gl-MIH had an additional short a1-helix, specifically 3 10 -helix, at the N-terminus (circled) that is absent from Gl-CHH.The C-terminal four amino acids of Gl-CHH (RKKK) were not included in the modeling.Molecular graphics images were produced using the Chimera package (see Materials and Methods).Hydrogens are not shown for clarity.Coordinates of the AlphaFold2 models in PDB format are available in Supplementary Data 6.

A B
conserved in the Gl-CFRC-A34b1 and -A34b2 proteins, with YRxxYxxxWxxFTxWxCDExFP and VGCVVTY residues identified (Figure 10, compare A and B).There were two acidic residues in the first motif (D353 and E354 in A34b1 and D288 and E289 in A34b2) located in the center of the binding pocket formed by the two b-sheets (Figure 10).

Discussion
Phylogenetic analysis using CrusTome identified homologs of insect ITP GPCRs in crustacean taxa, including copepods, isopods, amphipods, euphausiids, and decapods.They were organized into three large clades named after the B. mori BNGR-A2, -A24, and Structural models of the G. lateralis A24a (A) and A24b1 (B) receptors.On the left, ribbon diagrams are shown with colors ranging from blue for the N-terminus to red for the C-terminus.Images were produced using the Chimera package.On the right, three-dimensional prediction using AlphaFold2 (see Materials and Methods).Per-residue confidence score (pLDDT) designates the estimation of confidence on a scale from 0 to 100, with colors representing pLDDT confidence scores from very low (orange) to very high (dark blue; see legend).All receptors showed a common topology of seven transmembrane (TMM) a-helices connected by three extracellular loops (ECLs) and three intracellular loops (ICLs).The Nterminus is in the extracellular space and the C-terminus is in the cytosol.The ECL2 has a two-stranded b-sheet, designated b#1.The disulfide bridge that anchors the ECL2 b#1 to TMM6 is shown as ball and sticks (circle), which are located between C148 and C227 in Gl-CFRC-A24a and between C178 and C258 in Gl-CFRC-A24b1.
Structural models of the G. lateralis A34a (A), A34b1 (B), and A34b2 (C) receptors.On the left, ribbon diagrams are shown with colors ranging from blue for the N-terminus to red for the C-terminus.Images were produced using the Chimera package.On the right, three-dimensional prediction using AlphaFold2.Per-residue confidence score (pLDDT) designates the estimation of confidence on a scale from 0 to 100, with colors representing pLDDT confidence scores from very low (orange) to very high (dark blue; see legend).All receptors showed a common topology of seven transmembrane a-helices connected by three ECLs and three ICLs with the N-terminus is in the extracellular space and the C-terminus is in the cytosol.b-sheet #1 in ECL2 is present in the three A34 receptors.The ECL2 in A34b1/b2 receptors had an additional two-stranded b-sheet, designated b#2.The disulfide bridge that anchors the ECL2 b#1 to TMM6 are shown as ball and sticks (circle), which are located between C85 and C164 in Gl-CFRC-A34a2; between C352 and C273 in Gl-CFRC-A34b1; and between C208 and C287 in Gl-CFRC-A34b2.
Utilization of the CrusTome database has greatly expanded the number of decapod CFRCs.Previous studies identified a BNGR-A24 homolog in P. clarkii and BMGR-A34 homologs in C. maenas, P. clarkii, S. verreauxi, G. lateralis, S. paramamosain, P. trituberculatus, E. sinensis, H. americanus, C. sapidus, and P. argus, but no homologs in the BNGR-A2 clade (17,19,(21)(22)(23)(24).One hundred and seventeen sequences from 37 decapod species were organized into 11 CFRC subclades (Table 4).This includes the 23 published CFRC sequences and 18 newly-identified sequences in seven decapod species (Table 1).The additional sequences, except Cm-CFRC-A34b3, were in the A2 and A24 clades (Table 1).It should be noted that no new sequences were identified for P. trituberculatus and only one new sequence (Sv-CHHR3), provided by T. Ventura, was identified in S. verreauxi (Table 1), as transcriptomic data from both species were not included in the current version of CrusTome (41).The contigs were assigned to ten of the 11 decapod CFRC subclades (Table 1).CFRC-A24b3 was not expressed in the seven species; it appears to be relatively rare, as it was found in only three decapod species (Table 4).None of the 37 decapod species expressed sequences for all 11 CFRC subclades; the number ranged from one in N. denticulata and two other species to nine in P. clarkii (Tables 1, 4).The absence of sequences in the transcriptomes may be due to the tissue source, low expression level, and/or sequencing depth.Activation of vertebrate Class A GPCRs involves three conserved motifs located in the transmembrane domain and cytoplasmic region, forming an activation pathway that transmits ligand binding to G proteins (30).These motifs are an E/DRY motif located at the boundary of TMM3 and ICL2; a CWxP motif and a conserved phenylalanine (F) that interacts with the tryptophan (W) in TMM6; and an NPxxY motif located at the boundary between TMM7 and the C-terminus (29,30,36,74).The CWxP motif and a conserved phenylalanine (F) were retained in all the CFRCs (Figure 6; reference alignment positions #585 to #591), which supports its critical role in receptor activation (36).The NPxxY motif was also present in all the CFRCs (Supplementary Data 5).Upon activation, the tyrosine (Y) in the NPxxY motif interacts with hydrophobic residues between TMM6 and TMM7 to stabilize conformational changes in the transmembrane domain (29).The arginine (R) in the E/DRY motif acts as a microswitch; upon receptor activation, it interacts with a conserved tyrosine (Y) located at the boundary of TMM5 and ICL3 and participates in the binding of G proteins (29).The tyrosine was present in all CFRCs (Figure 6; reference alignment position #534).However, the DRY sequence in the Gl-CFRC-A24 sequences was replaced with GRF in the Gl-CFRC-A34 sequences (Supplementary Data 5).The conservation of the arginine (R) and tyrosine (Y) residues suggests that the activation mechanism in the A24 and A34 receptors is retained.However, the replacements of the aspartate (D) with glycine (G) and the tyrosine (Y) with phenylalanine (F) suggest that the CFRC-A24 and -A34 receptors differ in G protein binding affinity and/or specificity.
The expansion and diversity of CFRCs reflect the large variety of arthropod neuropeptides that bind GPCRs (11,17,24,25,75,76).The CHH neuropeptide superfamily is unique to arthropods, but it is greatly expanded in crustaceans.ITPs occur in insects, whereas CHH, MIH, MOIH, and GIH occur only in decapods (8,22,(77)(78)(79).These large neuropeptides have a unique compact core structure consisting of four or five a-helical regions and stabilized by three intramolecular disulfide bridges (3,9,12,13).However, differences in N-and C-terminal sequences, chemical modifications, and distribution of surface amino acid residues confer ligand/receptor binding affinity and specificity.The N-and C-termini of MIH and CHH are essential for biological activity and likely contribute to their binding to distinct high-affinity membrane receptors (8,9,80).The differences in the consensus sequences of brachyuran CHH and MIH peptides (Table 6) raise the possibility that coevolutionary processes have resulted in complementary changes in the receptor regions involved in binding and/or in discriminating structurally similar neuropeptides.In G. lateralis, the N-terminal regions of MIH and CHH were highly divergent (Table 6; Figure 7).This suggests that the N-terminal sequences of these neuropeptides contribute to interactions with the ECL2 and ECL3 regions of Gl-CFRC-A24a, Gl-CFRC-A24b1, and Gl-CFRC-A34b1/b2.Cryoelectron microscopy of human chemokine/GPCR complexes shed light on the peptide-binding mechanism in CFRCs (28,33).Initially, the chemokine core binds to the N-terminus and ECL2 of the receptor; these regions determine GPCR ligand specificity and affinity (28,29,33).This is followed by interactions between the flexible N-terminus of the chemokine with negatively charged residues located on the extracellular regions of the transmembrane core (29,35).Most of the brachyuran A34b1 and A34b2 proteins had two acidic residues located at the bottom of binding pocket formed by the two b-sheets, suggesting that there are similar interactions between MIH/CHH ligands with these CFRCs (Figures 6, 10).In the structures of chemokine ligands bound to their Class A GPCRs, both in the presence of Gi/o proteins, the peptide or protein ligand binds to extracellular pockets formed by ECL2, ECL3, and the transmembrane core.Specifically, in the CC motif ligand 20 (CCL20)/CC motif receptor 6 (CCR6) complex, the N-terminus of CCL20 interacts with the extracellular crevice of the seven transmembrane core of CCR6, forming crucial interactions with ECL2 and the receptor's Nterminus (81).Likewise, in the CCL15/CCR1 complex, the Nterminal region and 30s loop of CCL15 are positioned within the seven transmembrane pocket of CCR1, making contact with ECL2 and ECL3, as well as with TMM5 and TMM6 through an extensive network of hydrogen bonds and hydrophobic interactions (82).This suggests that the N-terminal sequences of CHH superfamily neuropeptides may determine binding to specific CFRC subclades, which differ in the ECL2 and ECL3 regions.
A common approach taken for the study of ligand/receptor evolution compares receptors and ligands in non-model organisms, using knowledge from well-studied models, such as mammals and a limited number of arthropods (e.g., Bombyx, Daphnia, and Drosophila).However, these pair-to-pair comparisons between classical models and non-model organisms have limitations (83).The approach taken here, which involves comparisons of multiple related organisms in a coherent phylogenetic framework, can provide more accurate reconstructions of ligand/receptor evolution (35).Incorporating hormone signaling mechanisms within an interspecific context can inform biological principles that guide species diversification, adaptation, and survival (84).Thus, analyzing these peptides and their GPCR partners within an evolutionary context provides additional insights regarding gene duplication and functional diversification across invertebrates and Arthropoda, which in turn significantly expands our understanding of the molecular evolution of neuropeptide signaling systems and the co-evolutionary dynamics of peptide-receptor pairs.Phylogenetic analysis assisted with narrowing the number of potential CHH superfamily receptors in decapods.MIH, CHH, GIH, and MOIH are unique to decapods (5,8).Assuming ligand/receptor co-evolution, it follows that peptide ligands unique to decapods would bind to receptors that would also be unique to decapods.Of the 11 CFRCs identified in decapods (Table 4), eight were decapod-only.The CFRC-A2 subclades were not restricted to decapods.The A2a1 subclade included decapods, hexapods, euphausiids, and peracarids, whereas the A2b subclade included decapods and copepods (Figure 2).Three of the four A24 subclades (A24b1/b2/b3) were restricted to decapods; A24a included decapods, copepods, and hexapods (Figure 3).All five A34 subclades (A34a1/a2 and A34 b1/b2/b3) were restricted to decapods (Figure 4).The three A24 subclades and the five A34 subclades varied in the sequence and structure of the ECL2 and ECL3 regions, suggesting that they bind different ligands (Figures 5, 6, 8, and 9).Among the three extracellular loops, ECL2 stands out as the longest and most diverse in terms of sequence length, composition, and structural shape (27,28,32,34,35).In human Class A GPCRs, the ECL2 region is organized into seven clusters with the peptide and protein GPCRs forming the largest cluster (32).The ECL2 of the Gl-CFRC-A24 and -A34 models exhibited a b-sheet structure, similar to the majority of Class A human GPCRs, and also featured a conserved cysteine (C) that serves as an anchor, tethering ECL2 to the helical bundle in TMM3 (Figure 8, 9) (26, 27,32,34,35,82).This anchoring may have implications for ligand binding and receptor function, suggesting a potentially crucial role for ECL2 in the context of ligand/receptor interactions.
The CRFC-A34b subclades appear to be the best receptor candidates for MIH and other decapod CHH family neuropeptides.Compared to the A24 clade, the A34 clade showed the greatest expansion and diversification, potentially producing CFRCs with ECL2 and ECL3 regions that can distinguish CHH family neuropeptides (Figures 4, 6) (9).The similarity in structures of the CFRC-A34 subclades (Figure 9) with chemokine receptors suggests that the ECL2 forms a lid-like structure over the binding pocket.Interactions between the surface amino acid residues on the neuropeptide with conserved residues projecting from the ECL2 bsheets (Figures 7, 10) likely contribute to peptide-receptor specificity.As GPCRs often bind to multiple ligands, and vice versa, disentangling the precise mechanisms by which these receptors modulate their binding affinities and specificities becomes of utmost importance to identify optimal ligand-receptor pairs (28).The sequence identity of the ECL2 motif in Gl-CFRC-A34b1 and -A34b2 suggest that the receptors bind the same ligand (s) (Figure 10).Peptide binding by GPCRs is of a dynamic nature that involves conformational changes of receptor and ligand structures (28-30, 33, 75), processes which cannot be easily simulated in silico when protein crystal structures are not available, as is the case for crustacean GPCRs.These limitations further highlight the significance of integrative approaches within an evolutionary context for the study of non-traditional model organisms.
An aim of this study was to identify potential MIH receptor candidates in two brachyuran species, G. lateralis and C. maenas.Both species are important models for understanding the endocrine control of molting (2-4, 6, 85-91).Moreover, C. maenas is an invasive species aided by anthropogenic range expansion to temperate coastal regions globally (92).Its rapid growth rate, in which the animal nearly doubles its size due to drinking of large quantities of sea water at each ecdysis, has contributed to its success (93, 94).Contigs encoding CFRC-A24b1/b2 in C. maenas and G. lateralis, A34a2 in G. lateralis, A34b1/b2 in C. maenas and G. lateralis, and A34b3 in C. maenas (Tables 2 and 3) should be considered putative receptors for CHH superfamily neuropeptides, including MIH, GIH, MOIH, and the eyestalk ganglia and pericardial CHH isoforms generated by alternative splicing (5,10,80).As CFRC-A34b1 and -A34b2 are expressed in the YOs of both species (Tables 2, 3) (17,23), they should be considered candidates for the MIH receptor.However, deorphanizing CFRCs requires functional assays, such as in vitro receptor activation assays using recombinant neuropeptide with CFRCs expressed in a cell reporter system and/or in vivo studies using double-stranded RNA to knock down receptor expression.
Although none of the receptors for CHH superfamily neuropeptides has been identified in decapods, the identity of the MIH receptor(s) has received the most attention (5,6).As functional assays are laborious and time consuming, it is useful to consider criteria in prioritizing CFRCs for testing: 1.The MIH receptor(s) should be preferentially expressed in the YO (3).In G. lateralis, Gl-CFRC-A34a2, -A34b1, and -A34b2 were expressed in the YO and ESG transcriptomes [Table 2 (17)].By contrast, contigs encoding Gl-CFRC-A2a1, -A24a, -A24b1, and -A24b2 were present in the ESG transcriptome (Table 2).Endpoint RT-PCR showed qualitative differences in tissue expression of Gl-CFRC-A34a2 and Gl-CFRC-A34b1.Gl-CFRC-A34a2 (formerly Gl-CHHR12) is expressed in YO, hindgut, hepatopancreas, and testis, whereas Gl-CFRC-A34b1 (formerly Gl-CHHRA9) is expressed in YO, eyestalk ganglia, gill, heart, midgut, and thoracic ganglion (17).The tissue expression of Gl-CFRCb2 was not determined (17).In C. maenas, only CFRC-A24b1 and CFRC-A34b1 were present in the YO transcriptome [Table 3 (23)].Although differential tissue expression of CFRCs is reported for E. sinensis, S. paramamosain, and P. trituberculatus, expression in the YO was not included in the analysis (22). 2. CFRC expression may change over the molt cycle, reflecting the decrease in sensitivity of the YO to MIH during midand late premolt (2).In G. lateralis YO, MIH signaling genes, such as adenylyl cyclases, protein kinase A, nitric oxide synthase, calcineurin, and protein kinase G, are downregulated during premolt (43).Gl-CFRC-A34a2, -A34b1, and -A34b2 show different patterns of relative expression over the molt cycle, with Gl-CFRC-A34b1 showing a pattern consistent with the down-regulation of other MIH signaling genes.Expression of Gl-CFRC-A34b1 (formerly Gl-GPCR-A9) is highest at intermolt, decreases during premolt, and is lowest at postmolt (17).Expression of Gl-CFRC-A34b2 (formerly Gl-GPCR-A10) is highest during premolt and is lowest at postmolt (17).Expression of Gl-CFRC-A34a2 (formerly Gl-GPCR-A12) is low at intermolt, early premolt, and mid-premolt, highest at late premolt, and lowest at postmolt (17).It is worth noting that GPCRs are generally expressed at very low levels (17), suggesting that any change in expression may not translate to meaningful changes in the number of receptors in the membrane.For example, binding of radiolabeled Cm-MIH to C. maenas YO membrane preparations is not affected by molt stage (95).3. The high conservation of brachyuran MIH and CHH sequences and structure, as well as biological activity, suggests a strong ligand/receptor co-evolution.For example, an antibody raised against a conserved Nterminal peptide sequence in Gl-MIH (amino acid residues #7 to #20 in the mature peptide) cross-reacts with Cm-MIH (86)

Conclusions
The MIH receptor is a critical component of the signal transduction pathway that regulates YO ecdysteroid synthesis (2,6).Assuming that the MIH receptor is a Class A GPCR, the challenge has been identifying potential candidates from among the large number of YO Class A GPCRs for functional analysis (6,17,23).Phylogenetic analysis has been used to characterize homologs of Bombyx ITP GPCRs in decapod transcriptomes.Previous studies have used this approach, mostly identifying homologs in the A34 clade (17,19,(21)(22)(23)(24). Phylogenetic analysis with the CrusTome database greatly expanded the number of CFRC homologs in the Crustacea, which were organized into a classification nomenclature corresponding to the Bombyx ITP BNGR-A2, -A24, and -A34 phylogeny (Figure 1, Table 4, and Supplementary Data 2).This nomenclature provides a framework for characterizing new homologs/orthologs as more transcriptomic data become available.A total of 11 CFRC subclades were identified in decapod crustaceans, although none of the 37 decapod species expressed all 11 (Table 4).This suggests that expression of certain CFRCs is restricted to specific tissues, enabling target tissues to respond to neuropeptides that control physiological processes, such as molting, reproduction, metabolism, ion and water balance, and responses to environmental stress (3,5,10).Analysis of the ECL2 and ECL3 regions, which mediate ligand binding, identified motifs that can be used to distinguish members of the A2, A24, and A34 clades and subclades (Table 5; Figures 5, 6).Structural modeling of the G. lateralis CFRCs showed that the ECL2 of A34b1 and A34b2 had a second b-sheet not found in hexapod and other crustacean GPCRs.The two b-sheets form a deep pocket on the extracellular surface of the receptor to accommodate large neuropeptides, such as CHH and MIH.Conserved residues in both b-sheets may stabilize neuropeptide binding with the receptor.These studies, in concert with earlier YO expression analyses, support prioritizing the A34b CFRC subclades as potential MIH receptor(s) for functional assays and structural modeling simulations of ligand/receptor binding.

FIGURE 1
FIGURE 1 Phylogeny of ITP GPCR homologs in crustaceans, depicted as a circular cladogram, showing the major clades following the Bombyx mori nomenclature for class A GPCRs: A2 (yellow), A24 (blue), A34 (green).The position of Bombyx mori reference sequences and Gecarcinus lateralis homologs are indicated by a blue circle and a red star, respectively.Maximum-likelihood phylogenetic reconstruction was performed with IQtree2 and a JTT+R9 model of evolution, and a total of 424 best Nearest Neighbor Interchange optimization iterations; branch support was assessed via 10,000 UltraFast bootstrap approximations and an aBayes parametric test (see Materials and Methods).Support values for the depicted splits are the following (aBayes/UFboot): A = 1/100; B = 0.999/90; C = 1/100; D = 1/100; E = 0.997/79; F = 1/100; and G = 1/98.Support values within clades are shown in Figures 2-4.Full annotated phylogeny available in Supplementary Data 4.

FIGURE 2
FIGURE 2Expanded phylogram of the A2 clade from the phylogenetic tree in Figure1.The Bombyx mori Bommo_BNGR_A2 reference sequence and G. lateralis homolog are indicated by blue and red font colors, respectively.Maximum-likelihood phylogenetic reconstruction was performed with IQtree2 and a JTT+R9 model of evolution, and a total of 424 best Nearest Neighbor Interchange optimization iterations; branch support was assessed via 10,000 UltraFast bootstrap approximations and an aBayes parametric test.Full annotated phylogeny available in Supplementary Data 4.

FIGURE 3
FIGURE 3Phylogram of the A24 clade and subclades.The Bombyx mori Bommo_BNGR_A24 reference sequence and G. lateralis homologs are indicated by blue and red font colors, respectively.Subclades with the b designation represent crustacean-specific lineages that do not include hexapods.Maximum-likelihood phylogenetic reconstruction was performed with IQtree2 and JTT+I+G4 as the best-fit model and a total of 274 best Nearest Neighbor Interchange optimization iterations.Branch support was assessed via 10,000 UltraFast bootstrap approximations and an aBayes parametric test.Full annotated phylogeny available in Supplementary Data 4.

FIGURE 4
FIGURE 4Phylogram of the A34 clade and subclades.The G. lateralis homologs are identified by red font color.The Bombyx mori Bommo_BNGR_A34 reference sequence is within the collapsed Hexapoda clade.Subclades with the b designation represent crustacean-specific lineages that do not include hexapods.Additionally, subclade A34b2 is restricted to true crabs (Malacostraca: Decapoda: Brachyura).Maximum-likelihood phylogenetic reconstruction was performed with IQtree2 and JTT+F+R6 as the best-fit model, and a total of 685 best Nearest Neighbor Interchange optimization iterations.Branch support was assessed via 10,000 UltraFast bootstrap approximations and an aBayes parametric test.Full annotated phylogeny available in Supplementary Data 4.

10
FIGURE 10 Structure of the ligand-binding region of G. lateralis A34b1 (A) and A34b2 (B) CFRCs.Ribbon diagrams include the side chains of conserved amino acids in the b-sheets of the ECL2 region (YRxxYxxxWxxFTxWxCDExFP in b-sheet #1; VGCVVTY in b-sheet #2).In A34b1, C352 in b-sheet #1 formed a disulfide bridge with C273 in TMM6.In A34b2, C287 in b-sheet #1 formed a disulfide bridge with C208 in TMM6.The distal region of b-sheet #2 had a conserved cysteine located at position #368 in A34b1 and at position #303 in A34b2.Two acidic residues (D353 and E354 in A34b1 and D288 and E289 in A34b2) were located at the bottom of the pocket formed by the b-sheets.Images were produced using the Chimera package.Intracellular regions are not shown for clarity.

TABLE 1
Classification of CHH Family GPCR candidates in seven decapod species.
DNA and amino acids sequences of the contigs are given in Supplementary Data 2. aa, amino acids; bp, base pairs; CNS, central nervous system; and YO, Y-organ. 1 Asterisk (*) indicates partial sequence; ORF incomplete.

TABLE 4
Classification of CHH family receptor candidates in decapod species.

TABLE 5
Sequences of conserved motifs in the ECL2 and ECL3 of decapod CFRCs.

TABLE 6
Consensus sequences of brachyuran MIH and eyestalk CHH isoform mature peptides showing conserved amino acids.