Computational Identification of Functional Centers in Complex Proteins: A Step-by-Step Guide With Examples

In proteins, functional centers consist of the key amino acids required to perform molecular functions such as catalysis, ligand-binding, hormone- and gas-sensing. These centers are often embedded within complex multi-domain proteins and can perform important cellular signaling functions that enable fine-tuning of temporal and spatial regulation of signaling molecules and networks. To discover hidden functional centers, we have developed a protocol that consists of the following sequential steps. The first is the assembly of a search motif based on the key amino acids in the functional center followed by querying proteomes of interest with the assembled motif. The second consists of a structural assessment of proteins that harbor the motif. This approach, that relies on the application of computational tools for the analysis of data in public repositories and the biological interpretation of the search results, has to-date uncovered several novel functional centers in complex proteins. Here, we use recent examples to describe a step-by-step guide that details the workflow of this approach and supplement with notes, recommendations and cautions to make this protocol robust and widely applicable for the discovery of hidden functional centers.


KEY POINTS
-Functional centers have key roles in catalysis, ligand-binding, hormone-and gas-sensing, and are difficult to identify through standard homology approaches. -Functional centers are often hidden in complex multi-domain proteins where they perform important molecular and cellular functions thereby enabling temporal and spatial modulation of signaling molecules and networks. -Here we present a method for the detection of functional centers and provide a step-by-step guide that systematically describes the workflow and the computational tools employed in this protocol. -This protocol also provides informative notes, recommendations and cautions at the appropriate steps to allow for a broad and robust application.

INTRODUCTION
Functional centers are at the core of domains that contain the key amino acids required to perform a molecular function including, but not limited to, catalysis, ligand-binding, hormone-and gassensing . They are often embedded within complex multi-domain proteins and offer cryptic but important cellular signaling functions that afford intricate temporal and spatial regulation of signaling molecules and their signaling networks (Brady et al., 2007;Levskaya et al., 2009;Irving et al., 2012;Zhang and Ma, 2012;Freihat et al., 2014;Muleya et al., 2014). The lack of overall conservation of functional centers and the fact that they are often only a small part of complex proteins, have left them undetected by standard homology searches (Ludidi and Gehring, 2003;Guo and Fang, 2014;Wong et al., 2018). For instance, in higher plants, enzymes that generate and degrade cyclic mononucleotides or heme-based gassensing proteins were not identified until recently although their corresponding homologs exist in diverse groups of organisms ranging from bacteria to animals and humans (Martinez-Atienza et al., 2007;Domingos et al., 2015;Gehring and Turek, 2017;Swiezawska et al., 2018;Wong et al., 2020;Swiezawska-Boniecka et al., 2021;Turek and Irving, 2021). This is perplexing since the products and biological effects of the functional centers such as cyclic nucleotides-mediated cellular events (Bowler et al., 1994;Neuhaus et al., 1997;Durner et al., 1998;Maathuis and Sanders, 2001;Donaldson et al., 2004;Ederli et al., 2009;Isner and Maathuis, 2011;Pasqualini et al., 2011;Isner et al., 2012;Hartwig et al., 2014;Hussain et al., 2016;Marondedze et al., 2016) and nitric oxide mediated pollen tube chemotropic responses (Feijo et al., 2004;Prado et al., 2004Prado et al., , 2008McInnis et al., 2006;Wang et al., 2009Wang et al., , 2012Pasqualini et al., 2011Pasqualini et al., , 2015Domingos et al., 2015;Wong et al., 2020Wong et al., , 2021 have long been documented. Lately, a fundamentally different approach to the discovery of these molecular functions has been employed and has identified novel signaling components in complex proteins (Wong and Gehring, 2013b;Wong et al., 2015Wong et al., , 2018. Based on the assumption that in complex multi-domain proteins only amino acids that directly perform a molecular function are conserved in the function centers, consensus sequence motifs that include only these key conserved amino acid residues, have been constructed and applied (Ludidi and Gehring, 2003;Gehring, 2010;Wong and Gehring, 2013a). These motifs are constructed by alignments of annotated functional centers from different and distantly related species and can then be used to query target proteomes-e.g., a model plant like Arabidopsis thaliana-to retrieve candidate proteins. The candidate proteins can subsequently be further assessed using homology modeling and molecular docking simulations prior to experimental testing. Functional annotation can also be done through the construction of sequence profiles on PROSITE which is a database of protein domains, families, and functional sites (Sigrist et al., 2013). These regions are better conserved throughout evolution especially in proteins from the same families and have largely similar three-dimensional structures which are crucial for a common molecular function (Wu et al., 2003;Lee et al., 2007;Marchler-Bauer et al., 2011;Mahlich et al., 2018). Functional centers deriving from annotated domains are either less obvious or have evolved beyond recognition in complex proteins because they often occupy a small part (< 5%) of the entire protein and only harbor amino acids that are critical for functionality, which may explain why they were unaccounted for during de novo and homology sequence annotation. As they get incorporated into complex proteins of varying primary domains, their tertiary structures may also differ (Jeffery, 1999;Irving et al., 2018;Turek and Irving, 2021) (see Conclusion and Future Perspective for a specific example).
The sequential use of a motif search and structural assessment can overcome the challenges associated with identifying functional centers in complex proteins and indeed, several been identified in recent years as emerging evidence also suggests that many more await discovery (Turek and Gehring, 2016;Ooi et al., 2017;Wheeler et al., 2017;Al-Younis et al., 2018;Chatukuta et al., 2018;Bianchet et al., 2019;Freihat et al., 2019;Ruzvidzo et al., 2019;Wong et al., 2020). Given the highly varied nature of the functional centers, an automated pipeline is currently not feasible for all applications. Here, we use recent examples to document a step-by-step guide of the workflow and supplement it with detailed application notes, recommendations, and cautions. This will enable users to seamlessly apply this approach to the discovery of novel hidden functional centers in reference proteomes.

Required Software
In this protocol, the computer program required to generate the protein 3D models by homology modeling, is MODELLER (Sali and Blundell, 1993) which is available for download at https://salilab.org/modeller/download_installation.html. Users are required to register with a valid institutional email at https:// salilab.org/modeller/registration.html prior to installation. To simulate docking of small molecules to the functional centers of the 3D models, AutoDock Vina (Trott and Olson, 2010) is used. This open-source program is available for download at http://vina.scripps.edu/download.html. AutoDockTools (ADT) (Morris et al., 2009) is a required graphical front-end for setting up and running Vina and can be downloaded as a package known as MGLTools at http://mgltools.scripps.edu/downloads. The package includes structure analysis and visualization programs such as Python Molecular Viewer (PMV) (Sanner, 1999). Alternatively, UCSF Chimera (Pettersen et al., 2004), which is available for download at https://www.cgl.ucsf.edu/chimera/ download.html, can also be used to visualize protein structures and docking outcomes.

General Workflow
The computational approach to identify functional centers in complex proteins includes five general steps which are detailed in this section (Figure 1).
Step I: Alignment of Functional Centers The first step of this approach involves the alignment of functional centers from a wide range of distantly related FIGURE 1 | A general workflow for the computational identification of functional centers in complex proteins. The approach begins with an alignment of functional centers from proteins across species and followed by the construction of a consensus sequence that includes only conserved key amino acids which are separated by gaps as determined from the alignment. The consensus sequence now serves as a search motif to be queried on various databases to identify candidate functional centers and after which, they are screened on webtools if available, to assign confidence levels to the retrieved candidates. In the final step, top candidates are subjected to structural assessments that include model generations and docking simulations.
organisms. NOTE: The functional centers do not have to come from orthologs, and it is important to include sequences of experimentally validated proteins preferably from both prokaryotes and eukaryotes. This will increase the confidence of prediction by the amino acid motif constructed in the next phase. Functional centers may include catalytic sites, ligand-and hormone-binding sites or gas-sensing regions, as long as they are annotated as having a specific molecular function. NOTE: Only regions that directly participate in the molecular function should be considered as this protocol assumes a minimalistic strategy for functionality in highly diverse protein architectures. CAUTION: Functional centers typically range from 12 to 50 amino acids; an overly long sequence may reduce the chances of identifying promising candidates while a shorter sequence increases the chances of false positives.
Step II: Motif Construction From the aligned sequences, highly conserved amino acids at each position in the alignment are included in a consensus sequence for the particular functional center. Amino acids that have been experimentally proven to perform key functions such as direct binding to ligands or essential for preserving certain charge and spatial configurations at the centers, are also included. These amino acids are indicated in square brackets []. NOTE: Key amino acids performing crucial molecular functions are also normally highly conserved at the functional centers of non-orthologous sequences and it is not uncommon for two or three amino acids with similar physicochemical properties to occupy a position. RECOMMENDATION: To be inclusive, we recommend including amino acids with similar chemical properties in the consensus sequence e.g., [RK] or [DE]. Detailed description of motif construction and its application have been described in Wong and Gehring (2013a,b), Wong et al. (2015Wong et al. ( , 2018. Amino acids between the conserved residues and whose biochemical functions are unknown, may be excluded from the consensus sequence. They are indicated as "X" which stands for any of the 20 amino acids. From the aligned sequences, "X" may be several amino acids long and could assume differing ranges in centers from different species or non-orthologous sequences and the gap size is noted in brackets e.g., (N,M) or {N,M}, with "N" and "M" representing the minimum and maximum number of amino acids. CAUTION: Consult the pattern syntax of the respective tools to avoid errors.
After evaluating the conserved and non-conserved amino acids at each position of the alignment, the consensus sequence for the particular functional center is ready for application. CAUTION: Check the consensus sequence for errors and inconsistencies, e.g., odd or contrasting amino acid properties, at each position. RECOMMENDATION: A 12-50 amino acid long consensus sequence is typically a good starting point and motifs can be modified to increase or loosen stringency. NOTE: Shorter sequences can be considered if there are many specifically conserved residues and longer sequences can be considered if the conserved residues are too few and far apart. For a specific example of motif construction, please refer to the "Identification of heme-containing gas sensors in complex proteins" section in Results and Discussion and the sequence alignment of H-NOX domains in Supplementary Figure 1 or contact the corresponding author directly to obtain the most up to date datasets and output files.
Step III: Database Query The consensus sequence now serves as a search motif to query any protein databases for candidate proteins harboring given functional centers. RECOMMENDATION: We recommend the use of the ScanProsite tool (Gattiker et al., 2002) available at: https://prosite.expasy.org/scanprosite that allows a search against protein database for known motifs (option 1) or with custom motifs (option 2). For proteins from Arabidopsis thaliana, the PatMatch tool (Yan et al., 2005) available at: https://www. arabidopsis.org/cgi-bin/patmatch/nph-patmatch.pl can be used. CAUTION: Consult the pattern syntax of the respective tools to avoid errors. NOTE: If the retrieved candidates are few, the consensus sequence (see Step II) can be revisited to relax the stringency of the motif by varying the gap sizes between conserved amino acids and/or by adding amino acids with similar chemical and physical properties to the conserved positions (e.g., instead of [IL] expand to [VIL]). CAUTION: A long list of candidates may indicate high false positive rates but at this point, this is not a concern because the retrieved hits can be conveniently screened and filtered with various webtools that provide statistics for confidence levels and candidate rankings in the next step. The motif only needs to be revised to increase stringency if no such webtools are currently available.
Step IV: Webtool Screening and Filtering Candidates retrieved from the motif search can be screened with predictive tools for the particular molecular function if such tools are available. These webtools consider the physical and chemical properties of non-conserved amino acids in experimentally validated functional centers in addition to their conserved residues where their algorithms generate statistics that compare the queried sequence to the mean values of the experimentally validated pool of proteins (Xu et al., 2018a,b). Users can therefore use these statistics to rank retrieved hits from high to low confidence levels. RECOMMENDATION: We recommend the use of these webtools not only to increase confidence in the retrieved candidates but also to enable high-throughput processing of a long list of candidates. CAUTION: Although these webtools are designed to predict hits with high confidence, they remain predictive in nature and thus may not detect some positive hits especially in poorly characterized proteins. RECOMMENDATION: If the number of retrieved candidates is small and there are biochemical and cellular justifications supporting the functions of those hits, we recommend users to proceed with the next step involving structural assessment even if the webtool screening returns no hits. NOTE: At this point, it may be advantageous to inspect the candidate list and sort them into gene ontology (GO) categories to help selection of proteins with functions of particular interest before proceeding to the structural examination.
Step V: Structural Assessment Candidates will now be assessed structurally and since this is the most computationally intense part and requires the most interpretation, the list of candidates should ideally be small. RECOMMENDATION: For a longer list of candidates, the top 5-10 proteins determined with the webtool (Step IV) can be selected for structural assessments. The assessment includes the generation of three-dimensional (3D) structures in case there are no known crystal structures for the candidates and conducting molecular docking simulations to predict functionality. NOTE: This phase requires installation of several software such as MODELLER, AutoDock Vina, MGLTools and UCSF Chimera, and a level of familiarity with these software. RECOMMENDATION: Users are recommended first to familiarize themselves with the operations of the required software packages before attempting this step, as this is a crucial phase for predicting functionality.
3D models should first be generated for candidates that have no known crystal structures by homology modeling using MODELLER (Sali and Blundell, 1993). This software generates protein 3D structures from amino acid sequences based on spatial restraints in template structures provided by the user. NOTE: See required software section for installation and registration guide. The five steps in homology modeling by MODELLER are: selection of crystal structures related to the candidate, template structure selection, alignment of candidate amino acid sequence to the template, model generation, and model evaluations. A detailed tutorial, the required program files, example input and output files in.zip format (for Windows) or.tar.gz format (for Unix/Linux), and the relevant scripts to run MODELLER can be found at: https://salilab.org/modeller/tutorial/basic.html. NOTE: These are the complete steps for homology modeling, but users can choose to perform alignments and template selection with other programs such as BLASTp (McGinnis and Madden, 2004) available at: https://blast.ncbi.nlm.nih.gov/Blast. cgi?PAGE=Proteins. If using BLASTp, the "Protein Data Bank" database should be selected to only retrieve results with known crystal structures. CAUTION: Selection of template structure is a critical step in determining the quality of the model. If using BLASTp, a crystal structure with high identity with the queried candidate in the particular region where the functional center resides will be more suitable than another structure with higher overall coverage and/or max. score but lower identity at the corresponding functional center region. This is particularly relevant since moonlighting functional centers constitute only a small part of complex multi-domain proteins (Su et al., 2019).
Once a high-quality 3D model is obtained, molecular docking simulation can be performed using AutoDock Vina (Trott and Olson, 2010) which is an automated docking software for the prediction of the binding ligands to a receptor protein. NOTE: See required software for download and installation guides. CAUTION: Prior to running AutoDock Vina, necessary files must be prepared using AutoDockTools (ADT) (Morris et al., 2009) which is a graphical front-end software for setting up and running AutoDock. RECOMMENDATION: We recommend users to download the entire MGLTools package which include ADT and Python Molecular Viewer (PMV) (Sanner, 1999) for visualization of protein structures. Alternatively, users can also install UCSF Chimera (Pettersen et al., 2004) which is a tool for interactive visualization and analysis of molecular structures. All software downloads and installations are detailed in the software description. A detailed step-by-step tutorial on ".pdbqt" file preparation with ADT, example input and output files in.zip format, and running docking simulations using Vina and PMV is available at: http://vina.scripps.edu/tutorial.html. CAUTION: When setting up the grid box, the position and size determination is critical for the success of molecular docking. An overly small or off-position grid box will result in failure to identify suitable binding poses during docking simulation. The grid box volume should not only cover the entire area of the functional center but also be large enough to allow free rotation of the ligand at its most extended configuration.
After docking simulations with AudoDock Vina, binding poses can be evaluated on PMV or Chimera. By default, AutoDock Vina will rank binding poses based on free energies. NOTE: Binding poses of ligands at some functional centers are already known based on experimental data and, in such instances, these ligand orientations should be given priority over their free energy calculations. Further docking simulations can be conducted to evaluate the effect of mutations of key residues at the functional centers.

RESULTS AND DISCUSSION
Using the general workflow as a guide, we describe specific applications using recent examples to detail the step-by-step process that led of the identification of catalytic centers and gas-sensing and hormone-interacting sites in complex proteins.

Identification of Catalytic Centers
Cyclic mononucleotide cyclases (guanylate cyclase and adenylate cyclase; GC and AC) are enzymes that catalyze the conversion of GTP and ATP to their cyclic forms, cGMP, and cAMP, respectively. In line with the concept of functional centers , only catalytic centers of canonical GCs and ACs are conserved non-orthologous cyclases. Currently available experimental data indicate that these enzymatic centers are often part of complex proteins with other primary domains and highly varied domain architectures (Figure 2).

A
Step-by-Step Guide for the Identification of Cyclic Mononucleotide Cyclases 1. Alignment of catalytic centers of GCs or ACs from organisms across species including prokaryotes and eukaryotes. NOTE: For a specific example of motif construction, please refer to the "Identification of heme-containing gas sensors in complex proteins" section in Results and Discussion and the sequence alignment of H-NOX domains in Supplementary Figure 1 or contact the corresponding author directly to obtain the most up to date datasets and output files. 2. Determining the consensus sequence of GC or AC catalytic centers by including only the conserved key amino acids and separated by rational gaps made up of non-conserved amino acids. NOTE: The current consensus sequences for GC and AC functional centers are 14-amino acids long with negatively charged amino acids D or E (written as [DE] in the motif to indicate that both amino acids can function in this position) 0-3 positions downstream of the centers (Figure 3A)  tools to avoid errors. Derivatives of these motifs justified by experimental evidence or rational modifications have also been described in detail (Kwezi et al., 2007;Mulaudzi et al., 2011;Bianchet et al., 2019;Ruzvidzo et al., 2019).

Application of GCPred and/or ACPred webtools [available
at: http://gcpred.com (Xu et al., 2018a) and http://gcpred. com/acpred (Xu et al., 2018b)] for the assessment of confidence levels of retrieved candidates. These webtools , assumes an alpha-helical secondary fold that is followed by a loop. At the tertiary level, the AC center typically forms a clear cavity that could dock with the substrate ATP in a binding pose where the adenine points into the cavity toward the amino acid at the first position of the motif, and the phosphate points outwards toward the positively charged [KR] amino acid at the solvent exposed region of the cavity. Negatively and positively charged amino acids that are crucial for the interactions with the substrate are colored red and blue, respectively. (B) The putative phosphodiesterase center in an Arabidopsis potassium channel AtKUP5 identified through a 27-47 amino acid long search motif, assumes an alpha-helical secondary fold that is followed by a loop which forms the latch region enclosing the docked cAMP substrate within a distinct cavity . Negatively and positively charged amino acids crucial for the interactions with the substrate are colored red and blue, respectively.
contain algorithms that compare the physicochemical values of amino acids such as their molecular weights, isoelectric points and hydrophobicity to mean values of experimentally validated functional centers. RECOMMENDATION: We propose selecting the "cation-binding" option because current experimental data indicate that GCs and ACs of such nature have higher activities when these negatively charged amino acids [DE] are present at 1, 2, or 3 amino acids downstream of the functional centers. NOTE: According to the interpretation guide (Xu et al., 2018b), a score of 0-1 for each physicochemical parameter is generated, where 1 is closest to the mean of experimentally validated GC or AC functional center. Green scores represent "high" confidence and red ones represent "low" confidence. We also recommend that the users select candidates with at least two green physicochemical values in addition to the compulsory green overall mean score. Good candidates should contain no red scores. CAUTION: Although these webtools are designed to predict hits with high confidence, they remain predictive in nature and thus may not detect some positive hits especially in lesser studied proteins. RECOMMENDATION: If the retrieved list of candidates is small and there are justifications, e.g., from the literature, supporting the plausibility of the functions, we recommend that users proceed with the next phase involving structural assessment.

Model generation and docking simulation with MODELLER
and AutoDock Vina to evaluate the structures of individual candidates and to predict the binding of substrate i.e., GTP or ATP, since this is a prerequisite for catalysis. NOTE: Based on experimentally validated GC and AC centers, they contain a characteristic alpha-helix barrel followed by a loop secondary fold (Wong and Gehring, 2013b;Wong et al., 2015Wong et al., , 2018Al-Younis et al., 2018) (Figure 3A). Furthermore, experimentally validated GC and AC centers have the following substrate binding pose: purine bases adenine or guanine facing into the cavity toward amino acid at the first position of the motif in the hydrophobic interior while the phosphate end facing the positively charged amino acid [KR] at the opening of the cavity (Al-Younis et al., 2018;Ruzvidzo et al., 2019). If structural evaluations determine that such substrate orientations are feasible in new candidates, then the confidence level is further increased.
In general, candidates retrieved from the motif search that rank highly in the webtool predictions and display the described structural features, will be considered suitable candidates for experimental verifications to ascertain their GC or AC activities both in vitro and in vivo.
Another example to illustrate the search for enzymatic functional centers are cyclic nucleotide phosphodiesterases (PDEs) which are enzymes that degrade the cyclic mononucleotides cGMP or cAMP. Much like nucleotide cyclase functional centers, PDEs may only retain the key conserved amino acids at the catalytic centers in complex multi-domain proteins and are therefore beyond the detection limit of BLAST searches. The steps involved in the identification of PDE centers follow the same guide as the cyclic mononucleotide cyclases. NOTE: Currently, the motif for PDE center is 27-47 amino acid long and is determined to be [YFW]Hx [YFW]Rx(20,40)[HRK] [DE] (Figure 3B). CAUTION: The current PDE motif is stringent and identifies candidates with high confidence. As experimental evidence for PDE centers begin to emerge, the motif may eventually be relaxed to allow for the identification of novel candidates. In the absence of webtools for the prediction of PDE centers, retrieved candidates will be structurally evaluated in the next step. NOTE: Current model for PDE centers has a distinct cavity that accommodate cyclic mononucleotides within a latch. The PDE motif can form an alpha-helix secondary fold that is followed by a loop that makes up the latch   (Figure 3B). Since PDE centers are less characterized than the centers of mononucleotide cyclases, the motif and structure presented here may yet undergo modifications that will improve prediction accuracy and coverage of hidden PDE centers.

Identification of Heme-Containing Gas Sensors in Complex Proteins
Heme-nitric oxide/oxygen (H-NOX) centers are hemecontaining regions that can sense gases including the signaling molecule nitric oxide (NO). In line with the concept of functional centers , only amino acids at the H-NOX center occupied by the heme moiety, are conserved. The steps involved in the identification of heme-containing gas sensors follow the same guide as the cyclic mononucleotide cyclases. In this case, we show how an H-NOX consensus sequence can be constructed from the alignment of the heme-binding centers of H-NOX proteins from prokaryotes and eukaryotes. From the alignment, the "H" and "P" amino acids, as well as the "YxSxR" signature, are highly conserved across distantly related species ( Figure 4A). Furthermore, these amino acids have been determined experimentally to have direct heme-and/or gasbinding functions at the H-NOX centers. For instance, the "H" residue in the motif serves as the distal ligand for NO (Olea et al., 2010) while the YxSxR signature stabilizes the porphyrin ring (Pellicena et al., 2004). Therefore, these amino acids are included in the consensus sequence with the non-conserved flanking amino acids that are not known to play functional roles at the H-NOX centers, assigned as "X" which stands for any of the 20 amino acids, and the gap size noted in brackets (N,M) or {N,M}, with "N" and "M" representing the minimum and maximum number of amino acids (Figure 4A). The constructed H-NOX consensus sequence was then used as a search term to identify heme-containing gas sensors in plants by querying the proteome of Arabidopsis thaliana and from which three candidates have since been experimentally confirmed to sense NO. NOTE: The current consensus sequence H-NOX center is 33-35 amino acids long and is determined to be Hx(12)Px(14,16)YxSxR (Figure 4). CAUTION: Consult the pattern syntax of the respective tools to avoid errors. Since there is currently no known webtools for the prediction of H-NOX centers, retrieved candidates will be directly subjected to structural evaluations in the next step. Current experimental data indicate that these H-NOX centers  (Zarban et al., 2019), identified through a 33-35 amino acid long search motif, assumes a long loop that wraps the docked heme-Fe moiety within a clearly defined pocket. The "H" residue in the motif is the distal ligand that binds to the iron and the YxSxR signature stabilizes the heme through hydrogen bonding. Negatively and positively charged amino acids crucial for the interactions with the substrate are colored red and blue, respectively. exist in complex proteins with other primary domains and varied architectures (Figure 4B). Mutations to either the "H" or "Y" residues impaired heme binding and concomitantly also the ability to sense NO ( Figure 4C) (Zarban et al., 2019;Wong et al., 2020). In general, candidates retrieved from the motif search that display the described structural features, will  (Ooi et al., 2017) identified through a 26-28 amino acid long search motif occupies a clear cavity that could dock with ABA with the "Y" and "K" residues being crucial for maintaining ABA affinity. Negatively and positively charged amino acids crucial for the interactions with the substrate are colored red and blue, respectively. be considered suitable candidates for experimental verifications to ascertain their gas-binding affinities both in vitro and in vivo. Currently, H-NOX motif is intentionally constructed to be rigid to identify candidates of high confidence especially in the absence of other predictive tools. Relaxation of the motif based on more data are likely to identify more heme-containing H-NOX proteins . RECOMMENDATION: We recommend omitting the "P" residue from the motif to increase coverage as this residue appears the least essential in terms of functionality.

Identification of Hormone Binding Sites
In this case, we use the amino acids directly involved in binding to the plant hormone abscisic acid (ABA) from the well-characterized START protein family PYR/PYLs (Melcher et al., 2009;Park et al., 2009), to identify other plant proteins that can bind to or interact with ABA ( Figure 5A). The steps involved in the identification of ABA-interacting centers follow the same guide as the cyclic mononucleotide cyclases. NOTE: The current consensus sequence for the ABA-interacting center is 26-28 amino acids long and is determined to be [DE]x(7,8)Rx(3,4) [DE]x(5)Yx(6)H (Figure 5). CAUTION: Consult the pattern syntax of the respective tools to avoid errors. Since there is currently no known webtools for the prediction of ABA-interacting centers, retrieved candidates will be directly subjected to structural evaluations in the next step. NOTE: Based on experimentally validated proteins containing the ABAinteracting centers, the "Y" residue in the motif together with the "K" located three amino acids upstream, are crucial for ABA binding (Figure 5B). Mutations to either the "Y" or "K" residues impair ABA binding and concomitantly other functions of the candidate protein e.g., K + transport (Ooi et al., 2017). Currently, the ABA-interacting motif is intentionally constructed to be relatively rigid to identify candidates of high confidence especially in the absence of predictive tools, but it can conceivably be made more relaxed to identify more ABA-interacting proteins. RECOMMENDATION: We recommend increasing the gaps upstream of the "Y" and the "H" residues in the motif to allow for a more inclusive prediction.

CONCLUSION AND FUTURE PERSPECTIVE
We have developed a general workflow and presented a step-by-step guide for the applications of this computational approach to identify functional centers in complex proteins using recent examples as case studies. While this protocol leverages on existing software and follows standard methods in protein functional annotations, the focus is however on the identification of functional centers that unlike canonical domains, consist of only the key amino acids required to perform certain molecular functions in complex proteins. Instead of detecting broadly the common patterns or motifs in proteins, we derived our motifs from key residues of functional centers in existing domains that have been experimentally proven to be fully functional.
The initial motivation was to identify the corresponding signaling components of animals and bacteria (Loewenstein et al., 2009) in plant systems many of which, are long taught to be absent because there were no orthologs in plants. It was hypothesized that these domains have become markedly altered to retain only their functional centers as they get incorporated into complex multi-domain proteins such as that in the wellcharacterized Arabidopsis BRASSINOSTEROID INSENSITIVE 1 (AtBRI1; UniProt ID: O22476). The 1,196 amino acid long AtBRI1 consists of an extracellular brassinosteroid receptor region, a transmembrane, and an intracellular kinase domain with the GC center embedded within the primary kinase domain. Both the hormone receptor region and kinase domain take up ∼62.7 and 23.0% of the protein while the GC center makes up only 0.12% of the entire protein (Supplementary Figure 2). Due to the diverse architecture of complex proteins, this protocol identifies functional centers that are more targeted and may not necessarily have similar 3D structures as it would be with canonical domains. Their incorporation into multi-domain complex proteins of varying primary functions, make these functional centers "hidden" or undetectable, hence not annotated in protein domains and families of existing databases.
Our application-specific CAUTIONS, NOTES and RECOMMENDATIONS will enable users to seamlessly apply the out-lined protocol and thereby increase the chances of success in the discovery of hidden functional centers or ligands in complex proteins as it informs users of potential pitfalls and gives suggestions for optimization. For instance, the recent identifications of functional nucleotide cyclase functional centers in crop plants such as tomato (Rahman et al., 2020) and rice (Malukani et al., 2020), and in other economically important plants such as ornamentals Hippeastrum sp. (Swiezawska et al., 2015(Swiezawska et al., , 2017 and Japanese morning glory (Szmidt-Jaworska et al., 2009) and in the monocot grass model Brachypodium (Swiezawska et al., 2020), as well as in a human immune-responsive protein IRAK3 (Freihat et al., 2019), have demonstrated the robustness of this approach and its significance in biological discoveries across different systems. Furthermore, the method enables candidate selections for follow-up in vivo and in planta studies that will eventually reveal the biological roles of these functional centers such as those reported in Joudoi et al.  Lee et al. (2020), and Turek et al. (2020). We foresee that emerging experimental data will inform and strengthen motif refinement efforts, and enable the development of modern machine learning techniques that incorporate multiple features ranging from the classical physicochemical properties of protein domains and protein-protein interaction (PPI) networks to GO based function predictions, to not only automate annotations for uncharacterized proteins, but also to identify hidden functional centers in complex multi-functional proteins (Rifaioglu et al., 2019;Bonetta and Valentino, 2020;Cai et al., 2020;Littmann et al., 2021).

DATA AVAILABILITY STATEMENT
The original contributions presented in the study are included in the article/Supplementary Material, further inquiries can be directed to the corresponding author/s.

AUTHOR CONTRIBUTIONS
AW conceived the project. WC, WZ, WS, WD, JW, and AW collected the data. AW, XT and CG analyzed the data. AW and CG wrote the manuscript. All authors contributed to the article and approved the submitted version.

FUNDING
Work in the authors' laboratory was supported by the National Natural Science Foundation of China (31850410470) and the Zhejiang Provincial Natural Science Foundation of China (LQ19C130001) awarded to AW.