Using Genetic Code Expansion for Protein Biochemical Studies

Protein identification has gone beyond simply using protein/peptide tags and labeling canonical amino acids. Genetic code expansion has allowed residue- or site-specific incorporation of non-canonical amino acids into proteins. By taking advantage of the unique properties of non-canonical amino acids, we can identify spatiotemporal-specific protein states within living cells. Insertion of more than one non-canonical amino acid allows for selective labeling that can aid in the identification of weak or transient protein–protein interactions. This review will discuss recent studies applying genetic code expansion for protein labeling and identifying protein–protein interactions and offer considerations for future work in expanding genetic code expansion methods.


INTRODUCTION
Heterogeneous environments exist within living cells and this ever-changing environment affects the expression, dynamics, and state of proteins. However, the results and importance of the internal cellular changes are still unclear. Here, we introduce recent progress to observe the internal conditions in living cells using genetic code expansion (GCE).
The site-specific incorporation of ncAAs using orthogonal aaRS/tRNA pairs allows observation of a specific protein (Lang and Chin, 2014;Saleh et al., 2019b). However, this method is applicable only to genetically manipulatable organisms where the aaRS/tRNA pairs are orthogonal. In recent years, progress has been made to create aaRS/tRNA pairs that are orthogonal in more organisms (Gohil et al., 2020), including multicellular organisms such as Caenorhabditis elegans, Drosophila melanogaster, Danio rerio, and Mus musculus (Greiss and Chin, 2011;Bianco et al., 2012;Ernst et al., 2016;Chen et al., 2017). Stop and sense codon suppression is commonly used for sitespecific incorporation of ncAAs. Although these techniques lead to stochastic incorporation of ncAAs, they are hindered by competition with intrinsic factors such as release factors and native tRNAs. Efforts in the past decade attempt to address this problem by reassigning rare codons to encode ncAAs through genome engineering and synthesis (reviewed in Mukai et al., 2017). Previous studies have accomplished this by eliminating the specific assignment of the amber stop codon to signify translation termination or by reassigning rare arginine codons (AGG, AGA) to encode ncAAs (Mukai et al., 2015a,b;Wang and Tsao, 2016). In a recent study, two sense codons (UCG, UCA) and the amber stop codon in the synthetic E. coli MDS42 strain were completely eliminated for GCE (Fredens et al., 2019). Various reports have utilized cell-free protein synthesis and in vitro transcribed tRNA sets to successfully eliminate or reassign rare codons to encode ncAAs as well as to swap codons for canonical amino acids (Iwane et al., 2016;Cui et al., 2017;Fujino et al., 2020;Hibi et al., 2020). These engineered strains and techniques will contribute to the observation of protein dynamics and the development of protein biochemical studies. This review will present recent studies that develop and utilize various GCE methods to study the cellular dynamics of protein expression and interactions.

PURIFICATION OF NEWLY SYNTHESIZED PROTEINS
Metabolic labeling allows cellular protein synthesis to be studied under different environmental or metabolic conditions of cell growth (Beynon and Pratt, 2005). Bioorthogonal ncAA tagging (BONCAT) was developed to label newly synthesized proteins in mammalian cells for purifications and identification (Dieterich et al., 2006(Dieterich et al., , 2007 (Figure 1A). Azidohomoalanine (Aha) (4), a methionine (Met) analog, is charged onto tRNA Met and incorporated into newly synthesized proteins at Met codons. The azide group of Aha is used to selectively purify the proteins produced within a desired timeframe during growth of mammalian cells. Tagged proteins can subsequently be identified through mass spectroscopy (Dieterich et al., 2006). Moreover, azidonorleucine (Anl) (5) has been used in various cell-type specific labeling studies. This azide-containing Met analog has been demonstrated to be charged onto initiator tRNA Met i by a mutant methionyl-tRNA synthetase (MetRS) from E. coli (Ngo et al., 2013) and onto elongator tRNA Met by a mutant murine MetRS (Mahdavi et al., 2016) in mammalian cells for identification of newly synthesized proteins. In both studies, only expression of the mutant MetRS is required to incorporate Anl and cellspecific labeling can be achieved via Cre-induced expression of mutant MetRS (Erdmann et al., 2015;Alvarez-Castelao et al., 2017Evans et al., 2020). Alternatively, the alkynebearing amino acid homopropargylglycine (HPG) (6) has been suggested as a different Met analog for affinity purification (Landgraf et al., 2015).
While BONCAT was developed in mammalian cell cultures (Dieterich et al., 2006), the method can be adapted for other systems. The labeling method has been combined with Creinduced MetRS expression for cell-specific labeling in mice (Alvarez-Castelao et al., 2017) and used to study protein synthesis in neurodegenerative disease mouse models (Evans et al., 2019) and during mouse and zebrafish development (Hinz et al., 2012;Saleh et al., 2019a). In each of those studies, the authors were able to identify proteins within specific timeframes and observe changes in the proteome under different conditions. In the plant Arabidopsis thaliana, BONCAT was used to isolate and identify proteins expressed under different conditions of stress via click chemistry to dibenzocyclooctyne beads and mass spectrometry. The authors were able to demonstrate that stress-related proteins were increased upon stress induction (Glenn et al., 2017).
Light-activated BONCAT is a modified version of BONCAT where photocaged Aha (7) can only be incorporated into proteins upon uncaging by light exposure. The amount of protein labeling in HeLa cells could be controlled by varying the light intensity and proteins in specific regions can be labeled by aiming the light source, allowing spatial and temporal control of Aha incorporation (Adelmund et al., 2018). However, proteins that do not encode at least one Met codon within their sequence will not be detected, as the N-terminal Met is often cleaved off (Wingfield, 2017). This can be circumvented by utilizing a different amino acid analog, such as the insertion of p-azido-L-phenylalanine (Azf) (8) at Phe codons in C. elegans by a mutant C. elegans phenylalanyl-tRNA synthetase (Yuet et al., 2015). Nevertheless, BONCAT provides a snapshot of proteins that are produced under various conditions. This can lead to the discovery of proteins that are only produced under specific conditions, and help researchers understand how proteins can play multiple roles.

PROTEIN VISUALIZATION USING FLUORESCENT MOLECULES
Genetic code expansion has allowed the ability to selectively label proteins in living cells by taking advantage of the unique properties of the encoded ncAA. A protected selenocysteine (Sec) (9) was site-specifically incorporated in the outer membrane protein eCPX (enhanced circularly permuted outer membrane protein OmpX) of E. coli by recoding the UAG stop codon (Liu et al., 2018). By taking advantage of the different pKa values of cysteine (pKa 8.3) and Sec (pKa ∼5.2) (10), selective labeling of the chemically deprotected Sec with a fluorescent dye through thiol chemistry was achieved at pH 5.0 (Liu et al., 2018).
To visualize the spatial localization of newly synthesized proteins, BONCAT was modified to visualize proteins and termed fluorescence non-canonical amino acid tagging (FUNCAT) (Dieterich et al., 2010). Fixed cells from mammalian cell cultures and mouse models with newly synthesized proteins containing Aha or HPG are tagged with fluorescent dyes via click chemistry or can be combined with immunochemistry to visualize specific proteins (Dieterich et al., 2010;Hinz et al., Figure 1B). FUNCAT has also been coupled with the proximity ligation assay (PLA) to visualize the spatial localization of specific proteins in fixed cells (tom Dieck et al., 2015). Primary antibodies recognize either the Aha-tagged protein or a specific protein and each primary antibody is recognized by a secondary antibody that carries different DNA adaptors. The adaptors are used for rolling circle amplification and the resulting DNA product is labeled with complementary fluorescently labeled probes, producing an amplified signal for each target protein molecule (tom Dieck et al., 2015). The authors noted that although it is possible that the PLA signal can result from two closely positioned proteins, it is more likely that both primary antibodies bind to the newly synthesized target protein to produce the PLA signal. Using Met analogs to label newly synthesized proteins has expanded to studying bacteria and microbes in environmental and human samples. The fluorescently labeled Met analog can be combined with other techniques such as FISH or FACS to study the translationally active cell populations (Couradeau et al., 2019;Sebastian and Gasol, 2019;Valentini et al., 2020). Thus, labeling of the Met analog shows promise as a tool that can be widely used in different biological systems in combination with other visualization techniques.
In an effort to further improve upon the specificity of FUNCAT-PLA, SCROL (Stop-Codon-Read-thrOugh-Label) was developed to study expression of a specific protein of interest (Schneider et al., 2018) (Figure 1C). A cassette containing the UAG amber codon, HA-tag, and an alternative stop codon (e.g., UGA opal codon) was genomically inserted at the 3'-end of the target protein using CRISPR/Cas9 in HEK293 cells. It is important to note that, even in the presence of an orthogonal aaRS/tRNA pair, the tagged protein will only be expressed upon addition of the ncAA, and thus, the protein can be labeled at specific timeframes. This would ensure that the cell, and protein of interest, is undisturbed in the absence of the ncAA. In this study, the ncAA was not used for labeling/detecting the proteins but as a method to control expression of the epitope tag. This differs from many other papers where the purpose of incorporating an ncAA is to directly use it in future studies. Thus, additional specialized aaRS/tRNA pairs and ncAAs would not need to be developed for this technique.
An important component of labeling ncAAs in vivo for protein visualization under different cellular conditions is bioorthogonal dye. Tetrazine dyes (11) have been extensively characterized for their use in labeling trans-cyclooct-2-ene (TCO * ) modified ncAAs (12) via click chemistry in a variety of mammalian cells for live cell imaging (Beliu et al., 2019). An attractive trait of the dyes is their small size compared to traditional labeling molecules such as antibodies. By choosing cell membranepermeable or impermeable dyes based on the target protein, selective labeling can be enhanced. The authors demonstrated selective and efficient labeling of the target proteins that produced superior readings compared to immunolabeling (Beliu et al., 2019). A similar study utilized bicyclo-non-yne lysine (BCN-Lys) (13) to attach a tetrazine dye to the protein of interest via a N-terminal tag in HEK293T and Cos7 cells (Segal et al., 2020). Interestingly, insertion of BCN-Lys in the N-terminal tag was labeled better than BCN insertion within the target protein. In addition, the modified tag was found to be compatible in labeling organelle proteins within their acidic environments. Although the authors demonstrated the ability to switch out the HA-tag with other commonly used tags (Myc and FLAG), the results were not as successful, indicating a need for further optimization (Segal et al., 2020). The ease of simply inserting a tag encoding an ncAA led the authors to state the possibility of genomically inserting the modified tag to monitor endogenously expressed proteins. This would ensure that any effects on protein folding and interactions would be minimized and the use of the tetrazine dyes allows live-cell imaging. Live-cell imaging using fluorescently labeled proteins provides a story of how proteins travel and react within the cell under various conditions, rather than a picture obtained from fixed cells. This could be helpful in studying disease-associated proteins and understanding how cellular and environmental changes affect their spatial distribution.
Insertion of different ncAAs within a single protein allows different molecules to be attached, permitting the application of various techniques. However, the limited number of available codons for reassignment/recoding, orthogonal aaRS/tRNA systems, and orthogonal labeling reactions impairs the actual number of insertable ncAAs. Recently, three different ncAAs were inserted within a single protein using three different orthogonal aaRS/tRNA pairs in E. coli (Italia et al., 2019). The same group had previously engineered an E. coli strain (ATMW1) where the endogenous tryptophanyl-tRNA synthetase/tRNA Trp (TrpRS/tRNA Trp ) pair was replaced by the TrpRS/tRNA Trp from Saccharomyces cerevisiae, rendering the E. coli TrpRS/tRNA Trp pair available for ncAA insertion (Italia et al., 2017). As three distinct ncAAs were inserted using all three nonsense suppressors, homogenous translation termination was achieved by inserting a TEV cleavage site at the C-terminal end of sfGFP-His followed by TEV protease and 3 consecutive UAA stop codons and all three ncAAs could be labeled in a one-pot reaction (Italia et al., 2019). More recently, two separate classes of pyrrolysyl-tRNA synthetase (PylRS) with the N-terminal domain missing and their corresponding tRNAs were identified to be mutually orthogonal to each other and to a Methanosarcina mazei PylRS/SpetRNA Pyl pair in E coli. By co-expressing all three pairs and ribo-Q1 with a GFP reporter, three distinct ncAAs were incorporated by recoding the UAG stop codon and two quadruplet codons (Dunkelmann et al., 2020). The use of quadruplet codons would allow protein termination with a triplet stop codon. Alternatively, competitive labeling of one ncAA reserves a nonsense codon for translation termination while still allowing production of the multi-labeled protein for different studies . By taking advantage of the ability of BCN-Lys to be labeled by various tetrazine-conjugated fluorescent dyes via click chemistry, two different dyes could be conjugated simultaneously to allow the use of two different techniques that require different fluorescent parameters. This was demonstrated to be feasible in live Cos7 cells, allowing real-time high resolution imaging at a single molecule level . These studies demonstrate the ability to insert multiple ncAAs within a single protein where each ncAA can take part in a specific chemical reaction. This would greatly increase the yield of samples that require multiple downstream processing steps as constant protein purification for different labeling reactions would not be required.

IDENTIFYING PROTEIN-PROTEIN INTERACTIONS
Studying the interactions of proteins aids in the understanding of their cellular roles under normal and stress conditions. Co-immunoprecipitation, crosslinking, and BioID are some techniques used to identify interacting proteins (Lin and Lai, 2017;Yu and Huang, 2018;Sears et al., 2019). GCE can be used to insert crosslinking ncAAs at known or potential binding interfaces to increase the chances of identifying interacting proteins as well as minimize interference with protein folding and activity. The light-activated crosslinker p-azidophenylalanine (Azi) (14) was demonstrated in E. coli to be efficiently incorporated and crosslinks with amines of interacting proteins (Chin et al., 2002). Recently, Azi was successfully inserted at recoded UAG stop codons and used to map protein-protein interactions within the inner membrane complex of the protozoan Toxoplasma gondii (Choi et al., 2019). Chemical crosslinkers are an alternative to light-activated crosslinkers. Two chemical crosslinkers have been developed that spontaneously react with cysteine residues that are in close proximity: BprY (15) and EB3 (16). Both crosslinkers are alkyl bromides with EB3 containing an alkyne group for enrichment with biotin. Their applicability in detecting weak and transient interactions was demonstrated in live E. coli cells (Yang et al., 2017). The drawback of these chemical crosslinkers is the assumption that cysteine residues are present at the binding interface. Replicating experiments with different types of crosslinkers could increase the coverage of potential interacting proteins. Using genetically encoded crosslinkers offers a high degree of specificity that can capture strong and transient interactions.

DISCUSSION
The studies discussed in this review demonstrate the potential of utilizing GCE to label proteins in an effort to illuminate the heterogenous intracellular environment. We present the following ideas for consideration to improve the current methods. (1) Although aaRS/tRNA pairs are tested to ensure their orthogonality within a system, cross-reactions may still occur. Developing more specific aaRS/tRNA pairs that minimally cross-react with endogenous aaRSs and tRNAs would lessen the impact on cellular metabolism and allow a more "natural" representation of the protein(s) of interest due to minimal disturbance of the intracellular environment. In addition, there is a need to produce more aaRS/tRNA pairs that are orthogonal in a greater variety of organisms (Greiss and Chin, 2011;Bianco et al., 2012;Ernst et al., 2016;Chen et al., 2017;Gohil et al., 2020) to study proteins in their natural environment with any necessary post-translational modifications that are often missing when produced in model expression organisms such as E. coli.
(2) Development of more sensitive detection techniques would enable a greater understanding of localization for more proteins. Although FUNCAT coupled to PLA allows amplification of a signal for a protein of interest (tom Dieck et al., 2015), a high affinity antibody must be available. This is often not the case for many proteins where only low affinity antibodies are available or none have so far been produced. While SCROL offers an alternative where the peptide tag is targeted by an antibody (Schneider et al., 2018), the SCROL cassette must be genomically incorporated to produce a stable cell line. (3) The increasing number of ncAAs allows for the development and optimization of techniques for specific purposes. However, this is often hindered by the limited quantity and price of ncAAs. Many ncAAs are expensive to buy or to chemically synthesize within the lab. It would be beneficial to develop a method to produce greater quantities of ncAAs, biologically or chemically. A related issue is the development of ncAAs for organelle-specific labeling. These ncAAs must be stable in the organellar environment and chemically reactive ncAAs would have to be designed to react specifically within the organellar environment. Light-activated ncAAs would circumvent this issue and would likely offer greater control in activating the ncAA.

CONCLUSION
In conclusion, GCE has become a versatile tool that can be modified to suit individual needs and incorporated into existing techniques. In the future, it is expected that more tools will be developed for expansion of GCE to other organisms and to improve on existing GCE methods for understanding heterogeneous environments within living cells.

AUTHOR CONTRIBUTIONS
CZC and KA wrote the manuscript. DS edited it. All authors contributed to the article and approved the submitted version.

FUNDING
Work in the authors' laboratory was supported by the National Institute of General Medical Sciences (R35GM122560) and the DOE Office of Basic Energy Sciences (DE-FG0298ER2031).