Genome-wide identification and comparative analyses of key genes involved in C4 photosynthesis in five main gramineous crops

Compared to C3 species, C4 plants showed higher photosynthetic capacity as well as water and nitrogen use efficiency due to the presence of the C4 photosynthetic pathway. Previous studies have shown that all genes required for the C4 photosynthetic pathway exist in the genomes of C3 species and are expressed. In this study, the genes encoding six key C4 photosynthetic pathway enzymes (β-CA, PEPC, ME, MDH, RbcS, and PPDK) in the genomes of five important gramineous crops (C4: maize, foxtail millet, and sorghum; C3: rice and wheat) were systematically identified and compared. Based on sequence characteristics and evolutionary relationships, their C4 functional gene copies were distinguished from non-photosynthetic functional gene copies. Furthermore, multiple sequence alignment revealed important sites affecting the activities of PEPC and RbcS between the C3 and C4 species. Comparisons of expression characteristics confirmed that the expression patterns of non-photosynthetic gene copies were relatively conserved among species, while C4 gene copies in C4 species acquired new tissue expression patterns during evolution. Additionally, multiple sequence features that may affect C4 gene expression and subcellular localization were found in the coding and promoter regions. Our work emphasized the diversity of the evolution of different genes in the C4 photosynthetic pathway and confirmed that the specific high expression in the leaf and appropriate intracellular distribution were the keys to the evolution of C4 photosynthesis. The results of this study will help determine the evolutionary mechanism of the C4 photosynthetic pathway in Gramineae and provide references for the transformation of C4 photosynthetic pathways in wheat, rice, and other major C3 cereal crops.

Compared to C 3 species, C 4 plants showed higher photosynthetic capacity as well as water and nitrogen use efficiency due to the presence of the C 4 photosynthetic pathway. Previous studies have shown that all genes required for the C 4 photosynthetic pathway exist in the genomes of C 3 species and are expressed. In this study, the genes encoding six key C 4 photosynthetic pathway enzymes (b-CA, PEPC, ME, MDH, RbcS, and PPDK) in the genomes of five important gramineous crops (C 4 : maize, foxtail millet, and sorghum; C 3 : rice and wheat) were systematically identified and compared. Based on sequence characteristics and evolutionary relationships, their C 4 functional gene copies were distinguished from non-photosynthetic functional gene copies. Furthermore, multiple sequence alignment revealed important sites affecting the activities of PEPC and RbcS between the C 3 and C 4 species. Comparisons of expression characteristics confirmed that the expression patterns of nonphotosynthetic gene copies were relatively conserved among species, while C 4 gene copies in C 4 species acquired new tissue expression patterns during evolution. Additionally, multiple sequence features that may affect C 4 gene expression and subcellular localization were found in the coding and promoter regions. Our work emphasized the diversity of the evolution of different genes in the C 4 photosynthetic pathway and confirmed that the specific high expression in the leaf and appropriate intracellular distribution were the keys to the evolution of C 4 photosynthesis. The results of this study will help determine the evolutionary mechanism of the C 4 photosynthetic pathway in Gramineae and provide references for the transformation of C 4 photosynthetic pathways in wheat, rice, and other major C 3 cereal crops.

Introduction
Photosynthesis is one of the key biological innovations that allows organisms to convert light energy into chemical energy (ATP and NAD(P)H) for metabolic activities on a massive scale (Gest, 2002;Shih, 2015). During oxygenic photosynthesis, with the oxidation of water and the reduction of CO 2 , chemical energy generated from light energy is captured and used to synthesize organic compounds in higher plants in 'dark reactions' (Ashida et al., 2005;Paul, 2012;Blankenship, 2002). There are many different photosynthetic pathways in higher plants, the most widely known of which are the traditional C 3 , C 4 , and CAM (Crassulacean Acid Metabolism) types, and an in-depth study of the photosynthetic model species Flaveria identified the C 3 -C 4 intermediate type .
Compared with C 3 plants, C 4 plants such as maize show higher photosynthetic capacity and greater water-and nitrogenuse efficiency, especially in suboptimal environments, which result in greater biomass production (Long, 1999;Paulus et al., 2013;Wang et al., 2014). Unlike C 3 plants, which use the three-carbon molecule 3-phosphoglycerate (3-PGA) for carbon fixation, the carbon of C 4 plants is first fixed in the four-carbon molecule oxaloacetate (OAA) in mesophyll cells (MCs) due to the presence of a CO 2 concentrating mechanism (CCM). Immediately thereafter, OAA is transferred from the outer MCs to the bundle sheath cells (BSCs) in the form of malate (NADPdependent malic enzyme subtype) or aspartate (NAD-dependent malic enzyme subtype). Then, CO 2 is released in the chloroplasts of BSCs through decarboxylation by the malic enzyme, producing a microenvironment with a high CO 2 concentration around the Rubisco enzyme (ribulose-1,5-bisphosphate carboxylase/ oxygenase), thereby performing a Calvin cycle. The CCM enables Rubisco to function near its enzymatic V-max and greatly reduces the adverse effects of photorespiration, thus allowing for a greatly reduced investment of nitrogen in Rubisco proteins (Wang et al., 2014). These elegant biological innovations, including biochemical reactions (C 4 carbon shuttle) and anatomical structure (Kranz anatomy), allow for more efficient water and nitrogen use by C 4 plants in extreme climates (Hibberd and Covshoff, 2010). C 4 photosynthesis evolved from ancestral C 3 photosynthesis during a global CO 2 reduction and temperature increase (Sage et al., 2012). Most plants are C 3 plants; however, it has been reported that the C 4 pathway independently evolved in angiosperms, and the characteristics of this multisource evolution indicate that the transition of the photosynthetic pathway from the C 3 pathway to the C 4 pathway is relatively simple (Kellogg, 1999;Hibberd et al., 2008). In addition, some C 3 plants exhibit C 4 characteristics in specific environments, while C 4 plants show C 3 differentiation at specific growth stages, and some plants can convert between the C 3 and C 4 photosynthetic pathways, all of which indicate that the photosynthetic characteristics of C 3 and C 4 plants have great plasticity (Cheng et al., 1989;Ueno, 1998;Pyankov et al., 1999;Hibberd and Quick, 2002;Hibberd et al., 2008).
Gramineae is one of the most important model groups for world research because it contains a large number of important crop species. C 4 grasses, including maize, sorghum, and foxtail millet, and C 3 crops, such as wheat and rice, are all widely cultivated in modern agriculture and are major food crops critical to global food security (Leakey, 2009). The current annual low yield growth rates of wheat and rice have fallen far short of the target of doubling production by 2050 (Ray et al., 2013;FAO, 2020). Meanwhile, the increasing global population, unpredictable severe weather, and continuous reduction in water and arable land resources generate an extremely urgent need to advance main crop productivity (Parry et al., 2011). C 4 photosynthetic transformation of C 3 crops is the most likely method through which crop yield will be increased on a large scale to ensure global food security (von Caemmerer et al., 2012;Long et al., 2015). Engineering C 3 food crops such as wheat and rice to use the C 4 photosynthetic pathway has long been explored. To date, the identification and comparisons of C 4 photosynthetic genes, their non-C 4 types in C 4 lineages and their close non-C 4 relatives in Gramineae have been reported in several studies. Transcriptome comparison of 10 independent C 4 origins and their 9 non-C 4 relatives showed that the most highly expressed gene lineages in non-C 4 ancestors may generate their C 4 pathway by repeated co-optation (Moreno-Villena et al., 2017). Another comparative study, based on Gramineae C 4 lineages and their non-C 4 relatives, also confirmed that the same gene lineages were recruited in independent C 4 origins despite the existence of multiple copies (Christin et al., 2013). Furthermore, previous studies based on single C 4 photosynthetic key genes in several gramineous species, such as the carbonic anhydrase and phosphoenolpyruvate transporter genes, also confirmed that key genes are recruited into the C 4 photosynthetic pathway by the acquisition of high expression levels and tissue expression characteristics (Ludwig, 2016;Lyu et al., 2020). All these results indicate that the recruitment and selection of C 4 photosynthetic pathway genes are identical and affected by the expression abundance of gene lineages before C 4 evolution. Therefore, accurate identification of C 4 orthologues in C 3 crops and comparative analysis with C 4 genes in C 4 crops are of great interest for understanding the evolution of the gramineous C 4 pathway and improving the photosynthetic efficiency of C 3 species.
In this study, the genes encoding the six key enzymes in the NADP-ME subtype of the C 4 photosynthesis pathway in five important gramineous crops were identified and characterized. These genes were systematically classified into C 4 -type and nonphotosynthetic gene copies based on their amino acid sequence characteristics and evolutionary relationships. In addition, their tissue expression patterns and sequence features in promoter regions were also compared, and a schematic diagram of the necessary steps for the transformation of the C 4 photosynthetic pathway in gramineous C 3 crops was proposed. The results will provide a foundation for understanding the C 4 photosynthetic gene characteristics in Gramineae C 3 and C 4 crops.
2 Materials and methods 2.1 Identification of the genes encoding six C 4 photosynthetic enzymes Wheat genome and protein sequences (IWGSC v1.1) were obtained from the Wheat URGI database (Alaux et al., 2018;IWGSC, 2018), and the sequences of four other Gramineae crops (rice, foxtail millet, sorghum, and maize) were downloaded from the Ensemble Plants database (release 39, http://plants.ensembl.org). Then, six important enzymes (beta carbonic anhydrase, b-CA; ribulose bisphosphate carboxylase small subunit, RbcS; phosphoenolpyruvate carboxylase, PEPC; NADP-dependent malic enzyme, NADP-ME; malate dehydrogenase, MDH; and pyruvate, orthophosphate dikinase, PPDK) involved in the C 4 photosynthetic pathway were identified (Kersey et al., 2015). First, five local protein databases were constructed using these sequences for homologous alignment of cloned sequences downloaded from the National Center for Biotechnology Information (NCBI, https://www.ncbi.nlm.nih.gov/) through a local protein basic local alignment search (BLASTP) program with an E-value cut-off < 10 -5 and an identity of 60% as the threshold. The Hidden Markov Model (HMM) models of conserved domains of all six genes were obtained from the PFAM database (http://pfam.xfam.org/), and all genes predicted by BLASTP were further screened by their conserved domains using the HMMER search tool (Wheeler and Eddy, 2013). The NCBI-Conserved Domain Database (CDD) search was also used to check the conserved protein domains of these candidate genes (https:// www.ncbi.nlm.nih.gov/cdd), and sequences lacking either of these domains were excluded. The PFAM ID and CDD Accession ID of the conserved domains of these genes are listed in Table 1. In addition, based on the understanding of the special dual promoter structure of PPDK genes, the gene structure annotations from NCBI (Maize, NCBI B73_v4 annotation release 102; Foxtail Millet, Setaria_italica_2.0 annotation release 103) were used to correct the PPDK genes from maize and foxtail millet. After manual curation, the nonredundant sequences were considered putative genes. Finally, gene duplication events of these genes in each species were determined by MCScanX, and manual screening was performed according to Wang et al. (2016). 2.2 Analysis of gene structure, conserved motifs, and the physical and chemical properties of encoded proteins The gene structures were determined and displayed using the online tool Gene Structure Display Server (GSDS) 2.0 (http:// gsds.cbi.pku.edu.cn/) (Guo et al., 2007). The multiple EM for motif elicitation (MEME) program (v5.5.0) was used to determine the conserved protein motifs of these genes, and the parameters were as follows: the optimal motif width was between 6 and 200 residues, allowing the presence of any number of repeating motif sites, and the maximum motif number was 20 (Bailey et al., 2009).
The biochemical parameters and subcellular localization were calculated by the Computer pI/MW tool in the ExPASy database (https://web.expasy.org/compute_pi/) (Gasteiger et al., 2005;Chou and Shen, 2008). The ChloroP server was used to predict the presence of chloroplast transit peptides (cTP) and the location of potential cTP cleavage sites (Emanuelsson et al., 1999). The protein sequences used to compare the differences between C 3 and C 4 plants were downloaded from the NCBI and UniProt databases (https:// www.uniprot.org/).

Phylogenetic analysis and identification of C 4 -orthologous genes in C 3 crops
To evaluate the evolutionary relationships among these genes, multiple sequence alignments were performed using the Clustal Omega program (https://www.ebi.ac.uk/Tools/msa/clustalo/) with the default parameters (Sievers et al., 2011). Phylogenetic analyses were conducted using both the neighbour-joining (NJ) method and maximum likelihood (ML) method. The ML trees were constructed using PhyML 3.1 (http://www.atgc-montpellier.fr/phyml/ versions.php) with the JTT model. The NJ trees were constructed using MEGA6.06 with 1000 bootstrap replications, the JTT model, and the pairwise deletion option (Saitou and Nei, 1987;Tamura et al., 2013). The trees of these six genes were constructed using iTOL v6 software (https://itol.embl.de). In addition, MCscanX was used to compare the homology relationships between genes in different species (Wang et al., 2012).

Sequence analysis of the promoter region
The 2000 bp upstream sequence of the initiation codon was considered the promoter region for each gene and was extracted from the genome using the SAMtools program (v1.12) (Li, 2011). The cis-acting regulatory elements in the promoter region were predicted using PlantCARE (http://bioinformatics.psb.ugent.be/ webtools/plantcare/html/) (Lescot et al., 2002). In addition, MEME was used to identify the conserved sequences in the promoter regions with an optimal motif width between 6 and 50 residues and a maximum motif number of 20.

Tissue-specific expression patterns
To investigate their expression profiles in different tissues, publicly available RNA-seq datasets of four tissues (root, shoot, spike, and leaf) in maize, sorghum, wheat, foxtail millet, and rice were downloaded from the following databases: the NCBI sequence read archive (SRA) (https://www.ncbi.nlm.nih.gov/sra), Gene Expression Omnibus (GEO) (https://www.ncbi.nlm.nih.gov/gds), S o r g h u m F u n c t i o n a l G e n o m i c s D a t a b a s e ( h t t p : / / structuralbiology.cau.edu.cn/sorghum), Setaria italica Functional Genomics Database (http://structuralbiology.cau.edu.cn/SIFGD), MaizeGDB (https://www.maizegdb.org), expVIP (http:// www.wheat-expression.com), and Rice Expression Database (http://expression.ic4r.org). In addition, the partial expression matrices of wheat, maize, and rice were directly obtained from public databases, and the raw RNA-seq data of sorghum (SRR959796, SRR959782, and SRR959765), foxtail millet (SRR442161-SRR442164) and rice (SRP028766 and SRP049212) were downloaded for reanalysis. The quality of raw reads was evaluated using FastQC v0.11.9, and the low-quality reads were filtered by Trimmomatic v0.36 (Andrews, 2010;Bolger et al., 2014). Clean reads were mapped onto their reference genomes using Hisat2 v2.2.1.0 (Pertea et al., 2016). The read numbers mapped to each gene were counted using HTSeq v0.11.1, and then the FPKM value of each gene was calculated based on the length of the gene and the read count mapped to the gene (Trapnell et al., 2010;Anders et al., 2015). To compare the tissue expression characteristics of homologous genes in different species, EL (expression level) values were used to draw a heatmap to display the tissue-specific patterns of expression. EL values were calculated using the following formula: Expression Level (EL) = −log2(FPKM + 1) −log2(sum(FPKM + 1)) Â 100 % where the EL value represents the expression in a certain tissue in proportion to that in all tissues and FPKM represents fragments per kilobase of exon model per million mapped fragments.

Identification of the genes encoding six key C 4 photosynthetic enzymes
The six key C 4 photosynthetic enzymes (b-CA, RbcS, PEPC, NADP-ME, MDH, and PPDK) are all encoded by small family genes. The availability of the genome sequences of the five gramineous crops made it possible to identify all family members of the six key enzymes, including C 4 and non-photosynthetic gene copies. A total of 216 genes with complete conserved domains were identified in five crop genomes (Table 1, Tables S2-S7). The MDH family was the largest, and the PPDK family had the fewest members among the six gene families of the five Gramineae crop genomes. For the convenience of description, all genes were renamed based on their homologous relationships. In short, the orthologous genes with high homology among the five crops were first numbered, and then the paralogous genes in each species were numbered. For example, b-CA3 enzymes were encoded by seven genes, namely, SiCA3, ZmCA3, SbCA3, TaCA3-3A, TaCA3-3B, TaCA3-3D and OsCA3, while b-CA5 enzymes were encoded by only three genes in wheat, TaCA5-7A, TaCA5-7B and TaCA5-7D (Table 2). Compared with other crops, wheat possessed the most copies as a result of its allohexaploid genome and complex evolutionary process.
Although the number of these genes encoding key C 4 photosynthetic enzymes in different species was not identical, except for RbcS, the number and distribution of the other five gene family members showed similar trends in the C 3 and C 4 crop genomes ( Table 1). The copy number of the RbcS gene in different grass species showed a large difference, and there were 24 RbcS copies in wheat and only one copy in sorghum. In addition, 30 of the 36 RbcS genes were associated with tandem duplication events, accounting for 83.33% of all RbcS genes (Table S3). Similarly, tandem duplicated genes accounted for 73.53% (25/34) of all b-CA genes ( Table 2, Table S2).
Earlier reports indicate that the subcellular-specific expression of the PPDK gene is regulated by a dual promoter located upstream of the first exon and on the first intron (Sage et al., 2012). The first exon encodes a chloroplast transit peptide; the copy regulated by the first promoter exhibits a chloroplast type, and the copy regulated by the second promoter exhibits a cytoplasmic type. However, our results showed that maize has two chloroplast copies, while foxtail millet has two mitochondrial copies. To further explain this discrepancy, the NCBI-annotated maize genome sequence (Maize, NCBI B73_v4 annotation release 102, ftp://ftp.ncbi.nlm.nih.gov/ genomes/Zea_mays) and foxtail millet genome sequence (Foxtail Millet, NCBI Setaria_italica_v2.0 annotation release 103, ftp:// ftp.ncbi.nlm.nih.gov/genomes/Setaria_italica) were downloaded and analysed. Further alignment analysis revealed that Zm00001d010321 (ZmPPDK2) in the current genome (Zm-B73-REFERENCE-GRAMENE-4.0) had a false annotation, and this gene (LOC103635678; NM_001358399.1) annotated by NCBI was shorter in length and lacked the first exon sequence, proving that ZmPPDK2 encoded a cytoplasmic PPDK isoform (Table S7).     (Table  S7). In general, all genes encoding six key C 4 photosynthetic enzymes were identified completely and accurately in these five Gramineae species.

Phylogenetic analysis and identification of C 4 -orthologous genes in C 3 crops
The full-length amino acid sequences of these identified genes were aligned, and then phylogenetic trees were constructed using both the maximum likelihood (ML) and neighbour-joining (NJ) methods. NJ trees and ML trees of different genes all showed very consistent topological structures ( Figure 1, Figure S1). Except for RbcS and PPDK, the other genes could be divided into three to five distinct lineage homologous branches, including three for b-CA, five for PEPC, four for ME, and four for MDH, as detailed below.

Beta carbonic anhydrase (b-CA)
The b-CA genes in the five crops were clustered into three major groups, of which b-CA1 (8), b-CA2 (6), and other members (20) clustered together into one category. The b-CA1 and b-CA2 branches were relatively conserved and evenly distributed in each species, while the third branch produced more copy number variations due to tandem duplication events ( Figure 1A, Table  S2). All b-CA3, b-CA4, and b-CA5 genes from the same species were tandem duplicated genes, such as TaCA5-7A, TaCA6-7A, and TaCA7-7A. This indicated that tandem duplication was an important mechanism for the expansion and functional diversification of the b-CA family, and the third branch was an important source of new functional generation. In addition, except for TaCA7, which only existed in the wheat A subgenome, other Phylogenetic relationships between six gene families involved in the C 4 photosynthetic pathway from common wheat (Ta), maize (Zm), foxtail millet (Si), rice (Os), sorghum (Sb), and Arabidopsis (At) based on their amino acid sequences. Multiple sequence alignment and phylogenetic tree construction were performed using MUSCLE and MEGA 6.0 (maximum likelihood method, JTT model), respectively. Scale bars indicate the number of amino acid substitutions per site. Genes presented in bold blue fonts represent C 4 -type copies in C 4 species. TaCAs had paralogous copies in the A, B, and D subgenomes, confirming that TaCAs were relatively conserved during wheat polyploidization ( Figure 2A).
Phylogenetic analysis of the b-CA genes showed that TaCA3, OsCA3, SiCA3, SbCA3, and ZmCA3 were clustered together, and collinearity analysis based on MCscanX also showed that these genes were orthologous ( Figure 1A, Table S2). In addition, the b-CA3 genes of the C 4 species were highly expressed in leaves; therefore, they were determined to be gene copies involved in C 4 photosynthesis ( Figure 3A).

Ribulose bisphosphate carboxylase small subunit (RbcS)
The phylogenetic tree showed that RbcS genes of the same species were clustered into one class, which confirmed that the expansion of the RbcS family occurred after the division of the grass species ( Figure 1B). Since the expansion of the RbcS family occurred after species differentiation, it is relatively difficult to identify its orthologous genes. Therefore, we believed that gene copies with specifically high expression in leaves were predominantly RbcS genes. Finally, based on their leaf-specific high expression, TaRbcS8, OsRbcS1, SbRbcS1, ZmRbcS2, and SiRbcS5 were considered putative genes encoding the ribulose bisphosphate carboxylase small subunit ( Figure 3B).

Phosphoenolpyruvate carboxylase (PEPC)
Phylogenetic analysis showed that PEPC genes were divided into five classes, which contained 9, 7, 7, 6, and 5 members ( Figure 1C). The first three groups each contained at least one PEPC gene copy from each species (PEPC1, PEPC2, and PEPC3), which indicated that the function of PEPC in these three groups should be relatively conserved. The fifth class contained only PEPC gene copies from wheat, rice, and foxtail millet. PEPC genes in the fifth class should not perform the C 4 function. In addition, the PEPC genes of each group in wheat contained three copies from the A, B, and D subgenomes.
Previous studies on C 3 and C 4 Flaveria species indicated that the serine (S, Ser) residue in the C-terminus is an important symbol of C 4 PEPC isoforms (Bläsing et al., 2002;Gowik and Westhoff, 2011). The same serine residue was found in three PEPC4 copies from C 4 crops (at position 780 of ZmPEPC4), which suggested that the PEPC4 isoforms were C 4 -type, and this was confirmed by their high expression in leaves ( Figure 3C).

NADP-dependent malic enzyme (NADP-ME)
ME genes of five crops were clustered into four groups, each containing 7, 7, 8, and 11 genes. Similar to PEPC genes, ME genes from C 4 and C 3 crops were evenly distributed in the first three groups (ME1, ME2, and ME3), while more genes from C 4 species were included in the fourth group (ZmME4-6, SbME4-6, SiME4, OsME4, and TaME4), which suggested that there might be more functional differentiation in the fourth group ( Figure 1D).
Previous studies have shown that the evolution of C 4 NADP-ME originates from the acquisition of the chloroplast transit peptide of the cytosolic ancestor and subsequent new functionalization, and C 4 NADP-ME is specifically located in the chloroplast (Tausta et al., The gene structures and conserved protein domains of putative C 4 /C 4 -homologous genes in five gramineous crops. (A-E) represent b-CA, RbcS, PEPC, ME, MDH, and PPDK, respectively. The graph on the left is the distribution of conserved motifs on these proteins, and the right panel is the gene structure.

2002).
Only ME gene copies in the fourth group were found to be chloroplast type, and transcriptome data also confirmed that ME4 genes from C 4 species were highly expressed in leaves, indicating that ME4s function in the plant C 4 photosynthetic pathway ( Figure 3D, Table S5).

Malate dehydrogenase (MDH)
Multiple forms of MDHs are present in higher plants with different subcellular localizations and coenzyme specificities. Based on the difference in coenzymes, they can be divided into two types, NAD-dependent MDHs (NAD-MDHs) and NADP-dependent MDHs (NADP-MDHs). NADP-MDHs are mainly present in chloroplasts and cytoplasm, while NAD-MDHs are distributed in mitochondria, the cytoplasm, peroxisomes, and plastids (Gietl, 1992). In C 4 photosynthesis, the chloroplastic NADP-MDH present in MCs converts OAA into malic acid to perform the C 4 function. Based on the prediction of subcellular localization and the coenzyme-specific binding site (at position 43 of OsMDH1, D and G for NAD-dependent MDH and NADP-dependent MDH, respectively), these MDHs were well classified (Table S6).
The phylogenetic tree showed that all different types of MDHs could be assigned to different branches ( Figure 1E, Table S6). MDH1, MDH2, ZmMDH10, and ZmMDH11 were NAD- Expression levels of all putative genes encoding key C 4 photosynthetic enzymes in four tissues (leaf, root, stem, and spike) from RNA-seq data. The heatmaps were drawn based on the EL values after the normalization of different tissues, with their original FPKM values listed in the corresponding grid. All genes were clustered according to their phylogenetic relationship. The blue and yellow colours represent higher and lower relative abundance for each transcript in each sample, respectively, and the dark red colour represents no data support. Based on the special double promoter structure of PPDK, the expression levels of all transcripts are listed.
dependent cytoplasmic isoforms, and MDH6 and MDH7 were NAD-dependent peroxisome isoforms. In terms of evolutionary relationships, cytoplasmic NAD-MDHs and chloroplast NADP-MDHs (MDH3, SiMDH13, SbMDH14, and TaMDH12) were located in one branch, which indicated that chloroplast NADP-MDHs are more closely related to cytoplasmic NAD-MDHs. The phylogenetic data combined with the finding of ultrahigh expression levels of MDH3 genes in the leaves of three C 4 crops, indicated that MDH3 genes are C 4 photosynthetic gene copies ( Figure 3E). Although both SiMDH13 and SbMDH14 were located in the same branch as MDH3 genes, neither of them exhibited leaf-specific high expression and were therefore excluded.

Pyruvate, orthophosphate dikinase (PPDK)
The PPDK gene directs the transcription of different subcellulartargeted PPDK isoforms through a specific dual-promoter structure, so different subcellular-localized PPDK proteins (PPDK1.1 and PPDK1.2) encoded by PPDK1 were also used to construct the phylogenetic tree ( Figure 1F). Eleven genes encoding PPDK were identified in five species, of which two genes were identified in all C 4 species, while only one gene was identified in rice, and four genes were identified in wheat. Using multiple sources of genome information to study PPDK genes in maize and foxtail millet, it was confirmed that PPDK1 contained a dual-promoter structure and was transcribed to a chloroplast PPDK isoform and a cytoplasmic PPDK isoform, while PPDK2 encoded another cytoplasmic PPDK isoform. PPDK1 was highly expressed in the leaves of all C 4 crops, while rice and wheat lacked some PPDK2 copies, so PPDK1 was identified as a C 4 type gene ( Figure 3F, Table S7).

Comparative analysis of sequence variations among C 4 and C 4 orthologue gene copies in C 4 and C 3 crops
Based on the prediction of protein conserved domains by MEME software, the annotation information of gene structure, the sequence, and their physicochemical characteristics, differences between C 4 genes in C 4 species and C 4 -orthologous genes in C 3 crops were compared in detail.

Carbonic anhydrase b type (b-CA)
The gene structures of b-CA3 of C 3 and C 4 crops showed large differences, but they were relatively conserved among crops with the same photosynthetic type (Figure 2A). TaCA3 and OsCA3 genes both contained 8 exons, while SiCA3, ZmCA3, and SbCA3 contained 7, 13, and 13 exons, respectively. In general, for all CA3 genes, the sequences from exon 1 to exon 7 were similar, although the lengths of these exons were different in C 3 and C 4 crops. For the conserved domains, a significant difference was that C 4 b-CA3s lacked motif 6, while C 4orthologous b-CA3s contained motif 8 at their N-terminus in C 3 crops (Figure 2A). Prediction of chloroplast transit peptides (cTPs) revealed that all TaCA3s had cTPs of approximately 48 amino acid residues at the N-terminus (TaCA3-3A, 49; TaCA3-3B, 48; TaCA3-3D, 48), and OsCA3 had a cTP of 68 amino acid residues, whereas all b-CA3s of C 4 crops did not have cTPs ( Figure S2). This was consistent with the position of motif 7-motif 6, confirming that C 4 b-CA3s could fix CO 2 in the cytoplasm by the deletion of motif 6 (Ludwig, 2016). In addition, the deletion of motif 6 improved the isoelectric point (pI) of C 4 b-CA3s, which ranged from 8.74 to 8.80, while the pI of C 4 -homologous b-CA3s ranged from 8.35 to 8.41 (Table 3).

Ribulose bisphosphate carboxylase small subunit (RbcS)
C 4 and C 4 -homologous RbcS genes did not differ significantly in gene structure, and both contained two exons ( Figure 2B). In contrast, two additional motifs were observed in C 4 -homologous RbcSs in C 3 species, motif 5 and motif 4, located near the Nterminus and at the C-terminus, respectively. Comparison of the mutation sites between C 4 -homologous and C 4 RbcS proteins found variations at multiple sites, such as (T/S)24G and N56I on OsRbcS1 ( Figure S3). To identify the specificity of these sites, more RbcS protein sequences of Gramineae species were downloaded and compared, including 172 sequences from 40 species of 18 genera. The first site (24 on OsRbcS1) was found to be highly conserved with a G (glycine) or D (aspartate) among C 3 species, while it showed differences among C 4 species, T (threonine) for Sorghum bicolor and maize, S (serine) for Saccharum hybrid, G (glycine) and R (arginine) for foxtail millet, and G (glycine) for Echinochloa crusgalli and Panicum hallii ( Table 4). Considering that the two species only have one protein sequence and considering the close relationship between grass and rice, Panicum hallii, and Echinochloa crusgalli, it is believed that there is a non-G site RbcS isoform in these two species. The second site (56 on OsRbcS1) could strictly distinguish the C 4 and C 3 species in all Gramineae RbcS sequences obtained, with N (asparagine) for C 4 species and I (isoleucine) for C 3 species. In addition, the prediction of protein structure suggested that the 8 amino acid residues at positions 51-59 of OsRbcS1 constituted a protein binding site, and the flanking sequences were highly conserved, so the variation at position 56 may result in important differences in function.

Phosphoenolpyruvate carboxylase (PEPC)
There were no significant differences in gene structure or conserved domains observed between C 4 and C 4 -orthologous PEPC genes/proteins, with both containing 10 exons in genes and 32 conserved domains in proteins ( Figure 2C). The glycine residue (G) at position 842 of ZmPEPC4 was found to be specific in the C 4 species of Gramineae (Rangan et al., 2016). Multiple alignments of the PEPC sequences of 65 species downloaded from the NCBI, UniProt, and Esembl Plants databases (Table 5) revealed that this G residue was relatively conserved in the PEPCs of C 3 species, except soybean. Distinctly, at this site, A was conserved in dicotyledons, the Flaveria lineage with the C 3 -C 4 intermediate photosynthetic type, and the unicellular C 4 lineage Hydrilla, whereas G was conserved in monocotyledonous plants, especially all C 4 grass species. A previous study showed that mutations in G could enhance tolerance to malic acid (Rangan et al., 2016); therefore, the mutation of A842G may be critical for the function of the C 4 photosynthetic PEPC.

NADP-dependent malic enzyme (NADP-ME)
No significant differences in gene structure or conserved domains were observed between C 4 and C 4 -orthologous ME genes/proteins in C 4 and C 3 species, with both containing 20 exons ( Figure 2D) encoding proteins with 591 to 648 aa (Table  S5). Although differences in multiple amino acid residues were found between the C 4 and C 4 -orthologous ME protein sequences in gramineous crops, such as C234V, A260S, R299K, and Q506E on OsME4, there was no evidence that these changes could impact enzyme kinetics. These variations could help to quickly distinguish C 4 photosynthetic gene copies from non-photosynthetic gene copies in gramineous crops.

Zea may AKG(N/D)PGIA(G a /A)(L/V) C4
( Continued) with 493 aa, the lengths of the other MDHs range from 431 to 434 aa (Table S6). The differences in conserved domains between C 4 and C 4 -orthologous MDHs were mainly concentrated in the N-terminal non-conserved regions. Differences in multiple amino acid residues were also found between the C 4 MDH and C 4 -orthologous MDH protein sequences, such as D197E, A167V, and V267M.

Pyruvate, orthophosphate dikinase (PPDK)
C 4 and C 4 -orthologous PPDKs had no significant differences in gene structure or conserved domains, with both containing 19 exons ( Figure 2F) encoding proteins of 936 to 948 aa in length. Similar to C 4 PPDK, PPDK genes of C 3 crops had special double promoter structures. There were differences in multiple amino acid residues between the C 4 and C 4 -orthologous PPDK protein sequences, such as (T/S)92A, D95E, Q160A, and Q279E on OsPPDK1, with Q160A being present in highly conserved regions, but the effect of these differences on PPDK function needs to be confirmed.

Tissue-specific expression of these C 4 and C 4 -orthologous genes
To compare the expression characteristics of these genes in C 3 and C 4 species, RNA-seq data from four tissues (leaf, root, spike, and shoot) of the five species were used. As FPKM values from different species could not be directly compared, the tissue expression characteristics of genes were determined based on inter-tissue standardization, and the original FPKM values are listed in the heatmap (Figure 3).

b-CA
The b-CA genes in different branches showed great differences in tissue-specific expression. b-CA1 genes were expressed in various tissues with high expression levels in leaves, roots, and spikes, while b-CA2 genes had low expression levels in the four tissues and did not show significant differences between C 3 and C 4 crops. Unlike the relatively low expression levels of b-CA1 and b-CA2, b-CA3 exhibited ultrahigh expression levels in the leaves of both C 3 and C 4 crops, which might accelerate the diffusion of inorganic carbon to the carboxylase (Rubisco, PEPC) site and increase the CO 2 fixation rate by increasing the concentration of inorganic carbon around the carboxylase ( Figure 3A). The high expression of C 4 -orthologous b-CA3 genes in C 3 crop leaves suggested that they play an indispensable role in maintaining photosynthesis stability.

RbcS
The RbcS genes of both C 3 and C 4 crops showed consistent tissue-specific expression patterns, and all were highly expressed in leaves ( Figure 3B). Different RbcS gene copies of the same species showed different expression levels. For example, the expression levels of the two copies in the fifth homologous group (TaRbcS7 and TaRbcS8) in wheat were much higher than those in the second homologous group (TaRbcS4, TaRbcS5, and TaRbcS6), and SiRbcS5 in foxtail millet showed preferential expression. This indicated that the nonprimary gene copies may be functionally redundant or have a complementary function.

PEPC
Each group of PEPC genes had a clear distinction in the specificity of tissue expression, while PEPC genes clustered in the same branch showed similar expression patterns in C 3 and C 4 crops ( Figure 3C). For example, PEPC1s and PEPC3s were highly expressed in the spike and root, while PEPC2s were highly expressed only in the spike, and PEPC4s and PEPC5s were specifically expressed in the leaf. The current view is that nonphotosynthetic PEPC copies are widely present in C 3 and C 4 species and are expressed in a variety of tissues to perform multiple functions (Aubry et al., 2011;Williams et al., 2012). The nonphotosynthetic function of non-C 4 PEPC copies not only replenishes tricarboxylic acid (TCA) cycle carbon skeletons, which is the traditional view of non-photosynthetic PEPC function, but also plays a nonnegligible role in seed formation and germination, fruit ripening, maintenance of cell ion balance, regulation of stomatal opening and provision of respiratory substrates for symbiotic nitrogen-fixing bacteroids of root nodules (Latzko and Kelly, 1983;Lepiniec et al., 2003). Based on sequence characteristics and expression patterns, the functions of these nonphotosynthetic PEPCs were predicted. PEPC1 and PEPC3 may play a very important role in seed germination and the interaction of root microbes, and PEPC5 may contribute to maintaining the pH environment and stomatal opening of leaf cells (O'Leary et al., 2011). Although PEPC4 in C 3 crops also showed a certain degree of leaf-specific expression, its expression level was much lower than that of C 4 PEPC4 in C 4 species, which suggested that improving the expression level of PEPC4 in C 3 crops might be the first step for PEPC transformation.

NADP-ME
ME gene copies within each group exhibited similar tissue expression specificity. ME1 gene copies were expressed at very low levels in all tissues of all species ( Figure 3D). Most ME gene

Class
Family Species Amino acid with flanking region a P-type b Zea mays subsp. mexicana AKGDPGIAG a L C 4 Zea mays subsp. parviglumis AKGDPGIAG a L C 4 a This region is located in the 834th-843th residue of ZmPEPC4. b P-type represents the type of photosynthesis. copies in the second and third groups were specifically expressed in roots, while some gene copies from maize, sorghum, and rice (ZmME2, OsME2, and SbME3) were also highly expressed in leaves and spikes. Interestingly, ME gene copies in the fourth group exhibited different tissue expression characteristics; those from C 3 species were specifically expressed in spikes (OsME4 and TaME4), while those from C 4 species were specifically expressed either in leaves or spikes, with SiME4, ZmME4, and SbME4 being exclusively expressed in leaves and ZmME5, ZmME6, SbME5, and SbME6 being highly expressed in spikes. ME4 was suggested as the C 4 photosynthetic orthologous group by evolutionary relationship analysis and subcellular localization ( Table 2, Table S5). Therefore, the specific expression of ME4 in leaves is the key to the C 4 photosynthetic transformation of C 3 ME.

MDH
For NAD-dependent cytoplasmic MDHs, only MDH1 genes encoding proteins of approximately 330 aa in length with a molecular weight of approximately 35 kDa could be expressed normally, while the other MDH gene copies (MDH2, MDH10, and MDH11) were not expressed or weakly expressed in various tissues of each species ( Figure 3E, Table S6). Studies based on Chinese cabbage and alfalfa have confirmed that cytosolic MDH can affect grain development and response to salt stress (Luo et al., 2004;Jia and Zhang, 2009). High expression of MDH1 in spikes and roots indicated that the cytoplasmic MDH of gramineous crops may have similar features. Seventeen NAD-dependent plastid MDH genes were highly expressed in three tissues: MDH5s and OsMDH4 in the leaf; OsMDH16, ZmMDH4, SbMDH4, and ZmMDH18 in the root; and TaMDH4 and TaMDH15 in the spike. Previous studies have shown that transgenic Arabidopsis, which lacks plastid NADdependent MDH, exhibits a variety of negative effects, such as reduced chlorophyll content, decreased photosynthetic rate, disordered chloroplast ultrastructure and undevelopable seeds (van Roermund et al., 2004). Plastid NAD-MDH gene copies with leaf-specific expression were essential for plant photosynthesis, while those with spike-specific expression affected grain development. In addition, the higher number of leaf-specific cytoplasmic MDH gene copies in C 4 species may ensure the stability of dimorphic chloroplasts in MCs and BSCs. There were two highly conserved copies of peroxisome MDH in each species: MDH6 was specifically expressed in the root, while MDH7 was specifically expressed in the spike. The root growth of Arabidopsis lacking peroxisome MDH was severely hindered due to incomplete b-oxidation. Sequence alignment showed that At2g22780 had higher homology with MDH6, so MDH6 might play an important role in the b-oxidation process (Pracharoenwattana et al., 2007). In addition, quantitative proteomic analysis of peroxisomes in Arabidopsis thaliana confirmed that another peroxisome MDH (At5g09660) is essential for photorespiration, suggesting that MDH7 might perform similar functions in gramineous crops, which is consistent with the high expression level of MDH7 in C 3 crops (Eubel et al., 2008). Mitochondrial MDH was mainly expressed in the spike and root. As a traditional TCA cycle enzyme in plants, its high expression in the spike and root may be related to the active energy metabolism of these two tissues. As a key gene of the C 4 photosynthetic pathway, NADP-dependent chloroplast MDH4 was highly expressed in the leaves of C 4 species but not in C 3 species, which indicated that increasing the expression level of MDH4 in C 3 species is essential for C 4 photosynthetic transformation.

PPDK
Because of the existence of a dual-promoter structure, two PPDK gene copies in C 4 crops encoded three PPDK isoforms, and there were two and seven PPDK isoforms in rice and wheat, respectively (Table S7). Both chloroplast mRNAs (PPDK1.1) in C 3 and C 4 crops were highly expressed in leaves, while the cytoplasmic mRNAs were highly expressed in spikes ( Figure 3F). The chloroplast PPDK gene copies in C 3 crops were expressed specifically in leaves, similar to the PPDK gene copies in C 4 crops, but their expression levels were much lower. This indicated that the high expression of chloroplast PPDK was key to the conversion of C 3 to C 4 .

Comparison of the promoter sequences of C 4 and C 4 -orthologous gene copies
The completeness of the C 4 photosynthetic pathway involves the improved expression levels and specific-tissue/cell expression of the key pathway genes, and the noncoding region of the gene harbours the information for its gene expression and cell-specific expression. Therefore, we compared the noncoding regions in the upstream promoter regions of C 4 orthologues in C 3 crops and C 4 genes in C 4 crops.
Differences in the -1000 to -2000 region upstream of the initiation codon were found in the promoter region between C 4orthologous b-CA in C 3 crops and C 4 b-CA in C 4 crops ( Figure S4). The prediction of cis-regulatory elements found that C 4orthologous b-CA genes had more CAAT-box elements (C(C/A) AAT) and more light-responsive elements in the -1000 to -2000 region than C 3 b-CA genes, which suggests that the C 4 -orthologous b-CA could be easily recruited into the C 4 photosynthetic pathway.
There was a significant difference in the region approximately 300 bp upstream of the initiation codon between the C 4 RbcS genes in C 4 crops and the C 4 -orthologous RbcS genes in C 3 crops. A conserved domain cluster consisting of 3 conserved motifs (motif 6 +motif 1+motif 7) was found to be specific to C 4 RbcS genes, and further prediction revealed that this region contained multiple specific elements, such as abscisic acid-responsive (ABRE) and MeJA responsive (CGTCA-motif) elements ( Figure S5), as well as a CAT-box, which controls meristem-specific expression. This may be one of the key regulatory elements for the high expression of C 4 RbcS in BSCs.
More elements interacting with MYB transcription factors were observed in the promoter regions of C 4 PEPC genes of C 4 crops than in those of C 4 -orthologous PEPC genes of C 3 crops, which were enriched in three regions of the initiation codon, -100 to -600 bp, -1000 to -1400 bp and -1600 to -2000 bp ( Figure S6). The multiple MYB binding sites might be induced by drought and light, which indicated that MYB transcription factors may participate in the regulation of C 4 PEPC expression levels under stress conditions.
Large differences were observed in the promoter regions of C 4 / C 4 -orthologous ME and MDH genes from different C 3 and C 4 crops, and no common motifs or cis-acting elements were found in the promoter regions of C 4 ME and MDH genes. In general, multiple abiotic elements involved in the abiotic stress response, hormone response, and light response were found in the promoter regions of all ME and MDH genes, which suggested that they may be regulated in multiple ways.
There was a difference in the expression levels of the C 4 PPDK and C 4 -orthologous PPDK genes. However, no significant difference in the promoter region was found between the C 4 and C 4orthologous PPDK genes. This result suggested that the leafspecific expression of PPDK might be regulated in other ways.

The expression synergy of C 4 genes and C 4 orthologues in leaves
The C 4 photosynthetic pathway is a complete and stable system, and its smooth operation requires synergy in the expression levels of multiple genes involved in this pathway. Therefore, we explored the difference in the expression synergy in leaves between C 4 genes in C 4 crops and C 4 orthologues in C 3 crops. As no C 4 -orthologous gene copy was found in rice, a closely orthologous gene (OsPEPC3) was used instead. All six C 4 genes were highly expressed cooperatively in the leaves of C 4 crops, while for the C 4orthologous genes, only RbcS and b-CA in C 3 crops had similar expression levels in leaves to those of C 4 genes in the leaves of C 4 crops, and the remaining genes, PEPC, MDH, ME, and PPDK, were all expressed at much lower levels in leaves of C 3 crops (Figure 4). This indicated that although C 4 -orthologous genes existed in C 3 crops, their expression could not be well coordinated to form the full C 4 photosynthetic pathway. Therefore, increasing and coordinating the expression levels of these C 4 -orthologous genes are of great importance for the activation of the C 4 pathway in C 3 crops.

Discussion
4.1 Key enzymes in the C 4 photosynthetic pathway of C 3 and C 4 Gramineae crops In this study, almost all homologous genes encoding the six key enzymes of the C 4 photosynthetic pathway were identified in C 3 crops, consistent with the finding that all genes required for the C 4 pathway are present and expressed in C 3 species (Aubry et al., 2011;Williams et al., 2012). Based on genome-wide analysis, more gene copies were found than previously reported, which is highly advantageous. Moreover, the annotations of PPDK were revised based on our in-depth analysis of PPDK genes.
Almost all the C 4 -homologous genes in C 3 crops of wheat and rice were identified based on the analysis of phylogenetic relationships, tissue expression characteristics, and subcellular localization. In addition to their C 4 photosynthetic gene copies, at least three non-photosynthetic paralogous copies were found in the b-CA, PEPC, ME, and MDH gene families, and similar gene structure and protein sequence characteristics were observed among the orthologous genes from different crops, which indicated that the non-photosynthetic homologous genes were relatively conserved in C 3 and C 4 crops during the evolution of the C 4 photosynthetic pathway. Consistent with previous studies, four gene lineages of independent origin from those of C 4 grass crops, b-CA3, PEPC4, ME4, and MDH3, were found to have been recruited to the C 4 photosynthetic pathway, confirming the identity of the gene lineages during C 4 evolution, and these gene copies in C 3 crops are important candidates for gene manipulation (Christin et al., 2013;Moreno-Villena et al., FIGURE 4 The expression synergy of the C 4 genes and C 4 orthologues in leaves of C 4 and C 3 species. The different coloured lines/dots represent the C 4 /C 4orthologues encoding six key enzymes in different gramineous crops. Wheat-A, wheat-B, and wheat-D represent three copies of the different subgenomes of bread wheat. Millet represents foxtail millet. The genes presented include b-CA3, PEPC4, ME4, MDH3, and PPDK1.1 orthologous groups, and the RbcS genes with the highest expression in each species. Because no PEPC4 orthologue was found in rice, the nearest paralogue, OsPEPC3, was used instead. 2017). Moreover, the positive effects of gene duplication on the expansion of these gene families were confirmed, as 83.33% of RbcS genes and 73.53% of b-CA genes were associated with tandem duplication events, with more duplicated genes in the branches with C 4 photosynthetic gene copies. As previously reported, gene duplication was the first step in the evolution of C 4 photosynthesis (Sage et al., 2012).

4.2
The difference between the C 4ortholous genes in C 3 crops and the C 4 genes in C 4 crops With the evolution of the C 4 photosynthetic pathway, the ancestral C 3 genes underwent modification of the regulatory region or mutation of the coding region, resulting in the adjustment of the intracellular position of key enzymes, obvious kinetic characteristics, and cell-specific expression patterns (Ludwig, 2013;Ludwig, 2016). These changes were also found among the six gene families between the C 3 and C 4 Gramineae crops investigated in this study; for example, ZmCA3, SiCA3, and SbCA3 were all cytoplasmic, while OsCA3 and TaCA3s were located in chloroplasts. Based on the alignment of the conserved domains of C 4 b-CA and C 4 -orthologous b-CA, C 4 b-CA was found to be accumulated in the cytoplasm due to the deletion of a conserved motif at the N-terminus, which led to loss of the chloroplast transit peptide.
Previous studies showed that the encoded enzymes likely varied in their kinetic properties in addition to their leaf and cell specificities. The enzymatic kinetic changes in the C 4 photosynthetic pathway mainly occurred in the two major carboxylases PEPC and Rubisco (Bläsing et al., 2002;Kapralov et al., 2011). The PEPCs from C 3 and C 4 Flaveria species exhibit different kinetic properties in terms of substrate saturation and response to activators and inhibitors, while the PEPC of C 3 -C 4 Flaveria species showed features between those of the C 3 and C 4 congeners, suggesting that the C 4 characteristics were gradually obtained during the evolution process (Engelmann et al., 2003). It has been confirmed that the serine residue (serine 774 of the C 4 F. trinervia PEPC) at the C-terminus of the C 4 PEPC protein determines the difference in substrate affinity (Bläsing et al., 2002); similarly, this feature was found in the present study. In addition, based on large-scale multiple sequence alignment, two ((T/S)24G and N56I on OsRbcS1) and one (G842A on ZmPEPC4) amino acid mutations in C 4 RbcS and PEPC were found to be highly specific to C 4 Gramineae crops and were located in highly conserved regions of these proteins. The 8 amino acid residues at positions 52-59 of OsRbcS1 constitute a protein binding site, and the significant variation at position 56 may have a critical impact on the function of Rubisco. In addition, specific amino acid substitutions have been associated with C 4 functionality. In the C 4 PEPC isoform of maize, Increased tolerance to feedback inhibition by malate involves specific G884 (Glycine) in C 4 -isoforms (Paulus et al., 2013). The G842A mutation in ZmPEPC4 may have a similar effect on PEPC enzyme kinetics. In summary, the discovery of these mutation sites may provide a basis for the interpretation of the novel kinetic characteristics of key C 4 carboxylases.

4.3
The promoter regions and the expression characteristics of C 4 genes and C 4 -orthologous genes in C 4 /C 3 crops Different C 4 orthologues in C 3 crops exhibited different expression characteristics. All C 4 genes in C 4 crops were highly expressed in leaves. RbcS and b-CA in C 3 and C 4 crops showed similar expression levels, while the C 4 orthologues of PEPC, ME, MDH, and PPDK exhibited much lower expression in the leaves of C 3 crops than that observed for the C 4 genes in the leaves of C 4 crops. C 4 -orthologous RbcS in C 3 crops was highly expressed in MCs, and C 4 RbcS in C 4 species was highly expressed in BSCs. In this study, we observed differences in tissue expression and subcellular localization of gene copies of the same lineage between C 3 and C 4 crops, and high levels of expression in leaves and localization to specific organelles were confirmed to be key for recruitment to the C 4 photosynthetic pathway, which is consistent with previous reports (Ludwig, 2016;Moreno-Villena et al., 2017). Multiple elements related to hormone response-related and meristem-specific expression were found 300 bp upstream of the C 4 RbcS initiation codon, which may directly regulate the specific expression of C4 RbcS in BSCs. Previous promoter deletion experiments on the maize RbcS promoter region confirmed that deletion of the -211 to +434 region resulted in the accumulation of RbcS in BSCs, which is consistent with our results (Purcell et al., 1995). However, regional differences exist in the three representative C 4 crops, confirming that high expression of C 4 RbcS in BSCs is regulated similarly.
In addition to the discovery that the promoter region regulated the cell expression targeting of the C 4 gene, regions that may regulate PEPC expression levels were also found. The C 4 PEPC promoter regions had more cis-regulatory elements related to MYB transcription factors, including multiple light-and droughtinducible MYB binding sites. This indicated that regulation by MYB transcription factors may play a positive role in the recruitment of PEPC to the C 4 pathway. Studies have shown that the 0.6 kb upstream sequence of the maize PEPC promoter, which corresponds to region 2, can bind to proteins to regulate MCspecific expression of PEPC (Bläsing et al., 2002), and a mesophyll enhancement module (MEM1) at the distal end (-1566 to -2141) of the C 4 PEPC promoter of F. bidentis, which corresponds to region 3, was predicted in this study (Akyildiz et al., 2007). In general, the Gramineae C 4 species and F. bidentis C 4 species seem to have similar regulatory patterns for C 4 PEPC.
Unfortunately, no additional elements that may regulate the expression levels of C 4 MDH, ME, or PPDK were found in their promoter regions. The manner in which the promoter regions of these genes are regulated in Gramineae C 4 species may be different, with the regulation possibly being more determined by the process of DNA methylation and trans-acting factors.

4.4
The synergistic expression of C 4 and C 4 -orthologous genes in leaves of C 4 /C 3 crops As a complete system, the expression of individual genes involved in the C 4 photosynthetic pathway must be well coordinated to form a complete and stable metabolic pathway. Our analysis showed that the expression levels of multiple C 4 orthologues, including PEPC, PPDK, MDH, and ME, were much lower and not well coordinated in the leaves of C 3 crops; therefore, improving and coordinating their expression might be a strategy for activating the C 4 pathway in C 3 crops.
Based on the analysis in this study, the necessary steps for the conversion of the C 4 -homologous photosynthetic pathway to the C 4 photosynthetic pathway in the currently examined C 3 crops were proposed ( Figure 5). The acquisition of the C 4 photosynthetic pathway involves not only the improvement of the expression level of the key genes in the pathway but also a variety of changes, including cell-specific expression patterns, adjustments in the intracellular location of the enzymes, and their kinetic characteristics. Therefore, altering the expression level of one or two genes is not sufficiently effective for the establishment of the entire C 4 system. In addition, studies based on maize, rice, and Cleome species have shown that some regulatory information determining cell-specific expression already exists in the ancestral C 3 gene, and the expression characteristics and compartmentalization modification observed in the C 4 species are regulated by at least one trans-acting factor or transcription factor (Aubry et al., 2011). Therefore, a wider range of in-depth comparisons of C 3 crops and their related C 4 crops is necessary for the transformation of the C 4 photosynthetic pathway in C 3 crops. In addition, great progress has been made in improving Rubisco's CO 2 specificity by genetic manipulation and significantly improving the photosynthetic efficiency of C 3 crops (Martinavila et al., 2020); the redesign of the bypass pathway based on photorespiration also greatly improved the photosynthetic efficiency of rice (Shen et al., 2019), which may also be an effective way to improve the photosynthetic efficiency of wheat and C 3 crops. Although in this study, a large number of key regions and amino acid variations that may affect the transformation of C 4homologous genes to C 4 genes were observed in the C 4 -orthologous genes of C 3 crops, more experiments should be performed to verify these findings in the future.

Conclusion
In this study, the genes encoding the 6 key enzymes of the C 4 photosynthetic pathway in five important gramineous crops were systematically identified and characterized. The C 4 functional gene copies were distinguished from the non-photosynthetic functional gene copies based on phylogenetic relationships, subcellular localization, and expression characteristics. Two mutations ((T/S) 24G and N56I on OsRbcS1) of RbcS and one mutation (G842A on ZmPEPC4) of PEPC were found to be specific to the C 4 crops, which may affect the enzymatic kinetics of major carboxylases. Possible cis-acting elements regulating the specific expression of C 4

FIGURE 5
Schema of C 3 photosynthesis and NADP-malic enzyme (NADP-ME)-type C 4 photosynthesis in gramineous crops. From top to bottom are the anatomical structure and photosynthetic model. The dotted line in the middle marks the necessary steps for the photosynthetic pathway to evolve from C 3 to C 4 . Cp is the chloroplast, cyt is the cytoplasm, and the red box and blue box represent high/low expression levels, respectively. genes were found in their promoter regions. The expression of most C 4 orthologues in the leaves of C 3 crops was much lower and not well coordinated, and the necessary steps for C 4 photosynthetic transformation of C 3 crops were proposed. The results of this study will help progress research into the C 4 photosynthetic transformation of important C 3 species, such as rice and wheat.

Data availability statement
The datasets presented in this study can be found in online repositories. The names of the repository/repositories and accession number(s) can be found in the article/Supplementary Material.