Molecular Evolution and Functional Diversification of Replication Protein A1 in Plants

Replication protein A (RPA) is a heterotrimeric, single-stranded DNA binding complex required for eukaryotic DNA replication, repair, and recombination. RPA is composed of three subunits, RPA1, RPA2, and RPA3. In contrast to single RPA subunit genes generally found in animals and yeast, plants encode multiple paralogs of RPA subunits, suggesting subfunctionalization. Genetic analysis demonstrates that five Arabidopsis thaliana RPA1 paralogs (RPA1A to RPA1E) have unique and overlapping functions in DNA replication, repair, and meiosis. We hypothesize here that RPA1 subfunctionalities will be reflected in major structural and sequence differences among the paralogs. To address this, we analyzed amino acid and nucleotide sequences of RPA1 paralogs from 25 complete genomes representing a wide spectrum of plants and unicellular green algae. We find here that the plant RPA1 gene family is divided into three general groups termed RPA1A, RPA1B, and RPA1C, which likely arose from two progenitor groups in unicellular green algae. In the family Brassicaceae the RPA1B and RPA1C groups have further expanded to include two unique sub-functional paralogs RPA1D and RPA1E, respectively. In addition, RPA1 groups have unique domains, motifs, cis-elements, gene expression profiles, and pattern of conservation that are consistent with proposed functions in monocot and dicot species, including a novel C-terminal zinc-finger domain found only in plant RPA1C-like sequences. These results allow for improved prediction of RPA1 subunit functions in newly sequenced plant genomes, and potentially provide a unique molecular tool to improve classification of Brassicaceae species.


INTRODUCTION
Replication protein A (RPA) is a heterotrimeric, single-stranded DNA (ssDNA) binding protein complex that is highly conserved in eukaryotes (Wold, 1997). It was first identified as an indispensable component for the in vitro replication of simian virus 40 (SV40) DNA (Wobbe et al., 1987). Its primary function is to protect ssDNA from nucleolytic damage and hairpin formation during DNA metabolism. Consistent with its function, RPA plays essential roles in almost all DNA metabolic pathways including DNA replication, transcription, recombination, DNA damage surveillance and recognition, cell-cycle checkpoints, and in all major types of DNA repair including base excision, nucleotide excision, mismatch, and double-strand break repair (Wold, 1997;Iftode et al., 1999;Zou et al., 2006). RPA participates in these diverse DNA metabolic pathways through its ability to interact with DNA and numerous proteins involved in these processes (Iftode et al., 1999;Zou et al., 2006). In both animal and yeast (Saccharomyces cerevisiae) cells, RPA is hyper-phosphorylated upon DNA damage or replication stress by checkpoint kinases including ATM, ATR, and DNA-PK (Binz et al., 2004;Vassin et al., 2004;Liu et al., 2006). The hyper-phosphorylation may switch the function of RPA from DNA replication to DNA repair by inducing changes in the RPA structure and RPA-DNA and RPA-protein interactions, suggesting that animal and yeast cells use a structure-based modulation mechanism of RPA (Binz et al., 2004;Vassin et al., 2004).
Why do plants contain multiple functional RPA1 gene paralogs? Paralogous genes are generated by events such as whole-genome duplication, segmental duplication, and tandem gene duplication. Unlike animals, genome duplication is prominent in plants (Lockton and Gaut, 2005;Cui et al., 2006). For example, Arabidopsis has experienced at least three events of whole genome duplications (Vision et al., 2000;Simillion et al., 2002). There are three possible fates of genes after duplication. These are (i) nonfunctionalization/pseudogenization or loss, where one copy loses its function(s) by degenerative mutations, (ii) subfunctionalization where both copies become partially compromised by mutation accumulation to the point at which their total capacity is reduced to the level of the single-copy ancestral gene, and (iii) neofunctionalization where one copy acquires a novel, beneficial function and become preserved by natural selection, with the other copy retaining the original function (Lynch and Conery, 2000;Zhang, 2003;Moore and Purugganan, 2005;Louis, 2007). The subfunctionalization model also predicts that duplicate genes will share overlapping redundant functions early in the process of functional divergence (Moore and Purugganan, 2005).
Using functional genetic analysis we previously showed that all of the five Arabidopsis RPA1 paralogs are functional (Aklilu et al., 2014). We also reported that these paralogs have undergone subfunctionalization but also share different degrees of overlapping redundant functions. For instance, the rpa1c mutant displays hypersensitivity to DNA double-strand breaks induced by both ionizing radiation and camptothecin, while the rpa1e mutant shows hypersensitivity only to ionizing radiation. Combination of rpa1c and rpa1e results in additive hypersensitivity to a variety of DNA damaging agents. Overall, the results suggest that RPA1C and RPA1E each play unique roles in the repair of DNA damage, with RPA1C playing a leading role in promoting double-strand break repair (Aklilu et al., 2014).
Conversely, the rpa1b rpa1d double mutant has a severely defective growth and developmental phenotype (reduced fitness) under normal growth conditions. However, it displays similar sensitivity as wild-type plants for DNA damaging agents. The growth and developmental abnormalities of the rpa1b rpa1d double mutant are likely a result of defects in DNA replication that generate abnormal cell division suggesting both RPA1B and RPA1D are required for normal DNA replication (Aklilu et al., 2014).
In addition to DNA repair and DNA replication activities, RPA1 proteins play essential roles in progression of meiosis. Of all the five rpa1 single mutants, only the rpa1a single mutant displays an obvious defective meiotic phenotype, observed as lower seed set and reduced crossover formation (Osman et al., 2009;Aklilu et al., 2014). However, the rpa1a rpa1c double mutant is completely sterile, with incomplete synapsis and chromosome fragmentation observed during meiosis (Aklilu et al., 2014). This suggests that both RPA1A and RPA1C play important and perhaps overlapping roles during meiosis.
In order to understand the evolutionary history and identify specific sequence variations responsible for the functional diversification of RPA1 throughout plants, we have analyzed the nucleotide and amino acid sequences, gene conservation, and structural features of all available plant RPA1 paralogs. The sequence and evolutionary analysis together with our new additional genetic analysis reveals that plant RPA1 gene family is divided into three main evolutionary groups (A, B, and C): Group A (including A. thaliana RPA1A) is primarily responsible for meiotic progression. Group B (including A. thaliana RPA1B and RPA1D) is responsible for normal DNA replication. Group C (including A. thaliana RPA1C and RPA1E) is primarily responsible for DNA damage repair. As we show here, these groups have unique coding and regulatory sequences, gene and protein structure, and different level of gene conservation and expression that are consistent with these proposed functions, and highlight how evolution of this gene family occurred into specialized members. These data provide needed sequence structure context for future biochemical studies of RPA function in plants.

Plant Materials and Growth
Plant growth conditions and plant lines, including Wild-type (Wt) and all RPA1 single mutant lines used are previously described (Aklilu et al., 2014).

Microscopy
Meiotic chromosome spreads were prepared as described (Armstrong et al., 2001). Slides were mounted with DAPI (2.5 mg/ml) in Vectashield. Chromosomes were visualized with a confocal microscope (Zeiss LSM 510) using the halogen light. Images were then captured using an externally mounted digital camera (Olympus CKX41) and processed with a SPOT microscope imaging software.

Sequences Sources
Promoter, coding, and amino acid sequences were obtained from the National Center for Biotechnology Information (NCBI) and The Arabidopsis Information Resources (TAIR) database. Orthologous RPA1 sequences were identified by employing NCBI BLAST searches. Full length Arabidopsis RPA1 amino acid sequences were used for the BLAST search against the genome of each organism. In addition, sequences identified at each step are used to refine the search. Orthology was confirmed by reverse BLAST. The lowest e-value and maximum percent identity BLAST hit was chosen as the putative ortholog of the respective Arabidopsis RPA1 protein.

Phylogenetic Analysis
Phylogenetic analysis of the RPA1 protein sequences was performed with the MEGA5 software package (Tamura et al., 2011). Amino acid sequences were aligned using ClustalW (default parameters) with in the MEGA5 software package. Trees were constructed using maximum likelihood methods with Jones-Taylor-Thornton (JTT) amino acid substitution model and 1000 bootstrap replicates. Except from bootstrapping and choice of model, all other parameters were left at default settings. Additional neighbor joining (NJ) trees were constructed with the same amino acid substitution model, bootstrap replicates, and NJ tree construction default parameters in MEGA5 to confirm the results of the ML analysis.

Codon Bias/Usage
Frequency of optimal codons (F OP ) was calculated based on optimal codons identified for Arabidopsis (Wright et al., 2004) and rice (Liu et al., 2004).

Gene Structure and Protein Domains
Introns in RPA1 genes were identified using the NCBI gene database. RPA1 protein domains and subdomains were identified using the NCBI Conserved Domain Database [CDD] (Marchler-Bauer et al., 2011).

Sequence Distance and Natural Selection
Analysis of synonymous (dS) and nonsysnonymous (dN) substitution rate were performed with the MEGA5 software package using the Nei-Gojobori substitution model (Tamura et al., 2011). Natural selection (ω) was calculated by dividing dN by dS (dN/dS).

Gene Expression
Level and pattern of RPA1 genes expression at different developmental stages and tissues were retrieved from genevestigator (https://www.genevestigator.com) and Arabidopsis eFP browser (http://bar.utoronto.ca/efp/cgibin/efpWeb.cgi). Genevestigator is an online visualization tool that summarizes results from thousands of high quality transcriptomic experiments often done by cDNA microarrays (Hruz et al., 2008). Arabidopsis eFP browser is also a data visualization tool for exploring degree and location of gene expression (Winter et al., 2007).

Analysis of cis-Elements in Arabidopsis RPA1 Promoter Regions
Promoter elements for all the sequences were analyzed using the PLACE (Plant Cis-acting Regulatory DNA Elements) database (http://www.dna.affrc.go.jp/PLACE/; Higo et al., 1999).

Arabidopsis RPA1A Plays a Leading Role in Meiosis
We previously described the role of RPA1A and RPA1C during meiotic progression employing homozygous mutants rpa1a, rpa1c, and the double mutant rpa1a rpa1c (Aklilu et al., 2014). These results suggest a genetically redundant role for both RPA1A and RPA1C during early stages of meiosis (crossing over and repair). A previous study suggests RPA1A plays a unique role during later stages of meiosis (second-end capture; Osman et al., 2009), but there is currently no evidence to support a role for RPA1C at later stages of meiosis. To determine which if any of these two subunits plays a more predominant role during early (repair) stages of meiosis (and thereby providing additional clarity into the functional classification of these genes), we generated two additional heterozygous mutant combinations, rpa1a (+/−) rpa1c (−/−) and conversely rpa1a (−/−) rpa1c (+/−). By comparing these two combinations, we sought to determine the relative dominance ("gene dosage") of each in promoting early meiotic repair. As shown in Figures 1A,B, phenotypic analysis of these mutant combinations show that while rpa1a (+/−) rpa1c (−/−) is fully fertile, the rpa1a (−/−) rpa1c (+/−), displays reduced fertility (∼92% reduction in fertility) vs. the rpa1a (−/−) single mutant.
To evaluate meiotic integrity at the chromosomal level, we prepared chromosomal spreads of pollen-mother cells from WT and mutant lines. In comparison to WT, rpa1a, rpa1c, and rpa1a rpa1c, the rpa1a (+/−) rpa1c (−/−) combination displays normal meiotic progression, while the rpa1a (−/−) rpa1c (+/−) displayed both univalents and fragmented chromosomes during metaphase I (Supplementary Figure S1). In addition, rpa1a (−/−) rpa1c (+/−) combination showed highly fragmented chromosomes and unequal segregation during the subsequent stages anaphase I and II (Supplementary Figure S1). These results show that the rpa1c mutation in the rpa1a background is semidominant during meiotic repair, and suggest that RPA1C is either less effective at meiotic repair than is RPA1A, or that RPA1C is only required for repair of a small subset of meiotic double-strand breaks. We therefore propose that RPA1A plays the leading role in both early and late stages of meiotic crossing over, but that RPA1C can either partially or completely fulfill the role of RPA1A during early meiotic stages (DNA repair and initiation of recombination) in its absence. This is consistent with rice rpa1a single mutants that show defective DNA repair during meiosis (Chang et al., 2009). However, it would be useful to determine combined effects of RPA1A and RPA1C deficiency in rice for a more direct comparison of rice and Arabidopsis (monocot vs. dicot) meiotic progression, since rice rpa1c knock-down lines also display meiotic deficiencies (Li et al., 2013).

The Plant RPA1 Gene Family is Composed of Three Distinct Groups
Animal and yeast RPA1 is generally encoded by a single gene (Wold, 1997;Iftode et al., 1999). Genetic studies in rice and Arabidopsis suggest two general groups of RPA1 encoding genes, one responsible for DNA replication and one for DNA repair (Ishibashi et al., 2006;Aklilu et al., 2014). In order to determine conservation of these groups throughout the plant kingdom, we conducted BLAST searches employing known RPA1 genes in Arabidopsis and constructed maximum likelihood phylogenetic trees from identified sequences to determine phylogenetic relationships. Full-length Arabidopsis RPA1 amino acid sequences were employed for BLAST searches against the complete genome of individual plant and algae species found in the NCBI database. In addition sequences identified at each step were used to detect additional homologs. Orthology was confirmed by reverse BLAST. The lowest e-value and maximum percent identity sequence identified was chosen as the putative ortholog of the respective Arabidopsis RPA1 protein (Supplementary Table S1). RPA1 has four conserved domains; DBD-F, DBD-A, DBD-B, and DBD-C. Some of the putative RPA1 orthologs have either a DBD-F or a DBD-C deletion. Human RPA1 containing a DBD-F deletion retains replication activity, while a DBD-C deletion is non-functional (Gomes and Wold, 1996;Haring et al., 2008). To be as inclusive as possible while eliminating potential psuedogenes or unrelated groups, we included those putative sequences that do not contain a DBD-F domain sequence in our analysis (Figures 2, 3).
To address phylogenetic relationships of all identified sequences, we employed maximum likelihood and neighborjoining analyses as described in the Materials and Methods section. Shown in Figure 2 is our resulting maximum-likelihood phylogenetic analysis (and is very similar employing neighborjoining methods) of RPA1 from all identified plant-related sequences. This analysis shows the plant RPA1 family is generally divided into three distinct groups composed of RPA1A (A group), RPA1B (B group), and RPA1C (C group) (  Table S1). RPA1D and RPA1E, which are found only in the Brassicaceae family, are members of the B and C group, respectively. Within each major clade (for example, RPA1B) each taxa are arranged in a sub-clade of dicot, monocot, primitive plant (Physicomitrella patens and Selaginella mollendorffii), and unicellular green algae (V. carteri, O. lucinarinus and C. subellipsoida). The common ancestor of RPA1B of dicots and monocots appears diverged from an RPA1Blike progenitor of primitive plants. This ancestral RPA1B in primitive plants appears to have diverged from an RPA1B-like progenitor in unicellular green algae (Figure 2, Supplementary Figure S2). The evolutionary history for RPA1A and RPA1C appears different from RPA1B. The common ancestor of RPA1A of dicots and monocots appears diverged from RPA1A of primitive plants and RPA1C of dicots. In turn, RPA1A of primitive plants and RPA1C of dicots likely diverged from the common ancestor leading to RPA1C of monocots which itself appears to have diverged from the RPA1A/C progenitor in unicellular green algae (Figure 2, Supplementary Figure S2).
Interestingly, some plants contain additional RPA1 group members. Soybean (Glycin max) and maize (Zea mays) have two RPA1B-like sequences; sorghum (Sorghum bicolor) and millet (Setaria italica) have four and two RPA1C-like sequences, respectively. Members of the Brassicaceae family (A. thaliana, Arabidopsis lyrata, and Capsella rubella) have two additional RPA1 types, RPA1D and RPA1E. In terms of both sequence and functional similarity RPA1D and RPA1E are close paralogs of RPA1B and RPA1C, respectively (Figure 2; Aklilu et al., 2014), suggesting these are duplications of RPA1B and RPA1C that occurred only in the Brassicaceae family. Unlike other plants, barrel clover (Medicago truncatula) has only two types FIGURE 2 | Evolutionary relationships of RPA1 proteins. (A) RPA1A group, (B) RPA1B group, (C) RPA1C group. The evolutionary history was inferred using the Maximum-Likelihood method performed with MEGA5.2 software package. Amino acid sequences were aligned using ClustalW and used to produce phylogenetic trees using the Jones-Taylor-Thornton (JTT) of RPA1 (RPA1A and RPA1C) and it does not appear to have an obvious RPA1B. Given the importance of RPA1B in DNA replication (Aklilu et al., 2014) and the fact that other member of the Fabaceae family (Soybean) has RPA1B, it is possible that lack of this sequence is due to sequencing/annotation errors. Alternatively, barrel clover may actually lack RPA1B since the RPA1A-like sequence from barrel clover contains a subdomain that is mostly found only in the N-terminal domain of RPA1B group (see Discussion of subdomains below). If so, this might suggest that Mt RPA1A evolved to perform dual functions in both meiosis and replication. From our analysis we find that primitive plant genomes generally contain two types of RPA1 sequences, RPA1A/C-like and RPA1B-like. However, only P. patens and S. mollendorffii from this group have a completed genome sequence (Goodstein et al., 2012). Additional genome sequencing of more species would be needed to determine if RPA1C is present in primitive plants.
Interestingly the RPA1A/C-like progenitor sequences of unicellular green algae are more similar to each other than to their orthologous RPA1A/C counterparts in dicots, monocots, and primitive plants (Figure 2, Supplementary Figure S2). This suggests that the plant RPA1A and RPA1C groups have evolved significantly since divergence from its counterpart in unicellular green algae, or alternatively the algae ancestor has evolved more rapidly since splitting from higher plant lineages. To determine the early evolutionary relationship of RPA1 proteins, we searched for RPA1 orthologs from two additional algal completed genomes (employing the method described above) to construct a maximum likelihood phylogenetic tree focused on algal RPA1 sequences. Interestingly, all five unicellular green algae that represent five different genera have two types of RPA1 sequences, an RPA1B-like group common to all the algae representatives in this study, and an RPA1A/C-like group (Figure 3; Supplementary Table S1; Supplementary Figure S2). This suggests that the first duplication event of RPA1 that led to diversification in plants likely occurred during the evolution of green algae. This is further supported by the fact that red algae generally contain only a single RPA1-like sequence (Supplementary Table S2). We propose a model (described below) for the expansion of RPA1 from green algae to higher plants based upon the phylogenetic relationships and sequence groups described above, in combination with RPA1 domain structure described in the next five sections below.

Structure and Domains of Plant RPA1 Proteins
Eukaryotic RPA1 contains four highly-conserved domains termed the N-terminal domain or DNA Binding Domain F (DBD-F), two structurally similar central DNA Binding Domains (DBD-A and DBD-B), and a C-terminal DNA Binding Domain C (DBD-C) (Wold, 1997;Takashi et al., 2009;Figure 4). Studies in yeasts and animals suggest that the N-terminal domain (DBD-F) is primarily involved in protein-protein interactions, the middle domains (DBD-A and DBD-B) function in ssDNA binding, while the C-terminal domain (DBD-C) is involved in DNA damage recognition during nucleotide excision repair and subunit interaction (Wold, 1997;Lao et al., 2000;Zou et al., 2006).
To characterize conservation of these domains throughout plant RPA1 sequences as well as relate known functional characteristics to domain structure, we compared all A. thaliana RPA1 sequence domain structure to human and yeast RPA1as summarized in Figure 4. We employed the NCBI Conserved Domain Database [CDD] (Marchler-Bauer et al., 2011) to identify RPA1 structure, domains, and sub-domains. All sequences contain these four conserved domains (DBD-F, DBD-A, DBD-B, DBD-C). However, we find that each group contains unique subdomains or motifs that may contribute to functional differences. In the following sections we describe these subdomains and motifs in detail. With very few exceptions (Sorghum RPA1C and P. patens RPA1B contain DBD-F deletion), the general structure, domains, subdomains, and motifs of the three groups of A. thaliana RPA1 proteins are conserved throughout all plants examined here (Supplementary Figures  S3-S6).
RPA1B is Unique Among Other RPA1 Sequences Through Loss of the Binding Surface I Domain DBD-F of human RPA1 contains a subdomain called "Binding Surface I" (BS-I) or basic cleft (Figure 4, pink inset box) (Jacobs  Marchler-Bauer et al., 2011). Multiple studies demonstrate that the BS-I subdomain in human RPA1 and a similar region in yeast RPA1 is required for DNA repair activity but not required to support DNA replication (Longhese et al., 1994;Umezu et al., 1998;Bochkareva et al., 2005;Haring et al., 2008;Xu et al., 2008). Five highly conserved amino acid residues constitute the protein interaction sites of BS-I; position 1 is generally a polar residue (N/Q/S/T), position 2, 3, and 5 are basic (R/K), and position 4 is hydrophobic (L/V/I) (Supplementary Figure S3A; Supplementary Table S3). In all plants analyzed here, except Sorghum RPA1C-3, sequences phylogenetically predicted to fall into the RPA1A and RPA1C groups generally display these same conserved features as found in animal and fungal RPA1 (Figure 4; Supplementary Figures S3B,C). One notable exception among plant sequences is at position 1 where the conserved polar residue (N/Q/S/T) is replaced by a negatively charged residue (D/E). Interestingly, the predicted plant RPA1B group members do not display the conserved features of BS-I in most or all five positions, being replaced by chemically non-similar residues (Supplementary Figure S3D). For instance, RPA1B sequences generally contain a negatively charged amino acid (D) at position 1 instead of a conserved polar residue. In addition, position 3 in most RPA1B sequences contains a polar residue, while position 5 contains a negatively charged residue (E/D), instead of the conserved basic residues (R/K). Similarly, the hydrophobic residue found at position 4 (L/V/I) is in most cases replaced by a polar residue (T/S). These data suggest that plant RPA1B evolved into a more specialized member of the RPA1 family through loss of this domain (BS-I) to interact primarily with DNA replicationoriented pathways consistent with proposed functions of this group (Aklilu et al., 2014). In unicellular green algae only two of the five RPA1A/C progenitor sequences (O. lucimarinus RPA1E and M. pussila RPA1C) contain BS-I. The other sequences have either an N-terminal (DBD-F) deletion (C. reinhardtii RPA1C) and thus do not contain BS-I, or have N-terminal domain but do not contain BS-I (V. cateri RPA1C and C. subellipsoida RPA1A).
RPA1B Sequences Contain a Conserved N-Terminal (DBD-F) Nucleic Acid Binding Surface Termed Generic Binding Surface I OB-fold domains are known for their ssDNA-binding activity. However, no clear conservation of amino acid residues that directly interact with nucleic acids has been identified among the domains (Theobald et al., 2003). For example, except for two aromatic residues that stack with DNA bases, there is no conservation among the rest of the 6-9 amino acids that are predicted to contact ssDNA and are found in the two main ssDNA-binding OB fold domains (DBD-A and DBD-B) of human RPA1 (Bochkarev et al., 1997). Nevertheless, alignments of multiple OB fold domains reveal a pattern of hydrophobic residues, mostly flanked by polar or charged residues, in alternating amino acid positions that are conserved for short stretches of sequence around the site of ssDNA-binding (Theobald et al., 2003). These patterns are generally referred to as Generic Binding Surface I (GBS-I), and are found in DBD-A, DBD-B, and DBD-C of eukaryotic RPA1.
Analysis of plant RPA1 sequences employing NCBI CDD reveals that an additional GBS-I domain is present in DBD-F of RPA1B group sequences [ Figure 4 (blue inset box); Supplementary Figure S4A]. This domain displays the classic pattern of hydrophobic residues flanked by polar or charged residues found in all other GBS-I domains, although the conservation of amino acid identity is unique among this particular domain in plant DBD-F. Interestingly, some plant RPA1A-and RPA1C-like sequences also contain this GBS-I in DBD-F (Supplementary Figure S4C) and these special cases are discussed below. In unicellular green algae only three of the five RPA1B-like sequences (RPA1B of O. lucimarinus, C. reinhardtii, and C. subellipsoida) contain GBS-I. The other two sequences (RPA1B of M. pussila and V. cateri) do not contain GBS-I as they have a truncated N-terminal domain (DBD-F). Interestingly, however, RPA1A of C. subellipsoida does contain a GBS-I domain in DBD-F. In general, this domain structure is unique to plant and green algae, and not found in animals and fungi.
Human RPA binds to ssDNA in two modes. The first mode has an occluded size of 8-10 nucleotides (Blackwell and Borowiec, 1994) and accomplished by the two major ssDNA-binding domains, DBD-A and DBD-B (Bochkareva et al., 2001;Hass et al., 2012). The second binding mode has an occluded binding size of ∼30 nucleotides (Kim et al., 1992(Kim et al., , 1994 and involves DBD-A, -B, and -C of RPA1 and DBD-D of RPA2 (Bastin-Shanower and Brill, 2001;Cai et al., 2007). DBD-F has a weak ssDNA-binding affinity (Daughdrill et al., 2001;Bochkareva et al., 2005) and is not involved in either binding modes (Cai et al., 2007). The ability of RPA binding to short nucleotides (8-10 nt) is required for DNA repair function, but not DNA replication (Hass et al., 2012). Conversely, it is proposed that the ability of RPA to bind to long stretches of ssDNA (∼30 nt) is required for its function in DNA replication rather than in DNA repair, as there is extensive DNA unwinding and exposure of long ssDNA intermediates during DNA replication vs. DNA repair (Hass et al., 2012). Accordingly, our result suggests that in plants, DBD-F of the RPA1B group may participate in ssDNA-binding (as it contains GBS-I) so that together with other DBDs (DBD-A, DBD-B, DBD-C, and DBD-D in RPA2) it may result in an occluded size of more than 30 nucleotides.
The presence of GBS I in DBD-F of barrel clover and Brachypodium distachyon RPA1A-like sequences, and millet and sorghum RPA1C-like sequences (Supplementary Figure  S4C) suggests these sequences could participate in DNA replication activities. For example, since we were unable to identify a predicted RPA1B sequence in barrel clover, it is possible the RPA1A-like sequence might have gained the GBS I subdomain to function in place of RPA1B. B. distachyon has two RPA1A (RPA1A-1 and RPA1A-2), perhaps allowing the paralogs to accumulate small gradual mutations and thereby gain new domains and functions as purifying selection is weaker on duplicate genes (Castillo-Davis et al., 2004). This same explanation can be argued for RPA1C-2 and RPA1C-4 of millet and sorghum, respectively.

Plant RPA1 Sequences Contain Unique Zinc-Finger Motifs Within the DBD-C C-Terminal Domain
RPA1 sequences from yeasts and animals contain a C4-type (C-X 2 -C-X 13 -C-X 2 -C) zinc-finger motif (ZFM) within the DBD-C domain (Wold, 1997). This motif is required for DNA replication (Lin et al., 1996(Lin et al., , 1998Walther et al., 1999), DNA damage recognition (Lao et al., 2000), and for proper structural formation of RPA complexes (Lin et al., 1996). Employing NCBI CDD protein domain analysis we find that indeed all plant RPA1 sequences also contain ZFM sequences. However, RPA1A and RPA1C contain a C6-type (C-X 3 -C-X 8 -C-X 13 -C-X 2 -C-X 6 -C) ZFM, while RPA1B contains a C5-type (C-X 2 -C-X 13 -C-X 2 -C-X 6 -C) ZFM. Conservation of these domains among all plant species analyzed is high, with only a few exceptions seen in sorghum RPA1C-3 and RPAC-4, rice RPA1C, and in the Lycophyte S. moellendorffii RPA1B-like sequences (Supplementary Figure S5). RPA1A/C-like progenitor sequences of unicellular green algae contain a C-4 type ZFM (Supplementary Figure S5E). However, in M. pussila RPA1C and O. lucimarinus RPA1E, the fourth residue of the conserved motif is histidine (H) instead of cysteine (C). Interestingly, RPA1B-like sequences of unicellular green algae do not contain a conserved C4-type ZFM in DBD-C (Supplementary Figure S5F).
ZFMs are known for their role in protein-DNA and proteinprotein interactions (Krishna et al., 2003;Gamsjaeger et al., 2007). The human RPA1 ZFM interacts with both normal and damaged ssDNA with low and high affinity, respectively, thereby contributing to optimal ssDNA-binding activity (Dong et al., 1999;Walther et al., 1999;Lao et al., 2000). Human RPA1 that contains a mutation in either the two cysteine amino acids of the ZFM, or a complete deletion of the motif, retains ssDNA binding activity, heterotrimeric complex formation, and DNA repair promoting activities, but does not support DNA replication (Lin et al., 1996(Lin et al., , 1998Walther et al., 1999). The latter replication defect may be due to a failure to load polymerase δ at replication forks during the elongation step of DNA replication (Lin et al., 1998), or due to improper heterotrimeric complex formation, as the ZFM generally affects the structure of the RPA complex (Dong et al., 1999;Walther et al., 1999;Bochkareva et al., 2000). In plants, it is possible that the different types of ZFMs in the RPA1A/1C and RPA1B group may contribute to the functional specificity of these proteins in DNA metabolism through direct and specific protein-DNA and/or protein-protein interactions. The ZFMs may also contribute to the formation of different types of RPA heterotrimeric complexes. In support of this, Arabidopsis RPA1A preferentially forms a complex with RPA2B, and RPA1B preferentially forms a complex with RPA2A (Eschbach and Kobbe, 2014). Also in rice, each of the three RPA1 subunits preferentially forms a complex with a specific RPA2 subunit (Ishibashi et al., 2006).

RPA1C Group has a C-Terminal Extension Region that Contains a Glycine/Serine-Rich Domain Interspersed by a CCHC-Type ZFM
In contrast to other RPA1 sequences examined here, plant RPA1C-like sequences (including RPA1E in Brassicaceae) contain a unique C-terminal extension region that contains an average amino acid sequence length of ∼176 in RPA1C and ∼119 in RPA1E-like sequences (Figure 4; Supplementary Table S4). CLUSTAL alignments of these C-terminal extensions (beginning at the end of the putative DBD-C domain in each case to the end of the predicted sequence), NCBI CDD domain searches, and manual comparisons were employed to identify common sequence domains or regions. Common among all of these sequences is the presence of at least one zinc-knuckle CCHC-type (CX 2 CX 4 HX 4 C) ZFM. All Brassicaceae members display one ZFM in RPA1C, and one ZFM in RPA1E C-terminal extensions, are roughly at the same position within the alignments (both C-terminal only, and the full-length sequence alignments), and share high sequence identity. Interestingly, in plants that do not have the RPA1C/RPA1E paradigm (non-Brassicaceae), the RPA1C sequence(s) contain multiple ZFM that fall into two clusters (termed here C1 and C2) each with unique sequence identity (Supplementary Figure S6). In most cases (tomato, cucumber, and maize for example) there are two unique ZFM separated by ∼30 amino acids (C1 followed by C2). In other cases, such as rice for example, there is a C1 followed by multiple (three in this case) C2s. The fact that there are two unique cluster types of the ZFM suggests either (1) that these motifs arose independently through acquisition of ZFM-like sequences from two independent sources (genes) in the ancestral gene, or (2) were duplicated from the same ancestral gene but evolved early into two unique ZFM sequences. In the case of Brassicaceae, a likely scenario is that the ancestral RPA1C that gave rise to current RPA1C and RPA1E contained both C1 and C2, but in each case lost C2. If so, this would suggest C2 is only necessary in the presence of C1 to carry out multiple DNA repair-related functions. In unicellular green algae only one of the RPA1A/C progenitor sequences (M. pussila RPA1C) contains CCHC-type ZFM in the C-terminal extension region. The other sequences either do not contain a C-terminal extension region (O. lucimarinus RPA1E) or contain a C-terminal extension region that does not have CCHC-ZFM (V. cateri RPA1C).
This analysis also revealed that the C-terminal extension regions in RPA1C sequences are glycine-and serine-rich (Supplementary Figures S6A,B). Glycine-rich regions are found in RNA-Binding Proteins (RBPs) and they may be enriched by additional polar (hydrophilic) residues, such as serine, arginine, asparagine, glutamine, and tyrosine. However, the function of these additional polar residues is poorly understood (Rogelj et al., 2011). Glycine-rich domains interspersed by CCHC-type ZFMs are found in RNA-binding plant proteins involved in posttranscriptional regulation of gene expression under various stress conditions (Karlson et al., 2002;Karlson and Imai, 2003;Kim et al., 2005Kim et al., , 2007Kim and Kang, 2006). Therefore, it is possible that RPA1C may also bind to RNA and play a role in posttranscriptional regulation. Furthermore, CCHC-type ZFMs are predicted to bind to both normal GT-rich ssDNA (Rajavashisth et al., 1989;Tzfati et al., 1995;Kim et al., 2005) and damaged ssDNA [e.g., Arabidopsis DDB2 (Ly et al., 2013), yeast RAD18 (Jones et al., 1988), and human PARP-1 (Langelier et al., 2011)]. This suggests that besides binding to RNA and ssDNA and thereby playing regulatory role, the CCHC-type zinc-finger motif may also play a role in DNA damage recognition, as suggested from genetic analysis (Aklilu et al., 2014).
In summary, the protein structure analysis result is consistent with genetic and phylogenetic data that plant RPA1 proteins fall into three distinct groups with unique functions (RPA1A, RPA1B, and RPA1C). Based on these data we propose that duplication of RPA1 in unicellular green algae led to two main progenitor groups in primitive plants, and later diverged into three groups in higher plants with specialized functions (Figure 5).

The RPA1B Group has Higher Levels of Gene Expression, Intron Frequency, and Optimal Codon Usage
Depending on their functional specificity and biochemical role, genes in general and duplicated genes in particular can have different temporal, spatial, and levels of gene expression (Chiapello et al., 1998;Casneuf et al., 2006;Hyun et al., 2014). As discussed above, genetic analysis of rpa1 mutants suggests specialized functions for individual RPA1 subunits. Based on this we predict unique gene expression patterns of RPA1 genes that are consistent with their proposed function. For example, we might expect to see higher basal expression of individual RPA1 members if they participate in "housekeeping" type functions, such as normal DNA replication activity. To this end, we examined expression patterns and levels of the Arabidopsis, soybean, rice, and P. patens RPA1 paralogs (Figure 6) from Genevestigator, an online gene expression visualization tool that summarizes results from thousands of high quality transcriptomic experiments, typically employing cDNA microarrays (Hruz et al., 2008). In general the RPA1B group displays higher basal expression levels in most developmental stages in comparison with the RPA1A and RPA1C groups of all four included plant species, in both shoots and roots (Figures 6A,B). In contrast, members of the RPA1C group and to a lesser extent the RPA1A group are induced by DNA damage (Culligan et al., 2006). The higher overall level of gene expression of the RPA1B group suggests a requirement throughout the developmental growth of the plant, and is consistent with the proposed function of RPA1B in normal DNA replication functions (Aklilu et al., 2014).
In A. thaliana, RPA1B and RPA1D display ∼10-fold more introns vs. RPA1A, RPA1C, and RPA1E (Supplementary Table  S1). To determine if this paradigm is consistent throughout plants, we established intron frequencies in all three major groups of RPA1 paralogs (A, B, and C) as predicted by our phylogenetic analysis above, employing NCBI sequence databases. As shown in Figure 7, RPA1B-like genes from 20 plant species contain ∼six-fold more introns on average than either group RPA1A, or group RPA1C. However, in unicellular green algae, there is no clear pattern of intron frequency between the B-like group and A/C-like group (Supplementary Table S1). In many eukaryotes, intron frequency is generally proportional to gene expression levels (Callis et al., 1987;Brinster et al., 1988;Duncker et al., 1997;Juneau et al., 2006;Shabalina et al., 2010), and mRNA stability (Le Hir et al., 2003;Niu and Yang, 2011). For example, highly expressed genes in both Arabidopsis and rice show increased intron frequency (number of introns per kilobase of coding sequence) vs. lower expressed genes (Ren et al., 2006). Therefore, these data are consistent with higher gene expression levels of RPA1B group members vs. RPA1A or RPA1C members.
Codon preference (optimal codons) is determined by tRNA abundance and gene expression level and occurs with unequal frequencies (Ikemura, 1985;Duret and Mouchiroud, 1999). Highly expressed genes in both prokaryotes and eukaryotes have increased frequency of codons that match abundant tRNAs (Ikemura, 1985;Bulmer, 1988;Wright et al., 2004). Codon bias may reflect a selective pressure to enhance translational efficiency for highly expressed genes (Bulmer, 1988;Marais and Duret, 2001;Wright et al., 2004;Sablok et al., 2013). Since the RPA1B group displays higher overall gene expression levels vs. RPA1A and RPA1C, we further hypothesized that RPA1 genes may reflect this through a bias in optimal codon usage. In order to test this we calculated the frequency of optimal codons (F OP ) for Arabidopsis and rice RPA1 genes, since these species have the most complete data set of developmental stages from Genevestigator (Hruz et al., 2008) and optimal codons for highly expressed genes have been identified in these organisms (Liu et al., 2004;Wright et al., 2004).  Hruz et al., 2008) and Arabidopsis eFP Browser (B; Winter et al., 2007). Error bars indicate standard error. F OP is calculated as the number of occurrences of optimal codons divided by the total number of codons (Ikemura, 1985). As shown in Figure 8, we find that the RPA1B group has the highest number of optimal codons (F OP = 0.54) followed by RPA1A (F OP = 0.45) and RPA1C (F OP = 0.41). F OP -values for the Arabidopsis RPA1D (0.45) and RPA1E (0.42) are not included in the F-test as they are only found in Brassicacea. This suggests that RPA1B group is likely under the control of translational selection due to a demand for its higher translational protein product during DNA replication.
RPA1A and RPA1B are More Conserved than RPA1C Highly expressed genes are usually under a high degree of selective constraint and thus display a higher degree of conservation (Pál et al., 2001;Drummond et al., 2005). As described above, plant RPA1 groups have unique expression patterns (Figure 6). Accordingly, we hypothesize unique sequence conservation of RPA1 subunits that is consistent with their expression level. To test this, we analyzed and compared the type and degree of natural selection applied on  Table S1) are included in the analysis. Data are mean ± SE. To analyze statistical difference F-test (ANOVA) and LSD were carried out at P ≤ 0.05. Bars with different letters indicate significant differences. each RPA1 group. Natural selection is measured by the ratio of non-synonymous (dN) and synonymous (dS) nucleotide substitution rates (ω or dN/dS) and its value ranges from zero (0) to infinity (∞). While a value of <1 is a sign of purifying selection (where sequence conservation is preferred), a value of >1 is a sign of positive selection (where change in sequence is preferred). We carried out the analysis by dividing plants into two groups. (1) The Brassicacea group (A. thaliana and A. lyrata), RPA1 from this group is analyzed separately because it has five paralogs, and (2) Other dicot group that contains only three RPA1 paralogs (Tomato, Cucumber, Strawberry, Castor oil plant, Grape, Cacao, Peach, and California Poplar). Plants that have only two RPA1 paralogs or more than three, as a result of lineage specific duplication, were not included in the analysis since these cases would affect sequence conservation and selection, and comparison of orthologs would lead to unreliable comparisons. dN and dS analyses employed RPA1 coding sequence comparisons to a common "outgroup" orthologous RPA1 (pair-wise comparison). To this end, RPA1 of C. rubella and rice were used as an "outgroup" to the first and second group, respectively.
All RPA1 are under purifying selection, as they all have ωvalues < 1 (Figure 9). However, the degree of selection pressure or conservation is stronger on RPA1B followed by RPA1A and RPA1C. This can be due to the difference in gene expression level and consistent with the high transcript accumulation of RPA1B group vs. the RPA1A/C group (Figure 6). However, gene FIGURE 8 | Frequency of optimal codon (F OP ) values for RPA1 genes of Arabidopsis and rice. Data are mean ± SE. To analyze statistical difference F-test (ANOVA) and LSD were carried out at P ≤ 0.05. Bars with different letters indicate significant differences. expression alone does not explain the sequence conservation difference seen among the groups as, for example, there is no clear expression difference between the RPA1A group and RPA1C group (Figure 6). In this case, functional differences may play a role. While RPA1A is primarily responsible for the highly conserved meiotic DNA recombination process, RPA1C likely functions in many types of DNA repair pathways and interacts with various proteins found in each pathway. Proteins involved in many biochemical pathways that employ various interaction partners tend to be more conserved (Krylov et al., 2003). However, since some repair pathways and associated proteins are relatively less conserved across plants (Singh et al., 2010), lineage specific co-evolution of RPA1C may result in less conservation of the protein across species.

Arabidopsis RPA1 Paralogs Contain Unique cis-Acting Element Composition Associated with Biological Function
As discussed above, RPA1 members display unique gene expression patterns consistent with their proposed function in meiosis, DNA replication and repair. Accordingly, we hypothesize that respective RPA1 paralog members should display cis-acting elements that regulate their spatial, temporal and induced expression pattern consistent with proposed activities. To identify cis-acting regulatory elements, sequences were analyzed using PLACE (Plant cis-acting regulatory DNA elements) database (Higo et al., 1999). For this analysis we used two sets of promoter sequences. The first set includes predicted promoter sequences upstream of transcriptional start site for each paralogs retrieved from TAIR (The Arabidopsis Information Resource) database. Since the predicted promoter sequence of each paralog is different in length, we also analyzed a second set with an equal length (1794 bp) of promoter sequence for each paralog (Table 1). This is based on the predicted promoter sequence length of RPA1A that contains the largest promoter FIGURE 9 | Natural selection (ω) values for RPA1 genes. (A) ω-values for Arabidopsis and A. lyrata RPA1 genes. C. rubella RPA1 genes were used as a reference for ortholog pairwise sequence distance measurement (dN and dS). (B) ω-values for RPA1 genes of eight plants (Tomato, Cucumber, Strawberry, Castor oil plant, Grape, Cacao, Peach, and California Poplar). Rice RPA1 genes were used as a reference for ortholog pairwise sequence distance measurement (dN and dS). dN and dS analyses were conducted in MEGA5 using the Nei-Gojobori model. Data are mean ± SE. To analyze statistical difference F-test (ANOVA) and LSD were carried out at P ≤ 0.05. Bars with different letters indicate significant differences. region of the group. A complete list of cis-acting elements along with their description is presented in Supplementary Table S5.
As is summarized in Table 1, RPA1A promoter sequences are enriched in cis-elements related to reproductive phase transition, flower, and senescence vs. other RPA1B and RPA1C group members. In addition, these sequences appear enriched in pollen-related cis-acting elements vs. other groups, but only when limited to predicted promoter sequences ( Table 1). These data are consistent with a proposed leading role for RPA1A during meiosis, and coincides with RPA1A expression up-regulation immediately after reproductive phase transition, and during flowering and pollen formation stage ( Figure 6A). Interestingly, RPA1A promoter enrichment of cis-acting elements related to senescence is well correlated to its late-developmental stage (senescence) transcriptional up-regulation (Figure 6A), suggesting a potential role for RPA1A in regulation of senescence, but this would need to be tested biologically. Cis-elements related to DNA synthesis and cell cycle regulation were only identified in the promoters of RPA1B and RPA1D, consistent with their proposed primary role in DNA replication (Aklilu et al., 2014). Interestingly, RPA1D appears to be uniquely enriched in cis-elements related to biotic stress responses.
Lastly, both RPA1C and RPA1E display enrichment of ciselements related to seed development and germination and abiotic stress (Table 1). Obviously, abiotic stress encompasses a wide spectrum of insults to tissues (Waterworth et al., 2011;Roy, 2014), but includes agents that damage DNA, such as UV light or ionizing radiation for example. RPA1C and RPA1E are predicted to play a leading role in DNA repair activities (Aklilu et al., 2014) and are induced by ionizing radiation (Culligan et al., 2006), suggesting perhaps many of these abiotic stress-related elements are directly involved in the transcriptional response to DNA damage. Interestingly, DNA repair pathways are known to regulate seed quality and longevity by repairing DNA damage accumulated during storage and imbibition (Cheah and Osborne, 1978;Dandoy et al., 1987;Waterworth et al., 2010Waterworth et al., , 2015Chen et al., 2012;Bueso et al., 2014). The observed enrichment of seed related cis-elements in the promoters of both RPA1C and RPA1E raises the question of whether RPA could play a role in seed quality and longevity.

Summary
We describe in this study the evolution of plant RPA1 family and the unique sequence variations that can be used to categorize RPA1 sequences into three general groups in plants, RPA1A, RPA1B, and RPA1C. Our data suggest that these three groups of plant RPA sequences evolved from two groups of green algal RPA1 progenitor sequences, RPA1B-like and RPA1A/C-like sequences, by losing and gaining unique motifs, domains, and subdomains ( Figure 5).
The unique sequence variations that exist within each group of plant RPA1 can be employed to predict biochemical function of RPA1 genes from newly sequenced genomes. These predictions will be valuable in future biochemical characterization studies of RPA complexes, which are ultimately necessary to accurately determine RPA functions in plants. Since the expansion of RPA1B and RPA1C into RPA1D and RPA1E respectively, appears to have occurred early and specifically in the Brassicacea family, these sequences could also prove useful in determining phylogenetic relationships within this family.

AUTHOR CONTRIBUTIONS
BA designed and carried out the experiments and wrote the paper. KC designed experiments and wrote the paper.

ACKNOWLEDGMENTS
We thank Dr. Vaughn S. Cooper for helpful comments and critical reading during the preparation of this manuscript.