Selection Signatures in the First Exon of Paralogous Receptor Kinase Genes from the Sym2 Region of the Pisum sativum L. Genome

During the initial step of the symbiosis between legumes (Fabaceae) and nitrogen-fixing bacteria (rhizobia), the bacterial signal molecule known as the Nod factor (nodulation factor) is recognized by plant LysM motif-containing receptor-like kinases (LysM-RLKs). The fifth chromosome of barrel medic (Medicago truncatula Gaertn.) contains a cluster of paralogous LysM-RLK genes, one of which is known to participate in symbiosis. In the syntenic region of the pea (Pisum sativum L.) genome, three genes have been identified: PsK1 and PsSym37, two symbiosis-related LysM-RLK genes with known sequences, and the unsequenced PsSym2 gene which presumably encodes a LysM-RLK and is associated with increased selectivity to certain Nod factors. In this work, we identified a new gene encoding a LysM-RLK, designated as PsLykX, within the Sym2 genomic region. We sequenced the first exons (corresponding to the protein receptor domain) of PsSym37, PsK1, and PsLykX from a large set of pea genotypes of diverse origin. The nucleotide diversity of these fragments was estimated and groups of haplotypes for each gene were revealed. Footprints of selection pressure were detected via comparative analyses of SNP distribution across the first exons of these genes and their homologs MtLYK2, MtLYK3, and MtLYK4 from M. truncatula retrieved from the Medicago Hapmap project. Despite the remarkable similarity among all the studied genes, they exhibited contrasting selection signatures, possibly pointing to diversification of their functions. Signatures of balancing selection were found in LysM1-encoding parts of PsSym37 and PsK1, suggesting that the diversity of these parts may be important for pea LysM-RLKs. The first exons of PsSym37 and PsK1 displayed signatures of purifying selection, as well as MtLYK2 of M. truncatula. Evidence of positive selection affecting primarily LysM domains was found in all three investigated M. truncatula genes, as well as in the pea gene PsLykX. The data suggested that PsLykX is a promising candidate for PsSym2, which has remained elusive for more than 30 years.


INTRODUCTION
One of the hallmark features of legumes (Fabaceae) is their ability to form beneficial symbioses with nitrogen-fixing soil bacteria collectively known as rhizobia. The establishment of such symbioses is a very complex process that involves many genes, as it requires proper recognition of the symbiotic partners, with subsequent development of novel symbiotic organs and partial integration of metabolic pathways in organisms belonging to different domains (Oldroyd et al., 2011;Suzaki et al., 2015;Zipfel and Oldroyd, 2017). The main function of these symbioses is the fixation of atmospheric nitrogen, making it a significant evolutionary advantage for legumes (Wheatley and Sprent, 2010;Werner et al., 2015).
Many genes encoding LysM-RLKs have been described in model legumes, such as, barrel medic (Medicago truncatula Gaertn; Arrighi et al., 2006), and Lotus japonicus (Regel) K. Larsen (Lohmann et al., 2010). Some are known to participate in symbiosis, while others have non-symbiotic functions. Interestingly, each legume species seems to have its own nuances in the organization and function of symbiotic LysM-RLKs; for example, mutations in the orthologous LysM-RLK genes NFR1 (NOD FACTOR RECEPTOR 1) from L. japonicus and MtLYK3 (LysM DOMAIN-CONTAINING RECEPTOR-LIKE KINASE 3) from M. truncatula manifested differently, blocking the penetration of rhizobia into plant cells at distinct stages of symbiotic development (Smit et al., 2007).
Garden pea (Pisum sativum L.) is a valuable pulse crop (http:// www.fao.org/faostat). Studies on its symbiosis with rhizobia are very important (Borisov et al., 2004(Borisov et al., , 2007Smýkal et al., 2012;Zhukov et al., 2016), since effective symbioses can increase the yield and quality of pea crops. Because of the long history of pea domestication and its wide distribution (Jing et al., 2010;Smýkal et al., 2012), many research centers around the world have large collections of pea genetic resources, making it possible to study the natural variability of particular pea genes and corresponding traits using sample sets close to genetic saturation (Jing et al., 2010). The N. I. Vavilov Institute of Plant Genetic Resources (VIR) in St. Petersburg, Russia, maintains a uniquely valuable collection of crop germplasm that includes pea and other legumes (Plekhanova et al., 2017). Studies on large ecotype collections and the use of association genetics may help uncover valuable alleles related to responses to environmental stresses (Gentzbittel et al., 2015). Recent genome-wide association (GWA) studies of the chickpea (Cicer arietinum L.) germplasm collection of the VIR revealed potential candidate genes likely to affect traits of agricultural importance (Plekhanova et al., 2017). Unfortunately, many molecular genetic methods are unsuitable for analyses of the pea genome because it is very large, congested with repetitive elements (Macas et al., 2007), without even a draft assembly. However, the high degree of synteny between the genomes of pea and M. truncatula, a well-studied model legume, makes it possible to investigate pea genes using the M. truncatula genome as a reference (Young and Udvardi, 2009).
The first discovered symbiosis-related pea gene was PsSym2, which determines increased selectivity toward rhizobia in the pea cv. Afghanistan (Govorov, 1928;Razumovskaya, 1937;Lie, 1984). Although the phenotype conferred by the allelic state of PsSym2 is clear, the molecular basis of its function is not. It is thought to encode a LysM-RLK, because the selectivity inherent to cv. Afghanistan is associated with the ability to distinguish among the structural features of Nod factors produced by different (effective, ineffective, or even pathogen-like) rhizobial strains. Previously, PsSym2 was localized in the linkage group I (LG I) of the pea genome (Kozik et al., 1996). Later, two paralogous genes encoding LysM-RLKs (PsSym37 and PsK1) were found to co-localize there (Zhukov et al., 2008). However, neither of them was shown to play the same role as PsSym2, although both were shown to participate in bacterial signal recognition (Zhukov et al., 2008). In M. truncatula, the genomic region corresponding to the pea PsSym2 location (LYK region) contains at least seven highly similar RLK genes; MtLYK1-MtLYK7 (Limpens et al., 2003). One of these genes, MtLYK3, was identified as a symbiotic gene that controls bacterial penetration into the infection thread (Limpens et al., 2003;Smit et al., 2007). Considering these facts, one could expect to find more genes encoding RLKs in pea LG I, including genes related to symbiosis.
The early domestication and subsequent global dissemination of pea has led to significant and multifaceted selection pressure, resulting from both natural and breeding selection (Smýkal et al., 2012). Several loci associated with agronomically important traits have been found to play crucial roles in the domestication process (Weeden, 2007). However, the loci responsible for symbiosis, such as, bacterial recognition genes, are unlikely to be directly affected by breeding, although their influence on plant productivity should result in some selective pressure. Thus, the aim of this work was to estimate the genetic diversity of the genes encoding LysM-RLKs in the Sym2 region of the pea genome and corresponding genes from the LYK region in the M. truncatula genome, and to evaluate and compare the selection pressures affecting different parts of the sequences. To this end, we (i) examined the region containing PsSym37 and PsK1 in pea LG I to identify new LysM-RLK genes; (ii) studied the polymorphism of the first exon sequences (corresponding to the receptor domain of the encoded protein) of three LysM-RLK genes (two known genes and one newly identified gene) in a large set of pea genotypes; and (iii) compared the selection signatures in pea genes with those in their M. truncatula homologs. As a result, we found a new gene, which we named PsLykX (for P. sativum LysM kinase eXclusive), and detected signatures of positive, balancing, and negative selection in the gene sequences. The polymorphism levels and patterns of selective signatures were unique to each of the three pea genes, but remarkably similar for the Medicago genes MtLYK3 and MtLYK4. The results of the polymorphism analyses indicate that PsLykX is a promising candidate for PsSym2, the determinant of the unique symbiotic phenotype of cv. Afghanistan.

Screening of the Pea BAC Library
To find novel LysM-RLK genes in the Sym2 region, the pea Psa-B-Cam BAC library screening was performed at the INRA-CNRGV Plant Genomic Center (Toulouse, France). The screening was carried out via quantitative PCR, using BAC DNA pools amplified with Phi29 DNA polymerase and primer pairs for PsSym37 and PsK1 genes (Table S1). Information about the BAC library can be found online: (http://cnrgv.toulouse.inra.fr/).
The whole insertion in the identified BAC Psa-B-Cam 446P4 was sequenced using the 454 GS-Junior system with the Titanium kit in CNRGV, yielding 11697 reads with a mean length of 500 bp. Reads were then quality-trimmed using AdapterRemoval software (v. 2.1.7) (Lindgreen, 2012) and mapped onto the Escherichia coli DH5α strain genome using hisat2 mapper (Kim et al., 2015) to eliminate bacterial contamination. Then, 10,927 non-aligned reads were assembled using SPAdes (v 3.9.1) assembler (Bankevich et al., 2012), resulting in three contigs, which were uploaded to the NCBI database as a single entry under the accession number MF185734. Three genes (PsSym37, PsK1, and the novel gene PsLykX) were found via BLASTn searches (e-value cutoff 10 −6 ) using the sequence of PsK1 as the query (Figure 1). Additional BLAST searches against the NCBI non-redundant nucleotide database revealed fragmented transposable elements (Figure 1), but no additional genes.
The 3 ′ -and 5 ′ -ends of PsLykX cDNA were obtained by rapid amplification of cDNA ends (RACE) using total cDNA from inoculated roots of cv. Cameor. The Mint kit was used for cDNA synthesis and Mint kit primers were used for amplification of cDNA ends (Evrogen, Moscow, Russia). To obtain RNA for cDNA synthesis, five cv. Cameor plants were inoculated with Rhizobium leguminosarum bv. viciae strain RCAM 1026 and then grown in sterile sand for 10 days (Afonin et al., 2017). The specific primers used for RACE PCR are listed in Table S1.

Plant Material
The pea genotypes were previously selected by Borisov et al. (2002) from the collection of cultivated peas of the VIR (Saint Petersburg, Russia). The criterion for selection was diversity of places of origin. Preference was given to wild genotypes or primitive varieties, the so-called landraces. This set of 99 genotypes has already been tested for the efficiency of symbiosis with arbuscular-mycorrhizal (AM) fungi under the conditions of rhizobia inoculation (Jacobi et al., 2000;Borisov et al., 2002), and also for their responses to the toxic heavy metal cadmium under symbiotic conditions (Belimov et al., 2015). The originally obtained samples were propagated by selfing for several generations at the All-Russia Research Institute for Agricultural Microbiology (ARRIAM, Saint Petersburg, Russia). The genotypes and their places of origin (when known) are shown in Table S2.

Plant DNA Extraction
DNA was extracted from 3-to 4-day old seedlings of each pea genotype. The seedlings were cultivated in Petri dishes filled with sterile vermiculite at 28 • C. Three to five seedlings per genotype were bulked and used for DNA extraction using the slightly modified CTAB protocol described elsewhere (Rogers and Bendich, 1985). Briefly, the steps were as follows: seedlings, 3-4 glass beads, and 700 µl 2 × CTAB buffer were placed into flat-bottomed microcentrifuge tubes and ground with the Fastprep-24 homogenizer (MP Biomedicals, Irvine, CA USA; 2 × 60 s at maximum frequency), then incubated in water bath for 2 h at 65 • C. Then, 700 µl chloroform/isoamyl alcohol (24:1 v:v) was added to each tube, the mixture was vortexed, and then centrifuged for 10 min at 16,873 g. The upper fraction was transferred to a fresh microcentrifuge tube, and the previous step was repeated with 300-500 µl chloroform instead of chloroform/isoamyl alcohol. The upper fraction was again transferred to a fresh tube, and 1,000 µl 96% ethanol and 20 µl 5 M NaCl were added. The tubes were gently inverted two to three times to mix the contents, and then the mixture was centrifuged (10 min, 16,873 g). The supernatant was discarded, and 500 µl 70% ethanol was added to the pellet. The samples were vortexed, incubated at room temperature for 30 min, and then centrifuged (10 min, 16,873 g). The ethanol was discarded, and the tubes were incubated at 37 • C for 20-30 min. Finally, the samples were dissolved in 20-30 µl TE buffer (pH 8.0). In total, DNA samples from 93 separate genotypes were used in further analyses.

Data Collection from M. truncatula Hapmap Project
The Medicago Hapmap project, a long-term, communityaccessible GWA mapping resource, is based on re-sequencing of 384 inbred lines spanning the range of Medicago diversity using Illumina next-generation sequencing technology (Stanton-Geddes et al., 2013). To study the polymorphisms of genes encoding LysM-RLKs, sequences of the first exons of MtLYK2 (Medtr5g086310), MtLYK3 (Medtr5g086130), and MtLYK4 (Medtr5g086120) were retrieved from the Medicago Hapmap project website, assembly Mt4.0 (http://www.medicagohapmap. org/). Single nucleotide variants at particular sites were considered different from the reference M. truncatula genome if they were detected in more than 50% of corresponding Illumina reads. In total, 116 sequences for MtLYK2, 220 for MtLYK3, and 196 for MtLYK4 were included in subsequent analyses.

Pairwise Comparison of LysM-RLKs
Pairwise comparison of all LysM receptor kinases was performed using the Needleman-Wunsch global alignment algorithm from the EMBOSS suite v.6.3.1 (Rice et al., 2000), using standard parameters. Comparisons of the three pea and three M. truncatula LysM-RLKs were performed separately for whole sequences, the kinase part, and the receptor part. The results from each comparison were combined using a custom python script. The results for full genes are shown in Table 1, and the results for the separate regions are shown in Tables S3, S4.
Pairwise comparison of the known pea Sym2 region and the M. truncatula LYK region was performed using BLAST and visualized using ACT software (release 13.0.0; Carver et al., 2005). The sequence of the Sym2 region was obtained from the BAC Psa-B-Cam 446P4 clone (see Results). The M. truncatula LYK region sequence (g5:37100000-g5:537450000) was retrieved from the Phytozome website (www.phytozome. net/). The minimal identity cut-off was set to 70%. The exon/intron structure of genes in these regions was modeled by the Exonerate package (http://www.ebi.ac.uk/about/vertebrategenomics/software/exonerate) with the -est2genome option enabled to search for gene mRNAs in the respective genomes.

Polymorphism Analysis and Detection of Selection Pressures
Neighbor-joining (NJ) trees were constructed used MEGA v.6.60 software (Tamura et al., 2013) with the assumption that substitutions followed the Jukes-Cantor model and had uniform rates among sites. Bootstrap tests of NJ trees were performed with 500 bootstrap replications. The branches were cut off at 70% bootstrap support. Trees were visualized using FigTree (http:// tree.bio.ed.ac.uk/software/figtree/).
The codon-based Z-test considering the average rate of synonymous (dS) and nonsynonymous (dN) substitutions per site was performed in MEGA v.6.60 using the modified Nei-Gojobori method (Nei and Gojobori, 1986), with Jukes-Cantor correction (Jukes and Cantor, 1969) for multiple substitutions. Standard errors were estimated from 1,000 bootstrap replicates.
In all cases, separate estimates were made for the whole sequences of the coding part of the first exons and for sequences corresponding to LysM modules and intermediate parts. The borders of LysM modules were drawn as described previously (Zhukov et al., 2008).

Inoculation Experiment
An inoculation experiment was performed to determine the symbiotic phenotype of the pea line K-6883 (84). Seeds of K-6883 and cv. Cameor used as the inoculum control were planted in 2-L pots with sterile sand and inoculated with either strain RCAM1026 or A1 (Chetkova and Tikhonovich, 1986) of R. leguminosarum bv. viciae. Each pot contained five seeds, and the The "heat map" coloring indicates the degree of identity and similarity. experiment was carried out with two technical replicates. After 28 days, the plants were removed from the pots and the phenotype of the root system was examined. The average number of nodules was calculated in SigmaPlot 12.0 (Systat Software, Inc., San Jose, CA, USA).

Discovery of a New LysM-RLK Gene in the Sym2 Region Using Pea BAC Library
To detect novel genes encoding LysM-RLKs in the Sym2 region of pea LG I, the pea Psa-B-Cam BAC library was screened at the INRA-CNRGV Plant Genomic Center using quantitative PCR with primer pairs for PsSym37 and PsK1 (see section Materials and Methods). In the BAC clone Psa-B-Cam 446P4, we identified the two previously known LysM-RLK genes PsSym37 and PsK1 (used as probes), and a new LysM-RLK gene that we named PsLykX (for LysM kinase eXclusive; see Figure 1). No other BAC clones were detected during further screening with a part of PsLykX as the probe. This may indicate a lack of saturation of the Psa-B-Cam library in this genomic region.
PsLykX was located close to PsK1; the stop codon of PsK1 and the start codon of PsLykX were separated by 531 bp (Figure 1, also see MF185734 at GenBank). Despite this, PsLykX appeared to be a fully functional gene, because sequences perfectly matching its cDNA were found in pea transcriptome assemblies generated from nitrogen-fixing nodules (Alves-Carvalho et al., 2015;Sudheesh et al., 2015;Zhukov et al., 2015). The PsLykX complete open reading frame was successfully amplified from cDNA generated from RNA extracted from pea roots inoculated with nodule bacteria, and the sequences of PsLykX cDNA ends were verified using 3 ′ -and 5 ′ -RACE. Alignment of the genomic sequence and the cDNA of PsLykX revealed a gene structure similar to that of PsSym37, with 12 exons and 11 introns (Zhukov et al., 2008), which is typical of symbiotic LysM-RLK genes (Figure 2A). The sequence of PsLykX has been deposited in GenBank under the accession number MF135533.
The putative PsLykX protein contained three LysM modules encoded by the first exon and forming the receptor domain, along with transmembrane and kinase domains ( Figure 2B). Multiple alignment between PsLykX and known LYKs from the syntenic LYK region of the M. truncatula genome and LysM-RLKs from other organisms revealed that PsLykX was closest to MtLYK4, MtLYK1 and MtLYK5 (Figure 3). In the neighbor-joining phylogenetic tree, PsLykX and MtLYK4 were grouped apart from PsSym37 and PsK1 which formed a distinctive clade with

Polymorphism of First Exon Sequences of LysM-RLK Genes in Sym2 Region
To assess the level of polymorphism in parts of pea LysM-RLK genes encoding ligand binding structures, we sequenced the first exons (corresponding to the receptor domain) of the three identified LysM-RLK genes from the Sym2 region in a subset of cultivated pea accessions (Table S2). To fully characterize the extent of polymorphism, previously obtained sequences of PsSym37 and PsK1 (Zhukov et al., 2008) available from online databases were included in the analysis. Overall, 101 sequences were analyzed for PsK1, 90 sequences for PsLykX, and 89 sequences for PsSym37 (Table S2).
The nucleotide sequences of the first exons were aligned using the codon-based ClustalW algorithm, and phylogenetic trees based on LysM-RLKs variability were constructed (Figures 4-6 for PsSym37, PsK1, and PsLykX, respectively). The clades of the resulting trees were considered to represent distinct haplotypes. Comparison of amino acid sequences confirmed the haplotype distribution pattern ( Table 2). PsSym37 and PsK1 sequences formed two main groups, each with several subgroups (Figures 4, 5), while PsLykX sequences appeared to be more diverse and formed six distinct groups (Figure 6).
The amino acid alignment analyses showed that PsLykX was the most variable among the three genes, with 16 amino acid polymorphisms in the putative protein sequence. According to these, the whole sample was divided into 13 groups, seven of which were represented by single unique genotypes with additional polymorphic sites. PsK1 had 17 amino acid polymorphisms in the putative protein sequence, dividing sample into 12 groups (with six single-genotype groups). Interestingly, although group B of PsK1 split into two subgroups, B1 and B2, according to nucleotide-based phylogeny, the haplotypes in these subgroups were identical at the amino acid level. PsSym37 appeared to be more uniform, with only six groups based on 10 amino acid polymorphisms (with one group being represented by a sole genotype).
In all three genes, the most polymorphic region was the sequence corresponding to the first LysM domain (LysM1); it contained the majority of non-synonymous substitutions ( Table 2). In PsK1 and PsLykX, several non-synonymous variations were detected in the part corresponding to the signal peptide (SP) as well, although these changes were unlikely to affect protein localization. Further research is required to explore the possible functional significance of these changes.
We identified PsLykX as a new gene from the Sym2 region encoding an LysM-RLK. This gene is a promising candidate for the PsSym2 gene, which is responsible for increased selectivity toward rhizobia in several genotypes of pea originating from Afghanistan and neighboring areas (Lie, 1984;Kozik et al., 1995). Plants carrying the so-called "Afghan" allele of PsSym2 (Sym2 A ) form nodules only with strains producing a double-acetylated Nod factor, while plants with the most common "European" allele (Sym2 E ) are able to perceive the mono-acetylated Nod factor as well (Davis et al., 1988). Bearing this in mind, we sequenced the first exon of PsLykX of cv. Afghanistan (NGB2150), and found that it contained a unique haplotype of PsLykX, which was shared with only one other line, K-6883 (84). We investigated the symbiotic phenotype of K-6883 in an experiment with two strains of R. leguminosarum bv. viciae: RCAM1026 (Afonin et al., 2017), which produces a mono-acetylated Nod factor, and A1, which produces a double-acetylated Nod factor (Chetkova and Tikhonovich, 1986;Ovtsyna et al., 1999). K-6883 failed to form nodules with RCAM1026 but formed 105.0 ± 22.3 nodules with A1, while cv. Cameor (control belonging to another PsLykX haplotype) formed 151.6 ± 14.1 nodules with RCAM1026 and 105.2 ± 11.7 nodules with A1. Thus, K-6883 showed "Afghan" selectivity toward rhizobia, providing further evidence that PsLykX may actually be PsSym2.

Identification of Selection Signatures
Investigation into the Correspondence between Pea and M. truncatula LysM-RLK Genes Polymorphism analyses can provide insights into the evolutionary pathways leading to the contemporary state of genes in related species. The fact that pea and barrel medic LysM-RLK genes are paralogs located close to each other made it difficult to estimate selection signatures in their sequences. The Sym2 region of pea LG I is syntenic to the M. truncatula LYK region on chromosome 5 (Gualtieri et al., 2002), where seven LysM-RLK genes are located (Limpens et al., 2003), while only three genes are known to be present in the Sym2 region of the pea genome. Thus, to compare the diversity of the pea LysM-RLK genes with that of the M. truncatula ones, we first determined the relationship between PsSym37, PsK1, and PsLykX and the M. truncatula LYK genes. To this end, we compared the whole sequences of LysM-RLK putative proteins (three from pea, seven from Medicago), as well as their receptor and kinase domains separately (Table 1, Tables  S3, S4). We included LjNFR1 as an ortholog of PsSym37 and MtLYK3 (Zhukov et al., 2008), and AtCERK1 as non-symbiotic LysM-RLK of a distant species in these comparisons. The identity (strict matching of amino acids in corresponding positions) and similarity (functional analogy of different amino acids in corresponding positions) percentages were taken into account.
In addition, the whole BAC clone Psa-B-Cam 446P4 insertion was compared to the LYK region of the M. truncatula genome ver. 4.0 (chr5:37,100,000-37,450,000) using BLAST. The beginning of PsSym37 showed very strong similarity to MtLYK3, even in the upstream region containing the putative promoter ( Figure  S1A). Exon 2, exon 3, and exons 6-8 of PsSym37 were similar to corresponding exons in both MtLYK2 and MtLYK3, while exon 5 and exon 11 were similar only to exon 5 and exon 11 of MtLYK3, and exon 4 was similar only to exon 4 of MtLYK2.
This, together with phenotypic data obtained for mutants (Smit et al., 2007), corroborates the idea that PsSym37 is the most likely ortholog of MtLYK3. For the remaining two pea genes, the similarity patterns were more complicated. In PsK1, exons 6-8 and exon 10 were similar to the corresponding exons in MtLYK3 and MtLYK2, while exon 9 was similar only to exon 9 of MtLYK2 ( Figure S1B). Exons 6-9, 10, and 11 of PsLykX were similar to the uncharacterized transcribed sequence (Mtr.51442.1.S1_at at MtGEA v. 3.0; https://mtgea.noble.org/v3/ He et al., 2009) in the LYK region of M. truncatula (see Figure S1C), annotated in Mt4.0 as Medtr5g086080. However, exons 6-8, exon 10, and exon 11 were also similar to respective exons in MtLYK4. It is important to note that the first exon showed significant uniformity (at least 80%) both among the genes in one species (pea or M. truncatula) and between genes in different species (Table S3).
Our observations confirmed the previous finding that PsSym37 is the most likely ortholog of MtLYK3 (Zhukov et al., 2008). PsLykX was more similar to MtLYK4 than to other LYK genes, whilst PsK1 showed high similarity to both MtLYK3 and MtLYK2 (and also to PsSym37) at the putative protein level ( Table 1). Taking into consideration the mosaic nature of genes in both the LYK and Sym2 regions, the complex evolutionary history of these genes, and the possibility of concerted evolution of paralogous genes, we are cautious to postulate orthology between pea and M. truncatula LysM-RLK genes, aside from PsSym37 and MtLYK3, which have conserved their basic function of Nod factor recognition and downstream signal transduction. Based on this, and since the first exons of all analyzed genes were >80% identical, we chose outgroups for polymorphism analyses as follows: PsSym37 for all barrel medic genes, and MtLYK3 for all pea genes.
The receptor parts of LysM-RLK contain three functionally important modules (LysM domains). Therefore, we assessed overall polymorphism in the first exon and polymorphism of separate parts of the first exon encoding LysM domains, as well as intermediate parts.

Codon-Based Tests for Selection Signatures
The ratio between synonymous and non-synonymous substitutions is often used to detect negative, or purifying, selection (i.e., selection against non-synonymous changes). The codon-based Z-test relies on a statistically significant difference between the rate of synonymous substitutions per synonymous site (dS) and the rate of non-synonymous substitutions per non-synonymous site (dN). We applied this test to the full sequences and to separate LysM module-encoding parts of the first exon of three pea and three M. truncatula genes. The results are shown in Table 3.
The overall sequences of the first exons of PsK1 and PsSym37 showed a clear departure from neutrality in favor of purifying selection, while the first exon of PsLykX showed a tendency toward purifying selection ( Table 3). Analysis of separate modules indicated that in both PsK1 and PsSym37, the part encoding the LysM2 module (and, for PsSym37, also the part preceding LysM1 module) has undergone strong purifying selection. In PsK1, the intermediate part between LysM2 and LysM2-encoding regions showed signs of positive selection. There was also a tendency toward purifying selection in LysM3 and between LysM2 and LysM3 of PsSym37, which may indicate the importance of this gene. Indeed, mutations in PsSym37 were shown to markedly decrease the number of nodules in symbiotic conditions (Zhukov et al., 2008). In addition, the site between LysM2 and LysM3 of PsK1 appeared to be under    Figures 4-6. Colors indicate functional parts of corresponding protein ; also see Figure 2B).

Numbers indicate the amino acid position in sequence of corresponding protein. Letters indicate groups of haplotypes, as shown in
positive selection, suggesting that the structure of LysM modules is more important than that of intermediate regions in case of this particular LysM-RLK.
In Medicago, MtLYK2 has undergone purifying selection, which has strongly affected LysM2; this was also the case for PsK1 and PsSym37. It is important to note that the function of MtLYK2 is unknown; however, its similarity to MtLYK3 and the results of our analyses suggest that MtLYK2 may participate in the same symbiotic signal cascade as MtLYK3. However, MtLYK3 clearly demonstrates neutral evolution, even though it is essential for Nod factor perception (Smit et al., 2007). A possible explanation is that MtLYK3 has a low basic level of polymorphism (Table 4), which results in a statistically insignificant difference between dN and dS. MtLYK4 has undergone neutral evolution, like MtLYK3 and pea PsLykX. However, the neutral evolution signal may be a result of interference of two oppositely directed signals implying positive or purifying selection in different lineages in the dataset.
We also used the McDonald-Kreitman test (MK-test) as a statistical test of synonymous and non-synonymous changes. This test compares genes in two related species. The MK test did not detect departure from a neutral model, with the exception of the LysM1 region for the gene pair PsLykX and MtLYK4 (P = 0.0492).

SNP-Based Tests
We used several molecular evolution tests to analyze the distribution of SNP sites across the dataset, regardless of whether they represented synonymous or non-synonymous changes. These tests can identify either balancing selection (in favor of two or more alleles) or positive selection (in favor of one allele) and distinguish these types of selection from the neutral evolution model. First, nucleotide diversity (Pi, or π) was assessed in all sequences of the first exons and their separate parts, considering changes in synonymous and non-synonymous sites. In general, the rate of π was higher in pea genes than in M. truncatula ones ( Table 4), even though the M. truncatula sequence dataset was larger than the pea dataset. In addition, haplotype diversity was higher for M. truncatula genes than for pea ones, possibly reflecting the prevalence of rare nucleotide variants (singletons) in the Medicago sequence dataset. The π a /π s ratios for LysM1 in MtLYK3 and MtLYK4 were very high (4.1 and 8.7, respectively), indicating an excess of nonsynonymous substitutions. This result indicated that natural selection has tended to support new variants of LysM1 in these genes.
All the sequence sets were subjected to several neutrality tests (see section Materials and Methods, Table 5). In addition to analyses of the whole sequences and parts corresponding to LysM modules, a sliding window approach was used to visualize the obtained criteria values (Figures 7, 8).
In PsSym37, the value of Tajima's D suggested a departure from the neutral model for the LysM1 domain, its positive value indicating balancing selection (in favor of several different variants, namely haplotypes A and B; Table 2). No deviations from neutral evolution were identified in other regions, except for a weak signal of positive selection upstream of LysM1 (FW-H criterion = −2.34, P = 0.046).
In PsK1, the Tajima's D value indicated balancing selection in LysM1, which appeared to be the most polymorphic part of this gene, and the region responsible for the segregation of two main haplotype groups A and B (Table 2). Interestingly, for the region between LysM2 and LysM3, Tajima's D value fell below the critical value, suggesting rapid spreading of a particular allelic state due to positive selection (selective sweep), or due to a recent population expansion. The FW-H value did not exceed the critical value, suggesting that population expansion has shaped the DNA sequence in this region. Thus, the low Tajima's D value should be attributed to genetic drift rather than to a selective advantage of a particular allelic state of PsK1. In other regions, no departures from the neutral evolution model In case of statistically significant deviation from the neutral model, the values of criteria are given in bold and marked with asterisks accordingly to the P-value: *P < 0.05; **P < 0.01; ***P < 0.001.
were identified, except for a signal of positive selection upstream of LysM1 (FW-H criterion = −3.74, P = 0.018), similar to PsSym37.
PsLykX, like M. truncatula MtLYK3 and MtLYK4, also showed a π a /π s ratio >1 in the LysM1-encoding region (Table 4). Remarkably, a clear signal of positive selection was found in both the LysM2 and LysM3 domains (FW-H criterion below critical values; Figure 7), possibly indicating the importance of these parts for binding the Nod factor. Using a sliding window approach, declining values of modified Fu and Li's D and F criteria were detected at the LysM1 domain. However, in the absence of signals from powerful criteria (Fay and Wu's H), this was not considered as a footprint of positive selection. Instead, it could be attributed to a population expansion or a non-random selection of the samples in the sequence dataset.
In Medicago, all three examined genes had significantly low Fay and Wu's H values, indicating strong positive selection at the first exons (corresponding to the receptor parts of the LysM-RLKs). The strongest pressure was at LysM2 of MtLYK2 and at LysM1 of MtLYK3. In contrast to the pea sequences, the M. truncatula gene sequences showed significantly low values of Fu's Fs (Table 5), which is sensitive to population expansion. This implied that recent demographic processes have contributed to the molecular evolution of M. truncatula LysM-RLK genes. Interestingly, the similar pattern of nucleotide variant distribution in MtLYK3 and MtLYK4 genes led to comparable Tajima's D profiles in the sliding window analysis. Tajima

DISCUSSION
Both natural and artificial selection tend to affect sequence variability at selected genomic loci and at neutral loci linked to them. These selection signatures are widely used to identify loci subjected to selection, thereby giving researchers an insight into the evolutionary process, as well as providing functional information about genes or genomic regions (Kreitman, 2000;Nielsen, 2005).
The most crucial step in the symbiosis between legume plants and nitrogen-fixing bacteria (rhizobia) is the initial mutual recognition of the macro-and micro-symbionts. The importance of this step should not be underestimated, as partnerships with ineffective bacteria can be lethal for plants growing in harsh environments. Thus, plants have developed multicomponent receptor systems for the bacterial signal to identify the most suitable and beneficial partner. As bacteria tend to evolve rapidly, the genes involved in their recognition should be "hot spots" under constant selective pressure. Therefore, one can expect to find footprints of selection in the nucleotide sequences of genes encoding legume symbiotic receptors. In this work, we studied the polymorphism of the first exon of three LysM-RLK genes in large set of pea genotypes, and compared them with their homologs in M. truncatula. Although all these genes appeared to be quite similar to each other, our data suggest that they differ significantly in their evolutionary patterns, consistent with the symbiotic partner recognition unique to each legume species.
A single amino acid replacement in the receptor part of the LysM-RLK can change its recognition of the Nod factor (as shown for L. japonicus LysM-RLK NFR5, Radutoiu et al., 2007). Therefore, we expected to find a relatively low level of polymorphism in the first exon (corresponding to the receptor domain) of the three LysM-RLK genes from the Sym2 region. However, several "legal" haplotypes were detected for all three pea genes, implying that the structure of receptors may be variable. Such variability is possibly advantageous. Previously, Li et al. studied the polymorphism of the LysM-RLK gene PsSym37 in 10 pea genotypes, and detected two groups of haplotypes (at the protein level) correlated with the ability of the plant to perceive Nod factors containing a C18:1 rather than a C18:4 (number of carbon atoms:number of unsaturated bonds) acyl group at the non-reducing end of molecule (Li et al., 2011). Despite our sample set being more than 10 times larger, we were unable to identify any new PsSym37 haplotypes in addition to those already discovered, which may indicate a certain fundamental base of this division among pea genotypes. However, it is hard to tell exactly how, where, and why these two symbiotic groups may have arisen, as the history of pea domestication, spreading, and crossing is long and unclear. Although our results indicate generally neutral evolution of PsSym37, the overall positive Tajima's D value indicates that the sequence dataset may be not uniform. Independent analyses of two lineages (haplotype groups A and B) may be required to identify footprints of positive selection.
Interestingly, PsK1 shows a pattern of polymorphism similar to that of PsSym37, with two basic groups of haplotypes. However, the haplotypes do not correlate with any symbiotic feature of pea known so far. The function of PsK1 is also unclear, although preliminary data (MSc. Anna Kirienko, Dr. Elena Dolgikh, personal communication) indicate that it may function prior to PsSym37, as suggested by phenotypic analyses of mutants. Since PsK1 is similar to the M. truncatula gene MtLYK2, whose function is also unknown, it is possible that these proteins participate in the initial capture and recognition of the Nod factor, together with the "early" symbiotic LysM-RLKs PsSym10 (in pea) and MtNFP (in barrel medic). These kinases are characterized by an inactive kinase domain (lacking the activation loop; Arrighi et al., 2006), while this domain appears to be fully functional in PsK1 and MtLYK2. Therefore, the observed variability of the receptor part of PsK1 may be essential for recognition of diverse symbiotic bacteria.
The newly discovered pea LysM-RLK gene PsLykX is unique in that it has six equivalent major haplotype groups (each containing several genotypes) and about the same number of unique haplotypes (corresponding to a single genotype). This suggests that our sample of 90 PsLykX sequences does not cover all the possible "legal" allelic variants of this gene. Another important fact is the correlation of one of the identified PsLykX haplotypes with the "Afghan" phenotype. This makes PsLykX a promising candidate for PsSym2, and, among other things, may indicate that other PsLykX haplotypes are also related to nuances of perception of bacterial signals. The function of MtLYK4, the closest homologous LYK gene of PsLykX, is unclear, despite its significant similarity to MtLYK3 both in its sequence and in its discernable selection signatures. In roots of MtLYK4-knockdown plants, infection thread morphology was affected but nodulation occurred normally (Limpens et al., 2003). Furthermore, evidence of positive selection in the LysM2 and LysM3 modules suggests that MtLYK4 has recently acquired a new function, probably associated with the expansion of the M. truncatula population. Remarkably, the kinase domains of both putative PsLykX and MtLYK4 proteins lack the YAQ motif, which is inherent to symbiotic kinases in L. japonicus and is absent from nonsymbiotic ones (Nakagawa et al., 2011). On the other hand, the receptor system in legumes with indeterminate nodules (e.g., pea and barrel medic) appears to be more complicated than that of legumes with determinate nodules, such as, L. japonicus (Ardourel et al., 1994). The presence of PsLykX transcripts in pea nodules, together with the lyk4 RNA-interference phenotype in M. truncatula (disruption of infection process; Limpens et al., 2003), provide evidence for the participation of YAQ-lacking LysM-RLKs in indeterminate nodule development.
For all three genes, the haplotype distribution was not uniform among our samples, resulting in strictly correlating haplotype groups: for instance, group E of PsLykX (see Figure 6, Table 2) was found exclusively with group A2 of PsK1 and B0 of PsSym37. However, considering that: (i) these genes are clustered within approximately 20 kb; (ii) the genotype sampling was not random; and (iii) the VIR collection does not represent a single population of pea, this non-uniformity can be explained by factors other than the evolutionary advantage of a particular haplotype combination, e.g., genetic linkage or founder effect. On the other hand, the fact that we observed so many haplotype combinations despite the close proximity of these genes suggests that there may be a selective advantage of symbiotic LysM-RLK haplotype shuffling in pea.
Despite the fact that the LYK region of M. truncatula and the Sym2 region of pea are clearly syntenic (Gualtieri et al., 2002), it is nearly impossible to determine the orthologous relationships between genes contained in those regions. As shown in this work, LysM-RLK genes from these genomic regions are characterized by mutual mosaic similarity, with different parts of different genes being similar to each other both within a species and between species. Since the LYK and Sym2 regions are represented primarily by gene clusters that apparently originated via multiplication of ancestral LysM-RLK gene(s), the genes in those regions have undergone concerted evolution (Liao, 1999) implying genetic conversion. This has led to the diversification of LysM-RLK gene clusters in different legume species. In other words, each cluster represents an "evolution crucible" unique to each species, where genes and gene parts have been shuffled, combined, or broken apart. Thus, divergent evolution may lead to the loss of earlier versions of genes in one species, and the retention of genes and acquisition of unique functions in another species through neofunctionalization. This is a promising explanation for the unique PsSym2 A phenotype that is not observed in any other legume species.

CONCLUSION
As a complex multi-stage process, the symbiosis of legumes with rhizobia is not yet fully understood. There is still much to learn about the functions of many genes and the population genetics of the symbiosis. In M. truncatula (De Mita et al., 2011;Ho-Huu et al., 2012) and other legumes with well-studied genomes like L. japonicus, chickpea (Cicer arietinum L.), and soybean [Glycine max (L.) Merr.], surveys of large ecotype collections and association genetics analyses have been widely used to identify loci of interest (Kale et al., 2015;Li et al., 2017;Plekhanova et al., 2017). However, knowledge about the diversity of certain genes, especially those related to symbiosis, is still insufficient. The aim of this study was to help to fill this knowledge gap. Accordingly, we evaluated the genetic diversity of the pea LysM-RLK-encoding genes from the Sym2 region and corresponding barrel medic genes from the LYK region, detected the signatures of positive, balancing and negative selection in the gene sequences, and compared the patterns of selection pressure affecting particular modules in the sequences.
The cultivated pea accessions analyzed in this work have been previously characterized to determine their productivity under symbiotic conditions (Borisov et al., 2002). Therefore, information on symbiotic LysM-RLK gene polymorphisms may be useful for studies on associations between gene variants and the formation of highly effective symbioses with beneficial microorganisms. Considering the importance and economic value of pea in modern agriculture, the value of such research activities should not be underestimated.
Analyses of the polymorphism of crucial symbiotic genes and the identification of selection signatures have allowed us to formulate new hypotheses about their roles in symbiosis, which will be tested experimentally in the near future. On the basis of our results, we propose that the newly discovered pea gene PsLykX could in fact be PsSym2, which is responsible for increased selectivity toward a symbiotic partner in plants with the characteristic phenotype of cv. Afghanistan. However, more detailed analyses are required to confirm that PsLykX is the elusive PsSym2 gene. To confirm our hypothesis, the polymorphism of PsLykX should be studied in a larger sample of genotypes with high selectivity toward microsymbionts. In addition, it is important to analyze the phenotypes of lines with mutations in the PsLykX gene. The pea TILLING mutants collection (Dalmais et al., 2008) provides opportunities for such analyses. Complementation of the PsLykX mutations by either "European" or "Afghan" alleles resulting in nodulation or nonnodulation phenotypes would confirm the characteristics of particular groups of pea genotypes.

AUTHOR CONTRIBUTIONS
VZ, IT, and LL designed and conceived the study; AS, AA, and AZ conducted experiments; AS, AZ, AA, VZ, and IT analyzed the data. AS, VZ, and AA drafted the manuscript, IT and LL critically revised the manuscript. All authors read and approved the final manuscript.

FUNDING
The work on pea genes sequencing and analysis was funded by the Russian Science Foundation (grant # 16-16-00118); the work of M. truncatula genes analysis was funded by the Russian Foundation for Basic Research (grant # 15-29-02737 ofi_m).