A re-sequencing based assessment of genomic heterogeneity and fast neutron-induced deletions in a common bean cultivar

A small fast neutron (FN) mutant population has been established from Phaseolus vulgaris cv. Red Hawk. We leveraged the available P. vulgaris genome sequence and high throughput next generation DNA sequencing to examine the genomic structure of five P. vulgaris cv. Red Hawk FN mutants with striking visual phenotypes. Analysis of these genomes identified three classes of structural variation (SV); between cultivar variation, natural variation within the FN mutant population, and FN induced mutagenesis. Our analyses focused on the latter two classes. We identified 23 large deletions (>40 bp) common to multiple individuals, illustrating residual heterogeneity and regions of SV within the common bean cv. Red Hawk. An additional 18 large deletions were identified in individual mutant plants. These deletions, ranging in size from 40 bp to 43,000 bp, are potentially the result of FN mutagenesis. Six of the 18 deletions lie near or within gene coding regions, identifying potential candidate genes causing the mutant phenotype.


INTRODUCTION
Common bean, Phaseolus vulgaris L., is an important source of proteins and carbohydrates for over three million people worldwide (Broughton et al., 2003). Despite its dietary importance, genetic resources for common bean have lagged behind those of "model legumes" soybean, Medicago truncatula, and Lotus japonicus. However, next generation sequencing (NGS) technologies now make genomic studies applicable to any species of interest.
Mutants are important tools in deciphering gene functions. In common bean, individual mutants can be created for a gene of interest through plant transformation (Aragao et al., 1996;Kwapata et al., 2012). Gene expression patterns of various genes in common bean can also be knocked out or down through the use of virus induced gene silencing (Diaz-Camino et al., 2011;Zhang et al., 2013). These methods both require prior knowledge of genes of interest. In contrast, mutant populations can be screened for phenotypes of interest and genes responsible identified through various approaches. The analysis of traits in various species including those related to plant architecture, yield, and stress response genes have been improved by utilizing mutant screens (Papdi et al., 2010;Bolon et al., 2011;Ma et al., 2013).
Here, we present a small fast neutron (FN) mutant population for common bean and demonstrate how NGS technologies, such as DNA-seq provide for fast, high quality analysis of genomic variation to identify potential candidate genes for observed phenotypes. Additionally, the DNA-seq data allowed us to examine the natural variation existing within the Red Hawk cultivar. Such natural variation is a rich source of genomic diversity that can be utilized in future cultivar development.

DEVELOPMENT OF Phaseolus vulgaris FAST NEUTRON POPULATION
Ten thousand P. vulgaris cv. Red Hawk seeds (Kelly et al., 1998), an Andean cultivar adapted for growth in the upper Midwest, were sent to the McClellan Nuclear Radiation Center at the University of California-Davis for irradiation. Five thousand seeds were treated with either 16 or 32 Gys of FN radiation. Treated seeds were sent to the Illinois Crop Improvement Association (ICIA) facility in Puerto Rico in November 2009 along with 200 wild type seeds from the same seed lot. Approximately 70% of the 5000 seeds treated with 16 Gys germinated, while none of the seeds treated with 32 Gys germinated. Seedlings were allowed to mature at the ICIA facility, where plants were phenotyped and seeds were collected from all mature plants. Seeds from ∼88 plants with striking mutant phenotypes such as developmental delays, plant stature, pod set, and pod size variations, were harvested individually. Remaining mutant plants were bulk harvested. Wild type plants grown at ICIA were also bulk harvested.
Seeds from individually collected plants were planted at the University of Minnesota Experiment Station in St. Paul in 2010. Approximately 10,000 seeds from the bulk collection of mutants were also planted. Phenotyping was performed throughout the growth season, complemented by photographs. Selected individuals with visible and/or maturity phenotype variations were harvested. In 2011, seeds from selected 2010 M2 individuals were planted in 10 ft rows (∼20 seeds). Phenotypes observed throughout the 2011 growth season were compared to documented phenotypes from previous years to determine if trait expression was consistent. Additionally, segregation among the 20 plants per mutant line was noted. Three to four individuals in each row with visible/stable traits were tagged, photographed and seed was harvested.

DNA-SEQ ANALYSIS OF FAST NEUTRON MUTANTS
Five FN mutant plants with different, stable, obvious phenotypes (Figure 1) were chosen for paired end sequence analysis. The following FN mutant plants chosen: 1R5C01r5CPVMN11, a plant with decorative chlorotic leaves early in the growing season ( Figure 1A), 1R19C15r28CPVMN11, a small plant with lanceolate leaves ( Figure 1B); 1R22C04r31CPVMN11, an upright plant with rugose leaves ( Figure 1C); 2R29C12r78CPVMN11; which phenotypically resembled the wild type plant but was delayed in maturity ( Figure 1D); and 3R5C25r87CPVMN11, which exhibited interveinal chlorosis ( Figure 1E). The mutant plants will respectively be referred to as lanceolate, rugose, decorative, maturity, and chlorotic throughout the rest of the manuscript. M3 seeds of the plants chosen for sequencing were collected and planted at the University of Minnesota Experiment Station in St. Paul in 2012 to ensure the phenotype was maintained through the M3 generation. Leaf tissue from a representative wild type plant and from each of the chosen mutant plants at the M2 generation was collected from 2011 field-grown plants early in the morning and immediately placed at −80 • C to inhibit DNA degradation. DNA from all six plants (WT and five mutants) was extracted using the phenol:chloroform method as described (Liu et al., 1997). Each DNA sample was visually inspected on a 1% agarose gel, to ensure that the samples were not degraded. DNA concentration and purity was assessed using an Agilent 2100® Bioanalyzer™ (Agilent®, Santa Clara, CA). DNA samples were submitted to the molecular biology core at the Mayo Clinic, Rochester, MN for paired end sequencing on an Illumina HiSeq 2000. To reduce variability, DNA from all samples were multiplexed and run in a single lane. Low quality reads and adaptor sequences were removed, resulting in 31 million paired end reads per sample.
Paired-end genome sequences were mapped to the P. vulgaris G 19833 genome sequence available at www.phytozome.net using BWA (Li and Durbin, 2009) with default parameters. The resulting mapping files were further sorted, indexed, and translated to binary format (BAM files) using samtools ). The sequence alignments were visualized using IGV (Robinson et al., 2011). This approach aligned 70% of all Red Hawk DNA sequences to 88% of each of the 11 P. vulgaris chromosomes with 12X sequence depth. Regions of genomic deletions were identified using custom perl scripts. To confirm these deletions, the program, CREST  was also used to screen the FN mutant plants. This program identifies genomic deletions, insertions, inversions, and translocations by identifying soft clipped reads and the read coverage at potential breakpoints to calculate if the probability of observing the number of soft clipped reads at a given location is >0.05 based on a binomial distribution. Statistically significant (P < 0.05) SV are retained for further consideration. CREST analysis was performed for each mutant compared to the wild type control. Genomic deletions resulting from differences between cultivars were masked using the -g function. We chose to focus our analysis on genomic deletions as these are called with the greatest confidence and can be confirmed by visual screening of genomic alignments. Deletions >40 bp identified by both perl scripts and CREST analysis were further characterized by genic location: intergenic, promoter, exon, intron, and 3 UTR. Genomic regions of natural variance within the cultivar were identified by identifying deletions common to multiple, but not all, FN mutant plants (Figures 2A, 3). Single nucleotide polymorphisms (SNPs), small (<40 bp) insertions and deletions (INDELs) (either unique to a single plant or common to multiple plants) were identified using the pileup function in SAMtools . Unique or common SNPs and INDELs were identified using the compareBed function of BEDtools (Quinlan and Hall, 2010) and custom perl scripts. SNPs and INDELs with a read depth <6 (half the average genome coverage) were removed from further analysis. Custom perl scripts were used to characterize SNPs and INDELs by genic locations as described above.

MUTANT COLLECTION
FN mutant populations have proven to be a valuable asset for genetic studies in a variety of crop species (Starker et al., 2006;Bolon et al., 2011;Xiao et al., 2011). We have established a small FN mutant population for common bean using P. vulgaris cv. Red Hawk. Seeds from 88 plants with stable, visual phenotypes are available for public use ( Table 1) by contacting Dr. Carroll Vance at vance004@umn.edu or Jeff Roesler at roess001@umn.edu. Phenotypes observed in the field include varying degrees and types of chlorosis and altered maturity (usually delayed). Additionally, bulk seed from M2 and M3 plants is available for researchers wishing to screen mutants for a particular trait of interest.

USE OF DNA-SEQ TO IDENTIFY REGIONS OF STRUCTURAL VARIATION
The DNA-seq reads from the five FN mutant lines and one wildtype cv. Red Hawk individual were mapped to the common bean reference genome sequence (accession G 19833). This analysis allowed us to identify three classes of SV ( The Class 1 group primarily represents intra-specific structural genomic differences between accession G 19833 and cv. Red Hawk. We identified over one-thousand of these regions, in which all six Red Hawk individuals were missing >100 bp that is present in the reference genome. This is an interesting group of SV to catalog and may have important implications for understanding inter-cultivar phenotypic variation. However, these features do not inform our understanding of the mutant phenotypes in the FN lines. Therefore, we chose to focus the data analysis on the Class 2 and Class 3 groups, which exhibited structural polymorphism among the FN individuals. The Class 2 group consists of DNA segments present in some FN individuals, but missing in at least two others (Figure 2A). It is highly unlikely the same genomic regions would be deleted in multiple plants by FN irradiation. Our analysis identified 24 genomic deletions belonging to Class 2, which illustrates the natural variation within the common bean cultivar Red Hawk. Specifically, ten unique deletions (P < 0.05) are shared by the lanceolate and chlorotic mutants on chromosome 1, nine genomic deletions were identified on chromosomes 2 and 11 in three mutant plants (lanceolate, maturity, and chlorotic), four deletions on chromosome 4 are common to the maturity and chlorotic mutants, and one deletion on chromosome 11 is shared by the rugose and maturity mutants (Figure 3, Table 2). Class 2 deletions range in size from 41 base pairs (bp) to 12,111 bp. Of all of these deletions, two are within gene introns and two are within 1000 bp upstream of the start codon or 1000 bp downstream of the stop codon, regions involved in regulating gene expression FIGURE 2 | Example of sequence alignments illustrating cultivar heterogeneity and fast neutron induced deletions. Paired end sequences from six Red Hawk individuals were aligned to the G 19833 genome sequence available at www.phytozome.net using BWA (Li and Durbin, 2009) with default parameters. Using CREST and custom perl scripts, we identified statistically significant deletions illustrating three classes of SV. SV was visualized using IGV (Robinson et al., 2011). These regions of SV were all identified on chromosome 1, from the two regions highlighted by filled black boxes above the alignments. The double vertical line within the alignments represents chromosomal region between the two genomic regions depicted. For each individual, a histogram plot illustrates the read depth with individual reads plotted below. Black boxes highlight the three classes of SV. (A) Class 2, regions of genomic heterogeneity within the Red Hawk cultivar. These deletions were identified in two or more Red Hawk individuals and represent the residual heterogeneity in the fast neutron mutant population. This deletion spans 3408 bp on chromosome 1 and is only evident in the lanceolate and chlorotic individuals. (B) Class 1, sequences present in the reference genome, but missing in all Red Hawk individuals. These regions illustrate the differences between the common bean cultivars. This particular region spans 5000 bp of the reference genome. (C) Class 3, sequences missing in a single individual but present in all other lines. This class is most likely the result of fast neutron mutagenesis and may be responsible for the mutant phenotype. This particular deletion is approximately 1500 bp long in the chlorotic mutant and is immediately downstream of the predicted gene Phvul.001G128600 (shown in blue below alignments), which encodes a RecA protein.
patterns ( Table 2). All Class 2 deletions on chromosome 2 lie within 2.7 million bps of each other. This is a region exhibiting high heterogeneity (Figure 3). Within this region three individual FN lines share eight Class 2 deletions. The region is unaffected in the remaining two FN lines and the wild type plant.
The remaining 18 deletions identified by CREST (P < 0.05) belong to Class 3. This class is composed of sequences absent from a single FN line, but present in all other samples ( Figure 2C, Table 3). Deletions belonging to Class 3 range in size from 41 to over 43,000 bp and are found on chromosomes 1, 4, 5, 7, 8, and 9. Twelve of the eighteen Class 3 deletions are in intergenic regions of the genome, though as genome annotation improves these regions may contain as of yet unpredicted genes. Six deletions belonging to Class 3 have the potential to alter gene expression patterns. Regions immediately upstream or downstream of gene coding regions are likely involved in regulating gene expression. Three deletions are located within 1000 bp of gene start or stop codons of Phvul.001G128600, Phvul.004G029200, Phvul.004G031900, and Phvul.004G032000 in the chlorotic and maturity mutants. The latter two genes are tightly linked in the genome, so a single deletion may affect the expression of multiple genes. Two Class 3 deletions shorten the introns of Phvul.001G050100 in the lanceolate mutant and Phvul.004G031200 in the maturity mutant. Finally, a 43,034 bp deletion on chromosome 8 in the chlorotic mutant removes the entire sequence for Phvul.008G141500. In summary, we were able to identify statistically significant SV belonging to Class 3 within either the coding region or the potential regulatory region of predicted genes for three of the five mutant plants (lanceolate, maturity, and chlorotic). Genes likely impacted by these deletions represent the most likely candidates responsible for the mutant phenotype of the lanceolate, maturity, and chlorotic mutants. For decorative and rugose, our analysis pipeline failed to identify Class 3 deletions within the regulatory or coding region of genes. These phenotypes may be a result of a heterozygous deletion, a small (<40 bp) INDEL, or a SNP.   genomic architecture corresponded to a genic region (Tables 4, 5). As was observed with the larger deletions, the majority of the INDELs and SNPs mapped to intergenic regions. Confirmation by PCR will be necessary to determine the false discovery rate of the SNP and INDEL identification.

DISCUSSION
We demonstrate the feasibility of utilizing high throughput DNA sequencing to analyze FN mutant plants. The use of high throughput sequencing allows the scale of our analysis to be reduced to a base-pair level, providing for the identification and analysis of SNPs, indels, and larger genomic deletions using a single experimental platform. Our analysis identified SV in each Red Hawk individual (wild type and FN mutants) in comparison to the P. vulgaris (accession G 19833) reference genome sequence (available at www.phytozome. net). These SV are inferred to be sequences lost (deleted) from the respective Red Hawk individuals or sequences recently gained by G 19833. (Identifying sequences missing in G 19833 that are present in Red Hawk requires a de-novo assembly of the Red Hawk genome, which is beyond the scope of this analysis.) One may assume that the majority of the SV is either natural or FN-induced deletions in Red Hawk; therefore these events will be referred to as "deletions" throughout the discussion. Our analysis identified three classes of SV (Figure 2). Class 1 represents the putative inter-cultivar SV, large sequence segments that are missing from all sequenced Red Hawk individuals in this study, but are present in the current G 19833 assembly (Figure 2B). Class 2 represents the intra-cultivar SV exhibiting differences among Red Hawk individuals (Figure 2A). Class 3 represents SV specific to a single Red Hawk individual, which were potentially generated by the FN irradiation ( Figure 2C). We focused our analysis on the 23 Class 2 and 18 Class 3 SV that were identified in this analysis. However, it is important to recognize that these classifications are tentative, as a deeper sampling of Red Hawk individuals may re-classify some variants. For example, the limited sampling of one wild-type and five mutated Red Hawk individuals suggests that some of our Class 3 SV may be low frequency natural variants that would have been identified in more than one Red Hawk individual (and thereby be a Class 2 SV) if a larger number of genotypes had been sampled. Similarly, some Class 1 SV may be present at low frequency in Red Hawk, suggesting that these would be re-classified as Class 2 SV in a deeper sampling of genotypes. It is probable that increasing the number of sequenced individuals from Red  Hawk and/or G 19833 individuals would identify many additional Class 1 and Class 2 SV, and strengthen the confidence of the Class 3 calls. A particularly interesting heterogeneous mixture of deletions was identified in a 2.7 million bp region on chromosome 2 of Red Hawk (Figure 3), and represents an intriguing cluster of Class 2 SV. Sequence analysis of the 159 genes within this region revealed 21 are involved in disease resistance response (one dirigent like protein, six leucine rich repeat proteins, and 14 NB-ARC-LRR domain containing disease resistance proteins). This is consistent with previous studies which identified an over-abundance of disease resistance genes located within regions of natural variation (McHale et al., 2012). Additional regions of Class 2 SV are apparent on chromosomes 1, 4, and 11. Analysis of the DNA-seq data for the five individual FN lines suggests the level of natural variation present within the Red Hawk cultivar is higher than that induced by FN mutagenesis. However, aside from the region on chromosome 2, most of the Class 2 SV is in intergenic regions, not likely impacting gene expression or function. The Class 3 deletions are particularly interesting, as they may be associated with the mutant phenotypes found in the FN irradiated individuals. For two of the FN lines, no unique deletions >40 bp were found within or near gene coding regions (although it is still possible that these lines carry heterozygous deletions within genic regions). For the remaining three FN lines, however, we identified candidate deletions within geneencoding regions that may be associated with the resulting phenotype.
The lanceolate mutant (1R19C15r28CPVMN11) exhibited a shorter than wild type stature with elongated, or lanceolate, shaped leaves ( Figure 1B). CREST analysis identified a single Class 3 SV potentially altering the expression of Phvul. 001G050100 (Table 3, Figure 3). The deletion is entirely contained within the last intron region of the gene. The gene is a member of the glycosyltransferase family 47 subgroup C. Proteins encoded by members of this family are bound to the golgi and are involved in cell wall biosynthesis (Jensen et al., 2008). Altering the expression of member of this family/subgroup would impact the physical property of the pectin matrix (Jensen et al., 2008). However, there are no reports of altered leaf phenotypes or plant height in Arabidopsis knockout populations.
Three regions of SV belonging to Class 3 potentially affecting gene expression in the maturity mutant (2R29C12r78CPVMN11) were identified by CREST analysis. In the field, this plant showed no phenotypic derivation from the wild type, except a delay in maturity ( Figure 1D). These three deletions, all on chromosome four, potentially affect the expression pattern of four candidate genes (Figure 3, Table 3). The 1st deletion lies immediately upstream of Phvul.004G029200, which encodes a 60S ribosomal protein. This is an unlikely candidate gene for the phenotype observed. The second deletion is in the intron of Phvul.004G031300. Sequence analysis of this gene failed to identify any functional annotations associated with this gene. The final deletion is immediately downstream of both Phvul.004G032000 and Phvul.004G031900. The Arabidopsis homologs of these genes (At3g01090 and At5g19790) encode a protein kinase and an AP2 transcription factor respectively. AP2 transcription factors are involved in regulating flowering and fruit ripening (Huijser and Schmid, 2011). In Arabidopsis, the overexpression of AP2 results in delayed flowering and maturity (Wollmann et al., 2010). It's possible, the deletion immediately down-stream of Phvul.004G031900 causes an increase of AP2 protein accumulation resulting in delayed flowering and maturity.
Finally, the mutant plant 3R05C25r87CPVMN11 exhibited interveinal chlorosis under standard field conditions ( Figure 1E). Interveinal chlorosis is often an indicator of nutrient deficiencies. However, DNA-seq analysis of this mutant revealed two SV belonging to Class 3 potentially affecting gene expression patterns ( Table 3). The first is a 1518 bp deletion immediately downstream of Phvul.001G128600. The homolog of this gene in Arabidopsis thaliana (At3g10140) encodes a RecA protein, involved DNA repair by binding ssDNA (Miller-Messmer et al., 2012). The second Class 3 SV is a large deletion spaning 43,034 bp on chromosome 8, encompassing the entire sequence for Phvul.008G141500 (Table 3). Sequence analysis of this gene determined it is a member of the SNF2 helicase family. The Arabidopsis homolog of this gene, At2g44980, is the only member of the ALC1 SNF2 subfamily (Knizewski et al., 2008). There is no reported data on the phenotype of an ALC1 knockout in Arabidopsis. However, in mammalian systems, ALC1 is essential for repairing DNA damage (Ryan and Hughes, 2011). Down-regulation of ALC1 protein results in hypersensitivity to damaging agents (Ryan and Hughes, 2011). We hypothesize a similar function is conserved in common bean. Based on the interveinal chlorotic leaf patterning when this gene is completely excised it's possible Phvul.008G141500 is involved in repairing damage to DNA in the leaves, possibly caused by UV radiation.
We identified Class 3 SV, likely a result of FN mutagenesis, ranging from 1 bp substitutions to 43,034 bp deletions, though changes <40 bp have not been visually verified. In the mutated plants, the level of FN irradiation used (16 Gys) induced far more small (<50 bp) deletions, including single base pair substitutions and deletions, than large genomic deletions. These results are consistent with FN induced deletions identified in a recent paper in Arabidopsis (Belfield et al., 2012). The lack of large SV regions belonging to Class 3, as seen in the wellcharacterized soybean population (Bolon et al., 2011), may be due to our analysis pipeline requiring the deletion to be complete (i.e.,: homozygous). It is possible some larger deletions are present in these lines, but are not yet homozygous in the M2 generation and, as such, were not identified by our analysis. In the soybean FN population, 52 and 38% of the mutants with visual phenotypes were identified from seeds treated with 16 and 32 Gys respectively (Bolon et al., 2011). None of the P. vulgaris plants irradiated at 32 Gys germinated in the field, suggesting the common bean genome may be less resilient to interruption than the soybean genome. The soybean genome may accommodate larger genomic deletions by compensating for gene loss through altered expression patterns of duplicated genes.

SUMMARY
A FN population with 88 individuals or bulk seed from M2 and M3 generations is available upon request. We've illustrated the utility of DNA-seq to identify three classes of SV in P. vulgaris individuals. These analyses were greatly facilitated by the availability of the P. vulgaris genome sequence. In the Red Hawk cultivar, natural variation is clustered in regions throughout the genome. These regions of natural variation illustrate the existing genetic potential of common bean germplasm. Our analyses also identified genomic deletions resulting from FN mutagenesis and candidate genes potentially responsible for the altered phenotype in three of the plants selected.