Large-Scale Investigation of Soybean Gene Functions by Overexpressing a Full-Length Soybean cDNA Library in Arabidopsis

Molecular breeding has become an important approach for crop improvement, and a prerequisite for molecular breeding is elucidation of the functions of genetic loci or genes. Soybean is one of the most important food and oil crops worldwide. However, due to the difficulty of genetic transformation in soybean, studies of its functional genomics lag far behind those of other crops such as rice, which severely impairs the progress of molecular improvement in soybean. Here, we describe an effective large-scale strategy to investigate the functions of soybean genes via overexpression of a full-length soybean cDNA library in Arabidopsis. The overexpression vector pJL12 was modified for use in the construction of a normalized full-length cDNA library. The constructed cDNA library showed good quality; repetitive clones represented approximately 4%, insertion fragments were approximately 2.2 kb, and the full-length rate was approximately 98%. This cDNA library was then overexpressed in Arabidopsis, and approximately 2000 transgenic lines were preliminarily obtained. Phenotypic analyses of the positive T1 transgenic plants showed that more than 5% of the T1 transgenic lines displayed abnormal developmental phenotypes, and approximately 1% of the transgenic lines exhibited potentially favorable traits. We randomly amplified 4 genes with obvious phenotypes (enlarged seeds, yellowish leaves, more branches, and dense siliques) and repeated the transgenic analyses in Arabidopsis. Subsequent phenotypic observation demonstrated that these phenotypes were indeed due to the overexpression of soybean genes. We believe our strategy represents an effective large-scale approach to investigate the functions of soybean genes and further reveal genes favorable for molecular improvement in soybean.


INTRODUCTION
Conventional crop breeding is challenging and often takes a great deal of time. Molecular breeding approaches, such as marker-assisted or transgenic breeding, have become an important approach for current crop improvement. The prerequisite for molecular breeding is elucidation of the functions of genetic loci or genes. Among the different crop species where molecular breeding has been successfully applied, rice is a good example. For seed-related traits, several loci affecting grain size have been successfully identified and elucidated, such as GW7 (Wang S. et al., 2015), GW2 (Song et al., 2007) and GL7 (Si et al., 2016). Further transgenic application of these genes greatly enhanced rice grain yield (Song et al., 2007;Si et al., 2016). Plant architecture is another important factor affecting grain yield. Ideal Plant Architecture 1 (IPA1) is an important dominant gene that reduces tiller number and enlarges panicles (Jiao et al., 2010). Practical application of this gene could greatly enhance grain yield compared to that of the control.
Soybean (Glycine max (L.) Merr.) is one of the most important food and oil crops worldwide, providing abundant proteins, oil and nutrition sources for animal feed and the human diet (Lam et al., 2010;Korir et al., 2013). At present, over onethird of edible oils and two-thirds of protein meal in the world are derived from soybean (Sobhanian et al., 2010). Soybean researchers have made some progress in soybean functional genomics through either forward genetic or reverse genetic studies. For example, a series of soybean gene functions have been elucidated, including GmphyA1 and GmphyA2 (Liu et al., 2008), Arabidopsis thaliana EARLY FLOWERING 3 (ELF3) homologous J locus , GmGIP1 , GmAP1 (Chi et al., 2011), and GmZF351 . However, due to the very large and complex genome, low transformation efficiency, and long transgenic process in soybean, studies of soybean functional genomics have lagged far behind those in other crops, such as rice (Oryza sativa). In addition, soybean is an ancient tetraploid plant with many redundant genes, which seriously influence the elucidation of gene function by means of studying loss-of-function soybean mutants. To date, many kinds of RNA and protein profiling in soybean have been conducted, and numerous regulated genes have been identified (Smith-Hammond et al., 2014;Lanubile et al., 2015;Yu et al., 2016). However, confirmation of the functions of these genes has been rare, let alone the application of these genes in molecular improvement. Therefore, a strategy is needed to conduct largescale and efficient studies in soybean functional genomics and, especially, to mine a large number of candidate favorable genes for soybean genetic improvement.
Arabidopsis is a model plant for studies of dicotyledons and thus represents an important intermediate tool to confirm the functions of soybean genes (Liu et al., 2014;Wang F. et al., 2015;Wei et al., 2015;Pan et al., 2017). In this study, we overexpressed a full-length soybean cDNA library in Arabidopsis and investigated the functions of soybean genes by evaluating the phenotypes of transgenic Arabidopsis. Our strategy could provide an efficient and high-throughput approach to mining candidate favorable soybean genes that could facilitate soybean molecular improvement.

Plant Materials and Growth Conditions
The roots, stems, leaves and shoot apices of soybean cultivar Williams 82 were used as materials for RNA extraction.
Arabidopsis ecotype Col-0 was used as transformation recipient. Arabidopsis Col-0 was planted in a greenhouse with a temperature of 23 • C, humidity of approximately 75%, and a 16 h light and 8 h dark photoperiod.

Vector Modification and Construction of the Full-Length cDNA Library
For vector modification, we first analyzed the sequence of vector pJL12 and found an SfiI recognition site (downstream of the NOS terminator at position 3859). However, there was no SfiI site downstream of the 35S promoter. Thus, we needed to mutate the existing SfiI site (at position 3859) and add another downstream of the 35S promoter for the integration of cDNA inserts. Briefly, we artificially synthesized a 1397 nucleotide fragment from position 2563 to 3907 in pJL12, which contained BamHI and Sse232I adapters. This fragment was designed to mutate the SfiI recognition site at position 3896 by changing G to T and introduce a new SfiI site downstream of the 35S promoter, where the sequence ACTAGTTCTAGA was replaced by GGCCATTACGGCCAAGCTTGATATCGGCCGCCTCGGCC. Next, the pJL12 fragment from position 2563 (BamHI) to 3927 (Sse232I) was replaced by the newly synthesized fragment using BamHI and Sse232I digestion and T4 DNA ligase linkage.

Large-Scale Genetic Transformation of the cDNA Library Into Arabidopsis
The concentrations of the plasmids harboring the cDNA library were measured by a spectrophotometer (Nanodrop, Thermo fisher) and diluted to ensure values between 100 and 200 ng/µl. Then, 1 µl of each plasmid was transformed into Agrobacterium strain GV3101 by an electrotransformation apparatus (MicroPulser, Bio-Rad). After Agrobacterium colonies appeared, all colonies were scraped into 1 L LB medium (kanamycin 30 ng/µl, rifampicin 50 ng/µl) and propagated in an incubator. To avoid the possibility of growth priority in a prolonged culture period, the culture time for Agrobacterium propagation was limited to 4∼6 h. The transformation experiment was performed as described by Clough and Bent (1998), except that we used 500 µl Silwet L-77 per liter LB medium when performing the transformation, because Silwet L-77 could greatly improve the transformation rate (Martinez-Trujillo et al., 2004).

Phenotypic Evaluation and Genetic Analyses of Transgenic Arabidopsis
The transformed seeds overexpressing the full-length cDNA library were planted in a greenhouse. After screening with Basta, the surviving positive seedlings were compared to the seedlings expressing the empty vector at all developmental stages.
For the screening of homozygous transgenic lines, approximately 30 seeds derived from T 2 plants were plated on Basta selective (1:1000 ratio dilution) plates, and the seeds of the plants whose seeds showed a 3:1 survival ratio were harvested. The phenotypes of the homozygous transgenic lines were further investigated without Basta selection.
For the A13, B12, C15 and D70 overexpression plasmids, the primers A13-f/A13-r, B12-f/B12-r, C15-f/C15-r and D70-f/D70-r were used to amplify each gene individually (Supplementary File 1). The PCR products were integrated into linearized pJL12 vector using the ClonExpress II One Step Cloning Kit (Vazyme, cat#c112) after digestion by the restriction enzyme BamHI. The correct constructs were retransformed into Col-0 to validate the observed phenotypes.

Chlorophyll Content Analysis
The total chlorophyll of the 4th and 5th rosette leaves was extracted as described in a previous report (Sakuraba et al., 2014b). Subsequently, the chlorophyll content was quantified according to another published report (Lichtenthaler, 1987).

Construction of Soybean Full-Length cDNA Library
The vector pJL12, harboring the 35S promoter, is widely used in studies of functional genomics (Qiu et al., 2002(Qiu et al., , 2008aCollins et al., 2003;Zou et al., 2011;Guo et al., 2014). To make this vector suitable for the construction of a full-length cDNA library, we first modified this vector. Because the cohesive ends produced by SfiI digestion were required, we first analyzed the sequence of the original pJL12 vector and found an SfiI restriction site in pJL12 (position 3859-3907, Figure 1A). However, downstream of the 35S promoter in the MCS position, there was no SfiI site. Therefore, we needed to mutate the existing SfiI restriction site (G to T transition) and add another in the MCS ( Figure 1A). Considering the above, a nucleotide fragment with a length of 1397 containing the BamHI and Sse232I recognition sites was artificially synthesized (Supplementary File 2), in which one SfiI was mutated by transitioning G to T at position 3859-3907 and another SfiI was introduced downstream of the 35S promoter at position 2568. This fragment was later integrated into pJL12 by BamHI and Sse232I digestion. Thus, the modified pJL12 vector could allow the insertion of the cDNA library after digestion by the SfiI restriction enzyme and the overexpression of the library in plants through the 35S promoter.
We collected the roots, stems, leaves and shoot apices of Williams 82 plants for RNA extraction. Different adapters with SfiI cohesive ends at the 5 and 3 ends were used to perform reverse transcription (see section "Materials and methods"). Next, the cDNA was digested with SfiI and then separated by use of agarose gel electrophoresis. The approximately 2 kb fragments were recovered, normalized and ligated to the modified pJL12 vector (Figure 1B). In the end, we transformed the cDNA library into DH5α. In total, we obtained approximately 10 6 clones and retained all of the clones for propagation and storage ( Table 1).

Assessment of cDNA Library Quality
We randomly picked 96 colonies and used primers pJL12-f (in the 35S promoter) and pJL12-r (in the NOS terminator) to amplify the cDNA inserts. All 96 colonies were successfully amplified, which showed that each clone carried a cDNA insertion (Supplementary File 3). To evaluate the normalization rate of the constructed cDNA library, we sequenced the 96 PCR products. As shown in Figure 2A, non-repetitive cDNA sequences in the constructed cDNA library were approximately 96% and repetitive sequences were approximately 4%. To evaluate the genome coverage, we blasted the sequenced results against the Wm82.a2.v1 Transcript Sequences database 1 . Because soybean is a tetraploid plant, most of our sequencing data could match more than one gene locus. We retrieved gene matches with more than 95% identities for further analyses. All of the chromosomes had similar cDNA hit numbers (Figure 2B), which suggested that our constructed soybean full-length overexpressed cDNA library had good coverage. Later, we evaluated the full-length rate of our constructed cDNA library. Among the sequenced data, approximately 98% included full-length sequences, indicating that the majority of the inserts in the cDNA library were intact ( Figure 2C). We also estimated the average insert size of our constructed cDNA library, which was approximately 2.2 kb ( Table 1). Based on the above results, we concluded that the constructed full-length soybean cDNA library had high quality with high genome coverage, low repetitive frequency and high full-length ratio, and it could be used for further study.

Transformation and Evaluation of Transgenic Arabidopsis
We extracted the plasmids from the cDNA library and transformed them into Agrobacterium cells. Next, the library was transformed into Arabidopsis through Agrobacterium tumefaciens-mediated transformation method with a slight modification (see section "Materials and methods") (Clough and Bent, 1998). To obtain T 1 positive transgenic lines overexpressing the soybean full-length cDNA library, T 0 generation seeds were harvested and sown in a greenhouse, and Basta (1:1000) was sprayed for selection. In total, we obtained nearly 2000 T 1 positive transgenic lines ( Table 2). To observe phenotypes in the T 1 transgenic lines, empty vector pJL12 (modified)-transformed lines were sown in parallel with positive cDNA-transformed T 1 seedlings. Many of the T 1 positive seedlings exhibited abnormal phenotypes (approximately 5%), including late flowering, yellowish  Table 2), whereas another 1% exhibited favorable traits, including large seeds, high silique density, strong drought resistance and vigorous growth appearance (Figure 3 and Table 2).
To inspect the distribution of cDNA inserts in the soybean genome, we extracted DNA from 96 randomly selected individual T 1 seedlings and amplified their cDNA inserts using the primers pJL12-f and pJL12-r. Most of the 96 transgenic plants had successfully amplified products (Supplementary File 5A). Then, the amplified products were sequenced and blasted against the Wm82.a2.v1 Transcript Sequences database. The sequencing data showed that 97% of the cDNA inserts were non-repetitive, and all of the sequenced data were full-length (Supplementary File 5B). To examine the coverage of the cDNA inserts, we analyzed the distribution of sequencing data in the soybean genome. The results showed that all 20 soybean chromosomes had good match numbers (Supplementary File 5C).

Validation of Phenotypes of T 1 Transgenic Arabidopsis by Retransformation of Corresponding Genes
To confirm whether the phenotypes of transgenic Arabidopsis were caused by the ectopic expression of soybean genes, we randomly selected 4 lines with obvious phenotypes for further study.
A13 homozygous transgenic plants displayed a yellowish phenotype (Figures 4A,B). The sequence analysis of the cDNA fragments showed that the A13 yellowish phenotype was caused by Glyma.06g119500, a homolog of Arabidopsis SGR1 (Shimoda et al., 2016). We reconstructed the overexpression vector for Glyma.06g119500 and transformed it into Arabidopsis. Phenotypic analysis showed that all homozygous transgenic lines exhibited a yellowing leaf phenotype (Figures 4A,B) consistent with the one from the transgenic line overexpressing the cDNA library.
B12 homozygous transgenic plants displayed a dwarfing and highly branched phenotype (Figures 4C,D). Sequence analysis showed that it might be caused by a predicted ANDROGEN INDUCED INHIBITOR OF PROLIFERATION (AS3) / PDS5-RELATED gene, Glyma.06g062700. To validate whether the B12 phenotype was caused by overexpression of Glyma.06g062700, we retransformed Glyma.06g062700 into Arabidopsis. Among 26 homozygous lines, 19 displayed obvious dwarfing and highly branched phenotypes (Figures 4C,D), suggesting that the B12 phenotype was due to overexpression of Glyma.06g062700.
C15 T 1 transgenic plants were thin, with small stature and early flowering (Figures 4E,F). Sequence analysis revealed that the inserted gene was Glyma.11g199100, a nucleic acid binding protein with ATP-dependent RNA helicase activity. Retransformed Arabidopsis lines with overexpression of Glyma.11g199100 also showed the same thin, small and early-flowering phenotype (Figures 4E,F), demonstrating that the C15 phenotype was caused by overexpression of Glyma.11g199100.
The D70 T 1 transgenic line displayed larger seeds (Figures 4G,H). Sequence analysis showed that the D70 insertion gene was Glyma.05g231900, which encoded an aldehyde dehydrogenase. To validate whether the large seeds in D70 were caused by Glyma.05g231900, we retransformed the overexpression vectors of Glyma.05g231900 into Arabidopsis. The results showed that 19 out of 24 homozygous lines showed larger seed sizes than those of the controls (Figures 4G,H), demonstrating that the aldehyde dehydrogenase Glyma.05g231900 exerts a specific influence on seed size.
Taking the above results together, we concluded that the phenotypes found in the T 1 generation plants were most likely caused by the overexpression of the genes from the soybean full-length cDNA overexpression library.

DISCUSSION
Soybean is an important crop that provides a well-balanced source of protein and oil. In addition, most of the components of soybean, such as α-linolenic acid and isoflavones, have beneficial health effects (Xia et al., 2013). However, studies of the functional genomics of soybean have lagged far behind those of other crops because of the relatively low efficiency and long transformation period and the relative complexity of the genome (Schmutz et al., 2010). Although some genes have been cloned (Lee et al., 2004;Gao et al., 2005;Gao and Bhattacharyya, 2008;Cook et al., 2012;Liu et al., 2012;Lu et al., 2017;Ma et al., 2017), the functions of many soybean genes remain unknown. As the demand for soybean is increasing, how to improve soybean yield and edible quality is an urgent challenge confronted by soybean breeders.

Rationality and Advantage of Our Strategy for Studies of Soybean Functional Genomics
Currently, when we mine favorable genes in soybean through the method of reverse genetics, we usually select dozens of candidate genes from transcriptome data or known homologous genes, and then investigate their functions one by one by use of Arabidopsis as the model plant. Subsequently, we confirm their functions in soybean. This process requires large amounts of time and labor, and the positive efficiency is relatively low, which severely influences the mining of favorable genes. On the other hand, soybean is an ancient tetraploid plant, and most of the soybean genes have duplicated copies, meaning that loss-offunction mutations in soybean genes usually have no apparent phenotypes. In addition, the successes of QTL cloning in soybean have been few to date, largely due to the complexity of the soybean genome (Kato et al., 2014;Lee et al., 2015). Due to the above, elucidation of the functions of soybean genes and, further, the mining of favorable genes have been much slower than those in other crops. Developing a large-scale and efficient strategy for studies of soybean functional genomics is an urgent need for soybean researchers.
Arabidopsis has diverse genetic resources and a large number of mutants created using various mutagenesis strategies 2 . Moreover, Arabidopsis has a small genome, short life cycle, and easy and effective transformation, making Arabidopsis a useful model plant (Koornneef and Meinke, 2010;Liepman et al., 2010). To date, many gene functions in Arabidopsis have been dissected and provided important references for other plant species 3 . Both soybean and Arabidopsis are dicotyledonous plants and share some similarities. These similarities provide a substantial potential for the use of Arabidopsis to carry out studies of soybean functional genomics. By using Arabidopsis as a model plant, the functions of many soybean genes have already been successfully elucidated (Li et al., 2016;  Takeshima et al., 2016;Chu et al., 2017). For example, GmNAC20 and GmNAC11 are two NAC transcription factors in soybean. When GmNAC20 was overexpressed in Arabidopsis, it greatly enhanced salt and freezing tolerance, whereas overexpression of GmNAC11 improved only salt tolerance in transgenic Arabidopsis. The roles of these two NAC family transcription factors were also confirmed in soybean (Hao et al., 2011).
To mine gene functions at a larger scale, we constructed a full-length soybean cDNA library with a modified overexpression vector, pJL12, which contains a 35S promoter ( Figure 1A). We assessed the resulting library and found that it had high quality (Figure 2). After transformation into Arabidopsis, we first obtained ∼2000 transgenic lines ( Table 2), among which 5% of transgenic plants showed abnormal developmental phenotypes, whereas 1% of them showed potentially beneficial traits (Figure 3 and Table 2). Sequence analyses showed that all overexpressed soybean genes were different rather than redundant, suggesting the effectiveness of our strategy. We noticed that some soybean genes could be missed, including genes that exhibit phenotypic changes only in the loss-offunction condition and genes that have different functions in soybean from those in Arabidopsis. Moreover, some phenotypes exhibited by the transgenic Arabidopsis lines might be attributed to the insertion of vectors in the Arabidopsis genome rather than the overexpression of soybean genes. In addition, ectopic overexpression of a gene might not reflect its intrinsic function, which could be solved by studying the corresponding loss-offunction mutant or using the native promoter. Despite these imperfections, our results prove that this strategy is reasonable and advantageous for use in investigating the functions of soybean genes and further mining favorable soybean genes for future molecular improvements.

Dissection of Four Potentially Favorable Soybean Genes
In this study, we transformed the soybean full-length cDNA overexpression library into Arabidopsis. Many T 1 generation positive seedlings exhibited various phenotypes ( Table 2). The occurrences of phenotypes in these transgenic lines might be attributed to the overexpression of the soybean genes or to the destruction of Arabidopsis gene expression by T-DNA insertion in the Arabidopsis genome.
To confirm that the phenotypes occurring in the T 1 positive seedlings were caused by cDNA ectopic overexpression, we randomly selected 4 lines for confirmation. Genetic retransformation proved that the overexpressed soybean cDNAs were responsible for each of the observed phenotypes. A13 encodes an Arabidopsis SGR1 homolog protein. In Arabidopsis, SGR1 interacts with Chl catabolic enzymes (CCE) to regulate chlorophyll metabolism, leading to a pale and yellowing leaf pattern during leaf senescence (Balazadeh, 2014;Sakuraba et al., 2014aSakuraba et al., ,b, 2015Bell et al., 2015). Furthermore, SGR1 regulates chlorophyll degradation in other plant species, such as rice (Park et al., 2007), pea (Pisum sativum) (Sato et al., 2007), tomato (Solanum lycopersicum), bell pepper (Capsicum annuum) (Barry et al., 2008), tall fescue (Festuca arundinacea)  and Medicago truncatula (Zhou et al., 2011). In this study, it also caused a yellowish appearance, indicating a conserved role of SGR1. B12 encodes an ANDROGEN INDUCED INHIBITOR OF PROLIFERATION (AS3) / PDS5-RELATED protein, which may inhibit cell proliferation, consistent with the dwarfing and highly branched phenotype of B12 (Yuan et al., 1993;Szelei et al., 1997). C15 encodes an ATP-dependent RNA helicase. RNA helicase plays diverse roles in plant growth and development (Linder and Owttrim, 2009), which may be a plausible reason for the small stature and early-flowering phenotype of C15. D70 encodes an aldehyde dehydrogenase (Brocker et al., 2013), which may affect the metabolism of energy in seeds. All of these genes are worthy of further study to elucidate their detailed functional mechanisms.

CONCLUSION
In this study, we used Arabidopsis as a recipient and developed a high-throughput strategy to investigate the functions of soybean genes by overexpressing a full-length soybean cDNA library in Arabidopsis. Many transgenic Arabidopsis lines exhibited abnormal phenotypes, among which some were found to contain potentially favorable genes for the molecular improvement of soybean. Although the functions of these genes must be further confirmed in soybean, we believe that this strategy may provide valuable sources of soybean genetic improvement.

AUTHOR CONTRIBUTIONS
XL performed the majority of the experimental work. LH helped to construct the cDNA library. JL conducted part of the transformation work. YC constructed the overexpression vector. QY, LW, XS, and XZ gave valuable advice. XL and YJ wrote and revised the paper. YJ conceived the project and supervised the work.