Shallow Whole Genome Sequencing for the Assembly of Complete Chloroplast Genome Sequence of Arachis hypogaea L.

The chloroplast (CP) is a plant organelle originated from cyanobacteria through symbiosis and had become an important component of the plant cell. It is the reaction center for the photosynthesis and also for several steps in the biosynthetic pathways of fatty acids, vitamins, pigments and amino acids. The CP genome is highly conserved in land plants (Raubeson and Jansen, 2005). The CP genome is circular and exhibits a quadripartite genome structure consisting of a large single copy region (LSC) and a small single copy region (SSC), separated by a pair of inverted repeats (IRs) with a few exceptions where loss of an IR or the SSC was observed. The size of the CP genome varies from 19 to 217 Kb in land plants, and the IRs are usually 20–26 kb in length (http://www.ncbi.nlm.nih.gov/genome/organelle/). Lack of recombination makes the CP genome an ideal target for phylogenetic studies (Ravi et al., 2008; Wu and Ge, 2012). 
 
Arachis hypogaea L. also known as groundnut is an herbaceous plant belonging to the Fabaceae family. It has an allotetraploid genome (AABB; 2n = 4x = 40) with a size of about 2.8 Gb. There have been many speculations regarding the ancestors of A and B subgenomes of A. hypogaea and proved to have originated through a hybridization event between Arachis ipaensis L. (B subgenome) and Arachis duranensis L. (A subgenome) (Kochert et al., 1996; David et al., 2016). It is one of the major edible oilseed crops in the world, and India is the second largest producer accounting for about 15% of the world production (FAOSTAT, 2015). Kernels of A. hypogaea L. contains 43–50% oil and 23–26% proteins. The oil comprises majorly of palmitic acid (16:0), stearic acid (18:0), oleic acid (18:1), linoleic acid (18:2), arachidic acid (20:0), eicosenoic acid (20:1), behemic acid (22:0), and lignoseric acid (24:0) along with trace amounts of palmitoleic acid (16:1). The mono and poly-unsaturated fatty acids, oleic acid and linoleic acid constitute about 75% of the total oil content (Shiv, 1982). Many attempts have successfully been made to improve the crop yield, drought resistance, disease resistance and other characteristics of A. hypogaea L. using classical breeding as well as genetic engineering using nuclear transformation. Chloroplast transformation by homologous recombination for producing transgenic plants is also possible due to the presence of candidate loci on the CP genome. Additionally, Genetic engineering of chloroplast genome when compared to nuclear transformation is environment-friendly; it minimizes the pleiotropic effects along with containment of the foreign genes (Daniell et al., 2005). Hence, the availability of the complete chloroplast genome of A. hypogaea L. will be an invaluable resource for designing and evaluating efficient chloroplast transformation experiments.


INTRODUCTION
The chloroplast (CP) is a plant organelle originated from cyanobacteria through symbiosis and had become an important component of the plant cell. It is the reaction center for the photosynthesis and also for several steps in the biosynthetic pathways of fatty acids, vitamins, pigments and amino acids. The CP genome is highly conserved in land plants (Raubeson and Jansen, 2005). The CP genome is circular and exhibits a quadripartite genome structure consisting of a large single copy region (LSC) and a small single copy region (SSC), separated by a pair of inverted repeats (IRs) with a few exceptions where loss of an IR or the SSC was observed. The size of the CP genome varies from 19 to 217 Kb in land plants, and the IRs are usually 20-26 kb in length (http://www.ncbi.nlm.nih.gov/genome/organelle/). Lack of recombination makes the CP genome an ideal target for phylogenetic studies (Ravi et al., 2008;Wu and Ge, 2012).
Arachis hypogaea L. also known as groundnut is an herbaceous plant belonging to the Fabaceae family. It has an allotetraploid genome (AABB; 2n = 4x = 40) with a size of about 2.8 Gb. There have been many speculations regarding the ancestors of A and B subgenomes of A. hypogaea and proved to have originated through a hybridization event between Arachis ipaensis L. (B subgenome) and Arachis duranensis L. (A subgenome) (Kochert et al., 1996;David et al., 2016). It is one of the major edible oilseed crops in the world, and India is the second largest producer accounting for about 15% of the world production (FAOSTAT, 2015). Kernels of A. hypogaea L. contains 43-50% oil and 23-26% proteins. The oil comprises majorly of palmitic acid (16:0), stearic acid (18:0), oleic acid (18:1), linoleic acid (18:2), arachidic acid (20:0), eicosenoic acid (20:1), behemic acid (22:0), and lignoseric acid (24:0) along with trace amounts of palmitoleic acid (16:1). The mono and poly-unsaturated fatty acids, oleic acid and linoleic acid constitute about 75% of the total oil content (Shiv, 1982). Many attempts have successfully been made to improve the crop yield, drought resistance, disease resistance and other characteristics of A. hypogaea L. using classical breeding as well as genetic engineering using nuclear transformation. Chloroplast transformation by homologous recombination for producing transgenic plants is also possible due to the presence of candidate loci on the CP genome. Additionally, Genetic engineering of chloroplast genome when compared to nuclear transformation is environment-friendly; it minimizes the pleiotropic effects along with containment of the foreign genes (Daniell et al., 2005). Hence, the availability of the complete chloroplast genome of A. hypogaea L. will be an invaluable resource for designing and evaluating efficient chloroplast transformation experiments.

Plant Material and Genome Sequencing
The seeds of A. hypogaea L. Co7 variety were obtained from Tamilnadu Agricultural University, Coimbatore, India. The plants were grown in the green house facility at SRM University, Kattankulathur, India. Leaves from 1-month old plant were used for total genomic DNA isolation using DNeasy Plant Mini Kit (Qiagen, Germany). A paired-end library with an average insert size of about 400 bp was constructed as per the manufacturer's protocol (Illumina Inc., USA). The library quality was assessed on CaliperLabChip GX using High Sensitivity Assay Kit (Caliber, USA). It was then hybridized on a flow cell for generating clonal clusters on cBOT using Truseq PE Cluster Kit v3-cBot-HS (Illumina Inc., USA). Sequencing by synthesis was performed on Illumina Hiseq 2500 using Truseq v3-HS kit to generate 100 bp paired end reads (Illumina Inc., USA).

Genome Assembly and Validation
The per base quality of the raw paired-end reads (51,650,486) of 100 bp was assessed by FastQC v0.11.2 (Andrews, 2010). The adapter trimming and quality filtering was done using Cutadapt v1.7.1 (Martin, 2011) and Sickle v1.33 (Joshi and Fass, 2011) tools respectively. A phred score of 20 was used for quality filtering. The quality filtered paired-end reads (49,299,308) were subjected to de novo assembly using three different de novo assemblers such as Velvet v1.2.10 (Zerbino and Birney, 2008), SOAPdenovo v2.04 (Luo et al., 2012) and Edena v3.131028 (Hernandez et al., 2008). The assembled contigs were pooled and ordered against the complete CP genome of closest relative Acacia ligulata L. as the reference using Mauve v2.3.1 tool (Darling et al., 2010;Williams et al., 2015). The gaps in the genome were filled by manual alignment of paired-end reads using overlapping method (Natarajan and Parani, 2015) and primer walking (Sanger sequencing method). Validation of the junctions between the single copy regions and the inverted repeats was done by Sanger sequencing using specific primers. The filtered reads were mapped against the assembled CP genome of A. hypogaea L. to calculate the genome coverage. The complete CP genome of A. hypogaea L. was annotated using DOGMA (Wyman et al., 2004).

RESULTS AND DISCUSSION
The size of the complete CP genome of A. hypogaea L. was found to be 156,391 bp. The genome coverage was calculated to be 2122x with 3,863,475 quality filtered reads mapped to the assembled CP genome. The CP genome exhibited a quadripartite structure consisting of LSC and SSC regions of 85,946 bp and 18,797 bp respectively, with a pair of inverted repeats (IRa and IRb) of 25,824 bp each separating them. The overall GC content of the complete chloroplast genome was 36.4% and the individual GC content for LSC, SSC, and IRs was 33.8%, 30.2%, and 42.8% respectively. A total of 110 genes were annotated including 76 protein coding genes, 30 tRNA genes, and 4 rRNA genes. Six of the protein coding genes and the 3' exon of rps12 are duplicated in the IR regions. Six of the tRNA genes and four of the rRNA genes are also duplicated in the IR regions. The presence of one or two introns were identified in the 13 genes, which includes 8 protein coding genes and 5 tRNAgenes ( Table 1). The complete CP genome sequence of A. hypogaea that is reported here for the first time will be an invaluable resource for designing and evaluating efficient chloroplast transformation experiments and to improve the desired traits.

DEPOSITED DATA AND INFORMATION TO THE USER
The complete data from the current study was submitted at NCBI under the BioProject ID PRJNA314013 and BioSample ID SAMN04527043. The assembled complete chloroplast genome sequence was submitted to NCBI Genbank with an accession number KX257487 (http://www.ncbi.nlm.nih.gov/nuccore/KX257487).
The raw reads in compressed FASTQ were submitted to SRA database at NCBI under the accession number SRP076091 (http://www.ncbi.nlm.nih.gov/sra/SRP076091).

ACKNOWLEDGMENTS
This project was supported by Department of Biotechnology (DBT), Government of India (BT/PR6394/GBD/27/422/2012). The High Performance Cluster Computing Facility at SRM University was used for the genome assembly and analysis.