A Case Study into Microbial Genome Assembly Gap Sequences and Finishing Strategies

This study characterized regions of DNA which remained unassembled by either PacBio and Illumina sequencing technologies for seven bacterial genomes. Two genomes were manually finished using bioinformatics and PCR/Sanger sequencing approaches and regions not assembled by automated software were analyzed. Gaps present within Illumina assemblies mostly correspond to repetitive DNA regions such as multiple rRNA operon sequences. PacBio gap sequences were evaluated for several properties such as GC content, read coverage, gap length, ability to form strong secondary structures, and corresponding annotations. Our hypothesis that strong secondary DNA structures blocked DNA polymerases and contributed to gap sequences was not accepted. PacBio assemblies had few limitations overall and gaps were explained as cumulative effect of lower than average sequence coverage and repetitive sequences at contig termini. An important aspect of the present study is the compilation of biological features that interfered with assembly and included active transposons, multiple plasmid sequences, phage DNA integration, and large sequence duplication. Our targeted genome finishing approach and systematic evaluation of the unassembled DNA will be useful for others looking to close, finish, and polish microbial genome sequences.

The PCR/Sanger sequencing based validations were particularly challenging when there was no bioinformatics evidence available for the contig extension/overlap. In such case (e.g. finishing for B. cellulosolvens DSM 2933), several experimental modifications were applied to standard PCR which includes: a. Use of strand-displacing Pfu DNA polymerase for uncoiling of double-stranded DNA b. Use of nucleotide analogue 7-deaza-2′-dGTP instead of standard guanine nucleotide which forms comparatively weaker bond with cytosine nucleotide and helps to break strong DNA hairpin structures 1 . c. Determination of appropriate PCR conditions through multiple experimental parameter optimizations (primer concentrations, annealing temperatures, and extended time for DNA denaturation) as described previously 1 .

Manual finishing of C. thermocellum AD2
The best automatic assembly for strain AD2 using PacBio data contained 10 contigs. Mapping of the draft assemblies (generated by SPAdes and ABySS) to the 10 contig reference sequence derived extension to the 7 contigs. Super-assembly of all 10 contigs (including 7 extended contigs) generated 4 super-contigs. The longest super-contig (AD2_SC1) was of size 2.06 Mb and derived from the assembly of three extended draft contigs ( Figure S1). Validation of two putative overlaps (AD2_overlap1 and AD2_Overlap2) between three extended draft contigs and accuracy of the resulting consensus sequence (AD2_SC1) was performed by PCR and Sanger sequencing approach ( Figure S1).

Figure S1
-AD2 genome finishing overview. Top row shows the 3.5 Mb consensus derived from the super assembly of two longest contigs AD2_SC1 and AD2_HC1 with 780 kb overlap. The AD2_SC1 contig was derived from super assembly of three contigs with extended ends (highlighted in green). The super assembly of AD2_SC1 was validated by PCR and Sanger sequencing of contigs overlaps AD2_overlap1 and AD2_overlap2. Zoomed view shows the corresponding annotations, and alignment of Sanger reads.
During manual inspection of mapping results (Illumina draft contigs mapped to PacBio assembly), a long 2.27 Mb contig (AD2_HC1) was identified in the SPAdes hybrid assembly. The two longest contigs (AD2_SC1 and AD2_HC1) represented two opposite ends of the genome and super-assembly derived a 3.5 Mb consensus sequence with ~780 Kb overlapping sequence ( Figure S1). The 780 Kb overlapping region contained only a handful of base-pair mismatches which were manually corrected by mapping of the Illumina reads. The expected genome size for the AD2 genome was in the ~3.5 Mb range. However, dot plot and other tests for genome circularity could not confirm the circular nature of the consensus sequence. Therefore, additional PCR amplification reactions were performed to extend the ends of the 3.5 Mb consensus sequence. Sanger sequencing of a ~1 kb purified PCR product revealed the sequence for this gap region (AD2_Gap1) comprising total 1,069 bp. After gap-closure, a 3,554,860 bp circular genome sequence for strain AD2 was obtained. Annotation of the 1,069 bp gap sequence corresponded to gene models for repetitive transposon DNA -"transposon mutator type CDS". Sequences derived for (AD2_overlap1, AD2_Overlap2 and AD2_Gap1) constituted the gaps present within the PacBio assemblies.

Manual finishing of B. cellulosolvens DSM 2933
The best assembly for B. cellulosolvens DSM 2933 using PacBio data contained 3 contigs. Scaffolding through AHA protocol 2 could not improve assembly any further. Mapping of the draft or hybrid contigs to a three contig reference could not extend any contig ends. However, super-assembly of 3 contigs detected a 6.7 kb overlap (BC_overlap1) between contigs BC_C1 and BC_C3. Validation of contig overlap and consensus sequence was performed through PCR and Sanger sequencing ( Figure S2). The remaining contig (BC_C2) could not be joined together using bioinformatics approaches and indicated presence of an unknown gap (BC_Gap1). Oligonucleotide primers were designed on either ends of the two contigs followed by the multiple rounds of PCR with various primer combinations. A successful PCR amplification and separation of PCR products on agarose gel obtained a clean band at ~400 bp location. Sanger sequencing uncovered the 406 bp sequence for the BC_Gap1 (Figure S3). In this study, the BC_Gap1 region was the most difficult region to resolve by PCR/Sanger approach and perhaps for the multiple reasons. First of all, there was no bioinformatics based evidence available in terms of overlapping contigs, the order and orientation of adjoining contigs was unknown, and other genomes from the same lineage had low (below 90%) average nucleotide identity making predictions about contig order difficult. The closure of BC_Gap1 permitted a single contig assembly for the strain DSM 2933, however; further finishing approaches to obtain the circular chromosome were unsuccessful and genome was deposited at Genbank with near-finished status. Two sequences derived through manual finishing (BC_overlap1 and BC_Gap1) constituted the gaps present within the PacBio assemblies.
Note: In current manual finishing protocol, we have extensively used commercial licensed software Geneious (Biomatters, Auckland, New Zealand) 3 which was convenient to use. However, all the functionalities and analysis can be also performed through various open source tools may be without graphical user interface. Mapping reads to reference can be achieved using bowtie2 4 , mapping of contigs to reference can be performed through mauve plugin 5 , assembly of contigs can be performed through CAR 6 or other assemblers and pairwise alignment of sequences through MUSCLE aligner 7 . Read quality assessment and trimming can be performed through FASTQC (http://www.bioinformatics.babraham.ac.uk/projects/fastqc/) and FASTX (http://hannonlab.cshl.edu/fastx_toolkit/) tools. Therefore, the availability of the commercial tool does not limit application of current protocol.
1. Differences between protein coding potential of any two assemblies Reciprocal BLASTP 8 analysis was performed using proteins predicted from the draft and the finished genome assemblies to gain insights into potential protein encoding differences. The blast hits across two genomes were analyzed using custom script to identify protein sequences which are same across two genomes, longer proteins, and shorter protein. Generally, the finished genome assembly was used as the reference to identify same, shorter, or longer proteins.
2. Estimation of positive/negative influence of Pilon change The BLASTN analysis of predicted genes before and after Pilon change was performed against the NCBI non-redundant database. The effect of Pilon call on BLASTN results was determined in terms of e-value, or percent similarity, or percent identity or subject hit length. The Pilon corrections which improved e-value cutoffs, percent similarity or percent identity scores and increased subject hit lengths are considered as having positive influence on gene-calling accuracy. On the other hand, Pilon corrections which diminished the significance of BLASTN results (e-value cutoffs, percent similarity or percent identity scores or decreased subject hit lengths) are considered to have negative influence on genecalling accuracy. Positive and negative influence of Pilon correction is indicated by the "Yes" and "No" values, respectively in supplementary table S11-S14. Pilon --genome genome.fasta \ --frags frags.bam \ --changes