Last-gen nostalgia: a lighthearted rant and reflection on genome sequencing culture

I sometimes see them in my dreams. The colorful peaks and troughs, the sharp, crisp waves spread across my computer screen, the rolling nitrogenous mountains, each with its own nucleotide sitting solidly on the summit. I'm talking about electropherograms, of course. Remember them? Those beautiful but oh so “old-gen” bioinformatics data generated from automated Sanger sequencing machines, such as the Applied Biosystems 370—the geriatric of genome sequencers. Don't laugh. It was these capillary-based electrophoretic technologies that gave us the draft human genome sequence (Lander et al., 2001) and the genome maps of many other model organisms, from the bacterium Haemophilus influenza to the yeast Saccharomyces cerevisiae to the multicellular green alga Volvox carteri (Fleischmann et al., 1995; Goffeau et al., 1996; Prochnik et al., 2010).

I sometimes see them in my dreams. The colorful peaks and troughs, the sharp, crisp waves spread across my computer screen, the rolling nitrogenous mountains, each with its own nucleotide sitting solidly on the summit. I'm talking about electropherograms, of course. Remember them? Those beautiful but oh so "oldgen" bioinformatics data generated from automated Sanger sequencing machines, such as the Applied Biosystems 370the geriatric of genome sequencers. Don't laugh. It was these capillary-based electrophoretic technologies that gave us the draft human genome sequence (Lander et al., 2001) and the genome maps of many other model organisms, from the bacterium Haemophilus influenza to the yeast Saccharomyces cerevisiae to the multicellular green alga Volvox carteri (Fleischmann et al., 1995;Goffeau et al., 1996;Prochnik et al., 2010).
As a grad student, I spent countless hours pruning, editing, assembling, and occasionally oohing and awing over Sanger sequences (Sanger et al., 1977;Smith et al., 1986;Prober et al., 1987). These 800-nucleotide genetic snippets intrigued, inspired, and motivated me. They contained just enough data to pique my interests-a novel exon, strange repeat, or foreign gene-and always left me craving a bit more: one additional sequencing read to extend that PCR product, find that stop codon, or join those lonely contigs. Usually, it would take weeks or months to get that extra read, and when it arrived I would savor the experience, exploring and analyzing it like a new book from a favorite author. After I devoured the data, I would say to myself, "If only I could get my hands on a great number of sequencing reads from my organism of interest then all of my genomic woes would be over." Naively, I believed that the more sequencing data I had, the more productive I would be. Be careful what you wish for from the genome gods. The onslaught of next-generation sequencing (NGS) technologies (Metzker, 2010;Koboldt et al., 2013) and the access to previously unfathomable amounts of genomic data have made me dizzy, disillusioned, and anything but efficient.
Like the proverbial boiling frog, my mind is gradually overheating from an accumulation of NGS reads (Liu et al., 2012). It's a paired-end nightmare, a SOLiD pain in the neck, and a massively parallel migraine. All this HiSeq and MiSeq is clogging-up my internal drive and externals disks. I've taken vacations and returned home only to find that my Illumina reads still haven't finished downloading. I can't move or backup a FASTQ file without needing a coffee break. Last month it got so bad that I tried calling 911 on my 454. I'm certain that I would have had two Nature papers by now if it weren't for that pestering computer cursor that keeps spinning around and around, reminding me of my small memory and pitiful processing power.
With all this NGS information, what have I gained (apart from being a chronic user of SEQanswers.com)? Well, I'm a co-investigator of a half a dozen, highly fragmented nuclear genome assemblies for various green algae, with no genome papers anywhere in sight. And don't get me started on the number of transcriptome projects waiting to be written up. What's worse is that I'm still sending more samples for sequencing. It's become my default setting: when in doubt, sequence. If a colleague drops by my office and says, "Smitty, you interested in milkweeds?" My first response is, "You betcha. Let's send some for sequencing?" Student asks: "Professor Smith, do you have any ideas for my honors thesis?" "Hmmm," I say, "how about we sequence another green alga." Grant money left over, what do I do? You guessed it: two for one RNA-seq at the campus sequencing facility. And if the data come back contaminated or the quality is poor? Easy, I sequence more! It's gotten to the point where I should begin my conference presentations with, "Hello, my name is David and I'm a NGS addict." There are some positives to being NGS obsessed. I'm constantly testing and learning the newest bioinformatics software and genome assembly programs. I know all of the hippest genome slang and genetic acronyms. I have learned more than I ever wanted to about Linux, Unix, and Perl, although, as my students regularly point out, I'm still a hack in all three of those areas. I love that I can go to the Sequence Read Archive at the National Centre for Biotechnology Information (Leinonen et al., 2011) (I visit the site incessantly) and in seconds access endless amounts of raw genomic and transcriptomic data from some of the coolest and most bizarre species on earth, and then use these data to mine genes for phylogenetic and other comparative analyses. I'm also an organelle genome junkie, and NGS techniques have made it quick and easy for me to sequence or data mine complete mitochondrial and plastid DNAs from a diversity of interesting taxa throughout the eukaryotic tree of life (Smith, 2012).
Sequencing nuclear DNAs has been a different story. Even with huge datasets, state-of-the-art assembly programs, and intricate annotation pipelines, I'm incapable of producing decent nuclear genome assemblies. It doesn't help that the species I choose to investigate are poorly studied and poorly sequenced. For researchers investigating organisms for which highquality nuclear genome assemblies already exist (i.e., assemblies based on Sanger sequencing), the payoffs of NGS have been great (Koboldt et al., 2013). Perhaps as sequencing technologies improve, personal computing power increases, and bioinformatics software become more user friendly, it will soon be easier for small labs to assemble publication-quality nuclear genomes of non-model taxa. For now, however, the promises of NGS have, at least for me, not lived up to their hype and often resulted in disappointment, frustration, and a loss of perspective.
Don't get me wrong, NGS has revolutionized, accelerated, and, in many ways, simplified scientific research. Moreover, new (and soon to come) long-read technologies will alleviate many of the current limitations of NGS (English et al., 2012), such as the absence of a reference genome map. But no matter how long sequencing reads get, NGS will probably never be the panacea of genetics that some claim it to be (Koboldt et al., 2013).
I was taught to approach research with specific hypotheses and questions in mind. In the good ol' Sanger days it was questions that drove me toward the sequencing data. But now it's the NGS data that drive my questions. I recently sequenced the transcriptome of a saltwater Chlamydomonas alga and have been knocking my head against the laboratory door asking, "What is the best way to market, package, and publish these data?" I'm trapped in a cycle where hypothesis testing is a postscript to senseless sequencing (Smith, 2013).
As we move toward a world with infinite amounts nucleotide sequence information, beyond bench-top sequencers and hundred-dollar genomes, let's take a moment to remember a simpler time, when staring at a string of nucleotides on a screen was special, worthy of celebration, and something to give us pause. When too much data were the least of our worries, and too little was what kept us creative. When the goal was not to amass but to understand genetic data.
I have a colleague on the insideworks at a big genome-sequencing centre in California. We had lunch recently and during one of my rants he stopped me and said, "Dave, take it easy, we still got them, a whole factory floor of AB3730xl Sanger sequencers!" Later that month, for old-time's sake, I sent him a few PCR products, which were kicking around the lab, and, sure enough, 2 weeks later three electropherograms arrived in my Inbox, like long lost friends. Anyway, for all those Sanger sequencing geeks out there, caught in a next-gen maze of short reads and long headaches, this one's for you.

ACKNOWLEDGMENTS
David Roy Smith is supported by a Discovery Grant from the Natural Sciences and Engineering Research Council (NSERC) of Canada. He can be found online at www.arrogantgenome.com.