# INTRODUCTION TO SINGLE CELL OMICS

EDITED BY : Xinghua Pan, Shixiu Wu and Sherman M. Weissman PUBLISHED IN : Frontiers in Cell and Developmental Biology, Frontiers in Genetics and Frontiers in Bioengineering and Biotechnology

### Frontiers Copyright Statement

© Copyright 2007-2019 Frontiers Media SA. All rights reserved. All content included on this site, such as text, graphics, logos, button icons, images, video/audio clips, downloads, data compilations and software, is the property of or is licensed to Frontiers Media SA ("Frontiers") or its licensees and/or subcontractors. The copyright in the text of individual articles is the property of their respective authors, subject to a license granted to Frontiers.

The compilation of articles constituting this e-book, wherever published, as well as the compilation of all other content on this site, is the exclusive property of Frontiers. For the conditions for downloading and copying of e-books from Frontiers' website, please see the Terms for Website Use. If purchasing Frontiers e-books from other websites or sources, the conditions of the website concerned apply.

Images and graphics not forming part of user-contributed materials may not be downloaded or copied without permission.

Individual articles may be downloaded and reproduced in accordance with the principles of the CC-BY licence subject to any copyright or other notices. They may not be re-sold as an e-book.

As author or other contributor you grant a CC-BY licence to others to reproduce your articles, including any graphics and third-party materials supplied by you, in accordance with the Conditions for Website Use and subject to any copyright notices which you include in connection with your articles and materials.

All copyright, and all rights therein, are protected by national and international copyright laws.

The above represents a summary only. For the full conditions see the Conditions for Authors and the Conditions for Website Use. ISSN 1664-8714 ISBN 978-2-88945-920-9 DOI 10.3389/978-2-88945-920-9

### About Frontiers

Frontiers is more than just an open-access publisher of scholarly articles: it is a pioneering approach to the world of academia, radically improving the way scholarly research is managed. The grand vision of Frontiers is a world where all people have an equal opportunity to seek, share and generate knowledge. Frontiers provides immediate and permanent online open access to all its publications, but this alone is not enough to realize our grand goals.

### Frontiers Journal Series

The Frontiers Journal Series is a multi-tier and interdisciplinary set of open-access, online journals, promising a paradigm shift from the current review, selection and dissemination processes in academic publishing. All Frontiers journals are driven by researchers for researchers; therefore, they constitute a service to the scholarly community. At the same time, the Frontiers Journal Series operates on a revolutionary invention, the tiered publishing system, initially addressing specific communities of scholars, and gradually climbing up to broader public understanding, thus serving the interests of the lay society, too.

### Dedication to Quality

Each Frontiers article is a landmark of the highest quality, thanks to genuinely collaborative interactions between authors and review editors, who include some of the world's best academicians. Research must be certified by peers before entering a stream of knowledge that may eventually reach the public - and shape society; therefore, Frontiers only applies the most rigorous and unbiased reviews.

Frontiers revolutionizes research publishing by freely delivering the most outstanding research, evaluated with no bias from both the academic and social point of view. By applying the most advanced information technologies, Frontiers is catapulting scholarly publishing into a new generation.

### What are Frontiers Research Topics?

Frontiers Research Topics are very popular trademarks of the Frontiers Journals Series: they are collections of at least ten articles, all centered on a particular subject. With their unique mix of varied contributions from Original Research to Review Articles, Frontiers Research Topics unify the most influential researchers, the latest key findings and historical advances in a hot research area! Find out more on how to host your own Frontiers Research Topic or contribute to one as an author by contacting the Frontiers Editorial Office: researchtopics@frontiersin.org

# INTRODUCTION TO SINGLE CELL OMICS

Topic Editors:

Xinghua Pan, Southern Medical University, Guangdong Provincial Key Lab of Single Cell Technology and Application, China; Yale University School of Medicine, United States

Shixiu Wu, Hangzhou Cancer Hospital, China

Sherman M. Weissman, Yale University School of Medicine, United States

Image: DrimaFilm/Shutterstock.com

Single-cell omics is a progressing frontier that stems from the sequencing of the human genome and the development of omics technologies, particularly genomics, transcriptomics, epigenomics and proteomics, but the sensitivity is now improved to single-cell level. The new generation of methodologies, especially the next generation sequencing (NGS) technology, plays a leading role in genomics related fields; however, the conventional techniques of omics require number of cells to be large, usually on the order of millions of cells, which is hardly accessible in some cases. More importantly, harnessing the power of omics technologies and applying those at the single-cell level are crucial since every cell is specific and unique, and almost every cell population in every systems, derived in either vivo or in vitro, is heterogeneous. Deciphering the heterogeneity of the cell population hence becomes critical for recognizing the mechanism and significance of the system. However, without an extensive examination of individual cells, a massive analysis of cell population would only give an average output of the cells, but neglect the differences among cells.

Single-cell omics seeks to study a number of individual cells in parallel for their different dimensions of molecular profile on genome-wide scale, providing unprecedented resolution for the interpretation of both the structure and function of an organ, tissue or other system, as well as the interaction (and communication) and dynamics of single cells or subpopulations of cells and their lineages. Importantly single-cell omics enables the identification of a minor subpopulation of cells that may play a critical role in biological process over a dominant subpolulation such as a cancer and a developing organ. It provides an ultra-sensitive tool for us to clarify specific molecular mechanisms and pathways and reveal the nature of cell heterogeneity. Besides, it also empowers the clinical investigation of patients when facing a very low quantity of cell available for analysis, such as noninvasive cancer screening with circulating tumor cells (CTC), noninvasive prenatal diagnostics (NIPD) and preimplantation genetic test (PGT) for in vitro fertilization. Single-cell omics greatly promotes the understanding of life at a more fundamental level, bring vast applications in medicine. Accordingly, single-cell omics is also called as single-cell analysis or single-cell biology.

Within only a couple of years, single-cell omics, especially transcriptomic sequencing (scRNA-seq), whole genome and exome sequencing (scWGS, scWES), has become robust and broadly accessible. Besides the existing technologies, recently, multiplexing barcode design and combinatorial indexing technology, in combination with microfluidic platform exampled by Drop-seq, or even being independent of microfluidic platform but using a regular PCR-plate, enable us a greater capacity of single cell analysis, switching from one single cell to thousands of single cells in a single test. The unique molecular identifiers (UMIs) allow the amplification bias among the original molecules to be corrected faithfully, resulting in a reliable quantitative measurement of omics in single cells. Of late, a variety of single-cell epigenomics analyses are becoming sophisticated, particularly single cell chromatin accessibility (scATAC-seq) and CpG methylation profiling (scBS-seq, scRRBS-seq). High resolution single molecular Fluorescence *in situ* hybridization (smFISH) and its revolutionary versions (ex. seqFISH, MERFISH, and so on), in addition to the spatial transcriptome sequencing, make the native relationship of the individual cells of a tissue to be in 3D or 4D format visually and quantitatively clarified. On the other hand, CRISPR/cas9 editing-based *in vivo* lineage tracing methods enable dynamic profile of a whole developmental process to be accurately displayed. Multi-omics analysis facilitates the study of multi-dimensional regulation and relationship of different elements of the central dogma in a single cell, as well as permitting a clear dissection of the complicated omics heterogeneity of a system. Last but not the least, the technology and biological noise, sequence dropout, and batch effect bring a huge challenge to the bioinformatics of single cell omics. While significant progress in the data analysis has been made since then, revolutionary theory and algorithm logics for single cell omics are expected. Indeed, single-cell analysis exert considerable impacts on the fields of biological studies, particularly cancers, neuron and neural system, stem cells, embryo development and immune system; other than that, it also tremendously motivates pharmaceutic RD, clinical diagnosis and monitoring, as well as precision medicine.

This book hereby summarizes the recent developments and general considerations of single-cell analysis, with a detailed presentation on selected technologies and applications. Starting with the experimental design on single-cell omics, the book then emphasizes the consideration on heterogeneity of cancer and other systems. It also gives an introduction of the basic methods and key facts for bioinformatics analysis. Secondary, this book provides a summary of two types of popular technologies, the fundamental tools on single-cell isolation, and the developments of single cell multi-omics, followed by descriptions of FISH technologies, though other popular technologies are not covered here due to the fact that they are intensively described here and there recently. Finally, the book illustrates an elastomer-based integrated fluidic circuit that allows a connection between single cell functional studies combining stimulation, response, imaging and measurement, and corresponding single cell sequencing. This is a model system for single cell functional genomics. In addition, it reports a pipeline for single-cell proteomics with an analysis of the early development of Xenopus embryo, a single-cell qRT-PCR application that defined the subpopulations related to cell cycling, and a new method for synergistic assembly of single cell genome with sequencing of amplification product by phi29 DNA polymerase. Due to the tremendous progresses of single-cell omics in recent years, the topics covered here are incomplete, but each individual topic is excellently addressed, significantly interesting and beneficial to scientists working in or affiliated with this field.

Citation: Pan, X., Wu, S., Weissman, S. M., eds. (2019). Introduction to Single Cell Omics. Lausanne: Frontiers Media. doi: 10.3389/978-2-88945-920-9

# Table of Contents

### CHAPTER 1

### EXPERIMENTAL DESIGN AND BIOINFORMATIC ANALYSIS


Olivier B. Poirion, Xun Zhu, Travers Ching and Lana Garmire

### CHAPTER 2

### TECHNOLOGIES FROM CELL ISOLATION, MULTIMOICS TO FLUORESCENCE IN SITU HYBRIDIZATION


Chenghua Cui, Wei Shu and Peining Li

*69 Single-Cell* in Situ *RNA Analysis With Switchable Fluorescent Oligonucleotides*

Lu Xiao and Jia Guo

*78 Fluidic Logic Used in a Systems Approach to Enable Integrated Single-Cell Functional Analysis*

Naveen Ramalingam, Brian Fowler, Lukasz Szpankowski, Anne A. Leyrat, Kyle Hukari, Myo Thu Maung, Wiganda Yorza, Michael Norris, Chris Cesar, Joe Shuga, Michael L. Gonzales, Chad D. Sanada, Xiaohui Wang, Rudy Yeung, Win Hwang, Justin Axsom, Naga Sai Gopi Krishna Devaraju, Ninez Delos Angeles, Cassandra Greene, Ming-Fang Zhou, Eng-Seng Ong, Chang-Chee Poh, Marcos Lam, Henry Choi, Zaw Htoo, Leo Lee, Chee-Sing Chin, Zhong-Wei Shen, Chong T. Lu, Ilona Holcomb, Aik Ooi, Craig Stolarczyk, Tony Shuga, Kenneth J. Livak, Cate Larsen, Marc Unger and Jay A. A. West

### CHAPTER 3

### REPORTS ON SINGLE CELL PROTEOMICS, RNA ANALYSIS, AND GENOMICS

*97 High-Sensitivity Mass Spectrometry for Probing Gene Translation in Single Embryonic Cells in the Early Frog (*Xenopus*) Embryo* Camille Lombard-Banek, Sally A. Moody and Peter Nemes

### *108 Cell Cycle and Cell Size Dependent Gene Expression Reveals Distinct Subpopulations at Single-Cell Level*

Soheila Dolatabadi, Julián Candia, Nina Akrap, Christoffer Vannas, Tajana Tesan Tomic, Wolfgang Losert, Göran Landberg, Pierre Åman and Anders Ståhlberg

### *119 Efficient Synergistic Single-Cell Genome Assembly*

Narjes S. Movahedi, Mallory Embree, Harish Nagarajan, Karsten Zengler and Hamidreza Chitsaz

## Experimental Considerations for Single-Cell RNA Sequencing Approaches

### Quy H. Nguyen<sup>1</sup> , Nicholas Pervolarakis<sup>2</sup> , Kevin Nee<sup>1</sup> and Kai Kessenbrock<sup>1</sup> \*

<sup>1</sup> Department of Biological Chemistry, University of California, Irvine, Irvine, CA, United States, <sup>2</sup> Center for Complex Biological Systems, University of California, Irvine, Irvine, CA, United States

Single-cell transcriptomic technologies have emerged as powerful tools to explore cellular heterogeneity at the resolution of individual cells. Previous scientific knowledge in cell biology is largely limited to data generated by bulk profiling methods, which only provide averaged read-outs that generally mask cellular heterogeneity. This averaged approach is particularly problematic when the biological effect of interest is limited to only a subpopulation of cells such as stem/progenitor cells within a given tissue, or immune cell subsets infiltrating a tumor. Great advances in single-cell RNA sequencing (scRNAseq) enabled scientists to overcome this limitation and allow for in depth interrogation of previously unexplored rare cell types. Due to the high sensitivity of scRNAseq, adequate attention must be put into experimental setup and execution. Careful handling and processing of cells for scRNAseq is critical to preserve the native expression profile that will ensure meaningful analysis and conclusions. Here, we delineate the individual steps of a typical single-cell analysis workflow from tissue procurement, cell preparation, to platform selection and data analysis, and we discuss critical challenges in each of these steps, which will serve as a helpful guide to navigate the complex field of single-cell sequencing.

Keywords: single-cell genomics, single-cell analysis, cell isolation, computational biology, cellular heterogeneity

### INTRODUCTION

Elucidating cellular heterogeneity represents a major scientific challenge in many areas of biology and biomedical research including developmental and stem cell biology, immunology, neurobiology, and cancer research (Wagner et al., 2016). Recent convergence of next generation sequencing (NGS) and bioengineering approaches to manipulate individual cells has led to unbiased single-cell DNA (Navin et al., 2011), RNA (Pollen et al., 2014; Treutlein et al., 2014; Tanay and Regev, 2017), and ATAC (Buenrostro et al., 2015) sequencing. These technological advances are redefining our understanding of how biological systems function and have formed the basis for large-scale, international collaborations such as the Human Cell Atlas project (Rozenblatt-Rosen et al., 2017). In this spirit, a recent endeavor using microwell-based single-cell RNAseq (scRNAseq) created the first cell atlas to map out most tissues of the mouse (Han et al., 2018). Moreover, scRNAseq has provided critical new insights into key developmental processes such as the earliest steps of cardiovascular lineage segregation in mice (Lescroart et al., 2018), and our recent work utilized scRNAseq to reveal the spectrum of cellular heterogeneity within the human

### Edited by:

Xinghua Victor Pan, Yale University, United States

### Reviewed by:

Lasse Dahl Ejby Jensen, Linköping University, Sweden Alexander D. Borowsky, University of California, Davis, United States

> \*Correspondence: Kai Kessenbrock kai.kessenbrock@uci.edu

### Specialty section:

This article was submitted to Molecular Medicine, a section of the journal Frontiers in Cell and Developmental Biology

Received: 29 April 2018 Accepted: 20 August 2018 Published: 04 September 2018

### Citation:

Nguyen QH, Pervolarakis N, Nee K and Kessenbrock K (2018) Experimental Considerations for Single-Cell RNA Sequencing Approaches. Front. Cell Dev. Biol. 6:108. doi: 10.3389/fcell.2018.00108

breast epithelium identifying three major cell types each harboring multiple distinct cell states (Nguyen et al., 2018).

Due to the high sensitivity of these methods, in particular scRNAseq, it can be difficult to choose an adequate approach to minimize batch effects and unwanted technical variation that may overshadow true biological insights. Here, we provide helpful insights and delineate a step-wise approach for designing single-cell analysis workflows (**Figure 1**).

### CELL DISSOCIATION AND SINGLE-CELL PREPARATION

The process of single-cell preparation is arguably the greatest source of unwanted technical variation and batch effects in any single-cell study (Tung et al., 2017). Different tissues can vary significantly in extracellular matrix (ECM) composition, cellularity, and stiffness, and therefore dissociation protocols must be optimized for the specific tissue type of interest. Conventional protocols for single-cell preparation typically involve the following steps: (1) tissue dissection, (2) mechanical mincing, (3) enzymatic/proteolytic ECM breakdown (e.g., dispase, collagenase, trypsin) often accompanied by mechanical agitation, and (4) optional enrichment for cell types of interest by flow cytometry, bead-based immune-selection, differential centrifugation, or sedimentation. Each step can affect the cells' expression signatures, and should therefore be carefully optimized to introduce the least artifact. An optimal tissue dissociation protocol will yield as many viable cells as possible in the shortest possible duration without preferentially depleting or significantly altering the frequencies of certain cell types.

Recent advances in bioengineering of innovative microfluidic cell dissociation devices (Qiu et al., 2014) have the potential to radically change the way tissue samples are dissociated into single cells, while avoiding inter-assay variation due to human handling of the tissue. Several microfluidic devices have been optimized for streamlined tissue digestion, cell dissociation, filtering, and polishing. In brief, these devices were designed to work with tissue sequentially through progressively smaller size scales, starting from tissue specimen, through cellular aggregates and clusters, and finally eluting a solution containing close to 100% single cells, which will be ideal for scRNAseq applications. In addition, new semi-automated commercially available systems can help streamline tissue dissociation (e.g., Miltenyi gentleMACS). These devices offer tissue-type specific kits that may allow more reproducible, time-saving and efficient tissue dissociation and single-cell preparation (Meeson et al., 2013; Baldan et al., 2015). Ultimately, determining a "best practices" dissociation strategy through heuristic optimization will be critical for downstream single-cell library quality.

### Cell Type Enrichment

There are various methods for isolating specific cell populations or removal of unwanted populations that should be optimized for any specific tissues type. Manual isolation utilizing magnetic beads or gradient purification are potential methods for removal of unwanted cells such as dead cells. Flow cytometry is a widely used, high-throughput method to enrich for rare cells such as hematopoietic stem cells (Radbruch and Recktenwald, 1995; Will and Steidl, 2010). However, these methods are not without drawbacks, since they can introduce artificial stress on cells and change their expression profile (Van Den Brink et al., 2017). Methods that involve antibody binding for purification can also affect the cell expression profile if binding of the antibodies to cell surface molecules induce intracellular signaling (Kornbluth and Hoover, 1989; Christaki et al., 2011). Flow cytometry-isolated cells are exposed to high pressure during sorting and these osmotic and pressure changes introduced to cells during cell sorting and handling can induce change to the cell expression profile of multiple cell types (Xiong et al., 2002; Romerosantacreu et al., 2009; Van Den Brink et al., 2017).

### Quality Control

Due to the high cost of single-cell sequencing experiments, careful quality control measurements should be executed. The performance of alternative protocols can be assessed using a number of readouts. A useful first metric can be acquired using imaging of viability such as using the Countess platform (Thermo Fisher Scientific). Flow cytometry is particularly valuable to measure several critical metrics simultaneously, such as cell viability, and contamination with doublets and small cell clusters which can confound single-cell sequencing results. Flow cytometry can also be used to evaluate whether cell populations of interest, such as immune cells, stromal fibroblasts, or stem cell populations, are maintained in the cell preparation and in the appropriate frequency. Finally, an additional metric on RNA quality can be acquired using the RNA integrity number (RIN) method (Schroeder et al., 2006).

### SINGLE-CELL TRANSCRIPTOMIC PLATFORM

Protocols for transcriptome analysis have advanced rapidly, resulting in several robust methods which range in cell and mRNA capture strategy, barcoding, throughput, and level of automation (Fan et al., 2015; Macosko et al., 2015). Selection of the optimal approach depends largely on the research question. Recent high-throughput protocols for scRNAseq have dramatically increased scalability through automation, increasing the number of cells that can be processed simultaneously, and decreasing reagent cost through reaction miniaturization. Using microwell-based (Cytoseq, Wayfergen), microfluidicsbased (Fluidigm C1 HT), or droplet-based (inDrop, Drop-seq, and 10× Chromium) approaches, hundreds to thousands of cells can be captured in a single experiment (Islam et al., 2014; Picelli et al., 2014; Klein et al., 2015; Heath et al., 2016; Zheng et al., 2017). The newest of these protocols utilize beads functionalized with oligonucleotide primers, which each contain a universal PCR priming site, a cell-specific barcode, an mRNA capture sequence, and Unique Molecular Identifiers (UMI). Individual cells are captured in wells or droplets with a single bead. Cellspecific barcode are similar within a droplet but unique UMI sequence on the primer allows for individual transcripts within

a cell to be counted. This provides a quantitative readout of the number of transcripts of each gene detected in a cell, thereby reducing the effects of amplification duplicates that occur with earlier technologies (Ramsköld et al., 2012; Patel et al., 2014). High-throughput 3<sup>0</sup> -end counting approaches have several important limitations. Since only the 3<sup>0</sup> -end of each mRNA are sequenced, differential splicing analyses are not feasible (Macosko et al., 2015; Heath et al., 2016). High-throughput approaches typically only achieve ∼10% transcriptome coverage, relative to ∼40% for full-length scRNAseq protocols that use Switching Mechanism at 50End of RNA Template (SMART) chemistry (Tirosh et al., 2016; Yuan et al., 2017). This is partly due to lower mRNA capture efficiency, but also due to lower sequencing depth. Single-cell qPCR platforms (e.g., Fluidigm C1 and Biomark) remain superior in sensitivity for detecting low-expressed genes (Lawson et al., 2015).

Protocols for processing rare cells usually involve an upstream capture step by flow cytometry or micromanipulation, followed by dispensing single cells into microtubes or microwell plates. Studies investigating rare cell populations that require selection via specific markers (e.g., adult tissue stem cell populations), are best performed using these protocols. Single-cell libraries are prepared using SMART-based chemistry, which utilizes a template-switching oligonucleotide (TSO) (Tirosh et al., 2016). This TSO can be used to prime off of the untemplated nucleotides added by the reverse transcriptase, enabling subsequent PCR using a single primer and capture of full length transcripts (Tirosh et al., 2016; Yuan et al., 2017). cDNAs are then

amplified by PCR and libraries are prepared for sequencing using standard protocols. Although there have been several large scale projects utilizing these protocols, because they are manual in nature and utilize larger microliter reaction volumes, they limit the number of cells that can be processed at reasonable cost.

Another area of ongoing debate is how to determine how many cells one should be analyzed to reach sufficient statistical power. Several methods have been developed using power analysis statistics, such as Scotty<sup>1</sup> or web-based tools<sup>2</sup> , but one must estimate the number and expected frequencies of cell populations present in the sample, and such information is often not available. Therefore, these decisions are usually made based on logistical restraints (i.e., the number of cells available), financial considerations, or re-iterative experiments where an initial sample of cells is sequenced to get a sense for overall population structure, and then increasing numbers of cells are sequenced until one is satisfied that all the main populations have been identified.

### SINGLE NUCLEI ISOLATION AND SEQUENCING

Single-cell RNA sequencing methods are optimal when cells can be harvested intact and viable (Grindberg et al., 2013). However, certain cell types (e.g., neurons, adipocytes), are not amenable to standard organ dissociation protocols, since enzymatic and mechanical forces easily disrupt the cytoplasmic contents (Habib et al., 2017). In these cases, an option could be to isolate intact nuclei for single-nucleus RNAseq (snRNAseq) (Grindberg et al., 2013; Habib et al., 2016, 2017; Krishnaswami et al., 2016; Lacar et al., 2016; Lake et al., 2016). To prepare single nuclei, cells are lysed with detergent and dounce homogenized to expel cytoplasmic contents and nuclei from the cellular membrane, (Habib et al., 2016), which may avoid transcriptomic changes (Van Den Brink et al., 2017). Nuclei can then be purified by flow cytometry or gradient centrifugation (Grindberg et al., 2013; Ambati et al., 2016; Habib et al., 2016). When cell-type specific nuclear proteins exist, they can be used for nuclei isolation from specific cell types using antibody labeling (Lacar et al., 2016; Habib et al., 2017).

Single-nucleus RNAseq is not only amenable for difficult to isolate cell types, but can also be used for archived tissues such as flash-frozen clinical samples. Individual nuclei isolated from frozen adult mouse and human brain tissues have been successfully sequenced, demonstrating that snRNAseq has sufficient resolution to identify many different cell types from frozen and post-mortem tissue (Grindberg et al., 2013). With the rapid development of many applications for snRNAseq, nuclei are amenable to other studies not easily done by scRNAseq.

An important question remains: To what degree is the nuclear transcriptome representative of the whole cell? Recent studies have demonstrated that many transcripts of cell

<sup>1</sup>http://scotty.genetics.utah.edu/

and nucleus are equally represented and that nuclear RNA represents an important and significant population of transcripts that contribute greatly to the overall diversity of transcripts (Barthelson et al., 2007; Trask et al., 2009). Comparative studies of scRNAseq and snRNAseq in neural progenitor cells have also demonstrated that genes are expressed in equal proportion between whole cell and nuclei (Grindberg et al., 2013). Nanogrid single-cell and nuclei RNA sequencing studies in the same breast cancer lines found that overall copy number, expression level, and abundance had a high (r<sup>s</sup> = 0.95) Spearman's correlation (Gao et al., 2017). Similarly, the transcriptomes of single cells and nuclei of 3T3 cells have also demonstrated high correlation (Pearson, r = 0.87) (Habib et al., 2017). Together these results suggest that nuclei and cells have highly correlated relative gene expression.

Despite the similarities between single-cell and nuclei transcriptomic profiles there remain notable differences. Not surprisingly, nuclear transcriptomes are enriched for several types of nuclear RNAs (Grindberg et al., 2013; Habib et al., 2016, 2017; Krishnaswami et al., 2016; Gao et al., 2017). Since ncRNAs are only polyadenylated in the nucleus, snRNAseq provides a feasible strategy to capture the heterogeneity of ncRNA transcription in single-cell resolution (Krishnaswami et al., 2016). In addition, nuclear transcriptomes are enriched for lncRNAs and nuclear-function genes (Gao et al., 2017). Another difference between cell and nuclear RNAseq is the higher abundance of intronic sequences in snRNAseq, which ranged between 10–40% of mapped reads (Grindberg et al., 2013; Gao et al., 2017; Habib et al., 2017). These features need to be accounted for when comparing datasets from cellular versus nuclear transcriptome analyses.

In conclusion, snRNAseq has emerged as a promising avenue for profiling archived samples or cell types that are hard to viably isolate from tissues.

### SINGLE-CELL LIBRARY SEQUENCING

The next critical part of designing single-cell workflows is to align the analysis pipeline with the respective NGS platform and sequencing depth. It is important to confirm that the chemistry used for library construction is compatible with the sequencing technology. Currently, there are two major outputs for libraries from scRNAseq: full-length transcript or 3<sup>0</sup> -end counted libraries, which each require different read depths (Haque et al., 2017). Full-length transcript libraries are typically sequenced at a depth of 10<sup>6</sup> reads per cell, but may still yield important biological information at as low as 5 × 10<sup>4</sup> reads per cell (Pollen et al., 2014). For specific applications such as alternative splicing analysis on the single-cell level, much higher sequencing depth up to 15– 25 × 10<sup>6</sup> reads per cell is necessary. On the other hand, 3 0 -end counting libraries are sequenced at much lower depth of around 10<sup>4</sup> or 10<sup>5</sup> reads per cells (Haque et al., 2017). Reaching the optimal sequencing depth can be an iterative process and may require multiple rounds of optimization. Sequencing saturation can be estimated by plotting down-sampled sequencing depth in mean reads per cell (e.g., 10× Genomics Cell Ranger).

<sup>2</sup>http://satijalab.org/howmanycells

### STUDY DESIGN AND DATA ANALYSIS

In the following section, we highlight several key considerations from a data analysis perspective for adequately designing a successful scRNAseq study. As mentioned, many single-cell technologies can be greatly affected by technical variation, and without proper study design the results can be difficult to interpret. One critical aspect of this is the separation of batch and condition. Batch refers to a library that was singularly generated in a contained workflow (i.e., harvesting tissue specimen, disassociating into single-cell suspension, and generating scRNAseq library). Condition refers to a biological state or experimental treatment that is being analyzed in the study. Technical variation can be difficult to separate from relevant biological variation when conditions are interrogated individually. To help correct for this, the generation of replicates (biological or technical) whenever possible is strongly recommended.

In addition to replicates, an option is to mix samples and conditions within a batch, such that they can be treated without confounding each other (Hicks et al., 2015). One example is the Demuxlet workflow, where samples from genetically distinct individuals can be processed within the same library generation protocol and sequenced together (Kang et al., 2018). Prior to library generation, genotyping of distinct samples is performed and subsequently used in conjunction with the scRNAseq library to demultiplex the mixed cell sample into the samples of origin. In situations where genetically identical samples are used, or genotypic data is not readily available, cellular hashing can be employed (Stoeckius et al., 2017). This involves oligo-tagged antibodies specific to each sample in the study and then pooling and generating the scRNAseq library from the sample mixture. The antibodies labeled with unique barcodes can be traced back to its sample of origin (Stoeckius et al., 2017).

Efforts can be made computationally to mitigate batch-tobatch variation. Batch effects are not unique to scRNAseq data, but the assumptions made by correction algorithms are not always appropriate for the bimodality of gene expression in zero-inflated scRNAseq data. Here, we highlight recent analytical frameworks that may be used to correct for this phenomenon. A recently developed approach by Haghverdi et al. (2018) builds a mixed nearest neighbor model for cells between datasets or samples that does not require known or equal proportions of cell types between data sets. In addition, the widely used Seurat pipeline for scRNAseq analysis recently employed canonical correlation analysis (CCA) that allows for discovery of cocorrelated gene modules between datasets that can then be used to cluster upon (Butler et al., 2018). This approach identifies the cell types common between datasets and samples, as well as those that are unique to an individual set by finding common sources of variation in gene expression. As an illustration of this method, we applied CCA to our recently published droplet-enabled scRNAseq dataset from four individual primary human breast tissue samples (**Figure 2**). Finally, the single-cell batch correction framework MAST (Finak et al., 2015) models the positive expression mean and the over-the-background

expression of transcripts, and calculates a fraction of detected genes per cell and uses this as a covariate that is independent of a previously specified control set of genes. Together, these methods serve as recent examples to handle batch-to-batch variation computationally, resulting in improved dimensionality reduction and clustering for meaningful scRNAseq data analysis.

Beyond accounting for technical variation, a common question that researchers address is the relatedness of described cell populations through the lens of a differentiation processes. The key assumption of pipelines that seek to address this is that the tissue sample analyzed using scRNAseq contains cell types/states that represent not only the ends of a differentiation process, but also stem/progenitor cells and transitional cell states along the path of differentiation. Common analysis suites that seek to reconstruct these differentiation trajectories are Monocle (Qiu et al., 2017), TSCAN (Ji and Ji, 2016), and CellTree (duVerle et al., 2016). Each use different methods, but their goal is to visualize differentiation trajectories and identify expression signatures that change through pseudotime.

### CONCLUSION

To fully harness the potential of single-cell analysis tools to decipher complex biological systems on the level of individual cells, careful study design and rigorous optimization of every step along the experimental procedure are mandatory. Here, we delineate a step-wise experimental approach for optimizing

### REFERENCES


tissue handling, cell dissociation and enrichment, single-cell platform selection, library sequencing, and data analysis for designing single-cell workflows. A move toward standardized and automated processing of tissues will minimize changes introduced by tissue handling that may obscure biologically relevant transcriptomic profiles. For tissues that are problematic to dissociate into high-quality and viable single-cell suspensions, snRNAseq offers a solution to this problem, and can be used to achieve uniform extraction and sequencing of multiple cell types for cross comparison. Numerous computational frameworks are currently emerging that help mitigate batch effects to separate biological variation from unwanted technical variation.

### AUTHOR CONTRIBUTIONS

KK outlined concept and overview of review. QN, NP, and KN wrote the manuscript. KK and QN designed and prepared the figures.

### FUNDING

This study was supported by funds from the National Cancer Institute (R00 CA181490), Chan/Zuckerberg Initiative (HCA-A-1704-01668), and the University of California Cancer Research Coordinating Committee (CTN-18-515073).



**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 Nguyen, Pervolarakis, Nee and Kessenbrock. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# The Impact of Heterogeneity on Single-Cell Sequencing

Samantha L. Goldman1,2, Matthew MacKay1,2, Ebrahim Afshinnekoo1,2,3, Ari M. Melnick<sup>4</sup> , Shuxiu Wu5,6 and Christopher E. Mason1,2,3,7 \*

<sup>1</sup> Department of Physiology and Biophysics, Weill Cornell Medical College, New York, NY, United States, <sup>2</sup> The HRH Prince Alwaleed Bin Talal Bin Abdulaziz Alsaud Institute for Computational Biomedicine, Weill Cornell Medicine, New York, NY, United States, <sup>3</sup> WorldQuant Initiative for Quantitative Prediction, Weill Cornell Medicine, New York, NY, United States, <sup>4</sup> Department of Medicine, Weill Cornell Medicine, New York, NY, United States, <sup>5</sup> Hangzhou Cancer Institute, Hangzhou Cancer Hospital, Hangzhou, China, <sup>6</sup> Department of Radiation Oncology, Hangzhou Cancer Hospital, Hangzhou, China, <sup>7</sup> The Feil Family Brain and Mind Research Institute, New York, NY, United States

The importance of diversity and cellular specialization is clear for many reasons, from population-level diversification, to improved resiliency to unforeseen stresses, to unique functions within metazoan organisms during development and differentiation. However, the level of cellular heterogeneity is just now becoming clear through the integration of genome-wide analyses and more cost effective Next Generation Sequencing (NGS). With easy access to single-cell NGS (scNGS), new opportunities exist to examine different levels of gene expression and somatic mutational heterogeneity, but these assays can generate yottabyte scale data. Here, we model the importance of heterogeneity for large-scale analysis of scNGS data, with a focus on the utilization in oncology and other diseases, providing a guide to aid in sample size and experimental design.

### Edited by:

Xinghua Victor Pan, Southern Medical University, China

### Reviewed by:

Saheli Sarkar, Northeastern University, United States Guangshuai Jia, Max-Planck-Institut für Herz- und Lungenforschung, Germany

### \*Correspondence:

Christopher E. Mason chm2042@med.cornell.edu

### Specialty section:

This article was submitted to Genomic Assay Technology, a section of the journal Frontiers in Genetics

Received: 20 July 2018 Accepted: 09 January 2019 Published: 01 March 2019

### Citation:

Goldman SL, MacKay M, Afshinnekoo E, Melnick AM, Wu S and Mason CE (2019) The Impact of Heterogeneity on Single-Cell Sequencing. Front. Genet. 10:8. doi: 10.3389/fgene.2019.00008 Keywords: single-cell sequencing, heterogeneity, scRNA-seq, NGS, RNA, single cells

### INTRODUCTION

It has been well-documented, both theoretically (Elsasser, 1984) and experimentally, that nearly all cellular systems are heterogenous (Altschuler and Wu, 2010). Heterogeneity may arise for a number of different reasons, and at many different levels, in order to improve survival and functionality. Both single-celled and multicellular organisms employ population-level survival strategies such as bet-hedging in order to achieve a better chance of survival when faced with new stresses though having a diverse population (Grimbergen et al., 2015). At a single-organism level, diversity further enables the existence of specialization and, within metazoan organisms, differentiation (Hadjantonakis and Arias, 2016).

Cellular heterogeneity can be measured in several different ways, most commonly via genomic, epigenomic, transcriptomic, and proteomic studies. However, the level of heterogeneity at one level of expression or regulation may not be the same at another level. Cells within a given person have nearly identical genomes, yet through specific modifications throughout development and disease, may generate many distinct cell types with unique expression profiles. Even the genome itself may be specifically rewired to generate increased genetic diversity within specific cell types, most notably B- and T-cells through V(D)J recombination. Uncovering the true diversity of cells is crucial to better understand cellular communication and responsibility within both healthy and disease states. It is now well understood that differentiation throughout development allows for

**14**

the necessary cellular specialization required for complex multicellular system function. Further, specific epigenomic modifications allow for this precise differentiation which inevitably results in the cascade of cellular diversity present in humans, and also is important in cancer (Li et al., 2014, 2016).

Next generation sequencing (NGS) is continuously being used more and more due to its rapidly decreased costs and ability to generate a large amount of data (Mason et al., 2014), with new data sets even being generated in zero gravity (McIntyre et al., 2016; Castro-Wallace et al., 2017). Within bulk-NGS analyses, many, typically hundreds of thousands to millions, of cells are analyzed at once. This generates an averaged picture of a given population of cells, and thus majority of our understanding of different cell and tissue types comes from the analysis of bulk experimentation which may underestimate the true heterogeneity of cells. Bulk-NGS is simply ill-equipped to address some important questions revolving around cellular heterogeneity. Single-cell NGS (scNGS) attempts to resolve issues facing bulk-NGS through the ability to relate sequences to a given cell, across the genetic, transcriptomic, epigenomic, and proteomic levels. This approach reduces the issue of data generalization which is prevalent in some bulk-NGS studies. However, scNGS is not without its faults. One of the main issues with scNGS is its cost and, though it has considerably decreased in recent years, it is still a large factor when designing experimentations, as well as technical issues and challenges in sensitivity. Here, we will outline the importance of cellular heterogeneity, assess factors of scNGS heterogeneity, and provide a practical sample size guide to aid in experimental design.

### THE IMPORTANCE OF CELLULAR HETEROGENEITY

Having a heterogeneous (i.e., diverse) population is beneficial for cellular systems for the same reason why it is beneficial for there to be variation among many organisms in a single species – bet-hedging (Beaumont et al., 2009). Bet-hedging is a population-level survival strategy in which less-fit individuals are maintained in a population as a precaution; if the environment were to drastically change, the originally less-fit organisms may be adapted to the new environment, thereby assuring the survival of the population (Grimbergen et al., 2015). In an ever-changing environment, a population has a greater overall fitness if there is greater diversity. In this way, the evolution adaptation of all cellular systems can be modeled in terms of Darwinian evolution.

There are many causes of cellular heterogeneity. Firstly, populations of cells will naturally contain individuals that develop random mutations. These unique subclones can become significant portions of the population if that mutation confers a selective advantage and proliferates. However, not all cellular heterogeneity is genetic. Rather, much heterogeneity is phenotypic, and is frequently expressed in transcriptomes that vary from cell to cell. This heterogeneity can arise via external or internal factors. Extrinsic heterogeneity can lead to phenotypic plasticity in response to an environmental change, and only affects the part of the population that is exposed to the causative environment (Huang, 2009). It can also include variables such as cell-cycle stage and cell size (Singh and Soltani, 2013). Intrinsic heterogeneity is a more nuanced phenomenon, and is a result of stochastic events, such as gene expression noise (Huang, 2009), rather than a changing intracellular environment (Elowitz et al., 2002).

Because of stochastic gene fluctuation, there are varying levels of protein abundance in different cells in a population at any given time. This is most easily visualized via flow cytometry, which yields a bell-shaped curve (Brock et al., 2009). Stochastic gene expression may have its evolutionary advantages, as well. In the same way that populations of cells maintain random mutations in bet-hedging, populations of clonal, unicellular organisms may maintain variation via stochastic gene expression to ensure overall survival (Raj and van Oudenaarden, 2008). Although stochastic gene expression is a significant contributor to heterogeneity, it is not the only cause. The sub-state of any given genome/cell depends on a number of factors, including epigenetics, alternative splicing sites, posttranslational modifications, and sometimes even microbial interactions (Shabaan et al., 2018). These processes are not always stochastic, and can therefore lead to "directed" heterogeneity, instead of the more random "non-directed" heterogeneity of stochastic gene expression (Chang and Marshall, 2017).

Interestingly, non-genetic, cellular heterogeneity also plays an important role in development. Early in the developmental process, before the small population of cells is beginning to differentiate, these cells are theoretically identical. However, as the cells begin to differentiate, they display non-genetic heterogeneity. The body of research on the role of heterogeneity in development is largely focused on transcriptional heterogeneity (Griffiths et al., 2018), which is a driver of differentiation of pluripotent stem cells. More recent work has also shown that RNA modifications, called the epitranscriptome (Saletore et al., 2012), can also lead to differential response of human cells to both disease and infection (Gokhale et al., 2016; Vu et al., 2017). Also, some transcriptional sub-states are heritable through several generations of cell divisions. Signaling factors, developmental regulators, and chromatin regulators contribute to transcriptional heterogeneity in stem cells (Kumar et al., 2014). "Directed" heterogeneity has been shown to lead the process behind the development of a body plan in Drosophila melanogaster (Chang and Marshall, 2017).

Even after development, all human tissue systems experience some level of differentiation. This allows cells to specialize, leading to a more flexible biological system. This principle has been most notably studied in the nervous and immune systems. In the central nervous system, for instance, there are dozens of different types of neurons. Subsets of these neurons form the myriad different regions within the brain (Emery and Barres, 2008). One phenotypic hallmark of heterogeneity in the nervous system, for example, is the distribution of mitochondria within the neuron. This heterogeneity is exhibited both regionally within the brain (e.g., brain regions that require more energy are composed of neurons with more mitochondria) (Dubinsky, 2009) and within individual neurons. This distribution differs greatly depending on the immediate

and current needs of the neuron, and is regulated by a complex system of proteins (Course and Wang, 2016). In the immune system, monocytes, macrophages (Gordon and Taylor, 2005), B-cells, and T-cells show heterogeneity. As an example, T-cell heterogeneity is essential for an effective immune response, since subtle differences in T-cell receptors (TCRs) enable the identification and elimination of foreign invaders (Durlanik and Thiel, 2015). However, in autoimmune disease, faulty TCR diversification can result in the improper identification of "self " as an invader resulting in normal tissue destruction.

Different diseases leverage heterogeneity to their advantage. A "survival of the fittest" model for cellular heterogeneity can be applied not only to populations of single-celled organisms, but also to tumors. Cancer cells continuously acquire and pass down genetic and epigenetic modifications to subsequent generations of cancer cells resulting in heterogeneity. These genetic mutations and epigenetic shifts may further lead to changes in fitness (Li et al., 2016). Cancer cells are often exposed to hostile environments, such as chemotherapy and radiation, during treatment (Afshinnekoo and Mason, 2016). Through bet-hedging, and therefore maintenance of a heterogeneous population, the chance of resistance or relapse from treatment is dramatically increased. As these cancer cells are all in the same small environment and are all competing for the same limited resources, there are complex interactions between different subclones that further reinforce these Darwinian relationships (Tabassum and Polyak, 2015). Cancer cells can be further driven into a "survival of the fittest" scenario via treatment with a chemotherapeutic drug, as this may lead to the selection for cancer-variants that are resistant to the drug. Over time, this could lead to chemotherapeutic resistance within the whole tumor (Dagogo-Jack and Shaw, 2017), as well as tumor subtypes (Shih et al., 2017). Indeed, it has been shown that a single tumor biopsy dramatically underrepresents the genetic diversity present within an entire tumor (Gerlinger et al., 2012). However, heterogeneity is not only clinically relevant in regards to chemotherapy. Immunotherapies can also be profoundly impacted by heterogeneity. Liver cancer-targeted immunotherapy is designed around tumor-infiltrating T-cells. Through the use of single-cell RNA sequencing, 11 tumorinfiltrating T-cell sub-states have been identified. Each of these sub-states has a unique profile of up- and downregulated genes, which may impact the efficacy of any immunotherapies (Zheng et al., 2017).

Intratumoral heterogeneity has been extensively studied through single-cell sequencing methods. For example, singlecell RNA sequencing has revealed significant heterogeneity in primary glioblastomas (Patel et al., 2014). Additionally, increased levels of heterogeneity in these tumors was inversely correlated with survival, indicating that intratumor heterogeneity should be an essential clinical factor, including events from DNA transposition (Henssen et al., 2017). Metastatic melanoma is also highly transcriptionally heterogeneous, and this heterogeneity is multifaceted; it is associated with a number of factors, including cell cycle stage, location, and chemotherapeutic resistance (Tirosh et al., 2016). The use of RNA sequencing here is key, as transcriptomics captures fine details of non-genetic heterogeneity that other sequencing methods may have missed. Shifting of cellular heterogeneity is not just a hallmark of cancer, but of many other diseases, but here we will focus on the relevance for cancer.

### ASSESSING HETEROGENEITY

Heterogeneity itself is a gradient which may be based on variable changes in the transcriptome or more permanent changes within the genome. Differences seen between cells may be temporal due to cell-cycle states, or spatial due to external stimuli (Dagogo-Jack and Shaw, 2017). Also, differences between cells may exist at any processing level of the cell, from the genome to transcriptome to proteome, or due to any additional modifications which may exist. With this in mind, it could be possible to define all cells as heterogeneous. However, two disparate cells might not behave functionally different, and their heterogeneity would therefore not be considered impactful (Altschuler and Wu, 2010). The overall assessment of cellular heterogeneity is therefore contextspecific and the technologies used to assess cellular differences need to be considered carefully.

Proteomic and cell-marker classification has been historically used to discern cell types. Immunohistochemistry (IHC) can be used to distinguish immune cell types within healthy systems (Reuben et al., 2017b) or even the cancer subtyping such as HER2 expression within breast cancer (Potts et al., 2012). Surface markers help to distinguish cell types into broad classification, but this type of analysis required prior gene expression knowledge and specific antibody usage. Other approaches, such as whole genome sequencing (WGS), bisulfite sequencing, and RNA sequencing, allow for genome-wide analysis (Mason et al., 2017). Historically these techniques are done on heterogeneous tissue samples, generating an averaged picture of the tissue of interest (bulk-NGS). Although bulk-NGS has a tendency to generalize heterogeneity, certain biological understanding and computational modeling can mitigate this effect within genomic and epigenomic analyses.

Bulk-WGS can be directly used to assess the existence of subclonal mutations through the use of variant allele frequencies (VAFs). Through the modeling of VAFs and copy number changes, an understanding of the clonal architecture may be inferred from such bulk-NGS data. One such method, Canopy, uses a Bayesian analysis to identify subpopulations and build a phylogenetic tree detailing their likely evolutionary history (Jiang et al., 2016). Long read bulk TCR sequencing can also be used directly to assess clonal structures under the assumption that there is a unique V(D)J recombination per subclone. As such, the quantity of a given TCR gene can be directly related to the abundance of that subclone and the number of different TCR genes relates to the overall heterogeneity and diversity of the T-cell population. TCR sequencing has also been used, and has shown intratumoral heterogeneity in localized lung carcinomas, which may confer post-surgical recurrence (Reuben et al., 2017a). As epigenetics also plays a significant role in heterogeneity, bisulfite sequencing can be used to study patterns of DNA methylation and estimate clonality, such as with the algorithm methclone (Li et al., 2014). Bisulfite sequencing has also been

used to reveal heterogeneity in DNA methylation of the MLH1 (a mismatch repair gene) promoter across several endometrial tumors (Varley et al., 2009).

While many bulk-NGS methods rely on mixture models of the VAFs to analyze small indels and point mutations, these methods often rely on the copy number of the gene in question, which can be altered in cancers, and are unable to relate multiple mutations which exist at low frequencies (Jiang et al., 2016). Additionally, bulk sequencing has a tendency to report what an "average" cell in a population would look like and for that reason would not be usable in the analysis of an all-or-nothing response (Altschuler and Wu, 2010). For example, Xenopus oocytes, have a binary response when signaled by progesterone to begin a process of maturation; they either mature or they do not (Ferrell and Machleder, 1998). In this case, looking at an average of two distinct oocyte subpopulations – one that has been signaled to mature and one that has not – would artificially yield a biologically impossible "mean oocyte" that has committed to maturation half-way (Altschuler and Wu, 2010).

There has been a significant effort within the field to quantitatively measure heterogeneity and relate it to a functional change. One approach to this is to quantify stochastic gene expression. This has been done through dividing stochastic gene expression into its intrinsic and extrinsic components via a two-color reporter experiment and deriving analytical formulas to measure each component of noise (Singh and Soltani, 2013). Systems have also been developed to quantify the individual contribution of unique processes to stochastic gene expression, and therefore to heterogeneity. For example, experimentally generated models have been used to quantify the individual contribution to chromatin dynamics in isogenic chicken-cell populations (Viñuelas et al., 2013). Also, shifted gene expression dynamics have been shown to drive cell fate choice for hematopoietic progenitors (Kleppe et al., 2017), induced pluripotent stem cells (iPSCs), and the mouse inner-cell mass during embryogenesis (Mojtahedi et al., 2016; Bargaje et al., 2017; Mohammed et al., 2017).

### UTILIZATION OF scNGS

To best understand cellular heterogeneity, single cells must be studied individually through the use of scNGS. Since assessing cellular co-occurrence is the main drawback of bulk-NGS, many studies have also been conducted to further elucidate clonal structures using single-cell DNAseq [including whole exome sequencing (WES) or WGS], bisulfite sequencing, and ATACseq (assay for transposable accessible chromatin, ATAC). Given the variability and importance of gene expression, sc-RNAseq is one of the most used single-cell sequencing techniques (**Supplementary Table S1**). Single-cell multi-omic analyses are also possible to uncover the true level of heterogeneity across expression levels within cells (Macaulay et al., 2017), which enable examination of the genome, transcriptome, and epigenome at once. scNGS has the ability to resolve noise in bulk-NGS through the additional ability to trace generated reads back to their cell of origin. Though, this added benefit comes at a steep monetary cost, as single-cell sequencing is still much more expensive than more traditional bulk NGS given the need to sequence more (**Supplementary Table S2**). Also, subpopulations of cancer cells can be found by scATAC-seq, which has the power to identify specific chromatin motifs. Indeed, when combined with RNAseq, it has been used to identify epigenetic plasticity between two cell subpopulations (Litzenburger et al., 2017).

There are currently dozens of variations of techniques to study the genome, epigenome, transcriptome, and epitranscriptome of cells, and here, we focus on those most commonly in use (**Supplementary Table S1**). Each of these technologies has had a significant impact on numerous fields, including immunology, oncology, and microbiology. Because the scope of the benefits of single-cell analysis is so wide, there is tremendous pressure to advance the technologies in the field. This is evident in the dramatic increase in recent years in publications referencing single-cell technologies (Wang and Navin, 2015). These techniques are highly varied, from manual manipulation (Pan et al., 2013) to droplet microfluidics used for sc-WGS (Hosokawa et al., 2017) to the creation of an RNA-library (Hedlund and Deng, 2018), such as bisulfite sequencing, can also be used on the single-cell level (Clark et al., 2017). A novel approach that combines Raman spectroscopy with an algorithmic biomolecular component analysis (microRaman-BCA) allows for the profiling of single organelles from a cell. Because this technique does not destroy the cell during analysis, the study can be performed multiple times on the same cell, providing a better picture of heterogeneity over time (Kuzmin et al., 2017).

While much of the current knowledge of cellular heterogeneity is transcriptional, newer techniques such as single-cell epigenomics have tremendous potential to study heterogeneity (Hassan et al., 2017) and may be able to provide further insights into the characterization and mechanisms of heterogeneity (Clark et al., 2016). Several topics in epigenomics are best suited to study with single-cell methods, including the relationship between transcriptional heterogeneity and epigenetic heterogeneity, which may vary greatly from cell to cell. Another application of single-cell sequencing is to study tumor resistance and therapeutic response to decrease the chance of resistance or relapse. scNGS can be used to not only detect heterogenous subclones within a tumor, but also to characterize these cells. Additionally, it can be used to characterize metastases and to create an effective treatment plan that minimizes the chance of chemotherapeutic resistance of specific subclones (Liang and Fu, 2017). In one study, analysis via deep whole-exome sequencing revealed that 75% of relapsed tumors in pediatric B-acute lymphoblastic leukemia were descendants of originally rare subclones (Ma et al., 2015). Given technical and sampling limitations, it is possible that resistant subclones existed within more patients. Although scNGS is currently expensive, treatment for cancer is often much more expensive. For this reason, any possible technique that could lead to a more effective therapy (even an expensive one like scRNA-seq) has clinical potential (Shalek and Benson, 2017).

Additionally, subclones can communicate and interact with each other, leading to complex relationships that may only be fully elucidated via scNGS. Although some of these

interactions are neutral, they can also be positive (leading to a commensalistic/mutualistic relationships in which one or both of the subclones benefit), or negative (leading to competition between subclones, e.g.), and can contribute to the chemotherapeutic resistance of one or more subclones within a tumor (Tabassum and Polyak, 2015). For instance, one study demonstrated that various clonal lineages in a case of colorectal cancer responded differently to treatment with chemotherapy (Kreso et al., 2013). Additionally, there is evidence that parallel evolution of various subclones within a tumor can lead to polyclonal resistance (Gerlinger et al., 2014). Additionally, intratumor heterogeneity makes it more difficult to precisely identify either histologically or genetically a tumor via a traditional biopsy (Tellez-Gabriel et al., 2016).

The implications of tumor heterogeneity in cancer evolution, clinical treatment, and tumoral spatial organization are not yet fully understood (Alizadeh et al., 2015), but scNGS provides a mechanism for beginning to unravel these relationships. Although heterogeneity makes the histological and genetic identity of a tumor more ambiguous, if the mechanisms

driving heterogeneity are further elucidated, they may lead to a better understanding of carcinogenesis (Gay et al., 2016). Moreover, data gathered from single-cell sequencing may help to clarify the methods of cancer progression and subclone resistance to chemotherapeutic treatment by sequencing both smaller transcripts and whole genomes in single cellular representatives of heterogeneous populations (Baslan and Hicks, 2017).

Interestingly, scNGS also has implications in lineage tracking in the development of differentiated tissues, as it may help to further clarify the developmental pathways involved in tissue differentiation (Kester and van Oudenaarden, 2018). As discussed, the nervous and immune systems are both well-studied examples of cellular systems that display cellular heterogeneity. For example, this technique can be used to study the central nervous system, and has the potential to not only molecularly classify various neurons or groups of neurons, but also to further study the molecular mechanisms behind, and possible therapies for, neurological diseases (Ofengeim et al., 2017). Indeed, this application can also be utilized to type sperm and oocytes, allowing for the confirmation and subsequent study of recombination events and polymorphisms in these haploids (Zhang et al., 1992).

### DESIGN OF scNGS EXPERIMENTS

One of the key questions in planning the methodology of a singlecell study is how many cells to sequence. Sequencing more cells enables a greater representation of the cells in a population, giving a more accurate model of the diversity of subclones. The number of single-cells sequenced in a study has scaled exponentially with the development of new technologies. In 2009, for example, only one cell could be sequenced at a time. By 2017, however, the technology has advanced enough to permit the analysis of hundreds of thousands of cells at once (Svensson et al., 2018) and the possibility to generate exobytes and even yottabytes of data in the future.

Many complexities exist with scNGS analyses and need to be carefully considered. Other work have covered the specific differences, benefits, and drawbacks between the various scNGS protocols (Kanter and Kalisky, 2015; Clark et al., 2016; Haque et al., 2017; Liang and Fu, 2017). Previous data have shown that the best scNGS technology should be used for a given hypothesis, in tandem with a proper experimental design for the number of cells. Due to this, the required number of cells necessary to address a given question or tissue model will largely vary depending on the overall hypothesis. However, the question of "how many cells should I sequence" can be simplified to how many cells do you need to sample in order to capture at least one subclonal cell. The chance of sampling a subclone from a tissue of interest depending on the subclonal prevalence, the size of the tissue, and the size of the sample. Therefore, this question can be modeled using the hypergeometric distribution with varying degrees of probability (**Figure 1A**). It is common within sc-NGS analysis to require multiple cells to contain a given phenotype, and therefore may be more appropriate to ask the question of "how many cells should I sample to capture at least three subclonal cells" (**Figure 1B**).

We have built a model to demonstrate the number of cells required for a sampling design can widely vary. As an example, if the goal was to sample a tissue which has 1 billion cells for a previously undefined stem-cell which exist at a population of 0.01%, you would have a 99% chance of sampling at least one stem-cell if you analyzed approximately 46,000 cells. However, to truly characterize and identify this subclonal population or to detect a lower threshold, the number of cells required could easily reach, or even surpass, 100,000 depending on tissue size (**Figure 1B**). Given the recent advances in scNGS and decreases in costs, this is now possible to do. Such a design – while completely impossible 5 years ago – should be strongly considered when designing experimentations today.

### THE FUTURE OF SINGLE-CELL ANALYSES

While single-cell sequencing has many advantages, it certainly is not a perfect technique. There are many different techniques for obtaining single-cell sequencing data and single-cell whole genome sequencing (sc-WGS), and each of these methods presents its own unique strengths and weaknesses. Multiple displacement amplification (MDA) and other PCR-based sequencing techniques often experience significant amplification bias (de Bourcy et al., 2014; Ahsanuddin et al., 2017). This could lead to incorrect interpretation of the prevalence and diversity of certain genes. Nonetheless, thanks to the breakthroughs in scNGS, the long-sought goal of sequencing of single cells is possible. This has created significant opportunities for advancement in the study of heterogeneity, especially as it applies to cancer. While it may be necessary to sample thousands or even millions of cells to encounter a unique subclone at low prevalence within a large tissue, sequencing continues to get cheaper, and thus scNGS will continue to open up many new research directions into the mechanisms of heterogeneity study variation on cell-by-cell resolution.

### AUTHOR CONTRIBUTIONS

CM and SG conceived and designed the study. CM, SG, and MM analyzed the data. SG, MM, EA, AM, and CM wrote the paper. All authors, reviewed, edited, and approved the manuscript.

### FUNDING

This work was supported by funding from the Irma T. Hirschl and Monique Weill-Caulier Charitable Trusts, Bert L and N Kuggie Vallee Foundation, the WorldQuant Foundation, The Pershing Square Sohn Cancer Research Alliance, NASA (NNX14AH50G and NNX17AB26G), the National Institutes of Health (R25EB020393, R01NS076465, R01AI125416, R01ES021006, 1R21AI129851, and 1R01MH117406), the Bill and Melinda Gates Foundation (OPP1151054).

### ACKNOWLEDGMENTS

fgene-10-00008 February 27, 2019 Time: 16:32 # 7

We would like to thank the Epigenomics Core Facility at Weill Cornell Medicine, as well as the Starr Cancer Consortium (I9-A9-071).

### REFERENCES


### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene. 2019.00008/full#supplementary-material


Stress Hematopoiesis. Cell Stem Cell 21, 48.e7–501.e7. doi: 10.1016/j.stem.2017. 08.011


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Goldman, MacKay, Afshinnekoo, Melnick, Wu and Mason. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Single-Cell Transcriptomics Bioinformatics and Computational Challenges

### Olivier B. Poirion<sup>1</sup> † , Xun Zhu1, 2 †, Travers Ching1, 2 and Lana Garmire<sup>1</sup> \*

*<sup>1</sup> Epidemiology Program, University of Hawaii Cancer Center, Honolulu, HI, USA, <sup>2</sup> Molecular Biosciences and Bioengineering Graduate Program, University of Hawaii at Manoa, Honolulu, HI, USA*

The emerging single-cell RNA-Seq (scRNA-Seq) technology holds the promise to revolutionize our understanding of diseases and associated biological processes at an unprecedented resolution. It opens the door to reveal intercellular heterogeneity and has been employed to a variety of applications, ranging from characterizing cancer cells subpopulations to elucidating tumor resistance mechanisms. Parallel to improving experimental protocols to deal with technological issues, deriving new analytical methods to interpret the complexity in scRNA-Seq data is just as challenging. Here, we review current state-of-the-art bioinformatics tools and methods for scRNA-Seq analysis, as well as addressing some critical analytical challenges that the field faces.

### Edited by:

*H. Steven Wiley, Pacific Northwest National Laboratory, USA*

### Reviewed by:

*Seth G. N. Grant, University of Edinburgh, UK Milind Ratnaparkhe, Indian Institute of Soybean Research (ICAR), India*

> \*Correspondence: *Lana Garmire lgarmire@cc.hawaii.edu*

*† These authors have contributed equally to this work.*

### Specialty section:

*This article was submitted to Genomic Assay Technology, a section of the journal Frontiers in Genetics*

Received: *01 May 2016* Accepted: *02 September 2016* Published: *21 September 2016*

### Citation:

*Poirion OB, Zhu X, Ching T and Garmire L (2016) Single-Cell Transcriptomics Bioinformatics and Computational Challenges. Front. Genet. 7:163. doi: 10.3389/fgene.2016.00163* Keywords: single-cell genomics, single-cell analysis, bioinformatics, heterogeneity, microevolution

### INTRODUCTION

Characterization of genomic signatures in individual patients is a key step toward the realization of precision medicine. Recently, next-generation sequencing (NGS) based RNA expression profiling (RNA-seq) has made broad impacts on biomedical fields. However, population-averaged RNA-seq has limited discovery power, and it can also mask the presence of rare subpopulations of cells (such as cancer stem cells) and thus may overlook important biological insights. The emerging single-cell RNA-Seq (scRNA-Seq) technology is designed to overcome these limitations by investigating expression profiles at the cell level. In just a few years, the number scRNA-Seq experiments has grown beyond exponentially. This new approach offers the potential to revolutionize our understanding of diseases and associated biological processes, with the capacity to reveal the intercellular heterogeneity within a specific tissue at an unprecedented resolution (Yan et al., 2013; Trapnell et al., 2014). Using single-cell level features, we can infer cell lineages (Treutlein et al., 2014), identify subpopulations (Trapnell et al., 2014) and highlight cell-specific biological characteristics (Tang et al., 2010). Moreover, single-cell analyses have already demonstrated their utilities in the clinical applications, ranging from characterizing cancer cells subpopulations (Navin et al., 2011; Patel et al., 2014; Ting et al., 2014), highlighting specific resistance mechanisms (Kim, K. T. et al., 2015; Miyamoto et al., 2015) to being used as diagnostic tools (Ramsköld et al., 2012; Kvastad et al., 2015).

Despite the expansion of scRNA-Seq studies and rapid maturing of experimental methods, major analytical challenges remain as the consequences of experimentation. One major challenge is that scRNA-Seq datasets present a very high level of noise (Brennecke et al., 2013; Kharchenko et al., 2014). Much of the noise is due to the nature of single-cell technologies. Because of the extremely low amount of starting biological material in the single cell, amplification processes are

**22**

required. These procedures are prone to distortion and contamination (Leng et al., 2015). To tackle these issues, rigorous efforts have been made to develop analytical methods for scRNA-Seq data. Here, we summarize current state-of-theart bioinformatics analysis tools and methods for scRNA-Seq (**Figure 1** and **Table 1**), and address some critical analytical challenges that we are facing. The first section describes specific pre-processing steps for noise removal of scRNA-Seq datasets. The second section reviews specific scRNA-Seq bioinformatics analysis procedures with emphasis on subpopulation detection. The third section focuses on microevolution analysis for scRNA-Seq data. In the last section, we highlight the challenges to be addressed and work to be accomplished in scRNA-Seq bioinformatics field.

### DATA PREPROCESSING AND NOISE REMOVAL

### Quality Control

scRNA-Seq experiments generate FASTQ files from the sequencing machine, which contain millions of reads composed of RNA sequences and add-on sequences (UMI tag and the cell tag etc). These reads need to be pre-processed before being aligned back to the reference genome. For scRNA-seq, pre-processing and quality control (QC) analyses similar to bulk RNA-seq are used. Cutadapt (Martin, 2011) is a tool that removes adapter sequences, and Trimmomatic (Bolger et al., 2014) performs quality-based trimming in addition to removing adapter sequence. These tools are commonly used in scRNA-seq experiments (Treutlein et al., 2014; Handel et al., 2016; Hou et al., 2016). Other generic quality control tools such as FASTQC or HTQC (Yang et al., 2013) might also be useful to produce quality metrics. Finally, it is worth noting that platform-specific QC tools such as SolexaQA (Cox et al., 2010) provide QC pipelines specific for Illumina sequencing, with trimming and quality-based filtering.

Other QC procedures for scRNA-seq involve the analysis of the expression of housekeeping genes (Ting et al., 2014; Treutlein et al., 2014), overall gene expression patterns (Zeisel et al., 2015) and the number of genes or reads detected per cell (Kumar et al., 2014). However, one issue of these approaches is that the thresholds chosen for filtering are arbitrary and should differ according to the dataset (Jiang, P. et al., 2016). SinQC (Jiang, P. et al., 2016) and SCell (Diaz et al., 2016) are two QC tools specifically designed for scRNA-seq data. SinQC uses sequencing library quality to confirm gene expression outliers. It computes different quality metrics (e.g., total number of mapped reads, mapping rate and library complexity) to identify a user-specified fraction of the dataset as noise. SCell is a versatile tool that allows for outlier detection. It estimates genes that are expressed at the background level using Gini index, which measures statistical dispersion, and removes samples whose background fraction is significantly higher than the average. Recently, a new mapping and quality assessment pipeline Celloline detects low quality cells from expression profiles, using curated biological and technical features (Ilicic et al., 2016).

### Alignment

To our knowledge, there are currently no specific aligners dedicated to scRNA-seq, and scRNA-seq studies use existing aligners made for bulk RNA-Seq. Tophat is one of the most popular aligners capable of detecting novel splice (Trapnell et al., 2009; Kim et al., 2013), and it is widely used in scRNA-seq studies (Treutlein et al., 2014; Fan et al., 2016; Freeman et al., 2016; Handel et al., 2016; Hou et al., 2016). RNA-Seq by Expectation Maximization, or RSEM, is a popular framework that includes an aligner (Li and Dewey, 2011). It is also used in some scRNAseq studies (Gao et al., 2016; Kimmerling et al., 2016; Meyer et al., 2016). Other aligners used in scRNA-Seq studies include MapSplice (Wang et al., 2010), GSNAP (Brennecke et al., 2013; Buettner et al., 2015; Wu et al., 2016), and STAR (Dobin and Gingeras, 2015; Moignard et al., 2015; Petropoulos et al., 2016). Among these aligners, TopHat and STAR were found to be about one to two magnitudes faster than GSNAP and MapSplice (Engström et al., 2013). More recently developed aligners include Kallisto (Bray et al., 2016) and HISAT (Kim, D. et al., 2015). Kallisto uses pseudo-alignment with hashing de Bruijn graphs and avoids alignment altogether, which drastically improves the speed of expression quantification. HISAT (hierarchical indexing for spliced alignment of transcripts) seems also promising in term of the speed and accuracy. It is worth mentioning that some major scRNA-Seq methods do not get enough coverage across the gene to measure alternative splicing, therefore algorithms for isoform measurements are not as critical in scRNA-Seq, at least at this stage.

### Feature Quantification

Feature quantification is the process of converting alignment results into a gene expression profile. An expression profile is conventionally represented as a numeric matrix where rows are genes and columns are cells. Each entry in the matrix is the abundance of a particular gene or transcript in a particular sample. Just as is the case for aligners, most scRNA-Seq studies use canonical feature quantification methods applied to bulk RNA-Seq.

Quantification methods for gene expression differ dramatically. The simplest approach, employed by programs such as HTSeq (Anders et al., 2014) and FeatureCounts (Liao et al., 2013), is to count the number of reads located within the boundaries of a gene (Liao et al., 2013; Anders et al., 2014). These programs have simple but flexible parameters for determining read counts in the case of overlapping genes, and were used in some scRNA-Seq studies (Brennecke et al., 2013; Moignard et al., 2015; Fan et al., 2016; Handel et al., 2016). More sophisticated approaches calculate probabilistic estimates of gene expression. For example, RSEM and Cufflinks both employ a maximum likelihood approach (Trapnell et al., 2010; Li and Dewey, 2011). These programs are based on statistical models where reads in a RNA-Seq sample are observed random variables predicted from the latent variables, such as the transcript sequence, strand and length. The new Kallisto pipeline (Bray et al., 2016) as described before, is shown to have up to two orders of magnitude speed improvement over previous aligner-quantifier combinations (Ntranos et al., 2016). Interestingly, while

probabilistic approaches are conceptually more refined, simple counting programs such as HTSeq and FeatureCounts showed comparable or even stronger performance (Chandramohan et al., 2013; Fonseca et al., 2014), suggesting that these probabilistic models are yet to be improved.

Given the uncertainties of quantifying fragments postamplification, a new technique was shown to reduce amplification noise by introducing random sequences called unique molecular identifiers, or UMIs (Islam et al., 2014). UMIs are tagged on individual RNA molecules before amplification and used for tracking transcripts directly rather than using sophisticated statistical modeling. This approach may lead to a different workflow than conventional fragment-based quantification methods (e.g., gene filtering and normalization).

### Gene Filtering

Due to the high level of noise in scRNA-Seq datasets, it is necessary to filter out low quality genes and samples. Various practices have been made to filter out genes that are expressed in too few samples (Brennecke et al., 2013; Treutlein et al., 2014; Petropoulos et al., 2016). Usually, a gene is defined as "expressed" by a minimal expression level threshold. For experiments that quantify gene expression with fragment counting, an FPKM (Fragment per Kilobase per Million Reads) threshold is appropriate. Common FPKM thresholds are 1 (Freeman et al., 2016) and 10 (Petropoulos et al., 2016). Other studies also set the threshold by Transcript Per Million (TPM) instead of FPKM (Meyer et al., 2016). Yet better filtering reference could come from External RNA Controls Consortium (ERCC) spikeins added to the experiment, which provides calibration of the relative amount of starting material (Brennecke et al., 2013; Treutlein et al., 2014).

Recently, specific methods have been developed to filter genes from scRNA-seq dataset. OEFinder is designed to identify artifact genes from scRNA-seq experiments using the Fluidigm C1 platform for cell capture (Leng et al., 2016). For experiments that

### TABLE 1 | List of single-cell analytical tools mentioned in this chapter.


*Links for their availability are attached.*

quantify gene expression with UMI counting, one can directly set up a molecule number threshold, e.g., 25 (Zeisel et al., 2015). It is also recommended to remove UMIs that have reads <1/100 of average non-zero UMI reads, in order to avoid erroneous UMIs generated during amplification.

### Removal of Confounding Factors

When the entire data set consists of several runs of experiments with potentially varied conditions, systematic variations called batch effects might be introduced. These artifacts may pose substantial problems to downstream statistical analysis, or even mask biological signals. For studies concerning over-dispersion of gene expression, it is necessary to factor out the extra variance caused by the systematic differences between batches (Fan et al., 2016). The appropriate way to compensate for batch effect depends on the quantification method as well as the downstream analysis. For most studies batch effects can be eliminated by using down-sampling methods, however the complexity is reduced (Wang et al., 2012; Dey et al., 2015; Grün and van Oudenaarden, 2015). For studies that use traditional fragment counting, COMBAT (Johnson et al., 2007) is a batch effect eliminating method based on empirical Bayes frameworks and purports to be robust to outliers for small sample sizes. It was originally designed for microarray data but was used in scRNA-Seq experiments (Kim, K. T. et al., 2015). Although unsupervised batch effect detection or removal methods exist (Leek, 2014), the batches called by such methods often correlate highly with subpopulations detected by other scRNA-Seq methods (Finak et al., 2015). Since it is usually desirable to consider subpopulations for valuable biological insights, unsupervised batch effect removal methods should be used with discretion in single-cell experiments.

Besides batch-effect removal, it is also important to remove technical variability within the noise. The technical noise level of a genes correlates with its average expression level. Thus, a probabilistic model can be built to fit this correlation using technical spike-ins and further infer the biological variability of each gene (Brennecke et al., 2013). For most studies, it is also desirable to avoid the ubiquitous cell-cycle induced variation to mask other interesting biological variations. scLVM is a package that tries to introduce a cell-cycle factor removal step before subpopulations detection (Buettner et al., 2015). Recently, a new package called ccRemover was developed to remove the principal components that are identified as cell-cycle affected, which claimed to perform better than scLVM in several simulated and real datasets (Barron and Li, 2016).

### Normalization

In scRNA-seq experiments, technical factors such as read depth, cell capture efficiency, 3′ bias or full sequence coverage due to particular library prep methods, might differ among different scRNA-Seq data sets. Thus, raw read counts should be normalized before downstream analyses. This procedure maximally ensures that the difference between the values in the matrix correctly reflects the abundance difference of transcripts or genes between the cells. When experiments are designed with ERCC spike-ins, ERCC can be used as internal controls and serve as anchors for normalization. GRM is a scRNA-seq normalization tool fitting a Gamma Regression Model between the reads (FPKM, RPKM, TPM) and spike-ins (Ding et al., 2015). The trained model is then used to estimate gene expression from the reads. BASICS, another recent workflow, provides a Bayesian model allowing to infer cell-specific normalization factor (Vallejos et al., 2015). This workflow estimates the technical variability using spike-ins. Finally, SAMstrt (Katayama et al., 2013) is an earlier algorithm that applies the resampling normalization procedure of the SAMseq algorithm to spikeins, which was originally developed for bulk RNA-seq (Li and Tibshirani, 2013).

For experiments without spike-ins, if the quantification is count-based, one can normalize the expression profile by the scaling methods used in DESeq and edgeR etc. (Love et al., 2014). A new specific scRNA-seq procedure proposes a de-convolution approach on the pooled counts of gene expression for multiple cells, thus allows to infer the size factor for individual cells without using spike-ins (Aaron et al., 2016). The authors claimed that their approach improved the accuracy of the normalization compared with existing methods. However, experiments designed with UMIs as mentioned earlier quantify gene expression on an absolute basis and thus they do not need computational normalization.

### Differential Expression

Differential expression (DE) analysis is the process of calling gene expression that show statistically significant difference between pre-specified groups of samples. Although DE is typically not the main objective of a single-cell experiment design, as it requires pre-defined grouping information among cells of interest, it is nevertheless common in scRNA-Seq experiments. Simple statistical methods such as t-test and Wilcoxon rank sum test are used in scRNA-Seq workflows such as SINCERA (Guo et al., 2015). Interestingly, EdgeR and DESeq2, two DE methods developed for bulk RNA-Seq, gave the best results for some scRNA-Seq data (Schurch et al., 2016).

The dropout event is a unique type of noise of scRNA-Seq that rarely occurs in bulk RNA-Seq experiments. It refers to the phenomenon that a gene is shown expressed abundantly in one cell but not detectable in another cell, as a consequence of the transcript loss in the reverse-transcription step. To account for frequent dropout events and biological variability within cell population, more sophisticated algorithms have been developed for scRNA-Seq data. Single-Cell Differential Expression (SCDE) is a package developed specifically for single-cell differential expression (Kharchenko et al., 2014). The model assumes that observed expression levels in scRNA-Seq data follow a mixture of negative binomial distribution for amplified genes, as proposed before (Anders and Huber, 2010); and a lowmean poisson distribution for dropout genes, as is observed in transcriptionally silenced genes. This model is then fit using Expectation Maximization (EM) algorithm (Kharchenko et al., 2014). It claimed higher sensitivity of differentially expressed genes compared to DESeq and CuffDiff. More recently, PAGODA improved upon SCDE's method in several aspects, including optimization of the computational process and a refined model

for better fitting (Fan et al., 2016). MAST is another scRNA-Seq differential expression detection method that uses a two-part generalized linear model and adjusts for the fraction of cells that express a certain gene (Finak et al., 2015).

Another challenge unique to scRNA-Seq is that some genes may exhibit bimodality, meaning that the expression levels across a group of cells concentrate around two modes instead of one. A beta-Poisson distribution was proposed in order to provide a more accurate differential expression analysis that captures bimodality (Vu et al., 2016). Another tool Monocle (Trapnell et al., 2014) also has a module for differential expression, which fits the data with a non-parametric generalized additive model. Finally, the workflow of BASICS as described earlier, provides an criterion to detect high- or low-variable genes within the single cells dataset (Vallejos et al., 2015). However, it is not clear which methods have generally superior performance.

### SUBPOPULATION AND MODULE DETECTION

### General Machine-Learning Approaches

Different classical unsupervised approaches have been used to highlight single cell subgroups among a population. Principal Component Analysis (PCA) and its variants (e.g., Robust PCA and Kernel PCA) have been used in different single cell studies (Amir et al., 2013; Yan et al., 2013; Pollen et al., 2014; Trapnell et al., 2014; Treutlein et al., 2014; Satija et al., 2015; Fan et al., 2016; Ilicic et al., 2016). K-means and other distance based clustering algorithms such as hierarchical clustering or WARD are also widely used (Yan et al., 2013; Jaitin et al., 2014; Kharchenko et al., 2014; Lohr et al., 2014; Marco et al., 2014; Pollen et al., 2014; Shin et al., 2015). For example, Jaitin et al. combined hierarchical clustering and probabilistic mixture models to classify single cells from different tissues (Jaitin et al., 2014). A refined clustering method called pcaReduce (Zurauskiene and Yau, 2015) was designed for scRNA-Seq. It iteratively uses PCA combined with K-means to produce the hierarchical tree of the cells. For distance metrics employed by these methods, Euclidean distance, Pearson and Spearman correlation coefficients have been popular (though may not be optimal) choices (Pollen et al., 2014; Rotem et al., 2015).

### Machine-Learning Approaches Tailored for scRNA-Seq Analysis

More sophisticated machine-learning algorithms have great potentials to overcome some issues of scRNA-Seq functional analysis. A main issue of scRNA-Seq analysis is that gene expression data cannot be expressed as a linear combination of the relationships between two cells in general (Buettner and Theis, 2012; Bendall et al., 2014; Levine et al., 2015). Also classical similarities (such as cosine or Euclidean distances) are less meaningful as the dimensionality increases (Beyer et al., 1999), and may not be appropriate for scRNA-Seq (Xu and Su, 2015). Possible irrelevant associations may arise with inappropriate metrics, while searching for the nearest neighbors on noisy data (Balasubramanian and Schwartz, 2002). Adequate analytical methods for scRNA-Seq data should also be able to highlight "rare events," such as the small fraction of metastatic cancer cells amongst a large cell population (Bose et al., 2015; Shin et al., 2015). We describe the scRNA-Seq specific algorithms below in the order of dimension reduction, clustering, and other clustering variant methods. The datasets that were used to test these algorithms are listed in **Table 2**.

Among the dimension reduction methods, Zero-inflated factor analysis (ZIFA) algorithm is a new method that includes dropout events by representing the probability of gene dropout as an exponential function of its mean expression (Pierson and Yau, 2015). Using a latent variable model based on factor analysis, ZIFA reduces the dimension of scRNA-Seq dataset and allows the probability of each gene expression to be zero. Experiments in the original study suggest that ZIFA is a more robust alternative to PCA. As mentioned earlier, scLVM is another method for identifying cell subpopulations, which features removal of confounding factor like cell-cycle effects (Buettner et al., 2015). It first computes cell-to-cell covariance using a set of marker genes related to biological hidden factors of interest (such as the cell cycle). Another approach, PAGODA as mentioned before, uses a weighted PCA to characterize multiple aspects of heterogeneity in mouse neuronal progenitors (Fan et al., 2016). PAGODA evaluates over-dispersion of individual genes using error models.

SIMLR is a new clustering method designed to learn a distance metric that best fits the structure of the data. It infers a distance function as a linear combination of several distance metrics (Wang et al., 2016). It is designed to tackle the heterogeneity observed amongst single-cell datasets related to both technological difference across platforms as well as biological difference across studies. In another singlecell clustering approach named analysis of scRNA-seq based on transcript-compatibility counts (AscTC), read counts from scRNA-Seq dataset are transformed into probabilities using transcript-compatibility counts, rather than the conventional transcript abundance (Ntranos et al., 2016). Individual cells are clustered using an affinity propagation algorithm, a derivative of spectral clustering.

A few other hierarchical clustering approaches are worth mentioning. Geneteam is a multi-level recursive clustering method that searches for bipartitions of cells sharing exclusive expression profiles for a subset of genes (Harris et al., 2015). Similarly, Backspin is another hierarchical dividing clustering algorithm, allowing to cluster both genes and cells (Zeisel et al., 2015). It uses the SPIN algorithm (Tsafrir et al., 2005) at each iteration to sort the expression matrix and then separates genes (rows) and cells (columns) into two groups by a specific splitting criterion. Alternatively, BISCUIT is a new iterative normalization and clustering procedure based on Dirichlet Process, which was designed to correct technical variation in scRNA-seq together with cell clustering (Prabhakaran et al., 2016).

### Graph Approaches beyond Clustering

Traditional clustering methods lack the function of inferring the inherent lineage between cells. Common approaches for cell lineage inferences require the creation of a graph or a tree, where single cells are represented as nodes and edges



between the cells indicate their similarities. The lengths of the edges are computed from a similarity matrix based on a given metric. Before constructing the graph, a de-noising procedure is necessary. A useful de-noising procedure is to compute the k-Nearest-Neighbor graph (kNNG; Bendall et al., 2014; Levine et al., 2015; Xu and Su, 2015). Samples from the kNNG could then be compared using the geodesic distance, defined as the shortest path between two nodes (Bendall et al., 2014). Such an approach can remove "shortcuts" between irrelevant pairs of samples due to the curse of high dimensionality (Tenenbaum et al., 2000). Clustering analysis can then be performed on the graph using community detection algorithms (Fortunato, 2010). Xu and Su first used Euclidean distance to compute Shared Nearest-Neighbor (SNN) graph, then searched for quasi-cliques to obtain clusters of cells (Xu and Su, 2015). Quasi-cliques are communities of nodes, densely but not necessarily fully connected. Highly Connected Sub-graph (HPC) is another community detection algorithm that showed very similar performances as SNN (Hartuv and Shamir, 2000).

### MICROEVOLUTION OF SINGLE CELLS

### Inference without Spatial and Temporal Information

scRNA-Seq data are also informative to reveal single-cell microevolution. Different algorithms have been specifically designed for scRNA-Seq to infer a pseudo temporal ordering of single cells. Moncole is the first scRNA-Seq bioinformatics tool to infer the temporal ordering of single cells (Trapnell et al., 2014). It first uses Independent Component Analysis (ICA) to reduce the dimension, then computes a Minimum Spanning Tree (MST) on the graph constructed by Euclidean distance between cell pairs. MST connects all nodes of a graph using edges with a minimal total weighting, based on the hypothesis that the longest path through the MST corresponds to the longest series of transcriptionally similar cells. Another similar method, Waterfall, uses PCA coupled with k-means to produce clusters, then connects the cluster centroids with MST (Shin et al., 2015). Similar to Waterfall, TSCAN is a new approach based on MST. Cells are first clustered using a model-based approach before constructing an MST, allowing the reduction of the tree space complexity (Ji and Ji, 2016).

Embeddr is a method that uses the correlation metric between cells to construct kNNG, then projects the samples into a lowdimensional embedding using Laplacian eigen maps. The pseudo time order is then fitted using the principal curves (Campbell et al., 2015). Embeddr aims to tackle the drawbacks of Monocle, where gene expression is modeled as a linear combination and the result is highly sensitive to outliers. This scheme is also used in the workflow of SLICER, a recent algorithm using Locally Linear Embedding (LLE) to project the dataset and to construct a kNNG among cells (Welch et al., 2016).

Since visualization is key in understanding reconstructed single-cell trajectories, better visualization algorithms are as important as methods to reconstruct the single-cell microevolution. t-SNE is a popular method to visualize single cells, as part of a more complex workflow (Jiang, L. et al., 2016; Petropoulos et al., 2016). Another approach derived from diffusion map was developed, allowing one to visualize a clear bifurcation event among the cells which may be missed by independent component analysis (ICA) or t-SNE (Haghverdi et al., 2015; Moignard et al., 2015).

### Modeling Microevolution with Spatial and Temporal Information

Cell subpopulations can also be characterized by different temporal and/or spatial gene expressions. Several approaches have been designed to exploit datasets with explicit temporal information. SCUBA is a method to detect bifurcation events using time course data (Marco et al., 2014). It assumes that the switch between cell states is a stochastic punctual process. To infer cellular hierarchy, it iteratively divides cells using k-means algorithm and uses a gap statistic to determine if a bifurcation event should occur. This process creates a binary tree, which can then be used to model gene expression dynamics (Marco et al., 2014). However, one drawback of SCUBA is that it requires data with temporal features. Free from such a requirement, Oscope is another method to infer oscillatory genes among single cells collected from a single tissue (Leng et al., 2015). It hypothesizes that these cells represent distinct states according to an oscillatory process. Oscope fits a two-dimensional sinusoidal function for each pair of genes, clusters gene pairs by frequency and reconstructs the order of the cells in a cyclic fashion. However, Oscope is unable to infer bifurcation events.

Other models also consider the spatial organization of cells in a tissue. Seurat is an approach that infers the spatial localization of single cells by integrating RNA-Seq with in situ RNA patterns (Satija et al., 2015). Seurat divides a cellular tissue into distinct spatial bins, linked by the expression of landmark genes per RNA in-situ hybridization. Within each bin, it builds a mixture model using expression values among correlated genes. The posterior probability is generated for each cell and assigned to a given bin. Another approach models the tissue as a 3D map and assumes that cells spatially close share common scRNA-Seq profiles (Pettit et al., 2014). This method uses a hidden markov random field to assign each bin of the map to a given cluster. Similar to Seurat, it takes the input of spatial gene expression measurement using whole mount in situ Hybridizations (WiSH) technology, a confocal microscopic approach that detects the presence of mRNA linked to a fluorescent probe.

### CHALLENGES AND FUTURE WORK

Compared to bulk-cell analysis, single-cell genomics has the advantage of exploring cellular processes with a more accurate resolution, but it is more vulnerable to disturbances. Besides perfecting the experimental protocols to deal with issues such as dropouts in gene expression and biases in amplification, deriving new analytical methods to reveal the complexity in scRNA-Seq data is just as challenging. In this review, we have listed the different bioinformatics algorithms dedicated to single-cell analysis. Although the initial few steps of workflow for scRNA-Seq analysis are similar to bulk-cell analysis (data pre-processing, batch removal, alignment, quality check, and normalization), the subsequent analyses are largely unique for single cells, such as subpopulations detection, and microevolution characterization (**Figure 1**). With the increasing popularity of single-cell assays and ever increasing number of computational methods developed, these methods need to

### REFERENCES


be more accessible to research groups without bioinformatics expertise. Moreover, datasets where cell classes have already been previously charaterized should be identified as benchmark data, in order to accurately assess the performance of new bioinformatics methods.

Although this review focuses on scRNA-Seq analyses, with the rapid development of technologies, coupled DNA-based genomics data can be obtained from the same cell, in parallel with scRNA-Seq data (Han et al., 2014; Dey et al., 2015; Kim, K. T. et al., 2015; Macaulay et al., 2015). This will further increase the analytical challenges. Previous multi-omics bioinformatics tools applied to bulk samples could be leveraged. The use of graphs and tensor approaches that integrate heterogeneous features in bulk samples may be good starting points for multidimensional single cell data (Li et al., 2009; Levine et al., 2015; Katrib et al., 2016; Zhu et al., 2016). Efforts should also be made toward developing computational methods to make use of spatial information (possibly guided by imaging) in combination of scRNA-Seq (Pettit et al., 2014; Satija et al., 2015). Also most emphasis in scRNA-Seq by far has been made on protein coding genes, and the dynamics and roles of noncoding RNAs such as lncRNAs (Travers et al., 2015; Ching et al., 2016) and micro-RNAs are poorly explored. Finally, a large number of single-cells (n = 4645) in a single data set was reported recently (Tirosh et al., 2016), and the scRNA-Seq data volume is expected to continue growing exponentially. Foreseeably, this poses a large spectrum of challenges from developing more efficient aligners to better data storage and data sharing solutions.

### AUTHOR CONTRIBUTIONS

LG envisioned this project, OP, XZ, TC, and LG wrote the manuscript, all authors have read and agreed on the manuscript.

### ACKNOWLEDGMENTS

This research was supported by grants K01ES025434 awarded by NIEHS through funds provided by the trans-NIH Big Data to Knowledge (BD2K) initiative (www.bd2k.nih.gov), P20 COBRE GM103457 awarded by NIH/NIGMS, 1R01LM012373 awarded by NLM, and Hawaii Community Foundation Medical Research Grant 14ADVC-64566 to LG.


defining neuronal maturation during postnatal neurogenesis. Cereb. Cortex. doi: 10.1093/cercor/bhw040. [Epub ahead of print].


accuracy, and functionality. Stat. Genomics Methods Protoc. 1418, 283–334. doi: 10.1007/978-1-4939-3578-9\_15


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2016 Poirion, Zhu, Ching and Garmire. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Single Cell Isolation and Analysis

Ping Hu1 † , Wenhua Zhang2 †, Hongbo Xin<sup>1</sup> and Glenn Deng1, 3, 4 \*

*<sup>1</sup> The Center for Biotechnology and Biopharmaceutics, Institute of Translational Medicine, Nanchang University, Nanchang, China, <sup>2</sup> Laboratory of Fear and Anxiety Disorders, Institute of Life Science, Nanchang University, Nanchang, China, <sup>3</sup> Yichang Research Center for Biomedical Industry and Central Laboratory of Yichang Central Hospital, Medical School, China Three Gorges University, Yichang, China, <sup>4</sup> Division of Surgical Oncology, Stanford University School of Medicine, Stanford, CA, USA*

Individual cell heterogeneity within a population can be critical to its peculiar function and fate. Subpopulations studies with mixed mutants and wild types may not be as informative regarding which cell responds to which drugs or clinical treatments. Cell to cell differences in RNA transcripts and protein expression can be key to answering questions in cancer, neurobiology, stem cell biology, immunology, and developmental biology. Conventional cell-based assays mainly analyze the average responses from a population of cells, without regarding individual cell phenotypes. To better understand the variations from cell to cell, scientists need to use single cell analyses to provide more detailed information for therapeutic decision making in precision medicine. In this review, we focus on the recent developments in single cell isolation and analysis, which include technologies, analyses and main applications. Here, we summarize the historical background, limitations, applications, and potential of single cell isolation technologies.

### Edited by:

*Ashok Kumar, University of Louisville, USA*

### Reviewed by:

*Wen-Shu Wu, University of Illinois at Chicago, USA Sandra Orsulic, Cedars-Sinai Medical Center, USA Adriana Simionescu Bankston, University of Louisville, USA*

\*Correspondence:

*Glenn Deng yaguangdeng@126.com*

*† These authors have contributed equally to this work.*

### Specialty section:

*This article was submitted to Molecular Medicine, a section of the journal Frontiers in Cell and Developmental Biology*

> Received: *10 June 2016* Accepted: *07 October 2016* Published: *25 October 2016*

### Citation:

*Hu P, Zhang W, Xin H and Deng G (2016) Single Cell Isolation and Analysis. Front. Cell Dev. Biol. 4:116. doi: 10.3389/fcell.2016.00116* Keywords: heterogeneity, single cell, isolation, analysis, sequencing

### INTRODUCTION

The cell is the fundamental unit of biological organisms. Despite the apparent synchrony in cellular systems, analyzed single cell results show that even the same cell line or tissue, can present different genomes, transcriptomes, and epigenomes during cell division and differentiation (Schatz and Swanson, 2011). For example, a developing embryo, brain, or tumor have intricate structures consisting of numerous types of cells that may be spatially separated. Thus, the isolation of distinct cell types is essential for further analysis and will be valuable for diagnostics, biotechnological and biomedical applications.

Conventional cell-based assays mainly measure the average response from a population of cells, assuming the average response is representative of each cell. However, in doing this important information about a small but potentially relevant subpopulation maybe lost, particularly in cases where that subpopulation determines the behavior of the whole population. For example, the tumor microenvironment is a complex heterogeneous system that consists of multiple intricate interactions between tumor cells and its neighboring non-cancerous stromal cells. The stromal cells are composed of endothelial cells, fibroblasts, macrophages, immune cells, and stem cells. Due to the variation in genetic and environmental factors, different kinds of cells have unique behaviors and present different implications in pathogenic conditions (Schor and Schor, 2001). These challenges make conventional analysis insufficient. Therefore, new technologies to isolate individual single cells from a complex sample and study the genomes and proteomes of single cells could provide great insights on genome variation and gene expression processes. It is believed that single cell analyses have influences on various fields including life sciences and biomedical research (Blainey and Quake, 2014).

In early times, researchers have applied low-throughput single cell analysis techniques, such as immunofluorescence, fluorescence in situ hybridization (FISH) and single cell PCR, to detect certain molecular markers of single cells (Taniguchi et al., 2009; Citri et al., 2012). These techniques allow quantification of a limited number of parameters in single cells. On the other hand, high-throughput genomic analysis, such as DNA and RNA sequencing are now widely used. However, genomic studies rely on studying collective averages obtained from pooling thousands to millions of cells, precluding genome-wide analysis of cell to cell variability. Therefore, single cell sequencing developed alongside its necessity in research awarding it "method of the year" by Nature Methods in 2013 (2014). By using single cell analysis, researchers have profiled many biological processes and diseases at the single cell level including tumor evolution, circulating tumor cells (CTCs), neuron heterogeneity, early embryo development, and uncultivatable bacteria.

In this review, we discuss the technologies recently developed for single cell isolation, genome acquisition, transcriptome, and proteome analyses, and their applications. We also briefly discuss the future potentials of single cell isolation technologies and analyses.

### TECHNOLOGIES FOR SINGLE CELL ISOLATION

Before initiating a single cell analysis, scientists need to isolate or identify single cells. The performance of cell isolation technology is typically characterized by three parameters: efficiency or throughput (how many cells can be isolated in a certain time), purity (the fraction of the target cells collected after the separation), and recovery (the fraction of the target cells obtained after the separation as compared to initially available target cells in the sample). The current techniques show different advantages for each of the three parameters.

Based on the variety of principles used, current existing cell isolation techniques can be classified into two groups. The first group is based on physical properties like size, density, electric changes, and deformability, with methods including density gradient centrifugation, membrane filtration and microchip-based capture platforms. The most advantageous physical properties is single cell isolation without labeling. The second group is based on cellular biological characteristics, comprising of affinity methods, such as affinity solid matrix (beads, plates, fibers), fluorescence-activated cell sorting, and magnetic-activated cell sorting, which are based upon biological protein expression properties (Dainiak et al., 2007). Thus, in what follows we briefly summarize the principle of each method, as well as the advantage and limitation of their applications (**Table 1**). We will not discuss limiting dilution since it is well known in the field of monoclonal cell cultures production.

### Fluorescence Activated Cell Sorting (FACS)

Fluorescence Activated Cell Sorting (FACS), a specialized type of flow cytometry with sorting capacity, is the most sophisticated and user-friendly technique for characterizing and defining different cell types in a heterogeneous cell population based on size, granularity, and fluorescence. FACS allows simultaneous quantitative and qualitative multi-parametric analyses of single cells (Gross et al., 2015). Before separation, a cell suspension is made and the target cells are labeled with fluorescent probes. Fluorophore-conjugated monoclonal antibodies are the most widely used fluorescent probes (mAb) that recognize specific surface markers on target cells. As the cell suspension runs through the cytometry, each cell is exposed to a laser, which allows the fluorescence detectors to identify cells based on the selected characteristics. The instrument applies a charge (positive or negative) to the droplet containing a cell of interest and an electrostatic deflection system facilitates the collection of the charged droplets into appropriate collection tubes for later analysis (**Figure 1A**). Although FACS has been widely used for isolation of highly purified cell populations, it has been reported that FACS can also be used to sort single cells (Schulz et al., 2012). For example, BD cell sorting systems (such as the BD FACSAria III Cell Sorter) are able to isolate single cells of interest from thousands of cells in a population using up to 18 surface markers.

Since the late 1960s, remarkable advances have been made on the FACS technology including the instrumentation and the availability of a large number of highly specific antibodies. The capability of FACS technology has improved significantly from a technique limited to measuring 1–2 fluorescent species per cell to 10–15 species. The maximum number of proteins that can be simultaneously measured has progressively increased (Wu and Singh, 2012). Due to this progress, our understanding of immunology and stem cell biology has improved tremendously alongside the discovery of scores of functionally diverse cell populations (Bendall et al., 2012). It has also been reported that using the next generation cytometry, "post-fluorescence" single cell technology termed mass cytometry is theoretically capable of measuring 70–100 parameters.

Although FACS has been widely used in both basic and clinical research, there are several limiting disadvantages. First, FACS requires a huge starting number of cells (more than 10,000) in suspension. Therefore, it fails to isolate single cells from a low quantity cell population. Second, the rapid flow in the machine and non-specific fluorescent molecules can damage the viability of the sorted cells rendering the isolation a failure. Moreover, cells or cell cultures must be subjected to stimulation experiments and treated in a separate environment before FACS analysis.

### Magnetic-Activated Cell Sorting (MACS)

Magnetic-Activated Cell Sorting (MACS) is another commonly used passive separation technique to isolate different types of cells depending on their cluster of differentiation. It has been reported that MACS is capable of isolating specific cell populations with a purity >90% purification (Miltenyi et al., 1990). MACS is based on antibodies, enzymes, lectins, or strepavidins conjugated to magnetic beads to bind specific proteins on the target cells. When a mixed population of cells is placed in an external magnetic field, the magnetic beads will activate and the labeled cells will polarize while other cells are washed out. The remaining cells can be acquired by elution after the magnetic field is turned off (**Figure 1B**). With this technique, the cells can be separated by

TABLE 1 | Overview of single cell isolation techniques.


charge with respect to the particular antigens. Positive separation techniques use coated magnetic beads and attract cells. The cells of interest are labeled while the unlabeled cells are discarded. In contrast, if species-specific substances are unavailable, a good choice is to use negative separation techniques which employ a cocktail of antibodies to coat untreated cells. In this case, labeled cells are discarded while unlabeled are retained (Grützkau and Radbruch, 2010).

Of the two most common affinity-based techniques for specific cell isolation, MACS technology is comparatively simple and cost-effective. However, the MACS system's obvious shortcoming lies in its initial costs in the separation magnet, and running costs including not only the price of the conjugated magnetic beads, but also replacement columns. In addition, the final purity of isolated cells in MACS devices depends on the specificity and the affinity of the antibodies used to select the target cells. It also depends on the amount of non-specific cell capture. Non-specific contamination can be from adsorption of background cells to the capturing device or their entrapment within the large excess of magnetic particles needed for labeling rare cells in large volumes. Using new materials can eliminate contamination from non-specific adsorption or entrapment of other blood cells. Another disadvantage of MACS is that it can only utilize cell surface molecules as markers for separation of live cells. Furthermore, it should be noted that MACS is far more limited than FACS because of immunomagnetic techniques that can only separate cells into positive and negative populations. High and low expression of a molecule cannot be separated while it is possible by using FACS sorting.

### Laser Capture Microdissection (LCM)

Laser Capture Microdissection (LCM) is an advanced technology for isolating pure cell populations or a single cell from mostly solid tissue samples on a microscope slide (Emmert-Buck et al., 1996). It can accurately and efficiently target and capture the cells of interest to fully exploit emerging molecular analytical technologies, including PCR, microarrays and proteomics (Espina et al., 2007). Today, there are two general classes of laser capture microdissection systems: infrared (IR LCM) and ultraviolet (UV LCM). The LCM system consists of an inverted microscope, a solid state near infrared laser diode, a laser control unit, a joy stick controlled microscope stage with a vacuum chuck for slide immobilization, a CCD camera, and a color monitor (Datta et al., 2015). The basic principle of LCM starts with visualizing the cells of interest through an inverted microscope, then a fixed-position, short duration and focused laser pulse is delivered to melt the thin transparent thermoplastic film on a cap above the targeted cells. The film melts and fuses with the underlying cells of choice. When the film is removed, the target cells remain bound to the film while the rest of the tissue is left behind. Finally, transfer the cells to a microcentrifuge tube containing buffer solutions required for a wide range of downstream analysis (Kummari et al., 2015; **Figure 1C**).

The most important advantage of LCM is its speed while maintaining precision and versatility (Fend and Raffeld, 2000). LCM provides a rapid, reliable method to procure pure populations of target cells from a wide range of cell and tissue preparations via microscopic visualization (Bonner et al., 1997). Conventional techniques for molecular analysis require dissociation of tissue. This may introduce inherent contamination problems and reduce the specificity and sensitivity to subsequent molecular analysis. On the other hand, LCM is a "no touch" technique that does not destroy adjacent tissues after initial microdissection. Morphology of both the captured cells as well as the residual tissue is well preserved and reduces the danger of tissue loss (Esposito, 2007). In addition, after removing the chosen cells, the remaining tissue on the slide is fully accessible for further capture, allowing comparative molecular analysis of adjacent cells.

The major requirement for effective LCM is correct identification of cell subpopulations or single cells in a complex tissue. Thus, the major limitation is the need to identify cells of interest through visual microscopic inspection of morphological characteristics, which in turn, requires a pathologist, cytologist, or technologist trained in cell identification (Espina et al., 2007). Another significant limitation is that the microdissected tissue section does not have a cover slip. Cover slipping would prevent physical access to the tissue surface, which is crucial to any current microdissection method. Without a cover slip, and the index matching between the mounting media and the tissue, the dry tissue section has a refractile quality, which might obscure cellular detail at high magnifications. Moreover, LCM introduces a number of technical artifacts, including slicing the cells during the preparation of tissue sections and UV damage to DNA or RNA from the laser cutting energy (Allard et al., 2004).

### Manual Cell Picking/Micromanipulation

Manual cell picking is a simple, convenient, and efficient method for isolating single cells. Similar to LCM, manual cell picking micromanipulators also consists of an inverted microscope combined with micro-pipettes that are movable through motorized mechanical stages. Each isolated single cell can be observed and photographed under the microscope, thus enabling unbiased isolation (**Figure 1D**). Unlike LCM that mainly isolates single cells from sections of fixed tissue, micromanipulation plays an important role in isolating live culture cells or embryo cells.

Micromanipulation can be easily performed in an electrophysiology lab equipped with a patch clamp system. For example, after investigating neuronal function in brain slices preparations after standard whole-cell patch-clamp electrophysiological recordings, scientists would apply negative pressure through the patch pipette so that the cytosolic material containing cellular mRNA can be aspirated for further analysis (Eberwine et al., 1992; Citri et al., 2012). However, the throughput is limited and it requires highly skilled professionals to perform, it has the utility limitation when detecting complex changes.

### Microfluidics

Microfluidics is recognized as a powerful enabling technology for investigating the inherent complexity of cellular systems as it provides precise fluid control, low sample consumption, device miniaturization, low analysis cost, and easy handling of nanoliters-volumes (Whitesides, 2006; **Figure 1E**). Cell Sorting by a microfluidic chip can be divided into four categories: cell-affinity chromatography based microfluidic (Nagrath et al., 2007), physical characteristics of cell based microfluidic separation, immunomagnetic beads based microfluidic separation, and separation methods based on differences between dielectric properties of various cell types.

Cell-affinity chromatography based microfluidic is the most commonly used method for microfluidic chip analysis. It is based upon highly specific interactions between antigen and antibody, ligand and receptor. At the beginning of the process, the microchannel in the chip is modified with specific antibodies capable of binding to cell surface antigen or aptamer, such as an epithelial cell adhesion molecule. Once the sample flows through the micro-channels, its cell surface antigen can bind to the specific antibodies or aptamer immobilizing the cells on the chip, while the remaining cells flow off the chip with the buffer. Finally, using a different buffer, we can elute the immobilized cells for downstream analysis. Compared to other separation methods, affinity based systems have higher specificity and sensitivity because of the recognition-binding event.

Today, microfluidics can be combined with different separation methods, such as filtration and sedimentation or affinity-based technologies like FACS and MACS. In the recent years, numerous investigations and applications in microfluidic devices have been reported, including cancer research, microbiology, single-cell analysis, stem cell research, drug discovery, and screening (Arora et al., 2010; Li et al., 2012a). Recently, microfluidic chips have been fabricated from silicon or glass, elastomer, thermosets, hydrogel, thermoplastics, and paper (Ren et al., 2013, 2014). The advantages and disadvantages of the materials used in microfluidic chips have been wellsummarized previously (Ren et al., 2014). Microfluidics are used to manipulate liquids (dimensions from 1 to 1000µm) in networks of micro-channels in a single device. At such ultralow volumes, fluids exhibit different physico-chemical properties compared to their behavior at the macro-scale (Squires and Quake, 2005). Other common fluids can be used in microfluidic devices include bacterial cell suspensions, whole blood samples, protein or antibody solutions, and various buffers.

Taking advantages of integrating cell handling and processing concurrently, microfluidic chips show potential applications in DNA sequencing (Hashimoto et al., 2007; Liu et al., 2007), protein analysis (Emrich et al., 2007), cell manipulation, and cell composition analysis (VanDijken et al., 2007; Bhagat et al., 2010). For example, Fluidigm developed a commercially available valvebased microfluidic qPCR system called the Dynamic ArrayTM. This system advanced on providing low-volume (nanoliter) and high-throughput (thousands of PCR reactions per device) methods to the researchers and has become increasingly popular for large-scale single cell studies. Moreover, microfluidic technology has shown increasing applications in studying diversity and variations in single cell genomes, spanning from cancer biology to environmental microbiology and neurobiology. Beyond genomics applications, the scalability and small volume advantages of microfluidic methods have found applications in the measurement of intracellular and secreted proteins from single cells.

### SINGLE CELL ANALYSIS

Single cell analysis tools can be divided into three groups: genomics, transcriptomics, and proteomics. Due to next generation sequencing (NGS) technologies as well as whole genome/transcriptome amplification (WGA/WTA) approaches, a new scientific field of single cell genome studies have been established. A combination of high-throughput and multiparameter approaches is used in single cell analysis which can reflect cell to cell variability and heterogeneous differences in the individual cells. Therefore, the development of efficient single cell analysis methods requires attention. In this section, we discuss novel technologies designed for single cell analysis of genomics, transcriptomics, and proteomics (**Table 2**).

### Single Cell Genomics

Single cell genome sequencing allows us to identify chromosomal variations, such as copy number and single-nucleotide variations. It also allows us to study tumor evolution, gamete genesis, and somatic mosaicism, which is reflected in the genomic heterogeneity among a population of cells. However, in humans, it often faces the low amount of genome materials, for example, the weight of one genomic DNA is only 6 pg and each gene in the genome only has two copies in a single normal cell which is not quite enough for the current NGS use. However, amplification using traditional PCR suffers from severe biases and allelic dropout across the genome when it is applied to single cells. Therefore, a precise, unbiased amplification of the DNA is critical to single cell genome sequencing. Lots of attempts were made, mostly by modifying the traditional PCR methodology to linker-adapter PCR (LA-PCR) (Klein et al., 1999), interspersed repetitive sequence PCR (IRS-PCR), primer extension pre-amplification PCR (PEP-PCR) (Hubert et al., 1992), degenerate oligonucleotide-primed PCR (DOP-PCR) (Telenius et al., 1992), and its variant displacement DOP-PCR (D-DOP-PCR) (Langmore, 2002). For example, by using DOP-PCR, Navin and colleagues demonstrated accurate and robust determination of genome wide copy number in rearranged cancer genomes (Navin et al., 2011). This is the first report of single cell genome sequencing applied to a cancer genomic heterogeneity study. However, these methods also have some limitations in low coverage, amplification bias, and allele dropout.

The multiple displacement amplification (MDA) is the most popular method applied in genome analysis due to its high fidelity and simplicity. It can amplify DNA in a 30◦C isothermal reaction with random hexamer primers and phi29 DNA polymerase. The kernel of MDA is that phi29 DNA polymerase can extend the primers with high fidelity and strong processivity, which exhibits powerful strand displacement ability during the new strand synthesis (Dean et al., 2002). The displacement process generates single stranded DNA templates, which are reprimed and extended, thereby amplifying the DNA in an isothermal reaction. Based on MDA, Xu and colleagues provided the first intratumoral genetic landscape at a single-cell level and demonstrated that clear cell renal cell carcinoma (ccRCC, the most common kidney cancer) may be more genetically complex than previously thought (Xu et al., 2012). However, MDA also suffers from strong biases and high allelic dropout rate across the genome, making the reaction vulnerable to generating "chimeras," resulting in unwanted noise and false results.

Another new method, multiple annealing and looping-based amplification cycles (MALBAC) showed faithful copy number variation detection (Zong et al., 2012), which can amplify the genome of a single cell with high uniformity. MALBAC is based upon strand displacement pre-amplification that generates amplicons with complementary ends. Thus, the full amplicons generated in the reaction seal themselves to form loops to prevent them from being amplified again. This also ensures that each new amplicon is replicated from the original templates. Therefore, the obvious advantage of MALBAC is that it can reduce the



\**PCR, Polymerase chain reaction;* \**LA-PCR, linker-adapter PCR;* \**IRS-PCR, Interspersed repetitive sequence PCR;* \**PEP-PCR, Primer extension pre-amplification PCR;* \**DOP-PCR, degenerate oligonucleotide-primed PCR;* \**MDA, Multiple displacement amplification; MALBAC, Multiple annealing and looping-based amplification cycles;* \**TPEA, 3*′ *-end amplification;* \**SMART, strand-switch-mediated reverse transcription amplification;* \**IVT, in vitro transcription; TTA, Total transcript amplification;* \**PMA, Phi29 mRNA amplification; LDI-MS, Laser desorption and ionization mass spectrometry;* \**SIMS, Secondary ion mass spectrometry;* \**MALDI-MS, Matrix-assisted laser desorption/ionization mass spectrometry.*

amplification errors and biases as the starting materials of the exponential amplification are amplicon separately copied from the original template. However, it is still needed to improve the fidelity and lower the bias (Marcy et al., 2007; Wu et al., 2014).

### Single Cell Transcriptomics

Single cell transcriptome sequencing has recently emerged as a powerful technology for revealing differential gene expression and diverse RNA splicing patterns during early embryonic development, differentiation and reprogramming. The main application of single-cell transcriptomics is to connect a cell's genotype to phenotype. It is able to detect thousands of transcripts in various kinds of tissues and cells (Cloonan et al., 2008; Mortazavi et al., 2008). Although mRNA is not as rare as DNA in a single cell, there are still thousands of copies. This is ideal since NGS transcriptome sequencing also requires a large amount of RNA as the starting material. The mRNA from single cells needs to be reverse-transcribed to cDNA followed by cycles of PCR amplification (Sandberg, 2014). The key process in completing single cell mRNA amplification successfully is based on performing reverse transcription to double-strand DNA with high efficiency and low biases.

PCR-based amplification was first reported in single-cell transcriptome analysis of the preparation of single-cell cDNAs using cDNA microarray and RNA-seq analysis (Brady and Iscove, 1993). The disadvantage of a microarray is the low detection sensitivity that would likely miss many low-level but key transcripts. Compared to microarray analysis, RNA-seq analysis expanded the spectrum of detected genes with high accuracy and effectively increased the proportion of full-length cDNA. One advantage of PCR-based mRNA transcriptome amplification bias is that it makes the expression difference more visible between samples and any RNA starting amount can be employed. But on the other hand, it may distort the original difference when it is marginal. Several modified PCR-based methods of cDNA amplification have been developed, such as global PCR amplification (GA), 3′ -end amplification (TPEA), and strandswitch-mediated reverse transcription amplification (SMART) (Pan, 2014).

In vitro transcription (IVT)-based amplification linear RNA amplification is the first strategy that has been used to successfully amplify RNA for molecular profiling studies, which promoted the birth of the era of single cell analysis (Liu et al., 2014). It is based on T7 RNA polymerase-mediated IVT and requires three rounds of amplification. The main advantages of the IVT strategy include its specificity, ratio fidelity, and reducing accumulation non-specific products, but has the drawback of low efficiency and a time consuming procedure.

Recently, single cell RNA amplification methods have been raised based on the Phi29 DNA polymerase (Blanco and Salas, 1984; Dean et al., 2002). This polymerase is a highly processive enzyme with strong strand displacement activity that allows for highly efficient isothermal DNA. The phi29 DNA polymerasebased transcriptome amplification method is a simple, fast and isothermal reaction (Liu et al., 2014). The primary advantage of this method is the highly efficient, low bias, and uniform nature of amplification.

Furthermore, in order to retain the spatial and temporal information of RNAs in cells, several new RNA sequencing methods have been developed, including transcriptome in vivo analysis (TIVA), single molecule fluorescent in situ hybridization (smFISH), fluorescent in situ RNA sequencing (FISSEQ), and so on (Lee et al., 2014; Lovatt et al., 2014). These technologies become powerful tools for unraveling longstanding biomedical questions.

### Single Cell Proteomics

Single cell analysis of DNA and RNA can provide qualitative information about protein expression. However, they cannot give information on protein concentration, location, posttranslational modifications, or interactions with other proteins. Thus, single-cell proteomics help us obtain much more information that is crucial in cell signaling and cell to cell heterogeneity. Traditional protein analysis techniques, such as gel electrophoresis, immunoassays, chromatography, and mass spectrometry require numerous cells for analysis. Therefore, the major challenges of analyzing proteins at the single-cell level are the exceedingly small copy number of individual proteins and the lack of amplification methods. However, recent advances in multiparameter flow cytometry, microfluidics, mass spectrometry, mass cytometry, and other techniques have led to new single cell proteomics studies that could be performed with greater sensitivity and specificity.

Not only widely used in cell sorting, flow cytometry is also the most established and user-friendly method for both qualitative and quantitative multiparameter analysis of single cells. As mentioned before, by using multiparameter flow cytometry, scientists can simultaneously measure 10–15 key proteins in signaling pathways in individual cells (De Rosa et al., 2001; Perez and Nolan, 2002). In addition, in an immunological proof-ofconcept study, as many as 19 separate parameters including 17 fluorescent colors and 2 physical parameters were analyzed (Perfetto et al., 2004). This strong ability has turned flow cytometry into a powerful tool to semi-quantitatively analyze pathways underlying many diseases (Irish et al., 2004; Sachs et al., 2005). The main limitation is the spectral overlap due to the broad spectral emission bands of organic fluorescent dyes. Quantum dots mitigate but do not eliminate the problem. Hence, complex correction algorithms are required for spectral deconvolution. Moreover, commercial flow cytometers use cell suspensions, which in turn allow individual interrogation of cells. The sample preparation is still done manually and therefore, requires a large numbers of cells (More than 10,000). This makes it hard to analyze small samples, such as cells recovered from a biopsy, tissue specimens or small volumes of blood.

To overcome these limitations, efforts have been made to develop microfluidic-based miniaturized flow cytometers which permit analysis of small numbers of cells (100–1000) (Lindström and Andersson-Svahn, 2010). For example, Su and colleagues developed a microscope-based label-free microfluidic cytometer. It is capable of acquiring two dimensional light scatter patterns from the smallest mature blood cells (platelets), cord blood hematopoietic stem/progenitor cells (CD34 + cells), and myeloid precursor cells (Su et al., 2011). Srivastava et al. (2009) developed an integrated microfluidic device which retro-fitted to commercial. The major advantage of this microfluidic device is its ability to perform cell culture, stimulation and sample preparation in combination with conventional fluorescence imaging and microfluidic flow cytometry to monitor immune response in macrophages. These microfluidic devices not only drastically reduced the amount of sample and reagent required, but also provided a means to perform two orthogonal modes of measurements-imaging and cytometry, in one experiment.

Mass spectrometry (MS) is the most powerful tool for protein analysis. However, MS's use for analyzing proteins in single cells is limited due to the lack of sensitivity to detect low amounts of proteins. Fractionation of the cell lysate by capillary electrophoresis (CE) prior to MS offers a good way to improve sensitivity. Recently, a format for flow cytometry has been developed that leverages the precision of mass spectrometry which is termed mass cytometry. It can uniquely enable the measurement of over 40 simultaneous cellular parameters on single cells with the throughput capacity to survey millions of cells from an individual sample (Mellors et al., 2010).

### APPLICATION OF SINGLE CELL ANALYSIS

The exponential growth in studies applying single cell analysis is explicitly tied to the acceptance of the technique by biologists. Single cell analysis has influenced and impacted different domains of science including cancer biology, neuroscience, and immunology and so on. It is impossible to document each of these developments. Therefore, a short overview of the fields of applications that are typically addressed by single cell analysis is presented in the research and application for cancer, brain and stem cell, etc.

### Application of Single Cell Analysis in Cancer, Neuron Research

Intra-tumor heterogeneity has been widely reported in numerous human cancer types. Tumors are frequently composed of individual, molecularly distinct clones that differ in their proliferation rates and metastatic potential, most critically, in their sensitivities and responses to drug treatment. Those cells that can cause distant metastases should possess unique characteristics when compared to the remaining subpopulation. Exome sequencing of single cells isolated from primary renal carcinomas showed that only 31–37% of the genetic lesions within a tumor are identical to the rest of the tumor cells (Gerlinger et al., 2012; Xu et al., 2012). Therefore, analyzing the occurrence, development and metastasis of these tumors at a single cell level provides much more detailed information on how a drug will respond to the tumor cells. It has been reported that the PIK3CA mutations were detected in primary and metastatic tumor tissues, but it is different periodically in single cells of CTCs and DTCs indicated the drug efficacy (Deng et al., 2014).

Several important types of cancer cells have been discovered, including primary tumor cells, metastatic tumor cells, cancer stem cells (CSC), circulating tumor cells (CTC), and disseminated tumor cells (DTC) (Zhang et al., 2016). CTC and DTC play a vital role in cancer dissemination, self-renewal, and distant metastases. They are being increasingly recognized for their potential utility in disease monitoring and therapeutic targeting. Many cancer patients are diagnosed with early-stage cancer with no clinical symptoms of metastasis but subsequently succumb to metastatic relapse. One important reason is that CTCs in the blood and DTCs have already reached a secondary organ but have not yet grown to become clinical metastasis. However, the CTCs are so rare among massive numbers of blood cells, as few as one cell per 10 million white blood cells and 5 billion red blood cells, that the accurate identification of CTCs turns out to be the most difficult step in the isolation process (Deng et al., 2008). In recent years, a variety of enrichment and detection techniques have been developed, making significant progress in CTC detection. For example, the CellSearch <sup>R</sup> system (Janssen Diagnostics, NJ, USA) is the first and the only technique that has been approved by the US FDA for the detection, enrichment and quantification of CTCs in peripheral whole blood samples (Riethdorf et al., 2007). This system utilizes magnets with ferrofluid nanoparticles conjugated to antibodies that target epithelial cell adhesion molecules, such as EpCAM and CD45. EpCAM is the most commonly used epithelial marker that is present on epithelial tumor cells while CD45 is an immunocyte marker that is present on many blood cells but absent in epithelial cells. Thus, the findings of EpCAM-positive and CD45-negative cells indicate the presence of CTCs. Another new immunomagnetic separation technology, called MagSweeper (Illumina), involves dipping a rotating magnetic rod with bound EpCAM antibodies in order to isolate CTCs. Then moving the magnetic rod into a new buffer to release the CTCs (Talasaz et al., 2009; Powell et al., 2012). The MagSweeper can be used reliably to extract functional human CTCs from the blood of mice inoculated with human tumor xenografts, while retaining both their tumor-initiating and metastasizing capacities (Ameri et al., 2010). This highlights the most advantageous aspect of MagSweeper is that CTCs can be completely isolated while preserving the integrity and viability of these fragile cells.

In recent years, a large number of studies have been reported using single cell analysis to analyze individual tumor cells isolated from breast cancer (Navin et al., 2011; Deng et al., 2014; Wang et al., 2014; Eirew et al., 2015), colon cancer (Zong et al., 2012; Yu et al., 2014), pancreatic adenocarcinomas (Ruiz et al., 2011), muscle-invasive bladder cancer (Li et al., 2012b), intestinal cancer (Grün et al., 2015), lung adenocarcinoma cancer (Kim et al., 2015), renal cell carcinoma (Gerlinger et al., 2012; Li et al., 2012b), and acute myeloid leukemia (Ding et al., 2012; Hughes et al., 2014; Paguirigan et al., 2015). For example, Navin and colleagues investigated copy number variation in single tumor cells using DOP WGA followed by DNA sequencing to determine cell population structure and tumor evolution patterns in a single breast tumor (Navin et al., 2011). This study provided an important breakthrough for research on tumor evolution and offered a way to assess the genetic details of tumor structure. Hou and colleagues applied MDA based single cell sequencing technology for the first time to analyze primary thrombocytosis disease (essential immature, ET) in patients at single bone marrow cell level (Hou et al., 2012). Thus, understanding tumor heterogeneity via single cell analysis is considered the biggest challenge in cancer research and if elucidated would enhance our ability to determine the best treatment options.

It is no exaggeration to say that the brain is the most complex structure in the human body. There are more than 100 billion neurons in the human brain. Each of them can make approximately 10,000 direct connections with others, totaling some 100 trillion nerve connections. This makes the brain a complicated network (Herculano-Houzel, 2009). The brain is divided into several regions. Each region consist of various morphologically and/or neurochemically distinct neurons surrounded by various types of glial cells (oligodendrocytes, microglia, and astrocytes). Additionally, distinct regions in the brain, such as areas of the cerebral cortex, hippocampus have specific functions. The cerebral cortex is responsible for many "higher-order" functions like language and information processing while the hippocampus is involved in spatial learning and memory. Increasing evidence shows that each brain region contains different types of neurons according to their location, neurotransmitter identity, connectivity, electrophysiological properties, and molecular markers. Changes of genomic content and epigenetic profiling of specific neuronal or glia subtypes are involved in the pathogenesis of neuropsychiatric diseases, such as Parkinson's and Alzheimer's diseases and autism spectrum disorders(Citri et al., 2012).

Hence there is no doubt that single cell isolation and analysis have made increasingly significant contributions to our understanding of the role that somatic genome variations play in neuronal diversity and behaviors. For example, MACS based technique has been successfully applied to isolating immature neuronal cells from a large number of embryonic zebrafish; the antibody of PSA-NCAM conjugated microbeads were used within a semi-automated dissociation process. (Welzel et al., 2015). Moreover, the MACS was also used for the isolation of embryonic spinal oligodendroglial progenitor cell populations from the rat embryonic spinal cord. By using superparamagnetic MicroBeads combined with A2B5 antibodies (a specific oligodendroglial development marker) and the Mini-MACS separator column, the oligodendroglial cells were isolated with a cell purity of 58–61% in comparison to 6–12% in an unseparated population (Cizkova et al., 2009).

Moreover, basolateral amygdala (BLA) neurons are used to activate distinct populations of the lateral central nucleus of the amygdala (CeL) neurons to either promote fear or reduce anxiety. Namburi and colleagues identified two populations of neurons in the basolateral amygdala neurons that undergo opposing synaptic changes following fear (negative emotion) or reward (positive emotion) conditioning. By using RNAseq they identified few differentially expressed candidate genes between these two population neurons that may mediate the effects (Namburi et al., 2015). Usoskin and colleagues used comprehensive transcriptome analysis of 622 single mouse neurons from sensory system and discovered 11 fundamentally distinct types of sensory neurons. Interestingly, each neuron is associated with a different type of sensation (Usoskin et al., 2015). Even cells that appear to be morphologically similar may show marked differences in expression patterns. In neuroscience research, electrophysiological analysis combined with molecular biology within the same cell will provide convincing results for us to better understand of how changes at the molecular level are manifested in functional properties (Eberwine et al., 1992).

### Applications of Single Cell Analysis in Stem Cell Research

Stem cells are undifferentiated cells that are characterized as both being capable of self-renewal and having the potential to differentiate into specialized types of cells. How stem cells balance their self-renewal capacity and their ability to differentiate are central questions in stem cell research. Stem cells can be generally classified into pluripotent stem cells, which can give rise to cells of all three germ layers (the ectoderm, mesoderm, and endoderm) or tissue-specific stem cells (also referred to as somatic or adult stem cells), which play essential roles in the development of embryonic tissues and the homeostasis of adult tissues. Both of these two types of stem cells are intermingled with a variety of differentiated and intermediate cell types in the embryonic or adult tissues, forming heterogeneous populations. Therefore, isolation, analysis, and development of specific therapies that target stem cells give cancer patients hope for improvement in terms of survival and quality of life, (Li et al., 2008; Sharma et al., 2010).

Cancer stem cells (CSCs) are hypothesized to persist in tumors as a distinct population and cause relapses and metastases by forming new tumors. CSC are intrinsically more refractory to the effects of a variety of anticancer drugs possibly via enhanced drug efflux (Trumpp and Wiestler, 2008). These cells are especially resistant to therapeutic drugs. Due to the limited number of CSCs in cancer tissues, isolation and analysis CSCs are still a hard work. Single cell sequencing provides powerful tools for identifying these cells providing new insight into complex intra-tumoral heterogeneity. For example, Patel et al. (2014) used single-cell RNA sequencing to profile 672 single cells from five primary. Each tumor showed high intra-tumoral cell heterogeneity in many aspects, including copy number variations as well as cell cycle, immune response and hypoxia. By examining a set of "stemness" genes, they identified continuous, rather than discrete, stemness-related expression states among the individual cells of all five tumors, reflecting the complex stem cell states within a primary tumor. It has been suggested that CSCs are more resistant to chemo—and radiotherapy than other cells in a tumor. This could be one explanation to why most tumors relapse after therapy. Thus, understanding how cancer stem cells resist medical therapy could lead to the development of new, more efficient cancer treatments. Although the existence of these CSCs is still controversial in many cancer types, there is no doubt that CSCs have the potential to provide a foundation for new innovative treatment targeting the roots of cancer.

The neural stem cells (NSCs) in the subventricular zone (SVZ) and the subgranular zone (SGZ) of the dentate gyrus continually divide and differentiate into mature neurons and glia in the adult rodent brain (Aimone et al., 2014). Although it has been documented that endogenous NSCs can be activated to produce multiple types of progeny to contribute to brain repair after brain injury, people do not know how distinct pools of NSCs may react to brain injury and which molecules trigger injury-induced activation of NSCs. Single-cell sequencing reveals a population of dormant neural stem cells in the SVZ that become activated upon brain injury by down regulation of glycolytic metabolism and a concomitant up regulation of lineage-specific transcription factors and protein synthesis (Llorens-Bobadilla et al., 2015).

Increasing evidence shows that multiple molecularly distinct groups of stem cells that respond differently to physiological stimuli coexist in the tissues. Understanding and implementing this molecular diversity will be critical in harnessing the potential of disease treatment.

### CONCLUSION AND OUTLOOK

The biological relevance of cell to cell variations and the high potential of single cell analysis in both basic research and clinical diagnostics have drawn the attention of the scientific community. Single cell gene expression analysis can be used for tumor cell identification; single cell DNA mutation analysis can be used for tumor cell monitoring and clinical decision making (Powell et al., 2012; Deng et al., 2014). Understanding cellular heterogeneity has been a major thrust of technological development over the past decade, resulting in an increasingly powerful suite of instrumentation, protocols, and methods for analyzing single cells at the DNA sequence, RNA expression and protein abundance levels (Kalisky et al., 2011; Wu and Singh, 2012). As remarkable examples, technical developments, and appropriate clinical solutions based on single cell analyses of CTCs and CSCs showed the promise to uncover personalized medicine to fight against cancer.

Although much progress has been made during the recent years in single cell gene analysis, live single cell isolation and molecular analyses are more favorable for global profiling of RNA expression and DNA mutation (Powell et al., 2012). We are still only beginning to face the measurement challenges of cellular heterogeneity. There is still more room for improvement in enabling new modes of analysis and improving the sensitivity, precision, speed and throughput (Lecault et al., 2012).

For single cell genomic and gene expression analyses, the greatest obstacle for direct detection of diverse genomic, transcriptomic, and epigenetic events is whether there is a sufficient amount of DNA or RNA. On the one hand, purification of high-quality nucleotides from a single sample plays a pivotal role for the following studies. A problem that is commonly faced is tube absorption which causes loss of sample materials. Low absorption material containers instead of ordinary tubes and single tube reaction analysis are recommended to reduce the loss of DNA and RNA, single cell direct PCR/RT-PCR without nucleotide isolation are also often used. Another problem is the low replication efficiency of secondary structure DNA sequences. Methods for current single cell sequencing still have relatively high technical noise. It is acceptable when studying highly expressed genes, but the biological variations of genes that are expressed at low levels may be masked. Thus, the efficiency of reverse transcription and PCR amplification should be urgently improved. On the other hand, this problem could be overcome

by the third-generation sequencing platforms, which are based on sequencing single molecules and real-time signal monitoring (Schadt et al., 2010; Liu et al., 2012). Within third-generation sequencing technology, no amplification is required and it also overcomes the issue of PCR amplification bias. However, the detection sensitivity, accuracy of sequencing reads, sample handling, recovery, and sequence assembly still need to be further improved.

Protein analysis is far more challenging than nucleic acid analysis. Undoubtedly, the complexity of the proteome, lack of amplification methods and highly specific high-affinity probes make protein analysis technically demanding. Because the cell contents are highly diluted after lysis, high affinity probes (not only monoclonal antibodies), and highly sensitive detection methods are needed to detect low abundance proteins and posttranslational modifications.

To summarize, single cell analysis now stands poised to illuminate this new layer of biological complexity under normal development and disease conditions. Considering the rapid progress in either the development of single cell isolation or analysis technology, many of the problems mentioned above will be solved in the near future. Nevertheless, further developments and interdisciplinary co-operative work between technologists,

### REFERENCES

(2014). Method of the year 2013. Nat. Methods 11:1. doi:10.1038/nmeth.2801


scientists, and clinicians will be necessary. In the distant future, we expect that the single cell techniques will become a powerful tool to unravel longstanding questions in both biological research and clinical diagnostics.

### AUTHOR CONTRIBUTIONS

GD and HX conceived the structure of the manuscript; PH and WZ wrote the manuscript; GD and HX read, edited, and approved the manuscript; Mr. Brian Deng (Stanford University) helped the discussion and correction of English writing.

### ACKNOWLEDGMENTS

This work was supported by grants of National Basic Research Program of China (2013CB531103 to HX), National Natural Science Foundation of China (91339113, 81270202 to HX, 81601179 to WZ), Natural Science Foundation of Jiangxi Province of China (20161BAB204166 to WZ, 20161BAB205212 to PH), Shenzhen Basic Research Program (20140825105648 to GD), and Wuhan Science and Technology Bureau grant (2014060202010125 to GD), Hubei 100 Talents and Wuhan 3551 Talents Program (to GD).


neurons and the entire ventral tegmental area. J. Vis. Exp. 96:e52336. doi: 10.3791/52336


diversity in acute myeloid leukemia. Sci. Transl. Med. 7:281re282. doi: 10.1126/scitranslmed.aaa0763


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

The reviewer ASB and handling Editor declared their shared affiliation, and the handling Editor states that the process nevertheless met the standards of a fair and objective review.

Copyright © 2016 Hu, Zhang, Xin and Deng. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Single Cell Multi-Omics Technology: Methodology and Application

Youjin Hu<sup>1</sup> \*, Qin An<sup>2</sup> , Katherine Sheu<sup>2</sup> , Brandon Trejo<sup>2</sup> , Shuxin Fan<sup>1</sup> and Ying Guo<sup>3</sup> \*

*<sup>1</sup> Zhongshan Ophthalmic Center, State Key Laboratory of Ophthalmology, Sun-Ye-Sat University, Guangzhou, China, <sup>2</sup> Department of Human Genetics, David Geffen School of Medicine, UCLA, Los Angeles, CA, United States, <sup>3</sup> The Second Affiliated Hospital, Xiangya School of Medicine, Central South University, Changsha, China*

In the era of precision medicine, multi-omics approaches enable the integration of data from diverse omics platforms, providing multi-faceted insight into the interrelation of these omics layers on disease processes. Single cell sequencing technology can dissect the genotypic and phenotypic heterogeneity of bulk tissue and promises to deepen our understanding of the underlying mechanisms governing both health and disease. Through modification and combination of single cell assays available for transcriptome, genome, epigenome, and proteome profiling, single cell multi-omics approaches have been developed to simultaneously and comprehensively study not only the unique genotypic and phenotypic characteristics of single cells, but also the combined regulatory mechanisms evident only at single cell resolution. In this review, we summarize the stateof-the-art single cell multi-omics methods and discuss their applications, challenges, and future directions.

Keywords: single cell transcriptome, single cell multi-omics profiling, single cell epigenome, single cell proteome, gene regulation, epigenetics

### INTRODUCTION

According to the central dogma, also known as the DNA-RNA-protein axis, DNA provides the code for RNA, which is translated to produce proteins that fulfill biological functions (Crick, 1970). To discover the regulatory mechanisms behind RNA transcription and protein translation, the most straightforward approach is to analyze both DNA and RNA, or both RNA and protein, from the same sample. Despite the complexity of tissues comprised of heterogeneous cell populations, such as cancer, most experimental results to date have been based on analysis of bulk samples, which theoretically read an averaged signal from the population and prevent resolution of cellular variation (Navin et al., 2011; Huang et al., 2015; Gawad et al., 2016). To decipher the mechanism of heterogeneous gene transcriptional regulation, integrated measurement and co-analysis of multiple types of molecules, such as DNA, RNA, and protein, at single cell level is required.

The invention of PCR methods in 1983 made it possible to analyze the picogram amounts of DNA in single cells, although these initial methods could only amplify small, targeted regions of the genome. However, the development of whole genome amplification (WGA) and whole transcriptome amplification (WTA) methods (Tang et al., 2009; Zong et al., 2012; Huang et al., 2015; Wang and Navin, 2015; Gawad et al., 2016) soon allowed quantitative measurement of DNA and RNA for multiple genes in single cells. At the same time, the development of next generation sequencing technology has enabled genome-wide analysis of DNA and RNA in single cells. Inspired by the very first report of single cell DNA sequencing

### Edited by:

*Xinghua Victor Pan, Yale University, United States*

### Reviewed by:

*Zhibin Wang, Johns Hopkins University, United States Leonard C. Edelstein, Thomas Jefferson University, United States Stephen Clark, Babraham Institute (BBSRC), United Kingdom*

### \*Correspondence:

*Youjin Hu huyoujin@gzzoc.com Ying Guo mytyl.g@hotmail.com*

### Specialty section:

*This article was submitted to Molecular Medicine, a section of the journal Frontiers in Cell and Developmental Biology*

> Received: *15 December 2017* Accepted: *08 March 2018* Published: *20 April 2018*

### Citation:

*Hu Y, An Q, Sheu K, Trejo B, Fan S and Guo Y (2018) Single Cell Multi-Omics Technology: Methodology and Application. Front. Cell Dev. Biol. 6:28. doi: 10.3389/fcell.2018.00028* and single cell RNA sequencing, scientists have developed numerous methods to measure other omics at single cell level, including single cell DNA methylation, single cell chromatin sequencing and single cell proteome analysis [**Figure 1,** A detailed introduction of single cell sequencing methods has been reviewed elsewhere (Wang and Navin, 2015; Gawad et al., 2016)].

Single cell genome-wide approaches provide a valuable opportunity to measure different molecules, such as DNA, RNA, protein, and chromatin with ultimate resolution. By isolating multiple types of molecules (DNA, RNA, or protein) from a single cell simultaneously, it is feasible to profile different types of molecules in parallel. For example, genomic DNA can be used to assay the single cell genome, methylome or chromatin accessibility, while RNA from the same cell can be used to profile the transcriptome, and protein the proteome. Utilizing these different single cell omics profiling strategies as building blocks, we can construct a multiomics profile for the same cell. Here, we summarize current single cell multi-omics approaches, such as scG&T-seq (single cell Genome & Transcriptome sequencing), scMT-seq (single cell Methylome and Transcriptome sequencing), scM&T-seq (single cell Methylome & Transcriptome sequencing), scTrioseq (single-cell triple omics sequencing), and scCOOL-seq (single cell Chromatin Overall Omic-scale Landscape Sequencing) (MacAulay et al., 2015; Angermueller et al., 2016; Hou et al., 2016; Hu et al., 2016), with each of them measuring a different combination of omics data (**Figure 2**). We also review the bioinformatics advances that have been necessary to understand the large amounts of multi-dimensional data arising from single cell multi-omics profiling, and we examine the potential for this technology to elucidate numerous biological enigmas.

### METHODS FOR ISOLATING MULTIPLE TYPES OF MOLECULES FROM A SINGLE CELL

Isolating multiple types of molecules from a single cell is the starting point for single cell multi-omics measurement, and generally can be divided into two steps.

The first step is to collect a single cell randomly from a population with heterogeneity. The standard protocol is to get viable, intact cells by mechanical or enzymatic dissociation and then capture single cells from the dissociated cell suspension. Several approaches can be used, including mouth pipetting, serial dilution, robotic micromanipulation, flow-assisted cell sorting (FACS), and microfluidic platforms (Wang and Navin, 2015). Although these collection approaches are borrowed from methods developed for single cell mono-omics sequencing, additional considerations must be taken for multi-omics to ensure that multiple types of molecules can be viably measured in the same cell. The success of this first collection step is critical for preserving an accurate representation of the DNA, RNA, and protein within the cell for downstream measurements. The method used for the initial dissociation of tissues into single cells—mechanical or enzymatic—needs to be selected with consideration for both the nature of the starting material and the types of sequencing to be performed. Clinical samples such as solid tumors are often obtained flash frozen or embedded in paraffin (FFPE), making multi-omics measurements that include cytoplasmic RNA or protein more challenging. However, because this type of freezing process perturbs the cytoplasmic membrane while keeping the nuclear membrane intact, multi-omics measurements that involve the genome, epigenome, and chromatin-associated RNA are still possible after creation of nuclear suspensions (Navin, 2015). For fresh tissues, choice of mechanical or enzymatic dissociation reflects the need for both cell integrity and dissociation quality. Prolonged exposure to common dissociation enzymes such as papain, collagenase, dispase, and neutral protease can result in degradation of RNA and proteins, or generation of cell debris that aberrantly activate cell signaling pathways and cell surface proteins (Autengruber et al., 2012; Volovitz et al., 2016). Mechanical mincing of the starting material through trituration or nanofiltration may also disrupt accurate representation of the proteome or transcriptome in cells that contain long projections such as neurons. These pitfalls in turn can complicate the subsequent computational analyses performed on the data, which often involve identification of correlative relationships among the different layers of multi-omics data obtained. Thus, both tissuespecific and measurement-specific aspects of obtaining multiomics measurements need to be considered in order to achieve optimized single cell suspensions.

Next, the technique used to select single cells after separation of bulk tissues also has an impact on the feasibility of combinatorial multi-omics measurements. The advantages of techniques such as mouth pipetting and serial dilution include the simplicity and rapidity of moving single cells from the cell suspension to individual reaction chambers. This helps limit the degradation of more volatile molecules such as RNA or protein and may reduce the possibility of non-physiologic changes in chromatin accessibility and chromatin conformation (Wang and Navin, 2015; Svensson et al., 2017). Robotic manipulation, FACS, and microfluidic capture platforms have the advantage of the ability to sort through subpopulations by cell labeling, but require more extensive manipulation of single cells using expensive equipment (Ortega et al., 2017). Of the numerous options, selection of a protocol for isolating single cells for multi-omics data collection will ultimately depend on the molecules that need to be preserved, the type of tissue obtained, and the cost.

The second step is to isolate multiple types of molecules from the same cell, for which there are four main strategies: To isolate DNA and RNA of a single cell, the first strategy is physical separation, including separation of nucleus from cytosol, as genomic DNA is contained in the nucleus and the majority of mRNAs are located in the cytosol. Single cells are treated with a membrane-selective lysis buffer, through which the cell membrane is broken down while the nucleus is kept intact. Then, single nuclei are separated from cytoplasm by micropipetting, centrifugation, or antibody-conjugated magnetic microbeads (Hou et al., 2016; Hu et al., 2016; Han et al., 2018; **Table 1**). This method has been demonstrated to be highly efficient by several research groups, including our lab. Our data indicates that profiling of cytosolic RNA can resemble the transcriptome of the

profiling the genome, epigenome, transcriptome, and proteome are shown by different shapes with variable colors (Middle). Single cell multi-omics methods are built by combining different single cell sequencing methods to simultaneously profile multiple types of molecules of a single cell genome wide (Bottom). For example, G&T-seq was built by combining genome (orange) and transcriptome (yellow) to simultaneously detect DNA and RNA of the same cell genome wide.

whole cell. However, this method is low throughput (Hu et al., 2016), as the nucleus-picking procedure is manual and cannot be automated easily. Methods based on centrifugation (Hou et al., 2016) or antibody conjugated magnetic microbeads (Han et al., 2018) can achieve relatively higher throughput in isolating DNA and RNA from single cells.


*(Continued)*


TABLE

1


Continued

The second strategy uses oligo-dT primer coated magnetic beads to bind and separate polyadenylated mRNA from DNA (MacAulay et al., 2015; Angermueller et al., 2016). Genome wide sequencing of single cell DNA and RNA purified by this method indicated that breadth of genome coverage and number of genes were not affected by the process of separation, indicating high efficiency in the recovery of DNA and RNA. Since this strategy is adaptable to liquid-handling robots or automated work stations, higher throughput can be achieved. However, coverage of isolated DNA was less evenly distributed across the genome compared to that of the whole single cell sequencing, which may result in less accuracy for copy number analysis of certain genomic regions at a suboptimized sequencing depth.

Besides direct physical isolation of DNA and RNA at the beginning, the third strategy is to preamplify DNA and RNA simultaneously, followed by separation into two parts (Dey et al., 2015). Whole transcriptome sequencing of preamplified RNA of one part showed a similar number of genes covered compared to that of whole single cells. However, as the amplified DNA does not retain methylation states, this method is not suitable for methylome analysis.

The fourth strategy is to split the material of a single cell into two parts directly. For example, a recent report used the splitting strategy to split a single cell into two parts and simultaneously analyze the RNA and protein of the same cell (Darmanis et al., 2016). This splitting strategy is not an ideal method to isolate substrates such as DNA because some material will inevitably be lost due to the uneven split. However, for RNA and protein molecules with high copy number in the single cells, this method is feasible as long as the split is even between the two parts.

### INTEGRATION OF GENOME AND TRANSCRIPTOME

The first single cell transcriptome analysis was reported in 2009 (Tang et al., 2009), and many additional single cell RNA sequencing methods have been developed since, such as Quartz-seq (Sasagawa et al., 2013), smart-seq (Switching mechanism at 5′ end of the RNA transcript) (Goetz and Trimarchi, 2012; Picelli et al., 2014), Cel-seq (Cell expression by linear amplification and sequencing) (Hashimshony et al., 2012) etc., which were developed using different strategies for different purposes. For example, Quartz-seq detects the 3′ end of transcripts, while Smart-seq detects full length transcripts. Cel-seq barcodes and pools samples before linearly amplifying mRNA to multiplex single cell samples. In parallel, due to the development of single-cell whole-genome amplification (WGA) methods, single cell genome sequencing technologies have also been established. At present, four major WGA methods have been reported: DOP (degenerate oligonucleotideprimed polymerase chain reaction) (Telenius et al., 1992), MDA (Multiple Displacement Amplification) (Dean et al., 2001), MALBAC (Multiple Annealing and Looping Based Amplification Cycles) (Zong et al., 2012) and PicoPLEX (Rubicon Genomics PicoPLEX Kit). In 2013, Han et al. first reported a co-detection of DNA and RNA from the same single cell (Han et al., 2014), which was achieved by physical isolation of cytoplasm (containing cytoplasm RNAs) from nucleus (containing the intact genome) from the same single cells, followed by separate amplification of the transcriptome and genome, and further by respective sequencing of both. Although the initial report showed only the data of the whole transcriptome but not the whole genome, instead of Sanger sequencing of a selected set of genomic sequences, it paved a way to establish multi-omic profiling methods. Later, experimental protocols that simultaneously sequenced the genome and transcriptome were developed by elegantly integrating existing single cell sequencing methods, namely DR-seq (gDNA and mRNA sequencing) (Dey et al., 2015) and G&T-seq (Genome & Transcriptome sequencing) (MacAulay et al., 2015). In DR-seq, a cell is lysed completely, releasing its DNA and RNA into the same reaction system. Genomic DNA and cDNA initially being amplified at the same time is split into two halves: one for RNA-seq using the CELseq protocol, and the other half for genome sequencing using MALBAC (Dey et al., 2015). Different from DR-seq, G&T-seq separated poly-A tailed mRNAs from DNA by using oligodT-coated magnetic beads. Separated mRNA and DNA were then sequenced using SMART-seq2 and various WGA protocols (MDA or PicoPLEX), respectively (MacAulay et al., 2015). Most recently, Han et al. reported a novel method for simultaneous isolation of genomic DNA and total RNA (SIDR) from single cells by using hypotonic lysis to preserve nuclear lamina integrity and subsequently capturing the cell lysate using antibody-conjugated magnetic microbeads. They found that copy-number variations positively correlated with the corresponding gene expression levels (Han et al., 2018). In summary, using DR-seq, G&T-seq and SIDR, researchers were able to directly determine the correlation between large-scale copy number variation and transcription levels in the CNV regions.

As discussed previously by MacAulay et al. (2017), a substantial advantage of direct measurement of multiple molecular types from the same single cell over separate measurement of each type of molecule from different cells is that genotype-phenotype correlation can be determined unambiguously. First, the genomic variation can be directly linked to the transcriptional variation without being confounded by cell heterogeneity, enabling the dissection of potential molecular mechanisms underlying variable phenotypes among single cells. Second, coupled with lineage record technology, simultaneous sequencing of the genome and transcriptome can be used for reconstruction of lineage trees. Genomic profiling of single cells can divulge the lineage relationship among single cells, based on inherited mutations. The transcriptome profiling of the same single cells can in parallel provide information about the cell's phenotype and function. One intriguing application of this method is to dissect the mechanism of heterogeneity of tumor cells to inform our knowledge of tumor formation and potential therapeutic targets (Shapiro et al., 2013). Third, simultaneous sequencing of DNA and RNA of the same cell can detect DNA mutations with higher accuracy, as the mutations found in DNA or RNA can be verified by each other. This strategy can be

very helpful in situations where highly accurate mutation calling from a single cell is required, such as genetic diagnosis screening during in vitro fertilization, when only 1–2 single blastomeres are available (Vermeesch et al., 2016). Of note, post-transcriptional modification such as RNA editing (Tan et al., 2017) which may affect the concordance of variations in both DNA and RNA, should be taken into consideration to precisely call the mutations.

### INTEGRATION OF EPIGENOME WITH TRANSCRIPTOME

Based on the development of technologies for single cell epigenome and transcriptome profiling, the methods for the integrated analysis of the epigenome and transcriptome were developed (Angermueller et al., 2016; Hou et al., 2016; Hu et al., 2016). DNA methylation has been demonstrated to have key regulatory functions on gene expression in many biological process, so the relationship between the DNA methylome and transcriptome from the same single cell is of great interest. Two major methods for single cell methylome analysis are single cell reduced representative bisulfite sequencing (scRRBS) (Guo et al., 2013) and single cell whole genome bisulfite sequencing (scWGBS) (Smallwood et al., 2014). The first reported combined DNA methylome and transcriptome profiling method is scM&Tseq (single cell methylome and transcriptome sequencing), which is developed using the procedure of G&T-seq to isolate DNA and RNA from the same single cell. The protocols for mRNA capture, amplification and sequencing are the same as those in G&T-seq. In parallel, the genomic DNA is subjected to bisulfite treatment and sequencing, allowing the simultaneous profiling of the DNA methylome and RNA transcriptome from the same single cell (Angermueller et al., 2016). Subsequently, scMT-seq (Hu et al., 2016) and scTrio-seq (Hou et al., 2016) were reported using a different strategy to isolate DNA and RNA from a single cell, in which cell membrane but not nucleus was selectively lysed to release RNA, and then intact nucleus was physically separated from the cell lysate (Hou et al., 2016; Hu et al., 2016; Guo et al., 2017). In the scMT-seq method, the single cell nucleus is collected by micropipette and subjected to scRRBS, and mRNA in the lysate is amplified by a modified Smart-seq2 protocol. In the scTrio-seq, the nucleus and cytosol are separated by centrifugation, and genomic DNA contained in the nucleus is sequenced by scRRBS while mRNA is amplified by the scRNA-seq protocol reported by Tang et al. (2009).

The simultaneous profiling of methylome and transcriptome of a single cell provides a unique opportunity to directly measure DNA methylation and gene transcription within the same single cell, and to study the correlation of DNA methylation differences with gene transcription variance across single cells. For example, scM&T-seq investigated the relationship between the transcriptome and DNA methylome, and found that low methylated regions (LMR) showed high variance in methylation level, which is consistent with their role as distal regulatory elements that control gene expression (Angermueller et al., 2016). Our results using scMT-seq found that variable CpG sites were significantly enriched at non-CGI (non-CpG island) promoters but depleted at CGI (CpG island) promoters, suggesting that non-CGI promoters could be the major region contributing to methylome heterogeneity among dorsal root ganglion single cells. We also found that transcription level was positively correlated with genebody methylation, but negatively correlated with promoter methylation. In addition, by integrating the genomic SNP information, we found a correlation between allelic gene body methylation and allelic expression at single cell level. Thus, scMT-seq allows us to profile genome, DNA methylome and transcriptome in parallel within a single cell (Hu et al., 2016). Similarly, scTrio-seq enables profiling of DNA methylome, genome (CNV) and transcriptome at the same time, in which the copy number variation is computationally inferred from the scRRBS (Hou et al., 2016). Most recently, Guo et al. from the same group reported another single cell multi-omics sequencing method called single-cell COOL-seq that can profile DNA methylation and chromatin state/nucleosome positioning, copy number variation and ploidy simultaneously from the same cell (Guo et al., 2017). Although they did not incorporate the RNA sequencing in this protocol (which is theoretically possible), this method provided new insights into the comprehensive study of genome-wide gene regulation at single cell level. Most recently, Clark et al. reported the scNMT-seq (singlecell nucleosome, methylation, and transcription sequencing), which can simultaneously profile single cell nucleosome, DNA methylation and transcription. By profiling the mouse embryonic stem cell, they found novel links between all three molecular layers and revealed dynamic coupling between epigenomic layers during differentiation (Clark et al., 2018).

### PARALLEL PROFILING OF RNA AND PROTEIN

RNA and protein have distinctive biochemical properties. Compared to genomic sequencing methods, the throughput in terms of the number of proteins that can be detected by the single cell proteome profiling is limited. Until now, a few single cell proteomic methods have been developed based on different strategies, including fluorescence-activated cell sorting (FACS), western blot, metal-tagged antibodies followed by mass cytometry, and oligonucleotide labeled antibodies. Although the multiplexing of these approaches were still limited to tens of proteins for a single cell, they still demonstrated the feasibility of detection of protein and RNA expression, paving a way to discover the dynamics of RNA and protein within the same cell. Darmanis et al. developed a method based on homogeneous affinity-based proximity extension assay that converts protein abundance into tag-oligo levels (Darmanis et al., 2016), and both transcript level and protein level were quantified by qPCR. This method has succeeded in capturing parallel profiles of protein and RNA for up to 96 genes (Darmanis et al., 2016). Another approach to simultaneously detect the RNA and protein of the same cell is PLAYR (proximity ligation assay for RNA). Briefly, the RNA transcripts are bound by and ligated to isotope labeled probes. Transcript levels are converted into isotope label levels that can be easily measured together with elemental isotopelabeled protein using mass cytometry (Frei et al., 2016). With this method, simultaneous quantification of more than 40 different mRNAs and proteins can be achieved, although improvement is required to achieve genome-wide measurement with higher throughput. Most recently, two methods named REAP-seq and CITE-seq with higher throughput have been reported, in which oligonucleotide-labeled antibodies are used to integrate cellular protein and transcriptome measurements into an efficient, singlecell readout (Peterson et al., 2017; Stoeckius et al., 2017). Quantified proteins with 82 barcoded antibodies and more than 20,000 genes can be detected in a single workflow.

### STRATEGIES FOR BIOINFORMATICS ANALYSIS OF SINGLE CELL SEQUENCING DATA

Single cell sequencing technologies for genome wide profiling of DNA and RNA, as well as the subsequent integrative computational analysis methods, are central to the interpretation of single cell multi-omics data. The prelude to this type of analysis hinges first on the development of bioinformatics approaches for single cell single-omics sequencing data for various individual types of molecular measurements. Because technical characteristics of various single cell sequencing protocols are different, the bioinformatics methods involved must also be customized to correctly analyze each data type. The need to address the specific characteristics of different single cell sequencing approaches has inspired many computational methods that allow us to better analyze sequencing datasets involving multiple layers.

### Single Cell Genome Sequencing

Two major purposes of single-cell genome sequencing are identifying copy number variation and identifying point mutations/SNPs. Both these questions have been addressed in bulk WGS, and the methods developed for bulk WGS data have provided guidance for single cell WGS analysis.

Copy number variation can be robustly identified using Hidden Markov Model (HMM) or Circular Binary Segmentation (CBS), and these methods have proved effective for scWGS data (Knouse et al., 2016). Although these two methods perform similarly in many situations, user-defined parameter adjustments within the algorithms can affect the sensitivity and specificity of copy number calls. For example, comparison of these two methods on scWGS data with a range of parameters indicated that CBS was more sensitive in calling copy number losses, while HMM was more sensitive in calling gains (Knouse et al., 2016). In the context of single cell CNV analysis, one strategy to reconcile the two approaches has been to take the overlap of CNVs identified by CBS and HMM to increase confidence (Knouse et al., 2016). Considerations in choosing between the methods involve the biological properties of the samples, such as the expected sizes of the CNVs, which could range from whole-arm changes seen in aneuploid tumors to dinucleotide changes observed in inherited polymorphisms or in microsatellite instability. CBS is more flexible than HMM in that the algorithm recursively searches for segmentation points in an unsupervised approach, while HMM depends on the assumption that segmentation points follow a homogenous Poisson process, which is not always the case and may therefore compromise flexibility (Wineinger et al., 2008).

Many tools have been developed for detecting variations in bulk WGS data (Depristo et al., 2011; Koboldt et al., 2012), and these methods, in principle, should perform well in scWGS data. However, scWGS data suffers from high allele coverage bias and high PCR amplification error, which could impair the performance of variant calling methods if not corrected. Recently, with increased understanding of coverage bias in scWGS data (Zhang et al., 2015), Dong et al. reported a computational method that can correct amplification bias to reduce false positive SNPs resulting from PCR or sequencing errors (Dong et al., 2017). Although this new method still partially relies on GATK to identify new variants, it achieved better accuracy by removing false positive variants resulting from PCR error.

### Single Cell Transcriptome Sequencing

Single cell RNA-seq data enables the discovery of exciting and new biological phenomena while presenting new challenges for data analysis. For example, single-cell RNA-seq can help us identify cell subtypes with unprecedented resolution, and reconstruct continuous cell lineages. Some early studies showed that identification of cell subtypes or reconstruction of cell lineage could be done manually by experts with sufficient biological prior knowledge using basic statistical methods (Xue et al., 2013; Treutlein et al., 2014). However, recently, huge datasets with extremely heterogeneous cell populations have precluded the feasibility of manual annotation, and many computational pipelines have been developed. For example, tools based on different theoretical frameworks have been developed to cluster cells based on their gene expression similarity, such as SINCERA (Guo et al., 2015), pcaReduce (Žurauskiene and Yau, 2016), SC3 (Kiselev et al., 2017), and SNN-Cliq (Xu and Su, 2015). Additional tools have been developed to reconstruct cell lineage by ordering cells according to computationally inferred pseudo-time (Trapnell et al., 2014; Cannoodt et al., 2016; Qiu et al., 2017). However, despite the availability of myriad computational software packages for clustering and lineage inference, few benchmarking studies have been done to compare their performance.

In addition to those two classical biological questions, the technical problem of imputation of missing values in singlecell RNA-seq data has recently attracted increasing attention. Single-cell RNA-seq, especially for cells captured by dropletbased methods, is often plagued by missing values due to dropout events, leading to an exceedingly sparse depiction of the single cell transcriptome. Simply removing genes containing missing values restricts the analysis to only highly expressed genes. To overcome this problem, much effort has been made to impute missing values (Kiselev et al., 2017; Lin et al., 2017). These imputation methods can not only enable us to investigate lowly expressed genes but can also improve the performance of existing computational methods for other purposes by reducing noise from drop-out events.

### Single Cell Methylome Analysis

Compared to bulk WGBS (whole genome bisulfite sequencing) data, the analysis of single cell WGBS requires distinct bioinformatics techniques due to the sparse and uneven coverage of scWGBS (single cell WGBS) libraries across the genome. Although many tools have been developed for bulk WGBS data analysis, these methods will fail if applied to scWGBS data directly. To make scGWBS data analysis possible, the first strategy is to merge data from single cells and analyze the merged data as a sample (Farlik et al., 2016). By combining data from many single cells (usually hundreds), the data coverage becomes high, and the bias from allele dropout is averaged out. However, this strategy cannot be used to address the heterogeneity of methylation among different single cells, because methylation data are merged and averaged among the cell population.

Aside from adapting scWGBS data to existing computational pipelines by merging data, the second strategy is to develop new methods specifically for scWGBS data, and many of these methods aim to aggregate methylation levels from adjacent CpG sites or regions with similar biological properties to overcome the sparseness of scWGBS data. For example, Smallwood et al. segment the genome into 5-kbp, non-overlapping bins and use average methylation level among bins as the feature for subsequent analysis (Smallwood et al., 2014). Similarly, by aggregating methylation signal on regulatory elements, we can reveal regulatory mechanisms behind the changes in the DNA methylome (Farlik et al., 2015). In these methods, each single cell is treated as a sample separately, thus enabling the discovery of DNA methylome heterogeneity among single cells.

Interestingly, besides aggregating existing methylation information to reduce noise, a method based on the deep neuronal network was recently developed, which infers missing methylation information from sequencing motifs (Angermueller et al., 2017). Although this method achieved high prediction accuracy for whole genome, its performance on low-methylated regions, the regulatory regions where methylation level influences gene expression greatly, were not satisfying. However, we believe that the prediction accuracy on LMRs can be further improved by incorporating more features into the same deep learning framework.

### Single Cell Sequencing for Chromatin Status Analysis

Success in single cell genome and transcriptome sequencing inspired the development of single cell epigenome sequencing. So far, single cell ChIP-seq (Rotem et al., 2015) (Chromatin Immunoprecipitation Sequencing), DNase-seq, and ATAC-seq (Buenrostro et al., 2015) (Assay for Transposase-Accessible Chromatin using sequencing) has been reported from different groups. Since this type of single cell epigenome data has just begun to emerge, the related computational analysis methods are still in their infancy and only a few methods have been developed specifically for single cell data. For example, scChIPseq and scATAC-seq have been developed to investigate histone modification and chromatin accessibility landscapes at single cell level (Buenrostro et al., 2015; Rotem et al., 2015; Corces et al., 2016), and the reads from one single cell are extremely sparse due to the low amount of DNA in a cell. To identify the regions that have histone modification or regions with open chromatin, reads from several dozen to hundred single cell libraries were pooled together, and only this "pooled library" has enough reads for conventional peak calling methods. In the subsequent analysis, these putative peaks will be used as guidance to aggregate sparse signal and remove background signal. Although this method enables the meaningful analysis of scChIP-seq and scATACseq without requirement of any new computational methods, concerns have been raised about the sensitivity of this strategy (Zamanighomi et al., 2017). Interestingly, methods designed for scATAC-seq analysis are emerging, such as chromVAR (Schep et al., 2017) and scABC (Zamanighomi et al., 2017). We believe these pipelines will also inspire the development of effective pipelines for scChIP-seq data.

### APPLICATION OF SINGLE CELL MULTI-OMICS METHODS

As described above, single cell multi-omics analysis integrates multiple data sets from the genome, epigenome, transcriptome, proteome, providing a unique chance to uncover novel biological processes. By extending and integrating methods developed for single-omics analysis, we can obtain a multi-channel molecular readout and utilize these features from multiple omics types to achieve a more comprehensive depiction of the state of a single cell. In combination with continuously advancing bioinformatic algorithms and computational resources, experimental collection of multi-omics data has allowed us to uncover increasingly important and complex insights.

The first application of single cell multi-omics methods is to identify cell subtypes from a heterogeneous cell population. Previously, for example, single cell RNA-seq approaches were shown to be effective in identifying cell subtypes such as human blood dendritic cells, monocytes, and neurons in human brain cortex (MacOsko et al., 2015; Ofengeim et al., 2017; Villani et al., 2017). Recently, single cell DNA methylation sequencing was also applied to study human brain cortex. By examining non-CpG methylation among single cells, they identified novel cell subtypes that were masked in scRNA-seq analysis (Luo et al., 2017). Epigenetic modifications such as DNA methylation are developmentally regulated and cell type-specific, yet stable over the life span, and therefore profiling the epigenome and transcriptome simultaneously can compensate for the limitation of single cell RNA-seq, which mainly yields information about highly expressed transcripts. Thus, different omics measurements can provide non-redundant information about cell identity and enable more detailed and more accurate dissection of complicated tissues.

Second, single cell multi-omics can be used to reconstruct cell lineage trajectories. Understanding cell lineage trajectories during the complete time course of multicellular animal development is the holy grail of developmental biology. DNA mutations, as well as epigenetic modifications gained during the cell division and passed to the daughter cells, can be used for lineage tracing, while the transcriptome of the matching single cells can reveal the concomitant alteration of gene expression and transcriptional cell fate change during cell proliferation and differentiaion. For example, cancer cells have extremely unstable genomes, and understanding cancer genome evolution is crucial for revealing "driver" mutations or copy number changes that cause carcinogenesis. Single cell multi-omics can not only help us determine the occurrence order of different mutations during cancer evolution, but can also reveal their functional consequences, such as alteration in gene expression, which will eventually help us identify the causal mutations that induce the transition from normal cell to cancer cell.

Lastly but most importantly, single cell multi-omics data provides the resolution to definitively reveal the relationship between different omics readouts. Correlation analysis between different omics is a prevailing approach to generate regulatory hypotheses between two omics data types. For example, cytosine methylation is among the best-studied epigenetic modifications and has been shown to regulate many critical biological processes. With both DNA and RNA sequencing data, DR-seq and G&Tseq have allowed us the ability to reveal correlation between copy number variation and gene expression level at a single cell scale. Further, scTrio-seq showed that large-scale CNVs caused proportional changes in RNA expression of genes within the gained or lost genomic regions, whereas these CNVs generally do not affect DNA methylation in these regions. Our work using scMT-seq not only showed allele-specific expression patterns based on SNV information, but also showed correlation of DNA methylation with allele-specific expression, providing new insight into the study of imprinting and its underlying mechanism. In the near future, multi-omics methods may be helpful for understanding the correlation between DNA mutations with epigenetic modifications and their effects on gene expression to reveal the mechanisms underlying interesting biological questions such as dosage compensation and Xinactivation, among others (Livernois et al., 2012; Graves, 2016). Inevitably, even with single cell multi-omics technology, we are still limited to identifying correlation but not causality. We therefore believe that single cell multi-omics, once combined together with experimental perturbation, will be effective in allowing us to understand causal relationships among omics data types.

Essential to all these applications is the development of computational approaches that help to integrate multiple data layers and to recover information lost due to the sequencing of minute amounts of biological material. Bioinformatic and computational techniques have advanced single cell multi-omics technology in several arenas, such as (1) imputation of "droppedout" single cell measurements, (2) indirect measurement of another omics layer from a measured one (Farlik et al., 2015; Bock et al., 2016), and (3) mathematical and statistical quantification of multi-dimensional associations (Lane et al., 2017). Imputation methods pull information from groups of similar cells to help to restore measurements for molecules originally in very low abundance, such as lowly expressed RNA transcripts, filling in sparse data matrices for better representations of the original relationships (Van Dijk et al., 2017; Li and Li, 2018). Furthermore, as our knowledge of biological regulatory relationships increases, one data type may be able to serve as proxy for inference of another omics layer. For example, transcription factor binding or copy number alterations have been indirectly inferred from single cell methylation data (Farlik et al., 2015; Hou et al., 2016). Likewise, copy number information can be inferred from the single cell transcriptome (Tirosh et al., 2016), and chromatin state from the methylome (Guo et al., 2017). In addition, as single cell multi-omics technology becomes progressively high throughput, computational resources and time needed for processing of the raw data will be an important aspect in the flexibility of data analysis. Pipelines and new algorithms that streamline and shorten the computational time needed for data processing will be important for increasingly complex, multi-dimensional experiments. Raw files for each omic type must be separately processed, aligned, filtered, and qualitycontrolled in a manner that accounts for complications inherent in single cell measurements, such as low signal-to-noise ratio, technical amplification artifacts, and technical variation (Bock et al., 2016). Each omics layer of processed data is then assigned back to the single cell and co-analyzed with both mathematical and statistical models to reveal patterns of regulation. These new computational methods, while still nascent, allow us the capacity to bypass experimental limitations and expose excitingly novel relationships.

### CONCLUSIONS AND FUTURE DIRECTIONS

Single cell multi-omics methods have provided countless opportunities to systematically understand biological diversity, and to identify rare cell types and their characteristics with unprecedented accuracy through integration of information from multiple omics levels, including DNA, RNA, and protein. These single cell multi-omics methods will play an important role in many diverse fields, and their applications are rapidly expanding, including (1) delineating cellular diversity, (2) lineage tracing, (3) identifying new cell types, and (4) deciphering the regulatory mechanisms between omics. Although some of the applications have been reported in initial studies, there are still many avenues open for exploration, and the further development of new multi-omics methods will also facilitate their increasing utility. It is anticipated that better performance of multi-omics methods will be generated based on the optimization of current single cell sequencing methods. There are currently several main challenges and thus opportunities for further development of single cell multi-omics technology: (1) Overcoming the limitations of current single cell sequencing methods will facilitate the development of more types of omics measurements on single cells. For example, outside of single cell DNA methylome analysis, there are other single cell epigenome sequencing methods such as scAba-seq (DNA hydroxymethylation) (Mooijman et al., 2016), single cell ATACseq (open chromatin) (Buenrostro et al., 2015), single cell Hi-C (chromatin conformation) (Nagano et al., 2013), and single cell ChIP-seq (histone modifications) (Rotem et al., 2015). However, due to limitations such as low genome coverage and high noise signals derived from locus dropout and PCR amplification, no reliable multi-omics approach based on these methods has been reported yet. Optimization of the existing single cell sequencing methods as well as newly developed methods will provide more opportunities to integrate diverse methods with transcriptomic analysis to reveal the relationship between epigenetic states and RNA transcription variation. (2) New approaches to isolate and label multiple types of molecules of the same single cell will help to increase the number of omics profiled in parallel, from dual-omics to triple-omics or more. Even multiple functional parameters of single cells could be included, such as with the development of patch-seq, which combined whole-cell electrophysiological patch-clamp recordings, single-cell RNA-sequencing, and morphological characterization to identify new cell types in the nervous system (Cadwell et al., 2016, 2017). (3) In contrast to the rich resources of experimental protocols, computational methods

### REFERENCES


for single cell multi-omics data analysis have just started to emerge. New computational approaches tailored to the analysis of single cell multi-omics data will also substantially facilitate the application of the methods (Yan et al., 2017). In summary, with further development of multi-omics methods, the future will witness an even wider application of single cell multi-omics technology that will result in meaningful findings never before achieved.

### AUTHOR CONTRIBUTIONS

YH and YG: Conceived the structure of the manuscript; YH, YG, and QA: Wrote the manuscript; QA, BT, KS, and SF: Read and edited the manuscript.

### ACKNOWLEDGMENTS

The work was supported by National Key R&D Program of China (2017YFA0104100, 2017YFC1001300), National Natural Science Foundation of China (31700900).


analyzed using reduced representation bisulfite sequencing. Genome Res. 23, 2126–2135. doi: 10.1101/gr.161679.113


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 Hu, An, Sheu, Trejo, Fan and Guo. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

## Fluorescence In situ Hybridization: Cell-Based Genetic Diagnostic and Research Applications

Chenghua Cui 1, 2 , Wei Shu1, 3 and Peining Li <sup>1</sup> \*

*<sup>1</sup> Laboratory of Clinical Cytogenetics, Department of Genetics, Yale School of Medicine, New Haven, CT, USA, <sup>2</sup> Department of Pathology, Institute of Hematology and Blood Diseases Hospital, Chinese Academy of Medical Sciences, Tianjin, China, <sup>3</sup> Department of Cell Biology and Genetics, Guangxi Medical University, Nanning, China*

Fluorescence *in situ* hybridization (FISH) is a macromolecule recognition technology based on the complementary nature of DNA or DNA/RNA double strands. Selected DNA strands incorporated with fluorophore-coupled nucleotides can be used as probes to hybridize onto the complementary sequences in tested cells and tissues and then visualized through a fluorescence microscope or an imaging system. This technology was initially developed as a physical mapping tool to delineate genes within chromosomes. Its high analytical resolution to a single gene level and high sensitivity and specificity enabled an immediate application for genetic diagnosis of constitutional common aneuploidies, microdeletion/microduplication syndromes, and subtelomeric rearrangements. FISH tests using panels of gene-specific probes for somatic recurrent losses, gains, and translocations have been routinely applied for hematologic and solid tumors and are one of the fastest-growing areas in cancer diagnosis. FISH has also been used to detect infectious microbias and parasites like malaria in human blood cells. Recent advances in FISH technology involve various methods for improving probe labeling efficiency and the use of super resolution imaging systems for direct visualization of intra-nuclear chromosomal organization and profiling of RNA transcription in single cells. Cas9-mediated FISH (CASFISH) allowed *in situ* labeling of repetitive sequences and single-copy sequences without the disruption of nuclear genomic organization in fixed or living cells. Using oligopaint-FISH and super-resolution imaging enabled *in situ* visualization of chromosome haplotypes from differentially specified single-nucleotide polymorphism loci. Single molecule RNA FISH (smRNA-FISH) using combinatorial labeling or sequential barcoding by multiple round of hybridization were applied to measure mRNA expression of multiple genes within single cells. Research applications of these single molecule single cells DNA and RNA FISH techniques have visualized intra-nuclear genomic structure and sub-cellular transcriptional dynamics of many genes and revealed their functions in various biological processes.

Keywords: fluorescence in situ hybridization (FISH), genetic diagnosis, aneuploidy, pathogenic copy number variants (CNV), microdeletion/microduplication syndromes, Cas-9 mediated FISH (CASFISH), oligopaint-FISH, single molecule RNA FISH (smRNA-FISH)

### Edited by:

*Shixiu Wu, Hangzhou Cancer Research Institute, China*

### Reviewed by:

*Frederick Charles Campbell, Queen's University Belfast, UK Marco Ghezzi, University of Padova, Italy*

> \*Correspondence: *Peining Li peining.li@yale.edu*

### Specialty section:

*This article was submitted to Molecular Medicine, a section of the journal Frontiers in Cell and Developmental Biology*

Received: *01 May 2016* Accepted: *11 August 2016* Published: *05 September 2016*

### Citation:

*Cui C, Shu W and Li P (2016) Fluorescence In situ Hybridization: Cell-Based Genetic Diagnostic and Research Applications. Front. Cell Dev. Biol. 4:89. doi: 10.3389/fcell.2016.00089*

### INTRODUCTION

Fluorescence in situ hybridization (FISH) uses DNA fragments incorporated with fluorophore-coupled nucleotides as probes to examine the presence or absence of complementary sequences in fixed cells or tissues under a fluorescent microscope. This hybridization-based macromolecule recognition tool was very effective in mapping genes and polymorphic loci onto metaphase chromosomes for constructing a physical map of the human genome (Langer-Safer et al., 1982; Lichter et al., 1993). FISH technology offers three major advantages including high sensitivity and specificity in recognizing targeted DNA or RNA sequences, direct application to both metaphase chromosomes and interphase nuclei, and visualization of hybridization signals at the single-cell level. These advantages increased the analytic resolution from Giemsa bands to the gene level and enabled rapid detection of numerical and structural chromosomal abnormalities (Klinger et al., 1992; Ried et al., 1992). Clinical application of FISH technology had upgraded classical cytogenetics to molecular cytogenetics. With the improvement in probe labeling efficiency and the introduction of a super resolution imaging system, FISH has been renovated for research analysis of nuclear structures and gene functions. This review presents the recent progress in FISH technology and summarizes its diagnostic and research applications.

### CELL BASED GENETIC DIAGNOSIS BY FISH

### Analytical and Clinical Validities and Practice Guidelines

Most DNA fragments used as probes are extracted from bacterial artificial clones (BACs) which contain cloned human genomic DNA sequences in the size of 100–200 Kilobases (Kb). These DNA fragments could be directly labeled by nick translation to incorporate nucleotides coupled with different fluorophores such as coumarins, fluoresceins, rhodamine, and cyanines (Cy3, Cy5, and Cy7) (Morrison et al., 2003). According to the targeted regions and labeling design, FISH probes can be divided into locus-specific probes targeted to specific regions or genes and regional painting probes for specific chromosomal bands, an entire chromosome or whole genome. Commonly used locusspecific probes include alpha repetitive sequences for centromeric regions and single copy sequences for subtelomeric and gene regions. Multi-color locus-specific probes allow simultaneously detection of numerical abnormalities of two to three regions in one FISH assay. For structural rearrangements, locusspecific probes with different fluorophores for two genes or for the 5′ and 3′ regions of a gene have been used to detect "double-fusion" signals resulting from a reciprocal translocation or "break apart" signals from a gene rearrangement, respectively. Painting probes have been used mostly in a research setting to dissect chromosome domains within a nucleus or structural rearrangements in metaphase chromosomes. **Figure 1** shows representative FISH applications of locus-specific and chromosome painting probes in the detection of numerical and structural chromosomal abnormalities.

Earlier studies had evaluated signal-to-noise ratios, spatial resolution of the fluorescent signals, and hybridization/detection efficiencies of FISH tests on lymphocytes and aminocytes (Klinger et al., 1992; Ried et al., 1992). These studies led to the commercialization of FISH probes with optimized probe selection and standardized labeling, and the clinical utility of FISH testing in large case series (Ward et al., 1993). To ensure safe and effective diagnostic application, a clinical cytogenetics laboratory needs to establish the analytical and clinical validities for every FISH assay. The analytical validity of a FISH assay is evaluated by its targeted accuracy, sensitivity, specificity, and normal reference ranges following a standardized laboratory procedure (Wolff et al., 2007; Ciolino et al., 2009). FISH testing could be used as an adjunctive assay or a stand-alone diagnostic assay for constitutional and somatic abnormalities. The clinical validity for its intended use should be evaluated by calculating the sensitivity from patients with targeted abnormalities and the specificity from normal controls. Other analytical and clinical considerations include possible false positive or negative results, continuous monitoring of signal variations, periodical evaluation and batch-to-batch comparisons of probe performances (Test and Technology Transfer Committee, 2000).

FISH technology enabled the detection of an increased spectrum of genetic disorders from chromosomal abnormalities to submicroscopic copy number variants (CNVs) and extended the cell-based analysis from metaphases to interphases (Xu and Li, 2013). The analytical resolution of FISH is in the range of 100–200 Kb as determined by the probe size, which is 50-fold higher than the 5–10 megabase (Mb) Giesma banding of a high resolution karyotyping. Locus-specific probes detected submicroscopic CNV and led to the identification of a group of genomic disorders (also termed contiguous gene syndromes or microdeletion syndromes), such as DiGeorge syndrome (OMIM#188400) by a deletion at 22q11.2, Prader-Willi syndrome (OMIM#176270) and Angelman syndrome (OMIM#105830) by a deletion at 15q11.2. FISH can be performed directly on interphase nuclei, which eliminated the time consuming cell culture procedure and extended its diagnostic application toward rapid screening of chromosomal and genomic abnormalities. In the following sections, the diagnostic applications of FISH technology are focused on three main areas: prenatal screening and postnatal diagnosis of constitutional chromosomal abnormalities and submicroscopic pathogenic CNVs, identification and monitoring of acquired chromosomal abnormalities in hematopoietic and solid tumors, and the detection of infectious diseases caused by microbials and parasites.

### Detection of Constitutional Chromosomal Abnormalities and Pathogenic CNVs

A Multiplex FISH panel with differentially labeled probes has been developed for prenatal screening of common aneuploidies involving gains or losses of chromosomes X, Y, 13, 18, and 21 (Ried et al., 1992; Ward et al., 1993). Pregnant women

with a single indication or combined indications of advanced maternal age, abnormal ultrasound findings, or abnormal maternal serum screening have an increased risk of 4–30% for carrying numerical and structural chromosomal abnormalities; among these abnormalities, 84% were numerical abnormalities mostly detectable by the multiplex FISH panel, and 16% were structural abnormalities required further microarray analysis (Li et al., 2011). For prenatal cases with cardiac anomalies detected by prenatal ultrasound examination, DiGeorge syndrome was detected by FISH. Recently, the application of non-invasive prenatal testing by massive parallel sequencing on maternal cellfree fetal DNA significantly improved the accuracy of aneuploidy screening, which resulted in a 57% decline in invasive prenatal procedures and an increase of diagnostic yield of chromosomal abnormalities (Xu Z. Y. et al., 2013; Meng et al., 2015). Despite these technology advances in prenatal diagnosis, the multiplex FISH panel is still used as an adjunctive assay for rapid detection of common aneuploidies. It should be noted that false positive or negative results as well as maternal cell contamination have been noted in prenatal FISH analysis. Therefore, an irreversible therapeutic action should not be initiated on the basis of FISH results alone. The current guideline recommended that clinical decisions should be made based on two of three pieces of available information: FISH results, conventional cytogenetic analysis and clinical information (Test and Technology Transfer Committee, 2000). Furthermore, aneuploidies and polyploidies have been detected in about 50% of first trimester spontaneous abortions by chromosome analysis and in 35% of products of conception culture failure cases by microarray analysis; it is recognized that an extended FISH panel for chromosomes X/Y/18, 13/21, and 15/16/22 will detect all polyploidies, 84% of aneuploidies, and 69% of multiple aneuploidies causing miscarriages (Zhou et al., 2016).

Developmental delay, intellectual disabilities, and multiple congenital anomalies are present in 1–5% of newborns, and chromosome microarray analysis as the first tier genetic testing has detected a spectrum of cytogenomic abnormalities in 10∼20% of these patients (Miller et al., 2010; Li et al., 2015). Analysis of abnormal findings from consecutive pediatric cases observed genomic disorders (microdeletion/microduplication syndromes), subtelomeric rearrangements, interstitial imbalances, chromosomal structural rearrangements, and aneuploidies in about 37, 26, 19, 10, and 8% of these cases, respectively (Xu et al., 2014). Cell-based FISH testing has been a cost-effective adjunctive assay to confirm microarray detected genomic disorders and then to detect carrier statues in a follow-up parental study. Microdeletions can be detected as a loss of one signal in metaphases and interphases, while microduplications can be detected as "twin-spot" like two signals in the interphase nuclei. For subtelomeric rearrangements, a complete set of subtelomeric FISH probes for all human chromosomes was developed (Ning et al., 1996) and have been used routinely as an adjunctive assay in visualizing cryptic and complex subtelomeric rearrangements (Li et al., 2006; Rossi et al., 2009). For many newly defined loci of genomic disorders and interstitial imbalances, there are no commercially available FISH probes. Therefore, "home-brew" targeted BAC clone FISH probes were used for these unique cases (Li et al., 2006; Khattab et al., 2011).

Structural rearrangements like ring chromosomes and small supernumerary marker chromosomes (sSMC) present not only segmental gains or losses but also a mosaic pattern due to their dynamic behavior in mitosis. As shown in **Figure 1A**, centromeric FISH probes are routinely used to track the changes from dicentric, tricentric, and tetracentric ring chromosomes to loss of the ring through mitosis. Subtelomeric and interstitial FISH probes have been used to define the intactness of the ring chromosome and the level of mosaicism (Zhang et al., 2004; Xu F. et al., 2013). A cytogenomic approach combining chromosome, FISH, and microarray analyses has been recommended for characterizing the genomic structure, mitotic instability, and mechanisms of ring formation for cases with a ring chromosome (Zhang et al., 2012). sSMC are extra centric chromosome fragments usually in the forms of an inverted duplication or a small ring chromosome and present in 0.043% of newborn children. Several sSMC have syndromic phenotypes such as inv dup(22q11.2) for cat-eye syndrome (OMIM∗607576) and i(12p) for Pallister Killian syndrome (OMIM#601803), and others like inv dup(15q) and i(18p) can have variable phenotypes (Liehr et al., 2004, 2006). About 30% of sSMC are derived from chromosome 15; the D15S10 or SNRPN probes are routinely used to assess inv dup(15q) (Wang et al., 2015). The euchromatic material in sSMC can be detected by a microarray analysis. A set of pericentric core probes for each arm of human chromosomes has been validated for characterizing unambiguously the chromosomal origin of sSMC and the level of mosaicism (Castronovo et al., 2013).

### Identification and Monitoring of Acquired Chromosomal Abnormalities

The discovery of Philadelphia chromosome in chronic myeloid leukemia (CML) followed by the characterization of t(9;22)(q34;q11) with underlying ABL1/BCR gene fusions supported the causative role of chromosomal abnormalities in carcinogenesis and set the foundation for cancer cytogenetics (Mitelman et al., 2007). Cancer is considered a genetic disease at the cellular level resulting from either a progressive process or a one-off catastrophic event (Stephens et al., 2011; Li and Cui, 2016). The two main pathogenetic pathways for hallmarks of cancer development are the inactivation of tumor suppressor genes by deletions, mutations, miRNA upregulation, or epigenetic mechanisms, and the activation or deregulation of oncogenes as a consequence of point mutations, amplification or balanced cytogenetic abnormalities (Vogelstein and Kinzler, 2004; Hanahan and Weinberg, 2011). Recurrent chromosomal abnormalities including translocations, deletions, duplications, and gene amplifications associated with distinct tumor entities have been characterized; specifically designed FISH panels have been widely used in the diagnosis and monitoring of acquired chromosomal abnormalities in hematologic and solid tumors (Hu et al., 2014; Liehr et al., 2015; Mikhail et al., 2016).

Current guidelines recommend an integrated approach for cancer cytogenetic diagnosis (Wolff et al., 2007). In general, both conventional karyotyping and FISH testing are used for initial diagnosis and follow up monitoring of clonal abnormalities. For hematopoietic and lymphoid tumors, the most commonly used FISH probes and disease-specific panels in a clinical cytogenetics laboratory are listed in **Table 1**. Results from a FISH panel offer a quick evaluation of targeted abnormal patterns and their percentage within the bone marrow cells or leukocytes. Chromosome analysis will then reveal the clonal abnormalities and clonal evolution. For leukemias requiring urgent treatment, such as acute promyelocytic leukemia (APL) caused by the t(15;17)(q24;q21) with underlying PML/RARa fusions, rapid FISH result is mandated for the administration of all-trans retinoic acid (ATRA). Targeted therapy against the ABL1/BCR fusion protein by small molecule tyrosine inhibitors like imatinib mesylate (Gleevec), dasatinib (Sprycel), and nilotinib (Tasigna) has increased the 10-year overall survival from 20 to 80–90% (Li et al., 2013). For many cryptic rearrangements undetectable by routine chromosome analysis, such as t(12;21)(p13;q22) with ETV6/RUNX1 gene fusions, t(4;14)(p16.3;q32) with FGFR3/IGH gene fusions, deletions of 12p13 (ETV6), 13q14 (RB1), and 17p13 (TP53), FISH tests are considered a standalone diagnostic assay. Adjunctive use of FISH probes to further define ambiguous or hidden chromosomal abnormalities is required for many cases (Kamath et al., 2008; Massaro et al., 2011). Additionally, FISH is a sensitive and timely method to monitor residual diseases with known clonal abnormality and bone marrow transplantation by sex-mismatch donor at cellular level. Considering some hematologic tumors may be morphologically similar and the abnormalities may not be detected by low-resolution karyotyping and/or in low percentage of leukemic cells, FISH could be important for differential diagnosis between these diseases. For example, cyclin D1 (CCND1) translocation can be detected by FISH as a characteristic abnormality in mantle cell lymphoma, which provides differential diagnosis for morphologically similar chronic lymphoid leukemia (CLL). Furthermore, FISH for nuclear DNA can be combined with immunostaining of cytoplasmic markers for simultaneous identification of chromosomal abnormalities and cell types. For example, IGH translocation is present in multiple myeloma and monoclonal gammopathy of undetermined significance (MM/MGUS) with high frequency, which is usually detected in plasma cells. In a two-step assay with first the hybridization of IGH probe and then immune-staining by fluoresceinisothiocyanate (FITC)-conjugated antibodies against κ- or λ-light chain, the FITC-stained cytoplasm and IGH break apart signals within the nuclei were visualized in plasma cells simultaneously. This modified immuno-FISH was expected to improve the diagnostic

### TABLE 1 | List of FISH panels and probes for hematopoietic and lymphoid tumors.


*DCE,dual-color enumerate; TCE, tri-color enumerate; DCBAP, dual-color break apart; DCDF, dual-color double fusion; CML, Chronic myeloid leukemia; MDS, Myelodysplastic syndrome; AML, Acute myeloid leukemia; CLL, Chronic lymphocytic leukemia; B-ALL, B-cell acute lymphocytic leukemia; T-ALL, T-cell acute lymphocytic leukemia; MM/MGUS, Multiple myeloma/Monoclonal mopathy of undetermined significance; MPD, Myeloproliferative disorder. Shaded for recurrent abnormalities detected by a primary FISH panel, unshaded for secondary FISH probes for specific abnormalities. For references see (Hu et al., 2014; Liehr et al., 2015), and (Mikhail et al., 2016).*

accuracy but the low sensitivity limited its application only in follow-up study (Boersma-Vreugdenhil et al., 2003).

FISH tests are widely used in various types of solid tumors. For example, FISH can define gene rearrangements in congenital fibrosarcoma with a novel complex translocation (Marino-Enriquez et al., 2008) and validate subclone markers in heterogeneous melanoma biopsies (Parisi et al., 2011). FISH results can be used to guide cancer treatment. For example, Herceptin-targeted therapy is effectively against HER2 over-expressed breast cancer. For routine clinical specimen, immunohistochemistry, real-time polymerase chain reaction, and FISH were used to assess the HER2 protein level, RNA expression, and DNA copy numbers, respectively. Among these methods, FISH offered a cell-based evaluation for the ratio of HER2 gene copy number to the number of copies of chromosome 17 (HER2/CEP17 ratio). The FISH scoring criteria for HER2/CEP17 ratio and the interpretive guidelines were reported (Hicks et al., 2005). Many targeted therapies for recurrent translocations in various types of solid tumors have been either approved by FDA or are under clinical trials. For example, lapatinib, sorafenib, sunitinib, termsirolimus, and pazopanib have been used for papillary renal cell carcinoma with translocations involving the TFE3 gene at Xp11.2; cixutumumab and mithramycin are in phase II clinical trial for Ewing sarcoma with translocation involving the EWSR1 gene at 22q12 (Li et al., 2013). FISH assays using probes for specific recurrent translocations from different solid tumors could guide effective targeted therapy. FISH tests were also used to evaluate sperm aneuploidy frequencies before and after chemotherapy in patients with testicular cancer and Hodgkin's lymphoma; significantly increased frequencies of aneuploidies for a duration up to 24 months were noted (De Mas et al., 2001; Tempest et al., 2008). It was recommended that genetic counseling about potentially increased reproduction risk from chemotherapy should be offered to cancer patients.

### Detection of Infectious Diseases by FISH

The majority of FISH probes target to specific chromosomal and genomic abnormalities in the human genome. Rapid phylogenetic identification of single microbial cells was achieved using fluorescently labeled oligonucleotides complementary to 16S ribosomal RNA (rRNA) (DeLong et al., 1989). Some segments in the 16S rRNA are invariant in all organisms but phylogenetic group-specific 16S rRNA in different groups of organism can be used as oligonucleotide FISH probes (length 17– 34 nucleotides) to identify infectious agents in clinical samples. For example, FISH probes complementary to specific sequence of 16s rRNA can detect malaria infection in blood samples. The Plasmodium Genus (P-Genus) FISH assay has a Plasmodium genus specific probes that detect all five species of Plasmodium known to cause the disease in humans. The sensitivity of this FISH assay is better than the Giemsa staining method. A LED light source may be an available device to read FISH result, which can extend the clinical application of FISH especially in the resource-limited areas. Since rRNA has a short life and is present in a live organism with plenty of copies, FISH should be done in the live pathogens (Shah et al., 2015).

### SINGLE-CELL DNA STRUCTURAL AND RNA TRANSCRIPTIONAL ANALYSES

FISH assays using locus-specific and regional painting probes are still a powerful tool in visualizing simple and complex chromosomal and genomic rearrangements. Fiber-FISH by locus-specific BAC clone probes within a 900 Kb 17q12 inversion hybridizing onto stretched DNA fibers correlated the inversion orientations with associated haplotypes, which allowed the evaluation of inversion frequencies among human populations globally (Donnelly et al., 2010). Pericentriomeric heterochromatin probes were used in a three dimensional FISH (3D-FISH) to study intra-nuclear centromeric positions in cultured cells from patients with ICF syndrome (immunodeficiency, centromeric region instability, facial anomalies) and Robert syndrome (cohesion defect by mutations in the ESCO2 gene) (Dupont et al., 2012, 2014). Multi-color FISH (M-FISH) by painting probes specific for a human chromosome and multi-color banding FISH (M-BAND) by painting probes specific for every band in a chromosome were used to visualize complex chromosomal rearrangements from chromothripsis in two patients with acute myeloid leukemia (Mackinnon and Campbell, 2013). Chromothripsis are seen as regional clustering of breakpoints and regularity of oscillating copy-number states by microarray analysis and as heterogeneous staining regions, marker or ring chromosomes, and other undefinable rearrangements by chromosome analysis (Stephens et al., 2011). Selected FISH probes targeting to the oscillating copy-number gains and losses could be used to monitor the abnormal clones with chromothripsis.

FISH technology has made significant progress with the innovation of novel labeling methods and the introduction of super resolution imaging systems for fine mapping of intranuclear genomic structures and for single cells single molecule profiling of cytoplasmic RNA transcription. Recently, a novel FISH method using nuclease-deficient clustered regularly interspaced short palindromic repeats (CRISPR)/CRISPRassociated caspase 9 (dCas9) system was developed. The initial design used enhanced green fluorescent protein (EGFP) tagged dCas9 and small guide RNA (sgRNA) targeting to repetitive telomere sequences or sgRNAs tiling along a non-repetitive genomic sequences at the MUC4 locus. This method enabled the visualization of intra-nuclear locations and dynamics of telomeres and MUC4 loci during mitosis in living human cells (Chen et al., 2013). Further modification by using both fluorophore-coupled sgRNA and fluorophore-coupled dCas9 was termed Cas9-mediated FISH (CASFISH); rapid and robust labeling of repetitive DNA elements in preicentromere, centromere, G-rich telomere, and MUC4 gene by CASFISH was demonstrated (**Figure 2A**; Deng et al., 2015). This CASFISH did not require the denature treatment for targeted DNA and therefore preserved the nature spatiotemporal organization of the nucleus. The CASFISH process is remarkably rapid (within 1 h) and can be used directly on fixed tissues or living cells. However, using tiling sgRNAs for single-copy gene regions could have low labeling efficiency and higher background. Further optimization of this CASFISH technology

is needed before its application for basic research and genetic diagnosis.

A synthesized primary single-strand oligonucleotide library targeting to a single copy region of the genome along with fluorophore-coupled second oligonucleotides complementary to a portion of the primary oligonucleotides were developed for so-called oligopaint FISH (Beliveau et al., 2015). Co-hybridization of a set of hundreds to thousands of primary fluorophore-coupled oligopaint probes (30–42 bases in length for targeted genome region and hinged 14–32 bases for second oligonucleotides) with fluorophore-coupled second oligonucleotide (14–32 bases) can visualize a 52 Kb–3 Mb regions in nuclei with a 96–100% hybridization efficiency. Oligopaint FISH probes designed with one fluorophore for specified single nucleotide polymorphisms (SNPs) in a targeted region from one chromosome and another fluorophore for these SNPs in the homology chromosome enabled differential labeling of the two homologous chromosomes. Stochastic optical reconstruction microscope (STORM) was used for single-molecule superresolution imaging. Therefore, with prior information of the specific SNP alleles from the two homologous chromosomes, oligopaint FISH showed in situ haplotyping for paternal and maternal chromosomes (**Figure 2B**). The oligopaint probes are chosen bioinformatically to avoid repetitive DNA sequences and they can be selected to target any organisms whose genomes have been sequenced. With further improvement on signal pattern recognition from the SNP loci, oligopaint FISH should enable direct analysis of fine-scale chromatin structure, differential visualization of homologous chromosomes, and allele-specific studies of gene expression.

RNA FISH is a cell-based technique for detecting mRNA transcripts. With the advance of various methods for signal amplification and super-resolution imaging, single molecule RNA FISH (smRNA-FISH) techniques have been developed. Several approaches, including branched DNA probes, tyramide signal amplification, quantum dots, and padlock-rolling circle amplification (RCA), have been used for signal enhancement (Kwon, 2013). RCA is the only method capable of distinguishing single nucleotide allelic changes in transcripts. Briefly, reverse transcription was performed in situ on cells and tissue sections to generate complementary DNA (cDNA), the mRNA was degraded by ribonuclease H, and then padlock probes were hybridized to targeted cDNA with 5′ and 3′ arms circularized by a T4 DNA ligase. The circularized padlock probes served as a template for RCA by 829 DNA polymerase, and then fluorophore-couple oligonucleotide probes specific for each padlock probe could be hybridized and visualized (**Figure 2C**; Larsson et al., 2010). To increase the capacity for multiplex detection of different mRNA molecules in single cells, combinatorial labeling, and optical super-resolution microscope were used to measure mRNA levels of 32 genes simultaneously in single Saccharomyces cerevisiae cells (Lubeck and Cai, 2012). Further modification introduced a sequential barcoding scheme for multiplex different mRNA quantitation (Lubeck et al., 2014). In this scheme, the mRNAs in cells were barcoded by sequential rounds of hybridization, imaging and probe stripping (**Figure 2D**). Theoretically, the multiplexing capacity scaled up quickly as the number of fluorophores and rounds of hybridization increased. In practice, the available fluorophores were limited and each round of hybridization introduced loss of the RNA integrity in the tested cells.

Various smRNA-FISH methods have been used in imaging cell-type specific RNA profiles and sub-cellular localization patterns of mRNAs in in vitro cellular systems (Ronander et al., 2012; Lalmansingh et al., 2013; Shaffer et al., 2013; Sinnamon and Czaplinski, 2014) and model animals such as Drosophila (Zimmerman et al., 2013), Caenorhabditis elegans (Bolková and Lanctôt, 2015), and Zebrafish (Hauptmann et al., 2016). Additionally, smRNA FISH has been used to study the subcellular localization and cell-to-cell variability of long non-coding RNAs (lncRNA); systematically quantification and categorization based on the subcellular localization patterns were achieved for a representative set of 61 lncRNAs in three different cell types (Cabili et al., 2015). Knowledge of lncRNA subcellular localization patterns is essential to understand its biological processes. An interesting application of smRNA FISH is the study on nuclear RNA foci in genetic diseases resulting from the expansion of tri-, tetra-, penta-, and hexa-nucleotide repeats; a detailed protocol was reported for detecting mRNAs containing expanded CAG and CUG repeats in fibroblasts, lymphoblasts, and induced pluripotent stem cells (Urbanek and Krzyzosiak, 2016).

Simultaneous detection of mRNA and protein quantity and their subcellular distribution in single cells by combining a RNase-free modification of the immunofluorescence (IF) technique and the smRNA FISH method observed direct interaction of RNase MCPIP1 with IL-6 mRNA (Kochan et al., 2015). Real-time live imaging using laser-scanning confocal microscope with photon-counting detectors for quantitative studies of transcription in culture cells and model animals have been achieved by smRNA-FISH and GFP-tagged reporter gene for RNA polymerase (Gregor et al., 2014). Using Drosophila embryo as a testing system, smRNA-FISH observed stochastic transcriptional activity of four critical patterning genes and co-packaging of transcripts as multicopy heterogeneous granules to selected subcellular domains (Little et al., 2013, 2015). These results indicated that there are Genetics diagnosis References Research applications References Constitutional chromosomal and genomic abnormalities Analysis complex chromosomal rearrangements Rapid screening of common aneuploidies Ried et al., 1992 Mapping breakpoints and genomic orientation Donnelly et al., 2010 Detection of microdeletion/microduplication syndromes Wei et al., 2013 The study of 3D chromosomal structures Dupont et al., 2012 Characterization of subtelomeric rearrangements Ning et al., 1996 Define complex rearrangements Mackinnon and Campbell, 2013 Analysis of supernumerary marker and ring chromosomes Zhang et al., 2012 Characterizing nuclear genomic structures Somatic recurrent chromosomal abnormalities Spatiotemporal organization of centromeres/telomeres Chen et al., 2013 Detection of translocations, deletions, duplications/amplifications Hu et al., 2014 Chromatin interaction during cell cycle Deng et al., 2015 Monitoring disease progression and clonal evolution Mikhail et al., 2016 *in situ chromosome haplotyping* Beliveau et al., 2015 Assessment of sex-mismatch bone marrow transplantation Liehr et al., 2015 Profiling RNA transcription and localization Infectious diseases Quantitation of multiplex mRNAs in single cells Lubeck et al., 2014 Detection of malaria by 16s rRNA Shah et al., 2015 Subcellular localization of mRNAs and non-coding RNAs Cabili et al., 2015

TABLE 2 | FISH applications

 in genetic diagnosis

 and research.

conserved mechanisms of precision mRNA transcription and localization for spatiotemporal control of protein synthesis in regulating cellular and embryo development.

### CONCLUSIONS AND FUTURE DIRECTIONS

In summary, FISH has a wide spectrum of diagnostic and research applications as shown in **Table 2**. FISH has the advantage that it can be used in metaphase chromosomes and interphase nuclei, and thus offers a cell-based genetic diagnosis in complementary to DNA-based molecular testing (Xu and Li, 2013). FISH has been used as adjunctive and diagnostic assays for both constitutional and somatic cytogenomic abnormalities. FISH analysis of uncultured interphase cells from amniotic fluid or chorionic villus samples is a standard procedure for rapid prenatal testing of common aneuploidy and genomic disorders, which alleviates much anxiety for patients and physicians. The use of interphase FISH has been particularly fruitful for cancer cytogenetics, where the detection of recurrent chromosomal abnormalities and clonal evolution is crucial for classifying different types of tumors, selecting treatment protocols, and monitoring outcomes. Even with the introduction of genomic technologies like microarray analysis and exome sequencing, FISH analysis will still be an integral part of genetic diagnosis (Parisi et al., 2012; Wei et al., 2013; Martin and Warburton, 2015). Microfluidic devices for miniaturized and automatic FISH applications are currently under development (Vedarethinam et al., 2010; Kwasny et al., 2012; Kao et al., 2015). The validation of

### REFERENCES


these devices in the near future and the available of more diseasespecific probes will further enhance and expand the diagnostic FISH application.

Novel FISH techniques and super-resolution imaging systems have been introduced to study the spatiotemporal changes of intra-nuclear genomic organization and cytoplasmic RNA profiling. These FISH techniques such as CASFISH, oligopaint-FISH, and smRNA-FISH have been developed mainly for genetic research applications. A current trend in FISH is toward simultaneous single-cell measurement of DNA, RNA, cell surface proteins, and intracellular proteins (Lai et al., 2016; Soh et al., 2016). The translation of these single molecule single cells FISH techniques into cell-based genetic diagnosis is expected to improve the analytical resolution and capacity for a spectrum of genetic defects from chromosomal and genomic abnormalities to epigenetic aberrations.

### AUTHOR CONTRIBUTIONS

CC drafted the cell-based genetics diagnosis by FISH. WS drafted the single cell DNA structural and RNA transcriptional analysis by FISH. PL organized, modified, and edited the manuscript. We would like to thank Audrey Meusel for proofreading and editing this manuscript.

### ACKNOWLEDGMENTS

Funding from China Scholarship Council to WS (Project No. 201308455012) supported part of this study.


a genotype-phenotype correlation. Cytogenet. Genome Res. 112, 23–34. doi: 10.1159/000087510


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2016 Cui, Shu and Li. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Single-Cell in Situ RNA Analysis With Switchable Fluorescent Oligonucleotides

### Lu Xiao and Jia Guo\*

*Biodesign Institute and School of Molecular Sciences, Arizona State University, Tempe, AZ, United States*

Comprehensive RNA analyses in individual cells in their native spatial contexts promise to transform our understanding of normal physiology and disease pathogenesis. Here we report a single-cell *in situ* RNA analysis approach using switchable fluorescent oligonucleotides (SFO). In this method, transcripts are first hybridized by pre-decoding oligonucleotides. These oligonucleotides subsequently recruit SFO to stain their corresponding RNA targets. After fluorescence imaging, all the SFO in the whole specimen are simultaneously removed by DNA strand displacement reactions. Through continuous cycles of target staining, fluorescence imaging, and SFO removal, a large number of different transcripts can be identified by unique fluorophore sequences and visualized at the optical resolution. To demonstrate the feasibility of this approach, we show that the hybridized SFO can be efficiently stripped by strand displacement reactions within 30 min. We also demonstrate that this SFO removal process maintains the integrity of the RNA targets and the pre-decoding oligonucleotides, and keeps them hybridized. Applying this approach, we show that transcripts can be restained in at least eight hybridization cycles with high analysis accuracy, which theoretically would enable the whole transcriptome to be quantified at the single molecule sensitivity in individual cells. This *in situ* RNA analysis technology will have wide applications in systems biology, molecular diagnosis, and targeted therapies.

Keywords: transcriptomics, genomics, fluorescence in situ hybridization, strand displacement reactions, RNA expression, oligonucleotides, fluorescent probes, single-cell

### INTRODUCTION

The ability to profile a large number of distinct transcripts in single cells in situ is crucial for our understanding of cancer, neurobiology, and stem cell biology (Crosetto et al., 2014). The differences between individual cells in complex biological systems may have significant consequences in the function and health of the entire systems. Thus, single cell analysis is required to explore such cell heterogeneity. Due to the inherent complexity of gene expression regulatory networks, comprehensive molecular profiling is required to systematically infer the functions and interactions of different RNA species. The precise location of cells in a tissue and transcripts in a cell is critical for effective cell-cell interactions and gene expression regulation, which can determine cell fates and functions. Therefore, to fully understand the organization, regulation, and function of a heterogeneous biological system, highly multiplexed single-cell in situ RNA analysis is critically needed.

### Edited by:

*Xinghua Victor Pan, Yale University, United States*

### Reviewed by:

*Jeffrey C. Petruska, University of Louisville, United States Saurabh Chattopadhyay, University of Toledo, United States*

> \*Correspondence: *Jia Guo jiaguo@asu.edu*

### Specialty section:

*This article was submitted to Molecular Medicine, a section of the journal Frontiers in Cell and Developmental Biology*

> Received: *08 December 2017* Accepted: *26 March 2018* Published: *11 April 2018*

### Citation:

*Xiao L and Guo J (2018) Single-Cell in Situ RNA Analysis With Switchable Fluorescent Oligonucleotides. Front. Cell Dev. Biol. 6:42. doi: 10.3389/fcell.2018.00042*

Next-generation sequencing (Guo et al., 2010; Metzker, 2010) and microarray technologies (Hoheisel, 2006) have been widely used to study gene expression regulation in health and disease by profiling RNA on a genome-wide scale. However, as transcripts are extracted, purified and then analyzed in these approaches, the RNA location information is lost. Imaging-based methods, such as molecular beacons (Guo et al., 2012; Huang and Martí, 2012), templated fluorescence activation probes (Franzini and Kool, 2009), and fluorescence in situ hybridization (FISH) (Raj et al., 2008), allow transcripts to be quantified in their native spatial contexts in single cells. Nonetheless, due to the spectral overlap of commonly available fluorophores, these methods can only detect a handful of different RNA species in one sample.

To enable comprehensive single-cell in situ RNA analysis, several approaches have been investigated. For instance, in situ sequencing (Ke et al., 2013; Lee et al., 2014) has been explored to enable transcriptome profiling in individual cells. However, this method has limited detection efficiency and may miss lowexpression transcripts. Combinatorial labeling (Levsky et al., 2002; Lubeck and Cai, 2012; Levesque and Raj, 2013) and reiterative hybridization (Xiao and Guo, 2015; Guo, 2016; Shaffer et al., 2017; Mondal et al., 2018) offer single-molecule detection sensitivity, but these approaches suffer from limited multiplexing capacities. Recently, sequential hybridization (Lubeck et al., 2014; Shah et al., 2016) and multiplexed error-robust fluorescence in situ hybridization (MER-FISH) (Chen et al., 2015; Moffitt et al., 2016a,b) have been developed for highly multiplexed singlemolecule RNA detection. In these methods, to stain the same RNA molecules in different analysis cycles, several approaches have been explored to remove the fluorescence signals at the end of each cycle. Such approaches include probe degradation by DNase, photobleaching, and disulfide based chemical cleavage. Nevertheless, probe degradation by DNase is limited by its low signal removal efficiency. In addition, DNase removes all the probes, including the large oligonucleotides library hybridized to their RNA targets. Consequently, this expensive oligonucleotides library has to be re-hybridized in every analysis cycle, which will increase the assay time and cost. Photobleaching erases fluorescence signals in different imaging areas sequentially. As a result, it is less time-effective and has low sample throughput. The disulfide based probes can cross-react with the endogenous thiol groups and the thiol groups generated by fluorophore cleavage in previous cycles, which will lead to high background and false positive signals.

Here, we report a single-cell in situ RNA analysis approach using switchable fluorescent oligonucleotides (SFO). In this method, RNA molecules are first hybridized by pre-decoding oligonucleotides, which subsequently recruit SFO to stain their RNA targets. After imaging, SFO are removed by strand displacement reactions. Upon continuous cycles of target staining, fluorescence imaging, and SFO removal, varied RNA species are identified by unique fluorophore sequences at the optical resolution. To demonstrate the feasibility of this approach, we show that the hybridized SFO can be efficiently removed by strand displacement reactions within the cellular environment in 30 min. We also demonstrate that this probe removal process maintains the RNA integrity and keeps the pre-decoding oligonucleotides hybridized to their RNA targets. Additionally, we show that RNA can be quantified with high accuracy in at least eight continuous hybridization cycles, which theoretically would allow the whole transcriptome to be profiled in individual cells in situ.

### MATERIALS AND METHODS

### General Information

Chemicals and solvents were purchased from Sigma-Aldrich or Ambion and were used without further purification, unless otherwise noted. Biogreagents were purchased from Invitrogen, unless otherwise indicated.

### Cell Culture

HeLa CCL-2 cells (ATCC) were maintained in Dulbecco's modified Eagle's Medium supplemented with 10% fetal bovine serum, 10 U mL−<sup>1</sup> penicillin and 100 g mL−<sup>1</sup> streptomycin in a humidified atmosphere at 37◦C with 5% CO2. Cells were plated on chambered coverglass (Thermo Scientific) and allowed to reach 60% confluency in 1–2 days.

### Cell Fixation

Cultured HeLa CCL-2 cells were first washed with 1 X PBS at room temperature for 5 min, fixed with fixation solution [4% formaldehyde (Polusciences) in 1 X PBS] at room temperature for 10 min, and subsequently washed another 2 times with 1 X PBS at room temperature, each for 5 min. The fixed cells were then permeabilized with 70% (v/v) EtOH at 4◦C at least overnight.

### Probe Design

The pre-decoding probes with a length of 70 nt contain three 20 nt sequences: (i) a target-binding sequence for in situ hybridization to the target RNA, and (ii) two repeated readout sequences for decoding hybridization. The three sequences are separated from each other by a flanking 5T spacer. The target-binding sequence was designed by the Stellaris Probe Designer provided by Biosearch Technology. The sequences of pre-decoding probes are provided in **Table S1**.

The decoding probe (SFO) with a length of 40 nt contains two 20 nt sequences: (i) a binding sequence complimentary to the readout sequence of the pre-decoding probes, and (ii) a toehold sequence for strand displacement reactions. The decoding probe is conjugated to fluorophores with the 5′ -amino modification. The sequence of the decoding probe is provided in **Table S1**.

The eraser oligonucleotide with a length of 40 nt is complimentary to the decoding probe. The sequence of the eraser oligonucleotide is provided in **Table S1**.

The SFO-orthogonal oligonucleotide with a length of 40 nt is conjugated to fluorophores with the 5′ -amino modification. The sequence of the SFO-orthogonal oligonucleotide is provided in Tabl S1.

To further ensure the specificity, all the sequences above were screened against the human transcriptome by using Basic Local Alignment Search Tool (BLAST) (Camacho et al., 2009) to ensure there were no more than 10 nt of homology. Sequence alignment test were also performed by BLAST within these sequences to ensure there were no more than 8 nt of homology.

### Probe Preparation

Pre-decoding oligonucleotides belonging to one library (IDT) were mixed and then stored as pre-decoding probe stock solution (10 mM in 0.01X Tris EDTA, pH 8.0) at 4◦C.

The 5′ -amino modified decoding probe or the SFOorthogonal oligonucleotide (IDT), at a scale of 1 nmol, was dissolved in 3 µL of nuclease-free water. To this solution was added sodium bicarbonate aqueous solution (1M, 3 µL) and Cy3 (AAT Bioquest) or Cy5 (AAT Bioquest) in DMF (20 mM, 5 µL). The mixture was incubated at room temperature for 2 h and then purified using a nucleotide removal kit (Qiagen). The fluorophore conjugated oligonucleotides were subsequently purified via an HPLC (Agilent) equipped with a C18 column (Aligent) and a dual wavelength detector set to detect DNA absorption (260 nm) and the fluorophore absorbtion (555 nm for Cy3, 650 nm for Cy5). For the gradient, triethyl ammonium acetate (Buffer A) (0.1 M, pH 6.5) and acetonitrile (Buffer B) (pH 6.5) were used, ranging from 7 to 30% Buffer B over the course of 30 min, then at 70% Buffer B for 10 min followed by 7% Buffer B for another 10 min, all at a flow rate of 1 mL min−<sup>1</sup> . The collected fraction was then dried in a Savant SpeedVac Concentrator and stored as decoding probe stock solution or SFO-orthogonal oligonucleotide stock solution at 4◦C in 100 µL 0.01X Tris EDTA (pH 8.0).

The eraser oligonucleotide was dissolved and stored as displacement stock solution (10 mM in 0.01X Tris EDTA, pH 8.0) at 4◦C.

### Pre-decoding Hybridization

To 100 µL of pre-decoding hybridization buffer (100 mg mL−<sup>1</sup> dextran sulfate, 1 mg mL−<sup>1</sup> Escherichia coli tRNA, 2 mM vanadyl ribonucleoside complex, 20 µg mL−<sup>1</sup> bovine serum albumin, and 10% formamide in 2 X SSC) was added 1 µL of predecoding probe stock solution. Then the mixture was vortexed and centrifuged to obtain pre-decoding hybridization solution.

HeLa CCL-2 cells after fixation and permeabilization were first incubated with wash buffer (2 mM vanadyl ribonucleoside complex and 10% formamide in 2 X SSC) for 5 min at room temperature, then incubated with 100 µL of pre-decoding hybridization solution at 37◦C overnight. Cells were then washed three times with wash buffer, each for 30 min, at 37◦C.

Cells were then post-fixed with post-fixation solution [4% formaldehyde (Polusciences) in 2X SSC] at room temperature for 10 min, and subsequently washed another three times with 2X SSC at room temperature, each for 5 min.

### Decoding Hybridization

To 100 µL of decoding hybridization buffer (100 mg mL−<sup>1</sup> dextran sulfate, 2 mM vanadyl ribonucleoside complex, and 10% formamide in 2 X SSC) was added 5 µL of decoding probe stock solution with or without 5 µL of SFO-orthogonal oligonucleotide stock solution. Then the mixture was vortexed and centrifuged to obtain decoding hybridization solution.

Cells labeled with pre-decoding probes were directly incubated with 100 µL of decoding hybridization solution at 37◦C for 30 min, and washed once with wash buffer at 37◦C for 30 min. After incubation with GLOX buffer (0.4% glucose and 10 mM Tris HCl in 2 X SSC) for 1–2 min at room temperature, the stained cells were imaged in GLOX solution (0.37 mg mL−<sup>1</sup> glucose oxidase and 1% catalase in GLOX buffer).

### Displacement of Decoding Probes

To 100 µL of displacement buffer (100 mg mL−<sup>1</sup> dextran sulfate, 2 mM vanadyl ribonucleoside complex, and 10% formamide in 2 X SSC) was added 5 µL of displacement stock solution. Then the mixture was vortexed and centrifuged to obtain displacement solution.

Cells after imaging were incubated with 100 µL of displacement solution at 37◦C for 30 min, and washed 3 times with 1X PBS at 37◦C, each for 15 min, then followed by the next cycle of decoding hybridization.

### Imaging and Data Analysis

Cells were imaged under a Nikon Ti-E epofluorescence microscope equipped with a 100X objective, using a 5µm range and 0.3µm z spacing. Images were captured using a CoolSNAP HQ2 camera and NIS-Elements Imaging software. Chroma filters 49004 and 49009 were used for Quasar 579 and Cy5, respectively.

Fluorescent spots in each hybridization cycle were identified and localized by SpotDetector (Olivo-Marin, 2002). For the detected FISH spots, their intensities in the Cy3 and Cy5 channels were compared to determine the color of the spots. Raw images of the same cells in different cycles of hybridization were aligned to the same coordination system established by the images collected in the first cycle of hybridization based on one specific spot reappearing in each cycle. Spots in the first hybridization cycle with the distance less than 2 pixels (320 nm) to those in the second hybridization cycle were extracted as the barcodes, which corresponded to a potential mRNA molecule. Spots in the following hybridization cycles that shared the distance less than 2 pixels (320 nm) with the barcodes were identified as the reappearance of the barcodes. And the barcode reappearance percentage in each hybridization cycle was then calculated.

### RESULTS

### Platform Design

In this SFO-based RNA profiling approach (**Figure 1**), individual RNA target is first hybridized by a set of non-fluorescent predecoding oligonucleotides with varied target binding sequences. These oligonucleotides also have one or multiple decoding oligonucleotides binding sequences, which can recruit SFO as decoding probes. Each of the subsequent analysis cycles consists of three steps. First, SFO are hybridized to predecoding probes to stain the RNA targets. In the second step, fluorescence images are acquired with each RNA molecule visualized as a single spot. Finally, oligonucleotide erasers, which are perfectly complementary to SFO, are applied to remove SFO by strand displacement reactions (Zhang and Seelig, 2011). These oligonucleotide erasers hybridize to the

FIGURE 1 | Highly multiplexed single-cell *in situ* RNA analysis with SFO. (A) Each transcript is first hybridized with a set of pre-decoding probes, which have varied target-binding sequences to hybridize to the different regions on the target RNA and the shared decoding sequence to recruit SFO as decoding probes. After imaging, the hybridized SFO is removed by strand displacement reactions. Through reiterative cycles of SFO hybridization, fluorescence imaging and strand displacement, the target RNA is sequentially stained by a set of SFO labeled with varied fluorophores. (B) Schematic diagram of the N cycles of hybridization images. In each cycle, individual transcript is visualized as a single spot with a specific color. (C) As RNA molecules remain in place during different hybridization cycles, different RNA species can be identified by the unique color sequences.

toehold on SFO, branch migrate and release SFO from the pre-decoding probes. Through reiterative cycles of target staining, fluorescence imaging and SFO release, each transcript is identified by a fluorescence sequence barcode. With M fluorophores applied in each cycle and N sequential cycles, a total of M<sup>N</sup> RNA species can be quantified in single cells in situ.

### SFO Removal Efficiency

One requirement for the success of this SFO-based RNA profiling technology is that fluorescent decoding probes need to be removed very efficiently at the end of each analysis cycle. In this way, the minimized fluorescence signal leftover will not lead to false positive signals in the subsequent cycles. Additionally, the efficient removal of SFO will regenerate the single-stranded SFO-binding sequences on pre-decoding probes, so that SFO can be recruited in the following cycle to stain the target RNA again. To assess the SFO stripping efficiency, we stained mRNA GAPDH with Cy3 labeled decoding probes (**Figure 2A**). After incubating the stained cells with the oligonucleotide eraser for 30 min at 37◦C, almost all the original FISH spots become undetectable (**Figures 2B,C**). We also performed control experiments by incubating the stained cells with an SFO-orthogonal oligonucleotide (**Figure 2D**). The fluorescence intensities of the Cy3 stained GAPDH remained largely the same before and after the oligonucleotide incubation (**Figures 2E,F**). These results indicate that SFO can be efficiently removed by strand displacement reactions.

### Effects of the Strand Displacement Reactions

Another requirement for the success of this SFO-based approach is that the strand displacement reactions should maintain the RNA integrity, so that the same transcripts can be restained in the subsequent cycles. Additionally, it is preferred to keep the pre-decoding probes hybridized to their RNA targets throughout the assay, rather than to apply them in every analysis cycle. This is essential for the following reasons. First, due to the theoretical hybridization efficiency of ∼75% (Lubeck and Cai, 2012), a small percentage of transcripts are not hybridized with enough pre-decoding probes to make them detectable. And these undetectable RNA can be different transcripts in different analysis cycles, if the pre-decoding probes are removed and rehybridized in each cycle. Consequently, many missing spots in the aligned fluorophore sequences will be generated, leading to the increased error rate. Furthermore, as the hybridization of the pre-decoding probes takes overnight to 36 h, it is timeconsuming to apply this step in each cycle. Finally, for highly multiplexed RNA profiling, the pre-decoding probes library is usually composed of thousands of oligonucleotides. Thus, it will make the assay less cost-effective if the expensive pre-decoding library is removed and re-hybridized in every cycle.

To assess the effects of the strand displacement reactions on the RNA targets and the hybridized pre-decoding probes, we stained mRNA GAPDH in three continuous hybridization cycles (**Figure 3**). In each cycle, Cy3 or Cy5 labeled SFO were applied to stain the transcripts, and were subsequently removed very efficiently using the same oligonucleotide eraser. We counted 1032 and 1045 spots in the first and second cycle, respectively. Among these spots, 803 spots were colocalized. These results are consistent with the ones obtained by using two sets of different colored FISH probes to stain the same transcripts (Raj et al., 2008). The small fraction of spots that did not colocalize may correspond to the non-specifically bound probes. To exclude these off-target signals, we define only the spots colocalized in the first two cycles as true mRNA signals. With our approach, 99% of the true signals reappeared in the third cycle. In comparison, when both pre-decoding and decoding probes are degraded using DNase, only 78% of spots reoccur in the third cycle (Lubeck et al., 2014). These results suggest that the DNA displacement reactions do not damage the RNA integrity, and the pre-decoding probes remain hybridized to their RNA targets throughout the assay. In this way, the analysis accuracy is improved and the assay time and cost are reduced.

### Eight-Cycle RNA Restaining

To demonstrate the multi-cycle potential of our approach, we stained mRNA GAPDH in eight consecutive hybridization cycles using SFO (**Figure 4**). To evaluate the target staining specificity, we incubated the cells with Cy3 conjugated SFO together with a Cy5 labeled orthogonal oligonucleotide in the odd hybridization cycles, and with Cy5 conjugated SFO and a Cy3 labeled orthogonal oligonucleotide in the even cycles. In the first cycle, the FISH spots were only observed in the Cy3 channel, suggesting that mRNA GAPDH is specifically stained by the corresponding SFO. After signal detection and strand displacement reactions, we imaged the cells again to confirm the efficient stripping of SFO. This process of staining, imaging and stripping was repeated eight times to obtain the 8-bit fluorophore sequence barcode for the target mRNA. For the spots co-localized in the first two cycles (n = 1470), more than 97% of these spots reappeared in each of the following cycles (**Figure 5**). And over 95% of the spots were successfully identified in all the hybridization cycles (**Figure 6**). A plot of the signal intensities of the FISH spots in both the Cy3 and

FIGURE 3 | (A) In the first hybridization cycle, GAPDH transcripts are stained by Cy3 labeled SFO. (B) SFO is removed by the eraser oligonucleotide. (C) In the second hybridization cycle, GAPDH transcripts are stained by Cy5 labeled SFO. (D) SFO is removed by the eraser oligonucleotide. (E) In the third hybridization cycle, GAPDH transcripts are stained by Cy3 labeled SFO. (F) SFO is removed by the eraser oligonucleotide. (G) Signal intensity profiles corresponding to the marked FISH spot in (A,B). (H) Signal intensity profiles corresponding to the marked FISH spot in (C,D). (I) Signal intensity profiles corresponding to the marked FISH spot in (E,F). Scale bars, 5µm.

Cy5 channels vs. the hybridization cycles is shown in **Figure 7**. Due to the high staining specificity, all the FISH spots were unambiguously detected in the correct fluorescence channels. We also performed control experiments to stain mRNA GAPDH using the conventional smFISH method. The copy numbers per cell obtained by the two methods (**Figure 8**), together with those reported previously using RNA-Seq (Uhlén et al., 2015), are consistent with each other. These results suggest that transcripts can be quantitatively profiled in single cells in situ by multi-cycle staining using the SFO-based approach.

In each cycle of MER-FISH, only certain transcripts are stained and other RNA targets remain unlabeled. Thus, to determine which transcripts are stained in a specific cycle, a detection threshold has to be manually selected by comparing the signal intensities of different FISH spots. However, due to the imperfect probe hybridization efficiency, RNA secondary structures, proteins bound to transcripts and other factors, even individual transcripts from the same RNA species can have significantly different staining intensities (**Figure 7**). As a result, the artificial detection threshold can lead to false negative signals, if the stained transcripts have low signal intensities. This threshold will also result in false positive signals, if the un-stained transcripts have high fluorescence intensities, which are generated as the signal leftovers from the previous cycles. In contrast, all the RNA targets are stained simultaneously in every cycle in the SFO-based approach. Rather than using a threshold to identify the stained transcripts, we compare the signal intensities of the same spot in different fluorescence channels to determine which SFO is hybridized to the specific RNA target. In this way, the correct fluorescence sequence can be unambiguously identified for both the weak spots (**Figure 9A**) and the strong spots (**Figure 9B**) in each analysis cycle. These results suggest that the SFO-based approach avoids the false positive and negative signals generated by the artificial threshold, and have enhanced detection sensitivity and analysis accuracy.

### DISCUSSSION

We have developed an SFO-based technology for in situ RNA profiling. Compared with the existing methods, our approach

has the following advantages. (i) By detecting transcripts directly without target sequence amplification, our technology enables RNA analysis at the single-molecule sensitivity. (ii) In this method, different RNA species can be distinguished by the varied color sequences, whose number increases exponentially with the number of hybridization cycles. Thus, our approach has the potential to enable highly multiplexed RNA analysis. (iii) All the distinct SFO in the whole specimen can be simultaneously removed by their corresponding eraser oligonucleotides. Therefore, our approach has high sample throughput, and allows a large number of cells to be quantified in a short time. (iv) As SFO can be very efficiently removed and have minimized cross-reactions with endogenous biomolecules and other probes, our approach has enhanced signal to noise ratio. (v) By keeping the pre-decoding oligonucleotides hybridized to their targets throughout the assay, our method has increased

analysis accuracy and decreased assay time and cost. (vi) With each transcript stained in every cycle, this SFO-based approach avoids the false positive and false negative signals generated by the manually selected detection thresholds.

The number of RNA species that can be quantified using this SFO-based approach depends on two factors: the number of hybridization cycles and the number of different fluorophores used in each cycle. As we have demonstrated, at least eight hybridization cycles with high analysis accuracy can be carried out in the same set of cells. And it is well-established that hundreds of thousands of oligonucleotides can be prepared cost-effectively by massively parallel synthesis on a microarray slide (Murgha et al., 2014). Thus, further implementation of the SFO-based approach with four classical fluorophores applied in each cycle will potentially enable the whole transcriptome to be profiled using the 65, 536 (4<sup>8</sup> ) distinct fluorophore sequences. Additionally, multispectral fluorophores (Dai et al., 2011; Guo et al., 2011; Wang et al., 2012) coupled with the hyperspectral imaging (Garini et al., 2006) can be applied to allow more fluorophores to be distinguished and applied in each hybridization cycle. In this way, the cycle number together with the assay time can be further reduced. Furthermore, following the RNA profiling by this SFO-based approach, the nuclear and cellular membranes can be counterstained using nuclear staining dyes (such as DAPI) and fluorescent antibodies targeting membrane proteins (such as E cadherin), respectively. With individual cells precisely segmented by this counterstaining approach, the SFO-based approach will allow RNA analysis in single cells of intact tissues. Finally, the combination of this SFO-based approach with multiplexed in situ protein analysis technologies (Bodenmiller, 2016; Mondal et al., 2017, in press) will enable the comprehensive and integrated RNA and protein profiling in single cells in situ. This molecular imaging platform will bring new insights into systems biology, signaling network regulation, molecular diagnosis and cellular targeted therapy.

### AUTHOR CONTRIBUTIONS

LX and JG designed the experiments. LX performed the experiments. LX and JG analyzed the data and wrote the manuscript.

### FUNDING

This research is supported by funding from the National Institute of Allergy and Infectious Diseases of the National Institutes of Health (R21AI132840), Arizona State University startup funds, Arizona State University/Mayo Clinic seed grant (ARI-219693), and Cystic Fibrosis Foundation (FIRTH17XX0).

### ACKNOWLEDGMENTS

We would like to thank members of the Guo lab for their input and helpful discussions.

### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fcell.2018. 00042/full#supplementary-material

Table S1 | Sequences of the pre-decoding probes, decoding probes, eraser oligonucleotide and SFO-orthogonal oligonucleotide.

### REFERENCES


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 Xiao and Guo. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Fluidic Logic Used in a Systems Approach to Enable Integrated Single-Cell Functional Analysis

*Naveen Ramalingam† , Brian Fowler† , Lukasz Szpankowski† , Anne A. Leyrat, Kyle Hukari, Myo Thu Maung, Wiganda Yorza, Michael Norris, Chris Cesar, Joe Shuga, Michael L. Gonzales, Chad D. Sanada, Xiaohui Wang, Rudy Yeung, Win Hwang, Justin Axsom, Naga Sai Gopi Krishna Devaraju, Ninez Delos Angeles, Cassandra Greene, Ming-Fang Zhou, Eng-Seng Ong, Chang-Chee Poh, Marcos Lam, Henry Choi, Zaw Htoo, Leo Lee, Chee-Sing Chin, Zhong-Wei Shen, Chong T. Lu, Ilona Holcomb, Aik Ooi, Craig Stolarczyk, Tony Shuga, Kenneth J. Livak, Cate Larsen, Marc Unger and Jay A. A. West\**

*Edited by: Xinghua Pan, Yale University, USA*

### *Reviewed by:*

*Senentxu Lanceros-Mendez, University of Minho, Portugal Lin Han, Shandong University, China*

> *\*Correspondence: Jay A. A. West jay.west@fluidigm.com*

*† These authors have contributed equally to this work.*

### *Specialty section:*

*This article was submitted to Genomic Assay Technology, a section of the journal Frontiers in Bioengineering and Biotechnology*

*Received: 31 May 2016 Accepted: 23 August 2016 Published: 21 September 2016*

### *Citation:*

*Ramalingam N, Fowler B, Szpankowski L, Leyrat AA, Hukari K, Maung MT, Yorza W, Norris M, Cesar C, Shuga J, Gonzales ML, Sanada CD, Wang X, Yeung R, Hwang W, Axsom J, Devaraju NSGK, Angeles ND, Greene C, Zhou M-F, Ong E-S, Poh C-C, Lam M, Choi H, Htoo Z, Lee L, Chin C-S, Shen Z-W, Lu CT, Holcomb I, Ooi A, Stolarczyk C, Shuga T, Livak KJ, Larsen C, Unger M and West JAA (2016) Fluidic Logic Used in a Systems Approach to Enable Integrated Single-Cell Functional Analysis. Front. Bioeng. Biotechnol. 4:70. doi: 10.3389/fbioe.2016.00070*

*New Technologies Research Department, Fluidigm Corporation, South San Francisco, CA, USA*

The study of single cells has evolved over the past several years to include expression and genomic analysis of an increasing number of single cells. Several studies have demonstrated wide spread variation and heterogeneity within cell populations of similar phenotype. While the characterization of these populations will likely set the foundation for our understanding of genomic- and expression-based diversity, it will not be able to link the functional differences of a single cell to its underlying genomic structure and activity. Currently, it is difficult to perturb single cells in a controlled environment, monitor and measure the response due to perturbation, and link these response measurements to downstream genomic and transcriptomic analysis. In order to address this challenge, we developed a platform to integrate and miniaturize many of the experimental steps required to study single-cell function. The heart of this platform is an elastomer-based integrated fluidic circuit that uses fluidic logic to select and sequester specific single cells based on a phenotypic trait for downstream experimentation. Experiments with sequestered cells that have been performed include on-chip culture, exposure to various stimulants, and post-exposure image-based response analysis, followed by preparation of the mRNA transcriptome for massively parallel sequencing analysis. The flexible system embodies experimental design and execution that enable routine functional studies of single cells.

Keywords: single-cell, mRNA-seq, functional studies, Fluidigm, Polaris

### INTRODUCTION

Recent single-cell transcriptomic analyses have documented the importance of cellular heterogeneity in studying cancer (Ennen et al., 2014; Saadatpour et al., 2014; Kim et al., 2015), immunology (Shalek et al., 2014), developmental biology (Briggs et al., 2015), stem cell research (Wilson et al., 2015), and neurobiology (Pollen et al., 2015). It has been estimated that the human body contains 37.2 trillion cells (Bianconi et al., 2013), excluding the complex microbiome that lives in the human body. High-throughput single-cell mRNA sequencing provides an unbiased path to classifying this vast number of cells into cell types. This endeavor has stimulated the development of methods to increase throughput (Fan et al., 2015; Klein et al., 2015; Macosko et al., 2015). The classification of cell types can be thought of as a high-resolution anatomy. At the single-cell level, moving from anatomy to physiology or from description to mechanism means moving from cell type to cell function. This will require integrating transcriptional data with other cellular measurements. In this regard, progress has been made in obtaining transcriptomic and genomic information (Dey et al., 2015; Macaulay, 2015), transcriptomic and epigenomic information (Angermueller et al., 2016), or transcriptomic and proteomic information (Darmanis et al., 2016; Frei et al., 2016) from the same single cell.

Moving from cell type to cell function will also require understanding how single-cell profiles change in response to perturbations. It is important to examine these effects at the single-cell level because cell-to-cell heterogeneity has been observed in a diverse set of circumstances, such as the response of macrophages to bacterial invasion (Avraham et al., 2015), the response of hematopoietic cells to various drugs (Bendall et al., 2011), and drug resistance in adenocarcinoma cells (Kim et al., 2015). Progress in the long-term culture of circulating tumor cells (Gao et al., 2014; Yu et al., 2014; Cayrefourcq et al., 2015; Alix-Panabières et al., 2016) enables single-cell functional studies on this important class of cells, which should lead to improved cancer diagnosis and therapy. Performing perturbation experiments on single cells requires care in maintaining the appropriate microenvironment. Examining the effects of serum on mouse embryonic stem cells (ESCs), researchers (Guo et al., 2016) concluded that "a large proportion of intracellular network variability is due to the extracellular culture environment." Microfluidic-based approaches are attractive for the precise control of the microenvironment because they enable structures at a size appropriate for single cells. Microfluidic systems for high-throughput preparation of sequencing libraries, though, have cell lysis as the initial step and thus are not suitable to maintain single cells for experimentation. What is required is a system specifically designed to capture, maintain, perturb, and observe single cells and then prepare these cells for high-dimensional analysis.

In this paper, we report development of an integrated fluidic circuit (IFC) that uses fluidic logic to actively select and sequester desired single cells based on particular biological markers of interest. This Polaris™ IFC can sequester up to 48 single cells. If required, the cells can be cultured in appropriate medium in order to control and manipulate the microenvironment around the sequestered cells. For adherent cells, appropriate extracellular matrix (ECM) can be coated inside the culture chambers. The single cells can be perturbed with a drug or other stimuli (i.e., mRNA, cytokines, bacteria, or viruses), with the response to perturbation monitored and measured by fluorescence imaging. Subsequently, the single cells are processed for cell lysis, reverse transcription (RT), and full-length transcriptome amplification using template-switching chemistry. Following harvest from the IFC, sequencing libraries are generated using a modified Nextera® protocol and sequenced on any Illumina® platform (**Figure 1**).

Figure 1 | Typical workflow for single-cell functional studies. The input single-cell suspension can be obtained from blood, primary cells, or cell culture. In the case of rare cells or a subset of cell population, there is an option to enrich them prior to use with Polaris using either fluorescence-assisted cell sorting (FACS) or other methods. Subsequently, the cells are labeled with a universal fluorescent marker for tracking the cells on the Fluidigm® Polaris system. Most of the single-cell functional study steps are automated on the Polaris system. The Polaris system generates preamplified full-length cDNA, which can be further processed for library preparation and massive parallel sequencing for mRNA sequencing.

### MATERIALS AND METHODS

### Design and Fabrication of Logic-Based Integrated Fluidic Circuit

The nanoscale IFC consists of a plastic carrier and a polydimethylsiloxane (PDMS) core (**Figure 2A**) or fluidic circuit. The carrier contains reservoir wells for input and output of reagents and circuit control. It provides a platform to facilitate interfacing with the fluidic circuit. The fluidic circuit with the desired microfluidic control components was fabricated using multilayer soft lithography (MSL®) process (Unger et al., 2000). Fluidic circuit components include flow and control channels, valves, multiplexors, and logic devices [such as serial-to-parallel shift register (SR)]. Fabrication and operational details of the fluidic logic circuits and devices were reported earlier (Devaraju and Unger, 2012). The IFC is designed to have the capability to actively select single cells based on fluorescent markers, isolate them to a desired holding location (cell capture site), apply individual conditions (feed medium and dose reagents to cells), and finally study the functional response. Execution of all these complex functions in a routine fashion requires flexible, programmable operational control, which in turn requires many controls in a parallel manner. Traditionally, in microfluidics, a dedicated external control line is required to independently control a set of valves. This imposes a limitation on the number of practical

Removal of black backing enables high-resolution imaging of cells on a fluorescent microscope. The CAD drawing of the microfluidic components is shown on the right. The green channels are the control lines. (B) Polaris IFC uses fluidic processor to receive serial control inputs and converts them to parallel shift register elements. Traditional microfluidic control elements use one external control for every internal control.

on-chip control operations and poses a challenge for scalability by requiring more external hardware. On-chip control architecture capable of receiving and processing data by elementary computation and decision-making can integrate programmability of controls on-chip and allows an increase in the number of on-chip control lines for the same number of external chip connections (**Figure 2B**). We developed such a microfluidic fluidic logic and implemented it on our Polaris IFC.

The state-based microfluidic fluidic logic devices and circuits utilize static gain and normally closed valves (NCVs). NCVs are fabricated by filling specialized control channels with a flash curable prepolymer and curing while the valve is closed. The resulting closed valve exerts certain force against fluidic pressure to keep the valve closed. The valves are characterized by breakthrough pressure: the threshold pressure in the flow channel required to push open the valve and restore the continuity of the flow. Breakthrough pressure for an NCV can be tailored by controlling the pressure at which they are cured. Using these NCVs, we have developed static gain valves (SGV) that have the ability to control higher (or equal) fluidic pressure using a lower pressure. This type of valve is essential to create any logic/feedback structures (to account for signal strength losses), which can receive the output of the previous element/gate and use it as an input for decision making.

Utilizing the SGV, we next built an inverter (NOT gate), which was further used to build more complex circuits including bistable flip flops, clocked flip flops (latches), delay flip flops (D-FF, one bit of the SR), and complex microprocessors (SR). A SR that is capable of processing *n* + 1 bits of data is formed by combining *n* D flip flops (bits of SR). The SR presented here uses air as the medium and receives three active high-pressure inputs: source, clock, and data (**Figure 3**). The pneumatic output of the SR cannot be used to control the flow of liquids in microchannels directly, due to risk of introducing bubbles. In order to address this issue, the signal medium is converted from air to liquid using an inverter.

The Polaris IFC microprocessor receives 28 external signals serially and processes them into 28 parallel independent controls capable of controlling individual valves or a set of valves. Five dedicated high-pressure external active signals are required for a SR. The CAD drawing of the various microfluidic components on a Polaris IFC is shown in **Figure 3**. The IFC can accept up to 20 independent reagents. The fluorescently labeled cells are loaded in a serpentine partition channel. Based on a desired combination of up to three fluorescent markers (refer to Section "Polaris Instrument Design" for excitation and emission details), single cells are selected and sequentially isolated to the cell capture sites through a multiplexer. Up to 48 single cells can be sequestered on a single Polaris IFC. Subsequently, these 48 cells are processed through template-switching chemistry for full-length cDNA generation for mRNA-seq. In brief, the cells are lysed and reversetranscribed, and full-length cDNA is preamplified by long and accurate PCR.

### Polaris Instrument Design

The Fluidigm Polaris system (**Figure 4A**) consists of four major modules: (1) thermal control module; (2) imaging module; (3) pneumatic control module; and (4) environmental control

(EC) module. The thermal module consists of a Peltier-based thermoelectric couple (TEC) device for heating/cooling. The TEC module can provide temperature in the range of 4–99°C. Vacuum grooves on the thermal module are designed to enable tight contact with the glass-based integrated heat spreader (IHS) on the Polaris IFC. This ensures thermal uniformity across the fluidic circuit. The imaging module contains a five-color LED light engine for excitation (Ex wavelengths: 438, 475, 530, 575,

(F) through a multiplexer (E). The IFC is capable of accepting 20 reagents (G) as input. The shift register uses inverter (C) and a set of source, clock, and data (H).

through mixed gas inlet port on the interface plate. Polaris IFC is shown for reference.

and 632 nm). The light source from the engine is collected and projected onto the fluidic circuit using fiber optics. The emitted signal from the fluidic circuit passes through an emission filter (five Em wavelengths: 488, 525, 570, 630, and 700 nm) and is collected by CCD camera with 6-μm pixel resolution through a custom-designed collimator lens.

The pneumatic control module generates and stores air with volume up to 1 L. The system can achieve a maximum pressure of 100 psi. The pneumatic controller generates a vacuum on the thermal chuck, clamps the Polaris IFC against the EC interface plate (IP) to enable a closed environment around the IFC, and loads reagents from the inlets on the IFC carrier to the microchannels and reagent chambers of the fluidic circuit. The system contains segregated zones to regulate four different pressures simultaneously. The EC module provides an environment suitable for cell culture using user-desired gas composition. Environmental parameters such as temperature, relative humidity (RH), and mix gas flow rate across the fluidic circuit are monitored and controlled. The gas inside the closed chamber is heated by two heater coils. The gas inlet on the EC IP is used to regulate the flow of gas across the fluidic circuit. The EC IP (**Figure 4B**) contains an indium tin oxide (ITO) coated glass on the top to maintain thermal control in the EC while allowing imaging through the EC IP. During cell culture operation, the ITO glass is heated to prevent condensation. During cell culture, the environment around the fluidic circuit is maintained by blood gas (5% CO2, 5% oxygen, and 90% nitrogen) or premixed gas of choice (for example, 5% CO2, 20% oxygen, and 75% nitrogen). Before on-IFC cell culture, a rectangular sponge saturated with water is installed inside the closed chamber to provide the desired humidity through heating. The EC IP is equipped with two sensors (T/RH) to measure and maintain temperature at 37°C and RH at 90%.

### K562 Cell Culture and CD59 Staining

K562 cells (ATCC® CCL-243) are cultured in T25 flasks in a volume ranging from 10 to 15 mL in an incubator (37°C, 5% CO2). The culture medium contains IMDM + GlutaMAX™-I + 25 mM HEPES + 3.024 g/L sodium bicarbonate (Gibco, 31980-030) and is supplemented with 10% FBS. The cells were fed every 2–3 days by dilution to 200,000 cells/mL. The K562 cells were stained with CellTracker™ Orange (CTO) CMRA Dye (Thermo Fisher Scientific, C34551) as universal marker and Alexa Fluor® 647 conjugated CD59 antibody. The recommended dyes and corresponding excitation and emission filters on the Polaris system are shown in **Table 1**. Immediately before use, the cell staining



*a Excitation values are center wavelength/band pass (*≥*90%).* solution was prepared by adding 0.6 μL of 1 mM CTO to 2 mL of HBSS without calcium or magnesium (−/−) at a final concentration of 0.3 μM. The cell staining solution was protected from light until use within 30 min. A total of ~1.5 × 106 cells was aliquoted in a 15 mL non-pyrogenic conical tube. The cell suspension was centrifuged at 300 × *g* for 3 min. Following this, the medium was aspirated without disturbing the pellet, and 2 mL of cell staining solution was added to the pellet and gently suspended by pipetting up and down three times. The cells were then incubated in the dark at 37°C for 20 min with occasional inverting and flicking. Following this, the cells were washed by adding 12 mL of HBSS to the cells in the 2 mL of staining buffer and then centrifuged at 300 × *g* for 5 min. Supernatant was aspirated and discarded without disturbing the pellet. The pellet was then resuspended in 200 μL of HBSS. The CTO-stained K562 cells were split into two tubes of 100 μL each. One tube was used as negative surface-stained cell population, and the other tube was processed further to stain CD59 epitope. In order to stain the surface CD59 epitope, 10 μL of CD59 biotinylated antibody (BD Biosciences, 555762, 100 tests, 2.0 mL) was added to 100 μL of CTO-stained cells. For negative surface-stained cell control, 10 μL of HBSS was added. Both the tubes were incubated at room temperature for 20 min with occasional inverting and flicking. Subsequently, 13 mL of HBSS was added to each tube and centrifuged at 300 × *g* for 5 min. The pellet was resuspended in 100 μL of HBSS. To this, 0.5 μL of Streptavidin Alexa Fluor® 647 (Thermo Fisher Scientific, S32357, 2 mg/mL stock) was added to positive-stain tube with CD59 biotinylated antibody in 100-μL cell suspension. This solution was mixed gently by pipetting up and down five times. Following this, the stain solution was incubated at room temperature for 15 min with occasional flicking. Again, 13 mL of HBSS was added to each tube, mixed by gently pipetting up and down, and centrifuged at 300 × *g* for 5 min. The supernatant was removed, and the pellet was resuspended in ~100–150 μL culture medium with FBS, but without phenol red, to prevent high background fluorescence during cell selection on the Polaris system. The resuspension volume of culture medium accounts for cell losses during the staining procedure and was chosen to yield a cell concentration greater than the target concentration of 550 cells/μL. Typically, 10 μL of cell mix is loaded into a C-Chip™ Disposable Hemocytometer (INCYTO, DHC-N01) and imaged on the Polaris system to estimate the staining intensity and purity. In order to achieve optimal buoyancy, cells in the range of 333–550 cells/μL are mixed with suspension reagent (Fluidigm, 101-0434). Typically, the ratio of cells to cell suspension reagent is 3:2. However, this ratio might need optimization depending on the cell type.

### IFC Operation

The Polaris IFC is first primed to fill the control lines on the fluidic circuit, load cell capture beads, and the inside of PDMS channels is blocked to prevent non-specific absorption/adsorption of proteins. In order to capture and maintain the single cells in the sites, the capture sites (48 sites) are preloaded with beads that are linked on-IFC to fabricate a tightly packed bead column during the IFC prime step. In the case of adherent cells, ECM is coated inside the cell capture chambers during prime step. After completion of the prime step, the cell mix (cells with suspension reagent) is loaded on the Polaris IFC and single CTO<sup>+</sup>/CD59<sup>+</sup> cells are selected to capture sites. We extensively tested the performance of the Polaris IFC and system at three different cell purities (3, 10, and 50%). The cell purity is defined as the ratio of CTO<sup>+</sup>/CD59<sup>+</sup> cells to CTO<sup>+</sup>/CD59<sup>−</sup> cells. During the cell selection step, the suspended cells are loaded into the serpentine partition channel (**Figure 3**). Subsequently, the flow inside the partition channel is stopped (**Figure 5A**), and

Figure 5 | (A) Cells in suspension are loaded into a serpentine partition channel (CAD design image). (B) Image analysis shows movement of fluorescent cells inside the serpentine channel during reagent flow. Once the reagent flow is stopped, single cells are separated from each other. The system software identifies single cells based on desired fluorescent markers. (C) After identification, single cells are isolated by moving them to the capture sites through a fluidic multiplexer. (D) Image of cultured BJ fibroblast cells. The Polaris IFC can be imaged on a microscope to obtain high-resolution micrographs.

the cells are imaged in the partition channel for different fluorescent markers, as selected by the user. Based on automated image analyses by the system's software, only single cells with the desired combination of fluorescent markers are selected and isolated to the cell capture site. Any doublets or single cells with undesired fluorescent combinations are not selected by the software for further experimentation (**Figure 5B**). The selected single cells are moved to capture sites through a multiplexer (**Figure 5C**). The system takes images of the capture sites to confirm the arrival of single cells from a particular position in the partition to a particular capture site number. **Figure 5C** shows a typical image from the Polaris system showing K562 single cells captured inside sites packed with a column of beads. The system will then select and isolate all available single cells from a partition fill as per desired fluorescent marker combination. Once it completes selection of candidate cells, the system refills the serpentine partition to look for more candidate single cells. The system repeats this process to select and isolate single cells until it fills all 48 capture sites.

If desired, the single cells can then be cultured in the capture sites. It is possible to culture either suspension (e.g., K562) or adherent (e.g., BJ fibroblast) cells. For adherent cells, extracellular matrix can be coated inside the capture site during the IFC priming step. **Figure 5D** shows a Polaris image of a cultured BJ fibroblast (adhered). Based on the experimental design, it is possible to dose these single cells and on-IFCcultured single cells with drugs or other cell stimuli. Finally, the single cells are processed through template-switching mRNA-seq chemistry for full-length cDNA generation and preamplification on-IFC.

### Full-Length cDNA Generation

Preamplified full-length cDNA of selected single cells are generated on-IFC, and the amplicons are harvested through 48 different outlets. We used the SMARTer Ultra® Low RNA Kit for Illumina Sequencing (Clontech®, 634936) to generate preamplified cDNA. The selected and sequestered single cells were lysed using Polaris cell lysis mixture. The 28-μL cell lysis mix consists of 8.0 μL of Polaris Lysis Reagent (Fluidigm, 101-1637), 9.6 μL of Polaris Lysis Plus Reagent (Fluidigm, 101-1635), 9.0 μL of 3′ SMART™ CDS Primer II A (12 μM, Clontech, 634936), and 1.4 μL of Loading Reagent (20X, Fluidigm, 101-1004). Synthetic RNA spikes can be optionally used with cell lysis mix. We typically use ArrayControl™ RNA spikes 1, 4, and 7 (Thermo Fisher Scientific, AM1780) to establish the functionality of RT and PCR on-IFC. We also use ERCC spikes at 1:50,000 dilution (final in lysis mix) for efficiency and quantification estimations. In order to implement synthetic RNA spikes, we thoroughly mix 96.5 μL of loading reagent with 2.5 μL of SMARTer Kit RNase Inhibitor (40 U/μL; Clontech, 634936) and subsequently add 1 μL of synthetic RNA spike to this spike mix. If RNA spike is used, then 1.4 μL of the loading reagent is replaced with the spike mix. The thermal profile for single-cell lysis is 37°C for 5 min, 72°C for 3 min, 25°C for 1 min, and hold at 4°C.

The 48-μL preparation volume for RT contains 1X SMARTer Kit 5X First-Strand Buffer (5X; Clontech, 634936), 2.5-mM SMARTer Kit Dithiothreitol (100 mM; Clontech, 634936), 1-mM SMARTer Kit dNTP Mix (10 mM each; Clontech, 634936), 1.2-μM SMARTer Kit SMARTer II A Oligonucleotide (12 μM; Clontech, 634936), 1-U/μL SMARTer Kit RNase Inhibitor (40 U/μL; Clontech, 634936), 10-U/μL SMARTScribe™ Reverse Transcriptase (100 U/μL; Clontech, 634936), and 3.2 μL of Polaris RT Plus Reagent (Fluidigm, 101-1366). All the concentrations correspond to those found in the RT chambers inside the Polaris IFC. The thermal protocol for RT is 42°C for 90 min (RT), 70°C for 10 min (enzyme inactivation), and a final hold at 4°C.

with ERCC RNA spikes, followed by mRNA-seq chemistry.

**85**

The 90-μL preparation volume for PCR contains 1X Advantage 2 PCR Buffer [not short amplicon (SA)] (10X, Clontech, 639206, Advantage® 2 PCR Kit), 0.4-mM dNTP Mix (50X/10 mM, Clontech, 639206), 0.48-μM IS PCR Primer (12 μM, Clontech, 639206), 2X Advantage 2 Polymerase Mix (50X, Clontech, 639206), and 1X Loading Reagent (20X, Fluidigm, 101-1004). All the concentrations correspond to those found in the PCR chambers inside the Polaris IFC. The thermal protocol for preamplification consists of 95°C for 1 min (enzyme activation), five cycles (95°C for 20 s, 58°C for 4 min, and 68°C for 6 min), nine cycles (95°C for 20 s, 64°C for 30 s, and 68°C for 6 min), seven cycles (95°C for 30 s, 64°C for 30 s, and 68°C for 7 min), and final extension at 72°C for 10 min. The preamplified cDNAs are harvested into 48 separate outlets on the Polaris IFC carrier.

### qPCR Analysis on Biomark**™**

Harvested samples from Polaris IFCs were analyzed by qPCR using 96.96 Dynamic Array™ IFCs and the Biomark™ HD system from Fluidigm. Processing of the IFCs and operation of the instruments were performed according to the manufacturer's procedures. For detection using the RNA expression and splice variant assays, a Master Mix was prepared consisting of 360-μL SsoFast™ EvaGreen® Supermix with Low ROX (BioRad 172- 5211) and 36-μL 20 × DNA Binding Dye Sample Loading Reagent (Fluidigm 100-5360), and 3.3 μL of this mix was dispensed to

and 3 (9 nL each), and PCR mix is loaded into chambers 4 and 5 (135 nL each). (C) Chemistry loading sequence and its function.

each well of a 96-well assay plate. Diluted harvest product (2.7 μL) was added to each well, and the plate was briefly vortexed and centrifuged. Following priming of the IFC in the IFC Controller HX, 5 μL of the sample + Master Mix were dispensed to each sample inlet of the 96.96 IFC. Five microliters of each 10 × assay [5 μM each primer, 1 × assay Loading Reagent (Fluidigm 100- 5359)] were dispensed to each Detector Inlet of the 96.96 IFC. After loading the assays and samples into the IFC in the IFC Controller HX, the IFC was transferred to the Biomark HD, and PCR was performed using the thermal protocol GE Fast 96 × 96 PCR + Melt v2.pcl. This protocol consists of a thermal mix of 70°C, 40 min; 60°C, 30 s, hot start at 95°C, 1 min, PCR cycle of 30 cycles of 96°C, 5 s; 60°C, 20 s, and melting using a ramp from 60 to 95°C at 1°C/3 s. Data were analyzed using Fluidigm Real-Time PCR Analysis software using the Linear (Derivative) Baseline Correction Method and the Auto (Global) Ct Threshold Method. The data are exported as a.csv file into an Excel® macro to compile and compare the data against in-house specifications.

### RESULTS

### Performance Evaluation of Polaris IFC

In order to statistically evaluate the performance of the Polaris IFC, we designed and developed two performance tests: (1) total-RNA-based performance test (RNA PT) and (2) single-cell-based key performance test (KPT). Since single cells are heterogeneous, it would be difficult to evaluate the performance uniformity across 48 capture sites using a cellbased test method. Hence, we developed a 20-cell-equivalent total-RNA PT to evaluate and improve the performance of the Polaris IFC during the initial phase of the IFC development process.

### Total RNA-Based Performance Test

The primary objective of this test is to statistically validate a workflow that is very close to the cell-based experiments on the Polaris system and yet collects critical information about uniformity of cDNA synthesis across IFC, reaction line cross-talk (on-IFC), and IFC-to-IFC correlation. To achieve this objective, we simulated steps such as loading of cell capture beads and the thermal step for cell lysis in the total-RNA PT. The workflow of the total-RNA PT is shown in **Figure 6A**. Briefly, the RNA-PT is a two-step procedure. In the first step, the control lines on the Polaris IFC are primed, channels are blocked, and cell capture beads are back-loaded with ArrayControl RNA SPIKES (1, 4, and 7 only, Thermo Fisher Scientific, AM1780; henceforth referred to as RNAspikes 147) in eight specific capture sites (**Figure 6B**).

The ArrayControl RNA Spikes are used to evaluate the backdosing cross-talk using highly sensitive qPCR assay designed to detect RNA spikes 1, 4, and 7 (three total ArrayControl RNA Spikes). After the priming step, six specific capture sites are loaded with ERCC RNA Spike-In Mix (Thermo Fisher Scientific, 4456740) to estimate the cross-talk for front-loaded reagents and dosing agents. Although the ERCC RNA mix contains 92 spikeins, only 8 ERCC spike-ins were probed using qPCR assays. The front-dosing strategy is illustrated in **Figure 7A** and the pipetting map is shown in **Figure 6C**.

For negative control, 1X Preloading Reagent (Fluidigm, 100- 9942) was loaded into specific inlets and capture sites (**Figure 7C**). After completion of front dosing, the mRNA-seq chemistry prep is integrated with the dosing step. The lysis mixture for the RNA-PT contains Leukemia (K562) Total RNA (Thermo Fisher Scientific, AM7832) at a concentration equivalent to 20 cells of total RNA in every cell capture site (48 sites). The cell capture site is serially connected to five chambers to enable multistep reaction chemistry (**Figure 7B**). Cell lysis mixture is loaded into the first 9-nL chamber. Then, RT mixture is loaded in 18-nL volume (two 9-nL chambers). Finally, PCR mixture for preamplification of full-length cDNA is loaded in 270-nL volume (two 135-nL chambers).

The preamplified cDNA is harvested in ~7 μL volume. The harvest is further diluted by addition of 10 μL of DNA Dilution RGT (Fluidigm, 100-9167). In order to evaluate IFC uniformity and other performance metrics, we designed 88 Delta Gene™ assays (Fluidigm) for K562 (85 genes covering high and low expressors) and RNAspikes 147. In addition to these 88 assays, we used 8 ERCC qPCR assays from a published work (Devonshire et al., 2011). In total, we used 96 intercalating dye-based qPCR assays for read-out of RNA PT on an M96.96 Dynamic Array™ IFC (Fluidigm). We routinely test positive and negative tube controls for every chemistry preparation by qPCR assays on the M96.96. In order to do this, 2 out of the 48 samples from a Polaris IFC are replaced by positive and negative tube controls on M96.96. The positive tube control contains total RNA from K562, RNAspikes 147, and ERCC. The tube controls are used to validate the functionality of chemistry preparation on a particular day. Harvest products from two Polaris IFC are tested on a single M96.96 Dynamic Array IFC run. A typical qPCR Ct heat map and associated Excel macro for two Polaris RNA PTs are shown in **Figures 8A,B**. In order to statistically validate the performance, we tested 44 Polaris IFCs with RNA PTs. IFC and reagents from minimum of three manufacturing lots were used. Tolerance limit or interval analyses were performed on more than 40 Polaris IFC runs. The distribution of data and tolerance limit analyses for different metrics for the RNA PTs are shown in **Figure 9**.

### Single-Cell-Based Key Performance Test

Key performance test was developed and validated using one cell type each for suspension (K562) and adherent (BJ fibroblast) cells. As described in Section "Materials and Methods," cells are stained with the universal fluorescent marker, CTO. A subset of these cells were stained for surface marker using antibody conjugated with Alexa 647. In the case of K562, we used Anti-Human CD59-Biotin (BD Biosciences, 555762) with Streptavidin Alexa 647, and for BJ fibroblast, we used Anti-Mouse/Human CD44-Alexa 647 (BioLegend 103018). The double-stained cells (universal CTO and surface marker Alexa 647) were mixed with cells stained with CTO only to achieve three different purity percentages (3, 10, and 50%). The cells were selected for universal CTO and surface marker. For BJ fibroblasts, we tested two different workflows, one with cell


Figure 8 | (A) Typical heat map of high-throughput qPCR assay for RNA-based performance test. The M96.96 IFC (96 samples) can accept amplicons from two Polaris IFCs (48 samples each). For every Polaris IFC, we replace two samples with positive and negative control samples. The columns are assays (85 high- and low-expressing assays; 8 ERCC spike assays; and 3 RNAspike 147 assays). The rows are diluted amplicons from Polaris IFCs. (B) Excel macro for the RNA-based performance test.

selection followed by chemistry (immediate) and another to dose the BJ fibroblasts with medium every 4 h for 24-h adherent culture, followed by chemistry (BJ dosing). In order to evaluate the cell viability prior to the cell lysis step, we used Zombie Yellow™ cell viability stain (BioLegend, 423103; λex = 396 nm and λem = 572 nm), which stains dead cells. Performance metrics, such as number of sites occupied with single cells out of the total 48 sites (cell selection), number of cells retained after dosing and prior to cell lysis (cell retention), and number of viable cells prior to lysis, were evaluated. On average from 20 Polaris IFC runs, our cell selection was ~95% for K562 and BJ fibroblast with different purity percentages (**Figure 10A**). For cell retention,

0.959.

>47/48 sites showed presence of single cells as enumerated after the cell selection step and prior to cell lysis step (**Figure 10B**). The average cell viability was ~90% as estimated from 20 Polaris IFC runs (**Figure 10C**).

presented here. (C) Single-cell viability as assessed by Zombie stain on-IFC.

The Polaris system generates very high quality (size distribution) and quantity (yield) of preamplified cDNA from single cells. The size distribution of preamplified cDNA from single cells, as evaluated using Bioanalyzer 2100 and the DNA highsensitivity chip (Agilent), is typically in the size range of 0.3–7 kb (**Figure 11A**). For yield, preamplified cDNA from single cells was quantified using PicoGreen-based dsDNA quantification assay (Quant-iT™ PicoGreen® dsDNA Assay Kit, Thermo Fisher Scientific, P7589). The average total cDNA yield per single K562 cell is 38.42 ± 8.08 ng (**Figure 11B**). We randomly selected ~14 single cells from three Polaris IFC runs, barcoded them using modified Nextera library prep, and pooled them to generate a single sequencing library. A representative library profile from 42 single cells is shown in **Figure 11C**. The majority of the single-cell library falls in the range of 200–2,000 bp. For three sequencing libraries from nine Polaris IFCs tested with K562 immediate chemistry, sequencing data from three MiSeq™ runs using v2 150 bp PE kit were compiled, and tolerance limits (90% confidence with 95% population coverage) were estimated for two key sequencing metrics (**Figure 12**). The average percentage of reads mapping to rRNA/total reads is 0.122%. The Box–Cox transformed data fit a normal distribution with a Shapiro–Wilk *P*-value of 0.0983. Based on the normal distribution, the upper tolerance limit for percentage of reads mapping to rRNA is 0.3% (**Figure 12A**). The mean number of genes detected is 6,967 ± 115. The data fit a normal distribution with a lower tolerance limit of 5,919 genes as estimated from 115 single-cell datapoints (**Figure 12B**).

We extensively analyzed our single-cell sequencing data for transcript coverage bias and possible positional bias of single cells selected across 48 capture sites on the Polaris IFC (**Figure 13**). We noted uniform coverage along the transcript length (**Figure 13B**) without any positional bias on the Polaris IFC. The plot of normalized coverage vs. normalized distance along the transcript with respect to capture sites (2, 3, 4 and 40, 41, 42) from a Polaris IFC is shown in **Figure 13B**. A plot of median 3′ end bias of transcript coverage with respect to capture site number indicates no positional bias across three Polaris IFC runs (**Figure 13A**). In order to understand if there is any possible effect of hypoxia on single cells due to spatial location of capture sites on the Polaris IFC, we analyzed the expression value of *HIF1A* gene across different capture sites. Up-regulation of hypoxia-induced factor 1 gene (*HIF1A*) is a known consequence of hypoxia (Choudhry and Mole, 2015). Expression analyses of *HIF1A* did not show any positional bias with respect to the capture sites (**Figure 13C**). It should be noted that we recommend strictly following the Polaris workflow as described in the Polaris protocol document (Fluidigm, 101-0082). Any deviation from the validated workflow might lead to introduction of possible bias at multiple levels.

### Sensitivity Studies Using ERCC Spike-Ins

An alternative way to evaluate performance of single-cell mRNAseq on the Polaris system is to implement use of the ERCC RNA Spike-In Mix 1 in the lysis mix. The ERCC control mix consists of 92 polyadenylated transcripts with a size range of 273–2,022 bases and six orders of magnitude range in concentration. We tested both qPCR- and sequencing-based methods for detection of ERCC spikes. Ninety-two primer pairs were designed

to target the corresponding transcripts for qPCR testing. qPCR was performed on a Fluidigm 96.96 Dynamic Array IFC. Stochastic distribution of transcripts was observed when the input concentration was 25 copies per reaction or less on the Polaris IFC followed by qPCR detection on the 96.96 Dynamic Array (**Figure 14A**). Single-copy RNA detection is demonstrated, although intermittently, likely due to sampling at the reaction site. Transcripts at 1.6 copies per reaction were intermittently detected by qPCR on the 96.96 Dynamic Array IFC. We also evaluated the detection rate of ERCC spikes (>1.6 copies/reaction) using an approach based on massive parallel sequencing. There were 7 ERCC spikes (ERCC-00170; ERCC-00148; ERCC-00126; ERCC-00099; ERCC-00054; ERCC-00163; ERCC-00059), which were at a concentration of 1.6 copies per Polaris reaction chamber. One of the 7 ERCC spikes (ERCC-00054) was not detected in any of the 19 single-cell samples. If we remove this datapoint as an outlier, the average detection rate of ~1.6 copies is 28%. Based on Poisson estimates, single-copy detection rate should be ~33%

lower tolerance limit of 5,919.

(67% should be a failure event). The single-copy detection rate (28%) from the Polaris system is very close to expected theoretical estimates based on Poisson statistics (**Figure 14B**).

### Single-Cell Transfection of nGFP mRNA and GFP Expression Analyses

In order to demonstrate dosing and functional response analyses, we transfected single K562 cells with nuclear green fluorescent protein (nGFP) mRNA and cultured the transfected single K562 for 16 h. During this culture duration, the cells translated the nGFP mRNA and expressed the GFP inside the cell. The imaging capability of Polaris enables monitoring of GFP expression. Subsequently, the cells were processed for mRNA-seq chemistry on-IFC and sequenced on MiSeq to quantify the reads mapped to GFP. K562 cells stained with CTO were selected on Polaris IFC. To carry out the single-cell transfection, 10 μL of Stemfect RNA transfection reagent was mixed with 240-μL Stemfect transfection buffer (Stemgent® Stemfect™ RNA Transfection Kit, 00-0069) (Mix A). The stock nGFP mRNA (Stemgent, 05-0019) at 100 ng/μL was diluted with Stemfect transfection buffer first and then further diluted with Mix A to make mRNA transfection complex. This complex was incubated at room temperature for 15 min and further diluted with K562 cell culture medium (refer to Section "K562 Cell Culture and CD59 Staining") to achieve different final concentrations (0.5 and 1 ng/μL) of nGFP mRNA. Selected single K562 cells were cultured with the nGFP mRNA transfection complex with culture medium at 37°C with 5% CO2 on the Polaris IFC. During this cell culture incubation time, images were taken every hour to monitor the onset of GFP expression. Image analyses (**Figure 15**) showed that single cells picked up nGFP mRNA at 0.5 and 1 ng/μL concentrations and expressed the green fluorescent proteins, thereby reinforcing the fact that single cells on Polaris IFC are healthy and are able to uptake naked mRNA and translate it to protein capable of being transfected. **Figure 15A** shows typical time-series images, which can be obtained from the Polaris system. It should be noted that for this particular experiment, the imaging interval was set to 1 h, but the Polaris system is capable of taking successive images in a rapid mode. We noted onset of GFP gene expression around the 3-h time frame at single-cell resolution. The cDNA pool showed a length range from 0.3 to 9.2 kb, with an average length ~2 kb. It is also noted that >85% of the total cDNA pool lies between 0.5 and 9.2 kb (**Figure 15B**). Sequencing data show that the cells transfected with nGFP mRNA harbored the

extracellular mRNA even after 16 h of culture. As expected, the control cells without nGFP transfection did not show any mapping to GFP sequence. The transfection of nGFP did not alter the mapping to genome and transcriptome when compared to the control cells (**Figure 15C**). The nGFP-transfected cells showed percentage average reads mapping of 0.57, 87.99, and 47.82% to GFP, genome, and transcriptome, respectively (*n* = 7), while the control K562 showed percentage average mapping of 0, 88.03, and 49.22 (*n* = 7).

### DISCUSSION

In this work, we report design and development of an integrated system to perform functional studies on single cells.

We developed a nanoscale IFC, which employs fluidic logic to actively select single cells, and a system capable of performing multiple functionalities. The performance of the developed IFC and system was extensively tested using RNA-based and single-cell-based performance tests. These tests were specifically designed to evaluate different functionalities of the IFC and system. The functional capability of the Polaris IFC and system has been successfully demonstrated using transfection of naked nGFP mRNA, followed by monitoring of nGFP expression and finally analysis of the whole set of mRNA transcripts by massive parallel sequencing. It is noted that it is not currently possible to perform studies reported in this work on any other single-cell platforms. The limitation of the current system includes limited number of cells for functional studies

mRNA-seq chemistry. (C) Mapping metrics of GFP-transfected single K562 cells to genome, transcriptome, and GFP sequence.

(up to 48 cells). However, the requirement on the number of cells depends on the biological question, and it is possible to expand the capability of the IFC consumable to process more cells in the future.

### AUTHOR CONTRIBUTIONS

NR, BF, JS, and JAAW conceived and designed the RNA-based performance test; BF, LS, AAL, JS, and JAAW conceived and designed the single-cell-based performance test; NR, LS, AAL, JS, MLG, CDS, NDA, CG, CTL, IH, AO, CS, and JAAW performed experiments; BF, NSGKD, MZ, EO, and CP were involved in Polaris IFC development; KH, MTM, WY, MN, CC, ML, HC, ZH, LL, CC, and ZS were involved in Polaris system development; RY, WH, JA, and ZH were involved in Polaris software development; NR, LS, JS, CDS, XW, and JAAW analyzed the data; TS edited the manuscript; CL drafted the Polaris user guide; MU and JAAW supervised the project, helped with design and interpretation, and provided laboratory space and financial support; and NR, LS, KJL, and JAAW wrote the manuscript with input from all authors. All authors listed, have made substantial, direct and intellectual contribution to the work, and approved it for publication.

### REFERENCES


**Conflict of Interest Statement:** All authors are employees of Fluidigm Corporation.

*Copyright © 2016 Ramalingam, Fowler, Szpankowski, Leyrat, Hukari, Maung, Yorza, Norris, Cesar, Shuga, Gonzales, Sanada, Wang, Yeung, Hwang, Axsom, Devaraju, Angeles, Greene, Zhou, Ong, Poh, Lam, Choi, Htoo, Lee, Chin, Shen, Lu, Holcomb, Ooi, Stolarczyk, Shuga, Livak, Larsen, Unger and West. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

# High-Sensitivity Mass Spectrometry for Probing Gene Translation in Single Embryonic Cells in the Early Frog (Xenopus) Embryo

Camille Lombard-Banek <sup>1</sup> , Sally A. Moody <sup>2</sup> and Peter Nemes <sup>1</sup> \*

*<sup>1</sup> Department of Chemistry, The George Washington University, Washington, DC, USA, <sup>2</sup> Department of Anatomy and Regenerative Biology, The George Washington University, Washington, DC, USA*

Direct measurement of protein expression with single-cell resolution promises to deepen the understanding of the basic molecular processes during normal and impaired development. High-resolution mass spectrometry provides detailed coverage of the proteomic composition of large numbers of cells. Here we discuss recent mass spectrometry developments based on single-cell capillary electrophoresis that extend discovery proteomics to sufficient sensitivity to enable the measurement of proteins in single cells. The single-cell mass spectrometry system is used to detect a large number of proteins in single embryonic cells in the 16-cell embryo of the South African clawed frog (*Xenopus laevis)* that give rise to distinct tissue types. Single-cell measurements of protein expression provide complementary information on gene transcription during early development of the vertebrate embryo, raising a potential to understand how differential

Edited by: *Xinghua Pan, Yale University, USA*

### Reviewed by:

*Raman Chandrasekar, Kansas State University, USA Vasudevan Seshadri, National Centre for Cell Science, India Qing-Yu He, Jinan University, China*

> \*Correspondence: *Peter Nemes petern@gwu.edu*

### Specialty section:

*This article was submitted to Molecular Medicine, a section of the journal Frontiers in Cell and Developmental Biology*

> Received: *01 June 2016* Accepted: *29 August 2016* Published: *05 October 2016*

### Citation:

*Lombard-Banek C, Moody SA and Nemes P (2016) High-Sensitivity Mass Spectrometry for Probing Gene Translation in Single Embryonic Cells in the Early Frog (Xenopus) Embryo. Front. Cell Dev. Biol. 4:100. doi: 10.3389/fcell.2016.00100* gene expression coordinates normal cell heterogeneity during development.

Keywords: single-cell analysis, mass spectrometry, proteomics, cell differentiation, Xenopus laevis

### INTRODUCTION

Single-cell analysis technologies are essential to understanding cell heterogeneity during normal development and disease. Characterization of the genomes and their expression at the levels of the transcriptome, proteome, and metabolome provides a molecular window into basic cell processes. Singe-cell measurements complement traditional cell population-averaging approaches by enabling studies at the level of the building blocks of life, where many critical processes unfold (Raj and van Oudenaarden, 2008; Altschuler and Wu, 2010; Singh et al., 2010; Zenobi, 2013). For example, by studying individual cells, it is possible to ask how cells give rise to all the different types of tissues in the body (stem cells) and specialize for defense (immune cells), communication (neurons), and support (glia). This information in turn lays the foundation to developing diagnosis and treatments for addressing pressing health concerns, such as emergence of drug resistant bacteria, onset and development of neurodegeneration, and cancer, as well as infections.

Single-cell investigations take advantage of rapid developments in technology. With more than million-fold amplification of DNA and RNA and the commercialization of high throughput DNA and RNA sequencing, it is now possible to query cell-to-cell differences (Kolisko et al., 2014; Mitra et al., 2014), including but not limited to chromosomal mosaicism in tissues (Vijg, 2014; Gajecka, 2016) and embryonic somatic cells (Liang et al., 2008; Jacobs et al., 2014), establishment of cell heterogeneity in the nervous system (McConnell et al., 2013), and mutations during disease states (Junker and van Oudenaarden, 2015; Kanter and Kalisky, 2015). How gene expression translates into the functionally important proteins and how they then feedback to modulate gene expression is essential to systems cell biology. Multiple reports found differences between transcription and translation (Vogel and Marcotte, 2012; Smits et al., 2014; Peshkin et al., 2015), and transcription is known to be controlled by translational factors during development (Radford et al., 2008); therefore, characterization of the proteome is critical to understanding cell heterogeneity. Translational cell heterogeneity has traditionally been measured by immunohistochemistry and Western blot analyses. Protein-targeted assays have recently gained substantial throughput by the development of mass cytometry (CyTOF), which uses inductively coupled plasma and mass spectrometry (MS) to simultaneously quantify ∼35 different proteins tagged with rare earth elements in thousands of cells. This level of multidimensionality has promoted applications in cell differentiation during erythropoiesis (Bendall et al., 2011), and was recently coupled to laser-ablation to spatially survey cell heterogeneity in the tumor environment (Giesen et al., 2014).

Cell heterogeneity has functional implications during embryonic development. Over four decades of innovative embryological manipulations combined with gene-by-gene identifications and functional characterizations in Xenopus have shown that molecular asymmetries in the distribution of maternal mRNAs occur upon fertilization and lead to the formation of the three primary germ layers and the germ line (King et al., 2005; Lindeman and Pelegri, 2010). Recent approaches have defined the spatial and temporal changes of mRNAs and abundant proteins and metabolites in the whole embryo (Flachsova et al., 2013; Wuhr et al., 2014; De Domenico et al., 2015). However, very little is known about how these molecules change over time in individual blastomere lineages as they acquire germ layer and body axis fates. In many animals, mRNAs that are synthesized during oogenesis are sequestered to different cytoplasmic domains (Davidson, 1990; Sullivan et al., 2001), which after fertilization then specify the germ cell lineage (King et al., 2005; Haston and Reijo-Pera, 2007; Cuykendall and Houston, 2010) and determine the anterior-posterior and dorsal-ventral axes of the embryo (Heasman, 2006b; Kenyon, 2007; Ratnaparkhi and Courey, 2007; White and Heasman, 2008; Abrams and Mullins, 2009). For example, in Xenopus several mRNAs are localized to the animal pole region, which later gives rise to the embryonic ectoderm and the nervous system (Grant et al., 2014), whereas localization of VegT mRNA to the vegetal pole specifies endoderm formation (Xanthos et al., 2001), and region-specific relocalization of the Wnt and Dsh maternal proteins govern the dorsal-ventral patterning of the embryo (Heasman, 2006a; White and Heasman, 2008). However, there is abundant evidence that in developing systems not all transcripts are translated into proteins; therefore, analyses of the mRNAs may not reveal the activity state of the cell. In fact, different animal blastomeres of the 16-cell Xenopus embryo that are transcriptionally silent can have very different potentials to give rise to neural tissues (Gallagher et al., 1991; Hainski and Moody, 1992; Yan and Moody, 2007), even though they appear to express common mRNAs (Grant et al., 2014; Gaur et al., 2016).

High-resolution MS is the technology of choice for the analysis of the proteome (Aebersold and Mann, 2003; Guerrera and Kleiner, 2005; Walther and Mann, 2010; Zhang et al., 2013). Using millions of cells, contemporary MS enables the discovery (untargeted) characterization of the encoded proteomes of various species in near complete coverage, as recently demonstrated for the yeast (Hebert et al., 2014), mouse (Geiger et al., 2013), and human (Wilhelm et al., 2014). Recent whole-embryo analyses by MS revealed that transcriptomic events are accompanied by gross proteomic and metabolic changes during the development of Xenopus (Sindelka et al., 2010; Vastag et al., 2011; Flachsova et al., 2013; Shrestha et al., 2014; Sun et al., 2014), raising the question whether these chemical changes are heterogeneous also between individual cells of the embryo at different embryonic developmental stages. However, the challenge has been to collect high-quality signal from the miniscule amounts of molecules contained within single blastomeres for analysis. Since different blastomeres in Xenopus are fated to give rise to different tissues (Moody, 1987a,b; Moody and Kline, 1990), elucidating the proteome in individual cells of the embryo holds a great potential to elevate our understanding of the cellular physiology that regulates embryogenesis. For a deeper understanding of the developmental processes that govern early embryonic processes, it would be transformative to assay the ultimate indicator of gene expression downstream of transcription: the proteome.

To address this cell biology question, we and others have developed platforms to extend MS to single cells (see reviews in References Mellors et al., 2010; Rubakhin et al., 2011; Passarelli and Ewing, 2013; Li et al., 2015). For example, targeted proteins have been measured in erythrocytes (Hofstadler et al., 1995; Valaskovic et al., 1996; Mellors et al., 2010). Discovery MS has been used in the study of protein partitioning in the nucleus of the Xenopus laevis oocyte (Wuhr et al., 2015). Recently, we have developed single-cell analysis workflows and custom-built microanalytical capillary electrophoresis (CE) platforms for MS to enable the discovery (untargeted) characterization of gene translation in single embryonic cells (blastomeres). Using singlecell CE, we have measured hundreds–thousands of proteins in blastomeres giving rise to distinct tissues in the frog (X. laevis), such as neural, epidermal, and gut tissues (Moody, 1987a). We have also established quantitative approaches to compare gene translation between these cell types. Quantification of ∼150 different proteins between the blastomeres has captured translational cell heterogeneity in the 16-cell vertebrate embryo (Lombard-Banek et al., 2016a). These results complement known transcriptional cell differences in the embryo, but also provide previously unknown details on how differential gene expression establishes cell heterogeneity during early embryonic development.

In this contribution, we give an overview of the major steps of the single-cell CE-MS workflow (**Figure 1**). Protocols are provided to isolate single cells, extract and process proteins,



and use the CE-MS platform to identify and quantify protein expression. Additional details on technology development and validation are available elsewhere (Nemes et al., 2013; Onjiko et al., 2015; Lombard-Banek et al., 2016a,b). These protocols have allowed us to study proteins (Lombard-Banek et al., 2016a,b) and metabolites (Onjiko et al., 2015, 2016) in single blastomeres in 8-, 16-, and 32-cell X. laevis embryos. Additionally, troubleshooting advice (**Table 1**) is provided to help others adopt single-cell MS toward the systems biology characterization of molecular processes in cells and limited amounts of specimens.

### MATERIALS AND EQUIPMENT

### Single Blastomere Dissection


### Protein Extraction, Enzymatic Digestion, and Quantification


### CE-ESI-MS Analysis


### PROCEDURES

### Sample Preparation

The goal of sample preparation is to extract proteins from single cells and process the proteins for MS analysis. The workflow (**Figure 1**) starts with the identification of blastomeres in the embryo in reference to established cell fate maps (Moody, 1987a,b; Moody and Kline, 1990; Lee et al., 2012) and differences in cell size and pigmentation. Cells are microdissected using sharp forceps and collected into individual microcentrifuge tubes. **Figure 2** shows the dissection of the V11 cell. Next, isolated blastomeres are lysed using chemical (detergent) and physical (ultrasonication) methods, and their proteins are extracted. The proteins are processed via standard bottom-up proteomics protocols (Zhang et al., 2013), whereby reduction, alkylation, and enzymatic digestion are performed to convert proteins into peptides that are more readily analyzable by MS.

### Single Blastomere Dissection and Isolation

As detailed protocols are available on the identification and dissection of blastomeres (Moody, 2012; Grant et al., 2013), only a brief summary of the major steps follows.

	- 2% cysteine solution
	- 100% Steinberg solution (SS)
	- 50% Steinberg solution (SS)
	- Sterile Pasteur pipet
	- Petri dish filled with 2% agarose (w/v in 100% SS)
	- Sharp forceps
	- Hair loop
	- 0.6 mL microcentrifuge tubes
	- a. Add 4× volume of the cysteine solution to the embryos (**Table 2**) and gently swirl the solution for ∼4 min.
	- b. Once the embryos are free of the jelly coat, immediately wash them with 100% SS (**Table 2**) 4 times for 2 min each.

### TABLE 2 | Solutions and their uses.


	- a. Transfer the selected embryos to a 60 mm Petri dish coated with 2% agarose and filled with 50% SS.
	- b. Place the embryo of interest in a groove made in the agarose coating.
	- c. Orient the embryo for easy handling of the cell of interest using a hair loop.
	- d. Remove the vitelline membrane gently using sharp forceps. During this step, take care not to damage the embryo.
	- e. Hold the embryo using sharp forceps on the opposite side of the cell of interest, and gently pull on either side to isolate the cell.
	- f. Transfer isolated cells using a sterile Pasteur pipet into a micro-centrifuge tube.

### Protein Extraction and Enzymatic Digestion

	- Lysis buffer
	- Acetone chilled to −20◦C
	- 50 mM ammonium bicarbonate
	- 1 M dithiothreitol
	- 1 M iodoacetamide
	- Sonication bath (e.g., Brandson CPX 2800)
	- a. Remove the excess 50% SS from around the cell. Take care not to disrupt the cell.
	- b. Add 10µL of lysis buffer (**Table 2**) and vortex for ∼30 s.
	- c. Sonicate for ∼5 min, vortex for ∼30 s. Repeat this step 3 times.
	- d. (Optionally) Add protease inhibitor to the lysis buffer to minimize/avoid protein degradation during this step.
	- a. Add 0.5µL of 1 M dithiothreitol to the sample, and incubate for 20–30 min at 60◦C.
	- b. Add 1µL of 1 M iodoacetamide and incubate for 15 min in the dark at room temperature.
	- c. Quench the reaction by adding 0.5 µL of 1 M dithiothreitol.
	- a. Add to the cell extract a volume of pure acetone that is 5 times that of the cell extract (∼50µL), and incubate at −20◦C overnight.
	- b. Recover the precipitated proteins by centrifugation at 10,000 × g for 10 min and 4◦C.
	- c. Remove the supernatant.
	- d. Dry the pellet using a vacuum concentrator.
	- e. (Optional) Store the protein pellet at –20 or −80◦C for up to 3 months.
	- a. Reconstitute the protein pellet in 50 mM ammonium bicarbonate.
	- b. Add 0.3µL of 0.5µg/µL trypsin (trypsin in 1 mM HCl), equivalent to a protease/protein ratio of ∼1/50.
	- c. Incubate overnight at 37◦C.

### Quantification

The presented technology is compatible with well-established protocols in quantitative proteomics. Stable isotope labeling with amino acids in cell culture (SILAC) allows barcoding of proteins with isotopic labels for multiplexing quantification (Geiger et al., 2013). Label-free quantification (LFQ) is an alternative strategy whereby peptide signal abundance is used as a proxy for protein concentration. We have recently demonstrated LFQ for single blastomeres of neural fates in the 16-cell embryo using the protocol presented here (Lombard-Banek et al., 2016b). Alternatively, relative quantification can be performed using designer mass tags. In this approach, proteins are digested to peptides and the peptides barcoded with isotopic labels that can be distinguished by high-resolution MS. Multiple protocols allow for quantifying protein expression at the level of peptides in high throughput via multiplexing, including tandem mass tags (TMT) (Thompson et al., 2006; McAlister et al., 2014), and isobaric tag for relative and absolute quantitation (iTRAQ; Ross et al., 2004), and di-Leu (Xiang et al., 2010; Frost and Li, 2016). We have recently downscaled TMT-based multiplexed quantification to the protein content of single blastomeres using the following strategy (adapted from the vendor), which we then used to compare protein expression between the D11, V11, and V21 cells (Lombard-Banek et al., 2016a) that are fated to give rise to different types of tissues (neural, epidermal, and hindgut, respectively):


### Sample Analysis Using CE-ESI-MS

Peptides are analyzed using a custom-built CE-ESI-MS platform (Nemes et al., 2013; Onjiko et al., 2015; Lombard-Banek et al., 2016a). Instructions regarding the construction and operation of the platform are available from elsewhere (Nemes et al., 2013). Schematics of the CE-ESI-MS instrument are shown in **Figure 3**. CE is selected to electrophoretically separate peptides in a fused silica capillary by applying voltage difference across the capillary ends. As a general rule, peptides with smaller size and higher charge state migrate faster through the capillary. A high resolution mass spectrometer is used to sequence peptides via data-dependent acquisition. In this approach, eluting peptides are detected based on single-stage (full) scans (MS<sup>1</sup> ) and are sequenced by tandem-MS (MS<sup>2</sup> scans) using collision-induced dissociation (CID), higher-energy collisional dissociation (HCD), or other fragmentation technologies. The tandem mass spectra reveal sequence information for the peptides, as also exemplified for LGLGLELEA in **Figure 4**. During quantification experiments, the TMT labels also dissociate from the peptide, and the relative abundance of these TMT signals serves as quantitative measure of protein abundance (**Figure 4C,** right panel).

### CE-ESI-MS Measurements

	- a. Flush the capillary with background electrolyte (25% acetonitrile with 1 M formic acid).
	- b. Flush the sheath capillary with electrospray solution (50% methanol with 0.1% formic acid)
	- c. Turn on the electronics (high voltage power supplies, syringe pumps, mass spectrometer, etc.) for ∼30 min to stabilize operation.
	- a. Transfer the capillary into the background electrolyte vial.
	- b. Deposit ∼1 µL of sample onto the sample microvial (see **Figure 3**).
	- c. Transfer the capillary from the BGE vial to the sample vial.
	- d. Elevate the injection stage by ∼15 cm for ∼3 min to siphon ∼20 nL of the sample into the CE capillary.
	- e. Lower the injection stage to level the capillary inlet to the outlet, and transfer the capillary inlet end into the BGE vial.
	- f. Apply ∼10,000 V to the background electrolyte vial to start electrophoretic separation of the peptides.

FIGURE 3 | Schematics of the high-sensitivity proteomic analyzer. The platform integrates microanalytical capillary electrophoresis (CE), electrospray ionization (ESI), and high-resolution tandem mass spectrometry (HRMS<sup>2</sup> ). Scale bar = 150µm (ESI), 1.5 mm (CE panel). Key: HVPS, high-voltage power supply. Figure adapted with permission from Lombard-Banek et al. (2016a).


CE current <8µA to prevent/minimize electrolysis or solvent heating. Monitor the CE current and adjust the separation voltage as necessary. For instructions on how to measure the current, refer to Nemes et al. (2013).

i. Start MS acquisition with data-dependent acquisition as specified by the mass spectrometer vendor. For example, we use the following settings for a quadrupoleorbitrap linear ion trap mass spectrometer (Fusion, Thermo Scientific): MS<sup>1</sup> analyzer resolution (orbitrap), 60,000 FWHM; m/z scan range, 350–1600; injection time, 100 ms; precursor ion selection window, 0.8 Da in the quadrupole cell; fragmentation, HCD with 30% normalized energy in the multipole cell using nitrogen collision gas; MS<sup>2</sup> analyzer rate, rapid scan; MS<sup>2</sup> maximum injection time, 50 ms.

### Protein Identification

Last, peptide sequences are compared to the proteome of the specimen (X. laevis here) to identify proteins (see **Figure 4**). This step is facilitated by readily available proteomes from SwissProt, UniProt, and experimentally determined RNA expression (Wang et al., 2012; Smits et al., 2014; Wuhr et al., 2014). Well-established bioinformatics software packages are used to process raw mass spectrometric data. For example, Proteome Discoverer (Thermo Scientific), ProteinScape (Bruker Daltonics), and MaxQuant (Cox and Mann, 2008) interpret MS–tandem-MS datasets by executing well-established search engines, such as SEQUEST (Eng et al., 1994), Mascot (Perkins et al., 1999), and Andromeda (Cox et al., 2011). The general strategy of bottom-up proteomics has recently been reviewed in detail (Sadygov et al., 2004; Cox et al., 2011; Zhang et al., 2013). We typically acquire tens of thousands to a million mass spectra, which identify 2000–4000 peptides in single blastomeres in the 16-cell embryo. These data allow us to identify ∼1700 protein groups and quantify hundreds of proteins between the D11, V11, and V21 cells.

### Anticipated Results

The CE-ESI-MS can be used to identify gene translational differences between cells. As shown in **Figure 5**, we have used this approach to assess protein differences between blastomeres of the 16-cell X. laevis embryo (Lombard-Banek et al., 2016a,b). Cell types with different tissue developmental fates were analyzed: the midline dorsal-animal cell (named D11) develops mainly into the retina and brain, the midline ventral-animal cell (named V11) gives rise primarily to the head and trunk epidermis, and the midline ventral-vegetal cell (named V21) is the primary precursor of the hindgut. The approach allowed the identification of 1709 protein groups (<1% false discovery rate, FDR) from ∼20 ng of protein digest, corresponding to ∼0.2% of the total protein content of the blastomere (Lombard-Banek et al., 2016a). Many of the identified proteins are known to be involved in different cell fates. For example, Geminin (Gem) and Isthmin (Ism) were detected in the D11 cells in our measurements,

and these proteins are involved in brain development (Pera et al., 2002; Seo et al., 2005), which is the stereotypical fate of D11 cells (Moody, 1987a). Multiplexed quantification by TMTs provided comparative evaluation for 152 non-redundant protein groups between the cell types (**Figure 5B,** left), including many that were significantly differentially expressed between the cell types (p < 0.05, fold change ≥1.3). We have also performed label free quantitation (LFQ) to compare D11 cells that were isolated at similar developmental phase of the 16-cell X. laevis embryos (**Figure 5A**). A Pearson correlation analysis showed similar expression levels for the majority of proteins between the D11 cells (see proteins along linear fits). The study also found 25 proteins that were differentially accumulated in the respective cells, suggesting highly variable expression (**Figure 5B,** right; Lombard-Banek et al., 2016b). These data on translational cell heterogeneity complement transcriptomic information on cell differences (Flachsova et al., 2013), but also provide new insights into how differential gene expression sets up different cell fates and the major developmental axes of the early embryo.

### CONCLUSIONS

High-sensitivity MS enables the identification and quantification of a sufficiently large number of proteins to study cell and developmental processes at the level of individual cells.

### REFERENCES


Advances in sampling (smaller single cells), protein processing, microanalytical MS, and bioinformatics have enabled the discovery characterization of hundreds to thousands of proteins in single cells. Unbiased measurement of protein translation by MS complements genomic and transcriptomic information, essentially laying down the foundation of the molecular characterization of cell heterogeneity. Knowledge of genomic, transcriptomic, proteomic, and metabolomic processes paves the way to understanding how differential gene expression establishes cell heterogeneity during normal development and disease states.

### AUTHOR CONTRIBUTIONS

CL, SM, and PN wrote the manuscript.

### FUNDING

This research was supported by National Science Foundation Grant DBI-1455474 (to PN and SM) and the George Washington University Start-Up Funds (to PN) and Columbian College Facilitating Funds (to PN and SM). The content of the presented work was solely the responsibility of the authors and does not necessarily represent the official views of the funding agencies.


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2016 Lombard-Banek, Moody and Nemes. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

**107**

# Cell Cycle and Cell Size Dependent Gene Expression Reveals Distinct Subpopulations at Single-Cell Level

Soheila Dolatabadi <sup>1</sup> , Julián Candia2, 3 \*, Nina Akrap<sup>1</sup> , Christoffer Vannas <sup>1</sup> , Tajana Tesan Tomic<sup>1</sup> , Wolfgang Losert <sup>3</sup> , Göran Landberg<sup>1</sup> , Pierre Åman<sup>1</sup> and Anders Ståhlberg<sup>1</sup> \*

<sup>1</sup> Department of Pathology and Genetics, Sahlgrenska Cancer Center, Institute of Biomedicine, University of Gothenburg, Gothenburg, Sweden, <sup>2</sup> Center for Human Immunology, Autoimmunity and Inflammation, National Institutes of Health, Bethesda, MD, USA, <sup>3</sup> Department of Physics, University of Maryland, College Park, MD, USA

### Edited by:

Xinghua Pan, Yale University, USA

### Reviewed by:

David Loose, University of Texas Medical School, USA Haiying Zhu, Second Military Medical University, China

### \*Correspondence:

Julián Candia julian.candia@nih.gov Anders Ståhlberg anders.stahlberg@gu.se

### Specialty section:

This article was submitted to Genomic Assay Technology, a section of the journal Frontiers in Genetics

Received: 07 June 2016 Accepted: 06 January 2017 Published: 25 January 2017

### Citation:

Dolatabadi S, Candia J, Akrap N, Vannas C, Tesan Tomic T, Losert W, Landberg G, Åman P and Ståhlberg A (2017) Cell Cycle and Cell Size Dependent Gene Expression Reveals Distinct Subpopulations at Single-Cell Level. Front. Genet. 8:1. doi: 10.3389/fgene.2017.00001 Cell proliferation includes a series of events that is tightly regulated by several checkpoints and layers of control mechanisms. Most studies have been performed on large cell populations, but detailed understanding of cell dynamics and heterogeneity requires single-cell analysis. Here, we used quantitative real-time PCR, profiling the expression of 93 genes in single-cells from three different cell lines. Individual unsynchronized cells from three different cell lines were collected in different cell cycle phases (G0/G1 – S – G2/M) with variable cell sizes. We found that the total transcript level per cell and the expression of most individual genes correlated with progression through the cell cycle, but not with cell size. By applying the random forests algorithm, a supervised machine learning approach, we show how a multi-gene signature that classifies individual cells into their correct cell cycle phase and cell size can be generated. To identify the most predictive genes we used a variable selection strategy. Detailed analysis of cell cycle predictive genes allowed us to define subpopulations with distinct gene expression profiles and to calculate a cell cycle index that illustrates the transition of cells between cell cycle phases. In conclusion, we provide useful experimental approaches and bioinformatics to identify informative and predictive genes at the single-cell level, which opens up new means to describe and understand cell proliferation and subpopulation dynamics.

Keywords: cell cycle, cell size, single-cell gene expression, machine learning, variable selection, random forests, cell subpopulations, cell transitions

### INTRODUCTION

Cell proliferation is a tightly organized process that involves cell division and cell growth, where cell division can be divided into distinct cell cycle phases: G0, G1, S, G2, and M. Transitions through the phases are regulated by several layers of checkpoints and control mechanisms (Baserga, 1981; Lubischer, 2007; Bertoli et al., 2013; Grant et al., 2013). The molecular processes behind cell cycle progression have been dissected by numerous morphological studies on live or fixed single cells using a plethora of techniques to visualize components and processes during cell division. Many more investigations have been made on cells, sorted according to size, or artificially arrested at various cell cycle checkpoints. However, most of our knowledge about cell proliferation comes from studies that average data from large and mixed cell populations. Such data are only indirectly related to quantitative changes in cells at different states of division and growth. Analysis at the single-cell level can overcome most of these limitations. Detailed single-cell analyses have shown that transcript numbers fluctuate in individual cells, even in seemingly homogeneous populations (Raj et al., 2006), and that features of the typical or average cell in a population cannot be deduced from measurements on cell population samples (Bengtsson et al., 2005). Variations in transcript numbers allow cells to produce unique responses to internal and external cues that lead to defined paths of cell proliferation and differentiation (Levine et al., 2013). Recent development of single-cell analytical platforms opens up new possibilities to define the molecular profiles of cells at different states and to determine the importance of cell heterogeneity on cellular processes and cell fate decisions (Kalisky et al., 2011; Ståhlberg et al., 2011b; Sanchez and Golding, 2013; Shapiro et al., 2013).

Here, we employed single-cell gene expression profiling to describe the dynamic transition between cell proliferative states in three different cell lines using a panel consisting of 93 marker genes. Function of selected genes related to cell proliferation, cell cycle regulation, TP53 function, stemness, differentiation, cell signaling, and housekeeping functions (for gene details, see Table S1). We assessed cell division by collecting cells in the G0/G1, S and G2/M phases, and cell growth by selecting small and large cells in respective cell cycle phase. In contrast to cell population data, single-cell data are reported as transcripts per cell without any further normalization (Ståhlberg et al., 2013), allowing total transcript levels to be determined and compared between cell states (Sanchez and Golding, 2013). To determine if, and to what degree, the gene expression profile of individual cells were associated with cell division and growth we applied the random forests algorithm (Hastie et al., 2009; Gareth et al., 2013), which is a supervised machine learning approach. By applying variable selection, a recursive feature elimination (RFE) scheme (James et al., 2013; Candia et al., 2015), we were able to identify the genes with strongest cell proliferation association and to define distinct subpopulations. Finally, we calculated a cell cycle index based on the most predictive genes that allowed us to visualize and biologically interpret cell cycle progression.

### MATERIALS AND METHODS

### Cell Culture

All cell lines were cultured at 37◦C and in 5% CO2. The myxoid liposarcoma cell line MLS 402-91 was cultured in RPMI 1640 GlutaMAX medium supplemented with 10% fetal bovine serum, 100 U/mL penicillin, and 100 µg/mL streptomycin (all Life Technologies). Cells were passaged with 0.25% trypsin and 0.5 mM EDTA (both Life Technologies). The breast cancer cell line MCF7 was cultured in DMEM medium supplemented with 2 mM L-glutamine, 1% penicillin/streptomycin (all PAA Laboratories), 10% fetal bovine serum (Lonza), and 1% non-essential amino acids (Sigma-Aldrich). MCF7 cells were passaged with 0.05% trypsin-EDTA (PAA Laboratories). Mesenchymal stem cells (MSC) derived from human embryonic stem cells (hES-MP 002.5, Takara Bio), were cultured in DMEM GlutaMAX, supplemented with 10% fetal bovine serum, 100 U/mL penicillin, 100 µg/mL streptomycin, and 4 ng/mL fibroblast growth factor 2 (all Life Technologies) as described (Karlsson et al., 2009). MSCs were passaged with TrypLE Select (Life Technologies). Dissociation enzyme inactivation was performed using complete medium, containing fetal bovine serum for all cell lines. Cell cultures were confirmed as mycoplasma-free using the Mycoplasma PCR Detection Kit (Applied Biological Materials).

### Fluorescent Activated Cell Sorting

Vybrant DyeCycle violet stain (Life Technologies) and CellVue Claret far red dye (Sigma-Aldrich) were used to stain genomic DNA and membrane lipids, respectively. Suspension of 10<sup>6</sup> cells in 1 mL Hanks' balanced salt solution (Life Technologies) was first stained with Vybrant DyeCycle violet stain (5 µM, final concentration) at 37◦C for 30 min. Then, 1 mL CellVue Claret far red dye diluted in diluent C (Sigma-Aldrich, 3.3 µM, final concentration) was added followed by an incubation step at 37◦C for 5 min. Staining was inactivated by complete medium and the cells were finally resuspended in Hanks' balanced salt solution.

G1/S cell cycle arrest was performed using a double thymidine block (Sigma-Aldrich). Thymidine (2 mM, final concentration) was added to 25–30% confluent cells for 18 h. Cells were then released by addition of fresh medium without thymidine. Finally, after 9 h cells were re-exposed to thymidine for additional 17 h. Complete cell cycle arrest was confirmed by Vybrant DyeCycle violet staining followed by fluorescence activated cell sorting analysis.

Cell aggregates were removed by filtering with a 40 µm cell strainer (BD Biosciences) and single cells were sorted with a BD FACSAria II (BD Biosciences) into 96-well-plates (Life Technologies), each well-containing 5 µL 1 mg/mL bovine serum albumin (Thermo Scientific; Svec et al., 2013). Collected single cells were frozen on dry ice and kept at −80◦C until subsequent analysis. Gating strategies for cell size and cell cycle phase are shown in Figure S1. The cell size/cell volume was estimated from the average CellVue Claret far red signal, assuming a spherical cell shape. All single-cells from respective biological condition were collected from an individual culture, to minimize batch-to-batch differences as described (Wills et al., 2013).

### Single-Cell Gene Expression Profiling

Reverse transcription was performed with SuperScript III (Life Technologies). Lysed single cells, 0.5 mM dNTPs (Sigma-Aldrich), 5.0 µM Oligo(dT12−18), and 5.0 µM random hexamers (both Life Technologies) were incubated in 6.5 µL at 65◦C for 5 min. Next, 50 mM Tris–HCl, 75 mM KCl, 3 mM MgCl2, 5 mM dithiothreitol, 10 U RNaseOut, and 50 U SuperScript III (all Life Technologies) were added to a final volume of 10 µL. Final reaction concentrations are shown. Reverse transcription was performed at 25◦C for 5 min, 50◦C for 60 min, 55◦C for 10 min, and terminated by heating to 70◦C for 15 min. All samples were diluted to 30 µL with water.

Targeted cDNA preamplification was performed with the iQ Supermix (BioRad) in 50 µL reactions. Each reaction contained 10 or 15 µL diluted cDNA and 40 nM of each primer. Primer sequences are shown in Table S1. Optimization and validation of good performing qPCR assays and preamplification are described elsewhere (Ståhlberg and Bengtsson, 2010; Andersson et al., 2015). The temperature profile was 95◦C for 3 min followed by 20 cycles of amplification (95◦C for 20 s, 60◦C for 3 min, and 72◦C for 20 s). All preamplified samples were chilled on ice and diluted 1:20 in TE-buffer (pH 8.0; Life Technologies). Preamplification was performed as two separate reactions for each single cell, each containing half of the assays. The products of the two reactions were pooled after preamplifciation. Reproducibility and efficiency of the preamplification were evaluated by standard curve analysis using cDNA from MLS 402-91 (Figure S2). The overall preamplification efficiency was assessed using five different cDNA concentrations (n = 4) generated from 0.04, 0.2, 1, 5, 25 ng total RNA, respectively. The average cycle of quantification value of all genes expressed in four or more dilutions were used to determine the overall preamplification efficiency.

The BioMark real-time PCR system with 96 × 96 dynamic arrays (Fluidigm) was used for gene expression profiling according to the manufacturer's instructions. The 5 µL sample reaction mixture contained 1X SsoFast EvaGreen Supermix (BioRad), 1X ROX (Life Technologies), 1X GE Sample Loading Reagent (Fluidigm), and 2 µL diluted preamplified cDNA. The 5 µL primer reaction contained 1X Assay Loading Reagent (Fluidigm) and 5 µM of each primer. Preamplification and qPCR were performed with the same primers (Table S1). The chip was first primed with the NanoFlex IFC Controller (Fluidigm) and then loaded with the sample and primer reaction mixtures. The cycling program was 3 min at 95◦C for polymerase activation, followed by 40 cycles of amplification (96◦C for 5 s and 60◦C for 20 s). After qPCR, all samples were analyzed by melting curve analysis (60–95◦C with 0.33◦C per s increment). All assays were confirmed to generate correct PCR product length by agarose gel electrophoresis. Data pre-processing was performed with GenEx (v.6, MultiD) as described (Ståhlberg et al., 2013). Briefly, samples with aberrant melting curves were removed and cycle of quantification values larger than 25 were replaced with 25. Data were transformed to relative quantities assuming that a cycle of quantification value of 25 equals one molecule. Missing data were replaced with 0.5 molecules. All data were calculated per cell if not stated otherwise. For all data analysis we assumed 100% PCR efficiency. The impact of the chosen cut-off value and applied PCR efficiency had negligible effect on downstream analysis.

### Immunofluorescence

MLS 402-91 and MCF-7 cells were seeded on Millicell EZ SLIDE 4-well-glasses (Merck Millipore). After 24 h, cells were rinsed with phosphate buffer saline (Life Technologies) and fixed in 3.7% formaldehyde for 5 min (Sigma-Aldrich), washed three times with phosphate buffer saline and permeabilized in AB buffer (phosphate buffer saline supplied with 1% bovine serum albumin and 0.5% Triton X, Sigma-Aldrich). Cells were stained with anti-MCM6 antibody (HPA004818 rabbit, diluted 1:50, Sigma-Aldrich). Detection was performed with a Cy3 conjugated secondary antibody (PA43004, diluted 1:1000, GE Healthcare Life Sciences). Slides were mounted using Prolong Gold anti-fade with 4′ ,6-diamidino-2-phenylindole (Life Technologies). Cellular fluorescence was imaged using a Zeiss Axioplan 2 microscope (Zeiss). Relative protein level per cell was estimated using Volocity 3D Image Analysis Software (PerkinElmer).

### Single-Cell Data Analysis and Statistics

Principal component analysis, hierarchical clustering, and Kohonen self-organizing maps were performed in GenEx software using autoscaled gene expression data as described (Ståhlberg et al., 2011a). The Ward's algorithm and Euclidean distance measure were applied for hierarchical clustering. Parameters for Kohonen self-organizing maps were: 3–4 × 1 map, 2 neighbors, 0.4 learning rate, and 150 iterations. The resulting clusters were not sensitive to parameter choice.

A random forests algorithm was implemented to pairwise classify different cell cycle phases and cell sizes. Two cell states were compared at a time. Random forests are collections of decision trees. At the top-most level of each decision tree, all genes are scanned one by one, to determine the best gene, and corresponding gene expression threshold to optimally partition the original cells into two branches. The optimal partition is algorithmically determined based on the minimization of a quality function such as the cross-entropy or the Gini index (Hastie et al., 2009; Gareth et al., 2013), which aim to increase the class purity of each branch. Subsequently, each branch is considered for further separation based on the expression values of other genes. The process continues until the full decision tree is grown in such a manner that each of its leaves, i.e., the endpoint of each branch, contains cells of a single class. To generate robust solutions and avoid data overfitting, additional parameters are usually incorporated to the model in order to either limit the length of the tree (or, alternatively, the size of the nodes that can undergo further branching) or to prune the tree. In this context, a popular technique is to generate a so-called random forest that contains a large number of partially decorrelated trees built out of bootstrapped samples from the original data set. Compared to single decision trees, random forests are less intuitive, since they lack a direct visualization of the structure and relations among predictor genes, but random forests are more powerful and robust. In this study, we implemented a random forest analysis using the random Forest (v4.6-10) package in R. This implementation uses the decrease of Gini index impurity as a splitting criterion and selects the splitting predictor from a subset of predictors, randomly chosen at each split. Each random forest consisted of 1000 trees. For each random forest we scanned the size of the predictor subset in the full range from one to the total number of predictors and selected the smallest subset that minimized the out-of-bag error. The so-called out-of-bag error is calculated from predictions on out-of-bag instances, i.e., those cells that have not been used in building a particular tree. Moreover, in order to assess model variance, for each class comparison we generated ensembles consisting of 100 different random forests. Only genes with detectable expression in at least 50% of the cells in at least one cell class were included in our analysis. We report averages and standard deviations calculated over these random forest ensembles throughout.

Cell classification performance can be quantified by several measures. In addition to the out-of-bag error, another measure is the balanced accuracy. The balanced accuracy is the classification accuracy averaged over all classes, where the classification accuracy for each class is the percentage of cells in the class that are correctly classified by the random forest. Yet another measure is Fisher's p-value obtained by applying Fisher's exact test on the confusion matrix, which consists of the number of correctly and incorrectly classified cells in each class. Moreover, we also computed the so-called gene importance, a quantitative measure of the impact of the gene on the node purity.

To address the question of which, and how many, genes are needed to best separate two classes we applied a recursive feature elimination (RFE) scheme, a standard approach for feature selection (Tarca et al., 2007; Candia et al., 2013). In the first RFE cycle, we generated a random forest ensemble using all (N) genes and computed classification statistics, including confusion matrices with associated Fisher's p-value, balanced accuracy, out-of-bag error, and gene importance. We determined the least significant gene based on gene importance and removed it. Then, in the second RFE cycle we used the remaining N–1 genes and repeated the random forest analysis to eliminate the second least significant gene. The procedure was subsequently iterated until one gene was left. By comparing the classification performance across all RFE cycles we could then determine the number of genes in the optimal gene signature. We verified that, for this optimal gene signature, the out-of-bag error and Fisher's p-value were minimized, while the balanced accuracy was maximized. The intended redundancy of separately considering three classification performance metrics allowed us to ensure the robustness of the optimally obtained gene signature.

The most predictive genes identified by RFE was used to calculate a cell cycle index as the sum of all G1 to S and/or G2/M upregulated genes subtracted by the sum of all G1 to S and/or G2/M downregulated genes divided by the number of genes used. The lg2 expression value of each gene was used.

### RESULTS

Gene expression and cell heterogeneity of proliferating cells were studied by fluorescence activated cell sorting combined with single-cell gene expression profiling. Three different cell lines were investigated: a genetically stable myxoid liposarcoma cell line (MLS 402-91) (Aman et al., 1992); a breast cancer adenocarcinoma derived cell line (MCF7; Soule et al., 1973) and mesenchymal stem cells (MSC) differentiated from an embryonic stem cell line (Karlsson et al., 2009). Cells were stained with lipid and DNA binding dyes, visualizing cell size, and DNA content. Utilizing this double-labeling approach we collected small and large cells in the G0/G1, S, and G2/M phases (Figure S1). DNA staining cannot distinguish between G0 and G1 phase cells, or between G2 and M phase cells. We refer the G0/G1 phase as G1 phase only, since few G0 cells are expected in our continuously passaged cell cultures. The average volume ratio between large and small collected cells was 2.8 for MLS 402-91, 2.5 for MCF7, and 4.5 for MSC (Figure S1). Expression of 93 genes were analyzed in each cell using reverse transcription quantitative realtime PCR. One gene (FUS) was assessed by two assays. Assay information and gene function are shown in Table S1. All basic data, including number of positive cells expressing each gene and mean single-cell expression with standard deviation, are shown in Table S2. We tested the reproducibility of our data by collecting individual MLS 402-91 cells in the G1, S, and G2/M phases without any cell size selection in an independent experiment.

### Total Transcript Level Correlates with Cell Cycle Phase at the Single-Cell Level

Transcript numbers were measured per single cell without any further normalization between cells (Ståhlberg et al., 2011a, 2013). Hence, the total transcript level could be calculated as the sum of all measured transcripts per cell. **Figure 1A** and **Table 1** show that the total transcript level correlated with cell cycle phase, but not with cell size. In MLS 402-91 the total transcript level reached maximum in G2/M phase cells with about twofold higher levels compared to G1 phase cells. In MCF7 the total transcript level reached maximum in S phase cells and remained at the same level in G2/M phase cells. MSC only displayed a weak correlation between total transcript level and cell cycle phase.

The total transcript level varied highly between individual cells (**Figure 1B**). The distributions were skewed with few cells containing high total transcript levels. The total transcript level was 17, 120, and 820 times higher in the cell with highest total transcript level compared to the cell with lowest total transcript level in MLS 402-91, MCF7, and MSC, respectively (all cells included). Correlation analysis between transcript levels of individual genes at single-cell level showed positive correlations between most genes: 74% in MLS 402-91 (total number of comparisons = 4278), 85% (total number of comparisons = 3081) in MCF7 and 90% (total number of comparisons = 3486) in MSC. Consequently, cells with high total transcript level also displayed elevated transcript numbers of most individual genes.

### Identification of Genes with Cell Cycle Phase and Cell Size Dependent Expression

Principal component analysis (PCA) showed that individual cells partly clustered based on their cell cycle phase in all three cell lines (MLS 402-91 in **Figure 2A**, MCF7 in **Figure 3A**, and MSC in **Figure 4A**), but only MSC displayed cell size depended clustering. However, large overlaps between cells of different cell cycle phases and cell sizes were observed for all cell lines. Double thymidine treated MLS 402-91 cells showed a completely divergent expression profile compared to nontreated G1, S, or G2/M phase cells, demonstrating that artificial cell synchronization result in severe and unintended side effects (**Figure 2A**).

To determine if individual cells can be correctly classified into cell cycle phase or cell size based on their gene expression profile we applied the random forests algorithm, a machine-learning approach based on decision trees. As a classifier, a decision tree is a hierarchically organized structure that optimally can separate cell cycle phases and cell sizes (see Section Materials and Methods for details). **Figures 2B**, **3B**, **4B** show how well-cell cycle phase and cell size could be distinguished using a multigene signature at the single-cell level. In MLS 402-91, we obtained best classification comparing G2/M with G1 phase cells, while the

12, <sup>n</sup>small−G2/<sup>M</sup> <sup>=</sup> 14, <sup>n</sup>large−G2/<sup>M</sup> <sup>=</sup> 15). In addition, G1/S phase arrested MLS 402-91 cells with any cell size were analyzed, using a double thymidine block (<sup>n</sup> <sup>=</sup> 61). As a separate experiment, MLS 402-91 cells were collected and analyzed based on cell cycle phase only (nG1 <sup>=</sup> 30, <sup>n</sup><sup>S</sup> <sup>=</sup> 29 and <sup>n</sup>G2/<sup>M</sup> <sup>=</sup> 30). Box-Whisker plots are shown; the box ranges between the 25 and 75% and the whiskers range between the 5 and 95% of all data. \*indicate 95% significance using the Mann-Whitney U-test with Holm-Bonferroni correction for multiple testing. (B) Distribution of total transcript levels among individual cells in MLS 402-91, MCF7, and MSC. The total transcript level per cell is calculated as the sum of all measured transcript for all 93 genes.

classifications between other cell cycle phases were less efficient (**Figure 2B**). For example, 29.86 ± 0.35 out of 31 MLS 402-91 cells were correctly classified as G1 phase cells, while 1.14 ± 0.35 G1 phase cells were falsely predicted to be G2/M phase cells. The ability to classify MCF7 cells was similar (**Figure 3B**). The gene expression profile was less predictive to classify cell size than cell cycle phase in both MLS 402-91 and MCF7 cells (**Figures 2B**, **3B**). Similar gene expression profiles and classifications were also observed for the independent MLS 402-91 data set (Figure S3). The gene expression profile of individual MSC was less predictive for cell cycle phases compared to the two other cell lines, but the ability to classify cell size was more efficient in MSC (**Figure 4B**). We also compared small and large cells within respective cell cycle phase, but no distinct cell size dependency was found in any of the three cell lines (data not shown). The random forests approach also allowed us to rank the individual genes based on their importance in the classification (Figure S4). **Figures 2C**, **3C**, **4C** show the genes with strongest cell cycle phase and cell size dependent expression. Even if the median expression level of these predictive genes correlated well with their ability to classify cell cycle phase or cell size, individual cells showed highly variable, and overlapping gene expression (**Figures 2C**, **3C**, **4C**).

TABLE 1 | Spearman's correlation coefficient between total transcript level and cell proliferation parameters at single-cell level.


\*p < 0.05, \*\*p < 0.01.

### Identification of Predictive Genes and Cell Line Specific Subpopulations

Expression data for all genes were used in the random forests classification algorithm to predict cell cycle phase and cell size. To determine if a similar prediction model could be generated with fewer genes, we applied a recursive feature elimination (RFE) approach. In RFE, the least informative gene is eliminated from the random forests analysis. This procedure is repeated until only one gene remains. Figure S5 shows how well the random forests algorithm performed with decreasing number of genes. We found that expression data from the following gene sets were almost as accurate as the complete gene panel in classifying cell cycle phase in MLS 402-91: G1 vs. S: MKI67, RB1, E2F1,

HIST1H2AE, and CCNB1; S vs. G2/M: CCNB1, CBX3, and ND1 and G2/M vs. G1: MKI67, GAPDH, CCNB1, and CCNB2. The gene lists are ordered with the most predictive gene listed first. Refined PCA using only these nine predictive genes revealed a distinct subpopulation that was not clearly visible using all genes (**Figure 5A**). The same subpopulation was also identified using other algorithms, including hierarchical clustering and Kohonen self-organizing maps (Figure S6). This new subpopulation mainly consisted of G1 cell cycle phase cells and was characterized by upregulation of MCM6 and downregulation of 21 other genes, mainly cell cycle related genes (**Figures 5B,C**). We refer to this subpopulation as the G1′ subpopulation. The total transcript level in the G1′ subpopulation was on average 32% lower compared to the other G1 phase cells (p < 0.01, Mann-Whitney U-test), suggesting a distinct G1 cell state with low transcriptional activity. We also confirmed the presence of the same G1 subpopulation with almost an identical gene expression profile in the independent MLS 402-91 data set (Figure S7).

In MCF7, the following sets of predictive genes were identified by RFE: G1 vs. S phase: HIST1H2AE, CCNB1, CDK4, and GMNN; S vs. G2/M phase: CCNB1, CCNB2, and HIST1H2AE and G2/M vs. G1 phase: MKI67, CCNB1, RPS10, RPL7, and EIF1. Refined PCA revealed a G1 subpopulation with similar characteristics as the G1′ subpopulation found in MLS 402- 91 (**Figures 5D–F**). The existence of the MCF7 defined G1′ subpopulation was confirmed by hierarchical clustering and Kohonen self-organizing maps (data not shown). The total transcript level was 47% lower in the G1′ subpopulation compared to the other G1 phase cells (p < 0.01, Mann-Whitney U-test). One gene, MCM6, displayed opposite regulation in the G1′ subpopulation in MCF7 compared to MLS 402-91. The variable and divergent MCM6 expression prompted us to analyze its protein expression. Immunofluorescence analysis showed variable MCM6 protein expression in both MLS 402- 91 and MCF7 with somewhat higher variability in MCF7 cells (Figure S8).

In MSC, RFE generated the following sets of predictive genes: G1 vs. S phase: HIST1H2AE, MKI67, ATF4, and YWHAZ; S vs. G2/M phase: HIST1H2AE, E2F4, TAF15, and RB1 and G2/M vs. G1 phase: CCNA2, NOTCH1, CCNB1, and VIM. In contrast to MLS 402-91 and MCF7, MSC displayed a distinct subpopulation

of small S and G2/M phase cells that was characterized by upregulated cell proliferation genes (**Figures 5G–I**). The existence of this MSC specific subpopulation was also confirmed by other algorithms (data not shown).

### Cell Cycle Progression Can Be Visualized By a Cell Cycle Index Based on Gene Expression

Multi-gene profiles are usually hard to visualize and interpret. Hence, we calculated and plotted a cell cycle index based on the expression of all cell cycle regulated genes identified by RFE for each cell line (**Figure 6**). The index correlated with the cell cycle progression for all three cell lines, where G1 phase cells showed low indexes, while G2/M phase cells displayed high indexes. The cell cycle index varied most between individual G1 phase cells in MLS 402-91 and MCF7, where a distinct index crossover point could be identified for cells in the transition from G1 to S phase. In contrast, MSC showed a different pattern with a more uniform G1 to S phase transition. The cells in the G1′ subpopulations identified in MLS 402-91 and MCF7 displayed the lowest cell cycle indexes, while the cells in the subpopulation defined in MSC showed the highest indexes.

### DISCUSSION

The mechanisms governing cell growth and division of mammalian cells have long been a subject of intense research. Many of the decisive regulatory events occur by post translational modifications of pre-existing proteins (Pagliuca et al., 2011), but underlying this regulatory level is also synchronized de novo production of cell cycle regulated components. A large number of genes have been reported to be timely transcribed as part of cell cycle progression (Sun et al., 2007; Simmons Kovacs et al., 2008; Muller and Engeland, 2010). Here, we have taken advantage of emerging technology to study gene expression profiles in single cells of different cell cycle phases and of different cell sizes. To date, most studies aimed at cell cycle regulated gene transcription were based on large cultures and artificial cell synchronization. We and others (Cooper, 2002, 2003) have observed that standard synchronization strategies affect cell states in unintended ways as they cause cell stress and abnormal expression profiles (**Figures 1A**, **2A**). Our approach to collect unsynchronized individual cells avoids these issues and our data clearly demonstrate some of the benefits using single-cell analysis. Both the observed cell-to-cell variability and the identified

subpopulations would have been challenging to study at cell population level.

Traditional expression analysis usually involves normalization processes before samples can be compared. Normalization assumes that selected house-keeping genes, i.e., reference genes, or the total amount of transcripts is essentially identical across samples. However, single-cell RT-qPCR data are reported as transcripts per cell without the need of additional normalization between cells, which enable us to calculate the total transcript level of all analyzed genes (Ståhlberg et al., 2011a, 2013). This strategy is possible, since single cells are analyzed directly without any extraction steps. Our data show that the assumption of equal total transcription levels between individual cells is not valid. Instead, we observed that the total transcript level correlated with the cell cycle phase (**Table 1**). This was further tested by analyzing an additional published single-cell astrocyte data set generated directly from dissociated mice brains (Figure S9; Rusnakova et al., 2013). Taken together, our data show a considerable cell-to-cell variation in total transcript levels where most genes are positively correlated. In addition, only a minority of cells displayed elevated total transcript levels. Consequently, these few cells expressed high number of transcripts of most genes. The absolute values of the calculated total transcript levels are dependent on the applied gene panel. However, the observation of subpopulations expressing elevated levels of transcripts for most genes is not gene panel dependent. Our results are in agreement with earlier observations that transcription occurs in bursts (Raj et al., 2006; Sanchez and Golding, 2013), generating skewed distributions of transcripts among individual cells (Bengtsson et al., 2005).

In many organisms cell size is strongly correlated to cell division and growth rate (Dungrawala et al., 2010; Marguerat and Bahler, 2012), but the role of cell size in mammalian cells is less clear (Echave et al., 2007; Tzur et al., 2009). Our cell size data are in line with these reports. We observed increased numbers of small cells in the G1 phase using fluorescence activated cell sorting (Figure S1), but no clear correlation between cell size and total transcript levels were observed in any cell line. In MSC, we identified a subpopulation of small S and G2/M phase cells with distinct gene expression profile. The divergent results of MSC could be connected to the larger span in size variation of these cells compared to the other two cell lines (**Figure 1A** and Figure S1).

A large number of genes displayed correlations between their expression levels and cell cycle phase, while the number of correlations between expression level and cell size was fewer (**Table 1** and Table S2). However, even for the genes with highest correlations we observed large overlap in gene expression levels among individual cells of different cell cycle

phases and cell sizes (**Figures 2C**, **3C**, **4C** and Table S2). To further analyze the relations between gene expression and cell cycle phase respective cell size we applied the supervised random forests learning algorithm. This strategy generated a multi-gene signature that optimally separated pre-defined cell populations. Further, to identify the most predictive genes we applied RFE. Most of the predictive genes were similar in MLS 402-91 and MCF7, while MSC displayed a different gene list. Some genes, including CCNB1 and MKI67, were predictive in all three cell lines. The RFE results showed that none of the measured genes alone or in combination could predict all cells into correct cell cycle phase or cell size in any cell line.

By excluding non-informative genes in the PCA we identified distinct G1′ subpopulations in both MLS 402-91 and MCF7. The G1′ subpopulations were characterized by low total transcript levels and downregulation of several proliferation associated genes. We speculate that these G1 phase cells are cells that have recently divided (Martinsson et al., 2005). One gene, MCM6, was upregulated in MLS 402-91, while downregulated in MCF7. MCM6 belongs to the MCM gene family, where the MCM complex is loaded on chromatin exclusively during the G1 phase

FIGURE 6 | Cell cycle index. The cell cycle index of each cell is shown in relation to its cell cycle phase. Subpopulation cells identified in Figure 5 are also indicated. (A) The MLS 402-91 index was calculated as: (MKI67 + RB1 + HIST1H2AE + CCNB1 + CBX3 + ND1 + GAPDH + CCNB2 – E2F1)/9. The lg2 expression value of each gene was used. The cell cycle index crossover point where the index enters a plateau is indicated. The linear fits are shown to guide the eye. (B) The MCF7 index was calculated as: (HIST1H2AE + CCNB1 + CDK4 + GMNN + CCNB2 + MKI67 + RPS10 + RPL7 +EIF1)/9. The lg2 expression value of each gene was used. The cell cycle index crossover point where the index enters a plateau is indicated. The linear fits are shown to guide the eye. (C) The MSC index was calculated as: (HIST1H2AE + MKI67 + ATF4 + YWHAX + E2F4 + TAF15 + RB1 + CCNA2 + NOTCH1 + CCNB2 + VIM)/11. The lg2 expression value of each gene was used.

with help of other proteins, including CDT1 and CDC6 (Shetty et al., 2005). Interestingly, the second most upregulated gene in the MLS 402-91 G1′ subpopulation was CDT1, further indicating that the MCM complex may be differently regulated in MLS 402- 91 compared to MCF7. The heterogeneously MCM6 expression also translated into variable protein expression levels. Transcript data suggest that the cells with high MCM6 protein level in MLS 402-91 correspond to the G1′ subpopulation, while the opposite seems true for MCF7. Further, analyses are needed to define the cell line specific regulation of MCM genes.

A single parameter is easier to visualize and interpret than a multi-gene signature. Hence, we developed a cell cycle index to illustrate cell cycle progression. The index shows that cells are in continuous transition throughout the cell cycle until mitosis. In MLS 402-91 and MCF7 we observed a distinct cell cycle index crossover point for cells that were in the G1 to S phase transition (**Figures 6A–B**). We speculate that this cell cycle index breakpoint is related to the G1 restriction check point (Lubischer, 2007). The identified G1′ subpopulations in MLS 402-91 and MCF7 were characterized by low indexes, illustrating that these cells are not likely to enter the S phase in the near future. However, further analysis of more cell lines in different conditions, degree of differentiation and various genetic backgrounds is needed to determine general cell proliferation constraints. In addition, whole transcriptome analysis would most likely reveal more predictive genes allowing for a more detailed understanding of cell transitions between cell cycle phases.

### AUTHOR CONTRIBUTIONS

AS conceived and designed the study. AS, SD, NA, CV, TT performed the experiments. AS, JC, WL performed data analysis. All authors were involved in data interpretation and manuscript drafting. All authors approved the final manuscript.

### FUNDING

Barncancerfonden, BioCARE, Cancerfonden, Johan Jansson Stiftelsen för tumörforskning och cancerskadade, Sahlgrenska Akademin-ALF, Stiftelsen Assar Gabrielssons Fond, Stiftelserna Wilhelm och Martina Lundgrens Vetenskapsfond, VINNOVA, Åke Wiberg Stiftelse.

### ACKNOWLEDGMENTS

We acknowledge the Centre for Cellular Imaging at the Sahlgrenska Academy, University of Gothenburg for imaging support and Dr. Daniel Andersson at the Sahlgrenska Cancer Center, University of Gothenburg, Gothenburg, Sweden for comments on the manuscript draft.

### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: http://journal.frontiersin.org/article/10.3389/fgene. 2017.00001/full#supplementary-material

### REFERENCES


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2017 Dolatabadi, Candia, Akrap, Vannas, Tesan Tomic, Losert, Landberg, Åman and Ståhlberg. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Narjes S. Movahedi1 , Mallory Embree2 , Harish Nagarajan2 , Karsten Zengler2 and Hamidreza Chitsaz3 \**

*1 Department of Computer Science, Wayne State University, Detroit, MI, USA, 2 Department of Bioengineering, University of California San Diego, San Diego, CA, USA, 3 Department of Computer Science, Colorado State University, Fort Collins, CO, USA*

As the vast majority of all microbes are unculturable, single-cell sequencing has become a significant method to gain insight into microbial physiology. Single-cell sequencing methods, currently powered by multiple displacement genome amplification (MDA), have passed important milestones such as finishing and closing the genome of a prokaryote. However, the quality and reliability of genome assemblies from single cells are still unsatisfactory due to uneven coverage depth and the absence of scattered chunks of the genome in the final collection of reads caused by MDA bias. In this work, our new algorithm Hybrid *De novo* Assembler (HyDA) demonstrates the power of coassembly of multiple single-cell genomic data sets through significant improvement of the assembly quality in terms of predicted functional elements and length statistics. Coassemblies contain significantly more base pairs and protein coding genes, cover more subsystems, and consist of longer contigs compared to individual assemblies by the same algorithm as well as state-of-the-art single-cell assemblers SPAdes and IDBA-UD. Hybrid *De novo* Assembler (HyDA) is also able to avoid chimeric assemblies by detecting and separating shared and exclusive pieces of sequence for input data sets. By replacing one deep single-cell sequencing experiment with a few single-cell sequencing experiments of lower depth, the coassembly method can hedge against the risk of failure and loss of the sample, without significantly increasing sequencing cost. Application of the single-cell coassembler HyDA to the study of three uncultured members of an alkane-degrading methanogenic community validated the usefulness of the coassembly concept. HyDA is open source and publicly available at http://chitsazlab.org/software.html, and the raw reads are available at http://chitsazlab.org/research.html.

Keywords: genome assembly, single-cell genomics, uncultivable bacteria, colored de Bruijn graph, genome coassembly

### 1. INTRODUCTION

Enormous progress toward DNA sequencing has brought a realm of exciting applications within reach, including genomic analysis at single-cell resolution. Single-cell genome sequencing holds great promise for various areas of biology including environmental biology (McLean et al., 2013). In particular, myriad unculturable environmental microorganisms have been studied using single-cell genome sequencing powered by high-throughput DNA amplification methods (Dean et al., 2001, 2002;

*Edited by:* 

*Xinghua Pan, Yale University, USA*

### *Reviewed by:*

*Malek Faham, Sequenta Inc., USA Xuefeng Wang, State University of New York at Stony Brook, USA*

> *\*Correspondence: Hamidreza Chitsaz chitsaz@chitsazlab.org*

### *Specialty section:*

*This article was submitted to Genomic Assay Technology, a section of the journal Frontiers in Bioengineering and Biotechnology*

*Received: 24 February 2016 Accepted: 06 May 2016 Published: 23 May 2016*

### *Citation:*

*Movahedi NS, Embree M, Nagarajan H, Zengler K and Chitsaz H (2016) Efficient Synergistic Single-Cell Genome Assembly. Front. Bioeng. Biotechnol. 4:42. doi: 10.3389/fbioe.2016.00042*

Hosono et al., 2003; Gill et al., 2006; Rusch et al., 2007). Since the majority of microbes to date are unculturable, single-cell sequencing has enabled significant progress in elucidating the genome sequences and metabolic capabilities of these previously inaccessible microorganisms.

Single-cell sequencing, which was challenging and limited for years, is now accessible and attractive for many scientific fields according to the Nature Method of the year 2013. It helps various types of projects such as antibiotics discovery (Li and Vederas, 2009), Earth Microbiome Project (EMP) (Caporaso et al., 2012), and Human Microbiome Project (HMP) (Gill et al., 2006). The importance of single-cell sequencing is particularly due to the fact that only 1% of environmental bacteria have been cultured in the laboratory as they need their natural habitant for cultivation (Lasken, 2007). Also, single-cell sequencing can preserve the uniqueness of each cell and its individual mutations and structural variations, which are valuable information, especially in cancer studies.

Nevertheless, single-cell sequencing is still far from perfect as whole-genome amplification procedures are needed to augment femtograms of DNA material of one cell into micrograms. All known amplification reactions to date introduce some form of bias. Today, the dominant amplification method in single-cell sequencing technology is the Multiple Displacement Amplification (MDA) (Dean et al., 2001, 2002; Lasken and Egholm, 2003). Another popular amplification method is MALBAC, which causes its type of amplification artifact (Lu et al., 2012; Zong et al., 2012).

Multiple Displacement Amplification (MDA) is the preferred amplification method for single-cell sequencing, since it is an isothermal (without thermo cycling) process as opposed to PCR (Illumina, 2013, 2014). Compared to PCR-based amplification methods, it produces less amplification coverage bias and error (Tindall and Kunkel, 1988; Esteban et al., 1993; Pinard et al., 2006).

Recently, a new whole-genome amplification method has been demonstrated on individual human cells, which is called Multiple Annealing and Looping Based Amplification Cycles (MALBAC) (Lu et al., 2012; Zong et al., 2012). MALBAC coverage of the human genome has less bias than that of MDA. Nevertheless, amplification bias is still a challenge despite the improvements achieved by MALBAC (Daley and Smith, 2014). Furthermore, sensitivity of MALBAC to background noise makes it not suitable for many applications, such as *de novo* assembly (de Bourcy et al., 2014).

Although single-cell sequencing methods have passed important milestones, such as capturing ≥90% of genes in a prokaryotic cell (Chitsaz et al., 2011) or finishing and closing the genome of a prokaryote using MDA (Woyke et al., 2010), the quality and reliability of genome assemblies from single cells lag behind those of sequencing methods from multi cells due to a bias arising from MDA. The main factors that affect quality are uneven coverage depth and the absence of scattered chunks of the genome in the final collection of reads. There is no known deterministic pattern for the preferred amplified regions, and they are currently treated as the result of a random process. Also, the outcome of MDA is widely variable ranging from total loss of the sample and any information therein to nearly complete reconstruction of the genome. In this sense, an MDA-based single-cell sequencing experiment is currently a gamble that can potentially lead to the loss of the sample and sequencing expenses.

The uneven depth of coverage of a single-cell data set makes the result of *de novo* assembly with uniform sequencing depth assumption inaccurate (Rodrigue et al., 2009; Woyke et al., 2009). This makes the challenges of single-cell sequencing more computational than experimental (Rodrigue et al., 2009). A novel computational solution proposed by Chitsaz et al. (2011) overcomes some of the complications caused by uneven depth of coverage. That method is implemented into a tool called Velvet-SC and adapted by other subsequent single-cell assembly tools, such as SPAdes (Bankevich et al., 2012) and IDBA-UD (Peng et al., 2012), which introduce further advanced algorithmic features and outperform Velvet-SC.

No matter how sophisticated the algorithmic features of an assembler, there is no way to assemble those regions of the genome that are not amplified enough to be captured in sequencing. Chitsaz et al. (2011) called those absent parts of the genome *blackout regions*. We propose an elegant solution to retrieve those blackout regions using the information vested in other single-cell data sets. Coverage data of identical DNA molecules suggest that the MDA process has a strong random component to the extent that it is likely that the blackout regions in one reaction are fully covered in another one. We introduce a coassembly strategy, which can fill the blackout regions in a data set by using the information in another coassembled data set using the idea of colored de Bruijn graph (Iqbal et al., 2012).

Colored de Bruijn graph was initially introduced for structural variation detection. We modified and implemented the algorithm for single-cell coassembly. Furthermore, our algorithm modifies the iterative *k* assembly algorithm, which is implemented by SPAdes (Bankevich et al., 2012) and IDBA-UD (Peng et al., 2012), and adapts it to the colored graph (Shariat Razavi et al., 2014). It has been shown that the weakness of the coassembly is related to breaking contigs due to various colored branches (Movahedi et al., 2012). Iterative assembly with variable *k* overcomes that contiguity weakness.

We demonstrate in this work how to hedge against the risk of poor assembly results through sequencing and coassembly of few single cells. Our method replaces a single-cell deep sequencing experiment with multiple single-cell shallow sequencing experiments, allowing for the simultaneous acquisition of supposedly synergistic information about multiple single cells.

### 2. MATERIALS AND METHODS

### 2.1. Media and Cultivation of the Methanogenic Alkane-Degrading Community

The microbial community was enriched from sediment from a hydrocarbon-contaminated ditch in Bremen, Germany (Zengler et al., 1999). The consortium was propagated in the laboratory in anoxic medium containing 0.3 g NH4Cl, 0.5 g MgSO4⋅7H2O, 2.5 g NaHCO3, 0.5 g K2HPO4, 0.05 g KBr, 0.02 g H3BO3, 0.02 g KI, 0.003 g Na2WO2⋅2H2O, 0.002 g NiCl2⋅6H2O, trace elements, and trace minerals as previously described (Zengler et al., 1999). The medium was sparged with a mixture of N2/CO2 (80:20 v/v), and the pH was adjusted to 7.0. After autoclaving, anoxic CaCl2 (final concentration 0.25 g/L) and filter-sterilized vitamin solution (Zengler et al., 1999) were added. Cells were supplemented with anoxic hexadecane as previously described (Embree et al., 2013). Bottles were degassed as necessary to relieve over-pressurization.

### 2.2. Single-Cell Sorting, MDA, and Genomes Sequencing

Individual cells from the alkane-degrading consortium were obtained by staining (SYTO-9 DNA stain) and sorting of single cells by FACS (Embree et al., 2013). Single cells were lysed as previously described, and the genomic DNA of individual cells was amplified using whole-genome multiple displacement amplification (MDA) (Swan et al., 2011). Amplified genomic DNA was screened for Smithella-specific 16S rDNA gene sequences. Six amplified Smithella genomes were selected for Next-Generation Sequencing. The MDA amplified genomes were prepared for Illumina sequencing using the Nextera kit, version 1 (Illumina) using the Nextera protocol (ver. June 2010) and high molecular weight buffer. Libraries with an average insert size of 400 bp were created for these samples and sequenced using an Illumina Genome Analyzer IIx. The 34-bp paired-end reads were generated for K05 (20.9 million reads), C04 (23.3 million reads), F02 (26.9 million reads), and A17 (22.2 million reads). The 58-bp single-end reads were generated for MEB10 (41.3 million reads), MEK03 (54.1 million reads), and MEL13 (18.0 million reads). The 36-bp paired-end reads were generated for F16 (11.0 million reads), K04 (27.2 million reads), and K19 (22.9 million reads).

### 2.3. Assembly of Single-Cell Genomes

Assemblies were obtained using HyDA version 1.1.1, SPAdes version 2.4.0, and IDBA-UD version 1.0.9. SPAdes and IDBA-UD were run with the default parameters in the single-end mode. The scripts to generate all of the assemblies are provided in Supplementary Material. The length of k-mers in the de Bruijn graph was 25, and the coverage cut off to trim erroneous branches in the graph was selected to be 100. The contigs were then annotated using RAST (Aziz et al., 2008), and the resulting annotation was used to generate a draft metabolic reconstruction using Model SEED (Henry et al., 2010). The Whole Genome Shotgun project has been deposited at DDBJ/EMBL/GenBank under the accession AWGX00000000. The version described here is version AWGX01000000.

### 3. RESULTS

### 3.1. Colored de Bruijn Graph

Algorithmic paradigms for fragment assembly, such as overlaplayout-consensus and de Bruijn graph, depend on the characteristics of sequencing reads, particularly read length and error profile. Overlap-layout-consensus is a paradigm that is usually applied to assembly projects using long reads, and the de Bruijn graph is another widely adopted paradigm that is used for short-read data sets (Compeau et al., 2011). A number of consecutive *k*-mers (a sequence of length *k* nucleotides) replace each read in the de Bruijn graph paradigm. Each *k*-mer is represented by a unique vertex. An edge is present between two vertices if there is a read in which the two respective *k*-mers are consecutively overlapping. When there are at least *k* consecutive common bases, reads share a vertex (respectively, *k* + 1 common bases for an edge) along which contigs are efficiently constructed.

Colored de Bruijn graph is a method proposed for coassembly of multiple short-read data sets (Iqbal et al., 2012). It is an extension of the classical approach by superimposing different uniquely colored input data sets on a single de Bruijn graph. Each vertex, which is a representation of a *k*-mer, accompanies an array of colored multiplicities. In this way, input data sets are virtually combined while they are almost fully tracked, enabling separation after assembly. Iqbal et al. (2012) proposed the colored de Bruijn graph in Cortex for variant calling and genotyping, whereas our tool Hybrid *De novo* Assembler (HyDA) (Movahedi et al., 2012) is developed for *de novo* assembly of short-read sequences with non-uniform coverage, which is a dominant phenomenon in MDA-based single-cell sequencing (Chitsaz et al., 2011). To fill the gaps and compare colors, contigs in HyDA are constructed in a color-oblivious manner, solely based on the branching structure of the graph. First, this method rescues a poorly covered region of the genome in one data set when it is well covered in at least one of the other input data sets (**Figure 1A**; **Table 1**). Second, it allows comparison of colored assemblies by

FIGURE 1 | Two sample colored de Bruijn graphs with colors red and blue. Nodes are *k*-mers and edges represent *k* + 1-mers. A colored bar shows multiplicity of the *k*-mer in the corresponding colored input data set. Each box is an output contig, and the color of a box shows non-zero colored average coverage, which is shown on the right hand side of the contig in (A). Our coassembly algorithm (A) rescues a poorly covered region of the genome in one color when it is well covered in the other, and (B) allows pairwise comparison of colored assemblies through revealing all of their shared and exclusive pieces of sequence.

revealing all shared and exclusive pieces of sequence not shorter than *k* (**Figure 1B**; **Table 2**).

### 3.2. Coverage Characteristics of Single-Cell Read Data Sets

Genomes amplified from single cells exhibit highly non-uniform genome coverage and multiple gaps, which are called blackout regions (Chitsaz et al., 2011). For the evaluation of such coverage characteristics in this study, we used amplified DNA originating from two single *Escherichia coli* cells as well as from one single *Staphylococcus aureus* cell (Chitsaz et al., 2011). Although these amplified DNAs were quality checked for preselected genomic loci using quantitative PCR (Rodrigue et al., 2009), they still did not cover the entire genome (Table S1 in Supplementary Material; **Figure 2**). One single *E. coli* cell was sequenced in four technical replicate lanes (1–4), and the other was sequenced in three technical replicate lanes (6–8) each with a sequencing depth of 600 per lane. The single *S. aureus* cell was sequenced in two technical replicate lanes each with a sequencing depth of 1,800. All nine lanes were sequenced on Illumina GAIIx platform in paired 2–100 bps read mode.

The coverage bias in technical replicates is almost identical, which suggests that the vast majority of bias is caused by MDA. The coverage bias, particularly of the blackout regions, does not always occur at the same genomic loci for different cells of the same genome (Chitsaz et al., 2011). Blackout regions in *E. coli* lanes 1 and 6 sequenced from two independently amplified single cells make up 1.8 and 0.1% of the genome, respectively, but there are no common blackout regions between these two data sets (Table S1 in Supplementary Material). This means that combining the two data sets could fill all gaps and yield a complete genome, which is the property that HyDA exploits with colored coassembly.

### 3.3. Colored Coassembly of *E. coli* and *S. aureus* Mitigates the Effect of Dropout Regions due to Amplification Bias

Single-cell read data sets have highly variable coverage (Raghunathan et al., 2005; Rodrigue et al., 2009) (Table S1 in Supplementary Material; **Figure 2**), which poses serious challenges for downstream applications such as *de novo* assembly. A number of single-cell assemblers, including EULER + Velvet-SC (Chitsaz et al., 2011), SPAdes (Bankevich et al., 2012), and IDBA-UD (Peng et al., 2012), have been developed to mitigate the adverse effects of non-uniform coverage and maximize the transfer of sequencing information into the final assembly. These efforts have been successful, and the existing single-cell assemblers are able to extract nearly all of the information contained in the input data set. However, the vast majority of single-cell data sets do not encompass the entire genome. We report that combining multiple data sets from the same or closely related species significantly improves the final assembly by filling genome gaps (Table S1 in Supplementary Material). The challenge presented by this method is the subsequent deconvolution of single-cell genomes to avoid chimeric assemblies.


TABLE 1 | The GAGE (Salzberg et al., 2012) statistics of HyDA assemblies for the six scenarios in Figure S1 in Supplementary Material.

*GAGE (Salzberg et al., 2012) was based on MUMmer 3.23 aligner (Kurtz et al., 2004).*

### TABLE 2 | Pairwise relationships between three coassembled data sets, *E. coli* lanes 1 and 6 and *S. aureus* lane 7, in a coassembly of *E. coli* lanes 1–4, 6–8, and *S. aureus* lanes 7 and 8.


*Total is the total size of those contigs that have non-zero coverage in the corresponding color. Shared is the size of those contigs that have non-zero coverage in both colors. Exclusive is the size of those contigs that have non-zero coverage in the corresponding color and zero coverage in the other color in the pair. a Exclusivity ratio* = *exclusive/total.*

The ideal solution involves the coassembly of multiple data sets without explicitly mixing sequencing reads such that individual assemblies can benefit from the synergy without suffering from chimerism. We propose and implement this solution using the colored de Bruijn graph in HyDA.

We report in **Table 1** the coassembly results for six distinct scenarios (Figure S1 in Supplementary Material), each consisting of a combination of the input read data sets: (i) single-cell assembly of *E. coli* lane 1; (ii) single-cell assembly of *E. coli* lane 6; (iii) mixed monochromatic assembly of *E. coli* lanes 1–4 and 6–8, technical replicates of two biologically replicate single cells; (iv) multichromatic coassembly of *E. coli* lanes 1–4 and 6–8; (v) mixed monochromatic assembly of non-identical cells: *E. coli* lanes 1–4 and 6–8, and *S. aureus* lanes 7 and 8; and (vi) multichromatic coassembly of non-identical cells: *E. coli* lanes 1–4 and 6–8, and *S. aureus* lanes 7 and 8, each assigned a unique color. GAGE, a standard genome evaluation tool, which reports the size statistics and number of substitution, indel, and chimeric errors of an assembly, was used to evaluate our assemblies (Salzberg et al., 2012). In all six scenarios, GAGE results (**Table 1**) comparing TABLE 3 | Evaluation results obtained from GAGE (Salzberg et al., 2012) for assembly of *E. coli* lanes 1 and 6 using E **+** V-SC (Chitsaz et al., 2011), SPAdes (Bankevich et al., 2012), and IDBA-UD (Peng et al., 2012).


the assembly of color 0 with the *E. coli* reference genome are reported. Color 0 corresponds to *E. coli* lane 1 in (i), (iv), and (vi); *E. coli* lane 6 in (ii); and the mixture in (iii) and (v) (Figure S1 in Supplementary Material).

While the state-of-the-art individual single-cell *E. coli* assemblies by SPAdes (SPAdes outperforms IDBA-UD and Euler + Velvet-SC in this case) miss 128,600 (2.77%) and 15,831 (0.34%) base pairs of the reference genome in the two different single cells (**Table 3**), our coassembly misses only 2,023 (0.04%) of the genome (**Table 1**), an improvement of 126,577 (2.72%) base pairs of the *E. coli* cell 1. Our coassembly of the two single *E. coli* cells and one *S. aureus* cell misses only 2,136 (0.05%) of the genome. The coassembly algorithm in this work, without any error correction, *k*-mer incrementation, or scaffolding, increases the total assembly size for both *E. coli* lanes 1 and 6 using only the synergy in the input data sets. Our exclusivity ratio (defined below) obtained from the coassembly results completely differentiates *E. coli* and *S. aureus* data sets (**Table 2**).

### 3.4. Quantification of Similarities and Differences between Colors

Input data sets can be clustered based on the similarity between their assemblies. For a pair of colors *i* and *j*, contigs belonging to both colors are considered shared and contigs belonging to color *i* but not to color *j* are considered exclusive of color *i* with respect to color *j*. We define the exclusivity ratio of color *i* with respect to color *j* as the ratio of the size of exclusive color *i* contigs to the total assembly size of color *i*. The exclusivity ratio for *E. coli* lane 1-lane 6 (Pair 1 in **Table 2**) is less than 0.5%, while that ratio for *E. coli* and *S. aureus* in the two other pairs (Pair 2 and 3 in **Table 2**) is greater than 90%. This large difference in exclusivity ratio between Pair 1 and Pairs 2 and 3 is expected in this case, as *E. coli* and *S. aureus* are phylogenetically divergent species belonging to different phyla.

### 3.5. *De Novo* Single-Cell Coassembly of Members of an Alkane-Degrading Methanogenic Consortium

The genomes of 10 cells from three dominant but uncultured bacterial members of a methanogenic consortium (Zengler et al., 1999; Embree et al., 2013), belonging to the families *Syntrophacea* and *Anaerolineaceae* were sequenced from their amplified singlecell whole DNAs: six cells belonging to *Smithella*, two cells belonging to *Anaerolinea*, and two cells belonging to *Syntrophus*. Single cells were isolated from the consortium by fluorescence-activated cell sorting, and the genomes of individual cells were amplified using MDA. MDA products were sequenced using an Illumina GAIIx with 34, 36, or 58 base pair reads. In total, 10 data sets, one per cell, were obtained. The 10 data sets were coassembled with HyDA in a *ten-color* setup, and to exhibit the advantage of the coassembly method, each data set was assembled individually by HyDA. Individual assemblies created by SPAdes and IDBA-UD were used as comparison. The QUAST (Gurevich et al., 2013) length statistics of the resulting assemblies (≥100 bp contigs) are compared in **Table 4** and Figures S2–S11 in Supplementary Material. The comparison between individual assembly and coassembly by HyDA demonstrates that coassembly rescues on average 101.4% more total base pairs for all 10 cells (Table S2 in Supplementary Material). Although HyDA does not use advanced assembly features such as variable *k*-mer sizes and paired read information, it can assemble 3.6–54% more total base pairs than both SPAdes and IDBA-UD do in all cells except two cases: *Anaerolinea* F02 and *Smithella* MEK03 (**Table 4**; Table S2 in Supplementary Material). When all contigs are considered, HyDA coassemblies of *Anaerolinea* F02 and *Smithella* MEK03 are 11% smaller and 41% larger than their SPAdes counterparts, respectively. *Smithella* MEK03 input reads are longer (58 bp) than the reads in some of the other data sets; therefore, the *Smithella* MEK03 assembly contains many short contigs and suffers because of the small *k*-mer size (k = 25) dictated by the shorter reads.

### 3.6. Exclusivity Analysis of Ten Assemblies from Single Uncultured Bacterial Cells

Exclusivity analysis revealed that the six Smithella cells clustered into a consistent group as their exclusivity ratios with respect to the two *Anaerolinea* and two *Syntrophus* cells are almost identical (**Table 5**). It is important to note that *Anaerolinea* A17 and *Syntrophus* C04 assemblies are relatively short, meaning the exclusivity ratios must be interpreted with caution. Although *Syntrophus* K05s exclusivity signature with respect to the six *Smithella* cells is indistinguishable from the six *Smithella* signatures with respect to themselves, the exclusivity ratios of *Syntrophus* K05 with respect to the two *Anaerolinea* cells and *Syntrophus* C04 differentiate *Syntrophus* K05 from the six *Smithella* cells. Slight differences between the *Syntrophus* C04 and K05 exclusivity signatures are not surprising because of the existence of potential intraspecies variations.

TABLE 4 | Quast (Gurevich et al., 2013) analysis of 10 cells from *Anaerolinea*, *Smithella*, and *Syntrophus* single-cell data sets assembled with HyDA (individual assembly), HyDA (10-color coassembly), SPAdes, and IDBA-UD.


*All statistics are based on contigs of size* ≥*100 bp. Only those HyDA contigs that have a coverage of at least 1 in the corresponding color are considered. Coverage cutoff was chosen to be 24 for all HyDA assemblies (*−*c* = *24). Total is the total assembly size and N50 is the assembly N50 (the size of the contig, the contigs larger than which cover half of the assembly size). Best result is in bold face.*


TABLE 5 | The exclusivity ratio (%) of row with respect to column for the 10 cells from *Anaerolinea*, *Smithella*, and *Syntrophus* single-cell data sets coassembled using 10 colors with Squeezambler (Taghavi et al., 2013), a tool in the HyDA package.

*Only the contigs of coverage at least 1 in the corresponding color are considered. Coverage cutoff was chosen to be 24 for all HyDA assemblies (*−*c* = *24).*

### 3.7. Annotation of the *Anaerolinea*, *Smithella*, and *Syntrophus* Assemblies

To assess the quality of coassemblies with HyDA, IDBA-UD, and SPAdes, we used the RAST server to predict the coding sequences and subsystems present in each assembly. The HyDA assemblies are superior to those of SPAdes and IDBA-UD in terms of the number of coding sequences and captured subsystems for one *Anaerolinea*, four *Smithella*, and both *Syntrophus* assemblies (**Table 6**). For *Smithella* MEB10 and MEK03, the HyDA assembly closely follows the SPAdes assembly, which provides the largest annotation (**Table 6**). For *Smithella* F16 and *Syntrophus* K05, HyDA assemblies contain significantly more coding sequences (33 and 39%, respectively) and cover more subsystems (29 and 57%, respectively) in comparison to the best of SPAdes and IDBA-UD assemblies.

To confirm the accuracy of the assemblies, the closest related species to each assembly was computed by the RAST server. For the HyDA, SPAdes, and IDBA-UD *Anaerolinea* F02 assemblies, the closest species was *Anaerolinea thermophila* UNI-1 (GenomeID 926569.3) (no closest genomes data found for *Anaerolinea* A17 by the RAST server). For the HyDA, SPAdes, and IDBA-UD *Smithella* and *Syntrophus* assemblies, the closest species is *Syntrophus aciditrophicus SB* (GenomeIDs 56780.10 and 56780.15). Note that *Syntrophus aciditrophicus SB* is the closest finished genome to the *Smithella* family. This verifies that coassembly does not create chimeric assemblies; otherwise, we would see *Syntrophus aciditrophicus SB* among close neighbors of the *Anaerolinea* assemblies and/or *Anaerolinea thermophila UNI-1* among close neighbors of the *Smithella* and *Syntrophus* assemblies by HyDA.

### 3.8. Metabolic Reconstruction of *Anaerolinea, Smithella*, and *Syntrophus*

Assembly and subsequent annotation of these genomes enables the elucidation of the functional roles of individual, unculturable constituents within the community. *Anaerolinea*, *Syntrophus*, and *Smithella* each represent genera with very few cultured members and only two sequenced genomes – *Anaerolinea thermophila* (no genome paper) and *Syntrophus aciditrophicus* (McInerney et al., 2007) are the only available sequenced genomes from these genera to date. The only member of *Smithella* that has been isolated, *Smithella propionica* (Liu et al., 1999), has not been sequenced yet. In addition to understanding the genetic basis for the unique metabolic capability of this microbial community, the genomes of these particular organisms present an opportunity to explore the breadth of genetic diversity in these elusive genera. Using the advanced genome assembly algorithm, we recently identified the key genes involved in anaerobic metabolism of hexadecane and long-chain fatty acids, such as palmitate, octadecanoate, and tetradecanoate, in *Smithella* (Embree et al., 2013). Based on sequence homology, *Syntrophus* is closely related to Smithella, but we cannot determine if it is also actively degrading hexadecane at this point in time. Only two species of *Anaerolinea* have been isolated and characterized thus far. These species, both isolated from anaerobic sludge reactors, form long, multicellular filaments and are strictly anaerobic (Sekiguchi et al., 2003; Yamada et al., 2006). Each species is capable of growing on a large number of carbon sources, and both isolates produce acetate, lactate, and hydrogen as the main end products of fermentation. Comparison of the Anaerolinea sp. genome derived from singlecell sequencing with the genome of *Anaerolinea thermophila* UN-1 revealed many similarities in potential metabolic capability. The *Anaerolinea* genome obtained from a single cell contains genes for the utilization of galactose and xylose, consistent with a previous physiological characterization of *A. thermophila* (Sekiguchi et al., 2003). Additionally, the single-cell *Anaerolinea sp*. genome encoded for several transporters and genes related to trehalose biosynthesis, suggesting extended metabolic capabilities of this strain. Furthermore, the genome has an extracellular deoxyribonuclease, an enzyme required for catabolism of external DNA, hinting at the strains ability to scavenge deoxyribonucleosides.

### 4. DISCUSSION

We demonstrated the power of genome coassembly of multiple single-cell data sets through significant improvement of the assembly quality in terms of predicted functional elements and length statistics. Coassemblies without any effort to scaffold or close gaps contain significantly more protein coding genes, subsystems, base pairs, and generally longer contigs compared



*Best result is in bold face.*

to individual assemblies by the same algorithm as well as the state-of-the-art single-cell assemblers (SPAdes and IDBA-UD). The new algorithm is also able to avoid chimeric assemblies by detecting and separating shared and exclusive pieces of sequence for input data sets. This suggests that in lieu of single-cell assembly, which can lead to failure and loss of the sample or significantly increase sequencing expenses, the coassembly method can hedge against that risk. Our single-cell coassembler HyDA proved the usefulness of the coassembly concept and permitted the study of three bacteria. The improved assembly gave insight into the metabolic capability of these microorganisms, thereby proving a new tool for the study of uncultured microorganisms. Thus, the coassembler can readily be applied to study genomic content and the metabolic capability of microorganisms, and increase our knowledge of the function of cells related to environmental processes as well as human health and disease. The colored de Bruijn graph uses a single *k*-mer size for all input data sets, which has to be chosen based on the minimum read length across all data sets. For instance, Smithella MEK03 input reads are longer (58 bp) than the reads in some of the other data sets, while the Smithella MEK03 assembly contains many short contigs because of the small *k*-mer size (*k* = 25) dictated by the shorter reads. This minor disadvantage can be remedied by using advanced assembly features such as variable *k*-mer

### REFERENCES


size, alignment of reads back to the graph and threading, and utilization of paired-end information.

### AUTHOR CONTRIBUTIONS

NM carried out genome assembly and evaluation, helped with metabolic reconstruction analysis, participated in development of HyDA, and drafted the manuscript. ME and HN participated in acquisition of the alkane-degrading consortium genomic data and drafted the manuscript. KZ participated in the project conception, participated in acquisition of the alkane-degrading consortium genomic data, and drafted the manuscript. HC participated in the project conception, developed HyDA, carried out interpretation of results, and drafted the manuscript.

### FUNDING

Funding for this work was partially provided by NSF DBI-1262565 grant to HC.

### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at http://journal.frontiersin.org/article/10.3389/fbioe.2016.00042


thermodynamic limit of microbial growth. *Proc. Natl. Acad. Sci. U.S.A.* 104, 7600–7605. doi:10.1073/pnas.0610456104


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Copyright © 2016 Movahedi, Embree, Nagarajan, Zengler and Chitsaz. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*