# RNA-SEQ ANALYSIS: METHODS, APPLICATIONS AND CHALLENGES

EDITED BY : Filippo Geraci, Indrajit Saha and Monica Bianchini PUBLISHED IN : Frontiers in Genetics and Frontiers in Plant Science

#### Frontiers eBook Copyright Statement

The copyright in the text of individual articles in this eBook is the property of their respective authors or their respective institutions or funders. The copyright in graphics and images within each article may be subject to copyright of other parties. In both cases this is subject to a license granted to Frontiers. The compilation of articles constituting this eBook is the property of Frontiers.

Each article within this eBook, and the eBook itself, are published under the most recent version of the Creative Commons CC-BY licence. The version current at the date of publication of this eBook is CC-BY 4.0. If the CC-BY licence is updated, the licence granted by Frontiers is automatically updated to the new version.

When exercising any right under the CC-BY licence, Frontiers must be attributed as the original publisher of the article or eBook, as applicable.

Authors have the responsibility of ensuring that any graphics or other materials which are the property of others may be included in the CC-BY licence, but this should be checked before relying on the CC-BY licence to reproduce those materials. Any copyright notices relating to those materials must be complied with.

Copyright and source acknowledgement notices may not be removed and must be displayed in any copy, derivative work or partial copy which includes the elements in question.

All copyright, and all rights therein, are protected by national and international copyright laws. The above represents a summary only. For further information please read Frontiers' Conditions for Website Use and Copyright Statement, and the applicable CC-BY licence.

ISSN 1664-8714 ISBN 978-2-88963-705-8 DOI 10.3389/978-2-88963-705-8

#### About Frontiers

Frontiers is more than just an open-access publisher of scholarly articles: it is a pioneering approach to the world of academia, radically improving the way scholarly research is managed. The grand vision of Frontiers is a world where all people have an equal opportunity to seek, share and generate knowledge. Frontiers provides immediate and permanent online open access to all its publications, but this alone is not enough to realize our grand goals.

#### Frontiers Journal Series

The Frontiers Journal Series is a multi-tier and interdisciplinary set of open-access, online journals, promising a paradigm shift from the current review, selection and dissemination processes in academic publishing. All Frontiers journals are driven by researchers for researchers; therefore, they constitute a service to the scholarly community. At the same time, the Frontiers Journal Series operates on a revolutionary invention, the tiered publishing system, initially addressing specific communities of scholars, and gradually climbing up to broader public understanding, thus serving the interests of the lay society, too.

#### Dedication to Quality

Each Frontiers article is a landmark of the highest quality, thanks to genuinely collaborative interactions between authors and review editors, who include some of the world's best academicians. Research must be certified by peers before entering a stream of knowledge that may eventually reach the public - and shape society; therefore, Frontiers only applies the most rigorous and unbiased reviews.

Frontiers revolutionizes research publishing by freely delivering the most outstanding research, evaluated with no bias from both the academic and social point of view. By applying the most advanced information technologies, Frontiers is catapulting scholarly publishing into a new generation.

#### What are Frontiers Research Topics?

Frontiers Research Topics are very popular trademarks of the Frontiers Journals Series: they are collections of at least ten articles, all centered on a particular subject. With their unique mix of varied contributions from Original Research to Review Articles, Frontiers Research Topics unify the most influential researchers, the latest key findings and historical advances in a hot research area! Find out more on how to host your own Frontiers Research Topic or contribute to one as an author by contacting the Frontiers Editorial Office: researchtopics@frontiersin.org

# RNA-SEQ ANALYSIS: METHODS, APPLICATIONS AND CHALLENGES

Topic Editors:

Filippo Geraci, Institute for Informatics and Telematics, CNR, Pisa, Italy Indrajit Saha, Department of Computer Science and Engineering, National Institute of Technical Teachers Training and Research, Kolkata, India Monica Bianchini, DIISM, University of Siena, Siena, Italy

Citation: Geraci, F., Saha, I., Bianchini, M., eds. (2020). RNA-Seq Analysis: Methods, Applications and Challenges. Lausanne: Frontiers Media SA. doi: 10.3389/978-2-88963-705-8

# Table of Contents

*05 Editorial: RNA-Seq Analysis: Methods, Applications and Challenges* Filippo Geraci, Indrajit Saha and Monica Bianchini

## SECTION 1

#### RNA-SEQ ANALYSIS

*08 Assessment of a Highly Multiplexed RNA Sequencing Platform and Comparison to Existing High-Throughput Gene Expression Profiling Techniques*

Eric Reed, Elizabeth Moses, Xiaohui Xiao, Gang Liu, Joshua Campbell, Catalina Perdomo and Stefano Monti

*22 Read Mapping and Transcript Assembly: A Scalable and High-Throughput Workflow for the Processing and Analysis of Ribonucleic Acid Sequencing Data*

Sateesh Peri, Sarah Roberts, Isabella R. Kreko, Lauren B. McHan, Alexandra Naron, Archana Ram, Rebecca L. Murphy, Eric Lyons, Brian D. Gregory, Upendra K. Devisetty and Andrew D. L. Nelson


Zhihua Gao, Zhiying Zhao and Wenqiang Tang

*61 CircCode: A Powerful Tool for Identifying circRNA Coding Ability* Peisen Sun and Guanglin Li

## SECTION 2

### SINGLE CELL RNA SEQUENCING

*67 Single-Cell RNA-Seq Technologies and Related Computational Data Analysis*

Geng Chen, Baitang Ning and Tieliu Shi

*80 Benchmark and Parameter Sensitivity Analysis of Single-Cell RNA Sequencing Clustering Methods*

Monika Krzak, Yordan Raykov, Alexis Boukouvalas, Luisa Cutillo and Claudia Angelini

*99 Reproducibility of Methods to Detect Differentially Expressed Genes From Single-Cell RNA Sequencing*

Tian Mou, Wenjiang Deng, Fengyun Gu, Yudi Pawitan and Trung Nghia Vu

*111 McImpute: Matrix Completion Based Imputation for Single Cell RNA-seq Data*

Aanchal Mongia, Debarka Sengupta and Angshul Majumdar

## SECTION 3

### CASE STUDIES


Nie Tengkun, Wang Dongdong, Ma Xiaohui, Chen Yue and Chen Qin

# Editorial: RNA-Seq Analysis: Methods, Applications and Challenges

#### Filippo Geraci <sup>1</sup> \*, Indrajit Saha<sup>2</sup> and Monica Bianchini <sup>3</sup> \*

*1 Institute for Informatics and Telematics, CNR, Pisa, Italy, <sup>2</sup> Department of Computer Science and Engineering, National Institute of Technical Teachers' Training and Research, Kolkata, India, <sup>3</sup> DIISM, University of Siena, Siena, Italy*

Keywords: RNA-seq, algorithm, software pipeline, method assessment, differenial analysis

**Editorial on the Research Topic**

**RNA-Seq Analysis: Methods, Applications and Challenges**

## 1. INTRODUCTION

RNA-seq has revolutionized the research community approach to studying gene expression. In fact, this technology has opened up the possibility of quantifying the expression level of all genes at once, allowing an ex post (rather than ex ante) selection of candidates that could be interesting for a certain study. The continuous drop in costs and the independence of library preparation protocols from the model species, have convinced the stakeholders to invest in this technology, by creating consortia able to produce large disease-specific datasets that, in turn, fostered transcriptomic research at a population level. Among many others, a virtuous example in this sense is The Cancer Genome Atlas. In a short time RNA-seq has moved from a technology to merely quantify the expression of genes to a powerful tool to: discover new transcripts (via de novo transcriptome assembly), characterize alternative splicing variants or new cell types (through single cell RNA sequencing). Leveraging on RNA-seq for daily diagnostic activities is no longer a dream but a consolidated reality.

Although established best practices exist, managing RNA-seq data is not easy. Before sequencing, it is essential to carefully plan library preparation in order to minimize downstream analysis biases. Budget optimization is another important factor. Sequencing multiple samples increases statistical power and reduces undesired side effects due to noise and variability. However, more samples imply higher costs. Multiplexing has proved to be an effective tool to limit the budget without sacrificing the number of samples. DNA barcoding enables combining up to 96 samples into a single line, trading a lower sequencing depth for a higher number of sequenced samples. The downside of this technique is the increased burden of data analysis to achieve the same accuracy that would be achieved with a richer input.

Downstream sequencing, fastq data must be validated and processed to distill raw reads into a quantitative measure of gene expression. While validation is somehow a standard procedure, read count depends on the type of RNA (microRNA, etc.) and on the target application. Usually reads are: subjected to adapter removal, aligned against a reference genome, grouped by functional unit (e.g., transcripts, genes, microRNA, etc.), normalized and counted. Subsequent analyses can vary dramatically according to the application. In the simplest setting, the subset of genes responsible for the differences on the phenotype between two populations should be discovered. In other cases, one may want to build the co-expression (or reverse expression) network in order to find interacting genes or a pathway related to a certain phenotype. Other applications involve the discovery of unknown cell types, the organization of cell types in homogeneous families, the identification of

#### Edited and reviewed by:

*Richard D. Emes, University of Nottingham, United Kingdom*

#### \*Correspondence:

*Filippo Geraci filippo.geraci@iit.cnr.it Monica Bianchini monica@diism.unisi.it*

#### Specialty section:

*This article was submitted to Bioinformatics and Computational Biology, a section of the journal Frontiers in Genetics*

> Received: *03 February 2020* Accepted: *24 February 2020* Published: *17 March 2020*

#### Citation:

*Geraci F, Saha I and Bianchini M (2020) Editorial: RNA-Seq Analysis: Methods, Applications and Challenges. Front. Genet. 11:220. doi: 10.3389/fgene.2020.00220*

**5**

new molecules (e.g., new microRNA, long non-coding RNA, etc.), or the annotation of new variants or alternative splicing.

## 2. RESEARCH TOPIC ORGANIZATION

This Research Topic is divided into three main sections: five articles cover the RNA-seq workflow, four papers discuss the most recent frontier of single cell RNA sequencing, while the last four contributions report on case studies, related to tumor profiling and plant science.

In the first part, we attempted to analyze the RNA-seq process (from experimental design to analysis and extraction of new knowledge) by highlighting the key choices of the stateof-the-art workflows. Although we have mainly focused on computational aspects, we believe that this Research Topic can catch the interest of those readers, specialized in the field of life science, who intend to become independent and autonomous in the analysis of their own data. Two papers of this section describe new methods: for the identification of differentially expressed genes and for the prediction of the circRNA coding ability.

The second section introduces a recent branch of RNAseq data analysis: single cell sequencing (scRNA-seq). Although conceptually similar to sequencing cells in bulk, the single cell resolution of this technique introduces a lot of noise, that requires ad hoc analysis methods. Much of this section is dedicated to the introduction of basic single cell RNA sequencing concepts, from laboratory protocols to the most common analyses. In particular, the problems of assessing the results of clustering cell types and the reproducibility of differential expression experiments are discussed. Finally, this section concludes with the description of a new method to infer missing counts due to poor coverage of sequencing.

The last part of the Research Topic was dedicated to four case studies: three concerning tumors and one application in plant science. The rationale behind this choice was that of showing different types of analysis. In the conceptually simpler case, the goal of the analysis was to create a panel of genes prognostic of the onset of cancer. Next, an example of a co-expression network is shown. Finally, an example of interaction among different types of RNA (long non-coding, genes, microRNAs) has been reported, showing the complexity of the pathways that regulate the life of cells.

#### 2.1. RNA-Seq Analysis

In Reed et al., the opportunity offered by Multiplexed RNA Sequencing is discussed. The study provides a comparison of several methods using real data from immortalized human lung epithelial cells.

In Peri et al., RMTA, an user-friendly analysis workflow, is proposed. RMTA was designed to provide standard preprocessing tools (i.e., read quality analysis, filters for lowly expressed transcripts, and read counting for differential expression analysis) in a scalable and easy to deploy environment.

In Jimenez-Jacinto et al., an integrative differential expression analysis web server (IDEAMEX) is described. The rationale of IDEAMEX is that of freeing non-expert users from the (sometimes frustrating) experience of interacting with the UNIXbased environment for standard differential expression analyses.

In Gao et al., a new method for the identification of differentially expressed genes is reported. The key observation of this work is that the binomial distribution at the basis of the majority of the algorithms for differential expression analysis is unable to capture underdispersion characteristics of RNAseq data.

In Sun and Li, the problem of predicting whether a given circular RNA can be translated or not is investigated. Circular RNAs differ from other types of RNA in that they are arranged as rings joining 3′ and 5′ endpoints. This characteristic makes hard to decide about their translation potential. The manuscript provides an algorithm to identify the coding ability of circRNAs with high sensitivity.

## 2.2. Single Cell RNA Sequencing

In Chen et al., an overview of currently available singlecell isolation protocols and scRNA-seq technologies is provided. In addition, several methods for scRNA-seq data analysis, from quality control to network reconstruction, are discussed.

In Krzak et al., the use of clustering to study heterogeneity of cells is dissected. In particular, this work aims at providing new insights into the advantages and drawbacks of scRNAseq clustering, highlighting open challenges.

In Mou et al., some issues connected to the reproducibility of differential expression studies is debated. The complexity of this type of analyses stands in the paucity of RNAs and in the consequent lower signal to noise ratio. The article shows pros and cons of standard and ad-hoc software for differential expression.

In Mongia et al., a method to impute dropouts in single cell expression data is detailed. Experiments on real data show that the proposed software is able to discriminate the real absence of reads from dropout events.

## 2.3. Case Studies

In Yin et al., differential expression analysis is used to pinpoint a small panel of genes potentially prognostic for the onset of Glioblastoma. The focus of the article is that of improving healthy/diseased classification regardless of the interaction among genes.

In Zhu et al., co-expressed genes are identified in order to build a network of interactions. Subsequently, the network is analyzed to select hub genes associated with soft tissue sarcomas.

In Zheng et al., the dynamics of the interaction among different molecules in lung adenocarcinoma is studied. The article reports on how the dysregulation of a long non-coding RNA triggers a sequence of dysregulations, causing the cell cycle arrest.

In Tengkun et al., genomics and trascriptomics data are integrated in order to identify the crucial genes that affect anthocyanin biosynthesis transforming quantitative traits into quality traits.

## AUTHOR CONTRIBUTIONS

The authors all contributed equally to the Research Topic assembly and editing and to this editorial.

## FUNDING

IS was supported by a grant (DST/INT/POL/P-36/2016) from the Department of Science and Technology, India.

**Conflict of Interest:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2020 Geraci, Saha and Bianchini. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Assessment of a Highly Multiplexed RNA Sequencing Platform and Comparison to Existing High-Throughput Gene Expression Profiling Techniques

Eric Reed1,2, Elizabeth Moses<sup>2</sup> , Xiaohui Xiao<sup>2</sup> , Gang Liu<sup>2</sup> , Joshua Campbell1,2 , Catalina Perdomo<sup>2</sup> and Stefano Monti1,2 \*

<sup>1</sup> Bioinformatics Program, Boston University, Boston, MA, United States, <sup>2</sup> Section of Computational Biomedicine, School of Medicine, Boston University, Boston, MA, United States

#### Edited by:

Filippo Geraci, National Research Council (CNR), Italy

#### Reviewed by:

Kashmir Singh, Panjab University, India Matteo Benelli, University of Trento, Italy Haibo Liu, Iowa State University, United States

> \*Correspondence: Stefano Monti smonti@bu.edu

#### Specialty section:

This article was submitted to Bioinformatics and Computational Biology, a section of the journal Frontiers in Genetics

Received: 06 September 2018 Accepted: 12 February 2019 Published: 05 March 2019

#### Citation:

Reed E, Moses E, Xiao X, Liu G, Campbell J, Perdomo C and Monti S (2019) Assessment of a Highly Multiplexed RNA Sequencing Platform and Comparison to Existing High-Throughput Gene Expression Profiling Techniques. Front. Genet. 10:150. doi: 10.3389/fgene.2019.00150 The need to reduce per sample cost of RNA-seq profiling for scalable data generation has led to the emergence of highly multiplexed RNA-seq. These technologies utilize barcoding of cDNA sequences in order to combine multiple samples into a single sequencing lane to be separated during data processing. In this study, we report the performance of one such technique denoted as sparse full length sequencing (SFL), a ribosomal RNA depletion-based RNA sequencing approach that allows for the simultaneous sequencing of 96 samples and higher. We offer comparisons to well established single-sample techniques, including: full coverage Poly-A capture RNA-seq, microarrays, as well as another low-cost highly multiplexed technique known as 3<sup>0</sup> digital gene expression (30DGE). Data was generated for a set of exposure experiments on immortalized human lung epithelial (AALE) cells in a two-by-two study design, in which samples received both genetic and chemical perturbations of known oncogenes/tumor suppressors and lung carcinogens. SFL demonstrated improved performance over 3 <sup>0</sup>DGE in terms of coverage, power to detect differential gene expression, and biological recapitulation of patterns of differential gene expression from in vivo lung cancer mutation signatures.

Keywords: RNA sequencing, gene expression, microarray, multiplexing, platform comparison

## INTRODUCTION

Since its inception in 2008, RNA sequencing has become the gold-standard for whole-transcriptome high-throughput data generation (Mortazavi et al., 2008). In addition to RNA transcript expression quantification, RNA-seq allows for more advanced analyses including de novo transcriptome assembly (Robertson et al., 2010) and characterization of alternative splicing variants (Bryant et al., 2012). Furthermore, RNA-seq is species agnostic, such that the same library preparation technique may be utilized for humans, mouse, rat, kidney bean, etc. These represent clear advantages over hybridization-based microarray platforms in which individual microarray platforms are designed to quantify specific transcripts for a specific species (Wang et al., 2009). However, one persistent drawback of RNA-seq has been its relatively high cost.

**8**

The use of classic RNA-seq techniques for experimental designs that require profiling of many samples – especially when the marginal information value of each sample is relatively low, such as in medium- and high-throughput screening applications – can thus present a disqualifying cost burden.

Large-scale projects based on transcriptional profiling of chemical exposure experiments include the Toxicogenomics Project-Genomics Assisted Toxicity Evaluation System (Open TG-GATEs) (Igarashi et al., 2015), the DrugMatrix database (Ganter et al., 2006), and the Connectivity Map (CMap) (Subramanian et al., 2017), among others. Both the TG-GATEs and the DrugMatrix projects used microarrays for expression profiling, which was at the time significantly less costly than full coverage RNA-sequencing, yet still requiring multi-million budgets. Alternatively, the CMap project utilizes the Luminex-1000 (L1000) profiling platform, a bead-based analog expression assay which quantifies 1,058 human transcripts, which are used to impute the expression of 11,350 additional transcripts (Subramanian et al., 2017). This technique is among the least expensive expression assays available, but it is restricted to human screens and it directly profiles only a limited panel of genes. Given the flexibility of RNA-sequencing platforms, highly multiplexed techniques represent a viable alternative for generating transcriptional data from exposure screens, as well as from other experiments that require a large sample size. Therefore, evaluation of the technical validity of specific techniques serves to inform research strategies for a variety of biological inquiries.

The need to reduce the per sample cost of RNA-seq has led to the adoption of barcoding technologies, where cDNA sequences from individual samples are tagged and their libraries are combined and multiplex sequenced in a single lane (Wang et al., 2011). More recently, these techniques have been optimized to allow multiplex sequencing of 96 samples per lane or higher (Hou et al., 2015; Shishkin et al., 2015). Here, we report the results of our effort at optimizing and evaluating one such technique denoted as sparse full length (SFL) sequencing (Shishkin et al., 2015), a ribosomal RNA depletion-based RNA sequencing approach that allows for the simultaneous sequencing of 96 samples and higher. We offer comparisons to well established single-sample techniques, including: full coverage Poly-A capture RNA-seq and microarray, as well as another low-cost highly multiplexed technique known as 3<sup>0</sup> digital gene expression (30DGE) (Asmann et al., 2009). Assessments include comparisons of coverage between the three RNA-sequencing techniques, as well as signal-to-noise and biological recapitulation of gene-level differential signals between treatment groups for the same samples profiled across SFL, microarray, and 30DGE. For this evaluation study, we generated a set of exposure experiments on immortalized human lung epithelial (AALE) cells (Lundberg et al., 2002) in a two-by-two study design, in which samples received both genetic and chemical perturbations of known oncogenes/tumor suppressors and lung carcinogens (**Figure 1**). The goal of this report is not only to assess the performance of our optimized highly multiplexed technique, but to inform future research in terms of the strengths and pitfalls of available cost-effective high throughput transcriptomic profiling techniques.

## MATERIALS AND METHODS

## Samples

Exposure experiments were performed on immortalized human bronchial epithelial cells (AALE). Cells were exposed to both chemical and genotypic perturbations with three replicates per perturbation combination. Cells were thawed from liquid nitrogen and grown up in SAGM small airway epithelial cell growth media (Lonza, Portsmouth NH). Cells were subcultured using Clonetics ReagentPack subculture reagents (Lonza, Portsmouth NH). In preparation for exposure, cells were plated into 24-well plates and allowed to reach confluency for 24 h. Cell culture media was then replaced, and compounds added at a concentration of 24 µg/ml CSC, 173 µM BaP, 490 µM NNK or DMSO. NNK and BaP compounds were obtained from Sigma-Aldrich (St. Louis, MO, United States) and CSC obtained from Murty Pharmaceuticals (Lexington, KY, United States). Genotypic perturbations included CRISPR knockouts of FAT1, and CDKN2A, as well as overexpression of NRF2 (NFE2L2), FGFR1, NRG1, and PIK3CA. Cells transfected with a pSpCas9-EGFP (GFP) plasmid (PX458) in the absence of sgRNAs were used as controls for the CRISPR perturbations while overexpression of an empty vector containing the reporter HcRed served as control for the overexpression experiments. The same samples were profiled across SFL, microarray, and 30DGE for a subset of combinations of exposures, though all samples were profiled by SFL. In addition, full coverage poly-A RNA-seq was performed on a separate set of samples for a subset of genotypic exposures, including CRISPR knockouts of FAT1, as well as overexpression of NRF2, NRG1, and PIK3CA. These samples did not receive any chemical exposures (**Figure 1**). Note that in a few cases there was not enough material to perform 30DGE, as indicated by the sample numbers of certain perturbation combinations.

## Library Preparation

Library preparation for SFL sequencing was carried out based on the published protocol (Shishkin et al., 2015). An edited version of this protocol is available in the **Supplementary Material**. RNA was isolated using a standard Qiazol and Qiacube protocol from Qiagen (Valencia, CA, United States). RNA purity was assessed using a NanoDrop spectrophotometer and no samples were excluded from downstream analysis. The dualbarcoded SFL libraries were pooled from 96 individual samples and then sequenced on the Illumina <sup>R</sup> NextSeq 550 to generate more than 400 million single-end 75-bp reads. Poly-A RNA Sequencing libraries were prepared from total RNA samples using Illumina <sup>R</sup> TruSeq <sup>R</sup> RNA Sample Preparation Kit v2 and then sequenced on the Illumina <sup>R</sup> HiSeq 2500 to generate more than 5 million single-end 50-bp reads per sample. Microarray procedures were performed as described in GeneChipTM WT PLUS Reagent Kit manual and GeneChipTM WT Terminal

of each condition. The color scheme for each platform is consistent throughout this report.

Labeling and Controls Kit protocol (Thermo Fisher Scientific). The labeled fragmented DNA was generated from 100 ng of total RNA and was hybridized to the GeneChipTM Human Gene 2.0 ST Array. Microarrays were scanned using Affymetrix GeneArray Scanner 3000 7G Plus. 3'DGE library preparation was performed by Broad Institute, Cambridge, MA, United States, similar to (Soumillon et al., 2014). Final libraries were purified

using AMPure XP beads (Beckman Coulter) according to the manufacturer's recommended protocol and sequenced on an Illumina NextSeq 500 using paired-end reads of 17 bp (read1) + 46 bp (read2). Read1 contains the 6-base well barcode along with the 10-base UMI. Across all platforms, the number of samples that were successfully profiled per perturbation combination is shown in **Figure 1**.

#### Data Pre-processing

fgene-10-00150 March 1, 2019 Time: 18:29 # 4

Affymetrix GeneChip Human Gene 2.0 ST Microarray CEL files were annotated to unique Entrez gene IDs, using a custom CDF file from BrainArray (hugene20st\_Hs\_ENTREZG\_21.0.0) and RMA-normalized. For SFL, adapter sequences were trimmed from raw sequence files using Cutadapt v1.12. Quality assessment of trimmed SFL sequence files as well as raw full coverage RNA-seq sequencing files was performed with FastQC v0.11.5. Both SFL and RNA-seq reads were aligned to human genome (UCSC RefSeq hg19) with STAR v2.5.2b with the non-defulat parameter, "–outSAMtype BAM SortedByCoordinate" (Dobin et al., 2013). Expression quantification in RefSeq genes was carried out with featureCounts (subread) v1.5.0 (Liao et al., 2014). For 3'DGE, pre-quantified gene expression count matrices were obtained from the Broad Institute, Cambridge, MA, United States. These reads had been aligned to the transcriptome (UCSC RefSeq hg19), using BWA aln v0.7.10 with the non-default parameter, "-l 24" (Li and Durbin, 2009). Considering that there are 4<sup>10</sup> (∼1.05<sup>∗</sup> 10<sup>6</sup> ) possible UMIs and the 3'DGE library sizes are on the order of 10<sup>6</sup> reads, it is highly unlikely for the same UMI to be added to multiple cDNA fragments from the same gene. Therefore, using a custom python program (Soumillon et al., 2014), reads with the same UMI and sample barcode were only counted once per gene. All further data processing and analysis were carried out in R.

#### Coverage Assessment

Read coverage across the 82 samples, shared between SFL and 30DGE, as well as all 18 full coverage RNA-seq samples was assessed for library size as well as percentage of the library size that was aligned, uniquely aligned (i.e., reads that only align once in the genome), and counted in the 22,233 genes which were annotated across all three platforms, i.e., the intersection of annotated genes. The full set of counted reads is hereafter referred to as the counted library. Unlike SFL and full coverage RNA-seq, 30DGE reads are aligned directly to mRNA sequences, such that the reported numbers of counted reads and uniquely aligned reads are the same. To assess the relative distribution of reads across the total set of shared genes, we plotted the cumulative proportion of the sum of reads aligning to individual genes per samples ranked by relative expression across all three platforms. Saturation analysis of the estimated minimum percentage of the counted library size to maximize the number of genes quantified by each platform was performed using a loess fit the gene discovery of 20 subsamplings of the per sample counted libraries. All subsampling analysis was performed using Subseq v1.8.0.

Finally, we assessed the relative induction of noise introduced by subsampling progressively larger proportions of the original counted library sizes in each platform, as measured by the principal component error (Heimberg et al., 2016). In order to compare the three platforms assuming equally sized starting library, we repeated the assessment after first subsampling full coverage RNA-seq libraries and 30DGE libraries to sizes matching that of SFL, the smallest library of the three platforms. This analysis was performed on the 18 samples of like genotypic perturbations, with no chemical treatment in the case of full coverage RNA-seq samples and vehicle DMSO treatment in SFL and 30DGE samples. Reported values reflect means across 20 iterations of the subsampling and principal component error calculation procedure.

### Signal-to-Noise Assessment

Signal-to-noise was compared among SFL, 30DGE and microarrays based on four-group ANOVA analysis and two-group differential analysis. In order to estimate signal-to-noise as a means for assessing expected performance when applying standard statistical methods to the data, rather than differential gene expression analysis packages, classic ANOVA was performed for each gene using normalized data across all three platforms, using the glm function in R. In this analysis, the signal-to-noise was assessed across like samples undergoing exposure to CSC or DMSO vehicle, as well as genotypic perturbations of NRF2 overexpression or HcRed control. Thus, the analysis included four independent groups of samples, receiving each combination of chemical (CSC or DMSO) and genotypic (NRF2 or HcRed) perturbations, with three replicates in each group. Only genes with mean expression ≥ 1 across all 12 samples in both SFL and 30DGE were included in the analysis (9,813 total genes). Expression levels across SFL and 30DGE were normalized via trimmed mean of M values (TMM) (Robinson and Oshlack, 2010) scaling and log<sup>2</sup> counts-per-million transformation. Additionally, two-group differential gene expression analysis was performed for each stratified chemical and genotypic perturbation, using LIMMA v3.30.7. That is, differential expression of CSC- vs. DMSO-treated samples, within either HcRed or NRF2 treatment, as well as differential expression of NRF2- vs. HcRed-treated samples, within either DMSO or CSC exposure, was performed. The SFL and 30DGE count data were transformed for linear modeling based on voom (Ritchie et al., 2015). Following modeling, results were restricted to the top 10,000 genes as ranked by median-absolute-deviation (MAD). This heuristic gene filtering procedure was adopted because quantificationbased filtering is not applicable to microarray data. This approach follows recommendations detailed in the LIMMA manual (Ritchie et al., 2015). All p-values reported from twogroup differential analysis are two-sided. In both ANOVA and LIMMA analyses, nominal p-values for each gene were corrected for multiple comparisons using the Benjamini–Hochberg procedure (Benjamini and Hochberg, 1995).

### Biological Signal Recapitulation

Two-group differential analysis signatures were compared by pre-ranked gene set enrichment analysis (GSEA) to gene sets derived from published signatures of smoking exposure in the airway from healthy volunteers (Spira et al., 2004; Beane et al., 2007), as well as to gene sets analytically derived from The Cancer Genome Atlas (TCGA) for patients with lung squamous cell carcinoma (LUSC) or lung adenocarcinoma (LUAD). The two smoking gene sets consist of genes reported as either up- or down-regulated in response to smoking in at least one of the two publications, while TCGA gene sets were

derived by probing differential expression of individual genes between patients with or without point mutations or copy number alterations (CNA) in genes of interest. These include mutations for the same panel of genes profiled for genotypic perturbations. In addition we include KEAP1 mutations, a repressor of NRF2 (Kansanen et al., 2013, 1). Specifically, point mutation signatures were derived from LUSC and LUAD, independently, by performing differential analysis of subjects with and without point mutations in genes of interest, matched for age, sex, and cancer stage. For NRF2 and PIK3CA point mutations were defined at specific mutation hotspots of along the gene body (**Supplementary Figure S2**) (Campbell et al., 2016). Likewise, CNA gene signatures were assessed for amplification and deletions of genes of interest by differential analysis, using subjects with zero, one, or two additional copies or deletions of a gene of interest, respectively. All models for mutations and CNA were adjusted for tumor purity, as reported (Campbell et al., 2016). Differential signatures were derived using LIMMA. Genes associated with specific mutations or CNA were defined as those with significance and magnitude of the linear model's genetic alteration coefficient at FDR Q-value < 0.05 and | log2 fold-change| > log2(1.5), respectively.

Each of our genotypic perturbation signatures was compared by GSEA to the corresponding TCGA-derived gene sets. For example, the PIK3CA overexpression signatures were compared to the gene sets derived from PIK3CA mutation and CNA in the TCGA data. To assess the effect of read counts on gene discovery and biological recapitulation of each platform, we compared the differential analysis and GSEA results to that derived from subsampled libraries across full coverage RNA-seq, SFL, and 30DGE. Similar to coverage assessment, this analysis was performed starting with full libraries across all three platforms, as well as initially subsampling the full coverage RNA-seq and 30DGE libraries to sizes matching that of SFL. Reported values reflect means from 20 iterations of the subsampling followed by differential analysis and GSEA procedures.

## RESULTS

#### Coverage Assessment

Comparison of coverage of the three sequencing platforms, full coverage poly-A RNA-seq, SFL, and 30DGE, is summarized in **Table 1**, **Figure 2**, and **Supplementary Figure S1**. Comparison between SFL and 3'DGE included 82 samples each, while full coverage poly-A RNA-seq included all 18 available samples. None of the three platforms demonstrated differences in the library size variability (total number of assigned reads) across samples, although there was a notably high difference between the largest and smallest library size for the SFL samples, with a fold change of 4.3. Fold changes for full coverage RNA-seq and 30DGE were 1.9 and 2.9, respectively (**Table 1** and **Figure 2A**).

Unsurprisingly, full coverage poly-A RNA-seq generated the largest library size, while the SFL and 30DGE libraries were of comparable size (**Figure 2A**). Furthermore, full coverage poly-A RNA-seq yielded the highest percentage of reads aligned to the genome, followed by SFL and 30DGE (**Table 1**, **Figure 2Ci**, and **Supplementary Figure S1A**). The lower mapping rate of 30DGE is most likely due to the lower read quality scores of 30DGE compared to full coverage RNA-seq and SFL (**Supplementary Figure S1B**). The mean percentage of reads with Phred quality scores greater than 20 (Q20) was only ∼88% for 30DGE, compared to ∼100% for both full coverage RNAseq and SFL. The relative 50–3<sup>0</sup> transcript coverage for each sample across all three platforms is shown in **Supplementary Figure S1F**. As expected, reads alignments were skewed toward the 3<sup>0</sup> end of transcripts for 30DGE, while we did observe relatively uniform coverage along the transcript for full coverage RNA-seq and SFL.

TABLE 1 | Comparison of read assignment between full coverage poly-A RNA-seq, SFL, and 30DGE.


shows the cumulative proportion of total counted reads assigned to these genes, i.e., the running sum of reads divided by the total number of reads across all genes. (C) The top 3 boxplots show the percentage of reads aligned (i), uniquely aligned (ii), and counted (iii) relative to the total library size for each platform. The bottom boxplot (iv) shows the proportion of genes with counts > 1, for protein-coding genes annotated across all 3 platforms (18,488). For (ii), "Reads Uniquely Aligned" is not shown for 30DGE because "Reads Uniquely Aligned" and "Reads Counted" are the same values as a result of the data pre-processing protocol, specific to 30DGE (see section "Materials and Methods"). Counts values for these percentages are given in Supplementary Figure S1A. (D) Analysis of the principal component error of subsampled counted library sizes for full coverage poly-A RNA-seq, SFL, and 30DGE for principal component 1. Results for principal component 2–5 is shown in Supplementary Figure S1D. Initial subsamples of Poly-A RNA-seq and 30DGE to the SFL library size are also given as dotted lines.

For SFL there was a clear drop-off when going from percentage of aligned reads to percentage of uniquely aligned reads due to ribosomal RNA (rRNA) contamination of the SFL samples (**Figure 2Cii**). The majority of reads aligning to ribosomal regions specifically align to RNA28S (**Supplementary Figure S3**). For 3 <sup>0</sup>DGE, unique UMIs are aligned directly to transcript sequences and not to the whole genome, such that the number of uniquely aligned reads and reads counted in transcripts are the same (**Figures 2Cii,iii**) (Morrissy et al., 2009). The percentage of reads that are counted in transcripts is greatest for full coverage poly-A RNA-seq (mean percentage of total library size: 65.2%), followed by 30DGE (33.3%), and SFL (24.5%). However, while the counted read library size is greater for 3'DGE than for SFL, more genes were quantified by SFL than by 30DGE (**Figure 2Civ**) (counts > 0 across all samples for 22,233 genes shared across all three platforms,). A median of 60.9 and 50.5% genes were quantified by SFL and 30DGE, respectively. The number of genes quantified was near the saturation point for each platform, such that this discrepancy is not due to read depth of each platform (**Supplementary Figure S1C**). The reason for the low gene discovery of 30DGE is further illustrated in **Figure 2B**, where it is shown that the reads are more evenly distributed across the 22,233 genes by SFL than by 30DGE, with the cumulative distribution of reads counted in individual genes nearly identical in SFL and full coverage poly-A RNA-seq.

The principal component (PC) error was estimated for each platform for different subsamples of the full counted library size. The first PC is shown in **Figure 2D**, while the second through the fifth PCs are shown in **Supplementary Figure S1D**. We observe that as the counted library size increases, the PC error decreases

at the fastest rate for full coverage RNA-seq, followed by SFL, then 30DGE. Although these differences are considerably more prominent when comparing full coverage RNA-seq to either SFL or 30DGE, we do observe that when down-sampling from 10 to 100% of the counted library size, the PC error decreases at a consistently faster rate for SFL than for 30DGE. Initially subsampling full coverage RNA-seq and 30DGE to match the full SFL counted library size does not change the results. The same trend is also observed in the cumulative variance explained by each successive PC across full coverage RNA-seq, SFL, and 30DGE (**Supplementary Figure S1E**).

In summary, despite lower overall counted library size due to ribosomal RNA contamination, SFL demonstrates greater coverage in low-to-medium expressed genes than 30DGE, comparable to full coverage poly-A RNA-seq. Consequently, the transcriptional signal captured by the SFL libraries are more robust to subsampling of the data compared to 30DGE as measured by the principal component error.

#### Signal-to-Noise Evaluation

Differential expression models comparing experimental groups of matched samples was performed in SFL, microarray, and 3 <sup>0</sup>DGE and the corresponding signal-to-noise scores were compared pairwise between platforms (**Figure 3**). Samples shared across the three platforms include three replicates for each of four experimental groups, corresponding to NRF2 overexpression or HcRed vehicle, as well as CSC chemical exposure or DMSO vehicle (**Figure 1**). Signal-to-noise was assessed by a four-group comparison with classic ANOVA (**Figures 3A–D**), as well as by stratified two-group differential analyses using LIMMA (**Figures 3E,F**).

We compared the log<sup>10</sup> F-statistics between ANOVA models across all three platforms (**Figure 3A**). Overall, the distribution of F-statistics is most similar between SFL and microarrays, with a Pearson correlation of 0.291. Though statistically significant (p < 0.01), the corresponding mean difference between log<sup>10</sup> F-statistics is only 0.026. The mean differences of the log<sup>10</sup> F-statistics between SFL and 30DGE, and between 3 <sup>0</sup>DGE and microarray are 0.328 and 0.302, respectively, and the corresponding Pearson correlations are 0.160 and 0.216, respectively. These results are consistent with the discovery rates estimated for different FDR Q-value thresholds (**Figure 3B**). For example, at the FDR Q-value threshold of 0.05, the discovery rates of SFL and microarray are almost identical, 0.214 (2083 genes), 0.209 (2038 genes), respectively, while the discovery rate of 30DGE is much smaller 0.032 (310 genes).

Loess regression of the log<sup>10</sup> F-statistics as a function of mean gene expression shows that the statistical signal increases with mean normalized expression. This trend is consistently positive for both SFL and 30DGE, while leveling off at the most highly expressed genes in microarrays (**Figure 3C**). Furthermore, SFL signal is greater than 30DGE signal at all levels of mean expression (**Figure 3C**). In agreement with the results from coverage comparison, the distribution of mean normalized expressions in 30DGE is smaller than that of SFL, while SFL is comparable to that of microarray (**Figure 3D**). Adherence to assumption of normality, assessed through a Shapiro–Wilk test, is also associated with higher mean normalized expression (**Supplementary Figure S4**).

The results of the comparisons of the two-group differential analyses across all three platforms were generally congruous with those of the four-group ANOVA analyses (**Figures 3E,F** and **Supplementary Figures S5**, **S6**). In all four two-group comparisons, the correlation of test statistics is closest between microarray and SFL results, followed by 30DGE versus microarray results, and 30DGE versus SFL. For example, in the DMSOstratified, NRF2 versus HcRed analysis, estimates of the Pearson correlations of test statistics are 0.66, 0.45, and 0.43, respectively (**Figure 3E**). The discovery rate of 30DGE is the lowest across all four differential analyses, while the discovery rate of SFL is higher in three out of four of these analyses (**Figure 3F** and **Supplementary Figures S5**, **S6**).

In summary SFL demonstrated greater statistical power than 3 <sup>0</sup>DGE to detect differentially expressed genes, and its results more closely matched those in microarrays.

## Biological Signal Recapitulation Evaluation

To evaluate the ability of each platform to recapitulate biologically relevant results, we utilized previously published signatures of smoking exposure in lung (Spira et al., 2004; Beane et al., 2007), as well as differential signatures derived from the TCGA LUSC and LUAD datasets associated with mutations of the genes over-expressed in our experiments. From each of these signatures two gene sets were extracted, one of genes positively associated and one of genes negatively associated to the variable of interest. These gene sets were then tested via pre-ranked gene set enrichment analysis against each of our differential analysis results (CSC vs. DMSO, stratified by NRF2 or HcRed perturbation; NRF2 vs. HcRed, stratified by CSC or DMSO perturbation). The enrichment results with respect to both the smoking exposure signatures and the TCGA mutations are summarized in **Figure 4A**, and further detailed in **Supplementary Figure S7**, and confirm the highest sensitivity of microarrays, followed by SFL and 30DGE.

The set of genes up-regulated in "smokers vs. non-smokers" was found to be significantly (FDR Q-value < 0.05) enriched in all "CSC vs. DMSO" signatures, within both genotypic stratifications for all three platforms. Conversely, the set of down-regulated genes in "smokers vs. non-smokers" was only enriched in the microarray signature of "NRF2 over-expressed; CSC vs. DMSO" (**Supplementary Figure S7**).

The enrichment results of TCGA-derived gene sets with respect to differential signatures of genotypic perturbations were in agreement with the gene-level results, in that they consistently demonstrated smaller discovery rates by 30DGE than by SFL or by microarrays (**Figure 4A**). For example, the significantly enriched gene sets in "DMSO-treated; NRF2 vs. HcRed" differential signatures across all three platforms are highlighted in **Supplementary Figure S7**. The number of gene sets enriched in microarray, SFL, and 30DGE platforms are five, three, and zero, respectively.

four n = 3 groups (HcRed:DMSO, HcRed:CSC, NRF2:DMSO, and NRF2:CSC). The gray line shows y = x. The platform with the higher mean log10(F-Statistic) is plotted on the y-axis. Also, included are the p-value and difference in mean between each bi-platform comparison from paired t-testing, as well as the squared correlation coefficient. P-values ∼ 0 are less than 0.01. Color of indicate genes discovered by individual platforms (green, orange, or blue), neither platform (gray), and both platforms (red). (B) Plot of the Discovery Rate versus FDR Q-Value from threshold for each platform from four group ANOVA models. The x-axis is plotted on a –log10 scale. The vertical line is indicative of a Q-value threshold of 0.05. (C) Loess fit of the log10(F-Statistic) versus median normalized expression from four group ANOVA models. (D) Distribution of mean normalized expression across all three platforms. (E) Comparison of gene discovery (FDR Q-Value < 0.05) by differential analysis with limma, comparing normalized gene expression between DMSO:NRF2 and DMSO:HcRed, including the raw discovery rates, discovered gene overlap, and linear fits, comparing test statistics from each platform. Genes that are discovered by more than 1 platform are shown in red in the scatterplots. Additional comparisons are shown in Supplementary Figure S5. (F) Plot of the Discovery Rate versus FDR Q-Value from threshold for each platform from two group differential analyses. The x-axis is plotted on a –Log10 scale. The vertical line is indicative of a Q-value threshold of 0.05.

In addition to comparing which gene sets were significantly enriched in individual differential signatures, we compared the relative statistical signal of these enrichments. To this end, we transformed the permutation-based FDR Q-values by taking the negative Log<sup>10</sup> and multiplying by the direction of the enrichment score (ES), −Log10(FDR Q-values)<sup>∗</sup> sign(ES). For each two-platform comparison, we fit a regression model through the origin. Since consistent results across platforms would result in a model fit close to the identity line, y = x, we tested whether the slope coefficient equaled 1 (i.e., B<sup>1</sup> = 1). **Figure 4B** shows these results for each of the three comparisons of the NRF2 and KEAP1 mutation-based gene sets enrichment against the "DMSO-treated; NRF2 vs. HcRed" signatures. In all three comparisons, microarrays have the highest measured enrichment signal, followed by SFL and 30DGE, however, the difference between microarray and SFL results is not significant, B<sup>1</sup> = 0.73; p-value = 0.2. The coefficients for both of the comparisons to 30DGE, are highly skewed in favor of microarray and SFL,

B<sup>1</sup> = 0.18 and 0.14, respectively. Both of these comparisons are highly significant with p-values < 0.01. Comparison of the enrichment results for other differential signatures show similar trends (**Supplementary Figure S8**).

Next, we compared enrichment results with respect to all genotypic perturbation signatures between SFL and 30DGE (**Figure 5A** and **Supplementary Figure S9A**). Each comparison (i.e., each point in the plot) denotes gene set enrichment results with respect to genotypic perturbations within each of the four chemical exposures, DMSO, CSC, BaP, and NNK. Gene sets were tested for enrichment against concordant differential signatures, e.g., the PIK3CA mutation-derived gene set was tested against the "PIK3CA vs. HcRed" signatures. As in the previous analysis, the permutation-based enrichment FDR Q-values were transformed by –Log10(FDR Q-values)<sup>∗</sup> sign(ES). In the "DMSO-treated; genotypic perturbation vs. control" signatures, we observe that the gene set enrichment is generally more significant for SFL than for 30DGE (B<sup>1</sup> = 0.63; p-value < 0.01; **Figure 5A**). The results obtained in CSC- and NNK-treated signatures, demonstrate concordance to these results (B<sup>1</sup> = 0.65; p-value = 0.03 and B<sup>1</sup> = 0.60; p-value = 0.01, respectively). The BaP-treated results are less comparable since only one genotypic perturbation signature, "FAT1 vs. GFP," is available for this stratification (**Supplementary Figure S9A**).

Additionally, we compared our differential signatures to available full coverage poly-A RNA-seq genotypic perturbations (**Supplementary Figure S9B**), although these results are considered less comparable because of differences in experimental set-up. In particular, in the full coverage poly-A RNA-seq experiments the genotypic perturbations were performed on untreated rather than DMSO-treated cell lines (**Figure 1**).

The effect on discovery rate by subsampling the data across all three platforms is shown in **Figure 5B**. Generally, we did not observe a plateauing of discovery rate, where the number of detected genes plateaus near full counted library size. When comparing the correlation between GSEA results on subsampled data we observe similar trends across full coverage RNA-seq, SFL, and 30DGE (**Figure 5C**). Initial subsampling of full coverage RNA-seq and 30DGE to the SFL counted library size did not change the analysis results.

In summary, differential analysis of molecular and genotypic perturbations with SFL recapitulates biologically meaningful signal of gene sets derived from high coverage in vivo data sets. This performance is comparable to both 30DGE and microarray.

### DISCUSSION

The goal of this study was to evaluate the performance of SFL sequencing, a low-cost method for performing highly multiplexed RNA-seq, and to compare it to other high-throughput gene expression profiling platforms. The development of such methods would be instrumental to the generation of large-scale perturbation screens based on in vitro models. The reduction of the cost per profile would make it feasible to significantly increase the number of replicates and conditions to be profiled, including multiple time points, concentrations, and biological models, and thus would support a more in-depth investigation of the heterogeneity of the biological response to different exposures. It would also support the development of more accurate predictive models of the adverse or therapeutic outcomes of various exposures. Finally, insights gained from our study will also inform the design of protocols for single cell RNA-sequencing (Eberwine et al., 2014), given their reliance on highly multiplexed libraries.

In addition to SFL, the platforms included in this analysis were 3 <sup>0</sup>DGE, an alternative highly multiplexed sequencing platform, Affymetrix GeneChip Human Gene 2.0 ST Microarray, an analog expression platform, and full coverage poly-A capture RNA-seq. The cost per sample for SFL and 30DGE was ∼\$50, a 10-fold decrease from that of full coverage RNA-seq, \$500, and a 7-fold decrease from that of the microarray, \$350 USD. Throughout this analysis we demonstrate comparable performances of SFL and 30DGE to these more expensive platforms. Furthermore, in this analysis we consistently find evidence that SFL outperforms 30DGE.

Performance was assessed in terms of coverage, signal-to-noise, and recapitulation of expected biological signal derived from independently generated, publicly available data collected from human subjects. Coverage was assessed by comparing the three digital expression platforms, while signal-to-noise and biological recapitulation was assessed by comparing SFL, 30DGE, and microarrays. Microarray expression quantification has been shown to be highly correlated with qRT-PCR, especially when processed with updated probe set annotations, utilized in this analysis (Sandberg and Larsson, 2007). Chemical and molecular perturbations were carried out in the same samples, and concurrently profiled by SFL, 30DGE, and microarrays. We also leveraged previously generated full coverage poly-A RNA-seq profiles from similar perturbations of AALE cell lines.

For coverage assessment, performance was evaluated in terms of the distribution of total reads, or library size, that were aligned to the human genome, and further quantified in annotated genes. The best performance was expected in full coverage poly-A RNA-seq, given that this is the most well-established technique and has by far the highest sequencing depth. This was confirmed, as full coverage poly-A RNA-seq was measured to have the highest per sample library size, percentage of aligned reads, percentage of uniquely aligned reads, and percentage of counted reads (**Figure 2** and **Supplementary Figure S1**). The coverage performance of SFL suffered as a result of rRNA contamination, where as many as 53% of the total library size per sample was assigned to ribosomal regions of the genome (**Supplementary Figure S3**).

3 <sup>0</sup>DGE is a poly-A capture technique, therefore ribosomal depletion is not a possible pitfall. 30DGE generates a short nucleotide tags from transposon-based fragmentation, which are enriched for 3<sup>0</sup> adjacent sequences of a given transcript (Soumillon et al., 2014). Since many transcripts of the same gene generate identical sequence tags, unique molecular identifiers (UMIs) are used to distinguish between unique reads and duplicate reads generated from PCR amplification. Although

mRNA fragment duplication occurs with any RNA-seq protocol, the impact of this artifact on downstream analyses is negligible for techniques, such as SFL, which generate more complex sequence libraries (Parekh et al., 2016).

3 <sup>0</sup>DGE sequences were aligned directly to human mRNAs, rather than the whole genome. Therefore, percentages of reads aligned and reads counted (**Figures 2Ci,iii**) reflect the percentages of these non-unique UMIs that align to at least one

gene and the number of unique UMIs that align to only one gene, respectively. We observe that the percentage of counted reads is greater for 30DGE than SFL, which is explained by a loss of reads to rRNA contamination in SFL. However, we observe notably more genes quantified by SFL than by 30DGE (**Figures 2B,Civ**), which indicates that more reads are assigned to fewer genes in 30DGE compared to SFL, as well as to full coverage RNA-seq (**Figure 2C**). Although rRNA contamination is a potential drawback of any ribosomal depletion RNA-sequencing technique, the extent of ribosomal contamination is variable, and could be potentially improved by further optimization of the library preparation protocol.

The difference in distribution of reads across shared genes between SFL and 30DGE likely explains the difference in information retained by subsampling as measured by principal component error. Although full coverage poly-A RNA-seq clearly outperforms both SFL and 30DGE for principal component assessment, we consistently observe that, as the counted library size increases, the rate of principal component error decreases faster for SFL than 30DGE (**Figure 2D** and **Supplementary Figure S1D**). This is unsurprising considering that not only are considerably fewer genes quantified by SFL compared to 3 <sup>0</sup>DGE, but there is also no discernable difference between the rate of genes counted as a function of counted library size between the two platforms (**Supplementary Figure S1C**). As we subsample the counted libraries, though we may lose the same number of genes between SFL and 30DGE, the percent of genes lost, and consequently the information lost, will be greater for 30DGE than SFL. Furthermore, this more even read distribution likely explains the improved performance of SFL over 30DGE in statistical signal. In particular, our signal-to-noise evaluation shows consistently higher gene-level statistical signal from SFL and microarray experiments than from 30DGE experiments (**Figure 3**). These differences appear to be driven by the differences in the relative quantification of genes, given that statistical signal is positively associated with mean gene expression for each platform, and 30DGE experiments showed lower gene-level quantification than SFL and microarrays (**Figures 3C,D**). We observe similar cross-platform relationships in the two-group differential analyses (**Figures 3E,F**).

The gene set-based enrichment results are consistent with those from signal-to-noise analyses. In every comparison of enrichment scores between SFL and 30DGE, we observe generally higher gene set enrichment with respect to the SFL-derived signatures (**Figures 4**, **5A** and **Supplementary Figures S8**, **S9**). The gene sets were selected to represent known biological responses to the profiled perturbations, and thus their enrichment with respect to the perturbation signatures are expected to be true positives.

The enrichment results confirm this expectation. For example, in the signatures of NRF2 overexpression, we consistently observe enrichment of the gene sets derived from NRF2 amplifications and KEAP1 deletions, each of which should increase NRF2 activity (**Supplementary Figure S7**) (Kansanen et al., 2013). Similarly, we observe significant concordant enrichment of the gene sets derived from NRF2 and KEAP1-dysregulated lung tumors in the signature of CSC exposure, suggesting that the NRF2 pathway is activated by CSC exposure in vitro (**Supplementary Figure S7**), which has been previously reported (Adair-Kirk et al., 2008). Interestingly, these results demonstrate that the activation of the NRF2 pathway in normal airway epithelial cells in vitro (by ectopic expression of the gene or by CSC treatment) is concordant with the activation of NRF2 by somatic genome alterations in lung tumors, a finding that, to the best of our knowledge, has not been previously observed.

Possible sources of technical variability in this study are the different sequencing platforms, service providers, and read lengths. However, when subsampling the 30DGE and SFL counted libraries, we generally observe higher discovery rates at all percentages of the full counted libraries, and even more so when the 30DGE counted libraries are initially subsampled to full SFL counted library sizes (**Figure 5B**), demonstrating that SFL shows improvements independent of the mapping rate. This result confirms previous reports showing that increasing read length above 50-bp does not improve read quantification (Chhangawala et al., 2015). Furthermore, similar results have been reported even when the same sequencing platform is used. A recent study reported a greater number of genes detected, as well as higher differential analysis discovery rates, in conventional RNA-seq than in 3'DGE at identical counted library sizes, using the Illumina HiSeq 2500 platform to generate both libraries (Xiong et al., 2017).

In summary, in this study we observe higher performance of SFL than 30DGE, as measured by coverage, signal-tonoise, and biological recapitulation of known signal, with the performance of SFL often matching that of well-established "gold standards" (full coverage RNA-seq or microarrays). On the other hand, the fact that 30DGE is shown to allocate a large number of reads to relatively fewer, highly expressed genes, makes this platform more suitable for problems where high accuracy in the differential quantification of highly expressed genes is needed. Furthermore, the ready availability of 30DGE as a core-provided option, which allows for the out-sourcing of library preparation, sequence read pre-processing and gene quantification, is an additional value-added of the platform. Ultimately, the best-suited platform for a specific project will depend on the study goals, design, and availability of different resources. We believe our study presents useful results to make a more informed choice.

The utility of highly multiplexed RNA-seq crucially depends on the trade-off between cost and data quality, and on the nature of the experiments for which the platform would be ideally suitable. These will in general be experiments where the marginal information content of a single profile is relatively low, and thus justifies trading-off some data quality for reduced cost.

### DATA AVAILABILITY

Data for SFL, 30DGE, and Microarray experiments is available through the Gene Expression Omnibus (GEO) at accession numbers: GSE118797, GSE118798, and GSE118799. Reviewers may access the data prior to publication using the tokens: mfibskyuxpudxmz, gdypscyerrolvad, and cxojqwuahlgttsl.

#### AUTHOR CONTRIBUTIONS

fgene-10-00150 March 1, 2019 Time: 18:29 # 13

SM, CP, and JC designed the experiments. EM performed the wet-lab experiments. ER performed all data pre-processing and analysis. ER, SM, JC, and CP interpreted the analysis results. XX and GL refined, implemented, and recorded the SFL protocol. ER, SM, JC, and CP wrote the manuscript. SM oversaw the whole project.

#### FUNDING

This work was supported in part by a Superfund Research Program grant P42ES007381 to SM, a Find the Cause Breast Cancer Foundation (http://findthecause.org) grant to SM, an

#### REFERENCES


Evans Foundation pilot grant to SM and CP, and a LUNGevity Career Development award to JC.

#### ACKNOWLEDGMENTS

Dr. Alexander A. Shishkin for his feedback during the development of the SFL protocol. 3<sup>0</sup> Digital Gene Expression libraries were prepared by the Broad Technology Labs and sequenced by the Broad Genomics Platform, using SCRB-Seq library preparation techniques.

### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene. 2019.00150/full#supplementary-material


differentiation by high-throughput single-cell RNA-Seq. bioRxiv [Preprint]. doi: 10.1101/003236


Xiong, Y., Soumillon, M., Wu, J., Hansen, J., Hu, B., van Hasselt, J. G. C., et al. (2017). A comparison of mRNA sequencing with random primed and 3'-directed libraries. Sci. Rep. 7:14626. doi: 10.1038/s41598-017- 14892-x

**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Reed, Moses, Xiao, Liu, Campbell, Perdomo and Monti. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Read Mapping and Transcript Assembly: A Scalable and High-Throughput Workflow for the Processing and Analysis of Ribonucleic Acid Sequencing Data

Sateesh Peri <sup>1</sup> , Sarah Roberts <sup>2</sup> , Isabella R. Kreko<sup>3</sup> , Lauren B. McHan<sup>3</sup> , Alexandra Naron<sup>3</sup> , Archana Ram<sup>3</sup> , Rebecca L. Murphy <sup>4</sup> , Eric Lyons 1,2, Brian D. Gregory <sup>5</sup> , Upendra K. Devisetty <sup>2</sup> and Andrew D. L. Nelson6\*

#### Edited by:

Filippo Geraci, Italian National Research Council (CNR), Italy

#### Reviewed by:

Cuncong Zhong, University of Kansas, United States Eve Syrkin Wurtele, Iowa State University, United States

> \*Correspondence: Andrew D. L. Nelson an425@cornell.edu

#### Specialty section:

This article was submitted to Bioinformatics and Computational Biology, a section of the journal Frontiers in Genetics

Received: 16 July 2019 Accepted: 12 December 2019 Published: 24 January 2020

#### Citation:

Peri S, Roberts S, Kreko IR, McHan LB, Naron A, Ram A, Murphy RL, Lyons E, Gregory BD, Devisetty UK and Nelson ADL (2020) Read Mapping and Transcript Assembly: A Scalable and High-Throughput Workflow for the Processing and Analysis of Ribonucleic Acid Sequencing Data. Front. Genet. 10:1361. doi: 10.3389/fgene.2019.01361 <sup>1</sup> Genetics Graduate Interdisciplinary Group, University of Arizona, Tucson, AZ, United States, <sup>2</sup> CyVerse, University of Arizona, Tucson, AZ, United States, <sup>3</sup> LIVE-for-Plants Summer Research Program, School of Plant Sciences, University of Arizona, Tucson, AZ, United States, <sup>4</sup> Biology Department, Centenary College of Louisiana, Shreveport, LA, United States, <sup>5</sup> Department of Biology, University of Pennsylvania, Philadelphia, PA, United States, <sup>6</sup> Boyce Thompson Institute, Cornell University, Ithaca, NY, United States

Next-generation RNA-sequencing is an incredibly powerful means of generating a snapshot of the transcriptomic state within a cell, tissue, or whole organism. As the questions addressed by RNA-sequencing (RNA-seq) become both more complex and greater in number, there is a need to simplify RNA-seq processing workflows, make them more efficient and interoperable, and capable of handling both large and small datasets. This is especially important for researchers who need to process hundreds to tens of thousands of RNA-seq datasets. To address these needs, we have developed a scalable, user-friendly, and easily deployable analysis suite called RMTA (Read Mapping, Transcript Assembly). RMTA can easily process thousands of RNA-seq datasets with features that include automated read quality analysis, filters for lowly expressed transcripts, and read counting for differential expression analysis. RMTA is containerized using Docker for easy deployment within any compute environment [cloud, local, or high-performance computing (HPC)] and is available as two apps in CyVerse's Discovery Environment, one for normal use and one specifically designed for introducing undergraduates and high school to RNA-seq analysis. For extremely large datasets (tens of thousands of FASTq files) we developed a highthroughput, scalable, and parallelized version of RMTA optimized for launching on the Open Science Grid (OSG) from within the Discovery Environment. OSG-RMTA allows users to utilize the Discovery Environment for data management, parallelization, and submitting jobs to OSG, and finally, employ the OSG for distributed, high throughput computing. Alternatively, OSG-RMTA can be run directly on the OSG through the command line. RMTA is designed to be useful for data scientists, of any skill level, interested in rapidly and reproducibly analyzing their large RNA-seq data sets.

Keywords: RNA-seq, transcriptomics, high throughput (-omics) techniques, bioinformatics, workflow

## INTRODUCTION

RNA-sequencing (RNA-seq) provides scientists with the ability to monitor genome-wide transcription across numerous cells or tissues and between experimental conditions in a rapid and affordable manner. Data generated from RNA-sequencing are incredibly powerful for differential gene expression analysis (Mortazavi et al., 2008; Li et al., 2016; Schlackow et al., 2017), novel gene discovery (Martin et al., 2013; Nelson et al., 2017), transcriptome-wide structural analysis (Gosai et al., 2015; Anderson et al., 2018), and even transcriptome-wide association studies (Galpaz et al., 2018; Gusev et al., 2019). In addition to generating and examining novel RNA-seq data, scientists are re-examining the hundreds of thousands of publicly available archived datasets to make novel discoveries (Lachmann et al., 2018), an analytical feat that represents a bottleneck for most researchers. The popularity of RNAsequencing is perhaps most apparent by examining the dramatic increase in the number of RNA associated sequence read archives (SRAs) deposited in National Center for Biotechnology Information (NCBI's) SRA (Leinonen et al., 2011; Figure 1) over the last 10 years.

Alignment-based processing of these massive volumes of RNA-seq data typically involves two computationally intensive steps: mapping reads against a reference genome and transcript assembly. Reference genome based read mapping is performed using splice-aware algorithms such as STAR (Dobin et al., 2013) or HISAT2 (Pertea et al., 2016). The computational cost associated with mapping reads is dependent on the size of the genome and the number of reads to be mapped but typically

FIGURE 1 | RNA-sequencing (RNA-seq) data deposited on National Center for Biotechnology Information (NCBI's) sequence read archive (SRA). SRA run information associated with transcriptomic analyses was downloaded and sorted by year deposited. Tera base pairs (Tbp, 1E+12) of RNA-seq data deposited is shown with the gray line and plotted on the left y-axis. Thousands of experiments deposited, per year, is shown with the black line on the right y-axis.

takes hours to days on a standard lab server. The mapped reads are then used to assemble transcripts using programs such as StringTie or Cufflinks. Transcript assembly is less computationally intensive than read mapping but can still require several hours to complete. In addition to the computational requirements, both of these steps require substantial data storage resources and technical skills in transferring and manipulating large files, further increasing the technological burden for the researcher.

Successful assembly of RNA-seq data is insufficient to achieve the ultimate experimental goal: extraction of meaningful data. Data extraction usually involves differential expression analyses, isoform analysis, or novel gene identification. Each of these analyses requires different input file types and the use of different applications—each with their own intricacies surrounding installation, use, and preference for a Linux environment. In addition, preparing data files and then organizing them into the appropriate file structure for these next steps rapidly becomes tedious when performed on hundreds to thousands of files. Thus, despite the wealth of computing resources, extracting meaningful knowledge from RNA-seq data is still a non-trivial task.

Cloud-computing cyber-infrastructure platforms such as CyVerse (Merchant et al., 2016) and Galaxy (Afgan et al., 2016) have lifted the computational and data management burdens and made RNA-seq analysis more accessible to nontraditional data scientists. In contrast to fee-based services such as the Cancer Genomics Cloud (Lau et al., 2017) or FireCloud (Chet et al., 2017 – doi 10.1101/209494), CyVerse and Galaxy are free to users and provide long-term data storage solutions integrated with limited on-demand cloud compute resources. CyVerse and Galaxy also offer graphical user interface (GUI) platforms which allow researchers with minimal programming experience to easily deploy and handle large volumes of jobs in parallel. A complement to single-source resources like CyVerse and Galaxy is the Open Science Grid [OSG (Pordes et al., 2007)], a distributed computing resource capable of handling hundreds of thousands of jobs and transferring hundreds of petabytes of data per day. Thus, these computational resources make large dataset analysis and re-analysis feasible in a reasonable timeframe and cost-effective way.

Here we introduce RMTA (Read Mapping, Transcript Assembly), a high throughput RNA-seq read mapping and transcript assembly workflow. RMTA is easy to use and incorporates features that move beyond the standard RNA-seq workflow, allowing data scientists to focus their time on downstream analyses. For users with access and familiarity with high-performance computing (HPC) command-line operations, RMTA is packaged as a Docker container for onestep installation (Table 1). In contrast to other containerized RNA-seq analysis tools (Folarin et al., 2015; Jensen et al., 2018), RMTA is also installed as an app in CyVerse's Discovery Environment, which obviates computing and data storage requirements while providing a GUI for users less familiar with the command-line. Finally, for users querying extremely large data sets, OSG-RMTA marries the computational resources



Platforms include the Discovery Environment, a local computer, or high performance computing center, or the Open Science Grid. \*Users wishing to utilize the Open Science Grid (OSG) outside of the Discovery Environment will need their own OSG account.

of the OSG with the job scheduling, data storage and management capabilities of CyVerse. Beyond read mapping and assembly, RMTA has a number of additional features that automate onerous data transformation and quality control steps, thus producing outputs that can be directly used for differential expression analysis or novel gene identification. In addition, the output from RMTA may be rapidly integrated in downstream transcriptomic data visualization platforms to help researchers extract meaningful knowledge. RMTA is both straightforward to install and use, and is meant to be used by both advanced and novice data scientists in their examination of their RNA-seq data.

#### MATERIALS AND METHODS

In this section we provide an overview of RMTA, its different features, and its deployment options.

#### Overview of the Read Mapping and Transcript Assembly Workflow

RMTA automates the three critical steps of RNA-seq analysis: read mapping, transcript assembly, and read counting. For genome-guided read mapping, RMTA utilizes either the spliceaware algorithm HISAT2 or the splice-unaware algorithm Bowtie 2 (Langmead and Salzberg, 2012) for mapping and then StringTie (Pertea et al., 2016) for transcript assembly (Figure 2). Minimum input requirements include a reference genome (FASTA or pre-indexed), and RNA-seq reads as either compressed or uncompressed FASTq, or as a list of one to thousands of SRA IDs. A reference genome annotation file (in GFF/GFF3/GTF) is optional and allows for downstream novel gene identification. RMTA automatically builds a reference genome index (if it is not provided) from the user supplied reference genome, aligns reads to the genome, and then returns a binary encoded version of a sorted sequence alignment map (BAM) file for each input FASTq/SRA. This BAM file is then automatically used as input for StringTie, where it, along with the reference genome annotation, is used to assemble transcripts. Following transcript assembly, each BAM file is processed by featureCounts (Liao et al., 2014) to determine how many reads map back to each gene/exon in the reference genome annotation file.

As an alternative to genome-guided read mapping and transcript assembly, RMTA also allows for read alignment directly to a transcriptome using the quasi-aligner and transcript abundance quantifier Salmon (Patro et al., 2017; Srivastava et al., 2019). Minimum input for Salmon includes a reference transcriptome (in FASTA format) and then RNA-seq reads (as above). Salmon maps reads to the provided transcript assembly and then counts the number of reads associated with each transcript, generating an output file (quant.sf) that can immediately be used for differential expression. The utilization of

FIGURE 2 | Read mapping and transcript assembly (RMTA) workflow with suggested downstream analyses. The standard RMTA workflow consists of read mapping by either HISAT2 or Bowtie 2, transcript assembly by StringTie, assembly comparison to the reference annotation by Cuffcompareto identify novel transcripts, and then read counting by featureCounts. Several optional features are included, such as the ability to perform quality control on RNA-sequencing (RNAseq) data with FastQC, filtering of lowly expressed transcripts, and removal of duplicate reads (Bowtie 2 only). Output is listed, and are ready for downstream analyses such as those shown.

Salmon is only appropriate when the user is wanting to rapidly test for differential expression and cannot facilitate the identification of novel genes or data visualization in a genome browser.

OSG-RMTA utilizes a similar workflow to RMTA. The primary difference is how the user plans on launching jobs and providing the necessary input data to the OSG. When launched directly from within the OSG through a user's personal account, the user must provide access to all necessary data (e.g. genomes, RNA-seq data, etc). Thus, we recommend users submit jobs to the OSG through CyVerse's Discovery Environment. When jobs are submitted via the Discovery Environment, it automatically prepares the information needed to run the job and submits it to the OSG via HTCondor (Thain et al., 2005) and requires no OSG account (Table 1). Once the job is launched OSG-RMTA uses the information provided by the Discovery Environment to retrieve input files, process the data, and upload the results back to the Data Store, allowing the user to submit and walk away.

RMTA is also available for implementation on a HPC, a public cloud-based computing system (i.e., XSEDE or Atmosphere), or a local compute system. For local or cloudbased computing, a Dockerized version of RMTA identical to that used in the Discovery Environment is available for use inside a Docker command line environment. However, the user will need to direct Docker to the location of the input files and assign the required "flags" that are hidden when using RMTA in the Discovery Environment. More information on how to run the Docker version of RMTA on a Linux/personal computer (PC)/ Mac operating system (OS) and a list of all available flags are available here (https://github.com/Evolinc/RMTA). Docker requires root privileges and thus is not available for HPC where users are denied super user do "sudo." For HPC systems, Docker can be used alongside Singularity (Kurtzer, et al., 2017; instructions found here: https://sylabs.io/guides/3. 4/user-guide/).

#### Additional Read Mapping and Transcript Assembly Features

Several additional features have been included in the RMTA workflow to facilitate data discovery and quality control. For users wishing to call single nucleotide polymorphisms from their RNA-seq [or DNA-sequencing (DNA-seq)] data in a high throughput manner, the read aligner Bowtie 2 (Langmead and Salzberg, 2012) has been included as an optional aligner in the RMTA workflow. When the Bowtie option is selected, HISAT2 and StringTie are both removed from the workflow, but the additional option to remove duplicate reads (important for population level analyses) becomes available.

Poor quality RNA-seq reads, particularly at the 5' or 3' ends as a result of adaptor contamination or a drop in sequencing quality, can lead to a significant population of unmapped reads. To help the user identify issues resulting from poor read mapping rates, the quality control tool FastQC (Andrews, 2010) is available as an additional option in the RMTA workflow for both genome or transcriptome-guided read mapping approaches. FastQC provides the user with both an overview of potential issues with their data, as well as summary graphs highlighting issues such as per base sequence quality and Kmer content. Because FastQC works on read files in FASTq format, and we envision many users running RMTA directly on SRAs, FastQC has been placed downstream of read mapping (Figure 2). When the FastQC option has been selected, BAM files are converted back into FASTq with mapped and unmapped reads, along with their associated quality score, retained. This FASTq file is then used as input for FastQC, and then deleted afterward to reduce disk usage. If issues are detected at the 5' or 3' of sequencing reads, RMTA includes additional options for specifically trimming bases off of either end during the next analysis. Sequencing reads of overall poor quality will simply not be mapped and therefore do not need to be trimmed, but will still be highlighted in the FastQC results.

RMTA is also designed to aid in the identification of novel genes such as long non-coding RNAs from genome-guided transcriptome assemblies. To help the user remove transcript assembly artifacts that can arise from low expression, and therefore improve their attempts at novel gene identification, RMTA has two options for filtering lowly expressed transcripts. The user can decide to filter based on low expression [denoted as fragments per kilobase of transcript per million mapped read (FPKM)], low/incomplete read coverage (read per base), or use both filters in combination. We find that applying both filters (e.g., setting them both to one) helps to remove a large percentage of poorly assembled transcripts.

## Output From RMTA

The RMTA workflow produces a number of files that are designed to be immediately useful for downstream analyses such as differential expression, novel gene identification, and single-nucleotide polymorphism (SNP) discovery. Directly within the RMTA\_Output folder the user will find the sorted BAM files and the filtered transcript assembly files (in GTF). The naming convention of these files reflects the SRA or FASTq from which they were derived (i.e., the input ID will be prepended to the output files). The filtered transcript assembly file is prepared for immediate use in the novel long non-coding RNA (lncRNA) identification package, Evolinc (Nelson et al., 2017), whereas the sorted BAM file is ready for import and visualization within a genome browser such as EPIC-CoGe (Nelson et al., 2018) or Integrative Genomics Viewer (IGV) (J. T. Robinson et al., 2011). The user will also find a "mapped.txt" file in the RMTA\_Output folder, which contains information about alignment rates for each input FASTq/SRA. Within the RMTA\_output folder is a subfolder labeled "Feature\_counts" which contains a featureCounts summary.txt file and a tab-delimited file containing the number of reads assigned to each gene/exon for each of the RNA-seq data sets analyzed. If using the transcriptome-guided mapping approach (i.e., Salmon), a single quant.sf file will be generated that will contain the counts of all reads mapped to each transcript in each of the RNA-seq datasets processed. If the user selected the FastQC option, there will be a subfolder within the Output folder called "FastQC\_out." This folder will contain a FastQC.html file for each data set examined. Clicking on this file within the Discovery Environment will open up a new tab in the user's browser where all of FastQC's output information will be displayed. If the user chose Bowtie as the read aligner and "remove duplicate reads" as an additional option, then the RMTA\_Output folder will only contain a sorted BAM file with duplicates removed for each SRA/ FASTq input file, as well as a mapped.txt file. No additional files will be generated. A similar file/folder structure is generated no matter how an RMTA job has been launched (DE/OSG/HPC).

#### Deployment Options

The different deployment options for RMTA and the benefits associated with each are summarized in Table 1. RMTA is freely available as an app (RMTA v2.6.3) within CyVerse's Discovery Environment (https://wiki.cyverse.org/wiki/display/DEapps/ RMTA+v2.6.3). Running RMTA within the Discovery Environment allows the user to take advantage of CyVerse's simplified data management and storage options through the Data Store. In addition, integrated in the Discovery Environment are a number of virtual interactive computing environment (VICE) apps, such as the DESeq2 RStudio app, that allow users to examine their data start to finish completely in the cloud (https://learning.cyverse.org/projects/vice/en/latest/). OSG-RMTA (v2.6.3) is available as a separate app within the Discovery Environment. Although the OSG-RMTA app outwardly looks identical to RMTA, jobs are submitted to the OSG by CyVerse on behalf of the user, while also automating data management and transfer between the Data Store and OSG. RMTA is available as a Docker image https://hub.docker.com/r/ evolinc/osg-rmta/ for easy installation in a command line environment (e.g. XSEDE or PC) where Docker is already installed or where the user has the necessary privileges to install Docker. Additionally, Docker can run within Singularity (Kurtzer et al., 2017), which enables launching RMTA within an HPC environment. Having RMTA packaged within a Docker container abrogates the need for installation of prerequisite software. For users with an OSG account and for whom a CyVerse account is unnecessary, OSG-RMTA is already present on the OSG as a Docker image for immediate use. A brief tutorial on how to use RMTA and OSG-RMTA in the command line and OSG, respectively, can be found in the README.md at (https://github.com/Evolinc/RMTA). Finally, a stripped down version of RMTA (few visible options) aimed at introducing undergraduates to the concepts of RNA-seq is also available in the Discovery Environment (RMTA\_Instructional) with instructions at (https://wiki.cyverse.org/wiki/display/ DEapps/RMTA\_Instructional).

#### Additional Discovery Environment-Specific Features to Simplify Ribonucleic Acid Sequencing Analysis

Although RMTA and OSG-RMTA are packaged as Docker images for use outside of CyVerse's Discovery Environment (e.g. OSG or an HPC), we highly recommend using the Discovery Environment integrated RMTA apps to take advantage of both the Discovery Environment's GUI and CyVerse's integrated Data Store. The Data Store makes data management relatively easy [drag 'n' drop as opposed to shipping hard drives to Amazon Web Services (AWS) (Zhao et al., 2013)]. A number of up-to-date genomes are available in the community Data Store and the Discovery Environment has an application programming interface (API) that can acquire any of the 50,000 additional genomes from CoGe (Lyons et al., 2008) or public/ private databases if needed. A Discovery Environment app has also been developed to retrieve GTF and BAM files from subdirectories generated for each SRA (File\_Select v1.0) and place them into a single, user-specified folder, making data management even easier.

Researchers running OSG-RMTA in the Discovery Environment can take advantage of two features that facilitate a "divide and conquer" approach to job submission to the OSG. Long (> 1,000s) lists of SRAs can be divided up into smaller lists using the File\_Split v1.0 app. The Discovery Environment's HT Analysis Path List file feature then uses these lists to parallelize their job submissions to the OSG. (https://wiki.cyverse.org/wiki/ display/TUT/Parallel+execution,+DE+(Discovery+Environment) +style). Thus, a thousand SRAs can be processed in roughly the same time it would take to process 100. All of this happens with a few clicks of a button.

## Data and Software Availability

RMTA and OSG-RMTA are freely available to use as an app on CyVerse's Discovery Environment or on the Open Science Grid (https://hackmd.io/s/rJjrqyAAQ). Detailed instructions on how to use RMTA in the Discovery Environment can be found at (https://wiki.cyverse.org/wiki/display/DEapps/RMTA+v2.6.3). Working within the Discovery Environment requires a modern hypertext markup language 5 (HTML5) capable browser and a free CyVerse user account (user.cyverse.org). Users wishing to use OSG-RMTA on the OSG directly (not through the Discovery Environment) will need an account (http://osgconnect.net/). The source code of the workflow is available at https://github.com/ Evolinc/RMTA and https://github.com/Evolinc/OSG-RMTA and the Docker images for users wishing to adapt RMTA to novel environments are available at https://hub.docker.com/r/ evolinc/rmta and https://hub.docker.com/r/evolinc/osg-rmta/. Test data for RMTA are present in the Discovery Environment and on GitHub.

#### Data Visualization in EPIC-CoGe, Long Non-Coding Ribonucleic Acid Identification With Evolinc, and Analysis of Gene Expression

Two sorted.bam files, SRR2240264 (flower) and SRR2240265 (root) from an RMTA run on 100 paired-end (PE) SRAs were uploaded to EPIC-CoGe from CyVerse's Data Store using the LoadExp+ tool (Grover et al., 2017) in CoGe. Expression data were associated with the Arabidopsis thaliana (Col-0) genome (v10.02, id 16911). These two datasets are publicly available in the RMTA folder (id 2568) at www.genomevolution.org. To identify lncRNAs, all 100 "filtered.gtf" files from the 100 PE RMTA analyses were added to an HTPathlist file in the Discovery Environment. This HTPathlist file was then used as the input for a single Evolinc analysis in the Discovery Environment (Evolinc v1.7.5; Nelson et al., 2017). The updated annotation file from each Evolinc job were merged using the Evolinc\_merge app (v1.0). FASTA sequence for all identified lncRNAs were extracted using the gffread utility in the Cufflinks package. GC content and length of all Arabidopsis protein-coding genes and the newly identified lncRNAs were calculated using a custom Perl script (File S1). Principal component analyses were generated in R (code in File S2) using the read count data from RMTA.

#### RESULTS

To demonstrate the utility of RMTA, we used our workflow (Figure 2) to process 1,000 A. thaliana SRAs (single-end reads) and 100 SRAs (paired-end reads) directly from NCBI's data repository, representing 1.27 terabases of RNA-seq data (Table 2 and Table S1). SRA IDs were obtained from NCBI's SRA by searching the term "Arabidopsis thaliana" and then exporting all summary results to a tab-delimited file using NCBI's "Send to" API. For downstream analysis of the PE data, specific RNA-seq data from root (n = 68) and flower (n = 32) tissue were chosen from these summary results (Table S1). PE and single-end (SE) SRA IDs were copied into new list files in the Discovery Environment, partitioned into lists of 10 and 100, respectively, using File\_Split-1.0. These 10 list files were then added to an HT Analysis Path List that subsequently became the input for the RMTA app. Two analyses were launched in the Discovery Environment (one for PE and one for SE) whereupon they were automatically divided up and submitted simultaneously as 10 jobs each. Specific options selected for these analyses were: HISAT2 for the aligner, a FPKM, and coverage cut-off threshold of 1, and Run FastQC selected. All other options were left as default.

Mapping rates and time to completion are shown in Table 2 and Table S1. While the mapping rates for most (76%) of the PE SRAs were >90% (avg = 92.7%, Table 2), six SRAs displayed rates <75%. FastQC results were interrogated to identify potential reasons for why these mapping rates might be low and if 5' or 3' trimming of reads might facilitate better mapping. FastQC results revealed a significant enrichment of adaptor sequence for these samples. A subsequent relaunching of

TABLE 2 | Mapping rates and time to completion for the example read mapping and transcript assembly (RMTA) analyses.


RMTA was used to process 100 paired-end (PE) and 1,000 single-end (SE) Arabidopsis sequence read archives (SRAs). The percentage of these SRAs with mapping rates >90%, 90–75%, etc., are shown. Gbp = 1x10<sup>9</sup> base pairs mapped. Mbp/minute = million base pairs mapped per minute.

RMTA with 15 nts trimmed from the 5' end (an option within RMTA) resulted in improved mapping rates for all six samples. This demonstrates the utility of being able to analyze hundreds of SRAs at once with default settings, and then follow up with adjusted parameters for problematic samples.

We then demonstrated three ways in which RMTA can facilitate downstream analysis: 1) by visualizing the RMTA generated BAM files in the EPIC-CoGe genome browser, 2) by testing for variation between datasets using the RMTA featureCounts output, and 3) identifying novel lncRNAs using RMTA's filtered genome annotation output. Users often wish to sanity check their RNA-seq data by viewing them in a genome browser. A benefit of performing RMTA in CyVerse's Discovery Environment is the ability to immediately import the large mapped read (BAM) files from the Data Store into the EPIC-CoGe genome browser (Nelson et al., 2018). Genomes for over 19,000 organisms are available on CoGe, meaning that the user will not only be able to visualize their RNA-seq, but can also import genomes from CoGe into the Discovery Environment to supplement the genomes already available. Two of the Arabidopsis PE-SRAs were imported into EPIC-CoGe (publicly available in the CoGe folder "RMTA," ID: 2568) for public browsing (Figure 3A). For users performing RMTA locally (i.e., in a Docker container), genome browsers such as IGV (J. T. Robinson et al., 2011) are freely available and easy to use.

Sample variation within the 100 Arabidopsis PE-SRAs, consisting of RNA-seq data from 68 root and 32 flower samples (see Table S1 for IDs) was examined using the RMTA-produced table of exon associated read counts (feature\_counts.txt). While these analyses can occur using Discovery Environment RStudio VICE app deployments of DESeq2 (Love et al., 2014) or EdgeR (M. D. Robinson et al., 2010); https://learning.cyverse.org/projects/ vice/en/latest/user\_guide/quick-rstudio.html), as the output file from featureCounts is small and manageable, it was downloaded and manipulated in a local R environment (i.e., RStudio; RStudio Team, 2015; R-code available in File S2). A principal component analysis (PCA) demonstrated that, as expected, the largest amount of variation between samples (PC1) could be explained by tissue (Figure 3B). This analysis demonstrates the ease with which researchers can validate their RNA-seq data and proceed to differential expression analyses using the RMTA workflow.

The filtered genome annotation produced by RMTA is perfect for novel gene identification without any additional data transformation. To describe this, the RMTA output file "filtered.gtf," with transcripts with an FPKM or read/base <3 removed, was used as input in the long non-coding RNA identification pipeline, Evolinc (v1.7.5; Nelson et al., 2017) in the Discovery Environment. Like RMTA, Evolinc is also packaged as a Docker image for local discovery. Evolinc was used to identify putative lncRNAs expressed in the root and flower RNA-seq data. The number of lncRNAs identified and some basic characteristics, such as average length and GC content relative to nuclear and organellar protein-coding genes, are shown in Figures 3C, D, with the R code necessary to recapitulate the images available in File S3. Reads mapped to these lncRNAs, and other novel genes, can also be visualized in a

genome browser using the BAM files generated by RMTA (Figure 3E). In sum, RMTA is not only a simple and intuitive means of processing large amounts of RNA-seq data, but also facilitates commonly performed downstream analyses.

### DISCUSSION

As the technical and financial barriers to generating raw RNAseq data are reduced, the barrier to discovery will be shifted toward the computational steps required to analyze those data and the integration with other software for extracting high-value knowledge and novel scientific insights. RMTA was designed with the goal of alleviating many of the tedious or time consuming steps of RNA-seq processing and downstream data analysis. This goal was primarily accomplished by incorporating the three main steps of RNA-seq processing (read mapping, assembly, and counting) into a very approachable, yet scalable and interoperable tool, and ensure that the output files from RMTA are easily ingested by other platforms and analysis tools.

The usefulness of RMTA is most apparent when utilized within CyVerse's Discovery Environment. Access to public or private genomes (through CoGe and the Data Store), automatic data retrieval from NCBI's SRA, data management through the Data Store, and job submission within the Discovery Environment or direct to the Open Science Grid, means that data scientists can perform all of their analyses in the cloud. In addition, users can take advantage of parallelizable job submission options that are available to divide and conquer their large datasets. Once finished, RMTA produces output files that are ready for immediate analysis (e.g., differential expression), visualization (e.g., in a genome browser), or novel gene identification (e.g., long non-coding RNAs), all of which can also occur in the cloud. In sum, large-scale RNA-seq analysis is no longer limited to data scientists with HPC access or a highend local computer.

RMTA was also designed for users who prefer to perform analyses locally. By packaging RMTA in a Docker container we have removed the tedious task of installing prerequisite software and made RMTA capable of running on any operating system. Thus, processing and analysis of RNA-seq data is no longer restricted to a Linux machine but can now also be performed on a machine utilizing Windows or Mac OS. In addition, recognizing data storage limitations, RMTA removes unnecessary files generated during the analysis that would rapidly fill up most storage allotments.

Not all data scientists have the same needs in terms of available features or in the amount of data to be processed. To this end we developed variants of RMTA targeting undergraduate or high school instructors (RMTA\_instructional), users processing 1– 100s of data files (RMTA), and users processing 1,000s or more data files (OSG-RMTA). RMTA\_instructional is available as an app in the Discovery Environment with minimal fields exposed, with example input files added to the appropriate fields, and with entry level descriptions of the purpose behind each field. RMTA and OSG-RMTA are available as both Discovery Environment apps and Docker images, with OSG-RMTA already available on the OSG. RMTA and OSG-RMTA offer the same features, differing only in where and how jobs are submitted.

In summary, RMTA opens up the task of RNA-seq processing and data analysis to anyone with access to a web browser, thereby democratizing data discovery. It also enables analysis of all transcripts, not just the ones matching already annotated genes, thus encouraging a more inclusive view of what genomic regions are actually transcribed. Finally, RMTA serves as a useful tool for savvy data scientists wishing to reduce the time and effort necessary to process large data sets.

#### REFERENCES


#### DATA AVAILABILITY STATEMENT

Publicly available datasets were analyzed in this study. Accession numbers can be found here: Table S1.

#### AUTHOR CONTRIBUTIONS

ADN, UD, EL, and BG designed the workflow. ADN, UD and SP wrote the code. SR and UD integrated RMTA into the OSG. SP, IK, LM, AN, AR, and RM tested RMTA, resolved bugs, and wrote the tutorials. SP, IK, LM, AN, AR, and RM analyzed the data. SP, RM, EL, and ADN wrote the manuscript. All authors read and approved the manuscript.

#### FUNDING

This work has been supported by the National Science Foundation grants IOS—1758532 (to ADN, RM, and UD), IOS—1743442 (to CyVerse), IOS—1444490 (to EL and BG), NSF Research Experience for Undergraduates (REU to IK, LM, and AN), and an NSF Research Assistantship for High School Students (RAHSS to AR). As AR is a minor, parental consent has been given to include her as an author on this manuscript.

#### ACKNOWLEDGMENTS

We would like to thank CyVerse for technical advice and application implementation. We would like to thank Jennifer Meneghin for posting her custom Perl script to the web in 2010. We would also like to thank Dr. Nirav Merchant (University of Arizona) for feedback on the manuscript.

#### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene.2019. 01361/full#supplementary-material

TABLE S1 | SRA IDs, tissue information, and mapping rates of example data.


ribonucleolytic cleavage to stabilize mRNAs in Arabidopsis. Cell Rep. 25, 1146–1157. doi: 10.1016/j.celrep.2018.10.020


Conflict of Interest: The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2020 Peri, Roberts, Kreko, McHan, Naron, Ram, Murphy, Lyons, Gregory, Devisetty and Nelson. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Integrative Differential Expression Analysis for Multiple EXperiments (IDEAMEX): A Web Server Tool for Integrated RNA-Seq Data Analysis

Verónica Jiménez-Jacinto<sup>1</sup> , Alejandro Sanchez-Flores<sup>1</sup> \* and Leticia Vega-Alvarado<sup>2</sup> \*

<sup>1</sup> Unidad Universitaria de Secuenciación Masiva y Bioinformática, Instituto de Biotecnología, Universidad Nacional Autónoma de México, Cuernavaca, Mexico, <sup>2</sup> Instituto de Ciencias Aplicadas y Tecnología, Universidad Nacional Autónoma de México, Mexico City, Mexico

#### Edited by:

Monica Bianchini, University of Siena, Italy

#### Reviewed by:

Zeeshan Ahmed, University of Connecticut, United States Gaurav Sablok, Finnish Museum of Natural History, Finland

#### \*Correspondence:

Alejandro Sanchez-Flores alexsf@ibt.unam.mx Leticia Vega-Alvarado leticia.vega@icat.unam.mx

#### Specialty section:

This article was submitted to Bioinformatics and Computational Biology, a section of the journal Frontiers in Genetics

Received: 15 December 2018 Accepted: 13 March 2019 Published: 29 March 2019

#### Citation:

Jiménez-Jacinto V, Sanchez-Flores A and Vega-Alvarado L (2019) Integrative Differential Expression Analysis for Multiple EXperiments (IDEAMEX): A Web Server Tool for Integrated RNA-Seq Data Analysis. Front. Genet. 10:279. doi: 10.3389/fgene.2019.00279 The current DNA sequencing technologies and their high-throughput yield, allowed the thrive of genomic and transcriptomic experiments but it also have generated big data problem. Due to this exponential growth of sequencing data, also the complexity of managing, processing and interpreting it in order to generate results, has raised. Therefore, the demand of easy-to-use friendly software and websites to run bioinformatic tools is imminent. In particular, RNA-Seq and differential expression analysis have become a popular and useful method to evaluate the genetic expression change in any organism. However, many scientists struggle with the data analysis since most of the available tools are implemented in a UNIX-based environment. Therefore, we have developed the web server IDEAMEX (Integrative Differential Expression Analysis for Multiple EXperiments). The IDEAMEX pipeline needs a raw count table for as many desired replicates and conditions, allowing the user to select which conditions will be compared, instead of doing all-vs.-all comparisons. The whole process consists of three main steps (1) Data Analysis: that allows a preliminary analysis for quality control based on the data distribution per sample, using different types of graphs; (2) Differential expression: performs the differential expression analysis with or without batch effect error awareness, using the bioconductor packages, NOISeq, limma-Voom, DESeq2 and edgeR, and generate reports for each method; (3) Result integration: the obtained results the integrated results are reported using different graphical outputs such as correlograms, heatmaps, Venn diagrams and text lists. Our server allows an easy and friendly visualization for results, providing an easy interaction during the analysis process, as well as error tracking and debugging by providing output log files. The server is currently available and can be accessed at http://www.uusmb.unam.mx/ideamex/ where the documentation and example input files are provided. We consider that this web server can help other researchers with no previous bioinformatic knowledge, to perform their analyses in a simple manner.

Keywords: bioinformatics, RNA-Seq, differential expression, NGS, transcriptomics

## INTRODUCTION

fgene-10-00279 March 29, 2019 Time: 15:35 # 2

Transcriptomics experiments have been used widely to measure the RNA levels expressed in tissues or cells from practically any organism. This approach has been used since the implementation of Northern blots hybridization analysis and was scaled up by the development of microarray technology. However, transcriptomics has been improved with the aid of sequencing technologies which recently have been replacing microarrays by using RNA sequencing (RNA-Seq) experiments to evaluate gene expression at a genome-wide scale. Therefore, either microarrays or RNA-Seq technologies have generated a massive amount of data results that demands ad hoc methods to fully analyze and compare gene expression between different conditions, tissues or cell populations for a given organism.

To quantify the transcription levels and identify differential expressed genes under different conditions, using RNA-Seq data from high-throughput sequencing technologies, a general workflow can be described: (1) quality control of RNA-Seq reads (Babraham Bioinformatics - FastQC A Quality Control tool for High Throughput Sequence Data Babraham Bioinformatics - FastQC A Quality Control tool for High Throughput Sequence Data); (2) read trimming or filtering (Chen et al., 2017; Roser et al., 2018); (3) mapping trimmed/filtered reads to a reference (genome or transcriptome) (Li and Durbin, 2009; Langmead and Salzberg, 2012; Kim et al., 2013; Wu et al., 2016); (4) obtaining the read count for each gene (Quinlan and Hall, 2010; Li and Dewey, 2011; Roberts et al., 2011) and (5) differential expression analysis (Anders and Huber, 2010; Tarazona et al., 2011; McCarthy et al., 2012; Love et al., 2014; Ritchie et al., 2015). Currently, due to the size of datasets, steps 1 to 4 have to be performed by the user and many tools for each step are available and have been widely used and cited elsewhere. However, the differential expression analysis is probably the most important step that allows the user to interpret the biological information regarding the expression profiles of a given organism under different conditions.

The gene expression profile contains the information regarding genes related to the organism response to a certain condition. To retrieve such information, the differential expression analysis has to be performed and it requires statistical methods to differentiate between expression changes due to the tested conditions and biological "noise" or variability. Currently, several computational tools have been developed mainly in the programming language R and packages are available at the Bioconductor project repository (Huber et al., 2015). However, R language and packages have to be used mainly through a UNIX-based operating system and by commandline instructions which requires a certain level of programming skills. Therefore, non-bioinformatics researchers demand either a Graphical User Interface (GUI) in order to use differential expression tools or web-based applications. A GUI-based solution still requires a local installation of all packages needed for the differential expression analysis and this could remain challenging. The web-based applications are now emerging (de Jong et al., 2015; Monier et al., 2018; Zhang et al., 2018) as friendlier option to perform the differential expression analysis in a more friendly way and without installing software in a local computer.

Here, we introduce the IDEAMEX web server (Integrative Differential Expression Analysis for Multiple EXperiments) that uses as input an RNA-Seq raw count table in text format and generates results using bioconductor packages NOISeq, limma-voom, DESeq2 and edgeR. These packages have been constanlty benchmarked and presented the most reliable results with different datasets and gold-standards (Seyednasrollah et al., 2015; Costa-Silva et al., 2017). In this work, we demonstrate the functionality of IDEAMEX, using RNA-Seq data from a previous publication (Olvera et al., 2017) where the differential expression analysis in tilapia liver was performed, in addition to other datasets used as examples to test the website.

Our server has been used in several projects and has been visited from different world-wide locations as recorded in our site tracker. IDEAMEX is available and can be accessed at http: //www.uusmb.unam.mx/ideamex/ where the documentation and example input files are provided. Our server offers a web server-based analysis that can help researchers with no previous bioinformatic knowledge, to perform their transcriptomic analyses in a simple manner, in order to interpret the biological data contained in their RNA-Seq experiments.

#### MATERIALS AND METHODS

#### Web Server Description

The web page is hosted by the "Unidad Universitaria de Secuenciación Masiva y Bioinformática" core lab facility, at the "Instituto de Biotecnología" of the "Universidad Nacional Autónoma de México, Campus Morelos located in Cuernavaca, Morelos, México." A Linux box computer with Ubuntu 14.04 LTS with the following hardware main characteristics: Intel Core i7 4770 processor; 32 Gbytes of DDR3 RAM and 1 Tbyte of hard disk storage.

The deployment was implemented using the Apache HTTP server version 2.4.7 with a PHP v5.5.9 front-end that coordinates the writing of the input and output files to a SQL database through a POSGRES Relational Data Base Manager (RDBM) server (psql version 9.3.22. The installed R version is 3.5.2. The web server can be accessed at http://www.uusmb. unam.mx/ideamex/.

The web server interface has been tested using different web browsers and different operative systems. Using Microsoft Windows 10: Microsoft EdgeHTML 17.17134; Google Chrome version 72.0.3626.109 (Official Build) (64-bit); Mozilla Firefox Quantum 63.0 (64-bit). Using MacOS Sierra 10.13.6: Safari 12.0.3; Google Chrome 71.0.3578.98 (64-bit). Using Linux Ubuntu 16.04 LTS: Mozilla Firefox Quantum 65.0.

Additionally, the scripts and binaries used in the web server can be found in the public repository https://github.com/ leticiaVega/IDEAMEX

## RNA-Seq Examples and Data From Tilapia Liver Experiment

We used as example to test our website two datasets. The first example contains data from the Pasilla Bioconductor library (Brooks et al., 2010), taking in account only the gene level counts. This dataset contains RNA-Seq count data for treated and untreated cells from the S2-DRSC cell line. The second example file which can be used to test the batch effect error awareness, was taken from the NBPSeq CRAN package (Di et al., 2014). This dataset contains the Arabidopsis thaliana RNA-Seq data (Cumbie et al., 2011), comparing 1hrcC challenged and mock-inoculated samples. In this case, the samples were collected in three batches.

We also obtained RNA-Seq publicly available data already reported (Olvera et al., 2017) that was generated to determine the effect of 3,5-di-iodothyronine (T2) and 3,5,3<sup>0</sup> -tri-iodothyronine (T3) exogenous treatment on the transcriptome of tilapia (Oreochromis niloticus) liver. For control and each hormone treatment, two biological replicates were generated. The FASTQ raw data can be found under the following SRA identifiers: SRX2630485, SRX2630486, SRX2630487, SRX2630488, SRX2630489, and SRX2630490.

Briefly, the quality control(QC) and filtering for the raw data was performed using the FASTQC software (Babraham Bioinformatics - FastQC A Quality Control tool for High Throughput Sequence Data Babraham Bioinformatics - FastQC A Quality Control tool for High Throughput Sequence Data) and contamination and adapter removal was carried out using in-house Perl scripts. QC'ed reads were mapped using the Bowtie 1.1.234 aligner (Langmead et al., 2009) to the annotated Oreochromis\_niloticus (Orenil1.0.cds.all, 21,437 coding genes) CDS dataset downloaded from Ensembl repository database (Aken et al., 2016) using the BioMart utility. Quantification and repetitiveness normalization were carried out using eXpress software 1.535 (Roberts et al., 2011). Total effective counts for each sample were merged; a matrix was generated using the "abundance\_estimates\_to\_matrix.pl" Perl script included in the Trinity pipeline (Grabherr et al., 2011; Roberts et al., 2011). The resulting matrix was used as input for the differential expression analysis in the IDEAMEX web server. The select parameters were: p-adj/FDR = 0.05; logFC = 2; CPM = 1.

## Differential Expression Packages

Based on the parameters defined by the user, 4 different R (version 3.5.2) packages for differential expression analysis are run: edgeR version 3.24.3 (Anders and Huber, 2010), using TMM normalization method (works with or without replicates); limma-Voom version 3.38.3 (Ritchie et al., 2015), using log2 counts per million normalization method (works with replicates only); DESeq2 version 1.22.2 (Love et al., 2014), with DESeq2 default normalization method (works with or without replicates) and NOISeq version 2.26.1 (Tarazona et al., 2011), with TMM normalization method (works with or without replicates). Other packages used in the server are: VennDiagram 1.6.20; ggplot2 3.1.0; UpSetR 1.3.3; corrplot 0.84 and ComplexHeatmap 1.20.0. The packages can change depending on the R programming language version, but all changes are reported to the user in log files that contain all details about the commands and parameters used for the analysis.

## RESULTS

## The IDEAMEX Web Server Implementation

The general workflow used in the IDEAMEX web server can be observed in **Figure 1**. First, the user has to enter a valid email address that will be used to report the follow up or the differential expression analysis to the user. In a nutshell, the pipeline starts with a raw count table for as many desired replicates and conditions, allowing the user to select which conditions will be compared, instead of doing all-vs.-all comparisons. After the web server validates the input format, the user can edit the sample names select one or more differential expression methods and the parameters to filter results. Additionally, the user can indicate if the samples belong to different batches so the selected differential expression methods, can correct any possible batch effect Then, the data analysis step is performed where a preliminary quality control report is generated, based on the data distribution per sample. Next, the differential expression analysis is performed using one or more selected methods. Finally, the result from the different selected methods are integrated and are reported using Venn diagrams, a upset bar plot graph and text files for further filtering and analysis. Several additional plots are generated including correlograms to check the consistency between some calculations and heatmaps. Further details and study cases for dataset examples are described in the IDEAMEX User Manual that can be downloaded from the website. To demonstrate the functionality of our web server, we used a dataset generated from an RNA-Seq experiment to compare the effect of thyroid hormones in tilapia liver (see Materials and Methods).

Optionally, the user can perform a full registration at the IDEAMEX homepage, in order to keep track of all projects results. The sample name format should have a suffix\_[0-9] structure: nameCond1\_1, nameCond1\_2, . . . , nameCond1\_n, nameCond2\_1, nameCond2\_2, . . . , nameCond2\_m. Once the input file is validated, the server can infer the replicates from the suffix before the underscore symbol and the replicate number will be the digit after the underscore symbol. However, during the input loading, the user can edit these names. In case of samples being prepared in different batches, this information can be specified in the same window the sample names are edited. Indicating samples in different batches will turn on the batch effect error correction of different methods. Importantly, use this option only if you have knowledge of samples from a given condition, being prepared in a different batch which can give the experiment an extra variability. The user manual has a case of study for samples with batch effect.

In this work, the samples were named liverC\_1, liverC\_2 for replicates of control condition (no treatment) and liverT2\_1, liverT2\_2, liverT3\_1, liverT3\_2 for replicates that correspond to each of the 3,5-T2 and 3',3,5-T3 (T2 and T3) thyroid hormones treatments. A raw count table (**Supplementary Material S1**) in tab-separated text format, was generated and fed to the

or more methods for differential expression analysis, data analysis and results integration. An optional step to edit the sample names is available. The user designs the comparison matrix by selecting which conditions will be compared. A link to the results is generated and after a few minutes, the results are presented in the Analysis Results web page.

#### TABLE 1 | Raw count table example.

fgene-10-00279 March 29, 2019 Time: 15:35 # 5


The sample names denote the condition naming and replicate number. liverC\_N, RNA-Seq counts for liver tissue with no treatment. liverT2\_N, RNA-Seq counts for liver tissue with T2 hormone treatment. liverT3\_N, RNA-Seq counts for liver tissue with T3 hormone treatment. N, replicate number. Raw count table should be in simple text format.

IDEAMEX web server. A snipped of the input raw count table is shown in **Table 1**.

#### Input and Data Quality Control

The next step is to select the differential analysis method(s), the data quality analysis and the result integration by clicking on each box. It is recommended to click on the "select all" box to perform a full analysis. Afterward, the cut-off values for statistical confidence (p-adj and False Discovery Rate [FDR]), normalization (CPM) and transcript abundance difference (logFC) can be selected. Also, the comparison matrix can be defined to establish which samples or conditions will be compared.

A link to the Analysis Results web page will be generated, where the user results can find a link to the "(1) Data Analysis" section. A series of plots are displayed, allowing the user to have a preliminary analysis for quality control based on the data distribution per sample. All conditions defined in the raw count table are depicted as boxplots, CPM bar plots, density plots, principal components analysis (PCA) plots and multidimensional scaling (MDS) plots. Inspection and evaluation of these plots are essential steps for the interpretation of the differential expression analysis.

#### CPM Plot Evaluation

In gene expression analysis, only a fraction of genes is expected to show differential expression between experimental conditions. The Count per million (CPM) plot shows the number of genes within each sample, having no counts (CPM = 0) or more than 1, 2, 5, or 10 CPM. This plot could help the user to decide the threshold to remove very low expressed genes in any of the experimental conditions. The default CPM cut-off value of 1 can be changed according to the user judgment, but it has to be done by re-running the analysis.

As observed in **Figure 2**, there is an increase of genes with CPM > 10 in the T2 and T3 samples, compared to the C condition. Also, the group of genes with CPM = 2 were decreased in T2 and T3 compared to the C condition. Approximately, ∼70% of the genes presented no counts. This plot is the first glance to the expression profile for the compared conditions. For this particular case, CPM = 1 is a convenient cut-off value which was the default option.

#### Boxplot Evaluation

**Figure 3** presents the boxplots which provide an easy way to visualize the count distribution in each sample. If the count values distribution is highly skewed, then data transformation can be applied to roughly normalize the distribution. **Figure 3A** presents the log2 normalized data (pseudo-counts) and **Figure 3B** depicts the normalized data using the Trimmed Mean of M-values (TMM) method which is used for the differential expression analysis in edgeR and NOIseq packages. As observed, TMM normalization adjust the data according to the sequencing yield of each sample. The boxplot is an easy way to visualize the data distribution since it shows statistical measures such as median, quartiles, minimum and maximum values. Whiskers are also drawn extending beyond each end of the box with points beyond the whiskers typically indicating count outliers. In the log2 boxplot, the sequencing yield difference per sample is very evident. In this case, the control samples have fewer reads than the other samples. However, TMM normalization can fix this problem and this is why several differential expression methods have implemented this normalization procedure.

It is important to mention that the user will find a pair of boxplots, PCA and MDS graphs, since the data is plotted using pseudo-counts and TMM values.

#### Density Plot Evaluation

The normalized count distributions can also be summarized by means of a density plot. Density plot provide more detail by enabling the detection of a dissimilarity in replicate count distribution. Ideally, the density plot for each replicate for a given condition, should greatly overlap indicating lower variability between replicates. **Figure 4** shows a density plot for the samples where replicates for the C condition, indicating certain dissimilarity in replicates for that condition.

#### PCA Plot Evaluation

This type of plot is useful for visualizing the overall effect of experimental covariates and batch effects. In the context of RNA-Seq analysis, PCA shows groups of samples that ideally

will correspond to each condition. Clustering first by the most significant group, then by progressively less significant groups. **Figure 5** depicts how the 3 conditions (C, T2, and T3) form separate clusters, although some dispersion between replicates can be observed. This suggest that the variability among individuals was high, but due to the cluster separation it shouldn't affect the analysis. When a replicate is grouped with other samples from different conditions, is recommended to removed it from the analysis if there are enough replicates left (at least two). Also, this plot could indicate if there is a batch effect problem, where samples in a same condition are very disperse in the plot. In that case, the user can rerun the analysis indicating which samples could belong to a different batch. However, we recommend to confirm this with records from the preparation of the samples in the wet lab.

#### MDS Plot Evaluation

Multi-dimensional scaling (MDS) is a technique that is used to create a visual representation of the pattern of proximities (similarities, dissimilarities, or distances) among a set of objects. In the context of RNA-Seq analysis, MDS plot shows variation among RNA-Seq samples, the more is the distance between sample, the higher is their dissimilarity. Therefore, samples belonging to the same condition or treatment should be closer to each other and distant to other conditions. However, if different conditions are grouped together, this could mean that those treatments or conditions have a very similar effect. Worstcase scenario, the user can suspect of a sample mislabeling. Conceptually, MDS and PCA plots can provide the same information and as observed in **Figure 6**, samples belonging to C, T2, and T3 form separate clusters with a certain dispersion among replicates. Similarly, to the PCA plot, this plot could indicate if there is a batch effect problem, where samples in a same condition are very disperse in the plot. Again, we recommend to confirm the preparation of the samples, by checking records from the preparation of the samples in the wet lab.

### Differential Expression Analysis

The "(2) Differential Expression Results" section has links with the name of each selected method, where the user can display the analysis output. A detailed description of each method output can be found in the User Manual at the IDEAMEX web page.

However, here we describe the generated graphs for a better interpretation. **Table 2** shows the output plots generated by each method, contributing with different representations of the genes that were differentially expressed. Some of these plots were already used in the "(1) Data Analysis" section (PCA and MDS plots). If the user indicated that samples for a given condition belonged to different batches, the batch error effect correction for several methods will be applied.

#### Expression, MA, MD and Smear Plots

These plots depict all expressed genes but those with differential expression are represented in other color than black. Basically, in all of them we can see the distribution of the gene expression according to a certain value. For example, in the expression plot (**Supplementary Figure S1**) the average expression values for each gene of the compared conditions are plotted and those highlighted in red are genes with a significant difference compared to the rest. In simple terms, the differentially expressed genes are those with outlier mean values.

In the MA-plot (**Supplementary Figure S2**), the log2 fold change (logFC) expression and the normalized mean counts of each gene in the compared conditions are plotted. Features declared as differentially expressed are highlighted in different



colors according to the logFC threshold defined by the user and the expression directionality (UP or DOWN).

The mean-difference (MD) plot (**Supplementary Figure S3**) shows the average expression (mean: x-axis in limma or D for NOISeq) against logFC (difference: y-axis in limma or M for NOISeq). Again, values declared as differentially expressed are highlighted in red.

The smear plot allows to visualize the results of the analysis in a similar manner to the MA-plot, this plot shows the logFC against log-CPM, where genes declared as differentially expressed highlighted in red.

In summary, all these plots compare the expression rate or difference between conditions and the normalized values. The proportion of black and highlighted dots gives an idea of the expression change magnitude between the treatment and the control or untreated conditions.

#### Volcano Plot

Arguably, the volcano plot (**Figure 7**) is the most popular and probably, the most informative graph since it summarizes both the expression rate (logFC) and the statistical significance (p-value). It is a scatter-plot of the negative log10-transformed p-values from the gene-specific test (on the y-axis) against the logFC (on the x-axis). The graph depicts datapoints with low p-values (highly significant) appearing toward the top of the plot. The logFC values are used to determine the change direction (up and down) appearing equidistant from the center. Features declared as differentially expressed are highlighted in red, according to the selected cut-off values.

#### Results Integration

Finally, the "(3) Results Integration" section of the Analysis Results in the IDEAMEX web page contains several text files and graphs that integrates the results from all selected methods. In **Figure 8**, we present the results from the C vs. T2 comparison, using a Venn diagram (**Figure 8A**), upset bar (**Figure 8B**) and correlograms (**Supplementary Figure S5**) plots. For the analyzed data, the Venn diagram showed all method intersections and it is observed that 852 genes were validated as differentially expressed by all four methods, being NOIseq the main contributor as also observed in the upset bar plot. It is interesting that limma-Voom reported that 43 genes that no other method found as differentially expressed but agreed with the other methods in


#### TABLE 3 | Example of intersect results table.

fgene-10-00279 March 29, 2019 Time: 15:35 # 13

Snippet of the liver CvsT2 treatment. Original table is in simple text format. The Regulation column indicates the directionality of the gene expression.

920 genes (5 + 51 + 852 + 12). This gives the option to the user to work with either only the intersection or the union of all methods. However, working with all methods can be overwhelming for the user although using an enrichment analysis using the GO term or metabolic annotation from KEGG could help.

As mentioned, other generated plots are heatmaps (**Supplementary Figure S4**) and correlogram (**Supplementary Figure S5**) plots. Since each method has different normalization methods, fold change or statistical metrics (p-adj, FDR or Probability) to determine if a gene is differentially expressed, the correlograms can help the user to evaluate the correlation of these values among the different used methods. Also, heatmaps are created to observe samples clustered by their fold change, allowing the user to spot groups of genes with a similar expression change.

Among all the text file results that are explained in detail in the User Manual (**Supplementary Material S2**), the IntersectTopRegulation.txt file provides the list of all differentially expressed genes with a 0| 1 matrix that can be used select genes depending on how many and which methods reported them as differentially expressed. In the last column of the file, a description of the gene regulation can be found, where is indicated how and in which condition the genes was expressed. **Table 3** has a snipped of the liverCvsliverT2\_ IntersectTopRegulation.txt file where the Regulation structure results is as follows: UP\_conditionX\_DOWN\_conditionY or DOWN\_conditionX\_UP\_conditionY. Therefore, the user can select which genes were up or down regulated in a certain condition and be sure of the directionality of the expression without checking the fold change directionality.

#### DISCUSSION

The IDEAMEX web server is a useful resource for transcriptome experiments designed for differential expression analysis involving several condition comparisons. The methods for differential expression analysis in the workflow, were selected based on their performance in several benchmark analyses since the emergence of RNA-Seq data as a powerful alternative to microarrays (Anders et al., 2013; Soneson and Delorenzi, 2013; Seyednasrollah et al., 2015; Costa-Silva et al., 2017). In particular, our web server uses R packages that use different algorithms and normalization methods giving a broader view of the results with a higher confidence based on their agreement, based on the idea that no statistical modeling can fully capture biological phenomena. In the case of limma and NOIseq, they use non-parametric methods that are statistical techniques for which we do not have to make any assumption of the gene expression; whereas DESeq2 and edgeR use parametric methods assuming a binomial distribution for the data and that no genes are differentially expressed.

Once the user had loaded the input data in the right format, our server allows the user to design which comparisons will be made and which cut-off values will be used, instead of running an all-vs.-all comparison and default parameters for each package. For parametric methods like edgeR and DESeq2, the FDR and p-adj values are the statistical parameters that define the probability of a gene to be differentially expressed in a multiple comparison and are used to define if a gene was differentially expressed or not from the statistical point of view. However, other parameters such as the CPM or logFC can have a biological meaning and also can be used a cut-off value. Is not straightforward how to select which cut-off values will be the best for a certain experiment but IDEAMEX allow users to try many combinations of them by running the comparisons several times and inspecting the different results.

Is very important that the user select which comparisons have a sense in terms of their experimental design. For example, in this work we used three conditions where one was used as a control to study the effect of two thyroid hormones treatments in tilapia liver (T2 and T3). The comparison between T2 and T3 has to be performed by comparing the results from comparing each one to the control or untreated condition. A direct comparison between T2 and T3 could miss several results since even if we can observe a gene with a certain expression change, the difference could not be statistically significant. Let's say that "gene A" has a differential expression of 10 times in T2 vs. C comparison and of 12 times in the T3 vs. C comparison. Roughly, the difference between T2 vs. T3 comparison for the same gene, will be 2 times which might not be statistically significant. For this reason, is very important to select the which comparisons make sense, instead of performing all possible comparison.

The results in the "Data Analysis" section, are several plots that allow the user to inspect the distribution of their data based on different metrics. This quality control check point is very important, since biological data tend to be very noisy. It is expected that the data from biological replicates within a certain

condition, will have the same distribution and a similar trend than those in other conditions. In particular, PCA and MDS plots allow the users to see if biological replicates of a certain condition are grouped together and if each condition forms a separate group. In this particular case, it was known that the samples didn't present any batch effect but as observed in **Figure 6**, there is some dispersion between samples. It is not trivial to determine if samples present a dispersion attributable to a batch effect. Therefore, it is important to obtain the information regarding the sample preparation to discriminate between high "biological" variability and "noise" from batch effect.

The distance or dispersion of the replicates and groups indicates how reproducible was the tested condition in different individuals or how variable were individuals despite the treatment. The more replicates available, the better statistical significance is observed. Having very disperse groups or samples from different conditions grouping together, should be considered as noisy or highly variable results that can skew the analysis and lead to misinterpretation of the experiments. However, NOIseq could be a good option when no biological replicates are available and as reported elsewhere, it delivers reliable results that have been confirmed by using quantitative PCR (qPCR) reactions.

The results from different methods are not mutually exclusive. From the statistical point of view, one of them, neither or all may be true. Therefore, working with the intersection or the union of all results is a decision that the user has to evaluate after exploring them based not only on the statistical significance but on the biological meaning that will depend on the gene annotation. The main problem with all statistics is the "fakeness" and misrepresentation of the results. However, if four different methods agreed with a certain result it could be assumed that those genes are differentially expressed, bearing in mind that an experimental orthogonal validation using a different technology like qPCR, should be necessary to confirm the result.

In the "Results Integration" section, there are several text lists and graphs that can guide the users to make sense out of the results from their experiments. As mentioned, the Venn diagram (**Figure 8A**) shows the intersection and union of the selected different methods. The user can choose one or more methods by evaluating the agreement between them since one method could generate either an overwhelming amount of results or very few of them. In the former case, the user can choose to work with the intersection of all methods or in the latter case, the union will provide the maximum amount of reported results.

In this work, we provide heatmaps and correlograms for different values obtained from each method. For example, heatmaps (**Supplementary Figure S4**) are useful to spot gene clusters with the same fold change pattern, suggesting that those genes could belong to a certain pathway of are regulated by the same mechanism. However, users have to be very careful when determining gene clusters since there is no straightforward method to do so. Defining the cluster size is not trivial and usually is a trial and error process. In terms of novelty, the most interesting plot could be the statistical parameter correlogram (**Supplementary Figure S5**), where the threshold values such as p-adj (limma-Voom and DESeq2), FDR (edgeR) and Prob (NOISeq) values are correlated. To our knowledge, this correlation has not been reported in other studies. Surprisingly, methods usually correlate very well since the statistical threshold denotes the error probability of each result. In our experience, we have observed that NOISeq is the method with lower correlation regarding the error probability since this is calculated using a very different approach (Tarazona et al., 2011) compared to the rest of the methods. However, is somehow refreshing that all methods present a good correlation, suggesting that are consistent identifying differentially expressed genes and those with no significant change, despite using different statistics.

Finally, there are several other methods to continue the differential expression analysis, that can help users to put their results in a certain biological context. Probably the most popular methods are those based on Gene Ontology (GO) terms enrichment (Maere et al., 2005; Eden et al., 2009; Reimand et al., 2016) which will require of a well curated gene annotation. Other enrichment methods like Gene Set Enrichment Analysis (GSEA) determine whether a defined set of genes shows statistically significant based on molecular signatures (Subramanian et al., 2007; Liberzon et al., 2011) or metabolic pathway enrichment analysis (Luo et al., 2009; Liu et al., 2017; Ulgen et al., 2018) can provide a better picture of the biological meaning of the observed changes in gene expression for a given treatment or condition. These enrichment methods along with the heatmaps, can help the researcher to spot regulation networks or pathways which could be subject to further studies.

### CONCLUSION

We consider that the IDEAMEX web server can help other researchers with no previous bioinformatic knowledge, to perform their analyses in a simple manner. Also, more experienced users with some bioinformatics skills can use the results and perform a more detailed analysis and a different integration of them, since all the results are provided in simple text files which are very convenient to parse and handle using regular expression searches.

### DATA AVAILABILITY

The datasets analyzed for this study can be found in the NCBI SRA repository (https://submit.ncbi.nlm.nih.gov/subs/ sra/), under the SRA identifiers: SRX2630485, SRX2630486, SRX2630487, SRX2630488, SRX2630489, and SRX2630490.

### AUTHOR CONTRIBUTIONS

VJ-J and LV-A developed the web deployment and scripts for the IDEAMEX server. VJ-J, LV-A, and AS-F conceived the web server workflow. AS-F wrote the manuscript. All authors read and authorized the publication of this manuscript.

#### FUNDING

The computer hosting the IDEAMEX web server was provided and maintained by the Unidad Universitaria de Secuenciación Masiva y Bioinformática using its core budget and CONACyT #260481 grant from the "Laboratorios Nacionales" program.

#### ACKNOWLEDGMENTS

We would like to thank "Laboratorio Nacional de Apoyo Tecnológico a las Ciencias Genómicas" CONACyT #260481 for infrastructural support hosting the server.

#### REFERENCES


#### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene. 2019.00279/full#supplementary-material


MATERIAL S2 | IDEAMEX user manual.



**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Jiménez-Jacinto, Sanchez-Flores and Vega-Alvarado. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# DREAMSeq: An Improved Method for Analyzing Differentially Expressed Genes in RNA-seq Data

Zhihua Gao1,2, Zhiying Zhao<sup>1</sup> and Wenqiang Tang<sup>1</sup> \*

<sup>1</sup> Ministry of Education Key Laboratory of Molecular and Cellular Biology, Hebei Key Laboratory of Molecular and Cellular Biology, Hebei Collaboration Innovation Center for Cell Signaling, College of Life Sciences, Hebei Normal University, Shijiazhuang, China, <sup>2</sup> College of Biological Science and Engineering, Hebei University of Economics and Business, Shijiazhuang, China

RNA sequencing (RNA-seq) has become a widely used technology for analyzing global gene-expression changes during certain biological processes. It is generally acknowledged that RNA-seq data displays equidispersion and overdispersion characteristics; therefore, most RNA-seq analysis methods were developed based on a negative binomial model capable of capturing both equidispersed and overdispersed data. In this study, we reported that in addition to equidispersion and overdispersion, RNA-seq data also displays underdispersion characteristics that cannot be adequately captured by general RNA-seq analysis methods. Based on a double Poisson model capable of capturing all data characteristics, we developed a new RNA-seq analysis method (DREAMSeq). Comparison of DREAMSeq with five other frequently used RNA-seq analysis methods using simulated datasets showed that its performance was comparable to or exceeded that of other methods in terms of type I error rate, statistical power, receiver operating characteristics (ROC) curve, area under the ROC curve, precision-recall curve, and the ability to detect the number of differentially expressed genes, especially in situations involving underdispersion. These results were validated by quantitative real-time polymerase chain reaction using a real Foxtail dataset. Our findings demonstrated DREAMSeq as a reliable, robust, and powerful new method for RNA-seq data mining. The DREAMSeq R package is available at http://tanglab.hebtu. edu.cn/tanglab/Home/DREAMSeq.

Keywords: RNA-seq, DREAMSeq, equidispersion, overdispersion, underdispersion, double Poisson model, negative binomial model

## INTRODUCTION

With the development of next-generation sequencing technology, RNA sequencing (RNA-seq) has become a routine and powerful method for evaluating global dynamic changes in gene expression during certain biological processes. Compared with microarray technologies, RNA-seq technologies have several advantages, including a wider measurable range of expression levels, higher throughput, less noise, more information for detecting allele-specific expression, and a higher capability to detect novel promoters and alternative gene-splicing isoforms (Marioni et al., 2008; Mortazavi et al., 2008; Sultan et al., 2008; Wang et al., 2009, 2010b; Oshlack et al., 2010). Therefore, developing powerful, reliable, and unbiased RNA-seq data-mining methods would facilitate the use of RNA-seq to explore basic biological questions in this era of big data.

#### Edited by:

Monica Bianchini, Università degli Studi di Siena, Italy

#### Reviewed by:

Shihao Shen, University of California, Los Angeles, United States Taina Raiol, Fundação Oswaldo Cruz (Fiocruz), Brazil

\*Correspondence:

Wenqiang Tang tangwq@mail.hebtu.edu.cn

#### Specialty section:

This article was submitted to Bioinformatics and Computational Biology, a section of the journal Frontiers in Genetics

Received: 16 August 2018 Accepted: 15 November 2018 Published: 30 November 2018

#### Citation:

Gao Z, Zhao Z and Tang W (2018) DREAMSeq: An Improved Method for Analyzing Differentially Expressed Genes in RNA-seq Data. Front. Genet. 9:588. doi: 10.3389/fgene.2018.00588

**47**

Typically, RNA-seq experimental procedures can be divided into six steps: (1) sequencing the RNA samples to obtain raw reads, (2) filtering out low-quality reads, (3) mapping the high-quality reads to a reference genome or transcriptome, (4) summarizing the read counts for each gene, (5) detecting differentially expressed genes (DEGs), and (6) performing systems biology analysis [e.g., cluster analysis, principal components analysis (PCA), gene ontology (GO) analysis, and pathway enrichment analysis] (Oshlack et al., 2010). Of these steps, identifying DEGs across treatments/conditions is the key task and often the primary goal of RNA-seq data analysis. There are numerous statistical methods focusing directly on read-count data for DEG identification, with these classified into two categories: (1) parametric methods that rely on assumptions about discrete probability models and include methods based on a Poisson model, such as DEGseq (Wang et al., 2010a) and TSPM (Auer and Doerge, 2011), methods based on a negative binomial (NB) model, such as edgeR (Robinson et al., 2010), DESeq (Anders and Huber, 2010), baySeq (Hardcastle and Kelly, 2010), NBPSeq (Di et al., 2011), EBSeq (Leng et al., 2013), ShrinkSeq (Van De Wiel et al., 2013), and DESeq2 (Love et al., 2014), methods based on a beta-binomial model, such as BBSeq (Zhou et al., 2011), methods based on a multivariate Poisson log-normal (LN) model, such as PLNseq (Zhang et al., 2015), and methods based on a generalized Poisson (GP) model, such as GPseq (Srivastava and Chen, 2010) and deGPS (Chu et al., 2015); and (2) non-parametric methods, such as NOISeq (Tarazona et al., 2011) and SAMseq (Li and Tibshirani, 2013), that do not assume any particular model.

Among count-based RNA-seq data-analysis methods, nonparametric methods were developed based on large-sample asymptotic theory and exhibit statistical power sufficient to detect DEGs only when the number of replicates per treatment condition is ≥5 (Tarazona et al., 2011; Seyednasrollah et al., 2013; Soneson and Delorenzi, 2013). However, due to the high cost of RNA-seq, the general sample size in a typical RNAseq experiment is <5 replicates, which limits the application of non-parametric methods in RNA-seq data mining. Therefore, the most popular RNA-seq data-analysis methods are parametric methods based on Poisson and NB models. In early RNAseq studies where only technical replicates were used, the traditional Poisson model was highly capable of fitting readcount data characterized by equidispersion (i.e., the variance is equal to the mean) (Marioni et al., 2008; Bullard et al., 2010). However, when biological replicates are available, readcount data often exhibits more variability than the Poisson model expects, which limits the use of a Poisson model for analyzing RNA-seq data (Anders and Huber, 2010). Fortunately, the NB model, as a Gamma-Poisson mixture, can address the overdispersion issue (i.e., when the variance is larger than the mean), as well as capture equidispersion (Anders and Huber, 2010). Additionally, recent studies reported that some RNAseq data demonstrates characteristics of underdispersion (i.e., the variance is smaller than the mean), which might be caused by RNA-seq coverage, as well as zero-inflation, cluster, or low expression level of the count data, and could lead to underestimation of DEGs (Famoye, 1993; Srivastava and Chen, 2010; Rau et al., 2011; Mi et al., 2015; Choo-Wosoba et al., 2016; Low et al., 2017). However, neither a traditional Poisson model nor the NB model works well at mining underdispersed data.

The GP model is a generalization of the Poisson model with an additional parameter. This method can process data characterized by underdispersion and non-underdispersion (equidispersion and overdispersion) (LuValle, 1990), but can only capture certain levels of dispersion, because the model is truncated under certain conditions regarding its bounded dispersion parameter (Famoye, 1993). For example, the program deGPS employs the GP model to fit read-count data characterized by non-underdispersion (Chu et al., 2015), whereas GPseq uses this model to consider potential positional bias during DEG analysis and handle position-level counts instead of genelevel counts, which is different from other methods (Srivastava and Chen, 2010). Therefore, these methods derived from different discrete models can potentially perform poorly at fitting underdispersed count data due to the restrictions associated with the inherent properties in the models.

In this study, we described a mixed Poisson model called double Poisson (DP), which offers the advantage of flexibility in fitting a wide range of data exhibiting underdispersion and non-underdispersion using only two parameters (Efron, 1986). Based on this model, we developed a novel differential relative expression-analysis method for RNA-seq data mining (DREAMSeq). Because the results of differential gene-expression analysis are dependent upon the discrete model used to fit the RNA-seq data (Consortium, 2010), we also added NB-model functionality to the DREAMSeq pipeline in order to optimize the performance of our method. Therefore, depending on the model used in the pipeline, our method can be divided into three approaches: DREAMSeq.DP (based on the DP model), DREAMSeq.NB (based on the NB model), and DREAMSeq.Mix (based on the mixture of the DP and NB models, with the lower p-value between two p-values generated based on the DP and NB models chosen as the final p-value) in order to fit variable RNA-seq data. In order to evaluate the performance of DREAMSeq, we generated three simulated datasets using three real RNA-seq datasets. Because the DEGs can only be effectively identified when the sample size is ≥3 (Conesa et al., 2016; Lin et al., 2016), to assess DREAMSeq using the most common RNA-seq scenario, we focused on detecting DEGs under small sample sizes (three replicates per condition) and between two groups. Our results indicated that the performance of DREAMSeq at effectively detecting DEGs was comparable to other popular RNA-seq data-analysis methods, including edgeR, DESeq, DESeq2, NBPSeq, and TSPM, in non-underdispersion situations, but outperformed most of the other methods in underdispersion situations. This conclusion was validated by quantitative real-time polymerase chain reaction (qRT-PCR) using a real Foxtail dataset generated in our laboratory. Our findings demonstrated DREAMSeq as a reliable and robust DEGdetection method that provides an additional option in the RNAseq data-analysis toolbox, especially for underdispersed-data mining.

#### MATERIALS AND METHODS

#### Models and Normalization

In this study, let Y represent the observed count and X the corresponding underlying gene expression (unknown) in an RNA-seq experiment. Let Yijk and Xijk denote the read count and the true gene expression of gene i from sample j in treatment group k, where i = 1, . . . , I (the number of genes), j = 1, . . . , J (the number of replicates; here, J = 3), and k = 1, . . . , K (the number of groups; here, K = 2), respectively.

#### NB Model

We assume that Y follows an NB model with two parameters: the mean, µ, and the dispersion, φ. The probability mass function (PMF) of the NB model is given as:

$$P\left(Y=\boldsymbol{\wp}|\mu,\phi\right) = \frac{\Gamma\left(\boldsymbol{\wp}+\phi^{-1}\right)}{\boldsymbol{\wp}!\Gamma\left(\phi^{-1}\right)} \left(\frac{1}{1+\mu\phi}\right)^{\phi^{-1}} \left(\frac{\mu\phi}{1+\mu\phi}\right)^{\boldsymbol{\wp}}.\tag{1}$$

The expected value is estimated as:

$$E\left(Y\right) = \mu.\tag{2}$$

We parameterize the variance of the NB model according to a previous study (Robinson and Smyth, 2007):

$$\operatorname{Var}\left(Y\right) = \sigma^2 = \mu + \mu^2 \phi,\tag{3}$$

where φ ≥ 0 and determines the extra variability as compared with the Poisson model. When φ > 0, σ <sup>2</sup> > µ; and when φ = 0, σ <sup>2</sup> = µ; the NB model collapses to the Poisson model, which can be viewed as a special NB model with zero dispersion (Robinson and Smyth, 2007). Therefore, the NB model allows for both overdispersion and equidispersion.

#### DP Model

We assume that Y follows a DP model with two parameters: the mean, µ, and the dispersion, θ. The approximate PMF of the DP model is given as:

$$P\left(Y=\boldsymbol{\uprho}|\mu,\boldsymbol{\theta}\right) = f\_{\mu,\boldsymbol{\theta}}\left(\boldsymbol{\uprho}\right) = (\boldsymbol{\theta}^{\frac{1}{2}}e^{-\boldsymbol{\theta}\boldsymbol{\upmu}})(\frac{e^{-\boldsymbol{\uprho}}\boldsymbol{\uprho}^{\boldsymbol{\nu}}}{\boldsymbol{\uprho}!})(\frac{e\boldsymbol{\upmu}}{\boldsymbol{\uprho}!})^{\boldsymbol{\theta}\boldsymbol{\uprho}}.\tag{4}$$

The exact DP density is:

$$P\left(Y=\boldsymbol{\jmath}|\mu,\theta\right)=\tilde{f}\_{\mu,\theta}(\boldsymbol{\jmath})=\boldsymbol{c}(\mu,\theta)f\_{\mu,\theta}\left(\boldsymbol{\jmath}\right),\tag{5}$$

where the factor c(µ,θ) can be calculated as:

$$\frac{1}{c(\mu,\theta)} = \sum\_{\gamma=0}^{\infty} f\_{\mu,\theta} \left( \wp \right) \approx 1 + \frac{1-\theta}{12\mu\theta} (1 + \frac{1}{\mu\theta}) \tag{6}$$

with c(µ, θ) being the normalizing constant nearly equal to 1. The constant c(µ, θ) ensures that the density integrates to unity. The expected value and the variance of the DP model in reference to the exact density ˜ fµ,<sup>θ</sup> (y) are estimated as follows:

$$E(Y) \approx \mu \tag{7}$$

and

$$\operatorname{Var}\left(Y\right) = \sigma^2 = \frac{\mu}{\theta},\tag{8}$$

respectively, where θ > 0 under RNA-seq data circumstances. The Poisson model is nested in the DP model for θ = 1, indicating that the DP model can fit equidispersed read-count data when θ = 1. Additionally, the DP model allows for both overdispersion (0 < θ < 1) and underdispersion (θ > 1) (Efron, 1986).

#### Normalization

Here, we assume that the expectation of Yijk, µijk, is the product of Xijk and sjk:

$$
\mu\_{ijk} = X\_{ijk} s\_{jk\text{\textquotedblleft}} \tag{9}
$$

where sjk is the size factor corresponding to sample j in treatment group k, which can be estimated using various existing normalization methods, such as total counts, upper quartile (Bullard et al., 2010), median (Dillies et al., 2012), quantile (Bolstad et al., 2003; Irizarry et al., 2003), trimmed mean of M-values (TMM) (Robinson and Oshlack, 2010), DESeq normalization (DESeq) (Anders and Huber, 2010), reads per kilobase per million (RPKM) (Mortazavi et al., 2008), to remove unwanted variation (Risso et al., 2014). Normalization is a process that makes unit-less data comparable among measurements by adjusting for sequencing depth and potentially other technical effects of different samples. Dillies et al. (2012) and Lin et al. (2016) found that TMM and DESeq normalization methods performed much better than the other methods described here. Therefore, the most widely used TMM method was chosen as the default data-normalization method in DREAMSeq and similar to previous studies (Robinson et al., 2010; Kadota et al., 2012; Soneson and Delorenzi, 2013; Sun et al., 2013).

#### Dispersion Estimations

Estimating the dispersion parameter is a crucial step in DEG detection. Various dispersion-parameter estimation methods, including pseudo-likelihood (Smyth, 2003), quasilikelihood (Nelder, 2000; Lund et al., 2012), conditional maximum likelihood (CML) (Smyth and Verbyla, 1996), quantile-adjusted CML (Robinson and Smyth, 2008), and shrinkage-estimation methods (Anders and Huber, 2010; Robinson et al., 2010), have been discussed previously. In particular, many Bayesian-based shrinkage-estimation methods, including baySeq, ShrinkSeq, DSS (Wu et al., 2013), and DESeq2, have been developed and are capable of obtaining accurate and robust estimates by sharing information across all genes when the sample size is small (Ji and Liu, 2010). Therefore, we also utilized an empirical Bayesian framework to shrink the dispersion parameter. Our strategy to estimate the dispersion parameter was divided into five steps described as follows.

#### Initial Dispersion Estimators

We first applied the method-of-moments (MoMs) described by Love et al. (2014) to estimate the initial value of dispersion for each gene. According to previous studies (Anders and Huber, 2010; Robinson et al., 2010), we first use the normalized sample mean, Xik, to estimate the expectation for the ith gene in group k:

$$
\mu\_{ik} = \frac{1}{J} \overline{X}\_{ik} \sum\_{j} s\_{jk}. \tag{10}
$$

We assume that the dispersions between two groups are the same under small sample sizes. Therefore, we denote n = KJ and substitute equation (10) with the following equation:

$$
\mu\_i = \frac{1}{n} \overline{X}\_i \sum\_n s\_{jk\text{\*}}\tag{11}
$$

where µ<sup>i</sup> and X<sup>i</sup> are the expectation and sample mean, respectively, of the ith gene. We then estimate the variance of the i th gene, σ 2 i , by pooling count data from different groups using approaches previously described by Anders and Huber (2010) and Wu et al. (2013). For the NB model, the initial dispersion for the ith gene can be estimated by:

$$
\phi\_i^{init} = \frac{\sigma\_i^2 - \mu\_i}{\mu\_i^2}.\tag{12}
$$

Note that φ init i is often artificially assigned with an extremely low positive value (e.g., 1 × 10−<sup>8</sup> in DESeq) when σ 2 <sup>i</sup> < µi , because the NB model cannot fit underdispersed readcount data. A similar conservative strategy was also utilized for underdispersion in a previous study (Schissler et al., 2015). Under this scenario, the initial dispersion can be overestimated, which results in a conservative DEG test (Robinson and Smyth, 2008). By contrast, instead of the NB model, the DP model is capable of handling this kind of data. For the DP model, the initial dispersion for the ith gene can be estimated by:

$$
\theta\_i^{init} = \frac{\mu\_i}{\sigma\_i^2}.\tag{13}
$$

#### Gene-Wise Dispersion Estimators

In RNA-seq experiments, there are typically tens of thousands of genes, but only a few replicates per treatment group, which describes the "large p and small n" phenomenon. It is quite difficult to estimate a reliable gene-specific dispersion with the MoMs described in such a scenario. To address this problem, we used maximum likelihood estimate (MLE) methods based on the initial dispersion estimator, φ init i (or θ init i ), to estimate a gene-wise dispersion, φ genewise i (or θ genewise i ), for gene, i. The MLE of the dispersion parameters in the NB and DP models can be obtained by maximizing the loglikelihood summed over all reads between conditions for the ith gene:

$$\phi\_i^{\text{genewise}} = \operatorname{argmax}\_{\phi} \left( \sum\_{n} \log \left( f\_{\text{NB}}(Y\_{\text{ijk}}, \mu\_{\text{ik}}, \phi) \right) \right) \tag{14}$$

and

$$\theta\_i^{\text{genewwise}} = \operatorname\*{argmax}\_{\theta} \left( \sum\_{n} \log \left( f\_{\text{DP}}(Y\_{ijk}, \mu\_{ik}, \theta) \right) \right), \tag{15}$$

respectively, where φ = φ init i , θ = θ init i , and fNB(·) and fDP(·) are the PMF of the NB and DP models, respectively.

#### Common Dispersion Estimators

It is essential for reliable dispersion estimation that information is shared between genes, especially when few replicates are available (Robinson and Smyth, 2008). The simplest method of sharing information is to assume that the dispersion parameters are common for all genes and then to use the entire dataset to directly calculate a precise common dispersion. However, it is generally not true that each gene has the same dispersion in practice (Robinson and Smyth, 2007). Consequently, we should seek a more general common dispersion-estimation approach that compromises between entirely individual gene-wise dispersions and an entirely shared common dispersion. Here, we assumed that the dispersions are common across all genes having similar expression strengths, suggesting that if the means for some genes are similar, the dispersions (or variances) for these genes are also similar. We adopted a similar locally weighted regression as that for voom (Law et al., 2014) in order to obtain the common dispersion estimators (φ common i for the NB model or θ common i for the DP model) for the ith gene by regressing the gene-wise dispersion estimators, φ genewise i (or θ genewise i ), onto the means, µi , of the normalized read counts. This is similar to the datadriven parameter estimation used by DESeq through the smooth function by modeling the observed mean-variance (or meandispersion) relationship for the genes in the read-count data (Anders and Huber, 2010).

#### Shrinkage-Dispersion Estimators

Shrinkage estimation can effectively improve statistical tests for differential gene expression in the case of a small number of samples (Cui et al., 2005). As mentioned previously, in order to obtain a more accurate and robust estimate, an empirical Bayes (EB) approach has been used to shrink gene-wise dispersions toward common dispersions, which could effectively allow the borrowing of information between genes (Robinson and Smyth, 2007; Robinson et al., 2010). The DSS and DESeq2 methods use an EB approach incorporating shrinkage with an NB model to squeeze the gene-wise dispersion estimates toward an LN prior, where the strength of shrinkage is dependent upon how reliably the individual gene-wise dispersions can be estimated (Wu et al., 2013; Love et al., 2014). Here, we assumed that the gene-wise dispersions, α, followed an LN prior with two parameters: the mean, m0, and the standard deviation (SD), τ . The PMF of the LN model is given as:

$$P\left(\alpha \middle| m\_0, \tau\right) = \frac{1}{\alpha \sqrt{2\pi \tau^2}} e^{-\frac{\left(\log(\alpha) - m\_0\right)^2}{2\tau^2}},\tag{16}$$

where α represents φ genewise i and θ genewise i for the NB and DP models, respectively. The two parameters of the LN model are estimated as follows:

$$m\_0 = median(\log(\beta))\tag{17}$$

and

$$
\pi = \text{mad}(\log(\alpha) - \log(\beta)),
\tag{18}
$$

respectively, where mad represents the median absolute deviation, and β represents φ common i and θ common i for the NB and DP models, respectively.

We adopted the same strategy as the DSS and DESeq2 methods to estimate the shrinkage dispersions for the ith gene in the NB and DP models:

$$\phi\_i^{shrinkage} = \operatorname{argmax}\_{\phi} \left( \sum\_{n} \log \left( f\_{\text{NB}}(Y\_{ijk}, \mu\_{ik}, \phi) \right) + f\_{\text{LN}}(\phi, m\_0, \tau) \right) \tag{19}$$

and

$$\theta\_i^{\text{slrinkage}} = \operatorname\*{argmax}\_{\theta} \left( \sum\_{n} \log \left( f\_{\text{DP}}(Y\_{ijk}, \mu\_{ik}, \theta) \right) + f\_{\text{LN}}(\theta, m\_0, \pi) \right) \tag{20}$$

respectively, where φ = φ genewise i , θ = θ genewise i , and fNB(·), fDP(·), and fLN(·) are the PMF of the NB, DP, and LN models, respectively.

#### Final Dispersion Estimators

Bias in dispersion estimation has serious effects on the expected false-positive rates (FPRs) in small-sample situations (Robinson and Smyth, 2008). To avoid bias, DESeq by default chooses the maximum value from the two dispersion estimators: the individual dispersion and the fitted dispersion as a final dispersion for the gene (Anders and Huber, 2010). However, DESeq is often overly conservative due to overestimation of the dispersion and results in conservation tests (Robinson and Smyth, 2008; Soneson and Delorenzi, 2013). For this reason, we proposed a compromise approach called "window scan" to obtain the final dispersion estimators in five steps: (1) rank the genes from smallest to largest according to the means of samples across all conditions; (2) open a default 1-count window, where the mean is smallest; (3) based on the relationship between the shrinkage-dispersion estimator and the common-dispersion estimator, all genes in this window are divided into I-type genes (its shrinkage-dispersion estimator ≥ its common dispersion estimator) and II-type gene (its shrinkage dispersion estimator < its common dispersion estimator); (4) estimate the final dispersion of each I-type gene (or II-type gene) by choosing the larger value between its shrinkage-dispersion estimator and the median of the shrinkage-dispersion estimators of all I-type genes (or II-type genes) for the NB model (or choosing the smaller value for the DP model); and (5) shift the window to the larger mean and repeat steps (3,4) until all of the genes are scanned.

#### Test Statistic and Method Evaluation Test Statistic

For DEGs detected between two treatment groups, we tested the hypotheses of the form H0: µi,1 = µi,2 for the gene i, where µi,1 and µi,2 are the expectations for the ith gene in groups 1 and 2, respectively. The Wald test has been widely applied in many previous studies because of its simplicity and flexibility (Ng and Tang, 2005; Chen et al., 2011; Yu et al., 2017). Similar to DSS and DESeq2, we constructed the Wald test statistic as:

$$W = \frac{\left| \mu\_{i,1} - \mu\_{i,2} \right|}{\sqrt{\sigma\_{i,1}^2 + \sigma\_{i,2}^2}},\tag{21}$$

where σ 2 <sup>i</sup>,1 and σ 2 <sup>i</sup>,2 are the variances for the ith gene in groups 1 and 2, respectively, and can be estimated using the final dispersion according to equation (3) in the NB model and equation (8) in the DP model.

#### Method Evaluation

All methods analyzed will return nominal p-values. In order to obtain a more reliable list of DEGs, the p-values were adjusted by the Benjamini-Hochberg (BH) procedure (Benjamini and Hochberg, 1995). We evaluated the type I error rates (i.e., FPRs) and statistical powers (i.e., true-positive rates; TPRs) of different methods with a significance level of 0.05. Additionally, we used a receiver operating characteristic (ROC) curve, the area under the ROC curve (AUC), and a precision-recall curve (PRC) to compare the performances of eight methods in the simulated datasets. It is common for biologists to be interested in detecting genes with fold changes (FCs) estimated according to the ratios of the mean normalized counts between two treatment groups. Therefore, some methods use FC as an indicator of DE, such as DEGseq and AMAP.Seq (Si and Liu, 2013). Here, we defined the genes satisfying either FC < 0.67 or FC > 1.5, and an adjusted p < 0.05 as DEGs according to previous studies (Peart et al., 2005; Si and Liu, 2013). This quantitative filter combines the significance level with the FC threshold and might be considered more practical by biologists. Therefore, we also identified DEGs using this filter.

The performances of different methods were further validated by qRT-PCR analysis.

#### Datasets

#### Real Datasets

We chose three real datasets to represent different characteristics of RNA-seq data. The Pickrell dataset and the Hammer dataset were downloaded from the ReCount database (http:// bowtie-bio.sourceforge.net/recount) (Frazee et al., 2011). The Pickrell dataset was obtained from lymphoblastoid cell lines derived from 69 unrelated Nigerian individuals as part of the International HapMap project (Pickrell et al., 2010) and contains 69 biological replicates. The Hammer dataset contains four biological replicates in each of two treatment groups: rat L4 dorsal-root-ganglion-treated groups in the presence or absence of induced chronic neuropathic pain (Hammer et al., 2010). The third real dataset was the Arab dataset provided as "arab" in the NBPSeq R package and that includes three biological replicates, where Arabidopsis leaves were inoculated with either a defenseeliciting 1hrcC mutant of Pseudomonas syringae pv. tomato DC3000 or 10 mM MgCl<sup>2</sup> as a mock-treatment control (Di et al., 2011).

#### Simulated Datasets

Simulation studies represent necessary processes for investigating the properties associated with certain statistical methods, given that the "true" DEGs are known in simulated data. An ideal simulation would generate data with similar characteristics to those produced in real RNA-seq experiments. Therefore, similar to Landau and Liu (2013), we generated three independent simulated datasets using a DP model based on three real datasets, respectively. The simulation processes were repeated 30 times to ensure reasonable precision in parameter estimation. Each simulated dataset contains 10,000 genes, including 2,000 DEGs and 8,000 non-DEGs, two treatment groups, and three replicates for each treatment group.

#### Foxtail Dataset

Foxtail millet (Setaria italica) is an important cereal crop in northern China, and the whole-genome sequence of Foxtail millet (Yugu-1 cultivar) was published in 2012 (Bennetzen et al., 2012; Zhang et al., 2012). In this study, we used a Foxtail RNAseq dataset obtained by our own laboratory to compare the performance of DREAMSeq with other methods. This Foxtail dataset includes three biological replicates, in which roots from 1 week-old Foxtail millet seedlings (Yugu-1 cultivar) were treated with or without 1µM epi-Brassinolide (eBL) for 2 h, followed by total RNA extraction using Trizol reagent (Invitrogen, Carlsbad, CA, Unites States). Extracted total RNA (2 µg per sample) was sequenced on an Illumina HiSeq X-ten platform, and the remaining RNA was used for qRT-PCR validation. The pairedend reads were aligned to the Foxtail millet reference genome (JGIv2.0.34) (Bennetzen et al., 2012; Goodstein et al., 2012) using TopHat (version 2.0.12) (Trapnell et al., 2009; Kim et al., 2013), and gene read counts were obtained using the program htseqcount from the python package HTSeq (version 0.61) (Anders et al., 2015).

### qRT-PCR

First-strand cDNA was synthesized from 1 µg total RNA using Reverse Transcriptase M-MLV (Takara Bio, Otsu, Japan) according to manufacturer instructions. qRT-PCR was performed according to the standard protocol using a Bio-Rad CFX Connect real-time PCR system (Bio-Rad Laboratories, Hercules, CA, Untied States). Primers used are listed in **Table S1**. The expression of target genes was normalized to Foxtail Actin, and the relative expression between treatment and control groups was averaged from three independent experiments, with the p-value calculated using a one-sample t-test. We defined genes satisfying relative expression >1.5 or <0.67 and p < 0.05 as "true" DEGs.

### RESULTS

### The Mean–Variance Relationship in Real Datasets

When analyzing the Hammer, Arab, and Foxtail datasets, we found strong relationships between the variances and the means on the log-log scale for the read counts from different real datasets (**Figure S1**). For convenience of notation and calculation, we used the unit line to represent a Poisson assumption-exhibited equidispersion. The data points on and above that line exhibit non-underdispersion, whereas the data points below that line exhibit underdispersion. **Figure S1** shows that 2,606 of 18,635 genes (14.0%) in the Hammer dataset, 2,015 of 26,222 genes (7.7%) in the Arab dataset, and 4,412 of 35,158 genes (12.5%) in the Foxtail dataset were estimated as underdispersed genes. Therefore, there are a considerable proportion of underdispersed genes in the RNA-seq data. Furthermore, we noted that the underdispersed data points mostly distributed at low read-count regions (**Figure S1**). These results suggested that in addition to nonunderdispersion, underdispersion also exists in RNA-seq data and should be properly handled during the RNA-seq data-mining process.

Most RNA-seq analysis methods were developed based on an NB model, which is able to capture both equidispersed and overdispersed data but not underdispersed data. In comparison, a DP model can capture all RNA-seq data (Efron, 1986). Using real Hammer, Arab, and Foxtail datasets, we found that both DP and NB models were able to fit read-count data very well (**Figure S2**). This suggested that the DP model can be used to mine RNA-seq data.

### Generation of Simulated Datasets

Wu et al. (2013) reported that using real data-driven simulations provided a better estimate for gene-wise dispersions and improved DEG detection, because the true DE status of each gene is known by controlling the settings (Wu et al., 2013). Therefore, we generated three simulated datasets with mean and dispersion parameters estimated from three real datasets based on a commonly used DP model and denoted these as simPickrell, simHammer, and simArab, respectively. The average number of underdispersed genes in simPickrell, simHammer, and simArab was 1299 (13%), 1935 (19%), and 1432 (14%), respectively. As shown in **Figure S3**, all simulated datasets were very similar to the corresponding real datasets in terms of distributions of the means and dispersions and relationships between means and dispersions. This indicated that our simulated data closely mimicked the real data.

## Type I Error Rate

Using the three simulated datasets, we first evaluated the type I error rates (i.e., FPRs) of the three DREAMSeq methods (DREAMSeq.DP, DREAMSeq.NB, and DREAMSeq.Mix) and five other widely used RNA-seq data-analysis methods (edgeR, DESeq, DESeq2, NBPSeq, and TSPM) under the null hypothesis. We found that except for TSPM, all other methods were able to control type I error rates well in both non-underdispersion and underdispersion situations (**Figure 1**). In comparison, DESeq was very conservative in term of type I error rate, whereas the abilities of FPR control by both DREAMSeq.NB and NBPSeq clearly varied between non-underdispersion and underdispersion situations. In contrast, the median FPRs of DREAMSeq.DP, DREAMSeq.Mix, edgeR, and DESeq2 were relatively stable and

consistently lower than or very close to the nominal type I error rate of 0.05 under all situations.

## Statistical Power, ROC, AUC, PRC, and Number of DEGs

We then evaluated the statistical powers (i.e., TPRs) of different methods using the simulated datasets under the alternative hypothesis (**Figure 2**). The results showed that in underdispersion situations, the TPR of DREAMSeq.Mix was slightly higher than that of DREAMSeq.DP, although that of both methods was higher than those of DREAMSeq.NB, edgeR, DESeq, DESeq2, and NBPSeq (**Figure 2**). In nonunderdispersion situations, the TPRs of DREAMSeq.Mix and DREAMSeq.DP were comparable with the other methods. Interestingly, TSPM consistently showed higher TPRs. Given that TSPM also showed higher FPRs in similar situations, it is likely that the TSPM method increased statistical power at the cost of poor FPR control.

The ROC curve was constructed using the TPR to FPR ratio for each method used for DE analysis. Theoretically, the method with the stronger statistical power at identifying DEGs should exhibit a ROC curve with a higher TPR relative to other methods at the same FPR level. **Figure S4** shows that NBPSeq and TSPM had lower TPRs when the FPR threshold was ∼0.05 in each scenario, whereas the ROC curves of the other methods were very similar. Additionally, we found that ROC curves associated with the simHammer dataset were steeper than those for the simPickrell and simArab datasets, suggesting that the performance of DEG identification by different methods was strongly dependent upon innate data characteristics, such as heterogeneity.

AUC is a relative measure of the quality of a DEG test, where a higher AUC indicates relatively better performance. To quantify the performances of different methods in detecting DEGs, AUCs of different methods were calculated. The result showed that the AUCs of DREAMSeq.DP and DREAMSeq.Mix were higher than those of DREAMSeq.NB, edgeR, DESeq, DESeq2, and NBPSeq in most of the situations, except slightly lower than DESeq2 when analyzing simHammer and simArab underdispersed data (**Figure 3**). Together with the above FPR, TPR, and ROC results, these findings clearly demonstrated that both DREAMSeq.DP and DREAMSeq.Mix were able to control type I error rates well while maintaining a relatively higher statistical power in detecting DEGs.

PRC curve shows the precision for corresponding recall (TPR). Similar to the ROC curve, the PRC curve is also an important performance indicator used to evaluate different methods at identifying DEGs. **Figure S5** shows that all methods, except TSPM, had higher precision over the entire range of recall rates, regardless of dataset or dispersion. Additionally, we found that all methods exhibited their best predictive performance using the simHammer dataset, but did not predict very accurately

using the simPickrell dataset in an underdispersion situation, which might also be related to the dataset itself.

We also compared the identified DEG numbers of different methods, with the results showing that both DREAMSeq.DP and DREAMSeq.Mix generally detected a larger number of DEGs (except in the case of simHammer non-underdispersed data) than the other methods (except for TSPM, which displayed poor FDR control) when analyzing non-underdispersed or underdispersed data from three simulated datasets, respectively, (**Figure 4**).

#### Analysis of the Foxtail Dataset

Our comprehensive evaluations showed that edgeR, DESeq, DESeq2, and DREAMSeq.Mix generally performed better as analyzing different simulated RNA-seq datasets; therefore, these methods were chosen to test their abilities to detect DEGs, especially underdispersed DEGs, using a real Foxtail dataset. A total of 128 non-underdispersed and 17 underdispersed DEGs were identified by at least one of the four methods (**Figure 5** and **Tables S2**–**S5**). Overall, the number of DEGs identified by DREAMSeq.Mix was much higher than that by DESeq but lower than that by edgeR and DESeq2 (**Figure 5A**). However, DREAMSeq.Mix identified 15 underdispersed DEGs, whereas edgeR identified 12, and DESeq2 identified 9 underdispersed DEGs. We defined DEGs detected only by one method as unique DEGs. Notably, DREAMSeq.Mix detected the highest number of unique DEGs in underdispersion scenarios, whereas DESeq did not identify any unique DEGs in either non-underdispersion or underdispersion scenarios (**Figures 5B,C**). Consistent with previous reports (Seyednasrollah et al., 2013; Tang et al., 2015), all of the DEGs found by DESeq were also found by edgeR (**Figures 5B,C**), possibly because these two methods use the same statistical model (i.e., the NB model) and hypothesis testing procedure (i.e., the Robinson and Smyth exact test) (Robinson and Smyth, 2008; Anders and Huber, 2010; Robinson et al., 2010). The presence of various unique DEGs also suggested the advantage of using more than one method to analyze the same RNA-seq data in order to allow maximum discovery of DEGs.

We then used qRT-PCR to validate whether the DEGs identified from the Foxtail dataset were "true" DEGs. Because DEGs identified by DESeq were also identified by edgeR, the unique DEGs identified by either edgeR, DESeq2, or DREAMSeq.Mix and the common DEGs identified simultaneously by any two methods were chosen for qRT-PCR analysis (**Figure 6**). The results showed that most of the DEGs chosen for validation exhibited similar upregulation or downregulation patterns as those shown from RNA-seq data analysis. For non-underdispersed DEGs, qRT-PCR results verified that 9 of 19 DEGs (47.4%) identified by DREAMSeq.Mix, 19 of 42 DEGs (45.2%) identified by edgeR, and 23 of 51 DEGs (45.1%) identified by DESeq2 were significantly upregulated or downregulated by eBL treatment by at least 1.5-fold. Notably,

for underdispersed DEGs, 5 of 8 (62.5%) DEGs identified by DREAMSeq.Mix were validated as "true" DEGs. By contrast, only 2 of 5 (40.0%) DEGs identified by edgeR and no DEGs identified by DESeq2 were validated as "true" DEGs. These qRT-PCR results demonstrated that for non-underdispersed data, the number of DEGs identified by DREAMSeq.Mix was lower than those by edgeR and DESeq2, but the accuracy was slightly higher; however, for underdispersed data, DREAMSeq.Mix exhibited both a higher number of identified DEGs and better accuracy than the other two methods, demonstrating DREAMSeq.Mix as a powerful RNA-seq data-analysis method, especially for situations involving underdispersed data.

### DISCUSSION

RNA-seq is an increasingly popular method used to analyze global changes in gene expression during certain biological processes. Identifying DEGs is a key step in mining RNA-seq data and important for downstream biological analyses, such as cluster analysis, PCA analysis, GO analysis, and Kyoto Encyclopedia of Genes and Genomes enrichment analysis. When analyzing RNdA-seq data, most current methods focus on non-underdispersed data, with less attention given to underdispersed data. In this study, we observed that RNA-seq data also includes underdispersion characteristics. Additionally, Low et al. (2017) found that as the RNA-seq coverage increases, underdispersion becomes increasingly obvious. With the development of sequencing technology, the read length and RNA-seq coverage have increased significantly. Therefore, to take full advantage of RNA-seq data, it is important to explore both non-underdispersed and underdispersed data. However, most widely used DE-analysis methods, such as DESeq and edgeR, are based on the NB model. Due to the limitations of this model, underdispersed data are often overestimated, leading to conservative results in the determination of DEGs. In comparison, the DP model is capable of capturing not only non-underdispersion but also underdispersion. Considering the potential advantages of these two models, we developed a novel RNA-seq data-mining method (DREAMSeq.Mix) that combines the DP and NB models.

Using simulated datasets generated from three real RNA-seq experiments, we compared the performance of DREAMSeq.Mix at detecting DEGs with five other commonly used RNA-seq dataanalysis methods. To provide a more comprehensive conclusion, we also added DREAMSeq.DP and DREAMSeq.NB methods, which were developed using only a DP model or an NB model, respectively, into the comparison. We found that DESeq, NBPSeq, and DREAMSeq.NB were often conservative, whereas TSPM, edgeR, and DESeq2 were more liberal in detecting DEGs. The poor performance of TSPM in our study might be due to the limited number of replicates in the RNA-seq datasets used (Auer and Doerge, 2011; Kvam et al., 2012; Soneson and Delorenzi, 2013). In comparison, DREAMSeq.DP and DREAMSeq.Mix underdispersion.

often outperformed the other methods in terms of TPR, AUC, and the number of DEGs detected (**Figures 2**–**4**). The following reasons suggest that DREAMSeq.Mix provided unique and important outcomes more advantageous than current RNA-seq data-mining methods.

First, DREAMSeq incorporates a more flexible DP model to fit highly complex and variable RNA-seq data. The dispersion parameter of the DP model is not subject to the same restrictions as the NB model when it is estimated in underdispersion situations. As a result, logarithmic dispersion estimated using

the DP model (**Figure S3**) showed a better normality than that acquired using the NB model (Figure 1 in Landau and Liu, 2013). This demonstrated that the DP model was able to accurately fit a widely range of read-count data without artificial intervention in RNA-seq data analysis. Therefore, DREAMSeq.DP and DREAMSeq.Mix often outperformed the other methods, especially in underdispersion situations, in simulation studies. Moreover, in terms of identifying the "true" underdispersed DEGs, DREAMSeq.Mix outperformed edgeR, DESeq, and DESeq2 according to qRT-PCR validation.

Second, DREAMSeq incorporates strategies, such as MoMs, MLE, and EB, which are used in the edgeR, DESeq, DSS, and DESeq2 methods, to obtain reliable dispersion estimation. Importantly, to avoid bias, DREAMSeq used a "window scan" approach to estimate dispersion and enhance DREAMSeq's robustness in analyzing a wider range of RNA-seq data. This enabled all DREAMSeq approaches maintain a higher AUC across different simulated datasets in either non-underdispersion or underdispersion scenarios.

Third, in multiple scenarios, DREAMSeq.Mix performed slightly better than DREAMSeq.DP, although the difference was small. This indicated that the efficiency and robustness of DREAMSeq.Mix was improved by taking full potential of the advantages of the DP and NB models to fit RNA-seq data.

Recently, single-cell RNA-seq (scRNA-seq) has rapidly become a powerful tool for analyzing gene-expression heterogeneity at the individual cell level and been widely applied to diverse fields of biological research, including stem cell differentiation, embryogenesis, and whole-tissue analysis (Saliba et al., 2014). However, scRNA-seq data displays typical features of bimodality (the NB model cannot capture bimodality) (Vu et al., 2016), making such data less efficient for mining using common RNA-seq data-analysis methods. Additionally, Choo-Wosoba et al. (2016) reported that genomic next-generation sequencing data also involves underdispersion. The increased accuracy and robustness displayed in finding "true" DEGs with higher confidence and its better performance at exploring underdispersed data make DREAMSeq a potentially valuable tool for mining sequencing data generated from many other high-throughput platforms, such as scRNA-seq and genomic sequencing.

During our analysis, we found that none of the eight tested methods consistently outperformed other methods under all situations, because different methods are capable of identifying specific groups of DEGs. Although some DEGs can be identified by all methods, the existence of unique DEGs suggested that different methods exhibited specific preferences during DEG detection. Additionally, our study showed that the same method sometimes displayed a wide range of performance variability when analyzing different datasets. It is likely that the intrinsic characteristics of the RNA-seq data determine the appropriateness of one method for data analysis over others. Therefore, to ensure maximum coverage of DEG identification, it is advantageous to use more than one method to analyze the same RNA-seq data. Based on our comparison studies, we recommend that using a combination of edgeR, DESeq2, and DREAMSeq.Mix for RNA-seq data analysis to potentially ensure the maximum retrieval of true DEGs in both nonunderdispersion and underdispersion situations.

### CONCLUSIONS

Previous studies reported both equidispersion and overdispersion as important characteristics of RNA-seq data. In this study, we showed that underdispersion also exists in RNA-seq data. The NB model widely used in RNA-seq data-mining methods can only capture non-underdispersion but not underdispersion. Here, we presented a DP model capable of capturing not only non-underdispersion but also underdispersion. Given the potential advantages of the two models, we developed a novel RNA-seq data-mining method (DREAMSeq) that combines both the DP and NB models to ensure its flexibility and robustness for RNA-seq data mining. Additionally, we used a "window scan" approach to estimate dispersion and enhance the reliability of DREAMSeq across a wider range of RNA-seq data. Using simulated datasets generated from three real RNA-seq datasets and an in-housegenerated Foxtail dataset, we demonstrated the ability of DREAMSeq to reach a better balance between conservative and liberal tests as compared with other methods. Our findings demonstrated DREAMSeq as a reliable and robust RNA-seq data-analysis method that provides important improvements in the DE analysis of RNA-seq data, especially in underdispersion situations.

## DATA AVAILABILITY

DREAMSeq R package (version 1.0, Windows binary release) is available publicly (http://tanglab.hebtu.edu.cn/tanglab/Home/ DREAMSeq). This package also contains a real Foxtail dataset obtained by our own laboratory.

## AUTHOR CONTRIBUTIONS

WT and ZG designed the research; ZG wrote the DREAMSeq R package and performed all data analyses; ZZ performed Foxtail RNA-seq and qRT-PCR experiments; and WT and ZG wrote the manuscript.

## FUNDING

This work was supported by grants from the National Natural Science Foundation of China (91417313, 2014CB943404, and 31670265) and the Science Foundation of Hebei University of Economics and Business (2013KYZ05).

## ACKNOWLEDGMENTS

We would like to thank Dr. Hong Zhang (School of Life Sciences, Fudan University) for valuable discussion and suggestion for this work.

## SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene. 2018.00588/full#supplementary-material

## REFERENCES


high density oligonucleotide array probe level data. Biostatistics 4, 249–264. doi: 10.1093/biostatistics/4.2.249


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 Gao, Zhao and Tang. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# CircCode: A Powerful Tool for Identifying circRNA Coding Ability

*Peisen Sun1,2 and Guanglin Li1,2\**

*1 Key Laboratory of Ministry of Education for Medicinal Plant Resource and Natural Pharmaceutical Chemistry, Shaanxi Normal University, Xi'an, China, 2 College of Life Sciences, Shaanxi Normal University, Xi'an, China*

Circular RNAs (circRNAs), which play vital roles in many regulatory pathways, are widespread in many species. Although many circRNAs have been discovered in plants and animals, the functions of these RNAs have not been fully investigated. In addition to the function of circRNAs as microRNA (miRNA) decoys, the translation potential of circRNAs is important for the study of their functions; yet, few tools are available to identify their translation potential. With the development of high-throughput sequencing technology and the emergence of ribosome profiling technology, it is possible to identify the coding ability of circRNAs with high sensitivity. To evaluate the coding ability of circRNAs, we first developed the CircCode tool and then used CircCode to investigate the translation potential of circRNAs from humans and *Arabidopsis thaliana*. Based on the ribosome profile databases downloaded from NCBI, we found 3,610 and 1,569 translated circRNAs in humans and *A. thaliana*, respectively. Finally, we tested the performance of CircCode and found a low false discovery rate and high sensitivity for identifying circRNA coding ability. CircCode, a Python 3–based framework for identifying the coding ability of circRNAs, is also a simple and powerful command line-based tool. To investigate the translation potential of circRNAs, the user can simply fill in the given configuration file and run the Python 3 scripts. The tool is freely available at https:// github.com/PSSUN/CircCode.

#### *Edited by:*

*Filippo Geraci, Italian National Research Council, (CNR) Italy*

#### *Reviewed by:*

*Wojciech M. Karlowski, Adam Mickiewicz University in Poznan´, Poland Cuncong Zhong, University of Kansas, United States*

> *\*Correspondence: Guanglin Li glli@snnu.edu.cn*

#### *Specialty section:*

*This article was submitted to Bioinformatics and Computational Biology, a section of the journal Frontiers in Genetics*

*Received: 21 June 2019 Accepted: 13 September 2019 Published: 10 October 2019*

#### *Citation:*

*Sun P and Li G (2019) CircCode: A Powerful Tool for Identifying circRNA Coding Ability. Front. Genet. 10:981. doi: 10.3389/fgene.2019.00981*

Keywords: bioinformatics, circular RNAs, ribosome profiling data, translation, coding potential, classification

## INTRODUCTION

Circular RNAs (circRNAs) are a special type of noncoding RNA molecule that has become a hot research topic in the field of RNA and is receiving a great deal of attention (Chen and Yang, 2015). Compared with traditional linear RNAs (containing 5′ and 3′ ends), circRNA molecules usually have a closed circular structure; rendering them more stable and less prone to degradation (Vicens and Westhof, 2014). Although the existence of circRNAs has been known for some time, these molecules were considered to be a by-product of RNA splicing. However, with the development of high-throughput sequencing and bioinformatics technologies, circRNAs have become widely recognized in animals and plants (Chen and Yang, 2015). Recent studies have also shown that a large number of circRNAs can be translated into small peptides in cells (Pamudurti et al., 2017) and have key roles despite their sometimes low level of expression (Hsu and Benfey, 2018; Yang et al., 2018). Although an increasing number of circRNAs are being identified, their functions in plants and animals generally remain to be studied. In addition to their functions as miRNA decoys, circRNAs have important translational potential, but no tools are available for specifically predicting the translational capabilities of these molecules (Jakobi and Dieterich, 2019).

**61**

Several tools do exist for the prediction and identification of circRNAs, such as CIRI (Gao et al., 2015), CIRCexplorer (Dong et al., 2019), CircPro (Meng et al., 2017), and circtools (Jakobi et al., 2018). Among them, CircPro can reveal translated circRNAs by calculating a translation potential score for circRNAs based on CPC (Kong et al., 2007), which is a tool for identifying the open reading frame (ORF) in a given sequence. However, because some circRNAs do not use the start codon during translation (Ingolia et al., 2011; Slavoff et al., 2013; Kearse and Wilusz, 2017; Spealman et al., 2018), employing CPC may filter out some truly translated circRNAs. In this study, we used BASiNET (Ito et al., 2018), which is an RNA classifier based on the machine learning methods (random forest and J48 model). It initially transforms the given coding RNAs (positive data) and noncoding RNAs (negative data) and represents them as complex networks; it then extracts the topological measures of these networks and constructs a feature vector to train the model that is used to classify the coding capacity of circRNAs. With this method, erroneous filtering of translated circRNAs that are not initiated by AUG is avoided. Additionally, Ribo-seq technology, which is based on high-throughput sequencing to monitor RPFs (ribosomal protected fragments) of transcripts (Guttman et al., 2013; Brar and Weissman, 2015), can be utilized to determine the locations of circRNAs that are being translated (Michel and Baranov, 2013). To identify the coding ability of circRNAs, we developed the tool CircCode, which involves a Python 3–based framework, and applied CircCode to investigate the translation potential of circRNAs from humans and *Arabidopsis thaliana*. Our work provides a rich resource for further study of the functions of circRNAs with coding capacity.

### METHODS

CircCode was written in the Python 3 programming language; it uses Trimmomatic (Bolger et al., 2014), bowtie (Langmead and Salzberg, 2012), and STAR (Dobin et al., 2013) to filter raw Riboseq reads and map these filtered reads to the genome. CircCode then identifies Ribo-seq read-mapped regions in circRNAs that contain junctions. After that, the candidate mapped sequences in the circRNAs are sorted based on classifiers (J48 model) into coding RNAs and noncoding RNAs by BASiNET. Finally, short peptides produced by translation are identified as potential coding regions of circRNAs. The entire process of CircCode consists of five steps (**Figure 1**).

#### Filtering of Ribosomal Profiling Data

First, low-quality fragments and adapters in the Ribo-Seq reads are removed by Trimmomatic with the default parameters to obtain clean Ribo-seq reads. Second, these clean Ribo-seq reads are mapped to an rRNA library to remove reads derived from rRNA using bowtie. Because the read lengths of Ribo-seq are relatively short (generally less than 50 bp), it is possible for one read to match multiple regions. In this case, it is difficult to determine which region a particular read corresponds to. To avoid this, the clean Ribo-seq reads are mapped to the genome of a species of interest, and the reads that are not perfectly aligned to the genome are regarded as the final unique Ribo-seq reads.

### Assembling Virtual Genomes

CircRNAs usually appear as ring-shaped molecules in eukaryotes, and they can be identified based on their backsplicing junctions. However, the sequences of circRNAs in the fasta file are often in linear form. In theory, the result indicates that the junction is between the 5′ terminal nucleotide and the 3′ terminal nucleotide, although the junction and the sequence near the junction cannot be viewed directly, thus aligning Ribo-seq reads to circRNA sequences, including junctions, in a straightforward manner.

CircCode connects the sequence of each circRNA in tandem such that the junction for each is in the middle of the newly constructed sequence. We also separated each series unit by 100N nucleotides to avoid confusion at the sequence alignment step (the length of each RPF is less than 50 bp). Finally, we obtained a virtual genome consisting only of candidate circRNAs in tandem separated by 100Ns. Because CircCode focuses only on alignment between Ribo-seq reads and circRNA sequences, we can investigate the coding potential of circRNAs by mapping the Ribo-seq reads to this virtual genome, which can save a large amount of computational time (the virtual genome is much smaller than the whole genome) and increase the accuracy (by avoiding interference between upstream and downstream sequence comparisons of the circRNAs).

#### Determination of the Ribo-seq Read-Mapped Region on a Junction (RMRJ) of circRNAs

The final unique Ribo-seq reads are mapped to a previously created virtual genome using STAR. Because each tandem circRNA unit was separated by 100N bases before producing the virtual genome, the largest intron length was set to not exceed 10 bases with the parameter "–alignIntronMax 10." This parameter eliminates any interaction between different circRNAs in the sequence alignment. In the second step of virtual genome production, CircCode stores positional junction information for each circRNA in the virtual genome. If the Ribo-seq readmapped region in the virtual genome includes the junction of the circRNA, and the number of mapped Ribo-seq reads on junction (NMJ) is greater than 3, the Ribo-seq reads-mapped region on junction of the circRNAs can be regarded as an RMRJ, which reveals a roughly translated segment of circRNAs near the junction site.

#### Training of the Model and Classification of RMRJs

Although RMRJs can constitute powerful proof of translation, there are still some shortcomings in this method. Because the length of the reads of the ribosomal map is short, a read may be compared to the wrong position. Therefore, it is not convincing to simply consider the region covered by the Riboseq reads as the translated region. To this end, the machine learning method is used to identify the coding ability of the RMRJ. First, CircCode extracts coding RNAs (positive data) and noncoding RNAs (negative data) from a species of interest and uses them for model training by means of the difference in feature vectors between coding and noncoding RNAs.

each part represents a different stage of operation. From left to right, the first part represents the filtering of the Ribo-seq data; the quality control is executed by Trimmomatic, and the rRNA reads are removed by bowtie. The second part represents the steps used to produce the virtual genome and align the filtered reads to the virtual genome with STAR. The last part represents the identification of translated circRNAs by machine learning. The bottom layer represents the last step used to predict the peptides translated from the circRNAs and the final output results, including information on translated circRNAs and their translation products.

CircCode then uses the trained model to classify the RMRJs obtained in the previous step by BASiNET. If the RMRJ of a circRNA is recognized as coding RNA, then this circRNA can be identified as a translated circRNA.

## Prediction of Translated Peptides by RMRJs

As expression of circRNAs in organisms is low, Ribo-seq data do not show the exact 3-nt periodicity clearly in the case of fewer RPFs. Therefore, it is difficult to determine the exact translation start site of a translated circRNA. Due to the presence of a stop codon in some RMRJs and because the start codon is difficult to determine, the method of finding an ORF based on a start codon and a stop codon is not feasible.

To determine the true translation regions of these circRNAs and generate the final translation product, FragGeneScan (Rho et al., 2010), which can predict protein-coding regions in fragmented genes and genes with frameshifts, is used to determine the translated peptides produced by circRNAs.

To avoid the cumbersome running process, all the models can be called by a shell script; the user can simply fill in the given configuration file and input it into script, and the entire process for predicting the translated circRNAs will then be run. In addition, CircCode can be run separately, step by step, such that the user can adjust the parameters in the middle of the procedure and view the results of each step as desired.

## RESULTS AND DISCUSSION

After testing on multiple computers, CircCode was found to run successfully with the required dependencies installed. To test the performance of CircCode, we used data for humans and *A. thaliana* to predict circRNAs with translation potential. The results were compared with circRNAs that have been verified experimentally as confirmation. Thereafter, we tested the false discovery rate (FDR) value of CircCode further. We used GenRGenS (Ponty et al., 2006) to generate a data set for testing based on known translated circRNAs and confirmed that the FDR value was within an acceptable range and at a low level. Finally, we evaluated the effect of different sequencing depths of Ribo-seq data on CircCode predictions and compared CircCode with other software.

#### Translated circRNAs in Humans and *A. thaliana*

To apply the CircCode tool to real data, we first downloaded the files including the human reference genome GRCh38, genome annotation, and human rRNA, from Ensembl. For *A. thaliana*, the reference genomes (TAIR10), genome annotation files, and corresponding rRNA sequences were all downloaded from Ensembl Plants. The Ribo-seq data for humans and *A. thaliana* were downloaded from RPFdb (accession numbers: GSE96643, GSE81295, GSE88794) (Hsu et al., 2016; Willems et al., 2017), and all the candidate circRNAs from human and *A. thaliana* were downloaded from CIRCPedia v2 (Dong et al., 2018) and PlantcircBase, respectively (Chu et al., 2017). Ultimately, we identified 3,610 translated circRNAs from human and 1,569 translated circRNAs from *A. thaliana* using CircCode (**Supplementary Data 1**).

#### Functional Enrichment of Human and *A. thaliana* circRNAs With Coding Potential

Using the CircCode results for human and *A. thaliana*, the online tool KOBAS 3.0 (Wu et al., 2006) was employed to annotate these translated circRNAs based on their parent genes. Furthermore, we performed GO (Gene Ontology) functional analysis and KEGG (Kyoto Encyclopedia of Genes and Genomes) enrichment analysis for these translated circRNAs using the R package clusterProfiler (Yu et al., 2012).

The KEGG results showed that the human circRNAs were enriched in protein processing in the endoplasmic reticulum pathway, carbon metabolism pathway, and RNA transport pathway. GO analysis indicated the participation of human translated circRNAs in the regulation of molecule binding, ATPase activity, and other RNA splicing-related biological processes. In addition, the translated circRNAs of *A. thaliana* are enriched in pathways related to stress resistance, suggesting that they play vital roles in this process (**Supplementary Data 2**).

#### Accuracy Test for CircCode

To investigate the accuracy of CircCode, test sequences generated by GenRGenS, which uses the hidden Markov model to produce sequences that have the same sequence characteristics (such as the frequencies of different nucleotides, different codons and different nucleotides at the start of the sequence), were used.

For this study, we used previously published human translated circRNAs (Yang et al., 2017) as the input for GenRGenS and generated 10,000 sequences to test CircCode. We repeated the test 10 times, and on average, 27 translated circRNAs were predicted each time. The FDR value was calculated to be 0.0027, which is much less than 0.05, indicating that the predicted results are credible.

In addition, we compared the translated circRNAs from humans as identified by CircCode with verified polysomeassociated circRNA data (Yang et al., 2017). Among them, 60% of the circRNAs were identified by CircCode (**Supplementary Data 3**).

#### Influence of the Ribo-seq Data Sequencing Depth

To investigate the impact of the sequencing depth of Ribo-seq data on the CircCode identification results, we first tested the effect of

sequencing depth on the number of translated circRNAs (**Figure 2A**). When the sequencing depth was low, the predicted number of translated circRNAs was low, and the number of translated circRNAs increased with increasing sequencing depth. The number of translated circRNAs became stable when the sequencing depth reached no less than 10× linear transcript coverage.

Second, the influence of NMJ on sensitivity at different sequencing depths was also assessed (**Figure 2B**). The results showed that NMJ had less impact on sensitivity as the sequencing depth increased. CircCode also had higher sensitivity when using Ribo-seq data with higher sequencing depth.

#### Comparison of CircCode With Other Tools

To compare CircCode with other tools, such as CircPro, the same set of Ribo-seq data (SRR3495999) from *A. thaliana* was used to identify translated circRNAs using six processors, with 16 gigabytes of RAM. CircPro identified 44 translated circRNAs in 13 min, whereas CircCode identified 76 translated circRNAs in 20 min. Thus, CircCode is more sensitive than CircPro at the same computer hardware level, but it takes more time. CircPro is concise and less time consuming than CircCode, but CircCode can identify more circRNAs with coding ability than CircPro.

### CONCLUSIONS

CircRNAs play an important role in biology, and it is crucial to accurately identify circRNAs with coding ability for subsequent research. Based on Python 3, we developed CircCode, an easyto-use command line tool that has high sensitivity for identifying translated circRNAs from Ribo-Seq reads with high accuracy. CircCode exhibits good performance in both plants and animals. Future work will add the downstream character analysis to CircCode by visualizing each step in the process and optimizing the accuracy of the prediction.

### AVAILABILITY AND REQUIREMENTS

CircCode is available at https://github.com/PSSUN/CircCode; operating system(s): Linux, programming languages: Python 3 and R; other requirements: bedtools (version 2.20.0 or later), bowtie, STAR, Python 3 packages (Biopython, Pandas, rpy2), R-packages (BASiNET, Biostrings). The installation packages for all of the

## REFERENCES


required software are available on the CircCode homepage. Users do not need to download them individually. The CircCode home page also provides detailed user manuals for reference. The tool is freely available. There are no restrictions on use by nonacademics.

## DATA AVAILABILITY STATEMENT

All relevant data are within the manuscript and its Supporting Information files.

## AUTHOR CONTRIBUTIONS

Conceptualization: PS, GL. Data Curation: PS, GL. Formal Analysis: PS, GL. Writing – Original Draft: PS, GL. Writing – Review and Editing: PS, GL.

## FUNDING

This work was supported by grants from the National Natural Science Foundation of China (grant nos. 31770333, 31370329, and 11631012), the Program for New Century Excellent Talents in University (NCET-12-0896), and the Fundamental Research Funds for the Central Universities (no. GK201403004). The funding agencies had no role in the study, its design, the data collection and analysis, the decision to publish, or the preparation of the manuscript. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene.2019.00981/ full#supplementary-material

SUPPLEMENTARY DATA 1 | The sequence of the predicted translated circRNA and short peptide.

SUPPLEMENTARY DATA 2 | GO enrichment and KEGG enrichment results for humans and *Arabidopsis thaliana*.

SUPPLEMENTARY DATA 3 | Comparison of predicted translated circRNAs with validated translated circRNAs.


**Conflict of Interest:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Copyright © 2019 Sun and Li. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

# Single-Cell RNA-Seq Technologies and Related Computational Data Analysis

Geng Chen<sup>1</sup> \*, Baitang Ning<sup>2</sup> and Tieliu Shi<sup>1</sup> \*

<sup>1</sup> Center for Bioinformatics and Computational Biology, and Shanghai Key Laboratory of Regulatory Biology, Institute of Biomedical Sciences, School of Life Sciences, East China Normal University, Shanghai, China, <sup>2</sup> National Center for Toxicological Research, United States Food and Drug Administration, Jefferson, AR, United States

#### Edited by:

Filippo Geraci, Italian National Research Council (CNR), Italy

#### Reviewed by:

Vsevolod Jurievich Makeev, Vavilov Institute of General Genetics (RAS), Russia Iros Barozzi, Imperial College London, United Kingdom

#### \*Correspondence:

Geng Chen gchen@bio.ecnu.edu.cn Tieliu Shi tieliushi@yahoo.com

#### Specialty section:

This article was submitted to Bioinformatics and Computational Biology, a section of the journal Frontiers in Genetics

Received: 05 December 2018 Accepted: 21 March 2019 Published: 05 April 2019

#### Citation:

Chen G, Ning B and Shi T (2019) Single-Cell RNA-Seq Technologies and Related Computational Data Analysis. Front. Genet. 10:317. doi: 10.3389/fgene.2019.00317 Single-cell RNA sequencing (scRNA-seq) technologies allow the dissection of gene expression at single-cell resolution, which greatly revolutionizes transcriptomic studies. A number of scRNA-seq protocols have been developed, and these methods possess their unique features with distinct advantages and disadvantages. Due to technical limitations and biological factors, scRNA-seq data are noisier and more complex than bulk RNA-seq data. The high variability of scRNA-seq data raises computational challenges in data analysis. Although an increasing number of bioinformatics methods are proposed for analyzing and interpreting scRNA-seq data, novel algorithms are required to ensure the accuracy and reproducibility of results. In this review, we provide an overview of currently available single-cell isolation protocols and scRNA-seq technologies, and discuss the methods for diverse scRNA-seq data analyses including quality control, read mapping, gene expression quantification, batch effect correction, normalization, imputation, dimensionality reduction, feature selection, cell clustering, trajectory inference, differential expression calling, alternative splicing, allelic expression, and gene regulatory network reconstruction. Further, we outline the prospective development and applications of scRNA-seq technologies.

Keywords: single-cell RNA-seq, cell clustering, cell trajectory, alternative splicing, allelic expression

## INTRODUCTION

Bulk RNA-seq technologies have been widely used to study gene expression patterns at population level in the past decade. The advent of single-cell RNA sequencing (scRNA-seq) provides unprecedented opportunities for exploring gene expression profile at the single-cell level. Currently, scRNA-seq has become a favorable choice for studying the key biological questions of cell heterogeneity and the development of early embryos (only include a few number of cells), since bulk RNA-seq mainly reflects the averaged gene expression across thousands of cells. In recent years, scRNA-seq has been applied to various species, especially to diverse human tissues (including normal and cancer), and these studies revealed meaningful cell-to-cell gene expression variability (Jaitin et al., 2014; Grun et al., 2015; Chen et al., 2016a; Cao et al., 2017; Rosenberg et al., 2018). With the innovation of sequencing technologies, some different scRNA-seq protocols have been proposed in the past few years, which largely facilitated the understanding of dynamic gene

expression at single-cell resolution (Kolodziejczyk et al., 2015; Haque et al., 2017; Picelli, 2017; Chen et al., 2018). One of them is the highly efficient strategy LCM-seq (Nichterwitz et al., 2016) which combines laser capture microscopy (LCM) and Smart-seq2 (Picelli et al., 2013) for single-cell transcriptomics without tissue dissociation. Currently available scRNA-seq protocols can be mainly split into two categories based on the captured transcript coverage: (i) full-length transcript sequencing approaches [such as Smart-seq2 (Picelli et al., 2013), MATQ-seq (Sheng et al., 2017) and SUPeR-seq (Fan X. et al., 2015)]; and (ii) 3<sup>0</sup> -end [e.g., Drop-seq (Macosko et al., 2015), Seq-Well (Gierahn et al., 2017), Chromium (Zheng et al., 2017), and DroNC-seq (Habib et al., 2017)] or 5<sup>0</sup> -end [such as STRT-seq (Islam et al., 2011, 2012)] transcript sequencing technologies. Each scRNA-seq protocol has its own benefits and drawbacks, resulting in that different scRNAseq approaches have distinct features and disparate performances (Ziegenhain et al., 2017). In conducting single-cell transcriptomic study, specific scRNA-seq technology may need to be employed in consideration of the balance between research goal and sequencing cost.

Owing to the low amount of starting material, scRNA-seq has limitations of low capture efficiency and high dropouts (Haque et al., 2017). Compared to bulk RNA-seq, scRNAseq produces nosier and more variable data. The technical noise and biological variation (e.g., stochastic transcription) raise substantial challenges for computational analysis of scRNA-seq data. A variety of tools have been designed to conducting diverse bulk RNA-seq data analyses, but many of those methods cannot be directly applied to scRNAseq data (Stegle et al., 2015). Except short-read mapping, almost all data analyses (such as differential expression, cell clustering, and gene regulatory network inference) have certain disparities in methods between scRNA-seq and bulk RNAseq. Due to the high technical noise, quality control (QC) is crucial for identifying and removing the low-quality scRNAseq data to get robust and reproducible results. Furthermore, some analyses including alternative splicing (AS) detection, allelic expression exploration and RNA-editing identification are not suitable for the 3<sup>0</sup> or 5<sup>0</sup> -tag sequencing protocols of scRNA-seq, but these analyses could be applicable to the data generated by whole-transcript scRNA-seq. On the other hand, an increasing number of tools are specially proposed for analyzing scRNA-seq data, and each method has its own pros and cons (Stegle et al., 2015; Bacher and Kendziorski, 2016). Therefore, to effectively handle the high variability of scRNAseq data, attention should be paid to choosing appropriately analytical approaches.

This Review aims to summarize and discuss currently available scRNA-seq technologies and various data analysis methods. We first introduce distinct single-cell isolation protocols and various scRNA-seq technologies developed in recent years. Then we focus on the analyses of scRNA-seq data and highlight the analytical differences between bulk RNA-seq and scRNAseq data. Considering the high technical noise and complexity of scRNA-seq data, we also provide recommendations on the selection of suitable tools to analyze scRNA-seq data and ensure the reproducibility of results.

## ISOLATION OF SINGLE CELLS

The first step of scRNA-seq is isolation of individual cells (**Figure 1**), although the capture efficiency is a big challenge for scRNA-seq. Currently, several different approaches are available for isolating single cells, including limiting dilution, micromanipulation, flow-activated cell sorting (FACS), laser capture microdissection (LCM), and microfluidics (Gross et al., 2015; Kolodziejczyk et al., 2015; Hwang et al., 2018). Limiting dilution technique uses pipettes to isolate cells by dilution, the main limitation of this method is inefficient. Micromanipulation is a classical approach used to retrieve cells from samples with a small number of cells, such as early embryos or uncultivated microorganisms, while this technique is time-consuming and low throughput. FACS has been widely used for isolating single cells, which requires large starting volumes (>10,000 cells) in suspension. LCM is an advanced strategy used for isolating individual cells from solid tissues by using a laser system aided by computer. Microfluidics is increasingly popular due to its property of low sample consumption, precise fluid control and low analysis cost. These single-cell isolation protocols have their own advantages and show distinct performances in terms of capture efficiency and purity of the target cells (Gross et al., 2015; Hu et al., 2016).

## CURRENTLY AVAILABLE SCRNA-SEQ TECHNOLOGIES

To date, a number of scRNA-seq technologies have been proposed for single-cell transcriptomic studies (**Table 1**). The first scRNA-seq method was published by Tang et al. (2009), and then many other scRNA-seq approaches were subsequently developed. Those scRNA-seq technologies differ in at least one of the following aspects: (i) cell isolation; (ii) cell lysis; (iii) reverse transcription; (iv) amplification; (v) transcript coverage; (vi) strand specificity; and (vii) UMI (unique molecular identifiers, molecular tags that can be applied to detect and quantify the unique transcripts) availability. One conspicuous difference among these scRNA-seq methods is that some of them can produce full-length (or nearly full-length) transcript sequencing data (e.g., Smart-seq2, SUPeR-seq, and MATQ-seq), whereas others only capture and sequence the 3<sup>0</sup> -end [such as Drop-seq, Seq-Well and DroNC-seq, SPLiT-seq (Rosenberg et al., 2018)] or 5<sup>0</sup> -end (e.g., STRT-seq) of the transcripts (**Table 1**). Distinct scRNA-seq protocols may possess disparate strengths and weaknesses, and several published reviews have compared a portion of them in detail (Kolodziejczyk et al., 2015; Haque et al., 2017; Picelli, 2017; Ziegenhain et al., 2017). A previous study demonstrated that Smart-seq2 can detect a bigger number of expressed genes than other scRNA-seq technologies including CEL-seq2 (Hashimshony et al., 2016), MARS-seq (Jaitin et al., 2014), Smart-seq (Ramskold et al., 2012), and Drop-seq protocols (Ziegenhain et al., 2017). Recently, Sheng et al. (2017) showed that another full-length transcript sequencing approach MATQ-seq could outperform Smart-seq2 in detecting low-abundance genes.

Compared to 3<sup>0</sup> -end or 5<sup>0</sup> -end counting protocols, full-length scRNA-seq methods have incomparable advantages in isoform usage analysis, allelic expression detection, and RNA editing identification owing to their superiority of transcript coverage. Moreover, for detecting certain lowly expressed genes/transcripts, full-length scRNA-seq approaches could be better than 3<sup>0</sup> sequencing methods (Ziegenhain et al., 2017). Notably, dropletbased technologies [e.g., Drop-seq (Macosko et al., 2015), InDrop (Klein et al., 2015), and Chromium (Zheng et al., 2017)] can generally provide a lager throughput of cells and a lower sequencing cost per cell compared to whole-transcript scRNAseq. Thus, droplet-based protocols are suitable for generating huge amounts of cells to identify the cell subpopulations of complex tissues or tumor samples.

Strikingly, several scRNA-seq technologies can capture both polyA+ and polyA− RNAs, such as SUPeR-seq (Fan X. et al., 2015) and MATQ-seq (Sheng et al., 2017). These protocols are extremely useful for sequencing long noncoding RNAs (lncRNAs) and circular RNAs (circRNAs). Lots of studies have demonstrated that lncRNAs and circRNAs play important roles in diverse biological processes of cells and may serve as crucial biomarkers for cancers (Barrett and Salzman, 2016; Chen et al., 2016b; Quinn and Chang, 2016; Kristensen et al., 2018); therefore, such scRNA-seq methods can provide unprecedented opportunities to comprehensively explore the expression dynamics of both protein-coding and noncoding RNAs at the single-cell level.

Compared to traditional bulk RNA-seq technologies, scRNAseq protocols suffer higher technical variations. In order to estimate the technical variances among different cells, spikeins [such as External RNA Control Consortium (ERCC) controls (External, 2005)] and UMIs have been widely used in corresponding scRNA-seq methods. The RNA spike-ins are RNA transcripts (with known sequences and quantity) that are applied to calibrate the measurements of RNA hybridization assays, such as RNA-Seq, and UMIs can theoretically enable the estimation of absolute molecular counts. It is worth noting that ERCC and UMIs are not applicable to all scRNA-seq technologies due to the inherent protocol differences. Spikeins are used in approaches like Smart-seq2 and SUPeR-seq but are not compatible with droplet-based methods, whereas UMIs are typically applied to 3<sup>0</sup> -end sequencing technologies [such as Drop-seq (Macosko et al., 2015), InDrop (Klein et al., 2015), and MARS-seq (Jaitin et al., 2014)]. Consequently, users can select the suitable scRNA-seq method according to the technical properties and advantages, number of cells to be sequenced and cost considerations.

#### READ ALIGNMENT AND EXPRESSION QUANTIFICATION OF SCRNA-SEQ DATA

The mapping ratio of reads is an important indicator of the overall quality of scRNA-seq data. Since both scRNA-seq and


bulk RNA-seq technologies generally sequence transcripts into reads to generate the raw data in fastq format, no differences exist between these two types of RNA-seq data in read alignment. The mapping tools originally developed for bulk RNA-seq are also applicable to scRNA-seq data. Numerous spliced alignment programs have been designed for mapping RNA-seq data, which was extensively discussed previously (Li and Homer, 2010; Chen et al., 2011). Generally, the read mapping algorithms mainly fall into two categories: spaced-seed indexing based and Burrows-Wheeler transform (BWT) based (Li and Homer, 2010). Currently popular aligners like TopHat2 (Kim et al., 2013), STAR (Dobin and Gingeras, 2015), and HISAT (Kim et al., 2015) perform well in mapping speed and accuracy, and they can efficiently map billions of reads to the reference genome or transcriptome (**Table 2**). STAR is a suffix-array based method and is faster than TopHat2, but it requires a huge memory size (28 gigabytes for human genome) for read mapping (Dobin and Gingeras, 2015). Engstrom et al. systematically evaluated 26 read alignment protocols (did not include HISAT) and found that different mapping tools exhibit distinct strengths and weakness, where some programs are with a faster mapping speed but a lower accuracy in splice junction detection (Engstrom et al., 2013). HISAT is developed based on BWT and Ferragina-Manzini (FM) methods. Kim et al. (2015) showed that HISAT is currently the fastest tool that can achieve equal or better accuracy than other available aligners.

For gene/transcript expression quantification, distinct approaches are needed, based on the range of transcript sequence captured by scRNA-seq. The data generated by whole-transcript scRNA-seq methods (such as Smart-seq2 and MATQ-seq) can

TABLE 2 | Tools for read mapping and expression quantification of scRNA-seq data.


be analyzed with the software developed for bulk RNA-seq to quantify gene/transcript expression. Two main approaches are available for transcriptome reconstruction: de novo assembly (does not need a reference genome) and reference-based or genome-guided assembly (Chen et al., 2017b). De novo transcriptome assembly methods are primarily applied to the organisms that lack a reference genome, and are generally with a lower accuracy than that of genome-guided assembly (Garber et al., 2011). The popular genome-guided assembly tools including Cufflinks (Trapnell et al., 2010), RSEM (Li and Dewey, 2011), and Stringtie (Pertea et al., 2015) have been broadly used in many scRNA-seq studies to get relative gene/transcript expression estimation in reads or fragments per kilobase per million mapped reads (RPKM or FPKM) or transcripts per million mapped reads (TPM) (**Table 2**). Pertea et al. (2015) stated that StringTie outperforms other genome-guided approaches in gene/transcript reconstruction and expression quantification. On the other hand, for the 3<sup>0</sup> -end scRNA-seq protocols (e.g., CELseq2, MARS-seq, Drop-seq, and InDrop), specific algorithms are required to calculate gene/transcript expression based on UMIs. SAVER (single-cell analysis via expression recovery) is an efficient UMI-based tool recently proposed for accurately estimating gene expression of single cells (Huang et al., 2018). In theory, UMI-based scRNA-seq can largely reduce the technical noise, which remarkably benefits the estimation of absolute transcript counts (Islam et al., 2014).

## QUALITY CONTROL OF SCRNA-SEQ DATA

The limitations in scRNA-seq including bias of transcript coverage, low capture efficiency, and sequencing coverage result in that scRNA-seq data are with a higher level of technical noise than bulk RNA-seq data (Kolodziejczyk et al., 2015). Even for the most sensitive scRNA-seq protocol, it is a frequent phenomenon that some specific transcripts cannot be detected (termed dropout events) (Haque et al., 2017). Generally, scRNA-seq experiments

can generate a portion of low-quality data from the cells that are broken or dead or mixed with multiple cells (Ilicic et al., 2016). These low-quality cells will hinder the downstream analysis and may lead to misinterpretation of the data. Accordingly, QC of scRNA-seq data is crucial to identify and remove the lowquality cells.

To exclude the low-quality cells from scRNA-seq, close attention should be paid to avoid multi-cells or dead cells in the cell capture step. After sequencing, a series of QC analyses are required to eliminate the data from low-quality cells. Those samples contain only a few number of reads should be discarded first since insufficient sequencing depth may lead to the loss of a large portion of lowly and moderately expressed genes. Then tools initially developed for QC of bulk RNA-seq data, such as FastQC<sup>1</sup> , can be employed to check the sequencing quality of scRNA-seq data. Moreover, after read alignment, samples with very low mapping ratio should be eliminated because they contain massively unmappable reads that might be resulted from RNA degradation. If extrinsic spike-ins (such ERCC) were used in scRNA-seq, technical noise could be estimated. The cells with an extremely high portion of reads mapped to the spike-ins indicate that they were probably broken during cell capture process and should be removed (Ilicic et al., 2016). Cytoplasmic RNAs are usually lost but mitochondrial RNAs are retained for broken cells, thus the ratio of reads mapped to mitochondrial genome is also informative for identifying low-quality cells (Bacher and Kendziorski, 2016). Additionally, the number of expressed genes/transcripts can be detected in each cell is also suggestive. If only a small number of genes can be detected in a cell, this cell is probably damaged or dead or suffered from RNA degradation. Considering the high noise of scRNA-seq data, a threshold of 1 FPKM/RPKM was usually applied to define the expressed genes. Some QC methods for scRNA-seq have been proposed (Stegle et al., 2015; Ilicic et al., 2016), including SinQC (Jiang et al., 2016) and Scater (McCarthy et al., 2017), these tools are useful for QC of scRNA-seq data.

## BATCH EFFECT CORRECTION

Batch effect is a common source of technical variation in high-throughput sequencing experiments. The innovation and decreasing cost of scRNA-seq enable many studies to profile the transcriptomes of a huge amount of cells. The large scale scRNA-seq data sets might be separately generated with distinct operators at different times, and could also be produced in multiple laboratories using disparate cell dissociation protocols, library preparation approaches and/or sequencing platforms. These factors would introduce systematic error and confound the technical and biological variability, leading to that the gene expression profile in one batch systematically differs from that in another (Leek et al., 2010; Hicks et al., 2018). Therefore, batch effect is a major challenge in scRNA-seq data analysis, which may mask the underlying biology and cause spurious results. To avoid incorrect data integration and interpretation, batch effects must be corrected before the downstream analysis. Because of the data feature differences between scRNA-seq and bulk RNA-seq, batchcorrection approaches specially proposed for bulk RNA-seq [e.g., RUVseq (Risso et al., 2014) and svaseq (Leek, 2014)] may not be suitable for scRNA-seq. Several methods have been recently designed to mitigate the batch effects in scRNA-seq data, such as MNN (mutual nearest neighbor) (Haghverdi et al., 2018) and kBET (k-nearest neighbor batch effect test) (Buttner et al., 2019). MNN corrects the batch effects using the data from the most similar cells in different batches. KBET is a χ 2 -based method for quantifying batch effects in scRNA-seq data. These specific batchcorrection approaches for scRNA-seq data can perform better than the methods developed for bulk RNA-seq (Haghverdi et al., 2018; Buttner et al., 2019).

## NORMALIZATION OF SCRNA-SEQ DATA

To correctly interpret the results from scRNA-seq data, normalization is an essential step to get the signal of interest by adjusting unwanted biases resulted from capture efficiency, sequencing depth, dropouts, and other technical effects. Technical noise of scRNA-seq is an obvious problem due to the low starting material and challenging experimental protocols. Normalization of scRNA-seq data will benefit the downstream analyses including cell subpopulation identification and differential expression calling. In general, normalization can be divided into two different types: within-sample normalization and between-sample normalization (Vallejos et al., 2017). Withinsample normalization aims to remove the gene-specific biases (e.g., GC content and gene length), which makes gene expression comparable within one sample (such as RPKM/FPKM and TPM). In contrast, between-sample normalization is to adjust samplespecific differences (e.g., sequencing depth and capture efficiency) to enable the comparison of gene expression between samples. Generally, those simple normalization strategies are based on sequencing depth or upper quartile. If spike-ins or UMIs are used in scRNA-seq protocol, normalization can be refined based on the performance of spike-ins/UMIs (Bacher and Kendziorski, 2016).

A number of approaches have been developed for betweensample normalization of bulk RNA-seq data, such as DESeq2 (Love et al., 2014) and trimmed mean of M values (TMM) (Robinson and Oshlack, 2010). DEseq2 calculates scaling factor based on the read counts across different samples, while TMM removes the extreme log fold changes (Vallejos et al., 2017). However, bulk-based normalization approaches may be not suitable for the data of single-cell transcriptomics. Because scRNA-seq generates abundant zero-expression values and has a higher level of technical variation than bulk RNAseq, using bulk RNA-seq normalization approaches may cause overcorrection in scRNA-seq for lowly expressed genes (Vallejos et al., 2017). Several normalization methods have been proposed for scRNA-seq data, such as SCnorm (Bacher et al., 2017), SAMstrt (Katayama et al., 2013) and a recently introduced deconvolution approach that uses the summed expression values across pools of cells to conduct normalization (Lun et al., 2016). SCnorm is based on quantile regression, while

<sup>1</sup>https://www.bioinformatics.babraham.ac.uk/projects/fastqc/

SAMstrt relies on spike-ins. Bacher et al. (2017) believed that traditional normalization methods developed for bulk RNA-seq may introduce artifacts for normalizing scRNA-seq data, while SCnorm can effectively normalize scRNA-seq data and improve principal component analysis (PCA) and the identification of differentially expressed genes.

## IMPUTATION OF SCRNA-SEQ DATA

Single-cell RNA sequencing data generally contain many missing values or dropouts that were caused by failed amplification of the original RNAs. The frequency of dropout events for scRNA-seq is protocol-dependent, and is closely associated with the number of sequencing reads generated for each cell (Svensson et al., 2017). The dropout events increase the cell-to-cell variability, leading to signal influence on every gene, and obscuration of gene-gene relationship detection. Therefore, dropouts can largely affect the downstream analyses since a significant portion of truly expressed transcripts may not be detectable in scRNAseq. Imputation is a useful strategy to replace the missing data (dropouts) with substituted values. Although some methods have been proposed for imputation of bulk RNA-seq data, they are not directly applicable to scRNA-seq data (Zhang and Zhang, 2018). Several imputation methods have been recently developed for scRNA-seq, including SAVER (Huang et al., 2018), MAGIC (van Dijk et al., 2018), ScImpute (Li and Li, 2018), DrImpute (Gong et al., 2018), and AutoImpute (Talwar et al., 2018). SAVER is a Bayesian-based model designed for UMIbased scRNA-seq data to recover the true expression level of all genes. MAGIC imputes the gene expression by building Markov affinity-based graph. The developers of ScImpute suggested that SAVER and MAGIC may lead to expression changes of the genes unaffected by dropouts, while ScImpute can impute the dropout values without introducing new biases through using the information from the same genes unlikely affected by dropouts in other similar cells. DrImpute is a clustering-based approach and can effectively separate the dropout zeros from true zeros. AutoImpute is an autoencoder-based method that learns the inherent distribution of scRNA-seq data to impute the missing values. Recently, Zhang et al. evaluated different imputation methods and found that the performances of these approaches are correlated with their model hypothesis and scalability (Zhang and Zhang, 2018).

## DIMENSIONALITY REDUCTION AND FEATURE SELECTION

Single-cell RNA sequencing data are with a high dimensionality, which may involve thousands of genes and a large number of cells. Dimensionality reduction and feature selection are two main strategies for dealing with high dimensional data (Andrews and Hemberg, 2018a). Dimensionality reduction methods generally project the data into a lower dimensional space by optimally preserving some key properties of the original data. PCA is a linear dimensional reduction algorithm, which assumes that the data is approximately normally distributed. T-distributed stochastic neighbor embedding (t-SNE) is a non-linear approach mainly designed for visualizing high dimensional data (van der Maaten and Hinton, 2008). Both PCA and t-SNE have been broadly used in diverse scRNA-seq studies to reduce the data dimension and visualize the cells discriminated into distinct subpopulations (Chen et al., 2016a; Rosenberg et al., 2018). It is worth noting that PCA cannot effectively represent the complex structure of scRNA-seq data and t-SNE has limitations of slow computation time and different embeddings for processing the same dataset multiple times. Recently, UMAP (uniform manifold approximation and projection) (Becht et al., 2018), and scvis (Ding et al., 2018) were specially developed for reducing the dimensions of scRNA-seq data. Becht et al. showed that UMAP provides the fastest run times, the highest reproducibility and the most meaningful organization of cell clusters than other dimensionality reduction approaches (Becht et al., 2018).

Feature selection removes the uninformative genes and identifies the most relevant features to reduce the number of dimensions used in downstream analysis. Reducing the number of genes by performing feature selection can largely speed up the calculations of large-scale scRNA-seq data (Andrews and Hemberg, 2018b). Differential expression is a widely used method for feature selection in bulk RNA-seq experiments, but it is hard to apply to scRNA-seq data since the information of predetermined and/or homogeneous subpopulations needed for differential expression calling of scRNA-seq data [e.g., SCDE (Kharchenko et al., 2014)] is often unavailable. Unsupervised feature selection algorithms specially designed for scRNA-seq data can be divided into the following groups: (i) highly variable genes (HVG) based; (ii) spike-in based; and (iii) dropout-based (Andrews and Hemberg, 2018a). HVG methods rely on the assumption that the genes with highly variable expression across cells are resulted from biological effects rather than technical noise. The HVG approaches include algorithms proposed by Brennecke et al. (2013), and FindVariableGenes (FVG) implemented in Seurat (Satija et al., 2015). Spike-in based approaches identify the genes showing significant higher variance than those of spike-ins with similar expression levels [e.g., scLVM (Buettner et al., 2015) and BASiCS (Vallejos et al., 2015)], which shares similar idea of HVG. Dropout based methods take advantage of the dropout distribution of scRNAseq data to perform feature selection, like M3Drop (Andrews and Hemberg, 2018b). Andrews and Hemberg showed that their M3Drop tool outperforms existing variance-based feature selection approaches.

## CELL SUBPOPULATION IDENTIFICATION

A key goal of scRNA-seq data analysis is to identify cell subpopulations (different populations are often distinct cell types) within a certain condition or tissue to unravel the heterogeneity of cells. Notably, cell subpopulation identification should be carried out after QC and normalization of scRNAseq data, otherwise artifacts could be introduced. Approaches for

clustering cells can be mainly grouped into two categories based on whether prior information is used. If a set of known markers was used in clustering, the methods are prior information based. Alternatively, unsupervised clustering methods can be used for de novo identification of cell populations with scRNA-seq data. The algorithms for unsupervised clustering can be primarily divided into the following types: (i) k-means; (ii) hierarchical clustering; (iii) density-based clustering; and (iv) graph-based clustering (Andrews and Hemberg, 2018a). K-means is a fast approach that assigns cells to the nearest cluster center, and it requires the predetermined number of clusters. Hierarchical clustering can determine the relationships between clusters, but it generally works slower than k-means. Density-based clustering methods need a large number of samples to accurately calculate densities and usually assume that all clusters have equal density. Graph-based clustering can be considered as an extension of density-based clustering, and it can be applied to millions of cells. Some clustering methods have been specially designed for scRNA-seq data, such as single-cell consensus clustering (SC3) (Kiselev et al., 2017) and the clustering approach implemented in Seurat (Satija et al., 2015), which can facilitate the identification of cell subpopulations (**Table 3**). SC3 is an unsupervised approach that combines multiple clustering approaches, which has a high accuracy and robustness in single-cell clustering. Seurat identifies the cell clusters mainly based on a shared nearest neighbor (SNN) clustering algorithm. Once the subpopulations are determined, the markers that can best discriminate distinct subpopulations are usually identified through differential expression calling or analysis of variance (ANOVA).

#### DIFFERENTIAL EXPRESSION ANALYSIS OF SCRNA-SEQ DATA

Differential expression analysis is very useful to find the significantly differentially expressed genes (DEGs) between distinct subpopulations or groups of cells. The DEGs are crucial for interpreting the biological difference between two compared

TABLE 3 | Subpopulation identification methods for scRNA-seq data.


conditions. The technical variability, high noise (e.g., dropouts) and massive sample size of scRNA-seq data raise challenges in differential expression calling (McDavid et al., 2013). Moreover, multiple possible cell states can exist within a population of cells, leading to the multimodality of gene expression in cells (Vallejos et al., 2016). The tools originally developed for bulk RNA-seq data have been used in many single-cell studies to identify the DEGs, but the applicability of these methods for scRNA-seq data is still unclear. In recent years, some specific methods have been proposed for conducting differential expression calling based on scRNA-seq data, such as MAST (Finak et al., 2015), SCDE (Kharchenko et al., 2014), DEsingle (Miao et al., 2018), Census (Qiu et al., 2017), and BCseq (Chen and Zheng, 2018) (**Table 4**). MAST is based on linear model fitting and likelihood ratio testing. SCDE is a Bayesian approach using a low-magnitude Poisson process to account for dropouts. DEsingle employs Zero-Inflated Negative Binomial model to estimate the dropouts and real zeros. BCseq mitigates the technical noise in a data-adaptive manner. Soneson and Robinson recently assessed 36 differential expression methods (including the tools designed for scRNAseq and bulk RNA-seq data) and revealed significant differences among these approaches in the characteristics and number of DEGs (Soneson and Robinson, 2018). An increasing number of tools for differential expression analysis of scRNA-seq data will be developed, and users are encouraged to choose the tools specially

TABLE 4 | Differential expression analysis tools for RNA-seq data.


designed for scRNA-seq to identify DEGs in consideration of the complex features of scRNA-seq data.

#### CELL LINEAGE AND PSEUDOTIME RECONSTRUCTION

The cells in many biological systems exhibit a continuous spectrum of states and involve transitions between different cellular states. Such dynamic processes within a portion of cells can be computationally modeled by reconstructing the cell trajectory and pseudotime based on scRNA-seq data. Pseudotime is an ordering of cells along the trajectory of a continuously developmental process in a system, which allows the identification of the cell types at the beginning, intermediate, and end states of the trajectory (Griffiths et al., 2018). Besides revealing the gene expression dynamics across cells, single-cell trajectory inference can also benefit the identification of the factors triggering state transitions. A number of tools have been proposed for trajectory inference, e.g., Monocle (Trapnell et al., 2014), Waterfall (Shin et al., 2015), Wishbone (Setty et al., 2016), TSCAN (Ji and Ji, 2016), Monocle2 (Qiu et al., 2017), Slingshot (Street et al., 2018), and CellRouter (Lummertz da Rocha et al., 2018) (**Table 5**). The resulting trajectory topology can be linear, bifurcating, or a tree/graph structure. Monocle builds a minimum spanning tree (MST) for cells to search for the longest backbone based on independent component analysis (ICA). Monocle2 uses a distinct approach that incorporates unsupervised data-driven methods with reversed graph embedding (RGE), which is more robust and much faster than Monocle. Slingshot is a clusterbased approach for identifying multiple trajectories with varying levels of supervision. CellRouter utilizes flow networks to identify cell-state transition trajectories. Recently, Saelens et al. (2018) evaluated a number of single-cell trajectory inference approaches (did not include CellRouter), and found that Slingshot, TSCAN and Monocle2 outperform other methods.

### ALTERNATIVE SPLICING AND RNA EDITING ANALYSIS OF SCRNA-SEQ DATA

Most of published single-cell studies mainly explored the transcriptome variation between individual cells at gene level. In eukaryotic genome, AS allows multi-exon genes to generate different isoforms, which can largely increase the diversity of both protein-coding and noncoding RNAs. Five basic modes are generally recognized for AS, including exon-skipping (cassette exon), mutually exclusive exons, alternative donor site, alternative acceptor site, and intron retention. Lots of studies have shown that AS is very common in mammalians and over 90% of human genes could undergo AS based on bulk RNAseq data (Wang et al., 2008; Chen et al., 2017a). Moreover, AS play crucial roles in a variety of biological processes and abnormal AS may be correlated with cancers (Sveen et al., 2016). The findings revealed by bulk RNA-seq data can only reflect the averaged AS patterns of numerous cells at population level. TABLE 5 | Methods for single-cell trajectory inference.


Due to the high noise (e.g., dropouts and uneven transcript coverage) and low sequencing coverage of scRNA-seq data, the splicing quantification methods initially developed for bulk RNAseq data are not suitable for scRNA-seq data. Since expression dynamics is a key aspect of cell populations, it is promising to study AS at single-cell resolution to gain insights into celllevel isoform usage. To date, only a few number of AS detection approaches were devised for scRNA-seq data, such as SingleSplice (Welch et al., 2016), Census (Qiu et al., 2017), BRIE (Huang and Sanguinetti, 2017), and Expedition (Song et al., 2017) (**Table 6**). SingleSplice uses a statistical model to detect the genes with a significant isoform usage without estimating the expression levels of full-length transcripts. Census models the isoform counts of each gene with a linear model as a Dirichletmultinomial distribution. BRIE is a Bayesian hierarchical model for differential isoform quantification. Expedition contains a suite of algorithms for identifying AS, assigning splicing modalities and visualize modality changes. The AS detection approaches specially designed for scRNA-seq data are just emerging, thus the innovation and improvement of such methods will largely facilitate AS exploration at the single-cell level.

On the other hand, RNA-editing is an important posttranscriptional processing event that leads to sequence changes on RNA molecules (Gott and Emeson, 2000). Similarly, RNAediting is mainly studied using bulk RNA-seq technologies but rarely explored at the single-cell level. Currently, the limitations of scRNA-seq largely prevented the application of RNA-editing

TABLE 6 | Alternative splicing detection tools for scRNA-seq data.


detection to individual cells. Accordingly, with the development of both scRNA-seq technologies and single-cell editing detection algorithms, exploration of RNA-editing dynamics among single cells will be feasible. Notably, both AS and RNA-editing are mainly suitable for the data generated by scRNA-seq protocols that can sequence full-length transcripts such as Smart-seq2 and MATQ-seq rather than 3<sup>0</sup> -end scRNA-seq approaches.

## ALLELIC EXPRESSION EXPLORATION WITH SCRNA-SEQ DATA

Diploid species contain two sets of chromosomes that are separately obtained from their parents. Allelic expression analysis can reveal whether genes are equally expressed between parental and maternal genomes. For autosomes, the parental and maternal expression are generally expressed equally, and aberrant expression of parental or maternal genome may cause certain diseases (McKean et al., 2016). Up to now, few methods were developed to detect the genome-wide allelic expression profile of genes based on scRNA-seq data. One main caution of allelic expression calling is that the high dropouts of scRNA-seq data may introduce many false positives. Deng et al. (2014) used a series of stringent criteria to filter the potentially false allelic calls resulted from the technical variability of scRNA-seq in studying allelic expression profile of mouse preimplantation embryos. The robustness of this strategy was further demonstrated in analyzing the dynamics of X chromosome inactivation along developmental progression using mouse embryonic stem cells (Chen et al., 2016a). SCALE was recently proposed to classify the gene expression into silent, monoallelic and biallelic, states by adopting an empirical Bayes approach (Jiang et al., 2017). We believe that allelic expression analysis at single-cell level can largely facilitate the understanding of the underlying mechanisms of dosage compensation and related diseases. It is worth noting that allelic expression investigation at single-cell level also needs the whole-transcript scRNA-seq and is mainly applicable to the organism that has available paternal and maternal single nucleotide polymorphism (SNP) information.

### GENE REGULATORY NETWORK RECONSTRUCTION

Gene regulatory network inference has been widely conducted in numerous bulk RNA-seq studies, while scRNA-seq also provides great potential for such analysis. For bulk RNA-seq data, networks are usually constructed from a number of samples using the tools like weighted gene co-expression network analysis (WGCNA) (Langfelder and Horvath, 2008; Chen et al., 2017a). A basic assumption is that the genes highly correlated in expression could be co-regulated. Because such an analysis is unable to determine the regulatory relationship, the resulting networks are typically undirected. Theoretically, the cells of scRNA-seq can be treated as the samples of bulk RNA-seq, then similar approaches are applicable to scRNA-seq data for constructing gene regulatory network.

Network inference of scRNA-seq data may reveal meaningful gene correlations and provide biologically important insights that could not be uncovered by population-level data of bulk RNA-seq. However, due to the technical noise of scRNA-seq and different subpopulations or sates of cells, attention should be paid to network reconstruction. To reduce spurious results, network inference should be carried out on each subpopulation or the cells with the same stage. Recently, Aibar et al. (2017) developed SCENIC method to reconstruct the gene regulatory network from scRNA-seq data and they showed that SCENIC can robustly predict the interactions between transcription factors and target genes. PIDC is another software designed to infer gene regulatory network from single-cell data using multivariate information theory (Chan et al., 2017). Such network inference tools facilitate the identification of expression regulatory network from singlecell transcriptomic data and provide critically biological insights into the regulatory relationships between genes.

### CONCLUSION

In the past 10 years, a great advancement has been achieved in scRNA-seq and a variety of scRNA-seq protocols have been developed. The development and innovation of scRNA-seq largely facilitated single-cell transcriptomic studies, leading to insightful findings in cell expression variability and dynamics. Moreover, the throughput of scRNA-seq has significantly increased with the exciting progress in cellular barcoding and microfluidics. Meanwhile, scRNA-seq methods that can be used for fixation and frozen samples have also been proposed recently, which will greatly benefit the study of highly heterogeneous clinical samples. However, currently available scRNA-seq approaches still have a high dropout problem, in which weakly expressed genes would be missed. The improvement of RNA capture efficiency and transcript coverage will definitely reduce the technical noise of scRNA-seq. Moreover, since most of current scRNA-seq methods mainly capture polyA+ RNAs, the development of protocols that can capture both polyA+ and polyA− RNAs (such as MATQ-seq) will enable comprehensive investigation of both protein-coding and noncoding gene expression dynamics at single-cell resolution.

Since the noise of scRNA-seq data is high, it is crucial to use appropriate methods to overcome the problem in analyzing scRNA-seq data. QC is necessary to exclude those lowquality cells to avoid involving artifacts in data interpretation. Furthermore, batch effect correction (if need), between sample

normalization and imputation are also important and should be conducted before cell subpopulation identification, differential expression calling, and other downstream analyses. Additionally, factors such as cell size and cell cycle state could play important roles in cell variability for certain types of cells, such biases are also need to be considered. Although an increasing number of methods have been specially designed to interpret scRNA-seq data, advances of novel methods that can effectively handle the technical noise and expression variability of cells are still required. Specifically, the approaches that can accurately analyze AS and RNA-editing with scRNAseq data are highly useful to unravel post-transcriptional mechanisms in individual cells. Overall, bioinformatics analysis of scRNA-seq data is still challenging, special attention should be paid in data interpretation, and more efficient tools are in urgent need.

Collectively, scRNA-seq and its related computational methods largely promote the development of single-cell

#### REFERENCES


transcriptomics. The continuous innovation of scRNA-seq technologies and concomitant advances in bioinformatics approaches will greatly facilitate biological and clinical researches, and provide deep insights into the gene expression heterogeneity and dynamics of cells.

#### AUTHOR CONTRIBUTIONS

GC and TS designed the study and wrote the manuscript. BN edited the manuscript and provided constructive comments.

#### FUNDING

This work was supported by the National Science Foundation of China (31771460, 91629103 and 31671377), National Key Research and Development Program of China (2016YFC0902100).


preimplantation embryos. Genome Biol. 16:148. doi: 10.1186/s13059-015- 0706-1




**Disclaimer**: The information in these materials is not a formal dissemination of the United States Food and Drug Administration.

**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Chen, Ning and Shi. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Benchmark and Parameter Sensitivity Analysis of Single-Cell RNA Sequencing Clustering Methods

*Monika Krzak1\*, Yordan Raykov2, Alexis Boukouvalas3, Luisa Cutillo4 and Claudia Angelini1*

1 Institute for Applied Mathematics "Mauro Picone", Naples, Italy, 2 Department of Mathematics, Aston University, Birmingham, United Kingdom, 3 Machine Learning Engineer Team, Prowler.io, Cambridge, United Kingdom, 4 School of Mathematics, University of Leeds, Leeds, United Kingdom

Single-cell RNA-seq (scRNAseq) is a powerful tool to study heterogeneity of cells. Recently, several clustering based methods have been proposed to identify distinct cell populations. These methods are based on different statistical models and usually require to perform several additional steps, such as preprocessing or dimension reduction, before applying the clustering algorithm. Individual steps are often controlled by methodspecific parameters, permitting the method to be used in different modes on the same datasets, depending on the user choices. The large number of possibilities that these methods provide can intimidate non-expert users, since the available choices are not always clearly documented. In addition, to date, no large studies have invistigated the role and the impact that these choices can have in different experimental contexts. This work aims to provide new insights into the advantages and drawbacks of scRNAseq clustering methods and describe the ranges of possibilities that are offered to users. In particular, we provide an extensive evaluation of several methods with respect to different modes of usage and parameter settings by applying them to real and simulated datasets that vary in terms of dimensionality, number of cell populations or levels of noise. Remarkably, the results presented here show that great variability in the performance of the models is strongly attributed to the choice of the user-specific parameter settings. We describe several tendencies in the performance attributed to their modes of usage and different types of datasets, and identify which methods are strongly affected by data dimensionality in terms of computational time. Finally, we highlight some open challenges in scRNAseq data clustering, such as those related to the identification of the number of clusters.

Keywords: single-cell RNA-seq, clustering methods, benchmark, parameter sensitivity analysis, high-dimensional data analysis

## INTRODUCTION

Single-cell RNA sequencing (scRNAseq) has emerged as an important technology that allows profiling gene expression at single-cell resolution, giving new insights into cellular development (Biase et al., 2014; Goolam et al., 2016), dynamics (Vuong et al., 2018; Farbehi et al., 2019), and cell composition (Darmanis et al., 2015; Zeisel et al., 2015; Segerstolpe et al., 2016). Although the scRNAseq analysis inherits many features from bulk RNA-seq approaches, the algorithms require

#### Edited by:

Filippo Geraci, Italian National Research Council, Italy

#### Reviewed by:

Giovanna Rosone, University of Pisa, Italy Antonio Federico, Tampere University, Finland

\*Correspondence: Monika Krzak monika.sonia.krzak@gmail.com

#### Specialty section:

This article was submitted to Bioinformatics and Computational Biology, a section of the journal Frontiers in Genetics

Received: 19 July 2019 Accepted: 13 November 2019 Published: 11 December 2019

#### Citation:

Krzak M, Raykov Y, Boukouvalas A, Cutillo L and Angelini C (2019) Benchmark and Parameter Sensitivity Analysis of Single-Cell RNA Sequencing Clustering Methods. Front. Genet. 10:1253. doi: 10.3389/fgene.2019.01253

1 **80** constant adaptation due to the several types of challenges present in scRNAseq data (Kiselev et al., 2019). For example, current droplet-based technologies allow measuring hundreds of thousands of cells which greatly exceeds the number of samples typically handled by bulk RNA-seq protocols. The low amount of measured RNA transcripts per cell and stochastic nature of the genes expression can also introduce missing information about gene profiles (dropouts). The scRNAseq data specific noise and the increasing number of scRNAseq protocols differing in accuracy and scalability (Svensson et al., 2017; Svensson et al., 2018) make the systematic data analysis even more challenging.

Over the last few years, a number of computational algorithms have been proposed to analyze scRNAseq data, focusing on different aspects (Chen et al., 2019). In particular, a growing class of computational methods is being developed for identifying distinct cell populations (Andrews and Hemberg, 2018). These methods are based on various types of clustering techniques, which aim to divide cells into groups that share similar gene expression patterns. In this way, each group can be associated with a specific cell type or subtype on the basis of well-known markers, or novel cell subtypes can be identified. However, before applying the clustering algorithm, such methods often require to perform a series of mandatory or optional steps that include preprocessing, filtering or dimension reduction (Luecken and Theis, 2019). In several cases, such steps can be adapted by the user by choosing an appropriate set of parameters. Thus, methods turn to be very heterogeneous in the way they model data and perform the individual steps. Differences arise at each stage of the analysis and are not yet fully understood. For example, some algorithms work with raw count dataset (Zurauskiene and Yau, 2016; Lin et al., 2017; Sun et al., 2018), others require normalized gene expression values (Macosko et al., 2015; Ji and Ji, 2016; Senabouth et al., 2019) or can handle both formats (Yip et al., 2017; Qiu et al., 2017; Kiselev et al., 2017; Wang et al., 2017). Some of the tools do incorporate an additional method-specific preprocessing step in terms of filtering or normalization (Senabouth et al., 2019; Yip et al., 2017), to remove noise present in the data, other require such step to be done externally before the execution of the method (Julia et al., 2015). In addition to preprocessing, many methods often utilize dimension reduction techniques, such as Principal Component Analysis (PCA) or t-Distributed Stochastic Neighbor Embedding (tSNE), in order to reduce the high-dimensional space (expression of tens of thousands of genes) prior to clustering (Julia et al., 2015; Herman and Grün, 2018; Ren et al., 2019).

Another great difference is given by the specific clustering techniques implemented in each method. Some of the methods use partitioning algorithms (Kiselev et al., 2017; Wang et al., 2017) in order to infer distinct cell populations, others are based on hierarchical clustering (Senabouth et al., 2019; Lin et al., 2017), graph theory (Macosko et al., 2015) or density based-approach (Ester et al., 1996). There is also a growing class of model-based algorithms (Fraley and Raftery, 2002; Ji and Ji, 2016; Sun et al., 2018) which utilize probabilistic properties of a given model to account for distinct challenges present in the data. Moreover, some methods require the number of cell populations to be known in advance (Zurauskiene and Yau, 2016; Sun et al., 2018), while others estimate the optimal value with an external procedure or as part of the clustering inference (Macosko et al., 2015; Senabouth

et al., 2019; Ren et al., 2019). The available methods also vary in terms of the programming language they have been implemented in (i.e. R, Matlab, Python), computational cost and other system requirements.

All of the mentioned variations across clustering pipelines affect the performance of the methods. Currently, there is a limited amount of studies that infer clustering performance and robustness under various data-driven scenarios (Freytag et al., 2018; Duò et al., 2018; Tian et al., 2019). The main purpose of existing studies is to investigate the performance of the methods limited to a selected parameter setting. Such limitation leads to a narrow view on the performance of the methods making it difficult to explore their full potential and identify the open challenges. For example, some algorithms provide multiple possibilities in the choice of parameters (Julia et al., 2015; Qiu et al., 2017; Herman and Grün, 2018; Ren et al., 2019) that can allow the user to adapt/modify the main method in each step. At the same time, the selection of parameter settings can be crucial in various data-driven conditions. The performance of the algorithms can also depend on the presence or absence of any preprocessing steps, either external or method-specific, carried out prior to clustering. Since both, parameter settings and data preprocessing can greatly affect the clustering result, we decided to investigate both aspects on the performance of the methods by carrying a comprehensive benchmark of the existing clustering methods and performing parameter sensitivity analysis.

For that purpose, we first described different modes of usage and parameter settings of 13 among the most widely used scRNAseq clustering methods implemented in R, then we applied them on a large set of real scRNAseq and simulated datasets. In order to fully understand the potential of each method, we tested them varying a wide range of available parameter settings which greatly expands the number of possible results. Through the analysis pipeline, we evaluated the performance of the methods with respect to several factors. First, we divided the real datasets into two groups, those that were expressed in the raw counts and those expressed on normalized fragments per kilobase of transcript per million mapped reads (FPKM) or reads per kilobase of transcript per million mapped reads (RPKM) counts. On the first group, we evaluated the performance of the methods on three data basic preprocessing types (not preprocessed counts, filtered counts, filtered and normalized counts). On the second group, we evaluated the performance of the methods depending on a various number of dimensions supplied to dimension reduction techniques prior to clustering. Synthetic datasets were used to prove the capacity of each method in handling varying dataset dimensions that can additionally be diverse in the number of simulated cell groups and the type of group balance. In the simulation, we also accessed the accuracy of the methods in recovering cell population structure in the presence of noise. The type of noise that we simulated were dropouts and overlapping cell populations which are key features of scRNAseq datasets. In all cases, we evaluated the performance of the methods in terms of i) Adjusted Rand Index (ARI) index, ii) accuracy of methods in estimating the correct number of clusters, iii) running time.

Overall, this work aims to provide new insights into the advantages and drawbacks of several scRNAseq clustering methods, by describing the ranges of possibilities that are offered to users and the impact that these choices can have on the final results. We also tried to identify some open challenges for future research that still need to be faced when doing the population inference.

## MATERIALS AND METHODS

### Real Datasets

In order to evaluate the performance of the clustering methods considered in this study we used 17 real scRNAseq datasets popular in the literature and listed in **Table 1**. To prepare the gene expression matrix for clusterization, we followed the main instructions for data import and processing from the online repository https://hemberg-lab.github.io/scRNA.seq.datasets/.

The selected scRNAseq datasets vary in terms of organisms, tissues under study and experimental protocols. As illustrated in **Table 1**, some datasets were profiled using 3′ or 5′ tag and dropletbased approaches (such as inDrop), others using full-length plate-based approaches, such as Smart-Seq protocols. Moreover, depending on the used platform, each study investigates a different number of cells and data are subjected to a different proportion of dropouts. Depending on the protocol, count matrices were of different types (see **Table 2**) including Raw unique molecular

TABLE 1 | List of the real datasets used to perform the clustering evaluation.


Datasets (named by the author and date of publication) contain gene expression of cells from various organisms and tissues that have been processed by different experimental protocols. Protocols include 3' or 5' tag and droplet-based approaches (inDrop and STRT/C1 UMI), or full-length plate-based approaches, such as Smart-Seq, Smart-Seq2, SMARTer or Fluidigm C1. Tang protocol corresponds to mRNA-Seq assay described in (Tang et al., 2009). For more information about protocols see (Svensson et al., 2018).

TABLE 2 | Brief description of the main features of each real dataset considered in this study.


Datasets can contain counts of 3 different types: Raw UMI counts, Raw read counts, and normalized FPKM/RPKM counts. Raw counts stands for the non-normalized counts that differ in terms of gene expression quantification method. FPKM/RPKM counts mean library size and gene length normalized counts. The number of reported cell populations is obtained from the annotation as described in the corresponding datasets publications.

identifier (UMI) counts (3 datasets), Raw read counts (7 datasets) and FPKM/RPKM counts (7 datasets). The raw counts (either UMI or read counts) consist of datasets with gene expression quantified in terms of the number of mapped reads (counts) and that have not been further processed, while FPKM or RPKM data are library size and gene length adjusted counts. Note that two datasets in **Table 2**, Deng2014 and Tasic2016, were of both types (raw read counts and FPKM/RPKM counts). Overall, the datasets cover various ranges of experimental complexity in terms of the number of sequenced cells (from tens to several thousands) and number of cell populations in the sample (with minimum of 3 and maximum of 18 number of cell populations). The cell populations (hidden groups to detect) can represent distinct cell types or cells at various time points of differentiation. Within this study, we will consider the cell population annotation (available from the corresponding datasets studies) as ground truth, although we are aware that there could be some errors in the annotations, since datasets could contain some rare cell subpopulations, that were not identified at the time of the study, or some misclassified cells.

## Simulated Datasets

We evaluated methods performance also on synthetic datasets. The simulation study was performed using Splatter package (Zappia et al., 2017). Splatter allows simulating single-cell RNA sequencing count data with a varying number of cells and cell groups, with different degree of cluster separability and varying rate of dropouts. We designed three simulation setups that allowed us to investigate various aspects of the performance of the methods (see **Figure 1**). Each simulation setup has been repeated 5 times choosing 5 different values of the seed.

In the first simulation setup (**Figure 1A**), we focused on assessing both the scalability (the capacity of each method in handling datasets with an increasing number of cells) and the complexity of the dataset (the ability of each method when the number of true groups increases or when the balancing between each group is disrupted). For this purpose, we simulated counts using three different values for the number of cells: 500, 1000 and 5000; three values for the number of groups: 4, 8, 16 and two possibilities for the number of cells in each of the group: balanced and unbalanced group size. In each of the modes, we set the number of genes to 1000. Therefore, the resulting 18 simulated datasets represent different levels of data complexity and size for the clustering task.

In the second simulation setup (**Figure 1B**), we fixed dataset dimension (1000 cells, 1000 genes) as well as the cell groups (fixed to four groups balanced in sizes) and we investigated the performance of each method with respect to the group separability ranging from poorly to well-separated groups. In such setup, we varied the probability of a gene to be differentially expressed to 0.1, 0.5, and 0.9, to obtain 3 simulated datasets: expression probability close to 1 gives highly separable cell groups that should be less difficult to be detected by any clustering algorithm.

Finally, in the third simulation setup (**Figure 1C**), we investigated the performance of clustering methods in the presence of various rates of missing information. With the number of cells and genes the same as before (1000) and cell groups fixed to four, we varied the rate of zero counts by setting

separability between the groups (from poorly to well separable). This feature has been controlled by setting the de.prob parameter of Splatter simulation function to three values: 0.1, 0.5 and 0.9. (C) Simulation of 4 datasets using Setup 3. In this simulation setup, we used one dataset to create 3 others by placing an increasing number of zeros (controlled by dropout.mid parameter) on the count matrix. We highlighted by red color three identical datasets across all simulated setups. Each simulation setup has been repeated with 5 different values of the seed.

the midpoint parameter (drop.mid) for dropout logistic function to 0, 2, 4, and 6. In this way, we obtained 4 datasets with varying percentage of dropouts from 20% to 90%.

In each of the 5 runs of simulation, we have kept the synthetic datasets, highlighted in red in **Figures 1A–C** (i.e., those corresponding to 1000 cells, 1000 genes, 4 groups, size-balanced, de.prob = 0.5 and drop.mid = 0), identical across all three setups for easier direct comparison.

## Analysis Pipeline

In order to analyze real and simulated data, we used the procedure illustrated in **Figure 2**. First, all 17 real datasets (Raw UMI/Raw read counts and FPKM/RPKM counts) underwent the same quality control by filtering not expressed genes and low-quality cells (see **Figures 2A, B**) to remove potential issues from further analysis.

For the raw datasets, we considered three types of basic preprocessing before applying the specific clustering methods (**Figure 2A**). After the basic preprocessing, the clustering methods were applied with specific combinations of the parameters. Note that only a subset of methods (and combination of parameters) can be considered for filtered and normalized counts.

The FPKM/RPKM counts underwent a different basic preprocessing step (see **Figure 2B**) and were then directly clustered. To investigate the influence of the choice in the number of retained dimensions on methods performance, we considered only those methods and those combinations of parameters that allowed us to set the number of reduced dimensions.

In contrast to real data, simulated counts were directly used for clustering (see **Figure 2C**) where all methods and combination of parameters have been considered in the evaluation.

FIGURE 2 | Clustering analysis pipeline. (A) (B) Real data analysis is divided into three steps: Quality control, basic preprocessing and clustering. (C) Clustering is directly applied to simulated datasets. Note that not all the parameter combinations have been applied to each dataset type. For filtered and normalized raw counts we excluded parameter combinations that use an additional method specific preprocessing. For FPKM/RPKM counts we used only those methods that do not allow for additional preprocessing (none) and provide option to set the number of reduced dimensions (TRUE).

More details about data quality control, basic preprocessing of Raw and FPKM/RPKM counts, methods and parameter settings are described in the next sections.

## Quality Control of Real Datasets

All real datasets underwent an identical quality control step using the scater package (McCarthy et al., 2017). Firstly, we removed features with duplicated gene names and/or not expressed across all the cells as they do not include any useful information. Then, we performed quality control on the cells excluding those with the total number of expressed genes and the total sum of counts more than 3 median absolute deviations below the median across the genes [as suggested in scater documentation (McCarthy and Lun, 2019)]. Cells with the low amount of expressed genes and few counts are likely to be stressed or broken and thus should be removed from the analysis. The resulting dimensions of real datasets before and after quality control are given in **Supplementary Tables 1** and **2**.

## Basic Preprocessing of Real Datasets

After quality control, we applied a basic preprocessing step that mimics some of the most commonly used procedures typically applied before clustering scRNAseq data (McCarthy et al., 2017). In the case of Raw UMIs and Raw read counts, we considered three independent types of basic preprocessing: no preprocessing, filtering, filtering and normalization (see **Figure 2A**). Clearly, in the first case, no further operations were performed on the raw counts. In the second case, we used scater to remove lowly expressed genes that are genes with average expression count (adjusted by library size) equal to 0, where for the library size we mean total sum of the counts per cell. Note that this filtering step did not affect some of the datasets including Baron2016\_m, Klein2015, Zeisel2015, and Romanov2016 (see **Supplementary Table 1**). In the third type, we first applied the filtering as described above, then we performed normalization. Both, Raw UMI counts and Raw read counts were normalized by scran package using deconvolution method. The deconvolution method normalizes data by cellspooled size factors that account for dropout biases. More details about raw dataset dimensions before and after filtering are given in **Supplementary Table 1**. For illustrative purpose, **Supplementary Figure 1** reports one realization of the tSNE projections of the 10 raw datasets after quality control step that were colored by the corresponding cell group annotations. The inspection of the figure shows the heterogeneity of the datasets with respect to number of cells, number of cell groups and their separation.

In case of FPKM/RPKM counts, the basic preprocessing involved the same gene filtering as for the raw counts followed by high variable gene selection (HVG) (**Figure 2B**). To extract the most informative genes, we used Seurat package (Macosko et al., 2015) that defines most variable genes based on meanvariance dispersion. The dimensions of datasets before and after basic preprocessing are given in **Supplementary Table 2**. **Supplementary Figure 2** shows one realization of the tSNE projections (colored by the corresponding cell group annotations) of the 7 FPKM/RPKM datasets after quality control and basic preprocessing step.

## Compared Methods and Modes of Usage

In this study, we evaluated 13 different methods aimed to identify cell populations from scRNAseq data. **Table 3** lists the methods that we have considered. For the sake of code compatibility and transparency, we restricted our choice to the methods implemented in the R programming language. Some of the methods have multiple releases and versions. In this evaluation, we only tested the releases with versions reported in **Table 3**. For the sake of completeness, we stress that recently some of the methods listed in the table underwent to a major update which could have partially improved their performance.

Most of the methods (i.e., all except DIMMSC and pcaReduce) considered in this study can be applied by setting different parameter combinations, thus providing potentially different results. Such combinations of parameters allow the user to tune different modes of usage, such as including or not an additional preprocessing step, including or not a dimension reduction procedure, using different criteria for choosing the suitable data dimension, applying different clustering algorithms within the same method, setting or estimating the number of clusters. **Table 4** shows a detailed series of parameters that the user can choose with possible parameter choices. Each row defines valid parameter settings for the specific method. Within the same row, the total number of combinations is given by the product of each possibility (the last column of **Table 4** summarizes the number of combinations). If the method has been reported more than once in the table (i.e., Linnorm and sscClust), it means that some of the parameters worked only with a subset of the settings (i.e., not in a full combinatorial way). By considering all possible combinations, we obtained 143 potential different modes of usage of the 13 tested methods.

As shown in **Table 4**, eight methods (corresponding to 43 parameter combinations) might incorporate an additional

TABLE 3 | List of methods compared in the benchmark.


Versions of the R packages (methods) compared in this study. Methods are based on various clustering techniques that can be categorized based on the clustermodel. Multiple choices indicate that the method allows to cluster cells with more than one clustering technique.

#### TABLE 4 | Valid configurations in the parameter settings for each method.


We reported a set of parameters that users can tune in the method such as the additional preprocessing, the dimension reduction strategy, the number of dimensions, the clusterring technique and the number of clusters to obtain. In particular, for the key additional processing: none – no additional preprocessing is applied, method specific – an additional preprocessing is applied prior clustering (filtering and/or normalization); for dimension reduction: internal – an internal dimension reduction is applied, none – the method works in the original domain, PCA, tSNE, ICA, iCor or others listed by names – the user can choose a specific method to reduce the dimensionality; for number of dimensions: TRUE or FALSE – method allows or doesn't allow for setting number of reduced dimensions, internal – method use an internal value for the number of dimensions; for clustering technique: fixed –method uses only one clustering technique, otherwise the user can choose among few options that are listed by name; for number of clusters: set or estimate – method allows to set or estimate number of clusters.

preprocessing step (herein, denoted method specific), five methods do not have any specific step (herein denoted none). Out of the 8 methods that include the additional preprocessing step, four methods allow the user to decide it to apply or not (both settings available). Methods differ also in the dimension reduction step either by providing only an internal procedure to reduce dimensions (six methods, herein denoted internal) or allowing for multiple choices for this purpose (five methods, herein denoted with the name of the specific procedures the user can choose, PCA, tSNE, ICA, etc). Note that two methods, DIMMSC and Linnorm have to or can, respectively, work directly in the high-dimensional space (setting herein denoted with none) and one method RaceID3 uses PCA dimension reduction which has been not considered as an internal technique (for more details see methods description in Supplementary Materials). Within all 12 methods that incorporate the dimension reduction step, an internal procedure can be used for selecting the number of reduced dimensions (herein denoted internal). Nine algorithms (63 combinations) also allows to manually set the number of dimensions (herein denoted with TRUE). Those with both options give to the user the possibility of either choosing the dimension or using the internal procedure. In this regard, the setting FALSE is related to methods that do not perform dimension reduction.

Methods can be also customized by the clustering techniques they apply. Some of them are based on a fixed clustering technique (herein denoted fixed), others propose multiple choices in this step (herein denoted with the name of the specific technique the user can choose, k-means, hclust, etc). The group of methods with multiple clustering options include: monocle3 that offers two types of clustering techniques, RaceID3 that utilizes two partitioning algorithms and a hierarchical clustering algorithm, sincell and sscClust which provide more clustering options. Depending on the clustering technique, methods either require to set the number of clusters by the user (36 combinations, herein denoted set) or provide an internal functionality to estimate it (107 combinations, herein denoted estimate). For more details about specifications, see methods descriptions in **Supplementary Materials**.

Finally, we stress that all 13 methods (with all 143 combinations of parameters) can be applied to nonpreprocessed or filtered Raw counts as well as simulated datasets (see **Figure 2**). To avoid performing method-specific normalization on already normalized data, only methods for which the additional preprocessing step can be set to none were used on filtered and normalized Raw counts (i.e., 9 methods with 100 combinations of parameters) (**Figure 2A**) or FPKM/ RPKM counts. In addition, according to **Figure 2**, when using normalized FPKM/RPKM counts, we reduced the number of methods and parameter combinations to those which perform dimension reduction step before clustering, and allow setting number of reduced dimensions in that step. In this way, we used a subset of 6 methods and 44 combinations of parameters to be applied on FPKM/RPKM counts (**Figure 2B**).

#### Evaluation Metrics

To quantify the agreement between the partition obtained from the considered method and the true partition, we used a wellknown and widely used measure, the Adjusted Rand Index (ARI), implemented in the R package mclust (Scrucca et al., 2016). The values of the ARI range can be negative if the agreement of the partitions is worse then the agreement expected by chance, or between 0 and 1 for clustering better than chance. The exact formulation of the ARI index can be found in (Lawrence and Phipps, 1985).

To evaluate the accuracy of methods in estimating the correct number of clusters, we used symmetric log-modulus transformation defined as follows:

$$L(\boldsymbol{\kappa}) = \operatorname{sign}(\boldsymbol{\kappa}) \* \log 10(\lfloor \boldsymbol{\kappa} \rfloor + 1) \tag{1}$$

where *x* is the difference between the estimated number of clusters and the true number of cell populations in a given dataset. The positive values of log-modulus transformation mean that the number of estimated clusters was higher than the number of true cell populations. Negative values indicate that methods underestimate the number of clusters whereas zero values denote the equality between the number of estimated clusters and the number of true cell populations.

To identify significant differences in methods performance (ARI Index) when applied after different basic preprocessing types, we used hypothesis testing procedures implemented in stats R package (Hollander and Wolfe, 1973). The Kruskal-Wallis rank sum test was used to assess the difference in methods performance as we vary the basic preprocessing (among QC, QC & FILT, QC & FILT & NORM). The Wilcoxon signed-rank test was used to infer the differences in accuracy with respect to two data basic preprocessing types (QC, QC & FILT). In each context, we computed the Benjamini-Hochberg adjusted p-values (Benjamini and Hochberg, 1995) to correct for multiple comparisons.

Finally, to measure the computational time required by each method to complete its task, we used *Sys.time* function from R that allows reporting time when the method starts and finishes the script. The difference between those time points constituted the computational time of the method in running dataset analysis. Note that computational times have been reported in the unit of minutes followed by *log*(*t*+1) transformation, where *t* is the running time in minutes, and include all the steps that the method needs to cluster a dataset (except data basic preprocessing) together with the loading of the required packages and package dependencies.

#### Implementation

This clustering benchmark study was implemented in the R programming language and scripts necessary for the reproducibility were deposited at the time of publication on the GitHub page: https://github.com/mkrzak/Benchmarking\_ Clustering\_Methods\_scRNAseq. The repository stores codes for data import, processing, and analysis as well as the information about system requirements and packages to be installed. When performing the analysis, additional HTML reports are produced with a detailed description of data analysis steps. Note that apart from the required methods, the analysis scripts call for other R packages used in plotting and managing R objects. The scripts have been tested on R version 3.5.1 and machine with specifications—Intel Core i7, 4.00 GHz × 8 and 24 GB RAM which are the minimum system requirements for the analysis.

Moreover, for the sake of completeness and to ensure the reproducibility of our study, we deposited the real and simulated datasets on the following GitHub pages: https:// github.com/DataStorageForReproducibility/Real\_data\_for\_ benchmark\_reproducibility and https://github.com/ DataStorageForReproducibility/Simulated\_data\_for\_benchmark\_ reproducibility. Both directories include. RData files as SingleCellExperiment class objects that store the count matrices and the corresponding cell group annotations.

In the clustering benchmark, we set the seed for generating pseudo-random numbers globally and applied it to the execution of any method in order to assure the stability of the solutions and reproducibility of the results. Note that, since the scRNAseq R packages we evaluated are often under continuous development, other version of the methods (R packages) than those reported in **Table 3**, might output slightly different results.

## RESULTS

Results are organized as follows. We first illustrate the performance of the evaluated methods on the 10 raw datasets, then on the 7 normalized FPKM/RPKM counts. Finally, we finish the summary of the main findings obtained on the simulated datasets in the 3 setups described in **Figure 1**.

Within this paper, methods/parameter combinations are referred as string obtained as a concatenation of keys separated by underscores. The concatenation takes the name of the method, the type of additional preprocessing, the dimension reduction technique, the setting of the number of dimensions, the clustering technique and the number of clusters. Each of these keys can take the values reported in **Table 4**.

#### Methods Performance on Raw UMI and Raw Read Counts

As mentioned, we independently applied all 13 methods (corresponding to 143 parameter combinations) to the 10 raw counts datasets after using two basic preprocessing types (QC, QC & FILT). Then, we applied only 9 methods (corresponding to a subset of 100 parameter combinations) to the same datasets after applying quality control, filtering and normalization (see the scheme illustrated in **Figure 2**). In the latter case, the 9 methods are those that allow the user to choose none as additional preprocessing to avoid renormalization of already normalized counts (see **Table 4**). To compare the methods across the basic preprocessing procedures, we first show the results corresponding to the combinations that were applied to all three basic preprocessing procedures, then the remaining methods/ combinations applied only to QC and QC & FILT data.

Note that some of the methods/parameters combinations failed to cluster some datasets (such cases are marked in grey in **Supplementary Figures 3** and **4**) due to the errors occurred during their execution. The most frequent error messages were reported in **Supplementary Table 3**, for Data type = "Raw counts". In particular, SIMLR, DIMMSC and Linnorm encountered failures in a limited number of cases, therefore we did not consider such datasets in the evaluation of the methods. By contrast, sincell (when ICA was chosen for dimension reduction) reported a significant number of failures, therefore we did not consider such combinations of parameters in the evaluation of sincell. Note that this will limit the overall number of parameter combinations from 143 to 133 (90 combinations applied after all three types of basic preprocessing, 43 applied to QC and QC & FILT data, only).

#### Overall Accuracy

**Figure 3** shows the performance of the 9 methods (90 parameter combinations out of 100) in terms of ARI evaluated across all 10 raw datasets and organized with respect to the type of basic preprocessing. Analogously, **Figure 4** shows the same results corresponding to the remaining 8 methods (43 parameters combinations) independently applied after two basic

distinguish the different methods, although applied with different parameter combinations. Superimposed as reference, a red dashed line at ARI = 0.5.

preprocessing types. To evaluate the overall accuracy, we first inspected the results regardless of the type of basic preprocessing.

From **Figures 3** and **4**, we can observe that, most of the methods/parameter combinations report a great variability in their performance across the different datasets which proves no all-time winner across the entire set of cases we have analyzed. Some of the methods still performed relatively well (i.e., with most of the results

FIGURE 4 | Overall accuracy of methods applied to Raw counts. ARI accuracy for remaining 8 methods with 43 parameter combinations, independently applied to the 10 raw datasets after two basic preprocessing types (QC, QC & FILT). Box colors distinguish the different methods, although applied with different parameter combinations. Superimposed as reference, a red dashed line at ARI = 0.5.

above ARI = 0.5) regardless the preprocessing type. This group includes CIDR, Linnorm (with some combinations of parameters), SC3 (when set is chosen in number of clusters), some combinations of sscClust (i.e., when iCor is used for dimension reduction) and TSCAN. On the other hand, few other methods were reporting very poor performance. For example, one of the poorest performance was observed in sincell (with many parameter combinations), ascend, DIMMSC, pcaReduce and Seurat (only when non-internal is chosen for the number of reduced dimensions). Although sincell performed generally poor, the method also showed good performance for few datasets (see, the results over individual datasets showed in **Supplementary Figure 3**).

The analysis of **Figures 3** and **4** also shows that the performance of some methods strongly depends on the particular choice of the parameter settings, i.e. sscClust, Linnorm or Seurat being those whose performance strongly rely on that option. We found such result partially ignored in previous studies, therefore we will investigate it in more detail in *Effect of Parameters Settings on Accuracy*.

#### Accuracy in Estimating the Number of Clusters

In order to evaluate the accuracy of a method in estimating the correct number of populations, we used log-modulus transformation in Eq. 1, and we limited the analysis to the 107 methods/parameter combinations that allow setting the option estimate for choosing the number of clusters (see **Table 4**).

**Figures 5** and **6** show the results, respectively for the 69 methods/parameters combinations applied after all three types of preprocessing procedures (i.e., we excluded 10 combination of sincell that reported frequent failtures), and for the remaining 28 methods/parameter combinations applied after two basic preprocessing steps.

We observed that most of the methods/parameter combinations either under or overestimated the number of clusters often in a systematic way. In particular, boxes below and above the dashed lines represent parameter combinations which under or overestimated the number of clusters. There are also methods, such as CIDR, some combinations of Linnorm, RaceID3 and TSCAN, that often provide less biased estimates. We also observed that the estimates strongly depend on the specific clustering technique used, as for monocle3, sincell and RaceID3 method, or dimension reduction applied, as for Linnorm. The group of methods that underestimated number of clusters includes monocle3 (when densityPeak is used for clustering), SIMLR method, sincell (with k-medoids and ward.D chosen as clustering techniques), all combinations of sscClust except SNN and RaceID3 (when k-means was applied). A special case of overestimating the number of clusters method was observed with sincell where a large number of cluster was often returned. For example, sincell used with max. distance technique always returned a number of clusters equal to the number of cells in the dataset whereas in combination with knn it also returned a very large number of clusters.

#### Effect of Data Basic Preprocessing on Accuracy

We found that most of the methods performed similarly when changing the preprocessing procedures (see **Figures 3** and **4**), although **Supplementary Figures 3** and **4** showed that some of them (i.e., Linnorm, monocle3, sincell and sscClust) present slight variability in the performance when data underwent to different preprocessing. However, Kruskal-Wallis rank sum test did not identify any significant difference in the performance of the methods with respect to the three types of basic preprocessing (QC, QC & FILT, QC & FILT & NORM) and Wilcoxon signedrank test did not identify any significant difference associated to the two types of basic preprocessing (QC, QC & FILT).

#### Effect of Parameter Settings on Accuracy

As mentioned above, **Figures 3** and **4** clearly shows that the performance of several methods depends much more on the choice of parameters than on the type of basic preprocessing.

To better investigate this, we computed the PCA of the ARI matrix obtained using the 133 methods/combination as variables and the 10 raw datasets as samples. **Figure 7** shows the results

FIGURE 5 | Estimation of the number of clusters for methods applied to Raw counts. Boxplots of L in Eq. 1 for the subset of methods (i.e., 69 parameter combinations) that allows to estimate the number of clusters (and with none preprocessing). Superimposed as a reference, a red dashed line at L = 0. Parameter combinations with difference below or above 0 resulted into under or overestimation of the number of clusters, respectively.

when the clustering methods were applied to QC & FILT preprocessed data (the figures after the other preprocessing types are very similar, not shown for brevity). Each point depicted in the PCA space represents a particular methods/parameter combination. Therefore, points that are close in the PCA space have similar performance across the 10 datasets. From **Supplementary Figure 5** we can see that the first component is strongly positively correlated with the performance, therefore methods located on the right side of the figure tends to have better performance than those located on the left side, while the second component is not significantly correlated with the ARI. Each panel of **Figure 7** represents the same PCA projection colored by the methods and shaped by one of the parameters of interest. The effect of parameters changes in the performance of

counts. Boxplots of L in Eq. 1 for the subset of methods for the remaining methods (28 parameter combinations) with method specific preprocessing that allows to estimate number of clusters. Superimposed as reference, a red dashed line at L = 0. Parameter combinations with difference below or above 0 resulted into under or overestimation of the number of clusters, respectively.

a given method is represented through the spread of the points in the same color. Note that DIMMSC and pcaReduce have only one valid parameter combination thus we do not discuss them in this section, although they are depicted in the figure.

Overall, **Figure 7** confirms the poor performance of sincell and the good performance of SC3, CIDR, TSCAN, and some combinations of Linnorm, as well as the strong impact of parameters setting for many methods (i.e., sscClust, Linnorm, Seurat, SIMLR).

In particular, the analysis of **Figure 7A** shows that the performance strongly depends on whether the number of clusters is estimated or not. Not surprisingly, when using the true number of clusters (parameter set) the performance is better for most of the methods compared to when estimating it (parameter estimate). However, there are few methods that report good overall performance also when the number of clusters is estimated (see for example, CIDR, monocole3 and sscClust).

In the same spirit, **Figure 7B** illustrates the effect of an additional preprocessing (that can be either method specific or none) on the methods performance. The figure does not indicate any global difference, but still pointed-out some methods specific variability (i.e. SIMLR showed significantly improved accuracy after such step).

We also superimposed other features, such as dimension reduction or clustering techniques (not shown for brevity). Since such parameters can assume multiple values, the figures do not allow to identify any suggestion that works well for all methods. However, such analysis allowed us to recognize i.e., sscClust with iCor and Seurat with internal number of reduced dimensions, as one of the good performing combinations.

#### Computational Time

We compared run times of the methods across all 10 raw datasets in order to assess their scalability and identify potential issues related to a specific dataset.

**Figure 8** reports execution time in minutes on a log plus one scale for the methods applied to QC & FILT preprocessed datasets. As a reference, we superimposed on the figure dashed lines at 1, 10, 60 min, and 10 hours. Overall, computational times varied from a few seconds to tens of minutes or till several hours (at least for some datasets). We distinguish methods that were consistently fast (showing good scalability), methods requiring longer but still reasonable run time with increased data size (showing limited scalability) and methods requiring significant execution time at least in some cases (showing either poor scalability or problems related to the analysis of a specific dataset). ascend, CIDR, monocle3, pcaReduce, RaceID3 (with non-internal number of dimensions), Seurat (with PCA dimension reduction), sincell, sscClust and TSCAN were among the fastest and across the analyzed datasets. Therefore, they were assigned to the first group (with average run time below 2 minutes and maximum time of about 10 minutes). Linnorm and SC3 were assigned to the second group (with average run time about 5 minutes and maximum time between 20 minutes and an hour). Other methods such as DIMMSC, SIMLR, RaceID3 (with internal number of dimensions), Seurat (with ICA dimension reduction) were among the longest, therefore assigned to the

third group (with average run time between 10 minutes and about two hours and maximum time between an hour and more than 10 hours). In the worst case, RaceID3 took about 12 hours before completing the clustering task.

## Methods Performance on FPKM/RPKM Counts

We used 7 FPKM/RPKM datasets to evaluate the performance of the methods/parameter combinations considered in this study with respect to the number of reduced dimensions when using different dimension reduction techniques. Since FPKM/RPKM datasets consist of already normalized counts we limited the study to those methods/parameter combinations that do not use "method specific" as additional preprocessing and that also allow setting the number of reduced dimensions. In total, we tested 44 methods/parameter combinations on each of the four dimensions: 3, 5, 10, and 15.

As in the previous case, we note that some of the methods/ parameters combinations failed to cluster some of the datasets (see grey boxes in **Supplementary Figure 6**) due to technical errors reported in **Supplementary Table 3**, for Data type = FPKM/RPKM counts. Note that three of the methods, Linnorm, monocle3 and sincell encountered a significant number of failures with the same error message when used with more than 3 dimensions. We did not consider such cases in further evaluation limiting the overall number of combinations from 44 to 33.

#### Overall Accuracy

**Supplementary Figure 7** shows the performance of all 33 methods/parameter combinations applied to FPKM/RPKM datasets. Regardless of the number of dimensions, we can observe variability in the accuracy of the methods similar to what was reported for the raw counts. Most of the methods that were reporting good or poor accuracy on raw counts show similar good/poor performance also on the FPKM/RPKM datasets (as we could have predicted from the results obtained on the QC & FILT & NORM raw datasets). For example, CIDR and sscClust

(with some of the parameter combinations) are among the better-performing methods, whereas sincell with most of the combinations reports poor accuracy (although not in all cases). Additionally, we can also confirm that the performance of some methods depends on the choice of parameter settings.

We also observed a general tendency of the methods to perform poorly on datasets with a high number of cells (more than 1600) (see **Supplementary Figure 6**). Although this relationship was not clearly visible on the raw counts, it could be expected as a consequence of a larger complexity in the data not fully explained by the number of selected features and not fully captured using low dimensional projections.

Finally, we did not observe any systematic differences in the accuracy with respect to the number of reduced dimensions (see **Supplementary Figures 6 and 7**). Some of the methods are either robust to the varying number of dimensions or they do not show any clear preference when using one or another setting. This suggests that data complexity cannot be easily explained by a certain parameter and the performance of the methods are often data specific.

#### Accuracy in Estimating Number of Clusters

**Supplementary Figure 8** shows the estimated number of clusters compared with the true one (as computed using Eq. 1) for all methods/combinations that allow the users to estimate such value. We observed a similar tendency in the estimates reported for raw counts. For example, monocle3 (with densityPeak clustering), SIMLR, sincell (with k-medoids and ward.D techniques) or sscClust (all except SNN) tend to underestimate the number of clusters whereas the rest of the combinations of sincell clearly overestimate that value. Moreover, CIDR often provides a less bias estimates that result in a better accuracy (alike on the raw counts).

#### Computational Time

**Supplementary Figure 9** reports the running time evaluated for all methods/parameter combinations (for dimension = 3). First, we observe that, since FPKM/RPKM counts underwent to a feature selection step, that greatly reduced the data dimension in terms of the number of genes (see **Supplementary Table 2**), we have a consequent reduction of the running time for most of the methods. Indeed, we can see that most of the methods ran below a minute (CIDR, monocle3 and sscClust) or in few minutes (some cases of sincell). SIMLR was the longest method and took up to one hour. Our study also shows that the number of reduced dimensions (3, 5, 10, or 15) was not so relevant in terms of computational time (data not shown).

Finally, note that some of the combinations evaluated on the raw counts, such as RaceID3 or Seurat, were not considered in the FPKM/RPKM evaluation as they do not allow to set none in the additional preprocessing.

## Methods Performance on Simulated Datasets

Synthetic datasets were used to test the performance of all 143 methods/parameter combinations. We followed three simulation setups in order to simulate the counts (see **Figure 1**) and we repeated the simulation 5 times, each with a different selection of the random seed. Simulation setups mimic different characteristics of scRNAseq datasets i.e. in terms of dimensionality, group structure or levels of noise. In theory, all simulated datasets constitute a different level of complexity for the clustering task.

The methods/parameter combinations that failed across all runs can be seen in the **Supplementary Figure 10** and the respective error messages have been reported the **Supplementary Table 3**.

In the next sections, we will describe the performance of the methods according to the three simulation setups. Note that the overall performance of the methods on the synthetic datasets is much higher than in the real data. This can be related to the fact that simulation models may not always reflect all types of noise present in the real case and thus the clustering task can be less challenging. Despite that, synthetic datasets allowed us to confirm some of the previously identified trends and to recognize the potential limits of the methods.

#### Performance on the Simulation Setup 1

Simulation Setup 1 has been used to access the performance of the methods depending on three factors: the number of cells present in the dataset, the number of cell groups and their balance in size. **Figure 9** and **Supplementary Figure 11** show the accuracy of the methods for balanced and unbalanced group design, respectively. **Supplementary Figures 12**–**14** give more details about balanced group design and correspond to the performance on datasets with 4, 8, and 16 number of cell groups, respectively.

By looking across the **Supplementary Figures 12–14** we can observe high variability of the methods/parameter combinations across different numbers of simulated cell groups. Less variability was attributed to the runs (see boxplots within **Supplementary Figures 12–14**). Balanced or unbalanced group design slightly affected the performance for most of the methods however with no clear direction (see **Figure 9** and **Supplementary Figure 11**).

On the synthetic datasets, the well performing methods included CIDR, Linnorm, SC3 and some combinations of sscClust (see **Figure 9** and **Supplementary Figure 11**), same as for the real datasets. Similarly, we could confirm the poor performance of methods such as Seurat with an imposed number of dimensions or sincell with tSNE dimension reduction. Additionally, on the synthetic data we observed high accuracy of the DIMMSC method, Seurat with internal number of dimensions and some combinations of monocle3, RaceID3 and SIMLR.

We did not observe a large loss in methods performance when the number of cells increased from 500 to 5000 (**Figure 9**

and **Supplementary Figure 11**). The only clear exception was SIMLR with several combinations that include cluster number estimation (denoted estimate) which failed on datasets with 5000 number of cells (see the error messages in **Supplementary Table 3**). We observed that many methods were affected by the growing number of simulated cell groups (from 4 to 16 cell groups). In particular, see the methods: CIDR, DIMMSC, Linnorm, SC3, SIMLR, sincell, sscClust, and TSCAN across **Supplementary Figures 12–14**. pcaReduce worked similarly across all three factors (number of cells, number of cell groups, group balance)

combinations on Setup 1 simulated data. Selected results are across all runs.

(see **Figure 9** and **Supplementary Figure 11**). Seurat accuracy, same for the real datasets, strongly dependent on the number of reduced dimensions (denoted as TRUE/internal).

#### Performance on the Simulation Setup 2

In the simulation Setup 2, we varied the separability between the cell groups from lowly to highly separable. Lowly separable groups mean that some of the simulated populations could overlap in space being the most challenging to detect. Separability was controlled by de.prob parameter in the Splatter simulation function.

**Figure 10** shows that some of the methods as CIDR, DIMMSC, SC3, TSCAN, Seurat (with imposed number of dimensions), SIMLR (with estimated number of clusters) and many combinations of sincell behaved similarly and their performance was mostly affected on the datasets with the lowest separability between the cell groups. However, their accuracy was still high meaning in most of the cases ARI above 0.5. The methods that performed well across all the separability modes were some combinations of Linnorm or monocle3, Seurat with internal number of dimensions and SIMLR with set number of clusters. All those methods/parameter combinations provided high accuracy with ARI close to 1.

#### Performance on the Simulation Setup 3

The third simulation setup was used to access the accuracy of the methods with respect to an increasing number of zero counts placed in the dataset. We simulated percentage of dropouts varying from 20% to 90% by manipulating dropout.mid parameter in the Splatter simulation function.

Overall, we noticed that most of the methods had low accuracy on the datasets with highest magnitude of missing values (dropout.mid = 6) (see **Figure 11**). Although this is an expected result, some of the methods/parameter combinations still performed well in this case (see i.e. monocle3, SC3 and sscClust). Interestingly, monocle3 and sscClust method performed poorly only in particular parameter combinations on the highest dropout rate. For the monocle3 method the bad performing combinations included additional method specific preprocessing and for the sscClust—iCor dimension reduction. Beyond that, some of the methods appeared to be affected by the increasing percentage of zeros, as CIDR, DIMMSC, Linnorm, Seurat, SIMLR, and TSCAN. In particular, Linnorm experienced technical errors across all the simulated datasets with the highest two modes of dropouts (denoted as dropout. mid = 4 and dropout.mid = 6). Many combinations of sincell performed poorly, notably those that use tSNE as dimensionality reduction. Seurat depended highly on the number of dimensions used (either TRUE or internal) and pcaReduce seemed to work moderate across all four ranges of dropouts.

#### Computational Time

Computational time for all parameter combinations applied to simulation Setup 1 datasets was reported in the **Supplementary Figure 15**. Some of the methods scaled in time all simulated datasets dimensions while others took longer on the largest datasets (with 5000 number of cells). Note that many of the trends observed here were previously mentioned in the real

results are across all runs.

datasets analysis. The fastest group of methods across all datasets dimensions include: ascend, monocle3, pcaReduce, Seurat, many combinations of sincell,sscClust when PCA dimension reduction was used and TSCAN. Other methods like CIDR, DIMMSC, Linnorm with set number of clusters and some combinations of RaceID3 and sincell were still relatively fast in time running for few minutes on datasets with the highest number of cells. SC3, Linnorm (with estimated number of clusters), sincell (when nonmetric-MDS was used as dimensionality reduction) and rest of the combinations of RaceID3 or sscClust took about one hour when applied to the largest simulated datasets whereas SIMLR computational time exceed few hours in that case being the longest method among all.

## DISCUSSION

In this study, we evaluated the performance of several clustering methods on a wide range of real and simulated scRNAseq datasets. Such methods are distributed as open-source R packages and they constitute a significant part of the computational tools nowadays available for inferring the unknown composition of cell populations from scRNAseq data. Our comparison aimed to provide insight into the mode of usage for each of these packages depending on the structural assumptions we are willing to make. We compared the ability of the different packages to infer the unknown number of cell populations, the sensitivity of the methods across different datasets and their computational cost.

For each method we tested different parameter configurations, revealing the great impact of parameter setting on the performance of individual methods. In particular, we found that some of the methods performed relatively well across most of the datasets we have considered and with respect to the different choices of the parameter settings (i.e., CIDR, and several combinations of Linnorm, SC3 and sscClust), or often poorly, as, sincell (with many parameter settings) and ascend. Other methods, such as DIMMSC, monocole3, RaceID3, Seurat, SIMLR and TSCAN, can be placed in the middle in terms of overall performance across all datasets, despite the fact that on few datasets they could have reported good performance. However, we should consider that the field of clustering of scRNAseq data is rapidly evolving. Novel methods are continuously emerging and those that we have compared are undergoing to an extensive revision that might improve their performance. It is not easy to explain why certain methods work better than others, since they perform several steps before applying the clustering algorithm. However, one of the reasons is that some methods were originally developed to analyze scRNAseq data collected under specific protocols (i.e., consisting of datasets with a limited number of cells). Then, the novel challenges (in particular the increasing size and cell heterogeneity) provided by the rapidly evolving scRNAseq technology made them not any more competitive for the complex types of data that are emerging. Reasonably, methods have to be optimized with respect to a specific protocol or dataset size, rather than attempting to find methods that work well on a wide range of scRNAseq conditions. In fact, our study showed that no methods seem to emerge as performing better than others on all datasets. Additionally, our results also showed that there is still space for improving the overall performance of the available methods on large and complex datasets or providing novel and more accurate methods.

We have found that despite different basic preprocessing options, there is no global pre-processing strategy which improves significantly the performance of all methods (packages). Instead, we found that the performance of several methods strongly depends on their parameter settings: in Seurat when varying the number of input dimensions; in SIMLR when estimating the true number of cell groups; in sincell when varying the clustering techniques and in sscClust when changing the dimension reduction step. We believe that the impact of the choice of the method-specific parameters on its performance has been underestimated till now, while it turns out to be crucial when using these methods. Unfortunately, we did not identify a golden rule for choosing the parameters. However, depending on the methods used, we identified some better performing configurations: sscClust performed better with iCor as dimensionality reduction step; Seurat with the internal choice of the number of dimensions; Linnorm and SC3 with a set number of clusters (using the true number of cell populations). On the basis of our results, we suggest that users should be more aware of the different possibilities that several methods offer in terms of parameter choices and modes of usage. Moreover, we recommend them to always evaluate the robustness of their partition with respect to changes in the parameter settings. At the same time, method developers should give more attention in better documenting all the possibilities that their methods can offer also testing their robustness with respect to changes in the settings. To this purpose, the benchmark pipeline developed for this study can be easily modified to offer an environment where other/novel methods can be evaluated.

We also observed that the poor performance of several methods/parameter combinations is often associated with a poor estimate of the number of clusters (see for instance estimation accuracy of monocle3, SIMLR or sincell). Although a rigorous assessment of the number of cell populations on real data dataset could be debated, our results show that several methods tend to significantly underestimate or overestimate the number of clusters, when compared to the true (usually unknown) cell populations. Therefore, we can say that the estimation of the number of hidden cell populations remains challenging in the scRNAseq data analysis and we hope that novel approaches will provide less biased estimates. Moreover, by comparing the performance of each method when the true number of clusters was imputed with those when it has been estimated from the data, it is possible to quantify the impact that a more accurate estimate of the number of cell populations can have on the overall accuracy.

The dataset dimension and complexity turns out to be clearly influential with respect to the running time of the methods and to the overall performance that the methods can achieve. In particular, SIMLR run time increased together with the sample size and was often the longest among other methods by several orders of magnitude (requiring up to several hours to analyze a given dataset compared to few seconds/minutes for the other methods). Similarly, scalability issues were observed in SC3, although to a less extent. In contrast, other methods/parameter combinations showed a good scalability in their computational time, as ascend, CIDR, monocle3, pcaReduce, RaceID3 (with non-internal number of dimensions), Seurat (with PCA dimension reduction), sincell, TSCAN or sscClust, limiting the computational time to few seconds/minutes. We want to stress that computational issues are becoming particularly important since modern technologies are now allowing to simultaneously sequence thousands or even tens of thousands of cells, thus it is expected that researchers will have to analyze much larger datasets. Hence, it will be important to provide novel methods that have good scalability properties either in terms of running time and/or computational resources required for their execution. This can be achieved either by designing methods with efficient algorithms and by better exploiting the parallel and high-performance computing in their implementation. From a technical point of view, we also observed frequent failures of some methods under particular cases. For example, SIMLR method failed on most of the simulated datasets with 5000 number of cells. We suspect that the method required large amounts of memory on the high-sample datasets than that available in our system. Other failures, like in monocle3, Linnorm and sincell on FPKM/RPKM datasets were related to the choices on the number of reduced dimensions. In fact, all of them encountered technical errors when used with tSNE dimension reduction and more than 3 number of dimensions. Additionally, Linnorm failed on raw and simulated datasets with a high percentage of dropouts (above 70% of zeros in the dataset) suggesting the low capacity of the method to handle high rates of missing data. Such points are probably less relevant and could be solved with future releases of the methods.

Finally, it is also worth to mention that some of the methods, such as ascend, monocle3, SIMLR, sscClust and some combinations of Linnorm and sincell, showed variability in the clusterization despite the global setting of the seed. The fluctuations can be spotted by looking at the accuracy of methods on the identical datasets across three simulation setups (see results across **Supplementary Figure 12** and **Figures 10** and **11**) or by looking at the accuracy of the methods on datasets not affected by filtering (see **Supplementary Figures 3** and **4**). We notify that the results in such cases might not be easily reproducible. In the spirit of reproducible computational research, the user should be aware of such limits.

## CONCLUSIONS

Concurrently with technical improvements in single-cell RNA sequencing, there is a rapid growth in the development of new methods, in particular, those related to the identification of cellular populations. Newly developed methods differ considerably in their computational design, implemented algorithms and available steps giving the user a large number of options to select parameters and perform a cluster analysis on scRNAseq data. However, such possibilities are often hidden and not fully documented in the software code and their impact has to be better understood.

We are not aware of any comprehensive studies aiming to test various modes of usage of the available methods on large scale datasets that have different experimental complexity in terms of dimensionality, number of hidden cell populations or levels of noise. Our benchmark approach extends the previous comparative studies (Freytag et al., 2018; Duò et al., 2018; Tian et al., 2019) to a broader range evaluation of the algorithms which depends on the parametrization (user-specified parameter choices) and previously mentioned dataset differences. The results presented here showed that the performance of the methods strongly depends on different user-specified parameter settings and that the dataset dimensionality and composition often determines the overall accuracy of the methods. Overall, this means that most of the methods lack of robustness with respect to the tuning parameters or differences among the datasets. We found that both aspects were partially ignored in the previous studies, preventing the user to better understand the potentials and limitations of each method. Although, we did not find a "golden" rule for choosing optimal parameter configurations, our study identified some model-dependent choices which were found more robust than others. Despite that, our study also showed that the overall performance is still far from being optimal. Hence, there is a need for developing novel and more accurate methods, in particular for those datasets containing a very large and heterogeneous amount of cells. Evaluating and improving clustering approaches for scRNAseq data might be beneficial for several areas of biomedical science such as immunology, cell development and cancer see for example Haque et al. (2017).

The analysis of real and simulated datasets confirmed that the high sample size and the high number of cell populations have a great impact on scRNAseq clustering methods. In particular, we found that the estimation of the number of clusters remains challenging. We confirmed these issues in several analyzed cases where the methods either under or overestimated the true number of cell populations and the simulated cell groups. In real scRNAseq applications, overestimation of the number of clusters might be just due to methods identifying previously unknown biologically relevant sub-groups. However, underestimation of the clusters means that methods failed to distinguish accurately differences between populations of cells. Since in scRNAseq clustering we also aim to identify novel and/ or rare cell populations, we typically do not know the number of cell populations. The failure to identify the number of subgroups in a consistent manner is a considerable drawback when it comes to practical applications of such methods. In fact, such failure is usually paid with a lower ARI index. By comparing the performance of each method when the true number of clusters was imputed with those when it has been estimated from the data, one can quantify the impact that a more accurate estimate on the number of groups can have on the overall performance.

With the development of new high-throughput scRNAseq protocols, the data dimensionality grows and one has to consider not only methodological performance but also computational requirements of the different approaches. We have demonstrated that computational cost does not always trade for empirical accuracy and some configurations are just unpractical for specific protocols. Since, larger and more complex datasets are going to be produced by novel droplet-based protocols, the computational feasibility needs to be better faced and more attention should be given in designing methods with efficient algorithms and in better exploiting high-performance computing in their implementation.

Taken all together, our systematic evaluation of the methods confirmed some common sense assumptions or expected results, but also identified new potential issues in scRNAseq clustering. The summary of the methods presented here can guide the readers in a number of options that the methods provide also giving awareness about their possible limitations. Moreover, the benchmark pipeline developed for this study is freely available and can be easily modified to add novel methods.

## DATA AVAILABILITY STATEMENT

Publicly available datasets were analyzed in this study. This data can be found here: GSE84133, GSE65525,GSE60361, GSE67835, GSE45719, MTAB-3321, MTAB-2600, GSE81861, GSE74672, GSE71585, GSE45719, MTAB-5061, GSE71585, GSE81608, GSE36552, GSE57249, GSE52583.

## AUTHOR CONTRIBUTIONS

MK designed and implemented the clustering benchmark study, performed both real and simulated analysis, selected and discussed results and wrote the manuscript, YR and LC contributed to the design of the benchmark study, the selection, and discussion of results and the drafting of the manuscript, AB contributed to the selection and discussion of the real and simulated data analysis and provided constructive comments on the benchmark study, CA contributed to the design of the benchmark, guided and supervised all phases of benchmark implementation, selection, and discussion of results and wrote the manuscript. All authors read and approved the manuscript.

## FUNDING

We acknowledge INCIPIT PhD program co-funded by the COFUND scheme (Marie Skłodowska–Curie Actions) grant

## REFERENCES


agreement n. 665403, EPIGEN project and ADViSE project for financial support.

## ACKNOWLEDGMENTS

MK would like to thank LC for warm hospitality while visiting the School of Mathematics, University of Leeds, to carry out part of this work.

## SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene.2019.01253/ full#supplementary-material


**Conflict of Interest:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Copyright © 2019 Krzak, Raykov, Boukouvalas, Cutillo and Angelini. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

# Reproducibility of Methods to Detect Differentially Expressed Genes from Single-Cell RNA Sequencing

Tian Mou<sup>1</sup> , Wenjiang Deng<sup>1</sup> , Fengyun Gu<sup>2</sup> , Yudi Pawitan1\* and Trung Nghia Vu1\*

<sup>1</sup> Department of Medical Epidemiology and Biostatistics, Karolinska Institutet, Stockholm, Sweden, <sup>2</sup> School of Mathematical Sciences, University College Cork, Cork, Ireland

Detection of differentially expressed genes is a common task in single-cell RNA-seq (scRNA-seq) studies. Various methods based on both bulk-cell and single-cell approaches are in current use. Due to the unique distributional characteristics of singlecell data, it is important to compare these methods with rigorous statistical assessments. In this study, we assess the reproducibility of 9 tools for differential expression analysis in scRNA-seq data. These tools include four methods originally designed for scRNA-seq data, three popular methods originally developed for bulk-cell RNA-seq data but have been applied in scRNA-seq analysis, and two general statistical tests. Instead of comparing the performance across all genes, we compare the methods in terms of the rediscovery rates (RDRs) of top-ranked genes, separately for highly and lowly expressed genes. Three real and one simulated scRNA-seq data sets are used for the comparisons. The results indicate that some widely used methods, such as edgeR and monocle, have worse RDR performances compared to the other methods, especially for the top-ranked genes. For highly expressed genes, many bulk-cell–based methods can perform similarly to the methods designed for scRNA-seq data. But for the lowly expressed genes performance varies substantially; edgeR and monocle are too liberal and have poor control of false positives, while DESeq2 is too conservative and consequently loses sensitivity compared to the other methods. BPSC, Limma, DEsingle, MAST, t-test and Wilcoxon have similar performances in the real data sets. Overall, the scRNA-seq based method BPSC performs well against the other methods, particularly when there is a sufficient number of cells.

Keywords: single cell, RNA sequencing, differential expression, rediscovery rate, comparison

## INTRODUCTION

Traditional gene expression profiling with high-throughput RNA-sequencing technology measures the aggregated expression levels of genes from a collection of millions of cells. Such bulk-cell RNAsequencing cannot capture cellular heterogeneity since there is no cell-specific information (Miao and Zhang, 2016; Jaakkola et al., 2017). Single-cell RNA sequencing (scRNA-seq) has developed rapidly as a powerful technology for studying transcriptomics at the single-cell level (Sandberg, 2014). However, compared to bulk-cell data, scRNA-seq data has a higher level of noise due to both

#### Edited by:

Monica Bianchini, University of Siena, Italy

#### Reviewed by:

Max Robinson, Institute for Systems Biology (ISB), United States Yuriy L. Orlov, First Moscow State Medical University, Russia

#### \*Correspondence:

Yudi Pawitan yudi.pawitan@ki.se Trung Nghia Vu Trungnghia.vu@ki.se

#### Specialty section:

This article was submitted to Bioinformatics and Computational Biology, a section of the journal Frontiers in Genetics

Received: 06 August 2019 Accepted: 05 December 2019 Published: 17 January 2020

#### Citation:

Mou T, Deng W, Gu F, Pawitan Y and Vu TN (2020) Reproducibility of Methods to Detect Differentially Expressed Genes from Single-Cell RNA Sequencing. Front. Genet. 10:1331. doi: 10.3389/fgene.2019.01331

**99**

biological and technical reasons, for example, lower input materials, cell-cycle phase, amplification biases, and the socalled dropout and bursting events (Dal Molin et al., 2017; Jaakkola et al., 2017; Soneson and Robinson, 2018). Such events are caused by the stochastic nature of the gene expression process at the single-cell level (Gong et al., 2018). The dropout events generate zero expression, statistically leading to zero inflation in the gene-expression distribution at a much higher proportion than expected under the standard negativebinomial model commonly assumed in bulk-cell data (Miao et al., 2018). Aggregation of expression in bulk-cell data reduces the effects of these single-cell events.

Differential expression (DE) analysis to discover quantitative changes between different groups or conditions plays an important role for understanding the molecular basis of phenotypic variation. However, due to the unique characteristics of scRNA-seq data, it is not immediately obvious that we can just use standard methods developed for bulk-cell data. A particular challenge is dealing with the large number of low (or zero) read counts in the scRNA-seq data. A previous study (Love et al., 2014) has shown the phenomenon that weakly expressed genes tend to produce more differences than highly expressed genes. For instance, to tackle this issue, several DE methods have been developed for scRNA-seq data, for example, BPSC (Vu et al., 2016), MAST (Finak et al., 2015), and monocle (Qiu et al., 2017). In general, bulk-cell–based DE methods were not originally designed to deal with a large fraction of lowly expressed genes. Yet, in practice, many studies use the bulk-cell −based DE methods for single-cell data, such as edgeR (Wang et al., 2016) or limma (Ziegenhain et al., 2017). Furthermore, various pipelines and workflows of RNA-seq analysis do not consider scRNA-seq data specifically (Lun et al., 2016; Chen et al., 2016; Law et al., 2016) and suggest users apply the bulkcell−based methods to scRNA-seq data (Zhu et al., 2017).

These bulk-cell−based methods are methodologically sophisticated, and they have been used for scRNA-seq data, but evaluation of their applicability to scRNA-seq data is still uncommon and different studies have reported opposite results. For example, authors in a recent study (Jaakkola et al., 2017) compared five DE methods, including two single-cell–based methods and three bulk-cell−based methods. They concluded that the original DESeq (Anders and Huber, 2010) and limma (Law et al., 2014) are not suitable for scRNA-seq data. In contrast, another comparative study (Miao and Zhang, 2016) declared that DESeq tends to outperform other methods on scRNA-seq data. Most comparative studies (Miao and Zhang, 2016; Dal Molin et al., 2017; Jaakkola et al., 2017; Soneson and Robinson, 2018) agree that bulk-cell–based methods are applicable to scRNA-seq even though there is a lack of agreement in finding DE genes by these DE methods (Wang et al., 2019) and it is difficult to identify the best performing tool for DE analysis of scRNA-seq data (Dal Molin et al., 2017). Therefore, further evaluations of these DE methods, including both bulk-cell– and single-cell–based methods in different aspects, are warranted for better understanding of the methodologies when applied to scRNA-seq studies.

To compare the DE methods, previous studies have used conventional statistics such as type-I error rate, false discovery rate (FDR) and receiver operating characteristic (ROC) curve. Notably, these metrics are applied to the full collection of genes. Reproducibility is also an important metric, although it is sometimes calculated differently in the different studies. For example, a recent study (Miao and Zhang, 2016) assesses the reproducibility of the methods by looking at the average of the overlap of top 1,000 DE genes (ranked by p-value) across 20 replicates. In each replicate, a control group and a testing group are sampled with a different random seed. Another measure of reproducibility (Jaakkola et al., 2017) compares the precision and recall of the detection of all DE genes between the full data set and its subsets.

In this study, we compare the performance of nine DE methods, including both bulk-cell and single-cell–based approaches as well as general statistical tests not specifically designed for RNA-seq data. We focus on the reproducibility of the methods in terms of rediscovery rate (RDR) (Ganna et al., 2014) of top-ranking genes. RDR is defined as the proportion of top-ranking findings detected from a training sample that are replicated in a validation sample. In high-throughput studies, the RDR is determined by both the false positive rate (FPR) and power (Ganna et al., 2014), so it is a convenient and easily understood metric for the comparison of methods. Limiting the assessment to top-ranking genes turns out to be important. Firstly, itfollows the data analytic process we perform in practice, where the top-ranked genes are usually considered the most interesting ones for further biological analyses or interpretation. Secondly, some methods perform differently for the top-ranked genes and across all genes. Besides the RDR, type-I error rate or FPR, and ROC are also used as extra metrics for the comparisons.

To get realistic distributional characteristics and capture some diversity in single-data data, we utilize three real scRNA-seq data sets; in addition, we use simulated data from the beta-Poisson model (BPSC), which has been suggested for scRNA-seq data in a recent study (Vu et al., 2016). Because of their distinct distributions, the groups of highly and lowly expressed genes are also considered separately, as the latter is more affected by single-cell specific events such as dropouts.

## RESULTS

We compare nine methods for detecting differentially expressed isoforms, including edgeR (Robinson et al., 2010), DESeq2 (Love et al., 2014), DEsingle (Miao et al., 2018), monocle (Qiu et al., 2017), BPSC (Vu et al., 2016), MAST (Finak et al., 2015), t-test (Welch, 1947), Wilcoxon rank sum test (Hollander et al., 2013), limmatrend (Law et al., 2014). Among those, edgeR, DESeq2 and limmatrend are designed for bulk-cell RNA-seq analysis; and DEsingle, monocle, BPSC, and MAST are developed based on scRNA-seq data. T-test and Wilcoxon rank-sum test are general comparison tests not specific to RNA-seq data. Table 1 compares the methods in terms of (i) distribution assumption, (ii) original data motivation (bulk-cell or single-cell data), (iii) test statistic,


TABLE 1 | List of the differential expression analysis methods.

and (iv) run time for a typical data set used in the comparisons. We also state the exact version of each software tool used in the comparisons.

To get realistic distributional characteristics, the following three real scRNA-seq data sets are used as the basis for simulations. (Different papers and projects use isoform- and gene-level expressions. For simplicity, we shall use the terms "isoform" and "gene" interchangeably.)


In addition, we also simulate single-cell data based on the beta-Poisson model (Vu et al., 2016). The variation in sample sizes of the three real data sets, from 160 to 720, allows us to compare the performance of each method at different sample sizes. More details of the methods and data sets are given in the Materials and Methods section.

In each experiment, the comparison focuses on the DE analysis of two predefined groups of cells. Briefly, an equal number of samples is randomly selected from the two groups in the original data set to generate the training set. For each sampled cellfrom a real data set, all isoforms are taken together; this preserves the statistical dependencies between the isoforms. For the validation set, a different set of samples from both groups is selected. The selection of training and validation sets is repeated 50 times to average out the effect of random selection. Note that the training and validation sets are always disjoint. The nine DE methods are then applied to the training and validation sets separately.

### Type-I Error Control

For each real data set, we generate a null data set by randomly sampling from the two groups combined (i.e., ignoring the group labels). Thus, the null data sets are expected to have no true DE isoforms, and the p-value distribution of each method is expected to be uniform. Theoretically, the p-values should follow a uniform distribution if the null hypothesis is true (Murdoch et al., 2008; Bland, 2013). The uniformity of p-value distribution under the null hypothesis can be used to assess the performance of methods. We calculate the type-I error rate by recording the fraction of the detected DE isoforms that are assigned a significant p-value (p < 0.05). This fraction is also known as the FPR. To highlight the effects of the dropout events, which tend to produce low expression and zero inflation, we split the isoforms into two groups based on the expression level: highly expressed isoforms and lowly expressed isoforms. The former refers to the isoforms with an estimated expression above 1 transcripts-per-million (TPM) in more than 25% of the cells, and the remaining isoforms are assigned to the latter. This threshold was also suggested in a recent comparative study of DE methods in scRNA-seq (Soneson and Robinson, 2018).

Results in Figure 1A show that for highly expressed isoforms, most methods manage to control the FPR close to the target 0.05. Two single-cell–based methods, monocle and DEsingle, are not stable, as their FPRs fluctuate the most from the expected error rate. As expected, the bulk-cell–based methods, edgeR, DESeq2, and limmatrend, perform well on this group, and DESeq2 is the most conservative.

For the lowly expressed isoforms, DESeq2 is also the most conservative method, Figure 1B. It identifies fewer significant isoforms, so the FPR is significantly lower than the expected level (0.05) in all data sets. In contrast, edgeR has the highest FPR, sometimes substantially above the target value. Similarly, monocle also has a large number of false positive findings. The FPR of DEsingle has a slight variation, as it is liberal for MDA-MB-231 data set, conservative for NPCs data set, and performs rather well in the other data sets. Thus, it seems the performance of DEsingle is not stable and highly dependent on data sets. The histograms of p-values (Figure S1 in the Supplementary report) further illustrate that few methods returned uniformly distributed p-values under the null hypothesis for the lowly expressed isoforms, while most methods have a better uniformity for the highly expressed isoforms.

## The RDR

The RDR is the proportion of the top-ranking DE isoforms in the training set that is found to be significant (p < 0.05) in the validation set. The RDR is calculated based on the top 5%, 10%, 20% DE and all isoforms in the training set.

replicates. The number of highly expressed isoforms in MDA-MB-231, mESCs, NPCs and simulated data sets are 8,299, 31,895, 10,422, and 8077, respectively.

#### RDR Analysis Under the Null Hypothesis

The RDR of the null data sets from the real data in Section 2.1 are reported in Figure 2. Panels A and B present the results for the groups of highly expressed isoforms and lowly expressed isoforms, respectively. Under the null hypothesis of no group effect, the expected RDR is 0.05. Similar to the results from the type-I error control in Section 2.1, the RDRs of all methods are generally better for highly expressed isoforms. Monocle and DEsingle are the worst, as their RDRs are often far from 0.05. However, the performances improve for the larger number top DE isoforms. For example, the RDR of monocle for all isoforms in the NPCs data set is very close to the expected value, but it is much higher than 0.05 among the top 5% DE isoforms. Similarly, for the mESCs data set, the RDR of edgeR for all isoforms is close to 0.05, but it is consistently higher than this target value for the smaller number of top DE isoforms. Thus, comparing the performances based on all isoforms could be misleading.

The corresponding number of lowly expressed isoforms are 18,476, 80,698, 30,378, and 1,923.

These patterns are much more pronounced for lowly expressed isoforms; see Figure 2B. In this case, edgeR performs worst in all data sets; this result is consistent with other studies (Soneson and Robinson, 2018). The performances of DESeq2 still tend to be conservative in both groups of isoforms, while other methods generally have RDR around the expected value.

We further evaluate RDR of the DE methods in the simulated beta-Poisson data set. Results from 50 replicates of the null data sets from the simulated data are reported in the rightmost plots of Figures 2A, B. The similar patterns of RDR of DE methods for both isoform groups confirm the results from the real data sets. In particular, monocle has poor performances in both groups, and edgeR does not perform well with lowly expressed isoforms.

#### RDR Analysis Under the Alternative Hypothesis

Results of RDR analysis for the simulated beta-Poisson data under the alternative hypothesis are presented in Figure 3. As Frontiers in Genetics | www.frontiersin.org

FIGURE 2 | Rediscovery rate (RDR) of differential expression (DE) isoforms in the real and simulated scRNA-seq data sets under the null hypothesis calculated from the top 5%, 10%, 20% DE and all isoforms. (Panels A and B) present the results of groups of highly expressed isoforms and lowly expressed isoforms, respectively. The number of highly expressed isoforms in MDA-MB-231, mESCs, neuronal progenitor cells (NPCs), and simulated data sets are 8,299, 31,895, 10,422, and 8,077, respectively. The corresponding number of lowly expressed isoforms are 18,476, 80,698, 30,378, and 1,923.

Reproducibility of DE Methods for scRNA-seq Data

FIGURE 3 | Observed rediscovery rate (RDR) and true rediscovery rate (TrueRDR) of differential expression (DE) isoforms in the simulated beta-Poisson data set under the alternative hypotheses calculated among the top 5%, 10%, 20% DE and all isoforms. (Panels A and B) present the rediscovery rate in the groups of highly and lowly expressed isoforms, respectively. (Panels C and D) display the true rediscovery rate collected from highly and lowly expressed isoforms separately. (Panels E) displays the true rediscovery rate collected from both highly and lowly expressed isoforms. (Panels F) presents the ratio between true RDR and observed RDR. The number of highly expressed isoforms in the simulated data set is 8,077, and the number of lowly expressed isoforms is 1,923.

described in more detail in the Materials and Methods section, 5% of the isoforms are randomly selected to be differentially expressed between the two groups (hence true DE isoforms). For highly expressed isoforms (Figure 3A), monocle and BPSC have the highest RDRs across the top 5%, 10%, 20% and all DE isoforms, while edgeR is comparable to the rest. DESeq2 is conservative for the null data sets and the group of lowly expressed isoforms, but its performance is comparable to other methods. For lowly expressed isoforms, edgeR and monocle produce the highest RDRs compared to other methods (Figure 3B). However, remember that from the previous subsection we know these two methods have high false positive rates.

In the simulated data, we in fact know the true DE status, so we can evaluate the true RDR, which is defined as the proportion of the true positives in the validation set among the top DE isoforms identified in the training set. In other words, the true RDR is the intersection of rediscovered genes and true DE genes. This is shown in Figures 3C, D. First, let us consider panel D. While there are 5% true DE isoforms, the statistical power for the lowly expressed isoforms is tiny, so very few of the true DE isoforms appear among the top-ranking genes and these isoforms do not produce significant p-values in the validation set. Hence the rediscoveries are mostly false positives. This means that there are reproducible features of the data, such as zero inflation, that consistently create problems for monocle and edgeR to the point of producing false positives in validation data. These results highlight the challenge in finding true DE among lowly expressed isoforms, or equivalently, the ease of producing false positives.

From Figure 3C, the true RDRs of 3 methods including BPSC, monocle and DESeq2 are better than the other methods. The overall true RDRs are given in Figure 3E, which in this case look similar to the result for highly expressed isoforms, but do not reflect the results for lowly expressed ones. Figure 3F shows the ratio of true RDR to observed (RDR). DESeq2 has the highest ratio among the comparing methods, indicating a good specificity in detecting DE isoforms. However, DEseq2 generally discovers fewer true DE isoforms, i.e., lower sensitivity, compared to BPSC. Two methods of edgeR and monocle have a lower ratio than the other methods since they have more false discoveries. In the next section, the balance between sensitivity and specificity of the methods are taken into account via the ROC curve.

For the real data sets, there are no significant differences in RDR performance for the top 5%, 10%, 20% DE isoforms between nine DE methods in the group of highly expressed isoforms (Figure S2A in the Supplementary report). However, similar to the results of the simulated data set, RDRs of edgeR and monocle are highly liberal, while DESeq2 tends to be too conservative for the lowly expressed isoforms (Figure S2B). We have performed other simulations and analyzed two other datasets that confirmed this observation. This is given in the Supplementary Material and described in the Discussion section.

#### ROC Performance

Performances of the DE methods on the simulated data with the alternative hypothesis are also evaluated using the area under the ROC curve (AUC). In Figure 4, the AUC and ROC curves of top 5% DE isoforms and all isoforms over 50 replicates are presented in panels A and B, respectively. For edgeR and monocle, there are obvious differences between their performances for top 5% DE isoforms and for all isoforms. For the top 5% isoforms, these two methods perform poorly compared to the other methods. However, if all isoforms are considered, the two methods are comparable with the other methods when more isoforms are taken into account. Results for the top 10% and 20% DE isoforms are given in Figure S3 in the Supplementary report. Among these methods, BPSC and DESeq2 are consistently the top performing methods with the highest AUC values for different sizes of top DE isoform sets. Overall, these results are in agreement with the results from RDR analyses.

## MATERIALS AND METHODS

## Experimental and Synthetic Data Sets

To capture the true distributional characteristics of real data, three real scRNA-seq data sets are used for the evaluation of the nine DE methods. The first data set (MDA-MB-231) includes 160 single cells from a triple-negative breast cancer cell line, half of which are treated with metformin. The cells are captured using the Fluidigm C1 system and sequenced on Illumina HiSeq 2500 machines for 80 control and 80 treated cells separately. Then we use Cufflinks (Trapnell et al., 2010) to estimate the isoform expression. This data set contains a total of 26,775 isoforms across 160 single cells. The average number of reads per cell is ∼649,000.

The second data set (mESCs) is collected from a public scRNAseq data (GSE60749-GPL13112) in the Conquer data set (Soneson and Robinson, 2018), which provides expression estimates of isoforms. The compared single cells are 94 individual v6.5 mouse embryonic stem cells (mESCs) with culture conditions 2i+LIF (group 1) vs. 174 v6.5 mESCs with culture conditions in serum +LIF (group 2). The data are prepared with the C1 System using the SMARTer Ultra Low RNA kit for Illumina Sequencing (Clontech) and protocols provided by Fluidigm. More details of the data can be found in the original paper (Kumar et al., 2014). Then the Conquer pipeline estimates isoform abundances using Salmon (Patro et al., 2017). This data set contains 112,593isoforms across 174 single cells in group 1 and 94 single cells in group 2. The average number of reads per cell is ∼1.7M, the largest among the 3 data sets.

The third real data set (NPCs) is a subset of GSE102934 data from the NCBI Gene Expression Omnibus (Iacono et al., 2018). This data set has 720 NPCs derived from induced pluripotent stem (iPS) cells, half of which are from a Williams-Beuren patient and the other half are from a healthy donor. The data are sequenced on Illumina HiSeq 2500 platform and then applied massively parallel single-cell RNA sequencing (MARS-Seq) to construct single-cell libraries. This data set contains a total of 41,020 isoforms from 720 single cell, and the average number of reads per cell is 18,600. Thus, this data set has a relatively large number of cells with low sequencing coverage.

The simulated data set for isoform expression of single cells is generated by the beta-Poisson model (Vu et al., 2016). In particular, we generate the counts for each isoform from a

beta-Poisson distribution with four parameters estimated from the mESCs data set. The four-parameter beta-Poisson model is as follows:

$$BP\_4(\boldsymbol{\chi}|\boldsymbol{\alpha}, \boldsymbol{\beta}, \boldsymbol{\lambda}\_1, \boldsymbol{\lambda}\_2) = \boldsymbol{\lambda}\_2 \text{Poisson}\left(\boldsymbol{\chi}|\boldsymbol{\lambda}\_1 \text{Beta}(\boldsymbol{\alpha}, \boldsymbol{\beta})\right) \tag{1}$$

The mean and variance of the model can be written as

$$\mu = E(X) = \mathcal{A}\_1 \mathcal{A}\_2 \phi\_1$$

and

$$V\operatorname{ar}(X) = \mu \mathfrak{A}\_2 + \mu^2 \phi\_2 \mu$$

where <sup>f</sup><sup>1</sup> <sup>=</sup> <sup>a</sup> <sup>a</sup> <sup>+</sup> <sup>b</sup> and <sup>f</sup><sup>2</sup> <sup>=</sup> <sup>b</sup> <sup>a</sup>(<sup>a</sup> <sup>+</sup> <sup>b</sup> + 1). Crucially, we can

modify the parameter l<sup>1</sup> to create mean differences between groups. A more detailed description of the model can be referred to in the original study (Vu et al., 2016).

Beta-Poisson models fitted on the real mESCs data set are used as baseline distributions for simulation. For each isoform, expression values across samples in the control and the treated group are generated from the same beta-Poisson model. To mimic the biological variation, 5% of isoforms are selected to be differentially expressed between two groups (true DE isoforms). Specifically, the parameter l1, which controls the mean of the distribution, is fixed in the control group and multiplied by log2 fold change of 1 unit in the treated group. The effect direction is randomly determined for each DE isoform, with equal probability of upregulation and downregulation. In other words, the quantity change between the two compared groups is either two- or half-fold change with equal probability. The simulated data set consists of 80 samples in each of control and treated groups and a total of 10,000 isoforms measured per sample. Library sizes of the single-cell samples are randomly sampled from a range of 1–3 million. We filter out isoforms with zero expression across all samples.

### DE Analysis Methods

Nine DE methods included in this study are categorized into four groups based on different statistical models. These nine methods are selected to cover most statistical models used in recent DE analysis. Regarding other DE methods that are not included in this study, they use similar approach comparing to the nine selected methods. For instance, D3E (Delmans and Hemberg, 2016) utilizes beta-Poisson model which is similar to BPSC; SCDE (Kharchenko et al., 2014) models the gene expression values using a mixture of negative-binomial distribution for amplification components and a Possion distribution for dropout events, which is similar to DEsingle; Ballgown (Frazee et al., 2015) is based on the linear modeling strategy which is similar to limma. In this section, we give a brief summary of these nine methods. For more details of the software packages and statistical models, the reader is referred to original publications and related software websites. When applying these tools, we follow standard procedures and parameter settings suggested in software manuals.

#### Negative-Binomial–Based Methods

The read counts of an isoform from the technical replicates (repeated sequencing runs of the same sample) are usually modeled to follow a Poisson law (Marioni et al., 2008). However, those from the biological replicates are usually assumed to follow a gamma distribution to accommodate the overdispersion observed in empirical data (Chen et al., 2014). Since the negative binomial (NB) model can be derived as a gamma-Poisson mixture model, several DE methods based on the NB distribution assumption have been developed to accommodate the overdispersion among biological replicates. Note, however, that these theoretical motivations come from bulk-cell RNA-seq data. Two popular methods for this class are edgeR (Robinson et al., 2010) and DESeq2 (Love et al., 2014). The setup is then to assume the expression read counts yij ~ NB (μij,ji), where μij is the mean and j<sup>i</sup> is the dispersion parameter for isoform i and sample j. Reliable estimation of the dispersion parameter j<sup>i</sup> for each isoform is crucial for detecting DE isoforms. Differences in the estimation of j<sup>i</sup> explain the main differences between edgeR and DESeq2.

#### edgeR

A conditional maximum likelihood (CML) is used in edgeR (Robinson et al., 2010) to estimate a common dispersion, which is assumed to be the same for all isoforms. Then this procedure is developed further to allow for the isoform-specific dispersion estimates and an empirical Bayes procedure—approximated by a weighted likelihood—is used to shrink the dispersions toward the common dispersion. The amount of shrinkage is determined by the neighbourhood set that is nearest to isoform i in average log count-per-million (logCPM). For DE testing, edgeR allows the user to select among different hypothesis tests including quasi-likelihood F-test (edgeRQLF) for bulk-cell RNA-seq data and likelihood ratio test (edgeRLFT) for scRNA-seq data as suggested by the developer. However, a recent study (Soneson and Robinson, 2018) shows that edgeRQLF performs significantly better than edgeRLFT in scRNA-seq data. Therefore, in this study, we report the results of edgeRQLF for the evaluation of edgeR in DE analysis.

#### DESeq2

DESeq2 (Love et al., 2014) uses a similar negative-binomial model as edgeR but facilitates more data-driven shrinkage estimators for dispersion and fold change. DESeq2 assumes the isoforms of similar average expression levels have similar dispersion and shrinks the isoform-specific dispersion toward a fitted smooth curve by an empirical Bayes approach. To overcome the difficulty in the log fold-change (LFC) estimation for the lowly expressed isoforms, DESeq2 shrinks LFC estimates toward zero when the expression level is low. The shrinkage procedure may result in underestimates of dispersion, thereby producing conservative estimate statistics for the DE test. This helps reduce the FPR at the expense of lower sensitivity.

#### DEsingle

DEsingle (Miao et al., 2018) has another negative-binomial based approach that employs the zero-inflated NB (ZINB) model to discriminate the observed zero values into two parts—constant zeros and zeros from the NB distribution. With the model, DEsingle is designed to overcome the issues of the excessive zero values observed in scRNA-seq data. To detect DE isoforms between two groups, DEsingle first calculates the maximum likelihood estimates (MLE) of two ZINB populations' parameters, then computes the constrained MLE of the two models' parameters under the null hypothesis (H0), and finally uses the likelihood ratio test for testing H0.

## Beta-Poisson–Based Methods

BPSC

BPSC (Vu et al., 2016) is an analytical procedure based on the beta-Poisson mixture model, which is designed to capture the property of scRNA-seq data. The model is integrated into the generalized linear model (GLM) framework for DE analysis. The sophisticated four-parameter beta-Poisson model is as shown in Eq. (1). The iterative weighted least-squares (IWLS) algorithm is used to estimate the model parameters.

#### Normal-Based Methods

#### Limma

Limma (Law et al., 2014) method is based on linear modelling which was originally designed for gene expression microarray data, but has recently been extended to RNA-seq data. In this study, we use limmatrend (Law et al., 2014), a version of limma where the empirical Bayes procedure is modified to incorporate a mean-variance trend for DE analysis. In a recent study of DE analysis of scRNA-seq data (Soneson and Robinson, 2018), limmatrend has the best performances among other versions of limma, such as voomlimma.

#### Monocle

Monocle (Qiu et al., 2017) is a tool originally designed for scRNA-seq data for identifying DE genes that vary across different cell types or across a so-called "pseudo-time." The mean expression level of each isoform is modeled by generalized additive models (GAMs) which relate one or more predictor variables to a response variable as

$$g(E(Y)) = \beta\_0 + f\_1(\alpha\_1) + f\_2(\alpha\_2) + \dots + f\_m(\alpha\_m),$$

where Y is a response variable, and xi's are predictor variables. The function g is a link function, typically the log function, and fi's are nonparametric functions, such as cubic splines or other smoothing functions. Gene expression level across cells is modeled by a Tobit model; with some approximations, monocle's GAM is thus

$$E(Y) = s(\psi\_t(b\_x, s\_i)) + \epsilon\_r$$

where yt(bx,si) is the assigned pseudo-time of a cell and s is a cubic smoothing function with (by default) three effective degrees of freedom. ϵ is the error term that is normally distributed with a mean of zero. The DE test is performed with a x<sup>2</sup> -approximation of the likelihood ratio test.

#### MAST

MAST (Finak et al., 2015) uses a hurdle model tailored to scRNAseq data. It is a two-part GLM that simultaneously models the gene expression rate (how many cells express the gene) by logistic regression and the expression level by Gaussian distribution. The DE testing is then done using the likelihood ratio test.

#### T-Test

T-test (Welch, 1947) is a general comparison method that is used to compare the means of two groups. One of the most common assumptions made when doing a t-test is the normality of data distribution. Empirically, scRNA-seq data are highly skewed, but the t-test is known to have a certain robustness against skewness, so it is still worth comparing against other sophisticated methods.

#### Nonparametric Methods

#### Wilcoxon Rank Sum Test

Wilcoxon rank sum test (Hollander et al., 2013) (also known as Mann-Whitney test) is a nonparametric test that is used to determine whether the two independent samples come from the same distribution. The main idea of the test is to compare the sum of the ranks for the observations which come from different samples.

## DISCUSSION

We have performed a systematic comparison of nine different statistical methods for DE analysis of scRNA-seq data. To get realistic distributional characteristics, three real scRNA-seq data sets are used as the basis for generating the data. A beta-Poisson model–based simulated data set is also performed to assess the performance of each method. The nine methods are evaluated by the type-I error control, the ROC curve and the RDR under both null and alternative hypotheses. Our results show that lowly expressed isoforms are generally the source of strong differences between methods. Most methods except monocle have good RDR performances for highly expressed isoforms.

EdgeR and monocle tend to produce extremely small p-values for lowly expressed isoforms, leading to many false positives. Notably, these two methods perform very poorly compared to the other methods for top DE isoforms. These results are consistent with other recent studies(Dal Molin et al., 2017; Soneson and Robinson, 2018). DESeq2, a bulk-cell–based method with a shrinkage procedure, works rather well over all isoforms on both the real scRNA-seq data and the simulated data. However, DESeq2 is highly conservative for lowly expressed isoforms, so its sensitivity is always lower than the other methods for all three real data sets. The performances of BPSC are comparable to DESeq2 in all analyses but less conservative. Other methods including limmatrend, t-test, Wilcoxon, MAST, and DEsingle perform reasonably in both real and simulated data sets.

To validate our results, we analyzed two extra public real scRNA-seq data sets including one data set with 164 single cells from H7 human cell-line generated by the SMARTer C1 prototol and another big data set contain 2,027 intestinal single cells of mouse from the CEL-Seq protocol. The results in Figure S6-S8 show the consistency of the comparison analyses for different types of scRNA-seq data for the new small data set. But for the new big data set, monocle and DESeq2 show particularly low sensitivity for lowly expressed isoforms in Figure S6D-S8D. The details of these data and results are referred to the Supplementary Material.

We also investigated further the performances of the DE methods for the group of lowly expressed isoforms. We first checked the relationship between the performance of the Wilcoxon test, one of the most stable DE methods, and the signal strength in different log fold-change (LFC) 1, 2, 3, and 4 using the simulated dataset. Results in the Figure S4 show that the RDR of Wilcoxon is a function of signal strength where it achieves a higher RDR for the data with a higher LFC. The low signal in the simulated data in Figure 3D had made the differences of true RDR for different methods inconspicuous. So we generated another simulation data set using the same procedure described in 3.1 but with a high signal strength LFC = ± 4, then applied all 9 methods on the simulated lowly expressed genes. The results (Figure S5) confirmed that for the lowly expressed isoforms, DESeq2 is too conservative and consequently loses sensitivity compared to the other methods.

The nine methods compared in this study are selected to cover most statistical models used in recent DE analysis. Although some DE methods are not included in this study, they use similar approach to those we included. For instance, D3E (Delmans and Hemberg, 2016) utilizes beta-Poisson model which is similar to BPSC; SCDE (Kharchenko et al., 2014) models the gene expression values using a mixture of NB distribution for amplification components and a Possion distribution for dropout events, which is similar to DEsingle; Ballgown (Frazee et al., 2015) is based on the linear modeling strategy which is similar to limma.

The main strengths of our comparison method include (i) the use of three real scRNA-seq data sets in order to capture the true distributional characteristics and the diversity of single-cell data; (ii) the use of the RDR metric for top-rank genes. This is consistent with the data analysis process of identifying the list of interesting genes. In some cases we show that considering the full collection of genes will lead to misleading comparisons; (iii) Separate results of highly and lowly expressed genes, as these two groups have distinct distributions and the methods vary more in their performances for lowly expressed genes. In summary, performances of DE methods do vary, so we need to pay attention in choosing the method to use, and, at least for highly expressed genes, some methods designed for bulk-cell RNA-seq analysis do not necessarily perform worse than those specifically designed for scRNA-seq data. Finally, as shown the figures, the number of lowly expressed genes is not trivial, so our results also highlight the need for further development of methods to deal with these genes.

#### CONCLUSION

There are large differences in the performance of methods for detecting DE in single-cell RNA-seq data. This is driven partly by the expression level of genes. For highly expressed genes, many bulk-cell–based DE methods perform well against single-cell– based methods. But, for lowly expressed genes, the performance of the methods varies, so a careful check of the gene expression level should be made before choosing a DE method in analyses. This is to ensure that the chosen method is appropriate for your data. We found edgeR and monocle to have poor control of falsepositives on lowly expressed genes, so we do not recommend these two methods for such genes. DESeq2 tends to be too conservative, so it sacrifices sensitivity for higher specificity. According to the simulation results, BPSC performs well against the other methods, particularly when there is a sufficient number of cells. RDR for top-rank genes is a useful metric for assessing performance of DE methods, sometimes giving different results compared to analysis of the full set of genes. We suggest to be considered in choosing DE methods to use, performances of DE methods in scRNA-seq data strongly depend on the expression level of genes.

### REFERENCES


#### DATA AVAILABILITY STATEMENT

The raw data of two data sets mESC (GSE60749-GPL13112) and NPCs (GSE102934) are published by the original studies and publicly available from NCBI Gene Expression Omnibus repository. The gene expression data of these data sets, MDA-MB-231 data set, simulated data sets and the two supplementary data sets can be found at: https://github.com/Tianmou/ scRNAseq-DE-comparison.

#### AUTHOR CONTRIBUTIONS

YP and TM designed the study. TM, TV, and YP performed the analysis and wrote the manuscript. WD performed the acquisition of MDA-MB-231 data. FG performed a part of simulation studies. All authors read and approved the final manuscript.

#### FUNDING

This work was partially supported by funding from the Swedish Cancer Fonden, the Swedish Research Council (VR) and the Swedish Foundation for Strategic Research (SSF).

#### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene.2019. 01331/full#supplementary-material


Conflict of Interest: The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2020 Mou, Deng, Gu, Pawitan and Vu. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# McImpute: Matrix Completion Based Imputation for Single Cell RNA-seq Data

#### Aanchal Mongia<sup>1</sup> , Debarka Sengupta1,2 \* and Angshul Majumdar <sup>3</sup>

*<sup>1</sup> Department of Computer Science and Engineering, Indraprastha Institute of Information Technology Delhi, New Delhi, India, <sup>2</sup> Center for Computational Biology, Indraprastha Institute of Information Technology Delhi, New Delhi, India, <sup>3</sup> Department of Electronics and Communications Engineering, Indraprastha Institute of Information Technology Delhi, New Delhi, India*

Motivation: Single-cell RNA sequencing has been proved to be revolutionary for its potential of zooming into complex biological systems. Genome-wide expression analysis at single-cell resolution provides a window into dynamics of cellular phenotypes. This facilitates the characterization of transcriptional heterogeneity in normal and diseased tissues under various conditions. It also sheds light on the development or emergence of specific cell populations and phenotypes. However, owing to the paucity of input RNA, a typical single cell RNA sequencing data features a high number of dropout events where transcripts fail to get amplified.

#### Edited by:

*Indrajit Saha, National Institute of Technical Teachers' Training and Research, India*

#### Reviewed by:

*Kumardeep Chaudhary, Icahn School of Medicine at Mount Sinai, United States Sumit Kumar Bag, National Botanical Research Institute (CSIR), India Yuriy L. Orlov, Russian Academy of Sciences, Russia Shaoli Das, National Institutes of Health (NIH), United States*

> \*Correspondence: *Debarka Sengupta debarka@iiitd.ac.in*

#### Specialty section:

*This article was submitted to Bioinformatics and Computational Biology, a section of the journal Frontiers in Genetics*

> Received: *08 August 2018* Accepted: *10 January 2019* Published: *29 January 2019*

#### Citation:

*Mongia A, Sengupta D and Majumdar A (2019) McImpute: Matrix Completion Based Imputation for Single Cell RNA-seq Data. Front. Genet. 10:9. doi: 10.3389/fgene.2019.00009* Results: We introduce mcImpute, a low-rank matrix completion based technique to impute dropouts in single cell expression data. On a number of real datasets, application of mcImpute yields significant improvements in the separation of true zeros from dropouts, cell-clustering, differential expression analysis, cell type separability, the performance of dimensionality reduction techniques for cell visualization, and gene distribution.

Availability and Implementation: https://github.com/aanchalMongia/McImpute\_scR NAseq

Keywords: scRNA-seq, dropouts, imputation, matrix completion, Nuclear norm minization

## 1. BACKGROUND AND INTRODUCTION

In contrast to traditional bulk population-based expression studies, single-cell transcriptomics provides more precise insights into the functioning of individual cells. Over the past few years, this powerful tool has brought in transformative changes in the conduct of functional biology (Wagner et al., 2016). With single-cell RNA sequencing (scRNA-seq) we are now able to discover subtypes within seemingly similar cells. This is particularly advantageous for characterizing cancer heterogeneity (Patel et al., 2014; Tirosh et al., 2016), identification of new rare cell type and understanding the dynamics of transcriptional changes during development (Tang et al., 2010; Yan et al., 2013; Biase et al., 2014).

Despite all the goodness, scRNA-seq technologies suffer from a number of sources of technical noise. Most important of these is insufficient input RNA. Due to small quantities transcripts are frequently missed during the reverse transcription step. As a direct consequence, these transcripts are not detected during the sequencing step (Kharchenko et al., 2014). Often times the lowly

expressed genes are the worst hit. Excluding these genes from the analysis may not be the best solution as many of the transcription factors and cell surface markers are sacrificed in this process (van Dijk et al., 2017). Added to that, variability in dropout rate across individual cells or cell types works as a confounding factor for a number of downstream analyses (Sengupta et al., 2016; Li et al., 2017). Hicks et al. (2015) showed, on a number of scRNA-seq datasets, that the first principal components highly correlate with the proportion of dropouts across individual transcriptomes. In summary, there is a standing need for efficient methods to impute scRNA-seq datasets.

Very recently, efforts have been made to devise imputation techniques for scRNA-seq data (**Table S6**). Most notable of among these are MAGIC (van Dijk et al., 2017), scImpute (Li and Li, 2018), and drImpute (Kwak et al., 2017). MAGIC uses a neighborhood based heuristic to infer the missing values based on the idea of heat diffusion, altering all gene expression levels including the ones not affected by dropouts. On the other hand, scImpute first estimates which values are affected by dropouts based on Gamma-Normal mixture model and then fills the dropout values in a cell by borrowing information of the same gene in other similar cells, which are selected based on the genes unlikely affected by dropout events. The overall performance of scImpute has been shown to be superior to MAGIC. Parametric modeling of single-cell expression is challenging due to our lack of knowledge about possible sources of technical noise and biases (Sengupta et al., 2016). Moreover, there is a clear lack of consensus about the choice of the probability density function. Another method, Drimpute, repeatedly identifies similar cells based on clustering and performs imputation multiple times by averaging the expression values from similar cells, followed by averaging multiple estimations for final imputation. We propose mcImpute (**Figure 1**), an imputation algorithm for scRNAseq data which models gene expression as a low-rank matrix and sprouts in values in place of dropouts in the process of recovering the full gene expression data from sparse single-cell data. This is done by applying soft-thresholding iteratively on singular values of scRNA-seq data. One of the salient features of mcImpute is that it does not assume any distribution for gene expression.

We first evaluate the performance of mcImpute in separating "true zero" counts from dropouts on single-cell data of myoblasts (Trapnell et al., 2014) (We call it Trapnell dataset). On the same dataset, we assess the impact of imputation on differential genes prediction. We further investigate mcImpute's ability to recover artificially planted missing values in a single cell expression matrix of mouse neurons (Usoskin et al., 2015).Accurate imputation should enhance cell type identity i.e., the transcriptomic similarity between cells of identical type. We, therefore, quantify cell type separability as a metric and assess its improvement. In addition to these, we also test the impact of imputation on cell clustering. Four independent datasets Zeisel (Zeisel et al., 2015), Jurkat-293T (Zheng et al., 2017), Preimplantation (Yan et al., 2013) and Usoskin (Usoskin et al., 2015), for which cell type annotations are available and another dataset, Trapnell et al. (2014) for which bulk RNA-seq data has been provided (required for validation of differential genes prediction and separation of "true zeros" from dropouts), are used for this purpose. McImpute clearly serves as a crucial tool in the scRNA-seq pipeline by significantly improving all the above-mentioned metrics and outperforming the state-ofthe-art imputation methods in the majority of experimental conditions.

With the advent of droplet-based, high-throughput technologies (Macosko et al., 2015; Zheng et al., 2017), library depth is being compromised to curb the sequencing cost. As a result, scRNA-seq datasets are being produced with an extremely high number of dropouts. We believe that mcImpute's great performance, will provide an adequate solution for the dropouts problem.

## 2. RESULTS

We performed computational experiments to evaluate the efficacy of our proposed imputation technique comparing mcImpute with a number of existing imputation methods for single cell RNA data: scImpute, drImpute, and MAGIC.

## 2.1. Dropouts vs. True Zeros

The inflated number of zero counts in scRNA-seq data could either be biologically driven or due to lack of measurement sensitivity in sequencing. The transcript which is not detected because of failing to get amplified in the sequencing step essentially corresponds to a "false zero" in the finally observed count data and needs to be imputed. A reasonable imputation strategy which has this discriminating property should keep the "true zero" counts (where the genes are truly expressed and have no transcripts from the beginning) untouched, while at the same time attempt to recover the dropouts.

The goodness of an imputation strategy can be formally confirmed by observing two factors. First, whether the imputation method is able to impute the true zero counts in the expression data as is or not; Second, if it can fill-in the dropouts with biologically meaningful expression counts or not; showing an increasing difference between the zero counts observed in unimputed data and the imputed one with expression amplification.

We investigate the performance of mcImpute in distinguishing "true zero" counts from dropouts on Trapnell data (Trapnell et al., 2014), for which the bulk-counterpart was available and hence, we could pull out low-to-medium expression genes from the corresponding bulk data for validation. Of note, to differentiate between the "true" and "false" zeros, we have used the matched bulk-expression profiles; as it is a well-known fact that bulk-RNA seq data has limited or no dropouts events as the corresponding experiments involve millions of cells. The fraction of zero counts was observed for genes with expression ranging from zero to 500 for unimputed and imputed gene-expression data. It should be noted that an imputed count value ranging from 0 to 0.5 is taken as an imputed zero, rendering minor flexibility to all imputation techniques.

Given the nature of this analysis, gene filtering in single cell expressions has been skipped. DrImpute could not be taken into

account since we could not programmatically mute the gene filtering step in its pipeline.

We observe (**Figure 2A**, **Table S1**) that with low expression genes, all imputation strategies successfully impute the "true zeros" while, as the gene expression amplifies, un-imputed matrix still exhibits large fraction of zeros, which essentially correspond to dropouts and only mcImpute and scImpute are able to curtail the fraction of zeros, thus recovering the dropouts back. As can be observed, MAGIC although successfully imputes the "true zeros"; it fails to recover most of the dropouts in the expression data.

#### 2.2. Improvement in Clustering Accuracy

A correct interpretation of single-cell expression data is contingent on the accurate delineation of cell types. Bewildering level of dropouts in scRNA-seq data often introduces batch effect, which inevitably traps the clustering algorithm. A reasonable imputation strategy should fix these issues to a great extent. In a controlled setting, we, therefore, examined if the proposed method enhanced clustering outcomes. For this, we ran Kmeans on first 2 principal component genes of log-transformed expression profiles featured in each dataset (**Figure S5**). Since the prediction from this clustering algorithm tends to change with the choice of initial centroids, which are chosen at random, we analyze the results on 100 runs of k-means to get reliable and robust results. We set the number of annotated cell types as the value of K for every data. Adjusted Rand Index (ARI) was used to measure the correspondence between the clusters and the prior annotations.

McImpute based re-estimation best separates the four groups of mouse neural single cells from Usoskin dataset and brain cells from Zeisel dataset, and clearly shows comparable improvement on other datasets too (**Figures 2B–E**, **Table S2**). The striking difference between Jurkat and 293T cells made them trivially separable through clustering, leading to same ARI across all 100 runs. Still, mcImpute was able to better maintain the ARI in comparison to other imputation methods.

## 2.3. Matrix Recovery

In this set of experiments, we study the choice of matrix completion algorithm – matrix factorization (MF) or nuclear norm minimization (NNM). Both the algorithms have been explained in section Materials and Methods.

The experiments are carried out on the processed Usoskin dataset (Usoskin et al., 2015). We artificially removed some counts at random (sub-sampling) in the data to mimic dropout cases and used our algorithms (MF and NNM) to impute the missing values. (**Figures 3A–C**) and **Table S3** show the variation of Normalized Mean Squared Error (NMSE), Root Mean Squared Error (RMSE) and Mean Absolute Error (MAE) to compare our two methods for different sub-sampling ratios. This is the

dropouts: plot showing fraction of zero counts (values between 0 and 0.5) in single cell expression matrix against the median bulk expression. The genes are divided into 10 bins based on median bulk genes expression (first bin corresponds to zero expression genes) (B–E) Boxplots showing the distribution of ARI calculated on 100 runs of k-means clustering algorithm on first two principal components of single cell expression matrix for datasets (B) Jurkat-293T (C) Preimplantation (D) Usoskin, and (E) Zeisel.

standard procedure to compare matrix completion algorithms (Keshavan et al., 2010; Marjanovic and Solo, 2012).

We are showing the results for Usoskin dataset, but we have carried out the same analysis for other datasets and the conclusion remained the same. We find that the nuclear norm minimization (NNM) method performs slightly better than the matrix factorization (MF) technique; so we have used NNM as the workhorse algorithm behind mcImpute.

### 2.4. Improved Differential Genes Prediction

Optimal imputation of expression data should improve the accuracy of differential expression (DE) analysis. It is a standard practice to benchmark DE calls made on scRNA-Seq data against calls made on their matching bulk counterparts (Kharchenko et al., 2014). To this end, we used a dataset of myoblasts, for which matching bulk RNA-Seq data were also available (Trapnell et al., 2014). For simplicity, this dataset has been referred to as the Trapnell dataset. DE and non-DE genes were identified using edgeR (Zhou et al., 2014) package in R.

We used the standard Wilcoxon Rank-Sum test for identifying differentially expressed genes from matrices imputed by various methods. Congruence between bulk and single cell-based DE calls were summarized using the Area Under the Curve (AUC) values yielded from the Receiver Operating Characteristic (ROC) curves (**Figure 3D**). Among all the methods mcImpute performed best with an AUC of 0.85.

For each method, the AUC value was computed on the identical set of ground truth genes. We had to make an exception only for drImpute as it applies the filter to prune genes in its pipeline. Hence AUC value for drImpute was computed based on a smaller set of ground truth genes.

### 2.5. Improvement in Cell Type Separability

Downstream analysis becomes much easier if expression similarities between cells of identical type are considerably

higher than that of cells coming from different subpopulations. To this end, we define the cell-type separability score as follows:

and NF cells from Usoskin dataset; and (H) S1pyramidal and Ependymal from Zeisel dataset . Refer Table S4 for absolute values.

For any two cell groups, we first find the median of Spearman correlation values computed for each possible pair of cells within their respective groups. We call the average of the median correlation values the intra-cell type scatter. On the other hand, inter-cell type scatter is defined as the median of Spearman correlation values computed for pairs such that in each pair, cells belong to two different groups. The difference between the intra-cell scatter and inter-cell type scatter is termed as the cell-type separability (CTS) score. We computed CTS scores for two sample cell-type pairs from each dataset. In more than 80 % (13 out of 16) of test cases, mcImpute yielded significantly better CS values (**Figures 3E–H**, **Table S4**).

## 2.6. Cell Visualization

Representing scRNA-seq data visually would involve reducing the gene-expression matrix to a lower dimensional space and then plotting each cell transcriptome in that reduced two or three-dimensional space. Two well-known techniques for dimensionality reduction are PCA and t-SNE (Holland, 2008; Maaten and Hinton, 2008). It has been shown that t-Distributed Stochastic Neighbor Embedding (t-SNE) is particularly well suited and effective for the visualization of high-dimensional datasets (Liu et al., 2017). So, we use t-SNE (**Figures 4**, **5**) on Usoskin and Zeisel expression matrices to explore the performance of dimensionality reduction, both without and with imputation. The cells are visualized in 2-dimensional space, coloring each subpopulation by its annotated group, both before and after imputation. To quantify the groupings of cell transcriptomes, we use an unsupervised clustering quality metric, silhouette index. The average silhouette values for each method have been shown in the plot titles (**Figures 4**, **5** and **Figures S3**, **S4**).

T-SNE analysis depicts that mcImpute brings all four groups of mouse neural cells from Usoskin dataset closest to each other in comparison to other methods and performs fairly well, competing with drImpute on Zeisel dataset too.

FIGURE 4 | Plot showing t-SNE visualization and average silhouette values for Usoskin dataset before and after imputation. McImpute improves the visual distinguishability the most for all groups of mouse neural single cells amongst all imputation strategies. The neuronal types were defined as neurofilament containing (NF), non-peptidergic nociceptors (NP), peptidergic nociceptors (PEP), and tyrosine hydroxylase containing (TH).

## 2.7. Improvement in Distribution of Genes

It has been shown that for single-cell gene expression data, in the ideal condition all genes should obey CV = mean−1/<sup>2</sup> (Klein et al., 2015) (CV: coefficient of variation), following a Poisson distribution as depicted by the green diagonal line (**Figures 6**, **7**). This is because individual transcripts are sampled from a pool of available transcripts for CEL-Seq. This accounts for technical noise component which obeys Poissonian statistics (Grün et al., 2014), and thus the CV is inversely proportional to the square root of the mean. Since this result has only been shown for single-cell data with transcript numbers, this experiment has not been analyzed for Jurkat-293T and Zeisel datasets for which the individual RNA molecules were counted using unique molecular identifiers (UMIs).

We model CV as a function of mean expression for all genes to analyze how various imputation methods affect the relationship between them. The results (**Figures 6**, **7**) show that both mcImpute and drImpute succeed to restore the relationship between CV and mean to a great extent (improving the dependency of the CV on the mean expression level to be more consistent with Poissonian sampling noise), while others do not.

## 3. DISCUSSION

Single-cell RNA seq technologies have opened up numerous possibilities for analysis at the single-cell resolution. But, low amount of starting RNA is a major limitation of the technology which results in frequent missing of transcripts in the reverse transcription step (dropout events). This dropout problem in single-cell RNA-seq data makes the expression matrix highly sparse; which in turn hinders the downstream analysis.

To overcome the dropout problem in single-cell data, we take motivation from various areas of applied sciences (including computer vision Tomasi and Kanade, 1992, control Mesbahi and Papavassilopoulos, 1997, machine learning Abernethy et al., 2006; Amit et al., 2007; Argyriou et al., 2007, etc) where recovery of an unknown low-rank matrix from very limited information is of interest. The problem is akin to that of recommendation systems (e.g. in Netflix movie recommendations and Amazon product recommendations) (Bell and Koren, 2007; Bennett and Lanning, 2007; SIGKDD, 2007), where there is a database of ratings given by users to movies/products. Since the users typically rate only a small subset of items, not all the ratings are available; which makes the user-movie rating matrix sparse. Also, the matrix is assumed to be of low-rank because there are not too many independent parameters on which the users generally rate the movie. The objective is to estimate the ratings of all the users on all the movies. If the new movie rating predictions can be done accurately, recommendation accuracy increases. There is a pretty straightforward link between both the Netflix problem and dropout problems. Therefore, imputation to singlecell expression matrix can be efficiently performed by Low-rank approximation. (Koren et al., 2009; Majumdar and Ward, 2011).

One could argue about the low-rank origin of the gene expression data. It should be noted that numerous studies have suggested that genes do not work in isolation (Staiger et al., 2013), but as part of a complex regulatory network (Silver et al., 2013). This inter-dependency has been analyzed in the form of associated network structures (Xiong et al., 2005; Gill et al., 2010) and is best reflected by the gene-gene correlations (Weckwerth et al., 2004; Klebanov and Yakovlev, 2007; Reynier et al., 2011; Najafov and Najafov, 2018). It is so believed that such high levels of correlation are caused by sharing of regulatory programs among different genes (Ye et al., 2013). Also, it has previously been shown that a small number of interdependent biophysical functions trigger the functioning of transcription factors, which in turns influence the expression levels of genes, resulting in a highly correlated data matrix (Kapur et al., 2016). On the other hand, cells coming from same tissue source also lie on differential grades of the variability of a limited number of phenotypic characteristics. Therefore, it is just to assume that the gene expression values lie on a low-dimensional linear subspace and the data matrix thus formed may well be thought as a low-rank matrix.

We attempt to give another mathematical justification on the Low-rank assumption of the gene-expression in **Figure S2** by showing that the maximum information of the expression-data is held in its first few singular values; hence the rank of the expression matrix (number of non-zero singular values) should be low.

In specific, we used Nuclear Norm-based Matrix Completion for imputing single-cell RNA seq data. The algorithm models the single-cell gene expression as a low-rank matrix and recovers the full gene expression from partial information by thresholding the singular values of expression matrix iteratively. The recovery process sprouts-in appropriate expressions in place of dropouts; keeping the biologically silent expression values intact.

Apart from taking care of biologically silent genes, the proposed algorithm performs competitively with the state-ofthe-art methods in improving the clustering accuracy of cells, identifying differentially expressed genes, enhancing cell type separability, improving the dimensionality reduction, etc.

Our method is particularly suitable for single-cell data since it does not assume anything about the statistical property of the expression or the dropouts and can be seamlessly incorporated into the single-cell analysis pipeline. We have also demonstrated that our method clearly distinguishes between biological and technical silencing.

The algorithm has some scope of improvement when it comes to handling scRNA– seq datasets with large sample sizes. As can be seen in **Table S5**, the running time of our algorithm is comparatively more than that of MAGIC and drImpute; although much less than that of scImpute.

### 4. DATA AND METHODS

#### 4.1. Dataset Description

We used five scRNA-seq datasets from four different studies for performing various experiments (**Table S7**).

• **Jurkat-293T:** This dataset contains expression profiles of Jurkat and 293T cells, mixed in vitro at equal proportions

(50:50). All ∼ 3,300 cells of this data are annotated based on the expressions of cell-type specific markers (Zheng et al., 2017). Cells expressing CD3D are assigned Jurkat, while those expressing XIST are assigned 293T. This dataset is also available at 10x Genomics website (https://support. 10xgenomics.com/single-cell-gene-expression/datasets/1.1.0/ jurkat:293t\_50:50).


#### 4.2. Data Preprocessing

Steps involved in preprocessing of raw scRNA-seq data are enumerated below.


counts in each cell, and then by multiplying with the median of the total read counts across cells.


A brief overview of the complete mcImpute pipeline has been shown in **Figure 1**.

## 4.3. Low-Rank Matrix Completion: Definition

Our problem is to complete a partially observed gene expression matrix X where columns represent genes and rows, individual cells. The complete matrix is constituted by the known and the yet unknown values. We can assume that the single cell data that we have acquired, Y is a sampled version of the complete expression matrix X. Mathematically, this is expressed as,

$$Y = A(X) \tag{1}$$

Here A is the sub-sampling operator. It is a binary mask that has 0's where the counts of complete expression data X have not been observed and 1's where they have been. The values of A are element-wise multiplied to the complete expression matrix X so that Y (the sub-sampled data) is a sparse representation of X and has expression values only at positions where gene expression is observed. Our problem is to recover X, given the observations Y, and the sub-sampling mask A. It is known that X is of low-rank.

It should be noted that matrix completion is a well studied framework. In this work, we consider two algorithms for efficient imputation of scRNA-seq expression data: Matrix factorization (Koren et al., 2009) and Nuclear norm minimization?

## 4.4. Matrix Factorization

Matrix factorization is the most straightforward way to address the low-rank matrix completion problem; it has previously been used for finding lower dimensional decompositions of matrices (Lee and Seung, 2001). Say X is of dimensions m×n, but is known to have a rank r (<m, n). In that case, one can express Xm×<sup>n</sup> as a product of two matrices Um×<sup>r</sup> and Vr×<sup>n</sup> . Therefore the complete problem (1) can be formulated as,

$$Y = A(X) = A(UV) \tag{2}$$

Estimating U and V from (2) tantamount to recovering X. The two matrices U and V can be solved by minimizing the Frobenius norm of the following cost function.

$$\min\_{U,V} ||Y - A(UV)||\_F^2 \tag{3}$$

Since this is a bi-linear problem, one cannot guarantee global convergence. However, it usually works in practice. It has been used for solving recommender systems problems (Koren et al., 2009), where (3) was solved using stochastic gradient descent (SGD). SGD is not an efficient techniques and requires tuning of several parameters. In this work, we will solve (3) in a more elegant fashion using Majorization-Minimization (MM) (Sun et al., 2017). The basic MM approach and its geometrical interpretation has been diagrammatically represented (**Figure S1**). It depicts the solution path for a simple scalar problem but essentially captures the MM idea.

For our given problem, the cost function to be minimized is given as J(X) = ||Y − A(X)||<sup>2</sup> F ; the majorization step basically decouples the problem (from A), so that we can solve the optimization problem by solving

$$\min\_{U,V} ||B - UV||\_F^2 \tag{4}$$

where Bk+<sup>1</sup> = X<sup>k</sup> + 1 a A T (Y −A(X<sup>k</sup> )) at each iteration k. Here, X<sup>k</sup> is the matrix at iteration k and a is a scalar parameter in the MM algorithm.

This (4) is solved by alternating least squares (Hastie et al., 2015), i.e., while updating U, V is assumed to be constant and while updating V, U is assumed to be constant.

$$U\_k \leftarrow \min\_U \|\|B - U\_{k-1} V\_{k-1}\|\|\_F^2 \tag{5}$$

$$\|V\_k \leftarrow \min\_V \|\|B - U\_k V\_{k-1}\|\|\_F^2 \tag{6}$$

Since the log-transformed input (with pseudo count added) expressions would never be negative, we have imposed a nonnegativity constraint on the recovered matrix X, so that it does not contain any negative values.

The matrix factorization algorithm has been summarized in Algorithm 1. The initialization of factor V is done by keeping r right singular vectors of X in V obtained by performing singular value decomposition (SVD) of X, where r is the approximate rank of the expression matrix to be recovered.


#### 4.5. Nuclear Norm Minimization

The problem depicted in (3) is non-convex. Hence, there is no guarantee for global convergence. Also one needs to know the approximate rank of the matrix X in order to solve it, which is unknown in this case. To combat this issues, researchers in applied mathematics and signal processing proposed an alternative solution. They would directly solve the original problem (1) with a constraint that the solution is of low-rank. This is mathematically expressed as,

$$\min\_{\mathcal{X}} \max(\mathcal{X}) \text{ such that } \mathcal{Y} = \mathcal{A}(\mathcal{X}) \tag{7}$$

However, this turns out to be NP hard problem with doubly exponential complexity. Therefore, studies in matrix completion (Candes and Recht, 2009; Candès and Tao, 2010) proposed relaxing the NP hard rank minimization problem to its closest convex surrogate: nuclear norm minimization.

$$\min\_{X} ||X||\_{\*} \text{ such that } Y = A \text{( $X$ )}\tag{8}$$

Here ||.||<sup>∗</sup> is the nuclear norm and is defined as the sum of singular values of data matrix X. It is the l<sup>1</sup> norm of the vector of singular values of X and is the tightest convex relaxation of the rank of matrix, and therefore its ideal replacement.

This is a semi-definite programming (SDP) problem. Usually its relaxed version (Quadratic Program) is solved (Candès and Plan, 2010) with the unconstrained Lagrangian version.

$$\min\_{\mathbf{X}} \left||Y - A(\mathbf{X})||\_F^2 + \lambda ||X||\_\* \tag{9}$$

Here, ||.||<sup>∗</sup> is the nuclear norm and λ is called the Lagrange multiplier. The problem (9) does not have a closed form solution and needs to be solved iteratively.

To solve (9), we invoke MM once more. Here J(X) = ||Y − A(X)||<sup>2</sup> <sup>F</sup> <sup>+</sup> <sup>λ</sup>||X||<sup>∗</sup> , we can express (9) in the following fashion in every iteration k

$$\min\_{X} \| |B - X| \|\_{F}^{2} + \lambda \| |X| \|\_{\*} \tag{10}$$

where Bk+<sup>1</sup> = X<sup>k</sup> + 1 a A T (Y − A(X<sup>k</sup> )).

Using the inequality ||Z<sup>1</sup> − Z2||<sup>F</sup> ≥ ||s<sup>1</sup> − s2||<sup>2</sup> , where s<sup>1</sup> and s<sup>2</sup> are singular values of the matrices Z<sup>1</sup> and Z<sup>2</sup> respective, we can solve the following instead of solving the minimization problem (10).

$$\min\_{s\_{\mathbf{x}}} \left||s\_{\mathbf{B}} - s\mathbf{x}\right||\_{2}^{2} + \lambda \left||s\mathbf{x}\right||\_{1} \tag{11}$$

Here, s<sup>B</sup> and s<sup>X</sup> are the singular values of B and X, respectively and ||sX||<sup>1</sup> is the l<sup>1</sup> norm or the sum of absolute values of sX. It has been shown that problem (10) is minimized by soft thresholding the singular values with threshold λ/2. The optimal update is given by

$$s\_X = \begin{cases} s\_B + \lambda/2 \text{ when } s\_B \le -\lambda/2\\ 0 \text{ when } |s\_B| \le \lambda/2\\ s\_B - \lambda/2 \text{ whereas } \ge \lambda/2 \end{cases} \tag{12}$$

or more compactly by

$$s\_X = \text{soft}(s\_B, \lambda/2) = \text{sign}(s\_B) \max(0, |s\_B| - \lambda/2) \tag{13}$$



We found that the algorithm is robust to values of λ as long as as it is reasonably small (< 0.01).

Here too, we have imposed the non-negativity constraint on X since expressions cannot be smaller than zero. The Nuclear Norm Minimization algorithm has been depicted in Algorithm 2.

#### 5. CONCLUSION

As an inevitable consequence of a steep decline in single cell library depth, dropout rates in scRNA-seq data have skyrocketed. This works as a confounding factor (Hicks et al., 2015), thereby hindering cell clustering and further downstream analyses. A good imputation strategy would handle the Dropouts problem gracefully and thereby has the potential to facilitate the discovery of new rare cell subtypes within seemingly similar cells. This, in turn, can be helpful for characterizing cancer heterogeneity and understanding the dynamics of transcriptional changes during development. The proposed mcImpute algorithm, without making any assumption about the expression data distribution, recovers dropouts by simultaneously retaining the true zero counts and shows comparable performance on a number of

#### REFERENCES


measures including clustering accuracy, cell type separability, differential gene prediction, cell visualization, gene distribution, etc.

We believe that McImpute, by far is the most intuitive way of catering the dropouts problem. It can seamlessly be integrated and serve as a key component in single-cell RNA seq pipeline.

Currently, imputation and clustering are together a piecemeal two-step process—imputation followed by clustering. In the future, we would like to incorporate both clustering and imputation as a joint optimization problem.

#### 6. SOFTWARE

The source code of mcImpute is shared at https://github.com/ aanchalMongia/McImpute\_scRNAseq.

#### DATA AVAILABILITY STATEMENT

The details of datasets for this study has been given in section 4.

#### AUTHOR CONTRIBUTIONS

DS and AnM led the study, contributed to the statistical analysis and design of the experiments. AaM analyzed and interpreted the scRNA-seq data and performed the experiments. All authors read and reviewed the manuscript.

#### ACKNOWLEDGMENTS

This manuscript has been submitted to the preprint serverbioRxiv (Mongia et al., 2018).

#### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene. 2019.00009/full#supplementary-material


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Mongia, Sengupta and Majumdar. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Expression Profile Analysis Identifies a Novel Five-Gene Signature to Improve Prognosis Prediction of Glioblastoma

Wen Yin<sup>1</sup> , Guihua Tang<sup>2</sup> , Quanwei Zhou<sup>1</sup> , Yudong Cao<sup>1</sup> , Haixia Li<sup>3</sup> , Xianyong Fu<sup>1</sup> , Zhaoping Wu<sup>1</sup> and Xingjun Jiang<sup>1</sup> \*

<sup>1</sup> Department of Neurosurgery, Xiangya Hospital of Central South University, Changsha, China, <sup>2</sup> Department of Clinical Laboratory, Hunan Provincial People's Hospital (First Affiliated Hospital of Hunan Normal University), Changsha, China, <sup>3</sup> Department of Operative Nursing, Xiangya Hospital of Central South University, Changsha, China

#### Edited by:

Monica Bianchini, University of Siena, Italy

#### Reviewed by:

Nitish Kumar Mishra, University of Nebraska Medical Center, United States Sen Peng, Translational Genomics Research Institute, United States Max Shpak, St David's Medical Center, United States

#### \*Correspondence:

Xingjun Jiang jiangxj@csu.edu.cn; jxjyjz@163.com

#### Specialty section:

This article was submitted to Bioinformatics and Computational Biology, a section of the journal Frontiers in Genetics

Received: 20 December 2018 Accepted: 17 April 2019 Published: 03 May 2019

#### Citation:

Yin W, Tang G, Zhou Q, Cao Y, Li H, Fu X, Wu Z and Jiang X (2019) Expression Profile Analysis Identifies a Novel Five-Gene Signature to Improve Prognosis Prediction of Glioblastoma. Front. Genet. 10:419. doi: 10.3389/fgene.2019.00419 Glioblastoma multiforme (GBM) is the most aggressive primary central nervous system malignant tumor. The median survival of GBM patients is 12–15 months, and the 5 years survival rate is less than 5%. More novel molecular biomarkers are still urgently required to elucidate the mechanisms or improve the prognosis of GBM. This study aimed to explore novel biomarkers for GBM prognosis prediction. The gene expression profiles from The Cancer Genome Atlas (TCGA) and Gene Expression Omnibus (GEO) datasets of GBM were downloaded. A total of 2241 overlapping differentially expressed genes (DEGs) were identified from TCGA and GSE7696 datasets. By univariate COX regression survival analysis, 292 survival-related genes were found among these DEGs (p < 0.05). Functional enrichment analysis was performed based on these survivalrelated genes. A five-gene signature (PTPRN, RGS14, G6PC3, IGFBP2, and TIMP4) was further selected by multivariable Cox regression analysis and a prognostic model of this five-gene signature was constructed. Based on this risk score system, patients in the high-risk group had significantly poorer survival results than those in the lowrisk group. Moreover, with the assistance of GEPIA http://gepia.cancer-pku.cn/, all five genes were found to be differentially expressed in GBM tissues compared with normal brain tissues. Furthermore, the co-expression network of the five genes was constructed based on weighted gene co-expression network analysis (WGCNA). Finally, this fivegene signature was further validated in other datasets. In conclusion, our study identified five novel biomarkers that have potential in the prognosis prediction of GBM.

Keywords: glioblastoma, differentially expressed genes, gene signature, prognosis, TCGA, GEO

## INTRODUCTION

Glioblastoma multiforme (GBM) is the most common and aggressive primary central nervous system malignant tumor with high morbidity and mortality. According to genomic abnormalities and gene expression, GBM can be divided into four molecular subtypes: classical, mesenchymal, neural, and proneural, which lay a foundation for understanding its inherent heterogeneity (Verhaak et al., 2010; Ma et al., 2018). In the United States, the incidence of GBM is 2.96 cases/100,000 population/year (Jhanwar-Uniyal et al., 2015). Although there are several treatment options, including surgery, radiotherapy and chemotherapy, the median survival of GBM

**123**

patients remains 12–15 months, and the 5 years survival rate is less than 5% (Wen and Kesari, 2008; Ostrom et al., 2013).

With the development of next-generation sequencing technologies, many specific molecular signatures have been identified to better understand the molecular pathogenesis of GBM (Aldape et al., 2015). As a result, many potential diagnostic and prognostic biomarkers have been discovered that enable a more specific classification and a more precise outcome prediction of GBM. Some molecular markers including MGMT (O6-methylguanine DNA methyltransferase), IDH (isocitrate dehydrogenase), EGFR (epidermal growth factor receptor), and PTEN (phosphatase and tensin homolog) have been routinely tested in GBM patients clinically (van den Bent et al., 2017; Binabaj et al., 2018). More importantly, these molecular signatures have contributed to personalized therapeutic approaches and targeted anti-GBM therapies (Huang et al., 2017; Szopa et al., 2017). However, considering the poor prognosis of GBM, novel molecular biomarkers and new therapeutic strategies are still urgently required to elucidate the mechanisms of GBM or increase overall patient survival.

Previous studies have shown that gene expression profile analysis could detect gene signatures to predict the outcome for malignancy tumors (Luo et al., 2018; Mao et al., 2018; Zeng et al., 2018). Shergalis et al. (2018) discovered that 20 genes were overexpressed and correlated with poor survival outcomes in GBM patients by bioinformatics analysis using data from The Cancer Genome Atlas (TCGA) project. Bao et al. (2014) identified a nine-gene signature to predict the prognosis of glioma patients based on mRNA expression profiling from the Chinese Glioma Genome Atlas (CGGA) database. Therefore, it is necessary to understand the development and progression of GBM by identifying GBM-related genes and to investigate of their potential clinical roles and molecular mechanisms.

In this study, RNA-Seq data from TCGA and microarray data from the Gene Expression Omnibus (GEO) database of GBM were downloaded. Based on the overlapping differentially expressed genes (DEGs), the genes related to prognosis were screened. By using Cox regression, we developed a five-gene signature based risk score to demonstrate the association between gene expression and the prognosis of GBM. Moreover, we validated this signature in the GEO dataset and TCGA array dataset. These results might be able to provide new reference for the prognostic predication of GBM.

#### MATERIALS AND METHODS

#### Data Source

The GBM RNA sequencing (RNA-seq) dataset and corresponding clinical follow-up information were downloaded from TCGA database (March, 2018). Subtype data of GBM were downloaded from UCSC Xena<sup>1</sup> . A total of 159 patients, including 154 samples of primary GBM patients and five samples of normal brain tissue were extracted for subsequent analysis.

<sup>1</sup>http://xena.ucsc.edu/

Gene expression microarray data GSE7696 (Lambiv et al., 2011), including 71 samples of primary GBM patients and four samples of normal brain tissue, were downloaded from the National Center of Biotechnology Information (NCBI) Gene Expression Omnibus<sup>2</sup> . The dataset was based on the GPL570 platform of [HG-U133\_Plus\_2] Affymetrix Human Genome U133 Plus 2.0 Array (Affymetrix, Santa Clara, CA, United States).

#### Differential Expression Analyses

Then, gene profiles were standard normalized within and among samples, respectively. Because the numerical distribution of RPKM (reads per kilo-base per million mapped reads) is too wide, the final expression level of a gene was defined as the log2(x + 1) of the raw expression level. Next, the DEGs between the tumor and normal samples were calculated by the limma package (Padj < 0.05 and | log2FC| > 1). The Venn diagram was produced by the VennDiagram R package (Chen and Boutros, 2011).

#### Identification and Selection of Survival-Related Genes

Only the patients with detailed follow-up times were extracted for subsequent survival analyses. Univariate Cox regression survival analysis using the Survival package in R was performed to identify survival-related genes (Yang et al., 2016). Genes were selected with a p-value of less than 0.05.

## Go and KEGG Annotation of Survival-Related Genes

Gene Ontology (GO) enrichment and KEGG (Kyoto Encyclopedia of Genes and Genomes) analysis were performed on the survival-related genes (Ogata et al., 1999; Wanggou et al., 2016; Li et al., 2018). DAVID (The Database for Annotation, Visualization, and Integrated Discovery) (Dennis et al., 2003) software and the clusterProfiler package (Yu et al., 2012) in R were used to annotate and visualize GO terms and KEGG pathways.

#### Gene Signature Identification and Risk Score System Establishment

Based on the top 100 survival-related genes in TCGA dataset, multivariable Cox proportional hazard regression analysis was performed to establish a risk score formula (O'Quigley and Moreau, 1986). As previously reported, a prognosis risk score formula could be constructed on the basis of a linear combination of the expression level (exp) multiplied by a regression coefficient (β) derived from the multivariate cox regression model.

Risk Score (RS) = expPTPRN ∗ βPTPRN + expRGS14 ∗ βRGS14

+expG6PC3 ∗ βG6PC3 + expIGFBP2 ∗ βIGFBP2 + expTIMP4 ∗ βTIMP4

Based on the formula, the risk score of each GBM patient was calculated, and then GBM patients were divided into highrisk score and low-risk score groups. The receiver operating characteristic (ROC) curve analysis was conducted using the R

<sup>2</sup>http://www.ncbi.nlm.nih.gov/geo/

package "pROC." After choosing an optimal cut-off point with the maximal sensitivity and specificity, the survival differences between the low-risk and high-risk groups were assessed by the Kaplan–Meier analysis with log-rank test. Similarly, to evaluate the predictive power of the five-gene signature in internal dataset, we assessed the gene signature within each subtype (classical, mesenchymal, neural, and proneural).

### Analysis in GEPIA and Exploring Co-expression by WGCNA

The expression levels of the five genes were acquired with the assistance of GEPIA<sup>3</sup> , which is a newly developed interactive web server for analyzing the RNA sequencing expression data of 23 types of cancers and normal samples from TCGA

<sup>3</sup>http://gepia.cancer-pku.cn/

TABLE 1 | Information about the five genes screened to build the risk score system.


and the GTEx projects according to the standard processing pipeline (Tang et al., 2017).

To explore the regulatory network of the five genes, all the overlapped DEGs were analyzed by WGCNA (Ahn et al., 2016; Chen et al., 2018). Finally, the co-expression network of the

FIGURE 4 | ROC and Kaplan–Meier analysis of the five-gene signature in TCGA dataset. (A) ROC analysis of the sensitivity and specificity of the survival time according to the five-gene signature based on risk score. (B) Kaplan–Meier analysis of the five-gene signature based risk score. Patients were divided into low-risk and high-risk groups based on the optimal cut-off point.

five genes was constructed based on WGCNA and visualized by Cytoscape 3.6.1 (Shannon et al., 2003).

## Validation of the Five-Gene Prognostic Signature by the GEO Dataset and TCGA Microarray Dataset

Dataset GSE13041 from the GEO and TCGA microarray dataset were used to validate this five-gene prognostic signature (Lee et al., 2008). The GSE13041 dataset including 188 samples of GBM patients and the TCGA microarray dataset including 498 samples of GBM patients were both based on the Affymetrix Human Genome U133A Array platform (GPL97). The ROC curves and Kaplan–Meier analyses were used to validate the prognostic value of the five-gene for GBM patients.

## RESULTS

## Differentially Expressed Genes (DEGs) in TCGA and GSE7696

Altogether, 4473 DEGs in TCGA dataset (**Figure 1A**) and 5789 DEGs in the GSE7696 dataset (**Figure 1B**) were screened by the limma package. The 2241 overlapping DEGs were screened for further analysis (**Figure 1C**).

### Survival-Related Genes in GBM

In TCGA dataset, every overlapped DEG was evaluated by univariate Cox regression survival analysis. Altogether, 292 significantly changed genes were considered -survival-related genes by the threshold of p < 0.05. The top 100 survival-related genes are shown in **Supplementary Table 1**.

## Go and KEGG Analysis of Survival-Related Genes

For the "biological processes" (BP), negative regulation of catalytic activity, regulation of cell shape, negative regulation of monocyte chemotaxis, long-term synaptic potentiation and insulin secretion involved in cellular response to glucose stimulus were the commonly enriched categories (**Figure 2A**). For the "cellular component" (CC), the enriched categories were correlated with focal adhesion, extracellular space, synaptic vesicle membrane, extracellular exosome, and endoplasmic reticulum (**Figure 2B**). For the "molecular function" (MF), those genes mainly showed enrichment in calcium ion binding, phospholipase inhibitor activity, calcium-dependent protein binding, calcium-dependent phospholipid binding, and signal transducer activity (**Figure 2C**). KEGG pathway enrichment analysis suggested that glycosaminoglycan degradation was the most significant pathway. These genes also participated in following pathways: proteoglycans in cancer, lysosome, and regulation of the actin cytoskeleton (**Figure 2D**).

## Risk Score System Based on Five-Gene Signature

After multivariate Cox regression analysis was conducted for these 100 genes, five genes (PTPRN, RGS14, G6PC3, IGFBP2, and TIMP4) were selected as signature genes that can optimally predict the overall survival of patients with GBM (**Table 1**). To comprehensively investigate the association between these five

genes and the prognosis of GBM, a five-gene survival risk score system was established based on their Cox coefficients.

Risk Score (RS) = 0.50894<sup>∗</sup> expPTPRN <sup>+</sup> <sup>0</sup>.54671<sup>∗</sup> expRGS14

+1.20753<sup>∗</sup> expG6PC3 <sup>+</sup> <sup>0</sup>.25845<sup>∗</sup> expIGFBP2 <sup>−</sup> <sup>0</sup>.20684<sup>∗</sup> expTIMP4

Then, the risk score for each patient was calculated in TCGA dataset and ranked according to the risk scores. Thus, patients were divided into a high-risk group (n = 75) and a low-risk group (n = 76). The survival time of GBM patients was adversely associated with their risk scores (**Figure 3A**). A remarkably lower expression was noted for TIMP4 in the high-risk groups, while a higher expression was observed for the other genes in the highrisk groups (**Figure 3B**). The Kaplan–Meier analysis and log-rank test showed that patients in the low-risk group had a significantly positive overall survival time compared to the high-risk group (p = 7.055906e-11) (**Figure 3C**).

Moreover, ROC analysis was performed for this risk score system. **Figure 4A** shows that the area under the ROC Curves (AUC) was 0.704. The optimal cutoff point was selected as 8.421. With this cutoff point, the patients were further divided into a

proneural (D).

TPM: transcripts per kilobase million. <sup>∗</sup>p < 0.05.

high-risk group and a low-risk group. The Kaplan–Meier analysis and log-rank test further indicated a significant difference in overall survival between the two groups (p = 1.075619e-11) (**Figure 4B**). Similarly, with different cutoff points, the patients in each subtype were divided into a high-risk group and a lowrisk group. The Kaplan–Meier analysis and log-rank test also

indicated a significant difference between the two groups in each subtype (**Figures 5A–D**).

## Analysis in GEPIA and Exploring Co-expression by WGCNA

Based on the results derived from GEPIA, the expression of G6PC3, IGFBP2, and TIMP4 were significantly up-regulated in GBM, while the expression of PTPRN and RGS14 were significantly down-regulated (**Figure 6**). By using GEPIA, the selected five genes were verified as DEGs in GBM with amplified normal sample sizes.

The co-expressed genes of the five genes were determined by WGCNA. Finally, 129 genes were discovered to be co-expressed with PTPRN, 41 genes were co-expressed with IGFBP2, 10 genes with RGS14 and 1 gene with TIMP4. However, no gene was coexpressed with G6PC3. The co-expression network of the four genes is visualized by WGCNA in **Figure 7**.

## Validation of the Five-Gene Prognostic Signature by GEO Dataset and TCGA Microarray Dataset

The GSE13041 dataset including 188 GBM patients and the TCGA microarray dataset including 498 GBM patients were used for the validation of the five-gene signature separately. Similarly, the risk score for each patient was calculated. ROC analyses were used to identify the optimal cutoff points (**Figures 8A,C**). Then, we divided the patients into a high-risk group and a lowrisk group using the selected optimal cut-off points, respectively. The Kaplan–Meier analyses suggested a significantly prolonged survival time in the low-risk patients compared to that in the high-risk patients (p = 3.480445e-06 and p = 0.00011) (**Figures 8B,D**).

#### DISCUSSION

GBM is the most aggressive brain tumor associated with poor prognosis. By analyzing TCGA and GSE7696 datasets, we identified 2241 significantly overlapping DEGs. A total of 292 survival-related DEGs were selected from the overlapping DEGs. Functional analyses demonstrated that these genes are mainly associated with following pathways: glycosaminoglycan degradation, proteoglycans in cancer, lysosome, and regulation of the actin cytoskeleton. More importantly, based on multivariate Cox regression analysis of TCGA dataset, five genes which could predict overall survival were screen out, namely PTPRN, RGS14, G6PC3, IGFBP2, and TIMP4. According to their Cox coefficients derived from cox regression, a risk score system based on the five genes was established. Additionally, after identifying the optimal cut-off point by ROC analysis, patients were classified into highrisk and low-risk groups. This five-gene signature was further successfully validated as a prognostic marker in each subtype of GBM, another independent GEO dataset (GSE13041) and TCGA microarray dataset. Furthermore, differential expression analysis of the five genes in GEPIA validated that three genes (G6PC3, IGFBP2, and TIMP4) were significantly up-regulated and two genes (PTPRN and RGS14) were significantly down-regulated in GBM. Co-expression network analysis revealed the regulation network of the five genes. These results suggest that these genes may play an important role in the molecular pathogenesis, progression and prognosis of GBM.

Based on GO and KEGG enrichment analyses of the survivalrelated DEGs among different studies, "negative regulation of catalytic activity" was the most significant enrichment in BP. This indicated that inhibiting the catalytic activity of some genes may be critical for cancer progression. Coincidentally, Zhao et al. (2009) found that IDH1 mutation could inhibit

IDH1 catalytic activity and contribute to the tumorigenesis of glioma. Other BPs such as regulation of cell shape and negative regulation of monocyte chemotaxis were also enriched. For the CC category, focal adhesion was the most significant enrichment which has been shown to be as a major determinant of cell migration and an essential process in tumor invasion (Garzon-Muvdi et al., 2012). The following three kinds of CCs, extracellular space, synaptic vesicle membrane and extracellular exosome, may also play important roles in tumor development and its micro-environmental manipulation (Wei et al., 2017). Regarding the MF category, calcium ion binding was the most affected MF. Ca2+-mediated cell connectivity and plasticity are unique features of the central nervous system, and the Ca2+/calmodulin-dependent process is able to regulate cell cycle progression and inhibit proliferation of malignant glioma (Cheng et al., 1995; Liu et al., 2011). For KEGG pathway enrichment analysis, glycosaminoglycan degradation was the most significant pathway. Extracellular proteoglycans play critical roles in driving oncogenic pathways in tumor cells and promoting critical tumor-microenvironment interactions (Wade et al., 2013). The other KEGG pathways, proteoglycans in cancer, lysosome, and regulation of actin cytoskeleton, were also closely related to oncogenesis (Liu et al., 2012; Terakawa et al., 2013; Wade et al., 2013).

The five-gene signature provides a wealth of potential biological and therapeutic information about GBM. PTPRN (protein tyrosine phosphatase, receptor type N), located on the long arm of human chromosome 2 (2q35) (Lan et al., 1996), is an integral transmembrane protein of dense core vesicles and plays an important role in the secretion of hormones and

neurotransmitters (Xu et al., 2016). PTPRN has been confirmed to be negatively related to the survival of hepatocellular carcinoma patients and closely related to liver tumorigenesis (Zhangyuan et al., 2018). Moreover, the hypermethylation of PTPRN is also associated with shorter survival in ovarian cancer patients (Bauerschlag et al., 2011). A high expression of PTPRN in small cell lung cancer is associated with tumor growth and proliferation. Interestingly, Shergalis et al. also found that a high PTPRN expression is strongly associated with a poor prognosis in GBM patients, which was consistent with our finding (Shergalis et al., 2018). RGS14 is a member of the regulator of the G-protein signaling (RGS) protein family and is highly expressed in the caudate nucleus of the brain, spleen and thymus (Cho et al., 2005; Gerber et al., 2016). Previous study found that RGS14 is important for centrosome function, transcriptional regulation and stress-induced cellular responses (Cho et al., 2005). However, little work has been done to elucidate the role of RGS14 in cancer. Interestingly, PTPRN and RGS14 expressed at low levels in GBM tissue, but their increased expression was associated with poor prognosis. The reason may be that they have different functions in normal and tumor tissues. More work is needed elucidate their functions in GBM. G6PC3, namely, glucose-6– phosphatase isoform β, is a catalysis subunit of- G6PC (Gao et al., 2017). G6PC (glucose-6–phosphatase) is a key enzyme that regulates glucose homeostasis and glycogenolysis, which has been reported as a specific enzyme regulating proliferation and invasiveness in several tumors, such as liver, kidney and ovarian cancer (Gao et al., 2017). Furthermore, a previous study revealed that G6PC is a key enzyme regulating glioblastoma invasion (Abbadi et al., 2014). Our study demonstrated that G6PC3 was significantly up-regulated in GBM samples compared with normal brain tissue, and the high expression of G6PC3 was closely related to a poor prognosis in GBM patients. IGFBP2 (Insulin-like growth factor binding protein 2), an important member of the Insulin-like growth factor binding protein family, modulates cell growth, differentiation, migration, and invasion in neoplasms (Fukushima and Kataoka, 2007). IGFBP2 is involved in immunosuppressive activities and is a potential immunotherapeutic target for GBM (Cai et al., 2018). Our study confirmed that IGFBP2 was significantly up-regulated in GBM and predicted a worse outcome for patients, which was consistent with the previous study (Cai et al., 2018). TIMP4 is a member of tissue inhibitors of matrix metalloproteinases (TIMPs), which are involved in several processes of tumorigenesis including proliferation, migration, and invasion (Boufraqech et al., 2016). A high-expression of TIMP4 has been found in patients with breast, cervical, and prostate cancers, whereas a low expression has been observed in patients with pancreatic cancer (Boufraqech et al., 2016).

#### REFERENCES

Abbadi, S., Rodarte, J. J., Abutaleb, A., Lavell, E., Smith, C. L., Ruff, W., et al. (2014). Glucose-6-phosphatase is a key metabolic regulator of glioblastoma invasion. Mol. Cancer Res. 12, 1547–1559. doi: 10.1158/1541-7786.MCR-14- 0106-T

Interestingly, our study found that TIMP4 was high-expressed in GBM patients, however, its high expression was associated with a good prognosis in patients with GBM. More work is also needed elucidate its functions in GBM. In summary, the five-gene signature not only is robust for predicting the overall survival for GBM, but also has promising practical value in the treatment of GBM.

There are some limitations in our work. First of all, there were only very limited normal samples included in our differential expression analyses, which might neglect some potential mRNAs. Moreover, the efficiency of the five-gene signature should be confirmed in more GBM patients. Furthermore, the molecular mechanisms how the five-gene signature affected the prognosis of GBM patients should be further elucidated by a series of experiments.

## CONCLUSION

In conclusion, our study identified five novel biomarkers that have potential for the prognosis prediction in GBM. Moreover, our findings provide new insights into the pathogenesis and prognosis of GBM.

## AUTHOR CONTRIBUTIONS

WY and XJ conceived and designed the study. GT, QZ, YC, HL, XF, and ZW performed the analysis procedures. GT, WY, and XJ analyzed the results. WY and XJ wrote the manuscript. All authors contributed to the editing of the manuscript.

## FUNDING

This work was supported by the National Natural Science Foundation of China (No. 81472355).

## ACKNOWLEDGMENTS

We sincerely acknowledge the public databases: TCGA, GEO, and GEPIA.

### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene. 2019.00419/full#supplementary-material

Ahn, R., Gupta, R., Lai, K., Chopra, N., Arron, S. T., and Liao, W. (2016). Network analysis of psoriasis reveals biological pathways and roles for coding and long non-coding RNAs. BMC Genomics 17:841. doi: 10.1186/s12864-016-3188-y

Aldape, K., Zadeh, G., Mansouri, S., Reifenberger, G., and von Deimling, A. (2015). Glioblastoma: pathology, molecular mechanisms and markers. Acta Neuropathol. 129, 829–848. doi: 10.1007/s00401-015-1432-1


feature selection methods. Front. Genet. 9:246. doi: 10.3389/fgene.2018. 00246


fgene-10-00419 May 2, 2019 Time: 17:44 # 11


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Yin, Tang, Zhou, Cao, Li, Fu, Wu and Jiang. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Co-expression Network Analysis Identifies Four Hub Genes Associated With Prognosis in Soft Tissue Sarcoma

Zhenhua Zhu<sup>1</sup> , Zheng Jin<sup>2</sup> , Yuyou Deng<sup>3</sup> , Lai Wei<sup>4</sup> , Xiaowei Yuan<sup>1</sup> , Mei Zhang<sup>5</sup> \* and Dahui Sun<sup>1</sup> \*

<sup>1</sup> Department of Orthopaedic Trauma, The First Hospital of Jilin University, Changchun, China, <sup>2</sup> Department of Immunology, College of Basic Medical Sciences, Jilin University, Changchun, China, <sup>3</sup> Department of Urology, The First Hospital of Jilin University, Changchun, China, <sup>4</sup> College of Computer and Control Engineering, Nankai University, Tianjin, China, <sup>5</sup> College of Chemistry, Jilin University, Changchun, China

Background: Soft tissue sarcomas (STS) are heterogeneous tumors derived from mesenchymal cells that differentiate into soft tissues. The prognosis of patients who present with an STS is influenced by the regulation of a complex gene network.

#### Edited by:

Monica Bianchini, Università degli Studi di Siena, Italy

#### Reviewed by:

Haibo Liu, Iowa State University, United States Rahul Kumar, Columbia University Irving Medical Center, United States

#### \*Correspondence:

Mei Zhang zhangmei@jlu.edu.cn Dahui Sun sundahui1971@sina.com

#### Specialty section:

This article was submitted to Bioinformatics and Computational Biology, a section of the journal Frontiers in Genetics

> Received: 07 August 2018 Accepted: 18 January 2019 Published: 04 February 2019

#### Citation:

Zhu Z, Jin Z, Deng Y, Wei L, Yuan X, Zhang M and Sun D (2019) Co-expression Network Analysis Identifies Four Hub Genes Associated With Prognosis in Soft Tissue Sarcoma. Front. Genet. 10:37. doi: 10.3389/fgene.2019.00037 Methods: Weighted gene co-expression network analysis (WGCNA) was performed to identify gene modules associated with STS (Samples = 156).

Results: Among the 11 modules identified, the black and blue modules were highly correlated with STS. However, using preservation analysis, the black module demonstrated low preservation, therefore the blue module was chosen as the module of interest. Furthermore, a total of 20 network hub genes were identified in the blue module, 12 of which were also hub nodes in the protein-protein interaction network of the module genes. Following additional verification, 4 of 12 genes (RRM2, BUB1B, CENPF, and KIF20A) demonstrated poorer overall survival and disease-free survival rate in the test datasets. In addition, gene set enrichment analysis (GSEA) demonstrated that samples with a high level of blue module eigengene (ME) were enriched in cell cycle and metabolism associated signaling pathways.

Conclusion: In summary, co-expression network analysis identified four hub genes associated with prognosis for STS, which may diminish the prognosis by influencing cell cycle and metabolism associated signaling pathways.

Keywords: soft tissue sarcoma, weighted gene co-expression analysis, RRM2, BUB1B, CENPF, KIF20A

#### INTRODUCTION

Soft tissue sarcoma (STS) is a rare group of tumors that accounts for approximately 1% of adult cancers. In 2009, it was estimated that 3,300 new cases were diagnosed in Britain and 10,000 in the United States (Linch et al., 2014). There are approximately 50 STS subtypes, which differ significantly in their disease presentation, response to currently available treatments and risk of tumor progression (Casali et al., 2018). Multiple factors have been reported to be related to the

progression of STS, including capillary morphogenesis gene 2 (CMG2) (Greither et al., 2017), HIF-2α protein (Nakazawa et al., 2016), epidermal growth factor receptor (EGFR) protein (Yang et al., 2017) and microRNAs (Smolle et al., 2017). However, no molecular biomarkers have been defined for predicting the prognosis of the disease in clinical. Therefore, a better understanding of the molecular pathogenesis is required.

To date, microarray-based expression data have been used to identify genes related to tumor progression and prognosis. Takahashi et al. (2014) identified 25 survival-associated genes using a knowledge-based filtering and multiple testing approach. Beck et al. (2010) has reviewed the manner in which gene expression profiling has been used to understand sarcoma pathobiology and identify clinically useful biomarkers. However, most studies have focused on screening genes that have different patterns of expression with explanations gained from gene ontology (GO) analysis. Such approaches, however, have failed to address the large number of interconnections between genes, because genes with similar expression profiles are most likely to function closely together. Therefore, weighted gene co-expression network analysis (WGCNA) clusters genes co-expressed in a network, based on similarities in expression profiles among samples and in clinical traits, to define sub-network regions (known as modules) (Langfelder and Horvath, 2008).

In this study, we utilized WGCNA to identify the most relevant module in STS. Key genes in the module were identified and validated using survival and protein-protein interaction (PPI) analyses. These key genes may shed new light on the biological mechanisms underlying STS progression and could potentially be used as prognostic biomarkers or therapeutic targets.

#### MATERIALS AND METHODS

#### Study Design and Data Collection

Study design, data preparation, preprocessing, analysis and validation are described in a flowchart (**Figure 1**). Core codes used to reproduce the results were provided in **Supplementary Table S1**. Firstly, normalized RNAseq data and associated clinical data were downloaded from the NCBI Gene Expression Omnibus (GEO). Dataset GSE21122 (Barretina et al., 2010), which was generated using an Affymetrix human genome U133A microarray (HG-U133A), was used as a training set to construct the co-expression network and identify key modules in this study. This dataset included 149 STS samples and 9 normal fat tissue samples. The STS samples contained 116 different types of liposarcoma and 34 malignant fibrous histiocytomas (MFHs). Most STSs (68.8%) were primary tumors at the time of sample procurement from patients whose mean age was 56 years. In addition, two test datasets were used to test the preservation of identified modules and survival significance of hub genes. The first one, which included RNA sequencing data and associated clinical information of 265 STS samples, were downloaded from The Cancer Genome Atlas (TCGA) database<sup>1</sup> . The other one, GSE21050 dataset (Chibon et al., 2010), which included RNA sequencing data and associated clinical information of 310 STS samples were downloaded from the NCBI GEO.

#### Data Preprocessing

Firstly, we extracted training expression data from the GSE21122 MINiML file. The expression data was background corrected using the Robust Multi-array Average (RMA) algorithm and log base 2 normalized. The data were then checked to ascertain whether there was a batch effect. No apparent batch effect was observed after analysis of expression clusters, box plots and principal components analysis (PCA) (**Supplementary Figure S1**). In order to detect outliers for WGCNA analysis, sample network was calculated based on squared Euclidean distance. The connectivity of each sample was defined as the sum of the connectivity of that sample with all other samples. Outliers were identified after normalization of the connectivity of each sample, by use of the threshold z.k < 0.6. Generally, genes whose expression varies greatly are more biologically relevant. To reduce background noise, we selected genes that were varied expressed across samples and removed those whose expression was the same across samples. The median absolute deviation (MAD) was calculated for each gene as a robust measure of variability. Then, genes were sorted based on the MAD value and the top 3,000 ranked genes were used for the subsequent WGCNA analysis.

#### Co-expression Network Construction and Module Preservation Analysis

The WGCNA package (Langfelder and Horvath, 2008) was used to construct the co-expression network. The concordance of genes in the expression dataset was measured with Pearson correlation, then the Pearson correlation matrix was transformed to weighted network with the power adjacency function. The first step in this process was selection of an appropriate soft power, in which strong connections between genes are promoted and weak connections penalized, so as to transform the network into one meeting the requirements of a scale-free network. Modules were identified using the dynamic tree-cutting function with a deepSplit argument value of 2 and a minimum size cutoff of 30. To test whether the identified modules were stable in the test TCGA dataset, the downloaded fragments per million (FPKM) expression data of 265 samples were transformed to the transcripts per million (TPM). A total of 2704 common genes in the training and TCGA datasets were used for preservation analysis. The module Preservation function (nPermutations = 200) of the WGCNA package (Langfelder et al., 2011) was utilized, in which the preservation statistic Zsummary was used to quantify the preservation of gene modules between datasets.

### Finding Modules of Interest and Functional Annotation

Because the module eigengene (ME) provides the most appropriate synopsis of gene expression profiles of any given module, we correlated MEs with clinical traits. In this study,

<sup>1</sup>https://genome-cancer.ucsc.edu/

clinical traits refer to whether the sample was a STS or normal fat tissue. Correlations were then calculated using linear regression model. The modules for which the eigengenes showed high correlation were chosen as the modules of interest. In an attempt to ascertain possible mechanisms of genes within a module

affecting STS progression, functional enrichment analyses using the KEGG and GO databases of the hub module was performed with the "clusterProfile" package in R (Yu et al., 2012).

#### Identification of Hub Genes and Correlation Analysis

Hub genes are those that have a high degree of intra-module connectivity. In this study, hub genes were defined as the 20 module genes with highest connectivity in the interested module. A PPI network was constructed in order to identify hub nodes by uploading all genes in the hub module to the Search Tool for the Retrieval of Interacting Gene (STRING) database<sup>2</sup> . The PPI network was then imported into the Cytoscape software platform and a comprehensive analysis of the relationship between nodes was performed using the Maximal Clique Centrality (MCC) function, reported to be the most effective method of finding hub nodes in a co-expression network (Chin et al., 2014), within the "cytoHubba" application. In this way, the most cohesive genes were marked as "first stage nodes." In the PPI network of blue module genes, the 30 most highly ranked nodes were identified as "first stage nodes." Genes that were defined as both hub genes in the module and "first stage nodes" in the PPI network were chosen as primary hub genes.

### Survival Analysis and Efficacy Evaluation

The internet tool, Gene Expression Profiling Interactive Analysis (GEPIA)<sup>3</sup> , was used to perform overall survival and diseasefree survival analyses for all hub genes. The platform utilizes all expression data and survival information of the TCGA database. Users are able to accomplish survival analysis by simply submitting a gene name and selecting a tumor type. Patients were divided into two groups (high vs. low) based

<sup>2</sup>http://www.string-db.org

<sup>3</sup>http://gepia.cancer-pku.cn

on the hub gene expression level in comparison to the mean expression level of that hub gene. Furthermore, dataset GSE21050, which includes 310 STS samples in which metastasis status and survival time were provided, was used to test the significance of hub genes for metastasis survival. A Kaplan-Meier survival plot was constructed using the "survival" package in R (Li, 2003). Differential expression between STS and normal tissue in the training set was plotted as a box plot graph.

#### Gene Set Enrichment Analysis (GSEA)

In the training data set, 156 samples were dichotomized into two groups (High vs. Low) based on the ME value of blue module in comparison to the mean ME level of blue module of all samples. GSEA was then performed between the two groups. The 3,000 most variable genes from the WGCNA were imported for enrichment. In this way, GSEA was used to validate the results of GO and KEGG analysis of the blue module. The cut-off criterion for GSEA was FDR < 0.05.

### RESULTS

### Co-expression Network Construction and Module Preservation Analysis

After discarding two outlier samples (GSM528297 and GSM528333), WGCNA was performed on the 3,000 most variable genes of 156 samples. Soft threshold power was set to 6, in which R <sup>2</sup> was 0.916, ensured a scale-free network (**Figure 2**). Following this, 11 co-expression modules were identified, ranging in size from 43 to 669 genes (with each module assigned a color) (**Figure 3**).

By comparing the training dataset GSE21122 with the TCGA test dataset, we were able to establish whether the co-expression modules produced in the training dataset could be reproduced in the test dataset through summary preservation statistics. Three modules (black, brown, and magenta) demonstrated poor preservation with each Zsummary statistic < 10. The remaining modules, including the blue module were stable enough, suggesting they were preserved between the training data set and the test data set (**Figure 4**).

#### Finding Modules of Interest and Functional Annotation

It is important to identify the most significant modules related to STS. Both black and blue modules showed a significantly high

FIGURE 4 | medianRank and Zsummary statistics of the most variant gene modules in module preservation. In the preservation medianRank graph (left), a medianRank value close to zero indicates a high degree of module preservation. In the preservation Zsummary graph (right), the dashed black lines indicate the thresholds Z = 2, 10. These horizontal lines indicate Zsummary thresholds for strong evidence of conservation (above 10) and for low to moderate evidence of conservation (above 2).

correlation with sarcomas (**Figures 5**, **6**). However, due to the lack of stability of the statistical data (Zsummary < 10), the black module was not further analyzed. Therefore, the blue module was defined as an important module of clinical significance and extracted for further analysis.

For the sake of exploration of the biological relevance of the blue module, GO functional and KEGG pathway enrichment analyses were performed on 414 genes in the blue module. The biological processes of the genes in the blue module were found to associate with the cell cycle, such as mitotic nuclear division, chromosome segregation and sister chromatid segregation. In the KEGG pathway analysis, cell cycle associated signaling pathways such as DNA replication, cell cycle, p53 signaling pathway, oocyte meiosis, mismatch repair and metabolism associated pathways such as pyrimidine metabolism and purine metabolism were enriched (**Figure 7**).

## Identification of Sarcoma Hub Genes in the Blue Module

Highly connected hub genes within a module perform important roles in tumor biological processes. Therefore, the 20 genes with greatest module relevance in the blue module were selected as candidate hub genes for STS (**Supplementary Data Sheet S1**). In addition, a PPI network in the blue module was constructed in accordance with the STRING database (**Figure 8**). Twelve of

the 20 candidate genes in the co-expression network were also identified as hub nodes of the PPI network. Finally, these 12 genes were considered "primary" hub genes associated with STS and therefore selected for additional analyses.

## Survival Analysis and Efficacy Evaluation

While testing the TCGA dataset, four out of 12 hub genes demonstrated significant connectivity with overall and diseasefree survival (**Figure 9**). When testing the GSE21050 dataset, these four hub genes showed significant correlation with metastasis free survival (**Figure 10**). Furthermore, they were significantly highly expressed in STS tissue compared to normal fat tissue (**Figure 11**).

## Gene Set Enrichment Analysis

In order to find out the potential function of both blue module and hub genes, GSEA was performed to identify KEGG pathways enriched in samples with higher level of ME of blue module. In GSEA analysis, five signaling pathways were significantly enriched, including ubiquitin mediated proteolysis (FDR = 0.01), pyrimidine metabolism (FDR = 0.03), oocyte meiosis (FDR = 0.02), cell cycle (FDR = 0.04) and DNA replication (FDR = 0.04) (**Figure 12**). Moreover, the last four pathways were consistent with the results of KEGG pathway analysis (**Figure 7D**).

## DISCUSSION

Soft tissue sarcomas remain among the most challenging diseases for medical oncologists to treat. STSs are mesenchymal neoplasms that can arise from any site within the body, including extremities, the trunk, retroperitoneum, head, and neck. These are biologically heterogeneous diseases of which greater than 50 subtypes exist, varying by molecular, histological and clinical characteristics.

In this study, WGCNA was utilized to construct a coexpression network for identification of gene co-expression modules associated with STS. The blue module was positively identified and 20 hub genes selected from this module. In addition, as a result of the PPI network, 12 genes were identified as hub nodes of the co-expression module and PPI network, indicating that these 12 hub genes were closely

GSE21050 microarray data.

related to STS and had important biological significance. Subsequent survival analysis established that four of the 12 hub genes (RRM2, BUB1B, CENPF, and KIF20A) were significantly associated with survival. We, therefore, focused on these four genes.

The ribonucleotide reductase regulatory subunit M2 (RRM2) is one of two subunits that constitute ribonucleotide reductase, the enzyme responsible for catalyzing the conversion of ribonucleotides into deoxyribonucleotides, and thus performing an important role in DNA synthesis. RRM2 is important in controlling cellular function in a number of human malignant tumors, including DNA repair, cell proliferation and senescence. Importantly, RRM2 functions as a driver in a variety of tumors, with in vivo and in vitro experiments confirming that knocking down expression using siRNA significantly inhibits tumor cell proliferation (Fang et al., 2016).

The BUB1 mitotic checkpoint serine/threonine kinase B (BUB1B) is a member of the spindle assembly checkpoint protein family, crucial for ensuring correct chromosome separation during cell division (Fu et al., 2016). BUB1B perfoms a role in the inhibition of APC expression, established as a tumor suppressor gene in most colorectal cancers. Accordingly, many reports have shown that upregulation of BUB1B is related to the recurrence and progression of bladder cancer (Yamamoto et al., 2007), gastric cancer (Ando et al., 2010), esophageal squamous cell carcinoma (Tanaka et al., 2008), breast cancer (Yuan et al., 2006), hepatocellular carcinoma (Liu et al., 2009) and others.

Centromere protein F (CENPF) is another important protein involved in chromosome segregation during mitosis. Upregulation of CENPF protein expression, especially through a gene amplification effect, suggests that high levels of CENPF protein may affect the occurrence of tumors, especially in the early stages of tumor development (Varis et al., 2006). Clinical research has demonstrated that high expression levels of CENPF results in poor prognosis in nasopharyngeal carcinoma (Cao et al., 2010), colorectal gastrointestinal stromal tumors (Chen et al., 2011), esophageal squamous cell carcinoma (Mi et al., 2013) and prostate cancer (Zhuo et al., 2015). It has also been shown to play an important role in driving hepatocellular carcinoma (Dai et al., 2013).

Kinesin family member 20A (KIF20A, also known as RAB6KIFL) belongs to the kinesin superfamily-6, located in the Golgi apparatus and contributes to intracellular organelle transport and cell division (Echard et al., 1998). Recently, it has been reported that KIF20A is associated with mitosis, cell adhesion, migration and proliferation. Furthermore, recent studies have demonstrated that KIF20A is involved in tumor progression and angiogenesis. High expression of KIF20A results poor prognosis in glioma patients (Duan et al., 2016; Saito et al., 2017), nasopharyngeal cancer (Liu et al., 2017), hepatocellular carcinoma (Shi et al., 2016), melanoma (Yamashita et al., 2012) and early-stage cervical squamous cell carcinoma (Zhang et al., 2016).

Regarding GSEA, it was found that cell cycle and metabolism associated pathways were significant enriched in samples with higher level of ME of blue module. This is consistent with the initial GO and KEGG analysis results of the blue module and are related to the physiological function of these four hub genes.

In summary, through WGCNA and other related analysis methods, we identified four genes (RRM2, BUB1B, CENPF, and KIF20A) related to the progression and prognosis of STS. These genes may play a role by regulating the cell cycle and metabolism associated signaling pathways.

#### AUTHOR CONTRIBUTIONS

fgene-10-00037 January 31, 2019 Time: 18:41 # 9

ZZ and DS designed the study. ZZ and ZJ performed the data collection. ZJ and LW performed the data analysis. ZZ and MZ drafted the manuscript. All authors read and approved the final version of the manuscript.

#### REFERENCES


#### FUNDING

This study was supported by the Special Projects of Health in Jilin Province (3D5148273428).

## SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene. 2019.00037/full#supplementary-material

FIGURE S1 | Data quality examination.

TABLE S1 | Code for WGCNA.

DATA SHEET S1 | Hub genes of blue module.

predicts poor outcome in patients with prostate cancer. Onco Targets Ther. 9, 2211–2220. doi: 10.2147/OTT.S101994



**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Zhu, Jin, Deng, Wei, Yuan, Zhang and Sun. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Long Noncoding RNA RAET1K Enhances CCNE1 Expression and Cell Cycle Arrest of Lung Adenocarcinoma Cell by Sponging miRNA-135a-5p

Chang Zheng1,2, Xuelian Li 2,3, Yangwu Ren2,3, Zhihua Yin2,3 and Baosen Zhou1,2\*

<sup>1</sup> Department of Clinical Epidemiology, First Affiliated Hospital of China Medical University, Shenyang, China, <sup>2</sup> Department of Epidemiology, School of Public Health, China Medical University, Shenyang, China, <sup>3</sup> Key Laboratory of Cancer Etiology and Intervention, University of Liaoning Province, Shenyang, China

#### Edited by:

Monica Bianchini, University of Siena, Italy

#### Reviewed by:

Shaoli Das, National Institutes of Health (NIH), United States Kashmir Singh, Panjab University, India

> \*Correspondence: Baosen Zhou bszhou@cmu.edu.cn

#### Specialty section:

This article was submitted to Bioinformatics and Computational Biology, a section of the journal Frontiers in Genetics

Received: 21 August 2019 Accepted: 10 December 2019 Published: 17 January 2020

#### Citation:

Zheng C, Li X, Ren Y, Yin Z and Zhou B (2020) Long Noncoding RNA RAET1K Enhances CCNE1 Expression and Cell Cycle Arrest of Lung Adenocarcinoma Cell by Sponging miRNA-135a-5p. Front. Genet. 10:1348. doi: 10.3389/fgene.2019.01348 Molecular dysregulation is believed to participate in the onset and progression of lung adenocarcinoma (LUAD). This study aimed to identify and evaluate the potential key long noncoding RNAs (lncRNAs) involved in the significant dysfunctional process of LUAD. We found that lncRNA retinoic acid early transcript 1K (RAET1K) was upregulated in tumor tissues and were correlated with a poor prognosis of patients with LUAD; further, for the first time, we detected the biological roles of RAET1K. Weighted gene correlation network and gene set enrichment analysis revealed that high RAET1K expression is related to cell cycle dysfunction through upregulated cyclin E1 (CCNE1) by targeting miR-135. The dualluciferase reporter gene assay was performed to clarify the binding relationship between RAET1K and miR-135a-5p in transgenic A549 and H1299 cells. Real-time PCR and Western blot analyses showed that RAET1K overexpression and miR-135a-5p inhibition exerted a strong synergistic effect on CCNE1 expression, and cell cycle flow cytometry analysis was used to confirm the arrest of A549 and H1299 cells at the G1/S phase. The lncRNA RAET1K/miR-135a-5p axis might participate in the regulation of LUAD progression by influencing CCNE1 expression and the accumulation of cells arrested at the G1/S phase boundary.

Keywords: RAET1K, cell cycle, lung adenocarcinoma, long noncoding RNA, gene regulatory networks

## INTRODUCTION

The latest report released by the International Agency for Research on Cancer has stated that lung cancer (LC) remains the most common and deadly form of malignancy (Siegel et al., 2017; Bray et al., 2018). In general, surgery is the best option for treating patients with early stage disease because the five-year survival rate of pathological stage I non-small cell LC (NSCLC) after lobectomy is 45%–65% (Ettinger et al., 2015). However, approximately 70% of patients are diagnosed in the late stage of the disease; therefore, the five-year survival rate of these patients is only 16.38% (Ettinger et al., 2015). Lung adenocarcinoma (LUAD) is the most common type of NSCLC, accounting for approximately 40% of cases (Ferlay et al., 2010). Therefore, the focus of the present study was limited to the complex molecular mechanisms leading to the onset and poor prognosis of LUAD.

Dysregulation of the cell cycle result in increased cell proliferation, and the abnormal expression of cell cycle regulators can lead to tumor formation (Otto and Sicinski, 2017). Various chemotherapeutic agents have been developed to target the cell cycle (Ingham and Schwartz, 2017). For example, cisplatin is one of the most successful anticancer drugs used to nonspecifically block the cell cycle (Besse and Le Chevalier, 2012). By focusing on the complex gene networks that cause dysregulation of cell cycle regulators, a potential strategy for the treatment of LC could be developed.

Previous studies have reported that noncoding RNAs, such as long noncoding RNAs (lncRNAs) and microRNAs (miRNAs) are involved in cell cycle processes (Djebali et al., 2012). Furthermore, it has been widely reported that lncRNAs functioning as the competing endogenous RNAs (ceRNAs) could regulate cancer by sponging miRNAs (Salmena et al., 2011; Dong et al., 2018; Dong et al., 2019). Despite the rapid evolution of genomic technologies and analytical tools, the identification of novel lncRNA-related ceRNA networks affecting the cell cycle and ultimately influencing LUAD remains challenging. Therefore, the present study aimed to investigate lncRNA expression profiles of The Cancer Genome Atlas (TCGA) database via complex bioinformatics analysis to identify novel lncRNAs and related biological functions, which initially identified that lncRNA retinoic acid early transcript 1K (RAET1K) was significantly upregulated. Furthermore, we revealed that the upregulated expression of lncRNA RAET1K was correlated with poor prognosis in LUAD patients and facilitated cell cycle arrest at the G1 phase by functioning as a ceRNA to upregulate cyclin E1 (CCNE1).

### MATERIAL AND METHODS

#### Data Sets and Preprocess

The RNA and miRNA sequence data of LUAD and corresponding clinical information were downloaded from the TCGA database (https://cancergenome.nih.gov). The study cohort consisted of 564 LUAD patients with level 3 Illumina HiSeq RNA sequencing (RNA-seq) data and 505 patients with level 3 miRNA sequencing (miRNA-seq) data. On the basis of the clinical traits of the patients, the samples were classified into two groups: early stage (stages I and II) and advanced stage (stages III and IV). The gene symbol and type were converted from transcript IDs of RNA-seq data with the use of Genome Reference Consortium Human Build 38 patch release 12 (GRCh38.p12) of the Ensembl genome browser (http://asia. ensembl.org/biomart). The DESeq2 package (Love et al., 2014) was used to normalize raw data sets and identify differentially expressed genes (DIFF-genes). The cutoff values were an absolute value of log2 fold change of ≥2 and an adjusted probability (P) value of ≤ 0.01.

#### Construction of Co-Expression Networks

The R package for weighted correlation network analysis (WGCNA) was used to build co-expression networks (Langfelder and Horvath, 2008). Significant DIFF-genes were selected to generate co-expression networks for both the early and advanced stages of NSCLC. Briefly, a connection-weighted adjacency matrix of pair-wise genes was initially built according to unsupervised classifications. In accordance with the scaleindependent topological criterion, the acceptable soft threshold value was set to 5 on the basis of a correlation coefficient threshold of 0.85 (Zhang and Horvath, 2005). Thereafter, a topological overlap matrix (TOM) was initially built on the adjacency matrix. The dynamic tree cutting method was performed to cluster DIFF-genes into modules with 30 as the minimum module sizes of the genes and 0.25 as the cluster merge height, respectively. Each module contained genes with similar expression patterns. The gray module consisted of a cluster of unclassified genes. After defining the modules, the module eigengene (ME) values were calculated for all genes in each module. The correlations between the ME values and the LUAD patient clinical traits were calculated (Langfelder and Horvath, 2007). Several significantly associated gene sets were chosen for functional enrichment analysis.

#### Prognostic Analysis

Survival analysis was performed with SPSS Statistics for Windows, version 17.0. (SPSS, Inc., Chicago, IL, USA). On the basis of the gene expression value of the lower or upper quartile, samples were categorized into two groups: low-exp and high-exp. The hazard ratio (HR) and estimated 95% confidence interval (CI) were calculated using the Cox proportional hazard regression model. Kaplan-Meier curves were plotted to estimate the overall survival (OS), and the log rank test was used for univariate comparisons. A P value < 0.05 was considered statistically significant. Furthermore, a nomogram was generated using a multivariate Cox regression model to evaluate the potential prognostic signature of lncRNA RAET1K for OS of LUAD patients.

#### Function Annotation and Gene Set Enrichment Analysis (GSEA)

Gene ontology (GO) enrichment analysis was performed to identify the biological processes (BPs) of the module. Relevant genes in the Database for Annotation, Visualization, and Integration Discovery (DAVID) were visualized using bubble plots. The DIFF-genes in specific modules were clustered into various Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway ontologies using the ClueGO plug-in for the visualization of nonredundant biological terms for large clusters of genes in a functionally grouped network (Bindea et al., 2009). According to the gene expression level, GSEA was performed to identify the BPs and biological functions of hub genes clustered into the modules (Subramanian et al., 2005). For miRNAs, the miRcode (Jeggari et al., 2012) database was used to identify target genes and binding sites based on seed complementarity and evolutionary conservation of the seed region of the miRNAs.

## Cell Lines and Culture Conditions

Human LUAD A549 and H1299 cell lines were routinely cultured in a Roswell Park Memorial Institute 1640 medium (Gibco, Carlsbad, CA, USA) supplemented with 10% fetal bovine serum and 100 U/ml of penicillin/streptomycin (Beijing Solarbio Science & Technology Co., Ltd., Beijing, China) in an incubator (Thermo Fisher Scientific, Waltham, MA, USA) at 37°C under an atmosphere of 5% CO2/95% air, as previously described (Zheng et al., 2018).

## Cell Transfection

Cells were inoculated into the wells of a six-well plate before transfection. The RAET1K overexpression lentivirus and a negative control (NC) lentivirus were purchased from GenePharma Co., Ltd. (Shanghai, China). The cells in each well were transfected with 10<sup>6</sup> lentiviruses. Four days later, the transfection efficiency was evaluated by determining the proportion of green fluorescent protein-positive cells. A medium supplemented with 2 mg/ml of puromycin was used to screen out the A549 and H1299 cells that were unsuccessfully transfected with the RAET1K and NC lentiviruses.

Cells were transiently transfected with a group of miR-135a-5p mimics and inhibitors (GenePharma Co., Ltd.) by using jetPRIME® transfection reagent (Polyplus-transfection S.A., Illkirch-Graffenstaden, France), as previously described (Zheng et al., 2018). The cells were harvested at 24 h after transfection for further use.

#### RNA Isolation and Real-Time Polymerase Chain Reaction (RT-PCR) Analysis

Total RNA was extracted using the NucleoSpin RNA Plus kit (TaKaRa Biotechnology [Dalian] Co., Ltd., Dalian, China) in accordance with the manufacturer's protocol. RNA was reversetranscribed to complementary DNA (cDNA) using the PrimeScript RT Reagent Kit (TaKaRa Biotechnology [Dalian] Co., Ltd.). RT-PCR analysis was performed using SYBR Green Master Mixture reagent (Takara Bio, Inc., Kusatsu, Shiga, Japan) and an ABI 7500-Fast Real-Time PCR system (Applied Biosystems, Carlsbad, CA, USA). The cycling conditions for cDNA amplification are described elsewhere (Zheng et al., 2018). The fold change in relative gene expression was calculated using the 2−ΔΔCt method with glyceraldehyde 3-phosphate dehydrogenase (GAPDH) as an internal reference. The primers used for RT-PCR are listed in Supplementary Table S1.

#### Western Blot Analysis

Total protein isolated from cells was sonicated in ice-cold radio immunoprecipitation assay lysis buffer (Pierce Biotechnology, Waltham, MA, USA). Denatured proteins were separated by sodium dodecyl sulfate polyacrylamide gel electrophoresis and then transferred to a polyvinylidene fluoride membrane (EMD Millipore Corporation, Billerica, MA, USA), which was blocked with Tris-buffered saline and 5% skim milk for 2 h. Samples were incubated with primary antibodies against the cyclin E1 (CCNE1) gene (catalog no. 20808; dilution, 1:1000; Cell Signaling Technology, Inc., Danvers, MA, USA) at 4°C overnight. After rinsing, the membrane was incubated with horseradish peroxidase-conjugated anti-rabbit secondary antibody (#7074; dilution, 1:1000; Cell Signaling Technology, Inc.). The protein bands were visualized using an enhanced chemiluminescence kit (Wanleibio Co., Ltd., Shenyang, China) and the ChemiDoc™ Touch Imaging System (Bio-Rad Laboratories, Hercules, CA, USA). The degree of gray intensity was determined using ImageJ software (https://imagej.nih.gov/ij/) and normalized to that of GAPDH (#2118; dilution, 1:5000; Cell Signaling Technology, Inc.).

## Flow Cytometry Analysis

The cells were fixed with ice-cold 70% ethanol overnight and then resuspended in staining solution included with the cell cycle detection kit (Nanjing KeyGen Biotech. Co. Ltd., Nanjing, China). After incubation for 1 h at 37°C in the dark, the stained cells were subsequently analyzed by flow cytometer fluorescenceactivated cell sorting (FACS) using the BD FACSCalibur™ Cell Analyzer system (BD Biosciences, San Jose, CA, USA).

#### Dual-Luciferase Reporter Assay

A fragment of the wild-type (WT) RAET1K 3'-untranslated region (RAET1K-3'UTR-wt) contained a binding site downstream of the luciferase reporter gene, whereas the mutant-type RAET1K (RAET1K-3'UTR-mut) contained mutated biding sites (GenePharma Co., Ltd.). A549 and H1299 cells were transfected in the wells of 24-well plates, cultured until attachment, and co-transfected with miR-135a-5p mimics, miR-135a-5p inhibitors or the miR-NC encoded by the luciferase plasmids (RAET1K-3'UTR-wt or RAET1K-3'UTR-mut). Luciferase gene expression was monitored using the Dual-Luciferase® Reporter Assay System (Promega Corporation, Madison, WI, USA), as described previously (Zheng et al., 2018). The results of experiments performed in triplicate were normalized to Renilla luciferase activity values.

## Statistical Analysis

Data are presented as the mean ± standard deviation. All statistical analyses were performed using Prism 8.0 software (GraphPad Software, Inc., La Jolla, CA, USA). Student's t-test and one-way analysis of variance were used to analyze two groups and more than two groups, respectively. The Pearson's correlation coefficient was used to identify correlations. Analysis of each sample was performed in triplicate. P < 0.05 was considered statistically significant.

### RESULTS

## Significant Genes and Clusters With Functions Related to LUAD

#### DIFF-Genes in Early and Advanced Stages of LUAD

The LUAD database included 24,989 genes from 564 tissue samples, which included 59 adjacent noncancerous tissues, 395 early stage LUAD tissues (274 stage I and 121 stage II), and 110 advanced stage LUAD tissues (84 stage III and 26 stage IV). In total, 1,069 and 425 DIFF-genes were upregulated and downregulated in early stage LUAD (Figure 1A), respectively, whereas 888 and 516 were upregulated and downregulated in advanced stage LUAD, respectively (Figure 1B). In total, 991 DIFF-genes in both early and advanced stages were used to construct the weighted correlation network.

#### Construction of the Gene Co-Expression Network in LUAD

WGCNA was performed for 991 DIFF-genes. First, potential hub genes in each module were investigated to identify correlations with the clinical features of LUAD patients. The generalized TOM defined the relationships of each pair of DIFF-genes from the adjacency matrix. The hierarchical clustering tree method detected that four modules contained DIFF-genes that highly correlated with LUAD, as depicted in turquoise, brown, blue and green color (Figure 1C). In the middle of the TOM network, a heatmap of the independent genes in different modules was constructed. The genes clustered into the blue and turquoise modules were significantly co-expressed with each other.

The DIFF-genes in each module were spontaneously clustered according to the following clinical features: early stage, advanced stage, tumor size (T), lymph node involvement (N), and presence of metastasis (M). Module trait relationships were calculated by correlating the ME values with the clinical features (Figure 1D). There were no significantly positive modules related to early stage disease or other clinical traits. However, the genes in the blue and brown modules were significantly and positively correlated with advanced stage disease, whereas the genes in the blue module showed strong associations (correlation rate = 0.8, Figure 1E) and were chosen for subsequent analyses.

#### Functional Enrichment Analysis of Selected Modules

To describe the BPs and mechanisms of hub genes, the GO functional enrichment analysis of 203 DIFF-genes in the blue module were performed using DAVID as a reference. The top 10 BPs were visualized using a bubble plot (Figure 1F), which showed that most of the DIFF-genes were involved in the cell cycle (Supplementary Table S2). Furthermore, ClueGO was performed to enrich the KEGG pathways of the DIFF-genes in the blue module (Figure 1G). In total, 168 protein-coding RNAs in the blue module were grouped into six significant KEGG pathways (P ≤ 0.05). The red nodes contained 23 genes enriched in the cell cycle pathway (Supplementary Table S3).

#### Function of RAET1K as a Key Gene in LUAD

#### Detection of Significant Genes in the Blue Module

According to GRCh38.p12, 12 lncRNAs and 191 mRNAs were assigned to the blue module. To further validate the hub genes and identify potential biomarkers for LUAD, Cox proportional hazard and Kaplan-Meier analyses of the genes in the blue module were performed. In total, 141 highly expressed hub genes were significantly associated with poor prognosis. Because there was only one lncRNA out of 141 significant genes in the blue module, and then we focused on this lncRNA RAET1K for further biological study.

#### RAET1K Is Highly Expressed in LUAD and Positively Correlated With the Prognosis of LUAD

RAET1K (HR = 1.428; 95% CI = 1.052–1.939; P = 0.022, Figure 2A) was the only lncRNA among the 141 hub genes that was significantly upregulated in tumor tissue compared with normal tissue (Figure 2H). Furthermore, a nomogram was constructed to predict 1- and 3-year survival rates in patients with LUAD by showing the risk score of clinical stage, age, sex, and RAET1K expression level (Figure 2B). The concordance index, which was evaluated using the calibration plot of this nomogram model, further supported the predictive prognostic signature of lncRNA RAET1K in LUAD OS (Figure 2C).

#### RAET1K May Regulate the Cell Cycle Phase in LUAD

To further explore the biological functions of RAET1K, GO enrichment for GSEA was performed. The LUAD samples with higher expression levels of RAET1K were enriched in genes correlated with cell cycle biological behavior. The GSEA results also indicated that among the genes in the blue module, lncRNA RAET1K expression was enriched in the cell cycle (Figure 2D).

#### The RAET1K/miR-135a-5p Axis May Influence the Cell Cycle via CCNE1 in LUAD Patients

lncRNAs can regulate mRNA expression via miRNA-mediated ceRNAs (Salmena et al., 2011). The expression of ceRNA transcripts that harbor the same miRNA binding sites should be parallel based on the ceRNA hypothesis. The interaction of the ceRNA network and RAET1K is described in Figure 2I, which was combined with the expressional correlation and target sites. Among the genes influencing OS, according to the Pearson's correlation coefficient, mRNAs that were positively correlated with RAET1K (r > 0.3 and <sup>P</sup> < 0.05, Figure 2E) and miRNAs that were negatively correlated with RAET1K and mRNAs (r < -0.3 and <sup>P</sup> < 0.05, Figures 2F, G) were selected, and then combined with the miRcode database, which was used to predict miRNA-interacting targets. As shown in Figure 2I, RAET1K may function as a sponge to absorb miR-135a-5p to modulate CCNE1 expression.

#### The RAET1K/miR-135a-5p Axis Arrested LUAD Cells in the G1 Phase by Upregulating CCNE1

RAET1K Regulated CCNE1 by Sponging miR-135a-5p Subsequently, to investigate the validity and potential biological mechanisms of the effects of the RAET1K/miR-135a-5p axis on CCNE1 expression, in vitro experiments with A549 and H1299 cells were performed. The efficiency of RAET1K overexpression lentivirus interference was confirmed by RT–PCR (Figure 3A). To further investigate the synergistic effect of the RAET1K/miR-135a-5p axis on CCNE1 expression, A549 and H1299 cells were transfected with lentiviral vectors stably overexpressing RAET1K and an empty control (hereafter referred to as A549RAET1K, A549Con, H1299RAET1K, and H1299Con cells, respectively).

FIGURE 1 | Detection of significant genes and their function related to lung adenocarcinoma (LUAD). Volcano plots showed fold change (FC) and P-values of differentially expressed genes in early (A) and advance (B) stage LUAD versus normal samples. Blue nodes present significantly down-regulated, and red nodes are up-regulated expressed genes. Grey nodes are not differentially expressed. RAET1K and CCNE1 expression are annotated. (C) In middle topological overlap matrix (TOM) heatmap, every row and column present one gene, light color presents low, while darker red presents higher weighted correlation. The dynamic tree cluster dendrogram of DIFF-genes are showed in the left and top, gray square indicates genes that are involved in any known module. (D) LUAD module-clinical feature relationships. The row matches a clinical trait (early stage, advance stage, T for tumor size, N for lymph node and M for metastasis) and the column matches a genes module. Correlation of module and clinical trait is showed in each cell. The darker the color is, the higher the degree of correlation is. Red presents positive, while blue presents negative correlation. (E) Scatterplot of gene significance and module membership in the blue module. Correlation coefficients and P-values are at the top. (F) Bubble plots showed top 10 terms of gene ontology (GO) enrichment analysis in biological process for blue module. The Y-axis correspond to the GO terms. The gene counts and -log (enrichment P-value) in every GO term were proportional to the area and color of the bubble, respectively. (G) Genes Kyoto Encyclopedia of Genes and Genomes (KEGG) enrichment analysis in blue module. The small size nodes in the network represent the genes enriched in the specific pathway, the big size nodes represent pathway term. The node colors correspond to the ClueGO-determined KEGG pathway clusters.

FIGURE 2 | Identification of lncRNA RAET1K function and biological mechanism. (A) The Kaplan‐Meier curve of the risk score for the overall survival of RAET1K in lung adenocarcinoma (LUAD). The blue line presents the lower expression level group of RAET1K, and the red presents the higher ones. Gene enrichment plots showed gene set enrichment analysis (GSEA) between high- and low-expressed RAET1K. (B) The nomogram of clinical features and RAET1K expression level for predicting the 1- and 3-year survival with risk score. (C) calibration plot indicated this nomogram model had a predictive power for overall survival. (D) The upper enrichment plots contain value of the genes' enrichment scores and the corresponding barcode plot shows the genes position. In the bottom heatmap red represents Spearman correlations with higher expression level of RAET1K, blue represents Spearman correlations with lower expression level of RAET1K. Expression of RAET1K and CCNE1 expression level were positive related with each other (E), while RAET1K (F) and CCNE1 (G) were negatively correlated with miR-135a. (H) RAET1K and CCNE1 expression were upregulated in both early and advance stage of LUAD, while miR-135a was downregulated, \*\*\*P < 0.001. (I) Construction of ceRNA network of lncRNA-miRNA-mRNA in blue module. The green node in diamond was lncRNA RAET1K, the blue circle nodes were mRNAs, and the pink circle nodes were miRNA. The line between nodes present their relation and the red lines shown RAET1K targeted miR-135a-5p and CCNE1.

FIGURE 3 | Overexpression RAET1K upregulated CCNE1 by sponging miR-135a-5p. (A) The interference efficiency of RAET1K overexpression lentivirus was detected by real-time PCR in A549 and H1299. Relative CCNE1 mRNA expression level after co-transfected with miR-135a-5p (or inhibitor) and RAET1K in A549 (B) and H1299 (C) cell lines, while the cyclin E1 protein levels was measured by Western blot in A549 (D and F) and H1299 (E and G). Bands were quantitatively compared with relative negative control groups. Data are represented as means ± S.D. from three independent experiments, \*P < 0.05, \*\*P < 0.01, \*\*\*P < 0.001. Con., control; inh NC, miRNA-135a-5p inhibitor negative control; inh, inhibitor; NC, negative control; mi, mimics.

Thereafter, A549RAET1K, A549Con, H1299RAET1K, and H1299Con cells were transfected with miR-135a-5p mimics, an inhibitor, an NC, or an NC inhibitor.

RT-PCR analyses of A549Con and H1299Con cells showed that miR-135a-5p inhibition resulted in a 1.5- and 2.2-fold increase, respectively, in CCNE1 mRNA expression relative to the NCs (Figures 3B, C, left panel). We observed that overexpression of RAET1K increased miR-135a-5p inhibition, as compared with NC (2.4- and 2.9-fold increases in A549RAET1K and H1299RAET1K, respectively, Figures 3B, C, right panel).

Western blot analysis showed that cyclin E1 protein levels were similar (Figures 3D–G). We observed that A549Con and A549RAET1K cells transfected with miR-135a-5p mimics reduced cyclin E1 protein expression levels, whereas miR-135a-5p inhibitors had an opposite effect (Figure 3D). Consistently, cyclin E1 protein expression showed similar tendencies with higher fold changes in H1299Con and H1299RAET1K cells co-transfected with miR-135a-5p inhibitor compared with those with NC inhibitor (Figure 3E). Additionally, although miR-135a-5p mimics significantly decreased cyclin E1 protein expression, this change was salvaged by RAET1K overexpression, thereby indicating that the change in cyclin E1 protein expression in response to RAET1K and miR-135a-5p was due to posttranscriptional modulation in both A549 and H1299 cells. Considering these results, lncRNA RAET1K inhibited CCNE1 mRNA expression probably via the downregulation of miR-135a-5p expression.

#### RAET1K as a Target of miR-135a-5p

The expression levels of miR-135a-5p and RAET1K were inversely correlated in LUAD tissues and cell lines. Bioinformatics analysis predicted that RAET1K was a potential target of miR-135a-5p. Figure 4A describes a putative interaction of RAET1K-3'UTR and modified RAET1K-3'UTRmut with the miR-135a-5p binding sequence. The luciferase reporter assay was performed to validate the interactions between miR-135a-5p and RAET1K in A549 and H1299 cells. Relative luciferase activity was inhibited by co-transfection with the miR-135a-5p mimics and the luciferase reporters containing RAET1K-3'UTR. However, inhibition was relatively weak in the RAET1K-3'UTR-mut group (Figures 4B, C). Luciferase activity was enhanced with the use of the miR-135a-5p inhibitor (Figures 4B, C).

#### The RAET1K/miR-135a-5p Axis Arrested LUAD Cells in the G1 Phase

To determine whether the RAET1K/miR-135a-5p axis exerted synergistic effects on cell cycle progression, cell cycle distributions were investigated following the co-transfection of RAET1K and miR-135a-5p mimics or an inhibitor in A549 and H1299 cells. Although the proportions of A549Con cells in the various cell cycle phases were not significantly altered by miR-135a-5p expression levels, a tendency for such alterations was observed (Figure 4D). In comparison with the NC group, transfection with the miR-135a-5p inhibitor decreased the number of A549RAET1K cells in the G1 phase, whereas a larger proportion were observed in the S phase (Figure 4D).

Similar, yet significant, tendencies were observed in H1299 cells. As compared with the NC inhibitor group, the use of the miR-135a-5p inhibitor resulted in fewer H1299Con and H1299RAET1K cells arrested in the G1 phase than in the S phase (Figure 4E). In addition, lncRNA RAET1K overexpression enhanced the inhibition of cells arrested in the G1 phase. As compared with the NC group, transfection of H1299Con cells with the miR-135a-5p mimics increased the number of cells accumulated in the G1 phase; however, RAET1K overexpression rescued this accumulation (Figure 4E). Moreover, histograms of the cell cycle were created (Figures 4F, G). The results showed that RAET1K overexpression with decr eased miR-135a-5p could synergistically arrest the A549 and H1299 cells in the G1 phase and hinder cell cycle transformation from the G1 to S phase.

## DISCUSSION

To identify significant lncRNAs in LUAD, comprehensive computational analysis of transgenic cells was performed. The results showed that lncRNA RAET1K regulated the expression of CCNE1 in LUAD and served as ceRNA to sponge miR-135a, whereas CCNE1 was targeted in cells arrested at the G1-S phase boundary. It is important to understand the pathological cell cycle process that is associated with the dysregulation of cell proliferation leading to cancer (Bertoli et al., 2013). The dynamic progression of the cell cycle consists of four sequential phases: S (chromosome replication), M (chromosome segregation), and G1 and G2 (gap), which are regulated by cyclin/cyclin-dependent kinases (Dai et al., 2018). In particular, cyclin E/Cdk2 interacts and forms complexes that promote G1 progression and G1/S transition (Sonntag et al., 2018). The amplification of cyclin E, which functions in cell cycle progression, inhibition of apoptosis, transcription, and replication, and DNA repair, has been observed in various types of cancer (Kanska et al., 2016; Vijayaraghavan et al., 2017). Furthermore, cyclin E1 can be modulated by multiple regulators, such as the transcription factors c-Myc, retinoblastoma, and E2F (Thurlings and de Bruin, 2016), as well as by miRNA-mediated inhibitors miR-15/16 (Yuan et al., 2019) and miR-424-5p(Jiang et al., 2019) at the transcriptional, posttranscriptional, and translational levels.

The rapid evolution of genomic technologies and analytical tools has improved the understanding of traditional simple gene mutations in cancer genomics. Furthermore elucidation of the complex networks of genomic alterations in LC has provided a basic understanding of the biological consequences and alterations of signal transduction pathways (Chin et al., 2011). A range of evidence suggests that diversity and complex molecular functions of lncRNAs may regulate epigenetic processes, particularly by acting as ceRNAs to sponge miRNAs. To identify novel LUAD-specific lncRNAs, differential analysis was performed during the early and advanced stages using

normal tissues in the TCGA LUAD cohort. Different genes in both subsets were selected to facilitate the next step. The coexpression gene network was detected by WGCNA, which is a systematic biological method to identify synergistically altered gene clusters, candidate biomarkers, and therapeutic targets. According to the WGCNA results, DIFF-genes in the blue module were related to the LUAD clinical stage and were enriched in cell cycle-related functions. Cell cycle dysfunction in LUAD was consistent with our results. A recent study demonstrated that cell cycle-related genes, such as E2F1 (Chen et al., 2019), were enriched during the regulation of the cell cycle progression(Li et al., 2018; Qi et al., 2019). In the present study, we found that lncRNA RAET1K could promote cell cycle dysfunction, providing insight into the crosstalk regulatory mechanism between lncRNAs and coding genes. Interestingly, GSEA results also showed that some cell cyclin proteins and CDK family members were classified by the median of RAET1K expression level including PBK, KIF14, NEK2, CCNE1, CDC45, and DENPF, among others. In addition to the survival prediction of RAET1K, a Kaplan-Meier curve and a nomogram of integrating clinical traits were constructed. Indeed, RAET1K attracted our attention. Liang et al. (Sui et al., 2019) reported that RAET1K was predictive of the prognosis of LUAD patients in a TCGA cohort, which is consistent with our results; however, this was not further verified at the molecular level. To the best of our knowledge, no study has investigated the underlying molecular mechanism of RAET1K in patients with LUAD.

lncRNA RAET1K is a conversely processed transcript at 6q25.1 composed of four exons and is 1,883 bp in length. The key mechanism of lncRNA RAET1K as a ceRNA is to competitively combine the same miRNA with cross-regulated genes by sharing the miRNA response elements in the 3'-UTR of the target genes. We hypnotized that RAET1K functions as a ceRNA that influences CCNE1 expression and the cell cycle process via miR-135a-5p. The role of RAET1K in A549, H1299, and PC-9 cells was investigated to determine why PC-9 cells did not survive puromycin-selection of cells transfected with a lentivirus overexpressing RAET1K. As a possible explanation, the epidermal growth factor receptor gene might be mutated in PC-9 cells, whereas A549 and H1299 cell lines carried the WT phenotype. Therefore, the effects of miR-135a-5p and cotransfection of RAET1K/miR-135a-5p in A549 and H1299 cells were investigated. The results of the PC-9 cells transfected with miR-135a-5p are provided in the Supplementary Figure S1. In the A549 and H1299 cell lines, CCNE1 expression was silenced by increased miR-135a-5p, which also affected the cell cycle process. In contrast, the miR-135a-5p inhibitor had opposite effects. The results revealed that overexpression of RAET1K partially absorbed miR-135a-5p and enhanced the miR-135a-5p-mediated biological effects. The tumor suppressive function of miR-135a in LUAD has been consistently demonstrated in previous studies. For instance, miR-135a-5p promoted the progression of head and neck squamous cell carcinoma by targeting HOXA10 (Guo et al., 2018), the progression of thyroid carcinoma by VCAN (Zhao et al., 2017), and the progression of gastric cancer by KIFC1 (Zhang et al., 2016). Conversely, miR-135a was found to target SIAH1 to promote cell transformation in cervical cancer via the b-catenin pathway (Leung et al., 2014). Furthermore, Zhang et al. (2019) reported that miR-135a-5p promoted LC progression via modulating LOXL4 and blockage of LC cells arrested at the G1 phase. The reasons for these findings could be the differences in the samples used for in vivo (LC tissue) vs. in vitro (LC cell lines) studies. However, the results above were in agreement regarding the influence of the G1 phase of the cell cycle.

Furthermore, the results of this study indicated that cotransfection of A549RAET1K and H1299RAET1K cells with the miR-135a-5p inhibitor could act synergistically to reduce the expression level of CCNE1 and accumulate the proportion of cells arrested at the G1-S phase boundary, thereby suggesting the possible existence of an oncogenic RAET1K/miR-135a-5p axis. As predicted and verified by the bioinformatics algorithms and luciferase reporter assay, RAET1K and CCNE1 are potential targets of miR-135a-5p at the 7-mer-m8 site. The lncRNA RAET1K/miR-135a-5p axis might have a stronger synergistic effect on the regulation of cell cycle phase-dependent CCNE1 and transformation from the G1 to S phase. Here, the role of

## REFERENCES


RAET1K as a putative oncogene in LUAD was revealed, suggesting that targeting the cyclin E1-CDK signaling provides a novel targeted therapeutic option for the treatment of LUAD. However, further investigations are required to verify the crucial molecules and signaling pathways involved in lncRNA RAET1Kmediated LUAD tumorigenesis.

## CONCLUSION

The major finding of this study was that RAET1K acted as a ceRNA and increased the expression of CCNE1 by directly competing with miRNA-135a-5p, which influenced the function of the cyclin E1 protein. Furthermore, the RAET1K/ miR-135a-5p axis, which drives cell cycle progression, was arrested at the G1 phase in LUAD onset and progression. These findings are expected to be useful for the development of a novel biomarkers and pathways regulating the the cell cycle in LUAD.

## DATA AVAILABILITY STATEMENT

The data that support the findings of this study are openly available in the Cancer Genome Atlas at (https://portal.gdc. cancer.gov).

## AUTHOR CONTRIBUTIONS

Conceptualization, analysis and validation: CZ and XL. Software: YR and ZY. Writing: CZ. Funding acquisition: BZ and XL.

## FUNDING

This project was supported by the National Natural Science Foundation of China (No.81773524 and No.81502878).

## SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene.2019. 01348/full#supplementary-material


Conflict of Interest: The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2020 Zheng, Li, Ren, Yin and Zhou. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Analysis of Key Genes Involved in Potato Anthocyanin Biosynthesis Based on Genomics and Transcriptomics Data

Nie Tengkun<sup>1</sup> \*, Wang Dongdong<sup>1</sup> , Ma Xiaohui<sup>1</sup> , Chen Yue<sup>1</sup> \* and Chen Qin<sup>2</sup> \*

<sup>1</sup> State Key Laboratory of Crop Stress Biology for Arid Areas, College of Agronomy, Northwest A&F University, Yangling, China, <sup>2</sup> State Key Laboratory of Crop Stress Biology for Arid Areas, College of Food Science and Engineering, Northwest A&F University, Yangling, China

#### Edited by:

Monica Bianchini, University of Siena, Italy

#### Reviewed by:

Dinesh Kumar, Indian Council of Agricultural Research (ICAR), India Izabela Makałowska, Adam Mickiewicz University in Poznan, Poland ´

#### \*Correspondence:

Nie Tengkun chinantk@126.com Chen Yue xnchenyue@nwafu.edu.cn Chen Qin chenpeter2289@nwsuaf.edu.cn

#### Specialty section:

This article was submitted to Bioinformatics and Computational Biology, a section of the journal Frontiers in Plant Science

Received: 28 November 2018 Accepted: 24 April 2019 Published: 14 May 2019

#### Citation:

Tengkun N, Dongdong W, Xiaohui M, Yue C and Qin C (2019) Analysis of Key Genes Involved in Potato Anthocyanin Biosynthesis Based on Genomics and Transcriptomics Data. Front. Plant Sci. 10:603. doi: 10.3389/fpls.2019.00603 The accumulation of secondary metabolites, such as anthocyanins, in cells plays an important role in colored plants. The synthesis and accumulation of anthocyanins are regulated by multiple genes, of which the R2R3-MYB transcription factor gene family plays an important role. Based on the genomic data in the Potato Genome Sequencing Consortium database (PGSC) and the transcriptome data in the SRA, this study used potato as a model plant to comprehensively analyze the plant anthocyanin accumulation process. The results indicated that the most critical step in the synthesis of potato anthocyanins was the formation of p-coumaroyl-CoA to enter the flavonoid biosynthetic pathway. The up-regulated expression of the CHS gene and the downregulated expression of HCT significantly promoted this process. At the same time, the anthocyanins in the potato were gradually synthesized during the process from leaf transport to tubers. New transcripts of stAN1 and PAL were cloned and named stAN1 like and PAL-like, respectively, but the functions of these two new transcripts still need further study. In addition, the sequence characteristics of amino acids in the R2-MYB and R3-MYB domains of potato were preliminarily identified. The aims of this study are to identify the crucial major genes that affect anthocyanin biosynthesis through multi-omics joint analysis and to transform quantitative traits into quality traits, which provides a basis and reference for the regulation of plant anthocyanin biosynthesis. Simultaneously, this study provides the basis for improving the anthocyanin content in potato tubers and the cultivation of new potato varieties with high anthocyanin content.

Keywords: anthocyanin, potato, multi-omics analysis, stAN1, PAL, R2R3-MYB

## INTRODUCTION

It is well known that some plants are colorful, and there are many reasons why plants display multiple colors. For example, the pH of plant cytoplasmic substrates, the accumulation of secondary metabolites, such as anthocyanins, and environmental factors, such as light, all have an effect on plant color formation (Asen et al., 1972; Dai and Mumper, 2010; Xu X. et al., 2015). The accumulation of anthocyanins and other flavonoids in cells results in plants displaying colors

other than green (Tanaka et al., 2008). Biosynthesis and metabolic pathways of anthocyanins in plants have been studied in depth, and many key genes have been cloned.

Among the many phenylalanine metabolic pathways, the pathway based on the biosynthesis process of phenylpropanoids is an important source of flavonoids in plants (Salvatierra et al., 2010). Phenylalanine is deaminated by phenylalanine ammonia lyase (PAL) to form trans-cinnamic acid; trans-cinnamic acid produces cinnamoyl-CoA under 4-coumarate-CoA ligase (4CL); then cinnamoyl-CoA is catalyzed by transcinnamate 4-monooxygenase (C4H) to form p-coumaroyl-CoA; finally p-coumaroyl-CoA is involved in the biosynthesis of flavonoids (Vogt, 2010). p-coumaroyl-CoA, through chalcone synthase (CHS), shikimate O-hydroxycinnamoyltransferase (HCT), chalcone isomerase (CHI), flavonoid 3<sup>0</sup> , 5<sup>0</sup> hydroxylase (F3<sup>0</sup> 5 <sup>0</sup>H), flavonoid 3<sup>0</sup> -monooxygenase (F30H), naringenin 3-dioxygenase (F3H), dihydroflavonol 4-reductase (DFR), anthocyanidin synthase (ANS) and other enzymes, catalyzes the final formation of pelargonidin, cyanidin and delphinidin, involved in anthocyanin biosynthesis (Martens et al., 2010; Tanaka et al., 2010). Anthocyanin mainly accumulates in plant cell vacuoles in the form of glycosides (Pietrini et al., 2002).

The MYB-bHLH-WD40 transcription factor complex (MBW) is a regulator that has been thoroughly studied and has an important regulatory effect on the synthesis of flavonoids such as anthocyanins (Jaakola, 2013). The main transcription factor involved in the regulation of anthocyanin synthesis in the MYB gene family is the R2R3-MYB transcription factor (Stracke et al., 2007). A study of the Arabidopsis MBW complex TT2-TT8- TTG1 showed that the target gene of the complex might be mainly determined by a R2R3-MYB transcription factor-encoded protein (Xu W. et al., 2015). The bHLH proteins involved in the MBW complex have some common features and most belong to the IIIF subfamily (Zimmermann et al., 2004). The Arabidopsis thaliana TT8 gene belongs to the bHLH gene family, which can regulate the synthesis of flavonoids by feedback regulation (Baudry et al., 2006). Studies have indicated that the WD40 protein does not participate in the recognition of gene promoters or regulate the expression of target genes; its effect is to link the two other protein subunits in the MBW complex (Hichri et al., 2011). In the synthesis of flavonoids, for some specific genes, MYB transcription factors can activate the corresponding gene transcription directly without binding to bHLH transcription factors (Jaakola, 2013). Thus, it is important that the R2R3-MYB transcription factor plays a role in the synthesis of flavonoids.

Anthocyanin is an important component of polyphenolic antioxidant active substances, and such compounds are easily absorbed and utilized by the human digestive system (Fernandes et al., 2014). Anthocyanins have a special chemical structure, which allows them to exert a variety of physiological and biochemical functions in mammals such as humans (Stintzing and Carle, 2004). On the one hand, anthocyanins have the effect of scavenging free radicals in living organisms and improving the antioxidant capacity of organisms themselves (Miguel, 2011); on the other hand, anthocyanins have many important pharmacological effects, for example, anthocyanins have significant effects in preventing many major humanrelated diseases, such as cardiovascular and cerebrovascular diseases, diabetes and its complications, cancer, and so on (Scalbert et al., 2005). Because of the above characteristics, anthocyanins are gradually being valued by chemists and pharmacologists. Potato is an important plant food for humans to obtain antioxidant active substances such as ascorbic acid and polyphenols (Lobo et al., 2010). Nutrients such as anthocyanins accumulate in colored potato tubers. In addition, it is considered that the anthocyanin content of potato with red or purple tubers is significantly higher than that of common potato with white or yellow tubers (Brown et al., 2005; Lachman and Hamouz, 2005). Since anthocyanins have favorable biological functions for humans, the key genes controlling the synthesis and accumulation of potato anthocyanins can be studied, and then the accumulation of anthocyanins in potato tubers can be regulated. This study attempted to control the content of anthocyanins in potato tubers, making it easier for humans to take antioxidant active substances such as anthocyanins, thereby preventing a variety of diseases and making humans healthier.

Potato is a good model plant for studying the formation of plant color by studying the process of anthocyanin biosynthesis. Firstly, potato plants reproduce mainly through asexual reproduction, and the genetic composition is stable. Secondly, different potato varieties have different colors, and for a single potato, the whole plant is consistent in color. In addition, mature potato plants have a large biomass, which is convenient for the determination of various secondary metabolites. Numerous key genes regulating anthocyanin synthesis have been cloned, but it is unclear which of these key genes is the most important. At the same time, whether there are other gene regulatory pathways controlling anthocyanin accumulation in plants is also worthy of further study.

In this experiment, we analyzed the R2R3-MYB transcription factor gene family, which plays a major role in the anthocyanin synthesis process, based on the genomic data of existing diploid potato (Solanum phureja DM1-3). Then, potato transcriptomics data from the NCBI Sequence Read Archive (SRA) database were used to determine which key genes were enriched in anthocyanin synthesis. Finally, based on the above analysis results, we aimed to identify the most critical genes involved in the regulation of anthocyanin biosynthesis and to explore new genes that may be involved in the regulation of anthocyanin synthesis.

### MATERIALS AND METHODS

## Identification of the R2R3-MYB Subfamily Genes in Potato Proteome Data

We downloaded proteomic data PGSC\_DM\_v3.4\_pep.fasta (Amino acid sequences corresponding to all gene coding

sequences) from the potato group database PGSC<sup>1</sup> . The identification of R2R3-MYB subfamily genes used stAN2 as a reference sequence (Jung et al., 2009); local Blast analysis was performed using blast-2.6.0+ software, and the e-value was set to e-5. After removal of short sequences of amino acids with a length less than 100 and repeated sequences, the SMART<sup>2</sup> database was submitted for retrieval. MEME 4.11.4<sup>3</sup> was used to determine the conserved domain boundaries of the MYB-R2 and MYB-R3 domains in potato. Only the amino acid sequences having both the MYB-R2 and MYB-R3 domains were retained for subsequent analysis.

## Construction of the Phylogenetic Tree of the Potato R2R3-MYB Gene and Collinear Analysis

Using MEGA7<sup>4</sup> software, an unrooted tree was constructed using the minimal evolution method, and the phylogenetic tree was tested using Bootstrap = 1000. The potato genome collinearity analysis was performed based on the PGSC\_DM\_v3.4\_cds.fasta application MCScanX<sup>5</sup> , and circos-0.69<sup>6</sup> was used to visualize the results of the potato genome collinearity analysis.

## Transcriptional Data of Potato Color Changes Were Analyzed

The potato transcriptome data were downloaded from the SRA database<sup>7</sup> the downloaded data format was transformed by the SRA-Toolkit<sup>8</sup> , and then the downloaded data were regrouped. According to the color of the potato stem and tuber used in sequencing, they were reclassified into a colored group and colorless group. The regrouped colored group contained 21 biological replicates; the regrouped colorless group contained 36 biological replicates. The colorless group was the control group, and the data and grouping information are shown in **Supplementary Table S5** (Hannapel et al., 2013; Liu et al., 2015; Gálvez et al., 2016; Pham et al., 2017). In this experiment, the NGSQC Toolkit (Patel and Jain, 2012) was used to filter the reads; Trimmomatic<sup>9</sup> was used to remove the linkers used for sequencing; and the PCR repeats generated during the sequencing process were eliminated by FastUniq<sup>10</sup>. Using the doubled monoploid S. tuberosum Group Phureja clone DM1-3 (DM) as the reference genome (Xu et al., 2011), TopHat and Cufflinks were used to splice the transcriptome data and obtain differentially expressed genes (Trapnell et al., 2012). Finally, InterProScan-5.29-68.0<sup>11</sup> and KOBAS 3.0<sup>12</sup> were used for preliminary annotations of the differentially expressed genes.

## GO Annotation and KEGG Enrichment Analysis Based on Genomic and Transcriptome Analysis Results

Comprehensive genomic and transcriptome analysis results were analyzed by GO annotation and KEGG enrichment using AnnotationDbi<sup>13</sup>, AnnotationHub<sup>14</sup> and clusterProfiler<sup>15</sup>. Only GO annotations and KEGG enrichment analysis results with p-value < 0.05 were retained. The GOplot<sup>16</sup> was applied to visualize the results of GO annotation. The KEGG analysis results were confirmed by the KEGG online database<sup>17</sup> .

#### Semi-Quantitative RT-PCR to Detect Gene Expression

Semi-quantitative RT-PCR was used to verify the expression of the key genes obtained from the above studies. We applied the potato variety Shepody and the colored potato material, Yellow Meigui 1, Red Meigui 3, Purple Meigui 2, which were bred in our laboratory. The color performance of each potato material is shown in **Figure 5C**. In this experiment, total RNA of roots, stems, leaves, and tubers of potato seedlings was extracted by TRNzol. After reverse transcription, semi-quantitative RT-PCR was carried out with EF-1α as the reference gene. The semi-quantitative RT-PCR experiment of each plant tissue was performed with 5 biological replicates. The primers used in the above experiments are shown in **Supplementary Table S6**. Finally, ImageJ<sup>18</sup> was used to measure the agarose gel gray value and perform statistical analysis.

#### Application of Tobacco Leaves for Subcellular Localization

The stAN1-like-GFP vector and the PAL-like-GFP vector were constructed and transformed into Agrobacterium strain LBA4404 by the freeze-thaw method. The transformed Agrobacterium was cultured at 28◦C with shaking until the OD<sup>600</sup> = 0.6 – 0.8, and the cells were centrifuged. We used a suspension (MES = 10 mmol/L; MgCl<sup>2</sup> = 10 mmol/L; acetosyringone = 0.3 mmol/L; pH = 5.8) to resuspend the cells. The resuspended cells were allowed to stand at room temperature for 2 h, and the resuspended bacteria were injected into the tobacco leaves using a disposable syringe. Under the condition of maintaining the humidity, green fluorescence was observed by laser scanning confocal microscopy (LSCM) after 48 h of tobacco leaf injection. The injected tobacco leaves were treated with a 0.25 g/ml sucrose solution, and the plasmolysis was observed by LSCM (**Supplementary Figure S1**). The GFP excitation wavelength

<sup>1</sup>http://solanaceae.plantbiology.msu.edu/pgsc\_download.shtml

<sup>2</sup>http://smart.embl-heidelberg.de/

<sup>3</sup>http://meme-suite.org/index.html

<sup>4</sup>http://megasoftware.net/

<sup>5</sup>http://chibba.pgml.uga.edu/mcscan2/

<sup>6</sup>http://circos.ca/software/download/circos/

<sup>7</sup>https://www.ncbi.nlm.nih.gov/sra

<sup>8</sup>https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=software

<sup>9</sup>http://www.usadellab.org/cms/index.php?page=trimmomatic

<sup>10</sup>https://sourceforge.net/projects/fastuniq/files/

<sup>11</sup>http://www.ebi.ac.uk/interpro/download.html

<sup>12</sup>http://kobas.cbi.pku.edu.cn/

<sup>13</sup>http://www.bioconductor.org/packages/devel/bioc/html/AnnotationDbi.html

<sup>14</sup>http://www.bioconductor.org/packages/release/bioc/html/AnnotationHub.html

<sup>15</sup>http://www.bioconductor.org/packages/release/bioc/html/clusterProfiler.html

<sup>16</sup>http://wencke.github.io/

<sup>17</sup>https://www.genome.jp/kegg/pathway.html

<sup>18</sup>https://imagej.nih.gov/ij/

was 488 nm, and the chloroplast autofluorescence excitation wavelength was 633 nm.

## RESULTS

## Identification of Genes Containing Only the R2 and R3 Domains in the Potato MYB Family

In the potato genome data, a total of 101 genes with the R2R3- MYB domain were found by a literature search and sequence alignment (Jung et al., 2009; Zhao et al., 2013; Liu et al., 2016). By comparing the protein sequences found using the above genes, the common features of the functional structure of the potato R2R3-MYB gene family were obtained. The results of the alignment of the R2 domain, which contains a total of 35 amino acids, are shown in **Figure 1A**. Analysis of the R3 domain revealed a total of 47 amino acids in its domain (**Figure 1B**). In the R2 and R3 domains, the conserved amino acids in order from the N-Terminal to the C-Terminal are glycine (G), tryptophan (W), glutamic acid (E), glycine (G), and tryptophan (W). Therefore, the G-W-E-G-W structure may have an important function in the process of binding the MYB transcription factor to the target promoter.

A phylogenetic tree was constructed using the amino acid sequence corresponding to the gene with the R2R3-MYB domain found in potato. As shown in **Figure 1C**, the population

of genes could be initially divided into 16 subpopulations based on the amino acid homology alignment. Amino acid homology analysis provided a reference for finding genes with the R2R3-MYB domain in the potato genome associated with anthocyanin accumulation.

## Collinearity Analysis of Potato R2R3-MYB Genes

The whole genome of potato was analyzed by collinearity analysis. The results showed that the potato genes were divided into five types: no repeat genes (singleton); modes other than segmental, tandem and proximal (dispersed duplication); nearby chromosomal region but not adjacent (proximal); consecutive repeat (tandem); and collinear genes in collinear blocks (WGD/segmental). Among them, the proximal type had a minimum of 1441 genes; the WGD/segmental type had a maximum of 21372 genes. The remaining types were 4797 genes for the singleton type, 7408 genes for the dispersed duplication type, and 4011 genes for the tandem type gene (**Figures 2A–E**). There were 25,383 collinear genes and tandem replication genes in the potato genome, accounting for 65.04% of the total number of potato genes. It could be seen that most genes had multiple copies in the potato genome, and there was a high number of genes with similar sequence characteristics or functions.

R2R3-MYB genes were present on each chromosome of potato. The R2R3-MYB genes were most abundantly distributed on the ch05 chromosome, with a total of 14 R2R3-MYB genes on this chromosome. Furthermore, the R2R3-MYB genes were also extensively distributed on the ch01, ch02, ch03, ch06, ch07, and ch10 chromosomes (**Figure 2G**). The distributions of the collinear genes and the tandem genes in the potato genome were relatively uniform on each chromosome, but there were fewer in the 41–46 Mb region of ch00 and the 1–10 Mb region of ch02. The lines in **Figure 2** indicated the collinear relationship between R2R3-MYB genes in the potato genome and between the R2R3-MYB genes and other genes in potato. Based on the above results, a total of 31 other genes were found in the potato genome, which were collinear with the members of the R2R3-MYB gene family identified above (**Supplementary Table S1**). Genes that were collinear with the R2R3-MYB gene family members could also be used as key candidate genes for the regulation of potato anthocyanin synthesis.

### Transcriptome Analysis Results

Based on the re-grouping transcriptome sequencing data, a total of 12,913 genes with different expression levels were found, of which 420 (p ≤ 0.05) were significantly different in terms of expression (**Supplementary Table S2**). There were 11030 genes with different expression levels |log2FC| ≥ 1; the colored group up-regulated genes accounted for 58.52%, and the colored group down-regulated genes accounted for 41.48% (**Figure 2H**). Compared with the colorless group, the number of up-regulated genes in the colored group was significantly higher. This indicated that the change in plant color and the accumulation of anthocyanins were achieved by the simultaneous up-regulation of multiple genes.

## GO Enrichment and KEGG Path Analysis

GO enrichment analysis was performed on transcriptome data using interproscan and clusterProfiler software (Yu et al., 2012; Jones et al., 2014). A total of 23 valid GO annotation terms (p-value < 0.05) were enriched, of which there were 7 annotation results with p-value < 0.01 (**Figure 3A**). The content of anthocyanins or polyphenols in plants has a close positive correlation with the antioxidant activity of plants (Velioglu et al., 1998). Among the 23 GO analysis results, 10 were significantly associated with plant color changes or plant antioxidant activity. Among them, the GO:0015035, GO:0004601, GO:0016684, GO:0046906, GO:0016747, GO:0010333, and GO:0016829 pathways were enhanced in the colored potato group, whereas the GO:0004866, GO:0030414, and GO:0004857 pathway were weakened in the colored potato group (**Figures 3B,D**).

It could be seen that the antioxidant activity of potato in the colored group was stronger than that in the colorless group, and the acyltransferase activity of potato in the colored group was also higher than that in the colorless group. This indicated that the high expression of some antioxidant genes and acyltransferase genes contributes to the accumulation of substances such as anthocyanins in plants. At the same time, it also showed that colored potato had higher antioxidant activity, and the antioxidant activity was improved by the simultaneous up-regulation of multiple key genes. A total of 104 differentially expressed genes were enriched in 10 significantly GO pathways, and these gene expressions may play an important role in the accumulation of potato anthocyanins (**Figure 3C**). Therefore, the above genes can be used as key candidate genes for further study of the synthesis of plant flavonoids and changes in plant antioxidant activity.

The transcriptome data were enriched by KEGG analysis to obtain 23 metabolic pathways (p-value < 0.05), including two pathways closely related to anthocyanin synthesis and accumulation (**Figure 4A**). These two pathways were sot00940 (phenylpropanoid biosynthesis) and sot00941 (flavonoid biosynthesis). The biological processes related to the accumulation of anthocyanins were sorted, and the up- and down-regulated expression changes of the potato genes in the colored group are shown in **Figure 4B**. The role of PAL (4.3.1.24) in phenylpropanoid biosynthesis is very important, but this study found that its up-regulated expression in colored potatoes was not obvious. However, the enhancement of the enzyme activity of caffeoyl-CoA O-methyltransferase (2.1.1.104), cinnamyl-alcohol dehydrogenase (1.1.1.195), and peroxidase (1.11.1.7) in the colored group promoted the formation of various phenolic substances, represented by lignin, and also promoted the transformation of cinnamoyl-CoA into p-coumaroyl-CoA.

P-cinnamoyl-CoA is a key precursor of synthetic anthocyanins, and its increased content contributes to the accumulation of potato anthocyanins (Besseau et al., 2007). The up-regulated expression of PGSC0003DMT400022254 and PGSC0003DMT400022255 genes increased the content of the CHS (2.3.1.74) enzyme and promoted the accumulation

of downstream products, which is of great significance in the whole process of anthocyanin accumulation. During the whole process of anthocyanin synthesis, the expression level of the PGSC0003DMT400018861 gene was significantly decreased, resulting in a decrease in the HCT (2.3.1.133) content. This could effectively reduce the loss of p-coumaroyl-CoA to caffeic acid metabolism and promote p-coumaroyl-CoA in the flavonoid synthesis pathway, which also had a positive significance for the accumulation of anthocyanins. In addition, the upregulation of F3<sup>0</sup> 5 <sup>0</sup>H (1.14.14.81) could effectively counteract the effect of HCT (2.3.1.133) down-regulated expression on the anthocyanin composition type. This resulted in the contents of

the delphinidin, pelargonidin, and cyanidin classes remaining relatively balanced.

## Semi-Quantitative RT-PCR to Verify the Expression of Related Genes

The members of the potato R2R3-MYB gene family were preliminarily identified by sequence alignment and construction of a phylogenetic tree, and the characteristics of R2 and R3 domains in potato were determined. Based on the collinearity analysis of the R2R3-MYB gene family, the R2R3-MYB gene family members were further enriched. A total of 104 potato R2R3-MYB gene family members were identified by combining phylogenetic analysis and collinearity analysis. Combined with the results of transcriptome analysis, the differentially expressed genes were searched for among the 104 R2R3-MYB members, and the most differentially expressed genes may be related to the synthesis of anthocyanins and changes in potato color.

Based on a comprehensive comparison of genomic and transcriptome analysis results (**Supplementary Tables S3**, **S4**), a total of 9 genes were further confirmed. The results of transcriptome analysis were verified by semi-quantitative RT-PCR using colored potatoes as material (**Figure 5C**). The expression of 7 genes was the same as that of transcriptome

The higher the logFC, the higher the expression of genes in the potato colored group, and vice versa. (D) Detailed description of key GO enrichment pathways.

analysis, and the expression of PGSC0003DMT400062326 and PGSC0003DMT400062403 was opposite to that of transcriptome analysis (**Figures 5A,B**). PGSC0003DMT400040774, PGSC0003DMT400055148, and PGSC0003DMT400009404 were mainly expressed in the potato stem. The expression level of PGSC0003DMT400064555 in various tissues of colored potatoes was generally lower than that of the control Shepody, but higher in the root of Red Meigui 3. PGSC0003DMT400055488 (PAL-like) was expressed in leaves and tubers of colored potatoes, but the expression did not increase with the deepening of potato color. The expression levels of PGSC0003DMT400036281 (stAN1-like) and PGSC0003DMT400055489 (PAL) increased as the color of the potato deepened. Solanum tuberosum anthocyanin 1 like (stAN1-like) was mainly expressed in the roots, stems and tubers of potato; its expression in Red Meigui 3 and Purple Meigui 2 potato tubers was significantly increased. The expression of phenylalanine ammonia-lyase (PAL) was mainly concentrated in the leaves of colored potatoes, but the expression level in the leaves of the control variety Shepody was significantly reduced.

## Subcellular Localization of stAN1-Like and PAL-Like

The total RNA of leaves was extracted from the Red Meigui 3 potato, and the new transcripts stAN1-like and PAL-like of stAN1 and PAL genes were cloned by RT-PCR. The length of the CDS sequence of stAN1-like is 798 bp, which indicates that the resulting protein peptide chain contains 265 amino acids. The length of the CDS sequence of PAL-like is 2169 bp, and 722 amino acids are included in the protein peptide chain. The subcellular localization results of stAN1-like (PGSC0003DMT400036281) and PAL-like (PGSC0003DMT400055488) genes are shown in **Figure 6**. It could be seen that the proteins produced by the stAN1-like guide were mainly concentrated in the nucleus. This suggested that stAN1-like might have the function of initiating downstream gene expression. The protein translated by PALlike was concentrated on the cell membrane (**Supplementary Figure S1**), which is consistent with its function as a functional protein to promote the conversion of phenylalanine to anthocyanin-producing precursor phenylpropanoids. Some of the phenylpropanoids are further metabolized to form lignins involved in cell wall synthesis (Zhou et al., 2009).

### DISCUSSION

## Distribution of Potato R2R3-MYB Transcription Factor on Chromosomes

The R2R3-MYB transcription factor genes have important functions in the process of anthocyanin biosynthesis (Feller et al., 2011). Their primary function in the MBW transcriptional complex is binding to a gene (Xu W. et al., 2015). In this study, 101 R2R3-MYB family genes were found in the potato genome, which were distributed on all of the chromosomes of potato. This indicates that R2R3-MYB transcription factor genes have important biological functions in potato. R2R3-MYB family genes not only participate in the synthesis and regulation of flavonoids, such as anthocyanins, but also participate in many physiological and biochemical processes, such as floral induction, photoperiod response, and plant drought resistance, and so on (Albert et al., 2011; Yang et al., 2012; Zhang et al., 2012; Liu et al., 2013). In addition, in other crops, such as Arabidopsis and Oryza sativa, the R2R3-MYB transcription factors were also found to be distributed on all of the chromosomes (Katiyar et al., 2012).

This further demonstrates that the functions of the R2R3-MYB transcription factors are important for plants.

## The Function of New Transcripts of stAN1 and PAL

The new transcripts of stAN1 and PAL in this experimental clone were from our own laboratory material Red Meigui 3. The new transcripts were named stAN1-like and PAL-like, respectively. The cloned stAN1-like amino acid sequence differs from stAN1 (**Supplementary Figure S2**), which has been reported to regulate potato color (Zhang et al., 2009; Liu et al., 2016). The stAN1 like transcript has 21 bases more than the 5<sup>0</sup> end of the stAN1 reference transcript. By comparing the stAN1-like transcript with the stAN1 reference gene sequence, it was found that the 21 bases were completely identical to the stAN1 reference gene sequence. It can be clarified that the production of stAN1-like transcripts is caused by the changes of transcription initiation sites or splicing sites of the pre-mRNA. Therefore, it is necessary to further study the role of stAN1-like in potato anthocyanins synthesis

and plant color change. The PAL gene also plays an important role in the accumulation of potato anthocyanins (Zhang and Liu, 2015), but PAL-like is different from the typical PAL gene (**Supplementary Figure S3**). Therefore, it is impossible to rule out the possibility that proteins produced by PAL-like guidance have other functions. The function of PAL-like needs further research through molecular biological methods.

## Biosynthesis and Accumulation Process of Anthocyanins

The R2R3-MYB transcription factor mainly regulates the transcription of downstream genes controlling anthocyanin synthesis, such as DFR (Nesi et al., 2001). The results of comprehensive transcriptome analysis showed that the upstream genes controlling the synthesis of anthocyanin precursors represented by PAL (PGSC0003DMT400055489) were mainly expressed in leaves. However, the R2R3- MYB transcription factor genes represented by stAN1-like (PGSC0003DMT400036281) were mainly concentrated in stems and tubers. This indicates that there is a transport process during the synthesis and accumulation of anthocyanins throughout the potato. Anthocyanin precursors such as phenylalanine and tyrosine accumulate in leaves; then the intermediate products are gradually catalyzed to form the final product (anthocyanins) in the process of transport to the tubers; the final end product accumulates in the tuber in the form of anthocyanins. The whole process is synthesized while transporting, rather than directly accumulating the final product of anthocyanin biosynthesis in the leaves and then transferring to the tubers.

Analysis of transcriptomic data revealed that the role of PAL gene in the overall anthocyanin biosynthesis process is not critical. In anthocyanin biosynthesis, the metabolic step that really plays a pivotal role should be the following process: The anthocyanin synthesis precursor p-cinnamoyl-CoA is transformed into naringenin chalcone as much as possible, thereby entering the subsequent synthesis process of anthocyanins, so that p-cinnamoyl-CoA enters the synthesis pathway of lignin as little as possible. In colored potatoes, the expression of CHS was up-regulated, and the downregulated expression of HCT effectively realized this process. Therefore, the up-regulation of CHS and the down-regulation of HCT should be the most critical link to promote plant anthocyanin synthesis and increase the plant anthocyanin content. In addition, the high expression of the F3<sup>0</sup> 5 <sup>0</sup>H gene effectively offsets the effect of the down-regulated expression of HCT on the anthocyanin composition type, so that the composition of each type of anthocyanin can remain relatively balanced.

## Application of Multi-Omics Joint Analysis in Experiments

With the development of bioinformatics and the accumulation of experimental data in the field of plant life sciences, it has become possible for multi-omics to jointly analyze a certain life phenomenon (Zhang et al., 2010; Lakshmanan et al., 2015). The transcriptomics data used in this paper were different from the traditional RNA-seq data. This experiment combined multiple RNA-seq results for comprehensive analysis. Potato could be used as a good model plant to study the process of anthocyanin synthesis and accumulation. However, due to the lack of research on potato gene function, it is difficult to perform transcriptome analysis and annotation, especially for transcription factor-related genes. At the same time, potato proteomics and metabolomics experimental data are still insufficient, and the analytical methods are limited, which make the relevant life phenomena unable to be fully analyzed. Future scientific research needs to further complement data on potato-related proteomics, metabolomics, and phenomics. With the advancement of life sciences, the above problems will surely be gradually solved.

The anthocyanin metabolism and synthesis process is a typical quantitative trait, and the synthesis process is controlled by multiple genes. In this experiment, the genomic and

transcriptome analysis indicated that the most important step in the anthocyanin synthesis process was to transfer p-cinnamoyl-CoA into the flavonoid biosynthesis process instead of further metabolism-producing lignin species. Up-regulation of CHS and down-regulation of HCT played a central role in anthocyanin biosynthesis. Through this analysis, we strived to find the major genes that regulate quantitative traits and convert quantitative traits into quality traits. At the same time, it was preliminarily found that anthocyanins synthesized precursor substances in leaves that were then gradually transformed during transport, and finally, end products (anthocyanins) accumulated in potato tubers. After a comprehensive analysis, two new transcripts with research potential were found, namely, stAN1-like and PALlike, and their functions were preliminarily studied. However, the specific functions of these two transcripts still require the construction of transgenic plants for further research and validation. This study provides a reference for the comprehensive analysis and application of multiple transcriptomics data in the context of big data. At the same time, it also provides a reference for the application of R programming language in GO and KEGG analysis of non-model plants. Finally, the results of this study provide a solid theoretical basis for increasing the anthocyanin content in potato tubers, cultivating new potato varieties with high anthocyanin content and regulating plant color.

## AUTHOR CONTRIBUTIONS

NT completed the main content of this manuscript. WD and MX made language retouching of this manuscript. CQ and CY provided guidance for the experiments.

#### REFERENCES


## FUNDING

This work was supported by the National Natural Science Foundation of China (No. 31601358), the Project of Science and Technology from Shaanxi Province (No. 2017ZDXM-NY-004) and major collaborative innovation projects for production, education and research in Yangling demonstration zone (No. 2016CXY-05), the National Key Research and Development Program of China (2018YFD0200805).

## SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fpls.2019.00603/ full#supplementary-material

FIGURE S1 | PAL-like transient expression of tobacco leaves with plasmolysis.

FIGURE S2 | The amino acid sequence of stAN1-like was aligned with the reference sequence.

FIGURE S3 | The amino acid sequence of PAL-like was aligned with the reference sequence.

TABLE S1 | Other candidate genes found by collinear analysis.

TABLE S2 | Genes with significantly different expression in the transcriptome.

TABLE S3 | Genes with differential expression in the potato MYB-R2R3 gene family.

TABLE S4 | The results of GO and KEGG analysis related to anthocyanin biosynthesis and accumulation based on transcriptome data.

TABLE S5 | RNA sequencing data regrouping.



**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Tengkun, Dongdong, Xiaohui, Yue and Qin. This is an openaccess article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

digital media

of impactful research

article's readership