SCCNV: A Software Tool for Identifying Copy Number Variation From Single-Cell Whole-Genome Sequencing

Identification of de novo copy number variations (CNVs) across the genome in single cells requires single-cell whole-genome amplification (WGA) and sequencing. Although many experimental protocols of amplification methods have been developed, all suffer from uneven distribution of read depth across the genome after sequencing of DNA amplicons, which constrains the usage of conventional CNV calling methodologies. Here, we present SCCNV, a software tool for detecting CNVs from whole genome-amplified single cells. SCCNV is a read-depth based approach with adjustment for the WGA bias. We demonstrate its performance by analyzing data obtained with most of the single-cell amplification methods that have been employed for CNV analysis, including DOP-PCR, MDA, MALBAC, and LIANTI. SCCNV is freely available at https://github.com/biosinodx/SCCNV.


INTRODUCTION
Each single cell in a tissue or cell population has its own unique genome due to accumulating de novo mutations, such as single-nucleotide variations (SNVs), structural variations (SVs), copy number variations (CNVs) and aneuploidies. The frequency and spectrum of the mutations reflect the loss of genome integrity of a cell population, critically important to cancer and aging (Vijg and Dong, 2020). To detect the mutations unique to a single cell, single-cell whole-genome sequencing (SCWGS) is necessary. SCWGS requires whole-genome amplification (WGA), which is often biased, leading to uneven distribution of DNA content across the genome or differences between alleles. This essentially constrain the usage of variant callers designed for non-amplified bulk DNA. We recently developed a new software tool, SCcaller, that uses heterozygous SNPs to correct for the allelic bias hampering SNV calling (Dong et al., 2017).
CNV calling is typically based on variation of sequencing depth across the genome. However, for a single cell amplicon, variation of sequencing depth increases dramatically due to the locus-specific amplification bias (Navin et al., 2011;Zong et al., 2012;Chen et al., 2017). To solve this issue computationally, we developed SCCNV, a software tool to identify CNVs from SCWGS. SCCNV is also based on a read-depth approach: it controls not only bias during sequencing and alignment, e.g., bias associated with mappability and GC content, but also the locus-specific amplification bias. We demonstrate the performance of SCCNV using SCWGS data of multiple experimental protocols, i.e., DOP-PCR (degenerativeoligonucleotide PCR), MDA (multiple displacement amplification), MALBAC (multiple annealing and loopingbased amplification cycles), and LIANTI (linear amplification via transposon insertion) (Navin et al., 2011;Gundry et al., 2012;Zong et al., 2012;Chen et al., 2017;Dong et al., 2017).

SCCNV
Our software tool for analyzing single-cell copy number variation (SCCNV) was written in Python. Its source code is freely available with a usage description and an example at Github 1 under the GNU Affero General Public License v3.0. It uses SCWGS data after alignment as input (i.e., a bam file per single cell). Of note, SCCNV cannot take sequencing data of a pool of single cells (a bam file composed of thousands of single cells data), e.g., the 10× Genomics single-cell copy number data, as input.
First, SCCNV divides the genome into bins of equal size (500 kb as default), and counts the numbers of reads per bin of a cell. This step is relatively time-consuming, and we suggest users to use samtools on a high-performance computer cluster in parallel for all samples to be time-efficient (see instructions on Github). The remaining major steps of SCCNV do not require much computational resources -most modern desktop computers should work well.
SCCNV then normalizes mappability, which indicates the efficiency of the alignment to a genomic region. For a bin b of a cell, SCCNV adjusts the raw number of reads, denoted by NR raw , by dividing over the mappability M, where mappability M is a value ranging from 0 to 1. SCCNV uses Encode Align100mer mappability score, downloaded from the UCSC genome browser, and calculates the mappability of each bin by using their weighted average. Then, SCCNV normalizes for GC content. For a cell, SCCNV calculates the percentile of GC content of each bin. For a bin b of the cell, its number of aligned reads after normalizing GC content, NR GC,b , is, where NR map,genome is the average NR map per bin of all bins from the cell; NR map , b,percentile is the average NR map per bin of bins in the same GC percentile as bin b.
After normalization for mappability and GC content, a pattern of sequencing read depth emerges that is consistent across different cells amplified using the same experimental protocol, i.e., the locus-specific amplification bias. Therefore, the bias is normalized across all cells in a particular batch and experiment. First, to make the NR GC,b comparable across cells, SCCNV 1 https://github.com/biosinodx/SCCNV converts it to a raw copy number estimate, denoted by CN raw,b for bin b of cell c, as follows, where NR GC,genome,c is the median NR GC,c per bin in the genome of cell c; ploidy is 2 by default. Second, the adjusted copy number is estimated as, where CN raw,b,−c denotes the average CN raw for bin b across all cells except cell c. Of note, with this step SCCNV aims to discover the difference between the cell c and the other cells. When analyzing CNVs of multiple tumor cells, it is not appropriate to use all tumor cells as input of SCCNV; instead, one should use one tumor cell with two or more normal diploid cells as the input. Then SCCNV uses a sliding window approach to further minimize amplification noise. By default, a window includes 11,500-kb bins, i.e., 5.5 Mb of DNA sequence in total, with a 500-kb step size between two neighboring windows, SCCNV then models the distribution of CN smoothed,b,c of all bins in autosomes of a cell c as a normal distribution N(µ, σ c 2 ). The µ = 2, and σ is estimated as, (6) where CN smoothed , 30 . 9%,c and CN smoothed,69 . 1%,c are the 30.9 and 69.1% percentiles of the CN smoothed , b,c of all bins in the autosomes, corresponding to the µ -0.5σ and µ + 0.5σ percentiles, respectively. Here, we did not use the observed s.d. of CN smoothed,b,c of all the bins because the normal distribution was to estimate amplification noise, not real variation in copy number across the genome. When a cell has several large CNVs, the s.d. will be high, even if its amplification noise remains low.
Assuming equally likely priors, for a bin b and a given possible copy number k ∈ {0, 1, 2, 3, 4}, its posterior probability is, where x is the CN smoothed,b,c , and f i (x) is the probability density function of a normal distribution, where the variance σ c 2 is calculated according to Eq. (5). We only used k ∈ {0, 1, 2, 3, 4} because the final copy number call was after multiple testing correction, i.e., Eq. (9) below, and we wished to minimize the number of hypotheses tested, i.e., five for a copy number of 0-4. However, this will result in an underestimation if the real copy number exceeds four. To resolve this issue, for bins with copy numbers ≥4 and ≤100, SCCNV reports the closest integer to the CN smoothed,b,c . SCCNV allows <1 false positive per cell. Therefore, it determines bin b as a copy number variant when,

Sensitivity and False Positive Rate
To determine copy number, SCCNV is based on a statistical test described in equations (8) and (9) for a normal distribution and multiple testing correction separately. With a given value of coefficient of variation (CV) of CN smoothed,b,c , sensitivity and FPR can be estimated as follows. Sensitivity equals the difference between two cumulative distribution functions (CDFs) of Eq. (8) at the upper and lower boundaries, which SCCNV provides the correct CNV call after the correction in Eq. (9). The percentage of FP out of all bins is equal to the sum of (a) CDF at the lower boundary of SCCNV providing an incorrect CN gain call; and (b) 1 -CDF at the upper boundary of SCCNV providing an incorrect CN loss call. Then FPR was estimated as the ratio of % of FP to the sum of % of FP and % of TN. For example, under the assumption that the true copy number is 2, if SCCNV calls CN = 2 when CN smoothed,b,c is between 1.8 and 2.2, sensitivity = CDF(x = 2.2, µ = 2) -CDF(x = 1.8, µ = 2), in which CDF is the cumulative distribution function of Eq. (8). If SCCNV calls (a) CN = 1 when CN smoothed,b,c is between 0.8 and 1.2, and (b) CN = 3 when CN smoothed,b,c is between 2.8 and 3.2, then%FP = CDF(x = 1,2, µ = 2) + 1 -CDF(2.8, µ = 2).  Table 1 lists all the single-cell data used in this study.

Testing Datasets and Preprocessing of Data
Sequence alignment was performed using BWA and GATK as follows (Li and Durbin, 2009;McKenna et al., 2010). Raw sequencing data of each sample (single cell and bulk DNA) were obtained from the SRA database and subjected to quality control using FastQC (version 0.11.4; 2 ) and trimming using Trim Galore (version 0.4.1; 3 ) with default parameters. Then they were aligned to the human reference genome (version hg19) using BWA MEM (version 0.7.12; option: -t number of CPUs -M reference genome fasta file) (Li and Durbin, 2009). PCR duplications were removed using picard tools (version 1.119; 4 ). The alignments were subjected to indel realignment and basepair recalibration using GATK (version 3.5; using options, RealignerTargetCreator, IndelRealigner, BaseRecalibrator, and PrintReads) (McKenna et al., 2010). The step above was used for generating an analysisready bam file for other types of variants, e.g., single nucleotide variants, small insertions and deletions, and this step is optional for large CNVs or aneuploidies using SCCNV. Reads with mapQ < 30 were discarded. The number of reads per bin of each sample was calculated using samtools (version 1.3; option: bedcov) . SCCNV (version 1.0) was used to estimate CNV of each cell.

Major Steps in SCCNV
As illustrated in Figure 1, SCCNV is composed of four major steps. It first calculates sequencing depth of the genome in bins of equal size (500-kb by default). Second, it normalizes the depth based on two features of the reference genome, including mappability and GC content. These two features are usually also considered by conventional CNV callers for bulk DNA sequencing. Next, it does further normalization across single cells of a same experimental batch. This step minimizes locus-specific bias due to WGA. Finally, it smooths the data (5 Mb by default) and infers copy number of each bin. Intermediate results between any two connecting steps can be generated by SCCNV for users to monitor its performance. We provide an example about the intermediate results of a normal neuronal nucleus amplified with MDA (SRA id: SRR2141574) in Supplementary Figures 1-4.

Performance on Real Datasets
To evaluate the performance of SCCNV, we obtained four SCWGS datasets from the SRA database, which includes 8.2 TB high-depth WGS data of 63 single human fibroblasts, neurons and cells of a tumor cell line amplified using eight different protocols, i.e., DOP-PCR (Sigma), Rubicon, MALBAC, LIANTI, and MDA (including four MDA protocols, Qiagen, GE, Lodato et al's MDA and SCMDA) (Supplementary Table 1; Zong et al., 2012;Lodato et al., 2015;Chen et al., 2017;Dong et al., 2017). The data were processed as described in the Materials and Methods.
We used the CV of sequencing depth across all genomic bins on autosomes as an indicator of performance, because it directly determines sensitivity and FPR of copy number calling step in SCCNV. We show sensitivity and FPR of the copy number calling in Figures 2A,B, respectively. As the CV decreased from 0.135 to 0.041, the sensitivity increased from 0 to 100% and the FPR decreased from 3.8 × 10 −3 to 3.9 × 10 −11 .
For the real datasets, we calculated the CV of raw data and normalized data after each step to demonstrate the performance of normalization in SCCNV (Figure 3). Almost all raw data (CV: 0.475 ± 0.135, avg. ± s.d.) are beyond the detection threshold, i.e., CV = 0.135. Each step of normalization decreased the CV by a significant fraction: on average, mappability normalization by 5%, GC content normalization by 33% percent, across-cell normalization by 22%, and smoothing by 55%. This shows the contributions of each normalization step to performance increase in the final variant calling. After all the normalization steps, the CVs are 0.107 ± 0.076 (avg. ± s.d.), corresponding to a  sensitivity of 68.6% and an FPR of 3.6 × 10 −4 on average. Of note, different amplification protocols have significantly different performance when using SCCNV, likely due to differences in DNA amplification linearity among the protocols. As expected, LIANTI outperformed all the others (Chen et al., 2017). Protocols that included PCR steps, i.e., DOP-PCR, MALBAC and Rubicon, ranked in the middle. Although known as suffering from the least artifactual SNVs (Dong et al., 2017;Zhang et al., 2019), MDA-based protocols were ended last (Figure 3).

DISCUSSION
Identification of copy number variation and aneuploidy has been one of the major areas of genomics methods development. Several statistical models have been developed for analyzing initially microarray data and later sequencing data of bulk DNA, for example, Circular Binary Segmentation (CBS), Mean Shift-Based (MSB) model, Shifting Level Model (SLM), Expectation Maximization (EM) model, and Hidden Markov Model (HMM) as discussed in Zhao et al. (2013). Based on these models, multiple computational software tools have been developed, e.g., CBS, Copynumber, CNVnator, and HMMcopy (Olshen et al., 2004;Abyzov et al., 2011;Ha et al., 2012;Nilsen et al., 2012). To call CNVs, most of the methods rely on assessing either sequencing read depth or alternate allele fraction at heterozygous SNPs across the genome of one sample, i.e., across-genome normalization. Some of the methods have been applied directly for analyzing single-cell sequencing data with specific filtering for cells with too much bias after WGA.
A few new tools for single cell data were also developed recently under the same rationale (assessing one sample at a time, or across-genome normalization), such as AneuFinder, baseqCNV, Ginkgo and SCOPE (Garvin et al., 2015;Bakker et al., 2016;Fu et al., 2019;Wang et al., 2019). SCCNV was developed based on our observation that the locus-specific amplification bias is often the same in different cells within one experimental batch and amplified using the same protocol (e.g., Supplementary Figure 3); and we showed that normalization across multiple samples (cells) significantly contributed to the increase in variant calling performance for single cells amplified using most WGA protocols (Figure 3). Following the same across-sample normalization rationale, another software tool, SCNV, was developed (Wang et al., 2018). It differs from SCCNV that SCCNV performs normalization based on empirical data directly (Eq. 4) without any assumption on its distribution. Of note, with across-sample normalization, SCCNV essentially aims to identify differences among different cells in one input batch and, therefore, it is important to input cells of interest (e.g., tumor cells) together with cells with a standard diploid genome.

CONCLUSION
We developed SCCNV to identify copy number variations from whole-genome amplified single cells. We demonstrated its stepwise performance using most of the recent SCWGS datasets generated with 8 different amplification protocols.