METHODS article

Front. Genet., 13 January 2021

Sec. Computational Genomics

Volume 11 - 2020 | https://doi.org/10.3389/fgene.2020.632311

A Density Peak-Based Method to Detect Copy Number Variations From Next-Generation Sequencing Data

  • 1. The School of Computer Science and Technology, Xidian University, Xi’an, China

  • 2. Xi’an Key Laboratory of Computational Bioinformatics, The School of Computer Science and Technology, Xidian University, Xi’an, China

Abstract

Copy number variation (CNV) is a common type of structural variations in human genome and confers biological meanings to human complex diseases. Detection of CNVs is an important step for a systematic analysis of CNVs in medical research of complex diseases. The recent development of next-generation sequencing (NGS) platforms provides unprecedented opportunities for the detection of CNVs at a base-level resolution. However, due to the intrinsic characteristics behind NGS data, accurate detection of CNVs is still a challenging task. In this article, we propose a new density peak-based method, called dpCNV, for the detection of CNVs from NGS data. The algorithm of dpCNV is designed based on density peak clustering algorithm. It extracts two features, i.e., local density and minimum distance, from sequencing read depth (RD) profile and generates a two-dimensional data. Based on the generated data, a two-dimensional null distribution is constructed to test the significance of each genome bin and then the significant genome bins are declared as CNVs. We test the performance of the dpCNV method on a number of simulated datasets and make comparison with several existing methods. The experimental results demonstrate that our proposed method outperforms others in terms of sensitivity and F1-score. We further apply it to a set of real sequencing samples and the results demonstrate the validity of dpCNV. Therefore, we expect that dpCNV can be used as a supplementary to existing methods and may become a routine tool in the field of genome mutation analysis.

Introduction

Copy number variation (CNV) is an important category of DNA structural variations, including amplifications or losses of DNA fragments with a length of more than 1 kilo base-pairs (bp) (; ). The mutation rate of CNV loci is much higher than that of single nucleotide polymorphisms (SNP) across the whole genome. CNV is one of the important pathogenic factors affecting human complex diseases (; ; ). Therefore, it is necessary and meaningful to analyze CNVs when studying and treating complex diseases especially human cancers. Generally, the mechanisms for the formation of CNVs can be classified into two categories: DNA recombination and DNA error replication (). In each category of the mechanisms, CNVs are usually presented in either amplification or deletion states. The major step of CNV analysis in samples obtained from human cancers is to identify which genome regions are CNVs and determine the corresponding states (i.e., either amplification or deletion). Therefore, it is required to develop statistically computational methods to analyze the data generated by different sequencing technologies.

There are three primary types of technologies that can produce data sets for the detection of CNVs: array comparative genomic hybridization (aCGH), SNP array, and next-generation sequencing (NGS) technologies. Currently, various computational methods have already been developed for analyzing each type of the data sets. For example, aiming at aCGH data, classic methods include fastRPCA (), PLA (), WaveDec (), and graCNV (). Meanwhile, aiming at SNP array data, famous methods include GISTIC (), STAC (), SAIC (), and AISAIC (). In comparison with these two types of data, NGS data is at the highest resolution and is used widely for the detection of CNVs in recent years. Due to the inherent characteristics behind NGS data, the CNV detection methods using NGS data can be classified into four categories (): pair-end mapping, split-read, de no assembly, and read depth (RD) based approaches. The intention of the pair-end mapping-based approach is that it determines CNVs according to the difference of the length between the two ends of paired reads mapped to the reference and the insert fragment, while the split-read based approach determines CNVs by splitting the sequence and observing the distance of the split reads mapped to the reference sequence. De no assembly approach is usually used to find out novel inserted sequences (). These three categories of approaches are appropriate for the detection of CNVs with a limited size, since the pair-end mapping and split-read based approaches are subject to the length of inserted fragments and the de no assembly method is subject to the cost of computation time. Nevertheless, CNVs are usually ranging at a large scope of interval in size, and can be up to more than tens of M base-pairs. Relative to the above three categories, the RD based approach is more versatile in detecting CNVs with any sizes. The major principle of this approach is to determine CNVs according to the variance of RDs across the genome to be analyzed.

The RD based approach is generally implemented through the following four steps (; ): (1) mapping sequencing reads to a reference genome and extracting a read count profile, (2) dividing the genome into non-overlapping bins and calculating a RD value for each bin based on the read count profile, (3) making normalization and correction to the RD values, and (4) analyzing the corrected RD values to declare CNVs. The theoretical assumption underlying the RD based approach is that the RD value of one bin or one region is roughly related to its corresponding copy number, i.e., the larger the RD value, the larger the copy number, and vice versa. Therefore, the key point here is how to design an appropriate scheme to reasonably analyze the RD values. The currently popular methods for detecting CNVs using RD values include but are not limited to: RDXplorer (), CNVnator (), GROM-RD (), XCAVATOR (), Control-FREEC (), CNVkit (), CNAseg (), CopywriteR (), SeqCNV (), CloneCNA (), iCopyDAV (), DeAnnCNV (), CNV_IFTV (), CONDEL (), and CNV-LOF (). Each of these methods has its own characteristics and advantages. For example, Control-FREEC makes the best use of GC-content to normalize the read count profile so as to find out CNV regions, and iCopyDAV chooses an appropriate bin size and uses thresholds for RD values to declare CNVs. Although much effectiveness has been achieved by these methods, some factors such as low-level tumor purity (i.e., the fraction of tumor cells in the sequencing sample), limited coverage depth and GC-content bias still pose a big challenge to the detection of CNVs with small amplitudes. Therefore, it would be necessary and meaningful to seek for new methods that can grasp the essential characteristics of sequencing data associated with CNVs.

Given the above, we summarize several aspects that should be considered to improve the detection of CNVs. In the first place, it is necessary to make a smooth or segmentation to the observed RD profile, so that adjacent bins with similar amplitudes can be merged into the same region and the bins showing a local mutation state cannot be masked. In the second place, it is meaningful to extract effective features from sequencing data that can make an accurate distinguishing between mutated and normal genome regions. In the last place, it is necessary to design a reasonable model for displaying the extracted features and perform a suitable analysis of the features to determine CNVs.

With a careful consideration of the problems described above, in this article, we propose a new method, called dpCNV, for the detection of CNVs from NGS data. The motivation and underlying idea of dpCNV could be demonstrated as below. It considers the inherent correlations among adjacent positions on the genome, and thus analyzes CNVs based on the unit of genome segments rather than individual bins. These segments can be produced by performing a segmentation process on the RD profile. It carefully takes into account that CNV regions usually accounts for a small fraction of the whole genome and many CNVs just display a “local” outlier state, and thus extracts two related features (i.e., local density and minimum distance) from the RD profile based on the density peak algorithm (). Finally, dpCNV analyzes the two feature values for each segment through multivariate Gaussian distribution and calculates the corresponding p-value to declare whether it is a CNV. We perform a large number of simulation experiments to test the dpCNV method and make comparisons with several existing methods. The experimental results demonstrate the merit of the proposed method. Moreover, we apply it to analyze a set of real sequencing samples and prove its validity.

The remainder of this article is organized as follows. Section “Materials and Methods” demonstrates the workflow of dpCNV and the related principles. In section “Results,” simulation studies are designed to evaluate the performance of the proposed method and its peer methods, as well as validations by applying it to a set of real sequencing samples. Section “Conclusion” discusses the proposed method and summarizes an outline of future work.

Materials and Methods

Workflow of dpCNV

The workflow of the dpCNV method is demonstrated in Figure 1. The dpCNV method works by starting from an input of a sequenced tumor sample and a reference genome. The sequenced tumor sample is aligned to the reference genome by using the commonly used alignment tool BWA (), and then a read count profile is extracted from the alignment result by using SAMtools (). With the read count profile, a RD profile is produced with a pre-defined bin size, such as 1000 base pairs (bp), which is moderate in the detection of CNVs ().

FIGURE 1

Based on the RD profile, the dpCNV method performs CNV analysis via the following four steps. (I) It implements a segmentation process on the RD profile to generate small genome segments, each of which usually include a set of adjacent and correlated bins. Here, the segmentation is carried out by using the Fused-Lasso algorithm (). (II) It extracts two features as the statistic and calculates the corresponding values via density peak algorithm. (III) It establishes a two-dimensional null distribution via multivariate Gaussian distribution and tests significance for each segment. (IV) It declares CNVs via a threshold of significance level and determines CNV statuses (i.e., amplification or deletion) via a RD cutoff.

Segmentation on the RD Profile

With the RD profile, a GC-content bias correction process is carried out through a similar approach with the works (; ), and then a segmentation process is implemented on the corrected RD profile. The purpose of the segmentation is to divide the whole RD profile into a set of small segments, each of which is composed by adjacent bins, and is to provide a segment-based unit for the detection of CNVs rather than a bin-based unit. Theoretically, the segment-based unit can help to increase the independence of elements in significance testing, so that a reasonable evaluation of p-values can be expected to be achieved (). Nevertheless, the bin-based unit may result in a conservativeness of p-value evaluation since adjacent bins are usually correlated ().

There are various existing approaches that can carry out segmentation on the RD profile. Here we choose the Fused-Lasso algorithm for this task (). In comparison with other segmentation algorithms, the Fused-Lasso algorithm performs better in smoothing adjacent bins with highly similar RD values while remaining local fluctuations among the resulted segments (). For convenience, the resulted segments are denoted by:

where n denotes the total number of segments that have been achieved. The following steps of analyzing CNVs are based on the set of S.

Calculation of Statistic Values for Each Segment

With the segment-based RD profile S, we adopt the density-based peak algorithm to extract two features as the statistic for each segment: local density (ρ) and minimum distance (δ), and to calculate their corresponding values. With the consideration of that regions with changed copy numbers are inherently different from those of normal copy numbers and only account for a small part of the whole genome, we transfer the problem of detecting CNVs to the issue of identifying outliers from the set of segments with features of ρ and δ. Accordingly, each segment can be regarded as an object or a point in the two dimensional space of ρ and δ. In the following text, we make a detailed description to these two features and the calculation approach.

Before describing the two features ρ and δ, we introduce the Euclidean distance between any two objects (segments) si and sj. Given the total number of segments of n, an Euclidean distance matrix Mn×n can be obtained, where each element (dij) can be calculated by the Euclidean distance formula:

where ρi and δi represent the feature values of object si, and the same to ρj and δj. With the Euclidean distance matrix Mn×n, an adjustable distance threshold γ is introduced according to the theorem of the density peak algorithm (). This threshold can be explained as a radius of each object si and is used to calculate how many objects are adjacent to the object si within the distance of γ. Then, the concept of local density ρ for each object is produced.

Definition 1

The local density ρi of the object si is defined as the number of objects adjacent to the object si with the radius γ, and can be calculated by using Eq. 3:

where χ(x) = 1 if x < 0, and otherwise, χ(x) = 0.

Definition 2

The minimum distance δi of the object si is defined as the minimum value among the distances between the object si and those objects with higher density than si, and can be expressed as Eq. 4:

For the object si with the highest density, the value δi is defined as the maximum distance between the object and the rest of objects in the set S, and can be expressed as Eq. 5:

For a clear understanding of local density and minimum distance, we use an example to describe the distribution of a set of objects with respect to the values of the two features, as shown in Figure 2. For the example, we can see that the objects at the abnormal area (outliers) are near to the left and bottom side of the distribution. From the basic idea of density peak algorithm, outliers usually have a larger minimum distance and a smaller local density than those of other objects. Here, the abnormal area denotes the place of outlier objects, and normal area denotes the cluster of most objects. More details about the density peak algorithm is referred to .

FIGURE 2

Establish of a Two-Dimensional Null Distribution

With the statistic values in a two-dimensional space [i.e., local density (ρ) and minimum distance (δ)], the task now is how to design an appropriate model to test the significance of them. Since the values of the two features are usually at different scopes, it is not appropriate to combine them as a single feature value for the analysis. Therefore, it would be reasonable to design a model that can analyze the statistic values in a two-dimensional space. To mirror this, we establish a multivariate (i.e., two-dimension) Gaussian distribution as the null distribution based on the observed statistic values, and then evaluate a p-value for each of them. The multivariate Gaussian distribution is expressed as Eq. 6:

where μ is a two-dimensional vector, representing the mean values of local density and minimum distance, i.e., , and Ʃ represents the covariance matrix of the two features.

The reason about why to choose a multivariate Gaussian distribution as the null distribution can be explained as below. Assuming that there are no CNVs in the segment-based RD profile S, and then the mean RD value should be around the sequencing coverage depth of the whole genome and the variance is primarily contributed by random artifacts such as sequencing and mapping errors. From this viewpoint, the RD values can be approximately modeled by a Gaussian distribution (). Theoretically, with a Gaussian distributed object, the deduced local density (ρ) and minimum distance (δ) would also follow Gaussian distribution, respectively. Therefore, the joint of the two features can be approximately modeled by a two-dimensional Gaussian distribution. For a clear understanding of this, we depict an example using a simulated dataset to show the distribution of the statistic values (ρ, δ) in Figure 3.

FIGURE 3

Declaration and Determination of CNV Statuses

Based on the two-dimensional null distribution above, the p-value (pi) for each object (segment) si can be calculated. We define a commonly used significance level α as the cutoff for declaring CNVs, i.e., if pi is less than α, then the object si will be declared as a CNV status; otherwise, it is regarded as a normal status. According to our experience and a large number of simulation experiments, we find that the value of α is appropriate to be assigned with 0.005.

With the abnormal objects, we further deduce their types (i.e., amplification or deletion) of CNV according to their RD values. Here, we use the average RD value of the objects in the cluster center (shown in Figure 3) as the baseline (rb) of normal copy number. This is consistent with that the objects in the cluster center are regarded as normal objects according to the density peak algorithm. Subsequently, for each abnormal object, if its RD value is larger than rb, then it is regarded as an amplification event, otherwise, it is regarded as a deletion event.

Results

The dpCNV software is implemented in Python language, and the code is publicly available at https://github.com/BDanalysis/dpCNV/. In order to demonstrate the performance and usefulness of our proposed method, we first conduct a number of simulation experiments and make comparisons with several existing methods in terms of precision, sensitivity and F1-score (the harmonic mean of sensitivity and precision). Then, we apply the proposed method to a set of real sequencing samples, which have been obtained from the European Genome-phenome Archive (EGA) databases.1 To assure a fair comparison between dpCNV and other methods, we use the default parameter values in the implementation of the compared methods.

Simulation Studies

Simulation studies are usually regarded as an appropriate and feasible way to assess the performance of existing and newly developed methods (, , ). This is because that the ground truth CNVs embedded in the simulated data sets could be used for an exact calculation of sensitivity and precision for the methods. Currently, there are many methods for simulating NGS data have been proposed. Here, we use one of our previously developed simulation methods, IntSIM (), for the simulation of NGS data with ground truth CNVs. Two primarily factors (i.e., tumor purity and depth of coverage) have been considered in the simulation process. Specifically, six scenarios have been simulated by setting different values of tumor purity (0.2, 0.3, and 0.4) and coverage depth (4× and 6×), and in each scenario 50 replicated samples have been produced.

With these simulated data sets, the dpCNV method and four peer methods (including FREEC, GROM-RD, CNVnator, and CNV_IFTV) are performed. Their results and comparisons are depicted in Figure 4. Here, the precision is calculated as the ratio of the number of correctly detected CNVs to the number of all declared CNVs, while the sensitivity is calculated by the ratio of the number of correctly detected CNVs to the total number of ground truth CNVs. From the Figure 4, one could observe that the performances of most methods are improving along with the increasing of tumor purity and coverage depth. Comparatively, the dpCNV method is superior in terms of the trade-off (F1-score) between precision and sensitivity in each of the simulation scenarios. With respect to sensitivity, dpCNV ranks first in all the simulation scenarios, followed by FREEC or CNV_IFTV. With respect to precision, GROM-RD and CNVnator display larger values than other methods.

FIGURE 4

The fact that dpCNV is superior to other methods under this study is due to the following reasons. Firstly, the relationship between adjacent bins has been taken into account by performing a segmentation process. In this process, most noised data points can be smoothed, and some local variations can be remained. In addition, two meaningful features (i.e., local density and minimum distance) are extracted from the segmented data based on a density peak algorithm. Secondly, a two-dimensional null distribution has been established for testing the significance of each genome segment. This can help to relieve the conservativeness of p-value assessment and provide a meaningful null hypothesis testing.

Real Data Applications

To further validate the performance of dpCNV, we apply it to three whole-genome sequencing data (EGAD00001000144_LC, EGAR00001004802_2053_1, and EGAR00001004836_2561_1) obtained from the EGA project. These samples include a lung cancer sample and two ovarian cancer samples. Besides, we also perform three peer methods (FREEC, CNVnator, CNV_IFTV) on these samples for comparisons. Since real sequencing data usually have no ground truth CNVs, it is difficult for us to exactly calculate the sensitivity and precision for the methods. Nevertheless, we analyze the overlapping results among the compared methods to observe the consistence between their results, as shown in Figure 5. We can note that CNVnator gets the largest number of overlaps with other methods, followed by dpCNV and FREEC. However, the total number of detected CNVs detected by CNVnator is also the largest. This means that it is not appropriate to determine which method is superior just according to the number of overlapped CNVs. Nevertheless, we adopt the overlapping density score (ODS) proposed in our previous work () to evaluate the methods. The ODS is calculated by using Eq. 7. The comparative result is shown in Table 1, from which we can notice that dpCNV achieves the highest ODS in the analysis of two ovarian tumor samples and FREEC gets the highest ODS in the analysis of the lung tumor sample:

FIGURE 5

TABLE 1

SampledpCNVFREECCNV_IFTVCNVnator
EGAD00001000144_LC99.4114.0247.9619.06
EGAR00001004802_2053_1155.25152.8935.7544.2
EGAR00001004836_2561_1263.16192.7957.04114.7
Average172.6153.2346.9259.32

Comparison of ODS between dpCNV and three peer methods on real samples.

Bold value denotes the largest values in each line.

where mcnv denotes the total overlapped CNVs divided by the number of compared methods and denotes the total overlapped CNV divided by the number of CNVs detected by itself.

An overview of the numbers of CNVs detected by the four methods are shown in Figure 6, where we could clearly take an overview of distribution on 22 autosomes of results called by dpCNV, FREEC, CNVnator, and IFTV, respectively. Each circus diagram is composed of two parts, the upper part consists of four arcs corresponding to the four detection methods and the lower part consists of 22 arcs corresponding to the 22 autosomes. In the lung cancer diagram, dpCNV obtains the largest number of CNVs while CNVnator obtains the smallest number of CNVs. In the diagrams of the two ovarian cancer samples, CNVnator gets the largest number of CNVs while FREEC and dpCNV get relatively fewer CNVs.

FIGURE 6

In addition, based on the COSMIC (catalog of somatic mutations in cancer) database, we analyze the CNVs detected by our proposed method on three whole genome sequencing data from biological meanings. For example, 425 CNVs detected by dpCNV from the lung cancer sample are compared to the COSMIC database. There are 151 cytobands and 405 genes in the comparative result. We may notice that many cytobands contain a lot of meaningful genes. For example, the cytoband 11p15.5 contains IFITM1 () and IFITM3 (). Many of genes are confirmed to be tumor driver genes and closely related to non-small cell lung cancer, such as C3orf21 (), ZNF454 (), and C10orf137 (). For the two ovarian cancer samples, dpCNV gets 225 cytobands and 128 cytobands, 285 genes and 529 genes overlapped with the COSMIC database, respectively, in which there are many important tumor driver genes corresponding to ovarian cancer, such as PUM1 (), GOLPH3L (), PIWIL4 (), and KNDC1 ().

Conclusion

Accurate detection of CNVs is a crucial step for a comprehensive analysis of genomic mutations in the study of genome evaluation and human complex diseases. In this article, a new method named dpCNV is proposed for the detection of CNVs from NGS data. The central point of dpCNV is that it extracts two meaningful features based on the density peak algorithm and establishes a two-dimensional null distribution to test the significance of genome segments. dpCNV is different from traditional methods and have some new characteristics: (1) it considers the intrinsic correlations among genome bins, and adopts Fused-Lasso segmentation algorithm to smooth the noise data between adjacent bins; (2) it carefully takes into account that CNV regions usually accounts for a small fraction of the whole genome and many CNVs just display a “local” outlier state, and thus extracts two related features (i.e., local density and minimum distance) from the RD profile based on the density peak algorithm; (3) it analyzes the two feature values for each segment through multivariate Gaussian distribution and calculates the corresponding p-value to declare whether it is a CNV.

The performance of dpCNV is assessed and validated through simulation studies and applications to a set of real sequencing samples. In simulation experiments, dpCNV outperforms four peer methods (FREEC, GROM-RD, CNVnator, and CNV_IFTV) in terms of sensitivity and F1-score. In real sample experiments, dpCNV is performed on three whole genome sequencing samples including a lung cancer sample and two ovarian samples, and is compared with three peer methods (FREEC, CNVnator, and CNV_IFTV). Here, we have not make comparison with GROM-RD since it has not obtained results from these real sequencing samples. In this comparison, we make an evaluation of the four methods by using ODS. The result indicates that dpCNV obtains a better performance than other methods. In addition, we demonstrate the biological meanings of the detected CNVs by referring the COSMIC database.

With regard to the future work, we plan to make a further improvement to the current version of the dpCNV method from the following aspects. In the first place, we will design a strategy to predict tumor purity and integrate it to the detection of CNVs. In the second place, we intend to predict absolute copy numbers for each CNV region, since absolute copy numbers might provide much information of the study of chromosome instability. In the third place, we intend to combine the detection of CNVs with other types of genomic mutations into a pipeline analysis, which will help to improve the efficiency of genomic mutation analysis. In the last palace, it is necessary to explore the detection of CNVs by using mRNA sequencing data. Generally, RD values obtained from the sequencing data on DNA are closely related with copy numbers. A high expression of mRNAs might be associated with a large copy number. Therefore, using mRNA sequencing data may facilitate the detection of CNVs in tumor genomes.

Statements

Data availability statement

The original contributions presented in the study are included in the article/supplementary material, further inquiries can be directed to the corresponding author/s.

Author contributions

KX and YT participated in the study and design of algorithms and experiments, and participated in writing the manuscript. XY directed the whole work, conceived of the study and help, and edited the manuscript. YT participated in the analysis of the performance of the proposed method. All authors read the final manuscript and agreed the submission.

Funding

This work was supported by the Natural Science Foundation of China under grant no. 61571341.

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

  • 1

    AbyzovA.UrbanA. E.SnyderM.GersteinM. (2011). CNVnator: an approach to discover, genotype, and characterize typical and atypical CNVs from family and population genome sequencing.Genome Res.21974984. 10.1101/gr.114876.110

  • 2

    AuerH.NewsomD. L.NowakN. J.McHughK. M.SinghS.YuC. Y.et al (2007). Gene-resolution analysis of DNA copy number variation using oligonucleotide expression microarrays.BMC Genom.8:111. 10.1186/1471-2164-8-111

  • 3

    BeroukhimR.GetzG.NghiemphuL.BarretinaJ.HsuehT.LinhartD.et al (2007). Assessing the significance of chromosomal aberrations in cancer: methodology and application to glioma.Proc. Natl. Acad. Sci. U S A.1042000720012. 10.1073/pnas.0710052104

  • 4

    BoevaV.PopovaT.BleakleyK.ChicheP.CappoJ.SchleiermacherG.et al (2012). Control-FREEC: a tool for assessing copy number and allelic content using next-generation sequencing data.Bioinformatics28423425. 10.1093/bioinformatics/btr670

  • 5

    CaiH.ChenP.ChenJ.CaiJ.SongY.HanG. (2018). WaveDec: a wavelet approach to identify both shared and individual patterns of copy-number variations.IEEE Trans. Biomed. Eng.65353364. 10.1109/tbme.2017.2769677

  • 6

    ChenY.ZhaoL.WangY.CaoM.GelowaniV.XuM. C.et al (2017). SeqCNV: a novel method for identification of copy number variations in targeted next-generation sequencing data.BMC Bioinform.18:147. 10.1186/s12859-017-1566-3

  • 7

    DharanipragadaP.VogetiS.ParekhN. (2018). iCopyDAV: integrated platform for copy number variations-Detection, annotation and visualization.PLoS One13:e0195334. 10.1371/journal.pone.0195334

  • 8

    DiskinS. J.EckT.GreshockJ.MosseY. P.NaylorT.StoeckertC. J.et al (2006). STAC: a method for testing the significance of DNA copy number aberrations across multiple array-CGH experiments.Genome Res.1611491158. 10.1101/gr.5076506

  • 9

    DuanJ.DengH. W.WangY. P. (2014). Common copy number variation detection from multiple sequenced samples.IEEE Trans. Biomed. Eng.61928937. 10.1109/tbme.2013.2292588

  • 10

    FengY. L.HeF.WuH. N.HuangH.ZhangL.HanX.et al (2015). GOLPH3L is a novel prognostic biomarker for epithelial ovarian Cancer.J. Cancer6893900. 10.7150/jca.11865

  • 11

    FreemanJ. L.PerryG. H.FeukL.RedonR.McCarrollS. A.AltshulerD. M.et al (2006). Copy number variation: new insights in genome diversity.Genome Res.16949961. 10.1101/gr.3677206

  • 12

    FridleyB. L.ChaliseP.TsaiY. Y.SunZ.VierkantR. A.LarsonM. C.et al (2012). Germline copy number variation and ovarian cancer survival.Front. Genet.3:142. 10.3389/fgene.2012.00142

  • 13

    GuanX.ChenS.LiuY.WangL. L.ZhaoY.ZongZ. H. (2018). PUM1 promotes ovarian cancer proliferation, migration and invasion.Biochem. Biophys. Res. Commun.497313318. 10.1016/j.bbrc.2018.02.078

  • 14

    GuoL. M.LiuM.LiX.TangH. (2009). The expression and functional research of PIWIL4 in human ovarian Cancer.Prog. Biochem. Biophys.36353357. 10.3724/sp.j.1206.2008.00478

  • 15

    InfusiniG.SmithJ. M.YuanH.PizzollaA.NgW. C.LondriganS. L.et al (2015). Respiratory DC Use IFITM3 to avoid direct viral infection and safeguard virus-specific CD8+ T cell priming.PLoS One10:e0143539. 10.1371/journal.pone.0143539

  • 16

    IvakhnoS.RoyceT.CoxA. J.EversD. J.CheethamR. K.TavareS. (2010). CNAseg–a novel framework for identification of copy number changes in cancer from second-generation sequencing data.Bioinformatics2630513058. 10.1093/bioinformatics/btq587

  • 17

    KuilmanT.VeldsA.KemperK.RanzaniM.BombardelliL.HoogstraatM.et al (2015). CopywriteR: DNA copy number detection from off-target sequence data.Genome Biol.16:49.

  • 18

    LiH.DurbinR. (2009). Fast and accurate short read alignment with Burrows-Wheeler transform.Bioinformatics2517541760. 10.1093/bioinformatics/btp324

  • 19

    LiH.HandsakerB.WysokerA.FennellT.RuanJ.HomerN.et al (2009). The sequence alignment/map format and SAMtools.Bioinformatics2520782079. 10.1093/bioinformatics/btp352

  • 20

    MagiA.PippucciT.SidoreC. (2017). XCAVATOR: accurate detection and genotyping of copy number variants from second and third generation whole-genome sequencing experiments.BMC Genom.18:747. 10.1186/s12864-017-4137-0

  • 21

    MartinJ.TammimiesK.KarlssonR.LuY.LarssonH.LichtensteinP.et al (2019). Copy number variation and neuropsychiatric problems in females and males in the general population.Am. J. Med. Genet. Part B, Neuropsychiatric Genet.180341350. 10.1002/ajmg.b.32685

  • 22

    NowakG.HastieT.PollackJ. R.TibshiraniR. (2011). A fused lasso latent feature model for analyzing multi-sample aCGH data.Biostatistics12776791. 10.1093/biostatistics/kxr012

  • 23

    RodriguezA.LaioA. (2014). Machine learning. clustering by fast search and find of density peaks.Science34414921496. 10.1126/science.1242072

  • 24

    SakamotoS.InoueH.KohdaY.OhbaS. I.MizutaniT.KawadaM. (2020). Interferon-Induced transmembrane protein 1 (IFITM1) promotes distant metastasis of small cell lung Cancer.Int. J. Mol. Sci.21:4934. 10.3390/ijms21144934

  • 25

    ShlienA.MalkinD. (2009). Copy number variations and cancer.Genome Med.1:62.

  • 26

    SmithS. D.KawashJ. K.GrigorievA. (2015). GROM-RD: resolving genomic biases to improve read depth detection of copy number variants.PeerJ3:e836. 10.7717/peerj.836

  • 27

    TalevichE.ShainA. H.BottonT.BastianB. C. (2016). CNVkit: genome-wide copy number detection and visualization from targeted dna sequencing.PLoS Comput. Biol.12:e1004873. 10.1371/journal.pcbi.1004873

  • 28

    TibshiraniR.WangP. (2008). Spatial smoothing and hot spot detection for CGH data using the fused lasso.Biostatistics91829. 10.1093/biostatistics/kxm013

  • 29

    XiJ.LiA.WangM. (2020a). HetRCNA: a novel method to identify recurrent copy number alternations from heterogeneous tumor samples based on matrix decomposition framework.IEEE/ACM Trans. Comput. Biol. Bioinform.17422434. 10.1109/tcbb.2018.2846599

  • 30

    XiJ.YuanX.WangM.LiA.LiX.HuangQ. (2020b). Inferring subgroup-specific driver genes from heterogeneous cancer samples via subspace learning with subgroup indication.Bioinformatics3618551863.

  • 31

    YangL.WangY.FangM.DengD.ZhangY. (2017). C3orf21 ablation promotes the proliferation of lung adenocarcinoma, and its mutation at the rs2131877 locus may serve as a susceptibility marker.Oncotarget83342233431. 10.18632/oncotarget.16798

  • 32

    YoonS. T.XuanZ. Y.MakarovV.YeK.SebatJ. (2009). Sensitive and accurate detection of copy number variants using read depth of coverage.Genome Res.1915861592. 10.1101/gr.092981.109

  • 33

    YuS. Q.ShenJ. Y.FeiJ.ZhuX. Q.YinM. C.ZhouJ. W. (2020). KNDC1 is a predictive marker of malignant transformation in borderline ovarian tumors.OncoTargets Therapy13709718. 10.2147/ott.s223304

  • 34

    YuZ.LiA.WangM. (2016). CloneCNA: detecting subclonal somatic copy number alterations in heterogeneous tumor samples from whole-exome sequencing data.BMC Bioinform.17:310. 10.1186/s12859-016-1174-7

  • 35

    YuanX.BaiJ.ZhangJ.YangL.DuanJ.LiY.et al (2020). CONDEL: detecting copy number variation and genotyping deletion zygosity from single tumor samples using sequence data.IEEE/ACM Trans. Comput. Biol. Bioinform.1711411153.

  • 36

    YuanX.GaoM.BaiJ.DuanJ. (2018). SVSR: a program to simulate structural variations and generate sequencing reads for multiple platforms.IEEE/ACM Trans. Comput. Biol. Bioinform. Online ahead of print.

  • 37

    YuanX.LiJ.BaiJ.XiJ. (2019a). A local outlier factor-based detection of copy number variations from NGS data.IEEE/ACM Trans. Comput. Biol. Bioinform. Online ahead of print.

  • 38

    YuanX.XuX.ZhaoH.DuanJ. (2019b). ERINS: novel sequence insertion detection by constructing an extended reference.IEEE/ACM Trans. Comput. Biol. Bioinform. Online ahead of print.

  • 39

    YuanX.YuJ.XiJ.YangL.ShangJ.LiZ.et al (2019c). CNV_IFTV: an isolation forest and total variation-based detection of CNVs from short-read sequencing data.IEEE/ACM Trans. Comput. Biol. Bioinform. Online ahead of print.

  • 40

    YuanX.MillerD. J.ZhangJ.HerringtonD.WangY. (2012a). An overview of population genetic data simulation.J. Comput. Biol.194254. 10.1089/cmb.2010.0188

  • 41

    YuanX.YuG.HouX.Shih, IeM.ClarkeR.et al (2012b). Genome-wide identification of significant aberrations in cancer genome.BMC Genomics13:342. 10.1186/1471-2164-13-342

  • 42

    YuanX.ZhangJ.YangL. (2017). IntSIM: an integrated simulator of next-generation sequencing data.IEEE Trans. Biomed. Eng.64441451. 10.1109/tbme.2016.2560939

  • 43

    ZhangB.HouX.YuanX.Shih, IeM.ZhangZ.et al (2014). AISAIC: a software suite for accurate identification of significant aberrations in cancers.Bioinformatics30431433. 10.1093/bioinformatics/btt693

  • 44

    ZhangY.YuZ.BanR.ZhangH.IqbalF.ZhaoA.et al (2015). DeAnnCNV: a tool for online detection and annotation of copy number variations from whole-exome sequencing data.Nucleic Acids Res.43W289W294.

  • 45

    ZhaoM.WangQ. G.WangQ.JiaP. L.ZhaoZ. M. (2013). Computational tools for copy number variation (CNV) detection using next-generation sequencing data: features and perspectives.BMC Bioinform.14:S1. 10.1186/1471-2105-14-S11-S1

  • 46

    ZhengC. X.GuZ. H.HanB.ZhangR. X.PanC. M.XiangY.et al (2013). Whole-exome sequencing to identify novel somatic mutations in squamous cell lung cancers.Int. J. Oncol.43755764. 10.3892/ijo.2013.1991

  • 47

    ZhouX.LiuJ.WanX.YuW. (2014). Piecewise-constant and low-rank approximation for identification of recurrent copy number variations.Bioinformatics3019431949. 10.1093/bioinformatics/btu131

  • 48

    ZhuQ. Q.WangJ.ZhangQ. J.WangF. X.FangL. H.SongB.et al (2020). Methylation-driven genes PMPCAP1, SOWAHC and ZNF454 as potential prognostic biomarkers in lung squamous cell carcinoma.Mol. Med. Rep.2112851295.

Summary

Keywords

copy number variations, next-generation sequencing data, density peak, null distribution, read depth

Citation

Xie K, Tian Y and Yuan X (2021) A Density Peak-Based Method to Detect Copy Number Variations From Next-Generation Sequencing Data. Front. Genet. 11:632311. doi: 10.3389/fgene.2020.632311

Received

23 November 2020

Accepted

21 December 2020

Published

13 January 2021

Volume

11 - 2020

Edited by

Zhenhua Yu, Ningxia University, China

Reviewed by

Liu Guangming, Xi’an University of Technology, China; Xinhui Yu, China University of Mining and Technology, China

Updates

Copyright

*Correspondence: Xiguo Yuan,

These authors have contributed equally to this work

This article was submitted to Computational Genomics, a section of the journal Frontiers in Genetics

Disclaimer

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.

Outline

Figures

Cite article

Copy to clipboard


Export citation file


Share article

Article metrics