METHODS article

Front. Genet., 30 April 2021

Sec. Computational Genomics

Volume 12 - 2021 | https://doi.org/10.3389/fgene.2021.665812

SIns: A Novel Insertion Detection Approach Based on Soft-Clipped Reads

  • 1. School of Computer and Information Engineering, Henan University, Kaifeng, China

  • 2. College of Computer Science and Technology, Henan Polytechnic University, Jiaozuo, China

Abstract

As a common type of structural variation, an insertion refers to the addition of a DNA sequence into an individual genome and is usually associated with some inherited diseases. In recent years, many methods have been proposed for detecting insertions. However, the accurate calling of insertions is also a challenging task. In this study, we propose a novel insertion detection approach based on soft-clipped reads, which is called SIns. First, based on the alignments between paired reads and the reference genome, SIns extracts breakpoints from soft-clipped reads and determines insertion locations. The insert size information about paired reads is then further clustered to determine the genotype, and SIns subsequently adopts Minia to assemble the insertion sequences. Experimental results show that SIns can achieve better performance than other methods in terms of the F-score value for simulated and true datasets.

Introduction

Although single-nucleotide polymorphisms (SNPs) represent the most frequent genomic variation, it is generally acknowledged that human genomes show more differences as a consequence of structural variations (SVs) (Gusnanto et al., 2012). SVs generally refer to genome sequence changes greater than 50 bp and can be further categorized as insertions, deletions, duplications, inversions, and translocations, among others, as well as combinations of these categories (Feuk et al., 2006; Alkan et al., 2011; Baker, 2012). Some studies have shown that phenotypic changes and some diseases are caused by SVs, e.g., autism, Parkinson’s disease, and schizophrenia (Suzuki et al., 2011). Therefore, the accurate detection of SVs is of great significance for gene expression analysis and related disease research (MacConaill and Garraway, 2010). However, until a few years ago, there were no efficient methods for the detection of SVs with high precision. The development of next-generation sequencing (NGS) technology has allowed researchers to obtain a large amount of sequence data, which has improved research on SV detection (The 1000 Genomes Project Consortium, 2010; Zhang et al., 2010; Guan and Sung, 2016; Kosugi et al., 2019).

As one type of SV, an insertion refers to the addition of a DNA sequence to the genome. This sequence might be novel or could exist in the original genome, which would be equivalent to translocation or duplication. In general, insertions can be divided into two types: (i) novel insertions refer to the insertion of a sequence that cannot be found or mapped to the reference genome, and (ii) mobile element insertions or duplications constitute insertions in which the sequence comes from the original sequence. The sequence of this second type of insertion can be obtained through a comparison with the reference genome. Based on the identification of discordant patterns in sequence data, some SV detection methods can currently be utilized to detect insertions. In general, these methods can be categorized into the following four classes: (i) paired-end mapping (PEM-based methods, such as BreakDancer (Chen et al., 2009), PEMer (Korbel et al., 2009) and GASV (Sindi et al., 2009)), which is based on the physical position and distance information of paired-end or mate-pair reads (Lee et al., 2009; Hormozdiari et al., 2010); (ii) split read (SR)-based methods, which search for split alignments of unmapped or clipped reads, and an example is CREST, which uses clipped reads to identify structural variations through multiple alignments and assembly (Wang et al., 2011); (iii) depth of coverage (DoC)-based methods such as SegSeq (Chiang et al., 2009), EWT (Yoon et al., 2009) and CNVnator (Abyzov et al., 2011)), which provide a macroscopic view of whether there is a high coverage area on the genome; and (iv) de novo assembly, which uses related reads to recover insertion sequences. The latter methods, such as ANISE and BASIL (Holtgrewe et al., 2015), SvABA (Wala et al., 2018), EPGA (Luo et al., 2015b) and EPGA2 (Luo et al., 2015a), require a coverage depth that is not less than 40X and have a high cost. However, these methods usually focus on abnormal information, such as variations in the insertion size and soft-clipped information, and thus cannot yield accurate detection results for insertions with variable sizes.

Some hybrid methods have been proposed for the detection of insertions with variable sizes in recent years. For example, Pindel, as a classical method, is mainly designed for deletions and small insertions and uses PEM and SR signatures to locate the breakpoints (Ye et al., 2009). However, for large insertions over 50 bp, Pindel does not perform well and yields many false positive results. MindTheGap uses a k-mer-based method to detect the insertion site and recovers insertion sequences through an assembly of k-mers (Rizk et al., 2014). This method enables the detection of small and large insertions, but the methods finds it difficult to locate a breakpoint when other polymorphisms occur near the insertion site, which leads to a high number of false negative results. As an insertion detection approach based on breakpoints, BreakSeek applies a Bayesian model for the PEM and SR signatures to find the accurate position of an insertion (Zhao and Zhao, 2015). The BreakSeek method can obtain accurate breakpoint results and genotypes without assembly, but the coverage depth of the dataset has some impact on the performance. In addition, although some insertion detection methods, such as PopIns (Kehr et al., 2016) and Pamir (Kavak et al., 2017), perform well, they may require a large number of data points.

In this paper, we propose an insertion detection approach called SIns, which is based on soft-clipped reads and achieves high insertion detection accuracy. SIns adopts PEM to identify and correct the breakpoints from a previous analysis of soft-clipped reads and clusters the insert size to determine the genotype. For sequence assembly, SIns directly extracts all abnormal reads and uses Minia to recover the insertion sequences. We conducted experiments using simulated data and real datasets, and the results show that SIns exhibits high accuracy in breakpoint detection and genotype determination.

The rest of this paper is organized as follows: in Section 2, we introduce the proposed method in detail. The experimental results are shown in Section 3, and we summarize and discuss the findings in Section 4.

Methods

In this study, we propose a novel insertion detection approach named SIns for the detection of insertions based on soft-clipped reads. In general, SIns performs the following three steps: (i) breakpoint detection, determining the location of insertions based on comprehensive information; (ii) genotyping, identifying the genotype of the insertion based on clustering results; and (iii) assembly of insertion sequences. The overall pipeline of SIns is shown (Figure 1).

FIGURE 1

Breakpoint Detection

Breakpoint detection is an important step in SIns. In this study, the breakpoints can be obtained through the following steps.

Step 1 Selection of Soft-Clipped Reads

For each soft-clipped read, SIns first obtains its clipped part, Sc, and then extracts a sequence Sr from the reference genome, which corresponds to Sr. Note that the length of Sr equals that of Sc.

Based on the Smith-Waterman algorithm, a score matrix between Sc and Sr can then be constructed to reflect their detailed matching degree. Moreover, SIns can obtain the maximum score from the matrix, which refers to the length of the longest successive sequence. To identify and screen out real soft-clipped reads, a threshold parameter c is then set to select those reads whose Sc and Sr exhibit higher similarity. This parameter c can be computed using the following equation:

where m represents the mappability (m ∈ [0,1]). If c equals 1, SIns selects it for the following steps; otherwise, SIns abandons it. A larger m indicates greater similarity between Sc and Sr. The default value for the parameter m is 0.5.

Step 2 Determination of Candidate Breakpoints

In our study, the soft-clipped reads were further divided into four types, namely, LL, LR, RL, and RR, which are shown in Figure 2. Taking “LL” as an example, the first L means that the left mate read is soft-clipped, and the second “L” specifies that this read is clipped on its head, whereas “RR” indicates that the right mate read is soft-clipped on its tail.

FIGURE 2

A true insertion might be related to the four types of soft-clipped reads. These soft-clipped reads can provide similar breakpoint information. In general, an insertion breakpoint is regarded strongly as true if the four types of soft-clipped reads mentioned above exist. However, it is difficult to find all types of soft-clipped reads for a true insertion, particularly if the DoC is low. In this paper, SIns defines four types of breakpoints, which are represented as {LL, LR}, {LL, RL}, {RL, LR}, and {RL, RR}. For a breakpoint, SIns collects all related soft-clipped reads that are kept to PSD and determines their types, and SIns then uses the following equation to determine whether a breakpoint is true:

where LL∧LR indicates that the PSD of a breakpoint contains LL and LR, and LL∨RL indicates that it contains LL or RL. Subsequently, SIns obtains a list of breakpoints using the above-described method. However, the method yields some false positive breakpoints, which can be due to a high GC content, sequencing error or SNPs. Therefore, even though their proportion is small, these breakpoints should be checked and filtered.

Step 3 Filtering of the Breakpoints

Through the above-described steps, SIns can obtain candidate breakpoints, which might include some false breakpoints. SIns then uses a filter method based on the insertion size to further improve the precision of these breakpoints. An insertion usually causes a series of abnormal reads with an anomalous insert size distribution.

For a candidate breakpoint, SIns first finds the paired reads that span this breakpoint and OEA reads (one-end-anchored reads). Note that these reads should be aligned in the region [p − (μ++3σ), p + (μ+3σ)], where p is the position of the breakpoint, μ is the insert size of the read library, and σ is the standard deviation of μ as shown in Figure 3. If the sum of paired reads and OEA reads is larger than Cov/2, SIns treats this breakpoint as true, otherwise, the method considers the breakpoint to be false. Cov is the coverage of the read library.

FIGURE 3

Genotyping

Genotyping is a necessary step of SIns. In a polyploid, the genotype is divided into heterozygous and homozygous genotypes. Taking diploid as an example, a heterozygous variation is only included in one chromosome and not the other one contains. In contrast, homozygosity indicates that the same variation is found in both chromosomes.

Genotyping can provide great convenience for subsequent studies, and many approaches, particularly assembly-based methods, are available for genotyping; however, all the assembly-based methods usually require considerable time and memory. Here, SIns adopts a cluster-based method, which can save as much time as possible.

If an insertion occurs, it will inevitably cause a change in the insert size for paired reads around the breakpoint, such as OEA reads, and a decrease in the normal insert size. For a heterozygous insertion, the insert size is difficult to determine because the paired reads might originate from two different chromosomes. Some paired reads contain insertions, whereas others do not. We defined P (Pl, Pr, and i) for a paired read spanning the breakpoint, where Pl is the aligning position of the left mate read, Pr is the aligning position of the right mate read and i is the insert size value around this paired read. After obtaining P for all paired reads spanning the breakpoint, SIns applies the DBSCAN for clustering. In DBSCAN, the parameter eps = 50, min_samples = 2 in default, and these parameters can be adjusted. And, SIns determines a breakpoint as heterozygote if there is one cluster in the clustering result, otherwise, the breakpoint is deemed as homozygous. Two types of insert size distributions are shown in Figure 4.

FIGURE 4

Assembly Insertion Sequences

In the assembly stage, SIns extracts OEA, soft-clipped and unmapped reads for a breakpoint to recover all possible insertion sequences. After applying the Minia (Boeva et al., 2012) algorithm to these abnormal reads, SIns generate a series of sequences with overlap, which contain insertion sequences. SIns then maps these sequences to the reference genome and obtains the insertion sequence results. For example, if the CIGAR value of a candidate sequence is 132M186I130M, the algorithm finds the length of this insertion, i.e., 186 bp, and determines that the sequence content is 133–318 bases.

Experiments and Analysis

To verify the performance of SIns, we used SURVIVOR (Jeffares et al., 2017) and ART (Huang et al., 2012) to simulate a large number of insertions on human chromosome 22 ranging in size from 50 to 1,500 bp and in coverage from 5X to 50X. The recent popular detection methods MindTheGap and BreakSeek were compared with the proposed SIns method. In addition, the real human dataset NA12878 was selected to test the performance of SIns.

Experimental Settings

Simulation Datasets and Parameter Setting

The simulation dataset was based on human chromosome 22, and the error rate of the dataset was set to 0.1%. SURVIVOR was used to simulate the structural variation. Here, we selected insertions for the simulation, and other types of structural variations were set to 0. ART was used to simulate different read sets from the simulated chromosome 22 containing insertions. We first generated some simulations of chromosome 22 containing insertions of different sizes, namely, 50–300 bp, 301–600 bp, 601–1,000 bp, and 1,001–1,500 bp, and ART was then used to simulate read sets with different coverages, i.e., 5X, 10X, 20X, 30X, 40X, and 50X. The read length was uniformly set to 150 bp, the inset size was 500 bp, and the standard deviation was 50. Using the above parameters, we can understand the detection ability of SIns under various conditions.

Evaluation Metrics

If the difference between the detected breakpoint and the simulated breakpoint does not exceed 10 bp, we consider it a positive result, which is represented by TP; otherwise, the result is represented by FP. True breakpoints that were not detected are indicated by FN. To clearly show the detection performance of various methods, we used the metrics precision (Pr), recall (Rc) and F-score as follows:

The F-score was defined as the harmonic average of precision and recall:

Simulation Dataset

Results on Homozygous Dataset

We compared SIns with MindTheGap and BreakSeek, selected chromosome 22 as the reference and simulated a chromosome containing 1,051 insertions of 50–300 bp, a chromosome containing 597 insertions of 301–600 bp, a chromosome containing 597 insertions of 601–1,000 bp and a chromosome containing 790 insertions of 1,001–1,500 bp. Based on different coverages, we simulated six read sets for each simulated chromosome. The experimental results are shown in Table 1.

TABLE 1

DocTool50-300301–600601-1,0001,001-1,500




PrRcF-scorePrRcF-scorePrRcF-scorePrRcF-score
5XSIns99.78487.72693.36710064.99278.78210061.97776.52510063.92477.992
BreakSeek99.79145.4862.48410014.40525.18398.59211.72520.95810011.89921.267
MindTheGap11.94926.54616.482.31727.4714.2743.10426.8015.5634.55129.4947.885
10XSIns99.41296.4897.92499.81590.6294.99610089.61594.52310090.12794.807
BreakSeek99.89287.63193.3610061.80976.39899.70155.94671.67499.77455.82371.591
MindTheGap30.35664.98641.38120.91865.66231.72821.31567.33732.3825.96267.46837.496
20XSIns99.03797.81298.4299.6595.47797.51910093.80296.80299.86895.5797.671
BreakSeek99.60395.43397.47399.2791.12295.02299.25989.78294.28399.44791.01395.043
MindTheGap85.84580.20982.93275.95579.89977.87873.24280.23576.57979.5978079.798
30XSIns98.84898.00298.42399.30896.14797.70210094.80797.33499.86795.31697.539
BreakSeek99.50996.38497.92299.29894.80797.00199.28492.96596.02199.45993.16596.209
MindTheGap86.82981.54184.10277.56481.07279.27975.42581.74278.45780.7381.13980.934
40XSIns98.10298.38298.24210096.48298.2199.82595.47797.60399.86895.94997.87
BreakSeek99.70897.43198.55699.12394.6496.82999.29594.30596.73599.59793.79796.61
MindTheGap86.91781.54184.14377.40480.90579.11575.88982.24578.93980.83281.13980.985
50XSIns98.5798.38298.47698.96996.48297.7110095.47797.68699.86996.20398.001
BreakSeek99.70897.43198.55698.61495.3196.93499.11894.13796.56499.33894.93797.087
MindTheGap87.01881.63784.24277.2880.90579.05175.15382.07778.46380.88181.39281.136

Comparison of three tools for four ranges.

The bold values represent the highest value of each data set in different depth.

As shown in Table 1, the performances of SIns and BreakSeek in detecting insertions of 50–300 bp were better. Although the precision of BreakSeek was generally higher than that of SIns, its F-score was only better than that of SIns when the coverages of the read set were 40X and 50X. We also found that SIns has a higher recall, which means that SIns can detect more true insertions. SIns exhibited higher precision and recall regardless of the coverage and the length of insertions. In addition, none of the methods worked well with low DoCs. However, for the case with a low coverage (DoC ≤ 10X), SIns showed better performance than the other methods.

Results on Heterozygous Dataset

To verify the performance of SIns in detecting heterozygous insertions, we simulated read sets of chromosome 22. Simulations of chromosome 22 containing insertions of 50–300 bp were used to produce these read sets, and other simulations of chromosome 22 containing an insertion of 301–600 bp were also used to generate other read sets. We then combine the read sets from the normal chromosome 22 and the simulations of chromosome 22. Note that the read sets were simulated with different coverages: 10X, 20X, 40X, 60X, and 80X. The experimental results are shown in Table 2.

TABLE 2

50-300Tool50-300301-600


PrRcF-scorePrRcF-score
10XSIns10092.95996.35110089.78294.616
BreakSeek10033.11149.7510021.44135.31
MindTheGap11.27521.78914.865.21122.1118.435
20XSIns99.90397.90798.89510096.98598.469
BreakSeek99.70764.778.47710048.57665.389
MindTheGap88.59657.65969.85679.66956.44966.078
40XSIns99.80798.57399.18610097.9998.985
BreakSeek98.84765.27178.62598.80541.54158.491
MindTheGap98.60967.4680.11397.38768.67780.55
60XSIns99.42598.76399.09310097.9998.985
BreakSeek98.38963.93977.50998.21446.06462.714
MindTheGap99.34972.59883.89298.4273.03283.846
80XSIns99.61698.85899.23610097.9998.985
BreakSeek98.50362.60776.55698.26447.40463.955
MindTheGap98.8472.97883.96398.4273.03283.846

Result of 50–300 and 301–600 bp heterozygous insertions.

The bold values represent the highest value of each data set in different depth.

As illustrated in Table 2, the detection results obtained with MindTheGap were less effective than those obtained with homozygous detection because MindTheGap has more sequences to choose from when selecting k-mers, which will yield some conflicting issues. The performance of BreakSeek on these two datasets was not as good as the results obtained with homozygotes, and a reason for this finding might be that normal reads extracted from the reference genome, which contained many contradictory PEM and SR information, were added. When BreakSeek iteratively analyses the PEM signature, there is too much contradictory information that can be used, and thus, the result cannot show the most authentic SV information. In contrast, when SIns extracts breakpoint information at the initial stage, the method relies more on SR information and thus experiences less interference from contradictory information. At the subsequent filtering stage, due to the addition of normal reads, the filtering conditions were more rigorous and precise, which explains why the precision of SIns increased, whereas the recall value decreased.

Experiments Based on Real Dataset

NA12878 is the gold standard dataset commonly used in genomics. Experiments with NA12878 (ERR194147 50X1) samples were conducted using the SIns, MindTheGap and BreakSeek methods. We extracted the reads with a probability of 0.1 because the coverage was too high. The generally recognized VCF file of this sample contains 50,016 insertion reports larger than 50 bp. The corresponding vcf file can be downloaded from NCBI. We only selected the detected results in the file records as true values. The test results are shown in Table 3.

TABLE 3

SInsMindTheGapBreakSeek
chr11239890
chr218013674
chr31075738
chr41058737
chr5946844
chr61178443
chr71349144
chr8737243
chr9776948
chr101016242
chr11884641
chr12996546
chr13662736
chr14512928
chr15424429
chr16886369
chr17674629
chr18724227
chr19674623
chr20385025
chr21571624
chr22282721

Results obtained with NA12878.

We have filtered out the SNPs and Indels of this data set. The above results show that SIns has good performance on most chromosomes compared with MindTheGap and BreakSeek. Although the detection number of insertions on chromosome 15 and 20 are lower than that of MindTheGap, we can find the result on the rest of chromosomes are better than other two methods. And the average of F-score on all 22 chromosomes is 5.46% for SIns. MindTheGap is 2.42%, and BreakSeek is 2.85%. The average of F-score shows the same conclusion.

Running Time Comparison

Here we list the time comparison results of homozygote and heterozygous experiments.

Although clustering is useful in the SIns process, it does not require as many iterations as in BreakSeek, MindTheGap and other methods; thus, SIns exhibits a relatively obvious advantage in terms of running time. As shown in Tables 4, 5, all the methods were run in the same machine and a single thread by default. As a result, SIns exhibited better performance than the other two methods in most cases. The main time-consuming step of SIns is the third step: the reads used for assembly are extracted from the original read collection, which is the most work-intensive step. If the assembly is not considered and the method aims to just detect breakpoints and judge genotypes, SIns can complete the task within a short time.

TABLE 4

Doc50–300301–600601–1,0001,001–1,500




Mind TheGapBreak SeekSInsMind TheGapBreak SeekSInsMind TheGapBreak SeekSInsMind TheGapBreak SeekSIns
5X176s1868s20s174s1842s35s178s2130s29s177s2127s31s
10X217s1868s40s216s2250s68s227s2156s65s212s2089s61s
20X243s2177s77s242s2178s119s242s2054s142s235s4349s123s
30X264s2249s116s264s2109s180s257s3723s191s203s5281s184s
40X284s2415s154s286s2589s250s292s4948s240s204s2736s245s
50X304s2577s193s310s2943s343s310s3207s319s211s2539s307s

Homozygote results obtained with four ranges.

TABLE 5

Doc50–300301–600


Mind TheGapBreak SeekSInsMind TheGapBreak SeekSIns
10X140s1997s38s139s2020s19s
20X152s2041s76s154s1990s47s
40X171s2224s150s180s2495s84s
60X190s2779s227s193s2869s122s
80X212s2703s305s215s3294s204s
100X227s3634s425s254s3719s259s

Heterozygous results obtained with four ranges.

Discussion

In this article, we propose an insertion detection method named SIns based on the comprehensive processing of soft-clipped read information. SIns can provide more precise detection of breakpoints and can perform relatively accurate genotyping. In addition, SIns uses the Minia algorithm for assembly of the insertion sequence, and the successfully assembled sequence is then filtered and tailored according to the breakpoint information. After these steps, the complete insertion sequence is provided.

Most of the existing methods show effectiveness in detecting small insertions but show poor performance in cases of low coverage. These methods usually are difficult to detect all types of SVs of all sizes. SIns focuses on the detection of insertions of different sizes. We tested the detection performance of SIns using various simulated datasets and compared it with MindTheGap and BreakSeek. In most cases, the performance of SIns was better than those of the other two methods. Comparing with the other two methods, SIns performs well both on low and high coverage data sets and different size insertions. The experimental results using a real dataset show that SIns exhibits good detection capability.

Statements

Data availability statement

The datasets presented in this study can be found in online repositories. The names of the repository/repositories and accession number(s) can be found below: http://www.ebi.ac.uk/ena.

Author contributions

CY and JL conceived and designed the approach. JH performed the experiments. JW and GZ analyzed the data. JH and JL wrote the manuscript. JL and HL supervised the whole study process and revised the manuscript. All authors have read and approved the final version of manuscript.

Funding

This work was supported by the National Natural Science Foundation of China (Nos. 61972134, 61802113, and 61802114). Science and Technology Development Plan Project of Henan Province, (Nos. 202102210173 and 212102210091). China Postdoctoral Science Foundation (No. 2020M672212). Henan Province Postdoctoral Research Project Founding.

Acknowledgments

This paper is recommended by the 5th Computational Bioinformatics Conference.

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

  • 1

    AbyzovA.UrbanA. E.SnyderM.GersteinM. (2011). CNVnator: an approach to discover, genotype, and characterize typical and atypical CNVs from family and population genome sequencing.Genome Res.21974984. 10.1101/gr.114876.110

  • 2

    AlkanC.CoeB. P.EichlerE. E. (2011). Genome structural variation discovery and genotyping.Nat. Rev. Genet.12363376. 10.1038/nrg2958

  • 3

    BakerM. (2012). Structural variation: the genome’s hidden architecture.Nat. Methods9133137. 10.1038/nmeth.1858

  • 4

    BoevaV.PopovaT.BleakleyK.ChicheP.CappoJ.SchleiermacherG.et al (2012). Control-FREEC: a tool for assessing copy number and allelic content using next-generation sequencing data.Bioinformatics28423425. 10.1093/bioinformatics/btr670

  • 5

    ChenK.WallisJ. W.McLellanM. D.LarsonD. E.KalickiJ. M.PohlC. S.et al (2009). BreakDancer: an algorithm for high-resolution mapping of genomic structural variation.Nat. Methods6677681. 10.1038/nmeth.1363

  • 6

    ChiangD. Y.GetzG.JaffeD. B.O’kellyM. J.ZhaoX.CarterS. L.et al (2009). High-resolution mapping of copy-number alterations with massively parallel sequencing.Nat. Methods699103. 10.1038/nmeth.1276

  • 7

    FeukL.CarsonA. R.SchererS. W. (2006). Structural variation in the human genome.Nat. Rev. Genet.78597.

  • 8

    GuanP.SungW.-K. (2016). Structural variation detection using next-generation sequencing data: a comparative technical review.Methods1023649. 10.1016/j.ymeth.2016.01.020

  • 9

    GusnantoA.WoodH. M.PawitanY.RabbittsP.BerriS. (2012). Correcting for cancer genome size and tumour cell content enables better estimation of copy number alterations from next-generation sequence data.Bioinformatics284047. 10.1093/bioinformatics/btr593

  • 10

    HoltgreweM.KuchenbeckerL.ReinertK. (2015). Methods for the detection and assembly of novel sequence in high-throughput sequencing data.Bioinformatics3119041912. 10.1093/bioinformatics/btv051

  • 11

    HormozdiariF.HajirasoulihaI.DaoP.HachF.YorukogluD.AlkanC.et al (2010). Next-generation VariationHunter: combinatorial algorithms for transposon insertion discovery.Bioinformatics26i350i357.

  • 12

    HuangW.LiL.MyersJ. R.MarthG. T. (2012). ART: a next-generation sequencing read simulator.Bioinformatics28593594. 10.1093/bioinformatics/btr708

  • 13

    JeffaresD. C.JollyC.HotiM.SpeedD.ShawL.RallisC.et al (2017). Transient structural variations have strong effects on quantitative traits and reproductive isolation in fission yeast.Nat. Commun.8:14061.

  • 14

    KavakP.LinY.-Y.NumanagiæI.AsghariH.GüngörT.AlkanC.et al (2017). Discovery and genotyping of novel sequence insertions in many sequenced individuals.Bioinformatics33i161i169.

  • 15

    KehrB.MelstedP.HalldórssonB. V. (2016). PopIns: population-scale detection of novel sequence insertions.Bioinformatics32961967. 10.1093/bioinformatics/btv273

  • 16

    KorbelJ. O.AbyzovA.MuX. J.CarrieroN.CaytingP.ZhangZ.et al (2009). PEMer: a computational framework with simulation-based error models for inferring genomic structural variants from massive paired-end sequencing data.Genome Biol.10:R23.

  • 17

    KosugiS.MomozawaY.LiuX.TeraoC.KuboM.KamataniY. (2019). Comprehensive evaluation of structural variation detection algorithms for whole genome sequencing.Genome Biol.20:117.

  • 18

    LeeS.HormozdiariF.AlkanC.BrudnoM. (2009). MoDIL: detecting small indels from clone-end sequencing with mixtures of distributions.Nat. Methods6473474. 10.1038/nmeth.f.256

  • 19

    LuoJ.WangJ.LiW.ZhangZ.WuF.-X.LiM.et al (2015a). EPGA2: memory-efficient de novo assembler.Bioinformatics3139883990.

  • 20

    LuoJ.WangJ.ZhangZ.WuF.-X.LiM.PanY. (2015b). EPGA: de novo assembly using the distributions of reads and insert size.Bioinformatics31825833. 10.1093/bioinformatics/btu762

  • 21

    MacConaillL. E.GarrawayL. A. (2010). Clinical implications of the cancer genome.J Clin. Oncol.28:5219. 10.1200/jco.2009.27.4944

  • 22

    RizkG.GouinA.ChikhiR.LemaitreC. (2014). MindTheGap: integrated detection and assembly of short and long insertions.Bioinformatics3034513457. 10.1093/bioinformatics/btu545

  • 23

    SindiS.HelmanE.BashirA.RaphaelB. J. (2009). A geometric approach for classification and comparison of structural variants.Bioinformatics25i222i230.

  • 24

    SuzukiS.YasudaT.ShiraishiY.MiyanoS.NagasakiM. (2011). ClipCrop: a tool for detecting structural variations with single-base resolution using soft-clipping information.BMC Bioinformatics12(Suppl 14):S7. 10.1186/1471-2105-12-S14-S7

  • 25

    The 1000 Genomes Project Consortium (2010). A map of human genome variation from population-scale sequencing.Nature467:1061. 10.1038/nature09534

  • 26

    WalaJ. A.BandopadhayayP.GreenwaldN. F.O’RourkeR.SharpeT.StewartC.et al (2018). SvABA: genome-wide detection of structural variants and indels by local assembly.Genome Res.28581591. 10.1101/gr.221028.117

  • 27

    WangJ.MullighanC. G.EastonJ.RobertsS.HeatleyS. L.MaJ.et al (2011). CREST maps somatic structural variation in cancer genomes with base-pair resolution.Nat. Methods8652654. 10.1038/nmeth.1628

  • 28

    YeK.SchulzM. H.LongQ.ApweilerR.NingZ. (2009). Pindel: a pattern growth approach to detect break points of large deletions and medium sized insertions from paired-end short reads.Bioinformatics2528652871. 10.1093/bioinformatics/btp394

  • 29

    YoonS.XuanZ.MakarovV.YeK.SebatJ. (2009). Sensitive and accurate detection of copy number variants using read depth of coverage.Genome Res.1915861592. 10.1101/gr.092981.109

  • 30

    ZhangQ.DingL.LarsonD. E.KoboldtD. C.McLellanM. D.ChenK.et al (2010). CMDS: a population-based method for identifying recurrent DNA copy number aberrations in cancer from high-resolution data.Bioinformatics26464469. 10.1093/bioinformatics/btp708

  • 31

    ZhaoH.ZhaoF. (2015). BreakSeek: a breakpoint-based algorithm for full spectral range INDEL detection.Nucleic Acids Res.4367016713. 10.1093/nar/gkv605

Summary

Keywords

structural variation, alignment, short read, the next generation sequencing technology, soft-clipped read

Citation

Yan C, He J, Luo J, Wang J, Zhang G and Luo H (2021) SIns: A Novel Insertion Detection Approach Based on Soft-Clipped Reads. Front. Genet. 12:665812. doi: 10.3389/fgene.2021.665812

Received

09 February 2021

Accepted

06 April 2021

Published

30 April 2021

Volume

12 - 2021

Edited by

Wang Guohua, Harbin Institute of Technology, China

Reviewed by

Hailin Chen, East China Jiaotong University, China; Minzhu Xie, Hunan Normal University, China

Updates

Copyright

*Correspondence: Junwei Luo, Huimin Luo,

This article was submitted to Computational Genomics, a section of the journal Frontiers in Genetics

Disclaimer

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.

Outline

Figures

Cite article

Copy to clipboard


Export citation file


Share article

Article metrics