AUTHOR=Luo Junwei , Chen Ranran , Zhang Xiaohong , Wang Yan , Luo Huimin , Yan Chaokun , Huo Zhanqiang TITLE=LROD: An Overlap Detection Algorithm for Long Reads Based on k-mer Distribution JOURNAL=Frontiers in Genetics VOLUME=11 YEAR=2020 URL=https://www.frontiersin.org/journals/genetics/articles/10.3389/fgene.2020.00632 DOI=10.3389/fgene.2020.00632 ISSN=1664-8021 ABSTRACT=

Third-generation sequencing technologies can produce large numbers of long reads, which have been widely used in many fields. When using long reads for genome assembly, overlap detection between any pair of long reads is an important step. However, the sequencing error rate of third-generation sequencing technologies is very high, and obtaining accurate overlap detection results is still a challenging task. In this study, we present a long-read overlap detection (LROD) algorithm that can improve the accuracy of overlap detection results. To detect overlaps between two long reads, LROD first retains only the solid common k-mers between them. These k-mers can simplify the process of overlap detection. Second, LROD finds a chain (i.e., candidate overlap) that includes the consistent common k-mers. In this step, LROD proposes a two-stage strategy to evaluate whether two common k-mers are consistent. Finally, LROD uses a novel strategy to determine whether the candidate overlaps are true and to revise them. To verify the performance of LROD, three simulated and three real long-read datasets are used in the experiments. Compared with two other popular methods (MHAP and Minimap2), LROD can achieve good performance in terms of the F1-score, precision and recall. LROD is available from https://github.com/luojunwei/LROD.