AUTHOR=Ding Youde , Liao Yuan , He Ji , Ma Jianfeng , Wei Xu , Liu Xuemei , Zhang Guiying , Wang Jing TITLE=Enhancing genomic mutation data storage optimization based on the compression of asymmetry of sparsity JOURNAL=Frontiers in Genetics VOLUME=Volume 14 - 2023 YEAR=2023 URL=https://www.frontiersin.org/journals/genetics/articles/10.3389/fgene.2023.1213907 DOI=10.3389/fgene.2023.1213907 ISSN=1664-8021 ABSTRACT=Background: With the rapid development of high-throughput sequencing technology and the explosive growth of genomic data, storing, transmitting and processing massive amounts of data has become a new challenge. How to achieve fast lossless compression&decompression according to the characteristics of the data to speed up data transmission and processing requires research on relevant compression algorithms. Methods: In this paper, a compression algorithm for sparse asymmetric gene mutations (CA_SAGM) based on the characteristics of sparse genomic mutation data was proposed. The data is first sorted on a row-first basis so that neighboring non-zero elements are as close as possible to each other. The data is then renumbered using the reverse Cuthill-Mckee sorting technique. Finally the data is compressed sparse row format (CSR) and stored. We have analyzed and compared the results of the CA_SAGM, coordinate format (COO) and compressed sparse column format (CSC) algorithms for sparse asymmetric genomic data. Nine types of single-nucleotide variation (SNV) data and six types of copy number variation (CNV) data from the TCGA database were used as the subjects of this study. Compression&decompression time, compression&decompression rate, compression memory and compression ratio were used as evaluation metrics. The correlation between each metric and the basic characteristics of the original data was further investigated. Results: The experimental results show that the COO method has the shortest compression time, the fastest compression rate and the largest compression ratio, showing the best compression performance. CSC compression performance is the worst, and CA_SAGM compression performance is between the two. When decompressing the data, CA_SAGM performed the best, with the shortest decompression time and the fastest decompression rate. COO decompression performance was the worst. With increasing sparsity, the COO, CSC and CA_SAGM algorithms all exhibit longer compression & decompression times, lower compression & decompression rates, larger compression memory and lower compression ratios. When the sparsity is large, the compression memory and compression ratio between the three algorithms show no difference characteristics, but the rest of the indexes are still different. Conclusion: CA_SAGM is a fast and efficient compression algorithm that combines compression and decompression performance for sparse genomic mutation data.