Performance comparison of sequential and parallel compression applications for DNA raw data

Aníbal Guerra, Jaime Lotero, Sebastián Isaza

Research output: Contribution to journalResearch Articlepeer-review

8 Scopus citations

Abstract

We present an experimental performance comparison of lossless compression programs for DNA raw data in FASTQ format files. General-purpose (PBZIP2, P7ZIP and PIGZ) and domain-specific compressors (SCALCE, QUIP, FASTQZ and DSRC) were analyzed in terms of compression ratio, execution speed, parallel scalability and memory consumption. Results showed that domain-specific tools increased the compression ratios up to 70 %, while reducing the runtime of general-purpose tools up to 7 × during compression and up to 3 × during decompression. Parallelism scaled performance up to 13 × when using 20 threads. Our analysis indicates that QUIP, DSRC and PBZIP2 are the best tools in their respective categories, with acceptable memory requirements. Nevertheless, the end user must consider the features of available hardware and define the priorities among its optimization objectives (compression ratio, runtime during compression or decompression, scalability, etc.) to properly select the best application for each particular scenario.

Original languageEnglish (US)
Pages (from-to)4696-4717
Number of pages22
JournalJournal of Supercomputing
Volume72
Issue number12
DOIs
StatePublished - Dec 1 2016
Externally publishedYes

All Science Journal Classification (ASJC) codes

  • Software
  • Theoretical Computer Science
  • Information Systems
  • Hardware and Architecture

Fingerprint

Dive into the research topics of 'Performance comparison of sequential and parallel compression applications for DNA raw data'. Together they form a unique fingerprint.

Cite this