TY - JOUR
T1 - Performance comparison of sequential and parallel compression applications for DNA raw data
AU - Guerra, Aníbal
AU - Lotero, Jaime
AU - Isaza, Sebastián
N1 - Funding Information:
We want to thank Felipe Cabarcas and Juan Fernando Alzate from the Centro Nacional de Secuenciación Genómica at the University of Antioquia for giving us access to their computing cluster and test data; and for their help in clarifying many bioinformatics related issues. This work was supported by the University of Antioquia under project code PRV15-2-02.
Publisher Copyright:
© 2016, Springer Science+Business Media New York.
PY - 2016/12/1
Y1 - 2016/12/1
N2 - We present an experimental performance comparison of lossless compression programs for DNA raw data in FASTQ format files. General-purpose (PBZIP2, P7ZIP and PIGZ) and domain-specific compressors (SCALCE, QUIP, FASTQZ and DSRC) were analyzed in terms of compression ratio, execution speed, parallel scalability and memory consumption. Results showed that domain-specific tools increased the compression ratios up to 70 %, while reducing the runtime of general-purpose tools up to 7 × during compression and up to 3 × during decompression. Parallelism scaled performance up to 13 × when using 20 threads. Our analysis indicates that QUIP, DSRC and PBZIP2 are the best tools in their respective categories, with acceptable memory requirements. Nevertheless, the end user must consider the features of available hardware and define the priorities among its optimization objectives (compression ratio, runtime during compression or decompression, scalability, etc.) to properly select the best application for each particular scenario.
AB - We present an experimental performance comparison of lossless compression programs for DNA raw data in FASTQ format files. General-purpose (PBZIP2, P7ZIP and PIGZ) and domain-specific compressors (SCALCE, QUIP, FASTQZ and DSRC) were analyzed in terms of compression ratio, execution speed, parallel scalability and memory consumption. Results showed that domain-specific tools increased the compression ratios up to 70 %, while reducing the runtime of general-purpose tools up to 7 × during compression and up to 3 × during decompression. Parallelism scaled performance up to 13 × when using 20 threads. Our analysis indicates that QUIP, DSRC and PBZIP2 are the best tools in their respective categories, with acceptable memory requirements. Nevertheless, the end user must consider the features of available hardware and define the priorities among its optimization objectives (compression ratio, runtime during compression or decompression, scalability, etc.) to properly select the best application for each particular scenario.
UR - http://www.scopus.com/inward/record.url?scp=84973603518&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84973603518&partnerID=8YFLogxK
U2 - 10.1007/s11227-016-1753-4
DO - 10.1007/s11227-016-1753-4
M3 - Research Article
AN - SCOPUS:84973603518
SN - 0920-8542
VL - 72
SP - 4696
EP - 4717
JO - Journal of Supercomputing
JF - Journal of Supercomputing
IS - 12
ER -