Hi all! I am running a benchmark of compressors on sequence databases. I thought I'd start a thread to discuss it, respond to comments from the "Data Compression Tweets" thread, and gather expert feedback to hopefully improve it.
Benchmark data: http://kirr.dyndns.org/sequence-compression-benchmark/
This is a work in progress, and I will continue improving it when I can. Suggestions are welcome!
Disclaimers/Disclosures/Limitations:
1) Like in any benchmark, the results are specific to particular hardware, test data and methodology. I used a reasonably standard workstation machine, and test data consists of commonly used sequence databases. Thus the results should be reasonably informative, but not necessarily 100% transferable to other machine or data.
2) I benchmark a very specific task. Namely, lossless compression (and decompression), without reference, of FASTA files with DNA, RNA and protein sequences. This is the data I often work with, so I'm familiar with what is used and how it is used.
3) My own compressor is included in the benchmark. I need to know how it compares to other compressors. Also when I make improvement, I need to measure the improvement. My compressor receives no any special treatment in the benchmark.
4) Benchmark takes lot of time. For anything missing, it's possible that I just haven't had time to add it yet.
5) The interface is a mess. I'll need to organize it in a much better way.