Results 1 to 11 of 11

Thread: Comparison of compressors for molecular sequence databases

  1. #1
    Member
    Join Date
    May 2019
    Location
    Japan
    Posts
    7
    Thanks
    1
    Thanked 2 Times in 1 Post

    Comparison of compressors for molecular sequence databases

    Hi all! I am running a benchmark of compressors on sequence databases. I thought I'd start a thread to discuss it, respond to comments from the "Data Compression Tweets" thread, and gather expert feedback to hopefully improve it.

    Benchmark data: http://kirr.dyndns.org/sequence-compression-benchmark/

    This is a work in progress, and I will continue improving it when I can. Suggestions are welcome!

    Disclaimers/Disclosures/Limitations:

    1) Like in any benchmark, the results are specific to particular hardware, test data and methodology. I used a reasonably standard workstation machine, and test data consists of commonly used sequence databases. Thus the results should be reasonably informative, but not necessarily 100% transferable to other machine or data.

    2) I benchmark a very specific task. Namely, lossless compression (and decompression), without reference, of FASTA files with DNA, RNA and protein sequences. This is the data I often work with, so I'm familiar with what is used and how it is used.

    3) My own compressor is included in the benchmark. I need to know how it compares to other compressors. Also when I make improvement, I need to measure the improvement. My compressor receives no any special treatment in the benchmark.

    4) Benchmark takes lot of time. For anything missing, it's possible that I just haven't had time to add it yet.

    5) The interface is a mess. I'll need to organize it in a much better way.

  2. #2
    Member
    Join Date
    May 2019
    Location
    Japan
    Posts
    7
    Thanks
    1
    Thanked 2 Times in 1 Post
    Now, on to replies to comments from another thread.

    Quote Originally Posted by Jarek View Post
    This comparison clearly misses CRAM ( https://en.wikipedia.org/wiki/CRAM_(file_format) ), its winning NAF is from the same author, and seems to be preprocessing+zstd: http://kirill-kryukov.com/study/naf/
    CRAM uses reference sequence. This is very interesting, but outside the scope of my benchmark.

    Quote Originally Posted by Jyrki Alakuijala View Post
    They failed to benchmark in a useful way. Brotli is run with 4 MB window whereas zstd is run with 8 to 128 MB.

    If they use the same window length, brotli will compress 5 % better than zstd on an average corpora.
    Thanks to your helpful comment and a fair bit of googling, I finally managed to find out about the existance of "--large_window" option of brotli. I now added "-q 11 --large_window=30" setting to the benchmark.

    If it was listed in "brotli -h" output, I might have added it sooner. Someone might say that brotli authors failed to expose or document this setting in a useful way.

    If you feel that some other important setting is missing, please kindly let me know.

    Quote Originally Posted by JamesB View Post
    They should compare themselves against modern incarnation of FASTQ compressors, such as Spring
    Not sure it this was about my benchmark. If it was, then I have to mention that I did not use any FASTQ files. It's interesting and I might do it some day. But currently I just benchmark on FASTA files.

    Any further comments or suggestions are greatly appreciated!

  3. #3
    Member
    Join Date
    May 2019
    Location
    Japan
    Posts
    7
    Thanks
    1
    Thanked 2 Times in 1 Post
    Missed one.

    Quote Originally Posted by JamesB View Post
    NAF looks like it's rather naive, just separating out seq and qual and doing zstd on them.
    Yes, NAF is very simple, by design. zstd works spectacularly well in the way I use it in NAF. I much prefer to keep it simple. There's a number of more complex DNA compressors, but most of them are impractically slow. Actually it was previously a mystery to me why everyone keeps using gzip instead of specialized compressors. Only after starting the benchmark I realized how slow most of sequence compressors are.

  4. #4
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    467
    Thanks
    149
    Thanked 160 Times in 108 Posts
    It's a mystery to me why people use gzip still too. It's even more of a mystery why every linux distribution I know ships with the slow zlib instead of any of the myriad of optimised ones.

    You may also want to test some of the fastq compressors (quip, fastqz, fqzcomp, spring, fqsqueezer, etc). If the data is lots of short sequences in fasta format then just generate a fixed string of qualities (all the same) to turn it into fastq, if the tool in question doesn't also support fasta. This may be interesting as there's been much more work on compression of myriad of small fastq fragments - ie raw sequence machine output - than there has fasta.

  5. Thanks:

    Kirr (4th December 2019)

  6. #5
    Member
    Join Date
    May 2019
    Location
    Japan
    Posts
    7
    Thanks
    1
    Thanked 2 Times in 1 Post
    Quote Originally Posted by JamesB View Post
    You may also want to test some of the fastq compressors (quip, fastqz, fqzcomp, spring, fqsqueezer, etc). If the data is lots of short sequences in fasta format then just generate a fixed string of qualities (all the same) to turn it into fastq, if the tool in question doesn't also support fasta. This may be interesting as there's been much more work on compression of myriad of small fastq fragments - ie raw sequence machine output - than there has fasta.
    That's a nice idea! I may try this some time. Indeed, it's easy to add constant quality via a wrapper script, and this will allow using fastq compressors.

  7. #6
    Member
    Join Date
    May 2019
    Location
    Japan
    Posts
    7
    Thanks
    1
    Thanked 2 Times in 1 Post
    Quote Originally Posted by JamesB View Post
    It's a mystery to me why people use gzip still too. It's even more of a mystery why every linux distribution I know ships with the slow zlib instead of any of the myriad of optimised ones.
    I started measuring memory consumption, and it provided one more piece to this puzzle: gzip appears to have the smallest memory footprint among 40 compressors. E.g.: decompression memory used by strongest setting of each compressor: http://kirr.dyndns.org/sequence-comp...ow+scatterplot

    Quote Originally Posted by JamesB View Post
    You may also want to test some of the fastq compressors (quip, fastqz, fqzcomp, spring, fqsqueezer, etc). If the data is lots of short sequences in fasta format then just generate a fixed string of qualities (all the same) to turn it into fastq, if the tool in question doesn't also support fasta. This may be interesting as there's been much more work on compression of myriad of small fastq fragments - ie raw sequence machine output - than there has fasta.
    I finally started benchmarking fastq compressors. Currently including: Leon, BEETL, GTZ, Quip, DSRC, HARC, SPRING, fastqz, fqzcomp, LFQC (with a few more in the pipeline). This is not easy since each compressor has its own unique set of limitations, quirks and bugs. Even though this may not be an apples-to-apples comparison, still I find it very interesting. Eventually I'll have to add fastq test data too.

    Example comparison on a 9.22 MB genome: Compression ratio vs Decompression speed. (other measures and data are selectable).

  8. Thanks (2):

    Gonzalo (30th November 2019),JamesB (2nd December 2019)

  9. #7
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    467
    Thanks
    149
    Thanked 160 Times in 108 Posts
    Quote Originally Posted by Kirr View Post
    Example comparison on a 9.22 MB genome: Compression ratio vs Decompression speed. (other measures and data are selectable).
    It's worth noting there are several very different classes of compression tool out there, so it may be good to label the type of input data more clearly. The sort I can think of are:


    • Compression of many small fragments; basically sequencing machine outputs. There is a lot of replication as we expect e.g. 30 fold redundancy, but finding those repeats is challenging.
      Further subdivided into fixed length short reads (Illumina) and long variable size reads (ONT, PacBio)
    • Compression of long genomes with a single copy of each chromosome. No obvious redundancy except for repeats internal to the genome itself (ALU, LINES, SINES, etc).
    • Compression of sets of genomes or sequence families. Very strong redundancy.

  10. #8
    Member
    Join Date
    May 2019
    Location
    Japan
    Posts
    7
    Thanks
    1
    Thanked 2 Times in 1 Post
    Quote Originally Posted by JamesB View Post
    It's worth noting there are several very different classes of compression tool out there, so it may be good to label the type of input data more clearly. The sort I can think of are:


    • Compression of many small fragments; basically sequencing machine outputs. There is a lot of replication as we expect e.g. 30 fold redundancy, but finding those repeats is challenging.
      Further subdivided into fixed length short reads (Illumina) and long variable size reads (ONT, PacBio)
    • Compression of long genomes with a single copy of each chromosome. No obvious redundancy except for repeats internal to the genome itself (ALU, LINES, SINES, etc).
    • Compression of sets of genomes or sequence families. Very strong redundancy.
    The first type of data is currently not represented at all in the benchmark. I will certainly add such data in the future. The other two kinds are both used, and I thought are labeled clearly. But probably not clearly enough. I will try to further improve the clarity. Thanks!

    There are also different kind of compressors. E.g., those designed specifically for short reads vs those not caring about sequence type. I will probably separate short read compressors to their own category. (Currently I bundle all specialized compressors together as "Sequence compressos").

  11. #9
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    467
    Thanks
    149
    Thanked 160 Times in 108 Posts
    Quote Originally Posted by Kirr View Post
    The first type of data is currently not represented at all in the benchmark. I will certainly add such data in the future.
    In that case I'm amazed fqzcomp does even remotely well! It was written for short read Illumina sequencing data.

    Luck I guess, although it's clearly not the optimal tool. NAF is occupying a great speed vs size tradeoff there.

  12. #10
    Member
    Join Date
    May 2019
    Location
    Japan
    Posts
    7
    Thanks
    1
    Thanked 2 Times in 1 Post
    Quote Originally Posted by JamesB View Post
    In that case I'm amazed fqzcomp does even remotely well! It was written for short read Illumina sequencing data.
    Yes, fqzcomp performs well considering it works via wrapper that chops long sequence into reads. (And adds constant quality as per your idea, which I probably took a bit too far ). Interestingly, it is currently leading in compactness on spruce genome: chart (though this test is not complete, some compressors are still missing). Also it may still improve more after I add its newly fixed "-s9" mode. I guess it will work even better on proper fastq shord reads datasets.

    Quote Originally Posted by JamesB View Post
    Luck I guess, although it's clearly not the optimal tool. NAF is occupying a great speed vs size tradeoff there.
    Thanks. Yeah, NAF is focused on transfer + decompression speed, because both of these steps can be a bottleneck in my work. I noticed that many other sequence compressors are primarily optimized for compactness (something I did not know before doing the benchmark), which partly explains why gzip remains popular.

  13. #11
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    467
    Thanks
    149
    Thanked 160 Times in 108 Posts
    Gzip is too popular. I regularly have discussions trying to channel people towards zstd instead. No reason to use gzip in modern era IMO unless it's some legacy compatibility.

Similar Threads

  1. Best compression algorithm for a sequence of incremental integers
    By CompressMaster in forum Data Compression
    Replies: 18
    Last Post: 17th May 2019, 12:56
  2. Pseudorandom Number Sequence Test + Benchmark Compressors
    By Samantha in forum Data Compression
    Replies: 3
    Last Post: 3rd January 2019, 16:19
  3. Binary sequence compression
    By smjohn1 in forum Data Compression
    Replies: 23
    Last Post: 8th December 2017, 02:48
  4. Sequence of bits
    By Kaw in forum Data Compression
    Replies: 12
    Last Post: 25th September 2009, 09:53
  5. LZP flag sequence compression
    By Shelwien in forum Data Compression
    Replies: 8
    Last Post: 9th August 2009, 03:08

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •