Results 1 to 18 of 18

Thread: Comparison of compressors for molecular sequence databases

  1. #1
    Member
    Join Date
    May 2019
    Location
    Japan
    Posts
    10
    Thanks
    1
    Thanked 2 Times in 1 Post

    Comparison of compressors for molecular sequence databases

    Hi all! I am running a benchmark of compressors on sequence databases. I thought I'd start a thread to discuss it, respond to comments from the "Data Compression Tweets" thread, and gather expert feedback to hopefully improve it.

    Benchmark data: http://kirr.dyndns.org/sequence-compression-benchmark/

    This is a work in progress, and I will continue improving it when I can. Suggestions are welcome!

    Disclaimers/Disclosures/Limitations:

    1) Like in any benchmark, the results are specific to particular hardware, test data and methodology. I used a reasonably standard workstation machine, and test data consists of commonly used sequence databases. Thus the results should be reasonably informative, but not necessarily 100% transferable to other machine or data.

    2) I benchmark a very specific task. Namely, lossless compression (and decompression), without reference, of FASTA files with DNA, RNA and protein sequences. This is the data I often work with, so I'm familiar with what is used and how it is used.

    3) My own compressor is included in the benchmark. I need to know how it compares to other compressors. Also when I make improvement, I need to measure the improvement. My compressor receives no any special treatment in the benchmark.

    4) Benchmark takes lot of time. For anything missing, it's possible that I just haven't had time to add it yet.

    5) The interface is a mess. I'll need to organize it in a much better way.

  2. #2
    Member
    Join Date
    May 2019
    Location
    Japan
    Posts
    10
    Thanks
    1
    Thanked 2 Times in 1 Post
    Now, on to replies to comments from another thread.

    Quote Originally Posted by Jarek View Post
    This comparison clearly misses CRAM ( https://en.wikipedia.org/wiki/CRAM_(file_format) ), its winning NAF is from the same author, and seems to be preprocessing+zstd: http://kirill-kryukov.com/study/naf/
    CRAM uses reference sequence. This is very interesting, but outside the scope of my benchmark.

    Quote Originally Posted by Jyrki Alakuijala View Post
    They failed to benchmark in a useful way. Brotli is run with 4 MB window whereas zstd is run with 8 to 128 MB.

    If they use the same window length, brotli will compress 5 % better than zstd on an average corpora.
    Thanks to your helpful comment and a fair bit of googling, I finally managed to find out about the existance of "--large_window" option of brotli. I now added "-q 11 --large_window=30" setting to the benchmark.

    If it was listed in "brotli -h" output, I might have added it sooner. Someone might say that brotli authors failed to expose or document this setting in a useful way.

    If you feel that some other important setting is missing, please kindly let me know.

    Quote Originally Posted by JamesB View Post
    They should compare themselves against modern incarnation of FASTQ compressors, such as Spring
    Not sure it this was about my benchmark. If it was, then I have to mention that I did not use any FASTQ files. It's interesting and I might do it some day. But currently I just benchmark on FASTA files.

    Any further comments or suggestions are greatly appreciated!

  3. #3
    Member
    Join Date
    May 2019
    Location
    Japan
    Posts
    10
    Thanks
    1
    Thanked 2 Times in 1 Post
    Missed one.

    Quote Originally Posted by JamesB View Post
    NAF looks like it's rather naive, just separating out seq and qual and doing zstd on them.
    Yes, NAF is very simple, by design. zstd works spectacularly well in the way I use it in NAF. I much prefer to keep it simple. There's a number of more complex DNA compressors, but most of them are impractically slow. Actually it was previously a mystery to me why everyone keeps using gzip instead of specialized compressors. Only after starting the benchmark I realized how slow most of sequence compressors are.

  4. #4
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    467
    Thanks
    149
    Thanked 160 Times in 108 Posts
    It's a mystery to me why people use gzip still too. It's even more of a mystery why every linux distribution I know ships with the slow zlib instead of any of the myriad of optimised ones.

    You may also want to test some of the fastq compressors (quip, fastqz, fqzcomp, spring, fqsqueezer, etc). If the data is lots of short sequences in fasta format then just generate a fixed string of qualities (all the same) to turn it into fastq, if the tool in question doesn't also support fasta. This may be interesting as there's been much more work on compression of myriad of small fastq fragments - ie raw sequence machine output - than there has fasta.

  5. Thanks:

    Kirr (4th December 2019)

  6. #5
    Member
    Join Date
    May 2019
    Location
    Japan
    Posts
    10
    Thanks
    1
    Thanked 2 Times in 1 Post
    Quote Originally Posted by JamesB View Post
    You may also want to test some of the fastq compressors (quip, fastqz, fqzcomp, spring, fqsqueezer, etc). If the data is lots of short sequences in fasta format then just generate a fixed string of qualities (all the same) to turn it into fastq, if the tool in question doesn't also support fasta. This may be interesting as there's been much more work on compression of myriad of small fastq fragments - ie raw sequence machine output - than there has fasta.
    That's a nice idea! I may try this some time. Indeed, it's easy to add constant quality via a wrapper script, and this will allow using fastq compressors.

  7. #6
    Member
    Join Date
    May 2019
    Location
    Japan
    Posts
    10
    Thanks
    1
    Thanked 2 Times in 1 Post
    Quote Originally Posted by JamesB View Post
    It's a mystery to me why people use gzip still too. It's even more of a mystery why every linux distribution I know ships with the slow zlib instead of any of the myriad of optimised ones.
    I started measuring memory consumption, and it provided one more piece to this puzzle: gzip appears to have the smallest memory footprint among 40 compressors. E.g.: decompression memory used by strongest setting of each compressor: http://kirr.dyndns.org/sequence-comp...ow+scatterplot

    Quote Originally Posted by JamesB View Post
    You may also want to test some of the fastq compressors (quip, fastqz, fqzcomp, spring, fqsqueezer, etc). If the data is lots of short sequences in fasta format then just generate a fixed string of qualities (all the same) to turn it into fastq, if the tool in question doesn't also support fasta. This may be interesting as there's been much more work on compression of myriad of small fastq fragments - ie raw sequence machine output - than there has fasta.
    I finally started benchmarking fastq compressors. Currently including: Leon, BEETL, GTZ, Quip, DSRC, HARC, SPRING, fastqz, fqzcomp, LFQC (with a few more in the pipeline). This is not easy since each compressor has its own unique set of limitations, quirks and bugs. Even though this may not be an apples-to-apples comparison, still I find it very interesting. Eventually I'll have to add fastq test data too.

    Example comparison on a 9.22 MB genome: Compression ratio vs Decompression speed. (other measures and data are selectable).

  8. Thanks (2):

    Gonzalo (30th November 2019),JamesB (2nd December 2019)

  9. #7
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    467
    Thanks
    149
    Thanked 160 Times in 108 Posts
    Quote Originally Posted by Kirr View Post
    Example comparison on a 9.22 MB genome: Compression ratio vs Decompression speed. (other measures and data are selectable).
    It's worth noting there are several very different classes of compression tool out there, so it may be good to label the type of input data more clearly. The sort I can think of are:


    • Compression of many small fragments; basically sequencing machine outputs. There is a lot of replication as we expect e.g. 30 fold redundancy, but finding those repeats is challenging.
      Further subdivided into fixed length short reads (Illumina) and long variable size reads (ONT, PacBio)
    • Compression of long genomes with a single copy of each chromosome. No obvious redundancy except for repeats internal to the genome itself (ALU, LINES, SINES, etc).
    • Compression of sets of genomes or sequence families. Very strong redundancy.

  10. #8
    Member
    Join Date
    May 2019
    Location
    Japan
    Posts
    10
    Thanks
    1
    Thanked 2 Times in 1 Post
    Quote Originally Posted by JamesB View Post
    It's worth noting there are several very different classes of compression tool out there, so it may be good to label the type of input data more clearly. The sort I can think of are:


    • Compression of many small fragments; basically sequencing machine outputs. There is a lot of replication as we expect e.g. 30 fold redundancy, but finding those repeats is challenging.
      Further subdivided into fixed length short reads (Illumina) and long variable size reads (ONT, PacBio)
    • Compression of long genomes with a single copy of each chromosome. No obvious redundancy except for repeats internal to the genome itself (ALU, LINES, SINES, etc).
    • Compression of sets of genomes or sequence families. Very strong redundancy.
    The first type of data is currently not represented at all in the benchmark. I will certainly add such data in the future. The other two kinds are both used, and I thought are labeled clearly. But probably not clearly enough. I will try to further improve the clarity. Thanks!

    There are also different kind of compressors. E.g., those designed specifically for short reads vs those not caring about sequence type. I will probably separate short read compressors to their own category. (Currently I bundle all specialized compressors together as "Sequence compressos").

  11. #9
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    467
    Thanks
    149
    Thanked 160 Times in 108 Posts
    Quote Originally Posted by Kirr View Post
    The first type of data is currently not represented at all in the benchmark. I will certainly add such data in the future.
    In that case I'm amazed fqzcomp does even remotely well! It was written for short read Illumina sequencing data.

    Luck I guess, although it's clearly not the optimal tool. NAF is occupying a great speed vs size tradeoff there.

  12. #10
    Member
    Join Date
    May 2019
    Location
    Japan
    Posts
    10
    Thanks
    1
    Thanked 2 Times in 1 Post
    Quote Originally Posted by JamesB View Post
    In that case I'm amazed fqzcomp does even remotely well! It was written for short read Illumina sequencing data.
    Yes, fqzcomp performs well considering it works via wrapper that chops long sequence into reads. (And adds constant quality as per your idea, which I probably took a bit too far ). Interestingly, it is currently leading in compactness on spruce genome: chart (though this test is not complete, some compressors are still missing). Also it may still improve more after I add its newly fixed "-s9" mode. I guess it will work even better on proper fastq shord reads datasets.

    Quote Originally Posted by JamesB View Post
    Luck I guess, although it's clearly not the optimal tool. NAF is occupying a great speed vs size tradeoff there.
    Thanks. Yeah, NAF is focused on transfer + decompression speed, because both of these steps can be a bottleneck in my work. I noticed that many other sequence compressors are primarily optimized for compactness (something I did not know before doing the benchmark), which partly explains why gzip remains popular.

  13. #11
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    467
    Thanks
    149
    Thanked 160 Times in 108 Posts
    Gzip is too popular. I regularly have discussions trying to channel people towards zstd instead. No reason to use gzip in modern era IMO unless it's some legacy compatibility.

  14. #12
    Member
    Join Date
    Aug 2015
    Location
    indonesia
    Posts
    76
    Thanks
    3
    Thanked 9 Times in 9 Posts
    using bigm_suryak v8
    astrammina.fna 361344 bytes
    nosema.fna 1312863 bytes

  15. #13
    Member
    Join Date
    May 2019
    Location
    Japan
    Posts
    10
    Thanks
    1
    Thanked 2 Times in 1 Post
    Quote Originally Posted by suryakandau@yahoo.co.id View Post
    using bigm_suryak v8
    astrammina.fna 361344 bytes
    nosema.fna 1312863 bytes
    This would put bigm on #2 for Astrammina rara (after cmix) and on #5 for Nosema ceranae (after jarvis, cmix, xm and geco), in compactness. (Note that this is not the most important measurement for judging practical usability of a compressor).

    What was compression and decompression time and memory use? Also on what hardware and OS?

    According to another thread the relationship between bigm and cmix is unclear currently, which probably means that I should not add bigm to benchmark until the issue is resolved?

  16. #14
    Member
    Join Date
    Aug 2015
    Location
    indonesia
    Posts
    76
    Thanks
    3
    Thanked 9 Times in 9 Posts
    Quote Originally Posted by Kirr View Post
    This would put bigm on #2 for Astrammina rara (after cmix) and on #5 for Nosema ceranae (after jarvis, cmix, xm and geco), in compactness. (Note that this is not the most important measurement for judging practical usability of a compressor).

    What was compression and decompression time and memory use? Also on what hardware and OS?

    According to another thread the relationship between bigm and cmix is unclear currently, which probably means that I should not add bigm to benchmark until the issue is resolved?
    Bigm is not cmix because bigm use only ~1.3 gb memory.cmix use 24 gb-25gb of memory. I use Windows 10 64 bit. Do it's okay to add bigm to benchmark list

  17. #15
    Member
    Join Date
    May 2019
    Location
    Japan
    Posts
    10
    Thanks
    1
    Thanked 2 Times in 1 Post
    Quote Originally Posted by suryakandau@yahoo.co.id View Post
    Bigm is not cmix because bigm use only ~1.3 gb memory.cmix use 24 gb-25gb of memory. I use Windows 10 64 bit. Do it's okay to add bigm to benchmark list
    1.3 GB is nice. It's unfortunate that you choose to waste your good work by ignoring the concerns of others (and possibly violating GPL), by keeping the source closed, by distributing only windows binary and by staying anonymous. I'm not going to touch your compressor with 10-foot pole while this remains the case.

  18. #16
    Member
    Join Date
    May 2019
    Location
    Japan
    Posts
    10
    Thanks
    1
    Thanked 2 Times in 1 Post
    One cool thing you can do with the benchmark is detailed comparison of any two (or more) compressors, or their settings. For example, recently I was wondering about some compressor levels that seem redundant. Let's take a closer look at one such example: lz4 -2 vs lz4 -1. From data table you can see that they are very close, but it's not so convenient to detect it in a wall of numbers.

    Fortunately, it's easy to visualize this data. For example, this scatterplot shows the difference between "-1" and "-2". For each dataset it shows the results of lz4-2 divided by those of lz4-1 (so that ratios of the measurements are shown). Each point is a different test dataset. What measurements to show is selectable, in this case it's compression ratio on the X axis, and compression+decompression speed on the Y axis. E.g., here is the same kind of chart showing compression memory against decompression memory of lz4-2 compared to lz4-1.

    The charts clearly show that "-2" and "-1" have identical compression strength. The difference in speed and memory consumption is tiny and probably can be explained by the measurement noise. (Considering that all outliers are on very small data). Therefore "-2" can be considered redundant, at least on this particular machine and test data.

  19. #17
    Member
    Join Date
    Aug 2015
    Location
    indonesia
    Posts
    76
    Thanks
    3
    Thanked 9 Times in 9 Posts
    Quote Originally Posted by Kirr View Post
    1.3 GB is nice. It's unfortunate that you choose to waste your good work by ignoring the concerns of others (and possibly violating GPL), by keeping the source closed, by distributing only windows binary and by staying anonymous. I'm not going to touch your compressor with 10-foot pole while this remains the case.

    I have published the source code in bigm thread

  20. #18
    Member
    Join Date
    Mar 2011
    Location
    USA
    Posts
    232
    Thanks
    109
    Thanked 112 Times in 68 Posts
    @Kirr, if you haven't seen it, there was some DNA benchmarking in this thread: https://encode.su/threads/2105-DNA-Corpus
    From that thread, there is one trick for "palindromic sequences" that significantly improves compression rate on DNA for general compressors. One other suggestion is to try adding cmv to your benchmark.

Similar Threads

  1. Best compression algorithm for a sequence of incremental integers
    By CompressMaster in forum Data Compression
    Replies: 18
    Last Post: 17th May 2019, 12:56
  2. Pseudorandom Number Sequence Test + Benchmark Compressors
    By Samantha in forum Data Compression
    Replies: 3
    Last Post: 3rd January 2019, 16:19
  3. Binary sequence compression
    By smjohn1 in forum Data Compression
    Replies: 23
    Last Post: 8th December 2017, 02:48
  4. Sequence of bits
    By Kaw in forum Data Compression
    Replies: 12
    Last Post: 25th September 2009, 09:53
  5. LZP flag sequence compression
    By Shelwien in forum Data Compression
    Replies: 8
    Last Post: 9th August 2009, 03:08

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •