Results 1 to 27 of 27

Thread: Comparison of compressors for molecular sequence databases

  1. #1
    Member
    Join Date
    May 2019
    Location
    Japan
    Posts
    21
    Thanks
    3
    Thanked 8 Times in 4 Posts

    Comparison of compressors for molecular sequence databases

    Hi all! I am running a benchmark of compressors on sequence databases. I thought I'd start a thread to discuss it, respond to comments from the "Data Compression Tweets" thread, and gather expert feedback to hopefully improve it.

    Benchmark data: http://kirr.dyndns.org/sequence-compression-benchmark/

    This is a work in progress, and I will continue improving it when I can. Suggestions are welcome!

    Disclaimers/Disclosures/Limitations:

    1) Like in any benchmark, the results are specific to particular hardware, test data and methodology. I used a reasonably standard workstation machine, and test data consists of commonly used sequence databases. Thus the results should be reasonably informative, but not necessarily 100% transferable to other machine or data.

    2) I benchmark a very specific task. Namely, lossless compression (and decompression), without reference, of FASTA files with DNA, RNA and protein sequences. This is the data I often work with, so I'm familiar with what is used and how it is used.

    3) My own compressor is included in the benchmark. I need to know how it compares to other compressors. Also when I make improvement, I need to measure the improvement. My compressor receives no any special treatment in the benchmark.

    4) Benchmark takes lot of time. For anything missing, it's possible that I just haven't had time to add it yet.

    5) The interface is a mess. I'll need to organize it in a much better way.

  2. #2
    Member
    Join Date
    May 2019
    Location
    Japan
    Posts
    21
    Thanks
    3
    Thanked 8 Times in 4 Posts
    Now, on to replies to comments from another thread.

    Quote Originally Posted by Jarek View Post
    This comparison clearly misses CRAM ( https://en.wikipedia.org/wiki/CRAM_(file_format) ), its winning NAF is from the same author, and seems to be preprocessing+zstd: http://kirill-kryukov.com/study/naf/
    CRAM uses reference sequence. This is very interesting, but outside the scope of my benchmark.

    Quote Originally Posted by Jyrki Alakuijala View Post
    They failed to benchmark in a useful way. Brotli is run with 4 MB window whereas zstd is run with 8 to 128 MB.

    If they use the same window length, brotli will compress 5 % better than zstd on an average corpora.
    Thanks to your helpful comment and a fair bit of googling, I finally managed to find out about the existance of "--large_window" option of brotli. I now added "-q 11 --large_window=30" setting to the benchmark.

    If it was listed in "brotli -h" output, I might have added it sooner. Someone might say that brotli authors failed to expose or document this setting in a useful way.

    If you feel that some other important setting is missing, please kindly let me know.

    Quote Originally Posted by JamesB View Post
    They should compare themselves against modern incarnation of FASTQ compressors, such as Spring
    Not sure it this was about my benchmark. If it was, then I have to mention that I did not use any FASTQ files. It's interesting and I might do it some day. But currently I just benchmark on FASTA files.

    Any further comments or suggestions are greatly appreciated!

  3. #3
    Member
    Join Date
    May 2019
    Location
    Japan
    Posts
    21
    Thanks
    3
    Thanked 8 Times in 4 Posts
    Missed one.

    Quote Originally Posted by JamesB View Post
    NAF looks like it's rather naive, just separating out seq and qual and doing zstd on them.
    Yes, NAF is very simple, by design. zstd works spectacularly well in the way I use it in NAF. I much prefer to keep it simple. There's a number of more complex DNA compressors, but most of them are impractically slow. Actually it was previously a mystery to me why everyone keeps using gzip instead of specialized compressors. Only after starting the benchmark I realized how slow most of sequence compressors are.

  4. #4
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    489
    Thanks
    176
    Thanked 172 Times in 117 Posts
    It's a mystery to me why people use gzip still too. It's even more of a mystery why every linux distribution I know ships with the slow zlib instead of any of the myriad of optimised ones.

    You may also want to test some of the fastq compressors (quip, fastqz, fqzcomp, spring, fqsqueezer, etc). If the data is lots of short sequences in fasta format then just generate a fixed string of qualities (all the same) to turn it into fastq, if the tool in question doesn't also support fasta. This may be interesting as there's been much more work on compression of myriad of small fastq fragments - ie raw sequence machine output - than there has fasta.

  5. Thanks:

    Kirr (4th December 2019)

  6. #5
    Member
    Join Date
    May 2019
    Location
    Japan
    Posts
    21
    Thanks
    3
    Thanked 8 Times in 4 Posts
    Quote Originally Posted by JamesB View Post
    You may also want to test some of the fastq compressors (quip, fastqz, fqzcomp, spring, fqsqueezer, etc). If the data is lots of short sequences in fasta format then just generate a fixed string of qualities (all the same) to turn it into fastq, if the tool in question doesn't also support fasta. This may be interesting as there's been much more work on compression of myriad of small fastq fragments - ie raw sequence machine output - than there has fasta.
    That's a nice idea! I may try this some time. Indeed, it's easy to add constant quality via a wrapper script, and this will allow using fastq compressors.

  7. #6
    Member
    Join Date
    May 2019
    Location
    Japan
    Posts
    21
    Thanks
    3
    Thanked 8 Times in 4 Posts
    Quote Originally Posted by JamesB View Post
    It's a mystery to me why people use gzip still too. It's even more of a mystery why every linux distribution I know ships with the slow zlib instead of any of the myriad of optimised ones.
    I started measuring memory consumption, and it provided one more piece to this puzzle: gzip appears to have the smallest memory footprint among 40 compressors. E.g.: decompression memory used by strongest setting of each compressor: http://kirr.dyndns.org/sequence-comp...ow+scatterplot

    Quote Originally Posted by JamesB View Post
    You may also want to test some of the fastq compressors (quip, fastqz, fqzcomp, spring, fqsqueezer, etc). If the data is lots of short sequences in fasta format then just generate a fixed string of qualities (all the same) to turn it into fastq, if the tool in question doesn't also support fasta. This may be interesting as there's been much more work on compression of myriad of small fastq fragments - ie raw sequence machine output - than there has fasta.
    I finally started benchmarking fastq compressors. Currently including: Leon, BEETL, GTZ, Quip, DSRC, HARC, SPRING, fastqz, fqzcomp, LFQC (with a few more in the pipeline). This is not easy since each compressor has its own unique set of limitations, quirks and bugs. Even though this may not be an apples-to-apples comparison, still I find it very interesting. Eventually I'll have to add fastq test data too.

    Example comparison on a 9.22 MB genome: Compression ratio vs Decompression speed. (other measures and data are selectable).

  8. Thanks (2):

    Gonzalo (30th November 2019),JamesB (2nd December 2019)

  9. #7
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    489
    Thanks
    176
    Thanked 172 Times in 117 Posts
    Quote Originally Posted by Kirr View Post
    Example comparison on a 9.22 MB genome: Compression ratio vs Decompression speed. (other measures and data are selectable).
    It's worth noting there are several very different classes of compression tool out there, so it may be good to label the type of input data more clearly. The sort I can think of are:


    • Compression of many small fragments; basically sequencing machine outputs. There is a lot of replication as we expect e.g. 30 fold redundancy, but finding those repeats is challenging.
      Further subdivided into fixed length short reads (Illumina) and long variable size reads (ONT, PacBio)
    • Compression of long genomes with a single copy of each chromosome. No obvious redundancy except for repeats internal to the genome itself (ALU, LINES, SINES, etc).
    • Compression of sets of genomes or sequence families. Very strong redundancy.

  10. #8
    Member
    Join Date
    May 2019
    Location
    Japan
    Posts
    21
    Thanks
    3
    Thanked 8 Times in 4 Posts
    Quote Originally Posted by JamesB View Post
    It's worth noting there are several very different classes of compression tool out there, so it may be good to label the type of input data more clearly. The sort I can think of are:


    • Compression of many small fragments; basically sequencing machine outputs. There is a lot of replication as we expect e.g. 30 fold redundancy, but finding those repeats is challenging.
      Further subdivided into fixed length short reads (Illumina) and long variable size reads (ONT, PacBio)
    • Compression of long genomes with a single copy of each chromosome. No obvious redundancy except for repeats internal to the genome itself (ALU, LINES, SINES, etc).
    • Compression of sets of genomes or sequence families. Very strong redundancy.
    The first type of data is currently not represented at all in the benchmark. I will certainly add such data in the future. The other two kinds are both used, and I thought are labeled clearly. But probably not clearly enough. I will try to further improve the clarity. Thanks!

    There are also different kind of compressors. E.g., those designed specifically for short reads vs those not caring about sequence type. I will probably separate short read compressors to their own category. (Currently I bundle all specialized compressors together as "Sequence compressos").

  11. #9
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    489
    Thanks
    176
    Thanked 172 Times in 117 Posts
    Quote Originally Posted by Kirr View Post
    The first type of data is currently not represented at all in the benchmark. I will certainly add such data in the future.
    In that case I'm amazed fqzcomp does even remotely well! It was written for short read Illumina sequencing data.

    Luck I guess, although it's clearly not the optimal tool. NAF is occupying a great speed vs size tradeoff there.

  12. #10
    Member
    Join Date
    May 2019
    Location
    Japan
    Posts
    21
    Thanks
    3
    Thanked 8 Times in 4 Posts
    Quote Originally Posted by JamesB View Post
    In that case I'm amazed fqzcomp does even remotely well! It was written for short read Illumina sequencing data.
    Yes, fqzcomp performs well considering it works via wrapper that chops long sequence into reads. (And adds constant quality as per your idea, which I probably took a bit too far ). Interestingly, it is currently leading in compactness on spruce genome: chart (though this test is not complete, some compressors are still missing). Also it may still improve more after I add its newly fixed "-s9" mode. I guess it will work even better on proper fastq shord reads datasets.

    Quote Originally Posted by JamesB View Post
    Luck I guess, although it's clearly not the optimal tool. NAF is occupying a great speed vs size tradeoff there.
    Thanks. Yeah, NAF is focused on transfer + decompression speed, because both of these steps can be a bottleneck in my work. I noticed that many other sequence compressors are primarily optimized for compactness (something I did not know before doing the benchmark), which partly explains why gzip remains popular.

  13. #11
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    489
    Thanks
    176
    Thanked 172 Times in 117 Posts
    Gzip is too popular. I regularly have discussions trying to channel people towards zstd instead. No reason to use gzip in modern era IMO unless it's some legacy compatibility.

  14. #12
    Member
    Join Date
    Aug 2015
    Location
    indonesia
    Posts
    236
    Thanks
    29
    Thanked 23 Times in 21 Posts
    using bigm_suryak v8
    astrammina.fna 361344 bytes
    nosema.fna 1312863 bytes

  15. #13
    Member
    Join Date
    May 2019
    Location
    Japan
    Posts
    21
    Thanks
    3
    Thanked 8 Times in 4 Posts
    Quote Originally Posted by suryakandau@yahoo.co.id View Post
    using bigm_suryak v8
    astrammina.fna 361344 bytes
    nosema.fna 1312863 bytes
    This would put bigm on #2 for Astrammina rara (after cmix) and on #5 for Nosema ceranae (after jarvis, cmix, xm and geco), in compactness. (Note that this is not the most important measurement for judging practical usability of a compressor).

    What was compression and decompression time and memory use? Also on what hardware and OS?

    According to another thread the relationship between bigm and cmix is unclear currently, which probably means that I should not add bigm to benchmark until the issue is resolved?

  16. #14
    Member
    Join Date
    Aug 2015
    Location
    indonesia
    Posts
    236
    Thanks
    29
    Thanked 23 Times in 21 Posts
    Quote Originally Posted by Kirr View Post
    This would put bigm on #2 for Astrammina rara (after cmix) and on #5 for Nosema ceranae (after jarvis, cmix, xm and geco), in compactness. (Note that this is not the most important measurement for judging practical usability of a compressor).

    What was compression and decompression time and memory use? Also on what hardware and OS?

    According to another thread the relationship between bigm and cmix is unclear currently, which probably means that I should not add bigm to benchmark until the issue is resolved?
    Bigm is not cmix because bigm use only ~1.3 gb memory.cmix use 24 gb-25gb of memory. I use Windows 10 64 bit. Do it's okay to add bigm to benchmark list

  17. #15
    Member
    Join Date
    May 2019
    Location
    Japan
    Posts
    21
    Thanks
    3
    Thanked 8 Times in 4 Posts
    Quote Originally Posted by suryakandau@yahoo.co.id View Post
    Bigm is not cmix because bigm use only ~1.3 gb memory.cmix use 24 gb-25gb of memory. I use Windows 10 64 bit. Do it's okay to add bigm to benchmark list
    1.3 GB is nice. It's unfortunate that you choose to waste your good work by ignoring the concerns of others (and possibly violating GPL), by keeping the source closed, by distributing only windows binary and by staying anonymous. I'm not going to touch your compressor with 10-foot pole while this remains the case.

  18. #16
    Member
    Join Date
    May 2019
    Location
    Japan
    Posts
    21
    Thanks
    3
    Thanked 8 Times in 4 Posts
    One cool thing you can do with the benchmark is detailed comparison of any two (or more) compressors, or their settings. For example, recently I was wondering about some compressor levels that seem redundant. Let's take a closer look at one such example: lz4 -2 vs lz4 -1. From data table you can see that they are very close, but it's not so convenient to detect it in a wall of numbers.

    Fortunately, it's easy to visualize this data. For example, this scatterplot shows the difference between "-1" and "-2". For each dataset it shows the results of lz4-2 divided by those of lz4-1 (so that ratios of the measurements are shown). Each point is a different test dataset. What measurements to show is selectable, in this case it's compression ratio on the X axis, and compression+decompression speed on the Y axis. E.g., here is the same kind of chart showing compression memory against decompression memory of lz4-2 compared to lz4-1.

    The charts clearly show that "-2" and "-1" have identical compression strength. The difference in speed and memory consumption is tiny and probably can be explained by the measurement noise. (Considering that all outliers are on very small data). Therefore "-2" can be considered redundant, at least on this particular machine and test data.

  19. #17
    Member
    Join Date
    Aug 2015
    Location
    indonesia
    Posts
    236
    Thanks
    29
    Thanked 23 Times in 21 Posts
    Quote Originally Posted by Kirr View Post
    1.3 GB is nice. It's unfortunate that you choose to waste your good work by ignoring the concerns of others (and possibly violating GPL), by keeping the source closed, by distributing only windows binary and by staying anonymous. I'm not going to touch your compressor with 10-foot pole while this remains the case.

    I have published the source code in bigm thread

  20. #18
    Member
    Join Date
    Mar 2011
    Location
    USA
    Posts
    244
    Thanks
    112
    Thanked 114 Times in 69 Posts
    @Kirr, if you haven't seen it, there was some DNA benchmarking in this thread: https://encode.su/threads/2105-DNA-Corpus
    From that thread, there is one trick for "palindromic sequences" that significantly improves compression rate on DNA for general compressors. One other suggestion is to try adding cmv to your benchmark.

  21. #19
    Member
    Join Date
    Sep 2015
    Location
    Italy
    Posts
    270
    Thanks
    112
    Thanked 153 Times in 112 Posts
    Quote Originally Posted by byronknoll View Post
    One other suggestion is to try adding cmv to your benchmark.
    Cmv is very slow, an acceptable speed (options "-m0,0,+") is not competitive with the best compressors and better options are terrible slow.

    Code:
    cmv 0.3.0 alpha 1
    
    DNA-Genome-Astrammina-rara-GCA_000211355.2-2011-04-27.fa
      358,328 cmix
      359,114 cmv (-m2,1,0x7fa87cbf + reverse-complement block size 4096 + re-feed with the last 16 bytes)
      359,602 cmv (-m2,1,0x7fa87cbf)
      361,269 bigm
      377,031 mfc-1
      379,515 xm-12-0.50
      383,650 jarvis-7
      383,682 cmv (-m0,0,+)
      384,443 naf-16
    1,712,167 Original
    
    DNA-Genome-Cryptosporidium-parvum-Iowa-II-GCA_000165345.1-2007-02-26.fa
    2,144,097 cmix
    2,146,622 cmv (-m2,1,0x7fe27b6a + reverse-complement block size 4096 + re-feed with the last 16 bytes)
    2,147,926 cmv (-m2,1,0x7fe27b6a)
    2,181,328 lfqc
    2,243,377 xm-13-0.50
    2,245,954 jarvis-7
    2,252,802 spring-l-4t
    2,253,292 geco2-2
    2,259,558 mfc-1
    2,271,177 dlim
    2,287,283 naf-1
    2,288,732 fqzcomp-9p
    2,294,865 cmv (-m0,0,+)
    2,302,505 bcm-b16
    9,216,802 Original
    
    DNA-Genome-Gordonia-phage-GAL1-GCF_001884535.1-2016-11-15.fa
    11,893 cmv (-m1,0,0x77a9ec32)
    11,907 cmix
    11,926 bigm
    11,949 cmv (-m1,0,0x77a9ec32 + reverse-complement block size 4096 + re-feed with the last 16 bytes)
    12,043 geco2-2
    12,062 xm-10-0.50
    12,122 jarvis-1
    12,647 cmv (-m0,0,+)
    12,804 dcom-9
    50,654 Original
    
    DNA-Genome-Nosema-ceranae-GCA_000988165.1-2015-05-05.fa
    1,282,485 cmv (-m2,3,0x37a9f8fa + reverse-complement block size 4096 + re-feed with the last 16 bytes)
    1,282,895 cmix
    1,285,337 jarvis-7
    1,290,084 xm-13-0.50
    1,295,935 cmv (-m2,3,0x37a9f8fa)
    1,313,392 bigm
    1,344,131 lfqc
    1,352,947 mfc-1
    1,373,983 naf-20
    1,376,438 geco2-2
    1,382,402 spring-l-4t
    1,399,894 cmv (-m0,0,+)
    1,408,176 fqzcomp-9p
    5,809,207 Original
    
    DNA-Genome-WS1-bacterium-JGI-0000059-K21-GCA_000398605.1-2013-05-16.fa
    119,941 cmix
    120,452 cmv (-m2,0,0x7fa9ebbe + reverse-complement block size 4096 + re-feed with the last 16 bytes)
    120,494 cmv (-m2,0,0x7fa9ebbe)
    121,424 bigm
    121,629 geco2-2
    122,283 jarvis-5
    122,357 xm-12-0.50
    124,237 mfc-1
    124,341 fqzcomp-9
    125,176 dlim
    125,408 naf-17
    125,768 leon-24
    127,829 dnax-1
    128,225 pfish
    128,489 cmv (-m0,0,+)
    129,040 bcm-b128
    521,951 Original
    
    DNA-Mitochondrion-2019-03-15.fa (Other DNA datasets)
     36,082,417 brotli-11w30
     36,134,379 naf-22
     38,175,252 xz-e9
     38,621,433 dlim
     39,844,011 zstd-22-4t
     41,052,400 mfc-1
     41,975,373 dnax-0
     43,175,212 lzturbo-49-4t
     43,805,811 spring-s-4t
     44,675,961 harc-4t
     45,052,773 lfqc
     45,083,192 cmv (-m0,0,0x03a21096 partially optimized)
     46,151,475 cmv (-m0,0,0x03a21096 partially optimized + reverse-complement block size 4096 + re-feed with the last 16 bytes)
     48,732,600 cmv (-m0,0,+)
     52,015,592 quip
    245,282,526 Original
    
    AA-PDB-2019-04-09.fa (Protein dataset)
     9,059,635 cmv (-m1,3,0x03a41197 partially optimized)
     9,231,778 cmv (-m0,0,0x03a21096 partially optimized + reverse-complement block size 4096 + re-feed with the last 16 bytes)
    10,094,588 naf-21
    10,290,284 xz-e9
    10,822,434 zstd-22-4t
    10,913,274 brotli-11
    10,935,828 cmv (-m0,0,+)
    11,460,855 lzturbo-39-1t
    67,609,430 Original
    
    RNA-SILVA-132-LSURef-2017-12-11.fa (RNA datasets)
     12,589,412 naf-22
     15,246,937 dlim
     15,754,800 xz-e9
     16,413,599 zstd-22-1t
     16,839,493 gtz-1-4t
     17,432,024 cmv (-m1,3,0x03ec1196 partially optimized)
     17,476,350 dnax-0
     17,521,079 brotli-11
     18,283,509 lzturbo-39-1t
     19,071,990 lfqc
     20,475,509 cmv (-m1,3,0x03ec1196 partially optimized + reverse-complement block size 4096 + re-feed with the last 16 bytes)
     21,609,658 spring-s-4t
     21,773,215 cmv (-m0,0,+)
     24,333,248 harc-4t
    610,296,406 Original
    
    DNA-UCSC-hg38-7way-knownCanonical-exonNuc-2014-06-06.fa (DNA alignments)
     30,142,284 cmv (partially optimized)
     31,092,994 cmv (partially optimized + reverse-complement block size 4096 + re-feed with the last 16 bytes)
     40,101,510 lzturbo-49-4t
     42,145,896 xz-e7
     42,908,692 brotli-11
     44,646,812 cmv (-m0,0,+)
     44,746,006 naf-18
    340,422,655 Original
    Last edited by Mauro Vezzosi; 3rd January 2020 at 18:25. Reason: Updated cmv tests

  22. #20
    Member
    Join Date
    May 2019
    Location
    Japan
    Posts
    21
    Thanks
    3
    Thanked 8 Times in 4 Posts
    Quote Originally Posted by suryakandau@yahoo.co.id View Post
    I have published the source code in bigm thread
    A step in the right direction!

    Quote Originally Posted by byronknoll View Post
    @Kirr, if you haven't seen it, there was some DNA benchmarking in this thread: https://encode.su/threads/2105-DNA-Corpus
    From that thread, there is one trick for "palindromic sequences" that significantly improves compression rate on DNA for general compressors. One other suggestion is to try adding cmv to your benchmark.
    Thanks for the pointers. I did not know that cmix contains special model for palindromic DNA (if I read the thread correctly). This might partially explain cmix' good compression strength on DNA in my benchmark.

    cmv looks interesting, however at present I'm not yet sure if I can add it. I normally benchmark open source compressors (with very few exceptions), and my benchmark is done entirely under Linux. Therefore adding a closed source windows-only binary-distributed compressor is a bit hard to justify (as well as hard to benchmark).

    Quote Originally Posted by Mauro Vezzosi View Post
    ​Cmv is very slow, an acceptable speed (options "-m0,0,+") is not competitive with the best compressors and better options are terrible slow.
    The compression strength looks nice nontheless! I hope that you'll keep improving it. Thanks for posting the results!

  23. #21
    Member
    Join Date
    Mar 2011
    Location
    USA
    Posts
    244
    Thanks
    112
    Thanked 114 Times in 69 Posts
    Quote Originally Posted by Kirr View Post
    I did not know that cmix contains special model for palindromic DNA (if I read the thread correctly).
    Actually, it currently doesn't. That was just something I was experimenting with in that thread. However, it is a feature I would like to add (hopefully in the next cmix release).

  24. #22
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    489
    Thanks
    176
    Thanked 172 Times in 117 Posts
    Of recent interest in this domain is GABAC paper: https://doi.org/10.1093/bioinformatics/btz922

    It's an entropy encoder, so not going to be great at LZ or BWT type dedup, but potentially OK for quality values.
    Edit: I take that back - it has a match step and RLE step too. So it's a generalised LZ coder as well as entropy encoder.

    For the purposes of your tests, you'd want to split it into two streams (names and sequence). However like fqzcomp, rANS, etc it's not designed specifically for your task in mind. May be interesting still though.


    PS. Yes, I still need to investigate the fqzcomp -s9 issues.

  25. #23
    Member
    Join Date
    Aug 2015
    Location
    indonesia
    Posts
    236
    Thanks
    29
    Thanked 23 Times in 21 Posts
    Quote Originally Posted by Mauro Vezzosi View Post
    ​Cmv is very slow, an acceptable speed (options "-m0,0,+") is not competitive with the best compressors and better options are terrible slow.

    Code:
    cmv 0.3.0 alpha 1
    ​
    DNA-Genome-Astrammina-rara-GCA_000211355.2-2011-04-27.fa
      358,328 cmix
      359,114 cmv (-m2,1,0x7fa87cbf + reverse-complement block size 4096 + re-feed with the last 16 bytes)
      359,602 cmv (-m2,1,0x7fa87cbf)
       361269 bigm
      377,031 mfc-1
      379,515 xm-12-0.50
      383,650 jarvis-7
      383,682 cmv (-m0,0,+)
      384,443 naf-16
    1,712,167 Original
    
    DNA-Genome-Cryptosporidium-parvum-Iowa-II-GCA_000165345.1-2007-02-26.fa
    2,144,097 cmix
    2,146,622 cmv (-m2,1,0x7fe27b6a + reverse-complement block size 4096 + re-feed with the last 16 bytes)
    2,147,926 cmv (-m2,1,0x7fe27b6a)
    2158936 bigm
    2,181,328 lfqc
    2,243,377 xm-13-0.50
    2,245,954 jarvis-7
    2,252,802 spring-l-4t
    2,253,292 geco2-2
    2,259,558 mfc-1
    2,271,177 dlim
    2,287,283 naf-1
    2,288,732 fqzcomp-9p
    2,294,865 cmv (-m0,0,+)
    2,302,505 bcm-b16
    9,216,802 Original
    
    DNA-Genome-Gordonia-phage-GAL1-GCF_001884535.1-2016-11-15.fa
    11,893 cmv (-m1,0,0x77a9ec32)
    11,907 cmix
    11926  bigm
    12,043 geco2-2
    12,062 xm-10-0.50
    12,122 jarvis-1
    12,647 cmv (-m0,0,+)
    12,804 dcom-9
    50,654 Original
    ?????? cmv (To do: -m1,0,0x77a9ec32 + reverse-complement block size 4096 + re-feed with the last 16 bytes)
    
    DNA-Genome-Nosema-ceranae-GCA_000988165.1-2015-05-05.fa
    1,282,895 cmix
    1,285,337 jarvis-7
    1,290,084 xm-13-0.50
    1313392   bigm
    1,344,131 lfqc
    1,346,449 cmv (Temporary)
    1,352,947 mfc-1
    1,373,983 naf-20
    1,376,438 geco2-2
    1,382,402 spring-l-4t
    1,399,894 cmv (-m0,0,+)
    1,408,176 fqzcomp-9p
    5,809,207 Original
    ????????? cmv (To do: optimize the options)
    ????????? cmv (To do: optimize the options + reverse-complement block size 4096 + re-feed with the last 16 bytes)
    
    DNA-Genome-WS1-bacterium-JGI-0000059-K21-GCA_000398605.1-2013-05-16.fa
    119,941 cmix
    120,498 cmv (-m2,0,0x7da9edbe)
    121424  bigm
    121,629 geco2-2
    122,283 jarvis-5
    122,357 xm-12-0.50
    124,237 mfc-1
    124,341 fqzcomp-9
    125,176 dlim
    125,408 naf-17
    125,768 leon-24
    127,829 dnax-1
    128,225 pfish
    128,489 cmv (-m0,0,+)
    129,040 bcm-b128
    521,951 Original
    ??????? cmv (To do: -m2,0,0x7da9edbe + reverse-complement block size 4096 + re-feed with the last 16 bytes)
    Last edited by suryakandau@yahoo.co.id; 19th December 2019 at 16:10.

  26. #24
    Member
    Join Date
    Aug 2015
    Location
    indonesia
    Posts
    236
    Thanks
    29
    Thanked 23 Times in 21 Posts
    Bigm using only ~1.3 gb memory

  27. #25
    Member
    Join Date
    May 2019
    Location
    Japan
    Posts
    21
    Thanks
    3
    Thanked 8 Times in 4 Posts
    Quote Originally Posted by byronknoll View Post
    Actually, it currently doesn't. That was just something I was experimenting with in that thread. However, it is a feature I would like to add (hopefully in the next cmix release).
    I see. OK, I'll look forward to test it in the future. Hopefully it won't cause much of extra slowdown.

    Quote Originally Posted by JamesB View Post
    Of recent interest in this domain is GABAC paper: https://doi.org/10.1093/bioinformatics/btz922

    It's an entropy encoder, so not going to be great at LZ or BWT type dedup, but potentially OK for quality values.
    Edit: I take that back - it has a match step and RLE step too. So it's a generalised LZ coder as well as entropy encoder.

    For the purposes of your tests, you'd want to split it into two streams (names and sequence). However like fqzcomp, rANS, etc it's not designed specifically for your task in mind. May be interesting still though.
    Thanks, looks interesting. I noticed on their GitHub:
    The GABAC development is continued within the Genie project (https://github.com/mitogen/genie).
    Which looks like a more complete compressor. Will have to check them both in detail in near future.

    Quote Originally Posted by JamesB View Post
    PS. Yes, I still need to investigate the fqzcomp -s9 issues.
    Thanks. It's really great that you continue to maintain fqzcomp. So many interesting compressors that I was hoping to test are either unavailable or unusable due to bugs.

  28. #26
    Member
    Join Date
    Aug 2015
    Location
    indonesia
    Posts
    236
    Thanks
    29
    Thanked 23 Times in 21 Posts
    in gordonia phage.fna the difference between cmix and bigm only 19 bytes whereas cmix uses ~24 gb memory , bigm uses only 1.3 gb memory...

  29. #27
    Member
    Join Date
    Mar 2013
    Location
    Worldwide
    Posts
    565
    Thanks
    67
    Thanked 198 Times in 147 Posts

Similar Threads

  1. Best compression algorithm for a sequence of incremental integers
    By CompressMaster in forum Data Compression
    Replies: 18
    Last Post: 17th May 2019, 11:56
  2. Pseudorandom Number Sequence Test + Benchmark Compressors
    By Samantha in forum Data Compression
    Replies: 3
    Last Post: 3rd January 2019, 15:19
  3. Binary sequence compression
    By smjohn1 in forum Data Compression
    Replies: 23
    Last Post: 8th December 2017, 01:48
  4. Sequence of bits
    By Kaw in forum Data Compression
    Replies: 12
    Last Post: 25th September 2009, 08:53
  5. LZP flag sequence compression
    By Shelwien in forum Data Compression
    Replies: 8
    Last Post: 9th August 2009, 02:08

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •