Results 1 to 12 of 12

Thread: Sequence Compression Benchmark

  1. #1
    Member SolidComp's Avatar
    Join Date
    Jun 2015
    Location
    USA
    Posts
    346
    Thanks
    129
    Thanked 53 Times in 37 Posts

    Sequence Compression Benchmark

    Hi all – @Kirr made an incredibly powerful compression benchmark website called the Sequence Compression Benchmark.

    It lets you select a bunch of options and run it yourself, with outputs including graphs, column charts, and tables. It can run every single level of every compressor.

    The only limitation I see at this point is the lack of text datasets – it's mostly genetic data.

    @Kirr, four things:

    1. Broaden it to include text? Would that require a name change or ruin your vision for it? It would be great to see web-based text, like the HTML, CSS, and JS files of the 100 most popular websites for example.
    2. The gzipper you currently use is the GNU gzip utility program that comes with most Linux distributions. If you add some text datasets, especially web-derived ones, the zlib gzipper will make more sense than the GNU utility. That's the gzipper used by virtually all web servers.
    3. In my limited testing the 7-Zip gzipper is crazy good, so good that it approaches Zstd and brotli levels. It's long been known to be better than GNU gzip and zlib, but I didn't know it approached Zstd and brotli. It comes with the 7-Zip Windows utility released by Igor Pavlov. You might want to include it.
    4. libdeflate is worth a look. It's another gzipper. The overarching message here is that gzip ≠ gzip. There are many implementations, and the GNU gzip utility is likely among the worst.

  2. Thanks:

    Kirr (26th May 2020)

  3. #2
    Member
    Join Date
    May 2019
    Location
    Japan
    Posts
    26
    Thanks
    4
    Thanked 8 Times in 4 Posts
    Thanks for kind words, SolidComp.

    Quote Originally Posted by SolidComp View Post
    Broaden it to include text? Would that require a name change or ruin your vision for it? It would be great to see web-based text, like the HTML, CSS, and JS files of the 100 most popular websites for example.
    I work with tons of biological data, which motivated me to first make a compressor for such data, and then this benchmark. I'll probably add FASTQ data in the future, if time allows. As for text, HTML, CSS and other data, I have no immediate plans for it. There are three main obstacles: 1. Computation capacity. 2. Selecting relevant data. 3. My time needed to work on it. Possibly it will require cooperating with other compression enthusiasts. I'll need to think about it.

    Quote Originally Posted by SolidComp View Post
    The gzipper you currently use is the GNU gzip utility program that comes with most Linux distributions. If you add some text datasets, especially web-derived ones, the zlib gzipper will make more sense than the GNU utility. That's the gzipper used by virtually all web servers.
    I'm under the impression that "zlib" is a compression library, and "gzip" is a command line interface to this same library. Since I benchmark command line compression tools, it's the "gzip" that is included, rather than "zlib". However please let me know if there is some alternative command line "zlib gzipper" that I am missing.

    Quote Originally Posted by SolidComp View Post
    In my limited testing the 7-Zip gzipper is crazy good, so good that it approaches Zstd and brotli levels. It's long been known to be better than GNU gzip and zlib, but I didn't know it approached Zstd and brotli. It comes with the 7-Zip Windows utility released by Igor Pavlov. You might want to include it.
    Igor Pavlov's excellent LZMA algorithm (which powers 7-Zip) is represented by the "xz" compressor in the benchmark. Igor's unfortunate focus on Windows releases allowed "xz" to become standard LZMA implementation on Linux (as far as I understand).

    Quote Originally Posted by SolidComp View Post
    libdeflate is worth a look. It's another gzipper. The overarching message here is that gzip ≠ gzip. There are many implementations, and the GNU gzip utility is likely among the worst.
    You mean this one - https://github.com/ebiggers/libdeflate ? Looks interesting, I'll take a look at it. I noticed this bit in the GitHub readme: "libdeflate itself is a library, but the following command-line programs which use this library are also provided: gzip (or gunzip), a program which mostly behaves like the standard equivalent, except that it does not yet have good streaming support and therefore does not yet support very large files" - Not supporting very large files sounds alarming. Especially without specifying what exactly they mean by "very large".

    Regarding gzip, don't get me started! Every single biological database shares data in gzipped form, wasting huge disk space and bandwidth. There is a metric ton of research on biological sequence compression, in addition to excellent general-purpose compressors. Yet the field remains stuck with gzip. I want to show that there are good alternatives to gzip, and that there are large benefits in switching. Whether this will have any effect remains to be seen. At least I migrated all my own data to a better format (saving space and increasing access speed).

  4. #3
    Member SolidComp's Avatar
    Join Date
    Jun 2015
    Location
    USA
    Posts
    346
    Thanks
    129
    Thanked 53 Times in 37 Posts
    "gzip" as such isn't a command line interface to the zlib library. It's just a format, one of three that zlib supports (the other two are raw DEFLATE and a "zlib" format, also DEFLATE-based). GNU gzip is just a specific app that produces gzip files (and maybe others?).

    I think zlib has a program that you can easily build. It might be called minizip. Someone please correct me if I'm wrong.

    The 7-Zip gzipper is unrelated to the .7z or LZMA formats. I'm speaking of 7-Zip the app. It can produce .7z, .xz, gzip (.gz), .zip, .bz2, and perhaps more compression formats. Pavlov wrote his own gzipper from scratch, apparently, and it's massively better than any other gzipper, like GNU gzip or libdeflate. I assume it's better than zlib's gzipper as well. I don't understand how he did it. So if you want to compare the state of the art to gzip, it would probably make sense to use the best gzipper. His gzip files are 17% smaller than libdeflate's on text...

  5. #4
    Member
    Join Date
    May 2019
    Location
    Japan
    Posts
    26
    Thanks
    4
    Thanked 8 Times in 4 Posts
    Quote Originally Posted by SolidComp View Post
    "gzip" as such isn't a command line interface to the zlib library. It's just a format, one of three that zlib supports (the other two are raw DEFLATE and a "zlib" format, also DEFLATE-based). GNU gzip is just a specific app that produces gzip files (and maybe others?).
    Yeah, the two are a tool and a library implementing DEFLATE algorith, this is more accurate to say. In my benchmark, by "gzip" I refer to the software tool, not to the "gzip" file format.

    Quote Originally Posted by SolidComp View Post
    I think zlib has a program that you can easily build. It might be called minizip. Someone please correct me if I'm wrong.
    zlib has "zpipe.c" in "examples" directory. This may be what you mean. I guess there is no point testing it, but perhaps I should benchmark it to confirm this.

    Quote Originally Posted by SolidComp View Post
    The 7-Zip gzipper is unrelated to the .7z or LZMA formats. I'm speaking of 7-Zip the app. It can produce .7z, .xz, gzip (.gz), .zip, .bz2, and perhaps more compression formats. Pavlov wrote his own gzipper from scratch, apparently, and it's massively better than any other gzipper, like GNU gzip or libdeflate. I assume it's better than zlib's gzipper as well. I don't understand how he did it. So if you want to compare the state of the art to gzip, it would probably make sense to use the best gzipper. His gzip files are 17% smaller than libdeflate's on text...
    It seems 7-Zip is still Windows-exclusive. However there is a more portable "p7zip" - I will think about adding it to the benchmark.

  6. #5
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    503
    Thanks
    181
    Thanked 177 Times in 120 Posts
    I think Libdeflate is the fastest tool out there right now, unless limiting to light-weight "level 1" style in which case maybe libslz wins out.

    We integrated libdeflate support into Samtools; for (de)compression of sequencing alignment data in the BAM format. I suspect this is the cause of libdeflate becoming an official Ubuntu package as Samtools/htslib have it as a dependency. I recently retested several deflater algorithms on enwik8.

    Code:
    Tool         Encode    Decode     Size
    ------------------------------------------
    vanilla      0m5.003s  0m0.517s   36548933
    intel        0m3.057s  0m0.503s   36951028
    cloudflare   0m2.492s  0m0.443s   36511793
    jtkukunas    0m2.956s  0m0.357s   36950998
    ng           0m2.022s  0m0.377s   36881293
    zstd (gz-6)  0m4.674s  0m0.468s   36548933
    libdeflate   0m1.769s  0m0.229s   36648336
    Note the file sizes fluctuate a bit. That's within the difference between gzip -5 vs -6 so arguably you'd include that in the time difference too.


    I also tried them at level 1 compression:

    Code:
    Tool         Encode    Decode     Size
    ------------------------------------------
    vanilla      0m1.851s  0m0.546s   42298786
    intel        0m0.866s  0m0.524s   56046821
    cloudflare   0m1.163s  0m0.470s   40867185
    jtkukunas    0m1.329s  0m0.392s   40867185
    ng           0m0.913s  0m0.397s   56045984
    zstd (gz)    0m1.764s  0m0.475s   42298786
    libdeflate   0m1.024s  0m0.235s   39597396
    Level 1 is curious as you can see very much how different versions have traded off the encoder algorithm speed vs size efficiency, with cloudflare and jtkukunas apparently using the same algorithm and intel/ng likewise. Libdeflate is no longer the fastest here, but it's not far off and is the smallest so it's in a sweet spot.

    And for fun, level 9:

    Code:
    Tool         Encode    Decode     Size
    ------------------------------------------
    vanilla      0m6.113s  0m0.516s   36475804
    intel        0m5.153s  0m0.516s   36475794
    cloudflare   0m2.787s  0m0.442s   36470203
    jtkukunas    0m5.034s  0m0.365s   36475794
    ng           0m2.872s  0m0.371s   36470203
    zstd (gz)    0m5.702s  0m0.467s   36475804
    libdeflate   0m9.124s  0m0.237s   35197159
    All remarkably similar sizes, bar libdeflate which took longer but squashed it considerably more. Libdeflate actually goes up to -12, but it's not a good tradeoff on this file:

    Code:
    libdeflate   0m14.660s 0m0.236s   35100586
    Edit: I tested 7z max too, but it was comparable to libdeflate max and much slower.

  7. Thanks (3):

    Bulat Ziganshin (10th June 2020),Kirr (28th May 2020),Mike (28th May 2020)

  8. #6
    Member
    Join Date
    May 2019
    Location
    Japan
    Posts
    26
    Thanks
    4
    Thanked 8 Times in 4 Posts
    Thanks James, looks like I'll have to add libdeflate soon. I'm still worrying about it's gzip replacement not supporting large files. I guess I'll see what they mean.

  9. #7
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    503
    Thanks
    181
    Thanked 177 Times in 120 Posts
    It's good for block based formats, but the lack of streaming may be an issue for general purpose zlib replacement. However even for a streaming gzip you could artificially chunk it into relatively large blocks. It's not ideal, but may be the better speed/ratio tradeoff still means it's a win for most data types.

    We use it in bgzf (wrapper for BAM and BCF formats), which has pathetically small block sizes, as a replacement for zlib.

  10. #8
    Member SolidComp's Avatar
    Join Date
    Jun 2015
    Location
    USA
    Posts
    346
    Thanks
    129
    Thanked 53 Times in 37 Posts
    James, what about SLZ? That's the fastest gzip implementation on earth as far as I know. It doesn't compress as well though. But it's incredibly fast, and super light in terms of CPU and memory consumption.

  11. #9
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    503
    Thanks
    181
    Thanked 177 Times in 120 Posts
    Quote Originally Posted by SolidComp View Post
    James, what about SLZ? That's the fastest gzip implementation on earth as far as I know. It doesn't compress as well though. But it's incredibly fast, and super light in terms of CPU and memory consumption.

    I just tried the latest version. It's as I recall - very fast but very light levels of compression. It also has no decompression support at all, so obviously it can't compete on that front.

    On the above file with the same machine, level 1 only. Redoing these encode timings as I realise I had I/O time in there which doesn't help the ultra fast ones:

    Code:
    Tool         Encode    Decode     Size
    ------------------------------------------
    vanilla      0m1.657s  0m0.546s   42298786
    intel        0m0.569s  0m0.524s   56046821
    cloudflare   0m0.974s  0m0.470s   40867185
    jtkukunas    0m1.106s  0m0.392s   40867185
    ng           0m0.695s  0m0.397s   56045984
    zstd (gz)    0m1.582s  0m0.475s   42298786
    libdeflate   0m0.838s  0m0.235s   39597396
    libslz       0m0.482s  N/A        55665941
    So yes it's the fastest. libslz has levels up to -9, but they're barely any different in size/speed.

  12. #10
    Member
    Join Date
    May 2019
    Location
    Japan
    Posts
    26
    Thanks
    4
    Thanked 8 Times in 4 Posts
    I guess this discussion is about deflate-based libraries. Since I benchmark command line tools rather than bare libraries, each lib would need a command line interface before it can be tested. I already added p7zip and libdeflate's streaming gzip replacement to my to-do benchmark queue. If you have more suggestions in mind, please post here with links. Thanks!

  13. #11
    Member SolidComp's Avatar
    Join Date
    Jun 2015
    Location
    USA
    Posts
    346
    Thanks
    129
    Thanked 53 Times in 37 Posts
    @Kirr, you might be interested in this report on Intel's QuickAssist with genomic data. QuickAssist is a hardware accelerator from Intel that comes in a couple of forms. One is a PCIe card, and the other is built into some Xeon D server chips. QuickAssist accelerates compression and encryption workloads beyond what is possible with conventional CPU-based code.

    https://01.org/sites/default/files/d...uickassist.pdf

  14. #12
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    503
    Thanks
    181
    Thanked 177 Times in 120 Posts
    Interesting link, although it's a bit of a sales pitch.

    "Reduce storage system cost: In the face of a 10 Tb genetic data growth on a daily basis, BGI needs to increase data compression ratio, improve storage density and reduce data storage costs."


    Fair enough, sounds good. They then compare it to gzip -1 where their system is much faster, but slightly larger. How is being slightly larger than gzip -1 "inceas[ing] data compression ratio"? 3 times faster than gzip -1 puts it into the zstd or brotli level 1 speeds. They're using 32-cores too, so a threaded tool clearly helps here (pigz vs zstd for example).

    Edit, 1GB of uncompressed BAM input:

    Code:
    $ time gzip -1 < /tmp/_1g|wc -c
    427760970
    real    0m15.498s
    user    0m15.331s
    sys    0m0.631s
    
    $ time ~/ftp/compression/libdeflate/gzip -1 < /tmp/_1g|wc -c
    416714103
    real    0m8.711s
    user    0m8.418s
    sys    0m0.372s
    
    $ time bro --quality 1 < /tmp/_1g|wc -c
    418622612
    real    0m5.753s
    user    0m5.599s
    sys    0m0.411s
    
    $ time zstd -1 < /tmp/_1g|wc -c
    402699667
    real    0m3.953s
    user    0m4.096s
    sys    0m0.392s
    
    $ time zstd --fast=3 < /tmp/_1g|wc -c
    581488734
    real    0m2.428s
    user    0m2.624s
    sys    0m0.472s

Similar Threads

  1. Best compression algorithm for a sequence of incremental integers
    By CompressMaster in forum Data Compression
    Replies: 18
    Last Post: 17th May 2019, 11:56
  2. Pseudorandom Number Sequence Test + Benchmark Compressors
    By Samantha in forum Data Compression
    Replies: 3
    Last Post: 3rd January 2019, 15:19
  3. Binary sequence compression
    By smjohn1 in forum Data Compression
    Replies: 23
    Last Post: 8th December 2017, 01:48
  4. Sequence of bits
    By Kaw in forum Data Compression
    Replies: 12
    Last Post: 25th September 2009, 08:53
  5. LZP flag sequence compression
    By Shelwien in forum Data Compression
    Replies: 8
    Last Post: 9th August 2009, 02:08

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •