Page 2 of 2 FirstFirst 12
Results 31 to 49 of 49

Thread: SLZ - stateless zip - fast zlib-compatible compressor

  1. #31
    Member
    Join Date
    Aug 2016
    Location
    Paris
    Posts
    14
    Thanks
    5
    Thanked 28 Times in 6 Posts
    SLZ - included in TurboBench (only github)
    The pages (length + content) are concatenated into a single html8 file, but compressed/decompressed separately.
    compress: page1,page2,...pageN
    decompress : Page1,page2,...pageN
    This avoids the cache effects like in other benchmarks, where small files are processed repeatedly in the L1/L2 cache,
    showing unrealistic results.
    I'm doing the same when compressing HTML, I prefer to have a huge HTML concatenation of all the files.
    Hardware: ODROID C2 - ARM 64 bits - 2Ghz CPU, OS: Ubuntu 16.04, gcc 5.3
    Just for info (so that it helps with your benchmarks' accuarcy), the ODROID C2 lies, it claims to run at 2016 MHz while it's capped at 1536. Just rerun the same tests at 1536, 1704 and 2016 and you'll get exactly the same performance. I thought I found the cheating code in the kernel but there is another place I have not yet found.

    Compression speed slz vs. zlib:
    - 2,65x faster on intel i7
    - 1,65x faster on ARM

    - "brotli 1" can also be an alternative (but memory usage is high)
    Thanks for the results. In fact here brotli's memory usage is a showstopper for me.

  2. #32
    Member SolidComp's Avatar
    Join Date
    Jun 2015
    Location
    USA
    Posts
    239
    Thanks
    95
    Thanked 47 Times in 31 Posts
    Quote Originally Posted by willy View Post
    I also forgot to mention an important point. Bandwidth nowadays at hosting companies is cheap provided that you get it with a cheap machine. For $3/month you have 200 Mbps on a quad-core ARM server for example : https://www.scaleway.com/pricing/. Such machines are not that powerful at compressing contents. If saving 100 Mbps requires to triple the number of machines, you'd rather let your bandwidth max out and order a second machine. After all that's only $15/month per Gbps. It's cheaper than the 500 Mbps you get with a much larger machine that would allow you to push the same amount of traffic via compression. So in certain scenarios, *not* compressing can cut costs because of CPU-induced costs.
    By the way, have you used Scaleway? How was it? Their website doesn't provide any details about their hardware, even for the "bare metal" options, which is strange. Like the $3 plan with four dedicated ARM cores – what are these cores? Who makes them? There are very few ARM server chips out there, and I'd want to know what sort of chips they are.

    It would also be important to know whether they have AES CPU instructions, and other crypto-boosting instructions, or else a web server using TLS might have a hard time. Same with CRC32 and compression-boosting instructions.

  3. #33
    Member SolidComp's Avatar
    Join Date
    Jun 2015
    Location
    USA
    Posts
    239
    Thanks
    95
    Thanked 47 Times in 31 Posts
    Quote Originally Posted by dnd View Post
    SLZ - included in TurboBench (only github)

    cpu: Sandy bridge i7-2600k at 4.2 Ghz, all with gcc 5.3, ubuntu 16.04
    Code:
          C Size  ratio%     C MB/s     D MB/s   Name        C Mem Peak    D Mem Peak    (bold = pareto) MB=1.000.0000
        16461059    16.5       0.79     449.97   brotli 11     10616320     241016
        20163816    20.2      92.93     559.73   brotli 4      35314344     193272
        20363869    20.4      40.28     435.27   zlib 9          274064      14320
        20485533    20.5      63.34     433.08   zlib 6          274064      14320
        23143931    23.1     329.10     524.32   brotli 1       1193296   16810696
        23723266    23.7     139.94     398.91   zlib 1          274064      14320  
        28214601    28.2     371.16     433.05   slz 6                0      14320
        28214601    28.2     371.23     433.02   slz 9                0      14320
        29476316    29.5     371.19     424.83   slz 1                0      14320
    Skylake i7-6700 - 3.7GHz
    Code:
          C Size  ratio%     C MB/s     D MB/s   Name            
        16461059    16.5       0.76     407.68   brotli 11       
        20163816    20.2      89.92     512.58   brotli 4        
        20363869    20.4      34.31     370.89   zlib 9         
        20485533    20.5      55.47     369.05   zlib 6          
        23143931    23.1     269.01     487.64   brotli 1        
        23723266    23.7     125.46     340.55   zlib 1          
        28214601    28.2     332.37     371.57   slz 9           
        28214601    28.2     332.43     371.41   slz 6           
        28425348    28.4     611.31    1691.52   lzturbo 20      
        29476316    29.5     336.27     364.31   slz 1
    This is extremely good. I'm stunned by those compression ratios, all below 30%! I guess the Silesia corpus Willy reported gives much worse compression than realistic HTML content.

    Brotli -4 looks really nice on the compression phase. For memory peaks, are you reporting bytes or MB? Where is Zstd?

  4. #34
    Member
    Join Date
    Aug 2016
    Location
    Paris
    Posts
    14
    Thanks
    5
    Thanked 28 Times in 6 Posts
    Quote Originally Posted by SolidComp View Post
    By the way, have you used Scaleway? How was it? Their website doesn't provide any details about their hardware, even for the "bare metal" options, which is strange. Like the $3 plan with four dedicated ARM cores – what are these cores? Who makes them? There are very few ARM server chips out there, and I'd want to know what sort of chips they are.
    Yes I tried them. In fact I happen to know some of the people who make it and was lucky enough to try their machines before they were on sale. The CPUs are quad-core Marvell Armada XP CPUs at 1.33 GHz. I have such CPUs on a few development boards, they're really great. They are very similar to Cortex A9 at the same frequency (with a few extra instructions like IDIV) but lack the NEON engine. They have a huge memory bandwidth, with a 64-bit DDR3 1600 bus, and quite good I/O (4 gigabit NICs, PCIe 2.1, a few SATA ports), that's why they make excellent servers.

    It would also be important to know whether they have AES CPU instructions, and other crypto-boosting instructions, or else a web server using TLS might have a hard time. Same with CRC32 and compression-boosting instructions.
    There are indeed two crypto engines in the CPU (marvell-CESA). They support the standard stuff (md5,sha1,des,aes). But like all crypto engines, you take a performance hit by talking to the engine, so it's fine only for large blocks. But the CPU is quite good even in pure software. If you look here at the numbers provided for the Linksys WRT1900AC (which employs the same CPU but at 1.2 GHz only and with a 32-bit memory bus), you'll find quite good results : https://wiki.openwrt.org/doc/howto/benchmark.openssl

    As you can see it's the fastest non-x86 performer per CPU core with about 40 MB/s/core (it equals the Celeron N3150 at 1.6 GHz).

  5. #35
    Member
    Join Date
    Jun 2015
    Location
    Switzerland
    Posts
    749
    Thanks
    215
    Thanked 281 Times in 164 Posts
    There is also brotli 0 quality nowadays, which should be something like 25 % faster than brotli 1. There is no practical reason for the high memory allocations in high speed brotli compression, we will investigate and fix them soon.

  6. Thanks:

    SolidComp (21st August 2016)

  7. #36
    Member
    Join Date
    Mar 2013
    Location
    Worldwide
    Posts
    565
    Thanks
    67
    Thanked 198 Times in 147 Posts

    TuroBench - brotli 0 added to i7-i2600k & ARM benchmark

    - We have a surprise, brotli 0 now new pareto (see benchmark on first page). Exactly 25% faster thant brotli 1.

    Quote Originally Posted by willy View Post
    Just for info (so that it helps with your benchmarks' accuarcy), the ODROID C2 lies, it claims to run at 2016 MHz while it's capped at 1536.
    Thanks, I've also found some info on the net.

    Quote Originally Posted by SolidComp View Post
    For memory peaks, are you reporting bytes or MB? Where is Zstd?
    bytes, this is the peak memory allocated (some regions may be not used).
    This test is limited to codecs with content encoding "br" or "gzip", so other compressors are not tested.

    Note that slz is using ~64k stack and 132k static memory and therefore not thread safe.
    This must not be an issue in multi-proccess environment.

    brotli is also using an additional read only 120k dictionary.
    Last edited by dnd; 20th August 2016 at 23:37.

  8. #37
    Member SolidComp's Avatar
    Join Date
    Jun 2015
    Location
    USA
    Posts
    239
    Thanks
    95
    Thanked 47 Times in 31 Posts
    Quote Originally Posted by dnd View Post

    This test is limited to codecs with content encoding "br" or "gzip", so other compressors are not tested.
    Does LZTurbo create gzip files? I had no idea, but I'm not very familiar with it. So it would be the fastest of all of them.

  9. #38
    Member jibz's Avatar
    Join Date
    Jan 2015
    Location
    Denmark
    Posts
    121
    Thanks
    103
    Thanked 71 Times in 51 Posts
    @dnd, of the libraries included in turbobench, libdeflate is also able to produce gz output. Did you try enabling that in this test?

    It would also be interesting to see how the special level 1 of the Intel zlib library (also included in zlib-ng) would compare. It sacrifices some ratio to get extra speed at level 1.

  10. #39
    Member
    Join Date
    Mar 2013
    Location
    Worldwide
    Posts
    565
    Thanks
    67
    Thanked 198 Times in 147 Posts
    Quote Originally Posted by SolidComp View Post
    Does LZTurbo create gzip files? I had no idea, but I'm not very familiar with it. So it would be the fastest of all of them.
    LzTurbo is not "gz" compatible and is included only as indication
    Quote Originally Posted by jibz View Post
    @dnd, of the libraries included in turbobench, libdeflate is also able to produce gz output. Did you try enabling that in this test?

    It would also be interesting to see how the special level 1 of the Intel zlib library (also included in zlib-ng) would compare. It sacrifices some ratio to get extra speed at level 1.
    libdeflate 1 (zlib) and zlib-ng 1,6 included.
    Other levels have no significant improvements over zlib.
    For "zlib-ng" the optimized lzlib.a compiled with the default (SSE4_2) makefile is used.

  11. #40
    Member
    Join Date
    Aug 2016
    Location
    Paris
    Posts
    14
    Thanks
    5
    Thanked 28 Times in 6 Posts
    Quote Originally Posted by dnd View Post
    Note that slz is using ~64k stack and 132k static memory and therefore not thread safe.
    This must not be an issue in multi-proccess environment.
    In fact it is thread-safe because the static memory is only used by pre-computed tables and is only written once. Initially I preferred not to use a constructor for the table initialization not knowing whether it was portable enough or not, but I think it's a mistake and I should use one as most platforms support them. And for the rare non-supported platforms we may have a #ifdef to manually initialize them.

  12. #41
    Member
    Join Date
    Mar 2013
    Location
    Worldwide
    Posts
    565
    Thanks
    67
    Thanked 198 Times in 147 Posts
    It is better to write the generated arrays first into a text file and copy/past this directly to the code part.
    Otherwise the static data will be duplicated (copy on write) at each fork in a multi-process environment.

    In conclusion, the benchmark shows that slz is very fast, without requiring a new content-encoding.
    And as I've showed in another thread, the few missing percents in compression ratio to other
    libs or other levels have no negative impact on users. Maybe you can report something
    about your experience on the user side.
    To estimate the memory usage it is also interesting to know the max. number of connections.
    Last edited by dnd; 23rd August 2016 at 08:44.

  13. #42
    Member
    Join Date
    Aug 2016
    Location
    Paris
    Posts
    14
    Thanks
    5
    Thanked 28 Times in 6 Posts
    Quote Originally Posted by dnd View Post
    It is better to write the generated arrays first into a text file and copy/past this directly to the code part.
    Otherwise the static data will be duplicated (copy on write) at each fork in a multi-process environment.
    Yes but in practice on modern operating systems it's extremely cheap because the whole process' memory space is marked copy-on-write on a fork() and then pages start to be allocated when the first write occurs. I'd say that the only real saving we can expect by putting the tables in the code is to save the few milliseconds it takes to generate them on the first call. During the development phase I prefered not to put them in the code for maintenance reasons, but I agree that now might be the right moment.

    In conclusion, the benchmark shows that slz is very fast, without requiring a new content-encoding.
    And as I've showed in another thread, the few missing percents in compression ratio to other
    libs or other levels have no negative impact on users.
    This is very true, so much that I'm even thinking about some improvements consisting in shrinking the lookup window to limit the number bits required to encode the distance and to reduce the cache footprint to optimize the ability to run mostly in L1. I still have to reinitialize the references upon each call which is bad. If I don't do it, I have to re-check the 4 cached bytes in the match() call and overall I lose more performance than I save by reinitializing the references. I may work around this by using a per-thread allocator of reference pools that would be pre-initialized with a call counter mixed with the distance, ensuring that we cannot match an old sequence, but it's possible that having to deal with this will again make me lose what I saved by this light maintenance cost. I should probably also take a second look at LZ4's matching mechanism, which may be more efficient or which may have already solved this.

    Maybe you can report something about your experience on the user side.
    By "user" I guess you mean the user of the lib, hence the developer here. I'm probably a bit biased for having created it and integrated it. Well, that's quite simple, I initialize a new compression stream by calling slz_rfc195{0,1,2}_init(&strm, level). Then I call slz_encode(strm, dst, src, len) which is basically used like memcpy(). And after the last call I call slz_finish(strm) to flush the last pending bits. It is possible to change the compression level on the fly between calls (0 or 1), which is convenient to pass some data compressed or uncompressed to save CPU. Its stateless nature means that you must not call it with a few bytes at a time since it compresses a whole buffer at once. So if your application only works on small chunks of data, you may have to implement your own buffer (but then maybe zlib is more suited since the primary purpose of SLZ is *not* to keep data in memory).

    To estimate the memory usage it is also interesting to know the max. number of connections.
    It's about 28 bytes per connection everything included, this is about 8000 times less than zlib, and 30 times less than what a small proxy needs to keep for a connection, not even counting socket buffers. That was the initial point, not having to worry anymore about the connection count. Here if I want to process 1 million concurrent connections, the socket buffers will eat at least 8 GB, the proxy will eat about 1 extra GB. And SLZ will eat 28 MB.

  14. #43
    Member
    Join Date
    Mar 2013
    Location
    Worldwide
    Posts
    565
    Thanks
    67
    Thanked 198 Times in 147 Posts
    Quote Originally Posted by willy View Post
    Yes but in practice on modern operating systems it's extremely cheap because the whole process' memory space is marked copy-on-write on a fork()
    But the static arrays (crc32_fast and fh_dist_table) are actually not read only and are modified by each forked process in the constructor, so they will be copied to new pages, resulting in more memory usage

    Quote Originally Posted by willy View Post
    By "user" I guess you mean the user of the lib
    I mean the surfer, especially on smartphones.
    Compressing web sites is mostly better than no compression. But, is the difference in compression ratio between slz,zlib,... perceptible or not?

    Quote Originally Posted by willy View Post
    It's about 28 bytes per connection everything included...
    What's about the 64k stack memory + 132k (fh_dist_table,...) memory in slz?

  15. #44
    Member
    Join Date
    Aug 2016
    Location
    Paris
    Posts
    14
    Thanks
    5
    Thanked 28 Times in 6 Posts
    Quote Originally Posted by dnd View Post
    But the static arrays (crc32_fast and fh_dist_table) are actually not read only and are modified by each forked process in the constructor, so they will be copied to new pages, resulting in more memory usage
    They're initialized only once, so it all depends if the fork() is performed before or after initialization. In most programs it will be done before, either in a constructor or at the end of the config parsing, during the initialization phase so that will not be a problem. Those who initialize them after the fork will definitely experience what you describe. But I agree it's probably not a big deal by now to move that to the text section into a const array and be done with the initialization call as well. I'll compare.

    I mean the surfer, especially on smartphones.
    Compressing web sites is mostly better than no compression. But, is the difference in compression ratio between slz,zlib,... perceptible or not?
    I can't tell to be honnest. Mainly because I think nobody uses zlib inline beyond level 1 for performance reasons, and both compression ratios are not much different at such a level (zlib divides the volume by 3 when slz divides it by 2-2.5). For a user, a web page is composed of lots of objects (typically 100-130 nowadays!), fetched from many locations, so in the end with compression you reduce the load time for the text parts, but since this retrieval is mixed with some other downloads, it's not that easy to tell.

    What's about the 64k stack memory + 132k (fh_dist_table,...) memory in slz?
    The fh_dist_table is used only once. The 64k stack is per-thread/process. On top of that you add 28 bytes per stream. Given that zlib uses 256 kB per stream, even at the very first stream you win. And I'm interested in scenarios involving 10k to 500k concurrent connections per thread/process, so here the difference is huge, it's literally 15 MB versus 128 GB per thread/process at 500k.

    Also, I intend to experiment with smaller windows (hence shorter distances). On HTML, matches with distances above 16kB represent about 3%, and that's 9% above 8kB. This means that it's possible to shrink the lookup tables by 2 or 4 at the expense of losing such matches. I noticed that the compression ratio didn't shrink that much probably because these far matches are just pure luck and are short (40% of all matches are 4 bytes long). And these long distances require 1 or 2 extra bits respectively, so the compression ratio shrinks as we encode them compared to the shorter ones. Typically to encode a 4-bytes match at more than 16kB distance, you consume 18+7 = 25 bits. 25 bits to save 32 is not a huge saving. Also in proxies, buffers tend to be small (8-16kB) and in this case there's no business keeping larger distance encodings. However since they're never used at least they don't pollute the caches.

  16. Thanks:

    dnd (24th August 2016)

  17. #45
    Member SolidComp's Avatar
    Join Date
    Jun 2015
    Location
    USA
    Posts
    239
    Thanks
    95
    Thanked 47 Times in 31 Posts
    Quote Originally Posted by dnd View Post
    LzTurbo is not "gz" compatible and is included only as indication

    libdeflate 1 (zlib) and zlib-ng 1,6 included.
    Other levels have no significant improvements over zlib.
    For "zlib-ng" the optimized lzlib.a compiled with the default (SSE4_2) makefile is used.
    Could you please add libdeflate -6 and -12? It's incredibly fast at -1, and it would be helpful to know the speed and ratio for the higher compression settings.

  18. #46
    Member
    Join Date
    Mar 2013
    Location
    Worldwide
    Posts
    565
    Thanks
    67
    Thanked 198 Times in 147 Posts
    Quote Originally Posted by SolidComp View Post
    Could you please add libdeflate -6 and -12? It's incredibly fast at -1, and it would be helpful to know the speed and ratio for the higher compression settings.
    libdeflate -6 and -12 added.
    Taking into account only compression speed, we see that "brotli 4" is more interesting than "libdeflate -6".
    "libdeflate -12" is too slow for dynamic web content compression. "libdeflate 1" is slower than "zlib 1" on ARM.

  19. #47
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    486
    Thanks
    167
    Thanked 166 Times in 114 Posts
    Quote Originally Posted by dnd View Post
    libdeflate -6 and -12 added.
    Taking into account only compression speed, we see that "brotli 4" is more interesting than "libdeflate -6".
    "libdeflate -12" is too slow for dynamic web content compression. "libdeflate 1" is slower than "zlib 1" on ARM.
    If you are forced to be using deflate (eg it's defined in an existing protocol) then a highly optimised implementation is better, obviously, than brotli, zstd, etc. If you have control over both encoder and decoder then the choices will inevitably be different.

  20. #48
    Member
    Join Date
    Mar 2013
    Location
    Worldwide
    Posts
    565
    Thanks
    67
    Thanked 198 Times in 147 Posts
    TurboBench - Dynamic Web Content Benchmark

    update with zopfli for comparison with brotli,11.
    Note that the deflate,gzip compressiion ratio can be also improved using SDCH.
    It is interesting to see the difference to brotli.

  21. #49
    Member
    Join Date
    Jun 2015
    Location
    Switzerland
    Posts
    749
    Thanks
    215
    Thanked 281 Times in 164 Posts
    Quote Originally Posted by dnd View Post
    ... can be also improved using SDCH.
    Some more traditional compressors, at least zlib, zstd and brotli, support shared dictionaries, too. I would be very interested if someone made a comparison of using SDCH dictionaries with vcdiff (the algorithm specified in SDCH) vs. zlib, zstd, brotli (and others) with the same shared dictionaries.

    Shared dictionaries are available from web sites that are using SDCH -- I believe that at least linkedin, amazon and google are using SDCH for some things (didn't check myself, just from rumors).

Page 2 of 2 FirstFirst 12

Similar Threads

  1. Fast Zlib compression
    By gildor in forum Data Compression
    Replies: 14
    Last Post: 20th February 2017, 18:32
  2. How to make .zip compatible with precomp?
    By Bulat Ziganshin in forum Data Compression
    Replies: 4
    Last Post: 23rd June 2015, 15:03
  3. Zlib-ng: a performance-oriented fork of zlib
    By dnd in forum Data Compression
    Replies: 0
    Last Post: 5th June 2015, 15:29
  4. Introducing zpipe, a streaming ZPAQ compatible compressor
    By Matt Mahoney in forum Data Compression
    Replies: 0
    Last Post: 1st October 2009, 07:32
  5. zlib-compatible alternatives
    By Cyan in forum Data Compression
    Replies: 0
    Last Post: 12th May 2009, 02:28

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •