Page 1 of 2 12 LastLast
Results 1 to 30 of 49

Thread: SLZ - stateless zip - fast zlib-compatible compressor

  1. #1
    Member
    Join Date
    Aug 2016
    Location
    Paris
    Posts
    14
    Thanks
    5
    Thanked 28 Times in 6 Posts

    SLZ - stateless zip - fast zlib-compatible compressor

    Hello,

    I've been suggested by @inikep to introduce libslz here :

    First, a bit of context. Being the author of the HAProxy load balancer, I was much concerned by zlib's memory usage (>256kB per stream) and CPU usage, because it quickly becomes a problem on large traffic web sites where you may have to compress tens of thousands of parallel streams at line rate. It happens that in web environments, zlib is always configured for the fastest speed and the compression ratio is not critical because most of the contents are CSS, JS and HTML files that already compress very well at high speeds, at least well enough to save a lot of bandwidth. I would happily go with LZ4 if I had the choice, but browsers only understand gzip and raw deflate. So I wanted to attack the problem of memory usage which was making zlib mostly impractical as the compressor for high traffic web sites and figured that the whole problem came from the fact that it needed to keep a window. I thought that if I was happy enough with the compression ratio obtained on small files, I could get the same compression ratio without keeping a window between calls, while keeping the compatibility with the zlib format. But zlib's flush doesn't release buffers at any time. So I started to work on an alternative implementation about two years ago.

    The first version was released a year ago and is now supported by haproxy. During the work on this project, I found that there was a lot of room to save CPU cycles. Sure, the zlib format is bit-oriented and is expensive, but by using a fast lookup method like LZ4 does, keeping only the last match, using a fast hash, limiting the dictionary size, using pre-computed fixed huffman tables only and pre-computed distance encoding tables, it's possible to achieve quite a reasonable compression ratio at a reasonable speed. The typical compression ratio is pretty similar to LZ4's (less than 1% better or 30% larger than zlib) and the compression speed is a bit more than half of LZ4's, or 3-4 times faster than zlib. The memory usage fell down to a few bytes only, which allows millions of streams to be compressed in parallel if needed. Interestingly, I found that it scales very well on multi-core CPUs, as it mostly runs in the L1+L2 caches, so I can compress multiple streams in parallel with almost no bad interactions, and that's another nice benefit for a network-oriented product. In the end I'm now in a situation where it's possible for a web load balancer to compress about 1-3 Gbps of HTTP traffic per CPU core and roughly reduce text-based contents in half. Typical production numbers on http://demo.haproxy.org/ indicate a 47% compression ratio on compressible files, resulting in 39% overall bandwidth savings across all contents (the site delivers a lot of tar.gz source files, which explains why not all of them are compressible).

    I added support for it to lzbench in order to get an easier comparison. Here's what I got on my laptop (core i5-3320M @3.3 GHz) using Silesia :

    Code:
    $ taskset -c 0 ./lzbench -ezlib,1/slz_deflate/lz4 /dev/shm/silesia.tar
    lzbench 1.2 (64-bit Linux)   Assembled by P.Skibinski
    Compressor name         Compress. Decompress. Compr. size  Ratio Filename
    memcpy                   6355 MB/s  6221 MB/s   211957760 100.00 /dev/shm/silesia.tar
    zlib 1.2.8 -1              66 MB/s   261 MB/s    77260225  36.45 /dev/shm/silesia.tar
    slz_deflate 1.0.0 -1      247 MB/s   335 MB/s    99655491  47.02 /dev/shm/silesia.tar
    slz_deflate 1.0.0 -2      235 MB/s   337 MB/s    96861939  45.70 /dev/shm/silesia.tar
    slz_deflate 1.0.0 -3      228 MB/s   338 MB/s    96187297  45.38 /dev/shm/silesia.tar
    lz4 r131                  434 MB/s  2254 MB/s   100881187  47.59 /dev/shm/silesia.tar
    done... (cIters=1 dIters=1 cTime=1.0 dTime=0.5 chunkSize=1706MB cSpeed=0MB)
    And on my home PC (core i7-6700K at 4.4 GHz) :

    Code:
    $ taskset -c 0 ./lzbench -ezlib,1/slz_deflate/lz4 /dev/shm/silesia.tar
    lzbench 1.3 (64-bit Linux)   Assembled by P.Skibinski
    Compressor name         Compress. Decompress. Compr. size  Ratio Filename
    memcpy                  14170 MB/s 14331 MB/s   211957760 100.00 /dev/shm/silesia.tar
    zlib 1.2.8 -1             104 MB/s   360 MB/s    77260209  36.45 /dev/shm/silesia.tar
    slz_deflate 1.0.0 -1      392 MB/s   435 MB/s    99655475  47.02 /dev/shm/silesia.tar
    slz_deflate 1.0.0 -2      376 MB/s   439 MB/s    96861926  45.70 /dev/shm/silesia.tar
    slz_deflate 1.0.0 -3      367 MB/s   440 MB/s    96187285  45.38 /dev/shm/silesia.tar
    lz4 r131                  679 MB/s  3070 MB/s   100881194  47.59 /dev/shm/silesia.tar
    done... (cIters=1 dIters=1 cTime=1.0 dTime=0.5 chunkSize=1706MB cSpeed=0MB)
    The source code and detailed explanation is available here : http://1wt.eu/projects/libslz/

    It doesn't implement decompression since that's already available in zlib. However some users asked if I could implement the decompression so that it's an easier drop-in replacement for zlib, as they don't want to link with both zlib+slz. I still don't know if I'll do it, I feel like it would be quite some work to provide no benefit beyond an easier integration, and that it could possibly suffer from some vulnerabilities for some time. However maybe it is possible to rewrite a fast and light zlib decompressor by keeping support for modern OSes and platforms only, I don't really know. Maybe someone here may have some useful advices regarding this.

    Suggestions and feedback welcome!

  2. Thanks (20):

    Bulat Ziganshin (11th August 2016),Christoph Diegelmann (11th August 2016),Cyan (11th August 2016),encode (16th August 2016),fgiesen (16th August 2016),Gonzalo (15th December 2016),hexagone (11th August 2016),inikep (11th August 2016),JamesB (11th August 2016),jibz (11th August 2016),jman (13th April 2017),lorents17 (11th August 2016),Matt Mahoney (12th August 2016),Mike (11th August 2016),Minimum (11th August 2016),RamiroCruzo (11th August 2016),schnaader (12th August 2016),SolidComp (11th August 2016),Sportman (12th August 2016),Turtle (12th August 2016)

  3. #2
    Member
    Join Date
    Apr 2011
    Location
    Russia
    Posts
    168
    Thanks
    163
    Thanked 9 Times in 8 Posts
    Please compile a version for windows

  4. Thanks:

    Minimum (11th August 2016)

  5. #3
    Member
    Join Date
    Jul 2014
    Location
    Mars
    Posts
    194
    Thanks
    133
    Thanked 12 Times in 11 Posts
    +1

  6. #4
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,840
    Thanks
    288
    Thanked 1,243 Times in 696 Posts
    http://nishi.dreamhosters.com/u/slz_port_v0.rar

    Its not very windows-friendly. Uses mmap.h and gcc extensions, but with some patches I was able to make it work with gcc/mingw.
    Also there's no decoder, included "zdec.c" is actually a deflate stream dumper, so I renamed it and included my own decoder.
    Also zenc only writes to stdout, see test.bat for examples.

  7. Thanks (5):

    Bulat Ziganshin (11th August 2016),Gonzalo (15th December 2016),lorents17 (11th August 2016),Sportman (12th August 2016),willy (13th August 2016)

  8. #5
    Member
    Join Date
    Aug 2016
    Location
    Paris
    Posts
    14
    Thanks
    5
    Thanked 28 Times in 6 Posts
    Quote Originally Posted by Shelwien View Post
    http://nishi.dreamhosters.com/u/slz_port_v0.rar

    Its not very windows-friendly. Uses mmap.h and gcc extensions, but with some patches I was able to make it work with gcc/mingw.
    Yep, it should only be seen as a lib, so there is no point in providing a binary for whatever OS. Zenc is only an implementation example explaining how to use it in other projects. The lib itself doesn't need any OS-specific thing. @Inikep reported that including mmap.h broke the build on windows, which I fixed (it was a leftover from the initial development code). The only use case I have with zenc is to run it with "zenc -t" to run benchmarks. But now we have lzbench which is portable and which validates the output stream, so there's really no point in using zenc anymore for benchmarks.

    Thus for those who want to try it on windows, I'd suggest to download the latest version of lzbench (dev branch).

  9. #6
    Member
    Join Date
    Jun 2015
    Location
    Switzerland
    Posts
    811
    Thanks
    234
    Thanked 292 Times in 174 Posts
    One consideration if this thing is for the internet:

    If you pay for the bandwidth, you typically shouldn't change bandwidth to CPU. You will lose about 1000x in bandwidth cost what you win in CPU and memory. If someone else pays for your bandwidth, that party could probably be convinced to give you more CPU instead of you using more bandwidth.

  10. #7
    Member
    Join Date
    Mar 2015
    Location
    Germany
    Posts
    57
    Thanks
    34
    Thanked 38 Times in 15 Posts
    Online compression is mostly used on servers to serv dynamic content which is quiet often not compressed at all because of the cpu overhead.
    If slz allows you to compress data which you could not have compressed without it (because the buffers for regular compression eat up your
    memory and you start idling on disk) I think this is a really good trade off.

  11. #8
    Member
    Join Date
    Aug 2016
    Location
    Paris
    Posts
    14
    Thanks
    5
    Thanked 28 Times in 6 Posts
    Hi Christoph,

    yes that's the point. The goal is not to defend whether inline compression is a good solution or not, there are many types of contents that are emitted uncompressed (sometimes just because it's difficult to compress them on the fly while processing them) and a lot of users want compression, to the point that some vendors build dedicated hardware for this. It happens that historically browsers were taught to deal with compress/gzip/deflate because that allowed servers to serve alternate files found on disk with the an extension matching the requested format, so when on-the-fly compression started to appear, it had to adapt to these browser's supported formats and algorithms. And while gzip is really nice to compress files on disk, it's a pain to perform on the fly, hence this alternative which addresses gzip's most painful issues. Other than that, like all LZ-based compressors, it's quite suited to compress text files full of repetitions, which HTML/CSS/JS clearly correspond to.

  12. Thanks:

    Cyan (13th August 2016)

  13. #9
    Member
    Join Date
    Jun 2015
    Location
    Switzerland
    Posts
    811
    Thanks
    234
    Thanked 292 Times in 174 Posts
    Quote Originally Posted by Christoph Diegelmann View Post
    dynamic content which is quiet often not compressed at all because of the cpu overhead.
    Which large commercial website does this? None?

    To me it seems it is incompetence, not a technical reason.

  14. #10
    Member
    Join Date
    Jun 2015
    Location
    Switzerland
    Posts
    811
    Thanks
    234
    Thanked 292 Times in 174 Posts
    Quote Originally Posted by willy View Post
    there are many types of contents that are emitted uncompressed (sometimes just because it's difficult to compress them on the fly while processing them) and a lot of users want compression, to the point that some vendors build dedicated hardware for this.
    Two questions:

    What kind of content is emitted uncompressed today as a good practice?

    Which vendor do you recommend for hardware compression?

  15. #11
    Member
    Join Date
    Mar 2015
    Location
    Germany
    Posts
    57
    Thanks
    34
    Thanked 38 Times in 15 Posts
    Quote Originally Posted by Jyrki Alakuijala View Post
    Which large commercial website does this? None?

    To me it seems it is incompetence, not a technical reason.
    The only big homepage that I could think of which truely serves somewhat
    dynamic content (not cached content which can be compressed and served
    several times) was twitter. So I did a small check which revealed:
    Twitter serves json uncompressed.

    PHP has output compression disabled by default. Which doesn't imply that
    it's used on big homepages like this but the developers seem to think it's
    the best for the large amount of users which clearly add up to a huge traffic.

    To be really efficient I think serving static content + templates e.g. as xslt or js
    precompressed and using something like slz for the dynamic part is quiet
    a good option.

  16. #12
    Member
    Join Date
    Aug 2008
    Location
    Planet Earth
    Posts
    937
    Thanks
    95
    Thanked 362 Times in 252 Posts
    Quote Originally Posted by Jyrki Alakuijala View Post
    Which large commercial website does this? None?

    To me it seems it is incompetence, not a technical reason.
    https://www.torrentz.eu/
    https://majestic.com/

  17. #13
    Member SolidComp's Avatar
    Join Date
    Jun 2015
    Location
    USA
    Posts
    279
    Thanks
    109
    Thanked 51 Times in 35 Posts
    Quote Originally Posted by Jyrki Alakuijala View Post
    Which large commercial website does this? None?
    Jyrki, brah, are you serious? You of all people should know that lots of major sites don't compress even their static content, let alone their "dynamic" content (I put it in quotes because a lot of it is not that dynamic, and people often use "dynamic" as an excuse not to compress.)

    Let's take a quick look...

    Example 1: Since we're already here... encode.su. As you can see from the attached screenshot, nothing is compressed for a logged in user, not even static CSS or JS files. (You probably know Chrome Dev Tools very, very well, but for everyone else, the Content-Encoding column is where gzip would be (or brotli). The fact that it's blank for a given row means the content was not compressed.

    Click image for larger version. 

Name:	2016-08-13_21-20-59.png 
Views:	169 
Size:	40.3 KB 
ID:	4580


    Example 2: CNN.com. The homepage is not compressed (the HTML). If we look at a specific article page, like this one, we see some compressed files and some not compressed. For example, the 143 KB buttons.js file is not compressed. (And no, there is no reason in the world that we should need a 143 KB JavaScript file just for buttons, no matter what those buttons do. They could fly all over the screen and whisper sweet nothings to you, and it still shouldn't take a separate 143 KB file.)

    Conversely, they're gzipping woff font files, which is a bad idea. In most cases, I think there would be zero size reduction. (I'm not logged into CNN, have no account there, so they don't have to deliver dynamic user-specific content.)

    Example 3: Reuters.com. The homepage is not compressed. If we look at a specific article page, like this one, we see that almost nothing is compressed, not even the big jquery file. Only bootstrap seems to be compressed, and there's well over 100 KB of uncompressed CSS (and of course, it's not one CSS file, but several – this whole fractured system with tons of separate files is terrible engineering and bad even with HTTP/2.)

    (I'm not logged into Reuters, have no account there.)

    The core problem is that compression is not the default setting for the web, and it should be. The basic behavioral economics here is that people go with defaults, and anything that imposes any friction or effort cost will be performed less than 100% of the time, usually much less. Part of the problem is that with the current generation of web formats, the machine-consumed format is the same as the authorship format. HTML, CSS, and JS are all this way. We work with these text files, and browsers also accept these text files. Separation of concerns got tangled up in people wanting to separate style from content from logic, when it should have been about separating the authorship format from the deployed, machine-consumed format. Because it's the same stupid text formats for both, there's a lot of inertia that leads to uncompressed and inefficiently structured text files being deployed on servers. If we had a sane binary format for browsers to consume, like XPack, XWRT, EXI, or a brotlified spin on those, a format that wasn't just binary but also inherently compressed and smartly structured for fast reads of the DOM and AST, we'd save enormous bandwidth, and page loads and web apps would be blazing fast, faster than AMP.

  18. Thanks (6):

    Bulat Ziganshin (14th August 2016),Christoph Diegelmann (14th August 2016),dado023 (14th August 2016),jman (13th April 2017),schnaader (14th August 2016),Shelwien (15th August 2016)

  19. #14
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,840
    Thanks
    288
    Thanked 1,243 Times in 696 Posts
    Random idea: its also possible to insert pre-compressed deflate blocks into a middle of deflate stream.
    Would it make sense along with SLZ approach?
    I'd imagine something like lazy/delayed dedup filter - collect anchor-hashes[1] for transmitted data, find frequently appearing chunks,
    compress them with some kzip in idle time, insert precompressed chunks by hash value, if one is available.

    [1] Anchor-hashing is a dedup method based on splitting data into chunks between "anchor points" determined by a rolling hash of data.
    For example, see https://msdn.microsoft.com/en-us/lib...(v=vs.85).aspx

  20. #15
    Member SolidComp's Avatar
    Join Date
    Jun 2015
    Location
    USA
    Posts
    279
    Thanks
    109
    Thanked 51 Times in 35 Posts
    Quote Originally Posted by Shelwien View Post
    Random idea: its also possible to insert pre-compressed deflate blocks into a middle of deflate stream.
    Would it make sense along with SLZ approach?
    I'd imagine something like lazy/delayed dedup filter - collect anchor-hashes[1] for transmitted data, find frequently appearing chunks,
    compress them with some kzip in idle time, insert precompressed chunks by hash value, if one is available.

    [1] Anchor-hashing is a dedup method based on splitting data into chunks between "anchor points" determined by a rolling hash of data.
    For example, see https://msdn.microsoft.com/en-us/lib...(v=vs.85).aspx
    That's interesting, and I've been wondering if anyone has implemented concatenation of precompressed deflate blocks. Adler has written about on Stack Overflow. Another thing I've wondered about is preprocessing of HTML, CSS, and JS files that would make them faster to gzip. A simple route would to be attach metadata to the files that would give the compressor useful information, such as the longest repeated strings (so the compressor knows it doesn't need to search for anything longer). For concatenating dynamic HTML fragments together, you might be able to pre-mark strings that are known matches with strings in other fragments (like strings in a footer, header, sidebar).

    The above would require a modified, or clean-sheet, gzipper that could make use of the metadata. Another approach could be to normalize the HTML, CSS, and JS content to a strict standard to make the compressor's job more constrained and predictable. The extent to which this could improve performance is complicated and detail-dependent – it might not be a huge win. The best normalization would probably be a strict minified format, with all lowercase tags, no CRLF, just LF (if any), normalized attribute order, etc. This also highlights the possibility of folding minification into the gzip or brotli process. Cloudflare built some kind of insanely fast minifier a few years ago, so it should be possible to have a fast minifier-compressor in one process. SLZ plus light minification might get you close to gzip -6 in some cases (on otherwise unminified content.)
    Last edited by SolidComp; 15th August 2016 at 06:31.

  21. #16
    Member
    Join Date
    Mar 2015
    Location
    Germany
    Posts
    57
    Thanks
    34
    Thanked 38 Times in 15 Posts
    I think precompressing chunks could be quite usefull for template engines like smarty. They could
    precompress the static parts and only add in e.g. some numbers as literals. Of course transfering the templates
    and render them at the client would be better because they could be cached and transfered
    completly compressed.

  22. #17
    Member
    Join Date
    Aug 2015
    Location
    Urbana, IL
    Posts
    159
    Thanks
    10
    Thanked 162 Times in 90 Posts
    How does this compare to intel's zlib fork(which is much faster than stock zlib when using -1) and igzip?

  23. #18
    Member
    Join Date
    Jun 2015
    Location
    Switzerland
    Posts
    811
    Thanks
    234
    Thanked 292 Times in 174 Posts
    Quote Originally Posted by SolidComp View Post
    Jyrki, brah, are you serious?
    Yes. I think none of the sites you listed is serving uncompressed to intentionally save CPU. They do it for some other reason.

  24. #19
    Member SolidComp's Avatar
    Join Date
    Jun 2015
    Location
    USA
    Posts
    279
    Thanks
    109
    Thanked 51 Times in 35 Posts
    Quote Originally Posted by Jyrki Alakuijala View Post
    Yes. I think none of the sites you listed is serving uncompressed to intentionally save CPU. They do it for some other reason.
    Oh, I can't know their reasons for not compressing. If we're strictly talking about saving CPU, the only way to know would be if they blogged about it, or we worked there.

  25. #20
    Member
    Join Date
    Jun 2015
    Location
    Switzerland
    Posts
    811
    Thanks
    234
    Thanked 292 Times in 174 Posts
    Quote Originally Posted by SolidComp View Post
    Oh, I can't know their reasons for not compressing. If we're strictly talking about saving CPU, the only way to know would be if they blogged about it, or we worked there.
    We can observe: if they don't compress static content, they are obviously doing it wrong from the financial viewpoint. There the cost of CPU is millions to 100 billions of times lower than the cost of bandwidth.

    We can calculate: If they turn on light compression (like gzip -6) for dynamic content, the first day they will save in bandwidth the equivalent in three years of CPU needed for it. Their website becomes faster as experienced by users, and their growth rates ramp up.

    Obviously they are not optimizing for cost. It is true that we don't know their true reasons to do so. I think they don't know it themselves either, so if they blogged or if we worked there, it wouldn't help in figuring it out.

  26. #21
    Member
    Join Date
    Aug 2016
    Location
    Paris
    Posts
    14
    Thanks
    5
    Thanked 28 Times in 6 Posts
    I think it could be done indeed, though you still have to process some of the data to compute the crc32/adler32. At least since slz needs no context between calls (it just keeps the last sub-byte bits and the current CRC), you could imagine flushing the stream (to complete the last byte), insert your pre-compressed block, then continue. But the CRC will definitely need to be adjusted and that might be the most difficult part.

  27. #22
    Member
    Join Date
    Aug 2016
    Location
    Paris
    Posts
    14
    Thanks
    5
    Thanked 28 Times in 6 Posts
    CPU definitely is a big issue, believe me. We've been used to disable or limit compression in our products to limit the operational risks. From the application, it's often complicated to enable compression on the fly, just because the way application components are chained with the application server does not make it easy. Many times the application server is managed by application people who don't know much about the interoperability issues, let alone HTTP at all, and they don't care about resource usage. Also, some caches are often installed in front of the application, so the cache needs to uncompress what was delivered compressed, to store it in both forms. Some caches use different strategies though, but at least they need to normalize the accept-encoding header to limit the variations of the same object in the cache. In the end, for many hosting companies, it becomes a matter of cost savings to compress their customers' contents and they don't want to put the burden on the customers who will do stupid things then complain that the incomplete advices they got result in visitors experiencing trouble. So it becomes quite frequent to see compression being performed on the fly on the front load balancer. That's the stupidest thing to do in terms of architecture and scalability but by far the most common. And there CPU matters a lot because you don't want to have two or three LBs when a single one was largely enough without compression. It's not just a matter of costs but also of operations and taking accurate routing decisions when you have to deal with multiple boxes instead of a single one. People who run many LBs don't care that much about CPU usage anymore, but for many users it's important.

    I also forgot to mention an important point. Bandwidth nowadays at hosting companies is cheap provided that you get it with a cheap machine. For $3/month you have 200 Mbps on a quad-core ARM server for example : https://www.scaleway.com/pricing/. Such machines are not that powerful at compressing contents. If saving 100 Mbps requires to triple the number of machines, you'd rather let your bandwidth max out and order a second machine. After all that's only $15/month per Gbps. It's cheaper than the 500 Mbps you get with a much larger machine that would allow you to push the same amount of traffic via compression. So in certain scenarios, *not* compressing can cut costs because of CPU-induced costs.

  28. #23
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,552
    Thanks
    767
    Thanked 685 Times in 371 Posts
    look at crc-util library, it allows to combine chunk CRCs. Or just read their paper, they describe CRC features including math behind hash combining

  29. Thanks (2):

    Christoph Diegelmann (18th August 2016),willy (19th August 2016)

  30. #24
    Member
    Join Date
    Jun 2015
    Location
    Switzerland
    Posts
    811
    Thanks
    234
    Thanked 292 Times in 174 Posts
    Quote Originally Posted by willy View Post
    CPU definitely is a big issue, believe me. We've been used to disable or limit compression in our products to limit the operational risks.
    Could you give more data on this? What kind of platform (are you running on the said arm box with a 200 mbps net work connection?!), what is the impact on session length or number of parallel sessions, impact on user experienced latency, and what risks are reduced?

    How do you limit compression in your products? How much output bandwidth do you generate in normal operation when in the uncompressed mode per core or per NIC? How much does the bandwidth use decrease or increase when compression is turned on?

  31. #25
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,840
    Thanks
    288
    Thanked 1,243 Times in 696 Posts
    > But the CRC will definitely need to be adjusted and that might be the most difficult part.

    http://ideone.com/RQdDKy
    http://nishi.dreamhosters.com/u/merge_crc.cpp

  32. Thanks (2):

    Bulat Ziganshin (19th August 2016),schnaader (19th August 2016)

  33. #26
    Member
    Join Date
    Aug 2016
    Location
    Paris
    Posts
    14
    Thanks
    5
    Thanked 28 Times in 6 Posts
    Quote Originally Posted by Jyrki Alakuijala View Post
    Could you give more data on this?
    Sure, I'll respond to each question inline below.

    What kind of platform (are you running on the said arm box with a 200 mbps net work connection?!),
    We're running on x86_64 for the normal platform (dual-core atom for the low-end and quad-core i7 on the high-end), and a very small mips32 for a demo platform. The ARM is one of my favorite development platform however (fanless, light, etc) as it is well balanced and could be a long-term replacement for the entry-level model. On all of these platforms (except the mips) it's trivial to achieve 1 Gbps of forwarded traffic without compressing. In one second of traffic, you transfer exactly 118MB of uncompressed HTML contents (line-rate). On the atom, it takes 7.76 seconds to transfer the same contents using gzip -1 versus 2.73 seconds using slz (in both cases at 100% CPU). The output becomes 44MB with gzip -1, versus 57MB with slz. So in the gzip -1 case, we produce 44MB in 7.76s = 5.7 MB/s of output traffic for 15.2 MB/s in. That's roughly 50 Mbps of output network bandwidth and 130 Mbps of server-side bandwidth. With slz, we produce 57MB in 2.73s = 21 MB/s or about 185 Mbps of output traffic for 43MB/s at the server (about 380 Mbps). Thus the math is simple : for a link smaller than 50 Mbps, you'd rather use gzip which benefits from the CPU and will saturate the link before the CPU, resulting in a higher internal bandwidth. Above 50 Mbps, the CPU will cap the traffic to about 130 Mbps of server-side traffic and that's where you'd rather use slz to go up to 380 Mbps. I don't remember the numbers on the core i7, though obviously they're much higher for both compressors and will reflect the numbers reported by lzbench.

    what is the impact on session length or number of parallel sessions
    On zlib, at 256kB/session versus 16kB for pure forwarding, you simply divide the maximum number of concurrent connections per 16. In practice what we used to do was that we limit the number of concurrent compressed streams to a very low value so that we could still pass all the clear-text traffic in parallel. Unfortunately sometimes you believe you can compress and you just emit an uncompressed stream so you really waste a compression slot (and some CPU). But HTTP contents are more or less correctly typed so this doesn't happen much (eg: never compress images, videos nor archives, or only compress text/* and that's fine). With slz, since the overhead is only a few bytes per stream, it's completely absorbed and not noticed at all, thus we could remove this annoying limitation and 100% of the streams can be compressed in parallel, so even if this brings nothing for some of them there's no waste of resource.

    impact on user experienced latency
    Less bytes transiting over the wire on their end, hence a faster load time by using compression in general. Since with zlib we can't afford to compress all streams, users randomly used to get either a nicely compressed content or an uncompressed one resulting in 3-fold size factors for the same object between accesses. It also used to fill their caches with different versions of the same object upon reload, which is not nice either. Interestingly, it was also when a site was under very low load that you had the most chances of getting your contents compressed, while you had little chances when it was under high load, which is quite bad because it means the output bandwidth usage increases faster than the number of visitors. Indeed, below the threshold, 100% of the streams are compressed, and this ratio would diminish to something like a 2-3% only, resulting in 30 times the output bandwidth for 10 times the number of visitors. When compressing all streams we can deliver a consistent compressed output under any load condition and that avoids this undesirable threshold effect.

    and what risks are reduced?
    If you mean compared to zlib, risks of sudden saturation of the link due to the quick growth of the bandwidth beyond the compression threshold, and risks of CPU saturation during intense traffic (that we already mitigate differently, see below).

    How do you limit compression in your products?
    Two ways : the first one is the limit on the number of concurrent compressed streams. So once the limit is reached, we refuse to compress in order to save memory. But as I said we disabled this one with slz. The second one is that we monitor the CPU usage, and when it reaches a predefined threshold (that we fixed to 85%), we stop compressing new streams, and we only emit literals for existing streams. That ensures that the device continues to remain responsive even under extreme loads. One unexpected benefit I noticed is that this last limit which protects the device against CPU saturation was preventing zlib from achieving the best compression ratio on gigabit links since we were frequently emitting literals to avoid maxing out the CPU. This is not the case anymore with slz as it easily compresses 1 Gbps of incoming traffic without reaching this limit. So in the end we get a higher overall compression ratio with slz than with zlib on high-speed links. That's fun because it's the opposite of the initial intent (ie: trade compression ratio for memory usage).

    How much output bandwidth do you generate in normal operation when in the uncompressed mode per core or per NIC?
    The atom (even the ARM platforms) has no issue pushing 1 Gbps, this is very cheap nowadays as TCP stacks have improved to limit the number of copies, and memory is quite fast. The ARM doesn't have that fast memory but the NICs are directly connected to the L2 cache, so that helps a lot! The core i7 can saturate its two 10G links (eg: approx 19 Gbps of useful traffic) using a single core for haproxy and 2 for the network stack, though I know for having run a benchmark with dual-40G links on the same CPU that using all 4 cores for both haproxy and the network I can reach around 60 Gbps. We're far beyond what slz can achieve

    How much does the bandwidth use decrease or increase when compression is turned on?
    It depends a lot on the web sites. Those who use few images (or use a CDN for images) can literally see their bandwidth decreased by 2-3 because they emit only HTML/CSS/JS. Those using lots of images (like shops listing their products catalog) will almost not notice any saving since their bandwidth is mostly made of the tens to hundreds of images composing a page. It's hard to guess in advance. That's also why I like to be able to enable it all the time then observe. If you go to http://demo.haproxy.org/ and hover over the "bytes out" column in the frontend, you'll see that on 106 GB of traffic, 78 GB were eligible to compression, resulting in 36 GB output, achieving a net saving of 38%. That's not bad at all! Even if that doesn't change any bandwidth cost here, it allowed users to access the site 40% faster on average, and without adding any hardware.

    Hoping this helps!

  34. Thanks (3):

    Cyan (19th August 2016),Jyrki Alakuijala (20th August 2016),SolidComp (19th August 2016)

  35. #27
    Member
    Join Date
    Mar 2013
    Location
    Worldwide
    Posts
    565
    Thanks
    67
    Thanked 198 Times in 147 Posts

    Lightbulb TurboBench - Dynamic Web Content Benchmark

    SLZ - included in TurboBench (only github)

    html8 : 100MB random html pages from a 2GB Alexa Top sites corpus.

    number of pages = 1178
    average length = 84886 bytes.
    The pages (length + content) are concatenated into a single html8 file, but compressed/decompressed separately.
    compress: page1,page2,...pageN
    decompress : Page1,page2,...pageN
    This avoids the cache effects like in other benchmarks, where small files are processed repeatedly in the L1/L2 cache,
    showing unrealistic results.


    size: 100,000,000 bytes.
    Single thread in memory benchmark

    cpu: Sandy bridge i7-2600k at 4.2 Ghz, all with gcc 5.4, ubuntu 16.04
    Code:
          C Size  ratio%     C MB/s     D MB/s   Name          C Mem Peak   D Mem Peak    (bold = pareto) MB=1.000.0000
        16461059    16.5       0.79     449.97   brotli 11     10,616,320     241,016
        19615056    19.6       0.13     398.95   zopfli        33,644,072      14,320      
        19815500    19.8       6.14     685.06   libdeflate 12 17,938,928      71,312  
        20163816    20.2      92.93     559.73   brotli 4      35,314,344     193,272
        20363869    20.4      40.28     435.27   zlib 9           274,064      14,320
        20485533    20.5      63.34     433.08   zlib 6           274,064      14,320
        20566179    20.6     115.35     678.32   libdeflate 6   2,202,032      71,312
        20624793    20.6      86.26     431.00   zlib_ng 6        274,272      14,320     
        22630144    22.6     165.19     634.85   libdeflate 1   2,202,032      15,184
        23143931    23.1     329.10     524.32   brotli 1       1,193,296  16,810,696
        23723266    23.7     139.94     398.91   zlib 1           274,064      14,320  
        24070531    24.1     411.35     488.54   brotli 0         145,920  16,810,696
        28214601    28.2     371.16     433.05   slz 6                  0      14,320    (64k stack + 132k static memory at compression)
        28214601    28.2     371.23     433.02   slz 9                  0      14,320
        28425348    28.4     671.00    2145.63   lzturbo 20                                *
        29466128    29.5     445.13     483.71   slz 1d                 0      14,320    (deflate)
        29476316    29.5     371.19     424.83   slz 1                  0      14,320
        29995215    30.0     309.16     414.03   zlib_ng 1        175,968      14,320    (optimized for intel SSE4.2)
    Hardware: ODROID C2 - ARM 64 bits - 1.536Ghz CPU, OS: Ubuntu 16.04, gcc 5.3
    Code:
          C Size  ratio%     C MB/s     D MB/s   Name                    
        16461059    16.5       0.14      87.22   brotli 11       
        20163816    20.2      14.06     102.77   brotli 4        
        20363869    20.4       8.14     128.23   zlib 9          
        20485533    20.5      13.04     127.66   zlib 6
        22630144    22.6      21.31     143.02   libdeflate 1
        23143931    23.1      44.01      93.11   brotli 1        
        23723266    23.7      32.90     117.04   zlib 1          
        24070531    24.1      76.59      84.36   brotli 0        
        28214601    28.2      54.30     127.64   slz 6           
        28214601    28.2      54.33     127.64   slz 9           
        28425348    28.4     143,06     542,98   lzturbo 20      
        29476316    29.5      51.58     126.06   slz 1
    Skylake i7-6700 - 3.7GHz
    Code:
          C Size  ratio%     C MB/s     D MB/s   Name            
        16461059    16.5       0.76     407.68   brotli 11       
        20163816    20.2      89.92     512.58   brotli 4        
        20363869    20.4      34.31     370.89   zlib 9         
        20485533    20.5      55.47     369.05   zlib 6          
        23143931    23.1     269.01     487.64   brotli 1        
        23723266    23.7     125.46     340.55   zlib 1          
        28214601    28.2     332.37     371.57   slz 9           
        28214601    28.2     332.43     371.41   slz 6           
        28425348    28.4     611.31    1691.52   lzturbo 20      
        29476316    29.5     336.27     364.31   slz 1
    * LzTurbo,20 incl. as indication, not gz,br compatible

    Note: Single thread peak memory - Only malloc,calloc,... are tracked. No Stack, No mmap
    Input/Output buffer not counted.

    Remark:
    Compression speed slz vs. zlib:
    - 2,65x faster on intel i7 ( 20% faster than zlib-ng SSE4.2)
    - 1,65x faster on ARM

    - "brotli 1" can also be an alternative (but memory usage is high)
    - Edit: "brotli 0" faster than slz, but decompression is slow on arm
    Last edited by dnd; 28th August 2016 at 14:53. Reason: added brotli 0,zlib-ng,libdeflate

  36. Thanks:

    willy (20th August 2016)

  37. #28
    Member
    Join Date
    Mar 2013
    Location
    Worldwide
    Posts
    565
    Thanks
    67
    Thanked 198 Times in 147 Posts
    .
    Last edited by dnd; 19th August 2016 at 22:18.

  38. #29
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    493
    Thanks
    177
    Thanked 174 Times in 118 Posts
    Are those slz parameters correct? If so it implies slz -1 is slower than slz -9 on both Intel and Arm.

  39. #30
    Member
    Join Date
    Mar 2013
    Location
    Worldwide
    Posts
    565
    Thanks
    67
    Thanked 198 Times in 147 Posts
    Quote Originally Posted by JamesB View Post
    Are those slz parameters correct? If so it implies slz -1 is slower than slz -9 on both Intel and Arm.
    This is the turbobench level (slz doesn't have levels) and specify only the dictionary size, but the difference is within some tolerance.
    Note that the average size is 84k, this why 6 and 9 have same ratio and the dictionary size has no big impact.

    Edit: ARM slz,1 retest slightly better
    Last edited by dnd; 19th August 2016 at 20:38.

Page 1 of 2 12 LastLast

Similar Threads

  1. Fast Zlib compression
    By gildor in forum Data Compression
    Replies: 14
    Last Post: 20th February 2017, 17:32
  2. How to make .zip compatible with precomp?
    By Bulat Ziganshin in forum Data Compression
    Replies: 4
    Last Post: 23rd June 2015, 14:03
  3. Zlib-ng: a performance-oriented fork of zlib
    By dnd in forum Data Compression
    Replies: 0
    Last Post: 5th June 2015, 14:29
  4. Introducing zpipe, a streaming ZPAQ compatible compressor
    By Matt Mahoney in forum Data Compression
    Replies: 0
    Last Post: 1st October 2009, 06:32
  5. zlib-compatible alternatives
    By Cyan in forum Data Compression
    Replies: 0
    Last Post: 12th May 2009, 01:28

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •