Page 2 of 2 FirstFirst 12
Results 31 to 42 of 42

Thread: LZFSE - New Apple Data Compression

  1. #31
    Member
    Join Date
    Nov 2014
    Location
    Earth
    Posts
    38
    Thanks
    0
    Thanked 77 Times in 19 Posts
    Well, you need to be careful to make useful comparisons when doing benchmarks. I think that Brotli in general aims for stronger compression than the other formats I benchmarked, and brotli -q 11 -w 24 in particular uses a much larger sliding window size and a much more thorough encoder than the other algorithms. So to get a fair comparison you would need to either run the other algorithms in a similar mode (which isn't fully possible for LZFSE and DEFLATE as they have a lower maximum window size; this certainly could be considered a disadvantage of those formats), or use a lower compression mode of Brotli.

    So, I edited my posts above to add brotli -6 -w 18 and brotli -6 -w 19, as those seemed the most comparable for faster compression. Results were as expected: Brotli had the best compression ratio but a slower decoder.

    For the sake of curiosity, I also compared brotli -11 -w 24 with Zstd levels 19 through 22 as this seemed at least somewhat comparable for better compression. Here were the results:


    Code:
    x86_64
    
    
    Method                        Csize (bytes)  Ctime (ms) Dtime(ms)
    ----------------------------- -------------  ---------- ---------
    Brotli, -11 -w 24             50050328       823513     990       
    Zstd -22                      52790083       154684     533       
    Zstd -21                      52824748       123616     533       
    Zstd -20                      53018920       108542     535       
    Zstd -19                      53939418       93909      513       
    
    
    ARM (32-bit)
    
    
    Method                        Csize (bytes)  Ctime (ms) Dtime(ms)
    ----------------------------- -------------  ---------- ---------
    Brotli, -11 -w 24             50050328       8098536    4902      
    Zstd -22                      52814371       853979     4200      
    Zstd -21                      52839938       789343     4546      
    Zstd -20                      53018920       694703     4762      
    Zstd -19                      53939418       495606     3796
    So on ARM, Brotli decoding was slower than Zstd and about twice as slow as zlib. To some extent this is expected, due to the more complex algorithm and the large sliding window size. Brotli encoding was also very slow --- it ran for over 2 hours.

    It may be worth noting that based on my results before, Zstd is slower on ARM than it should be --- XPACK was much faster. I'm not sure why.

    I also don't know why Zstd compressed to a different size at levels 21 and 22 on ARM vs on x86_64.

  2. #32
    Member
    Join Date
    Jul 2013
    Location
    United States
    Posts
    194
    Thanks
    44
    Thanked 140 Times in 69 Posts
    Quote Originally Posted by nemequ View Post
    Here is a version of the squash benchmark with LZFSE and LZVN:

    https://quixdb.github.io/squash-benchmark/unstable/

    Note that Squash doesn't yet expose LZVN, it's blocked on a pull request to the LZFSE repo.

    Note that there is a link to the CSV in the "Choose a machine" section if you want the raw data.
    BTW, if anyone wants to play with LZVN, the API is in lzvn_encode/decode_base.h. The PR I'm blocking on only matters if you have < 8 bytes of input, shouldn't be an issue for benchmarking

    Also, note that there are a bunch of compile-time tunables for LZFSE in lzfse_tunables.h, which are compatible with existing LZFSE decompressors. I haven't played with them to see what kind of effect they have; I probably won't bother benchmarking them unless LZFSE decides to expose them for run time configuration.

  3. #33
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    870
    Thanks
    471
    Thanked 264 Times in 109 Posts
    > also don't know why Zstd compressed to a different size at levels 21 and 22 on ARM vs on x86_64.

    Zstd has a hard limit at 32 MB window size in 32-bits mode.
    This is an implementation limit. It could be removed with some more complex code.
    Since it only impacts highest compression modes on very large files, it wasn't a priority up to now.

    Speed has been optimized for 64-bits systems (incl. ARM ones), but not 32-bits ones so far.
    I believe it will require a dedicated optimization pass.
    Any help is welcomed.

  4. #34
    Member
    Join Date
    Jun 2015
    Location
    Switzerland
    Posts
    768
    Thanks
    217
    Thanked 286 Times in 168 Posts
    Quote Originally Posted by Zyzzyva View Post
    So on ARM, Brotli decoding was slower than Zstd and about twice as slow as zlib. To some extent this is expected, due to the more complex algorithm and the large sliding window size.
    While brotli's sliding window is definitely larger than zlib's, it is actually smaller than zstd's. Brotli, even with largest 24 bit window has 2-4x smaller window than zstd -22 on silesia benchmark, and possibly more on other benchmarks. If you limit brotli to 18 or 19 bits, and compare with zstd -22, you get around 128x smaller window with brotli than with zstd on the silesia benchmark.

    The main issue with a larger window is decoding-time memory use. A secondary issue is the increase in memory bus bandwidth use in decoding, making a larger window less suitable for heavy multi-processing.

  5. #35
    Member
    Join Date
    Sep 2010
    Location
    US
    Posts
    126
    Thanks
    4
    Thanked 69 Times in 29 Posts
    Quote Originally Posted by Jyrki Alakuijala View Post
    The main issue with a larger window is decoding-time memory use. A secondary issue is the increase in memory bus bandwidth use in decoding, making a larger window less suitable for heavy multi-processing.
    There are good reasons for an LZ to support unlimited window, and there are good reasons to support finite window sizes.

    If you are decoding into a single memory buffer, then unlimited window does not affect decode-time memory usage. For example when using a compressor to load content into applications, the apps need a buffer for that object anyway so they don't consider part of the compressor's memory usage. In this case, limiting window size just hurts compression ratio for no good reason.

    16M or even 256M might seem like a reasonable maximum window now, but it will seem ridiculously small in 20 years. Let's not make another major LZ standard that's crippled almost immediately after launch due to fixed window limit. (*cough* deflate, LZX, etc.)

    Obviously for streaming decodes of files that are bigger than you want to hold in memory, you want some kind of limited window.

    Certainly there are still lots of important micro-controllers and other devices that have very small RAM and want small windows, but really none of these modern LZ's are great for that domain anyway.

    On platforms with slow memory, or some kind of big speed step from L2 to L3 or L3 to RAM, there are good performance reasons to limit the LZ window into the fast caches. However, that is quite a tricky proposition in general and can be very misleading on benchmarks, as there is a massive amount of different hardware possibilities there. For example wrst your issue of heavy multi-processing, do your cores have local L2, or is it shared? Is the L3 local or shared? (is there an L3?) How many read ports are there on the caches and RAM? What is the size of each cache? etc.

    Trying to get the settings for this right without knowing the exact details of the usage is pretty impossible.

    IMO the LZ that will replace deflate/Zlib needs to support unlimited window and finite window.
    Last edited by cbloom; 3rd August 2016 at 19:07.

  6. #36
    Member
    Join Date
    Sep 2010
    Location
    US
    Posts
    126
    Thanks
    4
    Thanked 69 Times in 29 Posts
    WRST LZFSE and Apple - why are they doing their own thing? Wouldn't it be better for everyone (Apple and the rest of the world) if they just used ZStd and put some time into optimizing it for ARM and other platforms that Apple cares about?
    Last edited by cbloom; 3rd August 2016 at 19:07.

  7. #37
    Member
    Join Date
    May 2008
    Location
    Kuwait
    Posts
    342
    Thanks
    36
    Thanked 37 Times in 22 Posts
    the magic word they used is "Energy Efficient" and maybe this is a field need to be considered while benchmarking. so its less cpu usage (may be on ARM)

  8. #38
    Member
    Join Date
    Feb 2010
    Location
    Nordic
    Posts
    200
    Thanks
    41
    Thanked 36 Times in 12 Posts
    Here is an interesting performance assertion: https://news.ycombinator.com/item?id=12050279

    Advocating memcpy.

    I know my way around LLVM and GCC internals a bit and I'd anticipate that normal memcpy would be lowered to __builtin_memcpy and that the backend would turn that into a normal store in this case. Of course people can compile with -fno-builtin(-memcpy) or -ffreestanding or other flags that can affect this and turn it into a proper indirect inter-lib call.

  9. #39
    Member
    Join Date
    Nov 2015
    Location
    ?l?nsk, PL
    Posts
    81
    Thanks
    9
    Thanked 13 Times in 11 Posts
    Quote Originally Posted by willvarfar View Post
    Here is an interesting performance assertion: https://news.ycombinator.com/item?id=12050279

    Advocating memcpy.

    I know my way around LLVM and GCC internals a bit and I'd anticipate that normal memcpy would be lowered to __builtin_memcpy and that the backend would turn that into a normal store in this case. Of course people can compile with -fno-builtin(-memcpy) or -ffreestanding or other flags that can affect this and turn it into a proper indirect inter-lib call.
    A couple of years ago I experimented with it and found memcpy performance to be inconsistent and in general lower.
    I see that LZFSE doesn't use it and Yann's codecs don't either, so I guess that the manual way is faster for them too.
    What is the reason? Loss of alignment information? Compiler inability to exploit the fact that memcpy size is always a multiply of X bytes? Other? I don't know.

  10. #40
    Member
    Join Date
    Nov 2014
    Location
    Earth
    Posts
    38
    Thanks
    0
    Thanked 77 Times in 19 Posts
    LZFSE already uses memcpy for most unaligned memory accesses. See lzfse_internal.h. However, it looks like lzvn_encode_base.c is an exception. This was probably an oversight.

    memcpy() is a great solution when the compiler handles it properly. It was probably sufficient for the compiler and platforms Apple used --- I think clang for x86_64 and AArch64. Unfortunately, not all compilers handle it well for all targets, so sometimes a non-standard solution or even just a byte-by-byte copy is better.

  11. #41
    Member
    Join Date
    Jul 2013
    Location
    United States
    Posts
    194
    Thanks
    44
    Thanked 140 Times in 69 Posts
    Quote Originally Posted by m^3 View Post
    A couple of years ago I experimented with it and found memcpy performance to be inconsistent and in general lower.
    I see that LZFSE doesn't use it and Yann's codecs don't either, so I guess that the manual way is faster for them too.
    What is the reason? Loss of alignment information? Compiler inability to exploit the fact that memcpy size is always a multiply of X bytes? Other? I don't know.
    zstd uses memcpy by default. See lib/common/mem.h:95. AFAIK lz4 does, too (though it didn't always).

    Yann wrote a good post on his blog about it a while back. Note that the ARM issues should be fixed with newer versions of GCC (6+, I believe); Yann filed a bug about it, and the GCC devs fixed it. So, for gcc and clang memcpy should be as fast (or faster than) other methods, and safe to use. My understanding is MSVC handles memcpy well.

    It's worth noting that people tend to think unaligned access is safe on x86, which isn't quite true. Some SIMD instructions can't handle unaligned accesses. LZ4 ran into this issue a while back, where GCC 4.9+ started generating code which caused a crash when compiled with -O3 (well, -ftree-loop-vectorize, which is included in -O3). I blamed GCC and filed a bug which, TBH, is a bit embarrassing, but worth mentioning because the discussion in the bug was quite illuminating. I think this is actually what prompted Yann's investigation into unaligned access.

    The moral of the story is don't depend on undefined behavior; even if you test every platform and compiler, it's quite possible that the a future version of that same compiler will not work. If you find a performance issue where the compiler generates sub-optimal code, file a bug against the compiler instead of trying to work around the issue by depending on undefined behavior.

  12. Thanks (5):

    Bulat Ziganshin (9th July 2016),Cyan (8th July 2016),m^2 (8th July 2016),SolidComp (10th July 2016),willvarfar (8th July 2016)

  13. #42
    Member
    Join Date
    Jul 2013
    Location
    United States
    Posts
    194
    Thanks
    44
    Thanked 140 Times in 69 Posts
    Not sure anyone cares, but LZFSE just merged a PR to get rid of the unaligned acceses.

  14. Thanks (3):

    Bulat Ziganshin (17th August 2016),Christoph Diegelmann (16th August 2016),Turtle (17th August 2016)

Page 2 of 2 FirstFirst 12

Similar Threads

  1. loseless data compression method for all digital data type
    By rarkyan in forum Random Compression
    Replies: 244
    Last Post: 23rd March 2020, 16:33
  2. Data Compression PC
    By encode in forum The Off-Topic Lounge
    Replies: 206
    Last Post: 13th December 2019, 23:10
  3. Apple buy Algotrim
    By Sportman in forum Data Compression
    Replies: 0
    Last Post: 3rd September 2013, 10:24
  4. Sensor Data Compression
    By mmhn97 in forum Data Compression
    Replies: 11
    Last Post: 21st December 2010, 05:21
  5. Data Compression Evolution
    By encode in forum Forum Archive
    Replies: 3
    Last Post: 11th February 2007, 14:33

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •