Results 1 to 13 of 13

Thread: rANS light (or tANS)

  1. #1
    Member SolidComp's Avatar
    Join Date
    Jun 2015
    Location
    USA
    Posts
    353
    Thanks
    131
    Thanked 54 Times in 38 Posts

    rANS light (or tANS)

    Ahoy. I read that rANS has a processing overhead comparable to Huffman. Is this accurate? Is there an implementation as light as something like SLZ is with gzip? This would be less than 1 MB of RAM and maybe 1% or something of CPU running at line rate on a 10 GbE network connection. Does Zstd have a mode that is super light like this?

    If this doesn't exist, could it exist, in theory? Zstd apparently goes much faster with a dictionary, and I wonder if it would be possible to build some sort of super fast and super lightweight ANS implementation that uses a dictionary tailored for small data payloads. That would be sweet. Benchmarks usually don't report the CPU and memory overhead, which is frustrating. I also wonder if GLZA can run super light.

  2. #2
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,977
    Thanks
    296
    Thanked 1,305 Times in 741 Posts
    > I read that rANS has a processing overhead comparable to Huffman.

    rANS is an equivalent of arithmetic coding.
    It requires multiplications, so has to be slower than Huffman coding.

    tANS processing (actual coding steps) are equivalent to huffman,
    but lookup-tables are larger and their generation is more complex.

    > Is there an implementation as light as something like SLZ is with gzip?

    You can test zstd --fast modes.

    > Does Zstd have a mode that is super light like this?

    In terms of encoding speed and memory usage, yes.
    I don't know whether there's an implementation with static tables for entropy coding (like SLZ),
    but that's possible too.

    > I wonder if it would be possible to build some sort of super fast and super
    > lightweight ANS implementation that uses a dictionary tailored for small
    > data payloads. That would be sweet.

    You don't need zstd for your case. Just write a simple custom solution.

    > Benchmarks usually don't report the CPU and memory overhead, which is frustrating.

    For LZ this information is unnecessary - CPU usage should be normally 100% and you
    can just look at coding speed. And memory usage depends on parameters
    (windows size, hashtable size, matchfinder type etc) which can be manually controlled
    to work with almost any given amount of memory.

    But you can try to use this utility for memory usage checks: http://nishi.dreamhosters.com/u/memchk_v0.rar
    (PeakWorkingSetSize is the maximal volume of memory used during the test;
    keep in mind that this value takes the whole executable footprint into account,
    so minimum is around 2.5MB).

    You can control zstd memory usage with "secret" options like "--zstd=wlog=15".

    > I also wonder if GLZA can run super light.

    Its called "LZW" in that case.
    GLZA benefits come from its dictionary generation algorithm, but its also very slow.

  3. #3
    Member SolidComp's Avatar
    Join Date
    Jun 2015
    Location
    USA
    Posts
    353
    Thanks
    131
    Thanked 54 Times in 38 Posts
    I've been confused by static Huffman since seeing the HPACK spec. There they have a predefined Huffman tree, like the characters 0 1 2 a e i o u s t are all five bits and so forth. Does static Huffman normally mean something else, like one custom tree for the whole payload?

    https://www.rfc-editor.org/rfc/rfc7541.html#appendix-B

    What do you mean by the 100% CPU? Do you mean for a production server running a bunch of stuff it should use all available CPU?

  4. #4
    Member
    Join Date
    Nov 2013
    Location
    Kraków, Poland
    Posts
    803
    Thanks
    244
    Thanked 255 Times in 159 Posts
    Surprisingly, currently the fastest available for CPU is rANS: ~1500MB/s/core decoding: https://github.com/jkbonfield/rans_static vs ~500 for tANS.
    This amazing speed is thanks to using SIMD ... but requires 32 independent streams.
    Splitting data into independent parts, like H.265 into slices, is usually at cost of compression ratio.
    For LZ77 we can e.g. prepare a common dictionary, and expand it independently for each "slice".

    However, there is practically no penalty for simple models like i.i.d., Markov ... like used for entropy coders in zstd-like compressors.
    Zstd uses (4) static probability distributions per frame - we could accumulate multiple such entropy coding tasks and encode/decode them simultaneously exploiting the speedup.
    ... ok, James' implementation rather uses the same probability distribution for all 32 independent streams - the question is speed penalty for using separate distributions?

    Also, James has super-fast o1 Markov, which might be worth to consider for improving compression ratio of zstd ... ?
    The biggest problem here is size of such models (to be stored in the header) - it would need some work, but should allow for significant improvement of compression ratio.
    Last edited by Jarek; 23rd June 2020 at 09:48.

  5. #5
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,977
    Thanks
    296
    Thanked 1,305 Times in 741 Posts
    > Does static Huffman normally mean something else, like one custom tree for the whole payload?

    Normally "static coder" is used in a sense of "static vs adaptive": https://en.wikipedia.org/wiki/Adapti...#Static_method
    Whether it uses relevant statistics or not is less important.

    > What do you mean by the 100% CPU?

    Well, I guess I don't understand what _you_ mean by "CPU overhead".
    Speed is usually reported in benchmarks.
    But if you have a codec with 500MB/s encoding speed writing to 100MB/s storage,
    you'd end with 100MB/s overall speed and 20% CPU load. I thought you meant that.

  6. #6
    Member
    Join Date
    May 2020
    Location
    Berlin
    Posts
    64
    Thanks
    10
    Thanked 20 Times in 15 Posts
    Not rANS but a rABS decoder on a 6502 8-bit CPU (~1MHz) with 64KB RAM:
    https://yupferris.github.io/blog/201...S-on-6502.html
    The author also talks about it (more a how-to video) here:
    https://youtu.be/-lJDJgBaOAw

  7. Thanks (4):

    Cyan (23rd June 2020),introspec (23rd June 2020),JamesB (3rd July 2020),Jarek (23rd June 2020)

  8. #7
    Member
    Join Date
    Nov 2013
    Location
    Kraków, Poland
    Posts
    803
    Thanks
    244
    Thanked 255 Times in 159 Posts
    Thanks, I didn't know about it. Pascal Massimino (skal) has made tests for binary alphabets and uABS turned out a bit better than rABS: lower redundancy, more uniform decoding speed:
    https://encode.su/threads/1821-Asyme...l-System/page8
    decoders:
    Click image for larger version. 

Name:	3sGBgPY.png 
Views:	42 
Size:	574.0 KB 
ID:	7708

  9. Thanks:

    Lucas (30th June 2020)

  10. #8
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    506
    Thanks
    187
    Thanked 177 Times in 120 Posts
    Quote Originally Posted by Jarek View Post
    Surprisingly, currently the fastest available for CPU is rANS: ~1500MB/s/core decoding: https://github.com/jkbonfield/rans_static vs ~500 for tANS.
    This amazing speed is thanks to using SIMD ... but requires 32 independent streams.
    Splitting data into independent parts, like H.265 into slices, is usually at cost of compression ratio.
    For LZ77 we can e.g. prepare a common dictionary, and expand it independently for each "slice".

    However, there is practically no penalty for simple models like i.i.d., Markov ... like used for entropy coders in zstd-like compressors.
    Zstd uses (4) static probability distributions per frame - we could accumulate multiple such entropy coding tasks and encode/decode them simultaneously exploiting the speedup.
    ... ok, James' implementation rather uses the same probability distribution for all 32 independent streams - the question is speed penalty for using separate distributions?

    Also, James has super-fast o1 Markov, which might be worth to consider for improving compression ratio of zstd ... ?
    The biggest problem here is size of such models (to be stored in the header) - it would need some work, but should allow for significant improvement of compression ratio.
    For what it's worth I reckon tANS ought to be faster still than rANS, but I'm not aware of anyone going to such extreme SIMD levels. I only did static rANS because I understood it and kind of stuck with it. A lame reason really!

  11. #9
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,572
    Thanks
    780
    Thanked 687 Times in 372 Posts
    tANS requires memory lookups which is 2 loads/cycle in the best case (on intel cpus). so probably you can't beat SIMD rANS

  12. #10
    Member SolidComp's Avatar
    Join Date
    Jun 2015
    Location
    USA
    Posts
    353
    Thanks
    131
    Thanked 54 Times in 38 Posts
    Quote Originally Posted by Bulat Ziganshin View Post
    tANS requires memory lookups which is 2 loads/cycle in the best case (on intel cpus). so probably you can't beat SIMD rANS
    Should SIMD rANS always beat SIMD AC?

  13. #11
    Member
    Join Date
    Nov 2013
    Location
    Kraków, Poland
    Posts
    803
    Thanks
    244
    Thanked 255 Times in 159 Posts
    In https://sites.google.com/site/powturbo/entropy-coder the fastest rANS has ~600/1500 MB/s/core enc/dec ... AC 33/22 MB/s/core.
    But in https://github.com/jkbonfield/rans_static there is faster AC: 240/150 MB/s/core.

    Generally ANS is much more convenient for vectorization due to state being single number, in AC one needs to process two numbers ... might be doable, but I haven't seen it.

    Another story is SIMD 4bit adaptive rANS - started in LZNA, simultaneously updates 16 probabilities, and compares with all CDFs to find the proper subrange.

  14. #12
    Member
    Join Date
    Jun 2015
    Location
    Switzerland
    Posts
    876
    Thanks
    242
    Thanked 325 Times in 198 Posts
    Quote Originally Posted by SolidComp View Post
    This would be less than 1 MB of RAM and maybe 1% or something of CPU running at line rate on a 10 GbE network connection.
    Entropy codings tend to be less than 10 gbit/s on software. If you have a 10 GbE network connection 4 : 1 compression, you need 400 % of CPU to fill the 10 gbits network. 1 % is about 3 orders of magnitude off.

    It might be possible to be done with relatively simple hardware.

    Still, you'd be using more than 10 % of the memory bus for storing the decompressed data.

    ​So, no, it is not possible with 1 %.

  15. #13
    Member SolidComp's Avatar
    Join Date
    Jun 2015
    Location
    USA
    Posts
    353
    Thanks
    131
    Thanked 54 Times in 38 Posts
    How did you get the 400% CPU figure? How does network bitrate convert to CPU utilization percentage?

Similar Threads

  1. rANS tutorial
    By j-towns in forum Data Compression
    Replies: 3
    Last Post: 11th March 2020, 23:39
  2. Rans tricks
    By Lucas in forum Data Compression
    Replies: 14
    Last Post: 4th November 2019, 17:30
  3. Replies: 0
    Last Post: 3rd October 2018, 23:42
  4. Replies: 45
    Last Post: 25th November 2016, 03:30
  5. AVX-512 and interleaved rANS
    By JamesB in forum Data Compression
    Replies: 41
    Last Post: 6th November 2016, 14:26

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •