Page 7 of 7 FirstFirst ... 567
Results 181 to 195 of 195

Thread: Benchmarking Entropy Coders

  1. #181
    Member
    Join Date
    Feb 2015
    Location
    United Kingdom
    Posts
    177
    Thanks
    29
    Thanked 75 Times in 45 Posts
    "The performance claims just don't hold up, and it has to encode everything in reverse."
    Looks like I need to scrap my current work and use 16 muls per nibble in my future work :P

    ANS has been proven faster, it's in use by Zstandard, and a ton of other fast codecs for this reason, they cannot use that as an excuse.
    From the looks of it the Daala devs haven't read much about optimizing rANS, I've used rANS with FIFO modelling and reverse encoding for nearly 2 years, it's totally applicable for their video codec.
    A single state rANS coder is typically a bit faster than a range coder, but with more and more states then the tables turn in favor of ANS and range coding cannot compete.
    My best guess is they just don't want anything to do with the patent scares, that or laziness cause it does take a while to build a working FIFO ANS implementation from scratch.

  2. #182
    Member
    Join Date
    Nov 2013
    Location
    Kraków, Poland
    Posts
    815
    Thanks
    252
    Thanked 262 Times in 161 Posts
    Maybe they have more promising patent for the range coder they use?
    And generally from Daala side there are real professionals (Timothy, Jean-Marc) fighting for their solution (there was Moffat's approximation earlier, but finally got to nearly standard RC) ... while it looked like Alex was the only responsible for rANS there, taking after Pascal (?)

  3. Thanks:

    dnd (22nd June 2018)

  4. #183
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    521
    Thanks
    196
    Thanked 186 Times in 127 Posts
    My guess is simply that they didn't optimise it well. They know Daala_EC and have spent a long time understanding how to optimise it. The same likely isn't true with adaptive nibble rANS. It needs careful work to get the huge speeds out of it, and the implementations out there that do this aren't open source. (However the methods are known and understood.)

  5. Thanks:

    dnd (22nd June 2018)

  6. #184
    Member
    Join Date
    Mar 2013
    Location
    Worldwide
    Posts
    565
    Thanks
    67
    Thanked 199 Times in 147 Posts
    rANS vs AOMedia AV1 Discussion

    Firstly, the AV1 range coder only uses 1 multiplication per CDF entry, the 16 is the "worst case" (keep in mind that they can be done in parallel, e.g. with SIMD, so it's actually better to use more than less as the multiply is the cheapest part in software).
    For SSE2 decoding in AV1 you need 4 SIMD multiplications (_mm_mullo_epi32) + 4 comparisons (_mm_cmpgt_epi32) + combining (after _mm_movemask_ps) 4 SSE2 registers
    It is unlikely that this will be faster than scalar decoding.
    For AV1 hardware implementations, you need 16 32x32 multipliers, otherwise parallel multiplications are not possible.
    Also 16 comparisons + other operations are additionaly required.

    Secondly, the difference is nowhere near 7x when we benchnmarked the two - rANS was faster, but by a factor of about 2.
    For this benchmark and current implementations, rANS decoding is SEVEN times faster than AV1.
    On ARM the scalar version is 5 times faster.
    The AV1 nibble entropy coder is even slower than a bitwise range coder.

    However, the requirement to buffer and reverse the symbols was unfortunately insurmountable.
    This is only required in encoding which is usually done in software.
    This irrelevant argument is always used in their discussions.
    The benchmark shows that TurboANXN, even with reverse encoding is more than 4 times faster than the current AOMedia AV1 encoder.

    Also keep in mind that AV1 adjusts the probabilities on a per-symbol basis.
    The entropy coder CDFs are designed to make adapting the probabilities very fast (with only adds and shifts).
    This puts some constraints on the design that don't exist in the linked benchmark (which uses fixed probabilities as far as I can tell).
    The benchmark is using adaptive probabilities.

    There is so much clever that gets done, even in decoders.
    And there are so many different kinds of parallelization, SIMD, ASIC, etcetera available.
    And surprising numbers of decoders don't implement basic stuff like skipping non-reference frames when doing seeking, due to the system layer and the decoder layers not being tightly coupled enough.
    This is indepedant from entropy coding. Here we are comparing the AV1 entropy coder against rANS and they are interchangeable.

    Also rANS is quite recent and higthly optimised, plus it uses 32/64bit aritmetic and SIMD instructions while daala range coder uses only 16bit aritmethic.
    And you can do betwen 2 and 4 1 clock 16bit multipliers in the same number of gates that of a 32bit 1 clock multiplier.
    According to the AV1 source code, 32 bits operations are used. rANS is 32 bits only.

    I think the decision against rANS is politically motivated (Not-invented-here-Syndrom).
    Otherwise, why not simply let the (now removed) rANS version in the repository for comparisons.
    Hardware comparisons (complexity,energie consumption,costs,...) are only possible after implementing both optimized versions.

    Note, we are considering here only adapative rANS. Do not confuse this with block based ANS as used in zstd,lzfse, lzturbo...
    Last edited by dnd; 27th June 2018 at 16:48.

  7. Thanks (3):

    JamesB (28th June 2018),Jarek (27th June 2018),SolidComp (28th June 2018)

  8. #185
    Member
    Join Date
    Mar 2013
    Location
    Worldwide
    Posts
    565
    Thanks
    67
    Thanked 199 Times in 147 Posts
    Entropy Coding Benchmark updated:

    - New: TurboRC nibble range coder (similar to AOmedia AV1 entropy coder)


  9. #186
    Member
    Join Date
    Jun 2015
    Location
    Switzerland
    Posts
    967
    Thanks
    264
    Thanked 347 Times in 219 Posts
    Quote Originally Posted by dnd View Post
    Hardware comparisons (complexity,energie consumption,costs,...) are only possible after implementing both optimized versions.
    AV1 HW engineers are among the smartest and most experienced there are. I'm pretty sure that such decisions are not made without a detailed analysis.

  10. #187
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    521
    Thanks
    196
    Thanked 186 Times in 127 Posts
    Quote Originally Posted by Jyrki Alakuijala View Post
    AV1 HW engineers are among the smartest and most experienced there are. I'm pretty sure that such decisions are not made without a detailed analysis.
    Perhaps, but not everyone is using hardware and if there are fast methods in software (which runs on hardware) then perhaps it gives at least some indication of what could be achieved with hardware. I know there are big differences, such as how parallelisable it is and whether it's using complex or simple instructions, but if dnd's analysis above is correct the AOM entropy encoder is still slower than an old bitwise AC method, which we already know how to do in hardware anyway.

    I'm missing something clearly, but it looks like a very poor choice of codec from these results.

  11. #188
    Member
    Join Date
    Nov 2013
    Location
    Kraków, Poland
    Posts
    815
    Thanks
    252
    Thanked 262 Times in 161 Posts
    I have seen a comment somewhere that they have focused on nibbles to avoid patents - which are mainly for methods working on binary EC.

    Nearly complete replacement of hardware from the market needs a decade - software optimization is crucial for this period.

  12. #189
    Member
    Join Date
    Mar 2013
    Location
    Worldwide
    Posts
    565
    Thanks
    67
    Thanked 199 Times in 147 Posts
    Quote Originally Posted by Jyrki Alakuijala View Post
    AV1 HW engineers are among the smartest and most experienced there are. I'm pretty sure that such decisions are not made without a detailed analysis.
    It will be nice to know on which criteria and analysis is the use of ANS in jpeg-xl based.

    From my analysis, one reasonable criteria in AV1 is the use of 16-bit multiplications in the daala range coder instead of the 32-bits required for ANS.
    This is apparently a lot cheaper to implement in hardware.
    Less memory is also required for the range coder in AV1 compared to ANS. ANS in jpeg-xl is based on alias tables.

    Given that AV1 decoders will be soon delivered for smartphones, it is more reasonable to base future image/video codecs on the AV1 coders to exploit the hardware coders.
    Unlike jpeg-xl, AVIF is based on AV1.


    Quote Originally Posted by Jarek View Post
    Nearly complete replacement of hardware from the market needs a decade - software optimization is crucial for this period.
    Realtek 4K Set-top Box SoC RTD1319/RTD1311 to Support Android 10 on Android TV
    Last edited by dnd; 21st February 2020 at 23:54.

  13. #190
    Member
    Join Date
    Nov 2013
    Location
    Kraków, Poland
    Posts
    815
    Thanks
    252
    Thanked 262 Times in 161 Posts
    Quote Originally Posted by dnd View Post
    32-bits required for ANS
    Very interesting, where did you get this nonsense?
    For tANS, we need ~4x more states than alphabet size to have redundancy on noise level. Redundancy drops ~4x every 2x more states.
    Let say it is ~16x for rANS, so for 256 size alphabet you would need 12bit state ... less for 16-size alphabet used in adaptive setting.
    16bit multiplication with 8bit renormalization for rANS is muuuuuch above what is needed.
    There is usually used 32bit only because it doesn't matter on CPU, and allows to use 16-bit renormalization.

    Even better, we can use multiplication by very low numbers - here is redundancy bits/symbol for binary alphabet using multiplication by at most 2bits numbers, also comparing with Moffat's approximation for AC:



    Less memory is also required for the range coder in AV1 compared to ANS.
    RC/ANS decoders need exactly the same search: to which subrange corresponds given position.
    Alias is its optimization to table use and single branch - could be also applied to RC.
    ANS requires additional buffer for backward encoding - only in encoder, which already requires huge memory for performing all analysis.

  14. #191
    Member
    Join Date
    Mar 2013
    Location
    Worldwide
    Posts
    565
    Thanks
    67
    Thanked 199 Times in 147 Posts
    Well, I'm referring to current rANS (<= 16 symbols) CDF implementations, that are all based on a branchless 32 bits SIMD in the symbol search at decoding.
    This is currently the only one possiblility to make rANS decoding fast.
    If, you do a scalar symbol search, then you lose the speed advantage to a range coder.
    And as you can see in Turbo-Range-Coder, you can also implement a fast interleaving and symbol search
    in RC that's faster than using division or reciprocal multiplication. see benchmark
    The division argument to justify the complexity and slowness of a multisymbol RC in decoding doesn't count anymore.


    Alias tables are static and not ANS specific. I've not tested this to have a definitive conclusion.
    A bitwise RC can also be very fast, adaptive, compress better and doesn't require large tables.


    I'm not aware of any fast rANS implementation with less than 32 bits state.
    I'm only interested in practical implementations and not what's can be possible in theory,
    because there is often a big divergence between implementations.
    See facts and real numbers in the entropy coder benchmark.


    In general a range coder is simpler and more natural to use than rANS, encoding is serval times faster
    and doesn't require buffering. You can easy combine bitwise and multisymbols RC.

  15. #192
    Member
    Join Date
    Nov 2013
    Location
    Kraków, Poland
    Posts
    815
    Thanks
    252
    Thanked 262 Times in 161 Posts
    You have lots of 16bit in https://github.com/jkbonfield/rans_static
    Especially for 16 size alphabet, you could use much lower state, down to ~8bit multiplication if optimizing for hardware.
    tANS usually uses 11-12bits state for 256 size size alphabet. It has 1bit renormalization to reduce redundancy, in hardware it could be also cheaply done for low state rANS.

    The rest is the same for RC and rANS: search for corresponding subrange and probability adaptation.

    You can easy combine bitwise and multisymbols RC.
    The same for ANS - you can use the same state while varying probability distributions (done in adaptive) or alphabet size.

  16. #193
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    521
    Thanks
    196
    Thanked 186 Times in 127 Posts
    Quote Originally Posted by dnd View Post
    Well, I'm referring to current rANS (<= 16 symbols) CDF implementations, that are all based on a branchless 32 bits SIMD in the symbol search at decoding.
    This is currently the only one possiblility to make rANS decoding fast.
    It's currently the way to do fast rANS decoding *in CPU*.

    If you're looking at something which is optimised primarily for hardware, then your design can change radically.

    PS. Personally I wouldn't put too much faith in my rans_static implementations. They're rather hacked up messes, just for experimentation purposes to see how fast it can be driven. For production code we're not yet using any SIMD based ANS because it hasn't been the bottleneck in our pipelines anyway. The more stable code is https://github.com/jkbonfield/htscodecs although I'm open to suggestions for improvement. I keep thinking I should do a CDF adaptive version, but in my own experiments it didn't massively win out over normal arithmetic or range coding as the efficiency of rANS starts to get lost once you add higher level modelling in there.

  17. #194
    Member
    Join Date
    Mar 2013
    Location
    Worldwide
    Posts
    565
    Thanks
    67
    Thanked 199 Times in 147 Posts
    You have lots of 16bit in https://github.com/jkbonfield/rans_static
    RansState is defined as uint32_t.
    The multiplication is 32 bits (32 bits for ransstate and 12-16 bits for the prob.)


    The AV1 range coder is using 16x16 multiplications.


    The same for ANS - you can use the same state while varying probability distributions (done in adaptive) or alphabet size.
    The stack buffer storing the probabilities in reserse order is more complex to maintain in a mixed bitwise and multisymbol adaptive encoding


    Quote Originally Posted by JamesB View Post
    I keep thinking I should do a CDF adaptive version, but in my own experiments it didn't massively win out over normal arithmetic or range coding as the efficiency of rANS starts to get lost once you add higher level modelling in there.
    This is also what I've found in the Turbo-Range-Coder experiment.

  18. #195
    Member
    Join Date
    Nov 2013
    Location
    Kraków, Poland
    Posts
    815
    Thanks
    252
    Thanked 262 Times in 161 Posts
    Ok, so let's look at arithmetics of rANS decoding step:
    D(x) = (f[s] * (x >> n) + (x & mask ) – CDF[s], s)

    We have n-bits accuracy, e.g. for 16bit state and 10bit accuracy, we need multiplication "10bits times 6bits -> 16 bits".
    In hardware implementation it would be exactly such multiplication, we can improve redundancy by using 1bit renormalization (count leading zeros gives number of bits).

  19. Thanks:

    dnd (22nd February 2020)

Page 7 of 7 FirstFirst ... 567

Similar Threads

  1. Archiver benchmarking framework
    By Bulat Ziganshin in forum Data Compression
    Replies: 11
    Last Post: 8th February 2013, 20:46
  2. Compression benchmarking: 64 bit images and 24 bit codecs
    By m^2 in forum The Off-Topic Lounge
    Replies: 6
    Last Post: 30th November 2011, 17:01
  3. Lossless image coders
    By Madgeniy in forum Data Compression
    Replies: 26
    Last Post: 11th July 2011, 10:06
  4. paq8f w/ .DXEs (DJGPPv2, DOS, benchmarking)
    By Rugxulo in forum Data Compression
    Replies: 4
    Last Post: 2nd February 2010, 15:32
  5. Comparison of the recent CM demo coders
    By Shelwien in forum Data Compression
    Replies: 38
    Last Post: 13th June 2008, 14:21

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •