Page 1 of 3 123 LastLast
Results 1 to 30 of 62

Thread: Wikipedia war: ANS vs. Huffman

  1. #1
    Member
    Join Date
    Mar 2013
    Location
    Worldwide
    Posts
    565
    Thanks
    67
    Thanked 198 Times in 147 Posts

    Question Wikipedia war: ANS vs. Huffman

    There is a strange discussion (see the 2017 posts) that is becoming more personal about the "Aysmmetric Numeral Systems".
    While you can find nearly about everything in Wikipedia, especially not only notable articles, there is a strange confrontation in this topic.
    Until recently the article about "Aysmmetric Numeral Systems" was always prevented from publishing.


    Are some editors at wikipedia acting like dictators or is it a comprehensible objectivity?
    Last edited by dnd; 22nd July 2017 at 15:41.

  2. #2
    Member
    Join Date
    Nov 2013
    Location
    Kraków, Poland
    Posts
    747
    Thanks
    233
    Thanked 235 Times in 145 Posts
    It looks like this is just one guy (Calbaer), whose dozens of Huffman papers can be easily found - wanting to protect students from being spoiled with its issues and the ways they are solved.
    Here is the original one: https://en.wikipedia.org/w/index.php...ldid=783217766
    Most important erasures:
    - clear example of its suboptimality in introduction, other entropy codings which are used to handle it,
    - practical implementation - there was a link to Yann's he has deleted.

    It seems was no reaction from any moderators.
    Last edited by Jarek; 22nd July 2017 at 16:36.

  3. Thanks:

    boxerab (24th July 2017)

  4. #3
    Member
    Join Date
    May 2014
    Location
    Canada
    Posts
    141
    Thanks
    66
    Thanked 21 Times in 12 Posts
    This Calbaer fellow gets totally owned in that thread, but he continues to desperately cling to his ad-hominem attacks on Jarek and ANS.

    https://media.tenor.co/images/544514...752857d53c/raw

  5. #4
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    488
    Thanks
    173
    Thanked 171 Times in 116 Posts
    ANS now has its own page which is totally justified due to large external use by many systems.

    The main problem with Huffman encoding is that it glosses over the suboptimal nature by suggesting various techniques such as combining multiple symbols together if you have a small alphabet (not workable on large alphabets) or using RLE to offset this on large alphabet systems. It really needs to just state where it works well (usually within 1% or so) and where it doesn't, with links to both arithmetic and ANS coding to alternative entropy encoders. Both is important as it's not showing bias.

    The worst case for Huffman coding can happen when the probability of a symbol exceeds 2−1 = 0.5, making the upper limit of inefficiency unbounded. Combining symbols together ("blocking") can be used to make Huffman coding as efficient as theoretically possible. A form of blocking called run-length encoding, in which lengths of repeated symbols are counted, is in particularly wide use.
    Actually that's worse than just glossing over it. "Blocking" cannot always be used to make huffman coding as efficient as theoretically possible - rarely infact. It gets *closer* to the entropy limit, but it's still not the best possible.

    Edit: If we want the most extreme example of suboptimality, a binary stream consisting os just two symbols will always be uncompressed by a prefix code, as it can only ever assign "0" and "1", leaving the data as-is (or possibly with negated bits). It's also clear with a larger alphabet, as a pure entropy encoder the maximum compression ratio achieved for any individual symbol can be 1/8th if we started with a byte-wise encoding. Discussing ways these limitations can be offset is still a useful thing to do, but it's moving away from a pure entropy encoder and into practical workarounds.

  6. #5
    Member
    Join Date
    Nov 2013
    Location
    Kraków, Poland
    Posts
    747
    Thanks
    233
    Thanked 235 Times in 145 Posts
    Blocking is rather for frequent symbols or small alphabets, but getting low redundancy(deltaH) is extremely costly - plot for some small alphabets and grouping of m symbols:
    Click image for larger version. 

Name:	huffman.png 
Views:	124 
Size:	22.9 KB 
ID:	5034

    The deleted example of suboptimality was that symbol of probability 0.99 carries only ~0.014 bits, while Huffman has to use 1 bit for it.
    Sure grouping and techniques like RLE (which is also grouping: of identical symbols into representation of their number) can hep, but the basic universal way used to solve Huffman's suboptimality is just using a different entropy coding: which directly handles fractional bits.

  7. #6
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    488
    Thanks
    173
    Thanked 171 Times in 116 Posts
    I find those plots hard to understand with immediacy (it it takes time to interpret it correctly). I think far easier examples can be used to demonstrate the inefficiencies in the most extreme scenarios, along with workarounds.

    I added a section to talk specifically about the Optimality section, so we'll see what happens. Right now that section is simply factually incorrect, which should not be ignored in an encyclopedia.

  8. Thanks (2):

    Cyan (24th July 2017),Jarek (24th July 2017)

  9. #7
    Member
    Join Date
    Nov 2013
    Location
    Kraków, Poland
    Posts
    747
    Thanks
    233
    Thanked 235 Times in 145 Posts
    Another one ... is encode.ru/su a reliable source?
    https://en.wikipedia.org/wiki/Talk:A...um_as_a_source

    He has some point - generally forums and blogs are often not seen as a reliable source.
    Also, while preparing information coding semester course I was searching for sources for used methods - for data compression often ending on blogs.
    Practical data compression is missing PR - maybe it worth sometimes to consider writing down the used methods with documentation e.g. as an arxiv ... or taking care of proper Wikipedia coverage.
    Last edited by Jarek; 28th July 2019 at 12:14.

  10. Thanks:

    Shelwien (28th July 2019)

  11. #8
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,774
    Thanks
    276
    Thanked 1,206 Times in 671 Posts
    Unfortunately data compression algorithms are often too complicated to properly describe in a paper.
    Unless its possible to turn program source into a arxiv.org paper with some minimal foreword in text.

    Also I think its a wikipedia issue, rather than ours.
    If websites are not reliable as a source, they should implement their own archive.org clone or something,
    which they would consider reliable.
    Discarding unofficial sources as unreliable is not really a solution.

  12. Thanks:

    Jarek (28th July 2019)

  13. #9
    Member
    Join Date
    Nov 2013
    Location
    Kraków, Poland
    Posts
    747
    Thanks
    233
    Thanked 235 Times in 145 Posts
    Articles often describe (avoiding details) much more complex methods - I believe the problem is rather in lack of motivation, for data compression practicans working program plus minimalist documentation is usually sufficient.
    Meanwhile, e.g. looking for overview of LZ methods, it is difficult to find a paper - one could comfortably cite e.g. in Wikipedia or academic article.

    Sure I also disagree here with Wikipedia policy, especially knowing how peer-review works: by anonymous reviewers very far from being objective.
    However, e.g. Wikipedia requires some universal set of rules - it is difficult for various reasons to give a special treatment e.g. to data compression society, but it might slowly go in this direction as blogs, forums, github, stack etc. often became a basic source of information ... but generally Wikipedia is and probably will remain the basic source in many situations.
    Unfortunately Wikipedia has mainly stopped in place - there was a run at the beginning, but now moderation is mainly done by sole enthusiasts ... one cannot count of moderators just doing what should be done - if we want to be properly represented, it is mostly in our hands as data compression people ... but it requires sources considered there not only as reliable, but as "notable".

    Update: "Links normally to be avoided": https://en.wikipedia.org/wiki/Wikipe..._to_be_avoided
    e.g.
    5. Individual web pages (...)
    10. Social networking sites (such as Myspace, Facebook, LinkedIn, and Instagram), chat or discussion forums/groups (such as Yahoo! Groups), Twitter feeds, Usenet newsgroups or email lists.
    11. Blogs, personal web pages and most fansites (negative ones included), except those written by a recognized authority. (This exception for blogs, etc., controlled by recognized authorities is meant to be very limited; as a minimum standard, recognized authorities who are individuals always meet Wikipedia's notability criteria for people.)

    For example the discussion here regarded zhuff - mainly pointed by Yann's blog - so the question is if he is a "recognized authority".
    If we believe so, maybe he deserves Wikipedia article - also confirming it.
    E.g. Matt Mahoney also doesn't have Wikipedia article - I think he deserves, and creating it would, by the way, give more credibility/notability to his sources.

  14. #10
    Member
    Join Date
    Nov 2013
    Location
    Kraków, Poland
    Posts
    747
    Thanks
    233
    Thanked 235 Times in 145 Posts
    We should dig out practical data compression from underground - please expand:
    https://en.wikipedia.org/wiki/Draft:Yann_Collet

  15. #11
    Member
    Join Date
    Apr 2012
    Location
    Stuttgart
    Posts
    448
    Thanks
    1
    Thanked 101 Times in 61 Posts
    Quote Originally Posted by Jarek View Post
    Another one ... is encode.ru/su a reliable source? https://en.wikipedia.org/wiki/Talk:A...um_as_a_source He has some point - generally forums and blogs are often not seen as a reliable source. Also, while preparing information coding semester course I was searching for sources for used methods - for data compression often ending on blogs. Practical data compression is missing PR - maybe it worth sometimes to consider writing down the used methods with documentation e.g. as an arxiv ... or taking care of proper Wikipedia coverage.
    Err, the first time I was pointed at the method was in a 2015 (I believe) PCS paper on a practical image compression method based on ANS. I could certainly dig that out. PCS is a well-received conference, though probably not quite in focus of the data compression folks. It is certainly peer reviewed, but the conference was somewhat remote (Australia, back then), so it was probably hard to get there. If I may make a suggestion, seeing a paper in a top-rank data compression conference like the DCC would certainly be nice, and would add a valuable reference. I wouldn't waste my time with Wikipedia. This is not a scientific source, just ignore the problems there.

  16. #12
    Member
    Join Date
    Nov 2013
    Location
    Kraków, Poland
    Posts
    747
    Thanks
    233
    Thanked 235 Times in 145 Posts
    Thomas, there are a few issues here, starting with that while peer-reviewed articles are the source for in-depth information, the first line of learning new stuff is often Wikipedia - for newcommers both from outside academia, but also from inside.
    Also, there are many practical data compression people who just don't publish articles: often blog with exec plus minimalist documentation, maybe github.
    E.g. searching for details of LZ approaches it is really tough to find peer-review papers, in many cases the best source I could find was Matt's http://mattmahoney.net/dc/dce.html

    But these methods are still often widely used, does lack of peer-review article mean that they are "not notable"?
    In the current Wikipedia discussion I have just changed "encode.ru" to "su" and got attacked that it is not a reliable source.
    Many important e.g. ANS related discussions were performed here, they usually are on archive.org - everything can be verified.
    It also concerns blogs where e.g. Yann first presented LZ4, zhuff, Zstandard - is it a reliable enough source for Wikipedia e.g. for confirming dates?

    But this is a two-sided issue: both of the way of treating new popular media - not only data compression people often present their work using e.g. blogs, forums, github - are they just "unreliable" by definition?
    From the other side ... there is often missing good documentation of used methods, especially in data compression.

  17. #13
    Member
    Join Date
    Jun 2015
    Location
    Switzerland
    Posts
    767
    Thanks
    217
    Thanked 286 Times in 168 Posts
    FYI, we made an attempt to base brotli on ANS in early 2014, but we got ever so slightly worse results with it than with Huffman. Zoltan documented this effort in a four page technical report, but this report was lost when we moved from code.google.com into github -- it was in the bug/issue tracking system of code.google.com, which got decommissioned.

    The main reason for better Huffman coding was the way how the entropy codes were codified in these efforts. We used quite a few tricks to make the Huffman codes small, but stored the ANS probabilities with a fixed (or log-related) precision. In our example ANS was a tiny bit slower, needed to be encoded wrong way around in blocks, had larger working set and used more memory bandwidth in decoding, and compressed with slightly worse density. As much as I liked ANS, I did not want to add it into Brotli based on those experiments.

    ANS can be useful in one domain whereas it is less interesting in another. We used ANS with great success in brunsli at the time when we were developing brotli. Also, pik and jpeg xl use ANS -- with the context modeling histogram clustering borrowed from WebP lossless and brotli.

  18. Thanks:

    Jarek (28th July 2019)

  19. #14
    Member
    Join Date
    Nov 2013
    Location
    Kraków, Poland
    Posts
    747
    Thanks
    233
    Thanked 235 Times in 145 Posts
    Yes, I remember this report with Zoltan, we had discussions about it - you have large alphabet with low probabilities, where Huffman is usually nearly optimal.
    Even larger alphabet has Kennon Condrad in tree/glza grammatical lz - Huffman gives practically Shannon there. Zstd also finally uses Huffman for literals.
    The largest gain is for skewed distributions, also Huffman is impractical for adaptation.
    tANS is in Zstd and LZFSE, but later rANS completely dominates - can be adaptive, and even faster ... even for o1 if only having many independent sources.
    This GPU ~20GB/s tANS decoder seems promising, maybe tANS will also return for hardware compressors for IoT - especially that we can include some encryption in it.

  20. #15
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    488
    Thanks
    173
    Thanked 171 Times in 116 Posts
    I'm pretty sure tANS could be several times faster if it was implemented in SIMD with e.g. 16 interleaved streams. It works for rANS, but my focus on rANS was for entirely spurious reasons (it was closer to what I understood - range coding - and tANS was a bit mystifying so I went with what I knew despite tANS actually being the better fit for my goal).

    rANS has been in CRAM since 2014 where it appeared in the CRAM 3.0 file format specification: https://github.com/samtools/hts-spec...113cc3ed3bd27d

    Thus we have proof of it being used in the wild in 2014. I'm not sure they'd like a github link any better than a forum one though.

    PS. For what it's worth, I have tidied up description of the decoder process in the CRAM 3.1 specification, along with pseudocode: https://github.com/samtools/hts-spec...RAM_codecs.pdf
    I reimplemented it in JavaScript from the spec to validate it too: https://github.com/jkbonfield/htscod...ter/javascript

    It's surprisingly speedy still. JS engines have been heavily optimised.

    Edit: maybe also link to https://en.wikipedia.org/wiki/CRAM_(file_format) which details rANS usage in 2014. Is Wikipedia a sufficiently credible source for Wikipedia?

  21. Thanks:

    Jarek (30th July 2019)

  22. #16
    Member
    Join Date
    Nov 2013
    Location
    Kraków, Poland
    Posts
    747
    Thanks
    233
    Thanked 235 Times in 145 Posts
    Thanks, so I have moved CRAM to 2014 in the list ... also zhuff to 2013 (Dec 23 on Yann's blog).
    Indeed this ~500MB/s FSE/tANS decoding vs ~1500 of your static rANS is a huge difference which probably could be at least reduced, but such speedup would need splitting into independent streams - what might be problematic e.g. for Zstd (?)
    There are also superfast CPU implementations by Hamid, but I think they also use rANS (?).

    ps. Was there some evolution regarding MPEG-G situation? On their webpage ( https://mpeg-g.org ) the last event is reaching some draft 29th March.

    Update: I see in http://datageekdom.blogspot.com/2019...weet-spot.html that CRAM 3.1 upgraded a lot for high compression ratios - congratulations!
    Last edited by Jarek; 30th July 2019 at 16:34.

  23. Thanks:

    JamesB (30th July 2019)

  24. #17
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    488
    Thanks
    173
    Thanked 171 Times in 116 Posts
    Drifting off-topic somewhat.

    Regarding MPEG-G, it's pretty silent from the company side (GenomSys). They demoed it working seamlessly with various standard genomics tools over a year ago, but no one I know has heard a squeak out of them since. I wonder whether their demo was a mock-up of how it should be rather than how it actually is? I didn't see it myself and aren't sure exactly of what they claimed at the time.

    There is an opensource version of MPEG-G called Genie (https://github.com/mitogen/genie). It raises the awkward difference between opensource vs royalty free which most people don't stop to consider - they're completely orthogonal issues. Ie if GenomSys get their patents then it may still get clobbered. From what I understand with speaking with the devs of Genie, Genie is top notch for storing unaligned FASTQ data, primarily because it's building on top of existing (also top-notch) compression tools like Spring, but for aligned data so far it's only marginally ahead of the current CRAM 3.0 spec so probably considerably behind the new one. Although note that no one has demonstrated that yet. I haven't tried Genie recently, but a couple months ago it wouldn't run on aligned data at all so I couldn't benchmark it. Maybe if it now builds and runs I'll be able to test that myself and write a new blog entry.

    PS. My slides for the recent ISMB conference have a more recent pic of CRAM 3.1 performance (slide 18/19): https://www.slideshare.net/JamesBonf...ram-31-crumble

  25. Thanks:

    Jarek (30th July 2019)

  26. #18
    Member
    Join Date
    Apr 2012
    Location
    Stuttgart
    Posts
    448
    Thanks
    1
    Thanked 101 Times in 61 Posts
    Quote Originally Posted by Jarek View Post
    Thomas, there are a few issues here, starting with that while peer-reviewed articles are the source for in-depth information, the first line of learning new stuff is often Wikipedia - for newcommers both from outside academia, but also from inside. Also, there are many practical data compression people who just don't publish articles: often blog with exec plus minimalist documentation, maybe github. E.g. searching for details of LZ approaches it is really tough to find peer-review papers, in many cases the best source I could find was Matt's http://mattmahoney.net/dc/dce.html
    LZ has not been in active research for a couple of years now, but about 20 years ago or so I read a couple of interesting works on it that provided an in-depth analysis of multiple LZ variants. Something like that is needed to drive the scientific community. Concerning ANS, I have to admit that I know surprisingly little, except for the PCS work.
    Quote Originally Posted by Jarek View Post
    But these methods are still often widely used, does lack of peer-review article mean that they are "not notable"?
    It means that the scientific community is not well-aware of the method, at least not well enough to attract some research in this area - which is a miss.
    Quote Originally Posted by Jarek View Post
    In the current Wikipedia discussion I have just changed "encode.ru" to "su" and got attacked that it is not a reliable source.
    It is not. A discussion forum is not the same as a peer-reviewed article. No matter what the method is. A peer-reviewed article should provide all the details and sources to reproduce an experiment. Here, we are only discussing. That is good, and necessary, but not sufficient.
    Quote Originally Posted by Jarek View Post
    Many important e.g. ANS related discussions were performed here, they usually are on archive.org - everything can be verified.
    archive.org is neither a reliable source. It is just a junkyard of papers. Some good, some bad, but the lack of a review process makes it hard to decide what is good and what is bad. Maybe these papers are reliable - maybe not. Who knows?
    Quote Originally Posted by Jarek View Post
    It also concerns blogs where e.g. Yann first presented LZ4, zhuff, Zstandard - is it a reliable enough source for Wikipedia e.g. for confirming dates?
    Nope. It's a good way how to attract people, but a bad way how to document work.
    Quote Originally Posted by Jarek View Post
    But this is a two-sided issue: both of the way of treating new popular media - not only data compression people often present their work using e.g. blogs, forums, github - are they just "unreliable" by definition?
    Yes, they are. They are unreliable because they haven't been cross-checked. That is the point. This doesn't make the method good or bad - it just leaves the question open. What I personally miss is a "reference paper" that gives an in-depth discussion, and a detailed analysis.
    Quote Originally Posted by Jarek View Post
    From the other side ... there is often missing good documentation of used methods, especially in data compression.
    Yes, indeed. I'm not saying that the traditional paper process is ideal. In fact, I'm really missing is that papers do not ship algorithms that could be reviewed along with the text (this is quite a miss).

  27. Thanks:

    compgt (31st July 2019)

  28. #19
    Member
    Join Date
    Apr 2012
    Location
    Stuttgart
    Posts
    448
    Thanks
    1
    Thanked 101 Times in 61 Posts
    Quote Originally Posted by Jyrki Alakuijala View Post
    ANS can be useful in one domain whereas it is less interesting in another. We used ANS with great success in brunsli at the time when we were developing brotli. Also, pik and jpeg xl use ANS -- with the context modeling histogram clustering borrowed from WebP lossless and brotli.
    Yes, I certainly know, of course. Yet, at this point, I'm still skeptical whether that is the best possible method. Huffman certainly has weaknesses as well, but so much work went into arithmetic, and I wonder how it would compare to ANS in a serious study. At this point, it seems to me that XL has just too many entropy coding methods where one(!) should be sufficient if parametrized correctly.

  29. #20
    Member
    Join Date
    Nov 2013
    Location
    Kraków, Poland
    Posts
    747
    Thanks
    233
    Thanked 235 Times in 145 Posts
    Thomas, there is now a big distance between "reliable" academia with mainly peer-reviewed papers, and "unreliable" practical data compression society with mainly github, blogs, forums - for example Linux kernel now uses LZ4, zstd and xz/LZMA (and gzip) - which of them came from academia?

    I have tried a few times to publish ANS myself earlier (introduced in 2006 - PCS paper in 2015), or try to just talk with anybody e.g while being a postdoc of NSF Center for Science of Information for a year (2013-4) - no luck, especially that these people often have built their careers on Huffman and AC papers, have patents on that - it is very often in academia people interest to maintain the status quo, not acknowledging new work of others - especially if it could reduce interest in own past work. And such stagnation incentive concerns also other fields, especially physics.
    My example of peer-review experience is reviewer holding our paper for two years - unless we not only mention his unrelated paper, but write a chapter about it.
    Also, peer-reviewed papers often just write results - without providing any way to independently test these claims, including for reviewers.
    In contrast, practical data compression doesn't care about subjective evaluation of anonymous reviewers, but just provide working program - anybody can objectively test, and if it is somehow superior it has a chance to become widely used (...especially if working for a big corp...).
    What really new practical data compression concepts has academia brought in the last decade or two? (e.g. BWT is 1994). Didn't it mainly become a club of mutual adoration just giving prizes to each other?

    Regarding ANS, as usually it has both advantages and weaknesses:
    - comparing to Huffman - it can get as close to Shannon as we want at similar cost (but in many cases Huffman is also nearly there), is practical for adaptation of probabilities (rANS) (...tANS can give simultaneous encryption), but needs a bit more resources (tANS - memory, rANS - multiplication, can be approximated with one or two additions), and has this backward encoding - e.g. encoder needs additional buffer.
    - comparing to binary arithmetic coding, ANS cheaply operates on large alphabets - reducing the number of steps, e.g. nicely SIMD-able nibble (4bit) adaptive became quite popular replacement for binary AC (LZNA, Dropbox DivANS, Razor ... for some time AV1),
    - comparing to large alphabet AC/RC, it pessimistically requires alphabet size multiplications/symbol - rANS requires just one instead (and it can be approximated with ~2 additions), at cost of backward encoding.
    Ok, ANS has also cost of writing the final state, but it can be compensated by storing information in the initial state.
    Last edited by Jarek; 31st July 2019 at 12:00.

  30. Thanks (3):

    compgt (31st July 2019),Cyan (1st August 2019),rainerzufalldererste (25th August 2019)

  31. #21
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,774
    Thanks
    276
    Thanked 1,206 Times in 671 Posts
    Some papers actually do exist: scholar.google.com
    Maybe there's something useful? Like https://arxiv.org/pdf/1402.3392.pdf ?

    > comparing to large alphabet AC/RC, it pessimistically requires alphabet size multiplications/symbol -
    > rANS requires just one instead (and it can be approximated with ~2 additions), at cost of backward encoding.

    Actually AC is compatible with the same counter update methods as rANS -
    either two additions with a "frequency table", or vector multiplication, like
    what fgiesen used for LZNA nibble coding.

    AFAIK, the number (and type) of operations required for RC and rANS are the same,
    and most speed optimization tricks do apply to both - although RC encoding is more complicated
    because of carry handling. Just that nobody implemented a vector RC with optimizations for modern cpus.

    tANS though is much harder to evaluate - yes, it provides an easier way to build a state machine
    for the coding process, but same was successfully done for AC too (with all the Q-coders) -
    so the main benefit of tANS just as well may be the fact that its relatively new and has less patent issues.

  32. #22
    Member
    Join Date
    Nov 2013
    Location
    Kraków, Poland
    Posts
    747
    Thanks
    233
    Thanked 235 Times in 145 Posts
    Academia officially acknowledges only the (anonymously) peer-reviewed papers, and for ANS the first one was this 2015 PCS paper (9 years after introduction) - when I got help form Edward Delp's group, more than a year after shown advantages by "unreliable" practitioners, and it was our second attempt (rejected from ISIT a few months earlier).
    And half a decade later ANS is still treated as some data compression underground by information theory society, often pretended not to exist, like in this NSF information theory center I was a postdoc of, or in this Huffman Wikipedia article - changing the status quo, against the authorities guarding own work, literally needs decades.

  33. Thanks:

    compgt (31st July 2019)

  34. #23
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    488
    Thanks
    173
    Thanked 171 Times in 116 Posts
    For Wikipedia, I think the main text citations should ideally be published documents; scientific papers or sometimes recognised online news outlets. arXiv.org seems a good place holder for papers until they're formally published and it's no different peer-review wise than a recognised media outlet (some of which are dubious).

    For other assertions, I think it's fine to make them (eg dates) and if it gets challenged just have comments in the talk section. It's not super critical to have links all over the shop to back up every minor claim, and they're liable to get lost anyway. Better to not have them than to suffer a broken dangling link. As for the validity of other sources, for purposes of backing up your claims should someone query them, we don't *have* to have scientific papers. If something is good enough for prior art in a patent dispute then it's certainly good enough for an encyclopedia entry! People claiming otherwise are just trolls.

  35. #24
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    488
    Thanks
    173
    Thanked 171 Times in 116 Posts
    Quote Originally Posted by Shelwien View Post
    Some papers actually do exist: scholar.google.com
    Maybe there's something useful? Like https://arxiv.org/pdf/1402.3392.pdf ?

    > comparing to large alphabet AC/RC, it pessimistically requires alphabet size multiplications/symbol -
    > rANS requires just one instead (and it can be approximated with ~2 additions), at cost of backward encoding.

    Actually AC is compatible with the same counter update methods as rANS -
    either two additions with a "frequency table", or vector multiplication, like
    what fgiesen used for LZNA nibble coding.

    AFAIK, the number (and type) of operations required for RC and rANS are the same,
    and most speed optimization tricks do apply to both - although RC encoding is more complicated
    because of carry handling. Just that nobody implemented a vector RC with optimizations for modern cpus.
    One problem I see with RC is the division. Certainly one of the divisions can use lookup tables (range/totFreq), or factored out if you're using static probabilities which is what my arith_static code did (written literally days before I became aware of ANS, duh!). The other division in the decoder is a full 32-bit division and it'd need a lookup table with 4 billion entries or use a smaller range size and suffer potentially poorer compression ratios. Maybe I'm missing something though.

    The divisions in rANS are like the range/totFreq bit - doable via lookup tables or precomputed in a static world. Thus it seems faster in a multi-alphabet world.

    Bitwise arithmetic coding however is certainly doable in a SIMD fashion with interleaved bit-streams. I just expect some ^&*%hole has patented the bleeding obvious already.

  36. #25
    Member
    Join Date
    Nov 2013
    Location
    Kraków, Poland
    Posts
    747
    Thanks
    233
    Thanked 235 Times in 145 Posts
    Quote Originally Posted by Shelwien View Post
    AFAIK, the number (and type) of operations required for RC and rANS are the same,
    and most speed optimization tricks do apply to both - although RC encoding is more complicated
    because of carry handling. Just that nobody implemented a vector RC with optimizations for modern cpus.
    Usually RC decoder has to pessimistically multiply current width by all cumulative frequencies (as in AV1) - can it be avoided?
    Also RC has state being two values - making vectorization harder than single rANS value, but probably doable.
    tANS though is much harder to evaluate - yes, it provides an easier way to build a state machine
    for the coding process, but same was successfully done for AC too (with all the Q-coders) -
    so the main benefit of tANS just as well may be the fact that its relatively new and has less patent issues.
    Creating automaton with AC is much more costly as you need to quantize two values (instead of one in ANS) - the number of states grows with square of range size (instead of the first power in ANS).
    Hence it is rather practically done only for binary alphabet - have you AC automaton for larger?
    For tANS it is cheap, redundancy drops with (alphabet size/number of states)^2, you can add scrambler for simultaneous encryption - perfect e.g. for IoT.

  37. #26
    Member
    Join Date
    Apr 2012
    Location
    Stuttgart
    Posts
    448
    Thanks
    1
    Thanked 101 Times in 61 Posts
    Quote Originally Posted by Jarek View Post
    Thomas, there is now a big distance between "reliable" academia with mainly peer-reviewed papers, and "unreliable" practical data compression society with mainly github, blogs, forums - for example Linux kernel now uses LZ4, zstd and xz/LZMA (and gzip) - which of them came from academia?
    There are many answers. First, let me state that I'm more into image compression than universal compression. As far as universal compression is concerned, I have to admit that I'm completely ignorant (but not in a negative way) in how far LZ4 variants provided new ideas, or were just incremental improvements over existing LZ variants. This said, I believe it is quite a miss *NOT* to see such papers at conferences like DCC. Which, nowadays, are also mostly taken hostage by "standards driven" development, i.e. incremental research on HEVC,VVC and the like. There is more than one reason for that, of course, but the main reason may be that conferences live from the papers they receive, and researchers that pay the conferences need to live from research money, and research money is nowadays (as sad as it is) mostly industry money, and industry money in this part is mostly coming from such standards-driven development where the industry behind believes to make some money on the inventions of their egg-heads. So it's not exactly the "fault" of academia, in that sense. You can only help to make it better.
    Quote Originally Posted by Jarek View Post
    I have tried a few times to publish ANS myself earlier (introduced in 2006 - PCS paper in 2015), or try to just talk with anybody e.g while being a postdoc of NSF Center for Science of Information for a year (2013-4) - no luck, especially that these people often have built their careers on Huffman and AC papers, have patents on that - it is very often in academia people interest to maintain the status quo, not acknowledging new work of others - especially if it could reduce interest in own past work. And such stagnation incentive concerns also other fields, especially physics.
    Look, I can of course only speak for myself, but I would love to see some serious, peer-reviewed research papers on this topic. While I can certainly not promise anything, I would consider that DCC would welcome such work, and I'm a bit "fed up" by incremental research on VVC and friends. You probably started at the "wrong end". Getting something "non standard" into a journal is a very high bar (again, this is just human nature to be skeptical about something new). You don't convince anyone from new ideas, just enemies die out... (-; I would start from conferences, and from there work your way up. It is a long road, but it may be worthwhile to meet some interesting folks and get into discussion with them.
    Quote Originally Posted by Jarek View Post
    My example of peer-review experience is reviewer holding our paper for two years - unless we not only mention his unrelated paper, but write a chapter about it. Also, peer-reviewed papers often just write results - without providing any way to independently test these claims, including for reviewers.
    Typical problem of the page limit. The typical 4-pages at ICIP or PCS are not sufficient to tell much, unfortunately. Yet, if I read a paper, I try to understand what is going on, and as I had to look through some papers this morning, I can also tell that I'm quite critical if the experimental setup in inadequate. Try to look for conferences which give you more room to tell a story. ICIP is just too big for many things, its idea is more to get an impression what is going on. PCS is a bit more "private".
    Quote Originally Posted by Jarek View Post
    In contrast, practical data compression doesn't care about subjective evaluation of anonymous reviewers, but just provide working program - anybody can objectively test, and if it is somehow superior it has a chance to become widely used (...especially if working for a big corp...). What really new practical data compression concepts has academia brought in the last decade or two? (e.g. BWT is 1994). Didn't it mainly become a club of mutual adoration just giving prizes to each other?
    HEVC and VVC, to mention some. It is a completely different area (and again, not exactly my area, but nevermind). It's worth something. One can certainly criticize the process, but it is not "nothing". Entropy coding (as in CABAC) is only a small part of it, of course, the biggest part is data modeling. If that helps you anything, I did not receive any price from anyone, and I'm not keen about it. I just love what I do, and make a living with it, which is the best price I was ever hoping for.
    Quote Originally Posted by Jarek View Post
    Regarding ANS, as usually it has both advantages and weaknesses: - comparing to Huffman - it can get as close to Shannon as we want at similar cost (but in many cases Huffman is also nearly there), is practical for adaptation of probabilities (rANS) (...tANS can give simultaneous encryption), but needs a bit more resources (tANS - memory, rANS - multiplication, can be approximated with one or two additions), and has this backward encoding - e.g. encoder needs additional buffer. - comparing to binary arithmetic coding, ANS cheaply operates on large alphabets - reducing the number of steps, e.g. nicely SIMD-able nibble (4bit) adaptive became quite popular replacement for binary AC (LZNA, Dropbox DivANS, Razor ... for some time AV1), - comparing to large alphabet AC/RC, it pessimistically requires alphabet size multiplications/symbol - rANS requires just one instead (and it can be approximated with ~2 additions), at cost of backward encoding. Ok, ANS has also cost of writing the final state, but it can be compensated by storing information in the initial state.
    So, sorry for the naive questions, but one of the big advantages of AC coding is that it is so simple to implement context modeling into it. Adaptive Huffman is much harder than adative AC - which is the real improvement (the coding efficiency is usually used to "sell" the story - true, but not quite the best argument for it).

  38. Thanks (2):

    compgt (1st August 2019),introspec (1st August 2019)

  39. #27
    Member
    Join Date
    Nov 2013
    Location
    Kraków, Poland
    Posts
    747
    Thanks
    233
    Thanked 235 Times in 145 Posts
    There are definitely missing papers on universal compressors - starting with those which are actually used: good documentation with details of solutions, analysis, discussion.

    Indeed getting something nonstandard through peer-review is extremely difficult - and requires orders of magnitude more work.
    Hence mainstream is mainly variations on known ideas - AVC/HEVC/VVC have mostly incremental extensions: more split modes, interprediction directions, deblocking filters, parallelization ...
    Revolution is coming with ML-based compressor, intermediate step might be predicting parameters of parametric distributions - this https://arxiv.org/pdf/1906.03238 I don't feel motivation to solitarily work on repeating my ANS story. I have unsuccessfully tried to find collaboration in data compression for a few years - in/outside Poland, in academia/industry ... ended with accidentally finding G's patent after my 3.5 years of help.
    I gave up working in this field - maybe will return from time to time to write an arxiv on something I believe in, but I don't care if some anonymous reviewer will let it go through - only if it can find real applications.
    ANS evolves itself, including e.g. a few papers from Moffat, but is really hard for complete theoretical analysis - especially tANS.
    MPEG will rather stay with CABAC due to patents. In contrast, there is recently a growing number of ML-based image compression papers using rANS.

    Regarding adding context modelling, there is now also everything available for ANS: order 1 Markov up to 1GB/s/core decoding, nibble-wise adaptive ~100MB/s/core ... you can easily switch between models - e.g. AV1 in some moment had interchangable RC and rANS nibble-wise coding.

  40. #28
    Member
    Join Date
    Sep 2018
    Location
    Philippines
    Posts
    75
    Thanks
    26
    Thanked 0 Times in 0 Posts
    That's the problem if you're a virtually unknown nobody. It will be hard to reach Standards bodies, as Jarek noted, when he was just starting to develop ANS. Jarek was probably virtually unknown that time. It's like David Scott actually having to write C/C++ programs to prove his ideas. Good thing ANS proved itself mainstream, thanks to Jarek.

  41. #29
    Member
    Join Date
    Nov 2013
    Location
    Kraków, Poland
    Posts
    747
    Thanks
    233
    Thanked 235 Times in 145 Posts
    Sadly science is idealistic only from naive outside view. In practice it is made of people guarding own interests e.g. through the peer-review system anonymously deciding what is relevant and what is not - authorities protecting importance of old own work (their power is based on), young researches having to glorify the old way to get a position. New ways of thinking, especially endangering the old ones, literally require generation change in these clubs of mutual adoration.

    Wanting to give data compression practicans also some relevance/authority, we also need to act.

  42. #30
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    870
    Thanks
    471
    Thanked 264 Times in 109 Posts
    Just to provide some perspective, here is an interesting book offering a stark view on what happens to scientists trying to frontly challenge a dominant theory (in this case, quantum physics as per Bohr's theory of Copenhagen).

    Hence there is nothing new, this has always been the case. Maybe the disappointing part is that we are easily taught in school how "stupid" our pre-industrial ancestors were, in rejecting Copernic's heliocentric model for example, with the implicit message that we are so much better now. Only to discover later that, actually, nope, today is not that different.

    But don't worry Jarek. History also tells us that, sooner or later (and rather later), the better solutions stand out, and their authors are acknowledged for it. ANS already has practical applications, so this threshold is passed, it's now merely a matter of time. In the words of Max Planck, just "wait for opponents to eventually die out".
    Last edited by Cyan; 1st August 2019 at 18:39.

Page 1 of 3 123 LastLast

Similar Threads

  1. Is UA war topic removed?
    By Sportman in forum The Off-Topic Lounge
    Replies: 16
    Last Post: 19th June 2019, 15:52
  2. Sequential implementation of ANS-like entropy coder
    By Shelwien in forum Data Compression
    Replies: 2
    Last Post: 2nd July 2017, 12:50
  3. ELI5: FSE/ANS
    By cassini in forum Data Compression
    Replies: 10
    Last Post: 28th September 2016, 22:53
  4. nARANS (Nania Adaptive Range Variant of ANS)
    By Nania Francesco in forum Data Compression
    Replies: 9
    Last Post: 16th November 2014, 15:33
  5. Which ANS is the Best
    By calthax in forum Data Compression
    Replies: 8
    Last Post: 9th September 2014, 16:13

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •