# Thread: Comprehensive wiki for data compression

1. ## Comprehensive wiki for data compression

Hi all – It seems like it would be a good idea to have a comprehensive wiki for data compression, a site that included in-depth articles on every subtopic, encoding method, codec implementation, benchmarks, histories of and evolution of certain approaches, etc.

Would any of you be willing to help deploy the wiki? I'd need some help for this to be deployed by say July. If I have to do it by myself, it will take longer.

I already bought two applicable domains – compression.wiki and compress.wiki. They'll probably cost about $25 per year to renew, each. I think the total cost of hosting the wiki site, including domain registration, would be$100-200 per year. I assume a donation link on the site will be sufficient to take care of it, maybe with a live "balance needed" display or something.

Anyway, who's willing to help?

Cheers,

Joe Duarte
https://www.researchgate.net/profile/Joe_Duarte

2. ## Thanks (3):

boxerab (6th June 2017),Gonzalo (21st April 2017),Jarek (6th June 2017)

3. I guess you can take some stuff from https://web.archive.org/web/20160620...m/authors.html

Also you can save on hosting by letting me or webmaster host it.

But the main question is what exactly is your point in this. It won't be profitable for certain - you won't even return hosting/domain costs from donations/ads.
Also wikipedia already has plenty of posts about data compression - https://en.wikipedia.org/wiki/Catego...ion_algorithms .
The only (imho) important thing that lacks in wikipedia is the "who did what" list like Sami's site had, because most compression developers
are not popular enough for wikipedia. Otherwise there's not that much information really, its mostly covered by http://mattmahoney.net/dc/text.html etc.

There're also problems with copyrights (for example, there's a DCC paper archive and decompiled sources of various codecs, but we can't make them easily available),
and its hard to say whether to include audio/video compression (because general data compression would easily drown in it, like it happened with compression.ru),
and how to describe the algorithms (eg. python is useless for data compression).

4. Wikipedia has a few drawbacks for scientific and technical info. One of them is its policy about "notability".
See this for an example:
https://en.wikipedia.org/wiki/Wikipe...letion/NanoZip

Much of the interesting things in data compression happen on the alpha side of development. In Wikipedia, you basically have to wait until it is yesterday's news to publish it.

5. ## Thanks:

SolidComp (27th April 2017)

6. I found that wiki page: http://nishi.dreamhosters.com/u/nanozip.txt
Can't say that it would have been better to keep it.

7. you can compare that to the first version of freearc page: https://en.wikipedia.org/w/index.php...ldid=308460316

...And then Sami appeared in deletion talk and killed it, by using his own benchmark site as proof of nanozip's performance.

9. I agree with you, but you are missing the whole point. It doesn't mater whether it was written as an advertisement, because it wasn't the reason it went kicked off. It was the lack of notability. And that's ok because what's "notable" for the whole world differs a lot from what it is for a person into the data compression world.
Just try and add an entry about any ground-breaking, not-thoroughly-tested program on its beta version like some of the present here in the forum. Your reflate, for an example, or GLZA. I would love to have an article explaining in depth data re-compression, but that's not gonna happen any time soon inside Wikipedia...

10. Why not just host it on GitHub or GitLab? They both have free wikis, and most people around here probably have at least a GitHub account already. Best of all, while both sites let you edit it through a web interface, you can also use git to download the whole wiki and edit it locally (it's just a bunch of markdown documents). Not only is that a nice feature for working with the repo, but it help make backups really easy in case GitHub/GitLab ever goes away.

The only real disadvantage I see is that (AFAIK) neither supports embedding equations directly into the page, though you can use iTex2Img (or something similar) and embed an image in the wiki.

11. ## Thanks:

12. Originally Posted by nemequ
Why not just host it on GitHub or GitLab? They both have free wikis, and most people around here probably have at least a GitHub account already. Best of all, while both sites let you edit it through a web interface, you can also use git to download the whole wiki and edit it locally (it's just a bunch of markdown documents). Not only is that a nice feature for working with the repo, but it help make backups really easy in case GitHub/GitLab ever goes away.

The only real disadvantage I see is that (AFAIK) neither supports embedding equations directly into the page, though you can use iTex2Img (or something similar) and embed an image in the wiki.
I think something like wikia would be a better choice as it doesn't require registration.

13. Technically neither would GitHub/GitLab. You don't have to use the web API; you could easily accept contributions with git send-email, diffs posted as attachments here (or anywhere else), etc.

Accepting unregistered edits may not be a good idea anyways… it tends to make abuse easier.

Personally, I'd feel a lot more comfortable with a DVCS where lots of people have copies instead of trusting everything to one site. Wikia would be safer than setting up something custom, but I'm not excited about the idea of a bunch of ads plastered all over the place… some company offering spyware with a veneer of compression software could easily buy some ad space and make it seem like an endorsement.

That said, I probably wouldn't contribute much to a wiki like this (I already spend more than enough time writing documentation), so I'm not sure why anyone should care about my opinion here.

14. ## Thanks:

15. Equations can be easlily supported like in LaTex if you use the GitHub repo as markdown source for a pandoc generated github.io site (
pandoc -f markdown+tex_math_dollars+simple_tables --mathjax -t html5
).

16. I'm not sure where exactly the line is between a wiki and an open-source, collaboratively developed web site is, but that seems pretty firmly in the latter camp. Jekyll + GitHub pages may be a bit closer since GitHub will automatically regenerate the HTML, but I think that's still unacceptably complex… You would also lose the web based wiki editor if you used something like pandoc or jekyll.

17. Originally Posted by Shelwien
I guess you can take some stuff from https://web.archive.org/web/20160620...m/authors.html

Also you can save on hosting by letting me or webmaster host it.

But the main question is what exactly is your point in this. It won't be profitable for certain - you won't even return hosting/domain costs from donations/ads.
Also wikipedia already has plenty of posts about data compression - https://en.wikipedia.org/wiki/Catego...ion_algorithms .
The only (imho) important thing that lacks in wikipedia is the "who did what" list like Sami's site had, because most compression developers
are not popular enough for wikipedia. Otherwise there's not that much information really, its mostly covered by http://mattmahoney.net/dc/text.html etc.

There're also problems with copyrights (for example, there's a DCC paper archive and decompiled sources of various codecs, but we can't make them easily available),
and its hard to say whether to include audio/video compression (because general data compression would easily drown in it, like it happened with compression.ru),
and how to describe the algorithms (eg. python is useless for data compression).
Yo Shelwien, it never occurred to me that it could be profitable, and that definitely isn't a goal. I'm pretty sure we could recover the hosting costs – what makes you think that would be hard?

Wikipedia has major limitations for this, as Gonzalo said. We'll just have to deal with whatever copyright issues come up – that won't be a problem for most content.

When you say that you could host it, what site are you referring to? Or do you have a hosting service?

18. Originally Posted by nemequ
Why not just host it on GitHub or GitLab? They both have free wikis, and most people around here probably have at least a GitHub account already. Best of all, while both sites let you edit it through a web interface, you can also use git to download the whole wiki and edit it locally (it's just a bunch of markdown documents). Not only is that a nice feature for working with the repo, but it help make backups really easy in case GitHub/GitLab ever goes away.

The only real disadvantage I see is that (AFAIK) neither supports embedding equations directly into the page, though you can use iTex2Img (or something similar) and embed an image in the wiki.
I think GitHub wikis are pretty bad. The design is confusing, and the Table of Contents are strange. They also have a major bug in that their Table of Contents in general don't work on mobile. If you tap on a heading in their TOC's on Chrome in Android, for instance, nothing happens. The anchor links don't work on mobile, and this has been a known bug for years. That's a pretty severe bug that should've been fixed in a week.

Equations are extremely important.

(I'm not 100 percent certain that the TOC links on their Wiki pages have the same problems as on the normal repo readme pages, but that they have such massive bugs annoys me.)

I know less about GitLab. I should take a look at their wiki.

19. Originally Posted by m^3
I think something like wikia would be a better choice as it doesn't require registration.
The problem I have with Wikia is that it's a brutal website – it's just so heavy and slow, probably because of too many HTTP requests and ads galore. On mobile, it's a terrible experience trying to lookup a Game of Thrones character or something. It's just so slow to load and strains the browser.

So I should probably make this clear: the compression wiki needs to be a damn good website, clean and fast on all platforms. Most websites suck, and I want it to not be like most websites.

20. > Yo Shelwien, it never occurred to me that it could be profitable, and that definitely isn't a goal.

Once upon a time, around 1995-2005, it could be.

> I'm pretty sure we could recover the hosting costs – what makes you think that would be hard?

Experience with this forum, for example (and other related sites).
Of course, you might have some friends who would donate $100 at once, but you certainly won't get 20 x$5 from random people.
With something like nanozip, maybe, but not with a wiki site.

> When you say that you could host it, what site are you referring to? Or do you have a hosting service?

I'm already paying for hosting service anyway (dreamhost), so I don't have a problem
to use their NS, and I'd make an account for you.

21. Originally Posted by nemequ
Technically neither would GitHub/GitLab. You don't have to use the web API; you could easily accept contributions with git send-email, diffs posted as attachments here (or anywhere else), etc.
This doesn't sound like a serious proposal.

Originally Posted by nemequ
Accepting unregistered edits may not be a good idea anyways… it tends to make abuse easier.
Not easier, just saves a couple of minutes.

Originally Posted by nemequ
Personally, I'd feel a lot more comfortable with a DVCS where lots of people have copies instead of trusting everything to one site. Wikia would be safer than setting up something custom, but I'm not excited about the idea of a bunch of ads plastered all over the place… some company offering spyware with a veneer of compression software could easily buy some ad space and make it seem like an endorsement.
Originally Posted by SolidComp
The problem I have with Wikia is that it's a brutal website – it's just so heavy and slow, probably because of too many HTTP requests and ads galore. On mobile, it's a terrible experience trying to lookup a Game of Thrones character or something. It's just so slow to load and strains the browser.
That's indeed a major concern. Whatever ads they have, they don't pass through my guards, so I wasn't aware of it having any advertisments, let alone being plastered with them. And it's super-fast here too.

22. I'm not sure if anyone is still interested in this, but I just found out that GitLab supports equations (using KaTex).

23. ## Thanks (2):

pothos2 (23rd May 2017),schnaader (29th May 2017)

24. Originally Posted by nemequ
I'm not sure if anyone is still interested in this, but I just found out that GitLab supports equations (using KaTex).
Interesting. I've been playing with KaTeX and MathJax. I probably like KaTeX more because it's easier to generate the HTML for the equation ahead of time and not force users to download a big JS library which then renders the equation on the fly, a typically terrible pattern of web development. I'm still trying to understand how they make equations look so good with just HTML and CSS, and no MathML (which only Firefox and Safari support).

Does GitLab support custom domains?

25. While I completely understand the need for a comprehensive wiki for data compression (e.g. benchmarks with generally agreed objectivity, less known compressors, some their details), a more basic and complementing activity is improving the Wikipedia which already exists and is widely used: where the data compression part is very far from perfect - maybe let us also discuss this issue here ... and try to bring some improvements and updates there.

Here are some my remarks (any others?):

- the template is a complete mess ( https://en.wikipedia.org/wiki/Templa...ession_methods ) mixing everything like methods with actual compressors, predictors with transformations - I have written some comment a few months ago ( https://en.wikipedia.org/wiki/Templa...ession_methods ) but there was no response (generally data compression seems to be ignored by editors),

- missing articles and updates like general about some repacking - discussing the growing set containing e.g. packJPG, mozjpg, lepton, guetzli,

- guarding existing articles from vandalism - maybe specialists in some topics should have them on their watchlists (sometimes checked). For example I have currently "discussion" regarding https://en.wikipedia.org/wiki/Huffman_coding - while on its top there was clear warning with example about its suboptimality and popular ways used to repair it (AC and ANS), a person with many Huffman articles is just trying to hide this information by replacing this paragraph with extremely innocent "Huffman coding is not always optimal among all compression methods",

- maybe deletion of suspicious articles (e.g. https://en.wikipedia.org/wiki/Statis...el%E2%80%93Ziv ?), or merging some like e.g. "arithmetic-range coding", "PPM-DMC-CTW", "dozens of LZ variants" (which often don't even clear distinction between them - maybe there should be additional collective articles clearly specifying these differences?).

26. Agreed that improving Wikipedia would reach the widest audience. Some details may be too technical, but we also have Matt's excellent pages too.

That huffman article is particularly confusing. "Although Huffman's original algorithm is optimal for a symbol-by-symbol coding..." followed by (correctly) "Huffman coding is optimal when each input symbol is a known independent and identically distributed random variable having a probability that is the inverse of a power of two". The first statement i clearly means the choice of bit-codes is optimal, but not that bit-codes themselves are optimal.

27. Actually Huffman code is not an optimal bitcode.
Its an optimal _prefix_ bitcode.
I have a working implementation of non-prefix bitcode, which provides better compression than huffman
(with escape codes and backtracking).

Btw, I installed mediawiki: http://ctxmodel.net/w/index.php?title=Special:Log

28. ## Thanks:

Jarek (7th June 2017)

29. Shelwien, could you give a simple example of a better than Huffman bitcode?

30. As I said, its a code without prefix property.
I can generate an example if you really want, but a comprehensible step-by-step description would take some work.
For now, I found this: https://encode.su/threads/1116-Charl...ll=1#post22277

My implementation started with huffman code and then iteratively replaced huffman codes with shorter codes which weren't observed
in the block (so additional compression comes from higher-order statistics).
Thus encoding is very slow, but decoding is just a little slower than huffman.

The decoder has to work kinda like a recursive parser for a programming language.
At some point it would encounter multiple matches, so it would have to try every choice,
until it can parse to the end of block without meeting any escape codes.

P.S.: Added an example, but I doubt that you'd be able to see anything it it.
Unfortunately its a very old experiment (~1999) so I only have DOS implementations for it.

31. As Huffman is generally believed to be the optimal bitcode (not only prefix), it would be great to have a simple counterexample which can be easily traced - using just a few symbols (3?) ...
A follow up, but much more difficult question, is understanding how bad it is - like theoretical boundary for improvements with non-prefix codes and maybe examples reaching or approaching it ...

32. 1. One simple example with 4 symbols ('a','b','c','d') is where huffman assigns codes a:0 b:10 c:110 d:111,
but 'aa' never occurs in the data. Then 00 code can be assigned to 'c', and greedy parser would always
correctly decode 00 as 'c'. No backtracking needed in this example, because all other codes start with 1.

2. Its similar to how we can get better compression by grouping the symbols before assigning huffman codes.
In any case, it won't be able to beat order* entropy for the data, and then practical results depend more on the specific implementation.

33. ## Thanks:

Jarek (8th June 2017)

34. 1. huffman code is optimal for order-0 data source with given probability distribution

2. it's not a bit coder. instead, like ari/ans, it just finally ouputs some sequence of bits

35. We just discussed that with Shelwien. My conclusion: his program includes table with optimal code for each particular file, thus overall code for data source isn't a bit code any more, from theoretical POV. But it doesn't make any difference from practical POV - non-prefix code may work as fast as Huffman, while providing better compression ratio

Moreover, if we have encoded symbol probabilities in the data, they don't actually represent order-0 model! Once we have started decoding, probabilities of remaining symbols changes as we decode string prefix, and we can modify the code according to changed probabilities

36. Ok, so Huffman is still optimal (not only prefix) "bitcode" (symbol -> bit sequence) ... for i.i.d. sources (what indeed should be explicitly stated).

Anyway, https://en.wikipedia.org/wiki/Huffman_coding article is kind of "gateway to data compression" - Huffman coding is rather the widest learned data compression algorithm, and many of these students will likely check its Wikipedia article - it is crucial that it properly shows the situation and maybe interest this person in the data compression field.
First of all, it should clearly (top of article) emphasize the suboptimality of Huffman/prefix codes (and mention most popular ways to repair it), like ~1% ratio loss due for letter-wise English text, or like in the deleted paragraph: "As an example, a symbol of probability 0.99 carries only $\log_2(1/0.99)\approx 0.014$ bits of information, but Huffman coding encodes each symbol separately and therefore the minimum length for each symbol is 1 bit" ... but this Huffman-crusader (dozens of Huffman/prefix codes articles) has removed this information about its weaknesses and trivialized it with this "Huffman coding is not always optimal among all compression methods."
A person visiting this article might be also searching for a good (fast) implementation - there was link to Yann's in External Links ... but this person has just deleted it.

And no editor has intervened - this is example that if data compression society wants to be properly represented in the widely visited: Wikipedia articles, we should improve and guard them ourself.

37. My first compressor was a Huffman one, on a trusty BBC Micro (in 6502 assembly).

It's a tricky balance between being too wordy so that people don't bother to read it and insufficiently detailed. Huffman can be "sufficient" for most cases and it's simple so there is nothing wrong in people using it - even reimplementing what everyone else has done already (likely badly, like my first effort). The key is to explain when it doesn't work. For example pure huffman can only ever get to 8:1 compression ratio, max. (Unless you have a single symbol case in which case the prefix code is zero bits.) To better it you need something else (deduplication, run-length encoding, or just not huffman!). Small alphabet sizes or extreme distributions are its primary weaknesses.

Page 1 of 2 12 Last