I've made https://github.com/rarten/ooz/
It builds upon the excellent work by Powzix.
It can compress oodle kraken/leviathan/selkie and mermaid. Same compression ratio as real oodle!
For educational use only. Happy hacking.
I've made https://github.com/rarten/ooz/
It builds upon the excellent work by Powzix.
It can compress oodle kraken/leviathan/selkie and mermaid. Same compression ratio as real oodle!
For educational use only. Happy hacking.
Bulat Ziganshin (22nd April 2019),encode (22nd April 2019),introspec (22nd April 2019),kassane (1st May 2019),load (22nd April 2019),moisesmcardona (22nd April 2019),Shelwien (22nd April 2019),svpv (22nd April 2019)
Any way to build on linux? Or at least a windows binary to run with wine?
Here I fixed it to compile with mingw gcc, seems to work.
Half of it is gcc being unfriendly as usual
(who cares about standards when they force the developer to manually add keywords (like typename), which _all_ other compilers don't need).
But other half is annoying syntax like this:template<int Level, typename Hasher>
struct MermaidCompressVeryfast {
typedef Hasher Hasher;
enum { Level = Level };
and some missing header files (I guess auto-loaded by VS), so it would seem like a good idea to fix it in main source too.
Unfortunately compiling on linux needs more fixing (tchar.h,intrin.h,windows.h)
@rartenAs this requires enormous effort to create, whats the point of it? You could have created your own (possibly better) compression algorithm.
Like this it seems to compile with gcc on linux (binary included).
Also compiles on windows with mingw.
VS compatibility is likely broken (by stuff in stdafx.h only).
Dll load removed because its windows-specific.
QueryPerformanceCounter() replaced with stub, so no speed estimation.
Code:[aardwolf]$ ls -al total 640 drwxrwxr-x 3 skymmer pg2092724 133 Apr 22 06:04 . drwxr-xr-x 9 skymmer pg2092724 4096 Apr 22 06:04 .. -rw-rw-r-- 1 skymmer pg2092724 76 Apr 22 2019 g -rw-rw-r-- 1 skymmer pg2092724 626 Apr 22 2019 g.bat -rw-rw-r-- 1 skymmer pg2092724 221 Apr 22 2019 list -rw-rw-r-- 1 skymmer pg2092724 620544 Apr 22 2019 ooz.exe drwxrwxr-x 2 skymmer pg2092724 4096 Apr 22 2019 src -rw-rw-r-- 1 skymmer pg2092724 52 Apr 22 2019 test -rw-rw-r-- 1 skymmer pg2092724 83 Apr 22 2019 test.bat [aardwolf]$ sh g 2>err [aardwolf]$ sh test ooz : 485627 => 222791 (0.00 seconds, inf MB/s) 1 : 222791 => 485627 (0.00 seconds, inf MB/s) 3bc847566d980f7edaf44adceecf6207 ./ooz 3bc847566d980f7edaf44adceecf6207 2 [aardwolf]$
encode (22nd April 2019)
I pushed an update that builds with Clang 8
is it possible to use ooz as precompressor like precomp decompress kraken/leviathan/selkie/mermaid files ?
No, for that you'll need the specific oodle dll which was used to compress the files originally.
There're differences even between versions of original oodle.
It produces identical results to oodle version 7 for all files I tried with, except mermaid/selkie < 64kb for some of the levels, and files < 256 bytes which oodle by default does not compress. Since there is some floating point math involved it could possibly depend on the compiler. But also the leviathan official oodle compressor has a bug which sometimes produces output based on uninitialized memory unless you workaround it.
@rarten You used a Trie based match finder. How does this compares to classical treaps(bst on content, heap on index)?
It compresses much better than zstd, at least some inputs I'm interested in. This is scandalous.
Update: this seems to be in large part due to a small window; --zstd=wlog=26 knocks the size down quite a bit, so it doesn't look that bad anymore.
Charles Bloom is well-known in data compression circles since early 90s
Without Charles, I wouldn't be here
Appreciate his hard work throughout the years
I disagree with that. I consider that almost always you want the maximum you can use at decoding time. At encoding you tend to have more resources, and decoding happens more often and often with a smaller computer.
Also, even when you have peculiar situations that would warrant optimization, the optimization result for window size would be the same across algorithms.
This is why one should not test algorithms to each other with different window sizes.
I thought some more about your argument but I still do not think I get it. We talk about windows and offsets, so we talk about variations of LZ77 algorithms.
Every algorithm of this kind has to introduce a format for storing compressed data. This format has to have a way for storing offsets, and there are many possible ways in which this can be done.
1) You can have a fixed length offset code for every match. This would be very traditional and, in fact, exactly how it was done in the original 1977 Ziv-Lempel paper. The fixed offset size can mean a fixed windows size or, if all files are smaller than this size, this may well correspond to an unlimited window. Importantly, this corresponds to an implicit assumption that far matches are equally likely to be useful to the matches nearby, which in particular implies that statistical properties of your data are uniform. In other words, you make an assumption about distribution of your offsets and in this case the assumption is that this distribution is uniform.
2) A lot of successful compressors operate under another assumption about offset distribution: they assume that probability of matches nearby is higher than the probability of far matches. This assumption usually leads to the use of variable length codes for offsets, with shorter codes allocated to nearer matches.
3) If you take assumption 2) to an extreme, with probability of far matches going lower and lower, they'll end up getting longer and longer codes. At some point you can expect some of these codes to be so long as to make compression of all but the longest (least likely to exist) strings impractical. Then it sounds like a natural development of a format to simply prohibit matches further than some hopefully carefully chosen limit. As a side effect, truncating your offsets at some finite point decreases the lengths of the offset codes just prior to that, so you may even benefit from making such an assumption. However, this is still an assumption about distribution of offset lengths, even if it is a pretty drastic assumption.
All in all, different variations of LZ77 make different assumptions about distribution of offsets. For a specific type of data and assuming you are developing your own compression scheme, I can imagine collecting some data on the relevant distributions and then devising a format that would match the actual distribution of offsets well. Often, you do not have the luxury of being able to go through a process like this. So, please explain, what is so fundamentally wrong when someone would take 10 or 20 compressors with different windows (in my mind, with different assumptions about the implied offset distribution) and compare their performance of specific dataset? In what way would such observations be meaningless?
My arguments are not derived from theory but from personal observation. If I take a benchmark corpus and run it with lzma, brotli, lzham, and zstd -- I get the same performance changes for all of these from going from window size X to window size Y, and the difference is typically larger than the density differences otherwise between the algorithms. For example, changing from 16 MB to 1 GB window on enwik9 is about 11 % density difference.
introspec (5th August 2019)
OK, thank you, now I understand. You were thinking about the situation where the window is increased by the factor of 64, which is almost two orders of magnitude. I was thinking about comparison between compressors one of which maybe has a window of 8K and another - 16K.