Results 1 to 19 of 19

Thread: Oodle compressor open source

  1. #1
    Member
    Join Date
    Apr 2019
    Location
    Eastern europe
    Posts
    3
    Thanks
    0
    Thanked 8 Times in 1 Post

    Oodle compressor open source

    I've made https://github.com/rarten/ooz/

    It builds upon the excellent work by Powzix.

    It can compress oodle kraken/leviathan/selkie and mermaid. Same compression ratio as real oodle!

    For educational use only. Happy hacking.

  2. Thanks (8):

    Bulat Ziganshin (22nd April 2019),encode (22nd April 2019),introspec (22nd April 2019),kassane (1st May 2019),load (22nd April 2019),moisesmcardona (22nd April 2019),Shelwien (22nd April 2019),svpv (22nd April 2019)

  3. #2
    Member
    Join Date
    Apr 2015
    Location
    Greece
    Posts
    106
    Thanks
    37
    Thanked 29 Times in 20 Posts
    Any way to build on linux? Or at least a windows binary to run with wine?

  4. #3
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,972
    Thanks
    296
    Thanked 1,299 Times in 736 Posts
    Here I fixed it to compile with mingw gcc, seems to work.
    Half of it is gcc being unfriendly as usual
    (who cares about standards when they force the developer to manually add keywords (like typename), which _all_ other compilers don't need).
    But other half is annoying syntax like this:
    template<int Level, typename Hasher>
    struct MermaidCompressVeryfast {
    typedef Hasher Hasher;
    enum { Level = Level };

    and some missing header files (I guess auto-loaded by VS), so it would seem like a good idea to fix it in main source too.

    Unfortunately compiling on linux needs more fixing (tchar.h,intrin.h,windows.h)
    Attached Files Attached Files

  5. Thanks (2):

    algorithm (22nd April 2019),load (22nd April 2019)

  6. #4
    Member
    Join Date
    Apr 2015
    Location
    Greece
    Posts
    106
    Thanks
    37
    Thanked 29 Times in 20 Posts
    @rartenAs this requires enormous effort to create, whats the point of it? You could have created your own (possibly better) compression algorithm.

  7. #5
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,972
    Thanks
    296
    Thanked 1,299 Times in 736 Posts
    Like this it seems to compile with gcc on linux (binary included).
    Also compiles on windows with mingw.
    VS compatibility is likely broken (by stuff in stdafx.h only).
    Dll load removed because its windows-specific.
    QueryPerformanceCounter() replaced with stub, so no speed estimation.
    Code:
    [aardwolf]$ ls -al
    total 640
    drwxrwxr-x 3 skymmer pg2092724    133 Apr 22 06:04 .
    drwxr-xr-x 9 skymmer pg2092724   4096 Apr 22 06:04 ..
    -rw-rw-r-- 1 skymmer pg2092724     76 Apr 22  2019 g
    -rw-rw-r-- 1 skymmer pg2092724    626 Apr 22  2019 g.bat
    -rw-rw-r-- 1 skymmer pg2092724    221 Apr 22  2019 list
    -rw-rw-r-- 1 skymmer pg2092724 620544 Apr 22  2019 ooz.exe
    drwxrwxr-x 2 skymmer pg2092724   4096 Apr 22  2019 src
    -rw-rw-r-- 1 skymmer pg2092724     52 Apr 22  2019 test
    -rw-rw-r-- 1 skymmer pg2092724     83 Apr 22  2019 test.bat
    [aardwolf]$ sh g 2>err
    [aardwolf]$ sh test
    ooz                 :   485627 =>   222791 (0.00 seconds, inf MB/s)
    1                   :   222791 =>   485627 (0.00 seconds, inf MB/s)
    3bc847566d980f7edaf44adceecf6207  ./ooz
    3bc847566d980f7edaf44adceecf6207  2
    [aardwolf]$
    Attached Files Attached Files

  8. Thanks:

    encode (22nd April 2019)

  9. #6
    Member
    Join Date
    Apr 2019
    Location
    Eastern europe
    Posts
    3
    Thanks
    0
    Thanked 8 Times in 1 Post
    I pushed an update that builds with Clang 8

  10. #7
    Member
    Join Date
    Mar 2018
    Location
    canada
    Posts
    3
    Thanks
    1
    Thanked 0 Times in 0 Posts
    is it possible to use ooz as precompressor like precomp decompress kraken/leviathan/selkie/mermaid files ?

  11. #8
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,972
    Thanks
    296
    Thanked 1,299 Times in 736 Posts
    No, for that you'll need the specific oodle dll which was used to compress the files originally.
    There're differences even between versions of original oodle.

  12. #9
    Member
    Join Date
    Apr 2019
    Location
    Eastern europe
    Posts
    3
    Thanks
    0
    Thanked 8 Times in 1 Post
    It produces identical results to oodle version 7 for all files I tried with, except mermaid/selkie < 64kb for some of the levels, and files < 256 bytes which oodle by default does not compress. Since there is some floating point math involved it could possibly depend on the compiler. But also the leviathan official oodle compressor has a bug which sometimes produces output based on uninitialized memory unless you workaround it.

  13. #10
    Member
    Join Date
    Apr 2015
    Location
    Greece
    Posts
    106
    Thanks
    37
    Thanked 29 Times in 20 Posts
    @rarten You used a Trie based match finder. How does this compares to classical treaps(bst on content, heap on index)?

  14. #11
    Member
    Join Date
    Jan 2017
    Location
    Selo Bliny-S'edeny
    Posts
    24
    Thanks
    7
    Thanked 10 Times in 8 Posts
    It compresses much better than zstd, at least some inputs I'm interested in. This is scandalous.

    Update: this seems to be in large part due to a small window; --zstd=wlog=26 knocks the size down quite a bit, so it doesn't look that bad anymore.

  15. #12
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,572
    Thanks
    780
    Thanked 687 Times in 372 Posts
    Charles Bloom is well-known in data compression circles since early 90s

  16. #13
    Member
    Join Date
    Oct 2010
    Location
    Germany
    Posts
    289
    Thanks
    10
    Thanked 34 Times in 22 Posts
    Without Charles, I wouldn't be here
    Appreciate his hard work throughout the years

  17. #14
    Member
    Join Date
    Jun 2015
    Location
    Switzerland
    Posts
    875
    Thanks
    242
    Thanked 324 Times in 197 Posts
    Quote Originally Posted by svpv View Post
    It compresses much better than zstd, at least some inputs I'm interested in. This is scandalous.

    Update: this seems to be in large part due to a small window; --zstd=wlog=26 knocks the size down quite a bit, so it doesn't look that bad anymore.
    It is common mistake to benchmark LZ77 compressors at different window sizes.

  18. #15
    Member
    Join Date
    Apr 2017
    Location
    United Kingdom
    Posts
    82
    Thanks
    64
    Thanked 33 Times in 21 Posts
    Quote Originally Posted by Jyrki Alakuijala View Post
    It is common mistake to benchmark LZ77 compressors at different window sizes.
    Yes, but if an average file size is fixed, the choice of the window size becomes a parameter to optimize in itself.
    I agree that this is a somewhat niche use case, but it does mean that such comparisons can be meaningful.

  19. #16
    Member
    Join Date
    Jun 2015
    Location
    Switzerland
    Posts
    875
    Thanks
    242
    Thanked 324 Times in 197 Posts
    Quote Originally Posted by introspec View Post
    Yes, but if an average file size is fixed, the choice of the window size becomes a parameter to optimize in itself.
    I agree that this is a somewhat niche use case, but it does mean that such comparisons can be meaningful.
    I disagree with that. I consider that almost always you want the maximum you can use at decoding time. At encoding you tend to have more resources, and decoding happens more often and often with a smaller computer.

    Also, even when you have peculiar situations that would warrant optimization, the optimization result for window size would be the same across algorithms.

    This is why one should not test algorithms to each other with different window sizes.

  20. #17
    Member
    Join Date
    Apr 2017
    Location
    United Kingdom
    Posts
    82
    Thanks
    64
    Thanked 33 Times in 21 Posts
    Quote Originally Posted by Jyrki Alakuijala View Post
    I disagree with that. I consider that almost always you want the maximum you can use at decoding time. At encoding you tend to have more resources, and decoding happens more often and often with a smaller computer.

    Also, even when you have peculiar situations that would warrant optimization, the optimization result for window size would be the same across algorithms.

    This is why one should not test algorithms to each other with different window sizes.
    I thought some more about your argument but I still do not think I get it. We talk about windows and offsets, so we talk about variations of LZ77 algorithms.
    Every algorithm of this kind has to introduce a format for storing compressed data. This format has to have a way for storing offsets, and there are many possible ways in which this can be done.

    1) You can have a fixed length offset code for every match. This would be very traditional and, in fact, exactly how it was done in the original 1977 Ziv-Lempel paper. The fixed offset size can mean a fixed windows size or, if all files are smaller than this size, this may well correspond to an unlimited window. Importantly, this corresponds to an implicit assumption that far matches are equally likely to be useful to the matches nearby, which in particular implies that statistical properties of your data are uniform. In other words, you make an assumption about distribution of your offsets and in this case the assumption is that this distribution is uniform.

    2) A lot of successful compressors operate under another assumption about offset distribution: they assume that probability of matches nearby is higher than the probability of far matches. This assumption usually leads to the use of variable length codes for offsets, with shorter codes allocated to nearer matches.

    3) If you take assumption 2) to an extreme, with probability of far matches going lower and lower, they'll end up getting longer and longer codes. At some point you can expect some of these codes to be so long as to make compression of all but the longest (least likely to exist) strings impractical. Then it sounds like a natural development of a format to simply prohibit matches further than some hopefully carefully chosen limit. As a side effect, truncating your offsets at some finite point decreases the lengths of the offset codes just prior to that, so you may even benefit from making such an assumption. However, this is still an assumption about distribution of offset lengths, even if it is a pretty drastic assumption.

    All in all, different variations of LZ77 make different assumptions about distribution of offsets. For a specific type of data and assuming you are developing your own compression scheme, I can imagine collecting some data on the relevant distributions and then devising a format that would match the actual distribution of offsets well. Often, you do not have the luxury of being able to go through a process like this. So, please explain, what is so fundamentally wrong when someone would take 10 or 20 compressors with different windows (in my mind, with different assumptions about the implied offset distribution) and compare their performance of specific dataset? In what way would such observations be meaningless?

  21. #18
    Member
    Join Date
    Jun 2015
    Location
    Switzerland
    Posts
    875
    Thanks
    242
    Thanked 324 Times in 197 Posts
    Quote Originally Posted by introspec View Post
    I thought some more about your argument but I still do not think I get it.
    My arguments are not derived from theory but from personal observation. If I take a benchmark corpus and run it with lzma, brotli, lzham, and zstd -- I get the same performance changes for all of these from going from window size X to window size Y, and the difference is typically larger than the density differences otherwise between the algorithms. For example, changing from 16 MB to 1 GB window on enwik9 is about 11 % density difference.

  22. Thanks:

    introspec (5th August 2019)

  23. #19
    Member
    Join Date
    Apr 2017
    Location
    United Kingdom
    Posts
    82
    Thanks
    64
    Thanked 33 Times in 21 Posts
    Quote Originally Posted by Jyrki Alakuijala View Post
    My arguments are not derived from theory but from personal observation. If I take a benchmark corpus and run it with lzma, brotli, lzham, and zstd -- I get the same performance changes for all of these from going from window size X to window size Y, and the difference is typically larger than the density differences otherwise between the algorithms. For example, changing from 16 MB to 1 GB window on enwik9 is about 11 % density difference.
    OK, thank you, now I understand. You were thinking about the situation where the window is increased by the factor of 64, which is almost two orders of magnitude. I was thinking about comparison between compressors one of which maybe has a window of 8K and another - 16K.

Similar Threads

  1. Open source gzip interface for Oodle SDK
    By Lucas in forum Data Compression
    Replies: 2
    Last Post: 10th November 2018, 02:42
  2. Replies: 23
    Last Post: 24th March 2018, 17:57
  3. BALZ - An Open-Source ROLZ-based compressor
    By encode in forum Data Compression
    Replies: 60
    Last Post: 6th March 2015, 16:47
  4. Why not open source?
    By nemequ in forum Data Compression
    Replies: 65
    Last Post: 25th November 2013, 22:05
  5. New fast open-source paq-based jpeg compressor
    By Bulat Ziganshin in forum Forum Archive
    Replies: 14
    Last Post: 13th September 2007, 13:57

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •