Results 1 to 26 of 26

Thread: Crook, a new binary PPM compressor

  1. #1
    Member
    Join Date
    Mar 2012
    Location
    Estonia
    Posts
    14
    Thanks
    1
    Thanked 0 Times in 0 Posts

    Crook, a new binary PPM compressor

    Hi guys, I'm new on the forum . I wrote this little file compressor, it's nothing earth-shattering, but I thought maybe somebody is interested. It started out as DMC, but eventually morphed into binary PPM, since binary PPM > DMC. Compared to PPMd it compresses a bit worse, is a bit slower and uses a lot more memory . I think the speed and memory usage could probably be fixed if I used a better data structure (currently a binary tree), but I didn't see point.

    It is however really simple, the whole modelling code is around 200 lines. I've attached the source; you can also get it from https://github.com/valdmann/crook .
    Attached Files Attached Files

  2. #2
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 779 Times in 486 Posts
    Always a pleasure to see a new compression algorithm. Some test results: http://mattmahoney.net/dc/text.html#1933
    Hopefully my description is right. If not, let me know.

    Looking at the code, it seems like the low 4 bits of the 10 bit counts (scaled by 32) are never used because they are initialized to either 1.5 or 12 and incremented by 1. Maybe you tuned it and found that allowing counts higher than 32 hurt compression, although I think on stationary sources it might help.

  3. #3
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,475
    Thanks
    26
    Thanked 121 Times in 95 Posts
    Could you add big files support? Or even make it work with stdin/ stdout.

  4. #4
    Member
    Join Date
    Mar 2012
    Location
    Estonia
    Posts
    14
    Thanks
    1
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by Matt Mahoney View Post
    Always a pleasure to see a new compression algorithm. Some test results: http://mattmahoney.net/dc/text.html#1933
    Hopefully my description is right. If not, let me know.
    Thank you.

    There are actually two (structural) problems with DMC. The first is, as you have written, duplicate states. The second problem is 'absent-minded' updating: imagine a PPM model with states for 'bra', 'ra', 'a', 'ac', 'c', and zero-order of course. Let's say current longest matching context is 'ra', and new symbol is 'c'. Now the next longest matching state would be 'ac' since 'rac' doesn't exist; let's call this the successor of 'ra' for symbol 'c'. Definition: the successor of S for C is the next longest matching context, given that S is the current longest matching context and C is the new symbol. In this example the successor of 'bra' for 'c' would also be 'ac' since neither 'brac' nor 'rac' exist. If PPM adds a new state for context 'rac', then the successor of 'ra' for 'c' becomes 'rac' AND the successor of 'bra' for symbol 'c' ALSO becomes 'rac'. DMC too works like this, except it doesn't update the successor of 'bra' for 'c', it still remains 'ac'. And that's stupid.

    I also tried a variation of PPM where instead of a global order limit I used DMC's growth heuristic. Wasn't too bad actually, about the same as global limit.

    Quote Originally Posted by Matt Mahoney View Post
    Looking at the code, it seems like the low 4 bits of the 10 bit counts (scaled by 32) are never used because they are initialized to either 1.5 or 12 and incremented by 1. Maybe you tuned it and found that allowing counts higher than 32 hurt compression, although I think on stationary sources it might help.
    Your observation is indeed correct. However it's not the limiting that improves compression (although testing it right now does show a 0.1% improvement on enwik7 compared to plain counts). It's the initial value of 1.5 for new states that gives a 1% improvement on enwik7. I thought this was very very interesting when I discovered it although I probably should've omitted it from the final program since it doesn't have much practical value. Especially considering that the effect tends vanish on longer files (since I don't restart the model eventually structural adaptation just stops). Yes, and of course I could've just used the lower 1 bit for storing 1.5, heh.

    EDIT: You know I actually tried to get rid of that 1.5 magic constant by doing adaptive mixing of inherited and actual statistics. So I'd have two counters in each context one for inherited counts from the lower-order state and one for new statistics. Even using the previous byte as context adaptive linear mixing was worse than the current scheme (and massively slower of course). Very sad.

    EDIT2: I can't remember if I tried using the actual count as context for the weights. That would make more sense than the previous byte. Hmmmm.

    EDIT3: Removing the fractional counts increases size on enwik8 by 2.4% (-m128 -O4), 0.049 bpc difference.
    Last edited by valdmann; 6th March 2012 at 14:26.

  5. #5
    Member
    Join Date
    Mar 2012
    Location
    Estonia
    Posts
    14
    Thanks
    1
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by Piotr Tarsa View Post
    Could you add big files support? Or even make it work with stdin/ stdout.
    Well it doesn't restart the model so compressing big files would probably be a bad idea anyway .

  6. #6
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 779 Times in 486 Posts
    Context models in ZPAQ also use an initial count of 1.5.

    Another possible improvement is to throw out the model and start over when it is full, sort of like ppmd. Or back up and partially rebuild it like the -r1 option. This should make the algorithm more adaptive on nonstationary sources.

    It seems that crook should beat hook (DMC), but hook uses some other tricks like LZP preprocessing.

  7. #7
    Tester
    Stephan Busch's Avatar
    Join Date
    May 2008
    Location
    Bremen, Germany
    Posts
    876
    Thanks
    474
    Thanked 175 Times in 85 Posts
    can anyone please offer a win32 compile of Crook?

  8. #8
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 779 Times in 486 Posts
    I compiled crook with g++ 4.6.1 for 32 bit Windows as recommended in the source comments and compressed with UPX.
    Attached Files Attached Files

  9. #9
    Member Vacon's Avatar
    Join Date
    May 2008
    Location
    Germany
    Posts
    523
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Hello everyone,

    Quote Originally Posted by Matt Mahoney View Post
    I compiled crook with g++ 4.6.1 for 32 bit Windows as recommended in the source comments and compressed with UPX.
    Thanks for providing the compiled .exe
    But it gives me following error:
    ---------------------------
    crook_01.exe - Systemfehler
    ---------------------------
    Das Programm kann nicht gestartet werden, da libgcc_s_dw2-1.dll auf dem Computer fehlt. Installieren Sie das Programm erneut, um das Problem zu beheben.
    ---------------------------
    OK
    ---------------------------
    Long story short -> libgcc_s_dw2-1.dll is missing. Are static compiles possible?

    Best regards!

  10. #10
    Tester
    Stephan Busch's Avatar
    Join Date
    May 2008
    Location
    Bremen, Germany
    Posts
    876
    Thanks
    474
    Thanked 175 Times in 85 Posts
    Dear Matt, thank you very much for compiling.
    CROOK reaches rank 264 in the SqueezeChart; needs 2590 sec for 6.4 GB using -m1600 -O7. I would love to see a model restart as well.

    Hey Vacon, I can send you the requested libraries if you want.

  11. #11
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,511
    Thanks
    746
    Thanked 668 Times in 361 Posts
    compile with -static-libgcc -static-libstdc++ (or just -static)

    attached icl-produced executables
    Attached Files Attached Files
    Last edited by Bulat Ziganshin; 9th March 2012 at 16:20.

  12. #12
    Member Vacon's Avatar
    Join Date
    May 2008
    Location
    Germany
    Posts
    523
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Hello everyone,

    Quote Originally Posted by Bulat Ziganshin View Post
    compile with -static-libgcc -static-libstdc++

    attached icl-produced executables
    Great Thanks!
    @ Stephan: Danke für das freundliche Angebot, aber Du siehst ja... (in english: thanks for your kind offer, but as you see...)

    Best regards!

  13. #13
    Member Vacon's Avatar
    Join Date
    May 2008
    Location
    Germany
    Posts
    523
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Hello everyone,

    quick and dirty test by picking some random folder from my tmp-folder and tar-ing them with 7-Zip. Afterwards I compressed the tar-file with default settings (right-click -> compress to <format>.
    Note: Winbuilder contains some scripts, which afaik contain pre-compressed content. So this is just a random test and not intended to be discussed at scientific-level

    Code:
    Source:
    WinBuilder_082.tar = 116.153.856 Bytes
    7z -> 88.734.414 Bytes 164 s
    FA -> 87.593.348 Bytes 110 s
    crook -> 116153856 -> 97031644, 143.59 s, 6.683 bpc
    7-zip -> 93.414.293 Bytes 48 s
    
    Source:
    ffmpeg-git-01fcbdf-win32-static.tar = 54.278.656 Bytes
    7z -> 16.056.356 Bytes 98 s
    FA -> 6.896.930 Bytes 24 s
    crook -> 54278656 -> 21342597, 41.40 s, 3.146 bpc
    7-zip -> 20.503.113 Bytes 32 s
    <ouch...> forgot to give the specs for my testing-pc *whistle...*
    PackardBell dot_SE, Intel Atom N450, with 2GB-RAM with Windows 7 Starter

    Best regards!
    Last edited by Vacon; 11th March 2012 at 13:27. Reason: forgotten specs for test-pc

  14. #14
    Member
    Join Date
    Mar 2012
    Location
    Estonia
    Posts
    14
    Thanks
    1
    Thanked 0 Times in 0 Posts
    Attached Windows binaries for crook-0.1.1 (source on Github). Fixed overflow in the progress display and optimized a little for 32-bit platforms. Should be 100% compatible.
    Attached Files Attached Files

  15. #15
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,475
    Thanks
    26
    Thanked 121 Times in 95 Posts
    I have tested crook on enwik9 split into 10 equally sized parts (so first part equalled to enwik. Options used were -m2047 -O8.

    Code:
    22400314	enwik9a.crook
    22283257	enwik9b.crook
    21864201	enwik9c.crook
    21781466	enwik9d.crook
    16028443	enwik9e.crook
    10363712	enwik9f.crook
    21354701	enwik9g.crook
    21209717	enwik9h.crook
    20278614	enwik9i.crook
    20721985	enwik9j.crook
    Ubuntu 11.10 64-bit, 8 GiB RAM, Core 2 Duo E8400. Timings are gone as my terminal broke when I accidentaly did 'cat' on compressed files to terminal, but most parts compressed in slightly above 30 sec each.

    Total size of parts is 198286410 so it's higher than the result posted already to LTCB.


    BTW Matt:
    Why do you have a habit to split enwik9 into equally sized blocks for block based compressors? I think best compression on homogenous data would be obtained with maximum possible blocks size. There is some non-homogenity within enwik9, as could be seen in crook compression ratios for different parts of enwik9, but that shouldn't destroy the advantages of bigger block size.
    Last edited by Piotr Tarsa; 10th March 2012 at 23:52.

  16. #16
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 779 Times in 486 Posts
    Larger blocks compress better but need more memory. I assume you are talking about BWT compressors.

    The middle part of enwik9 is highly compressible because it contains a large number of automatically generated articles taken from census data. I wrote about it in http://mattmahoney.net/dc/textdata.html

  17. #17
    Member
    Join Date
    Mar 2012
    Location
    Estonia
    Posts
    14
    Thanks
    1
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by Piotr Tarsa View Post
    I have tested crook on enwik9 split into 10 equally sized parts (so first part equalled to enwik. Options used were -m2047 -O8.

    [...]

    Total size of parts is 198286410 so it's higher than the result posted already to LTCB.

    [...]
    I did test model restarting on an earlier variant of crook and it did indeed lead to reduced compression. This might be caused by me limiting model growth to a single new state per input bit (just like DMC or Vitter's Fast PPM), otherwise the model is too quickly filled up with mostly useless data (no point in storing info on contexts that are only seen once or twice). Most PPM algos create many new states each input symbol and so in comparison to say PPMd my program exhibits slower structural adaptation and therefore the price of model restarting is correspondingly higher. The reason PPMd can afford such fast growth is thanks to its delayed state creation: new states take one only byte to store (very nice!).
    Last edited by valdmann; 11th March 2012 at 12:36.

  18. #18
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,475
    Thanks
    26
    Thanked 121 Times in 95 Posts
    Larger blocks compress better but need more memory. I assume you are talking about BWT compressors.
    Yes, they need more memory but you should try using block size that will use entire free memory, instead of splitting enwik9 into equally sized parts.

    The reason PPMd can afford such fast growth is thanks to its delayed state creation: new states take one only byte to store (very nice!).
    One byte? From what I see in Shkarin paper:
    - Transition structure takes 6 bytes,
    - Context structure takes 12 bytes,

    So a node with:
    - one child takes 12 bytes,
    - two children takes (12 + 6 + 6 = 24) bytes
    - N > 2 children takes (12 + N * 6) bytes

  19. #19
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 779 Times in 486 Posts
    The problem with using all available memory is you need the same amount to decompress.

  20. #20
    Member
    Join Date
    Mar 2012
    Location
    Estonia
    Posts
    14
    Thanks
    1
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by Piotr Tarsa View Post
    One byte? From what I see in Shkarin paper:
    - Transition structure takes 6 bytes,
    - Context structure takes 12 bytes,

    So a node with:
    - one child takes 12 bytes,
    - two children takes (12 + 6 + 6 = 24) bytes
    - N > 2 children takes (12 + N * 6) bytes
    That is for contexts that have been seen at least twice. New contexts are represented as a pointer into a text buffer ("pText" in source). If current longest matching context is 'a' and next symbol is 'c' then instead of creating new structure for context 'ac' PPMd sets successor('a', 'c') := pText. In next iteration the new symbol that follows 'ac' is written to *pText and pText is incremented. So it's a "pointer into the future" .

    By the way, I remember reading on this forum that PPMd uses suffix trees. This is false: it's just a trie, a much simpler data structure. The text buffer doesn't even contain the whole text, just a subset (IIRC).

    EDIT: Hmm. Actually I think if PPMd creates more than one new state, let's say N new states, then each of those would be represented as a pointer to the same byte in the text buffer. So the total overhead per state should be 1/N bytes...
    Last edited by valdmann; 11th March 2012 at 22:19.

  21. #21
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,475
    Thanks
    26
    Thanked 121 Times in 95 Posts
    The problem with using all available memory is you need the same amount to decompress.
    But you're benchmarking programs, you're decompressing on the same machine that you're compressing on.

    valdmann:
    Seems that you're familiar with PPMd internals. Sorry for being off-topic but I have some questions.

    1. I don't know what do you call overhead. I'm for example interested in memory usage per input byte. I've tested with -m256 -o16. For random data PPMd uses about 9 times the input size, for english.dic 20x, world95.txt and vcfiu.hlp about 16x. So it seems that for random data the most used data structure is 8 bytes long, probably a pointer to input byte and pointer to next sibling, totalling 9 bytes per input byte.

    2. If there's an input buffer then in fact PPMd cannot have suffixes not present in the current window (ie current input buffer contents). I am correct? If so, then model rescaling can be viewed as sliding the window and entire PPMd algorithm with rescaling can be viewed as simplified sliding window suffix tree.

    3. Does PPMd do path compression? Either internal paths compression or leaves - well from your description it seems that leaves are indeed compressed, ie pointer to input buffer accurately represents the chain from leaf to lowest branching parent in trie.

    4. Does Shkarin's paper "PPM: one step to practicality" accurately describe the newest PPMd versions?


    I'm asking because I'm developing compressor based on Ukkonen's on-line suffix tree construction with support for sliding the window (loosely based/ inspired by Jesper Larsson's sliding window suffix trees, but I'm not using a hash map to store edges but I use left child/ right sibling represenation of N-ary trees) and I'm curious about PPMd's internals but the sources are too complicated/ cryptic to me.

  22. #22
    Member
    Join Date
    Mar 2012
    Location
    Estonia
    Posts
    14
    Thanks
    1
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by Piotr Tarsa View Post
    valdmann:
    Seems that you're familiar with PPMd internals. Sorry for being off-topic but I have some questions.
    No problem. I'll try to answer but I don't remember all the details of the top of my head.

    Quote Originally Posted by Piotr Tarsa View Post
    1. I don't know what do you call overhead. I'm for example interested in memory usage per input byte. I've tested with -m256 -o16. For random data PPMd uses about 9 times the input size, for english.dic 20x, world95.txt and vcfiu.hlp about 16x. So it seems that for random data the most used data structure is 8 bytes long, probably a pointer to input byte and pointer to next sibling, totalling 9 bytes per input byte.
    Memory usage per byte depends very much on the bytes themselves. Best case scenario is a file full of zeroes (constant space). I'm not sure what the worst case is TBH.

    As for "overhead" I was referring to the number of bytes it takes to store a context. Deterministic contexts that have been seen at least twice are stored in 12 bytes. Deterministic contexts that have only been seen once are stored in one byte or less.

    If by sibling pointers you're thinking of a linked list for storing the counters in a context node (very common in older implementations PPM) then know that there are no linked lists in PPMd.

    Quote Originally Posted by Piotr Tarsa View Post
    3. Does PPMd do path compression? Either internal paths compression or leaves - well from your description it seems that leaves are indeed compressed, ie pointer to input buffer accurately represents the chain from leaf to lowest branching parent in trie.
    I'll try to explain PPMd's data structure as far as I can understand it. Begin with a trie: each context has a pointer to a dynamic array of up to 256 counters. Each counter stores symbol value, "frequency" and a pointer (possibly null) to a child context.

    Now optimize:
    1. First, if a context has only one counter then merge it into the context structure and delete the counters array.
    2. Second, if a deterministic context predicting symbol S has only been seen once then delete that context and replace all pointers to that context with pointers to symbol S in a text buffer.
    3. Third, in all counters replace the child pointers with successor pointers (I haven't figured out this part yet ).

    As you can see, there is no path compression.

    Quote Originally Posted by Piotr Tarsa View Post
    2. If there's an input buffer then in fact PPMd cannot have suffixes not present in the current window (ie current input buffer contents). I am correct? If so, then model rescaling can be viewed as sliding the window and entire PPMd algorithm with rescaling can be viewed as simplified sliding window suffix tree.
    I haven't looked much at the -r1 "cut-off" algorithm, but AFAIK the text buffer is only used for storing those symbols S I mentioned above. So I'd imagine one possibility when doing model rescaling is just to delete the whole buffer and delete all those contexts there. Since they've only been seen once they probably weren't very important.


    Quote Originally Posted by Piotr Tarsa View Post
    4. Does Shkarin's paper "PPM: one step to practicality" accurately describe the newest PPMd versions?
    The data structure is the same, but the SEE stuff is different for sure.

    Quote Originally Posted by Piotr Tarsa View Post
    I'm asking because I'm developing compressor based on Ukkonen's on-line suffix tree construction with support for sliding the window (loosely based/ inspired by Jesper Larsson's sliding window suffix trees, but I'm not using a hash map to store edges but I use left child/ right sibling represenation of N-ary trees) and I'm curious about PPMd's internals but the sources are too complicated/ cryptic to me.
    Very interesting! I'd love to see such a compressor, although I admit I also have some doubts. I mean it is quite difficult to actually use all of that information that you get from a suffix tree.

    Why use left-child-right-sibling representation though? That is so 1980s . A dynamic array would so much more efficient both in space and in time. Yeah you'd need a custom memory allocator, but a simple address-order first fit allocator takes only ~50 lines of code. In the end it would probably be simpler than all that linked list stuff.

    It's quite late; hopefully I haven't made too many mistakes.

  23. #23
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,475
    Thanks
    26
    Thanked 121 Times in 95 Posts
    Thanks for answers.

    Why use left-child-right-sibling representation though? That is so 1980s . A dynamic array would so much more efficient both in space and in time. Yeah you'd need a custom memory allocator, but a simple address-order first fit allocator takes only ~50 lines of code. In the end it would probably be simpler than all that linked list stuff.
    Well, there are some problems with dynamic arrays. The most serious one is that I need to maintain parent pointers (for smooth sliding the window - without instant sliding parent pointers wouldn't be required) in internal (ie. non-leaf) nodes so when reallocating dynamic array I would have to update parent pointers in possibly very big number of nodes/ leafs. Secondly, an internal node needs much more information than leaf, for example leaf doesn't need parent pointer, child pointer and suffix link. Currently both internal node and a leaf occupies a full single record, but I have an idea to stuff possibly many (eg. up to four) leaves into one record, saving space. That would be roughly equivalent to linked list unrolling. I haven't implemented it yet nor made any simulations so I don't know how big the saving will be, but I suppose that with good merging scheme memory complexity should be comparable to PPMd's one - of course not counting files like FP.LOG (from maximumcompression.com) where the low maximum order of PPMd makes it lose a lot of information.

    And dealing with singly linked lists is very easy.

  24. #24
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,424
    Thanks
    223
    Thanked 1,053 Times in 565 Posts
    > 1. I'm for example interested in memory usage per input byte.

    Its a tree, so there's no specific rate.
    The worst case should be about model_order*6 per byte.

    > So it seems that for random data the most used data structure is 8
    > bytes long, probably a pointer to input byte and pointer to next
    > sibling, totalling 9 bytes per input byte.

    A symbol node consists of symbol code, freq, and a suffix pointer
    (6 bytes total).
    In fact the memory manager there always works with 6-byte units.
    There's also a context descriptor node, which is 12 bytes.
    (The idea is that the context's address doesn't change after
    reallocation of its symbol table. But its possible to save some
    memory on this - Shkarin's "ppms" coder does that.)

    > 2. If there's an input buffer then in fact PPMd cannot have suffixes
    > not present in the current window (ie current input buffer
    > contents). I am correct?

    Not quite. Only the tree is used for coding, while window pointers
    are only used for tree updates.
    Afaik these pointers are only validated when its necessary to use them.

    > If so, then model rescaling can be viewed as sliding the window and
    > entire PPMd algorithm with rescaling can be viewed as simplified
    > sliding window suffix tree.

    I don't see it as anything similar to sliding.
    Ppmd tree can contain nodes corresponding to strings not existing in the window buffer,
    and, at the same time, some strings from the buffer may not have a tree node.

    > 3. Does PPMd do path compression? Either internal paths compression
    > or leaves - well from your description it seems that leaves are
    > indeed compressed, ie pointer to input buffer accurately represents
    > the chain from leaf to lowest branching parent in trie.

    Yes, there's no internal path compression.

    > 4. Does Shkarin's paper "PPM: one step to practicality" accurately describe the newest PPMd versions?

    Afair the general description of the model and data structures is correct,
    but it doesn't mention lots of implementation details

    > I'm curious about PPMd's internals but the sources are too complicated/ cryptic to me.

    I suppose you know about http://ctxmodel.net/files/PPMd/ppmd_Jr1_sh8.rar ?
    If yes, what's still hard to understand for you there?

  25. #25
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,475
    Thanks
    26
    Thanked 121 Times in 95 Posts
    I suppose you know about http://ctxmodel.net/files/PPMd/ppmd_Jr1_sh8.rar ?
    If yes, what's still hard to understand for you there?
    Yes, I know. But documentation is absent and code is not commented. I would rather want some more detailed scientific paper than a working code. For example, I don't fully understand Larsson's code for sliding window suffix trees, but I understand the ideas he described in his papers and with that help I devised algorithms for my suffix tree representation, which is vastly different from Larsson's one.

  26. #26
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,424
    Thanks
    223
    Thanked 1,053 Times in 565 Posts
    > But documentation is absent and code is not commented.

    Are you one of these "loop_counter++; // increment a loop counter" guys?

    > I would rather want some more detailed scientific paper than a working code.

    There're some in http://www.ctxmodel.net/files/PPMd/
    If you don't care about specific implementation details, I think Shkarin's
    papers contain enough information.
    Also note the Nelson's comp2 coder there, which can be considered a prototype
    of ppmd, there's also a DDJ paper about it.

    In ppmd case though, I think that any natural language descriptions of it
    would be useless.
    All the data structures and formulas there are fairly obvious - there's
    no sliding, no tricky mixing, no interpolation, no parsing optimization, etc.
    Its only the actual implementation which is hard to beat, not the theory
    behind it.
    For example, like having 3 different coding functions (basically different models)
    for different types of PPM contexts.
    So I think that ppmd_sh is currently the most concise description of Shkarin's method -
    its about 2x smaller than original source, and I even added some comments (at relevant points).

    Well, originally I also wanted to cut off the really complicated memory allocator there,
    but nobody seemed to care, so I switched to something else.

Similar Threads

  1. Simple binary rangecoder demo
    By Shelwien in forum Data Compression
    Replies: 35
    Last Post: 17th June 2019, 17:21
  2. PPMX v0.05 - new PPM-based compressor
    By encode in forum Data Compression
    Replies: 49
    Last Post: 28th July 2010, 03:47
  3. Delta: binary tables preprocessor
    By Bulat Ziganshin in forum Forum Archive
    Replies: 14
    Last Post: 1st April 2008, 10:43
  4. Bytewise vs Binary
    By Shelwien in forum Forum Archive
    Replies: 9
    Last Post: 30th March 2008, 17:51
  5. TURTLE incoming... Fast PPM file compressor.
    By Nania Francesco in forum Forum Archive
    Replies: 104
    Last Post: 8th August 2007, 21:40

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •