Results 1 to 8 of 8

Thread: lzma's binary tree matchfinder

  1. #1
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,378
    Thanks
    215
    Thanked 1,025 Times in 546 Posts

    lzma's binary tree matchfinder

    <cyrus>
    01 Tree
    I dont get it! the tree used in lzma is confusing.
    How does it work ?
    How does it maintain the tree for searching strings ?

    <Shelwien>
    just a binary tree of strings
    like, if( p.s<s ) p=p.ptr[1]; else p=p.ptr[0];

    <cyrus>
    It dosent look like normal binary tree. The most recent offset is
    stored in the hash. usually when we build a tree the oldest offset
    are at the top so we get those. but here i see he gets the most
    recent offsets first then the older ones but HOW ?
    I cant think of how to build a tree in reverse with recent offsets
    at top and older ones near leaves
    how how how

    <Shelwien>
    it just adds a new root to a tree every time
    with some little link updates along the search path
    but the tree is far from balanced, so its not very efficient
    but still, its more efficient than a plain chain
    also obviously there's not one tree, but one per each hashtable entry

    <cyrus>
    :(
    I tried to figure out the code but failed. i see that the new hash
    is updated with current position. and offsets from recent to old are
    searched for matches.
    the most recent one is replaced with current. and next search offset
    is replaced with previous one and so on
    but how is the pointer stuff maintained in that cyclicbuffer is dam
    confusing i cant get a big picture of whats going on
    if string is less ptr1 is adjusted else ptr0
    and cyclicpos is used to index the offsets somehow
    what kind of datastructure is this.

  2. #2
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,378
    Thanks
    215
    Thanked 1,025 Times in 546 Posts
    Here's my refactored version of bt4's tree lookup/update: http://pastebin.com/JaqbBjB4
    It collects distance/length pairs for all available match lengths.

    > what kind of datastructure is this.
    http://en.wikipedia.org/wiki/Circular_buffer

    Though actually there're many data structures.
    There's a text buffer (double buffer where 2nd half is copied to 1st half after filling it up)
    a ring buffer for tree nodes (that "son" thing is actually the pointer to the start of tree buffer),
    and a hashtable for tree roots.

    This function is also relevant: http://pastebin.com/vCMD6TbA
    As you can see there, normally old tree nodes are simply rewritten by old ones,
    but the pointer size limit (32bit=4G) requires special handling, so at that point
    it stops and explicitly shifts the pointers.

  3. Thanks:

    encode (17th March 2016)

  4. #3
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,474
    Thanks
    26
    Thanked 121 Times in 95 Posts
    Shelwien:
    Could you try to implement Splay Trees instead of simple binary trees? Splay Trees are easy to implement, doesn't use auxiliary space for tags and performs well if some nodes and have "the additional property that recently accessed elements are quick to access again" (quote from Wikipedia). Or does Igor use Splay Trees already?

  5. #4
    Programmer osmanturan's Avatar
    Join Date
    May 2008
    Location
    Mersin, Turkiye
    Posts
    651
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Well, according to LZMA source, I can't get any clue about Igor used Splay Tree. Because, as Shelwien already stated, there is no balancing function. Instead, there are several trees which are accessed from a hash table. So, every trees are already small enough. And old trees are overwritten by new ones due to hashed access.
    BIT Archiver homepage: www.osmanturan.com

  6. #5
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,474
    Thanks
    26
    Thanked 121 Times in 95 Posts
    And old trees are overwritten by new ones due to hashed access.
    What? They are updated IMO. In Splay Tree the most recently accessed element is the current root, so it is IMO perfectly compatible with roots hashtable.

  7. #6
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,378
    Thanks
    215
    Thanked 1,025 Times in 546 Posts
    I think its not really a splay tree, because
    Quote Originally Posted by wikipedia
    A splay tree is a self-adjusting binary search tree with the additional property that recently accessed elements are quick to access again
    and lzma's tree doesn't bring up the matches it finds.
    But in a way its similar too, kinda a reduced version

  8. #7
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    863
    Thanks
    461
    Thanked 257 Times in 105 Posts
    It cannot be a balanced tree, otherwise the fundamental assumption "next node is older than current one" is lost.
    And this property is key to ensure that new nodes simply overwrite oldest ones.
    Otherwise, some chains will be lost in the process

  9. #8
    Member
    Join Date
    Aug 2011
    Location
    US
    Posts
    8
    Thanks
    1
    Thanked 0 Times in 0 Posts
    A while ago i tried this approach.
    A hashtable to keep the current offset. A tree like structure to hold left and right kids.
    The current offset is placed in the hashtable. the previous offset at that hash is used to start searching.
    Going left and right depending on the string comparisions, also updating the tree as we go along.
    Well the concept is simple but the implementation seems tricky as i lost compression compared to normal binary tree(with insertions at the end).

    Christian has written about this kind of tree in "RZM" posts. It is not in any books that i read either.
    It is not splay tree , not a balanced tree , it is a variant data structure of binary search tree that is yet to be named properly and classified.

Similar Threads

  1. Simple binary rangecoder demo
    By Shelwien in forum Data Compression
    Replies: 35
    Last Post: 17th June 2019, 17:21
  2. Nibble MMC matchfinder
    By Shelwien in forum Data Compression
    Replies: 4
    Last Post: 25th April 2011, 13:21
  3. BMF is not binary lossless NOR pictore lossy
    By SvenBent in forum Data Compression
    Replies: 4
    Last Post: 23rd August 2009, 13:54
  4. Delta: binary tables preprocessor
    By Bulat Ziganshin in forum Forum Archive
    Replies: 14
    Last Post: 1st April 2008, 10:43
  5. Bytewise vs Binary
    By Shelwien in forum Forum Archive
    Replies: 9
    Last Post: 30th March 2008, 17:51

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •