Results 1 to 6 of 6

Thread: New data structure for bitwise CM

  1. #1
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,334
    Thanks
    209
    Thanked 1,007 Times in 532 Posts

    New data structure for bitwise CM

    Got an idea for CM coder yesterday - again with compressed data structures basically.

    There're currently two incompatible methods -
    bytewise coders like ppmd, which use trees for statistics, and
    bitwise coders like paq with hashtables and bitwise statistics.

    And there're many benefits in each method,
    but they don't combine - binary tree would take too much memory,
    and dynamic symbol ranking like in ppmd doesn't work with probability-based statistics.

    But now i've got a workaround - use fixed binary trees like in bitwise coders
    but store them in compressed form, without empty nodes.

    This allows to use some benefits of bytewise coders with bitwise.

    For example, a coder like ppmd can be given 1G for encoding,
    but if it uses only 10M (small file or something), then it can be decoded with 10M,
    and it doesn't work like that with hashtables -
    if paq compressed a 100 byte file with 1G hashtable,
    it won't be able to decode it without the same table.

    One problem though, is that i don't know how fast this can be :)
    That is, ppmd-like speed is probably possible,
    but access to that kind of data structure would be complicated
    Well, it would be useful somewhere anyway :)

    To be specific, I'm talking about something similar to nibble mask
    tree in my "tree" compressor.
    Tree nodes there are packed like this:
    Code:
    Suppose we have a context with '0' x 3, '1' x 2, 'A' x 1 
    (ie context history is 00011A)
    Then ppmd's tree symbol table would be like this:
    'A':1
    '1':2
    '0':3
    (3*(1+1+4)=18 bytes with byte counters)
    
    And the context node in "tree" would be like this:
    0000,0000,0001,1000 // 3x,4x enabled
    0000,0000,0000,0011 // x0,x1 enabled = 0x30,0x31
    0000,0000,0000,0010 // x1 enabled = 0x41
    3
    2
    1
    (2+2*2+3*(1+4)=21 bytes)
    Latter version is actually more compact with more symbols
    (ie if we add 5 more symbols 'B'-'F', it would be 18+6*5=48
    vs 21+5*5=45)
    And its main point is that it allows to locate a symbol
    by code without linear scan.

    But these masks were inherited from another experiment with
    alternative statistical storage structures ("ctx" with layered
    context masks), so I didn't notice that it is actually a compressed
    data structure - a full frequency table for 256 symbols, where
    zero freqs are compressed by disabling their bits in low nibble masks,
    and all-zero masks are again compressed by high nibble mask.

    Anyway, this same bitmask compression method can be applied to
    binary trees as well.
    So now I can make a bitwise CM with good counters (linear or fsm
    instead of freqs) and a tree with precise statistics
    (so more cache-friendly and without extra multiplication per order
    for hash functions).

  2. #2
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 779 Times in 486 Posts
    One problem is that the array of masks has to grow, so there is overhead for that.

    Another problem is like with PPM, you can't model adaptive statistics with just counters, like distinguishing the history 00011A from A11000, which in general make different predictions.

  3. #3
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,334
    Thanks
    209
    Thanked 1,007 Times in 532 Posts
    > One problem is that the array of masks has to grow, so there is
    > overhead for that.

    There's one extra bit in masks (well, 1 and 1/16) per symbol,
    and there's extra 8 bits (symbol code) in ppmd.
    Btw, the bitmask tree described above is used in "tree" compressor here:
    http://www.ctxmodel.net/files/green/green_v3a.rar (see node.inc)

    > Another problem is like with PPM, you can't model adaptive
    > statistics with just counters, like distinguishing the history
    > 00011A from A11000, which in general make different predictions.

    Yes, in "tree" it contains counters.
    But the whole point of this thread is that same
    compression method (hiding empty nodes with bitmasks) is
    applicable not only to symbol tables, but also to trees,
    so with that I can make a alphabet tree with paq counters,
    and store it in context nodes of ppmd's tree, and it won't
    require 255*sizeof(Symbol) bytes per node thanks to bitmask
    compression.
    Note also that I don't ever expand it in "tree" - I just
    count bits to find the table index by symbol code.

    Btw, there's now an instruction for bit counting - http://popcnt.org/2007/09/magic-popc...t-command.html

  4. #4
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,474
    Thanks
    26
    Thanked 121 Times in 95 Posts
    For example, a coder like ppmd can be given 1G for encoding,
    but if it uses only 10M (small file or something), then it can be decoded with 10M,
    and it doesn't work like that with hashtables -
    if paq compressed a 100 byte file with 1G hashtable,
    it won't be able to decode it without the same table.
    You can compute hashes of hashes and then store them in (possibly adaptively-sized) hashtable This way you can decode 100byte file compressed with ordinary PAQ without using 1G RAM.

    A better way would be to use adaptively-sized hashtables directly - ie. specify lower and upper limits for each hashtable (ie. for each model) and then do compacting/ expanding of individual hashtables.

    You can even temporarily dump currently irrelevant hastables to disk and/ or reorder compressed files or segments so that you could safely discard some hashtables during compression without or with minimized compression loss.

    Of course similiar methods could be used with trees.
    Last edited by Piotr Tarsa; 27th November 2010 at 16:57.

  5. #5
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 779 Times in 486 Posts
    You could also have the compressor choose the hash table size depending on the input file size. (zpaq would support this).

  6. #6
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,334
    Thanks
    209
    Thanked 1,007 Times in 532 Posts
    > You can compute hashes of hashes and then store them in (possibly
    > adaptively-sized) hashtable. This way you can decode 100byte file
    > compressed with ordinary PAQ without using 1G RAM.

    Well, I guess its possible to apply the bitmask method to hashtables as well -
    ie keep it in compressed form without empty cells, and insert new elements.
    But what's the point if its too slow?

    > A better way would be to use adaptively-sized hashtables directly -
    > ie. specify lower and upper limits for each hashtable (ie. for each
    > model) and then do compacting/ expanding of individual hashtables.

    I do have some kind of hashtable/tree hybrid - it supports direct
    lookups, but behavior on collisions is controlled by coefs in the nodes -
    its guaranteed that the second item with the same hashtable index would
    be located with one rehashing step. Also expansion is supported.

    But actually for this part (locating the descriptor node for a context)
    ppmd's suffix tree is clearly better than hashtables - you just follow
    a pointer, no hash functions to compute and collision resolutions.

    The problem is after that. We need some kind of a tree for stable
    processing speed, that tree has to be stored all in one place to
    be cache-friendly (and to not waste memory on more pointers), but
    we can't afford keeping full binary trees for all contexts.

    Well, fortunately now we have a solution for that with bitmask
    indexing, so for a while I can forget about hash functions,
    collisions, etc :)

Similar Threads

  1. A new match searching structure ?
    By Cyan in forum Data Compression
    Replies: 71
    Last Post: 3rd January 2011, 10:19
  2. Structure detection
    By Shelwien in forum Data Compression
    Replies: 0
    Last Post: 13th September 2010, 10:54

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •