Results 1 to 10 of 10

Thread: DCA [antidictionary compression]

  1. #1
    Member
    Join Date
    Dec 2012
    Location
    japan
    Posts
    159
    Thanks
    30
    Thanked 62 Times in 38 Posts

    DCA [antidictionary compression]

    I'm looking for antidictionary based compressor. Does anyone know where is it?(source & exe)

  2. Thanks:

    Jarek (28th November 2016)

  3. #2
    Member
    Join Date
    Nov 2013
    Location
    Kraków, Poland
    Posts
    747
    Thanks
    232
    Thanked 235 Times in 145 Posts
    Interesting, compression based on the list of forbidden words - let me tell something about the theory for such constraining.

    The simplest example is the Fibonacci coding - bit-sequences with forbidden word '11'.
    https://en.wikipedia.org/wiki/Fibonacci_coding
    This constraint reduces capacity from 1bit/node to log_2(1/2 (1 + Sqrt[5])) ~ 0.694242 bits/node.
    For 2D analogue of Fibonacci coding (forbidden vertical and horizontal '11') it is ~ 0.5878911 bits/node (can be practically approached: https://arxiv.org/pdf/0710.3861 ).

    For a general "antidictionary" (set of forbidden words), the capacity can be analytically found, but it's a bit complex (Maximal Entropy Random Walk).
    Specifically, if L is the maximal length of words in the anti-dictionary, we take all length L-1 sequences and construct a matrix from them:
    M_{ij}=1 iff after sequence i there can be sequence j
    E.g. for Fibonacci coding this matrix is {{1,1},{1,0}}.
    Then log_2(largest eigenvalue of M) is the maximal capacity in bits/symbol.
    It is achieved by a well defined transition probability:
    Pr(next = j | previous = i) = M_{ij}/lambda *psi_j/psi_i
    where lambda is the largest eigenvalue of M, psi is the corresponding eigenvector.

    It seems tempting to support data compression with a set of forbidden, however, it seems difficult to make it really practical (?)

  4. #3
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,760
    Thanks
    274
    Thanked 1,200 Times in 668 Posts
    1. At some point I made a non-prefix bitcode based on escape codes (aka forbidden words).
    It had a decoder similar to huffman code, but backtracked when parsing encountered
    an escape code - so decoding was a little slower than huffman, and compression better.
    However, encoding was really dumb - it started with huffman code, built a set of
    all bit substrings used in it (up to max codelength), then assigned the shortest
    unused string as a code for some symbol (its unique, so can't appear at random),
    and this process repeated until it became impossible to reduce some symbol's codelength.

    2. Same way, its possible to attach some meaning to bytewise "forbidden words".
    Like, actually use them for encoding some frequent longer words/phrases
    (some text preprocessors actually do just this, by encoding some n-grams with
    unused byte codes etc)

    3. Perfect coding can be approximated by counting the number of exclusions
    among the strings of length N (or more; with context of length N-1) starting with 0 and 1 bits,
    and then arithmetic-decoding of the bits (similar to stegdict).
    Its not that complex to make actually - do you think it would be interesting?

  5. #4
    Member
    Join Date
    Dec 2012
    Location
    japan
    Posts
    159
    Thanks
    30
    Thanked 62 Times in 38 Posts
    I haven't understand its algorithm well yet. So lookinf for.
    If the compression is good or run fast, it is interesting.

  6. #5
    Member
    Join Date
    Nov 2013
    Location
    Kraków, Poland
    Posts
    747
    Thanks
    232
    Thanked 235 Times in 145 Posts
    I was thinking about optimally using a channel having some constraints (like in Fibonacci coding) - for data compression it seems as just a weakened PPM: instead of modeling probability distribution for some various length contexts, you just model which symbols are allowed for the next step (assuming uniform probability distribution among them?).

    However, this is more memory friendly - it might be worth to consider a hybrid: PPM/CM which immediately cuts branches corresponding to forbidden words.

  7. Thanks:

    xezz (29th November 2016)

  8. #6
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,760
    Thanks
    274
    Thanked 1,200 Times in 668 Posts
    I mainly considered this kind of approach for compressing nearly-random data (like archives).
    For example, jpegs have FF masking - FF FF can't appear there (and most other FF xx strings).
    And many rangecoders have various small skews in their output - for example, paq uses a carryless rc,
    so a long series of FF (like, longer than 4) can't appear there either.

    @xezz:
    The closest practical thing are text proprocessors, like xwrt, because they use symbol codes
    (and short strings) which don't appear in text to compress frequent words etc.
    Otherwise its not anything useful for normal kinds of data, nor is it fast.

  9. Thanks:

    xezz (29th November 2016)

  10. #7
    Member
    Join Date
    Mar 2015
    Location
    Germany
    Posts
    57
    Thanks
    34
    Thanked 38 Times in 15 Posts
    My compressor bce (http://github.com/akamiru/bce) uses Compression by Substring Enumeration.
    That means it lists all minimum unique substrings. For the binary alphabet it uses this is equal
    to specifying all minimal absent words. Thus it is equal to encoding the maximum usefull
    antidictionary (it can always decode the next bit from the current context).

  11. Thanks (2):

    Shelwien (1st December 2016),xezz (2nd December 2016)

  12. #8
    Member
    Join Date
    Dec 2008
    Location
    Poland, Warsaw
    Posts
    1,019
    Thanks
    615
    Thanked 416 Times in 314 Posts
    Quote Originally Posted by Christoph Diegelmann View Post
    My compressor bce (http://github.com/akamiru/bce) uses Compression by Substring Enumeration.
    That means it lists all minimum unique substrings. For the binary alphabet it uses this is equal
    to specifying all minimal absent words. Thus it is equal to encoding the maximum usefull
    antidictionary (it can always decode the next bit from the current context).
    Christoph, could you make Win executable of bce v0.4?
    Second question - where I could find bce v0.2 and v0.3 Win executables?

    Darek

  13. #9
    Member
    Join Date
    Mar 2015
    Location
    Germany
    Posts
    57
    Thanks
    34
    Thanked 38 Times in 15 Posts
    bce v0.2 and v0.3 were pretty experimental and are around a factor of 10 slower while having worse compression (they lack the adaptive stage of bce v0.4).
    source can still be found at my homepage: http://www.kta-design.de/releases/.
    bce v0.4 is pretty easy to compile on windows using cmake for windows + visual studio.
    You need to compile libdivsufsort as a static lib and bce will link against it.
    If you're having trouble with it I can try to find some time to do it for you (maybe sunday).

    If you want to try you could use my modification of libdivsufsort (http://github.com/akamiru/libdivsufsort) which I modified to use a novel SACA which makes it run around 15% faster.
    Note that my modification has only been tested by me (but quiet well) because I haven't released it officially by now.

  14. Thanks:

    Darek (2nd December 2016)

  15. #10
    Member jibz's Avatar
    Join Date
    Jan 2015
    Location
    Denmark
    Posts
    122
    Thanks
    104
    Thanked 71 Times in 51 Posts
    Here are 32- and 64-bit builds of 0.4.0 from GCC 6.2 using Christoph's libdivsufsort fork and OpenMP.
    Attached Files Attached Files

  16. Thanks (3):

    Christoph Diegelmann (4th December 2016),Darek (2nd December 2016),Shelwien (2nd December 2016)

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •