I'm looking for antidictionary based compressor. Does anyone know where is it?(source & exe)
I'm looking for antidictionary based compressor. Does anyone know where is it?(source & exe)
Jarek (28th November 2016)
Interesting, compression based on the list of forbidden words - let me tell something about the theory for such constraining.
The simplest example is the Fibonacci coding - bit-sequences with forbidden word '11'.
https://en.wikipedia.org/wiki/Fibonacci_coding
This constraint reduces capacity from 1bit/node to log_2(1/2 (1 + Sqrt[5])) ~ 0.694242 bits/node.
For 2D analogue of Fibonacci coding (forbidden vertical and horizontal '11') it is ~ 0.5878911 bits/node (can be practically approached: https://arxiv.org/pdf/0710.3861 ).
For a general "antidictionary" (set of forbidden words), the capacity can be analytically found, but it's a bit complex (Maximal Entropy Random Walk).
Specifically, if L is the maximal length of words in the anti-dictionary, we take all length L-1 sequences and construct a matrix from them:
M_{ij}=1 iff after sequence i there can be sequence j
E.g. for Fibonacci coding this matrix is {{1,1},{1,0}}.
Then log_2(largest eigenvalue of M) is the maximal capacity in bits/symbol.
It is achieved by a well defined transition probability:
Pr(next = j | previous = i) = M_{ij}/lambda *psi_j/psi_i
where lambda is the largest eigenvalue of M, psi is the corresponding eigenvector.
It seems tempting to support data compression with a set of forbidden, however, it seems difficult to make it really practical (?)
1. At some point I made a non-prefix bitcode based on escape codes (aka forbidden words).
It had a decoder similar to huffman code, but backtracked when parsing encountered
an escape code - so decoding was a little slower than huffman, and compression better.
However, encoding was really dumb - it started with huffman code, built a set of
all bit substrings used in it (up to max codelength), then assigned the shortest
unused string as a code for some symbol (its unique, so can't appear at random),
and this process repeated until it became impossible to reduce some symbol's codelength.
2. Same way, its possible to attach some meaning to bytewise "forbidden words".
Like, actually use them for encoding some frequent longer words/phrases
(some text preprocessors actually do just this, by encoding some n-grams with
unused byte codes etc)
3. Perfect coding can be approximated by counting the number of exclusions
among the strings of length N (or more; with context of length N-1) starting with 0 and 1 bits,
and then arithmetic-decoding of the bits (similar to stegdict).
Its not that complex to make actually - do you think it would be interesting?
I haven't understand its algorithm well yet. So lookinf for.
If the compression is good or run fast, it is interesting.
I was thinking about optimally using a channel having some constraints (like in Fibonacci coding) - for data compression it seems as just a weakened PPM: instead of modeling probability distribution for some various length contexts, you just model which symbols are allowed for the next step (assuming uniform probability distribution among them?).
However, this is more memory friendly - it might be worth to consider a hybrid: PPM/CM which immediately cuts branches corresponding to forbidden words.
xezz (29th November 2016)
I mainly considered this kind of approach for compressing nearly-random data (like archives).
For example, jpegs have FF masking - FF FF can't appear there (and most other FF xx strings).
And many rangecoders have various small skews in their output - for example, paq uses a carryless rc,
so a long series of FF (like, longer than 4) can't appear there either.
@xezz:
The closest practical thing are text proprocessors, like xwrt, because they use symbol codes
(and short strings) which don't appear in text to compress frequent words etc.
Otherwise its not anything useful for normal kinds of data, nor is it fast.
xezz (29th November 2016)
My compressor bce (http://github.com/akamiru/bce) uses Compression by Substring Enumeration.
That means it lists all minimum unique substrings. For the binary alphabet it uses this is equal
to specifying all minimal absent words. Thus it is equal to encoding the maximum usefull
antidictionary (it can always decode the next bit from the current context).
bce v0.2 and v0.3 were pretty experimental and are around a factor of 10 slower while having worse compression (they lack the adaptive stage of bce v0.4).
source can still be found at my homepage: http://www.kta-design.de/releases/.
bce v0.4 is pretty easy to compile on windows using cmake for windows + visual studio.
You need to compile libdivsufsort as a static lib and bce will link against it.
If you're having trouble with it I can try to find some time to do it for you (maybe sunday).
If you want to try you could use my modification of libdivsufsort (http://github.com/akamiru/libdivsufsort) which I modified to use a novel SACA which makes it run around 15% faster.
Note that my modification has only been tested by me (but quiet well) because I haven't released it officially by now.
Darek (3rd December 2016)
Here are 32- and 64-bit builds of 0.4.0 from GCC 6.2 using Christoph's libdivsufsort fork and OpenMP.
Christoph Diegelmann (4th December 2016),Darek (3rd December 2016),Shelwien (2nd December 2016)