Fractal text compression, imperfect match LZ and distortion-resistant hash function?
There is a nice old concept for fractal compression of pictures: trying to find self-similarities and utilize them, for example encoding the mappings and differences:
https://en.wikipedia.org/wiki/Fractal_compression
It didn’t seem to be successful, however maybe analogous approach could be reasonable for text or binary file compression?
Specifically, while standard LZ schemes require perfect match for a sequence, there should be much more sequences with a nearly perfect match – it might be beneficial to extend LZ scheme to allow also for imperfect match and then encode the differences: for example byte sequence to XOR with, with dominating zeros – such difference could be cheaply stored using entropy coder.
The problem is that finding similar sequences (by encoder) seems very costly.
This is a very common task in DNA alignment and a few days ago I have realized that such search can be done in a relatively cheap way: http://arxiv.org/pdf/1602.05889
Specifically, rate distortion theory allows to find Distortion-Resistant Hashing (DRH): hash values which are not changed while small distortion of object (like sequence).
So searching for similar sequences can be reduced to having the same DRH value, and such collisions are relatively cheap to find.
Does LZ scheme which (additionally!) allows partial match and then coding of differences seem reasonable? Have you met with such approaches?
What other applications of distortion-resistant hashing can you think of?
LZ77 considered harmful for text?
Charles,
My impression is that for compressing text, literal handling isn't the main limitation on compression ratio for LZ77-based algorithms---especially if you use a preloaded dictionary. It's the fact that match offset distances aren't a particularly good way to encode repeated strings.
(I assume here that shorter match distances get shorter codes; my argument is that that's not good enough.)
If you have a good, sizable preloaded dictionary for a kind of text, you'll generally have relatively few literals and a lot of matches to reasonable-sized strings, like "in that case" and "president of"; what follows a dictionary string will usually be a dictionary string, too, and you'll mostly have literals where you need to "spell out" unusual words or numbers. (And XOR prediction already works well for spelling out numbers as digits.)
The LZ77 sliding window offset model assumes that recent strings are likely to repeat, and that recency is a passable proxy for frequency. (A string like "of the" that is stably frequent is likely to be recent, too.) But for the stably frequent stuff that should go in a preloaded text dictionary, actual frequency is just better. In normal text, the distance back to the previous occurrence of a string like "of the" is fairly randomly Zipf distributed, but the distribution isn't quite as skewed as you'd like.
(You know all this, but...) LZX- and LZMA-like algorithms partially solve this problem for clearly strided (usually binary) data, where the match distances are often repetitive, using short codes for repeats of recently seen offsets. If you have strong clear strides, the current match distance is likely to be the same as the previous one, or closely harmonically related to it. (E.g., if your stride is 6, you'll tend to get matches at small multiples of 6.)
As I understand it, this just doesn't work for actual text where there's ubiquitous variation in string lengths, due to orthographic weirdness, word choice, phrasing choice, etc. All you get is a Zipfy distribution of match distances, with little exact repetition. There are other, weaker regularities in there, but LZX-like stride detection won't work.
Which makes me think that another hack on LZ77 isn't the best way to go for text. LZ77 is just the wrong algorithm. If you're going to use a dictionary compressor for text, you probably want a smarter one, more like LZ78 in its basic structure, and likely a grammar-based compressor like (Kennon Conrad's) GLZA. A grammar-based compressor can decompress very, very nearly as fast as LZ77 and give you considerably better ratios for text, closer to PPM's. (GLZA is the first to actually do so, because previous ones made one or another basic conceptual mistake in grammar construction. Kennon nailed it.)