I've bugged Matt Mahoney on Quora enough to make him send me here. I'm not exactly
a compression guy, more AI person, so please, forgive me some illiteracy.
Some compression and AI foundations intersect though, and I have some basic questions,
if you guys willing to help me out. I'll try to keep the message/question concise.
Both compression and AI are looking for repetitive patterns for slightly different purposes [simplified].
Matt gave me an example of BANANA sequence with repetitive "NA" pattern, which could be exploited.
I enhanced the BANANA by breaking "NA" with random symbols, like "BANcANdA". Now, there are
no easily discoverable continuous patterns-substrings, but still discoverable subsequences
"N*A", with * being a wildcard, which might be exploited.
Obviously, on mutated but very short BANANA level, those subsequences only make compression worse,
because of handling logistics. On the long texts it might make sense.
Another example is a string like "a--b++c--d++e ", which is compressed to "aAbBcAdBe" with 2 entries dictionary "—" = A and "++" = B.
Now compressing that string again with 1 entry dictionary "A*B" = C (* is a wildcard) we can get:
a) aCbcCde (13 symbols --> 7 symbols) losslessy
b) CC (13 symbols --> 2 symbols) lossy
The number of possible combinations-patterns of length M made from N possible symbols is N^M.
A wild card introduction might be treated as extra symbol "*" and number of possible patterns
becomes (N + 1) ^ M, which grows faster then N^M.
The task of mining for "wildcarded" patterns is close to Multiple Longest Common Subsequences (MLCS) search.
I'm experimenting with "probabilistic MLCS miner", which processes an arbitrary texts corpora for MCS (cannot claim longest).
Here are some results:
It sure makes no sense in practical compression, because it takes days to mine and approach is explicitly corpus specific.
There are unexpected useful consequences from theoretical and AI viewpoint though, which is of interest I can get into later.
Now comes the first question: are there compression methods based on search of MCS-like patterns?
And the second one: I like my long 16-symbols long patterns, do you guys?
Matt nudged me in the "DNA compressors" directions (whatever they are), but brief papers reading shows: people shy away from dealing
with ALL encountered skip-patterns, and tend to minimize the wildcards number [to one?].
Thank you for reading. Nice to meet you.