Code:
public static boolean backwardsMatch(java.util.BitSet bitSet, int setIndex, char[] pattern, int maxDepth)
{
// These two values are set when we observe a wildcard character. They
// represent the locations, in the two strings, from which we start once
// we've observed it.
//
int startIndex = setIndex;
int setBookmark = 0;
int patternBookmark = 0;
int patternIndex = 0;
// Walk the text strings one character at a time.
while (startIndex - setIndex < maxDepth) {
// How do you match a unique text string?
if (pattern[patternIndex] == '*') {
// Easy: unique up on it!
while (++patternIndex < pattern.length && pattern[patternIndex] == '*') {
} // "xy" matches "x**y"
if (patternIndex == pattern.length) {
return true; // "x" matches "*"
}
if (pattern[patternIndex] != '?') {
// Fast-forward to next possible match.
while (bitSet.get(setIndex) != (pattern[patternIndex] == '1')) {
if (--setIndex == -1 || startIndex - setIndex >= maxDepth)
return false; // "x" doesn't match "*y*"
}
}
patternBookmark = patternIndex;
setBookmark = setIndex;
} else if (setIndex > -1 && bitSet.get(setIndex) != (pattern[patternIndex] == '1') && pattern[patternIndex] != '?') {
// Got a non-match. If we've set our bookmarks, back up to one
// or both of them and retry.
//
if (patternBookmark > 0) {
if (setIndex != patternBookmark) {
patternIndex = patternBookmark;
if (bitSet.get(setIndex) != (pattern[patternIndex] == '1')) {
// Don't go this far back again.
setIndex = --setBookmark;
continue; // "xy" matches "*y"
} else {
patternIndex++;
}
}
if (setIndex > -1) {
setIndex--;
continue; // "mississippi" matches "*sip*"
}
}
return false; // "xy" doesn't match "x"
}
setIndex--;
patternIndex++;
// How do you match a tame text string?
if (setIndex < 0) {
// The tame way: unique up on it!
while (patternIndex < pattern.length && pattern[patternIndex] == '*') {
patternIndex++; // "x" matches "x*"
}
if (patternIndex >= pattern.length) {
return true; // "x" matches "x"
}
return false; // "x" doesn't match "xy"
}
}
return false;
}
This is a function in Java to be able to use wildcard patterns on bitsets. I found a piece of code on the DrDobbs website and changed it from forwards to backwards and from characters to bits. Now you can do "000?1*" and it will return true if for some index you find this pattern.
The second step is that I made a function that generates N pseudo-random patterns and goes over the bitset to find matches. I build an simple filter that recognizes not so useful patterns so we end up with at least probable useful patterns. Lets say 10 bits (1024) worth of patterns. I take the pattern that has the biggest entropy gain (P1 * matchCount) and save the index of the pattern with the probability converted to 8 bits. This way a pattern costs just 18 bits to save to the disk. The gain has to be at least 18 bits to improve compression.
I ran this algorithm on AMillionRandomDigits.bin and a 7zipped Book1. The best I could find was a gain of -8 bits. Or in other words: a loss of 1 byte for every try to compress the uncompressible. So no luck here.
Okay, lets cheat... Lets generate hundreds of thousands of patterns on AMillionRandomDigits.bin and try to gain more than 18 bits. I could not find anything...
Lets use a different approach and lets use 10000 patterns in parallel and use a conservatively configured Paq-like mixer to try to compress AMillionRandomDigits.bin and the 7zipped Book1. My best result was 2 bits loss on AMillionRandomDigits.bin and 8 byte gain on the 7zipped book1. Reason? 7zip has some compressable bits on the start and the end of the file like mentioning the filename and a bit of checksum stuff.
I think this is a good approach of trying to compress the random and it still fails. It fails because random means not predictable and that means compression is not possible.
I could try to use Data Science to assemble a random forest of decision trees, boosted by a modified version of Gradient Boosting (to entropy) that does the job for every random file you input. And then I hide the decision trees in the executable. Compression by obscurity. Still no theoretical gains though!