# Thread: A Random Data Compressor/s to solve

1. ## A Random Data Compressor/s to solve

I create this separate thread for my random data compression ideas i posted in several threads here, so that it's easy to find in one thread.

My "frequency table" random data coding was programmed in 2006-2007, perhaps with or without a decoder. Or maybe i solved it already but deleted the compressor. I remembered it in 2017 so here it is again.

Note, do not try random data compression unless you can actually write a computer program to test your ideas, however futile they might be. It might take you years to clear up on your ideas, or admit that random data compression is impossible.

2. Random data compressor #1

One slow early approach is to guess the pattern or symbol. Just try to guess the input byte in < 32 tries, to output just 5 bits. (You can do this by randomly setting bits of a dummy byte on or off and compare it with the input byte.) If not guessed, output 00000 and then 8-bit byte. How would you initialize the dummy byte? Maybe by context; crude LZP like. What else? Build on this. Improve this.

3. Random data compressor #2

How about compressing an array of 256 Very Large Ints (BigNums or infinite-precision integers), i.e. a frequency table for 8-bit symbols?

This is hard to compress, i think, since the said frequency table is already the output of a compression algorithm (which was fascinating at first).

4. The compression algorithm is best understood if you "visualize" a bar chart or a histogram, where new symbol frequencies are always trying to become greater than the current highest frequency, which we increment by its delta with the new symbol's frequency. The new highest frequency becomes the new symbol's frequency; or put simply, the new symbol must have the highest frequency. So at most, the new highest frequency can only "double" or add by 1 bit in length. (In decoding, the symbol with the highest frequency is the symbol to decode; this means it is stack based. We add the delta to the highest frequency during encoding so we can preserve or get back to the previous frequency of the symbol when decoding.) Output is actually the frequency table, which is easy to compress or generate?

Algorithm pseudo-code:

Code:
```/* initialize frequency table. */
for (i=0; i < 256; i++) freq[i] = i + 1;
max = freq[255];

do {
c = get_byte(infile);
if (c == EOF) break;
freq[c] = max + (max - freq[c]);
max = freq[c];
} while (1);```

No "runs" of a single character allowed in the input, as much as possible. "Random data" indeed.

New or recalled observations:

1. This algorithm ironically "expands" the frequencies at first. ? LOL

We're back to the early days of information theory or data compression history!

2. The bombshell: It takes more than 1 bit added to encode for very small frequencies which suddenly must be maximum. The solution might be to "swap" them but this requires new information or codes. This is back to delta coding. haist

3. But a total cycling of the frequency table might work...

4. Instead of 8-bit bytes, use 4-bit symbols;

***

This is similar, i think, to WEB Technologies' algorithm as featured in BYTE magazine in 1992 and noted by comp.compression FAQ:
"WEB, in fact, says that virtually any amount of data can be
squeezed to under 1024 bytes by using DataFiles/16 to compress
its own output multiple times."

I think they were using or playing with a frequency table too, 256 32-bit frequencies = 1K.

They might had to output the MSbit of the highest frequency, the result of which may equal another byte frequency/ies?

That's why they had the problem that 4 numbers in a matrix are equal, a rare case in their algorithm.

Just maybe.

(Ideally, at most 1 bit increase in frequency of output or new symbol, but the bombshell precludes that. If they are of the same bitsize, then only 1 bit increase in the new max frequency.

The current symbol has always the highest frequency.

You decode backwards, from last symbol to first; the symbol with the highest frequency is the current symbol.

One parameter in decoding is the famed file_size().
)

The problem with the algorithm is that the emitted frequency table could be very large due to very large frequencies if you implement it by really using BigNums or BigInts;
You then have to compress the very large frequency table.

Maybe to achieve compression, you can just consider the MSBit after the arithmetic (addition) operation.
Or the solution is nearly just MTF (you have to output the character that *doubled* (MSBit activated)).

WEB Technologies' Datafiles/16 algorithm is clearly designed for compression of *random* data, and recursive, which are futile indeed.

5. > 4. Instead of 8-bit bytes, use 4-bit symbols;

How about 2-bit (base-4) symbols? Or maybe even better, a data source of base-3 symbols ??

6. To solve the bombshell, or presence of "anomalous" symbols, (1) you must have a way to create a smooth frequency distribution or the frequencies must be of the same bitsize as much as possible, of which there are many ways to do this, but must be reversible. For example, you can maybe XOR the bytes of the data source first (this is reversible), pre-whitening it for encoding. (2) Or, the bombshell symbol or byte can be thought of as an LZ77 literal, simply output a prefix bit flag (anomalous symbol or not?) for the symbols. This means at most two bits per symbol encoding, with the bit flag to indicate if the symbol sum doubles or MSbit. Plus the anomalous symbol when it happens, 8 bits. I wonder how large would the frequencies or freqtable be...

And, like in Huffman coding, you can contrive or generate a file that is exactly perfect or suitable for this algorithm. What would be interesting is that the generated file is a "random-appearing data" file, perhaps indeed incompressible to known compressors.

(See post above, now has pseudo-code for your easy understanding.)

7. Originally Posted by Matt Mahoney
... A string is algorithmically random if there is no program which outputs that string whose code is shorter than the string.
So, if you somehow compressed a string of jumbled characters into a significantly smaller (or program) size, it is simply "random-appearing" and not algorithmically random.

8. Since everyone has all the time in the world, with these lockdowns and home quarantines,
Has anyone actually implemented RandomDataCompressor (RDC) #1 and #2, hmmm?

#### Posting Permissions

• You may not post new threads
• You may not post replies
• You may not post attachments
• You may not edit your posts
•