The compression algorithm is best understood if you "visualize" a bar chart or a histogram, where new symbol frequencies are always trying to become greater than the current highest frequency, which we increment by its delta with the new symbol's frequency. The new highest frequency becomes the new symbol's frequency; or put simply, the new symbol must have the highest frequency. So at most, the new highest frequency can only "double" or add by 1 bit in length. (In decoding, the symbol with the highest frequency is the symbol to decode; this means it is stack based. We add the delta to the highest frequency during encoding so we can preserve or get back to the previous frequency of the symbol when decoding.) Output is actually the frequency table, which is easy to compress or generate?
Algorithm pseudo-code:
Code:
/* initialize frequency table. */
for (i=0; i < 256; i++) freq[i] = i + 1;
max = freq[255];
do {
c = get_byte(infile);
if (c == EOF) break;
freq[c] = max + (max - freq[c]);
max = freq[c];
} while (1);
No "runs" of a single character allowed in the input, as much as possible. "Random data" indeed.
New or recalled observations:
1. This algorithm ironically "expands" the frequencies at first. ? LOL
We're back to the early days of information theory or data compression history!
2. The bombshell: It takes more than 1 bit added to encode for very small frequencies which suddenly must be maximum. The solution might be to "swap" them but this requires new information or codes. This is back to delta coding. haist
3. But a total cycling of the frequency table might work...
4. Instead of 8-bit bytes, use 4-bit symbols;
***
This is similar, i think, to WEB Technologies' algorithm as featured in BYTE magazine in 1992 and noted by comp.compression FAQ:
"WEB, in fact, says that virtually any amount of data can be
squeezed to under 1024 bytes by using DataFiles/16 to compress
its own output multiple times."
I think they were using or playing with a frequency table too, 256 32-bit frequencies = 1K.
They might had to output the MSbit of the highest frequency, the result of which may equal another byte frequency/ies?
That's why they had the problem that 4 numbers in a matrix are equal, a rare case in their algorithm.
Just maybe.
(Ideally, at most 1 bit increase in frequency of output or new symbol, but the bombshell precludes that. If they are of the same bitsize, then only 1 bit increase in the new max frequency.
The current symbol has always the highest frequency.
You decode backwards, from last symbol to first; the symbol with the highest frequency is the current symbol.
One parameter in decoding is the famed file_size().
)
The problem with the algorithm is that the emitted frequency table could be very large due to very large frequencies if you implement it by really using BigNums or BigInts;
You then have to compress the very large frequency table.
Maybe to achieve compression, you can just consider the MSBit after the arithmetic (addition) operation.
Or the solution is nearly just MTF (you have to output the character that *doubled* (MSBit activated)).
WEB Technologies' Datafiles/16 algorithm is clearly designed for compression of *random* data, and recursive, which are futile indeed.