Hello. I can't think of a better place to ask this.
The string in question may include some sort of xml tags, numbers and plain text. How would you go about achieving the best possible compression ratio for such data?
Hello. I can't think of a better place to ask this.
The string in question may include some sort of xml tags, numbers and plain text. How would you go about achieving the best possible compression ratio for such data?
Preprocess with xml-wrt. It's included in some PAQ versions or standalone.
Umm. I think I need to state my requirements more clearly.
I have to implement compression for small separate chunks of (mostly text) data. Assuming that general approaches using dictionaries and huge blocks are sub-optimal for this scenario, I would like to know what method or combination of methods gives best theoretically possible results for compression of tiny amount of data.
>best theoretically possible results
there is my CM algorithm, that allows you to store common stats for large block of data and then decode short fragments using these stats as dictionary
There is no best way. Since its a moving target. But for small text
files. I would view that text as it underlying string and then do a
binary BWTS and then some simple bijective encoder. If these
small separate chunks don't exist in separate files you may have to
add a count field in the start of the data so that it can be separated out.
Easier yet is to just use a static huffman or arithmetic table that's fixed
and stored off line. But still do it bijectively as above.
In theory the best way to compress anything is to assign a probability to each possible input and use a single Huffman code for the whole string. If your strings are short enough that you can list every possibility, and you know the probabilities, that might even be practical.
But usually, that's not the case. If you have more than a few thousand or million possible inputs, then use an arithmetic code. Guessing the probabilities is another matter. In general the problem is not even computable. That's what makes data compression an art instead of a science.![]()