
Originally Posted by
Cristo
Part of the process of discovering and inventing is throwing magic ideas.
Then once in a while some of them work. Or maybe I'm just a fool again.
I'll briefly explain my idea of processor. Usually the dictionaries are limited in size or by any other factor, like computational. I thought of how to get into dictionaries or process all type of data and info in the enwik file, so everything could be represented in less bytes than it is, but without converting it in random-like/binary data, so the pre-compressor just changes the way in which the data is represented using dictionaries and other techniques, but is not strictly compressing. Just getting the lesser value to represent it. (That was the occam principle similarity, shaving the symbols to the simpler possible, which would be obvious but is not done)
The various classes or categories are: wikipages, xml, tags, markdown, phrases, words, numbers including things like dates and ips, punctuation marks. Then there are content mixer/un-mixers.
The resulting file weights less than the original but is not compressed yet. Then a binary compressor should compress it more than the original file because it doesn't care about what the symbols represent.
Compare for example:
Lorem ipsum Lorem ipsum Lorem ipsum
and
Lorem
ipsum
lilili
being the program size negligible the second one should be compressed less.
Here for words. Same for all the classes of data.