28th February 2021, 21:25
I don't feel that I'm stuck. I'm posting improvements quite regularly. Why would you say that we are stuck?
Data compression is a technical challenge - not a philosophical one. So we address the issue from technical point of view. If you think there is a better approach... show us. But it should work ;-)
Your issue is that you would like to understand why compressing a 2-digit file is worse than simply converting it to binary. This is what you asked. This is what I explained. Regarding what the issue is - I think we are on the same page.
No, it's not. The 33,421 bytes could be compressed to 4-5K, it's clearly not random. The 4096 bytes are random. From (general) compression point of view these two files are totally different.
In your head (and in my head) the binary file is equivalent to the 2-digit file (and it's absolutely true from our viewpoint), but they are not equal for the compression engines. Please read my explanation again above.
It looks like my explanation didn't get through. I tried my best to make it clear - it looks like I was not very successful. Please read my posts again and tell me where it is difficult to understand. I'm ready to try to explain it better.
No, they will never beat conversion. The entropy is 4096 bytes in each of your case. That's how low you can go no matter what. Regarding their information content, they are equal - you are right! But "technically" they are different: the 2-digit file needs to be actually compressed to get to (nearly) 4K, while the 4K file is already 4K - nothing to do from compression point of view.
You cannot beat 4K. That's the limit.
What do you mean by "conventional excuses"? Is it that you hear similar explanations again and again that random content is not compressible? Because it's not. You experienced it yourself.
If you read most of the threads in the "Random Compression" subforum, you'll notice that most people try converting some random content to different formats hoping that it will become compressible. Like yourself. It's the most natural approach, it's true. So everyone is doing it. But everyone fails. Do you think that there is some law behind it? There is.
You have still the same information content (4K) in any format you convert the file to. If you convert it to any format that is actually compressible (like the 2-digit format), you just give file compressors a hard time, because they need to do actual compression to go down to near 4K. What they see it not a random file - the 2-digit format is not random.
You mean 256 symbols and 2 symbols (not digits). Your intuition is correct: it's easier to work with 2 symbols than 256 symbols. This is how the paq* family works. But it does not mean that a 2-symbol file is more compressible than a 256 symbol file. From information content point of view they are equal. It's surprising, I know. In my next post I'll give you an example, maybe it helps you grasp what information content is.