I've had a fascination with compression for a number of years, I really don't know why, but I like compressing collections of files and testing various archives, eg the various zip options, 7z, UC2, AIM, ARJ, etc.
A couple years ago I even made an LZW compressor in PHP, just for fun, it was very basic and an insult to compare it to anything like what some of you can do.

But back then I discovered the Million Random Digit Challenge and I tried to think about how this could be compressed.

I think I've come up with an idea for how this could be done, it would be extremely time consuming, but apparently that's not against the rules.
I've considered trying to code a simple program to test this in C, but, I'm not the best programmer and I just haven't gotten around to it and I'd like to post my idea here to see whether it would actually be possible with today's technology or not.

So my idea is, compressing random data by generating a CRC8 value of 7 characters, then storing that value in the compressed file. Decompressing is done by brute forcing a 7 character string until it matches the CRC8 value.

There's a problem here, there would be a lot of matches for that CRC8 value, which aren't the original string, so my idea is to generate the CRC8 value when compressing and then brute it and compare the first result to the original string. If it matches, then save the CRC8 value into the compressed file. If the first result does not match, then discard it all and advance the string by one characters (still 7 characters total) and generate the CRC8, brute force it, compare it and so on.

Some pseudocode:

Open bigfile.txt
$position = 0
while not eof
{
$currentString = Read 7 characters at $position
$currentCRC8 = CRC8Encode($currentString)

If $currentString == CRC8Decode($currentCRC8)
{
Write $currentCRC8 to compressed.txt
$position = $position + 7
}
else
{
$position++
}
}

You wouldn't need to do this for the entire file, you could take the size of the compression tool and stop once you've got a total size smaller than that.

I'm sorry in advance if this is a stupid idea. I don't claim to be a great programmer, I know just enough to be able to make small tools for my own use.

True random stuff, with high probability, doesn't compress. Unless 'random' is not actually random (the result of a pseudorandom number generator, or of some physical process that is not completely random), you can't do better than just storing the digits.

If the digits are binary digits, then one bit per digit is optimal.

If the digits are base 10, then since 10 is not a power of 2, probably the best approach is to convert the number to binary and write that. To make it practical you could do it in chunks, e.g. take 9 digits and write them as a 30-bit number, so 3.333... bits per digit. That would produce a 416,670-byte file for a million digits. Larger chunks would be slightly more efficient, e.g. you could take 120 decimal digits at a time and write them in 399 bits, which is 3.325 bits per digit. That's 415,659 bytes then.

I don't think there's anything better you can do than this approach. Am I missing something?

I thought about this some 10 years ago. My idea was to use two or 3 different hashes at the same time so they'd cancel collisions out. I'd need to use a sufficiently large chunk so there're actual savings.
*Conceptually*, it should work. In practical terms, there is no system on earth that can compute that. Maybe after quantum supremacy, who knows.

Conceptually, it doesn't work. You can't encode all 2^n n-bit patterns to less than n bits on average, and still decode. No matter how many hashes you use or whatever clever tricks you use, you still need n bits on average to identify one of the 2^n possibilities.

Compression only works if the data is not random, but has some kind of structure or order. Entropy coding is about exploiting the lack of entropy in actual data to assign shorter codes to more likely data, but it also means unlikely data gets longer codes (could be just 1 bit longer, to signal 'fall-back to uncompressed', but still longer). If all data has the same probability, you can't compress.

I thought about this some 10 years ago. My idea was to use two or 3 different hashes at the same time so they'd cancel collisions out. I'd need to use a sufficiently large chunk so there're actual savings.
*Conceptually*, it should work. In practical terms, there is no system on earth that can compute that. Maybe after quantum supremacy, who knows.

It doesn't matter how many hashes you use for rechecking.
Hashes have to be stored too, and they are badly compressed, this will kill the benefit.
There's only 1 way to avoid collision - use a hash longer than the data block =)

1) Its a mistake to blindly believe that Nelson's million-digits file is perfectly random.
It is near-random, but was generated in 1947, with whatever methods were available,
so there's some visible redundancy.
Problem is not just compressing the file, but compressing it enough to fit the decoder.
But there's no proof that the generation method didn't have enough mistakes to allow that.

2) There's little point in messing with million.bin file - its a derivative form
which provides extra randomization.

3) As to hashing/CRC - it doesn't work like that.
Your method can compress some files accidentally (well, there're many files around),
but why do you think that specifically this CRC would be good for million.bin?

You can "compress" an arbitrarily large file to a fixed number of bytes using hashing. Of course, information is actually lost in the process, and due to the pigeonhole principle, decompression is simply not possible because that same string can be derived also from an arbitrary number of inputs.

Now, let's imagine for a moment we could, somehow, compute all or a large number of possible original strings from a given hash. The "trick" of using more hashes is because there can only be one possible solution that's in all of the outputs. CRC32 will produce a very different array of false positives than md5, or any other method. The more the merrier. The only thing all sets will have in common is the original content.

The problem is actually in the implementation. because, in Wikipedia terms, cryptographic functions are "practically infeasible to invert", let alone trying to do that for more than one function. Note that they are "practically infeasible", not mathematically impossible. Quantum computers show some promise, using Groove's algorithm, but they're not there yet.

The only hope for a method like this to work would be a technological advancement so radical that right now is indistinguishable from magic, and also using easier to revert functions in the first place.

decompression is simply not possible because that same string can be derived also from an arbitrary number of inputs.

This was the main part of my idea, after calculating the CRC8 hash you then brute force it and if the first result matches, then you can save that hash into the compressed file.
Then when decompressing the file, you brute force each hash and the first result match is saved into the decompressed file.

It would be extremely time consuming, but if there were enough hashes calculated that when brute forced are the first match, then it could work.

It's an idea that is completely impractical for anything other than this type of challenge.

CRC is reversible: https://stackoverflow.com/questions/...rc-32-checksum
There's no need to bruteforce anything, you can simply calculate the data that gives a specific crc value.
But you only need N bits of data to produce any given N-bit crc value... thus you can only extract N bits of data from N-bit crc.
With bruteforce search, any extra bits (>N) would be simply determined by the search order.

Same applies to any cryptographic hashes too, even if it may be not that easy to see.

@Gonzalo:
Multiple hashes can contain less information than one hash, but won't contain more.
Consider
hash1(x) = md5(x) mod 2^32
hash2(x) = (md5(x)*3+1) mod 2^32
With these you'd get two different 32-bit values for x, but these values would only contain 32 bits of information.

P.S. Don't be surprised when this thread moves to Random Compression soon.

ryanstev Still no, the algorithm will not work. And the point is not that it will take a long time. (crc is fast by the way)
_____
You're missing the point that the hash list will also need to be stored.
Have you tried to compress hashes? They're not compressible because you can't sort them, hash should stay in its position and it'll be a position key for raw data.
You're changing from random to random.
You can check it yourself, you don't need to know how to program. Just generate a list of crc hashes and try to compress it.

If I understand correctly, it is possible, albeit not certain, to use a cryptographic function that is (realistically) considered free of collisions to "compress".

Suppose we have (for simplicity) a file consisting of two bytes.
Each of them can be between 0 and 255, for a total (obviously) of 65,536 possible combinations.

If, for each of them, I calculate the relative hash (maybe two different ones, just to be sure ) and store only those, I can later "rebuild" the starting file.

By doing a nested loop from 0 to 255 (for each byte that makes up the file) and calculating the hash.

If the hash is the same as the one stored, then I reconstructed the file well by "unzipping" it
(inverting the hash)

Obviously the complexity grows "slightly" quickly

A sort of inverted Cantor diagonalization for madmen, or arithmetic crazy encoder, or the Final Rainbow Table!

This frequently happens when people learn about password cracking.
They see that sometimes longer passwords can be recovered from a hash (eg. 15-letter password from 64-bit hash)
and think that this can be directly used for compression.
Problem is that in this case, the 15-letter password simply doesn't contain 15*8=120 (or even 64) bits of information,
since it was found by dictionary permutation, or some such.
There's also the charset - for example, [a-z]{15} string contains 15*log(26)/log(2)=70.5 bits of information, rather than 120.
And if you'd run the password cracker with full \x00-\xFF charset and without statistical models or rules -
then you'd likely get an earlier hash collision, which won't look anything like the real password.

So yes, it is possible to use hashing as a replacement for entropy coding.
But no, it wouldn't let you to compress random data, and it isn't better than normal entropy coding (like AC or rANS).