# Thread: A pattern in random data

1. ## A pattern in random data

Ok so we know that in theory random data doesn't compress past a certain amount of entropy or `randomness` or whatever. I've tried hundreds of algorithms and none of them have worked (except, sometimes, for some data, by chance). And we know random data seems to have no discernible `order` or meaning to it. But I have found a way to show that there is a kind of meaningful 'signal' in random data.

In the below image (hopefully attached) is a visualization of a portion of `amillionrandomdigits.bin`, after a simple transform has been applied. Rather than look like a totally scattered set of random numbers, it instead shows a signal with 'features' - valleys and peaks. Yes this data does compress, but the transform has way more overhead than can be made up for. Despite the continued incompressibility, I just thought it was interesting to see that there IS something meaningful going on in what normally looks like a totally scattered set of data.

The image shows the sequence of byte values after the transform. I guess my point is that something is `happening` in the data beyond the appearance of total nonsense.

This said, it's possible to do some pretty wacky things to data to make it look meaningful, like sorting the entire file in order... not that it can be undone in less bytes. So maybe this is also nonsense?

2. ## Thanks:

Ernst (20th August 2016)

3. "A pattern in random data": oxymoron of the year.
If you find a pattern in what is presented as random data, you only find that the random data generator is not that random.

4. Could you show us what kind of transform did you use? It could be interesting...

5. "And we know random data seems to have no discernible `order` or meaning to it."

Random data lacks discernible order or meaning by definition.

6. Random data is an interesting concept. Once you put constraints on what the file looks like you really don't have a purely random file anymore.

A random file created by a pure random generator can be anything.

It child's play to compress any given file to one byte. So its no surprise that you can compress the one particular file. If you could on the average compress a large number of files created the same way then you have something. It's even possible for a random file generator to create a file of all zeros. Though it may not occur for eons. There is no way to prove a single file is not randomly generated by a random source.

Also to say you compressed except for the overhead is a JOKE the overhead is a key part of the data in compression.

7. Perhaps the "transformation" is what actually gives the file some significant. If so, can you de-transform the file to get back the original one?
I don't think it was possible. The key here is what are you doing to the file. With a no-that-complicated algo you can transform whatever in whatever else. It doesn't mean you are actually making any progress in data compression. Trust me, I tried

8. This is actually a paradox. If you can prove that random data has patterns, you have proved it to be not random. So you have proved nothing
That's because the definition of randomness implies the absence of patterns, at least in the sense we use it. As pointed before, a random generator can produce *any* string even meaningful ones. But, as the https://en.m.wikipedia.org/wiki/Infinite_monkey_theorem states:

If there were as many monkeys as there are atoms in the observable universe typing extremely fast for trillions of times the life of the universe, the probability of the monkeys replicating even a single page of Shakespeare is unfathomably minute.
So, in short, design comes from a disegner. If you have found patterns in random data, your data is not random. It's likely to be designed by an intelligent mind, or a well structured computational process, which also was designed by someone, as opposed to a "random" output.

9. but the transform has way more overhead than can be made up
Isn't that exactly the issue? If you remove that constraint i can make a pretty nifte filter that would make you random data by a simple RLE compression
Just XOR all bytes with the same random file as a filter, and the resultingfile will be easily compressible. its just that "tiny" issues of the filter file giving it more overhead...

10. Hi folks. So yes, you are correct, as I stated, there is more overhead than can be saved by the representation of the data in this transformed way, and therefore the final `total` of bytes needed is not a compression, but an expansion. I don't think I was trying to make out that this was a compression as such. Mainly that it is just very interesting that seemingly scattered data can be turned into something much less so.

As another person also said, you can take some pretty simple algorithms and 'do stuff' to random numbers to make them seem orderly... like I said.. e.g. sort the whole file into order... then the byte values form approximately a nice straight progression of increase from 0 to 255 ... looks very orderly and meaningful, but definitely not representable in fewer bytes than we started with.

I find the properties of random data to be really fascinating. Although it may be true most data like this can't be compressed further, it doesn't stop one from trying all manner of approaches `just in case` ... And like someone else said, you can sometimes compress some files by a little bit, but not always all files using the same model. Random data is very slippery that way, avoiding compressibility usually by just a small fraction in each attempt.

There is a way to compress some random files some of the time, and that is to e.g. encode a 15-bit `location` of a known value (e.g. two byte with 0,0 value). Provided that pattern happens to occur in the first 32kb of data, then you can save 1 bit. But the chance of this happening is random, so as it turns out a little fewer than half of all random data samples will have these properties and be compressible, while the rest won't. And it's not predictable which will or which won't, so you can't just store some flag to say which value to look for... because that then narrows down the search range further which doubles the unlikelihood of finding it. etc... gotta love how thats works, but it is infuriating

Anyway... this transform is simple. I'll share it since right now it's not really leading to anything.

Take a set of random data, e.g. a 256-byte block. Sort it into order, low to high. To reverse this sort requires a factorial of 256 (or an unsort algorithm) which requires around 224 bytes. Then, having sorted the block of data, find the difference between the value of the byte and its index/location in the block. In other words, pretend the values are supposed to be completely evenly distributed from 0..255, with no gaps or repeats. So subtract the byte value from its offset from the start of the block and you get a graph of the `deviation` from the predicted/expected orderly values. Obviously this is easily reversible by just adding back the deltas to get the original values - however, storage of the deltas is always too big (when added to the sorting data) to produce an overall compression.

So what you're looking at in the graph is a presentation of how the distribution of bytes differs from a completely even distribution. So instead of the 112th byte being a value of 112, it might be 142, and thus deviates by 30.

What I find really fascinating here, is the way in which multiple sequential bytes seem to have a very similar deviation. E.g. the 112th, 113th and 114th bytes might all sway away from the even distribution prediction by a similar or ascending or descending amount, forming a `trend`. I ask myself, WHY would seemingly unrelated byte values, which originally were in scattered locations, WHY when sorted do they seem to have this behavior of deviating along a similar higher-level path? You would think they should just be pretty randomly scattered. Is it perhaps because they've been `combined with` the influence of the sort sequence, but only partially? i.e. if you sort the entire file you get an almost completely equal distribution graph... but since our sample is smaller than that, it seems like `part way` between randomly scattered and evenly distributed. At least that's about the best I can understand what I'm seeing.

But even though I'm seeing this, it still seems really interesting to me that the `next higher` byte value is following a similar trend to the one before it, even though they were totally scattered in the random data.

Another thing I find interesting is that even when I sort each block of 256 byte completely separately, trends in the deviation connect-up across block boundaries. I can get my head around that even less. It's very intruiging.

Is it meaning? It is illusion? Is it all about subjective perception and interpretation? It it useful? Maybe, maybe not.

11. ## Thanks:

Gonzalo (10th July 2015)

12. Incidentally, I stumbled upon a good program to evaluate the randomness of a file :
http://www.fourmilab.ch/random/

The key value to watch is the Chi-square Test. A value close to 1% or 99% indicates the value is not truly random, like the values generated by a Linear Congruential Generator for example :
rdtsc
mov edi,403000H
mov ecx,2000H
L1:
imul eax,eax,015A4E35H
stosd
loop L1
int 3

More sophisticated Pseudo Random Generators would return a Chi-square test closer to 50%, like the PCG:
state=(state*multiplier)+increment
PCG-XSH-RR
output = rotate32((state xor (state shr 18 )) shr 27, state shr 59)
PCG-XSH-RS
output = (state xor (state shr 22)) shr (22 + (state shr 61))

13. "WHY would seemingly unrelated byte values, which originally were in scattered locations, WHY when sorted do they seem to have this behavior of deviating along a similar higher-level path?"

I think you could relate this to the law of large numbers, i.e. the fact that over a large enough number of samples, the average of a random variable approaches the expected value more and more closely. In your transform, the position of an element x is determined by the number of elements in the block that are <= x, which is a random variable with expected value = x. So, with big enough blocks, the element values and positions converge and give a predictable-looking result.

"Although it may be true most data like this can't be compressed further, it doesn't stop one from trying all manner of approaches `just in case` ..."

It might be more helpful to spend the time trying to understand why compressing random data is impossible, because even if one of your approaches finally succeeds, all that will show is that you made a mistake somewhere.

14. Random data does have some redundancy to it. Byte and bit patterns do reoccur quite often, although in scattered positions. The problem is that typically those reoccurrence are spaced, on average, just slightly too far apart to be able to encode them in less bits. The trick then might be in simply finding an encoding method which is leaner and more efficient than the output from previous compressors. After all, if e.g. we ZIP a file and it gets to 90kb, and the output seems random, but then we LZMA a file it gets to 70kb, obviously both seem random but the output from lzma has identified an even more efficient way to encode the redundancy. And then there are other compressors which can also produce even greater compression. So although there may theoretically be a limit to such compression, it still stands that inefficiencies in a given compressor leave room for improvement. Data can look random and yet still exhibit some limited `potential` for further reduction, up to a point.

15. strictly speaking, there is no such thing as "random data". there is an entropy of source, i.e. number of data sets it can produce

16. Originally Posted by Bulat Ziganshin
strictly speaking, there is no such thing as "random data". there is an entropy of source, i.e. number of data sets it can produce
Whether there is such a thing as random data is somewhat of a philosophical question. I think you and I agree; randomness is a property of the process that produced the data, not the data itself. OTOH, Matt has argued that some data is inherently random -- the catch, of course, being that recognizing it requires knowing its Kolmogorov entropy.

17. Originally Posted by ImaginaryHuman
Random data does have some redundancy to it. Byte and bit patterns do reoccur quite often, although in scattered positions. The problem is that typically those reoccurrence are spaced, on average, just slightly too far apart to be able to encode them in less bits. The trick then might be in simply finding an encoding method which is leaner and more efficient than the output from previous compressors. After all, if e.g. we ZIP a file and it gets to 90kb, and the output seems random, but then we LZMA a file it gets to 70kb, obviously both seem random but the output from lzma has identified an even more efficient way to encode the redundancy. And then there are other compressors which can also produce even greater compression. So although there may theoretically be a limit to such compression, it still stands that inefficiencies in a given compressor leave room for improvement. Data can look random and yet still exhibit some limited `potential` for further reduction, up to a point.
A lot of what you've said is true, however, you seem to be over-generalizing based on special cases where random-looking data happens to be compressible. The proof that not everything can be compressed is based on the pigeonhole principle. If you start there, it should start to become clear that a randomly-selected input is overwhelmingly-unlikely to compress, and your choice of compression scheme can't make any difference.

18. Understood. I do see though that `some` random data can be compressed. I also see that an algorithm that causes some random data to compress causes other random data not to compress, which is understandable. The data changes so dynamically/chaotically that the encoder is basically not flexible/dynamic enough to react to it.

19. Originally Posted by ImaginaryHuman
Understood. I do see though that `some` random data can be compressed. I also see that an algorithm that causes some random data to compress causes other random data not to compress, which is understandable. The data changes so dynamically/chaotically that the encoder is basically not flexible/dynamic enough to react to it.
The upper bound for compressing random data is 50% of files (you can't quite reach 50%). Even an encoder with perfect flexibility and dynamism could not compress on average since there are not any predictable trends in random data.

20. Been a while.
When I decided to look at compression as a science, these are some things I noticed.

Compression requires prevalence/repetition of some sort in order to be able to record a header/overhead and ultimately represent the data in a smaller way.
There is a relationship between CPU/Bandwidth/Storage on a computer. The process of compression reduces the storage aspect and with the overhead offloads that to the CPU - i.e you transform the data into a smaller form which needs more CPU to produce the original file.
In order to offload the transformation from storage into CPU steps, you require something for the CPU to do in terms of recording prevalence/repetition.

If you have a file which you can re-write with a header and use the CPU to transform it, you can compress a file.

There is a file spectrum of every possible combination of files of all sizes and iterations of 0/1. A 'random' file is technically any file which falls into this - i.e. any file possible in the world.

Consider that if you compress a file, you find that it is generally not further to compress any more. This is generally a transformation of a 'compressible string in the file spectrum' to a file which is an 'uncompressible string in the file spectrum'. Compressing a file usually rewrites the data in a way that the result is uncompressible.

It is as though you take compressible data strings in the file spectrum and convert it to its correlating uncompressible data string in the file spectrum. A compressed file needs to have a 'slot' in the file spectrum in order to have its representation.

If you want to give the highest opportunity to compress 'random' data, you need to be able to detect compressibility of the file for what you can break even with with a header and giving CPU work.
There are common methods, like looking for every 8 bit string to recode it, or finding repetition in a row (RLE) and rewriting some data (like BWT).

From my research, there are certain arrangements of data in their extremes which compress to the highest possible for each approach.
There is a certain arrangement of data which will compress to the smallest possible in the world if you give it the '8 bit string' approach, and for the in a row (RLE) and so forth.
If you have a file of 0 bits repeating in a row for 4GB, if you represent this in your headering as '0 bit' and '4GB worth', you cannot compress this data in the world any smaller.
If you have a file with data that if 0 bits repeating every 32 bits in distance apart for 4GB worth, if you represent this in your headering as '0 bit' 'every 32 bits apart' '4GB worth', you cannot possibly compress this smaller because it is the smallest way to write this possible.

There is a proportion in the entire file spectrum based on size as to how much compression targetting '8 bit strings' or 'in a row (RLE)' can possibly give you - i.e. you can calculate how much RLE you can possible find in every iteration of data of 128 bytes long and give a percentage of what it can possible do.

What I am saying is that there is a formula which one can apply to data of 0 and 1 bits and every single iteration and how much each approach can compress in the grand scale.

If you instead of using 8 bit string patters like in common LZ and opt to find a variable and dynamic approach, to have varying length bit patterns, this can bring down the filesize for some files. Consider however that using 8 bit patterns is optimal for a select amount of iterations of files out there, as they will compress to the smallest possible using 8 bit patterns. This is why allowing for something dynamic and optimizing the header to minimize identifying multiple lengths also assists, and can allow for compressing a larger range of data in the data spectrum.

The point here is that by widening the ability of compression in regards to the methods implemented, you can widen the scope of possible compression on data (and on 'random' data being any file thrown at this).

You give the highest possible opportunity to compress a 'random/any' file if you widen the scope of what your tool can target to compress.

If you allow for dynamic strings, dynamic RLE, dynamic encoding (switch between static length / variable length codes, as well as multiple code trees throughout the file), rewriting methods (such as a formula to read bits in a custom way, or read all right to left, or read all bits in every odd then even number), and more, you give the highest potential to compress a file.
The point is that if software cannot compress a file, it is because it is impossible mathematically, as it includes the scope of what is able to compressed and detected.

Surely the more methods you incorporate, you need to do a lot of scanning to identify the data in order to know which methods for applying and where etc.
If you know your constaints, such as if you have a lot of 0 bits in a row for a lot of the file, you can already tell and exclude rewriting the file (like BWT) as you can already tell it is redundant from overheads.

If you include all these methods and determine constraints, you can minimize possible scanning to what is mathematically required.
I have determined a number of methods to approach compressing a file, and to take advantage of even compressing a single stretch of data with one entry in the header and leaving the rest as raw - i.e. having the file exactly as it is and having a header which accommodates for a specific offset and a little bit of compression just for that area.
The point of this is to be able to compress a file in any possible way where it is mathematically feasible.
By accommodating for all possible scenarios to detect and record the compression, you can have a versatile and effective tool on a wider (widest) scope of files mathematically possible.

I am still (in free time) working on writing more about all these methods I have written about and how they correlate with their boundaries/priorities for minimizing the whole scanning/applying functions.
Every single method can compress a file in the file spectrum to the smallest possible in the world (like the RLE example able), and can work together with their correlating boundaries/priorities accounted for. By using all these approaches (where deemed appropriate from the constraints written in advance), you give the highest opportunity to get compression from a 'random/any' file out there. (This also includes rejecting writing anything if you can calculate in advance on an initial run that it will not reduce the file size, as well as possible avenues to take where you need to branch to be sure of which route gives a smaller file).

I'm not a programmer familiar with syntaxes of C and so forth, but I am familiar with the computer requirements such as CPU/RAM/Storage etc and the nature of bits of 0/1 and how these all need to come together (on paper).

So in regards to a 'pattern in random data' or understanding why a 'random file' will not compress, consider the file spectrum of any iteration of 0/1 bits of any length and how compressible data needs a slot to be written to (which is usually an uncompressible data slot in transformation) and how prevalence which can be recorded needs to be present in the file (and be able to detected from all methods to compress in the world).

If anyone is interested, I have written more methods to compress data since my last thread of 'ultimate compression' and can post more if there is interest in reading/doing something.

21. Perhaps some of the more experience people here can comment, but it seems to me that there are several inaccurate statements in SoraK05's post concerning the compressibility of random data.

Originally Posted by SoraK05
If you have a file which you can re-write with a header and use the CPU to transform it, you can compress a file.
On average for random data, the cost of a header will be greater than the average savings. The act of writing a header and using a transformation by no means guarantees you can compress a file. The most likely outcome is an output file that is at least as big as the original file and the average result can not be a smaller file.

Originally Posted by SoraK05
If you want to give the highest opportunity to compress 'random' data, you need to be able to detect compressibility of the file for what you can break even with with a header and giving CPU work.
There is no advantage to detecting compressibility of a file containing random data. On average, they are not compressible and a header only adds inefficiency that reduces the probability of compression.

Originally Posted by SoraK05
You give the highest possible opportunity to compress a 'random/any' file if you widen the scope of what your tool can target to compress.
Adding more methods to an attempt to compress random data increases the inefficiency and reduces the probability of compression.

There is no method for compressing random data that will have a higher probability of achieving compression than a simple encoder than assumes the first bit is a 0 and drops the first bit if it was a zero and writes a very long special symbol to indicate the first bit was not a 0. This would achieve compression on 50% of files, except for the fact that the encoder would have to replace any pattern that matched a chosen set of N-1 of the N bits in the special symbol with a pattern that is one bit longer.

22. Hi, I've been developing a mechanism to compress random data based on VOICE (as in speech) instructions, where you record an MP3 audiofile sampled at 16 PCM 8000 Khz, and then re-encode it using a 450 bps codec, obtaining 1 hour (up to 73 minutes) of lossless (I mean recognizable) audio speech with a size of 200 KB. That's 450 bits per second x 3600 seconds = 202500 bytes, so 200KB. The result file can be further compressed down to 85-100 KB using LRZIP based on zpaq (100% of the times, because the audio codec output has a lot of redundancy while representing the soundwaves). So what you end up is 85'000 bytes containing 73 minutes of human speech. The thing is, this technique allows you to pack 150KB worth random data in that single 85KB file.

The audio recording contains instructions meant for another human to understand it, and speech recognition to speed the process up, but speed is not my priority right now, Im just trying to prove (and I actually have the results), that any random data file can be:
1. Described using plain english
2. Recorded at 85kb per hour.
3. In 1 hour (85kb), contain 150kb of the original random data to be restored.

Thats 150kb random data in a 85kb codec-2 file. If anyone is interested Ill post all the technical mechanism. But keep in mind that this is not efficient AT ALL, since it requires months to restore a file (it uses a lot of bruteforce guesses as to what the audio really means).

23. JonasX. thats not random data when considering this topic. since you say yourself you output has a lot of redundancy. Thats a pattern so its not random its patterned.
In this case random data does not mean any giving data. but a piece of data that has no repeatable or apparent pattern.. so as said earlier if you can compress it ( including header info) you have just proven it was not really random (in regards to this topic)
I am also pretty sure you compression is lossy and not lossless. omce you start to accept tossing away data as Compression you can compress anything down to sub 1 byte. Everything missing are just the "lossy" part

Better people than me can explain it better or probably correct me. but it seems like people are using two different terminologies. Random created or Random Patterned

24. Originally Posted by SvenBent
JonasX. thats not random data when considering this topic. since you say yourself you output has a lot of redundancy. Thats a pattern so its not random its patterned.
In this case random data does not mean any giving data. but a piece of data that has no repeatable or apparent pattern.. so as said earlier if you can compress it ( including header info) you have just proven it was not really random (in regards to this topic)
I am also pretty sure you compression is lossy and not lossless. omce you start to accept tossing away data as Compression you can compress anything down to sub 1 byte. Everything missing are just the "lossy" part

Better people than me can explain it better or probably correct me. but it seems like people are using two different terminologies. Random created or Random Patterned
I should've been more precise, seems you didn't quite understood my post.

You take a 150 KB of random data (7.999999 bits of entropy), then generate a 200 KB of REDUNDANT data (an audio codec-2 file), which is then compressed down to 85 KB. The 200 KB redundant audio data, which is now 85 KB contains the instructions to generate the 150 KB random data, so you end up with 56% compression of the original random data.

Its lossy, yes. But by bruteforcing every possible byte per instruction in the audio file, the computer eventually spits (in 605.4 hours in some tests), the exact random data, completly lossless. This is because we are dealing with a human being that is listening the audio and deciding what is correct and what is not, but is perfectly possible to do with a speech recognition program.

25. If your data contains recognizable speech it is not random.

26. Originally Posted by Kennon Conrad
If your data contains recognizable speech it is not random.
The original random data contains no speech. Its any random sequence of bytes, which is then encoded as speech.

To compress:
Random data --- (encoding) ---> Speech data --- (compression) ---> Compressed data.

To decompress:
Compressed data --- (decompress) ---> Speech data --- (bruteforce decoding) ----> Random data.

27. Originally Posted by JonasX

Random data --- (encoding) ---> Speech data
Actually, it seems I read too many of these posts claiming compression of random data. It's funny how many posters issue verbal attacks when their claims of having compressed random data are questioned. You did say you STARTED by recording an MP3 audio file BEFORE re-encoding it. That is not random data and as SvenBent pointed out you proved it's not random.

28. Originally Posted by Kennon Conrad
Actually, it seems I read too many of these posts claiming compression of random data. It's funny how many posters issue verbal attacks when their claims of having compressed random data are questioned. You did say you STARTED by recording an MP3 audio file BEFORE re-encoding it. That is not random data and as SvenBent pointed out you proved it's not random.
Yes, that first MP3 is a transformation of the 150 KB random file.
I'm not compressing an audio file, I'm using an audio representation to encode the random bytes.

150 KB random bytes ----------------> into a (2 MB) MP3.

The first recording in MP3 is sampled at 6 kbps, and its size is about 6000 bits X 3600 seconds (1 hour), that's 19200000 bits (2 MB).
Then you re-sample at 450 bps using codec-2, which gives you 1620000 bits (200 KB).

Finally it becomes 85 KB, with Lrzip. Remember, I'm not compressing an audio MP3, I'm using the audio to encode the 150 KB random bytes.

I think either I'm terrible at explaining this sequence, or I'm at a wrong place
150KB random --> 2MB mp3 --> 200KB codec-2 -> 85KB lrzip.

29. If you are using the same "random" data perhaps you found one of the at most 680,000 combinations out of the 2^1200000 possible combinations that would happen to be compressible to this level because it happens to have compressible patterns. On rare occasions, random data generators will produce files that are compressible by ordinary methods.

30. This is my last post. I might release a closed version if anyone cares, but this was done primarily to compress CommonCrawl data (2 petabytes of web crawl).

My software team used a rather unconventional approach as I've been describing, using 450 bps audio files to represent blocks of random data, achieving recurrent compression for every block with entropy analysis of up to 0.000010% redundancy (this is as random as any file gets). Results show that it doesn't matter how random the data is, an audible instruction can always instruct a human to use a series of procedural calculations for every byte until it matches the audible instruction.

I just posted here because people claiming random data is impossible to compress, seem to have discarded human speech as a mean of transmitting information without the restrictions of digital storage.

Keep in touch with the CommonCrawl blog, aswell as the Rowetel codec-2 site.

31. There are grey areas which I think is what causes confusion.

Eg take a random 7-bit data stored with one 7-bit per byte. It's truely random and obviously we can "compress" it to 7/8ths of the original size by packing the bits better. That's not really compression of random data as most people see it, not in the mathematical sense at least, but it's what causes these confusions I think.

A trickier example is a random data chosen from a gaussian distribution centred on 128 and spread such that all values from 0 to 256 occur. A simple frequency analysis will show that those around 128 are more common than those around 0 and 255. What we essentially have is a shaped random signal; randomness on top of non-randomness. Any program which is losslessly compressing audio is doing a similar thing - teasing out the signal from the random part so that the signal can be compressed (while the random part cannot).

Put simply, if you think you're compressing random data, then you're just compressing a small residual signal somewhere in the random data while leaving the random part uncompressed. That's pretty much by definition.

Page 1 of 3 123 Last

#### Posting Permissions

• You may not post new threads
• You may not post replies
• You may not post attachments
• You may not edit your posts
•