# Thread: Euler's Number Triangle and Random Data

1. ## Euler's Number Triangle and Random Data

This is plot from a high row (100 or so) of a different number triangle, the Trinomial Number Triangle. Although it has two symmetrical sides, seems to be highly random data.

But is produced by a simple equation, which gives the same result ever, is not random.

If large amounts of seemingly random data can be compressed in the form of simple equations, then we would have an extremely high lossless compression.

http://mathworld.wolfram.com/EulersNumberTriangle.html

I'm just giving an idea to be exploited, not seen anyone talk about it yet.

2. These kind of problems are subject to Kolmogorov Complexity. As a practical and "talked about it already" example would be special ZPAQ model which computes PI. Searching the forum could help.

3. BetaTester:
"Seemingly" random is not true random, so the fact that we can compress textual representation of number PI (whatever precision we want) into a small source code doesn't mean we can represent any data using short source code.

Compression of such generated "seemingly" random data could be beneficial IMO only when decompression is much faster that generating the data and if that data is useful for some purpose.

4. True randomness is (IMHO) the complete collection of all bytes 0-255 in a page (by page, I mean 256 bytes) of a file placed at different positions, thus after 256^256 you've "lost" all true randomness and are merely recreating a previous page, although you'll be using some serious amount of storage

There is no way to "compress" this type of file. There is a way to "encode" it so it's smaller. Two ways (possibly 3 ways (i'm a lazy coder)).

The reason I state the 0-255 is that it's the smallest whole number a computer can hold (even a binary of 1010 would still take up a byte), regardless of the variables.

Going back to the point of "writing code to generate it", if you do the maths (excuse the pun) of 256^256, that's a lot of routines you'll be needing to code in order to cover ALL possiblities. It's not impossible, but a LOT of work - and the only occasion i've used something like this is a generator that made a generator to make the code. The initial generator to make such tables was so large that it's not worth doing that alone, which is why I needed to (by hand) code something that created the generator. Not something I want to do again :S

If you want some nice true random files, check out random.org - that's where I get the binary & textual random files for my test-sets.

5. > True randomness is (IMHO) the complete collection of all bytes 0-255 in a page

As you see here - http://www.wolframalpha.com/input/?i...6.%2C256%21%5D ("result")
if its known that your 256-byte "page" always contains all 256 byte values, it means that it can
be compressed to ~210.5 bytes (with a rangecoder, for example).
Of course it doesn't mean that you can compress the random data, but that the compressible data samples are not random.

6. It all falls down to what the definition of "random" is. Over many years testing, the "random" data i've found incompressable ALWAYS follows the rule of not exceeding a run of bits longer than 64. These samples were claimed to be "true random", not "pseudo random" - yet if there's a repetitive pattern then how can it be "true random" ? Personally i'd drop the idea of compessing and remodel the bit-streams on a per-page to line up the longest runs in page 1 to the end, and longest runs in page 2 to the start, which would NOW make the data compressable. I know i've badly explained it, but i'm sure you'll find that using my transformation makes any "random" data easier to work with for compression.

7. > It all falls down to what the definition of "random" is

The data that can't be compressed, ie decoder_size+compressed_size > uncompressed_size for all compression methods.

> not exceeding a run of bits longer than 64

The probability of a run of zeroes of length M appearing in a random bitstring of length N is
(N-M+1)/2^M, so its not a wonder if you won't ever see any.

Also the information that a zero run of length >M can't appear only allows to save a bit
per each zero run of length M that did appear, which is nearly nothing for M=64.
And even if you'd accidentally save 1-2 bits (we can delete a bit after zero run of length M, because it can't be 0, so its 1),
the size of decoder which would be able to undo that would be still much larger.

8. Yup - didn't understand me - my bad

I really should patent this or something as someone will rip it off, but meh can't afford such luxuries.

1.Grab a file with as much entropy in it as possible
2.Take the first 256 bytes (page)
3.Split the bits to make 8 bit streams
4.Find the longest run of 0's (or 1's) in the stream
5.roll so the longest run is at the end of the bytes (rol or ror - doesn't matter)
6.Repeat 4 & 5 for all 8 streams
7.Put them together and store the 8 bytes you used to rol (or ror) into a seperate file for later
8.Grab the next 256 bytes (page)
9.As above, only you put the longest run at the START of the block
10. Check the results and say "Wow - I can't believe" (or somesuch thing in a ludacris voice).

I've made a (very primitive) routine to test the theory out and the results were most interesting indeed. There's a few tweaks here & there you could apply - but the transformation seems sound. So no, not compress entropy - but you can apply my transformation to it to pre-process (again).

9. You said that you didn't see runs longer than 64 before.
So with that algo (and another step which is not described) you can cut off these 64 bits from all blocks - at best.

10. What it actually does is to INCREASE the size of the runs in the file overall. So after such a transformation, you'll find the "uncompressable" file is now more "compressable". If you think about it - you'll be changing the entire structure of the file and aligning (sorting) the largest runs of bits that will create a larger run of bytes & sequences, thus compression would help on a previously uncompressable file. By making the first page "back loaded" (the runs at the end), second page "front loaded" (runs at the start) what would be a run of up-to-4 bytes is now up-to-8 bytes - and this does not take into consideration the other bytes created - which WILL create more repetative sequences that'll compress nicely.

Obviously this kind of transformation REQUIRES you to use files with a high rate of entropy. But it'll help if you want to reduce further the size of an already compressed file or a "random data" file.

I get your point about the +8 bytes per page, however if you know of another way to compress an uncompressable file then i'll be happy to hear it May not be the best in results - but you can always try it recursively.

11. m^2 - thanks for the link. I'll retire for now as it'd be pointless for me to continue. However, i've tried to explain how & why my routine works - i've omitted why it's the optimal variant and how much research & time it took. I'll finish by saying it's akin to BWT on a block of 256 using bit-streams instead of bytes. I'm far from good at explaining my thoughts (yeay autism!) however i've seen the results and shared them - there is nothing more to do other than wait for someone else to "find it out on their own".

12. Originally Posted by EwenG
m^2 - thanks for the link. I'll retire for now as it'd be pointless for me to continue. However, i've tried to explain how & why my routine works - i've omitted why it's the optimal variant and how much research & time it took. I'll finish by saying it's akin to BWT on a block of 256 using bit-streams instead of bytes. I'm far from good at explaining my thoughts (yeay autism!) however i've seen the results and shared them - there is nothing more to do other than wait for someone else to "find it out on their own".
Hi. This is my first post So I was reading some of the older posts `to catch up` and thought your algorithm sounded interesting. I've tried a great number of approaches to trying to compress random data, and basically none of them have worked. I keep trying, but the scattering of values is very hard to overcome.

My theory is that the ability to compress random data can only occur randomly. ie... if you try some things, you might get lucky on one set of random data, and not on another. I tried a somewhat brute-force effort to match randomly generated patterns for example and every once in a while I got a hit.. but it's so sporadic and inconsistent that it really is a shot in the dark - pure luck. It would kind of make sense to me that random data can only be compressed randomly, ie rarely. I don't think it's impossible, just extremely unlikely and not at all consistent.... such is the nature of the beast.

So anyway, I liked your idea so I programmed it and got it to run and reverse correctly. However I can confirm that it does not compress random data. Unfortunately, as if very often the case, it just produces too much new highly random code data (those 8 bytes), which can be compressed some but not enough to be less than the gains from the rest of the algorithm. The algorithm itself mainly does work, the rotated data does compress a lot better, but this doesn't account for the coding overhead which is always greater than what you save. I've tried it with numerous block sizes and iterations etc (yes you can recurse with your algorithm but it doesn't help overall considering the code data). But it's pretty cool, thanks for the exercise.

I think the only way gains can be made with random data MIGHT be if you can do a transform on it with similar gains to this algorithm but without needing any added code data, or very little (ie much less than 1 byte per 256 bits).

13. Random data is not compressible. Furthermore, there is no general algorithm for compressing pseudo-random data even when you know it was generated by a simple program. For example, an encrypted string of zero bytes looks random, but the only way you could compress it is to guess the key. http://mattmahoney.net/dc/dce.html#Section_11

14. Well, technically for _pseudo_-random data there _is_ an algorithm.
Just look for statistical quirks in all possible contexts and proceed with arithmetic encoding.

Also by "random data" in this topic they probably refer to compressed data,
and compressed data is usually not completely random - its quite possible
to write a specialized compressor for compressed data, one with a very precise
rangecoder and statistics, and specific kinds of contexts.

And even cases where recursive (multi-pass actually) compression of compressed data
is applicable are possible (like with 1G of zeroes in .rar).

The main point though is that compression exploits redundancy,
and for modern archive formats, squeezing out the last 0.001%
of entropy coding redundancy is a slow and hard task.
For older formats there may be 1-3% even, but there's simply
no demand for such space saving.

And of course there's no way to do it fast, nor compress _everything_ further -
no more redundancy means no more compression.

15. Yes, it's very hard, next to impossible, takes a tonne of time etc.. not really worth it.

One area though that I find interesting is in the area of filters/preprocessing, because here you bypass the issue the data not being compressible and turn in into data that is (reversibly). But doing that on compressed/random data has to be very difficult indeed.

Back to working on my image compressor... currently doing better than PNG and close to/sometimes better than jpeg2000. Nothing earth-shaking but I'm quite happy with it.

16. > Well, technically for _pseudo_-random data there _is_ an algorithm.
> Just look for statistical quirks in all possible contexts and proceed with arithmetic encoding.

Yes, you're right. For a string x of n bits, search all 2^n - 1 programs shorter than n bits for one that outputs x. The problem is that some programs may not halt, and there is no general way to know which ones. But if you know that x is compressible, then running all of these programs in parallel must eventually find the solution. It's not a practical solution because it is possible for even small programs to run for a ridiculously long time, like 3^^^^3 steps (using Knuth's up-arrow notation).

Testing whether a string is compressible or not is another matter. If such a test existed, then I could describe "the first string that cannot be described in a million bits", which is a contradiction because I just did.

17. > search all 2^n - 1 programs

No, I don't think that kolmogorov compression is practically applicable.
Well, at least I don't know such a nice language that could make it work.
It has to be something with relevant operations (arithmetics and branches),
and with mutation-friendly syntax which could make transitions between
algorithms smooth enough,

> Testing whether a string is compressible or not is another matter.

We don't really need a theoretically perfect method at this point.
Simple cryptographic tests would be enough to compress paq8 code
a little.

And as a workaround for possible expansion, a long enough random signature
can be used. So eg. if we can save 136+ bits by CM compression of a paq
archive, then we add a 128-bit unique signature at the start and thus
save 8+ bits. Otherwise we don't add the signature and the size remains the same.

18. I've been toying with a random-data compressor for a while and at the moment (but don't hold your breath it looks like it may work... BUT that said I have not finished writing the decompressor yet. The proof is always in the pudding and it could be a matter of a tiny little bug which eventually renders it useless (it wouldn't be the first time, heh).

Depending on the data, it compresses `any data` up to ~2.5% - not huge, but it is recursive! Okay, so I hear you saying "here we go again".. lol Sometimes it expands certain sets of data up to ~1%. However, I have an easy workaround which reversibly alters the data until it compresses. What this means is that literally any set of data can be compressed no matter how much entropy or byte sequence of whatever it started out as. I deliberately make a real mess of it and then start from there.

The catch. .. it's rather slow. At present it may take around 10-15 seconds per megabyte for *each pass*. Ideally it needs to run at least 50-100 times to really have a big impact, and each of those passes gets a bit faster. It could take like 5-10 minutes per megabyte but would end up with a file in the 10kb range. There are of course lots of optimizations I can do including multithreading. Anyway... I guess this is presumptive - I have to get the decompressor working to prove that it's reversible, but right now looking at the code I can't see anything wrong with it.

Possibly to my detriment I'm using LZMA as a back end, so each pass has to be fully LZMA'd, which may give you some idea of the speed. And probably a compressor with better ratio would improve greatly on that 2%.

The algorithm is my original idea for how to partially sort the data, but it does produce some codes to reverse it which fortunately compress to slightly less than the amount of space saved by compressing the semi-sorted bytes.

... It's time to get out my fine toothed-comb and make sure it's doing what it's supposed to be doing.

19. I would suggest comp.compression. That seems to be where all the random data compressionists hang out. All of their programs seem to have bugs too.

20. Hi I see your in the US if your code works it could be worth billions. However since the US is now first to file. It might be better to patent it before someone else does. You can always write the code after you patent it. I would hate to see you make a miracle break through and end up penniless.

21. Random datas are like Transformers : they are more than meet the eye.
Check out that hexadecimal string, which looks pretty random :
1184 27b3 b4a0 5bc8 a8a4 de84 5986 7ffa d373 cc71

Now let's change its radix to base 10, and now we see :
99999999999999999999999999999999999977777777777

Magic!

22. lol. I found a nasty bug .. which unfortunately significantly impacts the compression, but I'm still getting ~0.46%. A Million random digits .bin file is getting about 1.36% compression on the first pass. I'll keep working at it until I can prove that it does/does not work. ... but what use would such an algorithm be if it is very slow? It's not exactly practical to wait 10 minutes for an image to save, for example.

 I tried 50mb of random data, compressed by 0.5%. The bigger the data is the better the compression. I tested a small file, the random data currently stops being compressible at around 50kb. Compression ratio would be ~1:1000 for a 50mb file. mm... 50gb would be ~1:1-million?

23. How does it do on sharnd_challenge.dat file?
http://mattmahoney.net/dc/#sharnd

24. Compressed sharnd_challenge.dat by 0.273% saving 273 bytes in the first pass. I peeked at the challenge and I see it asks to go below the space needed for sourcecode... not there yet since I haven't applied the recursion. And I'm still working on the decompressor to make sure it works, so let's not jump the gun.

 ... upon decompressing I finally found an elusive bug which completely undoes all of the compression benefits. heh... I guess you could see that coming. It does not work at all. I tried a much heavier-duty compressor but it's still way off. My apologies for littering the forum.

25. > However since the US is now first to file. It might be better to patent it before someone else does.

> but what use would such an algorithm be if it is very slow?

There is the \$1 million Clay Millennium prize for proving P = NP. http://www.claymath.org/millennium/

The proof goes like this. Suppose you have a program that can compress any string longer than m bits by at least 1 bit (e.g. m = 400Kb). Then you can solve any algorithm in O(n) time, including NP-complete problems, as follows:

1. Compress any input of size n to m bits recursively in O(n) steps. (Each step is O(1) if you compress a m+1 bit prefix, then append bits).
2. Look up the compressed output in a 2^m by m table.
3. Decompress the output recursively in O(n) steps.

Although this algorithm is impractical for large m, it nevertheless constitutes a proof and is therefore eligible for the prize. Also, since proof search is one of the problems this algorithm can solve, you can apply it to the 5 remaining Clay problems (the Poincare conjecture is taken), to claim a total of \$6 million.

26. Matt, how you can be so inserious?! once he will prove that n==n-1, the Universe will disappear

27. I have some data which is quite randomly arranged but the `how many times a value occurs` is skewed. The skewing is almost exactly linear where, for example there may be only 50 0's in the data but 30,000 255's. If you imagine a straight line between those two extremes that's what a plot of the counts looks like, ie how many of each value. There is a small amount of random scattering in those numbers but it's almost exactly a linear increase as the values go up.

What I find from this data is almost exactly 75% of the file is represented by byte values 0..127. This suggests to me maybe I could do some kind of huffman thing or something but I think I might be fighting the math. Does it seem even possible considering the linear gradient of the distribution? Do huffman and similar approaches only work when there is even more of a skewing like along a curve or with a hotspot of commonly occurring data? Or is there some other technique which could save some bytes from this kind of data? I need to save at least 12.5%.

28. A simple order-0 model and arithmetic encoder should compress this data. It will assign a probability p to each byte based on counts seen so far, and encode using -log2(p) bits. As long as the distribution is not uniform then it should compress. You will also get better compression if each byte depends on earlier bytes. Have you tried other compressors on your data (zip, 7zip, zpaq, etc)?

If you want to write your own compressors, you really should read http://mattmahoney.net/dc/dce.html

29. Hi Matt. Yes at present I LZMA the data and save about 3%. I need to get to at least around 11% because I already generated 1 bit per byte in codes as part of the transform that ended up with this data. I tried Paq8 on the data and only saved about 4.5%. There is not much repetition in the sequence of data, the order of bytes is highly random, but as I say there are a lot of bytes with the same value as you approach the high end of the spectrum. It seems to me it should compress better than it is if the compressor was based more on the probability of occurrence, as you say.. I don't understand what `order-0 model` means, I'll have to research. I thought maybe huffman would be good to model the distribution but it seems to me the prefix code lengths would inflate too quickly to be worthwhile? I tried SR3 on it which was slightly better than LZMA but not much.

Thanks for your online book btw, it's very informative and interesting. I am writing my own compressor, yes. Or at least trying to

Page 1 of 2 12 Last

#### Posting Permissions

• You may not post new threads
• You may not post replies
• You may not post attachments
• You may not edit your posts
•