Page 1 of 2 12 LastLast
Results 1 to 30 of 56

Thread: Euler's Number Triangle and Random Data

  1. #1
    Member BetaTester's Avatar
    Join Date
    Dec 2010
    Location
    Brazil
    Posts
    43
    Thanks
    0
    Thanked 3 Times in 3 Posts

    Post Euler's Number Triangle and Random Data

    This is plot from a high row (100 or so) of a different number triangle, the Trinomial Number Triangle. Although it has two symmetrical sides, seems to be highly random data.

    Click image for larger version. 

Name:	9trinomialpl[1].gif 
Views:	2045 
Size:	10.8 KB 
ID:	1660
    But is produced by a simple equation, which gives the same result ever, is not random.

    Click image for larger version. 

Name:	EulersNumberTriangleBinaryPlot[1].jpg 
Views:	697 
Size:	69.0 KB 
ID:	1661

    If large amounts of seemingly random data can be compressed in the form of simple equations, then we would have an extremely high lossless compression.

    http://mathworld.wolfram.com/EulersNumberTriangle.html

    I'm just giving an idea to be exploited, not seen anyone talk about it yet.

  2. #2
    Programmer osmanturan's Avatar
    Join Date
    May 2008
    Location
    Mersin, Turkiye
    Posts
    651
    Thanks
    0
    Thanked 0 Times in 0 Posts
    These kind of problems are subject to Kolmogorov Complexity. As a practical and "talked about it already" example would be special ZPAQ model which computes PI. Searching the forum could help.
    BIT Archiver homepage: www.osmanturan.com

  3. #3
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,474
    Thanks
    26
    Thanked 121 Times in 95 Posts
    BetaTester:
    "Seemingly" random is not true random, so the fact that we can compress textual representation of number PI (whatever precision we want) into a small source code doesn't mean we can represent any data using short source code.

    Compression of such generated "seemingly" random data could be beneficial IMO only when decompression is much faster that generating the data and if that data is useful for some purpose.

  4. #4
    Member
    Join Date
    Feb 2011
    Location
    St. Albans, England
    Posts
    20
    Thanks
    0
    Thanked 0 Times in 0 Posts
    True randomness is (IMHO) the complete collection of all bytes 0-255 in a page (by page, I mean 256 bytes) of a file placed at different positions, thus after 256^256 you've "lost" all true randomness and are merely recreating a previous page, although you'll be using some serious amount of storage

    There is no way to "compress" this type of file. There is a way to "encode" it so it's smaller. Two ways (possibly 3 ways (i'm a lazy coder)).

    The reason I state the 0-255 is that it's the smallest whole number a computer can hold (even a binary of 1010 would still take up a byte), regardless of the variables.

    Going back to the point of "writing code to generate it", if you do the maths (excuse the pun) of 256^256, that's a lot of routines you'll be needing to code in order to cover ALL possiblities. It's not impossible, but a LOT of work - and the only occasion i've used something like this is a generator that made a generator to make the code. The initial generator to make such tables was so large that it's not worth doing that alone, which is why I needed to (by hand) code something that created the generator. Not something I want to do again :S

    If you want some nice true random files, check out random.org - that's where I get the binary & textual random files for my test-sets.

  5. #5
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,322
    Thanks
    209
    Thanked 1,001 Times in 526 Posts
    > True randomness is (IMHO) the complete collection of all bytes 0-255 in a page

    As you see here - http://www.wolframalpha.com/input/?i...6.%2C256%21%5D ("result")
    if its known that your 256-byte "page" always contains all 256 byte values, it means that it can
    be compressed to ~210.5 bytes (with a rangecoder, for example).
    Of course it doesn't mean that you can compress the random data, but that the compressible data samples are not random.

  6. #6
    Member
    Join Date
    Feb 2011
    Location
    St. Albans, England
    Posts
    20
    Thanks
    0
    Thanked 0 Times in 0 Posts
    It all falls down to what the definition of "random" is. Over many years testing, the "random" data i've found incompressable ALWAYS follows the rule of not exceeding a run of bits longer than 64. These samples were claimed to be "true random", not "pseudo random" - yet if there's a repetitive pattern then how can it be "true random" ? Personally i'd drop the idea of compessing and remodel the bit-streams on a per-page to line up the longest runs in page 1 to the end, and longest runs in page 2 to the start, which would NOW make the data compressable. I know i've badly explained it, but i'm sure you'll find that using my transformation makes any "random" data easier to work with for compression.

  7. #7
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,322
    Thanks
    209
    Thanked 1,001 Times in 526 Posts
    > It all falls down to what the definition of "random" is

    The data that can't be compressed, ie decoder_size+compressed_size > uncompressed_size for all compression methods.

    > not exceeding a run of bits longer than 64

    The probability of a run of zeroes of length M appearing in a random bitstring of length N is
    (N-M+1)/2^M, so its not a wonder if you won't ever see any.

    Also the information that a zero run of length >M can't appear only allows to save a bit
    per each zero run of length M that did appear, which is nearly nothing for M=64.
    And even if you'd accidentally save 1-2 bits (we can delete a bit after zero run of length M, because it can't be 0, so its 1),
    the size of decoder which would be able to undo that would be still much larger.

  8. #8
    Member
    Join Date
    Feb 2011
    Location
    St. Albans, England
    Posts
    20
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Yup - didn't understand me - my bad

    I really should patent this or something as someone will rip it off, but meh can't afford such luxuries.

    1.Grab a file with as much entropy in it as possible
    2.Take the first 256 bytes (page)
    3.Split the bits to make 8 bit streams
    4.Find the longest run of 0's (or 1's) in the stream
    5.roll so the longest run is at the end of the bytes (rol or ror - doesn't matter)
    6.Repeat 4 & 5 for all 8 streams
    7.Put them together and store the 8 bytes you used to rol (or ror) into a seperate file for later
    8.Grab the next 256 bytes (page)
    9.As above, only you put the longest run at the START of the block
    10. Check the results and say "Wow - I can't believe" (or somesuch thing in a ludacris voice).

    I've made a (very primitive) routine to test the theory out and the results were most interesting indeed. There's a few tweaks here & there you could apply - but the transformation seems sound. So no, not compress entropy - but you can apply my transformation to it to pre-process (again).

  9. #9
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,322
    Thanks
    209
    Thanked 1,001 Times in 526 Posts
    You said that you didn't see runs longer than 64 before.
    So with that algo (and another step which is not described) you can cut off these 64 bits from all blocks - at best.
    But you're adding 64 bits of indexes instead.

  10. #10
    Member
    Join Date
    Feb 2011
    Location
    St. Albans, England
    Posts
    20
    Thanks
    0
    Thanked 0 Times in 0 Posts
    What it actually does is to INCREASE the size of the runs in the file overall. So after such a transformation, you'll find the "uncompressable" file is now more "compressable". If you think about it - you'll be changing the entire structure of the file and aligning (sorting) the largest runs of bits that will create a larger run of bytes & sequences, thus compression would help on a previously uncompressable file. By making the first page "back loaded" (the runs at the end), second page "front loaded" (runs at the start) what would be a run of up-to-4 bytes is now up-to-8 bytes - and this does not take into consideration the other bytes created - which WILL create more repetative sequences that'll compress nicely.

    Obviously this kind of transformation REQUIRES you to use files with a high rate of entropy. But it'll help if you want to reduce further the size of an already compressed file or a "random data" file.

    I get your point about the +8 bytes per page, however if you know of another way to compress an uncompressable file then i'll be happy to hear it May not be the best in results - but you can always try it recursively.

  11. #11
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,611
    Thanks
    30
    Thanked 65 Times in 47 Posts

  12. #12
    Member
    Join Date
    Feb 2011
    Location
    St. Albans, England
    Posts
    20
    Thanks
    0
    Thanked 0 Times in 0 Posts
    m^2 - thanks for the link. I'll retire for now as it'd be pointless for me to continue. However, i've tried to explain how & why my routine works - i've omitted why it's the optimal variant and how much research & time it took. I'll finish by saying it's akin to BWT on a block of 256 using bit-streams instead of bytes. I'm far from good at explaining my thoughts (yeay autism!) however i've seen the results and shared them - there is nothing more to do other than wait for someone else to "find it out on their own".

  13. #13
    Member
    Join Date
    Jan 2013
    Location
    USA
    Posts
    28
    Thanks
    0
    Thanked 2 Times in 2 Posts
    Quote Originally Posted by EwenG View Post
    m^2 - thanks for the link. I'll retire for now as it'd be pointless for me to continue. However, i've tried to explain how & why my routine works - i've omitted why it's the optimal variant and how much research & time it took. I'll finish by saying it's akin to BWT on a block of 256 using bit-streams instead of bytes. I'm far from good at explaining my thoughts (yeay autism!) however i've seen the results and shared them - there is nothing more to do other than wait for someone else to "find it out on their own".
    Hi. This is my first post So I was reading some of the older posts `to catch up` and thought your algorithm sounded interesting. I've tried a great number of approaches to trying to compress random data, and basically none of them have worked. I keep trying, but the scattering of values is very hard to overcome.

    My theory is that the ability to compress random data can only occur randomly. ie... if you try some things, you might get lucky on one set of random data, and not on another. I tried a somewhat brute-force effort to match randomly generated patterns for example and every once in a while I got a hit.. but it's so sporadic and inconsistent that it really is a shot in the dark - pure luck. It would kind of make sense to me that random data can only be compressed randomly, ie rarely. I don't think it's impossible, just extremely unlikely and not at all consistent.... such is the nature of the beast.

    So anyway, I liked your idea so I programmed it and got it to run and reverse correctly. However I can confirm that it does not compress random data. Unfortunately, as if very often the case, it just produces too much new highly random code data (those 8 bytes), which can be compressed some but not enough to be less than the gains from the rest of the algorithm. The algorithm itself mainly does work, the rotated data does compress a lot better, but this doesn't account for the coding overhead which is always greater than what you save. I've tried it with numerous block sizes and iterations etc (yes you can recurse with your algorithm but it doesn't help overall considering the code data). But it's pretty cool, thanks for the exercise.

    I think the only way gains can be made with random data MIGHT be if you can do a transform on it with similar gains to this algorithm but without needing any added code data, or very little (ie much less than 1 byte per 256 bits).
    Last edited by ImaginaryHuman; 28th January 2013 at 20:49.

  14. #14
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 779 Times in 486 Posts
    Random data is not compressible. Furthermore, there is no general algorithm for compressing pseudo-random data even when you know it was generated by a simple program. For example, an encrypted string of zero bytes looks random, but the only way you could compress it is to guess the key. http://mattmahoney.net/dc/dce.html#Section_11

  15. #15
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,322
    Thanks
    209
    Thanked 1,001 Times in 526 Posts
    Well, technically for _pseudo_-random data there _is_ an algorithm.
    Just look for statistical quirks in all possible contexts and proceed with arithmetic encoding.

    Also by "random data" in this topic they probably refer to compressed data,
    and compressed data is usually not completely random - its quite possible
    to write a specialized compressor for compressed data, one with a very precise
    rangecoder and statistics, and specific kinds of contexts.

    And even cases where recursive (multi-pass actually) compression of compressed data
    is applicable are possible (like with 1G of zeroes in .rar).

    The main point though is that compression exploits redundancy,
    and for modern archive formats, squeezing out the last 0.001%
    of entropy coding redundancy is a slow and hard task.
    For older formats there may be 1-3% even, but there's simply
    no demand for such space saving.

    And of course there's no way to do it fast, nor compress _everything_ further -
    no more redundancy means no more compression.

  16. #16
    Member
    Join Date
    Jan 2013
    Location
    USA
    Posts
    28
    Thanks
    0
    Thanked 2 Times in 2 Posts
    Yes, it's very hard, next to impossible, takes a tonne of time etc.. not really worth it.

    One area though that I find interesting is in the area of filters/preprocessing, because here you bypass the issue the data not being compressible and turn in into data that is (reversibly). But doing that on compressed/random data has to be very difficult indeed.

    Back to working on my image compressor... currently doing better than PNG and close to/sometimes better than jpeg2000. Nothing earth-shaking but I'm quite happy with it.

  17. #17
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 779 Times in 486 Posts
    > Well, technically for _pseudo_-random data there _is_ an algorithm.
    > Just look for statistical quirks in all possible contexts and proceed with arithmetic encoding.

    Yes, you're right. For a string x of n bits, search all 2^n - 1 programs shorter than n bits for one that outputs x. The problem is that some programs may not halt, and there is no general way to know which ones. But if you know that x is compressible, then running all of these programs in parallel must eventually find the solution. It's not a practical solution because it is possible for even small programs to run for a ridiculously long time, like 3^^^^3 steps (using Knuth's up-arrow notation).

    Testing whether a string is compressible or not is another matter. If such a test existed, then I could describe "the first string that cannot be described in a million bits", which is a contradiction because I just did.

  18. #18
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,322
    Thanks
    209
    Thanked 1,001 Times in 526 Posts
    > search all 2^n - 1 programs

    No, I don't think that kolmogorov compression is practically applicable.
    Well, at least I don't know such a nice language that could make it work.
    It has to be something with relevant operations (arithmetics and branches),
    and with mutation-friendly syntax which could make transitions between
    algorithms smooth enough,

    > Testing whether a string is compressible or not is another matter.

    We don't really need a theoretically perfect method at this point.
    Simple cryptographic tests would be enough to compress paq8 code
    a little.

    And as a workaround for possible expansion, a long enough random signature
    can be used. So eg. if we can save 136+ bits by CM compression of a paq
    archive, then we add a 128-bit unique signature at the start and thus
    save 8+ bits. Otherwise we don't add the signature and the size remains the same.

  19. #19
    Member
    Join Date
    Jan 2013
    Location
    USA
    Posts
    28
    Thanks
    0
    Thanked 2 Times in 2 Posts
    I've been toying with a random-data compressor for a while and at the moment (but don't hold your breath it looks like it may work... BUT that said I have not finished writing the decompressor yet. The proof is always in the pudding and it could be a matter of a tiny little bug which eventually renders it useless (it wouldn't be the first time, heh).

    Depending on the data, it compresses `any data` up to ~2.5% - not huge, but it is recursive! Okay, so I hear you saying "here we go again".. lol Sometimes it expands certain sets of data up to ~1%. However, I have an easy workaround which reversibly alters the data until it compresses. What this means is that literally any set of data can be compressed no matter how much entropy or byte sequence of whatever it started out as. I deliberately make a real mess of it and then start from there.

    The catch. .. it's rather slow. At present it may take around 10-15 seconds per megabyte for *each pass*. Ideally it needs to run at least 50-100 times to really have a big impact, and each of those passes gets a bit faster. It could take like 5-10 minutes per megabyte but would end up with a file in the 10kb range. There are of course lots of optimizations I can do including multithreading. Anyway... I guess this is presumptive - I have to get the decompressor working to prove that it's reversible, but right now looking at the code I can't see anything wrong with it.

    Possibly to my detriment I'm using LZMA as a back end, so each pass has to be fully LZMA'd, which may give you some idea of the speed. And probably a compressor with better ratio would improve greatly on that 2%.

    The algorithm is my original idea for how to partially sort the data, but it does produce some codes to reverse it which fortunately compress to slightly less than the amount of space saved by compressing the semi-sorted bytes.

    ... It's time to get out my fine toothed-comb and make sure it's doing what it's supposed to be doing.
    Last edited by ImaginaryHuman; 30th January 2013 at 01:59.

  20. #20
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 779 Times in 486 Posts
    I would suggest comp.compression. That seems to be where all the random data compressionists hang out. All of their programs seem to have bugs too.

  21. #21
    Member biject.bwts's Avatar
    Join Date
    Jun 2008
    Location
    texas
    Posts
    449
    Thanks
    23
    Thanked 14 Times in 10 Posts

    Thumbs up

    Hi I see your in the US if your code works it could be worth billions. However since the US is now first to file. It might be better to patent it before someone else does. You can always write the code after you patent it. I would hate to see you make a miracle break through and end up penniless.

  22. #22
    Member Jean-Marie Barone's Avatar
    Join Date
    Oct 2012
    Location
    Versailles, France
    Posts
    60
    Thanks
    15
    Thanked 4 Times in 3 Posts
    Random datas are like Transformers : they are more than meet the eye.
    Check out that hexadecimal string, which looks pretty random :
    1184 27b3 b4a0 5bc8 a8a4 de84 5986 7ffa d373 cc71

    Now let's change its radix to base 10, and now we see :
    99999999999999999999999999999999999977777777777

    Magic!

  23. #23
    Member
    Join Date
    Jan 2013
    Location
    USA
    Posts
    28
    Thanks
    0
    Thanked 2 Times in 2 Posts
    lol. I found a nasty bug .. which unfortunately significantly impacts the compression, but I'm still getting ~0.46%. A Million random digits .bin file is getting about 1.36% compression on the first pass. I'll keep working at it until I can prove that it does/does not work. ... but what use would such an algorithm be if it is very slow? It's not exactly practical to wait 10 minutes for an image to save, for example.

    [edit] I tried 50mb of random data, compressed by 0.5%. The bigger the data is the better the compression. I tested a small file, the random data currently stops being compressible at around 50kb. Compression ratio would be ~1:1000 for a 50mb file. mm... 50gb would be ~1:1-million?
    Last edited by ImaginaryHuman; 30th January 2013 at 08:03.

  24. #24
    Member
    Join Date
    May 2007
    Location
    Poland
    Posts
    89
    Thanks
    8
    Thanked 4 Times in 4 Posts
    How does it do on sharnd_challenge.dat file?
    http://mattmahoney.net/dc/#sharnd

  25. #25
    Member
    Join Date
    Jan 2013
    Location
    USA
    Posts
    28
    Thanks
    0
    Thanked 2 Times in 2 Posts
    Compressed sharnd_challenge.dat by 0.273% saving 273 bytes in the first pass. I peeked at the challenge and I see it asks to go below the space needed for sourcecode... not there yet since I haven't applied the recursion. And I'm still working on the decompressor to make sure it works, so let's not jump the gun.

    [edit] ... upon decompressing I finally found an elusive bug which completely undoes all of the compression benefits. heh... I guess you could see that coming. It does not work at all. I tried a much heavier-duty compressor but it's still way off. My apologies for littering the forum.
    Last edited by ImaginaryHuman; 31st January 2013 at 00:51.

  26. #26
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 779 Times in 486 Posts
    > However since the US is now first to file. It might be better to patent it before someone else does.

    There are already patents on recursive compression. http://encode.su/threads/1130-Recurs...tent-for-sale?

    > but what use would such an algorithm be if it is very slow?

    There is the $1 million Clay Millennium prize for proving P = NP. http://www.claymath.org/millennium/

    The proof goes like this. Suppose you have a program that can compress any string longer than m bits by at least 1 bit (e.g. m = 400Kb). Then you can solve any algorithm in O(n) time, including NP-complete problems, as follows:

    1. Compress any input of size n to m bits recursively in O(n) steps. (Each step is O(1) if you compress a m+1 bit prefix, then append bits).
    2. Look up the compressed output in a 2^m by m table.
    3. Decompress the output recursively in O(n) steps.

    Although this algorithm is impractical for large m, it nevertheless constitutes a proof and is therefore eligible for the prize. Also, since proof search is one of the problems this algorithm can solve, you can apply it to the 5 remaining Clay problems (the Poincare conjecture is taken), to claim a total of $6 million.

  27. #27
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,501
    Thanks
    741
    Thanked 664 Times in 358 Posts
    Matt, how you can be so inserious?! once he will prove that n==n-1, the Universe will disappear

  28. #28
    Member
    Join Date
    Jan 2013
    Location
    USA
    Posts
    28
    Thanks
    0
    Thanked 2 Times in 2 Posts
    I have some data which is quite randomly arranged but the `how many times a value occurs` is skewed. The skewing is almost exactly linear where, for example there may be only 50 0's in the data but 30,000 255's. If you imagine a straight line between those two extremes that's what a plot of the counts looks like, ie how many of each value. There is a small amount of random scattering in those numbers but it's almost exactly a linear increase as the values go up.

    What I find from this data is almost exactly 75% of the file is represented by byte values 0..127. This suggests to me maybe I could do some kind of huffman thing or something but I think I might be fighting the math. Does it seem even possible considering the linear gradient of the distribution? Do huffman and similar approaches only work when there is even more of a skewing like along a curve or with a hotspot of commonly occurring data? Or is there some other technique which could save some bytes from this kind of data? I need to save at least 12.5%.

  29. #29
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 779 Times in 486 Posts
    A simple order-0 model and arithmetic encoder should compress this data. It will assign a probability p to each byte based on counts seen so far, and encode using -log2(p) bits. As long as the distribution is not uniform then it should compress. You will also get better compression if each byte depends on earlier bytes. Have you tried other compressors on your data (zip, 7zip, zpaq, etc)?

    If you want to write your own compressors, you really should read http://mattmahoney.net/dc/dce.html

  30. #30
    Member
    Join Date
    Jan 2013
    Location
    USA
    Posts
    28
    Thanks
    0
    Thanked 2 Times in 2 Posts
    Hi Matt. Yes at present I LZMA the data and save about 3%. I need to get to at least around 11% because I already generated 1 bit per byte in codes as part of the transform that ended up with this data. I tried Paq8 on the data and only saved about 4.5%. There is not much repetition in the sequence of data, the order of bytes is highly random, but as I say there are a lot of bytes with the same value as you approach the high end of the spectrum. It seems to me it should compress better than it is if the compressor was based more on the probability of occurrence, as you say.. I don't understand what `order-0 model` means, I'll have to research. I thought maybe huffman would be good to model the distribution but it seems to me the prefix code lengths would inflate too quickly to be worthwhile? I tried SR3 on it which was slightly better than LZMA but not much.

    Thanks for your online book btw, it's very informative and interesting. I am writing my own compressor, yes. Or at least trying to
    Last edited by ImaginaryHuman; 1st February 2013 at 18:39.

Page 1 of 2 12 LastLast

Similar Threads

  1. Random reads vs random writes
    By Piotr Tarsa in forum The Off-Topic Lounge
    Replies: 22
    Last Post: 16th May 2011, 10:58
  2. Sometimes data look like random... here's an interesting file:
    By Alexander Rhatushnyak in forum The Off-Topic Lounge
    Replies: 29
    Last Post: 25th December 2010, 04:05
  3. goodbye and some random thoughts
    By Christian in forum The Off-Topic Lounge
    Replies: 72
    Last Post: 25th January 2010, 05:40
  4. Random Data Question.
    By Tribune in forum Data Compression
    Replies: 7
    Last Post: 13th June 2008, 20:30
  5. rnd - simple pseudo random number generator
    By encode in forum Forum Archive
    Replies: 21
    Last Post: 14th January 2008, 03:41

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •