Page 1 of 2 12 LastLast
Results 1 to 30 of 58

Thread: Compressing pseudo random files

  1. #1
    Member
    Join Date
    Aug 2008
    Location
    Planet Earth
    Posts
    1,026
    Thanks
    103
    Thanked 410 Times in 285 Posts

    Compressing pseudo random files

    Last year I created a compressor that was able to compress pseudo random files >1.2MB with 3 bytes but then I discovered some existing compressors did better.
    For random noise random files the contest is still open at https://encode.su/threads/1176-losel...ll=1#post59199

    Input:
    10,485,761 bytes - pseudo random file

    Output:
    10,693,145 bytes - glza v?
    10,526,776 bytes - brotli v?
    10,519,072 bytes - mcm v0.83
    10,505,541 bytes - top4 0.0.0.2
    10,491,999 bytes - zpaq v7.15
    10,487,391 bytes - gzip v1.2.4
    10,486,543 bytes - 7z v19.00
    10,486,417 bytes - rz v1.03.7
    10,486,413 bytes - zcm v0.93
    10,486,017 bytes - zstd v1.3.8
    10,486,003 bytes - nz v0.09a
    10,485,929 bytes - rar v5.70
    10,485,807 bytes - bsc v3.1.0
    ---------------------------------
    10,400,380 bytes - paq8pxd v47_3 (gain 85,381 bytes)
    10,366,387 bytes - paq8px v178 (gain 119,374 bytes)
    10,330,502 bytes - cmix v17 (gain 155,259 bytes)

  2. #2
    Member
    Join Date
    Apr 2012
    Location
    London
    Posts
    265
    Thanks
    13
    Thanked 0 Times in 0 Posts
    >>10,400,380 bytes - paq8pxd v47_3 (gain 85,381 bytes)
    10,366,387 bytes - paq8px v178 (gain 119,374 bytes)
    10,330,502 bytes - cmix v17 (gain 155,259 bytes)


    Does this still work peudorandom permutations twice with different seeds?

  3. #3
    Member
    Join Date
    Nov 2013
    Location
    Kraków, Poland
    Posts
    803
    Thanks
    244
    Thanked 255 Times in 159 Posts
    Could you tell something more about the used PRNG?
    I would say that these reductions show its imperfectness - that it leaves some statistical dependencies, found and explored by these compressors.
    If it is some high-end like Mersenne-twister, this would be extremely interesting e.g. from cryptography and physics (Monte Carlo) perspectives - especially if analyzing the models from these CMs ...

  4. #4
    Member
    Join Date
    Aug 2008
    Location
    Planet Earth
    Posts
    1,026
    Thanks
    103
    Thanked 410 Times in 285 Posts
    Quote Originally Posted by LawCounsels View Post
    Does this still work peudorandom permutations twice with different seeds?
    I didn't test it, but I tried last year recursive at the output from my own compressor but it had only gain once at the original input.

  5. #5
    Member
    Join Date
    Aug 2008
    Location
    Planet Earth
    Posts
    1,026
    Thanks
    103
    Thanked 410 Times in 285 Posts
    Quote Originally Posted by Jarek View Post
    Could you tell something more about the used PRNG?
    I tried last year different random generation code in different program languages, but I do not remember the exact rnd generation code I used and what output was compressible and what not. I mainly tested with my own compressor and some VS .NET generated pseudo random files where compressible and some not (depending the rnd generation code used) and this pseudo random file above was generated with Xojo what has only one rnd generation option in code. Output generated with code that used the build in Intel CPU RND generator was not compressible with my own compressor.

  6. #6
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,976
    Thanks
    296
    Thanked 1,303 Times in 740 Posts
    2019-01-21.bin has some short matches, for example C3 3E CC occurs 3 times.

  7. #7
    Member
    Join Date
    Nov 2013
    Location
    Kraków, Poland
    Posts
    803
    Thanks
    244
    Thanked 255 Times in 159 Posts
    Sportman, the built in PRNG is rather weak - might leave some dependencies ... but if you could reduce something generated by CSPRNG like Mersenne twister, that would be huge - it would mean that e.g. cmix is able to efficiently predict something about new generated bits, what could allow for some new attacks etc.
    https://en.wikipedia.org/wiki/Cryptographically_secure_pseudorandom_number_gener ator

    ps. There are many tests of PRNG and adding e.g. cmix there seems very interesting - especially if finding PRNG passing all standard tests, which sequence cmix is able to reduce ...
    Last edited by Jarek; 16th April 2019 at 00:04.

  8. #8
    Member
    Join Date
    Apr 2012
    Location
    London
    Posts
    265
    Thanks
    13
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by Shelwien View Post
    2019-01-21.bin has some short matches, for example C3 3E CC occurs 3 times.
    Unexploitable?

  9. #9
    Member
    Join Date
    Apr 2012
    Location
    London
    Posts
    265
    Thanks
    13
    Thanked 0 Times in 0 Posts
    Yes I venture to suggest I already have a compressor able reduce much much more consistently Mersenne-twister type by 1 bit ( big majority of times correct)

    Professional verifiable by arrangement

  10. #10
    Member
    Join Date
    Aug 2008
    Location
    Planet Earth
    Posts
    1,026
    Thanks
    103
    Thanked 410 Times in 285 Posts
    Quote Originally Posted by Jarek View Post
    like Mersenne twister
    Quick test:

    Input:
    10,485,760 bytes - pseudo random file (generated with Ron Charlton Mersenne twister pseudo-random number generator VB.NET code)

    Output:
    10,495,690 bytes - paq8pxd v47_3
    10,490,955 bytes - paq8px v178
    10,486,207 bytes - cmix v17
    Last edited by Sportman; 16th April 2019 at 19:27.

  11. #11
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,976
    Thanks
    296
    Thanked 1,303 Times in 740 Posts
    > Unexploitable?

    Depends on the level of overtuning, I suppose.
    With some actual analysis its like this:

    Code:
    freq - string - distances = codelen
    4 - 10 7D 9C - 20:609479,17:71312,18:217301,16:44165, = 80
    4 - 11 7D 91 - 17:68380,17:90052,17:107225,20:627618, = 84
    4 - 14 D7 99 - 18:136385,17:114266,18:199254,19:524147, = 84
    4 - 26 9F 74 - 16:59090,20:572772,17:103401,18:172438, = 80
    4 - 53 5C 40 - 14:8703,16:47371,20:717363,16:55570, = 72
    4 - 59 6D 1E - 18:191049,15:18331,16:57546,20:588107, = 76
    4 - 77 16 F8 - 18:195262,18:198286,17:71020,19:491486, = 84
    4 - 8F 79 6F - 19:304028,19:361473,16:34410,17:67830, = 80
    4 - AA 1A 2C - 17:96629,18:202432,19:291561,18:248825, = 84
    4 - C0 21 5B - 17:105253,18:135117,15:32653,19:401044, = 80
    4 - C5 AF 0E - 17:77771,19:345667,16:48030,19:350997, = 80
    4 - CB C5 8D - 18:164974,17:109653,18:139840,19:297836, = 84
    
    poslen - freq
    14 - 1
    15 - 2
    16 - 7
    17 - 12
    18 - 12
    19 - 9
    20 - 5
    4x 3-byte match = 96 bits.
    If these are deleted from main file, but then added as a dictionary (string {24 bits} + distances {less than remaining 72 bits}),
    there's some potential for compression depending on encoding etc.
    Here I counted "=xx" bits as 16 bits + flag + other 4 bits if flag=1.

    For example, if match string pattern is hardcoded as "0101?0?1 01??110? 0?0????0" (for 53 5C 40 and 59 6D 1E),
    we'd end up saving 96*2-(10+72+10+76)=24 bits = 3 bytes.

  12. #12
    Member
    Join Date
    Apr 2012
    Location
    London
    Posts
    265
    Thanks
    13
    Thanked 0 Times in 0 Posts
    This does not apply on average for all files overall ( whether with this particular short term match or not)

  13. #13
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,976
    Thanks
    296
    Thanked 1,303 Times in 740 Posts
    Of course its impossible to compress all files (without tricks like abusing filesystem for this).
    But what I described is a fairly general-purpose algorithm, it would also compress some other files (like texts),
    and would losslessly encode everything else (with lengths increased by 1).

    If you don't like the pattern, it should be possible to overtune a statistical model for distance coding, with similar results.

    In any case, so many 3-byte strings occuring 4 times in a 1MB file is not a good thing, statistically.
    Its probably not quite random.

  14. #14
    Member
    Join Date
    Aug 2008
    Location
    Planet Earth
    Posts
    1,026
    Thanks
    103
    Thanked 410 Times in 285 Posts
    Quote Originally Posted by Shelwien View Post
    Its probably not quite random.
    Input:
    1,048,576 bytes - 2019-01-21.bin (random noise random file)

    Output:
    1,049,605 bytes - paq8pxd v47_3
    1,049,150 bytes - paq8px v178
    1,048,638 bytes - cmix v17
    Last edited by Sportman; 16th April 2019 at 19:09.

  15. #15
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,976
    Thanks
    296
    Thanked 1,303 Times in 740 Posts
    Problem is, these kind of things require specialized solutions.
    If we'd normally encode a file with a binary rangecoder using p=0.5, there'd be already ~10 bytes of redundancy (filesize + rc flush).
    For small statistical redundancy I already made cdm, but it doesn't help here -
    instead we see a little more short-string matches than its likely for a random file.
    I explained how it can be done - we can make a specialized compressor for trigrams that occur 4 times within 1st MB of the file data.
    Actual trigrams would be deleted, but instead we'd add a header with list of trigram strings + 4 positions per string, plus a terminator byte.

    It seems to be a common property of random.org files, for example with 2019-04-16.bin I get:
    Code:
    13 F6 83 254986,472997,799955,1045732,
    2F 7A D8 82457,176832,426997,836185,
    A4 E7 52 414983,522274,549181,967106,
    Attached Files Attached Files

  16. #16
    Member
    Join Date
    Apr 2012
    Location
    London
    Posts
    265
    Thanks
    13
    Thanked 0 Times in 0 Posts
    >>If we'd normally encode a file with a binary rangecoder using p=0.5, there'd be already ~10 bytes of redundancy (filesize + rc flush).

    What input filesize best illustrate here, giving exact figures for all

  17. #17
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,976
    Thanks
    296
    Thanked 1,303 Times in 740 Posts
    Not sure what you asked, but distance coding described above can be pretty universal if designed to save 1 byte on 2019-01-21.bin, rather than 3.
    As to statistical approach, I'm currently tuning my CM for that file. Atm at 1048582 (+6 bytes), hard to say if it works out.

    Update: got one byte for now
    Code:
    1,048,576 2019-01-21.bin
    1,048,575 2019-01-21.ari
    
      201,728 CM.exe
      196,114 CM.ari
    I've checked 2019-03 files, some are not expanded, but none is compressed.
    Still, there're lots of these files on random.org, so I'm pretty sure that more of them can be compressed :)
    Attached Files Attached Files

  18. Thanks:

    xinix (17th April 2019)

  19. #18
    Member
    Join Date
    Aug 2008
    Location
    Planet Earth
    Posts
    1,026
    Thanks
    103
    Thanked 410 Times in 285 Posts
    Input:
    10,000,000 bytes - pseudo random file

    Output:
    10,199,299 bytes - glza v?
    10,039,103 bytes - brotli v?
    10,019,356 bytes - mcm v0.83
    10,014,317 bytes - top4 0.0.0.2
    10,001,559 bytes - gzip v1.2.4
    10,000,752 bytes - 7z v19.00
    10,000,620 bytes - zcm v0.93
    10,000,616 bytes - rz v1.03.7
    10,000,244 bytes - zstd v1.3.8
    10,000,069 bytes - rar v5.70
    ---------------------------------
    9,994,878 bytes - cm (gain 5,122 bytes)
    9,249,068 bytes - nz v0.09a (gain 750,932 bytes)
    7,894,591 bytes - zpaq v7.15 (gain 2,105,409 bytes)
    7,520,516 bytes - bsc v3.1.0 (gain 2,479,484 bytes)
    7,484,538 bytes - paq8px v178 (gain 2,515,462 bytes)
    4,807,272 bytes - paq8pxd v47_3 (gain 5,192,728 bytes)
    4,712,859 bytes - cmix v17 (gain 5,287,141 bytes)
    Last edited by Sportman; 17th April 2019 at 13:44.

  20. #19
    Member
    Join Date
    Aug 2008
    Location
    Planet Earth
    Posts
    1,026
    Thanks
    103
    Thanked 410 Times in 285 Posts
    Quote Originally Posted by Shelwien View Post
    I'm pretty sure that more of them can be compressed
    Tested CM with 2018-08-xx.bin and 2018-09-xx.bin (61 random files, each 1,048,576 bytes):

    1,048,579 bytes - 2x (2018-08-25.bin, 2018-08-02.bin)
    1,048,578 bytes - 23x
    1,048,577 bytes - 30x
    1,048,576 bytes - 6x (2018-09-28.bin, 2018-09-23.bin, 2018-09-19.bin, 2018-09-09.bin, 2018-09-06.bin, 2018-08-21.bin)

  21. #20
    Member
    Join Date
    Aug 2008
    Location
    Planet Earth
    Posts
    1,026
    Thanks
    103
    Thanked 410 Times in 285 Posts
    Quote Originally Posted by Shelwien View Post
    got one byte for now
    Congrats, verified between two computers!
    I think you are the first person who managed to compress a random.org file.

    CM also managed to compress the second pseudo random file, added to the results above.
    Last edited by Sportman; 17th April 2019 at 13:45.

  22. Thanks:

    xinix (17th April 2019)

  23. #21
    Member
    Join Date
    Apr 2012
    Location
    London
    Posts
    265
    Thanks
    13
    Thanked 0 Times in 0 Posts
    If you are tasked with finding random generator's possible valid seed, how would you plan to do this? even if only a small tiny step towards achieving this

  24. #22
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,976
    Thanks
    296
    Thanked 1,303 Times in 740 Posts
    If you know the PRNG function, the approach would be the same as this: https://encode.su/threads/3090-Moder...desired-output

    If you don't know the specific PRNG function, but do know that its something common, it could make sense to collect several popular RNG implementations from
    compiler libraries and such, and build a composite function for SAT (with specific RNG choice via undefined variable).

    If its known that cryptographic PRNG was used, the fastest approach would be bruteforce (ie try all parameter combinations and see whether any would match your data).

  25. #23
    Member
    Join Date
    Apr 2012
    Location
    London
    Posts
    265
    Thanks
    13
    Thanked 0 Times in 0 Posts
    >>If you know the PRNG function, the approach would be the same as this: https://encode.su/threads/3090-Moder...desired-output

    This simplifies.... even this could take more time than universe has

    what's the order of time requires to guarantee results

  26. #24
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,976
    Thanks
    296
    Thanked 1,303 Times in 740 Posts
    > This simplifies.... even this could take more time than universe has

    Ok, here I tried it with 32-bit VisualC's RNG (64-bit one may be different).
    Took 22s. CBMC found a slightly different seed (0x8BADF00D instead of 0xBADF00D), which otherwise
    produces the same output sequence.
    byte s[] = {  13, 240, 173, 139, };
    uint& seed = (uint&)s[0];
    printf( "seed=%08X\n", seed );
    srand( seed );
    for( i=0; i<100; i++ ) printf( "%i,", rand()%10 );


    > what's the order of time requires to guarantee results

    Depends on specific RNG design. Should be fast enough with usual simple ones from compiler RTL.

    Update: tested another seed (0x12345678). Solved, but took 46.3s, so I guess it depends on specific data too.
    Attached Files Attached Files

  27. #25
    Member
    Join Date
    Apr 2012
    Location
    London
    Posts
    265
    Thanks
    13
    Thanked 0 Times in 0 Posts
    How many distinct different 100 bits can seed of 32 bits long generate?

  28. #26
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,976
    Thanks
    296
    Thanked 1,303 Times in 740 Posts
    Obviously no more than 2^32. Probably 2^31 in VC32 RNG case.

  29. #27
    Member
    Join Date
    Apr 2012
    Location
    London
    Posts
    265
    Thanks
    13
    Thanked 0 Times in 0 Posts
    Can quantum computers find the1024 bits seed?

  30. #28
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,976
    Thanks
    296
    Thanked 1,303 Times in 740 Posts
    The problem is RNG/hash complexity, not the number of bits in variables.
    For example, CBMC solves 64-bit xxhash faster than VC32 RNG (0.8s).
    So size of RNG seed won't be an issue, if RNG design remains the same.

    As to quantum computers, we don't have any with 1024 entangled qubits atm, so nope.

  31. #29
    Member
    Join Date
    Apr 2012
    Location
    London
    Posts
    265
    Thanks
    13
    Thanked 0 Times in 0 Posts
    TWave previews quantum computing platform with over 5,000 qubits ...
    https://venturebeat.com › d-wav...
    27 Feb 2019 · D-Wave Systems today unveiled the roadmap for its 5,000-qubit quantum computer. Components of ...

    AND how exactly entangled qubits arrive at
    the solution?

  32. #30
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,976
    Thanks
    296
    Thanked 1,303 Times in 740 Posts
    While its predecessor, Chimera, offered six connected qubits, Pegasus more than doubles the number to 15.
    Having each qubit connected to 15 other qubits instead of six translates to 2.5x more connectivity,
    which in turn enables the embedding of larger and more complex problems with fewer physical qubits.
    The problem with cryptographic functions is that they're designed to have dependencies between each bit of input and each bit of output.
    So total number of qubits basically doesn't matter since quantum logic only applies in groups of "connected" ones.

    Btw, in any case, 5000 is also too little.
    For example for VC32RNG SAT system CBMC logged this: "15685 variables, 80408 clauses".
    "variables" here are bits, and "clauses" are bit operations (number of which is also limited on DWave machines).

    > AND how exactly entangled qubits arrive at the solution?

    Its basically the same bruteforce search (well, Monte-Carlo).
    but implemented using elementary particles as memory and physical laws for operations.
    So both faster than current electronics and massively parallel (potentially).

    There's some similarity with this: https://en.wikipedia.org/wiki/DNA_co...les/Prototypes
    But this is based on chemistry while quantium computing works with particles, which is "lower level".

Page 1 of 2 12 LastLast

Similar Threads

  1. Compressing two (or more) large but similar files
    By nanoflooder in forum Data Compression
    Replies: 1
    Last Post: 2nd March 2017, 19:41
  2. Compressing 129GB of MKV Files
    By Phoenix in forum Data Compression
    Replies: 7
    Last Post: 26th October 2016, 17:35
  3. Random reads vs random writes
    By Piotr Tarsa in forum The Off-Topic Lounge
    Replies: 22
    Last Post: 16th May 2011, 09:58
  4. Compressing mp3 Files
    By zhuda in forum Data Compression
    Replies: 12
    Last Post: 7th March 2011, 18:26
  5. rnd - simple pseudo random number generator
    By encode in forum Forum Archive
    Replies: 21
    Last Post: 14th January 2008, 02:41

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •