Results 1 to 18 of 18

Thread: Prime Number Benchmark

  1. #1
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,257
    Thanks
    307
    Thanked 797 Times in 489 Posts

    Prime Number Benchmark

    I did some compression tests on synthetic data: a list of prime numbers up to 1 million in 10 different formats. In theory all 10 files should compress to the same size because they all contain the same information. (Actually very small if the compressor recognizes the data as a list of primes, which none of them did). The 10 files are:

    msb - 32 bit integer format, most significant byte first.
    lsb - 32 bits, LSB first (x86 format).
    dbl - each number written twice (2,2,3,3,5,5,7,7,...,999983,999983) as 32 bit ints, MSB first.
    hex - each number written as 8 hex digits with no separators. "00000002000000003"...
    text - each number written in ASCII decimal followed by a newline "2\n3\n...999983\n".
    twice - two copies of msb concatenated together (2,3,5,...999983,2,3,5,...999983), 32 bit ints MSB first.
    interlv - all the MSB bytes of msb first, then all the second, third, last byte (0,0,0..., 0,0,0..., 0,0,0..., 2,3,5...).
    delta - successive differences as 32 bit ints, MSB first (2,1,2,2,4,2,4...).
    bytemap - 1M chars, either '1' if prime or '0' if composite.
    bitmap - same but as single bits packed into 125K bytes, LSB first.

    I grouped the results into LZ77, BWT, PPM, and CM type algorithms.
    Code:
        msb    lsb    dbl    hex   text  twice interlv delta bytemap bitmap
    
     313992 313992 627984 627984 538468 627984 313992 313992 1000000 125000 uncompressed
    
     314011 314011 160995 326410 299215 316485  90730 142644 232564  71928 lz5
     315511 315546 212296 274813 339411 315571  89784  89296 138739  71007 zpaq -m 1
     315402 315446 212283 273113 234554 315454  77504  62441  64726  52886 zpaq -m 2
     124751 152820 176242 202786 199829 144201  44959  35988  36004  36450 zpaq -m 3
     198314 105536 109362 198771 189936 396668  65146  57278  67579  46203 gzip
     198538 105760 109586 199139 190171 396894  65223  52480  55925  44541 zip -9
      57724  58701  57727  67378  77168  57848  53934  40044  52875  41368 7zip
      57724  58701  57727  67378  77354  57848  53212  40276  46369  41434 7zip -mx
      46428  46473  66777 139253 152351  50672  59957  47656  65942  44310 rar
      46194  46232  65248 143938 156028  50162  44978  46956  61330  44102 rar -m5 -ma5
      49328  43125  59371  67545  64953  49360  44604  42681  74288  43988 freearc
      45040  40676  45431  67545  62007  45498  42062  40370  40641  41694 freearc -m9
    
     190356 187421 271302 225982 214919 258811  50965  37831  38183  41196 bzip2
     183104 181153 183756 205063 199727 183167  44002  35328  34698  36193 bsc
     184914 184445 272700 221881 207723 272718  44994  35903  34983  35764 zpaq -m s4.3c
     123855 125097 175124 201847 198919 175014  44096  35225  34788  35674 zpaq -m s4.3ci1
      42902  43098  43503  62766  67771  41853  42186  36246  35783  37385 nz
    
     180336 191716 232018 144353 213793 180773  45507  36114  50945  36780 ppmd
      50613  49050  50757  66905  72933  50634  41696  34089  41987  35935 ppmonstr
    
     207941 207942 403928 241013 217784 416265 154946  70384  49824  38936 zpaq -m s4.0c256
     153306 154014 272911 229866 178614 305787  77212  70693  48121  39292 zpaq -m s4.0c0
      85141 110722 110278 150666 118636 171019  44696  64196  47450  38614 zpaq -m s4.0ci1
      72609 106695  79154 120408 100269 103116  42723  57541  46586  37520 zpaq -m s4.0ci1.1
      70770  91990  74563 100128  96293  91242  42841  40964  46291  36612 zpaq -m s4.0ci1.1.1
      70792  91861  74109  92700  90622  73422  42995  36769  45809  36019 zpaq -m s4.0ci1.1.1.1
      65157  84709  65644  89223  77471  68307  42291  36843  46252  35979 zpaq -m s4.0ci1.1.1.1m
      64888  83952  65536  89095  77486  65159  42420  37064  46893  36072 zpaq -m s4.0ci1.1.1.1am
      66131  84847  64891 202786 199829  66761  43239  35988  36004  36450 zpaq -m 4
      46419  46543  49942  56787  55973  47096  42641  35519  34450  30518 zpaq -m 5
      40741  40026  43083  46822  41930  40965  40249  34466  70477  32860 paq8pxd13 -8
      37835  37973  38597  56017  55520  37894  36255  34698  42764  35425 nz -cc
    The zpaq methods s4.3c and 4.3ci1 are BWT, 16 MB blocks, modeled with an order 0 ICM and with an order 0-1 ICM-ISSE chain. zpaq normally uses the second. s4.0c256 is a stationary order 0 model. s4.0c is an ICM (fast adapting order 0: context -> bit history -> prediction). s4.0.ci1.1.1.1 etc are order 1, 2, 3, 4 ICM-ISSE chains. m adds a mixer. am adds a match model and mixer.

    Interesting things to note.

    lz5, zpaq -method 1 and 2 do not compress msb or lsb at all. There are no matches of length 4 or more.

    Several programs depend on byte order (msb vs lsb) especially zip, gzip, zpaq -m 3 (LZ77 with order 1 literal modeling), freearc, bzip2,
    bsc, ppmd, ppmonstr, and zpaq ICM-ISSE chains.

    paq8pxd_v13 -8 does really poorly on bytemap, worse than gzip.

    zpaq -m 5 bitmap is the best result, probably because the sparse models are useful.

    zip and gzip fail to recognize the second copy in 'twice' because they only have a 32K window. Low order context models and most BWT also fail to compress this well. BSC probably does well because of its LZP preprocessor. Not sure why zpaq -m 3 misses it.

    nanozip supposedly uses BWT but compresses much better. Hard to speculate since it's closed source.

    Here is the program that generates the 10 test files.
    Code:
    // Generate test files for prime number compression benchmark. Public domain.
    
    #include <stdio.h>
    #include <stdlib.h>
    #include <string.h>
    
    unsigned char* sieve;  // bit 1 = prime
    const unsigned N=1000000;  // largest prime
    
    // is i prime?
    int prime(unsigned i) {
      return (sieve[i/8]>>(i&7))&1;
    }
    
    int main() {
    
      // init sieve
      sieve=(unsigned char*)malloc(N/8+1);
      if (!sieve) return 0;
      memset(sieve, 255, N/8+1);
      sieve[0]=0xfc;  // 0 and 1 are composite
      unsigned i, j;
      for (i=0; i*i<=N; ++i) {
        if (prime(i))
          for (j=i*2; j<N; j+=i)
            sieve[j/8] &= ~(1<<(j%8));
      }
    
      // List primes
      FILE* f1=fopen("text", "wb");    // decimal text, LF terminated
      FILE* f2=fopen("msb", "wb");     // 32 bit ints, MSB first
      FILE* f3=fopen("lsb", "wb");     // 32 bit ints, LSB first
      FILE* f4=fopen("delta", "wb");   // differnces, MSB first
      FILE* f5=fopen("dbl", "wb");     // 32 bit ints twice each
      FILE* f6=fopen("twice", "wb");   // 2 copies of msb
      FILE* f7=fopen("hex", "wb");     // hex, no spaces
      FILE* f8=fopen("interlv", "wb"); // all MSB digits...all LSB digits
      FILE* f9=fopen("bytemap", "wb"); // '1' if prime else '0'
      FILE* f10=fopen("bitmap", "wb"); // packed byte map (inverted)
      j=0;  // previous prime
      for (i=0; i<N; ++i) {
        if (prime(i)) {
          fprintf(f1, "%u\n", i);
          fprintf(f2, "%c%c%c%c", i>>24, i>>16, i>>8, i);
          fprintf(f3, "%c%c%c%c", i, i>>8, i>>16, i>>24);
          j=i-j;
          fprintf(f4, "%c%c%c%c", j>>24, j>>16, j>>8, j);
          fprintf(f5, "%c%c%c%c", i>>24, i>>16, i>>8, i);
          fprintf(f5, "%c%c%c%c", i>>24, i>>16, i>>8, i);
          fprintf(f6, "%c%c%c%c", i>>24, i>>16, i>>8, i);
          fprintf(f7, "%08X", i);
          fprintf(f8, "%c", i>>24);
          j=i;
        }
        fprintf(f9, "%c", '0'+prime(i));
      }
      fwrite(sieve, 1, N/8, f10);
      for (i=0; i<N; ++i) {
        if (prime(i)) {
          fprintf(f6, "%c%c%c%c", i>>24, i>>16, i>>8, i);
          fprintf(f8, "%c", i>>16);
        }
      }
      for (i=0; i<N; ++i) if (prime(i)) fprintf(f8, "%c", i>>8);
      for (i=0; i<N; ++i) if (prime(i)) fprintf(f8, "%c", i);
      return 0;
    }
    Last edited by Matt Mahoney; 9th January 2016 at 04:40.

  2. Thanks (6):

    avitar (9th January 2016),byronknoll (14th January 2016),encode (11th March 2016),Mauro Vezzosi (9th January 2016),Mike (9th January 2016),Turtle (9th January 2016)

  3. #2
    Member
    Join Date
    Sep 2015
    Location
    Italy
    Posts
    278
    Thanks
    116
    Thanked 160 Times in 117 Posts
    Results verified for cmv version 0.1.1.
    Code:
    cmv c -mx
        msb    lsb    dbl    hex   text  twice interlv delta bytemap bitmap total (4803388) tarball (4812800, 7z.exe a -ttar) tarball/total
      39547  39314  40403  41851  39019  39569  40831  34251  25789  32110  372684           292304                          0,78432130169
    
    Best overall (all programs tested until 2016/01/14)
        msb    lsb    dbl    hex   text  twice interlv delta bytemap bitmap total (4803388)
      37835  37973  38597  41619  38967  37894  36255  33726  25789  29527  358182
         nz     nz     nz   cmix   cmix     nz     nz   cmix    cmv   cmix
    You already post "Compressing prime numbers": is compressing prime numbers a big issue for you?
    Last edited by Mauro Vezzosi; 14th January 2016 at 09:25. Reason: Added cmix

  4. #3
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,257
    Thanks
    307
    Thanked 797 Times in 489 Posts
    It's more interesting to see the failure modes and limitations of different compression algorithms. The earlier benchmark didn't show that. For example, I didn't expect gzip to do better than paq8pxd_v13 on some files. I didn't expect deflate to show such a huge difference between compressing big-endian and little-endian numbers.

  5. #4
    Programmer schnaader's Avatar
    Join Date
    May 2008
    Location
    Hessen, Germany
    Posts
    615
    Thanks
    260
    Thanked 242 Times in 121 Posts
    One interesting thing to add would be "mirrored" - same as twice, but the copy is reversed - I always wondered if it would help to add "reverse matches" to match compression as no algorithm I've seen so far implements them (did simple tests on files with X random bytes and the same random bytes reversed after that) and in image compression, it seems like a not so unlikely case.
    http://schnaader.info
    Damn kids. They're all alike.

  6. Thanks:

    Gonzalo (9th March 2016)

  7. #5
    Member Skymmer's Avatar
    Join Date
    Mar 2009
    Location
    Russia
    Posts
    688
    Thanks
    41
    Thanked 173 Times in 88 Posts
    Quote Originally Posted by Matt Mahoney View Post
    Here is the program that generates the 10 test files.
    Just in case if somebody needs the exe.
    Attached Files Attached Files

  8. Thanks:

    comp1 (10th January 2016)

  9. #6
    Member
    Join Date
    Sep 2010
    Location
    US
    Posts
    126
    Thanks
    4
    Thanked 69 Times in 29 Posts
    The huge difference between ppmd & ppmonstr on this data is pretty interesting

    180336 191716 232018 144353 213793 180773 45507 36114 50945 36780 ppmd
    50613 49050 50757 66905 72933 50634 41696 34089 41987 35935 ppmonstr


    I wish we knew more about nanozip!
    Last edited by cbloom; 3rd August 2016 at 20:32.

  10. #7

  11. #8
    Member
    Join Date
    Mar 2011
    Location
    USA
    Posts
    249
    Thanks
    113
    Thanked 123 Times in 72 Posts
    Here are the results with cmix v8:

    msb: 38912
    lsb: 38696
    dbl: 39580
    hex: 41619
    text: 38967
    twice: 38940
    interlv: 38913
    delta: 33726
    bytemap: 30155
    bitmap: 29527

  12. Thanks (3):

    JamesB (14th January 2016),Matt Mahoney (14th January 2016),Mauro Vezzosi (14th January 2016)

  13. #9
    Member
    Join Date
    Jun 2015
    Location
    Switzerland
    Posts
    876
    Thanks
    242
    Thanked 325 Times in 198 Posts
    Code:
        msb    lsb    dbl    hex   text  twice interlv delta bytemap bitmap
    
     198314 105536 109362 198771 189936 396668  65146  57278  67579  46203 gzip
     198538 105760 109586 199139 190171 396894  65223  52480  55925  44541 zip -9
     105954 105060 106989 116413 111864 210700  58280  47067  47969  41156 zopfli (ubuntu)
     104888 105058 106989 116096 111812 209677  58284  47071  48006  41126 zopfli -i1000 (github)
     84194  83375  83577  85410  83756  84202   52316  46661  46262  42427 brotli (Feb 12 github)

    zopfli and brotli give no huge surprises, but there is a strange reversal at bitmap-file between zopfli and brotli
    Last edited by Jyrki Alakuijala; 7th March 2016 at 10:49.

  14. #10
    Member
    Join Date
    Mar 2013
    Location
    Worldwide
    Posts
    565
    Thanks
    67
    Thanked 199 Times in 147 Posts
    Better results are obtained by preprocessing these files with integer transformations "16/32/64 bits transpose" and possibly "delta" (see TurboPFor).

    I think nanozip (like WinRar) is also using transpose/delta preprocessing after detecting binary tables.

  15. #11
    Member
    Join Date
    Feb 2015
    Location
    United Kingdom
    Posts
    176
    Thanks
    28
    Thanked 74 Times in 44 Posts
    Actually I've already written a tool that detects and optimizes structured data https://github.com/loxxous/prepack, it has taken me about a month of work to get it to outperform nanozip's filters. I have yet to implement dictionary optimizations like in nanozip, but that update will be very soon.

    Last edited by Lucas; 10th March 2016 at 20:27.

  16. Thanks:

    dnd (13th March 2016)

  17. #12
    Member
    Join Date
    Sep 2007
    Location
    Denmark
    Posts
    925
    Thanks
    57
    Thanked 116 Times in 93 Posts
    Quote Originally Posted by Lucas View Post
    Actually I've already written a tool that detects and optimizes structured data, in my tests I've gotten it to outperform nanozip's filters. https://github.com/loxxous/prepack
    Sorry I'm at work so icant play with this yet. but is this a generic prefilter as in i can use in front og any compressior or does it invoke its own compression ?

  18. #13
    Member
    Join Date
    Feb 2015
    Location
    United Kingdom
    Posts
    176
    Thanks
    28
    Thanked 74 Times in 44 Posts
    Quote Originally Posted by SvenBent View Post
    Sorry I'm at work so icant play with this yet. but is this a generic prefilter as in i can use in front og any compressior or does it invoke its own compression ?
    Yes you can use it in front of any compressor, currently it stores predictions of a static or adaptive model plus a 1 byte header.

  19. #14
    Member
    Join Date
    Jun 2015
    Location
    Switzerland
    Posts
    876
    Thanks
    242
    Thanked 325 Times in 198 Posts
    Quote Originally Posted by dnd View Post
    I think nanozip (like WinRar) is also using transpose/delta preprocessing after detecting binary tables.
    I talked with Sami (author of nanozip) about transpose, and in general he was against transposing. He told that often there are more powerful alternatives, and transposing loses important correlations without being able to recover them.

  20. Thanks:

    Cyan (11th March 2016)

  21. #15
    Member
    Join Date
    Feb 2015
    Location
    United Kingdom
    Posts
    176
    Thanks
    28
    Thanked 74 Times in 44 Posts
    Here's the results of prepack applied to the prime test files
    Code:
    prepack+bsc : 437,342
    nanozip bwt : 453,493 (from Matt's test)

  22. #16
    Member
    Join Date
    Mar 2013
    Location
    Worldwide
    Posts
    565
    Thanks
    67
    Thanked 199 Times in 147 Posts

    Exclamation Number Benchmark with TurboBench + Transpose/Delta

    Benchmark in 3 groups lz77, bwt and context mixing
    Code:
    bitmap    bytemap     dbl     delta     hex    interlv    lsb      msb      text    twice      size   ratio%         (bold=pareto)
     42431     46266     83581    46665    85414    52320    83379    84198    83760    84206     692220  14,41 brotli 11                         
       -         -       44976    38803    53545    41508    39094    41557    60403    41655     450238   9,37 brotli 11  +Transform  
     41469     40683     57630    40031    67280    53242    58604    57627    77172    57747     551485  11,48 lzma 9   
       -         -       42470    38920    55837    41777    38967    41788    59902    41943     443756   9,24 lzma 9     +Transform
    
     37385     35783     43503    36246    62766    42186    43098    42902    67771    41853     453493   9,44 nz                          
     36190     35280    184076    35328   205182    45110   181294   183226   199890   183490    1289066  26,84 bsc 2        
       -          -      38132    35084    49246    37318    35090    37370    66494    37610     407814   8,49 bsc 2      +Transform
    
     29527     30155     39580    33726    41619    38913    38696    38912    38967    38940     369035   7,68 cmix v8
     32110     25789     40403    34251    41851    40831    39314    39547    39019    39569     372684   7,76 cmv c -mx
     35425     42764     38597    34698    56017    36255    37973    37835    55520    37894     412978   8,60 nz -cc
     29742     33214     49006    34744    55836    41775    45645    45553    55045    46378     436938   9,10 zpaq 5
       -        -        36222    34049    46155    36025    34056    36076    51695    38163     375397   7,82 zpaq 5     +Transform
    
    125000   1000000    627984   313992   627984   313992   313992   313992   538468   627984    4803388 100,00 copy
    
    
    Preprocessing before compression:
    delta: transpose,32
    hex: transpose,64
    interlv: delta8,8
    msb,lsb,dbl,twice: delta32,32/transpose,32
    text: transpose,56 
    
    transpose,32 = byte transpose for 4 bytes entries 
    transpose,64 = byte transpose for 8 bytes entries
    transpose,56 = byte transpose for 7 bytes entries
    
    for cmv, nz, cmix see results from other posts in this thread
    For all files where transpose/delta is applicable, bsc+Transform beats Nanozip!
    bsc+Transform beats both Nanozip bwt & Nanozip cc in the total file sizes.
    Last edited by dnd; 13th March 2016 at 13:30.

  23. Thanks:

    Cyan (13th March 2016)

  24. #17
    Member
    Join Date
    Mar 2013
    Location
    Worldwide
    Posts
    565
    Thanks
    67
    Thanked 199 Times in 147 Posts

    Thumbs up The myth Nanozip disenchanted!

    I've updated TurboBench with an extended transpose and updated the total file size.

    bsc+delta/Transpose beats not only Nanozip bwt but also the context mixing Nanozip cc in the total file sizes.
    Last edited by dnd; 13th March 2016 at 11:15.

  25. #18
    Member
    Join Date
    Jun 2015
    Location
    Switzerland
    Posts
    876
    Thanks
    242
    Thanked 325 Times in 198 Posts
    Quote Originally Posted by dnd View Post
    I've updated TurboBench with an extended transpose and updated the total file size.

    bsc+delta/Transpose beats not only Nanozip bwt but also the context mixing Nanozip cc in the total file sizes.
    Super!

    BTW, https://sites.google.com/site/powturbo/home/benchmark enwik8 shows 27131626 bytes for "brotli 11 16 MB", but I get 25764698 bytes if I try it. It is likely due to recent improvements in the brotli encoder.

Similar Threads

  1. Prime: find prime number >N
    By Bulat Ziganshin in forum Download Area
    Replies: 8
    Last Post: 24th August 2013, 00:03
  2. Compressing prime numbers
    By Matt Mahoney in forum Data Compression
    Replies: 14
    Last Post: 18th May 2013, 18:41
  3. Quo Vaidis JPEG ... the child has a name, and a number.
    By thorfdbg in forum Data Compression
    Replies: 3
    Last Post: 25th April 2013, 04:16
  4. Compression via Number Factorisation
    By MiY4Gi in forum Random Compression
    Replies: 14
    Last Post: 20th July 2012, 18:51
  5. rnd - simple pseudo random number generator
    By encode in forum Forum Archive
    Replies: 21
    Last Post: 14th January 2008, 02:41

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •