Results 1 to 14 of 14

Thread: Synthetic data benchmark

  1. #1
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 779 Times in 486 Posts

    Synthetic data benchmark

    I ran some compression benchmarks with 8 progressively harder artificial problems. They were generated by the following program, so obviously all 8 have low algorithmic complexity. The program generates 8 files named test0 through test7. All but the first have a size of 100 KB. test0 is empty.

    Code:
    #include <stdio.h>
    #include <stdlib.h>
    #include <math.h>
    #include <string>
    
    // return j'th byte of i'th file
    unsigned int gen(int i, int j) {
      switch(i) {
        default: return 0;  // case 1
        case 2: return j/1000;
        case 3: return (sin(j)+1)*128;
        case 4: return '0'+(sin(j)>=0);
        case 5: return j/2%(251-j%2*10);
        case 6: return (sin(j%50000)+1)*1e8;
        case 7: return (sin(j)+1)*1e8;
      }
    }
    
    int main() {
      for (int i=0; i<8; ++i) {
        FILE* f=fopen((std::string("test")+char(i+'0')).c_str(), "wb");  // test0...test7
        for (int j=0; j<100000; ++j)
          if (i) putc(gen(i, j)&255, f);
        fclose(f);
      }
      return 0;
    }
    test0 is empty. It is used to measure the archive overhead. Occasionally it will make a compressor fail (like slug).

    test1 is a 0 byte repeated 100,000 times. Most compressors have no problem with it, but a few like lzss and lzrw* have a lot of overhead.

    test2 is 0 repeated 1000 times, then 1 repeated 1000 times, and so on up to 99 repeated 1000 times. It tests whether the compressor is adaptive. Stationary models like fpaq0 do poorly.

    test3 is an 8 bit sine wave with period 2*pi, thus not repeating. It is easily compressed with a low order context model, but not order 0.

    test4 is a 1 bit sine wave with period 2*pi, i.e. like "0001110001111000...", thus not exactly repeating. It is easily compressed with a high order context model, but not low order. LZ77 does very well on it.

    test5 interleaves two counting sequences with periods 241 and 251 (two highest 8 bit primes), thus not repeating within 100KB. It requires either a sparse model or a delta model to compress. An ordinary contiguous context model like LZ77 or PPM will not work. This test tends to separate the high end compressors from the rest.

    test6 is effectively two copies of a 50KB random byte string. This should compress to 50KB, but some programs fail to notice the match, in particular deflate (zip, gzip) because the window size is 32KB, and bsc without the -p option to turn off LZP preprocessing.

    test7 is effectively random. The test is to make sure the data does not expand too much. For example, paq1 expands 8.8%, lzss 12.5%, and lzw 50%. The compress and quark programs do not expand it because they check and refuses to compress if larger. In those cases I filled in the uncompressed size.

    Results are below. I didn't time the programs or verify decompression. I ran many programs with options for best compression and best speed as well as default settings. The files are small enough that default memory options are normally sufficient. I sorted by total size, although giving each test equal weighting is somewhat arbitrary. I renamed the columns test0...test7 to empty...random but compressed with the original file names.

    I can't guarantee that the code above will generate the exact same data because sin(100000)*1e8 has a floating point round off error of about 0.01. It is possible that 1% of the bytes might be different from mine in test6 and test7. But it should not make much difference in the results.

    Code:
      empty   zero  adapt    low   high sparse  match random  total  Sep 26 2011
         18     78    234   1274     88    207  50084 100048  152031 .fp8
         22     70    229   1177     99    213  50156 100146  152112 .paq8px
         21     66    232   1183     93    245  50146 100138  152124 .paq8l
         67     86    114   1267    110    623  50094 100083  152444 -mx.uha (uharc)
         25     31    193   1568    203    763  50347 100025  153155 -o128.pmm (ppmonstr)
         25     31    193   1571    216    763  50347 100025  153171 -o32.pmm
          6     42    325   1835    460     83  50211 100253  153215 -7.ccmx
          6     42    325   1844    460     83  50264 100255  153279 -0.ccmx
         67    101    182   1672    119   1026  50108 100089  153364 -m3.uha
         67    101    179   1675    119   1027  50108 100089  153365 -m2.uha
         67    101    264   1675    119   1027  50108 100089  153450 -m1.uha
         24     28    276   1877    766    560  50116 100161  153808 .paq9a
         26     77    256   2136    101    973  50086 100162  153817 -cO_.nz (nanozip)
         26     77    259   2280    106   1106  50086 100162  154102 .nz
          7     20    201   3263     43    892  50160 100215  154801 .lzpxj
         25     31    193   1580   1895    763  50347 100025  154859 .pmm
         27     90    323   2766    555    931  50199 100128  155019 -cc.nz
        244    252    549   2074    612   1374  50442 100322  155869 -m4.zpaq
         28     84    617   4002    112    989  50061 100050  155943 .sbc
         82    149    317   2253    413   2909  50753 101154  158030 .slim23d
          0     26    442   3204    139   3799  50355 100281  158246 .qc
          8     60    353   2887    473   8603  50540 100042  162966 -p-m4.bsc
          8     60    354   2982    473   8603  50540 100042  163062 -p-m5.bsc
          8     60    354   3100    471   8603  50540 100042  163178 -p-m6.bsc
          8     60    353   2864    527   8603  50807 100042  163264 -p-m3.bsc
          8    108    401   3926    140   8600  50591 100042  163816 -p.bsc
         25     95    383   1936    227  14239  50456 100056  167417 -cD.nz
         11     63   1086   2942   1000  12315  51456 101406  170279 -1.rings
         11     63   1086   2942   1000  12315  51456 101406  170279 -10.rings
         12     56    269   3845     78  16562  50338 100040  171200 -m4.grzipii
         12     52    286   2694    652  16389  51319 100040  171444 -m4-p.grzipii
         12     56    269   3946     77  17663  50338 100040  172401 -m2.grzipii
         12     52    284   3605     84  18342  51329 100040  173748 -m2-p.grzipii
         10     46    296   1646    979  22970  50244 100175  176366 -0.lpaq9m
         10     47    305   1686   1007  23844  50180 100150  177229 -5.lpaq9m
        157    176    434   3744    210   7389  64338 100783  177231 -m2.zpaq
          7     89    723   2510    498  27341  50302 100317  181787 -96.cmm4
        116    129    393   2332    612  28698  50315 100183  182778 -m3.zpaq
         18     67    784   4694    104  28088  50218 100040  184013 .bssc
          2     44    393   4251     92  20102  59054 100338  184276 .bbb
          9     49    324   1959    543  34415  50496 100586  188381 -5.lpaq1
          0    146    313   2407     91  31093  54060 100665  188775 -100000.m03
          9     49    320   1954    518  35105  50629 100585  189169 -2.lpaq1
         22     28    508   2197    180  35789  50510 100448  189682 .paq6
          9     49    320   1955    518  36016  50726 100587  190180 -1.lpaq1
         14     47    487   4015     81  21991  62803 100836  190274 .bz2
         87    169    758   2635    581  36695  50401 100415  191741 -p5.bit
          0     78    271   1900     97  39831  50465 100000  192642 .quark
         12     56    281   3852     79  38069  50338 100040  192727 -m3.grzipii
         12     52    286   2821    652  38282  51313 100040  193458 -m3-p.grzipii
         12     56    270   4018     77  38671  50338 100040  193482 -m1.grzipii
         12     52    284   3870     84  39584  51341 100040  195267 -m1-p.grzipii
         18    153    416   1973    172  42710  51207 100022  196671 -m4.lzham
         18    153    416   1973    172  42710  51207 100022  196671 .lzham
         25     95    374   2658   1771  36348  56795 100056  198122 -cd_.nz
         27     35    347   2143     48  44751  50792 100027  198170 -ag.enc
         18    180    421   2029   2174  42395  52192 100022  199431 -m0.lzham
         32    148    320   1848    248  47232  50824 100064  200716 .xz
          5     13    253   3685     30  43926  51157 101729  200798 .tc
         36    116    287   1806    217  47200  50798 101412  201872 .plzip
         78    192    364   1918    295  47277  50869 101501  202494 .7z
         34     45    307   3496     74  36942  63055 101447  205400 .dark
         34     48    491      0      0  54006  50793 101623  206995 .b58 (boa)
         52    117   1959   2252    216  52896  50781 101049  209322 -m2-c5-b6.yzx
          6     20   1109   5119    104  25348  77495 101018  210219 .bcm
         30     52    438   3399    421  53826  50901 102001  211068 .ppmvc
         66    258    392   2182    394  57408  50514 100134  211348 -lzx21.cab
         12    296    720   3856    872  54680  50624 100400  211460 -e3.thor
          8     99    389   4129    128   6795 100042 100042  211632 .bsc
         26     48    437   3342   1695  53822  50897 101997  212264 -o16.pmd (ppmd)
          5     16    293   4164   1282  52695  53806 100446  212707 .px
         12     32    814   5218     77  57209  50030 100023  213415 .etincelle
         26     48    434   3818   2409  53822  50897 101997  213451 -o8.pmd
         12    412    756   4120    504  57048  50552 100284  213688 -e4.thor
          4    407    810   4495    441  57446  50407 100402  214412 .lzp2
         26     65    437   3417   1711  54870  51183 102726  214435 -o16-H.pmd
         26     48    430   4718   2572  53816  50904 101997  214511 -o4.pmd
         26     64    429   3357   1710  55002  51267 102815  214670 -o16-I.pmd
         19    102    614   2190    241  59012  51385 101530  215093 -m6-s9.packet
         19    102    614   2522    326  59012  51385 101530  215510 .packet
         26     65    439   4007   2418  54872  51183 102726  215736 -o8-H.pmd
         26     64    428   3943   2420  55003  51267 102815  215966 -o8-I.pmd
         12     57    844   4146    263  58562  51103 101233  216220 -9.lzpm
         12     57    844   4460    449  58554  51103 101233  216712 -1.lzpm
         26     65    438   4946   2585  54871  51199 102726  216856 -o4-H.pmd
         26     64    427   4887   2583  55004  51283 102815  217089 -o4-I.pmd
          6     25    299   4457    144  60046  50748 101366  217091 -o255.szip
         25     93    371   3845    892  46382  65663 100054  217325 -cf_.nz
          0   1131   1401   3797   1150  57981  51144 100782  217386 -6.ulz
         25     93    390   3876   1164  46123  65663 100054  217388 -cF.nz
          8     89    688   2838   1064  57060  51235 104410  217392 .tarsalzp
          6     25    297   3434   1579  60046  50748 101366  217501 .szip
         67    116    326   2134    137  58775  52617 103718  217890 -mz.uha
          4     16    314   2556    194  59578  51460 104024  218146 .ash
         13     76   1056   4520    261  59001  51961 102300  219188 -x.balz
         50    122   1424   4827    159  54885  56890 101047  219404 .flashzip
         50    122   1546   4821    154  54885  56890 101047  219515 -m3-c7-b7.flashzip
         13     76   1056   4710    399  59001  51961 102300  219516 .balz
          0    108    549   3214    282  59226  55910 100309  219598 -u.acb
          0    108    470   3212    273  59115  56449 100242  219869 -B_.acb
          0    108    514   3206    277  59182  56988 100248  220523 -b.acb
         19     49    655   4839     88  60005  52744 103150  221549 -5.tor (tornado)
          0   1131   1401   7407   2211  57981  51150 100782  222063 -1.ulz
         57    128    426   2237    628  70017  50387 100057  223937 .RAR (v2.50)
        100    144    420   7520    716  57452  57521 100150  224023 -m3.csc32
        100    144    420   7528   1559  57441  57521 100150  224863 .csc32
          9     24    433   3401    217  62754  54269 105912  227019 -x.quad
         28     35    392   2334     48  61045  54595 108709  227186 -aq.enc
         28     35    392   2334     48  61045  54595 108709  227186 .paq1
          9     24    433   3518    378  62754  54269 105912  227297 .quad
          8   8844   6076   7176   6440  45163  54420 100972  229099 .lcssr
          4     30    694   6771     71  66063  56281 100016  229930 -3.stz
          4     30    694   6771     71  66063  56281 100016  229930 -4.stz
          4     30    694   6771     71  66063  56281 100016  229930 -5.stz
         13    110    481   4732   2624  60864  56570 104872  230266 .ppmx
         36     59    795   2681     76  72548  55496 100036  231727 .lzt
         28     96    943   5687   3694  63308  58875 100473  233104 .bzp
         19     59    662   2827     85  61178  65528 103150  233508 -12.tor
         12     19    439   3710   1102  67514  54318 107551  234665 -256.hook
         12     35    443   5072     73  79465  50034 100027  235161 .zhuff
        210    246    544   4730    293  48233  80544 100488  235288 -m1.zpaq
         14     95    412   4711   2531  66255  61552 100295  235865 -d6-n16M-f16M.ctw
          0    537    717   3098    639  64228  56581 112339  238139 -x.crush
         26     48    427   5002   2716  64159  64058 101724  238160 -o3.pms (ppms)
          0    537    717   3423    639  64228  56581 112339  238464 .crush
          4     31    917   5991     56  75331  57021 100016  239367 -2.stz
          4     31    917   6025     71  76452  57125 100016  240641 .stz
         40    458    861   6477   5151  72655  51941 103165  240748 .fastlz
         11     16    567   3041     38  71354  55305 110475  240807 -l7-d9-x7.qazar
          0    537    717   3433   2371  65629  56581 112339  241607 -f.crush
         26     47    425   7096   8764  53886  70029 101971  242244 -o2.pmd
         26     47    421   7072   8769  54250  70139 101735  242459 -o1.pms
         26     47    421   7072   8769  54250  70139 101735  242459 -o2.pms
         11     16    551   2863     47  71207  55806 112177  242678 .qazar
         12    772   1120   4560   1768  80632  53216 100820  242900 -e2.thor
          1     15    642  27633   2283  16014  99980 102830  249398 .fpaq0f2
          5     17    810   3693   1575  75211  65634 107457  254402 .sr2
         25     32    638   4172   2727  71637  62880 112299  254410 .p6
         26     33    696   5397   2608  68082  68784 112448  258074 .p12
          4   4744   5118   6780   4757  65825  58606 112433  258267 -x.lzss
         66     74    629   7355   8871  68193  68302 105788  259278 -o2.zpaq
         26     48    429   4670   2575  74716  78350 101724  262538 -o4.pms
         12    816   1130   4454   2496  86864  59300 107506  262578 -e1.thor
          0    461   1127  15413   2891  85100  57982 106334  269308 .flzp
          4   4744   5118   6838  17972  65825  58606 112433  271540 .lzss
         26     48    432   4407   2576  80112  84650 101724  273975 -o5.pms
         26     48    432   4407   2576  80112  84650 101724  273975 .pms
         26     47    430   4092   2577  83371  88113 101724  280380 -o6.pms
         47    602    863   3341    626  77220 100059 100059  282817 -9.lzo (lzop)
         12    200    452   2696   1068  77144 100760 100768  283100 -e5.thor
         26     48    433   3950   2416  85518  90307 101724  284422 -o7.pms
         47    464    867   6726    502  77733 100059 100059  286457 -1.lzo
         26     48    435   3800   2418  87065  91835 101724  287351 -o8.pms
         26    138    344   2841    518  84779 100044 100044  288734 -9.gz
         26    138    344   2841    974  84779 100044 100044  289190 .gz
        108    221   1315   2919    594  84398 100108 100108  289771 .kzip
        250    364    570   3067    744  85005 100270 100270  290540 -9.zip
        108    221    424   2927    880  87147 100108 100108  291923 .PKZ (pkzip)
         52    117   2151  17793   8297  72449  90655 101049  292563 .yzx
         31     38    477   5040   2389 100031 100031 100031  308068 -A2.HA
          4     30    694   8527     71 100016 100016 100016  309374 -1.stz
         25     32    484   5652   2565  78478 100846 130325  318407 .p5
         19    546   5181  20900   3384  90003  85437 116404  321874 -1.lzc
         19    546   5181  20900   3384  90003  85437 116404  321874 -10.lzc
         60     74    564  34327   3161  93023  99213 100856  331278 -o1-icm.zpaq
          3    530   6484  28044   3527 100000 100000 100000  338588 .Z (compress)
          8     59   1741  35420  12143 100374 100248 101808  351801 .fcm1
          0  11824  12253  16344  12109 100008 100008 100008  352554 .lzrw3-a
         14     18    698   7843  11854  94340  98073 140982  353822 .sr (srank)
          0   1896   9934  41172   7612 100016 100016 100016  360662 .lzrw5
          0  11824  12029  16484  22459 100008 100008 100008  362820 .lzrw2
          0  11824  12253  16426  22457 100008 100008 100008  362984 .lzrw3
          0  12500  12995  15876  21524  99178  88926 112096  363095 .ppp
         57     71    545  96111   5446  62025 100325 100326  364906 -o0-icm.zpaq
          0  11824  12029  19545  22459 100008 100008 100008  365881 .lzrw1-a
         62     70    538  26853  11671 115632 103315 111199  369340 -o1.zpaq
         20     48    987   7238     96 114591 125020 125020  373020 -1.tor
          0  13306  13515  20809  28126 100008 100008 100008  375780 .lzrw1
         59     67   1687  97528  12750  95736 100903 100898  409628 -o0.zpaq
          6    686   2426 101698   1276 102262 101545 101540  411439 -5000-4096-200-3.bpe
          1   1263   2511  98254  13908  93703 101306 101304  412250 .fpaq0p
          0     56   3200  16606    140 159030 122646 150790  452468 .lzw
          1     17  83293  96262  12534  99715 100131 100156  492109 .fpaq0
          0 100000 100000 100000 100000 100000 100000 100000  700000
    Last edited by Matt Mahoney; 27th September 2011 at 04:31.

  2. #2
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,375
    Thanks
    214
    Thanked 1,023 Times in 544 Posts
    Thanks for an interesting post.
    But you forgot to tell us its purpose - to show off paq once more? :)
    Well, tests 0,1,6,7 are common enough (although 0 mostly demonstrates the
    difference between archivers and single-file compressors).
    But,
    1. Files have to be much larger - 10M or better 100M, otherwise paq's coding
    overhead in test1 is masked by header size. lzma should be better in that test.
    Also 50k window test may be ok, but we also need one with 5M window (for rar/lzx generation), and even larger ones.
    2. A script that produces testfiles has to be stable (ie no float point, preferably no STL, no stdlib rand).
    Otherwise why not upload the files?
    3. Either use coefficients for compressed sizes, or adjust the testfile sizes,
    because currently the ranking is basically determined by test5.
    4. Imho such benchmarks have to show good points of various algorithms,
    not their well-known "bad" points (like absence of delta and sparse submodels, bytewise processing),
    which have much smaller effect in practice.
    Surely, such good points are harder to think of, but I can give some examples:
    - Most LZs have a match size limit (273 for lzma, 258? for deflate),
    while there're LZ/PPM/CM coders which support higher orders;
    also BWT is relatively bad with a few very long matches (it essentially does
    unary coding of match length), but should be much better with many shorter ones.
    - Dictionary tests are pretty informative (eg. http://www.maximumcompression.com/data/dict.php),
    wordlists are good to demonstrate the effect of misprediction.
    - lzma has a data alignment option (lc/pb, manually controlled for now) which
    is very helpful for 32-bit aligned data, eg. arm/mips code.
    - 2D,3D tables (eg. mp3dump output)
    - variable-length structures (including ascii ones; eg. log files)
    - various types of numbers int8-16-32-64,uint8-64,float,double; vectors

  3. #3
    Member biject.bwts's Avatar
    Join Date
    Jun 2008
    Location
    texas
    Posts
    449
    Thanks
    23
    Thanked 14 Times in 10 Posts

    Thumbs up

    I think its a good test. It's nice to see how the files are generated but It would be nice to have fixed files so same set of files can be used by others.
    I also think the size is right. Why would everyone need to compress files of several mega bytes. There should be tests for those that target smaller files. Also it allows one to see how poorly or not people integrate overheard with compression which really effects shorter files. If one has an unknown compressor at least this set of tests gives some idea how the code works without actually digging into the source code. I like it and yes I suspect one could code a bijective compressor to target the whole set.

  4. #4
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 779 Times in 486 Posts
    Of course we would like a benchmark to determine which compression algorithms work best. Unfortunately there is no benchmark that does this because it depends on the data you need to compress and there is no such thing as "typical data". I attempted to do that using a Solomonoff distribution, but I think there is too much dependency on the choice of universal Turing machine to make this practical. http://mattmahoney.net/dc/uiq/

    The purpose of this benchmark is to test whether various compressors behave the way we would expect on some simple problems. Most of them do but there are a few surprises. The only test where file size makes a difference is test6 (2 copies of a random string). This should probably be benchmarked over a range of sizes. For the other tests, we can use small files because we can subtract the result of compressing an empty file in test0. So rather than rank the compressors, it is more interesting to sort them on various columns. But I had to present the data in some order.

    Anyway I am attaching the test set, although I'm not entirely satisfied with it and will probably use something else. My goal is to find the smallest test set that predicts general performance on practical data. I know that is an impossible task, but at least we should be able to make predictions how an algorithm will perform given some knowledge of the type of data it needs to compress.

    I am also considering a benchmark that includes only open source compressors, or at least open source or fully specified decompressers. I figure that undocumented or proprietary formats otherwise have no chance of widespread acceptance and are therefore irrelevant.
    Attached Files Attached Files

  5. #5
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,375
    Thanks
    214
    Thanked 1,023 Times in 544 Posts
    For test1 size also matters:
    Code:
    100k      100000   10m     10000000
    100k.7z      190   10M.7z      1586
    100k.paq8px   65   10M.paq8px  3551
    100k.rar     104   10M.rar      151
    100k.zip     268   10M.zip    11884
    As to open-source compressors, you don't include mine anyway, so I don't care :)

  6. #6
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    863
    Thanks
    461
    Thanked 257 Times in 105 Posts
    It's fine for a synthetic benchmark.
    It's kind of wondering "well, how does the algorithm deals with that kind of input", which is an interesting question, and can lead to incremental improvements.
    Of course, it does not mean "we should interpolate the result to get a ranking on real-world data", since it only qualifies a behavior on an set of specific situations. But that's nonetheless useful.

    I although like the way the set can be produced and distributed, with a trivial source code.
    Maybe it should not rely on Float calculation then, but if i do understand, these float calculations are only used to create noise, so a lots of others methods can be used to create the same effect, such as, for example :
    Code:
    accumulator32 *= prime32; 
    output8 = accumulator32 >> 24;
    or whatever equivalent.

    test5 interleaves two counting sequences with periods 241 and 251 (two highest 8 bit primes), thus not repeating within 100KB.
    well, i thought that : 241 x 251 = 60491 ...
    Last edited by Cyan; 27th September 2011 at 22:04.

  7. #7
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 779 Times in 486 Posts
    Quote Originally Posted by Shelwien View Post
    For test1 size also matters:
    Code:
    100k      100000   10m     10000000
    100k.7z      190   10M.7z      1586
    100k.paq8px   65   10M.paq8px  3551
    100k.rar     104   10M.rar      151
    100k.zip     268   10M.zip    11884
    OK, I see the problem. test1 - test0 size is not a good predictor like I thought it would be.

    7z, 192 - 78 = 114
    paq8px, 70 - 22 = 48
    rar, 128 - 57 = 71
    zip, 364 - 250 = 114

    Quote Originally Posted by Shelwien View Post
    As to open-source compressors, you don't include mine anyway, so I don't care
    Sorry about that. I have some of them on LTCB but I see some I missed on ctxmodel.net. I haven't got to them yet but if you want to benchmark them yourself I can add them more quickly.

  8. #8
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 779 Times in 486 Posts
    Quote Originally Posted by Cyan View Post
    I although like the way the set can be produced and distributed, with a trivial source code.
    Maybe it should not rely on Float calculation then, but if i do understand, these float calculations are only used to create noise, so a lots of others methods can be used to create the same effect, such as, for example :
    Code:
    accumulator32 *= prime32; 
    output8 = accumulator32 >> 24;
    or whatever equivalent.
    Except it's not very random. I thought about better integer generators (like RC4) but I wanted to keep the code simple to illustrate the idea.

    Quote Originally Posted by Cyan View Post
    well, i thought that : 241 x 251 = 60491 ...
    x 2 bytes > 100 KB

  9. #9
    Tester
    Black_Fox's Avatar
    Join Date
    May 2008
    Location
    [CZE] Czechia
    Posts
    471
    Thanks
    26
    Thanked 9 Times in 8 Posts
    Even in case this doesn't get used as a benchmark, it's still good functional/regression testset.
    I am... Black_Fox... my discontinued benchmark
    "No one involved in computers would ever say that a certain amount of memory is enough for all time? I keep bumping into that silly quotation attributed to me that says 640K of memory is enough. There's never a citation; the quotation just floats like a rumor, repeated again and again." -- Bill Gates

  10. #10
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 779 Times in 486 Posts
    Here is another benchmark. The test data is a sorted list of prime numbers less than 1 million, written 8 different ways.

    test0 - listed in decimal, one per line (2 3 5 7 ... 999983, linefeed terminated)
    test1 - each number listed twice (2 2 3 3 5 5 ...)
    test2 - whole list repeated twice (2 3 5 7 ... 999983 2 3 5 7 ... 999983)
    test3 - each number written as a 32 bit integer, big-endian
    test4 - as 64 bit floating point, little-endian
    test5 - delta coded as (prime - lastprime)/2 as a single byte
    test6 - as a million character vector, '0' = composite, '1' = prime
    test7 - as a packed bit vector

    For test5, the largest delta for primes up to 2e9 is 292. Since deltas are always even (except 3 - 2), you can divide by 2 and store as a byte in the range 0...145.

    Test files are generated as follows. I used the default n = 1000000.
    Code:
    #include <stdio.h>
    #include <stdlib.h>
    #include <vector>
    using namespace std;
    
    union U {
      unsigned long long i;
      double x;
    } u;
    
    int main(int argc, char** argv) {
    
      // Get n = max prime
      int n=1000000;
      if (argc>1) n=atoi(argv[1]);
    
      // Create list of primes less than n using sieve of Eratosthenes
      vector<bool> isComposite(n);
      isComposite[0]=isComposite[1]=true;
      for (int i=2; i*i<=n; ++i)
        if (!isComposite[i])
          for (int j=i*2; j<n; j+=i)
            isComposite[j]=true;
    
      // Generate output
      FILE* f0=fopen("test0", "wb");  // list of primes as decimal numbers
      FILE* f1=fopen("test1", "wb");  // each number printed twice
      FILE* f2=fopen("test2", "wb");  // whole list printed twice
      FILE* f3=fopen("test3", "wb");  // as a 32 bit int, MSB first
      FILE* f4=fopen("test4", "wb");  // as a double, LSB first
      FILE* f5=fopen("test5", "wb");  // delta/2 as 1 byte (max 146 up to n=2e9)
      FILE* f6=fopen("test6", "wb");  // as an unpacked vector '0'=composite, '1'=prime
      FILE* f7=fopen("test7", "wb");  // as a packed vector, n/8 bytes
      int lastprime=0, bits=0;
      for (int i=0; i<n; ++i) {
        bits+=bits;
        if (!isComposite[i]) {
          u.x=i;
          fprintf(f0, "%d\n", i);
          fprintf(f1, "%d\n%d\n", i, i);
          fprintf(f2, "%d\n", i);
          fprintf(f3, "%c%c%c%c", i>>24, i>>16, i>>8, i);
          for (int j=0; j<8; ++j) putc(int(u.i>>j*8)&255, f4);
          putc((i-lastprime)/2, f5);
          putc('1', f6);
          bits++;
          lastprime=i;
        }
        else
          putc('0', f6);
        if (i%8==7) putc(bits, f7), bits=0;
      }
      for (int i=0; i<n; ++i)
        if (!isComposite[i])
          fprintf(f2, "%d\n", i);
      return 0;
    }
    Results:

    Code:
       list  twice 2lists    int double  delta vector packed -------
      test0  test1  test2  test3  test4  test5  test6  test7 - total
      41954  43252  42174  40780  43417  33613  34813  33181  313184 .paq8px
      55504  60491  55696  37845  42516  33869  42708  35478  364107 -cc.nz
      55959  56193  56227  55512  79864  34164  41354  35427  414700 -6.lpaq9m
      64231  63990  64350  48083  69378  33837  44705  34891  423465 -m4.zpaq
      72934  79757  72957  50615  50050  34047  41985  35938  438283 .pmm
      68773  70756  68827  48491  88608  33928  47508  35596  462487 .paq9a
      77356  71373  77520  57728  75721  38729  46365  41318  486110 -mx=9.7z
      77170  71385  77333  57728  75835  38684  52871  41337  492343 .7z
      74604  78045  75133  69568 107452  34342  46153  35652  520949 -m3.zpaq
      67572  85380 171657  42990  48499  36237  35779  37499  525613 .nz
      80953  81732  81201  59198 110743  33625  44916  35339  527707 -6.lpaq1
     107138 101530 108452 100144 119156  39186  49584  41696  666886 -lzx21.cab
      97142 115012 142504 125586 136140  34231  36036  35644  722295 .ash
     106941 151398 107665  81316 150427  36539  57558  40271  732115 -m3-c7-b7.flashzip
     157720 135787 158653  66639 123976  45561  68398  48431  805165 -12.tor
     112231 119323 224328 108032 124666  38930  48915  42093  818518 .kzip
     152806 162705 153712  80161 116496  44725  99039  50693  860337 -5.tor
     177253 190518 179725 110595 106954  39531  62863  43641  911080 .comprox_sa
     134960 192310 269715  94247 138320  39146  44446  38894  952038 .fpaq0f2
     187836 226444 197516 130457 155762  34242  34668  35282 1002207 -cfm100.bbb
     163721 165639 324828 183066  85679  35039  38945  37444 1034361 -p-m3.bsc
     142397 244681 247763 132622 161475  33671  45893  35380 1043882 -d6-n16M-f16M.ctw
     126043 145533 252939 159115 224241  39460  53149  45873 1046353 .sr2
     175224 251581 198583 170131 169389  35992  36525  37211 1074636 -U.ACB
     199727 231066 200139 183104 183431  35025  34698  36301 1103491 -p.bsc
     199727 231066 200139 183104 183431  35025  34698  36301 1103491 .bsc
     199211 226147 341276 183068  88661  34987  38789  36312 1148451 -p-m6.bsc
     198865 303016 286373 123792 143829  35143  34732  35736 1161486 -m2.zpaq
     132141 244724 264094 178784 225030  34757  52883  38156 1170569 -m256-o2.pmd
     208656 287713 208783 180338 192099  35012  42286  37912 1192799 -m256-o16-r1.pmd
     208656 287713 208783 180338 192099  35012  42286  37912 1192799 -m256-o16.pmd
     208658 287730 209320 180338 192099  35012  47362  37886 1198405 -m256-o8.pmd
     190173 165143 380351 198542 125800  40894  55921  44515 1201339 -9.zip
     227242 280603 227252 138598 198555  37914  54604  36832 1201600 -256.hook
     189937 164927 380103 198316 125669  40667  67577  45693 1212889 .gz
     190163 165153 380329 198542 125895  40893  67802  45919 1214696 .zip
     202609 341779 332909 144884 164716  34217  34976  35586 1291676 .bcm
     214919 327855 363094 190356 198772  37813  38183  41318 1412310 .bz2
     214153 340243 333305 206163 205829  37632  54475  41656 1433456 .comprox_ba
     211832 378014 368220 183662 197590  36503  44525  39819 1460165 -m1.zpaq
     213794 343244 415797 180338 191078  35095  50943  36778 1467067 -m256-o4.pmd
     213794 343244 415797 180338 191078  35095  50943  36778 1467067 .pmd
     180458 282033 360585 195631 263707  40460  98726  46010 1467610 .slug
     228315 327431 239539 238815 302092  39998 194588  51742 1622520 .etincelle
     212154 416310 424263 180787 264695  39336  62610  40612 1640767 .fpaq0p
     242422 350251 484680 235542 235375  49029  86849  53790 1737938 -9.lzo
     230781 443685 461531 269780 284040  41012  52496  42309 1825634 .Z
     228147 455048 456622 239482 334345  38409  49758  38889 1840700 .fpaq0
     273642 498266 273647 256507 274721  55970 245530  94094 1972377 .stz
     270641 491326 540827 259857 265290  53925 147664  75123 2104653 .lzrw3-a
     270752 515022 540308 306370 321404  51502  97982  54232 2157572 .lzrw5
     258867 327175 513466 220807 375811  79723 279643 105229 2160721 .ppp
     274989 496311 549755 257726 274753  57098 238178  94087 2242897 .lzrw3
     277625 489728 555078 257731 274784  58390 238146  94111 2245593 .lzrw2
     286202 465064 572340 257731 274864  60854 238146  94609 2249810 .lzrw1-1
     286202 465063 572340 257728 274864  60853 245484  94609 2257143 .lzrw1
     337723 530587 357726 392025 296636  70775 242865  75896 2304233 -1.tor
     302687 510306 604965 250226 253184  62495 251175  75912 2310950 .lzo
     424413 454793 846656 298945 395041  77166 205762 112803 2815579 .flzp
     53846810769361076936 313992 627984  784981000000 125000 4837814

  11. Thanks:

    jman (14th April 2017)

  12. #11
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,375
    Thanks
    214
    Thanked 1,023 Times in 544 Posts
    Yeah, like this its more interesting.
    best lzma results are a little different though...

    http://nishi.dreamhosters.com/u/plzma_test_0.rar
    Code:
      64231  63990  64350  48083  69378  33837  44705  34891  423465 -m4.zpaq
      72934  79757  72957  50615  50050  34047  41985  35938  438283 .pmm
      68773  70756  68827  48491  88608  33928  47508  35596  462487 .paq9a
      77356  71373  77520  57728  75721  38729  46365  41318  486110 -mx=9.7z
      77170  71385  77333  57728  75835  38684  52871  41337  492343 .7z
    
      76306  61963  76531  52773  71772  38660  40098  41012  459115 lzma
      71057  64524  71119  52926  69789  37460  40114  39979  446968 plzma
      71057  61963* 71119  52773* 69789  37460  40098* 39979  444238 best of two

  13. #12
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 779 Times in 486 Posts
    Interesting how test1 is smaller than test0 for lzma and zip.

  14. #13
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,375
    Thanks
    214
    Thanked 1,023 Times in 544 Posts
    For lzma test0 vs test1 this is the markup: http://nishi.dreamhosters.com/u/test0vs1.png
    And this is the entropy distribution by symbol types:
    Code:
                        test0     test1
       c_IsMatch:    10952.12   4364.15 
      c_Literal1:    35357.38  26978.76 
      c_Literal2:    17427.67  16000.60 
         c_IsRep:      285.50    239.68 
       c_IsRepG0:      722.35    232.26 
       c_IsRepG1:      146.47      1.65 
       c_IsRepG2:       46.54      0.12 
    c_IsRep0Long:      674.75    255.91 
     c_LenChoice:      222.24   1023.62 
        c_LenLow:    10325.34    423.97 
    c_LenChoice2:                212.87 
        c_LenMid:              12193.88 
       c_PosSlot:       72.39     12.08 
       c_SpecPos:       20.71      1.26 
     direct bits:       18.00      1.62 
         c_Align:       16.08      1.51 
           Total:    76287.54  61943.95
    Looks like its able to merge matches for duplicate prefixes and duplicate lines.
    But the main reason is probably that it found longer matches, thus the balance
    of literal/match prices changed, and in the end it was able to save a lot on literals and flags.
    I guess lzma's optimal parser finds the wrong local minimum for test0...

  15. #14
    Member
    Join Date
    Jan 2019
    Location
    UK
    Posts
    7
    Thanks
    1
    Thanked 1 Time in 1 Post
    Quote Originally Posted by Matt Mahoney View Post
    Here is another benchmark. The test data is a sorted list of prime numbers less than 1 million, written 8 different ways.
    Since this data is wholly synthetic, it should be possible to calculate the theoretical Shannon entropy of the test file. That would directly illustrate the efficiency of the compressors. Exactly what I'm looking for. The prime gap used in your prime delta column is well studied, but I'm too thik to calculate it? Any ideas? Anyone?

Similar Threads

  1. loseless data compression method for all digital data type
    By rarkyan in forum Random Compression
    Replies: 228
    Last Post: 6th November 2019, 16:53
  2. Synthetic compression benchmark
    By giorgiotani in forum Forum Archive
    Replies: 6
    Last Post: 3rd March 2008, 12:14

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •