Results 1 to 30 of 30

Thread: Silesia compression corpus

  1. #1
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,982
    Thanks
    377
    Thanked 351 Times in 139 Posts

    Smile Silesia compression corpus

    Just very useful compression corpus:
    http://sun.aei.polsl.pl/~sdeor/silesia.html


  2. #2
    Programmer schnaader's Avatar
    Join Date
    May 2008
    Location
    Hessen, Germany
    Posts
    566
    Thanks
    217
    Thanked 200 Times in 93 Posts
    Nice one. Waiting for some results, I guess it could be compressed down to about 20-30 MB or even better.

    The PDF is uncompressed, by the way, so Precomp doesn't help here.
    It helps with the mozilla and samba testsets, though:
    mozilla: pcf+pcf+bzip2 ->16,685,982 bytes instead of just bzip2 -> 17,914,392
    samba: pcf+bzip2 -> 4,310,438 bytes instead of just bzip2 -> 4,549,790

    EDIT: First results with CCM 1.30c (setting 4):

    Code:
    CCM             45.223.851
    Precomp + CCM   43.377.536
    Last edited by schnaader; 10th January 2009 at 12:36.
    http://schnaader.info
    Damn kids. They're all alike.

  3. #3
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 779 Times in 486 Posts
    Quick test on a 2.0 GHz T3200, Win32. zip and nanozip -cc used 1 core. zpaq, 7zip, and nanozip in default mode used 2 cores. Times are real times to compress and decompress.

    Code:
    Size     Ctime Dtime
    39,452,564 621 620 silesia-cc.nz
    40,568,503 776     silesia-m4.zpaq
    43,422,818 279 303 silesia-m3.zpaq
    44,841,467 107  21 silesia.nz
    47,183,792  79  93 silesia-m2-b64.zpaq
    49,729,056 137  10 silesia.7z
    49,982,874  49  54 silesia-m1-b64.zpaq
    50,533,746  48  56 silesia-m1.zpaq
    67,633,896  41 2.5 silesia-9.zip
    68,229,871  19 2.5 silesia.zip
    zip and zpaq compress files separately. nanozip and 7zip produce solid archives. It takes as long to extract just the last file as to extract all of them. 7zip stops when it reaches the file you want, but nanozip needlessly decompresses to the end and discards the rest.

  4. #4
    Programmer schnaader's Avatar
    Join Date
    May 2008
    Location
    Hessen, Germany
    Posts
    566
    Thanks
    217
    Thanked 200 Times in 93 Posts
    Completely forgotten about this one

    Some up-to-date results (Precomp 0.4.3 developer version using PackJPG 2.5, zpaq 3.01, reflate v0a), M 520 @ 2.4 GHz:

    Code:
    38.831.249 silesia.pcf.zpaq_m4_b48      (54 s + 638 s comp)
    
    13.366.403 mozilla.7z                   (55 s comp)
    12.393.857 mozilla.zpaq_m4              (317 s comp)
    11.913.640 mozilla.pcf.7z               (73 s + 44 s comp)
    11.740.249 mozilla.reflate              (193 s comp, 92 s decomp)
    11.121.041 mozilla.pcf.paq8px_v69_3     (39 s + 841 s comp)
    10.664.503 mozilla.pcf.zpaq_m4          (73 s + 325 s comp)
     8.906.579 mozilla.pcf.paq8px_v69_4     (39 s + 8340 s comp)
    
     3.764.255 samba.7z                     (19 s comp)
     3.435.432 samba.reflate                (150 s comp, 10 s decomp)
     3.067.085 samba.zpaq_m4                (181 s comp)
     2.614.968 samba.pcf.zpaq_m4            (7 s + 224 s comp)
    Unfortunately, the actual version of reflate doesn't restore the whole testset correctly, but especially on the mozilla part, it works very well as it is able to decompress ZIP files that Precomp fails on.

    I also had a closer look at the x-ray file, it has a 240 byte header and after that, 1900*2230 grayscale pixels (12 bit, but stored as 16 bit values 0x0000-0x0FFF). I'm sure there's some way to convert it to one big or two seperate BMP files and process it with some image compressor or paq8px_v69, but the best results I got so far with conversion aren't much better than without:

    Code:
     3.678.159        (paq8px_v69 -3, 139 s comp)
     3.627.031 + ~240 (interpreting file as 16 bit BGR, 6:5:5, conversion to 1900*2230 24-bit BMP + paq8px_v69 -4, 274 s comp)
     3.620.831 + ~240 (interpreting file as 8 bit grayscale, conversion to 3800*2230 8-bit BMP + paq8px_v69 -4, 371 s comp)
     3.613.619        (paq8px_v69 -4, 1801 s comp)
    Last edited by schnaader; 16th December 2011 at 01:08.
    http://schnaader.info
    Damn kids. They're all alike.

  5. #5
    Programmer schnaader's Avatar
    Join Date
    May 2008
    Location
    Hessen, Germany
    Posts
    566
    Thanks
    217
    Thanked 200 Times in 93 Posts
    reflate v0b is able to recompress and restore the whole testset completely:

    Code:
    211.938.968 silesia.7z_store
     46.025.705 silesia.7z_store.reflate (1225 s comp, 213 s decomp)
    http://schnaader.info
    Damn kids. They're all alike.

  6. #6
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,372
    Thanks
    213
    Thanked 1,020 Times in 541 Posts
    That's cool, but is the -6 mode the best there? (you can change it in "set level=6" lines in c.bat/d.bat).
    Also I suppose that's a plzma result? Can you test with some other compression too? (like shar+ccm)

  7. #7
    Programmer schnaader's Avatar
    Join Date
    May 2008
    Location
    Hessen, Germany
    Posts
    566
    Thanks
    217
    Thanked 200 Times in 93 Posts
    I'm not sure if level 6 is the best. Precomp says used compression level are 1,5 and 6, so these would be other possibilities to test, but .hif files aren't that big anyway. Some test results and comments:

    Code:
     46.025.705 silesia.7z_store.reflate (level 6, 1880 .hif files, total .hif size 76.128 bytes)
     46.125.184 silesia.7z_store.reflate (level 9, 1880 .hif files, total .hif size 168.068 bytes)
    Tests with other compression methods:

    Code:
    219.754.380 silesia.reflate.shar
     43.337.208 silesia.reflate.shar.ccm (CCM 1.30a level 7)
     38.923.300 silesia.reflate.shar.zpaq_m4_b48
     38.831.249 silesia.pcf.zpaq_m4_b48  (from a post above, for comparison)
    Results are very good. Precomp is a bit better as it processes non-zlib streams, too (994 GIF, 4 JPG, 7 bZip2). When using switches to disable those streams and recursion, the comparison is more fair:

    Code:
    218.655.017 silesia.pcf              (-t-jbf -d0)
     39.039.373 silesia.pcf.zpaq_m4_b48
    Last edited by schnaader; 18th December 2011 at 03:50.
    http://schnaader.info
    Damn kids. They're all alike.

  8. #8
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,372
    Thanks
    213
    Thanked 1,020 Times in 541 Posts
    Thanks.
    With 46M for reflate and 38M for precomp it didn't look very convincing before

    These gifs and bzip2 streams are troublesome for me though, as handling them is not even in the plan yet
    (zip,png,jpg,mp3,exe)

    Also, is it necessary to buffer a few MBs ahead to recompress bzip2?..

  9. #9
    Programmer schnaader's Avatar
    Join Date
    May 2008
    Location
    Hessen, Germany
    Posts
    566
    Thanks
    217
    Thanked 200 Times in 93 Posts
    Quote Originally Posted by Shelwien View Post
    These gifs and bzip2 streams are troublesome for me though, as handling them is not even in the plan yet
    (zip,png,jpg,mp3,exe)
    Hmm.. at least for users, it'd not be bad if reflate's and precomp's supported formats diverged a bit, because they could combine both for best results:

    Code:
    225.744.709 silesia.shar.pcf
     38.721.067 silesia.shar.pcf.zpaq_m4_b48
    Also note that GIF and bZip2 implementations are quite basic, so it wouldn't be that hard to create your own giving similar or better results.

    Combining the other way round (.pcf.shar) would be possible, too, but as Precomp will partially recompress some of the streams it can't recompress completely, I guess the result is likely to get worse or make reflate fail.

    Quote Originally Posted by Shelwien View Post
    Also, is it necessary to buffer a few MBs ahead to recompress bzip2?..
    bZip2 recompression is the simplest part in Precomp atm. Basically, I just use bzlib to decompress and recompress the complete bZip2 streams to temporary files. Most of them will be recompressed 100% identical, although I already found some streams that differ.

    In theory, I think buffering one of the 100-900 KB blocks ahead should be enough, but I haven't looked at the bzlib internals close enough yet to be sure.

    bZip2 format is suited much better for recompression as it has a 3 byte file header ("BZh"), followed by the compression "level" ("1"-"9" for block size 100-900 KB) and even the blocks start with a magic string (BCD(pi), 0x314159265359).
    http://schnaader.info
    Damn kids. They're all alike.

  10. #10
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 779 Times in 486 Posts
    I decided to create a new benchmark. To encourage open source, those will be the only ones listed. http://mattmahoney.net/dc/silesia.html

    Right now there are just a few programs listed. More will come.

  11. #11
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,507
    Thanks
    742
    Thanked 665 Times in 359 Posts
    Most of them will be recompressed 100% identical, although I already found some streams that differ.
    7-zip has its own bzip2 codec

  12. #12
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    462
    Thanks
    147
    Thanked 158 Times in 106 Posts
    The BTRFS people have been benchmarking using this corpus on LZ4 vs Snappy & Snappy-C here. Given they're putting them into the kernel modules, they should be open source.

    http://www.spinics.net/lists/linux-btrfs/msg15036.html

  13. #13
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 779 Times in 486 Posts
    Yeah, still need to test some Linux compressors.

  14. #14
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 779 Times in 486 Posts
    Added ocamyd-1.65, freearc, dmc, irolz, lz4, brieflz, runcoder1, data-shrinker, quicklz, snappy. http://mattmahoney.net/dc/silesia.html

  15. #15
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    462
    Thanks
    147
    Thanked 158 Times in 106 Posts
    Have you considered adding timings for the total corpus, either compression, decompression or both. Maybe memory too?

    I know it is more work, however right now a novice reader would wonder why anyone would use, say, gzip over paq8, although we know there is one large and compelling reason.

  16. #16
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,474
    Thanks
    26
    Thanked 121 Times in 95 Posts
    Since largest file is only ~50 MB in size and files are compressed individually, then measuring the performance shouldn't be a big problem on laptop with 2 GiB RAM, except the fastest programs.

  17. #17
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 779 Times in 486 Posts
    The large text benchmark measures speed and memory usage. For most compressors, speed depends mostly on the input size, so there is no need to repeat the speed tests. Yes, it would be nice to have it all in one table. But ultimately it is not practical because when I upgrade to a new computer, all of the tests would need to be repeated. Compression ratios are machine independent.

  18. #18
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,611
    Thanks
    30
    Thanked 65 Times in 47 Posts
    Quote Originally Posted by Matt Mahoney View Post
    The large text benchmark measures speed and memory usage. For most compressors, speed depends mostly on the input size, so there is no need to repeat the speed tests. Yes, it would be nice to have it all in one table. But ultimately it is not practical because when I upgrade to a new computer, all of the tests would need to be repeated. Compression ratios are machine independent.
    Well...no. Almost all compressors have 10+% of speed variability depending on input. Quite a few have extremes here by either:
    -having special modes for special data, i.e. skipping of incompressible things or using different models for different data.
    -using algorithms with pessimistic complexity much different from optimistic ones. Trees, sorting, etc.

    And, in general, no compression ratios are not machine-independent. For some, more cores=splitting into more chunks. For some others, more free RAM = larger windows. Sadly, either is unlikely to show in SCC split into individual files because each input size is too small. This shows one flaw of the benchmark; an area where it's results are going to be different from what happens in real uses. Or maybe it's not a flaw, it's just that the benchmark is not supposed to show strength of codecs in archiver usage. But I don't see what is it supposed to show, really.

  19. #19
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 779 Times in 486 Posts
    I don't think any of the programs I tested have machine dependent compression ratios. At least if anyone finds one, they can report a discrepancy. If it depends on free memory or number of threads, there are usually options to set them.

    Anyway, as I said in the other thread, my goal is to encourage research in extreme compression. Not because extreme compression is useful by itself, but because it leads to better practical compressors. It's why we have lots of CM compressors and not just PPM, even though PAQ is too slow to be useful.

    Thus, open source only, and rank by size. paq8px_v69 is 2 years old. I predict that in a few months somebody will beat it.

  20. #20
    Member Alexander Rhatushnyak's Avatar
    Join Date
    Oct 2007
    Location
    Canada
    Posts
    237
    Thanks
    39
    Thanked 92 Times in 48 Posts
    Quote Originally Posted by Matt Mahoney View Post
    I predict that in a few months somebody will beat it.
    Somebody can beat it in a few hours by adding* APM(s) with failcount to Predictor::update(), as in paq8hp12, but why would he do that now if he hasn't done that years ago?

    *or modifying an existing APM

    This newsgroup is dedicated to image compression:
    http://linkedin.com/groups/Image-Compression-3363256

  21. #21
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,507
    Thanks
    742
    Thanked 665 Times in 359 Posts
    Quote Originally Posted by Matt Mahoney View Post
    I don't think any of the programs I tested have machine dependent compression ratios
    freearc depends on memory/cores for large enough files

  22. #22
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 779 Times in 486 Posts
    Can you suggest some options to improve on -m9?

  23. #23
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,507
    Thanks
    742
    Thanked 665 Times in 359 Posts
    what you mean? make it independed on memory/cores?

  24. #24
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 779 Times in 486 Posts
    A lot of the freearc options are not well documented (at least in English), so I was hoping you could find some combination that would give better compression than just -m9. I'd prefer those that don't depend on memory or cores so results are repeatable. I would be testing in 32 bit Windows using 2 cores with 3 GB or 64 bit Linux with 4 cores and 4 GB.

  25. #25
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,507
    Thanks
    742
    Thanked 665 Times in 359 Posts
    Quote Originally Posted by Matt Mahoney View Post
    I was hoping you could find some combination that would give better compression than just -m9
    sorry, no

  26. #26
    Programmer schnaader's Avatar
    Join Date
    May 2008
    Location
    Hessen, Germany
    Posts
    566
    Thanks
    217
    Thanked 200 Times in 93 Posts
    Experimented with filtering/converting the file MR. It's a DICOM file with some header bytes followed by 19 grayscale pictures (512*512, 16-bit).

    Renaming the file to mr.raw, opening it with IrfanView (8 bit grayscale, 1024*9737) and saving it as BMP gives a slightly better compression ratio and much better speed in paq8px:

    Code:
    mr						9.970.564
    mr_8_bit_1024_9737.bmp				9.971.766
    mr.paq8px_v69_4					2.112.410 (2690 s)
    mr_8_bit_1024_9737.bmp.paq8px_v69_4		2.038.773 ( 604 s)
    But as this interpretes the high and low bits of the 16-bit grayscale values as "mixed" 8-bit grayscale values, I tried to split those into 2 "planes" and saved the result as a big 512*19474 image (containing the "even" bytes in the upper image half, the "odd" bytes in the lower one). This improves compression ratio again. Results for paq8px and paq8pxd_v4:

    Code:
    mr_split_8_bit_512_19474.bmp                    9.971.766
    mr_split_8_bit_512_19474.bmp.paq8px_v69_4	1.922.196 (   610.24 s)
    mr_split_8_bit_512_19474.bmp.paq8pxd_v4_4       1.905.203 (   649.02 s)
    mr_split_8_bit_512_19474.bmp.paq8pxd_v4_8       1.894.843 (   657.51 s)
    I attached the BMP and two C++ source code files (split.cpp, join.cpp) that do the conversion mr<->mr_split, although the conversion is quite trivial. BMP conversion isn't done, but would be easy to add - as the image width is a multiple of 4, no scanline padding is needed and only a ~1000 bytes header and some trailing bytes have to be added.
    Splitting the DICOM header and the image data would perhaps improve results slightly, but I haven't tried it so far.

    Had no luck with other approaches, e.g. (simple) delta compression didn't work - the images are similar, but it seems there's too much noise.

    Not sure about how to add those results to Matt's benchmark or if they should be added at all. The benchmark is about compression the mr file itself, not a filtered version. Perhaps creating (yet another) PAQ branch specialized on Silesia would be a good solution.
    Attached Files Attached Files
    Last edited by schnaader; 28th April 2012 at 06:08. Reason: 19 grayscale pictures, not 38 - thanks Ethatron :)
    http://schnaader.info
    Damn kids. They're all alike.

  27. #27
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 779 Times in 486 Posts
    Yes, this is exactly the kind of research I was hoping to encourage.

    I was looking at the file osdb. The silesia docs says it comes from the file hundred.med from the OSDB benchmark. Actually it is a binary version of the comma separated text file asap.hundred from data-40mb from http://sourceforge.net/projects/osdb/files/
    The text version has 10 fields filled with random values of different types like this:

    Code:
    99347,86340864,174,-368686869.00,616161616,616161616,7/17/1944,FhXTub:ZQN,5mbIXWAmfRRmuV5:ef6y,8WJ
    27507,178321784,159,308080808.00,90909091,90909091,12/9/1990,oIPGrilOHc,t3FNs4t1EPl 2oifg1wf,ihZPaYgRcPtLx3
    21036,832348324,113,-136363636.00,-555555556,-555555556,12/27/1900,EGE5CsbnqQ,DTpl1xgzarZEV:7oxO2D,yU
    24466,281962820,101,-176767677.00,777777778,777777778,12/5/1908,Nnj.fJbN8Y,87:N7qBywYOxu pscbqM,822 aTeKPA08DTfyk1munPPNI6FZvfhb8TWoFxdJiG:ef28Fe6OBN675zVlVVXbDjvW3Q3Qr4WPFEYnf
    asap.report describes it thus:
    Code:
    Hundred Relation:
        Attribute  Dist.    Min          Max         # Uniques   Comments
        =========  =======  ===========  ==========  ==========  ========
        key        Uniform            0      100000      100000  '1' missing
        int        Uniform            0  1000000000      100000  '1' missing
        signed     Uniform          100         199         100
        float      Uniform   -500000000   500000000         100
        double     Uniform  -1000000000  1000000000         100
        decim      Uniform  -1000000000  1000000000         100  same as "double"
        date       Uniform     1/1/1900  12/31/1999       36524
        code       Uniform            1  1000000001      100000  min/max = rnd seed
        name       Uniform            1  1000000001         100  min/max = rnd seed
        address    Uniform            5         995         100  min/max = rnd seed
        no fill
    The binary version codes numbers as 4 or 8 bytes. Dates are strings. Record lengths are variable. Each record has a 6 byte header apparently giving the string lengths. I didn't completely analyze how to parse.

    The OSDB data distribution omits the code that generated the data, but I found an old version here: http://www.koders.com/c/fid99EBA91CB...B1E.aspx?s=sql

    It differs in that it generates strings containing only digits and upper case letters. As you can see, the actual data contains lower case, ":", and spaces. Also, the date range is slightly different. The distribution in some cases is not random, but rather a permutation to avoid duplicates.

    It apparently calls a random number function ran1() which might be http://ciks.cbt.nist.gov/bentz/flyash/node14.html

  28. #28
    Member
    Join Date
    Apr 2010
    Location
    El Salvador
    Posts
    43
    Thanks
    0
    Thanked 1 Time in 1 Post
    Quote Originally Posted by schnaader View Post
    Experimented with filtering/converting the file MR. It's a DICOM file with some header bytes followed by 38 grayscale pictures (512*512, 16-bit).
    You meant 19. The file contains 10bit (479 values from 0 to 992) greyscale values in little-endian order.

    PW compresses the file to 1,533,621 without inter-frame compression. I couldn't try anyway, I haven't even looked at the animation-code for 5 years, LOL.

  29. #29
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 779 Times in 486 Posts
    I calculated the entropy of osdb. It is 1910 KB if you assume a true random number generator. The best known compression is 2069 KB by paq8pxd_v4 -8.

    As I mentioned earlier, osdb is a synthetic database file. It is the binary conversion of asap.hundred from the OSDB benchmark. I could not find any documentation on the binary format so I reverse engineered it. The file consists of 100,000 records with 10 fields each. Each record begins on a 4 byte boundary and is padded with trailing 0 bytes as needed. Each record is preceded by a header in one of the two formats:

    1 0 length fmt1 fmt0 (if length = 1 (mod 4))
    3 0 length pad fmt1 fmt0 (otherwise)

    pad is 0, 1, or 2 as required to make the record length including header a multiple of 4. The length byte is the number of bytes that follow, including the rest of the header but not including padding 0 bytes. I believe the second byte (0) is the high byte of the 16 bit length, but it is always 0. fmt1, fmt0 is a 16 bit format as follows:

    011000xx 000000x0

    The 3 x bits indicate the format of the key, int, and address fields. The 10 fields are as follows:

    key: 4 byte integer, range 0,2..99999 (LSB first).
    int: 4 byte integer, range 0,2..1000000000.
    signed: 4 byte integer, range 100..199.
    float: 4 byte IEEE float, range -5e8..5e8.
    double: 4 byte IEEE float, range -1e9..1e9.
    decim: variable string, range "-1000000000.00"..."1000000000.00".
    date: variable string, range "1/1/1900"..."12/31/1999".
    code: 10 byte string.
    name: 20 byte string.
    address: 80 byte or variable length string.

    Bits 9, 8, and 1 of the fmt1-fmt0 are usually 0, 0, 1, respectively. If bit 9 is 1 then the key is omitted and assumed 0. If bit 8 is 1 then int is omitted and assumed 0. These happen only once each out of 100000 records. If bit 1 is 0 then address has a length of 80. This happens 1000 times. Otherwise it is a variable length string. Variable length strings are preceded by a byte that gives the number of bytes to follow. Integers and floats are 32 bits, LSB first (Intel x86 format).

    Each key is unique. There are 100000 possible keys. Thus, the entropy is log2(100000!) = 1,516,704 bits.

    Each int is unique. Since there would be only a few expected collisions, these can be modeled as random, giving 100000 * log2(1000000000) = 2,989,735 bits. An exact calculation with the uniqueness constraint is log2(1e9! / (1e9-1e5)!) which is 7 bits less.

    There are 100 possible values of signed, float, double, decim, name, and address. Each appears exactly 1000 times. The double and decim fields always have the same value. However the other 5 fields are independent. Thus, the entropy is 5 log2(100000! / 1000!^100) = 5 * 663764 = 3,318,820 bits.

    Each date appears either 2 or 3 times. There are 36524 days from 1/1/1900 to 12/31/1999 (in month/day/year format) including leap years 1904..1996 every 4 years. Thus, 9572 dates appear twice and 26952 dates appear 3 times. The two sets are not random. Dates occurring twice are spaced every 36524/9572 = 3.8157 days. Thus, the entropy is log2(100000!/(2^9572 6^26952)) = 1,437,462 bits.

    Codes are 10 bytes chosen randomly from a 64 character alphabet [A-Za-z.:]. Thus, each field is 60 bits, or 6,000,000 bits total.

    99 of the 100 names are 20 bytes chosen randomly from a 64 character alphabet like codes except that it uses spaces instead of periods. One name is "THE+ASAP+BENCHMARKS+". Thus, the entropy is 20*100*6 = 12000 bits.

    99 of the 100 address are chosen from the same alphabet as names. The other is "SILICON VALLEY". The length varies from 2 to 80. Only one address has length 80. The average length is 19.99 bytes for a total of 11994 bits.

    The 100 values for float, double, and decim are uniformly distributed at intervals of 1/99 of their range and rounded to the nearest integer. Thus, no additional information is needed to represent them. Note that double is not really a double, but a 32 bit float. A 32 bit float does not have enough precision to represent the values to within the nearest integer as it would appear in asap.hundred. Furthermore, the decim values are string representations of the exact integer value followed by ".00". Thus, when the number 90909091 appears in the double and decim fields of asap.hundred, the representation in osdb is 90909088.0 (as a 32 bit float) and "90909091.00" (as a string of length 11).

    Thus, the total entropy of osdb is 15,286,715 bits = 1,910,839 bytes.
    Last edited by Matt Mahoney; 2nd May 2012 at 00:57. Reason: "length = 3 (mod 4)" -> "1 (mod 4)"

  30. #30
    Member
    Join Date
    Aug 2011
    Location
    US
    Posts
    8
    Thanks
    1
    Thanked 0 Times in 0 Posts
    Impressive!
    Who can argue with that

Similar Threads

  1. Encode's Compression Corpus (EncCC)
    By encode in forum Download Area
    Replies: 5
    Last Post: 21st December 2017, 13:43
  2. Eugene Shelwien's test corpus
    By LovePimple in forum Download Area
    Replies: 0
    Last Post: 2nd August 2008, 20:19
  3. Canterbury Corpus
    By LovePimple in forum Download Area
    Replies: 0
    Last Post: 1st August 2008, 00:35
  4. Calgary Corpus
    By LovePimple in forum Download Area
    Replies: 0
    Last Post: 31st July 2008, 22:55

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •