Results 1 to 4 of 4

Thread: Compression test file generator

  1. #1
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 779 Times in 486 Posts

    Compression test file generator

    I wrote a program for generating test files for testing data compression programs. The advantage of this approach is that the exact entropy is known, and therefore it is easy to find types of data where the compressor does poorly. For example, any compressor should compress a file of all zero bytes to be very small, and should not expand random data. But there are harder cases too. For example, if each byte is generated by randomly adding 1 to the previous byte (or n bytes back), then ideally the file should compress to 1 bpc. As an intermediate case, we expect to be able to compress order-n data from a random vocabulary. We expect that 2 copies of a random string should compress as small as 1.

    The tcgen v1.00 program (GPL, http://mattmahoney.net/dc/tcgen100.zip ) creates files like this. It writes sets of data consisting of R blocks of N words each of length W, where words are chosen from a vocabulary of size V (initialized randomly at the start of each block) from an alphabet of size B in the range C to C+B-1. Data can also be delta coded by adding the D'th previous output byte to the current byte. The program takes a command string containing patterns like "n100v4" meaning set N=100, V=4, then generates a file of the same name so that the name exactly specifies its contents. The output is deterministic so that tests can be repeated. It uses RC4 seeded by the command string as its random number generator. The program calculates the entropy, which is log(B^W choose Vused) + N*log(V) per block, where Vused is the number of words from the vocabulary written at least once.

    I ran some tests using compress, zip, bzip2, 7zip, ppmd, bsc, zp -m3, and paq8px_v69 -6. Results are below. Some interesting findings.

    - bsc v2.50 crashes compressing an empty file.

    - ppmd is worst on random data, with 1.8% expansion. compress (LZW) does best because it refuses to compress if the output would be larger.

    - compress, zip, and bsc completely fail to compress 10 copies of a 100K byte random string. However bsc does best (beating paq8px by a factor of 4) if the characters are restricted to '0' and '1'.

    - There are many cases where bsc or 7zip beat paq8px on moderate or high order data.

    - paq8px is the only program that handles delta coded bytes with a stride greater than 1 well.

    In the file names below,

    Code:
    Tested:
      compress
      zip 2.32
      bzip2 1.0.3
      7zip 9.20
      ppmd J
      bsc 2.5.0
      zp 1.0.3 -m3 -b1000 (mid.cfg)
      paq8px_v69 -6
    
    Test files generated with tcgen v1.00
    http://mattmahoney.net/dc/tcgen100.zip
    tcgen . filename
    
    Command (filename) may contain the following patterns where # is a number:
      r# = set number R of blocks to write, default r1
      n# = set number N of words per block to write, default n1
      k# = set N = # * 1000
      m# = set N = # * 1000000
      w# = set word length W, default w1
      b# = set base (alpabet size) B, range 1..256, default b256
      c# = set first character of alphabet C, range 0..255, default c0
      v# = set vocabulary size V, range 1..B^W, default v1
      d# = set delta offset to D bytes, default d0 = no delta coding
      - or end of command = write R blocks of N words of W bytes each from
    a random vocabulary of V different words each block using bytes in the
    range C..C+B-1. If D > 0 then add the D'th previous output byte.
    
    Empty file, H = 0
            0 n0.bsc (crashed)
            0 n0
            3 n0.Z
           14 n0.bz2
           18 n0.paq8px -6
           23 n0.pmd
           72 n0.7z
          126 n0.zpaq
          244 n0.zip
    
    1M zero bytes, H = 0
           48 m1b1.bz2
          124 m1b1.bsc
          201 m1b1.zpaq
          218 m1b1.pmd
          317 m1b1.7z
          385 m1b1.paq8px
        1,233 m1b1.zip
        1,820 m1b1.Z
    1,000,000 m1b1
    
    1M ascending byte sequence, H = 0
          491 m1c1b1d1.bsc
          491 m1c1b1d1.paq8px
          524 m1c1b1d1.zpaq
          542 m1c1b1d1.pmd
          559 m1c1b1d1.7z
        1,991 m1c1b1d1.bz2
        4,456 m1c1b1d1.zip
       38,351 m1c1b1d1.Z
    1,000,000 m1c1b1d1
    
    500K ascending and 500K descending sequence, H = 0
          585 k500c1b1d1-c255.paq8px
          616 k500c1b1d1-c255.bsc
          716 k500c1b1d1-c255.zpaq
          763 k500c1b1d1-c255.7z
          830 k500c1b1d1-c255.pmd
        2,610 k500c1b1d1-c255.bz2
        4,759 k500c1b1d1-c255.zip
       55,687 k500c1b1d1-c255.Z
    1,000,000 k500c1b1d1-c255
    
    1000 strings, each 1000 copies of random byte, H = 1000
        2,012 r1000k1.bsc
        2,689 r1000k1.paq8px
        2,887 r1000k1.bz2
        3,126 r1000k1.7z
        3,357 r1000k1.pmd
        3,595 r1000k1.zpaq
        3,666 r1000k1.zip
       37,991 r1000k1.Z
    1,000,000 r1000k1
    
    1M random bytes, H = 1000000
    1,000,000 m1v256
    1,000,000 m1v256 (.Z does not compress)
    1,000,042 m1v256.bsc
    1,000,407 m1v256.zip
    1,000,516 m1v256.zpaq
    1,000,757 m1v256.paq8px
    1,004,492 m1v256.bz2
    1,013,715 m1v256.7z
    1,018,000 m1v256.pmd
    
    1M random from a 16 character alphabet, H = 500010.38
      500,983 m1v16.paq8px
      501,884 m1v16.bsc
      505,543 m1v16.zpaq
      508,818 m1v16.bz2
      523,360 m1v16.pmd
      537,764 m1v16.7z
      570,364 m1v16.zip
      577,911 m1v16.Z
    1,000,000 m1v16
    
    1M random bit string from '0' and '1' (text), H = 125000.00
      125,668 m1v2c48b2.bsc
      125,802 m1v2c48b2.paq8px
      126,142 m1v2c48b2.zpaq
      127,446 m1v2c48b2.pmd
      137,497 m1v2c48b2.7z
      142,554 m1v2c48b2.Z
      159,304 m1v2c48b2.zip
      160,113 m1v2c48b2.bz2
    1,000,000 m1v2c48b2
    
    1M random from a 16 char alphabet changed every 100K, H = 500103.82
      502,141 r10k100v16.paq8px
      509,134 r10k100v16.zpaq
      511,956 r10k100v16.bsc
      519,159 r10k100v16.bz2
      529,131 r10k100v16.pmd
      553,357 r10k100v16.7z
      586,142 r10k100v16.zip
      629,863 r10k100v16.Z
    1,000,000 r10k100v16
    
    10 copies of a 100K random string, H = 100000
      100,455 n10w100000.paq8px
      101,529 n10w100000.zpaq
      101,730 n10w100000.7z
      102,271 n10w100000.pmd
      276,494 n10w100000.bz2
    1,000,000 n10w100000 (.Z)
    1,000,000 n10w100000
    1,000,042 n10w100000.bsc
    1,000,415 n10w100000.zip
    
    10 copies of a 100K random bit string (text), H = 12500
       13,390 n10w100000c48b2.bsc
       14,499 n10w100000c48b2.7z
       49,289 n10w100000c48b2.paq8px
       56,784 n10w100000c48b2.bz2
      126,108 n10w100000c48b2.zpaq
      127,453 n10w100000c48b2.pmd
      141,838 n10w100000c48b2.Z
      158,967 n10w100000c48b2.zip
    1,000,000 n10w100000c48b2
    
    100 copies of a 10K random string, H = 10000
       10,405 n100w10000.bsc
       10,411 n100w10000.paq8px
       10,456 n100w10000.7z
       10,515 n100w10000.pmd
       11,133 n100w10000.zpaq
       17,884 n100w10000.zip
       41,940 n100w10000.bz2
      531,115 n100w10000.Z
    1,000,000 n100w10000
    
    100 copies of a 10K random bit string (text), H = 1250.00
        1,955 n100w10000c48b2.7z
        2,138 n100w10000c48b2.paq8px
        2,312 n100w10000c48b2.bsc
       10,615 n100w10000c48b2.bz2
      125,251 n100w10000c48b2.zpaq
      127,337 n100w10000c48b2.pmd
      129,260 n100w10000c48b2.Z
      158,581 n100w10000c48b2.zip
    1,000,000 n100w10000c48b2
    
    1000 copies of a 1K random byte string, H = 1000
        1,328 k1w1000.bsc
        1,329 k1w1000.pmd
        1,347 k1w1000.7z
        1,397 k1w1000.paq8px
        1,544 k1w1000.zpaq
        5,606 k1w1000.bz2
        6,285 k1w1000.zip
       80,529 k1w1000.Z
    1,000,000 k1w1000
    
    1M order 2, 16 word vocabulary, H = 250026.47
      250,977 k500w2v16.paq8px
      252,352 k500w2v16.bsc
      254,108 k500w2v16.pmd
      255,212 k500w2v16.zpaq
      259,356 k500w2v16.bz2
      279,791 k500w2v16.7z
      318,553 k500w2v16.zip
      338,903 k500w2v16.Z
    1,000,000 k500w2v16
    
    1M order 2, 256 word vocabulary, H = 500301.41
      503,405 k500w2v256.paq8px
      508,947 k500w2v256.bsc
      524,223 k500w2v256.zpaq
      532,814 k500w2v256.bz2
      541,083 k500w2v256.pmd
      560,785 k500w2v256.7z
      831,342 k500w2v256.zip
      864,093 k500w2v256.Z
    1,000,000 k500w2v256
    
    1M order 2, 256 word vocabulary changed every 100K, H = 503014.11
      515,938 r10k50w2v256.paq8px
      607,516 r10k50w2v256.zpaq
      646,536 r10k50w2v256.7z
      662,588 r10k50w2v256.pmd
      721,699 r10k50w2v256.bz2
      732,699 r10k50w2v256.bsc
      852,853 r10k50w2v256.zip
      907,517 r10k50w2v256.Z
    1,000,000 r10k50w2v256
    
    1M order 3, 256 word vocabulary, H = 333890.50
      339,222 n333333w3v256.paq8px
      341,961 n333333w3v256.bsc
      358,170 n333333w3v256.bz2
      360,811 n333333w3v256.zpaq
      362,294 n333333w3v256.pmd
      405,950 n333333w3v256.7z
      540,310 n333333w3v256.zip
      623,516 n333333w3v256.Z
      999,999 n333333w3v256
    
    1M order 4, 256 word vocabulary, H = 250813.50
      254,814 k250w4v256.paq8px
      259,252 k250w4v256.bsc
      275,731 k250w4v256.bz2
      276,256 k250w4v256.pmd
      279,022 k250w4v256.zpaq
      280,046 k250w4v256.7z
      401,966 k250w4v256.zip
      478,230 k250w4v256.Z
    1,000,000 k250w4v256
    
    1M order 4, 256 word vocabulary changed every 100K, H = 258135.00
      289,355 r10k25w4v256.paq8px
      304,353 r10k25w4v256.7z
      366,010 r10k25w4v256.zpaq
      373,536 r10k25w4v256.pmd
      408,856 r10k25w4v256.bz2
      410,327 r10k25w4v256.bsc
      414,017 r10k25w4v256.zip
      502,398 r10k25w4v256.Z
    1,000,000 r10k25w4v256
    
    1M order 4, 4K vocabulary, H = 385977.74
      425,489 k250v4096w4.paq8px
      437,865 k250v4096w4.7z
      462,377 k250v4096w4.bsc
      489,689 k250v4096w4.pmd
      505,941 k250v4096w4.zpaq
      511,569 k250v4096w4.bz2
      591,265 k250v4096w4.zip
      715,307 k250v4096w4.Z
    1,000,000 k250v4096w4
    
    1M order 8, word vocabulary = {'0','1'}, H = 15626.87
       15,973 k125v2w8c48b2.bsc
       16,258 k125v2w8c48b2.paq8px
       19,592 k125v2w8c48b2.bz2
       20,746 k125v2w8c48b2.zpaq
       25,117 k125v2w8c48b2.7z
       36,575 k125v2w8c48b2.zip
       37,213 k125v2w8c48b2.Z
       62,553 k125v2w8c48b2.pmd
    1,000,000 k125v2w8c48b2
    
    1M order 8, 16 word vocabulary, H = 62622.47
       63,385 k125v16w8.paq8px
       63,662 k125v16w8.pmd
       63,700 k125v16w8.bsc
       65,068 k125v16w8.bz2
       66,023 k125v16w8.zpaq
       76,191 k125v16w8.7z
      105,966 k125v16w8.zip
      140,666 k125v16w8.Z
    1,000,000 k125v16w8
    
    1M order 8, 256 word vocabulary, H = 126837.50
      130,521 k125v256w8.paq8px
      136,654 k125v256w8.bsc
      143,786 k125v256w8.pmd
      149,974 k125v256w8.bz2
      154,135 k125v256w8.7z
      157,168 k125v256w8.zpaq
      216,177 k125v256w8.zip
      267,887 k125v256w8.Z
    1,000,000 k125v256w8
    
    1M order 8, 4K word vocabulary, H = 214861.74
      241,388 k125v4096w8.7z
      254,859 k125v4096w8.paq8px
      266,651 k125v4096w8.pmd
      287,925 k125v4096w8.bsc
      313,131 k125v4096w8.zpaq
      350,276 k125v4096w8.bz2
      565,867 k125v4096w8.zip
    1,000,000 k125v4096w8 (.Z)
    1,000,000 k125v4096w8
    
    1M order 8, 64K word vocabulary, H = 596588.53
      610,864 k125v65536w8.7z
      646,854 k125v65536w8.paq8px
      718,831 k125v65536w8.zpaq
      739,093 k125v65536w8.pmd
      739,821 k125v65536w8.bsc
      792,213 k125v65536w8.bz2
      965,529 k125v65536w8.zip
    1,000,000 k125v65536w8 (.Z)
    1,000,000 k125v65536w8
    
    1M order, 256 word vocabulary of 20 bit strings (text), H = 50429.49
       80,260 k50w20v256b2c48.bsc
       85,806 k50w20v256b2c48.paq8px
       90,247 k50w20v256b2c48.7z
       96,818 k50w20v256b2c48.bz2
      114,179 k50w20v256b2c48.Z
      125,606 k50w20v256b2c48.zpaq
      127,112 k50w20v256b2c48.pmd
      153,765 k50w20v256b2c48.zip
    1,000,000 k50w20v256b2c48
    
    1M random from alphabet -1, 0, 1, H = 198120.31
      198,888 m1c255b3v3.paq8px
      199,244 m1c255b3v3.bsc
      199,937 m1c255b3v3.zpaq
      200,777 m1c255b3v3.pmd
      218,877 m1c255b3v3.7z
      221,221 m1c255b3v3.Z
      224,756 m1c255b3v3.bz2
      237,717 m1c255b3v3.zip
    1,000,000 m1c255b3v3
    
    1M from -1, 0, 1, but delta coded, H = 198120.31
      200,723 m1c255b3v3d1.paq8px
      200,862 m1c255b3v3d1.bsc
      201,986 m1c255b3v3d1.zpaq
      208,400 m1c255b3v3d1.pmd
      239,603 m1c255b3v3d1.bz2
      306,414 m1c255b3v3d1.7z
      408,380 m1c255b3v3d1.zip
      419,085 m1c255b3v3d1.Z
    1,000,000 m1c255b3v3d1
    
    1M from -1, 0, 1, delta coded with stride 2, H = 198120.31
      204,053 m1c255b3v3d2.paq8px
      248,010 m1c255b3v3d2.bsc
      268,766 m1c255b3v3d2.zpaq
      301,676 m1c255b3v3d2.bz2
      435,714 m1c255b3v3d2.7z
      475,220 m1c255b3v3d2.pmd
      676,657 m1c255b3v3d2.zip
    1,000,000 m1c255b3v3d2 (.Z)
    1,000,000 m1c255b3v3d2
    
    1M from -1, 0, 1, delta coded with stride 4, H = 198120.31
      204,178 m1c255b3v3d4.paq8px
      424,443 m1c255b3v3d4.zpaq
      470,555 m1c255b3v3d4.7z
      780,024 m1c255b3v3d4.zip
      863,153 m1c255b3v3d4.bz2
      865,548 m1c255b3v3d4.bsc
      921,900 m1c255b3v3d4.pmd
    1,000,000 m1c255b3v3d4 (.Z)
    1,000,000 m1c255b3v3d4
    
    1M from -1, 0, 1, delta coded with stride 1000, H = 198120.31
      300,688 m1c255b3v3d1000.paq8px
      504,378 m1c255b3v3d1000.7z
      714,266 m1c255b3v3d1000.zpaq
      735,316 m1c255b3v3d1000.zip
      757,259 m1c255b3v3d1000.bsc
      759,987 m1c255b3v3d1000.pmd
      778,823 m1c255b3v3d1000.bz2
      900,645 m1c255b3v3d1000.Z
    1,000,000 m1c255b3v3d1000

  2. #2
    Member biject.bwts's Avatar
    Join Date
    Jun 2008
    Location
    texas
    Posts
    449
    Thanks
    23
    Thanked 14 Times in 10 Posts
    Quote Originally Posted by Matt Mahoney View Post


    ....

    - ppmd is worst on random data, with 1.8% expansion. compress (LZW) does best because it refuses to compress if the output would be larger.

    .....

    I must have missed something here. Since compression is at best a swapping short files for long files. A good compressor has to make random data expand. Of course you could have a control bit or byte in front of output file to say you really using the full ppmd or not but while this might limit expansion to one byte. It would also lengthen the output for files that the method would otherwise been able to shorten one byte more. If one is always using say ppmd for files that are known to compress way waste the space in output file to tell if compressed or not.

    Also basic LZW not that smart it would expand real random files. Especailly long files since with traditional LZW there can be more than one file that decompresses to same file that you tried to compress. So LZW tends to waste some output space thus making most random files compress to a longer size.

  3. #3
    Member biject.bwts's Avatar
    Join Date
    Jun 2008
    Location
    texas
    Posts
    449
    Thanks
    23
    Thanked 14 Times in 10 Posts
    Well I looked farther at your results. I am sorry I was thrown off by your word "best" I didn't take that as meaning failure. I should have read more closely some of your tables to try to guess what you meant. It just seems strange to call LZW best when it fails to produce an output file any of the programs could have added that feature for FREE. If they had only know that that feature made them better.

  4. #4
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 779 Times in 486 Posts
    What I mean is that compress checks if the output is larger, and if so it does not compress. None of the other programs will check, but since the output is not much larger, there isn't much need to.

    The idea is that compression algorithms should always be tested for simple cases like this to see if it produces a sensible output.

Similar Threads

  1. Test data generator
    By Lasse Reinhold in forum Data Compression
    Replies: 1
    Last Post: 28th February 2010, 14:53
  2. video compression (test)
    By Lone_Wolf in forum Data Compression
    Replies: 42
    Last Post: 14th January 2010, 23:50
  3. JPEG Compression Test [December 2009]
    By Skymmer in forum Data Compression
    Replies: 9
    Last Post: 23rd December 2009, 21:06
  4. Searching for special file generator
    By nimdamsk in forum Data Compression
    Replies: 5
    Last Post: 19th March 2009, 02:33
  5. Audio test file
    By Mihai Cartoaje in forum Download Area
    Replies: 1
    Last Post: 7th January 2009, 10:47

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •