Results 1 to 11 of 11

Thread: How to detect compression of file

  1. #1
    Member
    Join Date
    Sep 2011
    Location
    Yan
    Posts
    10
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Red face How to detect compression of file

    Hi all encode.su member,

    So this is my question, how do I detect if some file has compression inside there, my mean is, like file "1.ff", how can I detect that file having zLib compression or not ? What tool can I use to find out that compression ?
    I ask this because I want to use precomp to decompression that file stream and compress back using other compresser, without having to test 1-by-1 of the file finding a stream inside there. Maybe my question are bit confusion, but anyone who understand my question, please reply to me back.

    BTW, i'm just a newbies that want to learn about compression, so I hope someone can advise me if I make something wrong.

  2. #2
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,257
    Thanks
    307
    Thanked 797 Times in 489 Posts
    A simple test for uncompressible data is to count byte predictions in an order 1 model. If the data is random, then predictions will be right 1/256 of the time. On most other data, it will be higher.

    Code:
    int c, c1=0, hits=0, total=0, t[256]={0};
    while ((c=getc(in))!=EOF) {
      ++total;
      if (t[c1]==c) ++hits;
      t[c1]=c;
      c1=c;
    }
    then test the ratio hits/total.

  3. #3
    Member
    Join Date
    Jun 2008
    Location
    G
    Posts
    377
    Thanks
    26
    Thanked 23 Times in 16 Posts
    t[c1] = c should this not be t[c1] = hits?

  4. #4
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,257
    Thanks
    307
    Thanked 797 Times in 489 Posts
    t[] is the order 1 prediction table. c1 and c are the previous and current bytes. t[c1] should predict the byte that comes after c1. We count correct predictions (hits) when the prediction matches c.

    This is the method I use in zpaq to detect uncompressible data in deciding whether to compress or store it. I use thresholds of 1/16, 1/32, 1/64, and 1/128 for -method 1 through 4 respectively. Sometimes it is surprising that you find zip, jpg, and other compressed files that are more predictable than they ought to be.

    I also use predictions to compute the rolling hash to decide fragment boundaries for deduplication. The hash depends on the last 32 bytes that are incorrectly predicted plus any bytes in between. The reason is that when the data is highly predictable (like all 0 bytes) you don't want to be computing the same hash at every point. I use a hash like h=(h+c+1)*M, where h is 32 bits, and if c is predicted then M is odd, and if c is not predicted them M is even but not a multiple of 4. The actual constants are 314159265 and 271828182.

  5. #5
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,572
    Thanks
    780
    Thanked 687 Times in 372 Posts
    very instresting. i bet that your method of detecting incompressibles are more efficient than my one

    for rolling hash, i use simple polinom hash: h=a[0]+K*a[1]+K*K*a[2]... its updating needs just 1-2 multiplies per byte plus few additions, and probably faster than any predictability checking. otoh, if you need to check it anyway, your method may be more efficient

  6. #6
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,257
    Thanks
    307
    Thanked 797 Times in 489 Posts
    I think your rolling hash looks just like mine except that I add 1. Detecting incompressible data is not perfect. For example, the repeating sequence AABBAABB... would appear not compressible.

  7. #7
    Member
    Join Date
    Sep 2012
    Location
    UK
    Posts
    11
    Thanks
    0
    Thanked 0 Times in 0 Posts
    I am interested in your answer but don't understand it fully.
    How do I calculate t[], the order 1 prediction table?
    I have (for a 256 character alphabet) a 256 X 256 byte array, where element (i,j) holds the frequency in my sample of character j following character i.
    Can I calculate t[] from this?

  8. #8
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,257
    Thanks
    307
    Thanked 797 Times in 489 Posts
    I don't use a 256x256 frequency table. I use a 256 byte array to predict the next byte from a 1 byte context. The prediction is whatever byte occurred in that context the last time it appeared.

  9. #9
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,972
    Thanks
    296
    Thanked 1,299 Times in 736 Posts
    I'm using this in reflate - http://nishi.dreamhosters.com/u/entropy_v0.rar

    Code:
      LOG2.Init();
      ENT.Init();
      while(1) {
        int c = getc(f); if( c<0 ) break;
        fprintf(g, "%06X %06X %02X %c\n", ENT.Eval(), ENT.bpc(), c, (c>=' ')?c:'.' );
        ENT.Update(c);
      }
    Its a rolling order0 model which estimates rc codelength for a window.
    Eval() is a precise estimation which multiples the codelength by freqs etc,
    and bpc() is the incremental one.

  10. #10
    Member
    Join Date
    Nov 2012
    Location
    Bangalore
    Posts
    114
    Thanks
    9
    Thanked 46 Times in 31 Posts
    Quote Originally Posted by Matt Mahoney View Post
    ...
    I also use predictions to compute the rolling hash to decide fragment boundaries for deduplication. The hash depends on the last 32 bytes that are incorrectly predicted plus any bytes in between. The reason is that when the data is highly predictable (like all 0 bytes) you don't want to be computing the same hash at every point. I use a hash like h=(h+c+1)*M, where h is 32 bits, and if c is predicted then M is odd, and if c is not predicted them M is even but not a multiple of 4. The actual constants are 314159265 and 271828182.
    Interesting technique. In Pcompress I am using a Rabin-style hash with table lookup and XOR to avoid doing modulus. It requires one multiply, a few adds and one bitwise "and" per byte. The table lookup and XOR happens only when fragments cross 4K size and testing for breakpoints start to happen. I am using minimum and maximum size limits to avoid too small and too large fragments. One advantage is if there is a long sequence of all 0 bytes it will get a series of minimum sized fragments which will all be deduplicated to a single minimum sized one. I have done some analysis of the chunking process here and here.

  11. #11
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,257
    Thanks
    307
    Thanked 797 Times in 489 Posts
    A Rabin hash seemed too complicated for what I wanted to do. But I did an experiment on zero bytes and found that the chunk size is 41343 bytes. Not sure it makes a big difference. I bound the hash between 4 KiB and 508 KiB and set a fragment boundary when the high 16 bits of the hash are 0 to give an average size of 64 KiB. The hash seems to work well, rarely ever giving very large or small fragment sizes. I am using large chunks so that I can keep the index in memory. It uses about 1 GB of memory for a 1 TB archive.

Similar Threads

  1. PerfectCompress, a new file compression software.
    By moisesmcardona in forum Data Compression
    Replies: 148
    Last Post: 21st May 2018, 03:16
  2. Replies: 4
    Last Post: 2nd December 2012, 02:55
  3. Identify compression in file fragment
    By Nquisitive in forum Data Compression
    Replies: 0
    Last Post: 16th September 2012, 17:02
  4. Compression test file generator
    By Matt Mahoney in forum Data Compression
    Replies: 3
    Last Post: 26th June 2011, 21:28
  5. my file compression considerations
    By JB_ in forum Data Compression
    Replies: 2
    Last Post: 5th May 2008, 19:47

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •