# Thread: How to detect compression of file

1. ## How to detect compression of file

Hi all encode.su member,

So this is my question, how do I detect if some file has compression inside there, my mean is, like file "1.ff", how can I detect that file having zLib compression or not ? What tool can I use to find out that compression ?
I ask this because I want to use precomp to decompression that file stream and compress back using other compresser, without having to test 1-by-1 of the file finding a stream inside there. Maybe my question are bit confusion, but anyone who understand my question, please reply to me back.

BTW, i'm just a newbies that want to learn about compression, so I hope someone can advise me if I make something wrong.

2. A simple test for uncompressible data is to count byte predictions in an order 1 model. If the data is random, then predictions will be right 1/256 of the time. On most other data, it will be higher.

Code:
```int c, c1=0, hits=0, total=0, t[256]={0};
while ((c=getc(in))!=EOF) {
++total;
if (t[c1]==c) ++hits;
t[c1]=c;
c1=c;
}```
then test the ratio hits/total.

3. t[c1] = c should this not be t[c1] = hits?

4. t[] is the order 1 prediction table. c1 and c are the previous and current bytes. t[c1] should predict the byte that comes after c1. We count correct predictions (hits) when the prediction matches c.

This is the method I use in zpaq to detect uncompressible data in deciding whether to compress or store it. I use thresholds of 1/16, 1/32, 1/64, and 1/128 for -method 1 through 4 respectively. Sometimes it is surprising that you find zip, jpg, and other compressed files that are more predictable than they ought to be.

I also use predictions to compute the rolling hash to decide fragment boundaries for deduplication. The hash depends on the last 32 bytes that are incorrectly predicted plus any bytes in between. The reason is that when the data is highly predictable (like all 0 bytes) you don't want to be computing the same hash at every point. I use a hash like h=(h+c+1)*M, where h is 32 bits, and if c is predicted then M is odd, and if c is not predicted them M is even but not a multiple of 4. The actual constants are 314159265 and 271828182.

5. very instresting. i bet that your method of detecting incompressibles are more efficient than my one

for rolling hash, i use simple polinom hash: h=a[0]+K*a[1]+K*K*a[2]... its updating needs just 1-2 multiplies per byte plus few additions, and probably faster than any predictability checking. otoh, if you need to check it anyway, your method may be more efficient

6. I think your rolling hash looks just like mine except that I add 1. Detecting incompressible data is not perfect. For example, the repeating sequence AABBAABB... would appear not compressible.

7. I am interested in your answer but don't understand it fully.
How do I calculate t[], the order 1 prediction table?
I have (for a 256 character alphabet) a 256 X 256 byte array, where element (i,j) holds the frequency in my sample of character j following character i.
Can I calculate t[] from this?

8. I don't use a 256x256 frequency table. I use a 256 byte array to predict the next byte from a 1 byte context. The prediction is whatever byte occurred in that context the last time it appeared.

9. I'm using this in reflate - http://nishi.dreamhosters.com/u/entropy_v0.rar

Code:
```  LOG2.Init();
ENT.Init();
while(1) {
int c = getc(f); if( c<0 ) break;
fprintf(g, "%06X %06X %02X %c\n", ENT.Eval(), ENT.bpc(), c, (c>=' ')?c:'.' );
ENT.Update(c);
}```
Its a rolling order0 model which estimates rc codelength for a window.
Eval() is a precise estimation which multiples the codelength by freqs etc,
and bpc() is the incremental one.

10. Originally Posted by Matt Mahoney
...
I also use predictions to compute the rolling hash to decide fragment boundaries for deduplication. The hash depends on the last 32 bytes that are incorrectly predicted plus any bytes in between. The reason is that when the data is highly predictable (like all 0 bytes) you don't want to be computing the same hash at every point. I use a hash like h=(h+c+1)*M, where h is 32 bits, and if c is predicted then M is odd, and if c is not predicted them M is even but not a multiple of 4. The actual constants are 314159265 and 271828182.
Interesting technique. In Pcompress I am using a Rabin-style hash with table lookup and XOR to avoid doing modulus. It requires one multiply, a few adds and one bitwise "and" per byte. The table lookup and XOR happens only when fragments cross 4K size and testing for breakpoints start to happen. I am using minimum and maximum size limits to avoid too small and too large fragments. One advantage is if there is a long sequence of all 0 bytes it will get a series of minimum sized fragments which will all be deduplicated to a single minimum sized one. I have done some analysis of the chunking process here and here.

11. A Rabin hash seemed too complicated for what I wanted to do. But I did an experiment on zero bytes and found that the chunk size is 41343 bytes. Not sure it makes a big difference. I bound the hash between 4 KiB and 508 KiB and set a fragment boundary when the high 16 bits of the hash are 0 to give an average size of 64 KiB. The hash seems to work well, rarely ever giving very large or small fragment sizes. I am using large chunks so that I can keep the index in memory. It uses about 1 GB of memory for a 1 TB archive.

#### Posting Permissions

• You may not post new threads
• You may not post replies
• You may not post attachments
• You may not edit your posts
•