Results 1 to 20 of 20

Thread: Compression Problem

  1. #1
    Member
    Join Date
    Sep 2012
    Location
    UK
    Posts
    11
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Compression Problem

    I need to compress the contents of a byte array.
    The byte array contains a file fragment (4 Kbytes) from the middle of a file so there may be no headers, dictionaries or other information. The byte array contains either data compressed by standard means such as Deflate, or an encrypted fragment. I need to distinguish between them.
    Using an efficient algorithm the compressed fragment should compress some more, whereas the encrypted fragment should not compress much. The measure of compression would be the length of the compressed data itself, excluding any headers, huffman tables etc. to give a valid comparison with the file fragment.
    Has anybody any suggestion as to which open source algorithm to use that could be modified to suit?
    Does anybody have a better idea?


  2. #2
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,572
    Thanks
    780
    Thanked 687 Times in 372 Posts
    first - that's you source goal?

  3. #3
    Member
    Join Date
    Sep 2012
    Location
    UK
    Posts
    11
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by Bulat Ziganshin View Post
    first - that's you source goal?
    Goal is to differentiate between compressed and encrypted file fragments. No decompression is required.

  4. #4
    Programmer schnaader's Avatar
    Join Date
    May 2008
    Location
    Hessen, Germany
    Posts
    615
    Thanks
    260
    Thanked 242 Times in 121 Posts
    You might alread know the following, though it still might be interesting, so I'll wrap TL;DR markers around it.

    TL;DR start

    Quote Originally Posted by Nquisitive View Post
    Using an efficient algorithm the compressed fragment should compress some more, whereas the encrypted fragment should not compress much.
    This is a quite speculative statement. I'd guess this is true in most cases, but without running some tests I wouldn't sign it. You should be aware of the possibility of getting both false positives (encrypted data compressing better) and false negatives (compressed fragment compressing worse) here.

    Some observations:

    (a) If the encryption algorithm is weak in matters of entropy (e.g a block cipher using ECB mode), the encrypted data might have an entropy only slightly above the entropy of the original data. This will lead to a false positive.

    (b) Even if the encryption algorithm is strong, you can still get some low entropy blocks by chance, leading to a false positive, though the probability for this is very low, especially for 4K blocks.

    (c) If the original data had a high entropy (was "random"), the compression algorithm will expand the data and the resulting compressed data will very likely still have a high entropy. This will often lead to false negatives.

    Luckily, there aren't much entropy-wise weak encryption algorithms around anymore and if the used encryption is a standard algorithm (e.g. AES used for encryption in WinRar/WinZip/7-Zip) and not an "homebrew" algorithm, you can leave (a) aside.

    Looking at the other two cases, I'd expect much more false negatives from (c) than false positives from (b), so as a summary: Expect the probability to detect a block as encrypted though it is a compressed one to be quite high. In general, don't expect your "compressed/encryption" detection
    to be correct in 100% of the cases, perhaps not even in 50%. You might already know this, but I just want to emphasize it to make sure you aren't wasting your time on something that won't work.

    Quote Originally Posted by Nquisitive View Post
    The byte array contains either data compressed by standard means such as Deflate, or an encrypted fragment.
    I'll now concentrate on Deflate, although the statement "such as Deflate" sounds like there could be data compressed with other algorithms, too.

    The Deflate algorithm is described in RFC 1951 and gives a concise overview of the details that are important for us. You talked about it in your earlier post, so I just mention it as a reference for others and for quoting.

    Unfortunately, as you already found out, you can't decompress or even detect compressed data reliably as "block sizes are arbitrary" (Section 2) and you'll be in the middle of a compressed block in most cases. To make things worse, the deflate "header" for fixed/dynamic huffman is only 3 bits making detection
    impossible. Even if the deflate blocks are embedded in a zLib stream, header size is only 2 bytes. There's a constraint in this case (RFC 1950, Section 2.2) that the two bytes must be a multiple of 31 when interpreted as a 16-bit MSB integer. This is similar to the LEN/NLEN constraint of uncompressed Deflate blocks, but doesn't help nearly as much.

    TL;DR end

    As you said, your goal is to use an open-source algorithm to compress a 4K block to be able to distinguish deflate blocks and encrypted blocks. For that, we'll need an algorithm that is

    (a) open-source
    (b) "better" than Deflate
    (c) bitwise

    The first two requirements should be clear, so I'll only explain the reason for (c). Deflate transforms the data into <length, distance> pairs and literals and compresses those using huffman codes. The huffman code lengths in the fixed case are between 7 and 9, so there's a chance for literal bytes
    to be aligned ("shine through") so a bytewise compression algorithm will "see" them, but this won't help for a couple of reasons - mainly the three bit header destroying the initial alignment and literals getting rare after the beginning of the stream. In the dynamic case, it will be worse as
    huffman codes can be of (almost) any length. In contrast, a bitwise compression algorithm will be able to "see" repeating huffman codes (both
    literals and <lenght, distance> pairs) and might even recognize patterns for <length, distance> pairs that differ in length or distance only.

    All these requirements give a perfect match: the PAQ/ZPAQ family. The only thing you'll might have to worry about is speed, but you could adjust the compressors to your application by using a custom config file (ZPAQ) or disabling models (PAQ). Even unmodified, you might be able to get good results using the faster modes (-1..-3 for PAQ, -1 and -2 for ZPAQ).

    You can still use the uncompressed Deflate detection you mentioned in your earlier post as a fast way to sort some of the blocks out, but that's a speed optimization.
    Last edited by schnaader; 4th October 2012 at 21:00.
    http://schnaader.info
    Damn kids. They're all alike.

  5. #5
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,257
    Thanks
    307
    Thanked 797 Times in 489 Posts
    I assume you just want to distinguish compressed and encrypted data, not actually compress them further. It is not possible to compress encrypted data at all without the key (except ECB mode, like schnaader said, which shouldn't be used). If the compression algorithm is poor, you might compress it further by a few percent, maybe 1% for deflate. Deflate packs Huffman codes LSB first, so what you might try is read the bits in this order and use a context mixing algorithm with a mix of different bit length (not byte length) contexts. If you can compress 4 bytes out of a 4K packet this way, you can assume it is not encrypted with probability 1 - 2^-32.

    JPEG packs Huffman codes MSB first, so try both bit orders. Also, JPEG uses byte stuffing codes (0xFF 0x00) to distinguish them from markers identified by a 0xFF first byte. This fact alone makes JPEG 0.4% compressible. Usually you can get 3-5% compression with simple algorithms like zip, but mostly this comes from the JPEG headers which are not compressed.

  6. #6
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,257
    Thanks
    307
    Thanked 797 Times in 489 Posts
    Here is an example. I wrote 2 identical zpaq models that mix bitwise order 8..16 contexts, except that one includes a pre/post-processor that reverses the bit order of each byte. zpaq normally models bits MSB first. msb.cfg contexts consist of the current byte and the low bits of the previous byte. It does not use any pre/post-processing. lsb.cfg recursively calls itself to preprocess because pre- and post- are identical (which is not always the case). This trick requires zpaq v6.xx (currently v6.06) because older versions have a different command line syntax.

    Some results:
    lsb: maxcomp.zip 14952164 -> 14903145
    msb: maxcomp.zip 14952164 -> 14910108
    lsb: a10.jpg 842468 -> 827863
    msb: a10.jpg 842468 -> 826429

    maxcomp.zip is the maximum compression corpus created with zip -9. a10.jpg is from that data set. Note that zip compresses better LSB first, and jpeg MSB first, because they store Huffman codes in that order.

    To test, paste the 2 config files below to msb.cfg and lsb.cfg, then run, e.g.

    zpaq -add archive maxcomp.zip -method lsb -solid

    -solid mode adds to archive.zpaq without deduplication or incremental update. Otherwise if you add a second time it will detect that it is already stored and you'll get a compressed size of 0. If you want an accurate compressed size then delete the archive each time or you will get an archive with multiple copies.

    Code:
    (msb.cfg)
    comp 0 0 0 0 10
      0 cm 9  $1+16
      1 cm 10 $1+16
      2 cm 11 $1+16
      3 cm 12 $1+16
      4 cm 13 $1+16
      5 cm 14 $1+16
      6 cm 15 $1+16
      7 cm 16 $1+16
      8 cm 17 $1+16
      9 mix 0 0 9 $2+24 0
    hcomp
      a<<= 9 *d=a halt
    post 0 end
    
    
    (lsb.cfg)
    comp 0 0 0 0 10
      0 cm 9  $1+16 (order 8..16 bit context model)
      1 cm 10 $1+16
      2 cm 11 $1+16
      3 cm 12 $1+16
      4 cm 13 $1+16
      5 cm 14 $1+16
      6 cm 15 $1+16
      7 cm 16 $1+16
      8 cm 17 $1+16
      9 mix 0 0 9 $2+24 0
    hcomp
      a<<= 9 *d=a halt
    (pre- and post-processor reverse bit order)
    pcomp zpaq -method lsb -quiet -run pcomp ;
      c=a
      a> 255 if halt endif
      a&= 1 a<<= 7 b=a
      a=c a&= 2 a<<= 5 a+=b b=a
      a=c a&= 4 a<<= 3 a+=b b=a
      a=c a&= 8 a<<= 1 a+=b b=a
      a=c a&= 16 a>>= 1 a+=b b=a
      a=c a&= 32 a>>= 3 a+=b b=a
      a=c a&= 64 a>>= 5 a+=b b=a
      a=c a&= 128 a>>= 7 a+=b
      out
      halt
    end
    You can pass 1 or 2 arguments to either config file to change the rate of adaptation of the context models and mixers like this:

    zpaq -add archive maxcomp.zip -method lsb -10 20 -solid

    I just used the defaults which work pretty well, but you might get better compression.

    A CM maps a context to a bit prediction and a count, then updates the prediction by error/count. It takes two arguments, which are the number of bits of context and the maximum count/4 before the rate of adaptation stops decreasing. The computed context is added to the current partial byte context, expanded to 9 bits, so what I did was take the previous byte (passed in A), shifted it left 9 bits, and stored it in H[0] (pointed to by D) as input to the model. The size of H is 1 (given by the first 0 after COMP), so all 10 components read their context from the same shared location. The size of the model chops off the high bits of the order 1 context.

    A MIX averages a range of predictions by other components using weighted averaging in the logistic domain (log(p/(1-p))), then updates the weights in favor of the best models. This MIX does not use any context to select a weight array, but you can usually get better compression if it does. The 5 arguments to MIX are the number of context bits (0), first component (0), number of components (9), adaptation rate (24), and order 0 context mask (0, but try 255).

    A postprocessor, if any, runs in a separate virtual machine. The postprocessor in msb.cfg receives the decoded input byte in A. It uses B and C as temporary registers to reverse the bit order and output it. The input is 0xffffffff at EOF. The if-statement ignores it. The PCOMP section calls zpaq with -run to call itself as the preprocessor. Normally this would be a separate program that takes an input and output file as its last 2 arguments.

    Get zpaq at http://mattmahoney.net/dc/zpaq.html

  7. #7
    Member
    Join Date
    Sep 2012
    Location
    UK
    Posts
    11
    Thanks
    0
    Thanked 0 Times in 0 Posts
    schnaader - thanks for your most informative reply. You have expressed very clearly many of the thoughts that were at the back of my mind but had not yet crystallised.
    You are correct that my assertion that encrypted blocks will not compress much is speculative, but I'm at early stages of research at the moment and just playing with ideas.
    I am going to have a good look at PAQ/ZPAQ as you have suggested. Hopefully I can - in a week or two - have some figures to support or refute my assertion.

  8. #8
    Member
    Join Date
    Sep 2012
    Location
    UK
    Posts
    11
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Matt. This will help greatly in my investigation of zpaq once I get to grips with it. Many thanks.

  9. #9
    Member biject.bwts's Avatar
    Join Date
    Jun 2008
    Location
    texas
    Posts
    449
    Thanks
    23
    Thanked 14 Times in 10 Posts
    You can't tell the difference between good compressed data and good encrypted data when looking at a random 4k segment. That is one reason why a good compression pass before encryption tends to make for a better product. You get a shorter file with the same information.

    Take any file you want or BOOK1 as an example. Compress it with ARB255 take a random 4k fragment of the file.

    Then take BOOK1 by itself do your favorite AES or DES or JQ3 or whatever. The resultant file will be longer than the file obtained from using ARB255 but take a 4k fragment of it. I don't believe that with any standard method you could detect which fragment was encrypted or which fragment compressed. Unless the encryption was so poor that it leaks trace of what encryption method was used.



    But since the real entropy density higher the first way its generally better to compress then encrypt. The fact of compressing before encryption use to well known but the powers to be want people to forget that. I used ARB255 as an example since its old compression and not the best. Today I would better compression using something that had more than one pass of BWTS like SCOTT follows by bijective LZW followed by SCOTT and maybe then BICOM. But hay thats just me.

  10. #10
    Member
    Join Date
    Sep 2012
    Location
    UK
    Posts
    11
    Thanks
    0
    Thanked 0 Times in 0 Posts
    biject.bwts - one of my assumptions was that the 'off the shelf' compression packages that most people use do not provide optimal compression. I am trying to show that an efficient compression method can compress these fragments 1% or 2% more. If this is so, I will then investigate whether this can differentiate between compressed and encrypted fragments. Maybe it is impossible, but it is still an interesting problem.

  11. #11
    Member
    Join Date
    Sep 2012
    Location
    UK
    Posts
    11
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Matt - would it take long to explain or give references to:
    1. If you can compress 4 bytes out of a 4K packet this way, you can assume it is not encrypted with probability 1 - 2^-32
    2.
    use a context mixing algorithm with a mix of different bit length contexts

  12. #12
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,257
    Thanks
    307
    Thanked 797 Times in 489 Posts
    1. is from the counting argument. Encrypted data is indistinguishable from random by any computationally feasible process (unless the encryption is insecure). So any mapping from n bits to n-32 bits can encode at most 2^-32 possible inputs.

    2. http://mattmahoney.net/dc/dce.html#Section_43 or see my ZPAQL example above.

  13. #13
    Tester
    Black_Fox's Avatar
    Join Date
    May 2008
    Location
    [CZE] Czechia
    Posts
    471
    Thanks
    26
    Thanked 9 Times in 8 Posts
    2^(n-32) maybe?
    I am... Black_Fox... my discontinued benchmark
    "No one involved in computers would ever say that a certain amount of memory is enough for all time? I keep bumping into that silly quotation attributed to me that says 640K of memory is enough. There's never a citation; the quotation just floats like a rumor, repeated again and again." -- Bill Gates

  14. #14
    Member biject.bwts's Avatar
    Join Date
    Jun 2008
    Location
    texas
    Posts
    449
    Thanks
    23
    Thanked 14 Times in 10 Posts
    Quote Originally Posted by Matt Mahoney View Post
    1. is from the counting argument. Encrypted data is indistinguishable from random by any computationally feasible process (unless the encryption is insecure). So any mapping from n bits to n-32 bits can encode at most 2^-32 possible inputs.

    2. http://mattmahoney.net/dc/dce.html#Section_43 or see my ZPAQL example above.

    The problem with the counting argument viewed this way is that there are so many possible compression methods that for any given 4k fragment of data there will be many compression methods that will compress the data no matter what the 4k sequence is. So you might falsely think the data is not encrypted since if you used the right compression program it will compress.

    About the only thing one can do is test with some fixed compressors and state with this specific type of compress it appears to be random or not.

    Another thing several people including me have suggested compressing text encrypting it with you favorite method
    this doing something like UNARI2A.EXE from http://bijective.dogma.net/compres2bse.htm once this is done the
    data will be highly compressible and yet appear like easy to break common child encryption.

  15. #15
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,257
    Thanks
    307
    Thanked 797 Times in 489 Posts
    Quote Originally Posted by Black_Fox View Post
    2^(n-32) maybe?
    Yes. I meant 2^-32 or about 1 out of 4 billion as the largest possible fraction of encrypted inputs that would compress by 32 bits. Depending on the compression method, it will probably be less, maybe 0, because most compressors are not bijective.

  16. #16
    Member
    Join Date
    Sep 2012
    Location
    UK
    Posts
    11
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Hi Matt. I am struggling to get my head around this probability - I'm a real tyro in this area.
    I can see that there are 2^n possible files of n bits.
    Also, if it compresses by 32 bits then there are 2^(n-32) possible compressed files.
    This is true for any file and the fact of encryption has no bearing here.

    I'm sure that it is not coincidence that 2^(n-32)/2^n = 2^(-32) but can't make the leap from the fact of the file being encrypted to 2^(-32) being the largest possible fraction of encrypted inputs that would compress by 32 bits.

  17. #17
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,257
    Thanks
    307
    Thanked 797 Times in 489 Posts
    The counting argument says that you can't compress random data. But in practice, compression works because most data is predictable by computationally feasible algorithms like context modeling. Encrypted data is not random, but for all practical purposes it is, because there should not be any algorithm that can predict bits that is any faster than brute force key search. So if data won't compress, then there are 3 possibilities:

    1. The compression algorithm isn't smart enough.
    2. The data is truly random.
    3. The data is encrypted.

    An example of 1 is compressing data that was already compressed with a weaker algorithm, and the second compressor not recognizing this fact. Another example is compressing a binary representation of pi or sqrt(2) or some other nonrepeating digit sequence that has a simple formula. A general solution is not computable, so there will always be such cases.

    There is also no way to know if data is truly random, in the sense that there is no shorter program that outputs it. Kolmogorov proved this. Suppose there was a function K(x) that tells you the length of the shortest program that outputs any given bit string x. Then you could write a program to output "the first x such that K(x) > 1000000", which itself is short, contradicting the claim about K. This also proves that there is no general purpose compression algorithm for nonrandom data.

    So there is no foolproof way to detect encrypted data except to find a key, and even that won't work if the plaintext is random. The best you can do is test whether the data compresses at all by a good algorithm, and if not, then it might be encrypted. If it does compress by even just a few bytes, then it is very unlikely to be directly encrypted. However, you can't rule out things like steganography or base64 encodings that expand the ciphertext.
    Last edited by Matt Mahoney; 12th October 2012 at 18:11.

  18. #18
    Member
    Join Date
    Sep 2012
    Location
    UK
    Posts
    11
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Matt - The argument that there is no way to determine if data is random is similar to the argument given in your book in section 1.3 which refers to Li and Vitanyi (2007) extending Godel's Theorem (takes me back to undergraduate days 50 years ago!).
    However I still struggle with 'If it does compress by even just a few bytes, then it is very unlikely to be directly encrypted' (my emphasis).
    I had hoped to put a figure on 'very unlikely' and thought that the probability being 1 - 2^(-32) would be ideal, but I am still struggling with a proof.

  19. #19
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,257
    Thanks
    307
    Thanked 797 Times in 489 Posts
    Well it's not exactly. By Bayes law,
    P(encrypted | 4 bytes compression) = P(4 bytes compression | encrypted) P(encrypted) / P(4 bytes compression).

    It is easy to show that P(4 bytes compression | encrypted) = 2^-32. The other numbers you have to guess.

  20. #20
    Member
    Join Date
    Sep 2012
    Location
    UK
    Posts
    11
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Thanks Matt. I'll work on that.

Similar Threads

  1. Thread title changeable? - No Problem!
    By Simon Berger in forum The Off-Topic Lounge
    Replies: 13
    Last Post: 7th January 2016, 10:53
  2. A Programming Problem: How to know if a file is already opened?
    By Fu Siyuan in forum The Off-Topic Lounge
    Replies: 7
    Last Post: 6th August 2010, 17:12
  3. Subset model selection problem
    By Alexandre Mutel in forum Data Compression
    Replies: 28
    Last Post: 17th July 2009, 09:43
  4. Problem with forum
    By EneergE in forum Forum Archive
    Replies: 1
    Last Post: 28th March 2008, 12:06
  5. [OFFTOPIC]: Problem for hardcore programmers!
    By nimdamsk in forum Forum Archive
    Replies: 4
    Last Post: 25th March 2007, 18:12

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •