Results 1 to 7 of 7

Thread: Blackbox identification of compression engine

  1. #1
    Member
    Join Date
    Jan 2013
    Location
    Australia
    Posts
    5
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Blackbox identification of compression engine

    Hello all.

    I'm a long time reader of this forum, first time poster, so I sort of feel like I already know most of you just from reading your posts over the years.

    I have an interesting problem in that I want to identify the various methods being used to compress data in a target chip I am examining. This chipset is undocumented and therefore I'll be treating it as a black box, but it presented an interesting problem and I was curious in your thoughts as to a potential solution.

    What I am thinking of is that various compression schemes in use have specific strengths and weaknesses (ie: they compress specific data types in specific ways). So I thought that it should be possible to create a set of source data that when presented to various compression engines will produce an output that if not directly gives away, at the very least, will strongly suggest that a specific type of algorithm is being used.

    So that's the crux of it: What kinds of data would be proposed to create a kind of generic corpus that could be provided to various compression engines and by examining the output and various other side channels (slowdown of output, heat generated by MCU, etc) would help me to determine what methods and are being used?

    Obviously, streams with various length runs of repeated symbols (be they bits or bytes) could be used to detect simple RLE, but a more complex incremental count being compressed better would indicate that some form of ADE was in use. Then simply extending this to include psuedo-random data streams with injected length runs of repeated runs at various intervals and lengths to detect sliding window LZW and such.

    The chip in question is doing real-time compression of data and uses an MCU core without any DSP capability, has limited buffer memory in the 10's of Megabytes range, and its clock speed is limited to <500Mhz - so it isn't going to be anything super complex or advanced like a BWT or context mixing.

    Do any of you have any suggestions of generating data to present to such an unknown engine that will assist me in detecting other basic compression mechanisms?

    Thanks,

    Luntik.

  2. #2
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,257
    Thanks
    307
    Thanked 796 Times in 488 Posts
    You can start by compressing small strings of a few bytes. Also, try changing bits of the compressed data and see what happens to the decompressed output.

    If it is LZ77 you can test the window size by compressing 2 copies of a random string of size n, and increasing n slowly until you see the output size jump. If it's BWT this will also tell you the block size. BWT differs from LZ77 in that it compresses sorted data poorly, but stationary sources better. BWT is also sensitive to alphabet reordering, but LZ77 is not.

    You can try compressing 2 strings that differ by 1 bit in the middle. If it uses a Huffman code, you will find cases where only a few bits differ in the output. If it uses arithmetic coding, all of the data past that bit will be different.

  3. #3
    Member
    Join Date
    Jan 2013
    Location
    Australia
    Posts
    5
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Hi Matt,

    Many, many thanks for that; it's exactly the kind of information that I was looking for.

    If anyone else has any tips or suggestions in the same vein, they'd also be greatly appreciated.

    Luntik.

  4. #4
    Member biject.bwts's Avatar
    Join Date
    Jun 2008
    Location
    texas
    Posts
    449
    Thanks
    23
    Thanked 14 Times in 10 Posts
    Quote Originally Posted by Matt Mahoney View Post
    You can start by compressing small strings of a few bytes. Also, try changing bits of the compressed data and see what happens to the decompressed output.

    If it is LZ77 you can test the window size by compressing 2 copies of a random string of size n, and increasing n slowly until you see the output size jump. If it's BWT this will also tell you the block size. BWT differs from LZ77 in that it compresses sorted data poorly, but stationary sources better. BWT is also sensitive to alphabet reordering, but LZ77 is not.

    You can try compressing 2 strings that differ by 1 bit in the middle. If it uses a Huffman code, you will find cases where only a few bits differ in the output. If it uses arithmetic coding, all of the data past that bit will be different.
    If you have 2 long strings that differ by 1 bit in the middle and the compressed files seem unrelated it could be one based on BWTS or some other multi pass compression system. If you can run the box to also uncompress you can change a bit in the middle of the compress file then uncompress and compress back. If it even on one try matches the changed file its a good chance its some fancy bijective compressor.

  5. #5
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,257
    Thanks
    307
    Thanked 796 Times in 488 Posts
    I bet it's not bijective. But that is a good point. If changing the middle of the file changes the beginning of the compressed output, it could be BWT or it could be something that uses 2 passes to build a static model.

  6. #6
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,943
    Thanks
    291
    Thanked 1,286 Times in 728 Posts
    There could be a crc too. might make sense to look for crc values of known data in the compressed file.
    Also deflate streams can be detected with reflate (rawdet) - http://nishi.dreamhosters.com/u/reflate_v0c2.rar
    and lzma streams with lzmarec - http://nishi.dreamhosters.com/u/lzmarec_v4b_bin.rar

  7. #7
    Member biject.bwts's Avatar
    Join Date
    Jun 2008
    Location
    texas
    Posts
    449
    Thanks
    23
    Thanked 14 Times in 10 Posts
    Quote Originally Posted by Matt Mahoney View Post
    I bet it's not bijective. But that is a good point. If changing the middle of the file changes the beginning of the compressed output, it could be BWT or it could be something that uses 2 passes to build a static model.
    Of course that's not a bet I would make with you unless I tested the black box first. There are less than 10 people in the whole world who are concerned with bijective compression.

Similar Threads

  1. Duplicate File Finder Engine
    By david_werecat in forum Download Area
    Replies: 10
    Last Post: 10th February 2018, 13:08
  2. A fast diffing engine
    By m^2 in forum Data Compression
    Replies: 36
    Last Post: 21st September 2011, 19:30
  3. File "Type" identification tool
    By soor in forum Data Compression
    Replies: 4
    Last Post: 6th June 2011, 04:04
  4. RZM - a dull ROLZ compression engine
    By Christian in forum Forum Archive
    Replies: 178
    Last Post: 1st May 2008, 21:26

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •