Results 1 to 10 of 10

Thread: Compression algorithm identification

  1. #1
    Member
    Join Date
    Jan 2013
    Location
    Earth
    Posts
    8
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Compression algorithm identification

    Hi all,

    I'm trying to figure out the compression used to compress some parts of Intel Manageability Engine's firmware. I've identified the compressed blocks, and from another firmware which happened to use no compression I found what looks like corresponding uncompressed blocks. Here's what I have so far:
    Code:
    uncompressed:
    0000000000: E1 C5 E1 C6 F1 C0 08 76 | 32 08 20 00 28 75 C9 70  ???????v2?  (u?p
    0000000010: A2 08 20 00 A9 71 D1 C0 | C1 C6 E0 7F C1 C5 E0 78  ??  ?q?????|???x
    0000000020: 03 D0 04 D1 04 D2 B9 01 | EF FF 22 7A xx xx xx xx  |?|?|?O??"z    
    
    
    compressed 1:
    0000000000: 5B 29 4F AB C6 48 F3 42 | 50 2C 78 0A 27 67 AE F2  [)O??H?BP,x◙'g??
    0000000010: 1B F4 7D 10 34 5B 2F 5A | CB 93 B4 3D D7 8F 3D 21  ←?}►4[/Z??=?=!
    0000000020: BE F1 97 AD BE A5 EE 86 | E2 D9 7E 58 62 2A FF 4B  ?????????~Xb*?K
     
    compressed 2:
    0000000000: 5B 29 4F AB C6 48 F3 42 │ 50 2C 78 0A 27 4F AE F2  [)O??H?BP,x◙'O??
    0000000010: 1B F4 7D 10 34 5B 2F 5A │ CB 93 B4 3D D7 8F 3D 21  ←?}►4[/Z??=?=!
    0000000020: BE F1 97 AD BE A5 EE 86 │ E2 D9 6C AC DA 72 2B 67  ?????????l??r+g
     
     
    compressed 3 (maybe):
    0000000000: 62 42 39 51 67 CF D3 DC │ 18 29 4A 9E EB EF 45 E9  bB9Qg???↑)J???E?
    0000000010: 60 4D 4C 02 F3 2F 86 8B │ BB 7F A5 F1 51 90 68 05  `ML☻?/???⌂??Q?h♣
    0000000020: CB 96 9C BF FB B4 A9 65 │ E6 5F 8B F4 DD FB 3F 4D  ??????e?_?????M
    Any idea what this could be? There are some strings in the firmware tools that hint it could be related to Huffman, however I was not able to match any of algorithms I found to the bitstream. Firmware also used LZMA, although that one seems to be implemented in software only.

    P.S. I'm interested mainly in decompression.
    Last edited by igorsk; 25th January 2013 at 14:48.

  2. #2
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,717
    Thanks
    271
    Thanked 1,185 Times in 656 Posts
    I can only suggest to start by applying
    lzmarec to detect lzma streams ( http://encode.su/threads/1231-lzma-r...ll=1#post24548 )
    rawdet to detect deflate streams ( http://encode.su/threads/1399-reflat...ll=1#post32007 )

    Also maybe try finding bytes from uncompressed chunk in the binary representation of compressed chunk
    (LZ literals; might help in case if it uses simple bit packing and no huffman).
    Otherwise there're too many LZ implementations and too little data to reverse-engineer anything.

  3. #3
    Member
    Join Date
    Jan 2013
    Location
    Earth
    Posts
    8
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by Shelwien View Post
    I can only suggest to start by applying
    lzmarec to detect lzma streams ( http://encode.su/threads/1231-lzma-r...ll=1#post24548 )
    rawdet to detect deflate streams ( http://encode.su/threads/1399-reflat...ll=1#post32007 )
    Thank you for the reply. I tried the tools. Reflate didn't find anything and lzmarec crashed.
    Quote Originally Posted by Shelwien View Post
    Also maybe try finding bytes from uncompressed chunk in the binary representation of compressed chunk
    (LZ literals; might help in case if it uses simple bit packing and no huffman).
    Otherwise there're too many LZ implementations and too little data to reverse-engineer anything.
    Unfortunately there is only one chunk which I'm reasonably sure matches the compressed data, and there haven't been many matches (just one byte, in fact). I tried looking for common byte sequences in other compressed chunks but got very little hits. I suspect the literals are not always byte-aligned or byte-sized.

    I have many more compressed chunks and I can post them. In fact, let me do one better.

    I've attached three samples of compressed data and a script that splits them into chunks. As far as I could figure out, each chunk corresponds to 1K (0x400 bytes) of original data. The index table lists offset to compressed data for each chunk and the flags (00 - not compressed, 40/C0 - compressed, 80 - no data/empty chunk). Compressed data length does not seem to be encoded explicitly so probably there is either an explicit end-of-stream marker, or the decompression is done until it reaches 1K.

    Now that I looked at it once more, there are several interesting chunks. They are very short so they probably encode long runs of zeroes or FFs.

    Code:
    (from fptr_8.1.20.1336)
    
     0000000000: 36 DF 7F 7F 7F 7F 7F 7F │ 7F 7F 7F 7F 7F 7F 7F 7F  
     0000000010: 7F 7F 7F 7F 7F 7F 7F 7F │ 7F 7F 7F 7F 7F 7F 7F 7F  
     0000000020: 7F 7F 7F 7F 7F 7F 7F 7F │ 7F 7F 7F 7F 7F 7F 7F 7F  
     0000000030: 7F 7F 7F 7F 7F 7F 7F 7F │ 7F 7F 7F 7F 7F 7F 7F 7F  
     0000000040: 7F 7F 7F 7F 7F 60       │                          
    
    
     0000000000: BC CB B8 E5 20 CB 8C 52 │ 0C B2 DD 66 4C 4D 1D F2  
     0000000010: FD A4 89 C6 49 49 A3 92 │ 54 16 93 61 61 61 F7 F7  
     0000000020: F7 F7 F7 F7 F7 F7 F7 F7 │ F7 F7 F7 F7 F7 F7 F7 F7  
     0000000030: F7 F7 F7 F7 F7 F7 F7 F7 │ F7 F7 F7 F7 F7 F7 F7 F7  
     0000000040: F7 F7 F7 F7 F7 F7 F7 F7 │ F7 F7 F7 F7 F7 F7 F7 F7  
     0000000050: F7 F7 F7 F7 F7 F7 F7 F7 │ F7 F7 F7 F7 F7 F6        
    
    
     0000000000: B0 FB FB 6B 5B 55 1B EA │ 07 BE 62 6F 99 87 0F D4  
     0000000010: 0F 59 0B 9C 56 63 4E 5B │ 03 BB 6D 50 4C B6 07 E8  
     0000000020: 7E 0F 6C B6 07 CA DA BE │ E5 B0 35 23 EF EF EF EF  
     0000000030: EF EF EF EF EF EF EF EF │ EF EF EF EF EF EF EF EF  
     0000000040: EF EF EF EF EF EF EF EF │ EF EF EF EF EF EF EF EF  
     0000000050: EF EF EF EF EF EF EF EF │ EF EF EF EF EF EF EF EF  
     0000000060: EF EF EF EF EF EF EF EF │ EC                       
    
    (from fptr_6.0.40.1215)
    
    0000000000: B1 3B A7 74 EE 9D D3 BA │ 77 4E E9 DD 3B A7 74 EE
    0000000010: 9D D3 BA 77 4E E9 DD 3B │ A7 74 EE 9D D3 BA 77 4E
    0000000020: E9 DD 3B A7 74 EE 9D D3 │ BA 77 4E E9 DD 3B A7 74
    0000000030: EE 9D D3 BA 77 4E E9 DD │ 3B A7 74 EE 9D D3 BA 77
    0000000040: 4E E9 DD 3B A7 74 EE 9D │ D3 BA 77 4E E9 DD 3B A7
    0000000050: 74 EE 9D D3 BA 77 4E E9 │ DD 3B A7 74 EE 9D D0
    
    0000000000: 11 E8 8F 4E E9 DD 3B A7 │ 74 EE 9D D3 BA 77 4E E9
    0000000010: DD 3B A7 74 EE 9D D3 BA │ 77 4E E9 DD 3B A7 74 EE
    0000000020: 9D D3 BA 77 4E E9 DD 3B │ A7 74 EE 9D D3 BA 77 4E
    0000000030: E9 DD 3B A7 74 EE 9D D3 │ BA 77 4E E9 DD 3B A7 74
    0000000040: EE 9D D3 BA 77 4E EB 78 │ 9A DB C4 D7 3C 4D 67 74
    0000000050: EE 9D D3 BA 77 4E E9 DD │ 3B A7 74 EE 9D D3 BA 77
    0000000060: 4E E9 DD 3B A7 74       │
    Attached Files Attached Files

  4. #4
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,717
    Thanks
    271
    Thanked 1,185 Times in 656 Posts
    > lzmarec crashed

    Not sure how you managed to crash it
    (the most recent one is http://nishi.dreamhosters.com/u/lzmarec_v4b_bin.rar
    used like "rec c input output")
    But it doesn't find anything in the attached files anyway.

    > I suspect the literals are not always byte-aligned or byte-sized.

    Yes, it seems pretty likely.
    That's why I suggested to look for literals in bit strings ("binary representation")

    Btw, I also tried using uniextract (just in case its LZO or something), but
    it didn't work out either.

    > They are very short so they probably encode long runs of zeroes or FFs.

    Yeah, seems quite likely that its some LZH.

    If its decoded by hardware, maybe its possible to find some info in the specs?

    Or maybe at least modify the bytes in the image and see how unpacked data changes?

  5. #5
    Member
    Join Date
    Jan 2013
    Location
    Earth
    Posts
    8
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Unfortunately I can't run my own bitsreams on the hardware. I'm analyzing the firmware binary statically, "offline" so to speak. The firmware is signed and any modification will make it just not work. The documentation for the chip is non-existent with the exception of some high-level marketing style diagrams.

    I did find a few Intel patents which might be relevant. Unfortunately they only cover some aspects of the compression and don't have anything nice such as decompression pseudocode. And I'm not even sure any of them apply to this specific case...

    https://www.google.com/search?tbm=pt...:intel+huffman
    Last edited by igorsk; 28th January 2013 at 21:28. Reason: grammar

  6. #6
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,717
    Thanks
    271
    Thanked 1,185 Times in 656 Posts
    > Unfortunately I can't run my own bitsreams on the hardware.

    What about capturing the unpacked version of specific firmware?
    Instead of comparing it to a different revision?

    If only static analysis is possible, then I guess we can collect
    all the possible data-code pairs and try analyzing how the minimum
    differences between data blocks are reflected in the code.

    Also a frequency analysis of bitstrings in the code might help with
    determining the codes... though blocks are kinda too small for that.

    I can say though, that if it was deflate, i'd not be able to reverse-engineer
    it like that - deflate headers are very complex and there's no direct
    relation between data bits and header bits.

    > The firmware is signed and any modification will make it just not work.

    Maybe its possible to crack the key?

    > I did find a few Intel patents which might be relevant.

    I don't think that its plain huffman - LZ without huffman is more useful
    than huffman without LZ.

  7. #7
    Member
    Join Date
    Jan 2013
    Location
    Earth
    Posts
    8
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by Shelwien View Post
    > Unfortunately I can't run my own bitsreams on the hardware.

    What about capturing the unpacked version of specific firmware?
    Instead of comparing it to a different revision?
    That's what I have in mind next. It's somewhat difficult since the memory area with unpacked code is inaccessible to the main CPU (enforced by the chipset). I have some ideas but they will likely need some serious effort.

    Quote Originally Posted by Shelwien View Post
    > The firmware is signed and any modification will make it just not work.
    Maybe its possible to crack the key?
    In theory yes, but it's RSA-2048, so...

    Quote Originally Posted by Shelwien View Post
    > I did find a few Intel patents which might be relevant.

    I don't think that its plain huffman - LZ without huffman is more useful
    than huffman without LZ.
    Could be. There's also the possibility of hardcoded dictionary in the hardware...

  8. #8
    Member
    Join Date
    Jan 2013
    Location
    Earth
    Posts
    8
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Someone gave me an idea that the compression could be related to the one Intel used in EFI and later UEFI. It does use Huffman, in conjunction with LZ77. I looked at the spec and tried to match it against my streams but was not able to figure out if it's anything close. But I don't have much experience with compressions. Also, it seems there were several variations of the algorithm - EFI, UEFI (possibly the same), and Tiano. None of the decompressors I could find managed to extract anything, so if it's related then the bitstream format is probably different...

    The description of the algorithm and the stream format can be found in the UEFI spec: http://www.uefi.org/specs/ (chapter 1.

  9. #9
    Member
    Join Date
    Jan 2013
    Location
    Earth
    Posts
    8
    Thanks
    0
    Thanked 0 Times in 0 Posts
    I have found a patent that I believe describes the actual implementation (it's by a person who worked on the device).

    https://www.google.com/patents/US7280052

    The language is somewhat obscure, but it seems that the dictionary and the tree are stored in hardware ("decompression engine") and not in the stream :/

    I'm somewhat fuzzy about Huffman; do the quoted code snippets provide any insights on the stream format? The "Array" seems to be bytes(?) but I'm not sure that's correct. Maybe the code indexes are converted to bit sequences when writing our the stream.

  10. #10
    Member
    Join Date
    Jan 2013
    Location
    Earth
    Posts
    8
    Thanks
    0
    Thanked 0 Times in 0 Posts
    For the record, the algorithm turned out to be plain Huffman (just a stream of Huffman codes) with the dictionary apparently stored in the hardware. More info here.

Similar Threads

  1. Blackbox identification of compression engine
    By Luntik in forum Data Compression
    Replies: 6
    Last Post: 19th January 2013, 20:57
  2. Hierarchy compression algorithm and more
    By teddybot in forum The Off-Topic Lounge
    Replies: 7
    Last Post: 3rd May 2012, 02:16
  3. Best Compression Algorithm for a File Hosting service?
    By Accident in forum Data Compression
    Replies: 4
    Last Post: 27th November 2011, 14:10
  4. New layer 0 - compression algorithm
    By abocut in forum Data Compression
    Replies: 5
    Last Post: 28th May 2010, 02:32
  5. The best algorithm for high compression
    By Wladmir in forum Data Compression
    Replies: 8
    Last Post: 18th April 2010, 15:54

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •