Results 1 to 9 of 9

Thread: Anyone know which compression algorithm does this?

  1. #1
    Member
    Join Date
    Aug 2013
    Location
    Singapore
    Posts
    5
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Anyone know which compression algorithm does this?

    Hi all,

    I came across this compression algorithm some time ago, but I can't remember what algorithm it is. What I remember was the compression ratio was about 50%, and there's something unique about the way it pads the compressed payload to a byte. The first 3 bits of the compressed payload indicates the number of padding bits, and they are not all ones or zeros. Instead, it takes the last however many number of bits of the size of the uncompressed data.

    For example, if the first 3 bits indicate that there are 7 padding bits, and the uncompressed data is 1000 bytes long, then the padding bits is 1101000.

    Does anyone know which compression algorithm does something like this?

    Thank you!

  2. #2
    Member biject.bwts's Avatar
    Join Date
    Jun 2008
    Location
    texas
    Posts
    449
    Thanks
    23
    Thanked 14 Times in 10 Posts
    It was a common method that could be applied to any almost any encoder. Sine the compressed output string is seldom a nice number dividable by 8. People who did not understand bijective coding sometimes had a field in front to tell the space need to write the whole string as any number so the decoder would not think more data is to be decoded. Actually today people don't care about header space so they sometimes store the number of symbols or they can encode a special symbol one time that marks end of compressed stream at the end. Actually for static or dynamic Huffman on short files it might have been common at one time.

  3. #3
    Member
    Join Date
    Aug 2013
    Location
    Singapore
    Posts
    5
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by biject.bwts View Post
    It was a common method that could be applied to any almost any encoder. Sine the compressed output string is seldom a nice number dividable by 8. People who did not understand bijective coding sometimes had a field in front to tell the space need to write the whole string as any number so the decoder would not think more data is to be decoded.
    Thanks for the reply. I understand that there is a need to pad the compressed output string to a byte, but wouldn't it be easier to just use all zero bits or all one bits? I thought using part of the uncompressed data size as the padding bits was pretty unusual, although I'm not really all that familiar with too many compression algorithms.

    ETA: My next question is are there any tools or methods out there to detect for the kind of compression algorithm used? For example, given a compressed output string, how does one go about finding out how the data was compressed? I've seen mentions of lzmarec to detect for lzma streams, and reflate to detect for deflate streams. Is there any more?
    Last edited by hjazz; 20th March 2014 at 09:23.

  4. #4
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,257
    Thanks
    307
    Thanked 795 Times in 488 Posts
    Have you tried precomp? It searches for deflate streams and uncompresses them. http://schnaader.info/precomp.php
    But I guess from your description it is not a deflate stream.

  5. #5
    Member
    Join Date
    Aug 2013
    Location
    Singapore
    Posts
    5
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by Matt Mahoney View Post
    Have you tried precomp? It searches for deflate streams and uncompresses them. http://schnaader.info/precomp.php
    But I guess from your description it is not a deflate stream.
    No, I'll give precomp a try.

    Another thing I remember was that there wasn't any literals in the compressed output. If I'm not wrong, I think a lot of compression algorithms do have literals in the output?

  6. #6
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,257
    Thanks
    307
    Thanked 795 Times in 488 Posts
    How do you know if there are literals? deflate uses literals like all LZ77 algorithms, but they are Huffman coded.

  7. #7
    Member
    Join Date
    Aug 2013
    Location
    Singapore
    Posts
    5
    Thanks
    0
    Thanked 0 Times in 0 Posts
    I compressed a string of the character 'a' with a couple of compression algorithms (LZ4, LZO), and saw at least 1 'a' in the output.

    Is Huffman coding the only thing, or rather, the most common method compression algorithms use to encode literals? If there are different encoding for the same character for different input texts, then does that point to dynamic huffman coding?

    ETA: I've tried the precomp 0.3 and 0.4 in PerfectCompress. Am I right to say if the output PCF contains the same compressed output string, i.e. no change, then the compressed string wasn't a deflate stream?
    Last edited by hjazz; 21st March 2014 at 09:23.

  8. #8
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,257
    Thanks
    307
    Thanked 795 Times in 488 Posts
    Precomp should produce a PCF file that is larger than the input if it contained a deflate string. It should also print a message if it finds something. The detection method isn't perfect but usually works.

    Algorithms using byte-aligned LZ77 for high speed like Snappy and I think LZ4 and LZO will show literals. However, others like deflate and LZMA (7zip) use different methods of encoding literals. Deflate uses a static Huffman code, so a literal might be 7 or 9 bits but in any case it will not be the same bits and usually not aligned on a byte boundary. LZMA compresses the literals and matches using arithmetic coding.

    Maybe it would help if we knew where the files came from.

  9. #9
    Member
    Join Date
    Aug 2013
    Location
    Singapore
    Posts
    5
    Thanks
    0
    Thanked 0 Times in 0 Posts
    I have a couple of compressed output strings, if that would help.

    The compressed output for 1000 bytes of the character 'A' is

    Code:
    5b 27 f8 34 81 4f 6b 28 3b 0f
    Another input text was 1410 bytes of the repeated string "abcdefghijklmnopqrstuvwxyz0123456789", meaning the string repeats until there is a total of 1410 characters.

    Code:
    4c 2b 50 2f f8 b4 04 aa 42 8c 65 9e f1 05 87 5c a4 65 72 91 b2 c6 f7 fa 27 eb b3 5b 45 7b ba 9a ed ee 2c bb c2 2b 25

Similar Threads

  1. Compression algorithm identification
    By igorsk in forum Data Compression
    Replies: 9
    Last Post: 26th April 2014, 21:26
  2. Help identify compression algorithm?
    By DotDotDot in forum Data Compression
    Replies: 0
    Last Post: 1st June 2013, 09:15
  3. Hierarchy compression algorithm and more
    By teddybot in forum The Off-Topic Lounge
    Replies: 7
    Last Post: 3rd May 2012, 01:16
  4. New layer 0 - compression algorithm
    By abocut in forum Data Compression
    Replies: 5
    Last Post: 28th May 2010, 01:32
  5. The best algorithm for high compression
    By Wladmir in forum Data Compression
    Replies: 8
    Last Post: 18th April 2010, 14:54

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •