Results 1 to 3 of 3

Thread: Need example compression lib that can detect patterns of 2+ bytes long

  1. #1
    Member
    Join Date
    Mar 2018
    Location
    London
    Posts
    8
    Thanks
    3
    Thanked 0 Times in 0 Posts

    Need example compression lib that can detect patterns of 2+ bytes long

    So... some time ago I made unreleased algorithm (called "mz"), but I put it aside for other projects. Now I wanna finish it off

    OK so... it features very tight pattern compression. It is a sliding-window-based compressor like lz77 compressors.

    The main difference is that mine can store 2-15 byte long "literals", using a 2 byte code. (Later I will allow to store some 2-byte using a 1-byte code.)

    So... the theoretical compression is quite high. I wrote a tree-based pattern-matcher, and... I got bad compression.

    So I tried using LZ4 compression (very simple), then converting the LZ4 file blindly into my format. Basically, LZ4 is a sequence of data + back-references. So I just translated the LZ4 file into my "mz" file. It works fine. The file decompresses perfectly.

    Guess what? Its much tighter to translate an LZ4 compressed file, into an mz file... than to compress it directly.

    In fact, often the files were about 5% smaller than LZ4, despite that LZ4 will miss 3-byte long patterns that could have been compressed.

    Obviously, this proves that my ENCODING is very good, but my PATTERN MATCHER is bad.

    In fact, LZ4 can store 4 byte-long "literals" minimum. ("Then we need to extract the matchlength. For this, we use the second token field, the low 4-bits. Value, obviously, ranges from 0 to 15. However here, 0 means that the copy operation will be minimal. The minimum length of a match, called minmatch, is 4. As a consequence, a 0 value means 4 bytes, and a value of 15 means 19+ bytes" https://github.com/lz4/lz4/blob/mast...lock_format.md)

    So, a compressor, that will MISS shorter patterns of 2 and 3 bytes long, generated a tighter format than mine, which can detect them.

    Obviously my tree lookup system is either very bad, or it has a bug in it. (I think its a bug actually. But I am too tired to find it. I want to delete this code and use someone else's)

    ...

    Heres my question finally:

    Does anyone know a compression library that can detect patterns of 2+bytes long?

    The only codes I could find, were to find patterns of 4+ bytes long (such as LZ). Also, they often used "chains", which maybe can miss patterns of 5 bytes or 6 bytes...

    I am very proud of my encoding. I like it a lot. VERY few lines of c++ code, and good encoding. But my pattern matcher is poop. lol.

    I want to use their source code of course.

    Thank you very much!

  2. #2
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,943
    Thanks
    293
    Thanked 1,286 Times in 728 Posts
    LZMA normally supports 2-byte matches and even 1-byte rep-matches.
    And of course you wouldn't find LZ77 without entropy coding (like LZ4) which could use 2-byte matches.
    This only makes sense if it can improve compression, thus 1-byte code has to be used instead of 2-byte match.
    It could be made I suppose - like 1 bit flag for this token type + 7-bit distance, but that would require using 1 extra bit
    for all other token types, so you'd end with worse overall compression.

    With LZ78 (LZW etc) its possible though.

  3. #3
    Member
    Join Date
    Apr 2017
    Location
    United Kingdom
    Posts
    82
    Thanks
    63
    Thanked 33 Times in 21 Posts
    Quote Originally Posted by Shelwien View Post
    And of course you wouldn't find LZ77 without entropy coding (like LZ4) which could use 2-byte matches.
    This only makes sense if it can improve compression, thus 1-byte code has to be used instead of 2-byte match.
    Your implicit assumption is that the encoder must work with whole bytes. There are many LZ77/LZSS-style compressors working with 1-byte matches using variable length codes, but without using true entropy coding (LZB would be a more correct comparison, but no-one really knows LSB). ApLib does it (1-byte matches are replaced by a 7 bit long command). Less well known compressors do it too; Hrum and MegaLZ on ZX Spectrum have 6-bit commands for copying 1-byte matches). Exomizer is using light entropy coding (worse than Huffman) and can do it too.

  4. Thanks (2):

    jibz (15th April 2020),Shelwien (14th April 2020)

Similar Threads

  1. Compression and mining for Multiple Long Common Subsequences
    By Bullbash in forum Data Compression
    Replies: 34
    Last Post: 21st April 2019, 08:06
  2. 7-Zip binaries using the Fast LZMA2 lib
    By Conor in forum Data Compression
    Replies: 54
    Last Post: 3rd May 2018, 05:05
  3. Help to detect unpacker
    By Pehat in forum Data Compression
    Replies: 7
    Last Post: 8th January 2016, 23:13
  4. Replies: 32
    Last Post: 24th September 2013, 00:57
  5. How to detect compression of file
    By achik961 in forum Data Compression
    Replies: 10
    Last Post: 19th January 2013, 04:01

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •