So... some time ago I made unreleased algorithm (called "mz"), but I put it aside for other projects. Now I wanna finish it off

OK so... it features very tight pattern compression. It is a sliding-window-based compressor like lz77 compressors.

The main difference is that mine can store 2-15 byte long "literals", using a 2 byte code. (Later I will allow to store some 2-byte using a 1-byte code.)

So... the theoretical compression is quite high. I wrote a tree-based pattern-matcher, and... I got bad compression.

So I tried using LZ4 compression (very simple), then converting the LZ4 file blindly into my format. Basically, LZ4 is a sequence of data + back-references. So I just translated the LZ4 file into my "mz" file. It works fine. The file decompresses perfectly.

Guess what? Its much tighter to translate an LZ4 compressed file, into an mz file... than to compress it directly.

In fact, often the files were about 5% smaller than LZ4, despite that LZ4 will miss 3-byte long patterns that could have been compressed.

Obviously, this proves that my ENCODING is very good, but my PATTERN MATCHER is bad.

In fact, LZ4 can store 4 byte-long "literals" minimum. ("Then we need to extract the matchlength. For this, we use the second token field, the low 4-bits. Value, obviously, ranges from 0 to 15. However here, 0 means that the copy operation will be minimal. The minimum length of a match, called minmatch, is 4. As a consequence, a 0 value means 4 bytes, and a value of 15 means 19+ bytes"

So, a compressor, that will MISS shorter patterns of 2 and 3 bytes long, generated a tighter format than mine, which can detect them.

Obviously my tree lookup system is either very bad, or it has a bug in it. (I think its a bug actually. But I am too tired to find it. I want to delete this code and use someone else's)


Heres my question finally:

Does anyone know a compression library that can detect patterns of 2+bytes long?

The only codes I could find, were to find patterns of 4+ bytes long (such as LZ). Also, they often used "chains", which maybe can miss patterns of 5 bytes or 6 bytes...

I am very proud of my encoding. I like it a lot. VERY few lines of c++ code, and good encoding. But my pattern matcher is poop. lol.

I want to use their source code of course.

Thank you very much!