Thread: Original LZ77 algorithm implementation?

1. Original LZ77 algorithm implementation?

Much code labelled "LZ77" exist, but produce all different output files and sizes. I am looking for a reliable, reference implementation of original LZ77 for study.

1) Why is there so much variation of LZ77 code? Is it not broadly defined? Original is too obsolete?

2 ) Does there exist some code which faithfully follows the original LZ77? If not, what is the next bet choice, LZSS? Haruhiko Okumura's code is possibly a good starting place.

I want to study the simplest practical classic LZ implementation and go from there, then proceed to learn how improvements such as hashtables affect it, etc. Then later look into combination with other entropy coding (huffman, range coding).

Any knowledge in the subject would be helpful.

2. https://en.wikipedia.org/wiki/LZ77_and_LZ78 - they provided only algorithms, but not their implementation. Their only interest was to prove that the algos are asymptomatically optimal, so the algos were inefficient practically and noone bothered to implement them until lzss/lzw went with more practical schemes

3. Thanks:

Shelwien (24th November 2019)

4. > I am looking for a reliable, reference implementation of original LZ77 for study.

https://en.wikipedia.org/wiki/LZ77_and_LZ78#Pseudocode

In modern discussions, LZ77 and LZ78 are usually treated as types of compression algorithms,
rather than specific algorithms.
Basically algorithms that decode strings by copying from previously decoded data are LZ77,
while ones that have to maintain a dictionary during decoding are LZ78.

Now there're also several known intermediate stages.
For example, ROLZ (reduced-offset LZ77) is a version which tracks positions separately per context.
And LZP is a special case of ROLZ without position coding.

http://mattmahoney.net/dc/dce.html#Section_525

Btw, PPM also can be interpreted as LZ-family algorithm...
its like LZP with unary coding of match length.

> 1) Why is there so much variation of LZ77 code?
> Is it not broadly defined? Original is too obsolete?

Its just hard to find a perfect solution.
Depending on simplicity/patents/speed/memory/platform/ratio/... requirements
we end up with different algorithms.

It may be in theory possible to build a "perfect LZ77"
(say, based on state machines) which could be adapted to any requirements,
but for now it doesn't exist.

> 2) Does there exist some code which faithfully follows the original LZ77?
> If not, what is the next bet choice, LZSS?
> Haruhiko Okumura's code is possibly a good starting place.

Yes, but there's basically nothing to "faithfully follow" at that point.
You can just make a new such coder from scratch in like 10 minutes,
simply with an idea to replace repeated strings by references.

Of course, an efficient speed-optimized algorithm would be much harder to make,
but these old implementations are quite inefficient on modern platforms anyway.

> I want to study the simplest practical classic LZ implementation and go from there,
> then proceed to learn how improvements such as hashtables affect it, etc.
> Then later look into combination with other entropy coding (huffman, range coding).

Well, I'd suggest LZ4, deflate(puff.c), then lzma.

5. Thanks:

zikey (24th November 2019)

6. https://encode.su/threads/2266-Stanf...pression-Forum
Points to video: https://www.youtube.com/watch?v=36CJat99i7c
About 15 min in there is author talking about it. Maybe it helps.

7. I think that Bulat Ziganshin and Shelwien answered the question very well. LZ77 nowdays is just an umbrella term for schemes with a sliding window and match-referencing. In fact, the original algorithm in the LZ77 paper is absolutely atrocious and needs very special input data for the authors to actually demonstrate an example of data compression (it is not explained particularly clearly, but the example of encoding and decoding is clear enough). Interestingly, LZSS paper covers many variations on the original LZ77 scheme, some of which I have not seen really explored elsewhere (maybe I need to search more), e.g. forward and backward referencing or referencing within compressed data. In a way, LZSS would be a much fairer abbreviation to use as an umbrella term for such compression schemes in my opinion.

I actually want to ask a somewhat tangential question: is there much interest in the community in the history of specific ideas and compression algorithms? I am particularly interested in the LZ77-style variations, but I am sure that for other families of algorithms there is much to cover too. I was thinking to set up a post and keep modifying it based on newly found bits of information and comments from others. How many people would be interested in something like this?

8. Could be interesting, but also controversial. Some authors would deny the relation, even if its obvious.
Also how would you classify PPM (it still maintains a sliding window and copies strings... just with unary-coded match length (escapes)) or ACB?

9. is there much interest in the community in the history of specific ideas and compression algorithms?
It will be great if we can setup Wikipedia engine and fill it with whatever info we like. In particular, I can write a bit about LZ77 history since I participated in it since 90s and learned what occurred prior to me.

At the same time, we can use wiki engine to write our own Compressopedia, even if there are multiple variants of it made by different authors. We aren't Britannica and don't need the single version for each thing.

10. My wiki instance is still where you left it btw: http://ctxmodel.net/w/index.php?title=Special:Log :)

11. I think it tends to be easier to start a new Wiki, then come up with the actual content...
Thus, I would be fully content starting just here and collecting some information together, before embarking on a big documentation project, which may or may not have enough contributors from the community.

@Shelwein, I understand where you are coming from with your comment about PPM. However, I think, when people talk about LZ77-style compression they almost invariably mean something a lot more narrow in scope - a scheme without explicit statistical predictor and backreferences to the already decompressed data. This excludes even a number of schemes discussed in LZSS paper! Obviously, when you have something like LZMA, it an amalgamation of several types of ideas and boundaries become blurred. At this stage I do not want to create a volunteer equivalent of Solomon's Handbook on Data Compression - it is a huge undertaking and probably not particularly worth it. Instead, my current ambition is to create a bit more balanced and better researched version of https://ethw.org/History_of_Lossless...ion_Algorithms.

12. Originally Posted by Bulat Ziganshin
At the same time, we can use wiki engine to write our own Compressopedia, even if there are multiple variants of it made by different authors. We aren't Britannica and don't need the single version for each thing.
This is an excellent point by the way. There is no need to strive for some sort of an absolute truth here; we can afford to have different outlooks and perspectives.

13. > when people talk about LZ77-style compression they almost invariably mean
> something a lot more narrow in scope - a scheme without explicit
> statistical predictor and backreferences to the already decompressed data

1) LZMA has many "descendants" now - lzham, oodle etc, so they have to be included?
2) There're versions of Okumura LZSS with absolute positions instead of distances
3) There was a period (198x-200x?) with many attempts to reduce LZ77 redundancy
by length/distance masking (eg. LZFG), where the most extreme is ACB
(distance is coded by lexicographic index of prefix, minlength is determined
by suffixes of adjacent strings in prefix-sorted list, length is coded as
number of occurrences of mismatch symbol, rather than full number of symbols).

So it looks like your "LZ77 history" would be very limited :)

14. Originally Posted by Shelwien
1) LZMA has many "descendants" now - lzham, oodle etc, so they have to be included?
2) There're versions of Okumura LZSS with absolute positions instead of distances
3) There was a period (198x-200x?) with many attempts to reduce LZ77 redundancy
by length/distance masking (eg. LZFG), where the most extreme is ACB
(distance is coded by lexicographic index of prefix, minlength is determined
by suffixes of adjacent strings in prefix-sorted list, length is coded as
number of occurrences of mismatch symbol, rather than full number of symbols).

So it looks like your "LZ77 history" would be very limited
Oh, many good points. It would be very limited indeed! Limited is the only way to get it manageable in my opinion.

1) Eventually - maybe, yes. But the lack of focus is going to kill any such undertaking. OK, I do not really know LZHAM. But I do know Oodle and the range of technologies there is staggering. Selkie/Mermaid should probably be in, but things like Kraken and Leviathan - they are pretty much LZMA-style creatures, so to talk about them properly, we need to cover entropy coders and a lot of other stuff.

2) I actually loved this comment - because I noticed that it surprised you in that other thread. Would it surprise you then that the original LZ77 paper is actually working with absolute positions in the buffer, not relative offsets? In a way, this was the best explanation from my own prospective why Okumura did it this way - he simply had a good knowledge of the original paper and this option was just as natural to him. This is the kind of understanding that I am hoping to come up with by digging these old papers and compressors.

3) I am not aware of a single LZ77-style scheme outside of academia as of 1987-1988. Okumura's article (and code) seems to have been very influential. I suspect that LZB work was just as important, because in it you can find all of the ideas later implemented in LZS and other kinds of simplistic mixed bit/byte stream compressors. There is a lot of other important work done during the 1990s of course, I do not know a lot of it and hope to learn more as I go along.

15. > Limited is the only way to get it manageable in my opinion.

Then maybe it would be better to discard the old part of the history, rather than new?
New codecs are much more relevant, but are also almost totally undocumented.

> so to talk about them properly, we need to cover entropy coders and a lot of other stuff.

I don't see anything bad in that?
Is huffman less of entropy coding than AC, RC or ANS?
Even if the developer of some LZ codec is not concerned with entropy coding,
probabilities are still implicitly assigned to coding choices.

> Would it surprise you then that the original LZ77 paper is
> actually working with absolute positions in the buffer, not relative offsets?

I've been mostly surprised not by the existence of such an option, but that after looking at the actual implementation,
I found that it makes sense - with absolute positions its much easier to track probabilities
of specific matches, while with distances we can only assign shorter codes to smaller distances.
Thanks for reminding about the idea btw, I may be able to use it :)

> 3) I am not aware of a single LZ77-style scheme outside of academia as of 1987-1988.
> Okumura's article (and code) seems to have been very influential.

Yeah, it seems that ZOO/ARC/PKZIP1/compress were all based on LZW rather than LZ77.
Later I guess LZW development was stopped by patenting craze.

But LZSS is from 1982 paper and Okumura only made an open-source implementation of it.
I'd not be surprised if there was some obscure software, maybe a game engine,
that used something like LZ77 before 1987.

16. Originally Posted by Shelwien
Then maybe it would be better to discard the old part of the history, rather than new?
New codecs are much more relevant, but are also almost totally undocumented.
Well, it is not very clear how this is supposed to happen. The authors do not document their program and, just seeing another thread, actively prohibit reverse engineering.
I have no interest in this competition for the best modern compressor in whatever category, but I do crave understanding and the understanding of what was going on in the 1980s seems already to be largerly lost.

Originally Posted by Shelwien
Is huffman less of entropy coding than AC, RC or ANS?
Even if the developer of some LZ codec is not concerned with entropy coding,
probabilities are still implicitly assigned to coding choices.
No, my reasons were entirely different. Huffman coding, just itself, is a big enough topic to talk about.
And the fact that probability is assigned by coding choices is important to understand, but is not very crucial in terms of the LZ77 methodology, which is always assuming some specific and constant distibution of matches.

Originally Posted by Shelwien
Yeah, it seems that ZOO/ARC/PKZIP1/compress were all based on LZW rather than LZ77.
Later I guess LZW development was stopped by patenting craze.

But LZSS is from 1982 paper and Okumura only made an open-source implementation of it.
I'd not be surprised if there was some obscure software, maybe a game engine,
that used something like LZ77 before 1987.
LZB was published in late 1986. Stac's compression is protected by several patents from the 1980s. Okumura is the third source of knowledge about LZ77.
I do not know any other sources of information about LZ77 at the time. Ross Williams's work is already happenning in the early 1990s. Your suggestion about game engines is plausible, however, don't forget - game coding in 1980s is a bit like Wild West, with large proportion of it happenning on 8-bit computers. I agree that some people used really modern solutions, but a huge majority of coders were self-taught, often teenagers. So, the chances that any of them used something beyond RLE are very slim. For example, I am not aware of a single compressed game on ZX Spectrum, eve though the benefits of faster loading from tape would be immediate and noticeable.

17. > So, the chances that any of them used something beyond RLE are very slim.

I think LZ78 was pretty popular then, so LZ77 just didn't have a chance.
For example: http://www.worldofspectrum.org/infos...cgi?id=0023582

But then
Originally Posted by Okumura
Also, I was told that the algorithm described in Fiala and Greene [4] got patented ("Textual Substitution Data Compression With Finite Length Search Windows," U.S. Patent 4,906,991, Mar. 6, 1990. Actually they got three patents! The other two were: "Start, Step, Stop Unary Encoding for Data Compression," Application Ser. No. 07/187,697, and "Search Tree Data Structure Encoding for Textual Substitution Data Compression Systems," Application Ser. No. 07/187,699.)

Furthermore, I learned that the original Ziv-Lempel compression method (Eastman et al., U.S. Patent 4,464,650, 8/1984) and the LZW method (Welch, 4,558,302, 12/1985) were patented. I also heard that Richard Stallman, of the Free Software Foundation, author of the EMACS editor and leader of the GNU project, ceased to use 'compress' program any more because its LZW algorithm got patented.

18. Originally Posted by zikey
Much code labelled "LZ77" exist, but produce all different output files and sizes. I am looking for a reliable, reference implementation of original LZ77 for study.

1) Why is there so much variation of LZ77 code? Is it not broadly defined? Original is too obsolete?

2 ) Does there exist some code which faithfully follows the original LZ77? If not, what is the next bet choice, LZSS? Haruhiko Okumura's code is possibly a good starting place.

I want to study the simplest practical classic LZ implementation and go from there, then proceed to learn how improvements such as hashtables affect it, etc. Then later look into combination with other entropy coding (huffman, range coding).

Any knowledge in the subject would be helpful.
That's my experience when i restarted data compression coding.

I said in my blog that "I, at one time, took pride in my compression programs as truly lossless." Especially my LZ77 programs; because i saw some LZ77 codecs that were actually lossy. At least, according to my tests.

I have LZ77 compressors here in my "The Data Compression Guide" site:

https://sites.google.com/site/datacompressionguide/lz77

lz772/lz773 are LZ77/LZSS, respectively, and lzhh uses Huffman. lzuf2 and lzuf5 use "variable-length" codes. They are the successors to lzuf, which is in Matt Mahoney's Large Text Compression Benchmark. These programs are simple to understand, somewhat book-example style 90's C code. lzgt3a can encode very long match strings at the cost of decoding speed.

19. I think the very early idea behind LZ77 algorithm is to realize that for the whole file there are *similar* strings. How to encode these strings is the problem. Since these strings precede each other, then one string can be encoded via its offset with respect to the last string/s (ROLZ, good with variable-length codes), or a previous block of literals which may contain the string (LZ77). If you are to code these strings one by one, via their positions or instances in the file or previous block, then you have to encode the length (LZ77).

But if you encode all these similar strings in one encoding, you simply have to gather them all and encode them using offsets only, "no length code for the succeeding strings past the initial string" but you have to encode this initial string which is just equivalent to outputting it the first time as literals. I think this is the most basic idea new data compression programmers think of when encoding strings of symbols/bytes. See "Reduced Length LZ (RLLZ)" which also does not transmit a boolean prefix bit if it is literal or a match string:

https://encode.su/threads/3013-Reduc...put-LZ77-codes

This is done by outputting the literals last, after all the match strings are encoded. During decoding, the literals nicely fit in the holes or gaps in the output buffer not occupied by the match strings!

Posting Permissions

• You may not post new threads
• You may not post replies
• You may not post attachments
• You may not edit your posts
•