Hello. I'm just starting with compression and I want to ask for help. I've learned and implemented LZW and created a variant but I don't know if it's good and I wasn't able to implement it yet.
First of all getting started...
LZW-encode:
i = read-longest-match(stream)
emit i
note dict[i] ++ peek(stream)
this is normal LZW where the dictionary gets added to immediately. For decoding the dictionary is one step ahead so you need that exceptional case to handle the case where the newly added dictionary word is used right away. (know what I mean?)
I wanted to get rid of that special case, so I delayed adding to the dictionary by one step:
LZW-slow:
i = read-longest-match(stream)
emit i
note dict[j] ++ head(dict[i]) // except first iteration
j = i
(j is the previous iterations dictionary index)
I tried this change and it worked fine.
Now the issue with LZW
If you compress the string "AAAAA.....AAAA" it will add to dictionary AA, AAA, AAAA, AAAAA (+1 each time) and the result is the file will be compressed to square root size.
What I want: Adds to the dictionary A, AA, AAA, AAAAA, AAAAAAAA, ... (almost double each time). This should compress a long repeated string to log size. Much better compression.
LZOOM:
i = read-longest-match(stream)
emit i
note dict[j] ++ dict[i]
j = i
Questions
What do you think about this variant?
Have you heard of this variant before? (or is it new?)
Do you think it can give good compression? (I am worried it will be too "scattered")
I tried to implement it but my code ran really slow and crashed, don't know what is wrong
A known variant of LZW is that, instead of creating new entries with the rule :
newEntry = existingEntry + character
you could do instead :
newEntry = existingEntry1 + existingEntry2
Since characters are part of the initial entry list, this new rule is a superset of the previous one.
Also, if existingEntry1==existingEntry2, you get the doubling you are looking for.
This variant creates longer entries faster.
That doesn't mean it's always better.
Sure, it beats the simpler version for "AAAA.....AA" corner case,
but for more usual distributions, that's less clear, and usually whatever difference there is remains small.
If you compress the string "AAAAA.....AAAA" it will add to dictionary AA, AAA, AAAA, AAAAA (+1 each time) and the result is the file will be compressed to square root size.
The LZW variant on which you (r01) are working, and Cyan correctly describes, it is known as Miller-Wengman.
The variant of the source in #4 it is the Storer's All Prefixes, it adds to the dictionary as follows:
A(first word, in this case a character)|BCD(next word)
New entries: AB,ABC,ABCD
Please note because the parser needs to know the second word before to start
to add to the dictionary, encoder and decoder are always in lock-step and there's
no need to handle the LZW special case.