Results 1 to 22 of 22

Thread: LZ77 Variation Idea

  1. #1
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    695
    Thanks
    153
    Thanked 183 Times in 108 Posts

    LZ77 Variation Idea

    nburns asked me a while back about whether Tree could use an existing LZ format. At the time, I couldn't think of a way, but now I have an idea and am curious if anyone has tried anything similar. So here's my idea. It's not really all that much like Tree, but might offer some of the advantages if I understand LZ77 correctly.

    1. Start with basic LZ77 algorithm that supports large data blocks.
    2. Redefine what literals are. Instead of being representing single byte raw characters, literals will be all single byte raw characters plus all offset+length pairs that have been used.
    3. Change how the history is built. Whenever a non-literal code is sent, instead of putting the raw (uncompressed) data into the history, put the literals used to write the code in the history and then add the string represented by the offset+length pair as a "parallel" history.
    4. When matches are found, use the parallel history symbols in place of the regular history whereever possible.
    5. Optimize coding to be more efficient with shorter average offsets, shorter average lengths, and a higher percentage of literals.

    I left out a few details, but hopefully that's enough to describe the basic characteristics of the key features. The thought is that compression ratios would be improved because the history is "compressed", which would yield shorter offsets and shorter lengths on average, at the cost of (most likely) an increase in the number of symbols sent. I think it would also not use as much memory as standard LZ77, but that's just a guess.

    Has anyone ever tried something like this?

  2. #2
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,495
    Thanks
    26
    Thanked 131 Times in 101 Posts
    You've described a brand new format and not an existing one :]


    I have another idea for something in between Tree and "normal" LZ. In LZ77 schemes you output offset in bytes + length in bytes. In Tree you output a symbol representing token with contents defined somewhere before. In the hybrid scheme you could output offsets in tokens instead of bytes and skip outputting the length as tokens have a predefined one. As in Tree, non repeated raw literal sequences won't be saved as tokens. The advantage of that scheme over Tree is that you don't have to maintain MTF queue of recent tokens to shorten their codelengths - instead, frequently repeating tokens will have small offsets. The disadvantage is that frequent tokens will pollute token stream with multiple copies, increasing offsets for other tokens and increasing memory usage. As you don't need copies of tokens older than the freshest one, you could periodically do garbage collection of the token stream. And you could do that like the JVM (or other VMs for managed languages) ie do full garbage collection rarely, do GC on last 1 million tokens moderately frequently and do GC on last 1 thousand tokens very frequently. You could do GC using parallel mark & sweep method, assuming every token instance has reference to previous instance/ occurence. And because the coding scheme would somewhat resemble existing LZ schemes, you could borrow tricks from them.

  3. #3
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    889
    Thanks
    482
    Thanked 279 Times in 119 Posts
    Quote Originally Posted by Piotr Tarsa View Post
    You've described a brand new format and not an existing one :]
    Well, I feel the proposed scheme has quite some resemblance with LZW.

    The main difference I noticed is the existence of 2 "parallel history", while LZW maintains a single index.
    But it's unclear if this separation brings any tangible benefit.
    Each "history" is of course shorter, but there is now a need to select which one, which will also bring some costs.

    Update rules are also different.
    The implied MTF scheme should improve compression ratio, but it's also going to be CPU intensive.

    Most importantly, the strength of the solution will rely on points 4 & 5, which are very parsing dependent.
    It's unclear what the final performance will be, the scheme has "potential" if parsing is good, but its associated CPU cost is going to be quite important.

    For the record, I did tried something similar (a long time ago), an LZ scheme using "history" index instead of offset. I found that the scheme was beneficial if, and only if, older and used positions were removed from the list of choices as decompression progresses. But it was also CPU intensive, on both sides since compression and decompression need to be sync'ed. In the end, I found that the final compression benefit wasn't really worth the CPU cost.

    But it was a long time ago. I have kept the idea in memory, just in case it would become useful later, after learning some new tricks.

  4. #4
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,495
    Thanks
    26
    Thanked 131 Times in 101 Posts
    For the record, I did tried something similar (a long time ago), an LZ scheme using "history" index instead of offset. I found that the scheme was beneficial if, and only if, older and used positions were removed from the list of choices as decompression progresses. But it was also CPU intensive, on both sides since compression and decompression need to be sync'ed. In the end, I found that the final compression benefit wasn't really worth the CPU cost.
    But have you done it like I described, ie using tiered garbage collection or were you removing garbage immediately?

    Garbage here is a particular instance of a token that appeared before the last instance of that token (as assigning one token many offsets at the same time introduces redundancy).

    (Assuming I understand your algorithm correctly)

  5. #5
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    889
    Thanks
    482
    Thanked 279 Times in 119 Posts
    But have you done it like I described, ie using tiered garbage collection or were you removing garbage immediately?
    Immediately.
    Doing it "later", à la garbage collector, would have had an impact on a compression ratio.
    The efficiency was very tied to this ability to remove indexes (hence shorten values of following indexes).

    For the record, I also tried to "not remove" any index, which was good for speed.
    But the resulting compression ratio wasn't improved much compared to "classic" LZ. So it was also not worthwhile.

    Note that, even if there are some common points between our described scheme, there is no guarantee that we are talking about exactly the same thing. So it could be the conclusions I reached back then would be different today for a closely-related-but-nonetheless-different scheme.

  6. #6
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,495
    Thanks
    26
    Thanked 131 Times in 101 Posts
    Still, the tiered garbage collection looks (to me) like it has potential to provide a satisfactory compomise between speed loss and ratio gain. I'm assuming you haven't tried anything like that.
    Last edited by Piotr Tarsa; 17th September 2014 at 18:53.

  7. #7
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    695
    Thanks
    153
    Thanked 183 Times in 108 Posts
    Thanks for your thoughts.

    @Piotr: Yes, the hybrid scheme you described is consistent with my thoughts. Garbage collection should be included and is one of the details I left out. Your ideas sound good. My initial thought was to collect all garbage immediately, but you are right, the garbage collection can be intelligently scheduled without causing a big file size penalty.

    I have been thinking about fast compression methods (relative to Tree), which is what led me down this path. The dumbest simplified Tree-like compression algorithm I can think of fits within what I tried to describe and seems similar to existing LZ algorithms except for the history referencing compressed symbols instead of uncompressed data. You are right, no existing LZ code would recognize the format, but I think it wouldn't be too hard to modify an existing LZ program to recognize it. Perhaps deflate would be a good candidate to try it on.

    @Cyan: Please correct me if I am wrong, but I think LZW doesn't support match lengths longer than two symbols and new dictionary symbols are created for most output symbols. The algorithm I described (partially) would support the immediate creation of new symbols representing many symbols (like LZ77) and would have a significantly smaller dictionary than LZW since the majority of symbols sent would be literals (hopefully). In terms of the parallel history, this is just an idea in its infancy. I was thinking the code would use the regular history until it aligns with the parallel history, and then switch to the parallel history. Maybe it would switch back at the end, but that would consume extra bits to identify which history to use.

    I am curious which LZ algorithm you started with. It's disappointing that the results were not that impressive, but much can depend on the implementation details. I suspect the compression ratio would be highly dependent on the encoder's ability to make smart choices for which literals and offset+length codes to use.

  8. #8
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    889
    Thanks
    482
    Thanked 279 Times in 119 Posts
    Quote Originally Posted by Kennon Conrad View Post
    I think LZW doesn't support match lengths longer than two symbols and new dictionary symbols are created for most output symbols.
    Yes, you're right, that's an important difference.
    LZW can have "symbols" (or "dictionary entries") of any length, but requires time to build up, byte after byte in the basic scheme, or, in some more recent derivative by joining 2 existing dictionary entries.
    Your proposition is more flexible. But as a consequence, it also adds the requirement to provide a "length" field, which introduces some new costs.


    Quote Originally Posted by Kennon Conrad View Post
    I am curious which LZ algorithm you started with. It's disappointing that the results were not that impressive, but much can depend on the implementation details. I suspect the compression ratio would be highly dependent on the encoder's ability to make smart choices for which literals and offset+length codes to use.
    Correct. I can't say I spent time on the parsing scheme. It was just a quick test, to "taste" if the basic idea hold promises or not. I was disappointed by results and stopped there. But that doesn't prove there isn't a jewel next door.
    I also believe that clever parsing is key to extract the better benefit from the scheme. I also fear that it's going to be quite CPU intensive.

  9. #9
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    695
    Thanks
    153
    Thanked 183 Times in 108 Posts
    Quote Originally Posted by Cyan View Post
    Yes, you're right, that's an important difference.
    LZW can have "symbols" (or "dictionary entries") of any length, but requires time to build up, byte after byte in the basic scheme, or, in some more recent derivative by joining 2 existing dictionary entries.
    Your proposition is more flexible. But as a consequence, it also adds the requirement to provide a "length" field, which introduces some new costs.
    Thanks for the confirmation. I am still learning about existing compression methods, but when I compared existing LZ77 style compression ratios to LZ78 style, I have found LZ77 seems to do better and suspect it is because the additional length field pays off because the added flexibility increases the efficiency of the dictionary by more than the cost of having the field. With efficient coding, the cost of including the length field may not be as much as you would expect. When my compression program (Tree) compresses enwik9, the average cost of including the length code is less than two bits. Length two is assumed (about 65% of dictionary entries), and an escape bit is used if the length is not two. Tree's compressed files could be directly converted to an LZ77 like format, but I am not aware of any LZ77 format that counts compressed symbols instead of raw data. If the converted Tree file had to count raw data, I imagine the compressed file size will take a significant hit.

    Quote Originally Posted by Cyan View Post
    Correct. I can't say I spent time on the parsing scheme. It was just a quick test, to "taste" if the basic idea hold promises or not. I was disappointed by results and stopped there. But that doesn't prove there isn't a jewel next door.
    I also believe that clever parsing is key to extract the better benefit from the scheme. I also fear that it's going to be quite CPU intensive.
    Yes, it does seem like a smart algorithm will be slow relative to other LZ programs. But I am thinking about this as a possible way to speed up compression for Tree, which is extraordinarily slow because it is a recursive, top-down compressor (about 27 KB/sec for enwik9 on 4GHz i4790K). I think a smart linear program could be much faster than that and perhaps maintain Pareto Frontier decompression size/speed performance (excluding other Tree variations).

  10. #10
    Member
    Join Date
    Sep 2010
    Location
    US
    Posts
    126
    Thanks
    4
    Thanked 69 Times in 29 Posts
    I mean...

    LZFG-alike

    Build a Suffix Trie of the dictionary as you go.
    Use path compression.
    Matches are coded by writing a node index, and then additionally a match length along the compressed path (usually a small bound).
    Matches at leaves have unbounded match length.

    There is no redundancy in string references - each substring is unique in the Trie. All possible string matches can be coded.

    Recency modeling can be done with an MTF on the node indexes.

    This is good for text but not for binary because it destroys offset patterns.

    Schemes like this are very well known, they're just not Pareto when compared to PPM on text.
    Last edited by cbloom; 3rd August 2016 at 20:37.

  11. #11
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 72 Times in 56 Posts
    I think my idea was basically that Tree's parsing could be represented as (start,length) pairs instead of token references without much loss of information. (Start, length) pairs would take up more space initially, but, since the parsing is optimized to reuse matches, it should be easy for EC.

    Edit:
    The hope is that building the offline dictionary wouldn't be a waste of time, even though it isn't used directly. AFAIK, choosing LZ matches so that the references compress near optimally with Huffman is an unsolved problem. Even though that wasn't what Tree was designed specifically to do, I think there's a good bit of overlap between that problem and the problem solved by Tree.

    I don't really see any obvious advantage to this over what Tree does now. But since LZ compressors are actively researched, if it were me, I'd like to take credit for advancing an LZ problem even more than I'd like to take credit for working in a more novel area that takes much longer to explain.
    Last edited by nburns; 25th September 2014 at 06:11.

  12. #12
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    695
    Thanks
    153
    Thanked 183 Times in 108 Posts
    Quote Originally Posted by cbloom View Post
    I mean...

    LZFG-alike

    Build a Suffix Trie of the dictionary as you go.
    Use path compression.
    Matches are coded by writing a node index, and then additionally a match length along the compressed path (usually a small bound).
    Matches at leaves have unbounded match length.

    There is no redundancy in string references - each substring is unique in the Trie. All possible string matches can be coded.

    Recency modeling can be done with an MTF on the node indexes.

    This is good for text but not for binary because it destroys offset patterns.

    Schemes like this are very well known, they're just not Pareto when compared to PPM on text.
    LZFG isn't shown on Wikipedia or dce. It's an interesting way of doing LZ and has some similarities to what I have been thinking about, but I'm not seeing where a dictionary is used in LZFG.

    The program I am thinking of would be able to create "copies" that use symbol offsets and lengths for the number of compressed symbols rather than offsets and lengths for only literal/uncompressed sequences. I am starting to think that type of scheme is not as well known.

  13. #13
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    695
    Thanks
    153
    Thanked 183 Times in 108 Posts
    Quote Originally Posted by nburns View Post
    I think my idea was basically that Tree's parsing could be represented as (start,length) pairs instead of token references without much loss of information. (Start, length) pairs would take up more space initially, but, since the parsing is optimized to reuse matches, it should be easy for EC.

    Edit:
    The hope is that building the offline dictionary wouldn't be a waste of time, even though it isn't used directly. AFAIK, choosing LZ matches so that the references compress near optimally with Huffman is an unsolved problem. Even though that wasn't what Tree was designed specifically to do, I think there's a good bit of overlap between that problem and the problem solved by Tree.

    I don't really see any obvious advantage to this over what Tree does now. But since LZ compressors are actively researched, if it were me, I'd like to take credit for advancing an LZ problem even more than I'd like to take credit for working in a more novel area that takes much longer to explain.
    I agree 100%. I see no reason Tree's compressed files couldn't be converted to a format that uses (start,length) pairs instead of references and think the loss of compression ratio may not be significant if done well. The new format would have some limitations of what could be referenced relative to LZ77, but I think the gain from using offsets and lengths of compressed symbols over uncompressed/raw symbols would more than make up for the limitations on references.

    As many people have correctly pointed out, Tree is too slow on compression, making it more of a novelty item that a useful compressor. The advantage I am looking for in comparing LZ77-like methods and formats to Tree is faster compression. I am willing to give up some compression efficiency to get there if compression times can become reasonable.

  14. #14
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 72 Times in 56 Posts
    Quote Originally Posted by Kennon Conrad View Post
    The new format would have some limitations of what could be referenced relative to LZ77, but I think the gain from using offsets and lengths of compressed symbols over uncompressed/raw symbols would more than make up for the limitations on references.
    I'm not 100% sure what you mean, but limitations can be good. Limitations are predictable and so you don't need to code them.

    As many people have correctly pointed out, Tree is too slow on compression, making it more of a novelty item that a useful compressor. The advantage I am looking for in comparing LZ77-like methods and formats to Tree is faster compression. I am willing to give up some compression efficiency to get there if compression times can become reasonable.

  15. #15
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    695
    Thanks
    153
    Thanked 183 Times in 108 Posts
    Quote Originally Posted by nburns View Post
    I'm not 100% sure what you mean, but limitations can be good. Limitations are predictable and so you don't need to code them.
    My thoughts go in slightly different directions as time goes by, so I imagine the brief explanations could be hard to follow. Here's a rough example of what I am thinking.

    Suppose we needed to compress the phrase "How much wood would a woodchuck chuck if a woodchuck could chuck wood?", and had previously created dictionary entries as follows:
    1: "How"
    2: " much"
    3: " wood"
    4: " would"
    5: " a"
    6: " woodchuck"
    7: " if"
    8: " could"

    The phrase could be written with length/distance pairs as follows:
    1,? (offset to previous "How")
    1,? (offset to previous " much")
    1,? (offset to previous " wood")
    1,? (offset to previous " would")
    1,? (offset to previous " a")
    1,? (offset to previous " woodchuck")
    1,? (offset to previous " chuck")
    1,? (offset to previous " if")
    1,4 (" a")
    1,4 (" woodchuck")
    1,? (offset to previous " could")
    1,5 (offset to " chuck")
    1,10 (offset to " wood")
    1,? (offset to previous "?")

    The goal would be to make the lengths smaller (around 80% - 90% would be length 1 and over 50% of the remainder would have length 2 for typical text files) and also to make the offsets smaller/more predictable. For instance, the second " woodchuck" would have a length of 1 and offset of 4 instead of a length of 10 and an offset of 21 for standard LZ77 (ignoring the fact LZ77 would more likely choose " a woodchuck" or possibly even something else).

    My current thinking is that new dictionary entries would be created whenever the length was not 1 and would contain an offset and length pair for each symbol in the entry. For instance, in the above example, the first " a" and " woodchuck" could be encoded as length 2 with the length/offset codes as above. Then the second " a woodchuck" written with a single length/offset pair of (1,3). Also, since most symbols would have length 1, it could make sense to use a escape symbol to indicate a code is not length 1, so this could easily be just (3).

    This would really just be an alternate format for Tree's encoded stream since the limitations and capability on what can be tokenized are identical to Tree's limitations and capabilities. For instance, neither Tree nor this method could access the " wood" in " woodchuck", and both methods would use multi-level dictionaries (symbols can be written with other symbols in addition to the original characters).

    This type of compressor could be similar to TreeCompress, with one major difference. Instead of creating new symbols recursively from best to worst, they would be created in one pass from the beginning of the file to the end of the file whenever the number of occurances of a string passed a threshold calculated from the length of the string, the number of symbols in the file, and (probably) the probability of the individual symbols (like TreeCompress). The program would make a suffix tree (trie?) of the input plus dictionary with the dictionary at the end of the file and each entry proceeded by a unique identifier to block matches. New symbols would be created by adding them to the dictionary, replacing the matching suffix tree paths with the new symbol, and making the associated suffix tree node counter updates and branch deletions. When all the input has been parsed, then the program would create the output from the final suffix tree.
    Last edited by Kennon Conrad; 26th September 2014 at 08:02.

  16. Thanks:

    surfersat (26th September 2014)

  17. #16
    Member biject.bwts's Avatar
    Join Date
    Jun 2008
    Location
    texas
    Posts
    449
    Thanks
    23
    Thanked 14 Times in 10 Posts
    if you where only compressing words followed by spaces could not one due a sort of DOUBLE LZW where one of them is building the linked list of words. And the other if word not in list builds the new word space to be used in the other LZW it would be easy to make bijective and ONE PASS

  18. #17
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 72 Times in 56 Posts
    Quote Originally Posted by Kennon Conrad View Post
    The phrase could be written with length/distance pairs as follows:
    1,? (offset to previous "How")
    1,? (offset to previous " much")
    1,? (offset to previous " wood")
    1,? (offset to previous " would")
    1,? (offset to previous " a")
    1,? (offset to previous " woodchuck")
    1,? (offset to previous " chuck")
    1,? (offset to previous " if")
    1,4 (" a")
    1,4 (" woodchuck")
    1,? (offset to previous " could")
    1,5 (offset to " chuck")
    1,10 (offset to " wood")
    1,? (offset to previous "?")
    What you're describing sounds a lot like simply replacing the tokens and then running LZ on the resulting stream. You could probably get a pretty good idea of what that would be like by writing out the replaced stream without further processing and then gzipping it.

  19. #18
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,257
    Thanks
    307
    Thanked 797 Times in 489 Posts
    This looks like a variation of LZW where you code the offset instead of the dictionary index, and the length can be more than 1. But if you have a rule that every pair of symbols is added to the dictionary, then the length should always be 1.

  20. #19
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 72 Times in 56 Posts
    Quote Originally Posted by Matt Mahoney View Post
    This looks like a variation of LZW where you code the offset instead of the dictionary index, and the length can be more than 1. But if you have a rule that every pair of symbols is added to the dictionary, then the length should always be 1.
    With those modifications to LZW, it no longer sounds like LZW to me. It sounds as close to LZ77 as LZW, anyway. I get the feeling that to some people, at least, the meaning of LZW is to stand for any dictionary scheme where entries are associated with single numbers. To me, LZ78/LZW only represent one fairly primitive example of such schemes.

  21. #20
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    695
    Thanks
    153
    Thanked 183 Times in 108 Posts
    Quote Originally Posted by biject.bwts View Post
    if you where only compressing words followed by spaces could not one due a sort of DOUBLE LZW where one of them is building the linked list of words. And the other if word not in list builds the new word space to be used in the other LZW it would be easy to make bijective and ONE PASS
    I don't see why not. My example was probably overly simplified in terms of showing single words for each symbol though. For the algorithm I am describing, the dictionary entries could be portions of words (or numbers, etc.) or more than one word.

    Quote Originally Posted by nburns View Post
    What you're describing sounds a lot like simply replacing the tokens and then running LZ on the resulting stream. You could probably get a pretty good idea of what that would be like by writing out the replaced stream without further processing and then gzipping it.
    It's a lot like that, but LZ can't create the new words within the stream at the first location they are used. I have tried running LZ on TreeCompress output files. It doesn't work too well. I think that much of that is because it doesn't understand the 4 byte long symbols. What I am describing is pretty much Tree except it would not have counts for the low frequency tokens and the ability to remove them from the dictionary when they are no longer useful (or the associated overhead) and there would be one giant MTF queue instead of lots of short ones. I feel these differences would not cause a big change in compressed file size vs. Tree, so I feel comfortable in thinking this compression format can beat all existing LZ formats on compressed file size for typical large text files. As long as the data is compressed with an algorithm that approximates smart (not just the longest matches) top-down deduplication, I think it can do well.

    I could be completely wrong here since I have never written or studied the code of any LZ algorithm, but I agree that the compressed data stream would look a lot like an LZW data stream with offsets in place of dictionary references since most symbols would have length 1 and the history uses symbols instead of raw data. But I also think looks can be deceiving.

    I would guess the full dictionary (never flushed) for enwik9 when compressing with LZW would contain close to 100 million entries by the time compression is finished. I think most of those dictionary entries would never get used, causing the dictionary to be less useful than it could be. The scheme I am describing allows the designer to choose the strings that get entered in the dictionary by choosing when to use lengths more than 1 in the length/offset pairs. The test results I posted for Tree indicated the dictionary could be as small as 23K entries for 73% compression of enwik9 or as large as 8M entries for 82% compression if the results were similar, which seems likely to me since the key features are the same.

    In many cases, LZW does not handle deduplication well because it has no way of quickly and efficiently writing long, low frequency string matches (like 1000 characters long that occur twice). With the scheme I described, if there is a second copy of the woodchuck phrase, it can be handled by simply adding an additional length code before the string to create a dictionary entry for it, and then referencing the phrase later with a code length of 1 and the appropriate offset (just like LZ77, except with a length of 1 and significantly smaller offset).

  22. #21
    Member biject.bwts's Avatar
    Join Date
    Jun 2008
    Location
    texas
    Posts
    449
    Thanks
    23
    Thanked 14 Times in 10 Posts
    Quote Originally Posted by Kennon Conrad View Post
    I don't see why not. My example was probably overly simplified in terms of showing single words for each symbol though. For the algorithm I am describing, the dictionary entries could be portions of words (or numbers, etc.) or more than one word.



    It's a lot like that, but LZ can't create the new words within the stream at the first location they are used. I have tried running LZ on TreeCompress output files. It doesn't work too well. I think that much of that is because it doesn't understand the 4 byte long symbols. What I am describing is pretty much Tree except it would not have counts for the low frequency tokens and the ability to remove them from the dictionary when they are no longer useful (or the associated overhead) and there would be one giant MTF queue instead of lots of short ones. I feel these differences would not cause a big change in compressed file size vs. Tree, so I feel comfortable in thinking this compression format can beat all existing LZ formats on compressed file size for typical large text files. As long as the data is compressed with an algorithm that approximates smart (not just the longest matches) top-down deduplication, I think it can do well.
    I have written a few bijective LZW's and what the simplest one does is start with 2 roots or symbols "0" and "1" and then every time used creates two new symbols by destroying the one used. The last time I used it for something on this site was after the one I immodestly called the SCOTT transform and it does compress quite well with it. This was for a general compressor. When you try to write the code for more roots it becomes messy and slow very fast. I have written and tested for 4 and it was not to bad 256 root is needed when one works with bytes. But that code lost. In fact I don't think it's worth writing again for 256 since the output is that same as my normal LZW until you finally if ever reach a case where a given symbol has had all of each type of the 256 symbols following it. That would be the first case where a node is dropped.

    Quote Originally Posted by Kennon Conrad View Post

    I could be completely wrong here since I have never written or studied the code of any LZ algorithm, but I agree that the compressed data stream would look a lot like an LZW data stream with offsets in place of dictionary references since most symbols would have length 1 and the history uses symbols instead of raw data. But I also think looks can be deceiving.

    I would guess the full dictionary (never flushed) for enwik9 when compressing with LZW would contain close to 100 million entries by the time compression is finished. I think most of those dictionary entries would never get used, causing the dictionary to be less useful than it could be. The scheme I am describing allows the designer to choose the strings that get entered in the dictionary by choosing when to use lengths more than 1 in the length/offset pairs. The test results I posted for Tree indicated the dictionary could be as small as 23K entries for 73% compression of enwik9 or as large as 8M entries for 82% compression if the results were similar, which seems likely to me since the key features are the same.
    I have 64 bit code for my binary LZW code. When I used it on enwik9 by itself it compressed to 412,928,335 bytes when enwik9 followed by the SCOTT transform and then LZW it compressed to 232,638,564 bits. Find thread on the SCOTT transform the point being the tree for the LZW had to have many millions of entries and the code was not that slow. Have meant to do more but have not.

    Quote Originally Posted by Kennon Conrad View Post
    In many cases, LZW does not handle deduplication well because it has no way of quickly and efficiently writing long, low frequency string matches (like 1000 characters long that occur twice). With the scheme I described, if there is a second copy of the woodchuck phrase, it can be handled by simply adding an additional length code before the string to create a dictionary entry for it, and then referencing the phrase later with a code length of 1 and the appropriate offset (just like LZ77, except with a length of 1 and significantly smaller offset).
    Every thing adds a cost. I prefer one pass for most algorithms used. Accept my favorite is BWTS. That said. Your are correct the Double LZW where you have to separate trees. One for using words only and one for building new words would never find the second copy of the woodchuck phrase. However there is still a possible trade off. Allow a third LZW tree that only builds out of the words. If the text used in portions long of ten enough it would create a node for the phrase in question. However if just two long repeats occur out of the blue in random places it would NEVER notice it.

    Look I call it LZW but you might not. Here is an example of the state of trees

    frist LZW contains 5 nodes with bit values as from table below

    10 The
    01 cat
    11 is
    001 at
    000 * this is complete table for 5 modes the * indicate you go to the word building LZW

    when word built you now have a table with 6 nodes

    10 the
    01 cat
    110 is
    111 at
    001 pound
    000 * now this table has 6 entries and 6 states only increased by new word. Order of the entries can change based on how often a symbol occurs. Also since many tables possible you can use weights to build the current huffman tree at this point in time.

    When an * appears you are in the other table and do standard well my standard LZW 256 until word built for symbol you may have to write several numbers as string built but when done the next number part of the being tree in this example.
    Of this current tree to be fast should not be bijective but that only occurs when a symbol is followed by every symbol at least once and that not iikely to happen here only describing what I call double LZW you could go to triple but much more over head.

  23. #22
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    695
    Thanks
    153
    Thanked 183 Times in 108 Posts
    Interesting. I can see where a transformation would really help LZW. I can also see where having separate models for words and characters could help, especially for smaller dictionary sizes that leave more raw data behind.

    You would really not like Tree in it's current form since it passes over enwik9 something like 10K times to compress it. While one pass would be ideal, I think two passes would still be far better than 10K passes if it doesn't ruin the compression ratio.

Similar Threads

  1. lz77 visualisation
    By chornobyl in forum Data Compression
    Replies: 3
    Last Post: 7th June 2016, 16:04
  2. LZ77 on GPU
    By Bulat Ziganshin in forum Data Compression
    Replies: 2
    Last Post: 31st July 2014, 00:52
  3. LZ77 Performance Issues
    By david_werecat in forum Data Compression
    Replies: 4
    Last Post: 23rd August 2011, 18:06
  4. Idea - Don't flame me pls...
    By rubendodge in forum Data Compression
    Replies: 7
    Last Post: 24th April 2009, 19:44
  5. balz v1.00 - new LZ77 encoder is here!
    By encode in forum Forum Archive
    Replies: 61
    Last Post: 17th April 2008, 22:57

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •