Results 1 to 8 of 8

Thread: Improving LZ78/LZW?

  1. #1
    Member RichSelian's Avatar
    Join Date
    Aug 2011
    Location
    Shenzhen, China
    Posts
    160
    Thanks
    18
    Thanked 56 Times in 27 Posts

    Improving LZ78/LZW?

    It's easy to use an entropy coder to enhance LZ77, but what about LZ78/LZW? The compression ratio is not high and the output code is hard to compress with an entropy coder (no way to predict it??). Is there any LZ778/LZW-ARI compressors which provide good compression ratio just like LZ77-ARI?

  2. #2
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,474
    Thanks
    26
    Thanked 121 Times in 95 Posts
    There are many variations of LZ78. I think even ACB ( http://encode.su/threads/540-ACB ) could be considered a LZW variation. David Scott has made bijective LZW (although very slow).

    I think it's better to invest time into BWT as it's scalable and preprocessing + transformation stage is almost fully decoupled from encoding stage, ie usually changes in first stage (eg changing from LZP to LZ77 preprocessing or switching between BWT and BWTS) does not influence much the performance of encoding stage. IMO any LZ coder with adaptive entropy coding is a messy thing, ie optimizing performance relies too much on tuning and trying ad-hoc combinations of constants as LZ has an inherent information overhead (by allowing multiple encodings of the same input string). On the other hand with static entropy coding it is sometimes feasible to do optimal parsing and then focus on improving speed, thus avoiding spending time on extensive tuning, which will be useless if a customer compresses files with different content than your data corpus.

    Regarding LZW: I think it is possible to improve the compression, for example by keeping the dictionary sorted. Then you can predict the highest bits of indexes using statistics like in PPM. Additionally you can eg assume higher probability of longer dictionary entries.

    Edit:
    What about variable length LZW? Ie: usual LZW (from what I know) uses dictionary of size of a power of two (some entries are free during compression) and then uses an integer length bit code to refer to a particular entry. We can use something like prefix(less) codes (Huffman codes are examples of these) that have varying length. Entire dictionary could be organized into a binary tree, with entries in leafs. Each node would keep a probability of selecting left or right children. In fact such scheme would be a good hybrid between PPM and LZW, the advantage over plain PPM is that that hybrid could sometimes encode multiple symbols as one bit (ie before actual encoding) when some edge would correspond to multiple symbols. Did anyone tried such an algorithm? The ability to keep multisymbol edges could be a memory saver compared to PPM. Of course such hybrid would have the disadvantage of having mutiple encodings of the same input string, maybe unless David Scott makes a bijective version.
    Last edited by Piotr Tarsa; 17th September 2011 at 21:37.

  3. #3
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,334
    Thanks
    209
    Thanked 1,008 Times in 533 Posts
    > I think even ACB could be considered a LZW variation.

    I believe that its closer to LZ77, because it encodes
    distances and lengths, though in reduced form.

    > I think it's better to invest time into BWT

    Unfortunately BWT is inherently not universal, because of its blockwise nature,
    and because its a fixed transformation too.
    For example, BWT methods are completely useless in the case of network traffic compression -
    large blocks needed for good compression result in a huge delay, and the local cache can't
    be used as a dictionary.
    And specialized compressors for any kinds of structured data also can't be based on BWT,
    so 20-50% better compression (than possible with BWT) of most data types is imho just
    a matter of time.

    > LZ coder with adaptive entropy coding is a messy thing, ie
    > optimizing performance relies too much on tuning and trying ad-hoc

    Actually that perfectly applies to postcoders in bsc/bcm/bwtmix.
    Also BWT needs adaptive segmentation and LZP preprocessing
    (to stabilize sorting speed and compression ratio) and some
    other possible tricks (like alphabet reordering) are just ignored for now.

    > combinations of constants as LZ has an inherent information overhead

    I don't quite see how magic constants are related to LZ transformation redundancy,
    but after playing with lzma source I learned that the effect of that redundancy
    can be significantly reduced by optimal parsing.
    I mean that with arithmetic coding, when some symbol has a probability
    less than 1/180337, the redundancy when that symbol never occurs is
    only 1 byte per megabyte of data.
    (See http://wolframalpha.com/input/?i=Solve[-1E6*Log[256,(1-1/q)]==1])
    So basically if the model is precise enough, it doesn't matter that
    LZ is _potentially_ able to encode the same string with multiple token sequences,
    because the redundancy would be insignificant.

    To be specific, the symbol masking when literal is encoded after a match in lzma,
    is technically redundant too - it allows to encode any symbol. Lzma only does
    statistical masking - some symbols have much lower probability than others.
    But that appeared enough to beat 3 different approaches which I tried, while
    all of them perfectly excluded impossible symbols and were much slower too.
    I guess sometimes flexibility is that much more important.

    Note also that CMs (including BWT postcoders) also have some inherent redundancy,
    mostly because of limited precision.

    > Did anyone tried such an algorithm?

    I have something like that planned. The idea is to make LZ77 with optimal parsing
    and match coding "by value", instead of distances.
    Then maybe turn it into PPM with optimal parsing
    There're also supposed to be some adaptive binary trees in the context nodes
    (btw, unary codes in PPMd tree are sometimes efficient enough too)
    but as to encoding multiple input symbols with a single bit - I really don't see a way to do that.
    It would make symbol masking really troublesome...

  4. #4
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,982
    Thanks
    377
    Thanked 351 Times in 139 Posts
    A brief forum search for LZW keyword gives this:
    http://encode.su/threads/214-LZW-wit...?highlight=lzw

  5. #5
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,474
    Thanks
    26
    Thanked 121 Times in 95 Posts
    Note also that CMs (including BWT postcoders) also have some inherent redundancy,
    mostly because of limited precision.
    It's not very hard to make bijective coders for BWT, PPM or CM and in this case, for me, bijectivity == no redundancy.

    but as to encoding multiple input symbols with a single bit - I really don't see a way to do that.
    It would make symbol masking really troublesome...
    Well, I have mixed up some ideas and something bogus happened :P

    But I have another idea - let's use on-line suffix tree construction algorithm, like that from Ukkonen. (It's not a binary tree, but each node has max |Alphabet| children.) That suffix tree has some links that would help in masking (ie AFAIR links leading to a next suffix, ie from a node abcd to a node bcd). Some edges in suffix trees (which are tries in fact) corresponds to multi-symbol substrings.

  6. #6
    Member biject.bwts's Avatar
    Join Date
    Jun 2008
    Location
    texas
    Posts
    449
    Thanks
    23
    Thanked 14 Times in 10 Posts
    Quote Originally Posted by Piotr Tarsa View Post
    There are many variations of LZ78. I think even ACB ( http://encode.su/threads/540-ACB ) could be considered a LZW variation. David Scott has made bijective LZW (although very slow).

    I think it's better to invest time into BWT as it's scalable and preprocessing + transformation stage is almost fully decoupled from encoding stage, ie usually changes in first stage (eg changing from LZP to LZ77 preprocessing or switching between BWT and BWTS) does not influence much the performance of encoding stage. IMO any LZ coder with adaptive entropy coding is a messy thing, ie optimizing performance relies too much on tuning and trying ad-hoc combinations of constants as LZ has an inherent information overhead (by allowing multiple encodings of the same input string). On the other hand with static entropy coding it is sometimes feasible to do optimal parsing and then focus on improving speed, thus avoiding spending time on extensive tuning, which will be useless if a customer compresses files with different content than your data corpus.

    Regarding LZW: I think it is possible to improve the compression, for example by keeping the dictionary sorted. Then you can predict the highest bits of indexes using statistics like in PPM. Additionally you can eg assume higher probability of longer dictionary entries.

    Edit:
    What about variable length LZW? Ie: usual LZW (from what I know) uses dictionary of size of a power of two (some entries are free during compression) and then uses an integer length bit code to refer to a particular entry. We can use something like prefix(less) codes (Huffman codes are examples of these) that have varying length. Entire dictionary could be organized into a binary tree, with entries in leafs. Each node would keep a probability of selecting left or right children. In fact such scheme would be a good hybrid between PPM and LZW, the advantage over plain PPM is that that hybrid could sometimes encode multiple symbols as one bit (ie before actual encoding) when some edge would correspond to multiple symbols. Did anyone tried such an algorithm? The ability to keep multisymbol edges could be a memory saver compared to PPM. Of course such hybrid would have the disadvantage of having mutiple encodings of the same input string, maybe unless David Scott makes a bijective version.

    My LZW that I posted is for bits. That is it starts with 2 roots the string of 1 bit "0" and 1 bit "1" at each use the string used is replaced with 2 strings each one bit longer. I had a one time code for 4 root but its long gone. Don't get me wrong I say its for bits but is works on any file. If one wanted a fast a dirty 256 root LZW that not be to hard to code. One could automatically reduce the tree needed or restart when conditions arise that would led to nonbijectiveity. That is as long as any string does not have more than 255 children when using a 256 root symbol if coded properly there is no problem with tree as far as bijectiveity is concerned. Them main porblem with most LZW methods is that the output should either be from a two level huffman tree or an arithmetic coder based on the current number of nodes. There is no need for an EOF symbol in there methods
    The last time I looked at LZW was when I was playing with the Scott Transfrom. I updated it for file size. Have been to lazy to post not sure what anyones interest is in it. But the LZW a used was not the best. I has a basis that for book1 that makes it better than using a pure arithmetic coder for the nodes. But the basis can be improved more than I took advantage of since one could assume that most long runs contain more zeroes after the scott transfrom of course the same and better fix could be done with a biased arithemetic.

  7. #7
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,334
    Thanks
    209
    Thanked 1,008 Times in 533 Posts
    > It's not very hard to make bijective coders for BWT, PPM or CM and in this case,
    > for me, bijectivity == no redundancy.

    Unfortunately its not so simple.
    For example, the "compression method" called "store" is bijective, but obviously redundant.

    Another example: paq8p compresses 1MB of zeroes to 359 bytes, and lzma to 189 bytes -
    in this case paq is more redundant, because it has a lower ratio limit.

    On other hand, real bijectivity is very troublesome and doesn't really appear in
    any practical codecs. Because even simple rounding up of entropy to integer number of bytes
    already breaks it.
    Of course its nearly always possible to find a solution, but its just not practical
    to overcomplicate things and save a byte per file.

    > That suffix tree has some links that would help in masking

    The masking problem looks like this.
    Suppose that at some point there's a possibility to encode a string "ab",
    but we encountered "ac" and had to encode just "a".
    Then at the next step we have to take into account that "ab" wasn't used
    before, so current symbol is not 'b'.
    Of course its not impossible, but very annoying and slow, especially in
    a scheme with support for multiple variable-length strings in a node.

  8. #8
    Member biject.bwts's Avatar
    Join Date
    Jun 2008
    Location
    texas
    Posts
    449
    Thanks
    23
    Thanked 14 Times in 10 Posts
    Quote Originally Posted by Shelwien View Post
    > It's not very hard to make bijective coders for BWT, PPM or CM and in this case,
    > for me, bijectivity == no redundancy.

    Unfortunately its not so simple.
    For example, the "compression method" called "store" is bijective, but obviously redundant.

    Another example: paq8p compresses 1MB of zeroes to 359 bytes, and lzma to 189 bytes -
    in this case paq is more redundant, because it has a lower ratio limit.

    On other hand, real bijectivity is very troublesome and doesn't really appear in
    any practical codecs. Because even simple rounding up of entropy to integer number of bytes
    already breaks it.

    .....
    Your right to be bijective it uses all files so obviously there that could have
    large amounts of redunacy exist as part of the output.


    Second it does take careful thought to make a real bijective compressor.
    It also takes more time. But virtually all lossless compressors could be
    made bijective and they will compress to same size or smaller for the set
    of all files. That would include what you say are practical compressors.

    Unless you really don't care about output file size. If speed is your main
    concern and not output file size. Then your correct its not practical. But
    I think some people enjoy seeing a smaller size.

    Also its not a simply rounding up to an integer number of bytes. What
    do you mean "breaks it" The so called rounding works both ways. You
    sometimes round down you sometimes round up. But its done in a way
    that zero data is added and is more efficient than commonly used ways
    in ending file compression. One could use such methods for any file
    compressor that uses an arithmetic like entropy stages. It will save space.

Similar Threads

  1. LZW with unbounded dictionary
    By encode in forum Data Compression
    Replies: 34
    Last Post: 28th September 2010, 04:30
  2. Improving RC4 (MC1 cipher proposal)
    By encode in forum The Off-Topic Lounge
    Replies: 14
    Last Post: 5th August 2010, 21:57
  3. LZW v0.2 is here!
    By encode in forum Forum Archive
    Replies: 6
    Last Post: 8th February 2008, 23:53
  4. LZW v0.1 is here!
    By encode in forum Forum Archive
    Replies: 20
    Last Post: 2nd February 2008, 14:46
  5. New LZW variant
    By encode in forum Forum Archive
    Replies: 14
    Last Post: 28th January 2008, 22:33

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •