Results 1 to 13 of 13

Thread: Are there LZW/LZ78/LZC etc. variants with efficient codeword encoding?

  1. #1
    Member
    Join Date
    May 2020
    Location
    Berlin
    Posts
    64
    Thanks
    10
    Thanked 20 Times in 15 Posts

    Are there LZW/LZ78/LZC etc. variants with efficient codeword encoding?

    This is about encoding of the actual codeword range of this family of compressors. So far the number of used bits just gets increased after reaching the limit of the current codeword size (e.g. 9 -> 10 bits). But the new range only gets 50% used right after increasing by 1 bit, with slow filling of the range.
    Is some existing algorithm adressing this? AC, RC, ANS come to my mind here. Anything else?

  2. #2
    Member
    Join Date
    Sep 2015
    Location
    Italy
    Posts
    278
    Thanks
    116
    Thanked 160 Times in 117 Posts
    ​Just in these days I'm looking at lzw-ab, its "adjusted-binary" part does a good trick.
    https://github.com/dbry/lzw-ab

  3. #3
    Member
    Join Date
    May 2020
    Location
    Berlin
    Posts
    64
    Thanks
    10
    Thanked 20 Times in 15 Posts
    Quote Originally Posted by Mauro Vezzosi View Post
    ​Just in these days I'm looking at lzw-ab, its "adjusted-binary" part does a good trick.
    https://github.com/dbry/lzw-ab
    Thanks, this variant looks interesting. I'll have a look at it.

  4. #4
    Member
    Join Date
    May 2020
    Location
    Berlin
    Posts
    64
    Thanks
    10
    Thanked 20 Times in 15 Posts
    A little update: LZW-AB is roughly 5-7% better than LZW. Some paper mentioned possible reductions of 8% vs standard LZW.

  5. #5
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,977
    Thanks
    296
    Thanked 1,304 Times in 740 Posts
    Code:
    768,771 book1
    567,221 1     // lzw-ab -1
    484,819 2     // lzw-ab -2
    436,111 3     // lzw-ab -3
    402,414 4     // lzw-ab -4
    354,568 2.zst // -22 --zstd=wlog=12 
    299,755 1.gz  // 7z -tgzip -mx=9
    264,731 1.zst // -22
    214,938 1.glz // glza -x

  6. Thanks:

    Dresdenboy (20th June 2020)

  7. #6
    Member SolidComp's Avatar
    Join Date
    Jun 2015
    Location
    USA
    Posts
    353
    Thanks
    131
    Thanked 54 Times in 38 Posts
    Is that normal for GLZA? What's the story?

  8. #7
    Member
    Join Date
    Sep 2015
    Location
    Italy
    Posts
    278
    Thanks
    116
    Thanked 160 Times in 117 Posts
    Code:
    768,771 book1
    567,221 1.lzw-ab  // lzw-ab -1
    484,819 2.lzw-ab  // lzw-ab -2
    436,111 3.lzw-ab  // lzw-ab -3
    402,414 4.lzw-ab  // lzw-ab -4
    376,752 5.lzw-ab  // lzw-ab -5
    355,634 6.lzw-ab  // lzw-ab -6
    354,568 2.zst     // -22 --zstd=wlog=12 
    338,861 7.lzw-ab  // lzw-ab -7
    323,503 8.lzw-ab  // lzw-ab -8
    312,855 9.lzw-ab  // lzw-ab -9
    304,695 10.lzw-ab // lzw-ab -10
    304,695 11.lzw-ab // lzw-ab -11
    299,755 1.gz      // 7z -tgzip -mx=9
    264,731 1.zst     // -22
    214,938 1.glz     // glza -x

  9. Thanks:

    Dresdenboy (20th June 2020)

  10. #8
    Member
    Join Date
    May 2015
    Location
    Italy
    Posts
    56
    Thanks
    0
    Thanked 12 Times in 9 Posts
    Any entropy stage stacked upon the LZW solves the issue, because not yet assigned dictionary entries have zero frequency. I pursued this approach with my compressor Bolt, where I used the Splay Tree (Douglas Jones) after the LZW. Please note that there is an improvement only with binary files while for text ones it goes to a degradation, but I think this is due to the profile of the Splay Tree; for other entropy encoders the result may vary, in this case you have to consider that LZW produces a huge alphabet and speed is a concern.
    The command line options for Bolt are:

    2 - LZW 12 bit
    3 - LZW 14 bit
    4 - LZW 12 bit + Splay Tree
    5 - LZW 14 bit + Splay Tree

    Another approach is starting to code the dictionary entries with maximum bit size and when the dictionary becomes full applying a strategy to recycle the least recently used entry, of course it is worth only for input stream large in respect to the dictionary size.
    Attached Files Attached Files
    Last edited by Marco_B; 4th July 2020 at 14:35.

  11. #9
    Member
    Join Date
    May 2020
    Location
    Berlin
    Posts
    64
    Thanks
    10
    Thanked 20 Times in 15 Posts
    Quote Originally Posted by Shelwien View Post
    Code:
    768,771 book1
    [see Mauro Vezzosi's table]
    (Where are these many blank lines in code blocks coming from?)
    I already expected not too stellar performance with file sizes like this, as it still is LZW, creating dictionary entries, where many are never used. And more efficient encoding might just shave of fractions of a bit up to roughly one bit per symbol.

    Quote Originally Posted by Mauro Vezzosi View Post
    Code:
    768,771 book1
    567,221 1.lzw-ab  // lzw-ab -1
    484,819 2.lzw-ab  // lzw-ab -2
    436,111 3.lzw-ab  // lzw-ab -3
    402,414 4.lzw-ab  // lzw-ab -4
    376,752 5.lzw-ab  // lzw-ab -5
    355,634 6.lzw-ab  // lzw-ab -6
    354,568 2.zst     // -22 --zstd=wlog=12 
    338,861 7.lzw-ab  // lzw-ab -7
    323,503 8.lzw-ab  // lzw-ab -8
    312,855 9.lzw-ab  // lzw-ab -9
    304,695 10.lzw-ab // lzw-ab -10
    304,695 11.lzw-ab // lzw-ab -11
    299,755 1.gz      // 7z -tgzip -mx=9
    264,731 1.zst     // -22
    214,938 1.glz     // glza -x
    Interesting! At those big dictionary sizes it even starts to look competitive.

    Quote Originally Posted by Marco_B View Post
    Any entropy stage stacked upon the LZW solves the issue, because not yet assigned dictionary entries have zero frequency. I pursued this approach with my compressor Bolt, where I used the Splay Tree after the LZW. Please note that there is an improvement only with binary files while for text ones it goes to a degradation, but I think this is due to the profile of the Splay Tree; for other codecs the result may vary, in this case you have to consider that LZW produces a large alphabet and speed is a concern.
    The command line options for Bolt are:

    2 - LZW 12 bit
    3 - LZW 14 bit
    4 - LZW 12 bit + Splay Tree
    5 - LZW 14 bit + Splay Tree

    Another approach is starting to code the dictionary entries with maximum bit size and when the dictionary becomes full applying a strategy to recycle the least recently used entry, of course it is worth only for input stream large in respect to the dictionary size.
    This also looks interesting. My Pascal days were roughly a quarter century ago. But as I'm testing compression of binary files, Bolt is on my list now.

  12. #10
    Member
    Join Date
    Sep 2015
    Location
    Italy
    Posts
    278
    Thanks
    116
    Thanked 160 Times in 117 Posts
    Quote Originally Posted by Dresdenboy View Post
    Quote Originally Posted by Shelwien View Post
    Code:
    768,771 book1
    [see Mauro Vezzosi's table]
    (Where are these many blank lines in code blocks coming from?)
    I don't understand what "many blank lines in code blocks" you're referring to.

    Quote Originally Posted by Dresdenboy View Post
    At those big dictionary sizes
    lzw-ab -11 requires ~5 MB.
    -11 = 19 bits dictionary = 512 KiB.
    512 KiB * (4 (first_references) + 4 (next_references) + 1 (terminators)) = ~4.5 MiB (the first 256 values are implicit and not allocated)
    Last edited by Mauro Vezzosi; 21st June 2020 at 18:06.

  13. #11
    Member
    Join Date
    May 2020
    Location
    Berlin
    Posts
    64
    Thanks
    10
    Thanked 20 Times in 15 Posts
    Quote Originally Posted by Mauro Vezzosi View Post
    I don't understand what "many blank lines in code blocks" you're referring to.
    Sorry, this was just a comment about the code box formatting to Shelwien as a forum admin.

    Quote Originally Posted by Mauro Vezzosi View Post
    lzw-ab -11 requires ~5 MB.
    -11 = 19 bits dictionary = 512 KiB.
    512 KiB * (4 (first_references) + 4 (next_references) + 1 (terminators)) = ~4.5 MiB (the first 256 values are implicit and not allocated)
    Sounds plausible! I think the LZW part is pretty standard. Just the code-word/symbol encoding got more efficient. And since those symbols grow to 19 bits, the adjusted binary savings are going down to ~1/38 bits per encoded symbol, I think.

  14. #12
    Member
    Join Date
    May 2020
    Location
    Berlin
    Posts
    64
    Thanks
    10
    Thanked 20 Times in 15 Posts
    Quote Originally Posted by Mauro Vezzosi View Post
    ​Just in these days I'm looking at lzw-ab, its "adjusted-binary" part does a good trick.
    https://github.com/dbry/lzw-ab
    There is the Horspool paper (Improving LZW), which describes it. I also checked his source. No additional information. But Charles Bloom did some interesting analysis of this encoding (calling it "flat codes" BTW):
    http://cbloomrants.blogspot.com/2013...bits-flat.html

  15. #13
    Member
    Join Date
    Sep 2015
    Location
    Italy
    Posts
    278
    Thanks
    116
    Thanked 160 Times in 117 Posts
    Quote Originally Posted by Dresdenboy View Post
    Quote Originally Posted by Mauro Vezzosi View Post
    Just in these days I'm looking at lzw-ab, its "adjusted-binary" part does a good trick.https://github.com/dbry/lzw-ab
    There is the Horspool paper (Improving LZW), which describes it. I also checked his source. No additional information. But Charles Bloom did some interesting analysis of this encoding (calling it "flat codes" BTW):http://cbloomrants.blogspot.com/2013...bits-flat.html
    Yes, after doing various Internet searches on LZW, I saw that this type of coding was known to others too.

Similar Threads

  1. Encoding FLAC residuals (alternatives to Rice encoding)
    By flubadubadoo in forum Data Compression
    Replies: 4
    Last Post: 17th January 2020, 01:25
  2. Replies: 38
    Last Post: 27th April 2016, 18:01
  3. Improving LZ78/LZW?
    By RichSelian in forum Data Compression
    Replies: 7
    Last Post: 19th September 2011, 18:05
  4. LZC Question
    By moisesmcardona in forum Data Compression
    Replies: 3
    Last Post: 16th August 2009, 22:33
  5. LZWA vs. LZC
    By encode in forum Forum Archive
    Replies: 8
    Last Post: 16th July 2006, 18:37

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •