Page 1 of 2 12 LastLast
Results 1 to 30 of 35

Thread: LZW with unbounded dictionary

  1. #1
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,979
    Thanks
    376
    Thanked 347 Times in 137 Posts

    Exclamation LZW with unbounded dictionary

    OK, I was just curious, how LZW will perform if we will not reset the dictionary - i.e. the dictionary will grow-up with no bounds. So, I wrote a small LZW encoder which may hold any number of dictionary entries. To be honest it was limited to 128M entries, since it's pretty enough to compress even ENWIK9 with no dictionary reset. The LZW codes are coded using flat codes as with original UNIX COMPRESS (Actually my LZW should be called LZC, anyway). For string searching routines I use a special tree structure, which is really FAST and it makes such huge dictionary possible.

    OK, here are the results, for comparison I've included COMPRESS v4.2 and ZIP from WinRAR 3.51 with Best setting:
    Code:
    ENWIK8:
    LZW:      34,423,356 bytes
    ZIP:      36,422,375 bytes
    COMPRESS: 46,247,947 bytes
    
    ENWIK9:
    LZW:      284,356,309 bytes [!]
    ZIP:      322,454,406 bytes
    COMPRESS: 448,136,005 bytes
    
    3200.txt:
    LZW:      5,641,427 bytes
    ZIP:      6,201,591 bytes
    COMPRESS: 6,576,921 bytes

  2. #2
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,979
    Thanks
    376
    Thanked 347 Times in 137 Posts
    Note that it is possible to add some tricks to this plain LZW, say, lazy code-length expansion, increasing compression further.

  3. #3
    Programmer giorgiotani's Avatar
    Join Date
    May 2008
    Location
    Italy
    Posts
    166
    Thanks
    3
    Thanked 2 Times in 2 Posts
    Results seem very promising.
    The dictionary size is quite big but if the string searching is fast it may be, today, a reasonable tradeoff: it would be interesting to have memory usage and compression times details.

  4. #4
    Member biject.bwts's Avatar
    Join Date
    Jun 2008
    Location
    texas
    Posts
    449
    Thanks
    23
    Thanked 14 Times in 10 Posts
    I am really surprised that plain LZW did so good. After Huffman and Arithmetic LZW was one of my favorite compressors to play with. At my site I have two bijective LZW that starts with 2 roots 0 and 1.
    Unless my memory failes me Plain LZW each time through when you code you add one more leaf into the tree each time still leaving the old leaf. To me on my tests this meant a lot of leaves wasted and that it was possible to construct more than one file that decompressed to same value. The bijective LZW differs in that each time you reach a terminal leaf. The leaf crestes a new node with two new leaves so that the tree grows faster and no chance of two files decompressing to same file. My code was very slow so and dictionary very small so that when it fills tble up. The last tree is used for the rest of file.
    The link to my stie in that area is

    http://bijective.dogma.net/compreslzw1.htm

    I would be honored it you or others interested in compression looked at it,

    At one time had code for 4 roots but never put it ont web page it waw lost and then my interests shifted to something else. Ite still in back of mind and if I live long enough would like to make one with 256 roots but its way back on the burner.


    Thank You
    David Scott

  5. #5
    Member biject.bwts's Avatar
    Join Date
    Jun 2008
    Location
    texas
    Posts
    449
    Thanks
    23
    Thanked 14 Times in 10 Posts
    I know you stated that it was plain LZW but how many starting codes 256 or more. If more what are they?
    Also how do you encode the codes. If you have 300 values do you use 9 bits for each code or do you use an adjusted table such at this point you use 8 for same and then 9 for others. When you get more then 512 values do you jump to 10 or mix the 9 and 10. How do you fell about using airhmetic so if your at
    a point where you have 300 codes you always use -log(1/300)/log(2) amount of space for each code.
    Any way if your code is simple C. I think I could mode it to be fully bijective
    and save a few bytes if you would post it.

    Thank You
    David Scott

  6. #6
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,979
    Thanks
    376
    Thanked 347 Times in 137 Posts
    Quote Originally Posted by giorgiotani View Post
    it would be interesting to have memory usage and compression times details.
    Such things depend on implementation details, anyway, check out some results of current DRAFT version (Core 2 Duo @ 2.4 GHz, 2 GB RAM):

    ENWIK8: 34,423,356 bytes, 17 sec, 142 MB
    ENWIK9: 284,356,309 bytes, 225 sec, 1048 MB

    A quite monstrous LZW...

    Quote Originally Posted by biject.bwts
    I know you stated that it was plain LZW but how many starting codes 256 or more. If more what are they?
    Read carefully, my LZW uses the same scheme as the original COMPRESS - i.e. flat codes. No additional codes are used - i.e. I start with 256 codes - all chars.

    Quote Originally Posted by biject.bwts
    I think I could mode it to be fully bijective
    and save a few bytes if you would post it.
    ...just compare the performance of yours and mine LZWs...
    My first serious (not RLE) compressor (non-public) was based on LZW. I know at least a few very cool tricks for LZW that will increase compression (not for just a few bytes ) with no extra speed loss.

    Just interesting what Matt Mahoney, Shelwien and toffer think about such big and fat LZW?

  7. #7
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,979
    Thanks
    376
    Thanked 347 Times in 137 Posts
    Additional testing results with various dictionary sizes:
    Code:
     64K: 45,485,655 bytes
    128K: 43,607,570 bytes
    256K: 41,816,621 bytes
    512K: 40,168,441 bytes
      1M: 38,724,293 bytes
      2M: 37,384,540 bytes
      4M: 36,249,326 bytes
      8M: 35,267,409 bytes
     16M: 34,423,356 bytes

  8. #8
    Member biject.bwts's Avatar
    Join Date
    Jun 2008
    Location
    texas
    Posts
    449
    Thanks
    23
    Thanked 14 Times in 10 Posts
    Quote Originally Posted by encode View Post
    Such things depend on implementation details, anyway, check out some results of current DRAFT version (Core 2 Duo @ 2.4 GHz, 2 GB RAM):

    ENWIK8: 34,423,356 bytes, 17 sec, 142 MB
    ENWIK9: 284,356,309 bytes, 225 sec, 1048 MB

    A quite monstrous LZW...


    Read carefully, my LZW uses the same scheme as the original COMPRESS - i.e. flat codes. No additional codes are used - i.e. I start with 256 codes - all chars.


    ...just compare the performance of yours and mine LZWs...
    My first serious (not RLE) compressor (non-public) was based on LZW. I know at least a few very cool tricks for LZW that will increase compression (not for just a few bytes ) with no extra speed loss.

    Just interesting what Matt Mahoney, Shelwien and toffer think about such big and fat LZW?
    I can't compare it since mine has 2 starting states not 256. I don't know what you mean by flat. It could mean you use 8 bits irst time and then 9 bits until you have 512 values and then switch to 10 bit codes. If that is what you mean you are wasting a lot of space. I just again looked at Mark Nelsons code. I think I can make it 256 root and bijective since I have given a lot of thought to that over they years. When done I will post it. But not sure how big the dictionary can be.

    Take Care
    David Scott

  9. #9
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,979
    Thanks
    376
    Thanked 347 Times in 137 Posts
    Quote Originally Posted by biject.bwts View Post
    I can't compare it since mine has 2 starting states not 256. I don't know what you mean by flat. It could mean you use 8 bits irst time and then 9 bits until you have 512 values and then switch to 10 bit codes. If that is what you mean you are wasting a lot of space.

    Take Care
    David Scott
    Yep, that's what I meant.
    Some time ago, I tried Range Coder and FPAQ0-like coder to encode LZW-output. I think it's not really adequate to slowdown LZW with such tricks...
    Maybe there is an efficient ARI encoder and decoder exists for such tasks?

  10. #10
    Member biject.bwts's Avatar
    Join Date
    Jun 2008
    Location
    texas
    Posts
    449
    Thanks
    23
    Thanked 14 Times in 10 Posts
    Quote Originally Posted by encode View Post
    Yep, that's what I meant.
    Some time ago, I tried Range Coder and FPAQ0-like coder to encode LZW-output. I think it's not really adequate to slowdown LZW with such tricks...
    Maybe there is an efficient ARI encoder and decoder exists for such tasks?
    I could write a special version of arb255 to handle your output file. If I knew either the exact format of it. I realize its 8 bits then 9 bits for 256 times then 10 bits for the next 512 ouputs. But there is more than one way to cut and paste the bits on a byte stream. Also one would need to know exactly how you terminate the file.

    Alternatively I could write one that is purely bijective if you gave me an ASCII file that is n1,n2,n3...nk where n1 is one out of 256 and n2 is 1 of 257 values and so. When you read EOF you output the code for right up to the string and ignore the EOF for writting to the ascii file. I could could make a symbol mode of arb255 to do this it would not to hard and give the unarb255 version that recreates the ASCII file.



    David

  11. #11
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,979
    Thanks
    376
    Thanked 347 Times in 137 Posts
    biject.bwts
    This topic about my LZW!

    I just did my LZW fully compatible with compress. At the same time my program compresses better - currently just due to better dictionary reset. If I will add a more advanced parsing I may call my LZW - optz - the .Z file optimizer. That means we may get .Z and .GIF files smaller...

  12. #12
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,979
    Thanks
    376
    Thanked 347 Times in 137 Posts
    Check this out:
    Code:
    pak0.pak
         LZW: 132,837,105 bytes
    COMPRESS: 147,988,895 bytes

  13. #13
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,979
    Thanks
    376
    Thanked 347 Times in 137 Posts

    Talking

    New results:
    Code:
    pak0.pak
         LZW: 132,163,909 bytes
    COMPRESS: 147,988,895 bytes

  14. #14
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,979
    Thanks
    376
    Thanked 347 Times in 137 Posts
    Results on ENWIK8 with flat arithmetic encoding:
    33,994,960 bytes

  15. #15
    Member biject.bwts's Avatar
    Join Date
    Jun 2008
    Location
    texas
    Posts
    449
    Thanks
    23
    Thanked 14 Times in 10 Posts
    Quote Originally Posted by encode View Post
    Results on ENWIK8 with flat arithmetic encoding:
    33,994,960 bytes
    Quote Originally Posted by encode View Post
    ENWIK8:
    LZW: 34,423,356 bytes
    ZIP: 36,422,375 bytes
    COMPRESS: 46,247,947 bytes
    First good job
    After sleeping on it. I see I could not easily improve any code for a 256 root LZW even yours it works slick when you can build long strings quickly but when you can't the built in waste of output space is hard to overcome with out lots of extra work.

    Here is what bugs my about 256 root LZW suppose I have a compressed file that at start contains the same code repeated three times. Then when I try to uncompress it will either error off or decompress to the same character 3 times. But if not error on decompress it will compress to two codes either way its not making full use of compression space.

    Here is a better example say i have "ABCDA" and "ABCDB" as codes to be used in the compression. But I don't have "ABCDX" which is what I have at this point in a file to be compressed so when I compress I write the code for "ABCD" and create a code for "ADCDX" At this point in process save I have 512 unique codes. So when it comes time to write the next code you only need to write 1 of 512. But several of the codes may start with either an A or B, yet on the previous output I did not go down that path. So any code that starts with string A or B for the next time should not be allowed. This means if 20 start with A and 30 with B at this time I really can only encode with 1 out ot (512-50) which uses less space. So this is space your wasting unless you can keep track at each time just how many of you codes are really possible. It may not worth to make your code squeeze this extra space out but it would be cool if you could then it would compress better.

    Take Care

  16. #16
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,979
    Thanks
    376
    Thanked 347 Times in 137 Posts
    Another cool trick is LZW-PM - i.e. you maintain lots of mini-dictionaries based on current order-N context.

  17. #17
    Member
    Join Date
    May 2008
    Location
    Kuwait
    Posts
    324
    Thanks
    29
    Thanked 36 Times in 21 Posts
    Quote Originally Posted by encode View Post
    biject.bwts
    This topic about my LZW!

    I just did my LZW fully compatible with compress. At the same time my program compresses better - currently just due to better dictionary reset. If I will add a more advanced parsing I may call my LZW - optz - the .Z file optimizer. That means we may get .Z and .GIF files smaller...
    so it maybe combines with precomp..!

    .. i read one article about gif and lzw that there a chance to get better reduction if data is rotated 90% before compression..i can't recall where i read it..

  18. #18
    Member chornobyl's Avatar
    Join Date
    May 2008
    Location
    ua/kiev
    Posts
    153
    Thanks
    0
    Thanked 0 Times in 0 Posts
    ..i read one article about gif and lzw that there a chance to get better reduction if data is rotated 90% before compression..i can't recall where i read it..
    this is because LZW compression is one dimensional, and gif doesn't have any predictors to compensate this, here are few examples

    and also interesting thing, TRUE COLOR GIF exist
    http://phil.ipal.org/tc.html
    Attached Thumbnails Attached Thumbnails Click image for larger version. 

Name:	h.gif 
Views:	298 
Size:	8.8 KB 
ID:	234   Click image for larger version. 

Name:	v.gif 
Views:	392 
Size:	13.1 KB 
ID:	235  
    Last edited by chornobyl; 17th November 2008 at 14:32.

  19. #19
    Member
    Join Date
    May 2008
    Location
    England
    Posts
    325
    Thanks
    18
    Thanked 6 Times in 5 Posts
    The same can happen with JPEGs when using lossless transforms to rotate it/flip it etc, can save you some space.

  20. #20
    Member
    Join Date
    May 2008
    Location
    Kuwait
    Posts
    324
    Thanks
    29
    Thanked 36 Times in 21 Posts
    Quote Originally Posted by Intrinsic View Post
    The same can happen with JPEGs when using lossless transforms to rotate it/flip it etc, can save you some space.
    i have to chech this my self..(so packjpg may use it if not already used..)

    i found a link about the GIF rotation here:
    http://www.ahuka.com/htmllevel1/optimizegif.html

    i tested lossless jpeg rotation
    Angle jpg packjpg 2.3 precomp 0.8(packjpg 2.4)
    000 = 1,821,058 - 1,510,224 - 1,491,363
    -090 = 1,822,722 - 1,510,324 - 1,491,971
    -180 = 1,821,012 - 1,510,406 - 1,491,812
    -270 = 1,822,842 - 1,510,017 - 1,491,763

    change is fraction so not good..
    Last edited by maadjordan; 18th November 2008 at 10:29.

  21. #21
    Member
    Join Date
    May 2008
    Location
    HK
    Posts
    160
    Thanks
    4
    Thanked 25 Times in 15 Posts
    Quote Originally Posted by encode View Post
    I just did my LZW fully compatible with compress. At the same time my program compresses better - currently just due to better dictionary reset. If I will add a more advanced parsing I may call my LZW - optz - the .Z file optimizer. That means we may get .Z and .GIF files smaller...
    I can't wait for optz!

  22. #22
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,979
    Thanks
    376
    Thanked 347 Times in 137 Posts
    Quote Originally Posted by roytam1 View Post
    I can't wait for optz!
    I think the program should have two modes:
    e - Which compatible with COMPRESS
    ex - The ultimate LZW

    or even program should have an option - dictionary size:
    9 - 512
    10 - 1k
    11 - 2k
    12 - 4k
    ...
    16 - 64k
    17 - 128k
    18 - 256k
    19 - 512k
    ...
    27 - 128m

    Actually with dictionary size of 16 or less we will be fully compatible with COMPRESS. My new program should just expand the .Z format.

  23. #23
    Member
    Join Date
    May 2008
    Location
    HK
    Posts
    160
    Thanks
    4
    Thanked 25 Times in 15 Posts
    Quote Originally Posted by encode View Post
    I think the program should have two modes:
    e - Which compatible with COMPRESS
    ex - The ultimate LZW

    or even program should have an option - dictionary size:
    9 - 512
    10 - 1k
    11 - 2k
    12 - 4k
    ...
    16 - 64k
    17 - 128k
    18 - 256k
    19 - 512k
    ...
    27 - 128m

    Actually with dictionary size of 16 or less we will be fully compatible with COMPRESS. My new program should just expand the .Z format.
    what about optimizing gif files then?

  24. #24
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,979
    Thanks
    376
    Thanked 347 Times in 137 Posts
    Quote Originally Posted by roytam1 View Post
    what about optimizing gif files then?
    I will compare .GIF compression from popular software like Photoshop with my LZW implementation...

  25. #25
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,979
    Thanks
    376
    Thanked 347 Times in 137 Posts
    Funny thing - LZP with unbounded context length. Using tree data structure with backward (context) matching we may keep buffer positions very effective. With hashing we keep positions only for fixed context lengths - say 8 and 4. With such approach we keep positions for ALL context lengths ranging from 1 to unbounded value!

    Tested Predictor algorithm with such tree - cool results! Will write a small LZP to test the idea...

  26. #26
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    Quote Originally Posted by encode View Post
    Just interesting what Matt Mahoney, Shelwien and toffer think about such big and fat LZW?
    Lots of memory always helps on enwik9.

  27. #27
    Member chornobyl's Avatar
    Join Date
    May 2008
    Location
    ua/kiev
    Posts
    153
    Thanks
    0
    Thanked 0 Times in 0 Posts
    I will compare .GIF compression from popular software like Photoshop with my LZW implementation...
    here are results from some software LZW implementations
    enwik8 10000x10000 8bit
    Code:
    irfanview4.20 xnview1.95.3 acdsee9 
    imageanalizer1.30 m$paint5.1
    
    54809718  ac_lzw.tif
    55018081  ia.gif
    55022485  ac.gif
    55022485  pa.gif
    55022493  iv.gif
    55023331  xn.gif
    55606526  pa.TIF
    55606548  iv_lzw.tif
    55621898  xn_lzw.tif
    55623446  xn_lzw-ad.tif
    71718421  xn_lzw-predict-ad.tif
    74335346  xn_lzw-predict.tif
    Last edited by chornobyl; 1st December 2008 at 14:59.

  28. #28
    Member
    Join Date
    May 2009
    Location
    France
    Posts
    97
    Thanks
    13
    Thanked 74 Times in 44 Posts

    Thumbs up

    A bit late but...

    Quote Originally Posted by roytam1 View Post
    I can't wait for optz!
    Me too!!


    AiZ

  29. #29
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,979
    Thanks
    376
    Thanked 347 Times in 137 Posts
    OK, it's a dummy version of OPTZ. It is compatible with UNIX COMPRESS but in most cases provides a little bit more compression due to better dictionary reset. No advanced parsing yet. In addition, it should be FAST!

    Anyway, enjoy!
    Attached Files Attached Files

  30. #30
    Member
    Join Date
    May 2008
    Location
    HK
    Posts
    160
    Thanks
    4
    Thanked 25 Times in 15 Posts
    Quote Originally Posted by encode View Post
    OK, it's a dummy version of OPTZ. It is compatible with UNIX COMPRESS but in most cases provides a little bit more compression due to better dictionary reset. No advanced parsing yet. In addition, it should be FAST!

    Anyway, enjoy!
    Does it optimize GIF images?

Page 1 of 2 12 LastLast

Similar Threads

  1. bzip2 dictionary size
    By Wladmir in forum Data Compression
    Replies: 3
    Last Post: 7th April 2010, 16:09
  2. LZSS with a large dictionary
    By encode in forum Data Compression
    Replies: 31
    Last Post: 31st July 2008, 21:15
  3. LZW v0.2 is here!
    By encode in forum Forum Archive
    Replies: 6
    Last Post: 8th February 2008, 22:53
  4. LZW v0.1 is here!
    By encode in forum Forum Archive
    Replies: 20
    Last Post: 2nd February 2008, 13:46
  5. New LZW variant
    By encode in forum Forum Archive
    Replies: 14
    Last Post: 28th January 2008, 21:33

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •