Results 1 to 7 of 7

Thread: LZWC - A fast tree based LZW compressor

  1. #1
    Member
    Join Date
    Aug 2011
    Location
    Canada
    Posts
    113
    Thanks
    9
    Thanked 22 Times in 15 Posts

    LZWC - A fast tree based LZW compressor

    LZW is an algorithm rarely found in compression benchmarks, mostly due to legal issues and the fact that it is relatively outdated. Since there are few LZW compressors, I decided to make one myself. It uses trees to speed up dictionary searching and encodes each entry as a 16-bit codeword. I have included comments about other possible optimizations in the source code. The release includes the C source code and a 32-bit executable (built with tcc).

    Version 0.3 adds in file buffering and verbose output.
    Version 0.4 fixes an error with binary data.
    Version 0.7 adds multiple engine wrappers and has an optional bitwise mode for adaptive code lengths (now two executables, one for each mode, incompatible with each other).


    Benchmarks:

    enwik8 100,000,000 -> 46,647,318 (46.6%) in 4.094 seconds
    Attached Files Attached Files
    Last edited by david_werecat; 16th January 2013 at 02:07.

  2. #2
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,611
    Thanks
    30
    Thanked 65 Times in 47 Posts
    FYI: gcc 4.2.1 doesn't accept recursive inline functions and throws a linker error.

  3. #3
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,257
    Thanks
    307
    Thanked 795 Times in 488 Posts
    Added to LTCB. http://mattmahoney.net/dc/text.html#4638

    Edit: lzwc 0.3 crashed while decompressing several of the Silesia files. It looks like only the text files were decompressed correctly.

    Code:
    10,192,446 dickens
     4,306,040 dickens.lzw
    51,220,480 mozilla
    41,791,536 mozilla.lzw
     9,970,564 mr
     3,785,442 mr.lzw
    33,553,445 nci
     3,665,810 nci.lzw
     6,152,192 ooffice
     5,103,854 ooffice.lzw
    10,085,684 osdb
     4,342,370 osdb.lzw
     6,627,202 reymont
     1,941,516 reymont.lzw
    21,606,400 samba
    17,599,344 samba.lzw
     7,251,944 sao
     6,981,730 sao.lzw
    41,458,703 webster
    14,574,788 webster.lzw
     8,474,240 x-ray
     7,771,628 x-ray.lzw
     5,345,280 xml
     3,847,292 xml.lzw
    
    decompresed:
    
    10,192,446 dickens
             0 mozilla
             0 mr
    33,553,445 nci
             0 ooffice
             0 osdb
     6,627,202 reymont
        65,536 samba
             0 sao
    41,458,703 webster
             0 x-ray
     5,345,280 xml
    Last edited by Matt Mahoney; 15th January 2013 at 22:11.

  4. #4
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,611
    Thanks
    30
    Thanked 65 Times in 47 Posts
    I have troubles with it on scc too.
    I converted it to a library and maybe I've done it wrong, but it fails to compress scc as a whole - it actually expands it here.
    Code:
    *** LZWC.c    2013-01-15 13:53:32.000000000 +0100
    --- ../LZWC.c    2013-01-15 21:09:57.327317870 +0100
    ***************
    *** 22,31 ****
      
      //#define PROGRESS
      #include <stdlib.h>
    ! #include <stdio.h>
      #include <string.h>
      #include <time.h>
    ! #include <BufferedFile.h>
      
      struct LZWNODE {
          unsigned short num;
    --- 22,31 ----
      
      //#define PROGRESS
      #include <stdlib.h>
    ! //#include <stdio.h>
      #include <string.h>
      #include <time.h>
    ! //#include <BufferedFile.h>
      
      struct LZWNODE {
          unsigned short num;
    ***************
    *** 45,51 ****
          for(i=0;i<256;i++) { ln->subnodes[i] = NULL; }
          return ln;
      }
    ! __inline void lzwnode_free(PLZWNODE node, int full) {
          int i;
          if(full) {
              for(i=0;i<256;i++) { 
    --- 45,51 ----
          for(i=0;i<256;i++) { ln->subnodes[i] = NULL; }
          return ln;
      }
    ! void lzwnode_free(PLZWNODE node, int full) {
          int i;
          if(full) {
              for(i=0;i<256;i++) { 
    ***************
    *** 92,114 ****
          return lb;
      }
      
    ! int lzw_compress(BufferedFile* input, BufferedFile* output) {
          //Initialize LZW
          int c;
          long long ib=0,ob=0;
          PLZWBASE lzw = lzwbase_init(1);
          if(!lzw) { return 0; }
          //Compress
    !     while((c = bfgetc(input)) != EOF) {
              if(!lzw->node->subnodes[c]) {
                  //Not in dictionary.
                  if(lzw->entries < 65536) {
    !                 if(!(lzw->node->subnodes[c] = lzw->dict[lzw->entries] = lzwnode_init(lzw->node, c, lzw->entries))) { lzwbase_free(lzw); printf("\n"); return 0; }
                      lzw->entries++;
                  }
                  //Write code
    !             bfputc((lzw->node->num >> 8) & 0xFF, output);
    !             bfputc(lzw->node->num & 0xFF, output);
                  ob += 2;
                  //Reset
                  lzw->node = lzw->tree->subnodes[c];
    --- 92,118 ----
          return lb;
      }
      
    ! size_t lzwc_compress(const unsigned char* input, size_t input_size, unsigned char* output, size_t output_size) {
          //Initialize LZW
    +     const unsigned char* input_end = input + input_size;
    +     const unsigned char* output_end = output + output_size;
          int c;
          long long ib=0,ob=0;
          PLZWBASE lzw = lzwbase_init(1);
          if(!lzw) { return 0; }
          //Compress
    !     while(input < input_end) {
    !         c = *input++;
              if(!lzw->node->subnodes[c]) {
                  //Not in dictionary.
                  if(lzw->entries < 65536) {
    !                 if(!(lzw->node->subnodes[c] = lzw->dict[lzw->entries] = lzwnode_init(lzw->node, c, lzw->entries))) { lzwbase_free(lzw); return 0; }
                      lzw->entries++;
                  }
                  //Write code
    !             if(output_end - output < 2) { lzwbase_free(lzw); return 0; }
    !             *output++ = (lzw->node->num >> 8);
    !             *output++ = lzw->node->num;
                  ob += 2;
                  //Reset
                  lzw->node = lzw->tree->subnodes[c];
    ***************
    *** 122,138 ****
      #endif
          }
          //Flush data
    !     bfputc((lzw->node->num >> 8) & 0xFF, output);
    !     bfputc(lzw->node->num & 0xFF, output);
    !     bfflush(output);
          ob += 2;
    -     printf("Compressed    %lli -> %lli (%.1f%%)    \n", ib, ob, (double)(ob * 100) / ib);
          //Return
          lzwbase_free(lzw);
    !     return 1;
      }
    ! int lzw_decompress(BufferedFile* input, BufferedFile* output) {
          //Initialize LZW
          int c,d,i,p;
          long long ib=0,ob=0;
          LZWNODE* last = NULL;
    --- 126,143 ----
      #endif
          }
          //Flush data
    !     if(output_end - output < 2) { lzwbase_free(lzw); return 0; }
    !     *output++ = (lzw->node->num >> 8);
    !     *output++ = lzw->node->num;
          ob += 2;
          //Return
          lzwbase_free(lzw);
    !     return output_size - (output_end - output);
      }
    ! size_t lzwc_decompress(const unsigned char* input, size_t input_size, unsigned char* output, size_t output_size) {
          //Initialize LZW
    +     const unsigned char* input_end = input + input_size;
    +     const unsigned char* output_end = output + output_size;
          int c,d,i,p;
          long long ib=0,ob=0;
          LZWNODE* last = NULL;
    ***************
    *** 140,146 ****
          PLZWBASE lzw = lzwbase_init(1);
          if(!lzw) { return 0; }
          //Compress
    !     while(((c = bfgetc(input)) != EOF) & ((d = bfgetc(input)) != EOF)) {
              d |= c << 8;
              d &= 65535;
              //Find chain
    --- 145,153 ----
          PLZWBASE lzw = lzwbase_init(1);
          if(!lzw) { return 0; }
          //Compress
    !     while(input < input_end - 1) {
    !         c = *input++;
    !         d = *input++;
              d |= c << 8;
              d &= 65535;
              //Find chain
    ***************
    *** 148,154 ****
              if(d == lzw->entries) {
                  //Node hasn't been added yet, add new dictionary entry
                  if(last && (lzw->entries < 65536)) {
    !                 if(!(last->subnodes[p] = lzw->dict[lzw->entries] = lzwnode_init(last, p, lzw->entries))) { lzwbase_free(lzw); printf("\n"); return 0; }
                      lzw->entries++;
                  }
                  lzw->node = last->subnodes[p];
    --- 155,161 ----
              if(d == lzw->entries) {
                  //Node hasn't been added yet, add new dictionary entry
                  if(last && (lzw->entries < 65536)) {
    !                 if(!(last->subnodes[p] = lzw->dict[lzw->entries] = lzwnode_init(last, p, lzw->entries))) { lzwbase_free(lzw); return 0; }
                      lzw->entries++;
                  }
                  lzw->node = last->subnodes[p];
    ***************
    *** 162,168 ****
                  last = chain[0];
              } else {
                  //Normal node
    !             if(!(lzw->node = lzw->dict[d])) { lzwbase_free(lzw); printf("\n"); return 0; }
                  //Walk node chain
                  while(lzw->node->parent) { 
                      chain[i++] = lzw->node;
    --- 169,175 ----
                  last = chain[0];
              } else {
                  //Normal node
    !             if(!(lzw->node = lzw->dict[d])) { lzwbase_free(lzw); return 0; }
                  //Walk node chain
                  while(lzw->node->parent) { 
                      chain[i++] = lzw->node;
    ***************
    *** 171,184 ****
                  //Add new dictionary entry
                  p = chain[i-1]->val;
                  if(last && (lzw->entries < 65536)) {
    !                 if(!(last->subnodes[p] = lzw->dict[lzw->entries] = lzwnode_init(last, p, lzw->entries))) { lzwbase_free(lzw); printf("\n"); return 0; }
                      lzw->entries++;
                  }
                  last = chain[0];            
              }
              //Decode
              ob += i;
    !         while(i--) { bfputc(chain[i]->val, output); }
              //Update progress
              ib += 2;
      #ifdef PROGRESS
    --- 178,192 ----
                  //Add new dictionary entry
                  p = chain[i-1]->val;
                  if(last && (lzw->entries < 65536)) {
    !                 if(!(last->subnodes[p] = lzw->dict[lzw->entries] = lzwnode_init(last, p, lzw->entries))) { lzwbase_free(lzw); return 0; }
                      lzw->entries++;
                  }
                  last = chain[0];            
              }
              //Decode
              ob += i;
    !         if(output_end - output < i) { lzwbase_free(lzw); return 0; }
    !         while(i--) *output++ = chain[i]->val;
              //Update progress
              ib += 2;
      #ifdef PROGRESS
    ***************
    *** 186,198 ****
      #endif
          }
          //Flush data
    -     bfflush(output);
    -     printf("Decompressed  %lli -> %lli (%.1f%%)    \n", ib, ob, (double)(ob * 100) / ib);
          //Return
          lzwbase_free(lzw);
    !     return 1;
      }
    ! 
      void dohelp(char* prog) {
          //printf("LZWC 0.3                            Copyright (c) 2013 by David Catt\n\n");
          printf("usage: %s c|d [infile] [outfile]\n", prog);
    --- 194,204 ----
      #endif
          }
          //Flush data
          //Return
          lzwbase_free(lzw);
    !     return output_size - (output_end - output);
      }
    ! #if 0
      void dohelp(char* prog) {
          //printf("LZWC 0.3                            Copyright (c) 2013 by David Catt\n\n");
          printf("usage: %s c|d [infile] [outfile]\n", prog);
    ***************
    *** 243,246 ****
              return 1;
          }
          return 0;
    ! }
    \ No newline at end of file
    --- 249,253 ----
              return 1;
          }
          return 0;
    ! }
    ! #endif
    \ No newline at end of file

  5. #5
    Member
    Join Date
    Aug 2011
    Location
    Canada
    Posts
    113
    Thanks
    9
    Thanked 22 Times in 15 Posts
    Version 0.4 fixes the issue with binary files and removes recursive inline functions.

    Also, using the algorithm to compress in small blocks will most likely expand files, since there isn't enough data to fill the dictionary. Sometime in the future I might implement variable length codes for dictionary entries to lessen this problem.
    Last edited by david_werecat; 15th January 2013 at 22:53.

  6. #6
    Member
    Join Date
    Aug 2011
    Location
    Canada
    Posts
    113
    Thanks
    9
    Thanked 22 Times in 15 Posts
    Version 0.7 adds generic functions to easily support multiple file to file compression modes. It also introduces bitwise mode, in which the dictionary indices are written with less bits when the dictionary is empty. Bitwise mode will compress slightly better (mainly on small files), although is approximately twice as slow as the default mode. There is one executable for each mode: the standard executable is compatible with all previous versions and the bitwise executable is incompatible with all other versions.

  7. #7
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,257
    Thanks
    307
    Thanked 795 Times in 488 Posts

Similar Threads

  1. BALZ - An Open-Source ROLZ-based compressor
    By encode in forum Data Compression
    Replies: 60
    Last Post: 6th March 2015, 16:47
  2. PPMX v0.05 - new PPM-based compressor
    By encode in forum Data Compression
    Replies: 49
    Last Post: 28th July 2010, 02:47
  3. BCM v0.08 - The ultimate BWT-based file compressor!
    By encode in forum Data Compression
    Replies: 78
    Last Post: 12th August 2009, 10:14
  4. BCM v0.01 - New BWT+CM-based compressor
    By encode in forum Data Compression
    Replies: 81
    Last Post: 9th February 2009, 15:47
  5. New fast open-source paq-based jpeg compressor
    By Bulat Ziganshin in forum Forum Archive
    Replies: 14
    Last Post: 13th September 2007, 13:57

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •