Results 1 to 10 of 10

Thread: BWT alphabet reordering

  1. #1
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 779 Times in 486 Posts

    BWT alphabet reordering

    I have been doing some experiments with alphabet reordering to improve BWT compression. In theory, this problem could be solved by trying all 256! (about 10^507) permutations, compressing each one, and picking the best. In practice, we aren't going to find it. However, I did find some small improvements with bzip2, bsc (default options), and zpaq -m1 and -m2 with 16 and 100 MB block sizes.

    Code:
    768,771 book1
    769,027 book1.bwtopt
    232,598 book1.bz2
    232,316 book1.bwtopt.bz2
    212,534 book1.bsc
    211,876 book1.bwtopt.bsc
    231,899 book1-m1.zpaq
    231,036 book1-m1.bwtopt.zpaq
    215,277 book1-m2.zpaq
    214,539 book1-m2.bwtopt.zpaq
    
    100,000,000 enwik8
    100,000,256 enwik8.bwtopt
     29,008,758 enwik8.bz2
     28,887,556 enwik8.bwtopt.bz2
     22,254,418 enwik8.bsc
     22,210,454 enwik8.bwtopt.bsc
     25,035,410 enwik8-m1.zpaq
     24,961,147 enwik8-m1.bwtopt.zpaq
     23,328,707 enwik8-m2.zpaq
     23,261,619 enwik8-m2.bwtopt.zpaq
     21,246,030 enwik8-m2-b100.zpaq
     21,210,811 enwik8-m2-b100.bwtopt.zpaq
    The transformation consists of a 256 byte header D, and then replace each remaining byte c with D[c]. I only reordered the letters a..z because in my experiments, attempting to reorder all 256 bytes made compression worse. These results are for the following alphabet orders:

    book1: zwphfmrbeoqgjycktlixvndsau
    enwik8: qiclsjgnwyhombrfukaepvtxzd

    which means that in book1, a is encoded as z, b as w, c as p, etc. Uppercase is encoded the same as lowercase, A as Z, B as W, etc. The dictionary header D contains the inverse permutation. The encoder/decoder is below. You might get different results depending on your implementation of rand().

    Code:
    /* bwtopt.cpp - Alphabet reorder transform for BWT compressors
    
    To encode: bwtopt N in out
    To decode: bwtopt d in out
    
    where the optimization uses the first N bytes of file in. Uses 6*N memory.
    The output is a 256 byte permutation of the alphabet followed by the
    transformed data. To decode, read the first 256 bytes into dictionary D
    and then replace each remaining input byte c with D[c].
    */
    
    #include <stdio.h>
    #include <stdlib.h>
    #include "divsufsort.h" // lite, from http://code.google.com/p/libdivsufsort/
    
    template <typename T>
    void swap(T& a, T& b) {
      T tmp=a;
      a=b;
      b=tmp;
    }
    
    // swap r1,r2 in d. If r1<r2 swap everything in between.
    // If r1==r2 then swap next adjacent letter.
    void permute(unsigned char* d, int r1, int r2) {
      if (r1==r2) r2=(r1-'a'+1)%26+'a';
      do {
        swap(d[r1], d[r2]);
        swap(d[r1-32], d[r2-32]);
        ++r1;
        --r2;
      } while (r1<r2);
    }
    
    int main(int argc, char** argv) {
    
      // Check args
      int n=0;
      if (argc<4 || ((n=atoi(argv[1]))<1 && argv[1][0]!='d'))
        fprintf(stderr, "Usage: bwtopt N|d input output\n"), exit(1);
    
      // Open files
      FILE* in=fopen(argv[2], "rb");
      if (!in) perror(argv[2]), exit(1);
      FILE* out=fopen(argv[3], "wb");
      if (!out) perror(argv[3]), exit(1);
    
      // Decode
      unsigned char d[256];  // permutation
      if (argv[1][0]=='d') {
        fread(d, 1, 256, in);
        for (int c; (c=getc(in))!=EOF; putc(d[c], out));
        return 0;
      }
    
      // Read n bytes into buf
      for (int i=0; i<256; ++i) d[i]=i;
      unsigned char* buf=(unsigned char*)calloc(n, 6);  // extra space for BWT
      if (!buf) fprintf(stderr, "out of memory\n"), exit(1);
      n=fread(buf, 1, n, in);
      printf("Read n bytes of %s\n", argv[2]);
    
      // Optimize by randomly swapping pairs of chars in a..z or reversing
      // the order in between and testing if BWT improves compression as
      // estimated by counting mismatches between nearby bytes.
      // Try about 10^9/n times.
      unsigned char* bwt=buf+n;   // bwt output
      int* work=(int*)(buf+n*2);  // bwt working space
      int score=0x7fffffff;  // what we want to reduce
      for (int i=0; i<256; ++i) d[i]=i;  // initial identity permutation
      for (int i=1; i<=1000000/(n>>10); ++i) {
        int r1='a'+rand()%26;
        int r2='a'+rand()%26;
        permute(d, r1, r2);
        for (int j=0; j<n; ++j) bwt[j]=d[buf[j]];
        divbwt(bwt, bwt, work, n);
        int newscore=0;
        const int L=4;
        const int M[L+1]={0,2,1,1,1};
        for (int j=0; j<n-L; ++j)
          for (int k=1; k<=L; ++k)
             newscore+=(bwt[j]!=bwt[j+k])*M[k];
        if (newscore<score) {
          score=newscore;
          for (int j='a'; j<='z'; ++j) putchar(d[j]);
          printf(" %5d (%c %c) -> %d\n", i, r1, r2, score);
        }
        else
          permute(d, r1, r2);
      }
    
      // Write
      for (int i=0; i<256; ++i)  // invert d and write to header
        for (int j=0; j<256; ++j)
          if (d[j]==i) putc(j, out);
      for (int i=0; i<n; ++i)  // encode buf
        putc(d[buf[i]], out);
      for (int c; (c=getc(in))!=EOF; putc(d[c], out));  // encode rest of file
      return 0;
    }
    I created the files:

    bwtopt 1000000 book1 book1.bwtopt
    bwtopt 1000000 enwik8 enwik8.bwtopt

    This means to use only the first n = 1 MB of input to optimize. It repeats about 10^9/n times, taking about 1 minute: pick a pair of letters and either swap them or reverse the order of all letters in between. Then compute the BWT (using libdivsufsort) and estimate the compressed size by comparing each byte with the last L bytes and counting mismatches with weight M[i], i=1..L. After some experimenting with lots of M and L, I found L=4, M=2,1,1,1 gives good results.

    Not sure how I can improve this. In theory it should be possible to optimize outside A..Z. Another approach I have experimented with (so far not successfully) is to find a short path through 512 dimensional context space to group letters that have similar contexts (counts of adjacent byte values). It avoids the BWT but is still an NP-complete problem (traveling salesman).

    The fastest way to solve this might be on a quantum computer. It looks like it finds in 1 ms a probabilistic solution to vector x in {-1, 1}^128 that minimizes Ax + B, where the 128 x 128 matrix A and 128 element vector B are programmable. I think this could be used to express the problem for the 128 most frequently occurring byte values in a text file, or I could wait for the 256 qubit version. The solution is not optimal because NP-complete problems are still exponential even on a quantum computer, but it is faster than a conventional computer could do. But for now it is probably out of my budget (it includes liquid helium cooling), so my laptop will have to do for now.

  2. #2
    Programmer osmanturan's Avatar
    Join Date
    May 2008
    Location
    Mersin, Turkiye
    Posts
    651
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Before trying out a quantum computer, you may want to try a genetic algorithm which is capable of solving traveling salesman problem on whole alphabet And cost function could be compression estimation. In that case, it won't be optimal, but at least you could eliminate several tries (far better than brute force ). If you want to adapt from my codes, I've implemented a traveling salesman version of genetic algorithm in MATLAB at a time (actually I believe it's not a real genetic algorithm, but who cares ). It permutes a string until the cost function satisfies. The source lies somewhere in my HDD. Ahh... Wait... I've also C# version of it (it guarantees that it doesn't consist any of "closed source" code block like in MATLAB). The real usage was about finding optimal sort order for given articles in several columns. It permutes the sort order so that total height would be minumum. Quite same problem as you see.
    BIT Archiver homepage: www.osmanturan.com

  3. #3
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,474
    Thanks
    26
    Thanked 121 Times in 95 Posts
    I think D-Wave One is either a hoax or has severely limited possible applications. It's unverified and authors claims that it only does Discrete Optimization.

    Regarding alphabet reordering. How about:
    - checking on random 1 MB parts of data, not the only first 1 MB,
    - trying order-1 alphabet reorderings

  4. #4
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 779 Times in 486 Posts
    Yeah, not sure about the quantum algorithm. But what I'm doing now is sort of a genetic algorithm but with only one organism.

  5. #5
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,348
    Thanks
    212
    Thanked 1,012 Times in 537 Posts
    I found my old alphabet permutation sources and started optimization with book1/bsc target.
    For now I've got 212172->210411 ... will see if it improves any further and post the final tables.
    (the options for 212172 are bsc.exe e book1 book1.bsc -m0tTp)

  6. #6
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,348
    Thanks
    212
    Thanked 1,012 Times in 537 Posts
    http://nishi.dreamhosters.com/u/BWT_reorder_v4.rar
    Code:
    book1  enwik6
    212172 253988  original + bsc 2.7.0
    210239 254560  book1 permutation + bsc 2.7.0
    Switched to enwik6 target, it should be more stable...

    Update:

    http://nishi.dreamhosters.com/u/BWT_reorder_v4a.rar
    Code:
    book1  enwik6 enwik8
    212172 253988 20787496  original + bsc 2.7.0         
    210700 251693 20736429  book1 permutation + bsc 2.7.0
    Update2:
    http://nishi.dreamhosters.com/u/BWT_reorder_v4b.rar

    Code:
    book1  enwik6 enwik8
    212172 253988 20787496  original + bsc 2.7.0         
    210808 251403 20730606  book1 permutation + bsc 2.7.0

  7. #7
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 779 Times in 486 Posts
    enwik6 target improves over bwtopt with enwik8 too.

    Code:
    28,852,844 enwik8r.bz2
    22,178,404 enwik8r.bsc
    24,981,295 enwik8r-m1.zpaq
    23,250,554 enwik8r-m2.zpaq
    21,194,231 enwik8r-m2-b100.zpaq

  8. #8
    Member caveman's Avatar
    Join Date
    Jul 2009
    Location
    Strasbourg, France
    Posts
    190
    Thanks
    8
    Thanked 62 Times in 33 Posts

    Lehmer code & factorial number system

    Quote Originally Posted by Matt Mahoney View Post
    The transformation consists of a 256 byte header D, and then replace each remaining byte c with D[c].
    You could reduce the size of these 256 bytes (2048 bits) to 210 bytes and a half (log2(256!) = 1684 bits) since this permutation could be coded using a Lehmer code in factorial number system, it would require some sort of "bignum" library to perform the computations on thousands of bits long integers.

    i.e. with a 5 letters alphabet (0,1,2,3,4), the number of permutations of 5 letters is 5! = 120.
    An easy way to associate a number to a permutation is the Lehmer code.

    Let's say we want to code the following permutation (2,0,3,4,1) into a number between 0 and 119, the Lehmer code will turn the permutation into this (2,0,1,1,0)

    Code:
    Matching (2,0,3,4,1) out of (0,1,2,3,4)
    
    (0,1,2,3,4)
         ^
    index of 2 is 2 -> Lehmer code (2,
    
    2 is removed from the list
    (0,1,3,4)
     ^
    index of 0 is 0 -> Lehmer code (2,0
    
    0 is removed from the list
    (1,3,4)
       ^
    index of 3 is 1 -> Lehmer code (2,0,1
    
    3 is removed from the list
    (1,4)
       ^
    index of 4 is 1 -> Lehmer code (2,0,1,1
    
    4 is removed from the list
    (1)
     ^
    index of 1 is 0 -> Lehmer code (2,0,1,1,0)
    of course the last index is always zero thus it could be dropped.
    Now we use the factorial number system to record the Lehmer code this gives:

    Code:
    (2,   0,   1,   1,   0)
     |    |    |    |
     2x4!+0x3!+1x2!+1x1! = 2x24 + 1x2 + 1x1 = 51
    
    The reversal of the alphabet (4,3,2,1,0) has a Lehmer code equal to itself and produces the biggest number:
    (4,   3,   2,   1,   0)
     |    |    |    |
     4x4!+3x3!+2x2!+1x1! = 4x24 + 3x6 + 2x2 + 1x1 = 119
    And now the reverse from 51 back to the permutation.

    Code:
    51 divided by 1 gives 51 (it remains 0)
    51 divided by 2 gives 25 (it remains 1)
    25 divided by 3 gives 8 (it remains 1)
    8 divided by 4 gives 2 (it remains 0)
    2 divided by 5 gives 0 (it remains 2)
    If you look at the remainders in reverse order... Tah dah! The Lehmer code (2,0,1,1,0) is back!
    Getting the permutation from the Lehmer code is pretty easy:
    Code:
    (0,1,2,3,4)
         ^
    at index 2 we have 2 -> Permutation (2,
    
    2 is removed from the list
    (0,1,3,4)
     ^
    at index 0 we have 0 -> Permutation (2,0
    
    0 is removed from the list
    (1,3,4)
       ^
    at index 1 we have 3 -> Permutation (2,0,3
    
    3 is removed from the list
    (1,4)
       ^
    at index 1 we have 4 -> Permutation (2,0,3,4
    
    4 is removed from the list
    (1)
     ^
    at index 0 we have 1 -> Permutation (2,0,3,4,1)
    Last edited by caveman; 3rd September 2011 at 20:42. Reason: Lehmer code sample

  9. #9
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,474
    Thanks
    26
    Thanked 121 Times in 95 Posts
    Regular range coder should have a few bytes overhead, maybe less if we use bijective ones from biject.bwts.

  10. #10
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 779 Times in 486 Posts
    Very clever. But you should also store the permutation table in a form that is easy to compress. Using a Lehmer+factorial code or range coder would result in 208 uncompressable random bytes. Lehmer code without the factorial step would avoid big number arithmetic and still compress because of a bias to smaller byte values.

Similar Threads

  1. Rangecoding with restricted output alphabet
    By Shelwien in forum Data Compression
    Replies: 0
    Last Post: 17th August 2010, 19:38
  2. Coding with restricted alphabet
    By Shelwien in forum Data Compression
    Replies: 0
    Last Post: 26th July 2010, 00:21
  3. Alphabet Reordering for BWT
    By Shelwien in forum Data Compression
    Replies: 3
    Last Post: 31st May 2009, 04:32
  4. BWT - how to use?
    By m^2 in forum Data Compression
    Replies: 29
    Last Post: 6th November 2008, 03:01
  5. BWT/ST compression
    By encode in forum Data Compression
    Replies: 60
    Last Post: 25th June 2008, 08:21

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •