I have been doing some experiments with alphabet reordering to improve BWT compression. In theory, this problem could be solved by trying all 256! (about 10^507) permutations, compressing each one, and picking the best. In practice, we aren't going to find it. However, I did find some small improvements with bzip2, bsc (default options), and zpaq -m1 and -m2 with 16 and 100 MB block sizes.

The transformation consists of a 256 byte header D, and then replace each remaining byte c with D[c]. I only reordered the letters a..z because in my experiments, attempting to reorder all 256 bytes made compression worse. These results are for the following alphabet orders:Code:768,771 book1 769,027 book1.bwtopt 232,598 book1.bz2 232,316 book1.bwtopt.bz2 212,534 book1.bsc 211,876 book1.bwtopt.bsc 231,899 book1-m1.zpaq 231,036 book1-m1.bwtopt.zpaq 215,277 book1-m2.zpaq 214,539 book1-m2.bwtopt.zpaq 100,000,000 enwik8 100,000,256 enwik8.bwtopt 29,008,758 enwik8.bz2 28,887,556 enwik8.bwtopt.bz2 22,254,418 enwik8.bsc 22,210,454 enwik8.bwtopt.bsc 25,035,410 enwik8-m1.zpaq 24,961,147 enwik8-m1.bwtopt.zpaq 23,328,707 enwik8-m2.zpaq 23,261,619 enwik8-m2.bwtopt.zpaq 21,246,030 enwik8-m2-b100.zpaq 21,210,811 enwik8-m2-b100.bwtopt.zpaq

book1: zwphfmrbeoqgjycktlixvndsau

enwik8: qiclsjgnwyhombrfukaepvtxzd

which means that in book1, a is encoded as z, b as w, c as p, etc. Uppercase is encoded the same as lowercase, A as Z, B as W, etc. The dictionary header D contains the inverse permutation. The encoder/decoder is below. You might get different results depending on your implementation of rand().

I created the files:Code:/* bwtopt.cpp - Alphabet reorder transform for BWT compressors To encode: bwtopt N in out To decode: bwtopt d in out where the optimization uses the first N bytes of file in. Uses 6*N memory. The output is a 256 byte permutation of the alphabet followed by the transformed data. To decode, read the first 256 bytes into dictionary D and then replace each remaining input byte c with D[c]. */ #include <stdio.h> #include <stdlib.h> #include "divsufsort.h" // lite, from http://code.google.com/p/libdivsufsort/ template <typename T> void swap(T& a, T& b) { T tmp=a; a=b; b=tmp; } // swap r1,r2 in d. If r1<r2 swap everything in between. // If r1==r2 then swap next adjacent letter. void permute(unsigned char* d, int r1, int r2) { if (r1==r2) r2=(r1-'a'+1)%26+'a'; do { swap(d[r1], d[r2]); swap(d[r1-32], d[r2-32]); ++r1; --r2; } while (r1<r2); } int main(int argc, char** argv) { // Check args int n=0; if (argc<4 || ((n=atoi(argv[1]))<1 && argv[1][0]!='d')) fprintf(stderr, "Usage: bwtopt N|d input output\n"), exit(1); // Open files FILE* in=fopen(argv[2], "rb"); if (!in) perror(argv[2]), exit(1); FILE* out=fopen(argv[3], "wb"); if (!out) perror(argv[3]), exit(1); // Decode unsigned char d[256]; // permutation if (argv[1][0]=='d') { fread(d, 1, 256, in); for (int c; (c=getc(in))!=EOF; putc(d[c], out)); return 0; } // Read n bytes into buf for (int i=0; i<256; ++i) d[i]=i; unsigned char* buf=(unsigned char*)calloc(n, 6); // extra space for BWT if (!buf) fprintf(stderr, "out of memory\n"), exit(1); n=fread(buf, 1, n, in); printf("Read n bytes of %s\n", argv[2]); // Optimize by randomly swapping pairs of chars in a..z or reversing // the order in between and testing if BWT improves compression as // estimated by counting mismatches between nearby bytes. // Try about 10^9/n times. unsigned char* bwt=buf+n; // bwt output int* work=(int*)(buf+n*2); // bwt working space int score=0x7fffffff; // what we want to reduce for (int i=0; i<256; ++i) d[i]=i; // initial identity permutation for (int i=1; i<=1000000/(n>>10); ++i) { int r1='a'+rand()%26; int r2='a'+rand()%26; permute(d, r1, r2); for (int j=0; j<n; ++j) bwt[j]=d[buf[j]]; divbwt(bwt, bwt, work, n); int newscore=0; const int L=4; const int M[L+1]={0,2,1,1,1}; for (int j=0; j<n-L; ++j) for (int k=1; k<=L; ++k) newscore+=(bwt[j]!=bwt[j+k])*M[k]; if (newscore<score) { score=newscore; for (int j='a'; j<='z'; ++j) putchar(d[j]); printf(" %5d (%c %c) -> %d\n", i, r1, r2, score); } else permute(d, r1, r2); } // Write for (int i=0; i<256; ++i) // invert d and write to header for (int j=0; j<256; ++j) if (d[j]==i) putc(j, out); for (int i=0; i<n; ++i) // encode buf putc(d[buf[i]], out); for (int c; (c=getc(in))!=EOF; putc(d[c], out)); // encode rest of file return 0; }

bwtopt 1000000 book1 book1.bwtopt

bwtopt 1000000 enwik8 enwik8.bwtopt

This means to use only the first n = 1 MB of input to optimize. It repeats about 10^9/n times, taking about 1 minute: pick a pair of letters and either swap them or reverse the order of all letters in between. Then compute the BWT (using libdivsufsort) and estimate the compressed size by comparing each byte with the last L bytes and counting mismatches with weight M[i], i=1..L. After some experimenting with lots of M and L, I found L=4, M=2,1,1,1 gives good results.

Not sure how I can improve this. In theory it should be possible to optimize outside A..Z. Another approach I have experimented with (so far not successfully) is to find a short path through 512 dimensional context space to group letters that have similar contexts (counts of adjacent byte values). It avoids the BWT but is still an NP-complete problem (traveling salesman).

The fastest way to solve this might be on a quantum computer. It looks like it finds in 1 ms a probabilistic solution to vector x in {-1, 1}^128 that minimizes Ax + B, where the 128 x 128 matrix A and 128 element vector B are programmable. I think this could be used to express the problem for the 128 most frequently occurring byte values in a text file, or I could wait for the 256 qubit version. The solution is not optimal because NP-complete problems are still exponential even on a quantum computer, but it is faster than a conventional computer could do. But for now it is probably out of my budget (it includes liquid helium cooling), so my laptop will have to do for now.