1. ## BWT alphabet reordering

I have been doing some experiments with alphabet reordering to improve BWT compression. In theory, this problem could be solved by trying all 256! (about 10^507) permutations, compressing each one, and picking the best. In practice, we aren't going to find it. However, I did find some small improvements with bzip2, bsc (default options), and zpaq -m1 and -m2 with 16 and 100 MB block sizes.

Code:
```768,771 book1
769,027 book1.bwtopt
232,598 book1.bz2
232,316 book1.bwtopt.bz2
212,534 book1.bsc
211,876 book1.bwtopt.bsc
231,899 book1-m1.zpaq
231,036 book1-m1.bwtopt.zpaq
215,277 book1-m2.zpaq
214,539 book1-m2.bwtopt.zpaq

100,000,000 enwik8
100,000,256 enwik8.bwtopt
29,008,758 enwik8.bz2
28,887,556 enwik8.bwtopt.bz2
22,254,418 enwik8.bsc
22,210,454 enwik8.bwtopt.bsc
25,035,410 enwik8-m1.zpaq
24,961,147 enwik8-m1.bwtopt.zpaq
23,328,707 enwik8-m2.zpaq
23,261,619 enwik8-m2.bwtopt.zpaq
21,246,030 enwik8-m2-b100.zpaq
21,210,811 enwik8-m2-b100.bwtopt.zpaq```
The transformation consists of a 256 byte header D, and then replace each remaining byte c with D[c]. I only reordered the letters a..z because in my experiments, attempting to reorder all 256 bytes made compression worse. These results are for the following alphabet orders:

book1: zwphfmrbeoqgjycktlixvndsau
enwik8: qiclsjgnwyhombrfukaepvtxzd

which means that in book1, a is encoded as z, b as w, c as p, etc. Uppercase is encoded the same as lowercase, A as Z, B as W, etc. The dictionary header D contains the inverse permutation. The encoder/decoder is below. You might get different results depending on your implementation of rand().

Code:
```/* bwtopt.cpp - Alphabet reorder transform for BWT compressors

To encode: bwtopt N in out
To decode: bwtopt d in out

where the optimization uses the first N bytes of file in. Uses 6*N memory.
The output is a 256 byte permutation of the alphabet followed by the
transformed data. To decode, read the first 256 bytes into dictionary D
and then replace each remaining input byte c with D[c].
*/

#include <stdio.h>
#include <stdlib.h>
#include "divsufsort.h" // lite, from http://code.google.com/p/libdivsufsort/

template <typename T>
void swap(T& a, T& b) {
T tmp=a;
a=b;
b=tmp;
}

// swap r1,r2 in d. If r1<r2 swap everything in between.
// If r1==r2 then swap next adjacent letter.
void permute(unsigned char* d, int r1, int r2) {
if (r1==r2) r2=(r1-'a'+1)%26+'a';
do {
swap(d[r1], d[r2]);
swap(d[r1-32], d[r2-32]);
++r1;
--r2;
} while (r1<r2);
}

int main(int argc, char** argv) {

// Check args
int n=0;
if (argc<4 || ((n=atoi(argv))<1 && argv!='d'))
fprintf(stderr, "Usage: bwtopt N|d input output\n"), exit(1);

// Open files
FILE* in=fopen(argv, "rb");
if (!in) perror(argv), exit(1);
FILE* out=fopen(argv, "wb");
if (!out) perror(argv), exit(1);

// Decode
unsigned char d;  // permutation
if (argv=='d') {
for (int c; (c=getc(in))!=EOF; putc(d[c], out));
return 0;
}

// Read n bytes into buf
for (int i=0; i<256; ++i) d[i]=i;
unsigned char* buf=(unsigned char*)calloc(n, 6);  // extra space for BWT
if (!buf) fprintf(stderr, "out of memory\n"), exit(1);
printf("Read n bytes of %s\n", argv);

// Optimize by randomly swapping pairs of chars in a..z or reversing
// the order in between and testing if BWT improves compression as
// estimated by counting mismatches between nearby bytes.
unsigned char* bwt=buf+n;   // bwt output
int* work=(int*)(buf+n*2);  // bwt working space
int score=0x7fffffff;  // what we want to reduce
for (int i=0; i<256; ++i) d[i]=i;  // initial identity permutation
for (int i=1; i<=1000000/(n>>10); ++i) {
int r1='a'+rand()%26;
int r2='a'+rand()%26;
permute(d, r1, r2);
for (int j=0; j<n; ++j) bwt[j]=d[buf[j]];
divbwt(bwt, bwt, work, n);
int newscore=0;
const int L=4;
const int M[L+1]={0,2,1,1,1};
for (int j=0; j<n-L; ++j)
for (int k=1; k<=L; ++k)
newscore+=(bwt[j]!=bwt[j+k])*M[k];
if (newscore<score) {
score=newscore;
for (int j='a'; j<='z'; ++j) putchar(d[j]);
printf(" %5d (%c %c) -> %d\n", i, r1, r2, score);
}
else
permute(d, r1, r2);
}

// Write
for (int i=0; i<256; ++i)  // invert d and write to header
for (int j=0; j<256; ++j)
if (d[j]==i) putc(j, out);
for (int i=0; i<n; ++i)  // encode buf
putc(d[buf[i]], out);
for (int c; (c=getc(in))!=EOF; putc(d[c], out));  // encode rest of file
return 0;
}```
I created the files:

bwtopt 1000000 book1 book1.bwtopt
bwtopt 1000000 enwik8 enwik8.bwtopt

This means to use only the first n = 1 MB of input to optimize. It repeats about 10^9/n times, taking about 1 minute: pick a pair of letters and either swap them or reverse the order of all letters in between. Then compute the BWT (using libdivsufsort) and estimate the compressed size by comparing each byte with the last L bytes and counting mismatches with weight M[i], i=1..L. After some experimenting with lots of M and L, I found L=4, M=2,1,1,1 gives good results.

Not sure how I can improve this. In theory it should be possible to optimize outside A..Z. Another approach I have experimented with (so far not successfully) is to find a short path through 512 dimensional context space to group letters that have similar contexts (counts of adjacent byte values). It avoids the BWT but is still an NP-complete problem (traveling salesman).

The fastest way to solve this might be on a quantum computer. It looks like it finds in 1 ms a probabilistic solution to vector x in {-1, 1}^128 that minimizes Ax + B, where the 128 x 128 matrix A and 128 element vector B are programmable. I think this could be used to express the problem for the 128 most frequently occurring byte values in a text file, or I could wait for the 256 qubit version. The solution is not optimal because NP-complete problems are still exponential even on a quantum computer, but it is faster than a conventional computer could do. But for now it is probably out of my budget (it includes liquid helium cooling), so my laptop will have to do for now.   Reply With Quote

2. Before trying out a quantum computer, you may want to try a genetic algorithm which is capable of solving traveling salesman problem on whole alphabet And cost function could be compression estimation. In that case, it won't be optimal, but at least you could eliminate several tries (far better than brute force ). If you want to adapt from my codes, I've implemented a traveling salesman version of genetic algorithm in MATLAB at a time (actually I believe it's not a real genetic algorithm, but who cares ). It permutes a string until the cost function satisfies. The source lies somewhere in my HDD. Ahh... Wait... I've also C# version of it (it guarantees that it doesn't consist any of "closed source" code block like in MATLAB). The real usage was about finding optimal sort order for given articles in several columns. It permutes the sort order so that total height would be minumum. Quite same problem as you see.  Reply With Quote

3. I think D-Wave One is either a hoax or has severely limited possible applications. It's unverified and authors claims that it only does Discrete Optimization.

- checking on random 1 MB parts of data, not the only first 1 MB,
- trying order-1 alphabet reorderings   Reply With Quote

4. Yeah, not sure about the quantum algorithm. But what I'm doing now is sort of a genetic algorithm but with only one organism.  Reply With Quote

5. I found my old alphabet permutation sources and started optimization with book1/bsc target.
For now I've got 212172->210411 ... will see if it improves any further and post the final tables.
(the options for 212172 are bsc.exe e book1 book1.bsc -m0tTp)  Reply With Quote

6. http://nishi.dreamhosters.com/u/BWT_reorder_v4.rar
Code:
```book1  enwik6
212172 253988  original + bsc 2.7.0
210239 254560  book1 permutation + bsc 2.7.0```
Switched to enwik6 target, it should be more stable...

Update:

http://nishi.dreamhosters.com/u/BWT_reorder_v4a.rar
Code:
```book1  enwik6 enwik8
212172 253988 20787496  original + bsc 2.7.0
210700 251693 20736429  book1 permutation + bsc 2.7.0```
Update2:
http://nishi.dreamhosters.com/u/BWT_reorder_v4b.rar

Code:
```book1  enwik6 enwik8
212172 253988 20787496  original + bsc 2.7.0
210808 251403 20730606  book1 permutation + bsc 2.7.0```  Reply With Quote

7. enwik6 target improves over bwtopt with enwik8 too.

Code:
```28,852,844 enwik8r.bz2
22,178,404 enwik8r.bsc
24,981,295 enwik8r-m1.zpaq
23,250,554 enwik8r-m2.zpaq
21,194,231 enwik8r-m2-b100.zpaq```  Reply With Quote

8. ## Lehmer code & factorial number system Originally Posted by Matt Mahoney The transformation consists of a 256 byte header D, and then replace each remaining byte c with D[c].
You could reduce the size of these 256 bytes (2048 bits) to 210 bytes and a half (log2(256!) = 1684 bits) since this permutation could be coded using a Lehmer code in factorial number system, it would require some sort of "bignum" library to perform the computations on thousands of bits long integers.

i.e. with a 5 letters alphabet (0,1,2,3,4), the number of permutations of 5 letters is 5! = 120.
An easy way to associate a number to a permutation is the Lehmer code.

Let's say we want to code the following permutation (2,0,3,4,1) into a number between 0 and 119, the Lehmer code will turn the permutation into this (2,0,1,1,0)

Code:
```Matching (2,0,3,4,1) out of (0,1,2,3,4)

(0,1,2,3,4)
^
index of 2 is 2 -> Lehmer code (2,

2 is removed from the list
(0,1,3,4)
^
index of 0 is 0 -> Lehmer code (2,0

0 is removed from the list
(1,3,4)
^
index of 3 is 1 -> Lehmer code (2,0,1

3 is removed from the list
(1,4)
^
index of 4 is 1 -> Lehmer code (2,0,1,1

4 is removed from the list
(1)
^
index of 1 is 0 -> Lehmer code (2,0,1,1,0)
of course the last index is always zero thus it could be dropped.```
Now we use the factorial number system to record the Lehmer code this gives:

Code:
```(2,   0,   1,   1,   0)
|    |    |    |
2x4!+0x3!+1x2!+1x1! = 2x24 + 1x2 + 1x1 = 51

The reversal of the alphabet (4,3,2,1,0) has a Lehmer code equal to itself and produces the biggest number:
(4,   3,   2,   1,   0)
|    |    |    |
4x4!+3x3!+2x2!+1x1! = 4x24 + 3x6 + 2x2 + 1x1 = 119```
And now the reverse from 51 back to the permutation.

Code:
```51 divided by 1 gives 51 (it remains 0)
51 divided by 2 gives 25 (it remains 1)
25 divided by 3 gives 8 (it remains 1)
8 divided by 4 gives 2 (it remains 0)
2 divided by 5 gives 0 (it remains 2)
If you look at the remainders in reverse order... Tah dah! The Lehmer code (2,0,1,1,0) is back!```
Getting the permutation from the Lehmer code is pretty easy:
Code:
```(0,1,2,3,4)
^
at index 2 we have 2 -> Permutation (2,

2 is removed from the list
(0,1,3,4)
^
at index 0 we have 0 -> Permutation (2,0

0 is removed from the list
(1,3,4)
^
at index 1 we have 3 -> Permutation (2,0,3

3 is removed from the list
(1,4)
^
at index 1 we have 4 -> Permutation (2,0,3,4

4 is removed from the list
(1)
^
at index 0 we have 1 -> Permutation (2,0,3,4,1)```  Reply With Quote

9. Regular range coder should have a few bytes overhead, maybe less if we use bijective ones from biject.bwts.  Reply With Quote

10. Very clever. But you should also store the permutation table in a form that is easy to compress. Using a Lehmer+factorial code or range coder would result in 208 uncompressable random bytes. Lehmer code without the factorial step would avoid big number arithmetic and still compress because of a bias to smaller byte values.  Reply With Quote

#### Posting Permissions

• You may not post new threads
• You may not post replies
• You may not post attachments
• You may not edit your posts
•