Let's discuss various ways to implement subj.

In 90s it was well known that BWT output should be compressed by low-order entropy coders. As said by Deorowicz, "it can be approximated with high precision as an output of a piecewise stationary memoryless source". Nowadays, we have powerful computation engines with very limited local memory - GPUs. It seems that BWT output compression and GPUs are perfect match.

BWT implementation in NVBIO is claimed to reach 70-80 MB/s on Kepler-class Tesla GPUs, so it may be a little over 100 MB/s on best modern GPUs - and this territory is already occupied by the BSC compressor. Let's look into free territory, i.e. speeds of 200+ MB/s and Sorting Transformation. On the 980Ti, ST3 speed is 3 GB/s and ST8 is 500 MB/s. They just need to be paired with fast enough codecs, i.e. ones flying at 0.5-10 GB/s.

We can look into 3 directions:

1) simplest algorithm

2) fastest one

3) best-compression

Let's start with simplest one. ST+RLE+MTF+RangeCoder. ST and RLE are already implemented by the CUB+BSC. Even naive MTF implementation should process dozens of GB/s. The same holds for RangeCoder. Of course, data should be split into independent blocks of 100 KB or so in order to allow parallel processing and use per-block stats for encoding.

I'm not sure what should be coded by RC. Is it enough to split numbers produces by MTF and RLE into 16-32 bins and encode bin number + direct output of any remaining bits? Some sort of:

{0}, {1}, {2}, {3}, {4,5}, {6,7}, {8,9,10,11}...

BWT is a huge area, i never looked into it, so i'm asking experts about better algorithms. Even if you never looked into GPUs, basically it's just a few dozen cores, each running a few hundreds of threads sharing 30 KB of fast on-chip memory (per core), plus a few dozen registers per thread. Plus a memory interface able to request 10^9 of 256-byte blocks each second.

Now, the possible improvements to the simplest codec:

1) i wonder why Deorowicz says that we don't know same-context bounds. At least when Radix Sort is used to produce an ST, the radix sort output contains original contexts so we can split ST-sorted chars into blocks by those contexts (of course with minimum&maximum limits for block size)

2) there are a lot of preprocessing techniques we can run at GPU memory speed: e8/e9, Szymon Grabowski text tricks, alphabet reordering, data reversing

So, we left with all those rle/mtf/if/dc/wfc/qlfc variations. Almost anything will be fine as well as it can be squeezed into 30 KB stationary model. So, what are your ideas?

PS: If you want to learn more about BWT output compression:

http://www.data-compression.info/Algorithms/BWT/

http://www.compression.ru/download/bwt.html#mod

http://homepage3.nifty.com/wpage/lin...transform.html