1. ## Modern BWT speed

What is the current state of the art for BWT computation? I realize this is nearly the same as asking what the fastest string sorting routine is.

There's several good references online, BZIP2's page in fact talks about it and Julian himself wrote a paper on BWT sorting speed. But that result is many years old now.

Two things I'm especially interested in: what's the current speed of BWT for say a standard corpus.. is is something on the order of 2MB/sec or 200 MB/sec?
Note that this is just the BWT speed, I could run BZIP2 or whatever to measure an overall compression speed but that doesn't isolate the BWT contribution.

Second, has anyone worked on parallelizing BWT? Sure, it can be parallelized by giving each thread its own block, but I mean having multiple threads work on the same block.

Thanks for any pointers/hints! There's a lot of data out there but it's hard to see where the modern speed & efficiency is at.

There're timings so you can derive the speed.
Its tested on "Athlon XP 2200+ with 768 MB memory" though.
And for plain BWT performance its probably the best to
compile some implementations yourself
(like http://www.geocities.com/zxcb33/)
or try using some compressor which allows to disable all but BWT,
for example, DC allows that.

Also to me it looks like BWT is not that fashionable anymore,
as postcoders required to bring it good compression,
can reach pretty good results by themselves.

Btw, here's my idea on GPU-based (massively parallel) BWT
implementation:
1. data is compressed with a good static rangecoder.
(it preserves lexicographical order).
2. String shifts are split into N ~equal intervals (should
be pretty precise based on entropy code)
3. So each thread just has to sequentially scan the
compressed source data, fetch the strings fitting in
its interval, and sort them (1/128 of BWT block or less).
Hope you understand how entropy-coded strings can be
compared without decoding.
4. The sorted intervals (they don't have to be merged,
as first stage is kind of radix sort) then have to be
stored in some order and encoded with some postcoder,
parallel or not, but that's out of scope here.

The point here is that normal BWT algorithms, while
very parallel by design, have too much random memory access,
and too little arithmetics.
And GPUs are good at arithmetics and very bad at
memory access (no cache), thus a trivial cuda compile
of an existing BWT sort would be slower on an expensive
GPU than on Q6600.

3. Benchmark for suffix array construction is at:
http://homepage3.nifty.com/wpage/benchmark/index.html

suffix array construction (as you are probably well aware) is
the bulk of the time required to compute the BWT.

MSufSort has a BWT feature which skips the second stage of
the improved two stage (for completing the suffix array) and
instead jumps directly to computing the BWT.
DivSufSort borrows this (and several other MSufSort features)
and is currently slightly faster still.

Both (and many others) are measured in the benchmark listed
above.

- Michael

4. Wow, two excellent responses with exactly the information I was looking for.
I really appreciate it!

Shel: you're right that BWT is a poor fit for GPUs. They don't have the memory cache to deal with much data at once. And BWT is entirely about pointers and dereferencing them and comparing them, which is the worst case for GPU.

I'm more thinking about CPU multithreading, but that's tough too. I'm thinking of experimenting with a BWT compressor that allows random access.. so you can compress/decompress data from the middle of the stream. (This is nothing too fancy or innovative since blockwise compression isn't hard to seek/update individual blocks).

5. Also have a look at : http://www.nanozip.net/research.html