I've compared CPU- and GPU-based BWT and ST4 sorting algorithms. The results are performed on Athlon II X4 630 2.8 GHz + GeForce GTX 460 (336 CUDA cores) with beginning 10.485.768 bytes from ENWIK8.
1 CPU:
BWT divsufsort = 1669 ms (6282 KB/s)
BWT std::stable_sort = 7316 ms (1433 KB/s) - uses n*log(n) sorting
ST4 std::stable_sort = 4836 ms (2168 KB/s) - uses n*log(n) sorting
ST4 BSC = 219 ms (47880 KB/s)
ST4 GRZip = 188 ms (55775 KB/s)
GPU:
BWT thrust::stable_sort = 4960 ms (2114 KB/s) - uses n*log(n) sorting
ST4 thrust::stable_sort = 2278 ms (4603 KB/s) - uses n*log(n) sorting
ST4 thrust::stable_merge_sort_by_key = 343 ms (30570 KB/s) - uses n*log(n) sorting
ST4 thrust::stable_radix_sort_by_key = 94 ms (111550 KB/s)
ST8 thrust::stable_radix_sort_by_key = 171 ms (61320 KB/s)
ST4 sorting is 2 times faster on GPU (111550 KB/s) than CPU (55775 KB/s). GPU sorting time includes host<=>device memory copy times, which are bigger than actual sorting for thrust::stable_radix_sort_by_key.
So far BWT sorting is slower on GPU, because I didn't found an implementation of Suffix Array Construction Algorithm for GPU. There is one algorithm (http://web.iiit.ac.in/~abhishek_shuk...ber%202009.pdf) and I've asked the authors about the implementation. So far no response...