I've been sitting on this for a little while, and talked about it with David Scott, and I figure now is as good a time as any to share.
This computes the BWTS using a new approach. Unlike the original implementation and the one in openbwt, it suffix sorts the input as a whole and then corrects the order to produce the BWTS. It's not clear to me whether it's fundamentally a better approach than merging, or if it's just simpler and benefits from the optimization that's gone into divsufsort. But it seems to get the BWTS quite a bit faster, and gets the same result as openbwt on all the data I've thrown at it.
Please let me know if you find a bug.