This might be old news but I just came across this. http://people.unipmn.it/manzini/bwtdisk/
The paper describes a clever method of using less memory to construct a BWT. The idea is to compute the suffix array over small segments of the input starting from the end and going backward, then merging the output into the BWT on disk in multiple passes. This is fast enough because all of the disk access is sequential. This uses a lot less external disk than BBB, which writes all the suffix arrays to disk as temporary files and then merges them by reading from all of them at once.
There is also a demo program which I added to LTCB. http://mattmahoney.net/dc/text.html#1901
It is source code only, and I only got it to compile in Linux. There is no block size parameter. It compresses the whole file in a single block. Decompression still requires 4x memory. The BWT is compressed using zlib, lzma, or RLE plus range coding. The compression ratio is not that great, but the code is designed to make it easy to add other compressors.