Results 1 to 5 of 5

Thread: Modern BWT speed

  1. #1
    Member
    Join Date
    May 2009
    Location
    CA USA
    Posts
    22
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Modern BWT speed

    What is the current state of the art for BWT computation? I realize this is nearly the same as asking what the fastest string sorting routine is.

    There's several good references online, BZIP2's page in fact talks about it and Julian himself wrote a paper on BWT sorting speed. But that result is many years old now.

    Two things I'm especially interested in: what's the current speed of BWT for say a standard corpus.. is is something on the order of 2MB/sec or 200 MB/sec?
    Note that this is just the BWT speed, I could run BZIP2 or whatever to measure an overall compression speed but that doesn't isolate the BWT contribution.

    Second, has anyone worked on parallelizing BWT? Sure, it can be parallelized by giving each thread its own block, but I mean having multiple threads work on the same block.

    Thanks for any pointers/hints! There's a lot of data out there but it's hard to see where the modern speed & efficiency is at.

  2. #2
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,334
    Thanks
    209
    Thanked 1,007 Times in 532 Posts
    How about http://compressionratings.com/bwt.html ?
    There're timings so you can derive the speed.
    Its tested on "Athlon XP 2200+ with 768 MB memory" though.
    And for plain BWT performance its probably the best to
    compile some implementations yourself
    (like http://www.geocities.com/zxcb33/)
    or try using some compressor which allows to disable all but BWT,
    for example, DC allows that.

    Also to me it looks like BWT is not that fashionable anymore,
    as postcoders required to bring it good compression,
    can reach pretty good results by themselves.

    Btw, here's my idea on GPU-based (massively parallel) BWT
    implementation:
    1. data is compressed with a good static rangecoder.
    (it preserves lexicographical order).
    2. String shifts are split into N ~equal intervals (should
    be pretty precise based on entropy code)
    3. So each thread just has to sequentially scan the
    compressed source data, fetch the strings fitting in
    its interval, and sort them (1/128 of BWT block or less).
    Hope you understand how entropy-coded strings can be
    compared without decoding.
    4. The sorted intervals (they don't have to be merged,
    as first stage is kind of radix sort) then have to be
    stored in some order and encoded with some postcoder,
    parallel or not, but that's out of scope here.

    The point here is that normal BWT algorithms, while
    very parallel by design, have too much random memory access,
    and too little arithmetics.
    And GPUs are good at arithmetics and very bad at
    memory access (no cache), thus a trivial cuda compile
    of an existing BWT sort would be slower on an expensive
    GPU than on Q6600.
    Last edited by Shelwien; 4th May 2009 at 19:08.

  3. #3
    Programmer michael maniscalco's Avatar
    Join Date
    Apr 2007
    Location
    Boston, Massachusetts, USA
    Posts
    113
    Thanks
    11
    Thanked 88 Times in 26 Posts
    Benchmark for suffix array construction is at:
    http://homepage3.nifty.com/wpage/benchmark/index.html

    suffix array construction (as you are probably well aware) is
    the bulk of the time required to compute the BWT.

    MSufSort has a BWT feature which skips the second stage of
    the improved two stage (for completing the suffix array) and
    instead jumps directly to computing the BWT.
    DivSufSort borrows this (and several other MSufSort features)
    and is currently slightly faster still.

    Both (and many others) are measured in the benchmark listed
    above.

    - Michael

  4. #4
    Member
    Join Date
    May 2009
    Location
    CA USA
    Posts
    22
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Wow, two excellent responses with exactly the information I was looking for.
    I really appreciate it!

    Shel: you're right that BWT is a poor fit for GPUs. They don't have the memory cache to deal with much data at once. And BWT is entirely about pointers and dereferencing them and comparing them, which is the worst case for GPU.

    I'm more thinking about CPU multithreading, but that's tough too. I'm thinking of experimenting with a BWT compressor that allows random access.. so you can compress/decompress data from the middle of the stream. (This is nothing too fancy or innovative since blockwise compression isn't hard to seek/update individual blocks).

  5. #5
    Member
    Join Date
    May 2008
    Location
    Antwerp , country:Belgium , W.Europe
    Posts
    487
    Thanks
    1
    Thanked 3 Times in 3 Posts
    Also have a look at : http://www.nanozip.net/research.html

Similar Threads

  1. Decompression speed test
    By m^2 in forum Data Compression
    Replies: 8
    Last Post: 18th August 2010, 00:42
  2. Compression and speed
    By Wladmir in forum Data Compression
    Replies: 4
    Last Post: 25th April 2010, 13:15
  3. On Fast I/O speed
    By Cyan in forum Data Compression
    Replies: 16
    Last Post: 14th March 2010, 16:55
  4. GCC 4.4 and compression speed
    By Hahobas in forum Data Compression
    Replies: 14
    Last Post: 5th March 2009, 18:31
  5. Compression speed benchmark
    By Sportman in forum Forum Archive
    Replies: 104
    Last Post: 23rd April 2008, 17:38

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •