Results 1 to 8 of 8

Thread: Multithreaded merge BWT/ST

  1. #1
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,537
    Thanks
    758
    Thanked 676 Times in 366 Posts

    Multithreaded merge BWT/ST

    Is it possible to make BWT/ST fully multithreaded using divide-and-merge technique? We can divide input data into 2 or more subarrays, build BWT/ST indicies for subarrays and then merge indexes into the final index. We even don't need to build space-consuming final index, just put to outbuf bytes selected by merging procedure:

    bwt(char* buf, int len, char* outbuf)
    {
    build_bwt_index(buf, len/2, index0)
    build_bwt_index(buf+len/2, len-len/2, index1)
    // the merge procedure
    while(...) {
    if strcmp(*index0+1, *index1+1) > 0
    *outbuf++ = **index1
    else
    *outbuf++ = **index0
    }

    Random memory accesses to buf in the merge procedure may be prefetched since they are very predictable

    For ST[N] transform, index may contain N+1 chars (N bytes to compare and 1 byte to output), thus suppressing use of buf in the merge procedure, making memory access linear

    Merge can be made multithreaded - we just need to split index0+index1 into two parts of equal size and process them simultaneously. Merge even can be made somewhat virtual - data produced by merge may be immediately consumed by the next encoding stage.

    So if we have 10-core cpu, we can split input block into 10 subarrays, build BWT/ST index for each subarray, then split indexes into 10 parts of equal size and run 10 parallel procedures producing BWT/ST outputs and immediately encoding them



    Alternative approach is to split data by alphabet, f.e. index0 may include only strings starting with A-M and index1 - strings starting with N-Z. Its drawback is that every compression thread needs to scan entire input block. At least, merge operation is trivial
    Last edited by Bulat Ziganshin; 4th September 2012 at 11:52.

  2. #2
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,775
    Thanks
    276
    Thanked 1,206 Times in 671 Posts
    1. Please don't try too hard and just use american plural forms. http://en.wiktionary.org/wiki/index#Noun

    2. Yes, its possible to split the sort, but doing it at the level of strcmp would be likely slower than
    divsufsort on single core.

    3. There's a method which is very compatible with massive MT - http://encode.su/threads/379-BWT-wit...sed-input-data

  3. #3
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,537
    Thanks
    758
    Thanked 676 Times in 366 Posts
    Eugene, i will look at your code. seems that you have not implemented m/t?

    I think that we can extend 2-threaded ST procedure in BSC to any number of threads. BSC fills each bucket from start and end simultaneously. The trick is that we can do it using as many threads as we like. Just split input block into 10 parts and count freqs inside each one, thus getting bounds for 10 subbuckets inside each bucket. Then each thread should fill only its own subbucket inside each bucket.

  4. #4
    Member biject.bwts's Avatar
    Join Date
    Jun 2008
    Location
    texas
    Posts
    449
    Thanks
    23
    Thanked 14 Times in 10 Posts
    Quote Originally Posted by Bulat Ziganshin View Post
    Is it possible to make BWT/ST fully multithreaded using divide-and-merge technique? We can divide input data into 2 or more subarrays, build BWT/ST indicies for subarrays and then merge indexes into the final index. We even don't need to build space-consuming final index, just put to outbuf bytes selected by merging procedure:

    bwt(char* buf, int len, char* outbuf)
    {
    build_bwt_index(buf, len/2, index0)
    build_bwt_index(buf+len/2, len-len/2, index1)
    // the merge procedure
    while(...) {
    if strcmp(*index0+1, *index1+1) > 0
    *outbuf++ = **index1
    else
    *outbuf++ = **index0
    }

    Of course you realize that the final output from the merge will not be the same output if the BWT was not split into parts?
    However depending on the file that might help as well as hurt the final compression.

  5. #5
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,775
    Thanks
    276
    Thanked 1,206 Times in 671 Posts
    > Eugene, i will look at your code. seems that you have not implemented m/t?

    No, it was a proof-of-concept demo. The relevant loop is test.cpp:363 ,
    but the idea is simple anyway - compress the input data, split the interval
    range, scan for strings within a specific subinterval, sort them and output
    the corresponding bwt symbols (scan+sort can be done in parallel).

    Some kind of preprocessing (lzp? srep?) would be necessary to handle the
    most redundant cases, but otherwise it should be pretty efficient, afaik.

    The original idea was to apply this on a GPU though.

    > I think that we can extend 2-threaded ST procedure in BSC to any
    > number of threads.

    Well, ST doesn't count as with its limited contexts its very easy
    to split even without any special tricks.

    > Then each thread should fill only its own subbucket inside each bucket.

    Yes, but low-order ST is kinda limited... real BWT is more interesting.
    Also dunno how it is now, but when I looked, bsc also had openmp directives
    in various places, so it might use more threads than you think.

  6. #6
    Member
    Join Date
    May 2008
    Location
    Germany
    Posts
    412
    Thanks
    38
    Thanked 64 Times in 38 Posts
    @bulat:
    i think that bsc in mode st order 6 definitively uses many cores
    because we know it can use NVIDIA CUDA on GPU and then it can work very fast

    the only thing that is absent within bsc 3.1x is a definitive archiv-file-format
    like within for example *.zip or *.7z or *.zpaq

    so it is for example partly incompatible to its predecessors bsc 2.* and bsc 3.0x

    best regards

    Joerg

  7. #7
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,487
    Thanks
    26
    Thanked 129 Times in 99 Posts
    Prefix doubling can be easily parallelized, but the parallel version used about 14 * N memory in my tests, was much slower than DivSufSort (multithreaded version of my algo was comparable to single threaded DivSufSort) and I had problems with scaling. It was a subject of my master thesis and I haven't touched the problem after finishing my master thesis, so probably it has potential I haven't uncovered. If anyone is interested, despite the high memory requirements, then I can upload my master thesis and undocumented code.

    joerg:
    CPU and GPU versions of ST6 work differently in libbsc. ST6 in CUDA versions just uses stable radix sort and sorts 7 symbols, but ST6 in CPU version uses something like two times radix sorting, each sorting 3 symbols long n-grams.
    Last edited by Piotr Tarsa; 4th September 2012 at 19:32.

  8. Thanks:

    Bulat Ziganshin (24th November 2013)

  9. #8
    Member
    Join Date
    May 2008
    Location
    Germany
    Posts
    412
    Thanks
    38
    Thanked 64 Times in 38 Posts
    @Piotr Tarsa:
    Thank you very match for explaining the details within bsc:

    "CPU and GPU versions of ST6 work differently in libbsc.
    ST6 in CUDA versions just uses stable radix sort and sorts 7 symbols,
    but ST6 in CPU version uses something like two times radix sorting, each sorting 3 symbols long n-grams."

    best regards
    Joerg

Similar Threads

  1. Alphabet Reordering for BWT
    By Shelwien in forum Data Compression
    Replies: 3
    Last Post: 31st May 2009, 03:32
  2. BWT - how to use?
    By m^2 in forum Data Compression
    Replies: 29
    Last Post: 6th November 2008, 02:01
  3. BWT/ST compression
    By encode in forum Data Compression
    Replies: 60
    Last Post: 25th June 2008, 07:21
  4. 4x4 - multithreaded compressor
    By Bulat Ziganshin in forum Forum Archive
    Replies: 12
    Last Post: 19th April 2008, 17:25
  5. CCM(x) multithreaded ?
    By SvenBent in forum Forum Archive
    Replies: 2
    Last Post: 15th September 2007, 15:29

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •