Page 2 of 2 FirstFirst 12
Results 31 to 45 of 45

Thread: Optimizing BWT decoding

  1. #31
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,593
    Thanks
    801
    Thanked 698 Times in 378 Posts
    I've done similar measurements on my i7-4770:
    Code:
    C:\Jampack-master>Jampack_x64.exe d M:\c8 nul -t1
    entropy: 1.78 seconds
    MAP[count[BWT[i]]++]: 0.20 seconds
    unbwt: 3.84 seconds
    lz: 0.03 seconds
    Read: 22 MB => 95 MB (23.18%) @ 16.68 MB/s
    Completed in 5.71 seconds
    
    C:\Jampack-master>Jampack_x64.exe d M:\c8 nul
    entropy: 0.49 seconds
    MAP[count[BWT[i]]++]: 0.20 seconds
    unbwt: 1.32 seconds
    lz: 0.03 seconds
    Read: 22 MB => 95 MB (23.18%) @ 50.08 MB/s
    Completed in 1.91 seconds
    
    
    
    C:\Jampack-master>Jampack_x64.exe d M:\c9 nul -t1
    entropy: 14.99 seconds
    MAP[count[BWT[i]]++]: 1.84 seconds
    unbwt: 64.13 seconds
    lz: 0.33 seconds
    Read: 177 MB => 953 MB (18.62%) @ 11.93 MB/s
    Completed in 79.99 seconds
    
    C:\Jampack-master>Jampack_x64.exe d M:\c9 nul
    entropy: 3.77 seconds
    MAP[count[BWT[i]]++]: 1.85 seconds
    unbwt: 20.36 seconds
    lz: 0.33 seconds
    Read: 177 MB => 953 MB (18.62%) @ 38.30 MB/s
    Completed in 24.99 seconds
    As you see, unbwt on enwik9 is ~1.5x slower than on enwik8. And i think that we can drop this difference by using large pages. I attached corresponding part of freearc sources (Shelwien may have smaller similar library). You should alloс large memory blocks with BigAlloc and return them with BigFree. Well, you can just use these functions for all memory allocations in this code. Just don't mix BigAlloc+free as well as malloc+BigFree



    Another point is what the cycle i mentioned occupies 15% of execution time:

    MAP[count[BWT[i]]++]: 0.20 seconds
    unbwt: 1.32 seconds

    Well, radix sort optimization will probably not help here since data have a small entopy. But using multiple threads will probably help. I think it may be run in 2 threads for free - one thread counts forward and another one backward. And to use 2N more threads, one need to store N*256 frequencies to the archive

    BTW, frequency counting works at 2 GB/s which is about 5% of overall UNBWT time after all these optimizations. So, storing frequencies in archive will avoid this delay too. Or the computation can be made 4x faster by using m/t



    Combining all these improvement together (large pages + storing frequencies + m/t MAP assignment + combining decoder with unbwt), the decompression speed may be improved ~2-2.5x, i.e. up to about 100 MB/s that is similar to lzma s/t decompression speed

    Aside of that, your model seems almost as slow as bsc -e1, but much less efficient (165M vs 186M on enwik9). So you can just use lzp and model from bsc, and work on the novel things - parallel unbwt combined with modeling, as well as gpu unbwt



    I suggest to add "-static" to your compilation scripts: "g++ ... -static", otherwise executables require libraries that usually absent on users machines

    also, i wonder why single-thread unbwt is so slow? are you still follow multiple bwt chains in this case as it was done in the initial version?
    Attached Files Attached Files
    Last edited by Bulat Ziganshin; 6th April 2017 at 01:22.

  2. #32
    Member
    Join Date
    Feb 2015
    Location
    United Kingdom
    Posts
    183
    Thanks
    30
    Thanked 80 Times in 47 Posts
    I'm still working on developing the model and all the components that encompass it, the WFC currently isn't performing at WFC levels since it's actually adjusting the ranks based on rank frequency instead of symbol frequency. The reason I use that is because with WFC you can use a logistic function to perform bytewise frequency updates without really needing to care about cumulative counts remaining constant, though I just used the fpaqc style frequency update scheme out of curiosity and a bit of laziness. I paired an MTF like system to it for rare occurrences so only the 8 most frequent symbols use WFC (rather than bothering to update all 256 symbol frequencies). As with SRC I have no idea how it works cause it isn't really described anywhere other than in OpenBWT. I did try to use SRC at one point but it kept crashing on me, not sure what happened there but I'll give it another go in a bit.

    The current state of the model is just static-block frequencies with order-1 inheritance but I disabled the order-1 model (when order-1-states in model.h is 1 then it is equivalent to order-0) cause it actually hurts compression when RLE0 is used. But like you mentioned it is a strange model and I only made it to have extremely low header requirements while still having performance like an order-1 model.

    The LZ stage is just to help out on special cases where there's big duplicate chunks. I do plan to have the parser be guided by the entropy of the lz tokens eventually, cause in some cases data is better represented as offsets, I noticed this weird property in the SMILE's test suite a while ago.

    Not sure what you mean by the unbwt being slow, in your enwik8 and enwik9 test I can see that there is a speed difference in the tests but enwik9 is harder to cache simply because it's larger. If you use -b100 it should run at the same rate cause it'd be using the same block size in both tests.

    Also yes, it still follows the initial version. It decodes 4 symbols per iteration per thread.
    Last edited by Lucas; 6th April 2017 at 08:52.

  3. #33
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,593
    Thanks
    801
    Thanked 698 Times in 378 Posts
    i just made interesting observations:
    Code:
    C:\>bsc
    Platform specific options:
      -G       Enable Sort Transform acceleration on NVIDIA GPU, default: disable
      -P       Enable large 2MB RAM pages, default: disable
      -t       Disable parallel blocks processing, default: enable
      -T       Disable multi-core systems support, default: enable
    
    C:\>timer bsc d m:\b9 nul -P
    Kernel Time  =     3.244 = 00:00:03.244 =  20%
    User Time    =    74.521 = 00:01:14.521 = 479%
    Process Time =    77.766 = 00:01:17.766 = 499%
    Global Time  =    15.554 = 00:00:15.554 = 100%
    
    C:\>timer bsc d m:\b9 nul
    Kernel Time  =     1.544 = 00:00:01.544 =   9%
    User Time    =    97.500 = 00:01:37.500 = 576%
    Process Time =    99.045 = 00:01:39.045 = 585%
    Global Time  =    16.926 = 00:00:16.926 = 100%
    
    C:\>timer bsc d m:\b9 nul -PT
    Kernel Time  =     0.592 = 00:00:00.592 =   1%
    User Time    =    55.832 = 00:00:55.832 =  98%
    Process Time =    56.425 = 00:00:56.425 =  99%
    Global Time  =    56.519 = 00:00:56.519 = 100%
    
    C:\>timer bsc d m:\b9 nul -T
    Kernel Time  =     1.232 = 00:00:01.232 =   1%
    User Time    =    72.400 = 00:01:12.400 =  98%
    Process Time =    73.632 = 00:01:13.632 =  99%
    Global Time  =    73.648 = 00:01:13.648 = 100%
    it seems that bsc already has parallel unbwt. large pages make s/t unbwt ~1.5x faster (18 seconds of overall time is -e1 model execution), but doesn't change much for m/t mode - may be because there aren't enough threads, since cpu time still increases by 23 seconds. unfortunately, there is no way to increase number of threads to check that assumption


    EDIT: from BSC history:

    Changes in 2.2.0 (June 15, 2010)
    - Added parallel version of reverse BWT transform

    so sorry, you reinvented the wheel look up "num_indexes" in the bsc sources
    Last edited by Bulat Ziganshin; 6th April 2017 at 14:33.

  4. #34
    Member
    Join Date
    Feb 2015
    Location
    United Kingdom
    Posts
    183
    Thanks
    30
    Thanked 80 Times in 47 Posts
    Oh well at least bsc hasn't got gpu decoding support. Also Bulat would you mind helping me configuring OpenCl? I'm having issues linking it with g++.

  5. #35
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,593
    Thanks
    801
    Thanked 698 Times in 378 Posts
    i never used opencl. do you installed sdk from amd/nvidia? google "compiling opencl gcc windows" or so

  6. #36
    Member
    Join Date
    Feb 2015
    Location
    United Kingdom
    Posts
    183
    Thanks
    30
    Thanked 80 Times in 47 Posts
    I installed the sdk from Nvidia, added the sdk directory to the environment path variables, directly fed the sdk path into g++ with the -L flag. I tried using Visual Studio and its linker but I couldn't get it working. I'm almost tempted to install Linux and use mingw to compile for Windows.

  7. #37
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,593
    Thanks
    801
    Thanked 698 Times in 378 Posts
    try to use cuda first, just for start. it requires msvc, though

  8. #38
    Member
    Join Date
    Feb 2015
    Location
    United Kingdom
    Posts
    183
    Thanks
    30
    Thanked 80 Times in 47 Posts
    Okay I added cuda support. I noticed some pretty odd things, g++ produces much faster executables than nvcc and cl; the g++ compile generally decodes 10MB/s faster than the cl compile. I added a -g flag to enable cuda acceleration but from my tests it isn't running anywhere near your estimated 400MB/s (840 threads * 0.5 MB/s), its more like a 5-10% speed gain over CPU decoding, but then again I've only been writing cuda for like a week so I might be overlooking some optimizations.
    Last edited by Lucas; 15th April 2017 at 01:07.

  9. #39
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,593
    Thanks
    801
    Thanked 698 Times in 378 Posts
    yes, you need to learn A LOT

    the main problem is that your code runs on the single SM due to https://github.com/loxxous/Jampack/b...3/bwt.cpp#L237

    you create only one block (with 840 threads) and all threads in a single block run on the same SM (since they may share access to shared memory that is SM-local)

    replace this with simple N_Units = Threads = BWT_UNITS; and use BWT_UNITS=~100K to check real limits of GPU efficiency. of course, divide those BWT_UNITS between SMs fairly. You have ~20 SMs plus a tail effect, so dimGrid should be at least 100 and preferably 1000. OTOH, the more dimBlock threads you have, the better - maximum is 1024

    since each thread accesses memory, even 16 threads/core (maxwell maximum) may be not enough to hide delyas. you may need to process ~32 bwt units per thread, but i suggest you to use constant N (f.e. as CUDAInverse template parameter) to specify its size

    what's your gpu?

    anyway, good job!!! thank you for sharing all your code and ideas
    Last edited by Bulat Ziganshin; 15th April 2017 at 02:17.

  10. #40
    Member
    Join Date
    Feb 2015
    Location
    United Kingdom
    Posts
    183
    Thanks
    30
    Thanked 80 Times in 47 Posts
    Oh I didn't realize it would still need to process a bunch of symbols within each thread on the gpu, I just assumed that there was enough threads already. I'll work on adding OoOE for cuda.

    Quote Originally Posted by Bulat Ziganshin View Post
    what's your gpu?
    I have a GTX 970.

  11. #41
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,593
    Thanks
    801
    Thanked 698 Times in 378 Posts
    btw, no GPU i know have OoOE. instead, they stall at memory read operation only when this register value used by next operation. this serves as a kind of OoOE limited to memory operations - i.e. if you load 16 values into 16 regisers and then start to use these registers, all 16 load operations will be performed simultaneously. you don't need to perfrom this optimization manually - something like "(for i=0..15) {use a[i]}" should be optimized to efficient code - it's archetypical pattern for CUDA programming

    GF970 is SM 5.2 GPU, so it has 256 KB register file per SM, i.e. 512 registers/ALU. Since you can't run more than 16 threads/ALU, one thread has at least 32 registers available, so for best perfromance you need to use at least 20-30 BWT units/thread. Higher values will be even more efficient, since you have fixed overhead of ~10 registers/thread used for book-keeping purposes

    but first thing first - you don't need OoOE in the first place. Instead, increase BWT_UNITS and make sure that you run at least 100 blocks to fill entire GPU

    moreover, 1664 alus*16 threads provides enough parallelism, so implementing OoOE will not change anything. instead, try with smaller bwt block sizes - i've seen reports that gf9xx can't efficiently random-access more than ~1 GB of VRAM due to limited TLB cache

  12. #42
    Member
    Join Date
    Feb 2015
    Location
    United Kingdom
    Posts
    183
    Thanks
    30
    Thanked 80 Times in 47 Posts
    Okay I think I got the Cuda thing almost nailed down, an updated version is now on github and of course that means some new test results.
    Code:
    > D:\Jampack_nv.exe c D:\comp\enwik9\enwik9 D:\comp\enwik9\enwik9.jam -b256
    Read: 953 MB => 190 MB (19.99%) @ 16.49 MB/s
    Completed in 57.98 seconds
    
    
    > D:\Jampack_nv.exe d D:\comp\enwik9\enwik9.jam nul -g
    Read: 190 MB => 953 MB (19.99%) @ 60.90 MB/s
    Completed in 15.69 seconds
    Looks like a new record for bwt speed. Thanks for the advice Bulat!
    Also it turns out that querying the cuda library for device information really kills performance (which was the cause of the cl compile being slower), luckily the new version avoids repeated querying.

  13. Thanks:

    Bulat Ziganshin (19th April 2017)

  14. #43
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,593
    Thanks
    801
    Thanked 698 Times in 378 Posts
    can you measure speed of unbwt itself? separately its cpu part and gpu part. it seems that you got ~2x speedup, which is pretty close to 1.5x speedup you will get from using of LargePages
    Last edited by Bulat Ziganshin; 19th April 2017 at 06:29.

  15. #44
    Member
    Join Date
    Feb 2015
    Location
    United Kingdom
    Posts
    183
    Thanks
    30
    Thanked 80 Times in 47 Posts
    Code:
    Gpu unbwt: 11.498s
    Cpu unbwt m/t: 15.274s
    Cpu unbwt s/t: 44.051s
    I tried to use the alloc functions you provided but I couldn't get it to compile, so it's still using standard alloc.

  16. #45
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,593
    Thanks
    801
    Thanked 698 Times in 378 Posts
    yeah, it's directly ripped from freearc and bloated with its specific code. Eugene Shelwien provided this link: http://nishi.dreamhosters.com/u/2mpages.cpp - hope you will find it more useful

    details at https://msdn.microsoft.com/en-us/lib...(v=vs.85).aspx

Page 2 of 2 FirstFirst 12

Similar Threads

  1. New ASPLOS paper on SIMD FSM's and Huffman decoding
    By Paul W. in forum Data Compression
    Replies: 0
    Last Post: 22nd April 2014, 05:26
  2. PDF optimizing
    By SvenBent in forum Data Compression
    Replies: 8
    Last Post: 16th January 2014, 13:37
  3. BWT + LZMA
    By Zelex in forum Data Compression
    Replies: 10
    Last Post: 21st July 2013, 12:44
  4. The BWT is redundant...
    By nburns in forum Data Compression
    Replies: 29
    Last Post: 10th April 2013, 04:59
  5. BWT - how to use?
    By m^2 in forum Data Compression
    Replies: 29
    Last Post: 6th November 2008, 03:01

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •