Results 1 to 11 of 11

Thread: [GPU] Fast block-sorting compressor

  1. #1
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,511
    Thanks
    746
    Thanked 668 Times in 361 Posts

    [GPU] Fast block-sorting compressor

    Let's discuss various ways to implement subj.

    In 90s it was well known that BWT output should be compressed by low-order entropy coders. As said by Deorowicz, "it can be approximated with high precision as an output of a piecewise stationary memoryless source". Nowadays, we have powerful computation engines with very limited local memory - GPUs. It seems that BWT output compression and GPUs are perfect match.

    BWT implementation in NVBIO is claimed to reach 70-80 MB/s on Kepler-class Tesla GPUs, so it may be a little over 100 MB/s on best modern GPUs - and this territory is already occupied by the BSC compressor. Let's look into free territory, i.e. speeds of 200+ MB/s and Sorting Transformation. On the 980Ti, ST3 speed is 3 GB/s and ST8 is 500 MB/s. They just need to be paired with fast enough codecs, i.e. ones flying at 0.5-10 GB/s.

    We can look into 3 directions:
    1) simplest algorithm
    2) fastest one
    3) best-compression

    Let's start with simplest one. ST+RLE+MTF+RangeCoder. ST and RLE are already implemented by the CUB+BSC. Even naive MTF implementation should process dozens of GB/s. The same holds for RangeCoder. Of course, data should be split into independent blocks of 100 KB or so in order to allow parallel processing and use per-block stats for encoding.

    I'm not sure what should be coded by RC. Is it enough to split numbers produces by MTF and RLE into 16-32 bins and encode bin number + direct output of any remaining bits? Some sort of:
    {0}, {1}, {2}, {3}, {4,5}, {6,7}, {8,9,10,11}...

    BWT is a huge area, i never looked into it, so i'm asking experts about better algorithms. Even if you never looked into GPUs, basically it's just a few dozen cores, each running a few hundreds of threads sharing 30 KB of fast on-chip memory (per core), plus a few dozen registers per thread. Plus a memory interface able to request 10^9 of 256-byte blocks each second.


    Now, the possible improvements to the simplest codec:

    1) i wonder why Deorowicz says that we don't know same-context bounds. At least when Radix Sort is used to produce an ST, the radix sort output contains original contexts so we can split ST-sorted chars into blocks by those contexts (of course with minimum&maximum limits for block size)

    2) there are a lot of preprocessing techniques we can run at GPU memory speed: e8/e9, Szymon Grabowski text tricks, alphabet reordering, data reversing

    So, we left with all those rle/mtf/if/dc/wfc/qlfc variations. Almost anything will be fine as well as it can be squeezed into 30 KB stationary model. So, what are your ideas?


    PS: If you want to learn more about BWT output compression:
    http://www.data-compression.info/Algorithms/BWT/
    http://www.compression.ru/download/bwt.html#mod
    http://homepage3.nifty.com/wpage/lin...transform.html
    Last edited by Bulat Ziganshin; 1st June 2016 at 04:55.

  2. Thanks (5):

    Cyan (9th May 2016),encode (9th May 2016),Mike (10th May 2016),Paul W. (9th May 2016),Turtle (10th May 2016)

  3. #2
    Member
    Join Date
    Sep 2010
    Location
    US
    Posts
    126
    Thanks
    4
    Thanked 69 Times in 29 Posts
    I'm not sure how you do anything like MTF on GPU in a reasonable way.

    GPU's really want to have N of the same operation to do at the same time. (N very large)

    MTF is nasty because each operation depends on the previous, so bytes are totally serialized.

    Even if you break it into chunks and try to run independent MTF's of the chunks across threads, you have the problem that sometimes you're moving an element 0 or 1 steps, and sometimes moving it 256 steps, which GPUs really bad at. You really want all your threads doing the exact same operations.

    [...]

    I found "Parallel Lossless Data Compression on the GPU" which has a reasonable idea for doing MTF on the GPU with a typical branch-merge structure. .. but in their tests it's usually slower than CPU MTF.
    Last edited by cbloom; 3rd August 2016 at 20:11.

  4. #3
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,511
    Thanks
    746
    Thanked 668 Times in 361 Posts
    You are right - i don't looked into that thoroughly. So let's see. First, we aren't (yet) fight against best CPU compressors - medium-to-low compression is ok as far as we can run it at 200+ MB/s speeds. In particular, i will be happy even with bzip2-class backend. So, the question is - how large should be MTF block to lose no more than 1-3% of compressed size compared to full-block MTF?

    Now let's attack question from the other side: byte-level MTF will require 256 bytes of shared memory per scalar thread, so no more than 192-384 threads per SM (since they have 48-96 KB of shmem). F.e. GF980 has 16 SMs, so it can run 6K threads. With 4 GB VRAM, it can sort up to 160-200 MB data blocks, therefore (splitting those 160 MB among 6K threads) it should process 25-30 KB per single MTF block to fill the entire GPU.

    But of course, if we will wait for full MTF processing prior to going to other tasks, we will have tails. Sometimes, very long tails (= threads processing random data). So, the better solution will be to make dynamic decisions - i.e. start post-MTF stages on the blocks once their MTF processing was finished. It's easily doable with dynamic parallelism (i.e. start post-MTF kernel once we have finished MTF processing of this particular block), and with less covenience can be done on older GPUs too.

    So, for example, we can split data into 64 KB blocks, each warp will process 32*64KB=2 MB of data, and once those 2 MB will be MTF-processed, the post-MTF kernel will be started. Or even better: once each individual 64 KB block will be finished, its post-MTF processing will start.

    Another solution is to split large MTF blocks into smaller partially-overlapping ones. F.e. large block = 1 MB, small block = 4 KB + 256 bytes of previous small block for training. This way, we can do fast encoding, but decoding can be performed only in full 1 MB chunks.

    Yet another solution is to abandon MTF entirely and use only the "repeat last" code. As Fenwick said in the second report (see Direct-"Repeat"), it's still 10% better than pure symbol coding.

    Or we can use CPU for that purpose - it need to have some work while we are falling in love with GPU

    Finally, there is MTF-joining solution in the CUDPP (described in their "Parallel Lossless Data Compression on the GPU" paper). I've not looked yet into that part of their code. Overall, their bwt+mtf+huffman code is here:
    https://github.com/cudpp/cudpp/blob/...ompress_app.cu
    https://github.com/cudpp/cudpp/blob/...mpress_cta.cuh
    https://github.com/cudpp/cudpp/blob/...ess_kernel.cuh
    Last edited by Bulat Ziganshin; 14th May 2016 at 15:08.

  5. #4
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,511
    Thanks
    746
    Thanked 668 Times in 361 Posts
    As was shown by researchers, the main problem of the BWT output compression is the rapid, implicit context changes, favoring highly-adaptive back-end compressors. We can consider the following back-ends, sorted by the time of their appearance as well as complexity and speed (in parentheses i show the name of closest original compressor as well as its results on the Calgary corpus):

    1) RLE+RangeCoder, with two separate tables for lengths and symbols for each block of 16 KB or so (Direct-"Repeat", 2.513 bpb)

    2) MTF+RangeCoder, with 16 KB blocks (BW94, 2.427 bpb)

    3) RLE + MTF + dynamic structured or hierarchical (unary) coder + blockwise encoding of remaining bits of MTF values (BZIP 0.21, 2.34 bpb)

    4) I'm still reading about further improvements. It looks like the main idea was IFC (inversion frequency coding), allowing to sort data by symbol. The fast BSC model predicts next byte's range&length-of-run by the symbol itself + global context + state (binary logarithm of previous rank of the same symbol). These 3 predictions are mixed and then bunch of binary models are employed, to ensure fast adaptation. It can be implemented on GPU, but only for compression stage (we can perfrom 3 passes over data, sorted in different ways and then mix predictions made in different passes).

    Still, the simpler approach of IFC + dynamic aricoder gained 2.27 bpb on the Calgary. I think we can do the same with QLFC + sort by symbol + encode rank/len in 16 KB blocks (or use structured/hierarchical model + encode remaining rank/len bits in 16 KB blocks).

    Overall, it's important to avoid large (>100 elements) dynamic tables since their implementation will significantly lower the GPU utilization (i.e. amount of threads per SM). Also, it's preferable to avoid divisions, both in encoder and decoder. Everything else is possible. Even binary coders will be very fast on GPUs (10-100 GBit/sec), but we should take into account the CPU decoding speed.

    The OpenBWT 2.0 (benchmarks) provides a lot of ready-to-use routines implementing various mtf/rle-like transformations, including several sorted ones (sorted IF, sorted MTF), so it's invaluable tool to check compression architecture prior to going to GPU implementation. The interesting optimization that should father increase the compression ratio is that OpenBWT transformations sort data by freq[c] rather than c itself. Further going, we can split all chars into classes, either by frequency, or by their meaning (lower letter, upper letter, spaces, punctuation..), and sort/split-into-blocks symbols by their class.

    I think that sorting is very important since it improves compression by a few percents and makes context slower-changing, so we can use block-static encoding rather than dynamic one.

  6. #5
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,511
    Thanks
    746
    Thanked 668 Times in 361 Posts
    Unforunately, OpenBWT 2.0 fails on TP, so i repeated 6-year old OpenBWT 1.5 test, now on i7-4770. The results, including bsc and zstd for comparison, are follow. SRC (Sorted Rank Coding) is the best second stage, while TP (Transpose) is the fastest:
    Code:
    C:\>sst_test.exe z:\e8
    Input: z:\e8, 100000000 bytes
    BWT ... Done. (20.92 sec)
    BWT -> SST -> RLE0 + Simple Structured Coding Model
      SST       Encoding Time   Decoding Time   Compressed Size
      MTF                0.94            0.49          22944781 (22.94%)
      M1FF               0.93            0.50          22620816 (22.62%)
      M1FF2              0.93            0.49          22480483 (22.48%)
      SIF                0.99            3.14          21648436 (21.65%)
      DC                 0.96            0.69          23118798 (23.12%)
      SDC                1.19            0.88          21774150 (21.77%)
      SRC                1.40            0.74          21434745 (21.43%)
      TP                 0.37            0.29          22901245 (22.90%)
      TS0                0.88            0.84          22311750 (22.31%)
      Bx(3)              1.05            1.01          22275919 (22.28%)
      Bx(4)              1.03            1.04          22354727 (22.35%)
      Bx(5)              1.04            1.03          22454707 (22.45%)
      Bx(6)              1.06            1.12          22559436 (22.56%)
      SBR                0.93            0.82          22129901 (22.13%)
      FC                 2.21            1.92          22354372 (22.35%)
      WFC                4.55            4.40          21944813 (21.94%)
      AWFC               4.68            4.34          21743432 (21.74%)
      IFC                6.59            7.44          22290628 (22.29%)
    
    bsc 3.1.0 -b1000m0pe2: 20800348
    bsc 3.1.0 -b1000m0pe1: 20933900
    
    zstd -1: 40760118 bytes, 0.530 seconds (User Time)
    zstd -2: 37733203 bytes, 0.702 seconds
    zstd -3: 36614874 bytes, 0.967 seconds
    binary data:
    Code:
    C:\>sst_test.exe z:\100m
    Input: z:\100m, 100000000 bytes
    BWT ... Done. (17.17 sec)
    BWT -> SST -> RLE0 + Simple Structured Coding Model
      SST       Encoding Time   Decoding Time   Compressed Size
      MTF                4.13            0.94          33395019 (33.40%)
      M1FF               4.09            1.17          33822495 (33.82%)
      M1FF2              4.10            1.09          33836968 (33.84%)
      SIF                1.05            3.58          33870492 (33.87%)
      DC                 4.13            1.25          34556589 (34.56%)
      SDC                4.38            1.44          33384052 (33.38%)
      SRC                4.62            1.29          32029367 (32.03%)
      TP                 0.33            0.25          43845346 (43.85%)
      TS0                2.38            1.28          35976652 (35.98%)
      Bx(3)              2.38            1.35          37106074 (37.11%)
      Bx(4)              2.12            1.35          38313450 (38.31%)
      Bx(5)              1.98            1.35          39200073 (39.20%)
      Bx(6)              1.88            1.44          40091164 (40.09%)
      SBR                2.85            1.37          35375009 (35.38%)
      FC                 4.28            2.56          37818472 (37.82%)
      WFC               12.93           12.34          32397114 (32.40%)
      AWFC              13.16           12.39          32056036 (32.06%)
      IFC                9.40            8.60          32976314 (32.98%)
    
    bsc 3.1.0 -b1000m0pe2: 30658806
    bsc 3.1.0 -b1000m0pe1: 30951152
    
    zstd -1: 47570893 bytes, 0.421 seconds (User Time)
    zstd -2: 44965974 bytes, 0.546 seconds
    zstd -3: 43649692 bytes, 0.780 seconds

    ST timings (CUDA times will be much better with modern CUB versions):
    Code:
    M:\>cuda_ST.exe z:\e8
    CUDA_ST_bsc 0.6  (c) Dell Inc.  Written by P.Skibinski
    Kernel Warmup = 1 CPU ms (100000000 KB/s)
    BSC ST3 = 812 CPU ms (123152 KB/s)
    CUDA ST3 = 388 CPU ms (257731 KB/s)
    - output of BSC_ST3 and CUDA_ST3 (2 parts) is identical (index=26567496)
    BSC ST4 = 799 CPU ms (125156 KB/s)
    CUDA ST4 = 445 CPU ms (224719 KB/s)
    - output of BSC_ST4 and CUDA_ST4 (3 parts) is identical (index=3247660)
    BSC ST5 = 1802 CPU ms (55493 KB/s)
    CUDA ST5 = 629 CPU ms (158982 KB/s)
    - output of BSC_ST5 and CUDA_ST5 (3 parts) is identical (index=16326958)
    BSC ST6 = 2385 CPU ms (41928 KB/s)
    CUDA ST6 = 702 CPU ms (142450 KB/s)
    - output of BSC_ST6 and CUDA_ST6 (3 parts) is identical (index=16326959)
    Last edited by Bulat Ziganshin; 15th May 2016 at 01:27.

  7. Thanks:

    hexagone (14th May 2016)

  8. #6
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,511
    Thanks
    746
    Thanked 668 Times in 361 Posts
    With a few modifications, i've managed to compile win32 & win64 builds of OpenBWT v1.5 with full range of RLE stages and tested it at enwik8:
    Code:
    bsc 3.1.0 -b1000m0pe2: 20800348
    bsc 3.1.0 -b1000m0pe1: 20933900
    
    
    M:\openbwt-1.5\samples>sst_test64.exe z:\e8
    Input: z:\e8, 100000000 bytes
    BWT ... Done. (21.19 sec)
    
    BWT -> SST -> RLE0 + Simple Structured Coding Model
      SST       Encoding Time   Decoding Time   Compressed Size
      MTF                0.66            0.49          22944781 (22.94%)
      M1FF               0.64            0.48          22620816 (22.62%)
      M1FF2              0.64            0.48          22480483 (22.48%)
      SIF                0.88            3.04          21648436 (21.65%)
      DC                 0.74            0.54          23118798 (23.12%)
      SDC                0.95            0.69          21774150 (21.77%)
      SRC                0.97            0.67          21434745 (21.43%)
      TP                 0.27            0.20          22901245 (22.90%)
      TS0                0.75            0.61          22311750 (22.31%)
      Bx(3)              0.76            0.79          22275919 (22.28%)
      Bx(4)              0.75            0.81          22354727 (22.35%)
      Bx(5)              0.78            0.83          22454707 (22.45%)
      Bx(6)              0.80            0.88          22559436 (22.56%)
      SBR                0.78            0.64          22129901 (22.13%)
      FC                 1.73            1.69          22354372 (22.35%)
      WFC                3.89            3.76          21944813 (21.94%)
      AWFC               4.07            3.72          21743432 (21.74%)
      IFC                6.04            6.12          22290628 (22.29%)
    
    BWT -> SST -> Simple Structured Coding Model
      SST       Encoding Time   Decoding Time   Compressed Size
      MTF                0.64            0.48          23812585 (23.81%)
      M1FF               0.62            0.47          23420199 (23.42%)
      M1FF2              0.63            0.47          23195616 (23.20%)
      SIF                0.86            2.99          22039742 (22.04%)
      DC                 0.72            0.53          23118800 (23.12%)
      SDC                0.93            0.67          21728650 (21.73%)
      SRC                0.98            0.68          21850719 (21.85%)
      TP                 0.27            0.20          23688499 (23.69%)
      TS0                0.72            0.60          23046525 (23.05%)
      Bx(3)              0.76            0.79          23009408 (23.01%)
      Bx(4)              0.75            0.81          23107679 (23.11%)
      Bx(5)              0.77            0.83          23224949 (23.22%)
      Bx(6)              0.80            0.88          23347266 (23.35%)
      SBR                0.77            0.63          22816836 (22.82%)
      FC                 1.77            1.71          23214397 (23.21%)
      WFC                3.98            3.85          22815571 (22.82%)
      AWFC               4.08            3.76          22605314 (22.61%)
      IFC                6.12            6.17          23161881 (23.16%)
    
    BWT -> RLE-EXP(2) -> SST -> Simple Structured Coding Model
      SST       Encoding Time   Decoding Time   Compressed Size
      MTF                0.60            0.45          23084157 (23.08%)
      M1FF               0.59            0.44          22776631 (22.78%)
      M1FF2              0.59            0.45          22644672 (22.64%)
      SIF                0.79            1.60          21577638 (21.58%)
      DC                 0.71            0.52          23195674 (23.20%)
      SDC                0.89            0.66          21704366 (21.70%)
      SRC                0.91            0.59          21504329 (21.50%)
      TP                 0.24            0.17          22698988 (22.70%)
      TS0                0.67            0.52          22473348 (22.47%)
      Bx(3)              0.63            0.64          22418244 (22.42%)
      Bx(4)              0.60            0.64          22388230 (22.39%)
      Bx(5)              0.60            0.64          22385973 (22.39%)
      Bx(6)              0.61            0.66          22402733 (22.40%)
      SBR                0.69            0.54          22286657 (22.29%)
      FC                 0.98            0.94          22169390 (22.17%)
      WFC                3.89            3.74          22085072 (22.09%)
      AWFC               3.96            3.66          21882892 (21.88%)
      IFC                2.63            2.65          22401027 (22.40%)
    
    BWT -> RLE-EXP(2) -> SST -> RLE0 + Simple Structured Coding Model
      SST       Encoding Time   Decoding Time   Compressed Size
      MTF                0.60            0.45          23255097 (23.26%)
      M1FF               0.59            0.44          22946273 (22.95%)
      M1FF2              0.59            0.45          22833445 (22.83%)
      SIF                0.79            1.59          21645823 (21.65%)
      DC                 0.71            0.52          23195676 (23.20%)
      SDC                0.90            0.66          21752119 (21.75%)
      SRC                0.91            0.59          21829352 (21.83%)
      TP                 0.24            0.17          22865908 (22.87%)
      TS0                0.66            0.52          22653500 (22.65%)
      Bx(3)              0.63            0.64          22562587 (22.56%)
      Bx(4)              0.60            0.63          22533308 (22.53%)
      Bx(5)              0.61            0.64          22534690 (22.53%)
      Bx(6)              0.61            0.65          22552449 (22.55%)
      SBR                0.69            0.54          22494949 (22.49%)
      FC                 0.98            0.94          22322804 (22.32%)
      WFC                3.88            3.75          22254610 (22.25%)
      AWFC               3.96            3.65          22055571 (22.06%)
      IFC                2.63            2.65          22571101 (22.57%)
    ... as well as enwik9:
    Code:
    bsc 3.1.0 -b1000m0pe2: 163924696
    bsc 3.1.0 -b1000m0pe1: 164998990
    
    
    M:\openbwt-1.5\samples>sst_test64.exe z:\e9
    Input: z:\e9, 1000000000 bytes
    BWT ... Done. (256.54 sec)
    
    BWT -> SST -> RLE0 + Simple Structured Coding Model
      SST       Encoding Time   Decoding Time   Compressed Size
      MTF                5.23            3.93         181549492 (18.15%)
      M1FF               5.04            3.80         178914394 (17.89%)
      M1FF2              5.15            3.86         177733856 (17.77%)
      SIF                7.30           30.49         170713509 (17.07%)
      DC                 5.83            4.37         182804192 (18.28%)
      SDC                7.48            5.37         171715155 (17.17%)
      SRC                7.86            5.72         168629937 (16.86%)
      TP                 2.28            1.75         179001288 (17.90%)
      TS0                6.19            5.25         175951344 (17.60%)
      Bx(3)              6.54            6.98         175360763 (17.54%)
      Bx(4)              6.61            7.29         175800543 (17.58%)
      Bx(5)              6.77            7.43         176420772 (17.64%)
      Bx(6)              7.20            7.94         177110276 (17.71%)
      SBR                6.31            5.36         174629805 (17.46%)
      FC                17.52           17.28         176313847 (17.63%)
      WFC               31.38           30.37         173430485 (17.34%)
      AWFC              32.09           29.35         171564360 (17.16%)
      IFC               11.47           11.65         174328078 (17.43%)
    
    BWT -> SST -> Simple Structured Coding Model
      SST       Encoding Time   Decoding Time   Compressed Size
      MTF                5.43            4.10         189007682 (18.90%)
      M1FF               5.17            3.85         185704381 (18.57%)
      M1FF2              5.12            3.84         183758421 (18.38%)
      SIF                7.84           30.16         173762346 (17.38%)
      DC                 5.81            4.38         182804165 (18.28%)
      SDC                7.59            5.45         171269940 (17.13%)
      SRC                7.84            5.72         172123429 (17.21%)
      TP                 2.28            1.77         185687010 (18.57%)
      TS0                6.16            5.24         182198103 (18.22%)
      Bx(3)              6.50            6.92         181580950 (18.16%)
      Bx(4)              6.49            7.14         182196806 (18.22%)
      Bx(5)              6.73            7.38         182977182 (18.30%)
      Bx(6)              7.11            7.84         183821065 (18.38%)
      SBR                6.28            5.34         180477566 (18.05%)
      FC                17.21           17.02         183561940 (18.36%)
      WFC               30.66           29.90         180905180 (18.09%)
      AWFC              31.79           28.92         178855718 (17.89%)
      IFC               11.29           11.42         181647733 (18.16%)
    
    BWT -> RLE-EXP(2) -> SST -> Simple Structured Coding Model
      SST       Encoding Time   Decoding Time   Compressed Size
      MTF                4.70            3.58         182668735 (18.27%)
      M1FF               4.58            3.52         180137239 (18.01%)
      M1FF2              4.59            3.50         179008332 (17.90%)
      SIF                6.29           12.45         169699886 (16.97%)
      DC                 5.55            4.11         183263178 (18.33%)
      SDC                7.09            5.26         170736913 (17.07%)
      SRC                7.18            4.71         169175965 (16.92%)
      TP                 1.95            1.65         177745784 (17.77%)
      TS0                5.14            4.23         177220014 (17.72%)
      Bx(3)              4.91            5.01         176439113 (17.64%)
      Bx(4)              4.88            4.96         176007148 (17.60%)
      Bx(5)              4.84            5.17         175813275 (17.58%)
      Bx(6)              4.80            5.18         175796514 (17.58%)
      SBR                5.40            4.30         175856647 (17.59%)
      FC                 7.86            7.57         174509163 (17.45%)
      WFC               30.24           29.13         174555152 (17.46%)
      AWFC              30.59           27.75         172687017 (17.27%)
      IFC                6.12            6.30         175487575 (17.55%)
    
    BWT -> RLE-EXP(2) -> SST -> RLE0 + Simple Structured Coding Model
      SST       Encoding Time   Decoding Time   Compressed Size
      MTF                4.67            3.53         184056029 (18.41%)
      M1FF               4.53            3.48         181562799 (18.16%)
      M1FF2              4.50            3.45         180624211 (18.06%)
      SIF                6.19           12.22         170525127 (17.05%)
      DC                 5.43            4.02         183263165 (18.33%)
      SDC                6.90            5.12         171202306 (17.12%)
      SRC                7.00            4.58         171886471 (17.19%)
      TP                 1.90            1.62         179181572 (17.92%)
      TS0                5.15            4.19         178734804 (17.87%)
      Bx(3)              4.91            4.98         177699985 (17.77%)
      Bx(4)              4.69            4.98         177284807 (17.73%)
      Bx(5)              4.74            5.03         177126778 (17.71%)
      Bx(6)              4.77            5.13         177130608 (17.71%)
      SBR                5.30            4.23         177599571 (17.76%)
      FC                 7.74            7.44         175914723 (17.59%)
      WFC               29.74           28.66         175931471 (17.59%)
      AWFC              29.97           27.86         174121558 (17.41%)
      IFC                6.13            6.29         176932513 (17.69%)
    So, both files are compressed best with SRC+RLE0, while fastest transformation is RLE-EXP(2)+TP.

    With SRC (Sorted Rank Coding), we are just 2-2.5% worse than BSC -e1, while this transformation (essentially just MTF + sort by 8-bit value), as well as Simple Structured Coding Model - can be implemented on GPU.

    Changed/compiled files are attached. If you need to compile on your own, put them into the "samples" directory of OpenBWT v1.5
    Attached Files Attached Files

  9. #7
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,511
    Thanks
    746
    Thanked 668 Times in 361 Posts
    I've published first version of the Compression-Research suite

    I consider this repository as the place for my compression-related experiments. Just now it includes almost full stack of BWT-compression related algos, with two applications to benchmark them. There are no any real compressors here, neither planned in future versions.

    So, the contribution of this release:


    Two new LZP compression routines

    My first algorithm lzp_cpu_bsc_mod is just a small low-level optimization of BSC LZP, implementing multiplicative hash and unaligned memory access. Depending on compiler, it's 3-25% faster than original.

    LZP preprocessing for BWT/ST usually called with minimum match length minLen of 32..128. And any LZP implementation aiming for high minLen values suffers from "false positives" problem - match candidates smaller than minLen bytes go in runs since initial check is performed only by 4 hashed bytes. BSC LZP implementation only partially solves this problem by comparing bytes with minLen offset first as well as employing other heuristics.

    My second algorithm lzp_cpu_rollhash completely solves the "false positive" problem by storing in each hash entry, besides a pointer, a hash of minLen bytes it points to. This allows to check whether we may have a match, with a very high confidence level. Current implementation stores 32-bit hash, thus making "false positives" almost impossible (probability = 2^-32 = 0.000000025%). Since successfull matches are pretty rare (f.e. enwik9 processed by LZP with -h15 -l32 options has 1 match per 400 input bytes), this means that we run innermost loop almost without unpredicted branches, making lzp_cpu_rollhash a new member of growing branch-less compression algorithms family.

    Even storing 8-bit hash should make the cost of handling "false positives" negligible, as well as allow to increase amount of hash table entries (while still fitting to L2 cache) from 2^15 to 2^17, improving overall compression ratio with only tiny speed loss.

    And of course, hashing 32..128 bytes (=minLen) on every loop step will be prohibitively slow, so we run a rolling hash over the next minLen bytes, hence the algorithm name.

    The speed of all three LZP routines somewhat depends on compiler:
    Code:
    M:\CR>bslab-icl-x64.exe enwik9 -nobwt -norle -nomtf -lzp1-3
    [ 1] lzp_cpu_bsc     : 1,000,000,000 => 855,183,966 (85.52%)  352 MiB/s,  2710.463 ms
    [ 2] lzp_cpu_bsc_mod : 1,000,000,000 => 855,168,159 (85.52%)  361 MiB/s,  2639.064 ms
    [ 3] lzp_cpu_rollhash: 1,000,000,000 => 855,369,315 (85.54%)  488 MiB/s,  1955.604 ms
    
    M:\CR>bslab-clang-x64.exe enwik9 -nobwt -norle -nomtf -lzp1-3
    [ 1] lzp_cpu_bsc     : 1,000,000,000 => 855,183,966 (85.52%)  314 MiB/s,  3033.328 ms
    [ 2] lzp_cpu_bsc_mod : 1,000,000,000 => 855,168,159 (85.52%)  324 MiB/s,  2943.981 ms
    [ 3] lzp_cpu_rollhash: 1,000,000,000 => 855,369,315 (85.54%)  438 MiB/s,  2177.476 ms
    
    M:\CR>bslab-gcc-x64.exe enwik9 -nobwt -norle -nomtf -lzp1-3
    [ 1] lzp_cpu_bsc     : 1,000,000,000 => 855,183,966 (85.52%)  352 MiB/s,  2713.003 ms
    [ 2] lzp_cpu_bsc_mod : 1,000,000,000 => 855,168,159 (85.52%)  369 MiB/s,  2584.420 ms
    [ 3] lzp_cpu_rollhash: 1,000,000,000 => 855,369,315 (85.54%)  421 MiB/s,  2266.292 ms
    
    M:\CR>bslab-msvc-x64.exe enwik9 -nobwt -norle -nomtf -lzp1-3
    [ 1] lzp_cpu_bsc     : 1,000,000,000 => 855,183,966 (85.52%)  348 MiB/s,  2736.850 ms
    [ 2] lzp_cpu_bsc_mod : 1,000,000,000 => 855,168,159 (85.52%)  369 MiB/s,  2581.483 ms
    [ 3] lzp_cpu_rollhash: 1,000,000,000 => 855,369,315 (85.54%)  434 MiB/s,  2197.373 ms
    Current lzp_cpu_rollhash implementation employs 64-bit computations so it's suboptimal for 32-bit platforms (but it will be easy to fix):
    Code:
    M:\CR>bslab-icl.exe enwik9 -nobwt -norle -nomtf -lzp1-3
    [ 1] lzp_cpu_bsc     : 1,000,000,000 => 855,183,966 (85.52%)  315 MiB/s,  3026.841 ms
    [ 2] lzp_cpu_bsc_mod : 1,000,000,000 => 855,168,159 (85.52%)  334 MiB/s,  2856.902 ms
    [ 3] lzp_cpu_rollhash: 1,000,000,000 => 855,369,315 (85.54%)  359 MiB/s,  2654.012 ms
    
    M:\CR>bslab-clang.exe enwik9 -nobwt -norle -nomtf -lzp1-3
    [ 1] lzp_cpu_bsc     : 1,000,000,000 => 855,183,966 (85.52%)  252 MiB/s,  3782.299 ms
    [ 2] lzp_cpu_bsc_mod : 1,000,000,000 => 855,168,159 (85.52%)  315 MiB/s,  3026.308 ms
    [ 3] lzp_cpu_rollhash: 1,000,000,000 => 855,369,315 (85.54%)  328 MiB/s,  2905.698 ms
    
    M:\CR>bslab-gcc.exe enwik9 -nobwt -norle -nomtf -lzp1-3
    [ 1] lzp_cpu_bsc     : 1,000,000,000 => 855,183,966 (85.52%)  327 MiB/s,  2918.068 ms
    [ 2] lzp_cpu_bsc_mod : 1,000,000,000 => 855,168,159 (85.52%)  343 MiB/s,  2777.789 ms
    [ 3] lzp_cpu_rollhash: 1,000,000,000 => 855,369,315 (85.54%)  267 MiB/s,  3566.095 ms
    
    M:\CR>bslab-msvc.exe enwik9 -nobwt -norle -nomtf -lzp1-3
    [ 1] lzp_cpu_bsc     : 1,000,000,000 => 855,183,966 (85.52%)  288 MiB/s,  3308.380 ms
    [ 2] lzp_cpu_bsc_mod : 1,000,000,000 => 855,168,159 (85.52%)  315 MiB/s,  3023.044 ms
    [ 3] lzp_cpu_rollhash: 1,000,000,000 => 855,369,315 (85.54%)  307 MiB/s,  3111.218 ms
    You can change LZP params using -b -l -h options. Add "-nobwt -norle -nomtf" to run only through the LZP compression part. The fourth LZP algo, "lzp_cpu_rollhash (OpenMP)" just splits input data into 8 MB chunks and runs multiple algorithm instances on multiple cores simultaneously.

    The compilers tested:
    - Intel C++ Compiler 16.0
    - Clang 3.8
    - GCC 5.3
    - Microsoft Visual C++ 2015 Update 1

    Compression-Research-v1.zip includes BSLab executables compiled by them all, for any combination of x86/x64, sse2/avx2 targets - i.e. 16 executables overall. All speed measurements provided here are performed on single core of Haswell i7-4770.

    You can find more benchmark results here (boost/enwik8/enwik9 are text files, 100m/1g are executables, and 1g.tor3 is incompressible), and download these testfiles (742 MB).

    I will continue to describe contribution of this release in the next postings...
    Last edited by Bulat Ziganshin; 19th June 2016 at 14:33.

  10. Thanks (4):

    Cyan (19th June 2016),inikep (19th June 2016),Shelwien (19th June 2016),Turtle (20th June 2016)

  11. #8
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,511
    Thanks
    746
    Thanked 668 Times in 361 Posts
    BSC is hallmark of BWT/ST compression, so i interested in the ways to improve its performance. Let's start with analysis of its current timing on compression of ENWIK9 using single core of my i7-4770:

    - LZP -h15: 3 sec, or -h26: 7.5 sec
    - BWT: 85 sec, or ST6/5/4: 21/15/5.5 sec, or GPU-ST: 0 sec
    - RLE+MTF: 5 sec
    - static modeling (-e1): 18 sec, or dynamic modeling (-e2): 36 sec

    As you see, encoding time is dominated by BWT, but once we are switched to ST4 of GPU ST, modeling becomes the hog. So, possible BSC optimizations can be split into 3 areas:

    BWT or high-order ST optimizations are always welcomed. What can be dione here:
    - fast parallel/GPU UnBWT by Lucas (120 MB/s on CPU and expected 1 GB/s on GPU)
    - fast parallel BWT: MSufSort4 (50 MB/s in m/t mode)
    - GPU BWT: NVBio or CUDPP. Reported as 75 MB/s on pretty old GF780 (afair)

    Once we are using ST, modeling becomes a hog. The next post describes several models with speeds up to a few GB/s.

    Once we have fast ST and modeling, we are ready to improve LZP/MTF speeds:
    - Shelwien SIMD MTF algo runs in 2 secs with SSE2 and 1.5 secs with AVX2 (included in Compression-Research package)
    - rolling-hash LZP decreases LZP -h15 time to 2 sec (the same)
    - sorted LZP may run on GPU at ~1 GB/s, on CPU at ~100 MB/s per core, providing the same results as LZP -h26

  12. #9
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,511
    Thanks
    746
    Thanked 668 Times in 361 Posts
    Encoding of BWT/ST+MTF+RLE0 output

    For comparison: gzip compresses Calgary corpus to 2.71 bpb, PPMdJ to 2.091 bpb (best modern BWT compressors should have very similar results).

    The original BW94 algorithm combined bwt+mtf+rle+some entropy coder. The results were 2.43 with huffman and 2.40 with arithmetic. This shows how skewed are data, even after RLE pass! Simple replacement of huffman with modern FSE or vectorized RC/RANS would improve ratios by more than 1%.

    The models to consider:

    1. 2.513 bpb: direct FSE encoding of BWT output with a single Repeat code: page 18 of [1]. The Repeat code replaces repetition of last symbol, so it's essentially an RLE with unary encoding of RLE length Actually, Fenwick employed an adaptive arithmetic model with very fast adaptation (increment=600/total=10k), so i'm not sure whether we can get similar results with block-static coding.

    We can try to encode repeat lengths with 1/2 code, Maniscalo code or some wider structured/hierarchical model.


    2. 2.46 bpb: bred1 == bwt+mtf+rle0+1/2coding+huffman. We can further improve result by using FSE (+1%), using one FSE table per 16-128 KB of input data (while bred1 used single huffman table per entire file) and may be by replacing 1/2 code with better ways to encode RLE counts.

    But most important, we can replace MTF with SRC that improves compression ratio by 7%!!! Well, these are results are on Enwik9 in combination with structured model, so we can only guess about its effect in this case. Nevertheless 2.46-8% = 2.26 (7% from MTF->SRC and 1% from Huffman->FSE).


    3. 2.37 bpb: bzip2, which is pretty similar to bred1, except that huffman tables are built in much more sophisticated way. With FSE+SRC, it may be 2.37-8% = 2.18. The most important point is that bzip2 model has excellent fit to SIMD and GPU


    4. 2.34 bpb: bzip1=bwt+mtf+rle0+1/2coding+structured model. Since the structured model output encoded with adaptive arithmetic, it seems that bzip1 provides essentially the same results as bzip2+FSE while being slower.

    This model is implemented in OpenBWT and we can see its results on enwik9 = 168.6 MB with SRC, compared to 165.0 MB by bsc -e1. So it looks like we can implement bzip2 model+FSE+SRC and get results only 2% worse than bsc -e1 and 5% worse than BWMonstr!!!



    Overall time required to process ENWIK9 on the single core:
    - LZP: 2 sec
    - MTF: 1.5 sec with AVX2, 2 sec with SSE2
    - RLE, 1/2 coding and sorting of MTF output: ?
    - 2 passes of probabilities recomputing (see below): 0.5 sec
    - FSE encoding: 0.5 sec

    So, it's 5/3 sec with/without LZP, plus unknown time in the third line. Divided by 4 cores, the time will be ~ 1 sec, i.e. we can compress only 2% worser than BSC -e1 while being ~5x faster.

    Moreover, we can easily offload probability computing to GPU. Offloading of LZP, MTF and entropy coding is also possible, but will require more work.

    [1] http://www.compression.ru/download/a..._1995_8_ps.rar

  13. #10
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,511
    Thanks
    746
    Thanked 668 Times in 361 Posts
    RLE0

    Usually, more than 50% of MTF output consists of Rank0 symbols, in particular text files such as enwiki has 60-70% of zeros. They contribute 20-30% of compressed size, so efficient encoding of zeros is very important.

    Simplifying the things, MTF output consists of two types of blocks: first is rather random blocks corresponding to UNSTABLE bwt contexts, f.e. "aeeioeuuu". Second is rather repeating blocks produced in STABLE contexts, f.e. "aaaaaaeaaaaaoaaaa". The second type of blocks are common when context is enough to predict next symbol with a high probability, f.e. "u" after "q" in English texts or space after dot in any texts.

    The second type of blocks contribute much of Rank0 symbols after MTF, and my example shows that Rank0 symbols are tend to occur in runs, calling for RLE after MTF. But Rank1 and other symbols doesn't have increased probability of runs, so we should RLE-process only Rank0 symbols - this algorithm was named RLE0.

    MTF+RLE0 is equivalent to RLE+MTF. Some programs, f.e. BSC, combines RLE and MTF into the single procedure. For other programs, it may be more efficient to run RLE prior to MTF, thus decreasing amount of work for MTF. For research purposes, it's easier to talk about MTF followed by optional RLE0.

    MTF+RLE0 output can be represented as sequence of pairs (Rank1..255, Count0..BlockSize), where Count encodes number of Rank0 symbols after the single RankN symbol. Small counts has much higher probability, so good encoding should pay special attention to efficient encoding of small numbers. We can consider the following encodings:

    1. Lack of RLE0 encoding may be viewed as RLE0 followed by UNARY encoding of Counts with their only code Rank0 embedded into the table of Rank1..255 symbols.

    2. We can improve over unary encoding by using BINARY encoding with two codes, RUNA and RUNB, embedded in the Rank1..255 table (so there are 257 symbols total). Length of Rank0 run is encoded by sequence of RUNA/RUNB symbols delimited by RankN symbols. All possible runs of RUNA/RUNB symbols, enumerated in natural order, represent increasing Count values: 1 -> A, 2 -> B, 3 -> AA, 4 -> AB... One can compute sequence from Count by adding 1 and removing leading 1 from binary digits of result, f.e Count=3 encoded as 3+1 = 0b100 -> 00 = AA. Count=0 is encoded by an empty run. Compared to unary encoding, this never extends the sequence, but adds one more symbol to the table (i.e. RUNA and RUNB instead of the single Rank0 symbol).

    Despite the simplicity, embedded binary encoding, nicknamed 1/2-code, was used by most BWT compressors and remained unbeatable since the original BW94 through the next decade. Its gain over unary encoding was a little more than 1% (2.37 -> 2.34 bpb) with BZIP1 compressor on Calgary corpus. AFAIK, attempts to use ternary and 4-ary encodings were unsuccessful, probably because second and following occurrences of RUN* symbols in each run have much less probability than the first occurrence and thus waste the code space.

    One later improvement is some reordering of codes. F.e. BAA encoding is shorter than ABB (since RUNA symbol has higher probability), but they represent, respectively, Counts 11 and 10. By exchanging codes for these Counts, we can improve the compression ratio. The same trick is applicable for longer codes, but their probability is so low than usually only 10 and 11 are rearranged.

    3. Maniscalo proposed [1] to add only single symbol RUN to the RankN table, and encode sequence of A/B symbols by separate model or multiple models, f.e. Count=3 encoded as RUN,RUN symbols in main model plus 0,0 bits in separate model. Length of the RUN symbols run tells decoder how much bits to read from the second model.

    Note that using a single model gives the same compression results as plain 1/2-coding, except that we are going to perform more computations and can use different updating mechanism on this separate table. Extra contexts open more possibilities - we can have separate model for first binary digit, or even separate models for each digit position, or even more - separate set of models for every length of RUN symbol run.

    4. Obvious way of encoding Count with usual entropy coders. Surprising, this employed only by most advanced BWT algorithms such as GRZipII/BSC, which encode Count bit-by-bit using a lot of contextual information. It may be considered as further development of Maniscalo encoding, replacing implicit encoding of bit sequence length via run of RUN symbols with explicit one.

    I consider 1/2-code as baseline. It has efficient implementation and successfully used in bzip2 which is my model coder. Bit modeling seems too slow for my needs, especially for decoding/GPUs - they prefer block-static codes. OTOH, direct Count encoding with FSE may be even faster, so we may look whether it can provide the same compression ratios as 1/2-code when combined with bred1/bzip2 modeling.

    [1] http://www.geocities.ws/m99datacompr...pers/rle2.html

  14. Thanks (2):

    boxerab (5th April 2017),SolidComp (3rd April 2017)

  15. #11
    Member
    Join Date
    May 2014
    Location
    Canada
    Posts
    137
    Thanks
    61
    Thanked 21 Times in 12 Posts
    I have worked quite a lot with GPU codecs over the past 3 years. Designing an efficient algorithm for GPU is very different than for CPU. GPU has very large memory bandwidth but also
    very large memory latency; also limited number of registers and on-chip memory. The key design principles I follow, from experience, are:

    1. make algorithm as simple as possible to help the compiler reduce register usage
    2. reduce branches as much as possible to avoid thread divergence
    3. organize data in memory to optimize load/store efficiency: on AMD hardware, for example, 64 work items reading 64 consecutive 32 bit numbers is most efficient way of
    reading data from global memory.

    Some thread divergence can be tolerated and still get good performance, but it must be controlled very carefully.

Similar Threads

  1. bsc, new block sorting compressor
    By Gribok in forum Data Compression
    Replies: 363
    Last Post: 14th June 2017, 21:55
  2. another (too) fast compressor
    By Cyan in forum Data Compression
    Replies: 139
    Last Post: 6th February 2016, 21:41
  3. Fast LZ4+EC compressor
    By Bulat Ziganshin in forum Data Compression
    Replies: 220
    Last Post: 1st April 2015, 01:49
  4. Segmenting after block-sorting
    By Piotr Tarsa in forum Data Compression
    Replies: 1
    Last Post: 2nd September 2011, 03:27
  5. Block sorting for LZ compression
    By Bulat Ziganshin in forum Forum Archive
    Replies: 15
    Last Post: 14th April 2007, 16:37

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •