C++ AMP
CUDA
OpenCL
OpenACC
BWT & ST[N] transformations
Entropy coding
Deduplication
LZ exhaustive match finder
LZ optimal parser
LZ lazy parser with appropriate matchfinder
Basic algorithms
Printable View
C++ AMP
CUDA
OpenCL
OpenACC
BWT & ST[N] transformations
Entropy coding
Deduplication
LZ exhaustive match finder
LZ optimal parser
LZ lazy parser with appropriate matchfinder
Basic algorithms
I've finally found a way to GPGPU programming that's easiest to learn - it's a C++ AMP that is a part of Microsoft C++ compiler (it's free for Windows users). There are several ways to learn basics of AMP:
- read MSDN intro [available in your native language]
- read MSDN magazine papers #1 and #2 [available in your native language too]
- read a series of blog posts indexed here
- buy the 300-page book (i got the pdf)
There is an MSDN blog dedicated mainly to AMP, it has the samples page, including 14-line "Hello world". It also contains Performance Guidance.
Besides MSVC, AMP is also implemented in a clang fork. The AMP specs are open so other implementations may follow.
MS AMP implementation works through DirectCompute 5.0 that is part of DirectX 11. The good part is that it can run on any DX11-compatible GPU, including AMD/NVidia discrete graphics, AMD APUs, and Intel GPUs starting with Ivy Bridge. Additionally, starting with Windows 7, there is a fallback CPU engine that takes advantage of multicore/SSE.
The bad part is that being compatible with wide range of GPUs, AMP allows only operations supported by them all. This means that kernels (i.e. code chunks that run on GPU) has pretty limited abilities - there is no recursion (it seems that compiler just recursively inlines all the functions called from the kernel), no arbitrary pointers (although you can use calculated indexes inside arrays), no 8/16/64-bit ints and no dynamic parallelism - all this available in latest CUDA.
The great side of AMP are two libraries implementing STL-like operations over arrays stored on GPU: C++ AMP Algorithms Library (see function list) and AMD Bolt (see function list). Bolt runs on Linux/Windows over AMP, OpenCL and Intel TBB (cpu-based parallelism library). Both libraries implement sort, scan, copy_if and other algorithms that may be used for STn, dedup, LZ and Huffman compression.
Sorting seems to be the key to any data searching on GPU, so i should mention that AmpAlgo includes radix_sort, while Bolt has stable_sort and stable_sort_by_key.
CUDA 7 (introduction) is the latest version, supporting C++ 11 inside kernels and Thrust 1.8. Thrust is STL-like library that can be also compiled to CPU backend by GCC with OpenMP or Intel TBB. It supports stable sort and stable sort by key (with speeds up to 1700 MB/s according to blog post). Compared to MS AMP, CUDA supports Linux and every feature of latest NVidia GPUs, but of course limited to those GPUs. Also, it has its own C++ compiler that means that you lose all those GCC/MSVC-specific features. I believe that you can put GPU-specific code into DLL, however.
Thrust: site, introduction, tutorial, documentation, examples, releases
CUB and Modern GPU are two more data processing libs.
OpenCL: AMD Bolt (AMD GPUs only!), CLOGS, VexCL, Boost.Compute (docs)
Intel C++ can offload to Intel GPUs: https://software.intel.com/node/522442
The only BWT implementation i know is a part of NVBIO. Edit: article about CUDPP 2.2 - another CUDA BWT/MTF/bzip2 implementation.
ST[N] transformation is the easiest one to implement, once we got sorting routines. Just save current char + context (i.e. N preceding or following chars) in each array element and stable-sort by the context. Then extract first char from every record of the sorted array. This algorithm requires 2*(N+1)*BlockSize bytes of memory, since stable sorting can't be performed in-place. Alternatively, we may include 4-byte position in each record and perform in-place unstable sorting (by context+position), reducing memreqs to (N+5)*BlockSize.
BSC implements the first method and with N=7 or 8, it needs VRAM~=20*BlockSize. We can implement ST compression on GPU with larger blocksizes by applying two techniques of splitting data:
Vertical splitting: split data by original position. This way we split original data into subblocks, sort them using either 2*(N+1)*SubBlockSize or (N+5)*SubBlockSize memory and once all subblocks are get sorted, merge them, using saved N-byte context plus subblock number as a key. The (hierarchical binary) merging itself may be performed at the speeds around 1 GB/s and, taking into account the following compression at about 40 MB/s, it's almost free. The memory reqs are a bit more than (N+1)*BlockSize.
Horizontal splitting: split data by output position, i.e. by the context:
1. Calculate freqs for byte pairs
2. Split those byte pairs into G groups with about the same total frequency, f.e. 00..'FA' - total 16123 pairs, 'FB'..'LN' - 17333 pairs, 'LO'..'TK' - 15777 pairs, 'TL'..FFFF - 18222 pairs
3. Perform G scans of input data, each time sorting only bytes having context in the current group, it will require additional memory of (N+5)*MaxGroupSize. Output of each scan can be immediately compressed since it's the next chunk of output data, and compression may be overlapped with sorting of the next group.
Now let's see how it may be used in practice. There are lots of possible combinations of vertical+horizontal splitting at CPU/GPU side, but most practical ones are:
1. Vertical splitting to the blocks compessible on GPU. As mentioned above, merging sorted blocks back is so fast that we can sort arbitrary-sized blocks using GPU at virtually the same speed as 50 MB blocks. RAM usage is (N+1)*BlockSize, though. But since merging and the following compression stage accesss memory in sequential fashion, we can use paging file or AWE memory without much speed loss. This makes this approach also practical for CPU-only sorting of arbitrary blocks with limited (random-accessed) memory usage.
2. Horizontal splitting inside the GPU. This way we can sort blocks of size up to ~1/2 of VRAM. Sorted groups are immediately downloaded to CPU RAM and compressed in parallel with sorting of remaining groups on GPU. Horizontal splitting on GPU should be very fast, both due to high bandwidth of VRAM and lot of computation resources. So again, i expect it to be almost as fast as whole-block sorting. The most exciting side of this method is that it doesn't need any CPU RAM at all, except for a small I/O&compression buffers.
3. Horizontal splitting through the GPU. We can go further and implement horizontal splitting on GPU for arbirary input block size. This way, we stream the entire input block G times through the GPU, each time GPU filters only entries that belong to the current group and then sorts them using entire VRAM. So, it require CPU RAM only for original input block, GPU-produced output can be immeditely compressed and sent to the outstream.
4. CPU-only sorting with horizontal splitting allows to limit memory usage to 1.5-2 * BlockSize. For example, by splitting into 16 blocks we may limit ST6 memory usage to BlockSize (for original data) + (N-2)*BlockSize/16 = 1.25*BlockSize, or 1.5*BlockSize if we want to compresss and sort the next group in parallel. (Unlike GPU sorting, on CPU we can use more memory-efficient algorithm with small changes to support horizontal splitting). Well, since the original data are accessed sequentially, we may offload them to pagefile/AWEmem or just reread input data multiple times.
Huffman coding can be done in this way:
1. Count frequencies of input data (not necessarily plain chars, it may be chars+matches with a complex formula for frequency bin). Attached histogram.exe manages 13 GiB/s on the GF560Ti. Also implemented by CUB
2. Build Huffman table(s) from freqs, including:
2.1 Sort symbols by freq using segmented sort or just any sort whose steps doesn't depend on the data
2.2 Build Huffman tree (sequential process requiring simultaneous access to only 3 words per table, i.e. 3*128 bytes if cache rows are 128 bytes long, so we can process ~1000 tables simultaneously)
2.3 Assign codes to table entries (sequential process requiring simultaneous access to only 2-3 words per table - current node and either its parent or its childs)
3. Split data into chunks of 128..1024 elements and compute how many bits each chunk will use
4. Compute partial sums of those bitsizes to determine position where output for each chunk should be started
5. Encode data using one thread/chunk. Use special handling for the first (partial) word of output. One Huffman table is shared between many chunks so it needs at most 16 bytes/chunk. The encoding process is sequential requiring simultaneous access to only 2 words - current source and encoded data.
CUDA NPP also contains functions for histogram building and (JPEG) huffman decoding
----------------------------------------------------------------------------------------------------------------
ANS/Ari coding may split data into blocks of 16KB or so and process each block sequentially - it still allows to run thousands of threads but we may run out of local memory. Therefore, computation-based approaches should be preferred to table-based ones.
1. SREP -m1 chunking algorithm (chunking by CRC of last 48 bytes) can run with speeds around 10 GB/s even on i7-class CPU, so it doesn't make much sense in GPU implementation. Nevertheless, it's a good start.
So, one way is to compute quick checksum of last 32..64 bytes at every byte position (that requires 16..32 of 32-bit ALU operations) and then mark each position where this checksum has 0, say, in the left-most 12 bits. Since we want to use no more than one byte (and even better only one bit) for each mark, each thread should process at least 4 (or 32) positions because 32-bit word is the least accessible memory chunk.
Another way is to use rolling hash. First we need to hash preceding 32..64 bytes (that requires 128..256 ALU operations - mask+shift+mul+add per byte, since every byte should be processed individually). Then the thread may process several hundreds bytes, using 10 operations per byte (mask+shift+mul+add for added byte, mask+shift+mul+sub for removed byte, then check hashsum and conditionally enable the corresponding bit in the flags word). Once per each 32 input bytes processed, we store 32-bit flags word, we have constructed, to the memory.
This means ~10-15 ALU operations per each input byte, with resulting speeds around 100 GB/s (i suppose that GPU can perform ~10^12 operations/second) - much faster than CPU but pretty useless unless we can run further compression stages at the GPU
2. ZPAQ chunking algorithm is also doable, but it needs much larger "warm-up" stage in order to properly fill the o1[] array - i have used 8 KB in SREP -m2 implementation. So, in order to limit overhead, each thread should process, say, 32 KB of useful data (+8KB for warm-up) and to create 2048 threads we need to process 64 MB of data
3. REP algo...
4. Main SREP algo...
...
Now we have a match table - for every position in some range it includes nearest match of length 2+, then nearest match longer than previous one and so on. For parsing, we split input data into segments of about 16 KB and run one thread per segment.
1. Build optimal LZSS parse - i.e. one that minimizes number of matches and don't use any stats from entropy coder. This parsing should provide compression ratio close to carefully built lazy parser.
2. Optionally, collect statistics and build entropy coder model. Build optimal parse again, now using entropy coder stats (i.e. information how many bits each char/match will require).
3. Optionally, repeat last step again.
With match length/fast bytes limited to 128, we need to cache only last 128 elements of optimal pricing table, where each entry needs at least 5 bytes (3 bytes for optimal_price plus 2 bytes for optimal_pos), total 640 bytes/thread. We also need to keep entropy coder table, at least price for each entry. Each price needs 1-2 bytes, so with 300 (deflate) to 800 (tornado) entries, we need 300-1600 bytes of local memory for this table. Even more memory may be required for building entropy coder model, but this operation is pretty fast so we can live with lower parallelism.
LZSS-optimum parser requires only 2 bytes for storing previous position in optimal path table, so with fast_bytes=32, it's only 64 bytes/thread.
1. Build the ST4 index of the block (say, 64 MB). Such index contains all block positions stable-sorted by 4 bytes they are point. Memory usage = 4*BlockSize
2. Process block in smaller chunks, say 4 MB. Scan the index and for each position that belongs to current chunk, save position in the index to another index:
Now another_index stores initial LZ match search position for every entry in current chunkCode:for (i=0;i<64MB;i++)
if (index[i]/4MB == current_chunk)
pos = index[i] % 4MB,
another_index[pos] = i;
3. Split chunk into segments of 16 KB or so and compress each segment independent on each other. Alternatively, find lazy matches in 1 KB segments and then encode them in 16 KB ones.
It's important to mention that memory usage is limited to 4*BlockSize + O(ChunkSize). Even if we implement full solid compression, BlockSize=2*DictSize, so memory usage is 8*DictSize+O(1) that is pretty good for 100% parallel GPU implementation of LZ lazy compression.
High Performance and Scalable Radix Sorting
Modern GPU blog
Structured patterns for parallel computation
Scan (partial sums)
Quicksort
Fast Histograms Using Shared Atomics on Maxwell
Efficient sort() implementations:
- radix sort in CUB (CUDA only) - allows to specify the sort key as bitrange
- merge/radix sort in Bolt (C++ AMP and AMD OpenCL)
- radix sort in CLOGS (OpenCL, better optimized for NVidia cards) - allows to specify number of bits in the sort key (performance)
- merge sort in ModernGPU (CUDA only)
- merge/radix sort in Thrust (CUDA only, examples) - only in-place, supports values+keys model
RLE encoding and decoding
http://devblogs.nvidia.com/parallelf...research-cuda/ talks about another BWT,MTF and bzip2 implementation: http://cudpp.github.io/releases/2014...-released.html
A few more papers:
1. Parallel Lossless Data Compression on the GPU (Ritesh A Patel, Yao Zhang, Jason Mak, Andrew Davidson, John D. Owens)
2. GPU-Accelerated BWT Construction for Large Collection of Short Reads (Chi-Man Liu, Ruibang Luo, Tak-Wah Lam)
3. A massively parallel algorithm for constructing the BWT of large string sets (Jacopo Pantaleoni)
4. Fast Parallel Suffix Array on the GPU (Leyuan Wang1, Sean Baxter2, and John D. Owens)
Worth looking at Sycl as well.
https://www.codeplay.com/portal/introduction-to-sycl
OpenCL may well become a better option due to SPIR.
https://en.wikipedia.org/wiki/Standa...Representation
With SPIR being used for Vulkan there is a lot of working going into it. With a good validation layer, it might lead to a good, solid approach.
It was also announced last November, that AMD were going to offer Cuda support in the future.
http://www.amd.com/en-us/press-relea...2015nov16.aspx
Another option for compatibility between different compute languages is to use an intermediate language.
https://github.com/Celtoys/ComputeBridge
Finally, for some of the tasks using SIMD and threads better on the cpu may be easier. ISPC is worth a look for that.
http://ispc.github.io/
A good example of using ISPC for texture compression.
https://github.com/GameTechDev/ISPCTextureCompressor
As each block in the block texture compression formats is independent it is very parallel friendly.
-Brian.
AMD has released their HIP compiler, which converts CUDA code into C++ code that will run on both Nvidious and AMD hardware
http://gpuopen.com/hip-to-be-squared...-hip-tutorial/