Results 1 to 13 of 13

Thread: GPGPU

  1. #1
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,564
    Thanks
    773
    Thanked 687 Times in 372 Posts

    Index:

    C++ AMP
    CUDA
    OpenCL
    OpenACC

    BWT & ST[N] transformations
    Entropy coding
    Deduplication
    LZ exhaustive match finder
    LZ optimal parser
    LZ lazy parser with appropriate matchfinder

    Basic algorithms
    Last edited by Bulat Ziganshin; 12th May 2015 at 12:52.

  2. #2
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,564
    Thanks
    773
    Thanked 687 Times in 372 Posts

    C++ AMP

    I've finally found a way to GPGPU programming that's easiest to learn - it's a C++ AMP that is a part of Microsoft C++ compiler (it's free for Windows users). There are several ways to learn basics of AMP:
    • read MSDN intro [available in your native language]
    • read MSDN magazine papers #1 and #2 [available in your native language too]
    • read a series of blog posts indexed here
    • buy the 300-page book (i got the pdf)

    There is an MSDN blog dedicated mainly to AMP, it has the samples page, including 14-line "Hello world". It also contains Performance Guidance.

    Besides MSVC, AMP is also implemented in a clang fork. The AMP specs are open so other implementations may follow.

    MS AMP implementation works through DirectCompute 5.0 that is part of DirectX 11. The good part is that it can run on any DX11-compatible GPU, including AMD/NVidia discrete graphics, AMD APUs, and Intel GPUs starting with Ivy Bridge. Additionally, starting with Windows 7, there is a fallback CPU engine that takes advantage of multicore/SSE.

    The bad part is that being compatible with wide range of GPUs, AMP allows only operations supported by them all. This means that kernels (i.e. code chunks that run on GPU) has pretty limited abilities - there is no recursion (it seems that compiler just recursively inlines all the functions called from the kernel), no arbitrary pointers (although you can use calculated indexes inside arrays), no 8/16/64-bit ints and no dynamic parallelism - all this available in latest CUDA.

    The great side of AMP are two libraries implementing STL-like operations over arrays stored on GPU: C++ AMP Algorithms Library (see function list) and AMD Bolt (see function list). Bolt runs on Linux/Windows over AMP, OpenCL and Intel TBB (cpu-based parallelism library). Both libraries implement sort, scan, copy_if and other algorithms that may be used for STn, dedup, LZ and Huffman compression.

    Sorting seems to be the key to any data searching on GPU, so i should mention that AmpAlgo includes radix_sort, while Bolt has stable_sort and stable_sort_by_key.
    Last edited by Bulat Ziganshin; 15th May 2015 at 22:19.

  3. Thanks (2):

    Cyan (8th May 2015),Paul W. (9th May 2015)

  4. #3
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,564
    Thanks
    773
    Thanked 687 Times in 372 Posts

    CUDA, OpenCL, OpenACC...

    CUDA 7 (introduction) is the latest version, supporting C++ 11 inside kernels and Thrust 1.8. Thrust is STL-like library that can be also compiled to CPU backend by GCC with OpenMP or Intel TBB. It supports stable sort and stable sort by key (with speeds up to 1700 MB/s according to blog post). Compared to MS AMP, CUDA supports Linux and every feature of latest NVidia GPUs, but of course limited to those GPUs. Also, it has its own C++ compiler that means that you lose all those GCC/MSVC-specific features. I believe that you can put GPU-specific code into DLL, however.

    Thrust: site, introduction, tutorial, documentation, examples, releases
    CUB and Modern GPU are two more data processing libs.


    OpenCL: AMD Bolt (AMD GPUs only!), CLOGS, VexCL, Boost.Compute (docs)


    Intel C++ can offload to Intel GPUs: https://software.intel.com/node/522442
    Last edited by Bulat Ziganshin; 24th May 2015 at 20:23.

  5. #4
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,564
    Thanks
    773
    Thanked 687 Times in 372 Posts

    BWT & ST[N] transformations

    The only BWT implementation i know is a part of NVBIO. Edit: article about CUDPP 2.2 - another CUDA BWT/MTF/bzip2 implementation.

    ST[N] transformation is the easiest one to implement, once we got sorting routines. Just save current char + context (i.e. N preceding or following chars) in each array element and stable-sort by the context. Then extract first char from every record of the sorted array. This algorithm requires 2*(N+1)*BlockSize bytes of memory, since stable sorting can't be performed in-place. Alternatively, we may include 4-byte position in each record and perform in-place unstable sorting (by context+position), reducing memreqs to (N+5)*BlockSize.

    BSC implements the first method and with N=7 or 8, it needs VRAM~=20*BlockSize. We can implement ST compression on GPU with larger blocksizes by applying two techniques of splitting data:

    Vertical splitting: split data by original position. This way we split original data into subblocks, sort them using either 2*(N+1)*SubBlockSize or (N+5)*SubBlockSize memory and once all subblocks are get sorted, merge them, using saved N-byte context plus subblock number as a key. The (hierarchical binary) merging itself may be performed at the speeds around 1 GB/s and, taking into account the following compression at about 40 MB/s, it's almost free. The memory reqs are a bit more than (N+1)*BlockSize.

    Horizontal splitting: split data by output position, i.e. by the context:
    1. Calculate freqs for byte pairs

    2. Split those byte pairs into G groups with about the same total frequency, f.e. 00..'FA' - total 16123 pairs, 'FB'..'LN' - 17333 pairs, 'LO'..'TK' - 15777 pairs, 'TL'..FFFF - 18222 pairs

    3. Perform G scans of input data, each time sorting only bytes having context in the current group, it will require additional memory of (N+5)*MaxGroupSize. Output of each scan can be immediately compressed since it's the next chunk of output data, and compression may be overlapped with sorting of the next group.



    Now let's see how it may be used in practice. There are lots of possible combinations of vertical+horizontal splitting at CPU/GPU side, but most practical ones are:

    1. Vertical splitting to the blocks compessible on GPU. As mentioned above, merging sorted blocks back is so fast that we can sort arbitrary-sized blocks using GPU at virtually the same speed as 50 MB blocks. RAM usage is (N+1)*BlockSize, though. But since merging and the following compression stage accesss memory in sequential fashion, we can use paging file or AWE memory without much speed loss. This makes this approach also practical for CPU-only sorting of arbitrary blocks with limited (random-accessed) memory usage.

    2. Horizontal splitting inside the GPU. This way we can sort blocks of size up to ~1/2 of VRAM. Sorted groups are immediately downloaded to CPU RAM and compressed in parallel with sorting of remaining groups on GPU. Horizontal splitting on GPU should be very fast, both due to high bandwidth of VRAM and lot of computation resources. So again, i expect it to be almost as fast as whole-block sorting. The most exciting side of this method is that it doesn't need any CPU RAM at all, except for a small I/O&compression buffers.

    3. Horizontal splitting through the GPU. We can go further and implement horizontal splitting on GPU for arbirary input block size. This way, we stream the entire input block G times through the GPU, each time GPU filters only entries that belong to the current group and then sorts them using entire VRAM. So, it require CPU RAM only for original input block, GPU-produced output can be immeditely compressed and sent to the outstream.

    4. CPU-only sorting with horizontal splitting allows to limit memory usage to 1.5-2 * BlockSize. For example, by splitting into 16 blocks we may limit ST6 memory usage to BlockSize (for original data) + (N-2)*BlockSize/16 = 1.25*BlockSize, or 1.5*BlockSize if we want to compresss and sort the next group in parallel. (Unlike GPU sorting, on CPU we can use more memory-efficient algorithm with small changes to support horizontal splitting). Well, since the original data are accessed sequentially, we may offload them to pagefile/AWEmem or just reread input data multiple times.
    Last edited by Bulat Ziganshin; 2nd January 2016 at 05:18.

  6. #5
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,564
    Thanks
    773
    Thanked 687 Times in 372 Posts

    Entropy coding

    Huffman coding can be done in this way:

    1. Count frequencies of input data (not necessarily plain chars, it may be chars+matches with a complex formula for frequency bin). Attached histogram.exe manages 13 GiB/s on the GF560Ti. Also implemented by CUB

    2. Build Huffman table(s) from freqs, including:
    2.1 Sort symbols by freq using segmented sort or just any sort whose steps doesn't depend on the data
    2.2 Build Huffman tree (sequential process requiring simultaneous access to only 3 words per table, i.e. 3*128 bytes if cache rows are 128 bytes long, so we can process ~1000 tables simultaneously)
    2.3 Assign codes to table entries (sequential process requiring simultaneous access to only 2-3 words per table - current node and either its parent or its childs)

    3. Split data into chunks of 128..1024 elements and compute how many bits each chunk will use

    4. Compute partial sums of those bitsizes to determine position where output for each chunk should be started

    5. Encode data using one thread/chunk. Use special handling for the first (partial) word of output. One Huffman table is shared between many chunks so it needs at most 16 bytes/chunk. The encoding process is sequential requiring simultaneous access to only 2 words - current source and encoded data.

    CUDA NPP also contains functions for histogram building and (JPEG) huffman decoding

    ----------------------------------------------------------------------------------------------------------------

    ANS/Ari coding may split data into blocks of 16KB or so and process each block sequentially - it still allows to run thousands of threads but we may run out of local memory. Therefore, computation-based approaches should be preferred to table-based ones.
    Attached Files Attached Files
    Last edited by Bulat Ziganshin; 20th May 2015 at 10:09.

  7. #6
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,564
    Thanks
    773
    Thanked 687 Times in 372 Posts

    Deduplication

    1. SREP -m1 chunking algorithm (chunking by CRC of last 48 bytes) can run with speeds around 10 GB/s even on i7-class CPU, so it doesn't make much sense in GPU implementation. Nevertheless, it's a good start.

    So, one way is to compute quick checksum of last 32..64 bytes at every byte position (that requires 16..32 of 32-bit ALU operations) and then mark each position where this checksum has 0, say, in the left-most 12 bits. Since we want to use no more than one byte (and even better only one bit) for each mark, each thread should process at least 4 (or 32) positions because 32-bit word is the least accessible memory chunk.

    Another way is to use rolling hash. First we need to hash preceding 32..64 bytes (that requires 128..256 ALU operations - mask+shift+mul+add per byte, since every byte should be processed individually). Then the thread may process several hundreds bytes, using 10 operations per byte (mask+shift+mul+add for added byte, mask+shift+mul+sub for removed byte, then check hashsum and conditionally enable the corresponding bit in the flags word). Once per each 32 input bytes processed, we store 32-bit flags word, we have constructed, to the memory.

    This means ~10-15 ALU operations per each input byte, with resulting speeds around 100 GB/s (i suppose that GPU can perform ~10^12 operations/second) - much faster than CPU but pretty useless unless we can run further compression stages at the GPU


    2. ZPAQ chunking algorithm is also doable, but it needs much larger "warm-up" stage in order to properly fill the o1[] array - i have used 8 KB in SREP -m2 implementation. So, in order to limit overhead, each thread should process, say, 32 KB of useful data (+8KB for warm-up) and to create 2048 threads we need to process 64 MB of data


    3. REP algo...


    4. Main SREP algo...
    Last edited by Bulat Ziganshin; 9th May 2015 at 03:49.

  8. #7
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,564
    Thanks
    773
    Thanked 687 Times in 372 Posts

    LZ exhaustive match finder

    ...
    Last edited by Bulat Ziganshin; 9th May 2015 at 01:52.

  9. #8
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,564
    Thanks
    773
    Thanked 687 Times in 372 Posts

    LZ optimal parser

    Now we have a match table - for every position in some range it includes nearest match of length 2+, then nearest match longer than previous one and so on. For parsing, we split input data into segments of about 16 KB and run one thread per segment.

    1. Build optimal LZSS parse - i.e. one that minimizes number of matches and don't use any stats from entropy coder. This parsing should provide compression ratio close to carefully built lazy parser.
    2. Optionally, collect statistics and build entropy coder model. Build optimal parse again, now using entropy coder stats (i.e. information how many bits each char/match will require).
    3. Optionally, repeat last step again.

    With match length/fast bytes limited to 128, we need to cache only last 128 elements of optimal pricing table, where each entry needs at least 5 bytes (3 bytes for optimal_price plus 2 bytes for optimal_pos), total 640 bytes/thread. We also need to keep entropy coder table, at least price for each entry. Each price needs 1-2 bytes, so with 300 (deflate) to 800 (tornado) entries, we need 300-1600 bytes of local memory for this table. Even more memory may be required for building entropy coder model, but this operation is pretty fast so we can live with lower parallelism.

    LZSS-optimum parser requires only 2 bytes for storing previous position in optimal path table, so with fast_bytes=32, it's only 64 bytes/thread.
    Last edited by Bulat Ziganshin; 9th May 2015 at 01:51.

  10. #9
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,564
    Thanks
    773
    Thanked 687 Times in 372 Posts

    LZ lazy parser with appropriate matchfinder

    1. Build the ST4 index of the block (say, 64 MB). Such index contains all block positions stable-sorted by 4 bytes they are point. Memory usage = 4*BlockSize

    2. Process block in smaller chunks, say 4 MB. Scan the index and for each position that belongs to current chunk, save position in the index to another index:
    Code:
    for (i=0;i<64MB;i++)
      if (index[i]/4MB == current_chunk)
        pos = index[i] % 4MB,
        another_index[pos] = i;
    Now another_index stores initial LZ match search position for every entry in current chunk

    3. Split chunk into segments of 16 KB or so and compress each segment independent on each other. Alternatively, find lazy matches in 1 KB segments and then encode them in 16 KB ones.

    It's important to mention that memory usage is limited to 4*BlockSize + O(ChunkSize). Even if we implement full solid compression, BlockSize=2*DictSize, so memory usage is 8*DictSize+O(1) that is pretty good for 100% parallel GPU implementation of LZ lazy compression.
    Last edited by Bulat Ziganshin; 9th May 2015 at 02:30.

  11. #10
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,564
    Thanks
    773
    Thanked 687 Times in 372 Posts

    Basic algorithms

    High Performance and Scalable Radix Sorting
    Modern GPU blog
    Structured patterns for parallel computation
    Scan (partial sums)
    Quicksort
    Fast Histograms Using Shared Atomics on Maxwell

    Efficient sort() implementations:


    RLE encoding and decoding
    Last edited by Bulat Ziganshin; 27th June 2015 at 14:45.

  12. #11

  13. #12
    Member
    Join Date
    Aug 2015
    Location
    Yorkshire, England
    Posts
    2
    Thanks
    84
    Thanked 2 Times in 1 Post
    Worth looking at Sycl as well.
    https://www.codeplay.com/portal/introduction-to-sycl

    OpenCL may well become a better option due to SPIR.
    https://en.wikipedia.org/wiki/Standa...Representation
    With SPIR being used for Vulkan there is a lot of working going into it. With a good validation layer, it might lead to a good, solid approach.

    It was also announced last November, that AMD were going to offer Cuda support in the future.
    http://www.amd.com/en-us/press-relea...2015nov16.aspx

    Another option for compatibility between different compute languages is to use an intermediate language.
    https://github.com/Celtoys/ComputeBridge

    Finally, for some of the tasks using SIMD and threads better on the cpu may be easier. ISPC is worth a look for that.
    http://ispc.github.io/

    A good example of using ISPC for texture compression.
    https://github.com/GameTechDev/ISPCTextureCompressor
    As each block in the block texture compression formats is independent it is very parallel friendly.

    -Brian.

  14. Thanks (2):

    Bulat Ziganshin (4th January 2016),JamesB (7th April 2016)

  15. #13
    Member
    Join Date
    May 2014
    Location
    Canada
    Posts
    141
    Thanks
    71
    Thanked 21 Times in 12 Posts
    AMD has released their HIP compiler, which converts CUDA code into C++ code that will run on both Nvidious and AMD hardware

    http://gpuopen.com/hip-to-be-squared...-hip-tutorial/

Similar Threads

  1. Mass parallel (GPGPU) compression algos
    By Bulat Ziganshin in forum Data Compression
    Replies: 14
    Last Post: 20th November 2014, 13:04
  2. Replies: 11
    Last Post: 6th October 2014, 11:47
  3. Did anyone combine GPGPU and CM?
    By Piotr Tarsa in forum Data Compression
    Replies: 12
    Last Post: 31st December 2012, 08:08
  4. GPGPU computing
    By Bulat Ziganshin in forum The Off-Topic Lounge
    Replies: 0
    Last Post: 19th February 2012, 16:10
  5. A paper on GPGPU Huffman
    By m^2 in forum Data Compression
    Replies: 3
    Last Post: 15th July 2011, 02:02

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •