all i`ve seen are sdks and papers papers without working binaries 8-)
i mean there are password bruteforcing software, video encoding but no file compression related 8-(
all i`ve seen are sdks and papers papers without working binaries 8-)
i mean there are password bruteforcing software, video encoding but no file compression related 8-(
Last edited by necros; 25th September 2014 at 06:37.
a902cd23 (25th September 2014),Bulat Ziganshin (12th May 2015),necros (4th June 2016)
thanx 8-)
i might be wrong but i think the serial structure of most compression makes it hard to put into a massive multithreading model
Video encoding and brute forcing is easier cause it can divide up the data into multiple individuel data's to work on (hej thread 1 do the first 1024 brute force tries. thread 2 do the 1024. thread 1gazzilion do your next gazzilant part of the tries)
i know only one good open source project
in the field of compresson which use CUDA and this is BSC
http://libbsc.com/
gribok wrotes:
---
1. You need an NVIDIA GPU with compute capability 1.1 or higher.
2. You need 270.81 drivers at least.
3. You need Microsoft Visual C++ 2010 SP1 Redistributable Packages.
4. Currently CUDA acceleration is implemented only for forward ST5, ST6, ST7 and ST8 transforms.
Usage:
1. GPU compression uses 17x-19x GPU memory for single block. So use smaller block size.
2. GPU compression is designed for ultra fast compression,
so preprocessing like segmentation or reordering need to be disabled.
I recommend to use something like "-b4p -m7f" or "-b8p -m7f"
---
status: The latest stable version is 3.1.0, released 8 July 2012
- in my tests ST5 and ST6 give the best results
- ST5 and ST6 is implemented for GPU and for CPU too
- ST4 is fast , but for now only implemented for CPU
- it would be wonderful, if we would have a simplified source with only ST5 and/or ST6
which we could implement into a archive-format like 7Zip or similar
- i think the performance of the ST5/ST6/ST7 implementation on CUDA will even better,
if we can use CUDA compute capability 2.0 and the actual CUDA version 6.5-14
another interesting project which uses parallel compression , but not GPU/CUDA is plzip
http://download.savannah.gnu.org/releases/lzip/plzip/
the actual version 1.12 for now has no windows binary
best regards
Unfortunately I've never managed to get the GPU part of libbsc to build, normally suffering errors in the back40 code. It's also unclear how you're meant to do this given no documentation, but even this on the command line fails:
Code:nvcc -arch sm_20 -g -DLIBBSC_SORT_TRANSFORM_SUPPORT -DLIBBSC_CUDA_SUPPORT -D_LARGEFILE64_SOURCE -D_FILE_OFFSET_BITS=64 -O3 -DNDEBUG -c libbsc/st/st.cu libbsc/st/b40c/radix_sort/../radix_sort/problem_instance.cuh(331): error: expected a ")" detected during: instantiation of "cudaError_t b40c::radix_sort::Enactor::IteratePasses<ProblemInstance, PROBLEM_SIZE, BITS_REMAINING, CURRENT_BIT, CURRENT_PASS>::DispatchPass<TUNE_ARCH>(ProblemInstance &, b40c::radix_sort::Enactor &) [with ProblemInstance=b40c::radix_sort::ProblemInstance<b40c::util::DoubleBuffer<unsigned long long, b40c::util::NullType>, int>, PROBLEM_SIZE=b40c::radix_sort::LARGE_PROBLEM, BITS_REMAINING=40, CURRENT_BIT=16, CURRENT_PASS=0, TUNE_ARCH=200]" libbsc/st/b40c/radix_sort/enactor.cuh(246): here instantiation of "cudaError_t b40c::radix_sort::Enactor::Sort<PROBLEM_SIZE,BITS_REMAINING,CURRENT_BIT,DoubleBuffer>(DoubleBuffer &, int, cudaStream_t, int, bool) [with PROBLEM_SIZE=b40c::radix_sort::LARGE_PROBLEM, BITS_REMAINING=40, CURRENT_BIT=16, DoubleBuffer=b40c::util::DoubleBuffer<unsigned long long, b40c::util::NullType>]" libbsc/st/st.cu(237): here libbsc/st/b40c/radix_sort/../radix_sort/problem_instance.cuh(331): warning: variable "spine_elements" is used before its value is set detected during: instantiation of "cudaError_t b40c::radix_sort::ProblemInstance<DoubleBuffer, _SizeT>::DispatchPass(int, const b40c::radix_sort::ProblemInstance<DoubleBuffer, _SizeT>::UpsweepKernelProps &, const b40c::radix_sort::ProblemInstance<DoubleBuffer, _SizeT>::SpineKernelProps &, const b40c::radix_sort::ProblemInstance<DoubleBuffer, _SizeT>::DownsweepKernelProps &, bool, b40c::radix_sort::DynamicSmemConfig, int) [with DoubleBuffer=b40c::util::DoubleBuffer<unsigned long long, b40c::util::NullType>, _SizeT=int]" libbsc/st/b40c/radix_sort/enactor.cuh(145): here instantiation of "cudaError_t b40c::radix_sort::Enactor::IteratePasses<ProblemInstance, PROBLEM_SIZE, BITS_REMAINING, CURRENT_BIT, CURRENT_PASS>::DispatchPass<TUNE_ARCH>(ProblemInstance &, b40c::radix_sort::Enactor &) [with ProblemInstance=b40c::radix_sort::ProblemInstance<b40c::util::DoubleBuffer<unsigned long long, b40c::util::NullType>, int>, PROBLEM_SIZE=b40c::radix_sort::LARGE_PROBLEM, BITS_REMAINING=40, CURRENT_BIT=16, CURRENT_PASS=0, TUNE_ARCH=200]" libbsc/st/b40c/radix_sort/enactor.cuh(246): here instantiation of "cudaError_t b40c::radix_sort::Enactor::Sort<PROBLEM_SIZE,BITS_REMAINING,CURRENT_BIT,DoubleBuffer>(DoubleBuffer &, int, cudaStream_t, int, bool) [with PROBLEM_SIZE=b40c::radix_sort::LARGE_PROBLEM, BITS_REMAINING=40, CURRENT_BIT=16, DoubleBuffer=b40c::util::DoubleBuffer<unsigned long long, b40c::util::NullType>]" libbsc/st/st.cu(237): here libbsc/st/b40c/radix_sort/../radix_sort/problem_instance.cuh(331): warning: variable "spine_elements" is used before its value is set detected during: instantiation of "cudaError_t b40c::radix_sort::ProblemInstance<DoubleBuffer, _SizeT>::DispatchPass(int, const b40c::radix_sort::ProblemInstance<DoubleBuffer, _SizeT>::UpsweepKernelProps &, const b40c::radix_sort::ProblemInstance<DoubleBuffer, _SizeT>::SpineKernelProps &, const b40c::radix_sort::ProblemInstance<DoubleBuffer, _SizeT>::DownsweepKernelProps &, bool, b40c::radix_sort::DynamicSmemConfig, int) [with DoubleBuffer=b40c::util::DoubleBuffer<unsigned long long, unsigned char>, _SizeT=int]" libbsc/st/b40c/radix_sort/enactor.cuh(145): here instantiation of "cudaError_t b40c::radix_sort::Enactor::IteratePasses<ProblemInstance, PROBLEM_SIZE, BITS_REMAINING, CURRENT_BIT, CURRENT_PASS>::DispatchPass<TUNE_ARCH>(ProblemInstance &, b40c::radix_sort::Enactor &) [with ProblemInstance=b40c::radix_sort::ProblemInstance<b40c::util::DoubleBuffer<unsigned long long, unsigned char>, int>, PROBLEM_SIZE=b40c::radix_sort::LARGE_PROBLEM, BITS_REMAINING=64, CURRENT_BIT=0, CURRENT_PASS=0, TUNE_ARCH=200]" libbsc/st/b40c/radix_sort/enactor.cuh(246): here instantiation of "cudaError_t b40c::radix_sort::Enactor::Sort<PROBLEM_SIZE,BITS_REMAINING,CURRENT_BIT,DoubleBuffer>(DoubleBuffer &, int, cudaStream_t, int, bool) [with PROBLEM_SIZE=b40c::radix_sort::LARGE_PROBLEM, BITS_REMAINING=64, CURRENT_BIT=0, DoubleBuffer=b40c::util::DoubleBuffer<unsigned long long, unsigned char>]" libbsc/st/b40c/radix_sort/enactor.cuh(289): here instantiation of "cudaError_t b40c::radix_sort::Enactor::Sort(DoubleBuffer &, int, cudaStream_t, int, bool) [with DoubleBuffer=b40c::util::DoubleBuffer<unsigned long long, unsigned char>]" libbsc/st/st.cu(315): here 1 error detected in the compilation of "/tmp/tmpxft_00000ccc_00000000-4_st.cpp1.ii".
gribok wrotes:
---
3. You need Microsoft Visual C++ 2010 SP1 Redistributable Packages.
I was building under Linux. For windows I could just use the precompiled binaries anyway.
it looks like there is no linux cuda support. at least author never thought about it
There actually is, it is working. However, I personally prefer OpenCL for that. However, GPU algorithms only work well if they are massively parallel, and here comes the actual problem into the play: Data compression, especially data modelling is inherently a serial process - you need to adapt your dictionary, need to take history into account. Of course, one can disregard such knowledge and split the data into independent blocks. However, that also means that you pay a price in terms of compression performance (you loose the co-information between the data, i.e. were you split it off).
Data synchronization within a GPU only works within the same compute unit (i.e. within the same work-group in the language of OpenCL), i.e. there are no generic synchronization primitives.
Of course, if you look into the works of Ana, specifically the parallel Huffman, you suffer from a two-pass approach where you first need to store bit-offsets along with a static(!) Huffman output, then a second pass plus atomic operations to insert the data into the right spot. I haven't made an experiment on specifically *this* approach, but I made a quite similar approach (static Huffman) for image compression, and the "serial fit-together" part of the algorithm was clearly the dominating (aka "slow") part of the algorithm.
Data compression means, almost by definition of the problem, that you either compress fixed-length blocks to varying-length data (Huffman) or varying-length blocks to fixed-length data (Tunstall codes), both of which requires some kind of synchronization between the blocks.
The other option is lossy compression where you could realize fixed-length to fixed-length codes (lossy, of course) where you again have a coding loss due to the lack of proper rate-distribution (this argument can also be made more precise, in the sense of fixed-length and variable length quantization, estimates on the loss are available). In reality it means that it simply doesn't perform well.
Or, and that is the last option, you parallelize only a certain part of the codec, i.e. wavelet or DCT, probably Huffman per block. Whenever I tried that, the overhead of moving the data between GPU and CPU is the dominating factor, and not the compression itself.
Either way, there is no silver bullet.
I disagree. Let's take for example LZ77 based codecs. Matt Mahoney is already using block sorting to find matches and block sorting is already done using GPGPU in libbsc. What's missing is fitting it together (which is a tough task, but I don't see any blocker).Data compression, especially data modelling is inherently a serial process - you need to adapt your dictionary, need to take history into account. Of course, one can disregard such knowledge and split the data into independent blocks. However, that also means that you pay a price in terms of compression performance (you loose the co-information between the data, i.e. were you split it off).
There are global atomics, so you can synchronize data flow. For synchronization of control flow, you're left with synchronization within command queues. I don't know what OpenCL 2.0 brings in regards of synchronization facilities but probably it will allow much greater set of problems to be solved efficiently.Data synchronization within a GPU only works within the same compute unit (i.e. within the same work-group in the language of OpenCL), i.e. there are no generic synchronization primitives.
I haven't really well studied Ana's paper, but for Huffman coding I think we can use a scheme which is efficient in regards to memory bandwidth usage.Of course, if you look into the works of Ana, specifically the parallel Huffman, you suffer from a two-pass approach where you first need to store bit-offsets along with a static(!) Huffman output, then a second pass plus atomic operations to insert the data into the right spot. I haven't made an experiment on specifically *this* approach, but I made a quite similar approach (static Huffman) for image compression, and the "serial fit-together" part of the algorithm was clearly the dominating (aka "slow") part of the algorithm.
Data compression means, almost by definition of the problem, that you either compress fixed-length blocks to varying-length data (Huffman) or varying-length blocks to fixed-length data (Tunstall codes), both of which requires some kind of synchronization between the blocks.
We can:
- split uncompressed input into fixed size blocks,
- compute histogram over whole input - there are many efficient solutions to do that, it can basically run at the speed of VRAM,
- compute Huffman tree - I can't predict how long it would take, but if we're compressing hundreds of megabytes then if shouldn't matter much,
- compute offsets of compressed blocks and do prefix sum of them - it will again require scanning the whole input,
- compress data - again it could be done at the speed of VRAM, but now we have not only to read but also to write,
Consequently, it seems possible to do Huffman coding at 1/4 of the speed of VRAM.
Concurrent writes to the same buffer require synchronization, but concurrent reads don't. We can implement Huffman where each thread writes to non-overlapping region of memory. To achieve this we need the offsets of compressed blocks (I've included them in scheme above) and resolve the problem of Huffman codes crossing the block boundaries by additional loops that hoists a few symbols from previous block. This way we don't need to synchronize writes, but we need additional loops for special handling of the beginning of blocks. We also need to be careful not to write to next block during encoding the last symbol.
Update:
Computing offsets of blocks can be done faster. While computing global histogram, we can store a histogram per block (but that would require having bigger blocks to provide a gain) so then we can compute block offsets from the histograms and code lengths alone, reducing the memory bandwidth requirement for that step. That would bring the total speed closer to 1/3 of VRAM speed.
I can think of lossy texture compression on GPGPU which does multiple passes to select the best local methods in every pixel block. Probably that's already done with the DXTC compression schemes.The other option is lossy compression where you could realize fixed-length to fixed-length codes (lossy, of course) where you again have a coding loss due to the lack of proper rate-distribution (this argument can also be made more precise, in the sense of fixed-length and variable length quantization, estimates on the loss are available). In reality it means that it simply doesn't perform well.
nVidia Maxwell architecture achieves great efficienty with narrow memory interfaces thanks to novel techniques of lossless image buffers compression. So various forms of image compression on GPGPUs are doable.
You can interleave transfer and computation, by dividing work into groups and processing them asynchronously. This way you would have computation and transfer in both ways happening at the same time. PCI-Express is full duplex, meaning that uplink and downlink are independent and can be simultaneously saturated.Or, and that is the last option, you parallelize only a certain part of the codec, i.e. wavelet or DCT, probably Huffman per block. Whenever I tried that, the overhead of moving the data between GPU and CPU is the dominating factor, and not the compression itself.
There's also HSA standard and devices where CPU and GPGPU fully share global memory.
Last edited by Piotr Tarsa; 3rd October 2014 at 20:43.
Linux itself has cuda support and NVidia release "nvcc" compiler. It just seems that for whatever reason it doesn't like the back40 component so well, giving a very cryptic missing bracket error (which isn't obviously true, but with templates who knows!). The latest back40 code seems to have changed radically and suffers other build problems; besides the back40computing project itself has abandoned it and switched to a newer library (http://nvlabs.github.io/cub/).
Anyway this thread isn't really about BSC, but about practical code examples of doing data compression in a GPU. My post was simply to refute the claim that BSC is suitable as "a simple gpgpu packer". We still need simple examples.
Probably the best starting point would be the general libraries ("cub" above), or for the more advanced, pulling apart the nvBWT component of nvBio (http://nvlabs.github.io/nvbio/nvbwt_page.html). The latter is specifically designed around BWT of DNA sequences for alignment purposes (eg nvBowtie), but the block sorting components should be transferable.