http://arxiv.org/pdf/1107.1525v1
c.a. 2x faster than CPU with OpenMP and 4-6x faster than single threaded.
Using:
NVIDIA GeForce GTX 285 with 240 cores at 1.5GHz, Core i7 965 with four cores at 3.2 Ghz
http://arxiv.org/pdf/1107.1525v1
c.a. 2x faster than CPU with OpenMP and 4-6x faster than single threaded.
Using:
NVIDIA GeForce GTX 285 with 240 cores at 1.5GHz, Core i7 965 with four cores at 3.2 Ghz
Last edited by m^2; 14th July 2011 at 23:25.
Why so poor scaling on multicore CPU? 4 cores/ 8 threads and only a little above 200 % of scaling on quad i7...
Nice paper. It gives some interesting and insightful explanations on the seemingly poor GPU scaling:
Despite the GPU having 60 times the number of cores as our CPU, the differences in throughput between the GPU encoder and the OpenMP encoder are not dramatic. This paradox can be largely resolved by recalling that the architecture of the GPU was developed for the SIMD, single instruction multiple data, programming model while our CPU was developed with MIMD, multiple instruction multiple data, in mind.---The control unit broadcasts an instruction to all the cores, and optimal performance can only be achieved when every core can execute it. If, for example, the instruction is a branching statement, then there is a likelihood that some cores will not follow the jump, and in this case, some cores must remain inactive until they either themselves satisfy the branching instruction or control passes beyond the branching sections of the code. Therefore, in the worst case, when only one core can satisfy the jump and the other seven are left idle, our GPU behaves more like a crippled 30 core shared memory MIMD machine with a slow clock speed and no automatic memory caching.
This is mysterious indeed. The decoding performs well (~40 MB/s=>~150 MB/s with OpenMP), so I guess they're having some problems with the encoder which is strange because by splitting data into blocks, the expected 300% speedup should be no problem. Another strange thing is that encoding is much faster then decoding - is this common for Huffman?
Last edited by schnaader; 15th July 2011 at 01:44.
http://schnaader.info
Damn kids. They're all alike.
It's common for Huffman Decoding to be slower than Encoding, but it does not need to be *that* dramatic.
I would say that beyond a 20% speed difference, there is certainly some missing optimisation.
Not sure if the implementation they compare GPGPU with is properly done.
For reference :
http://fastcompression.blogspot.com/...py-coders.html