Page 1 of 3 123 LastLast
Results 1 to 30 of 76

Thread: CUDA anyone?

  1. #1
    Member
    Join Date
    Feb 2010
    Location
    Germany
    Posts
    77
    Thanks
    2
    Thanked 0 Times in 0 Posts

    CUDA anyone?

    Did any of you programmers ever play with the thought of using CUDA or ATI's equivalent as a main processor instead of the normal CPU? The more efficient compressors are painfully slow, yet no one seems to feel the need to harness the power of the graphics cards.

    I'm aware that this would require a multi-threading capable code, but some archivers are already past that hurdle. In conjunction with that I wonder if it would be possible to use the RAM of graphics cards instead or additionally to the system RAM since it's so much faster and otherwise a wasted resource during compression.

  2. #2
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,511
    Thanks
    746
    Thanked 668 Times in 361 Posts
    yes, SSE makes your internet faster, so why not use graphics accelerator for compression. it's a cool idea, noone before proposed this!

  3. #3
    Member
    Join Date
    Feb 2010
    Location
    Germany
    Posts
    77
    Thanks
    2
    Thanked 0 Times in 0 Posts
    I'm not sure, are you being sarcastic?

  4. #4
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,511
    Thanks
    746
    Thanked 668 Times in 361 Posts
    no, no, believe me - you was the first who proposed such cool idea. i never heard it from 7z/fa users who believe in nvidia ads not managers who need to add marketing hype to their products

  5. #5
    Member PAQer's Avatar
    Join Date
    Jan 2010
    Location
    Russia
    Posts
    22
    Thanks
    3
    Thanked 0 Times in 0 Posts
    Last edited by PAQer; 9th March 2010 at 22:13.

  6. #6
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,511
    Thanks
    746
    Thanked 668 Times in 361 Posts
    plase check url

  7. #7
    Member
    Join Date
    Feb 2010
    Location
    Germany
    Posts
    77
    Thanks
    2
    Thanked 0 Times in 0 Posts
    So you are being sarcastic :P

  8. #8
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,511
    Thanks
    746
    Thanked 668 Times in 361 Posts
    Quote Originally Posted by PAQer View Post
    http://www.hydrogenaudio.org/forums/...dpost&p=576803 answers the question. and as i see, core2quad still beats gpus even for flac
    Last edited by Bulat Ziganshin; 9th March 2010 at 23:55.

  9. #9
    Member
    Join Date
    Feb 2010
    Location
    Germany
    Posts
    77
    Thanks
    2
    Thanked 0 Times in 0 Posts
    Well, the programmer himself said that his hardware might not be powerful enough to provide a proper comparison. A GTS 250 is really not too powerful. I have a GTX 260 and an AMD 4200+. I'll test his app later to see which performs better on my rig.

    Anyway, CUDA is compatible even to weak chipsets such as ION. It would be interesting to see how this flac encoder performs on an ION-powered subnotebook. Those usually have pretty weak CPUs so a CUDA-based app might make sense. Same goes of course for all other apps. Even if the speed isn't that much better, it would keep the CPU free from intense tasks such as archiving.

    Another idea: Wouldn't also be a GPU+CPU mode be possible?

    In any case, it's not sure that this flac-encoder is written in the most efficient way there is. GPU-based processing is used for all sorts of scientific simulation work including folding at home, seti at home and complex financial calculations. I doubt these institutions would go through the hassle of converting their code to GPU-processing if it wouldn't be worth it. Last but not least, there's also the option to use CUDA only for certain tasks to improve speed of only selected processes. No one says that an application has to run exclusively on a GPU

    http://www.tomshardware.com/reviews/...pgpu,2299.html

  10. #10
    Member
    Join Date
    Feb 2010
    Location
    Nordic
    Posts
    200
    Thanks
    41
    Thanked 36 Times in 12 Posts
    It'd be interesting explore this.

    Compression is typically a sequential thing - you process the input in order, you can't process symbols out of order.

    This means you need an ever-growing history, and you need RAM to store this history. More RAM means better compression.

    RAM - the Random Access" bit of the acronym - is something that GPUs don't have a lot of. They are streaming cores - they work best on a sequential window of data with strict coalescing read write rules.

    What kind of (de)compression is a streaming thing with good locality of references? FLAC, and I guess lossy audio codecs like AMR and GSM (of course thats light work on a PC, and on phones and such there is dedicated functionality typically in the wireless parts, but still.) And is this (de)compression pushing the limits of the CPUs it is typically or potentially deployed to?

    I guess the first step is for people with dual-cores to explain how they utilise more than one core for classic CPU-based (de)compression?
    Last edited by willvarfar; 10th March 2010 at 00:27.

  11. #11
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,511
    Thanks
    746
    Thanked 668 Times in 361 Posts
    Quote Originally Posted by willvarfar View Post
    I guess the first step is for people with dual-cores to explain how they utilise more than one core for classic CPU-based (de)compression?
    why not ask cpus themselves?

  12. #12
    Member
    Join Date
    May 2008
    Location
    brazil
    Posts
    163
    Thanks
    0
    Thanked 3 Times in 3 Posts
    I don t like CUDA .It's completely proprietary.


    I prefer OpenCL. It's a open standard.

  13. #13
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,475
    Thanks
    26
    Thanked 121 Times in 95 Posts
    Bulat:
    Look here: http://golubev.com/about_cpu_and_gpu_2_en.htm
    Single GTX 260 has only 2 times more peak integer performance than Core 2 Quad 6600. On the other side AMD's 4800 is theoretically almost twice as fast as GTX 260.

  14. #14
    Member
    Join Date
    Feb 2010
    Location
    Germany
    Posts
    77
    Thanks
    2
    Thanked 0 Times in 0 Posts
    I just tested flacuda. My specs: AMD 4200+ X2, Palit GTX260 SP 216. I used a ramdisk as source and destination to avoid slow downs through disc access.

    Wave size: 782.531.612 bytes

    compression 11, full gpu: 20.347s (compressed size 474.919.550)
    compression 11, slow gpu: 36.500s (compressed size 474.919.550)

    standard flac, compression 8: 162s (compressed size 483.374.967)

    Full gpu means that flacuder ran the whole process on the graphics card. Slow gpu means that supposedly slow routines are performed by the CPU to speed things up even further. Apparently my CPU is too weak for this purpose, that's why the full gpu-mode performed almost twice as fast. As for the old-fashioned flac: I think the numbers speak for themselves. Flacuda is 8 times faster while also having a better compression!

    This comparison is not really fair though since flacuda had to do more complex work than flac. Flacuda supports 11 compression levels, flac only 8. That's why the settings differ in the test. However, it is a comparison on a basis of maximum compression. I also tested flacuda with a compression level of 8 to get a better comparable result:

    full gpu: 16.730s
    slow gpu: 24.513s

    compressed size: 480.159.565 bytes

    This time flacuda is 9.7 times faster while still delivering a better compression.

  15. #15
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,511
    Thanks
    746
    Thanked 668 Times in 361 Posts
    Quote Originally Posted by Piotr Tarsa View Post
    Bulat:
    Single GTX 260 has only 2 times more peak integer performance than Core 2 Quad 6600
    does peak integer performance has any relation to our algos?

  16. #16
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    865
    Thanks
    463
    Thanked 260 Times in 107 Posts
    Willvarfar point is very valid here.
    This is one thing to work on contiguous stream of data, this is an entirely different one to go back into data history several megabytes before, if not much more.
    So here, streaming processors doesn't work at all.

    At least, this looks like a no go for any LZ-based compression.

    The situation could be different with PPM. Here we are dealing with context statistics, so each statistics could be considered a small amount of data, quite fitted to remain in a GPU stream processor. Would that help ? this is not so sure : you still have to process data sequentially (input and output), so this is like jumping in_order from one context to another. Maybe by limiting the number of contexts to the number of stream processors, to avoid loading times ? This would badly hurt compression though.
    And this is without talking of SSE further processing (which i don't know enough to comment).

    So there are serious reasons why this is not yet implemented.

  17. #17
    Member
    Join Date
    Feb 2010
    Location
    Germany
    Posts
    77
    Thanks
    2
    Thanked 0 Times in 0 Posts
    Ah I see. Thanks for the noob-friendly explanation

  18. #18
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,475
    Thanks
    26
    Thanked 121 Times in 95 Posts
    Bulat:
    Yes, because probably flac uses SSE heavily so it comes as near the peak performance as flacuda.

  19. #19
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,511
    Thanks
    746
    Thanked 668 Times in 361 Posts
    Quote Originally Posted by Piotr Tarsa View Post
    Bulat:
    Yes, because probably flac uses SSE heavily so it comes as near the peak performance as flacuda.
    lossy compression isn't our business at all

  20. #20
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,475
    Thanks
    26
    Thanked 121 Times in 95 Posts
    From what I understood flacuda is lossless. If not then it will be pretty pointless.

    SSE2 adds many vector integer operations, like 4 x 32 int32. That's why it's used in some version of PAQ mixer.

  21. #21
    Member
    Join Date
    Feb 2010
    Location
    Germany
    Posts
    77
    Thanks
    2
    Thanked 0 Times in 0 Posts
    FLAC=Free Lossless Audio Codec

  22. #22
    Member
    Join Date
    Feb 2010
    Location
    Nordic
    Posts
    200
    Thanks
    41
    Thanked 36 Times in 12 Posts
    It occurs to me that PPM and CM might possibly be CUDAisable. (OpenCLable, whatever; they are all the same abstraction.)

    They can be implemented as a hash-table.

    For each context to be mixed, it is hashed and looked up in the table to get the counter.

    CUDA is a NUMA architecture. Well, so is a CPU when you're counting cycles. But CUDA is very NUMA.

    The GPU is made from cores; each core can execute a thread (well, thats a good approximation), threads are grouped into thread blocks and there are several thread blocks on the chip. The memory follows this hierarchy.

    There is fast memory (registers) in each core, slightly slower (but still blazingly fast) local memory per thread, slightly slower shared memory accessible by all threads in the block, and very slow global memory accessible by all threads in all thread-blocks. The size of these pieces of memory also scales with the hierarchy - the shared memory might be MB, the global memory GB.

    One major performance point is coalesced reads and writes for non-local memory. Its important that the threads in a block read and write to aligned memory locations relative to their peers, else performance is really impacted.

    If you have a hash table, you can distribute that hash table - divide the keyspace between separate parts.

    If you place a portion of the table into different thread local memory - basically giving each thread responsibility for a small portion of the larger problem - then they can calculate their predictions in parallel. It is perhaps logical to take the division further by also divided by context so different orders and other contexts get separate tables with separate thread blocks responsible for mixing them.

    So for each symbol input, each thread makes a prediction by hashing the context and accessing its portion of the counters should the hash be within its range, then writing the prediction and updating the counters.

    Another small thread block can be in charge of reading these predictions and mixing them.

    It'd be the same for decompressing.

    You can, with care, actually use texture memory for large table storage, if you ensure you coalesce your reads and writes.

    It would be natural for the counters to perhaps be integer, and the mixing to perhaps be float. Once you're on the GPU, floats don't have the cost that they have on the CPU world, and the various fixed-point approaches to avoiding floating point math on the CPU can be abandoned.

    Now the utility of this approach relies on there being enough thread-local memory and threads to make this approach viable. I don't know what normal figures are.

  23. #23
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,511
    Thanks
    746
    Thanked 668 Times in 361 Posts
    It'd be the same for decompressing.
    no

  24. #24
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,423
    Thanks
    223
    Thanked 1,052 Times in 565 Posts
    1. I may be mistaken, and things could change since my 8800GT,
    but when I played with it, I found that local/shared/global memory
    is all the same (even though I expected the same results as you).
    There's a register called %clock, which is similar to x86 TSC,
    and I found that accessing any kind of memory takes ~100 clocks,
    while arithmetic operations have reasonable timings, similar to x86.

    The only memory with fast access was "const" memory, but it seems
    that its actually part of code - I found that immediate constants
    for GPU instructions are in fact stored there.
    But const memory is not modifiable, as expected.

    Anyway, I'd suggest to enable .ptx generation in cuda
    and check there what kind of differences you'd get depending on
    memory types etc.

    2. There's only one CM for which it might make sense to apply such
    threading - paq8. But even there you won't find that much continuous
    calculations - there's like two random memory accesses per multiplication,
    and afaik there're no divisions in the main engine (counters/mixing/APM/SSE).
    Still, I guess thread syncing on GPU is much less expensive than on CPU
    (there's even an instruction for that), so it should be possible to somewhat
    parallelize the paq8 probability estimation.
    But anyway it would require syncing many times per bit, and parallelizing
    of update won't go that well (for example, because of possible collisions
    in secondary statistics), and GPU clock speed is lower than CPU's,
    so it doesn't look like there would be any speed improvement comparing
    to cpu.

    3. As to floats - it was already mentioned in flac thread.
    The C++ compilers are really rough with floats - for example,
    its hardly possible to get the same output from mp3 or jpeg
    decoder with float DCT using different compilers - and that's
    deadly for compression, as format would lose compatibility
    at random, like after a change in compiler options or code
    modifications.
    Although its still possible to make a working float-based compressor,
    Ash is an example of that.

  25. #25
    Member
    Join Date
    Feb 2010
    Location
    Nordic
    Posts
    200
    Thanks
    41
    Thanked 36 Times in 12 Posts
    Quote Originally Posted by Bulat Ziganshin View Post
    no
    I what significant way would a PAQ-like decompressor be dissimilar from the compressor?

  26. #26
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,423
    Thanks
    223
    Thanked 1,052 Times in 565 Posts
    Its much easier to parallelize any CM compressor than decompressor -
    unrelated parts can just work separately and store their predictions
    to some buffers.
    Its relatively easy because all input data are known.
    But it only works sequentially in decoder.

  27. #27
    Member
    Join Date
    Feb 2010
    Location
    Nordic
    Posts
    200
    Thanks
    41
    Thanked 36 Times in 12 Posts
    I thought I had described sequential mixing for both encoding and decoding

  28. #28
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,423
    Thanks
    223
    Thanked 1,052 Times in 565 Posts
    That's what I implied too, but Bulat didn't.

  29. #29
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,511
    Thanks
    746
    Thanked 668 Times in 361 Posts
    Quote Originally Posted by willvarfar View Post
    I thought I had described sequential mixing for both encoding and decoding
    it simply doesn't work due to high cost of synchronization

  30. #30
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,475
    Thanks
    26
    Thanked 121 Times in 95 Posts
    Shelwien:
    Probably you're wrong with memoty latencies. From Fermi slides there's indication that there will be even L2 cache. So it's definitely not UMA. And I think previous generations have L1 caches, small but they were present.

Page 1 of 3 123 LastLast

Similar Threads

  1. CUDA Technology of nVIDIA for Compression?
    By Stephan Busch in forum Data Compression
    Replies: 13
    Last Post: 17th September 2008, 22:44
  2. Cuda massive Multithreating
    By thometal in forum Forum Archive
    Replies: 2
    Last Post: 18th February 2007, 23:49

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •