Page 2 of 3 FirstFirst 123 LastLast
Results 31 to 60 of 76

Thread: CUDA anyone?

  1. #31
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    869
    Thanks
    470
    Thanked 261 Times in 108 Posts
    I may have missed something or my knowledge of GPU be lackluster,
    but while i do understand that each GPU core can calculate a probability, or to be more precise one prediction stage, based on some hashed-context distribution, i do not understand how the parallelizing can be kept (and therefore taken advantage of) when reaching the final coding step. Assuming an arithmetic or range coder, all final probabilities must be merged into a single stream (or a few if you know how to jump from one to another), and each flow need serialising in the same order as source input.
    Now, maybe this final stage can be considered not so costly compared to probability estimation itself. Then, there is still the need to ensure probabilities are encoded in the correct order. Meta-tags, maybe....
    Last edited by Cyan; 10th March 2010 at 22:08.

  2. #32
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,531
    Thanks
    755
    Thanked 674 Times in 365 Posts
    just write intermediate data to buffers (of course it's for encoding only)

  3. #33
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,688
    Thanks
    264
    Thanked 1,180 Times in 651 Posts
    > Assuming an arithmetic or range coder, all final probabilities must be merged into a single stream

    That's no really true too, even for decoding.
    There're some tested ways to thread a rangecoder.

  4. #34
    Member
    Join Date
    Feb 2010
    Location
    Nordic
    Posts
    200
    Thanks
    41
    Thanked 36 Times in 12 Posts
    Quote Originally Posted by Bulat Ziganshin View Post
    it simply doesn't work due to high cost of synchronization
    GPGPU threads in a block - and even between blocks - are synchronised - in many cases they even run in lock-step (which is why branches are expensive - threads that decline a branch must stall until those threads in the group that did take the branch join again). Synchronisation is completely different in GPGPU world, and completely cheap.

    In the x86 world cores are independent execution units with their own program. In the GPGPU world, cores are more like SIMD instructions - single program (flow), each core doing said program on separate data.

    So imagine in my description you have a single context split between 6 GPGPU threads; all hash the context, only one actually has it in range and does the mixing whilst the other 5 stall waiting automatically and then they move to the next, doing cheap synchronisation with the other thread blocks responsible for other contexts in necessary. And you don't care that you're only utilising one of the 6 threads for that bit, since you're effectively modelling a CPU with a massive L1 cache.

    As for the timings of memory in CUDA, its certainly my experience that the published access times for memory in CUDA are accurate, and you can - if you're diligent - program to avoid memory fetch stalls.

  5. #35
    Member
    Join Date
    Feb 2010
    Location
    Nordic
    Posts
    200
    Thanks
    41
    Thanked 36 Times in 12 Posts
    Fermi cores have individual instruction caches and pointers. They are effectively independent units, as Larrabee promised.

    Unfortunately, the claims that Fermi won't ship are rather compelling reading.

    http://www.semiaccurate.com/2010/02/...and-unfixable/

    You'd still expect synchronisation to be cheap on Fermi, else it won't be much good for graphics!

    GPGPUs have completely different hardware-based thread scheduling and synchronisation, and they do those things extremely well - because that's crucial to graphics programs and the trade-offs are completely different to classic CPUs.

  6. #36
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    869
    Thanks
    470
    Thanked 261 Times in 108 Posts
    Very interesting reading on Fermi, and quite credible.
    Sounds very familiar too, same reasons, same consequences (doesn't it look like ATI Radeon HD2900XT ?)

  7. #37
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,478
    Thanks
    26
    Thanked 122 Times in 96 Posts
    Quote Originally Posted by willvarfar View Post
    Fermi cores have individual instruction caches and pointers. They are effectively independent units, as Larrabee promised.
    Are you SURE? Because that would impose a great impact on bandwidth eaten by flowing instructions. Currently there's few instruction streams broadcasted to units in shader clusters (one instruction stream per cluster).

  8. #38
    Member
    Join Date
    Feb 2010
    Location
    Nordic
    Posts
    200
    Thanks
    41
    Thanked 36 Times in 12 Posts
    Quote Originally Posted by Piotr Tarsa View Post
    Are you SURE?
    No, its a recollection. I'll go looking for my source.

  9. #39
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,478
    Thanks
    26
    Thanked 122 Times in 96 Posts
    Not so long ago I've asked about Fermi:
    http://www.semiaccurate.com/forums/s...ead.php?t=1494
    The answer was: Fermi is not MIMD.

  10. #40
    Member
    Join Date
    Feb 2010
    Location
    Nordic
    Posts
    200
    Thanks
    41
    Thanked 36 Times in 12 Posts
    As madmac said on that semiaccurate thread: "NVDA claims MIMD because of multiple kernels that can be executed at once i think... but that's plain marketing."

    I don't find where I got the impression it was independent cores, although I have the definite feeling I read it somewhere. I've speed-read the anandtech stuff, but can't spot the exact source..

    Now Larrabee would have been independent cores, iirc. Have I got that wrong too?

  11. #41
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,531
    Thanks
    755
    Thanked 674 Times in 365 Posts
    larrabee has 16 or so cores, each one executing 4 threads each can execute sse command

    nv260 has 16 cores or so, each one executing script on 16 data streams simultaneously

    so they all are mutiple cores (threads) executing simd commands. you may call this mimd too (any double-core or SMT cpu is MIMD, after all)

  12. #42
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,478
    Thanks
    26
    Thanked 122 Times in 96 Posts
    Larrabee was supposed to have up to 48 cores and 512 bit registers. The project was cancelled and the next incarnation probably will have more cores and/ or wider units.

    Fermi will have 16 cores, each one has 32 streams (?). Charlie said "There is no MIMD in NV GPUs, as was stated above, just concurrent thread execution." but I don't fully understand what that means.

  13. #43
    Member
    Join Date
    Feb 2010
    Location
    Nordic
    Posts
    200
    Thanks
    41
    Thanked 36 Times in 12 Posts
    I can see that the SMT helps despite the cores being in-order.

    Depressing that they are not what I'd term 'independent cores' - that you can not have thread blocks of 1 thread efficiently.

    Of course the desktop is going that way all the time.

  14. #44
    Member Lone_Wolf236's Avatar
    Join Date
    Aug 2009
    Location
    Canada
    Posts
    13
    Thanks
    0
    Thanked 0 Times in 0 Posts
    i think CUDA have it's place in the compression world

    i'm currently working on a different approach to compress files and i want it to benefit from parallel processing. the problem with the current algos is that it needs the results of the previous processes to compute the next result. the algo i'm working on splits the input file in blocks of a few bytes and compress each of these blocks. my current GPU has 240 cores clocked at 1550Mhz with a memory bandwith of 135Gb/s, and dont forget that each core and each block of cores have a cache. And the next generation that are coming out in 2 weeks will have 480 cores.

    we are talking of more than 2,5 TFLOPS of power, while my Core I7 @ 4,2Ghz and HT on only have 60 GFLOPS.

    i'm sorry but graphics cards are WAY more powerful than CPUs for compression IF
    1) we use algos that support parallel processing
    2) we reduce our algos memory usage (#1 will reduce it anyway)

  15. #45
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,531
    Thanks
    755
    Thanked 674 Times in 365 Posts
    Quote Originally Posted by Lone_Wolf236 View Post
    the problem with the current algos is that it needs the results of the previous processes to compute the next result. the algo i'm working on splits the input file in blocks of a few bytes and compress each of these blocks
    one more genius here

  16. #46
    Member Lone_Wolf236's Avatar
    Join Date
    Aug 2009
    Location
    Canada
    Posts
    13
    Thanks
    0
    Thanked 0 Times in 0 Posts
    i have to admit i'm a complete n00b compared to you all

    but anyway, i'm working on it

  17. #47
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,478
    Thanks
    26
    Thanked 122 Times in 96 Posts
    I'm seeing GPGPU as perfect solution for compressing photos, movies, sounds etc "stream" data.

    I've chosen BWT on GPGPU's as my master thesis and this is crazy. I think I'll make some heuristic like hashing to floats ;D and some other calculations, with the goal to be 9x % accurate (ie. make correct output in 9x % of cases).

  18. #48
    Member
    Join Date
    Feb 2010
    Location
    Nordic
    Posts
    200
    Thanks
    41
    Thanked 36 Times in 12 Posts
    Unfortunately GPGPUs aren't a perfect fit for image and video compression either.

    Clearly they do work, and better than CPUs, but basically if you're making a recorder/player you have a choice between a beefy GPGPU with a fan or a tiny asic. Obvious choice really.

  19. #49
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,478
    Thanks
    26
    Thanked 122 Times in 96 Posts
    I'm talking about custom compression algoritms. ASICs found in blu ray players are fixed function so not so interesting from programmer point of view.

  20. #50
    Programmer Gribok's Avatar
    Join Date
    Apr 2007
    Location
    USA
    Posts
    159
    Thanks
    0
    Thanked 1 Time in 1 Post
    Quote Originally Posted by Piotr Tarsa View Post
    I've chosen BWT on GPGPU's as my master thesis and this is crazy. I think I'll make some heuristic like hashing to floats ;D and some other calculations, with the goal to be 9x % accurate (ie. make correct output in 9x % of cases).
    Good luck. This is not easy. But I think this is possible. You could start to looking into my idea which I post couple of years ago:
    http://forum.compression.ru/viewtopi...t=1948&start=0
    Enjoy coding, enjoy life!

  21. #51
    Member
    Join Date
    Apr 2009
    Location
    The Netherlands
    Posts
    63
    Thanks
    2
    Thanked 12 Times in 6 Posts
    BWT on a GPU? I don't get it... I expected a Neural Network for example, but BWT? I admid i'm not an expert on GPU-usage for compression goals, but I have red several articles about GPU's boosting Neural Networks.

    The idea is to build up a network structure in the memory of the GPU that will stay there for the entire time of usage. Only the input for the network and the output from the network is transported during usage. This keeps the speed gain from the GPU provitable compared to the loss of the transport of information towards the GPU.

    If you talk about BTW on the GPU, I see 2 problems:
    1. GPU's are not specialised in integer operations. Maybe a clever algorithm could overcome this problem by using floats somehow, but that more complex algoritm could well be responsable for enough speed loss to make usage of the GPU unefficient.
    2. BTW is about moving and ordening blocks of memory. That's exactly not what you want to do on your GPU.

    From my point of view it would be interesting to see the performance of a Paq-like algoritm where one or more CPU's are working on several models and where the GPU is mixing the statistics from those models to an acurate prediction.

  22. #52
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,478
    Thanks
    26
    Thanked 122 Times in 96 Posts
    Quote Originally Posted by Gribok View Post
    Good luck. This is not easy. But I think this is possible. You could start to looking into my idea which I post couple of years ago:
    http://forum.compression.ru/viewtopi...t=1948&start=0
    Smart idea. Sadly I don't understand Russian (I can only "decipher" / read out cyrillic). The problem with GPUs is that quicksort is rather difficult to implement and doesn't offer much performance gain over traditional radix sort to offset the difficulty (maybe even quicksort GPU perfromance will be lower when coupled with some BWT oriented techniques). But the algo should be easy to modify for radix sort.

  23. #53
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,688
    Thanks
    264
    Thanked 1,180 Times in 651 Posts
    1. Modern GPUs already have integer support.
    And more than that, Cuda allows to run most C/C++ programs
    on a GPU without much work - making it efficient is a completely
    different question though.

    2. Neural networks which can be easily parallelized are redundant -
    something like a few planes, where each neuron is connected to
    the whole previous plane.
    And this doesn't apply to paq8, and even less to other CM coders.

    3. BWT as a pointer manipulation algorithm isn't very compatible
    with GPU features, but even like that its easy to split into parallel
    parts.
    Then, there're many ways to implement BWT, and
    some of them can make use of more arithmetics and less memory accesses,
    thus being more compatible.

  24. #54
    Member Raymond_NGhM's Avatar
    Join Date
    Oct 2008
    Location
    UK
    Posts
    51
    Thanks
    0
    Thanked 0 Times in 0 Posts
    The latest NVidia CUDA Video Encoder (nvcuvenc.dll ActivX driver v.196.21),
    also just support private LZMA Decoder.

    in .dll i seek this functions:

    "AVCPrivNLZMADecoder"
    "AVCnVZipDecoder"
    "AVCLosslessDecoder"
    or more???

    Here in Fermi Compute Architecture Whitepaper, you can find all things
    about Fermi GPU's body & differents between previous G80 & GT200 Series.

    *Improve Double Precision Performance-while single precision
    floating point performancewas on the order of ten times the
    performance of desktop CPUs, some GPU computing applications
    desired more double precision performance as well.
    *ECC support-ECC allows GPU computing users to safely deploy
    large numbers of GPUs in datacenter installations, and also
    ensure data-sensitive applications like medical imaging and
    financial options pricing are protected from memory errors.
    *True Cache Hierarchy-some parallel algorithms were unable to
    use the GPU's shared memory, and users requested a true cache
    architecture to aid them.
    *More Shared Memory-many CUDA programmers requested more than
    16 KB of SM shared memory to speed up their applications.
    *Faster Context Switching-users requested faster context switches
    between application programs and faster graphics and compute
    interoperation.
    *Faster Atomic Operations-users requested faster read-modify-write
    atomic operations for their parallel algorithms.

    +Third Generation Streaming Multiprocessor (SM)
    -32 CUDA cores per SM, 4x over GT200
    -8x the peak double precision floating point performance over GT200
    -Dual Warp Scheduler simultaneously schedules and dispatches instructions
    from two independent warps
    -64 KB of RAM with a configurable partitioning of shared memory and L1 cache
    +Second Generation Parallel Thread Execution ISA
    -Unified Address Space with Full C++ Support
    -Optimized for OpenCL and DirectCompute
    -Full IEEE 754-2008 32-bit and 64-bit precision
    -Full 32-bit integer path with 64-bit extensions
    -Memory access instructions to support transition to 64-bit addressing
    +Improved Performance through Predication
    -Improved Memory Subsystem
    -NVIDIA Parallel DataCacheTM hierarchy with Configurable L1 and Unified L2 Caches
    -First GPU with ECC memory support
    -Greatly improved atomic memory operation performance
    +NVIDIA GigaThreadTM Engine
    -10x faster application context switching
    -Concurrent kernel execution
    -Out of Order thread block execution
    -Dual overlapped memory transfer engines

    & other links:

    MD4/MD5/SHA1 GPU Password Recovery:

    http://www.golubev.com/files/ighashgpu/readme.htm
    http://www.golubev.com/hashgpu.htm

    also benchmark from between CPU & latest nVidia/ATi GPUs:
    http://www.golubev.com/about_cpu_and_gpu_2_en.htm

    RAR GPU Password Recovery:
    http://www.golubev.com/rargpu.htm

    Note that, days to days GPUs has make opening new wide area for
    programmers to processing datas more over & faster than CPUs.
    but we mustn't forget, the base CPU's work for what.
    Attached Files Attached Files

  25. #55
    Member
    Join Date
    Feb 2010
    Location
    Nordic
    Posts
    200
    Thanks
    41
    Thanked 36 Times in 12 Posts
    Another thing about how ATI cards kick Nvidia's butt for some GPGPU use-cases: http://www.theregister.co.uk/2010/03...word_recovery/

  26. #56
    Member
    Join Date
    Jun 2008
    Location
    G
    Posts
    377
    Thanks
    26
    Thanked 23 Times in 16 Posts
    Quote Originally Posted by Piotr Tarsa View Post
    I'm seeing GPGPU as perfect solution for compressing photos, movies, sounds etc "stream" data.

    I've chosen BWT on GPGPU's as my master thesis and this is crazy. I think I'll make some heuristic like hashing to floats ;D and some other calculations, with the goal to be 9x % accurate (ie. make correct output in 9x % of cases).
    any progress?

  27. #57
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,478
    Thanks
    26
    Thanked 122 Times in 96 Posts
    I have too much courses to finish now (lots of projects on university). I'll play with GPU-BWT on summer holidays. And I don't guarantee that it will be any faster than CPU-BWT.

  28. #58
    Member
    Join Date
    May 2008
    Location
    Germany
    Posts
    410
    Thanks
    37
    Thanked 60 Times in 37 Posts
    @Piotr_Tarsa

    it seems your favorite algorithm is a BWT on GPU

    1. what about to try to implement
    a bzip2 (parallel bzip2) - algorithm on a GPU ?

    pbzip2 seems to have a god speedup on multiple cores
    and especially with more then 4 cores too

    http://compression.ca/pbzip2/bench-multi.gif

    2. the new bsc (http://encode.dreamhosters.com/showthread.php?t=586)
    maybe is a good canditate for multi-cores too

    especially interesting for me seems its strong ST5-algorithm

    3. at least not at last:

    the strongest from the new programs - the new zp 1.0 from Matt Mahoney

    http://encode.dreamhosters.com/showthread.php?t=608

    especially great is the "mid" = "c2" - mode

    but i dont know,
    a) will it be
    possible to implement this on a parallel multi-core system
    b) will it have a good profit from a parallel multi-core system

    ---
    have you even compared ATI-GPUs versus NVIDIA-GPUs
    and OPEN-CL versus CUDA ?

    if we do not speak about fermi/tesla
    it seems ATI has the stronger devices at the moment
    (more GFLOPS)

    i wish you a big success with you work

    best regards

  29. #59
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 779 Times in 486 Posts
    Making a ZPAQ compatible program like zp parallel could be done by compressing or decompressing blocks in separate threads. Blocks are independent so this could be done. For decompression this is no problem as long as each block decompresses to a different set of files. For compression, the compressed data would have to be kept in memory or temporary files until all of the threads have finished, and then concatenated.

    It is possible to split a single file into smaller blocks that are compressed independently in parallel. In ZPAQ format, each block would have only one segment and only the segment in the first block would have a file name. Both compression and decompression would require saving the output in memory or temporary files until all the threads are finished and the data can be concatenated.

    Both approaches have some problems. Splitting the input into smaller blocks makes compression worse. Also, the memory requirement is the sum of the requirements of each block instead of the maximum. Also, speed is limited partly by random memory access where using threads does not help. Also in ZPAQ, each block would have to have a header describing the compression algorithm. This can be a few hundred bytes. For example compressing an empty file with zp c1, c2, or c3 gives an archive size of 73, 116, or 244 bytes.

    pbzip2 gets good speedup with no compression loss because it already uses a small block (0.9 MB). The best ZPAQ modes use hundreds of MB.

  30. #60
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,478
    Thanks
    26
    Thanked 122 Times in 96 Posts
    I think that:
    - STx family is well suited for GPU's as the commonly uses sorting algorithms on GPU's are radix sorts, so I'll probably start with such thing (probably coupled with DWFC or something that parallelizes well and is strong),
    - compressing large blocks in multiple threads requires enormous amount of memory - considering that fast memory for GPU's is expensive thus shipped in low volume; average PC has few times less memory on GPU than on mainboard,
    - inventing own exotic algorithm won't bring my program much focus and popularity; I will probably try to create something compatible with bzip2 (additionally it has small blocks sizes, so it could run with graphic cards with moderate amounts of memory),
    - QSufSort seems promising - ie. I think it's possible to implement it in reasonable time in OpenCL, I'll also try Gribok's idea,
    - offloading 100 % of work to GPU leaves CPU jobless so CPU should also handle some taks; In the moment I think about segmentation for bzip2; I don't know if Bulat implemented it already (I know that most of time it's better to just use biggest block but sometimes it isn't the case),

Page 2 of 3 FirstFirst 123 LastLast

Similar Threads

  1. CUDA Technology of nVIDIA for Compression?
    By Stephan Busch in forum Data Compression
    Replies: 13
    Last Post: 17th September 2008, 22:44
  2. Cuda massive Multithreating
    By thometal in forum Forum Archive
    Replies: 2
    Last Post: 18th February 2007, 23:49

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •