Results 1 to 30 of 76

Thread: CUDA anyone?

Hybrid View

Previous Post Previous Post   Next Post Next Post
  1. #1
    Member
    Join Date
    Feb 2010
    Location
    Germany
    Posts
    77
    Thanks
    2
    Thanked 0 Times in 0 Posts

    CUDA anyone?

    Did any of you programmers ever play with the thought of using CUDA or ATI's equivalent as a main processor instead of the normal CPU? The more efficient compressors are painfully slow, yet no one seems to feel the need to harness the power of the graphics cards.

    I'm aware that this would require a multi-threading capable code, but some archivers are already past that hurdle. In conjunction with that I wonder if it would be possible to use the RAM of graphics cards instead or additionally to the system RAM since it's so much faster and otherwise a wasted resource during compression.

  2. #2
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,593
    Thanks
    802
    Thanked 698 Times in 378 Posts
    yes, SSE makes your internet faster, so why not use graphics accelerator for compression. it's a cool idea, noone before proposed this!

  3. #3
    Member
    Join Date
    Feb 2010
    Location
    Germany
    Posts
    77
    Thanks
    2
    Thanked 0 Times in 0 Posts
    I'm not sure, are you being sarcastic?

  4. #4
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,593
    Thanks
    802
    Thanked 698 Times in 378 Posts
    no, no, believe me - you was the first who proposed such cool idea. i never heard it from 7z/fa users who believe in nvidia ads not managers who need to add marketing hype to their products

  5. #5
    Member PAQer's Avatar
    Join Date
    Jan 2010
    Location
    Russia
    Posts
    22
    Thanks
    3
    Thanked 0 Times in 0 Posts
    Last edited by PAQer; 9th March 2010 at 22:13.

  6. #6
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,593
    Thanks
    802
    Thanked 698 Times in 378 Posts
    plase check url

  7. #7
    Member
    Join Date
    Feb 2010
    Location
    Germany
    Posts
    77
    Thanks
    2
    Thanked 0 Times in 0 Posts
    So you are being sarcastic :P

  8. #8
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,593
    Thanks
    802
    Thanked 698 Times in 378 Posts
    Quote Originally Posted by PAQer View Post
    http://www.hydrogenaudio.org/forums/...dpost&p=576803 answers the question. and as i see, core2quad still beats gpus even for flac
    Last edited by Bulat Ziganshin; 9th March 2010 at 23:55.

  9. #9
    Member
    Join Date
    Feb 2010
    Location
    Germany
    Posts
    77
    Thanks
    2
    Thanked 0 Times in 0 Posts
    Well, the programmer himself said that his hardware might not be powerful enough to provide a proper comparison. A GTS 250 is really not too powerful. I have a GTX 260 and an AMD 4200+. I'll test his app later to see which performs better on my rig.

    Anyway, CUDA is compatible even to weak chipsets such as ION. It would be interesting to see how this flac encoder performs on an ION-powered subnotebook. Those usually have pretty weak CPUs so a CUDA-based app might make sense. Same goes of course for all other apps. Even if the speed isn't that much better, it would keep the CPU free from intense tasks such as archiving.

    Another idea: Wouldn't also be a GPU+CPU mode be possible?

    In any case, it's not sure that this flac-encoder is written in the most efficient way there is. GPU-based processing is used for all sorts of scientific simulation work including folding at home, seti at home and complex financial calculations. I doubt these institutions would go through the hassle of converting their code to GPU-processing if it wouldn't be worth it. Last but not least, there's also the option to use CUDA only for certain tasks to improve speed of only selected processes. No one says that an application has to run exclusively on a GPU

    http://www.tomshardware.com/reviews/...pgpu,2299.html

  10. #10
    Member
    Join Date
    Feb 2010
    Location
    Nordic
    Posts
    200
    Thanks
    41
    Thanked 36 Times in 12 Posts
    http://code.google.com/p/back40compu...i/RadixSorting is claiming the fastest sort ever, on GPUs

  11. #11
    Member
    Join Date
    May 2008
    Location
    Germany
    Posts
    412
    Thanks
    38
    Thanked 64 Times in 38 Posts
    @Piotr_Tarsa:

    " - STx family is well suited for GPU's as the commonly uses sorting algorithms on GPU's are radix sorts, so I'll probably start with such thing"

    "In the moment I think about segmentation for bzip2; I don't know if Bulat implemented it already
    (I know that most of time it's better to just use biggest block but sometimes it isn't the case)"

    any success with compression on GPU ?

    may be the new stable version 1.0.215 of "High performance GPU radix sorting in CUDA" from above is interesting for you?
    ( download: http://code.google.com/p/back40computing/downloads/list )

    *** ***
    This project implements a very fast, efficient radix sorting method for CUDA-capable devices.
    For sorting large sequences of fixed-length keys (and values),
    we believe our sorting primitive to be the fastest available for any fully-programmable microarchitecture:
    our stock NVIDIA GTX480 sorting results exceed the 1G keys/sec average sorting rate (i.e., one billion 32-bit keys sorted per second).

    In addition, one of our design goals for this project is flexibility.
    We've designed our implementation to adapt itself and perform well on all generations and configurations of programmable NVIDIA GPUs,
    and for a wide variety of input types.

    Features:

    * The current implementation can sort:
    o Keys of any C/C++ built-in, numeric type (e.g., signed char, float, unsigned long long, etc.)
    o Any structured payload type (within reason).

    * Early-exit digit-passes. When the sorting operation detects that all keys have the same digit at the same digit-place,
    the pass for that digit-place is short-circuited, reducing the cost of that pass by 80%.
    This makes our implementation suitable for even low-degree binning problems where sorting would normally be overkill.
    (We note that the random keys used to produce our performance results do not trigger this feature.)

    Radix Sorting

    - works by iterating over the d-bit digit-places of the keys from least-significant to most-significant.
    For each digit-place, the method performs a stable distribution sort of the keys based upon their digit at that digit-place.
    Given an n-element sequence of k-bit keys and a radix r = 2d,
    a radix sort of these keys will require k/d iterations of a distribution sort over all n keys.

    The distribution sort (a.k.a. counting sort) is the fundamental component of the radix sorting method. In a data-parallel, shared-memory model of computation, each logical processor gathers its key, decodes the specific digit at the given digit-place, and then must cooperate with other processors to determine where the key should be relocated. The relocation offset will be the key's global rank, i.e., the number of keys with lower digits at that digit place plus the number of keys having the same digit, yet occurring earlier in the input sequence.

    We implement these distribution sorting passes using a very efficient implementation of a generalized parallel prefix scan. Our generalized scan is designed to operate over multiple, concurrent scan problems. For example, with d = 4 bits (r = 16 digits), our multi-scan does sixteen scan operations: one for each digit. For each of the scan operations (e.g., the 0s scan, the 1s scan, the 2s scan, etc.), the input for each key is a 1 if the key's digit place contains that operation's digit, 0 otherwise. When the mulit-scan is done, the logical processor for each key can look up the scan result from the appropriate scan operation to determine where its key should be placed.
    *** ***
    Last edited by joerg; 30th August 2010 at 12:41. Reason: subscription of radix-sort-on CUDA

  12. #12
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    4,139
    Thanks
    321
    Thanked 1,402 Times in 804 Posts
    That's interesting, but normally isn't it better to run BWT sorting in parallel, instead of encoding?
    And specifically what about the approach which I suggested before - kinda a multithreaded radix sort.
    Basically you make 336 threads which all scan the whole file and extract their own string prefix ranges,
    then sort independently. Although in my case the main idea was to work with a compressed representation
    (to reduce memory usage and make use of GPU's arithmetic capability) -
    see http://encode.su/threads/379-BWT-wit...sed-input-data

  13. #13
    Programmer
    Join Date
    May 2008
    Location
    PL
    Posts
    309
    Thanks
    68
    Thanked 173 Times in 64 Posts
    Quote Originally Posted by Shelwien View Post
    That's interesting, but normally isn't it better to run BWT sorting in parallel, instead of encoding?
    Of course it's better. Duane Merrill claims (http://code.google.com/p/back40computing) his radix sorting algorithm exceed the 1G keys/sec using GTX 480 so this problem seems to be solved. For me it means that you can speed up any BWT-based compressor by replacing its sorting with GPU sorting so I wanted to see what can be done with BWT modeling+encoding.

  14. #14
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    4,139
    Thanks
    321
    Thanked 1,402 Times in 804 Posts
    Thanks for the link, but it seems that its not a BWT sort, and BWT potentially adds O(N) at key comparison.
    And as to postcoding, I think some hierarchical model is necessary for that (something like frequency table compression) -
    forcing sequential methods to compress small blocks isn't really a good idea.
    Btw, if we can afford less efficient coding, then it should be easy enough to get 300-400MB/s on cpu with huffman etc.
    And even if we can't, compression part is not really an issue, because all the data are available and we can thread stuff easily.
    (eg. in each thread find the occurrences of specific context and only update its model)
    But for decoding w/o compression loss I still only know a method of entropy coding with 2 threads, which is not enough for GPU
    (well, it can be any number of threads actually, but the model still has to run sequentially)

  15. #15
    Programmer
    Join Date
    May 2008
    Location
    PL
    Posts
    309
    Thanks
    68
    Thanked 173 Times in 64 Posts
    1. True, radix sort is not enough for BWT, but we can use GPU merge sort:
    http://thrust.googlecode.com/svn/tag...rge__sort.html
    2. Dividing data into blocks is not the best idea, but it's simple and works quite well (results above) with blocks bigger than 50-100 kb. The memory usage, however, is N times bigger. I've tried to use one common model for all BWT threads/compressors, but it was not faster than separate models (and gave much worse compression).

  16. #16
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,505
    Thanks
    26
    Thanked 136 Times in 104 Posts
    GPU have SIMD engines. For example Radeon HD 5870 has 20 SIMD engines. Each SIMD engine has several processors, but all of them have the same instruction pointer. This means that both branches (in case of branching) are processed by each processor and also with while loops each processor must wait for slowest thread. Using one-pass radix sort you can do only at most STx transform (where x is a small constant).


    QSufSort does full BWT in O(log n) ordinary fixed-size-integer-key sorts.
    Quote from http://www.larsson.dogma.net/ssrev-tr.pdf :
    Observation 1 (Manber and Myers). Sorting the suffixes using, for eachsuffix Si, the position in the h-order of Si as its primary key, and the positionof Si + h in the same order as its secondary key, yields the 2h-order.

  17. #17
    Programmer
    Join Date
    May 2008
    Location
    PL
    Posts
    309
    Thanks
    68
    Thanked 173 Times in 64 Posts
    I've found two interesting papers about GPU Suffix Array Construction:
    http://web.iiit.ac.in/~abhishek_shuk...ber%202009.pdf
    http://www.nvidia.com/content/GTC/po...Prefix-Sum.pdf

  18. #18
    Member
    Join Date
    Oct 2011
    Location
    Porto Alegre, Brazil
    Posts
    16
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Anyone got any progress?

  19. #19
    Member
    Join Date
    Feb 2010
    Location
    Nordic
    Posts
    200
    Thanks
    41
    Thanked 36 Times in 12 Posts
    Can anyone imagine how you might parallelise or multi-pass Levenshtein Distance and extract optimal path to make it computable for large files in reasonable time using reasonable resources?

  20. #20
    Member
    Join Date
    Oct 2011
    Location
    Porto Alegre, Brazil
    Posts
    16
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by willvarfar View Post
    Can anyone imagine how you might parallelise or multi-pass Levenshtein Distance and extract optimal path to make it computable for large files in reasonable time using reasonable resources?
    http://mohamedfahmed.wordpress.com/2...ing-on-gpgpus/

  21. #21
    Member
    Join Date
    Feb 2010
    Location
    Nordic
    Posts
    200
    Thanks
    41
    Thanked 36 Times in 12 Posts
    Gorgeous!

    I had fuzzy ideas along those lines but was rather suspicious my intuition could in any way be correct.

    So how does levenshtein path help compressing deltas? Anyone care to hazard how it might help compression compared to, say, xdelta or bsdiff?

  22. #22
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    4,139
    Thanks
    321
    Thanked 1,402 Times in 804 Posts
    > how it might help compression compared to, say, xdelta or bsdiff?

    Likely won't be of any help.

    The only type of data where replace/delete/insert operations on bytes are relevant
    is revisions of wiki articles or some such.
    And even there its far from perfect, because utf8 diffing is not the same as byte diffing,
    and typical editing includes not only fixing of mistypes, but also moving words around,
    changing suffices etc, thus a linguistic model is necessary to build really efficient diffs.

    And for binaries its much more relevant to parse the format structure - if some field
    contains the size of some structure at other place, there's no sense in diffing these sizes.

    Also the levenshtein distance by definition is not the best for compression, because
    compression requires an entropy-based metric instead of simple counts.

  23. #23
    Member Raymond_NGhM's Avatar
    Join Date
    Oct 2008
    Location
    UK
    Posts
    51
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Alenka is a modern analytical database engine written to take advantage of vector based processing
    and high bandwidth of modern GPUs.

    Features include:

    -Vector-based processing
    CUDA programming model allows a single operation to be applied to an entire set of data at once.

    -Self optimizing compression
    Ultra fast compression and decompression performed directly inside GPU

    -Column-based storage
    Minimize disk I/O by only accessing the relevant data

    -Fast database loads
    Data load times measured in minutes, not in hours.

    -Open source and free

    Some benchmarks:
    Alenka : Pentium E5200 (2 cores), 4 GB of RAM, 1x2TB hard drive , NVidia GTX 260
    Current Top #10 TPC-H 300GB non-clustered performance result : MS SQL Server 2005 : Hitachi BladeSymphony (8 CPU/8 Cores), 128 GB of RAM, 290x36GB 15K rpm drives
    Current Top #7 TPC-H 300GB non-clustered performance result : MS SQL Server 2005 : HP ProLiant DL585 G2 (4 CPU/8 Cores), 128 GB of RAM, 200x36GB 15K rpm drives

    Time of Q1 of TPC-H Scale Factor 300 :

    Alenka : 178s
    Top#10: 485s
    Top#7 : 309s

    http://sourceforge.net/projects/alenka/files/

Similar Threads

  1. CUDA Technology of nVIDIA for Compression?
    By Stephan Busch in forum Data Compression
    Replies: 13
    Last Post: 17th September 2008, 22:44
  2. Cuda massive Multithreating
    By thometal in forum Forum Archive
    Replies: 2
    Last Post: 18th February 2007, 23:49

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •