Page 3 of 3 FirstFirst 123
Results 61 to 76 of 76

Thread: CUDA anyone?

  1. #61
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,532
    Thanks
    755
    Thanked 674 Times in 365 Posts
    segmentation is implemented not by me, but by Gribok in bsc. he said that it's just Shkarin's seg_file

  2. #62
    Member
    Join Date
    Feb 2010
    Location
    Nordic
    Posts
    200
    Thanks
    41
    Thanked 36 Times in 12 Posts
    http://code.google.com/p/back40compu...i/RadixSorting is claiming the fastest sort ever, on GPUs

  3. #63
    Member
    Join Date
    May 2008
    Location
    Germany
    Posts
    410
    Thanks
    37
    Thanked 60 Times in 37 Posts
    @Piotr_Tarsa:

    " - STx family is well suited for GPU's as the commonly uses sorting algorithms on GPU's are radix sorts, so I'll probably start with such thing"

    "In the moment I think about segmentation for bzip2; I don't know if Bulat implemented it already
    (I know that most of time it's better to just use biggest block but sometimes it isn't the case)"

    any success with compression on GPU ?

    may be the new stable version 1.0.215 of "High performance GPU radix sorting in CUDA" from above is interesting for you?
    ( download: http://code.google.com/p/back40computing/downloads/list )

    *** ***
    This project implements a very fast, efficient radix sorting method for CUDA-capable devices.
    For sorting large sequences of fixed-length keys (and values),
    we believe our sorting primitive to be the fastest available for any fully-programmable microarchitecture:
    our stock NVIDIA GTX480 sorting results exceed the 1G keys/sec average sorting rate (i.e., one billion 32-bit keys sorted per second).

    In addition, one of our design goals for this project is flexibility.
    We've designed our implementation to adapt itself and perform well on all generations and configurations of programmable NVIDIA GPUs,
    and for a wide variety of input types.

    Features:

    * The current implementation can sort:
    o Keys of any C/C++ built-in, numeric type (e.g., signed char, float, unsigned long long, etc.)
    o Any structured payload type (within reason).

    * Early-exit digit-passes. When the sorting operation detects that all keys have the same digit at the same digit-place,
    the pass for that digit-place is short-circuited, reducing the cost of that pass by 80%.
    This makes our implementation suitable for even low-degree binning problems where sorting would normally be overkill.
    (We note that the random keys used to produce our performance results do not trigger this feature.)

    Radix Sorting

    - works by iterating over the d-bit digit-places of the keys from least-significant to most-significant.
    For each digit-place, the method performs a stable distribution sort of the keys based upon their digit at that digit-place.
    Given an n-element sequence of k-bit keys and a radix r = 2d,
    a radix sort of these keys will require k/d iterations of a distribution sort over all n keys.

    The distribution sort (a.k.a. counting sort) is the fundamental component of the radix sorting method. In a data-parallel, shared-memory model of computation, each logical processor gathers its key, decodes the specific digit at the given digit-place, and then must cooperate with other processors to determine where the key should be relocated. The relocation offset will be the key's global rank, i.e., the number of keys with lower digits at that digit place plus the number of keys having the same digit, yet occurring earlier in the input sequence.

    We implement these distribution sorting passes using a very efficient implementation of a generalized parallel prefix scan. Our generalized scan is designed to operate over multiple, concurrent scan problems. For example, with d = 4 bits (r = 16 digits), our multi-scan does sixteen scan operations: one for each digit. For each of the scan operations (e.g., the 0s scan, the 1s scan, the 2s scan, etc.), the input for each key is a 1 if the key's digit place contains that operation's digit, 0 otherwise. When the mulit-scan is done, the logical processor for each key can look up the scan result from the appropriate scan operation to determine where its key should be placed.
    *** ***
    Last edited by joerg; 30th August 2010 at 12:41. Reason: subscription of radix-sort-on CUDA

  4. #64
    Programmer
    Join Date
    May 2008
    Location
    PL
    Posts
    309
    Thanks
    68
    Thanked 173 Times in 64 Posts
    I've extracted modeling+entropy coding stages from several BWT-based compressors and converted them to CUDA. Currently each compressor divides an input data into N independent blocks and compresses them using N GPU processors.

    The results are performed on Athlon II X4 630 2.8 GHz + GeForce GTX 460 (336 CUDA cores) with beginning 10.485.768 bytes from ENWIK8. Sorting is done using CPU (divsufsort) and it can be easily replaced with GPU sorting. CPU (1 thread) results are measured using almost the same source code as for GPU.

    grzip (39.552 bytes per thread memory)
    Sorting CPU = 1950 ms (5377 KB/s)
    Encoding CPU 1 thread = 593 ms (17682 KB/s), 10485768->2551762 bytes
    Encoding CUDA 7 threads = 6895 ms (1520 KB/s), 10485768->2553538 bytes
    Encoding CUDA 21 threads = 2808 ms (3734 KB/s), 10485768->2556614 bytes
    Encoding CUDA 42 threads = 1591 ms (6590 KB/s), 10485768->2560395 bytes
    Encoding CUDA 84 threads = 1107 ms (9472 KB/s), 10485768->2566211 bytes
    Encoding CUDA 168 threads = 1046 ms (10024 KB/s), 10485768->2574894 bytes
    Encoding CUDA 336 threads in 336 blocks = 999 ms (10496 KB/s), 10485768->2587506 bytes
    Encoding CUDA 336 threads in 168 blocks = 952 ms (11014 KB/s), 10485768->2587506 bytes
    Encoding CUDA 336 threads in 84 blocks = 999 ms (10496 KB/s), 10485768->2587506 bytes

    bcm (542.740 bytes per thread memory)
    Sorting CPU = 1966 ms (5333 KB/s)
    Encoding CPU 1 thread = 4383 ms (2392 KB/s), 10485768->2421049 bytes
    Encoding CUDA 336 threads in 168 blocks = 2340 ms (4481 KB/s), 10485768->2478555 bytes

    bsc (3.578.964 bytes per thread memory)
    Sorting CPU = 1872 ms (5601 KB/s)
    Encoding CPU 1 thread = 1029 ms (10190 KB/s), 10485768->2424133 bytes
    Encoding CUDA 224 threads = 2434 ms (4308 KB/s), 10485768->2483240 bytes
    Encoding CUDA 112 threads = 2137 ms (4906 KB/s), 10485768->2462086 bytes
    Encoding CUDA 64 threads = 2075 ms (5053 KB/s), 10485768->2450479 bytes
    Encoding CUDA 32 threads = 3557 ms (2947 KB/s), 10485768->2440428 bytes

    I'm no expert in CUDA optimization, so these results can be improved.
    It seems that main problem with GPU compression is a GPU's global memory latency or too little amount of fast shared memory.
    Last edited by inikep; 21st January 2011 at 22:55.

  5. #65
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,713
    Thanks
    271
    Thanked 1,184 Times in 655 Posts
    That's interesting, but normally isn't it better to run BWT sorting in parallel, instead of encoding?
    And specifically what about the approach which I suggested before - kinda a multithreaded radix sort.
    Basically you make 336 threads which all scan the whole file and extract their own string prefix ranges,
    then sort independently. Although in my case the main idea was to work with a compressed representation
    (to reduce memory usage and make use of GPU's arithmetic capability) -
    see http://encode.su/threads/379-BWT-wit...sed-input-data

  6. #66
    Programmer
    Join Date
    May 2008
    Location
    PL
    Posts
    309
    Thanks
    68
    Thanked 173 Times in 64 Posts
    Quote Originally Posted by Shelwien View Post
    That's interesting, but normally isn't it better to run BWT sorting in parallel, instead of encoding?
    Of course it's better. Duane Merrill claims (http://code.google.com/p/back40computing) his radix sorting algorithm exceed the 1G keys/sec using GTX 480 so this problem seems to be solved. For me it means that you can speed up any BWT-based compressor by replacing its sorting with GPU sorting so I wanted to see what can be done with BWT modeling+encoding.

  7. #67
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,713
    Thanks
    271
    Thanked 1,184 Times in 655 Posts
    Thanks for the link, but it seems that its not a BWT sort, and BWT potentially adds O(N) at key comparison.
    And as to postcoding, I think some hierarchical model is necessary for that (something like frequency table compression) -
    forcing sequential methods to compress small blocks isn't really a good idea.
    Btw, if we can afford less efficient coding, then it should be easy enough to get 300-400MB/s on cpu with huffman etc.
    And even if we can't, compression part is not really an issue, because all the data are available and we can thread stuff easily.
    (eg. in each thread find the occurrences of specific context and only update its model)
    But for decoding w/o compression loss I still only know a method of entropy coding with 2 threads, which is not enough for GPU
    (well, it can be any number of threads actually, but the model still has to run sequentially)

  8. #68
    Programmer
    Join Date
    May 2008
    Location
    PL
    Posts
    309
    Thanks
    68
    Thanked 173 Times in 64 Posts
    1. True, radix sort is not enough for BWT, but we can use GPU merge sort:
    http://thrust.googlecode.com/svn/tag...rge__sort.html
    2. Dividing data into blocks is not the best idea, but it's simple and works quite well (results above) with blocks bigger than 50-100 kb. The memory usage, however, is N times bigger. I've tried to use one common model for all BWT threads/compressors, but it was not faster than separate models (and gave much worse compression).

  9. #69
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,479
    Thanks
    26
    Thanked 122 Times in 96 Posts
    GPU have SIMD engines. For example Radeon HD 5870 has 20 SIMD engines. Each SIMD engine has several processors, but all of them have the same instruction pointer. This means that both branches (in case of branching) are processed by each processor and also with while loops each processor must wait for slowest thread. Using one-pass radix sort you can do only at most STx transform (where x is a small constant).


    QSufSort does full BWT in O(log n) ordinary fixed-size-integer-key sorts.
    Quote from http://www.larsson.dogma.net/ssrev-tr.pdf :
    Observation 1 (Manber and Myers). Sorting the suffixes using, for eachsuffix Si, the position in the h-order of Si as its primary key, and the positionof Si + h in the same order as its secondary key, yields the 2h-order.

  10. #70
    Programmer
    Join Date
    May 2008
    Location
    PL
    Posts
    309
    Thanks
    68
    Thanked 173 Times in 64 Posts
    I've found two interesting papers about GPU Suffix Array Construction:
    http://web.iiit.ac.in/~abhishek_shuk...ber%202009.pdf
    http://www.nvidia.com/content/GTC/po...Prefix-Sum.pdf

  11. #71
    Member
    Join Date
    Oct 2011
    Location
    Porto Alegre, Brazil
    Posts
    16
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Anyone got any progress?

  12. #72
    Member
    Join Date
    Feb 2010
    Location
    Nordic
    Posts
    200
    Thanks
    41
    Thanked 36 Times in 12 Posts
    Can anyone imagine how you might parallelise or multi-pass Levenshtein Distance and extract optimal path to make it computable for large files in reasonable time using reasonable resources?

  13. #73
    Member
    Join Date
    Oct 2011
    Location
    Porto Alegre, Brazil
    Posts
    16
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by willvarfar View Post
    Can anyone imagine how you might parallelise or multi-pass Levenshtein Distance and extract optimal path to make it computable for large files in reasonable time using reasonable resources?
    http://mohamedfahmed.wordpress.com/2...ing-on-gpgpus/

  14. #74
    Member
    Join Date
    Feb 2010
    Location
    Nordic
    Posts
    200
    Thanks
    41
    Thanked 36 Times in 12 Posts
    Gorgeous!

    I had fuzzy ideas along those lines but was rather suspicious my intuition could in any way be correct.

    So how does levenshtein path help compressing deltas? Anyone care to hazard how it might help compression compared to, say, xdelta or bsdiff?

  15. #75
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,713
    Thanks
    271
    Thanked 1,184 Times in 655 Posts
    > how it might help compression compared to, say, xdelta or bsdiff?

    Likely won't be of any help.

    The only type of data where replace/delete/insert operations on bytes are relevant
    is revisions of wiki articles or some such.
    And even there its far from perfect, because utf8 diffing is not the same as byte diffing,
    and typical editing includes not only fixing of mistypes, but also moving words around,
    changing suffices etc, thus a linguistic model is necessary to build really efficient diffs.

    And for binaries its much more relevant to parse the format structure - if some field
    contains the size of some structure at other place, there's no sense in diffing these sizes.

    Also the levenshtein distance by definition is not the best for compression, because
    compression requires an entropy-based metric instead of simple counts.

  16. #76
    Member Raymond_NGhM's Avatar
    Join Date
    Oct 2008
    Location
    UK
    Posts
    51
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Alenka is a modern analytical database engine written to take advantage of vector based processing
    and high bandwidth of modern GPUs.

    Features include:

    -Vector-based processing
    CUDA programming model allows a single operation to be applied to an entire set of data at once.

    -Self optimizing compression
    Ultra fast compression and decompression performed directly inside GPU

    -Column-based storage
    Minimize disk I/O by only accessing the relevant data

    -Fast database loads
    Data load times measured in minutes, not in hours.

    -Open source and free

    Some benchmarks:
    Alenka : Pentium E5200 (2 cores), 4 GB of RAM, 1x2TB hard drive , NVidia GTX 260
    Current Top #10 TPC-H 300GB non-clustered performance result : MS SQL Server 2005 : Hitachi BladeSymphony (8 CPU/8 Cores), 128 GB of RAM, 290x36GB 15K rpm drives
    Current Top #7 TPC-H 300GB non-clustered performance result : MS SQL Server 2005 : HP ProLiant DL585 G2 (4 CPU/8 Cores), 128 GB of RAM, 200x36GB 15K rpm drives

    Time of Q1 of TPC-H Scale Factor 300 :

    Alenka : 178s
    Top#10: 485s
    Top#7 : 309s

    http://sourceforge.net/projects/alenka/files/

Page 3 of 3 FirstFirst 123

Similar Threads

  1. CUDA Technology of nVIDIA for Compression?
    By Stephan Busch in forum Data Compression
    Replies: 13
    Last Post: 17th September 2008, 22:44
  2. Cuda massive Multithreating
    By thometal in forum Forum Archive
    Replies: 2
    Last Post: 18th February 2007, 23:49

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •