Page 1 of 6 123 ... LastLast
Results 1 to 30 of 153

Thread: CUDA/GPU-based BWT/ST4 sorting

  1. #1
    Programmer
    Join Date
    May 2008
    Location
    PL
    Posts
    309
    Thanks
    68
    Thanked 173 Times in 64 Posts

    CUDA/GPU-based BWT/ST4 sorting

    I've compared CPU- and GPU-based BWT and ST4 sorting algorithms. The results are performed on Athlon II X4 630 2.8 GHz + GeForce GTX 460 (336 CUDA cores) with beginning 10.485.768 bytes from ENWIK8.

    1 CPU:
    BWT divsufsort = 1669 ms (6282 KB/s)
    BWT std::stable_sort = 7316 ms (1433 KB/s) - uses n*log(n) sorting
    ST4 std::stable_sort = 4836 ms (2168 KB/s) - uses n*log(n) sorting
    ST4 BSC = 219 ms (47880 KB/s)
    ST4 GRZip = 188 ms (55775 KB/s)

    GPU:
    BWT thrust::stable_sort = 4960 ms (2114 KB/s) - uses n*log(n) sorting
    ST4 thrust::stable_sort = 2278 ms (4603 KB/s) - uses n*log(n) sorting
    ST4 thrust::stable_merge_sort_by_key = 343 ms (30570 KB/s) - uses n*log(n) sorting
    ST4 thrust::stable_radix_sort_by_key = 94 ms (111550 KB/s)
    ST8 thrust::stable_radix_sort_by_key = 171 ms (61320 KB/s)

    ST4 sorting is 2 times faster on GPU (111550 KB/s) than CPU (55775 KB/s). GPU sorting time includes host<=>device memory copy times, which are bigger than actual sorting for thrust::stable_radix_sort_by_key.

    So far BWT sorting is slower on GPU, because I didn't found an implementation of Suffix Array Construction Algorithm for GPU. There is one algorithm (http://web.iiit.ac.in/~abhishek_shuk...ber%202009.pdf) and I've asked the authors about the implementation. So far no response...
    Last edited by inikep; 25th January 2011 at 17:39.

  2. #2
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,611
    Thanks
    30
    Thanked 65 Times in 47 Posts
    Could you pair sorting with coding, so you count data movement once (and do it one way in a smaller form)?

  3. #3
    Programmer
    Join Date
    May 2008
    Location
    PL
    Posts
    309
    Thanks
    68
    Thanked 173 Times in 64 Posts
    Quote Originally Posted by m^2 View Post
    Could you pair sorting with coding, so you count data movement once (and do it one way in a smaller form)?
    It's possible if the modeling+encoding stage will be done on GPU. So far, however, CPU seems to be faster than GPU on this stage:

    grzip (39.552 bytes per thread memory)
    Encoding CPU 1 thread = 593 ms (17682 KB/s), 10485768->2551762 bytes
    Encoding CUDA 7 threads = 6895 ms (1520 KB/s), 10485768->2553538 bytes
    Encoding CUDA 21 threads = 2808 ms (3734 KB/s), 10485768->2556614 bytes
    Encoding CUDA 42 threads = 1591 ms (6590 KB/s), 10485768->2560395 bytes
    Encoding CUDA 84 threads = 1107 ms (9472 KB/s), 10485768->2566211 bytes
    Encoding CUDA 168 threads = 1046 ms (10024 KB/s), 10485768->2574894 bytes
    Encoding CUDA 336 threads in 336 blocks = 999 ms (10496 KB/s), 10485768->2587506 bytes
    Encoding CUDA 336 threads in 168 blocks = 952 ms (11014 KB/s), 10485768->2587506 bytes
    Encoding CUDA 336 threads in 84 blocks = 999 ms (10496 KB/s), 10485768->2587506 bytes

    bcm (542.740 bytes per thread memory)
    Encoding CPU 1 thread = 4383 ms (2392 KB/s), 10485768->2421049 bytes
    Encoding CUDA 336 threads in 168 blocks = 2340 ms (4481 KB/s), 10485768->2478555 bytes

    bsc (3.578.964 bytes per thread memory)
    Encoding CPU 1 thread = 1029 ms (10190 KB/s), 10485768->2424133 bytes
    Encoding CUDA 224 threads = 2434 ms (4308 KB/s), 10485768->2483240 bytes
    Encoding CUDA 112 threads = 2137 ms (4906 KB/s), 10485768->2462086 bytes
    Encoding CUDA 64 threads = 2075 ms (5053 KB/s), 10485768->2450479 bytes
    Encoding CUDA 32 threads = 3557 ms (2947 KB/s), 10485768->2440428 bytes

  4. #4
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,611
    Thanks
    30
    Thanked 65 Times in 47 Posts
    Quote Originally Posted by inikep View Post
    It's possible if the modeling+encoding stage will be done on GPU. So far, however, CPU seems to be faster than GPU on this stage:

    grzip (39.552 bytes per thread memory)
    Encoding CPU 1 thread = 593 ms (17682 KB/s), 10485768->2551762 bytes
    Encoding CUDA 7 threads = 6895 ms (1520 KB/s), 10485768->2553538 bytes
    Encoding CUDA 21 threads = 2808 ms (3734 KB/s), 10485768->2556614 bytes
    Encoding CUDA 42 threads = 1591 ms (6590 KB/s), 10485768->2560395 bytes
    Encoding CUDA 84 threads = 1107 ms (9472 KB/s), 10485768->2566211 bytes
    Encoding CUDA 168 threads = 1046 ms (10024 KB/s), 10485768->2574894 bytes
    Encoding CUDA 336 threads in 336 blocks = 999 ms (10496 KB/s), 10485768->2587506 bytes
    Encoding CUDA 336 threads in 168 blocks = 952 ms (11014 KB/s), 10485768->2587506 bytes
    Encoding CUDA 336 threads in 84 blocks = 999 ms (10496 KB/s), 10485768->2587506 bytes

    bcm (542.740 bytes per thread memory)
    Encoding CPU 1 thread = 4383 ms (2392 KB/s), 10485768->2421049 bytes
    Encoding CUDA 336 threads in 168 blocks = 2340 ms (4481 KB/s), 10485768->2478555 bytes

    bsc (3.578.964 bytes per thread memory)
    Encoding CPU 1 thread = 1029 ms (10190 KB/s), 10485768->2424133 bytes
    Encoding CUDA 224 threads = 2434 ms (4308 KB/s), 10485768->2483240 bytes
    Encoding CUDA 112 threads = 2137 ms (4906 KB/s), 10485768->2462086 bytes
    Encoding CUDA 64 threads = 2075 ms (5053 KB/s), 10485768->2450479 bytes
    Encoding CUDA 32 threads = 3557 ms (2947 KB/s), 10485768->2440428 bytes
    I assumed that the CUDA numbers include moving data to and from the GPU. If yes, you include these times twice. Now I see that modelling+encoding time is dominating the whole process, but it would still be more accurate to have them both timed together.

  5. #5
    Programmer Gribok's Avatar
    Join Date
    Apr 2007
    Location
    USA
    Posts
    159
    Thanks
    0
    Thanked 1 Time in 1 Post
    I am glad to see that someone is pushing GPGPU based compression. I think with existing GPU implementation it is possible to do something ST based.

    Note: You can same some time for transferring data to GPU if you compress it. You can use o0/o1 static Huffman coder. You can even build ST transform based on compressed input.

    Please let me know if you have any questions about grzipii or bsc. Good luck.
    Enjoy coding, enjoy life!

  6. #6
    Programmer
    Join Date
    May 2008
    Location
    PL
    Posts
    309
    Thanks
    68
    Thanked 173 Times in 64 Posts
    Quote Originally Posted by m^2 View Post
    I assumed that the CUDA numbers include moving data to and from the GPU. If yes, you include these times twice.
    Data transfer times are included, however, it's about 50 ms and it doesn't change anything in the modeling+encoding stage.

    Quote Originally Posted by Gribok View Post
    Please let me know if you have any questions about grzipii or bsc. Good luck.
    Thanks. Libbsc is exceptionally well written, thanks for this amazing piece of software.

  7. #7
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,475
    Thanks
    26
    Thanked 121 Times in 95 Posts
    Quote Originally Posted by inikep View Post
    There is one algorithm (http://web.iiit.ac.in/~abhishek_shuk...ber%202009.pdf) and I've asked the authors about the implementation. So far no response...
    I can't see any complete BWT algorithm in this paper. I can't figure out how Figure 1b is related to BWT.

    Did they compared BWT on GPU and CPU using their algorithm on both options? That's pointless. They should've compared their BWT on GPU to divsufsort on all cores of CPU.


    Quote Originally Posted by inikep View Post
    Data transfer times are included, however, it's about 50 ms
    So actually ST8 using Thrust is 3 times slower than ST4 (about 44ms vs about 121ms)?

  8. #8
    Programmer
    Join Date
    May 2008
    Location
    PL
    Posts
    309
    Thanks
    68
    Thanked 173 Times in 64 Posts
    Quote Originally Posted by Piotr Tarsa View Post
    I can't see any complete BWT algorithm in this paper.
    Yes, it's not complete. You can also look at:
    http://web.iiit.ac.in/~abhishek_shuk...nAlgorithm.pdf
    http://web.iiit.ac.in/~abhishek_shukla/suffix

    Quote Originally Posted by Piotr Tarsa View Post
    So actually ST8 using Thrust is 3 times slower than ST4 (about 44ms vs about 121ms)?
    The results of ST4/ST8 sorting without data transfer times:
    ST4 thrust::stable_merge_sort_by_key time = 270 ms (38836 KB/s)
    ST4 thrust::stable_radix_sort_by_key time = 33 ms (317750 KB/s)
    ST8 thrust::stable_radix_sort_by_key time = 84 ms (124830 KB/s)

    BTW. thrust::stable_radix_sort_by_key is based on http://code.google.com/p/back40compu...i/RadixSorting
    Last edited by inikep; 26th January 2011 at 17:57.

  9. #9
    Programmer Gribok's Avatar
    Join Date
    Apr 2007
    Location
    USA
    Posts
    159
    Thanks
    0
    Thanked 1 Time in 1 Post
    Quote Originally Posted by inikep View Post
    This algorithm is a same as divsufsort, the only difference is buckets were renamed from A, B, B* to A, B, C, CA, ... etc. This algorithm is not suitable fro GPGPU, because final steps is sequential. This steps can't even be paralleled for CPU.
    Enjoy coding, enjoy life!

  10. #10
    Programmer
    Join Date
    May 2008
    Location
    PL
    Posts
    309
    Thanks
    68
    Thanked 173 Times in 64 Posts
    bwtari (http://www.compressconsult.com/st/bwtari06.zip) requires only 2212 bytes of memory to compress BWT/STx sorted data.
    Therefore I've managed to convert bwtari to CUDA to use fast GPU shared memory:

    Encoding CPU = 670 ms (15650 KB/s), 10485768->2571263 bytes
    Encoding CUDA = 468 ms (22405 KB/s), 10485768->2582143 bytes


    I've also created GPU-based ST8 compressor (CUDA ST8 sorting + CUDA bwtari):

    ST8 sorting GPU = 111.09 GPU ms (94387 KB/s)
    Encoding GPU = 400.08 GPU ms (26209 KB/s)
    ST8+encoding GPU = 609 ms (17217 KB/s), 10485760->2606484 bytes


    For comparison here are single-threaded results of BSC 2.4.5 (ST4+QLFC and ST5+QLFC)
    >bsc e ENWIK ENWIK.bsc -m1fptT
    ENWIK compressed 10485760 into 2653050 in 0.796 seconds.
    >bsc e ENWIK ENWIK.bsc -m2fptT
    ENWIK compressed 10485760 into 2533023 in 1.076 seconds.
    Last edited by inikep; 29th January 2011 at 21:08.

  11. #11
    Member
    Join Date
    May 2008
    Location
    Germany
    Posts
    410
    Thanks
    37
    Thanked 60 Times in 37 Posts

    ST5 - algorithm ?

    @inikep: your results sounds very interesting for me

    congratulation

    i am reading you have done a ST4 and a ST8 - variant

    i think, it would be wonderful to have

    a variant of your GPU-implementation with ST5 too

    because in my testcases the ST5 seems the most powerful algorithm

    would this be possible?

    best regards

  12. #12
    Programmer
    Join Date
    May 2008
    Location
    PL
    Posts
    309
    Thanks
    68
    Thanked 173 Times in 64 Posts
    @joerg: Usually ST8 should be better in compression ratio than ST5

    ST5 sorting GPU = 106.63 GPU ms (98339 KB/s)
    Encoding GPU = 404.18 GPU ms (25944 KB/s)
    ST5+encoding GPU = 608 ms (17246 KB/s), 10485760->2687342 bytes

  13. #13
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,984
    Thanks
    377
    Thanked 352 Times in 140 Posts

    Talking

    Can you upload an executable of ST8? I'm really curious to test it.

  14. #14
    Programmer
    Join Date
    May 2008
    Location
    PL
    Posts
    309
    Thanks
    68
    Thanked 173 Times in 64 Posts
    Quote Originally Posted by encode View Post
    Can you upload an executable of ST8? I'm really curious to test it.
    Attached
    Attached Files Attached Files

  15. #15
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,984
    Thanks
    377
    Thanked 352 Times in 140 Posts
    Oops...
    CUDA_ST8 0.1 (c) Dell Inc. Written by P.Skibinski
    w:/Ocarina/cuda/cuda_ST8/cuda_ST8.cu(37) : cudaSafeCall() Runtime API error : CU
    DA driver version is insufficient for CUDA runtime version.
    Looks like I have no CUDA (I use ATI RADEON). Can you please add a CPU version to the archive as well?

  16. #16
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,475
    Thanks
    26
    Thanked 121 Times in 95 Posts
    So you're left with OpenCL If you know Java, then AMD Aparapi can convert the Java code into OpenCL.

  17. #17
    Programmer
    Join Date
    May 2008
    Location
    PL
    Posts
    309
    Thanks
    68
    Thanked 173 Times in 64 Posts
    Quote Originally Posted by encode View Post
    Can you please add a CPU version to the archive as well?
    Sorry, but I don't have a CPU version of ST8. It can be done easily with std::stable_sort, but it will be slow.
    Last edited by inikep; 31st January 2011 at 01:08.

  18. #18
    Member
    Join Date
    May 2008
    Location
    Germany
    Posts
    410
    Thanks
    37
    Thanked 60 Times in 37 Posts
    @inikep: congratulation again to first known (for me) CUDA-implementation of ST4/ST8

    but, sadly i cant test it:

    ------
    CUDA_ST8 requires:

    - GPU with Compute Capability 2.0 (Fermi) or higher
    - 2*N of CPU memory and 12*N GPU memory, where N is an input data size
    ------

    NVIDIA-Website:
    ------
    Compute capability 1.0

    Chip: G80
    Cards: GeForce 8800GTX/Ultra/GTS, Tesla C/D/S870, FX4/5600, 360M
    ------
    Compute capability 1.1

    Chip: G86, G84, G98, G96, G96b, G94, G94b, G92, G92b
    Cards: GeForce 8400GS/GT, 8600GT/GTS, 8800GT, 9600GT/GSO, 9800GT/GTX/GX2, GTS 250, GT 120/30, FX 4/570, 3/580, 17/18/3700, 4700x2, 1xxM, 32/370M, 3/5/770M, 16/17/27/28/36/37/3800M, NVS420/50
    ------
    Compute capability 1.2

    Chip: GT218, GT216, GT215
    Cards: GeForce 210, GT 220/40, FX380 LP, 1800M, 370/380M, NVS 2/3100M
    ------
    Compute capability 1.3

    Chip: GT200, GT200b
    Cards: GTX 260/75/80/85, 295, Tesla C/M1060, S1070, CX, FX 3/4/5800
    ------
    Compute capability 2.0 "Fermi"

    Chip: GF100, GF110
    Cards: GTX 465, 470/80, Tesla C2050/70, S/M2050/70, Quadro 600,4/5/6000, Plex7000, GTX570, GTX580
    ------
    Compute capability 2.1

    Chip: GF108, GF106, GF104
    Cards: GT 420/30/40, GTS 450, GTX 460, 500M
    ------


    ask 1:

    Can you tell, which card you are using?

    ask 2:

    If i want to compress a 700-Mbyte-file,
    how many memory i must have on the grafic-board?

    ask 3:

    Are there any possibility to have a program-version for the older CUDA-"Compute capability 1.1" ?


    best regards

    Joerg
    Last edited by joerg; 31st January 2011 at 15:27. Reason: spelling

  19. #19
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,611
    Thanks
    30
    Thanked 65 Times in 47 Posts
    Quote Originally Posted by joerg View Post
    Are there any possibility to to have a program-version
    for the older CUDA-"Compute capability 1.1" ?
    Or - better - OpenCL?

  20. #20
    Member
    Join Date
    May 2008
    Location
    England
    Posts
    325
    Thanks
    18
    Thanked 6 Times in 5 Posts
    Yes OpenCL so us ATI users can use em ;p

  21. #21
    Programmer
    Join Date
    May 2008
    Location
    PL
    Posts
    309
    Thanks
    68
    Thanked 173 Times in 64 Posts
    Quote Originally Posted by joerg View Post
    Can you tell, which card you are using?
    The results are performed on Athlon II X4 630 2.8 GHz + GeForce GTX 460 (336 CUDA cores) with beginning 10.485.768 bytes from ENWIK8.

    Quote Originally Posted by joerg View Post
    If i want to compress a 700-Mbyte-file,
    how many memory i must have on the grafic-board?
    CUDA_ST8 uses 12*N GPU memory, where N is an input data size. I don't know how much uses SRTS Radix Sorting (http://code.google.com/p/back40compu...i/RadixSorting), but I've managed to compress files up to 35 MB using 1 GB of GeForce GTX 460. Therefore, it's possible to compress 700 MB in 20 blocks (each of 35 MB).


    Quote Originally Posted by joerg View Post
    Are there any possibility to have a program-version for the older CUDA-"Compute capability 1.1" ?
    Compute capability lower than 2.0 (Fermi) means no L1 cache and CUDA_ST8 will be very slow.

  22. #22
    Programmer
    Join Date
    May 2008
    Location
    PL
    Posts
    309
    Thanks
    68
    Thanked 173 Times in 64 Posts
    Quote Originally Posted by Intrinsic View Post
    Yes OpenCL so us ATI users can use em ;p
    Sorry, I wanted to buy Radeon by myself, but I decided to buy GeForce to have CUDA. It's not so easy to convert CUDA_ST8 to OpenCL, especially as SRTS Radix Sorting has only a CUDA version.

  23. #23
    Programmer osmanturan's Avatar
    Join Date
    May 2008
    Location
    Mersin, Turkiye
    Posts
    651
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by inikep View Post
    Compute capability lower than 2.0 (Fermi) means no L1 cache and CUDA_ST8 will be very slow.
    Well, on my laptop which has a GeForce GT 330M (CUDA compute capability 1.2) with lastest driver, it seems there is something wrong. Here is the output:
    Code:
    C:\Users\Osman\Desktop\CUDA_ST8>cuda_ST8.exe readme.txt output.tmp
    CUDA_ST8 0.1  (c) Dell Inc.  Written by P.Skibinski
    Device 0: "GeForce GT 330M" with Compute 1.2 capability (48 CUDA cores)
    input_size=447 ST8_threads=42*64=2688 compress_threads=48
    +Initialize_Kernel 3.01 GPU ms (148 KB/s)
    +Pre_Sort_Kernel   0.14 GPU ms (3210 KB/s)
    +Radix_Sort_Kernel 1.04 GPU ms (428 KB/s)
    +Post_Sort_Kernel  0.12 GPU ms (3588 KB/s)
    +Deinitialize_Kern 0.07 GPU ms (6824 KB/s)
    Total_Kernel      4.39 GPU ms (102 KB/s)
    Compress_Kernel   0.16 GPU ms (2804 KB/s)
    ST8+compress with CUDA = 105 ms (4 KB/s), 447->0 bytes
    done...
    Edit: Considering statistics about tasks, GPU and CPU works in asyncrounous fashion. In OpenGL, there is an extension to exactly know a task has finished or not (Nvidia specific fence extension). But, I don't know any kind of thing in CUDA. Am I missing something?
    Last edited by osmanturan; 1st February 2011 at 03:58. Reason: addition...
    BIT Archiver homepage: www.osmanturan.com

  24. #24
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,611
    Thanks
    30
    Thanked 65 Times in 47 Posts
    Quote Originally Posted by inikep View Post
    Sorry, I wanted to buy Radeon by myself, but I decided to buy GeForce to have CUDA. It's not so easy to convert CUDA_ST8 to OpenCL, especially as SRTS Radix Sorting has only a CUDA version.
    I can't stop wondering, why are so many people into CUDA?
    I also own a CUDA capable card, but I don't intend to use this feature...because what's the point of writing programs that work on 20% of computers? OK, in scientific computing you sometimes code for a particular computer, but otherwise I don't see the point.

  25. #25
    Member
    Join Date
    Feb 2010
    Location
    Nordic
    Posts
    200
    Thanks
    41
    Thanked 36 Times in 12 Posts
    Not all code is written with utility in mind. Really, from a user perspective compression is a solved problem and deflate is that solution. The vast majority of compression since has run only on a tiny tiny fraction of 20% of the worlds computers ever. Code - especially compression algorithms - are written because of the curiosity of the programmer. That's enough of a reason. Whilst CUDA or any other GPGPU/CPU language might in itself wane, the general march of GPGPUs will only ever increase beyond that 20%, so understanding how to best utilise these kind of platforms is a meaningful endeavour.

  26. #26
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,511
    Thanks
    746
    Thanked 668 Times in 361 Posts
    if i will use it in freearc, default mode will compress using ST8 if GPU is present, and ST5 otherwise, resulting in faster and better compresion on GPU-occcupied boxes

  27. #27
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,423
    Thanks
    223
    Thanked 1,052 Times in 565 Posts
    @Bulat: actually its not so simple probably, because ST5 and ST8 would have different inverse transforms (slower for ST

  28. #28
    Programmer
    Join Date
    May 2008
    Location
    PL
    Posts
    309
    Thanks
    68
    Thanked 173 Times in 64 Posts
    Quote Originally Posted by osmanturan View Post
    Well, on my laptop which has a GeForce GT 330M (CUDA compute capability 1.2) with lastest driver, it seems there is something wrong.
    According to readme.txt (CUDA_ST8.rar) CUDA_ST8.exe requires GPU with CUDA Compute Capability 2.0 (Fermi) or higher.
    Compute capability lower than 2.0 (Fermi) means no L1 cache and CUDA_ST8 will be very slow. I've checked it on my laptop.

  29. #29
    Programmer
    Join Date
    May 2008
    Location
    PL
    Posts
    309
    Thanks
    68
    Thanked 173 Times in 64 Posts
    Quote Originally Posted by m^2 View Post
    I can't stop wondering, why are so many people into CUDA?
    CUDA_ST8 is only a proof of concept compressor. It even has no decompressor (I didn't write reverse ST8 transform and I'm afraid that CPU version would be faster). Why I chosen CUDA?
    1. CUDA is faster than OpenCL.
    2. It's easier to write for CUDA than for OpenCL.
    3. I didn't found OpenCL Radix Sort for (keys, values).

    I agree that a GPU compressor should work for both NVIDIA and AMD/ATI, but it can use a separate code for CUDA and AMD Stream/APP instead of OpenCL.
    Last edited by inikep; 1st February 2011 at 14:13.

  30. #30
    Member
    Join Date
    May 2008
    Location
    Germany
    Posts
    410
    Thanks
    37
    Thanked 60 Times in 37 Posts
    @inikep: thank you very much for answering my questions

    you wrote: "I didn't found OpenCL Radix Sort .."

    i found this link, but i dont know it is useable for you:

    http://developer.download.nvidia.com...lRadixSort.zip
    http://developer.download.nvidia.com...lRadixSort.zip
    http://developer.download.nvidia.com...dixSort.tar.gz

    about CUDA-Version:

    ask 1:

    you wrote - you are using a GeForce GTX 460

    What about the new GeForce GTX 560 Ti with "GF114" and 1024 MB RAM,
    do you expect this card will be faster with your compression program
    then the GeForce GTX 460 with "GF104" and 1024 MB RAM ?

    ask 2:

    there are "GeForce GTX 560 Ti" - Cards with 2048 Mbyte RAM on the market too

    Do you expect a card with 2048 MB RAM will have faster/better results
    with your compression program ?

    ask 3:

    here is the very powerful ST5-implementation for CPU within bsc
    http://encode.su/threads/586-bsc-new...ing-compressor

    if we would have a ST5-implementation in CUDA too,
    then everybody can easy compare the results ...

    do you see any chance for this ?
    best regards
    Joerg

Page 1 of 6 123 ... LastLast

Similar Threads

  1. BCM v0.09 - The ultimate BWT-based file compressor!
    By encode in forum Data Compression
    Replies: 22
    Last Post: 6th March 2016, 10:26
  2. GPU compression again
    By Shelwien in forum Data Compression
    Replies: 13
    Last Post: 13th January 2013, 21:09
  3. BCM v0.08 - The ultimate BWT-based file compressor!
    By encode in forum Data Compression
    Replies: 78
    Last Post: 12th August 2009, 11:14
  4. BCM v0.01 - New BWT+CM-based compressor
    By encode in forum Data Compression
    Replies: 81
    Last Post: 9th February 2009, 16:47
  5. DARK - a new BWT-based command-line archiver
    By encode in forum Forum Archive
    Replies: 138
    Last Post: 23rd September 2006, 22:42

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •