Page 3 of 6 FirstFirst 12345 ... LastLast
Results 61 to 90 of 153

Thread: CUDA/GPU-based BWT/ST4 sorting

  1. #61
    Member
    Join Date
    May 2008
    Location
    Germany
    Posts
    412
    Thanks
    38
    Thanked 64 Times in 38 Posts
    @inikep: thanks for the wonderful news
    CUDA_STx_bsc03.rar is downloadable. Is it possible to download your new version ? CUDA_STx_bsc04.rar ?

    i think, if it is file-compatible with the bsc 2.4.5 - implementation
    then we dont have a problem to compress with your CUDA-implementation and decompress with bsc-implementation ..

    sounds wonderful - i would like to test it

    best regards Joerg
    Last edited by joerg; 25th February 2011 at 13:54. Reason: correct

  2. #62
    Programmer
    Join Date
    May 2008
    Location
    PL
    Posts
    309
    Thanks
    68
    Thanked 173 Times in 64 Posts
    Quote Originally Posted by Gribok View Post
    I have idea of fast ST7 sort for you. You can build 8 byte data structure like (7 byte suffix, prevByte) and sort them. I assume that sorting is stable. Now you have 7-order sort transform and you just need to output prevBytes. This is allow you to sorting only values instep of sorting <key, value> pairs. Instead of building data structure you should pack all data to qword and use for comparasion, but ignore one byte. You also don't need original array and you can replace it with PrevBytes. This should be very faster sorting.
    I've used the newest version of SRTS Radix Sorting (v328 from SVN), which allows to ignore some bits from keys and I've implemented your idea:

    CUDA ST5 0.2 = 132 CPU ms (79437 KB/s) 64-bit keys, 32-bit values (pointers)
    CUDA ST5 0.3 = 85 CPU ms (123361 KB/s) 64-bit keys, 8-bit values (chars)
    CUDA ST5 0.4 = 77 CPU ms (136178 KB/s) 64-bit keys only
    BSC 2.4.5 ST5 = 492 CPU ms (21312 KB/s)

  3. #63
    Programmer
    Join Date
    May 2008
    Location
    PL
    Posts
    309
    Thanks
    68
    Thanked 173 Times in 64 Posts
    Quote Originally Posted by joerg View Post
    if it is file-compatible with the bsc 2.4.5 - implementation
    It's not compatible with bsc 2.4.5, only the ST5 part is compatible. It's possible to add CUDA ST5 sorting to bsc and use it when a card with CUDA is available.

  4. #64
    Member
    Join Date
    May 2008
    Location
    Germany
    Posts
    412
    Thanks
    38
    Thanked 64 Times in 38 Posts
    @inikep

    "only the ST5 part is compatible.
    It's possible to add CUDA ST5 sorting to bsc and use it when a card with CUDA is available."

    thank you very much for your explanation - sounds for me very interesting anyway

    Would it be possible to download your new version 0.4 for testing purposes ?

    my test result from last version was in several cases:

    CUDA_STx 0.3 (c) Dell Inc. Written by P.Skibinski
    ...
    - output of BSC_ST4 and CUDA_ST4 is NOT identical
    ...
    - output of BSC_ST5 and CUDA_ST5 is NOT identical
    ...

    best regards

    joerg

  5. #65
    Programmer
    Join Date
    May 2008
    Location
    PL
    Posts
    309
    Thanks
    68
    Thanked 173 Times in 64 Posts
    Attached is CUDA_STx_bsc 0.4 as a library. It has 3 functions that can be used as a replacement of bsc_stX_encode():

    int cuda_st3_encode(unsigned char * T, int n, int block_count, int threads_in_block, bool print_time);
    int cuda_st4_encode(unsigned char * T, int n, int block_count, int threads_in_block, bool print_time);
    int cuda_st5_encode(unsigned char * T, int n, int block_count, int threads_in_block, bool print_time);

    Results on Athlon II X4 630 2.8 GHz + GeForce GTX 460 (336 CUDA cores) with beginning 10.485.768 bytes from ENWIK8:
    Code:
    CUDA_STx_bsc 0.4  (c) Dell Inc.  Written by P.Skibinski
    Kernel Warmup = 118 CPU ms (88862 KB/s)
    BSC ST3 = 201 CPU ms (52167 KB/s)
    CUDA ST3 = 47 CPU ms (223101 KB/s)
    - output of BSC_ST3 and CUDA_ST3 is identical (index=2373116)
    BSC ST4 = 212 CPU ms (49461 KB/s)
    CUDA ST4 = 53 CPU ms (197844 KB/s)
    - output of BSC_ST4 and CUDA_ST4 is identical (index=2373116)
    BSC ST5 = 560 CPU ms (18724 KB/s)
    CUDA ST5 = 73 CPU ms (143640 KB/s)
    - output of BSC_ST5 and CUDA_ST5 is identical (index=2373116)
    Attached Files Attached Files
    Last edited by inikep; 1st March 2011 at 19:42.

  6. #66
    Member
    Join Date
    May 2008
    Location
    Germany
    Posts
    412
    Thanks
    38
    Thanked 64 Times in 38 Posts
    @inikep: thank you very much for the new version

    but, what i am doing wrong ?

    test: compressing the readme.txt from version 0.4
    ---
    CUDA_ST readme.txt readme.st
    CUDA_STx_bsc 0.4 (c) Dell Inc. Written by P.Skibinski
    Kernel Warmup = 100 CPU ms (5 KB/s)
    BSC ST3 = 1 CPU ms (587 KB/s)
    CUDA ST3 = 2 CPU ms (293 KB/s)
    - output of BSC_ST3 and CUDA_ST3 is NOT identical (index=162, bsc_index=160)
    BSC ST4 = 1 CPU ms (587 KB/s)
    CUDA ST4 = 3 CPU ms (195 KB/s)
    - output of BSC_ST4 and CUDA_ST4 is NOT identical (index=162, bsc_index=160)
    BSC ST5 = 9 CPU ms (65 KB/s)
    CUDA ST5 = 4 CPU ms (146 KB/s)
    - output of BSC_ST5 and CUDA_ST5 is NOT identical (index=162, bsc_index=161)
    done...
    ---

    best regards Joerg

    ps: have you seen this news?
    ---
    A release candidate of CUDA Toolkit 4.0 will be available free of charge beginning March 4, 2011, by enrolling in the CUDA Registered Developer Program at: www.nvidia.com/paralleldeveloper.
    ---
    http://www.anandtech.com/show/4198/n...ounces-cuda-40
    ---

  7. #67
    Programmer
    Join Date
    May 2008
    Location
    PL
    Posts
    309
    Thanks
    68
    Thanked 173 Times in 64 Posts
    @joerg: Please try CUDA_STx_bsc04b.rar (I've removed the old version). I know about CUDA Toolkit 4.0, thanx.

  8. #68
    Member
    Join Date
    May 2008
    Location
    Germany
    Posts
    412
    Thanks
    38
    Thanked 64 Times in 38 Posts
    @inikep
    ---
    CUDA_ST readme.txt readme.st
    CUDA_ST_bsc 0.4 (c) Dell Inc. Written by P.Skibinski
    Kernel Warmup = 116 CPU ms (5 KB/s)
    BSC ST3 = 1 CPU ms (587 KB/s)
    CUDA ST3 = 2 CPU ms (293 KB/s)
    - output of BSC_ST3 and CUDA_ST3 is identical (index=160)
    BSC ST4 = 1 CPU ms (587 KB/s)
    CUDA ST4 = 3 CPU ms (195 KB/s)
    - output of BSC_ST4 and CUDA_ST4 is identical (index=160)
    BSC ST5 = 17 CPU ms (34 KB/s)
    CUDA ST5 = 5 CPU ms (117 KB/s)
    - output of BSC_ST5 and CUDA_ST5 is identical (index=161)
    done...
    ---
    with the new version 0.4a the problem seems solved
    ---
    ps: the resulting files from version 0.4a and from 0.4b are identically

    best regards joerg
    Last edited by joerg; 3rd March 2011 at 15:09. Reason: spelling

  9. #69
    Member
    Join Date
    May 2008
    Location
    Germany
    Posts
    412
    Thanks
    38
    Thanked 64 Times in 38 Posts
    @inikep:
    in my test i can do a cuda-sorting for 13.482.000 bytes
    how can i sort a file with 650.000.000 bytes ?
    is that posible with the program ?

    can you help me ?

    best regards joerg

  10. #70
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,505
    Thanks
    26
    Thanked 136 Times in 104 Posts
    joerg:
    You can split it, for example with 7zip GUI.

  11. #71
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    4,023
    Thanks
    415
    Thanked 416 Times in 158 Posts
    c:\Test>cuda_stx -p enwik8
    CUDA_STx 0.2 (c) Dell Inc. Written by P.Skibinski
    Device 0: "GeForce GTX 570" with Compute 2.0 capability (480 CUDA cores)
    BSC ST4 = 1040 CPU ms (96153 KB/s)
    BSC ST5 = 2517 CPU ms (39729 KB/s)
    input_size=1024 STx_threads=42*64=2688 bsize=1
    +Initialize_Kernel 0.62 GPU ms (1655 KB/s)
    +Pre_Sort_Kernel 0.08 GPU ms (13474 KB/s)
    +Radix_Sort_Kernel 0.46 GPU ms (2223 KB/s)
    +Post_Sort_Kernel 0.07 GPU ms (14460 KB/s)
    +Deinitialize_Kern 0.05 GPU ms (18935 KB/s)
    STx sorting total 1.28 GPU ms (800 KB/s)
    Kernel Warmup = 113 CPU ms (884955 KB/s)
    input_size=100000000 STx_threads=42*64=2688 bsize=37203
    +Initialize_Kernel 52.54 GPU ms (1903150 KB/s)
    +Pre_Sort_Kernel 20.43 GPU ms (4895196 KB/s)
    d:/stx/src/cuda_STx.cu(207) : cudaSafeCall() Runtime API error : unknown error.

  12. #72
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    4,023
    Thanks
    415
    Thanked 416 Times in 158 Posts
    Quote Originally Posted by joerg View Post
    PS: searching another Nvidia card ...
    ---
    Gigabyte Nvidia Geforce GFX 560 Ti

    GV-N560OverClock-1GI or GV-N560SuperOverclock-1GI ?

    or better Palit GeForce GTX 560 Ti 2048 MB ?

    or another Geforce GFX 560 Ti ?
    Gainward is the best! And better get Gainward GeForce GTX 570 Golden Sample!

  13. #73
    Programmer
    Join Date
    May 2008
    Location
    PL
    Posts
    309
    Thanks
    68
    Thanked 173 Times in 64 Posts
    @encode: try CUDA_STx_bsc 0.4 with files up to 60 MB (if you have 1280 MB GDDR5)
    -b60 or -b120 or -t128 should give better results on your gfx card
    Last edited by inikep; 15th March 2011 at 14:13.

  14. #74
    Member
    Join Date
    May 2008
    Location
    Germany
    Posts
    412
    Thanks
    38
    Thanked 64 Times in 38 Posts
    @encode: can you please try the corrected version CUDA_STx_bsc04b.rar ?

    sorry - inikep was faster

  15. #75
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    4,023
    Thanks
    415
    Thanked 416 Times in 158 Posts
    c:\Test>cuda_st -t128 -p -b120 enwik8
    CUDA_ST_bsc 0.4 (c) Dell Inc. Written by P.Skibinski
    Kernel Warmup = 102 CPU ms (980392 KB/s)
    BSC ST3 = 1033 CPU ms (96805 KB/s)
    input_size=100000000 STx_threads=120*128=15360 bsize=6511
    +Initialize_Kernel 39.89 GPU ms (2507108 KB/s)
    +Pre_Sort_Kernel 9.43 GPU ms (10604939 KB/s)
    +Radix_Sort_Kernel 82.67 GPU ms (1209558 KB/s)
    +Post_Sort_Kernel 30.42 GPU ms (3287000 KB/s)
    +Deinitialize_Kern 2.12 GPU ms (47100138 KB/s)
    STx sorting total 164.54 GPU ms (607766 KB/s)
    CUDA ST3 = 182 CPU ms (549450 KB/s)
    - output of BSC_ST3 and CUDA_ST3 is identical (index=2248130
    BSC ST4 = 993 CPU ms (100704 KB/s)
    input_size=100000000 STx_threads=120*128=15360 bsize=6511
    +Initialize_Kernel 44.99 GPU ms (2222529 KB/s)
    +Pre_Sort_Kernel 9.17 GPU ms (10909903 KB/s)
    [w:\ocarina\cuda\cuda_stx_bsc\radix_sort_328\radixs ort_api_enactor.cuh, 151] Lsb
    SortEnactor cudaMalloc problem_storage.d_values[1] failed (CUDA error 2: out of
    memory)
    w:/Ocarina/cuda/cuda_STx_bsc/cuda_st.cu(251) : cutilSafeCall() Runtime API error
    : out of memory.
    CUDA ST4 = 71 CPU ms (1408450 KB/s)
    - output of BSC_ST4 and CUDA_ST4 is NOT identical (index=4294967295, bsc_index=2
    248130
    BSC ST5 = 2306 CPU ms (43365 KB/s)
    input_size=100000000 STx_threads=120*128=15360 bsize=6511
    w:/Ocarina/cuda/cuda_STx_bsc/cuda_st.cu(212) : cutilSafeCall() Runtime API error
    : out of memory.
    CUDA ST5 = 1 CPU ms (100000000 KB/s)
    - output of BSC_ST5 and CUDA_ST5 is NOT identical (index=4294967295, bsc_index=2
    248130
    done...

  16. #76
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    4,023
    Thanks
    415
    Thanked 416 Times in 158 Posts
    c:\Test>cuda_st -t128 -p -b120 enwik5
    CUDA_ST_bsc 0.4 (c) Dell Inc. Written by P.Skibinski
    Kernel Warmup = 132 CPU ms (37878 KB/s)
    BSC ST3 = 51 CPU ms (98039 KB/s)
    input_size=5000000 STx_threads=120*128=15360 bsize=326
    +Initialize_Kernel 2.57 GPU ms (1946386 KB/s)
    +Pre_Sort_Kernel 0.47 GPU ms (10661890 KB/s)
    +Radix_Sort_Kernel 5.07 GPU ms (985388 KB/s)
    +Post_Sort_Kernel 1.73 GPU ms (2892073 KB/s)
    +Deinitialize_Kern 0.34 GPU ms (14744739 KB/s)
    STx sorting total 10.18 GPU ms (491162 KB/s)
    CUDA ST3 = 13 CPU ms (384615 KB/s)
    - output of BSC_ST3 and CUDA_ST3 is identical (index=110784
    BSC ST4 = 48 CPU ms (104166 KB/s)
    input_size=5000000 STx_threads=120*128=15360 bsize=326
    +Initialize_Kernel 2.65 GPU ms (1884642 KB/s)
    +Pre_Sort_Kernel 0.49 GPU ms (10147422 KB/s)
    +Radix_Sort_Kernel 7.02 GPU ms (711988 KB/s)
    +Post_Sort_Kernel 1.65 GPU ms (3034806 KB/s)
    +Deinitialize_Kern 0.39 GPU ms (12765523 KB/s)
    STx sorting total 12.21 GPU ms (409581 KB/s)
    CUDA ST4 = 15 CPU ms (333333 KB/s)
    - output of BSC_ST4 and CUDA_ST4 is identical (index=110784
    BSC ST5 = 110 CPU ms (45454 KB/s)
    input_size=5000000 STx_threads=120*128=15360 bsize=326
    +Initialize_Kernel 2.73 GPU ms (1830290 KB/s)
    +Pre_Sort_Kernel 0.75 GPU ms (6695950 KB/s)
    +Radix_Sort_Kernel 11.06 GPU ms (452094 KB/s)
    +Post_Sort_Kernel 2.03 GPU ms (2464589 KB/s)
    +Deinitialize_Kern 0.33 GPU ms (15381966 KB/s)
    STx sorting total 16.89 GPU ms (295999 KB/s)
    CUDA ST5 = 20 CPU ms (250000 KB/s)
    - output of BSC_ST5 and CUDA_ST5 is identical (index=110784
    done...

  17. #77
    Programmer
    Join Date
    May 2008
    Location
    PL
    Posts
    309
    Thanks
    68
    Thanked 173 Times in 64 Posts
    These results are better than expected. GTX 460 is almost two times slower:

    CUDA_ST_bsc 0.4 (c) Dell Inc. Written by P.Skibinski
    Kernel Warmup = 116 CPU ms (43103 KB/s)
    BSC ST3 = 98 CPU ms (51020 KB/s)
    CUDA ST3 = 25 CPU ms (200000 KB/s)
    - output of BSC_ST3 and CUDA_ST3 is identical (index=110784
    BSC ST4 = 104 CPU ms (48076 KB/s)
    CUDA ST4 = 28 CPU ms (178571 KB/s)
    - output of BSC_ST4 and CUDA_ST4 is identical (index=110784
    BSC ST5 = 261 CPU ms (19157 KB/s)
    CUDA ST5 = 37 CPU ms (135135 KB/s)
    - output of BSC_ST5 and CUDA_ST5 is identical (index=110784
    done...

  18. #78
    Member
    Join Date
    May 2007
    Location
    Poland
    Posts
    91
    Thanks
    10
    Thanked 4 Times in 4 Posts
    CUDA_ST_bsc 0.4 (c) Dell Inc. Written by P.Skibinski
    CUDA compatible card not found!
    It does not detect my 8800gt.

  19. #79
    Programmer
    Join Date
    May 2008
    Location
    PL
    Posts
    309
    Thanks
    68
    Thanked 173 Times in 64 Posts
    I think you have to install newer drivers (263.06+).

  20. #80
    Member
    Join Date
    May 2007
    Location
    Poland
    Posts
    91
    Thanks
    10
    Thanked 4 Times in 4 Posts
    Yes you're right. I still was keeping 197.xx

    CUDA_ST_bsc 0.4 (c) Dell Inc. Written by P.Skibinski
    Kernel Warmup = 112 CPU ms (145421 KB/s)
    BSC ST3 = 396 CPU ms (41129 KB/s)
    input_size=16287253 STx_threads=42*64=2688 bsize=6060
    +Initialize_Kernel 20.74 GPU ms (785423 KB/s)
    +Pre_Sort_Kernel 60.73 GPU ms (268190 KB/s)
    +Radix_Sort_Kernel 236.82 GPU ms (68774 KB/s)
    +Post_Sort_Kernel 42.34 GPU ms (384665 KB/s)
    +Deinitialize_Kern 4.00 GPU ms (4072237 KB/s)
    STx sorting total 364.63 GPU ms (44668 KB/s)
    CUDA ST3 = 376 CPU ms (43317 KB/s)
    - output of BSC_ST3 and CUDA_ST3 is identical (index=8597354)
    BSC ST4 = 442 CPU ms (36848 KB/s)
    input_size=16287253 STx_threads=42*64=2688 bsize=6060
    +Initialize_Kernel 22.36 GPU ms (728483 KB/s)
    +Pre_Sort_Kernel 73.53 GPU ms (221505 KB/s)
    +Radix_Sort_Kernel 270.57 GPU ms (60197 KB/s)
    +Post_Sort_Kernel 49.36 GPU ms (329956 KB/s)
    +Deinitialize_Kern 3.78 GPU ms (4311279 KB/s)
    STx sorting total 419.60 GPU ms (38817 KB/s)
    CUDA ST4 = 431 CPU ms (37789 KB/s)
    - output of BSC_ST4 and CUDA_ST4 is identical (index=8597354)
    BSC ST5 = 1264 CPU ms (12885 KB/s)
    input_size=16287253 STx_threads=42*64=2688 bsize=6060
    +Initialize_Kernel 22.89 GPU ms (711684 KB/s)
    +Pre_Sort_Kernel 73.12 GPU ms (222733 KB/s)
    +Radix_Sort_Kernel 237.42 GPU ms (68602 KB/s)
    +Post_Sort_Kernel 31.81 GPU ms (511986 KB/s)
    +Deinitialize_Kern 4.73 GPU ms (3440798 KB/s)
    STx sorting total 369.97 GPU ms (44023 KB/s)
    CUDA ST5 = 387 CPU ms (42085 KB/s)
    - output of BSC_ST5 and CUDA_ST5 is identical (index=8597354)
    done...

  21. #81
    Member Surfer's Avatar
    Join Date
    Mar 2009
    Location
    oren
    Posts
    203
    Thanks
    18
    Thanked 7 Times in 1 Post
    I have a lame question.
    I've found on russian forum ru-board some comparison of password crackers with CUDA vs AMD Stream(?).

    Why nobody develops AMD GPU acceleration?

  22. #82
    Member
    Join Date
    May 2008
    Location
    Germany
    Posts
    412
    Thanks
    38
    Thanked 64 Times in 38 Posts
    inikep (the programmer) wrote: "I need OpenCL Radix Sort for pairs (keys, values)."

    I understand:
    if somebody can deliver such an implementation of radix sort within "OPENCL",
    then he would complete his program with an OPENCL - variant.

    And may be this is the problem with OPENCL:
    there not enough supported tools / libraries / free available program-code-snippets
    as for CUDA.

    CUDA is very actively supported and developed by NVIDIA .. CUDA Toolkit 4.0 RC ..

    it seems AMD has 3 APIs:

    1. "old" ATI Stream
    2. OPEN-CL
    3. "better direct programming the GPU without API" ??
    "developer want a low level API for GPU"

    the support of AMD for the OPENCL-Standard seems to be "not so big"

    AMD has with the Radeon a very powerfull GPU, but the software-side seems to be poor

  23. #83
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    4,135
    Thanks
    320
    Thanked 1,397 Times in 802 Posts
    > Why nobody develops AMD GPU acceleration?

    Because cuda is a C++ extension and amd tools use basically their own language.
    Also for ATI having the newest driver doesn't mean that anything would work.

  24. #84
    Programmer osmanturan's Avatar
    Join Date
    May 2008
    Location
    Mersin, Turkiye
    Posts
    651
    Thanks
    0
    Thanked 0 Times in 0 Posts
    For ATI, OpenCL is only limited by "CPU only" (for nvidia it's opposite). Actually, a proper OpenCL support by a vendor should include both CPU and GPU. But, AFAIK, it's "still" incomplete for both sides (nvidia vs ati). I think, it could be one of the reason why OpenCL isn't popular yet. BTW, most of implementation rules are the same for both OpenCL and CUDA. I mean, one could translate a ordinary CUDA source to OpenCL without any real trouble. Moreover, on nvidia cards, OpenCL implementation is actually a kind of wrapper around CUDA. That could be the reason why inikep stated as "CUDA is faster than OpenCL" in a time (IIRC).
    BIT Archiver homepage: www.osmanturan.com

  25. #85
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,610
    Thanks
    30
    Thanked 65 Times in 47 Posts
    Quote Originally Posted by osmanturan View Post
    For ATI, OpenCL is only limited by "CPU only" (for nvidia it's opposite). Actually, a proper OpenCL support by a vendor should include both CPU and GPU. But, AFAIK, it's "still" incomplete for both sides (nvidia vs ati). I think, it could be one of the reason why OpenCL isn't popular yet. BTW, most of implementation rules are the same for both OpenCL and CUDA. I mean, one could translate a ordinary CUDA source to OpenCL without any real trouble. Moreover, on nvidia cards, OpenCL implementation is actually a kind of wrapper around CUDA. That could be the reason why inikep stated as "CUDA is faster than OpenCL" in a time (IIRC).
    Yes, NV has no incentives to support OpenCL properly; it's not popular, so there's little push from customers, yet it's something that can shutter it's proprietary technology. Reminds me of how MS claimed that OpenGL was slower than DirectX - until SGI made a proper Windows implementation. How can this be resolved? Some suggest that AMD should implement CUDA for own GPUs. I don't expect it to happen. Or developers could use OpenCL. When poor implementation puts NV in competitive disadvantage, they would fix the issues.

  26. #86
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,505
    Thanks
    26
    Thanked 136 Times in 104 Posts
    AMD has OpenCL for both CPU and GPU. The problem with OpenCL is for example a lack of templating like in C++ (so one has to copy + paste functions and change only parameter types) and lack of many functions like bitreverse, populationcount, firstbitset, etc which are supported directly by hardware.

  27. #87
    Programmer osmanturan's Avatar
    Join Date
    May 2008
    Location
    Mersin, Turkiye
    Posts
    651
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by Piotr Tarsa View Post
    AMD has OpenCL for both CPU and GPU.
    That's a good news indeed I was thinking that when it'll be implemented.
    BIT Archiver homepage: www.osmanturan.com

  28. #88
    Programmer
    Join Date
    May 2008
    Location
    PL
    Posts
    309
    Thanks
    68
    Thanked 173 Times in 64 Posts
    IMHO OpenCL is slower than CUDA because is has interpreted GPU code (compiled executables from AMD and Nvidia require additional .cl files/OpenCL kernels) instead of compiled GPU code (that is the reason why a CUDA executable is so big).

  29. #89
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,610
    Thanks
    30
    Thanked 65 Times in 47 Posts
    AFAIK it's not interpreted but compiled at runtime.

  30. #90
    Programmer
    Join Date
    May 2008
    Location
    PL
    Posts
    309
    Thanks
    68
    Thanked 173 Times in 64 Posts
    I've compiled OpenCL RadixSort from CUDA Toolkit 3.2 (attached). It sorts 4M of random 32-bit keys, so it's loosely comparable to ST3.

    Results on Athlon II X4 630 2.8 GHz + GeForce GTX 460:
    NVidia OpenCL RadixSort = 7854 KB/s
    AMD OpenCL RadixSort = 4696 KB/s
    CUDA ST3 = 223101 KB/s
    BSC ST3 = 52167 KB/s

    I don't claim that CUDA is 28 times faster, because it's not, but you can see that the OpenCL RadixSort implementation is not yet sufficiently mature.
    Attached Files Attached Files
    Last edited by inikep; 23rd March 2011 at 18:31.

Page 3 of 6 FirstFirst 12345 ... LastLast

Similar Threads

  1. BCM v0.09 - The ultimate BWT-based file compressor!
    By encode in forum Data Compression
    Replies: 22
    Last Post: 6th March 2016, 10:26
  2. GPU compression again
    By Shelwien in forum Data Compression
    Replies: 13
    Last Post: 13th January 2013, 21:09
  3. BCM v0.08 - The ultimate BWT-based file compressor!
    By encode in forum Data Compression
    Replies: 78
    Last Post: 12th August 2009, 11:14
  4. BCM v0.01 - New BWT+CM-based compressor
    By encode in forum Data Compression
    Replies: 81
    Last Post: 9th February 2009, 16:47
  5. DARK - a new BWT-based command-line archiver
    By encode in forum Forum Archive
    Replies: 138
    Last Post: 23rd September 2006, 22:42

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •