Page 2 of 6 FirstFirst 1234 ... LastLast
Results 31 to 60 of 153

Thread: CUDA/GPU-based BWT/ST4 sorting

  1. #31
    Programmer
    Join Date
    May 2008
    Location
    PL
    Posts
    309
    Thanks
    68
    Thanked 173 Times in 64 Posts
    Quote Originally Posted by joerg View Post
    i found this link, but i dont know it is useable for you:
    It's not enough, I need OpenCL Radix Sort for pairs (keys, values).


    Quote Originally Posted by joerg View Post
    What about the new GeForce GTX 560 Ti with "GF114" and 1024 MB RAM,
    do you expect this card will be faster with your compression program
    then the GeForce GTX 460 with "GF104" and 1024 MB RAM ?
    I'm sure it will be faster, but not sure how much The speed with 1 GB and 2 GB should be the same.

  2. #32
    Programmer osmanturan's Avatar
    Join Date
    May 2008
    Location
    Mersin, Turkiye
    Posts
    651
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by inikep View Post
    According to readme.txt (CUDA_ST8.rar) CUDA_ST8.exe requires GPU with CUDA Compute Capability 2.0 (Fermi) or higher.
    Compute capability lower than 2.0 (Fermi) means no L1 cache and CUDA_ST8 will be very slow. I've checked it on my laptop.
    Seems you didn't look the output of the program It compresses any file into "zero" byte. I did not mean anything about speed. According to your description which is somewhat posted twice in this thread, the compressor should at least compress a file - whether slow or not.
    BIT Archiver homepage: www.osmanturan.com

  3. #33
    Programmer
    Join Date
    May 2008
    Location
    PL
    Posts
    309
    Thanks
    68
    Thanked 173 Times in 64 Posts
    Quote Originally Posted by osmanturan View Post
    Seems you didn't look the output of the program It compresses any file into "zero" byte. I did not mean anything about speed. According to your description which is somewhat posted twice in this thread, the compressor should at least compress a file - whether slow or not.
    The CUDA_ST8 was compiled just for "arch=compute_20". It's possible to make one executable for 20 and 10, but it will be two times bigger. In your case the output is always 0, because I didn't know (until yesterday) how to check GPU kernel execution results. I'm curious where did you read that program called CUDA_ST8 will work without CUDA?

  4. #34
    Programmer
    Join Date
    May 2008
    Location
    PL
    Posts
    309
    Thanks
    68
    Thanked 173 Times in 64 Posts
    Bulat and joerg: I have a little gift for you. Attached is CUDA_STx 0.2 which compares following ST4 and ST5 (Schindler Transform order 5) algorithms:
    a) CPU-based ST4/ST5 from libbsc 2.4.5 (c) 2009-2011 Ilya Grebnov
    b) GPU-based ST4/ST5 written by P.Skibinski that uses SRTS Radix Sorting (c) 2010 Duane Merrill

    CUDA_STx requires NVIDIA CUDA-enabled GPU (see http://www.nvidia.com/object/cuda_gpus.html). It's compiled for devices with Compute Compatibility 1.0 and 2.0. Results with GeForce GTX 460:

    Code:
    CUDA_STx 0.2  (c) Dell Inc.  Written by P.Skibinski
    Device 0: "GeForce GTX 460" with Compute 2.1 capability (336 CUDA cores)
    Kernel Warmup = 54 CPU ms (194180 KB/s)
    BSC ST4 = 208 CPU ms (50412 KB/s)
    CUDA ST4 = 108 CPU ms (97090 KB/s)
    - output of BSC_ST4 and CUDA_ST4 is identical
    BSC ST5 = 563 CPU ms (18624 KB/s)
    CUDA ST5 = 132 CPU ms (79437 KB/s)
    - output of BSC_ST5 and CUDA_ST5 is NOT identical
    I don't know why the output of BSC_ST5 and CUDA_ST5 is not identical. Gribok should know.
    Attached Files Attached Files
    Last edited by inikep; 1st February 2011 at 21:17.

  5. #35
    Programmer osmanturan's Avatar
    Join Date
    May 2008
    Location
    Mersin, Turkiye
    Posts
    651
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by inikep View Post
    The CUDA_ST8 was compiled just for "arch=compute_20". It's possible to make one executable for 20 and 10, but it will be two times bigger. In your case the output is always 0, because I didn't know (until yesterday) how to check GPU kernel execution results. I'm curious where did you read that program called CUDA_ST8 will work without CUDA?
    Well...another misunderstanding. My graphics card already supports CUDA. But, compute capability 1.2 rather than 2.0. So, according to your description about speed (lack of L1 cache), I've at least expected to see a compression (it could be slow but it's ok in my case). That's why I've written "there is something wrong". I didn't aware of "arch" parameter effect. I've played with GPGPU only (in old fashion with OpenGL via fragment programs).
    BIT Archiver homepage: www.osmanturan.com

  6. #36
    Programmer
    Join Date
    May 2008
    Location
    PL
    Posts
    309
    Thanks
    68
    Thanked 173 Times in 64 Posts
    Quote Originally Posted by osmanturan View Post
    So, according to your description about speed (lack of L1 cache), I've at least expected to see a compression (it could be slow but it's ok in my case).
    You can try CUDA_STx 0.2, but you will see something similar to this:

    Code:
    CUDA_STx 0.2  (c) Dell Inc.  Written by P.Skibinski
    Device 0: "NVS 3100M" with Compute 1.2 capability (16 CUDA cores)
    Kernel Warmup = 96 CPU ms (109226 KB/s)
    BSC ST4 = 257 CPU ms (40800 KB/s)
    CUDA ST4 = 1634 CPU ms (6417 KB/s)
    - output of BSC_ST4 and CUDA_ST4 is NOT identical
    BSC ST5 = 492 CPU ms (21312 KB/s)
    CUDA ST5 = 2343 CPU ms (4475 KB/s)
    - output of BSC_ST5 and CUDA_ST5 is NOT identical

  7. #37
    Programmer osmanturan's Avatar
    Join Date
    May 2008
    Location
    Mersin, Turkiye
    Posts
    651
    Thanks
    0
    Thanked 0 Times in 0 Posts
    No luck This time it crashed (I've also tried after using timeout.reg). Here is the output:
    Code:
    C:\Users\Osman\Desktop\CUDA_STx>cuda_STx.exe readme.txt output.tmp
    CUDA_STx 0.2  (c) Dell Inc.  Written by P.Skibinski
    Device 0: "GeForce GT 330M" with Compute 1.2 capability (48 CUDA cores)
    Kernel Warmup = 154 CPU ms (3 KB/s)
    For further information about my laptop, please have a look at the attachment which was generated by GPU Caps Viewer.
    Attached Files Attached Files
    BIT Archiver homepage: www.osmanturan.com

  8. #38
    Programmer
    Join Date
    May 2008
    Location
    PL
    Posts
    309
    Thanks
    68
    Thanked 173 Times in 64 Posts
    osmanturan: you can find a fixed executable 4 posts above
    Last edited by inikep; 1st February 2011 at 21:19.

  9. #39
    Programmer Gribok's Avatar
    Join Date
    Apr 2007
    Location
    USA
    Posts
    162
    Thanks
    0
    Thanked 14 Times in 2 Posts
    Quote Originally Posted by inikep View Post
    I don't know why the output of BSC_ST5 and CUDA_ST5 is not identical. Gribok should know.
    It may happened because for all st functions from st.cpp you need to allocate n + LIBBSC_HEADER_SIZE bytes of memory.
    Enjoy coding, enjoy life!

  10. #40
    Programmer osmanturan's Avatar
    Join Date
    May 2008
    Location
    Mersin, Turkiye
    Posts
    651
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Seems it works now Here is the output if you interested:
    Code:
    C:\Users\Osman\Desktop\CUDA_STx_02>cuda_STx.exe readme.txt output.tmp
    CUDA_STx 0.2  (c) Dell Inc.  Written by P.Skibinski
    Device 0: "GeForce GT 330M" with Compute 1.2 capability (48 CUDA cores)
    Kernel Warmup = 153 CPU ms (3 KB/s)
    BSC ST4 = 1 CPU ms (582 KB/s)
    CUDA ST4 = 4 CPU ms (145 KB/s)
    - output of BSC_ST4 and CUDA_ST4 is NOT identical
    BSC ST5 = 7 CPU ms (83 KB/s)
    CUDA ST5 = 8 CPU ms (72 KB/s)
    - output of BSC_ST5 and CUDA_ST5 is identical
    done...
    
    C:\Users\Osman\Desktop\CUDA_STx_02>cuda_STx.exe enwik7 output
    CUDA_STx 0.2  (c) Dell Inc.  Written by P.Skibinski
    Device 0: "GeForce GT 330M" with Compute 1.2 capability (48 CUDA cores)
    Kernel Warmup = 114 CPU ms (91980 KB/s)
    BSC ST4 = 213 CPU ms (49228 KB/s)
    CUDA ST4 = 706 CPU ms (14852 KB/s)
    - output of BSC_ST4 and CUDA_ST4 is NOT identical
    BSC ST5 = 447 CPU ms (23458 KB/s)
    CUDA ST5 = 1033 CPU ms (10150 KB/s)
    - output of BSC_ST5 and CUDA_ST5 is NOT identical
    done...
    BIT Archiver homepage: www.osmanturan.com

  11. #41
    Programmer
    Join Date
    May 2008
    Location
    PL
    Posts
    309
    Thanks
    68
    Thanked 173 Times in 64 Posts
    Quote Originally Posted by Gribok View Post
    It may happened because for all st functions from st.cpp you need to allocate n + LIBBSC_HEADER_SIZE bytes of memory.
    This is not the cause. Can you look at sources of CUDA_STx (CUDA_STx_02.rar)?

  12. #42
    Member
    Join Date
    May 2008
    Location
    Germany
    Posts
    412
    Thanks
    38
    Thanked 64 Times in 38 Posts
    @inikep
    Wonderful!
    Thank you very much for this wonderful piece software

    ---
    first run:
    ---
    CUDA_STx 0.2 (c) Dell Inc. Written by P.Skibinski
    Device 0: "GeForce 8600 GS" with Compute 1.1 capability (16 CUDA cores)
    Not enough memory!
    ---
    second run
    ---
    CUDA_STx 0.2 (c) Dell Inc. Written by P.Skibinski
    Device 0: "GeForce 8600 GS" with Compute 1.1 capability (16 CUDA cores)
    Kernel Warmup = 101 CPU ms (18703 KB/s)
    BSC ST4 = 42 CPU ms (44977 KB/s)
    CUDA ST4 = 260 CPU ms (7265 KB/s)
    - output of BSC_ST4 and CUDA_ST4 is NOT identical
    BSC ST5 = 143 CPU ms (13210 KB/s)
    CUDA ST5 = 554 CPU ms (3409 KB/s)
    - output of BSC_ST5 and CUDA_ST5 is NOT identical
    done...
    ---
    ask1:
    Would it be possible to see, which file results from CPU-ST5 and
    which file results from CUDA-ST5 ?

    ask 2:
    Would it be possible to run the st5-compression on cpu (bsc-mode ?)
    independently from the existence of a nvidia-card?

    Thanks again!

    best regards

    Joerg

    PS: searching another Nvidia card ...
    ---
    Gigabyte Nvidia Geforce GFX 560 Ti

    GV-N560OverClock-1GI or GV-N560SuperOverclock-1GI ?

    or better Palit GeForce GTX 560 Ti 2048 MB ?

    or another Geforce GFX 560 Ti ?
    Last edited by joerg; 4th February 2011 at 18:23. Reason: correct

  13. #43
    Programmer
    Join Date
    May 2008
    Location
    PL
    Posts
    309
    Thanks
    68
    Thanked 173 Times in 64 Posts
    Quote Originally Posted by joerg View Post
    Not enough memory!
    You run out of RAM (not GPU memory). You can try a smaller file.

    Quote Originally Posted by joerg View Post
    would it be possible to run first the st5-compression on cpu (bsc-mode ?)
    independently from the existence of a nvidia-card?
    Done and attached.

    Quote Originally Posted by joerg View Post
    PS: Gigabyte Nvidia Geforce GFX 560 Ti

    GV-N560OC-1GI or GV-N560SO-1GI ?

    or Palit GeForce GTX 560 Ti 2048 MB ?

    or other Geforce GFX 560 Ti ?
    Gigabyte is OK, I'm happy with my Gigabyte GTX 460. The choice GV-N560OC-1GI or GV-N560SO-1GI depends on a difference in prices.
    Attached Files Attached Files

  14. #44
    Member
    Join Date
    May 2008
    Location
    Germany
    Posts
    412
    Thanks
    38
    Thanked 64 Times in 38 Posts
    @inikep: thank you very much for your quick answer

    a) if no cuda-compatible grafic-card present, now:
    ---
    CUDA_STx 0.2 (c) Dell Inc. Written by P.Skibinski
    CUDA compatible card not found!
    BSC ST4 = 204 CPU ms (16504 KB/s)
    BSC ST5 = 904 CPU ms (3724 KB/s)
    d:/stx/src/cuda_STx.cu(177) : cudaSafeCall() Runtime API error : CUDA driver ver
    sion is insufficient for CUDA runtime version.
    ---
    the result-file is 0 bytes long

    b) if cuda-compatible grafic-card present, now:
    ---
    CUDA_STx 0.2 (c) Dell Inc. Written by P.Skibinski
    Device 0: "GeForce 8600 GS" with Compute 1.1 capability (16 CUDA cores)
    BSC ST4 = 1 CPU ms (2173 KB/s)
    BSC ST5 = 10 CPU ms (217 KB/s)
    Kernel Warmup = 68 CPU ms (31 KB/s)
    CUDA ST4 = 5 CPU ms (434 KB/s)
    CUDA ST5 = 10 CPU ms (217 KB/s)
    done...
    ---
    the result-file has exactly the same length in bytes as the input-file, but is not identically
    ---

    if the compression-results from ST5-BSC and ST5-CUDA are not identical,
    i think it would be useful to have the compression-results in two independent files

    for example:
    to outputfilename.st5_bsc and to outputfilename.st5_cuda

    best regards

    joerg
    Last edited by joerg; 4th February 2011 at 23:35. Reason: correct

  15. #45
    Programmer Gribok's Avatar
    Join Date
    Apr 2007
    Location
    USA
    Posts
    162
    Thanks
    0
    Thanked 14 Times in 2 Posts
    Quote Originally Posted by inikep View Post
    This is not the cause. Can you look at sources of CUDA_STx (CUDA_STx_02.rar)?
    1. It maybe due to fact that sorting is not stable. In this case if suffixes are the same order should be preserved.
    2. You incorrectly handle first suffix. first suffix is s[0], s[1], s[2], s[3], not a s[1], s[2], s[3], s[4].
    Enjoy coding, enjoy life!

  16. #46
    Programmer
    Join Date
    May 2008
    Location
    PL
    Posts
    309
    Thanks
    68
    Thanked 173 Times in 64 Posts
    Quote Originally Posted by joerg View Post
    the result-file has exactly the same length in bytes as the input-file, but is not identically
    The output file is ST5-transformed data.


    Quote Originally Posted by Gribok View Post
    1. It maybe due to fact that sorting is not stable. In this case if suffixes are the same order should be preserved.
    2. You incorrectly handle first suffix. first suffix is s[0], s[1], s[2], s[3], not a s[1], s[2], s[3], s[4].
    AFAIK the used sorting is stable. Suffixes are handled correctly; they are initialized from the second suffix, but the order of an initialization doesn't matter (remember that I'm sorting pairs of (keys,values)).
    Last edited by inikep; 5th February 2011 at 15:50.

  17. #47
    Programmer Gribok's Avatar
    Join Date
    Apr 2007
    Location
    USA
    Posts
    162
    Thanks
    0
    Thanked 14 Times in 2 Posts
    Quote Originally Posted by inikep View Post
    AFAIK the used sorting is stable. Suffixes are handled correctly; they are initialized from the second suffix, but the order of an initialization doesn't matter (remember that I'm sorting pairs of (keys,values)).
    Unfortunately I don't have Nvidia GPU to debug, but I have idea of fast ST7 sort for you. You can build 8 byte data structure like (7 byte suffix, prevByte) and sort them. I assume that sorting is stable. Now you have 7-order sort transform and you just need to output prevBytes. This is allow you to sorting only values instep of sorting <key, value> pairs. Instead of building data structure you should pack all data to qword and use for comparasion, but ignore one byte. You also don't need original array and you can replace it with PrevBytes. This should be very faster sorting.
    Enjoy coding, enjoy life!

  18. #48
    Programmer
    Join Date
    May 2008
    Location
    PL
    Posts
    309
    Thanks
    68
    Thanked 173 Times in 64 Posts
    @Gribok: I like your idea, but I see two problems:
    a) it's not possible to ignore one byte in the current implementation of CUDA radix sort (SRTS Radix Sorting)
    b) I have no idea how to make a GPU implementation of the reverse STx transform that is faster than a CPU implementation

  19. #49
    Programmer
    Join Date
    May 2008
    Location
    PL
    Posts
    309
    Thanks
    68
    Thanked 173 Times in 64 Posts
    I've found a problem that ST5 output was different. I concerns "(inbuf[i+2]<<24)" in the following code:
    Code:
    keys[i] = inbuf[i+1];
    keys[i] <<= 32;
    keys[i] += (inbuf[i+2]<<24) + (inbuf[i+3]<<16) + (inbuf[i+4]<<8) + inbuf[i+5];
    I assume this is related to a signed/unsigned 32-bit int difference and this code is working fine:
    Code:
    keys[i] = (inbuf[i+1]<<8) + inbuf[i+2];
    keys[i] <<= 32;
    keys[i] += (inbuf[i+3]<<16) + (inbuf[i+4]<<8) + inbuf[i+5];
    Last edited by inikep; 7th February 2011 at 16:04.

  20. #50
    Programmer
    Join Date
    May 2008
    Location
    PL
    Posts
    309
    Thanks
    68
    Thanked 173 Times in 64 Posts
    Attached is a new, faster, and fixed (ST5 output) version of CUDA_STx. It still can be faster using Gribok's idea.

    CUDA_STx 0.3 compares following ST4 and ST5 (Schindler Transform order 5) algorithms:
    a) CPU-based ST4/ST5 from libbsc 2.4.5 (c) 2009-2011 Ilya Grebnov
    b) GPU-based ST4/ST5 written by P.Skibinski that uses SRTS Radix Sorting (c) 2010 Duane Merrill

    Results on Athlon II X4 630 2.8 GHz + GeForce GTX 460 (336 CUDA cores) with beginning 10.485.768 bytes from ENWIK8:

    Code:
    CUDA_STx 0.3  (c) Dell Inc.  Written by P.Skibinski
    Device 0: "GeForce GTX 460" with Compute 2.1 capability (336 CUDA cores)
    BSC ST4 = 207 CPU ms (50655 KB/s)
    BSC ST5 = 548 CPU ms (19134 KB/s)
    Kernel Warmup = 90 CPU ms (116508 KB/s)
    CUDA ST4 = 53 CPU ms (197844 KB/s)
    - output of BSC_ST4 and CUDA_ST4 is identical
    CUDA ST5 = 85 CPU ms (123361 KB/s)
    - output of BSC_ST5 and CUDA_ST5 is identical
    Attached Files Attached Files
    Last edited by inikep; 7th February 2011 at 16:14.

  21. #51
    Member
    Join Date
    May 2008
    Location
    Germany
    Posts
    412
    Thanks
    38
    Thanked 64 Times in 38 Posts
    @inikep
    ---
    thanks for the new version
    ---
    CUDA_STx 0.3 (c) Dell Inc. Written by P.Skibinski
    Device 0: "GeForce 8600 GS" with Compute 1.1 capability (16 CUDA cores)
    BSC ST4 = 1 CPU ms (2173 KB/s)
    BSC ST5 = 9 CPU ms (241 KB/s)
    Kernel Warmup = 115 CPU ms (18 KB/s)
    CUDA ST4 = 5 CPU ms (434 KB/s)
    - output of BSC_ST4 and CUDA_ST4 is NOT identical
    CUDA ST5 = 8 CPU ms (271 KB/s)
    - output of BSC_ST5 and CUDA_ST5 is NOT identical
    done...
    ---

    it seems that ST5 profits more from CUDA as ST4

    maybe focus to the ST5 algorithm ?

    ---
    if the results from ST5-BSC and ST5-CUDA are not identical,
    i think it would be useful to have the results in two independent files ?

    for example:
    to outputfilename.st5_bsc and to outputfilename.st5_cuda
    ---

    best regards

    joerg

  22. #52
    Programmer
    Join Date
    May 2008
    Location
    PL
    Posts
    309
    Thanks
    68
    Thanked 173 Times in 64 Posts
    Quote Originally Posted by joerg View Post
    the results from ST5-BSC and ST5-CUDA are not identical
    The results should be identical. I think that I've found a bug in SRTS Radix Sorting (it concerns Compute capability lower than 2.0).

  23. #53
    Member
    Join Date
    Feb 2011
    Location
    St. Albans, England
    Posts
    20
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Not having a decent enough GPU card, I tend to stay out of these tests. However - small note - surely the RAM issue could be sorted by using a USB Flash drive ? I'm sure we've all got a few 1-2Gb sticks laying around doing nothing - and I realise they're not as fast as "conventional" RAM, but that would help the RAM issue ? Or is this not a viable solution ?

  24. #54
    Programmer osmanturan's Avatar
    Join Date
    May 2008
    Location
    Mersin, Turkiye
    Posts
    651
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Obviously, you can't use USB like external things as a memory while using GPU powered programs. In short, it's not an option. On the other hand, I found "regular" flash drives are very slow (most llikely ~10-15 MiB/s read, 5-8 MiB/s write). Instead of that, I would use file mapping (via win32 API) on a harddrive as an extra memory space. It's more better IMO. But, again it's not usable in GPU powered programs. More precisely a GPU function cannot work in this circumstances.
    BIT Archiver homepage: www.osmanturan.com

  25. #55
    Member
    Join Date
    Feb 2011
    Location
    St. Albans, England
    Posts
    20
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Okay thanks for that osmanturan. Guess I had better start saving for a decent desktop & card then

  26. #56
    Programmer
    Join Date
    May 2008
    Location
    PL
    Posts
    309
    Thanks
    68
    Thanked 173 Times in 64 Posts
    @EwenG: GPU's RAM is not a problem. I can compress files in blocks up to 40 MB on my GTX 460 1GB. Bigger blocks give only slightly better compression.

    You don't need a desktop. You can get a good speed of GPU-based compression with all Fermi-based cards (GeForce 4xx and GeForce 5xx), also with mobile (especially with 460M, 470M, 480M, 485M).

  27. #57
    Member
    Join Date
    Feb 2011
    Location
    St. Albans, England
    Posts
    20
    Thanks
    0
    Thanked 0 Times in 0 Posts
    inikep, the thing is finance - being disabled limits my income. By the time i've saved enough to buy a decent card for my laptop, it'd be just as easy to buy a desktop. This is why i'm one of those who sit there for hours on end reversing code & optimising routines in assembler by hand.

    I'm very interested in the entire GPU idea, as it does follow something i've been looking into for a while, the PIC (Programmable IC), which - going by Intels limitation (eg: the speed barrier, hence all the multi-cores) - might be a worthwhile investment due to the low costs. What I mean is, be nice to have a small USB plugin to handle all the compression & RAM side for you, whilest the bulk of your system can get on with something else. However, thats something for another post. This is not as far-off as it seems

  28. #58
    Programmer osmanturan's Avatar
    Join Date
    May 2008
    Location
    Mersin, Turkiye
    Posts
    651
    Thanks
    0
    Thanked 0 Times in 0 Posts
    (inikep, sorry for being slightly off-topic...)
    @EwenG: PIC + USB is not good idea especially for compression. Because, for todays' standard for competition, you need either more processing power or more memory. Actually we need both, but we can sacrifice one of them in favor to other in some circumstances (=not very general rule). So, if you bet on memory, you're bounded with very limited memory (even 512K models are called as "has big memory"). So, you need a MMU beside a microprocessor for connecting external memory. In this case, driving a MMU unit by SPI (which is very tricky actually) with a PIC is not a brilliant idea indeed. As to other option, speed, PIC is not very good again. Because, even high-end/low cost models with embedded USB stack are actually very slow (~48 MHz, most of instructors are processed in a single clock). So, in short, PIC is not an option. You may use couple of ARM processors (for multicore processing) with a plenty of memory for you scenario (ARM has a built-in MMU). But, in this case you are bounded with USB transmission speed (please don't say anything about USB's theoretical speed, it's much lower in practical ). So, you need a faster bus such as PCI or even better PCIe. Controlling such a high speed, you need FPGA based solutions which becomes an expensive solution in the end. I even didn't say anything about designing/debugging such high tech stuff
    BIT Archiver homepage: www.osmanturan.com

  29. #59
    Member
    Join Date
    Feb 2011
    Location
    St. Albans, England
    Posts
    20
    Thanks
    0
    Thanked 0 Times in 0 Posts
    @Osmanturan: thanks for clearing that up. Really I was wondering why nobody HAS done it yet, and your argument certainly makes sense now why people favour the GPU. I won't argue with the theoretical speed either - we all know that a lot of systems "in theory" work faster than they do in the real world.

    @inikep: Apologies for going off-topic, I was merely wondering why people favoured the GPU over other options. I'm here to try and learn, and without asking questions that may seem obvious to me - i'll never learn anything. I just hope that I can give something useful back to you guys in future. One thing i'm working on at present is a method of using transformation of 3D OBJ files, which in simple terms is a bunch of floats written in ASCII. Obviously most people who use these files would have a card capable of this kind of processing, along with a (possible) need for using BWT in the variants (I won't go into huge detail here as i'm still working out a lot of the issues on paper).

    Thank you all for replying - I appreciate it.

  30. #60
    Programmer
    Join Date
    May 2008
    Location
    PL
    Posts
    309
    Thanks
    68
    Thanked 173 Times in 64 Posts
    I've written a fully working GPU-based ST5 compressor (with Schindler's bwtari as entropy coding). Decompression is 3 times slower as reverse ST5 is CPU-based (bsc 2.4.5), because I didn't manage to create a GPU version.

    Compression:
    Code:
    CUDA_STx 0.4  (c) Dell Inc.  Written by P.Skibinski
    Device 0: "GeForce GTX 460" with Compute 2.1 capability (336 CUDA cores)
    Kernel Warmup = 66 CPU ms (158875 KB/s), 10485760->2688 bytes
    CUDA ST5+BWTari = 474 CPU ms (22121 KB/s), 10485760->2690156 bytes
    Total = 555 ms
    Decompression:
    Code:
    CUDA_STx 0.4  (c) Dell Inc.  Written by P.Skibinski
    Device 0: "GeForce GTX 460" with Compute 2.1 capability (336 CUDA cores)
    Kernel Warmup = 49 CPU ms (54901 KB/s), 2690156->2690156 bytes
    CUDA unBWTari = 495 CPU ms (21183 KB/s), 2690156->10485760 bytes
    BSC unST5 = 1145 CPU ms (9157 KB/s), 10485760->10485760 bytes
    Total = 1706 ms

Page 2 of 6 FirstFirst 1234 ... LastLast

Similar Threads

  1. BCM v0.09 - The ultimate BWT-based file compressor!
    By encode in forum Data Compression
    Replies: 22
    Last Post: 6th March 2016, 10:26
  2. GPU compression again
    By Shelwien in forum Data Compression
    Replies: 13
    Last Post: 13th January 2013, 21:09
  3. BCM v0.08 - The ultimate BWT-based file compressor!
    By encode in forum Data Compression
    Replies: 78
    Last Post: 12th August 2009, 11:14
  4. BCM v0.01 - New BWT+CM-based compressor
    By encode in forum Data Compression
    Replies: 81
    Last Post: 9th February 2009, 16:47
  5. DARK - a new BWT-based command-line archiver
    By encode in forum Forum Archive
    Replies: 138
    Last Post: 23rd September 2006, 22:42

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •