Seems you didn't look the output of the program
It compresses any file into "zero" byte. I did not mean anything about speed. According to your description which is somewhat posted twice in this thread, the compressor should at least compress a file - whether slow or not.
BIT Archiver homepage: www.osmanturan.com
The CUDA_ST8 was compiled just for "arch=compute_20". It's possible to make one executable for 20 and 10, but it will be two times bigger. In your case the output is always 0, because I didn't know (until yesterday) how to check GPU kernel execution results. I'm curious where did you read that program called CUDA_ST8 will work without CUDA?![]()
Bulat and joerg: I have a little gift for you. Attached is CUDA_STx 0.2 which compares following ST4 and ST5 (Schindler Transform order 5) algorithms:
a) CPU-based ST4/ST5 from libbsc 2.4.5 (c) 2009-2011 Ilya Grebnov
b) GPU-based ST4/ST5 written by P.Skibinski that uses SRTS Radix Sorting (c) 2010 Duane Merrill
CUDA_STx requires NVIDIA CUDA-enabled GPU (see http://www.nvidia.com/object/cuda_gpus.html). It's compiled for devices with Compute Compatibility 1.0 and 2.0. Results with GeForce GTX 460:
I don't know why the output of BSC_ST5 and CUDA_ST5 is not identical. Gribok should know.Code:CUDA_STx 0.2 (c) Dell Inc. Written by P.Skibinski Device 0: "GeForce GTX 460" with Compute 2.1 capability (336 CUDA cores) Kernel Warmup = 54 CPU ms (194180 KB/s) BSC ST4 = 208 CPU ms (50412 KB/s) CUDA ST4 = 108 CPU ms (97090 KB/s) - output of BSC_ST4 and CUDA_ST4 is identical BSC ST5 = 563 CPU ms (18624 KB/s) CUDA ST5 = 132 CPU ms (79437 KB/s) - output of BSC_ST5 and CUDA_ST5 is NOT identical
Last edited by inikep; 1st February 2011 at 21:17.
Well...another misunderstanding. My graphics card already supports CUDA. But, compute capability 1.2 rather than 2.0. So, according to your description about speed (lack of L1 cache), I've at least expected to see a compression (it could be slow but it's ok in my case). That's why I've written "there is something wrong". I didn't aware of "arch" parameter effect. I've played with GPGPU only (in old fashion with OpenGL via fragment programs).
BIT Archiver homepage: www.osmanturan.com
You can try CUDA_STx 0.2, but you will see something similar to this:
Code:CUDA_STx 0.2 (c) Dell Inc. Written by P.Skibinski Device 0: "NVS 3100M" with Compute 1.2 capability (16 CUDA cores) Kernel Warmup = 96 CPU ms (109226 KB/s) BSC ST4 = 257 CPU ms (40800 KB/s) CUDA ST4 = 1634 CPU ms (6417 KB/s) - output of BSC_ST4 and CUDA_ST4 is NOT identical BSC ST5 = 492 CPU ms (21312 KB/s) CUDA ST5 = 2343 CPU ms (4475 KB/s) - output of BSC_ST5 and CUDA_ST5 is NOT identical
No luckThis time it crashed (I've also tried after using timeout.reg). Here is the output:
For further information about my laptop, please have a look at the attachment which was generated by GPU Caps Viewer.Code:C:\Users\Osman\Desktop\CUDA_STx>cuda_STx.exe readme.txt output.tmp CUDA_STx 0.2 (c) Dell Inc. Written by P.Skibinski Device 0: "GeForce GT 330M" with Compute 1.2 capability (48 CUDA cores) Kernel Warmup = 154 CPU ms (3 KB/s)
BIT Archiver homepage: www.osmanturan.com
osmanturan: you can find a fixed executable 4 posts above
Last edited by inikep; 1st February 2011 at 21:19.
Seems it works nowHere is the output if you interested:
Code:C:\Users\Osman\Desktop\CUDA_STx_02>cuda_STx.exe readme.txt output.tmp CUDA_STx 0.2 (c) Dell Inc. Written by P.Skibinski Device 0: "GeForce GT 330M" with Compute 1.2 capability (48 CUDA cores) Kernel Warmup = 153 CPU ms (3 KB/s) BSC ST4 = 1 CPU ms (582 KB/s) CUDA ST4 = 4 CPU ms (145 KB/s) - output of BSC_ST4 and CUDA_ST4 is NOT identical BSC ST5 = 7 CPU ms (83 KB/s) CUDA ST5 = 8 CPU ms (72 KB/s) - output of BSC_ST5 and CUDA_ST5 is identical done... C:\Users\Osman\Desktop\CUDA_STx_02>cuda_STx.exe enwik7 output CUDA_STx 0.2 (c) Dell Inc. Written by P.Skibinski Device 0: "GeForce GT 330M" with Compute 1.2 capability (48 CUDA cores) Kernel Warmup = 114 CPU ms (91980 KB/s) BSC ST4 = 213 CPU ms (49228 KB/s) CUDA ST4 = 706 CPU ms (14852 KB/s) - output of BSC_ST4 and CUDA_ST4 is NOT identical BSC ST5 = 447 CPU ms (23458 KB/s) CUDA ST5 = 1033 CPU ms (10150 KB/s) - output of BSC_ST5 and CUDA_ST5 is NOT identical done...
BIT Archiver homepage: www.osmanturan.com
@inikep
Wonderful!
Thank you very much for this wonderful piece software
---
first run:
---
CUDA_STx 0.2 (c) Dell Inc. Written by P.Skibinski
Device 0: "GeForce 8600 GS" with Compute 1.1 capability (16 CUDA cores)
Not enough memory!
---
second run
---
CUDA_STx 0.2 (c) Dell Inc. Written by P.Skibinski
Device 0: "GeForce 8600 GS" with Compute 1.1 capability (16 CUDA cores)
Kernel Warmup = 101 CPU ms (18703 KB/s)
BSC ST4 = 42 CPU ms (44977 KB/s)
CUDA ST4 = 260 CPU ms (7265 KB/s)
- output of BSC_ST4 and CUDA_ST4 is NOT identical
BSC ST5 = 143 CPU ms (13210 KB/s)
CUDA ST5 = 554 CPU ms (3409 KB/s)
- output of BSC_ST5 and CUDA_ST5 is NOT identical
done...
---
ask1:
Would it be possible to see, which file results from CPU-ST5 and
which file results from CUDA-ST5 ?
ask 2:
Would it be possible to run the st5-compression on cpu (bsc-mode ?)
independently from the existence of a nvidia-card?
Thanks again!
best regards
Joerg
PS: searching another Nvidia card ...
---
Gigabyte Nvidia Geforce GFX 560 Ti
GV-N560OverClock-1GI or GV-N560SuperOverclock-1GI ?
or better Palit GeForce GTX 560 Ti 2048 MB ?
or another Geforce GFX 560 Ti ?
Last edited by joerg; 4th February 2011 at 18:23. Reason: correct
@inikep: thank you very much for your quick answer
a) if no cuda-compatible grafic-card present, now:
---
CUDA_STx 0.2 (c) Dell Inc. Written by P.Skibinski
CUDA compatible card not found!
BSC ST4 = 204 CPU ms (16504 KB/s)
BSC ST5 = 904 CPU ms (3724 KB/s)
d:/stx/src/cuda_STx.cu(177) : cudaSafeCall() Runtime API error : CUDA driver ver
sion is insufficient for CUDA runtime version.
---
the result-file is 0 bytes long
b) if cuda-compatible grafic-card present, now:
---
CUDA_STx 0.2 (c) Dell Inc. Written by P.Skibinski
Device 0: "GeForce 8600 GS" with Compute 1.1 capability (16 CUDA cores)
BSC ST4 = 1 CPU ms (2173 KB/s)
BSC ST5 = 10 CPU ms (217 KB/s)
Kernel Warmup = 68 CPU ms (31 KB/s)
CUDA ST4 = 5 CPU ms (434 KB/s)
CUDA ST5 = 10 CPU ms (217 KB/s)
done...
---
the result-file has exactly the same length in bytes as the input-file, but is not identically
---
if the compression-results from ST5-BSC and ST5-CUDA are not identical,
i think it would be useful to have the compression-results in two independent files
for example:
to outputfilename.st5_bsc and to outputfilename.st5_cuda
best regards
joerg
Last edited by joerg; 4th February 2011 at 23:35. Reason: correct
The output file is ST5-transformed data.
AFAIK the used sorting is stable. Suffixes are handled correctly; they are initialized from the second suffix, but the order of an initialization doesn't matter (remember that I'm sorting pairs of (keys,values)).
Last edited by inikep; 5th February 2011 at 15:50.
Unfortunately I don't have Nvidia GPU to debug, but I have idea of fast ST7 sort for you. You can build 8 byte data structure like (7 byte suffix, prevByte) and sort them. I assume that sorting is stable. Now you have 7-order sort transform and you just need to output prevBytes. This is allow you to sorting only values instep of sorting <key, value> pairs. Instead of building data structure you should pack all data to qword and use for comparasion, but ignore one byte. You also don't need original array and you can replace it with PrevBytes. This should be very faster sorting.
Enjoy coding, enjoy life!
@Gribok: I like your idea, but I see two problems:
a) it's not possible to ignore one byte in the current implementation of CUDA radix sort (SRTS Radix Sorting)
b) I have no idea how to make a GPU implementation of the reverse STx transform that is faster than a CPU implementation
I've found a problem that ST5 output was different. I concerns "(inbuf[i+2]<<24)" in the following code:
I assume this is related to a signed/unsigned 32-bit int difference and this code is working fine:Code:keys[i] = inbuf[i+1]; keys[i] <<= 32; keys[i] += (inbuf[i+2]<<24) + (inbuf[i+3]<<16) + (inbuf[i+4]<<8) + inbuf[i+5];
Code:keys[i] = (inbuf[i+1]<<8) + inbuf[i+2]; keys[i] <<= 32; keys[i] += (inbuf[i+3]<<16) + (inbuf[i+4]<<8) + inbuf[i+5];
Last edited by inikep; 7th February 2011 at 16:04.
Attached is a new, faster, and fixed (ST5 output) version of CUDA_STx. It still can be faster using Gribok's idea.
CUDA_STx 0.3 compares following ST4 and ST5 (Schindler Transform order 5) algorithms:
a) CPU-based ST4/ST5 from libbsc 2.4.5 (c) 2009-2011 Ilya Grebnov
b) GPU-based ST4/ST5 written by P.Skibinski that uses SRTS Radix Sorting (c) 2010 Duane Merrill
Results on Athlon II X4 630 2.8 GHz + GeForce GTX 460 (336 CUDA cores) with beginning 10.485.768 bytes from ENWIK8:
Code:CUDA_STx 0.3 (c) Dell Inc. Written by P.Skibinski Device 0: "GeForce GTX 460" with Compute 2.1 capability (336 CUDA cores) BSC ST4 = 207 CPU ms (50655 KB/s) BSC ST5 = 548 CPU ms (19134 KB/s) Kernel Warmup = 90 CPU ms (116508 KB/s) CUDA ST4 = 53 CPU ms (197844 KB/s) - output of BSC_ST4 and CUDA_ST4 is identical CUDA ST5 = 85 CPU ms (123361 KB/s) - output of BSC_ST5 and CUDA_ST5 is identical
Last edited by inikep; 7th February 2011 at 16:14.
@inikep
---
thanks for the new version
---
CUDA_STx 0.3 (c) Dell Inc. Written by P.Skibinski
Device 0: "GeForce 8600 GS" with Compute 1.1 capability (16 CUDA cores)
BSC ST4 = 1 CPU ms (2173 KB/s)
BSC ST5 = 9 CPU ms (241 KB/s)
Kernel Warmup = 115 CPU ms (18 KB/s)
CUDA ST4 = 5 CPU ms (434 KB/s)
- output of BSC_ST4 and CUDA_ST4 is NOT identical
CUDA ST5 = 8 CPU ms (271 KB/s)
- output of BSC_ST5 and CUDA_ST5 is NOT identical
done...
---
it seems that ST5 profits more from CUDA as ST4
maybe focus to the ST5 algorithm ?
---
if the results from ST5-BSC and ST5-CUDA are not identical,
i think it would be useful to have the results in two independent files ?
for example:
to outputfilename.st5_bsc and to outputfilename.st5_cuda
---
best regards
joerg
Not having a decent enough GPU card, I tend to stay out of these tests. However - small note - surely the RAM issue could be sorted by using a USB Flash drive ? I'm sure we've all got a few 1-2Gb sticks laying around doing nothing - and I realise they're not as fast as "conventional" RAM, but that would help the RAM issue ? Or is this not a viable solution ?
Obviously, you can't use USB like external things as a memory while using GPU powered programs. In short, it's not an option. On the other hand, I found "regular" flash drives are very slow (most llikely ~10-15 MiB/s read, 5-8 MiB/s write). Instead of that, I would use file mapping (via win32 API) on a harddrive as an extra memory space. It's more better IMO. But, again it's not usable in GPU powered programs. More precisely a GPU function cannot work in this circumstances.
BIT Archiver homepage: www.osmanturan.com
Okay thanks for that osmanturan. Guess I had better start saving for a decent desktop & card then![]()
@EwenG: GPU's RAM is not a problem. I can compress files in blocks up to 40 MB on my GTX 460 1GB. Bigger blocks give only slightly better compression.
You don't need a desktop. You can get a good speed of GPU-based compression with all Fermi-based cards (GeForce 4xx and GeForce 5xx), also with mobile (especially with 460M, 470M, 480M, 485M).
inikep, the thing is finance - being disabled limits my income. By the time i've saved enough to buy a decent card for my laptop, it'd be just as easy to buy a desktop. This is why i'm one of those who sit there for hours on end reversing code & optimising routines in assembler by hand.
I'm very interested in the entire GPU idea, as it does follow something i've been looking into for a while, the PIC (Programmable IC), which - going by Intels limitation (eg: the speed barrier, hence all the multi-cores) - might be a worthwhile investment due to the low costs. What I mean is, be nice to have a small USB plugin to handle all the compression & RAM side for you, whilest the bulk of your system can get on with something else. However, thats something for another post. This is not as far-off as it seems![]()
(inikep, sorry for being slightly off-topic...)
@EwenG: PIC + USB is not good idea especially for compression. Because, for todays' standard for competition, you need either more processing power or more memory. Actually we need both, but we can sacrifice one of them in favor to other in some circumstances (=not very general rule). So, if you bet on memory, you're bounded with very limited memory (even 512K models are called as "has big memory"). So, you need a MMU beside a microprocessor for connecting external memory. In this case, driving a MMU unit by SPI (which is very tricky actually) with a PIC is not a brilliant idea indeed. As to other option, speed, PIC is not very good again. Because, even high-end/low cost models with embedded USB stack are actually very slow (~48 MHz, most of instructors are processed in a single clock). So, in short, PIC is not an option. You may use couple of ARM processors (for multicore processing) with a plenty of memory for you scenario (ARM has a built-in MMU). But, in this case you are bounded with USB transmission speed (please don't say anything about USB's theoretical speed, it's much lower in practical). So, you need a faster bus such as PCI or even better PCIe. Controlling such a high speed, you need FPGA based solutions which becomes an expensive solution in the end. I even didn't say anything about designing/debugging such high tech stuff
![]()
BIT Archiver homepage: www.osmanturan.com
@Osmanturan: thanks for clearing that up. Really I was wondering why nobody HAS done it yet, and your argument certainly makes sense now why people favour the GPU. I won't argue with the theoretical speed either - we all know that a lot of systems "in theory" work faster than they do in the real world.
@inikep: Apologies for going off-topic, I was merely wondering why people favoured the GPU over other options. I'm here to try and learn, and without asking questions that may seem obvious to me - i'll never learn anything. I just hope that I can give something useful back to you guys in future. One thing i'm working on at present is a method of using transformation of 3D OBJ files, which in simple terms is a bunch of floats written in ASCII. Obviously most people who use these files would have a card capable of this kind of processing, along with a (possible) need for using BWT in the variants (I won't go into huge detail here as i'm still working out a lot of the issues on paper).
Thank you all for replying - I appreciate it.
I've written a fully working GPU-based ST5 compressor (with Schindler's bwtari as entropy coding). Decompression is 3 times slower as reverse ST5 is CPU-based (bsc 2.4.5), because I didn't manage to create a GPU version.
Compression:
Decompression:Code:CUDA_STx 0.4 (c) Dell Inc. Written by P.Skibinski Device 0: "GeForce GTX 460" with Compute 2.1 capability (336 CUDA cores) Kernel Warmup = 66 CPU ms (158875 KB/s), 10485760->2688 bytes CUDA ST5+BWTari = 474 CPU ms (22121 KB/s), 10485760->2690156 bytes Total = 555 ms
Code:CUDA_STx 0.4 (c) Dell Inc. Written by P.Skibinski Device 0: "GeForce GTX 460" with Compute 2.1 capability (336 CUDA cores) Kernel Warmup = 49 CPU ms (54901 KB/s), 2690156->2690156 bytes CUDA unBWTari = 495 CPU ms (21183 KB/s), 2690156->10485760 bytes BSC unST5 = 1145 CPU ms (9157 KB/s), 10485760->10485760 bytes Total = 1706 ms