Code:<schnaader> Got two machines here. First is 800 MHz AMD, 256 MB RAM :( second one is 2,2 GHZ Celeron, 1 GB RAM. I plan to upgrade to a 3 GHz single-core machine soon and will invest in a nice quad-core machine with enough RAM and a nice GPU card some time after that to get rid of "my-machine-sucks"-problems :) and of course, to be able to get some experience in multithreading and work on GPU. Precomp would get a nice speed boost running on 4 cores and stopping temporary file usage <toffer> well multithreading is nice indeed i cut down the time eaten up by my genetic optimizer by a factor of 3, using 4 threads <schnaader> Yes, it's useful at least if you can parallelize the algorithm - I guess precomp could get almost linear speed-up, as you'll get thousands of independant (de)compression tasks if there'd be some deflate implementation on GPUs, you could even get it to lightspeed, but I guess there's too much I/O involved for GPUs... <toffer> lz on a gpu weird stuff <schnaader> weird and useless... doesn't matter if you get 1 or 10 GB/s if your disk speed is only 100 MB/s :) <toffer> yep i didn't took that into account <Shelwien> as to deflate on GPU it should be easy at least with Cuda you just compile the plain C/C++ code for GPU and dispatch it to run on GPU from the main unit <schnaader> it should be easy that way, yes, but would it get faster, too? <Shelwien> now, that's really unlikely ;) <schnaader> most compression discussion that involved GPUs I saw said something like "There's too much I/O involved, this won't get you anywhere"... <Shelwien> GPU is not much faster than core2quad even on dumb parallel tasks like password cracking <schnaader> Huh? IIRC, especially on password cracking or things like computing md5 hashes it's much faster, like 10x speedup <Shelwien> http://3.14.by/en/read/md5_benchmark no, alas <schnaader> Elcomsoft f.e. does passowrd cracking things on GPUs very successful <Shelwien> this is much faster than elcomsoft implementation you can see elcomsoft impl there, in fact <schnaader> OK... nice one <Shelwien> anyway, its clock is lower than x86 something around 1Ghz afair and memory access takes ~100 clocks and all is memory, other than 8k CPU registers and then, what's even more important for LZ GPU really hates branches in the code <schnaader> Ah, I see. That's why they are better for straightahead tasks like video compression <Shelwien> it seems that it doesn't really process that many threads in parallel, its more like automatic vectorization than lots of independent parallel cores. So if you'd try to process 100 blocks of arithmetic instructions on registers, it'd really run 100x faster, but if you'd make these blocks unsync with branches etc so that instruction pointers in all threads would be different, then it would be only 8x faster or something depending on how many real physical cores there are <schnaader> of course, even having the same speed as your CPU would still be nice as you can use both CPU+GPU at the same time :) <Shelwien> in other words, its like multicore + hyperthreading but a little more cores and physical threads and btw there're virtual threads too As to "having the same speed as your CPU would still be nice" - yes, but that isn't usually worth the work. especially taking into account that for ATI cards you need to repeat that again <schnaader> yes, the incompatibility is one the main disadvantages atm let's hope things get better. CUDA already is a nice step towards better usability, hopefully OpenCL will solve some problems, too. <Shelwien> http://nsa.unaligned.org/index.php <schnaader> yeah, of course nothing beats FPGAs. <Shelwien> that's not FPGA that's FPGA-controlled bunch of cheap GPUs with plain FPGA, unfortunately, its very hard to beat core2 because cheap boards like like at 30mhz, so even if you make it very parallel it won't really help and number of elements there is very limited too so not like too many threads are possible