Code:
<schnaader>
  Got two machines here. First is 800 MHz AMD, 256 MB RAM :(
  second one is 2,2 GHZ Celeron, 1 GB RAM. I plan to
  upgrade to a 3 GHz single-core machine soon and will
  invest in a nice quad-core machine with enough RAM and a
  nice GPU card some time after that to get rid of
  "my-machine-sucks"-problems :)
  and of course, to be able to get some experience in
  multithreading and work on GPU. Precomp would get a nice
  speed boost running on 4 cores and stopping temporary file
  usage
<toffer> 
  well multithreading is nice indeed
  i cut down the time eaten up by my genetic optimizer by a
  factor of 3, using 4 threads
<schnaader> 
  Yes, it's useful at least if you can parallelize the
  algorithm - I guess precomp could get almost linear
  speed-up, as you'll get thousands of independant
  (de)compression tasks
  if there'd be some deflate implementation on GPUs, you
  could even get it to lightspeed, but I guess there's too
  much I/O involved for GPUs...
<toffer> 
  lz on a gpu
  weird stuff
<schnaader> 
  weird and useless... doesn't matter if you get 1 or 10
  GB/s if your disk speed is only 100 MB/s :)
<toffer>
  yep
  i didn't took that into account
<Shelwien> 
  as to deflate on GPU
  it should be easy at least with Cuda
  you just compile the plain C/C++ code for GPU
  and dispatch it to run on GPU from the main unit
<schnaader> 
  it should be easy that way, yes, but would it get faster, too?
<Shelwien> 
  now, that's really unlikely ;)
<schnaader> 
  most compression discussion that involved GPUs I saw
  said something like "There's too much I/O involved, this
  won't get you anywhere"...
<Shelwien> 
  GPU is not much faster than core2quad even on dumb
  parallel tasks like password cracking
<schnaader> 
  Huh? IIRC, especially on password cracking or things
  like computing md5 hashes it's much faster, like 10x
  speedup
<Shelwien> 
  http://3.14.by/en/read/md5_benchmark
  no, alas
<schnaader> 
  Elcomsoft f.e. does passowrd cracking things on GPUs very successful
<Shelwien> 
  this is much faster than elcomsoft implementation
  you can see elcomsoft impl there, in fact
<schnaader>
  OK... nice one
<Shelwien> 
  anyway, its clock is lower than x86
  something around 1Ghz afair
  and memory access takes ~100 clocks
  and all is memory, other than 8k CPU registers
  and then, what's even more important for LZ
  GPU really hates branches in the code
<schnaader> 
  Ah, I see. That's why they are better for straightahead
  tasks like video compression
<Shelwien> 
  it seems that it doesn't really process that many
  threads in parallel, its more like automatic
  vectorization than lots of independent parallel cores.
  So if you'd try to process 100 blocks of arithmetic
  instructions on registers, it'd really run 100x faster,
  but if you'd make these blocks unsync with branches etc
  so that instruction pointers in all threads would be
  different,
  then it would be only 8x faster or something
  depending on how many real physical cores there are
<schnaader> 
  of course, even having the same speed as your CPU would
  still be nice as you can use both CPU+GPU at the same
  time :)
<Shelwien> 
  in other words, its like multicore + hyperthreading
  but a little more cores and physical threads
  and btw there're virtual threads too

  As to "having the same speed as your CPU would still be nice" -
  yes, but that isn't usually worth the work. especially
  taking into account that for ATI cards you need to repeat
  that again
<schnaader> 
  yes, the incompatibility is one the main disadvantages atm
  let's hope things get better. CUDA already is a nice
  step towards better usability, hopefully OpenCL will
  solve some problems, too.
<Shelwien> 
  http://nsa.unaligned.org/index.php
<schnaader> 
  yeah, of course nothing beats FPGAs.
<Shelwien> 
  that's not FPGA
  that's FPGA-controlled bunch of cheap GPUs
  with plain FPGA, unfortunately, its very hard to beat core2
  because cheap boards like like at 30mhz, so even if you
  make it very parallel it won't really help
  and number of elements there is very limited too
  so not like too many threads are possible