The string search sample finally light up the bulb in my mind and i got the GPU idea: you can compute anything as far as you simultaneously process 256 similar items, plus masked execution to simulate branching.
Sketch of LZ-on-GPU implementation, based on the HST matchfinder idea:
0. Split input data into blocks of ~64 MB. next stages are performed on the single block. some stages may be performed on CPU until efficient GPU implementation will be developed
1. Find short matches: split block into 256 chunks and scan every chunk sequentially, looking for 2/3-byte matches. each chunk needs hash2[256] and hash3[4096] tables to store last matches and with 2-byte entries these tables will occupy 8.5 KB, i.e. 2.125 MB for all 256 chunks. On this stage we can aslo look for 4/5-byte matches (and use HST5/6 on the next stage) but it will go out of GPU cahes and therefore probably too slow
2. Find long matches: build an HST4 index of the block. For every position in the index check previous 1..256 entries and calc number of matching bytes. My experience tells that, for optimum parsing, every position at average provides 1.5 matches in the increased length sequence. The question remains how to compact those 1..256 potential matches to the "increased length sequence" and then sort them by the LZ src position. At least it looks suitable for the "mass parallelism" paradigma
3. Build optimal path: split the block into optimal-compressed chunks. The natural split points are large matches (easily found by the m/t sequential scan), otherwise data may be split after each 768 (a la lzma) .. 32K (a la tornado) positions. So, 64 MB block should provide enough data to process 256 chunks simultaneously. Then optimal parser enumerates all possible matches at every position and builds path with the minimum price
4. Huffman encoding: the LZ ouput of previous stage may be split into 256 equal-sized chunks encoded simultaneously
Overall, this builds up scheme of LZH/LZAri compressor with optimal parser and non-sliding window. Sliding window can be emulated by shifting out only half of the encoding window and encoding only second half of the window. Acually, even 4 MB data overlap should be enough to significantly reduce loss of compression ratio.
Also, REP codes (repeat previous offset) should be implemented in the optimal parser in order to reach tornado/rar5 compression levels