I want to discuss the schemes that zstd, lz5 and other lz algos may use to implement m/t or gpu operation
1. M/T parsing
The simple approach to m/t compression, used by 4x4/lzma2/pzstd/mtzstd, is to break data into several [overlapping] blocks and compress them in parallel. Unfortunately, it raises memory reqs to O(threads*dictsize). Here i describe scheme that limits memory usage to O(dictsize).
This require to split work into two independent tasks, match finding and parsing, like lzma already does for a years. Since match finding precedes parsing, we can't find matches only for positions where parser stops and should find matches for every position in the file. This makes use of fast parsing strategies (lazy/greedy) less interesting since important part of their speedup is exactly the ability to skip match searching for the most of positions.
So, the first group of threads should find matches for all positions and the second part should use [optimal] parsing to build an LZ output. The list of matches found should be split into multiple parts, each part parsed and compressed independently, and then their results are catenated. This requires some support from compressed data format, though - it should allow to catenate such streams, i.e. either reset entropy coding model (for lzma-like codecs), or encode entropy model as the part of compressed data themself (like deflate/zstd/lz5 does). Note that for lzma-like codecs it's preferable to use larger parsing blocks, since we lose compression ratio on each reset, while for deflate-like codecs it's ok to catenate individual compressed blocks, so we may split input data, f.e., in 128 KB blocks plus points where extra-large matches are found (larger than FAST_BYTE chars in lzma terminology).
2. M/T matchfinder
The first group of threads in this scheme should fill the match list. They can do that cooperatively by assigning work based on the hash index in each position. F.e. if we use matchfinder with minlen=4, we need to compute hash(pos) based on pos[0]..pos[3]. Then, depending on 3 bits from this hash, we assing this position to one of 8 threads.
There are two ways to calculate the assignment. First way is to scan every pos in main thread, calculate the hash and push {pos,hash} pair to queue allocated for corresponding thread. This means a lot of memory traffic, so ideally queue blocks should be kept small enough to restrict this traffic to L3 cpu cache.
The second approach is that every sub-thread scans every position, filtering out all positions that belong to other threads. This means that we make N times more work (where N is the number of sub-threads), but since this work is extremely simple (just a few alu operations), it may be unimportant for desktop cpus (i.e. while N is limited to 8..16). There is also the combined approach where main thread sends to queue only pos and hash is recomputed by sub-thread.
To improve the speed, sub-threads can prefetch data from hastable using hash values ahead. It's easier to implement with the first approach, although also possible with remaining ones.
2.1 Searching the data
By assigning work depending on hash value we effectibely split has tables into several segments, each segment belonging to one thread and serached/updated only in this thread. The details depend on hash organization:
HashTable matchfinder (i.e. one from Tornado) is the simplest one. All pointers are saved into HashTable[hash(pos)][i] slots where i is a small number, f.e. 0..7, so we can divide work between threads based on higher bits of the hash. All required data from HashTable can be prefetched once we know the hash value.
HashChain matchfinder is a bit more tricky. It maintains a list of values with head stored in HeadTable[hash(pos)] and next pointers stored in ChainTable[pos]. So, to avoid enormous traffic between cores, ChainTable should be updated in main thread:
Code:
for (every pos)
hash = hashfunc(pos)
next = HeadTable[hash]
HeadTable[hash] = pos
ChainTable[pos] = next
Queue[thread(pos)].push(pos,next)
Since the following search in ChainTable doesn't modify the table and doesn't depend on the hash value, thread(pos) function doesn't need to depend on hash value and may be as simple as pos%THREADS.
For BinaryTree, first step of searching is pretty close to the HashTable case, i.e. each thread can deal with its own part of the HeadTable.
Igor Pavlov already tried naive implementation where multiple threads share the single lzma-style BinaryTree and found that this is too slow due to massive traafic between cores. So, the only way to use single BinaryTree is by HyperThreaded threads on the single core which may be forced by OS calls.
Alternatiely, we can use multiple BinaryTrees - one per thread. This raises memory usage by 50% compared to lzma-style BT, since it's no more possible to use "natural indexing" where index=pos. Instead, every node should keep {pos,left_index,right_index} tuple.
Finally, i considered B-tree organization - it's natural choice for reducing memory traffic. While B-Trees has larger memory requirements than single BinaryTree and thus avoided by current s/t algorithms, it may be the best choice for m/t ones with exhaustive matchfinders. Like with BT, on the first level we should have large, directly indexed HashTable, segmented between threads. On lower levels, each node should occupy 64-256 bytes, since each memory transaction usually reads 128 bytes. We may keep only buffer positions on the lowest level, while on higher levels we also need to store indexes of B-Tree nodes. Organization of B-Tree should be similar to organization of lzma BinaryTree, i.e. each entry can point only to entries with smaller buffer position. It still needs more research, can anyone look into that?
PS: m/t decompression and gpu compression will be described later