I have done some basic research in this field. It seems like dictionary-based compressors are still dominating. Am I right? If you can help me with some useful paper or links, it will be a great time saving! Thanks!
Here is my understanding:
1. LZ family, dictionary-based compression. We either build a global dictionary or use a sliding window. When the string comes in, the longest repetition will be matched. The id or position of the word in the dictionary will be stored.
2. Entropy Encoder: dictionary-based method have a disadvantage that is minimal code length. The id or position requires some bits, which means for short(1-3bytes) words, the profit could not cover the cost. We usually use LZ family + entropy coder. We have Huffman code. Current fashion is ANS.
3. PPM/Context based. We may have a sliding windows, but instead of just recording previous words, a *context tree* is built. We predict next word based on previous knowledge.
4. BWT based + run-length code. BWT can effectively change the distribution of the source. The same chars are likely to be placed together. There is a change to apply run-length code or whatsoever.
5. Re Pair/Byte Repair algorithm. We pack the highest freq words into one recursively util there is no adjacent repetition.
6. other methods, please specify. Is there any other interesting method?
Currently, I am designing a compressor for database engine like leveldb/rocksdb. Therefore, I need to do random access on compressed data. LZ family with a global dictionary seems most reasonable. My question is whether byte repair algorithm have a chance? Byte re-pairing is prefect for this scenario. I have not implemented it. Is it good in practice?