-
LZ4, BWT, RLE?
Hi there...
This is my first time here on encode.su and seems cool! :)
I wanted to ask if you guys think it might be a good idea to create this kind of C++ Application: It uses 3 different types of Algorithms before giving the output to the user in this way: LZ4, BWT, RLE:
Is it possible to use LZ4 to create an initial step and then apply BWT to sort data... In the end RLE on data previously sorted by bwt?
-
bwt+rle can compress data 2-3x. this combination is pretty bad in terms of speed/compression ratio, compared to other alternatives (such as bsc of zstd), but if you want to make compressor for learning purposes, it will at least compress. i expect that lz4 preprocessing will decrease overall compression ratio. but yoy can try to tune it increasing minimum match length to 32 or so
for example, best practical bwt-based compressor BSC runs data through lzp+bwt+rle+reversed mtf+special entropy coding. with only bwt+rle you leave ~50% of possible compression on the table, while compression time will be about the same (bwt is the slowest part of this algo sequence)
-
I made a quick test for you using bce (my bwt based compressor) so you can make your own thoughts:
Code:
Program | Time | Size
bce | 20.1 | 20.878.504
lz4 -1 | bce | 12.7 | 35.930.444
lz4 -9 | bce | 9.4 | 35.439.320
drt | bce | 14.4 | 20.672.568
BCE is <2 times slower than bwt+rle but compresses better.
DRT is the dictionary preprocessor of Alexander Rasushnyak.
-
Idea
What about instead of BWT using PPM?
-
ppm usually employs some sort of arithmetic coding so it's full compressor by itself. pre-lz4 will make compression much worse, post-rle is useless
but what's your goal? it should be obvious that you can't beat existing compressors developed by professionals, or even go any close. if you have reasonable C++ experience, i invite you to participate in improving BSC. i know how it should be done, but prefer to collaborate with people who can do part of work
-
I'm so sorry for the late! It is for educational purpose of course! And yes... Seems a great idea!
-
I'm also sorry for late answer :)
The idea is to make a faster compressor based on the BSC by replacing its last step. It can be done in the following steps:
1) replace the bsc_coder_encode_block/bsc_coder_decode_block with functions executing SRC forward/backward transformation from OpenBWT
2) add huffman/fse coders from FSE package processing SRC output in 16-64 KB blocks
3) run an RLE encoding prior to SRC and somethat compress run lengths too
4) speed up the SRC procedure using mtf_shelwien2 code
I can provide you support in this work. If you are going to do it, i suggest you to clone BSC in GitHub and start to work there. In the plan above, after the first step the program will not compress at all but you can check that it can correctly extract the data. Second step will add compression, and steps 3-4 will ensure unprecedented speed/compression ratio, especially on GPU-occupied boxes. First two steps should be easy to implement - you just need to call the existing procedures.