1. minimum match length per each position
assume we have a position to 'AAAAAAB' in ROLZ hash chain, and the next position 'AAAAAAC' comes. we get a match of length 6, and we know if there's another position that matches 'AAAAAAB', the length will be always greater than 6, otherwise 'AAAAAAC' is a better match (with shorter offset). a length<=6 match to 'AAAAAB' is impossible.
2. expected match length per each position
when 'AAAAAAC' is appended to hash chain a get a match of length 6, we will assume that something like 'AAAAAA*' is likely to appear again. so if there's a match to 'AAAAAAC', we expect its match length to be 6.
this two tricks are added in Orz compressor. I saved `match_len_min` and `match_len_expected` for each position in hash chain, and used a simple transform to achieve smaller match length:
// start from match_len_min
// bring match_len_expected to front
let encoded_match_len = match match_len.cmp(&match_len_expected) {
std::cmp::Ordering::Greater => match_len - match_len_min,
std::cmp::Ordering::Less => match_len - match_len_min + 1,
std::cmp::Ordering::Equal => 0,
};
here's benchmark for enwik8:
| match_len_min | match_len_expected | size |
|---------------|--------------------|------------|
| Off | Off | 27,695,500 |
| Off | On | 27,620,396 |
| On | Off | 27,437,772 |
| On | On | 27,066,417 |
Last edited by RichSelian; 26th July 2019 at 20:45.
Yeah, but how does it affect the decompression speed?
for rolz it's not a big problem, these two lengths can be saved together with position in the dictionary while updating.
about 5%~10% slower in practice.
Good for text, what about binary/mixed data?
Cannot build the code.
Code:
$ cargo install --git https://github.com/richox/orz --tag v1.5.0 --force
error[E0554]: #![feature] may not be used on the stable release channel
--> /home/at/.cargo/git/checkouts/orz-8ca8c064613dc1e7/f2dfa43/src/main.rs:1:1
|
1 | #![feature(nll)]
| ^^^^^^^^^^^^^^^^
error[E0554]: #![feature] may not be used on the stable release channel
--> /home/at/.cargo/git/checkouts/orz-8ca8c064613dc1e7/f2dfa43/src/main.rs:2:1
|
2 | #![feature(const_slice_len)]
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
error: aborting due to 2 previous errors
Further trying to understand how the MTF array works.
Good for text, what about binary/mixed data?
Cannot build the code.
Code:
$ cargo install --git https://github.com/richox/orz --tag v1.5.0 --force
error[E0554]: #![feature] may not be used on the stable release channel
--> /home/at/.cargo/git/checkouts/orz-8ca8c064613dc1e7/f2dfa43/src/main.rs:1:1
|
1 | #![feature(nll)]
| ^^^^^^^^^^^^^^^^
error[E0554]: #![feature] may not be used on the stable release channel
--> /home/at/.cargo/git/checkouts/orz-8ca8c064613dc1e7/f2dfa43/src/main.rs:2:1
|
2 | #![feature(const_slice_len)]
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
error: aborting due to 2 previous errors
Further trying to understand how the MTF array works.
you need a nightly build rust, run `rustup default nightly` to get one.
the mtf.rs is not a complete move-to-front transform, it just swap symbol to a more advanced position. it is used to apply symbol exclusion and make compression ratio closer to o-1 but still have o-0 speed.