1. ## LZMA algorithm question

Hi. An advanced thanks to any who will respond.

Are the rep codes used in LZMA absolute offsets, or relative distances?

I mean, is it trying to take advantage of repeated duplicate strings, or
duplicate strings with breaks in between?

2. Relative distances. See an example in http://encode.su/threads/1288-LZMA-markup-tool
(eg. how <P 58>\n is encoded there)

3. I see.

So:

abcTUVWXYZabcRSVWXYZ

----------------^
is a match
------------------------^
--> is a rep0?

I was thinking the other 2 days of combining

lz77 with lzw-like behavior. (different from lzfg)

Where:

1. You code matches similar to lz77
2. Every time you get a match, you add the offset and length to an array of pointers
3. When you find a match with the strings pointed by the array, you output the index,
instead of an offset length pair.
4. You can tell the decoder to remove strings in that array if it will not occur again
in the future. This can be done two ways:
a. if a string is a first occurrence, yet will not occur again, tell the decoder outright.
b. if a string is the last occurrence, tell the decoder that it is the last so it can remove
it from the array.

I might map it this way:

(in bits)
0 + byte
1 + 0 + offset length

1 + 1 + 0 + index

where index is the minimum number required to code the index,
so if the array contains 1, its just 1+1+0.
if the array contains 2, its 1+1+0 +0/1
if the array contains 4, its 1+1+0 + bb
(then update the array[index] with the current offset)

1 + 1 + 1 + 0 + offset length
--> an lz77 pair, without adding it to the array

1 + 1 + 1 + 1 + index
--> offset = array[index].offset and length, removing it from the
array, and moving array[last] to array[index]

(I was also thinking of using the stored offsets and code
a displacement from it, but got lost in thought and circled
back to this ^^)

But since LZMA is using a relative distance, I'm not sure this
might be an improvement in the same direction.

4. > Every time you get a match, you add the offset and length to an array of pointers

Now that's ROLZ. So like that you can get better compression than LZ77, but decoding is much more complex.

But don't forget about parsing optimization. LZ schemes gain a lot from it (~10% better than lazy matching),
so you might gain 5% with a complex model, then won't be able to make an optimizer for it (because its so complex)

> But since LZMA is using a relative distance, I'm not sure this might be an improvement in the same direction.

Sure, rep0 is mostly used for sparse matches. As to rep1-3, I think their use is more or less random - the only relevant
example that I can think of is table/image compression, where a few multiples of row size can be frequent distance values.

But even lzma's parser is not really aware of rep codes. It just checks whether it can use one (and whether it improves compression),
but doesn't explicitly try to find longer sparse matches or anything like that.

5. > Now that's ROLZ

Hmmm... I believe in ROLZ, you have to see the first 2 bytes first,
then output the number of jumps in the array.

Whereas here, the array works just like in Lzw, where it represents
unique strings, but rather than store the string itself, just the pointer,

i.e. the length and offset, since you are encoding a la lz77 and use
a sliding window.

So it looks like so:

array[0..max]

array_size = initialized to 0.

When you code an lz77 length offset pair, you increment array_size
to state you have added a string into the array.

Encoding will look like this:

abc123abc456abc789abc
---------^
offset length
(6, 3)

------------------^
index[0]
(which stored [absolute
offset] 0, 3)
(update index[0]
w/ latest offset of
repeat
string to 12)

--------------------------^
index[0]

In lzw, you add every single unique string into the allotted, i.e.

0..255 first initial characters,
256 .. 65535 (GIF implementation I believe) unique strings seen
so far.

Instead of doing that, I keep only a list of (in array) unique
strings that will occur again.

If the string won't occur in the future, within the boundary of the window
size, it would tell the decoder so.

If it's the last instance w/in the window, it would tell the decoder
too.

So that array that's not grow in size, and thus you can output less
bits, since both encoder and decoder can keep an updated size
of that array.

>> but decoding is much more complex.

I don't believe so (will try to code later, I'm still sleepy)

since it behaves just like lzw, only optimizing it's size of list of strings.

And decoder will look like an Lzma decoder, instead of the rep0..3,
I have a whole array of whatever maximum size I'd like, and represents
duplicate strings instead of sparse matches.

>> But don't forget about parsing optimization

Yes I won't. But that's where my original idea about different types of
matches will come in. But I'd test just this first, and use optimal
parsing.

6. ROLZ = reduced offset LZ, ie where only match offsets from some kind of subset are allowed.
It doesn't necessarily have to use previous symbols as context.
Anyway, because you need less bits to encode the distance choice, it can compress better than LZ77,
but then you also need to maintain the offset tables in decoder, so its slower.

And I'm not saying that its bad... for example, xwrt+lzma would be afaik similar to what you describe,
just that its slower and more complicated.

Also, sparse matches are actually fairly important, like for structured texts (logs, tables etc) and most binaries.
So I think that you won't beat lzma without some kind of sparse matches... the rep codes may not be the best
idea to implement these though.

7. Noted. ^^

By the way, I don't just reduce distance, but the length as well.

I don't see the point of saving the offset, but not the length.

Of course the string can be longer, but that suffix can be saved
in another entry.

(Also, reduction isn't done on the first instance of a string to be
encoded)

Thus it's a little different than ROLZ.

Maybe it's ROLLZ.

LOL

8. I'm having some problems with implementation.

I'm trying to use the optimal parsing described in

"Optimal Parsing in Dictionary-Symbolwise
Data Compression Schemes" G. Della Penna,
A Langiu, F Mignosi, A Ulisse.

But it is assumed that the cost function for a dictionary match
is greater than coding a literal.

But I have repeat matches where coding a match
from saved index costs less than a literal.

Haha, think mode again...

9. lzma encoder does optimal parsing with precise codelength estimation and rep codes, so I guess you can look at the source.
There's nothing especially smart though - its structures include values of rep distances and context ("state")
per offset in parsing buffer.
Also it "cheats" a little by estimating combinations (like match-literal-repmatch) - I suppose the normal context/rep tracking
isn't very precise, and such a multi-token estimation improves it
(price estimation is based on current parsing, but real final parsing is unknown until the backward pass)

#### Posting Permissions

• You may not post new threads
• You may not post replies
• You may not post attachments
• You may not edit your posts
•