# Optimal Preprocessing/Parsing for LZ?

Printable View

Show 40 post(s) from this thread on one page
Page 2 of 3 First 123 Last
• 24th December 2014, 02:48
Piotr Tarsa
If you're computing price[pos] in second line then in fourth line you're accessing price[pos + len] which usually isn't computed yet. Hmm, prices could be initialized to infinity, which solves the problem. But it seems more likely to me that you need to update prices in that if statement than in outer for statement.
• 24th December 2014, 02:53
Bulat Ziganshin
yes, it's updated. i omitted this line because it's obvious if you know LZMA. CBloom has its own optimal parsing schemes, so if you read his blog, LZMA OP is slightly outside of your mind

Quote:

it seems more likely to me that you need to update prices in that if statement than in outer for statement.
yes, you are thinking CBloom way :) when i evaluate all matches in the N position, i already found my best way to this position, updated the state and recomputed all prices. so i can compute the possible price[N+len] as price[N]+state[N].match_price(len,dist)

no need to update prices, though, since it's just one possibility. when i reach this N+len position, i will already evaluate ALL ways to this position and know the best one. now, i can copy the appropriate state, update it with appropriate match/literal, compute exact prices for this position and start to evalute all matches/literals starting in this position
• 24th December 2014, 03:22
Piotr Tarsa
If you don't update prices[pos + len] in the if statement then you'll most probably enter that if too often, since price[pos + len] will be always infinity.

In the outer for statement there is a need to compute price but only if it's not computed. Then it's price[pos - 1] + price[literal[pos - 1]].

And even if you always recompute price for a given pos when processing that pos then you should end up with the same price as computed within if statement.

That's how I understand it.

I thought LZMA state is rather big, exceeding megabyte or so.

State differences could be represented as hashtables with updated values and extra pointer to not updated state. Hashtables would be copied on write and states would be updated with hashtables contents when hashtables exceed some size threshold.
This could work this way:
- on read first look into hashtable whether needed value is present. If it is get it from hashtable, if not get it from frozen state.
- on write, always send the value to hashtable,
- when hashtable becomes full, copy frozen state and update it with all values from hashtable,
- key to hashtable would be the pointer to element of state, value in hastable would correspond to updated value for that element,
• 24th December 2014, 03:25
Bulat Ziganshin
the price is updated, look at edited message. or even better, read my post. ypou need to understand how lzma/tornado optimal parsing work so we can discuss my small changes to this algorithm rather than entire lzma OP from scratch

Quote:

price[pos - 1] + price[literal[pos - 1]].
i consider literal as kind of match to omit unnecessary details here

Quote:

State differences could be represented as hashtables with updated values and extra pointer to not updated state
i proposed IMHO better solution - update the state and keep the UNDO info. if lzma state is megabyte long, then we just need to have larger UNDO buffer. situations where entire megabyte state is recalculated should be very rare
• 24th December 2014, 04:14
nburns
Quote:

Originally Posted by Piotr Tarsa
'Optimal' is adjective like 'best'. 'Best' always meant best, it never meant second place or whatever. And like 'optimal', 'best' depends on constraints - you can be the best runner at a competition (the constraint here is a particular competition), while someone could be even better than you, but not starting at the competition you do. You still would've been the best runner on that competition though.

According to dictionary.reference.com, optimal came to English from the Latin word optimus, which means best. So optimal and best have more in common than perhaps you had thought.

The same dictionary site lists this as a definition for optimum (optimal): the greatest degree or best result obtained or obtainable under specific conditions. So, even this nontechnical definition makes reference to "specific conditions" as essential parts of the condition of being optimal.

Quote:

The word 'parsing' was already hijacked and given new meaning. If you look into dictionary you would see:

But parsing in LZ schemes is not looking for syntactic roles. It's merely a decomposition that doesn't care if the chunks correspond to anything. It doesn't create confusion though, because if 'parsing' is used in place of 'decomposition' then it's immediately explained that it's about finding repetitions and nothing is said about syntax.
Interesting choice of example. I think a more widespread use of the word parsing in computer science is in relation to the process of matching, e.g., a computer program as text to the corresponding productions in a formal grammar. By the time the word was first being used in relation to LZ, whoever selected it would have almost certainly been quite familiar with the older meaning. I guess they saw similarities between the two processes. Thinking about it now, parsing doesn't seem like quite a perfect description, but I have no idea what other word I'd use instead. Decomposition seems somehow inaccurate, but it's hard to explain exactly why. I think I associate decompositions with processes that are more determined exactly by deep mathematical rules, whereas LZ parsing seems like a more ad hoc and arbitrary procedure that you do for reasons that are not deep.

Quote:

Originally Posted by Bulat Ziganshin
seems that you both misundestood my algorithm and optimal parsing overall. in lzma/tornado, when you are starting to check matches started at position N, you are already found best path to this postion from the start and the postion price, but only using old prices

the only thing i've added is up-to-date state that means up-to-date prices. contrary to my approach, A* only knows prices optimized for the best encoding of further path to the end of window

PS: just now 80% of this topic is about Linguistic and 20% about algorithms

I'm not surprised that I misunderstood, because my knowledge of LZ is not especially deep (whereas you would most likely be counted as somewhat of an authority on LZ in almost any circumstances). I take it that there must be existing LZ variants that adapt the symbol costs gradually as they progress across the text from start to finish. That way, partial solutions that are prefixes are not in danger of needing to be recomputed after they've finished, because the prices they saw are fixed in time and not subject to change.

I am relatively sure that doing optimal LZ with optimally-chosen matches chosen according to their simultaneously-optimal Huffman code lengths is still considered an open problem. If that's so, then there is most likely some desirable property that a solution would have that the variant you mentioned doesn't -- or, at least, for some reason it still interests some academics. You mentioned that, in that variant, the symbol costs are updated along the way through the text; I'm guessing that it must use AC instead of Huffman, which would mean it can in effect change all the symbol codes instantly whenever and as frequently as it needs to. That's an interesting idea. One problem that it most likely isn't addressing is that of optimally choosing LZ matches across the whole span of the text such that the LZSS matching part and the symbol-coding part are at all times simultaneously mutally optimal. (If it is solving that problem, I guess I would like to understand their solution -- because it's a neat trick.)

Since I sent off my last post, I had an insight into the LZSS-Huffman problem (to invent a new name for it). I would not call it a sure thing or anything along those lines -- that would certainly be over-optimistic and premature. (From my standpoint, I don't really care if there's a practical solution available or not. My goal is to anwer the question, and yes and no are both valid answers.) I was going to describe the insight, but I got distracted by responding to Bulat, and if it's already been formally solved by someone else, I will not dwell on it much more. At best, it looks to involve some math/CS that I only know the general outlines of, and so I'd have to spend time studying that stuff to do it justice on my own.
• 24th December 2014, 04:32
Bulat Ziganshin
Quote:

One problem that it most likely isn't addressing is that of optimally choosing LZ matches across the whole span of the text such that the LZSS matching part and the symbol-coding part are at all times simultaneously mutally optimal.
even with an up-to-date prices we will build only a locally-optimal solution. i still interested to know how it looks compared to the A* scheme, but seems only CBloom understands it

Quote:

You mentioned that, in that variant, the symbol costs are updated along the way through the text; I'm guessing that it must use AC instead of Huffman, which would mean it can in effect change all the symbol codes instantly whenever and as frequently as it needs to.
actually we don't need symbol codes at all. we need only to update prices at each step. and later perform the second pass to encode the selected path. although it can be done in single pass too
• 24th December 2014, 04:50
Piotr Tarsa
Quote:

i proposed IMHO better solution - update the state and keep the UNDO info. if lzma state is megabyte long, then we just need to have larger UNDO buffer. situations where entire megabyte state is recalculated should be very rare
That wouldn't work, IMO. If you update state at each step then you can't share state easily. Frozen state + small hashtables with updates will allow for easy sharing of frozen state. You always need to copy the state before updating it in place. Same for hashtables. And size of hashtables is proportional to the extent of changes since freezing the state, not to size of entire state. With hashtables much smaller than frozen state you're saving bandwidth burned for copying data.
• 24th December 2014, 05:28
Bulat Ziganshin
Quote:

If you update state at each step then you can't share state easily.
when i start processing the position N+len, i need the state[N+len] that's copy of state[N] plus the match(len,dist). instead of copying the state structure, i modify it, use to check all possible matches starting in this position, and then return back to original state. no sharing is required since we use only one state at the time

look at bold vars:

Code:

```for pos in (0..N)   state[pos] := state[best_prev_pos[pos]], updated for a best_match[pos]   for (len,dist) in possible_matches[]     match_price = state[pos].price(len,dist)     if price[pos]+match_price < price[pos+len]       price[pos+len] = price[pos]+match_price       best_prev_pos[pos+len] = pos       best_match[pos+len] = (len,dist)   // here we undo the state changes```
• 24th December 2014, 08:25
nburns
Quote:

Originally Posted by Matt Mahoney
I think LZ77 was proved to be asymptotically optimal on stationary, ergodic sources. That's not the same as optimal compression, which Kolgomorov proved is not computable :D

I am pretty certain that at least one of LZ78 or LZ77 has been proved asymptotically optimal. That mostly tells you how much being asymptotically optimal is actually worth in compression; not much.
• 24th December 2014, 08:49
nburns
Quote:

Originally Posted by Bulat Ziganshin
even with an up-to-date prices we will build only a locally-optimal solution. i still interested to know how it looks compared to the A* scheme, but seems only CBloom understands it

actually we don't need symbol codes at all. we need only to update prices at each step. and later perform the second pass to encode the selected path. although it can be done in single pass too

I don´t normally follow cbloom´s rants on LZ. I read a good sized chunk of what was in the archives all within a relatively short timespan at one point, but I don´t think I understood everything then, and I haven´t kept up with it.

That loop code you have posted looks like a fairly straightahead implementation of the kind of dynamic programming algorithm that I would expect to be appropriate for that type of problem, I think. I mean, there are some additional frills attached, to be sure, but the basic control flow looks to be straightforward dynamic programming. (I just happen to have had dynamic programming on my mind lately.)

I sort of figured that it must not really matter when you deal with symbol codes in that algorithm, since AC doesn´t truly have codes, anyway. So approximations should be good enough.

The combined LZSS and optimal Huffman algorithm would be a whole different type of beast, though. One thing that I realized is that you can´t really think of it as basically LZSS with a little bit of Huffman added on, like an appendage. The optimal matching part and the Huffman part would be more like equal partners. Neither one would really be in charge. They would both probably have roughly equal control and responsibility for the quality of the output. You could choose either one to start with... you could do LZSS first using symbol size approximations, followed by Huffman. Or you could do Huffman first using symbol statistics approximations, followed by LZSS using the symbol sizes that came out of Huffman. You could actually continue going back and forth for a while, and you should get successively better approximations. I bet the chance of getting stuck in local minima would be very high, though...
• 24th December 2014, 21:49
nburns
Quote:

Originally Posted by Piotr Tarsa
Kennon Conrad in his Tree program doesn't do any SS-like parsing (ie going backwards) but he achieves very good compression ratio. I think that tokenization could be applicable in LZ77 world and maybe could allow for better compression ratios than the so called "optimal parsing". And tokenization in Tree is basically a parsing - it's used to find matches which should lead to very good compression ratio.

I'd say that creating tokens, i.e. named matches, seems to have worked well for Tree. Tokens might end up offering more freedom than you get with LZ references, so that you can employ more compression strategies.

I don't think anyone currently knows whether token algorithms can be made as fast as LZ (for encoding). Since tokens are made to be used more than once (otherwise, why assign a name?), determining which tokens should be created and kept in the dictionary might turn out to be inherently slower than LZ parsing, at least to get halfway-decent results.

Quote:

Repeated matches or matches continuations bring conventional LZ77 closer to Tree like coding. The more entries are in the recent matches queue the more similarity between Tree and that LZ77 derived coder. So one could eg try local Tree like tokenizations within small windows to extract repeated matches and then complete the parsing by finding matches over a bigger LZ77 window.
It would probably harder to do for LZ77, because the matches you had to consider would be full of redundancy and there would be too many choices. Tree goes to great lengths when choosing which tokens to make, so by later stages they will be fairly filtered and high quality.
• 24th December 2014, 22:11
nburns
I guess I am not entirely sure what Piotr and Bulat have meant by "adaptive codes". I think I might have interpreted it wrong. I've been focused on Huffman, mainly, but Huffman is not an adaptive code... it would be a static code.
• 24th December 2014, 23:55
Kennon Conrad
Quote:

Originally Posted by nburns
I guess I am not entirely sure what Piotr and Bulat have meant by "adaptive codes". I think I might have interpreted it wrong. I've been focused on Huffman, mainly, but Huffman is not an adaptive code... it would be a static code.

I think mostly they are talking about dictionary symbols having variable code lengths. If I understand correctly, LZSS is optimal in the sense that if all codes must be length N (non adaptive), then it produces the smallest output given that constraint.

Even though Tree is not optimal, it generally outperforms LZSS on text because it is adaptive. Dictionary symbols can have different lengths and the code length for individual symbols can change as the file is processed (via MTF queues). You are right, a Huffman Tree is not adaptive because it is based on global statistics. It is the best tree available if you can't change the tree while processing the file.
• 25th December 2014, 00:35
Kennon Conrad
Bulat, could you provide a few details on how the price(len,dist) value is determined? It looks like maybe it is adaptive, so I was wondering if that's just because the cost of writing the lengths and distances vary.

Has anyone combined the pricing method you show with a dictionary? It's only for regular LZ, right?

I view tokenization as a way to improve upon LZ style compression by guiding the encoder to build a code tree that is more efficient for writing the upcoming data than is possible without using a context modeling of the prior history. If there is an optimal parser like LZSS that has variable length dictionary symbols, it would be interesting to try adding a data structure containing the upcoming common prefixes over a relatively small sliding window and adjusting the price function based on the usefulness of symbols within that sliding window.
• 25th December 2014, 01:08
Bulat Ziganshin
rae you read http://encode.su/threads/1895-LZ-Opt...ll=1#post37055 and tornado sources?
• 25th December 2014, 01:34
Kennon Conrad
Quote:

Originally Posted by Bulat Ziganshin

Yes, but not thoroughly. As I understand it, Tornado does not use a dictionary, just a LZ77 style history. But maybe I do not understand Tornado.
• 25th December 2014, 02:10
Bulat Ziganshin
yes, both lzma and tornado use only lz77-style history
• 25th December 2014, 03:50
Kennon Conrad
I suppose you might be able to get some benefit even for LZ77 by adding a special code to indicate the previous match should be converted to a literal. Code couldn't reference portions of the match but match distances across the match and any subsequent use of the new literal would be reduced by one less than the length of the match. For instance, if " the" was converted to a literal, then many match distances would be reduced by a multiple of 3 positions since all " the"'s would be written as literals (or matches containing the literal) counting as one position in the history.
• 25th December 2014, 04:29
Bulat Ziganshin
it's performed by dictionary preprocessor such as xml-wrt. making it a part of lz77 compressor itself will make it much slower (at least in naive implementations)
• 25th December 2014, 05:33
Kennon Conrad
Thanks, I will try xml-wrt. It looks like xml-wrt + LZ77 is competetive with LZMA. I would like to know if the dictionary is completely separate, or embedded within the data.
• 25th December 2014, 06:11
nburns
Quote:

Originally Posted by Kennon Conrad
I think mostly they are talking about dictionary symbols having variable code lengths. If I understand correctly, LZSS is optimal in the sense that if all codes must be length N (non adaptive), then it produces the smallest output given that constraint.

Even though Tree is not optimal, it generally outperforms LZSS on text because it is adaptive.

I recommend taking some time, someday, to study how the LZSS algorithm finds its optimal solution. Alternately, you could pick up an algorithms textbook (such as CLR) and read the dynamic programming chapter. I never really got dynamic programming while I was an undergrad, but years later I went back and reread the chapter in CLR to find out what I missed. Until you get around to that, it would be safe to think of something like LZSS as an exhaustive search. The general idea is to look at each possible solution one by one and compute the cost for each. After having seen every possible solution, you must have seen the best, according to its cost function, so return it.

The disappointing part about dynamic programming is that it doesn't work on all optimization problems, and when optimization problems fail the test for D.P., they are almost always intractable, i.e. NP-hard.

LZSS uses D.P. to find its optimal match sequence, and it's probably not a good idea to overestimate what LZSS can do.

Quote:

Originally Posted by Kennon Conrad
Bulat, could you provide a few details on how the price(len,dist) value is determined? It looks like maybe it is adaptive, so I was wondering if that's just because the cost of writing the lengths and distances vary.

Has anyone combined the pricing method you show with a dictionary? It's only for regular LZ, right?

I view tokenization as a way to improve upon LZ style compression by guiding the encoder to build a code tree that is more efficient for writing the upcoming data than is possible without using a context modeling of the prior history. If there is an optimal parser like LZSS that has variable length dictionary symbols, it would be interesting to try adding a data structure containing the upcoming common prefixes over a relatively small sliding window and adjusting the price function based on the usefulness of symbols within that sliding window.

I think it's significant that Tree works so hard to choose matches/tokens. I think the problem Tree needs to solve may be trickier than what LZ needs to do for its references.

By the way, it occurred to me that there may be some kind of D.P. approach that Tree could use in its encoding algorithm, e.g. while evaluating potential tokens. I don't think that the D.P. literature holds any breakthroughs, but if you could figure out a way to fit Tree into some sort of D.P. framework, I think it would be worth exploring. Even if you find that you're already doing what the book is describing, just learning the names and getting familiar with relevant algorithms is worthwhile.
• 25th December 2014, 17:44
Bulat Ziganshin
Quote:

Originally Posted by Kennon Conrad
Thanks, I will try xml-wrt. It looks like xml-wrt + LZ77 is competetive with LZMA. I would like to know if the dictionary is completely separate, or embedded within the data.

look at the program output. i don't remeber - it may be either at the start of file or at the first word occurence

you can also check my dict preprocessor: arc a compressed-file orig-file -m=dict --no-dir
• 8th January 2015, 04:41
cbloom
Hi! A few notes -

Calling these parsers "optimal" is a historical convention. Early LZH's would implement an LZSS optimal parse, and then just apply a Huffman entropy coder to the result and not worry about the fact that it was no longer the optimal solution to the LZH parse. They kept the name "optimal" because the truly optimal parse of LZH is uncomputable (I've never even heard of a proposed algorithm to compute anything like it).

The earliest attempt to do forward "optimal" parsing was Quantum by Dave Stafford (the original PAQ). It used an adaptive arithmetic coder and did what we would now call "flexible parsing" ; it evaluated costs a few steps ahead.
• 8th January 2015, 04:51
cbloom
So, the big difference between the LZMA optimal parse that Bulat describes here, and the ones I've worked on is that I wanted to consider how current choices affect coding in the future.

The LZMA algorithm is very local or greedy, in the sense that when it considers a decision, it only looks at the price to take the next step.

The strongest effect is not from the statistical adaptation, but from the "last offset" for repeat matches. This can be pretty significant on highly structured data that have strong repeat-match properties.

A common case is something like choosing between {len=3, offset=16} {len=5, offset=912}

If you just look at current prices, the len=5 choice might seem slightly better. But if you look a few steps into the future, the len=3 case turns out to be cheaper because it loads a useful offset into the repeat-match state, which gets used a few decisions in the future.

This is the idea for the "Chain N" parsers. They take an existing optimal parse, which provides costs to the end of the buffer, and they improve that parse by speculatively trying different decisions and computing how that decision affects the costs N steps into the future.
• 8th January 2015, 04:56
cbloom
Another little note -

when the size of your dynamic "state" is small, you can do a truly optimal parse using dynamic programming. If state is of size S it simply requires a table of size N*S.

For example in LZ4 the current LRL affects code cost, so it breaks the simple LZSS parsing, but you can set S = 15 or so and you get truly optimal parsing. (this is ignoring the effects of very high LRL affecting code cost at the next 255 boundary)

The problem is as soon as you introduce even one "repeat match", then S is too big, and just putting a max on the repeat offset doesn't seem to work.
• 8th January 2015, 05:06
cbloom
And -

My A* idea is an attempt to do something closer to true optimal parsing. It's a forward parse, and it explores the entire graph of all possible parses.

The A* parse considers the way current decisions affect the future all the way to the end - not limited after N steps.

Obviously this is intractable. The solution is to discard exploration of paths that cannot be better than the current best path by more than some threshold. The higher you set that threshold, the more approximate and faster the parse.

For me this is a big improvement over the LZMA parse or my previous parsers, because it makes the error *bounded*. Previous optimal parsers could be an unbounded amount worse than truly optimal.

The problem I ran into was that 95% of the time, I could set a pretty low error threshold and get path explorations that terminated quickly. But on some blocks there are too many parses that are similar in cost to the best parse, so the bound doesn't work well and they explode into the worst case complexity.

I do believe it's a promising area for future work if anyone really ever cares. (IMO there are better places to spend CPU time in your encoder!)

See for example :

http://en.wikipedia.org/wiki/A*_search_algorithm
and "Bounded Relaxation"
• 8th January 2015, 08:30
cbloom
Also -

LZMA state size depends on the settings for the # of bits in the contexts.

With default settings it can fit in less than 16k.

There's a small mistake in LZMA -

http://cbloomrants.blogspot.com/2014...lzma-like.html

those fixes reduce the # of contexts required somewhat.

In my A* I do not include the statistical state in the graph. A node is labelled by {state,pos} where "state" only includes the rep match offsets and a few bits similar to the LZMA "markov chain". The node label has to be small enough to be usable as a hash index for the dynamic programming. It's also useful to collapse that state as much as possible, so that paths come back together. In practice this reduces the branching factor enormously on most files.

Lastly I should note that improvements beyond the LZMA-style parse are not very fruitful. The LZMA-style seems to be a very good tradeoff. My parsers are much slower for not much more compression.
• 8th January 2015, 10:42
Kennon Conrad
Quote:

Originally Posted by cbloom
Lastly I should note that improvements beyond the LZMA-style parse are not very fruitful.

Perhaps I misunderstand, but I disagree. The whole point of developing Tree was to determine how an alternate method of parsing would impact compression ratios. The results on large text files such as enwik9 indicates Tree style parsing can yield files at least 10% smaller than LZMA with Pareto Frontier decompression speed.

I could be wrong, but based on the results, I tend to think it is better to make decisions based on global effects and fine tune for local effects rather than to choose symbols based on local effects.
• 8th January 2015, 20:59
cbloom
Quote:

Originally Posted by Kennon Conrad
Perhaps I misunderstand, but I disagree.

I should have said *my* improvements were not very fruitful.

I do think it's an important area of study for research. But in a practical sense at the moment it's not a very useful way to spend encoder CPU time.

How does Tree do on binary data? LZMA is really quite special on binary data and only okay on text. LZMA+wrt is perhaps more fair on text. Text matches and binary matches have quite different properties. In text you're building a grammar (the content repeats), while in binary the repeated-offset structure is very important (the offsets repeat, but with different content).

I suppose something for future work would be to learn grammars of offsets for binary structured data. Like you might get repeated sections of {ml 7, offset multiple of 16, 1 literal, ml 3, offset multiple of 4, 2 literals}.
• 8th January 2015, 21:09
Kennon Conrad
Quote:

Originally Posted by cbloom
How does Tree do on binary data? LZMA is really quite special on binary data and only okay on text. LZMA+wrt would be more fair on text. Text matches and binary matches have quite different properties. In text you're building a grammar, while in binary the repeated-offset structure is very important.

Tree works much better on text than binary data. A delta filter would help in some cases, but I think building a grammar generally isn't a good fit for binary data.
Show 40 post(s) from this thread on one page
Page 2 of 3 First 123 Last