***
*** Note: This algorithm is broken and useless.
*** Maybe an admin could change the thread title to indicate this?
***
***

There is an algorithm that finds the optimal LZ77 parsing in linear time given certain constraints in this paper. The paper is extremely complicated and hard to read. I haven't fully deciphered the algorithm.

However, I think I've figured out how to do linear optimal parsing on my own. The following code hasn't been tested.

// Find cost of optimal parsing.
//
// Condition for algorithm:
// For all A, x, and y where xAy is a string appearing in the text and x and y are individual symbols,
// Cost(xAy) - Cost(Ay) >= Cost(xA) - Cost(A)

int Cost(int i, int j); /* cost of best match for text[i..j-1] */

int len = /* length of text */;
int *prefix_cost = (int*)malloc(sizeof(int) * (len+1));
prefix_cost[0] = 0;
for(i=0, j=1; j<len+1; j++) {
while(!match_exists(i, j)) {
i++;
}
int next_cost, minimum_cost = prefix_cost[i] + Cost(i, j);
while((i+1 < j) && ((next_cost = prefix_cost[i+1] + Cost(i+1, j)) <= minimum_cost)) {
minimum_cost = next_cost;
i++;
}
prefix_cost[j] = minimum_cost;
}
// prefix_cost[len] == cost of parsing text

The following works for arbitrary cost functions, but it's O(n^2). (Also not tested.)

int len = /* length of input */;
int *prefix_cost = (int*)malloc(sizeof(int) * (len+1));
prefix_cost[0] = 0;
for(j=1; j<len+1; j++) {
int next_cost, minimum_cost = /* highest number */ INT_MAX;
for(i=0; i<j; i++) {
if(match_exists(i, j) && ((next_cost = prefix_cost[i] + Cost(i, j)) < minimum_cost)) {
minimum_cost = next_cost;
}
}
prefix_cost[j] = minimum_cost;
}
// prefix_cost[len] == cost of parsing text

Last edited by nburns; 10th January 2017 at 18:55.

match_exists(i, j) isn't described here. i guess it means that text[i..j-1] has a match, so it always returns true on i+1==j? otherwise i fear that it may go into endless loop right at the first iteration of main loop

overall, first loop seems exactly what implemented in tornado (and probably most other algorithms too). lzma implements more complex strategy what tries multiple paths - rather than multiple matches irrespective to the previous path, as your and my algorithms do.

the main point of both your codes is that Cost(i,j) doesn't depend on the path to i, that is true only for static model. Real algos employs either dynamic (lzma/tornado) or block-static (deflate/rar/zstd) models. For block-static model, you can find local optimum by repeating optimum parsing with stats generated by the previous pass, using stats of previous block at the first pass. For most cases, i expect that this algo will give an optimum result (and already implemented by ultra deflate compression in 7-zip (see "passes" parameter), and probably in ultra compression modes of zstd/brotli).

For dynamic algo, stats at each step depends on previous matches (i.e. path), so the Cost(i,j) may either use stats collected at some previous point (lzma updates stats and starts parsing afresh each 4 KB, tornado - each 32 KB), or save at each position already processed entire stats structure after "best match" in addition to the "best match" itself. But even this approach doesn't give us a real optimum since it makes local decisions.

The paper you have mentioned provides an algo with the following constraint "In this paper we considervariable-length encodings which use longer codewords for larger integers". So it works for gamma/elias-style encodings, and fails for huf/ari/ans and any other where codeword for 0 may be longer than codeword for 1. I.e. the algo is inapplicable even to lzh/deflate, only to algos with FIXED entropy model like lz4/lz5 and tornado -c1/-c2 modes

match_exists(i, j) isn't described here. i guess it means that text[i..j-1] has a match, so it always returns true on i+1==j?

That's correct. It would always return true for single characters, and longer strings if and only if there's a match. "Match" is left undefined, except to the effect that it's a single, indivisible unit of parsing and it has cost Cost(i,j).

otherwise i fear that it may go into endless loop right at the first iteration of main loop

overall, first loop seems exactly what implemented in tornado (and probably most other algorithms too). lzma implements more complex strategy what tries multiple paths - rather than multiple matches irrespective to the previous path, as your and my algorithms do.

From tornado:

price[pos+1] ?= price[pos] + literal_price(data[pos])
// at success: len[pos+1]=1, dist[pos+1]=0

for every match in match_list and every intermediate match_len
price[pos+match_len] ?= price[pos] + match_price(match_len, match_dist)
// at success: len[pos+match_len]=match_len, dist[pos+match_len]=dist

That is styled sort of like single-source shortest path, where edges = matches.

That doesn't quite demonstrate a linear time optimal parser.

the main point of both your codes is that Cost(i,j) doesn't depend on the path to i, that is true only for static model. Real algos employs either dynamic (lzma/tornado) or block-static (deflate/rar/zstd) models. For block-static model, you can find local optimum by repeating optimum parsing with stats generated by the previous pass, using stats of previous block at the first pass. For most cases, i expect that this algo will give an optimum result (and already implemented by ultra deflate compression in 7-zip (see "passes" parameter), and probably in ultra compression modes of zstd/brotli).

Cost(i, j) can't depend on any previous parsing decisions, because in that case, dynamic programming fails and finding the optimal parse tends to become intractable.

Cost(i, j) can take into consideration the previous text (actually, it has to). Just not how it was parsed.

This algorithm should give an optimal result in every case. You can only use it if the cost function meets the constraint, though.

For dynamic algo, stats at each step depends on previous matches (i.e. path), so the Cost(i,j) may either use stats collected at some previous point (lzma updates stats and starts parsing afresh each 4 KB, tornado - each 32 KB), or save at each position already processed entire stats structure after "best match" in addition to the "best match" itself. But even this approach doesn't give us a real optimum since it makes local decisions.

The paper you have mentioned provides an algo with the following constraint "In this paper we considervariable-length encodings which use longer codewords for larger integers". So it works for gamma/elias-style encodings, and fails for huf/ari/ans and any other where codeword for 0 may be longer than codeword for 1. I.e. the algo is inapplicable even to lzh/deflate, only to algos with FIXED entropy model like lz4/lz5 and tornado -c1/-c2 modes

Right. The cost function can't take into consideration the statistics of the output symbols (i.e. in Huffman), because the statistics depend on the particular parsing and that makes this approach impossible.

One thing that might be interesting is to do the parsing in blocks and update the statistics and cost function at the end of each block. Then each block would be optimally-parsed based on the stats from the last block. You might not be able to use the linear algorithm, but you could use the second algorithm. The complexity of the second algorithm is actually bounded by the number of matches; it only has to be O(n^2) in pathological cases (mine is O(n^2) as written, but it's a trivial change).

Last edited by nburns; 10th January 2017 at 07:37.

This algorithm should give an optimal result in every case. You can only use it if the cost function meets the constraint, though.

unfortunately, there are no practical algos where Cost(i,j) doesn't depend on previous parsing decisions. In fact, in block-static algos (deflate, zstd) it even depends on future decisions, i.e. entire block parsing. So the best practical approach is multiple passes using stats from previous pass

For dynamic algos, Cost depends on the path to the current point, so locally-optimal parser need to store stats corresponding to "best parsing" in each position already parsed. This allows to use these stats when looking for "best parsing" in new positions:

Code:

for (j)
-- let's find the best parsing for position j
for(any i<j)
-- let's check match(i,j), i.e. match from previous position i to position j
cost = Model[i].Cost(i,j) -- cost of match(i,j) in the Model of best parsing for position i
if cost < BestCostSoFar then pos=j
best_parsing[i]=pos
Model[i] = Model[pos] updated with match(i,pos)

(i've found this algo in Charles Bloom blog)

But even this algo can't make globally-optimal parsing, since it's not ready to "sacrifice" some bits in order to get much better results in future. F.e. it may prefer "2323" parsing over "22222" (digits are match lengths) despite the fact that parsing entire buffer as "222.." will lower cost of 2-byte matches and provide overall improvement.

Finally, there is one type of algos where Cost(i,j) is semi-static - it's a semi-dynamic algos like tornado (sorry, i forgot that when wrote previous answer). Here Model updated only once per, say, 10K matches - based on the stats gathered on previous matches. So, if you start optimal parsing at this point and continue it until next update, your Cost(i,j) function will give you exact real costs.

And my error - tornado actually implements your algo 2, i.e. it's O(N*MAX_MATCH_LEN). It's exactly why all optimal parsers i know are so slow. It can deal with some edge cases using FASTBYTE trick i described. But entire idea of trying all previous positions up to the longest match is a source of serious slowness. I have considered various ideas of making this faster such as use of SIMD, but it's still a research area.

So i look again at your first algo. Seems that main point here isn't the algo itself, but the requirement "Cost(xAy) - Cost(Ay) >= Cost(xA) - Cost(A)". Can you describe in simple words how it restricts the possible models??

And my error - tornado actually implements your algo 2, i.e. it's O(N*MAX_MATCH_LEN). It's exactly why all optimal parsers i know are so slow. It can deal with some edge cases using FASTBYTE trick i described. But entire idea of trying all previous positions up to the longest match is a source of serious slowness. I have considered various ideas of making this faster such as use of SIMD, but it's still a research area.

Right. It sounds like tornado is already doing the thing I say might be interesting, too.

So i look again at your first algo. Seems that main point here isn't the algo itself, but the requirement "Cost(xAy) - Cost(Ay) >= Cost(xA) - Cost(A)". Can you describe in simple words how it restricts the possible models??

The main point is the condition and the fact that it runs in linear time (it would be fast).

You'll notice that it never backs up. It only moves forward.

In order to never have to back up, it needs to know that adding a symbol to the end of the current match won't cause the optimal match starting point to move backwards.

// Condition for algorithm:
// For all A, x, and y where xAy is a string appearing in the text and x and y are individual symbols,
// Cost(xAy) - Cost(Ay) >= Cost(xA) - Cost(A)

i think i deciphered it! let's start with compressor having MIN_MATCH=3 and consider simple case where len(A)=1, so A is just a char like x and y. in this case:

i.e. the condition is that ANY 3-byte match has higher cost than 3 individual bytes it covers! i.e. the condition is that 3-byte matches are useless so we may set minimum match length to 4 and... repeat the proof with len(A)=2

overall, if len(A)=N, the condition means cost(N+2)-cost(N+1) >= cost(N+1)-cost(N) where cost(L) is a cost of some match with length L. I.e. we require that, for any N, adding any char to any match of length N+1 has HIGHER cost than adding the same char to its prefix, the match of length N!!! But we use LZ for exactly the opposite reason - extending long match with one more char usually has SMALLER cost than extending the small match. And for condition you given parsing is trivial - 1-byte matches are optimal.

It may be good at practice, however, just as one more sort of "flexible" parsing algos, but it need to be checked with real data.

OTOH, the paper algorithm may be used for lz5-like engines having FIXED but VARIABLE-SIZED len/dist encoding as opposite both to fixed-size encoding of lz4/lzss as well as data-dependent encoding of lzh-class algorithms

overall, if len(A)=N, the condition means cost(N+2)-cost(N+1) >= cost(N+1)-cost(N) where cost(L) is a cost of some match with length L. I.e. we require that adding one char to the match of length N+1 has HIGHER cost than adding one char to the match of length N!!! But we use LZ for exactly the opposite reason - extending long match with one more char usually has SMALLER cost than extending the small match. And for condition you given parsing is trivial - 1-byte matches are optimal.

It may be good at practice, however, just as one more sort of "flexible" parsing algos, but it need to be checked with real data.

I didn't follow all of your reasoning, but extending long matches can cost more than extending shorter matches, because longer matches are harder to find, so the distance may get larger. Also, the length gets bigger. Adding to the larger match doesn't *have* to cost more, it just can't cost less.

I'll try to think some more about situations that might violate that constraint. I think that constraint is no more strict than the constraint in the paper. But as I said, I'm not 100% confident in this yet.

Well, it definitely MAY cost more - otherwise LZSS-style "optimal parsing" will be really optimal, i.e. it will be enough tio minimize number of matches. the point is that USUALLY it's opposite (i.e. cost LESS) and your algo1 gives optimal parsing only for the cases where extending ANY match is unprofitable (i.e. cost MORE). But in that case we should just stick with literals.

I can add more explanations to my proof if it's hard to follow. The following is equivalent to your condition but may be easier to analyze:

Cost(xAy) - Cost(xA) >= Cost(Ay) - Cost(A)

i.e. extending longer match xA with one more char y should have larger cost than extending shorter match and its suffix, namely A, with the same char y.

For optimality, your algo requires that this condition holds for EVERY (src_position,dst_position) pair, but on real data it will be wrong - not only on a few positions, but on 90% of such pairs.

In short, idea of your algo is to keep i and j as close as possible and skip long matches since they, indeed, can't improve cost in models satisfying that strange condition

PS: every time i write "more" this means "more or equal" and so on. it's just for brevity and doesn't change anything i said.

PPS: your idea is opposite to that of paper and any reasonable LZ encoding. indeed, it makes building "optimal LZ" algorithm so much easier. try to make O(N) algorithm for the opposite condition

Last edited by Bulat Ziganshin; 10th January 2017 at 09:19.

let's repeat that again - your condition means that any match of MINIMAL length supported by this particluar coder, should have a cost larger than summary cost of inidividual chars it replaces. Indeed, if xAy is minimal-length match, then

This means that optimal encoding should never use matches of minimal length, so it's equivalent to encoding with coder having a 1 byte higher MINIMAL match length. But this coder also can't use matches of minimal length for optimal encoding. Repeating that proof recursively, we find that for models satisfying your condition, optimal parsing should use no matches at all!

We can simplify proof that by using "proof from opposite". Suppose that optimal parsing OP of given data D with given model M (satisfying your condition) includes matches. Let's find minimal match length L actually used in OP. Since there are no smaller matches, OP is equivalent to OP1, that is data D encoded in model M1, which is model M with dropped support of matches shorter than L. But as we proofed above, no optimal encoding can use matches of minimal length supported in particular model (if model obeys your condition). That means, again, that optimal parsing in model M can't use any matches.

Last edited by Bulat Ziganshin; 10th January 2017 at 10:20.

Now every match costs the same, so you get 0>=0 in every case.

don't forget that we consider literals as matches of length 1, so you are going to encode each literal with 8 bytes too

but you misinterpret my point. it's not a problem to find awkward model that satisfy your condition, it's a problem to find a useful one. In this particular model, all matches have the same price, so we just need to minimize AMOUNT of matches+literals, it's done by LZSS optimal parsing algo which is already O(N). But any practical algorithm where minimal-length match at least sometimes provides better compression than separate literals it replaces, doesn't satisfy your condition.

That's even worse is that your algo just skips large matches supposing they are worser than short ones, so it can't be useful on practice. It's like selecting shortest matches in greedy compression.

You seem to be claiming that Cost(Ay)=Cost(A)+Cost(y), and that is not generally true.

it's true when xAy is minimal match length, so Ay can be encoded only as sequence of literals. Overall, theory tends to consider literals as (pseudo-)matches of length 1, and your code in the first post follows that tradition.

OTOH, when i say about minimal match length, i mean minimal length of real matches supported by particular coder, and my reasoning operates with xAy matches having this minimal real match length. Don't mix them up!

After doing a lot of algebra, I got the constraint "Cost(xAy) - Cost(Ay) >= Cost(xA) - Cost(A)".

since loop header ensures only that (i+1 < j), it's possible that i+1==j-1 and in this case Cost(j-1, i+1) covers empty string. So, A in your condition may be an empty string and this means that even model where everything is encoded by 8 bytes, doesn't fit it.

anyway, it doesn't have much practical meaning - the entire condition requires that long matches should be always more expensive than smaller ones they replacing. It doesn't cover any practical LZ models (we use LZ for exactly opposite reason). that's the real point!

let's replace A with Ab. we get Cost(xAby) - Cost(Aby) >= Cost(xAb) - Cost(Ab)

but Cost(xAb) - Cost(Ab) >= Cost(xA) - Cost(A), so we get

Cost(xAby) - Cost(Aby) >= Cost(xA) - Cost(A)

repeating that again and again, we proof that for any string B

Cost(xAB) - Cost(AB) >= Cost(xA) - Cost(A)

now let's replace A with empty string here and we got:

Cost(xB) >= Cost(B) + Cost(x)

i.e. any match should have higher cost that its first byte as literal + suffix match. i.e. any match is worser than shorter one + literal, up to the point where only literals are used to encode the text. Only that condition can make algo1 optimal!

Ok... 'A' can't be empty. It has to be at least one character. xAy can't be 3 literals, because each literal would be treated separately (matches are supposed to be atomic). The constraint doesn't apply unless you have at least 3 characters.

I think you may be trying to apply the constraint to every match, or every group of matches involving 3 contiguous characters, but it only applies to single matches of at least 3 characters.

Last edited by nburns; 10th January 2017 at 17:27.

I think one or two good diagrams would be really helpful here, but I don't know if it's worth straining my weak diagramming abilities for this little algorithm. If I was writing a paper on this, there would definitely be diagrams.

Maybe that's what the Ferragina paper needs... diagrams.

Ok... 'A' can't be empty. It has to be at least one character. xAy can't be 3 literals, because each literal would be treated separately (matches are supposed to be atomic). The constraint doesn't apply unless you have at least 3 characters.

cost(xAy) means MINIMAL cost of xAy string, which may be a cost of those 3 literals

about diagram - the algo1 is fast because it avoids checking long matches. And it's possible because you require that long matches always has larger price that encoding the same string with shorter match + literal. So it's a trivial and useless algo

LOL... you can't prove that. The constraint is like an axiom. Not provable.

I could set a constraint that every text must contain the string "Bulat" and it can't be proven false.

cost(xAy) means MINIMAL cost of xAy string, which may be a cost of those 3 literals

If xAy is three literals, then that's 3 separate "matches" (or "atomic parsing units"). You can't pass three matches to cost() at once. The purpose of the algorithm is to start with a cost function that only works for single matches and get a cost for a sequence of matches.

about diagram - the algo1 is fast because it avoids checking long matches. And it's possible because you require that long matches always has larger price that encoding the same string with shorter match + literal. So it's a trivial and useless algo

LOL... you can't prove that. The constraint is like an axiom. Not provable.
I could set a constraint that every text must contain the string "Bulat" and it can't be proven false.

algo1 can't build an optimum parsing for arbitrary model. it requires that your condition holds for any xAy part of input text, including those with empty A. it is what i proved

Originally Posted by nburns

If xAy is three literals, then that's 3 separate "matches" (or "atomic parsing units"). You can't pass three matches to cost() at once. The purpose of the algorithm is to start with a cost function that only works for single matches and get a cost for a sequence of matches.

your Cost function gives an OPTIMUM price of STRING of input text. The optimum price is selected among ALL possible parsings of the string, including all-literals

Originally Posted by nburns

It may be useless, but not trivial.

it's triviality doesn't look obvious but basically you require that any match of length >=2 has larger price than shorter match + literal encoding the same data. so sequence of matches of length 1 is the most efficient parsing.

algo1 can't build an optimum parsing for arbitrary model. it requires that your condition holds for any xAy part of input text, including those with empty A. it is what i proved

your Cost function gives an OPTIMUM price of STRING of input text. The optimum price is selected among ALL possible parsings of the string, including all-literals

it's triviality doesn't look obvious but basically you require that any match of length >=2 has larger price than shorter match + literal encoding the same data. so sequence of matches of length 1 is the most efficient parsing.

I'm trying to implement it and you might be right. Hopefully it's fixable.

Oh, you're right...

Cost(xAy) - Cost(Ay) >= Cost(xA) - Cost(A)

If Cost(A)=1 and Cost(xA)=2, then every successive increase in length increases cost by >=1.

Optimal parsing for simple LZs already exists anyway - doesn't matter much whether its O(n) or not, if it has reasonable speed in practice.
What I really want to know, is how to correctly implement optimal parsing for codes like lzma's - with alignment, rep-matches and masked literals.
Well, one idea was to to actually find imprecise matches (kinda like what bsdiff does), and then build a code for these using all types of lzma's tokens, and thus compute the price.
Lzma itself only explicitly checks for a few simple token patterns, so who knows how much its compression can be improved for formats like .exe.

// Find cost of optimal parsing.
//
// Condition for algorithm:
// For all A, x, and y where xAy is a string appearing in the text and x and y are individual symbols,
// Cost(xAy) - Cost(Ay) >= Cost(xA) - Cost(A)

int Cost(int i, int j); /* cost of best match for text[i..j-1] */

int len = /* length of text */;
int *prefix_cost = (int*)malloc(sizeof(int) * (len+1));
prefix_cost[0] = 0;
for(i=0, j=1; j<len+1; j++) {
while(!match_exists(i, j)) {
i++;
}
int next_cost, minimum_cost = prefix_cost[i] + Cost(i, j);
while((i+1 < j) && ((next_cost = prefix_cost[i+1] + Cost(i+1, j)) < minimum_cost)) {
minimum_cost = next_cost;
i++;
}
prefix_cost[j] = minimum_cost;
}
// prefix_cost[len] == cost of parsing text

I suspect that I made a similar mistake in deriving the condition and that led to the condition being too strict.

Optimal parsing for simple LZs already exists anyway - doesn't matter much whether its O(n) or not, if it has reasonable speed in practice.
What I really want to know, is how to correctly implement optimal parsing for codes like lzma's - with alignment, rep-matches and masked literals.
Well, one idea was to to actually find imprecise matches (kinda like what bsdiff does), and then build a code for these using all types of lzma's tokens, and thus compute the price.
Lzma itself only explicitly checks for a few simple token patterns, so who knows how much its compression can be improved for formats like .exe.

Conceptually, the way to find optimal parsings is to construct a directed graph where each text position is a vertex and edges represent matches, literals, etc. Each edge starts at the beginning of the match and points one vertex past the end of the match. The edge is weighted by the cost of the match. Given this graph, you can use the algorithm for single source shortest path to get the lowest-cost path.

The graph has to exist (in a conceptual sense, not necessarily literally) before you can find the shortest path, so the weights have to be known (or knowable) in advance.

This code implements the equivalent of the SSSP algorithm.

int len = /* length of input */;
int *prefix_cost = (int*)malloc(sizeof(int) * (len+1));
prefix_cost[0] = 0;
for(j=1; j<len+1; j++) {
int next_cost, minimum_cost = /* highest number */ INT_MAX;
for(i=0; i<j; i++) {
if(match_exists(i, j) && ((next_cost = prefix_cost[i] + Cost(i, j)) < minimum_cost)) {
minimum_cost = next_cost;
}
}
prefix_cost[j] = minimum_cost;
}
// prefix_cost[len] == cost of parsing text

You can optimize the inner loop if you know where the matches are going to be.

[I tried to rewrite the constraint, but writing it correctly is hard and I gave up for now.]

The idea is that if it is costly to add to the front of a match, adding to the back of the match shouldn't discount the cost of adding to the front so much that the optimal starting point moves backwards.

That could be extended to "the starting point can't move backwards by more than O(1) positions" or "the starting point can't move backwards by more than O(1) positions (amortized)".