Originally Posted by

**Malcolm Taylor**
...

There is also for lz77 several different optimal parsing methods. One

(used by quantum IIRC) parses all the matches in the file, and then

passes backwards over them finding the best match to reach each byte.

The result is optimal for a static encoding (much like FP).

Another one (used by 7zip and ROLZ and others), is a dynamic programming

approach which can cope with adaptive encodings. This often leads to

some fantastic results. Although I call this optimal, it is still only a

near optimal method. It always converges to a local minimum, so may not

always hit the global minimum.

...

It is easier to understand than any other optimal parsing I have come

across, but not necessarily obvious.

1) Find all the matches for every byte of your file.

2) Start at the end (position n).

3) Find the cheapest way to get to n from n-1, this is the best so far.

4) Find the cheapest way to get to n from n-2, if better than the best

so far, then this new way becomes the best so far.

5) Repeat step 4 until we reach the start of the file.

By induction the way to get from 0 to n that you end up with is the

cheapest.

Things are a little more complicated when you consider the actual

implementation. As you go, you have to remember the cheapest from each

byte - ie. have an array of the best match found for each byte (or

literal which we can think of as a length 1 match), and the cost.

If you are testing position x which has a match length l, then the total

cost of reaching the end of the file will be:

best_cost_found(x) = cost_of_match(l) + best_cost_found(x+l).

In other words, the total cost is the cost of the match at the current

position, plus the best cost found earlier from where the match ends.

Then you try all match lengths from l down to 1, and save the best total

cost and match details for the current byte, and repeat for the previous

one...

This parsing depends on a well defined cost function, which implies a

static encoding. If your encoding is dynamic, then it kindof screws with

the optimisation, especially since you parse backwards. One way to solve

this is to parse optimally only over small blocks (say 4096), then you

can have static encodings only for each block.