How to use suffix tree as an LZ matchfinder? I suppose I need to use Ukkonen's algorithm to construct the tree: stepping one by one in the input (so the tree contains suffixes of the string ended at the current position), and searching the tree for the byte string starting at the current position. However, this method fails to find a match if it overlaps with the current lookahead buffer. How is this problem solved? Of course, I could find these kind of matches manually, but it kinda defeats the purpose of the tree (in degenerate cases, this search makes the algorithm from O(n) to O(n^2)).

Here's an example:

ABC|ABCABC
^
+--- current position

Here, suffixes of "ABC" are in the tree. The suffix tree will report that "ABC" is the longest match. However, the real longest match that an LZ compressor can use is "ABCABC" (not yet in the tree). My match lengths have unbounded length, so "forward-adding" MAXLENGTH bytes to the tree is not an option.

A basic approach would be to save matches only when going through suffix pointers in the tree. But this causes a problem - you can't easily control the size of a window where you store data about matches for parsing. I'm assuming that parsing has much bigger input than output. Ie:
- input size is M bytes,
- parsing input is O bytes per position,
- parsing output is P bytes per position,
If O is much bigger than P (that's usually the case) then it would be beneficial to make N-byte parsing windows, so you collect matches for N positions, compute some good parsing output, and repeat (ie collect another N positions and so on). It's clear that if we save matches only on going through suffix links then we can't predict nor limit the positions and number of matches. To work around that you could add some artifical unique symbol nodes to the tree when you are too deep in it thus restarting the parsing in that place. If the depth threshold is big then the compression ratio loss should be minimal. To further reduce it you could restart parsing at some earlier position. Example:
- we are on position 2600 in input,
- our current depth in the tree is 1000,
- we reached our depth threshold (which is that 1000),
- we insert unique symbol to the tree, thus effectively moving to depth 0,
- we want to minimize compression ratio loss so we restart parsing from position 2600 - 300 = 2300,
- we need to go through the 300 bytes so we are at position 2600 in the input again,
- now we can continue building tree and saving matches normally (till we reach the depth limit of 1000 again so we need to repeat the procedure again),

I hope I'm clear enough

Keep in mind that saving only one maximum match per input position won't necessarily give you enough information to get good parsing. Sometimes more matches are optimal when they have shorter distances. Eg 10 matches of 8 bits each will be shorter than 9 matches where each takes 10 bits. I would advise to not only store the deepest node from the tree that would point to the longest match but also a few (fixed number of) parent nodes. Those parent nodes should mostly correspond to matches with much shorter distances.

Also there's another problem to tackle: saving matches only when going through suffix pointer will leave you with holes in parsing window. Eg after building tree and saving matches you could be left with,

Code:

Input position Match position Match length
... ... ...
100 50 10
105 30 9
... ... ...

We can interpolate the matches then then:

Code:

Input position Match position Match length
... ... ...
100 50 10
101 51 9
102 52 8
103 53 7
104 54 6
105 30 9
... ... ...

Disclaimer:
This is fully untested, everything from this post I've just simulated in my head so it can be wrong

However, I'm not sure that I can understand what you said.

(Note that I've implemented optimal parsing already, with more matches at a position, input is divided into blocks, etc. - but my BST matchfinder becomes slow at degenerate cases (yes, I could limit search depth), and it sometimes fails to find the longest match, that's why I'd like to try suffix tree based matching).

Let's see a basic example: "ABABAB".

At the beginning, suffix tree is empty.
At the first byte, we search for "ABABAB" in the suffix tree, but as it's empty, it gives us no match. Then, we insert "A" into the tree (Ukkonen's algo), right?
At the second byte, we search for "BABAB" in the suffix tree -> still no matches. We insert "B" into the tree.
At the third byte, we search for "ABAB" in the suffix tree -> we got a match (at 0, "AB") but it is only 2 bytes long. It should be 4 bytes long, because the match overlaps with the lookahead buffer. I could manually expand the length with simple linear search, but that's no good, because it will make the algorithm O(n^2).

I'm completely new to suffix trees (never implemented Ukkonen's algo), so maybe I misunderstand something...

Correct (but not full) description for Ukkonen's algorithm would be different:
- it's an on-line algorithm,
- after fully processing N-th byte the produced tree is a complete proper suffix tree for sequence of N bytes,
- on N-th step exactly N bytes were analyzed so you can't say it added longer suffixes,
- processing N-th byte basically adds it to every suffix in the tree (plus recreates the empty one),
Paths in Ukkonen's suffix tree are compressed and thanks to that you have linear number of nodes instead of quadratic number of them.

I'll describe what I mean by 'depth' in previous post. Imagine we've processed L symbols from input, so our tree built by Ukkonen's algorithm is full proper suffix tree for first L symbols. Let D be a maximum number lower than L so that the suffix of length D is a substring that could be found somewhere earlier in the input. D is the depth I was talking about. I've named it depth because in Ukkonen's algorithm you're always looking at symbol at that depth in the tree. The number of nodes in the tree is L - D (after processing L symbols of course). If the last added symbol is unique then D = 0 and you don't need to special case anything if you're going to use the suffix tree for anything. If D > 0 then you need to special-case the suffixes of length D or lower. That's what I've done in my suffx tree based CM algorithm: http://encode.su/threads/1671-Demixe...in-development

Anyway, for me the only sensible moment for saving matches is just before you add new node to the suffix tree in Ukonnen's algorithm. That's precisely the moment when you have the maximum match at hand and also there's a linear number of such moments. If you're going to do N searches for N bytes long input then you will certainly have superlinear complexity (I'm not sure if quadratic or not).

Correct (but not full) description for Ukkonen's algorithm would be different:

I didn't intend to describe Ukkonen's algorithm. What I described is to how to use it for LZ matchfinding.

Originally Posted by Piotr Tarsa

Anyway, for me the only sensible moment for saving matches is just before you add new node to the suffix tree in Ukonnen's algorithm.

Yep, this is what I described with my little example. At each position, first I search the tree for the current lookahead buffer, then I add the current byte to the tree (or do these tasks simultaneously, like in BST).

Originally Posted by Piotr Tarsa

That's precisely the moment when you have the maximum match at hand

If the longest match doesn't overlap with the lookahead buffer (LAB), then yes, it's correct. But, if it overlaps, this statement is not true anymore. As the tree doesn't contain LAB, it fails to find matches which overlaps with LAB. That is my problem. Suffix tree is advertised that it handles degenerate cases well (and it is true as long as the match doesn't overlap with the LAB). However, I don't see how it handles the case where the match overlaps with the LAB in O(n) time.

What I've suggested is to delay saving matches until you create new nodes. If you have input like "ABABABAB" then the solution to reduce depth to 0 at the end is to add unique symbol at the end, so we'll end with input ""ABABABAB$". Try it. When adding '$' symbol the tree will go from 3 nodes (root + 2 leaves) to a full tree with leaf node for every suffix. This means that you'll get all the matches after adding that last symbol.

The workaround for high depths described previously by me could be compared to periodical dry run of adding unique symbol (ie simulate adding and actually collect matches without modifying real tree) then restarting parsing from some earlier point.

I also suggest you to first go with easier route first and extract matches from final suffix tree (instead of doing it on-line). This way it won't matter if you implement Ukkonen's algorithm or any other one for building the suffix tree. The key consideration here is that you don't need to compare symbols from input when you extract matches from suffix tree. You know that you have a match when you have two nodes sharing a path from root - there's no other way to have a match in suffix tree.

After doing the off-line approach and simplifyling job with adding a unique symbol at the end you can return to on-line approach without using unique artifical symbols. To tackle the problem of suffixes that do not have own leaves (becase they are substrings that can be found earlier in the input) you need to use the active point/ active node/ whatever it was called in the algorithm's description. Active point point exactly to the place where you would add new node if an unique symbol would come as input. It also shows you the maximum match you have at current moment. Correctly using active point and suffix links is crucial to maintain linear complexity.

Piotr, I really thank you for trying to helping me

However, I still have no idea how to solve my problem. The offline method is works, of course, because it contains all the input (a new problem is here, how to find the nearest occurrence? The tree will give me some position, but it won't be necessarily the nearest match - it is important because of entropy coding).

The online method, however, doesn't contain the full input. In my previous example, after adding "AB" to the empty tree, the tree will contain suffixes for the string "AB". So this tree is unable to give me the longest match for position 3. Active point here will tell me that the maximum match length is 2 (am I right here?), but it is incorrect, because the maximum match length is 4. Does it makes sense?