# Thread: A new match searching structure ?

1. ## A new match searching structure ?

Hello everyone

At last. It is a long time since i last posted in this forum.
I've had a hard time trying to catch up my work objectives after a stupid accident made me out of programming business for a few weeks. I'm finally available again to spend some time on compression.

I've been interested these last days with Match Searching. While writing a simple tutorial on basic match searching with Hash chains, i've been puzzled by what seemed a new idea, that i've not been able to find or read on the net before.
Not being aware is obviously no proof that this is a novel idea. So i will describe it here, in the hope that experts of this board may already know about it. In this case, I would glad to learn more by reading documentation on it.

Context :
The basic objective is to find the best match in a given search window size.
As everyone probably already know, a proven method is to hash the minimal valid sequence (minmatch) into a first table, and then link all positions with same hash within a second table, which has as many positions as the search window size. (In effect, the second table is probably 4 times the size of search window).
It works reasonably well, and indeed provide tremendous speed boost compared to "naive" search (comparing all previous positions).

Now, the problem is : with increasing search window size, the number of positions within the chain might become so large that it slows down considerably compression speed. This has nothing to do with eventual collisions (two different sequences with same hash), as this can be made reasonably low with increased hash size. This is a valid behavior of the algorithm : there are many positions which share the same prefix, that's all.

Therefore, i was wondering : quite a pity that, if i already have a match of say 4 bytes, i would prefer to ask the match search function for a position with at least 5 common bytes, in order to improve on previous result, and not for another position with at least 4 bytes.

That, i figure out, could be achieved with nearly the same structure.

Basic idea :
Let's imagine we have a chain of positions which share a common hash. Now, what i want to result with is a chain of MinMatch common bytes; in effect i want to remove hash collisions.

This can be achieved with a simple "double chain" methodology. This is nothing new, i think Bulat already advocated for it some time ago.
Just to make everything clear, here is a picture of how it works :

Just use the first "hash chain" normally, looking for a sequence of at least MinMatch Bytes. When you do find it, link it on the second chain, on continue searching on the second chain.
This will reduce the number of positions to compare, as there is no more collision to fear. As another small speed advantage, there is no need to compare the first Minmatch Bytes anymore, as they are guaranteed to be strictly equivalent, so just compare starting from MinMatch+1 character.

Generalization :
Now, the same idea could be used to reach higher levels.
Just start with a chain of positions with a guaranteed common prefix of N elements.
That's nice, but in order to quickly reduce the number of useless comparisons, you would prefer to reach a "higher level", with a chain of at least N+1 common prefix elements.

A naive strategy would be to create a table for each prefix depth. This obviously would cost more and more memory, on top of making the chain update process more and more expensive.

So the clever idea here is to continue working with just 2 tables. This can be done to reach unlimited prefix depth level.

Just replace in previous picture the words "hash" with "prefix N", and "minmatch" with "prefix N+1". And we have our generalization.
For each position, we keep 2 choices : "bad" means this position has not improved over the guaranteed prefix, and therefore we should continue on this chain. "good" means this position as at least "N+1" common elements, so we use this opportunity to reach the higher chain. By applying this principle recursively, we can create more and more chains with more and more common prefix elements, making further researches faster.

Here is a graphical attempt at explaining the algorithm :

Note that the match chain is being modified as we are searching into it, hence the name "morphing match chain". The good point is that there is no "setup time" here, it's all done on the fly while searching.

Special cases:
The whole idea may seem simple now, but there are some hard points during implementation.
Mostly, the basic behavior is guaranteed as long the chain of prefix N has already been fully sorted and linked to valid chains of prefix N+1. But if you want a fast update, it's not the case, unfortunately. Therefore, you sometimes have to continue updating chain N even if you already found an N+1 sequence.
However, dealing with this complex "unsure" situation allows greater agility. Adding positions in any chain without having to search for its higher level is a great time saver. Typically, positions within a match may not trigger a search (using greedy or lazy matching). This is not an issue with this methodology : just add them to the base hash chain, they'll get sorted later on, if ever needed.

Experimental results:
Time for precise evaluation of this algorithm. I spent some time at building a modified version of LZ4, a trivial LZ compressor, ideal for testing such structure.
We are comparing 2 versions : "HC : Hash Chain" and "MMC : Morphing Match Chain". Both are looking for the best match (no comparison limit) within a search window. The resulting compressed file are strictly identical. Here are some results, comparing the nb of searches :

Window Search 64K :
No code has to be inserted here.

Window Search 512K :
No code has to be inserted here.

Window Search 4M :
No code has to be inserted here.

Results are conform to expectation : the larger the search window, the more advantageous the morphing match chain methodology becomes.
Note also that benefits vary greatly depending on file type.

As to speed, faster search is not the only element in the equation, but for LZ4 which is a trivial LZ compressor, this is mostly linear : twice less searches nearly translates into twice faster. Obviously, a search with MMC is a bit more complex, due to changes into chain table, so there is a cost for it. But that's not terrible. To provide an idea, compressing an uncompressible file results in a less than 10% speed loss with MMC compared with classic single Hash Chain methodology.

There is no claim here that this structure is the best.
For a start, it is still vulnerable to long stream of zeros and equivalent, which is also a weak point of hash chains. For better speed, these special cases should have their special search structure or coding sequence (RLE typically).
Certainly it is a fair improvement over simple Hash chain methodology, but it can be argued that a clever tree can provide better results, especially with very large search depths.
Indeed, investigating further, this structure do look like a disguised tree into a table. But, in order to reach higher level, there is no direct access, unlike a tree : you have to search into the current level and find the gateway.

However, we get some nice properties too :
- First, the search at level N is not random; it is done in an MTF fashion, which is very favorable. And we have not made any effort at sorting anything. This effect is for free.
- Second, we have automatic cleaning of old positions. The slot is simply reused by newer position. Tree are typically much more complex to clean up, with sometimes erasing everything and restarting being the easy way out (such as ppm).
- Third, the memory budget may seem high (2 tables basically means memory is 8 times the window size) but this is still favorable compared to most tree structures; And moreover, it is completely predictable.

All in all, there seems a fair improvement over hash chain, however this is still a bit too slow (to my taste) for very large search windows (64MB and more). Hopefully, you can still combine this method with a classic comparison limit, giving reasonable compression level in exchange for bounded worst cases. Good news is, for a given limit, you'll search much farther with MMC than with a classic hash chain.

Now the question is : is that really new ?

2. imho, it's new, but not better than binary tree, that requires the same 8N memory. lzma needs about 10N memory due to dictionatry itself and hash heads array that MMC needs too

3. Imho the best approach would be to use BWT. It gives you all the matches at once with 4N memory.
The drawback is that its completely blockwise, but as a workaround you can do BWT on overlapped windows.

Indeed, BST is down my list of things to try it, and likely my next attempt.
I'm a bit worried though of tree maintenance, as all inserted elements must be fully "searched" to get to their right position. Also, deletion is not free, but probably not the main issue there.
I wonder also about the 8N size : it is ok for reading a BST from top to leaves, and therefore allows searches and insertion (with minor local variable). But for more complex operations such as deletion and balancing, i was under the feeling that adding information, such as depth, priority or parent node, would be necessary for efficiency. And therefore, would cost some memory too...

lzma needs about 10N memory due to dictionatry itself
Is there some documentation available ? I'm interested.

[EDIT] : Here is what says LZMA documentation :
Code:
```              Memory requirements depend from dictionary size
(parameter "d" in table below).

MF_ID     Memory                   Description

bt2    d *  9.5 + 4MB  Binary Tree with 2 bytes hashing.
bt3    d * 11.5 + 4MB  Binary Tree with 3 bytes hashing.
bt4    d * 11.5 + 4MB  Binary Tree with 4 bytes hashing.
hc4    d *  7.5 + 4MB  Hash Chain with 4 bytes hashing.```
Interesting these *.5 sizes, as if some information was removed...

and hash heads array that MMC needs too
Optimal size for Hash heads array tend to be surprisingly stable as search window size grows, and in the end relatively negligible when N becomes big. In my recent experiments, it depends mostly on minmatch size. It was a surprising conclusion so i made a few graphs to study it :
http://phantasie.tonempire.net/news-...hique-t109.htm
and http://phantasie.tonempire.net/news-...imale-t110.htm
Therefore, a hash of 17 bits for a 4-byte minmatch seems good enough for nearly any search window size. In any case, it does not grow linearly with N.

Imho the best approach would be to use BWT
I agree, although i'm glad to learn and investigate different solutions to properly understand them. btw, regarding BWT, certainly the end table result is a 4N array, but is a temporary structure with vastly different size necessary to build this table in the first place ?

5. > certainly the end table result is a 4N array, but is a temporary
> structure with vastly different size necessary to build this table
> in the first place ?

No, see http://ctxmodel.net/files/BWT.cpp
That can be slow in redundant cases though, but there're real efficient
implementations still with 5N memory (including input block).
Well, it just stores pointers to all data bytes into a table,
and sorts them by referenced strings.
So in the end you have all the pointers to strings with matching
And usual sequential match finders can be viewed as BWT implementations
based on insertion sort... while there're lots of more efficient algorithms.

Well, I'm not into LZ, so I didn't implement a BWT-based match finder myself,
but maybe you'd be able to use this as a reference:
http://nishi.dreamhosters.com/u/bsdiff_sh2.rar
Match finding there (for patch construction) is implemented by computing
BWT of file1+file2 in memory, and then scanning the index.
In fact, it even does _fuzzy_ match finding, eg. it would use
"bsdiff_sh1.rar" as a reference for "bsdiff_sh2.rar" - I'd like to
see this in LZ someday.

6. Thanks for reference Shelwien. I will look into them.
OK, i understand that 4N memory can be enough to build the final table (i think it is called the suffix array).
I'm willing to look deeper into this someday, although with so many excellent open-source implementation already available, this is unlikely to bring any breakthrough.

Now, if 4N is indeed enough for the suffix array itself, searching for the current position P into it (in order to get the neighbors) can become quite long, and void the speed advantage.
As a consequence, maybe a reverse-table, which for any position P gives the rank into the suffix array, would be preferable for speed.
It's pretty easy to build this one, but then, you end up with 8N...

7. 1. lzx/lzma binary trees are not the same as you may find in any Algorithms book. look into lzma sources to understand how they work

2. the best lzma doc is its source. 11.5x = 1.5x for dictionary, 8x for bt, 2x for hash heads

8. > I'm willing to look deeper into this someday, although with so many
> excellent open-source implementation already available, this is
> unlikely to bring any breakthrough.

There're good BWT implementations, but almost no (beside bsdiff) examples
of BWT usage for match finding.

> Now, if 4N is indeed enough for the suffix array itself, searching
> for the current position P into it (in order to get the neighbors)
> can become quite long, and void the speed advantage.

Here's an example of index table after BWT:

Code:
```aabracadabr <- 10
dabraabraca <-  6
As you see, if you know your current position in the index table, you can
find all the matches around it.

> As a consequence, maybe a reverse-table, which for any position P
> gives the rank into the suffix array, would be preferable for speed.
> It's pretty easy to build this one, but then, you end up with 8N...

1. Rebuild the string offset table from BWT into permutation table used for iBWT,
it can be done inplace:
Code:
`for( i=0; i<N; i++ ) p[i]=freq[data[t[i]]]++`
2. Like in iBWT, you can trace the symbols in original order, starting from
index of 0 returned by BWT:

Code:
```index = BWT(...);
uint freq[CNUM+1];
for( i=0; i<=CNUM; i++ ) freq[i]=0;
for( i=0; i<f_len; i++ ) freq[data[i]+1]++;
for( i=0; i<CNUM; i++ ) freq[i+1]+=freq[i];
for( i=0; i<N; i++ ) p[i]=freq[data[t[i]]]++

for( i=0; i<N; i++ ) {
next = p[index];
p[index] = i|(1<<31); // convert back into string offset and mark as processed
// here you can find the matches for data position i around p[index]
index = p[index];
}```
3. You can use another 1N for match length table. In can be built in O(N) time
before (1). But even without it you only need a single full string comparison per symbol.

9. ## Fastest hash function

Hi Cyan,

if you find it useful check out
FNV1A_Hash_WHIZ, these days I am interested in fastest TEXT(alpha) hashers, in fact a lot of things must be learned, like graphs mentioned by Sebastian.

Code:
```UINT FNV1A_Hash_WHIZ(const char *str, SIZE_T wrdlen)
{
const UINT PRIME = 1607;

UINT hash32 = 2166136261;
const char *p = str;

for(; wrdlen >= sizeof(DWORD); wrdlen -= sizeof(DWORD), p += sizeof(DWORD)) {
hash32 = (hash32 ^ *(DWORD *)p) * PRIME;
}
if (wrdlen & sizeof(WORD)) {
hash32 = (hash32 ^ *(WORD*)p) * PRIME;
p += sizeof(WORD);
}
if (wrdlen & 1)
hash32 = (hash32 ^ *p) * PRIME;

return hash32 ^ (hash32 >> 16);
}```
An excerpt from http://www.strchr.com/hash_functions follows:
Results on a large data set (list of all words in English Wikipedia, 12.5 million words, from the benchmark by Georgi 'Sanmayce'):
Core i5 processor:
No code has to be inserted here.Execution time, then the number of collisions in square brackets.
Execution time is expressed in thousands of clock cycles (a lower number is better).

Hashing words from wikipedia-en-html.tar on 3 Intel powered PCs: currently I have no access to AMD CPUs, so here are 3 Intel CPUs under fire: Subnotebook Toshiba PC, Notebook Toshiba PC, Desktop PC.

[Subnotebook Toshiba PC, Intel Atom N450:]

Code:
```Microsoft Windows XP [Version 5.1.2600]

C:\Test\hash>dir

11/07/2010  06:47 AM            84,992 hash.exe
10/27/2010  03:48 PM       146,973,879 wikipedia-en-html.tar.wrd

C:\Test\hash>hash wikipedia-en-html.tar.wrd
33554432 elements in the table (25 bits)
Sixtinsensitive+:   16153226  16031348  16039844  16048697  16026421|  16026421 [2251734] !FASTEST!
Hash_Sixtinsensitive:   16501364  16504662  16506868  16509768  16515127|  16501364 [2139242]
Whiz:   16081498  16080691  16080162  16079881  16085656|  16079881 [2189360] !Second-to-FASTEST!
Bernstein:   16222890  16208653  16195534  16205258  16207576|  16195534 [2074237]
K&R:   16181141  16153089  16147347  16144145  16144279|  16144145 [2083145]
x17 unrolled:   16196693  16200720  16203124  16195429  16196980|  16195429 [2410605]
x65599:   17213371  17215886  17211092  17225140  17218658|  17211092 [2102893]
FNV-1a:   17648084  17649994  17644643  17654437  17650937|  17644643 [2081195]
Sedgewick:   17715483  17711538  17706339  17709057  17711857|  17706339 [2080640]
Weinberger:   19795514  19798517  19792040  19783954  19796407|  19783954 [3541181]
Paul Larson:   16904068  16909317  16906315  16905215  16899096|  16899096 [2080111]
Paul Hsieh:   16731679  16720594  16932967  16731003  16725432|  16720594 [2180206]
One At Time:   18317093  18329453  18319193  18325990  18318877|  18317093 [2087861]
lookup3:   16591932  16600364  16593657  16594726  16605181|  16591932 [2084889]
Arash Partow:   16790714  16782529  16783414  16782986  16791156|  16782529 [2084572]
CRC-32:   16878539  16872606  16866719  16876415  16860548|  16860548 [2075088]
Ramakrishna:   16437073  16449160  16438019  16442447  16452696|  16437073 [2093253]
Fletcher:   74995725  74991965  74983105  75027126  75020605|  74983105 [9063797]
Murmur2:   16742975  16730183  16736149  16741761  16735690|  16730183 [2081476]
Hanson:   17286767  17288260  17273143  17280187  17371322|  17273143 [2129832]
Novak unrolled:   60320056  60322432  60315223  60308709  60325221|  60308709 [6318611]
SBox:   17677039  17685532  17686854  17681897  17680986|  17677039 [2084018]
MaPrime2c:   19286056  19296348  19307574  19294326  19295151|  19286056 [2084467]

C:\Test\hash>```

[Notebook Toshiba PC,
Intel Pentium T3400:]

Code:
```D:\_KAZE_new-stuff>hash wikipedia-en-html.tar.wrd
33554432 elements in the table (25 bits)
Sixtinsensitive+:   12058258  11973671  11953887  11970400  11973268|  11953887 [2251734]
Hash_Sixtinsensitive:   12387803  12368726  12437268  12419321  12367736|  12367736 [2139242]
Whiz:   10594894  10600128  10582191  10593562  10580720|  10580720 [2189360] !FASTEST!
Bernstein:   12157227  12160397  12163431  12152197  12206038|  12152197 [2074237]
K&R:   12113889  12062050  12066080  12053732  12058839|  12053732 [2083145]
x17 unrolled:   11831265  11851909  11849651  11834030  11842581|  11831265 [2410605] !Second-to-FASTEST!
x65599:   12015548  12059651  12043365  12003264  12014828|  12003264 [2102893]
FNV-1a:   12512030  12476285  12472843  12464470  12481049|  12464470 [2081195]
Sedgewick:   12423386  12430959  12446600  12481711  12417274|  12417274 [2080640]
Weinberger:   15886398  15882493  15873672  15871503  15887145|  15871503 [3541181]
Paul Larson:   11865686  11875179  11880320  11930336  11883977|  11865686 [2080111]
Paul Hsieh:   12743450  12720752  12740266  12722144  12733419|  12720752 [2180206]
One At Time:   13025820  13037695  13023502  13036316  13087593|  13023502 [2087861]
lookup3:   12513213  12510938  12517817  12521024  12528520|  12510938 [2084889]
Arash Partow:   12616358  12613361  12610421  12609956  12596411|  12596411 [2084572]
CRC-32:   12407442  12316574  12336861  12333267  12332221|  12316574 [2075088]
Ramakrishna:   12281227  12301150  12289356  12282034  12288241|  12281227 [2093253]
Fletcher:   50147553  50090287  50105264  50208280  50116095|  50090287 [9063797]
Murmur2:   12188900  12187655  12192879  12190325  12256534|  12187655 [2081476]
Hanson:   12475359  12463011  12446859  12444451  12445178|  12444451 [2129832]
Novak unrolled:   37262357  37364140  37383476  37304713  37319954|  37262357 [6318611]
SBox:   12414330  12413821  12512329  12421929  12411736|  12411736 [2084018]
MaPrime2c:   12973562  12980602  12983752  12983859  12987014|  12973562 [2084467]

D:\_KAZE_new-stuff>```

[Desktop PC,

Code:
```Microsoft Windows [Version 6.1.7600]

D:\TESTS\hash>hash wikipedia-en-html.tar.wrd
33554432 elements in the table (25 bits)
Sixtinsensitive+:    9053718   8989266   8951519   8952219   8949380|   8949380 [2251734]
Hash_Sixtinsensitive:    9269264   9264150   9263994   9263761   9263358|   9263358 [2139242]
Whiz:    7865326   7856200   7859796   7866475   7852540|   7852540 [2189360] !FASTEST!
Bernstein:    9081460   9069960   9068900   9070469   9079405|   9068900 [2074237]
K&R:    8993801   8992382   8990556   8983363   8986733|   8983363 [2083145]
x17 unrolled:    8682855   8681626   8692004   8684190   8682793|   8681626 [2410605] !Second-to-FASTEST!
x65599:    8793795   8792213   8791230   8790913   8801408|   8790913 [2102893]
FNV-1a:    9391666   9382336   9379497   9379865   9389164|   9379497 [2081195]
Sedgewick:    9091018   9085970   9085732   9094216   9097600|   9085732 [2080640]
Weinberger:   11999347  11988551  11987173  11995011  11991473|  11987173 [3541181]
Paul Larson:    8788831   8797335   8784061   8788220   8785700|   8784061 [2080111]
Paul Hsieh:    9470151   9474939   9471454   9470880   9483763|   9470151 [2180206]
One At Time:    9856512   9865262   9853988   9864387   9856661|   9853988 [2087861]
lookup3:    9346552   9341545   9346329   9338833   9346559|   9338833 [2084889]
Arash Partow:    9341715   9323081   9332539   9338738   9333211|   9323081 [2084572]
CRC-32:    9159684   9134865   9138729   9155572   9155020|   9134865 [2075088]
Ramakrishna:    9183066   9178082   9191092   9184270   9187847|   9178082 [2093253]
Fletcher:   31111801  31116622  31094619  31055130  31036218|  31036218 [9063797]
Murmur2:    9013410   9011146   9010973   9016315   9016830|   9010973 [2081476]
Hanson:    9275481   9276946   9284989   9279977   9281620|   9275481 [2129832]
Novak unrolled:   17390159  17391792  17387836  17389504  17391874|  17387836 [6318611]
SBox:    9318679   9317361   9316436   9316997   9319220|   9316436 [2084018]
MaPrime2c:    9819245   9818182   9820444   9816287   9817004|   9816287 [2084467]

D:\TESTS\hash>```

10. Thanks for info,

Indeed, Shelwien, i will look into this. Using a well established BWT sorter such as libdivsufsort can provide quick interesting results. I have a few algorithms to try out first, in order to compare them.

11.5x = 1.5x for dictionary, 8x for bt, 2x for hash heads
1.5x for dictionary ? quite surprising, why not just 1x ?

Regards

11. Originally Posted by Cyan
Thanks for info,

Indeed, Shelwien, i will look into this. Using a well established BWT sorter such as libdivsufsort can provide quick interesting results. I have a few algorithms to try out first, in order to compare them.

1.5x for dictionary ? quite surprising, why not just 1x ?

Regards
IIRC, it's a implementation choice for ring buffer (sliding LZ window). LZMA reads couple of bytes and does not fill the window until those amount of bytes consumed. So, a "safe area" must have. This is why it needs some extra size. Maybe Bulat has a better explaination.

12. Originally Posted by osmanturan
IIRC, it's a implementation choice for ring buffer (sliding LZ window). LZMA reads couple of bytes and does not fill the window until those amount of bytes consumed. So, a "safe area" must have. This is why it needs some extra size. Maybe Bulat has a better explaination.
As far as I know maximum match length in LZMA is something around ~272.
A sliding window does not need a safe area. This is just for faster string compare.
You insert bytes at window[ (windowpos+maximumlen) % windowsize].
If windowpos+maximumlen>=windowsize you mirror the byte at window[ (windowpos+maximumlen)],
thus avoiding a % or & operation (if windowsize is a power of 2) if you do string-compares.

So this should not be the reason.

13. because it's sliding window. i think it makes match searching faster than with cyclic one

14. Originally Posted by Sebastian
As far as I know maximum match length in LZMA is something around ~272.
A sliding window does not need a safe area. This is just for faster string compare.
You insert bytes at window[ (windowpos+maximumlen) % windowsize].
If windowpos+maximumlen>=windowsize you mirror the byte at window[ (windowpos+maximumlen)],
thus avoiding a % or & operation (if windowsize is a power of 2) if you do string-compares.

So this should not be the reason.
Well, seems you misunderstood. I didn't say "sliding window requires safe zone". As I mentioned earlier, it's just a implementation choice. You can verify this implementation side-effect by observing LZMA decompression (it flushes some amount of data in some interval, IIRC 8 MiB for decompression). It's also true for compression. IMO, this implementation enables us two advantages:
1-Simpler match finding (IIRC, match finder does not check bounds for each step)
2-Better I/O handling.
BTW, IIRC again, I've implemented both of them and single cyclic version was slower.

15. In a will to investigate a bit more the presented MMC search algorithm, i've upgraded the initial implementation by introducing two major improvements : Segment detection, and multiple levels promotions.
While the first is a rather straightforward optimization for streams of identical characters, the second is a more complex refinement, removing a limitation from initial MMC, which would promote elements by only one level per pass.
"Level L" means at least L bytes are guaranteed to be exactly the same as current pattern.
By allowing elements to be upgraded immediately to their best possible level, it reduces considerably the number of comparisons to do for the next match search.

The main driver behind MMC is that data get sorted during search. As a consequence, a never used data will never get sorted. In "greedy match" search strategy, this happens quite often, and therefore helps to save many useless comparisons.
This is in contrast with BST, which requires data to be sorted on insertion.

Experimental results will tell more than words, so here they are.
As previously, i'm counting the number of comparisons required to do a full match search over a given window size.

Window Search 64K :
No code has to be inserted here.

Window Search 512K :
No code has to be inserted here.

Window Search 4M :
No code has to be inserted here.

The benefit is now much better. Firefox results at 4M are particularly good, keeping in consideration that nearly 30% of these comparisons are Hash collisions.
On the other end, Enwik8 results are a bit disappointing, which seems due to the presence of many valid small length matches. For example, MMC needs to find at least one match of length 6 to eliminate all matches of length 5 and less from search chain. Sometimes, it takes many comparisons before finding such occurence, and the algorithm spends a lot of times at length 4, then at length 5, ...

This version of MMC is implemented in a version of LZ4 called "HC", for High Compression. LZ4 is a very simple LZ77 compressor, which therefore ensures that 99% of CPU cycles are spent into match searching loops. The current speed of LZ4-HC, and therefore of MMC, is getting close to 40MB/s on my test system (Core 2 Duo). Not "extremely" fast, but this is nonetheless a complete full search, with no early limit to speed-up results.

At this stage, i have no further idea to beef-up MMC, and therefore would like to test it in a head-to-head comparison with BST.
Greedy match strategy seems the terrain on which MMC can stand a chance. On Optimal parsing strategy, however, i guess BST can only become a winner, since all positions are "searched", which is the basic behavior of BST insertion.

Although i should write my own BST implementation, in order to learn, i may end-up with a sub-part implementation, which would therefore produce some biased results. In order to achieve a respectful comparison, i'd prefer use an open-source one, if such a thing exists.
I tend myself to completely "entangle" the search algorithm within the main compression loop, making it impossible to extract and use into another context.
Reading a "clean" implementation would also serve as good practice example to write code in a way which makes it re-usable for others, something i've struggled to do properly up to now.

 http://fastcompression.blogspot.com/...erimental.html

16. I understand the appeal to building your own compressors to test your own matchfinder ideas, but as you point out that might not be a fair test.

Freearc has pluggable matchfinders (Compression\LZMA2\C\ e.g. picked in C_LZMA.cpp and implemented e.g. in LzFind.c)

You could implement yours in there, which would give you proper comparable results to extremely solid versions of the various alternatives. Very benchmarkable!

17. I have some idea for optimization, don't know if it'll work.

Suppose we're following the bad links, eg: current prefix is "blaha", and we're successfully matched "blah" against "blahb" in dictionary. Then we'll follow bad links and encounter strings "blahc", "blahd", "blahb", "blahe", etc When we encounter the second "blahb" we're sure it is useless purely useless at this point and for future searches (does it? I'm not 100% sure, but quite self-confident), so we can update bad link for "blahd" to point directly to "blahe". Or maybe you do it already. The second "blahb" will remain achievable because when we're searching for "blahb" then we'll use promotion link at first "blahb" which will point to second "blahb".

18. Hi Piotr

Yes, you are fully right.
That's indeed an idea i also had in mind initially, and somehow discarded at implementation time, and finally forgot. First because i'm lazy but also because i feared that it would cost more to handle this situation than it would gain.

I want to share the description of the algorithm i had in mind, so that you can judge. Indeed, maybe something better could be invented.

1) Have a clean table of 256 characters at the beginning of level L chain, to track character occurrence
2) Test the next position in the chain. If the tested character (blah"a") is good, process as usual.
3) If it is not (blah"b"), register the character into the table.
4) Now, while registering the character, check if it was already present in table. If it was not (blah"c", blah"d",...), just continue.
5) If it was (blah"b" again), then :
- remove current position from current level L chain (blah"d" points to blah"e")
- link it to previous occurrence of this character (first blah"b"), which is either a gateway (first occurrence), or already at level L+1 (all other occurrences). Position is now at level L+1 (this is L+1 relative to "blahb").
6) Continue
7) Whenever conditions are met to enter level L+1 (good "blaha" position tested, promotion link to level L+1 available), clean the table.
Just start again from step 1.

And that's it. I believe it would work, but, obviously, it would cost more CPU too. The table cleaning phase (256 bytes) was particularly worrying. As it stands, encountering a "bad" character is much more common than finding a good one. As a consequence, this table cleaning is bound to happen very very often. I just feared it would kill performance.

Now i may be wrong, either on performance assumption, or on algorithm itself. If you believe this can be improved upon, please feel free to comment.

@willvarfar : thanks for hint.
Yes this is a good point; instead of integrating another match finder into LZ4, i could also try to integrate MMC match finder into a proven stable foundation such as FreeArc.
I remember spending a bit of time last year at reading some parts of FreeArc, and it was one of the most readable and well commented source code out there. I could only wish to reach such a level of clarity.
It's still some work, though, to get accustomed to another environment, and properly plug code where it should. Anyway, maybe worth a try ....

Best Regards

 A working demo of MMC with the proposed multiple-levels promotion improvement is available at website : http://fastcompression.blogspot.com/p/lz4.html

19. If I understand correctly you operate on single global 256 records table.

Observation no 1:
Instead of cleaning table when entering consecutive levels, just tag every record with a pair of (counter, level). Counter is a number of iteration of match finding, ie. if you search matches for all positions then it'll be equal to position in file. Then, instead of cleaning table, just compare pair (current counter, current level) with the pair from table, and when they're different, treat the record as empty.

Observation no 2:
We must keep the last occurence of hashes registered at each symbol to enable efficient execution of step 5, ie without rescanning the chain to find the previous hash in chain with current bad character. Probably you've thought of that.

Observation no 3:
We must keep track if a symbol was registered zero times, once or more than once. If it was registered zero times then just update the table, if it was registered once then update also promotion link, but if it was registered more than once then we must update the bad link instead of promotion link. I hope you understand that.

This all only causes that 256 records table to grow in size few times but it probably still won't require more than a few kilobytes.

Hope I'm right

20. wow, thanks, there are many good ideas in your post Piotr.

Originally Posted by Piotr Tarsa
Observation no 1:
Instead of cleaning table when entering consecutive levels, just tag every record with a pair of (counter, level).
Yes, that's an excellent idea.
Indeed, just comparing the current level with the one stored in table is enough to know if the character was already seen at current level. This avoids any reset after promotion, making the search much faster.

An issue remains. It works well within a single search, since level can only go up. But when we start a new search, we have a risk that a byte registered for example at level 4 during previous search may appear again at level 4 on next search. In this case, the algorithm will incorrectly believe that the character was already seen, and link it to previous occurrence, which by the way would produce corrupted chains.
The easy way to solve this is to reset the table between each search.

Granted, it is already a very great improvement compare to one reset per promotion. But still, with one search per match and per literal, it still means something like 20 millions searches for enwik8, hence 20 millions reset. This might still be too sensible.

Building up on your idea of a counter, we can simply use a rolling counter, independent of current level. It simply increases by one with each promotion. OK, an additional variable is not the best for performance, but this should remain manageable.
The only issue is to be sure that the rolling counter does not roll back too fast to initial value, otherwise, there is still a risk, even if low, that a character is incorrectly believed to have already occurred. A large enough counter should be enough to prevent such a risk.

Originally Posted by Piotr Tarsa
Observation no 2:
We must keep the last occurence of hashes registered at each symbol to enable efficient execution of step 5, ie without rescanning the chain to find the previous hash in chain with current bad character. Probably you've thought of that.
Yep. Simply keep it in the table, with the character as index.

Originally Posted by Piotr Tarsa
Observation no 3:
We must keep track if a symbol was registered zero times, once or more than once. If it was registered zero times then just update the table, if it was registered once then update also promotion link, but if it was registered more than once then we must update the bad link instead of promotion link. I hope you understand that.
This one is more simple than it sounds.
Instead of keeping an occurrence, i'm keeping a pointer to the link which needs to be updated. This way, the issue of 'promotion link' vs 'continuation link' is solved with no extra check.
This trick is already is use within MMC.

OK, with all these new ideas, it seems possible now that your improvement generates more payoff than costs. And it seems to be worth a try.

[Update]
Mmmh, there is still a risk, which is actually worse than i though initially.

In the current algorithm, i'm leaving the current level L when i find a valid promotion link. By construction, it guarantees that the character i'm looking for is no longer present in current level.
Now, what happens with the proposed new strategy ?
The first time i see character "b", i keep the promotion link into table, for further update if necessary.
If this promotion link value is different from zero, it means the current position is already a gateway to its own level L+1. As a consequence, i will not find any more "b" on current level.
Therefore, issue starts with a promotion link of value "zero", which means that this position has not been tested yet. Therefore, some "b" may still be found at current level.
Now imagine there is a second character "b".
The promotion link of first "b" is now updated to point to the second "b". Which means the value of promotion link is no longer zero.
Now imagine i'm finding the character i'm looking for ("a") with a valid promotion link. I should now leave level L and start on the new level L+1.
But, maybe there is still another "b" behind this "a"...

This is real issue, because next time i will be looking for "b", i will leave current level L on its first occurrence, since it contains a valid promotion link. I will therefore miss the third "b" of the level L+1 chain, and all other which follow.

To counter this problem, an obvious solution is to continue along the level L chain, up to its very last element. It solves the problem, but obviously results in many more positions to test, and therefore will hamper some if not all of the gains.

Another idea, much more difficult, would be ensure that, by construction, there is no risk to forget a "b" behind the searched "a" character. Unfortunately, this is not trivial, and i'm not even sure if that is possible.

Maybe another completely different way need to be found.

[Edit 2] another idea, is for the second "b" to still point to level L with its 'continuation link'. Unfortunately, it would mean that current level is no longer guaranteed. I don't think the rest of algorithm would like it.

[Edit 3] maybe i incorrectly understood your Obs.n1. You mention to keep a level tag for each "record". If by "record", you mean position, then this is, in my implementation, not necessary, since current level is implicit.
Now, if we cross this idea with "edit 2", it means we can actually go back by one level while searching, since the current level is no longer "implicit", but directly read from the tag.
It seems to solve the issue. The drawback, however, is more memory, and more updates. Constantly updating the level tag is bound to cost some performance. It is unclear at this stage if this can pay off for the better sorting which can be achieved.

[Edit 4] I finally properly understood your idea of (counter,level) pair. This is a very good solution that completely solves the "reset" issue. It is in competition with the rolling counter, which is just a variation of the same idea. Not sure which one of the two is the best, they are probably very comparable.
Now, the "unfinished sort" issue mentioned in "update" paragraph still needs to be fixed...

Well, still some food for thoughts...

21. Well, if I understand you properly, then in "update" paragraph you're talking about situation where you don't execute a search on every position in file, ie. you're doing greedy parsing or something like that. If you do search on every position then problem shouldn't exist.

I can't understand how there can be third "b" on some level at any point in the process, it should be linked by continuation link on upper level. Maybe if you make some illustration I will understand what you mean.

Regarding counters:
If counter overflows or gets back to initial value by and other way, then we should do full reset of the table. As clearing the table is rather cheap, even 8-bit counters could be enough. They will require full table reset every 256 searches.

22. Hi Piotr

Well, some situations are pretty difficult to describe, as they result from several corner cases; but they do happen, over several millions of sequences to analyze.

Here is a simple enough example :

The green cells, on the second part, are the one affected during the search for Blaha.
As you can see, the gateway position "blahb" becomes the first level 4 element, when blaha becomes the gateway.
What's important to catch, is that "b" is not the character being looked for.
Therefore, blahb is just inserted into level 4, but not dealt with. Its promotion link remains initialized to zero.
This situation is pretty important to keep in mind, as it creates all sort of side effects later on.
As you see in this example, we reach a valid promotion for "a" to level 5 before we meet again "b".
So, no opportunity for "Blahb" sequences to be linked to each other during this search.

However, in this first example, it's not really a problem. The final status is still "valid".

What we absolutely want to avoid is a situation as below :

Here, this is terrible, since on insertion of Blaha, the 2 "blahb" positions are linked to each other, and we never reach the 3rd one.

Now the question is : Is this situation guaranteed to never happen ?
I know the situation described on the first (black) line above can and do happen with current version of MMC,
but with secondary sorting enabled, it's not so clear.

If counter overflows or gets back to initial value by and other way, then we should do full reset of the table.
Very good point ! Thanks for advice.

23. Actually, i think this "forbidden" situation can happen, unfortunately.
Here is, below, an example :

It's really complex, so it deserves some careful explanation.

First, we make hypothesis that data is inserted at Level 2.
from most recent to oldest :
P4 : Blahb : level 2 and gateway to level 3
P3 : Blahb : level 3 and gateway to level 4
P2 : Blaha : level 4 and gateway to level 5 "Blaha"
P1 : Blahb : level 4 and gateway to level 5 "Blahb"

now we try to insert P5 : Blaha.
This is the situation described at line 2.
Here is the detailed algorithm behavior :

1) P5: Blaha : inserted at level 2. And Gateway to level 3.
2) Since P4: Blahb has 3 characters in common, it gets promoted to level 3.
3) Since P4: Blahb has 4 characters in common, it also becomes gateway to level 4.
4) Since P3: Blahb has 4 characters in common, it gets promoted to level 4.
5) Since P3: Blahb does NOT have 5 characters in common, its promotion link remains at Zero.
6) P3 is now remembered as first "b" position for current level, for further linking, if detected.
7) We now reach P2: Blaha.
Since M2:Blaha has 5 characters in common, we use its promotion link.
9) We never reach P1:Blahb. Therefore, promotion link of P3:Blahb remains at Zero.

At this stage, there is still no problem. Not yet in fact, because on next insertion, ....

Now we are inserting P6:Blaha.

1) P6: Blaha : inserted at level 2. And Gateway to level 3.
2) Since P5: Blaha has 3 characters in common, it gets promoted to level 3.
3) Since P5: Blaha has 4 characters in common, it also becomes gateway to level 4.
4) Since P4: Blahb has 4 characters in common, it gets promoted to level 4.
5) Since P4: Blahb does NOT have 5 characters in common, its promotion link remains at Zero.

Now we are in the situation represented in line 3.
It is a forbidden situation, just described in previous post : it will create a link between P4 and P3, but will not reach P1, therefore cutting out these Level 5 positions.

Hope explanations are clear, otherwise don't hesitate to ask or challenge.

It is also possible that a small change to algorithm could solve or entirely avoid this problem.

24. Here's my response for your post at 19:58 (there were problems with posting before):

It seems that on level N, (N + 1)th character of gateway string can sometimes be found in the chain. That's visible in initial situation in first figure (right "blahb" is a gateway, left "blahb" is another occurence of the same string). When inserting/ searching "blaha", the right "blahb" ceases to be a gateway. In that case we must search the chain for another "blahb", and if we found it, we must make the right "blahb" a gateway with link to left "blahb" and we must unlink the left "blahb" from current level chain.

When we make the right "blahb" a gateway, we must repeat the process recursively. Recursion stops if (N + 1)th character of gateway is not found in its chain - there will be no further promotion to gateway so no rescanning is needed.

If we do that then situation in second figure should be impossible to encounter.

It seems to be easy fix, but it will be somewhat costly. I don't know if it fixes all problems with the scheme we're working on and if it will ruin performance.

Alternatively, we can squeeze a flag in chain that tells us if (N + 1)th character of gateway string can be found in the current level chain, but it shouldn't make much difference.

Well, it's getting complicated

I'm getting to sleep. Maybe I will think about MMC tommorow

25. Originally Posted by Piotr Tarsa
It seems that on level N, (N + 1)th character of gateway string can sometimes be found in the chain.
Yes, exactly

26. I don't know why but I can't post the full text on this forum. It gets cut. I'll try later.

27. My point about adding this to freearc is that there are a bunch of different hash chain and table and tree things in there so direct comparisons are possible. Until you wipe the floor with the best Igor and Bulat have written, they won't believe you can

28. I'm constantly trying to post full text, but I'm failing :/

I have idea for multiple promotions:

If we're on level N, and we have gateway on character 'a' to upper level, but on upper level all strings have at least N+M common characters then we can do promotion of M chars instead of 1. That would be equal to path compression in tries.

If we're on the same level N, and we're adding a string on character 'a', but it hasn't at least N+M common characters with those on upper level, then we must:
- reduce the promotion level on current gateway,
- make a multiple promotion gateway on upper level, for those strings that have at least N+M common characters,

Now you're using 8 bytes per position (ie. continuation link and promotion link). If you will use only 28 bits on each double word to store pointers, then you can split an 8-bit number and store it in upper bits of those double words. 8-bit should be enough for multiple promotion. 8 bit allows a range of 0..255 but we're promoting at length of least one character, so the range is 1..256.

I see a problem with match finders and sliding/ circular windows. If we're overwriting a position in our circular window then we can have a link to that position somewhere in our matching structures. The solution would be to prune the links sometimes, ie prune the links pointing to the least 1/ 4th of buffer and then sliding the window. But this way the effective size of dictionary would be reduced by 1/ 4.

29. If my ideas for duplicates removing and multiple promotions will work, then MMC should have only a little higher computational complexity than tries/ radix trees, but it will be much more cache-friendly, as scanning memory is cheaper than accessing nodes scattered through entire memory.

I think you should try reversing entire MMC, so instead of scanning memory backwards you scan memory forwards. Modern processors perform better at forward prefetching than backwards prefetching.

You should also try that repetition flag I've mentioned and reduce promotion length to 7 bit at the same time (to keep all bits used).

Have anyone compared BSTs to tries/ radix trees? I thinks tries should win

30. Wow, so many good points, as usual Piotr; i'll take some time to try answering them.

When inserting/ searching "blaha", the right "blahb" ceases to be a gateway. In that case we must search the chain for another "blahb", and if we found it, we must make the right "blahb" a gateway with link to left "blahb" and we must unlink the left "blahb" from current level chain.

When we make the right "blahb" a gateway, we must repeat the process recursively. Recursion stops if (N + 1)th character of gateway is not found in its chain - there will be no further promotion to gateway so no rescanning is needed.

If we do that then situation in second figure should be impossible to encounter.

It seems to be easy fix, but it will be somewhat costly. I don't know if it fixes all problems with the scheme we're working on and if it will ruin performance.
Exactly !
And in fact, due to the recursive nature of this update, we end up updating the whole tree.
OK, each update is smaller since the tree is constantly "up to date", and this helps making future searches faster.
Still, there is no guarantee that the gain will offset the cost.
On top of that, we have to add positions one by one, like for BST, not "several in a row" as i'm doing currently with greedy matches.
My assumption, up to now, was that the cost would become too high.

Alternatively, we can squeeze a flag in chain that tells us if (N + 1)th character of gateway string can be found in the current level chain, but it shouldn't make much difference.
Not sure to understand this one.
But it reminded me an alternative, which was not only to track when a new character is present and need a link, but also to ensure we properly close the search at current level.
It works like this :

- If we find a new character ("b"), we first check if it is a gateway.
==> If it is, then we don't need to track this character anymore, as it is the last at this level

- If not, we add it to the table, but also add it to a kind of "reminder value".
==> The "reminder value" is the sum of all characters in the table that still need to be completed.

- When we find a gateway of a character in the table ("b"), the gateway is promoted, the character becomes a gateway, and is removed from the table. It is also removed from the "reminder value".
==> We can also have more complex situations, with multiples character "b" found before reaching the gateway; it does not change the principle.

- When we find the character we are looking for ("a"), we first look for the "reminder value". If it is zero, then no unfinished job remains, so we can level up
==> We can also have more complex situations, with multiples character "a" found before reaching the gateway; it does not change the principles.

- If it is not zero, then we have to continue on the current level, until either :
=> we find all necessary gateways, so "reminder value" becomes zero
=> we reach end of chain (window size limit).
Afterwards, we resume our search on next level.

The idea is that the number of additional searches is limited to current level, and therefore may not grow too large.
It is still unclear if it is a win, but it seems a bit more likely.

If we're on level N, and we have gateway on character 'a' to upper level, but on upper level all strings have at least N+M common characters then we can do promotion of M chars instead of 1. That would be equal to path compression in tries.
Yes, good point, like a radix tree.

I though about it in the beginning, but quickly dismissed the idea, on the ground that it would cost too much for too little.
The point is that we have to take in consideration the additional complexity associated, such as checking and modifying level up value, storing it, reacting to changes (deletions, insertion), etc.
On top of that, i suspected such multi-level situation would be pretty rare, essentially a symptom of very few remaining positions to test in the first place.

I haven't made proper statistics for it up to now; however my available traces seem to indicate that they are, indeed, rare.

You can prove me wrong though.

I see a problem with match finders and sliding/ circular windows. If we're overwriting a position in our circular window then we can have a link to that position somewhere in our matching structures.
Yes, but this is not an issue, since all links are always pointing backwards. Therefore, the rest of tree is also much too far.
Delete is in fact completely free, like for a classic hash chain.

On the other hand, it completely forbids optimization which involve shuffling with pointers. Indeed, i though of intentionnally pointing the gateway with the first instance which would share the same next byte. This would have helped tremendously for the next promotion. Unfortunately, if the destination falls outside of the window, we lose the rest of the chain.

The quick fix would be an additional "parent" pointer, but then it makes the memory cost much higher (12x) and delete operation is no longer free.

If my ideas for duplicates removing and multiple promotions will work, then MMC should have only a little higher computational complexity than tries/ radix trees, but it will be much more cache-friendly
Indeed, as suggested by Piotr, MMC looks like is a distant cousin of Radix Tree.

Talking about cache friendly, could a cache-conscious radix tree be called a Judy tree ?

I think you should try reversing entire MMC, so instead of scanning memory backwards you scan memory forwards. Modern processors perform better at forward prefetching than backwards prefetching.
Right. However, we still are making some big jumps into memory, so even if these jumps are forward, they are still a problem for prefetching.

Currently, my guess for a cache-friendly search algorithm is that it would need a completely different layout, ensuring that as much checks as possible remain in a single cache line, and occasionally next one with prefetching.

Then, it would also mean a completely different search algorithm ...

It seems this is a theme with still many ideas to try out

Page 1 of 3 123 Last

#### Posting Permissions

• You may not post new threads
• You may not post replies
• You may not post attachments
• You may not edit your posts
•