what ypu think about sorting of data block (like BWT does) in order to find matches? this technique looks crazy, but it was successfully used in IMP. and it was very fast compressor
what ypu think about sorting of data block (like BWT does) in order to find matches? this technique looks crazy, but it was successfully used in IMP. and it was very fast compressor
I tried this out once.
It has some advantages over BWT-sorting. You don't need to maintain a full comparison - e.g. if the first 256 bytes of two strings are equal then the two strings are equal (for BWT this limit is buffersize). Still, you'd have a gurantee to find every maximum match upto 256 bytes. Additionally, you don't need to use a stable sort.
If you're looking for a best match the problems arise. In such a sorted table maximum matches are neighbours. But in general you're not looking for a maximum match, but the most cost effective one. Additionally, the sorting order makes it very expensive to find a match of length n with the best index.
aaaaa - index 10000
...
aaaay - index 1
aaaaz - index 10001
baaaa - ...
There can be hundreds of entries in between aaaaa and aaaaz. All these have a 4 byte match with aaaaz. But aaaaa has the best index. I know that this example isn't really good, but it should make the problem clear.
There might be many ways to impove this. But IMO a nice search tree or hashing table do this job just fine.
They are not a lot interested to illusory and little useful technique BWT in the how much in how much ruin structure of the data for which with some files it gets worse the situation. They are a lot interested instead to the regarding speech the system of indices LZP to outside of the BWT. I wanted to know if there is a fast way in order to discard the indices gets worse and to use a smaller number of indices? (for Christian and Bulat) Hi!!
I'm sorry Nania. I do not understand what you're asking for.
Seems to me as Nania wanted to know if there is fast way to discard unusable indexes and decrease indexes number or something similar
EDIT:After some thinking and playing with google translator I changed "matches" into "indexes"
I always apologize me for my English!. I wanted to exactly say this..
but this time use a practical example:
abaaababcaaaaa
1......xx1............
xx=index that doesn't serve us!
1 = index for the copy
In this example of mine I want to say that the 2° ab if as position was memorized it would make the smaller copy of the equal bytes. Did I want note to know if there is a some algorithm that does this in automatic or however in precise way?
christian:
i have some ideas to make bwt based pattern matching fast and efficient.
firtstly, when you have entries like:
aaa 5
aab 7
aac 90
aad 45
...
aaz 34
then you don't need all those entries but only one:
aa? 5
beacuse there's no two entries that starts with aa and have third letter the same. 5 is the index of earliest entry starting with aa.
then if you find that match, you update index to next value. in above case updated entry would look like:
aa? 7
additionally you can add skip indexes.
assume you have following entries:
a?
aa?
abc?
abcccdef?
abd?
abea?
abeb?
ac?
ad?
and you search for ad?
you start at a? . okay. then you go to aa? . you notice that it didn't enlarged your match. so you go to abc? . again no gain. so you go to abcccdef? . hey!! why we are checking entries that start with ab? ??? we can skip directly to ac? !!! so we skip and continue search. this way we found ad? quickly.
skip index points to first entry after current that has same (or less) letter in common than the previous entry.
this way you would never chceck more than (size of alphabet) * (longest match) entries.
additionally you could reorganize entries ie. move more popular prefixes before less popular ones (of course without breaking the structure).
this way bwt based pattern matching can be very fast.
additionally while walking you can remember longest matches for different distances (eg. for 8- bit distance, 9- bit distance, ..., 20- bit distance) and use that for more optimal parsing.
those techniques i described above makes bwt based pattern matching as fast as tree based pattern matching (or even faster due to linear scanning) and it can have even shorter building time (if you use some fast bwt sorting algorithm like msufsort http://www.michael-maniscalco.com/msufsort.htm ).
the disadvantages are high memory requirements and offline nature, ie. you must fetch entire block before compression and use fixed window (sliding window would be unpractiably inefficient with bwt based pattern matching).
Thank you for your interesting post, Donkey!
I still have some problems with your idea, so I just ask. The goal is to find the longest match with the best index? e.g.
00000*
aaaaa* index i
aaaab*
...
aaazy*
aaazz* <- best index
Let's say we're at index i. Obviously, we're not allowed to match index i with itself. Thus, we're looking through the neighbourhood for a longest match with best index (regarding cost). The further we go into one direction (in the neighbourhood of aaaaa) the shorter the matches (at least they don't get longer, since they're sorted).
So we start at aaaba (the direct neighbour). Here, we have a match of length 3. We know that this match is the longest we can get. But how do we reach aaazz efficiently?
hmm, i have little time now. i will explain it later.
but for now you can think about my method as a array implementation of prefix tree. skip index in my array is equivalent to right sibling in tree implementation. and compacting method i described in my first post is equivalent to removing nodes that have no leafs.
since array is built once at start and the structure is not changing during compression (only indexes are updated) you can optimize it at start once (it should be relatively quick) and gain in compression speed while compressing.
the optimization i'm talking here is equivalent to sorting childs of a particular leaf in frequency order, so more popular patterns will be found faster than less popular ones.
i've dicovered also one more trick that would save time by not needing to compare buffer at inexes (ie. don't need to perform computelongestcommonprefix(buffer[index], buffer[current]) ), but comparing only one byte per entry (this byte would be stored in array). but this will require some special structure of array (ie. in a block of entries with common prefix, pairs of entries should be sorted by size of longest common prefix - okay, i'm messed here but it can work )
Thank you Donkey! Itd be great if you use my last example.Originally Posted by donkey7
Nania Francesco Antonio
i still can't understood your question. as a variant, you can try to write it in Italiano and ask someone to translate it
of course, sorting doen's allow us to use sliding window (although we can sort 2 times more data to get this effect)
afaik, imp used suffix sort and at each step we have results analogue to STn sorting so we can easily use it to find match of length=n (just use previous string). On next sorting step we have results analogous to ST(n+1) sort so we can easily find macthes of length n+1 and so on. details was described in fido7.ru.compress 7-8 years ago by Dmitry Subbotin but i don't understood them those time
so, we see 3 schemes
1) described by donkey7 (which i don't understood)
2) imp-like
3) just make STn sort where n is minimal match len we are interested. then scab through this array searching matches among the last, say, 64 entries. this is very like to LZ77/ROLZ searching but cache misses will be much less likely! moreover, we can easily prefetch data that will be scanned a bit later. so, this scheme may completely omit cache miss expenses (except ones on sorting phase)! also, it should be easily multithreaded up to hundreds and thousands of threads (btw, are you know that $100 32-core processors start shipping this month?)
moreover, because searching at each time limited to vary small amount of strings, we can use highly complex schemes of searching speedup that require large amount of data to be saved (per each node searched), because total number of searched nodes would be anyway no more than, say, 1000-10k - old data simply must be automatically discarded (shifted out) from these search acceleration datastructures
so, final result is - we run completely inside L2 cache (or even L1). we find matches for all strings (i.e. 5-10 times more than really required we can therefore use OP in its simple form - using longest matches (not cheapest ones)
this method should be especially great for a huge amount of cores, slow memory access, binary files (because for these files percentage of strings used is bigger than for text files)
4) after ST4 sorting we scan data in usual LZ manner, but use STn-sorted data as very fast way to find strings similar to current. i.e. if Ptr is current pointer in input buffer, then ST_Index[Ptr] gives us some ST_Ptr such as ST_Strings[ST_Ptr]==Ptr and ST_Strings[ST_Ptr-1], ST_Strings[ST_Ptr-2] and so on gives us pointers to previous strings with the same first 4 bytes. not bad? it looks like current QUAD scheme but allows to scan more than 64 (or any other fixed value) of possible matches and gives us only the strings that has the same 4 first bytes - no any hash collisions!
i'm not sure but may be such scheme may be also used with BWT sorting. alternatively, one can try to organize Binary Tree (like lzma/cabarc does) inside each sub-array of nodes having the same first 4 bytes. this may allow to implement rather fast OP with weightning
i have started writing explanations to my methods. they are there: http://asembler.republika.pl/misc/bwt-lz.txt . it's not finished currently and it's not polished as it can be. i'm studying theoretical informatics at jagiellonian university, so i'm learning new algorithms, and i'm discovering more and more effective methods for pattern matching.
when i will complete course of algorithms and data structures (this is on second year) and fully understand fast bwt sorting algorithms (like msufsort or others, look at http://www.michael-maniscalco.com/msufsort.htm ) i will make some ground- breaking matching algo
i've updated my document. i will be happy if anyone comment it (especially christian).
christian:
if you're searching for aaazz then you can't go into aaaaa*. the furthest you can go is aaaa* and you see that common prefix is shorter than length of aaaa* so you go to next sibling that is aaab* and so onm till aaaz*. for details look at my above document. (next sibling has the last letter different)
Originally Posted by ChristianIn my previous post I wasnt searching for aaazz.Originally Posted by donkey7
I was searching for a longest match with the best index for aaaaa. In the example aaazz qualifies for this because its a longest match and does have the best (=cheapest) index.Originally Posted by Christian
Im sorry, I do not have enough time to read all this. But Im sure some of the other guys can give you some feedback.Originally Posted by donkey7
EDIT: Upps, I just saw that there was a tiny mistake in the example. Ive underlined the difference in this post.
christian:
if you read my document you will get the picture
firstly youre example is impossible, ie you have:
but dont have:00000*
aaaaa* index i
aaaba*
aaabb*
...
aaazy*
aaazz* <- best index
and so on.aaba*
aabb*
..
aazy*
aazz*
aba*
abb*
..
azy*
azz*
but if we convert your suffix sorted array to my preodrer binary tree resresentation pattern search array (yup, whats a long name ), it will look like:
so you will basically get three (all) cheapest matches for lengths 1, 2 and 3 by checking only one entryordinal letter compressed.nodes right.sibling
1 a 2 31
2 b 0 3
......
30 z 0 null
31 b ....
*(30 is only hand picked number here)
my pattern matching function finds cheapest for all match lengths and requires constant time (if maximum match length is constant, precisely it takes o(k) time where k is longest match at current position and alphabet size is fixed) for one execution. but you must execute it for every position in buffer if you want to get cheapest matches all the time because it updates indexes only with current position.
my searching scheme *always* (ie. if searching & updating is performed at every position in buffer) guarantees finding cheapes indexes for every match of size up to longest match in constant time. it will give same results as executing naive pattern search for every position in the buffer (it will be quadratic time complexity).
with binary trees you cant achieve such property, unless the binary tree is several times larger than buffer size (up to average_longest_match_size larger), because you cant dynamically compress nodes as i do statically when creating my repetition search array.
memory requirements for my repetition search array is 15 N:
- 4 N for last index (= cheapest index),
- 4 N for indexes (= postion in buffer,
- 4 N for right siblings,
- 2 N for compressed nodes & letters,
- 2 N for commons & suffix lengths,
(two above can be represented as union, taking consequently only 2 N space, because they are indepedent)
- 1 N for input buffer,
plus some fraction of N for sorting algorithm (or that overhead can be hold in other array, eg. last.index because its not needed at sorting stage).
I thought it was obvious that I left out those uninteresting strings.Originally Posted by donkey7
I do trust you that it works but I hope you understand that I dont have time to do this - its not really a short read.Originally Posted by donkey7