http://encode.su/threads/15-7-Zip?p=...ull=1#post7997 has pseudo-code for your HT4, Bulat; is it accurate?
From that post:
Code:
int bestMatchDist = 0;
int bestMatchLen = 0;
HashEntry entry = &hashtable[context];
for (int i=0; i<16; ++i)
{
int pos = entry->Positions[i];
int len = CheckMatch(pos);
if (len > bestMatchLen)
{
bestMatchLen = len;
bestMatchDist = CurrentPosition - pos;
if (bestMatchLen >= MAX_MATCH_LENGTH)
break;
}
}
Now imagine that you used the HT of the next key too. Imagine you have, say, 3 bytes hashing and you're looking for a match of at least 5 bytes. Split this needle into two hash lookups, e.g. the string "crazy" would be split into [cra] and [azy].
Consider this following code and how much nicer may potentially be to the cachelines:
Code:
int bestMatchDist = 0;
int bestMatchLen = 0;
HashEntry* entry1 = &hashtable[context];
if(!entry1)
goto give_up_searching;
// try for 6-or-longer matches first, falling back to 5 then 4 and so on
for(int hash_offset = HASH_LENGTH*2;
!bestMatchLen && ((hash_offset-HASH_LENGTH) >= MIN_MATCH);
hash_offset--) {
if(HashEntry* entry2 = &hashtable[context+hash_offset]) {
for(int i=0, j=0; i<entry1->num_matches; i++) {
// slide along the second entry
while(entry2->pos[j] < entry1->pos[i]) {
j++;
if(j >= entry2->num_matches)
goto give_up_seaching;
}
if(entry2->pos[j] == entry1.pos[i] + hash_offset) {
// we have a match that is HASH_LENGTH + hash_offset, i.e. 5
if(hash_offset == HASH_LENGTH*2) {
// match can be longer since we haven't shorted our hash_offset for a retry
// this is where we go stall to get the buffer
int len = CheckMatch(...);
if(len > bestMatchLen) {
...
}
} else {
bestMatchLen = HASH_LENGTH + hash_offset;
bestMatchDist = ...
goto give_up_searching;
}
}
}
}
give_up_searching:
...