I have just created a hash formula for the Match model of Smac 1.16.
As I'm a bit lazy, I simply used 2 SSE2 instructions, Rcpps & Sqrtsd.
Surprisingly, my tests on Silesia Corpus show that, overall, it outperforms FNV-1 in terms of dispersion, but not with all files though.
I say 'surprisingly' because calculating square roots of negative values should have issued a lot of Indefinite values, and thus as many collisions. That is a mistery to be solved...
It seems to give better results on text files : on Enwik8, I achieved 21836249, against 21871800 with FNV1.

;Smac-hash, 64 bits to 32 bits
;We assume that low qword of xmm0 contains the 8 bytes to be hashed

rcpps xmm1,xmm0 ;xmm1=reciprocal of 2 low-dwords of xmm0
sub esp,8 ;allocate a qword in stack
sqrtsd xmm1,xmm1 ;xmm1=square root of low-qword of xmm1
movq [esp],xmm1 ;store result in stack
pop eax ebx ;then in eax & ebx
xor eax,ebx ;eax=32 bits hash

This second version appends a shift-right instruction before calculating the square root : the bit of sign will be erased, so xmm1 will always be positive. It gets a better result on enwik8 (2183473, but I have not tested it thoroughly yet.

;Smac-hash, 64 bits to 32 bits
;We assume that low qword of xmm0 contains the 8 bytes to be hashed

rcpps xmm1,xmm0 ;xmm1=reciprocal of 2 low-dwords of xmm0
sub esp,8 ;allocate a qword in stack
psrlq xmm1,1 ;erase bit of sign
sqrtsd xmm1,xmm1 ;xmm1=square root of low-qword of xmm1
movq [esp],xmm1 ;store result in stack
pop eax ebx ;then in eax & ebx
xor eax,ebx ;eax=32 bits hash

Also, you could use sqrtpd to work on a 128 bits value; you might replace rcpps xmm1,xmm0 with rcpps xmm1,dqword [Source]. Feel free to use and modify...

I doubt Rcpps & Sqrtsd will provide repeatable results on different processors and even on different compilers or compilation settings.

I think FNV-1 is good enough for start. If you want something more sophisticated, look for a recent topic with TarsaLZP. Shelwien made a variation that uses his rolling CRC32 algorithm which should fit nicely with match model.

After further testing, first version should be avoided : the square root function will issue the same NAN value for all negative entries. Second version corrects this problem though, and looks Ok.
I had originally used FNV-1 for my Match model, and I've made some tests with CRC32, S-Box, to see if I could achieve some better results, but FNV-1 remained better.
Curiously, this weird hash is getting better results, but this may be due to the way my match Model is implemented; I have to clear this out.

Last edited by Jean-Marie Barone; 24th July 2013 at 00:14.
Reason: typo

I'm working on a hashing project at the moment. I'm not sure how to interpret your results, such as on enwik8. I haven't checked out your hash function yet, and I haven't really looked much at work other people have done with hashing, especially applied to full text. I'm trying it my way, but I really don't know what other techniques exist, so I don't know how to compare.

I tracked down the formula to calculate the expected number of hash collisions, under the assumption that hashes are random and evenly distributed. Here's how I wrote it in C:

double D = hashtablesize;
double n = trials;
double phi = (D - 1.0)/D;
double expected_collisions = n - D * (1 - pow(phi, n));

D is the number of possible hash values. n is the number of objects being hashed. Collisions are defined as instances when you compute a hash for some object and find that it's already been assigned to a different object.

My experimental results match the numbers I get from this formula with high precision.

I think it would provide useful context to give results of hashing as number of collisions observed compared with the expected number for a random assignment of hashes.

Thanks nburns, this might be helpful. My own testing method was much trivial : I have tested it with a different set of files, and compared with FNV-1 results. I don't really plan to analyze it deepfully; I just noticed it was working good for me, especially with text files, so I intend to use it.

I have read some documentation regarding Floating-point determinism.
It turns out using rcpss instruction was a bad idea, as it is not fully IEEE-754 compliant. Actually, at leat 4 instructions must be avoided : rcpss, rcpps, rsqrtps, rsqrtss because they calculate estimates, and might give different results between different processors. A guy on a forum (I can't find back the link) had made some tests with rcpps on different procs, and the results were sometimes a bit different, even between Intel Core 2 and Core 2 duo. Using square root is ok though.
So I had to find a substitute for the rcpss instruction in my Match model hash with something like this:

;Reset SSE2 control/status register (set rounding mode to Nearest)
push eax ; allocate a qword in stack
stmxcsr [esp] ;store MXCSR register in stack
mov word ptr [esp],1F80H ;default settings
ldmxcsr [esp] ;load MXCSR register from stack
pop eax ;clear stack

;Initiate xmm2 with a mask
pcmpeqd xmm2,xmm2 ;fill xmm2 with 1
pslld xmm2,8 ;xmm2=FFFFFF00FFFFFF00H

;Hash computation assuming datas to hash are located in low-qword of xmm0
cvtdq2ps xmm1,xmm0 ;xmm1=4 xmm0 dwords converted to float
pand xmm1,xmm2 ;set bytes 0 & 4 to naught
sub esp,8 ;allocate a Qword in stack
psrlq xmm1,1 ;erase the bit of sign of xmm1
sqrtsd xmm1,xmm1 ;xmm1=square root of low-qword of xmm1
movq [esp],xmm1 ;store result in stack
pop eax ebx ;then in eax & ebx
xor eax,ebx ;eax=32 bits hash

Results on enwik8 (21833024) are a bit better than second version, but results on Silesia are a bit worse, altough still better than FNV-1.
Anyway, I still have the feeling that something useful can be worked out using a small chain of SSE instructions.

The kind of hash I'm playing with is an extension of a rolling hash in which bytes can be added and removed in any order, so you can add as many bytes as you want, remove some, and add more, in any order.

It's based on the Randomized Karp-Rabin implementation from Daniel Lemire at http://code.google.com/p/ngramhashing/. However, it's simpler, because I started eliminating anything unnecessary, and pared it back to one header in C. I made changes to the algorithm, so it's not really Randomized Karp-Rabin any more.

c is the input byte, and B was originally the number 37. One of my innovations was to replace 37 with its multiplicative inverse (mod 2^31), which is probably also prime, but I haven't checked it. I removed the randomization part, which substituted each byte with a random integer of size equal to the hash. I think the random numbers were counterproductive, because by making the low bits of the hash random, it fairly guarantees that some patterns will repeat, and since those bits receive little or no mixing, that makes those bits a source of collisions. The random numbers could contribute no entropy to the hash over what was in the original bytes. I think the only real purpose they served was to spread the effects of each input byte across the entire hash faster than it could spread through repeated multiplication by 37. That's not so much of an issue once the hash gets going, but I had collisions after removing them when using the hash on short words, where the hash was almost completely zero after adding one or two bytes. However, I had the idea to try the multiplicative inverse of 37 instead of 37, which is the modular equivalent to a regular multiplicative inverse, and is uniquely determined for any prime number mod n in a field. That's the number Binv in the code, and just by looking at this number in binary, you can see it has many more 1 bits than 37 does, and it is much larger relative to the size of the hash. The 1 bits all become shift-and-add operations when doing binary multiplication, so lots of 1s imply rapid mixing (too many will cancel out). The calculators in both Windows and OSX have programmer modes that show the current number in binary, and it's educational and simple to watch what happens in this hash through repeated multiplications by a prime. Experimentally, replacing 37 with its inverse seemed to solve any lack of mixing caused by the lack of random numbers, and in some cases reduced collisions. I didn't look too deeply into the significance of the number 37, which was built in to Lemire's implementation, but I have a feeling that 37 and its inverse share common properties, and exchanging them is a mathematically reasonable thing to do.

I'm not that experienced with testing hash functions, but this one seems like a pretty good one, and its simple.

nburns, this sort of rolling hash is well known and used at least since 90's. many of us here at encode.su use 123456791 prime number. i also posted simple program that finds prime numbers just for such sort of hashes

nburns, this sort of rolling hash is well known and used at least since 90's. many of us here at encode.su use 123456791 prime number. i also posted simple program that finds prime numbers just for such sort of hashes

It's Karp-Rabin, as I said, and it's derived from Daniel Lemire's. The major thing I've done is to make it expandable, which I haven't seen done by anyone else. A standard rolling hash has a fixed window, but this one is more like a queue that holds any number of elements. The idea is to apply hashing to a broader range of problems, and the experiment is ongoing.

I've also done some modifications, like entirely removing the randomized part, and I've experimented with using the modular multiplicative inverse of a prime, rather than choosing a small prime or choosing one that just looked good without mathematical justification. Using the inverse of 37 looks remarkably like dividing rather than multiplying in terms of how it affects the bits, which is a curious thing. This function was designed by Karp and Rabin as a randomized hash function, that was dependent on random numbers, but now it's deterministic. And it still seems to work, actually, pretty well.

I brought it up mainly for the sake of discussing hashing, and it's a platform for experimentation more than anything else.

I think I read your message where you posted the prime number finder. I think that was a different kind of Rabin-Karp, and I have a sense that there are many Rabin-Karps. This one doesn't have a requirement to find prime numbers. The modulus is any power of 2, and it came hard-coded with a small prime that's suitable for any mod.

New hashes are always welcome! I'm sure this rolling hash can prove itself useful for someone, somewhere.
For the moment, I'm still happy with my weird-looking hash

The major thing I've done is to make it expandable, which I haven't seen done by anyone else. A standard rolling hash has a fixed window, but this one is more like a queue that holds any number of elements.

now i got it! i think that it may provide better results than zpaq idea of multiplication by even number. we should perform some tests

I guess it depends on the rules for changing the window size. To find dedupe fragment boundaries, zpaq uses an order 1 predictor and computes a rolling hash that depends on the last 32 mispredicted bytes plus any bytes in between. To do this it uses a 32 bit hash and multiplies by an even number when a byte is mispredicted and by an odd number not a multiple of 4 when the byte is predicted. The actual code is like:

Code:

if (o1[c1]==c) h=(h+c+1)*314159265u;
else h=(h+c+1)*271828182u;
o1[c1]=c;
c1=c;

where c is the current byte, c1 is the previous byte, h is the 32 bit hash, and o1[256] is the order 1 prediction table.

i know this code very well (and employ it in the srep -m1 mode). its drawback, obviously, is that older bytes affect only a few highest bits of h. so nburns idea may provide us with better hashing

Yes, that is a problem, or a feature, depending on what you do with it. The low n bits of the hash depend on only a window of the last n mispredicted bytes. For zpaq, it makes fragment boundary decisions based on the high bits of h.

i think that it makes decision worser because f.e. oldest byte in the window can change only one bit of hash. it would be a problem for regular hashing (f.e. used in match finder), so i expect that it reduces quality of hashing (anf therefore chunking) in zpaq too

i think that it makes decision worser because f.e. oldest byte in the window can change only one bit of hash. it would be a problem for regular hashing (f.e. used in match finder), so i expect that it reduces quality of hashing (anf therefore chunking) in zpaq too

I think Matt designed it that way on purpose. It seemed like a questionable thing to do, but I haven't looked into the problem+solution enough to really judge.

It's been a while since I looked at this hash, so I'm not sure if this version was the one that passed all the SMHasher tests. The final version should be around somewhere. It might be on, e.g., Google Code.

P.S. When I wrote that post, I had a misunderstanding related to composite finite fields. I'm doing normal integer arithmetic there mod 2^31. That's not the same as the Galois Field with 2^31 elements. However, the math still worked out.

standard rabin-karp scheme involves rolling hash of fixed, say 48 last bytes. zpaq hashing includes variable number of bytes including last 32 non-predicted ones. the idea, as i understand it, is to include enough "entropy" (since predicted bytes has low entropy). zpaq multiplies hash by even number in order to suppress influence of bytes preceding those 32 non-predicted ones, but side-effect is that older indluencing bytes also has less influence. may be is is desirable effect, i don't know. but if not, your scheme of multiplicaion will improve results. well, it's easier to check than discuss

standard rabin-karp scheme involves rolling hash of fixed, say 48 last bytes. zpaq hashing includes variable number of bytes including last 32 non-predicted ones. the idea, as i understand it, is to include enough "entropy" (since predicted bytes has low entropy). zpaq multiplies hash by even number in order to suppress influence of bytes preceding those 32 non-predicted ones, but side-effect is that older indluencing bytes also has less influence. may be is is desirable effect, i don't know. but if not, your scheme of multiplicaion will improve results. well, it's easier to check than discuss

If you are working mod 2^k and you multiply by 2, the whole thing shifts left and you lose one bit from the top. The same applies if you multiply by any x=2*a. ... I guess the purpose of Matt's scheme is to let the older symbols fade out gradually, rather than removing them all at once. It would be interesting to see evidence for doing that. But it seems easier to envision the results when you hash the last k symbols (without trying to weight them).

Last edited by nburns; 10th February 2015 at 05:08.