Its pretty similar to rc logic, don't you think?
And its not accidental as ideal hash is a mapping of value to its index,
and compression is exactly the same thing.

I wonder is an optimal multiplier exists? I just consider myself that multiplier like 123456791 is not optimal with 3-byte sequences. Probably such multiplier is too small for hashing 24-bit value, since larger values works better. Maybe 2654435761 should be the best? Don't you remember what Knuth wrote about this problem in his book?

Doesn't look like you understand...
There's no single optimal coefficient, because its performance depends on the specific distribution of values.
Instead, to make it perfect you should create a model for that mapping, and then approximate it until it becomes fast enough.
Also afaik that topic is covered by some lattice theory so Knuth didn't discuss it in much detail.

I supposed that you want less collisions, not obfuscation.
And for less collisions you'd have to exploit the value's internal correlations,
by not assigning any cells to improbable values and assigning more cells to probable ones.

Also multiplication is just a conditional sum of value shifts, so you can
easily imagine what would happen after adding a specific shift, knowing
that your value is eg. 4 adjacent symbols from text.

In my LZ77 like algorithm the magic number 123456789 is always better than famous Knuth's prime 2654435761. I tested it on enwik8, enwik9, silesia.tar, LZ4.1.9.2.exe, ...

I tried to using two hash functions (one for even value, the other for odd value) but this worsened the result.

In my LZ77 like algorithm the magic number 123456789 is always better than famous Knuth's prime 2654435761.

How many bits are used for addressing the hash table (or: how many slots do you have)?
How do you exactly implement hashing (do you shift >>)? What do you hash exactly?

const
TABLEBITS = 17;
TABLESIZE = 1 shl TABLEBITS;
var
table: array[0..TABLESIZE-1] of dword;
................................
// Calculate hash by 4 bytes from inb1^
function CalcHash(dw: dword): dword; assembler;
asm
mov ecx,123456789
mul ecx
shr eax,32-TABLEBITS
end; { CalcHash }

I believe that the fewer 1 bits will be in the magic constant, the faster the multiplication will be performed, which will increase the compression speed.

In my LZ77 like algorithm the magic number 123456789 is always better than famous Knuth's prime 2654435761. I tested it on enwik8, enwik9, silesia.tar, LZ4.1.9.2.exe, ...

I tried to using two hash functions (one for even value, the other for odd value) but this worsened the result.

Are those numbers better or worse than my personal magic number: 506832829

I believe the number doesn't need to be prime, but needs to be odd and have a good distribution of 0s and 1s.

The humorous constant 123456789 gives noticeably better compression. Your number is better only on silesia.tar (on 0.006%). I think, because silesia.tar contains very specific img files.

> I believe the number doesn't need to be prime, but needs to be odd and have a good distribution of 0s and 1s.

Yes, the number 123456789 includes sequences of 1, 2, 3 and 4 1-bits.

By the way: when I used the Knut's number 2654435761, my algorithm with best ratio was compressing enwik8 in 40.008%. After I change the Knuth's number to the 123456789 my algorithm overcame the psychological frontier and showed 39.974%. < 40% on enwik8 only with hash table of 128K cells without search matches, source analysis & additional compression! May be after improvement the algorithm will show the compression ratio < 40% on a hash table of 64K cells...

I believe that the fewer 1 bits will be in the magic constant, the faster the multiplication will be performed, which will increase the compression speed.

I don't think that the number of bits in the multiplier has an effect on the performance of the multiplication. The cpu does not actually look for the 1 bits in the multiplier to do shifts and additions. An explanation is here.

Originally Posted by lz77

The humorous constant 123456789 gives noticeably better compression.

40.008% -> 39.974% is a 0.034% gain. I would not call it noticeably better.

That is unfortunately not optimal for this kind of hash. This hash works best with small values. The larger the values gets, the worse it performs. Probably that is why a smaller multiplier worked better for you (so far).

I've just run some checks and I got the best results on the corpuses using a hash function originally from paq8a, still used in paq8pdx. The second place is for crc32c and the 3rd is from Robert Sedgwicks' hash function. The fastest from the top 3 is paq8a's wordmodel hash. So that is the absolute winner.
Could you please try something similar like ( ((a*271+b)*271+c)*271+d ) where a,b,c,d are four bytes? Keep the lower bits (and 0x1FFFF) at the end.
If you do timings: is it much slower in your case?

(I'm still running tests, there are many candidates.)

See my new CalcHash and compare results with ones in my message #11:

Code:

function CalcHash(dw: dword): dword;
var a,b,c,d: byte;
begin
a:=dw and $ff; b:=dw shr 8 and $ff; c:=dw shr 16 and $ff; d:=dw shr 24 and $ff;
Result:=(((a*271+b)*271+c)*271+d) and $1FFFF;
end;
enwik8: 40.579%
enwik9: 36.318%
silesia.tar: 37.434%

It's slower on 15-20% and ratio always worse comparing with 123456789...

See my new CalcHash and compare results with ones in my message #11:

Code:

function CalcHash(dw: dword): dword;
var a,b,c,d: byte;
begin
a:=dw and $ff; b:=dw shr 8 and $ff; c:=dw shr 16 and $ff; d:=dw shr 24 and $ff;
Result:=(((a*271+b)*271+c)*271+d) and $1FFFF;
end;
enwik8: 40.579%
enwik9: 36.318%
silesia.tar: 37.434%

It's slower on 15-20% and ratio always worse comparing with 123456789...

Oh, I see.
It looks like your use case is significantly different from mine.
Let me share what exactly I did.

For every file (49 in total) in calgary, cantenbury, maximumcompression, silesia plus some images from Darek plus enwik8 I fetched the consecutive bytes ("strings"), hashed them using different algorithms and counted how many items are to be stored in hash tables with sizes in the range 16..22 bits (64K..4M). I run the same experiment for 2..8-byte strings. Just to have a more global view.

Results for 4-byte strings and 128K hash tables?
For example, in calgary\bib there are 19086 distinct 4-byte strings. Fitting them to a 128K (17 bit) hash table looks simple and easy but we need to squeeze those 32-bit (4-byte) values to their 17-bit representation "somehow". That is what hashing does.
It turns out that assigning 19086 values *randomly* to 128K buckets will produce some collisions (the birthday problem) even if we have plenty of room. We are to expect around 1300 cases where 2-3 different strings will be assigned to the same hash table bucket. That is a 0,069 collision-rate on average per bucket.
The worst collision rate to be expected for 4-byte strings with a 128K hash table is from silesia\xml. It has 125899 distinct 4-byte strings. Having 131072 buckets and assigning our strings randomly to the buckets we are to expect 0.357 collisions per bucket on average.

So I established these expected collisions rates for every [file + string size + bucket size]. Then measured the actual collision rate for the different hashing algorithms.
I simply normalized at the end: my metric for cross-scenario comparison is: the actual collision rate divided by the expected collision rate. A value less than or around 1.0 means the hash algo performed well enough. I averaged these metrics to have a total and also across dimensions (the string sizes and hash table sizes) to verify that this approach gives consistent result "no matter in which scenario".

Remarks:
- Hashing is not a random assignment to buckets, because the input strings make it deterministic, but having such an expected value for every setup we can compare the hashing algorithms across the setups. I can not just add up all the number of collisions for every [file + string size + bucket size], that would give a strong bias with setups with very low expected collision rates.
- There are other metrics for measuring hash quality, but I think the above is useful for finding the best algo for our scenario: which are the hash algorithms that perform equally well with different (small) string sizes and hash table sizes.
- With proper collision detection (and by chaining the strings of course) any "good enough" hashing works equally well. You will need to choose the faster/simpler one.
- A collision is not necessarily a bad thing. In lucky cases it can even be advantageous. If for instance "ant", "fire" and "police" would produce the same base hash value, and would use the same statistics for the three words (since you don't have collision detection), and your text would contain the space or "m" as the next character (as is "antman", "fireman" and "policeman" or just simply "ant ", "fire " and "police " then you would be lucky with this collision.

We have the following differences:
- I ran the tests on every file separately. You tar your input files.
- For me a "collision" is when there are multiple items to be assigned to the same bucket. For you a collision may occur multiple times: when one item overrides an old one, and then a newer occurrence of the old string overwrites the new one. And this can go on multiple times. As I understand you don't have collision detection (right?).
- What else? What do you have in your compression algorithm that would explain the result you experienced?

Attached are the results for calgary/bib for 4-byte strings and 128K hash table size.
The bold items are the hash algorithms with different magic numbers you tried where you hash the dword at once not byte-by-byte.

Yes, it is clear that there is no clear global winner - any algo or magic number must be tried in the specific context.

For example in paq8px it is important to 1) have a near-uniform spread i.e. as few collisions as possible, 2) be fast, because there are hundreds of hash calculations for every bit (or byte). Yes. Not one hash calculation for every byte or dword. Hundreds for every bit (or byte). (Let it sink ) That's a lot. 3) And finally to have a larger result than 32 bits, since we have large hash tables and we need some bits for collision detection a well. We need 35+ bits in some cases.
It is also important that we usually hash small values or short strings of bytes. And a multiplicative hash or a set of multiplicative hashes are ideal for that (they are most suitable for small values).

That is one specific use case. When searching for a solution to improve hashing I tried a lot of possibilities and settled for the current one. This gave the best improvement for paq8px. It is clear from the results (link#1link#2) that the larger the files were, the better results we got. (There are more collisions to be expected for larger files, so better hashing is more crucial for them).

From a lot of experience I can say: you have to try until you are satisfied with one algo/method.

Results of my search for a better hash functionality in paq8px.

Remark: Most of the statements below are well known for this community. I just wanted to have them in one place with paq8px in focus.

Overview

Paq8px uses hash tables to store its context statistics.
For each input byte many contexts are taken into account for prediction. Therefore for each input byte (sometimes: for each input bit) many statistics are gathered by the different models and are stored in hash tables.
Naturally there are collisions in hash tables. Paq8px collision resolution policy is (simply put) to overwrite infrequent older context statistics.
During compression as the hash tables become slowly full there are more and more collisions. As a result sometimes useful contexts are overwritten. We need to minimize the loss of useful contexts.
We need a "good" hash function to disperse our contexts nearly uniformly into hash buckets. We also need a reasonably good checksum to test for collisions.

We need two hash functions: one that squeezes multiple integers into one integer: like

hashkey = hashfunc(x1,x2,x3);

and one that squeezes arbitrary many bytes (a string) into one integer: like

Both needs to have good spreading (dispersion) and they need to be reasonably fast since we do a lot of hashing for every input bit or byte.

Hash key size matters

When looking up or inserting a new context statistic item to a hash table it needs an index (bucket selector) and a checksum (usually 1 or 2 bytes) to detect collisions.
With maximum compression (memory) settings ContextMap and ContextMap2 used 19 bits for bucket selection and 16 bits as a checksum from the 32-bit hash key. This is a 3-bit overlap so the effective checksum is around 13 bits.
This suggests that a 32-bit hash keys may not be enough.

Which hash function(s) to use in paq8px then?

When chosing a hash function, we need to understand the data we have.
Range: We usually have "smaller" integers. Sometimes very small. These integers are not random. Hash functions developed and fine tuned with random data in mind (like hash functions tuned for 32-bit random numbers as inputs) may not perform optimally in this range.
Order: hashing order matters for strings: (simplified example follows) when hashing "fireman" and "policeman" from left to right they may produce the same hash key after "fire" and "police". Including the remaining bytes (m+a+n) the hash value will not change. These strings will be represented by the same hash key in a hashtable so their collision will be inevitable and undetectable. Their statistics will be calculated together as they would be the same string. However strings having the same ending would probably also have very similar statistics. Therefore, in this case a collision is advantageous. Therefore hashing from left to right is probably better than right to left: more recent bytes will have greater "weight": when an early collision happens "in the middle" of the string, the last some bytes decide the final hash value. Most of the times in paq8px we hashed from right to left for performance reasons, so it was not ideal.
Parameter order in multiple value hashes: when we hash a fixed number of integers and do it similarly as in the case of strings (i.e. combining the input values in order: one after the other), then in case of an intermediate collision the most recently added inputs will have greater "weight". But this time it is a disadvantage. We usually want our inputs to have "equal weight". So as a value hashing function hash(x1)+hash(x2)+hash(x3) is probably better then hash(hash(hash(x1),x2),x3). ("+" is not necessarily an addition.)

Tests

I tried several hash functions mostly from the multiplicative, xor-rotate and crc families. I haven't tested any hash functions that I thought would be an overkill (for performance reasons).
I ran a couple of tests with my hash function candidates using different parameters and different hash table sizes. I used n rolling bytes (n=4..16) from larger input files (binary and text) as input values.

The "best" functions produced surprisingly similar results - the difference was negligible. The number of collisions were *very* similar. There is no clear winner.
But some of the functions are simpler and faster. Based on that I finally chose a winner.

Some interesting findings

Whenever a weaker hash function was post-processed ("finalized") with a simple permutation, the result was usually better than the "raw" result. Finalizing the "raw" result usually helps. Note: most of the best hash functions have some finalization step. This final scrambling is not necessarily a simple permutation.
Remark: paq8px (until v167) post-processed the calculated hash when it was used for hash table lookup to enhance quality:

On the other hand, when "too much" hashing was performed, the result was sometimes slightly worse. "Too much" hashing: include the character position, include a magic additive constant (for multiplicative hashes) or begin with a magic number as a seed (vs. begin with 0). We can decrease the number of hash calculations: hashing more input bytes (4) in one go is a little bit better than doing 4 hashes after each other. Too much hashing may weaken the hash key. The above finalizers were dropped in v168, it turned out that hash quality was improved so using such finalizers was already disadvantageous.

I'd mention two of the more outstanding string functions: crc32c and the family of 32/64 bit multiplicative hashes with large 32/64 bit multipliers.
Crc32c is a stable and good hash function. It for instance beats the internal hash function of the .NET framework. It has hardware support in some CPUs. It is fast with hardware-acceleration but slower with software emulation.
Crc32c is good but I excluded it for being 32-bit.
Multiplicative hashes are fast and spread very well when using a good multiplier (a large odd random number).
Remark: Paq8px (until v167) used a combination of multiplicative hashes as its main hash function with some post-processing ("finalizing"):

The magic numbers here are all primes - except the first one (I don't know why). Notice that there is some post-processing involved (xors and shifts).

Between v160 and v167 paq8px used a new string hash from the multiplicative hash family in MatchModel (see the combine() function), and a high-quality xor-shift hash as a general hash function: https://github.com/skeeto/hash-prosp...ound-functions

uint32_t
triple32(uint32_t x)
{
x ^= x >> 17;
x *= UINT32_C(0xed5ad4bb);
x ^= x >> 11;
x *= UINT32_C(0xac4c1b51);
x ^= x >> 15;
x *= UINT32_C(0x31848bab);
x ^= x >> 14;
return x;
}

Note: it's high quality when the inputs are random 32-bit numbers. But our use case is different: it turned out that this is not the best for us - even when it is so good (for random inputs). So soon it was out.

How did I get these large random primes numbers?
I used an online random number generator. Generated a couple of numbers, looked up the next prime with wolframalpha.
Then tried which ones in which combination works best. Really. Didn't need to try for too long, the results were very similar.

Then we finally finalize the result by keeping the most significant bits for hashtable lookup and the next couple of bits (usually 8 or 16) as checksums for collision detection:

static ALWAYS_INLINE
auto finalize64(const uint64_t hash, const int hashBits) -> uint32_t {
assert(uint32_t(hashBits) <= 32); // just a reasonable upper limit
return uint32_t(hash >> (64 - hashBits));
}

static ALWAYS_INLINE
uint64_t checksum64(const uint64_t hash, const int hashBits, const int checksumBits) {
assert(0 < checksumBits && uint32_t(checksumBits) <= 32); //32 is just a reasonable upper limit
return (hash >> (64 - hashBits - checksumBits)) & ((1 << checksumBits) - 1);
}