Imagine there is a database (e.g. dictionary) pointed by hashes: hash of an object indicates its location.

Obviously, the length of the hash (label) has to grow with lg(n), for exampleto distinguish among a billion of objects, we need >30 bits per object.

For all objects we need n lg(n) bits.

However,

- we don't need their order, what allows tosave lg(n!) ~ n lg(n) bits,

- if we only need to distinguish inside a given database (no questions about objects outside it), we cancut hashes to the first distinguishing bit- creating prefix tree (TRIE), where all leaves (hashes) are children of degree 2 nodes,

Thanks to these two savings, it turns out thatwe can reduce entropy from n lg(n) to linear growth: ~2.77544n bits.

It can be further reduced by observing that some internal bits are not distinguishing (degree 1 nodes) - we can save this 1 bit per such node.

Finally, we get asymptotically ~2.33275n bitsand this "distinguishability constant" (2.33275 bits/element) can be found analytically ((3/2+gamma)lg(e), can it be reduced further?):

slides: https://www.dropbox.com/s/ugmduw3c9pqsnnr/hash3.pdf

arxiv: https://arxiv.org/pdf/1206.4555

So the question is if it could be put into practice - e.g.to reduce ~10x size of hashes for database with millions of entries?

The big problem is that it requires compressing the binary tree - making it terrible for random access ... but still could be beneficial for storage and transmission.

And there might be some practical approximations...?