Imagine there is a database (e.g. dictionary) pointed by hashes: hash of an object indicates its location.
Obviously, the length of the hash (label) has to grow with lg(n), for example to distinguish among a billion of objects, we need >30 bits per object.
For all objects we need n lg(n) bits.
However,
- we don't need their order, what allows to save lg(n!) ~ n lg(n) bits,
- if we only need to distinguish inside a given database (no questions about objects outside it), we can cut hashes to the first distinguishing bit - creating prefix tree (TRIE), where all leaves (hashes) are children of degree 2 nodes,
Thanks to these two savings, it turns out that we can reduce entropy from n lg(n) to linear growth: ~2.77544n bits.
It can be further reduced by observing that some internal bits are not distinguishing (degree 1 nodes) - we can save this 1 bit per such node.
Finally, we get asymptotically ~2.33275n bits and this "distinguishability constant" (2.33275 bits/element) can be found analytically ((3/2+gamma)lg(e), can it be reduced further?):
slides: https://www.dropbox.com/s/ugmduw3c9pqsnnr/hash3.pdf
arxiv: https://arxiv.org/pdf/1206.4555
So the question is if it could be put into practice - e.g. to reduce ~10x size of hashes for database with millions of entries?
The big problem is that it requires compressing the binary tree - making it terrible for random access ... but still could be beneficial for storage and transmission.
And there might be some practical approximations...?
![]()