# Thread: Compressing hash database from n lg(n) to ~2.3327n bits (distinguishability constant)

1. ## Compressing hash database from n lg(n) to ~2.3327n bits (distinguishability constant)

Imagine there is a database (e.g. dictionary) pointed by hashes: hash of an object indicates its location.
Obviously, the length of the hash (label) has to grow with lg(n), for example to distinguish among a billion of objects, we need >30 bits per object.
For all objects we need n lg(n) bits.
However,
- we don't need their order, what allows to save lg(n!) ~ n lg(n) bits,
- if we only need to distinguish inside a given database (no questions about objects outside it), we can cut hashes to the first distinguishing bit - creating prefix tree (TRIE), where all leaves (hashes) are children of degree 2 nodes,

Thanks to these two savings, it turns out that we can reduce entropy from n lg(n) to linear growth: ~2.77544n bits.
It can be further reduced by observing that some internal bits are not distinguishing (degree 1 nodes) - we can save this 1 bit per such node.

Finally, we get asymptotically ~2.33275n bits and this "distinguishability constant" (2.33275 bits/element) can be found analytically ((3/2+gamma)lg(e), can it be reduced further?):
slides: https://www.dropbox.com/s/ugmduw3c9pqsnnr/hash3.pdf
arxiv: https://arxiv.org/pdf/1206.4555

So the question is if it could be put into practice - e.g. to reduce ~10x size of hashes for database with millions of entries?
The big problem is that it requires compressing the binary tree - making it terrible for random access ... but still could be beneficial for storage and transmission.
And there might be some practical approximations...?

2. ## Thanks (2):

Shelwien (13th April 2017),xinix (13th April 2017)

3. I actually considered something like that as a replacement for entropy-coding.
Hashing basically allows to produce near-optimal codes for data strings without the need for any complicated probability estimation.

But for compression of a large set of hash values, I'd likely consider Rainbow Table approach first.
Plain RT method might not work if hashes in question are computed from large-ish blocks of data,
but then I invented a delta-RT, where hashes are used for generating diffs instead of actual blocks,
and chains can be constructed from hashes of blocks with similar content (diff_size < log2(n_hashes)).

4. ## Thanks (2):

Jarek (13th April 2017),xinix (13th April 2017)

5. Shelwien, I remember your thread about handling with lg(n!) bits about ordering - that beside "subtracting it" by some compression, we can store some information in this permutation instead.
This is one step (attached nice approximation by James Dow Allen) ... which rather still needs n lg(n) size.
For example to halve the storage for 32bit hashes, you would need 178 thousands of them, for 64 bit hashes you need 11 billions.

Here I am talking about including also another step to get from n log(n) to linear growth: do not store the entire hash, only its distinguishing prefix (like in the left-most tree above for 4 hashes).
It does not longer allow to test if an element belongs to database (huge true positive), only allows to distinguish elements inside the database.
My point is that while testing for being in database requires n lg(n) bits, linear growth is sufficient for distinguishing inside it.
Instead of e.g. 32-256 bit hashes, "2.33275 bit hashes" are asymptotically sufficient for distinguishing inside a database.

6. Yes, I think I understand it good enough, thanks.
Although I'd not use trits etc, but a generic rangecoder instead.
Just that depending on type of input blocks from which hashes are computed, RT method allows much higher compression, up to millions to one.
https://en.wikipedia.org/wiki/Rainbo...ed_hash_chains

7. I am not saying it is impossible, but it would be a big surprise if you could reduce hash table below these 2.33 bits per element (entropy of such random tree), still being able to distinguish them.
I don't see how rainbow tables could help here?

8. RT can compress N hashes (which can be linked into a chain) to just 2.
Depending on what is the source of these hashes, it might be possible to get higher compression ratio than what you suggest.
For example, it would obviously work for password hashes.

9. Shelwien, I see RT as just a fancy way for brute force collision attack of hash functions - instead of just testing all or random plaintexts, combine them into chains by deriving new plaintext as "reduction function" of the previous hash (e.g. just taking some subsequence).
I don't see any magic here, analogously we could just generate arbitrarily long hash sequence - which could be "compressed" into the hashed object with ratio as large as we want (e.g. a quadrillion) ... nor how it could help with the task this thread was supposed to be about (?)

Once again, the task here is being able to distinguish (identify) elements inside a large database - generate a unique pseudorandom hash value for each object and use it e.g. to localize it inside the database.
The question is how many bits are required for storing such data structure generated based on the database: allowing to identify all objects inside?

It should contain all hashes, but can exploit some reduction, e.g.
- remembering only the shortest distinguishing prefixes of hashes (tree a. in the picture, also called TRIE),
- forgetting about non-distinguishing internal bits of these prefixes - about direction of degree 1 nodes (tree d. in the picture, analogous to PATRICIA trees),
- finally compressing such tree, forgetting about order of elements (subtracting lg(n!) bits)
This way we get to linear entropy growth of such family of random trees (instead of standard n lg(n)) - we need asymptotically "~2.33275 bit hashes" to distinguish elements.

Some possible ways that could lead to reducing this 2.33275 bits are:
- instead of just using prefix, use some further positions - optimized for a given database. However, these positions have to be stored and we are talking about asymptotic behavior,
- instead of using i.i.d. Pr(0)=Pr(1)=1/2 hash function, use one with a different probability distribution of bits (can be realized with entropy coder),
- instead of using binary hash function, use e.g. ternary one,
- ...
I haven't checked, but intuitively they shouldn't improve these 2.33275 bits (maybe the first one) ... however, a formal proof of nonexistance of method seems kind of impossible task.

Could you really distinguish elements of a database below these asymptotically 2.33275 bits/element?
What is the real fundamental "distinguishability constant"?
I don't believe it can be reduced below linear growth - the question is if these 2.33275 bits/element can be further reduced?

10. > Shelwien, I see RT as just a fancy way for brute force collision attack of hash functions

Sure, it is currently used for precomputing data for collision attacks.

But I see RT itself as an unique compression method.
It can be applied in any case where its feasible to compress (eg. with paq) a lot (90%+?)
of N relevant data blocks to log2(N/k) bits or less (k>=4 is the target chain length).

As a simpler application of a similar method, imagine processing enwik6 with a paq
model, then using it to generate 100M of data (by decoding random data with paq model trained on enwik6),
then using generated data as a dictionary for some LZ compressor.
I think that such a dictionary would contain plenty of useful references which can't
be found in enwik6.

My idea was that eg. with 128-bit hashes (presuming a password database, where it is
certainly possible to apply RT compression), its enough to get average hash chain length 128
to reach "hash database size" = 2*n. Actual RTs use much longer chains.
Then it would be also possible to apply permutation-based compression to chain descriptors.

> Once again, the task here is being able to distinguish (identify) elements
> inside a large database - generate a unique pseudorandom hash value for
> each object and use it e.g. to localize it inside the database.

Sure. But in the end, as far as I understand, you're still compressing a separate table of
hashes, which are computed from records of a large database.
In which case, as I said, it should be possible to apply RT-based compression to this
table of hashes, depending on nature of original database records.
Do you at least agree that its possible if its a password database?

And if its not, but is a sufficiently large database with usual redundant data,
then we can likely apply my delta-RT idea - basically constructing a separate
RT instance for each cluster of database records within limited edit distance.

11. I have to admit that I don't see how to use RT for data compression.
The chains are random sequences identified by the initial value - choosing the best one (as the closest) in a smaller space, is exactly what rate distortion does (theoretical approach to lossy compression) ... but I don't think it's your point.

Originally Posted by Shelwien
Sure. But in the end, as far as I understand, you're still compressing a separate table of
hashes, which are computed from records of a large database.
By compressing table of hashes you just subtract lg(n!) bits, but it it is not enough to get to linear growth ... for which you also need to cut these hashes and finally compress the binary tree.

To compress such binary tree with n leaves, you encode how many of them goes to the left child, the remaining go to the right one.
Then encode recursively both subtrees.

12. > I have to admit that I don't see how to use RT for data compression.

Well, I guess my main point here was that hashes are hard to compress,
and its perfectly possible that actual data records for which hashes are computed
could be compressed to smaller size than their hashes (btw, permutation thing applies here too).

RT simply provides a way to actually find records by their hashes, with time/size tradeoff.

Note that its possible to combine the methods too - ie for redundant records compress actual records,
while for remaining ones apply your method.
So it doesn't look like your constant is actually a constant :)

13. Originally Posted by Shelwien
>So it doesn't look like your constant is actually a constant
So please suggest data representation below 2.33275 bits/element "distinguishability constant" which is able to perform the task in question: is able to distinguish between elements of a given (large) database.

Let say you could use RT to retrieve object from a hash ... however, such 2.33275 bit hash here would allow you only to distinguish between 2^2.33275 ~ 5 objects ... to get such tiny average cost per hash you just need to work on their collection.
For me it was a big surprise that a set of hashes can be reduced from standard n lg(n) size down to a linear growth ...

14. As example, suppose that we have a database with records containing numbers from 1 to N.
And so we're compressing hashes of these numbers.
But its quite clear that we can make a small fixed-size program that would output all these hashes,
so it is possible to compress them better than your 2.33*N limit.

15. Ok, for a "database" of natural numbers from 1 to n (of some numberophile) ... we just don't need hashes.
However, as I have started this thread: "Imagine there is a database (e.g. dictionary) pointed by hashes: hash of an object indicates its location." - it is for situations when we need hashes.

E.g. imagine there is a database of vehicles - assuming each of them has a unique plate number, still these numbers are usually not just sequential, can use letters etc.
So you perform hash of a given plate number and use data structure to locate the record of this vehicle basing on this hash.
The question is minimal overhead for such structure to distinguish inside a database basing on hash values - naively it requires ~lg(n) bits/element ... can it be reduced below 2.33 bits/element?

ps. the number plate itself can be directly used instead of hash value - but you would still need e.g. a tree to locate the record basing on this number.
So the question is: for general data (random labels/hashes), can this tree be compressed below 2.33 bits/element?
In other words, is i.i.d. Pr(0)=Pr(1)=1/2 the most efficient for hash function? Maybe a different distribution or larger than binary alphabet would be more efficient?

16. > E.g. imagine there is a database of vehicles - assuming each of them has a unique plate number,
> still these numbers are usually not just sequential, can use letters etc.

Now, this is exactly the case where RT would shine - how plate numbers are different from
passwords? In fact, plate numbers would be much easier to generate than passwords.

> So the question is: for general data (random labels/hashes), can this tree be compressed below 2.33 bits/element?

Sure, if you have a database of sha3 hashes (of unknown nature) and have to index it with md5 hashes,
then there would be no way to generate the data.
But _real_ "general data" is normally quite redundant, and it should be possible to develop a general-purpose
method to make use of it, instead of manual RT design.

17. I really don't understand what do you mean by "indexing" using less than 2.33 bits/elements?

You need to store hashes for all elements in the database, but can reduce them to the shortest distinguishable (unique) and subtract order (lg(n!)) - this way you can get down asymptotically to ~2.33 bits/element.

Please write some scheme how would you like to get below it?

18. Well, its one thing if you had some way to store these reduced hashes in the database fields, to use them as some kind of pointers.
But, afaik, "subtract order" can only mean that we have to compress an array of hashes separately from the database.

And in this case, if the database is redundant enough (as a whole, individual records may look random), it may be a better way
to compress the database itself, then compute hashes from it.
Its just an example though - RT approach would rely on the same data redundancy.

19. Yes, the task here is to still use them as distinguishing pointers - being just 2.33 bits/element sufficient was surprisingly low for me (especially that it's not ~lg(n) bits/element).

You build tree of all hashes like in a) or d) figure in the first post above.
Then to subtract the order, compress this binary tree - you know the total number of leaves (hashes), for root you write how many of them go left, and so on recursively.

After decoding the tree, you enumerate the leaves from left to right, and this number indicates the pointed object.

20. Sure, you can compress the whole table of reduced hashes like this (and establish their mapping to database records).
But when actually using these hashes to reference the records (eg. in another table), we'd not be able to "subtract order".

21. Indeed, as I have written - this is not for random access, only for compression as storage and transmission.
To use this tree it rather needs to be fully decompressed into less memory friendly structure (enumeration of final leaves is n lg(n) itself).

However, there might be some approximations (?)
There is this unary coding and trit trick (James Dow Allen) for subtracting information about the order for fixed length sequences. It was not exactly random access even with the trit - we would rather need some additional structure to locate suffix of a given number.
Here the situation is even more difficult as hashes have no longer fixed length - the trick of dividing into fixed length buckets does not longer work ...

22. > Indeed, as I have written - this is not for random access, only for compression as storage and transmission.

Well, in this case its the same as simply compressing a set of hashes,
like what we'd need to compress https://thepiratebay.org/torrent/733...SHA1_passwords
or https://thepiratebay.org/torrent/658..._rainbow_table

So my point about being able to improve your method by replacing some of the hashes
with actual data (if it can be compressed better) still remains.

> subtracting information about the order for fixed length sequences.

Well, we can just as well use stegdict for fixed blocks - it also doesn't

23. Compressing set of hashes is only subtracting lg(n!) bits about the order - it is not sufficient to get from n log(n) to linear growth of entropy - cases b,c in figure from the first post.
As I have written a few times here - you need additionally to reduce the hashes e.g. to the shortest distinguishing - cases a,d from the figure.

24. No, as I keep saying, in many cases it would make sense to compress actual data from which hashes are computed.

25. But it would take muuuuuuch more than these 2.33 bits/element ... this thread was intended for the lower bound.

26. But what it would take depends on actual database contents.
Isn't it obvious that in your plate number example, a large set of XXX#### strings can be compressed better than their hashes?

27. Originally Posted by Shelwien
But what it would take depends on actual database contents.
As we are agnostic about the content (just working on pseudorandom hahses), it should depend only on its size: n.
So naively the required length of hash is lg(n), for example the length of phone number grows with lg(number of users).
It can be found analytically (slides, arxiv), for example for the minimal prefix tree (a in the figure), average hash length (depth of such random tree) is
D_n = sum_{d=0..infinity} (1-(1-2^{-d})^{n-1}) ~ lg(n) + 1.33275
(plus some strange oscillating terms)

However, compressing all of them together we can subtract their order - the entropy of such random tree is:
H_n = nD_n - lg(n!) ~ 2.77544n
This linear growth (not n lg(n)) was a big surprise for me.

To reduce to 2.33275n (from tree a to tree d), we forget about direction of order 1 nodes (all hashes in our database go the same way there) - we save 1bit per such node, and it can be found that the expected number of such nodes is asymptotically ln(e/2) n ~ 0.4427 n.

28. ## Thanks:

Shelwien (16th April 2017)

https://cstheory.stackexchange.com/q...-using-less-th

Update: let's formalize the problem: imagine you get n random potentially infinite bit sequences (hash values) - what is the minimal expected size of structure, which determines some injection from this set of hashes to {1,..,n}?

Looking at such formulation, I have just got down from ~2.33 to ~1.44 bits/element, however, the current algorithm is absolutely impractical.

So while the above construction used only prefixes, it might turn out it is more efficient to use different positions in obtained hashes. A more general approach might be:

1. Basing on obtained hashes choose better (more distinguishing) positions and encode these positions.
2. For these chosen positions build minimal distinguishing tree as above.

Previously we had trivial 1. (just prefixes), now let's look at approach with trivial 2.: choose position such that we get complete binary tree (constant depth, assume n=2^m) - where we don't need to encode the tree.

First position should have one half of 0 and 1, probability of finding it is {n \choose n/2}/2^n. Second should have half of 0 and 1 within both previous sets, and so on - finally the bits on the positions chosen this way uniquely determine all n hashes.

We can encode distance to successive such position - it has geometric distribution (1-p)^k p with entropy h(p)=(-p \lg(p)-(1-p)\lg(1-p))/p.

Finally summing over cost of writing positions halving all previous subsets, we get (n=2^m):

\sum_{i=1}^m h\(({2^i \choose 2^{i-1}}/2^{(2^i)})^{(2^{m-i})})

Numerics suggest that it quickly approaches \lg(e)n\approx 1.4427 n bits (?)

While it is completely impractical - requiring testing exponential number of positions, there might be some practical intermediate algorithms: using both 1. and 2. It might be a valuable minimal description length lesson - of choosing features, still having cost of their description in consideration.

There remains question if this 1.4427 bits/element is optimal? (Now I doubt it)

To enforce the previous 2.33275 bits/element, we might require that the original lexicographic order is maintained.

#### Posting Permissions

• You may not post new threads
• You may not post replies
• You may not post attachments
• You may not edit your posts
•