Results 1 to 19 of 19

Thread: Perfect Hash Function to Hash Strings

  1. #1
    Member
    Join Date
    Nov 2015
    Location
    Mumbai
    Posts
    13
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Perfect Hash Function to Hash Strings

    I want a perfect hash function to hash strings. My usecase is, the function should take a string and calculate the hash(integer). I will store both hash and corresponding string in form of a dictionary. The main thing is that it should be collision free i.e. 'Perfect hash function'.

    Kindly suggest me such kind of open source implementations in 'C'.

    TIA.

  2. #2
    Member
    Join Date
    Jun 2015
    Location
    Switzerland
    Posts
    749
    Thanks
    215
    Thanked 281 Times in 164 Posts
    Quote Originally Posted by joey View Post
    I want a perfect hash function to hash strings. My usecase is, the function should take a string and calculate the hash(integer). I will store both hash and corresponding string in form of a dictionary. The main thing is that it should be collision free i.e. 'Perfect hash function'.

    Kindly suggest me such kind of open source implementations in 'C'.

    TIA.
    Is the system related to data compression?

  3. #3
    Member
    Join Date
    Jul 2013
    Location
    United States
    Posts
    194
    Thanks
    44
    Thanked 140 Times in 69 Posts
    I looked in to PHFs a few years ago (not for compression), and IIRC http://cmph.sourceforge.net/index.html is probably going to be your best bet. That said, AFAIK every PHF requires knowledge of the entire key space, so I'm not really sure how useful it would be for data compression…

  4. #4
    Member biject.bwts's Avatar
    Join Date
    Jun 2008
    Location
    texas
    Posts
    449
    Thanks
    23
    Thanked 14 Times in 10 Posts
    Quote Originally Posted by joey View Post
    I want a perfect hash function to hash strings. My usecase is, the function should take a string and calculate the hash(integer). I will store both hash and corresponding string in form of a dictionary. The main thing is that it should be collision free i.e. 'Perfect hash function'.

    Kindly suggest me such kind of open source implementations in 'C'.

    TIA.
    There is NO PERFECT HASH.

    That said if you bijectively map all strings to all strings and define that the hash itself can be of variable length. Then you could call that a perfect hash with no collisions.

    But if you hash is a fixed finite length then the counting argument alone shows it can't be collision free.

  5. #5
    Member
    Join Date
    Jul 2013
    Location
    United States
    Posts
    194
    Thanks
    44
    Thanked 140 Times in 69 Posts
    Quote Originally Posted by biject.bwts View Post
    There is NO PERFECT HASH.
    To be clear, "perfect hash function" is a well-defined concept, and there absolutely are perfect hash functions. I think what biject.bwts means to say is that there are no perfect hash functions for infinite sets. PHFs require knowledge of the entire key space when they are generated, which greatly limits their applications.

  6. #6
    Member
    Join Date
    Jun 2014
    Location
    Ro
    Posts
    23
    Thanks
    4
    Thanked 6 Times in 5 Posts
    Thought about this some time ago. You can have a perfect hash function using a hash result of variable length. For instance, you can train a statistical compressor on some (or the same) input string you expect. Then the entire model (and coder) becomes the hash generator. You can save those so the hash generator doesn't change. After that, you can compress arbitrary strings (using a the model, now viewed as static), and their output is the hash. If you have a good compressor, you will probably get about two bits per byte for regular text, so a four time reduction of the string in a hash.
    You can consider that the result which exceeds your max hash length to be a key in an imperfect hash map.
    The keys that have a length below the max hash length can be padded.
    Let me know about what you think, advantages and disadvantages. Two big disadvantages I consider to be the fact that you need to store the model (which the larger it is, the better short codes are expected), and the speed of computing a hash, because it will be a lot more than just multiplying and flipping/shifting bits.

  7. #7
    Member
    Join Date
    Mar 2013
    Location
    Worldwide
    Posts
    565
    Thanks
    67
    Thanked 198 Times in 147 Posts
    If you have low to medium number of known strings at compile time, then you can try "gperf".
    If you have a large number of strings, you can try cmph. Personally, I see no practical advantage over hashing,
    because in this case cmph is generating large tables. Memory access will dominate the hash lookup.

    Possible simple solutions for the dictionary:
    1 - Variable length strings
    - Sorted array of offsets to string entries. Use the unique offset or array index for the string-ID and use binary search to retrieve the string-ID from the key string.
    - Hashed (Ex. linear probing) array of offsets to strings entries. Use the unique array index for the string-ID and a hash lookup to retrieve the string-ID from the key string.
    2 - Fixed length strings:
    - Sorted array for fixed length strings
    - Hashed array of fixed length string entries

  8. #8
    Member
    Join Date
    Nov 2015
    Location
    Mumbai
    Posts
    13
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by dnd View Post
    If you have low to medium number of known strings at compile time, then you can try "gperf".
    What is the approximate number for low to medium? I typically have 50K or 1lakh strings to be hashed at once.

  9. #9
    Member
    Join Date
    Mar 2013
    Location
    Worldwide
    Posts
    565
    Thanks
    67
    Thanked 198 Times in 147 Posts
    Medium size is no more than 1000-2000 entries.
    I think it is better to use a pointer/offset array as proposed.
    The table size should be approximately 2x larger as the number of your string keys.
    Use hashing with ex. linerar probing for accessing the hash table.
    You will have empty entries, but this is not an issue in your case.

    You can also use other libraries like khash in klib.

  10. #10
    Member
    Join Date
    May 2015
    Location
    Italy
    Posts
    51
    Thanks
    0
    Thanked 12 Times in 9 Posts
    Do you think a cuckoo hash it is acceptable? It makes use of (minimum) two hash functions and at some point you decide
    it rearranges the strings (spending computational time) in a way that a maximum of n(=number of functions) probes suffice.
    More functions you use mean less need of hash space. I implemented it for a database you can find at
    http://digilander.libero.it/borsa77
    in the link most bastard dbm (source in Pascal and executable for DOS).

  11. #11
    Member
    Join Date
    Nov 2015
    Location
    Mumbai
    Posts
    13
    Thanks
    0
    Thanked 0 Times in 0 Posts
    @nemequ: CMPH is licensed under LGPL. Can it be used in commercial purposes?
    Last edited by joey; 25th January 2016 at 16:53.

  12. #12
    Member
    Join Date
    Jul 2013
    Location
    United States
    Posts
    194
    Thanks
    44
    Thanked 140 Times in 69 Posts
    Quote Originally Posted by joey View Post
    @nemequ: CMPH is licensed under LGPL. Can it be used in commercial purposes?
    Yes.

  13. #13
    Member
    Join Date
    Nov 2015
    Location
    Mumbai
    Posts
    13
    Thanks
    0
    Thanked 0 Times in 0 Posts
    @nemequ: I tried using cmph and it works fine if my keys are all distinct. If I give data with duplicates in it, it throws a message "Unable to create perfect hash function"

    Should we give only distinct values to it?
    My data will have repeated values for sure. How to handle this?

  14. #14
    Member
    Join Date
    Nov 2015
    Location
    Mumbai
    Posts
    13
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by joey View Post
    You can also use other libraries like khash in klib.
    Can Klib take array of strings with duplicates in it and produce hash? I couldn't find this info in readme of klib.
    Thanks

  15. #15
    Member
    Join Date
    Jul 2013
    Location
    United States
    Posts
    194
    Thanks
    44
    Thanked 140 Times in 69 Posts
    Quote Originally Posted by joey View Post
    @nemequ: I tried using cmph and it works fine if my keys are all distinct. If I give data with duplicates in it, it throws a message "Unable to create perfect hash function"

    Should we give only distinct values to it?
    My data will have repeated values for sure. How to handle this?
    I think you're misunderstanding how perfect hash functions work. You provide the generator (CMPH, gperf, etc.) with a list of *all* keys. They then create a function which will accept any one of those keys (and *only those keys*) and generate a unique hash; for minimal perfect hashing the hash is in the range [0, n_keys - 1] (or maybe [1, n_keys]), for non-minimal perfect hashing the range is larger, but you're still guaranteed that there will be no duplicate hashes.

    Since there are no duplicates, you can use the hash function to create a very simple hash table which doesn't have to worry about duplicates. If you're working in-memory this could be a simple array, and you would just use arr[generated_hash_function(key)] to access the value and all operations would be O(1), no exceptions. This is in contrast to a standard hash table where many keys will hash to the same bucket, and if that bucket is full it will have to do some sort of collision resolution, which will cause performance to degrade.

    During the generation phase cmph requires distinct values because it isn't creating a hash table for you, it is creating a hash function. Once you have the hash function, using it on duplicate keys will (obviously) simply hash to the same value.

    Quote Originally Posted by joey View Post
    Can Klib take array of strings with duplicates in it and produce hash? I couldn't find this info in readme of klib.
    klib uses a traditional hash table implementation. It does not use a perfect hash function, it uses an open addressing scheme (double hashing, to be precise) to resolve collsisions, not a perfect hash function to avoid them altogether.

  16. #16
    Member
    Join Date
    Nov 2015
    Location
    Mumbai
    Posts
    13
    Thanks
    0
    Thanked 0 Times in 0 Posts
    So there is no way I can generate a perfect hash function(table) if I have duplicates in my data?
    (If I process them before and remove duplicates, it will further decrease my overall performance which I cant afford to do.)

  17. #17
    Member
    Join Date
    Oct 2013
    Location
    Filling a much-needed gap in the literature
    Posts
    350
    Thanks
    177
    Thanked 49 Times in 35 Posts
    Joey,

    Perfect hash functions are for situations where you have a known, finite set of all possible unique keys. You have to know this ahead of time.

    If you have known, finite set of all possible keys, but it has duplicates, you can deduplicate it before handing the set to the perfect hash function generator. Just keep the first (or last) occurrence of each key. You have to convert your bag of keys to a set.

    Then when you use the perfect hash function in your program, it is okay if the same key(s) occur any number of times, as long as they're all in the set that you gave the perfect hash function generator. If any new keys occur---that you hadn't told the perfect hash function generator about---the hash function will fail.

  18. #18
    Member
    Join Date
    Nov 2015
    Location
    Mumbai
    Posts
    13
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Will XXHASH assure that any two unique strings given to xxhash produce two unique hash values?

    What is the chance of collison in xxhash?

  19. #19
    Member
    Join Date
    Nov 2015
    Location
    ?l?nsk, PL
    Posts
    81
    Thanks
    9
    Thanked 13 Times in 11 Posts
    Don't ask the same question multiple times.
    XXHASH will not give such assurance. You need a perfect hash for that.

    The chance of collision? On random data - low.

Similar Threads

  1. Hash / Checksum algorithm considerations
    By Cyan in forum Data Compression
    Replies: 61
    Last Post: 16th June 2017, 01:28
  2. Extremely fast hash
    By Bulat Ziganshin in forum Data Compression
    Replies: 36
    Last Post: 23rd August 2013, 22:25
  3. Directory hash as one string
    By FatBit in forum The Off-Topic Lounge
    Replies: 6
    Last Post: 17th January 2012, 00:29
  4. Fastest non-secure hash function!?
    By Sanmayce in forum Data Compression
    Replies: 13
    Last Post: 20th November 2010, 20:54
  5. Hash Zip
    By Black_Fox1 in forum Forum Archive
    Replies: 6
    Last Post: 4th March 2007, 18:12

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •