Results 1 to 12 of 12

Thread: What type of hashing is this.

  1. #1
    Member
    Join Date
    Mar 2010
    Location
    Canada
    Posts
    40
    Thanks
    0
    Thanked 2 Times in 2 Posts

    What type of hashing is this.

    For my compressor code I decided that I needed a replacement for my old method of hashing bytes (IE, I used direct hashing for 1-3 bytes and shifted and EOR additional bytes for 6 byte and nine byte searches)

    I don't know where I first saw it but the hashing method I choose to explore works are follows

    ulong hash = 0;
    for (i=0; i<length; i++) {
    hash = (hash << 13) ^ (hash << 19) ^ table[key[i]]; }

    Where key[] is a string of bytes to hash.

    And table[] has been filled with 256 (32 bit) values chosen to insure that no value is a rotated and/or an inverted version of any other value found inside table[].

    An addition property is that no two values of table[] if EXORed together will not generate any value already in table[], this insures that equal length strings with a single byte difference will not hash to the same value.

    Using Google and WIKI I now know this is not a CRC, Zobrist or Pearson hash scheme, does anyone have any pointers to this type of hashing?

    My main problem is I have not problem generating the needed values when using 32 bits, but 21 bits hashes for searching keep failing for me. I need to know if anyone has done the research to state what the properties of table[] need to be. If I can relax my constraints I probably can generate the 256 numbers needed but I need to know what constraints matter the most first.

    Also if you wish I can post my table[] or the code I use to generate it.

  2. #2
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 779 Times in 486 Posts
    I don't know where the hash came from, but it is not really useful. The low 13 bits depend only on the last byte, and the low 26 bits depend on only the last 2 bytes. It is not even useful as a rolling hash because you would need a bigger table than if you just used a direct lookup. A better hash would be:

    hash = (hash + key[i]) * M;

    where M is almost any large, odd number, or if the low n bits of M are 0 then you get a rolling context hash so that the last kn bits of the hash depend on the last k bytes of the key. For example if M = 48 then the low 4 bits of M are 0, so the last 12 bits would be an order 3 context hash.

    By "almost", I mean there are bad choices for M like 0x80000001. But most random choices will work.
    Last edited by Matt Mahoney; 20th June 2010 at 00:48. Reason: math error

  3. #3
    Member
    Join Date
    Mar 2010
    Location
    Canada
    Posts
    40
    Thanks
    0
    Thanked 2 Times in 2 Posts

    Why only the lower 13 bits?

    Quote Originally Posted by Matt Mahoney View Post
    I don't know where the hash came from, but it is not really useful. The low 13 bits depend only on the last byte, and the low 26 bits depend on only the last 2 bytes. It is not even useful as a rolling hash because you would need a bigger table than if you just used a direct lookup. A better hash would be:most", I mean there are bad choices for M like 0x80000001. But most random choices will work.
    I don't understand your reply, key[i] indexes table[] and table[key[i]] is filled with 32 bit numbers thus each byte entered affects the *ENTIRE* 32 bit hash value.

    Or am I missing something else in your reply? And yes it does use a bigger table. The old hash used a table of 2^24 elements where as the present code uses a table of 2^32 elements. What I am trying to develop is a version of the above hash that uses 2^21 elements. That is why I am trying to find out even just the name of this type of hash so I can do some targeted Google searches.

    PS. Would it help if I posted the hash-seeds found in table[]?

  4. #4
    Member
    Join Date
    Mar 2010
    Location
    Canada
    Posts
    40
    Thanks
    0
    Thanked 2 Times in 2 Posts

    Big OOPs on my part!!!!

    I messed up, the code should read as follows:

    ulong hash = 0;
    for (i=0; i<length; i++) {
    hash = (hash << 13) ^ (hash >> 19) ^ table[key[i]]; }

    Sorry for the stupid mistake. No wonder you guys must think me a fool of a coder.

  5. #5
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 779 Times in 486 Posts
    OK, that makes more sense. If you fill the table with random values you should have a good hash function.

  6. #6
    Member
    Join Date
    Mar 2010
    Location
    Canada
    Posts
    40
    Thanks
    0
    Thanked 2 Times in 2 Posts
    First, I have to thank you. In writting out the message that is going to be posted below I finally in one place consolidated all my steps in developing the filtering of seed values over that last three weeks. I was starting to forget some of my early design choices and the reasons for them, so even if you find major faults in my reasoning I still come out ahead posting this here.

    ================================================== ============

    I have to disagree with the idea of filling the table[] with just random numbers. If you see an error in my logic please correct me.

    0) Assume nothing just because the numbers are supposed to be random. I first test for and filter out any repeated values from going into table[], it would be dumb to have two diffirent bytes with the same hash-seed value.

    1) The values in table[] should have major effects on the final hash value over it's entire range, a hash-seed with no bits or only a few bits ( examples 0, 1, 2, 1024, 65537.) will make small changes only. Instead I filter out and only add numbers to table[] with certain range of bits being set to one(1). So far I have tested tables filled with numbers which have (15,17) or (15,16,17) or (10-22) bits are set to one.

    >>> I really am having a hard time figuring out what the best combination of number of bits in the hash seeds are best.

    >>> Notice if I used a table[] with values where only 16 bits set this prevents 50% of the possible hash values from being created. It is problems like these I wish to avoid.

    2) When I fill table[] with hash-seeds I pre-filtered out numbers that are just a rotation of a previous hash-seed that are already in the said table. This is to prevent collision between strings that have just one/two byte diffirences. Example, if I had two strings "AB" and "BA" but the hash-seed of "A" happens if rotated by 13 bits to be the hash-seed of "B" then these two strings will always cancel each other out.

    >>> Since I plan only to have strings up 12 characters it would be easy to only filter those (12) possible rotations, but because I want a table[] that can have general use I test and filter out all 31 possible rotations.

    3) The last filtering I do as I add numbers to table[] is to test if that number and any of the numbers already in the table if EXOR together will generate any other hash-seed already in said table[]. And again testing is done against every possible rotation. The idea is to prevent two bytes generating the same hashing as another single byte. If the hash-seed for "A" when rotated by 13 bits and EXOR with the hash-seed of "B" happens to have the same value as the hash-seed for "C" then the strings "xxxxAB" and "xxxxC" will have the same hashs. That also applies to "ABxxxx" and "Cxxxx".

    The testing with all possible rotations insures that strings like "AxxxxB" and "xxCxx" also do not end up with the same hash value.
    Last edited by Earl Colby Pottinger; 21st June 2010 at 05:45.

  7. #7
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 779 Times in 486 Posts
    I guess that makes sense. You want to create a table so there is no A, B, C, M, N such that (rot(A,M)^rot(B,N)^C)==0 where rot(A,M) is (A<<M|A>>32-M). That happens with probability 2^-32 but you have 2^33 independent tests, so a random table would have 2 failed cases on average. I think you could generate a table in a few seconds by testing each value as you add it and choosing a new random 32 bit number until the test passes.

    One problem with this hash would be for strings longer than 32 bytes. (I know you aren't using them). If you swapped 2 bytes that were a multiple of 32 bytes apart like A...B and B...A then they would collide. You could avoid that problem with a hash like:

    hash = (hash * 3) ^ table[key[i]];

    which repeats every 2^31 bytes instead of every 32 bytes. It would also be just as efficient because the multiplication would be optimized to a shift and add using a single LEA instruction. g++ is also smart enough to optimize rot() or (hash<<13 ^ hash >> 19) to a ROL instruction.

  8. #8
    Member
    Join Date
    Mar 2010
    Location
    Canada
    Posts
    40
    Thanks
    0
    Thanked 2 Times in 2 Posts

    Thanks

    Quote Originally Posted by Matt Mahoney View Post
    I guess that makes sense. You want to create a table so there is no A, B, C, M, N such that (rot(A,M)^rot(B,N)^C)==0 where rot(A,M) is (A<<M|A>>32-M). That happens with probability 2^-32 but you have 2^33 independent tests, so a random table would have 2 failed cases on average. I think you could generate a table in a few seconds by testing each value as you add it and choosing a new random 32 bit number until the test passes.
    In-fact, I get a far higher rejection rate than that. Remember I am also choosing numbers with a certain range in the number of bit being set to one(1). The pool of numbers I can thus use is far smaller. I can get the numbers if you want.

    Quote Originally Posted by Matt Mahoney View Post
    One problem with this hash would be for strings longer than 32 bytes. (I know you aren't using them). If you swapped 2 bytes that were a multiple of 32 bytes apart like A...B and B...A then they would collide. You could avoid that problem with a hash like:

    hash = (hash * 3) ^ table[key[i]];

    which repeats every 2^31 bytes instead of every 32 bytes. It would also be just as efficient because the multiplication would be optimized to a shift and add using a single LEA instruction. g++ is also smart enough to optimize rot() or (hash<<13 ^ hash >> 19) to a ROL instruction.
    Yes, I was aware of that limit, thank you for the tip.

  9. #9
    Member
    Join Date
    Mar 2010
    Location
    Canada
    Posts
    40
    Thanks
    0
    Thanked 2 Times in 2 Posts
    think you could generate a table in a few seconds by testing each value as you add it and choosing a new random 32 bit number until the test passes.


    It takes me over two hours to do all the comparisons, I must be doing something wrong.

  10. #10
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,511
    Thanks
    746
    Thanked 668 Times in 361 Posts
    if you need faster and efficient algorithm, just use smth like this:

    hash = *(uint32*)p * 123456791

  11. #11
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 779 Times in 486 Posts
    This should give you a table in about 5 seconds. All the elements have equal number of 0 and 1 bits, although I don't really think that helps. Run time is about the same without this constraint.

    Code:
    // Generate table of 256 32-bit numbers such that no rotations
    // of any 1 or 2 numbers XORed is equal to any third number.
    
    #include <stdio.h>
    #include <stdlib.h>
    #include <time.h>
    
    // Return a 32 bit random number with 16 zero bits and 16 ones.
    unsigned int rnd() {
      unsigned int r;
      int c;
      do {
        r=rand()^rand()<<10^rand()<<19;
        c=0;  // bit count
        for (int i=0; i<32; ++i)
          c+=r>>i&1;
      } while (c!=16);
      return r;
    }
    
    int main() {
      srand(time(0));
      unsigned int t[257]={0};  // 0, then 256 random elements
      for (int i=1; i<257; ++i) {
        int good=1;
        unsigned int c=0;
        do {
          c=rnd();
          good=1;
          for (int j=0; j<i && good; ++j) {
            for (int m=0; m<32; ++m) {
              unsigned int a=t[j]<<m|t[j]>>32-m;
              for (int k=j+1; k<i; ++k) {
                for (int n=0; n<32; ++n) {
                  unsigned int b=t[k]<<n|t[k]>>32-n;
                  if ((a^b)==c) good=0;
                }
              }
            }
          }
        } while (!good);
        t[i]=c;
        printf("  0x%08X, // %d\n", c, i-1);
      }
      return 0;
    }

  12. #12
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 779 Times in 486 Posts
    Links to forum from http://encode.su/ now redirect to IRC chat. It was working about an hour ago.
    Sorry for off topic post but I can't get to any of the other forums except by clicking on "last post".

Similar Threads

  1. Knuth's Multiplicative Hashing
    By encode in forum Data Compression
    Replies: 5
    Last Post: 28th May 2008, 23:18
  2. A nice article about hashing
    By encode in forum Forum Archive
    Replies: 19
    Last Post: 26th September 2007, 22:31
  3. MM type detector
    By Bulat Ziganshin in forum Forum Archive
    Replies: 10
    Last Post: 5th April 2007, 16:32

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •