Results 1 to 20 of 20

Thread: A nice article about hashing

  1. #1
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,982
    Thanks
    377
    Thanked 351 Times in 139 Posts

  2. #2
    Moderator

    Join Date
    May 2008
    Location
    Tristan da Cunha
    Posts
    2,034
    Thanks
    0
    Thanked 4 Times in 4 Posts
    Thanks Ilia!

  3. #3
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,503
    Thanks
    741
    Thanked 665 Times in 359 Posts
    10x! my mulptiply hash is rather close to FNV hash mentioned but not tested there: http://isthe.com/chongo/tech/comp/fnv/

  4. #4
    Member
    Join Date
    Apr 2007
    Posts
    12
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Very nice article ! Thanks !

  5. #5
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,982
    Thanks
    377
    Thanked 351 Times in 139 Posts

  6. #6
    Moderator

    Join Date
    May 2008
    Location
    Tristan da Cunha
    Posts
    2,034
    Thanks
    0
    Thanked 4 Times in 4 Posts
    Thanks Ilia!

  7. #7
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,982
    Thanks
    377
    Thanked 351 Times in 139 Posts
    Probably, in new LZPM Ill replace current multiplicative hashing with modular hashing.

    MOD hashing is very simple:

    key % PRIME

    In my case, key = (*(reinterpret_cast<__int64 *>(&buf[i])) & 0xffffffffff)

    Quote Originally Posted by lzpm.cpp
    #define HASHSIZE 4194301
    #define gethash(i) ((*(reinterpret_cast<__int64 *>(&buf[i]))
    & 0xffffffffff) % HASHSIZE)
    I think such hash function should be useful for 5-byte strings.

    PRIME values:
    for 20-bit hash PRIME = 1048573
    for 21-bit hash PRIME = 2097143
    for 22-bit hash PRIME = 4194301
    for 23-bit hash PRIME = 8388593
    for 24-bit hash PRIME = 16777213

    The complete table of values can be found here:
    http://gcc.gnu.org/ml/gcc-patches/2001-03/msg01075 .html


  8. #8
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,982
    Thanks
    377
    Thanked 351 Times in 139 Posts
    A 64-bit PRIME that can be used with multiplicative hashing:
    11400714819323198483
    or
    0x9e3779b97f4a7c13 (hex)

    Modular hashing failed. LZPM works 2-times slower (on text files) with such technique.


  9. #9
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 779 Times in 486 Posts
    Modular hashing is slow because division is slow. g++ will optimize a/b to the form a*((1<<n)/b)>>n when b is constant. Then it will optimize a%b to a-((a/b)*b). Still this is a lot of instructions.

    A simpler hash for a 32 bit input is to multiply by some large odd number (does not have to be prime) and use the high bits. Each bit depends on the bits to the right, but not the bits to the left. A better hash is:

    x *= C1;
    x = x<<16|x>>16;
    x *= C2;

    where C1 and C2 are odd. This also has the nice property that there are no collisions because each step has an inverse. Also, g++ will optimize the second statement to a single rotate instruction.

  10. #10
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,474
    Thanks
    26
    Thanked 121 Times in 95 Posts
    imo almost every modern compiler uses multiplying by reciprocal when dealing with diviosion by constant.

    so entire code sequence for modulo would look like that:

    mov ebx,[number]
    mov eax,reciprocal
    mul ebx ; after that a / b is in edx
    mov eax,edx
    mul ebx
    sub ebx,eax ; after that a % b is in ebx
    of course there should be also some incs for corrections (as everything is in integer precision), but finally it still be very fast.

    details on how to use multiplying by reciprocal are at http://agner.org/optimize/

    in my tarsalzp is use fnv hash. entire code sequence to compute fnv- 1 hash of last four bytes is:

    mov eax,2166136261
    mov ebx,16777619
    mul ebx
    xor al,[edi-01]
    mul ebx
    xor al,[edi-02]
    mul ebx
    xor al,[edi-03]
    mul ebx
    xor al,[edi-04]
    and eax,1 shl 22 - 1
    ; and we have fnv- 1 hash in eax
    much slower than matts, multiplicative or modulo hashing but still very fast. in my tarsalzp im computing fnv- 1 hash for the last eight bytes sequence for every position in file and i think that computing such hashes consumes about 20 seconds on matts system when dealing with enwik9, so its resource requirements are negligible.

    i think that good bit dispersion is more important as it would provide less collisions. lzpm can also use hash collision detection, eg. use highest 5 bits of fnv- 1 hash for collision detection.

    maybe ilia is computing hashes too many times unnecessary wasting time. memoizing hashes would be a good technique is such case. or maybe its necessary (i doubt, but i cant tell without source code) and computing hashes consumes most of the time.

    ilia: please share the code of lzpm

    ive looked at quad sources. i noticed that you use 32- bit for every rolz context: 8 bits for detecting hash collisions of first four bytes and 24- bit for pointers. with good hash (when collisions are rare) you can use this 8- bit counter to store length of longest prefix with previous entry in table accelerating speed many times, especially on very reduntant files (like fp.log).

    maybe ill play with such technique later. today im busy - i have an exam ;(

  11. #11
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,982
    Thanks
    377
    Thanked 351 Times in 139 Posts
    Quote Originally Posted by Matt Mahoney
    x *= C1;
    x = x<<16|x>>16;
    x *= C2;
    Quote Originally Posted by lpaq1.cpp
    i*=123456791;
    i=i<<16|i>>16; // use _rotr(i, 16) or _rotl(i, 16) instead?
    i*=234567891;
    int chk=i>>24;
    Just tested this function. Well, with LZPM this one gives no improvement.

    Currently I use:
    Quote Originally Posted by lzpm.cpp
    #define HASHBITS 24
    #define HASHSIZE (1 << HASHBITS)

    #define gethash(i) ((*(reinterpret_cast<unsigned int *>(&buf[i - 1]))
    * 123456791) >> (32 - HASHBITS))
    Like you see I hash 32-bits to 24-bits.

    Quote Originally Posted by donkey7
    maybe ilia is computing hashes too many times unnecessary wasting time. memoizing hashes would be a good technique is such case. or maybe its necessary (i doubt, but i cant tell without source code) and computing hashes consumes most of the time.
    Actually hash is not so important with LZPM, I tried many, found the best. The catch in string searching (memory access). With max level LZPM tries all available matches, overwise we loose compression. With such mode I desperate for each byte. However, since we have a few compression levels, I even dont care by now. I will try to improve the existing code anyway.

    Quote Originally Posted by donkey7
    ilia: please share the code of lzpm
    Just currently I see no reason to do that - the code is not finished yet. Only a small amount of people will understand whats going on inside LZPM. In addition, LZPM is just an improved version of QUAD - which is open source. To prove that, let me introduce a part of the LZPM decoder v0.11 (look at its size!!):
    Quote Originally Posted by lzpm.cpp
    static void decode() {
    static int tab[256][TABSIZE];
    int cnt[256];

    __int64 size = 0;
    fread(&size, 1, sizeof(size), input);
    if (size < 0)
    fprintf(stderr, "size error"), exit(1);

    ppm.decoder.Init();

    while (--size >= 0) {
    buf[0] = ppm.Decode(0);
    int i = 1;

    while ((i < N) && (--size >= 0)) {
    int c = buf[i - 1];
    int j = i;

    int s = ppm.Decode(c);
    if (s >= 256) {
    size -= ((s -= (256 - MINMATCH)) - 1);

    int p = tab[c][(s > MINMATCH
    ? cnt[c] - ppm.DecodeIndex() : cnt[c]) & (TABSIZE - 1)];
    while (--s >= 0)
    buf[i++] = buf[p++];
    }
    else
    buf[i++] = s;

    tab[c][++cnt[c] & (TABSIZE - 1)] = j;
    }

    fwrite(&buf, 1, i, output);
    }
    }
    Does anyone sent me an improved version of QUAD which will be at least near to the LZPM in terms of compression ratio or decompression speed? Nope. I am the author and Im moving the shadow!


  12. #12
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,982
    Thanks
    377
    Thanked 351 Times in 139 Posts
    Actually memcpy() can be faster than loop:
    Quote Originally Posted by encode
    while (--s >= 0)
    buf[i++] = buf[p++];
    ->
    Quote Originally Posted by encode
    memcpy(&buf[i], &buf[p], s); // DEBUG
    i += s;
    But due to some reason here memcpy() (and memmove()) works not correctly. Finding whats wrong...

    The catch in overlaps - the buffer must be updated during match copying...

  13. #13
    Tester
    Nania Francesco's Avatar
    Join Date
    May 2008
    Location
    Italy
    Posts
    1,565
    Thanks
    220
    Thanked 146 Times in 83 Posts
    Memcpy will not be able to never work in how much is not in a position to copying reconstructing the structure!
    My solution:
    void fast_copy(unsigned char *src,unsigned char *dst,long len)
    {while (len--) *dst++=*src++;}

  14. #14
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,982
    Thanks
    377
    Thanked 351 Times in 139 Posts
    Quote Originally Posted by Nania Francesco Antonio
    void fast_copy(unsigned char *src,unsigned char *dst,long len)
    {while (len--) *dst++=*src++;}
    I believe that a good compiler perform this automatically.

  15. #15
    Member
    Join Date
    Apr 2007
    Posts
    16
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Try something like this:

    pushfd
    pushad

    mov ecx,LEN
    and ecx,ecx
    jz @EXIT

    lea esi,SRC
    lea edi,DST
    cld
    rep movsb

    @EXIT: popad
    popfd


    Regards,

  16. #16
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,982
    Thanks
    377
    Thanked 351 Times in 139 Posts
    With VC2005 we can do the same thing with:
    __movsb(&buf[i], &buf[p], s);
    i += s;
    Will make some tests to see what it gives. However, LZMA uses the approach I currently use.

  17. #17
    Member
    Join Date
    Jan 2007
    Location
    Moscow
    Posts
    239
    Thanks
    0
    Thanked 3 Times in 1 Post
    Author claims his hash function to be the fastest, i can't check it
    http://www.azillionmonkeys.com/qed/hash.html

  18. #18
    Tester
    Nania Francesco's Avatar
    Join Date
    May 2008
    Location
    Italy
    Posts
    1,565
    Thanks
    220
    Thanked 146 Times in 83 Posts
    The last hash code are not good in performance ! (tested)

  19. #19
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,474
    Thanks
    26
    Thanked 121 Times in 95 Posts
    Nania Francesco Antonio
    a simple hash = value & mask could also serve as hash and it would be fastest hash possible but many algorithms require low number of collisions.

    for example i'm using fnv hash in my tarsalzp despite it has relatively low performance. but if i change it to something simpler and faster i can gain only few percent of speed for compression ratio loss of few percent.

    and fnv hash is commonly used

  20. #20
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,503
    Thanks
    741
    Thanked 665 Times in 359 Posts
    Quote Originally Posted by nimdamsk
    http://www.azillionmonkeys.com/qed/hash.html
    it seems quite useless because its oriented to hashing large strings while only ~4bytes are hashed in our compressors

Similar Threads

  1. Knuth's Multiplicative Hashing
    By encode in forum Data Compression
    Replies: 5
    Last Post: 28th May 2008, 23:18
  2. nice CM improvement
    By toffer in forum Forum Archive
    Replies: 4
    Last Post: 10th January 2008, 10:21
  3. A small article on ROLZ (Russian)
    By encode in forum Forum Archive
    Replies: 21
    Last Post: 29th April 2007, 16:18

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •