Results 1 to 14 of 14

Thread: hashing LZ

  1. #1
    Member
    Join Date
    Feb 2010
    Location
    Nordic
    Posts
    200
    Thanks
    41
    Thanked 36 Times in 12 Posts

    hashing LZ

    I post this here because I'd enjoy a discussion and have the errors of my naive ways pointed out to me; I've already promised myself I'll not be sucked into implementing a proof of concept! Who knows, maybe in here there's a nugget that will inspire someone else, and that will be nice.

    Imagine a hashtable-(only?)-based history for LZ. Imagine it like this:

    You pick some order for the key. In this example, I'll pick order 3.

    Think of the stream as being in multiples of this key-size, even though you're sliding through it one char at a time.

    So, for example, our compressor is mid-stream:

    Code:
    input:    ...I had no love of world of work, with its ...
    current:                                  ^
    blocks:                       ...[of ][wor][k, ][wit]...
    We look up the previous key in the table, [wor]

    This will point out to a *list* of historic offsets for the [wor] prefix

    We also look up where we are now, [k, ], which is another list of historic offsets

    We take these two lists and slide along them, looking for a continuation:

    [wor] -> hash lookup -> list of offsets -> 56, 68, 122, 157 ...
    [k, ] -> hash lookup -> list of offsets -> 35, 125, 220 ...

    In the sliding-along we spot that [wor]122 and [k, ]125 are adjacent. We know that historically there was a [work, ] at 122 and the match was 6 long.

    Here's an augmentory idea, that might be useful elsewhere. We're doing a *greedy* LZ coder. Look up the prefix - [of ][wor] - what is it's history? There will like this for sure: [of ][wor][ld ]. Now we we *know* that our match is not going to be anything beginning "l" because, if it was, we'd have previously greedily encoded our LZ! We can use the historic offsets to cull candidates because we know that we're greedy. PPMs do masking/exclusion of course. This might group the ROLZ offsets to be even closer to 0 slightly more often, for example.
    Last edited by willvarfar; 20th August 2010 at 18:38.

  2. #2
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,511
    Thanks
    746
    Thanked 668 Times in 361 Posts
    useless
    1. what's "slide along them"? it's the key, and i don't see fast way to do it
    2. if you just need 6-byte match - hash those 6 bytes and you will find all 6-byte repetitions directly

    in order to make something new, one should have exvellent knowledge of existing solutions. now you even don't distinguish lz77 and rolz

  3. #3
    Member
    Join Date
    Feb 2010
    Location
    Nordic
    Posts
    200
    Thanks
    41
    Thanked 36 Times in 12 Posts
    thx for the encouragement; I hadn't tried to claim I was an expert at anything

    the masking/exclusion might help all variants of LZ compression ratio wise, with slight speed cost of course.

  4. #4
    Member Sanmayce's Avatar
    Join Date
    Apr 2010
    Location
    Sofia
    Posts
    57
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Hi willvarfar,

    I understand the idea behind thoughts you share with others,
    keep dreaming, that matters only.

    I post this here because I'd enjoy a discussion and have the errors of my naive ways pointed out to me; I've already promised myself I'll not be sucked into implementing a proof of concept! Who knows, maybe in here there's a nugget that will inspire someone else, and that will be nice.
    Well said, I believe that pushing-the-boundaries while sharing opinions(giving different point-of-views) will inspire the guys with skills-and-knowledge to bring better-and-better implementations(whether LZ, HASH techniques, ...), eventually.

    Regards
    Last edited by Sanmayce; 20th August 2010 at 20:24.

  5. #5
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,424
    Thanks
    223
    Thanked 1,053 Times in 565 Posts
    I don't know why there's "hashing" in the title.
    But as to masking, it can be described a bit simpler - after an encoded match, the next literal or first symbol of next match
    can't be the same as next symbol in match.
    ie data[x]!=data[lastmatchofs+lastmatchlen]
    That would certainly improve the compression (at least comparing to plain greedy LZ77 or ROLZ), but you would have to enumerate
    the offsets each time, including decoding.

  6. #6
    Member
    Join Date
    Feb 2010
    Location
    Nordic
    Posts
    200
    Thanks
    41
    Thanked 36 Times in 12 Posts
    nice clarification on the greedy masking bit Shelwein.

    As I said, the next character can't be any character that would make a sequence of literals that would have been matched should they have existed.

    As you said, the next character immediately after a match can't be the next character in that match.

    Furthermore along your thinking, it can't be the next match character in any other match in the history that is the same length as the chosen match.

    I starting thinking about how this would help the entropy coder too and not just say the match offset counting in ROLZ

  7. #7
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,424
    Thanks
    223
    Thanked 1,053 Times in 565 Posts
    > Furthermore along your thinking, it can't be the next
    > match character in any other match in the history that is
    > the same length as the chosen match.

    I'd say its better not to look further, because you'd make a PPM like that
    "All symbols which can extend length-n match to n+1, or alternatively,
    length-(n-1) match to n" etc.

    > I starting thinking about how this would help the entropy
    > coder too and not just say the match offset counting in ROLZ

    As to entropy coding afaik its better to mask symbols there too.
    I mean, if we know that offset M is masked and can't appear,
    instead of renumbering M+1..N to M..N-1 its better to exclude
    the code interval corresponding to M.

  8. #8
    Member
    Join Date
    Feb 2010
    Location
    Nordic
    Posts
    200
    Thanks
    41
    Thanked 36 Times in 12 Posts
    http://encode.su/threads/15-7-Zip?p=...ull=1#post7997 has pseudo-code for your HT4, Bulat; is it accurate?

    From that post:
    Code:
    int bestMatchDist = 0;
    int bestMatchLen = 0;
    HashEntry entry = &hashtable[context];
    for (int i=0; i<16; ++i)
    {
      int pos = entry->Positions[i];
      int len = CheckMatch(pos);
      if (len > bestMatchLen)
      {
        bestMatchLen = len;
        bestMatchDist = CurrentPosition - pos;
        if (bestMatchLen >= MAX_MATCH_LENGTH)
          break;
      }
    }
    Now imagine that you used the HT of the next key too. Imagine you have, say, 3 bytes hashing and you're looking for a match of at least 5 bytes. Split this needle into two hash lookups, e.g. the string "crazy" would be split into [cra] and [azy].

    Consider this following code and how much nicer may potentially be to the cachelines:

    Code:
    int bestMatchDist = 0;
    int bestMatchLen = 0;
    HashEntry* entry1 = &hashtable[context];
    if(!entry1)
      goto give_up_searching;
    // try for 6-or-longer matches first, falling back to 5 then 4 and so on
    for(int hash_offset = HASH_LENGTH*2; 
      !bestMatchLen && ((hash_offset-HASH_LENGTH) >= MIN_MATCH); 
      hash_offset--) { 
      if(HashEntry* entry2 = &hashtable[context+hash_offset]) {
        for(int i=0, j=0; i<entry1->num_matches; i++) {
          // slide along the second entry
          while(entry2->pos[j] < entry1->pos[i]) {
             j++; 
             if(j >= entry2->num_matches)
                goto give_up_seaching;
          }
          if(entry2->pos[j] == entry1.pos[i] + hash_offset) {
            // we have a match that is HASH_LENGTH + hash_offset, i.e. 5
            if(hash_offset == HASH_LENGTH*2) {
              // match can be longer since we haven't shorted our hash_offset for a retry
              // this is where we go stall to get the buffer
              int len = CheckMatch(...); 
              if(len > bestMatchLen) {
                ...
              }
            } else {
              bestMatchLen = HASH_LENGTH + hash_offset;
              bestMatchDist = ...
              goto give_up_searching;
          }
        }
      }
    }
    give_up_searching:
    ...
    Last edited by willvarfar; 24th August 2010 at 12:34.

  9. #9
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,985
    Thanks
    377
    Thanked 353 Times in 141 Posts
    These days I designed a VERY fast LZ coder that outputs dictionary indexes instead of an exact match offset. These indexes are hashes at some point, but things got more complicated in real implementation. Anyway, it's about an online compression - compression on the fly that takes nearly no time...

  10. Thanks:

    Bulat Ziganshin (24th November 2013)

  11. #10
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,985
    Thanks
    377
    Thanked 353 Times in 141 Posts
    A few notes. Ultra fast LZ coder should use very simple output coding - byte aligned output as example. LZ77 by its nature has redundant offsets. As example, if we have 1 MB dictionary and offsets coded as integer values, we will spent 20-bits per each offset coded, it's pretty much. Furthermore, such number of offsets are redundant - we may have the same strings at different offsets. Many techniques were proposed to reduce such redundancy - LZFG, ROLZ or even LZP. LZFG is slow at building dictionary entries. ROLZ/LZP or anything else with contexts are not that useful with byte alighed coding - the context is equal to missed match. It's cool in pair with string output encoder like CM, but it not works if we just store LZ codes. So, the ultimate thing is to invent something that will build a good and useful dictionary fast. It's no just a talk, I implemented all of LZ family coders along with other stuff to imperically test the words above. I'm under experiments, but hopefully new Ultimate LZ will be born - I have a few practical ideas on how to efficiently build dictionary.

  12. #11
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,511
    Thanks
    746
    Thanked 668 Times in 361 Posts
    >Consider this following code and how much nicer may potentially be to the cachelines:

    you just forgot few simple things: let's assume that out HT5 cache has 16 entries per each hash value, so one can check 16 entries with just 17 memory accesses

    now let's look at direct 3-byte hash that allows to find the same 16 5-byte matches. how deep should it be? i think, about 1024 entries. so in order to have the same depth search, we need to access 1024*2/16=128 memory lines. and it's not the worst part of the scheme. hash size should be 2^24*1024*4=64 gigabytes

    remember: the more chars we can hash, the better. much much better. reducing number of hashed chars is very expensive, never do it without outstanding compensation

  13. #12
    Member
    Join Date
    Feb 2010
    Location
    Nordic
    Posts
    200
    Thanks
    41
    Thanked 36 Times in 12 Posts
    I'm confused about the counting;

    The reason that HT has multiple offsets in a cacheline is because, once you have data in L1, you can go through it very fast.

    So to do your hash lookup, you're probably stalling at least once. But then to iterate over all those offsets for that hash, which are in the same cacheline, you go very fast. However, for each of those offsets, you go and look in the history buffer itself to count the length of the match. This is likely a stall.

    In what I'm proposing, you stall once in the hash lookup, as before. And then you stall again for the second hash lookup. You can then go 'sliding' through those two cachelines very fast, and only if you have a match longer than HASH_LEN*2 do you have to stall to go examine the history buffer and check it.

    So far, I've implemented LZ77, BWT, PPM and a few unsuccessful side things. I'm currently in my ROLZ stage which is why I'm having these thoughts about how to index the buffer.

    So, apart from what I'm saying, why did you create HTn for freearc LZMA, Bulat? And why 4 (rather than 3 or 6 or whatever)? And how does it compare to the other match finders that you use? Is there a generally best choice for western texts?

  14. #13
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,511
    Thanks
    746
    Thanked 668 Times in 361 Posts
    >I'm confused about the counting;

    just spend more time thinking on it. draw it at paper. collect some stats

    >why did you create HTn for freearc LZMA, Bulat

    i once wrote it on the old forum. HC was developed in the ARJ times. now it's obsolete a bit

    > Is there a generally best choice for western texts?

    bsc

  15. #14
    Member
    Join Date
    Feb 2010
    Location
    Nordic
    Posts
    200
    Thanks
    41
    Thanked 36 Times in 12 Posts
    so here's some numbers from 7z 9.04 beta, which is a normal ubuntu package.

    I've got a core2duo laptop.

    It doesn't have HT4 for comparison, unfortunately.

    7-Zip 9.04 beta Copyright (c) 1999-2009 Igor Pavlov 2009-05-30
    p7zip Version 9.04 (locale=en_GB.utf8,Utf16=on,HugeFiles=on,2 CPUs)


    Code:
     Performance counter stats for '7z a -m0=lzma:mf=bt2 enwik8.7z enwik8':
    
        92513899303  L1-dcache-loads          
         4659291366  L1-dcache-load-misses    
        19792997314  L1-dcache-stores         
          403442854  L1-dcache-store-misses   
          497573148  L1-dcache-prefetches     
      <not counted>  L1-dcache-prefetch-misses
       300709874903  L1-icache-loads          
           24723345  L1-icache-load-misses    
      <not counted>  L1-icache-prefetches     
      <not counted>  L1-icache-prefetch-misses
         3193070279  LLC-loads                
          655357396  LLC-load-misses          
          123686758  LLC-stores               
           47595563  LLC-store-misses         
      <not counted>  LLC-prefetches           
      <not counted>  LLC-prefetch-misses      
        90275173015  dTLB-loads               
         1211963842  dTLB-load-misses         
        20221900638  dTLB-stores              
           20952191  dTLB-store-misses        
      <not counted>  dTLB-prefetches          
      <not counted>  dTLB-prefetch-misses     
       206629139055  iTLB-loads               
            1294973  iTLB-load-misses         
        29362887643  branch-loads             
         3486702355  branch-load-misses       
                                              
       98.767845359  seconds time elapsed
         26,187,410  bytes
    
     Performance counter stats for '7z a -m0=lzma:mf=bt3 enwik8.7z enwik8':
    
        90118429731  L1-dcache-loads          
         4318854269  L1-dcache-load-misses    
        20007043891  L1-dcache-stores         
          403696509  L1-dcache-store-misses   
          498834301  L1-dcache-prefetches     
      <not counted>  L1-dcache-prefetch-misses
       294447934751  L1-icache-loads          
           22227423  L1-icache-load-misses    
      <not counted>  L1-icache-prefetches     
      <not counted>  L1-icache-prefetch-misses
         2900324538  LLC-loads                
          689492140  LLC-load-misses          
          117302292  LLC-stores               
           43851739  LLC-store-misses         
      <not counted>  LLC-prefetches           
      <not counted>  LLC-prefetch-misses      
        86786573311  dTLB-loads               
         1206597774  dTLB-load-misses         
        19983016198  dTLB-stores              
           31456682  dTLB-store-misses        
      <not counted>  dTLB-prefetches          
      <not counted>  dTLB-prefetch-misses     
       202050653128  iTLB-loads               
            1215207  iTLB-load-misses         
        28144253208  branch-loads             
         3206738090  branch-load-misses       
                                              
       90.216583600  seconds time elapsed     
         25,947,915  bytes                    
    
     Performance counter stats for '7z a -m0=lzma:mf=bt4 enwik8.7z enwik8':
                                             
        83957452393  L1-dcache-loads          
         3901836216  L1-dcache-load-misses    
        19044210741  L1-dcache-stores         
          492948474  L1-dcache-store-misses   
          439573721  L1-dcache-prefetches     
      <not counted>  L1-dcache-prefetch-misses
       276787092434  L1-icache-loads          
           19931196  L1-icache-load-misses    
      <not counted>  L1-icache-prefetches     
      <not counted>  L1-icache-prefetch-misses
         2578908534  LLC-loads                
          672922193  LLC-load-misses          
          120699161  LLC-stores               
           40413127  LLC-store-misses         
      <not counted>  LLC-prefetches           
      <not counted>  LLC-prefetch-misses      
        85575838861  dTLB-loads               
         1066028284  dTLB-load-misses         
        19239703173  dTLB-stores              
           47445414  dTLB-store-misses        
      <not counted>  dTLB-prefetches          
      <not counted>  dTLB-prefetch-misses     
       189621448898  iTLB-loads               
             915180  iTLB-load-misses         
        26414348745  branch-loads             
         2909416093  branch-load-misses       
                                              
       78.891124630  seconds time elapsed
         25,895,909  bytes
         
     Performance counter stats for '7z a -m0=lzma:mf=hc4 enwik8.7z enwik8':
    
        50843826946  L1-dcache-loads          
         3462263304  L1-dcache-load-misses    
        12730153068  L1-dcache-stores         
          253215889  L1-dcache-store-misses   
          204718949  L1-dcache-prefetches     
      <not counted>  L1-dcache-prefetch-misses
       197708822891  L1-icache-loads          
            6731606  L1-icache-load-misses    
      <not counted>  L1-icache-prefetches     
      <not counted>  L1-icache-prefetch-misses
         3134033394  LLC-loads                
          416830350  LLC-load-misses          
           64584531  LLC-stores               
            6738763  LLC-store-misses         
      <not counted>  LLC-prefetches           
      <not counted>  LLC-prefetch-misses      
        51350710111  dTLB-loads               
          954164362  dTLB-load-misses         
        12748954659  dTLB-stores              
           90928317  dTLB-store-misses        
      <not counted>  dTLB-prefetches          
      <not counted>  dTLB-prefetch-misses     
       142757058727  iTLB-loads               
             292936  iTLB-load-misses         
        18841964494  branch-loads             
         1438040735  branch-load-misses       
    
       75.997639156  seconds time elapsed
         29,418,352  bytes

Similar Threads

  1. What type of hashing is this.
    By Earl Colby Pottinger in forum Data Compression
    Replies: 11
    Last Post: 22nd June 2010, 06:23
  2. Knuth's Multiplicative Hashing
    By encode in forum Data Compression
    Replies: 5
    Last Post: 28th May 2008, 23:18
  3. A nice article about hashing
    By encode in forum Forum Archive
    Replies: 19
    Last Post: 26th September 2007, 22:31

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •