Results 1 to 25 of 25

Thread: A new 900GB compression target

  1. #1
    Member
    Join Date
    May 2009
    Location
    CA USA
    Posts
    22
    Thanks
    0
    Thanked 0 Times in 0 Posts

    A new 900GB compression target

    The entire Geocities site (famous for terrible web design 10 years ago!) is about to be released as a single, ginormous 900GB archive.

    Much of the hassle in preparing the archive was compressing it!
    This is going to be one hell of a torrent ? the compression is happening as we speak, and it?s making a machine or two very unhappy for weeks on end.
    Just as a theoretical discussion, what would the best strategy be to compress such a huge amount of data? I don't even know the original format but I'd guess it'd be a very very large directory tree. And as a total guess I bet the compression the team is using is to isolate something like 1GB subtrees and feed each into ZIP or WinRAR.

    Of course this would mean they're losing enormous efficiency since there's no hope of data in one subarchive helping compress data in a different subarchive...

    It would be interesting (not important, but interesting) too see how different algorithms could deal with compressing such huge data.. kind of a ginormous enwiki8-like benchmark.
    It's no longer insane to think about handling so much data (a 2TB hard drive is $100) so it might be interesting to think how to compress and access such an archive.

    As a wild initial "practical and decent" strategy, streaming the whole archive as a flat tar-like file through 7zip would likely be a good tradeoff for speed and size. I pick 7zip because of its very large LZ77 like windows, allowing similar code from one website to be copied into later websites. (But a compressed tar-like archive wouldn't work for random access.. it's all tradeoffs!)

    The link above goes to the site that will be hosting the torrent soon.

  2. #2
    Programmer giorgiotani's Avatar
    Join Date
    May 2008
    Location
    Italy
    Posts
    166
    Thanks
    3
    Thanked 2 Times in 2 Posts
    It is an interesting challenge... I think in the end it would win an approach targeted to lighter compression as possible, as most of the weight of the archive would be in highly optimized graphic for the web, i.e. very compressed jpeg, many gif and png, and lossy compressed audio/video.
    Although html and script files would get huge benefit of compression (and some crazy people may had uploaded huge bmp, tiff or wav, that compresses well) I think most of the files would compress poorly.
    The real deal, as the dump will contain really a lot of files, is a format (and an implementation handling it) that is efficient in handling and random accessing large trees... unless the intended usage is in extracting the dump or part of it (i.e. one website at time) in order to correctly access all files linked in the websites honoring the originally intended hypertext structure.

  3. #3
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 779 Times in 486 Posts
    I would start with deduplication. Look for identical files and large identical chunks. You can use a rolling hash to find chunk boundaries, then build an index of hashes. Once you do that you can break it into pieces (1 GB or smaller) and use standard compression.

  4. #4
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,372
    Thanks
    213
    Thanked 1,020 Times in 541 Posts
    Rolling hashes are used in Bulat's rep/srep utils, but with such amount of data the approach with lookups (or stores)
    on each byte becomes inefficient, so there's another trick, which i call "anchored hashing".
    Quoting MS: "RDC divides a file's data into chunks by computing the local maxima of a fingerprinting function that is computed at every byte position in the file."
    http://msdn.microsoft.com/en-us/libr...48(VS.85).aspx

  5. #5
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,507
    Thanks
    742
    Thanked 665 Times in 359 Posts
    but with such amount of data the approach with lookups (or stores) on each byte becomes inefficient,
    SREP speed is 15 mb/sec (i.e. one random memory access per byte), independent on file size. the only problem is that SREP needs a lot of memory

    qouted article propose two new ideas over rolled hashes - recursion and storing maximum hash around a range. i'm still thinking how each one improves results, if you can describe it - please do. it may be that these ideas are more importnat for file1->file2 patching rather than compressing single big file?

  6. #6
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,372
    Thanks
    213
    Thanked 1,020 Times in 541 Posts
    > i.e. one random memory access per byte

    Yes, and I'm saying that you don't need that access per byte.
    With your method, you can either store the "fingerprint" once per fragment
    and look up on each byte, or store on each byte and look up fragments,
    and both versions are inefficient.

    > qouted article propose two new ideas over rolled hashes -
    > recursion

    That's a funny idea, but afaik its better to use custom archor
    functions for known data formats, like structured data or text.

    > and storing maximum hash around a range.

    Well, its an anchor function, it doesn't have to be that.
    For example, my current function is (hash32<(1<<32)/l), where
    l is desired average fragment length.

    > i'm still thinking how each one improves results, if you can
    > describe it - please do.

    As to recursion - yeah, its mostly useful for backup and such.
    You make a data fingerprint file (something like a .torrent file),
    and still have to transfer that to remote machine to detect differences.
    But the fingerprint file can be large enough too (1% of 100G is 1G),
    and differences between backup snapshots are usually small, so
    recursion kinda makes sense.

    But for srep-like usage, maybe an algo with two different fragment lengths
    would make sense instead... for example, l1=128, l2=512
    - if l2 fragment matches, enable l1 lookups
    - if there're no matches in a block, disable l1 lookups

    > it may be that these ideas are more importnat for file1->file2
    > patching rather than compressing single big file?

    The anchor function idea is really important - with it fragment
    sizes become somewhat random, but you can do both stores and
    lookups once per fragment.

    Something like this:
    Code:
      void wincrc::Update( byte c, byte d ) {
        x = CRC32Update(x,c) ^ IncWinTab[d]; // rolling crc update
        y = CRC32Update(y,c); // normal crc update
        z = CRC32Update(z,c+1); // alt. crc update
        blklen++;
      }
    
      int wincrc::Check( void ) {
        int flag = (blklen>=cfg_minblklen) && ((uint(x)<uint(cfg_anchormask)) | (blklen>=cfg_maxblklen));
        if( flag ) Flush();
        return flag;
      }
    
      wincrc crc;
    
      while(1) {
        if( i>=l ) {
          l = fread( buf, 1,inpbufsize, f );
          if( l==0 ) break;
          i = 0;
        }
        byte c = buf[i];
        inpwin[i&inpwinM] = c;
        crc.Update( c, inpwin[(i-winsize)&inpwinM] ); 
        i++;
        if( crc.Check() ) hout.Store( g, crc.h );
      }
    Or you can see an actual working demo here:
    http://nishi.dreamhosters.com/u/uam_find_v0.rar

  7. #7
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,611
    Thanks
    30
    Thanked 65 Times in 47 Posts
    Quote Originally Posted by Shelwien View Post
    > i.e. one random memory access per byte

    Yes, and I'm saying that you don't need that access per byte.
    With your method, you can either store the "fingerprint" once per fragment
    and look up on each byte, or store on each byte and look up fragments,
    and both versions are inefficient.

    > qouted article propose two new ideas over rolled hashes -
    > recursion

    That's a funny idea, but afaik its better to use custom archor
    functions for known data formats, like structured data or text.

    > and storing maximum hash around a range.

    Well, its an anchor function, it doesn't have to be that.
    For example, my current function is (hash32<(1<<32)/l), where
    l is desired average fragment length.

    > i'm still thinking how each one improves results, if you can
    > describe it - please do.

    As to recursion - yeah, its mostly useful for backup and such.
    You make a data fingerprint file (something like a .torrent file),
    and still have to transfer that to remote machine to detect differences.
    But the fingerprint file can be large enough too (1% of 100G is 1G),
    and differences between backup snapshots are usually small, so
    recursion kinda makes sense.

    But for srep-like usage, maybe an algo with two different fragment lengths
    would make sense instead... for example, l1=128, l2=512
    - if l2 fragment matches, enable l1 lookups
    - if there're no matches in a block, disable l1 lookups

    > it may be that these ideas are more importnat for file1->file2
    > patching rather than compressing single big file?

    The anchor function idea is really important - with it fragment
    sizes become somewhat random, but you can do both stores and
    lookups once per fragment.

    Something like this:
    Code:
      void wincrc::Update( byte c, byte d ) {
        x = CRC32Update(x,c) ^ IncWinTab[d]; // rolling crc update
        y = CRC32Update(y,c); // normal crc update
        z = CRC32Update(z,c+1); // alt. crc update
        blklen++;
      }
    
      int wincrc::Check( void ) {
        int flag = (blklen>=cfg_minblklen) && ((uint(x)<uint(cfg_anchormask)) | (blklen>=cfg_maxblklen));
        if( flag ) Flush();
        return flag;
      }
    
      wincrc crc;
    
      while(1) {
        if( i>=l ) {
          l = fread( buf, 1,inpbufsize, f );
          if( l==0 ) break;
          i = 0;
        }
        byte c = buf[i];
        inpwin[i&inpwinM] = c;
        crc.Update( c, inpwin[(i-winsize)&inpwinM] ); 
        i++;
        if( crc.Check() ) hout.Store( g, crc.h );
      }
    Or you can see an actual working demo here:
    http://nishi.dreamhosters.com/u/uam_find_v0.rar
    These fragments sound a lot like what Data Domain (IIRC) is doing. Might be patented.

    EDIT:
    After reminding details of what DD does, it probably isn't. The idea is roughly the same, but implementation isn't.
    Last edited by m^2; 31st October 2010 at 10:26.

  8. #8
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,507
    Thanks
    742
    Thanked 665 Times in 359 Posts
    Comments on the http://msdn.microsoft.com/en-us/libr...48(VS.85).aspx page mentions multiround rsync described in the following paper: http://www.cs.cmu.edu/~jcl/research/mrsync/mrsync.ps (i've attached it's pdf conversion)

    and papaer about their Remote Differential Compression (RDC) implementation: http://research.microsoft.com/~gurevich/Opera/183.pdf
    Attached Files Attached Files

  9. #9
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,372
    Thanks
    213
    Thanked 1,020 Times in 541 Posts
    Sure, I also mentioned it in http://encode.su/threads/452-Remote-diff-utility
    And Matt in http://mattmahoney.net/dc/dce.html#Section_527
    But I reinvented this "wheel" on my own, and my implementation has significant differences.

  10. #10
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,507
    Thanks
    742
    Thanked 665 Times in 359 Posts
    i don't understood you in 2010 and spent a lot of time optimizing my SREP in 2011. very stupid! week ago i've started to optimize REP in the same stupid way but looking for "memory access" discussions found this topic

    btw, MS says in the paper about selecting "local maxima" for h positions, i.e. position those hash is larger than any hash in h-1 positions before and h-1 positions after. but initially i understood "local maxima" as just selection maximal hash among next L positions. i.e. without any probablities or so - we just split data into L-sized blocks and find exactly one "local maxima" in every block. it has the following advantages:

    - streamlined match search process, especially with background hashing
    - i think that it makes easier to vary storing/checking conditions - f.e., store once for every 64 bytes and check once for 16 bytes or vice versa
    - it shouldn't have problems with highly repetitive data like AAAAAA or ABABABAB

    Are you see any possible drawbacks in such solution compared to (1<<32)/L checking?

    Using your idea, it should be easy to utilize many cores. Background threads do the following:
    - read next 256kb block from input file
    - slide rolling hash over the block and save "local maxima" for every L positions

    Then main thread does:
    - scan through "local maximas" stored and check/store matches in the global hashtable

    This way, main thread should perform just one memory access per L bytes processed, i.e. spend 17 ns using memory prefetching. With L=256, default for (s)rep, it means 15 GB/s processing speed! Actually, main thread will spend its time mainly in comparing memory regions seeking match boundaries, and overall compressor would be probably limited just to I/O speed!
    Last edited by Bulat Ziganshin; 9th January 2012 at 22:44.

  11. #11
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,372
    Thanks
    213
    Thanked 1,020 Times in 541 Posts
    > btw, MS says in the paper about selecting "local maxima" for h positions

    I also found this discussion:
    http://nishi.dreamhosters.com/chanlo...2009%3A49%3A03

    Somehow originally I had the same idea as theirs, but changed it after trying to
    actually implement.

    > we just split data into L-sized blocks and find exactly one "local maxima" in every block

    We want to find duplicate blocks of size >=L.
    Thus we can either compare whole blocks, or hashes of whole blocks.
    Also the method has to be position-independent, ie quickly converge
    to the same behavior even if we don't start processing at the same
    point in the data.

    Then, we can maintain a rolling hash of last L bytes and look up matches
    on every byte.
    Or we can choose the block start/end based on properties of actual data -
    then block layout would be then same for the same data, so there's no
    need to check for matches on every byte.
    This is less precise though, because to find a match of length L we'd need
    Code:
    ----------|------ L bytes -----|  block layout
    | N bytes |          | N bytes |  arguments of anchor function
    N+L bytes to actually match.
    But then again, we can compensate for that by adjusting L, and even if
    some matches are lost, its also not really a problem for a fast algorithm.

    Then, let's look at the anchor function (the predicate which is used to determine
    the block boundaries).
    The best idea would be to use a N-byte rolling hash, because any other way to
    compute a function of N bytes would be slower.

    Now, if we'd compute L next values of rolling hash and set the anchors
    (block boundaries) at the points with max (or min) hash values, this would
    let us split the data into blocks of length 1..2*L, and thus locate
    the matches of size 2*L+N in the worst case.

    But this approach requires maintaining the window of L last values of rolling hash,
    and finding the min/max of these.

    So when I actually tried to implement this scheme, it immediately became obvious,
    that a simpler way is possible - any function (predicate) of the rolling hash
    which is rarely true would do, and hash<2^32/L is good enough, because
    the hash value distribution is random enough.
    Of course, I also had to add checks for min/max block lengths, while in the scheme
    with a fixed window of hash values its not necessary.
    But still, its much simpler, the block size hits the limit only on very redundant data,
    and its more tunable because x in hash<x can be fine-tuned to acquire any necessary
    average block length, which is hard with just L.

    As to parallel hashing, its still doable anyway - we just have to process slightly
    overlapped blocks.

    > 15 GB/s processing speed!

    That's still unlikely imho - there're still two hashes to update per byte
    and size checks and branches. But yeah, it should be significantly faster
    than methods with hashtable lookups per byte.

  12. #12
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,507
    Thanks
    742
    Thanked 665 Times in 359 Posts
    either you misunderstood me or i can't understand your arguments. at least, "a minimal hash value in some area" is exactly what i mean. That's the code:
    Code:
        // Index L-byte blocks starting at ptr[0]..ptr[bytes-1]
        void process (int *hashptr,  // Place for saving info about maximum hashes and their indeces
                           byte *ptr,      // Buffer being hashed
                           int    bytes,    // How much bytes to hash in ptr[] (plus a L-byte lookahead bytes after that)
                           int    L           // Size of rolling hash AND number of positions among those we are looking for "local maxima"
                                               //  (these may be different numbers in other implementations)
                          )
    
        {
            // Initial hash value
            int hash=0;  for (int i=0; i < L; i++)  update_hash (0, buf[i]);  
    
            // Split buffer into L-sized blocks and store maximal hash in each block and its position to hashptr[]
            for (int block=0; block<bytes/L; block++) {
                int maxhash = hash, maxi = 0;
                for (int i=0; i<L; i++, ptr++) {
                    update_hash (ptr[0], ptr[L]);
                    if (hash > maxhash)
                        maxhash = hash, maxi = i;
                }
                *hashptr++ = maxi+block*L;
                *hashptr++ = maxhash;
            }
        }
    Last edited by Bulat Ziganshin; 11th January 2012 at 00:06.

  13. #13
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,507
    Thanks
    742
    Thanked 665 Times in 359 Posts
    yeah, it works and easily outperforms old code:

    old code:
    4,531,060,447 -> 3,046,406,598: 67.23% Cpu 364 mb/s (11.872 sec), real 333 mb/s (12.989 sec) = 91%
    old code with m/t optimizations:
    4,531,060,447 -> 3,046,406,598: 67.23% Cpu 298 mb/s (14.508 sec), real 364 mb/s (11.883 sec) = 122%
    Anchored hashing (local maxima) with just the same m/t optimizations:
    4,531,060,447 -> 3,064,484,898: 67.63% Cpu 492 mb/s (8.783 sec), real 595 mb/s (7.257 sec) = 121%

  14. #14
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,372
    Thanks
    213
    Thanked 1,020 Times in 541 Posts
    Yes, its exactly what I imagined from what you described before.
    In short, my points are:
    1. You can't find the extremum per block independently - its not a position-independent scheme.
    It may be not completely useless for rep-like applications, but it would be certainly bad for a diff.
    I mean that the set of max( buf[i*L+0]..buf[i*L+L-1] ) can be completely different from
    max( buf[i*L+1]..buf[i*L+L] )
    2. Position-independent implementation with hash-max is possible, but it has to look for the next maximum
    among L hashes after the previous maximum, not just the next L-byte block.
    3. Its also necessary to compute the hashes of blocks between anchors (maximums in your case),
    because the hashes at anchors are not quite good for lookups.
    4. Either implementation can be modified to work in parallel, but mine is simpler and more tunable.
    Btw, I even managed to make it work for L=5 (or so) in the unaligned bitstring matching experiment.

  15. #15
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,507
    Thanks
    742
    Thanked 665 Times in 359 Posts
    and finally with multiple hashing threads as proposed above:
    4,531,060,447 -> 3,064,484,898: 67.63% Cpu 529 mb/s (8.174 sec), real 1341 mb/s (3.221 sec) = 254%

  16. #16
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,507
    Thanks
    742
    Thanked 665 Times in 359 Posts
    btw, some analysis. Your method selects blocks with hash <1/L. My method selects blocks with minimum hash among L ones, i.e. it's also <1/L or so on average, but check isn't as strict as yours

    In the case of too regular data, you may have not enough or too much blocks with hash<1/L. My method handles such situations "automatically", so in effect range of accepted hashes grows or shrinks depending on data being processed. Your method handles it by cutpoints - i.e. minimal and maximal range size

    It seems that your method can be improved by 1) selection of minimal hash among all those K checked (where either K==min_block or K==max_block), 2) use of accumulative range checks (i.e. 10*min_block <= K[-1]+..+K[-10] <= 10*max_block), 3) slight variations in min_block/max_block that may help breaking worst cases like (abad)+ and and ac(adab)+ reviewed in MS paper, f.e. min_block=32/33/34/35/32...

  17. #17
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,372
    Thanks
    213
    Thanked 1,020 Times in 541 Posts
    > check isn't as strict as yours

    From my p.o.v your version is more strict,
    because the max size of its block is limited by design, implicitly,
    while in my case its a tweakable parameter.

    > In the case of too regular data, you may have not enough or too much blocks with hash<1/L

    Yes, but
    1) I actually think that hitting the block size limit is ok for very redundant data
    (like chunks of all zeroes) - processing of these becomes even faster.
    2) "too much blocks with hash<1/L" is very unlikely when a good rolling hash function is used
    (crc32 in my case). The probability of N sequential values with hash<1/L is 1/L^N, which is low.
    In practice its only necessary to tweak the min_block_size when required L is very small
    (like <10), because the rolling hash's window has to be reduced there too, and the output
    of hash function really becomes less random.
    But still I managed to make it work even for very small L values, like
    Code:
    enum {
      winsize = 4, // window size for the rolling hash 
      cfg_minblklen  = 4,
      cfg_maxblklen  = 8,
      cfg_anchormask = (1ULL<<32)/2, // parameter for hash<anchormask 
    };
    > range of accepted hashes grows or shrinks depending on data being processed

    That's actually bad
    Its not a problem to add some adaptivity to my "anchormask", but that would
    in effect mean a much longer dependency window - ie same strings won't match
    with different anchormasks.

    This also applies to the max block size limit - its much better to set
    the max block size to some reasonably high value (like 64k) and split
    it on "natural" anchors, than to hit an "artifical" size limit after
    which the next block likely won't be ever matched.

    > Your method handles it by cutpoints - i.e. minimal and maximal range size

    Yes, but with minblklen=1 and maxblklen=65535 it becomes just a precaution.

    > It seems that your method can be improved by
    > 1) selection of minimal hash among all those K checked (where either K==min_block or K==max_block),

    That's not an improvement.
    minblklen usually can be 1 with L=256 or so, thus its not an issue at all
    (and your suggestion would produce blocks shorter than minblklen).
    And maxblklen... if I'd want it to be smaller, I'd just reduce it.
    There's no point in setting anchors randomly, because the whole point
    is to set anchors to mark strings with specific properties.

    Again, I wanted to simplify the algorithm (get rid of the window of
    hash function values) and make it tunable - and that's what I made
    after some evolution.
    Surely your implementation with max hash in each L-byte block looks
    even simpler, but its properties are completely different, so it
    doesn't help me in any way.
    Although I was surprised that your way it works at all, which is
    good in itself, I guess

    > 2) use of accumulative range checks (i.e. 10*min_block <= K[-1]+..+K[-10] <= 10*max_block),

    No point. As I said, it practically never happens in rep-like applications.
    And btw, the profile above for L=4..8 blocks is quoted from my lzp based
    on anchor hashing - it treats short fragments as symbols and reaches
    about 30-32M on enwik8 (with entropy coding).
    Which is imho a proof that these limits are not an issue.

    > 3) slight
    > variations in min_block/max_block that may help breaking worst cases
    > like (abad)+ and and ac(adab)+ reviewed in MS paper, f.e.
    > min_block=32/33/34/35/32...

    Those are not really worst cases with crc32 and window size >=12.
    And as I said, randomizing anything in the block layout would just make
    compression worse.

    -------------

    1. My implementation of anchor hashing was originally designed for applications
    like remote diffing and remote backups (See http://encode.su/threads/452-Remote-diff-utility)
    Then I started using it also as a rep equivalent, but I guess in that case it can be
    simplified for speed - anchor hashes can be used for match lookups and so on.
    But I also see some benefits in keeping it as is, so likely won't optimize it like that.

    2. The parameter "anchormask" comes from the previous anchor function which looked like
    ((hash&anchormask)==0), which in turn comes from "distinguished points" in rainbow tables.
    I still wonder how to apply more of RT ideas for matchfinding.
    RT is basically a method of compression of a hash->string map, so it has to be applicable somehow,
    but I don't have any practical ideas yet.

    3. Actually I prefer to call the strings between anchor points "fragments", because "block" is
    too overloaded - for a anchor hashing implementation its very likely to have i/o blocks,
    maybe format blocks etc.
    Note also that the anchor function doesn't have to be always the same.
    For example, it can be changed to (c=='\n') for text.

  18. #18
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,507
    Thanks
    742
    Thanked 665 Times in 359 Posts
    One difference is that i don't split data into variable-sized "fragments". I always hash L-sized blocks, although in new implementation positions of these blocks may be arbitrary. it's one reason why we don't understood each other Next, usual data generates pretty random hash values for any good hash (both crc and FNV), it's first reason why even my approach has good enough results (at least with larhe enough L). Second reason, specific for usage in FreeArc, is that i need results of rep+tor/lzma compression and for fast compression modes losing of 0.3% or so isn't important and for slow compression modes losing of matches slightly larger than L doesn't change overall compression too much, and may even improve compression tiny bit (because matches with length ~512 almost equally well handled by rep and lzma)

    But it's interesting to build theory behind it. For example, in FreeArc i need to find as much as possible matches with length>=512 (MINLEN or, short, M). We either index blocks of fixed sized L or variable-sized fragments. We select positions to index by hashing either blocks or fragments, and either select minimum hash among l consecutive ones or hashes<1/l. So, my question is best selection of L and l for given M, and best strategy among forementioned "eithers"



    Another intersting question is comparison of CRC and FNV as hash functions. BTW, your scheme doesn't need rolling hash at all. You may start hashing from the fragment beginning. Two questions - speed of hashing and distribution quality. BTW, CRC gets much faster when unrolled, so you can compute the second hash for entire fragment at once. And quality may be checked using smaller L, since for large L distribution would be perfect anyway

  19. #19
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,372
    Thanks
    213
    Thanked 1,020 Times in 541 Posts
    > One difference is that i don't split data into variable-sized "fragments".

    Afaik you do. Fragments are the data between anchors.
    In your case it would be ((block+1)*L+maxi_1)-((block*L+maxi_0)=L+maxi_1-maxi_0.
    Also maxi_1-maxi_0=-(L-1)..(L-1), and L+that=1..2*L-1.
    I can also configure min/max to produce similar results.

    > So, my question is best selection of L and l for given M,

    Looking up fragments by L-byte anchor hash value at their end would
    let you find matches of length L..2*L-1.
    Basically only the match of length ~4*L would be certainly found
    (as the fragment length is up to 2*L-1 and it can start at any offset).
    Also, only finding the anchors in fixed L-byte blocks makes it much less
    certain - in some cases even very long matches won't be found.

    And lookup of actual fragment hash lets you find the matches of specified
    length min..max though matches of len <L would be still kinda accidental.
    Otherwise its similar (identical string of length >2*max is necessary
    for certain finding), but at least the explicit max fragment len is configurable.

    Other benefits are:
    1. A stronger hash can be used for match lookup, with an anchor function still based
    on a fast rolling hash.
    For example, I encountered crc32 collisions in enwik9.
    2. With hashes of actual fragments (especially stronger ones) its possible to
    encode the matches without actual string comparison (and thus without the data window etc).
    I even have an idea of a (relatively) safe version of that, by using error recovery methods -
    ie we presume that data matches when hashes match and treat collisions as transmission errors.
    3. Applications like remote diffing are simply impossible without fragment hashes.

    > and best strategy among forementioned "eithers"

    Afaik there're no large differences in speed or compression of usual data.
    Its more a matter of specific applications and interfaces.
    For example, my anchor-hashing class is made in such a way that I can
    feed it single bytes, which would be harder with your approach.

    > Another intersting question is comparison of CRC and FNV as hash functions.

    Well, CRC is simpler from p.o.v of arithmetics and is more random -
    LCG-based hashes are more sensitive to data - not all bits are equally distributed etc.

    > BTW, your scheme doesn't need rolling hash at all.

    Hey, the main part is still finding anchors, which can't be done without a rolling hash
    (well, more like it would be slow without it).

    > can compute the second hash for entire fragment at once

    Yes, but presuming that it would be used in conjuction with some other algos,
    these crc32 updates would be interleaved good enough as they are.

  20. #20
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,507
    Thanks
    742
    Thanked 665 Times in 359 Posts
    Fragments are the data between anchors.

    Well, you hash the blocks between the anchors, and call these blocks "f
    ragments". I always hash L bytes preceding the anchor. These are two different strategies. Probably second one isn't too useful for patch/rsync, but for me it looks more reasonable if we need to find matches>=M. btw, have you read original rsync thesis? http://samba.org/~tridge/phd_thesis.pdf



    >> BTW, your scheme doesn't need rolling hash at all.
    >
    Hey, the main part is still finding anchors, which can't be done without a rolling hash

    why? if you save hashes of fragments between anchors, you may just incrementally compute them starting from the first byte of fragment:

    Code:
    for(int i=0,last_i=0,crc=0; i<size;i++)
    {
      if (crc>0xFF000000)
      {
        Store(outfile, i-last_i, crc);
        crc=0;
        last_i=i;
      }
      updateCRC(crc,buf[i]);
    }


    CRC is simpler from p.o.v of arithmetics and is more random - LCG-based hashes are more sensitive to data - not all bits are equally distributed etc.
    i agree that lower LCG bits are less random - they cannot be influenced by higher bits of hashed chars. but computation of rolling LCG may be faster - it needs 2 multiplies and 2 additions:

    Code:
    #define update_hash(sub,add)                        \
    {                                                   \
        hash = hash*PRIME + add - sub*cPOWER_PRIME_L;   \
    }
    while rolling CRC requires 4 arithmetic operations and two memory reads

    Code:
      void Update0( const byte c ) {    x = CRCTab[byte(x)^c] ^ (x>>8);
      }
    
    
      void Update( const byte c ) {
        Update0( c );
        x ^= IncWinTab[buf[i]];
        buf[i] = c;
        i = (i+1) % winsize;
      }
    btw, if we don't need an incremental computation then a simple FNV (?) hash may be used:

    Code:
    int *p = buf
    hash = (123456791*p[0]+31415926)*p[1])....
    by updating at 4 bytes a time, it may be even faster than unrolled CRC computation, although again - it has weak lower bits

    btw, why we need to check for anchor every byte? we can check every 4th byte - we don't need loosers it would futher improve hashing speed for quick-and-dirty compression
    Last edited by Bulat Ziganshin; 13th January 2012 at 04:55.

  21. #21
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,372
    Thanks
    213
    Thanked 1,020 Times in 541 Posts
    > I always hash L bytes preceding the anchor.

    Yes, but the distances between anchors are still variable.
    You can only find an L-byte string at the end of a fragment,
    not just anywhere.

    > if you save hashes of fragments between anchors, you may just
    > incrementally compute them starting from the first byte of fragment

    1. Again, that introduces a dynamic dependency.
    I mean, like that we would get completely different fragment layouts
    just by adding a single symbol at the start of a file -
    first fragment would end at a different point, then the second one
    would start and end at different points etc.
    The point of a rolling hash is to get a static layout, where
    the positions of anchors only depend on occurrences of L-byte
    strings with specific properties.

    2. Also it reduces the randomness of the hash.

    > while rolling CRC requires 4 arithmetic operations

    2 multiplications is not the fastest option too,
    I guess on PC either one can be faster, depending on circumstances
    (like, there's a crc32 instruction in modern cpus).
    But there're also popular platforms (like ARM) where multiplications
    would be much slower.

    > why we need to check for anchor every byte? we can check every 4th byte

    If we want to find byte-aligned strings, then we need to check anchor on every byte.
    However, for data with 32-bit alignment, every 4th byte would be good too.

  22. #22
    Member
    Join Date
    Feb 2010
    Location
    Nordic
    Posts
    200
    Thanks
    41
    Thanked 36 Times in 12 Posts
    for those tracking rolling hashing: http://news.ycombinator.com/item?id=3589676

  23. #23
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,507
    Thanks
    742
    Thanked 665 Times in 359 Posts

  24. #24
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    863
    Thanks
    461
    Thanked 257 Times in 105 Posts
    The part regarding the fingerprint being stored and checked only if lower bits are zero is clear. Nice and efficient.

    However this part is not :
    "But minimum and maximum block sizes are also enforced for pathological cases which skews this slightly."

    I guess such case can happen when dealing with repetitive pattern (such as a large chunk of zeroes).
    Then the fingerprint of last N bytes will be always the same, or will repeat itself every "pattern-length".
    Either the fingerprint happens to also respect the "lower bits are zero" rule, and get updated all the time,
    or the other way round, no fingerprint respect the rule, and no update is done on a potentially large region.

    So enforcing some kind of "minimum" and "maximum" look reasonable.

    However, this is on the decoding side that it looks more complex.

    I assume that, after calculating a fingerprint, this one is checked and compared to the hash table only if it respects the "lower bits are zero" rule. But if a "maximum" distance is enforced, does that mean that a fingerprint which does not respect the "lower bits are zero" rule is also inserted into the hash table ? If yes, how could it be found later on, since the check is supposed to happen only if all "lower bits are zero" ??

  25. #25
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,372
    Thanks
    213
    Thanked 1,020 Times in 541 Posts
    > The part regarding the fingerprint being stored and checked only if lower bits are zero is clear

    Its just a boolean function of rolling hash, "lower bits are zero" is too specific.

    > However, this is on the decoding side that it looks more complex.

    In most cases it just doesn't appear on decoding side at all.
    In LZ-like applications like rep its just a method of speed optimization of matchfinding.

    > If yes, how could it be found later on, since the check is supposed to happen only if

    Well, the same way obviously - when the same hash is generated after stopping at size limit in the next zero block.
    If such "invalid" hash values are somehow bad for your implementation, you can just discard them imho,
    there're better ways to handle zero blocks and such anyway.
    The actual problem is that the following fragment would be also invalid, and in theory it can stay out of sync until the end of file,
    which would be bad.
    But it won't happen with normal data (not very redundant) and in worst case you'd just get a little worse compression.

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •