> i.e. one random memory access per byte
Yes, and I'm saying that you don't need that access per byte.
With your method, you can either store the "fingerprint" once per fragment
and look up on each byte, or store on each byte and look up fragments,
and both versions are inefficient.
> qouted article propose two new ideas over rolled hashes -
> recursion
That's a funny idea, but afaik its better to use custom archor
functions for known data formats, like structured data or text.
> and storing maximum hash around a range.
Well, its an anchor function, it doesn't have to be that.
For example, my current function is (hash32<(1<<32)/l), where
l is desired average fragment length.
> i'm still thinking how each one improves results, if you can
> describe it - please do.
As to recursion - yeah, its mostly useful for backup and such.
You make a data fingerprint file (something like a .torrent file),
and still have to transfer that to remote machine to detect differences.
But the fingerprint file can be large enough too (1% of 100G is 1G),
and differences between backup snapshots are usually small, so
recursion kinda makes sense.
But for srep-like usage, maybe an algo with two different fragment lengths
would make sense instead... for example, l1=128, l2=512
- if l2 fragment matches, enable l1 lookups
- if there're no matches in a block, disable l1 lookups
> it may be that these ideas are more importnat for file1->file2
> patching rather than compressing single big file?
The anchor function idea is really important - with it fragment
sizes become somewhat random, but you can do both stores and
lookups once per fragment.
Something like this:
Code:
void wincrc::Update( byte c, byte d ) {
x = CRC32Update(x,c) ^ IncWinTab[d]; // rolling crc update
y = CRC32Update(y,c); // normal crc update
z = CRC32Update(z,c+1); // alt. crc update
blklen++;
}
int wincrc::Check( void ) {
int flag = (blklen>=cfg_minblklen) && ((uint(x)<uint(cfg_anchormask)) | (blklen>=cfg_maxblklen));
if( flag ) Flush();
return flag;
}
wincrc crc;
while(1) {
if( i>=l ) {
l = fread( buf, 1,inpbufsize, f );
if( l==0 ) break;
i = 0;
}
byte c = buf[i];
inpwin[i&inpwinM] = c;
crc.Update( c, inpwin[(i-winsize)&inpwinM] );
i++;
if( crc.Check() ) hout.Store( g, crc.h );
}
Or you can see an actual working demo here:
http://nishi.dreamhosters.com/u/uam_find_v0.rar