I really would like to see your P2P implementation on
  linux distros, btw. Like you'd only have to download 1/3 of
  the new distribution as most files can be patched from the
  old one you still got :)
  i guess we'd need to add support for using all the local
  data as a reference ;)
  but people won't like p2p program to scan all of their hdds ;)
  Even with non-local data it would speed up the upload very
  much :) Well, at least adding the whole data in your
  shared/incoming folder would be enough in most cases.
  well, simply reference data indexing should be different
  from shared data indexing
  i didn't think about that, any more ideas? ;)
  Not yet. Still somehow ordering the whole concept in my
  head, especially the recompression part.
  to take recompression into account
  we just have to produce multiple hash "fingerprints" for the data
  Guess you'll see the best possibilities after you've got
  a first usable reference implementation.
  I have a usable rdiff implementation using it, thats imho enough
  1. we can search for the original file using its hash
  2. compute primary hashtable, then secondary hashtable
     from it, and search for secondary hashtable hashes
  3. search for primary hashtable hashes
  ...wait, which direction is this?
  ok, lets say we want to download a file, then
  That's exactly how I got confused with the idea the last days :)
  1. we search for info on that file by its hash
  2. download its secondary hashtable and look up its
     parts in local reference database
  3. download parts of primary hashtable corresponding to
     unknown secondary hashes
  4. repeat the local reference lookup again
  5. find sources by primary/secondary hashes
  6. check availability of alternate hashtables (for
     precomp'ed data etc) and repeat 2-5 to get more sources
  Step 3 is pretty much where I often got stuck, 
  this helped, thanks.
  also for precomp'ed data we have to know how it maps to
  original data
  Yes, for deflate f.e. we'd have to seperate file data
  and "deflate-data" (literal/match information). We know
  the file data but need the deflate-data to reconstruct
  the stream. Or is there something better?
  Of course, standard deflate streams would help as we
  just have "standard compression level 9" as deflate-data
  in that case :)
  ah, wait, i guess we can find the proper place for
  recompressed data by primary hash lookups
  then we only have to know how to reconstruct the
  original data from a part of filtered data i guess it won't work with just precomp
  as even with a single zipped file we won't be able to
  reconstruct the deflate code after downloading a block
  from the middle of precomp'ed data
  but adding some info about deflate blocks would fix that
  so that if we'd download a block from 100M..101M of
  precomp'ed data, we'd be able to read from it where the
  deflate block starts and how far behind it references
  Ah, that would be kind of a lookup table for "I'm at
  position 100M in the decompressed stream, where is my
  position in the deflate stream?"
  not quite... as i said, position in deflate stream can
  be found by primary hashtable, if we managed to rebuild
  some deflated data
  but we'd have to know where deflate blocks start, and
  how far behind they reference
  so that after getting enough data for that "window",
  we'd be able to reconstruct the deflate block
  and then merge it with the main download
  or maybe you're right...
  first, we'd have to know which precomp'ed data to request first
  because it only makes sense for the blocks which we
  didn't download already, and won't download very soon.
  So some approximate offset mapping is necessary too, it
  It's also possible that we talk about similar things,
  just use other terms or slightly different attempts -
  The whole thing is too much of an vague idea atm :)
  its not really that vague for me, as i already have a
  similar transfer protocol working, with secondary hashes and all
  the recompression thing would be the most difficult
  (although optional) part of this concept and can't be
  generalized for different algorithms, right?
  err, why?
  For example (just as an extreme example) we can't easily
  recompress data in a solid mode 7z archive.
  here it depends on what we call "easy"
  first it can be recompression with forced "restore points"
  with added redundancy so that we would be able to
  reconstruct only part of compressed data
  its not that problematic with 7-zip or any other LZ
  algorithms actually, only PPM/CM are troublesome in that sense
  (though ppmd tends to reset the model)
  but then, there's an alternative protocol too
  instead of downloading "uncompressed" data and
  converting them to required format locally,
  we can send some reference data to uploader
  and have him produce the recompressed data for you
  nice, that's another possiblity I didn't think of.
  in other words, if you're downloading a zip archive, but
  only see a source which has same files in rar
  you can send some deflate parameters which you acquired
  to that source
  and it would be able to repack rar to zip using it
  and send that (or parts of) to you
  or a third choice ;)
  if its not about transfer compression, but about finding
  rare sources, then we'd be able to just find matches in
  precomped data of some other files, and then download
  these files _in_whole_ and use them locally
  but the first version with runtime mapping of compressed
  formats actually seems reasonable too
  at least it would work with versions of deflate and other LZ