Code:<schnaader> I really would like to see your P2P implementation on linux distros, btw. Like you'd only have to download 1/3 of the new distribution as most files can be patched from the old one you still got :) <Shelwien> i guess we'd need to add support for using all the local data as a reference ;) but people won't like p2p program to scan all of their hdds ;) <schnaader> Even with non-local data it would speed up the upload very much :) Well, at least adding the whole data in your shared/incoming folder would be enough in most cases. <Shelwien> well, simply reference data indexing should be different from shared data indexing i didn't think about that, any more ideas? ;) <schnaader> Not yet. Still somehow ordering the whole concept in my head, especially the recompression part. <Shelwien> to take recompression into account we just have to produce multiple hash "fingerprints" for the data <schnaader> Guess you'll see the best possibilities after you've got a first usable reference implementation. <Shelwien> I have a usable rdiff implementation using it, thats imho enough 1. we can search for the original file using its hash 2. compute primary hashtable, then secondary hashtable from it, and search for secondary hashtable hashes 3. search for primary hashtable hashes ...wait, which direction is this? %) ok, lets say we want to download a file, then <schnaader> That's exactly how I got confused with the idea the last days :) <Shelwien> 1. we search for info on that file by its hash 2. download its secondary hashtable and look up its parts in local reference database 3. download parts of primary hashtable corresponding to unknown secondary hashes 4. repeat the local reference lookup again 5. find sources by primary/secondary hashes 6. check availability of alternate hashtables (for precomp'ed data etc) and repeat 2-5 to get more sources <schnaader> Step 3 is pretty much where I often got stuck, this helped, thanks. <Shelwien> also for precomp'ed data we have to know how it maps to original data <schnaader> Yes, for deflate f.e. we'd have to seperate file data and "deflate-data" (literal/match information). We know the file data but need the deflate-data to reconstruct the stream. Or is there something better? Of course, standard deflate streams would help as we just have "standard compression level 9" as deflate-data in that case :) <Shelwien> ah, wait, i guess we can find the proper place for recompressed data by primary hash lookups then we only have to know how to reconstruct the original data from a part of filtered data ...so i guess it won't work with just precomp as even with a single zipped file we won't be able to reconstruct the deflate code after downloading a block from the middle of precomp'ed data but adding some info about deflate blocks would fix that so that if we'd download a block from 100M..101M of precomp'ed data, we'd be able to read from it where the deflate block starts and how far behind it references <schnaader> Ah, that would be kind of a lookup table for "I'm at position 100M in the decompressed stream, where is my position in the deflate stream?" <Shelwien> not quite... as i said, position in deflate stream can be found by primary hashtable, if we managed to rebuild some deflated data but we'd have to know where deflate blocks start, and how far behind they reference so that after getting enough data for that "window", we'd be able to reconstruct the deflate block and then merge it with the main download ... or maybe you're right... first, we'd have to know which precomp'ed data to request first because it only makes sense for the blocks which we didn't download already, and won't download very soon. So some approximate offset mapping is necessary too, it seems <schnaader> It's also possible that we talk about similar things, just use other terms or slightly different attempts - The whole thing is too much of an vague idea atm :) <Shelwien> its not really that vague for me, as i already have a similar transfer protocol working, with secondary hashes and all <schnaader> the recompression thing would be the most difficult (although optional) part of this concept and can't be generalized for different algorithms, right? <Shelwien> err, why? <schnaader> For example (just as an extreme example) we can't easily recompress data in a solid mode 7z archive. <Shelwien> here it depends on what we call "easy" first it can be recompression with forced "restore points" with added redundancy so that we would be able to reconstruct only part of compressed data its not that problematic with 7-zip or any other LZ algorithms actually, only PPM/CM are troublesome in that sense (though ppmd tends to reset the model) but then, there's an alternative protocol too instead of downloading "uncompressed" data and converting them to required format locally, we can send some reference data to uploader and have him produce the recompressed data for you <schnaader> nice, that's another possiblity I didn't think of. <Shelwien> in other words, if you're downloading a zip archive, but only see a source which has same files in rar you can send some deflate parameters which you acquired to that source and it would be able to repack rar to zip using it and send that (or parts of) to you ... or a third choice ;) if its not about transfer compression, but about finding rare sources, then we'd be able to just find matches in precomped data of some other files, and then download these files _in_whole_ and use them locally ... but the first version with runtime mapping of compressed formats actually seems reasonable too at least it would work with versions of deflate and other LZ