Results 1 to 9 of 9

Thread: filesharing with built-in recompression

  1. #1
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,366
    Thanks
    212
    Thanked 1,018 Times in 540 Posts

    filesharing with built-in recompression

    Code:
    <schnaader>
      I really would like to see your P2P implementation on
      linux distros, btw. Like you'd only have to download 1/3 of
      the new distribution as most files can be patched from the
      old one you still got :)
    <Shelwien> 
      i guess we'd need to add support for using all the local
      data as a reference ;)
      but people won't like p2p program to scan all of their hdds ;)
    <schnaader>
      Even with non-local data it would speed up the upload very
      much :) Well, at least adding the whole data in your
      shared/incoming folder would be enough in most cases.
    <Shelwien>
      well, simply reference data indexing should be different
      from shared data indexing
      i didn't think about that, any more ideas? ;)
    <schnaader> 
      Not yet. Still somehow ordering the whole concept in my
      head, especially the recompression part.
    <Shelwien> 
      to take recompression into account
      we just have to produce multiple hash "fingerprints" for the data
    <schnaader> 
      Guess you'll see the best possibilities after you've got
      a first usable reference implementation.
    <Shelwien> 
      I have a usable rdiff implementation using it, thats imho enough
      1. we can search for the original file using its hash
      2. compute primary hashtable, then secondary hashtable
         from it, and search for secondary hashtable hashes
      3. search for primary hashtable hashes
      ...wait, which direction is this?
      %)
      ok, lets say we want to download a file, then
    <schnaader> 
      That's exactly how I got confused with the idea the last days :)
    <Shelwien> 
      1. we search for info on that file by its hash
      2. download its secondary hashtable and look up its
         parts in local reference database
      3. download parts of primary hashtable corresponding to
         unknown secondary hashes
      4. repeat the local reference lookup again
      5. find sources by primary/secondary hashes
      6. check availability of alternate hashtables (for
         precomp'ed data etc) and repeat 2-5 to get more sources
    <schnaader> 
      Step 3 is pretty much where I often got stuck, 
      this helped, thanks.
    <Shelwien> 
      also for precomp'ed data we have to know how it maps to
      original data
    <schnaader> 
      Yes, for deflate f.e. we'd have to seperate file data
      and "deflate-data" (literal/match information). We know
      the file data but need the deflate-data to reconstruct
      the stream. Or is there something better?
      Of course, standard deflate streams would help as we
      just have "standard compression level 9" as deflate-data
      in that case :)
    <Shelwien> 
      ah, wait, i guess we can find the proper place for
      recompressed data by primary hash lookups
      then we only have to know how to reconstruct the
      original data from a part of filtered data
      ...so i guess it won't work with just precomp
      as even with a single zipped file we won't be able to
      reconstruct the deflate code after downloading a block
      from the middle of precomp'ed data
      but adding some info about deflate blocks would fix that
      so that if we'd download a block from 100M..101M of
      precomp'ed data, we'd be able to read from it where the
      deflate block starts and how far behind it references
    <schnaader> 
      Ah, that would be kind of a lookup table for "I'm at
      position 100M in the decompressed stream, where is my
      position in the deflate stream?"
    <Shelwien> 
      not quite... as i said, position in deflate stream can
      be found by primary hashtable, if we managed to rebuild
      some deflated data
      but we'd have to know where deflate blocks start, and
      how far behind they reference
      so that after getting enough data for that "window",
      we'd be able to reconstruct the deflate block
      and then merge it with the main download
      ...
      or maybe you're right...
      first, we'd have to know which precomp'ed data to request first
      because it only makes sense for the blocks which we
      didn't download already, and won't download very soon.
      So some approximate offset mapping is necessary too, it
      seems
    <schnaader> 
      It's also possible that we talk about similar things,
      just use other terms or slightly different attempts -
      The whole thing is too much of an vague idea atm :)
    <Shelwien> 
      its not really that vague for me, as i already have a
      similar transfer protocol working, with secondary hashes and all
    <schnaader>
      the recompression thing would be the most difficult
      (although optional) part of this concept and can't be
      generalized for different algorithms, right?
    <Shelwien> 
      err, why?
    <schnaader> 
      For example (just as an extreme example) we can't easily
      recompress data in a solid mode 7z archive.
    <Shelwien> 
      here it depends on what we call "easy"
      first it can be recompression with forced "restore points"
      with added redundancy so that we would be able to
      reconstruct only part of compressed data
      its not that problematic with 7-zip or any other LZ
      algorithms actually, only PPM/CM are troublesome in that sense
      (though ppmd tends to reset the model)
      but then, there's an alternative protocol too
      instead of downloading "uncompressed" data and
      converting them to required format locally,
      we can send some reference data to uploader
      and have him produce the recompressed data for you
    <schnaader> 
      nice, that's another possiblity I didn't think of.
    <Shelwien> 
      in other words, if you're downloading a zip archive, but
      only see a source which has same files in rar
      you can send some deflate parameters which you acquired
      to that source
      and it would be able to repack rar to zip using it
      and send that (or parts of) to you
      ...
      or a third choice ;)
      if its not about transfer compression, but about finding
      rare sources, then we'd be able to just find matches in
      precomped data of some other files, and then download
      these files _in_whole_ and use them locally
      ...
      but the first version with runtime mapping of compressed
      formats actually seems reasonable too
      at least it would work with versions of deflate and other LZ

  2. #2
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,366
    Thanks
    212
    Thanked 1,018 Times in 540 Posts
    Code:
    <schnaader> 
      btw, I planned to write some analysing tool soon that
      perhaps also could be helpful for testing here, kind of
      like Precomp without the recompression part, visualizing
      the file content in its different stages (by default the
      completely decompressed stage)
    <Shelwien> 
      it could be helpful
      i'd appreciate also a similar detection tool for other
      data properties
      like rep + delta + records + seg_file + wav/bmp etc
    <schnaader> 
      these would be useful plugins then :)
    <Shelwien> 
      otherwise, as it is now, we just don't know why a
      certain compressor wins
      Although, Sami says that practical compressors are these
      that can be used as is and show good results,
      but with experimental stuff its not quite that simple -
      there're lots of data types, and a codec which is even
      _very_ good for specific data (like some WRT)
      won't show good overall results with testsets like
      Sami's or this Bulat's VM image
    <schnaader> 
      there also are much other things involved like data
      ordering and dictionary sizes which make a difference at
      least for big test data
      best would be to combine best results of several
      specialized codecs and a general purpose compression,
      but this would take ages to finish...
      although decompression would be quite fast :)
    <Shelwien> 
      well, hopefully i'm on the way to it now
      this thing i'm writing is basically the same as archiver
      just of a more general architecture - split into
      frontend and backend etc, and for now i'm still messing
      with stuff like archive format and exchanging patches
      but eventually i'd get to compression too, and there
      we'd have some use for various codecs, recompression, etc
    <schnaader> 
      well, even the basic concept is much better than the
      current things, so from there it should be able to make
      some money out of it and continue with the optional parts
    <Shelwien> 
      one interesting point btw
      is that for online backup
      fast (de)compression is not really important
      no sense at all to have it working faster than your
      internet connection
      so its quite likely that i would be able to start with
      some good CM right away ;)
    <schnaader> 
      yes, and you still have the possibility to multiply
      speed by simply using multiple servers :)
    <Shelwien> 
      well, as to servers the idea is to do as little as
      possible on servers. But server doesn't really
      need to process the uncompressed data, right?
      So i can compress it locally, send, and server would
      store it. Then, probably, some kind of "defrag" process
      would be required for servers which would improve the
      archive fragmentation (layers of updates etc), but
      normally its not really required in backup operation
    <schnaader> 
      I found Ocarina's concept pretty nice (I guess this is
      something similar to what you just talked about) -
      they store the data uncompressed by default and after
      the data hasn't been used f.e. for 30 days, it gets
      compressed, so it will take less space.
    <Shelwien>
      yeah
      though in my case the backup would be compressed right
      away (for transfer and storage) and only "defragmented" later
      ...
      and then, with online backup, the p2p idea appears
      naturally too ;)
    <schnaader> 
      Yes, it's pretty the same thing, only that you have data
      "in the cloud" instead of on the servers.
      And p2p at the moment just doesn't cross the borders of
      the file that gets transferred.
    <Shelwien> 
      well, kinda
      i've seen p2p apps that are barely able to merge downloads
      like downloading the same file from ed2k and torrent at once
      but it was very painful and manual
    <schnaader> 
      yes, there should be automatic tools for that. Even
      merging multiple torrents of the same source can get
      quite complicated.
    <Shelwien>
      not really - at least they could detect the blocks with
      matching hashes
    <schnaader> 
      "they could" is the sad thing here :)

  3. #3
    Member
    Join Date
    Jun 2008
    Location
    USA
    Posts
    111
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Okay, too lazy to read the whole thing, but ...

    1). Chrome (browser) uses some improved binary diff "technology" (better than Firefox, even)

    2). Fedora 12 has some kind of binary patches available via Yum (I forget what they call it)

    Both of these might? (I can't remember) use LZMA (which is also used by Slackware now), but it's still annoying how every mainstream Linix liveCD takes up at least 680 MB. :-/

  4. #4
    Programmer schnaader's Avatar
    Join Date
    May 2008
    Location
    Hessen, Germany
    Posts
    566
    Thanks
    216
    Thanked 200 Times in 93 Posts
    Quote Originally Posted by Rugxulo View Post
    1). Chrome (browser) uses some improved binary diff "technology" (better than Firefox, even)
    That one was really nice, don't remember the name, was something french, but it was about not just patching the executable difference, but disassembling them before, so it sounded a bit like Precomp for executables - and results were much smaller than usual binary diffs.

    Ah, found it now: "Courgette" - see this Chromium Blog entry.

    Quote Originally Posted by Rugxulo View Post
    but it's still annoying how every mainstream Linix liveCD takes up at least 680 MB. :-/
    Using Precomp in slow mode and 7-Zip/CCM you can bring most of them down to about 400-500 MB, although it'll take some time.
    Last edited by schnaader; 7th December 2009 at 20:13.
    http://schnaader.info
    Damn kids. They're all alike.

  5. #5
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,366
    Thanks
    212
    Thanked 1,018 Times in 540 Posts
    Here it is btw.
    Its been a pain with a not-Vista-aware MSVS version,
    but I managed to compile it, and it kinda works.
    Attached Files Attached Files
    Last edited by Shelwien; 8th December 2009 at 00:49.

  6. #6
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    863
    Thanks
    460
    Thanked 257 Times in 105 Posts

    Smile

    Quote Originally Posted by Rugxulo View Post
    but it's still annoying how every mainstream Linix liveCD takes up at least 680 MB. :-/
    At least there is this one, which has stick (successfully for now) to the 50MB limit...

  7. #7
    Member Skymmer's Avatar
    Join Date
    Mar 2009
    Location
    Russia
    Posts
    681
    Thanks
    38
    Thanked 168 Times in 84 Posts
    Quote Originally Posted by Shelwien View Post
    Here it is btw.
    Its been a pain with a not-Vista-aware MSVS version,
    but I managed to compile it, and it kinda works.
    Oh, thats nice! Thanks!
    Found some undocumented stuff. First of all syntax for -genbsdiff and -applybsdiff commands:
    Code:
    -genbsdiff <old_file> <new_file> <patch_file>
    -applybsdiff <old_file> <patch_file> <new_file>
    Also undocumented switches:
    Code:
    -gen1[au] <old_file> <new_file> <patch_files_root>
    -repeat
    Don't know what they're for. Also interesting what's the difference between gen\apply and genbsdiff\applybsdiff ? Maybe *bsdiff switches are for working with BSDiff ?

    Anyway, first dirty test brought one issue. -dis\-asm commands don't preserve overlays in PE files. Tested on MSO.dll and out file has no overlay.

  8. #8
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,366
    Thanks
    212
    Thanked 1,018 Times in 540 Posts
    It seems that -gen is like an updated version of bsdiff - works a little better,
    but basically the same.
    And the actual improvement is when you convert both files with -dis
    first, and run -gen after that.
    But it seems that -dis/-asm are lossy and might not be applicable
    to just any program.

  9. #9
    Programmer schnaader's Avatar
    Join Date
    May 2008
    Location
    Hessen, Germany
    Posts
    566
    Thanks
    216
    Thanked 200 Times in 93 Posts
    Quote Originally Posted by Cyan View Post
    At least there is this one, which has stick (successfully for now) to the 50MB limit...
    This actually is a nice example for the advantages of an improved p2p. Just picking out two variants, the standard ISO (dsl-4.4.10.iso) and the VM image (dsl-4.4.10-vmx.zip). If you download them both, you'll download almost 104 MB. But now see what happens if we just glue them together and run srep on it:

    Code:
       103 999 403      dsl_iso_zip.dat
        74 527 355      dsl_iso_zip.dat.srep
    And it gets even worse. If we have a look inside the VM zip, we'll see there are 2 files in it: Another dsl-4.4.10.iso (no, it's not exactly the same) and a tiny dsl.vmx file (488 bytes). Let's see what happens if we glue the standard ISO and those two files together and run srep on it:

    Code:
       104 657 384      dsl_both_isos_and_vmx.dat
        51 923 368      dsl_both_isos_and_vmx.dat.srep
    So we downloaded 103.9 MB, but only 51.9 MB were necessary... Of course, you could blame the DSL creators here that could've created a program smaller than 1 MB that "converts" the standard ISO into the VM version. But an improved p2p with recompression features could automatically detect the redundancy here and download only a "patch" to recreate the VM zip. And even without recompression, only 74.5 MB are necessary.

    Additionally, even if you just download one of the files, you would have twice the sources you could download from as you'll be able to download most of the data from people with the other variant. This "seed multiplier" probably won't speed up your download here, as DSL is highly available, but it will help for other rare files.
    Last edited by schnaader; 8th December 2009 at 14:16.
    http://schnaader.info
    Damn kids. They're all alike.

Similar Threads

  1. Winzip v12.0 with JPG recompression & 7z support
    By maadjordan in forum Data Compression
    Replies: 3
    Last Post: 13th September 2008, 00:58

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •