Results 1 to 13 of 13

Thread: Multi-Volume Archives

  1. #1
    Programmer osmanturan's Avatar
    Join Date
    May 2008
    Location
    Mersin, Turkiye
    Posts
    651
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Multi-Volume Archives

    Hi,
    I'm currently collecting some archive structure ideas to decide which way to go for BIT (or maybe to invent a better way). Today I try to understand multi-volume archive capabilities of WinRAR and 7-zip separately. And made some small tests. Here is what I've observed:

    WinRAR is fully aware of each volume number even volume names damaged (including extension). That's possible with storing volume number in each volume. WinRAR also stores a metadata for each file which has a part in current volume. It's basically a flag that indicates to how to decode a stream. I mean, using previous, next or both volumes.

    7-Zip volumes are basically "raw" parts of a single archive. In another word, 7-Zip is not aware volume's information when volume names changed Actually I've a bit shocked how it can use such a basic method to implement multi-volume archives.

    As a summary, minumum volume size in WinRAR is [Archive Headers]+[N] while in 7-Zip it's basically [N].

    Now, question is that do you have any idea for implementing multi-volume archives? I prefer to hear about multi-volumes archives which support modification.
    BIT Archiver homepage: www.osmanturan.com

  2. #2
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,375
    Thanks
    214
    Thanked 1,023 Times in 544 Posts
    I think its straight-forward.
    Just understand how you want it to work from user's point of view,
    and implement that
    For example, whether existing files have to be overwritten (with
    whole archive updates, volume resplitting etc), or new versions
    just can be added to a new volume.
    Or whether you want to be able to extract files completely contained
    in the volume without access to other volumes.
    Or whether you want to list the whole archive contents without access
    to all volumes.
    You just have to imagine a few of practical use cases, and implement
    a format that would be convenient for all of them.
    Last edited by Shelwien; 9th June 2009 at 20:53.

  3. #3
    Member
    Join Date
    Oct 2007
    Location
    Germany, Hamburg
    Posts
    408
    Thanks
    0
    Thanked 5 Times in 5 Posts
    I guess those situations are what osman want to read .
    I don't use all features every good archiver has but important in this area are some more information that it gets more save.
    It would be important for me that the name wouldn't be important :-D and that a missing block could be skipped without loosing information of the following blocks.

  4. #4
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,375
    Thanks
    214
    Thanked 1,023 Times in 544 Posts
    Btw, I've just got a new (imho) idea: compressor recovery points.
    The idea is to compress the data in a completely solid mode,
    while occasionally saving the information required for independent
    block access, like compressed model state (including out-of-block
    matches for LZ), maybe even state diffs.
    For LZ, it might be actually cheaper than compressing the files/blocks
    separately, while still providing random access to these files/blocks.
    And with an additional feature of that extra info being removable.
    And it would be ok for ppmd as well, I think.
    However, for hashtable-based CMs it would be kinda problematic
    to properly compress the hashed statistics... and there'd be a high
    redundancy if we won't do it properly

  5. #5
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,507
    Thanks
    742
    Thanked 665 Times in 359 Posts
    Quote Originally Posted by Shelwien View Post
    Btw, I've just got a new (imho) idea: compressor recovery points.
    The idea is to compress the data in a completely solid mode,
    while occasionally saving the information required for independent
    block access
    i think it will be worser than having independent blocks. with ind. blocks, you start with empty model. with recovery point, you still start with empty model, then add some info learned on previous data (!) and then go to compress this block data. if you will add instead of it info learned on this block data - compression should be better

  6. #6
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,375
    Thanks
    214
    Thanked 1,023 Times in 544 Posts
    Its not that obvious and depends on model and data properties.

    1. Model recovery info + full solid compression
    2. blockwise compression with model reinit

    The result of comparison 1 vs 2 depends on amount of
    redundancy added due to model reinit vs compressed model
    state size.

    So basically its the same as compressed statistics vs adaptive models
    (like your static CM vs PPM) - in theory amount of information is
    the same, and in practice the statistics need a much more advanced
    model for compression, so normally such an approach displays some
    redundancy. However, there're also examples of improved compression
    with separately compressed statistics - eg. with low order CM coders,
    or in asymmetric compression algos.

    Thus, I believe that with "recovery point" approach the compression
    loss can be kept at insignificant level, with possible occasional improvements,
    over the model reinit approach.

    And there's an additional benefit of being able to remove the recovery point
    data and turn the non-solid archive into solid one without recompression.

    I'm quite sure that it would work with LZMA and such, probably PPMD too.
    But it'd be hardly possible for paq-like coders, because of very high complexity
    of the models necessary for good compression of their statistics.

  7. #7
    Member Fallon's Avatar
    Join Date
    May 2008
    Location
    Europe - The Netherlands
    Posts
    158
    Thanks
    14
    Thanked 10 Times in 5 Posts

    Smile

    Some history.
    Besides WinRAR, the other archiver with usable multi volumes has been WinAce, because it featured a recovery option (it's however more closed source than WinRAR).
    WinImp had multi volumes too (closed source, no support, no recovery, not used) and was the first archiver that started mv-numbering with 01 instead of 00. WinRAR changed to that newer style later.

    Comparison between WinRAR and 7-Zip, leaves 7-Zip lacking, because Igor chose the 'split' that you describe (to avoid any possible issue with future compatability?).
    With 7-Zip, a user has to put all parts in the same directory first, before extraction can start (only from the first part).
    Multi volumes should preferably open via any part in the chain, like in WinRAR.
    Unpacking from separate volumes is obviously nice to have in practice.

    To make the case for mv in a few words: multi volumes are convenient to have, whenever file size limitations become an issue. In the past for spanning floppies (ARJ), nowadays for uploading to lots of sites, for email, and for splitting big files. What 7-Zip has, is still better than nothing.

  8. #8
    Programmer osmanturan's Avatar
    Join Date
    May 2008
    Location
    Mersin, Turkiye
    Posts
    651
    Thanks
    0
    Thanked 0 Times in 0 Posts
    To avoid the thread is being off-topic, it's better to write what I'm thinking about it

    I always imagine my homemate's behavior while designing on compression. Because, he has a really "bad" habit which can be defined as collecting some programs, games, movies, TV series, mp3s etc and uploading his rapidshare account for sharing with someone. So, he frequently uses multi-volumes. His possible "bad" scenarios on multi-volumes can be summarized like that:

    1-He forgot to add newest service pack/patch of program/game in archive. So, he must create another multi-volume archive for sharing it.

    2-He accidentally added some "bad" version of service pack/path of program/game in archive. So, he must "replace" them with a new usable version. It's basically he has to reupload all volumes or distributing a "patch" volume which should be manually replaced it's contents with main archive.

    My current "draft" idea can fit in case 1. But, case 2 has some problem though. Prior to explain my idea, it's better to explain some info about new structure:

    1-Format should support fragmentation (=data analyzing). So, in a archive smallest primitive compressed element is defined as "stream". And a file can be represented with several streams and/or two or more files can share same streams in solid mode. A stream is basically single independent compressed block.
    2-First volume stores total number of volumes
    3-FAT should be started in first volume and can be continue on specific volumes. Last volumes should be good in case of modification.

    Now, it's time to explain how deal with Scenario #1 with above structure definition:

    He just reopens his multi-volume archive and adds new files to it. And BIT, creates new volumes and also changes first volume to let BIT know how many number of volumes exist in this archive. BIT also modifies FAT table. So, only first volume+added volumes should be reuploaded.

    Now, here is the problematic part: Scenario #2. And here is detailed scenario about it:

    - There are 10 volumes in the archive.
    - Each volumes' size is N bytes.
    - He want to replace X file which has "streams" in both 5th and 6th volumes with Y file which is 3N bytes or more.
    - The previous volumes (i.e. 4th volume) has some streams which continue to across only 5th volume.
    - Z file starts within 6th volume and continues to across next volumes.

    Restrictions:
    - There should be less "upload" traffic as possible (=less affected volumes).
    - BIT should aware of each volume number. So, it can ask "please insert disk N".

    Here is my draft idea to deal with it:
    - Each stream has a pair of reference which is equivalent to WinRAR's flag. But, instead of a flag, it's an integer which tell us disk number. So, streams can be freely distributed over any volumes. It's true for FAT table too.
    - When X file is deleted, it's removed from FAT table which can be affect that modification last volume (because, it continues from first volume) and also streams are flushed out from 5th and 6th streams (note that 5th and 6th volumes have still the other streams).
    - Instead of totally removing a volume in case of it has no stream at all, BIT should flush out streams from it and should be reupload this tiny volume to avoid progress being failed.
    - Y file's streams' parts are written to 5th and 6th volumes and referenced to N+1, N+2,...,N+M volumes.

    Now, here is the questions:
    - Is it really necessary that we should have "all" volumes to see all file list? (=WinRAR behaviour). In that case, if we have several disks which stores each volumes and inserted only first volume, we can only see first volume contents! Good point is only you don't need the other volumes unless you don't have any part of streams which is belong to the file which is wanted to extracted. Note that, in WinRAR, FAT is spread around all volumes.
    - Do you have any better idea?
    BIT Archiver homepage: www.osmanturan.com

  9. #9
    Member Fallon's Avatar
    Join Date
    May 2008
    Location
    Europe - The Netherlands
    Posts
    158
    Thanks
    14
    Thanked 10 Times in 5 Posts

    Talking

    >> - Do you have any better idea?
    No. But as you say, if there is a way to see more from the content of all volumes, that would be better.

    In my modest opinion, the main thing is that a multi volume archive can be opened as easy as any archive.
    (a doubleclick without a search for a particular file number- that my grandmother would not like)
    When it's open, the typical user is a keystroke away from extracting, which will solve other issues.

    >> There should be less "upload" traffic as possible (=less affected volumes).

    The mv updating idea may work offline and maybe on known servers (where you have permissions).
    But I wonder if in these circumstances 'file size limitations' are an issue.
    When we talk about our own space, we may even ask how much archiving is necessary.

    Cases where multi volumes can be of interest for users to make:
    a)
    with 100mb space and limited file size of 11mb -example: google site-.
    (here, because of the limited space, in case of packing mistakes, complete re-upload seems as likely as adding additional volumes).
    b)
    with backup to an online mailservice with a large inbox -example: gmail-.
    (here, backup is not of interest when a webmail-provider requires login every 30 or 60 days).
    c)
    when you upload say 2 gig to a server that does not support downloadmanagers (resume) or ftp.
    (here, multi volumes are likely to save time. You have to start over again if the browser halts the 'Download' for some reason, or halts the 'Upload' for that matter) Example: posted multi volume-links to movies on forums.
    d)
    for mailing some archive to a collegue or a friend. Lots of providers limit attachment size to 5mb or less. Alternative file exchange via a chat-program may not be wanted in a workplace.
    For files over 10mb to an unknown receiver, mailing password and upload the archive encrypted is probably the better option (size of inbox may be unknown, download time of large attachment may be unwelcome).
    e)
    for large archives, exchanges between (home-)servers. When downloading multivolumes, user has the option to shut down the PC after any part (and can be off for the weekend or whatever).
    (in more countries less of an issue because of upcoming broad band speed via glass fiber cable).
    People will unlikely pack their stuff up in multi-volumes just to be considerate.

    For some an additional reason to use multi-volumes could be to forego corruption in large archives (recovery option as argument to use mv).

    Rapidshare with its 100mb limit is still popular because of it's search option, but for any file over 100mb it cannot be recommended. You can only download 1 file per 24hours for free and there is no guarantee that a second volume will still be up there the next day.
    Several online storage sites will do a better job for bigger files.

    Concluding (just my opinion): if you do multivolumes anything near like WinRAR does, it's fine.
    Last edited by Fallon; 10th June 2009 at 14:53.

  10. #10
    Member chornobyl's Avatar
    Join Date
    May 2008
    Location
    ua/kiev
    Posts
    153
    Thanks
    0
    Thanked 0 Times in 0 Posts
    1) unlike other archivers make first volume smaller (not last), this will ease up preview and update (but will make archive creation slower)

    2) it is possibile to extract file started in current volume, at least partly, for preview=volume contains all info about streams started in it

    3) it is impossible to extract file started not in current volume due to obvious reasons(compression) and lack of sense (video wont play from the middle, unless its mpeg)=volume contains basic info about files started in another volume

    4) first volume contains all info about all files

    5) in the case of name loss, first volume should be able to detect other volumes among the bunch of messed up archives, using some unique hash for every volume

    6) binary copy command should produse perfectly valid archive(dunno why)

    7) every volume is valid archive file, except one stream

    >>only first volume+added volumes should be uploaded

    9)>>There should be less "upload" traffic as possible
    in case of some files deletion archiver should say if there are AbsolutelY redutant volumes, so you can easily delete few volumes without affecting whole archive extraction.

  11. #11
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,375
    Thanks
    214
    Thanked 1,023 Times in 544 Posts
    Its a totally "innovative" idea,
    but I think that for development its very reasonable to completely
    split the compressor and archiver parts.
    Like that, a compressor would produce a compressed stream from
    a list of input files, and decompressor would restore the files given
    the compressed file and the same list.
    Where each file starts is important for compression, so there's
    probably no way around it, file-to-file compressor interface won't be
    universal enough, but it should be possible to encode EOFs instead
    of filelengths in such encoding, and filenames don't have to be saved.
    And then, a separate archiver utility would have to construct an archive
    out of already compressed files, including existing archive update or
    splitting into volumes, combining the optimal (by any metric) archive
    from multiple versions of compressed data, etc.

    My point here is that it might be better to not start with a concept
    of archiver sequentially processing the data, calling the compressor
    when needed, and sequentially writing an archive, until the end.
    Imho its much more interesting for archiver to start actually working
    after all the data is compressed - at least much more flexible.

  12. #12
    Member Fallon's Avatar
    Join Date
    May 2008
    Location
    Europe - The Netherlands
    Posts
    158
    Thanks
    14
    Thanked 10 Times in 5 Posts

    Post

    @Shelwien
    >> "innovative" idea (....) to completely split the compressor and archiver parts.
    If you like, a name for your idea: modular archiving.

    @chornobyl
    >> 1) unlike other archivers make first volume smaller (not last)
    First implemented in WinImp.

  13. #13
    Member
    Join Date
    Feb 2009
    Location
    M?nchen, Germany
    Posts
    8
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by osmanturan View Post
    He just reopens his multi-volume archive and adds new files to it. And BIT, creates new volumes and also changes first volume to let BIT know how many number of volumes exist in this archive. BIT also modifies FAT table. So, only first volume+added volumes should be reuploaded.
    Hmm... if you want to put the a table of files (like a FAT) into the first part of the archive, you must not put compressed data into the first part, too. Because: How big would you want the FAT to be? And you need a fixed size, plus it cannot be compressed, or it may grow when you add a file and thus not fit into the space that was reserved for it in the first volume.

    How about appending the FAT to the last part of the multivolume? When adding new files, BIT could cut off the FAT from the last volume, add the new files to the end, update the FAT, thereby maybe increase its size and append it to the last volume (or add another volume if needed) And it would also be possible to compress the FAT before appending it...

    Thinking about it... the obvious disadvantage is that you need the last part of a multivolume before you can start. The archiver must seek to the end of the file, read the last few bytes (they describe the length of the (compressed) FAT), then seek backwards by this length, uncompress the FAT and could start unpacking files from the achive.

    Quote Originally Posted by osmanturan View Post
    Now, here is the problematic part: Scenario #2.
    Here I don't see any other way but rewriting/recompressing the multivolume beginning the first byte that became obsolete, and then you might gain some compression ratio if you do it for the whole archive altogether.

    Unless you want to go for smallest effort... but replacing some compressed data in an archive by some other compressed data leads to something like a "defragmentation problem", made worse by the fact that different files can share streams.


    Quote Originally Posted by osmanturan View Post
    - Instead of totally removing a volume in case of it has no stream at all, BIT should flush out streams from it and should be reupload this tiny volume to avoid progress being failed.
    - Y file's streams' parts are written to 5th and 6th volumes and referenced to N+1, N+2,...,N+M volumes.
    So the volumes could have different sizes...

    Scenario 1 would only require you to replace the last of N uploaded volumes, given that the FAT is not split over the two last volumes.

    Scenario 2... Hm. Another idea: Just declare the old streams to be invalid and add the replacement data to the end. If the invalid data spans a full volume, the archiver could tell you that you don't need to download Vol.5.
    Otherwise, a little amount to upload, a bigger amount to download.

    It might be somewhat unusual to download the last volume only, let the archiver look at it and have it tell you what other volumes you need. But why not...

Similar Threads

  1. ZPAQ self extracting archives
    By Matt Mahoney in forum Data Compression
    Replies: 31
    Last Post: 17th April 2014, 04:39
  2. MS CAB archives
    By nanoflooder in forum Data Compression
    Replies: 0
    Last Post: 10th April 2010, 01:58
  3. QArchiver - A new multi platform graphical Archiver
    By Simon Berger in forum Data Compression
    Replies: 25
    Last Post: 8th April 2010, 00:00
  4. Multi-threading motivation
    By Trixter in forum Data Compression
    Replies: 1
    Last Post: 10th September 2008, 06:18
  5. GUI for creation of 7z-SFX-archives
    By Vacon in forum Forum Archive
    Replies: 0
    Last Post: 8th June 2007, 16:16

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •