Results 1 to 11 of 11

Thread: Help on designing new file archiver format

  1. #1
    Programmer
    Join Date
    May 2008
    Location
    denmark
    Posts
    94
    Thanks
    0
    Thanked 2 Times in 2 Posts
    I want to create a real file archiver format for QuickLZ and have a few questions.

    a) Is there any reason to have the header (file list, etc) in the beginning of the file archive instead of having it at the end? The first method requires backwards seeking into the file/device at compression which I want to avoid.

    b) Are there any good (and fast!) alternatives to Adler32 which doesn't have that "missing wrap flaw" for small files?

    c) What can I do to make it easy for 3'rd parties to support compression/decompression in their archiver tools? How is this usually handled?

    Any other important considerations?

    Lasse

  2. #2
    Programmer giorgiotani's Avatar
    Join Date
    May 2008
    Location
    Italy
    Posts
    166
    Thanks
    3
    Thanked 2 Times in 2 Posts
    Hi, I hope my opinion on the subject may help:

    a) there is not a wrong and right way to implement this, you can even decide to pre-scan the archive searching for header's area at the end of the archive i.e., read the last word of the archive, which contains the size of the header, then start reading header at archive size - header size, which usually will have a verry small impact on performances.
    However writing the header at the beginning of the file may avoid some annoyances, i.e. if the archive is spanned on multiple volumes on removable devices you will need to ask the users to provide last volume (hoping it contains all the header, otherwise ask a previous one) and then provide the first one afret the header was read.

    b) usually CRC32 is preferred to Adler32 even if the latter is simpler to implement; however a good CRC32 implementation is almost as fast as Adler32.

    c) that may vary; in *x world in example there is a strong emphasis on pipelining application, so it will be preferred an application capable of running without GUI, able to accept parameters and to comminicate with other applications (see in example how tar and gz/bz2 works, or how Linux archivers support external binaries for extending archive format support).
    On Windows instead it will be usually preferred a dll exposing archiving functions to be called from main program's executable.

    About other considerations, IMHO you should think about what do you want from the archive format in term of additional properties and expected usage target:
    What filesystem information are you going to store? That will be relevant expecially if the format is targeted also for backup purpouse; will be Unix or NT access and/or security policies saved?
    Will the format need to implement cryptography?
    Do you plan to allow volume spanning?
    Will the archive contain recovery records for repairing attempts?
    I think those are the most relevant additional topics to keep in mind designing an archive format, but certainly the list can't be exaustive.

  3. #3
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,511
    Thanks
    744
    Thanked 668 Times in 361 Posts
    Quote Originally Posted by giorgiotani
    but certainly the list cant be exaustive.
    but format can be (and should be!) extensible


    Quote Originally Posted by Lasse Reinhold

    a) Is there any reason to have the header (file list, etc) in the beginning of the file archive instead of having it at the end?
    its a three-sided sword

    a) directory at the beginning, a-la cabarc - decompression can be implemented with sequential file read, but compression requires to create temporary files

    b) dir at the end, like 7zip/freearc. compression is sequential, but decompression need to lseek the archive file. its much less work, though, that creating temporary files, so its selected in our programs

    c) dir entry before each file, like in TAR. allows to make both compression and decompression sequential, what is essential in unix world, but has its own obvious disadvantages


    Quote Originally Posted by giorgiotani
    However writing the header at the beginning of the file may avoid some annoyances, i.e. if the archive is spanned on multiple volumes
    there are many ways to implement volume splitting inside archiver. if we just go to split archive file into pieces with external utility, then we again should select between easiness of archive creation and extracting


    Quote Originally Posted by Lasse Reinhold
    c) What can I do to make it easy for 3rd parties to support compression/decompression in their archiver tools? How is this usually handled?
    windows: unrar.dll and optionally unrar sources


    Quote Originally Posted by Lasse Reinhold
    Any other important considerations?
    this depends on how much time you are ready to spend on this project. i suggest you to start with teaching existing archive formats, especially 7-zip. PeaZip is interesting for its great encryption/authentication support.

    my own archive format imho is most flexible and powerful one, although it may be too complex to implement. when i developed it, ive written some notes and now i will try to translate them into English. now i know some weak sides of my current format and plan to improve it somewhere in 0.50, so you can readily start with even better one

  4. #4
    Programmer giorgiotani's Avatar
    Join Date
    May 2008
    Location
    Italy
    Posts
    166
    Thanks
    3
    Thanked 2 Times in 2 Posts
    Quote Originally Posted by Bulat Ziganshin
    c) dir entry before each file, like in TAR. allows to make both compression and decompression sequential, what is must be in unix worls, but has its own obvious disadvantages
    To further clarify it (btw, this is the solution I used in PEA format), the obvious shortcoming of this scheme is that the program browsing the archive will need to browse the whole archive to perform a list command, and to browse the archive until the selected object is found to extract it; that is much less efficient than other two schemes (with contiguously listed content and offset indexes for non sequential access to archives content).

    Quote Originally Posted by Bulat Ziganshin
    there are many ways to implement volume splitting inside archiver. if we just go to split archive file into pieces with external utility, then we again should select between easiness of archive creation and extracting
    Yes, I was thinking to a single pass archiving scheme which is inherently more complex to implement, and the degree of complexity is useless if volume spanning is not used which may occour most of the times.
    However a two pass scheme with external utility for volume spanning will have some disvantages (i.e. requires temporary files and double disk time) unless we have the two applications able to pipeline the work like tar/gz on *x, but this requires a bit more of complexity too.

    Quote Originally Posted by Bulat Ziganshin
    PeaZip is interesting for its great encryption/authentication support.
    Thanks a lot for the appreciation!
    I think in a modern archive format, if encryption is taken in account, an authenticated encryption scheme should be definitely used rather than classic encryption only schemes; otherwise I would recommend AES (Twofish and Serpent are good too, but today AES should be the first choice because more used and consequently more tested) in CTR mode: fast, efficient, well known and understood by analists, with provable security properties.

  5. #5
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,511
    Thanks
    744
    Thanked 668 Times in 361 Posts
    Quote Originally Posted by giorgiotani
    Quoting: Bulat Ziganshin
    there are many ways to implement volume splitting inside archiver. Yes, I was thinking to a single pass archiving scheme
    there are many ways to implement such scheme. you may put directory of volume at the volume end, you may even ensure that volumes may be extracted separately, as it was done in arj format

    [QUOTE=Bulat Ziganshin
    there are many ways to implement volume splitting inside archiver. Yes, I was thinking to a single pass archiving scheme [/QUOTE]

    there are many ways to implement such scheme. you may put directory of volume at the volume end, you may even ensure that volumes may be extracted separately, as it was done in arj format

    [QUOTE=Bulat Ziganshin
    there are many ways to implement volume splitting inside archiver. Yes, I was thinking to a single pass archiving scheme [/QUOTE]

    there are many ways to implement such scheme. you may put directory of volume at the volume end, you may even ensure that volumes may be extracted separately, as it was done in arj format

    [QUOTE=Bulat Ziganshin
    there are many ways to implement volume splitting inside archiver. Yes, I was thinking to a single pass archiving scheme [/QUOTE]

    there are many ways to implement such scheme. you may put directory of volume at the volume end, you may even ensure that volumes may be extracted separately, as it was done in arj format

    [QUOTE=Bulat Ziganshin
    there are many ways to implement volume splitting inside archiver. Yes, I was thinking to a single pass archiving scheme [/QUOTE]

    there are many ways to implement such scheme. you may put directory of volume at the volume end, you may even ensure that volumes may be extracted separately, as it was done in arj format

    [QUOTE=Bulat Ziganshin
    there are many ways to implement volume splitting inside archiver. Yes, I was thinking to a single pass archiving scheme [/QUOTE]

    there are many ways to implement such scheme. you may put directory of volume at the volume end, you may even ensure that volumes may be extracted separately, as it was done in arj format

    Quote Originally Posted by Bulat Ziganshin
    there are many ways to implement volume splitting inside archiver. Yes, I was thinking to a single pass archiving scheme [/QUOTE

    there are many ways to implement such scheme. you may put directory of volume at the volume end, you may even ensure that volumes may be extracted separately, as it was done in arj format

    Quote Originally Posted by Bulat Ziganshin
    there are many ways to implement volume splitting inside archiver. Yes, I was thinking to a single pass archiving scheme [/QUOTE

    there are many ways to implement such scheme. you may put directory of volume at the volume end, you may even ensure that volumes may be extracted separately, as it was done in arj format

    Quote Originally Posted by Bulat Ziganshin
    there are many ways to implement volume splitting inside archiver. Yes, I was thinking to a single pass archiving scheme [/QUOTE

    there are many ways to implement such scheme. you may put directory of volume at the volume end, you may even ensure that volumes may be extracted separately, as it was done in arj format

    Quote Originally Posted by Bulat Ziganshin
    there are many ways to implement volume splitting inside archiver. Yes, I was thinking to a single pass archiving scheme [/QUOTE

    there are many ways to implement such scheme. you may put directory of volume at the volume end, you may even ensure that volumes may be extracted separately, as it was done in arj format

    Quote Originally Posted by Bulat Ziganshin
    there are many ways to implement volume splitting inside archiver. Yes, I was thinking to a single pass archiving scheme [/QUOTE

    there are many ways to implement such scheme. you may put directory of volume at the volume end, you may even ensure that volumes may be extracted separately, as it was done in arj format

    Quote Originally Posted by Bulat Ziganshin
    there are many ways to implement volume splitting inside archiver. Yes, I was thinking to a single pass archiving scheme [/QUOTE

    there are many ways to implement such scheme. you may put directory of volume at the volume end, you may even ensure that volumes may be extracted separately, as it was done in arj format

    Quote Originally Posted by Bulat Ziganshin
    there are many ways to implement volume splitting inside archiver. Yes, I was thinking to a single pass archiving scheme [/QUOTE

    there are many ways to implement such scheme. you may put directory of volume at the volume end, you may even ensure that volumes may be extracted separately, as it was done in arj format

    Quote Originally Posted by Bulat Ziganshin
    there are many ways to implement volume splitting inside archiver. Yes, I was thinking to a single pass archiving scheme [/QUOTE

    there are many ways to implement such scheme. you may put directory of volume at the volume end, you may even ensure that volumes may be extracted separately, as it was done in arj format

    Quote Originally Posted by Bulat Ziganshin
    there are many ways to implement volume splitting inside archiver. Yes, I was thinking to a single pass archiving scheme [/QUOTE

    there are many ways to implement such scheme. you may put directory of volume at the volume end, you may even ensure that volumes may be extracted separately, as it was done in arj format

    Quote Originally Posted by Bulat Ziganshin
    there are many ways to implement volume splitting inside archiver. Yes, I was thinking to a single pass archiving scheme [/QUOTE

    there are many ways to implement such scheme. you may put directory of volume at the volume end, you may even ensure that volumes may be extracted separately, as it was done in arj format

    <div class=""quoting"">Quoting: giorgiotani]<div class=""quoting"">Quoting: giorgiotani
    PeaZip is interesting for its great encryption/authentication support.]<div class=""quoting"">Quoting: giorgiotani]<div class=""quoting"">Quoting: giorgiotani
    PeaZip is interesting for its great encryption/authentication support.
    well, we say here only about archive format. if encryption/authentication should be used, one should carefully plan which parts of archive should be encrypted/authd. in particluar, i declined from implementing auth. in fa 0.40 because i need to ensure that every part of archive are signed - its a more strict requirement than just encrypt parts that has interesting information. for example, i should resist against just appendding to the archive "false archive" which contains enemys data


    <div class=""quoting"">Quoting: giorgiotani]<div class=""quoting"">Quoting: giorgiotani]<div class=""quoting"">Quoting: giorgiotani
    PeaZip is interesting for its great encryption/authentication support.]<div class=""quoting"">Quoting: giorgiotani]<div class=""quoting"">Quoting: giorgiotani
    PeaZip is interesting for its great encryption/authentication support.
    well, we say here only about archive format. if encryption/authentication should be used, one should carefully plan which parts of archive should be encrypted/authd. in particluar, i declined from implementing auth. in fa 0.40 because i need to ensure that every part of archive are signed - its a more strict requirement than just encrypt parts that has interesting information. for example, i should resist against just appendding to the archive "false archive" which contains enemys data


    <div class=""quoting"">Quoting: giorgiotani
    dir entry before each file, like in TAR.]<div class=""quoting"">Quoting: giorgiotani]<div class=""quoting"">Quoting: giorgiotani
    PeaZip is interesting for its great encryption/authentication support.]<div class=""quoting"">Quoting: giorgiotani]<div class=""quoting"">Quoting: giorgiotani
    PeaZip is interesting for its great encryption/authentication support.
    well, we say here only about archive format. if encryption/authentication should be used, one should carefully plan which parts of archive should be encrypted/authd. in particluar, i declined from implementing auth. in fa 0.40 because i need to ensure that every part of archive are signed - its a more strict requirement than just encrypt parts that has interesting information. for example, i should resist against just appendding to the archive "false archive" which contains enemys data


    <div class=""quoting"">Quoting: giorgiotani]<div class=""quoting"">Quoting: giorgiotani]<div class=""quoting"">Quoting: giorgiotani
    PeaZip is interesting for its great encryption/authentication support.]<div class=""quoting"">Quoting: giorgiotani]<div class=""quoting"">Quoting: giorgiotani
    PeaZip is interesting for its great encryption/authentication support.
    well, we say here only about archive format. if encryption/authentication should be used, one should carefully plan which parts of archive should be encrypted/authd. in particluar, i declined from implementing auth. in fa 0.40 because i need to ensure that every part of archive are signed - its a more strict requirement than just encrypt parts that has interesting information. for example, i should resist against just appendding to the archive "false archive" which contains enemys data


    <div class=""quoting"">Quoting: giorgiotani
    dir entry before each file, like in TAR.
    one more variant is to place before each compressed block list of files in this block. in this case, we are limited to do just one read per each, say, 8 mb of archive. we can also compress directory blocks if they contain large amounts of files. this also makes archive more robust - data and directory lies together so damages in other parts of archive dont disturb us

  6. #6
    Member
    Join Date
    Jan 2007
    Location
    Moscow
    Posts
    239
    Thanks
    0
    Thanked 3 Times in 1 Post
    CRC32 is no way as fast as Adler32. But with it's speed who cares? See http://www.cryptopp.com/benchmarks.html

  7. #7
    Programmer giorgiotani's Avatar
    Join Date
    May 2008
    Location
    Italy
    Posts
    166
    Thanks
    3
    Thanked 2 Times in 2 Posts
    Fine tuning both algorithms Adler32 keep a good speed advantage (I thought not so big as in the example linked, probably this morning I was recalling obsolete or less optimized benchmarks, I apologize).

    Quote Originally Posted by nimdamsk
    But with its speed who cares?
    Yes, probably the performance penalty will be not noticeable in a tipical case of use for an archiving application.
    As the speed of both algorithm is well beyond disk read/write speed, it would probably dont change too much for the user even in a single threaded application, and nothing in a multithreaded application since any of the two algorithms can mangle the data of one block before the next is read from the disk.
    It would probably be proportionally more noticeable if the data is in the RAM, but in this case the process would probably be so fast that in any case that the user will not complain.

  8. #8
    Member
    Join Date
    Jan 2007
    Location
    Moscow
    Posts
    239
    Thanks
    0
    Thanked 3 Times in 1 Post

  9. #9
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,475
    Thanks
    26
    Thanked 121 Times in 95 Posts
    Quote Originally Posted by nimdamsk
    CRC32 is no way as fast as Adler32. But with its speed who cares? See http://www.cryptopp.com/benchmarks.html
    afair igor pavlov developed algorithm that computes crc- 32 code using 1.2 cycles per byte. search 7- zip forum for details.

  10. #10
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,511
    Thanks
    744
    Thanked 668 Times in 361 Posts

  11. #11
    Programmer
    Join Date
    May 2008
    Location
    denmark
    Posts
    94
    Thanks
    0
    Thanked 2 Times in 2 Posts
    Thanks alot for all the input on different aspects

    Bulat - if you'll get time to translate it, post a note here.

Similar Threads

  1. Replies: 1
    Last Post: 13th May 2009, 11:46
  2. Can't extract file from ARC file.
    By Absurd in forum Data Compression
    Replies: 3
    Last Post: 26th January 2009, 21:11
  3. StuffIt X Format
    By maadjordan in forum Data Compression
    Replies: 19
    Last Post: 9th August 2008, 14:03
  4. New archive format
    By Matt Mahoney in forum Forum Archive
    Replies: 9
    Last Post: 25th December 2007, 12:22
  5. UZ2 file format
    By encode in forum Forum Archive
    Replies: 0
    Last Post: 13th July 2007, 00:00

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •