Results 1 to 11 of 11

Thread: Idea for raising compression efficiency on disk images

  1. #1
    Member
    Join Date
    Feb 2010
    Location
    Germany
    Posts
    77
    Thanks
    2
    Thanked 0 Times in 0 Posts

    Idea for raising compression efficiency on disk images

    A year ago I was involved in a project that dealt with compressing a few hundred iso-images. Since isos themselves tend to be treated as a single file by archivers they are not as efficiently handled as they could be.

    My idea is to parse images. Every image comes with a TOC which an archiver can use to virtually handle all files in it separately. This way all normal benefits such as sorting and grouping similar files together as well as being able to use filters on certain files become available and should boost the compressibility quite a bit.

    Parsing an ISO is also not too complicated. The archiver could assign the names of the files from the TOC and later - when extracting - could use the TOC to append the files in the correct order to the iso-header. The excess space of the last sector of each file could be interpreted as part of the file (meaning that each file would have a size that would be a multiple of the sector size, e.g. 204. This might be a little bit less efficient than just filling up the excess space through an algorithm, but in rare cases where this space is not zeroed out this would prevent the archiver from being lossy.

    Of course this gets more complex with increasing complexity of the image, but basically most of the formats are not just similar to iso, but direct derivatives of it, such as cdi, nrg, bin and many more. In any case iso itself would be easy to handle. Also, it's the most widespread format so a lot of people could profit from such an algorithm. Back then, I had planned to write an external tool for this job, but quickly gave up on this idea since my programming skills were too rudimentary and other solutions I could have written wouldn't have been up to standard and far too slow to be usable.

    Anyway, that's my idea. Tell me what you think about it.


    Edit: This is the 10.000th post in this subforum. Woohoo!
    Last edited by Mexxi; 16th February 2010 at 01:45.

  2. #2
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,334
    Thanks
    209
    Thanked 1,007 Times in 532 Posts
    Well, you're right, though its kinda an obvious idea.
    But there're actually 2 completely different formats for .iso files -
    iso9660 and udf, and a few different versions for each of these,
    so its not really very easy to parse (also note that it can contain
    multiple sessions etc).

    But I already explained (on irc) how you can get a similar effect with
    existing tools:

    Compression:
    1. extract the .iso with 7z (eg. 7z x -ofiles .iso)
    2. compress the files to archive1 (with any archiver and any compression)
    3. compress the files to archive2 without compression (7z a -t7z -mx=0 archive2. files)
    4. generate a patch from archive2 to .iso with xdelta
    (or, alternatively, with 7z or other relevant tools like rep/srep, but that'd
    require a little programming)
    5. Store archive1 and xdelta patch to the final archive

    Decompression:
    1. extract the archive1 and xdelta patch from the "final archive"
    2. extract the files from archive1
    3. compress the files to archive2 without compression
    4. apply the xdelta patch to archive2
    5. you'd have your .iso at this point

    For batch application though, you'd have to also make a normal
    archive of .iso and compare the archive size with the one from
    above.
    Also, similar recompression of other decodable formats is
    possible too (eg. cab or installshield), but you'd have to
    replace [2] with compression of files into the target format
    with all the possible options and finding which options produce
    the smallest patch size.

  3. #3
    Member
    Join Date
    Feb 2010
    Location
    Germany
    Posts
    77
    Thanks
    2
    Thanked 0 Times in 0 Posts
    Yes, but your way is much more complicated. It takes more steps and xdelta's efficiency is also very RAM-dependent. My method doesn't require the program to process the full set of files twice. For a full DVD-R image xdelta would have to process roughly 8.7GB (2x4.37GB) of data. Additionally to that, xdelta would have to read the extracted files plus write the new image which would amount to another 8.7GB of traffic. That's everything but economical. My method allows to process the amount of 4.37GB only once during compression and once during decompression. Also, the question is whether xdelta's redundancy-algorithm is so efficient when it comes to huge files. A couple of months ago I contacted Josh Macdonald and he told me that to get the most efficiency out of xdelta you'd have to give it as much RAM as the size of the file that you want to process. That's okay for cd-images, but DVD-R is already a close call, let alone DVD-DL or Bluray. Why worry about all of that if this can be done with almost no extra memory consumption and with hardly any overhead?

    You're correct that ISO supports more than just ISO9660. In fact, it supports 4 substandards of that convention apart from Joliet and of course it also supports UDf and Rock Ridge. However, parsing any of the ISO9660-standards as well as Joliet is very easy. After all, if it was so hard, then there wouldn't be so many tools to create, edit and burn them. Also, those TOCs are pretty easy in themselves. Don't forget that they are supposed to to allow easy file access after all.

    Anyway, even if there was only iso9660 support, that would be enough. Most isos use the 9660-standard even if they also include Joliet. A simple check before processing the iso is enough to determine how to read the TOC. If Joliet-formatted entries are found then the algorithm has just to adapt to that. In any case you only need to read two values from the TOC: the name of the file including its full path and its location. Both values use the exact same format under ISO9660 as well as Joliet with the difference that Joliet stores the filenames by entering a dummy-byte after each character.

    After all files have been read and processed, the iso-header including the TOC can be compressed separately which ensures that the file structure on the disc remains the same. As for the files themselves, their consecutive way of being stored is exactly the same in all these different iso-formats. They all just differ in their header.

    It's really less complicated as it sounds. I have been building iso-images in hexeditors and inserted sectors and even complete files into their structure manually. It's no big deal. Also, there are libraries available that enable third-party programs to open all kinds formats without requiring the programmer to know much about the image-structure. And that's also enough, because that already enables you to parse files and attach the original header back later on.
    Last edited by Mexxi; 16th February 2010 at 03:27.

  4. #4
    Member
    Join Date
    May 2008
    Location
    Germany
    Posts
    410
    Thanks
    37
    Thanked 60 Times in 37 Posts

    compress iso ... ciso ...

    may be this can help:

    here is a open-source file-format CISO = Compressed ISO
    which can use ZIP or LZMA compression algorithms

    cfs/ciso - open container format - storage of DVD images and file sets

    there is a tool PTISO

    http://www.pismotechnic.com/ptiso/

    also with some sourcecode

    http://www.pismotechnic.com/download...154-src.tar.gz

    ---
    here was an old thread:

    http://encode.dreamhosters.com/showthread.php?t=356
    ---

    i think, it will be very nice to have a free universal and simple source code
    to process a ISO-FILE within a compressor-program
    in this way each component of the iso-file can compressed in an optimal way
    and we would have support for a directory-structure in this way inclusive

    would this be very hard to implement ?

    what do you say Shelwien ?

    best regards

  5. #5
    Member Surfer's Avatar
    Join Date
    Mar 2009
    Location
    oren
    Posts
    203
    Thanks
    18
    Thanked 7 Times in 1 Post

  6. #6
    Member
    Join Date
    Feb 2010
    Location
    Germany
    Posts
    77
    Thanks
    2
    Thanked 0 Times in 0 Posts
    ECM filters and compresses ECC and EDC chunks of sectors. ISO doesn't support those. Using ECM on images that use these sectors is a good idea, but that still doesn't help you to process the files inside the image on a 1 by 1 basis, let alone group them

    @Shelwien: Forgot to mention that ISO does not support multisession as you claim. You need other formats for that like mds, cdi or nrg. Even then, multisession images look like several iso files appended together. Each session usually has an ISO-based TOC followed by the appended data. The only difference in processing would be the necessity to read the footer of the image first, because that usually contains information about where each session starts and ends. The rest is just the same as described before.

    Winrar, 7z and FreeArc can read isos natively already, so most of the work is already done anyway. It's just a small step to implement the rest.
    Last edited by Mexxi; 16th February 2010 at 12:02.

  7. #7
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,334
    Thanks
    209
    Thanked 1,007 Times in 532 Posts
    @joerg: thanks for posting the links.
    I know these, but didn't store the links.

    > It's just a small step to implement the rest.

    Its kinda like saying "there're lots of deflate decoders - what's
    the problem with making a deflate recompressor?".

    The parser needed for recompression and usual format decoders
    are completely different. People who write these parsers never
    care about losslessness and error tolerance.

    So its a lot more work than you imagine, and very likely
    the parser would have to be written from scratch, without
    much help from any of these open sources.

    But as I said, the idea is obvious and would be implemented someday.
    Just that there're lots of other formats with higher priority.

  8. #8
    Member
    Join Date
    Feb 2010
    Location
    Germany
    Posts
    77
    Thanks
    2
    Thanked 0 Times in 0 Posts
    Actually, I don't see what a parser for recompression has to do with this. I think your deflate example is comparing apples with oranges. Granted that the file compression used by archivers is lossless anyway there is little to worry about. Just extract the files, compress them and store the header of the image. Decompression is little more than extracting the image header and appending the files according to their position in the TOC (or with the help of a sortfile that the archiver can create along the way). What exactly is complicated about that?

  9. #9
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,334
    Thanks
    209
    Thanked 1,007 Times in 532 Posts
    You suggest to split the iso format... and then merge it back...
    and preferably with least possible redundancy.
    So I presumed that names and file sizes have to be derived
    from names and sizes of actual files on restoring, etc.

    I guess you'd be surprised that courgette (google's bsdiff-based
    binary diff tool) "forgets" overlays and values in section padding in exes.
    And that's when the COFF format is clearly better documented and
    simple than .iso.
    But that's what happens when you write a parser without intention
    to make it lossless.
    Last edited by Shelwien; 17th February 2010 at 01:11.

  10. #10
    Member
    Join Date
    Feb 2010
    Location
    Germany
    Posts
    77
    Thanks
    2
    Thanked 0 Times in 0 Posts
    If I understand you correctly, you think that the idea is to derive the data from the files itself, but I want to derive it from the TOC of the iso. In the end, that's the point of reference for every drive and every iso-app to retrieve the files.

    Of course losslessness is essential, that's why I also suggest not filling up the last sector of each file with zeroes, but actually compressing them as part of the file, just in case some information is stored there that is not referenced in the TOC.

    I never heard of the COFF-Format, but iso is veeeery simple. I've been reverse engineering image-formats from scratch and iso by far the easiest. mdf/mds for example is really complex. Same goes for VirtualCD formats and other proprietary ones. I'll check out COFF later to see the differences to ISO

    Edit: In what way is ISO insufficiently documented? If a format is extremely well described then it is this one or it wouldn't have become an industry standard. Also, you seem to ignore the fact that iso is already supported by major archivers, so whatever problems with parsing there might be, they're already overcome.

    I also checked out COFF. It's not an iso-format so how can you compare it to iso in terms of simplicity? You're comparing apples with oranges again. Dragging other formats into the discussion kind of misses the point. Same goes for google's diff-handler. The process I described doesn't rely on any diff-algorithms so why even mention that? How is google's problem with not working in a lossless way related to what I described?
    Last edited by Mexxi; 17th February 2010 at 14:42.

  11. #11
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,334
    Thanks
    209
    Thanked 1,007 Times in 532 Posts
    1. I didn't try writing an iso parser, so I only can compare it to other formats.
    2. I don't have anything against the idea. In fact I have a plan to do it eventually.
    But do you expect that I'd write it for you now or what?
    3. We're talking about different kinds of parsers.
    4. Maybe you're different, but this kind of specifications are
    not clear enough for me:
    http://www.ecma-international.org/pu...T/Ecma-119.pdf
    5. Don't forget that DVD images are normally in UDF format,
    and for CD images I don't see a problem with applying
    the scheme explained above.

    I presume extraction of all the structured information and
    removing most redundancy, also supporting any image files including
    broken ones.

    And just splitting the single given image into parts may be simpler,
    but does it really make sense for compression improvement?
    I guess you should try seg_file or something instead.

    But existing iso decoder sources would be likely of no help for that.
    Because such decoders tend to seek around the files extracting
    the data fields, and don't try to preserve all the information.
    Last edited by Shelwien; 18th February 2010 at 07:20.

Similar Threads

  1. Idea: Combine Compression & Encryption
    By dirks in forum Data Compression
    Replies: 16
    Last Post: 22nd February 2010, 11:49
  2. Compression idea: Base conversion
    By Nightgunner5 in forum Data Compression
    Replies: 8
    Last Post: 30th October 2009, 08:58
  3. Idea to make new site about data compression
    By Piotr Tarsa in forum Data Compression
    Replies: 1
    Last Post: 14th August 2009, 21:22
  4. NanoZip huge efficiency issue
    By m^2 in forum Data Compression
    Replies: 9
    Last Post: 10th September 2008, 22:51
  5. Compression used in .wim (Vista) images
    By jaclaz in forum Data Compression
    Replies: 9
    Last Post: 29th August 2008, 19:05

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •