Page 2 of 4 FirstFirst 1234 LastLast
Results 31 to 60 of 98

Thread: precomp - further compress already compressed files

  1. #31
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,593
    Thanks
    801
    Thanked 698 Times in 378 Posts
    >it will be great to have archive format that can be written strictly sequentially. this means that index to metadata block(s) should be placed at the archive end

    FTL still saves 8-byte size in the archive header, so you need to go back. 7z/arc store metainfo size in the last bytes of archive, so archive should be decoded starting with the last few bytes


    >Every global data structure is a checksummed single block. They would be stored uncompressed

    so, FTL doesn't support compression/encryption of metainfo, making some users unhappy.

    also, FTL employs fixed order of fixed-structure blocks. This means that if you will need to change some block structure, older programs will be screwed. Instead, arc metainfo is a sequence of tagged blocks. When block type 1 will become obsolete, I will just stop including it in newer archives. Some functionality will lose upward compatibility, but the rest will still work. This also means that 3rd-party uitilities can add blocks of their own types

    Then, some items such as codec tag, requires more input. This means that when older version reads an archive with newer codec, it can't continue the decoding. arc encodes all codec params as
    asciiz string, so older programs can display archive listing that includes newer codecs and even extract files in solid blocks that use only older codecs


    >Si
    nce we'd be using variable-length integers, most of the benefits of SoA over AoS aren't really available

    this is the most important point. among usual file directory fields (name,size,date/time,attr,crc) only first two are varsized. and even the size field may be encoded as fixed-sized integer - after lzma, compressed fixed-size data for the filesize field has about the same size as
    compressed VLI data, according to my tests (I think that saving each byte of the size field separately may further improve lzma-compressed size but don't checked it

    next, with SoA you encode field tag only once per archive rather than once per file. Also, you can store field size after field tag and thus skip unknown fields, again improving forward compatibility and 3rd-party extensibility

    finally, compression ratios are greatly improved on same-type data, i.e. name1,name2...,attr1,attr2... is compressed better than AoS layout

  2. #32
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    4,133
    Thanks
    320
    Thanked 1,396 Times in 801 Posts
    @mpais:

    > This is exactly what made me decide to quit posting any stuff here.
    > All I see here is talk. Lots of talk, very little if any action.
    > It's so, so easy to just sit on the sidelines and just criticize others.

    Actually I think there's too little of constructive discussion, rather than too much.
    What action do you want?
    Bulat made a few complete archivers, freearc is still relatively popular.
    I also made my share of archiver-like tools and maintain a 7z-based archive format with recompression.

    Do you mean participation specifically in FT project?
    But I don't feel that its compatible with my ideas and even requirements of my existing codecs?
    (mainly multi-stream output).
    I do try to suggest features and open-source preprocessors that might be useful, is there anything else?

    >> Its full of unimportant details like specific data structures,
    >
    > It's.. a specification for a format, i.e., it should describe in detail
    > how to read an archive and interpret its data structures..

    I think it has to start with concepts which the format is supposed to support.
    1) there're common features which would be necessary to compete with popular
    formats, like .rar or .7z: encryption, volumes, recovery, solid modes, API for external modules;
    2) some possible specialized features: compatibility with stream processing (like tar),
    incremental backups (like zpaq), forward compatibility (rarvm,zpaq),
    random access to files/blocks (file systems);
    3) new features that would establish superiority of the new format:
    recompression, multi-level dedup, smart detection, MT compression improvements
    (eg. ability to reference all unlocked regions of data for all threads,
    rather than dumb independent compression of blocks), file/block diffs,
    virtual files, new file/block sorting modes...

    Of course, its not necessary to implement everything right away.
    But it'd be bad if you don't consider eg. volumes, and end up unable
    to implement it properly, eg. because block descriptors don't support
    splitting of blocks between volumes.

    And this FT format spec looks like a cleaner version of .paq8px or .pcf.
    Sure, it would be an improvement for these, but do you really think
    that proposed format is better than .7z, for example?

    >> Normally there're only two options - either we preprocess a file, or we don't.
    >> But here the optimal solution would be to preprocess most of the data,
    >> except for chunks which are duplicated in original files.

    > It doesn't even mention preprocessors, since those would just be codecs,
    > usually the first ones in a codec sequence for a block type.

    But your format has "chunk" descriptors which specify codecs,
    and "block" descriptors which specify preprocessors?

    > And it was specifically designed to do just what you mentioned.

    Nope. I was talking about multi-level dedup.
    As example, I made this: https://encode.su/threads/3231-LZ-token-dedup-demo
    Currently in most cases preprocessing is applied either blindly
    (for example, many archivers would apply exe filters to files preprocessed by external tools,
    even though that makes compression worse), or with very simple
    detection algorithm integrated in the preprocessor itself.

    So there's this token dedup example, where dedup ideally
    has to be applied on two levels simultaneously, but a greedy approach doesn't work:
    entropy-level recompression isn't always better, but sometimes it can be.
    But if a duplicate fragment is removed from lzma stream, it would be impossible to decode to tokens;
    and if a dup simply removed from token stream, it would become impossible to reapply entropy coding.

    I think that this is actually a common case for a new archiver with lots of preprocessors.
    But at the moment, at best, developers make specific approximate detectors for preprocessors,
    rather than add a part of archiver framework which would handle detection and switching
    of codecs and preprocessors, including external plugins.

    > First, deduplication occurs at the file level, then later on it's done at the content level
    > (if a parser splits a block, we perform deduplication on those new sub-blocks).

    That's actually good, atm only .rar has explicit file deduplication,
    and extending it to embedded formats is a good idea.
    But how would you handle a chunked-deflate format (like http traffic or email or png)?
    (there's a single deflate stream with extra data inserted into it).
    Or container formats with filenames, which are best handled by extracting to virtual files,
    then sorting them by filetype along with all other files?

    > After all parsing is done, the plan was to run a rep-like (or such) deduplication
    > stage on any non-specific blocks, so they could still be further divided.

    I'm not too sure about this, since atm in most cases dedup is implemented more like a long-range LZ.
    As example, we can try to add to archive movie1.avi and movie1.mkv,
    where 2nd is converted from 1st. Turns out, there're many frames smaller than 512 bytes,
    so even default srep doesn't properly dedup them.
    But I think that adding every ~100-byte CDC fragment to a block list
    would be quite inefficient.

    > So in your example, if the common code chunks were large enough to merit deduplication,
    > they'd be deduped before any exe-preprocessor would even be called.

    delta(5,6,7,8, 1,2,3,4,5,6,7,8,9) -> (5,1,1,1, 1,1,1,1,1,1,1,1,1).
    - better to ignore dups in original data and just compress delta output;
    delta(5,6,7,8,9, 1,3,7,7, 5,6,7,8,9) -> (5,1,1,1,1, 1,2,4,0, 5,1,1,1,1).
    - better to ignore delta;
    delta(5,6,7,8, 2,3,7,7,7, 3,4,8,8,8,5,6,7,8) -> (5,1,1,1, 2,1,4,0,0, 3,1,4,0,0,-3,1,1,1).
    - "1,4,0,0" dups in delta, but "5,6,7,8" is better in original?

    > Nothing in the format prevents that. Each block uses a codec-sequence.
    > The user could specify that, for instance, on 24bpp images,
    > we should try 2 sequences for every block: one that includes a color-space transform before the actual compression,
    > and one that doesn't; and we just keep the best.

    Yeah, as I said, a strict choice.

    1) this is only perfect with block dedup.
    While with a more generic long-range-LZ-like dedup
    there'd be cases where locally one preprocessor is better,
    but globally another, because part of its output matches something.

    2) I'm mostly concerned about cases where preprocessing stops
    working after dedup - especially recompression;
    For example, suppose we have a deflate stream with 3 blocks: blk1,blk2,blk3.
    blk2 is a duplicate of some previous block.
    but with blk2 removed, blk3 can't be recompressed.
    And if dedup is ignored, recompression works, but an extra parsing diff is generated for blk2.
    One possible solution would be to do both, with decoding kinda like this:
    - restore blk1
    - insert blk2 by reference
    - recompress blk1.blk2
    - patch rec(blk1.blk2) to rec(blk1.blk2.blk3)
    - restore full sequence
    But this certainly would require support in archive index.
    And normal diffs would too, I suppose.

    Of course, there may be an easier solution.
    For some types of preprocessing (eg. exe) it really may be
    enough to give the preprocessor the original position,
    rather than the one in the deduplicated stream.

    In any case, I think that a new format has to be able to beat existing ones,
    otherwise why not just use .7z or .arc which have a plugin interface?
    And thus the most important part is new features that can significantly improve compression,
    but would require support in format, since otherwise its better to use .7z.
    Sure, block dedup is one such feature, but in case of solid compression
    it would be always worse than long-range-LZ dedup like srep.

    As example, I can only use .7z for half of my codecs, because it provides
    an explicit multi-stream interface. I'd have to add my own stream interleaving
    to make reflate or mp3det/packmp3c compatible with any other format, including FT.

  3. #33
    Member
    Join Date
    Feb 2016
    Location
    Luxembourg
    Posts
    577
    Thanks
    220
    Thanked 832 Times in 340 Posts
    @bulat:

    >FTL still saves 8-byte size in the archive header, so you need to go back.
    >7z/arc store metainfo size in the last bytes of archive,
    >so archive should be decoded starting with the last few bytes

    That was quite discussed in the Gitter chat, and something that seemed simple actually required considering a lot of things.
    Most would prefer a streaming-friendly format, which was basically out of the question when other features were considered.
    So with that out of the way, we considered ease of data-handling versus robustness to data loss, specifically truncation.
    The order of the metainfo blocks is such as to maximize the possibility of data recovery in that case; the file information structure is the last in the file, so if it got truncated, only some files would be lost. Even this wasn't really much to my liking, since maybe losing info for the directory tree structure would possibly be less catastrophic, but that is why this was a draft.
    Placing the offset last would mean truncation would effectively render all data lost.
    So it was considered, and mostly in limbo since we never actually had other draft proposals, but personally I think the small inconvenience is worth the effort. Sure, if you're considering backups to tape then completely strict sequential IO would be better. And the described reasoning would only be of concern for archives that didn't use the proposed wrapping recovery format.

    >so, FTL doesn't support compression/encryption of metainfo, making some users unhappy.

    Only compression of metainfo wasn't described in the draft, and could be added at no loss of generality, there are flag bytes specifically useful for things like encryption, aside from a version number. There was discussion on the encryption method to use and whether to support several.

    >also, FTL employs fixed order of fixed-structure blocks. This means that if you will need to change some block structure, older programs will be screwed.

    As above, backward compatibility was required, but not upward, obviously.

    >Instead, arc metainfo is a sequence of tagged blocks. When block type 1 will become obsolete,
    >I will just stop including it in newer archives. Some functionality will lose upward compatibility,
    >but the rest will still work. This also means that 3rd-party uitilities can add blocks of their own types

    Only the information required to describe the block structure, the codecs used and their sequences, the directory tree and the file list are required in specific order and thus don't need tags (also, inside these, almost every other metadata is based on a variable length list of tags). If these need to change, a simple version-number change suffices let newer decoders handle it, and older (incompatible ones) skip it. Should a 3rd-party tool wish it can append their own blocks after these, tag them or not as a way of identification, since any data after the end of the file list structure is ignored (again, since we don't rely on the metainfo offset being in the last bytes of the archive).

    >Then, some items such as codec tag, requires more input. This means that when older version
    >reads an archive with newer codec, it can't continue the decoding. arc encodes all codec params
    >as
    asciiz string, so older programs can display archive listing that includes newer codecs and
    >even extract files in solid blocks that use only older codecs

    The way parameters for each codec are stored is completely undefined as it is codec-dependent, it is literally only stated as being X bytes, so you could even include a asciiz/utf8/other string to describe the name of the codec, so older versions could at least also inform the user what codec it isn't recognizing and what parameters were used for it. And that doesn't influence its ability to extract any other blocks, solid or not, that use codecs it does recognize.

    >this is the most important point. among usual file directory fields (name,size,date/time,attr,crc)
    >only first two are varsized. and even the size field may be encoded as fixed-sized integer - after lzma,
    >compressed fixed-size data for the filesize field has about the same size as
    compressed VLI data,
    >according to my tests (I think that saving each byte of the size field separately may further improve
    >lzma-compressed size but don't checked it

    We use the optional metadata tag structure to hold all file/directory fiels apart from the name. Even file size is optional, since it can be deduced from its block structure, but Stephan Busch rightly noted that it would make listing file sizes a lot more computationally expensive, so it could be listed with its own tag.

    >next, with SoA you encode field tag only once per archive rather than once per file.

    Correct, but since the user would have complete control of what fields to include in the archive, if the absolute smallest archive size was required it could skip all fields, in which case a single null byte would be stored per file/directory, so that would help reduce the overhead.
    And it allows for flexibility, the user may choose to store complex metadata for just some file types of personal interest. A photographer could even keep a full disk backup including his/her photos, and when updating it with newer photos, include personalized text comments for each, describing personal thoughts about it.

    >Also, you can store field size after field tag and thus skip unknown fields,
    >again improving forward compatibility and 3rd-party extensibility

    Again, straight from the draft:
    All tags are identified by their id, and those for which decoder support is not obligatory must have at
    least the tag value field, so that the decoder can ignore unknown tags. For these, the tag value field
    represents the size, in bytes, of the tag data only. The decoder should read the tags until it finds a
    termination tag, i.e., with its tag id set to 0 (zero). This reserved tag doesn’t require any further fields.
    >finally, compression ratios are greatly improved on same-type data,
    >i.e. name1,name2...,attr1,attr2... is compressed better than AoS layout

    Agreed. But again, at least in this initial version, we weren't considering compressing the metainfo since that would almost surely mean the format would be ignored by everyone else. Do you really think 3rd-party tool creators would want to have to handle writing/using decompression routines just to be able to even read the archive structure, let alone do anything with it? I don't see anyone apart from a few enthusiasts in this forum being willing to go through all that hassle.

    @shelwien:

    >Actually I think there's too little of constructive discussion, rather than too much.

    A gitter chat with several feature-focused rooms was setup for discussion of all of this, so as not to pollute the forum.

    >What action do you want?
    >Bulat made a few complete archivers, freearc is still relatively popular.
    >I also made my share of archiver-like tools and maintain a 7z-based archive format with recompression.

    Exactly, there is largely more than enough know-how in this community to make something great.
    And when once in a while it's discussed, there are lots of suggestions and discussion.
    But when time comes to actually help out and do something, no one can be bothered to contribute. But then just as easily as the saying goes, everyone's a critic.

    >Do you mean participation specifically in FT project?

    I'm actually also waiting on your improvements to paq8px. Gotty is going through so much trouble to improve the code (and probably fix all my bugs) to make it more decoupled so others, like yourself, can contribute ideas without being hamstringed by all the clunky mess it was.

    >But I don't feel that its compatible with my ideas and even requirements of my existing codecs?
    >(mainly multi-stream output).
    >I do try to suggest features and open-source preprocessors that might be useful, is there anything else?

    Fair enough, everyone has their own ideas, and suggestions are always welcome.
    But why is it that those here trying to cobble-up something for free in their spare time are belittled so easily and see such criticism?
    This dismissive atitude some of you here have for those trying to make progress in this field, constantly reminding everyone on how you guys could do a much better job at it, is rightfully going to get you called out. Sorry, but sometimes one either puts up or shuts up.

    >I think it has to start with concepts which the format is supposed to support.
    >1) there're common features which would be necessary to compete with popular
    >formats, like .rar or .7z: encryption, volumes, recovery, solid modes, API for external modules;

    Of those, only volumes weren't considered, since that can be approached in the same way we did with the recovery: design a wrapper format for it, useful not just for our own archive. This spin-off project could allow people with different expertise to contribute and would accelerate development. This was quite discussed in the chat and the recovery format was still in limbo without anything set in stone before discussion stopped. That is not to say better ways of handling any of those couldn't be explored, it was, after all, just a response to a call for drafts.

    >2) some possible specialized features: compatibility with stream processing (like tar),
    >incremental backups (like zpaq), forward compatibility (rarvm,zpaq),
    >random access to files/blocks (file systems);

    >3) new features that would establish superiority of the new format:
    >recompression, multi-level dedup, smart detection, MT compression improvements
    >(eg. ability to reference all unlocked regions of data for all threads,
    >rather than dumb independent compression of blocks), file/block diffs,
    >virtual files, new file/block sorting modes...

    Sure, lots of cool and innovative features can be explored, but we need to be realistic. Even as-is the project would already be extremely complex, as proven from its still-born status.
    And a few of those you mentioned were proposed.

    >Of course, its not necessary to implement everything right away.
    >But it'd be bad if you don't consider eg. volumes, and end up unable
    >to implement it properly, eg. because block descriptors don't support
    >splitting of blocks between volumes.

    Recovery and volumes could/would be handled as lower-level IO abstraction layers.
    They would be independent formats (like for example .PAR), so a FTL archive with recovery info would have extension ".ftl.rec", for instance.
    When processing, we'd check for which IO handler to use, and it would be passed on to the codecs.

    >And this FT format spec looks like a cleaner version of .paq8px or .pcf.
    >Sure, it would be an improvement for these, but do you really think
    >that proposed format is better than .7z, for example?

    If it was that simple to acertain what is better we wouldn't have needed any lenghty discussions.
    As long there are multiple specific needs in play, each possible option with their set of pros and cons, we should remain open. Hence the call for drafts.

    >But your format has "chunk" descriptors which specify codecs,
    >and "block" descriptors which specify preprocessors?

    Preprocessors are just another codec, that's all. Only after all the parsers, transforms and dedupers reach a block segmentation, do we then proceed to applying codec sequences.
    The format just describes the segmentation structure and what codecs were used, nothing more. Each individual block can have its own side-data (like reconstruction info for deflate streams, GIF images, etc).

    >Nope. I was talking about multi-level dedup.
    >As example, I made this: https://encode.su/threads/3231-LZ-token-dedup-demo

    >So there's this token dedup example, where dedup ideally
    >has to be applied on two levels simultaneously, but a greedy approach doesn't work:
    >entropy-level recompression isn't always better, but sometimes it can be.
    >But if a duplicate fragment is removed from lzma stream, it would be impossible to decode to tokens;
    >and if a dup simply removed from token stream, it would become impossible to reapply entropy coding.

    On your posted example, if I understood it correctly:

    Both files passed file-level dedup, so we parse them. The LZMA parser detects the stream. Maybe we try a few common options, none decoded it, so we can't mark it as a LZMA block. We try decoding it to tokens, succeed, mark the block as a LZMA-TOKENS block. They don't match, so content-level dedup doesn't affect them.
    After all parsing is done, we're still free to do any reversible-operations on the blocks before calling the preprocessors and codecs.
    So since we have a few LZMA-TOKENS blocks, we run them through their specific deduper, which finds their similarity. block 1 is left unchanged, block 2 is now set as LZMA-TOKENS-DEDUPED, it's private info points to block 1, and its related stream is now your ".dif". Both get compressed later on by the codecs.
    Now when decompressing, we get to block 2. We see it was lz-deduped from block 1, so we decompress blocks 1 and 2, apply the diff and get the tokens for block 2, and apply the inverse transform to get the original lzma stream.

    >That's actually good, atm only .rar has explicit file deduplication,
    >and extending it to embedded formats is a good idea.
    >But how would you handle a chunked-deflate format (like http traffic or email or png)?
    >(there's a single deflate stream with extra data inserted into it).

    Specific parsers and transforms, it was clear from the start that something like PNG would require it. Even decoding the LZW-encoded data from GIF images requires a specific transform.

    >Or container formats with filenames, which are best handled by extracting to virtual files,
    >then sorting them by filetype along with all other files?

    Again, content-aware segmentation would then allow (if using solid compression) for a (optional) content-aware clustering stage, designed to sort blocks of the same type by ranking their similarity. So if say, a TAR parser split all files in the stream into single blocks, of which some were then already detected as a specific type, all the files could be sorted. Images could just be sorted by resolution, or date taken, or camera used, or even visual likeness. Texts could be sorted by language. For this stage, that TAR parser might have even included the filename as optional info in each block it detected, to better helps us sort them, since for those we might not have any specific similarity-comparer.

    >But I think that adding every ~100-byte CDC fragment to a block list
    >would be quite inefficient.

    Which is why I said that they would need to be dedup-worthy, we need to take into account the block segmentation overhead. No point in deduping just a few bytes if the segmentation alone is bigger.

    >delta(5,6,7,8, 1,2,3,4,5,6,7,8,9) -> (5,1,1,1, 1,1,1,1,1,1,1,1,1).
    >- better to ignore dups in original data and just compress delta output;
    >delta(5,6,7,8,9, 1,3,7,7, 5,6,7,8,9) -> (5,1,1,1,1, 1,2,4,0, 5,1,1,1,1).
    >- better to ignore delta;
    >delta(5,6,7,8, 2,3,7,7,7, 3,4,8,8,8,5,6,7, -> (5,1,1,1, 2,1,4,0,0, 3,1,4,0,0,-3,1,1,1).
    >- "1,4,0,0" dups in delta, but "5,6,7,8" is better in original?

    The last level dedup would handle most big duplications, where deduping would most likely provide better compression anyway.
    The codec-sequences for these default blocks (assuming fast, lz-based codecs, since most CM codecs probably wouldn't need it) could be setup with a delta preprocessor before the lz-codec or not, and the best option would be used. Not the same, sure, but if delta helped we'd still get benefits, and otherwise an lz codec should reasonably handle small duplications, especially if we cluster similar blocks together).
    For specific type blocks, we can make use of the information in them to sort them, to create transforms and codecs.

    >2) I'm mostly concerned about cases where preprocessing stops
    >working after dedup - especially recompression;
    >For example, suppose we have a deflate stream with 3 blocks: blk1,blk2,blk3.
    >blk2 is a duplicate of some previous block.
    >but with blk2 removed, blk3 can't be recompressed.
    >And if dedup is ignored, recompression works, but an extra parsing diff is generated for blk2.
    >One possible solution would be to do both, with decoding kinda like this:
    >- restore blk1
    >- insert blk2 by reference
    >- recompress blk1.blk2
    >- patch rec(blk1.blk2) to rec(blk1.blk2.blk3)
    >- restore full sequence
    >But this certainly would require support in archive index.
    >And normal diffs would too, I suppose.

    If blk2, as an intermediate deflate block, is a direct match for some other deflate block in another file, it's most likely a stored/raw block, so after we decompress the whole stream, the last level dedup could likely handle it (for default FTL blocks) and not much would be lost anyway. But lets assume that is not the case, it's a huffman compressed block, static or not, and luck would have it, it doesn't decompress the same. Your problem is that we'd then possibly be storing duplicated deflate diff information.
    Nothing in the proposed draft keeps you from doing it. Again, it simply describes a block structure and parameters on how to recreate them. You can have a deflate-diffed block type, whose individual reconstruction parameters would do just what you wanted.

    But if we're considering contrived examples, think of Darek's testset, with the same image in 2 different compressed formats. We could decompress both, but one is stored top-to-bottom and the other bottom-to-top. Ideally, one of the transforms would account for this, to allow for better chances of deduping, but let's say that it not the case. If we use the last-level dedup on non-default blocks, we will match every line of one image to the same line in the vertically-mirrored image, and we'll have to split both into single-line blocks, one will be just deduped-blocks, and the other will be split into as many 1-line "images" blocks as the height. So, for one image we'll just have to store dedup info, and get great compression. But the other image now has a lot of extra segmentation overhead, and if we're not using solid-mode (and even there, we'd need stable sorting in the clustering stage to ensure the lines would be stored in order), then we're compressing single lines without any added 2d-context, and may get sub-par compression. Would it still be better in solid mode than relying on the clustering stage to put both images consecutively so that a solid image codec could get the 2nd one almost for free anyway? Would it then make sense to have an image-specific deduper that would realize this, and just store as "diff" for the 2nd image that we should mirror it vertically from the 1st image?

    >In any case, I think that a new format has to be able to beat existing ones,
    >otherwise why not just use .7z or .arc which have a plugin interface?

    Sure, but why not discuss other approaches? Isn't that the idea of a call for drafts, that people can provide alternative solutions and a consensus be reached on what to keep from each?
    I tried to detail how to handle every feature we set out to achieve with this draft, and we discussed its pros and cons.
    And since no one cared, I called it a day and stopped bothering with the project.

    @schnaader:

    I'm sorry for the off-topic, it's nice to see precomp is still getting developed.

  4. Thanks (3):

    hexagone (17th November 2019),Shelwien (16th November 2019),Stephan Busch (16th November 2019)

  5. #34
    Programmer schnaader's Avatar
    Join Date
    May 2008
    Location
    Hessen, Germany
    Posts
    630
    Thanks
    288
    Thanked 252 Times in 128 Posts
    Quote Originally Posted by mpais View Post
    I'm sorry for the off-topic, it's nice to see precomp is still getting developed.
    No problem, nice to read something from you. Also, that discussion is very interesting and precomp is in a similar phase now, I have to decide which features to implement and where its journey should go. I think it will get kind of schizophrenic in the next versions, offering both faster modes (e.g. making zstd default instead of lzma) and modes with better compression (e.g. brute-force modes for image data trying FLIF, webp, pik, PNG filters and pure lzma). The multithreading changes also make me doubt streaming support since these two are often incompatible.
    http://schnaader.info
    Damn kids. They're all alike.

  6. #35
    Member
    Join Date
    Aug 2014
    Location
    Argentina
    Posts
    572
    Thanks
    245
    Thanked 98 Times in 77 Posts
    Quote Originally Posted by mpais View Post
    I have a (non-public) completely rewritten version that solves the problem of having lots of memory allocations and temporary files, by using a pre-allocated hybrid pool of a single memory block and a single temporary file, so that much is done.
    Need any assistance with that? PM if you could use some testing or whatever

  7. #36
    Member
    Join Date
    Apr 2017
    Location
    Bangladesh
    Posts
    13
    Thanks
    57
    Thanked 2 Times in 2 Posts
    Any chance to include stdio in precomp?

  8. #37
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    4,133
    Thanks
    320
    Thanked 1,396 Times in 801 Posts
    See rzwrap above.

  9. #38
    Member
    Join Date
    Aug 2014
    Location
    Argentina
    Posts
    572
    Thanks
    245
    Thanked 98 Times in 77 Posts
    @Christian:

    so the preferred test case would be to throw all kind of stuff together and try this version on it.
    Last night processed and restored all my PDFs, GZs and ZIPs with mtprecomp. Some 630+ files, several GBs.

    Click image for larger version. 

Name:	image.png 
Views:	94 
Size:	4.0 KB 
ID:	7073

    Totally forgot to bit compare outputs OTOH. At least we know there are no crashes


    -----------------------

    On a different note:

    e.g. making zstd default instead of lzma
    Are you sure?? I mean, what's the point? ZSTD is barely better than zlib. Why go thru all the trouble just to end up using a weak algorithm?

    If you want a faster codec, I strongly suggest you to look at flzma2. On my tests, it is always way faster than lzma (1.5x being the slowest, up to 5x and more when multithreading) and ratios are barely 1% worse (sometimes even better than lzma - see the pdf).

    Here's the results of a recent test. The testset is a folder with windows .ico icons (with pngs inside). On the first two pages are the pareto frontier-maker methods for cspeed vs ratio (with Razor manually added as it is probably the strongest LZ program I know of); next is everything else for comparison. I did several of these test with other kinds of data too. The results are pretty much the same. This is just the better formatted one.
    Attached Files Attached Files
    Last edited by Gonzalo; 18th November 2019 at 22:27. Reason: Added test result

  10. Thanks:

    schnaader (19th November 2019)

  11. #39
    Programmer schnaader's Avatar
    Join Date
    May 2008
    Location
    Hessen, Germany
    Posts
    630
    Thanks
    288
    Thanked 252 Times in 128 Posts
    Just checked newest zstd (1.4.4) with silesia.tar. As compression ratio drops comparing to "Precomp -t+", (with similar compression speed), I will most likely not make zstd the new default, but it is great when looking at decompression times - 5 times faster! So for many use cases, it is much better although it might not compress as well. It also offers long distance matches which might be useful until Precomp does its own deduplication (current precomp offers up to 192 MiB block size, zstd can have a window size of 2 GB).
    http://schnaader.info
    Damn kids. They're all alike.

  12. #40
    Member
    Join Date
    Jun 2015
    Location
    Switzerland
    Posts
    976
    Thanks
    266
    Thanked 350 Times in 221 Posts
    Quote Originally Posted by schnaader View Post
    I will most likely not make zstd the new default, but it is great when looking at decompression times - 5 times faster!
    How did it fare against large window brotli?

  13. #41
    Member
    Join Date
    Aug 2014
    Location
    Argentina
    Posts
    572
    Thanks
    245
    Thanked 98 Times in 77 Posts
    I'm running a roundtrip of mtprecomp against literally every single file on a personal computer (two O.S. and a data partition). Since the script is now stuck on a folder with a bunch of movies and series and it's gonna be a few days until I have more results, I decided to report some stats now:

    Click image for larger version. 

Name:	DeepinScreenshot_select-area_20191120110916.png 
Views:	92 
Size:	46.9 KB 
ID:	7087

    By the way: Kudos on a rock-solid program, Christian! I believe it's pretty safe to say precomp is stable software by now...

  14. Thanks:

    schnaader (20th November 2019)

  15. #42
    Programmer schnaader's Avatar
    Join Date
    May 2008
    Location
    Hessen, Germany
    Posts
    630
    Thanks
    288
    Thanked 252 Times in 128 Posts
    Quote Originally Posted by Jyrki Alakuijala View Post
    How did it fare against large window brotli?
    Code:
    silesia.tar, 211.948.032 Bytes - CPU: Intel i7-4800MQ, 4 cores @ 2.7 GHz
    
    Program			compressed size		compression time (s)	decompression time (ms)
    Precomp 0.4.7 -t+	49.498.111		43			7016
    brotli 1.0.4 -q 11      49.774.385              945                     1375
    zstd 1.4.4 -T0 -19      53.264.556              68                      1125
    Brotli has a nice ratio and similar decompression time to zstd, but no multi-threaded compression (at least the command line version), so compression time is horrendous.
    Last edited by schnaader; 20th November 2019 at 18:35.
    http://schnaader.info
    Damn kids. They're all alike.

  16. #43
    Programmer schnaader's Avatar
    Join Date
    May 2008
    Location
    Hessen, Germany
    Posts
    630
    Thanks
    288
    Thanked 252 Times in 128 Posts
    Quote Originally Posted by Gonzalo View Post
    MP3: There are two files reporting a high rate of packMP3 fails. They are in a browser's cache so my guess is they're some video with mp3 audio interleaved. I made a copy just in case you need them.
    This would be some interesting test files indeed, especially since it might be related to Issue #20: Support MP3 format in containers, so please upload them somewhere or send them by e-mail.
    http://schnaader.info
    Damn kids. They're all alike.

  17. #44
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    4,133
    Thanks
    320
    Thanked 1,396 Times in 801 Posts
    @Gonzalo: Can you test packmp3 compression vs 7zdll?
    example: "7z.exe a -m0=mp3det -m1=packmp3c -m2=plzma -mb0s0:1 -mb0s1:2 1.pa cat.mp3"

  18. #45
    Member
    Join Date
    Aug 2014
    Location
    Argentina
    Posts
    572
    Thanks
    245
    Thanked 98 Times in 77 Posts
    Done, and done.

    @schnaader: check your e-mail
    @Shelwien:

    Code:
    1425858    90.63%    2ca47597ec5fecbe_s.pcf    
    1427458    90.73%    2ca47597ec5fecbe_s.pcf_cn    
    1520647    96.66%    2ca47597ec5fecbe_s.rz    // Razor
    1528742    97.17%    2ca47597ec5fecbe_s.pcf    // -t+
    1573219    100.00%    2ca47597ec5fecbe_s    
    1647168    104.70%    2ca47597ec5fecbe_s.pa    
                
    3014158    90.18%    658939e740457b10_s.pcf    
    3016315    90.24%    658939e740457b10_s.pcf_cn    
    3214080    96.16%    658939e740457b10_s.rz    // Razor
    3234478    96.77%    658939e740457b10_s.pcf    // -t+
    3342489    100.00%    658939e740457b10_s    
    3458058    103.46%    658939e740457b10_s.pa
    Command line is exactly as suggested. Didn't change anything.

  19. Thanks:

    Shelwien (20th November 2019)

  20. #46
    Member
    Join Date
    Aug 2014
    Location
    Argentina
    Posts
    572
    Thanks
    245
    Thanked 98 Times in 77 Posts
    Quote Originally Posted by schnaader View Post
    Code:
    silesia.tar, 211.948.032 Bytes - CPU: Intel i7-4800MQ, 4 cores @ 2.7 GHz
    
    Program            compressed size        compression time (s)    decompression time (ms)
    Precomp 0.4.7 -t+    49.498.111        43            7016
    brotli 1.0.4 -q 11      49.774.385              945                     1375
    zstd 1.4.4 -T0 -19      53.264.556              68                      1125
    Brotli has a nice ratio and similar decompression time to zstd, but no multi-threaded compression (at least the command line version), so compression time is horrendous.

    Have you tried fastlzma2? This the size I get:

    Code:
    7-Zip-Zstandard/7z.exe a Silesia.7z -mfb=273 -myx=9 -m0=flzma2 -mf=on -mqs=on -md=128m -mx9 -ms=on silesia
    
    Program            compressed size        compression time (s)    decompression time (ms)
    
    fastlzma2                48.669.533                ??                      ??     
    Precomp 0.4.7 -t+    49.498.111                43                      7016
    brotli 1.0.4 -q 11       49.774.385              945                     1375
    zstd 1.4.4 -T0 -19     53.264.556               68                      1125
    Speed on both systems is not comparable, but I compared flzma2 with my copy of precomp:
    C. speed is 0.55x faster for fastlzma2 (2/3 time spent)
    D. speed is 0.11x faster for fastlzma2 - not really a difference
    Last edited by Gonzalo; 21st November 2019 at 00:39. Reason: Test correction

  21. Thanks:

    schnaader (21st November 2019)

  22. #47
    Programmer schnaader's Avatar
    Join Date
    May 2008
    Location
    Hessen, Germany
    Posts
    630
    Thanks
    288
    Thanked 252 Times in 128 Posts
    Quote Originally Posted by Gonzalo View Post
    @schnaader: check your e-mail
    Interesting test files. For each file, there's corrupted JPG data at the beginning which can be extracted, but looks damaged. After that, there is MP3 data which can be extracted and played seemingly flawless, but I guess that's just because of the high error tolerance of the MP3 format. There's some strings in the files indicating some iTunes format, so it looks like this could be some container format and/or copyright protection.

    Even more interesting, there are URLs at the beginning of the files:
    https://download-a.akamaihd.net/file...45_Ro_S_14.mp3
    https://download-a.akamaihd.net/file...6_1Co_S_15.mp3

    These two files can be downloaded and are processing just fine (1 JPG, 1 MP3 each):

    Code:
    1.570.365 bi12_45_Ro_S_14.mp3
    1.328.179 bi12_45_Ro_S_14.pcf
    3.335.986 bi12_46_1Co_S_15.mp3
    2.812.212 bi12_46_1Co_S_15.pcf
    Comparing the files to the originals shows there are insertions (32 byte in size) to the original data that intercept the JPG/MP3 streams, so this is some iTunes container format indeed. Will research a bit more, perhaps there's a specification of it somewhere.
    http://schnaader.info
    Damn kids. They're all alike.

  23. Thanks:

    Shelwien (22nd November 2019)

  24. #48
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    4,133
    Thanks
    320
    Thanked 1,396 Times in 801 Posts
    There's this utility that can detect mp3 frames and separate it from non-mp3 data: https://nishi.dreamhosters.com/u/mp3rw_v1.rar
    Does it improve recompression in this case?

  25. Thanks:

    schnaader (22nd November 2019)

  26. #49
    Programmer schnaader's Avatar
    Join Date
    May 2008
    Location
    Hessen, Germany
    Posts
    630
    Thanks
    288
    Thanked 252 Times in 128 Posts
    Yes, this helps indeed:

    Code:
     1.573.219 2ca47597ec5fecbe_s
     1.425.858 2ca47597ec5fecbe_s.pcf // recompresses 69/757 MP3 streams
    
       100.074 2c_bin
     1.474.409 2c_mp3                 // 9403 frames found
     1.310.286 2c_mp3.pcf             // recompresses 78/78 MP3 streams
        98.354 2c_bin.pcf
    
     1.408.640                        // 24_mp3.pcf + 2c_bin.pcf
    http://schnaader.info
    Damn kids. They're all alike.

  27. #50
    Member
    Join Date
    Aug 2014
    Location
    Argentina
    Posts
    572
    Thanks
    245
    Thanked 98 Times in 77 Posts
    Quote Originally Posted by schnaader View Post
    perhaps there's a specification of it somewhere.
    I highly doubt it. I believe those files are from here and here. On some other articles from that page there is a possibility to sync a given paragraph to its audio reading (click on the paragraph shows a little 'play' button, which then jumps to the corresponding part). I think those insertions could be like mkv 'chapters', especially if they are not at a fixed length from each other.


    ---------------

    Another thing entirely, using Google's colaboratory, it's extremely easy to test precomp against your whole Google Photos backup, for example. That's what I did (with my whole Google Drive actually). I didn't find any bugs, so here I rest my case.

  28. Thanks:

    schnaader (23rd November 2019)

  29. #51
    Member
    Join Date
    Jan 2017
    Location
    Germany
    Posts
    65
    Thanks
    31
    Thanked 14 Times in 11 Posts
    Quote Originally Posted by Gonzalo View Post
    Have you tried fastlzma2?
    What about replacing the range coder from LZMA2 with FSE (from ZSTD) or rANS for the use in Precomp?
    Ok, this will be an new format, but could improve compression / speed ratio. Is this a stupid idea?

  30. #52
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    4,133
    Thanks
    320
    Thanked 1,396 Times in 801 Posts
    tANS/FSE would also require changing the entropy model (and parsing optimizer).
    rANS is easier, but that's basically what oodle LZNA is.

    Also switch to ANS coding does not mean automatic speed improvement (or especially compression).
    Speed-wise rANS is the same as a good rangecoder... RC and rANS actually do exactly the same arithmetic operations, just in different order.
    The actual cause for speed improvements in new-gen LZ codecs is format optimization for out-of-order vector cpus.
    In lzma case it means that we have to somehow implement multiple independently decodable streams of compressed data per block.

  31. Thanks (2):

    Bulat Ziganshin (25th November 2019),WinnieW (24th November 2019)

  32. #53
    Member
    Join Date
    Aug 2014
    Location
    Argentina
    Posts
    572
    Thanks
    245
    Thanked 98 Times in 77 Posts
    Code:
    Prepack
    
    Generic data compression filters
    
    Prepack uses multiple resampling techniques to optimize files prior to compression.
     It's well suited for raw image file-types and raw audio such as WAV.
    If the file's structure cannot be detected or optimized it is stored as-is.
    Code:
    /*
        Fast general purpose data preprocessor
        
        The MIT License (MIT)
        Copyright (c) 2016 Lucas Marsh
        How it works:
            encoding:
                pass 1: skims over input file to get a rough idea of how many channels there are
                pass 2: encodes a 1 byte header with the amount of channels then encodes the entire file
            decoding is a single pass that un-interleaves the data with n channels.
        
            encode method is either delta (best for image) or adaptive LPC with a single weight (best for audio)
    */
    https://github.com/loxxous/prepack
    1 c file, 312 LOC

    Didn't make any tests tho. Posting just in case
    Last edited by Gonzalo; 25th November 2019 at 17:44. Reason: Bad formatting

  33. #54
    Programmer schnaader's Avatar
    Join Date
    May 2008
    Location
    Hessen, Germany
    Posts
    630
    Thanks
    288
    Thanked 252 Times in 128 Posts
    Here's the latest development version - I fixed an error with a file write that had no mutex, which led to incorrect reconstruction on files with many small interleaved JPG and preflate (PDF/ZIP/...) streams.
    Attached Files Attached Files
    http://schnaader.info
    Damn kids. They're all alike.

  34. Thanks (2):

    Mike (5th December 2019),Stephan Busch (5th December 2019)

  35. #55
    Member
    Join Date
    Aug 2014
    Location
    Argentina
    Posts
    572
    Thanks
    245
    Thanked 98 Times in 77 Posts
    Precomp hangs restoring an .iso image file. I attached a 10 Mb chunk around the area where it happens. On this particular file, precomp hangs at 39.27%
    Attached Files Attached Files

  36. #56
    Programmer schnaader's Avatar
    Join Date
    May 2008
    Location
    Hessen, Germany
    Posts
    630
    Thanks
    288
    Thanked 252 Times in 128 Posts
    Quote Originally Posted by Gonzalo View Post
    Precomp hangs restoring an .iso image file.
    Thanks! That was a mutex error in PNG restoration, fixed.
    Attached Files Attached Files
    http://schnaader.info
    Damn kids. They're all alike.

  37. Thanks:

    Stephan Busch (11th December 2019)

  38. #57
    Member
    Join Date
    Aug 2014
    Location
    Argentina
    Posts
    572
    Thanks
    245
    Thanked 98 Times in 77 Posts
    I was thinking about a rather naive way to improve precomp effectiveness... I'm sure somebody thought about it before, I'm just sharing it so as to know if it could be done or is it a bad idea.

    It's been stated before the possibility of rearranging data inside the .PCFs to group similar streams and in doing so improve compression. Couldn't it be simpler to output every stream as a separate file with a guessed extension, like '.ari' for incompressible streams, '.txt' for text, '.bmp' for bitmaps and '.bin' for everything else? Then any modern archiver would take care of the grouping and maybe codec selection.

    An alternative (so as to not write a million little files to the disk) would be to output a few big TXT, BIN, and so on with all respective streams concatenated, plus an index.pcf containing the metadata needed for reconstruction.

    ​What do you think about it?

  39. #58
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    4,133
    Thanks
    320
    Thanked 1,396 Times in 801 Posts
    In theory you can run "precomp -v" and parse its output:
    Code:
    (67.02%) Possible bZip2-Stream found at position 154448374, compression level = 9
    Compressed size: 3051
    Can be decompressed to 8996 bytes
    Identical recompressed bytes: 3051 of 3051
    Identical decompressed bytes: 8996 of 8996
    Best match: 3051 bytes, decompressed to 8996 bytes
    Recursion start - new recursion depth 1
    No recursion streams found
    Recursion end - back to recursion depth 0
    (72.75%) Possible GIF found at position 167662070
    Can be decompressed to 5211 bytes
    Recompression successful
    (72.75%) Possible GIF found at position 167663606
    Can be decompressed to 5193 bytes
    Recompression successful
    (72.75%) Possible GIF found at position 167665142
    Can be decompressed to 5211 bytes
    Recompression successful
    (72.75%) Possible GIF found at position 167666678
    Can be decompressed to 5988 bytes
    Recompression successful
    It prints positions so shouldn't be that hard.
    Attached Files Attached Files

  40. #59
    Member
    Join Date
    Aug 2014
    Location
    Argentina
    Posts
    572
    Thanks
    245
    Thanked 98 Times in 77 Posts
    Quote Originally Posted by Shelwien View Post
    In theory you can run "precomp -v" and parse its output:
    It prints positions so shouldn't be that hard.
    Yes, somehting like that is in theory possible, the 'hack' way. Although I don't see why the redundancy. It would imply running precomp -v (which is proven to slow it down considerably), then parse the output, then butcher the pcf, then delete the pcf. Inverse for reconstruction. And of course writing a program to do it, which I think would be the same or even more work than actually modifying precomp to just write down the streams to different files. I mean, it's the same work without having to write the parser. I'll see if I can do it but I don't really trust on my programming skills. I'm afraid I'll end up breaking something else.

    What do you think about this, Christian?

    BTW: Is that attachment any different than the original precomp, or is it just the last commit compiled for Windows?

  41. #60
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    4,133
    Thanks
    320
    Thanked 1,396 Times in 801 Posts
    As example, I already did it for lzmarec: http://nishi.dreamhosters.com/u/lzmadump_v1.rar
    Same idea here, just need to extract positions from precomp -v output.

    > same or even more work than actually modifying precomp to just write down the streams to different files

    Yes, but precomp modifications have to be done by schnaader (precomp is not very modular plus there's encode+decode), while log parsing is independent.

    > Is that attachment any different than the original precomp, or is it just the last commit compiled for Windows?

    Its precomp 4.8 that I built from github files to look at -v output... schnaader's binary didn't run because of VS/MD dependencies.
    Also, based on my previous tests, my version is probably faster (built with clang).

Page 2 of 4 FirstFirst 1234 LastLast

Similar Threads

  1. Test files to compress
    By KingAmada in forum Random Compression
    Replies: 2
    Last Post: 4th November 2019, 19:31
  2. How to expand .ff compressed files using Precomp & Fsum???
    By Manjunath in forum The Off-Topic Lounge
    Replies: 21
    Last Post: 7th September 2014, 14:47
  3. pim 2.9 compress mysql 5.1.32 x64 files
    By l1t in forum Data Compression
    Replies: 0
    Last Post: 23rd March 2009, 16:06
  4. Replies: 3
    Last Post: 10th November 2007, 22:32
  5. Replies: 12
    Last Post: 30th June 2007, 17:49

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •