Page 2 of 3 FirstFirst 123 LastLast
Results 31 to 60 of 83

Thread: Backup compression algorithm recommendations

  1. #31
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,593
    Thanks
    801
    Thanked 698 Times in 378 Posts
    Use whatever you want that can handle at least 1TB in half night (4 hours).
    wrong. if you backup your disk every day, typically only a few percents are changed. so, you may need to have very fast dedup algo (or not, if you watch disk changes using OS API), but compression speed is of less importance. and for decompression, you may just employ multiple servers

  2. #32
    Member RichSelian's Avatar
    Join Date
    Aug 2011
    Location
    Shenzhen, China
    Posts
    174
    Thanks
    20
    Thanked 62 Times in 31 Posts
    you can try some other ROLZ based compressor like balz https://sourceforge.net/projects/balz. ROLZ is more symmetric than LZ77 and still very fast

  3. #33
    Member RichSelian's Avatar
    Join Date
    Aug 2011
    Location
    Shenzhen, China
    Posts
    174
    Thanks
    20
    Thanked 62 Times in 31 Posts
    i copied that codes from libsnappy if i don't remember wrong

  4. #34
    Member
    Join Date
    Dec 2013
    Location
    Italy
    Posts
    517
    Thanks
    25
    Thanked 45 Times in 37 Posts
    Quote Originally Posted by Bulat Ziganshin View Post
    wrong. if you backup your disk every day, typically only a few percents are changed. so, you may need to have very fast dedup algo (or not, if you watch disk changes using OS API), but compression speed is of less importance. and for decompression, you may just employ multiple servers
    You're confirming my thesis

    In my previous post

    Quote Originally Posted by fcorbelli View Post
    For me the answer is easy.
    The one who scales better on multicore.
    Just like pigz.
    Single thread performance is useless.
    On implementation side: the one which can extensively use HW SSE instructions.
    Compression ratio is irrilevant.
    Only speed (and limitated RAM usage).
    In two words: a deduplicated pigz (aka deflate).
    Or lz4 for decompression speed (not so relevant).
    In fact I use this one (storing on zfs the deduplicated archive)
    0) versioning "a-la-time-machine"
    1) deduplication.
    2) highly parallelizable compression.
    3) RAM consumption
    4) works with really large files
    5) decompression which does NOT seek (if possible)
    6) an advanced and fast copy verification mechanism WITHOUT decompress if possible
    7) easy portability between Windows-Linux-* Nix systems.
    8 append-only format
    9) Reliability reliability reliability. No software "chains", where bugs and limitations can add up.
    A real-ZPAQ-based example of a virtual Windows 2008 server with SQL server for a ERP software

    https://encode.su/threads/3559-Backu...ll=1#post68383

    Quote Originally Posted by Shelwien View Post
    So you're just saying that its not acceptable for you, that's ok.
    On filesystem
    Quote Originally Posted by fcorbelli View Post
    For quick backup the answer is differential zfs send (not incremental) pigzip-ed
    Requires zfs, lots of Ram and fast disks.
    It is doable: I do every day from years.
    But restore is painful, and extensive zfs expertise needed.

  5. #35
    Member
    Join Date
    Dec 2013
    Location
    Italy
    Posts
    517
    Thanks
    25
    Thanked 45 Times in 37 Posts
    Quote Originally Posted by Shelwien View Post
    @fcorbelli:
    > Use whatever you want that can handle at least 1TB in half night (4 hours).

    2**40/(4*60*60) = 76,354,974 bytes/s
    Its not actually that fast, especially taking MT into account.
    I posted the requirements for the single thread of the algorithm, but of course the complete tool would be MT.
    You are thinking in the "gigabyte scale".
    Look at the watch for 4 seconds.
    That's a GB at 250MB/s (very fast indeed).

    Now look at the watch for an hour (~3.600s), or 1.000x.
    That's the TB scale at 250MB/s

    So every time you do a test, think about a method or an algorithm, imagine it for a time 1,000 times longer of what you are used to, if your job is not to make backups every day for hundreds of servers.

    I think that it is unlikely that you do a test on 1TB or maybe 50TB of data when developing such program.
    I need to, because it's my job.

    Then you will begin to understand that everything else doesn't matter, IF you don't have something really fast. But just so fast.
    I mean FAST.

    And how can it be that fast?
    Only if heavily multithreaded, of course.
    Single core performance is completely irrelevant IF does not scale ~ linearly with size

    And which compression algorithms (deduplication taken for granted) are so fast on multi-core systems,
    not because they are magical, but because they are highly parallel?

    Someone was offended, but it is simply factual.

    When in doubt, think about the terabyte scale with my gekandenexperiment and everything will be clearer.
    That's 1TB, the size for the hobbyist.

    Then multiply by 10 or even 100, so by 10,000 or 100,000 vs the 4-seconds-GB-time-scale,
    and THEN choose the algorithm

    Similarly for the consumption of RAM and all the points I have already written several times before

  6. #36
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    4,133
    Thanks
    320
    Thanked 1,396 Times in 801 Posts
    1) Any compression algorithms would benefit from MT, when you can feed them different blocks.
    Linear scaling is not realistic because of shared L3 and other resources.
    But something like 0.5*N_cores*ST_speed would still turn 50MB/s into 400MB/s at 16 cores,
    and that's not even the maximum on modern servers.

    2) If you have a scheduled backup every 4 hours, you don't really need the compressor to work faster than 80MB/s,
    so it may be sometimes profitable to use a slower codec, if it still fits the schedule and saves 20% of storage space.

    3) Nobody forces you to use slower compression algorithms than what you currently like.
    Compression algorithm development is simply interesting for some people,
    and some other people are getting paid for development of new compression algorithms.

  7. #37
    Member
    Join Date
    Dec 2013
    Location
    Italy
    Posts
    517
    Thanks
    25
    Thanked 45 Times in 37 Posts
    Quote Originally Posted by Shelwien View Post
    1) Any compression algorithms would benefit from MT, when you can feed them different blocks.
    I wouldn't be so assertive
    In some cases separating the blocks considerably reduces efficiency.
    In others less.

    Linear scaling is not realistic
    ~ linear, yes.
    But as mentioned it is not important "how"
    ...400MB/s at 16 cores,and that's not even the maximum on modern servers.
    You will typically never have all those cores 100% available, because servers work 24 hours a day.
    You doesn't turn off everything to make backups: more resources, but not infinite.

    And those who use servers with 16 physical cores often will not have 1TB of data, but maybe 10.
    Or 100.
    I do.

    In this case, as I have already explained, you will typically use specialized compression machines that read the snapshot data .zfs (thus loading the server's IO subsystem, but not the CPU, and not very much, with NVMe's).
    For example with physical 16CPU systems (AMD 3950X in my case).
    This gives you nearly 24 hours of backup time (100TB-time-scale).
    But it is not exactly a common system, nor a requirement that seems realistic to me for new software [it works, but you have to buy 10,000 euros of hardware to do it and hire two more engineers]


    2) If you have a scheduled backup every 4 hours, you don't really need the compressor to work faster than 80MB/s, so it may be sometimes profitable to use a slower codec, if it still fits the schedule and saves 20% of storage space.
    When you need to have 600TB of backup space don't worry too much about saving 20%.
    Indeed even 0% (no compression at all).
    Just buy some other hard disks.
    3) Nobody forces you to use slower compression algorithms than what you currently like.
    Compression algorithm development is simply interesting for some people,
    and some other people are getting paid for development of new compression algorithms.
    Certainly.
    I point out, however, that the main backup software houses in the world for virtual systems do not think so.
    Which compress a little, ~deflate for example.
    This does not mean that "we" are right, but I expect a new software AT LEAST outperform the "old" ones.

    ===========
    I add a detail for the choice of the algorithm: advanced management of large blocks of data-all-the-same.
    When you export vsphere typically thin disks become thick, and are padded with zeros (actually it depends on the filesystem, for example this happens for Linux based QNAP NAS, in sparse mode).

    So a new software should efficiently handle the case where there are hundreds of gigabytes of empty blocks, often positioned at the end.
    It is a serious problem especially during the decompression phase (slowdown)

    Just my cent

  8. #38
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    4,133
    Thanks
    320
    Thanked 1,396 Times in 801 Posts
    >> 1) Any compression algorithms would benefit from MT, when you can feed them different blocks.

    > In some cases separating the blocks considerably reduces efficiency.

    Sure, it depends on inter-block matches, and whether blocks break
    the internal data structure, which could be important for format detection
    and/or recompression.

    But recompression (forward transform) is usually slower than 50MB/s anyway,
    we have CDC dedup to take care of long inter-block matches,
    fast dictionary methods can be used to factor out inter-block matches,
    and nothing of above is really relevant when we're dealing with TBs of data -
    we'd be able to compress 100GB blocks in parallel, and that would barely affect the CR
    at all, because only specialized dedup algorithms handle that scale -
    for normal codecs its good if they can handle 1-2GB windows.

    > And those who use servers with 16 physical cores often will not have 1TB of data, but maybe 10. Or 100.

    As Bulat already said, that won't be unique data which has to be compressed
    with actual compression algorithms.
    Most of that data would be handled by dedup module in any case,
    so the speed of compression algorithm won't affect the overall performance that much.

    > But it is not exactly a common system, nor a requirement that seems
    > realistic to me for new software

    As I already calculated, 80MB/s is enough to compress 1TB of new data
    every 4 hours in 1 thread.
    You'd only use 16 if you really want it to run faster for some reason.

    > When you need to have 600TB of backup space don't worry too much about saving 20%.
    > Just buy some other hard disks.

    Maybe, but if the question is - buy extra 200TB of storage or use new free software
    (like your zpaq) to save 20% of space - are you that certain about the answer?

    > I point out, however, that the main backup software houses
    > in the world for virtual systems do not think so.
    > Which compress a little, ~deflate for example.

    Its more about ignorance and fashion than anybody actually evaluating their choices.
    Most even use default zlib when there're many much faster optimized implementations
    of the same API.

    For example, did you properly evaluate zstd for your use case?
    (Not just default levels, but also --fast ones, dictionary modes, manual parameter setting).
    If not, how you can say that zlib or LZ4 are better?

    > This does not mean that "we" are right,
    > but I expect a new software AT LEAST outperform the "old" ones.

    This is actually true in zlib vs zstd case,
    and seems to be rather common for goals of many other codecs too.

    But for me compression ratio is more interesting,
    so I made this thread about new algorithms with different properties,
    rather than about making zstd 2x faster via SIMD tricks.

  9. #39
    Member
    Join Date
    Dec 2013
    Location
    Italy
    Posts
    517
    Thanks
    25
    Thanked 45 Times in 37 Posts
    Quote Originally Posted by Shelwien View Post
    You'd only use 16 if you really want it to run faster for some reason.
    Ahem... those little machines costs many thousands of euro each.
    And they consume about 300W each, and require an air conditioning on 24/365
    (in Italy we have neither gas, oil nor nuclear).

    Maybe, but if the question is - buy extra 200TB of storage or use new free software
    (like your zpaq) to save 20% of space - are you that certain about the answer?
    Yes, I am.
    Because BEFORE trusting in a new software a couple of years of testing is needed.
    You will never ever run anything new.
    With new builds of the same software you will run in parallel for months.
    rw-r--r--  1 root  wheel  794036166617 Jan 26 19:17 fserver_condivisioni.zpaq
    -rw-r--r-- 1 root wheel 320194332144 Jan 26 19:22 fserver_condivisioni47.zpaq

    Those are two backups, one for zpaqfranz v11, one for zpaqfranz v47
    Even if you did it yourself.
    Imagine losing all your money for days because ooopsss there was corruption in restoring your Bank's backup.
    It just can't happen.

    Its more about ignorance and fashion than anybody actually evaluating their choices.
    Most even use default zlib when there're many much faster optimized implementations
    of the same API.
    In part yes, I agree.

    For example, did you properly evaluate zstd for your use case?
    (Not just default levels, but also --fast ones, dictionary modes, manual parameter setting).
    If not, how you can say that zlib or LZ4 are better?
    I am not able to tell.
    Unfortunately I no longer have the age, and therefore the time,
    to devote myself to projects that would interest me.
    These are things I could have done 25 years ago.
    Unfortunately, much of my time today is devoted to... paying taxes.

    But for me compression ratio is more interesting,
    so I made this thread about new algorithms with different properties,
    rather than about making zstd 2x faster via SIMD tricks.
    A new algorithm from scratch that was more efficient would certainly be interesting.

    An algorithm as fast as the actual transfer media rate, say 500MB/s for 4 cores (which is typically how much you can use), even better.

    At that point, once the speed has been set, we can discuss the reduction in size.
    And the decompression speed, which must be decent.
    Because when you have a system hang, and you need to do a rollback, and your Bank's account is
    freezed, you can't wait 12 hours for unzipping.

    The ideal program reads and writes at the same speed that the data is "pumped" by the IO subsystem (which can also be a 40Gb NIC).
    Just like pv
    It would be a great relief to those who work in data storage.

  10. #40
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    4,133
    Thanks
    320
    Thanked 1,396 Times in 801 Posts
    > At that point, once the speed has been set, we can discuss the reduction in size.

    Unfortunately its the reverse of how it actually works.
    Speed optimization is time-consuming, but also much more predictable than compression improvement.
    There're many known "bruteforce" methods for speed improvement - compiler tweaking, SIMD, MT, custom hardware.
    For example, these people claim 16GB/s LZMA compression (up to 256GB/s in a cluster): https://www.fungible.com/product/dpu-platform/

    But its much harder to take an already existing algorithm and incrementally improve its compression ratio.
    To even start working on that its usually necessary to remove most of existing
    speed optimizations from the code (manual inlining/unrolling, precalculated tables etc),
    and then some algorithms simply can't be pushed further after some point (like LZ77 and huffman coding).

    Thus designing the algorithm for quality first (compression ratio in this case) is a much more reliable approach.
    Speed/quality tradeoff can be adjusted later, after reaching the maximum quality.
    Of course, the choices would be still affected by minimum acceptable speed - depending on whether its 1MB/s,10MB/s,100MB/s or 1000MB/s
    we'd have completely different choices (algorithm classes) to work with.

    Still, compromising on speed is the only option if we want to have better algorithms in the future.
    Speed optimizations can be always added later and better hardware is likely to appear,
    while better algorithms won't appear automatically - somebody has to design them and push the limits.

    Its just how it is - compression algorithms may have some parameters, but never cover
    the full spectrum of use cases, its simply impossible to push deflate to paq-level compression
    by allowing it to run slower - that requires a completely different algorithm
    (and underlying mathematical models).

  11. Thanks:

    Gotty (28th January 2021)

  12. #41
    Member
    Join Date
    Dec 2013
    Location
    Italy
    Posts
    517
    Thanks
    25
    Thanked 45 Times in 37 Posts
    Quote Originally Posted by Shelwien View Post
    Thus designing the algorithm for quality first (compression ratio in this case) is a much more reliable approach.
    Most of the data that takes up a lot of space is already compressed.
    Often extremely already compressed.
    Videos, images etc are the largest files.
    Executables compress little. Often there are also compressed files (zip, 7z) of internal backups performed.
    There remain the tablespaces of the databases (where the compresison level is very high) and large quantities of text (eg HTML).
    So I don't expect sensational results.

    Even using NZ, or the various paqs, the differences are modest.
    I therefore recommend taking the disk of a test virtual machine and testing the algorithms already available
    This is a really, really crude test (I'm working... just always!)
    Attachment 8299

    Here you can see the mighty nanozip, the ubiquous 7z (all with default values, just a test) vs pigz -1 and lz4 vs almosty empty Windows 8.1 image

    As you can see you will never use an algorithm that takes 20 times (!) the time to go from a 4.5GB backup to 3.2GB (pigz vs nz)
    Or that doubles (11 minutes vs 5) from 3.3 to 3.2 (7z vs nz)
    I understand that, from a theoretical point of view, these are important improvements.
    But from the practical one... you will use lz4 or pigz (remember: scale x1000).
    Or whatever (srep+lz4? no, thanks, ~pigz performace WITHOUT complication) as fast as you can

    Or, of course, zpaqfranz
    Attached Thumbnails Attached Thumbnails Click image for larger version. 

Name:	rude.jpg 
Views:	28 
Size:	828.3 KB 
ID:	8300  

  13. #42
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    4,133
    Thanks
    320
    Thanked 1,396 Times in 801 Posts
    > Most of the data that takes up a lot of space is already compressed.
    > Often extremely already compressed.

    Mostly "extremely bad".
    See https://github.com/schnaader/precomp-cpp

    > Videos, images etc are the largest files.

    See
    https://github.com/google/brunsli
    https://encode.su/threads/1111-Ocari...deo-compressor
    https://github.com/danielrh/losslessh264
    https://github.com/packjpg/packMP3
    https://encode.su/threads/3256-Lossl...ll=1#post62604
    etc

    > Executables compress little.

    https://fgiesen.wordpress.com/2011/0...n-in-kkrunchy/

    > Often there are also compressed files (zip, 7z) of internal backups performed.

    zip is mostly covered by precomp and other similar tools
    https://encode.su/threads/1231-lzma-recompressor

    > So I don't expect sensational results.

    You can expect it or not, but based on paq8 results, there's at least 30%
    between max zstd compression and actual data entropy.
    deflate and LZ4 only support window up to 64kb, so can't even really be compared -
    its easy to demonstrate even 100x better compression for these.

    > Even using NZ, or the various paqs, the differences are modest.

    You need precomp and srep passes before comparing actual compression algorithms.

    Also you're wrong to expect archivers with default options to show their full potential.
    Except for zstd with its "paramgrill", nobody else (afaik) bothered with actual tuning
    of their level profiles.

    For comparisons its also necessary to compare using the same number of threads,
    same window size, and same types of preprocessing if one of the programs lacks some.
    Most of the popular archivers lack some separately available features (like dedup module),
    so comparing them to rare programs which have these features inside is unfair.

    Or, to be precise, its fair if you're testing to choose the best tool for some task,
    but unfair if you're looking for highest potential for further development.

    > Here you can see the mighty nanozip, the ubiquous 7z (all with default
    > values, just a test) vs pigz -1 and lz4 vs almosty empty Windows 8.1 image

    It doesn't mean much, since nz and 7z defaults are not designed for such volumes.
    But they have commandline parameters which can significantly change the results in this case.
    Like "-mx=9 -myx=9 -md=1536M -ms=999t -mmt=2" for 7z.

    > As you can see you will never use an algorithm that takes 20 times (!) the
    > time to go from a 4.5GB backup to 3.2GB (pigz vs nz)

    Problem is, you didn't bother to read nz usage text, and it has lots of parameters
    and 12(!) different compression algorithms, from much faster to stronger
    (plus memory usage, MT controls etc) which significantly affect the results.

    > I understand that, from a theoretical point of view, these are important improvements.

    We simply don't know the actual results, since these programs weren't designed
    for your use case, and you can't RTFM and tweak their options.

    See this post for example: https://encode.su/threads/3559?p=68397&pp=1
    In that post, non-default settings made LZMA encoding 8x faster,
    while still providing significantly better compression than another codec.
    7-zip also has syntax for this.

    > But from the practical one... you will use lz4 or pigz (remember: scale x1000).

    No, you simply learn how to properly use existing tools
    instead of expecting their authors to provide you a perfect solution for each use case.

  14. #43
    Member
    Join Date
    Dec 2013
    Location
    Italy
    Posts
    517
    Thanks
    25
    Thanked 45 Times in 37 Posts
    Quote Originally Posted by Shelwien View Post
    No, you simply learn how to properly use existing tools
    instead of expecting their authors to provide you a perfect solution for each use case.
    Ahem...
    With all respect I disagree, your hypotheses are very far from these works.

    Try to find the optimal parameters to compress 1TB of virtual machine for 10 different programs with 10 different parameters.
    It will take weeks of work.
    Then I will upload some MP4 into the image.
    Your parameters are now wrong.
    Run another weeks to find the best.
    Then I will upload a big mysqldump backup.
    Your parameters are now wrong again.
    (...)

    Anyone knows that there are 1000 parameters that affect significantly,
    but you can only know this EX POST, not EX ANTE.

    It is not Hutter Prize with a fixed file to compress where you can tweak almost everything.
    It is not "compress a fresh Windows 2008 R2 virtual disk server"

    You will find whatever.
    Small BSD server
    Huge LINUX server
    Mid-size Windows Server
    With "anything" inside (ntfs, already-compressed zfs block, ext4 blocks, btrfs)


    Problem is, you didn't bother to read nz usage text
    Of course I've tried them all years ago.
    But for small files.
    If you want to try them for 1TB machines, I can supply as many as you want.
    But beware: the content varies from day to day.
    Often with already compressed images and videos.

    We simply don't know the actual results, since these programs weren't designed
    for your use case, and you can't RTFM and tweak their options.
    "MY" use case doesn't exist.
    There are so many different VMs with so many different containers.
    It would be so easy if there were ONLY disk image of a certain type.

    When the difference in speed is so big (seconds vs minutes),
    it doesn't make much sense to tweak in spite of saving every little byte.
    The backups, sooner or later, will be deleted, for make room for new ones.
    They are disposable in the medium term.

    ===
    However, I'm curious to see if much better algorithms will be developed (in terms of space saving for the same time) for VM backup

    PS yes, EXEs compress little, very little.
    Reductions of the order of tens % are not worth the effort.
    It is not at all easy to understand where and how the executables are.

    Maybe you are confusing a file access (like zip, nz or whatever you want)
    with access to the SECTORS (/cluster/block or whatever the filesystem)
    that make up the virtual disks.

    Where, in case of both internal and external fragmentation,
    you will NOT have a continuous stream of bytes representing the EXE, or JPG ....
    Maybe you'll have some EXE chunks, then some MP4, then some HTML,
    then some internal filesystem structures, and so on.
    Preprocessors, pre-analyzes etc must take this into account.
    Precomp et similia simply fail.
    Click image for larger version. 

Name:	precomp.jpg 
Views:	22 
Size:	1.45 MB 
ID:	8301
    Not "doesn't work, but they could, if you were smarter"
    Instead "can't"

    PAQ-like heuristic analysis method for recognizing individual chunks does not work
    at all with advanced filesystems where there is no sequential data stream.
    It's more like a shuffled deck of cards.

    Then you can do your own analysis, but the sun is about to rise
    and the server needs to resumes full operation

  15. #44
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,593
    Thanks
    801
    Thanked 698 Times in 378 Posts
    fcorbelli, a few notes:
    - you talk about architecture of the entire backup engine, while Eugene works as part of team and his narrow work is to make the final compression algorithm
    - when you talk about "inherently parallel" algorithms, you need to investigate why they are parallel. to start, pigz and zpaq aren't algorithms, but programs. especially zpaq which is entire backup engine employing many algorithsm together

    PIGZ just runs multiple independent compression jobs simultaneously. Any LZ77-class compressor with dictionary of X bytes can be multi-threaded WITHOUT CR DECREASING this way, just parsing X extra bytes prior to compression in every chunk. PIGZ employs DEFLATE compression that has dictionary of only 32 KB, that allows to employ many CPU cores using less than 1 MB/core.

    This approach is applicable to any LZ77-class algorithms (zstd, lzma) and probably ROLZ ones too. The drawback is single-threaded decompression. There are also other ideas that helps to parallelize LZ compressors, but they are usually applicable to any LZ algorithms, so they can be added later. As Eigene said, he looks first into improving compression ratio, speed optimization can be done later.

    Similarly, my own fa'next (that outperforms pigz by miles) works by applying deduplication and then splitting data into blocks each compressed independently with zstd or lzma. AFAIR, zpaq does the same but employs its own LZ/BWT compression algorithms inferior to ztsd.

    ----------------------

    Now, do you agree that splitting data in 100 GB blocks is enough? You shouldn't make blocks larger because they can be completely lost and because you may need to extract smaller parts of entire dedup set. With 20 MB/s such a block will be processed in 1.5 hours. It should be OK for backup stage, although upu may need to restore faster. So, emphasis should be on decompression speed, and if you are looking for m/t algorithms, you should look specifically into M/T DECOMPESSION ability.

  16. Thanks:

    Shelwien (28th January 2021)

  17. #45
    Member
    Join Date
    Dec 2013
    Location
    Italy
    Posts
    517
    Thanks
    25
    Thanked 45 Times in 37 Posts
    Quote Originally Posted by Bulat Ziganshin View Post
    fcorbelli, a few notes:
    - you talk about architecture of the entire backup engine, while Eugene works as part of team and his narrow work is to make the final compression algorithm
    That's true.
    Infact, for me, doesn't matter at all if you choose to dedup+compress, precomp+dedup+compress, dedup+precom+compress, compress or whatever (I am just running some paq real VM example, just to be clear. Running...)
    - when you talk about "inherently parallel" algorithms...
    It should be noted that server CPUs have a normally modest clock speed, for thermal reasons.
    While it is easier to have 8-16 cores, and sometimes even 1-4 CPUs, modern servers are not "number cruncher"

    PIGZ just runs multiple independent compression jobs simultaneously (...) PIGZ employs DEFLATE compression that has dictionary of only 32 KB, that allows to employ many CPU cores using less than 1 MB/core.
    It's an archaic technology, but it works.
    In the average case, it works great (many cores).
    It has problems (very little parallelism) in decompression, and in fact it is one of the reasons why it is not used in any case
    This approach is applicable to any LZ77-class algorithms (zstd, lzma) and probably ROLZ ones too. The drawback is single-threaded decompression.
    The author (of pigz) explains it very well in his notes.
    It is not a small problem.
    ...As Eigene said, he looks first into improving compression ratio, speed optimization can be done later.
    I am a little skeptical about the concrete possibility of having a "smart" algorithm that operates on large amounts of data. Sure I might be surprised.
    zpaq does the same but employs its own LZ/BWT compression algorithms inferior to ztsd.
    I dream of a zpaq with a much faster compressor.
    Maybe I will make the patch.
    Now, do you agree that splitting data in 100 GB blocks is enough? You shouldn't make blocks larger because they can be completely lost and because you may need to extract smaller parts of entire dedup set. With 20 MB/s such a block will be processed in 1.5 hours. It should be OK for backup stage, although upu may need to restore faster. So, emphasis should be on decompression speed, and if you are looking for m/t algorithms, you should look specifically into M/T DECOMPESSION ability.
    In my previous posts I have already explained that decompression speed is almost as critical as compression speed.
    This is why, in my opinion, even a "stupid" algorithm, but which has few or no interprocess locks, would be desirable.

    If is very hard, because...
    5) decompression which does NOT seek (if possible)
    and
    ...
    And the decompression speed, which must be decent.
    Because when you have a system hang, and you need to do a rollback, and your Bank's account is freezed, you can't wait 12 hours for unzipping.
    Just to make some real world: a very "smart" analyzer (PAQ8PX) vs three real (small) images
    Special compressors (for example for EXE, with substitutions etc) are not so "smart" with a filesystem-based virtual container

    EDIT: after dinner results for MacOS virtual machine
    хороший ужин
    Attached Thumbnails Attached Thumbnails Click image for larger version. 

Name:	solaris.jpg 
Views:	30 
Size:	176.2 KB 
ID:	8304   Click image for larger version. 

Name:	linux.jpg 
Views:	32 
Size:	185.7 KB 
ID:	8303   Click image for larger version. 

Name:	windows.jpg 
Views:	31 
Size:	177.1 KB 
ID:	8302  

  18. #46
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    4,133
    Thanks
    320
    Thanked 1,396 Times in 801 Posts
    > Try to find the optimal parameters to compress 1TB of virtual machine
    > for 10 different programs with 10 different parameters.
    > It will take weeks of work.

    Ideally yes, if you want to optimize the compressed file to the byte precision,
    then it has to be done.

    You may be surprised, but it is really done sometimes,
    eg. for game repacks, since you can spend a week on this optimization
    once, and then 1000s of people would download and use your releases for years.

    > Then I will upload some MP4 into the image. Your parameters are now wrong.

    Ideally yes, at the very least you'd have to add some MP4 recompressor,
    which would be disabled by default, because archiver would normally
    compress MP4s as standalone files and always-on inline detection
    would slow down processing in more common cases.

    > Then I will upload a big mysqldump backup. Your parameters are now wrong again.

    Yes, ideally you'd have to add a mysqldump preprocessor, since its a popular format
    and its possible to convert actual data contained in the dump to a more compact form,
    which could be compressed better and much faster.

    > Anyone knows that there are 1000 parameters that affect significantly,
    > but you can only know this EX POST, not EX ANTE.

    Problem is, there're also more common parameters, which significantly
    affect the archiver's performance (both CR and speed), and would be common
    for the whole VM-image use case.

    Archivers are mostly relatively old (they were very important for all PC users
    some 15-30 years go, but not anymore, as compression is now integrated in all
    popular office formats and filesystems), and are targeted at the office use -
    copying or sending a bunch of related files in a single container.

    VM image compression at terabyte scale is a completely different use case,
    so it definitely requires some non-default settings for these archivers.

    > With "anything" inside (ntfs, already-compressed zfs block, ext4 blocks, btrfs)

    That's ok and any compression-related solution can be usually improved forever.

    But I cannot accept your opinion that LZ4 is better than LZMA(7z) for your use case,
    because you failed to do a fair comparison.

    > "MY" use case doesn't exist.

    But it does.
    The number of files to compress, the average volume of data,
    data types that likely would be there, fragmentation
    (FS types would be only relevant if you have parsers for them;
    do you know that 7z does have parsers for some FS types,
    so you can extract eg. NTFS image with 7z? And then compress it
    with much better CR).

    > When the difference in speed is so big (seconds vs minutes),
    > it doesn't make much sense to tweak in spite of saving every little byte.

    Sure, it won't actually make sense to backup VM images with paq8.
    Actually, from hundreds of codecs at LTCB, maybe 5-10 would be applicable.
    But even if your choice is limited to LZ4, it still does have some detailed
    parameters, which could improve its performance in your case.
    Also branches with improved compression: https://github.com/inikep/lizard

    Of course, there's no problem if you found a perfect solution for yourself.
    Just don't tell other people to stop experimenting,
    especially if your opinion is very subjective and lazy.

    > However, I'm curious to see if much better algorithms will be developed
    > (in terms of space saving for the same time) for VM backup

    With recompression and "diff-based dedup" there's still a lot of potential atm.
    Btw, the Fungible thing linked above seems to support jpeg recompression in hardware.

    > It is not at all easy to understand where and how the executables are.

    Actually at least x86/x64 code is easy enough to detect with E8 filter.
    And the CR difference between deflate and disasm/delta/lzma can be easily
    2x or more - https://www.maximumcompression.com/data/exe.php
    LZ4 is of course even worse.

    > Maybe you are confusing a file access (like zip, nz or whatever you want)
    > with access to the SECTORS (/cluster/block or whatever the filesystem)
    > that make up the virtual disks.

    No, at least NZ should be able to detect x86 code in containers.
    7z has an exe handler too, but might require explicit cmdline options
    to enable it for VM images.

    > Where, in case of both internal and external fragmentation,
    > you will NOT have a continuous stream of bytes representing the EXE, or JPG ....

    Sure, it limits the maximum recompression potential,
    and most open-source recompressors would have problems
    with fragmented data.

    But there's still some non-zero potential in this case too.
    We'd just have to write specialized solutions for recompression
    of fragments of popular formats, rather than valid whole files
    at it is now in most cases.
    Also there may be some worse but universal solutions:
    https://encode.su/threads/2742-Compr...ll=1#post52493

    > Not "doesn't work, but they could, if you were smarter"
    > Instead "can't"

    No, in most cases partial recompression would be still possible,
    we just don't have implementations yet.
    (Well, mp3zip can handle any chunks of mp3 data, even muxed with video,
    jojpeg can compress partial jpegs, many formats would have small enough deflate streams,
    all video and audio formats consist of independently parsable small frames, etc).

    Its simply a complex work, so developers try to save time
    by skipping detection and handling of broken data...
    but its far from impossible.

    > PAQ-like heuristic analysis method for recognizing individual chunks does not work

    Sure, but detection code in paq is actually of very low quality,
    since in the end its a hobby project without much practical uses.

    > Then you can do your own analysis, but the sun is about to rise
    > and the server needs to resumes full operation

    If you're regularly doing backups of the same images,
    it should be actually helpful to run defrag there sometimes.

    And instead of running full analysis from scratch every time,
    we could in theory save image structure descriptions and only
    update differences during subsequent backups.

  19. #47
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,593
    Thanks
    801
    Thanked 698 Times in 378 Posts
    7 years ago I benchmarked FreeArc, Nanozip, 7-zip and RAR5 on 16.3 GB folder with programs installed on my computer. Attached is the file with my results. Test were done on i7-4770 (haswell 3.9GHz 4core+HT) which is about 2x slower than your CPU.

    Fastest NZ setting (-cf -t8 ) got 750 MB/s compression speed (50x faster than the default -co) with compression similar to rar -m1 -md64k (which should be better than pigz/deflate -1). On your computer it should be 1.4 GB/s, i.e. 4x faster that pigz -1 - while with better compression ratio.

    And FA'Next is even better in both CR and speed, especially the unpublished 0.12 version.
    Attached Files Attached Files

  20. Thanks:

    fcorbelli (28th January 2021)

  21. #48
    Member
    Join Date
    Dec 2013
    Location
    Italy
    Posts
    517
    Thanks
    25
    Thanked 45 Times in 37 Posts
    Quote Originally Posted by Bulat Ziganshin View Post
    7 years ago I benchmarked FreeArc, Nanozip, 7-zip and RAR5 on 16.3 GB folder with programs installed on my computer. Attached is the file with my results. Test were done on i7-4770 (haswell 3.9GHz 4core+HT) which is about 2x slower than your CPU.

    Fastest NZ setting (-cf -t8 ) got 750 MB/s compression speed (50x faster than the default -co) with compression similar to rar -m1 -md64k (which should be better than pigz/deflate -1). On your computer it should be 1.4 GB/s, i.e. 4x faster that pigz -1 - while with better compression ratio.
    I got ~1800 MB/s.
    This is the kind of speed (~300MB/s for core) that I will like to have... in ... zpaq
    Size is just about pigz -1
    As said the archaic parallel deflate is not that bad

    4.586.692.754 81.vmdk.gz
    4.487.043.973 nz_test2.nz


    R:\prova>c:\nz\nz64\nz.exe a -cf -t6 nz_test2.nz c:\vm\81\81.vmdk
    NanoZip 0.09 alpha/Win64 (C) 2008-2011 Sami Runsas www.nanozip.net
    Intel(R) Core(TM) i7-8700K CPU @ 3.70GHz|37356 MHz|#6+HT|17649/32432 MB
    Archive: nz_test2.nz
    Threads: 6, memory: 512 MB, IO-buffers: 20+4 MB
    Compressor #0: nz_lzpf [90 MB]
    Compressor #1: nz_lzpf [90 MB]
    Compressor #2: nz_lzpf [90 MB]
    Compressor #3: nz_lzpf [90 MB]
    Compressor #4: nz_lzpf [90 MB]
    Compressor #5: nz_lzpf [90 MB]
    Compressed 9 584 050 176 into 4 478 090 255 in 4.86s, 1880 MB/s
    IO-in: 4.35s, 2100 MB/s. IO-out: 2.19s, 1944 MB/s

    But, AFAIK, nanozip is not opensource

  22. #49
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,593
    Thanks
    801
    Thanked 698 Times in 378 Posts
    your cpu can run 12 threads, so add -t12 option to make it 10x faster than pigz

    the simplest way to make such a fast compressor is to use zstd -1 parallely. you can incorporate zstd in zpaq, if it allows to disable deduplication while running multiple zstd threads paralelly.

    note that zpaq dedup is pretty slow (50-100 MB/s per thread), but with proper implementation you can dedup at 1-2 GB/s. check for example srep -m1 mode

    at the end of day, all that was implemented in my fa'next


    >As said the archaic parallel deflate is not that bad

    being only 10x slower than 10-year old program? BTW, try -cD -t12 - it should be faster still than pigz but much better compression. but fa'next will just shine here since it's the ony program combining fast dedup with modern fast LZ compressors
    Last edited by Bulat Ziganshin; 29th January 2021 at 00:00.

  23. #50
    Member
    Join Date
    Dec 2013
    Location
    Italy
    Posts
    517
    Thanks
    25
    Thanked 45 Times in 37 Posts
    Quote Originally Posted by Bulat Ziganshin View Post
    your cpu can run 12 threads, so add -t12 option to make it 10x faster than pigz
    No
    8700K have 6 physical cores.
    With hyperthread NZ runs a little slower


    note that zpaq dedup is pretty slow (50-100 MB/s per thread), but with proper implementation you can dedup at 1-2 GB/s. check for example srep -m1 mode
    I try srep for years, but it is simply not reliable.
    Sometimes crash, and this is not OK.
    Does not like very much pipeing
    Zpaq is not very fast, but... works.
    You run, wait, and get the job done.
    And, of course, keep the versions

    at the end of day, all that was implemented in my fa'next
    Any source to try on BSD?

    On pigz: works well with streams and gzcat.
    Without much RAM. Runs everywhere.
    It just works. Not bad for a 30 years old deflate on steroids

  24. #51
    Programmer schnaader's Avatar
    Join Date
    May 2008
    Location
    Hessen, Germany
    Posts
    630
    Thanks
    288
    Thanked 252 Times in 128 Posts
    Quote Originally Posted by fcorbelli View Post
    Where, in case of both internal and external fragmentation,
    you will NOT have a continuous stream of bytes representing the EXE, or JPG ....
    Maybe you'll have some EXE chunks, then some MP4, then some HTML,
    then some internal filesystem structures, and so on.
    Preprocessors, pre-analyzes etc must take this into account.
    Precomp et similia simply fail.

    Not "doesn't work, but they could, if you were smarter"
    Instead "can't"
    Fragmentation can be undone using libguestfs, I did a quick proof of concept for "defragmented" VMDK compression (decompression would be harder, but not impossible):
    • created a Damn Small Linux VM using Virtual Box, downloaded silesia.zip to /home/dsl inside the VM
    • extracted VMDK disk content to a .tar.gz (fragmentation is undone):
      Code:
      sudo guestfish --ro -a DSL_Test.vmdk -m /dev/sda2 tgz-out / content.tar.gz && gunzip -d content.tar.gz


    Results (using Precomp 0.4.7):

    Code:
    DSL_Test.vmdk   218,103,808 bytes
    DSL_Test.pcf_t+ 109,024,465 (only lzma2)
    DSL_Test.pcf     96,130,933 (1370/1384 streams, -d1)
    content.tar     196,423,680
    content.pcf_t+  108,590,172 (only lzma2)
    content.pcf      84,638,080 (4582/4650 streams, -d2)
    
    for reference:
    silesia.zip      68,182,744
    silesia.pcf_t+   67,957,231 (only lzma2)
    silesia.pcf      47,328,711 (3492/3555 streams, -d2)
    The VM metadata is lost in that quick-and-dirty proof of concept, but it's not that much of a difference when comparing the "only lzma2" sizes (~0,4%). However, silesia.zip can be completely processed in content.tar because there's no more fragmentation, giving a ~12% smaller result.

    With some effort, such preprocessing (and the reverse transform) should be possible to implement for VMDK and other VM image formats.
    http://schnaader.info
    Damn kids. They're all alike.

  25. Thanks (2):

    Mike (30th January 2021),Shelwien (30th January 2021)

  26. #52
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    4,133
    Thanks
    320
    Thanked 1,396 Times in 801 Posts
    How about making a diff from .tar to .vmdk?
    https://github.com/sisong/HDiffPatch

  27. Thanks:

    schnaader (30th January 2021)

  28. #53
    Programmer schnaader's Avatar
    Join Date
    May 2008
    Location
    Hessen, Germany
    Posts
    630
    Thanks
    288
    Thanked 252 Times in 128 Posts
    Thanks, that didn't come to my mind last night, it was late

    Code:
       556,030 bytes    tar_to_vmdk.hdiffz.pcf
                        command line:
                        hdiffz content.tar DSL_Test.vmdk tar_to_vmdk.hdiffz && precomp -e -t+ tar_to_vmdk.hdiffz
                        
    96,130,933          DSL_Test.pcf     
    85,194,110          (size of content.pcf + size of tar_to_vmdk.hdiffz.pcf)
    So yeah, it's still ~11% smaller and now it's reversible

    Now we can have a look at the timings (AMD Ryzen 5 4600H, 6 x 3.0 GHz, 4.0 GHz boost):
    guestfish (.vmdk -> content.tar) takes 9 seconds, hdiffz takes 16 s. VMDK size is 218 MB -> ~9 MB/s.
    No multithreading at all so far, so this might be improvable to ~40-50 MB/s.
    And combining both processes might be much faster as guestfish should know the mapping .tar -> .vmdk which would support the diff.
    http://schnaader.info
    Damn kids. They're all alike.

  29. #54
    Member
    Join Date
    Dec 2013
    Location
    Italy
    Posts
    517
    Thanks
    25
    Thanked 45 Times in 37 Posts
    Quote Originally Posted by schnaader View Post
    Fragmentation can be undone...
    Can you please post the time?
    Just to have a gross estimation
    Can I post some VM to be checked?
    About 5GB each

  30. #55
    Programmer schnaader's Avatar
    Join Date
    May 2008
    Location
    Hessen, Germany
    Posts
    630
    Thanks
    288
    Thanked 252 Times in 128 Posts
    Quote Originally Posted by fcorbelli View Post
    Can you please post the time?
    Just to have a gross estimation
    Can I post some VM to be checked?
    About 5GB each
    See the post right before yours, we posted simultaneously

    5 GB should take ~3 minutes (.vmdk -> .tar) or ~10 minutes (.tar and diff to make it reversible). Feel free to upload somewhere, I can download and test later today.
    http://schnaader.info
    Damn kids. They're all alike.

  31. #56
    Member
    Join Date
    Dec 2013
    Location
    Italy
    Posts
    517
    Thanks
    25
    Thanked 45 Times in 37 Posts
    Quote Originally Posted by schnaader View Post
    See the post right before yours, we posted simultaneously

    5 GB should take ~3 minutes (.vmdk -> .tar) or ~10 minutes (.tar and diff to make it reversible). Feel free to upload somewhere, I can download and test later today.
    It depends on the complexity of the algorithms.
    Those O (n), O (nlogn) etc scale well
    But those (especially of diffs) that are polynomial are totally different.
    For 100MB "any" algorithm is fine.
    When you test for 1GB or 1TB you will see the average asymptotic complexity in real case
    I will make a little
    *fresh installed freebsd
    *with portsnap (lots of source code)
    *with some MP4 inside (like a fileserver)

  32. #57
    Member
    Join Date
    Dec 2013
    Location
    Italy
    Posts
    517
    Thanks
    25
    Thanked 45 Times in 37 Posts
    OK I made a little (very little) test setup

    30/01/2021 14:53 781.189.120 114_fresh.vmdk
    30/01/2021 14:59 2.188.574.720 114_ports.vmdk
    30/01/2021 15:32 4.741.660.672 114_somefile.vmdk
    30/01/2021 15:40 4.822.990.848 114_zpaqcompile.vmdk

    FreeBSD 11.4 "fresh" install (fresh)
    With ports (for those who are not accustomed it is a consistent library of program sources, by the thousands, therefore essentially text)
    With somefile (MP4 videos)
    With some compiling of ZPAQ (to see what happens with small differences)

    Obviously it is not a scientific test, different systems, different CPUs etc.


    But as an order of magnitude zpaqfranz
    takes about 80s to make a 3.251.727.759 bytes long archive
    and about 110s to make create file AND verify
    By verification I mean checking the contents of what is stored with the files on the disk,
    which are read again

    A more "real world" test (in the sense of sequential backups)
    c:\zpaqfranz\zpaqfranz a z:\1.zpaq  114_fresh.vmdk
    c:\zpaqfranz\zpaqfranz a z:\1.zpaq 114_ports.vmdk
    c:\zpaqfranz\zpaqfranz a z:\1.zpaq 114_somefile.vmdk
    c:\zpaqfranz\zpaqfranz a z:\1.zpaq 114_zpaqcompile.vmdk

    In this case the time is about 83s for 3.253.167.479 bytes
    And about 111s with verify

    Just as info, after "tarred" in single file
    srep ~36s (very high requested memory -5GB- for decompression) =>3.419GB

    srep+nz -cf (very fast indeed) ~40s => 3GB
    srep+nz (nz_optimum1) => 2.8GB

    So there is room for speed improvements (well known fact) over zpaq.
    Not so much, however, in the overall compression ratio, because compression
    is much less relevant compared to deduplication (it does't matter at all)
    (well known fact that I've written before).

    Only NZ -cf (~50s) => 16GB

    I don't know about the actual performance with verification.
    It is difficult to do this with chained programs

    http://archivio.francocorbelli.it/in...8penpTgxXzkS2O

  33. #58
    Member
    Join Date
    Jun 2018
    Location
    Yugoslavia
    Posts
    82
    Thanks
    8
    Thanked 6 Times in 6 Posts
    it would be best to backup only settings and other unique data, not common files if archiver were smart enough.
    I'd backup the installation disks and patches separately for long term archiving.

  34. #59
    Member
    Join Date
    Dec 2013
    Location
    Italy
    Posts
    517
    Thanks
    25
    Thanked 45 Times in 37 Posts
    Quote Originally Posted by schnaader View Post
    Fragmentation can be undone using libguestfs...
    With some effort, such preprocessing (and the reverse transform) should be possible to implement for VMDK and other VM image formats.
    In fact... no, you can't.
    Unpacking the virtual machine archive (the vmdk) must be identical to the vmdk itself.
    Neither a diff ( complexity too high =>way too slow ).
    For simple reasons of verification and reliability.


    You cannot mount (also because in some cases you would not be able, if you want I can post some examples like the one above with Solaris) the image and read its contents

    Encrypted filesystems can also occur (they are rare, but they exist), which are inherently uncompressible.
    Instead, they are easily deduplicable

  35. #60
    Member
    Join Date
    Jan 2021
    Location
    Spain
    Posts
    2
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by fcorbelli View Post
    I try srep for years, but it is simply not reliable.
    Sometimes crash, and this is not OK.
    Wrong. You simply use the wrong version. Don't use 3.9x:
    https://encode.su/threads/231-FreeAr...ll=1#post56822
    Use 3.2 unless you have access to the unreleased newer version.

    Quote Originally Posted by fcorbelli View Post
    srep ~36s (very high requested memory -5GB- for decompression) =>3.419GB
    Again wrong, use Future-LZ. Switches are ONLY m1f ... m3f, must contain f!
    Less RAM for decompression needed!!!

Page 2 of 3 FirstFirst 123 LastLast

Similar Threads

  1. My new compression algorithm
    By tefara in forum Random Compression
    Replies: 55
    Last Post: 12th June 2019, 21:45
  2. Anyone know which compression algorithm does this?
    By hjazz in forum Data Compression
    Replies: 8
    Last Post: 24th March 2014, 06:49
  3. How do you backup?
    By Piotr Tarsa in forum The Off-Topic Lounge
    Replies: 25
    Last Post: 17th September 2013, 23:39
  4. Replies: 5
    Last Post: 25th December 2011, 21:53
  5. Test set: backup
    By m^2 in forum Data Compression
    Replies: 1
    Last Post: 23rd October 2008, 23:16

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •