Use whatever you want that can handle at least 1TB in half night (4 hours).
wrong. if you backup your disk every day, typically only a few percents are changed. so, you may need to have very fast dedup algo (or not, if you watch disk changes using OS API), but compression speed is of less importance. and for decompression, you may just employ multiple servers
wrong. if you backup your disk every day, typically only a few percents are changed. so, you may need to have very fast dedup algo (or not, if you watch disk changes using OS API), but compression speed is of less importance. and for decompression, you may just employ multiple servers
You're confirming my thesis
In my previous post
Originally Posted by fcorbelli
For me the answer is easy.
The one who scales better on multicore.
Just like pigz.
Single thread performance is useless.
On implementation side: the one which can extensively use HW SSE instructions.
Compression ratio is irrilevant.
Only speed (and limitated RAM usage).
In two words: a deduplicated pigz (aka deflate).
Or lz4 for decompression speed (not so relevant).
In fact I use this one (storing on zfs the deduplicated archive)
0) versioning "a-la-time-machine"
1) deduplication.
2) highly parallelizable compression.
3) RAM consumption
4) works with really large files
5) decompression which does NOT seek (if possible)
6) an advanced and fast copy verification mechanism WITHOUT decompress if possible
7) easy portability between Windows-Linux-* Nix systems.
8 append-only format
9) Reliability reliability reliability. No software "chains", where bugs and limitations can add up.
A real-ZPAQ-based example of a virtual Windows 2008 server with SQL server for a ERP software
So you're just saying that its not acceptable for you, that's ok.
On filesystem
Originally Posted by fcorbelli
For quick backup the answer is differential zfs send (not incremental) pigzip-ed
Requires zfs, lots of Ram and fast disks.
It is doable: I do every day from years.
But restore is painful, and extensive zfs expertise needed.
@fcorbelli:
> Use whatever you want that can handle at least 1TB in half night (4 hours).
2**40/(4*60*60) = 76,354,974 bytes/s
Its not actually that fast, especially taking MT into account.
I posted the requirements for the single thread of the algorithm, but of course the complete tool would be MT.
You are thinking in the "gigabyte scale".
Look at the watch for 4 seconds.
That's a GB at 250MB/s (very fast indeed).
Now look at the watch for an hour (~3.600s), or 1.000x.
That's the TB scale at 250MB/s
So every time you do a test, think about a method or an algorithm, imagine it for a time 1,000 times longer of what you are used to, if your job is not to make backups every day for hundreds of servers.
I think that it is unlikely that you do a test on 1TB or maybe 50TB of data when developing such program.
I need to, because it's my job.
Then you will begin to understand that everything else doesn't matter, IF you don't have something really fast. But just so fast. I mean FAST.
And how can it be that fast?
Only if heavily multithreaded, of course.
Single core performance is completely irrelevant IF does not scale ~ linearly with size
And which compression algorithms (deduplication taken for granted) are so fast on multi-core systems,
not because they are magical, but because they are highly parallel?
Someone was offended, but it is simply factual.
When in doubt, think about the terabyte scale with my gekandenexperiment and everything will be clearer.
That's 1TB, the size for the hobbyist.
Then multiply by 10 or even 100, so by 10,000 or 100,000 vs the 4-seconds-GB-time-scale,
and THEN choose the algorithm
Similarly for the consumption of RAM and all the points I have already written several times before
1) Any compression algorithms would benefit from MT, when you can feed them different blocks.
Linear scaling is not realistic because of shared L3 and other resources.
But something like 0.5*N_cores*ST_speed would still turn 50MB/s into 400MB/s at 16 cores,
and that's not even the maximum on modern servers.
2) If you have a scheduled backup every 4 hours, you don't really need the compressor to work faster than 80MB/s,
so it may be sometimes profitable to use a slower codec, if it still fits the schedule and saves 20% of storage space.
3) Nobody forces you to use slower compression algorithms than what you currently like.
Compression algorithm development is simply interesting for some people,
and some other people are getting paid for development of new compression algorithms.
1) Any compression algorithms would benefit from MT, when you can feed them different blocks.
I wouldn't be so assertive
In some cases separating the blocks considerably reduces efficiency.
In others less.
Linear scaling is not realistic
~ linear, yes.
But as mentioned it is not important "how"
...400MB/s at 16 cores,and that's not even the maximum on modern servers.
You will typically never have all those cores 100% available, because servers work 24 hours a day.
You doesn't turn off everything to make backups: more resources, but not infinite.
And those who use servers with 16 physical cores often will not have 1TB of data, but maybe 10.
Or 100.
I do.
In this case, as I have already explained, you will typically use specialized compression machines that read the snapshot data .zfs (thus loading the server's IO subsystem, but not the CPU, and not very much, with NVMe's).
For example with physical 16CPU systems (AMD 3950X in my case).
This gives you nearly 24 hours of backup time (100TB-time-scale).
But it is not exactly a common system, nor a requirement that seems realistic to me for new software [it works, but you have to buy 10,000 euros of hardware to do it and hire two more engineers]
2) If you have a scheduled backup every 4 hours, you don't really need the compressor to work faster than 80MB/s, so it may be sometimes profitable to use a slower codec, if it still fits the schedule and saves 20% of storage space.
When you need to have 600TB of backup space don't worry too much about saving 20%.
Indeed even 0% (no compression at all).
Just buy some other hard disks.
3) Nobody forces you to use slower compression algorithms than what you currently like.
Compression algorithm development is simply interesting for some people,
and some other people are getting paid for development of new compression algorithms.
Certainly.
I point out, however, that the main backup software houses in the world for virtual systems do not think so.
Which compress a little, ~deflate for example.
This does not mean that "we" are right, but I expect a new software AT LEAST outperform the "old" ones.
===========
I add a detail for the choice of the algorithm: advanced management of large blocks of data-all-the-same.
When you export vsphere typically thin disks become thick, and are padded with zeros (actually it depends on the filesystem, for example this happens for Linux based QNAP NAS, in sparse mode).
So a new software should efficiently handle the case where there are hundreds of gigabytes of empty blocks, often positioned at the end.
It is a serious problem especially during the decompression phase (slowdown)
>> 1) Any compression algorithms would benefit from MT, when you can feed them different blocks.
> In some cases separating the blocks considerably reduces efficiency.
Sure, it depends on inter-block matches, and whether blocks break
the internal data structure, which could be important for format detection
and/or recompression.
But recompression (forward transform) is usually slower than 50MB/s anyway,
we have CDC dedup to take care of long inter-block matches,
fast dictionary methods can be used to factor out inter-block matches,
and nothing of above is really relevant when we're dealing with TBs of data -
we'd be able to compress 100GB blocks in parallel, and that would barely affect the CR
at all, because only specialized dedup algorithms handle that scale -
for normal codecs its good if they can handle 1-2GB windows.
> And those who use servers with 16 physical cores often will not have 1TB of data, but maybe 10. Or 100.
As Bulat already said, that won't be unique data which has to be compressed
with actual compression algorithms.
Most of that data would be handled by dedup module in any case,
so the speed of compression algorithm won't affect the overall performance that much.
> But it is not exactly a common system, nor a requirement that seems
> realistic to me for new software
As I already calculated, 80MB/s is enough to compress 1TB of new data
every 4 hours in 1 thread.
You'd only use 16 if you really want it to run faster for some reason.
> When you need to have 600TB of backup space don't worry too much about saving 20%.
> Just buy some other hard disks.
Maybe, but if the question is - buy extra 200TB of storage or use new free software
(like your zpaq) to save 20% of space - are you that certain about the answer?
> I point out, however, that the main backup software houses
> in the world for virtual systems do not think so.
> Which compress a little, ~deflate for example.
Its more about ignorance and fashion than anybody actually evaluating their choices.
Most even use default zlib when there're many much faster optimized implementations
of the same API.
For example, did you properly evaluate zstd for your use case?
(Not just default levels, but also --fast ones, dictionary modes, manual parameter setting).
If not, how you can say that zlib or LZ4 are better?
> This does not mean that "we" are right,
> but I expect a new software AT LEAST outperform the "old" ones.
This is actually true in zlib vs zstd case,
and seems to be rather common for goals of many other codecs too.
But for me compression ratio is more interesting,
so I made this thread about new algorithms with different properties,
rather than about making zstd 2x faster via SIMD tricks.
You'd only use 16 if you really want it to run faster for some reason.
Ahem... those little machines costs many thousands of euro each.
And they consume about 300W each, and require an air conditioning on 24/365
(in Italy we have neither gas, oil nor nuclear).
Maybe, but if the question is - buy extra 200TB of storage or use new free software
(like your zpaq) to save 20% of space - are you that certain about the answer?
Yes, I am.
Because BEFORE trusting in a new software a couple of years of testing is needed.
You will never ever run anything new.
With new builds of the same software you will run in parallel for months.
rw-r--r-- 1 root wheel 794036166617 Jan 26 19:17 fserver_condivisioni.zpaq
-rw-r--r-- 1 root wheel 320194332144 Jan 26 19:22 fserver_condivisioni47.zpaq
Those are two backups, one for zpaqfranz v11, one for zpaqfranz v47
Even if you did it yourself.
Imagine losing all your money for days because ooopsss there was corruption in restoring your Bank's backup.
It just can't happen.
Its more about ignorance and fashion than anybody actually evaluating their choices.
Most even use default zlib when there're many much faster optimized implementations
of the same API.
In part yes, I agree.
For example, did you properly evaluate zstd for your use case?
(Not just default levels, but also --fast ones, dictionary modes, manual parameter setting).
If not, how you can say that zlib or LZ4 are better?
I am not able to tell.
Unfortunately I no longer have the age, and therefore the time,
to devote myself to projects that would interest me.
These are things I could have done 25 years ago.
Unfortunately, much of my time today is devoted to... paying taxes.
But for me compression ratio is more interesting,
so I made this thread about new algorithms with different properties,
rather than about making zstd 2x faster via SIMD tricks.
A new algorithm from scratch that was more efficient would certainly be interesting.
An algorithm as fast as the actual transfer media rate, say 500MB/s for 4 cores (which is typically how much you can use), even better.
At that point, once the speed has been set, we can discuss the reduction in size.
And the decompression speed, which must be decent.
Because when you have a system hang, and you need to do a rollback, and your Bank's account is
freezed, you can't wait 12 hours for unzipping.
The ideal program reads and writes at the same speed that the data is "pumped" by the IO subsystem (which can also be a 40Gb NIC).
Just like pv
It would be a great relief to those who work in data storage.
> At that point, once the speed has been set, we can discuss the reduction in size.
Unfortunately its the reverse of how it actually works.
Speed optimization is time-consuming, but also much more predictable than compression improvement.
There're many known "bruteforce" methods for speed improvement - compiler tweaking, SIMD, MT, custom hardware.
For example, these people claim 16GB/s LZMA compression (up to 256GB/s in a cluster): https://www.fungible.com/product/dpu-platform/
But its much harder to take an already existing algorithm and incrementally improve its compression ratio.
To even start working on that its usually necessary to remove most of existing
speed optimizations from the code (manual inlining/unrolling, precalculated tables etc),
and then some algorithms simply can't be pushed further after some point (like LZ77 and huffman coding).
Thus designing the algorithm for quality first (compression ratio in this case) is a much more reliable approach.
Speed/quality tradeoff can be adjusted later, after reaching the maximum quality.
Of course, the choices would be still affected by minimum acceptable speed - depending on whether its 1MB/s,10MB/s,100MB/s or 1000MB/s
we'd have completely different choices (algorithm classes) to work with.
Still, compromising on speed is the only option if we want to have better algorithms in the future.
Speed optimizations can be always added later and better hardware is likely to appear,
while better algorithms won't appear automatically - somebody has to design them and push the limits.
Its just how it is - compression algorithms may have some parameters, but never cover
the full spectrum of use cases, its simply impossible to push deflate to paq-level compression
by allowing it to run slower - that requires a completely different algorithm
(and underlying mathematical models).
Thus designing the algorithm for quality first (compression ratio in this case) is a much more reliable approach.
Most of the data that takes up a lot of space is already compressed.
Often extremely already compressed.
Videos, images etc are the largest files.
Executables compress little. Often there are also compressed files (zip, 7z) of internal backups performed.
There remain the tablespaces of the databases (where the compresison level is very high) and large quantities of text (eg HTML).
So I don't expect sensational results.
Even using NZ, or the various paqs, the differences are modest.
I therefore recommend taking the disk of a test virtual machine and testing the algorithms already available
This is a really, really crude test (I'm working... just always!) Attachment 8299
Here you can see the mighty nanozip, the ubiquous 7z (all with default values, just a test) vs pigz -1 and lz4 vs almosty empty Windows 8.1 image
As you can see you will never use an algorithm that takes 20 times (!) the time to go from a 4.5GB backup to 3.2GB (pigz vs nz)
Or that doubles (11 minutes vs 5) from 3.3 to 3.2 (7z vs nz)
I understand that, from a theoretical point of view, these are important improvements.
But from the practical one... you will use lz4 or pigz (remember: scale x1000).
Or whatever (srep+lz4? no, thanks, ~pigz performace WITHOUT complication) as fast as you can
You can expect it or not, but based on paq8 results, there's at least 30%
between max zstd compression and actual data entropy.
deflate and LZ4 only support window up to 64kb, so can't even really be compared -
its easy to demonstrate even 100x better compression for these.
> Even using NZ, or the various paqs, the differences are modest.
You need precomp and srep passes before comparing actual compression algorithms.
Also you're wrong to expect archivers with default options to show their full potential.
Except for zstd with its "paramgrill", nobody else (afaik) bothered with actual tuning
of their level profiles.
For comparisons its also necessary to compare using the same number of threads,
same window size, and same types of preprocessing if one of the programs lacks some.
Most of the popular archivers lack some separately available features (like dedup module),
so comparing them to rare programs which have these features inside is unfair.
Or, to be precise, its fair if you're testing to choose the best tool for some task,
but unfair if you're looking for highest potential for further development.
> Here you can see the mighty nanozip, the ubiquous 7z (all with default
> values, just a test) vs pigz -1 and lz4 vs almosty empty Windows 8.1 image
It doesn't mean much, since nz and 7z defaults are not designed for such volumes.
But they have commandline parameters which can significantly change the results in this case.
Like "-mx=9 -myx=9 -md=1536M -ms=999t -mmt=2" for 7z.
> As you can see you will never use an algorithm that takes 20 times (!) the
> time to go from a 4.5GB backup to 3.2GB (pigz vs nz)
Problem is, you didn't bother to read nz usage text, and it has lots of parameters
and 12(!) different compression algorithms, from much faster to stronger
(plus memory usage, MT controls etc) which significantly affect the results.
> I understand that, from a theoretical point of view, these are important improvements.
We simply don't know the actual results, since these programs weren't designed
for your use case, and you can't RTFM and tweak their options.
See this post for example: https://encode.su/threads/3559?p=68397&pp=1
In that post, non-default settings made LZMA encoding 8x faster,
while still providing significantly better compression than another codec.
7-zip also has syntax for this.
> But from the practical one... you will use lz4 or pigz (remember: scale x1000).
No, you simply learn how to properly use existing tools
instead of expecting their authors to provide you a perfect solution for each use case.
No, you simply learn how to properly use existing tools
instead of expecting their authors to provide you a perfect solution for each use case.
Ahem...
With all respect I disagree, your hypotheses are very far from these works.
Try to find the optimal parameters to compress 1TB of virtual machine for 10 different programs with 10 different parameters.
It will take weeks of work.
Then I will upload some MP4 into the image.
Your parameters are now wrong.
Run another weeks to find the best.
Then I will upload a big mysqldump backup.
Your parameters are now wrong again.
(...)
Anyone knows that there are 1000 parameters that affect significantly,
but you can only know this EX POST, not EX ANTE.
It is not Hutter Prize with a fixed file to compress where you can tweak almost everything.
It is not "compress a fresh Windows 2008 R2 virtual disk server"
You will find whatever.
Small BSD server
Huge LINUX server
Mid-size Windows Server
With "anything" inside (ntfs, already-compressed zfs block, ext4 blocks, btrfs)
Problem is, you didn't bother to read nz usage text
Of course I've tried them all years ago.
But for small files.
If you want to try them for 1TB machines, I can supply as many as you want.
But beware: the content varies from day to day.
Often with already compressed images and videos.
We simply don't know the actual results, since these programs weren't designed
for your use case, and you can't RTFM and tweak their options.
"MY" use case doesn't exist.
There are so many different VMs with so many different containers.
It would be so easy if there were ONLY disk image of a certain type.
When the difference in speed is so big (seconds vs minutes),
it doesn't make much sense to tweak in spite of saving every little byte.
The backups, sooner or later, will be deleted, for make room for new ones.
They are disposable in the medium term.
===
However, I'm curious to see if much better algorithms will be developed (in terms of space saving for the same time) for VM backup
PS yes, EXEs compress little, very little.
Reductions of the order of tens % are not worth the effort.
It is not at all easy to understand where and how the executables are.
Maybe you are confusing a file access (like zip, nz or whatever you want)
with access to the SECTORS (/cluster/block or whatever the filesystem)
that make up the virtual disks.
Where, in case of both internal and external fragmentation,
you will NOT have a continuous stream of bytes representing the EXE, or JPG ....
Maybe you'll have some EXE chunks, then some MP4, then some HTML,
then some internal filesystem structures, and so on.
Preprocessors, pre-analyzes etc must take this into account.
Precomp et similia simply fail.
Not "doesn't work, but they could, if you were smarter"
Instead "can't"
PAQ-like heuristic analysis method for recognizing individual chunks does not work
at all with advanced filesystems where there is no sequential data stream.
It's more like a shuffled deck of cards.
Then you can do your own analysis, but the sun is about to rise
and the server needs to resumes full operation
fcorbelli, a few notes:
- you talk about architecture of the entire backup engine, while Eugene works as part of team and his narrow work is to make the final compression algorithm
- when you talk about "inherently parallel" algorithms, you need to investigate why they are parallel. to start, pigz and zpaq aren't algorithms, but programs. especially zpaq which is entire backup engine employing many algorithsm together
PIGZ just runs multiple independent compression jobs simultaneously. Any LZ77-class compressor with dictionary of X bytes can be multi-threaded WITHOUT CR DECREASING this way, just parsing X extra bytes prior to compression in every chunk. PIGZ employs DEFLATE compression that has dictionary of only 32 KB, that allows to employ many CPU cores using less than 1 MB/core.
This approach is applicable to any LZ77-class algorithms (zstd, lzma) and probably ROLZ ones too. The drawback is single-threaded decompression. There are also other ideas that helps to parallelize LZ compressors, but they are usually applicable to any LZ algorithms, so they can be added later. As Eigene said, he looks first into improving compression ratio, speed optimization can be done later.
Similarly, my own fa'next (that outperforms pigz by miles) works by applying deduplication and then splitting data into blocks each compressed independently with zstd or lzma. AFAIR, zpaq does the same but employs its own LZ/BWT compression algorithms inferior to ztsd.
----------------------
Now, do you agree that splitting data in 100 GB blocks is enough? You shouldn't make blocks larger because they can be completely lost and because you may need to extract smaller parts of entire dedup set. With 20 MB/s such a block will be processed in 1.5 hours. It should be OK for backup stage, although upu may need to restore faster. So, emphasis should be on decompression speed, and if you are looking for m/t algorithms, you should look specifically into M/T DECOMPESSION ability.
fcorbelli, a few notes:
- you talk about architecture of the entire backup engine, while Eugene works as part of team and his narrow work is to make the final compression algorithm
That's true.
Infact, for me, doesn't matter at all if you choose to dedup+compress, precomp+dedup+compress, dedup+precom+compress, compress or whatever (I am just running some paq real VM example, just to be clear. Running...)
- when you talk about "inherently parallel" algorithms...
It should be noted that server CPUs have a normally modest clock speed, for thermal reasons.
While it is easier to have 8-16 cores, and sometimes even 1-4 CPUs, modern servers are not "number cruncher"
PIGZ just runs multiple independent compression jobs simultaneously (...) PIGZ employs DEFLATE compression that has dictionary of only 32 KB, that allows to employ many CPU cores using less than 1 MB/core.
It's an archaic technology, but it works.
In the average case, it works great (many cores).
It has problems (very little parallelism) in decompression, and in fact it is one of the reasons why it is not used in any case
This approach is applicable to any LZ77-class algorithms (zstd, lzma) and probably ROLZ ones too. The drawback is single-threaded decompression.
The author (of pigz) explains it very well in his notes.
It is not a small problem.
...As Eigene said, he looks first into improving compression ratio, speed optimization can be done later.
I am a little skeptical about the concrete possibility of having a "smart" algorithm that operates on large amounts of data. Sure I might be surprised.
zpaq does the same but employs its own LZ/BWT compression algorithms inferior to ztsd.
I dream of a zpaq with a much faster compressor.
Maybe I will make the patch.
Now, do you agree that splitting data in 100 GB blocks is enough? You shouldn't make blocks larger because they can be completely lost and because you may need to extract smaller parts of entire dedup set. With 20 MB/s such a block will be processed in 1.5 hours. It should be OK for backup stage, although upu may need to restore faster. So, emphasis should be on decompression speed, and if you are looking for m/t algorithms, you should look specifically into M/T DECOMPESSION ability.
In my previous posts I have already explained that decompression speed is almost as critical as compression speed.
This is why, in my opinion, even a "stupid" algorithm, but which has few or no interprocess locks, would be desirable.
If is very hard, because...
5) decompression which does NOT seek (if possible)
and
...
And the decompression speed, which must be decent.
Because when you have a system hang, and you need to do a rollback, and your Bank's account is freezed, you can't wait 12 hours for unzipping.
Just to make some real world: a very "smart" analyzer (PAQ8PX) vs three real (small) images
Special compressors (for example for EXE, with substitutions etc) are not so "smart" with a filesystem-based virtual container
EDIT: after dinner results for MacOS virtual machine
хороший ужин
> Try to find the optimal parameters to compress 1TB of virtual machine
> for 10 different programs with 10 different parameters.
> It will take weeks of work.
Ideally yes, if you want to optimize the compressed file to the byte precision,
then it has to be done.
You may be surprised, but it is really done sometimes,
eg. for game repacks, since you can spend a week on this optimization
once, and then 1000s of people would download and use your releases for years.
> Then I will upload some MP4 into the image. Your parameters are now wrong.
Ideally yes, at the very least you'd have to add some MP4 recompressor,
which would be disabled by default, because archiver would normally
compress MP4s as standalone files and always-on inline detection
would slow down processing in more common cases.
> Then I will upload a big mysqldump backup. Your parameters are now wrong again.
Yes, ideally you'd have to add a mysqldump preprocessor, since its a popular format
and its possible to convert actual data contained in the dump to a more compact form,
which could be compressed better and much faster.
> Anyone knows that there are 1000 parameters that affect significantly,
> but you can only know this EX POST, not EX ANTE.
Problem is, there're also more common parameters, which significantly
affect the archiver's performance (both CR and speed), and would be common
for the whole VM-image use case.
Archivers are mostly relatively old (they were very important for all PC users
some 15-30 years go, but not anymore, as compression is now integrated in all
popular office formats and filesystems), and are targeted at the office use -
copying or sending a bunch of related files in a single container.
VM image compression at terabyte scale is a completely different use case,
so it definitely requires some non-default settings for these archivers.
That's ok and any compression-related solution can be usually improved forever.
But I cannot accept your opinion that LZ4 is better than LZMA(7z) for your use case,
because you failed to do a fair comparison.
> "MY" use case doesn't exist.
But it does.
The number of files to compress, the average volume of data,
data types that likely would be there, fragmentation
(FS types would be only relevant if you have parsers for them;
do you know that 7z does have parsers for some FS types,
so you can extract eg. NTFS image with 7z? And then compress it
with much better CR).
> When the difference in speed is so big (seconds vs minutes),
> it doesn't make much sense to tweak in spite of saving every little byte.
Sure, it won't actually make sense to backup VM images with paq8.
Actually, from hundreds of codecs at LTCB, maybe 5-10 would be applicable.
But even if your choice is limited to LZ4, it still does have some detailed
parameters, which could improve its performance in your case.
Also branches with improved compression: https://github.com/inikep/lizard
Of course, there's no problem if you found a perfect solution for yourself.
Just don't tell other people to stop experimenting,
especially if your opinion is very subjective and lazy.
> However, I'm curious to see if much better algorithms will be developed
> (in terms of space saving for the same time) for VM backup
With recompression and "diff-based dedup" there's still a lot of potential atm.
Btw, the Fungible thing linked above seems to support jpeg recompression in hardware.
> It is not at all easy to understand where and how the executables are.
Actually at least x86/x64 code is easy enough to detect with E8 filter.
And the CR difference between deflate and disasm/delta/lzma can be easily
2x or more - https://www.maximumcompression.com/data/exe.php
LZ4 is of course even worse.
> Maybe you are confusing a file access (like zip, nz or whatever you want)
> with access to the SECTORS (/cluster/block or whatever the filesystem)
> that make up the virtual disks.
No, at least NZ should be able to detect x86 code in containers.
7z has an exe handler too, but might require explicit cmdline options
to enable it for VM images.
> Where, in case of both internal and external fragmentation,
> you will NOT have a continuous stream of bytes representing the EXE, or JPG ....
Sure, it limits the maximum recompression potential,
and most open-source recompressors would have problems
with fragmented data.
But there's still some non-zero potential in this case too.
We'd just have to write specialized solutions for recompression
of fragments of popular formats, rather than valid whole files
at it is now in most cases.
Also there may be some worse but universal solutions: https://encode.su/threads/2742-Compr...ll=1#post52493
> Not "doesn't work, but they could, if you were smarter"
> Instead "can't"
No, in most cases partial recompression would be still possible,
we just don't have implementations yet.
(Well, mp3zip can handle any chunks of mp3 data, even muxed with video,
jojpeg can compress partial jpegs, many formats would have small enough deflate streams,
all video and audio formats consist of independently parsable small frames, etc).
Its simply a complex work, so developers try to save time
by skipping detection and handling of broken data...
but its far from impossible.
> PAQ-like heuristic analysis method for recognizing individual chunks does not work
Sure, but detection code in paq is actually of very low quality,
since in the end its a hobby project without much practical uses.
> Then you can do your own analysis, but the sun is about to rise
> and the server needs to resumes full operation
If you're regularly doing backups of the same images,
it should be actually helpful to run defrag there sometimes.
And instead of running full analysis from scratch every time,
we could in theory save image structure descriptions and only
update differences during subsequent backups.
7 years ago I benchmarked FreeArc, Nanozip, 7-zip and RAR5 on 16.3 GB folder with programs installed on my computer. Attached is the file with my results. Test were done on i7-4770 (haswell 3.9GHz 4core+HT) which is about 2x slower than your CPU.
Fastest NZ setting (-cf -t8 ) got 750 MB/s compression speed (50x faster than the default -co) with compression similar to rar -m1 -md64k (which should be better than pigz/deflate -1). On your computer it should be 1.4 GB/s, i.e. 4x faster that pigz -1 - while with better compression ratio.
And FA'Next is even better in both CR and speed, especially the unpublished 0.12 version.
7 years ago I benchmarked FreeArc, Nanozip, 7-zip and RAR5 on 16.3 GB folder with programs installed on my computer. Attached is the file with my results. Test were done on i7-4770 (haswell 3.9GHz 4core+HT) which is about 2x slower than your CPU.
Fastest NZ setting (-cf -t8 ) got 750 MB/s compression speed (50x faster than the default -co) with compression similar to rar -m1 -md64k (which should be better than pigz/deflate -1). On your computer it should be 1.4 GB/s, i.e. 4x faster that pigz -1 - while with better compression ratio.
I got ~1800 MB/s.
This is the kind of speed (~300MB/s for core) that I will like to have... in ... zpaq
Size is just about pigz -1
As said the archaic parallel deflate is not that bad
your cpu can run 12 threads, so add -t12 option to make it 10x faster than pigz
the simplest way to make such a fast compressor is to use zstd -1 parallely. you can incorporate zstd in zpaq, if it allows to disable deduplication while running multiple zstd threads paralelly.
note that zpaq dedup is pretty slow (50-100 MB/s per thread), but with proper implementation you can dedup at 1-2 GB/s. check for example srep -m1 mode
at the end of day, all that was implemented in my fa'next
>As said the archaic parallel deflate is not that bad
being only 10x slower than 10-year old program? BTW, try -cD -t12 - it should be faster still than pigz but much better compression. but fa'next will just shine here since it's the ony program combining fast dedup with modern fast LZ compressors
Last edited by Bulat Ziganshin; 29th January 2021 at 00:00.
your cpu can run 12 threads, so add -t12 option to make it 10x faster than pigz
No
8700K have 6 physical cores.
With hyperthread NZ runs a little slower
note that zpaq dedup is pretty slow (50-100 MB/s per thread), but with proper implementation you can dedup at 1-2 GB/s. check for example srep -m1 mode
I try srep for years, but it is simply not reliable.
Sometimes crash, and this is not OK.
Does not like very much pipeing
Zpaq is not very fast, but... works.
You run, wait, and get the job done.
And, of course, keep the versions
at the end of day, all that was implemented in my fa'next
Any source to try on BSD?
On pigz: works well with streams and gzcat.
Without much RAM. Runs everywhere.
It just works. Not bad for a 30 years old deflate on steroids
Where, in case of both internal and external fragmentation,
you will NOT have a continuous stream of bytes representing the EXE, or JPG ....
Maybe you'll have some EXE chunks, then some MP4, then some HTML,
then some internal filesystem structures, and so on.
Preprocessors, pre-analyzes etc must take this into account.
Precomp et similia simply fail.
Not "doesn't work, but they could, if you were smarter"
Instead "can't"
Fragmentation can be undone using libguestfs, I did a quick proof of concept for "defragmented" VMDK compression (decompression would be harder, but not impossible):
The VM metadata is lost in that quick-and-dirty proof of concept, but it's not that much of a difference when comparing the "only lzma2" sizes (~0,4%). However, silesia.zip can be completely processed in content.tar because there's no more fragmentation, giving a ~12% smaller result.
With some effort, such preprocessing (and the reverse transform) should be possible to implement for VMDK and other VM image formats.
So yeah, it's still ~11% smaller and now it's reversible
Now we can have a look at the timings (AMD Ryzen 5 4600H, 6 x 3.0 GHz, 4.0 GHz boost):
guestfish (.vmdk -> content.tar) takes 9 seconds, hdiffz takes 16 s. VMDK size is 218 MB -> ~9 MB/s.
No multithreading at all so far, so this might be improvable to ~40-50 MB/s.
And combining both processes might be much faster as guestfish should know the mapping .tar -> .vmdk which would support the diff.
Can you please post the time?
Just to have a gross estimation
Can I post some VM to be checked?
About 5GB each
See the post right before yours, we posted simultaneously
5 GB should take ~3 minutes (.vmdk -> .tar) or ~10 minutes (.tar and diff to make it reversible). Feel free to upload somewhere, I can download and test later today.
See the post right before yours, we posted simultaneously
5 GB should take ~3 minutes (.vmdk -> .tar) or ~10 minutes (.tar and diff to make it reversible). Feel free to upload somewhere, I can download and test later today.
It depends on the complexity of the algorithms.
Those O (n), O (nlogn) etc scale well
But those (especially of diffs) that are polynomial are totally different.
For 100MB "any" algorithm is fine.
When you test for 1GB or 1TB you will see the average asymptotic complexity in real case
I will make a little
*fresh installed freebsd
*with portsnap (lots of source code)
*with some MP4 inside (like a fileserver)
FreeBSD 11.4 "fresh" install (fresh)
With ports (for those who are not accustomed it is a consistent library of program sources, by the thousands, therefore essentially text)
With somefile (MP4 videos)
With some compiling of ZPAQ (to see what happens with small differences)
Obviously it is not a scientific test, different systems, different CPUs etc.
But as an order of magnitude zpaqfranz
takes about 80s to make a 3.251.727.759 bytes long archive
and about 110s to make create file AND verify
By verification I mean checking the contents of what is stored with the files on the disk,
which are read again
A more "real world" test (in the sense of sequential backups)
c:\zpaqfranz\zpaqfranz a z:\1.zpaq 114_fresh.vmdk
c:\zpaqfranz\zpaqfranz a z:\1.zpaq 114_ports.vmdk
c:\zpaqfranz\zpaqfranz a z:\1.zpaq 114_somefile.vmdk
c:\zpaqfranz\zpaqfranz a z:\1.zpaq 114_zpaqcompile.vmdk
In this case the time is about 83s for 3.253.167.479 bytes
And about 111s with verify
Just as info, after "tarred" in single file
srep ~36s (very high requested memory -5GB- for decompression) =>3.419GB
So there is room for speed improvements (well known fact) over zpaq.
Not so much, however, in the overall compression ratio, because compression
is much less relevant compared to deduplication (it does't matter at all)
(well known fact that I've written before).
Only NZ -cf (~50s) => 16GB
I don't know about the actual performance with verification.
It is difficult to do this with chained programs
it would be best to backup only settings and other unique data, not common files if archiver were smart enough.
I'd backup the installation disks and patches separately for long term archiving.
Fragmentation can be undone using libguestfs...
With some effort, such preprocessing (and the reverse transform) should be possible to implement for VMDK and other VM image formats.
In fact... no, you can't.
Unpacking the virtual machine archive (the vmdk) must be identical to the vmdk itself.
Neither a diff ( complexity too high =>way too slow ).
For simple reasons of verification and reliability.
You cannot mount (also because in some cases you would not be able, if you want I can post some examples like the one above with Solaris) the image and read its contents
Encrypted filesystems can also occur (they are rare, but they exist), which are inherently uncompressible.
Instead, they are easily deduplicable