Yesterday, 19:50
As I tried to explain, the compression rate of virtual machine disks is the last, but the very last, of the aspects that are considered when operating with VM.
I enumerate the necessary requirements for those who want to develop his own
0) versioning "a-la-time-machine"
1) deduplication. This is the most important, indeed fundamental, element to save space during versioned copies
2) highly parallelizable compression. Server CPUs are typically low clocked, but with many cores.
Therefore the maximum performance obtainable by a certain algorithm on a single core is almost irrelevant.
3) since the problem is clearly IO-bound, ideally a program shoul
But it is not a real requirement, the point is the RAM consumption of any multiple processes launched in the background with &d be able to process in parallel data streams arriving from different media (e.g. multiple NFS shares)
4) works with really large files (I've had thousands of terabytes), with a low RAM (~20/30GB, not more).
RAM is precious on VM server. Specific compression machines are expensive, delicate, fail
5) the decompression performance is, again, IO-bound rather than CPU-bound. So a system that, for example, does NOT seek (as does for example ZPAQ) when extracting is excellent
Even if you compress a lot, absurdly, a virtual disk of 500GB by 98% in 10GB, then you have to write in extraction 500GB, and you will pay the "writing cost" (time) of 500GB
6) an advanced and fast copy verification mechanism. Any unverified backup is not eligible
A fast control mechanism is even more important than fast software. So, ideally, you need a check that does NOT include massive data extraction (which we know is really huge). That is ... the mechanism of ZPAQ (!).
Keep the hashes of the decompressed blocks, so that you do NOT have to decompress the data to verify them. Clearly using TWO different algorithms (... like I do ...) for hash collisions, if paranoid
7) easy portability between Windows-Linux-* Nix systems.
No strange compiling paradigma, libraries etc
8) append-only format, to use rsync or whatever.
You simply cannot move even the backups (if you do not have days to spare)
9) Reliability reliability reliability
No software "chains", where bugs and limitations can add up.
=====
Just today I'm restoring a small virtualbox Windows server with a 400GB drive.
Even assuming you get 100MB/s of sustained rate (normal value for a normal load virtualization server), it takes over an hour just to read it.
Obviously I didn't do that, but a zfs snapshot and copying yesterday's version (about 15 minutes)
In the real world you make a minimum of a backup for day (in fact 6+)
This give you no more than 24 hours to do a backup (tipically 6 hours 23:00-05:00,plus 1 hour until 06:00 of uploading to remote site)
With a single small server with 1TB (just about a home server) this means
10^12 / (86.400) = ~10MB/s as a lower bound.
In fact for 6 hours this is ~50M/s for terabyte.
This is about the performance of Zip or whatever.
For a small SOHO of 10TB ~500MB/s for 6 hours. This is much more than a typical server can do.
For medium size vsphere server soon it become challenging,needing external cruncher (I use AMD 3950x), but with a blazing-fast network (not very cheap, at all), and a lot of efforts.
To recap: the amount of data is so gargantuan that hoping to be able to compress it with something really efficient, in a period of a few hours, becomes unrealistic.
Unzipping is also no small problem for thick disks
If it takes a week to compress'n'test a set of VM images, you get one backup per week.
Not quite ideal.
Moving the data to a different server and then having it compressed "calmly" also doesn't work.
There are simply too many.
Often compression is completely disabled (for example, leaving it to the OS with LZ4).
This is my thirty years of experience in datastorage, and twenty-five in virtual datastorage.