Page 1 of 3 123 LastLast
Results 1 to 30 of 83

Thread: Backup compression algorithm recommendations

  1. #1
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    4,137
    Thanks
    320
    Thanked 1,397 Times in 802 Posts

    Backup compression algorithm recommendations

    - VM image data (mostly exes and other binaries)
    - compression ratio 15% higher than zstd (lzma is ~10% better, but its encoding is too slow, or no gain without parsing optimization)
    - encoding speed = 50MB/s or higher (single thread)
    - decoding speed = 20MB/s or higher (single thread)

    Any ideas?
    Preprocessors can be used, but are also applicable for zstd.
    CM actually might fit by CR and enc speed (with context sorting etc), but decoding speed is a problem.
    PPMs doesn't fit because its CR is bad on this data.

  2. #2
    Member ivan2k2's Avatar
    Join Date
    Nov 2012
    Location
    Russia
    Posts
    49
    Thanks
    15
    Thanked 8 Times in 5 Posts
    ZPAQ maybe? It have fast modes like -m1/2/3, or you can play with custom compression mode(-m x.......), it requires some time to find a good one. Just check ZPAQ thread

  3. #3
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    4,137
    Thanks
    320
    Thanked 1,397 Times in 802 Posts
    Code:
    100,000,000 mixed100.cut
     27,672,053 m1.zpaq
     24,873,818 m2.zpaq
     19,040,766 m3.zpaq
     17,735,276 mixed100.cut.zst
     15,180,571 mixed100.cut.lzma
     13,465,031 m5.zpaq
    No, zpaq is totally useless in this case since its LZ and BWT are subpar and CM is too slow.
    In any case, even -m1 -t1 encoding is already slower than 50MB/s, and -m2 is more like 5MB/s.
    -m5 compression is good, but 0.5MB/s... there're much faster CMs around.

  4. #4
    Member
    Join Date
    Dec 2013
    Location
    Italy
    Posts
    517
    Thanks
    25
    Thanked 45 Times in 37 Posts
    Zpaq is the only answer.
    I use it for the same work by years, until today.

    Compression speed is decent (150/200MB/s for modern server).
    Deduplication very good.
    High compression is totally a waste of time for virtual disks.
    m1 or even m0 (only dedup).
    I will prefer pigz -1 but it is too hard to merge in zpaq.

    It simply... WORKS even with very big files.
    Decompression is slow with magnetic disks but..
    who cares?

    Better, of course, my zpaqfranz fork.
    Compile on BSD and Linux and Windows

    With a decent gui for Windows (pakka) a la time macchine.

  5. #5
    Member
    Join Date
    Dec 2013
    Location
    Italy
    Posts
    517
    Thanks
    25
    Thanked 45 Times in 37 Posts
    When making backup of virtual disks, even worse thick one, it is normal to take say 400GB for image.
    If you have even only 2/3 it's one terabyte just for a home server.
    in production easily 10TB for DAY.
    there is not a single x86 CPU that can compress this amount of data with High compress ratio.
    bandwith dominate the problem, it is io bound and not CPU bound.

    If the backup is 370GB or 390GB makes no difference at all.
    If 3000GB or 3600GB even less.

    For quick backup the answer is differential zfs send (not incremental) pigzip-ed
    Requires zfs, lots of Ram and fast disks.
    It is doable: I do every day from years.
    But restore is painful, and extensive zfs expertise needed.
    I make intermedial backup with zfs (hourly) and nighttime zpaqfranz, plus ZIP (yes 7z in ZIP mode)

  6. #6
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    4,137
    Thanks
    320
    Thanked 1,397 Times in 802 Posts
    This is not about using some already available tool, but more about development of one.
    zpaq doesn't fit at all in that sense, because all of its algorithms are worse than other known open-sources ones.
    Yes, its nice that zpaq is a working integrated solution, and I really appreciate that you're trying to improve it.

    But this thread is about designing a compression algorithm with given constraints.
    These constraints are a little ahead of current state-of-art, and there're multiple ways to solve it
    (making a stronger/slower LZ77 or ROLZ, speed-optimizing a fast-CM, finding a fitting BWT/postcoder setup, some LZ/CM hybrid maybe, etc)
    so I'd like to know what other developers think about this.

  7. Thanks (3):

    Mike (22nd January 2021),radames (25th January 2021),xinix (23rd January 2021)

  8. #7
    Member
    Join Date
    Jun 2018
    Location
    Yugoslavia
    Posts
    82
    Thanks
    8
    Thanked 6 Times in 6 Posts
    if you got several of them, I'd defragment them, perhaps make sparse file, and then use lrzip.

  9. #8
    Member
    Join Date
    Dec 2013
    Location
    Italy
    Posts
    517
    Thanks
    25
    Thanked 45 Times in 37 Posts
    Quote Originally Posted by Shelwien View Post
    This is not about using some already available tool, but more about development of one.
    zpaq doesn't fit at all in that sense, because all of its algorithms are worse than other known open-sources ones.
    Yes, its nice that zpaq is a working integrated solution, and I really appreciate that you're trying to improve it.

    But this thread is about designing a compression algorithm with given constraints.
    These constraints are a little ahead of current state-of-art, and there're multiple ways to solve it
    (making a stronger/slower LZ77 or ROLZ, speed-optimizing a fast-CM, finding a fitting BWT/postcoder setup, some LZ/CM hybrid maybe, etc)
    so I'd like to know what other developers think about this.
    As I tried to explain, the compression rate of virtual machine disks is the last, but the very last, of the aspects that are considered when operating with VM.
    I enumerate the necessary requirements for those who want to develop his own

    0) versioning "a-la-time-machine"
    1) deduplication. This is the most important, indeed fundamental, element to save space during versioned copies
    2) highly parallelizable compression. Server CPUs are typically low clocked, but with many cores.
    Therefore the maximum performance obtainable by a certain algorithm on a single core is almost irrelevant.
    3) since the problem is clearly IO-bound, ideally a program shoul
    But it is not a real requirement, the point is the RAM consumption of any multiple processes launched in the background with &d be able to process in parallel data streams arriving from different media (e.g. multiple NFS shares)
    4) works with really large files (I've had thousands of terabytes), with a low RAM (~20/30GB, not more).
    RAM is precious on VM server. Specific compression machines are expensive, delicate, fail
    5) the decompression performance is, again, IO-bound rather than CPU-bound. So a system that, for example, does NOT seek (as does for example ZPAQ) when extracting is excellent
    Even if you compress a lot, absurdly, a virtual disk of 500GB by 98% in 10GB, then you have to write in extraction 500GB, and you will pay the "writing cost" (time) of 500GB
    6) an advanced and fast copy verification mechanism. Any unverified backup is not eligible
    A fast control mechanism is even more important than fast software. So, ideally, you need a check that does NOT include massive data extraction (which we know is really huge). That is ... the mechanism of ZPAQ (!).
    Keep the hashes of the decompressed blocks, so that you do NOT have to decompress the data to verify them. Clearly using TWO different algorithms (... like I do ...) for hash collisions, if paranoid
    7) easy portability between Windows-Linux-* Nix systems.
    No strange compiling paradigma, libraries etc
    append-only format, to use rsync or whatever.
    You simply cannot move even the backups (if you do not have days to spare)
    9) Reliability reliability reliability
    No software "chains", where bugs and limitations can add up.
    =====

    Just today I'm restoring a small virtualbox Windows server with a 400GB drive.

    Even assuming you get 100MB/s of sustained rate (normal value for a normal load virtualization server), it takes over an hour just to read it.
    Obviously I didn't do that, but a zfs snapshot and copying yesterday's version (about 15 minutes)

    In the real world you make a minimum of a backup for day (in fact 6+)
    This give you no more than 24 hours to do a backup (tipically 6 hours 23:00-05:00,plus 1 hour until 06:00 of uploading to remote site)

    With a single small server with 1TB (just about a home server) this means
    10^12 / (86.400) = ~10MB/s as a lower bound.
    In fact for 6 hours this is ~50M/s for terabyte.
    This is about the performance of Zip or whatever.

    For a small SOHO of 10TB ~500MB/s for 6 hours. This is much more than a typical server can do.
    For medium size vsphere server soon it become challenging,needing external cruncher (I use AMD 3950x), but with a blazing-fast network (not very cheap, at all), and a lot of efforts.

    To recap: the amount of data is so gargantuan that hoping to be able to compress it with something really efficient, in a period of a few hours, becomes unrealistic.
    Unzipping is also no small problem for thick disks

    If it takes a week to compress'n'test a set of VM images, you get one backup per week.
    Not quite ideal.

    Moving the data to a different server and then having it compressed "calmly" also doesn't work.
    There are simply too many.

    Often compression is completely disabled (for example, leaving it to the OS with LZ4).

    This is my thirty years of experience in datastorage, and twenty-five in virtual datastorage.

  10. Thanks:

    Shelwien (23rd January 2021)

  11. #9
    Member
    Join Date
    Jun 2018
    Location
    Yugoslavia
    Posts
    82
    Thanks
    8
    Thanked 6 Times in 6 Posts
    think you should consider using 'Linux containers' or similar if possible, they should use space and other resources much more efficiently.
    dunno about security.

  12. #10
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    4,137
    Thanks
    320
    Thanked 1,397 Times in 802 Posts
    Again, this is not about archive format, and obviously there'd be dedup, MT, etc.
    Question is, which class of compression algorithms to use as a base for development.
    Compression-wise, my PLZMA almost fits, but encoding with parsing optimization is too slow.
    fast-CM (like MCM, or nzcc) fits by CR and, potentially, encoding speed
    (it can be significantly improved only for encoding with context sorting and out-of-order probability evaluation),
    but there's no solution for decoding speed.
    And BWT fits both by enc and dec speed, and even CR on text, but BWT CR on binaries is relatively bad.
    Plus, there're preprocessors and hybrid options - plenty of choices, which is the problem.
    Code:
    1,048,576 corpus_VDI_pcf_x3.1M
      249,592 corpus_VDI_pcf_x3.1M.lzma      1048576/249592 = 4.20 (276366/249592-1)*100 = 10.72%
      243,743 corpus_VDI_pcf_x3.1M.plzma_c1  1048576/243743 = 4.30 (276366/243743-1)*100 = 13.38%
      248,687 corpus_VDI_pcf_x3.1M.rz        1048576/248687 = 4.22 (276366/248687-1)*100 = 11.13%
      276,366 corpus_VDI_pcf_x3.1M.zst       1048576/276366 = 3.79
      276,403 corpus_VDI_pcf_x3.1M.lzma_a0 // lzma -a0 -d20 -fb8 -mc4 -lc0 -lp0 
    
    533,864 corpus_VDI_pcf_x3.1M.lz4-1
    369,616 corpus_VDI_pcf_x3.1M.lz4-1.c7lit_c2
    443,586 corpus_VDI_pcf_x3.1M.lz4-12
    355,800 corpus_VDI_pcf_x3.1M.lz4-12.c7lit_c2
    707,961 corpus_VDI_pcf_x3.1M.LZP-DS
    236,180 corpus_VDI_pcf_x3.1M.LZP-DS.c7lit_c2
    391,962 corpus_VDI_pcf_x3.1M.lzpre
    306,616 corpus_VDI_pcf_x3.1M.lzpre.c7lit_c2

  13. #11
    Member
    Join Date
    Dec 2013
    Location
    Italy
    Posts
    517
    Thanks
    25
    Thanked 45 Times in 37 Posts
    Quote Originally Posted by Shelwien View Post
    Again, this is not about archive format, and obviously there'd be dedup, MT, etc.
    Question is, which class of compression algorithms to use as a base for development...
    For me the answer is easy.
    The one who scales better on multicore.
    Just like pigz.
    Single thread performance is useless.
    On implementation side: the one which can extensively use HW SSE instructions.
    Compression ratio is irrilevant.
    Only speed (and limitated RAM usage).
    In two words: a deduplicated pigz (aka deflate).
    Or lz4 for decompression speed (not so relevant).
    In fact I use this one (storing on zfs the deduplicated archive)

  14. #12
    Member
    Join Date
    Jan 2017
    Location
    Germany
    Posts
    65
    Thanks
    31
    Thanked 14 Times in 11 Posts
    FWIW

    I use Zstandard for multi-Gbyte heavy backups, with commandline: -12 --long=30
    I'm quite content with it.

  15. #13
    Member
    Join Date
    Jan 2021
    Location
    ?
    Posts
    5
    Thanks
    2
    Thanked 0 Times in 0 Posts
    NTFS image:
    38.7 GiB - ntfs-ptcl-img (raw)

    simple dedupe:
    25.7 GiB - ntfs-ptcl-img.srep (m3f) -- 22 minutes

    single-step:
    14.0 GiB - ntfs-ptcl-img.zpaq (method 3)-- 31 minutes
    13.0 GiB - ntfs-ptcl-img.zpaq (method 4) -- 69 minutes

    Chained:
    12.7 GiB - ntfs-ptcl-img.srep.zst (-19) -- hours
    11.9 GiB - ntfs-ptcl-img.srep.7z (ultra) -- 21 minutes
    11.8 GiB - ntfs-ptcl-img.srep.zpaq (method 4) -- 60 minutes

    2700X, 32 GB RAM
    times are for the respecting step, not cumulative

    I think there is no magic archiver for VM images yet, just good old SREP+LZMA2.

  16. #14
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    4,137
    Thanks
    320
    Thanked 1,397 Times in 802 Posts
    1) There're probably better options for zstd, like lower level (-9?) and --long=31, or explicit settings of wlog/hlog via --zstd=wlog=31

    2) LZMA2 certainly isn't the best option, there're at least RZ and nanozip.

    3) zstd doesn't have integrated exe preprocessing, while zpaq and 7z do - I'd suggest testing zstd
    with output of "7z a -mf=bcj2 -mm=copy"

  17. #15
    Member
    Join Date
    Dec 2013
    Location
    Italy
    Posts
    517
    Thanks
    25
    Thanked 45 Times in 37 Posts
    Quote Originally Posted by Shelwien View Post
    1) There're probably better options for zstd, like lower level (-9?) and --long=31, or explicit settings of wlog/hlog via --zstd=wlog=31

    2) LZMA2 certainly isn't the best option, there're at least RZ and nanozip.

    3) zstd doesn't have integrated exe preprocessing, while zpaq and 7z do - I'd suggest testing zstd
    with output of "7z a -mf=bcj2 -mm=copy"
    Ahem... No
    Srep is not enough sure to use.
    Nanozip source is not available.

    And those are NOT versioned backup.
    VM are simply too big to keep with different file.
    If you have a 500GB thick disk who become a 300GB compressed file today, what you will do tomorrow?
    Another 300GB?

    For a month-long backup retention policy,
    where to out 300GBx30=9TB for just a single vm?
    How long will take to transfer 300GB via LAN?
    How long will take to verify 300GB via LAN?

    Almost everything is good for a 100MB file.
    Or 30GB.
    But for 10TB of a typical vsphere server?

  18. #16
    Member
    Join Date
    Dec 2013
    Location
    Italy
    Posts
    517
    Thanks
    25
    Thanked 45 Times in 37 Posts
    Now change a little on the image, just like a real VM do.
    redo the backup.
    How much space needed to retain today and yesterday backups?

    How much time and ram is needed to verify those backups?

    Today and yesterday

    PS shitty korean smartphone does not like english at all

  19. #17
    Member
    Join Date
    Jan 2021
    Location
    ?
    Posts
    5
    Thanks
    2
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by Shelwien View Post
    This is not about using some already available tool, but more about development of one.
    This is where I agree with Eugene.

    Quote Originally Posted by fcorbelli View Post
    And those are NOT versioned backup.
    How long will take to transfer 300GB via LAN?
    How long will take to verify 300GB via LAN?
    This is where I agree with Franco.

    My tests were about what tool does what the best?
    Zpaq cannot de-duplicate as good as SREP.
    LZMA2 is fast enough with good ratio to compress an SREP file.

    Integrated exe/dll preprocessing is important as well like Eugene said.

    Will we need precomp? What are streams we find on a windows OS VM?
    ZIP/CAB(non-deflate like LZX/Quantum) files? Are the latter covered?

    What about ELF/so and *NIX operating systems? Those are important for VMs and servers as well.

    What are priorities and in what order?
    Multithreading >>> Deduplication/Journaling >> Recompress popular streams > Ratio (controlled by switch)

    What entropy coder really makes a difference?

    Franco made an excellent point about transfer speeds (Ratio will matter for your HDD storage space and 10Gbit/s net).
    Your network and disk speeds are almost as important as your total threads and RAM.

    I am just here because I am interested in what you might code.
    Eugene just please don't make it too difficult or perfect, or it may never be finished.

  20. #18
    Member
    Join Date
    Jun 2018
    Location
    Yugoslavia
    Posts
    82
    Thanks
    8
    Thanked 6 Times in 6 Posts
    iirc, vbox has an option to keep virtual disk and save diff in separate file.
    dunno about others.

  21. #19
    Member
    Join Date
    Dec 2013
    Location
    Italy
    Posts
    517
    Thanks
    25
    Thanked 45 Times in 37 Posts
    Quote Originally Posted by radames View Post
    My tests were about what tool does what the best?
    Zpaq cannot de-duplicate as good as SREP.
    It doesen't matter.
    You need something that deduplicates BUT that allows you to do many other things, such as verifying files without decompressing.
    If you have to store 300GB or 330GB really make no difference (huge from the point of view of software performance,irrelevant to use.
    LZMA2 is fast enough with good ratio to compress an SREP file.
    It doesen't matter
    Integrated exe/dll preprocessing is important as well like Eugene said.
    It doesen't matter, at all
    Will we need precomp? What are streams we find on a windows OS VM?
    It doesen't matter
    You will find anything. Images, executables, database files, videos, word and excel documents, ZIP, RAR, 7z

    What are priorities and in what order?
    Multithreading >>> Deduplication/Journaling >> Recompress popular streams > Ratio (controlled by switch)
    In my previous post
    0) versioning "a-la-time-machine"
    1) deduplication.
    2) highly parallelizable compression.
    3) RAM consumption
    4) works with really large files
    5) decompression which does NOT seek (if possible)
    6) an advanced and fast copy verification mechanism WITHOUT decompress if possible
    7) easy portability between Windows-Linux-* Nix systems.
    8 append-only format
    9) Reliability reliability reliability. No software "chains", where bugs and limitations can add up.

    This is, in fact, a patched ZPAQfranz with a fast LZ4 -1 compressor/decompressor

    Or zpaqfranz running on a zfs datastorage system, with embedded lz4.


    On developing hand

    a block-chunked format, with compressed AND uncompressed hash AND uncompressed CRC-32

  22. #20
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    4,137
    Thanks
    320
    Thanked 1,397 Times in 802 Posts
    @fcorbelli:
    What you say is correct, but completely unrelated to the topic here.

    Current compression algorithms are not perfect yet.
    For example, zstd design and development focuses on speed, while lzma provides 10% better compression
    and could be made much faster if somebody redesigned and optimized it to work on modern CPUs.
    Then, CM algorithms are slow, but can provide 20-30% better compression than zstd, and
    CM encoding can be actually much faster than LZ with parsing optimization - maybe even the requested 50MB/s.
    But decoding would be currently still around 4MB/s or so.
    So in some discussion I mentioned that such algorithms with "reverse asymmetry" can be useful in backup,
    because decoding is relatively rare there. And after a while got a feedback from actual backup developers
    with codec performance constraints that they're willing to accept.
    Problem is, it would be very hard to push CM to reach 20MB/s decoding, because its mostly determined by L3 latency.
    But it may be still possible, and there're other potential ways to the same goal - basically with all usual classes of compression algorithms.
    So I want to know which way would be easiest.

  23. #21
    Member
    Join Date
    Dec 2013
    Location
    Italy
    Posts
    517
    Thanks
    25
    Thanked 45 Times in 37 Posts
    Quote Originally Posted by Shelwien View Post
    @fcorbelli:
    So in some discussion I mentioned that such algorithms with "reverse asymmetry" can be useful in backup,
    because decoding is relatively rare there.
    In fact, no.
    You will decode always.
    Every single time.
    Because you need to verify (or check)
    Unless you use, as I write, a "block checksumming approach".
    In this case you do not need to extract data at all (to verify).

    And you never can use a 4 or even 20MB/s algorithm
    I attach an example of "real world" virtual machine disk backup

    About 151TB stored in 437MB

    To "handle" this kind of file you will need the fastest algorithm, not the one with the most compression.
    That's what I'm trying to explain: the faster and lighter, the better.

    So LZ4 or PIGZ, AFAIK
    Attached Files Attached Files
    • File Type: txt 1.txt (126.5 KB, 23 views)

  24. #22
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    4,137
    Thanks
    320
    Thanked 1,397 Times in 802 Posts
    So you're just saying that its not acceptable for you, that's ok.
    But it doesn't provide an answer to my question - which compression algorithms fits the constraints.

    Also,
    1) Decoding is not strictly necessary for verification.
    Instead we can
    a) Prove that our compression algorithm is always reversible
    b) Add ECC to recover from hardware errors (at that scale it would be necessary in any case, even with verification by decoding)

    2) LZ4 and deflate(pigz) are far from the best compression algorithms in any case.
    Even if you like these specific algorithms/formats, there're still known ways to improve their compression without speed loss.
    And then there're different algorithms that may be viable on different hardware, in the future, etc.
    You can't just say that perfect compression algorithms already exist and there's no need to think about further improvements.

  25. Thanks:

    radames (26th January 2021)

  26. #23
    Member
    Join Date
    Jan 2021
    Location
    ?
    Posts
    5
    Thanks
    2
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by fcorbelli View Post
    It doesen't matter.
    ...
    It doesen't matter
    ...
    It doesen't matter, at all
    ...
    It doesen't matter
    Not sure about the reason being so negative on a site where people share thoughts but whatever
    your use case != my use case, short and polite


    Quote Originally Posted by Shelwien View Post
    which compression algorithms fits the constraints.
    ZPAQ:
    -method 4: LZ77+CM, BWT or CM

    https://github.com/moinakg/pcompress#pcompress
    support for multiple algorithms like LZMA, Bzip2, PPMD, etc, with SKEIN/SHA checksums for data integrity

    Let the user decide or use detection.
    A single magic compression doesn't exist, and my idea with stream detection was okay.

  27. #24
    Member
    Join Date
    Dec 2013
    Location
    Italy
    Posts
    517
    Thanks
    25
    Thanked 45 Times in 37 Posts
    Quote Originally Posted by radames View Post
    Not sure about the reason being so negative on a site where people share thoughts but whatever
    your use case != my use case, short and polite
    Just because the use case is not "mine", not "yours".

    A virtual disk is tipically made of hundreds of gigabyte.
    A tipical virtual backup can be multiple terabyte long.
    Do you agree on this use case?
    Because if yours virtual disk is 100MB then you are right
    Your use case != mine


    Short version
    Everything you want to use must consider the cardinality, space and time needed.
    It doesn't matter if you use X Y Z
    Use whatever you want that can handle at least 1TB in half night (4 hours).
    Preprocess, deduplication, ecc whatever.
    The real limit is time, not efficiency (space)
    This is not hutter prize.
    This is not 1GB challenge
    This is minimum 1TB challenge
    I hope this is clear.
    Anyway I will not post anymore

  28. #25
    Member
    Join Date
    Jan 2021
    Location
    ?
    Posts
    5
    Thanks
    2
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by fcorbelli View Post
    Anyway I will not post anymore
    Nah no worries, just post as you like. I let my account stay unused now.
    I just see why the community archiver Fairytale just stopped being developed
    Bye everyone!

  29. #26
    Member SolidComp's Avatar
    Join Date
    Jun 2015
    Location
    USA
    Posts
    372
    Thanks
    133
    Thanked 57 Times in 40 Posts
    Have you looked at libzling? It had an intriguing balance of ratio and speed as of 2015 or so, and it seemed like there was still headroom for improvement. The ratios weren't as good as LZMA, but it might be possible to get there.

    You mentioned the possibility of optimizing LZMA/2, which is what I was thinking. Have you seen Igor's changes to 7-Zip 21.00?

    ​I wonder if SIMD might have asymmetric implications on algorithm choices. AVX2 should be the new baseline going forward, and some algorithms might be differentially impacted.

  30. #27
    Member RichSelian's Avatar
    Join Date
    Aug 2011
    Location
    Shenzhen, China
    Posts
    174
    Thanks
    20
    Thanked 62 Times in 31 Posts
    Quote Originally Posted by SolidComp View Post
    Have you looked at libzling? It had an intriguing balance of ratio and speed as of 2015 or so, and it seemed like there was still headroom for improvement. The ratios weren't as good as LZMA, but it might be possible to get there.

    You mentioned the possibility of optimizing LZMA/2, which is what I was thinking. Have you seen Igor's changes to 7-Zip 21.00?

    ​I wonder if SIMD might have asymmetric implications on algorithm choices. AVX2 should be the new baseline going forward, and some algorithms might be differentially impacted.
    i rewrote libzling with rust years ago (https://github.com/richox/orz) and libzling is no longer maintained
    now the compression ratio is almost the same with lzma for text data, but 10x times faster

  31. Thanks:

    Shelwien (26th January 2021)

  32. #28
    Member lz77's Avatar
    Join Date
    Jan 2016
    Location
    Russia
    Posts
    176
    Thanks
    60
    Thanked 16 Times in 12 Posts
    Quote Originally Posted by RichSelian View Post
    i rewrote libzling with rust years ago...
    It seems, in your C++ version of libzling you used my idea of finding match length.

  33. #29
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    4,137
    Thanks
    320
    Thanked 1,397 Times in 802 Posts
    Unfortunately orz doesn't seem very helpful:
    Code:
    73102219 11.797s 2.250s // orz.exe encode -l0 corpus_VDI 1
    72816656 12.062s 2.234s // orz.exe encode -l1 corpus_VDI 1
    72738286 12.422s 2.234s // orz.exe encode -l2 corpus_VDI 1
    
    53531928 87.406s 2.547s // lzma.exe e corpus_VDI 1 -d28 -fb273 -lc4
    59917669 27.125s 2.703s // lzma.exe e corpus_VDI 1 -a0 -d28 -fb16 -mc4 -lc0
    65114536 15.344s 2.860s // lzma.exe e corpus_VDI 1 -a0 -d24 -fb8 -mc1 -lc0
    65114536 11.532s 2.875s // lzma.exe e corpus_VDI 1 -a0 -d24 -fb8 -mc1 -lc0 -mfhc4
    Attached Files Attached Files

  34. #30
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    4,137
    Thanks
    320
    Thanked 1,397 Times in 802 Posts
    @fcorbelli:
    > Use whatever you want that can handle at least 1TB in half night (4 hours).

    2**40/(4*60*60) = 76,354,974 bytes/s
    Its not actually that fast, especially taking MT into account.
    I posted the requirements for the single thread of the algorithm, but of course the complete tool would be MT.

Page 1 of 3 123 LastLast

Similar Threads

  1. My new compression algorithm
    By tefara in forum Random Compression
    Replies: 55
    Last Post: 12th June 2019, 21:45
  2. Anyone know which compression algorithm does this?
    By hjazz in forum Data Compression
    Replies: 8
    Last Post: 24th March 2014, 06:49
  3. How do you backup?
    By Piotr Tarsa in forum The Off-Topic Lounge
    Replies: 25
    Last Post: 17th September 2013, 23:39
  4. Replies: 5
    Last Post: 25th December 2011, 21:53
  5. Test set: backup
    By m^2 in forum Data Compression
    Replies: 1
    Last Post: 23rd October 2008, 23:16

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •