Results 1 to 12 of 12

Thread: 10 GB benchmark

  1. #1
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,257
    Thanks
    307
    Thanked 797 Times in 489 Posts

    10 GB benchmark

    I am starting a new archiver benchmark. So far only a few tests because they can take a long time
    There will be more to come.
    http://mattmahoney.net/dc/10gb.html

    The reason for the benchmark is to have a more realistic tests for backup software. Real disks have a lot of already compressed data and lots of duplicate or near-duplicate files. And the backups are big

  2. Thanks:

    samsat1024 (31st July 2013)

  3. #2
    Member
    Join Date
    Aug 2008
    Location
    Planet Earth
    Posts
    1,026
    Thanks
    103
    Thanked 410 Times in 285 Posts
    Matt what zpaq version did you use to create 10gb.zpaq?

    It parse wrong in winzpaq versions because attribute values are missing. When I pack folder /10gb/ with winzpaq and zpaq 6.40 parse work well (size 3,701,607,317 bytes with level 2 and round 180 seconds to pack).

    See different list your build 10gb.zpaq and my build 10gb.zpaq:

    Your 10gb.zpaq list:
    zpaq v6.40 journaling archiver, compiled Jul 28 2013
    Reading archive 10gb.zpaq

    Ver Date Time (UT) Attr Size Ratio File
    ----- ---------- -------- ------ ------------ ------ ----
    > 1 2013-07-27 20:10:18 0 1.0000 10gb/
    > 1 2013-07-28 00:02:23 0 1.0000 10gb/2011/
    > 1 2013-07-28 00:02:23 0 1.0000 10gb/2011/www.mattmahoney.net/
    > 1 2011-12-21 19:17:06 0 1.0000 10gb/2011/www.mattmahoney.net/dc/
    > 1 2011-06-13 01:59:42 234 0.1476 10gb/2011/www.mattmahoney.net/dc/11ninke.html
    > 1 2010-03-05 21:47:34 19836 1.0002 10gb/2011/www.mattmahoney.net/dc/220px-JPEG_ZigZag.png
    > 1 2010-02-26 14:58:20 54800 1.0002 10gb/2011/www.mattmahoney.net/dc/250px-Dctjpeg.png
    > 1 2010-03-02 02:38:51 18442 1.0002 10gb/2011/www.mattmahoney.net/dc/250px-Suffix_tree_BANANA.png
    > 1 2010-02-26 14:59:51 17762 1.0002 10gb/2011/www.mattmahoney.net/dc/400px-Ntsc_channel.png
    etc.

    My build 10gb.zpaq list:
    zpaq v6.40 journaling archiver, compiled Jul 28 2013
    Reading archive 10gb.zpaq

    Ver Date Time (UT) Attr Size Ratio File
    ----- ---------- -------- ------ ------------ ------ ----
    > 1 2013-07-27 20:10:18 D..... 0 1.0000 10gb/
    > 1 2013-07-28 00:02:23 D..... 0 1.0000 10gb/2011/
    > 1 2013-07-28 00:02:23 D..... 0 1.0000 10gb/2011/www.mattmahoney.net/
    > 1 2011-12-21 19:17:06 D..... 0 1.0000 10gb/2011/www.mattmahoney.net/dc/
    > 1 2011-06-13 01:59:42 .A.... 234 0.1476 10gb/2011/www.mattmahoney.net/dc/11ninke.html
    > 1 2010-03-05 21:47:34 .A.... 19836 1.0002 10gb/2011/www.mattmahoney.net/dc/220px-JPEG_ZigZag.png
    > 1 2010-02-26 14:58:20 .A.... 54800 1.0002 10gb/2011/www.mattmahoney.net/dc/250px-Dctjpeg.png
    > 1 2010-03-02 02:38:51 .A.... 18442 1.0002 10gb/2011/www.mattmahoney.net/dc/250px-Suffix_tree_BANANA.png
    > 1 2010-02-26 14:59:51 .A.... 17762 1.0002 10gb/2011/www.mattmahoney.net/dc/400px-Ntsc_channel.png
    etc.

  4. #3
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,257
    Thanks
    307
    Thanked 797 Times in 489 Posts
    I created the archive with zpaq v6.40 -method 2 -noattributes (new option in 6.40). Attributes are not saved because some of the files had system, hidden, read-only, etc. attributes set, which can cause problems for some archivers that would not add such files. (I was hoping this wouldn't break the GUI. Oops). Older zpaq versions should extract the files with default attributes.

    Also, I added a bunch of new results.

  5. #4
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,257
    Thanks
    307
    Thanked 797 Times in 489 Posts
    So far I broke zip, exdupe, freearc, and zcm. zcm 0.88 extraction of .wav files was not bit-identical. I did not report the results, but the compressed size was 5,100,768,860 bytes, compression time 8093 sec, decompress 6951 sec on system 2 (Vista) using options -m7 -t1 -s (1.6 GB memory, 1 thread, solid) for best compression.

    On system 1 (Xeon E5-2620, 24 hyperthreads, 64 GB, Fedora), zip gives an error after about 20 minutes that it can't open one of the input files (enwik9.pmd: no such file or directory). I tried twice with the same result. No other program has trouble reading it.

    freearc generic Linux build can't load libcurl.so.4 at start, although unarc starts OK. I tried "yum install curl" and "yum install curl-devel" with no effect.

    exdupe 64 bit Linux build complains "kernel too old" and gives a segmentation fault. I tried compiling from source successfully, but then I just get a segmentation fault with no message.

    Top pending results are currently zpaq -method 610 -threads 4 (using 41 GB memory) and nanozip -cc -m16g (using 10 GB in 1 thread) but it will take several hours to decompress and check the results. Fortunately the test machine can handle both at the same time with 19 cores to spare
    Unfortunately the machine does not have wine so I can't test Windows programs on it.

  6. #5
    Member
    Join Date
    Mar 2009
    Location
    Prague, CZ
    Posts
    62
    Thanks
    32
    Thanked 7 Times in 7 Posts
    Avast complains at bbb.exe. Reported as false alarm.....

  7. #6
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,257
    Thanks
    307
    Thanked 797 Times in 489 Posts
    I added some more test results. I am testing on a Linux laptop (4 cores) with and without an external USB drive. With tar, exdupe, and zpaq -m 2, it is faster to compress from local disk to external disk and decompress back to local disk than to do everything on the local disk, so I ran all the rest of the tests that way. I was surprised that zpaq compresses both better and faster than exdupe at default compression levels. It might be because less data is sent over the USB cable. Decompression speed is definitely I/O bound and the same for both. Disk speed seems to be 250 MB/sec locally and 125 MB/sec on the USB drive. On the 24 core machine (system 1), disk speed is 1 GB/sec.

    I compared zpaq and exdupe deduplication without compression and they are nearly the same. zpaq -m 0 gives 8452 MB. exdupe -x0 gives 8465 MB.

    The Linux versions of exdupe and nanozip don't restore file dates properly. exdupe is off by 5 hours. (I am in Eastern time zone, UT-0500, or UT-0400 in the summer). Some nanozip times are off by years. That doesn't disqualify them from the results because the contents are still identical. Nanozip does not restore empty directories but I included it anyway.

    zpaq -m 612 (4 GB blocks) compresses worse than -m 610 at 2,771,860,839, 22335 seconds in 2 threads using up to 62 GB resident memory (75 GB virtual). I have not posted it yet because I have not yet verified decompression. That might not be for awhile because in the middle of decompression my company decided to change their VPN server, which will require that I re-image my company issued laptop that I use for the Linux tests.

  8. #7
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,257
    Thanks
    307
    Thanked 797 Times in 489 Posts
    Quote Originally Posted by mhajicek View Post
    Avast complains at bbb.exe. Reported as false alarm.....
    I expect that some virus detectors will complain about lpaq*.zip, containing lpaq*.exe (under 10gb/www.mattmahoney.net/dc/ and 10gb/2011/www.mattmahoney.net/dc/). These are false alarms too. They were packed with upack, which gives better compression than upx but unfortunately can trigger alarms because other people have used upack to compress viruses. It was a bit annoying when Yahoo took down my website for 2 days for hosting malware until I called them about transferring my domain.

  9. #8
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,257
    Thanks
    307
    Thanked 797 Times in 489 Posts
    I split up the results by system so it is easier to compare run times and show the Pareto frontier. I'm not sure this is the best way to display the data. It would be easier to compare compression ratios with a single table. http://mattmahoney.net/dc/10gb.html

  10. #9
    Member
    Join Date
    Aug 2008
    Location
    Planet Earth
    Posts
    1,026
    Thanks
    103
    Thanked 410 Times in 285 Posts
    Quote Originally Posted by Matt Mahoney View Post
    I split up the results by system so it is easier to compare run times and show the Pareto frontier.
    I just added NanoZip to my old 10gb folder benchmark results.

  11. #10
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,257
    Thanks
    307
    Thanked 797 Times in 489 Posts
    Two new updates. One new set of tests that was submitted to me (system , duplicating most of the tests under Linux that I ran on system 4. He said that he got 32 bit freearc to run by installing all of the needed packages but in both of his tests that decompression either hung or gave CRC errors. It also looks like there was a problem with zpaq -method 1 -fragile decompression. I asked about it, but haven't got an answer yet. In the other tests, pcompress does quite well, but since it is a single file compressor, it loses to zpaq when you add the time to tar and untar.

    I also did a brief test comparing Linux and 64 bit Windows 7 on the same laptop (systems 4 and 6). In Windows, zip and zpaq are much slower, especially for decompression. I suspect the problem is the McAfee antivirus scanning each file as it is created. The good news is it did not find any viruses in the benchmark data.

    Edit: zpaq -fragile turned out to be a test script error. It is fixed now. Also, I added tests of the 32 and 64 bit Windows versions under Ubuntu/wine to compare with Windows 7 64-bit. It turns out the native Linux (64 bit) version is the fastest. Compression and decompression times are:

    Win7-64 zpaq64.exe: 796, 1159
    Win7-64 zpaq.exe: 996, 1159
    wine zpaq64.exe: 377, 255
    wine zpaq.exe: 407, 272
    zpaq: 355, 215
    Last edited by Matt Mahoney; 13th September 2013 at 05:53.

  12. #11
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,257
    Thanks
    307
    Thanked 797 Times in 489 Posts
    I created two new data sets of size 100 MB and 1 GB from the 10 GB benchmark. They are described and can be downloaded from http://mattmahoney.net/dc/10gb.html (scroll to the bottom).

    The purpose is not to create a new benchmark, but to make testing more convenient. I created the data sets by randomly sampling files from 10GB until the desired size was reached. Getting the size exact is an example of the subset-sum problem, which is NP-complete for n = 79431. However, it turned out not to be so hard At each step, the randomly selected file must be between 0.1% and 10% of the remaining space until the space is less than 1000 bytes when the restriction is removed. The restriction preserves plenty of small files to be selected at the end, but it does skew the distribution so that the average file size is larger and there are fewer files. The upper bound is to prevent one very large file from dominating the test set. However, this blocked most of the human genome data (31% of 10GB) and enwik9 (10%) from getting into the smaller data sets, since these files are very large. Also because of the sampling, there is much less duplication for a dedup algorithm to exploit.

    The goal is to have a small test set that predicts performance on larger sets. I ran some tests with several different archivers with various options, but as you can see from my results, there does not seem to be much correlation. I don't know of any other benchmarks that have this property either, so I consider this an unsolved problem. Compression of 10GB depends so much more on available memory and deduplication than it does on smaller data sets.

  13. #12
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,257
    Thanks
    307
    Thanked 797 Times in 489 Posts
    Tested obnam (system 4). http://mattmahoney.net/dc/10gb.html
    obnam: http://liw.fi/obnam/
    pothos pointed it out to me in a PM. (He can't post for some reason).

    obnam is an incremental backup utility for Linux that uses deduplication and stores multiple versions. By default it does not compress but you can use the option --compress-with=deflate to get slightly worse compression than zpaq -method 1, and slower because it is single threaded. It created a backup repository with 72265 files in 105 directories. The total size of all files is 3,958,233,075 bytes but the size reported by du is 4,004,577 1K blocks or 4,100,686,848 bytes.

    obnam has a lot of features like backing up via sftp and managing versions (called generations), like keeping hourly backups for 3 days and daily backups for a month, etc.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •