thanks - be better if you would change the names of the key stuff 7z.exe and 7z.dll so they don't get confused with real 7z & can be put into same directory. eg to 7zf.exe and 7zf.dll. And maybe 7zfm too? j
thanks - be better if you would change the names of the key stuff 7z.exe and 7z.dll so they don't get confused with real 7z & can be put into same directory. eg to 7zf.exe and 7zf.dll. And maybe 7zfm too? j
Have you a plan to create a separate codec dll that can used with any 7z host app?
I've included Fast-lzma into TurboBench Compression Benchmark
There is a conflict with ZSTD-library in "pool.c". I've tried to test multihreading by compiling TurboBench without ZSTD, but fast-lzma2 is crashing when you specify more than 2 threads.
Here the results without multithreading for Skylake i6700 3.4 GHz.
Benchmark with lzma-sdk v18.01
Files from the Compression Benchmark
Code:C Size ratio% C MB/s D MB/s Name File 48758739 23.0 2.47 81.17 lzma 9 silesia.tar 49515082 23.4 4.84 79.77 flzma2 9 silesia.tar 32823983 32.8 3.38 55.53 lzma 9 app3.tar 39558020 39.5 5.22 55.03 flzma2 9 app3.tar 7992210 25.0 4.21 66.85 lzma 9 pd3d.tar 8108746 25.4 6.49 66.90 flzma2 9 pd3d.tar 24861228 24.9 1.49 83.07 lzma 9 enwik8 26465055 26.5 4.46 78.64 flzma2 9 enwik8 13598062 6.6 2.70 260.41 lzma 9 access_log_Jul95 13612075 6.6 3.51 243.60 flzma2 9 access_log_Jul95
Conor (10th March 2018)
Thanks for the test results. I'll rename the functions from Zstd in the next release which will be very soon. Please note that level 9 is not comparable to lzma 9. Level 11 or 12 is a better comparison.
EDIT: Please update your copy with the attached source files and delete pool.*
Last edited by Conor; 10th March 2018 at 04:57.
7-Zip v18.03 FL2 v0.9.2 beta is compiled with debug settings(needs special dlls), pls compile it as release.
How about multithreading decompression with radyx mf? How about bigger dictionary, maybe 2gb or even 4gb? Btw on testset1 below dictionary size doesn't matter.
Code:Testset1 - 3794MB | C-Size | C-time | C-Ram |D-speed | D-Ram 7-Zip v18.03 lzma d1024mb ultra mt2 qs | 1419.789.767 | 00:23:00 | 10820MB | 58MB/s | 1041MB 7-Zip v18.03 lzma2 d512mb ultra mt4 qs | 1428.925.790 | 00:11:42 | 10736MB |127MB/s | 2871MB 7-Zip v18.03 FL2 v0.9.2 flzma2 d1024 x12 fb64 mt4 qs | 1439.799.697 | 00:07:38 | 6254MB | 56MB/s | 1043MB
If someone needs debug x64 dlls:
That was weird. The debug binary for the GUI somehow ended up in the output folder for the make script. I've replaced it with the correct file in the release.
load (10th March 2018)
Great work, Conor! As always
Thanks, it is working after your changes.
- lzma is using only one thread, even by setting the number of threads = 2 as parameter in the call to LzmaEncode.
Therefore, we can't make a reasonable mutithreaded comparison.
- fast-lzma2 is a lot faster with 2 threads. Excellent work!
- this benchmark shows clearly, that multithreaded match finders are memory bound.
There is no acceleration with more than 2 threads, at least with my testing systems, skylake i6700 and an overclocked sandy bridge i2600k.
This is also applicable to lzham and in general to all programs, that are accessing large memory parts in a more or less random processing.
- brotli compression seems to work with a larger window (>16MB), but the decompressed buffer doesn't match.
- included the peak memory usage for compression and decompression
Turbobench compression benchmark with Skylake i6700 3.4 GHz, ubuntu 17.10, gcc 7.2. File: silesia.tar
Code:C Size ratio% C MB/s D MB/s Name C Peak Mem D Peak Mem 48,758,739 23.0 2.47 81.17 lzma 9 604,629,600 15,992 49,013,348 23.1 4.23 80.62 flzma2 11 675,964,240 57,208 49,021,875 23.1 7.91 80.67 flzma2 11mt2 680,049,408 57,208 49,030,944 23.1 7.87 80.69 flzma2 11mt4 688,219,312 57,208 49,034,963 23.1 7.82 80.69 flzma2 11mt8 704,560,056 57,208 50,470,286 23.8 0.47 373.42 brotli 11d24 264,516,392 18,887,456 50,861,542 24.0 1.68 269.97 lzham 4 5,406,315,184 43,104 50,861,542 24.0 2.39 271.98 lzham 4t2 5,410,490,888 43,104 50,861,542 24.0 2.35 271.97 lzham 4t4 5,411,539,608 43,104 50,861,542 24.0 2.24 271.29 lzham 4t8 5,414,686,760 43,104 211,948,036 100.0 13148.96 13543.87 memcpy silesia.tar
Thanks for the test results. There are definitely only 2 cores available to the program. Speed with 4 threads should be at least 13 Mb/s. Most of the performance improvement comes from reducing random memory access within a large block to improve cache efficiency, so it still works well at 4 or even 8 threads.
7Zip-zstd project has codecs. And you can exam its source code (CPP\7zip\Bundles\Codec_*). I can create a sample but I am Delphi developer and my sources will be useless for C++ developer.
Conor (11th March 2018)
Not sure how the Skylake system is configured. I've only remote access.
Included now the test on a i2600k CPU at 3,4 GHz.
You have better scaling with small files, but the effect using more threads is diminished when the files get larger.
Everyone can download "TurboBench" and redo the benchmarks.
Code:C Size ratio% C MB/s D MB/s Name 48758739 23.0 2.14 76.66 lzma 9 49013348 23.1 3.69 77.63 flzma2 11 49021875 23.1 5.94 77.62 flzma2 11mt2 49030944 23.1 12.08 77.61 flzma2 11mt4 49034963 23.1 15.13 77.62 flzma2 11mt8 50861542 24.0 1.39 222.60 lzham 4 50861542 24.0 3.47 232.17 lzham 4t4 50861542 24.0 1.90 227.12 lzham 4t2
Results from turbobench, gcc 7.2.0, i5 2500 at 3.3 GHz. Your results on the 2600k suggest another process was loading the CPU a bit. That would impact mt4 results more than the others.
C Size ratio% C MB/s D MB/s Name
48761289 23.0 2.06 78.76 lzma 9
48756483 23.0 3.17 79.05 lzma 9:mt2
49001957 23.1 3.59 79.73 flzma2 11
49003096 23.1 6.46 79.66 flzma2 11:mt2
49008164 23.1 11.07 80.10 flzma2 11:mt4
I've made the test for 4 and 8 threads again.
I'm becoming indeed better numbers for 4/8 threads. 12.08 instead of 9.87 MB/s for 4 threads. Lzham numbers for 4 threads also better.
On my system, the timings are not stable as with single thread benchmarking.
Note that lzma,9 with 2 threads is working only on windows.
Very good work!
btw. In Lzturbo, I'm using cache efficient compact tries, but not implemented multithreading.
LzTurbo is considering all possible matches at each position.
It is not clear which data structures are you using. Binary trees?
The basic data structure is an array of integer pairs, one pair for each byte in the data block. Each pair consists of the index of the longest previous match in the data block, and the match length. For dictionaries up to 64Mb and a maximum match length of < 64 bytes, the pair is packed into a single 32-bit value. For cache efficiency, each chain of unique 2-byte or 4-byte matches is copied into a buffer, resolved up to the maximum match length, and copied back.
The basic algorithm without buffering is in the functions RadixInitReference and RecurseListsReference in radix_engine.h
dnd (12th March 2018)
Thank you, Conor!
FLZMA2 is a substantial improvement over vanilla LZMA2
I have two questions for you if you'd be so kind as to answer:
1) Why is radyx so much faster than 7z-FLZMA2?
2) Do you plan on having a look at extraction speed? Archives generated by vanilla LZMA2 seem to extract quite a bit faster with 7z 18.xx' optimisations.
Code:Archiver | File Size | Elapsed | per Sec | 7z x -------------------+-----------+---------+---------+------ 7za a -mx6 -mmt4 | 137967163 | 33.8s | 3.89 | 5.1s 7zfl2 a -mx7 -mmt4 | 137746718 | 26.3s | 4.99 | 5.6s radyx a -mx7 -mmt4 | 138192220 | 17.4s | 7.57 | 5.4s
This is a nice example of why in-memory benchmarks are the most accurate. Much time is spent reading your input from disk. Radyx uses an extra buffer to read during compression, but 7z does not, or at least not a large buffer. Radyx uses the same compression code as the DLL in 7z-FL2 so it runs at about the same speed.
7-Zip's own decoder is used for all LZMA2 decoding in 7z-FL2 (the FL2 DLL lacks the optimized decoder but the latest source has it). The speed difference occurs because 7-Zip does a dictionary reset part way through the compressed stream, so an extra decoder thread can start decoding there in parallel. FL2 never resets the dictionary so decoding is single-threaded. I will modify it to add dictionary resets. This decreases compression ratio a little but I'll see how it turns out.
choochootrain (13th March 2018)
Maybe it will be interesting for someone. I created separate FLZMA2 plugin that can be used with baseline 7-Zip.
- 7-Zip 18.05 was released.
7-Zip for 32-bit Windows:
http://7-zip.org/a/7z1805.exe
or
http://7-zip.org/a/7z1805.msi
7-Zip for 64-bit Windows x64:
http://7-zip.org/a/7z1805-x64.exe
or
http://7-zip.org/a/7z1805-x64.msi
What's new after 7-Zip 18.01:
- The speed for single-thread LZMA/LZMA2 decoding
was increased by 30% in x64 version and by 3% in x86 version.- 7-Zip now can use multi-threading for 7z/LZMA2 decoding,
if there are multiple independent data chunks in LZMA2 stream.- 7-Zip now can use multi-threading for xz decoding,
if there are multiple blocks in xz stream.- The speed for LZMA/LZMA2 compressing was increased
by 8% for fastest/fast compression levels and
by 3% for normal/maximum compression levels.- 7-Zip now shows Properties (Info) window and CRC/SHA results window
as "list view" window instead of "message box" window.- Some improvements in zip, hfs and dmg code.
- Previous versions of 7-Zip could work incorrectly in "Large memory pages" mode in
Windows 10 because of some BUG with "Large Pages" in Windows 10.
Now 7-Zip doesn't use "Large Pages" on Windows 10 up to revision 1709 (16299).- The vulnerability in RAR unpacking code was fixed (CVE-2018-10115).
- Some bugs were fixed.
- New localization: Kabyle.
Conor (3rd May 2018)
Thanks, I'll update the repo soon.
load (3rd May 2018)