Page 3 of 9 FirstFirst 12345 ... LastLast
Results 61 to 90 of 249

Thread: Filesystem benchmark

  1. #61
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    865
    Thanks
    463
    Thanked 260 Times in 107 Posts
    A few ideas :

    - Previous threads are not "killed", and continue polluting the list or create contention
    - Memory leaks (unlikely)
    - CPU heat protection (throtling), in which case a serious look to your radiator is recommended.

    In any case, my first foe would be the MT code,
    and if you have a SingleThread code branch still available,
    i would test that one, to check if the slowing pattern still appear.

  2. #62
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,611
    Thanks
    30
    Thanked 65 Times in 47 Posts
    Thanks Cyan.
    Quote Originally Posted by Cyan View Post
    A few ideas :
    - Previous threads are not "killed", and continue polluting the list or create contention
    - Memory leaks (unlikely)
    Nope. The same happens when I run the exe with multiple tests several times in a short progression.

    Quote Originally Posted by Cyan View Post
    - CPU heat protection (throtling), in which case a serious look to your radiator is recommended.
    Hmmm...that might be the case. I'll sure look into it once the issue repeats.

    In any case, my first foe would be the MT code,
    and if you have a SingleThread code branch still available,
    i would test that one, to check if the slowing pattern still appear.
    I'm not entirely sure, but I think that I tried it and the pattern was present.
    Also, I should add that LZ4 was just an example and it happened with all codecs that I tried.

    In the meantime I implemented contention counting.

    With nop (no codec), I can see a clear correlation between performance and number of contention hits. On most runs ~95% of locking operations is contented and performance is somewhat variable. But once in a while there is a run with contention much lower, even 3%, and performance is up to 2.5x faster. At the same time, contention is a much smaller issue with regular codecs. With the fastest codec (RLE64) that also happens to have a fairly stable speed, in 90% of runs the contention count is <10%. The highest peak that I've seen was 70%, but even then there was no drop in performance.

    So I won't be implementing the scalable version now.

    ADDED: And the performance variability that I reported a couple of days ago appears to be just some random issue, probably a background process. During development, I usually don't benchmark on a clean machine.
    Last edited by m^2; 8th April 2012 at 16:34.

  3. #63
    Member
    Join Date
    Jan 2007
    Location
    Moscow
    Posts
    239
    Thanks
    0
    Thanked 3 Times in 1 Post
    Can you please add data-shrinker to the test?
    Seems it has the same interface as LZ4, so adding should not take much time.
    And - if possible - include win32 and win64 builds with new releases.
    Thanks.

  4. #64
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,611
    Thanks
    30
    Thanked 65 Times in 47 Posts
    Thanks for the heads up, I will gladly add the codec.
    There's a bit more work than this, but not a problem.
    It's written by fusiyuan2010...sounds familiar.

    I can't say when will release the benchmark with it though. The code is not in a good enough shape ATM and there's quite a lot of work (relatively to amount of my free time) to get it fixed.

    As to binaries - I'll think about it. I stopped providing them because it cost a bit too much time to make them.

  5. #65
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,611
    Thanks
    30
    Thanked 65 Times in 47 Posts
    Despite not having finished many things that I planned for the next release, I decided to push it earlier because Shrinker turned out to be a cool thing.
    First, some numbers:
    Code:
    e:\projects\benchmark04\tst>fsbench default shrinker -i3 -s1 -b131072 -m4096 -t2
     ..\scc.tar
    memcpy               = 78 ms (2591 MB/s), 211927552->211927552
    Codec         version    args       Size (Ratio)    C.Speed      D.Speed
    LZ4               r59      12  106569728 (x 1.99)   C: 185 MB/s  D:  598 MB/s
    LZO              2.05     1x1  104497152 (x 2.03)   C: 161 MB/s  D:  256 MB/s
    QuickLZ       1.5.1b6       1  100093952 (x 2.12)   C: 153 MB/s  D:  132 MB/s
    Snappy          1.0.5          108097536 (x 1.96)   C:  89 MB/s  D:  304 MB/s
    Shrinker           r4           98582528 (x 2.15)   C:  88 MB/s  D:  309 MB/s
    Codec         version    args       Size (Ratio)    C.Speed      D.Speed
    done... (3x1 iteration(s))
    In most tests it's quite similar, I consider it to be the strongest of fast codecs.
    There are issues though, it doesn't work correctly on my 32-bit Linux. It's quite possible that id doesn't work correctly on 32-bit systems at all.

    I made another noteworthy addition, LZX. Sadly, lzx_compress doesn't come with a decompressor. I want to add one from some other library later, for now there's just a dummy. Don't be surprised by how fast it is.

    As to internal changes:

    Multithreading code is much different, now I schedule work dynamically instead of statically. The implementation is still very basic, I use mutexes and didn't try critical sections or custom solutions. It has to wait. Also, I don't have good job size adjustment, I use a default of 256 KB; for now it has to do.
    I still haven't decided what to do with measured overhead of my code, right now I do nothing; later I may subtract it from the results.
    And, sadly, I have to say that I don't have much confidence in correctness of the code, I don't have a feeling that it's tested well already. Well, this is the thing that makes me the least content about releasing the code now.

    To enable the multithreading changes, I had to modify memory layout. The new code is, in general, safer. But less cache-friendly with things scattered over memory instead of having just a couple of big chunks. I noted performance drop in some codecs and I suspect this might be the issue.

    I changed the way decompression speed is calculated, now I take into account only amount of data that was actually decompressed. While I consider it to be a less important metric, it's less confusing.

    Changelog:
    Code:
    [+] added LZFX
    [+] added LZX (without decompressor)
    [+] added Shrinker
    [-] removed lzp1. It crashed on calgary.tar created with 7-zip 9.22. After a short analysis I decided that debugging would cost too much to be worth it.
    [~] major refactoring and other improvements
    [~] dynamic work scheduling. Up to now threads got roughly equally sized pieces. Sometimes one piece would take much longer than others thus skewing results in favor of codecs with more predictable performance.
    [~] added QuickLZ to the list of default codecs
    [~] use memcpy instead of LZ4 for warmup. It touches more memory, which is good for fairness with codecs that are weaker than LZ4.
    [~] when calculating decompression speed, take into account amount of data that was really decompressed
    [~] improved measurement accuracy
    [!] the algorithm often considered an incompressible last block of a file to be compressible
    [!] *nix compatibility fixes
    [!] fixed UCL, LZO and LZMAT crashes
    Attached, source and mingw win64 compilation.
    Attached Files Attached Files

  6. #66
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,511
    Thanks
    746
    Thanked 668 Times in 361 Posts
    deleted
    Last edited by Bulat Ziganshin; 14th April 2012 at 15:18.

  7. #67
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    865
    Thanks
    463
    Thanked 260 Times in 107 Posts
    LZHAM still doesn't work on MinGW32; but i guess your earlier comment on LZHAM portability issue is still valid...

  8. #68
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,611
    Thanks
    30
    Thanked 65 Times in 47 Posts
    Yes, I changed the version numbering scheme because of this; I don't want to release the full 0.10 until the benchmark is reasonably portable again.
    Sadly, in the meantime Rich Geldreich committed no changes to LZHAM and I'm not willing to fix codecs that are maintained, I prefer to wait. Did I say that my time is scarce?
    I consider even removing LZHAM (or making optional) if I get tired with this issue.
    Last edited by m^2; 14th April 2012 at 20:36.

  9. #69
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,511
    Thanks
    746
    Thanked 668 Times in 361 Posts
    >Shrinker turned out to be a cool thing.
    >
    I consider it to be the strongest of fast codecs.

    it seems tha lz4 uses 12-bit hash ta
    ble here, and Shrinker is 15-bit. i've compared their code:
    1. lz4 is able to quickly skip incompressible data (segmentSize)
    2.
    Shrinker hashes two more strings at the beginning of match (src+1 and src+3)
    3. Shrinker uses 1/2 byte encoding for distances (long_dist)

    it seems that there is no more differences that may change compression ratio. and i doubt that they can change compression ratio too much

  10. #70
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,611
    Thanks
    30
    Thanked 65 Times in 47 Posts
    Yes, it's very similar to LZ4. At first, I thought it was a fork.
    As to differences, there's also a different (and weird) hash, ((a*21788233) >> (32 - HASH_BITS)). Also, it uses 8n+3 bits to encode literal length, while LZ4: 8n+4.
    I tried reducing its hash to 12 bits and the performance gain was low, it seems to be tuned for this size pretty well. IIRC, LZ4 was both weaker and slower than Snappy back when it used 15-bit hash...

  11. #71
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    865
    Thanks
    463
    Thanked 260 Times in 107 Posts
    Indeed, this is an interesting variation of LZ4; the author (Fu Siyan) is quite honest about it on reading the homepage. I guess it must be a simple experiment from him, since Fu already demonstrated he can write much stronger compression algorithms, "7zip-like", such as CSC.

    The most significant difference seems to be the variable-size offset schema.
    There are also some other differences, which you already perfectly pointed out.
    Variable-sized offset is an important and almost obvious compression improvement, especially for small input sizes.
    I also considered it in the early days of LZ4, but it costed too much speed to my taste.
    I mean, if some fields start to become variable-sized, the performance drop is large enough to consider LZH instead, imho.

    Which is the reason why i turned my attention to Zhuff afterwards....

  12. #72
    Member
    Join Date
    Jan 2007
    Location
    Moscow
    Posts
    239
    Thanks
    0
    Thanked 3 Times in 1 Post

    Not enough memory!

    I wanted to test compression on 3696 MiB log file on a server with Windows 2008 Server x64, 32 GiB RAM.
    fsbench-mingw64-4.6.2.exe from benchmark-0.10d.7z‎ fails with "Not enough memory!" error.
    Here are VMMap screenshot and analysis log.
    Why does "Not enough memory!" error appears if 8 TiB free block can be allocated?
    Attached Thumbnails Attached Thumbnails Click image for larger version. 

Name:	ff6.png 
Views:	312 
Size:	34.4 KB 
ID:	1925  
    Attached Files Attached Files

  13. #73
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,511
    Thanks
    746
    Thanked 668 Times in 361 Posts
    modern compession algos are sooo greedy

  14. #74
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,611
    Thanks
    30
    Thanked 65 Times in 47 Posts
    Quote Originally Posted by nimdamsk View Post
    I wanted to test compression on 3696 MiB log file on a server with Windows 2008 Server x64, 32 GiB RAM.
    fsbench-mingw64-4.6.2.exe from benchmark-0.10d.7z‎ fails with "Not enough memory!" error.
    Here are VMMap screenshot and analysis log.
    Why does "Not enough memory!" error appears if 8 TiB free block can be allocated?
    This should need ~~12 GB RAM.
    I guess that the system had so much free?
    Also, could you give me the exact command line used (memory usage depends on which algorithms do you use), the exact file size and the exact program output (allocation failures with this message can happen in different parts of code).

    Does anybody have experience with such errors?

    Also, malloc aside, on Windows the code may not work with files > 2 GB. I didn't expect this might be needed. I like such TODOs.

  15. #75
    Member
    Join Date
    Jan 2007
    Location
    Moscow
    Posts
    239
    Thanks
    0
    Thanked 3 Times in 1 Post
    Windows Task Manager says 25-28 GiB are free, 34 GiB available.
    Exact command is "fsbench fast %file%". %file% is 3696 MiB long, as i've said already.
    Exact output is, again, "Not enough memory!" - nothing more.
    By the way - is it possible to make compressed output be written to disk?

  16. #76
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,611
    Thanks
    30
    Thanked 65 Times in 47 Posts
    That is - 3875536896 bytes?

    No, there is no way to write the output to disk. Also, it wouldn't be readable by anything but my code because:
    1. I insert my own metadata that's needed to support splitting data to blocks.
    2. In some other cases I insert even more metadata to overcome codec's limitations.
    3. My metadata was not designed to be cross-plaftorm portable in the sense that different systems can have it different.

  17. #77
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,611
    Thanks
    30
    Thanked 65 Times in 47 Posts
    A new version.
    I added a slightly more verbose error message. Also, I tried to remove the problems that my code had with large files on Windows. I was unable to test it, so some things may be missing. This may help with the issue, though probably won't.

    There are other notable changes too.

    Most importantly, I added LZSS by Encode. It's the strongest LZ77 included. Advice: don't try it on big files with 4k block size. It was designed to be used with large chunks of data.
    Encode, because the name LZSS is not unique enough, I called your codec LZSS-IM. Feel free to name it differently.

    Also, I reduced the default block size. Up to now I wanted the default mode to compress the entire file in 1 go. But because quite a few codecs work on ints, some may work on (u)int32_t's, they won't work with chunks > 2GB. Also, maximum compressed size is bigger than input size; some codecs may be calculating it and I don't trust that they handle integer overflows internally, so I decided to be much more conservative. The default is 1.6 GB now.

    Code:
    [+] added lzss by Encode
    [~] display speed in KB/s if needed
    [~] real-time printing of results
    [~] reduced default block size to avoid possible troubles with codecs that don't support big ones
    [!] fixes to enable files > 2 GB on Windows 64.
    Attached Files Attached Files
    Last edited by m^2; 17th April 2012 at 00:28.

  18. #78
    Member
    Join Date
    Jan 2007
    Location
    Moscow
    Posts
    239
    Thanks
    0
    Thanked 3 Times in 1 Post
    Thanks.

    Not enough memory!
    Tried to allocate 15+3074457776117147357+172657685346 bytes

    OK, 2796203 TiB - there is really no such an amount of memory, even virtual

    BTW file size in bytes is 3,875,397,108 if it really helps

    Added:

    Tried to compress log truncated to 2 GiB - the same error.
    Tried all compressors from "fast" one-by-one:

    fsbench.exe fastlz 2gb.log
    Not enough memory!
    Tried to allocate 15+922337291560508480+172657685346 bytes

    fsbench.exe lz4 2gb.log
    Not enough memory!
    Tried to allocate 15+72340424956025619+172657685346 bytes

    fsbench.exe lzf 2gb.log
    Not enough memory!
    Tried to allocate 15+258557031290+172657685346 bytes

    fsbench.exe lzjb 2gb.log
    Not enough memory!
    Tried to allocate 15+258557031290+172657685346 bytes

    fsbench.exe lzo 2gb.log
    Not enough memory!
    Tried to allocate 15+258557031290+172657685346 bytes

    fsbench.exe quicklz 2gb.log
    Not enough memory!
    Tried to allocate 15+2305843431623727630+172657685346 bytes

    fsbench.exe rle64 2gb.log
    Not enough memory!
    Tried to allocate 15+2305843431623727630+172657685346 bytes

    fsbench.exe shrinker 2gb.log
    Not enough memory!
    Tried to allocate 15+645104088038+172657685346 bytes

    fsbench.exe snappy 2gb.log
    Not enough memory!
    Tried to allocate 15+3074457776117147357+172657685346 bytes


    Compressing 2 GiB - 512 B is OK.

    Added:

    Not so OK as it seems. fsbench process took ~9 GiB of RAM.
    System had ~11 GiB free and ~25 GiB available, but was totally sluggish.
    Also some results output was corrupted.

    Code:
    > fsbench.exe fast 2gb-512b.log
    memcpy               = 253 ms (8094 MB/s), 2147483136->2147483136
    Codec         version    args       Size (Ratio)    C.Speed      D.Speed
    fastlz          0.1.0       1   66496166 (x32.29)   C: 686 MB/s  D: 719 MB/s
    LZ4               r59      12   41419461 (x51.85)   C:2848 MB/s  D: 624 MB/s
    LZF               3.6    very   58119586 (x36.95)   C: 898 MB/s  D:3555 MB/s
    lzjb             2010          677635401 (x 3.17)   C: 244 MB/s  D: 271 MB/s
    LZO              2.05     1x1   76451214 (x28.09)   C:2730 MB/s  D:4129 MB/s
    QuickLZ       1.5.1b6       1   42124236 (x50.98)   C: 984 MB/s  D:1914 MB/s
    RLE64           R3.00      64 2147483136 (x 1.00)   C:3150 MB/s  D:        -
    Error
    Error
    Errorker           r4           39056675 (x54.98)   C:2311 MB/s  D:2070 MB/s
    Error
    Errorker           r4           39056675 (x54.98)   C:2311 MB/s  D:2162 MB/s
    Error
    Shrinker           r4           39056675 (x54.98)   C:2311 MB/s  D:2162 MB/s
    Snappy          1.0.5          169046662 (x12.70)   C:2077 MB/s  D:2698 MB/s
    Codec         version    args       Size (Ratio)    C.Speed      D.Speed
    done... (3x1 iteration(s))
    The same test for first 1GiB of log file:

    Code:
    > fsbench.exe fast 1gb.log
    memcpy               = 269 ms (7613 MB/s), 1073741824->1073741824
    Codec         version    args       Size (Ratio)    C.Speed      D.Speed
    fastlz          0.1.0       1   35491690 (x30.25)   C: 678 MB/s  D: 717 MB/s
    LZ4               r59      12   21324631 (x50.35)   C:2426 MB/s  D:4708 MB/s
    LZF               3.6    very   31086870 (x34.54)   C: 888 MB/s  D:3407 MB/s
    lzjb             2010          363594856 (x 2.95)   C: 234 MB/s  D: 456 MB/s
    LZO              2.05     1x1   40391391 (x26.58)   C:2663 MB/s  D:4047 MB/s
    QuickLZ       1.5.1b6       1   21107075 (x50.87)   C: 984 MB/s  D:1914 MB/s
    RLE64           R3.00      64 1073741824 (x 1.00)   C:3180 MB/s  D:        -
    Shrinker           r4           19953246 (x53.81)   C:2280 MB/s  D:2592 MB/s
    Snappy          1.0.5           87376026 (x12.29)   C:2009 MB/s  D:2652 MB/s
    Codec         version    args       Size (Ratio)    C.Speed      D.Speed
    done... (3x2 iteration(s))
    LZ4 and lzjb decompression speed in 2GiB-512B file compared to 1GiB file are suspicious...
    Last edited by nimdamsk; 17th April 2012 at 11:58.

  19. #79
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,611
    Thanks
    30
    Thanked 65 Times in 47 Posts
    Thanks, it's a very useful post.

    So for now I see 4 issues with the code:
    -the malloc issue. It's caused by your OS's standard C library inability to fully support large files and poor error handling in my code. I know enough to be able to fix it.
    -Shrinker error messages are mangled. Again, I know enough to fix it.
    -Shrinker has some errors. It shouldn't. They may be in the wrapper that I wrote on in the Shrinker itself. I'll test my code better and if it doesn't help, write some debug code to get the problem nailed. It's interesting that the problem doesn't happen with smaller blocks.
    -decompression speed of 2 codecs. I suspect that some background process interrupted the test. Is the issue repeatable?

    Also, wow, what a beast machine, what CPU (and RAM) does it have?

    And it's interesting to see such great differences in strength. Any idea of what might be the reason? LZO, LZ4, Snappy all score very closely on 200+ MB files that I tried.

    ADDED: 1k posts.

    ADDED: 5. 5 problems. 9 GB. I already investigated the issue. It happens only with large blocks and data not much larger than block size. In extreme case memory consumption can go up by as much as 80%. It's a significant, but not critical issue IMO because it happens very rarely and can be manually tuned by selecting a different block size (i.e. ceil(file_size/2)).

    And I'd like to add that system sluggishness and even total unresponsiveness during test is normal. If you want accurate results, you don't want other programs to interrupt your tested algos. For this reason I set the process priority as high as I can.
    Last edited by m^2; 17th April 2012 at 23:26.

  20. #80
    Member Fu Siyuan's Avatar
    Join Date
    Apr 2009
    Location
    Mountain View, CA, US
    Posts
    176
    Thanks
    10
    Thanked 17 Times in 2 Posts
    if size < 32 or size >= 128 MB the function will refuse to run and returns -1. The block size must smaller than 128MB, or the benchmark program may behave unpredictable.

    I use some higher bits in offset hash table as cache. So the offset was limited to 128 MB.

  21. #81
    Member
    Join Date
    Jan 2007
    Location
    Moscow
    Posts
    239
    Thanks
    0
    Thanked 3 Times in 1 Post
    The issue with decompression speed is repeatable. I have no particular background process during the test.
    As for machine, it is rather old 2 CPU (8 cores, 16 with HT) E5530 Xeon server with 36 GiB of 3-channel DDR3.
    The difference in compression ratio may be caused by bad log file
    Our software went insane and wrote the same error to the log for millions of times, so this file is not relevant for benchmarking.
    Here are some more results:
    Code:
    > timer fsbench.exe all -b67108864 -t16 1gb.log
    memcpy               = 211 ms (4853 MB/s), 1073741824->1073741824
    Codec         version    args       Size (Ratio)    C.Speed      D.Speed
    BriefLZ         1.0.5           17926694 (x59.90)   C:1556 MB/s  D:4302 MB/s
    Doboz      2011-03-19           21295798 (x50.42)   C:  18 MB/s  D:8904 MB/s
    fastlz          0.1.0       1   35503420 (x30.24)   C:4718 MB/s  D:5818 MB/s
    LZ4               r59      12   21339729 (x50.32)   C:13473 MB/s  D:8982 MB/s
    LZ4hc             r12           14338980 (x74.88)   C: 948 MB/s  D:8982 MB/s
    LZF               3.6           26336863 (x40.77)   C:2106 MB/s  D:8982 MB/s
    LZFX              r16           35113461 (x30.58)   C:3923 MB/s  D:6320 MB/s
    lzg             1.0.6       5   38151448 (x28.14)   C: 370 MB/s  D:5120 MB/s
    LZHAM         SVN r96       4    4601108 (x233.37)   C:  32 MB/s  D:8533 MB/s
    lzjb             2010          363600470 (x 2.95)   C:1889 MB/s  D:3494 MB/s
    lzmat             1.1           16630003 (x64.57)   C:2316 MB/s  D:3084 MB/s
    LZO              2.05     1x1   40395087 (x26.58)   C:11377 MB/s  D:8904 MB/s
    LZSS-IM    2008-07-31           58020356 (x18.51)   C:  62 MB/s  D:2124 MB/s
    LZV1              0.5           31613249 (x33.96)   C:1651 MB/s  D:6606 MB/s
    LZX        2005-07-06      21    4491556 (x239.06)   C:2956 KB/s  D:1024000 MB/s
    LZX        2005-07-06      21    4491556 (x239.06)   C:2978 KB/s  D:1024000 MB/s
    LZX        2005-07-06      21    4491556 (x239.06)   C:2978 KB/s  D:1024000 MB/s
    miniz            1.11       6   10754120 (x99.84)   C:1119 MB/s  D:4970 MB/s
    nrv2b            1.03       6   11301974 (x95.00)   C: 298 MB/s  D:4452 MB/s
    nrv2d            1.03       6   11441924 (x93.84)   C: 305 MB/s  D:4675 MB/s
    nrv2e            1.03       6   11442346 (x93.84)   C: 293 MB/s  D:4612 MB/s
    QuickLZ       1.5.1b6       1   21123758 (x50.83)   C:5535 MB/s  D:8533 MB/s
    RLE64           R3.00      64 1073741824 (x 1.00)   C:5197 MB/s  D:        -
    Shrinker           r4           19960509 (x53.79)   C:12337 MB/s  D:8677 MB/s
    Snappy          1.0.5           87376085 (x12.29)   C:11377 MB/s  D:8752 MB/s
    tornado           0.5       6   10043766 (x106.91)   C: 606 MB/s  D:1145 MB/s
    Yappy              v2      10   78523884 (x13.67)   C:1359 MB/s  D:8827 MB/s
    zlib            1.2.5       6   11261466 (x95.35)   C: 591 MB/s  D:4591 MB/s
    Codec         version    args       Size (Ratio)    C.Speed      D.Speed
    done... (3x1 iteration(s))
    
    CPU stats:
            Kernel time : [0:01:37.4850249], [97.485025 secs]
            User time   : [5:47:38.4257070], [20858.425707 secs]
            Total time  : [5:49:15.9107319], [20955.910732 secs]
    Memory stats:
            Page faults count       : [    7'955'635]
            Peak pagefile usage     : [4'294'967'295] bytes
            Peak virtual size       : [4'294'967'295] bytes
            Peak working set size   : [4'294'967'295] bytes
    I/O stats:
            Total reads : [               1] Total read    : [   1'073'741'824]
            Total writes: [               0] Total written : [               0]
            Total other : [              25] Total other   : [              76]
    Exit code: [0x0], [0]
    1. LZX is the decompression speed king! Why no taking decompression code from http://www.cabextract.org.uk/ ?
    2. LZX results are tripled for some reason.
    3. RLE64 have no decompression speed result. May be because it didn't compress anything?
    4. Why does fsbench need 4x RAM?
    5. Decompression for some fastest algorithms is slower than compression.
    Last edited by nimdamsk; 18th April 2012 at 12:47.

  22. #82
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,511
    Thanks
    746
    Thanked 668 Times in 361 Posts
    Quote Originally Posted by Fu Siyuan View Post
    I use some higher bits in offset hash table as cache
    it's useful for decreasing number of cache misses. but Shrinker anyway uses only 128kb of memory so i wonder whether it really helps. are you made benchmarks with and without caching?

  23. #83
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    865
    Thanks
    463
    Thanked 260 Times in 107 Posts
    Quote Originally Posted by nimdamsk View Post
    1. LZX is the decompression speed king!
    Are you sure your RAM is fast enough to deliver 1 TB/s ?

  24. #84
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,511
    Thanks
    746
    Thanked 668 Times in 361 Posts
    Cyan, it returns in microsecond becuase there is no lzx decompression code yet here

  25. #85
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    865
    Thanks
    463
    Thanked 260 Times in 107 Posts
    Yep

  26. #86
    Member Fu Siyuan's Avatar
    Join Date
    Apr 2009
    Location
    Mountain View, CA, US
    Posts
    176
    Thanks
    10
    Thanked 17 Times in 2 Posts
    Quote Originally Posted by Bulat Ziganshin View Post
    it's useful for decreasing number of cache misses. but Shrinker anyway uses only 128kb of memory so i wonder whether it really helps. are you made benchmarks with and without caching?
    The 128 kb is the size of hash table.

    if (unlikely(cache == (*pcur & 0x1f))
    && pfind + 0xffff >= (u8*)pcur
    && pfind < pcur
    &&*(u32*)pfind == *(u32*)pcur) // it may eliminate the upper comparsion and an access at pfind, may range larger than 128 kb.

    I tested and got about 7% speed improvement with this.

  27. #87
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,511
    Thanks
    746
    Thanked 668 Times in 361 Posts
    oh, sorry 128+64=192 kb, that's still smaller than 256kb L2 cache on Sandy. what's cpu you are tested on? single-threaded or many (since SB core can run 2 threads m/t compression should be better case for caching)

  28. #88
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,475
    Thanks
    26
    Thanked 121 Times in 95 Posts
    L2 access has some latency, it could play a role in slowdown.

  29. #89
    Member Karhunen's Avatar
    Join Date
    Dec 2011
    Location
    USA
    Posts
    91
    Thanks
    2
    Thanked 1 Time in 1 Post
    I thought that LZX was too similar to LZ77 so I didn't mention it, however http://sourceforge.net/projects/libmspack/ exists and I believe that it was modified to decompress streams as well as the regular CAB spec. Apparently, M$ is updating the spec to support delta compression.

  30. #90
    Member Fu Siyuan's Avatar
    Join Date
    Apr 2009
    Location
    Mountain View, CA, US
    Posts
    176
    Thanks
    10
    Thanked 17 Times in 2 Posts
    Quote Originally Posted by Bulat Ziganshin View Post
    oh, sorry 128+64=192 kb, that's still smaller than 256kb L2 cache on Sandy. what's cpu you are tested on? single-threaded or many (since SB core can run 2 threads m/t compression should be better case for caching)
    Generated by cachegrind:

    CPU is a kind of Xeon. I can't know the exact model number by /proc/cpuinfo. L1 size unknown, L2 2 MB

    Shrinker with NO higher bits cache:
    ==13691== My PID = 13691, parent PID = 13539. Prog and args are:
    ==13691== ./nocache
    ==13691== c
    ==13691== enwik8
    ==13691== enwik8.shr
    ==13691==
    --13691-- warning: Pentium with 12 K micro-op instruction trace cache
    --13691-- Simulating a 16 KB cache with 32 B lines
    ==13691==
    ==13691== I refs: 1,534,622,325
    ==13691== I1 misses: 2,199
    ==13691== L2i misses: 1,272
    ==13691== I1 miss rate: 0.00%
    ==13691== L2i miss rate: 0.00%
    ==13691==
    ==13691== D refs: 493,735,652 (357,086,542 rd + 136,649,110 wr)
    ==13691== D1 misses: 55,958,445 ( 34,382,782 rd + 21,575,663 wr)
    ==13691== L2d misses: 33,164 ( 22,966 rd + 10,198 wr)
    ==13691== D1 miss rate: 11.3% ( 9.6% + 15.7% )
    ==13691== L2d miss rate: 0.0% ( 0.0% + 0.0% )
    ==13691==
    ==13691== L2 refs: 55,960,644 ( 34,384,981 rd + 21,575,663 wr)
    ==13691== L2 misses: 34,436 ( 24,238 rd + 10,198 wr)
    ==13691== L2 miss rate: 0.0% ( 0.0% + 0.0% )

    Shrinker with higher bits cache:
    ==13827== My PID = 13827, parent PID = 13539. Prog and args are:
    ==13827== ./havecache
    ==13827== c
    ==13827== enwik8
    ==13827== enwik8.shr
    ==13827==
    --13827-- warning: Pentium with 12 K micro-op instruction trace cache
    --13827-- Simulating a 16 KB cache with 32 B lines
    ==13827==
    ==13827== I refs: 1,811,334,804
    ==13827== I1 misses: 2,202
    ==13827== L2i misses: 1,272
    ==13827== I1 miss rate: 0.00%
    ==13827== L2i miss rate: 0.00%
    ==13827==
    ==13827== D refs: 543,059,246 (406,410,139 rd + 136,649,107 wr)
    ==13827== D1 misses: 51,774,786 ( 30,486,790 rd + 21,287,996 wr)
    ==13827== L2d misses: 33,163 ( 22,965 rd + 10,198 wr)
    ==13827== D1 miss rate: 9.5% ( 7.5% + 15.5% )
    ==13827== L2d miss rate: 0.0% ( 0.0% + 0.0% )
    ==13827==
    ==13827== L2 refs: 51,776,988 ( 30,488,992 rd + 21,287,996 wr)
    ==13827== L2 misses: 34,435 ( 24,237 rd + 10,198 wr)
    ==13827== L2 miss rate: 0.0% ( 0.0% + 0.0% )

    And this is LZ4 with 15bits hash table:
    ==22422== My PID = 22422, parent PID = 21872. Prog and args are:
    ==22422== ./havecache
    ==22422== c2
    ==22422== enwik8
    ==22422== enwik8.lz4
    ==22422==
    --22422-- warning: Pentium with 12 K micro-op instruction trace cache
    --22422-- Simulating a 16 KB cache with 32 B lines
    ==22422==
    ==22422== I refs: 1,202,957,962
    ==22422== I1 misses: 2,547
    ==22422== L2i misses: 1,298
    ==22422== I1 miss rate: 0.00%
    ==22422== L2i miss rate: 0.00%
    ==22422==
    ==22422== D refs: 245,381,899 (170,194,823 rd + 75,187,076 wr)
    ==22422== D1 misses: 48,658,715 ( 35,997,342 rd + 12,661,373 wr)
    ==22422== L2d misses: 38,561 ( 24,463 rd + 14,098 wr)
    ==22422== D1 miss rate: 19.8% ( 21.1% + 16.8% )
    ==22422== L2d miss rate: 0.0% ( 0.0% + 0.0% )
    ==22422==
    ==22422== L2 refs: 48,661,262 ( 35,999,889 rd + 12,661,373 wr)
    ==22422== L2 misses: 39,859 ( 25,761 rd + 14,098 wr)
    ==22422== L2 miss rate: 0.0% ( 0.0% + 0.0% )
    Last edited by Fu Siyuan; 19th April 2012 at 13:57. Reason: Fixed result of lz4

Page 3 of 9 FirstFirst 12345 ... LastLast

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •