Page 4 of 9 FirstFirst ... 23456 ... LastLast
Results 91 to 120 of 249

Thread: Filesystem benchmark

  1. #91
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    892
    Thanks
    492
    Thanked 280 Times in 120 Posts
    Hi Fu

    Nice to hear from you again
    I'm not accustomed yet of Valgrind/Cachegrind data, but in any case it seems a very useful tool.

    What's striking in your provided statistics is the large amount of L2 cache misses for LZ4.
    It looks way off proportion compared to Shrinker (2.4M vs 0.034M).
    So that's an area which seems worth investigating.

    Even when doing the math, it doesn't fit :
    15 bits Hash Table means an HT of 128KB. This is true even in 64 bits mode.
    And since the match window is 64KB, 192KB are necessary, which should fit even into the 256KB L2 cache of a modern iCore, let alone the 2 MB of the test system.

    So, how come there are so many L2 cache misses reported ?
    That looks interesting...

  2. #92
    Member Fu Siyuan's Avatar
    Join Date
    Apr 2009
    Location
    Mountain View, CA, US
    Posts
    176
    Thanks
    10
    Thanked 17 Times in 2 Posts
    Hi Cyan. I was lazy and used an old compiled version of lz4, under unknown settings.
    Here is the test result with new compiled (hash bits = 15). I test two times , the results are identical.

    ==22422== My PID = 22422, parent PID = 21872. Prog and args are:
    ==22422== ./havecache
    ==22422== c2
    ==22422== enwik8
    ==22422== enwik8.lz4
    ==22422==
    --22422-- warning: Pentium with 12 K micro-op instruction trace cache
    --22422-- Simulating a 16 KB cache with 32 B lines
    ==22422==
    ==22422== I refs: 1,202,957,962
    ==22422== I1 misses: 2,547
    ==22422== L2i misses: 1,298
    ==22422== I1 miss rate: 0.00%
    ==22422== L2i miss rate: 0.00%
    ==22422==
    ==22422== D refs: 245,381,899 (170,194,823 rd + 75,187,076 wr)
    ==22422== D1 misses: 48,658,715 ( 35,997,342 rd + 12,661,373 wr)
    ==22422== L2d misses: 38,561 ( 24,463 rd + 14,098 wr)
    ==22422== D1 miss rate: 19.8% ( 21.1% + 16.8% )
    ==22422== L2d miss rate: 0.0% ( 0.0% + 0.0% )
    ==22422==
    ==22422== L2 refs: 48,661,262 ( 35,999,889 rd + 12,661,373 wr)
    ==22422== L2 misses: 39,859 ( 25,761 rd + 14,098 wr)
    ==22422== L2 miss rate: 0.0% ( 0.0% + 0.0% )

  3. #93
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    892
    Thanks
    492
    Thanked 280 Times in 120 Posts
    Thanks Fu. the new results look more in line with expectation.
    Best Regards

  4. #94
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,593
    Thanks
    801
    Thanked 698 Times in 378 Posts
    Shrinker with NO higher bits cache:
    ==13691== I refs: 1,534,622,325
    ==13691== D refs: 493,735,652 (357,086,542 rd + 136,649,110 wr)
    ==13691== D1 misses: 55,958,445 ( 34,382,782 rd + 21,575,663 wr)

    Shrinker with higher bits cache:
    ==13827== I refs: 1,811,334,804
    ==13827== D refs: 543,059,246 (406,410,139 rd + 136,649,107 wr)
    ==13827== D1 misses: 51,774,786 ( 30,486,790 rd + 21,287,996 wr)

    LZ4 (no cache)
    ==22422== I refs: 1,202,957,962
    ==22422== D refs: 245,381,899 (170,194,823 rd + 75,187,076 wr)
    ==22422== D1 misses: 48,658,715 ( 35,997,342 rd + 12,661,373 wr)

    it's interesting, but doesn't provide us the actual speed. btw, L1 cache of many generations of intel cpus is 32+32kb, amds are 64+64. but latest intel cpus has fast 256kb L2 cache and last level cache is 8mb L3. while L3 cache is pretty slow, L2 should be fast enough, probably making bits caching disadvantageous. well, except for compression with HT where you need 192*2 kb on each core

  5. #95
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,505
    Thanks
    26
    Thanked 136 Times in 104 Posts
    I don't know which Xeon Fu used but there's a sample latency listing: http://users.atw.hu/instlatx64/Genui..._MemLatX86.txt
    In "*** Random Read Latency ***" sections there is an indication that:
    - random read from L1 costs 4 clocks,
    - random read from L2 costs about 30 clocks when randomly reading about from block of 256 KiB,

    If you understand those listings betten then please comment.

    There's also something like summary at the end:
    Code:
    L1D cacheline size: 64 bytes      [Mode:6 Size:    32K, Stride:   512, #NOP:  0, UnRoll:64]
    L2  cacheline size:128 bytes      [Mode:6 Size:  4096K, Stride:   512, #NOP:  0, UnRoll:64]
    Mem latency: 147.407ns( 471 clks) [Mode:1 Size: 16384K, Stride:  1024, #NOP: 24, UnRoll:64]
    L1D latency:   1.242ns(   4 clks) [Mode:1 Size:    16K, Stride:    64, #NOP:  0, UnRoll:64]
    L2  latency:   8.678ns(  28 clks) [Mode:1 Size:  2048K, Stride:   128, #NOP:  6, UnRoll:64]
    Also writes are maybe asynchronous. At least when it comes to writing to RAM there're asynchronous writes (in standard TDP processors at least).

    Sandy Bridge has super fast cache:http://users.atw.hu/instlatx64/Genui..._MemLatX86.txt but Bulldozer doesn't: http://users.atw.hu/instlatx64/Authe..._MemLatX86.txt so at least in case of Bulldozer such optimization could actually give a gain.
    Last edited by Piotr Tarsa; 19th April 2012 at 17:38.

  6. #96
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,593
    Thanks
    801
    Thanked 698 Times in 378 Posts
    I don't know which Xeon Fu used but there's a sample latency listing:
    xeon is their line of server cpus, so saying "xeon" is the same as saying "intel cpu" for the cpu caches discussion

  7. #97
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,505
    Thanks
    26
    Thanked 136 Times in 104 Posts
    Fu said:
    CPU is a kind of Xeon. I can't know the exact model number by /proc/cpuinfo. L1 size unknown, L2 2 MB
    So I assumed he really tests his program on some old Xeon, as probably that isn't Core 2 (those have much bigger L2 caches) or Core i7 (those have much smaller L2 caches) based.

    Saying only "intel cpu" would carry less information, as there are some low-end Core 2 processors with 2 MiB L2 cache.

  8. #98
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,593
    Thanks
    801
    Thanked 698 Times in 378 Posts
    i mean that your info about some Xeon delays and his info about some other xeon don't have anything in common

  9. #99
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,610
    Thanks
    30
    Thanked 65 Times in 47 Posts
    Another bug to scratch, I know what's the cause of the Shrinker errors, it's in my code. I didn't fix it yet, it's not trivial, but certainly fixable.

    Quote Originally Posted by Fu Siyuan View Post
    if size < 32 or size >= 128 MB the function will refuse to run and returns -1. The block size must smaller than 128MB, or the benchmark program may behave unpredictable.

    I use some higher bits in offset hash table as cache. So the offset was limited to 128 MB.
    I wrote a wrapper that lets me use it with any block larger than 31 B.

    Quote Originally Posted by nimdamsk View Post
    The issue with decompression speed is repeatable. I have no particular background process during the test.
    I need to think about it. ATM I have no idea what's going on.

    Quote Originally Posted by nimdamsk View Post
    As for machine, it is rather old 2 CPU (8 cores, 16 with HT) E5530 Xeon server with 36 GiB of 3-channel DDR3.
    The difference in compression ratio may be caused by bad log file
    Our software went insane and wrote the same error to the log for millions of times, so this file is not relevant for benchmarking.
    Mhm

    Quote Originally Posted by nimdamsk View Post
    1. LZX is the decompression speed king! Why no taking decompression code from http://www.cabextract.org.uk/ ?
    Adding a LZX decompressor is a fairly small thing to do. But I want to do something better, a mechanism that allows you mix and match compressors and decompressors as long as they are mutually compatible. This is a much larger task and I didn't get to even design it yet. At the same time if I picked some LZX decompressor and added it, my motivation to do it right would decrease so for now I prefer to have it broken.

    Quote Originally Posted by nimdamsk View Post
    2. LZX results are tripled for some reason.
    At the time I wrote the code, I didn't expect speeds larger than 9999 MB/s. When I saw your 8 GB/s with 1 thread, I knew it needed fixing, but didn't do it yet. A similar thing; compression ratios of 100:1 and greater misalign the output. But I don't think it's worth fixing.


    Quote Originally Posted by nimdamsk View Post
    3. RLE64 have no decompression speed result. May be because it didn't compress anything?
    Yep.

    Quote Originally Posted by nimdamsk View Post
    4. Why does fsbench need 4x RAM?
    The listed memory stats are a mystery. It uses 1/3 of what it should!
    That's because LZSS-IM needs 600 MB/thread. +1 GB for input, 1.2 for compressed data and 1 for decompressed.


    Quote Originally Posted by nimdamsk View Post
    5. Decompression for some fastest algorithms is slower than compression.
    Dunno. I would guess that it's caused by how weird is your file. Maybe it's because of huge discrepancy between amount of data read and written? Anybody?
    Quote Originally Posted by Piotr Tarsa View Post
    Fu said:

    So I assumed he really tests his program on some old Xeon, as probably that isn't Core 2 (those have much bigger L2 caches) or Core i7 (those have much smaller L2 caches) based.

    Saying only "intel cpu" would carry less information, as there are some low-end Core 2 processors with 2 MiB L2 cache.
    Intel has many CPUs that are not listed officially. Recently my colleague said that when he was working remotely on some computer in our branch office, he noticed that the CPU had 16 cores. When he asked what it was, he was told 'some old xeon'. He tried CPUID and it didn't work. Sadly, he didn't write down whatever he could. He was poor at hardware, but he thought it was something Atom-based.

  10. #100
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,593
    Thanks
    801
    Thanked 698 Times in 378 Posts
    16 cores + atom architecture - it may be some sort of Larrabee, it has Pentium cores like the Atom

  11. #101
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,610
    Thanks
    30
    Thanked 65 Times in 47 Posts
    I also thought so. But Larrabee is not a standalone CPU. So for me it's just some unknown prototype. Now I kinda regret I didn't ask to have a chance to play with it, maybe I could learn a bit about it. But I was occupied with less fun but more important things at that point.

  12. #102
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,505
    Thanks
    26
    Thanked 136 Times in 104 Posts
    Whatever it is, the CPU Fu has tested certainly behaves like it has slow L2 access. And Sandy Bridge doesn't have a big installed user base (yet?). So L2 access latency still is a problem in general. That's what I've tried to say.

  13. #103
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,593
    Thanks
    801
    Thanked 698 Times in 378 Posts
    Nehalem (i.e. all those i3/i5/i7 cpus since Nov200 also has 256kb/core L2 and large L3 cache

  14. #104
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,593
    Thanks
    801
    Thanked 698 Times in 378 Posts
    Quote Originally Posted by Fu Siyuan View Post
    if (unlikely(cache == (*pcur & 0x1f))
    I tested and got about 7% speed improvement with this.
    are you mean 7% less L2 accesses? i can't believe in 7% overall speedup

    and i don';t undertand that this means:
    &&*(u32*)pfind == *(u32*)pcur) // it may eliminate the upper comparsion and an access at pfind, may range larger than 128 kb.
    isn't range (distance) limited to 64kb due to method of storing matches?

  15. #105
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,610
    Thanks
    30
    Thanked 65 Times in 47 Posts
    nimdamsk, as to memory usage of *just* 4GB, I think it's your tool. What it reports is UINT32_MAX. I guess it has some internal limits.

  16. #106
    Member Fu Siyuan's Avatar
    Join Date
    Apr 2009
    Location
    Mountain View, CA, US
    Posts
    176
    Thanks
    10
    Thanked 17 Times in 2 Posts
    Quote Originally Posted by Bulat Ziganshin View Post
    are you mean 7% less L2 accesses? i can't believe in 7% overall speedup
    It's really 7% overall speedup on my computer. I can't explain why.

    Quote Originally Posted by Bulat Ziganshin View Post
    and i don';t undertand that this means:
    isn't range (distance) limited to 64kb due to method of storing matches?
    I was wrong. It should needs 192 KB cache at most.

    I'm sure my Xeon is an old model, may born in 2004 or 2005. It even slower than my old laptop.

  17. #107
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,593
    Thanks
    801
    Thanked 698 Times in 378 Posts
    what's your computer?

  18. #108
    Member Fu Siyuan's Avatar
    Join Date
    Apr 2009
    Location
    Mountain View, CA, US
    Posts
    176
    Thanks
    10
    Thanked 17 Times in 2 Posts
    Quote Originally Posted by Bulat Ziganshin View Post
    what's your computer?
    It's a company's test server machine. No one cares what it is.
    Here is:
    [root@tp40:/data0/siyuan5/shrinker] cat /proc/cpuinfo
    processor : 0
    vendor_id : GenuineIntel
    cpu family : 15
    model : 4
    model name : Intel(R) Xeon(TM) CPU 3.00GHz
    stepping : 3
    cpu MHz : 2992.722
    cache size : 2048 KB
    physical id : 0
    siblings : 2
    core id : 0
    cpu cores : 1
    apicid : 0
    fpu : yes
    fpu_exception : yes
    cpuid level : 5
    wp : yes
    flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx lm constant_tsc pni monitor ds_cpl cid cx16 xtpr
    bogomips : 5985.44
    clflush size : 64
    cache_alignment : 128
    address sizes : 36 bits physical, 48 bits virtual
    power management:

    processor : 1
    vendor_id : GenuineIntel
    cpu family : 15
    model : 4
    model name : Intel(R) Xeon(TM) CPU 3.00GHz
    stepping : 3
    cpu MHz : 2992.722
    cache size : 2048 KB
    physical id : 3
    siblings : 2
    core id : 0
    cpu cores : 1
    apicid : 6
    fpu : yes
    fpu_exception : yes
    cpuid level : 5
    wp : yes
    flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx lm constant_tsc pni monitor ds_cpl cid cx16 xtpr
    bogomips : 5985.42
    clflush size : 64
    cache_alignment : 128
    address sizes : 36 bits physical, 48 bits virtual
    power management:

    processor : 2
    vendor_id : GenuineIntel
    cpu family : 15
    model : 4
    model name : Intel(R) Xeon(TM) CPU 3.00GHz
    stepping : 3
    cpu MHz : 2992.722
    cache size : 2048 KB
    physical id : 0
    siblings : 2
    core id : 0
    cpu cores : 1
    apicid : 1
    fpu : yes
    fpu_exception : yes
    cpuid level : 5
    wp : yes
    flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx lm constant_tsc pni monitor ds_cpl cid cx16 xtpr
    bogomips : 5985.39
    clflush size : 64
    cache_alignment : 128
    address sizes : 36 bits physical, 48 bits virtual
    power management:

    processor : 3
    vendor_id : GenuineIntel
    cpu family : 15
    model : 4
    model name : Intel(R) Xeon(TM) CPU 3.00GHz
    stepping : 3
    cpu MHz : 2992.722
    cache size : 2048 KB
    physical id : 3
    siblings : 2
    core id : 0
    cpu cores : 1
    apicid : 7
    fpu : yes
    fpu_exception : yes
    cpuid level : 5
    wp : yes
    flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx lm constant_tsc pni monitor ds_cpl cid cx16 xtpr
    bogomips : 5985.37
    clflush size : 64
    cache_alignment : 128
    address sizes : 36 bits physical, 48 bits virtual
    power management:

  19. #109
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,593
    Thanks
    801
    Thanked 698 Times in 378 Posts
    sse2 only = early pentium4. thanks

  20. #110
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,610
    Thanks
    30
    Thanked 65 Times in 47 Posts
    I made a bugfix release.

    I think that all reported bugs were fixed except for decompression performance drop of LZ4 and LZJB with block size increase. If not, please ping me about them.

    Changes:
    Code:
    [+] added protection from codecs that report too small output size
    [~] updated Shrinker to r5
    [~] shrinker's compressed size listing now includes the metadata that I had to add to make it support a wider range of block sizes (4 bytes / ~128 MB block)
    [!] cleanup of error messages
    [!] replaced cstdio with fstream to enable testing with large files on Windows x64
    [!] changed some int types to avoid potential troubles
    [!] fixed shrinker decompression errors (that could happen to some other codecs too)
    As to the performance issue:
    nimdamsk, could you test:
    LZ4, LZJB, LZO on file parts with sizes of about:
    1395864371
    1707501158
    1728472678
    1822844518
    bytes?

    Also, it would be nice if you checked if the problem appears on some different file.
    Attached Files Attached Files

  21. #111
    Member
    Join Date
    Jan 2007
    Location
    Moscow
    Posts
    239
    Thanks
    0
    Thanked 3 Times in 1 Post
    Code:
    > fsbench.exe all -b67108864 -t16 1gb.log
    memcpy               = 342 ms (5988 MB/s), 1073741824->1073741824
    Codec         version    args       Size (Ratio)    C.Speed      D.Speed
    BriefLZ         1.0.5           17926694 (x59.90)   C:1681 MB/s  D:4876 MB/s
    Doboz      2011-03-19           21295798 (x50.42)   C:  19 MB/s  D:8827 MB/s
    fastlz          0.1.0       1   35503420 (x30.24)   C:4511 MB/s  D:6282 MB/s
    LZ4               r59      12   21339729 (x50.32)   C:  17 TB/s  D:8677 MB/s
    LZ4hc             r12           14338980 (x74.88)   C: 981 MB/s  D:8641 MB/s
    LZF               3.6           26336863 (x40.77)   C:2111 MB/s  D:8982 MB/s
    LZFX              r16           35113461 (x30.58)   C:4284 MB/s  D:7236 MB/s
    lzg             1.0.6       5   38151448 (x28.14)   C: 483 MB/s  D:5953 MB/s
    LZHAM         SVN r96       4    4601108 (x233.37)   C:  32 MB/s  D:8605 MB/s
    lzjb             2010          363600470 (x 2.95)   C:2115 MB/s  D:3930 MB/s
    lzmat             1.1           16630003 (x64.57)   C:2572 MB/s  D:3351 MB/s
    LZO              2.05     1x1   40395087 (x26.58)   C:  16 TB/s  D:8497 MB/s
    LZSS-IM    2008-07-31           58020356 (x18.51)   C:  66 MB/s  D:2021 MB/s
    LZV1              0.5           31613249 (x33.96)   C:2458 MB/s  D:8062 MB/s
    LZX        2005-07-06      21    4491556 (x239.06)   C:2956 KB/s  D:2000 TB/s
    miniz            1.11       6   10754120 (x99.84)   C:1272 MB/s  D:5673 MB/s
    nrv2b            1.03       6   11301974 (x95.00)   C: 307 MB/s  D:5673 MB/s
    nrv2d            1.03       6   11441924 (x93.84)   C: 307 MB/s  D:5305 MB/s
    nrv2e            1.03       6   11442346 (x93.84)   C: 303 MB/s  D:5184 MB/s
    QuickLZ       1.5.1b6       1   21123758 (x50.83)   C:6041 MB/s  D:8497 MB/s
    RLE64           R3.00      64 1073741824 (x 1.00)   C:9266 MB/s  D:        -
    Shrinker           r5           19960573 (x53.79)   C:  14 TB/s  D:8827 MB/s
    Snappy          1.0.5           87376085 (x12.29)   C:  13 TB/s  D:8752 MB/s
    tornado           0.5       6   10043766 (x106.91)   C: 628 MB/s  D:1147 MB/s
    Yappy              v2      10   78523884 (x13.67)   C:1517 MB/s  D:8982 MB/s
    zlib            1.2.5       6   11261466 (x95.35)   C: 600 MB/s  D:5417 MB/s
    Codec         version    args       Size (Ratio)    C.Speed      D.Speed
    done... (3x2 iteration(s)).
    
    > fsbench LZ4 LZJB LZO shrinker snappy -b2210886 -t16 17439.rtf
    memcpy               = 174 ms (7307 MB/s), 33330406->33330406
    Codec         version    args       Size (Ratio)    C.Speed      D.Speed
    LZ4               r59      12    1591108 (x20.95)   C:  14 TB/s  D:9780 MB/s
    lzjb             2010            3128352 (x10.65)   C:5045 MB/s  D:4834 MB/s
    LZO              2.05     1x1    1568212 (x21.25)   C:  14 TB/s  D:9780 MB/s
    Shrinker           r5            1394401 (x23.90)   C:  11 TB/s  D:9780 MB/s
    Snappy          1.0.5            2681351 (x12.43)   C:  11 TB/s  D:9488 MB/s
    Codec         version    args       Size (Ratio)    C.Speed      D.Speed
    done... (3x40 iteration(s)).
    
    > fsbench LZ4 LZJB LZO shrinker snappy -b16108860 -t16 23.bmp
    memcpy               = 65 ms (7907 MB/s), 269485110->269485110
    Codec         version    args       Size (Ratio)    C.Speed      D.Speed
    LZ4               r59      12  227347678 (x 1.19)   C:2325 MB/s  D:6047 MB/s
    lzjb             2010          241917783 (x 1.11)   C: 962 MB/s  D:2264 MB/s
    LZO              2.05     1x1  268886798 (x 1.00)   C:4047 MB/s  D:  71 TB/s
    Shrinker           r5          208479563 (x 1.29)   C:1132 MB/s  D:4319 MB/s
    Snappy          1.0.5          230916699 (x 1.17)   C:1882 MB/s  D:4942 MB/s
    Codec         version    args       Size (Ratio)    C.Speed      D.Speed
    done... (3x2 iteration(s)).
    
    > fsbench LZ4 LZJB LZO shrinker snappy -b30000000 -t16 benchmark3.O1.DMP
    memcpy               = 70 ms (7005 MB/s), 514238726->514238726
    Codec         version    args       Size (Ratio)    C.Speed      D.Speed
    LZ4               r59      12   99877295 (x 5.15)   C:4086 MB/s  D:7430 MB/s
    lzjb             2010          137081332 (x 3.75)   C:1796 MB/s  D:3027 MB/s
    LZO              2.05     1x1  119856081 (x 4.29)   C:5004 MB/s  D:7430 MB/s
    Shrinker           r5           95091453 (x 5.41)   C:2622 MB/s  D:7107 MB/s
    Snappy          1.0.5           96744981 (x 5.32)   C:2636 MB/s  D:5449 MB/s
    Codec         version    args       Size (Ratio)    C.Speed      D.Speed
    done... (3x1 iteration(s)).
    
    > fsbench LZ4 LZJB LZO shrinker snappy -b3000000000 -i1 -s1 -w1 3696mb.log
    Not enough memory!
    Tried to allocate 3875397124+18446744073288336087+3875397124 bytes.
    
    > fsbench LZ4 LZJB LZO shrinker snappy -b2000000000 -i1 -s1 -w1 3696mb.log
    Error reading file!
    
    > fsbench LZ4 LZJB LZO shrinker snappy -b5000000 3696mb.log
    Error reading file!
    0. Thanks for the new version
    1. Decompression in my tests was always faster.
    2. Compression speed of LZ4, LZO, Shrinker and Snappy on highly compressible data is suspicious.
    3. No particular problems with files of sizes 1395864371, 1707501158, 1728472678, 1822844518.
    4. Decompression speed of LZO on almost_non_compressed data is suspicious.
    5. Still can't test on 3696 MiB file using any block size.
    6. ConsMeter utility used for checking CPU and RAM usage is 32-bit, so, AFAIK, it can't count values > 4GiB. Does anyone have 64-bit version of it or another 64-bit utility of the same usage?
    Last edited by nimdamsk; 22nd April 2012 at 22:22.

  22. #112
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,610
    Thanks
    30
    Thanked 65 Times in 47 Posts
    I guess I was not clear enough. By decompression speed problem I meant the one reported here where LZ4 / LZJB speed dropped significantly for no apparent reason. I wanted to be able to draw a performance curve with a special attention to what happens around the boundary of having 1 or 2 blocks. That's why I asked you to look what happens in the specified data points with these 2 codecs and LZO as a control sample.
    2. You mean the RTF? I guess that it's just too small, can you try to run the test on 4 or more copies catenated?
    4. There's too little data to get meaningful measurements, there's probably just 1 block being decompressed twice, so 3 MB/iteration. It easily fits in CPU cache. Your clock is not accurate enough to get sensible measurements in such case. That's because the benchmark doesn't decompress blocks that weren't successfully compressed in the 1st place. The benchmark was not designed to measure decompression on incompressible data and I'm not willing to write code to enable it. It's not terribly hard, but I don't see the point, you're better off storing it uncompressed anyway.
    5. Mhm. Will look deeper. Thanks for info. As to 2 GB+ blocks - I said that you can expect troubles. Allocation of 18 EB is nothing special. Neither are crashes. Maybe I should add a warning.
    Last edited by m^2; 22nd April 2012 at 23:12.

  23. #113
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,593
    Thanks
    801
    Thanked 698 Times in 378 Posts
    Compression speed of LZ4, LZO, Shrinker and Snappy on highly compressible data is suspicious.

    const string units[] = {" B", "KB", "MB", "TB", "PB", "EB", "ZB", "YB"};//OK, uint64_t is not enough to reach YB

    so it's just 2x faster than decompression. compression of higly repetitive data is mainly about sequential reading from memory and performing memcmp, decompression is mainly about sequential writing with memcpy. so:

    min(sequential reading,memcmp) = 16 GB/s = 2*min(sequential writing,memcpy)



    m^2, how about adding tor:3, tor:5? and for me it's still interesting to see lz4/32k in order to make direction comparison with Shrinker (well, i will compile it myself)

  24. #114
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,610
    Thanks
    30
    Thanked 65 Times in 47 Posts
    Thanks for the bug, Bulat.

    As to tor:3 etc, it's already suppported. It's just called "tornado,3". In my long term plans there's a kind of more generic parameters parser that would enable setting advanced parameters. Tor's defaults are not well suited for an in-memory benchmark. inikeep used some customised options instead, but IMO it was misleading.

  25. #115
    Member Karhunen's Avatar
    Join Date
    Dec 2011
    Location
    USA
    Posts
    91
    Thanks
    2
    Thanked 1 Time in 1 Post
    compression speed of LZ4, LZO, Shrinker and Snappy on highly compressible data is suspicious.
    I frequently use LZO on my Windoze Backups, just pipe it from a disk imager thru LZO - the difference in time ( to copy ) is negligible, and say for a 80GB partition that hosts WinXP
    I can expect 30% savings, although deflate usually gets another 15% savings but at a cost of 2X execution speed. Decompression seems the same ( LZO vs GZIP )

  26. #116
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,593
    Thanks
    801
    Thanked 698 Times in 378 Posts
    Tor's defaults are not well suited for an in-memory benchmark
    i don't think so. i've optimized tornado using ram-disk to nul compression, and main part of tornado are huffman/ari compression modes that doesn't suffice too much from overhead of reading from ram disk

    inikeep used some customised options instead
    he tested compression of very small blocks (targeted for filesystem) where tornado is abolutely useless - his fast (byte/bit) models are extremely inefficient compared to snappy-alikes, and huf/ari modes are useless at all because on the first few kbs they just gather stats

  27. #117
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,593
    Thanks
    801
    Thanked 698 Times in 378 Posts
    compression of zeroes:

    Code:
    memcpy               = 171 ms (17846 MB/s), 100000001->100000001
    Codec         version    args       Size (Ratio)    C.Speed      D.Speed
    fastlz          0.1.0       1    1145047 (x87.33)   C:2114 MB/s  D:1507 MB/s
    LZ4               r59      12     392168 (x254.99)   C:8620 MB/s  D:6115 MB/s
    LZF               3.6    very    1136371 (x88.00)   C:1920 MB/s  D:1532 MB/s
    lzjb             2010            3219736 (x31.06)   C:1440 MB/s  D:1446 MB/s
    LZO              2.05     1x1     443539 (x225.46)   C:7988 MB/s  D:2057 MB/s
    QuickLZ       1.5.1b6       1  100000001 (x 1.00)   C: 693 MB/s  D:        -
    RLE64           R3.00      64         25 (x4000000.00)   C:9940 MB/s  D:  11 TB/s
    Shrinker           r5             392179 (x254.99)   C:7667 MB/s  D:2095 MB/s
    Snappy          1.0.5            4696662 (x21.29)   C:8094 MB/s  D:3208 MB/s
    Codec         version    args       Size (Ratio)    C.Speed      D.Speed
    done... (3x32 iteration(s)).

  28. #118
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,610
    Thanks
    30
    Thanked 65 Times in 47 Posts
    Quote Originally Posted by Bulat Ziganshin View Post
    i don't think so. i've optimized tornado using ram-disk to nul compression, and main part of tornado are huffman/ari compression modes that doesn't suffice too much from overhead of reading from ram disk
    RAM is fast and it has advantage over disk to disk compression only when your algorithm is faster than disk. At the same time it uses a lot of RAM for itself, leaving less of it for the algorithm. Tornado is not very fast (comparing to disk speed), but it is very memory-hungry. Yes, it can be faster than disk, especially if you have just 1 HDD. But I guess that memory constraints are a limiting factor for many users.
    Quote Originally Posted by Bulat Ziganshin View Post
    he tested compression of very small blocks (targeted for filesystem)
    No, he tested on regular files. It's me who added the small blocks mode. Though he did optimise it to lower memory usage and heavily concentrated on faster modes.

    As to compression of zeroes...Bulat, you got me interested because I had no idea why would QuickLZ fail to compress. My results:
    Code:
    e:\projects\benchmark04\tst>fsbench all -w0 -i1 -s1 -v ..\0
    memcpy               = 63 ms (1513 MB/s), 100000001->100000001
    Codec         version    args       Size (Ratio)    C.Speed      D.Speed
    blosc           1.1.4       5     392168 (x254.99)   C: 908 MB/s  D:1986 MB/s
    BriefLZ         1.0.5                 10 (x10000000.00)   C: 381 MB/s  D: 386 MB
    /s
    Doboz      2011-03-19            1600426 (x62.48)   C:1232 KB/s  D: 623 MB/s
    fastlz          0.1.0       1    1145047 (x87.33)   C: 341 MB/s  D: 541 MB/s
    LZ4               r59      12     392167 (x254.99)   C: 574 MB/s  D: 756 MB/s
    LZ4hc             r12             392167 (x254.99)   C: 451 MB/s  D: 635 MB/s
    LZF               3.6            1136370 (x88.00)   C:  91 MB/s  D: 314 MB/s
    LZFX              r16            1136370 (x88.00)   C: 299 MB/s  D: 298 MB/s
    lzg             1.0.6       5    1562526 (x64.00)   C:  52 MB/s  D: 283 MB/s
    LZHAM         SVN r96       4      85706 (x1166.78)   C:1762 KB/s  D: 739 MB/s
    lzjb             2010            3219736 (x31.06)   C: 298 MB/s  D: 341 MB/s
    lzmat             1.1               7032 (x14220.71)   C: 353 MB/s  D: 241 MB/s
    LZO              2.05     1x1     443539 (x225.46)   C: 557 MB/s  D: 334 MB/s
    LZSS-IM    2008-07-31            4734862 (x21.12)   C:1787 KB/s  D: 435 KB/s
    LZV1              0.5            1136369 (x88.00)   C:4163 KB/s  D: 489 MB/s
    LZX        2005-07-06      21     106158 (x941.99)   C: 250 KB/s  D:  93 GB/s
    count=100000000  30 30 30 30!=0 0 0 0
    ERROR in block 0: common=0
    miniz            1.11       6      97145 (x1029.39)   C:  49 MB/s  D: 193 MB/s
    nrv2b            1.03       6     152599 (x655.31)   C:  12 MB/s  D: 451 MB/s
    nrv2d            1.03       6     152597 (x655.32)   C:  15 MB/s  D: 371 MB/s
    nrv2e            1.03       6     152597 (x655.32)   C:  15 MB/s  D: 460 MB/s
    QuickLZ       1.5.1b6       1    1227092 (x81.49)   C: 507 MB/s  D: 794 MB/s
    RLE64           R3.00      64         25 (x4000000.00)   C: 623 MB/s  D:1306 MB/
    s
    Shrinker           r5             392179 (x254.99)   C: 139 MB/s  D: 494 MB/s
    Snappy          1.0.5            4693608 (x21.31)   C: 653 MB/s  D: 727 MB/s
    tornado           0.5       6       1933 (x51733.06)   C:  37 MB/s  D: 237 MB/s
    Yappy              v2      10   14285723 (x 7.00)   C:  24 MB/s  D:1036 MB/s
    zlib            1.2.5       6      97210 (x1028.70)   C:  60 MB/s  D: 343 MB/s
    Codec         version    args       Size (Ratio)    C.Speed      D.Speed
    done... (1x1 iteration(s)).
    Brieflz is cool.
    And QuickLZ compressed just fine. Any idea what's up?

    Also, I took a closer look at the fastest decompressing codecs.

    Code:
    memcpy               = 182 ms (2619 MB/s), 100000001->100000001
    Codec         version    args       Size (Ratio)    C.Speed      D.Speed
    blosc           1.1.4       9     392168 (x254.99)   C:2788 MB/s  D:2649 MB/s
    LZ4               r59      12     392167 (x254.99)   C:1841 MB/s  D:1204 MB/s
    Yappy              v2      10   14285723 (x 7.00)   C:  25 MB/s  D:1003 MB/s
    Snappy          1.0.5            4693608 (x21.31)   C:1543 MB/s  D: 739 MB/s
    Codec         version    args       Size (Ratio)    C.Speed      D.Speed
    done... (4x5 iteration(s)).
    LZ4 used to be either the fastest or close. Now it lost a lot. It appears that a regular cstdlib's memcpy used by blosc can be a big win in some cases.

    Oh...blosc.
    Since I introduced it already, I'm releasing the new version, though there are very few changes.
    Code:
    0.10g
    [+] added blosc
    [~] better warnings
    [!] GBytes were missing from the list of units
    Some more about blosc:
    Don't bother too much. It's poor.
    Attached Files Attached Files

  29. #119
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    892
    Thanks
    492
    Thanked 280 Times in 120 Posts
    Blosc uses a memset() operation for repetitive data, such as a long stream of zero.
    This is even faster than a memcpy().
    It shall explain its advantage for this specific situation.

    I tested the idea, but ruled against it after benchmarks :
    long streams of zero do happen, but not often enough to offset the cost of one additional branch.
    So, on generic data, it translates into a (small) loss.

    In contrast, Blosc depends on this trick to provide its speed.
    It only works properly in tandem with shuffle(), which itself is only usable with strictly aligned data,
    such as a table of unsigned int with a relatively "progressive" pattern (i.e. not random).
    With shuffle() regrouping the high bytes together, the perspective for long streams of identical characters is much larger.

    Blosc also uses a lot of memcpy(), which is good for long copies only.
    I guess that its author might have overtuned its algorithm to its own benchmark, which is not based on real data, but on generated ones.
    And apparently, the generator output some long cyclic patterns, which can be then be copied with an efficient memcpy().
    Unfortunately, on real data, most copies are short, and this bet doesn't pay off.

  30. #120
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,610
    Thanks
    30
    Thanked 65 Times in 47 Posts
    In my code I use blosclz directly to avoid its tricks with threading. It doesn't use shuffle. Though it does use memset.
    I noted that its decompressor's speed is about exactly the same as that of memcpy on those zeroes.

    ADDED:
    Do you think it would be better if I enabled shuffling in blosc and disabled threading in some other way?
    Last edited by m^2; 24th April 2012 at 00:57.

Page 4 of 9 FirstFirst ... 23456 ... LastLast

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •