Results 1 to 19 of 19

Thread: Real life vs max checksum speed

  1. #1
    Member
    Join Date
    Dec 2013
    Location
    Italy
    Posts
    425
    Thanks
    16
    Thanked 40 Times in 32 Posts

    Real life vs max checksum speed

    I have made my little ZPAQ patch with two "checksum": crc32c (via hardware SSE 4.2) and sha1.
    It's hard to make a choice: obviously SHA1 is much "stronger" from every point of view.

    In "real life" (checking mixed load of small files) it is about two time slower than crc32c (on my PC).
    ~34s vs ~17s for my testbed

    Code:
    C:\zpaqfranz>zpaqfranz sha1 f:\zarc\* >1.txt
    
    17.735 seconds (all OK)
    
    C:\zpaqfranz>zpaqfranz sha1 f:\zarc\* >2.txt
    
    33.390 seconds (all OK)
    Of course, in this case, there is the cumulative speed of the mass storage, OS, latency and so on.

    When running from cached data the gap is huge
    Click image for larger version. 

Name:	max.jpg 
Views:	20 
Size:	220.1 KB 
ID:	8071
    4.9 GB/s for crc32c (I had to pay attention to times <one tick per division by zero in the average times!)
    vs
    ~0.55GB/s for SHA1 (not a bad performance)

    In real life, for big file not preloaded, the gap is about 350MB/s vs 230MB/s
    Click image for larger version. 

Name:	reallife.jpg 
Views:	13 
Size:	222.1 KB 
ID:	8072

    In other words, SHA1, when executed in software by a modern CPU, seems able to reach roughly the speeds of "normal" SSD disks, reducing the advantage of the hardware-based crc32c (so fast that it does not require a parallel calculation).

    Clearly in the real world the data is loaded from the media, and is not already preloaded into the cache.

    So ... which solution would you choose as the checksum method?

  2. #2
    Member
    Join Date
    May 2015
    Location
    ~
    Posts
    10
    Thanks
    1
    Thanked 5 Times in 2 Posts
    Is there any reason to stick with crc32c and sha1? xxHash and BLAKE3 are both worth investigating for your purposes. ~800MB/s is not unrealistic for real world reads from a modern SSD.

  3. #3
    Member
    Join Date
    Dec 2013
    Location
    Italy
    Posts
    425
    Thanks
    16
    Thanked 40 Times in 32 Posts
    Quote Originally Posted by choochootrain View Post
    Is there any reason to stick with crc32c and sha1? xxHash and BLAKE3 are both worth investigating for your purposes. ~800MB/s is not unrealistic for real world reads from a modern SSD.
    One reason in simplicity.
    SHA1 is embedded in the source code, and crc32c is very easy to write.
    About speed real world SSD do not give much more than 400/500MB/s.
    So a 4.9GB/s checksum vs ten time slower SHA1 code is not much different (in my tests, I will try with magnetic disks)

    Therefore the thread's question

  4. #4
    Member
    Join Date
    Feb 2015
    Location
    United Kingdom
    Posts
    177
    Thanks
    29
    Thanked 75 Times in 45 Posts
    When I write software I tend to keep disk IO out of the equation, since people could be using an HDD (150MB/s), or SSD (550 MB/s), or even those gen 4 nvme SSD's (5GB/s) when running your software, it can make accurate measurements difficult since IO throughput varies so much across different devices.

    The most stable measurements come from the time spent on CPU operations, excluding IO. Remember: IO isn't actually part of your code, it's not something you can make any faster as it's handled by the Kernel, it's only as fast as the available hardware.

    If you want my opinion: for verifying content, go with the fastest hashing algorithm you can develop or get off the shelf. For encryption, use the strong but slower hashing methods.

    As for real-time speeds, look into multi-threading with IO read-ahead and write-behind buffers, this will keep the CPU busy for longer periods of time.

  5. Thanks:

    fcorbelli (12th November 2020)

  6. #5
    Member
    Join Date
    Dec 2013
    Location
    Italy
    Posts
    425
    Thanks
    16
    Thanked 40 Times in 32 Posts
    Real world disk IO is a part of the equation.
    I have nvme, ssd, magnetic, PCIE RISC controller etc (data storage guy)
    After profiling I find that a 10x faster algorithm do not make any real difference, because a modern CPU can "beat" even very fast storage.
    its not only theorical bandwith, but latency, OS filesystem slowdown and so on.
    It is not easy to find a "real" storage which can pump (sustained, not burst) half gigabyte for second for hundreds of GB

    So I will take care of you advise and make extensive test in different (virtual) hardware and very different software (Unix with zfs) .

  7. #6
    Member
    Join Date
    Dec 2013
    Location
    Italy
    Posts
    425
    Thanks
    16
    Thanked 40 Times in 32 Posts
    I have compiled a Windows 64 executable which contains the two algorithms: SHA1 (software) and CRC-32C (hardware) by patching ZPAQ.

    http://www.francocorbelli.it/zpaqfranz.exe


    Could someone please do some testing and report the results with "real world" workloads,
    without cache if possible?

    The syntax is pretty easy

    zpaqfranz sha1 z:\mypath\*

    and

    zpaqfranz sha1 z:\mypath\* -crc32c

    Please report the last lines of the output, something like
    Code:
    Using SHA1
    Scanning filesystem time  0.718000
    Data transfer+CPU   time  10.204000
    Data output         time  49.329000
    Worked on 5668339712 average speed 555.501.735 B/s
    
    60.250 seconds (all OK)
    and
    Code:
    Using CRC-32C
    Scanning filesystem time  0.719000
    Data transfer+CPU   time  1.798000
    Data output         time  34.453000
    Worked on 5946088891 average speed 3.307.057.225 B/s
    Thanks

  8. #7
    Member
    Join Date
    Nov 2020
    Location
    France
    Posts
    1
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Point 1 : cryptographic hashes

    I think most people have a wrong idea of what SHA and other cryptographic hashes (md4, md5, sha2, sha3, etc) are made for.

    Cryptographic hashes are made to be used in signatures.

    They are made to make it very hard to find another message with the same hash, when the alteration is made on purpose.

    In order to do that, the computation of secured hash are very long.

    As a side effect, it is very hard to demonstrate that your data are well protected against errors.

    As your archive contains both data and his unencrypted hash, cryptographic hashes are not designed for that purpose. You just waste CPU cycles.

    Point 2 : size of protected data

    The biggest your data are, the more likely you will not detect errors with only one hash.

    For example : with CRC32 or CRC32C, the hash is other 32bits.

    For every message which size is 512 Mio (2^32 bits), you can find 2 messages with the same hash which only differs from one bit from the original message. Because, we have 2^32 messages which only differ from one bit from the original.
    This means, there are a lot of couple of messages of size 512 Mio with the same hash, which differ by two bits.

    So you should compute several hashes over big files, or increase the number of bits in the hash.

    Point 3 : recommendation

    xxhash looks like the fastest hash.
    xxhash128 uses 128 bits.

    Note : I use "message" instead of "file" because in cryptography, we talk about message rather than files.

  9. #8
    Member
    Join Date
    Dec 2013
    Location
    Italy
    Posts
    425
    Thanks
    16
    Thanked 40 Times in 32 Posts
    Thanks, but in this case it's "file" and no "message", because I do not work with cryptography, instead with checksum (file integrity).

    Using, as often in practice, a cryptographic hash as a checksum, is better/worse/equal of a CRC32?


    Here's a quick result
    Code:
    SHA1
    Data transfer+CPU   time  68.219000
    
    CRC32-c
    Data transfer+CPU   time  69.063000
    As you can see the time is almost the same, even using a much faster algorithm.
    Of course the transfer time from hard disk is the bottleneck

    I am currently test some rather "big" file (>1.5TB): I do not expect anything different

  10. #9
    Member
    Join Date
    Dec 2013
    Location
    Italy
    Posts
    425
    Thanks
    16
    Thanked 40 Times in 32 Posts
    Quote Originally Posted by VinceLeGrand View Post
    Point 3 : recommendation

    xxhash looks like the fastest hash.
    xxhash128 uses 128 bits.
    "My" own softwarello seems way faster than xxhash
    Click image for larger version. 

Name:	xxhsum.jpg 
Views:	26 
Size:	294.4 KB 
ID:	8074
    Of course "my" means: Intel SSE

  11. #10
    Member
    Join Date
    Dec 2019
    Location
    USA
    Posts
    2
    Thanks
    1
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by fcorbelli View Post
    "My" own softwarello seems way faster than xxhash
    Not at all a side-effect of file system cache...

  12. #11
    Member
    Join Date
    Dec 2013
    Location
    Italy
    Posts
    425
    Thanks
    16
    Thanked 40 Times in 32 Posts
    Quote Originally Posted by standolf View Post
    Not at all a side-effect of file system cache...
    Of course not (I use a RAMDISK).

    The CRC32-C (SSE) algorithm IS 10 times faster then SHA1.
    In this case (about) two times vs xxhash.

    Because both program use the same hardware (mine), same OS,
    same cache (not used, in fact) etc

  13. #12
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,576
    Thanks
    790
    Thanked 687 Times in 372 Posts
    you can find xxh64 and xxh128 speeds in their docs. your executbale prints 32-bit value, so it's probably xxh32

    properly implemented crc32c is 8 bytes/cycle i.e. 32 GB/s on 4 GHz cpu. but crc hash is too weak, so i suggest you to use better hash that may still have 10+ GB/s speed

    ps: OS doesn't make differemce between disk types and copies data into cache anyway. so, the first time you perfrom two memory copies, and second time, only one

  14. #13
    Member
    Join Date
    Dec 2013
    Location
    Italy
    Posts
    425
    Thanks
    16
    Thanked 40 Times in 32 Posts
    Quote Originally Posted by Bulat Ziganshin View Post
    you can find xxh64 and xxh128 speeds in their docs. your executbale prints 32-bit value, so it's probably xxh32
    It is a switch -H0
    It is xxh64, xxh128 seems to use SSE
    http://cyan4973.github.io/xxHash/

    properly implemented crc32c is 8 bytes/cycle i.e. 32 GB/s on 4 GHz cpu. but crc hash is too weak, so i suggest you to use better hash that may still have 10+ GB/s speed
    ... like... what?
    With a "not too fancy" source?
    I am reading xxHash, but a gazillion of #ifdef and so on.

    EDIT: maybe I find one
    https://create.stephan-brumme.com/xxhash/
    Yann's C code is heavily optimized for various environments: little and big endian, aligned and unaligned data, etc.
    However, this makes his code hard to read. That's why I wrote my C++ implementation.

    It's just 180 lines of code - all in a single header file (no .cpp) and heavily commented. No #ifdefs, too - scroll down to see the full souce code !
    ps: OS doesn't make differemce between disk types and copies data into cache anyway. so, the first time you perfrom two memory copies, and second time, only one
    There is no real difference from first run, from ramdisk, and N+1 from cache.
    At least on my machine.

    EDIT: done.
    In my machine the "straight" xxhash 64 bit (to be checked) runs at about 3.9GB/s
    Code:
    XXHASH: 8565B2C49FAACB26 r:/partf.zpaq
    Using xxhash
    Scanning filesystem time  0.000000
    Data transfer+CPU   time  3.470000
    Data output         time  0.000000
    Worked on 13536133120 average speed 3.900.902.916 B/s
    This is not bad at all (author's code is a real mess)
    Last edited by fcorbelli; 12th November 2020 at 23:11.

  15. #14
    Member
    Join Date
    Apr 2015
    Location
    Greece
    Posts
    115
    Thanks
    39
    Thanked 30 Times in 21 Posts
    Quote Originally Posted by Bulat Ziganshin View Post
    you can find xxh64 and xxh128 speeds in their docs. your executbale prints 32-bit value, so it's probably xxh32

    properly implemented crc32c is 8 bytes/cycle i.e. 32 GB/s on 4 GHz cpu. but crc hash is too weak, so i suggest you to use better hash that may still have 10+ GB/s speed

    ps: OS doesn't make differemce between disk types and copies data into cache anyway. so, the first time you perfrom two memory copies, and second time, only one
    CRC is a checksum and not a hash.

  16. #15
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,576
    Thanks
    790
    Thanked 687 Times in 372 Posts
    so, what's the hash you are running - xxh32/64/128? i guess that you don't used best switches for compilation. xxh64 requires x64 for best speed, and xx128/3 requires avx2 but probably may work with sse4

  17. #16
    Member
    Join Date
    Dec 2013
    Location
    Italy
    Posts
    425
    Thanks
    16
    Thanked 40 Times in 32 Posts
    Quote Originally Posted by Bulat Ziganshin View Post
    so, what's the hash you are running - xxh32/64/128? i guess that you don't used best switches for compilation. xxh64 requires x64 for best speed, and xx128/3 requires avx2 but probably may work with sse4
    None of them.
    Need too much work to "strip down" and "inject" into ZPAQ.
    This one instead (gcc 7.3.0, -O3)
    https://create.stephan-brumme.com/xxhash/


  18. #17
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,576
    Thanks
    790
    Thanked 687 Times in 372 Posts
    1. use -m64 to compile 64-bit executable of xxh64sum
    2. xxh32 is pretty weak hash, although still better than crc

  19. #18
    Member
    Join Date
    Dec 2013
    Location
    Italy
    Posts
    425
    Thanks
    16
    Thanked 40 Times in 32 Posts
    Quote Originally Posted by Bulat Ziganshin View Post
    1. use -m64 to compile 64-bit executable of xxh64sum
    2. xxh32 is pretty weak hash, although still better than crc
    1. too time consuming (I have no clang on Windows, only on BSD), Collet's source is really unmanageable
    2. xxHash64 from Brumme's code seems to give different hash from xxhash (maybe the seed? tomorrow I'll check... with the delphi version!)

    Thank you for hints

  20. #19
    Member
    Join Date
    Dec 2013
    Location
    Italy
    Posts
    425
    Thanks
    16
    Thanked 40 Times in 32 Posts
    After some experimentation also with crc32 (software implementations, also with indications by Bulat Ziganshin on Stephan Brumme) I come to the conclusion that
    1) actual (real world) results are extremely different from theoretical maximums
    On my PC, running from a 1GB (useless in real world) array
    Code:
    bitwise          : CRC=221F390F, 8.082s, 126.706 MB/s
    half-byte        : CRC=221F390F, 3.850s, 265.980 MB/s
    tableless (byte) : CRC=221F390F, 3.628s, 282.283 MB/s
    tableless (byte2): CRC=221F390F, 3.555s, 288.010 MB/s
      1 byte  at once: CRC=221F390F, 1.909s, 536.368 MB/s
      4 bytes at once: CRC=221F390F, 0.640s, 1600.459 MB/s
      8 bytes at once: CRC=221F390F, 0.440s, 2325.023 MB/s
    4x8 bytes at once: CRC=221F390F, 0.317s, 3231.518 MB/s
     16 bytes at once: CRC=221F390F, 0.197s, 5193.688 MB/s
     16 bytes at once: CRC=221F390F, 0.196s, 5213.763 MB/s (including prefetching)
        chunked      : CRC=221F390F, 0.199s, 5138.350 MB/s
    16 bytes at once, on a "real" case: about 2900MB/s (13GB fully cached by filesystem)

    2) considering that roughly all algorithms works somewhat like this
    (block reading of a file in a buffer and then processing)
    Code:
    while ((n = fread (data, sizeof (char), mysize, myfile))> 0)
    checksum=calc(data,n,checksum);
    for many reasons (latency, operating system cache loading, CPU cache location loss and so on and so forth) the actual performance is wastly reduced (at least in Brumme implementations), even for 16byte slices
    With very rough estimates, certainly not scientific, about -60%
    However, always around 3GB / s, which for a software implementation is not bad at all

    3) crc32c with SSE turns out to be really fast, it is easy to program, it does not require particular "strange" sources

    4) SHA1 (!) Is not that bad, again for software implementations

    5) xxHash64 (simplified implementation, software) is somewhere in between, but - always in actual version with processing from broken file to blocks and not on buffer benchmarks - I don't know if it's worth it.

    So I'll put all the possibilities into my ZPAQ, and maybe SHA256 too, for the "real paranoid".
    Unfortunately I have not found a usable hardware SHA1 implementation, it seems that Intel sells libraries of this type. Patience

Similar Threads

  1. Hash / Checksum algorithm considerations
    By Cyan in forum Data Compression
    Replies: 61
    Last Post: 16th June 2017, 01:28
  2. The short, tormented life of computer genius Phil Katz
    By boxerab in forum Data Compression
    Replies: 0
    Last Post: 14th December 2016, 01:07
  3. Possible to apply Conway's game of life to a compressor?
    By Cristo in forum Data Compression
    Replies: 4
    Last Post: 4th September 2016, 14:15
  4. compression speed VS decomp speed: which is more important?
    By Lone_Wolf236 in forum Data Compression
    Replies: 14
    Last Post: 12th July 2010, 20:57
  5. Start of another BIG and really real Benchmark
    By Simon Berger in forum Forum Archive
    Replies: 31
    Last Post: 15th November 2007, 17:18

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •