Results 1 to 22 of 22

Thread: Checksumming big files (~400GB and more): is CRC32 / CRC32c enough?

  1. #1
    Member
    Join Date
    Dec 2013
    Location
    Italy
    Posts
    425
    Thanks
    16
    Thanked 40 Times in 32 Posts

    Checksumming big files (~400GB and more): is CRC32 / CRC32c enough?

    I am evolving ZPAQ to insert file integrity check codes, which in the standard, for various technical reasons, is not provided, without breaking compatibility.

    ZPAQ can directly use SHA1 and SHA256 hashes, which are extremely safe, but very slow to compute (no "fancy" hardware implementation) and large to store.

    Is a trivial CRC32 code enough (maybe a better CRC32c) for this purpose?
    The files can be pretty big >>4G (vmdk) and I need something not too complex/strange.
    ---
    Not wanting to complicate my life too much, I would avoid advanced systems of parallel computing and subsequent combination.

    What do 7Z and Freearc use?

  2. #2
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    4,013
    Thanks
    303
    Thanked 1,328 Times in 759 Posts
    CRC32 is only not enough as file checksum when you have millions of files in archive, file size doesn't matter.

  3. #3
    Member
    Join Date
    Apr 2009
    Location
    The Netherlands
    Posts
    85
    Thanks
    7
    Thanked 18 Times in 11 Posts
    Quote Originally Posted by Shelwien View Post
    CRC32 is only not enough as file checksum when you have millions of files in archive, file size doesn't matter.
    The number of files in an archive does not matter. OP doesn't try to find duplicates but tries to ensure file integrity.

    CRC32 has roughly a 1/2^32 chance that the file has changed but that crc32 stayed the same. That's a very small chance. You have to decide if that's acceptable or not.

    Also CRC32 does not guarantee that the file wasn't changed on purpose. It's hard to create a random spontaneous mutation in the file and keep the CRC32 the same. But when you do it on purpose it gets very easy. SHA1 is in theory not save anymore to these kinds of attacks but in practice it's still almost impossible to pull of. SHA256 is still save and even theoretically as good as free of weaknesses.

    So it depends on what you try to achieve. If it's protection against unmaliciously changed files I think CRC32 should do the job.

  4. #4
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    4,013
    Thanks
    303
    Thanked 1,328 Times in 759 Posts
    > The number of files in an archive does not matter.

    The chance of getting a correct CRC for incorrect file grows with the number of files, file size doesn't matter for this.

    Also there're existing collision file pairs for MD5 and SHA1.

  5. #5
    Member
    Join Date
    Apr 2009
    Location
    The Netherlands
    Posts
    85
    Thanks
    7
    Thanked 18 Times in 11 Posts
    Quote Originally Posted by Shelwien View Post
    The chance of getting a correct CRC for incorrect file grows with the number of files, file size doesn't matter for this.
    Well yes and no. There is an independent probability of around 1/2^32 that if the file is changed that the CRC32 stays the same. So in a very strange situation when every file is getting corrupted and you have millions of corrupted files the probability grows to very likely that you will have a file that is corrupted but still has that same CRC32 hash.
    But would you trust an archive at all when it contains millions of files that are known to be corrupt and 1 or 2 files are flagged as okay by the algorithm because of accidentally correct hash values?

    Quote Originally Posted by Shelwien View Post
    Also there're existing collision file pairs for MD5 and SHA1.
    I am not advertising MD5 for this reason. SHA1 has known collisions but it's very hard (as in you have to own a GPU powered data center hard) to create one on purpose.

  6. #6
    Member
    Join Date
    Dec 2013
    Location
    Italy
    Posts
    425
    Thanks
    16
    Thanked 40 Times in 32 Posts
    Quote Originally Posted by Shelwien View Post
    > The number of files in an archive does not matter.

    The chance of getting a correct CRC for incorrect file grows with the number of files, file size doesn't matter for this.

    Also there're existing collision file pairs for MD5 and SHA1.
    As mentioned my concern is checksumming files >> 32 bit size, not for collision or anything else.

    Can I trust CRC32 even for VERY large and VERY similar files (different versions of vmdk virtual disks)?

    I just patched ZPAQ to use SHA1 as a checksum (it's already included), but I have a rather difficult bug to fix (source code is a real mess of dirty tricks).

    In a couple of day I think I can replace it with something different.

    But does it make sense?
    CRC32 vs CRC32c vs SHA1 vs SHA256 vs MD5?

    If I were truly paranoid I would store TWO different hashes, to prevent extension attacks, but my aim is to simply quickly verify that the files in a folder are identical to those inside ZPAQ.

    What do you suggest me?

  7. #7
    Member
    Join Date
    Dec 2013
    Location
    Italy
    Posts
    425
    Thanks
    16
    Thanked 40 Times in 32 Posts
    Quote Originally Posted by Kaw View Post
    Well yes and no. There is an independent probability of around 1/2^32 that if the file is changed that the CRC32 stays the same. So in a very strange situation when every file is getting corrupted and you have millions of corrupted files the probability grows to very likely that you will have a file that is corrupted but still has that same CRC32 hash.
    The integrity of the files, within an archive (zpaq in my case), is guaranteed by the SHA1 of the individual fragments. So I'm pretty sure a file, marked as not corrupted ... it is not.

    I need the checksum to verify the files present into the archive with those of a remote (cloud) folder.

    Let's say we have different (~500) versions of pippo.vmdk, a 400GB+ virtual disk, stored inside copia.zpaq.
    Suppose we have a remote pippo.vmdk file, and we want to be sure that the 2 pippo.vmdk are the same.

    Obviously I do NOT want to extract the pippo.vmdk from inside the copia.zpaq, calculate the hash, then compare it with the remote one.

    It can be done (and, in fact, today I do so).

    But it's slow and painfully.

    If, on the other hand, I had expressly the CRC32 of the file pippo.vmdk "frozen" in the file copia.zpaq it would be very easy to compute the CRC32 of pippo.vmdk in the remote server and compare.

    This is not a problem for 7z where CRC32 of the stored files is present.
    But in ZPAQ there isen't, and cannot be (without patching).

  8. #8
    Programmer schnaader's Avatar
    Join Date
    May 2008
    Location
    Hessen, Germany
    Posts
    620
    Thanks
    269
    Thanked 245 Times in 123 Posts
    Apart from the CRC32/SHA1 topic (I also think that CRC32 is enough for your usecase), note that the modified behaviour leads to a somewhat weaker result (if I understood your description of the new method correctly). Extracting to verify is painfully slow, but compares the current state of the file inside the archive. On the other hand, the new method compares against the stored CRC32 which is very fast, but it compares to the state of the file when it was added to the archive. So there is a possibility for three new effects:

    1. False positive: Parts of the archive gets corrupted, but the stored CRC32 stays the same. This will lead to the result "file is the same" although the file content is corrupted and can't be restored correctly (well, depending on the other hashes and the way ZPAQ deals with errors).

    2. False negative: The stored CRC32 gets corrupted, but the archived file content stays the same. This will lead to the result "file is not the same" although the file content is identical.

    3. "True negative": The stored CRC32 and the archived file content gets corrupted. This is not that interesting as most likely other parts of the archive will be corrupted, too, so the error would be detected anyway.
    http://schnaader.info
    Damn kids. They're all alike.

  9. #9
    Member
    Join Date
    Dec 2013
    Location
    Italy
    Posts
    425
    Thanks
    16
    Thanked 40 Times in 32 Posts
    Quote Originally Posted by schnaader View Post
    Apart from the CRC32/SHA1 topic (I also think that CRC32 is enough for your usecase), note that the modified behaviour leads to a somewhat weaker result (if I understood your description of the new method correctly). Extracting to verify is painfully slow,
    Sometimes near impossible, or very difficult.

    but compares the current state of the file inside the archive. On the other hand, the new method compares against the stored CRC32 which is very fast, but it compares to the state of the file when it was added to the archive. So there is a possibility for three new effects:

    1. False positive: Parts of the archive gets corrupted, but the stored CRC32 stays the same. This will lead to the result "file is the same" although the file content is corrupted and can't be restored correctly (well, depending on the other hashes and the way ZPAQ deals with errors).
    It's unlikely, file integrity is checked by SHA1
    2. False negative: The stored CRC32 gets corrupted, but the archived file content stays the same. This will lead to the result "file is not the same" although the file content is identical.
    Unlikely as well, the checksum is stored inside an SHA1-checked bloc
    3. "True negative": The stored CRC32 and the archived file content gets corrupted. This is not that interesting as most likely other parts of the archive will be corrupted, too, so the error would be detected anyway.
    ZPAQ can check if a file is SHA1-OK (apart from collisions, of course), but cannot quickly compute a checksum.

    Until I fix the last bug of my patched-ZPAQ

    Back to the question: CRC32, CRC32c, MD5, SHA1, SHA256 or whatever?

  10. #10
    Member
    Join Date
    Aug 2016
    Location
    USA
    Posts
    70
    Thanks
    16
    Thanked 21 Times in 16 Posts
    Is https://github.com/Cyan4973/xxHash in the running? It's non-cryptographic like CRC32, etc, but should be substantially faster. It looks like for data corruption/integrity checking you don't need anything stronger than CRC variants of other fast hashes.

  11. #11
    Member
    Join Date
    Dec 2013
    Location
    Italy
    Posts
    425
    Thanks
    16
    Thanked 40 Times in 32 Posts
    This is my first "checksum-enabled" ZPAQ (very dirty, full of debug code and bugs, no resilience, do not compile on *nix and so on. Just a working build).

    Currently with SHA1.

    I will read something about md4 (yes, md4) but I'm too lazy to work on "strange" code (zpaq is just enough for me) and make some tests.
    Attached Files Attached Files

  12. #12
    Member
    Join Date
    Jan 2017
    Location
    Germany
    Posts
    64
    Thanks
    31
    Thanked 14 Times in 11 Posts
    In fact CRC-algorithms are getting more unreliable with increased amounts of data.

    The recommendation is to use CRC-n for amounts of data of up to 2^n bits.
    That means for CRC-32 the recommended limit is 2^32 bits or 2^29 bytes (512 Mbyte).

    https://www.ece.unb.ca/tervo/ece4253/crc.shtml

  13. #13
    Member
    Join Date
    Dec 2013
    Location
    Italy
    Posts
    425
    Thanks
    16
    Thanked 40 Times in 32 Posts
    First test (on my PC, single thread, no slice)

    crc32c with HW instructions (SSE4.2) ~1.4GB/s
    crc32c SW only 0.9GB/s
    ZPAQ's SHA1 ~0.6GB/s (not bad)
    sha1deep64 ~0.6GB/s
    md5deep64 ~0.6GB/s
    openssl sha1 ~0.6GB/s

  14. #14
    Member
    Join Date
    Jan 2020
    Location
    Chagrin Falls, OH
    Posts
    8
    Thanks
    1
    Thanked 0 Times in 0 Posts

  15. #15
    Member
    Join Date
    Dec 2013
    Location
    Italy
    Posts
    425
    Thanks
    16
    Thanked 40 Times in 32 Posts
    ... because I do not know any already written C++ library without too esotic interface

  16. #16
    Member
    Join Date
    Jan 2020
    Location
    Chagrin Falls, OH
    Posts
    8
    Thanks
    1
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by fcorbelli View Post
    ... because I do not know any already written C++ library without too esotic interface
    you don't know OpenSSL? it's not C++, but C++ can use C libraries just fine and the interface is very easy to use.

  17. #17
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    4,013
    Thanks
    303
    Thanked 1,328 Times in 759 Posts

  18. #18
    Member
    Join Date
    Dec 2013
    Location
    Italy
    Posts
    425
    Thanks
    16
    Thanked 40 Times in 32 Posts
    Quote Originally Posted by hotaru View Post
    you don't know OpenSSL? it's not C++, but C++ can use C libraries just fine and the interface is very easy to use.
    Thank you but I need something to compile into the source

  19. #19
    Member
    Join Date
    Dec 2013
    Location
    Italy
    Posts
    425
    Thanks
    16
    Thanked 40 Times in 32 Posts
    Quote Originally Posted by Shelwien View Post
    I am looking for Intel, but of course no real demo available.
    i will email the authors for a md5sum-like source.
    They claim to have fastest SHA1 and I think it is possibile

  20. #20
    Member
    Join Date
    Jun 2015
    Location
    Moscow
    Posts
    33
    Thanks
    4
    Thanked 5 Times in 4 Posts
    Quote Originally Posted by Stefan Atev View Post
    Is https://github.com/Cyan4973/xxHash in the running? It's non-cryptographic like CRC32, etc, but should be substantially faster. It looks like for data corruption/integrity checking you don't need anything stronger than CRC variants of other fast hashes.
    XXH128 I think this is a good solution, on E5-2689

    Code:
    xxhsum.exe 0.8.0 by Yann Collet                              
    64-bit x86_64 autoVec little endian with Clang 10.0.0 
    Sample of 100 KB...
    1#XXH32                         :     102400 ->    59266 it/s ( 5787.7 MB/s)
    3#XXH64                         :     102400 ->    95970 it/s ( 9372.1 MB/s)
    5#XXH3_64b                      :     102400 ->   180706 it/s (17647.1 MB/s)
    11#XXH128                       :     102400 ->   178667 it/s (17447.9 MB/s)

  21. Thanks (2):

    fcorbelli (12th November 2020),standolf (12th November 2020)

  22. #21
    Member
    Join Date
    May 2017
    Location
    Spain
    Posts
    25
    Thanks
    22
    Thanked 4 Times in 4 Posts
    Just a small note, I think md5 and sha1 are not as slow as you think, maybe you are compressing and hashing directly in memory. But even with fastest SSD I think you are still bound to SSD speed, much slower than the speed you can get hashing. And there are really optimized, unrolled, etc, sha or md5 optimizations in software.

    Also, a curiosity, maybe some of you remember when project MAME started using md5 and sha1 in its dat files, also pretty common and standard in games preservation collection when you deal with milllions of files in the same dataset. The reason I remember was they started getting collisions in not equal files. So Dat files (a dat is a text file used to binary identify files, containing names, hashes, etc) needed better hashes.
    Biggest dataset I've seen in preservation is a nearly 16Million files, don't know the total size...

  23. #22
    Member
    Join Date
    Dec 2013
    Location
    Italy
    Posts
    425
    Thanks
    16
    Thanked 40 Times in 32 Posts
    Quote Originally Posted by anormal View Post
    Just a small note, I think md5 and sha1 are not as slow as you think(...)
    On my PC "my" (hardware SSE) CRC32 is about 10x faster than "my" SHA1 (libzpaq)

Similar Threads

  1. wanted: RXM big dictionray
    By SvenBent in forum Download Area
    Replies: 2
    Last Post: 27th November 2013, 21:28
  2. REP and Delta fails with big files
    By SvenBent in forum Data Compression
    Replies: 14
    Last Post: 23rd November 2008, 20:41
  3. I volunteer for some big dictionary test
    By SvenBent in forum Data Compression
    Replies: 18
    Last Post: 8th June 2008, 23:59
  4. Replies: 4
    Last Post: 17th March 2008, 22:19
  5. Fast decompression of big files
    By SvenBent in forum Forum Archive
    Replies: 16
    Last Post: 8th March 2008, 20:17

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •