Results 1 to 13 of 13

Thread: Compute overall SHA1 from a SHA1 series of fragments?

  1. #1
    Member
    Join Date
    Dec 2013
    Location
    Italy
    Posts
    425
    Thanks
    16
    Thanked 40 Times in 32 Posts

    Compute overall SHA1 from a SHA1 series of fragments?

    I would like to know, please, if it is possible to calculate the SHA1 hash of a stream of bytes (a file, in fact) that has been splitted into a number of fragments of variable length (the last one is typically shorter) of which you know the SHA1 codes.

    Essentially I'm looking for a way to calculate the SHA1 code of the files stored in ZPAQ.

    Currently it seems to me that this is not possible: all is divided into blocks of a few KB, each of which has its own SHA1 code, without a "global" SHA1

    Thank you

  2. #2
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    4,013
    Thanks
    303
    Thanked 1,328 Times in 759 Posts
    It's impossible: SHA is not that kind of hash.
    At most it may be possible to reuse sha1(block1) value to compute sha1(block1+block2),
    but even that would only work for blocks with 64-byte-aligned size.

    However, specifically SHA1 is not essential for zpaq format - in fact there's a potential problem with sha1 collision example files:
    https://stackoverflow.com/questions/...n-demo-example

    So I think you can look into designing a custom hash function with necessary properties.
    For example, CRC has that property: https://stackoverflow.com/questions/...ith-known-crcs

  3. Thanks:

    fcorbelli (6th November 2020)

  4. #3
    Member
    Join Date
    Dec 2013
    Location
    Italy
    Posts
    425
    Thanks
    16
    Thanked 40 Times in 32 Posts
    I was pretty sure, but I preferred to ask for a second opinion.

    I am not at all worried about SHA1 collisions for zpaq, but I was trying to modify its source to get SHA1 of every single file (for verification purposes with rsync)

  5. #4
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,576
    Thanks
    790
    Thanked 687 Times in 372 Posts
    i would say that any hash with first or second property would be a bad crypto hash. sha final value is computed from larger internal state that you can't restore from this value

    while you don't worry about cryptohraphy, it will be bad idea to release product that can be easily used to attack its users
    Last edited by Bulat Ziganshin; 6th November 2020 at 23:30.

  6. #5
    Member
    Join Date
    Dec 2013
    Location
    Italy
    Posts
    425
    Thanks
    16
    Thanked 40 Times in 32 Posts
    Unfortunately, I did not design the operation of ZPAQ, for better or for worse.

    I have not found, looking at the source, an effective method to check the integrity of individual files, which is carried out, if I understand well Matt's code, by verifying that the individual fragments have the respective correct SHA1 code.
    If every fragments are OK => the entire file is OK.

    Or at least so I interpret this portion of the software, where file blocks are compared with internal fragments, of size <= 4KB

    It is not clear to me what happens if a fragment is > 4KB (assuming it is possible, maybe not).
    It would be very easy to ask Matt, but he no longer answer for this program, at least to me.

    So there is SHA1(fragment1), SHA1(fragment2), SHA1(...), SHA1(fragment-n)
    I think I will make my "fake SHA1" (!!!), something like

    SHA1 ( SHA1(fragment1) U SHA1(fragment2) U (...) SHA1(fragment-n) )

    and a special "fake SHA1" calculator that, given a file and a fixed fragment size, make che job.

    Not for cryptographic purposes, but for checksum.

    I can't think of other solutions, at least at the moment.

    The question is not trivial, because it seems to be used a sort of "rolling hash" with fragments of variable size (and again it is not clear to me what happens if >4KB).
    I have to study the source better

    Code:
    // compare hashes
      fseeko(in, 0, SEEK_SET);
      libzpaq::SHA1 sha1;
      const int BUFSIZE=4096;
      char buf[BUFSIZE];
      for (unsigned i=0; i<p->second.ptr.size(); ++i) {
        unsigned f=p->second.ptr[i];
        if (f<1 || f>=ht.size() || ht[f].usize<0) return fclose(in), false;
        for (int j=0; j<ht[f].usize;) {
          int n=ht[f].usize-j;
          if (n>BUFSIZE) n=BUFSIZE;
          int r=fread(buf, 1, n, in);
          if (r!=n) return fclose(in), false;
          sha1.write(buf, n);
          j+=n;
        }
        if (memcmp(sha1.result(), ht[f].sha1, 20)!=0) return fclose(in), false;
      }

    So, in summary, I don't think it's possible to get the SHA1 of every single file printed by ZPAQ, which was what I needed to speed up the remote integrity check, because there aren't any.

    That's why I appreciated confirmation from those who are more experienced than me in this area.

  7. #6
    Member
    Join Date
    Dec 2013
    Location
    Italy
    Posts
    425
    Thanks
    16
    Thanked 40 Times in 32 Posts
    Click image for larger version. 

Name:	concatenate.jpg 
Views:	8 
Size:	253.4 KB 
ID:	8065

    I found a way to compute the SHA1 hashes of ZPAQ's whole files by modifying one of Matt's programs (unzpaq206.cpp) to unz.cpp.

    It is a very slow method, however it can be fine if you are working with remote crontab-night tasks.

    Essentially the goal is to have a SHA1 of all the files present, after having been sorted alphabetically and then juxtaposed (-all), as a single big file.

    To be compared with the SHA1, ordered and juxtaposed, of all the files in a local folder (as usual excluding .zfs and: $ DATA).

    If the codes are identical then, for the transitive property, you can be pretty sure that the remote ZPAQ file contains exactly all the files in the local folder (not at 100% in fact, I should add filesize for 100%).

    It's a much tighter check than just extracting and testing the file "all OK" from a remote zpaq.

    Unfortunately it's slow (it works one byte at a time, single threaded!).

    Maybe over time I will be able to improve Matt's work.

    ====
    In summary, this is the scenario.

    Locally there is the pippo directory, which contains many different files.
    Create the file copia.zpaq, containing the data of the pippo folder, with zpaq.

    Send the copia.zpaq file to a cloud server, on a scheduled basis, via rsync on a RSA-ssh-tunnel or whatever.

    ====
    How can I verify that the content of remote copia.zpaq matches the local pippo folder?

    In several ways, uncomfortable.
    It is not a trivial task, if there are ~1TB data and ~1M files and ~1K versions.

    The first is to extract the entire contents of the copia.zpaq file locally in a temporary folder.
    Compare each individual file with the data (pippo) folder.

    And finally compare the SHA1 of the copia.zpaq (local) file with copia.zpaq (remote).

    For large amounts of data particular care is required for the wear of mass devices (writes of hundreds of GB every day), with the activation of a zfs deduplication to save space etc.
    In short, feasible, but not for everyone.

    ====
    With unz and zpaqfranz-35 two commands are enough, the first executed remotely, the second locally, for the sha1 calculation with the -all option (file size in later build for 100%).

    I hope it is clear the relevance for a data storage scientist
    Attached Files Attached Files

  8. #7
    Member
    Join Date
    Jun 2015
    Location
    Moscow
    Posts
    33
    Thanks
    4
    Thanked 5 Times in 4 Posts
    Quote Originally Posted by fcorbelli View Post
    Unfortunately, I did not design the operation of ZPAQ, for better or for worse.

    I have not found, looking at the source, an effective method to check the integrity of individual files, which is carried out, if I understand well Matt's code, by verifying that the individual fragments have the respective correct SHA1 code.
    If every fragments are OK => the entire file is OK.

    Or at least so I interpret this portion of the software, where file blocks are compared with internal fragments, of size <= 4KB

    It is not clear to me what happens if a fragment is > 4KB (assuming it is possible, maybe not).
    It would be very easy to ask Matt, but he no longer answer for this program, at least to me.

    So there is SHA1(fragment1), SHA1(fragment2), SHA1(...), SHA1(fragment-n)
    I think I will make my "fake SHA1" (!!!), something like

    SHA1 ( SHA1(fragment1) U SHA1(fragment2) U (...) SHA1(fragment-n) )

    and a special "fake SHA1" calculator that, given a file and a fixed fragment size, make che job.

    Not for cryptographic purposes, but for checksum.

    I can't think of other solutions, at least at the moment.

    The question is not trivial, because it seems to be used a sort of "rolling hash" with fragments of variable size (and again it is not clear to me what happens if >4KB).
    I have to study the source better

    Code:
    // compare hashes
      fseeko(in, 0, SEEK_SET);
      libzpaq::SHA1 sha1;
      const int BUFSIZE=4096;
      char buf[BUFSIZE];
      for (unsigned i=0; i<p->second.ptr.size(); ++i) {
        unsigned f=p->second.ptr[i];
        if (f<1 || f>=ht.size() || ht[f].usize<0) return fclose(in), false;
        for (int j=0; j<ht[f].usize;) {
          int n=ht[f].usize-j;
          if (n>BUFSIZE) n=BUFSIZE;
          int r=fread(buf, 1, n, in);
          if (r!=n) return fclose(in), false;
          sha1.write(buf, n);
          j+=n;
        }
        if (memcmp(sha1.result(), ht[f].sha1, 20)!=0) return fclose(in), false;
      }

    So, in summary, I don't think it's possible to get the SHA1 of every single file printed by ZPAQ, which was what I needed to speed up the remote integrity check, because there aren't any.

    That's why I appreciated confirmation from those who are more experienced than me in this area.

    const int BUFSIZE=4096; it's not a fragment size

    by default zpaq's fragment size [64 <<6 ... 8128 << 6] = [4096 ... 520192]
    fragment size can be less than 4096 only if the last block in the file is less than 4096

    to implement your functionality, you should store the sha1 of the entire file somewhere in the archive itself, I think c/block is great for this
    zpaq format has one requirement per c/block (data size >= 8 bytes)
    you can create a larger block and store sha1 for the files of this transaction

  9. #8
    Member
    Join Date
    Dec 2013
    Location
    Italy
    Posts
    425
    Thanks
    16
    Thanked 40 Times in 32 Posts
    unz.cpp (slooooowlyyy) compute SHA1 for each file.
    I'd like to extend the zpaq format, without breaking compatibility.

  10. #9
    Member
    Join Date
    Jun 2015
    Location
    Moscow
    Posts
    33
    Thanks
    4
    Thanked 5 Times in 4 Posts
    Quote Originally Posted by fcorbelli View Post
    unz.cpp (slooooowlyyy) compute SHA1 for each file.
    I'd like to extend the zpaq format, without breaking compatibility.
    this is what I suggested, use c/block to store sha1 and this will not break compatibility with the zpaq format

    now c/block usize[8]
    --->

    usize[8] + sha1[N] N number of files in transaction

  11. #10
    Member
    Join Date
    Dec 2013
    Location
    Italy
    Posts
    425
    Thanks
    16
    Thanked 40 Times in 32 Posts
    Quote Originally Posted by SpyFX View Post
    this is what I suggested, use c/block to store sha1 and this will not break compatibility with the zpaq format

    now c/block usize[8]
    --->

    usize[8] + sha1[N] N number of files in transaction
    I try to explain myself better.

    It is not possible for me to perform a SHA1/md5/whatever calculation on multiple threads, as it is essentially a serial computation.

    So, if I am not mistaken, there is no method that can be applied to ZPAQ to compute the SHA1 code of the entire file, given that it is splitted, in the compression phase, into multiple concurrent threads.

    In this paper
    https://pkolano.github.io/papers/lisa10.pdf
    there is a parallel, but non-standard checksum calculation system, very interesting (msum).

    Therefore the problem is not only "where" to store the SHA1 code of the single files into the ZPAQ format, but "how" to calculate it.
    Even the extraction phase (post-calculation of SHA1) is divided into threads, with the creation of a large file gradually filled.

    Code:
            if (!job.jd.dotest && (nz<q+usize || j+1==ptr.size())) {
              fseeko(job.outf, offset, SEEK_SET);
              fwrite(out.c_str()+q, 1, usize, job.outf);
            }

    And here
    https://code.google.com/archive/p/crcutil/downloads
    there is how to "combine" (slicing) data to make parallel CRC.

    BUT... where to store 4 more bytes (CRC) for fragment?
    Last edited by fcorbelli; 9th November 2020 at 19:49.

  12. #11
    Member
    Join Date
    Jun 2015
    Location
    Moscow
    Posts
    33
    Thanks
    4
    Thanked 5 Times in 4 Posts
    let's figure out how zpaq works (simplified algorithm)

    1 main thread getting a list of files to add
    2 main thread create a set of threads that compress blocks (sets of fragments)
    3.1 main thread reads sequentially all files
    3.2 each file is divided into fragments
    3.3 sha1 is calculated for each fragment and adds the fragment to the block if new
    3.4 if block is full, main thread sends block to thread that compress block

    who prevents to calculate sha1 of all fragments of one file?

  13. #12
    Member
    Join Date
    Dec 2013
    Location
    Italy
    Posts
    425
    Thanks
    16
    Thanked 40 Times in 32 Posts
    who prevents to calculate sha1 of all fragments of one file?
    Nothing, but it is useless: it is the exact question of this thread.
    You cannot build the SHA1 of one entire file, with an unordered number of concurrent threads.
    Neither combining the sha1s into fragments, like crc32_combine of zlib for crc32
    Therefore sha1 is useless in this case (quickly compute a checksum of every file into the zpaq ).
    I am thinking on a radical approach (hidden special file with a real database inside)

  14. #13
    Member
    Join Date
    Dec 2013
    Location
    Italy
    Posts
    425
    Thanks
    16
    Thanked 40 Times in 32 Posts
    OK I find a way, without breaking back-compatibility (maybe), using I blocks and adding an "offset" into the attr, something like that, extending the DT struct-ure
    Code:
    if ((p->second.attr&255)=='u') 
                    {  // unix attributes
                        puti(is, 3+FRANZOFFSET, 4);
                        puti(is, p->second.attr, 3+FRANZOFFSET);
                    }
                    else 
                    
                    if ((p->second.attr&255)=='w') 
                    {  // windows attributes
                        puti(is, 5+FRANZOFFSET, 4);
                        puti(is, p->second.attr, 5+FRANZOFFSET);
                    }
    Now I have a 2nd question: is CRC32 (or CRC32c) enough?
    Started another thread,

    this one

    https://encode.su/threads/3513-Check...-CRC32c-enough
    Last edited by fcorbelli; 10th November 2020 at 20:36.

Similar Threads

  1. Code snippet to compute CPU frequency
    By Bulat Ziganshin in forum Data Compression
    Replies: 6
    Last Post: 8th May 2020, 05:05
  2. Replies: 8
    Last Post: 30th July 2019, 18:20
  3. AMD A-series CPU (Llano, Socket FM1)
    By encode in forum The Off-Topic Lounge
    Replies: 28
    Last Post: 5th October 2011, 15:02
  4. CHK 1.01 is here! (New GUI MD5/SHA1 file checker)
    By encode in forum Data Compression
    Replies: 24
    Last Post: 20th July 2011, 09:45
  5. On compressing series of ones and zeroes.
    By Chuckie in forum Data Compression
    Replies: 4
    Last Post: 3rd March 2011, 12:33

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •