Page 87 of 89 FirstFirst ... 37778586878889 LastLast
Results 2,581 to 2,610 of 2654

Thread: zpaq updates

  1. #2581
    Member
    Join Date
    Dec 2013
    Location
    Italy
    Posts
    465
    Thanks
    23
    Thanked 42 Times in 34 Posts
    Quote Originally Posted by SpyFX View Post
    I get a list of blocks from the archive
    I read the blocks sequentially and send them to the processing threads
    each thread gets the original block and compresses it again ( from -m1 to -mX)
    the new block is sent to the common heap
    one thread reads the finished blocks and saves them to a new file

    also have to repackage h/block, because the first Int32 in h/block contains the size corresponding to the size of d/block


    I also made a variant of gluing blocks, from blocks -m14 to -m1x (5[32MB] .... 9[512MB]) as an example
    this is of course a little more complicated, but I did it too
    WOW.
    Can you please post the source? So I'll learn something new.

  2. #2582
    Member
    Join Date
    Dec 2013
    Location
    Italy
    Posts
    465
    Thanks
    23
    Thanked 42 Times in 34 Posts
    I'm putting a series of help examples on my fork of ZPAQ.
    Any very interesting line suggestions to add?

    Currently
    Code:
    +++ ADDING
    
    zpaqfranz a z:\backup.zpaq c:\data\* d:\pippo\* -summary 1
    Add all files from c:\data and d:\pippo to archive z:\backup.zpaq
    
    zpaqfranz a z:\backup.zpaq c:\vecchio.sql r:\1.txt -summary 1 -noeta
    Add two files archive z:\backup.zpaq, no eta (for batch file)
    
    Windows running with administrator privileges
    zpaqfranz a z:\backup.zpaq c:\users\utente\* -summary 1 -vss -pakka
    Delete all VSS, create a fresh one of C:\, backup the 'utente' profile
    
    +++ SUPPORT FUNCTIONS
    
    zpaqfranz s r:\vbox s:\uno
    Filesize sum from directory EXCEPT .zfs and NTFS extensions
    
    zpaqfranz sha1 r:\vbox s:\uno -all
    Get SHA1 of all files (contatenated) in two directory
    zpaqfranz sha1 r:\vbox s:\uno
    zpaqfranz sha1 r:\vbox s:\uno -crc32c
    zpaqfranz sha1 r:\vbox s:\uno -sha256
    zpaqfranz sha1 r:\vbox s:\uno -xxhash
    zpaqfranz sha1 r:\vbox s:\uno -crc32

  3. #2583
    Member
    Join Date
    Dec 2013
    Location
    Italy
    Posts
    465
    Thanks
    23
    Thanked 42 Times in 34 Posts
    This is the zpaqfranz-38 version
    I did a "merge" (!) of unzpaq206 (!) patched for file verification and SHA1 injection.
    With crc32c hardware, crc32, xxhash64, sha256.
    Highly experimental, the source is bad etc. There is no handling of special cases,
    multipart volumes and so on, compile only on Windows etc.

    http://www.francocorbelli.it/zpaqfranz.exe

    Essentially it can store individual SHA1 hashes of files within the ZPAQ, while maintaining backward compatibility (therefore 7.15 and the like should also easily read the generated files)

    The main parameter is -checksum (on add) which stores SHA1s.
    You can use the -test switch (or the t command) for paranoid ex-post testing

    Doing this to store individual SHA1s
    Code:
    zpaqfranz a z:\test1.zpaq f:\zarc\ok\ -checksum -summary 1
    or
    Code:
    zpaqfranz a z:\test1.zpaq f:\zarc\ok\ -checksum -summary 1 -test
    for a really paranoid verify after compression

    To see the SHA1s (if any)
    Code:
    zpaqfranz l z:\test1.zpaq
    To re-check
    Code:
    zpaqfranz t z:\test1.zpaq
    Example
    Code:
    Total file: 00001111
    ERRORS    : 00000000 (ERROR:  something WRONG)
    Level 0   : 00000000 (UNKNOWN:cannot verify)
    Level 1   : 00000000 (GOOD:   decompressed=file on disk)
    Level 2   : 00001111 (SURE:   stored=decompressed=file on disk)
    All OK (paranoid test)
    In this case every 1111 into ZPAQ have a stored SHA1, which is equal to the file on disk and,
    if decompressed from ZPAQ, get the same SHA1.

    In other words given a pippo.jpg with SHA1 "27", into ZPAQ is stored the string "27 pippo.jpg" and, when pippo.jpg is extracted, the file will get SHA1 "27"

    Any feedback is welcome
    Attached Files Attached Files

  4. #2584
    Member
    Join Date
    Dec 2013
    Location
    Italy
    Posts
    465
    Thanks
    23
    Thanked 42 Times in 34 Posts
    Wow, really great feedback!
    Maybe it's the case to avoid further updates to do not bloat the forum.

    OK the -39 version compile with gcc 9.3.0 (FreeBSD, I do not use Linux)
    Code:
    g++ -O3 -march=native -Dunix zpaq.cpp libzpaq.cpp -pthread -o zpaq -static-libstdc++ -static-libgcc
    Finally, I added a function that I wanted to implement for some time: the -find during the listing.
    Something like | grep -i something
    Code:
    zpaqfranz l foo.zpaq -find something
    or even
    Code:
    zpaqfranz l foo.zpaq -find something -replace somethingelse
    with -pakka switch can be used to make a SHA1 list for checking with sha1sum etc.

    The main problem still present is an increasing monotone (fast) decompressor.
    Something that orders the individual fragments before extracting them, reducing memory consumption (now very high).
    I was hoping to find someone who has already faced this (big) limitation for the "definitive" -test

    PS What about an embedded SMTPs client, to send logfile without something like smtp-cli?
    Too fancy?

    Windows 64 binary
    http://www.francocorbelli.it/zpaqfranz.exe
    Attached Files Attached Files

  5. #2585
    Member
    Join Date
    Dec 2013
    Location
    Italy
    Posts
    465
    Thanks
    23
    Thanked 42 Times in 34 Posts
    It wasn't super easy, but you can do an integrity check of the ZPAQ files (using the venerable CRC32) at pretty much the rate at which it extracts the files.
    The "trick" consists in keeping track of each fragment decompressed in parallel by the various threads, dividing them by file, and then calculating its CRC32.
    Broadly speaking like this
    Code:
                    uint32_t crc;
                    crc=crc32_16bytes (out.c_str()+q, usize);
                    s_crc32block myblock;
                    myblock.crc32=crc;
                    myblock.crc32start=offset;
                    myblock.crc32size=usize;
                    myblock.filename=job.lastdt->first;
                    g_crc32.push_back(myblock);
    Finally, after having sorted the individual blocks...
    Code:
    bool comparecrc32block(s_crc32block a, s_crc32block b)
    {
        char a_start[40];
        char b_start[40];
        sprintf(a_start,"%014lld",a.crc32start);
        sprintf(b_start,"%014lld",b.crc32start);
        return a.filename+a_start<b.filename+b_start;
    }
    the "final" CRC32 can be calculated by combining the individual CRC32s (snippet for a single file)



    Code:
        for (auto it = g_crc32.begin(); it != g_crc32.end(); ++it)
        {      
            printf("Start %014lld size %14lld (next %14lld) CRC32 %08X %s\n",it->crc32start,it->crc32size,it->crc32start+it->crc32size,it->crc32,it->filename.c_str());
        }
      printf("SORT\n");
      
        sort(g_crc32.begin(),g_crc32.end(),comparecrc32block);
    
       uint32_t currentcrc32=0;
        for (auto it = g_crc32.begin(); it != g_crc32.end(); ++it)
        {      
            printf("Start %014lld size %14lld (next %14lld) CRC32 %08X %s\n",it->crc32start,it->crc32size,it->crc32start+it->crc32size,it->crc32,it->filename.c_str());
            currentcrc32=crc32_combine (currentcrc32, it->crc32,it->crc32size);
        }
        printf("%08X\n",currentcrc32);
    By getting something like this example

    Code:
    ?======================?????????????????????????? 13
    Start 00000001680383 size          93605 (next        1773988) CRC32 401AB2BB z:/myzarc/Globals.pas
    Start 00000001123668 size         143047 (next        1266715) CRC32 7060F541 z:/myzarc/Globals.pas
    Start 00000001354597 size         239761 (next        1594358) CRC32 E183AA23 z:/myzarc/Globals.pas
    Start 00000001798290 size         110253 (next        1908543) CRC32 ECC91F12 z:/myzarc/Globals.pas
    Start 00000001928573 size          64971 (next        1993544) CRC32 0F19B1F5 z:/myzarc/Globals.pas
    Start 00000000189470 size         372350 (next         561820) CRC32 AFC2ADD7 z:/myzarc/Globals.pas
    Start 00000001266715 size          87882 (next        1354597) CRC32 4086324B z:/myzarc/Globals.pas
    Start 00000001594358 size          86025 (next        1680383) CRC32 92AA9167 z:/myzarc/Globals.pas
    Start 00000001773988 size          24302 (next        1798290) CRC32 44F3E2AE z:/myzarc/Globals.pas
    Start 00000001908543 size          20030 (next        1928573) CRC32 1D7D9886 z:/myzarc/Globals.pas
    Start 00000000059210 size         130260 (next         189470) CRC32 64E567C0 z:/myzarc/Globals.pas
    Start 00000000561820 size         561848 (next        1123668) CRC32 481E0307 z:/myzarc/Globals.pas
    Start 00000000000000 size          59210 (next          59210) CRC32 E1B00AE8 z:/myzarc/Globals.pas
    SORT
    Start 00000000000000 size          59210 (next          59210) CRC32 E1B00AE8 z:/myzarc/Globals.pas
    Start 00000000059210 size         130260 (next         189470) CRC32 64E567C0 z:/myzarc/Globals.pas
    Start 00000000189470 size         372350 (next         561820) CRC32 AFC2ADD7 z:/myzarc/Globals.pas
    Start 00000000561820 size         561848 (next        1123668) CRC32 481E0307 z:/myzarc/Globals.pas
    Start 00000001123668 size         143047 (next        1266715) CRC32 7060F541 z:/myzarc/Globals.pas
    Start 00000001266715 size          87882 (next        1354597) CRC32 4086324B z:/myzarc/Globals.pas
    Start 00000001354597 size         239761 (next        1594358) CRC32 E183AA23 z:/myzarc/Globals.pas
    Start 00000001594358 size          86025 (next        1680383) CRC32 92AA9167 z:/myzarc/Globals.pas
    Start 00000001680383 size          93605 (next        1773988) CRC32 401AB2BB z:/myzarc/Globals.pas
    Start 00000001773988 size          24302 (next        1798290) CRC32 44F3E2AE z:/myzarc/Globals.pas
    Start 00000001798290 size         110253 (next        1908543) CRC32 ECC91F12 z:/myzarc/Globals.pas
    Start 00000001908543 size          20030 (next        1928573) CRC32 1D7D9886 z:/myzarc/Globals.pas
    Start 00000001928573 size          64971 (next        1993544) CRC32 0F19B1F5 z:/myzarc/Globals.pas
    4A99417C
    The key is to use the CRC32 property of being able to be used "in blocks" (distributive).

    It should be remembered that the sequence of blocks is not ordered, and that in general there are multiple threads in simultaneous processing.

    Using a method similar to the one I implemented in my fork, ie extend the file's attribute portion, you can store the CRC32 inside the ZPAQ file in a way that is compatible with older versions.

    I guess nobody cares about having a quick method of verifying ZPAQ files, but it's one of the most annoying shortcomings for me

    It was a difficult job to "disassemble" the logic of ZPAQ without help...
    But ... i did it!

  6. #2586
    Member FatBit's Avatar
    Join Date
    Jan 2012
    Location
    Prague, CZ
    Posts
    193
    Thanks
    0
    Thanked 36 Times in 27 Posts
    I guess nobody cares about having a quick method of verifying ZPAQ files…

    Wrong assumption, at least me…

    Best regards,

    FatBit

  7. #2587
    Member
    Join Date
    Jun 2015
    Location
    Moscow
    Posts
    42
    Thanks
    4
    Thanked 7 Times in 6 Posts
    p/s sorry i blunted, my post needs to be deleted (:

  8. #2588
    Member
    Join Date
    Dec 2013
    Location
    Italy
    Posts
    465
    Thanks
    23
    Thanked 42 Times in 34 Posts
    This is version 41 of zpaqfranz.

    It begins to look like something vaguely functioning.

    Using the -checksum switch (in the add command) stores BOTH SHA1 and CRC32 of each file inside the ZPAQ file.

    Those codes can be seen with the normal command l (list) in zpaqfranz.
    With zpaq 7.15 should be ignored without errors.

    The t and p commands test the new format.

    The first uses the CRC32 codes (in present), and if desired with -force also do a comparison from the filesystem files for a double check.
    About as fast as standard test for ZPAQ 7.15

    The second p (as paranoid) does it on SHA1.
    In this case it's MUCH slower and uses MUCH more RAM, so that it often crashes on 32-bit systems.

    -verbose gives a more extended result list.

    Examples
    zpaqfranz a z:\1.zpaq c:\zpaqfranz\* -checksum -summary 1 -pakka
    zpaqfranz32 t z:\1.zpaq -force -verbose



    I emphasize that the source is a real mess, due to "injection" of different programs into the original file, so as not to differentiate it too much from "normal" zpaq.

    It should be corrected and fixed, perhaps in the future.

    To summarize: now zpaqfranz-41 can check file integrity (CRC32) in a (hopefully) perfectly backward compatible way with ZPAQ 7.15

    EXE for Windows32
    http://www.francocorbelli.it/zpaqfranz32.exe

    EXE for Win64
    http://www.francocorbelli.it/zpaqfranz.exe

    Any feedback is welcomed.
    Attached Files Attached Files

  9. #2589
    Member
    Join Date
    Dec 2013
    Location
    Italy
    Posts
    465
    Thanks
    23
    Thanked 42 Times in 34 Posts
    Quote Originally Posted by SpyFX View Post
    i looked at your source code, found incompatibility with zpaq 7.15


    unsigned na=btoi(s);  // attr bytes
    if (s+na>end || na>65535) error("attr too long");
    for (unsigned i=0; i<na; ++i, ++s) // read attr
    if (i<8) dtr.attr+=int64_t(*s&255)<<(i*8);


    in original code, the maximum change of a pointer to a buffer can be 8, although the limit is set to 65535
    in your code the value na = FRANZOFFSET(40) then if you use zpaq 7.15 there will be problems with decoding i/block
    Thank you, but... why?
    The attr does not have a fixed size, can be 3 or 5 bytes, or 0, or... 55 (50+5) bytes long.
    At least that's what it looks like from the Mahoney source.
    But I could be wrong.

    Indeed, more precisely, I think the solution could be "padding" to 8 the "7.15" attr (with zeros after the 3 or 5 bytes) then put "in the queue" my new attr block of 50 bytes.

    In this way the source 7.15 should always be able to take 8 bytes, of which 3 or 4 (the latter) are zero, to put in dtr.attr
    if (i<
    Code:
    -7.15tr- << 40 bytes of SHA1                   >> zero  <<CRC >> zero
    12345678 1234567890123456789012345678901234567890 0     12345678 0
    lin00000 THE-SHA1-CODE-IN-ASCII-HEX-FORMAT-40-BYT 0     ASCIICRC 0
    windo000 THE-SHA1-CODE-IN-ASCII-HEX-FORMAT-40-BYT 0     ASCIICRC 0
    Seems legit?


    Something like that (I know, I know... not very elegant...)
    Code:
    void Jidac::writefranzoffset(libzpaq::StringBuffer& i_sb, uint64_t i_data, int i_quanti,bool i_checksum,string i_filename,char *i_sha1)
    {
        if (i_checksum) /// OK, we put a larger size
        {
    /// experimental fix: pad to 8 bytes (with zeros) for 7.15 enhanced compatibility
    /// in this case 3 attr, 5 pad, then 50
    
            const char pad[FRANZOFFSET+8+1] = {0};
            puti(i_sb, 8+FRANZOFFSET, 4);     // 8+FRANZOFFSET block
            puti(i_sb, i_data, i_quanti);
            i_sb.write(&pad[0],(8-i_quanti));                // pad with zeros (for 7.15 little bug)
            
            if (getchecksum(i_filename,i_sha1))
            {
                i_sb.write(i_sha1,FRANZOFFSET);
                if (!pakka)
                    printf("SHA1 <<%s>> CRC32 <<%s>> %s\n",i_sha1,i_sha1+41,i_filename.c_str());
            }
            else                            // if something wrong, put zeros
                i_sb.write(&pad[0],FRANZOFFSET);
                    
        }
        else
        { // default ZPAQ
            puti(i_sb, i_quanti, 4);
            puti(i_sb, i_data, i_quanti);
        }
    }
    
    
    ....
    if ((p->second.attr&255)=='u') // unix attributes
     writefranzoffset(is,p->second.attr,3,checksum,filename,p->second.sha1hex);
    else 
    if ((p->second.attr&255)=='w') // windows attributes
     writefranzoffset(is,p->second.attr,5,checksum,filename,p->second.sha1hex);
    else 
     puti(is, 0, 4);  // no attributes
    With observation I found a possible bug (what happens if the CRC calculation fails? Actually I should do it FIRST and if OK, insert a FRANZOFFSET block.

    In other words: if a file cannot be opened, then I could save space by NOT storing SHA1 and CRC32.
    Next release ...
    Last edited by fcorbelli; 24th November 2020 at 18:35.

  10. #2590
    Member
    Join Date
    Jun 2015
    Location
    Moscow
    Posts
    42
    Thanks
    4
    Thanked 7 Times in 6 Posts
    Quote Originally Posted by fcorbelli View Post
    Thank you, but... why?
    The attr does not have a fixed size, can be 3 or 5 bytes, or 0, or... 55 (50+5).
    At least that's what it looks like from the Mahoney source.
    But I could be wrong.
    p/s sorry, i deleted my post

  11. #2591
    Member
    Join Date
    Jun 2015
    Location
    Moscow
    Posts
    42
    Thanks
    4
    Thanked 7 Times in 6 Posts
    yes, right, for your expansion, you should give 8 bytes for compatibility with zpaq 7.15, and then place the checksum
    +in your code, I don't see where the checksum is located if no attributes

  12. Thanks:

    fcorbelli (24th November 2020)

  13. #2592
    Member
    Join Date
    Dec 2013
    Location
    Italy
    Posts
    465
    Thanks
    23
    Thanked 42 Times in 34 Posts
    ... no attribute no checksum ...
    (I never use noattribute!)
    I'll fix with a function (edited the previous post)

    I attach the current source

    EDIT: do you know how to "intercept" the blocks just before written to disk, in the add() function?

    I am trying to compute the CRC32 code of the file from the resulting compression blocks, sorted (as I do for verification). I would save re-reading the file from disk (by eliminating the SHA1 calculation altogether).

    In short: during add (), for each file and for each compressed block (even unordered) I want to save it on my vector, and then I work it.
    can you help me?
    Attached Files Attached Files

  14. #2593
    Member
    Join Date
    Jun 2015
    Location
    Moscow
    Posts
    42
    Thanks
    4
    Thanked 7 Times in 6 Posts
    Quote Originally Posted by fcorbelli View Post
    ... no attribute no checksum ...
    (I never use noattribute!)
    I'll fix with a function (edited the previous post)

    I attach the current source

    ок,

    my small fix


    del line const char pad[FRANZOFFSET+8+1] = {0};
    change i_sb.write(&pad[0],(8-i_quanti)); to puti(i_sb, 0, (8-i_quanti));

    full code, no zero buffer

    if (getchecksum(i_filename, i_sha1))
    {
    puti(i_sb, 8 + FRANZOFFSET, 4); // 8+FRANZOFFSET block
    puti(i_sb, i_data, i_quanti);
    puti(i_sb, 0, (8 - i_quanti));


    i_sb.write(i_sha1, FRANZOFFSET);
    if (!pakka)
    printf("SHA1 <<%s>> CRC32 <<%s>> %s\n", i_sha1, i_sha1 + 41, i_filename.c_str());


    }
    else
    {
    puti(i_sb, 8, 4);
    puti(i_sb, i_data, i_quanti);
    puti(i_sb, 0, (8 - i_quanti));
    }

  15. #2594
    Member
    Join Date
    Dec 2013
    Location
    Italy
    Posts
    465
    Thanks
    23
    Thanked 42 Times in 34 Posts
    Quote Originally Posted by SpyFX View Post
    ок,

    my small fix


    del line const char pad[FRANZOFFSET+8+1] = {0};
    change i_sb.write(&pad[0],(8-i_quanti)); to puti(i_sb, 0, (8-i_quanti));
    Good only max 8 bytes (puti write uint64_t) are needed...

    BUT...
    I am too lazy to iterate for write an empty FRANZOFFSET block if something goes wrong
    i_sb.write(&pad[0],FRANZOFFSET);



  16. #2595
    Member
    Join Date
    Jun 2015
    Location
    Moscow
    Posts
    42
    Thanks
    4
    Thanked 7 Times in 6 Posts
    Quote Originally Posted by fcorbelli View Post
    Good only max 8 bytes (puti write uint64_t) are needed...

    BUT...
    I am too lazy to iterate for write an empty FRANZOFFSET block if something goes wrong
    i_sb.write(&pad[0],FRANZOFFSET);


    if attr size == (FRANZOFFSET + 8) then checksum ok else if attr size == 8 checksum error ?

  17. #2596
    Member
    Join Date
    Dec 2013
    Location
    Italy
    Posts
    465
    Thanks
    23
    Thanked 42 Times in 34 Posts
    I think this should be OK.

    void Jidac::write715attr(libzpaq::StringBuffer& i_sb, uint64_t i_data, unsigned int i_quanti)
    {
    assert(i_sb);
    assert(i_quanti<=8);
    puti(i_sb, i_quanti, 4);
    puti(i_sb, i_data, i_quanti);
    }
    void Jidac::writefranzattr(libzpaq::StringBuffer& i_sb, uint64_t i_data, unsigned int i_quanti,bool i_checksum,string i_filename,char *i_sha1)
    {
    /// experimental fix: pad to 8 bytes (with zeros) for 7.15 enhanced compatibility

    assert(i_sb);
    if (i_checksum)
    {
    assert(i_sha1);
    assert(i_filename.length()>0); // I do not like empty()
    assert(i_quanti<8); //just to be sure at least 1 zero pad, so < and not <=
    if (getchecksum(i_filename,i_sha1))
    {
    puti(i_sb, 8+FRANZOFFSET, 4); // 8+FRANZOFFSET block
    puti(i_sb, i_data, i_quanti);
    puti(i_sb, 0, (8 - i_quanti)); // pad with zeros (for 7.15 little bug)
    i_sb.write(i_sha1,FRANZOFFSET);
    if (!pakka)
    printf("SHA1 <<%s>> CRC32 <<%s>> %s\n",i_sha1,i_sha1+41,i_filename.c_str());
    }
    else
    write715attr(i_sb,i_data,i_quanti);
    }
    else
    write715attr(i_sb,i_data,i_quanti);
    }

  18. #2597
    Member
    Join Date
    Dec 2013
    Location
    Italy
    Posts
    465
    Thanks
    23
    Thanked 42 Times in 34 Posts
    Last step: CRC32's of block during compression



    // Update HT and ptr list
    if (fi<vf.size()) {
    if (htptr==0) {
    htptr=ht.size();
    ht.push_back(HT(sha1result, sz));
    htinv.update();
    fsize+=sz;
    }
    vf[fi]->second.ptr.push_back(htptr);



    As mentioned I'm trying to find a (simple) way to calculate the hashes of the blocks that make up a single file, during the add () phase.

    In this way I would not have to re-read the file downstream, calculating the CRC32 (operation that takes time) for storing.

    However it is not easy, at least for me, to engage an "interception" of the blocks.

    The "real" problem is not so much for new blocks, but for duplicate ones.
    In this case it should theoretically be decompressed (the duplicated block) to calculate the CRC32 (so I would say no, takes too long and too complex).

    The alternative that comes to mind is something similar to FRANSOFFSET, that is to store in blocks, in addition to the SHA1 code, also the CRC32.

    However, the ZPAQ "take the SHA1 data at the end of the block" mechanism seems to me rather rigid, with no concrete possibility of changing anything (compared to the 21 bytes of SHA1) without losing backwards compatibility (readSegmentEnd)

    // End segment, write sha1string if present
    void Compressor::endSegment(const char* sha1string) {
    if (state==SEG1)
    postProcess();
    assert(state==SEG2);
    enc.compress(-1);
    if (verify && pz.hend) {
    pz.run(-1);
    pz.flush();
    }
    enc.out->put(0);
    enc.out->put(0);
    enc.out->put(0);
    enc.out->put(0);
    if (sha1string) {
    enc.out->put(253);
    for (int i=0; i<20; ++i)
    enc.out->put(sha1string[i]);
    }
    else
    enc.out->put(254);
    state=BLOCK2;
    }


    void Decompresser::readSegmentEnd(char* sha1string) {
    assert(state==DATA || state==SEGEND);

    // Skip remaining data if any and get next byte
    int c=0;
    if (state==DATA) {
    c=dec.skip();
    decode_state=SKIP;
    }
    else if (state==SEGEND)
    c=dec.get();
    state=FILENAME;

    // Read checksum ///// this is the problem, only SHA1 or nothing
    if (c==254) {
    if (sha1string) sha1string[0]=0; // no checksum
    }
    else if (c==253) {
    if (sha1string) sha1string[0]=1;
    for (int i=1; i<=20; ++i) {
    c=dec.get();
    if (sha1string) sha1string[i]=c;
    }
    }
    else
    error("missing end of segment marker");
    }



    Ideas?

  19. #2598
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    4,064
    Thanks
    310
    Thanked 1,360 Times in 777 Posts
    I think there's little point in preserving zpaq format.
    Forward compatibility is a nice concept, but imho not worth adding redundancy to archives.
    Also zpaq format doesn't support many useful algorithms (eg. faster methods of entropy coding),
    so I think its too early to freeze it in any case.

    Thus I'd vote for taking useful parts from zpaq (CDC dedup, compression methods) and designing
    a better archive format around it.
    There're more CDC implementations that just zpaq though, so maybe not even that.

  20. Thanks:

    Mike (25th November 2020)

  21. #2599
    Member
    Join Date
    Dec 2013
    Location
    Italy
    Posts
    465
    Thanks
    23
    Thanked 42 Times in 34 Posts
    I can write A LOT of improvement in zpaq's archive format, much more like a Rdbms.

    But I do not like broken compatibility at all.
    And I am not the author, so a incompatible fork will be fully "unofficial".
    I am thinking about storing info in a special file, like mft.
    Zpaq will just extract this file, Zpaqfranz will use like embedded db.

    Maybe someone will present better ideas

  22. #2600
    Member
    Join Date
    Apr 2015
    Location
    Greece
    Posts
    124
    Thanks
    42
    Thanked 33 Times in 22 Posts
    @fcorbelli do you use zpaq for backups?
    A very good alternative is borg. Deduplication is similar I think( content based chunking). Uses zstd, lzma, lz4 has encryption and you can even mount the archive and read it like a normal File System through FUSE.

  23. #2601
    Member
    Join Date
    Dec 2013
    Location
    Italy
    Posts
    465
    Thanks
    23
    Thanked 42 Times in 34 Posts
    Quote Originally Posted by algorithm View Post
    @fcorbelli do you use zpaq for backups?
    A very good alternative is borg. Deduplication is similar I think( content based chunking). Uses zstd, lzma, lz4 has encryption and you can even mount the archive and read it like a normal File System through FUSE.
    Yes, with great results for years.
    I like it very much, with only a couple of defects.
    First are lack of fast check (almost done by my fork)
    Second very slow file listing.
    I am thinking about fake deleted file (zpaq 715 do not extract if date=0)
    I will make my little mft file, ignored by zpaq 715 but user by Zpaqfranz.
    In have to check if faster in listing by some experiment

    I have tried borg years ago but I do not like very much

  24. #2602
    Member
    Join Date
    Jun 2015
    Location
    Moscow
    Posts
    42
    Thanks
    4
    Thanked 7 Times in 6 Posts
    Quote Originally Posted by fcorbelli View Post
    I can write A LOT of improvement in zpaq's archive format, much more like a Rdbms.

    But I do not like broken compatibility at all.
    And I am not the author, so a incompatible fork will be fully "unofficial".
    I am thinking about storing info in a special file, like mft.
    Zpaq will just extract this file, Zpaqfranz will use like embedded db.

    Maybe someone will present better ideas
    add the fsize to 'FRANZOFFSET' and don't calculate fsize by reading h/block, I think it will speed up list command

  25. #2603
    Member
    Join Date
    Dec 2013
    Location
    Italy
    Posts
    465
    Thanks
    23
    Thanked 42 Times in 34 Posts
    Quote Originally Posted by SpyFX View Post
    add the fsize to 'FRANZOFFSET' and don't calculate fsize by reading h/block, I think it will speed up list command
    I will try a more extreme approach, via a precomputed txt listing embedded.
    But extract speed can be a problem, because appended after the first version.
    Putting on head could be better, but cannot handle errors.

  26. #2604
    Member
    Join Date
    Jun 2015
    Location
    Moscow
    Posts
    42
    Thanks
    4
    Thanked 7 Times in 6 Posts
    Quote Originally Posted by Shelwien View Post
    I think there's little point in preserving zpaq format.
    Forward compatibility is a nice concept, but imho not worth adding redundancy to archives.
    Also zpaq format doesn't support many useful algorithms (eg. faster methods of entropy coding),
    so I think its too early to freeze it in any case.

    Thus I'd vote for taking useful parts from zpaq (CDC dedup, compression methods) and designing
    a better archive format around it.
    There're more CDC implementations that just zpaq though, so maybe not even that.
    the zpaq format seems to me quite thoughtful and it is possible to squeeze additional improvements out of it without violating backward compatibility

    We can use any CDC, maybe even different for different data, but it seems to me that zpaq CDC is not so bad

    at the moment I'm not satisfied with processing with a large number of small files (hundreds of thousands or more), everything is very slow, as well as processing large files from different physical hdd, everything from the fact that zpaq715 reads all files sequentially

  27. #2605
    Member
    Join Date
    Dec 2013
    Location
    Italy
    Posts
    465
    Thanks
    23
    Thanked 42 Times in 34 Posts
    Quote Originally Posted by SpyFX View Post
    the zpaq format seems to me quite thoughtful and it is possible to squeeze additional improvements out of it without violating backward compatibility
    Only partially, or at least for me.
    Two big problems:
    1- no space to store anything in version (ASCII comment)
    2- no space for anything in blocks (end of segment with 20 bytes SHA1, or nothing).

    As stated I am thinking about "fake" (date==0==deleted) files to store information (715 ignore deleted one)
    But it is not so easy, and even worse not so fast.

    at the moment I'm not satisfied with processing with a large number of small files (hundreds of thousands or more), everything is very slow, as well as processing large files from different physical hdd, everything from the fact that zpaq715 reads all files sequentially
    This is typically OK for magnetic disks, not so good for SSDs.

    But, at least for me, very slow file listing is, currently, the main defect.

  28. #2606
    Member
    Join Date
    Jun 2015
    Location
    Moscow
    Posts
    42
    Thanks
    4
    Thanked 7 Times in 6 Posts
    Quote Originally Posted by fcorbelli View Post
    Only partially, or at least for me.
    Two big problems:
    1- no space to store anything in version (ASCII comment)
    2- no space for anything in blocks (end of segment with 20 bytes SHA1, or nothing).

    As stated I am thinking about "fake" (date==0==deleted) files to store information (715 ignore deleted one)
    But it is not so easy, and even worse not so fast.

    This is typically OK for magnetic disks, not so good for SSDs.

    But, at least for me, very slow file listing is, currently, the main defect.
    I wrote that in zpaq on c/block there is no size limit,
    at the moment I decided to store the second usize [8] there,
    which is equal to the sum of the sizes all d/block + all h/block, this makes it possible to immediately move to the first i/block,
    it seems to me there is a small acceleration

    since there are no boundaries, you can store any additional information in c/block

    I don't like that c/block is not aligned to the 4k border,
    so I am creating it at the moment in 4k size, so that during the final rewrite, do not touch other data

    I understand correctly that the fake file is supposed to store information about checksums and filesizes?

    I like this idea, I also plan to use it for 4K alignment of the first h-i/block in the version and subsequent c/blocks in the archive
    alignment allow me to simplify the algorithm for reading blocks from the archive without system buffering,
    because I don't like that when working with a zpaq archive, useful cached data is pushed out of the server's RAM

    p/s
    I understand that it is very difficult to make fundamental changes in the zpaq code,
    so I rewrite all the work with zpaq archive, I use C# and my zpaq api (dll)

    almost everything has already worked )
    but your ideas are forcing me to change the concept of creating a correct archive that would solve your wishes as well

  29. #2607
    Member
    Join Date
    Dec 2013
    Location
    Italy
    Posts
    465
    Thanks
    23
    Thanked 42 Times in 34 Posts
    Quote Originally Posted by SpyFX View Post
    I wrote that in zpaq on c/block there is no size limit (...)
    I will think about it.
    I understand correctly that the fake file is supposed to store information about checksums and filesizes?
    Yes, and a optional ASCII list of all the files and all versions.
    So, when you "list" an archive, the fake file is uncompressed and send out as output.
    Another option is ASCII comments in versions, so you can make somethin
    add ... blablabla -comment "my first version"

    p/s
    I understand that it is very difficult to make fundamental changes in the zpaq code,
    so I rewrite all the work with zpaq archive, I use C# and my zpaq api (dll)

    almost everything has already worked )
    but your ideas are forcing me to change the concept of creating a correct archive that would solve your wishes as well
    I work almost with Delphi

  30. #2608
    Member
    Join Date
    Jun 2015
    Location
    Moscow
    Posts
    42
    Thanks
    4
    Thanked 7 Times in 6 Posts
    Quote Originally Posted by fcorbelli View Post
    Yes, and a optional ASCII list of all the files and all versions.
    So, when you "list" an archive, the fake file is uncompressed and send out as output.
    fake file is limited to 64k by zpaq format, I think one such file is not enough

    it seems to me that placing the file sizes in such a file will significantly speed up the process of getting a list of files in the console,
    but it will be redundant to burn the entire list of files in each version

  31. #2609
    Member
    Join Date
    Dec 2013
    Location
    Italy
    Posts
    465
    Thanks
    23
    Thanked 42 Times in 34 Posts
    Quote Originally Posted by SpyFX View Post
    fake file is limited to 64k by zpaq format, I think one such file is not enough

    it seems to me that placing the file sizes in such a file will significantly speed up the process of getting a list of files in the console,
    but it will be redundant to burn the entire list of files in each version
    You are right, but in my case the archives is big (400+GB) so the overhead is small.

    Very quick and very dirty test (same hardware, from SSD to Ramdisk, decent CPU)

    Listing of
    copia_zarc.zpaq:
    1057 versions,
    3271902 files,
    166512 fragments,
    5730.358526 MB

    796080.407841 MB of 796080.407841 MB (3272959 files) shown
    -> 10544.215824 MB (13999029 refs to 166512 of 166512 frags) after dedupe
    -> 5730.358526 MB compressed.


    ZPAQ 7.15 64 bit (sorted by version)
    zpaq64 l h:\zarc\copia_zarc.zpaq -all >z:\715.txt
    60.297 seconds, 603.512.182 bytes output

    zpaqlist 64 bit - franz22 (sorted by file)
    zpaqlist l "h:\zarc\copia_zarc.zpaq" -all -out z:\22.txt
    15.047 seconds, 172.130.287 bytes output

    Way slower on magnetic disks, painfully slow from magnetic-disk NAS, even with 10Gb ethernet

  32. #2610
    Member
    Join Date
    Jun 2015
    Location
    Moscow
    Posts
    42
    Thanks
    4
    Thanked 7 Times in 6 Posts
    ..\zpaq715.exe l DISK_F_Y_????.zpaq -all > zpaq715.first.txt
    zpaq v7.15 journaling archiver, compiled Aug 17 2016
    DISK_F_Y_????.zpaq: 778 versions, 833382 files, 24245031 fragments, 764858.961314 MB
    30027797.206934 MB of 30027797.206934 MB (834160 files) shown
    -> 2000296.773364 MB (415375470 refs to 24245031 of 24245031 frags) after dedupe
    -> 764858.961314 MB compressed.


    54.032 seconds (all OK)


    Z:\ZPAQ\backup>..\zpaq715.exe l DISK_F_Y_????.zpaq -all > zpaq715.first.txt
    54.032 seconds (all OK)
    Z:\ZPAQ\backup>..\zpaq715.exe l DISK_F_Y_????.zpaq -all > zpaq715.second.txt
    38.453 seconds (all OK)
    Z:\ZPAQ\backup>..\zpaq715.exe l DISK_F_Y_????.zpaq -all > zpaq715.third.txt
    38.812 seconds (all OK)


    the first launch caches h/i blocks and for this reason the next launches are processed faster
    for my archive, the time to get a list of files is 38 seconds if all blocks are in the system file cache

    can you do multiple launches?, you should reset the system file cache before first run
    Last edited by SpyFX; 28th November 2020 at 17:37.

Similar Threads

  1. ZPAQ self extracting archives
    By Matt Mahoney in forum Data Compression
    Replies: 31
    Last Post: 17th April 2014, 04:39
  2. ZPAQ 1.05 preview
    By Matt Mahoney in forum Data Compression
    Replies: 11
    Last Post: 30th September 2009, 05:26
  3. zpaq 1.02 update
    By Matt Mahoney in forum Data Compression
    Replies: 11
    Last Post: 10th July 2009, 01:55
  4. Metacompressor.com benchmark updates
    By Sportman in forum Data Compression
    Replies: 79
    Last Post: 22nd April 2009, 04:24
  5. ZPAQ pre-release
    By Matt Mahoney in forum Data Compression
    Replies: 54
    Last Post: 23rd March 2009, 03:17

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •